265 99 36MB
English Pages 536 Year 2021
Randomness and Realism
Encounters with Randomness in the Scientific Search for Physical Reality
Recommended Titles in Related Topics
Modern Physics: The Scenic Route by Leo Bellantoni ISBN: 978-981-124-220-5 ISBN: 978-981-124-317-2 (pbk) The Quantum Universe: Essays on Quantum Mechanics, Quantum Cosmology, and Physics in General by James B Hartle ISBN: 978-981-121-639-8 Invitation to Generalized Empirical Method: In Philosophy and Science by Terrance J Quinn ISBN: 978-981-3208-43-8 Substance and Method: Studies in Philosophy of Science by Chuang Liu ISBN: 978-981-4632-18-8
YongQi - 12468 - Randomness and Realism.indd 1
17/6/2021 12:15:59 pm
Randomness and Realism
Encounters with Randomness in the Scientific Search for Physical Reality
John W Fowler California Institute of Technology, USA
World Scientific NEW JERSEY
•
LONDON
•
SINGAPORE
•
BEIJING
•
SHANGHAI
•
HONG KONG
•
TAIPEI
•
CHENNAI
•
TOKYO
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
Library of Congress Control Number: 2021941261
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
The plots in figures 1-1, 1-2, 2-7 2-9, 2-13 2-15, 3-2, 4-2 4-27, 5-1 5.5, 6-3, 6-5, 6-7 6.9, 6-12, A-1, B-1, B-2, F-1, I-1, and I-2 were produced using Maple 2015. Maplesoft is a division of Waterloo Maple Inc., Waterloo, Ontario. Maple is a trademark of Waterloo Maple Inc.
RANDOMNESS AND REALISM Encounters with Randomness in the Scientific Search for Physical Reality Copyright © 2021 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 978-981-124-345-5 (hardcover) ISBN 978-981-124-347-9 (ebook for institutions) ISBN 978-981-124-348-6 (ebook for individuals)
For any available supplementary material, please visit https://www.worldscientific.com/worldscibooks/10.1142/12468#t=suppl
Printed in Singapore
YongQi - 12468 - Randomness and Realism.indd 2
17/6/2021 12:15:59 pm
June 8, 2021
11:3
World Scientific Book - 10in x 7in
For my wife, Debbie
12468-resize
page 1
b2123 2014 Annual Competitiveness Update on Indian States and Inaugural Regional Competitiveness Analysis
This page intentionally left blank
B2123_FM.indd x
4/25/2015 10:28:37 AM
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Table of Contents Preface ...............................................................................................................................
xi
Chapter 1 Nothing Ventured, Nothing Gained ................................................................ 1.1 Beginnings .................................................................................................... 1.2 Mathematical Formulation of Probability .................................................... 1.3 Intuition Regarding Probability .................................................................... 1.4 Independent Events ....................................................................................... 1.5 Different Interpretations of Probability ......................................................... 1.6 Sucker Bets ................................................................................................... 1.7 Russian Roulette ........................................................................................... 1.8 The St. Petersburg Paradox ........................................................................... 1.9 The Standard Deviation and Statistical Significance .................................... 1.10 The Origin of Fluctuations ........................................................................... 1.11 Microstates, Macrostates, and Entropy ........................................................ 1.12 The Law of Averages ................................................................................... 1.13 Lotteries and Spreads ................................................................................... 1.14 Evolution Happens .......................................................................................
1 1 3 6 8 10 12 16 19 22 26 27 29 31 32
Chapter 2 Classical Mathematical Probability ................................................................ 2.1 Quantifying Randomness and Probability .................................................... 2.2 Probability Mass Distributions ..................................................................... 2.3 Probability Density Functions ...................................................................... 2.4 The Gaussian Distribution ........................................................................... 2.5 Gaussian Relatives ....................................................................................... 2.6 Products and Ratios of Gaussian Random Variables ................................... 2.7 The Poisson Distribution .............................................................................. 2.8 Bayes’ Theorem ............................................................................................ 2.9 Population Mixtures ...................................................................................... 2.10 Correlated Random Variables ...................................................................... 2.11 Sample Statistics .......................................................................................... 2.12 Summary ......................................................................................................
35 35 38 43 51 52 55 58 60 63 65 82 93
Chapter 3 Classical Statistical Physics ............................................................................ 3.1 Probability Distributions Become Relevant to Physics ................................ 3.2 Thermodynamic Foundations of Thermal Physics ....................................... 3.3 Statistical Mechanics .................................................................................... 3.4 Relation Between Clausius Entropy and Boltzmann Entropy ...................... 3.5 Entropy of Some Probability Distributions .................................................. 3.6 Brownian Motion ......................................................................................... 3.7 Reconciling Newtonian Determinism With Random Walks Through Phase Space .................................................................................... 3.8 Carrying Statistical Mechanics into Quantum Mechanics ............................
96 96 97 104 116 118 122
vii
124 126
page 1
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Table of Contents
Chapter 4 Scientific Data Analysis .................................................................................. 128 4.1 The Foundation of Science: Quantified Measurements ................................ 128 4.2 Putting Measurements to Work .................................................................... 132 4.3 Hypothesis Testing ........................................................................................ 134 4.4 Hypothesis Testing Example 1: Matching Sources in Two Catalogs With Gaussian Errors .................................................................... 137 4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors ..................................................................... 143 4.6 Monte Carlo Simulations .............................................................................. 152 4.7 Systematic Errors .......................................................................................... 157 4.8 The Parameter Refinement Theorem ............................................................ 175 4.9 Curve Fitting ................................................................................................. 194 4.10 Random-Walk Interpolation ........................................................................ 207 4.11 Summary ...................................................................................................... 219 Chapter 5 Quantum Mechanics ....................................................................................... 5.1 Interpreting Symbolic Models ...................................................................... 5.2 Clouds on the Classical Horizon .................................................................. 5.3 The Language of Quantum Mechanics ......................................................... 5.4 The Discovery of Quantized Energy ............................................................. 5.5 Gradual Acceptance of Quantization ............................................................ 5.6 The Schrödinger Wave Equations ................................................................ 5.7 Wave Packets ................................................................................................ 5.8 The Heisenberg Uncertainty Principle .......................................................... 5.9 The Born Interpretation of Wave Mechanics ................................................ 5.10 Probability Amplitude, Quantum Probability, and Interference Between Coherent States ........................................................................... 5.11 Quantum Entanglement and Nonlocality .................................................... 5.12 Hidden-Variable Theories and Bell’s Inequalities ...................................... 5.13 Nonlocal Effects and Information Transfer ................................................ 5.14 “Instantaneous” Nonlocal Effects ............................................................... 5.15 Wave Function “Collapse” and Various Alternatives ................................. 5.16 The Status of Nonlocal Hidden-Variable Theories ..................................... 5.17 Summary ......................................................................................................
viii
221 221 229 231 234 244 247 256 260 264 268 272 280 290 292 295 305 308
page 2
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Table of Contents
Chapter 6 The Quest for Quantum Gravity ..................................................................... 6.1 Fields and Field Quantization ...................................................................... 6.2 Essential Features of General Relativity ...................................................... 6.3 Reasons Why a Quantum Gravity Theory Is Needed ................................... 6.4 Miscellaneous Approaches ........................................................................... 6.5 Canonical Quantum Gravity ......................................................................... 6.6 String Theories ............................................................................................. 6.7 Loop Quantum Gravity and Causal Dynamical Triangulations ................... 6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes ..... 6.9 Embeddings and Intrinsic vs. Extrinsic Curvature ....................................... 6.10 Summary ......................................................................................................
310 310 314 354 360 363 368 373 380 396 413
Epilogue ............................................................................................................................ 416 Appendix A Moments of a Distribution
.......................................................................... 419
Appendix B Functions of Random Variables
.................................................................. 424
Appendix C M Things Taken K at a Time With Replacement
........................................ 432
Appendix D Chi-Square Minimization
............................................................................ 434
Appendix E Chi-Square Distributions
............................................................................. 443
Appendix F Quantiles, Central Probabilities, and Tails of Distributions
........................ 446
Appendix G Generating Correlated Pseudorandom Numbers ......................................... 451 Appendix H The Planck Parameters
................................................................................ 473
Appendix I Estimating the Mean of a Poisson Population From a Sample Appendix J Bell’s Theorem and Bell’s Inequalities Appendix K The Linear Harmonic Oscillator
..................... 477
........................................................ 483
................................................................. 488
References ........................................................................................................................ 496 Index
................................................................................................................................. 508
ix
page 3
b2123 2014 Annual Competitiveness Update on Indian States and Inaugural Regional Competitiveness Analysis
This page intentionally left blank
B2123_FM.indd x
4/25/2015 10:28:37 AM
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Preface Scientists are not supposed to believe in magic. And yet what is a random number if not magical? When conjured up from the mysterious reservoir of haphazardness where effects do not require causes, it bursts upon the scene like a rabbit popping out of a hat or a bird flying out of a magician’s sleeve and divulges its previously unpredictable identity. As one might dip a net into a murky pool and draw it out in the absence of any foreknowledge of what it will hold, to “draw” a random number is to plunge a container known as a “random variable” into a lagoon of numerical pandemonium and let it summon some passing denizen into its snare. As it happens, some random variables tend to acquire values that are not completely unpredictable, they have tendencies to cluster in certain patterns, but unlike any other variables that can take on explicit values in physics or mathematics, their specific numerical worth is impossible to predict on a case-by-case basis. They defy cold hard deterministic causality much like the events which take place in our dreams, where, for example, one doesn’t dare leave one’s possessions in a train station’s waiting room in order to visit the restroom, because upon return, not only will the possessions be gone, but the room won’t even be the same shape or have the same doors and windows. No rhyme or reason, just willy nilly nonsensical lack of strict cause-effect relationships. This may suggest that the “magic” of random numbers is only in our imagination and therefore not really very important with respect to the “real world”. But as we will see below, the “real world” also exhibits its own brand of this kind of magic. Not all scientists agree about whether the “real world” behaves according to a set of rules that can be represented symbolically and grasped by the human mind as “the laws of Nature”. But human science has developed by pursuing mathematical descriptions of various physical phenomena and attempting to link such formalisms into a more integrated schema. Arguments have been made that since the universe displays such a high degree of complexity, any “laws of Nature” would have to reflect this by being so complicated themselves as to be unmanageable, but it is also now known that some rather simple mathematical systems are capable of generating complexity vastly beyond what one might expect at face value (e.g., cellular automata, nonlinear chaotic systems, etc.). If knowable “laws of Nature” do exist, we are extremely unlikely to find them without searching for them, and so in this book their existence is taken as a working hypothesis with a special focus on the possibility that the formalism may involve intrinsically random behavior. It is said that one sign of insanity is repeating a given act and expecting a different outcome. If this is true, then people who use random numbers are insane. But fortunately, most rules have exceptions, and this is one such case. The very essence of nontrivial random numbers is that when the circumstances in which they are summoned are identical, each occasion yields a different result in general. In fact, it is possible for discrete random numbers to pop up with the same value on consecutive occasions every now and then. In this way they eliminate any hope that one can at least predict what value won’t occur on the next occasion. So while scientists must not believe in magic, they must make an uneasy peace with it, because modern science cannot be practiced without the use of random numbers in the form of probability, statistics, and the theory of functions of random variables. As a result of making this concession to irrationality, the greatest scientific advances in history have been made possible. A number of applications of random numbers have been found which enable progress in areas that seemed to be
xi
page 5
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Preface
dead ends when the only tools available were those based on strict determinism. As the theory of random numbers was developed, several fundamentally different kinds of randomness were identified. Here “different” refers not merely to separate distributions of random numbers, but to philosophically distinct types. For example, suppose someone asks you what day of the week August 19, 2397, will be and offers you a bet that you cannot name the correct day in three seconds. Assume no change in the way calendars are computed. Without any other information, whatever day you think of first has one chance in seven of being correct. At even money, you would be foolish to accept the bet. But suppose you are offered ten-to-one odds on a dollar. If you take the bet, you will probably lose your dollar, but if you are allowed to play this game many times, you are more likely than not to come out ahead. Here we have applied simple probability theory. Without knowing the right answer, we can proceed anyway by quantifying our ignorance in terms of random variables. But what is random here? Not the weekday corresponding to a given date; that’s completely deterministic. The random element springs from our own ignorance. We can model uncertainty due to ignorance with random numbers and proceed to make meaningful decisions. This kind of uncertainty is called epistemic uncertainty, meaning that the object of consideration is not random, it has a deterministic value, but our knowledge of that value is uncertain, and this uncertainty can be treated mathematically. Incidentally, we should note that while the year 2397 is in the future and thus seems especially mysterious, we could have chosen a year in the past, say 1746, without changing any essential aspect of the situation, and this might even accentuate the fact that the day of the week involved is not at all random. Throughout the entire gamut of classical physics, it is implicit that all uncertainty is epistemic. In Statistical Mechanics, for example, ensembles of particles are treated probabilistically, not because the particles behave irrationally, but because typically the number of particles in systems of interest is huge, and the detailed application of Newtonian mechanics to the particle interactions is orders of magnitude beyond the human capacity for computation. But nevertheless the classical interpretation of the particle behavior, the collisions and velocities as functions of time, is that it is completely deterministic. The statistical descriptions are very useful for computing overall physical parameters of systems such as energy, temperature, and pressure, even though the details about any given individual particle are completely uncertain. This uncertainty is epistemic. With the advent of Quantum Mechanics, statistical descriptions continued to play a central role in the mathematical formalism. But it was gradually realized that Quantum Mechanics was not so easily interpreted, certainly nothing like Statistical Mechanics had been. For example, shining monochromatic light through a double slit has long been known to produce interference fringes on a screen behind the slits. Although we are still working on what light actually is and how it manages to propagate, the fringes themselves were fully explained by classical electromagnetic wave theory. But the fact that electron beams do the same thing presented a problem: electrons were supposed to be particles, not waves. And when it was found that electrons eventually produce interference fringes even when the beam is so dilute that only one electron at a time is passing through the system, the fact that a crisis of interpretation existed was fully realized. Much later in a long story that is not yet over, this has led to the point where today it appears that our uncertainty about what an electron is doing as it passes through the double-slit apparatus is nonepistemic (also called aleatoric, although this is sometimes used to mean deterministic but xii
page 6
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Preface
unpredictable because of being incomputable). Whatever is happening appears not to be representable in terms familiar to our intuition, nor does it seem to be governed completely by deterministic laws. The distributions of probability may be established by deterministic rules, but the actual draws from those distributions appear to be intrinsically random. It is not that the electron is doing something but we just don’t know what it is; rather it seems that it is not possible to know what it is doing, any more than it is possible to know what happened to one’s luggage during the visit to the phantasmagoric train station’s restroom. In the world of our normal experience, if a merry-go-round is spinning in one direction, then it isn’t simultaneously spinning in the other direction; the two spin directions are incompatible states, and a classical system in one such state cannot also be in the other state. But an electron can have its spin be in one direction, the opposite direction, or both directions at the same time. Another violation of classical intuition is that in the quantum world, we must accept that an object can actually be in more than one place at the same time, unless we abandon the requirement that “place” must be a property of the object at all times. Can an object exist without having a place where it exists? Epistemic uncertainty is not a particularly uncomfortable notion to accept as inevitable; at least whatever detailed behavior is taking place makes sense, and in principle, we could understand it if we took the trouble. It’s just a matter of removing our own ignorance of something that is fundamentally knowable. But nonepistemic uncertainty has its origins outside of our consciousness. It is more than just an aspect of the relationship between the behavior of the physical universe and what we can know about that behavior. It stems from an intrinsic property of that behavior itself, a capacity for achieving effects for which there were no causes, an ability to create self-inconsistent conditions. It can be a little unsettling to ponder the possibilities implicit in forces operating outside the control of logic and reason. Just how irrational can the world around us become? And yet a universe of strict determinism is not so appealing either, because we are part of the universe, and if everything happens with the strict determinism of a classical clock mechanism, what are we but automatons? There is a tension between the need to be safe from dreamlike absurdities and the need to believe that we have something we call “free will”, difficult as that is to define. In this book we will explore these and other aspects of random numbers as a sort of meditation on randomness. The goal will be to investigate the different aspects of various kinds of random numbers and the application of their mathematical descriptions in order to get a feeling for how they relate to each other and how they affect our lives. This will require some mathematics, but readers not inclined to savor equations should be able to skip certain technical sections and still follow the general flow. To follow the mathematics completely, a grasp of basic algebra and trigonometry and introductory differential and integral calculus suffices. No reader should feel uncomfortable about skimming or entirely skipping certain mathematical elaborations, because professional scientists also do this quite frequently when reading the technical literature. Still, it is hoped that even readers who usually avoid equations will give those within these pages a chance to reveal why they have been included. Except when simply quoting a result, care has been taken not to skip steps needed to understand how one line leads to another, and mathematical notions are generally introduced in the most intuitive way possible. The goal is to provide an opportunity to take things in a sequence that leads to an accumulation of understanding. Even less mathematically inclined readers may be surprised at how potentially foreboding concepts fall into line xiii
page 7
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Preface
when one makes a legitimate attempt to follow a succession of mathematical implications. The fundamental truth is that most of our understanding of the way the Universe behaves is founded upon mathematics, and discussions of such topics would be greatly handicapped if we avoid the language that expresses all aspects of physical reality most precisely. This book is intended primarily for scientists and science students of all ages, but it is hoped that it can also serve a wider audience. The purpose of the math is not exclusively to aid in understanding the answers to certain questions but also to aid in understanding the questions themselves. We also apologize in advance to those readers with professional expertise for treading some familiar ground, but we need to establish various contexts along the way, and perhaps such sections can be viewed simply as brief reviews that establish the book’s notation conventions. Each topic can fill at least one book on its own, and since most have done this already, much of the discussion will be at the survey level, with references to more complete rigorous treatments provided. One goal of this book is to communicate the enjoyability that exists close to the surface of all scientific disciplines. Once the fun is realized, many technical ideas lose their threatening appearance. Nothing promotes mastery more than having a good time. Another goal is to focus on intuitive notions rather than dry rote technical facts. Most intuitive aspects can be found near the surface and are most congenial when not desiccated and painstakingly dissected in the fashion required for specialized textbooks, and so we will emphasize a narrative approach that does not necessarily cover every base and plug every loophole. In keeping with this scope, it is hoped that occasionally informal use of common terminology will improve the flow without creating misunderstandings. The general public uses the word “statistics” without demanding a rigorous definition. As a result, the word is frequently misused, but usually with little damage. The distinction between “statistics” and “data” is often ignored, as is that between “statistical” and “random”. This book aims at developing intuition for the notion of randomness, but as far as rigorous definitions are concerned, we will avoid entering into the still-raging debates about its philosophical, psychological, and theological definitions, keeping instead to the fairly safe ground provided by mathematics and empirical science. It is the author’s opinion that intuition for a subtle concept develops best by gradual contemplation from various perspectives. If it can be acquired suddenly and completely from a rigorous definition, then it was not all that subtle in the first place. One amusing aspect of mathematical definitions is that frequently the opposite notion turns out to be more fundamental. For example, if one looks up the phrase “biased estimator”, one may very well find “a biased estimator is an estimator that is not unbiased.” Upon dutifully looking up “unbiased estimator”, one finds that it is an estimator whose expected error is zero. It may seem that “biased estimator” could have been more straightforwardly defined as an estimator whose expected error is nonzero (and sometimes it is indeed so defined), but that would not have taken advantage of a previously solved problem, how to define “unbiased estimator”, a central concept in estimation theory, whereas “biased estimator” is much less frequently encountered. This approach to providing technical definitions was lampooned in a cartoon (published in a well-known professional science periodical) showing a hallway view of the doors to a university math department; the two doors were labeled “Not an Entrance” and “Not an Exit”. We will take “random” simply to mean “nondeterministic”. Whether something can actually be nondeterministic will not concern us immediately. “Deterministic” is a very straightforward xiv
page 8
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Preface
notion reflected in the description of insanity above: if determinism applies, then given the same conditions, the same outcome will inevitably result. If there is the slightest possibility that even a microscopically different result can emerge, then something is random. On the other hand, the presence of some deterministic aspect does not preclude randomness. A “thing” may be composed of smaller units, some of which are random and some of which are deterministic. The composite thing may exhibit partial orderliness, but the fact that its behavior is not 100% deterministic makes it random. The conflating of determinism with the appearance of order creates much of the confusion that fuels the still-raging debates. Without being pedantic, we should be a little less casual with the term “statistics”, although it presents fewer philosophical obstacles and is less likely to start arguments. The term refers to several related things, but the most fundamental is that of a function whose domain is the set of all values that can be taken on by a specific random variable. A very important instance of such a set is that known as a data sample. A major task of the discipline known as “statistics” is to estimate properties of a population by analyzing samples drawn from it. A population is a (possibly infinite) set whose elements each possess a property that follows some distribution which describes how many members of the population have values of the property lying within specifiable ranges. For example, all oranges harvested from a particular orchard on a given day constitute a population whose members all possess a weight. Some of these oranges weigh more than others. In principle, all the oranges could be weighed, and the resulting numbers could be organized into a histogram whose shape would reveal the distribution. It may be desired to know the average weight of all of the oranges. Weighing each orange would be too tedious, however, and so a representative sample can be drawn, for example by weighing one orange in every crate as they are brought in from the orchard. If these numbers are summed, then the average of all the numbers can be computed by dividing the sum by the number of oranges weighed, and this is an estimator of the population average. The process of computing the average involves a mathematical function, namely the sum of all the numbers divided by the sample size. The result of this function is a statistic. Many functions can be defined to operate on the sample, and the result of each is another statistic. The weights of the oranges going into the average are not themselves statistics, they are the data, but they are sometimes referred to as statistics in informal language. The study of statistics includes a lot of focus on how they relate to the population properties, whether the estimators might be inaccurate and by how much, etc. If one orange is found to be twice as heavy as the average, whether this was to be expected is computed and expressed as the statistical significance of that weight. In this sense, “statistical” becomes associated with “probabilistic”, and phrases such as “statistical behavior” enter the language to mean “random behavior” and “probabilistic behavior”. Such usage is not rigorous, but sometimes rigor requires verbosity that gets in the way of common discourse. At other times, however, rigor is needed to maintain a clarity that serves mathematical accuracy, and we will attempt to provide it on such occasions. It will not be possible in a single book to discuss every topic within which randomness plays a role, nor every detailed apparition of randomness within the topics included. The desire to ponder the most important encounters with randomness experienced by the majority of humans on the planet has led to the following chapter structure. Chapter 1 deals with how the realization that some things are random led to the formulation of mathematical probability, along with some examples of everyxv
page 9
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Preface
day encounters and classical paradoxes. Chapter 2 provides a basic mathematical foundation for the classical description of random processes. Chapter 3 shows how this was used to advance the understanding of physical processes prior to the advent of Quantum Mechanics. Chapter 4 deals with the application of these methods to the optimal description of measurements, how we can characterize measurement uncertainty in a way that maximizes usefulness. Chapter 5 discusses how Quantum Mechanics came about and how human intuition finally met a challenge with which it could not keep up, at least not yet. In this chapter we will see how Nature not only appears to indulge in nondeterministic behavior at its deepest layer but also allows events at widely separated locations to influence each other nonlocally. The desire to remove these nonintuitive aspects by expanding the scope of physical theories is considered in Chapter 6, and finally an Epilogue presents some closing thoughts. In these later chapters the phrase “In the author’s opinion” appears more frequently, and a temptation presented itself to begin almost every sentence with it explicitly, but clearly that would have quickly become tedious, so the reader is asked simply to take it as always being implicit, because many of the subjects are controversial with no clear consensus among the scientific community. Eleven appendices serve the purpose of keeping some of the more extensive mathematics out of the main text, where the acceptable depth of mathematical analysis has been influenced considerably by how deeply it is necessary to dig in order to unearth the desired result. When too great a depth is needed, only a verbal summary of the mathematics is provided (e.g., the heuristic visualization used by Erwin Schrödinger to formulate wave mechanics), but for those jewels lying just below the surface, the small amount of exercise needed to expose them is undertaken. In some cases mathematical generality is sacrificed in order to avoid having to dig too deeply. For example, in Statistical Mechanics, the process by which pressure arises from particles bouncing off of container walls is illustrated by using a very specific point in momentum space, namely that in which all the particles are moving at the same speed. Although this is admittedly a highly improbable state of a gas in a container, it is a legitimate one that allows us to ignore the form of the velocity distribution, and in the process we do not have to dig very far below the surface, but we are unable to discuss fluctuations rigorously as a result. That seems a small price to pay when working within a limited scope. In other cases, especially in the later chapters, some references are made to highly specialized formalisms, and some equations are shown without derivation. To understand the purpose of such references and equations, we consider an analogy. Suppose Aristotle were transported from ancient Greek times to the modern day and shown a contemporary automobile speeding around a race track. He would surely be impressed and mystified. One could open the hood, reveal the engine inside, and explain to Aristotle that the source of the automobile’s ability to travel at high speed was something complicated with many moving parts constructed from finely machined metal and using energy generated by burning a liquid fuel. Without elaborate explanations of hundreds of details, Aristotle’s experience with metal implements and cooking fires might provide him some insight into how the marvelous contraption operated, not by magic but by advanced harnessing of familiar natural forces. No more than this is demanded of the reader when we refer to (e.g.) Christoffel symbols, covariant derivatives, or Einstein-Hilbert action. Derivations of these concepts are vastly beyond our scope, but just seeing such expressions and learning about the resultant properties of such formalisms should aid the reader in following the physical implications of interest. Again, some familiarity with xvi
page 10
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Preface
algebra, trigonometry, and differential and integral calculus is needed in order to absorb the content of such equations, but just as Aristotle is not expected to grasp instantly how the parts of the engine were fabricated, the reader is not expected to master the derivations and all the foundational notions that lie behind them. Of course, the reader is free to pursue a more advanced appreciation if moved to do so, but herein the purpose is limited to showing the engine inside the car. The accumulation of human experience over time has produced what seems to be a large amount of knowledge in the form of observed facts with discernible relationships. There is no more important component of this ongoing intellectual exercise than that which is concerned with knowledge about knowledge. We seek the comfort of certainty, and yet a sufficiently close inspection of any article of knowledge reveals some residual fuzziness. It has been said that the only thing of which we may be certain is that we are uncertain. Whatever the fundamental nature of human consciousness may be, it appears that uncertainty is an irreducible ingredient, and therefore the ability to analyze the role of uncertainty in everything we do is essential if progress is to continue. The theory of random numbers has allowed this analysis to become quantitative, and in leading us to develop Quantum Mechanics, it has unlocked a magic gate and made us into sorcerer’s apprentices. We must pursue our destiny by using powers we do not completely understand. Science proceeds from observations. Crafting ways to make observation more effective has led to a highly developed theory of measurement. The products of measurements are numbers that we call data. When data are subjected to interpretation, they become information, and there is no more crucial aspect of this phase than the characterization of measurement error, since perfect nontrivial measurements are impossible. If the error in a given measurement could be evaluated quantitatively, it could be removed, and this is the goal of data calibration. But some residual error always remains and can be characterized only probabilistically. When this is done properly, the product is real information, and when that information is understood as completely as humanly possible, knowledge has been acquired. Only then can the final and most important challenge be addressed: how to distill the knowledge into scientific insight. Thus the study of randomness improves our ability to distinguish between what we know and what we do not know, and it empowers us to proceed despite incompleteness of understanding by providing a way to take the unknown into account. To anyone with the slightest motivation to work in science, realizing this dispels any misconception about probability and statistics being dry and uninteresting. Some experience reveals quite the opposite: everything we learn about how random numbers behave increases our fascination with them and the pleasure we find in making them do our bidding to the extent possible. Before closing, it seems advisable to place some limits on how seriously the notion of “magic” is meant to be taken. Certainly we do not advocate embracing superstition, belief in sorcery, or any mystical power to control nature by supernatural means. On the contrary, the central theme of this book is that the most effective tool available for understanding the experience of existing in this Universe is science. But it would be foolish to suggest that the workings of all mechanisms operating in the Universe are well understood by science, and the gaps in that understanding require a special kind of careful handling. By ascribing a magical quality to randomness because of its fundamentally mysterious nature, we are merely acknowledging the incomprehensibility of something that defies exact predictability. All the more are we justified in being proud that we have found a way to describe whatever is in many of those gaps by treating it as random and hence subject to the mathematixvii
page 11
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Preface
cal disciplines known as probability and statistics, without which we would be rather helpless. It is no accident that the first topic in the title of this book, randomness, is closely associated with the book’s other topic, physical realism, the desire to understand more deeply the nature of the physical Universe whose objective existence is posited by many scientists. It was because of this aspiration that the author became aware of the pervasive role of randomness in this quest. Pondering randomness in an organized way turns out to be one approach to exploring the methods that have been developed in the hope that science will illuminate the human experience, its highest purpose. We must be clear that deep questions remain unanswered and will remain so within this book. But there is some solace to be had in understanding why they have remained unanswered. This purpose is at least partly served by understanding more deeply the questions themselves. The topics covered in this book and the interpretations attached to them are drawn from half a century’s experience enriched by the uncountable contributions of many teachers, mentors, and coworkers. The author would like to express his gratitude by acknowledging these partners with a complete list of names. Unfortunately, that is beyond the author’s powers to accomplish in detail because of the sheer numbers involved (and the unanswerable question of where to stop), so the hope is that these people will know who they are and how much the privilege of being able to call them friends has meant. It would be unacceptably negligent, however, not to thank personally by name someone whose interest in this book’s topics, whose invariable willingness to engage in discussions, and whose numerous useful suggestions and insights contributed many valuable ideas included herein: Frank Masci, a great friend and colleague. Finally, a debt of gratitude is owed to the wonderful editors at World Scientific Publishing for their expertise, cooperation, and encouragement.
xviii
page 12
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Chapter 1 Nothing Ventured, Nothing Gained 1.1 Beginnings In some timeless tropical wilderness, a deer pauses in the brush near a water hole on a scorching afternoon. The desire for a long refreshing drink of the water is strong, but the deer holds back in a nervous hesitation, caught between the appeal of the water and a suspicion that something lurks among the high weeds on the other side of the pond, downwind from where the deer stands frozen and indecisive. In a completely nonverbal way, without the benefit of abstract symbolic reasoning, something about a movement in the tall grass has struck the deer as unlike the effect of the slight breeze that is so uncooperative in its choice of direction, denying any scent clue about what, if anything, might be lying in wait across the small pool. For a long time, a vacillation occurs in which the cool refreshing moisture’s beckoning almost overcomes the dread that flows from an instinctive knowledge of predators. Perhaps there is nothing there after all, or if there is, perhaps fleetness of foot will suffice to provide survival. Perhaps water can be found elsewhere, but if so, perhaps yet another hunter will be using it as a lure. Eventually, the need for life-giving hydration will require the abandonment of any demand for the complete avoidance of risk. To prolong a parched condition in the oppressive heat or to hazard an attack by a hungry carnivore, either way, the deer must gamble. For as long as animals capable of making decisions have existed on the planet, gambling has been a part of life. In most cases, this has been an essential aspect of obtaining food without becoming someone else’s meal, conquering and holding territory, obtaining a mate, and other activities required for survival of the individual and the species. And as with so many things which most animals do only out of necessity, humans have adopted the habit of gambling for amusement. The human fascination for indulgence in games of chance is amply recorded in historical accounts ranging from the earliest times to the present day, and it plays a role in many ancient myths. In some manifestations it was believed to be a way of revealing the will of the gods, such as the tossing of bones, magic sticks, dice, or the “casting of lots” in general. Such activities were used to foretell the future, determine innocence or guilt, settle disputes, and decide who should be appointed to a given task. “Lots” could be wood chips, stones, slips of paper, and just about anything that could be marked in such a way that subsequent identification of a winner or loser was possible. Playing cards probably evolved from such primitive tokens. The ancient Greeks allowed the will of the gods to show itself by shaking a helmet containing lots until one fell out. Similar results were obtained by rolling dice vigorously enough to preclude controlling the results, shuffling a deck of cards several times, spinning a large wheel with marked segments, and flipping a coin through the air before allowing it to fall and come to rest with one side facing upward. These are all attempts to randomize the decision process in order to render it unbiased by human preconception, thereby allowing the outcome to be determined completely by the deities, or failing that, the person in the room with the most highly developed mind-over-matter capability. Persistent belief in the latter concept underlies the widespread acceptance of the notion 1
page 13
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.1 Beginnings
known as a “run of good luck”. As with any highly developed skill, some outings are simply better than others, and slumps can happen, but when one is “on a roll”, one must take full advantage. Of course, as we shall see, “runs” of positive or negative outcomes are common in most purely random processes, and they have historically provided an excellent context for the nurturing of superstitious notions about conscious control over events which are beyond such volitional guidance. In more recent tradition, all of these mechanisms have been viewed as depending upon classical physics, which is deterministic, and in this view, it is clear that the outcomes are not really random in any pure sense, merely unpredictable (as long as no one cheats), and so all such procedures are called pseudorandomization to keep them separate from the mathematically pristine notion of true randomness, which is not only unpredictable but also nondeterministic by definition. On the other hand, the argument can be made that classical physics has been supplanted in principle (if not in most engineering classes) by Quantum Mechanics, wherein the truly random event appears possible, and so maybe the Greek helmet technique is unfairly characterized as pseudorandom. In any case, the distinction is important, since most computer applications that require “random” numbers obtain them in a provably pseudorandom manner. Devising a good pseudorandomization technique is not an easy task, and to this day the merits of different approaches are hotly debated. The various means for invoking random outcomes probably evolved gradually from decision making to gambling for its own sake. The human sense of adventure includes a fascination for predicting an outcome in advance, and contests were inevitable. Even before mythology gave way to written history, accounts of betting on horse races, wrestling matches, gladiator contests, and just about every conceivable activity abound. Of special interest are the games of chance in which the bettors are themselves the contestants, because these led to duels in which the weapons were dice or playing cards, and these eventually led to the mathematical formulation of probability theory. Most professional poker players will tell you that what they do is not gambling but a game of skill. This is both true and a little disingenuous. It is true because skill is certainly involved in acquiring a knowledge of possible outcomes and putting that knowledge to effective use, especially in games wherein knowledge of cards no longer in play is available. In principle, when two card players engage each other in a long enough series of games, the more skillful will always emerge victorious, assuming the relative skill levels don’t change. The disingenuous aspect is that an unreasonably long series of games may be required for this to happen. If the skill levels are close enough, the players may not live long enough for the inexorable laws of probability to dominate the situation, and such series often end because one player runs out of money to bet. And so we will refer to card games as games of chance without apology. Of course, when an expert poker player contends with a naive amateur, the issue is usually resolved relatively quickly. These distinctions are easily accommodated by the notion of biased outcomes. For example, although it is physically impossible to construct an absolutely fair coin, mathematically the definition can be made, namely that the coin has exactly 50% probability for either heads or tails when tossed and allowed to come to rest. It has been proved that when such a coin is tossed a sufficiently large number of times, with all outcomes written down and counted, although at any given time there may be more heads than tails, eventually the number of tails will catch up to the number of heads, and vice versa. This is an example of what is called a “random walk” in which steps are taken in random directions. The fair coin tosses correspond to a one-dimensional unbiased random walk if, for example, steps of constant length are taken forward for each head and backward for each tail. In such a case, one eventually arrives back where one started. 2
page 14
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.2 Mathematical Formulation of Probability
As soon as the coin has a slight bias, say for heads, the random walk is no longer unbiased and ultimately diverges, i.e., the distance from the origin grows arbitrarily large and the chance of returning to zero becomes vanishingly small. This bias operates in the same way as the skill of the professional poker player; the greater the bias, the sooner the professional takes all of the amateur’s money, and for this reason the professional does not consider that gambling was involved in anything more than how long it would take. 1.2 Mathematical Formulation of Probability Apparently for many centuries, people gambled on the basis of nothing more than hunches and a general feel for relative likelihood. Eventually number theory became highly developed, and combinatorial analysis rose to the point where it could be applied to estimating probabilities of various outcomes in card games and other gambling activities. As modern mathematics was being born in seventeenth-century France, shortly before the development of the calculus and the explosion of all forms of mathematical analysis, Blaise Pascal (1623-1662) and Pierre de Fermat (1601-1665) accepted the request of a gambler acquaintance to help him determine how the money on the table should be divided among the players when the game goes unfinished. The idea was that each player should receive a fraction of the money in proportion to that player’s probability of having eventually won the game. Their application of combinatorial analysis to this problem is widely regarded as the beginning of probability theory. It is recorded in correspondence between the two friends in 1654. After Gottfried Wilhelm Leibniz (1646-1716) and Isaac Newton (1642-1727) published their work on the calculus, the way was open for the full development of probability theory based on density functions. The problem posed to Pascal and Fermat was one in which players gain points in each hand dealt out or each throw of the dice, depending on the game involved, and the first to accumulate a certain number of points wins the money that has been bet. The mechanics of how points are won is less important than the possible outcomes of each hand or throw and the probabilities of those outcomes. A simple case is that between two players alternately rolling dice and obtaining a number of points equal to the numbers on the top surfaces of the dice after they come to rest. The dice are taken to be fair, so that each of the six sides is equally likely to end up on top. At any point in the game, each player has a certain number of points. From there on, one could in principle write down every possible outcome for each roll of each player until each branch in the tree arrived at a point where one player has enough points to win. Taking each possible branch as equally likely, the probability that a given player would have won is just the ratio of the number of branches leading to that outcome divided by the total number of possible distinct outcomes. The problem with this brute-force approach is that the number of possible branches may be astronomically large unless one player was very close to victory when play was halted. What Pascal and Fermat contributed was the use of combinatorial analysis to eliminate the need for detailed enumeration of each possible branch. To see how unmanageable the detailed approach is when the number of outcomes is not very small, consider a simple game in which a coin is flipped, and heads results in a point for player A, tails a point for player B. Suppose A has 5 points, B has 4 points, and 10 points are needed to win. Then at most ten more coin flips are needed to decide the game. To write down all possible sequences of ten coin flips may not sound difficult, but in fact there are more 3
page 15
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.2 Mathematical Formulation of Probability
than one thousand possible patterns. Since each coin flip has only two possible results, it is natural to represent each result as a binary number, say with 1 for heads and 0 for tails. Ten coin flips would then be represented by a ten-digit binary number whose value runs from 0000000000 to 1111111111. In decimal, these numbers are zero and 1023, respectively. Since every value in that range can occur, there are 1024 possible unique sequences of outcomes, a lot more than most people would care to list individually. Although there are 1024 distinct sequences, many imply the same result for the game. For example, 1010111000 and 1101101000 are quite different, but when read left to right, both result in Player A winning the game on the seventh toss if the game goes that far. In the set of 1024 different ten-digit binary numbers, 1010111000 and 1101101000 are equivalent in the sense that each represents five heads and five tails. In these terms, there are only eleven equivalent groups. These represent no heads, ten heads, and all possibilities in between. In general, for N flips, there are 2N distinct sequences and N+1 equivalent groups. For N > 1, the number of distinct sequences is always greater than the number of equivalent groups, very much so as N becomes large. With the number of equivalent groups so much smaller than the number of detailed sequences, an efficient way to bookkeep all the possible equivalent branches was desirable. To achieve, this, Pascal made extensive use of a construction which came to be named after him, although he did not invent it: Pascal’s Triangle. This construction has a number of interesting properties, among which was the usefulness it had been found to have in algebra, especially in the expansion of powers of binomials such as (a + b)n. For example, for the case n = 3,
a b a ba ba b a ba 2 a b b 3
2
2
a 3a b 3a b b 3
2
2
(1.1)
3
Here it can be seen that only one term has a3, i.e., its coefficient is 1 (which is omitted by convention). This means in the process of multiplying all the occurrences of a and b with each other, there was only one way to get a3. The same is true of b3. But there are three pairings of products that result in a2b, and the same for ab2, and these results have been grouped together in the last line with coefficients of 3 to indicate this. Thus the coefficients show how many ways there are to get the combination of a and b that they multiply. This much is easy to work out by hand, but what about, say, (a + b)17? It can be done by brute force, but Pascal’s Triangle offers an easier way. First, the triangle must be constructed. This is an open-ended process, but we will take it as far as needed to solve the problem of the coin-flipping game above, which involved a maximum of ten tosses. The triangle begins with a single numeral “1” on line number 0; we begin counting at 0 rather than 1 for reasons which will become clear below. Then each new line begins and ends with a “1” but contains one more number than its predecessor. Line number 1 therefore consists of “1 1”. After that, we continue making each line longer by inserting one additional number per line, but each number between the beginning and ending “1” is the sum of the two numbers closest to its position on the previous line. This process leads to the following.
4
page 16
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.2 Mathematical Formulation of Probability
0: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:
1 1
1
1 1
3
1 1 1 1
1
10
15
8
21
1 4
20
28
35
1 5
15
56
35
1 6
21
1
1 56 28 8 9 36 84 126 126 84 36 9 10 45 120 210 252 210 120 45
1 1
6 10
6
1 3
4 5
7
2
7
70
1 1 10
1
Here we have eleven lines, numbered top down from zero to 10. On line number 3, we see the coefficients of the terms in Equation 1.1. Above that line we see the coefficients of the partial result on the second line of Equation 1.1 where the binomial has been squared. The higher-numbered lines correspond to higher powers of the binomial. From left to right, we have the coefficients of the terms containing an, an-1, an-2,..., a0, namely anb0, an-1b1, an-2b2,..., a0bn. As we saw above, the coefficients correspond to the number of ways certain combinations of a and b occur as the binomial is multiplied out. We can bookkeep this in terms of powers of a alone, since the sum of the powers of a and b must be n. Let us denote the number of ways that ak can occur on line m as Nm(k). Then, for example, lines 0 through 5 of the triangle can be written as follows. 0: 1: 2: 3: 4: 5:
N 0(0) N 1(1) N1(0) N 2(2) N2(1) N2(0) N3(3) N3(2) N3(1) N3(0) N4(4) N4(3) N4(2) N4(1) N4(0) N5(5) N5(4) N5(3) N5(2) N5(1) N5(0)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Now let us associate ak with the occurrence of k heads in m tosses. Comparing line 4 in each triangle shows that N4(4) = 1; this means that in four flips of the coin, there is only one way to get four heads, which is fairly obvious. But line 5 shows us, for example, that there are ten ways to get three heads in five tosses, which may not be as immediately apparent. The further down the triangle we go, the more work Pascal’s Triangle saves us. It’s hard to imagine an easier method for computing, for example, that there are 210 ways to get six heads in ten tosses. This is the kind of information that Pascal and Fermat needed to solve problems of the type posed to them by their gambling friend, Chevalier de Méré. Pascal, probably in company with many other mathematicians, independently developed a formula for calculating Nm(k) directly without having to work his way down to it in the triangle construction. This is known as “Pascal’s Rule”:
Nm (k )
m m! k !(m k )! k 5
(1.2)
page 17
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.3 Intuition Regarding Probability
where the exclamation mark indicates the factorial function: m! = m×(m-1)×(m-2)×...×1, i.e., the product of all integers from 1 to m. By definition, 0! = 1 (which is consistent with line 0 above, N0(0) = 1, i.e., there is only one way to get nothing by not tossing the coin). The expression at the right end of the equation is a standard abbreviation for the ratio of factorials preceding it and is usually called “m things taken k at a time” or just “m take k” (see Appendix C). For example, we pointed out above that there are 210 ways to get six heads in ten tosses, or N10(6) = 210. This can be computed directly from
10 10! 10! N10 (6) 6 6! (10 6)! 6!4! 10 9 8 7 6 5 4 3 2 1 (6 5 4 3 2 1) (4 3 2 1) 10 9 8 7 4 3 2 1 9 8 10 7 3 4 2 10 3 7 210
(1.3)
Now we return to the coin-flipping game between players A and B. A has 5 points, B has 4, 10 points are needed to win, and play is halted. We want to know the probability that A would have won if play had continued. At least five more rounds are needed for a victory, which is possible for A if all five tosses are heads. We saw above that there is only one way to get five heads on five tosses, whereas 25 different sequences are possible, so the probability that A would win in the next five tosses is 1/32. The probability that play would continue to a sixth toss is therefore 31/32. We could examine the possibilities on each toss, but that is not necessary if all we want to know is the probability that A would eventually have won on some round. For that, we need only to examine line 10 in the triangle, since either A or B but not both must acquire ten points at or before that line. Any of the N10(k) values for k 5 corresponds to a win for player A. Many of these correspond to a win prior to the tenth toss, but the division of the money does not depend on which toss settles the issue, so we simply sum these values (1, 10, 45, 120, 210, and 252) to obtain 638, then divide by the total number of possibilities, 1024, to arrive at 0.623046875 as the probability that A would have won if play had continued until victory was established. Thus A pockets 62.3% of the money on the table, B takes 37.7%, and everyone is free to rush off to whatever appointment interrupted the game. 1.3 Intuition Regarding Probability The reader should not feel uncomfortable if this result is less than crystal clear upon first sight. By considering only line 10, we are including some outcomes in which one of the players had more 6
page 18
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.3 Intuition Regarding Probability
than ten points, i.e., those in which one player won the game before the tenth coin toss but went on to win more tosses after the game was over. For example, two of the 1024 sequences are 1111100000 and 1111100001. Both of these correspond to a win by player A on the fifth toss. By counting both of them as equally likely, are we not double-bookkeeping? Certainly both are equally likely after ten tosses, but they really correspond to a single earlier outcome of the contest, so is this a legitimate way to count possible outcomes? In fact, probability theory is rife with subtleties that confuse even the most powerful intellects, and one need never feel chagrined about requiring a moment to ponder what may turn out to be obvious to someone else, or even to oneself when viewed from a different perspective. For example, in statistical mechanics, great arguments have raged between leading theoreticians over whether certain possible states of a physical system may be taken as equally likely, and some of these arguments are still unresolved. In our case, the two sequences quoted above are from a set of 32 such sequences, i.e., once we have the five consecutive 1 digits, the other five binary digits can take on 32 different patterns. 32 out of 1024 is the same as 1 out of 32, so out of the 1024 possible sequences, these 32 patterns yield one chance in 32 for a win by player A on the fifth toss, the same as what we obtain by examining the possibilities after five tosses. All the other multiple-bookkeeping cases similarly reduce to their winning-round probabilities, and the result given above is indeed correct, namely a 62.3% probability that player A would have won. Today, this can be checked with the aid of modern computers, which can perform in a reasonable amount of time the billions of arithmetic operations needed to calculate with brute force, and some problems are solvable only in this way. In Chapter 4, a simple example of a Monte Carlo program will be given which tests the result quoted above. But this luxury is relatively recent, and before it came along, some serious errors were occasionally made by even the best mathematicians. The general public has always found some difficulty becoming comfortable with the notion of probability. There are some who feel that what will be will be, and only one outcome will actually happen, and the probability of that outcome is 100%, and all other outcomes have 0% probability. The problem with this view is that it assigns the attribute of randomness to the wrong element: it is the uncertainty in our knowledge of what will happen that is properly modeled as a random variable, not the event itself whose possible inevitability is unknown to us in advance. This view therefore contributes nothing to any optimizing of the decision-making process. But the rationale behind the idea that we should make decisions based on what would happen N times out of M trials remains generally elusive. Usually there aren’t going to be M trials, there is going to be only one. What is missed is that the methodology is still applicable. For example, most people would agree that betting on the outcome of three coin flips should be guided by considerations of probability based on combinations of possible head/tail outcomes. But suppose that instead of three coin flips, the contest is to consist of three different activities: (a.) whether a single coin flip results in heads; (b.) whether a card drawn from a shuffled 52-card deck will be of a red suit; (c.) whether the throw of a die results in a number greater than three. Since both of the last two contests are equivalent to a coin flip, the betting strategy should be unchanged. Now suppose the last item is modified to whether the die throw results in a number greater than two: this is no longer equivalent to a fair coin flip, so is the entire methodology rendered inapplicable? Not if we take into account the fact that the relevant probability in the third contest is instead of ½. The point is that over the course of time, the many different situations in which an optimal decision is advantageous in effect comprise M trials, and the fact that each has its own set of contests 7
page 19
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.4 Independent Events
and outcome probabilities is irrelevant to whether the methodology is appropriate. Thus one need not be troubled by the fact that a gigantic ensemble of identical contests does not exist in reality. The optimal strategy should still include considerations based on probabilities evaluated as though the contest were going to be repeated many times. This is basically what we did above: players A and B had no intention of executing up to ten coin flips 1024 times, and even if they had, they would almost certainly not have experienced all 1024 possible sequences. But if they had gone through an arbitrarily large number of such exercises, each sequence would have occurred approximately the same number of times, because each is equally likely. The only caveat is that one should avoid making any bet at all if M has to be absurdly large before statistical stability can be expected (for an extreme example, see section 1.8). Another viewpoint is to imagine an arbitrarily large number of parallel universes, with the randomness of the coin toss exercising its freedom to be independent over them all. In each universe, the players perform the coin tossing as needed to determine one outcome, and then we compute the fraction of victories by player A over the entire ensemble. There is an apparent absurdity to the ensemble of parallel universes that obfuscates the usefulness of the mathematical analysis. Only quiet contemplation can provide intuition a foothold for the eventual appreciation of all the elements involved. 1.4 Independent Events Even when the usefulness of random-variable modeling for decision making is accepted, many people still have trouble adjusting intuition to real situations. One area of difficulty is the idea of independent events. Textbooks often define two random events as independent if and only if the probability of both happening is the product of their individual (or marginal) probabilities. Unfortunately, this doesn’t help when the context is one in which it is desired to know whether it is permissible to compute the joint probability as such a product, i.e., one must figure out whether two events are independent from physical reasoning so that one can determine whether one is permitted to multiply the two separate probabilities to get the combined probability. One guideline is to consider whether the knowledge of one outcome alters the possibilities that need to be considered for the other outcome. A simple case is when the two events are incompatible. For example, suppose Bob is running late and may not get to the bus stop in time to catch the bus: C represents the event in which he catches the bus, and M represents the event in which he misses the bus. The probability that he will catch the bus is P(C), and the probability that he will miss the bus is P(M). Is the probability of both events happening P(C)×P(M)? No, because the two events are incompatible. They cannot both happen. If he catches the bus, he does not miss the bus. If someone tells you that he missed the bus, i.e., that event M took place, then you obtain information about event C also: it did not take place. Whenever information about one event alters the distribution of possibilities for another event in any way, those two events are not independent, and the probability of both happening is not the product of the separate probabilities. On the other hand, if Bob lives in Cleveland, and the probability that he will catch his bus is PB(C), while Ted lives in London, and the probability that he will miss his bus is PT(M), and assuming that Bob and Ted have never met, have no acquaintances or activities in common, and are unaware of each others’ existence, then it is almost certainly a safe assumption that the probability 8
page 20
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.4 Independent Events
that Bob will catch his bus and that Ted will miss his is PB(C)×PT(M). The reason why the probability of two independent events both happening is the product of the separate probabilities can be seen in the parallel-universe ensemble interpretation. Since a universe is completely self-contained by definition, the universes are all completely independent of each other. It follows that if you happen to know that Bob caught his bus in Universe no. 1, that fact tells you nothing about whether he caught it in Universe no. 2. All that we know is that if we could count the number of times Bob caught his bus in each universe and divide by the number of universes, the ratio would approach PB(C) as the number of universes in the ensemble becomes arbitrarily large. In fact, since all this is happening in our imagination, we have the power to demand it to be so large that we can partition it practically any way we like without losing statistical stability, i.e., without introducing fluctuations that significantly alter the ratio from being equal to PB(C). For example, if we put the odd-numbered universes in one sub-ensemble and the even-numbered ones in another, the fraction of universes in each sub-ensemble wherein Bob catches his bus remains arbitrarily close to PB(C). Now we partition the full ensemble into a sub-ensemble in which Bob always caught his bus and one in which he did not. The first sub-ensemble contains a fraction of the full ensemble equal to PB(C). Within this sub-ensemble, the fraction of times that Ted misses his bus is PT(M), since M does not depend on C. This last fraction of the sub-ensemble is a fraction of a fraction of the full ensemble, which is the product of the two fractions, and hence PB(C)×PT(M). This simple product would not apply if there were some dependence between C and M. For example, if Bob and Ted lived in the same city and tried to catch the same bus every day, and if that bus tended to fill up sometimes at Bob’s stop before getting to Ted’s stop, then Bob catching the bus would tend to increase the probability that Ted would miss the bus because of not being allowed to board. Calculating the probability that C and M both happened in this case would require information about how frequently the bus fills up at Bob’s stop and remains full when arriving at Ted’s stop. In this case, the two probabilities are said to be correlated. If someone told you that Bob caught the bus, you would then know that Ted’s chances of missing it were increased. Even though you don’t know whether Ted in fact missed the bus, the greater chance that he did reflects the fact that the two events are not independent, and their joint probability is therefore not the simple product of the separate probabilities. Even professional technologists frequently overlook dependences between two random events. In subsequent chapters we will take a closer look at how such dependences often come to exist, how they are taken into account mathematically, and what the usual results of overlooking or simply neglecting them are. But the opposite error is also commonly committed: assuming that two independent events are somehow causally connected to each other. The extent to which such seemingly simple considerations escape a significant fraction of the general population is well illustrated by a true story witnessed by the author while standing in line to pay for some goods at a convenience store. The customer at the counter instructed the clerk to include two lottery tickets in his purchase. At the time, lottery tickets were manufactured on a cardboard roll with perforations between them and consecutively numbered. The customer stipulated that his two tickets must come from separate parts of the roll, i.e., must not contain consecutive numbers. When asked why he preferred such tickets, his facial expression become a blend of puzzlement over how anyone could be so simple-minded as to need to ask that question and pity for the questioner’s apparently handicapped condition. He eventually found the words to explain that if one ticket contained a losing number, then the odds were very high that a number immediately 9
page 21
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.5 Different Interpretations of Probability
next to it was also a losing number, i.e., if you miss, you are more likely to miss by a mile than an inch. Despite the fact that the lottery machinery was designed to make all numbers equally likely, this customer apparently felt that proximity to a losing number was a disadvantage. The probability that two lottery tickets would both lose depended on how close the two numbers were to each other. Better to fan out and cover more territory. At least his strategy did not harm his chances. 1.5 Different Interpretations of Probability At this point, it should be noted that the parallel-universe notion used above is firmly rooted in what is called the frequency interpretation of probability (also called the frequentist interpretation and relative frequency interpretation). We have mentioned that the random processes relevant to classical physics are associated with epistemic uncertainty, i.e., randomness models our knowledge of physical processes, not the processes themselves, which are deterministic. It should come as no surprise, therefore, that anything with such subjective overtones should also be at the center of vigorous philosophical debate over its essential meaning. This book is concerned with the impact of randomness on our experience of physical reality, and we will implicitly assume the interpretation embraced by most scientists and engineers, at least in contexts involving epistemic uncertainty, and this is the frequency interpretation. We note in passing that there are other interpretations, often listed as the classical interpretation, the subjective interpretation, and the axiomatic interpretation. The classical interpretation is illustrated by the coin-flip analysis above. It says that one need not actually carry out a large number of experiments to determine probabilities of various outcomes, but simply enumerate all possibilities and compute the fractions of results in which those outcomes occur. By taking the probabilities to be given by those fractions, it is similar to the frequency interpretation. The essential difference is in the use of theoretical rather than empirical determination of the frequencies, and application of the classical approach often bogs down in arguments about the assignment of equal likelihood to the fundamental states. The development of empirical science alleviated the impossibility of applying purely theoretical principles to all phenomena, and the ability to measure some frequencies of various outcomes led to the frequency interpretation. At face value, the parallel-universe notion used in the example above is classical, since actual experiments in the ensemble of universes are obviously not conducted. The values of PB(C) and PT(M), however, were assumed to be known, and for all practical purposes, these would be impossible to compute theoretically. Some measurements would have to be made, and the leap from the relative frequencies of C and M to PB(C) and PT(M) is considered the weakness of the frequency interpretation. So that example depended on the frequency interpretation to provide PB(C) and PT(M), after which the classical interpretation was used to ponder what these probabilities mean if the two events are independent. Such off-handed concept switching tends to be very irritating to pure mathematicians, who have come up with their own interpretation (or perhaps it is really an antiinterpretation), the axiomatic approach, in which probability is simply defined to have certain properties affecting the manipulation of symbols. This results in what superficially resembles the equations of the frequency interpretation but sidesteps all issues involving how one actually obtains values for the probabilities and what these probabilities “mean”. The subjective interpretation views probability as a measure of belief. It seems more natural for assigning probabilities to things which are not as easily associated with events, such as whether 10
page 22
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.5 Different Interpretations of Probability
intelligent life exists on other planets in the Milky Way Galaxy. Until empirical methods develop to the point of allowing effective measurements, the frequentist approach cannot provide a probability for this. Attempts to apply the classical approach have been made (again employing some component probabilities derived from experiments), most famously the Drake Equation, but these have suffered from extreme uncertainty in the resulting estimate and serve more to illustrate the difficulty of the problem than to provide an answer to the question. Yet most people have a guess about whether intelligent life exists elsewhere, each based on that individual’s degree of belief that the events required for extraterrestrial life to evolve have in fact occurred. The conclusions reached in this way also demonstrate a large dynamic range and suggest that the process of generating subjective appraisals varies widely from one person to another. An example of an argument against the existence of intelligent extraterrestrial life in the Galaxy goes as follows. Our only example of the evolution of intelligent life on a habitable planet is ourselves, and we arrived on the scene about four billion years after the earth formed. Once mammals evolved, intelligent life developed quickly relative to the geological time scale. From the first humanoids to the present day took only something on the order of 10 5 years, and from cave dwelling to moon walking took about 104 years. At this rate, we could be expected to achieve interstellar travel in less than 103 more years, after which we will have explored the entire Galaxy in less than another 106 years, because interstellar travel will be nearly at light speed, and light takes only a few hundred thousand years to travel from one side of the Galaxy to the other, assuming reasonable boundaries to what we call the Galaxy. Compared to the initial four billion years, these other intervals are negligible, so assuming that we are fairly typical of the form that intelligent life would take, four billion years after a habitable planet forms, it should produce beings that have navigated the Galaxy. But the Galaxy is about ten billion years old and contains several hundred billion stars currently believed to harbor about one planet each or more on average. Even with the most pessimistic of estimates, if the evolution of intelligent life forms is not an unthinkably rare event, then there should be a large number of Galaxy-navigating civilizations, and it is inconceivable that all that explorative activity would produce no visible evidence of having taken place. There must be some unknown factor that makes intelligent life all but impossible, and we are a highly improbable fluke. Therefore, “they” cannot be out there. Anything that close to zero probably is zero. The subjective element is clearly at work in this example, and each step in the reasoning can be attacked with counter-arguments. Everything depends on what one considers “reasonable”. We also see the use of conditional probabilities, e.g., the probability that a star develops a habitable planet may be small, but given that a habitable planet does develop, the probability that some form of life develops on it can be hypothesized. The use of conditional probabilities to estimate whether certain propositions are true has been highly developed in what is now called Bayesian statistics, the foundation of decision theory, which we will encounter in Chapter 4. Many authors view Bayesian statistics as being in conflict with the frequentist interpretation, but it can also be viewed as employing frequentist notions in a particular combination that allows prior knowledge to refine probabilistic estimation. It provides some powerful techniques, but it can also be abused and often is in the sensational documentation of remarkable events such as UFO sightings. For example, if UFOs have visited Earth, surely the government would know. But the government denies all such knowledge; why would they do that? Only because they have something to hide. And if the government has something to hide, what could that be? Either they are collaborating with the aliens, or they fear a panic among the general population. If they fear a panic, how do they know that there 11
page 23
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.6 Sucker Bets
would be one? Isn’t it possible that the public would welcome the information that contact with an extraterrestrial race of beings had been established? And if they are collaborating, what technologies are being obtained from the aliens, and what other countries are involved? At this point, the audience is expected to have forgotten that “if the government has something to hide” was only a condition on whose probability the likelihood of subsequent possibilities depended, not an established fact. For further discussion of the various interpretations of probability, the reader is referred to the excellent summary provided by Papoulis (1965) and the more complete treatment given by von Plato (1994). In limiting the focus herein, it is not the intention to suggest that these distinctions are unimportant, but rather to avoid getting distracted from the main point, the nature of randomness itself. Certainly the philosophical essence of the mathematical tools used to tame (or at least co-exist rationally with) randomness deserves great study, but the intrinsic nature of randomness can be explored without settling the debates about the interpretation of probability. Our concern with physical reality leads us to depend primarily on the blend of classical and frequency interpretation used by the majority of scientists and engineers, which does involve occasional augmentation via Bayesian statistics. 1.6 Sucker Bets Returning to the lottery contestant who demanded non-sequential numbers: the prevalence of such hazy reasoning has provided an ecological niche for predatory scoundrels who engage in offering what is commonly called “sucker bets” to unsuspecting dupes. Any time a set of alternatives is presented for wagering in which the apparently obvious choice is known by the perpetrator to be significantly unlikely, it can be called a sucker bet. Perhaps the word “dupe” is a bit harsh, but anyone who accepts a sucker bet with the intention of taking money away from someone perceived to be offering a foolishly self-defeating proposal is also guilty of attempted exploitation to some extent, and so we will not feel required to employ gentle terminology in referring to them. On the other hand, sucker bets can also arise in more innocent circumstances, such as party games and friendly pranks, in which case we withhold the judgmental overtones just implied. What is of interest here is that there is a class of such bets that rest on a purely numerical foundation, and unlike sucker bets based on inside information concerning unlikely events which have recently taken place without the word getting around yet, these stratagems can be used over and over on new victims. It is the unexpected likelihood of seemingly improbable outcomes that is interesting, not the fact that such gambits can be used to separate people from their money. One fairly common sucker bet is that among any 30 people, at least two will have the same birthday. Two things must be clear: (a.) “birthday” means month and day of month, not including year or day of week; (b.) the 30 people must be chosen in a manner that has nothing to do with their birthdays, e.g., they must not include twins, triplets, etc., they must not be selected to span all astrological symbols, or any other criterion that involves date of birth. For example, 30 people at an office Christmas party would usually be a good set of people for the purpose of this bet, which should be offered as an even-money bet, i.e., the two betting parties put up the same amount of money, and the winner takes it all. To the average person not familiar with this sucker bet (or other similar ones), the proposition seems unlikely. There is a tendency to imagine oneself with 29 other people and estimate the 12
page 24
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.6 Sucker Bets
probability of one of them having the same birthday as oneself as much less than 50-50 (meaning 50% for and 50% against). In fact, assuming no birthday bias in the group, the probability that any given other person has one’s own birthday is indeed only 1/365. The effect of leap years is ignored here; this approximation has very little impact, as will be seen below. It also assumes that human birthdays are uniformly distributed over the year, which is not exactly true but is an acceptable approximation. The common mistake is to proceed with the assumption that with 29 other people, this chance becomes 29/365, or less than 1/12. An even money bet with only one chance out of twelve of losing does seem like an inviting proposition. This sort of miscalculation is exactly why the bet works; it suggests that the probability of winning is 336/365, practically a sure thing. After all, there are 365 possible birthdays, of which at most 30 are taken up, so there are at least 335 days on which no one in the group was born, and that seems like a lot of room. The problem is that it assumes a specific birthdate that must be avoided by all the others, whereas in fact everyone must avoid everyone else’s birthday if the bet is to be won. The real odds are computed as follows. Take the 30 people in any desired order. The first person has some birthday, which we will call birthday no. 1 and represent as B1. The second person has birthday no. 2, B2, which must not be the same as B1. The probability for this is 364/365, since there are 364 days out of the 365 in a year when B2 could occur without being the same as B1. The third person has birthday B3, which must be different from both birthdays B1 and B2 . The probability for this is 363/365; however, this probability comes into play only if we also have the first two birthdays unequal. As we saw above, the probability that two independent events will both occur is the product of their individual probabilities. In order to win the bet, we must have the first event, B1 B2, and also the second event, B3 B1 or B2. The probability of both events occurring is their product, (364/365)×(363/365). We can denote the probability that the birthday of the nth person, Bn, is different from all the birthdays of the preceding n-1 persons, given that all of those birthdays are different, as Pn = (366-n)/365, n > 1. To win the bet, we must have all 29 events, for which the probability is P2 × P3 × P4 × ... × P30. As we work our way through this product of 29 numbers, each fairly close to 1.0, we find that it drops significantly with each new factor. This is shown in the following table, where Π(Pn) indicates the product of all Pn for all previous n up to and including the value on the given line.
13
page 25
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.6 Sucker Bets
n 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Pn 364/365 363/365 362/365 361/365 360/365 359/365 358/365 357/365 356/365 355/365 354/365 353/365 352/365 351/365 350/365 349/365 348/365 347/365 346/365 345/365 344/365 343/365 342/365 341/365 340/365 339/365 338/365 337/365 336/365
Pn 0.99726027 0.99452055 0.99178082 0.98904110 0.98630137 0.98356164 0.98082192 0.97808219 0.97534247 0.97260274 0.96986301 0.96712329 0.96438356 0.96164384 0.95890411 0.95616438 0.95342466 0.95068493 0.94794521 0.94520548 0.94246575 0.93972603 0.93698630 0.93424658 0.93150685 0.92876712 0.92602740 0.92328767 0.92054795
Π(Pn) 0.99726027 0.99179583 0.98364409 0.97286443 0.95953752 0.94376430 0.92566471 0.90537617 0.88305182 0.85885862 0.83297521 0.80558972 0.77689749 0.74709868 0.71639599 0.68499233 0.65308858 0.62088147 0.58856162 0.55631166 0.52430469 0.49270277 0.46165574 0.43130030 0.40175918 0.37314072 0.34553853 0.31903146 0.29368376
We can see that the probability that no two people have the same birthday drops below 50% once we get to 23 people. By the time we get to 30, it is less than 30%. So the odds of losing this bet if we accept it are greater than 2 to 1. That means that we would win sometimes, perhaps even on the first try, but if we play very many rounds, we are extremely likely to lose about 40% of the money we wager. For example, if we bet $1 on each of 100 tries, on average we will win $30 and lose $70, which amounts to a net loss of $40. Of course, it isn’t likely that we would find ourselves at 100 Christmas parties or other groups of 30 people often enough for multiple exposures to this particular sucker bet. But the same mechanism can be used in other contexts. Usually a phone book is not hard to come by, and most can be used for this purpose. Open a phone book’s residential listings to a random page and select the first 15 phone numbers at the top right-hand side, but avoid any obviously repeated numbers such as those of a business that uses the same phone for orders and customer support. Consider only the last two digits of each number. In most cases, these 15 two-digit numbers will be effectively random. 14
page 26
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.6 Sucker Bets
Since they may take on any values from 00 to 99 inclusive, there are 100 possible values, and we have selected 15. The sucker bet is that there will be at least one duplicate two-digit number. Applying the same method as before, but now with Pn denoting the probability that the nth number will not be the same as any of the n-1 preceding it, we have Pn = (101-n)/100, n > 1. The table becomes: n 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Pn 99/100 98/100 97/100 96/100 95/100 94/100 93/100 92/100 91/100 90/100 89/100 88/100 87/100 86/100
Pn 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.90 0.89 0.88 0.87 0.86
Π(Pn) 0.99000000 0.97020000 0.94109400 0.90345024 0.85827773 0.80678106 0.75030639 0.69028188 0.62815651 0.56534086 0.50315336 0.44277496 0.38521422 0.33128423
This time the probability of winning the bet drops below 50% at the 13 th number and quickly gets down to just under 1/3 at the 15th. So on the average, we would lose this sucker bet twice out of every three times we accepted it, more or less the same odds as the previous one. And yet most people do not expect a mere 15 random two-digit numbers to harbor a duplication. One hundred possible values just seems like too many to allow frequent pairs to arise in a sample of only 15. But of course that is what makes the bet work, especially if one wins it on the first try, which should not be unusual, occurring once every three times but really only providing encouragement to play more rounds and gradually lose more and more money. This effect may also be seen in “screen saver” computer programs that display images selected randomly. It is common for people to observe a repeated image that was recently displayed. If there are 100 images from which to select, the odds are about 2 to 1 that in the last 15 displays, some image occurred twice. Many people react to that by suspecting a defect in the random selector, but in fact, there would have to be a defect for it not to happen, if each selection is independent from all the others, as opposed to simply dealing out a shuffled deck. We should digress briefly into a caveat: many mathematicians would be shocked at the notion of using a telephone book as a source of pseudorandom numbers. Clearly such numbers are not random; they are assigned according to specific algorithms, and some phone books are less suitable than others for this purpose. Most mathematicians have excellent sources of pseudorandom numbers readily available and wouldn’t dream of using a phone book to get them. But most people are not professional mathematicians and have no such handy lists. Without making too significant a point out of this, one goal of this book is to help the reader develop intuition about what is or is not random, as well as what is not but approximates random behavior well enough for the purpose at 15
page 27
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.7 Russian Roulette
hand. Some purposes, for example large-scale Monte Carlo calculations, require billions of pseudorandom numbers that are effectively independent and drawn from a specific population. Such a large set of numbers will tend to reveal systematic (i.e., non-random) patterns if not generated expertly. To get 15 numbers that are approximately uniformly randomly distributed from 0 to 99, however, one can do worse than a phone book. But there are ways in which a phone book can fall short of the mark. The best sort of phone book would be a residential listing for a large metropolitan area, since the alphabetical order would not correlate with geographical location within the area. In such cases, the distribution of the least significant digits, two in our case, is usually a very good approximation to a uniform random distribution. The intrusion of business numbers biases the distribution toward repetition, since many businesses make some effort to acquire phone numbers ending in 00, and often there will be several phones listed for the same business, some of which are in fact the same. This makes the bet harder to win. On the other hand, residential listings for a very small community may not be effectively shuffled by the alphabetical ordering, and this may cause them to reflect the order in which phone numbers are assigned, which may be chronologically sequential, and the use of every possible number before repeating may show up in the last two digits. This would make the bet easier to win. In summary, we are not advocating telephone books as a substitute for expertly generated pseudorandom numbers, only as a way to get a handful easily when no better source is at hand, and only then after some reflection about whether any of the available books is likely to be free of biases. 1.7 Russian Roulette The opposite of sucker bets are the bets which one is likely to win but refuses to accept anyway. Probably the best example is Russian Roulette. If offered a prize of $1000 for surviving one click of the hammer of a six-shot revolver containing one cartridge in a random position, most people would decline. For a million-dollar prize, some people would think twice. For a billion dollars, many would find it quite an agonizing decision. Some instinct is at work here, some intuitive trade-off recognizing that the prize and its probability have to be greater than the cost and its probability. Such considerations are at the foundation of decision theory. Very often the probability of each outcome can be computed precisely, but whether the potential cost is worth the chance that it will have to be paid is essentially completely subjective. Although one would be well advised not to play such a game even once, it is instructive to consider what would happen if one played it enough times to constitute a statistically stable sample. In the $1000 case, one would win $1000 five times out of every six, suggesting that if one values one’s life at $5000 or more, this game should be avoided. And while it might well be considered unforgivably foolish to play it even once, it would surely compound that foolishness to play it more than once, so that really the trade-off is at $1000 vs. one’s life. This is not like betting that a roll of a die will yield a number greater than one; a single loss wipes out all previous wins (except possibly for the player’s estate). One does not say “Well that worked out quite nicely, let’s do it again and again.” So even though one might contemplate one and only one round of the game, the ensemble of possible outcomes is useful to consider, not because the most likely result is a payoff of $1000, but because the potential disaster is not really all that far removed. 16
page 28
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.7 Russian Roulette
It is useful to distinguish here between an “expected” result and the “most likely” result. These are often not the same, depending on whether the distribution of possibilities is symmetric about its average and whether the distribution has a single highest peak. The notion of an expected value may apply to a number drawn randomly from a distribution or to any function of the variable described by the distribution. Sticking to our frequentist interpretation, one can picture a “distribution” as a finely sampled histogram. The well-known “Bell Curve”, which is also known as the Gaussian or Normal distribution, has a single peak and is symmetric about that peak, which is therefore also the mean, or average value of the variable. It is also the median, the midpoint. If binned into a histogram with an odd number of equal-width cells, then the cell that contains the peak is also the mode, the most frequently occurring and hence most likely group of values of the random variable. A summary of these properties of distributions is given in Appendix A. The mean of a distribution is the “expected” value of any random number drawn from that distribution, also called the “expectation value”. These phrases based on the common English word “expect” have a special flavor in probability theory. They do not imply that we are surprised when a randomly drawn number turns out not to have the “expected” value; they simply indicate the average value. As we saw above, the average number of heads in ten tosses of a fair coin is five, but we are not in the least taken aback if a given trial produces four or six heads, or even three or seven. Besides having an “expected” value of a number drawn from a distribution, a function of that variable also has an expectation value. To illustrate, we consider the expectation value for the money won in repetitions of the game of Russian roulette, assuming the six-chamber cylinder containing the single cartridge is spun randomly between each trial. The “expected” amount won is the prize multiplied by the probability of obtaining the prize. With $1000 per trial, for a single trial, the probability of winning is 5/6, so the expected amount is $1000 multiplied by 5/6, or $833.33. Note that this is not an amount that can actually be won; it is the average value won by a large number of players who each play a single round. But everyone who wins gets $1000, not $833.33; the most frequently observed value (i.e., the mode) is $1000. The following table shows the expected winnings by a single player who engages in more than one trial, up to a maximum of 25. The rules are that the player agrees to the number of trials in advance, wins $1000 on each successful trial provided all agreed trials are completed successfully, and otherwise wins nothing, not even funeral expenses. The first column shows the number of trials agreed in advance to be played, denoted N. The second column gives the probability of completing that many trials; this is just (5/6) N. The third column gives the expected payoff for completing the given number of trials. The expectation value for a variable is commonly denoted by placing the name of that variable inside “angle brackets” as shown. The fourth column shows the actual payoff. The expected payoff is just the actual payoff multiplied by the probability of completing the N trials. This is an a priori estimate of the payoff of an attempt to survive N trials, which is the same as saying that it applies only before the experiment is actually done. It reflects both the possibility that the N trials will be survived and that they will not, and therefore the numerical value of an expectation typically is not an actual achievable outcome. After the experiment is done, we obtain what we have called the actual payoff and listed in the fourth column for the case of a successful outcome. For an unsuccessful outcome, the actual payoff is zero for all values of N. If an audience of bettors were to wager on the outcome of someone else’s attempt to survive this game, their betting would be guided by the expected payoff if they wanted to 17
page 29
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.7 Russian Roulette
minimize the squared error in their guess. In other words, if the cost of being wrong is proportional to the square of the difference between the value on which one bets and the actual outcome, then the expectation value is the best choice to bet on, because it is the value that minimizes the expected square error. This is a very common choice for the cost function in decision problems, and it may arise from subjective preferences or the properties of the distribution or both (such considerations will be discussed more rigorously in the next chapter). The actual player would be less concerned with minimizing the squared error than simply surviving the game, so only columns two and four would be of interest. N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
P(N) < Payoff > 0.83333333 833.33 0.69444444 1388.89 0.57870370 1736.11 0.48225309 1929.01 0.40187757 2009.39 0.33489798 2009.39 0.27908165 1953.57 0.23256804 1860.54 0.19380670 1744.26 0.16150558 1615.06 0.13458799 1480.47 0.11215665 1345.88 0.09346388 1215.03 0.07788657 1090.41 0.06490547 973.58 0.05408789 865.41 0.04507324 766.25 0.03756104 676.10 0.03130086 594.72 0.02608405 521.68 0.02173671 456.47 0.01811393 398.51 0.01509494 347.18 0.01257912 301.90 0.01048260 262.06
Payoff 1000.00 2000.00 3000.00 4000.00 5000.00 6000.00 7000.00 8000.00 9000.00 10000.00 11000.00 12000.00 13000.00 14000.00 15000.00 16000.00 17000.00 18000.00 19000.00 20000.00 21000.00 22000.00 23000.00 24000.00 25000.00
One thing to notice is that the expected payoff increases with the number of trials until N reaches 5, then it repeats for N = 6. If the player were to go by the expected payoff, it would seem to be immaterial whether 5 or 6 trials were chosen as the survival goal. Of course these two goals have different probabilities of being achieved, and it would be counterproductive to reduce the probability of success from about 40.2% to about 33.5% with no additional payoff. For the wagering onlookers, however, the expected payoff of $2009.39 would be the best bet. After N = 6, the expected payoff proceeds to become smaller and smaller despite the fact that the actual payoff for success continues to increase. This results from the exponential dependence of the probability and 18
page 30
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.8 The St. Petersburg Paradox
the linear dependence of the payoff. The table also shows that if one wants at least a 50% chance of surviving, one should not commit to more than three trials. No sane person would play this game as portrayed, but since the expected payoff scales linearly with the actual payoff, an affluent sadist could amplify the attractiveness to the point where the fringe element of society would probably yield a few takers. The repulsive aspect of the game could also be reduced somewhat by increasing the odds of surviving; a special revolver with 100 chambers could be constructed and still employed with only one cartridge. Some combination of increased payoff and reduced mortality rate clearly would bring the game into the range of acceptability for the majority of the population, since a game played once each year with 7000 chambers and a payoff around $50,000 corresponds to the risk taken by the average person navigating automobile traffic to go to and from work and various other destinations. 1.8 The St. Petersburg Paradox We mentioned above that before the outcome of a trial is known, the expectation value of the payoff is the best guide to whether to accept a bet. We hasten to add, however, that it is not the only guide. Fortunately there is also a guide to whether the expectation value is likely to be close to the actual outcome. A good way to illustrate this is another well-known sucker bet contained in what is known as the St. Petersburg Paradox, which is based on a game that exists in several forms, of which we will take the simplest. This involves the following proposition: one pays a fee for the privilege of playing a game whose rules are that every consecutive flip of a fair coin that results in heads yields a prize of twice the previous flip’s prize amount, with the initial prize being $2. In other words, if the first coin flip yields tails, the game is over, and one has nothing to show for the entrance fee; otherwise the game continues. As long as the coin keeps coming up heads, the pot keeps doubling. Assuming that tails eventually comes up, the game ends, but one gets to keep whatever money was in the pot the last time heads came up. For example, if heads comes up on the first three flips, then tails, one wins $8, because that was the value of the pot for heads on the third flip. Now we ask what the “expected value” of the winnings will be. We will cover this in more detail in the next chapter, but for now we will just state that the expectation value for the function of a random variable is just the sum (for discrete random variables) or the integral (for continuous random variables) of the product of the function evaluated for each value of the random variable and the probability of that value, taken over all values that the random variable can assume. Since we are dealing with a discrete random variable here, this means the following.
1 W Wn pn (2 ) n 2 n 1 n1 n
1
(1.4)
n1
Here W without a subscript represents the total winnings, so < W > is the expectation value of the winnings. Wn is the amount won on the nth coin flip provided the game proceeds that far and heads comes up, and pn is the probability of actually getting to that coin flip and getting heads on it. 19
page 31
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.8 The St. Petersburg Paradox
Since Wn = 2n dollars, we substitute that in the equation, and since pn = 1/2n, we also substitute that. To clarify, there is only a probability of ½ that heads will come up on the first flip, so p1 = ½. The probability of getting heads on the second flip is also ½, but we don’t even get to the second flip unless we get heads on the first flip, so we need both of these events to occur, and as we saw above, since these coin flips are independent, the probability of getting to a second flip and having heads come up on it is the product of the two probabilities, which is 1/2 2, and in general, for the nth flip happening and turning up heads, we get a probability of 1/2n . So we see that the expected value is the sum of the product of the money in the pot for heads on each flip and the probability of that flip producing heads successfully, with the sum being taken over all possible coin flips. But now as n increases, we see that since the same factor controls the rate at which the prize increases and the rate at which the probability of winning it decreases, they cancel out, leaving an infinite summation of $1 amounts, which is therefore an infinite amount of money. Since the “expected” return on investment is infinite, it would seem that any finite entrance fee would be well spent. But no sane person consents to play this game for any entrance fee greater than a pittance. People know instinctively that this game is not going to turn out well, but it is not clear how they know, and this has become a subject of great study. Of course, it’s possible that the typical person’s reason for rejecting this offer is itself fallacious, just like accepting the 30-birthday sucker bet and insisting on nonconsecutive lottery numbers. The St. Petersburg Paradox has been discussed in essentially this form (there are variations on this theme) for approximately three centuries. Nicholas Bernoulli dealt with it in 1713, and his cousin Daniel Bernoulli published a solution in 1738 that has had a profound impact on economic theory. This involves a formulation of “economic motive” in which desire is not linearly proportional to wealth but rather to utility, which is a function of wealth but increases slower than linearly with wealth. Bernoulli suggested a logarithmic function of wealth, and while this still diverges, when multiplied by the decreasing probability pn, the sum approaches a finite constant asymptotically (2 ln2, about 1.38629436, assuming the natural logarithm and replacing Wn with ln(Wn) in Equation 1.4). Some discussions also bring aspects of psychology into play; these involve the fact that no one will believe that the person offering this bet has the unlimited resources needed to back it up. Furthermore, to get the infinite payoff, one would have to hit an infinite string of heads, which would take forever even if its probability were not vanishingly small. Neither of these types of “solution” gets at the mathematical core of the problem: the expectation value of the money payoff really is infinite! There is no algebraic sleight of hand involved. Even with psychological and economic predispositions at work, why turn down a very large profit even if it’s not really infinite? Despite the mathematically unassailable result that the “expected” value is infinite, people don’t expect to get the “expected” value. They sense that, for some reason, the expectation value is not necessarily a reliable guide to what to plan on winning from this game. A resolution to this paradox that has not been published elsewhere to the author’s knowledge will be presented below. This will also show how to supplement the information provided by the expectation value. But first, we will probe this game with a Monte Carlo calculation (Chapter 4 will devote more discussion to this kind of calculation). Clearly a random variable (or function thereof) whose distribution yields infinities has some pathological aspects. When exact mathematical formulations are problematic in this way, or simply because the math is too difficult, Monte Carlo calculations come in handy. These calculations are simulations of the actual situation being studied. A “math model” is set up to describe the relevant 20
page 32
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.8 The St. Petersburg Paradox
features of the situation incorporating relationships to variables which can be assigned appropriate pseudorandom values, and then a computer can run the model through the exercise a very large number of times. From these trials, statistical parameters can be computed (e.g., mean, median, mode, skewness, and others that are described more completely in Appendix A), histograms can be constructed, etc., and one arrives at the same numerical results that would normally be of interest for problems having formal solutions. In this case we can write a computer program that simulates the flipping of a fair coin and sums the payoffs for runs of heads, quitting upon the first appearance of tails. This can be repeated many times, and the frequencies of the various payoffs counted. In practice, the number of trials must clearly be finite, and as a result, the formal predictions of the analysis above will not be obtained exactly. Because the formal problem contains unbounded elements (e.g., it is assumed that the result of an infinite number of coin flips could be realized in a finite amount of time), a finite simulation cannot behave in exactly the same way. Our finite simulation, however, can be expected to behave more like a real, hence finite, attempt to play this game. Such a program was run for 100 million trials. The longest run of heads turned out to be 27, for which the payoff was $134,217,728. As would be expected, very close to half of the trials ended in failure on the first flip. About 25% ended after one successful flip, yielding a pot of $2. For relatively small numbers of successful runs, hence large numbers of occurrences in the set of 100 million trials, the statistics came out very close to the theoretical predictions. But once the expected number of events became small, e.g., for 20 or more consecutive heads, statistical fluctuations became significant. Since 108 (100 million) trials were computed, it is interesting to ask what is the number of consecutive heads that has a probability of happening once in 100 million trials, i.e., what value of n satisfies the formula 1/2n = 1/108 ? The answer is n = 8/log(2), i.e., 8 divided by the base10 logarithm of 2. To ten significant digits, the value of this is 26.57542476, but since n must be an integer, we round that off to 27. It is therefore not surprising that we got one instance of a run of 27 consecutive heads, nor is it surprising that we never got a longer run, in this set of 100 million trials, whose typical set of outcomes is as follows. nmax 0 1 2 3 4 5 6 7 8 9 10 11 12 13
No. Occurrences 50010353 24998194 12499037 6246710 3122077 1561769 780730 391018 195304 97790 48455 24251 12233 6021
Money Won $0 $2 $4 $8 $16 $32 $64 $128 $256 $512 $1,024 $2,048 $4,096 $8,192
nmax 14 15 16 17 18 19 20 21 22 23 24 25 26 27
21
No. Occurrences Money Won 3064 $16,384 1505 $32,768 729 $65,536 386 $131,072 183 $262,144 91 $524,288 57 $1,048,576 15 $2,097,152 16 $4,194,304 6 $8,388,608 3 $16,777,216 1 $33,554,432 1 $67,108,864 1 $134,217,728
page 33
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.9 The Standard Deviation and Statistical Significance
For this set of 100 million trials, the average payoff was $14.36, rather short of the expected payoff of infinity. Many runs of the program were made, and the average payoff varied between $10 and $20. The reason why there is such a large variation is that the average payoff is strongly influenced by the number of really long runs of heads, and this number is subject to significant fluctuations. In other words, because the payoff algorithm is so unstable statistically, even 100 million trials is far too few to bring out the intrinsic behavior accurately. But since attempts to play this game in the real world would involve far fewer trials, we get a fairly good idea of what to expect. The result is that one should not pay an entrance fee of more than about $5 if one wishes to allow some room for a little profit, and at that fee, one must expect to have to play a large number of games, because one quirk of the game is that the average return tends to increase with the number of games, being only pennies for a handful of games and increasing to the $10-$20 range for 100 million. Monte Carlo experiments indicate that if one wishes to play only 25 games, then one should not expect to win more than about $5 and not less than about $1, because most outcomes fall in this range, but it is not rare for somewhat smaller and quite larger amounts to occur. The acceptability of an entrance fee should be judged accordingly, e.g., perhaps 50¢ is worth the risk. But no purveyor of misleading recommendations would offer the game for so low a price, since the purpose is to take money away from clients, not donate it to them. The basic idea is to make a losing proposition seem appealing, and the very large “expected” prize should warrant a stiffer charge for the privilege of playing. It is interesting to note that similar prices of admission, odds of winning, and potential purses are typical of state lotteries, a form of amusement that is much more acceptable to the public than the St. Petersburg Paradox has been found to be. Perhaps it is all in the marketing. Many players of lotteries claim that they had no real expectation of winning, but rather the thrill of the possibility was what they had purchased with their ticket. If so, it is hard to see why the possibility of winning an infinite amount of money would not be all the more of a thrill. 1.9 The Standard Deviation and Statistical Significance Now we will attempt to understand why the expectation value is relatively worthless as a guide to smart betting in the St. Petersburg Paradox. As stated above, the word “expectation” has a special meaning in statistics. Specifically, it means the probability-weighted value summed (or for continuous variables, integrated) over the domain (all possible values) of the random variable. For most random variables encountered in everyday life, the expectation value is usually not very different from a typical outcome. For example, if a fair coin is flipped 10 times, the “expected” number of heads is 5. We are not shocked if it turns out to be 4 on one occasion or 6 on another, because 4 and 6 are as close to 5 as one can get without actually getting 5. Even 3 and 7 are not particularly disturbing unless they come up rather frequently. For this kind of experiment, our anticipation that we should get the expectation value or something fairly close is rewarded more often than not. But if we want to be careful, we should consider another quantity that is computed by summing (or integrating) a probability-weighted value: the variance, which is the average squared difference from the expectation. Equation 1.4 shows how to compute the mean (another name for the expectation value; see Appendix A) for the winnings in the St. Petersburg game. Once we have this, we can compute the 22
page 34
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.9 The Standard Deviation and Statistical Significance
variance as follows.
VW Wn W n 1
2
pn
(1.5)
The variance is the square of a more commonly known parameter called the standard deviation, which is usually indicated by the Greek letter sigma, or with subscripts relevant to this case, σW = VW . The standard deviation is named as it is because for a large variety of common random processes, its value is usually a reasonably typical deviation from the mean that one might expect to encounter. It is therefore the additional guideline that one should consider when deciding whether to accept a bet or make any other probabilistic decision. For the case of 10 coin flips, the mean is 5 and the standard deviation is 1.58. This means that we can expect to get 5±1.58 heads “most” of the time. Just how seriously the word “most” can be taken depends a bit on what kind of random distribution we are dealing with. What we want to know is what fraction of the time we should expect the result to be within ±σ of the mean, and we should also use some care in interpreting fractional means and standard deviations. For example, for a single fair coin flip, the expected number of heads is 0.5. What meaning can be attached to half of a “heads”? If we take it to mean that the coin comes to rest on its edge with both heads and tails facing sideways, then this definition should be agreed upon before any flips are undertaken, and since this result is clearly extremely unlikely, it is usually left out of consideration in probabilistic computations. Some casino games have explicit rules stating that such an event is to be rejected. For a single fair coin flip, the “expected” average cannot occur, and we must be content with knowing that our chance of a successful result is simply 50%. What about the standard deviation being 1.58? Like the mean for a single flip being 0.5, the fractional aspect refers to what to expect on average over many trials. Each random distribution has a relationship between standard deviation and the fraction of outcomes that fall within that distance from the mean. This relationship varies from one distribution to another. For example, it is not the same for the “Bell Curve” and the Uniform distribution (e.g., the distribution describing the number of minutes showing on the face of a yet-unobserved clock that stopped running a year ago, for which all values between 0 and 60 are equally likely). For the Bell Curve (i.e., the Gaussian distribution, illustrated in Figure 1-1), a random draw will be within ±1σ of the mean about 68% of the time. For the Uniform distribution (illustrated in Figure 1-2), the corresponding frequency is about 58%. For most distributions that one is likely to encounter, the fraction is larger than ½. When a result lies outside the range, however, it may lie very far outside, depending on the distribution, and so usually one also asks what fraction of outcomes will lie within ±2σ, ±3σ, and possibly larger ranges when the stakes are high (e.g., analysis of failure modes involved in manned spaceflight). High-σ intervals are usually encountered only when dealing with random variables whose domains are formally infinite, as opposed to those with finite domains such as practical applications of the Uniform distribution, for which 2σ fluctuations are impossible, because the 100%-confidence range defined by the full width of the distribution corresponds to ± 3σ, or approximately ±1.73205σ.
23
page 35
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.9 The Standard Deviation and Statistical Significance
Figure 1-1. The Gaussian Distribution. The curve is the probability density function for the Gaussian Distribution, also called the Normal Distribution and the Bell Curve. This example has a mean of 5 and a standard deviation of 1. The vertical lines show the location of the mean plus and minus 1 standard deviation. Approximately 68.27% of the area between the curve and the horizontal axis lies between the two vertical lines, hence within 1 standard deviation of the mean.
Figure 1-2. The Uniform Distribution. The step-function rectangle is the probability density function for the Uniform Distribution. In this example, the mean is 5 and the half width is 2, so the rectangle extends from 3 to 7 with a height of 0.25, i.e., the inverse of the full width. Below 3 and above 7 on the horizontal axis, the probability density is zero. The other two vertical lines show the location of the mean plus and minus 1 standard deviation. Approximately 57.74% of the area between the rectangle and the horizontal axis lies between the two vertical lines, hence within 1 standard deviation of the mean. Points 2 standard deviations from the mean lie outside the rectangle.
24
page 36
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.9 The Standard Deviation and Statistical Significance
For a single trial consisting of 10 fair coin flips, we might be more interested in the likelihood of getting 5±2 heads instead of 5±1.58. This could be worked out with the use of Pascal’s triangle, but even easier methods are available, as we will see in the next chapter, where a number of distributions will be discussed, including the one that governs coin flips, the binomial distribution. Here we will just quote the result: 10 coin flips can be expected to produce a result of 3 to 7 heads about 89% of the time; it will produce either less than 3 or more than 7 about 5.5% of the time each. If a single experiment produces only 2 heads, then we might suspect that the coin is not fair, and we could say that the coin is biased toward tails with a statistical significance of 94.5%, meaning that with a fair coin, the probability of getting a result at least this far below the expectation value is only 5.5%, therefore the probability of getting more than the observed result is 94.5%. Note that this is not quite the same thing as saying that there is a 94.5% probability that the coin is biased toward tails. Many statisticians would consider such a statement poorly formulated, and arguments about such statements are not unusual. The “measure of belief” interpretation of probability might be invoked to defend it, but it is generally better to stick to the logically defensible statement that with a fair coin, the observed result would be expected only 5.5% of the time, and therefore some suspicion of bias is not unreasonable. But any statement about a probability of being biased should be specific about the amount of bias by quoting an expectation value other than 0.5. Just to say that a coin is biased is to make a fairly empty statement, since the impossibility of physical perfection makes some bias inevitable. If the truth of the situation is important, then further research into possible bias would be required. An example of this would be to use a much larger number of coin flips. For example, if 100 flips yielded the same fraction of heads, only 20 in this case, then the statistical significance would rise to 99.9999999865%, and doubters of bias would be few. But if 10 coin flips produced 4 heads, the statistical significance of possible bias towards tails would be 62.3% (i.e., with a fair coin, 4 or fewer heads can be expected from 10 flips 37.7% of the time), and a claim that the probability of bias is 62.3% due to an extremely unremarkable result would be clearly overstepping the bounds of reason. Although far from a “sure thing”, 62.3% is decidedly more than 50-50, being much closer to 2-to-1 odds. To offer 2-to-1 odds is to make a strong statement, and any event as commonplace as getting 4 heads in 10 coin flips is far too flimsy a foundation for such a statement. So the statistical significance of an event and the probability of some underlying systematic effect are very different things. The former is just the a priori probability that the observed result would not have occurred without some systematic bias. The fact that it did occur does not mean that the same probability applies to the existence of a systematic bias, because fluctuations do occur randomly, i.e., for no reason. It is only when something happens despite an extremely low a priori probability that we should get suspicious. Although some philosophical debate continues about the nature of probability, at least it is quantifiable in principle, whereas the interpretation of statistical significance remains highly subjective. When some event really does have a probability of 62.3%, people start to take its systematic nature seriously, because this means that out of 100 trials, it is expected to happen about 62 times. But a one-time occurrence of an event with an a priori probability of 37.7% has a statistical significance of 62.3%, which translates to “essentially not significant at all”. Only when statistical significance presses close to 100% is it taken seriously by most interested observers. Now we can apply the standard deviation concept to the problem of interpreting and resolving the St. Petersburg Paradox. If we make substitutions from Equation 1.4 into Equation 1.5, we obtain 25
page 37
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.10 The Origin of Fluctuations
VW
2
n
n1
2
1 2 2 2 2 4 2 1 2n 21 2 2
(1.6)
This shows that the variance about the “expected” winnings is infinite, and hence the standard deviation is infinite, and so without even getting into the question of what the confidence of a ±1σ fluctuation is, we know that a typical fluctuation itself is infinite, and this tells us that the difference between the “expected” outcome and any actual outcome is infinitely uncertain. In other words, the expected winnings may be infinite, but so is the typical difference between what is expected and what is achieved, and so an outcome of zero should be no more surprising than any other outcome. If one plays the game and wins only $2 instead of the infinite “expected” amount, this outcome is statistically completely insignificant. 1.10 The Origin of Fluctuations It was said above that fluctuations occur randomly, i.e., for no reason. This was meant to emphasize what has been said previously, that the nature of randomness involves nondeterminism, and therefore the occurrence of a fluctuation, especially one with a typical magnitude on the order of the standard deviation, should not be over-interpreted to the point of concluding that some bias must exist. But it could also be said that there is a reason why fluctuations occur, and if they did not, that would indicate that some hidden process is working. This reason may be seen in why sometimes 10 fair coin flips produce only 2 heads, a result with an a priori probability of 5.5% as mentioned above. Consider the following numbers: 0 1 2 3 4 5 6
8 9 10 12 16 17 18
20 24 32 33 34 36 40
48 64 65 66 68 72 80
96 128 129 130 132 136 144
160 192 256 257 258 260 264
272 288 320 384 512 513 514
516 520 528 544 576 640 768
If we imagine a “wheel of fortune” with 1024 equal-angle segments, each segment containing a unique integer from 0 to 1023, inclusive, then these 56 numbers will be on the wheel. If we spin the wheel in the usual way, namely with enough force to produce an unpredictable number of rotations, then the 56 numbers listed above are each as likely as any other number to end up under the wheel’s pointer, i.e., the indicator of the winning number. If we spin the wheel 1000 times and never get any of the 56 numbers above, that would be a very surprising result, since they are each as likely as any other number. In fact, given the equal likelihood of each segment on an honest wheel of fortune, one of these 56 numbers ought to come up on average about 5.5% of the time. If we do such an experiment, spin the wheel many times and compute the percentage of outcomes on which 26
page 38
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.11 Microstates, Macrostates, and Entropy
one of the 56 numbers comes up, we should not be the least surprised if this comes out about 5.5%. The 56 numbers are all the integers between 0 and 1023 whose binary representations contain two or fewer “1” bits. For example, the middle row contains the numbers 3, 12, 33, 66, 130, 257, 384, 544; in 10-bit binary notation, these are: 0000000011, 0000001100, 00000100001, 0001000010, 0100000010, 0100000001, 0110000000, and 1000100000. We have already used 10bit binary notation above to represent a sequence of 10 coin flips with “1” standing for heads and “0” standing for tails, so we resume that representation here. The point is that 56 of the 1023 equally likely possibilities correspond to 0, 1, or 2 heads. These 56 are as likely as any other number to come up on a spin of the wheel. So the “reason” why we occasionally observe a fluctuation this large (so far from the “expectation” of 5 heads) is that if we didn’t, it would violate the assumption that all the integers are equally likely, and something would have to be wrong with our wheel. The absence, not the presence, of such fluctuations would be a cause for suspicion. Therefore we should not attach too much “significance” to a fluctuation whose likelihood is really not all that small. So even in the context of a purely random model, there is a “reason” for the existence of fluctuations. But that does not make the random process deterministic, because even though over many trials we expect all numbers to come up approximately an equal number of times, we cannot predict the order in which they will come up, and particularly we cannot predict what will come up on the next spin of the wheel, or sequence of 10 coin tosses, or even a single coin toss. That part is still perfectly nondeterministic. If we flip a coin and get heads, and someone asks why it happened to come up heads, we can go into the microscopic physical processes that combine in a chaotic chain of deterministic events too complicated for computation, but it still comes down to why the initial conditions were as they were, so that the best answer remains: “No reason.” 1.11 Microstates, Macrostates, and Entropy It must be emphasized that our assumption of equal probability for all 1024 integers from 0 to 1023 rests critically on the coin being fair. As soon as we give it a bias towards heads, for example, then binary numbers with more 1s than 0s become more probable. But assume that the coin is perfectly fair: it still may seem strange that a number like 1111100000 should be considered to have exactly the same probability as a number like 0110001011. The first appears deliberate, orderly, non-random, and special, while the second looks disheveled, shuffled, scrambled, and not at all special, hence more commonplace and likely to be found lying around than anything so well groomed as the first number. Both indicate 5 heads and 5 tails, and both are indeed equally probable. These two different sequences are called “microstates” in statistical mechanics, and the state defined by “5 heads” is called a “coarse-grained state” or “macrostate”. These two microstates happen to correspond to the same macrostate. The fact that “5 heads” is the most probable result of 10 fair coin flips results from the fact that this macrostate corresponds to more microstates than any other macrostate defined by the number of heads in 10 tosses. So coarse-grained states are generally not equally likely, while in this case (no bias), microstates are. If any doubt remains about whether the microstates for 10 fair-coin flips are equally likely, consider the following. The equal likelihood of microstates is a result of classical symmetry: on the first flip, heads and tails are equally likely because they are symmetric, i.e., there is no way to identify a preference for one over the other. So after one flip, the microstates 0 and 1 are equally 27
page 39
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.11 Microstates, Macrostates, and Entropy
likely by symmetry. Considering each possible result as the beginning of a branching path, the second flip is equally likely (again by symmetry) to take the 0 microstate to 01 or 00, and also to take the 1 microstate to 11 or 10. The third flip branches each of these four microstates to two new equally likely microstates: 01 goes to 011 or 010 with equal likelihood, and similarly 00 goes to 001 or 000, 11 goes to 111 or 110, and 10 goes to 101 or 100. If we continue to generate these branching paths through 10 flips, we generate 1024 microstates, with every branch in the process being one of symmetric equal probability for adding a 1 or a 0 to the microstate at the branch point. This extends the symmetry argument uniformly over all 1024 microstates, and indeed 1111100000 comes out with exactly the same probability as 0110001011 and every other microstate, even 0000000000. It is important to realize that 0110001011 actually is just as unique as any other microstate, even one that is the sole member of its macrostate such as 0000000000. At the risk of belaboring this justification for treating microstates as equally probable, we must recognize that in some problems (notably in statistical mechanics), the symmetries are not as obvious, and arguments about equal likelihood of microstates may go unsettled, leaving conclusions about physical behavior accepted by some groups of physicists and rejected by others. It is very important, therefore, to make claims of equal likelihood on as rigorous a foundation as possible. The coin-flip exercise provides a solid starting point for developing the intuition useful for such analyses of less obvious situations. Having now introduced the terms “microstate” and “macrostate”, we will take this opportunity to introduce the term “entropy”, which is so important in thermodynamics, statistical mechanics, and information theory, in each of which it has a slightly different mathematical definition. In general, a macrostate of higher entropy is one in which there are more ways in which that state could have come about. Since lower entropy implies fewer ways in which the state could have come to exist, knowing that a system is in that lower-entropy state tells us more than if the system were in a higherentropy state. Among the “N heads” macrostates, “5 heads” has the highest possible entropy in the 10-flip example. If someone tells you that 10 flips produced 5 heads, you know relatively little about which microstate produced that result, because there are 252 of them. Macrostates of “4 heads” and “6 heads” have almost as high entropy, with 210 microstates each. The macrostates “no heads” and “10 heads” have the lowest possible entropy, with only one microstate each. Therefore if someone tells you that 10 flips yielded 10 heads, you know everything about what happened. This results in an association between entropy and ignorance. Since macrostates require a “coarse-grained” definition, one could also define a macrostate as “4 to 6 heads”. This has 672 microstates and therefore even higher entropy. In statistical mechanics, the connection between entropy and number of microstates is straightforward, with entropy being proportional to the logarithm of the number of microstates. In thermodynamics, entropy is a function of temperature and heat energy, and in information theory, it is defined in terms of probabilities and their logarithms. In all cases, the association between increased entropy and reduced information is maintained, although sometimes this requires a sign reversal. For example, in imaging theory, a completely flat image has the highest possible entropy, because it has the minimum possible information (none at all); but in image-compression theory, this same flat image has the lowest possible entropy, because it is maximally orderly and therefore highly compressible. The two mathematical definitions differ only in polarity. It has been said (admittedly somewhat facetiously) that the history of entropy definitions, beginning with that of Clausius in thermodynamics, continuing with Boltzmann’s in statistical mechanics, passing through Shannon’s and others’ in 28
page 40
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.12 The Law of Averages
information theory, imaging theory, and compression theory, illustrates the famous growth of entropy with the passage of time about as well as anything. 1.12 The Law of Averages An expression that one comes across occasionally in games of chance is the “Law of Averages”. This arises most often in the context of sporting events. For example, in baseball, when an excellent batter is in a slump, it frequently happens that someone says that he is “due to get a hit” because of the “Law of Averages”. What this means is that he has a certain average rate of getting a hit per plate appearance, and over the years, he has managed to maintain that rate fairly constantly. In order for it to be maintained, he must get enough hits to make up for the times he fails to get a hit. Since he has gone a while without a hit, each time he comes up to bat, his probability of getting a hit goes up, because according to the Law of Averages, eventually he has to get some hits, and therefore he is due for a hit right now. Shouldn’t smart bettors put extra money on him getting a hit the next time he gets a chance to bat? After all, didn’t we say above that the unbiased random walk always returns to zero? If that’s true, then the farther it gets from zero, the more likely it is to turn around and come back, right? Otherwise, how is it supposed to get back to zero if it never turns around? It is certainly true that the one-dimensional unbiased random walk eventually returns to zero, and this is also true in two dimensions (it gets more complicated in three or more dimensions). This is the only correct implication in the second half of the preceding paragraph, however. There are several fallacies contained in the idea that the batter is “due” for a hit, and there is no such thing as the “Law of Averages” in probability theory, it is a popular fiction. The simplest error is overlooking the possibility that the batter’s ability might be degrading as time passes. In fact, all athletes’ abilities eventually decline with time, so perhaps he will never come out of his slump. Another error is assuming that the slump is necessarily statistically significant. As we saw above, there are many ways to get 5 heads in 10 coin flips, two of which are the sequences 1011001001 and 1111100000. These two microstates of the “5 heads” macrostate are equally likely. And yet the first would probably go unnoticed, while the second would be considered by many people to indicate the use of tampered coins, whereas it is as likely as any other specific sequence of 10 flips leading to 5 heads. So runs of successful or unsuccessful outcomes do not necessarily indicate anything out of the ordinary. The most serious fallacy is the idea that the outcome of one independent random event has any effect on the outcome of another. It is of course possible that the batter has been ill with serious flu symptoms, and now he is feeling much better, so naturally we expect to see some hits very soon. But in such cases, recent recoveries from debilitating conditions are always given prominence in the discussions of expected athletic performance, and this was not the case in the example above, which is fairly typical. If a batter’s health is fairly constant over some interval of time, and if there are no other obvious biases operating (e.g., facing that year’s Cy Young award winning pitcher), then whether he gets a hit on any specific occasion is essentially independent of how other recent attempts came out. Some slight daily variations in the efficiency of physical execution are normal, but we cannot assume that his recent failures have made his current attempt more likely to produce a favorable result. 29
page 41
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.12 The Law of Averages
This difficulty in separating a priori probability from a posteriori probability is deeply ingrained in the human psyche. These terms mean different things in different contexts, but in probability theory, the former means the probability of some event estimated before the relevant observations are acquired, and the latter is the probability of that same event estimated after some of those data have been acquired. For example, before one makes any flips of a fair coin, the methods discussed above allow us to compute that the probability of getting 4 heads on the first 4 flips is 1/16. Suppose that after 3 flips, we have obtained 3 heads; what is the probability of getting heads on the next flip? The “Law of Averages” says (erroneously) that tails is more likely, just because getting 4 consecutive heads is rather unlikely, and we are “due” for tails. But under the conditions assumed in the preceding coin-flip discussion, each flip is independent of each other. So once 3 heads are part of history, the probability that this set of 4 flips will yield 4 heads has changed. Like any other single flip, the probability of getting heads is ½, i.e., the probability that we will complete the sweep is now ½, much greater than 1/16, because the 1/16 was prior to the achieving of 3 heads. The phrase “a posteriori probability” is encountered most frequently in decision theory, where one has probability distributions describing all possible events that contribute to the outcome of a measurement. Typically these events are correlated with each other, and therefore if some events are observed to occur, the probabilities of others are changed to what are called conditional (because they depend on the condition that some other events occurred) or a posteriori probabilities (since they apply after some events are known to have occurred). Formal manipulation of a posteriori probability can be used to work backward from an observed event to estimate probabilities for various unobserved events that may have been the cause of the observed event. This involves the use of Bayes’ Theorem, which is described briefly in section 2.8 and which reduces to trivial identities when the events are all independent. So how does the unbiased random walk ever get back to zero? The answer is: in its own sweet time. It has an arbitrarily long period in which to work its magic. It may well continue its progress away from zero for many more steps before its remoteness eventually peaks, and when it finally does, it will not be because it has a debt to pay, but only because the probability of never doing so is vanishingly small. The actual step-by-step progress may not betray any tendency toward zero. There will generally be many ups and downs along the way, creating the appearance of many slumps and surges that obscure the large-scale excursion. The mathematically ideal fairness of the coin remains undiminished through the eons that it may have to continue being flipped before the excursion finally arrives back at the origin, only to begin another odyssey. At no point will it be “due” for a step toward zero. There is another popular use of the phrase “Law of Averages” which is not so fallacious, other than its reference to a nonexistent “law” of probability theory. This is the simple admonition not to undertake some risky behavior because sooner or later one will suffer the consequences because of the “Law of Averages”. In this context, the idea conveyed is basically correct, it is just poorly expressed. Although in practice many probabilities are estimated via an averaging process, it is the probability that is fundamental, not the method of estimating it. But indeed, if some activity has a significantly nonzero probability of a deleterious outcome, one ought to consider the fact that repeated indulgence in that activity can be expected to produce that detrimental result eventually. This is just a statement of the fact discussed in section 1.10 above: statistical fluctuations must be expected to occur at the rate indicated by their probability, and if they do not, something is wrong with the analysis of the situation. There is a legitimate notion called “The Law of Large Numbers” 30
page 42
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.13 Lotteries and Spreads
which states that for a sufficiently large number of opportunities, the various possibilities will occur at rates given approximately by their probabilities. But this makes no attempt to predict when the various possibilities will occur, so one must not go one step further and claim that a fluctuation is “due” on the next few trials. 1.13 Lotteries and Spreads Two of the most publicly visible forms of gambling are lotteries and spreads relating to scores in sporting events. State and national lotteries are now commonplace, and television news programs usually report the spreads of major upcoming sports competitions. Lotteries have become a staple of many state budgets, although typically a significant fraction of the money lost by the bettors ends up in private pockets, not public coffers. Different lotteries operate with different pseudorandom number generators, different definitions of what constitutes a winning wager, and different levels of payoffs, but the main fascination in most cases appears to be the possibility of winning an astronomical amount of money. The odds of this happening to any given individual are usually not too different from the odds of that individual being struck by lightning, but this does not cause any shortage of participants. Odds of 1 in 10 million are typical of winning the big prize in a lottery. If a given person plays that lottery 20 times in a year, the odds of winning are about the same as being struck by lightning during that year. Of the money taken in by most state lotteries, approximately half is paid out as prizes in the form of one enormous payout and many small ones. This might seem like a lot of prize money, until one realizes that every time one bets, the expected outcome is the loss of half the money. The thrill of knowing that the big payoff could happen is clearly worth the price of admission for many players. It seems reasonable to presume, however, that like most numbers too large to be fully appreciated by the human mind, the thrill survives primarily because of a failure to grasp how improbable the big payoff really is. The fact that it does happen to someone carries a lot of weight. And yet in these same people, the fact that lightning also really does strike someone generates little anticipation of being that someone. At least these lotteries have a certain mathematical honesty in that the chances of winning are not difficult to compute and are required by many state laws to be made available to participants. A less transparent parasitic activity has developed however: services that purport to increase one’s lottery payoffs. Such services usually take the form of a computer program that generates numbers in the proper format for a given lottery. It seems that if the lotteries themselves use honest unbiased pseudorandomization methods, then it could not be possible for any computer program to generate a set of numbers with a better than average chance of winning. In fact, the only way this would be possible would be for the computer programmer to have access to the same algorithm used by the lottery. In fact, this technique is used to synchronize pseudorandom numbers used as part of secure remote access to computer systems. But many lotteries don’t use computer-generated numbers; they may have numbered balls bouncing around in a container or some similar physically based mechanism. Remarkably, these lottery programs can indeed increase one’s payoff. This happens because many people bet the same numbers, and if the numbers win, ties result and reduce the individual payoffs. For example, in a game requiring the player to guess five two-digit numbers less than or 31
page 43
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.14 Evolution Happens
equal to 49, if one bets 12 07 19 41 08 because the attack on Pearl Harbor came on December 7, 1941, at about 8 AM, and if these numbers win, then the prize may have to be shared, because out of the other ten million bettors, there’s a fair chance that someone else had the same idea. Similarly, a winning bet of 04 06 08 12 20, the numbers of sides of the Platonic solids, would be imperiled by the possibility of other mathematics-history buffs entering the game. Many lottery players believe in “lucky numbers” and try to find them in Nature. Simply by generating effectively random numbers, the computer program provides guesses that are less likely to be duplicated and result in a tie. Of course, first the numbers have to win, and the computer programs are no better at that than anyone else. But once a program has produced a winner at any level of payoff with no ties involved, its marketer can claim that it has increased its customers’ lottery payoffs. If the public interprets this to mean that the program has some magical power to predict winning numbers, then that is just an unfortunate misunderstanding. Spreads are very simple on the surface. A “bookmaker” offers a point difference as the likely winning margin by a favored team, and the bettor puts money on one of two possibilities: the favored team will win by more than the stated margin, or it will not (including the favored team losing outright). There are many subtle variations involving ties, etc., but this description applies to most spread betting in the United States. If the bookmaker did a perfect job, the bet could not be different from a fair coin flip. The more lopsided the expected outcome of the sports contest, the larger the spread, which is adjusted once betting begins to balance the number of bettors on each side of the outcome. The reason why this makes money for the bookmaker is that a “commission” is charged. A less generous way to describe it is that the bookmaker pays less than even money on a fair coin flip. It would not be easy to find people willing to bet on a fair coin flip when a loss costs more than the payoff, but there is no shortage of spread bettors, because a widespread illusion of special sports insight pervades the sport-loving population. Most people who follow a given sport do indeed have a lot of insight into it, but even just predicting win/lose outcomes has a history littered with failures, and correctly predicting margins of victory is vastly more subtle. But the bookmaker needs only to get a large number of bettors evenly split, and a profit is almost guaranteed. A profit for the bookmaker is not completely guaranteed on every event, because the period of adjustment may leave some bettors committed to slightly different spreads, and also bettors may make multiple-event wagers, such as picking spreads on three different football games played on the same day. When placing multi-event wagers, it is occasionally possible to win more money than the amount bet, but the payoff ratio doesn’t grow fast enough to compensate for the shrinking probability of getting all the coin flips in one’s favor, and one must win all facets of the multi-event combination to win anything. For example, the probability of winning a two-event fair coin flip is 25%, so even money would involve a payoff of $3 on a $1 bet, but a $2 payoff is more typical. Statistically, the bookmaker may suffer a few losses along the way, but over time, this variation on the biased random walk is practically certain to diverge in gross assets and converge into a comfortable retirement. 1.14 Evolution Happens We have traced the human fascination with gambling from the primeval need to guess accurately in decisions related to survival. Apparently our ancestors were better guessers than their 32
page 44
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.14 Evolution Happens
competitors, since natural selection surely rewarded wiser choices in the risky situations bound to arise in an eat-or-be-eaten arena. But on the cosmological time scale, these contests are recent. It now seems clear that the furnace of evolution has been burning for much longer than the Earth has existed. From the earliest moments in the history of the universe, the ability of matter and energy to exist in various configurations has given the more stable forms an opportunity to exert greater influence on subsequent developments. In the raw infancy of what we call physical reality, the unthinkably high energy density of the milieu produced churning convulsions that prevented even quarks from stabilizing in the seething froth. As the neonatal medium expanded, it cooled enough for elementary particles to settle out, and recognizable processes took over from the incomprehensible chaos. The fundamental rules governing the microscopic time-driven unfolding of mass-energy powered the chain of events through several distinct stages. Quarks eventually became stable and fused into protons which eventually captured the cooling electrons, taking away most of their opacity and freeing the light whose lingering vestiges are still evident today in the cosmic background. Density fluctuations led the primordial hydrogen to collapse into galaxies, stars formed, and the starlight heated the interstellar medium and the relatively evacuated spaces between galaxies. The stars’ nuclear ovens cooked more interesting chemical species, molecules formed, and matter swirled into planetary systems. On at least one planet in such a system, chemical reactions blossomed into biological processes that coalesced into organisms who continued to play the game of trying many variations and yielding the future to those best suited to inherit it. By some mysterious process still not understood, advanced organisms acquired consciousness, and suddenly the universe knew that it existed. Again, “suddenly” must be interpreted in the context of the cosmic time scale, and sentient beings may have evolved in many places. In the one place that we know about, awareness inevitably led to own-self-awareness, and among the most advanced individuals, to other-self-awareness. Over the last few millennia, a tendency can be seen to devote more and more conscious energy to expanding our understanding of ourselves and our surroundings. Knowledge for its own sake is no longer an odd notion. Apparently this too is a manifestation of the predisposition of mass-energy to organize into richer and more complex structures as the universe continues to cool and expand. One focus of the search for greater understanding is the attempt to determine what rules govern the activity at the most fundamental scale of reality. Notions about such a scale have come and gone, with the Greeks’ concept of indivisible atoms and the continuum physics of the 18 th and 19th centuries. The advent of Quantum Mechanics (which will be discussed in Chapter 5) gave form to the realization that atoms were both real and not indivisible, and Nature was seen to be behaving as though its fabric was not a continuum. Great effort went into expressing Nature’s granularity in the language of continuum mathematics, and the old problems with infinities in classical physics took new forms in quantum field theory but did not pass away. Perhaps there is no greater quandary woven into Quantum Mechanics than the role of random processes. The idea that quantum randomness was any different from classical randomness was consistently rejected by Albert Einstein. Classical randomness was purely epistemic, i.e., all processes were 100% deterministic but impossible to bookkeep exactly, so random distributions were used to model the large-scale behavior of systems composed of many small elements. The idea that the more accurate descriptions provided by Quantum Mechanics implied a nonepistemic randomness among the laws of Nature drove Einstein to launch a number of brilliant and extremely 33
page 45
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
1.14 Evolution Happens
useful arguments whose attempted goal was to demonstrate an incompleteness in the quantum formalism. All of these failed to achieve their objective, but the pressure they exerted forced great progress in both theoretical and experimental physics. Today it appears that either a fundamental nonepistemic randomness does lie at the heart of the mechanism that propagates the universe of our experience through time, or the very concepts we use to represent physical reality are inadequate for that purpose. Possibly both. In any case, randomness epistemic or otherwise is an essential ingredient in the evolutionary development of every aspect of reality that we can recognize. Whether this is any way to run a universe is difficult to say without having any other possibilities to consider, but this is the way we see the universe operating. Something shuffles the deck and shakes out different structures to compete against each other for the favor of natural selection. Whatever does that has the appearance of randomness and seems to conduct its affairs at the most fundamental level of physical reality. It is therefore an element of great desire for the pursuit of understanding. It probably cannot be understood without simultaneously bringing into better focus the raw materials upon which it acts, the very “atoms of spacetime” themselves.
34
page 46
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Chapter 2 Classical Mathematical Probability 2.1 Quantifying Randomness and Probability In the last chapter we saw how Pascal and Fermat developed the first solid mathematical foundation for probability theory. It is no coincidence that this took place at a time and in a place where modern mathematics was being born. Some notions about betting odds, likelihood, and unpredictability had existed for millennia, but in seventeenth-century Europe a talented generation of thinkers had arisen and were formulating ever more challenging problems in logic and finding solutions to them. In those days, specialization into the separate fields of mathematics, physics, and philosophy had not yet taken place, and these seekers of truth operated in all of these fields simultaneously. Developments in number theory led from the study of integers to the continuum, i.e., the ordered set of generally fractional numbers with the property that between any two members of the set, another member can always be found. Newton and Leibniz extended ancient geometry with the newly developed theory of infinite sequences and thereby brought the calculus into the toolkit of all mathematicians. The continuum had long been known to consist of different types of non-integer numbers: rationals and irrationals, numbers which can be represented as the ratio of two integers and those which cannot, for example, 3/17 and 2, respectively. Now it was found that irrationals also consist of two types: algebraic and transcendental. These two kinds of irrational numbers are, respectively, solutions to algebraic equations consisting of polynomials with integer coefficients and other kinds of equations. Examples are the algebraic equation 22x4 - 2143 = 0 and the transcendental equation 2x - 7 = 0. The solution to the latter is the transcendental irrational number ln(7)/ln(2) (where “ln” is the natural logarithm), and the solutions to the former are algebraic irrational numbers whose values are left as an exercise (there are two real and two imaginary). Calculus first emerged in its differential form, a form concerned with the rate of change of a function with respect to its independent variable. Soon after, interest developed in the question: given a rate of change, what is the function that has that rate of change? This led to the development of the integral calculus, which required a parallel development of measure theory and a variety of definitions of the integral operator. Newton’s work on the calculus was guided and partly motivated by his interest in “Natural Philosophy”, the name used at the time for what we now call mathematical physics (also including chemistry, optics, etc.). This took place about 200 years before the widespread acceptance of the idea that “atoms” actually exist as objectively real components of the physical universe. In fact, it was about 100 years before the development of the first approximation to the modern concept of atoms, after which for another century atoms were regarded merely as useful mnemonics for chemical calculations. So for Newton it was natural to view the substance of physical reality as continuous, and this led to the concept of a density function. Thus the mass of a stone resulted from the fact that a mass density function operated throughout the volume of the stone. This notion made the leap from mechanics to probability theory 35
page 47
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.1 Quantifying Randomness and Probability
and enabled the application of probability as known for integer variables, e.g., the number of heads in 10 coin tosses, to continuous variables, e.g., the amount of rainfall to expect in London in October. A minor addendum was found to be necessary: one couldn’t discuss meaningfully the probability that 3.0 inches of rain would fall, because the probability of any exact value of a continuous variable would seem to be zero. Being continuous, the variable has an infinite number of other values that are possible, and adding up the probabilities of all the possible values must yield a finite result corresponding to certainty. Calling the probability zero is not satisfactory, since adding up an infinite number of zeroes is not a well-defined process. This same bit of rigor had also been found necessary in mechanics: a continuous stone may be infinitely divisible, but if one were to subdivide a stone into an infinite number of pieces, one could not discuss meaningfully the mass of a given infinitely small piece, and yet somehow when assembled into the stone, all the pieces had to add up to produce the mass of the stone. Thus one can discuss only the mass of some arbitrarily small but nonzero piece (i.e., not the result of an infinite number of halvings of the stone), and the question of whether the October London rainfall will be 3 inches can be discussed meaningfully only in terms of the probability that it will be between 3-Δ and 3+Δ inches, where this probability is obtained by integrating the density function from 3-Δ to 3+Δ, Δ > 0 but otherwise arbitrarily small. This probability can then be compared to that for 2-Δ to 2+Δ, for example, to decide whether 2 or 3 inches is more likely. But one may object: surely some year came and went with no October rainfall in London, in which case there is some well-defined probability for exactly 0 inches. Such an objection is valid; in general, probability distributions are models of reality which must be understood well enough to avoid misapplication. A probability density function for October London rainfall can be constructed only for values greater than 0, and 0 itself has to be treated as a separate case of an integer value with its own probability. But is this distinction significant? Probably not, but it depends on one’s purposes. If one is being technically rigorous, then one should also ask: is it really certain that not even one water molecule went from the atmosphere to the ground in that October period? But a “molecule” of water violates the assumption that the random variable is drawn from a continuum. What about the probability of -1 inch? Isn’t that also zero? In a sense yes, but if the probability density function is properly defined, the domain of its random variable will be restricted to positive values, in which case it is not meaningful to apply it to values outside of its domain of definition. And should evaporation be considered negative rainfall? In summary, some work is in order to ensure that the model is a good approximation and that it is used exclusively in a manner consistent with that approximation. For example, the Gaussian distribution introduced in the previous chapter is frequently used in astronomy to model errors in the angular coordinates of a celestial body at a particular epoch. Modern astrometric techniques typically involve angular errors of less than one second of arc, where “error” is characterized statistically as the distribution’s standard deviation (the square root of the average squared distance from the mean). A second of arc, also called an arcsecond, is 1/60 th of a minute of arc, which is 1/60th of an angular degree, of which 360 constitute a full circle. For example, a second of arc is about the apparent angular size of a basketball viewed with the naked eye from a distance of 31 miles, or would be if the human visual system were capable of such resolving power, which clearly it is not. The apparent sky position of typical asteroids can be computed for a specific time with an uncertainty of about this amount, which means that the Gaussian model implies that the object will be found within 1.0 arcsecond of its predicted position 36
page 48
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.1 Quantifying Randomness and Probability
about 68% of the time, between 1.0 and 2.0 arcseconds from its predicted position about 27% of the time, and farther off about 5% of the time. This is usually close to what is observed after many independent observations have been made, so the Gaussian error model seems to be very good for this purpose. But the Gaussian model also allows for some small but nonzero probability that the object will be millions of arcseconds off. Does this make the Gaussian model absurd? After all, in spherical geometry, the error can’t be more than 648,000 arcseconds, because that is 180 , and farther off than that in one direction implies closer from another. Can a distribution with an infinite domain make sense as a model for angular errors which cannot be greater than 180 in magnitude? Yes, it can, as long as the absurd region of the model is safely distant from where serious work is being done. Although the Gaussian model allows nonzero probabilities for absurdly large angular errors, those probabilities are so small as to be completely negligible. In this case, for example, the probability that the object will be more than 5 arcseconds off is less than 0.00006%. The probability of being off by 180 would require us to insert over 91 billion more zeroes after the decimal point. The important point here is that mathematical models of physical processes generally have limits beyond which they are not acceptable approximations. It follows that some care must be given to ensure that all significant aspects of a model’s application lie safely within such limits. We will see in subsequent chapters that probabilistic models play a crucial role in measurement theory, that the most scientifically meaningful definition of “information” requires their use, that they can be extremely powerful tools, and like all such tools, they can do great damage if abused. But when used properly, they facilitate progress in expanding our understanding of the physical reality within which we find ourselves embedded as some sort of self-aware beings capable of experiencing wonder and curiosity and a desire to know more about our situation, beings of incomplete knowledge, hence beings whose experience is infused with uncertainty. By addressing our uncertainty head-on via the tools of random-variable theory, we have found that a surprising amount of orderliness can be brought to our world view and aid in our quest for understanding. Observation and theory combine to guide us in our search for this greater comprehension. So far, recipes for creating more effective theories remain elusive, but the craft of observation has seen great advances in the form of standardizing the process of making and interpreting measurements. The use of probabilistic models in the representation of the information contained within measurements is quite possibly the greatest leap forward in making the observational side of science effective. For this reason, the act of representing the result of a measurement by assigning it an appropriate probability density function is considered to have a status approaching the sacred. Accomplishing this successfully opens doors to realms previously unsuspected, as for example some highly nonintuitive but empirically validated features of quantum mechanics attest. What we mean by this assignment of a probability density function to a measurement result is that everything we know about the “true” value of whatever was measured is built into a density function that permits us to consider the implications of different “true” values. Nontrivial measurements never yield results accurate to an infinite number of decimal places, so there is always some residual uncertainty that represents possible error in the quoted value. Furthermore, precision limits alone typically do not account for all measurement error. A well-designed measurement yields a numerical result whose precision exceeds the accuracy of the observational instrument, i.e., it samples the noise. This makes the averaging of multiple results more effective, thereby reducing the uncertainty in what the “true” value is. The most effective averaging methods make use of 37
page 49
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.2 Probability Mass Distributions
information built into a probability density function, and this is one reason why measurement results should be represented in this way. There are other reasons as well, most of which will be explored in later chapters, along with the questions surrounding what is meant by a “true” value. For now, we focus on what a probability density function is and how it can represent our knowledge of the value of a quantifiable physical aspect of objective reality. By now it is clear that a great deal of focus is going to be placed on probability density functions for continuous random variables and probability distributions for discrete random variables. The latter are sometimes called “probability mass distributions”, and this is consistent with the common practice of treating discrete random variables with continuous density functions while considering the meaningful quantity to be the probability “mass” lying between k - ½ and k + ½, where k is an integer value of the discrete random variable. The distinction between continuous and discrete random variables should not be forgotten, nor that between probability density functions and probability mass functions, but we prefer not to litter subsequent discussions with constant reminders of these, and so we will assume that the nature of the random variable is obvious in context unless some special emphasis is needed. This sliding between discrete and continuous random variables is especially useful when dealing with the fact that many distributions of both types have “asymptotic Gaussian” forms, also called “asymptotically normal” forms. For example, the number k of heads resulting from n tosses of a coin follows the “binomial distribution”, which is a probability distribution of a discrete random variable, since the random variable can take on only integer values. But under certain asymptotic conditions involving n becoming large, the shape of this distribution begins to resemble that of a Gaussian distribution, which describes a continuous random variable. It would be very inconvenient not to be able to take advantage of this fact, and so we proceed to use the continuous distribution to describe a discrete variable with the understanding that when we refer to the probability of k heads, we really mean the probability mass between k - ½ and k + ½. We will therefore talk mostly about “probability density” functions, but it should usually be obvious how the point being made relates to discrete distributions, and where this is unlikely, special comments will be made. We must also acknowledge that the density function is not the only type of function of interest. The “cumulative” distribution is also of crucial importance, and a number of auxiliary functions will also be found useful. But when we speak of a given distribution, the mental image invoked should usually be the density function or mass distribution. 2.2 Probability Mass Distributions The most natural prelude to the study of probability density functions is a discussion of probability mass distributions. In section 1.9 we mentioned the distribution of probabilities for various outcomes of a series of fair coin flips, and that is a good one with which to begin. In that chapter we also saw (Equation 1.2, p. 5) that the number of ways to get k heads in n flips of a coin is given by Pascal’s rule:
Nn (k )
n! k !(n k )!
38
n k
(2.1)
page 50
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.2 Probability Mass Distributions
This is the number of microstates (see section 1.11) corresponding to k heads, and it is the first piece that we need to construct a probability distribution for coin flips. We saw that, in keeping with the frequentist interpretation of probability, what we really want is the fraction of the time that n tosses can be expected to yield k heads, so we need to divide Nm(k) by the total number of possible microstates, which we found to be 2n, that is, the number of different n-bit binary numbers. This fraction is the probability we seek:
n 1 P ( k , n) n k 2
(2.2)
For example, with n = 2 tosses, there are 22 = 4 possibilities, HH, HT, TH, TT. Of these, 2 contain a single H, as Pascal’s rule predicts, so the probability of getting a single H is 2/4 = 1/2. So Equation 2.2 is indeed the probability mass distribution that gives the likelihood of getting k heads (or tails) in n flips of a fair coin. But what if the coin is not fair? Suppose for example that we have a really badly mangled coin for which the probability of heads on a single flip is 2/3. The more general probability distribution that we need is the binomial distribution, of which Equation 2.2 is a special case. In assuming the coin is fair, we used the fact that the probability of heads on a given flip is 1/2. If instead this is 2/3, then the 2n possible microstates are no longer equally likely, and this must be taken into account. In order to do this, we observe that the factor 1/2 n is actually the fair-coin probability of any one given microstate. Equation 2.2 actually says that the probability of getting k heads in n flips is the number of corresponding microstates multiplied by the probability of a single corresponding microstate. Since all microstates have the same probability with a fair coin, the probability of any single corresponding microstate is just the probability of any microstate at all, 1/2 n. When we change to a biased coin that has a probability of heads on a single toss of p = 2/3, we do not change the number of corresponding microstates, but we do change their probability. So we need to change the 1/2n in Equation 2.2 to a more general microstate probability. The inverse relationship between the probability of a single microstate and the number of microstates is lost when the microstates no longer all have the same probability. In order to get exactly k heads in n tosses, we must also get exactly n-k tails. For p = 2/3, the probability of tails on a single toss is 1 - p = 1/3. The probability of k heads is pk (because the tosses are independent), and the probability of n-k tails is (1 - p)n-k. The probability for both of these to happen is their product, and so the probability of a corresponding microstate is pk (1 - p)n-k, and the more general form is
n P( k , n, p) p k (1 p) n k k
(2.3)
So for p = 1/2, pk (1 - p)n-k is equal to 1/2n, which matches Equation 2.2. However, for n = 10, the 1/2n in Equation 2.2 evaluates to 1/1024, whereas pk (1 - p)n-k in Equation 2.3 depends on both k and n in a manner that doesn’t cancel k out unless p = 1/2. For example, for k = 5 and p = 2/3, we are dividing Nm(k) by about 1845.28. The fact that this is larger than 1024 reflects the fact that when p changes from 1/2 to 2/3, the probability of a “5 heads in 10 tosses” microstate drops from 1/1024 to about 1/1845.28. The increased probability of heads on each toss makes microstates with more than 5 heads more likely than with a fair coin, and their increased likelihood comes at the expense of the 39
page 51
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.2 Probability Mass Distributions
microstates on the lower end of the spectrum, including those corresponding to “5 heads”. This is all consistent with the requirement that the normalization of a probability distribution must be such that the sum of the probability over all possible outcomes must be 1.0. When each microstate is equally likely, dividing each macrostate’s number of microstates by the total number of possible microstates supplies this normalization. When the microstates are no longer equally likely, the probability of a microstate corresponding to a given macrostate varies from one macrostate to another, but as long as we multiply the number of microstates of a given macrostate by the corresponding microstate probability, the sum over all macrostate probabilities will still be 1.0. This is what we have done in going from the special case in Equation 2.2 to the general case (for binomials) in Equation 2.3. Consider the case k = 5. As we saw in the last chapter, there are 252 ways to get 5 heads in 10 coin flips, i.e., the macrostate “5 heads” has 252 microstates. If the coin is fair, p = 1/2, and all 1024 possible microstates are equally likely, so the probability of k = 5 is 252/1024, or about 0.2461, the highest probability of any macrostate. But now consider p = 2/3: those 252 microstates are now each less likely to occur. A microstate with 7 heads is now more likely to occur than one with 5 heads; pk (1 - p)10-k is greater for k = 7 than for k = 5 when p = 2/3. Even though there are only 120 microstates for the “7 heads” macrostate, each is much more likely to occur, making this macrostate more likely than “5 heads”, with a probability of about 0.26012, while that for “5 heads” has dropped to about 0.13656. So while there are still fewer than half as many microstates for “7 heads” as for “5 heads”, they have each become almost 4 times as likely to occur. With the microstates now so unequal in probability, the relationship between fraction of microstates and probability of corresponding macrostate has clearly changed.
Figure 2-1. A wheel of fortune with equally likely sequences of four flips of a fair coin. Each four-digit binary number represents a possible sequence, with “1” indicating heads and “0” indicating tails. Each of the 16 possible sequences occupies an angular segment of 22.5 . Since six of these involve two heads, the probability of that outcome is 6/16, higher than that of one or three heads, since those outcomes have only four equally likely sequences or “microstates” and thus a probability of 4/16 each. There is only one sequence containing no heads, and only one with four heads, so those outcomes each have a probability of 1/16.
40
page 52
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.2 Probability Mass Distributions
A.
B.
C.
Figure 2-2 A. After one flip of a coin biased 2/3 to heads, the area proportional to the fraction of heads is labeled “1”, tails “0”. B. After the second flip, each area is subdivided into areas with 2/3 going to heads, 1/3 to tails. C. After the third flip, each area is again subdivided into areas with 2/3 going to heads, 1/3 to tails.
Figure 2-3. A wheel of fortune containing the same 16 possible sequences of four coin flips as in Figure 2-1, but for a biased coin that has a 2/3 probability of heads on any given flip. Sequences involving more heads are now more likely than for a fair coin. Although there are six sequences with two heads and only four with three heads, an outcome of three heads is more likely than two heads because the former’s microstates are each twice as likely, i.e., occupy segments with angles twice as large.
41
page 53
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.2 Probability Mass Distributions
This is illustrated in Figures 2-1 through 2-3. Figure 2-1 shows a “wheel of fortune” corresponding to 4 flips of a fair coin. We reduce the example from 10 flips to 4 in order to keep the wheel segments easy to see. Each segment is labeled with the 4-digit binary number corresponding to the sequence of toss outcomes in its microstate. Like most wheels of fortune, or “prize wheels”, the segments are of equal angular size. From its earliest references in Roman literature, the wheel of fortune (“Rota Fortuna”, the wheel of the goddess Fortuna, a capricious controller of fate) has conveyed the notion of uniformly random selection from a set of possible outcomes. When spun honestly, i.e., through enough full rotations to preclude deliberate control of the final orientation, such wheels generally provide a very good approximation to uniform randomness. The pointer selects a particular location on the wheel, whose angular position when it stops spinning is usually modeled as a continuous random variable that takes on values between 0 and 360 , with these two values being equivalent. By subdividing the wheel into a finite number of segments, we allow it to select discrete outcomes. The wheel in Figure 2-1 has 16 segments of 22.5 each. Since the final orientation of the wheel is uniformly random, each segment has the same chance of containing the pointer’s tip when the wheel stops. If we define a “trial” as a sequence of 4 coin tosses, then this mechanism is a good way to generate the equivalent of a trial. A spin of the wheel is equivalent to the actual tossing of a coin 4 times. In this case, the coin is fair, and each segment’s corresponding microstate is equally likely. Since there are 6 microstates with 2 heads, more than with any other number of heads, the “2 heads” macrostate is the most likely outcome on any given trial, with a probability of 6/16, or 0.375. There are 4 segments with a single head, and the same number with 3 heads, so the “1 head” and “3 heads” macrostates both have a probability of 0.25. The “4 heads” and “no heads” macrostates, each having only one microstate, i.e., segment, have probability 0.0625. These probabilities are the ones that Equations 2.2 and 2.3 provide. Now we consider the biased coin with a single-toss heads probability of 2/3. Figure 2-2 shows how the microstates develop on the first, second, and third tosses. After one toss, the probability of heads is 2/3, and so the 1 microstate’s segment occupies the same fraction of the wheel, i.e., an angular size of 240 , with the 0 microstate having 120 . On the second toss, we consider what happens to each of the one-toss microstates: each will branch to two segments, with the subdivisions being again in proportion to the heads/tails probability ratio. So microstate 1 branches to 11 and 10, with the former getting 2/3 of its 240 , and microstate 0 branches to 01 and 00, also with the former getting 2/3 of its area, in this case 2/3 of its 120 . The third toss results in similar branching of the 4 existing segments into a total of 8 subdivisions with the heads region getting 2/3 of the previous area. The wheel after the final toss is shown in Figure 2-3. Clearly the segments have unequal angular size, which reflects the fact that their corresponding microstates have unequal probability. The most probable microstate is now the only one belonging to the “4 heads” macrostate; this segment has an angular size of about 71.11 . On each flip of the coin, the “all heads” macrostate has gone from 2/3 of the wheel to 4/9, then 8/27, then 16/81. But “4 heads” is not the most probable macrostate; that distinction belongs to “3 heads”, because while each of its 4 microstates has an angular size only half that of the 1111 microstate, the fact that there are 4 of them results in twice the total area on the wheel. The 0111, 1011, 1101, and 1110 segments each occupy 8/81 of the wheel with an individual angular size of about 35.56 . Their sum is about 142.22 , double that of the 1111 segment. Just as predicted by Equation 2.3, the “3 heads” macrostate has a probability of about 0.39506, while “4 heads” has a probability of about 0.19753. 42
page 54
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.3 Probability Density Functions
Because the coin flips are all independent of each other, the order in which the 3 heads appeared is immaterial, and correspondingly, the segments of the 0111, 1011, 1101, and 1110 microstates all have exactly the same angular size. The probabilities of all microstates belonging to the same macrostate are equal, as assumed when we constructed Equation 2.3. This may not have been completely obvious at the time, but inspection of Figure 2-3 reveals it to be true. If this point seems trivial, one should consider what would happen if each coin flip were not independent of the previous one. Such a dependence could develop if some damage to the bottom side of the coin occurred on each flip and introduced some lack of symmetry. Or one can imagine a game in which a fair coin is flipped, and if it is tails, then it is flipped again, but if it is heads, then a biased coin with 0.51 heads probability is used instead for the next flip, with similar rules leading to a 0.52 heads-bias coin, etc. Such artificial-sounding games in fact often correspond to real processes in Nature, such as biased random walks in which the probability of a step toward the origin depends on the distance from the origin. But we will leave such complications for later chapters. For now, we will pursue the development of probability density functions. 2.3 Probability Density Functions In the previous section we referred to the continuous distribution of angular positions at which a “wheel of fortune” may come to rest. The usual purpose of such a wheel is to supply a randomization technique for which all values between two limits are equally probable. Such a distribution is called a uniform distribution, and an example is shown in Figure 1-2 (p. 24). That one extends between the limits 3 and 7, and so it is 4 units wide, and its density has a flat value of 1/4, the inverse of its width, so that the total area is 1.0, the required normalization. For a wheel of fortune with its angular orientation measured from the pointer in degrees, the limits would be 0 and 360 , and so the density would be 1/360 in units of probability mass per degree. The uniform distribution is widely regarded as having the least information of any distribution. Most distributions have a peak somewhere, so that some values are more likely than others, but the uniform distribution has an infinite number of peak values. The phrase “maximum likelihood” has no relevance to a uniform distribution. The only knowledge it conveys about the random variable it governs is the range to which that variable is limited, and in the case of the wheel of fortune, that range constitutes all possible orientations, so absolutely nothing can be ruled out regarding what value a random draw might produce. To illustrate the properties of a uniform distribution, we consider the following scenario: a man who owns a candy store has promised his son that the boy may have whatever fraction of a dollar exists in each day’s receipts as a daily allowance. For example, at the end of one particular day, the cash register contained $387.12, so the son’s allowance that day was 12¢. On another day, the profits amounted to $434.77, so the boy got 77¢. How much money will the son have after 10 business days? On any given day, the boy may get anything from nothing at all to 99¢, with all values equally likely and therefore completely unpredictable within that range. But over the course of a number of days, it seems reasonable to assume that the boy will get something in the upper range about as often as something in the lower range, so things should average out a bit toward the middle of the distribution, 99¢/2, or $0.495. We might guess therefore that after 10 days, he should have about 43
page 55
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.3 Probability Density Functions
$4.95. This would be correct, i.e., it is the most likely value. But suppose we want to know the probability that he will have at least $6? Since the distribution is uniform, we might guess that there’s a 40% chance that he will accumulate $6 or more at the end of 10 days’ allowance. If so, we would be wrong this time. The correct answer requires us to use a property of random variables that is discussed briefly but more fully in Appendix B. For now we will take it on faith. The relevant topic is called “functions of random variables”. In this case, the random variables are the 10 uniformly distributed integer values drawn from a set containing the 100 distinct values 0 through 99, and the function of these random variables is the sum over those 10 draws. The boy’s accumulation after 10 payments is the sum S of 10 draws of the uniformly distributed random variable. As such, it is a function of 10 random variables:
S
10
U i 1
i
1 for 0 U i 99 100 0 otherwise
, pU (U i )
(2.4)
Here Ui is the value resulting from the ith draw. Note that the actual random value is discrete, whereas we are picturing the Uniform distribution as continuous. The continuous range is therefore segmented into 100 equal sections, similar to the way the wheel of fortune in Figure 2-1 is segmented into 16 equal sections. The continuous range is the open interval extending from -0.5 to 99.5. With this definition, any value from the continuous range can be rounded off to obtain an integer from 0 to 99. It is convenient to use an underlying continuous random variable for reasons that will be seen below. The theory of functions of random variables is an extremely interesting and useful topic, and the reader is strongly encouraged to investigate more complete discussions of it (see especially Papoulis, 1965, Chapters 5 and 7). Appendix B presents a brief summary. The purpose of this discipline is to compute the distribution describing a random variable from the distributions of those on which it has a functional dependence. Simply calculating the mean of a random variable from the means of those on which it depends is usually fairly easy, but to answer such questions as what the son’s chances are of having $6 after 10 days, we must obtain the form of the dependent distribution. Functional dependences can be almost anything, but one that occurs again and again in scientific studies is the one in the present example: summation of random draws. Linear combination plays a role throughout classical physics, and when the quantities being combined linearly are subject to error, those errors combine linearly along with whatever else is being added up. For example, suppose you want to measure the length of your living room along its north wall, but the only length-measuring device you have is a yardstick. You can lay the yardstick out along the baseboard with one end touching the east wall, make a mark where the other end is, slide the yardstick up so that the first end aligns with the mark, and repeat the process until you reach the west wall, where you will generally have to make a fractional measurement (i.e., less than a yard) from the last mark to that wall. Then you multiply the number of whole yardstick placements by 36 inches, add the fractional measurement, and you have the distance you want in inches. But how far off might your measurement be? Worse yet: what chance is there that your estimate is at least half an inch short? 44
page 56
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.3 Probability Density Functions
Assuming for the moment that the yardstick itself is perfect, the errors in play are those with which you aligned the yardstick with the marks you made at the leading end, and also the accuracy with which you read the fractional distance. Some error is inevitable in these activities, and if you have no reason to doubt that you were a little short about as often as you were a little long, you may model your error with a zero-mean random variable. These errors combine linearly, i.e., the final error is the sum of the contributing errors, and this way of combining errors is extremely common in empirical science. Other functional dependences do occur, but linear combination is the most important because of its ubiquity and its recognizable consequences. So your total error is a function of the individual random errors made along the way, similar in behavior to the deviation of the boy’s 10-day earnings from exactly $4.95. Given that the yardstick itself is incapable of perfection, there is also a systematic error that occurs at each measurement point. For example, if the yardstick is a hundredth of an inch too long, you will accumulate that much error each time (in addition to your random error), causing a tendency to underestimate the total distance. Usually systematic errors are also modeled with zero-mean random variables, because while they are not “random” in the sense of taking on different values with each measurement, their values are unknown, and our knowledge of such an error involves a zero-mean random variable. If we knew some nonzero value that we could assign to them, we would subtract it off of each measurement, thereby eliminating the nonzero part and reducing the error to zero-mean status. The way that systematic errors are treated in error analysis is not by making them non-zero-mean, but rather by using the fact that they are constants. Before ending this digression, we should give a brief example of a few functions of random variables that are not summations over random draws. A trivial example is simple scaling: the candy store owner’s daily profits may fluctuate on average by ±$57.34, so they also fluctuate by ±5734¢. This seems too obvious to mention, but in fact these are different random variables. It may seem a little less obvious if we want to know the fluctuation in yen or euros. In fact, suppose that the candy store owner sends one day’s profit per week to his sister in Paris: the exchange rate fluctuates also, and now the number of euros that the sister will receive next week is a function of two random variables, the day’s profit and the exchange rate at the time the transfer is made. In this case, the product of two random variables is involved, and computing the distribution describing various possible amounts to be received by the sister is considerably more complicated than when only sums are involved. Now we return to the question of what chances the owner’s son has of accumulating at least $6 in 10 days. One of the most important and fundamental results of the theory of functions of random variables is: the density function of a random variable that is the summation of independent random variables is the convolution of the density functions of the independent random variables. Whether this causes a sigh of relief no doubt depends on one’s familiarity with the operation known as convolution. It is a well-studied operation that plays a crucial role in many areas of mathematics and physics, and if one is unfamiliar with it, that should cause no immediate concern. In fact, the way it enters into our example is an excellent way to be introduced to it. It involves mixing two functions together in a particular way to produce a single function. It is a linear process, which means that to convolve three functions, one can simply convolve any two of them and then convolve the result with the third. This can be extended to convolve any number of functions. The definition of convolution for two functions, f1(x) and f2(x), is:
45
page 57
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.3 Probability Density Functions
C( x)
f 1 ( x x ) f 2 ( x ) dx
f
2
( x x ) f 1 ( x ) dx
(2.5)
In our case, we need to convolve the 10 Uniform distributions to which Equation 2.4 refers. These happen to be identical, but that is not required in general. We note here that we are concerned with convolution integrals (rather than summations) because we are modeling the Uniform distributions as continuous; this too is not required in general. Also, while the integration limits are infinite, the obvious substitutions are made for functions of finite domain, which is what we have in our example. Plugging in the Uniform distributions and cranking out the complete set of convolutions is straightforward but a bit tedious, and intuition is better served by presenting the process visually. Figure 2-4 shows the basic Uniform density function on the left as a rectangle denoted U. The convolution operation is often indicated with the symbol “ ”. Equation 2.5 could have been written simply as C(x) = f1(x) f2(x), or even just C = f1 f2. The first convolution of U with itself produces the triangular distribution labeled U U. If two Uniform distributions of different length had been convolved, this would have been trapezoidal, but the equal lengths result in a triangle. Here we see already that after two days, the son is more likely to have something close to a dollar than any extreme value. The peak of the triangle is actually at 99¢, but it is difficult to read a graph that precisely.
Figure 2-4. The tall rectangle on the left is the Uniform distribution (U) from -0.5 to 99.5. This is convolved with itself once to produce the triangular distribution labeled U U, where the symbol “ ” indicates convolution. This is convolved with U once again to produce the next distribution to the right, which is labeled U U U and shows curvature beginning already to resemble a “Bell Curve”. Another convolution, U U U U, continues this trend, as does the last curve on the right, a convolution of 5 Uniform distributions.
46
page 58
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.3 Probability Density Functions
The money that the boy will have after three days has a probability distributed as shown in the next curve to the right, which is labeled U U U, because it is the convolution of three of the U distributions. At this point, the straight lines are gone from the shape of the density function, and the tendency to become bell-shaped is quite apparent. Two more convolutions with U are shown, and the bell shape can be seen to be settling in as the domain of the random variable expands. Figure 2-5 shows the result of convolving 10 Uniform distributions as defined above. This is shown as the solid curve. To demonstrate how close this has come to the shape of a true Gaussian distribution (see Figure 1-1, p. 24), the latter is shown as a dashed curve in the same figure. Here the Gaussian has the same mean and standard deviation as the convolved Uniforms. Its peak is slightly higher, and it is slightly narrower at half maximum, but the two curves are so similar that they almost lie on top of each other over most of the range.
Figure 2-5. The solid curve is the result of convolving 10 Uniform distributions such as the one shown as a rectangle in Figure 2-4. The dashed curve is a Gaussian distribution with the same mean and standard deviation as the solid-curve distribution. The two curves are almost identical; the Gaussian peak is slightly higher and its full width at half maximum is slightly less, but clearly the 10 convolved Uniforms have attained a very close approximation to a Gaussian, or Bell, curve.
This illustrates one of the most important facts in all of statistical theory: a random variable composed of a summation over numerous other random variables tends to have a distribution very close to the Gaussian shape. There are some conditions that the summed random variables must satisfy for this to happen, but the vast majority of random variables that make good models of physical quantities satisfy these conditions. This is why the Gaussian distribution is so ubiquitous in Nature: most macroscopic objects and processes are composed of numerous smaller objects and processes that are subject to some random variation, and the sum over all these random variations 47
page 59
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.3 Probability Density Functions
results in the Gaussian distribution. In general, whenever one observes a Gaussian distribution, one can probably assume that the objects comprising that distribution are each composed of the same sub-objects with random variations that were added together. Since each day’s payment to the son is independent of each other’s, the mean and standard deviation of the 10-day convolution of Uniforms are easy to compute. The mean of a Uniform distribution is just the center value, which is obvious but can be computed in the usual way (see Appendix A). The standard deviation is not quite as obvious, but computing it in the manner of section 1.9 (or Appendix A) shows that a Uniform with a half width of L (50¢ in this case) has a standard deviation of L/ 3. The variance is the square of the standard deviation, so for U the mean is 49.5¢, and the variance is (50¢)2/3, or 8331/3 ¢2. For independent random variables, the mean and variance of the sum are just the sums of the means and variances.
S
10
U i 1
S2
i
10 49.5 495
10
U2 10 833 13 8333 13 i 1
(2.6)
i
S 9128709 . This mean and standard deviation do not depend on the shape of the density functions that were convolved. As long as the random variables are independent, then the summation rules above apply to any shape, and if other distributions have the same means and variances as the U distribution we are using, they would yield the same results for the mean and variance of the sum. So this much is easily obtained from basic statistical principles without any recourse to more advanced theory of functions of random variables, as along as the random variables summed are independent. But this much does not tell us what the probability is that the boy will have at least $6 after 10 days. For that, we need the cumulative distribution, which does depend on the shape of the final distribution. This is where the Gaussian approximation comes in very handy. The actual formal expression for the convolution of 10 U distributions is extremely tedious, and if possible, we don’t want to bother with it. But we can see that the Gaussian in Figure 2-5 is a good enough approximation to the convolution for our purposes, and we already know the mean and standard deviation from Equation 2.6. These two parameters completely determine this particular Gaussian distribution. The answer to our question requires the integral of the density function. The advantage of the Gaussian approximation does not stem from any ease of formal manipulation of the integral of its density function; in fact, this integral does not exist in closed form and must be evaluated by numerical quadrature. The advantage is that the Gaussian distribution is so important that tables of its cumulative distribution are easy to come by, and every scientific programming language has built-in functions to supply this information (see Appendix F). What we want to know is: how much probability mass lies at or above 600¢ (actually, 599.5¢), since this is the probability of obtaining a result of at least $6. Alternatively, we could ask: what is the probability that the boy will have less than $6. This is the total probability mass between zero and 599.5¢, so the probability of at least $6 is 1.0 minus the probability of less. The probability mass lying between 0 and 599.5¢ is the integral of the final density function between those limits. This 48
page 60
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.3 Probability Density Functions
is what is called the cumulative distribution and commonly indicated with a capital P, here P(599.5¢). The probability mass above the point of interest is commonly indicated with a capital Q. In general, P(x)+Q(x) = 1, where x is any value in the domain of the distribution. So what we are asking is: what is the value of Q(599.5¢)? This question would be difficult to answer using the exact algebraic form of the 10 convolved U distributions, but since the Gaussian approximation is acceptable, we can find in any table of Gaussian Q values (see Appendix F) that the probability of the boy having at least $6 is about 12.6%. Such tables are usually presented in terms of zero-mean unit-variance random variables, so we have to subtract off the mean and divide by the standard deviation, i.e., (599.5-495)/91.28709, which is about 1.14474. For the boy to get $6 or more, we must have at least a positive fluctuation of a little more than one standard deviation. This much could have been computed from the numerical results in Equation 2.6, but in order to evaluate Q(599.5¢), the shape of the distribution is required, and fortunately, there were enough random variables in Equations 2.4 and 2.6 to make the convolution of all their density functions acceptably Gaussian. Although the great technological importance of convolution is beyond the scope of this book, we should mention in passing a few of its most notable aspects, and the reader is encouraged to pursue complete discussions in the ample literature. For one thing, it aids our purposes to develop some intuitive notion of convolution, and this can be done fairly easily by noting that the most common form of image degradation employs convolution of an image with a blurring function called a point-spread function (PSF). For example, the triangular function in Figure 2-4, the one labeled U U, has a very sharp peak. If we think of the first function with curvature, the one labeled U U U, as a PSF, then blurring the triangle with this PSF results in the rightmost function, the one labeled U U U U U. This last function is clearly quite blurry compared to the triangular function. This blurring is rather extreme, because the size of the PSF is similar to that of the function it is blurring. In typical imaging applications, this is not the case; the function being blurred usually has much more structure, the PSF is usually much smaller than the image, and images are usually two-dimensional. The extension of convolution to two dimensions is straightforward, simply involving a double integral over a pair of two-dimensional functions. A more typical example of an image blurred by a PSF is shown in Figure 2-6. Here the sharp image on the upper right has been blurred by the lower PSF on the left to produce the blurry image on the lower right. This was done artificially, but essentially the same exact blurry image could have been obtained by taking the original image out of focus. In practice, the usual problem is that the only image one possesses is the blurry image. If the blurry PSF is known or can be constructed, then the sharp image can be obtained by deconvolution of the blurry PSF and the blurry image (by the same token, if one had the sharp and blurry images, the blurry PSF could be obtained by deconvolution of the two images). In other words, if the result of a convolution is known, and if one of the two functions inside the integral is known, then the other function can be obtained by solution of the integral equation.
49
page 61
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.3 Probability Density Functions
Figure 2-6. On the left, the point-spread functions (PSFs) of the pictures on the right are shown magnified by a factor of 10 relative to their corresponding pictures. The blurrier PSF is the same everywhere in its image and much broader than the sharp PSF, and so to a high degree of approximation the blurry image is just the sharp image convolved with the blurry PSF. The sharp image may be recovered from the blurry one by deconvolution of the blurry PSF.
50
page 62
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.4 The Gaussian Distribution
In real-world applications, one must attend to some details like edge effects, noise, etc., and so the sharp image is usually not recovered perfectly. But great improvement in spatial resolution is typically obtained. In order for deconvolution to be applicable, one requirement is that the PSF must be invariant over the image, otherwise the degradation is not really a convolution, i.e., the accuracy of the convolution model can be unacceptable. Improvement of the spatial resolution is still generally possible by more elaborate means, but these go beyond deconvolution per se. In astronomical imaging, the invariant PSF is often a good model, because the imaged objects are effectively at infinity, but for terrestrial photography, the blurring is usually depth-dependent. Further remarks would take us too far afield. The point is that in statistics and imaging theory, the operation of convolution can be visualized as a blurring or smoothing process. Even when the problem does not include what one normally thinks of as an image, as long as the two functions being convolved are nonnegative (which is always true of probability density functions), convolution produces a broader result than either of those functions. And almost any function can be graphed, i.e., made into an image, so this visualization is generally applicable. So a linear combination of independent random variables has a density function that is the convolution of the density functions of the random variables that were combined, and this resultant density function will always be broader and smoother than its contributors, and in most practical applications, it will tend to have the Gaussian bell-like shape. There is a very famous statement of this fact called the Central Limit Theorem. This says that as long as the contributing random variables possess certain properties that are in fact common to the ones usually encountered (e.g., have finite standard deviations), the resultant density function becomes a better and better approximation to a Gaussian with each convolution. One aspect of potential confusion must be avoided: when we refer to summing two random variables, we are not referring to summing their density functions, but rather actual samples of the random variables themselves as shown in Equation 2.4 (p. 44). The topic of mixtures of distributions does arise wherein different populations with arbitrary normalizations, are added together to produce a linear combination of density functions (see section 2.9). In this case, there is no general tendency toward Gaussian results. In fact, if two distinctly different Gaussian populations are added to produce a mixture, the result is decidedly non-Gaussian, whereas if two random variables drawn from distinct Gaussian distributions are added, their sum has a density function which is exactly Gaussian. 2.4 The Gaussian Distribution The algebraic form of the general one-dimensional Gaussian density function is:
P( x )
e
x x 2 2
2
(2.7)
2
where the random variable x has the mean x and standard deviation σ. This form is found to arise asymptotically under many different circumstances. A very common one is that discussed in the previous section, namely summation over independent random variables. But the Gaussian arises 51
page 63
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.5 Gaussian Relatives
in numerous other ways. For example, the binomial distribution in Equation 2.3 (p. 39) approaches a Gaussian distribution with a mean of np and a variance of np(1-p) as n approaches infinity. Equation 2.7 may look a bit complicated at first sight, because it shows the Gaussian in its most general one-dimensional form. This expression shows that a Gaussian is uniquely determined by the values of its mean and variance. It is very common in practice to encounter Gaussians with zero mean and unit variance, since all Gaussian random variables can be shifted and scaled to that form, and then tables and computer subroutines can be standardized easily. The Gaussian distribution is also known as the “Normal Distribution”, and many authors prefer that name, but since the word “normal” can be mistaken to mean things like “ordinary” and “perpendicular”, we will use the name “Gaussian” herein. But one should be familiar with the notation “N(0,1)”, which means “Normal distribution with zero mean and unit variance”, and then more general values for the two arguments will be recognized. An impressively simple special case is 2 1 N 0, e x 2
(2.8)
Gaussians tend to be self-replicating. As mentioned above, if two independent Gaussian random variables are added together, the sum is a random variable with a Gaussian distribution, without any approximation being involved. This may be seen by substituting two functions with the form of Equation 2.7 into the integral in Equation 2.5 and carrying out the integration. In this case, even the unit normalization is preserved. The simple product of two Gaussian density functions has the form of a Gaussian except for the normalization. As we will see in Chapter 4, the optimal combination of the information in two measurements with Gaussian errors results in another Gaussian that is narrower than either measurement’s Gaussian. The Fourier Transform of a Gaussian density function is another Gaussian density function. Familiarity with the Fourier Transform is not essential here. We mention it only to note in passing that it is intimately related to convolution. Specifically, the convolution of two functions can be accomplished by first taking the Fourier transform of each, multiplying the two results, and then taking the reverse Fourier Transform. If the two functions are Gaussian, then so are the others at each point along the way. 2.5 Gaussian Relatives A number of distributions have arisen which are derived from the Gaussian and are thereby closely related to it. For example, at the core of the Rayleigh Distribution is a two-dimensional Gaussian, except that the random variable is expressed as a radial distance in two dimensions, each axis of which is the domain of a zero-mean Gaussian random variable with the same variance:
52
page 64
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.5 Gaussian Relatives
r
x2 y2 x
px ( x )
pr (r )
2
2
y2 2 2
e 2 e , py ( y ) 2 2 re
(2.9)
r2 2 2
2
The density function for r is shown on the last line. It has some resemblance to a Gaussian except for the factor of r in front of the exponential. This distribution was invented by Lord Rayleigh to quantify what he called “The Drunkard’s Walk”, an early manifestation of what is now called a random walk. We encountered this notion in sections 1.1 and 1.12, where the step size was constant and the walk was one-dimensional. The Rayleigh distribution describes the two-dimensional random walk and allows variable step sizes. This distribution and its generalization to three dimensions are important in theories of diffusion. There are too many Gaussian relatives to discuss exhaustively, but an extremely important one that should not be passed over is that of the chi-square random variable, χ2. This is really a family of distributions, each specific member being characterized by its number of degrees of freedom. The random variable is defined as:
N2
N
x
i
xi
i 1
2
(2.10)
2 i
where the xi are all independent Gaussian random variables, and N is the number of degrees of freedom, i.e., the number of zero-mean unit-variance Gaussian random variables squared and summed. There is a more general definition for the case in which the xi are correlated. This involves a complete N×N covariance matrix, and we will not explore those details here. Each chi-square random variable has a density function whose form depends on N. Chi-square is used extensively for judging statistical significance of suspected data outliers, and it provides the core of the formalism for data curve-fitting that minimizes deviations between the curve and the data points with higher weighting given to more accurate data points. The technical discussion of this topic will be relegated to Appendix D, wherein the case of correlated random variables is included. While keeping the number of degrees of freedom completely general requires dealing with a nontrivial expression if the density function is needed for the purpose at hand (see Appendix E for details), several very simple formulas do exist. Probably the most useful of these are the ones for the mean and the variance, which depend in the same way on N as a variable:
N2 N 2 2 N 2 N
53
(2.11)
page 65
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.5 Gaussian Relatives
Very often, these are all one needs, because chi-square, like so many other random variables, approaches a Gaussian as N increases. Just how large N needs to be in order for the approximation to be acceptable depends on the application, but frequently a judgment of statistical significance of an observed fluctuation can be made on the basis of nothing more than the distance from the mean in units of standard deviation. For example, suppose a polynomial y = f(x) of order P has been fit to M noisy data points yi, each with an uncertainty σi and an associated noiseless abscissa value xi (in this context, “noisy” does not necessarily mean seriously degraded, only that at least some fuzziness must be taken into account). Since the curve was derived from the data, the number of degrees of freedom drops by the number of polynomial coefficients, P+1, and so N = M-P-1. A summation similar to Equation 2.10 is performed, except that xi is replaced by yi, and x i by f(xi), and the summation is from 1 to M:
2 N
M
y
i
f ( xi )
2
i2
i 1
(2.12)
The absolute deviation from the mean in units of standard deviation is:
2 N
N2 N 2N
(2.13)
If this has a value of less than 2, for example, then probably nothing is out of the ordinary. At 3, things get a bit shaky. If it is greater than 5, for example, then something is probably wrong. If the fluctuation is negative, it probably means that the data uncertainties, σi, are overestimated. A positive fluctuation may mean that the polynomial doesn’t really describe the ideal behavior of the data very well, or the uncertainties are underestimated, or there is an outlier in the data set. Greater than 10 indicates essential certainty that the fit or the data or both have a serious problem. A value between 3 and 5 indicates the need for much closer inspection; things are almost plausible but not quite. A search for the maximum deviation from the curve may reveal an outlier, or a different polynomial order may fare better. In section 1.9 we mentioned that the statistical significance of a fluctuation depends on the form of the distribution. The example was given that a random draw will be within ±1σ of the mean about 68% of the time for a Gaussian but only about 58% for a Uniform distribution. What about the chi-square above? It does matter. If N is fairly large, it will behave very much like a Gaussian. If N is quite small, say 5, it can be noticeably different; in that case, the mean would be 5 and the standard deviation would be 10 3.1623, and a negative fluctuation of 2 standard deviations is not even possible. For a small number of degrees of freedom, there is often no recourse to simple approximations, and the rigorous chi-square formalism must be applied . Fortunately, in practice, the most typical cases involve many degrees of freedom. One other remark is needed on this subject: the chi-square is defined in terms of Gaussian random variables, but all we said about the yi was that they were “noisy”. Technically, their noise must be Gaussian-distributed for the chi-square formalism to be applicable. The ubiquity of Gaussian distributions usually satisfies this need sufficiently well, and the methodology is surprisingly robust even when the contributing random variables are individually non-Gaussian to a noticeable extent. 54
page 66
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.6 Products and Ratios of Gaussian Random Variables
But some thought should always be given to the question of whether the noise is strongly nonGaussian. If so, special care must be taken. One approach that usually succeeds is to probe the situation with Monte Carlo methods. 2.6 Products and Ratios of Gaussian Random Variables While we will not be able to list every kind of random-variable distribution herein, we will mention a few more that are especially important in the sciences. As stated earlier, the most common way that random variables combine is summation, which for independent random variables produces a distribution whose density function is a convolution of the density functions of the summed variables. The difference of two random variables is a trivial extension of summation, requiring only an intermediate random variable defined as the negation of the one subtracted. This is just a scale factor which happens to be negative, and it is handled algebaically in the same way as converting the result of a random spin of a wheel of fortune from degrees to radians, for example. So sums and differences of independent random variables lead to convolutions of density functions. We now consider the other two of the “four functions” of simple arithmetic, division and multiplication. We define the random variable z to be the ratio of two random variables, x and y, i.e., z = x/y. Given the density functions for x and y, we can compute the density function for z by applying the theory of functions of random variables (see Appendix B for brief summary). The distributions of x and y could be anything, but our interest here is the case in which they are both Gaussian. In this case, z is a rather badly behaved random variable. The reason for this may be seen by considering the fact that, unless otherwise restricted, the domain of a Gaussian random variable is the entire real line. Although the density function far from the mean drops off rapidly, it remains greater than zero for arbitrarily large finite distances from the mean. Assuming that the mean is finite, then the density function for y is not quite zero at the point y = 0. Since y occurs in the denominator, evaluating z at this point involves division by zero. Because x also spans zero, the singularity in z may arise with either sign. The mere existence of this singularity has dire consequences for integrals of the sort that define moments of the distribution (e.g., mean, variance, skewness, etc.). The z distribution for the general case is more complicated than we need to consider here. An interesting special case arises when both Gaussians have zero mean. In this case, z follows what is known as the “Cauchy” distribution. In the theory of atomic spectroscopy, it is known as the “Lorentz” distribution. If both Gaussians also have unit variance, it is what is known as the “Student t” distribution with one degree of freedom. We want to draw attention to this distribution here, because among those actually encountered in scientific applications, it is the most pathological, the most outstanding exception to the Central Limit Theorem. The general Cauchy density function has the form
p( z )
b/ ( z z0 ) 2 b 2
b x y
(2.14)
where z0 is called the “location parameter”, and b is the half width at half maximum, with the 55
page 67
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.6 Products and Ratios of Gaussian Random Variables
relationship shown on the second line if z arises from a ratio of two zero-mean Gaussian random variables, in which case z0 = 0. Since this distribution arises in ways other than the Gaussian ratio, z0 is generally not zero; for example, in atomic spectroscopy, it is the central wavelength of a spectral line. This density function itself has a finite integral over the real line, as may be seen by that fact that it is normalized. But all moment integrals higher than zeroth diverge, and so no mean can be computed, and certainly no variance or any higher moment. That means that the random variable does not qualify for the Central Limit Theorem, and therefore if a summation of random variables contains even one Cauchy random variable, the summation itself does not tend to be Gaussian. In practice, such a summation may be sufficiently well approximated as Gaussian, since it is common for the domain of the Cauchy distribution to be forcibly truncated, but technically the Central Limit Theorem provides no guarantees. The “location parameter” turns out to be the median and the mode of the distribution. These parameters do not depend on moment integrals. When the ratio of non-zero-mean Gaussians is considered, the density function may be bimodal, i.e., have two peaks. The Cauchy itself has only one peak at the center of a disarmingly bell-shaped curve. Figure 2-7 shows a Cauchy density function with z0 = 10 and b = 1 (solid curve). By itself, it might be taken to be a Gaussian density function from its appearance. A real Gaussian is shown with the same peak value (dotted curve). Since the Cauchy has no mean or variance, it is not possible to compare it to a Gaussian with the same ones, so instead the Gaussian is chosen to have a mean at z0 and the same peak, hence the Gaussian has a mean of 10 and a standard deviation of (π/2). The wings of the Cauchy are obviously more persistent than the Gaussian’s, and although it may appear that they are approaching zero with distance from z0, in fact they are not doing it fast enough to permit first or higher moment integrals to exist. So some caution must be exercised when dealing with functions of random variables, especially regarding assumptions about the Central Limit Theorem providing protection against pathological behavior.
Figure 2-7. The solid curve is a Cauchy density function with b = 1 and z0 = 10. The dashed curve is a Gaussian density function with the same peak as the Cauchy, 1/π, and mean = 10, standard deviation = (π/2). The two curves are both “bell-shaped”, but the Cauchy has much more persistent wings.
56
page 68
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.6 Products and Ratios of Gaussian Random Variables
Now we turn to the product of two independent Gaussian random variables x and y. We can simplify matters by considering them to be zero-mean, as may be seen by first allowing them to be non-zero-mean. Any such random variable may be written as the sum of a constant and a zero-mean random variable. This is one of the more obvious principles in the theory of functions of random variables. So we can write x and y as:
x x x , y y y z xy ( x x )( y y )
(2.15)
x y xy yx x y where the primed variables are zero-mean counterparts of the unprimed variables. The product z consists of the sum of a constant, two rescaled zero-mean Gaussian random variables, and a product of two zero-mean Gaussian random variables. The first three terms contribute nothing new; only the last term is really of interest. So we might as well just take x and y to be zero-mean from the outset, and that is what we will do, dropping the primes and bearing in mind that the more general case must include not only that fourth term but also the other three easily computed terms. The density function for z is derived in Appendix B as an example of a less trivial application of the theory of functions of random variables. The result is that the density function involves a modified Bessel function of the second kind and order zero:
pz ( z)
z K0 x y
(2.16)
x y
where K0 is the n = 0 case of
Kn (u) e
u cosh t
cosh nt dt
(2.17)
0
The mean and standard deviation can be computed by using this density function in the corresponding moment integrals, but since x and y are independent, we can compute them the easier way:
z z xy 0
(2.18)
z2 ( z z ) 2 z 2 x 2 y 2 x 2 y 2 x2 y2 Equations 2.16 and 2.17 show that z is definitely not a Gaussian random variable, so it is not surprising that pz(z) has very different shape from Gaussian. This may be seen in Figure 2-8, where pz(z) is shown for the case in which the two Gaussian random variables have unit variance. In this case, z also is a zero-mean unit-variance random variable, as Equation 2.18 indicates, but its density function is very much more peaked than the zero-mean unit-variance Gaussian that is also shown as the dashed curve. In fact, pz(0) is singular, so one cannot actually quote a peak value, other than 57
page 69
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.7 The Poisson Distribution
to speak loosely and say that it is infinite. Nevertheless, this density function is otherwise very well behaved, possessing all finite moments. This is not generally true of integrals whose integrands contain singularities, but this case is one of the better behaved ones. So it can happen that a random variable whose density function has a singularity nevertheless satisfies the requirements of the Central Limit Theorem. Note that this case of a singularity is different from that of the Cauchy distribution, because there the random variable was singular with nonzero probability, while here the density function is singular at the mean of the random variable.
Figure 2-8. The solid curve is the density function for the product of two independent zero-mean unitvariance Gaussian random variables, a modified Bessel function of the second kind and order zero; it is actually singular at zero. The random variable with this density function also has a mean of zero and unit variance. The dashed curve is a zero-mean unit-variance Gaussian density function.
2.7 The Poisson Distribution Our sampling of a few of the most important random variables encountered in the sciences will end with the one described by the Poisson distribution, which provides an excellent model for fluctuations in many naturally occurring patterns, such as the number of stars in a region of sky, the number of clicks per second of a Geiger counter, and the number of envelopes delivered per day to a mailbox. In each of these cases, the fluctuation statistics depend on the average value, which may vary with time or location. For example, the average number of envelopes per day in a mailbox may go up shortly before Christmas, the average number of stars per square degree increases with proximity to the Galactic plane, etc. But if a local average value can be estimated for a given countable quantity, then the Poisson distribution is often an excellent model with which to estimate the probability of various fluctuations in that quantity over a set of observations. 58
page 70
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.7 The Poisson Distribution
The form of the Poisson distribution can be derived in a variety of ways, the simplest of which makes use of the binomial distribution (Equation 2.3, p. 39). In the terminology of section 2.2, when the probability of heads p is related to the number of coin tosses n according to p = λ/n, where λ is a finite constant greater than zero, then in the limit as n approaches infinity, the binomial distribution approaches
e k P( k , ) k!
(2.19)
This is the Poisson probability mass function. The mean and variance are both equal to λ. For large λ, the shape approaches that of a Gaussian, and it is often convenient to use a Gaussian instead, simply keeping the mean equal to the variance, and observing the usual caveats about using a continuous density function to describe a discrete random variable. In this particular case, there are two other caveats: (a.) the Gaussian probability mass for negative numbers must be negligible if the approximation is to be acceptable: (b.) if a refined estimate is obtained from two such measurements using the Gaussian approximation, since the latter involves inverse-variance weights, this amounts to inverse-mean weights, and a bias against larger measurement outcomes develops. Usually, if the Gaussian approximation to a Poisson is acceptable, the bias is negligible, but a correction for it can be applied if desired, namely to add 1 to the refined Gaussian mean. If the two Poisson measurement errors are from the same population, then unweighted averaging should be used in the first place, since that is the unbiased estimator of the mean of a Poisson population. Here we have assumed that we can take the uncertainty in a measurement dominated by Poisson noise to be the square root of the measurement value, since the variance in a Poisson population is equal to the mean. This assumes that the measured value is equal to the mean of the population from which the measurement is a random draw. This is clearly not true in general, but without any further information, it is the only approximation available for estimating the uncertainty, and so it is commonly used. The result is that S/N, the estimated signal-to-noise ratio for the measurement, is the square root of the measurement itself. In practice, it frequently happens that Poisson noise is diluted by such things as quantum efficiency scale factors and contributions from non-Poisson noise sources, so that the Gaussian approximation is often more optimal. The Poisson distribution is most interesting for relatively small values of λ, because there its own distinctive (non-Gaussian) qualities show clearly. For example, suppose that the average number of envelopes delivered to a given mailbox each day is 7. What is the probability that fewer than 4 envelopes will be delivered today? This is the sum of the Poisson probabilities for 0 through 3, and Equation 2.19 can be used to compute the answer: 8.2%. So this should happen about every two weeks. Suppose one opens the mailbox and finds nothing at all; what is the probability that the day’s delivery has not yet been made? Assuming that the only possibilities are that the postal carrier has not yet arrived or that there simply was no mail that day, we compute the probability of the latter and find that P(0,7) = 9.12×10-4, indicating a very strong probability that the postal carrier has not yet come. Sometimes the Poisson distribution is called the “law of rare events”, because it describes events which are very unlikely to happen but have a large number of chances to do so. In the mailbox case, this means that on any given day there is a very large collection of sources of envelopes that could arrive in the mailbox, but the probability that any given one of them will produce an envelope 59
page 71
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.8 Bayes’ Theorem
destined for that mailbox on that day is very small. The presence of λ in the exponential in Equation 2.19 is a reminder that the Poisson distribution applies only to dimensionless quantities, counts of things. In science, one cannot exponentiate volts or grams or meters per second. For example, consider the following situation: many golfers practice at driving ranges where the golf balls are retrieved by a cart equipped with a brush and roller mechanism that sweeps up the balls and drops them in a large bucket as the cart is driven over the range with a cage protecting the driver as golfers continue to hit balls. When the cart has completed a sweep of the range, it drops the bucket off at a shed, and the next sweep begins. When all the tees are in use by the same golfers for an extended period of time, the number of balls retrieved on each sweep tends to be more or less the same, but with some random variation that is well modeled as Poisson. Suppose we want to estimate the average number of balls per sweep, but we have only enough time to count the balls in one bucket, and this number is 625. Then 625 is our estimate, and the Poisson model assigns a 1-σ uncertainty of 25, and the S/N for our estimate is 25. If each golf ball weighs 1.6 ounces, then our estimate for the average mass of the contents of the bucket is 1000 ounces. Since this quantity experiences bucket-to-bucket fluctuations that are due to a Poisson process, we might conclude (erroneously) that its uncertainty is 1000 = 31.6, so that its S/N is 31.6, somewhat better than that of the number of balls. In fact, if we want even better S/N, we should measure the mass in grams, 45.4 per ball. Now the estimated average in a bucket is 28375 grams, with a S/N of 168.4! But the S/N of a measurement shouldn’t depend on the units employed. The fallacy in the example above is that variables with physical dimensions (in this case, mass) were treated as Poisson random variables. We implicitly exponentiated grams and ounces. The Poisson model can be applied only to dimensionless variables. The Poisson mean and the variance must have the same units, and this implies no units at all. The uncertainty in the average mass in a bucket must be computed from the uncertainty in the number of balls, 25, hence 40 ounces or 1135 grams. Surprisingly, this sort of error has actually been known to occur, but vigilant referees generally keep it out of the literature. 2.8 Bayes’ Theorem A very powerful and sometimes controversial theorem in probability theory is that known as Bayes’ Theorem, attributed to the Rev. Thomas Bayes and published posthumously in 1764. The theorem itself is not controversial, but some applications of it are. Perhaps it is appropriate that some controversy should accompany a theorem so critical to the foundation of the subjectivist or “measure of belief” interpretation of probability (see section 1.5) and credited to a clergyman. Lest this remark seem impertinent, it should be noted that Bayes’ own definition of probability has a certain frequentist ring to it, and as one whose primary responsibility was theology, he was certainly justified in bringing any available tool of knowledge to bear on religious issues. He clearly possessed some insight and imagination in turning the rather recently developing theory of probability from the problem of predicting outcomes toward the question of determining the conditions that had caused observed outcomes. Thus his theorem lies at the heart of decision theory, one of whose applications is to estimate the probabilities of various possible unobserved causes that led to observed events. An essential aspect of Bayes Theorem is the role played by conditional probabilities, i.e., the probability of some event given that some other event has occurred, at least hypothetically. If the 60
page 72
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.8 Bayes’ Theorem
latter event is believed to have occurred, then it is called prior knowledge, and Bayesian estimation involves computing probabilities based on this prior knowledge. It can happen that the meaning of the event that is believed to have occurred is subject to interpretation, with the result that conclusions obtained via Bayesian estimation may be controversial. What some people consider prior knowledge may be considered prior opinion by other people. But some problems are best solved with the use of Bayesian estimation, since it does provide a valid formalism for including previously established facts in probabilistic analyses. As it happens, there are several variations in the form of the theorem, depending on how many events it relates, whether probabilities of both occurrence and nonoccurrence are included, and whether continuous or discrete random variables are involved, to name a few. Since we are merely illustrating the basic idea here, we will use the simplest possible situation: A and B represent two separate events. P(A) and P(B) are the respective prior probabilities of these events taking place (i.e., probabilities that apply before it is known whether they took place), P(A|B) is the probability of A taking place given that B took place. Similarly, P(B|A) is the probability of B taking place given that A took place. These last two are also known as conditional probabilities or posterior probabilities. The probability of both A and B taking place is the joint probability and is denoted P(A B). If A and B are independent, the joint probability is simply the product of the prior probabilities, P(A)×P(B), and Bayes’ Theorem becomes trivial. The interest is in the case where A and B are not statistically independent, because then knowledge of one implies some knowledge of the other. For example, if A and B are positively correlated, then A becomes more likely if it is known that B took place. Rather than derive the theorem rigorously, we will illustrate its origin qualitatively using the simplest possible events, fair coin flips. Let A be defined as “two heads after two flips”, and let B be defined as “heads on the first flip”. Using the same binary notation for coin-flip outcomes as in section 2.2, the four possible outcomes after two flips are 00, 01, 10, and 11, with 1 representing heads and 0 representing tails. The four outcomes are all equally likely, and only one of them satisfies the definition of A, and so P(A) = ¼. Two of the four outcomes satisfy the definition of B, so P(B) = ½. Noting that A cannot happen without B, the probability of both happening is the same as the probability of A, and so P(A B) = ¼, i.e., only one of the four outcomes corresponds to both A and B. Now we consider the case when the first flip has yielded heads and the second flip has not yet been made. The possibility of A happening is still viable, and its probability has changed to the conditional or posterior probability P(A|B). Obviously this must be ½, since it depends on a single fair-coin flip. This is expressed as P(A|B) = P(A B)/P(B) = ¼ / ½ = ½. This is how a conditional probability is obtained from a joint probability and the prior probability of an event that has occurred. We note in passing that for cases in which A and B are independent, the joint probability is just P(A B) = P(A)×P(B), and the conditional probability of A is the same as the prior probability, since P(A|B) = P(A B)/P(B) = P(A)×P(B)/P(B) = P(A). It is also of interest to note that in such a case, given different definitions of A and B but the same prior probability values as those in the example above, the joint probability is , so we see that the joint probability can depend very strongly on whether the two events are correlated. Although the case of independent events is trivial in this context, it does shed some light on how conditional probability depends on joint and prior probability. Even when the joint probability P(A B) does not factor into the product of prior probabilities P(A) and P(B), the same mechanism of dividing by the prior probability of the observed event is appropriate. 61
page 73
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.8 Bayes’ Theorem
There is nothing special about the order in which we treated A and B. It is also true that P(B|A) = P(B A)/P(A) = P(A B)/P(A), which for the case of the coin-flip outcomes defined above becomes P(B|A) = ¼ / ¼ = 1, i.e., people not paying attention to the first coin toss can still be certain that B occurred if it is observed that A occurred. From this symmetry, it follows that the joint probability is both P(A B) = P(A|B)×P(B) and P(A B) = P(B|A)×P(A). Therefore the two right-hand sides are equal: P(A|B)×P(B) = P(B|A)×P(A). From here, it is only one more step to Bayes’ Theorem, namely to divide by one of the prior probabilities, for which we arbitrarily select P(B):
P( A| B)
P( B| A) P( A) P( B)
(2.20)
Obviously P(B) must not be zero, but in the context of B having been observed to occur, clearly it must have nonzero probability. With its focus on conditional probability, Bayes’ Theorem provides a useful frame of reference for considering the mythical “Law of Averages” that was discussed in section 1.12. For example, now we are better equipped to consider the question of whether a baseball player who has a batting average of 300 but has gone hitless in his last 10 at-bats is “due for a hit”. He may also be “due for another out”. This depends on whether the joint probability of not getting a hit in the last ten at-bats and getting one in the next at-bat factors into two prior probabilities. If so, then the idea that he is due for anything other than a typical at-bat is fallacious, since we saw above that the posterior and prior probabilities are equal when the relevant events are independent and hence uncorrelated. If they are correlated, then he is “due” for something, depending on the sign of the correlation, assuming that “due” means an increased probability. The fact that the consideration of such correlations is usually omitted by sports announcers is why the spurious misconception of a “Law of Averages” is perpetuated. It may be that some kind of perceived correlation is implicit, however. If the longer the batter goes without a hit, the greater his desire for a hit becomes, and this provides motivation and energy and better focus on the ball that was lacking in previous at-bats, then the probability of getting a hit may well be greater than his prior average of 0.3, i.e., there would be a negative correlation between the two propositions. On the other hand, it may be that the batter has a fragile ego that reacts to a series of failures with reduced confidence and a self-fulfilling belief that failures will continue, in which case his slump is more likely to continue. The possibility that he is ill and getting worse, or has been ill but is getting better, could also induce correlations. But without good scientific data, the only acceptable assumption is that the at-bats are uncorrelated, and the “Law of Averages” is exposed as having nothing useful to say. Even so, subjective issues of an analogous nature do sometimes arise in the formulation of conditional probabilities going into applications of Bayes’ Theorem, and differing opinions about them create the associated controversy. We will end this brief discussion of Bayes’ Theorem with a simple example. A certain pilot training school has a graduation rate of 75%. One of the questions on the entrance survey is whether the student has ever driven a motorcycle. Statistical studies over many years’ experience have shown that 65% of all graduating students answered yes to that question. The percentage of all incoming students who answer yes to that question is 50%. One of the new students in the current class is Alexander, who answered yes to the question. Assuming that all students answered truthfully, what is the probability that Alexander will graduate? 62
page 74
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.9 Population Mixtures
We will use G to denote the event “Graduated from the pilot school”, and M to denote “has driven a Motorcycle”. Applied to the students in the school, we have P(G) = 0.75 and P(M) = 0.5. We know that if a student has graduated, the probability of that student having driven a motorcycle is 65%, and so we have P(M|G) = 0.65. We know that Alexander has driven a motorcycle, so the probability that he will graduate is P(G|M) = P(M|G)×P(G)/P(M) = 0.65×0.75/0.5 = 0.975. Therefore we estimate the probability that Alexander will graduate to be 97.5%, quite a bit higher than the 75% we would have estimated if we had known only that he was in that pilot school without knowing whether he had driven a motorcycle. Another student in the same class, Bartholomew, has never driven a motorcycle. What is the probability that he will graduate? Using an overline to indicate “not”, _ the probability that a student did not drive a motorcycle given that the student graduated is P(M_|G) = 1_ - P(M|G) = 0.35. _ The same rules apply to these conditional probabilities, so we have P(G|M) = P(M |G)×P(G)/P(M). Since the prior probability of driving _ a motorcycle is 50%, the prior probability of not having done so is the same, 1 - 0.5, and so P(G|M) = 0.35×0.75/0.5 = 0.525. So Bartholomew has a 52.5% probability of graduating, not much better than even money. This illustrates the use of Bayes’ Theorem to refine predictions, specifically whether Alexander and Bartholomew will graduate. But we mentioned that this theorem is also employed in estimating probabilities of unobserved events that may have led up to an observed event. For example, suppose that instead of knowing that 65% of all graduating students are motorcyclists, we know that motorcycle drivers have a 97.5% probability of graduating. We also know that Charles, another student in the same pilot school, has graduated, but we don’t know whether he has ever driven a motorcycle. In that case, given the same prior probabilities as before, P(G) = 0.75 and P(M) = 0.5, we could estimate the probability that Charles has driven a motorcycle given that he graduated as P(M|G) = P(G|M)×P(M)/P(G) = 0.975×0.5/0.75 = 0.65. Thus some people might say “given that Charles graduated, the odds that he has driven a motorcycle are about 2-to-1, and that probably contributed to the fact that he graduated.” Here we can see the first hints of why some dissension about statements founded on Bayesian estimation can arise. The fact that Charles has a probability well above 50% of being a motorcycle rider does not establish that he is a motorcycle rider. For pilot school and motorcycle riding, substitute more contentious propositions, and the subjective elements can become more pronounced. 2.9 Population Mixtures In section 2.3 we discussed the density function for a random variable formed by the addition of two other independent random variables, mentioning that the resulting density function is the convolution of the two density functions for the two random variables that were summed. We also added a caveat: sometimes people confuse adding two random variables with adding two density functions (with subsequent renormalization), or equivalently, mixing together two populations with distinct distributions. The resulting population is called heteroskedastic. As we saw in Figure 2-4 (p. 46), adding uniformly distributed random variables results in a new random variable that becomes more and more Gaussian as more uniformly distributed random variables are included in the summation. If anything, the opposite effect tends to take place when mixing two whole populations: mixing two different Gaussian populations generally results in a very 63
page 75
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.9 Population Mixtures
non-Gaussian population. For example, suppose that a warehouse contains one million Florida oranges whose diameter distribution is Gaussian with a mean of 2.2 inches and a standard deviation of 0.1 inch. Another warehouse contains one million California oranges whose Gaussian distribution of diameter has a mean of 2.5 inches and a standard deviation of 0.2 inch. When loaded for shipment, the two groups are accidentally mixed together into one population of two million oranges. What sort of diameter distribution results?
Figure 2-9. Density function describing the diameters of a population of oranges formed by equally mixing two Gaussian populations with means of 2.2 and 2.5 inches and standard deviations of 0.1 and 0.2 inches, respectively. The mixture has a mean of 2.35 and a standard deviation of 2.18, and is highly non-Gaussian, with skewness 0.652 and kurtosis 2.85.
The density function describing this mixture is shown in Figure 2-9. Denoting the fraction of the mixture that came from the first population as f, the mixture density function is just the linear combination p(x) = fp1(x) + (1-f)p2(x), where the subscripted density functions are the two pure Gaussians. The mean of p(x) (and all higher raw moments, independently of distribution) can be computed straightforwardly from this linear combination; it is just the corresponding linear combination of the two Gaussian means. Since f = ½ in the example, the mean is 2.35 inches. Expressed in terms of the means and variances of the two Gaussians, the first four raw moments (see Appendix A, whose notation we employ here, i.e., mn for raw nth moment) are:
64
page 76
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
m1 f x1 (1 f ) x2 m2 f ( x12 12 ) (1 f )( x22 22 ) m3 f ( x13 3x1 12 ) (1 f )( x23 3x2 22 )
(2.21)
m4 f ( x14 6 x12 12 3 14 ) (1 f )( x24 6 x22 22 3 24 ) The first two of these formulas apply to all distributions in general, whereas the last two employ Gaussian properties not common to all distributions (e.g., m3 depends on p1(x) and p2(x) having zero skewness), and therefore these two formulas do not apply in general to mixtures involving arbitrary distributions. Central moments (see Appendix A) can be obtained from the raw moments by the usual relationships (see Appendix A, Equation A.4, p. 420). For the example, these result in a standard deviation of 2.18 inches, with skewness 0.652 and kurtosis 2.85. If the mean diameters of the two types of orange had been equal, then the skewness would have been zero, since mixing two symmetric populations with their means aligned would result in another symmetric distribution, but nevertheless non-Gaussian, depending on how different the two original standard deviations are. In this same case (equal means and different variances), the excess kurtosis (see Appendix A) of the mixture can only be positive. The resulting distribution is always less centrally concentrated than a pure Gaussian. As more and more distinct populations are added to the mixture, however, the tendency of a Gaussian mixture to be non-Gaussian may not persist. For example, one of the most important commercial applications of probability theory is in “actuarial statistics”, which involves risk assessment. Insurance companies need to evaluate the risk associated with their clients in order to guarantee a profit while not pricing themselves out of the market. In the case of life insurance, different subsets of the general population have different inherent mortality risks. The population defined as American men between the ages of 50 and 60 can be broken down into sub-populations such as such men who: (a.) are heavy smokers; (b.) frequently ride motorcycles; (c.) have criminal records; (d.) have below-median income; etc. Taking age at death as a random variable, each of these groups may have a distinct distribution that is well approximated as Gaussian, but when they are mixed together, that approximation may not be lost in the result. This makes it possible to design insurance policies for heterogeneous groups. Unlike the Central Limit Theorem for adding random variables from non-pathological distributions, however, there is no guarantee that mixing a large number of distinct populations will result in a Gaussian population. Constructing counter-examples is in fact quite easy, as can be seen by considering disjoint uniform distributions. One interesting problem in population analysis is how to deduce the component populations of a significantly non-Gaussian distribution. 2.10 Correlated Random Variables One of the nuances in the notion of “randomness” involves the occasional appearance of “order” in behavior that is supposed to be “random”. It can take a while to get used to the idea that disorder in randomness may occur to varying degrees. From a mathematical point of view, a random variable may be defined simply by giving a proper specification of a probability density function and 65
page 77
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
declaring that the random variable follows the corresponding distribution. Thus we may take Equation 2.7 (p. 51) as our density function and proceed to use the Gaussian random variable. From a viewpoint more rooted in physical intuition, however, we saw above that Gaussian random variables arise when many smaller random variables are summed. Since most big things are comprised of many small things, the properties of big things tend to be distributed according to the familiar bell-shaped Gaussian curve. The fact that it has a “shape” means that there is some kind of order present to some extent, even though the random variables that were summed may all have been uniformly distributed and thus originate in the most completely disordered process known. The order implicit in the bell shape is produced by the higher entropy of some values of the random variable relative to others (see section 1.11). An example of this is the fact that in 1000 absolutely random flips of a fair coin, there are more ways for the numbers of heads and tails to come out approximately equal than dramatically different. For example, there are 1000 ways to get a single head, but about 2.7×10299 ways to get 500 heads. Since all 21000 possible distinct outcomes are equally likely, there is a vastly greater probability of getting one of those 2.7×10 299 than one of those 1000. Thus “order” arises out of pure randomness. Another manifestation of order within random behavior is when two random variables exhibit a tendency to follow each other, even if only partially. Such random variables are said to be correlated. When measurement data representing physically real phenomena exhibit significant correlation, one must consider the possibility of causal relationships among the phenomena. There may in fact be a direct causal link, or an indirect one, or it may be that an artificial correlation was induced by the choice of parameters. A simple example of a direct causal link has been seen already in the example of the candy store owner in section 2.3 who sends one day’s profit per week to his sister in Paris. The profit on the given day is a random variable, and since the exchange rate fluctuates randomly, the number of euros per dollar that the sister will receive is also a random variable. So the number of dollars sent and the number of euros received are two random variables that are strongly correlated, because the latter is a function of the former, with only a relatively small additional random fluctuation introduced by the exchange rate. Here, the two random variables are not independent, since a functional relationship couples them directly. Usually independence and lack of correlation go together, but they are actually fundamentally different. Independence is a stronger condition. If two random variables are independent, then they must be uncorrelated, whereas being uncorrelated does not necessarily imply that they are independent. Clearly, if one variable is a function of the other, then the variables are not independent. Having a causal relationship is generally thought of as having a functional dependence, even if the mathematical form is unknown. So a lot of effort in statistical analysis is aimed at detecting correlations which, if found, imply a dependence of one variable on the other (or both on a common driver), and hence a causal link. The lack of correlation does not imply the lack of a functional dependence. An example of this counter-intuitive situation will be given below. A caveat is needed regarding independent variables being necessarily uncorrelated. This is strictly true of ideal populations of random variables, but much work revolves around studying such populations by analyzing a sample drawn from the population. For example, in the Preface we considered estimating the average mass of the oranges in a day’s harvest by weighing one orange from each crate. In this case, the orange masses in the complete harvest comprise the population, and the set of orange masses obtained from one orange per crate is the sample. Suppose this method had 66
page 78
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
been applied to the two populations of oranges in section 2.9, one from California and one from Florida, and that the same number of crates had been involved, N. Then we would have two lists of numbers that could be combined into a single list of N pairs of numbers, say (Cn, Fn), n = 1 to N. The average mass of each type and its variance can be computed as we have already seen: C
1 N C N n 1 n
1 N F Fn N n 1
, C2
1 N 2 Cn C N n 1
1 N 2 , F2 Fn F N n1
(2.22)
where the variances shown are sample variances, not estimates of the population variances. The use of the sample means instead of the (unknown) population means biases these variances toward smaller values. If one wishes to obtain unbiased estimates of the population variances, the sample variances must be multiplied by N/(N-1), but that is not our goal here. We are interested in the sample correlation coefficient ρ, which is obtained from the standard deviations (square roots of the variances) and the covariance cov(C,F) as follows (this is the Pearson linear correlation coefficient; there are a number of correlation definitions, but this is the most widely used in the sciences).
1 N C C Fn F N n 1 n cov(C, F )
cov(C, F )
(2.23)
C F
When we consider the mass difference of an orange relative to its population mean, we would not expect any causal relationship between the California and Florida oranges, hence we expect ρ = 0. However, since ρ is clearly a function of the two randomly drawn samples, it is a function of random variables, and therefore a random variable itself (the kind known as a statistic, i.e., a function defined on a random sample). So in practice, ρ usually acquires some nonzero value purely by random fluctuations, just as the sample means will generally not be exactly equal to the population means. Fortunately, the expected dispersion in ρ is a well-studied statistic itself, and so it is generally possible to make some judgment about whether the observed value of ρ is statistically significant. For example, if the value of ρ turns out to be 10 σρ in magnitude, we would be forced to rethink our assumption that the two populations are independent. Statistical significance of sample correlation coefficients will be discussed further in section 2.11 below. Normally, when two random variables are correlated, the manner in which we form pairs of values has some obvious basis, unlike the arbitrary pairing in this example, where we didn’t care about the order because the point was that random nonzero correlation values typically occur in samples from populations without any intrinsic correlation. If we had been testing for a correlation between mass and volume, for example, then we would naturally pair those values for each individual orange. In such a case, we would expect to find a very strong positive correlation. Correlation may also be negative, which is sometimes referred to as anti-correlation. The definition of the correlation coefficient ρ normalizes it to the range -1 to +1. The extreme values indicate a completely deterministic relationship. For example, if the density were constant throughout a sample 67
page 79
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
of oranges, and if there were no measurement error, we would find mass and volume to be 100% correlated. An interesting and somewhat non-intuitive fact about correlation is that it can always be removed via a transformation of variables. This transformation is a pure rotation in the space defined by the correlated variables. For example, the orange volume and mass are two parameters that can be assigned to orthogonal coordinate axes and thus define a space whose dimensions are volume×mass, with units of (e.g.) centimeter3×grams, or cm3×gm for short. Suppose there were 10 crates of oranges, one orange measured from each crate, to produce the following data, which are subject to random fluctuations in density, volume, and measurement error. Crate No. 1 2 3 4 5 6 7 8 9 10
V (Vol) 122.17 131.05 144.72 117.38 101.61 166.19 122.28 138.74 162.99 136.48
M (Mass) 139.54 149.30 174.65 146.34 120.68 232.30 142.06 163.30 202.70 188.97
Figure 2-10. Mass (gm) of the ten oranges in a sample drawn from ten crates, one per crate, shown as a function of volume (cm3).
68
page 80
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
This sample has an average volume of 134.361 cm3 and an average mass of 165.984 gm. The corresponding standard deviations are 19.052 cm 3 and 32.164 gm, respectively. Figure 2-10 shows these measurements as points in the volume×mass space. If the two variables had been 100% correlated, the points would lie exactly on a straight line. As it is, measurement errors and density fluctuations have prevented perfect correlation. The covariance and correlation coefficient for this sample have values of 577.238 cm3×gm and 0.942, respectively. The variances (squares of the standard deviations) and covariances form a covariance matrix. In general, a covariance matrix for N random variables xi, i = 1 to N, is defined as
cov( x1 , x1 ) cov( x1 , x2 ) cov( x1 , x3 ) cov( x2 , x1 ) cov( x2 , x2 ) cov( x2 , x3 ) cov( x3 , x1 ) cov( x3 , x2 ) cov( x3 , x3 ) cov( x N , x1 ) cov( x N , x2 ) cov( x N , x3 )
cov( x1 , x N ) cov( x2 , x N ) cov( x3 , x N ) cov( x N , x N )
(2.24)
The covariance of a random variable with itself is just the ordinary variance, so the diagonal of a covariance matrix contains the squares of the standard deviations. If any of the off-diagonal elements are significantly nonzero, then correlation exists between the corresponding random variables. In this example, the covariance matrix is
V2 cov(V , M ) 362.979 577.238 M2 cov( M ,V ) 577.238 1034.523
(2.25)
Note that a covariance matrix is symmetric, because as Equation 2.23 shows, the definition of covariance is symmetric in the two random variables. As long as the correlation coefficient is not equal to ±100%, the determinant is greater than zero, so the matrix is nonsingular. In most normal applications, the random variables are real (i.e., not complex), and so the covariance matrix is both real and symmetric. Many different kinds of matrix can be diagonalized, i.e., rotated into a coordinate system in which all off-diagonal elements are zero, and this is especially straightforward for real symmetric nonsingular matrices. The details of matrix diagonalization are beyond our scope, so here we will just mention that diagonalizing a matrix is the same as solving for its eigenvalues and eigenvectors. In a diagonalized covariance matrix, the eigenvalues are the variances of new random variables formed by the rotation, i.e., linear combinations of the original random variables, and the eigenvectors are the corresponding axes of the rotated coordinate system. In this example, the diagonalized covariance matrix is
0 1366.543 D 0 30.959
69
(2.26)
page 81
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
The coordinate rotation is -29.097 . This means that if we define a measurement vector whose components are volume and mass, (V,M), then the ten measurements provide ten such vectors, each of whose components are samples of correlated random variables. If we rotate these vectors by -29.907 , we obtain ten new vectors, each of whose components are samples of uncorrelated random variables. These new vector components are functions of the original ones and the rotation angle, which we use in a standard Euler rotation matrix operating on the measurement vector:
VR cos( 29.907) sin( 29.907) V 0.86684V 0.49859 M M R sin( 29.907) cos( 29.907) M 0.49859V 0.86684 M
(2.27)
This equation has a somewhat anomalous feature. Suppose we ask: what are the units of the rotated vector? We started out with a vector that was already a bit strange in that its components had different units, cm3 and gm. But at least each component had physically meaningful units. After the rotation, we have a vector whose components have units that are a linear combination of cm 3 and gm. Is this legitimate? The answer is that it is mathematically legitimate, but it has no physical meaning. There cannot be any physical parameter that has such units. These uncorrelated random variables are a mathematical fiction, but a useful one. Whatever manipulations are done with them, the results need to be mapped back through the rotation to the original space before physical meaning can be recovered. The notation VR is just a convenience, since this variable is no longer a volume, nor is MR a mass. But it is quite common to plot variables with different units on the same graph (e.g., atmospheric temperature vs. altitude) without eliciting any sense of transgression, and any graph of two or more variables supports the definition of a vector, so there is no requirement that the components of a vector have the same units. But any vector can be rotated, and so these odd units can arise, and we should not be surprised. But we do need to be careful about several things. The angle, for example, is not uniquely determined. As can be seen in Figure 2-10, the points are scattered about a line that runs up to the right at an angle of about 30 from the horizontal. By rotating the points approximately -30 , we have brought the run direction of the data to the horizontal. This is what makes the new axis variables uncorrelated. But we could also have rotated through about 60 to make the run direction vertical, and this too would have eliminated correlation. It would also have permuted the diagonalized matrix elements (the assignment of the largest eigenvalue to VR was arbitrary). Multiples of 180 may be added or subtracted from the angle without changing the diagonalized matrix, so a rotation of about 150 would have produced the same covariance matrix as -30 . In general, different original units would have changed the numerical value of the angle, the covariance, and the diagonalized matrix, but not the correlation coefficient, which is dimensionless. All of these ambiguities are typical of eigenvalue problems, but care and consistency allow useful manipulations to be performed. In this example we have seen correlation arise between two variables because they were not independent. All the masses and volumes referred to oranges having about the same density, and so these variables are linked by a simple equation, mass = volume×density. Without measurement errors and fluctuations in density, the variables would have been 100% correlated. Simply rotating 70
page 82
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
the measurement vectors does not remove this dependence, and yet we obtained uncorrelated vector components. The relationships between correlation, independence, and causality are important to understand, since getting this wrong leads to “bad science”. The easiest pair to deal with is correlation and independence. If two random variables are independent, then they are uncorrelated. Independence is the stronger condition. It nullifies all justification for preferring any specific sample pairing scheme, leaving only arbitrary choices like pairing the California and Florida orange samples by crate number. To reiterate, here we are discussing the intrinsic properties of the random variables, not the numerical value of correlation computed for a sample of pairs of the random variables. Just as random fluctuations generally cause the mean of a sample not to be equal to the mean of the population from which the sample was drawn, they also cause fluctuations in statistics such as the correlation coefficient. We stress again that the sample correlation coefficient is a well-studied statistic whose significance can be evaluated, so there is little danger of misinterpreting a randomly nonzero sample correlation coefficient, as long as the available theory is employed. Deriving this theory is beyond the present scope, however, and so we will take a qualitative approach to the question: given that two random variables are correlated, is it possible that they are independent?
Figure 2-11 A. Asymmetric distribution of correlated random variables; ellipses have semimajor axes of 1 and 2 standard deviations but are not contours of constant probability density because distribution is highly non-Gaussian. B. Coordinates rotated -26.76 degrees; the new variables are uncorrelated.
Suppose we have a sample of 5000 data pairs whose plot looks like Figure 2-11A. The points are obviously not scattered uniformly over the graph, and if we compute the correlation coefficient, we find that it is 0.304. As a sample is made larger and larger, the uncertainty in the sample average as an estimator of the population average decreases. This is generally true of the uncertainties of all statistics such as variance, skewness, kurtosis, (see Appendix A), etc. With a sample size of 5000, the chance that a correlation coefficient of 0.304 could be an accident may be ignored. This correlation is highly significant. Could these pairs be comprised of independent variables? If so, then as discussed in section 1.4 and elsewhere, the joint density function is the simple product of two 71
page 83
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
density functions, one for each variable that is a function only of that variable. In other words, the distribution of one variable cannot be a function of the other, since that would put the hooks of the other’s distribution into that of the first, and the joint density would not separate. We can pick a limited region of the abscissa, for example between 0.1 and 0.2, and compute the sample mean and standard deviation of all ordinate values in that region. This yields 0.288 for the mean and 0.166 for the standard deviation. The uncertainty of the sample mean as an estimator of the population mean is the sample standard deviation divided by the square root of the sample size less 1. This yields a 1σ uncertainty of 0.0074. So if we draw samples from the parent population many times and repeat this calculation, the mean should always be very close to 0.288. Even 5σ fluctuations would yield means between 0.251 and 0.325. Now we consider the abscissa region between 0.8 and 0.9. Here we find a mean of 0.462 and a standard deviation of 0.267, with the 1σ uncertainty of the mean being 0.0119. Here the 5σ fluctuations run from 0.402 to 0.522. The 5σ fluctuation intervals for these two regions do not come even close to overlapping. We can be very confident that the distribution of the ordinate random variable depends significantly on the abscissa variable, and therefore the joint density function does not factor into separate density functions for X and Y, and therefore the random variables are not independent. So ignoring absurdly unlikely fluctuations, if two random variables are significantly correlated, they cannot be independent. But if two random variables are not independent, can they be uncorrelated? The answer is yes, as we can see by constructing a simple example. We consider a random variable X that is uniformly distributed between -½ and +½. Now we define another variable Y to be the absolute value of X, i.e., Y = |X|. Since Y is a function of X, it is obviously not independent of X. As soon as a value for X is obtained, we immediately know the value of Y. And yet X and Y are uncorrelated. To see this, we compute the means and covariance: 1/ 2
1/ 2
X2 X X p X ( X ) dX X dX 2 1/ 2 1/ 2 1/ 2
Y
Y pY (Y ) dY
1/ 2
X 2
2 0
1/ 2
X 2
0
X dX
1/ 2
2 1/ 2
0 1/ 2
1/ 2
0
1/ 2
X dX 0
(2.28)
1 1 1 8 8 4
cov( X , Y ) ( X X ) (Y Y ) XY XY XY X Y XY X Y X Y X Y XY X Y XY
where we used that fact that both density functions are 1 over the integration range, and in the last equation, the fact that X is zero-mean. Since Y is a function of X, and again using the fact that the density function is unity, we can write as
72
page 84
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
1/ 2
XY
Y( X ) X p
1/ 2 X
( X ) dX
1/ 2 0
X X dX
X X dX
(2.29)
0
X
2
dX
1/ 2
X
3 1/ 2
3 0
1/ 2
0
X X dX
1/ 2
1/ 2
1/ 2
2
dX
0
X 3
1/ 2
X 3
0
1 1 0 24 24
So the covariance is zero, and it follows that the correlation coefficient is zero, and so X and Y are uncorrelated but definitely not independent. The connection between correlation and causality is not so easily dismissed. Since correlation implies a dependence of some sort, and dependence implies the presence of some deterministic agent, it is tempting to assign roles of cause and effect to the observed correlated parameters. The two greatest challenges that the actuary faces are avoiding erroneous estimation of statistical significance and discovering which actors play which roles. When two observable parameters are in fact causally related, it is often not very difficult to ascertain which was the cause and which was the effect. For example when a carpenter has hit his thumb with a hammer and the thumb is observed to have a blue color, it was clearly not the blueness of the thumb that attracted the hammer. That is of course an exaggerated example; much greater subtlety can arise. Often the problem is what is known as a confounding variable, another parameter that was not explicitly taken into account but is the real cause behind the observed correlation. For example, over the last few decades, several studies have claimed to find that the use of mouthwash containing alcohol correlates with increased incidence of cancer of the mouth and throat. Some of these go on to identify the cause as the alcohol acting on mucus membranes and creating greater opportunities for carcinogenic agents to invade. The computation of a correlation coefficient and its statistical significance is straightforward, and the result is that a significant correlation can hardly be denied. The accusation that the data contain biases can be made, however, since bias may be created by bad sample selection or failing to recognize confounding agents. Counter-studies, often funded by pharmaceutical companies, have claimed that the confounding agents are tobacco and alcoholic beverages. For example, someone who frequently smokes cigars is probably more likely to use mouthwash than the average member of the population, and certainly many less controversial studies have established links between cigar smoking and oral cancer. So one of the actual causative agents driving the correlation may be cigar smoking. Another example was seen in section 2.8, the correlation between a history of driving a motorcycle and success in pilot training school. One could claim that operating a motorcycle requires developing many of the same skills as flying an airplane, so that previous experience would give a pilot trainee a head start. But this argument overlooks the possibility of a more fundamental causative agent for both activities, a psychological predisposition for speed, exhilaration, and the machines that can provide these. There is more to being a good pilot than working the controls; a 73
page 85
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
certain mindset also helps. We mentioned above that artificial correlation can be induced by the choice of parameters. By parameters, we really mean coordinates. As we just saw, a real correlation can be removed by rotating the coordinates. Conversely, parameters that are not correlated can often be made to appear so by a coordinate rotation. Normally one would not deliberately create correlation for its own sake, but it can happen as a byproduct of some standard processing. In typical cases, this occurs when the coordinate axes are of the same type, for example, the situation most familiar to the author involves celestial coordinates of astrophysical objects. Celestial coordinates are angles on the “celestial sphere” that represents the sky, with the Earth at the center of the sphere, and the sphere’s radius being arbitrarily large in units such as kilometers but usually taken to be 1.0 in its own special units. Fiducial marks on the sphere were originally supplied by recognizable patterns of stars, but as astrometric precision improved, stellar motions became significant, and the definition of “inertially fixed” space came to depend on galaxies so distant that their apparent motion is negligible. Geometrical features of the Earth, such as its equatorial plane, spin axis, etc., can be extended to intersect the celestial sphere to produce corresponding loci or points on the spherical surface. The angles used to define positions on the sky are similar to the spherical coordinates used to describe points on the surface of the earth, latitude and longitude. These are both angles, hence they have the same units, and rotating coordinates does not introduce any anomalous mixture such as that of the mass and volume of oranges. There are numerous coordinate systems defined on the sky, each serving a specific purpose. The oldest is the one usually meant by the name “celestial coordinates”, with an azimuthal angle called “Right Ascension” (RA, standardly denoted α) and an elevation angle called “Declination” (Dec, standardly denoted δ). The zero point of Dec is the plane of the Earth’s equator projected onto the sky, and the zero point of RA in this plane is the Vernal Equinox, the point where the Sun is located at the first instant of spring in the northern hemisphere. The fact that we have to specify which hemisphere we mean when referring to “spring” is a clue that there are many subtleties in these coordinate definitions. Most of these stem from the fact that the Earth’s equatorial and orbital planes precess due to influences of the moon and other solarsystem bodies, and therefore each instant of time has its own celestial poles and equator. We are not really interested in these nuances here, so we will just refer to RA and Dec,( α,δ), as though the epoch were established. One of the activities of the astronomer is predicting where certain celestial bodies will appear in the sky at a particular time. As with all predictions, some uncertainty is associated with the nominal position, since the gravitational interactions are complicated, computer arithmetic is done with finite precision, and knowledge of orbital parameter values is limited. In many cases, and especially with asteroids, the uncertainty is dominated by error in the location of the object within its orbital plane. This is because the motion of any given asteroid within its orbital plane is very fast compared to the motion of the plane itself (i.e., its precession due to gravitational perturbations). As a result, typical asteroids have position uncertainties of a few arcseconds due to motion within the plane and about one arcsecond or less perpendicular to the plane due to error in the orientation of the plane. After the position of an asteroid in its orbital plane has been computed, the orientation of that plane relative to the observatory is used to compute the asteroid’s apparent position on the sky. This is typically a position in (α,δ) coordinates. Estimating the uncertainty is a crucial aspect of real 74
page 86
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
science, since it allows judgments to be made regarding the significance of discrepancies in the measurements. Therefore it is important to estimate the uncertainties in the orbital-plane position and in the mapping from that plane to celestial coordinates. The errors causing these uncertainties are usually well approximated as Gaussian, and “error bars” in the predicted sky position take the form of an error ellipse centered on the nominal position. The error ellipse should really be called an uncertainty ellipse, since we can compute the uncertainty, whereas we cannot compute the error, and if we could, we would subtract it off and have no error. The word “error” is often used as a shortened name meaning “estimated error”, “expected error”, “RMS error”, etc., all which mean that the so-called “error” is really a random variable believed to behave as the real error would over an ensemble of many identical measurements or similar computations. The error ellipse is a contour of constant probability density enclosing a region of most-likely location of the asteroid. The elliptical shape stems from the exponent in the Gaussian density function. Since the position is a two-dimensional quantity, the uncertainty is represented by a twodimensional Gaussian containing an exponential function whose (negative) argument is
2 2
2
2
(2.30)
2 2
where the bars indicate the nominal coordinates, and for simplicity here we have temporarily ignored correlation in the errors. Setting this expression equal to a constant defines an ellipse in the coordinates (α,δ). The elliptical density contour is somewhat peculiar to bivariate Gaussian distributions, although other density functions can be constructed that would also have elliptical contours. This particular ellipse would have its principal axes aligned with the celestial coordinate axes.
A.
B.
Figure 2-12 A. Error ellipse of an asteroid with 3 arcsec uncertainty due to motion in its orbital plane and 1 arcsec uncertainty due to error in that plane’s orientation; the errors in α and δ are significantly correlated and have 1σ uncertainties of 2.646 and 1.732 arcsec, respectively. B. The same error ellipse in a rotated coordinate system aligned with its principal axes; in these coordinates, the errors are uncorrelated, with uncertainties of 3 and 1 arcsec on the major and minor axes, respectively; the correlation in A is artificially induced by the coordinate system.
75
page 87
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
Figure 2-12A shows an example of a case in which the uncertainty due to motion in the orbital plane is 3 arcsec and the perpendicular uncertainty is 1 arcsec. In general, such error ellipses are not aligned with the local directions of RA and Dec, and so the ellipse’s axes are rotated with respect to the local Cartesian projection of the RA and Dec directions. Angular coordinate ranges of a few arcseconds are small enough to ignore projection distortions, and so the approximation of a locally Cartesian substitute for the spherical coordinates is well within the tolerances of modern asteroid science. It should be mentioned that the view shown is as seen from outside the celestial sphere; this makes the coordinate system right-handed. Astronomers normally use projections seen from inside the celestial sphere, since that is how the sky is viewed in actuality, and known star patterns appear as expected. But a sphere may be viewed either way. The joint density function defining this error ellipse is the bivariate Gaussian
p ( , )
e
1 2
2 (1 )
( )
2
2
( ) 2
2
2 ( ) ( )
(2.31)
2 1 2
where ρ = 0.756, σα = 2.646, and σδ = 1.732 in the example, and the nominal values of α and δ are at the origin of the axes. Note that if ρ were set to zero, this equation would reduce to a product of two density functions, each with the form of Equation 2.7 (p. 51), i.e., it could be factored into two marginal density functions, one depending only on α and one depending only on δ. In general, the marginal density function for α is the joint density function integrated over the domain of δ, and vice versa. When the joint density can be factored into separate density functions, the integral of one of them will be unity by the normalization condition, leaving only the other function. When the joint density cannot be so factored, the integral of either variable over its domain can still be computed, but if this is done for both variables and the two marginal density functions are multiplied, the result is not the joint density function (an illustration will be given below). In this example, the error ellipse is rotated 30 from the RA direction. This means that errors in RA and Dec are positively correlated. A way to view this is: given the error ellipse, if someone who knows the true RA value tells you that your estimate is off by 2 arcseconds in the positive RA direction, you immediately have a clue regarding how much you are off in Dec. That estimate is also more likely to be high than low, although it can still be low, since the correlation is not 100%. But now the nominal position in Dec can be adjusted 1 arcsec down to get a zero-mean expection of error. When knowledge gained about the true value of one uncertain variable implies knowledge also about the other, the variables are correlated. When no such implication exists, the variables can at least tentatively be considered uncorrelated. Figure 2-12B shows the error ellipse in a rotated coordinate system that is aligned with the principal axes. In this system, the errors are uncorrelated. This could be considered a more natural system to use, since it is tied directly to the asteroid’s orbital properties. In fact, it is an intermediate system used in the computation of asteroid position. The standard representation in RA and Dec induces the correlation in the errors. The RA and Dec coordinates are obtained by rotating the coordinates in Figure 2-12B and are therefore linear combinations of those uncorrelated coordinates. This is an example of correlation created by the choice of coordinates. In this case, unlike the 76
page 88
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
volumes and masses of the oranges, the variables joined in linear combinations have the same (angular) dimensions and units, so no chimerical anomaly develops, and the new coordinates can be thought of as the same sort of thing as the old ones. Next we consider correlated random variables that are far removed from the Gaussian distribution, and hence for which error “ellipses” are not appropriate. The distribution we saw earlier in Figure 2-11A (p. 71) is such an example. These points were generated deliberately to be correlated and non-Gaussian with the following pseudorandom algorithm: X U (0,1) X 1 Y U 0, 2
(2.32)
where the left arrows symbolize a random draw, and U indicates a uniform distribution with lower and upper limits given by the arguments. So X is drawn from a uniform distribution between 0 and 1, and Y is drawn from a uniform distribution between 0 and the average of 1 with whatever value X took on. As may be seen, this results in an asymmetric distribution of points with the density higher on the left, where the range of Y is smaller than on the right. We saw earlier that X and Y are correlated and hence not independent; here we see the actual formal dependence of Y on X. Being asymmetric, correlated, and non-Gaussian does not change the fact that means and variances are well defined for this sample. The means are 0.498 for X and 0.378 for Y, with standard deviations of 0.288 and 0.231, respectively. The correlation coefficient is 0.304, and we can define error “ellipses” that turn out to be rotated from the coordinate axes by 26.76 degrees. The 1σ and 2σ ellipses are shown in each side of the figure. But obviously, these ellipses do not trace out contours of constant probability density, since they go into and out of the boundaries of the actual point distribution. This is a symptom indicating that any Gaussian approximation is probably not acceptable. The ellipses are simply mathematical constructions with the correct major and minor axes and the correct orientations. As asymmetric as this distribution is, it too can be rotated to remove the correlation. This is shown in Figure 2-11B (p. 71). The means of the new coordinates (X ,Y ) are the same as before, but now the standard deviations have changed, since the covariance matrix has changed from having nonzero off-diagonal elements to being diagonalized. The determinant of the covariance matrix is invariant under rotation, and so the diagonal elements must change as the off-diagonal elements go to zero. The new standard deviations are 0.305 and 0.207 for X and Y respectively. But rotation does not remove the asymmetry, and it is obvious that the range available to Y depends on the value of X . The joint density function therefore cannot be separated into a product of marginal density functions; these coordinates remain not independent despite being uncorrelated. Because of the great importance of the Gaussian distribution, an exception to the rule “uncorrelated does not necessarily imply independent” must now be pointed out. When the two random variables are jointly Gaussian, being uncorrelated does imply independence. The errors in RA and Dec were said to be Gaussian in Figure 2-12, and so rotating into the coordinate system shown in Figure 2-12B not only yields uncorrelated errors, but also independent errors. This rotation makes the correlation coefficient zero, and as noted above, if ρ is set to zero in Equation 2.31, the joint density function reduces to the product of separate marginal density functions, and this is a sufficient condition for the random variables to be independent, despite the fact that these 77
page 89
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
uncorrelated variables are each linear combinations of the same two correlated variables. This common dependence does not supply a foundation for expressing one of the uncorrelated variables as a function only of the other. We must also note that two random variables may each be Gaussian without necessarily being jointly Gaussian. The latter requires the joint density function to have a form algebraically equivalent to Equation 2.31, and there are other ways to combine Gaussian random variables, some of which leave them uncorrelated but not independent. An example will be given below. If the random draws shown in Equation 2.32 were changed to use Gaussian distributions, but keeping the mean of the Y distribution a linear function of the X draw, the joint distribution would have probability contours shaped like the rotated ellipse in Figure 2-12A, for which we now know a coordinate transformation can produce uncorrelated and independent Gaussian random variables. If the Y distribution has a more complicated dependence on the X draw, however, then the joint density function no longer has the form of a bivariate Gaussian (whose exponential argument must be a quadratic form, otherwise the marginal density functions are not Gaussian, as required to refer to X and Y as Gaussian). In such a case, Y should probably no longer be viewed as a simple random variable but rather a random process, also called a stochastic process, and different considerations come into play. A basic example already encountered is the random walk, in which the variance of the distance from the origin increases with the number of steps, so that the distance random variable has a distribution which is a function of time. Arbitrary dependences between random variables can be constructed without regard to whether such connections arise in scientific data, and studying such artificial distributions can provide insight into how correlation and formal dependence of random variables are connected. For example, we will consider the case mentioned above of a bivariate Gaussian whose y mean is a linear function of x. This turns out to behave very much like the correlated bivariate Gaussian in Equation 2.31, but we write it in terms of the linear dependence of the y mean on x, y = k0 + k1 x, instead of in terms of a correlation coefficient ρ.
pxy ( x , y)
e
( x x )2 ( y k0 k1 x ) 2 2 x2 2 y2
(2.33)
2 x y
Figure 2-13A shows a contour plot of this distribution for the case x = 0, k0 = 0, k1 = ½, σx = 2, and σy = 3 . The contours can be seen to be elliptical, just like the 1σ contours in Figure 2-12. Also like Figure 2-12A, these contours are not aligned with the coordinate axes, but this distribution could be rotated to a coordinate system whose axes are aligned with the principal axes of the ellipse, and the variables in the new system would be uncorrelated. But would they also be independent? The answer is yes. To see this, we need only consider the following property of ellipses: rotation of the principal axes results from the presence of a cross-term such as that between α and δ in the argument of the exponential in Equation 2.31, and such as that between x and y in the argument of the exponential in Equation 2.33 when the square of the y term is expanded. When we rotate to coordinates aligned with the principal axes, these cross terms vanish in the new coordinates. Once the cross terms are gone, we are left with an expression like Equation 2.30 as the argument of the exponential. But the exponential of the sum of the two terms is just the product of the exponentials 78
page 90
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
of each term separately, and so the exponential factors cleanly, allowing the joint density function to do the same, and that satisfies the conditions for the variables to be independent.
A
B Figure 2-13 A. Contour plot of a bivariate Gaussian with x = 0 and y = x/2, σx = 2, σy = 3 (axes have different scales). B. Contour plot of a bivariate Gaussian similar to A except with y = x2/2.
In review, a marginal density function for a given random variable can always be obtained from a joint density function by integrating that function over the domains of all other random variables involved. If the joint density function can be factored into separate functions of each variable, then these integrations are trivial, with each producing only a factor of unity. This is the case when all variables are independent. When such factoring is not possible, as in the case of Equation 2.33, marginal density functions can still be obtained, but they cannot be multiplied to reproduce the joint density function. We will illustrate this with Equation 2.33 for the nontrivial case of k0 0, k1 0 . The marginal density functions are
px ( x ) pxy ( x , y ) dy
e
( x x )2 2 x2
2 x
p y ( y ) pxy ( x , y ) dx
e
( y k0 k1x ) 2 2 ( y2 k12 x2 )
(2.34)
2 ( y2 k12 x2 )
Both marginal density functions are Gaussian. For x this is obvious. For y we need to define 79
page 91
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
the mean and variance as
y k0 k1 x
k 2 y
2 y
2 1
(2.35)
2 x
But the product of the two right-hand sides of Equation 2.34 clearly does not yield the righthand side of Equation 2.33. A plot of the 1σ contour of constant probability density for the real joint density function is shown in Figure 2-14A as a solid curve, and the similar plot of the product of the marginal density functions above is shown as a dashed curve.
A
B Figure 2-14 A. 1σ contours of constant probability density for the joint density function in Equation 2.33 (solid curve, contour plot in Figure 2.13A) and the product of its marginal density functions in Equation 2.34 (dashed curve). B. 1σ contours of constant probability density for the joint density function in Equation 2.37 (solid curve, contour plot in Figure 2.13B) and the product of its marginal density functions (dashed curve).
For comparison, we consider the marginal density functions for the joint density function in Equation 2.31, the correlated bivariate Gaussian:
p ( ) p ( , ) d
e
2
p ( ) p ( , ) d
80
( )2 22
e
( ) 2 22
2
(2.36)
page 92
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.10 Correlated Random Variables
Again we obtain Gaussian marginal density functions, but in this case they betray no hint of the correlation! The correlation dependence in the joint density function evaporates in the integrations, leaving exactly the same functions we would have if there were no correlation. Nevertheless, their product is not the right-hand side of Equation 2.31, and therefore the random variables with that joint density function are neither uncorrelated nor independent. A plot of the 1σ contours of constant probability density for the real joint density function and the product of the marginal density functions is qualitatively similar to Figure 2-14A. We note in passing that when we say “a marginal density function for a given random variable can always be obtained from a joint density function”, we do not necessarily mean in closed form. For example, the joint density function shown in Figure 2-13B is
pxy ( x , y)
e
( x x )2 ( y k0 k1 x 2 ) 2 2 x2 2 y2
(2.37)
2 x y
which differs from Equation 2.33 only in that k1 multiplies x2 instead of x. Integrating over y leaves a simple Gaussian marginal density function for x (the same as in Equation 2.34), but integrating over x yields a marginal density function for y that involves modified Bessel functions of the second kind, i.e., integral functions that can be evaluated only via numerical quadrature. Figure 2-14B shows the 1σ contour of constant probability density for the joint density function above, and the similar plot for the product of its marginal density functions is shown as a dashed curve.
A
B Figure 2-15 A. 3-D plot of the joint density function for a zero-mean unit-variance Gaussian random variable α and a random variable δ = α for |α| > 1.5384, otherwise δ = -α. Both random variables are Gaussian, they are not independent, but they are uncorrelated. B. Contour plot of the same joint density function.
81
page 93
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
Finally we give an example of two Gaussian random variables that are not jointly Gaussian, are not independent, but are uncorrelated. They do of course have some joint density function, but it is not the bivariate Gaussian required to call them jointly Gaussian (e.g., Equation 2.31). We define α to be an ordinary zero-mean unit-variance Gaussian random variable, but we define δ to be equal to α for |α| > C, otherwise δ = -α, where C is some constant. This amounts to the joint density function being disjoint, part of it the same as Equation 2.31 with ρ = 1, and part with ρ = -1. Since a unit correlation magnitude collapses the joint density function into a line (in this case disjoint lines), for the sake of visualization we will back off a bit and use ρ = 0.9 for |α| > C, otherwise ρ = -0.9. A judicious choice for the value of C will result in the total correlation being zero; this turns out to be about 1.5384. Figure 2-15A shows a 3-D plot of the joint density function, and a contour plot is shown in Figure 2-15B. Clearly δ is not independent of α, and the joint density function is not a simple ellipse and cannot be factored into Gaussian marginal density functions. Integrating over either variable produces a Gaussian marginal density function for the other, however, reproducing Equation 2.36, and hence both variables are Gaussian. 2.11 Sample Statistics Underlying almost every application of classical statistics there is a notion of a population and a sample drawn from it. The very word “statistics” implies a sample space, i.e., a set of all possible values that a given random variable may assume. It is upon this sample space that a statistic is defined as a function. The word “sample” is used to mean a set of values drawn from the population, hence a set of numbers that exist in the sample space. The same word may be used to mean a single one of these numbers within the set; the distinction is usually clear from context. The notions “population” and “sample” are intuituively fairly obvious, and we have been making considerable use of them already, but in this section we should be more specific about them and what we mean by a “random draw”. A “population” is a set of objects with some aspect that can be quantified by a parameter that has a distribution. In section 2.9 we referred to a population of oranges with diameters distributed in such a way that an arbitrarily finely binned histogram would look exactly like the Gaussian density function. In most applications, a population’s distribution is not considered to be changed by the mere fact that some random draws are taken from it. The number of objects in the population is usually pictured as being arbitrarily large, so the shape of the distribution is not affected by the removal of any finite number of objects. A “sample” is formed by taking some number of objects from the population in a random way. What we mean by a “random” way can be imagined as follows. We spin a wheel of fortune (like those in sections 2.2 and 2.3) whose markings run uniformly from 0 to 1 with an arbitrarily fine scale. This gives us a randomly drawn value between 0 and 1, with every value in between equally likely. Then we locate this value on the vertical axis of a graph of the cumulative distribution corresponding to the population’s density function (see, for example, Figure B-2, p. 430). Our sample value is whatever value on the horizontal axis is paired with our point on the vertical axis via the curve of the cumulative distribution. Of course in practice, this is not how random draws are usually made. For one thing, it assumes knowledge of the population distribution, whereas that is usually what is being sought. But this is the mathematical analog of what a random draw should accomplish. Flawed statistical studies result 82
page 94
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
from failures to generate samples in a manner equivalent to this procedure, such as mid-20 th-century polls predicting political outcomes incorrectly because the people polled were those whose names were found in telephone books, whereas at that time the income distribution of telephone owners was biased high compared to the general voting population. Modern pollsters go to great lengths to avoid such sampling errors. In the example given in the Preface wherein the average orange weight was computed by weighing one orange from each crate, it might be wise to take care that the heavier oranges have not sunk to the bottoms of the crates as a result of being shaken during transportation. Clearly some challenges are involved in the effort to obtain representative samples of unknown populations. Outcomes of elections are predicted based on samples of voters, pharmaceutical hazards are estimated from samples of drug recipients, automobile damage to be expected from collisions is calculated from random samples subjected to crash tests, etc. Even the statistical characterization of the error in a scientific measurement employs the notion of a population of errors from which one is drawn at random in a given measurement. It follows that the relationships between samples and their parent populations are very important to understand quantitatively. That is the goal of the discipline known as “sample statistics”. The number of elements in the sample set is called the sample size. The simplest statistics are the sample mean and the sample variance. Typically the sample mean is intended to be used as an estimator of an unknown population mean (an “estimator” is a rule for computing an estimate, in this case, “compute the mean of all data in the sample”). Often the population variance is also unknown, and so the sample variance is computed as part of the process of obtaining an estimator for that, which of course requires the sample size to be greater than 1. It should be noted immediately that the sample variance is a biased estimator of the population variance, where “biased” means that the expectation value of the error is not zero. This will be discussed more fully below, where a correction factor will be derived that allows an unbiased estimator to be computed. The sample mean, however, is an unbiased estimator of the population mean. Here we assume that the draws comprising the sample came from the same population. It often happens that this is not true, most often involving populations with the same mean but different variances. In such cases, the sample mean is typically not the optimal estimator of the common mean of all the populations, but we will leave this complication to Chapter 4. For now we consider only samples drawn from a single population. For example, a measurement of stellar brightness can be modeled as a random draw from a population with some distribution whose mean is equal to the true brightness and whose standard deviation is determined by the RMS (root-mean-square) random noise fluctuations peculiar to the measurement apparatus and photon stream. The properties of the instrument must be obtained by calibration, which usually involves measuring standards of known value and seeing what numbers come out. Photon fluctuations are well modeled as brightness-dependent Poisson random variables. When several measurements of a given star have been obtained from a single population, the sample mean becomes the estimate of the population mean, which is the true stellar brightness, and the sample variance provides a check on the instrument calibration. The subject of sample statistics can easily fill whole chapters and even entire books, and so we will limit the remarks herein to a brief summary of some of the topics most important in scientific data analysis. More complete treatments can be found in standard textbooks (e.g., see Cramér, 1999, and references therein). 83
page 95
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
It must be borne in mind that every sample statistic is a random variable, because it is a function of other random variables, namely the random draws that comprise the sample. Therefore every sample statistic has its own density function, mean, variance, etc. It is useful to think in terms of ensembles of samples, e.g., instead of just a set of N random draws, consider an ensemble of M sets of N random draws each. One would not expect each draw in a sample to be equal to the population mean, and since each sample generally has a different set of fluctuations attached to the random draws comprising it, one should not expect the M sample means all to be equal. The sample mean will generally vary from one sample to another according to its own random fluctuations, which are distributed according to its own density function. The same is true of the sample variance and any other statistics computed from the sample. For a given population that is fully characterized by a distribution function, it is possible (in principle at least) to compute the density function for any statistic for any sample drawn from that population by applying the theory of functions of random variables, which was discussed earlier and is summarized in Appendix B. An important consideration is whether the population is believed to be Gaussian or at least acceptably approximated as Gaussian. Many important theorems exist regarding statistics of samples drawn from Gaussian populations, and when this assumption cannot be made, sample properties may be much more difficult to relate to the population. But as long as the population distribution is nonpathological (e.g., not Cauchy), the Central Limit Theorem suggests that some sample statistics will behave in a nearly Gaussian fashion. The mean is the most obvious of these, since it depends on a simple sum of the random draws. Thus the variation of the mean from one sample to another may usually be expected to fluctuate in a somewhat Gaussian manner even if the population is decidedly non-Gaussian, depending on whether the sample size is sufficient to allow the Central Limit Theorem to work its wonders. The use of sample statistics to learn about the properties of the parent population is somewhat reminiscent of the “Allegory of the Cave” in Plato’s Republic, Book VII, wherein the observers could see only the shadows on the wall cast by some unobservable reality. We know the population only by the shadow it casts, the sample. The same descriptors (mean, variance, skewness, etc.) apply to both, and we require precise notation to keep them separate. For example, Cramér uses letters of the Roman alphabet as names for sample variables, and corresponding letters of the Greek alphabet for the corresponding population variables. In the author’s experience, modern readers find that approach somewhat cumbersome, and so we will distinguish between sample and population variables by simply prepending a subscript s to sample variables and p to population variables, prepending in order to leave space for indexes, variable subscripts, and exponents. Thus the sample mean and variance are sx and sσx2, and the population mean and variance are px and pσx2. Although the typical application of sample statistics involves obtaining a sample of concrete numbers and using them to estimate the properties of the population, as in so many practical problems, the situation may profitably be reversed hypothetically by assuming a population of known properties and then computing what would be expected in a sample drawn from it. These computed properties may then be compared to the actual sample properties to decide whether the real and theoretical samples are essentially the same. If so, then the hypothetical population may be likewise similar to the unknown real population, and if not, then the difference may shed light on how the hypothetical population should be altered to produce a sample closer to that observed. When operating in the hypothetical space, the sample is what has to be considered random. It is in this context that the ensemble of samples mentioned above is useful, with the sample mean 84
page 96
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
being a random variable that generally fluctuates from one sample set in the ensemble to another. It is in this sense that one may speak of an uncertainty in the sample mean. In real-life applications, the sample mean is a number computed from other known numbers and has no uncertainty. The population mean, being unknown, suffers the uncertainty of our knowledge of its value. Although the sample mean is just an observed number known exactly, we know it would fluctuate over an ensemble of samples, so we know that as an estimator of the population mean it has some uncertainty. When we do not know the properties of the population, it is crucial to have some idea of how uncertain our knowledge of its mean is when we use the sample mean as an estimate of it. Any scientific use of this estimate depends on having some knowledge of the uncertainty. That knowledge is a hallmark of the difference between real science and pseudo-science. Two questions therefore arise: what does it mean to have a “correct uncertainty”, and how accurate does our value for the uncertainty have to be? The answers clearly depend on another question: what are we going to do with these uncertainties? The answer to that is: quite a lot, as we will see below and in Chapter 4. Obviously, smaller uncertainties are preferable to larger ones, as long as the smallness is not spurious. In most scientific applications, the goal of a measurement is to obtain a numerical value for some physical parameter accurate to about 1% to 3%, although at the frontiers, 10% may represent a significant accomplishment (e.g., the early days of infrared astronomy). At the other end, the measurement of the magnetic dipole moment of the electron is now done with an uncertainty of about 1 part in 1012. What does “accurate to” mean, or “with a margin for error of”? There are no rigorous mathematical definitions, but generally they mean “with an error characterized by a random-variable distribution with a standard deviation of”. Often people seem to interpret (e.g.) “accurate to 3 grams” to mean that the true value is absolutely no more than 3 grams different from the quoted value. Or “Smith is projected to win 62% of the vote with a margin for error of 3 points” is taken to mean that Smith will win by no less than 59% and no more than 65%. But those would be 100%-confidence specifications of uncertainty, whereas for (e.g.) Gaussian errors, a standard deviation, or 1σ uncertainty, is only about 68% confidence. In such cases, the true value, if ever discovered, will be outside the “accurate to” or “margin for error” range almost one out of every three times, assuming that the uncertainty was well estimated. A complete specification of uncertainty must include a statement of the confidence and the type of error distribution, but this is not done in most cases. Usually it can be assumed that 1σ uncertainties are being quoted and that the distribution is Gaussian or approximately so, but if it matters very much, clarification should be requested. A “correct uncertainty” is one such as we just described above, one out of every three times on average, the quoted value will be off by a little more than the 1σ value, assuming a Gaussian approximation. But such after-the-fact calibrations may or may not be performed, so we can require only that such would be the case if it were done. And as long as it’s hypothetical, we might as well add that the true value should differ from the measurement by at least 2σ 4.6% of the time, and by less than 3σ 99.73% of the time. Since exact accountings seldom take place, the quoting of uncertainty is basically done on an honor system, and better scientists perform the task more honorably. How accurate does the uncertainty estimate need to be? Again, there is no rigorous mathematical definition, but for practical matters, a good rule of thumb is: if two measurements of the same thing disagree, it is very useful to be able to determine whether they disagree by only 1 or 2 σ versus 4 or 5 σ, because the former would be considered an ordinary fluctuation, whereas the 85
page 97
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
latter would suggest that some significant systematic effect is present. This is a borderline situation, but often scientific breakthroughs occur at the borderlines. By this we mean that if the disagreement is a tiny fraction of 1σ, it is almost certainly insignificant, since factors of 2 or 3 error in the uncertainty don’t really change the interpretation, and if the disagreement is 20 or 30 σ, there is little doubt that no matter what the two observations were intended to quantify, they took the measure of two different things. But in the 2-5σ range, things are not so clear-cut, and large errors in estimating the uncertainties cannot be tolerated, or else scientific information is lost. For example, suppose we measure the brightness of a star twice, one week apart. Stellar brightness is traditionally expressed in “magnitudes”, a logarithmic function of photon count. Suppose the two measured values are independent with values M1 = 10 and M2 = 10.5. If the measurement uncertainties are 0.2, then the random variable ΔM M2-M1 has an uncertainty which is the RSS (root-sum-square) of the two individual uncertainties, and so we have a discrepancy of 0.5 with an uncertainty of about 0.28, hence a 1.77σ case. Do we decide that the star is variable and publish? Probably not, because 3.84% of the time (assuming Gaussian statistics, and here we are concerned only with the “high tail”, since being low by 1.77σ would not suggest variability) this will happen just because of random noise fluctuations, and we don’t want to be wrong once out of every 26 times we publish something. But we probably would be motivated to invest the extra effort to get more observations. If we had overestimated the uncertainties, say as 0.4, then we would see this as a 0.88σ discrepancy, something to be expected about once out of every 5.3 cases and thus hardly remarkable, no further action needed. But if we had underestimated the uncertainties, say as 0.1, then we would see this as a 3.54σ case, which happens randomly only once out of almost every 5000 trials, so it might seem conclusive that the star is variable, and we might rush into publication, especially if another observational group is getting close to publishing their own results on this star. Unless our graduate student happened to come in and mention in passing that the uncertainties could easily be off by factors of 2. So errors by factors of 2 in estimating uncertainty won’t do. The example above suggests that if we can keep such errors down to around 20%, we’re probably on solid ground. 20% is large when it’s the uncertainty in the measured quantity, but it’s probably fairly good when it’s a highconfidence uncertainty of the uncertainty. In the author’s view, keeping the error in uncertainty estimation below 20% is a worthy goal and often not easily met. Of course, smaller is better. The example of estimating the uncertainty in a stellar brightness measurement is one in which typically the sample size is relatively small, and calibrating instrumental effects is usually the most difficult part. In most applications of sample statistics, the main source of uncertainty is the unknown nature of a population that does not depend on instrumental calibration, and sample size can usually be made large enough to expect the Central Limit Theorem to induce good behavior. In these cases, computing the uncertainty of the mean is straightforward, and emphasis can be placed on ensuring that all data in the sample were drawn from the same population, and that this was indeed the population of interest. We will denote the sample size as N and the ith datum in the sample as xi, i = 1 to N, and we will assume that the sample has been drawn properly from a population whose mean and variance are unknown and to be estimated. Then the sample mean and variance are
86
page 98
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
1 N x N i 1 i 2 1 N 2 x x s x N i 1 i s sx
(2.38)
Recall that the mean, as defined above, is the value about which the second moment is minimized. In other words, if we did not have the summation expression for the mean but simply asked that the symbol represent a value that minimizes the variance, we would take the derivative of the “variance” variable with respect to “mean” variable, set the result to zero, and solve for the “mean” variable, and this would produce the first equation above. This point is important for understanding why the sample variance is a biased estimator of the population variance. The use of any value other than the sample mean in the variance expression results in a larger value for the variance. But we know that in general the sample mean is not exactly equal to the population mean. If we knew the population mean, we would use it instead in the variance expression to get a straightforward estimate of the population variance, and since this would produce a larger number, we know that the sample variance is essentially always smaller than what we would estimate for the population variance if we knew the population mean, unless the two means were equal by pure luck. That does not imply that the sample variance is always smaller than the real population variance, of course. The population variance is some unknown constant, whereas the sample variance is a random variable that may sometimes be larger and sometimes smaller than the population variance but is slightly more likely to be smaller. This tendency makes its average value an underestimate of the population variance. The goal is to have an unbiased estimator for the latter. Our intuitive expectation that using the population mean would result in an unbiased estimator is in fact correct, although we have not proved that, and in practice it is not usually available anyway. When the population distribution is treated as completely general, the statistical relationship between it and samples drawn from it are much more complicated than when the population can be considered Gaussian. Since the Gaussian case is the most prevalent and typically not radically different from most others in qualitative aspects of sampling, we will keep this discussion simple by considering only Gaussian populations. For a full treatment, Cramér’s discussion of sampling distributions is highly recommended, and proofs omitted herein can be found there. Where the Gaussian case leaves out important features, we will take specific note of that limitation. An example of that is the fact that the Gaussian distribution is symmetric about its mean, and therefore its skewness (see Appendix A) is zero. This causes the sample mean and variance to be uncorrelated random variables, which is not true for skewed distributions. Two important theorems for the Gaussian case are: (a.) sx is itself Gaussian with a mean of px and a variance of pσx2/N; (b.) N sσx2/pσx2 is chi-square with N-1 degrees of freedom, hence it has an expectation value of N-1. Since the mean is computed using a linear sum of the sample values, we already know that its density function is the convolution of those of the summed random variables, and we know that since the latter are Gaussian, so is the former, since we saw above that Gaussians convolve to produce Gaussians. Similarly, the variance of the mean is the expectation of the squared difference between the sample mean and population mean: 87
page 99
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
s
2 x
s
x px
2
1 N
N
1 xi N i 1
N
i 1
p x
2
1 N
N
i 1
xi p x
2
1 N2
N
i 1
p
x2
p
x2 N
(2.39)
where uncorrelated cross terms in the second-last summation were dropped because their expectation values are zero. This gives us item (a.) above. That the sample variance is chi-square to within a scale factor may be seen as follows. We start with the second line of Equation 2.38, except with both sides multiplied by N, then divide both sides by the population variance:
xi s x N
N s p
2 x
2 x
i 1
p
2
(2.40)
2 x
Since the population is Gaussian and the sample consists of independent (hence uncorrelated) draws, the right-hand side would be the definition of a chi-square random variable with N degrees of freedom if the mean subtracted off inside the parentheses were the population mean. But since it is the sample mean, we lose one degree of freedom (but retain the chi-square distribution; see Cramér for a full proof). In other words, the expression for the variance in Equation 2.38 implies N-1 degrees of freedom, since the use of the sample mean uses up one such degree, whereas if we could use the population mean instead, we would not be using any information from the sample itself to determine the center about which to compute the second moment, leaving the full N degrees of freedom. Furthermore, the resulting estimator would be unbiased, because the second moment was taken about the relevant mean, not an approximation that happens to minimize the second moment. This gives us item (b.) above and provides the expectation value for the sample variance in the Gaussian case:
N s x2 p
s
2 x
x2
N 1
N 2 p x
s
x2
(2.41)
N 1 2 px N
This shows that if we multiply the sample variance by N/(N-1), we obtain the desired unbiased estimator of the population variance. So the degree to which the sample variance underestimates the population variance on average is that factor of (N-1)/N. Equation (2.41) depends on the chi-square nature of the sample variance, so it can be considered exact only in the Gaussian case, although it is frequently found to be a good approximation for most populations encountered in practice. A widely used convention for denoting an estimator is to place a circumflex (also known as a “caret”, “hat”, or “roof”) over the symbol for the quantity estimated. So we have p
x s x
N 1 N 2 2 xi s x p s x N1 N 1 i 1 2 x
88
(2.42)
page 100
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
We note in passing that the right-hand side of the second line appears often in the literature under the name “sample variance”, implying an equivalence between that and an unbiased estimator of the population variance. This has created unnecessary confusion witnessed by the author about when to divide by N and when to divide by N-1. Herein we conform to Cramér’s parallel definitions of sample and population moments and maintain a distinction between moments and estimators. For any sample composed of independent random draws from a single unknown population, given no further information, these are the best estimates possible for the population mean and variance, but both have some uncertainty. In general, uncertainties of sample estimators decrease as sample size increases. It is also obvious that the factor N/(N-1) eventually becomes negligibly different from unity. The sample variance does not decrease as N increases; it asymptotically approaches the population variance. The uncertainty of the sample mean as an estimator of the population mean does asymptotically approach zero as N becomes arbitrarily large, however, since (as stated above) the variance of the sample mean statistic varies as 1/N, specifically, pσx2/N. Again, this is exact only for Gaussian populations but often made to work when the distribution is unknown. Technically, this should not be called the “uncertainty of the sample mean”, since once a sample has been drawn and a mean computed, that mean is known exactly without uncertainty. The quantity (pσx2/N) is really the 1σ uncertainty of the population mean as estimated by the sample mean. The first line of Equation 2.42, however, shows that the estimator is the sample mean, they have the same value, and if the left-hand side has uncertainty, then it is common to see that ascribed also to the right-hand side, but this misses a subtle distinction between the two uses of this value. On the right-hand side, we have a number computed from other known numbers, hence a known value with no uncertainty. On the left-hand side, that value is used as an estimate of an unknown value that does (being unknown) have uncertainty. Since the population variance is typically unknown, its estimator must be used in calculating an estimate of the uncertainty of the population mean:
2 x 2x p
2 x p
p
x2
N N 2 2 s x s x N 1 N N1 s
(2.43)
For the ideal case of a Gaussian population (of unknown mean and variance) and a properly drawn sample, the uncertainty of this estimate can be computed, i.e., the uncertainty of the uncertainty of the mean. This follows from the relationship between the sample variance and a chisquare random variable with a mean of N-1, whose variance is therefore 2(N-1), twice the number of degrees of freedom. Of course, this too is an estimate, since it depends on the unknown population variance, for which we must substitute its estimator, which has its own uncertainty. So we will look first at the uncertainty in the estimate of the population variance. Using V( ) to denote the variance of the argument,
89
page 101
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics 2
N s x2 N 2 V 2 2 V s x 2( N 1) p x p x 2( N 1) 4 p x N2 2( N 1) 4 V s x2 p x N2 N N 2 V p x2 V V 2 s x N1 N1 s x V s x2
N 2( N 1) 2 4 2 p x N1 N N 2N 4 N 1 2 s x
2 p x
V p x2
N 2 N 1 s x
(2.44) 2
2N 2 N1 s x
So the estimated uncertainty in the estimated population variance is the observed sample variance multiplied by a factor that drops off approximately as 1/ N. We can define a relative uncertainty:
R p x2
2N 2 s x N 1 N 2 p x 2 N1s x 2 N
2 p x
(2.45)
Suppose we want to know the population standard deviation to 10% accuracy. Then to first order, we need to know the population variance to 20% accuracy, which implies that we need a sample size of 50. But this too is just an estimate, not a guarantee. Perhaps we should compute the uncertainty in this estimate. But that would also be an estimated uncertainty. Clearly an infinite regress presents itself. In practice, one generally avoids absurdly small sample sizes and then makes the very conservative assumption that the error in the uncertainty of the uncertainty is surely less than 100%. So with a sample size of 50, we feel confident that the uncertainty in the estimate of the population variance is at least 10% and no more than 40%, which leaves us with an estimated accuracy of the population standard deviation between about 5% and 18%. That translates directly into the uncertainty of the estimate of the population mean, and that should be good enough to allow us to decide whether we have statistically significant evidence whether that star is variable. 90
page 102
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
In practice this calculation is seldom done, because in real-world problems with typical sample sizes, these uncertainties become dominated by imperfections in the sampling process itself. There are few perfectly Gaussian populations outside the Platonic realm of mathematics. Real populations are more often mixtures of Gaussian populations, which (as we saw in section 2.9) tend to be significantly non-Gaussian while remaining fairly non-pathological, so that Gaussian approximations still usually produce acceptably accurate results. For example, the case of “Smith is projected to win 62% of the vote with a margin for error of 3 points”: one might ask “why not use twice as large a sample and cut that margin for error almost down to 2 points?” Besides the added cost, the reason is probably that enlarging the sample would not reduce the uncertainty stemming from imperfections in the process of sampling the relevant voting population. Making predictions based on samples is far from being a mechanical procedure. The greater success of some pollsters compared to others follows from their superior expertise in constructing samples and estimating how inevitable deficiencies translate into uncertainties beyond those known to the ideal mathematical formulation. Our feeling of being on fairly solid ground depends on our feeling that the population is probably not terribly non-Gaussian. If we don’t have that feeling, we need to take special action which is beyond the present scope. In any case, it is nice to have some way to estimate whether the almost-Gaussian assumption is safe to make, and fortunately there are some ways to do that. These methods depend on “shape” parameters, the most common of which are the skewness and kurtosis. The problem with these is that they are higher moments, and the higher the moment, the more it is jostled around by fluctuations in the tails of the distribution. That means that the more accuracy we need, the larger the sample size has to be, since all of these estimators have uncertainties that drop off approximately as N -1/2. We have seen that above for the uncertainty of the population mean and the relative uncertainty of the population standard deviation. Similar formulas have been derived by Fisher (1929, 1930) for the skewness and kurtosis of samples drawn from a Gaussian population, where we denote skewness S and excess kurtosis (i.e., kurtosis - 3) K: s
S0
2S s
6 ( N 2) ( N 1) ( N 3)
6 sK N1 24 N ( N 2) ( N 3) s2K ( N 1) 2 ( N 3) ( N 5)
(2.46)
For a Gaussian population, the sample skewness and excess kurtosis are asymptotically Gaussian, and so we can construct an approximately chi-square random variable with 2 degrees of freedom as follows:
D 2
s
S2
2S
( s K s K )2
2K
s
s
91
(2.47)
page 103
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.11 Sample Statistics
So if the population is Gaussian, we can expect the sample skewness and excess kurtosis to obey these relationships, and D2 should follow (approximately) the distribution of a chi-square random variable (see Appendix E) with 2 degrees of freedom, i.e., a mean of 2, a variance of 4, and Q = exp(-D2/2). If a better approximation is needed, the “Omnibus K2 statistic” based on the D’Agostino transformation of variables is recommended. This makes up for the fact that the skewness and kurtosis approach their Gaussian asymptotes very slowly, but the details are beyond the current scope (see D’Agostino, 1970; D’Agostino et al., 1990). Typically, sample sizes well above 100 are needed for these tests to be useful. For example, given a uniformly distributed population, which is about as non-Gaussian as one finds in the real world, the D2 statistic averages around 4.6 for N = 100, a Q value of about 0.1, which is not a very loud alarm for non-Gaussian behavior, whereas the K2 statistic averages around 29, which is convincingly significant at Q = 5×10-7. At N = 300, the D2 statistic begins to be useful, averaging about 16.3, Q = 2.9×10-4. One warning about deriving estimates of physical parameters from measurement samples: while the methods described immediately above are valid for estimating population parameters, if the population mean is afflicted with a systematic error relative to the true value of the desired physical parameter, that systematic error will persist into the sample mean, hence the sample-based estimator of the physical parameter, and it will cancel out of the sample variance, hence not be taken into account in the sample-based uncertainty of the physical parameter. Systematic errors must be taken into account during instrument calibration and either removed, or at least an appropriate uncertainty must be estimated for them. This will be discussed in more detail in Chapter 4. The last topic to be discussed in this overview of sampling statistics is the correlation coefficient for paired variables. This is one of the most important aspects of this discipline, since causal connections are often first discovered as correlations in samples. An example is the highly controversial claim of a correlation between glioma and cell phone usage. Even when the correlation is due to a common agent, the fact that it exists is often of critical importance. What must be established is not only the extent of the correlation but also its statistical significance, since as mentioned above, some non-zero correlation can always be expected in vector samples just because of random fluctuations. Correlated random variables were discussed in section 2.10, and Equations 2.22 and 2.23 (p. 67) show how the coefficient is computed for a sample of 2-dimensional vectors (for higher dimensions, the process is the same for all 2-dimensional subspaces). Here we will be concerned with computing the corresponding statistical significance for samples drawn from Gaussian populations of 2-dimensional vectors, each of whose elements are uncorrelated, i.e., the null hypothesis is that there is no correlation, and then we ask whether the sample contradicts this hypothesis. Just as with the uncertainties of the sample-based estimators of skewness and kurtosis, better accuracy can be obtained through a transformation of variables. This was provided by Fisher (1941) and is known as the Fisher z-transform:
z
1 1 log 2 1
(2.48)
Note that this expression is undefined for ρ = ±1. If one gets a 100% correlation of either sign, one must consider it absolutely significant. Otherwise z can be tested as follows, based on Fisher’s 92
page 104
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.12 Summary
demonstration that z is approximately Gaussian with a mean of zero and a variance of 1/(N-3): given the null hypothesis that the correlation is zero, a straightforward test using the Gaussian cumulative distribution is possible. But clearly, we must have a sample size greater than 3. It suffices to say that for a sample size of 3 or less, it would be foolhardy to expect to calculate a meaningful correlation coefficient. Otherwise, small sample size just incurs the usual penalty that statistical significance is difficult to obtain. The example in section 2.10 of correlation between mass and volume of oranges had a sample size of 10 and a correlation coefficient of 0.942. Under the null hypothesis, this is a fluctuation of a zero-mean Gaussian random variable with a standard deviation of 1/ 7, so about 2.49σ. Standard tables of the Gaussian cumulative distribution (or the error function built into most scientific programming languages; see Appendix F) reveal that this large a deviation from the mean could happen randomly only about once in every 78 trials, i.e., 2Q = 0.0128, where we use the “two-tailed” distribution because z can fluctuate randomly to either side of zero. That’s enough to cast serious doubt on the null hypothesis. If we did not already know the answer from the simple physics, we would conclude that the mass and volume of oranges are very probably correlated. The reason for not achieving greater confidence is that the sample size is simply too small. The same correlation with a sample size of 100 would be a 9.37σ fluctuation, 2Q = 7.33×10-21, something we expect to happen once in every 1.36×1020 trials, leaving no doubt that the correlation is real. This shows a potential pitfall, however. It is not unusual to encounter sample sizes of tens of thousands or more. In the last few decades, some gigantic data bases have been built up, such as astronomical catalogs with millions of entries. For example, for a sample size of 1,000,003, z has σ = 0.001. A random correlation of 0.005 would be a 5σ event, i.e., 2Q = 5.74×10-7, something to expect once every 1.74 million trials, or in other words, extremely significant. Should this small correlation be considered nevertheless virtually certain to be a real property of the population? Probably not; it is unlikely to have come from random sampling fluctuations, but it could have been produced by roundoff errors due to limited precision in the computation. In other words, such analyses involve not only sampling a population but also sampling the imperfections of computer arithmetic. Computing statistics with gigantic samples entails correspondingly large numbers of arithmetic operations, and that leads to larger random-walk deviations from perfect numerical accuracy. There is no escaping the need to use judgment in interpreting statistical results. We must stress that this section has only highlighted some of the more important topics encountered in sampling statistics. Many fascinating aspects have gone unmentioned because of the limited scope, such as order statistics (e.g., given a sample from a known population, what density function describes the third-largest member of the sample?), analysis of variance (estimating whether two samples came from the same population), robust estimation, and non-parametric statistics. A few of these will come up in Chapter 4, but the reader is encouraged to pursue a more complete study of this discipline. 2.12 Summary The attempt to form a clear mathematical notion of randomness has been a recent chapter in the long history of humanity’s mental grappling with intuition-defying concepts. Mathematics depends on immaculate notation and the rigorous application of logical rules for manipulating 93
page 105
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.12 Summary
symbols, and so it had to wait for the invention of symbols. This activity took place over quite an extended period of time. What led up to it is mostly prehistoric, since history itself depends significantly on written language, and any earlier work had to be transmitted orally until such time as what survived could be recorded symbolically. Along with other abstractions that had to be made to get this process going, the concept of a number had to be brought into focus. What seems rather trivial to us today was actually considered quite erudite in the days when written numbers were used mostly by merchants to keep track of trade agreements and farmers to assist in scheduling planting and harvesting seasons. Gleick (2011) gives a nice historical summary of the development of symbolic representation of information and the difficult hurdles that the ancients had to get past. Barrow (2000) provides an interesting account of the search for the concept of zero, whose absence crippled mathematics for a surprisingly long time. The human mind took to the concept of actual things being present in different amounts, but their absence was too subtle to be easily integrated into the numerical concepts. To separate the notion of number from specific collections of things required one level of abstraction. To represent specific numbers symbolically required another, and to represent a whole family of numbers with a single symbol, as required for the development of algebra, is yet another. Algebra could not blossom until the concept of nothing was crystallized into a symbol that could be used with those of others and take its place in polynomial equations. Even today, the concept of “nothing” occasionally confounds the public. Taken at face value, for example, a statement like “Nothing works better than our product” can be seen to be self-defeating: if nothing works better, then one would be better off with nothing, and it probably would cost less as well. Even after the zero symbol allowed nothingness to be used in equations, there was resistance to allowing negative solutions, since these seemed at first to have no physical meaning. It strikes us today that it took longer than should have been necessary to grasp the idea that if your bank balance is negative, that means that you owe more than you have. But what about the negative energy that is radiated into a black hole as part of the process of evaporating it? Is negative energy a sort of “selling short” of the universe’s energy content, to be paid back later? No less a luminary than Paul Dirac employed negative energy to describe positrons as holes in an infinite sea of electrons at negative energy levels. The mathematics worked perfectly well. A hole acted like a positively charged electron with the electron’s mass. This is no longer the consensus interpretation of the quantum-mechanical vacuum, but the notion of negative energy persists, while the fundamental nature of the vacuum continues to defy our understanding. After the visionaries of their time embraced negative numbers and fractions, a new obstacle appeared: irrational numbers. Some equations appeared to have solutions that could not be represented as a fraction with integers in the numerator and denominator. The main battleground turned out to be the diagonal of a square. It was known that some right triangles (i.e., having one 90 o angle at a vertex) had three sides that all had integer lengths (e.g., 3, 4, and 5). A square could be seen as two right triangles with equal smaller sides, one triangle inverted and joined to the other at the hypotenuses. So the diagonal should behave like the hypotenuse of a triangle, but for a 1×1 square, no rational number could be found that could represent the diagonal length. The Pythagorean Theorem was well known, and so it was clear that the diagonal had a length of 2, but actual blood was shed over the argument whether this could be represented as a ratio of two integers. Once the inevitability of irrational numbers was realized, they were still considered outcasts by many for a long time. Dunham (1991, 2004) and Derbyshire (2004) give excellent narratives of 94
page 106
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
2.12 Summary
the struggle to tame numbers via algebraic equations, including the gradual absorption of negative solutions, irrational solutions, and eventually even “imaginary” solutions to algebraic equations. Nahin (1998) gives an interesting account of the story of imaginary numbers, without which it seems impossible that we would have many of the highly developed theories of modern times, notably Quantum Mechanics. So history shows that conceptual difficulties were experienced in the process of absorbing negative numbers, zero, irrationals, and imaginary numbers. It was therefore to be expected that random numbers would present some challenges. All of these stumbling blocks persist to some extent today among some segments of the population. Many people have trouble simultaneously entertaining mutually incompatible hypotheses, something required if one is to formulate alternative scenarios for the purpose of evaluating corresponding likelihoods. There is also apparently an obstacle to considering deterministic processes in the probabilistic context required for quantifying uncertainty regarding those deterministic processes. Even among mathematicians there has been lack of consensus regarding the correct interpretation of some aspects of random numbers. In recent times, the most obvious hostilities have been between Bayesian and non-Bayesian theorists (the latter typically called “frequentists”). Even so great an authority as R. A. Fisher never relinquished his insistence that the conceptual foundation of Bayesian statistics was fatally flawed. Meanwhile, Bayesian theorists proceeded unhindered to develop powerful estimation techniques that seemingly could never have succeeded if they had been founded upon an erroneous interpretation of how probability relates to randomness. There may be no complete cure for these afflictions of conceptual cloudiness. After all, the human intellect is not infinitely powerful. But the drive to understand as much as possible about our shared experience demands that we continue to press on towards greater clarity wherever it can be achieved, and honing our intuition for randomness seems a worthy goal in view of the role it appears to play at the most fundamental of physical processes. The main tool at our disposal is the study of mathematical probability and statistics, since these are the instruments with which we seek to domesticate randomness, whatever its nature may turn out to be. But the goal of this book is not to be a text on probability; it is to explore the nature of randomness itself.
95
page 107
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Chapter 3 Classical Statistical Physics 3.1 Probability Distributions Become Relevant to Physics The 19th century witnessed rapid progress in mathematical physics based on the innovations of the previous generation. While much of what we think of now as mathematical probability theory was created in the 20th century through the work of Fisher, Kolmogorov, the Pearsons, Shannon, etc., in fact the 19th century gave birth to Statistical Mechanics, which made very productive use of (and contributed greatly to) the development of classical probability theory. Statistical Mechanics followed closely on the heels of Thermodynamics, and their relationship is thought by many to foreshadow a new similar relationship between the combination of Quantum Mechanics and General Relativity as potential parents of a Quantum Gravity Theory. The similarity is in the fact that Thermodynamics is based on the mathematics of the continuum and is concerned with macroscopic phenomena, whereas Statistical Mechanics accounts for the fact that the microscopic processes underlying Thermodynamics involve discrete particles. Thus the relationship between temperature, pressure, and volume in Thermodynamics was eventually found to have its origin in the interactions between molecules, although these molecules themselves are still described as moving about in a continuous space. When a successful Quantum Gravity Theory is eventually achieved, many expect it to involve a discrete form of what Einstein called the “Space-Time Continuum” (e.g., Sorkin, 1998), although the properties of the “atoms of spacetime” may possibly remain formalized within a continuum. Interest in Thermodynamics was augmented by progress in the design of steam engines, an activity that had been going on since the 17th century but was just beginning to provide machines that operated with an efficiency useful for large-scale applications. Along with electricity, magnetism, light, and chemical reactions, heat was an object of intense study by the scientists of the 19 th century. A large amount of experimental information had been accumulated regarding the macroscopic effects of heat, and attempts to formulate general laws governing its behavior led to the formal development of Thermodynamics, which then fed back into the design of thermal engines, especially those employing steam power. All of this took place without needing to understand the nature of heat in terms of fundamental physical processes. Since the 17 th century, general notions had existed regarding a substance called phlogiston that existed in combustible substances and was released when the substance burned, and heat was viewed as a fluid called caloric that flowed from hotter to cooler bodies as a result of some self-repulsion. The great Lavoisier, considered by many to be the founder of modern chemistry, subscribed to the caloric theory and enjoyed considerable success without a more rigorous theory of heat. In 1820, Sadi Carnot began developing a scientific description of how steam engines work, and the theoretical development of Thermodynamics was underway. Besides setting the stage for Statistical Mechanics, the relevance of Thermodynamics for our present context is simply the fact that it gave birth to the concept of entropy, which was later related to statistical behavior. Other than that, random processes play little role in the formalism of Thermodynamics. Here we are using that name in the sense of what is sometimes called “Classical Thermodynamics” to distinguish it from 96
page 108
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.2 Thermodynamic Foundations of Thermal Physics
what is sometimes called “Statistical Thermodynamics”. The latter is also commonly called “Statistical Mechanics”, and that is the name we will use herein. We must note that “randomness” in classical statistical physics was regarded strictly as a mathematical convenience, just a way to average over distributions of large collections of particles. We mentioned in the Preface that guessing the day of the week corresponding to August 19, 2397, without time to consult or compute anything, involves a probability of 1/7 that any given guess is correct. The lack of any actual randomness in the relationship between dates and days of the week is no obstacle to applying probability theory to decide whether 10-to-1 odds are worth a bet. Similarly, the legacy of Newton, Lagrange, Hamilton, and others who contributed to Classical Mechanics left no room for random processes in Nature, but their heirs in the latter 19 th century quickly embraced the practice of using methods of probability theory to average over different microscopic physical configurations to compute corresponding macroscopic conditions. But at first, only the macroscopic conditions were familiar. The job of organizing the notions of heat, pressure, and temperature fell to the new discipline that became known as Thermodynamics. 3.2 Thermodynamic Foundations of Thermal Physics Among the impressive achievements of the 19th century scientists was the ability to make rather precise measurements of temperature without even knowing at first what temperature was, other than something proportional to heat content. It became clear that heat itself was not a fluid but rather something involving energy, since energy provides the wherewithal to do work, and work was being done by machines employing heat. These engines primarily used heated gases, and so most of the theoretical focus was on the gaseous state of matter, but the relevance of thermal physics to solids and liquids was also recognized. Studying the gaseous state naturally drew attention to the other two notions at the heart of Thermodynamics, pressure and volume, which at least were already very accessible to the intuition and relatively easy to measure in the laboratory. Thus a great deal of the work done to develop Thermodynamics was concerned with the “state variables” pressure P, volume V, and temperature T, and formulas linking their mutual variations in physical systems were called “equations of state”. One of the most useful theoretical spaces within which to analyze the state variables was found to be the PV plane, i.e., pressure as a function of volume. This developed naturally from the observation that a gas contained in a piston chamber exerted a greater pressure as the volume was made smaller. In addition it was learned that compressing a gas caused its temperature to rise, and that increasing the temperature of a gas caused its pressure to rise. Empirical evidence accumulated to suggest that the product of P and V was proportional to T, i.e., for a given amount of gas at constant temperature, P and V were hyperbolically related, and their product was linearly related to T. This appeared to be true to high approximation for different gases and is called the “perfect gas law”:
PV CT
(3.1)
where C is a constant of proportionality. C was later shown to have the form mR, where R is the “universal gas constant”, and m is the number of gram-molecular weights of the gas in the chamber. Also called a mole, a gram-molecular weight is an amount of a substance whose weight in grams 97
page 109
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.2 Thermodynamic Foundations of Thermal Physics
equals its molecular weight, e.g., 32 grams of O2, molecular oxygen composed of two oxygen atoms, each with an atomic weight of 16. C is also equal to nk, where k is the Boltzmann constant (which we will encounter below; its modern value is 1.3806488×10 -16 erg/deg Kelvin), and n is the number of gas molecules. But while the notion of molecules was useful in chemistry, wherein the concept of different molecular weights arose from weighing the reagents and products of various chemical reactions and the proportions in which different chemicals combined, the objective physical reality of atoms and molecules was not accepted by most mainstream scientists until much later. These chemistry experiments also led to identifying the relationship between Boltzmann’s constant k and the universal gas constant R, namely R/k = NA, where NA is Avogadro’s constant, whose modern value is 6.02214076×1023 particles/mole, i.e., NA molecules of a given species have a combined mass equal to the molecular weight in grams. Methods were devised to keep the amount of gas in the piston chamber constant and to employ “heat baths” to control its temperature as state variables were changed. All of these variations could be plotted in a PV diagram, and this aided in quantifying causal relationships.
Figure 3-1. The Carnot Cycle: starting at point 1, the path 1 2 is an isothermal expansion; 2 3 is an adiabatic expansion; 3 4 is an isothermal compression; 4 1 is an adiabatic compression. Each step is quasistatic, i.e., departures from equilibrium are negligible. In one cycle, work is done at the expense of the heat source that keeps path 1 2 isothermal. Excess heat is dumped into a sink during path 3
Experiments were done in which the temperature T of a gas was held constant at a value we will call Th while reducing the pressure from P1 to P2 by allowing the volume to increase from V1 to V2, i.e., by allowing the gas in the chamber to do work by moving the piston. The locus of intermediate (P,V) values could be plotted in the PV plane, where it was found not to be a straight line. Instead it had a downward curvature compatible with the hyperbolic relationship of Equation 3.1. This is shown in Figure 3-1 as the curve between the two points labeled 1 and 2. Because of the constant temperature, this is called an isothermal process. Since the gas has a natural tendency to 98
page 110
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.2 Thermodynamic Foundations of Thermal Physics
cool when its volume increases and its pressure decreases, it draws energy from the heat bath as it is held to the constant temperature. In this case, the heat bath is called a “heat source”. In order for this process to work in an ideal fashion, it must be carried out quasistatically, i.e., so slowly that for all practical purposes, the system is never out of equilibrium, i.e., the gas is arbitrarily close to being uniform in all state variables over its entire volume during the process. It was found that a useful continuation of this procedure was to follow it with another reduction in pressure and expansion in volume, but this time with no heat bath. In this case the natural tendency of the gas to cool lowers its temperature from Th (the “hot” temperature) to Tc (the “cool” temperature), as no heat energy is allowed to enter or leave the piston chamber. This is called an adiabatic process and is shown in Figure 3-1 as the locus in the PV plane running from point 2 to point 3. This is also curved but with a steeper average slope resulting from the temperature being allowed to drop. Again, the gas does work by moving the piston. If this is followed by a judiciously chosen isothermal compression using a different heat bath with temperature Tc, the system can be made to arrive at point 4 in Figure 3-1, from where an adiabatic compression will take it back to point 1, heating it along the way to the original temperature Th. Both of these compression phases involve doing work on the gas. During the isothermal compression, the gas will have a natural tendency to heat up, and so it must lose heat energy to the heat bath, which is therefore called a “heat sink”. These four steps constitute what is called a “Carnot Cycle”, an idealized model of how a heat engine can operate between a heat source and a heat sink and convert heat energy into work. Since the system has arrived back at point 1, the cycle can be repeated as many times as needed (assuming ideal heat baths). Our present scope prevents doing anything more than scratching the surface of Thermodynamics. Except for some very fundamental aspects, we must quote results without proof and encourage the reader to explore this pillar of physics more completely by consulting any of the standard texts (e.g., Callen, 1960; Morse, 1964; a classic text is Planck, 1921; an excellent undergraduate-level presentation may be found in Resnick & Halliday, 1961). Our interest is really just to get some feeling for the PV plane, since this is essential for understanding the Thermodynamic definition of entropy. The Carnot Cycle is a classic example of the use of the PV plane and a perfect context for introducing entropy. First we note that the net work done by the Carnot Cycle is done at the expense of the heat source. The work done by the gas in the adiabatic expansion is equal to the work done on the gas in the adiabatic compression, since no external heat energy is transferred to or from the gas during these steps, which therefore cancel each other as far as work is concerned. Not all of the heat energy taken from the source is converted into work, however. Some is dumped into the heat sink. The difference is related to the engine’s efficiency, which is defined as one minus the ratio of the heat dumped in the sink to the heat taken from the source. But what is crucial is the fact that work is done. Less heat is dumped to the sink than is taken from the source. Denoting the heat energy of the gas by Q, with differential dQ that we integrate along a path in the PV plane, the integral of dQ along the path 1 2 is ΔQ12 > 0. The adiabatic paths 2 3 and 4 1 cancel each other. The path 3 4 yields ΔQ34 < 0. Since work is done, |ΔQ12| |ΔQ34|. Next we note that since each step in the cycle is done quasistatically, each step is reversible. In other words, we can run the Carnot Cycle backwards to take heat energy from the lowertemperature bath and deposit it in the higher-temperature bath. In this case, we must provide the work to force heat to flow against a temperature gradient. But the point is that when run backwards, 99
page 111
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.2 Thermodynamic Foundations of Thermal Physics
what had been expansion steps become compression steps, and vice versa. Thus instead of losing heat energy ΔQ34 < 0 on the path 3 4, the gas gains heat energy ΔQ43 = -ΔQ34 > 0 on the path 4 3. Since the adiabatic compression 2 3 involves the same work as the adiabatic compression 1 4, and since |ΔQ12| |ΔQ34|, the paths 1 4 3 and 1 2 3 arrive at point 3 with different amounts of heat energy added to the gas. Thus the integral of dQ in the PV plane is path-dependent, and one cannot determine the change in the gas’s energy in going from one point in the PV plane to another without knowing which path was taken. Another way of saying this is that dQ as a function of P, V, and T is not a perfect differential. If we look at the work done by the Carnot Cycle, however, we see that since the two adiabatic steps cancel, the net energy converted to work W must be the difference between the heat drawn from the source and that lost to sink, i.e., W = ΔQ12 + ΔQ34 = ΔQ12 - ΔQ43. The efficiency e is defined as the ratio of the work done to the energy drawn from the source, or e
Q12 Q43 Q43 W 1 Q12 Q12 Q12
(3.2)
which shows that if any energy at all has to be dumped to a sink, the efficiency will be less than 100%. The two steps that determine the efficiency are therefore the isothermal steps. Since we have assumed that temperature is proportional to the heat energy, and since temperature is constant during these two steps, it follows that the heat energy drawn from the sink is proportional to the work done by the gas on that step, since ΔQ12 did not go into the gas’s internal heat energy, or it would have raised the temperature, and similarly the work done by the gas did not come out of the gas’s internal heat energy, or it would have lowered the temperature. The same argument applied to ΔQ43 then leads to or
Q12 Th Q43 Tc Q12 Q43 Th Tc
(3.3) (3.4)
So if we integrate dQ/T on the two paths 1 4 3 and 1 2 3, we get the same value, and dQ/T is therefore a perfect differential, i.e., 1/T is an integrating factor that can be applied to dQ to create a perfect differential which we will denote dS = dQ/T. We stress that this assumes that all paths involved in the integral are traversed in a reversible manner. The physical parameter S was named entropy by Rudolph Clausius in 1865, and it is important in Thermodynamics for several reasons, not least of which is that it endows the notion of temperature with a mathematically rigorous definition and in the process establishes an absolute scale for it known as the Absolute Thermodynamic Temperature Scale. At this point, we do not have a zero point for entropy, only the ability to calculate its change along a reversible path in the PV plane. Clearly its integral around a closed reversible path is zero, since the path ends where it began, and the entropy at that point is independent of how the system arrived at that point. If the path encloses a nonzero area, then net work is done by traversing that path. This can be seen in Figure 3-1: since pressure is a force per unit area, its product with volume 100
page 112
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.2 Thermodynamic Foundations of Thermal Physics
is a force multiplied by a distance, and work is a force acting through a distance. Therefore P dV is the work done by the system if the path is from lower to higher volume, and it is the work done on the system if the path goes in the opposite direction. So the work done on the path 1 2 is the area under that path, and similarly for the work done on the path 2 3. The areas under the paths 3 4 and 4 1 are negative, so the areas under them subtract from the areas under the first two paths, leaving the area inside the closed path as the net work done. When the system returns to its starting point, it has the same entropy as the last time it was there, and its state variables are the same as before, so the cycle can be repeated. In Thermodynamics, a process which does not satisfy the above definition of reversibility is called irreversible and can happen spontaneously but never reverse itself spontaneously. A different view of this will be encountered in Statistical Mechanics, but before that was developed, it was generally believed that a thermodynamically irreversible process effectively supplies an arrow of time, since such a process can go only in one direction as time increases. In general, the change in entropy during an isolated irreversible process along a path in the PV plane is greater than the change that would occur along a reversible path with the same endpoints. If the system is not isolated, i.e., if the system is allowed to interact with its environment, then entropy may be lost or gained by the system, but if the environment is taken into account as part of the “system”, then the net change in entropy is always greater than it would have been along a reversible path between the same endpoints. Since the Universe is the most all-inclusive system possible, it follows that the occurrence of spontaneous processes invariably causes the entropy of the Universe to increase. None of this seems to have anything to do with randomness or probability, however. The first hint of such a connection may be seen in the interpretation of entropy as a measure of the disorder of a system. An isolated system allowed to undergo spontaneous irreversible processes increases its entropy and becomes more disordered. Even in reversible processes like the Carnot steps 1 2 and 2 3, in which the system does work, its entropy increases, and its ability to do work is diminished and must be eventually restored by having work done on it in order to restore its initial condition if another cycle is to be possible. Because the energy that is to make work possible must be somewhat organized in advance into some exploitable configuration, there is an association between order and the ability to do work versus disorder and a lessened ability to do work, and all spontaneous processes evolve to states of higher entropy, greater disorder, and less ability to do work. Of all the states available to a system, it seems reasonable that their spontaneous evolution would be toward states of higher probability, and these are the states with higher entropy, although in classical physics the connection between possible states of a system and corresponding relative probabilities lacks the essential ingredient of randomness. The tenuous link is that in general there are many microscopic states that give rise to essentially the same macroscopic state, and the number of microscopic states generally varies from one macroscopic state to another. Since something that has many ways of happening appeals to the intuition as more probable than something which has only a few, the former seems more probable, in the sense that August 19, 2397 is more likely to be a weekday than a weekend just because there are 2.5 times as many weekdays. But that does not depend on any randomness in the calendar. “Weekday” may also be considered a more disordered state than “weekend”, since if someone tells you that some event took place on a weekday, you know less about which actual day it was than if you had been told it happened on a weekend. Thus higher entropy became associated with greater disorder and less information content at the expense of bringing these rather subjective notions into the analysis. 101
page 113
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.2 Thermodynamic Foundations of Thermal Physics
For example, consider a container with a thermally nonconducting membrane in the middle separating it into halves, with a hot gas on one side of the membrane and a cold gas on the other. The system has a certain order by virtue of the organization into a hot half and a cold half. If the membrane is removed, the gases will mix spontaneously, and the system will become disordered. Since this is a spontaneous irreversible process, the entropy increases. Before the membrane was removed, the two halves could have been used as a heat source and a heat sink capable of doing work, but this is lost after the gases mix to form a system of intermediate temperature. Thus the spontaneous process is associated with an increase in entropy, a decrease in order, and a loss of the ability to do work. Since no heat was added or subtracted from the system, ΔQ is zero, hence the integral of dQ/T along the path in the PV plane is zero, but entropy increases anyway, because the formula dS = dQ/T applies only when the path is traversed in a reversible manner, and allowing the hot and cold halves to mix is an irreversible process. While removing the membrane can be seen as making the organization into hot and cold halves improbable as a continuing state, in fact in both Thermodynamics and Statistical Mechanics, the diffusion of the two halves into a new equilibrium at intermediate temperature is not considered subject to uncertainty or randomness. Thermodynamics considers it certain to occur. As we will see below, in Statistical Mechanics, the actual trajectories of the gas molecules are deterministic but too complicated to compute in detail, and so statistical methods are used to average over all possible trajectories as though they were random, but this is essentially like saying that on average, Tuesday occurs 1/7 of the time in a long chronological interval. This averaging produces the same equilibrium state predicted by Thermodynamics with a standard deviation too small to be worthy of consideration in practice. We should note that we have been assuming energy conservation in all of this. It had long been believed that certain physical quantities are conserved in physical processes. For example, Kepler’s first two laws of planetary motion (based on one of the most impressive feats of scientific data analysis in history) state that planetary orbits are elliptical with the Sun at one focus and, for any given planet, that the sun-to-planet line sweeps out equal areas in equal times. This turns out to be equivalent to stating that each planet’s orbital angular momentum is conserved. Newtonian gravity involves forces that are derivable from scalar potentials that are not explicit functions of time. This is a sufficient condition for the sum of the corresponding potential energy and kinetic energy to be conserved. So the notion of energy conservation, among other conservation principles, predates Thermodynamics, but it is among the “laws” of the latter discipline that energy conservation has become enshrined. In fact, energy conservation occupies the position of honor in Thermodynamics, being the First Law. The mathematical form it takes varies depending on how energy is defined (e.g., besides work and heat, chemical potentials may be involved, along with gravitational potentials, molecular vibrational energy, etc.), but the basic statement is that energy is conserved in any physical process. It may be converted from one form to another, but the total energy is constant. Before energy conservation achieved a status resembling an 11th Commandment, it was not completely taken for granted. Experimental probes into its validity were undertaken. Planck (1921) describes the experiments performed by Joule to establish that the mechanical energy converted into work to lift an object of known weight through a measured distance was the same when used instead to create heat energy through friction. Planck says “... the state produced in the liquid by friction is identical with a state produce by the absorption of a definite number of calories” and “That all his 102
page 114
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.2 Thermodynamic Foundations of Thermal Physics
experiments with different weights, different calorimetric substances, and different temperatures, led to the same value [of the mechanical equivalent of heat], goes to prove the correctness of the principle of the conservation of energy.” Several antiquated aspects of Planck’s classic treatise stand out here. He makes no mention of measurement uncertainty, and he does not provide a reference for Joule’s work (it was published in 1843 in a British journal where it encountered considerable resistance at first). It is therefore not an easy task to determine whether Joule provided an error analysis of his experiments, but if he had done so, it seems reasonable that Planck would have made some reference to the statistical significance of these verifications of energy conservation. It also seems reasonable to assume that the fact that energy was conserved to within the ability of experiments to test the principle was taken as convincing evidence that Nature conserves it exactly, and that became a philosophical sine qua non for subsequent theoretical developments in physics. Such insights are crucial to the advancement of science. For example, in recent times, similar arguments have been presented to declare that the Robertson-Walker metric (which describes the large-scale curvature of the universe and will be discussed further in Chapter 6) must be flat because observations indicate that it is so close to being so. But there are also examples of cases in which physical theories with great aesthetic beauty and mathematical elegance turned out not to be correct, at least not when interpreted as fundamental laws describing the actual behavior of the universe (examples will be given in Chapter 5). It is now known that what Joule measured was the specific heat of water, which varies with temperature by ±1.2%. That probably explains why his value of 4.155 Joule/calorie differs from the modern value of 4.1860 by 0.74%, a discrepancy which is quite impressively small, but reveals clearly that his results establish energy conservation only to about two decimal places. Of course, Joule did not express his result in terms of units named after him, but rather in foot-pounds per degree Fahrenheit, and today his unit of energy has displaced the calorie altogether in physics. The notion that some connection existed between mechanical energy and heat had been floating around for some time before Joule performed his definitive experiments that ultimately led to his receiving credit for the physical principle at the end of a surprisingly common process by which scientific progress is made: the disappointment of initial rejection, gradual grudging acceptance, acrimonious contesting of priority, and finally bestowal of a physical unit of measure. Besides introducing of the concept of entropy, we are also interested in the idea that energy is strictly conserved in physical processes, because if the activity at the most fundamental layer of reality involves any randomness, it is difficult to see how energy could be preserved exactly. Of course energy conservation could also be violated by deterministic processes (an example will be given in Chapter 5). Indeed this appears to be the case in cosmology, where the expansion of the universe causes a loss of photometric energy that is not obviously converted to anything else as the background radiation redshifts to lower frequencies. There is also the apparent fact that new energy is constantly being added in the form of the vacuum energy of an expanding space, but to say that vacuum energy is not yet well understood is to make an extreme understatement, and being a quantum mechanical process, it probably involves some nondeterministic phenomena. Here we are anticipating the acceptance of truly random physical processes that will be discussed in Chapter 5. So the notions that we will be juggling throughout this book include randomness vs. determinism, order and information vs. disorder and entropy, conservation of physical quantities (e.g., mass, energy, momentum, charge), causality and reversibility, how all these relate to each other 103
page 115
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
and to the “arrow” of time, and the nature of the fundamental laws governing the processes that constitute the evolution of the universe at all scales (which we assume to exist as a working hypothesis, as stated in the Preface). We will not presume to provide answers to all the riddles entangled in these notions, but by undertaking the meditation, we hope to assist the reader in sorting out the most significant, appealing, and potentially productive questions to consider, something which must precede finding solutions to the corresponding problems. The primary focus will be on randomness, the ingredient whose relevance bears directly on that of each of the others. 3.3 Statistical Mechanics Chemists had not yet accepted atoms as objectively real objects even as some physicists were computing statistical properties of gas molecules and showing that pressure could be viewed as stemming from collisions between these particles with each other and chamber walls in a way that causes pressure to increase with temperature. But even most physicists still regarded the particle model as little more than a useful mnemonic device. In retrospect it is difficult to appreciate why it took so long before mainstream science embraced the fundamental objective reality of atoms and molecules. The idea certainly was not new. In ancient Greece, Democritus and Leucippus advocated the idea that matter consisted of microscopic rigid particles joined together to form solids, rolling over each other to form liquids, and rattling off of each other to form gases. Perhaps the fact that at various times the ancient Greeks also had other ideas long since discredited (e.g., all matter being composed of earth, air, fire, and water) made the 19th-century scientists wary of leaping to conclusions. In 1649, the French philosopher Pierre Gassendi published a work titled “Syntagma philosophiae Epicuri” in which he pointed out that the atomistic theory could explain qualitatively not only the three phases of matter but also transitions between them and several phenomena observed in the gaseous state. For this, Sir James Jeans (1962) considers him to be the father of the kinetic theory of gases. Ideas similar to Gassendi’s were also advanced later by Robert Hooke and generally accepted by Isaac Newton and Daniel Bernouilli, but another century passed before mainstream physics began making real quantitative use of the particle interpretation, and then with only a tentative interpretation. Many historians identify Einstein’s 1905 paper on Brownian motion as the tipping point after which atoms and molecules were widely considered more than just computational devices. Only then could attention focus on the structure of atoms and lead eventually to Quantum Mechanics. As we saw above, the ideal gas law does not depend on what molecular species of gas is involved, and it was found that one mole of any gas at T = 0 centigrade and P = 1 standard atmosphere (101.325 kPa) occupied 22.4 liters. Since it was known that this involved about 6×10 23 molecules, it was obvious that each molecule must be extremely small, as the ancient Greeks had already deduced. Thus a reasonable volume of gas at typical laboratory conditions contained a horrendously large number of molecules, and there was no hope of computing their collisions and trajectories via Newtonian mechanics. Some sort of averaging over all possible combinations of molecule positions and velocities was required. The need to perform such an average raised the question: should the averaging be done over time? It was not entirely clear how to do that. But if not over time, what else was there? To answer 104
page 116
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
that question, the concept of an ensemble was created. We introduced this concept in section 1.3ff, with the ensemble containing an infinite number of independent universes and again in section 2.11 with the concept of a population of objects from which one could be drawn randomly. In the case of Statistical Mechanics, the item which can be drawn randomly from a population is the complete system of molecules for which we wish to average over molecular trajectories. Every possible state of the system at all times is represented. Instead of calling it a population, we call it an ensemble, and instead of drawing a sample, the averaging is performed over the entire ensemble. The first ensemble found to be useful in Statistical Mechanics came to be called the microcanonical ensemble, which is characterized by the facts that it is isolated (the energy of the system is constant) and the number of particles in the system is constant. The best way to express mathematically the states of the systems comprising the microcanonical ensemble was found to be as points in phase space. This is a space whose axes are the three position coordinates and the three momentum coordinates of every particle in the physical system under consideration. Thus each point corresponds to a single unique microstate of the system. Most of the developmental work was done for gaseous systems, and so we will focus on that state, but the basic concepts apply also to liquids and solids. So for a microcanonical ensemble of n particles, the phase space has 6n dimensions. If the system consists of just one mole of an atomic or molecular species, then Avogadro’s constant tells us that we are dealing with over 3.6×1024 dimensions. The 19th-century scientists were not afraid of hyperspaces! Thus the microcanonical ensemble occupies the 6n-dimensional space of all possible states that a system of n particles may occupy with constant energy. The condition of constant energy implies that the ensemble is confined to a hypersurface in this space that corresponds to a constant sum of the kinetic energy over the n particles (assuming in our immediate case that gravitational energy, chemical energy, etc., are not relevant, but this is not necessary in general). By “possible” states, we mean accessible states. There may be points on the constant-energy hypersurface that are not accessible to a given system. For example, a classical system consisting of a single molecule bouncing back and forth between the same two points on opposite walls of a chamber cannot ever reach any other points in the chamber that do not lie on its trajectory, even though there are such points on the constant-energy hypersurface. It was clear that for any real liquid or gas system of n particles, each particle had to be changing its position and momentum at incredibly rapid rates (and also for solids, although the range of variation was probably much more restricted). Thus the point in phase space describing the system was jumping around at breakneck speed, but it might be spending more time in some corresponding macrostates than in others. For example, the point corresponding to all particles being in the same corner of a box seemed clearly unlikely to be occupied by the system relative to points implying a more uniform distribution. But (as mentioned in section 1.3 regarding equal probability of microstates), the question arose: are all microstates equally probable? As difficult as it seems to imagine, there is no acceptable argument that the probability of all particles being in the same corner of a box is any different from that of any other microstate. But the overwhelming fact is that an approximately uniform distribution over the box is a macrostate with many more corresponding microstates than the macrostate “all particles are close to the same corner of the box”. Just as we saw in section 1.11, some coarse-graining of microstates is needed to form a correspondence to macrostates. And as discussed in section 2.2 and illustrated in Figure 2-1 (p. 40), if all possible branches when the system changes state are equally likely, the final microstates are equally likely, and so the assumption of 105
page 117
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
equal probability for all microstates is taken as a fundamental postulate (see, e.g., Callen, 1960, p. 14). As before, the word “likely” is used somewhat loosely to mean what we are likely to find when we do a measurement on the result of some completely deterministic process. In section 3.2 we considered a container with a thermally nonconducting membrane in the middle separating it into halves, with a hot gas on one side of the membrane and a cold gas on the other, then said that if the membrane is removed, Thermodynamics considers it certain that the two halves will mix and come into equilibrium, whereas Statistical Mechanics considers this not formally certain exactly but probable with negligible chance of not happening. Earlier we defined “equilibrium” to mean arbitrarily close to being uniform in all state variables over the entire volume. To illustrate what Statistical Mechanics has to say about this, consider a chamber that contains one mole of a gas, hence n is equal to Avogadro’s constant, NA 6×1023. Uniformity implies that the left half of the chamber contains half of the gas particles, 3×10 23. The microcanonical ensemble contains microstates in which the left half contains many more particles than half, others with many less (an extreme example being that above in which all the particles are near one corner). But we suspect that there are many more ways that the left chamber could contain half of the particles than contain (e.g.) 10% of them. For example, just consider the number of ways to flip a fair coin 300 times and get heads exactly half the time: more than 9.37597×10 88, a larger number than plausible modern estimates for the number of elementary particles in the known universe! Combinatorial arithmetic applied to numbers on the order of Avogadro’s constant generally yield results beyond mindboggling. To get a qualitative feel for the error in assuming that the gas evolves to a uniform state, we can try modeling the microcanonical-ensemble distribution of the number of particles in the left half of the chamber with a Poisson distribution whose mean is 3×10 23. That large a mean guarantees that the Poisson distribution is well into its Gaussian asymptote. The corresponding probability that the number of particles in the left half of the chamber is less than half of the total by at least one part in 1010 is about 2.63×10-654 Assuming that one were able to measure the imbalance with a precision of 10 decimal places, the chances of catching the system in at least that nonuniform a state are truly negligible, and that is why we said that the difference between the certainty of Thermodynamics and the not-quite-certainty of Statistical Mechanics was too small to be worthy of consideration in practice. But the formal difference is nevertheless important, because it places Statistical Mechanics closer to being a valid description of how the universe really behaves. Since Statistical Mechanics is a formalism meant to illuminate the same macroscopic phenomena as Thermodynamics, some correspondences between the two sets of concepts had to be established. Volume is straightforward in both formalisms. Thermodynamic pressure is a force per unit area created in Statistical Mechanics by the change in momentum of particles bouncing off of the chamber walls. But what about temperature and entropy? In Thermodynamics, temperature is something whose inverse multiplies heat energy’s imperfect differential to obtain a perfect differential named entropy. How could these be related to a system of particles colliding with each other and the walls of the container? We will deal first with the easier notion of pressure and restrict our attention to an isolated system of identical particles inside a closed rectangular container whose internal energy consists exclusively of the total kinetic energy of the particles. Thus at any instant, the state of the system corresponds to a point in the microcanonical ensemble. In order to keep the following discussion as simple as possible, we will deliberately choose a very specific set of points in the microcanonical ensemble for a system consisting of only one type 106
page 118
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
of molecule. As a result, our conclusions will be valid only for certain average properties of the system, having nothing to say about fluctuations. A more general discussion may be found in standard texts (e.g., Morse, 1964, chapter 12), but even the simplest of these go well beyond the complexity desirable for our present purpose, which is only to illustrate the qualitative features of how the notion of randomness is employed. Although the molecules in the gas surely have different velocities at different times (i.e., the system migrates on the constant-energy hypersurface in the 6n-dimensional space of the microcanonical ensemble), at any instant there is an average velocity magnitude, where the averaging process is carried out over all the molecules in the system at a given instant. We are not concerned at the moment with the form of the distribution of velocities. Whatever form may apply, in principle the average must be well defined. We will assume that there must be a set of points in the phase space corresponding to all particles moving with the same (hence average) velocity magnitude on all axes, with locations distributed uniformly throughout the chamber, and equal numbers moving toward each wall bounding each axis. This allows us to sketch out an extremely simplified illustration of how pressure is related to the molecular collisions on the chamber walls. The only velocity changes occur when molecules bounce off of the walls (this implies that collisions between molecules are rare enough to be neglected). The average velocity magnitude v implies that each molecule with mass m also has the average kinetic energy, = . So the total energy, which is constant, is n. and since the internal energy U consists only of the total kinetic energy of the particles, we have U = n. The pressure on the container’s walls results from momentum transferred to the walls by particles bouncing off of them. Consider the left-hand wall with area A perpendicular to the X axis. A particle of mass m with speed vx and momentum component px = mvx bouncing off this wall elastically undergoes a change in momentum of magnitude 2px, since its momentum reverses. If nc such collisions occur in time Δt, then the total force exerted on the wall is F = ncΔpx/Δt. Of course, in general, all these collisions would involve different momentum values, and we would have to add up all the individual contributions weighted by the distribution function, but this would be the same as the average contribution multiplied by nc, so to keep things as simple as possible, we are just taking all collisions as involving the average momentum. The average pressure on the wall over this time interval is P = F/A, and nc is just the number of particles moving toward the wall (hence half of all particles) and within a distance that can be closed within Δt at a speed of . Since the extent of the container in the X direction is ΔX = V/A, the fraction of all particles that will hit the wall within Δt is ½ Δt/ΔX. Multiplying this by the total number of particles n yields nc. Putting these pieces together, v n x 2
t p x X t
F A A v t p x 2 n x 2 V t A A n px v x V
P
107
(3.5)
page 119
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
Since in this extremely simplified derivation we have selected all molecules to be moving with the average velocity magnitude on each axis, there is no dispersion, and we can set equal to . The equation of state is therefore PV = n = nm. Since the average velocity magnitude in any direction is the same as any other direction, we have
v x2 v 2y v z2 v 2 v x2 v 2y v z2 3 v x2 U n
mv 2 2
3 2
nm v x2
3 2
PV
(3.6)
2
PV U 3
The perfect gas law also tells us that PV = nkT, so we have U = 3/2nkT, and in the present case, = 3/2kT, hence = 3kT /m, or T = m/3k. Thus temperature in Statistical Mechanics is determined by the average velocity magnitude of the particles in the container and their mass in this case wherein all the energy is kinetic. Since = 3 , the average energy per particle per degree of freedom is ½kT. This is a more general result than can be inferred from the simplified description above. It actually does follow from much more general considerations and extends to all forms of energy. When we assumed that collisions between molecules are rare, we assumed that the gas was not very dense, and so it might seem that the results we derived would not apply to a dense gas. This is not true, however; we simply would have had to take into account the distribution of velocities instead of assuming that all molecules moving toward a given wall and within striking distance of it within the time interval Δt actually make it to the wall. To abandon that assumption, we would have had to know what distribution to use, and getting around that obstacle would have taken us too far afield, only to arrive at the same conclusions regarding average values. Since the role of fluctuations in Statistical Mechanics is of immense importance, the reader is invited to consult standard texts to probe more deeply into this cornerstone of physics. There is much to explore regarding what happens when various constraints defining the microcanonical ensemble are relaxed. Here we can provide only a brief sketch of some of the topics. For example, in the microcanonical ensemble, the probability that the system will be in a state corresponding to a phase-space point that is off of the constant-energy hypersurface is zero. If instead the system energy E is no longer taken to be constant, then different accessible states may have different energies. Taking the probability that the system will be in a state with energy E to be proportional to exp(-E/kT) leads to the canonical ensemble. Holding the number of particles in the system constant precludes analysis of interacting systems that may exchange particles or experience chemical or nuclear reactions, and furthermore the notion of a constant number of particles encounters problems when the particles are indistinguishable. Relaxing this constraint leads to the grand canonical ensemble. In all of these models, the randomness is strictly epistemic for classical systems. After the advent of Quantum Mechanics, these methods were applied to quantum states, and great debate arose regarding whether nonepistemic randomness was involved, but that is a topic for a later chapter. 2
108
page 120
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
We mentioned in passing that the canonical ensemble involves a distribution of state probability density in which the energy appears in the argument of an exponential function. The reason why this form of energy dependence arises and is so important is another example of something far too complicated to discuss in detail herein, but we can explore a few qualitative aspects to set the stage for the introduction of Statistical Mechanical entropy The exponential distribution emerges in Statistical Mechanics in a very natural way involving the mean free path between collisions of particles. In equilibrium, it is very reasonable to expect the locations of particles on any axis to be uniformly distributed in a random fashion, i.e., for any given particle inside a chamber, at any given instant of time its X coordinate is equally likely to be anywhere inside the chamber’s limits. Let λ be the mean density of particle coordinates on the X axis inside the chamber, i.e., λ is the number of particles n divided by the extent of the chamber in the X direction. Then the mean number of particles with X coordinate in the range Δx is λΔx. The probability that m particles will be in the range Δx is given by the Poisson distribution (see section 2.7). For the range Δx to be empty, we have m = 0, for which the Poisson probability is exp(-λΔx). Thus the probability that a range Δx is clear of particles drops off exponentially with Δx. This can be used to compute the probability that a particle moving in three dimensions will travel a given distance without a collision, but the details are beyond the present scope. One example of the dependence of phase-space state distributions on exp(-E/kT) can be described qualitatively as follows. We consider the case in which the particles in the chamber collide frequently, since this will tend to stir up the distribution and cause some particles to be moving rapidly compared to the average speed at any given instant while others are moving relatively slowly. Conservation of momentum causes some particles to gain momentum while others lose it. After many collisions, a given particle’s momentum is the sum of many random changes. Even without knowing the actual distribution of momentum changes, since this is a well-behaved physical process, we can assume that the distribution must have a well-defined variance, and so we can apply the Central Limit Theorem to argue that the distribution of momentum on a given axis is Gaussian (see section 2.4). Thus the momentum on the X axis, px, is distributed according to the density function
f x ( px )
px2
2 e 2
(3.7)
2
where we use f instead of p for the density function to avoid confusion with the phase-space variable p, and we assume that the variance is the same on all axes. The distributions on the Y and Z axes have the same form, and the three momentum components may be assumed independent, so that the joint density function for all three dimensions is f ( p x , p y , pz ) f x ( p x ) f y ( p y ) f z ( p z )
e
px2 p 2y pz2 2 2
2 2 3/ 2 109
(3.8)
page 121
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
Given identical particles, each with mass m, the kinetic energy per particle is
E
p x2 p y2 pz2
(3.9)
2m
so that the exponential in Equation 3.8 can be written exp(-mE/σ2), and thus the joint density function for the momentum components has an exponential dependence on kinetic energy. Using Equation 3.8 to compute the mean kinetic energy yields = 3σ2/2m, and since we know from equation 3.6ff that the mean kinetic energy is 3/2kT, we see that the variance σ2 = mkT, so that the exponential can be written exp(-E/kT). This dependence on energy and temperature arises in the equilibrium distributions of many other kinds of state variables. For example, the distribution of ionization state populations for a given atomic species in an ionized plasma has this dependence on ionization energy, and the distribution of excitation state populations for electrons in a given ion has this dependence on excitation energy. Note that Equation 3.8 describes the probability of momentum states, not energy itself (e.g., σ2 is the variance of the momentum on each axis, not the energy). The fact that the distribution depends on energy does not make it the distribution of energy, despite the energy appearing in the argument of an exponential and superficially resembling a Gaussian distribution of what appears in that argument. Taking the kinetic energy per particle as the random variable of interest requires treating it as a function of the random momentum variables, so its density function has to be obtained by applying the theory of functions of random variables (see Appendix B) to Equation 3.9. In this three-dimensional case, since each momentum component is Gaussian, the kinetic energy is related to chi-square with three degrees of freedom by a scale factor. Equation 3.9 can be written E
2 2 p x2 p y pz2 2 2 kT 2 3 3 2m 2 2 2 2m 2
(3.10)
where we substituted σ2 = mkT (again, the per-axis momentum variance, not the energy variance) to get the final result on the right. The density function for chi-square with three degrees of freedom is (see Equation E.7, p. 445)
32 e 3 / 2 2
f
( 32 )
(3.11)
2
where we used the fact that Γ(3/2) = π/2. For this simple linear dependence of one random variable on another, Equation B.2 (see Appendix B, p.425) yields
f E (E)
2 kT
E
E kT e kT
(3.12)
Since a chi-square random variable with N degrees of freedom has a mean of N and a variance of 2N (see Appendix E), with three degrees of freedom the mean is 3 and the variance is 6. Since E is related to this chi-square by the scale factor kT/2, its mean is 3× that, hence 3/2kT (as we saw 110
page 122
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
above), and its variance is 6×(kT/2)2, hence 3/2k2T 2. In giving just enough of the flavor of Statistical Mechanics to provide a context for discussing the role of randomness in physical theories, we must repeatedly stress that we have glossed over many important issues, for example the fact that in the microcanonical ensemble with kinetic energy being the only form of energy involved, the sum over all particles’ kinetic energy is constant, and so the energy defined in Equation 3.9 is not entirely unconstrained, it is in principle correlated across the particles by the constraint. And in assuming many collisions between particles in order to invoke the Central Limit Theorem, we have not discussed the problem of how point particles collide at all. The reader is strongly encouraged to explore the standard texts covering these and many other fascinating aspects of Statistical Mechanics. One thing that the reader will find in standard texts is that the concept of entropy in Statistical Mechanics is typically introduced somewhat gingerly. This follows partly from the subjective elements involved (“order”, “information”, and to some extent, “ability to do work”, etc.) and partly from the fact that early on it was not at all clear what mathematical expressions were appropriate. Even in Thermodynamics, some subtle nuances were involved. The mathematical nature of a perfect differential was unambiguous, but even that lacked a zero point definition, and the straightforward integration of the perfect differential only worked on reversible paths. The example given in the previous section of a hot and cold gas separated by a partition that is subsequently removed illustrated how equilibrium obtains, but we did not explicitly point out therein the connection between the increase in total entropy due to an irreversible process and the “entropy of mixing”, which causes the entropy of a mixture of dissimilar substances to rise above the sum of the entropies of the unmixed substances even if the two originally separated gases were at the same temperature and pressure. Thermodynamics provides equations for the partial derivatives of entropy with respect to the state variables, however, so expressing the state variables in the language of Statistical Mechanics provides a method for defining entropy in that language (e.g., Morse, 1964). Entropy can also be defined axiomatically (e.g., Callen, 1960, Kittel, 1958). Since we are interested primarily in a qualitative understanding of entropy, an intuitive approach may be used (e.g., Resnick and Halliday, 1961): since entropy describes the disorder of a system, and since disordered macrostates are plausibly viewed as more probable than ordered macrostates, the entropy of a macrostate must somehow be related to the probability of that macrostate. Given two independent (not mutually interacting) systems with entropy S1 and S2 respectively, the combined entropy must be the sum S = S1 + S2. The joint probability that the two systems will be in their respective macrostates must be the product of the two independent probabilities for each system. With the common assumption that the microstates are all equally likely, these probabilities must each be proportional to the number of ways that the macrostate can come to exist, i.e., the number of microstates corresponding to the macrostates. Denoting the number of microstates for each system’s macrostate W1 and W2 respectively, the probabilities P1 and P2 are proportional to W1 and W2, respectively, or P1 = f1W1 and P2 = f2W2, where f1 and f2 are the single-microstate probabilities. The joint probability is then the product of the separate independent probabilities, P = P1P2 = f1f2W1W2 cW1W2. The only way the sum of the entropies can depend on the product of the probabilities is for the entropy to depend logarithmically on probability, or S = k lnW. This is the form proposed by Ludwig Boltzmann (using his constant k in the same notation as the engraving on his tombstone except for the more modern “ln” instead of the potentially ambiguous “log”). 111
page 123
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
Here we have assumed discrete microstates in referring somewhat glibly to the “number” of them. If the distribution of microstates is continuous, then W must be taken to be a volume in phase space instead of an integer number, and a little more work must be done to relate macrostates with microstates. These are mathematical issues, and the usual formal modifications take care of them (e.g., how to define the probability that the October London rainfall will be 3 inches; see section 2.1). We also assumed that all relevant microstates have the same probability so that the macrostate probability will be proportional to the number of corresponding microstates. This assumption turns out to be an excellent approximation in most cases. For example, in section 2.2 we discussed the distribution of microstates for biased coin flips, with Figures 2-2 and 2-3 (p. 41) showing how the bias partitions the space of possible outcomes as the coin is flipped once, twice, three times, and four times. The 16 resulting microstates are not at all equally probable, but for any given macrostate (defined by the number of heads after four flips), all of its microstates have equal probability. With a probability of for heads, the most probable macrostate is the one with 3 heads. This macrostate has only 4 microstates, but each of them has the same probability, about 0.0987654321. An important point emerges here, however: while all “3 heads” microstates have the same probability, other microstates have distinctly different probabilities. The “2 heads” microstates each have a probability of about 0.049382716, half that of the “3 heads” microstates, but there are 6 of them, totaling a probability of about 0.296296296, much smaller than the 0.3950617284 probability of the “3 heads” macrostate. Thus the most probable macrostate does not have the most microstates! The correspondence between number of microstates and maximum likelihood does not hold in general, but when all microstates are equally probable, it does hold, and that is one reason why it is so useful to postulate (or prove, where possible) the equal probability of microstates. For the case when equal-probability microstates cannot be assumed, J. Willard Gibbs (1902) developed the entropy definition S k f i ln f i i
(3.13)
where fi is the probability of the ith microstate. Since the system must be in some microstate, the sum over all of their probabilities is unity:
fi 1 i
(3.14)
Consider a macrostate with W microstates whose probabilities sum to F and may be different from those corresponding to other macrostates but are equal to each other (like the “3 heads” microstates above). If the probability of the system’s state is conditioned on its being in that macrostate, then fi fi/F for the corresponding microstates, and fi 0 for all others, and then Equation 3.14 remains valid when only the microstates belonging to the given macrostate are included in the summation. Since these W probabilities are equal, each has the value 1/W, and W 1 1 1 W ln k ln W k ln W W W i 1 W i 1 W k ln W W
S k
(3.15)
and the Gibbs expression reduces to the Boltzmann expression. So the Gibbs entropy is the same as 112
page 124
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
the Boltzmann entropy when all microstates of the system have equal probability and also when the system state probability is conditioned on being in a given macrostate whose microstate probabilities are equal. But the two forms are distinct when we must deal with a set of microstates with unequal probabilities. For example, we must use the Gibbs expression when considering all possible states corresponding to the four flips of the coin biased toward heads with a probability p = , i.e., not conditioning the system probability on any given macrostate. We find (see section 2.2, whose notation we use here, e.g., Eq. 2.3):
811 0.012345679 , N 8 P1,4, 2 3 0.098765432 , N 81 24 P2,4, 2 3 0.296296296 , N 81 32 P3,4, 2 3 0.395061728 , N 81 16 P4,4, 2 3 0197530864 . ,N 81 P 0,4, 2 3
microstates
1
microstates
4
microstates
6
microstates
4
microstates
1
(3.16)
We saw in section 2.2 that the microstate probabilities for any one given macrostate are equal to each other, so the corresponding independent microstate probabilities are just their macrostate probability divided by its number of microstates. Taking the number of coin flips and the probability of heads as given in this example to reduce notational clutter, the microstate probabilities pm(n) corresponding to the macrostates “n heads” are 0
4
1
3
1 2 1 p m (0) 0.012345679 3 3 81 2 2 1 p m (1) 0.024691358 3 3 81 2
2
3
1
4
0
4 2 1 p m (2) 0.049382716 3 3 81
(3.17)
8 2 1 p m (3) 0.098765432 3 3 81 16 2 1 p m (4) 0197530864 . 3 3 81
Applying the Gibbs expression for entropy to these, the entropies S(n) of the macrostates “n heads” are:
113
page 125
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics 1
S (0) pm (0)ln pm (0) i 1 4
1 1 ln 0.054252458 81 81
S (1) pm (1)ln pm (1) 4 i 1 6
2 2 ln 0.365560689 81 81
4 4 ln 0.891305124 81 81
8 8 ln 0.914570909 81 81
S (2) pm (2)ln pm (2) 6 i 1 4
S (3) pm (3)ln pm (3) 4 i 1 1
S (4) pm (4)ln pm (4) i 1
(3.18)
16 16 ln 0.320367493 81 81
where we omit Boltzmann’s constant k because it relates thermodynamic entropy to the microstates in 6n-dimensional phase space (3 position dimensions and 3 momentum dimensions) and therefore is not relevant to coin-flip microstates (a more general concept of entropy applies to all probability distributions, as discussed in section 3.5 below). The point is that now we see that with the Gibbs expression, the most probable macrostate, “3 heads”, does have the highest entropy. Whereas it is possible to give the simple heuristic illustration above of how the Boltzmann entropy depends on the logarithm of the number of microstates, no such derivation of the Gibbs entropy is possible without going much further into the formalism of Statistical Mechanics than the present scope can sustain (e.g., Kittel, 1958, gives a proof of the formula for the canonical ensemble that is about as straightforward as can be found but requires introducing partition functions and Helmholtz free energy). Instead we will take Equation 3.13 as given, with or without the factor of k as appropriate, and consider some of its more interesting aspects below. Equation 3.15 shows how the Gibbs entropy reduces to the Boltzmann entropy when all relevant microstates have equal probability, but it may not be immediately obvious that this situation is also when the Gibbs entropy is maximized. In other words, when some microstates are more probable than others, the Gibbs entropy is less than it would otherwise be, because the uniform distribution contains the smallest possible amount of information, and any non-flat structure in the distribution corresponds to some additional information, some reason to expect certain outcomes more often than others, hence at least some partial order. An intuitive grasp for why the Boltzmann entropy corresponds to the maximum of the Gibbs entropy can be obtained by considering a qualitative argument that begins with a very simple case, namely when there are only two microstates, one with probability f and the other with probability 1-f. In this case the Gibbs entropy (Equation 3.13 but omitting the Boltzmann constant) is S f ln f (1 f ) ln(1 f )
(3.19)
We find the value of f that maximizes S by setting the derivative of S with respect to f to zero:
114
page 126
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.3 Statistical Mechanics
S f ln( f ) ln(1 f ) f ln(1 f ) dS f 1 f ln( f ) ln(1 f ) df f 1 f 1 f ln( f ) ln(1 f ) 0
f 1 f
f
(3.20)
1 2
To verify that this extremum is indeed a maximum, we check the second derivative: d 2S df
2
1 1 0 f 1 f
(3.21)
We are considering only 0 < f < 1, since a probability of zero means that the microstate doesn’t exist, and a probability of 1 means that there is only one microstate. Thus the quantity inside the parentheses is finite and positive for all valid f, making the second derivative exclusively negative over that range, indicating that there is only one extremum, and it is a maximum, where the second derivative has the value -1. So for a 2-microstate distribution, the Gibbs entropy is maximized when the two microstates have equal probability. The 2-microstate example can be extended easily to a W-microstate case in which W-1 microstates have equal probability and the Wth is free to have any other probability subject to the constraint that all probabilities add up to 1:
S (W 1) f ln f 1 (W 1) f ln1 (W 1) f
(3.22)
Setting the derivative of S equal to zero and solving for f yields f = 1/W. This is not a completely general proof, since it assumes that no more than one microstate can have a probability different from all the others, but it shows that if one starts out with equal-probability microstates, changing the probability of one of them moves the entropy away from a maximum. So starting with a 2-microstate case, for which we know that the Gibbs entropy is maximized when the two have equal probability independently of any underlying distribution, we can get to a 3-microstate case by adding one microstate to the maximum-entropy 2-microstate case, and we have established that this case will have maximum entropy if the third microstate’s probability is equal to that of the other two, namely . In this way, we can build up to an arbitrary number of microstates, always with maximum entropy when all have equal probability. So the remaining question is whether this maximum is global. We argue that it is, because the entropy can have only one extremum per fi axis, the probability axis of the ith microstate of the system, as in Equation 3.13. Writing that equation without the k factor and taking derivatives of S with respect to fi,
115
page 127
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.4 Relation Between Clausius Entropy and Boltzmann Entropy
S f i ln f i i
S ln f i 1 fi
(3.23)
2S 1 2 fi fi So there can be only one extremum per fi axis because the second derivative never changes sign, and this is a maximum because the second derivative is negative, and we have found that a maximum exists when all probabilities are equal, and therefore this is the global maximum of the Gibbs entropy. In the case of the four coin flips discussed above, the total entropy is greatest when the coin is fair. Every microstate is equally likely to occur, and any decision that depends on what microstate will occur must be made completely in the dark. With a bias toward heads or tails, some microstates can be expected to occur more often than others, i.e., there is partial order that provides some information about what to expect, even though the process must still be considered random. At least the randomness operates in a more constrained fashion. 3.4 Relation Between Clausius Entropy and Boltzmann Entropy It is perhaps not immediately obvious how the Clausius entropy, defined by dS = dQ/T, and the Boltzmann entropy, S = k lnW, refer to the same physical concept. For one thing, we never established a zero point for the Clausius entropy, and to do so would take us too far afield, but we mention in passing Nernst’s Heat Theorem (see Planck, 1921), also known as the Nernst Postulate (see, e.g., Callen, 1960), which states that thermodynamic entropy goes to zero as absolute thermodynamic temperature does the same. This is related to some other interesting ideas whose details are also beyond the present scope, such as heat capacities going to zero in the same way, implying that no energy would be required to heat an object at absolute zero temperature, hence that a temperature of absolute zero must be impossible to achieve, which led Einstein to point out that no meaningful statement can be made about an object at absolute zero temperature, since it can never be observed, and meaningful statements can be made only about physical situations that are at least in principle observable. As it happens, we do not require a zero point for Clausius entropy herein, because a simple illustration involving only the change in entropy suffices to connect the two forms well enough to give intuition a foothold. Here we follow the discussion of Resnick and Halliday (1961). We consider an ideal gas consisting of a fixed number of particles undergoing a reversible isothermal expansion such as the path from point 1 to point 2 in Figure 3-1 (p. 98). We treat the positions of the particles as epistemically random variables, since we cannot follow their motions exactly. The probability that a given particle will be in a volume element ΔV is proportional to ΔV, i.e., twice the volume implies twice the probability of finding the particle there. Taking the probability f of a macrostate to be proportional to the number of its microstates W, the probability that any given single particle is in ΔV is f1 = c1W1 = c2ΔV, where c1 and c2 are con116
page 128
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.4 Relation Between Clausius Entropy and Boltzmann Entropy
stants of proportionality (c1 is the single-microstate probability), and W1 is the number of microstates corresponding to the particle being in ΔV. Then with c c2/c1, we have W1 = cΔV. Taking the positions of all the particles as independent random variables, the probability f of finding N particles in ΔV is the product of all the N single-particle probabilities, or
f f 1N c1 W1 W1N
c V 2 c1
N
N
c2 V
c V
N
(3.24)
N
In going from single-particle macrostates to N-particle macrostates, we must multiply by the number of single-particle microstates. For example, if a single-particle macrostate has 2 microstates labeled A and B, a 3-particle macrostate will have 23 = 8 microstates, AAA, AAB, ABA, ABB, BAA, BAB, BBA, and BBB. Each of the 3 particles can be in either of the 2 microstates. So the number of N-particle microstates corresponding to the macrostate “N particles in ΔV” is W = W1N. Using this in the formula for Boltzmann entropy, S k lnW k lnW1N k ln c V
N
k N ln c ln V
(3.25)
Taking the volume element as the entire volume of the system, the change in Boltzmann entropy between points 1 and 2 in Figure 3-1 (p. 98) is
S S2 S1
kN ln c lnV2 kN ln c lnV1 V V R kN ln 2 N ln 2 V1 N A V1
(3.26)
V V N R ln 2 mR ln 2 NA V1 V1
where we used k = R/NA and m = N/NA. Recall from section 3.2 that R is the universal gas constant, NA is Avogadro’s constant, and m is the number of moles (gram-molecular weights) of the gas. The last expression is what we will compare to the change in entropy as calculated thermodynamically, i.e., the change in Clausius entropy. In the isothermal expansion, work is done by the gas and made up by drawing energy from the heat reservoir in order to keep the temperature T constant, which in turn keeps the gas’s internal energy U constant. So the work performed by the pressure P acting to move the piston through the distance that changes the gas chamber volume from V1 to V2, PdV, is exactly equal to the heat energy drawn from the heat reservoir, dQ, where the integrals are along the isothermal path from point 1 to point 2:
dQ PdV PdV dQ S T T
117
(3.27)
page 129
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.5 Entropy of Some Probability Distributions
With the perfect gas law, PV = mRT (Equation 3.1, p. 97, with C = mR), the change in Clausius entropy becomes V2
V2 mR dV mR ln V1 V1 V
S
(3.28)
which is the same result as in Equation 3.26 for the Boltzmann entropy. 3.5 Entropy of Some Probability Distributions First of all, we observe that (omitting the factor of k) the Gibbs entropy can also be written
S f i ln i
S ln i
1 1 ln fi f
1
(3.29)
f i fi
The first form shows that the Gibbs entropy of an entire distribution is in fact the average value of the logarithm of the inverse probabilities. We stress again that these probabilities must be on the open interval (0,1), i.e., 0 < fi < 1, because a probability of zero means that the microstate does not exist, and a probability of unity means that there is only one microstate, hence no nontrivial distribution. For this reason we do not have to worry about inverse probabilities causing division by zero. The second form just illustrates further that different interpretations are suggested by various ways of applying the logarithm and numerator/denominator placement of factors. It may not be immediately obvious that a measure of disorder should involve raising a probability to a power equal to itself.
A
B Figure 3-2 A. Total entropy of the Binomial Distribution as a function of p and n (see Equation 2.3, p.39; p > 0.5 is not shown because the distribution is symmetric about p = 0.5) B. Total entropy of the Poisson Distribution as a function of the mean, λ (see Equation 2.19)
118
page 130
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.5 Entropy of Some Probability Distributions
The summation in the Gibbs entropy can be formally evaluated straightforwardly for most discrete distributions, although the logarithm factor occasionally introduces algebraic tediousness which we will omit herein. Figure 3-2 shows the total entropy for two important distributions over a range large enough to illustrate the important features of the dependence on the distribution’s parameters that remain after summing contributions over all of the discrete random variable’s values. The entropy of the Binomial Distribution (Equation 2.3, p.39) depends on both n and p, interpreted in the discussion of section 2.2 as the number of coin flips and the probability of heads per flip. Since biasing a coin toward heads affects the entropy in the same way as the same bias toward tails, the entropy is symmetric about p = 0.5, and so only lower values are included in the plot. The entropy of the Poisson distribution (Equation 2.19, p. 59) depends only on the mean, λ. These plots show the general tendency of entropy to increase approximately logarithmically as the distribution acquires a larger number of microstates with nonnegligible probabilities. The Binomial Distribution has a finite number of microstates, 2n, but for probabilities p near 0 or 1, many of these become so improbable that they might almost as well not exist. The Poisson Distribution’s microstates comprise the transfinite set of all nonnegative integers, but far from the mean, λ, the probabilities become small and eventually negligible. The larger the mean, the greater the range of microstates about it with significant probabilities, increasing the entropy. The entropy of the discrete Uniform Distribution is just lnW, as we saw in Equation 3.15 (omitting the Boltzmann constant here). Now let us consider the continuous Uniform Distribution with a domain from a to b. The straightforward generalization is to replace the summation with an integral, so the number of discrete microstates changes to the extent of the continuous space, suggesting S = ln(b-a). But here we encounter a bump in the road: the argument of the logarithm is not necessarily dimensionless, as it seems it should be. In physics, taking the logarithm of 5 centimeters, for example, is a bit like taking the cosine of 3 grams. Some functions are just supposed to take only dimensionless arguments, as in the example of golf balls retrieved from a driving range discussed at the end of section 2.7 (the number of balls may be Poisson-distributed, but not the total mass). The arguments of some functions may have units, e.g., degrees versus radians, but not dimensions, e.g., length, mass, and time. Like angles, probability is dimensionless. Even without necessarily subscribing to the frequentist interpretation, the mathematical nature of probability is clearly that of a ratio, a fraction of all opportunities that will give rise to a random event. As such, it is dimensionless and its logarithm can be taken, just like the number of equal-probability microstates. A probability density, on the other hand, is the amount of probability contained within some space and thus has the inverse dimensions of that space. Suppose, for example, that there is a rectangular water trough measuring 36×12 inches on the ground beneath a sloped cabin roof covered with melting snow, and the long side of the trough is parallel to a lower horizontal edge of the roof. As the snow melts, drops of water fall from the edge of the roof at uniformly random locations, and some of these cause splashes in the trough. A person inside the cabin can hear the splashes but not see them, so the location of each drop in the longitudinal direction of the trough is unknown and is therefore an epistemically random variable that is uniformly distributed over the 36-inch length. We can define a = 0 inches and b = 36 inches, and so the splash-center locations are uniformly distributed between 0 and 36. Is the Gibbs entropy of this distribution ln(36) in units of ln(inches)? The answer is no; such units are physically meaningless, just as an angle cannot be 37 grams. If it could, then an angular rate ω could be 37 grams/second, and in the equation made famous by 119
page 131
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.5 Entropy of Some Probability Distributions
Planck and Einstein, E = h ω (i.e., E = hν), E would not have energy units or dimensions. So generalizing the Gibbs entropy from discrete to continuous distributions requires more than just substituting a probability-weighted spatial extent for a number of microstates. We have seen this problem before (e.g., in section 2.1, the London rainfall). In dealing with density functions, one may compute relative densities by simply taking ratios of the density evaluated at two locations, but to get at the physical parameter x whose distribution is described by the density function, it is always necessary to work in terms of the interval from x to x+δx, where the interval size δx can be made arbitrarily small but greater than zero. Since we are considering continuous random variables describing physical quantities, one can always make δx > 0 small enough so that the density may be considered to have a constant slope over that range. This is what we must do with continuous density functions in order to remove the dimensions. For example, the water splashes have a uniform density of p = 1/36 in units of inch-1, but the probability of a splash with a longitudinal coordinate between 18 and 18.1 inches is the dimensionless P = p δx = 1/360. We can take the logarithm of P with a clear conscience. So to generalize the Gibbs entropy for use with continuous random variables, we begin by approximating the integral as the limit of a summation of discrete values as δx approaches zero. We divide the domain of the random variable (here assumed to be the entire real line, but the modification for finite domain is straightforward) into bins of width δx labeled with an index i. Thus bin i covers the semi-open interval iδx x < (i+1)δx. Since f(x) is continuous, and x may be arbitrarily close to (i+1)δx, the mean value theorem guarantees that each bin contains an x value, xi, such that f ( xi )x
i 1x
f ( x ) dx
(3.30)
i x
The integral of f(x) over the domain is therefore
f ( x) dx lim f ( xi )x 1 x 0i
(3.31)
Now we write the Gibbs entropy with the discrete probabilities replaced by f(xi) δx,
S lim
f ( xi )x ln f ( xi )x
(3.32)
x 0 i
We use the fact that ln(uv) = ln(u)+ln(v), despite the fact that in the process we may be creating the situation discussed above in which the arguments of the logarithms may not be dimensionless. In principle, as long as the sum of the two logarithms is never broken, we have an acceptable combination because it is identical to one logarithm with a dimensionless argument (i.e., the absurdity of each argument is canceled by the other). So Equation 3.32 becomes
120
page 132
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.5 Entropy of Some Probability Distributions
S lim f ( xi )x ln f ( xi ) lim f ( xi )x lnx x 0i
x 0i
lim f ( xi )x ln f ( xi ) lim lnx f ( xi )x x 0i x 0 i
(3.33)
f ( x )ln f ( x ) lim lnx
x 0
So we get to the integral form we expected, but with an additional term that produces a contribution of arbitrarily large magnitude. The first term alone cannot be considered Gibbs entropy, but it is widely used under the name differential entropy. In any comparison among various cases, as long as the second term is the same constant for all such cases, the first term alone can be useful in revealing how a distribution’s entropy depends on its parameters. Some examples of differential entropy are:
SUniform ln(b a )
SGaussian ln
2e
SCauchy ln 4b
for x Uniform in (a , b)
for x Gaussian with standard deviation
(3.34)
for x Cauchy with half width at half maximum b
So the Cauchy distribution (see Equation 2.14, p. 55), whose mean is undefined because the integral diverges, does have a finite differential entropy. The fact that the dependences involve the logarithm of width-related parameters is typical of most distributions, as is the lack of any dependence on the mean (unless the mean and standard deviation are coupled, as with the Poisson and chisquare distributions, for example, but even then the real dependence is on the standard deviation). Note that with the argument of the logarithm expressed in units of standard deviation, the differential entropy of the Gaussian is larger than that of the Uniform, whose standard deviation is (b-a)/ 12, so its differential entropy can be written ln(σ 12), and 12 is smaller than (2πe). Even though the Uniform distribution contains the least information within its domain of any probability distribution, the fact that its domain is finite amounts to enough information to reduce its differential entropy below that of the infinite-domain Gaussian with the same standard deviation. The concept of entropy arises in other areas, which we will mention only in passing, since the role played by randomness is essentially the same as in what we have discussed above. Shannon (1948) introduced the concept of entropy into information theory in the form of the first line in Equation 3.23 (Gibbs entropy without the Boltzmann constant). He linked entropy to message representation by defining the entropy of a single independent fair coin flip as a single bit (binary digit) with a numerical value of ln2, then considered the binary codes for characters to be subject to probability, since some characters are more likely to be followed by others than purely random sequences. This was extended to image representation and proved useful in data compression theory, especially in determining the minimum number of bits needed for lossless compression. Bekenstein (1973) studied the entropy of black holes and proposed an expression that involved elements of both thermodynamic entropy and Shannon entropy. An interesting variety of epistemic 121
page 133
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.6 Brownian Motion
uncertainty arises, since his formulation is based on equating the smallest amount of Shannon entropy, a single bit with a numerical value of ln2, with the loss of the smallest possible particle into a black hole. The entropy increase stems from the fact that information concerning the state of the particle after it disappears into the black hole is lost. Conditions inside the black hole (other than implied by externally observable properties, i.e., mass, charge, and angular momentum) cannot be known even in principle, but within the context of classical General Relativity the particle’s fate is still deterministic, and so the uncertainty is epistemic. Normally in classical physics the evolution of a system governed by deterministic but unknown interactions would be considered to be independent of human knowledge which is in principle available but simply not practical to obtain. In this case entropy arising from lack of human knowledge is considered to produce physically meaningful entropy because the uncertainty proceeds directly from physical law (i.e., General Relativity prohibits knowing what goes on inside a black hole), not an inability to compute an otherwise knowable solution. Further variations on the entropy theme were composed, but with more and more tenuous connections to our goal of examining the role of randomness in the physical universe, and generally conforming to the previously quoted facetious remark about the chronological growth of entropy applying to the history of the notion itself. For example, the entropy defined for information contained in an image is the same as for compressibility of an image except for a sign flip. When all the pixels are equal (i.e., the image is blank), the entropy is maximal for information in the image (there is none) and minimal for compression (a flat image is maximally ordered and can thus be compressed to the smallest possible size). 3.6 Brownian Motion An excellent example of how progress in physics has proceeded by harnessing randomness via probability theory is the way that Einstein convinced essentially all skeptics of the hypothesis that matter is composed of discrete units. In the late 19th century, some scientists (e.g., Ludwig Boltzmann) already accepted that “atoms” and “molecules” were objectively real ingredients of the physical universe, not just handy mnemonics for representing chemical reactions. This was based on theoretical arguments, since molecules had never been observed directly in a laboratory. The skeptics (e.g., Wilhelm Ostwald and Ernst Mach) required that some experimental verification of real molecules be achieved before these could be admitted as truly existing. It did not help that the distinction between atoms and molecules was murky at best, and most scientists just used the word “molecule” essentially as a synonym for “particle”. Some rough estimates had been obtained for the size of typical molecules, and it was clear that they were extremely small, so small that no foreseeable laboratory apparatus had any hope of seeing them. The fact that molecules would have to be very small proceeded from the fact that there were apparently so many of them in typical liquids and gases encountered in the laboratories of the 19 thcentury chemists. Studying the weights and volumes of reagents and products of chemical reactions had made it clear that different chemical species combined in certain ratios of whole numbers and that the volume of a gas was inversely proportional to its weight. Extensive experience with such ratios brought into focus the concept of atomic number and led to the concept of an element’s valence, despite the fact that the latter is determined by properties of atomic electron shells, and since 122
page 134
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.6 Brownian Motion
the existence of atoms had not been conceded, the notion of atomic electron shells was remote. Arguments based on relative viscosities and mean free paths of gas and liquid phases, notably by Johann Loschmidt in 1865, led to order-of-magnitude estimates of the diameters of putative air molecules around 10-7 centimeters, suggesting that what we now call Avogadro’s constant, the number of molecules in one gram-molecular weight of a substance, was on the order of 10 22 or 1023. It is understandable that the challenge of observing something so small was considered too daunting to take seriously. Experiments similar to Loschmidt’s had also established the phenomenon of diffusion, the tendency of two different chemical species to flow into each other to form a uniform mixture, and osmosis, the ability of certain membranes to allow the passage of some chemical species but not others. Adolph Fick (1855) derived several laws describing diffusion under various laboratory setups. These are differential equations that describe the behavior of continuous media. One of these was used by Einstein (1905d; see Einstein, 1926, for an English translation by A.D. Cowper) in his paper on the motion of particles suspended in a liquid. In this paper, Einstein acknowledged that what he was discussing probably had some relevance to a phenomenon known as “Brownian motion” (also “Brownian movement”) named after the botanist Robert Brown who observed it in 1827, but Einstein did not claim a connection because he felt the relevant data were inadequate. The quality of microscopes had improved considerably since Robert Hooke popularized them in 1665. By 1676, Antony van Leeuwenhoek had observed single-celled animals in water samples and established that these were living creatures, some with flagella that allowed them to move about. Brown made microscopic observations of pollen grains suspended in water and observed that not only did these grains move about in an irregular fashion, the same was true of small particles ejected into the water from the pollen grains. Coming from a plant, the pollen grains might be considered alive, but neither they nor the smaller particles had any visible flagella with which to swim about. Brown concluded that the motion was not the result of biological activity and accepted that he had no explanation. Other scientists had observed unexplainable irregular motion of particles in fluids, both before and after Brown. The behavior of coal dust on the surface of alcohol was described by Jan Ingenhousz in 1785. In fact, in 60 BC, the Roman Lucretius applied his belief in the Greek atomic theory to explain the motion of dust particles in the air made visible by beams of sunlight shining into his room. None of these qualitative arguments gained traction. Quantitative treatments of a tangential nature appeared shortly before Einstein’s paper: Thorvald N. Thiele addressed Brownian motion in a paper on least-squares estimation in 1880, and Louis Bachelier derived the relevant stochastic process in his Ph.D. thesis of 1900 on speculation applied to stock markets. What Einstein achieved was a formulation amenable to experimentation expressed in the language of physics. Einstein began his paper (“On the Motion of Small Particles Suspended in a Stationary Liquid, as Required by the Molecular Kinetic Theory of Heat”, Annalen der Physik 18 (13), 1905d) by pointing out that while classical Thermodynamics contains the notions of diffusion and osmotic pressure, it does not provide a description of the irregular motions of suspended particles, and if an experimentally verifiable description could be derived within the molecular-kinetic theory of heat as embodied in Statistical Mechanics, it would add to that theory’s credibility, especially given that he saw a way in which it can provide a means for computing molecular sizes. He established an equivalence between the osmotic pressure of ordinary solvents and that of suspended particles for dilute solutions, then showed that dynamic equilibrium in the number density 123
page 135
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.7 Reconciling Newtonian Determinism With Random Walks Through Phase Space
of particles in a volume element results from an average balance between forces acting on the particles and osmotic pressure. He used this to show that diffusion takes place with a coefficient that depends on the particle size and fluid viscosity. By introducing a zero-mean probability distribution for the distance a particle moves in a short time interval, he was able to derive Fick’s second law of diffusion and show that the distance on a given coordinate axis that a particle drifts in a time t has a Gaussian distribution with a variance that is linear in t, i.e., a standard one-dimensional unbiased random walk. The root-mean-square displacement of an observable suspended particle over one second of time under standard laboratory conditions was presented as a function of the suspended particle’s size, liquid viscosity, temperature, pressure, the universal gas constant, and Avogadro’s constant. By measuring all but the last, Avogadro’s constant can be computed, and since the mass of the solvent is easily measured, the mass of its constituent molecules is readily obtained. Knowing the volume occupied by one mole of the solvent also provides an estimate of the size of the atoms or molecules that make up the fluid. The validity of Einstein’s theoretical results was soon established experimentally by a number of researchers, most notably Jean Baptiste Perrin (1909), and this work was a substantial part of what earned him the Nobel Prize in 1926. The debate over whether molecules really exist was over, atoms were accepted by mainstream physicists, and the door was open to the development of Quantum Mechanics, much of whose foundation was already in place because of the work of pioneers like Planck and Einstein. 3.7 Reconciling Newtonian Determinism With Random Walks Through Phase Space Einstein’s paper on Brownian motion was just one example of his use of probability theory. In his paper on the photoelectric effect (“On a Heuristic Point of View about the Creation and Conversion of Light”, Annalen der Physik 17 (6), 1905), a key ingredient in his argument that light behaves as though it consisted of particles was his derivation of the fact that the entropy of blackbody radiation varies with volume in the same way as the entropy of a perfect gas, and he specifically relates this entropy to the probability of the radiation field’s state. Einstein found the notion of probability very useful, and yet he was a staunch believer that the physical universe obeyed absolutely deterministic laws, even to the point of conceding the implication that free will would have to be an illusion, as would the concept of morality based on the idea that individuals are free to choose good or evil. He once said (quoted by Isaacson, 2007, p.393) “I know that philosophically a murderer is not responsible for his crime, but I prefer not to take tea with him.” Nevertheless, he advocated strongly for morality, also going on record with “Only morality in our actions can give beauty and dignity to life.” (ibid.) Einstein’s own life and work were noteworthy for many things, not least for consistency. No doubt he felt that his commitment to morality was something about which he had no choice. Not that genuine randomness in the physical universe implies free will in any common sense of that phrase, but strict determinism surely precludes it. Clearly Einstein believed that the randomness described by probability as he used it was strictly epistemic. This belief caused a rift between him and many other scientists after Quantum Mechanics was formulated and the challenge of interpreting the formalism was realized. Such interpretation has been elusive because of the fact that Quantum Mechanics was discovered as much as invented. For example, attempts to modify classical physics to handle atomic phenomena led to 124
page 136
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.7 Reconciling Newtonian Determinism With Random Walks Through Phase Space
situations in which certain variables had to be constrained to take on only integer values for no apparent reason other than the very compelling fact that it made the equations describe observed behavior with amazing accuracy. But further discussion of this topic must be postponed to Chapter 5. For now we are concerned with how a strictly deterministic universe could behave in a manner consistent with randomness. What Einstein believed could be stated differently as follows: the microscopic processes that drive the evolution of the universe in time amount to a very good pseudorandom number generator. But why should this be? In fact, it is not always true even in classical physics. We mentioned above that parts of the constant-energy hypersurface of the microcanonical ensemble are not available to the system, giving the simple example of a single molecule bouncing back and forth between the same two points on opposite walls of a chamber and thus unable ever to reach any other points in the chamber that do not lie on its trajectory. But this is hardly a problem for Statistical Mechanics, since it can be solved exactly using simple algebra and Newton’s laws. Usually some complexity is required before we can claim to have a good pseudorandom number generator. In the case of the single molecule, all we need is complete ignorance of the phase of the oscillating particle, since this effectively renders its position uniformly distributed along its path. We have referred to the epistemic uncertainty surrounding which day of the week corresponds to August 19, 2397. As long as that is not today’s date and we don’t actually compute the day of the week, there is no effective difference between epistemic and nonepistemic randomness, each day of the week is equally likely to have that date. But if the relationship were truly nonepistemically random in general, a very great observable difference would result. If one has to go to work on Monday morning, and it is Sunday evening, one would not know until after midnight whether one had to show up for work in the morning. Planning would have to be done on the basis that the prior probability of tomorrow being a workday is always 5/7. Life would be very different indeed. Why is this distinction unimportant regarding the motion of air molecules in the room in which we sit? The main difference is that the various configurations of air molecules in the room are both unthinkably large in number and completely interchangeable for all practical purposes, so that it does not matter whether they evolve deterministically or randomly as long as the distribution of random configurations corresponds to the distribution of configurations that evolve deterministically. The analogy to workdays would be more accurate if the number of workdays in the week were Avogadro’s constant, but still only two weekend days. The chance that tomorrow will be a weekend day would be completely negligible independently of whether tomorrow is chosen randomly or by a deterministic process of sequentially dealing from a shuffled deck. What is epistemically random in Statistical Mechanics is the location of a system in phase space. This location is generally pictured as moving along a random walk, not as jumping haphazardly from one point to a completely unrelated point. At any instant, all the particles in a system have some position and momentum. As the positions evolve between collisions, the location in phase space moves correspondingly. Whenever particles collide, their paths change as momentum is conserved, and the system point in phase space evolves accordingly. Because the mechanical processes are deterministic, the path of the system in phase space is deterministic, but because it is incomputable, it appears random, and therefore it is really a pseudorandom walk from certain points to certain other connected points, where the connection is through the laws of particle dynamics. There is an extreme similarity between the highly stable large-number statistics of epistemically random behavior and the purely deterministic physical behavior of classical physics. This has 125
page 137
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.8 Carrying Statistical Mechanics into Quantum Mechanics
a profound echo in Quantum Mechanics. Even though the latter allows nonepistemically random processes to occur, they are generally unobservable at the macroscopic level because the statistics are stable to dozens of decimal places in the measurements that can actually be made. The result is that ordinary experience appears to be based on perfectly deterministic physical laws. It is natural that the development of science would proceed in this fashion. Once we adopt the premise that physical laws exist, the idea that they would be deterministic arises before the realization that such laws may not work at all levels. Having embraced determinism as an all-encompassing feature of the universe, we find it very difficult to surrender, and for Einstein it was never possible. Fortunately, his battles to defend determinism stimulated great progress in the interpretation of Quantum Mechanics. These attempts eventually led to the design of feasible experiments that could discern the difference between epistemic and nonepistemic randomness in the natural laws themselves. The verdict has gone against Einstein’s beliefs. According to Wolfgang Pauli (1954), Einstein himself denied that he would insist on strict determinism to the bitter end. Pauli quotes Einstein as saying that he would not reject a theory on the sole basis that it was not rigorously deterministic, and in fact his long series of attacks on Quantum Mechanics ended with Einstein primarily rejecting the nonlocal effects that it predicts (this will be discussed in more detail in Chapter 5). Given Einstein’s strong predisposition against nonepistemic randomness (to the point of accepting free will and morality as illusions), the fact that Quantum Mechanics made it nonlocality’s bedfellow certainly did not improve its chances that he could be convinced to accept it. 3.8 Carrying Statistical Mechanics into Quantum Mechanics In section 3.5 above we saw that the process of defining the entropy of continuous random variables encountered a stumbling block that did not present itself for discrete random variables. There are several ways in which discrete variables are easier to handle. For example, the probability that ten coin flips will yield three heads is well defined and straightforward to interpret, while the probability that the phone will ring at exactly 3 PM is not (on the classical assumption of continuous time). For the latter, we must consider the probability that the phone will ring between 3 PM and 3+Δ PM, where Δ > 0 is some arbitrarily small time increment. Otherwise, if we let Δ be zero, we end up with a probability of zero. For continuous random variables, we must deal with density functions. Other examples of continuous variables creating problems can be found in size distributions that diverge as size approaches zero, electromagnetic energy that becomes infinite as wavelength goes to zero, etc. Such annoyances can generally be handled, but we would not object if something made it unnecessary, some natural cutoff that removed the need to consider arbitrarily small continuous intervals. A similar hindrance follows from the fact that the 6N-dimensional phase space of classical Statistical Mechanics is continuous. Instead of being able to work with the probability that the system is in a given microstate, we must deal with the probability that it is in some small volume of phase space around that microstate. We must introduce “coarse graining” into the phase space, and in so doing we implant a subjective element. Fortunately, the advent of Quantum Mechanics resolved this issue. We must leave most details to Chapter 5, but the gist is that Quantum Mechanics provides a natural coarse graining of phase 126
page 138
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
3.8 Carrying Statistical Mechanics into Quantum Mechanics
space via the Heisenberg Uncertainty Principle. One way to express this principle is ΔpΔq h, where Δp is the uncertainty of the system’s momentum, Δq is the uncertainty of the system’s position, and h is Planck’s constant. Since ΔpΔq is a volume element in phase space within which the system’s state cannot be resolved with an accuracy better than h, there is no point in considering smaller volume elements, and there are indications that such smaller elements cannot even apply meaningfully to real physical systems. Gas particles in a chamber cease to resemble infinitely hard arbitrarily small ball bearings rattling around in a continuous space with continuously distributed energies and momenta. Instead they take on the appearance of processes extended in space and evolving in time in a manner that precludes interpreting them as completely localized particles. The very notion of a particle having a well-defined position and momentum ceases to be applicable, and systems with an infinite number of degrees of freedom find that number reduced to something finite. This is a highly simplified description, but it captures the basic concept. A more detailed discussion would have to include explicitly the number of degrees of freedom Ndf possessed by the system, each with its own quantum number, and the number of distinct microstates in the volume element becomes ΔpΔq/hNdf. With this feature, accompanied by the appearance of Schrödinger’s equation in place of classical Newtonian equations, some energy quantization rules, and a need to consider whether certain particles can be distinguished from each other, a surprisingly large amount of entropy/ensemble theory in classical Statistical Mechanics survives the transition to Quantum Mechanics. Randomness lies at the foundation of both formalisms. A sea change, however, is encountered in the kind of randomness that must be embraced after the transition. Previously it did not matter that the randomness was not real, it was just a pseudorandom substitute for our lack of knowledge and was not relevant to the fundamental Newtonian formulation of particle interactions. In the new theory, the mainstream view is that randomness acts directly on physical reality and suffuses the laws of its behavior. This view is not universally embraced; some physicists prefer “hidden variable” theories in which the randomness remains epistemic, but the failure of such theories to remove nonlocality renders them incapable of preserving all of the traditional notions of classical physics, casting doubts on their ultimate utility.
127
page 139
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Chapter 4 Scientific Data Analysis 4.1 The Foundation of Science: Quantified Measurements The word “science” comes from the Latin word “scientia”, meaning “knowledge”. From this very general notion, it has taken on many different connotations, but we use it herein to mean the highly systematized organization of demonstrable facts into logical structures that typically employ theoretical formalisms to link various parts of the network of ideas to each other and provide a cohesive interpretation of the whole ensemble. Without a theoretical foundation, observations are simply data, and without data, theories are mere speculations. Thus scientific theories must be based solidly on observed facts, and it is in this sense that we use the word “theory”. In some contexts, “science” is taken to be distinct from “engineering”. We do not emphasize that distinction herein because of the substantial overlap in purpose and practice. Engineering is applied science, and the business of science makes considerable use of engineering principles in constructing and using its instruments of observation. In both disciplines, approximations that are known not to be perfect may be acceptable for certain purposes, although in pure science, the search for exact laws of Nature is considered meaningful by some theorists and admits no approximations in these laws themselves, only in their application to specific physical situations. The history of science has unfolded in such a manner that each theory that was successful in its own time has frequently been found to be an approximation to a more comprehensive theory. As practiced in modern times, science seeks to avoid the pitfalls of human folly by adhering to rigorous standards of measurement reproducibility, peer review, and analysis of theoretical uniqueness and falsifiability. This causes many areas of human interest to fall outside of its purview, such as metaphysics, religion, and aesthetics. The importance of these other areas of philosophical endeavor is not diminished because science eschews them, it is simply the price of strenuously shunning the occasions of cognitive error and unprovable hypotheses. Not every worthwhile question has answers subject to rigorous application of scientific method. By the same token, practitioners of these other disciplines should be aware of the limitations placed on their conclusions regarding the level of certainty that is realistic to claim and acceptance that is reasonable to expect from those who did not experience the same revelation and are professionally obliged not to take the word of others on faith alone. Scientific theories need not be “true” in any absolute sense as long as their scope is clearly delimited and they are consistent with all relevant facts known at the time. It is not possible to make science as immaculate in this respect as mathematics, but the same goals are in effect. This means that science must be based on reliable observations. The examples herein will be taken from astrophysical research, since that is the area in which the author’s most extensive experience lies. It is also appropriate in that astronomy is one of the oldest sciences known to history. It was tightly coupled in ancient times with another, agriculture. Astronomical observations were used to guide agricultural activities such as planting and harvesting. In order to be reliable, an observation must be reproducible and quantifiable. The meaning of the first is straightforward. For example, if an observer discovers a new comet, other observers must 128
page 140
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.1 The Foundation of Science: Quantified Measurements
be able to confirm its existence. To be considered quantified, the comet’s position in celestial coordinates must be measured, the time of observation must be recorded, and its brightness and shape must be characterized numerically. The part of this that usually escapes the notice of the general public is that each of these numbers must be accompanied by a measure of uncertainty. Nontrivial measurements are never perfect. We may be able to “measure” the number of eggs in a carton and report that it is exactly 12, but that is an example of a trivial measurement. When measuring the position of a comet at a specific time, there is always some error. This is caused by the fact that the comet does not appear as a perfect point of light, determining the telescope pointing direction depends on hardware with limited accuracy, the optical system is not perfect, etc. In order to maximize the usefulness of the observation, all significant sources of error must be identified and characterized quantitatively. It is at this point that randomness and probability theory enter: since we cannot know the actual error in a measurement at the time we perform it (otherwise we would correct for the error), the best that we can do is to take into account all the possible values it might have and weight each according to its probability of being what actually happened. Thus measurement error is treated as an epistemic random variable whose distribution we estimate by calibrating our instrumental hardware. The end product is a representation of each measurement parameter (e.g., celestial azimuthal angle) not as a single number but as a probability density function (or discrete probability distribution if appropriate, but most of our examples will use continuous random variables, so density functions will be used). Furthermore, the possibility that any given parameter may have an error that is correlated with that of another parameter should be taken into account. This means that the density function for celestial azimuthal angle may have to be represented as part of a joint density function whose other component applies to a celestial elevation angle (see Equations 2.23, p. 67, and 2.31, p. 76). In most cases, describing the uncertainties with Gaussian random variables is an excellent approximation, and this means that each measurement’s density function is fully characterized by a mean and a standard deviation. The mean is generally the nominal measurement value, since if we can compute a nonzero-mean error, we usually subtract this off as part of the calibration exercise, thereby changing the nominal measurement value and reducing the expectation value of the error to zero. When the Gaussian approximation is not sufficient, representing the density function may require more parameters, especially if the error distribution is asymmetric. In such cases, the nature of the density function must be specified clearly. The absence of such an explicit definition is usually taken to mean that the Gaussian model is in effect. When several position measurements for a given comet at different times have been characterized in this way, an orbit can be computed in terms of parameters that also have quantified uncertainties obtained from the measurement uncertainties via the theory of functions of random variables (see Appendix B). This allows the comet’s position and associated uncertainty to be computed for other times, and when those positions are observed at those times, knowing the uncertainties guides judgment of whether objects that may be found correspond to the comet in question, and if no object is found, whether that is plausible. This is an example of what we mean by measurements being useful. There are many ways in which the utilization of our measurements by other scientists depends as completely on the uncertainties as on the nominal measured values. Because designing the error model for a measurement takes many forms depending on the kind of measurement, we cannot give a complete description of how this process is accomplished here. In general, it is fair to require that observers understand their hardware sufficiently well to know how 129
page 141
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.1 The Foundation of Science: Quantified Measurements
to model its workings, otherwise the result that comes out of a measurement is essentially useless for any purpose beyond demonstrating that something has been discovered, which is certainly a very important step in advancing science but not properly called a “measurement”. The crucial feature of any measuring instrument is its ability to respond in some way to a stimulus corresponding to whatever is to be measured. It must be possible to design a mathematical model for the response as a function of input stimulus level. Once that has been achieved, it should also be possible to anticipate the response to noise stimuli, and then it is primarily a matter of calibrating the response using known inputs and observing the dispersions about repeated constant input levels. By “noise stimuli”, we mean anything capable of provoking a response other than the phenomenon we wish to measure or anything able to alter the response to that phenomenon. Thus “noise” is any source of “error” in the measurement, and the two terms are frequently used interchangeably. Similarly, “error” and “uncertainty” are often loosely interchanged, because the error distribution determines the uncertainty distribution. The mathematical model of how an observing instrument reacts to the stimulus it is intended to measure is called its response function. The quality of an observing instrument is directly reflected in the quality of its response function, since the latter is the mechanism through which the former provides quantitative measurements. A response function must be able not only to predict the instrumental output for a given input, it must also be able to provide the value of the uncertainty. For the most common case of Gaussian errors, that uncertainty is represented by the standard deviation of the dispersion in a set of actual output values given the same input value. We do not expect real hardware to produce a single response to a set of repeated constant inputs, because we do not expect the noise to be the same on every measurement. If the same input always produces the same output, then our precision is too limited, and the uncertainty is dominated by truncation error which must be modeled as uniformly distributed over the range of the least significant digit. It is preferable to arrange for enough precision in the instrumental output to sample the noise, and we will assume below that this is the case. The dispersion in these outputs should be predicted by the response function’s error model with an accuracy sufficient to support scientific purposes. Before we consider what “accuracy sufficient to support scientific purposes” means, we must digress briefly to review the concept known as “systematic error”. We said above that we expect our hardware to show different responses to repetitions of the same input stimulus. This assumes that the noise has some randomness in it that is uncorrelated from one measurement to another. And yet it is possible that one component of that noise is the same error on every measurement, which is what we mean by “systematic error”. We may assume that the noise is not all due to systematic error, because if it were, we would get the same response to repetitions of the same stimulus, and this would be absorbed into the zero point of the response function during instrumental calibration. Thus we need not consider situations in which all instrumental error is systematic. We will need to address the role of systematic error again later, however, when we get to the point of considering how a single measurement of one phenomenon is used to process multiple measurements of another, since any error, no matter how it arose, repeats itself every time a single measurement is used. Recall that what makes an error systematic is only that it has the same (but unknown) value every time it enters a calculation. If we view the noise in repeated measurements of the same stimulus as a series of random draws, then the systematic component (usually a bias or scale-factor error) is drawn only once and remains as it is on all measurements, whereas the nonsystematic error is drawn fresh each time. Both are zero-mean random variables (again, we would 130
page 142
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.1 The Foundation of Science: Quantified Measurements
subtract off any nonzero expectation value), and because both are random, we avoid the common but misleading usage of “random” to mean “nonsystematic”. Returning to the question of how accurately the response function should be able to model uncertainty: in section 2.11 (Sample Statistics) we addressed this issue and gave a rule of thumb based on the need to distinguish normal fluctuations from real physical differences. .... if two measurements of the same thing disagree, it is very useful to be able to determine whether they disagree by only 1 or 2 σ versus 4 or 5 σ, because the former would be considered an ordinary fluctuation, whereas the latter would suggest that some significant systematic effect is present. This is a borderline situation, but often scientific breakthroughs occur at the borderlines. By this we mean that if the disagreement is a tiny fraction of 1 σ, it is almost certainly insignificant, since factors of 2 or 3 error in the uncertainty don’t really change the interpretation, and if the disagreement is 20 or 30 σ, there is little doubt that no matter what the two observations were intended to quantify, they took the measure of two different things. But in the 2-5 σ range, things are not so clear-cut, and large errors in estimating the uncertainties cannot be tolerated, or else scientific information is lost. .... if we can keep such errors down to around 20%, we’re probably on solid ground. 20% is large when it’s the uncertainty in the measured quantity, but it’s probably fairly good when it’s a high-confidence uncertainty of the uncertainty. In the author’s view, keeping the error in uncertainty estimation below 20% is a worthy goal and often not easily met. Of course, smaller is better.
We will therefore consider that for a response function to be of professional quality, its error model must predict dispersions with an accuracy of 20% or better. Of course, in most cases the standard deviation of the dispersion itself should be much smaller than 20% relative to the response if at all possible. Again, smaller is better, but it is not always possible at scientific frontiers. As long as the uncertainty is well modeled, even highly uncertain measurements can be useful. In other words, if the measurement itself cannot escape being highly uncertain, then the only way that it can be useful is if the uncertainty has been reliably estimated. That raises the question of how we know whether the uncertainty has been well estimated. There are two ways. With the passage of time, better observing instruments are designed and built, and these often measure the same things that were measured in the past. Better sensitivity and more accuracy usually reveal not only what the errors were in older measurements but also whether the uncertainties matched those errors. The other way doesn’t require us to wait. It is usually possible to make repeated observations not only of calibration sources but also the objects of physical interest. Instead of observing a comet, suppose we make 100 observations of a distant star whose position may be considered constant over the observing interval, each yielding a nominal position and an associated uncertainty. Once we verify that all 100 objects observed really were the same star, we can compute an average position. Assuming independent Gaussian errors, we can use inverse-variance-weighted averaging. As shown at the end of Appendix D, this is the same as doing a chi-square minimization fit to a zero th-order polynomial (see Equations D.26-28, pp. 441-442, for the case of uncorrelated errors; for correlated errors, see Equations 4.18-22 in this chapter, pp. 163-164). We will consider the error σi in the ith measurement of the star’s declination, δi, which is the elevation angle in Earth-based celestial coordinates, the angle between the star and the projection of the Earth’s equator onto the celestial sphere. 131
page 143
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.2 Putting Measurements to Work
Once we have the average declination δ, we can compute the chi-square goodness-of-fit statistic (see Equation D.2, p. 434, for the uncorrelated case or Equation D.20, p. 440, for the correlated case; see Appendix E for general information on the chi-square statistic). For simplicity, we will show the uncorrelated case, which for this example takes the form 100
i 2
i 1
i2
2
(4.1)
Since the zeroth-order polynomial has one coefficient, we lose one degree of freedom, so that we assign 99 degrees of freedom to the chi-square statistic instead of 100. If the error model is sufficiently accurate, then the value of chi-square should be plausible as a randomly drawn chi-square statistic with 99 degrees of freedom. As discussed in Appendix E, such a chi-square statistic has an expected value of 99 and a variance of 2×99, hence a standard deviation of about 14. This many degrees of freedom means that the chi-square distribution is fairly well into its Gaussian asymptote, so we can treat it as a Gaussian random variable with a mean of 99 and a standard deviation of 14. Assuming our error model is good, a ±3σ range is 57 to 141. Anything outside of that range probably signals problems. If we have significantly fewer than 100 measurements, we can always use the real chi-square cumulative distribution to judge statistical plausibility (see Equation E.7, p. 445), but the more measurements, the better we are able to detect problems with our error model. There are further complications that are beyond the present scope, such as whether position errors depend on stellar brightness (they generally do), in which case the chi-square tests must be performed for a range of brightnesses. But this much should illustrate well enough how chi-square tests can be used to verify error models. 4.2 Putting Measurements to Work In his classic textbook, instead of a dedication, the great numerical analyst Richard Hamming placed a motto which has become justifiably famous, “THE PURPOSE OF COMPUTING IS INSIGHT, NOT NUMBERS” (Hamming, 1962). This in no way denigrates numbers, it simply stresses that they constitute the means, not the end, to a scientific study. His own dedication to optimal computation makes clear his belief that numbers must be treated with great respect if the insight we are to gain from them is to be as enriching as we hope. In Chapter N+1 of that book, he suggests that the first question one should ask when beginning an attempt to solve a problem is “What are we going to do with the answers?” The main point is that as we compute, we should have some idea of how what we are doing will contribute to the insight we hope to gain. Insight may not emerge automatically at the end of a rigorous computation, we may have to work for it, but we have our best chances if we manage to carry all the relevant information along through the process. It may seem that in stressing the need for a thorough analysis of measurement uncertainty, we are making a case that does not need to be made. Unfortunately, it has happened more than once that the question “And what is the uncertainty in this result?” was answered “There’s no way to estimate that.” Such an answer places a burden on diplomacy. It was made clear above that without a wellcharacterized uncertainty, a “measurement” is essentially worthless for anything more than motivat132
page 144
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.2 Putting Measurements to Work
ing the design of a better observing instrument. And yet some researchers retain a disdain for “error analysis” that they acquired in their first undergraduate laboratory class, where they viewed it as an annoying distraction rather than a fascinating stage in the conversion of a measurement result to the knowledge they were seeking in the first place. The problem is that sometimes these classes are taught with no emphasis at all on the principle that we can express by paraphrasing Hamming: “The purpose of measurement is insight, not numbers.” As we stated in the Preface: Science proceeds from observations. Crafting ways to make observation more effective has led to a highly developed theory of measurement. The products of measurements are numbers that we call data. When data are subjected to interpretation, they become information, and there is no more crucial aspect of this phase than the characterization of measurement error ... When this is done properly, the product is real information, and when that information is understood as completely as humanly possible, knowledge has been acquired.
The reason why we advocate viewing a measurement result as a probability density function is that this carries all available information about the measured quantity to the latter stages of an analysis. Here is a list of seven (not entirely uncorrelated) reasons why this is a rewarding practice. In the sections that follow we will investigate the considerations in the first three items more fully. 1.)
It provides a way to limit the effects of lower-quality data without rejecting them outright. In the Gaussian case, inverse-variance weighting allows the lower-quality data (i.e., with larger uncertainties) to make a contribution that is more likely to improve the average than degrade it, unless the larger uncertainties are so much larger that they reduce the contribution to negligibility, in which case, the average is effectively protected from useless data. In nonGaussian cases, a similar effect occurs when optimal averaging is performed, as we will see below in section 4.8. While it is possible to construct pathological error distributions artificially in which exceptions to this occur, such distributions seldom occur in practice, especially if any form of robust outlier detection is employed, and pathological errors are more likely to be identified in the first place as a result of a good error analysis.
2.)
When errors can be assumed to be Gaussian (the most common case), it permits least-squares fitting to use chi-square minimization, which in turn permits goodness-of-fit tests which can reveal breakdowns in modeling the response function, including its error-model component.
3.)
It permits optimal use of the product it describes in other applications such as hypothesis testing (radiation-hit detection, variability analysis, position-based merging, etc.).
4.)
Inspecting uncertainties can reveal problems early. If the uncertainties turn out to be significantly larger or smaller than expected, something upstream is wrong or poorly understood.
5.)
It permits quantitative evaluation of whether requirements are being met and may raise alarms when requirements specify unreasonable goals.
6.)
It facilitates determining whether one is improving the product during the development 133
page 145
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.3 Hypothesis Testing
phase, e.g., whether the error bars are shrinking and which of several possible methods yields the best result. 7.)
It really can’t be escaped anyway. Omitting it is mathematically equivalent to doing it in a sub-optimal way. For example, least-squares fitting without uncertainty-based weighting is equivalent to weighting with arbitrary equal nonzero uncertainties and then not doing any goodness-of-fit test based on statistical significance (which is not produced by ordinary correlation analysis), because there are no prior uncertainties to support one.
Of course, these benefits can be obtained only if the “error analysis” is done with sufficient accuracy, and we assume that professional standards have been met in what follows. As with all human activities, mistakes will sometimes be made, and significant mistakes in characterizing the probability density function that describes a measurement will usually lead to incorrect conclusions and in the worst-case scenario, journal article retractions. It is the fear of this that leads some researchers to shy away from the deep end of the pool. Fortunately, as we will see below, methods for detecting such mistakes do exist, and making the application of these methods a standard practice is part of the modern scientist’s normal workday. Besides simple mistakes, another very common accuracy limitation is incomplete knowledge of noise distributions. Often the sources of significant noise are numerous and difficult to characterize, typically leading to the “noise distribution” being a superposition of several populations. In rare instances, this can be a crippling deficiency that no reasonable amount of diligent study can remove. In such cases, the best that can be done is to document clearly the impact on scientific results. Few explanatory supplements of scientific surveys are free of a “caveats” section. 4.3 Hypothesis Testing Other than requiring a rigorous specification, the scientific use of the word “hypothesis” is the same as in ordinary conversation, a scenario that may or may not be true. A scientific hypothesis is usually of more than passing interest, and as a possibility considered in the context of information that is incomplete, it is a natural target for statistical analysis. Being scientific in nature, it is specified in terms of measured quantities. Sometimes a theoretical conjecture is called a “hypothesis” before any measurements exist to test it. That is not the kind of hypothesis we are considering here. “Hypothesis testing” refers to the use of some available information to decide whether to accept a particular interpretation of actual measurements. When the measurements conform to the standards we have discussed above, quantitative analysis is possible. Deciding whether to accept a hypothesis is a typical activity of modern science for which the discipline known as decision theory has been developed. As with most advanced topics mentioned in this book, we will not be able to go deeply into decision theory, and so the reader is referred to standard texts (see, e.g., Goldman, 1970). We should note, however, that decision theory does not tell us whether a given hypothesis is “true”, but rather whether we should accept it based on our quantitative statement of the benefits of accepting it if it is true and of rejecting it if it is false, weighed against the costs of accepting it if it is false and of rejecting it if it is true. Both costs and both benefits must be quantified, and this is one the few places where subjective value judgments 134
page 146
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.3 Hypothesis Testing
enter. It has been the author’s experience that most scientists assign a much higher cost to accepting a false hypothesis than to rejecting a true one. The idea is that it is better for a barrel of good apples to be not quite full than for the barrel to be full but contain some rotten ones. A completely specified hypothesis H0 may state that its scenario(s) did or did not give rise to the measurement(s). The former case is more common in the physical sciences, and the goal is to accept it. The latter is more common in other sciences, and the goal is to reject it, in which case it is usually called the null hypothesis. Either way, it can always be paired with an alternative hypothesis, H1, that states that some scenario(s) other than H0’s caused the measurement(s). This means that in classical physics, the decision is binary, only two hypotheses need to be considered (in Quantum Mechanics, superpositions must be considered, but we will leave that to Chapter 5). It may be that numerous scenarios are contained in one or both hypotheses, in which case the phrase composite hypothesis is used, but a composite hypothesis still counts as a single hypothesis. For example, a hypothesis that we successfully observed the same asteroid at two different times could be false because the sightings were actually of different asteroids, or of a star and an asteroid, or of an asteroid and a photometric noise fluctuation, or of a star and a cosmic-ray hit, etc. There are typically many ways to be wrong and only one way to be right. The most powerful form taken by binary decision theory is the likelihood ratio test. This involves deciding whether to accept H0 on the basis of the ratio P(H0|x)/P(H1|x), where x is the set of measurements, and P(Hi|x) is the conditional probability that the measurements would be obtained as they were given the scenario(s) in Hi. When this ratio is greater than some threshold, we accept H0, otherwise we reject it. The threshold is set according to some cost function that expresses the value of making the right decision relative to the penalty for making the wrong one. Because conditional probabilities are involved, P(H0|x) and P(H1|x) do not generally sum to 1. If they did, the likelihood ratio would reduce to P(H0|x)/(1-P(H0|x)), and we would have only one conditional probability to evaluate, but the distribution(s) relevant to the two hypotheses are generally distinct. For example, the probability that a successful pair of asteroid sightings would occur in the form taken by the two measurements depends on the observational errors modeled and used in the probability density functions that describe the two measurements, while the probability that the two sightings were actually a star and a cosmic-ray hit depend on the spatial stellar density and the radiation environment, among other things, and the other scenarios in the alternative hypothesis similarly depend on their own noise populations. So the probability that H0 is true and the probability that it is false sum to 1, but this is not true of the probability that the measurements would result from H0 and the probability that they would result from H1. The latter pair of probabilities can generally be computed, whereas the probability that H0 is true usually cannot. The latter is the probability that the object observed on each sighting was the same asteroid. The scenario is well defined, but the meaning of a probability that the scenario corresponds to reality is murky. What kind of probability would this be? In section 1.5 we considered various interpretations of probability. The subjective, or measureof-belief, probability cannot be made quantitative here. The classical and axiomatic interpretations do not tell us how to compute this probability. Only the frequentist interpretation makes much sense, and even it has difficulties. It says that, given an arbitrarily large ensemble of independent parallel universes, the probability is the number in which this one asteroid is detected in our observing instrument at the two measurement times relative to the total number of such parallel universes in which we attempt those measurements. But what could thwart such detections? Certainly the errors mod135
page 147
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.3 Hypothesis Testing
eled in the measurement response function, since they might induce fatal fluctuations some fraction of the time, but other effects could also prevent a successful pair of observations, namely the very phenomena included under the alternative hypothesis. Those alternative scenarios would have to be taken into account somehow, and the likelihood ratio is a fairly straightforward way and is at least in principle computable when probability distributions for the phenomena in the alternative scenarios can be constructed with sufficient accuracy. Nevertheless, as we will see, sometimes those alternative distributions cannot be constructed with the resources available, and approximations cannot be avoided, and so a decision based on the measurement model alone may turn out to be required. As Goldman (1970) states, “If we know very little about the alternatives it is hard to design a good test. Admittedly, the likelihood-ratio tests are uniformly most powerful if H0 and H1 are simple or if H1 is located conveniently with respect to H0.” (The phrase “located conveniently” refers to the geometry of the subset of the sample space in which we would reject H0, also called the critical region.) The likelihood ratio test can often be used successfully when discrete random variables are involved. For example, suppose that the hypothesis H0 is that a given coin is unbiased, i.e., the probability of heads on a given flip is 0.5, and the alternative hypothesis is that the coin is biased towards heads with a probability of 0.6. Then if ten flips yield (e.g.) seven heads, the probability of that happening under each hypothesis is straightforward to calculate using the binomial distribution (see section 2.2 and Equation 2.3, p. 39). One thing that can be learned by working on this example is that the probability of rejecting H0 when it is true and of accepting it when it is false are both uncomfortably high with only ten tosses, and one can compute how many tosses are needed to drive these probabilities down to any desired level. Thus decision theory can be used to guide the design of an experiment, not just interpret observed outcomes. This kind of ease of use occurs less often with continuous random variables such as the error in reconstructing the position of an observed star. When two measurements are compared to see whether we should accept a hypothesis H0 that the same star is involved in both, the probability that the positions would differ by the observed amount is not as clear-cut (e.g., the star’s position may not be constant), and the probability of the alternative hypothesis is even more troublesome. Thus it often happens that a decision algorithm must be made to depend only on some probability measure that H0 would produce the observed result, and this cannot be taken to apply to the exact position discrepancy observed, because that probability is zero. The corresponding probability density is generally not zero, but since the density at any given source separation generally depends on the variance of the measurement distributions, applying a single threshold to densities introduces biases against broader distributions, which may or may not be an acceptable feature. In any case, the probability of exactly the observed separation is zero (a nonzero density multiplied by a zero interval), and a different probability measure for the hypothesis must be designed. Specifically, we can ask: in the infinite ensemble of universes that are identical except for randomly drawn noise fluctuations, what fraction of the time would a successful pair of observations yield position discrepancies that are greater than those observed? This fraction can be computed from the measurement density function alone, and if it turns out to be comfortably close to 1, we will probably accept H0 as sufficiently likely compared to any alternative whose probability might have been evaluated with unlimited resources. If the fraction is absurdly small, we will probably reject H0 because the measurement results in its scenario must have been caused by some glitch. In between we would have to do a delicate balancing act that depends on the relative costs of the two kinds of 136
page 148
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.4 Hypothesis Testing Example 1: Matching Sources in Two Catalogs With Gaussian Errors
decision error. We must set a threshold based on the fraction of all true results we are willing to sacrifice in order to avoid false results induced by alternative scenarios that we understand only qualitatively. We will look further into this approach in section 4.4 below. There are also other ways to use the measurement probability density functions to construct a decision algorithm, each with its own kind of threshold and challenges in setting a value. One such approach will also be discussed below in section 4.5, a method based on the cross-correlation of the density functions that can be useful when highly non-Gaussian errors dominate or when the spreads of the error distributions vary considerably. The point is that such decision algorithms are available if and only if the random-noise aspects of the measurements are optimally characterized. Although there is only one threshold, and a decision is either correct or incorrect, as noted above, there are two kinds of correct decision and two kinds of incorrect decision. Only the latter are given specific names: rejecting H0 when it is true is called an error of type 1 (also an error in completeness or a false negative), and accepting H0 when it is false is called an error of type 2 (also an error in reliability or a false alarm or a false positive). These different errors generally incur different penalties, and balancing the corresponding costs against the benefits of the two kinds of correct decision is how the threshold is set. When we refer below to “the probability of a hypothesis”, we are using that as a short way of saying “the probability that the hypothesis gave rise to the measurements”, not “the probability that the hypothesis is true”. When we refer to a “probability measure for a hypothesis”, we mean a probability that is monotonically related to the probability that the hypothesis gave rise to the measurements. A probability measure can be useful as a proxy when the desired rigorous probability cannot be computed. The effect of using a proxy can usually be absorbed into the threshold value. 4.4 Hypothesis Testing Example 1: Matching Sources in Two Catalogs With Gaussian Errors A frequent activity in astrophysical research involves identifying which point sources in a given survey catalog correspond to entries in a different survey catalog. By “point sources” we mean astrophysical objects that emit detectable electromagnetic energy in the survey’s wavelength channels and are too small in angular size to be resolved spatially by the observing instrument. These are typically stars, but often they are asteroids, comets, and unresolved galaxies and quasars. Such catalogs typically differ in the wavelength channels used for each survey, and so spectral information is gained by matching entries. Spectral information is the key to deducing astrophysical properties such as stellar temperatures, distinguishing stars from unresolved galaxies, etc. Many surveys operate at multiple wavelengths, but combining the information in different catalogs can provide a great amplification in wavelength coverage. In order to remain photometrically unbiased, the matching is usually done on the basis of position coincidences alone. The sky positions are represented as angles in some spherical coordinate system, and we will use the most common of these herein. This is the “celestial coordinates system” whose poles are the projection of the Earth’s poles onto the “celestial sphere”, i.e., the sky appearing as a hollow sphere viewed from its center, and whose azimuthal zero point is the vernal equinox, which is where the great-circle projection of the Earth’s equatorial plane onto the sky crosses the great circle containing the apparent path of the Sun during the year, called the ecliptic. These two great circles intersect at two diametrically opposite points. The vernal equinox is the point at which 137
page 149
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.4 Hypothesis Testing Example 1: Matching Sources in Two Catalogs With Gaussian Errors
the Sun appears to pass from south of the celestial equator to north of it, which currently occurs typically not long before or after March 21. Since the Earth’s poles wobble a bit, their projection onto the celestial sphere varies with time, and this same wobble affects the intersection of the ecliptic and celestial equator. Because the sun’s apparent motion along the ecliptic is really due to the Earth’s orbital motion around the Sun, and since the Earth’s orbit is perturbed by other solar-system bodies, these orbital perturbations also contribute to the time dependence of the celestial poles and vernal equinox. The end result is that when using celestial coordinates, one must specify the epoch at which the system is defined. We will assume that such details have been taken into account. The azimuthal angle is called “right ascension” and typically represented by the Greek letter alpha, α, and is traditionally measured in units of time, since this system arose for navigational purposes, and time of day was of great interest. Thus the full range of α is from zero to 24 hours, and any given value is traditionally represented in hours, minutes, and seconds. This is of course terribly inconvenient for computations, and so modern computer processing is always done with α represented in degrees or radians, and conversion to hours, minutes, and seconds is done only for final display purposes, if at all. We will assume that α is measured in floating-point degrees (i.e., not in integer degrees, integer minutes of arc, and floating-point seconds of arc). The other angle coordinate is an elevation angle relative to the celestial equator. It is called “declination” and is typically represented by the Greek letter delta, δ, and we will assume that it is also measured in floating-point degrees. It is zero in the equator, positive in the northern hemisphere, and negative in the south, and therefore it has a range from -90 o to +90o. Away from the equator, the azimuthal distance between two point sources at the same declination is not a true angular measure of their separation, since the diameter of the small circle of constant declination has been shrunk relative to the equator’s great circle by a factor of cos δ, and this has to be taken into account when star separations are computed and compared to position uncertainties, which are typically represented in true angular measure. Thus a one-arcsecond uncertainty in α, usually denoted σα, is the same angular distance along the line of constant declination anywhere on the sky. As one might expect, problems can develop very close to a pole, e.g., at δ = 89.99991158, the entire distance all the way around the line of constant declination is only one arcsecond of true angular measure. Similarly, star positions that straddle α = 0o must be handled carefully, e.g., positions at α = 0.0001o and α = 359.9999o appear to be almost 360o apart, whereas the difference in α is really only 0.0002o. For such reasons, observed point source positions close enough to be considered candidates for belonging to the same object are usually transformed into a local Cartesian coordinate system in which the spherical distortions can be neglected over the small angular distances involved. Thus we usually deal with (α,δ) coordinates transformed to Cartesian (x,y) with σx = σα and σy = σδ, and this will be assumed in the examples below. If the catalog authors have really done their homework, we will even have the position error covariance, often represented as a co-sigma σxy, where the offdiagonal element of the error covariance matrix is vxy = σxy|σxy| (see section 2.10; also Appendix D, Equation D.9, p. 436, and Appendix E, Equation E.3, p. 443). We will assume that this is the case, since when it is not, the formalism will still work with the off-diagonal element set to zero. When referring to 1σ uncertainties below, we mean the principal axes of the “error ellipse” (see section 2.10, Equations 2.30 and 2.31 and Figure 2.12, pp.75-76). The main business of catalog matching involves comparing each point source position in one 138
page 150
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.4 Hypothesis Testing Example 1: Matching Sources in Two Catalogs With Gaussian Errors
catalog to every point source position in the other. At least that’s the principle; in practice, much more efficient ways to accomplish the spirit of the principle exist, e.g., sorting both catalogs by declination and not searching beyond some declination range considered too large to be of interest. Since we are not discussing computer science here, we will assume that the principle of allowing every source in one catalog to come into position comparison with every source in the other is effectively achieved. That means that at the core of the processing we have one source from each catalog under mutual scrutiny, selected because their positions are within some coarse search distance of each other, where this coarse distance is set large enough so that essentially no true matches will be missed. By “true matches” we mean a pair of positions, one from each catalog, belonging to one and the same astrophysical object. For example, if each catalog has position uncertainties that are never very different from one arcsecond (i.e., one angular second of arc, 1/3600 th of a degree of true angular measure, abbreviated “arcsec”), then a coarse distance of 10 arcsec provides for 7σ root-sumsquared fluctuations. In this section, we will assume that the position errors are Gaussian and that their 1σ uncertainty does not fluctuate greatly in magnitude in either catalog. In the next section we will consider what is different when one or both of these assumptions cannot be made. For now, a coarse search distance of 10 arcsec means that only one true match out of every 72 billion will be missed because the coarse distance was not large enough. Since there is usually some fluctuation in uncertainty, this might be overestimated significantly, and typically the accuracy of the uncertainty estimation is not considered something about which to be supremely confident. If a nominal margin of 7σ is not good enough, then the coarse distance is simply enlarged as needed. The purpose of the coarse matching is to gather pairs of point sources for which it is worthwhile to incur the additional computing expense required for an optimal test. It often happens that a source from the first catalog is within the coarse distance of more than one source from the second catalog and vice versa. The testing is done pairwise for all combinations. If more than one pair passes the test, the one which passes most convincingly is the one accepted, and usually some sort of catalog tag is set to indicate that confusion occurred. Our concern here is just how to perform a test for a single pair of sources. The data we have for this purpose are the two sets of position coordinates (xi,yi), i = 1,2, and the two sets of error covariance matrix elements (vxi,vyi,vxyi), i = 1,2. Since our data are limited to these parameters, we do not have enough information to formulate probability measures for the alternative hypothesis, and we must proceed to design a decision algorithm with the information at hand. This is often the case. By thresholding on a probability measure obtained exclusively from H0, we are implicitly assuming that the probability of the alternative hypothesis can be approximated as some constant. In the example of two catalogs from surveys taken in different wavelength channels, each catalog will usually contain some sources not in the other. The frequency at which two unrelated sources lie close enough together to look consistent with H0 will depend on spatial source densities and uncertainty size. A very frequently employed decision algorithm for such cases is the chi-square test. The probability measure is the fraction of times H0 would yield results at least as discordant as what has been observed, i.e., the area under the position-discrepancy density function in the high tail above the observed discrepancy. The smaller that fraction, the less plausible the hypothesis. The threshold must be based on some qualitative idea of the processes that could make the hypothesis false, together with the acceptable fraction of decisions that reject the hypothesis when it is true, i.e., the probability of a type 1 error. 139
page 151
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.4 Hypothesis Testing Example 1: Matching Sources in Two Catalogs With Gaussian Errors
If the source densities are low enough so that the mean separation is large compared to the position uncertainties, then we can set the threshold to a low fraction. If the mean separation is not much larger than typical uncertainties, then the probability of type 2 errors will be relatively large, and we must raise the threshold, which then also raises the probability of type 1 errors. The harder we try to avoid accepting the H0 when it is false, the more times we will reject it when it is true. In this example, “completeness” is the fraction of all valid pairs accepted as such by the decision algorithm relative to all valid pairs contained in the input data. “Unreliability” is the fraction of all incorrect pairs accepted by the decision algorithm relative to the total number of decisions made, and “reliability” is 1.0 minus the unreliability. The cleaner the data, the higher the combination of reliability and completeness can be, and the messier the data, the more that combination is limited. For any given mean source separation and position uncertainty, reliability and completeness always compete. To raise one, we must lower the other. The only way to raise both is to use a more optimal decision algorithm. An example of a less optimal algorithm is to accept the closest two sources that fall within a given search radius, independently of position uncertainty. This is seldom necessary anymore, but sometimes it is done for quick-look purposes. If all position uncertainties had circular error ellipses of the same radius, then this could be made as optimal as the chi-square algorithm, but such uncertainties essentially never happen in modern observations. Position uncertainties usually vary significantly as a function of source signal-to-noise ratio, as well as other effects. The threshold for this method is just the maximum source separation that we will accept. In general, the larger the acceptable separation, the higher the completeness and the lower the reliability, but other than that, there is no formal relationship between threshold and completeness, unlike the chi-square method, whose threshold is just the probability of a type 1 error. Both methods leave reliability to scientific intuition and should be probed via simulations if possible.
Figure 4-1. Comparison of Completeness vs. Reliability for two position-based point source matching decision algorithms. The solid curve (the Spitzer Bandmerge algorithm) uses a chi-square test that takes advantage of the position error covariance matrix. The dashed curve (GSA, General Source Association) uses only position proximity within a maximum acceptable radial distance to decide whether there is a match (from Laher and Fowler, 2007).
140
page 152
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.4 Hypothesis Testing Example 1: Matching Sources in Two Catalogs With Gaussian Errors
A comparison of the closest-acceptable-pair method and the chi-square test was made by Laher and Fowler (2007) for a substantially challenging set of astronomical data. The results are shown in Figure 4-1. This is a plot of completeness versus reliability for matching two source lists generated by simulation to have the spatial densities and position errors characteristic of the Spitzer Space Telescope source detections in the wavelength channels at 3.6 and 24 microns. Simulated sources were used so that completeness and reliability could be computed from known true values. The source matching was done for a range of thresholds so that several points in the plane could be obtained for each method, then smooth curves were fitted through these points. The figure shows the upper right portion of the plot, which is where judgments are made about setting a threshold by trading off completeness versus reliability. The dashed curve labeled “GSA” (General Source Association) is the locus of closest-acceptable-pair results, and the corresponding curve for the chi-square test is the solid line labeled “Bandmerge” (the name of the Spitzer computer program that matched sources in different wavelength bands). Both methods were able to achieve a completeness of 98%, but the less optimal method could do this only by degrading the reliability to 92.7% or less, whereas the chi-square method could do it with a reliability as high as 97.4%. In practice, this is not where most scientists would choose to set the threshold, but the spectrum of trade-offs provided by the chi-square test is clearly more desirable. Settling for 95% completeness in order to get 98% reliability would be a more typical threshold choice. For a given source pair, the chi-square test is performed as follows. It is invariant under permutation of the catalogs, so it does not matter which source we label as 1 or 2.
x x2 x1 y y2 y1 v x v x1 v x 2 v y v y1 v y 2
(4.2)
v xy v xy1 v xy 2 2
v y x 2 v x y 2 2 v xy xy 2 v x v y v xy
Note that the denominator in the last line will always be greater than zero for physically realistic cases, because it is the determinant of the source-separation error covariance matrix, which is proportional to the square of the area of the error ellipse, which is always greater than zero. Note also that if the off-diagonal covariance element is zero, then chi-square reduces to the more common form for two degrees of freedom
2
x 2 y 2 vx vy
(4.3)
The Q distribution of a chi-square random variable with two degrees of freedom is especially simple (unlike the general case; see Appendix E, Equation E.7 (p. 445), with Q = 1-P). This is the area under the high tail of the density function. 141
page 153
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.4 Hypothesis Testing Example 1: Matching Sources in Two Catalogs With Gaussian Errors
Q2 ( 2 ) e
2
/2
(4.4)
This makes it easy to threshold on chi-square itself. If the acceptable probability of an error of type 1 is K, then the maximum acceptable chi-square is -2 lnK. The advantage of the chi-square test over the closest-acceptable-pair method may be seen in Figure 4-2A, which shows the equal-probability contours of the position-uncertainty density functions for three sources. The source labeled A is from the first catalog and has uncertainties on both axes of 0.8 with no covariance. Sources B and C are from the second catalog and are thus both candidates for matching to source A. These two sources have unit uncertainty variance and an offdiagonal covariance of 0.4. Source A is at the origin, source B is at (2,2), and source C is at (1.5,-1.8). The source separation A-B is 2.828, and A-C is only 83% as large at 2.343. The closestacceptable-pair method would take A-C, but the A-B separation is actually more probable, as may be seen by the greater overlap of the contours.
A
B
Figure 4-2 A. Contours of position uncertainty for three sources; source separation A-B is greater than A-C, but the A-B match is better than A-C because the separation A-B is more probable. B. The source separation A-D has a lower chi-square than A-B, but this is an artifact of the anomalously large position uncertainties of source D.
142
page 154
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors
The value of chi-square for A-B is 3.922, and for A-C it is 4.413. Closely contested cases such as this usually comprise a fairly small fraction of the decisions that need to be made, but the difference between the two trade-off curves in Figure 4-1 arises from handling the marginal cases better, and this can be done only if due diligence has been paid to the task of developing an acceptable error model. Other than having no specific accounting of the alternative hypothesis, the limitations of the chi-square test are that the uncertainties must be acceptably approximated as Gaussian (not usually a problem), and the dynamic range of the uncertainties must not be too large. The latter stems from the fact that a measurement with an extremely large uncertainty can result in a misleadingly low value of chi-square, since uncertainties contribute to the denominator. Often very faint sources acquire very large position uncertainties. An example is shown in Figure 4-2B. Source D is from the second catalog and hence is a candidate to match source A. It is at position (-5,7), has 1σ position uncertainties of 8 on each axis, and the same error correlation coefficient as B and C. Source A falls well within its probability contours. The chi-square for A-D is 1.866. This would steal source A from source B if allowed to do so. Whether this should be permitted is a matter of scientific judgment, but in the author’s experience, most scientists prefer not to allow low-quality source detections to interfere with those of better quality. Various ad hoc fixes can be applied, but if the challenge of dealing with a large dynamic range in uncertainties must be accepted, one should consider using a different kind of test, e.g., that described in the next section. 4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors The Infrared Astronomical Satellite (IRAS) performed the first infrared all-sky survey conducted from earth orbit (Beichman et al., 1988). Unlike modern astronomical telescopes, IRAS did not have what we now think of as an imaging array, i.e., a rectangular arrangement of thousands of essentially identical pixels uniformly tiling the enclosed area. IRAS had imaging optics, but they just produced an image of the sky on a continuously scanning focal plane that contained apertures leading to integrating cavities containing infrared-sensitive detector elements. Figure 4-3A shows a schematic of the IRAS 60 μm detector geometry. The black rectangles are the 16 apertures leading to the integrating cavities. At the focal-plane plate scale, all of these are 1.51 minutes of arc (arcmin) in the scan direction, y, and most of them are 4.75 arcmin wide in the cross-scan direction, x, with two at 1/4 and two at 3/4 this size that fill out the 30 arcmin width of the focal plane and allow the apertures to be staggered so that a source image passing through will fall on at least two detectors with a net overlap of no more than about 2.4 arcmin in the x direction. The telescope optics yielded a diffraction-limited point-spread function (PSF) at this wavelength that contained 80% of the photometric energy within a circle of one arcmin diameter. This provided oversampling of the source image in the scan direction but left the cross-scan direction extremely undersampled. The result of this is that source positions on the sky could be reconstructed with Gaussian uncertainties in-scan, but the cross-scan uncertainties were dominated by the uniformly distributed error due to the aperture overlap.
143
page 155
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors
A
B C Figure 4-3 A. Schematic of the IRAS 60 μm detector aperture geometry; the black rectangles are apertures to integrating cavities containing detectors; source image motion is from top to bottom, caused by telescope scanning. B. Contours of a typical IRAS source-position probability density function (y scale expanded relative to x scale, units are arcsec). C. 3-D plot of a typical IRAS source-position probability density function (y scale expanded relative to x scale, units are arcsec).
The absolute telescope pointing direction was reconstructed with Gaussian uncertainties on both axes. These convolved with the source-detection position uncertainties to produce position probability density functions that were Gaussian in-scan and convolved Gaussian-Uniform (CGU) cross-scan. The contours of a typical density function are shown in Figure 4-3B, and a 3-D plot of the same density function is shown in Figure 4-3C. This density function has a Gaussian in-scan 1σ 144
page 156
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors
uncertainty of 2 arcsec, a cross-scan Gaussian 1σ uncertainty of 5 arcsec, and a cross-scan uniform uncertainty half width of 72 arcsec. It is immediately obvious that the Gaussian approximation is completely unacceptable for these position uncertainties. Furthermore, the dynamic range of the uncertainties is very large for IRAS because some of the apertures are much narrower than others in the cross-scan direction, and some overlaps between apertures in adjacent rows are extremely small. So both of the requirements for the chi-square test are not met. The IRAS data processing involved matching sources detected on consecutive orbits. This was called source confirmation. Only sources detected and matched on an orbit-to-orbit basis, called hours confirmation, were considered reliable enough to keep for the catalog (they would also have to survive weeks confirmation in the downstream processing; these steps were intended to filter out radiation hits, asteroids, scattered moonlight, and all sources of transients). The source matching used a probability measure for H0 based on the cross-correlation of the density functions (for details, see Fowler and Rolfe, 1982). A cross-correlation measures the similarity of two functions at various spacings. Functions with much in common will have a high correlation at certain spacings, while dissimilar functions tend not to. The commonality sought for the hypothesis was large overlapping probability mass, since this is a measure of how many ways the two different density functions might be describing a single source position. The cross-correlation integral bears a strong resemblance to the convolution integral. The difference is the sign of the kernel, which affects changes of variable in a different way from convolution, because the integration variable plays a different role. The convolution integral was presented in Equation 2.5 (p. 46), which may be compared to the cross-correlation integral below, where we consider real functions. Note that if one of the functions is symmetric about any point, then a translation of origin and a flip of the sign of its argument can be made to yield the convolution integral. This is most obvious in the first integral below.
D( x )
f 1 ( x x ) f 2 ( x ) dx
f 2 ( x x ) f 1 ( x ) dx
(4.5)
We denote the cross-correlation D (following Fowler and Rolfe, 1982) because its value at the observed separation between nominal source positions will be used as the decision parameter. As with convolution, the cross-correlation generalizes straightforwardly to multiple dimensions, and we will be using it in the two dimensions x and y, cross-scan and in-scan. The cross-scan and in-scan axes on consecutive orbits can be acceptably approximated respectively as parallel, so we will treat the marginal distributions as defined on the same coordinates for the hours-confirmation matching decision. This was not true for weeks confirmation, which required arbitrary rotation angles, but this introduced only algebraic complications that will not be addressed herein. Each probability density function has a marginal distribution in y that is simply Gaussian, whereas the marginal distribution in x is a little more complicated, being CGU, a convolution of Gaussian and Uniform density functions. We will denote these two components xG and xU, respectively, i.e., the error on the x axis is xG + xU. In order to avoid confusion between the standard deviation of x, σx, and the standard deviation of the Gaussian component, we will label the latter σGx. The half width of the Uniform component will be denoted Lx. Then the two components of the x distribu145
page 157
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors
tion have the density functions ( xG x ) 2
f G ( xG )
2 2 Gx
e
2 Gx
(4.6)
1 , Lx xU x Lx f U ( xU ) 2 Lx 0 otherwise
Both density functions have the same the mean, which is the nominal source position, about which both functions are symmetric. This conforms to the standard practice of assigning the nominal position to the value of x that minimizes the variance. The CGU density function for x is the convolution, which can be written x x xG xU
xU x x xG
f CGU ( x ) f G ( xG ) f U ( x x xG ) dxG
(4.7)
With the use of the standard error function, erf(t),
erf (t )
t
2
2
e t dt
0
(4.8)
this can be put into the form (see the referenced paper above) x x+Lx x x Lx erf erf 2 Gx 2 Gx f CGU ( x ) 4 Lx
(4.9)
The marginal density function in y is just the Gaussian
f G ( y)
e
( y y )2 2 y2
(4.10)
2 y
The complete 2-dimensional point-source position probability density function is
f xy ( x, y ) f CGU ( x ) f G ( y )
(4.11)
where the errors on the two axes may be considered independent to an acceptable approximation (in fact, the telescope pointing reconstruction had some correlated errors, but they were small enough compared to the focal-plane point-source detection uncertainties to be neglected). The density 146
page 158
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors
function in Equation 4.11 is the one shown in Figure 4-3B and 4-3C. The hours confirmation decision involved two point-source detections, one from each of a pair of consecutive orbits, hence two of these density functions were involved. The decision parameter was the cross-correlation of these two density functions evaluated at the observed separation between nominal positions. The nontrivial nature of the x distributions makes the derivation of the crosscorrelation function too lengthy to present herein (for details, see Fowler and Rolfe, 1982). Subscripting the two density functions 1 and 2, with no. 1 at the origin, and using Δx x 2-x 1 and Δy y 2-y 1, the result can be expressed as
Dxy ( x, y ) Dx ( x ) Dy ( y )
(4.12)
where
y y21 y22
Dy ( y )
e
y2
(4.13)
2 y2
2 y
and 2 2 Gx Gx 1 Gx 2
v u x g (u, v , x) (v u x ) erf 2 Gx Dx ( x )
2
( v u x )2
Gx e
2 2 Gx
(4.14)
g ( L x 1 , L x 2 , x ) g ( L x 1 , L x 2 , x ) g ( L x1 , L x 2 , x ) g ( L x 1 , L x 2 , x ) 8 Lx1 L x 2
Examples of two IRAS position probability density functions are shown in Figure 4-4A. Source no. 1 is centered at the origin of the xy coordinate system and has σGx1 = 3, Lx1 = 120, and σy1 = 3 arcsec. Source no. 2 (toward the left) is centered at (-60,2) arcsec and is taller because its uniform component is less wide at 90 arcsec. It has the same Gaussian sigmas as source no. 1. An offset of 60 arcsec in cross-scan was typical for IRAS sources, and the same is true of the in-scan discrepancy of 2 arcsec. These two density functions are completely plausible for consecutive-orbit detections of a single source. Figure 4-4B shows the cross-correlation function for these two density functions. Its peak has the value 3.92×10-4 (probability mass per square arcsec). Evaluating it for the observed separation Δx = -60 and Δy = 2 arcsec yields a value of 2.92×10-4, 74.5% of the peak and well above typical IRAS thresholds. But IRAS thresholds could not be set to yield a single probability of a type 1 error, unlike the chi-square test. In principle, it could have been done by computing cumulative distributions for every pair combination that arose during the processing, but the computer time would have been prohibitively expensive during the IRAS epoch. With modern computing power, it would probably be feasible to fit a Q function (fraction of probability mass outside a given contour level) to the six 147
page 159
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors
parameters (Δx, Δy, σGx, σy, Lx1,Lx2) by generating a six-dimensional grid numerically and doing a least-squares regression based on some polynomial model. But with the advent of true imaging arrays with infrared-sensitive pixels, the need may never arise. The example shown in Figure 4-4 has Q = 0.7261 outside of the 2.92×10-4 contour.
A
B
Figure 4-4 A. Two IRAS point-source probability density functions: one centered at (0,0) with σGx1 = 3, Lx1 = 120, and σy1 = 3 arcsec; the other centered at (-60,2) with σGx2 = 3, Lx2 = 90, and σy2 = 3 arcsec. B. The cross-correlation function for these two density functions; the value at the observed separation between nominal positions is 2.92×10-4, 74.5% of the peak value of 3.92×10-4.
In practice, the IRAS thresholds were set by studying Monte Carlo simulations for various noise environments. Such simulations (see the next section) employ known “truth” scenarios within which potential processing algorithms may be explored. The nature of the IRAS position errors and the spatial densities of real infrared sources to be expected were fairly well understood, and by varying the spatial density of noise sources, the H0 rejection rate could be studied as a function of threshold and later compared to values observed inflight to validate the simulations and make any required threshold adjustments. This is the method that was used with sufficient success to spawn an entirely new processing center at Caltech, the Infrared Processing and Analysis Center, which became a hub for the data analysis of many subsequent infrared projects in addition to extracting a vastly greater amount of science from the IRAS data than had been initially foreseen. The fact remains that the probability of a type 1 error was different for different combinations of input uncertainties. The main impact of this is a discrimination against combinations involving especially large uncertainties. Since this means that lower-quality detections had less chance of being accepted, it is not necessarily a bad thing, but one needs to be aware of it. It is possible for two lowquality detections to lie on top of each other and still fail to be accepted. For the detection pair in the example, thresholding at 1% of peak, i.e., Dxy = 3.92×10-6, would yield Q = 0.00495, i.e., we would reject about 0.5% of all true matches. Depending on the noise rate, this might be a reasonable 148
page 160
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors
sacrifice when the goal after all filtering is to achieve a completeness of 98%. But this same threshold applied to a pair of detections, each with (e.g.) uniform half widths of 900 and both Gaussian sigmas of 40 arcsec, would reject all pairs even if perfectly aligned, because the peak value is 3.82×10-6, below the threshold of 3.92×10-6. On the other hand, even given the fact that a small fraction of the IRAS detectors were inoperative and therefore not available to supply the overlap that cut down the cross-scan uniformly distributed position error, the IRAS mission was incapable of producing such low-quality detections. The point here is just that, in principle, the algorithm allows the rejection of perfectly aligned detections if their position information is of sufficiently low quality. As mentioned above, subjective value judgments enter into the setting of thresholds. It could be argued that one does not want to accept low-quality detections at the same Q threshold as high-quality detections, since the larger position uncertainties of the former present a larger target for false matches. Maintaining the same probability of a type 1 error would increase the probability of a type 2 error, a problem that does not play as much of a role when the dynamic range of uncertainties is lower. In review, the motivation for basing our decision on the cross-correlation of two position probability density functions, thresholding on the value at the observed nominal source separation, is that this is a measure of the similarity of the two density functions. The more probability mass the two density functions have in common, the more likely that they describe the same source, provided that neither is so spread out that in effect all bets are off, i.e., there could be any number of different sources in the area in question. We will close this section with an examination of two alternate interpretations of the IRAS cross-correlation algorithm’s formulas, one of which is incorrect despite being based on the exact same equations. We begin by recalling what was said earlier about the relationship between convolution and cross-correlation: if one of the functions is symmetric about any point, then a translation of origin and a flip of the sign of its argument can be made to yield the convolution integral. The IRAS marginal density functions are symmetric, hence Equations 4.13 and 4.14 can both be viewed as the results of convolution. This same symmetry allows Δx and Δy to be viewed as random variables whose density functions are those same equations, since Dx(Δx) = Dx(-Δx) and Dy(Δy) = Dy(-Δy). So the algorithm has the same appearance as a test on the joint density function for these two random variables evaluated at their observed values. But such an algorithm would be based on a misconception for the following reason. The IRAS spacecraft had a near-polar orbit designed to precess the orbital plane at a rate of 360o per year. Its telescope scanned the sky with a constant angle between its boresight and the Sun for periods of approximately half an orbit, i.e., ascending scans operated at one solar aspect angle, and descending scans another. This angle could be held to any value between 60 o and 120o in order to prevent sunlight and earthlight from entering the telescope, whose boresight always pointed within 30o of the local zenith. The orbit-to-orbit changes in solar aspect angle were set to produce a large fraction of overlap from each ascending/descending scan to the next, re-observing much of the same sky so that hours confirmation would be possible. During the data-taking part of a scan, the solar aspect angle was held constant to within the attitude control system’s limit-cycle motion of a few arcsec, and analysis of star sensor data allowed the achieved pointing to be reconstructed to slightly better accuracy. The significant point is that the trajectories of the IRAS detector apertures mapped onto the sky along the scans were implicitly commanded a priori, so that the aperture overlap zones in the cross-scan direction were almost 149
page 161
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors
completely deterministic parameters. The only random component in the mapping of an overlap zone to xy coordinates was that due to the telescope boresight pointing reconstruction error, just a few arcsec. This means that the Δx value, typically on the order of 60 arcsec, had a random component of only a few arcsec. It was predominantly a deterministic parameter. To treat it as randomly distributed with a density function given by Equation 4.14 would be completely unjustified. The x 1 and x 2 values are (almost) deterministically commanded values whose role is simply to anchor the CGU distribution to a location in the xy plane. The only significantly random ingredient in the cross-scan direction is the location of the source within the overlap region. This random variable indeed has a mean of x 1 and x 2 for the two detections, but x 1 and x 2 are essentially just peg points for their respective density functions. Given a source at the sky position observed, the two values of x 1 and x 2 are pre-ordained by the scanning strategy, not random variables except to within the small errors involved in executing that strategy and reconstructing what it accomplished, i.e., the small Gaussian errors described by the xG component in Equation 4.6, which is generally tiny relative to Lx. The real random variable, the source location within the overlap zones, could take on any value within those zones without affecting the peg points for locating the CGU distributions. Other than determining which overlap zones would be relevant, it has no further cross-scan effect on the measurements. So a casual observer, upon superficial inspection of Equations 4.13 and 4.14, might conclude that they involve a typical application of the fact that the sum of two independent random errors has a density function given by the convolution of the two errors’ density functions. The erroneous implication is that Δx is a random variable whose density function is given by Equation 4.14, whereas Δx is not capable of fluctuations as large as those implied by Equation 4.14 if it were a density function instead of a cross-correlation function. Even though the equations would be the same, it is important to understand the nature of the randomness involved, which variables are actually susceptible to fluctuations, and the size of those fluctuations. This distinction is typical of the ways in which non-Gaussian applications differ from Gaussian applications. Another is the way parameter refinement operates. For measurements with Gaussian errors of similar magnitude, inverse-variance averaging reduces the uncertainty by about a factor of 2, and the amount of density-function overlap has no bearing on final uncertainty. Even when the uncertainties are of significantly different magnitude, the final uncertainty is always smaller than the smallest input uncertainty, although the reduction becomes negligible as the size difference becomes very large. For example, uncertainties of 1 arcsec and 10 arcsec yield a refined uncertainty of 0.995 arcsec. But the final uncertainty is never larger than the smallest input uncertainty. Of course, it may be erroneous if the refinement follows an error of type 2. It is possible to construct pathological distributions for which parameter refinement yields a final uncertainty greater than any input uncertainty, where “uncertainty” refers to the standard deviation of the distribution. For example, if the two measurements being averaged have extremely asymmetric density functions and one is the mirror image of the other, each with essentially all of its probability mass far from the mutual mean, then the final standard deviation can be larger than either input standard deviation. To the author’s knowledge, such distributions never arise in practice, and even if they did, it seems extremely unlikely that any two such measurements could ever survive a test on whether they were measurements of the same physical quantity. An example will be given in section 4.8. When uniformly distributed errors dominate, parameter refinement can result in no uncertainty 150
page 162
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.5 Hypothesis Testing Example 2: Matching Sources With Non-Gaussian/Variable Errors
reduction at all relative to the better measurement, or to arbitrarily small uncertainty values, or anywhere in between, depending on the overlap of the two uniform distributions, as we will see in section 4.8. Ironically, the less overlap of the two uniform distributions, i.e., the less the two measurements appear to agree, the greater the reduction in uncertainty. If the decision to accept H0 on the basis of very small overlap was incorrect (an error of type 2), then at least the refined product of that incorrect decision will have an especially difficult time downstream finding another random match. Some interesting results are found when different types of distributions are mixed in parameter refinement. One of these, refining Gaussian and Uniform measurement density functions, has an intriguing echo in the quantum theory of spontaneous localization, as we will see in the next chapter. The second alternate interpretation mentioned above is the following, and this will be important later in section 4.8. Given two measurement density functions, we ask: what is the probability that the unknown value of the random variable described by the first density function is arbitrarily close to the unknown value of the random variable described by the other? Prior to accepting the hypothesis, we have no right to assume that the two measurements apply to the same point source. That is after all what we are trying to decide, and so our argument would be circular if we assume that the hypothesis is true before we decide whether to accept it. We need to define what we mean by “the same point source”. It may be obvious intuitively, but we need a mathematical definition, and we have already hinted that it will contain the concept of “arbitrarily close”. For the sake of simplicity, we will take advantage of the fact that this point can be made in one dimension and with independent random variables (i.e., independent measurements). We consider two random variables, x1 and x2, with means x 1 and x 2 and density functions f1(x1) and f2(x2), respectively, and joint density function f12(x1,x2) = f1(x1) f2(x2). We want to know whether x1 and x2 have the same (unknown) value. Here we encounter once again the problem of dealing with the probability that a continuous random variable takes on an exact value. Once again we must think in terms of a small region around that exact value, so our question becomes whether |x1-x2| < ε/2, where ε can be made as small as we like as long as it remains greater than zero, and we divide it by 2 for reasons that will soon become apparent. The probability that |x1-x2| < ε/2 is the total probability mass in that portion of the joint density function wherein this condition is satisfied. Since in general x 1 x 2, we must account for the fact that when x = x2, x = x1 + Δx, Δx x 2-x 1, and we choose ε small enough that we can neglect the change in the joint density function over its extent. Then the probability we seek is
P | x1 x2 | / 2
x1 x / 2 f 1 ( x1 ) f 2 ( x2 ) dx2 dx1 x1 x / 2
x1 x / 2 f 2 ( x2 ) dx2 dx1 x1 x / 2
f 1 ( x1 )
f 1 ( x1 ) f 2 ( x1 x) dx1
f 1 ( x1 ) f 2 ( x1 x) dx1
151
(4.15)
page 163
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.6 Monte Carlo Simulations
This is just the cross-correlation of the two density functions evaluated at the observed Δx and multiplied by ε. Thus the IRAS decision algorithm may also be viewed as thresholding on the probability that the two measurement random variables on each axis are within ε/2 of each other, whatever their actual values may be, with the factors of ε absorbed into the threshold. This was in fact the probability measure originally intended for the hypothesis. Either interpretation is valid, but the cross-correlation approach is a bit simpler, whereas the other supplies a useful definition of what is meant by “the same point source” that will be needed in section 4.8 for parameter refinement in general, where we define the means of any two random variables to be effectively equal if the absolute difference between their (unknown) values is arbitrarily small. 4.6 Monte Carlo Simulations In his book on Monte Carlo exercises, Paul Nahin (2008) says “No matter how smart you are, there will always be problems that are too hard for you to solve analytically.” The problem of setting a threshold for the IRAS hours-confirmation decision algorithm is one example. The problem there was not only the difficulty of handling the complicated CGU function but also a lack of sufficient knowledge of the noise environment. The solution was to simulate a variety of environments, and with the “omniscience” possessed by the simulator, to examine various outcomes from various strategies. In the process, certain observables (e.g., the H0 rejection rate) could be calibrated as functions of threshold values and noise levels, then later compared to inflight experience to gain further insight. All this was made possible by the realistic error model. The CGU distribution is somewhat nontrivial to manipulate, but it is easy to simulate, requiring only one Gaussian and one Uniform pseudorandom draw. In the IRAS case, it requires only one pseudorandom Gaussian draw, since the Uniform component is determined by the aperture overlaps, which are mapped onto the sky by the survey strategy. The key ingredient is simulating the scanning strategy so that the Uniform distributions can be properly sized and placed on the simulated sky in order to identify the overlap region appropriate for each simulated star or transient event. After that, one needs only to add the relatively small Gaussian error to the reported peg point. So mathematical intractability is one reason to employ simulation, and another reason is lack of certainty regarding the nature of the obstacles to be encountered. A third reason is just to verify the analytical results of complicated derivations in which the possibility that some subtlety has been overlooked is significant. Many problems can be solved by simulation, but those in which random events play a major role require not a single simulation of an event, but an entire ensemble of simulations within which the random ingredients can be expected to exhibit the full range of their peculiarities. Such systems resemble the parallel-universe populations that are so useful for gedanken experiments, but when the complexity rises to the point where computer power is required, the exercise is called a Monte Carlo calculation. In section 1.3 we promised to give a Monte Carlo simulation of the game defined in section 1.2:
152
page 164
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.6 Monte Carlo Simulations
... one could in principle write down every possible outcome ... until each branch in the tree arrived at a point where one player has enough points to win. Taking each possible branch as equally likely, the probability that a given player would have won is just the ratio of the number of branches leading to that outcome divided by the total number of possible distinct outcomes. The problem with this brute-force approach is that the number of possible branches may be astronomically large unless one player was very close to victory when play was halted. What Pascal and Fermat contributed was the use of combinatorial analysis to eliminate the need for detailed enumeration of each possible branch. To see how unmanageable the detailed approach is when the number of outcomes is not very small, consider a simple game in which a coin is flipped, and heads results in a point for player A, tails a point for player B. Suppose A has 5 points, B has 4 points, and 10 points are needed to win. Then at most ten more coin flips are needed to decide the game. To write down all possible sequences of ten coin flips may not sound difficult, but in fact there are more than one thousand possible patterns.
The formal solution to this problem was given using Pascal’s Rule (Equation 1.2, p. 5), which circumvents the labor-intensive need to track down every possible branch the game could take. But computers excel at brute-force labor-intensive calculations. Computer science has provided wellknown methods for traversing branching “tree” patterns, so it would be possible to trace every branch on a computer until a “node” is encountered that has one of the players achieving ten points, then count outcomes and compute fractions of wins by each player. This would not be a Monte Carlo calculation, just a brute-force exact solution, but it would verify that the result using Pascal’s Rule was correct. Since we want to illustrate a simple Monte Carlo simulation, we will take the approach of using pseudorandom draws to determine the coin-flip outcomes to simulate a single game played to completion, note which player won, and repeat the game many times until the fractions stabilize. To illustrate the Monte Carlo procedure, we will use what is called “pseudocode”, which is a recipe whose steps correspond to the actual statements in the programming language employed in the implementation of the algorithm but hopefully are more widely understandable to non-users of that programming language. This calculation simulates one million games starting with player A at 5 points and player B at 4 points, which means that 9 flips have already taken place. At least 5 and no more than 10 additional flips are needed to settle the game, so we simulate flips from no. 10 up to as many as no. 19, quitting as soon as a player achieves 10 points. In each game, a fair coin flip is simulated by drawing a pseudorandom number C that is uniformly distributed between 0 and 1 and calling it heads if it is greater than or equal to 0.5, otherwise tails. When heads comes up, we add a point to A’s total, otherwise we add a point to B’s total. When either player achieves 10 points, the game is over, and the victory tally for the winner is incremented. After the million games have been completed, we divide each victory tally by 10 6 and print the fractions. Comments in the pseudocode are enclosed in braces, and code blocks are indented. A left arrow indicates a draw from a pseudorandom number generator U(0,1), uniformly distributed from 0 to 1. A point of curiosity may arise here: why do we declare heads if the Uniform draw is greater than or equal to 0.5 rather than simply greater than 0.5? Is there not a bias here? Uniform generators usually never return exactly 1, leaving 0.5 just past dead center in the interval, so this choice usually prevents a bias. There may be an unavoidable bias with some Uniform generators, but they typically 153
page 165
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.6 Monte Carlo Simulations
cycle through about four billion floating-point numbers before starting to repeat previously seen values, and we are willing to risk a bias of one part in four billion.
nGames = 1,000,000 NA = 0 NB = 0
{number of games to be completed} {number of games won by player A} {number of games won by player B}
Repeat nGames times: A = 5 {score for player A} B = 4 {score for player B} Repeat until (A=10) or (B=10) {finish game} C U(0,1) {pseudorandom draw} if C >= 0.5 then add 1 to A {heads; point for player A} else add 1 to B {tails; point for player B} end repetition of coin flips {somebody won} if A=10 then add 1 to NA {increment win tally for player A} else add 1 to NB {increment win tally for player B} end repetition of games fA = NA/nGames fB = NB/nGames
{fraction for player A} {fraction for player B}
The correct values for fA and fB are 0.623046875 (638/1024) and 0.376953125 (386/1024), respectively. This Monte Carlo calculation based on one million games was run 5 times using independent initializations of the pseudorandom number generator. The five sets of fractions were:
Run#
fA
fB
1
0.623377
0.376623
2
0.622898
0.377102
3
0.623655
0.376345
4
0.623823
0.376177
5
0.622735
0.377265
It can be seen that the fractions vary by a bit less than ±1 in the third significant digit. To have the relative error vary approximately as the inverse square root of the number of trials (in this case, the number of games) is typical, so with 106 trials, we normally expect substantial fluctuation in the fourth significant digit. Since Monte Carlo calculations are usually done when analytical manipulation is difficult, it is often a good idea to probe the accuracy by running various numbers of trials and 154
page 166
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.6 Monte Carlo Simulations
noting the dispersions in the results. A run was made with 108 trials, for which we would expect the first four significant digits to be stable, and this yielded fractions of 0.62307342 and 0.37692658, respectively, so the errors are ±2.6545×10-5, and indeed the fourth significant digits are in good shape. If we had any doubts about the correctness of our formal solution using Pascal’s Rule, these results would be sufficient to dispel them. Of course, when one is dealing with a pathologically unstable computation, this simple relationship between accuracy and the number of trials may not apply, and one needs to take great care, as the next example shows, the St. Petersburg Paradox (section 1.8). This involves the following proposition: one pays a fee for the privilege of playing a game whose rules are that every consecutive flip of a fair coin that results in heads yields a prize of twice the previous flip’s prize amount, with the initial prize being $2. In other words, if the first coin flip yields tails, the game is over, and one has nothing to show for the entrance fee; otherwise the game continues. As long as the coin keeps coming up heads, the pot keeps doubling. Assuming that tails eventually comes up, the game ends, but one gets to keep whatever money was in the pot the last time heads came up.
As shown in Equation 1.4 (p. 19), the “expected” amount of money to be won (i.e., the mean of the distribution) is infinite. But as shown later, the standard deviation about the mean is also infinite, so that finite winnings should not come as a surprise. As a rule of thumb, for example, one could compute how many games would need to be played in order to have a 50% chance of getting at least 10 heads in a row and thereby winning $1024 or more. The binomial distribution can be used to show that if one plays the game 710 times, there is a 50% chance that one will get a run of at least 10 heads. That means that one must pay the entrance fee 710 times, for which one still has only a 50% chance of winning $1024 or more. The entrance fee would have to be very low for that to be appealing. Of course, the run of 10 heads would probably not occur on the 710 th try, if at all, and there is a 12% chance of getting two runs of 10 heads, and even a 2.8% chance of three. The Monte Carlo calculation used to generate the results presented in section 1.8 is given below.
155
page 167
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.6 Monte Carlo Simulations
nGames = 100,000,000 TotalWon = 0 MaxWon = 0 MaxHeads = 0 Hist(0 50) = 0
{number of games to be played} (initialize total winnings} {initialize maximum game payoff} {initialize maximum run of Heads} {initialize nHeads histogram}
Repeat nGames times: Won = 0 {initialize game winnings} nHeads = 0 {initialize no. of Heads} Pot = 1 {initialize money in the pot} Repeat until Tails comes up C U(0,1) {pseudorandom draw} if C >= 0.5 then Heads Add 1 to nHeads Double the Pot Won = Pot else Tails end repetition of coin flips {Tails came up} increment Hist(nHeads) Add Won to TotalWon if nHeads > MaxHeads then new longest run occurred MaxHeads = nHeads MaxWon = Won end repetition of games
{count no. of times } { nHeads came up} {sum total winnings}
AverageWon = TotalWon/nGames display AverageWon, MaxWon, and MaxHeads display Hist
Unlike the first coin-flip game above, this one shows large fluctuations in the fractions computed at the end. The AverageWon parameter varied by factors of 2 over repeated sets of 100 million games. The more unstable the calculation, the more important it is to have a good pseudorandom number generator. The topic is highly developed, and we cannot take the time to go into it beyond a few caveats. Perhaps most importantly, since many of these algorithms operate by cycling through four-byte integers, and since an unsigned four-byte integer can express values only from zero to 4,294,967,295, only a little more than four billion different numbers can be generated without repetitions unless something is done to complicate the cycle. Since the St. Petersburg Paradox Monte Carlo calculation draws only several hundred million pseudorandom numbers, it is on solid ground regarding accidentally induced correlations, but a computation requiring more than four billion draws needs to pay attention to the possibility that the sequence will not imitate true randomness with sufficient fidelity. Well-designed pseudorandom number generators are a source of pride to many numerical analysts, and something of a competition exists. For example, Press et al. (1986) give an excellent discussion of classic generators and offer a $1000 bounty to the first person who can demonstrate a nontrivial statistical failure of their best uniform generator that is not due simply to limited computer precision. Wolfram (2002) also gives an excellent discussion and claims that generators based on cellular automata cannot be surpassed. Zeilinger (2010) states that numbers constructed from 156
page 168
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
binary bits determined by reflection/transmission of photons from/through a semi-reflecting mirror are as random as any possible because of the nonepistemic randomness of quantum processes (more on this in Chapter 5; of course the physical impossibility of constructing a perfectly semi-reflective mirror must induce some bias, however small, and Zeilinger’s point has more to do with quantum randomness than constructing random number generators). A great advantage of Monte Carlo calculations is that once they are set up, it is usually very easy to probe the effects of changes in certain parameters. For example, the first coin-flip game above is easily solved with Pascal’s Rule, but what if the coin in use were not fair? Suppose the probability of heads on a single flip were 0.4517 (i.e., the test on C would be “if C > 0.5483”). The game can still be solved formally, but now it is a bit harder, because now the branches are no longer equally likely, and we must bookkeep the cascading fractions of times each branch is taken. But one small change in the Monte Carlo code allows us to see what happens. In this case, wherein player A has won 5 of the first 9 tosses despite the bias, from that point on, each player has approximately the same chance of winning (which is why we chose that bias). Similarly, it is easy to probe the effects of different states of the game, e.g., player A with 6 points and player B with 3. Or suppose that the rules were that the players alternate flipping their own coins, with player A’s coin biased toward heads and player B’s biased by the same amount toward tails. Suppose that whoever won the last flip gets to flip again. There is much to be said for possessing one’s own toy universe. Suppose the coin used in the St. Petersburg Paradox game were biased with a 0.6 probability of heads. The average winnings after 100 million games jumps to the range from about $1700 to $5000, and the game starts to acquire some appeal. We found before that if one’s strategy is to play 25 games with a fair coin, then one can expect one’s average winnings to be something in the range from $1 to $5 per game. With a 0.6 probability of heads, this becomes $3 to $20 per game. If the entrance fee is around $2 or less, the game is probably worth playing, although as noted earlier, whoever offers this game probably isn’t in the business of giving money away. 4.7 Systematic Errors For some scientists, the concept known as a “systematic error” in scientific data analysis is a bit of a stumbling block. Various misconceptions circulate, such as “if it’s systematic, then it’s not random”, “it can’t be zero, because then it wouldn’t matter”, “all errors are systematic after they happen, since they remain constant in your result”, “correlated errors are necessarily systematic”, and “systematic errors are necessarily non-Gaussian”. We’ll consider these one at a time below. Classical physics views natural processes as unfolding in time according to strictly deterministic laws: nothing that happens in Nature is random. But when we contemplate the fact that our knowledge is something acquired through Nature, then we have to consider the origin of mistaken beliefs. Our knowledge is finite. Most of what we know is known with limited precision. Things happen in Nature that we cannot know completely, so we always operate in a context of partial ignorance and uncertainty. In our quantitative description of Nature, how can we represent this uncertainty due to ignorance? The only way is to model the unknown parts as random and to estimate as well as we can the distribution of conceivable values. This maximizes our ability to take into account all the possibilities that could be happening beyond our perception, with each weighted according to its probability of being actualized. 157
page 169
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
In this case, the randomness is epistemic, i.e., the thing itself need not be random, just our knowledge of it. Thus a systematic error is just as random as a nonsystematic error, because we have ignorance of its value. Systematic and nonsystematic errors must both be modeled as random variables. This doesn’t change when we undertake the transition from classical to quantum physics. By whatever means the unknown effect comes to be, we are ignorant of it, and our uncertainty must enter the formalism as an epistemically random variable. Although this applies to both systematic and nonsystematic errors equally, the ways in which these random variables appear in the model are different, as we will elaborate below. A good way to view the entrance of systematic errors into our measurements is case E in Appendix G. This shows one way to simulate systematic errors. Specifically, when we perform a nontrivial measurement, some error always enters and is typically a combination of smaller effects. If we perform a set of N measurements, and if some of these noise sources produce errors that make the same contribution to the total error in several or all of these measurements, the fact that these measurements have some errors in common amounts to a systematic effect. If the net value of these shared errors is positive, then the affected measurements are all biased high by that amount, and if it is negative, they are all biased low. If the measurements all have uncertainty σ, and if averaging them were appropriate, we could not expect the final uncertainty to be σ/ N, because that requires the errors to be independent between measurements, and this requirement is not met. When the same noise subprocess contributes to several measurements, those measurements have correlated errors due to the systematic presence of the mutual noise subprocess. Thus systematic errors do indeed induce correlations. Not all correlated errors are properly viewed as systematic, however, since correlated errors can arise purely by the choice of coordinates, as illustrated in Figure 2.12 (p. 75). The correlated error in Figure 2.12A arises from the desire to express position and uncertainty in standard celestial coordinates for a measurement made of a physical process (in this case the motion of an asteroid) involving different errors on the axes of its own natural coordinate system (in this case, the asteroid’s orbital plane) which was not aligned with the celestial axes at the epoch in question. A similar effect occurs when a measurement is made with an observing instrument that has different errors on the axes of its own natural coordinate system (the focal plane layout of the detectors) which are not aligned with the desired system. These correlations are introduced by transforming away from the natural axes. When the error occurs within the natural coordinate system (or other mathematical characterization of the physical situation) and affects two or more measured quantities, only then do we apply the term “systematic”. Similarly, errors in curve-fit coefficients are usually correlated. To apply the term “systematic” to correlations that arise directly from purely mathematical relationships would undermine its usefulness in distinguishing between different kinds of noise sources associated with physical processes, namely those whose values are independent from one measurement to the next and those whose values are the same in all affected measurements. Thus systematic errors usually induce positive correlation, because all affected measurements are biased in the same way. The actual error need not be positive for the correlation to be positive. As long as all affected measurements are biased in the same direction, the correlation will be positive. Negative correlation means that the error affects the values of the correlated variables in opposite directions. Error correlation arising from coordinate rotations and in curve-fit coefficients can be negative as easily as positive. For example, in the most common fits of data to a straight line, the errors in the slope and intercept are negatively correlated, as will be discussed further in section 4.9 below. 158
page 170
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
But the way correlated errors are treated statistically does not depend on whether they arise from systematic noise effects. Once they enter the error covariance matrix, how they got there no longer matters. It remains true that the correlations can always be removed by coordinate rotation, but in the case of correlation due to systematic error, the interpretation of the rotated axes is often not straightforward. An example is case E in Appendix G, which involves all N measurements containing the same systematic error and a nonsystematic error whose variance is the same in all measurements. This simplification is often not far from realistic cases whose eigenvalues will not be so different from those of case E that the latter would be misleading. Equation G.15 (p. 461) shows that diagonalizing the error covariance matrix has the effect of subtracting the systematic error variance from all but one diagonal element and adding it scaled by N-1 to the remaining element. The symmetry among the measurements is lost, and the meaning of the one with N-1 times the systematic error combined in quadrature is elusive. Thus the rotated axes are neither useful nor needed, since the effects of systematic errors can be handled in the same way as any other correlated errors without transforming coordinates. Another aspect of systematic errors that some people find non-intuitive is the need to model them as zero-mean random variables. This seems to be a variation of the fallacy sometimes encountered regarding the “expectation value” of a distribution being the value which the random variable should be thought of as having. But the means of many distributions are not even available to the corresponding random variables, e.g., the Poisson distribution, which can have a fractional mean but only integer values for the variable. Systematic errors are modeled as zero-mean random variables simply because any nonzero mean that we could discern during instrumental calibration would be subtracted off of the response function, restoring the zero-mean nature of the distribution. The fact that a mean of zero is used does not imply that we think the systematic error will be essentially zero, it is just consistent with the fact that we don’t know what sign it will have. The less we know about the systematic error, the larger its variance must be in the error model, but the mean remains zero. In this context, the word “systematic” takes on a specialized meaning: it describes a property that is mutual among a set of measurements. Although it is true that any error in a given measurement is a permanent blemish and thus “systematic” in a loose colloquial sense, as long as a single measurement is being discussed, it is not conducive to clarity to refer to that permanent imperfection as “systematic”. The nature of a systematic error is such that it applies to more than one measurement. It is something that is systematically present in a group of measurements. Finally, the question of whether errors are Gaussian is completely independent of the question of whether they are systematic. As we saw above, systematic errors induce correlation, and Gaussian random variables are perfectly capable of correlation, as shown by Equation 2.31 (p. 76), which gives the joint density function for two Gaussian random variables that are definitely correlated. Of course, non-Gaussian random variables are similarly free to be correlated. Case E in Appendix G has several interpretations in real-world scientific data analysis. For example, astronomical images are used for point-source detection in which concentrations of photometric energy are found by automated means and the shapes are compared to what is expected for a point source seen through the instrumental response function (which is usually dominated by optical blur, pixel geometry, and for ground-based observations, fluctuations in atmospheric refraction). This is usually done in pixel coordinates, and the locations of the pixels in the imaging array are usually known very accurately, so that the position errors are essentially due exclusively to photometric noise fluctuations, with some contribution from response-function modeling imperfec159
page 171
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
tions and possibly errors in correcting optical distortion. But these positions are represented in the pixel coordinates of the imaging array, not celestial coordinates. The ultimate goal is to locate these point sources on the sky, and so a mapping from pixel coordinates to celestial coordinates is needed. The telescope pointing control system already has some information about this mapping, since the pointing is specifically commanded by the observer, but the tolerance on this “control error” is usually quite coarse compared to the desired eventual celestial position accuracy, which is typically achieved by matching the detected point sources to an astrometric catalog of previously known point sources with well-determined positions. The accuracy of the mapping from pixel coordinates to celestial coordinates is usually better than the pixel position accuracy of any one detected point source, since the former is typically based on many point sources, but nevertheless, the residual mapping error is seldom so small that it can be ignored. So the situation consists of a set of point sources with positions and uncertainties defined in an array of pixels whose coordinate system can be viewed as a flat rigid body. Any error in placing this rigid body on the sky is shared by all the point sources detected in the imaging array and thus constitutes a systematic celestial position error that shows up in the off-diagonal elements of the error covariance matrix for the set of point-source positions. What happens after that depends on how these positions are used. An important area in which correlated errors have a significant effect is the averaging process. Since positions of different sources are not expected to have identical true values, they never find themselves being inverse-variance averaged with each other. The fact that their errors have correlations may not be particularly important, although it is prudent to bookkeep them so that they will be handled properly if the need arises (e.g., star catalog-to-catalog comparisons). An example in which the presence of nonzero off-diagonal elements in the error covariance matrix is of more obvious importance is that involving repeated measurements of the photometric energy, or “flux”, of a single point source. Re-observations of point sources are very common. This is how variability is discovered and measured, and averaging multiple observations generally drives flux uncertainties down. This is true even for variable sources, whose mean flux remains important as a reference for the flux variations. Figure 4-5 illustrates a typical method for estimating the flux and position of a point source from the pixel values in an imaging array. Point sources generally do not appear isolated against empty sky, so before the estimation begins, the background level is subtracted from all pixels to be used for a given point source. Depending on the situation, various background estimation methods may be used, but a common one is to find the median pixel value near the point source sufficiently distant not to contain flux from it. To keep things as simple as possible, we will analyze this procedure in one pixel dimension. Imaging arrays employed in astronomy are usually designed for the optical system in a manner that provides oversampling of the point-source blur spot, i.e., the angular scale of the pixels is usually two or three times smaller than the blur spot’s full width at half maximum. This allows position estimation at the sub-pixel scale. The algorithm operates with a response-function model that represents the normalized shape of a noise-free point source in an imaging array. This shape is called a template, and it is constructed by averaging the shapes of many point sources that are bright enough to have a high signal-to-noise ratio (S/N) but not bright enough to saturate the pixels. This sample average, like all others, has some residual error as an estimator of a population property.
160
page 172
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
Figure 4-5. One-dimensional cut through a point-source image; the horizontal axis is in units of pixels, the x symbols are flux measurements with the local median background subtracted and in arbitrary units, and the solid curve is the response-function template representing a noiseless point-source flux distribution. The template is fit to the data via chi-square minimization, and the peak amplitude is the estimated point-source flux, whose uncertainty stems from photometric noise and response-function model error.
The template is shown in Figure 4-5 as a solid curve whose centroid and amplitude are fit to the pixel data by chi-square minimization, since the Gaussian model for the relevant errors is usually an excellent approximation. The position reported for a point source is the centroid of the template, and the flux is the amplitude that scales the normalized template to fit the observations. The illustration is for a single observation. When the point source is re-observed, the photometric noise is generally independent, but any error in the template model will be the same as before, or at least similar and generally of the same sign. Thus photometric noise produces nonsystematic error, and template model imperfections create systematic error. Modern astronomical data processing algorithms employ all of the observations of a given point source simultaneously, because that maximizes the S/N of the solution and aids in identifying outliers. Thus the estimated position is that which minimizes chi-square over all observations, and then the template fit essentially averages over all flux fits for that position. Sometimes a single flux solution is computed for all measurements simultaneously, in which case any flux-dependent errors make the equations to be solved nonlinear. Typically initial estimates for the individual fluxes are used to compute the flux-dependent errors, and then each iteration of the nonlinear solution can use the fluxes of the previous iteration for that purpose, or if the initial estimates are a sufficiently good 161
page 173
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
approximation for the error model, then the equations may be recast as linear. We will assume the latter below wherein flux-dependent template errors can be computed once from individual template fits and then used in the averaging for a set of individual fluxes. What interests us here is the effect of the systematic error due to template imperfections on the flux averaging process. Each measurement such as the one illustrated in Figure 4-5 results in a nonsystematic flux error due to a number of noise processes (typically dominated by photometric noise, random fluctuations in the photon arrival rate during the flux integration period) and a systematic flux error caused by template estimation imperfection. Other possible systematic effects may stem from background estimation error, if it affects all observations of the given source similarly, and “confusion noise” caused by nearby real sources in a dense field. To the extent that these produce systematic error, they may be combined in quadrature with the template error, and so we will consider the systematic error as a single entity. The flux errors on the ith measurement due to systematic and nonsystematic effects will be denoted εsi and εni, respectively. They sum to form the total error, εi. The expectation value of the square of εi is the flux uncertainty variance, σi2, which we will denote vii in error covariance matrices in order to be consistent with the notation vij for the off-diagonal elements i j.
i ni si vii i2 i2 ni2 2 ni si si2 ni2 2 ni si si2
(4.16)
ni2 si2 where the term 2 εsi εni is zero because the systematic and nonsystematic errors are uncorrelated. The covariance for measurements i and j, i j, is
vij i j ( ni si )( nj sj ) ni nj ni sj si nj si sj
(4.17)
ni nj ni sj si nj si sj The first three expectation values are zero, because those errors are uncorrelated. If the systematic error is the same on every measurement, then the i and j subscripts in the last expectation value are irrelevant, and we just have σs2, the systematic error variance. Usually the template error causes a flux-dependent systematic error, however, and so the error model must provide that dependence, and the multiple-measurement error covariance matrix will not be floor-diagonal like that of case E in Appendix G, but the qualitative features should be fairly similar. Here we will consider the more general case, which may be approximated as σsi = yi σs, where yi is the flux on measurement i, σs is a unit-flux template uncertainty, and we assume that the error in yi is small enough to use yi σsi y σs, where y is the true value of y. Then vij = yi yj σs2. Since negative or zero fluxes are generally not of interest, we can assume that vij 0. If the systematic error is the same on every measurement, then all off-diagonal elements of the error covariance matrix are the same, and so Equation 4.16 implies that vii vij, i j. But since we are using vij = yi yj σs2, if the fluxes are not all equal, then the offdiagonal elements won’t be either, and vii < vij is possible. This has a striking consequence that will 162
page 174
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
be discussed below. Since inverse-variance averaging of two measurements is a frequent activity and involves simpler algebra than the N-measurement case, we will consider first how to average a pair of measurements that share a systematic error. As shown at the end of Appendix D, inverse-variance averaging for Gaussian errors is equivalent to a chi-square minimization fit to a zero th-order polynomial. We take this approach because the equations we need are already given in Appendix D, whose notation we will use here. The model is given by Equation D.1 (p. 434) with the number of terms M equal to 1 and the basis function f1 = x0 = 1, so that the model is simply y = p1, where y is the function of the measurements that we wish to estimate (a constant equal to the inverse-varianceweighted average of the observed yi values, i.e., the measured fluxes in the notation of Appendix D), and p1 is the model coefficient for the zeroth-order (and only) term. We have the data error covariance matrix ΩD and its inverse, the weighting matrix W:
v11 v12 n21 y12 s2 D v21 v22 y1 y2 s2 W
D1
v22 1 v11 v22 v12 v21 v21
y1 y2 s2 n22 y22 s2 v12 w11 v11 w21
(4.18)
w12 w22
where we do not write the matrices explicitly as symmetric even though they are, because this indexing is consistent with some summations below. Equation D.8 shows that the model coefficient vector p is computed according to the product of an inverse matrix and a vector. In our case this reduces to a division of one scalar by another, p1 = b1/a11, where b1 and a11 are obtained from Equation D.24 for our zeroth-order polynomial case, m = k = 1, with N = 2, since we are averaging only two measurements, and using w12 = w21 and f1 = 1:
1 2 2 b1 wij yi f 1 y j f 1 w11 y1 w22 y2 w12 y1 y2 2 i 1 j 1
1 2 2 a11 wij f 12 f 12 w11 w22 2 w12 2 i 1 j 1
(4.19)
The uncertainty variance of p1 is the inverse of the A matrix, which in our case is just 1/a11:
2p1
1 w11 w22 2 w11
Substituting the original data uncertainty variances,
163
(4.20)
page 175
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
v22 y1 v11 y2 v12 y1 y2 p1
2p1
2 v12
v11 v22 v22 v11 2v12 2 v11 v22 v12
v22 y1 v11 y2 v12 y1 y2
v22 v11 2v12
(4.21)
2 v11 v22 v12 v22 v11 2v12
Using v12 = ρ (v11v22) = ρσ1σ2, these can be written
p1
2p1
22 y1 12 y2 1 2 y1 y2 12
22
2 1 2
1 2 12 22
(4.22)
12 22 2 1 2
This shows the effects of correlated errors on the average of two measurements. When the correlation is zero, the familiar formulas result:
p1
22 y1 12 y2 12 22
p21
12 22 12 22
(4.23)
If the two measurement uncertainties are equal, then the correlation has no effect on the average, but it still affects the uncertainty:
1 12 y1 y2 p1 ( 1 2 ) 2 12 2 12 21 12 1 2 14 1 2 14 1 12 2 p ( 1 2 ) 2 2 12 2 12 21 12 12 y1 y2 12 y1 y2
y1 y2 2
(4.24)
1
The second line shows that positive correlation increases the uncertainty of the average. This is not true in general when the correlation coefficient is close to 1, but when the measurement uncertainties are equal, as ρ approaches 1, the average becomes as uncertain as each of the two measurements. With 100% correlation, a second measurement adds no information, and if y1 y2, one has an enigma to explain: if the exact same error occurred on both measurements, how did different results come out? It seems that either two different objects were measured, or the object is variable, in which case the mean flux can be useful, but its uncertainty is unimproved over either measurement, and inverse-variance weighting is inappropriate anyway because its application is 164
page 176
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
conditioned on the two unknown “true” values being arbitrarily close to each other, as in the discussion of Equation 4.15 above and further elaborated in section 4.8 below. When the quantities being averaged are known not to be equal, as in computing the average number of children per family, inverse-variance weighting is generally not applicable. For negative correlation, the average becomes less uncertain than when there is no correlation, approaching zero as ρ approaches -1. Mathematically, this makes sense: the error on the second measurement exactly cancels the error on the first. In the real world, such extreme correlations seldom arise. There is usually plenty of nonsystematic error to dilute the correlation. In any case, as mentioned above, correlations due to systematic error tend to be positive. The correlation between the errors in the slope and intercept solutions for a fit of data to a straight line, as also mentioned above, can be strongly negative, and these errors sometimes do come close to canceling each other, but the common noise sources found in Nature provide few opportunities for systematic measurement errors to fall into negative correlation. In the general case, the error covariance matrix approaches being singular as ρ approaches ±1, where some effects occur that may defy intuition. Equation 4.22 shows that the numerator of the uncertainty variance goes to zero at both extremes, while the denominator goes to the difference or sum of σ1 and σ2. The former approaches division of zero by zero when σ1 = σ2, while the limit of the fraction is the value shown in Equation 4.24. Such symptoms indicate that the situation being analyzed is unphysical or poorly formulated, as seen in the fact that the uncertainty for 100% correlation becomes zero if σ1 σ2. The slightest difference between σ1 and σ2 is enough to make the difference between zero and the nonzero value of Equation 4.24! Taken literally, these conditions imply that the error is revealed by the more uncertain measurement and can be removed completely (this will be discussed further below). Figure 4-6 shows σp1 versus ρ for σ1 = 2 and σ2 = 2.1 as a solid line, and σ1 = σ2 = 2 as a dashed line. The latter can be seen to obey Equation 4.24, while the former shows the sudden drop to zero as ρ gets very close to 1.
Figure 4-6. Solid line: σp1 versus ρ for σ1 = 2 and σ2 = 2.1; the slight inequality between σ1 and σ2 is sufficient to cause the precipitous drop to zero near ρ = 1. Dashed line: σp1 versus ρ for σ1 = σ2 = 2; the precipitous drop is averted, and the final uncertainty is equal to the two equal measurement uncertainties.
165
page 177
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
For physically realistic cases, positive error correlation generally increases the uncertainty unless one measurement uncertainty is much greater than the other, in which case the final uncertainty is less than the smaller measurement uncertainty, as shown in the second and third lines in Equation 4.25 below. Compare this to the uncorrelated case, in which the final uncertainty becomes equal to the smaller measurement uncertainty as the larger becomes arbitrarily large. Any correlation, positive or negative, causes a reduction in final uncertainty under those conditions! Nonzero correlation implies more information, and this can reduce uncertainties or reveal that they should be larger. To illustrate this (possibly) unexpected behavior, we plot σp1 for ρ = 0.9 and σ1 = 2 versus σ2 in Figure 4-7. Taking the derivative of σp1 with respect to σ2, setting this to zero, and solving for σ2, we find that σp1 is maximized at σ2 = σ1/ρ.
Figure 4-7. σp1 versus σ2 for ρ = 0.9 and σ1 = 2. As σ2 approaches infinity, σp1 approaches σ1 (1-ρ2), in this case 0.871779788. The peak value of σp1 occurs at σ2 = σ1/ρ, in this case 2/0.9 or 2 2/9.
When the two measurement uncertainties are about the same size, positive correlation increases the final uncertainty, with precipitous drops toward zero and the onset of numerical instabilities occurring only when the correlated-error model has been pushed too close to 100%. Some limits of interest are as follows.
166
page 178
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
lim 2p1 0
1
1 2 12
lim 2p1 1 2 22
1
lim 2p1
2
1 y2 2 y1 1 2 1 1 y2 2 y1 lim p1 1 2 1 lim p1
(4.25)
lim p1 y2
1
lim p1 y1
2
The fourth line may appear at first glance not to be symmetric in measurement number, but interchanging the measurement indexes just amounts to multiplying the numerator and denominator by -1. Again, if y1 y2, then one must wonder how different results were obtained from two measurements with 100%-correlated errors, but fortunately this extreme correlation is just a limiting case that (to the author’s knowledge) never arises in practice. The fifth line is somewhat remarkable in that it resembles the uncorrelated formula for p1 in Equation 4.23 but with the variances all replaced by standard deviations. The last two lines are the same as for zero correlation. Since nothing depends on how we number the measurements, let us consider the case when y1 y2 and number them in order of increasing value, i.e., y1 < y2. Then a potentially shocking fact lurks in the fourth line of Equation set 4.25: the inverse-variance-weighted mean can lie outside of the range (y1,y2) when σ1, σ2, and ρ take on certain values! For example, evaluating that fourth equation for y1 = 9 and y2 = 11, with σ1 = 2 and σ2 = 3, with ρ arbitrarily close to +1, we find p1 = 5, well below the range (9,11). If we swap the uncertainties, we find p1 = 15, well above the (9,11) range. In this extreme case, p1 will lie below the range (y1,y2) if σ1 < σ2 and above it if σ1 > σ2. In no case can p1 be equal to the measurement with the larger uncertainty or outside the range on that side. Taking σ1 < σ2, if we set p1 in Equation 4.22 equal to y1 and solve for ρ, we find ρ = σ1 /σ2, so in this example we find p1 = y1 when ρ = 2/3. For larger values of ρ, p1 < y1. Figure 4-8 illustrates p1 for this case of y1 = 9, y2 = 11, σ1 = 2, and σ2 = 3 as a function of ρ. When ρ = σ1 /σ2 we also find σp1 = σ1 (still taking σ1 < σ2), and for larger values of ρ, σp1 < σ1, dropping to zero as seen in Figure 4-6. Since v12 = ρ σ1 σ2, the condition ρ = σ1 /σ2 implies that v12 = σ12 = v11, a physically questionable situation but one for which the error covariance matrix is still positive definite if and only if v22 > v11, i.e., y2 must have some error that does not afflict y1. This extra error may be nonsystematic or systematic. The latter is possible if and only if the measurements do not all have exactly the same systematic error.
167
page 179
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
Figure 4-8. p1 versus ρ for the case y1 = 9, y2 = 11, σ1 = 2, and σ2 = 3. The unweighted average is 10; rounded to six decimal places, the inverse-variance-weighted average for zero correlation is 9.615385 with an uncertainty of 1.664101; for ρ = 2/3, it is equal to y1, the smaller measurement, 9.000000 with an uncertainty of 1.984313; for ρ = 0.9, it is 7.727273 (less than the smaller measurement) with an uncertainty of 1.763261.
It can take intuition a while to adapt to the idea that an “average” lying outside the measurement range is not some fallacy in the formalism. The first experience most people have with averaging involves no weighting and no notion of correlated measurement errors, so the mean naturally falls inside the range of the numbers being averaged. It would be unthinkable for the mean number of children per family to be greater than the greatest number of children per family encountered in the data-taking phase of the study. On the other hand, when measurements are contaminated by noise, one expects the data values to be somewhat off from the correct number. With two measurements to average, each equally likely to be somewhat high as somewhat low, there are four equally probable scenarios: two in which the true value is straddled by the measurements, one in which both measurements are too low, and one in which both are too high. So in half the cases, the correct value lies outside of the observed range. With measurement uncertainties taken into account, assuming that they are not exactly the same, one of the measurements can be considered more likely to be closer to the truth than the other. The measurement with the smaller uncertainty gets more weight in the averaging process. Sometimes this takes the answer further from the truth, but assuming that the error analysis was done in a professional manner, more often than not it will produce an average that is closer to the right answer than an unweighted average. Now when correlation is added to the mix, a new insight is possible. As seen in Figure 4-8, negative correlation does not force the average outside of the range of observed values, and this is true in general. Systematic errors tend to produce positive correlation, so if the error model recognizes the presence of systematic errors, then we can expect ρ > 0. Given that p1 hits a range limit when ρ = σ1 /σ2 (still taking σ1 < σ2), the closer the two measurement uncertainties, the more correlation is needed to get outside of the range. The larger the difference between σ1 and σ2, the smaller the correlation needs to be to drive p1 outside of the range. Consider the case when σ1 σ2: then y2 is likely to be further off than y1. But the impact of the 168
page 180
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
systematic error, which is proportional to ρ, is likely to show up in the direction in which y2 is off, and y1 is also affected by this same systematic error, and thus there is some likelihood that it is off in the same direction as y2, just probably not quite as far off. Thus both measurements could be off in the same direction, and depending on the specific values of σ1, σ2, and ρ, it could be more likely than not. This is why certain combinations of σ1, σ2, and ρ lead to estimates outside of the observed range but closer to the measurement with the smaller uncertainty. This uncomfortable phenomenon disappears for cases in which all correlation is due to the same systematic error. As we stated above, in this case, v12 can never be greater than v11, since that would require a negative nonsystematic error variance, hence an error distribution with an imaginary standard deviation. This prevents the inverse-variance-weighted average from falling outside of the range of measured values. The closest it can come is to be equal to y1 when ρ = σ1 /σ2 (still taking σ1 < σ2 but not necessarily y1 < y2, i.e., y1 has the smaller uncertainty but could be either range limit). We could also change our approximation yi σsi y σs to be compatible with the hypothesis that both measurements stem from the same true value, e.g., y σs (y1+y2)σs /2, making the two systematic errors equal. In order to get p1 outside of the range, we must have ρ > σ1 /σ2, which implies
1 2 12 1 2 v12
(4.26)
v11 v12
and as we have seen, the last line cannot happen for correlation due exclusively to the same systematic error in all measurements. So in principle, we could get p1 to be equal to one end of the range, but only for the implausible case when one measurement has no nonsystematic error but the other one does. We cannot get p1 outside of the range of measured values. This also means that the unruly behavior for ρ very close to +1 never happens for this case, because in order to get ρ very close to +1, we must have σ1 only very slightly less than σ2, and this introduces the immunity of p1 to ρ shown in Equation 4.24. Of course, the uncertainty is not immune to correlation. We will take a look at two extreme examples that come as close to ρ = σ1 /σ2 as we can get via constant systematic error, one with the uncertainties almost equal, and one with the uncertainties very different. The former produces ρ very close to +1 and the latter produces ρ very close to zero, but neither forces p1 outside of the range of measurements, which can be arbitrary as long as they are different, so we will stick with y1 = 9 and y2 = 11. The first case is 119999 . 12 D 119999 . 12 . 0001
(4.27)
0.9999875001 p1 9.666666667
2p1 11.999966667 The effect of ρ having an extreme value is largely overcome by σ1 and σ2 being so nearly equal, and Equation 4.24 comes almost into full force. The correlation does move p1 closer to y1 than what the uncorrelated result would have been, namely 9.99999583, and σp12 is almost double the corresponding uncorrelated value, 6.0000249, but p1 is still within the measurement range. The second 169
page 181
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
extreme example is 119999 . 12 D . 9999 119999
0.034642460
(4.28)
p1 9.000000020
2p1 11.999999999999 The effect of the much larger σ2 is to weight y1 much more heavily, moving p1 almost to the bottom of the measurement range, and to reduce ρ to such a small value that is it plays almost no role. The corresponding uncorrelated results are p1 = 9.0023974 and σp12 = 11.9856158. This leaves the case when the off-diagonal elements of the error covariance matrix are greater than one of the diagonal elements, e.g., as in Equation 4.26. If y1 y2, the systematic errors on the diagonal in Equation 4.18 will not be equal, and in this case, the “average” can fall outside of the data range. Consider Equation 4.18 with the values y1 = 9, y2 = 11, σn1 = 1, σn2 = 1, and σs = 0.3. Then we have 8.29 8.91 D (4.29) . 8.91 1189 Although v11 < v12, the matrix is still positive definite. We find that ρ = 0.897448, larger than σ1 /σ2, 0.835, allowing p1 to fall outside of the data range, and indeed we find that Equation 4.21 yields p1 = 8.474576 and σp1 = 2.850810. The algorithm decides that both measurements are off in the same direction and that the more uncertain measurement betrays which direction, namely too high, so that the most likely true value is a bit less than the smaller measurement. This happens only because of the strong dominance of the systematic error. The nonsystematic errors do not need to be much larger to push p1 back into the measurement range; σn1 = σn2 = 1.3 will suffice to yield p1 = 9.037433. Now we will consider the practical issues involved in taking correlated errors into account when averaging measurements. The formalism for the general case of N measurements is a straightforward generalization of the equations above. The matrices in Equation 4.18 must be extended to N×N, and the inverse shown there for the 2×2 case no longer applies when N > 2.
v11 v12 v21 v 22 D v31 v 32 v N1 vN 2
W
D1
w11 w21 w31 wN 1
vN 3
v1 N v2 N v3 N v NN
w12
w13
w22
w23
w32
w33
wN 2
wN 3
v13 v23 v33
170
w1 N w2 N w3 N w NN
(4.30)
page 182
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
These matrices are symmetric but not explicitly indexed as symmetric in order to be consistent with the following summations, which are just those of Equation 4.19 extended for N measurements and substituting 1 for all occurrences of f1:
b1
1 N N w y yj 2 i 1 j 1 ij i N N
a11 wij
(4.31)
i 1 j 1
The average, p1 in this notation, and its uncertainty variance are the same fractions as before: b1 a11
p1
2p1
1 a11
(4.32)
Although these equations are simple and compact, it is not unusual for the full rigorous formalism to be difficult to implement, usually because of problems with the data error covariance matrix, whose size scales as the square of the number of measurements. As case E in Appendix G shows, systematic errors tend to populate the entire error covariance matrix with nonzero elements. That example involves all the off-diagonal elements being equal, however, which suggests that some simplification should be possible, even when the off-diagonal elements are only approximately equal. A shortcut that is sometimes used to handle systematic errors is to ignore their effect on the average, then add the systematic-error variance to the final uncertainty in quadrature. This allows the systematic error variance to escape reduction by the factor of approximately N that shrinks the nonsystematic error variance (assuming the diagonal elements do not vary too much). It also takes advantage of the fact that when the measurement uncertainties are almost equal, correlated errors have almost no effect on the average, as shown in the first line of Equation set 4.24. The first phase amounts to setting all off-diagonal elements to zero and keeping only the nonsystematic error variances in the diagonal, i.e., vii = σni2, vij = 0 for i j. Since systematic errors usually induce positive correlation, and often it is not extreme, this approximation may be acceptable, typically incurring relative errors of only a few percent. But there is no guarantee that such errors will be small, and figuring out the error bounds is often not much easier than simply using the rigorous formulation. The point of setting all off-diagonal elements of the error covariance matrix to zero is to remove the need for that matrix altogether, and since the diagonal elements are all just σni2, the first phase reduces Equation 4.31 to the familiar formulas N y i b1 2 i 1 ni (4.33) N 1 a11 2 i 1 ni
171
page 183
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
and including the second phase, Equation 4.32 becomes
p1
2p1
b1 a11
(4.34)
1 s2 a11
where σs2 is the systematic-error variance and is not indexed by measurement because either it is the same in all measurements or else it is an average value. This approximation is assumed to be acceptable when this shortcut method is used. 22 Monte Carlo simulations were run to probe this shortcut method and seek vulnerabilities. Each run consisted of one million trials with a given prior error covariance matrix, and each trial used an independently generated pseudorandom 12-vector such as those discussed in Appendix G using the grid method (e.g., case E therein). Each 12-vector was used to compute 11 different averages simulating N measurements, N = 2 to 12, by using the upper-left partition of the 12×12 error covariance matrix of the appropriate size. The values of p1 and σp1 from the shortcut method were compared to those of the rigorous method, and average relative errors were computed as follows. E ( p1 ) E ( p1 )
( p1 ) shortcut ( p1 ) rigorous ( p1 ) rigorous ( p1 ) shortcut ( p1 ) rigorous
(4.35)
( p1 ) rigorous
To review briefly, the grid method in Appendix G allows up to 144 independent zero-mean unit-variance pseudorandom generators to contribute to elements of a 12-vector in any desired combination. Assigning some generators to all 12 elements produces systematic error. Assigning a given generator to more than one element but less than 12 produces partial correlations that are not “systematic” in the sense of contaminating all 12 measurements. The shortcut method has no way of taking these into account, since it assumes that systematic error affects all measurements and is thus characterized by a single variance that can be added at the end. Nine runs based on case E were made with varying initial random-number seeds and amounts of systematic error, ranging from zero to 9 of the 12 generators assigned to each element. In other words, the error covariance matrices were all floor-diagonal with values of 12 on the diagonal and equal off-diagonal elements ranging from 0 to 9 in the nine cases, respectively. Since the diagonal elements of the error covariance matrix were always equal, we expect Equation 4.24 to apply. This expectation is fulfilled: none of the nine runs produced measurable error in the shortcut results. The relative errors defined in Equation 4.35 were printed out as percentages with four decimal places, and these rounded off to all zeros in all cases for both p1 and σp1. The shortcut method has no weakness when given floor-diagonal error covariance matrices. One modification of case E was made in which the number of systematic-error generators alternated between 3 and 4 per element. This resulted in a prior error covariance matrix that was only approximately floor-diagonal:
172
page 184
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
12 3 4 3 4 3 4 3 4 3 4 3 3 11 3 3 3 3 3 3 3 3 3 3 4 3 12 3 4 3 4 3 4 3 4 3 3 3 3 11 3 3 3 3 3 3 3 3 4 3 4 3 12 3 4 3 4 3 4 3 3 3 3 3 3 11 3 3 3 3 3 3 D 4 3 4 3 4 3 12 3 4 3 4 3 3 3 3 3 3 3 3 11 3 3 3 3 4 3 4 3 4 3 4 3 12 3 4 3 3 3 3 3 3 3 3 3 3 11 3 3 4 3 4 3 4 3 4 3 4 3 12 3 3 3 3 3 3 3 3 3 3 3 3 11
(4.36)
Some shortcut approximation errors begin to show up in this case, as summarized in the table below. N
E(p1) %
E(σp1) %
2
0.0000
0.0000
3
-0.0004
-1.6336
4
-0.0004
-1.0063
5
-0.0009
-2.0436
6
-0.0008
-1.4643
7
-0.0013
-2.1935
8
-0.0001
-1.6950
9
-0.0009
-2.2412
10
-0.0009
-1.8134
11
-0.0011
-2.2402
12
-0.0005
-1.8702
Earlier we stated a goal to keep the error of uncertainty estimation smaller than 20%. The shortcut method achieves this goal easily in this case. The pattern of getting p1 accurate to better than 0.01% applies to all the Monte Carlo runs, as we will see below. It is primarily the accuracy of σp1 that is a concern. The Monte Carlo runs showed that the shortcut method is not sensitive to variation along the diagonal of the error covariance matrix as long as the “floor” is approximately flat. Nor is it particularly sensitive to the amount of systematic error as long as this dominates the correlation. But when there is a lot of correlation without much all-measurement systematic error causing it, the shortcut 173
page 185
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.7 Systematic Errors
method can fail badly in uncertainty estimation. The worst case of this sort was found for case C in Appendix G, which has no systematic error but strong enough correlation to reach across neighboring elements to each next neighbor, i.e., the error covariance matrix is pentadiagonal: 12 8 4 0 0 0 0 0 0 0 0 0 8 12 8 4 0 0 0 0 0 0 0 0 4 8 12 8 4 0 0 0 0 0 0 0 0 4 8 12 8 4 0 0 0 0 0 0 0 0 4 8 12 8 4 0 0 0 0 0 0 0 0 4 8 12 8 4 0 0 0 0 D 0 0 0 0 4 8 12 8 4 0 0 0 0 0 0 0 0 4 8 12 8 4 0 0 0 0 0 0 0 0 4 8 12 8 4 0 0 0 0 0 0 0 0 4 8 12 8 4 0 0 0 0 0 0 0 0 4 8 12 8 0 0 0 0 0 0 0 0 0 4 8 12
(4.37)
For this case, the shortcut method continues to do well on p1 but fails badly on σp1. This is understandable, because there is a lot of correlation that the shortcut method does not know about and has no way to take into account N
E(p1) %
E(σp1) %
2
0.0000
18.3216
3
-0.0053
22.4745
4
-0.0031
35.4006
5
0.0011
39.6424
6
-0.0031
46.3850
7
0.0015
55.8387
8
0.0019
60.9630
9
0.0013
67.3320
10
0.0012
75.1190
11
-0.0000
80.1875
12
-0.0007
86.0521
Adding systematic error actually helps the shortcut method. If all the “0" off-diagonal elements in Equation 4.37 are replaced with “4", the errors in σp1 are approximately cut in half, which is still much too large. The conclusions based on the Monte Carlo tests are that the shortcut method is generally quite 174
page 186
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
good as long as essentially all correlation is due to systematic error in the strict sense that all measurements have about the same systematic error and there is negligible other correlation present. In all cases, the method estimated the average well. But it should be clear from preceding material that we place as much emphasis on getting the uncertainty right as we do on getting the average right. Too large or too small an uncertainty undermines hypothesis testing. In the former case, discoveries may be missed because discrepancies appeared not to be statistically significant, and in the latter case, published results may have to be retracted because remarkable results turned out not to be statistically significant after all. As we have stated previously, the goal is to express each measurement (and quantities derived from measurements, e.g., averages) in the form of a probability density function, which requires estimating not only a mean (or other location parameter) but also a standard deviation (or other width parameter), and possibly other parameters needed to characterize less frequently encountered distributions (e.g., the Trapezoidal distribution shown in Figure G-3B, p. 472). The density functions defined by the means and variances in Equations 4.21, 4.22, and 4.34 are Gaussian, since the averages are all linear combinations of Gaussian random variables. The fact that the latter are generally correlated does not change the rule “Gaussians in, Gaussians out”. One important point must be stressed: the formal handling of systematic errors illustrated above clearly depends on theoretical knowledge of the source of systematic error, which cannot be discovered in a sample. Since the systematic error affects the sample mean directly, it subtracts off in the calculation of the sample variance and disappears from view. This is another reason why due diligence is required for good science, and why lapses induce a distrust in statistical arguments. We should also mention that we have considered only additive systematic errors. In some contexts, the errors may be multiplicative, which generally complicates the analysis because the Central Limit Theorem does not come into play. If the relative errors are small, however, a perturbation formalism can usually be made to recast the errors as additive, and if the relative errors are large, unfortunately one is on thin ice and has upstream problems to solve. 4.8 The Parameter Refinement Theorem In the previous section we showed how to average measurements with correlated Gaussian noise. Averaging noisy measurements is also called “parameter refinement”, since knowledge of the parameter of interest is refined by combining the information in separate measurements. When the measurements are all represented in the form of probability density functions, the refinement procedure involves operating on those to obtain a new probability density function that defines the refined characterization of the parameter. By obtaining a refined density function, we maintain the optimal form of knowledge representation. In some cases, the process does not correspond to weighted averaging of measured observed positions, but for Gaussian errors, it does. In a large fraction of real-world cases, Gaussian distributions are involved exclusively, and the fact that the refined density function is Gaussian is not considered remarkable. The familiar inversevariance weighting yields the new mean and uncertainty variance, and these define the new Gaussian distribution. It is well known that convolutions of Gaussian density functions yield more Gaussian density functions, and convolution is the correct method for computing the density function for a random variable defined as the sum of two other independent random variables (see Appendix B, Equations B.17 through B.20). Equation 4.23 can be written 175
page 187
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
p1 c1 y1 c2 y2 c1
22 12 22
, c2
12 12 22
(4.38)
Since c1y1 and c2y2 each are just rescaled means of Gaussian random variables, the rescaled variables with those means are also Gaussian (see Appendix B, Equation B.2), and hence the density function for p1 is just the convolution of their density functions, making p1 also Gaussian. When y1 and y2 are correlated, a similar breakdown of Equation 4.21 into a linear combination of the form c1y1 + c2y2 is possible, but the situation is complicated by the fact that the joint density function for y1 and y2 no longer factors into separate marginal density functions, and the convolution integral no longer applies, but conceptually one can imagine using the fact that the correlation can be removed by an appropriate coordinate transformation without losing the Gaussian distribution, and the two resulting density functions can be convolved, and the refined density function will be Gaussian in those coordinates and hence Gaussian in any linearly related coordinates. Inverse-variance weighting is so widely used that it is possible to forget under what circumstances it is appropriate. One fundamental requirement is that we accept the hypothesis that the true values underlying the measurements are arbitrarily close to each other, i.e., we are averaging measurements of the same thing. If this is not the case, then we must consider seriously whether unweighted averaging would be more appropriate. As we will see below, the acceptance of that hypothesis enters explicitly in all parameter refinements based on optimal use of measurement density functions. We have seen above that inverse-variance weighting is appropriate for averaging measurements with Gaussian errors. This is implicit in our use of a chi-square minimization fit to a zero thorder polynomial, since chi-square random variables are formed from Gaussian random variables, and we will give a different derivation below. But is inverse-variance weighting appropriate for other kinds of distributions? Depending on how one categorizes them, it is fair to say that more than a hundred distinct random-number distributions have been defined and studied. Many of these are asymptotically Gaussian. For example, optimal averaging of measurements with Poisson noise should not employ inverse-variance weighting for small counts, but for very large counts, the Poisson distribution will be asymptotically Gaussian. The fact that the mean and variance remain equal, however, forces largecount Poisson values eligible for averaging not to be grossly different for most practical purposes, otherwise the statistical significance of the difference is very high, calling into question whether such numbers should be averaged at all, but if they are comparable, the Gaussian approximation does allow the uncertainty to be reduced even though the refined mean remains close to the measurements. Of course this eliminates the Poisson nature by making the refined mean and variance significantly unequal, so the Gaussian model becomes locked in at that point, but in fact the optimal refinement of two measurements with pure Poisson noise is not itself Poisson-distributed anyway, as we will see below (and also in Appendix I). In a remarkably simple derivation of inverse-variance weighting of two independent unbiased measurements, Arthur Gelb (1974) requires no reference to the form of the error distributions. The requirements for combining the two measured values are: (a.) we want to minimize the expected squared error in our refined estimate; (b.) we want our estimate to be unbiased, i.e., we want the error 176
page 188
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
to have an expectation value of zero; (c.) we want our estimator to have the form of a linear combination of the two measured values with no other terms. These conditions lead directly to an estimator based on inverse-variance weighting. His argument, in our notation, proceeds as follows. The two measured values are y1 and y2, each equal to a true value plus error ε1 and ε2, respectively: y1 y 1 (4.39) y2 y 2 where the overparenthesis indicates a true value, and we use a circumflex below to indicate our estimator, which we require to be of the form
y c1 y1 c2 y2
(4.40)
The estimation error is
y y y
(4.41)
We require the expected value of this to be zero:
y c1 y 1 c2 y 2 y 0 c1 y c2 y y c1 1 c2 2 0
(4.42)
The first expectation value on the second line involves only constants, so the brackets can be removed, and the last two expectation values are zero because the measurements are unbiased. Assuming we do not expect the true value to be zero, this implies
c1 c2 y y c1 c2 1
(4.43)
c2 1 c1 The expected squared estimation error is now
y2 y2 c12 21
1 c1 2 22
(4.44)
2
c12 12 1 c1 22
where σ12 and σ22 are the measurement uncertainty variances. Since we want to minimize the expected squared error in the estimator, we differentiate this with respect to c1, set the result to zero, and solve for c1:
177
page 189
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
2 c1 12 2 1 c1 22 0 c1 12
c1 22
22 0
(4.45)
c1 12 22 22
22 c1 2 1 22 This implies the second line of Equation 4.38, and we have the Gaussian refinement formulas, Equation 4.23, including the expected squared error of the estimator, i.e., Equation 4.44 becomes 2
y2
2
2 2 2 2 2 12 2 1 2 22 1 2 1 2
12 24 14 22
2 1
22
2
2 1
22 12 22
2 1
22
2
(4.46)
12 22 12 22
The assumption of independent measurement errors is not crucial to this method. Keeping a covariance v12 adds a term 2c1(1-c1)v12 to the second line of Equation 4.44, and then the rest of the procedure leads to the expressions in Equation 4.21 instead of Equation 4.23. Gaussian refinement certainly meets the requirements levied above, but the general nature of the derivation suggests that errors need not be Gaussian in order for parameter refinement to boil down to inverse-variance-weighted averaging. Are there other distributions that satisfy those requirements? If not, then somewhere those requirements necessarily imply a Gaussian context. And yet they seem to capture completely the essentials of a general optimal refinement. The prime suspect is item (c.), that the estimator consist only of a linear combination of measurement values. As we will see below, that requirement is not met for uniformly distributed measurement errors; a third term is required. We will see below that it doesn’t work when combining a measurement containing Gaussian noise with one containing a uniformly distributed error, nor for refining two CGU measurements (see Equation 4.9). The author is not aware of anyone checking the 100-plus distributions to see which ones satisfy the requirement that the optimal estimator consist solely of a linear combination of the two measurement values, so personal experience suggests that it is possible that only Gaussian measurements do so, and also measurements with Poisson noise for which the optimal estimate is an unweighted average. But in the Poisson case, accepting the hypothesis that two such measurements stem from the same mean simultaneously assumes that they have the same prior error variance, since that equals the mean, so inverse-variance-weighted averaging reduces to unweighted averaging. That is not to say that one may not apply inverse-variance-weighted averaging to non-Gaussian measurements with some justification other than consistency with the Parameter Refinement Theo178
page 190
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
rem which will be introduced below, only that doing so would be less optimal and occasionally in serious error (e.g., with uniformly distributed errors in general). Inverse-variance weighting clearly reduces the influence of noisier measurements on the average, but unless the errors are Gaussian, one would be on very thin ice if one were to quote the reduced uncertainty in Equation 4.46. The usual derivation of the Gaussian averaging formulas (e.g., Equation 4.23) involves maximizing the joint density function for N independent measurements given the observed values yi, i = 1 to N, each with Gaussian noise σi. The joint density function is
fN
N
e
i 1
( y yi ) 2
2 i2
2 i
(4.47)
To find the value of y that maximizes this, we take the derivative with respect to y, set it to zero, and solve for y. For example, the case N = 2 yields the expression for p1 in Equation 4.23. This way of getting the formula for inverse-variance weighting does not directly provide the form of the density function for the estimator, but standard theory of functions of random variables can be used to show that it is Gaussian and that the variance is that shown in Equation 4.23. This maximum-likelihood approach can be used for other distributions, but since it does not automatically provide a density function for the estimator, a more general method was needed for the highly non-Gaussian position uncertainties of the IRAS Project (see section 4.5). This led to what we will call herein the Parameter Refinement Theorem (Fowler and Rolfe, 1982). To make this discussion as simple as possible, we will use the fact that the problem to be solved can be stated in terms of two one-dimensional continuous probability density functions representing measurements whose information is to be fused into a single refined characterization of the parameter measured. The domains of the random variables will be assumed to be the entire real line; applying limited domains is straightforward. The idea is to map all of the information in the two measurement density functions into the refined density function. Once that is done, the mean, variance, and higher moments can all be obtained from the refined density function. The process is linear in the sense that N measurements can be processed in N-1 pairwise operations whose order does not affect the final information content, although different orders may involve different levels of mathematical difficulty. Generalization to higher dimensions is straightforward. The joint density function for the two measurement random variables, x1 and x2, with means (i.e., observed values) x 1 and x 2, respectively, will be denoted f12(x1,x2), with marginal density functions f1(x1) and f2(x2), respectively. Other parameters of the measurement distributions will similarly be subscripted with 1 and 2, e.g., σ1 and σ2 for standard deviations, L1 and L2 as appropriate for the half widths of uniformly distributed random variables, etc. The refined random variable refers to the same physical quantity as x1 and x2 once we identify these with each other. It will be denoted x, with cumulative distribution F(x) and density function f(x), with other distribution parameters similarly unsubscripted. The distribution of x is completely determined by F(x), and so we proceed by constructing that. Let X be some value in the domain of x. Then we consider the probability that x < X. This is the probability that both x1 and x2 are less than X conditioned on x1 and x2 having values that are arbitrarily close to being equal. In other words, we accept the hypothesis that the true values underlying 179
page 191
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
the two measurements are arbitrarily close to each other, another way of saying that we take the two measurements to be observations of the same thing. By accepting this hypothesis, we are saying that the two measurements are apparitions of the same truth with different noise attached to it and that we are justified in mapping the two random variables x1 and x2 onto a single x axis. This condition must be expressed mathematically, i.e., we seek the probability that x1 < X and x2 < X given that |x1-x2| < ε/2, where ε > 0 can be made as small as we like. Specifically, we choose it to be small enough to neglect the difference between f12(x1,x2) and f12(x1±ε/2,x2±ε/2). The probability that x1 < X and x2 < X is the total probability mass in the joint density function for which that condition is satisfied. This involves a double integral of the joint density function. Adding the constraint that |x1-x2| < ε/2 takes the form of restricting the integration limits for either integral to the interval ±ε/2. This integrates the probability mass inside an arbitrarily thin slice of the two-dimensional joint density function centered on the line x1 = x2, thereby mapping the bivariate joint density function onto a single x axis, so that the result is no longer a joint density function, it is a function of a single random variable.
P( x1 X and x2 X with | x1 x2 | / 2)
x 1 / 2 f 12 ( x1 , x2 ) dx2 dx1 x 1 / 2 X
(4.48)
X
f 12 ( x1 , x1 ) dx1
The probability that |x1-x2| < ε/2 is a similar integral except with the first upper limit infinity: x 1 / 2 P(| x1 x2 | / 2) f 12 ( x1 , x2 ) dx2 dx1 x 1 / 2
(4.49)
f 12 ( x1 , x1 ) dx1
The first integral assumes that |x1-x2| < ε/2. To get F(X), we must condition the first probability on the second, i.e., divide the first integral by the second: X
f 12 ( x1 , x1 ) dx1
F( X )
(4.50)
f 12 ( x1 , x1 ) dx1
Differentiating this with respect to X yields the refined density function:
f ( x)
f 12 ( x , x )
f 12 ( x , x ) dx
180
(4.51)
page 192
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
Very commonly the measurement errors are independent, in which case the joint density function in Equation 4.48 factors into the marginal density functions, which leads to:
f ( x)
f 1 ( x) f 2 ( x)
(4.52)
f 1 ( x ) f 2 ( x ) dx
i.e., the refined density function is just the renormalized product of the two measurement density functions. The product in the numerator reflects the “and” of two independent probabilistic hypotheses, and the renormalization stems directly from conditioning on the true values underlying the measurements being “the same” in the sense of their true values being arbitrarily close to each other. Given the dominant role played by Gaussian errors, we will look first at Equation 4.52 applied to two measurements with independent Gaussian noise:
e
( x x1 ) 2 2 12
e
2 1
f ( x)
e
( x x2 ) 2 2 22
2 2
( x x1 ) 2 12
2
e
2 1
(4.53)
( x x2 ) 2 2 22
2 2
dx
It is important that the two density functions be expressed in a manner that is compatible with mapping them onto the same x axis. Specifically, they cannot be written in a zero-mean form in general; the fact that the means may be different must be explicit. The numerator in Equation 4.53 can be expanded and simplified to obtain
Numerator
e
12 ( x x2 ) 2 22 ( x x1 ) 2 2 12 22
2 1 2
(4.54)
and the denominator is
Denominator =
e
( x1 x2 ) 2 2 (12 22 )
2 ( 12 22 )
Dividing the numerator by the denominator and simplifying results in
181
(4.55)
page 193
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
e
f ( x)
22 x1 12 x2 x (12 22 ) 2 2 12 22 (12 22 )
(4.56)
12 22 2 2 2 1 2
Dividing both the numerator and denominator of the exponential argument by (σ 12 + σ22)2 and changing the sign inside the squared parentheses yields
f ( x)
e
2 x 12 x2 x 2 21 1 22 2
2
12 22 12 22
(4.57)
12 22 2 2 2 1 2
which can be recognized as Gaussian with mean and variance
x
22 x1 12 x2 12 22
2 2 2 21 2 2 1 2
(4.58)
This shows that Equation 4.52 yields the expected results for Gaussian measurements. Figure 4-9 illustrates a case with the first Gaussian measurement having a mean of 10 and a standard deviation of 2, and the second has a mean of 15 and a standard deviation of 1. The dash-dot curve is the first measurement, the dashed curve is the second, and the solid filled curve is the refined density function, whose mean is 14 and whose standard deviation is 0.8.
182
page 194
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
Figure 4-9. Refinement of two measurements with Gaussian noise; the dash-dot curve has a mean of 10 and a standard deviation of 2; the dashed curve has a mean of 15 and a standard deviation of 1; the solid filled curve is the refined Gaussian, which has a mean of 14 and a standard deviation of 0.8.
Another (simpler) example is the case of two measurements with uniformly distributed errors:
1 f 1 ( x1 ) 2 L1 0 1 f 2 ( x2 ) 2 L2 0
,
L1 x1 x1 L1 otherwise , L2 x2 x2 L2
(4.59)
otherwise
We assume that the overlap of these two Uniform distributions is greater than zero, otherwise they cannot possibly apply to the “same thing”. With this assumption, applying Equation 4.52 results in 1 1 4L L 4 L1 L2 1 1 2 , L x x L 2L f ( x) L 1 2L dx 4L L 4 L1 L2 L 1 2 0 otherwise min( x1 L1 , x2 L2 ) max( x1 L1 , x2 L2 ) L 2 min( x1 L1 , x2 L2 ) max( x1 L1 , x2 L2 ) x 2
183
(4.60)
page 195
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
Figure 4-10. Refinement of two measurements with uniformly distributed errors. The dash-dot distribution has a mean of 10 and a half width of 5; the dashed distribution has a mean of 14 and a half width of 4. The refined distribution is the solid filled rectangle with a mean of 12.5 and a half width of 2.5.
An example is shown in Figure 4-10. The first measurement (dash-dot distribution) has a mean of 10 and a half width of 5. The second measurement (dashed distribution) has a mean of 14 and a half width of 4. The refined distribution (solid filled) has a mean of 12.5 and a half width of 2.5. In this case, the first measurement has the least upper bound and therefore defines the upper limit of the refined distribution, while the second measurement has the greatest lower bound and defines the refined lower limit. For this specific case, the second and third lines of Equation 4.60 can be written
( x1 L1 ) ( x2 L2 ) 2 ( x L ) ( x2 L2 ) x1 x2 L1 L 2 x 1 1 2 2 2 L
(4.61)
Note that x is not a linear combination of only x 1 and x 2, a third term is required, disqualifying this case from Gelb’s derivation. Because of that third term, one cannot really call x an average of x 1 and x 2. Note also that if the mean of the second measurement had been 10, its distribution would have been completely contained inside that of the first distribution, and the refined uncertainty would be unreduced relative to the less uncertain measurement. By the same token, if the second mean had been 18, the refined uncertainty would have been reduced by a factor of 8 relative to the less uncertain measurement. For independent Gaussian measurements, the refined uncertainty has no dependence on the means of the two measurements, and the best reduction possible is by a factor of 1/ 2. To apply inverse-variance weighting to measurements with uniformly distributed errors could clearly 184
page 196
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
produce highly erroneous results. In principle, the refined Gaussian uncertainty is always less than the smaller measurement uncertainty, although the reduction may be negligible if the larger uncertainty is vastly larger than the smaller, so both Equation 4.58 and Equation 4.60 can result in essentially no reduction of uncertainty, but unlike Equation 4.58, Equation 4.60 can produce a refined uncertainty that is arbitrarily small relative to the smaller measurement uncertainty.
Figure 4-11. Refinement of a measurement with uniformly distributed errors (dash-dot, mean = 10, half width = 5) and a measurement with Gaussian errors (dashed, mean = 14, standard deviation = 2). The refined distribution (solid filled) is a truncated Gaussian renormalized to the uniform domain (5,15) and zero outside of that domain.
Of course, the two measurement distributions need not be of the same form. For example, Figure 4-11 shows the combination of a Uniform distribution and a Gaussian distribution. The first measurement’s density function (dash-dot) is Uniform between 5 and 15, and the second measurement’s density function is Gaussian (dashed) with a mean of 14 and a standard deviation of 2. The refined density function (solid filled) has a Gaussian shape clipped to the uniform domain, is normalized to that domain, and is zero outside of that domain. For this combination, we have in general
1 f 1 ( x1 ) 2 L1 0 f 2 ( x2 )
e
,
L1 x1 x1 L1 otherwise (4.62)
( x2 x2 ) 2
2
2 2
2 2 185
page 197
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
and Equation 4.52 takes the form ( x x2 ) 2 2 1 e 2 2 2 L1 2 2 ( x x2 ) 2 f ( x) 1 x1 L1 e 2 22 dx 2 L1 x1 L1 2 2 0
e
( x x2 ) 2 2 22
x1 L1 ( x x2 ) 2 22
e
,
2
x1 L1 x x1 L1
dx
(4.63)
x1 L1
otherwise
The integral in the denominator can be expressed in terms of error functions (Equation 4.8):
f ( x)
e
( x x2 ) 2 2 22
x L x x L x 2 erf 1 1 2 erf 1 1 2 2 2 2 2 2 0
,
x1 L1 x x1 L1 (4.64)
otherwise
The mean and variance of the refined density function can be expressed in terms of exponentials and error functions, but the expressions are too complicated to present here, and in practice it may be advisable simply to obtain them via numerical quadrature. We should not ignore the distributions that motivated the Parameter Refinement Theorem in the first place, the CGU distributions of the IRAS Project (see Equation 4.9). For this example, we have x x +L x x L1 erf 1 1 1 erf 1 1 2 G1 2 G1 f 1 ( x) 4 L1 x x2+L2 x x2 L2 erf 2 erf 2 2 G2 2 G2 f 2 ( x) 4 L2
(4.65)
Inserting these into Equation 4.52 is straightforward, but there is no useful simplification of the resulting expressions. As with the Uniform-Gaussian case, expressions for the mean and variance can be obtained in terms of exponentials and error functions which are too complicated to be of interest here (for details, see Fowler and Rolfe, 1982). As may be obvious, unlike Gaussian-Gaussian and Uniform-Uniform cases, the refined density function does not have the form of the measurement density functions. The IRAS Project employed numerical quadratures over a grid of (σG1, L1, σG2, L2,Δx) values 186
page 198
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
to fit the refined density functions to CGU shapes which could be used in downstream processing. These fits were very accurate when σG1 and σG2 were approximately equal, since this made the refined density function approximately symmetric. Frequently, however, σG1 and σG2 were different enough to produce significant asymmetry in the refined density function, and a conservative modification was developed to keep the CGU shape. This is illustrated in Figure 4-12, where the two measurement CGU density functions have very different Gaussian components. The first (dash-dot) has x 1 = 120, σG1 = 4, and L1 = 90. The second (dashed) has x 2 = 200, σG2 = 10, and L2 = 60. The refined density function (solid filled) has x = 174.4 and σ = 21.5864. Note that this is not a Gaussian σ. The refined density function is neither Gaussian nor CGU, and x and σ were determined via numerical quadrature. The conservative CGU approximation is shown as a dotted curve. It was designed to fit best near the lower values, since that is the neighborhood of the critical-region boundary for the downstream decisions, but the entire distribution plays a role in subsequent position refinement, and a small amount of information is sacrificed for mathematical tractability. The dotted curve has a mean of 170 and a total standard deviation of 24.2440, thus a position shift of 4.4 and increase in standard deviation of 2.66 relative to the true refined density function. The overall approximation error in uncertainty is about 12.3%, which is under our guideline of 20% maximum, but almost none of this error is in the part of the distribution critical for confirmation decisions, and the mean position offset is well covered by the slightly overestimated uncertainty.
Figure 4-12. Refinement of two measurements with CGU (Convolved Gaussian-Uniform) errors. The first measurement (dash-dot) has a mean of 120, a Gaussian σ of 4, and a Uniform half width of 90; the second measurement (dashed) has a mean of 200, a Gaussian σ of 10, and a Uniform half width of 60. The refined density function (solid filled) has a mean of 174.4 and a standard deviation of 21.5864, and it is not CGU because it is asymmetric due to the two CGU distributions having different Gaussian components. The dotted curve is a CGU approximation used as a stand-in for the refined density function so that the CGU formalism could be propagated downstream.
187
page 199
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
We mentioned above that when the measurement noise is Poisson-distributed, the refined distribution is not itself Poisson, independently of whether the measurement values are large enough to invoke the Gaussian asymptote. This case is worth investigating, and it allows us to look at the Parameter Refinement Theorem applied to discrete random variables, for which we will switch notation from x to k and from x to the traditional Poisson λ, but with the other conventions maintained. In this case, we are not dealing with probability density functions but rather discrete probability distributions. The Parameter Refinement Theorem can be derived in these terms, however, using summations in place of integrals, and we will assume below that this has been done. This will also be our first example using asymmetric distributions, and this fact introduces a new difficulty. Since the previous examples involved symmetric measurement density functions, we were able to use the nominal measurement values as the means of those distributions without introducing any bias. The idea was that the measurement result was the true value plus a zero-mean random fluctuation, so that the true value is at an offset from the mean in the measurement density function that is symmetric relative to the position that the measurement value occupies in the density function centered on the true value. This does not apply when the distribution is asymmetric, i.e., skewed. The true value may be the same distance from the measurement value as the latter is from the former, but the asymmetric distributions covering these points assign different probabilities to them in the two contexts. In other words, the probability of the measured value given the true distribution is generally not the same as the probability of the true value given the distribution of the same form but pegged to the measured value. For the previous symmetric distributions, the mean, mode, and median were all the same point, and we did not have to question which of these distribution parameters we were pegging to the measured value. When the distribution is skewed, these three aspects generally have different values, and we have to ask whether we want to take the measurement value to be the mean, the mode, the median, or perhaps something else. There is a significant possibility that we will introduce a bias in locating the measurement distribution on the random variable’s axis. A Poisson distribution is completely specified by its mean, λ, since the variance has the same value. The skewness is 1/ λ, and the excess kurtosis is 1/λ. The mean is a continuous variable greater than zero, but the random variable takes on only integer values greater than or equal to zero. In astronomy, the Poisson distribution occurs frequently as photometric noise when fluxes are measured in photon or photo-electron counts, and it is not likely to be part of a position uncertainty. The measurements that we want to refine are usually flux measurements, but this must apply to counts of dimensionless variables, not functions of them such as currents or voltages, since these will not be Poisson-distributed. Currents and voltages have dimensional units, and as we saw in Section 2.7, variables with such units cannot be Poisson-distributed, since that would allow the absurd situation in which we could change the S/N of a measurement by simply using different units. The mere fact that the mean and variance are equal implies that the random variable must be dimensionless, since the mean and standard deviation must have the same dimensions, and in this case, so must the square of the standard deviation, the variance. The skewed nature of the Poisson distribution causes some problems in reporting the values of physical parameters in the form of a best estimate plus or minus some uncertainty, since the offset corresponding to the “plus” uncertainty is generally not equal to that of the “minus” uncertainty, depending on whether the “best estimate” is a mean, a most-likely value, or something else (see, e.g., D’Agostini, 2004). Since the Poisson distribution is both discrete and skewed, in general there is no point above the mean with the same probability or confidence interval from the mean as any point 188
page 200
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
below the mean. By ensuring that the complete distributions are fully specified, these ambiguities can be avoided. We also stated in Section 2.7 that if two Poisson measurement samples are drawn from the same population, then the unbiased estimator of the population mean is the unweighted average. Up to now we have not assumed that any two measurement errors are drawn from the same population. We have assumed only the forms of the distributions and that their means are arbitrarily close to being equal, but not any relationship between their variances. But because the variance of a Poisson distribution is equal to the mean, the hypothesis that the two measurements came from the same “true” value forces us to assume that the two Poisson-distributed measurements were drawn from the same population and therefore have the same prior error variance. Since we do not know the mean of that population, however, we have only two choices if the two measurement values are not equal: (a.) assign different means to the two measurement distributions; (b.) use the average to define the mean of both measurement distributions. Then by applying the Parameter Refinement Theorem, we hope to obtain a distribution that constrains the population mean with less uncertainty. We will see that there is not much difference between the two choices, and we will see that problems in assigning measurement distributions cause the Parameter Refinement Theorem result to be slightly less optimal than the known optimal method, unweighted averaging. But if we use the latter, we still need a probability distribution to go with it; this is developed in Appendix I. In the previous examples, the error distributions were provided by an error model based on the instrumental response function. We were not forced to estimate the error variance from a sample itself. Now that the error variance is determined by the distribution’s mean, we are in fact estimating the error variance from the sample, and therefore we incur the “uncertainty of the uncertainty” issue discussed in Section 2.11. Even though the refined variance will not be equal to the refined mean. it will still depend on that mean, as we will see below, and so it is uncertain itself. Whether some adjustment to the refined uncertainty is made to include this extra source of estimation error is up to the analyst, but in the author’s experience, most scientists simply quote the nominal uncertainty derived from the sample with no padding to account for the fact that all sample-derived parameters have their own uncertainties, including the estimated uncertainty of the estimated population mean. The two measurement probability distributions are: k
e 1 1 1 f 1 ( k1 ) k1 ! f 2 (k2 )
e
2
k2
(4.66)
2
k2 !
where λ1 and λ2 are based on the observed values. In general they cannot be equal to the observed values, because those could be zero, which is not allowed for the mean of a Poisson distribution, and in any case we anticipate assigning them some sort of bias-adjusted values. It is a property of Poisson distributions that when the mean is an integer, the distribution is bimodal in the sense that the mean and the value immediately below it have the same maximal probability. Since we can observe only integer draws, if we use an observed nonzero value as the mean, the Poisson distribution will be bimodal, and even if we assume that the drawn result is a most likely value, there is ambiguity about whether it is the mean or the value immediately below. An unbiased estimate would therefore be that 189
page 201
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
the observed value is a mode that is equally likely to be the higher or lower, so an offset of 0.5 might be best, i.e., λ1 = k1+½ and λ2 = k2+½. On average, this centers the assumed mean on the higher mode, which is the mean, although it simultaneously removes the bimodality of the assumed distribution, because the mean is no longer an integer. Since we are feeling our way here, we will test these ideas with Monte Carlo simulations, and indeed we will find these choices to be the best, i.e., the results come very close to the known optimal estimates using unweighted averages, the differences being negligible for most practical purposes unless the observed values imply a mean that is very small, e.g., less than 2, where the skewness problem is most severe and the difficulty in assigning probability distributions to measurements degrades the effectiveness of the Parameter Refinement Theorem. For discrete probability mass distributions, Equation 4.52 takes the form f (k )
f1 (k ) f 2 (k )
f1 ( k ) f 2 ( k )
(4.67)
k 0
and plugging in Equation 4.66 yields e 1 1k e 2 2k k! k! f (k ) e 1 k e 2 k 1 2 k! k! k 0
e ( 1 2 ) ( 12 ) k ( k !) 2 e ( 1 2 ) ( ) k 1 2 2 ( k !) k 0
(4.68)
( 12 ) k ( k !) 2 I 0 (2 12 )
where I0 is a modified Bessel function of the first kind, order zero. It is interesting to compare this to the optimal distribution derived in Appendix I, Equation I.7. At face value, the two bear no resemblance, but they plot almost on top of each other, at least for λ 2. Of course, Equation I.7 assumes that one knows the population mean. To apply it in parameter refinement, one would have to approximate the population mean with the average of the two measurements, which introduces just enough approximation error to offset the slight lack of optimality in Equation 4.68 using λ1 = k1+½ and λ2 = k2+½. Figure 4-13 compares Equation 4.68 to Equation I.7 for two values of the population mean, 10 and 25. Note that k in Equation 4.68 is defined on the set of nonnegative integers, whereas u in Equation I.7 describes the unweighted mean of two Poisson draws and is therefore defined on the set of nonnegative integers divided by 2, making the two equations not exactly comparable, because Equation I.7 distributes the total probability over twice as many points. When the population mean is not too small, one can compare the approximation f(k)/2 to Equation I.7's P(2u,λ), and this is what is done in Figure 4-13. Since the population mean is known to be correct as used with Equation I.7, matching values are used in Equation 4.68 in order to have a valid comparison. Any two measurement values with the proper average can be used with little difference in the result. For the population mean of 10, the measurements used with Equation 4.68 are k1 = 8 and k2 = 12. For the population mean of 25, the measurements used with Equation 4.68 are k1 = 22 and k2 = 28. 190
page 202
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
Figure 4-13. Comparison of Equation 4.68 (solid curve, Parameter Refinement Theorem applied to two measurements with Poisson noise) to Equation I.7 (dotted curve, optimal Poisson average). On the left the population mean is 10, and the two measurement values are 8 and 12. On the right, the population mean is 25, and the two measurement values are 22 and 28.
Equation I.7 has a mean of λ and a variance of λ/2. The mean and variance of Equation 4.68 are not at all algebraically straightforward but can be expressed in terms of modified Bessel functions of the first kind, orders 0 and 1. Since Equation I.7 is optimal and much simpler, and since application of the Parameter Refinement Theorem to Poisson measurements is of academic interest only, we will omit the rather complicated expression for the variance of the distribution in Equation 4.68, but for illustrative purposes, the comparatively simple mean can be obtained by taking the first moment:
k
k 0 ( k !)
k ( 1 2 ) k 2
I 0 (2 1 2 )
1 2 I1 (2 1 2 ) I 0 (2 1 2 )
(4.69)
For the population mean of 10 in Figure 4-13, k1 = 8 and k2 = 12, the mean and variance are 10.05457287 and 5.1555643, respectively, and for the population mean of 25, k1 = 22 and k2 = 28, the mean and variance are 25.07165532 and 12.662099, respectively. Thus Equation 4.68 misses the optimal solutions by 0.55% and 0.29% for the mean and by 1.54% and 0.65% for the standard deviations, respectively. The errors are generally smaller for larger means and closer observed values.
191
page 203
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
Figure 4-14. Refinement of two measurements with Poisson noise. The dash-dot distribution has a mean and variance of 5; the dashed distribution has a mean and variance of 10. The refined distribution is the solid filled curve with a mean of 7.345 and a variance of 3.802.
Figure 4-14 shows an example of Equation 4.68 applied to two Poisson measurements, the first with λ1 = 5 and the second with λ2 = 10. Since these are somewhat small and separated counts, the Poisson skewness makes itself felt, and the refined mean and variance are 7.345 and 3.802, respectively, instead of 7.5 and 3.75. As stated above, many Monte Carlo tests were run for a variety of population means from 0.1 to 22, 65 in all, concentrated toward the lower end, each evaluating Equation 4.69 for 100 million pairs of random Poisson draws and computing error statistics. Offsets of 0.5 from the observed values were used except for very small population means, where the offset was not allowed to exceed the population mean by more than a factor of 1.5. The results in Figures 4-13 and 4-14 were typical in that deviations from optimality were apparent for population means less than 5, and tended to vanish for higher values. Figure 4-15 shows the average error in estimating the population mean as a function of the latter. Above λ = 5, the residual is at the level of the Monte Carlo fluctuations, several × 10-4. In general, Equation 4.68 yielded slightly smaller skewness and kurtosis values than Equation I.7. Although measurements completely dominated by Poisson noise do occur, in astronomy it is much more common for photon noise to be diluted with Gaussian errors due to calibration residuals for effects such as dark current and responsivity variations. These and the prevalence of much higher counts make the total noise more Gaussian.
192
page 204
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.8 The Parameter Refinement Theorem
Figure 4-15 Mean residual in Equation 4.69 as a function of Poisson population mean based on Monte Carlo tests using 100 million pairs of random draws each. Above λ = 5, the residual is at the level of the Monte Carlo fluctuations, several × 10-4.
We will consider one last case in this section, again of purely academic interest, because the error distributions do not apply to any real measurements of which the author is aware. The point is only to explore what was said earlier about “refinement” not necessarily producing reduced uncertainty relative to the least uncertain measurement. In principle, such a situation could arise when the measurement errors follow distributions that are extremely skewed in opposite directions. For this purpose we will use the following density functions based on chi-square with two degrees of freedom clipped to a finite domain: x1 c x e 2 , 0 x1 20 f 1 ( x1 ) 1 0 otherwise (4.70) ( 20 x2 ) 2 , 0 x2 20 f 2 ( x2 ) c (20 x2 ) e 0 otherwise where the constant c adjusts the normalization to the domain (0, 20) with a value of 7.5 10 -4. Using these in Equation 4.52 yields the “refined” distribution
c 20 x x 2 f ( x) 0
,
0 x 20 otherwise
(4.71)
where c’ is approximately 0.00378711. Figure 4-16 shows this example. The first measurement 193
page 205
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
(dash-dot) has a mean of 3.99 and a variance of 7.84. The second measurement has a mean of 16.01 and a variance of 7.84. The “refined” distribution (solid filled) has a mean of 10 and a variance of 20. Thus the refined standard deviation is almost 60% larger than that of either measurement. The two measurement means differ by about 3 root-sum-squared measurement uncertainties, so the hypothesis (that the same true value gave rise to both measurements) might be rejected, but if not, then the “refinement” consists not of reduced uncertainty but rather of a shocking realization that the measurements were extremely misleading.
Figure 4-16. Refinement of two measurements with pathological oppositely skewed error distributions. The dash-dot distribution has a mean of 3.99 and a variance of 7.84. The dashed distribution has a mean of 16.01 and a variance of 7.84. The refined distribution is the solid filled curve with a mean of 10 and a variance of 20, much larger than the variance of either measurement.
4.9 Curve Fitting The goals of science go beyond cataloging the constituents of the universe and their properties. How these constituents are distributed over space, how they interact, and how their behavior evolves over time are of paramount interest. One area of scientific activity in which pursuit of such knowledge takes place is curve fitting. A star’s electromagnetic energy within some wavelength band observed at various epochs may be fit to a curve that describes variability with respect to time. Positions at which an asteroid is observed at various epochs may be fit to a functional form that describes its orbit around the Sun. In some cases, the specific form defining the curve is constrained by theory, and in others it may be a simple approximation such as a polynomial of a sufficiently high order to accommodate the shapes observed in a plot of the measurements versus some independent parameter. The purpose may be to compute the value of a physical quantity such as the mass of the 194
page 206
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
electron, or it may be simply to obtain an approximation that may be used for interpolation or smoothing over noise fluctuations. The objectives and methods comprise a list too large to explore herein, where our concern is limited to understanding how such analyses are enhanced by the proper accounting of measurement errors. As we have seen, our knowledge concerning any measured aspect of a component of physical reality is not completely specified until we characterize its uncertainty. We have proposed that such characterization should take the form of a uniquely defined probability density function (or probability mass distribution in the case of discrete variables) to describe each measurement. This allows the effects of uncertainty to be propagated optimally in any use of the measurement. In this section, we are interested in the effects that emerge when using measurements to compute coefficients defining a functional relationship between observed quantities. To examine this topic deeply is beyond our scope, and so we will look at the most common form taken by this activity consisting of: (a.) solution for coefficients defining a curve via chi-square minimization using measurements whose errors may be taken as Gaussian; (b.) an independent variable whose sample values may be taken as perfectly known; (c.) a model whose basis functions depend only on the independent variable (i.e., not on the dependent variable, nor on the model coefficients). Under these conditions, the system of equations to be solved is linear. The formalism of chi-square minimization is developed in Appendix D, where the use of chisquare as a goodness-of-fit measure is also discussed, along with the distinction between it and the formal uncertainties of the model coefficients. It bears repeating that the latter may be very small even when the quality of the fit is very bad if the functional form used for fitting is inappropriate for the data at hand. The uncertainties of the coefficients, and the uncertainty of the model itself evaluated at some abscissa point, refer exclusively to how well a curve of the form used is determined by the data. Very small uncertainties mean only that one can be confident that the curve obtained is very close to the best possible for that functional form. That is no guarantee that the functional form bears any resemblance to the data relationships. A precisely determined straight line may be a very bad fit to parabolic data, and this fact will show up in the chi-square, not the model-coefficient uncertainties. Our point is that both of these indispensable measures are made possible by the inclusion of optimal uncertainty accounting. We will look closely at the error covariance matrix for the model coefficients and see that, under the conditions we are considering, the dependent variables appear nowhere in its elements. The fit uncertainty depends entirely on the uncertainties of the dependent variables and the sampling provided by the independent variables. In the notation of Appendix D, which we will use herein, the N observations of the dependent variable are yi with uncertainties σi, and the samples of the independent variable are xi, i = 1 to N. In this notation, the elements of the error covariance matrix for the model coefficients depend entirely on the N xi and σi values, and not at all on yi, hence they do not depend at all on how well the model matches the data. This is evident in Equations D.5 and D.9. Whether the model is appropriate for the data is revealed by chi-square, Equation D.2, the quantity that the fit is chosen to minimize, i.e., to make as good as possible. In section 4.7 we used chi-square minimization with a zeroth-order polynomial model to perform inverse-variance-weighted averaging of measurements with correlated errors. The next simplest model is a first-order polynomial, i.e., a straight line, which becomes appropriate when we do not believe that all of our measurements apply to the same underlying true value but rather describe the variation of the observed phenomenon as a function of some other quantity. It is always 195
page 207
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
recommended to plot the dependent data versus the independent data to see what sort of variation is apparent before choosing a specific model for the functional dependence. Even when one has a reason to expect a certain form of variation, it should be verified. If one is fitting data to a theoretical curve, then this step simply checks whether anything unexpected has happened. Very often a physical mechanism is not known, and so a basic model such as a polynomial, trigonometric, or exponential dependence may be suggested by the plot. In such cases, it is advisable to include the measurement uncertainties in the plot as one-sigma error bars in order to help avoid overfitting, i.e., choosing basis functions that fit the noise along with the signal. For example, the data points alone may appear to have some nonlinear variation, but it may turn out to be negligible compared to the error bars. We will look more closely at this below. Note that trigonometric and exponential models usually require fitting constants inside the arguments of transcendental functions and thus violate our constraint that the basis functions depend only on the independent variable. Such models generally involve nonlinear systems of equations that require iterative methods to solve and are beyond our scope, but the role played by random variables in characterizing uncertainty remains similar to what we are considering herein, and that is the central theme that concerns us. We should stress the distinction between nonlinear systems of equations and nonlinear models. Sometimes the phrase “linear least squares” is interpreted to mean “fitting data to a straight line”; this is ambiguous at best. Nonlinear models may be used in linear least squares algorithms as long as the models are linear in the model coefficients. The basis functions need not be linear in the independent variable, as we will see below when we consider a quadratic model. Equation D.1 (p. 434) expresses what is needed to generate a linear system of equations whose solution yields the model coefficients: a model composed of a sum of terms, each containing a coefficient multiplied by a basis function that depends on the independent variable but not the dependent variable and not the model coefficients, and the value of the independent variable may be considered error-free. To illustrate chi-square minimization with a straight-line model we will use the following data, which were simulated by adding Gaussian noise to the “true” straight line y = 0.4x + 10. In order to make the errors visible in plots, somewhat large measurement errors were used, and the sample size was limited to 10 (the latter keeps the fitting uncertainty from becoming too small to see easily). The xi values are in arbitrary units and are evenly spaced between 15 and 60. Each measurement error was drawn from a zero-mean Gaussian population with σi = 3+εi, where εi is drawn from a zero-mean Gaussian population with σε = 0.1 in order to get some variation in uncertainty. The following points were generated. xi
yi
σi
15
15.91610
2.97667
20
18.33959
3.15813
25
18.61320
2.95540
30
24.86969
3.10864
35
23.94686
2.89100
196
page 208
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
40
21.77556
2.84877
45
31.40502
2.87415
50
28.36409
2.94364
55
36.93295
3.11954
60
34.02322
2.99718
Figure 4-17 The filled circles with vertical error bars are ten data points fit to a straight line (solid). The dotted line is the truth from which the data points were generated with simulated errors. The two dashed lines are the one-sigma uncertainty envelope of the fit line. The fit line has a positive slope error and a negative intercept error, i.e., these errors are negatively correlated. The dot-dash curve is a 9th-order polynomial fit to the same points to illustrate drastic over-fitting, i.e., fitting the noise in addition to the signal.
These points are plotted as solid circles with vertical error bars in Figure 4-17. The “truth” is shown as a dotted line. The fit is shown as a solid line. It can be seen that its slope is a bit too large and its intercept is a bit too small, i.e., the errors are negatively correlated. This follows from the facts that the default location of the intercept is at the origin, x = 0, that all the data points are on the positive side of the origin, and that straight-line fits tend to pass close to the center of the data-point scatter, causing slope errors to rotate the line a bit about that center. A positive slope error rotates the line downward at the intercept location, causing a negative intercept error, hence these two errors 197
page 209
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
are negatively correlated. This can be changed by choosing a different intercept location. For example, at the right end of the plot, the errors are positively correlated. The intercept can be located at any desired abscissa value x0 by simply using x-x0 instead of x in the fitting process. We can do this explicitly in Equation D.1, which for this first-order polynomial model becomes
y p1 p2 ( x x0 )
(4.72)
Then the elements of the coefficient matrix A in Equation D.5 (p. 435) become
N 1 2 i 1 i A N xi x0 2 i 1 i
xi x0 2 i 1 i N ( x x )2 i 20 i i 1 N
(4.73)
and its inverse, which is the error covariance matrix for the model coefficients, is
A1
N ( xi x0 ) 2 1 i 1 i2 N xi x0 D 2 i 1 i
xi x0 2 v11 v12 i 1 i N 1 v21 v22 2 i 1 i N
N 1 N ( xi x0 ) 2 N xi x0 D 2 2 2 i 1 i i 1 i i 1 i
(4.74)
2
where the determinant D, like all determinants of error covariance matrices for real data, must be positive, since it is the product of the eigenvalues of a positive-definite matrix. In M-dimensional space (e.g., a model-coefficient error covariance matrix for a polynomial of order M-1), the square root of the determinant multiplied by πM/2/Γ(1+M/2), where Γ is the Gamma function, is the volume of the error ellipsoid, another way to appreciate that the determinant must be positive. It is intuitively clear that it should not depend on the abscissa zero point, and indeed the appearance of x0 in the second line of Equation 4.74 is illusory, as may be shown by expanding and simplifying
198
page 210
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
N 1 N ( xi x0 ) 2 N ( xi x0 ) D 2 2 2 i 1 i i 1 i i 1 i
2
N x x N x2 N x N x N 1 N xi2 i 0 0 i 0 2 2 2 2 2 2 2 i 1 i i 1 i i 1 i i 1 i i 1 i i 1 i
2
2
N x N 1 N x N 1 N xi2 N 1 N xi N 1 i i 2 2 2 x0 2 x02 2 2 2 x0 2 2 x02 2 i 1 i i 1 i i 1 i i 1 i i 1 i i 1 i i 1 i i 1 i
N 1 N xi2 N xi 2 2 2 i 1 i i 1 i i 1 i
2
2
(4.75)
N 1 2 N 1 2 N 1 N xi N 1 N xi 2 x0 2 2 2 2 x02 2 2 i 1 i i 1 i i 1 i i 1 i i 1 i i 1 i N 1 N xi2 N xi 2 2 2 i 1 i i 1 i i 1 i
2
Setting the covariance element v12 to zero and solving for x0 results in N
xi x0
i 1
i2
v12 0 N
N
xi
2 i 1 i
2 i 1 i
N
x0
x0
N
0
xi
2 i 1 i
N
x0
N
x0
2 i 1 i
1
2 i 1 i
(4.76)
xi
2 i 1 i N 1
2 i 1 i
So x0 is just the inverse-variance-weighted average abscissa value and can be computed before the fitting is performed. Using it as the abscissa zero point in the model eliminates the correlation between the slope and the intercept. It can be easily shown that this also minimizes v11 in the process, the uncertainty variance of the intercept:
v11 1 N ( xi x0 ) 2 x0 D x0 i 1 i2 N (x x ) 2 N ( xi x0 ) i 0 0 0 D i 1 i2 i2 i 1
199
(4.77)
page 211
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
which is the same equation we solved in Equation 4.76. Note that the choice of x0 has no effect on the slope uncertainty variance, v22, which can be seen in Equation 4.74 to be independent of x0. Is there any point in doing this pre-processing computation? Not unless we are concerned that users of our model will ignore the covariance contribution to the model uncertainty. As long as everyone computes the model uncertainty according to Equation D.13 (p. 437), everyone will get the same correct result independently of whether we have adjusted the abscissa zero point. The above discussion was intended only to make clear the origin of the negative correlation between slope and intercept that happens when all the abscissa values are to the right of the intercept location, since that is the most common case. Equation 4.74 shows that if x0 = 0 and all xi are positive, then v12 will be negative, making the correlation coefficient ρ12 = v12 / (v11v22) negative. The fit shown in Figure 4-17 did not employ x0. The true intercept and slope are 10 and 0.4, respectively. The intercept and slope obtained are 9.044272868 and 0.4341002977, respectively. The model-coefficient error covariance matrix is 7.1673 . 016678
016678 . 0.0044304
(4.78)
so that ρ12 = -0.93591. The fit chi-square is 7.11363. With 8 degrees of freedom (10 data points minus two model coefficients) the Q value (high tail area) is 0.52442, and the reduced chi-square is 0.88920. So the chi-square goodness-of-fit measure indicates a completely plausible result, as does the model uncertainty, Equation D.13 for this case,
y v11 v22 x 2 2v12 x 7.1673 0.0044304 x 2 0.33356 x
(4.79)
which was used to generate the one-sigma uncertainty envelope shown as dashed curves in Figure 4-17. Note that this envelope has a minimal vertical width at x = 37.64339, which happens to be the value of x0 as given in Equation 4.76. If we had used this abscissa adjustment, we would have obtained the same fit (except as a function of x-x0 instead of simply x, hence with different model coefficients but equivalent to the same straight line) and the same model uncertainty, except the latter would have been expressed as the algebraic equivalent of Equation 4.79 with x0 = 37.64339,
y 0.88928 0.0044304 ( x x0 ) 2
(4.80)
With x0 = 0, Equation 4.74 shows that since v12 is linear in the xi, if they are all positive, then v12 will be negative. Similarly, if the xi are all negative, v12 will be positive. This is why we said in section 4.7 that not all correlated errors are properly viewed as “systematic” errors: if a simple change of abscissa zero point can remove or reverse the correlation, then the nature of the error is too different from that of a common noise source contaminating a set of measurements to dilute the term “systematic” to include all correlated errors. Although a coordinate rotation of the error covariance matrix can always remove correlation even for those arising from truly systematic errors, as noted previously, such a rotation generally changes the physical interpretation of the quantities whose errors are described by the covariance matrix, whereas a translation of origin merely changes the way the parameter is quantified, not its physical interpretation. 200
page 212
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
Figure 4-17 also shows the result of maximal overfitting of data. The dash-dot curve is a 9 thorder polynomial, the highest that can be fit to 10 distinct data points without causing a singular coefficient matrix. Since this solves for as many model coefficients as data points, chi-square is no longer defined, because the number of degrees of freedom is zero. To within the limitations of computer precision, the curve passes through each data point but is clearly nonsense between data points, especially outside of the domain of the data. If one really wants a solution that passes through every point, then the uncertainties are irrelevant, and the solution could be obtained via ordinary least-squares fitting, which is equivalent to chi-square minimization with unit weights but does not deal with error propagation. With chi-square undefined, one cannot claim to have minimized it, although something is minimized, and if the uncertainties are all approximately equal, it is essentially proportional to the squared error that ordinary least squares minimizes. In any case, since “fitting” a polynomial of order M-1 to M data points is not really an optimization problem, it should probably just be treated as an ordinary linear system of M simultaneous equations, to which the least-squares algorithm is equivalent in this situation. If the purpose is to obtain an interpolating polynomial, clearly fitting all available points generally does not provide a useful result. Piecewise fitting with lower-order fits is usually a better approach. Furthermore, fitting high-order polynomials typically stresses computer precision unless extra measures are taken, so the results may not turn out well even when deliberately overfitting. Our point here is simply to illustrate the dangers of doing it unintentionally. Since overfitting makes the curve pass closer to every point, one might expect the model uncertainty to decrease, but in fact, the opposite happens, because with more coefficients to fit, the uncertainty that the solution is correct becomes larger rather than smaller, and the model uncertainty is not applicable only at the data points but rather over the entire x domain. It can be seen even in the 1st-order polynomial fit that the uncertainty is going to diverge as |x| becomes arbitrarily large. So overfitting and underfitting tend to have opposite effects on chi-square and model uncertainty. As the overfitting takes the chi-square degrees of freedom toward zero, chi-square becomes smaller, but the model uncertainty becomes larger, whereas underfitting increases chi-square and reduces model uncertainty. Before the degrees of freedom disappear altogether, increasing the polynomial order reduces both chi-square and the number of degrees of freedom, so that Q(χ2) tends to get smaller with more advanced overfitting, and the reduced chi-square (chi-square divided by the number of degrees of freedom) tends to enlarge. Both of these symptoms indicate an inferior fit. For the data in Figure 4-17, reduced chi-square is minimal for the 1st-order polynomial, and Q(χ2) is maximal, although the 8th-order polynomial comes very close. With only one degree of freedom left to keep chi-square legitimate, the chi-square measure approaches that of the proper model, but the model uncertainty remains much larger, suggesting that both indicators should be checked carefully. Figure 4-18 shows the RMS (root-mean-square) model uncertainty, χ2, reduced χ2, and Q(χ2) for polynomial fits of order L = 0 to 8 applied to the data in Figure 4-17, where L = M-1, and M is the number of terms in the polynomial, including the zero th-order term (see Equation D.1, p. 434). These are labeled RMS(σy), χ2, χ2R, and Q(χ2) as functions of L, respectively. It can be seen that RMS(σy) increases monotonically with L. This shows once again that the formal model uncertainty applies only to how well determined the model is compared to all possible models of the same form, not to actual goodness-of-fit. The solution for the zero th-order polynomial is y = 25.38527865, σy = 0.94301779, which look good at face value, but χ2 = 49.64762648, with 201
page 213
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
χ2R = 5.51640294 and Q(χ2) = 1.25468229×10-7, which are unequivocal condemnations of the fit. If these ten data points had really been drawn from a single true value with the uncertainties given, we would expect such a mutually inconsistent set of numbers to arise only once in 79.7 billion cases. Once we get to L = 1, χ2 becomes well behaved. It decreases slowly as we start to overfit with increasing L, but not enough to keep pace with the reducing degrees of freedom, so that χ2R slowly increases, and Q(χ2) slowly decreases, both signs that we are moving away from the optimal solution. These behavior patterns depend on the actual errors, so they cannot be counted on in all cases, but in general, if we find two solutions with about the same chi-square goodness-of-fit but significantly different model uncertainties, we should take the one with the lower model uncertainty.
Figure 4-18 Upper Left. RMS uncertainty for L-order polynomial fit to data in Figure 4-17, evaluated at the data points. Upper Right. Chi-square for L-order polynomial fit to data in Figure 4-17 Lower Left. Reduced chi-square for L-order polynomial fit to data in Figure 4-17 Lower Right. Q(χ2) for L-order polynomial fit to data in Figure 4-17
In order to see how much of this is peculiar to situations in which a 1 st-order polynomial is appropriate, we will apply the same considerations to fitting quadratically distributed noisy data: 202
page 214
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
xi
yi
σi
15
1197.2472
97.6675
20
812.4531
115.8131
25
455.1683
95.5403
30
402.3427
110.8645
35
198.3623
89.1002
40
74.1362
84.8767
45
403.5611
87.4152
50
447.5579
94.3641
55
977.0338
111.9540
60
1200.7727
99.7181
These points were generated by adding zero-mean Gaussian noise to y = 3000 - 150x + 2x2 at the same xi values as before and are plotted as solid circles with vertical error bars in Figure 4-19. The uncertainty for each point was simulated as σi = 100+εi, where εi is drawn from a zero-mean Gaussian population with σε = 10, and the simulated error for that point was drawn from a zero-mean Gaussian population with a standard deviation of σi. The truth is plotted as the curve composed of filled rectangles that closely follows the solid curve, which is the model fit. The one-sigma uncertainty envelope of the model is shown as the two dashed lines bracketing the solid curve. To illustrate the effects of underfitting with a polynomial of too low an order, the dot-dash curve shows a 1st-order polynomial fit to these data. Its one-sigma uncertainty envelope is shown as the two dotted lines, clearly showing that the model uncertainty does not provide information on goodness-of-fit, only whether the fit is well determined for the given model. The underfitting errors, along with overfitting behavior, are shown in Figure 4-20, which corresponds to Figure 4-18. As before, the RMS model uncertainty increases monotonically with polynomial order L. The chi-square parameters all behave similarly to Figure 4-18 except that χ2 and χ2R are very high at both L = 0 and 1 instead of only 0, and χ2R at L = 8 is slightly better than at L = 2, but again we would prefer the L = 2 solution because of its much lower model uncertainty and higher Q(χ2).
203
page 215
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
Figure 4-19. The filled circles with vertical error bars are ten data points fit to a quadratic curve (solid). The rectangles lie on the true curve (not shown) from which the data points were generated with simulated errors. The two dashed lines are the one-sigma uncertainty envelope of the fit curve. The dot-dash curve is a 1st-order polynomial fit to the same points to illustrate underfitting with a polynomial of too low an order. The two dotted lines are the one-sigma uncertainty envelope of the 1st-order polynomial fit.
Figure 4-20 Upper Left. RMS uncertainty for L-order polynomial fit to data in Figure 4-19, evaluated at the data points. Upper Right. Chi-square for L-order polynomial fit to data in Figure 4-19 Lower Left. Reduced chi-square for L-order polynomial fit to data in Figure 4-19 Lower Right. Q(χ2) for L-order polynomial fit to data in Figure 4-19
204
page 216
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
One thing about the 2nd-order polynomial model that is very different from the 1 st-order model is the algebraic complexity of the model error covariance matrix elements. The coefficient matrix is only slightly more complicated than Equation 4.73 (keeping x0 in the model explicitly): N 1 2 i 1 i N x x i 0 A i 1 i2 N ( x x )2 0 i 2 i 1 i
N
xi x0
i 1 N (x
i2 i
i 1 N (x i
i 1
( xi x 0 ) 2 i2 i 1 N ( x x )3 i 20 i i 1 N (x x )4 i 2 0 i i 1 N
x0 ) 2
i2
x0 ) 3
i2
(4.81)
A pattern becomes visible here; it follows from Equation D.5 (p. 435) applied to polynomial models of the form
y
M
pk ( x x0 ) k 1
(4.82)
k 1
The pattern can be seen in the upper-right-to-lower-left diagonal: the elements are all the same. This is true along all upper-right-to-lower-left diagonal directions, but only the middle one has enough elements to make it stand out. In order to reduce notational clutter, we define Sk as follows:
Sk
N
( xi x 0 ) k
i 1
i2
(4.83)
Then the coefficient matrix for a polynomial model of order L = M-1 is
S0 S1 S2 S3 A S 4 S5 SL
S1 S2
S2 S3
S3 S4
S4 S5
S5 S6
S3 S4
S4 S5
S5 S6
S6 S7
S7 S8
S5
S6
S7
S8
S9
S6
S7
S8
S9
S10
S L 1
S L 2
S L 3
S L 4
S L5
SL S L 1 S L 2 S L 3 S L 4 S L5 S2 L
(4.84)
In this notation, the 1st-order polynomial model’s error covariance matrix in Equation 4.74 is
205
page 217
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.9 Curve Fitting
A1
1 S2 D S1
D S0 S2
S1 S0
(4.85)
S12
The corresponding equation for the 2nd-order polynomial is
A1
S2 S4 S32 1 S1S4 S2 S3 D 2 S1S3 S2
S1S4 S2 S3 S0 S4 S 22 S0 S3 S1S2
S1S3 S22 S0 S3 S1S2 S0 S 2 S12
(4.86)
D S0 S2 S4 2 S1 S2 S3 S0 S32 S12 S4 S23 If one wishes to choose x0 to make any given off-diagonal element zero, one has a much more difficult equation to solve. The level of complication increases very rapidly as the polynomial order increases. In general only one element (and its symmetry partner) can be forced to zero, but usually there is no need to do so. When all the abscissa values are positive, however, an interesting pattern is found: in the vij or ρij notation, we find that vij > 0 for i+j even, and vij < 0 for i+j odd. When all the abscissa values are negative, all vij > 0, i.e., all model coefficient errors are positively correlated. For example, the correlation matrix for the 8th-order polynomial fit to the data in Figure 4-19, for which all abscissa values are positive, is 1 0.9996 0.9984 0.9963 P 0.9935 0.9900 0.9859 0.9814 0.9766
0.9996 0.9984 0.9963 0.9935 1 0.9996 0.9983 0.9963 0.9996 1 0.9996 0.9983 0.9983 0.9996 1 0.9996
0.9900 0.9935 0.9963 0.9984
0.9859 0.9901 0.9937 0.9965
0.9814 0.9862 0.9905 0.9940
0.9963 0.9983 0.9996 1 0.9996 0.9985 0.9967 0.9935 0.9963 0.9984 0.9996 1 0.9996 0.9986 0.9901 0.9937 0.9965 0.9985 0.9996 1 0.9997 0.9862 0.9905 0.9940 0.9967 0.9986 0.9997 1 0.9820 0.9868 0.9910 0.9944 0.9970 0.9987 0.9997
0.9766 0.9820 0.9868 0.9910 0.9944 0.9970 0.9987 0.9997 1
(4.87)
This pattern of alternating signs means that model coefficients for basis functions having an even power of x are all positively correlated with each other, and those for odd powers of x are also positively correlated, but errors corresponding to even powers of x are negatively correlated with those corresponding to odd powers of x. For example, if the intercept is too large, the coefficients for x, x3, x5, and x7 are probably too small, while those for x2, x4, x6, and x8 are probably too large, etc. All the Sk values in Equation 4.84 are positive, since they are summations of positive numbers. If all the abscissa values are multiplied by -1, all the negative elements of the covariance and correlation matrices become positive, the elements of the coefficient matrix for which i+j is odd change from positive to negative, and the model coefficients for odd powers of x flip sign. If some other way of making all the abscissa values negative is applied, then the numerical values of these matrix 206
page 218
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
elements and model coefficients generally change in amplitude, but the pattern of sign flipping remains. These specific behavior patterns follow from the use of a polynomial model, so different models, as well as abscissa values that straddle the zero point, should not be expected to show the same behavior, but the possibility of some extremely positive or negative correlations should be anticipated. The data in this example are not at all peculiar compared to typical data used in curve fitting, so the extremely strong correlation magnitudes are to be expected in most cases when the abscissa values are all on the same side of the zero point. This makes it clear that when evaluating model uncertainty, the entire covariance matrix must be used as shown in Equation D.13 (p. 437) or algebraic equivalent. Given that the measurement errors are Gaussian to an acceptable approximation, the model error may also be taken as Gaussian, because (as shown in Appendix D), the Gaussian data errors propagate via linear combination to the model, so that the model error probability density function is the convolution of the intermediate Gaussian density functions, making it also Gaussian (see the discussion in Appendix B covering Equation B.2 and Equations B.17 through B.20, pp. 425 and 428). This allows the model to be used with a well-defined probability density function to characterize its information content. Even if the measurement errors deviate slightly from Gaussian behavior, since the model error is a linear combination of them, it is subject to the Central Limit Theorem and will generally be more Gaussian than the measurement errors. If some or all of the measurement errors are not well approximated as Gaussian, then one must proceed with caution. The right-hand side of Equation D.2 (p. 434), i.e., the quantity that is being minimized, is no longer chi-square in a rigorous sense, but if there are enough measurements available, asymptotically Gaussian behavior might emerge. If there is reason to expect significantly nonGaussian or potentially pathological error distributions, then testing any models that come out of the chi-square minimization algorithm via Monte Carlo simulations is advisable for verification and possible exposure of biases that can be removed from the model or included in its uncertainty in an ad hoc fashion. In the author’s experience, such circumstances are rare as long as valid models are used, but further information on probing non-Gaussian confidence limits for model errors in such cases may be found in Press et al. (1986), Chapter 15. 4.10 Random-Walk Interpolation In the previous section we explored connecting discrete data points to each other with curves that model the general behavior of the mechanisms that gave rise to those data points. These data points were measurement results that included complete specifications of the measurement uncertainties, and because the presence of measurement error was acknowledged, the curves did not have to pass exactly through each point. The models may be based on theoretical formulas or merely arbitrary functions that assume no more than necessary to connect the points as smoothly as possible without fitting the noise. The goal is to fit the statistically significant structure in the data points, not random fluctuations, so that the curve may be thought of as tracing the population from which the data points were drawn as a sample. Clearly the statistical behavior of the measurement errors must be fairly accurately known before any judgment can be made about what parts of the structure are significant and what features are merely random fluctuations. 207
page 219
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
It is also necessary, furthermore, to be able to judge what sort of behavior is plausible between the data points in order to choose an appropriate mathematical model. If a theoretical model is available, then it is possible to determine what sampling is necessary to constrain its parameters to within a given accuracy. For example, if the data points are timed altitude measurements of a fly ball in a baseball game, then at least three points are needed at significantly different times, assuming we may ignore the effects of air resistance, and a few more are desirable for fitting a parabola within the measurement-error envelope. Once the parabola is determined, it can be assumed that the ball was close to the computed trajectory at the intermediate times that were not measured. The curve permits interpolation between altitude measurements, and the measurement uncertainties can be mapped into an altitude uncertainty for any interpolated point. But what if the same data points had to be analyzed without any knowledge of what they represent, just a half dozen or so (x,y) pairs? Depending on the actual sampling, a plot might suggest convincingly that a parabolic model was appropriate. This would probably be the case if the time samples were fairly uniformly distributed over the flight of the ball. But such uniform distribution is not entirely critical when one knows what the data mean. In principle, all the samples could be on the rising portion of the trajectory without necessarily rendering the solution completely useless, although this would certainly not be the way to design an experiment optimally. The inferior sampling would expand the fitting uncertainty and increase the actual error in the unsampled part of the domain, but it would still be plausible that the ball at all times was as close to the fitted curve as the model uncertainty indicates, i.e., within 3σ about 99.73% of the time, and knowing that the data represented the altitude of a fly ball would make it immediately clear that a 2 nd-order polynomial would be a good model to use. On the other hand, if no clue were provided about what the data mean, having only the rising trajectory sampled would create serious problems. Some curvature might be suggested by the data, but depending on the size of the uncertainties, whether it was significant might very well be in doubt, so a 1st-order polynomial model might seem to be the best way to avoid overfitting the data. To probe this, we will consider the following five data points which were generated in a manner similar to those of the previous section: xi
yi
σi
1
87.1247
9.7667
2
153.8453
11.5813
3
168.1168
9.5540
4
157.6343
11.0864
5
76.8362
8.9100
The xi values are times in seconds measured from just before the instant the baseball left the bat at 0.21 sec. The yi values are the ball’s altitude in feet above the playing field at the corresponding times, measured by triangulation from simultaneous photographs taken from locations with a known baseline. The σi values are the altitude uncertainties due to synchronization and other measurement 208
page 220
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
errors, also in units of feet. The truth parabola is y = -23 + 133x - 22.6x2. The σi values are drawn from a Gaussian population with a mean of 10 and a standard deviation of 1. The yi values are taken from the truth parabola at the corresponding times with errors drawn from a zero-mean Gaussian population with the corresponding σi. Figure 4-21 shows the five data points as solid circles with 1σ error bars. The true parabolic trajectory is shown as a dotted curve. The 2nd-order polynomial fit to the data points is shown as a solid curve bracketed by its 1σ uncertainty envelope indicated by the two dashed curves. The reduced chi-square for the fit is very good at 0.502796. Even without knowing what the data points represent, these results suggest very strongly that the quantity measured had a parabolic relationship between the ordinate and abscissa, and there would be little doubt that one could safely interpolate y between the xi values by using the model fit. Any doubt would be removed by knowing that these are measurements of the altitude of a fly ball at specific times.
Figure 4-21. Five measurements of the altitude y of a fly ball at time x shown as solid circles with error bars. The true parabolic trajectory is shown as a dotted curve. The 2nd-order polynomial fit to the data points is shown as a solid curve. Its 1σ uncertainty envelope is shown by the two dashed curves.
Now we will look at the same physical phenomenon sampled less optimally, i.e., the five measurements come only from the first half of the trajectory. With the same truth parabola, we simulate independent measurements taken at half-second intervals, obtaining the following data:
209
page 221
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
xi
yi
σi
0.5
36.8108
9.0839
1.0
102.3116
9.4108
1.5
134.0844
8.7717
2.0
147.8670
9.3424
2.5
172.5775
9.4464
Figure 4-22 shows this situation with the same conventions as Figure 4-21. The sub-optimal sampling has clearly damaged the results. Even the rising portion of the trajectory is obviously less well determined because of lack of information about the falling portion. But at least the true trajectory is always within or close to 1σ of the model. Admittedly the model uncertainty becomes very coarse outside of the measurement domain. The goodness-of-fit figure of merit is not bad, with a reduced chi-square of 1.15486 and Q(χ2) = 0.31510. Note that these examples are essentially just one trial each in what should be large-scale Monte Carlo simulations. Other trials will have more or less luck giving plausible results, but these are fairly typical.
Figure 4-22. Five measurements of the altitude y of a fly ball at time x shown as solid circles with error bars. The true parabolic trajectory is shown as a dotted curve. The 2nd-order polynomial fit to the data points is shown as a solid curve. Its 1σ uncertainty envelope is shown by the two dashed curves.
210
page 222
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
If one had no knowledge of what these data represent, one could not have much confidence that a 2nd-order polynomial was appropriate, or even that one could use the model to interpolate between data points. Extrapolation would obviously be on very thin ice and probably not very useful because of the rapidly diverging model uncertainty. Clearly, good sampling and knowledge of the relevant physics both contribute strongly to confidence in the results of curve fitting. Suppose that we did not know the physical mechanism underlying the data but needed to fit a curve to them anyway. We might suspect that a cosine dependence could work, or an exponential decay. For simplicity, we will just examine a few other polynomials. For example, it is not visually obvious that the data don’t belong on a straight line, so we might try a 1 st-order polynomial model. Using the same conventions as the last two figures, Figure 4-23A shows how this turns out. The chisquare goodness-of-fit measure warns us immediately that this fit is inferior to the 2 nd-order polynomial. The reduced chi-square is 3.64779 despite the increased number of degrees of freedom, and Q(χ2) is alarmingly low at 0.01204. Even if we did not know that the data were from the rising portion of a fly ball trajectory, we would know to reject this model in favor of the quadratic fit.
A
B
C
D
Figure 4-23. Other polynomial fits to the data in Figure 4-22 with the same labeling conventions. A. 1st-order polynomial; reduced chi-square = 3.64779 B. 3rd-order polynomial; reduced chi-square = 0.03018 C. 4th-order polynomial; chi-square undefined because the number of degrees of freedom is zero D. Zoomed-out view of the 4th-order polynomial
211
page 223
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
Figure 4-23B shows the result of fitting to a 3rd-order polynomial. The fit is better, too good, really, with a reduced chi-square value of 0.03018 and Q(χ2) = 0.86207, but we might be put off a bit by the way the formal uncertainty diverges so rapidly for x values above the data domain, and fitting with only one degree of freedom left over might give us pause. Still, ignorance of the physics involved would leave us open to considerable jeopardy. If we suspected that the y values should descend at some point past a peak, we might blame the 3rd-order polynomial for being unable to turn around and come down for larger x values, suggesting a 4th-order polynomial, the highest possible for five data points and already too high to allow chisquare to be defined. Nevertheless, a cautionary tale awaits this attempt. The fit is shown in Figure 4-23C: the curve fails to come down from the peak. The reason for that is visible in Figure 4-23D, a zoomed-out view of the fit: expecting the 4 th-order polynomial to come down for larger x values assumed that it rose from the smaller x values to the left of the data domain, the way the 2 nd-order polynomial does. But noise can confound such assumptions. The best fit comes down from infinity before getting to the data domain, and so it goes back up again afterwards. The model uncertainty also diverges outside of the data domain even faster than the 3 rd-order polynomial. The point here is that without foolproof sampling, we need to know the underlying physics, and even then we need decent sampling. This section is concerned with the fact that sometimes even professional scientists are presented with situations involving neither of those ingredients: knowledge of the value of some physical parameter is required, that parameter is measured, and at some later time it is measured again and is found to have changed, with no known explanation. When the value of that parameter at some time other than when it was measured is required, somehow it must be estimated and assigned an appropriate uncertainty. An example can be found in the Spitzer Space Telescope mission. The IRAC (InfraRed Array Camera) imaging array pixels, like all pixels that operate in the infrared, are subject to detecting their own thermal radiation. On the frontiers of astronomy, detector materials typically involve newly discovered combinations of chemical elements whose solid-state physical properties are not 100% worked out and thoroughly understood. Such detectors are used because they have been found to have the sensitivity required for infrared observations at the desired frequencies, and their responsivities can be calibrated in the laboratory and used in an ad hoc fashion to measure infrared fluxes of celestial sources of astrophysical interest. Because they are sensitive to thermal radiation, including their own, they are usually operated at very low stabilized temperatures made possible by liquid helium or solid hydrogen cryostats. But operation at absolute zero temperature is not possible, and so these detectors have a residual response known as “dark current” that is measured by reading out their response when they have no input through the telescope system. This dark current is subtracted off of the response when observations of celestial sources are analyzed. Measuring dark current is not difficult, since it just requires measuring output when there is no input, and so dark-current calibration can be performed at various intervals during the lifetime of a project. Despite being temperature-stabilized, however, it is not at all unusual for a dark current calibration to reveal that some pixels have changed since the last calibration by an amount that is statistically significant. Often there is no apparent reason for these changes, and they must be accepted as random drifts perhaps due to aging effects but for which there is no physical model. Thus it is not known whether the dark current at times between calibrations is at some intermediate value, i.e., whether linear interpolation is an acceptable approximation. The amplitude of the random drift varies from pixel to pixel, with some pixels changing 212
page 224
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
significantly while others do not, leaving the time scale of the drifts unknown. Long-baseline time histories of dark current typically do not reveal structure in the variation that can be correlated with other instrumental parameters, and in many cases one cannot wait years to do a complete data reprocessing anyway. The most common situation involves using the two nearest-in-time calibrations to estimate each pixel’s dark current at a given observation epoch, leaving some smooth form of interpolation as the only option, and occasionally extrapolation over a limited distance also becomes necessary. Another approach is to use the chronologically nearest calibration alone. In any of these cases, the question of how to assign an uncertainty must be addressed. Fortunately, dark current is generally small enough so that the residual error in subtracting it is a minor contributor to the overall measurement error, and so first-order corrections and very approximate uncertainties can be tolerated. Nevertheless, it is desirable to obtain the best estimation possible without investing resources out of proportion. There is not much that can be done with two measurements that doesn’t resemble linear interpolation, so the main issue is estimating uncertainty, and a secondary issue is what to do when extrapolation is needed. It was for these purposes that Random-Walk Interpolation was invented (see, e.g., Moshir et al., 2003, and McCallon et al., 2007). To see why ordinary linear interpolation is insufficient, we will consider the following simulated data for a single pixel: dark-current calibrations were performed on days x1 = 5 and x2 = 20, yielding photo-electron counts y1 and y2 of 10±3 and 20±3, respectively, where the uncertainties are 1σ values for zero-mean independent Gaussian random variables. So after 15 days, the nominal dark current changed by more than 3σ relative to either measurement, and about 2.357σ for the difference of two independent measurements. This would be expected to occur at random in only about 1% of all cases, and so it is a matter of some concern. Although with megapixel imaging arrays, it can be expected to happen often, it still has to be taken seriously. If we were to use ordinary linear interpolation, we would get what is shown in Figure 4-24, where the two measurements are shown with 1σ error bars. The thick solid line is the linear interpolation between them, and the thin solid line extensions are linear extrapolations. The dashed lines form the 1σ uncertainty envelope for the interpolation and extrapolations assuming independent measurements. This uncertainty is minimal at the center of the interpolated region, where the real uncertainty should be maximal. The usual uncertainty for linear interpolation is clearly incorrect in this case. It applies only when we accept the linear model as known to be correct, in which case the central portion of the interpolation indeed benefits maximally from the two measurements. The interpolation and its uncertainty obey the following formulas:
x x1 x2 x1 y y1 f ( y2 y1 ) f
(4.88)
y2 (1 f ) 2 12 f 2 22 2 f (1 f ) 1 2 where ρ is the measurement error correlation coefficient. This uncertainty is found in the same manner as that used for Equations 4.39 through 4.44 (p. 177) with the same ε notation for error and overparenthesis for true values:
213
page 225
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
y y y y1 1 f ( y2 2 y1 1 ) y y1 f ( y2 y1 )
y 1 f ( 2 1 ) (1 f ) 1 f 2
(4.89)
where the second line was subtracted from the first to obtain the third, and we are considering additive noise. The uncertainty variance for the interpolated/extrapolated values is the square of the expectation of εy :
y2 y2 (1 f ) 2 12 f 2 22 2 f (1 f ) 1 2 (1 f ) 2 12 f 2 22 2 f (1 f ) 1 2
(4.90)
Figure 4-24. Two dark-current measurements are shown with error bars. The thick solid line is the linear interpolation between them, and the thin solid line extensions are linear extrapolations. The dashed lines form the 1σ uncertainty envelope for the interpolation and extrapolations assuming independent measurements. The dot-dash envelope is for 90%-correlated measurement errors, and the dotted envelope is for -50%-correlated measurement errors.
The dashed-line uncertainty envelope in Figure 4-24 corresponds to ρ = 0. This is the usual uncertainty for linear interpolation, but if the measurements are known to be correlated, that can be taken into account by including the appropriate value for ρ. The dot-dash envelope corresponds to ρ = 0.9, very strong positive correlation. This envelope shows very little pinching in the center and very slow divergence for extrapolated values. For 100% correlation, the envelope would consist of straight lines parallel to the interpolation line. Even this is not realistic for our dark current drift 214
page 226
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
problem, because our intuition tells us clearly that the uncertainty must be larger than the measurement uncertainties for interpolated values as well as extrapolated values. For completeness of illustration, the envelope for ρ = -0.5 is included as a dotted line, This has more pinching in the center than the uncorrelated case. The closest thing to what we need is the 100%-correlated case, but not only is there no justification for assuming that, it does not even really satisfy our intuition. Nothing based on the hypothesis that the true value changes monotonically between the two measurements is plausible. What is left is the assumption that the true value followed a random walk that began at the first measurement and ended up on the second one. Given the time-reversal invariance of classical physics, we could also view the process as a random walk that began at the second measurement and ended up on the first. In between, the dark current may have wandered outside the range of the values measured and may have oscillated any number of times. All we know is what the two measurements tell us. The Random-Walk Interpolation algorithm makes the minimal assumption that what was measured is the least remarkable result possible for a basic unbiased random walk in one dimension: starting from y1, the probability of ending up as far away as |y2-y1| is assumed to be 50%, and similarly for the reverse walk from y2 to y1. By “basic unbiased”, we mean a random walk in which each step is independent of any preceding step and equally likely to go in either direction. This implies that the uncertainty variance for the distance from the starting point is linear in the number of steps, or equivalently, the time, assuming a constant number of steps per unit time. Therefore the time coefficient VRW of the random-walk variance is
y2 y1
2
Vrw
(4.91)
x2 x1
where γ is a coefficient that converts the observed difference from 50% confidence to 1σ. For a Gaussian process, γ = 1/0.6745 = 1.4826. We assume that the distribution of random-walk distances can be approximated as Gaussian, because the random steps follow a binomial distribution, and we assume that enough steps are involved to put this into its Gaussian asymptote. The random walk is independent of the measurement error, so based on measurement yi, the total error variance of our knowledge of y at some time x is
VTi ( x ) i2 Vrw x xi
(4.92)
This applies to both measurements independently, and although we are accepting the hypothesis that the two measurements do not stem from the same true value (a situation whose damage we are attempting to contain), we also accept the hypothesis that once both measurements are attached to a random walk model, they do each contain information about the same true value of y at a given time x. Therefore according to the Parameter Refinement Theorem, we can combine the information in the two expanded measurement probability density functions, and since we are working within the Gaussian approximation, that means that we can compute the estimate of y at time x via inversevariance-weighted averaging:
215
page 227
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
1 1 VT 1 ( x ) VT 2 ( x )
1
y2 ( x ) y( x)
y2 ( x )
y1 y2 VT 1 ( x ) VT 2 ( x)
(4.93)
Figure 4-25 shows the result of using this algorithm on the data in Figure 4-24. It can be seen that at each measurement time, the algorithm senses the other measurement enough to shift the interpolated value slightly in the other measurement’s direction, and there is a very small reduction in uncertainty that is not discernible in the figure. At the measurement times, the interpolated values for the dark current are 10.37845 and 19.62155 instead of the single-measurement values of 10 and 20, respectively, and the uncertainties are each 2.94269 instead of the single-measurement uncertainties of 3. These shifts result from the fact that at each measurement time, that measurement is refined via inverse-variance-weighted averaging with another of much larger uncertainty.
Figure 4-25. Random-Walk Interpolation for the dark current measurements of Figure 4-24. The two data points are shown as filled circles with 1σ measurement error bars. The thick sold line is the interpolation, and the thin solid curves are extrapolations, as in Figure 4-24, but for Random-Walk Interpolation instead of ordinary linear interpolation. The dashed curves are the 1σ interpolation uncertainty envelope.
Between the two measurements, the expression for y(x) reduces algebraically to a linear interpolation. This can be derived by replacing |x-x1| and |x-x2| in Equation 4.92 for i=1 and i=2 with (x-x1) and (x2-x) respectively, which apply in the interval x1 x x2. The expression is somewhat complicated by the fact that the endpoints are not y1 and y2 but rather the inverse-variance-weighted 216
page 228
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
values mentioned in the previous paragraph, but the dependence on x is linear. Outside of the measurement interval, y(x) approaches an unweighted average of the two measurements nonlinearly. The main feature of this algorithm, however, is the interpolation uncertainty, which is larger inside the interpolated interval than at the measurements if the latter are significantly different. The two measurement uncertainties need not be equal. If they are not, then the maximum σy will be closer to the measurement with the larger uncertainty. Unlike linear interpolation, this algorithm never extrapolates y to infinity. The further one extrapolates in x, the closer the y estimate comes to an unweighted average of the two measurements, but the uncertainty does diverge. All of this behavior is what intuition suggests should happen: when a measured physical parameter changes value for an unknown reason, something close to linear interpolation is the best one can do for intermediate times, but the uncertainty must become larger, not smaller, in the interpolated region. Meanwhile, there is no plausible reason to expect that extrapolated values should become monotonically larger or smaller than the measured values, but the uncertainty in what those values are must grow indefinitely. As extrapolation proceeds to arbitrarily large distances from the measurements, the measurement uncertainty variances become negligible compared to the random-walk variances, and the difference between the two random-walk variances becomes negligible, so that the inverse-variance weights approach being equal. If the nominal measurement y values are even further apart, the interpolation uncertainty will enlarge accordingly. This is illustrated in Figure 4-26, where the nominal values are 8 and 22 instead of 10 and 20. Compared to Figure 4-25, the interpolation uncertainty can be seen to increase more inside the interpolation interval and also for the extrapolations.
Figure 4-26. Random-Walk Interpolation for data points more discrepant than those of Figure 4-24 and 4-25. The two data points are shown as filled circles with 1σ measurement error bars. The thick sold line is the interpolation, and the thin solid curves are extrapolations, as in Figure 4-25. The dashed curves are the 1σ interpolation uncertainty envelope. The larger discrepancy causes a faster random walk, resulting in larger uncertainties in the interpolated and extrapolated regions.
217
page 229
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.10 Random-Walk Interpolation
When the two measurements are not very different, the algorithm assumes that there is little or no random walk involved, and the interpolated values approach the usual inverse-varianceweighted average values, as can be seen by making VRW arbitrarily small in Equations 4.92 and 4.93. An example of a small but nonzero difference is shown in Figure 4-27. Here y1 = 14 and y2 = 16, with the same uncertainties as before, 3. The interpolation uncertainty is less than the measurement uncertainty, 2.44474 at the measurement times and enlarges only to 2.58807 at the midpoint between the measurements. Although it is not obvious in the figure, the extrapolation is nonlinear, and its uncertainty envelope is diverging, although much more slowly than in the previous examples. If the measurement values had been equal, the interpolation and extrapolations would have passed through them, and the interpolation uncertainty envelope would consist of lines parallel to them at ±3/ 2.
Figure 4-27 Random-Walk Interpolation for less discrepant data points, y1 = 14 and y2 = 16, with both measurement uncertainties equal to 3. The interpolation uncertainty is less than the measurement uncertainty, 2.44474 at the measurement times and enlarging only to 2.58807 at the midpoint between the measurements, which are shown as filled circles with 1σ error bars. The thick solid line is the interpolation, the thin solid curves are the extrapolations, and the dashed curves are the 1σ uncertainty envelope.
The algorithm can be extended to random vectors whose elements may be correlated and to correlation in the measurements at different times. At the user’s judgment, the interpretation of the observed difference in measured values as 50% probable can be made more or less, depending on whether a more optimistic or conservative value is desired, by simply scaling γ down or up, respectively. Since this algorithm obviously includes some ad hoc elements, extending it to more than two measurements is not straightforward. For example, should it use separate VRW values between different point pairs? Should it perhaps average VRW over all point pairs? There is some risk of overestimating its ability to compensate for incomplete knowledge of the behavior of the observing hardware, so it should probably be kept as simple as possible. It is, after all, a damage-control technique, not anything with a claim to being omniscient. It does provide an avenue for attaching 218
page 230
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.11 Summary
plausible uncertainties to interpolated values that are probably not quite on target, but the goal is to assign uncertainties that are neither too large nor too small, since such extravagances generally harm either completeness or reliability downstream. We said earlier in Section 4.1, “In general, it is fair to require that observers understand their hardware sufficiently well to know how to model its workings, otherwise the result that comes out of a measurement is essentially useless”. We should reconcile the Random-Walk Interpolation algorithm with this statement. In that section, we also said “if the measurement itself cannot escape being highly uncertain, then the only way that it can be useful is if the uncertainty has been reliably estimated”. We cannot guarantee that the uncertainty provided by the Random-Walk Interpolation algorithm is “reliably estimated” any more than we can guarantee that its linear interpolation and nonlinear extrapolations are correct, but when used judiciously in a situation where its uncertainty plays a minor role, it can bridge the gap between advancing and retreating. In the author’s opinion, Random-Walk Interpolation should seldom be needed and only for very minor effects for which coarse approximations can be tolerated. If one finds it necessary for dominant error sources, then some remedy is needed, or else substantial caveats must be added to any quoted results. Shooting in the dark at the frontiers of science can be a good way to discover phenomena worthy of subsequent follow-up with better-understood instruments, but the limits imposed by large uncertainties must be clear to all users of the results. On the other hand, as in the case of dark-current drifts in the Spitzer imaging array pixels, the fact that a complete understanding of all of the properties of newly discovered detector materials has not been achieved should not be allowed to slow the progress of science when the impacts of incomplete understanding are limited to minor effects whose uncertainties are quantifiable to an acceptable approximation. After all, it is incomplete understanding that makes the random-variable modeling of errors necessary in the first place. When the incompleteness results in uncertainties that are not small enough to ignore but are small enough to accept relatively coarse approximations (i.e., perhaps violating the 20% rule but without putting the total uncertainty in violation), then the Random-Walk Interpolation algorithm can facilitate the advance of science in the near term. The rigorous practice of scientific method does not imply that only perfection is acceptable, otherwise science could not advance at all. 4.11 Summary As stated in the Preface, this book is necessarily limited to a subset of all the interesting and important aspects of how randomness affects our experience. This chapter has, therefore, been restricted to selected topics which, it is hoped, provide some illumination of many of the important ways in which the effects of randomness are taken into account in the process of converting welldesigned measurements into scientific knowledge. In each case, the same epistemic randomness of classical physics is involved, but its facets are best viewed from different perspectives. Sometimes it exhibits correlation, sometimes the correlation stems from systematic errors, sometimes the random ingredients in a measurement are due to instrumental effects, and sometimes they are due to the object being observed, such as the Poisson noise implicit in the photon streams from distant stars. All of these map into the uncertain components infiltrating scientific knowledge, the residual noise which we strive to make ever smaller relative to the signal. In resisting the temptation to expand this scope, the author trusts the reader not to interpret that 219
page 231
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
4.11 Summary
scope as anything approaching completeness. Scientific data analysis encompasses many more topics of great interest and usefulness, but generally such topics do not involve any essence of randomness fundamentally different from what suffuses the subjects covered. For example, the study of robust estimators is extremely important but not addressed herein. The tools it provides aid considerably in situations involving small samples or samples contaminated by data outliers, but their derivation is well within the scope of functions of random variables, with nothing novel about the intrinsic randomness in the origin of fluctuations and outliers. Typical instances in astronomy include the effects of stars on the background statistics for celestial images. Often these estimators depend on the behavior of medians of various distributions and rank statistics (how data within a sample are distributed relative to the mean, maximum, minimum). This is intimately related to outlier rejection in general. Similarly, we have not delved into estimation of time-dependent states of stochastic processes, Markov processes, Kalman and Wiener filtering, ergodicity, stationarity, among many other fascinating topics of great practical value but not essential to the contemplation of the fundamental concept of randomness. The interested reader is encouraged to follow up with independent study of these subjects. As stated in section 4.1, the examples in this chapter illustrate the sort of mathematics typical of the “hard” sciences, e.g., physics, astronomy, chemistry, geology, and biology. The representation of measurement results as probability density functions is at least as applicable to the so-called “soft” sciences, e.g., sociology, economics, and psychology. The fact that the latter tend to employ more approximate mathematical models emphasizes the need for reliable estimation of uncertainty, and in fact many of the most powerful methods of statistical analysis were developed therein. Throughout the entire spectrum of the sciences, making measurement results quantitative allows the maximum feasible rigor to be applied to their analysis and the greatest possible insight to be gained. In some cases, the nature of the subject material forces large uncertainties, but the principle that an uncertain result can be useful only if its uncertainty is well estimated applies universally.
220
page 232
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Chapter 5 Quantum Mechanics 5.1 Interpreting Symbolic Models What we call “Physics” today began as a part of what was called “Natural Philosophy” in earlier times. The highest academically endowed status of professional physicists today is the “Doctor of Philosophy” degree. And yet many modern physicists profess a certain distaste and mistrust for “Philosophy”. There is a broad range of opinion among theoretical physicists regarding the relationship between mathematical formalisms and the fundamental nature of objective physical reality, including denials that the latter is a meaningful concept. There is no escaping the fact that these are philosophical issues. It is the author’s opinion that as physics became more and more specialized and separate from other subdivisions of philosophy, physicists became more focused on mathematical sophistication and less interested in such things as ontology versus contingency, theology, semiotics, rhetoric, linguistics, ethics, etc. Certainly the vast majority of physicists place great value on ethics, but they consider most ethical questions straightforward without requiring a lot of contemplation. They attempt to be semiotically immaculate in their mathematical manipulations, but they don’t need a philosophy course to help them do that. Questions perceived to be pointless are often dismissed as “just metaphysics” or “just semantics”. There seems to be little consensus about the philosophical value of physical theories. Modern science has produced theories that have turned out to have great practical value, which is certainly a good thing, but perhaps it shouldn’t drown out the considerations that gave rise to science in the first place, primarily the need to contribute to illuminating the human experience. In many cases mathematically formulated theories may be telling us something about fundamental physical reality (ontology) if we can make our way through the connections between the symbolic formalisms with what they ultimately represent (semantics). One of the most difficult decisions that scientists ever have to make is how literally to take the implications of what their mathematical models are telling them. Great skill is required to design and carry out measurements, and great ingenuity is needed to weave the results into a new or existing theoretical formalism. Sometimes the only apparent way to organize facts into a mathematical theory ends up producing a great challenge to interpretation. Acceptance at face value may seem absurd, and one must choose between rejecting a completely new intuition-expanding revelation versus embracing what may turn out to be a career-ending gaffe. For example, it seems reasonable that in ancient times, the Earth was assumed to be the center of whatever existed. But there was also some kind of space above the Earth, and it had mysterious objects in it that could be seen to be moving around. The ancient Greeks and Phoenicians had noticed that, beyond seasonal variations, the angle toward the noonday Sun as seen in northern regions differed substantially from that at southern locations. The idea that the Earth was spherical developed slowly, and with it, the acceptance that the Earth is finite and therefore surrounded by some kind of void. Early attempts to understand the observable paths of the moon and planets across the sky led to the system in which celestial bodies moved on circular trajectories called epicycles which were centered on points moving uniformly along larger circular trajectories called deferents. 221
page 233
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.1 Interpreting Symbolic Models
The virtue of this system was that it maintained the Earth’s position as the center of the Universe (although the actual centers of the deferents were generally somewhat off of the center of the Earth) and employed the geometrically perfect concept of circles. Actual geared mechanisms were constructed, such as the ancient Greek device known as the Antikythera Mechanism, which employed at least thirty gears and actually allowed the moon to move faster at perigee than at apogee. The philosophical underpinnings of the concept suggest that it was believed to describe an actual law defining how the Universe operates, but its origins precede the development of modern science by such a drastic extent that it is hard to determine what it would mean to “believe” in its physical reality. No one actually observed any gears in the sky, but presumably they could have been hidden by the many layers of rotating spheres that seemed to be up there. Independently of whether anyone granted ontologically real status to the epicycle mechanism, the basic idea persisted even in the work of Copernicus, whose main contribution was to replace the Earth with the Sun as the center of the Universe, although some early Greek and Islamic astronomers advocated similar systems. This helped a lot with the observational fact of retrograde apparent motions of the outer planets. Today it seems that explaining such retrograde motions by making the Earth just another planet going around the central Sun would have been obviously more plausible than an Earth-based epicycle system, but in fact the people of those eras had serious difficulties with the idea of a moving Earth. Some had theological objections. Others believed that if the Earth were moving, people would be swept off of its surface, and still others objected on the basis that they could not feel the Earth moving the way one feels a ship at sea moving or a horse-drawn carriage moving. This persisted at least as late as 1633 when the Roman Inquisition placed Galileo under house arrest on suspicion of heresy for advocating heliocentrism with its moving Earth. One must bear in mind that this was about half a century before Newton established his laws of motion, the first of which states that an object in uniform motion remains in uniform motion unless acted on by a force, and “force” was only a qualitative notion anyway, not yet a precisely defined physical parameter (technically, the Earth’s orbital motion about the Sun isn’t uniform, it takes place under the influence of an accelerating gravitational interaction, but the point is that it is possible for the Earth’s motion to go unnoticed). In the end, Copernicus was criticized for turning proper scientific method on its head: he started with observed facts and constructed a cosmological theory to explain them instead of starting from accepted cosmological principles and using those to guide his experiments. Given the trouble caused around this time by heliocentric theories, the fact that Copernicus was never made to suffer very much suggests that no one took his system seriously as anything more than a good mnemonic device for calculating celestial predictions. Although it did come under significant religious attack, it was nevertheless used by the Catholic Church in reforming the Julian calendar. Copernicus himself admitted that his system lacked any explanation for how things came to be as he described, and yet taking the Sun rather than the Earth to be the center of the Universe is an inescapably ontological statement. He may have had some reservations about the details because of the need to retain the less plausible notion of epicycles, which were required in order to keep perfect circles as the basic geometrical ingredient. Copernicus first published his heliocentric theory in his book De Revolutionibus Orbium Coelestium in Germany in 1543, three years before the birth of Tycho Brahe in Denmark. Brahe produced an astonishing wealth of astronomical observations without the benefit of a telescope, achieving star and planet position accuracies similar to those of IRAS (Infrared Astronomical 222
page 234
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.1 Interpreting Symbolic Models
Satellite; see sections 4.5 and 4.8). His talents for theory and data analysis were more debatable. He modified the Copernican system to place the Earth back at the center while keeping the other planets still orbiting the Sun. Fortunately, Johannes Kepler came along soon afterward and made better use of the data accumulated by Brahe, for whom he worked briefly as an assistant in 1600 and 1601 following an active correspondence in the preceding years. Kepler was extremely gifted as a mathematician and made numerous important contributions. He was what Brahe’s data needed, a real theorist. After Brahe’s death in 1601 and a period of legal wrangling, Kepler obtained access to Brahe’s data and attempted to make his own modification of the Copernican system work with sufficient accuracy to be compatible with them. In a truly amazing feat of scientific data analysis, Kepler came to the realization in 1605 that all planets move in elliptical orbits with the Sun at one focus. This became his first law of planetary motion. Although he arrived earlier at another of his laws in 1602, he placed it as number two in his list: that a line segment joining a planet and the Sun sweeps out equal areas during equal intervals of time. The third is that the square of the orbital period of a planet is proportional to the cube of the semi-major axis of its orbit. These led to predictions of the transits of Mercury and Venus across the Sun that were seen as confirmations of Kepler’s theory after some observational difficulties were overcome. In 1687, Isaac Newton published the first of three editions of his Philosophiæ Naturalis Principia Mathematica in which he included derivations of Kepler’s laws based on his own law of universal gravitation, F
Gm1 m2 r2
(5.1)
where F is the magnitude of the force pulling together two point masses m1 and m2, r is the distance between them, and G is a constant determined by the strength of gravity and the units employed. The force on each mass is pointed directly at the other mass, and r is determined by the positions of the two masses at the same time, i.e., the force propagates instantaneously in a reference frame in which time is absolute. This law was successful at describing every observation to which it could be applied. The implication for elliptical orbits is limited to bound two-body systems. The presence of multiple bodies in the solar system produces deviations from perfect elliptical orbits, but it was well understood that detecting such deviations was beyond the observational art of Kepler’s day. Newton was never comfortable with the “action at a distance” implied by the ability of each mass to influence the other without physical contact. He granted that some unknown mechanism must account for it but felt that ignorance of the nature of that mechanism in no way refuted the law. The instantaneous nature of the force does not seem to have troubled him, nor the point masses. The former was probably consistent with mainstream physical intuition at the time, and he stated that the latter were a mathematical idealization. They could also be seen as simply a starting point for introducing mass density functions. That is not as trivial as it might sound, because Newton had to invent a physical definition for “mass”, as no rigorous prior definition existed. Early on, he struggled to formulate clear consistent definitions of “mass”, “weight”, “inertia”, and “density”. That these notions seem straightforward to us today reflects a part of the great debt owed to Sir Isaac. The fact that the gravitational forces point directly at the center of mass of the two bodies results in central force motion, for which angular momentum is conserved, and the force itself is 223
page 235
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.1 Interpreting Symbolic Models
derivable as the gradient of a scalar potential that is not an explicit function of time, so that mechanical (i.e., kinetic plus potential) energy is also conserved. Kepler’s second law is actually a statement of the conservation of angular momentum, and so it follows directly from Newton’s law. For bound two-body systems with nonzero angular momentum, Newton’s law leads to elliptical (including circular) trajectories, giving Kepler’s first law. Newton also showed that the proportionality constant in Kepler’s third law was actually a function of the total mass of the two-body system (see, e.g., Danby, 1962, Chapter 6), hence different for different planets, but the solar mass dominates so strongly for any sun-planet combination that the third law is an excellent approximation for most purposes. In considering the ontological status of Kepler’s and Newton’s “laws”, one must take into account the fact that the words “law” and “theory” are commonly used with a variety of definitions, among which we find “hypothesis”, “rule”, “guide”, “principle”, and “excellent approximation in many cases”. There is also, of course, “law” in its strictest sense, that which must be obeyed. Whereas humans may break the “law”, presumably the Universe could not do so. From the earliest discussions of such things, the laws of Nature were perceived as divinely ordained. Mere insentient physical substances could not disobey the omnipotent authority behind such binding proclamations. Thus to discover a law of Nature was to lay bare the intimate workings of the Creation and perhaps something of the Mind of the Creator. Today most scientists perceive questions about the Mind of the Creator to be too permeated with pitfalls to justify investing any effort in them. Although it is common in cosmology to see the Big Bang referred to as “the Creation”, most scientists would quickly retreat from a literal interpretation of that word and even faster from whether there could be a Creation without a Creator. With or without the divine aspect, there is compelling evidence that Nature tends to do the same things under the same circumstances, hinting that even if the “laws” of Nature are simply a tautology reflecting the fact that Nature behaves the way Nature behaves, still there appear to be regularities in that behavior that can be codified mathematically and called laws. If such laws exist, it is difficult to imagine how insentient matter could choose to disobey them. Of course, one possibility that interests us is whether those laws allow for any random draws to enter the picture. The “laws” may still have a stochastic character that can be formalized. The question of whether scientists should be more respectful of “Philosophy” is really a question of degrees, since it can be readily discovered by experiment that attempts to be logically rigorous about the absolute nature of the Universe inevitably develop into labyrinthine convolutions of confusing notions from which it can be difficult to extricate oneself and get back to the original scientific goal. Before we can ask whether a particular symbol in a mathematical formalism refers to something that “really exists objectively in the physical Universe”, we need to define what we mean by “existence”, “objective”, and “physical Universe”. Attempts to do that can get tangled in circular arguments, infinite regress, and above all the inability of language to achieve absolute clarity. In the author’s opinion, pursuing science should involve thinking hard about what mathematical theories really mean, one should test the waters of rigorous philosophical grounding for one’s interpretation, but in the end when it comes time to stop spinning the wheels and get back to studying Nature, one should be prepared to do what scientists do best, embrace what appeals most to intuition. It would be very nice to place physics on as firm a footing as that of a purely mathematical formal system of axioms and postulates, but the latter has an advantage in not being required to provide 224
page 236
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.1 Interpreting Symbolic Models
axiom-independent definitions of the absolute nature of the elements to which the axioms refer. Those elements of the formal system are defined entirely by how they behave under the axioms. Using this approach for physical theories leaves open the possibility that the formalism doesn’t refer to anything outside of itself, and that is quite unsatisfying to many scientists. What seems to be left is to define terms as clearly as possible and then accept that when it comes to what is meant by an ontologically real element of the objective physical Universe, a certain amount of “you know what I mean” is ultimately inescapable. As a consolation prize, it can be argued that anyone who doesn’t know what you mean by something that can be grasped only as self-evident probably isn’t going to understand your theory anyway. In this book we accept the postulate that something truly exists outside of and independently of our conscious experience, an objectively real physical Universe that evolves according to strict laws that are knowable in principle. We posit that the nature of these laws is such that mathematical formalisms can be placed in one-to-one correspondence with them. By “outside of and independently of”, we do not mean “necessarily separate from”. We accept that as sentient beings we may or may not be components of this physical reality in whole or in part. We mean that although our existence may depend on the Universe, its existence and fundamental nature do not depend upon us. The fact that we exist means only that the Universe is capable of taking on a configuration consistent with that (i.e., in Philosophy-speak, the physical Universe is ontologically real, but our participation in it may be contingent). We take as a working hypothesis that the laws of the Universe can be discovered independently of whether we ourselves are completely subject to them. Since “consciousness” is as complete a mystery as any other, we do not rule out the possibility that it can act on the relatively external physical Universe in mysterious ways, e.g., as in forcing a physical system into a particular state. Of course every decision we make that leads to physical action involves consciousness acting on the physical Universe, and even these actions cannot be completely understood without a proper explanation of the nature of consciousness. The two examples may not be so different. To the author, the constraints in our working hypothesis are necessary in order for the study of physics to be worthwhile. The reader is not asked to accept them, only to understand that they form the background for interpreting the discussions below concerning attempts to comprehend how the elements of theories expressed as mathematical formalisms relate to ontologically real properties of the Universe. Many professional scientists do not accept them, some because they feel that the notion of a Universe independent of human thought is ill defined, others because they consider its laws unknowable or even nonexistent. What various scientists consider reasonable and meaningful should not be the final deciding factor in how one proceeds on one’s own, but neither should such opinions be dismissed out of hand. In support of our position, we offer the following statements made by Albert Einstein. “Certain it is that a conviction, akin to religious feeling, of the rationality or intelligibility of the world lies behind all scientific work of a higher order. ... This firm belief, a belief bound up with deep feeling, in a superior mind that reveals itself in the world of experience, represents my conception of God.” (1929, quoted in Ideas and Opinions, 1954 edition, p.262)
225
page 237
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.1 Interpreting Symbolic Models
“I maintain that the cosmic religious feeling is the strongest and noblest motive for scientific research.” (1930, quoted in Ideas and Opinions, 1954 edition, p.39) “... science can only be created by those who are thoroughly imbued with the aspiration towards truth and understanding... To this there also belongs the faith in the possibility that the regulations valid for the world of existence are rational, that is, comprehensible to reason. I cannot conceive of a genuine scientist without that profound faith.” (1939, quoted in Out of My Later Years, 2005 edition, p.24) We supply these selected quotations with some risk of being accused of inconsistency, since we will stray from the path of strict adherence to Einstein’s views below. The idea is that we are starting from the same place as he did, but much has been learned since his passing, and we must make our own way through its implications. It is difficult if not impossible to compare scientists of different eras in terms of their powers of imagination and intuition, but it is the author’s opinion that beginning with 20th-century physicists, no one has surpassed Einstein in these qualities. That he could conceive of curved spacetime as the source of gravitational effects must be considered an unexcelled achievement of human creativity. It is therefore only with some disquiet and after extensive consideration that one should proceed to contradict his interpretation of any aspect of physics. But as a rebel and revolutionary himself, he would surely not approve of dogmatic devotion to his views, which were based on theories that he clearly stated he held to be incomplete. The purpose of this preamble is to establish the context within which we attempt to understand what scientific theories tell us about the behavior of the Universe at its most fundamental level. The particular focus in this book is whether nonepistemic randomness plays a role in the physical processes taking place at that level, and it is in that issue that we find our greatest departure from Einstein’s tenets. A secondary issue involves the nature of “spacetime”, specifically whether successful theories based on a four-dimensional physical continuum are actually asymptotic forms describing the behavior of discrete indivisible elements. Both of these dichotomies profoundly influence how ontologically fundamental theories should be constructed. It is probably not possible to reconstruct accurately the mindset of the inventors of the epicycle system. It seems reasonable to assume that they believed in the objective reality of what they were trying to model, but they may have considered the epicycle construction to be only a coarse description, not an exact representation of an actual geared mechanism in the sky, since such an apparatus would surely have added more mystery than it removed. If pressed, they may very well have said that the moon and planets move in the sky as if they were mounted on spheres whose motion resembled that of an appropriately geared machine. By contrast, Newton seems to have believed that his law of universal gravitation was an exact characterization of how the Universe must behave regarding the attraction between its component bodies. He did not say that celestial bodies behave as if they were under the influence of the force specified in Equation 5.1. In accepting that he had no explanation for action at a distance, he embraced a law that he did not fully understand. In one section of his book he states “It is enough that gravity does really exist and acts according to the laws I have explained, and that it abundantly serves to account for all the motions of celestial bodies.” He felt that the law applied exactly to objective reality, and the underlying mechanism responsible for it would eventually be discovered. The development of field theories soon followed, including attempts to revive ancient ideas 226
page 238
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.1 Interpreting Symbolic Models
about an aether that filled all of space and mediated interactions like gravity and later electromagnetism, providing a medium for the propagation of light. Newton himself attempted to make an aether model work for gravity before abandoning it. He was well aware that the speed of light is finite, so the luminiferous aether apparently was not the transmitter for gravity, which had to propagate instantaneously for his law to work. If a finite propagation speed is assigned to gravity as expressed in Equation 5.1, then each body is attracted toward locations in other bodies’ pasts, the condition of central force motion is lost in general, and with it, conservation of angular momentum, and in general each body sees the others at r values corresponding to different times, invalidating the timeindependent scalar potential and thereby losing conservation of mechanical energy. For example, in a bound two-body system with nonzero angular momentum, each body is pulled toward a position that the other occupied in the past, a direction constantly lagging behind that toward the instantaneous center of mass, and this creates a torque on the system that increases its angular momentum, adds mechanical energy, and generally causes the system to become asymptotically unbound. So Newton’s 1/r2 dependence must have instantaneous propagation in order to succeed. The “law” had numerous beautiful qualities such as a uniform spherical distribution of mass generating an external gravitational force equivalent to the mass all being concentrated in a point at the center, while the field internal to a uniform spherical shell cancels to zero throughout the inside. Infinite flat sheets of uniform density generate constant gravitational forces, i.e., the force does not depend on distance from the sheet. The law’s success in predicting and explaining gravitational interactions was thoroughly convincing, but the additional aesthetic appeal of its mathematical beauty appeared to confer upon it a divine seal of approval. This seemed clearly to be the way the Creator would do it. Besides its practical value, there was support for a claim of metaphysical motivation. And yet we now know that it was simply wrong in concept. It retains its engineering value because it remains the best way to solve most (i.e., nonrelativistic) gravitational problems, but it has no ontological value. There are no forces to propagate across distances instantaneously. It tells us nothing about the inner workings of the Universe. Indeed, gravitating bodies merely act as if Newton’s law were operating. Nevertheless, it was a magnificent intellectual achievement, a vast improvement on Aristotle’s “law” of gravity (which held that heavy objects fall faster than light objects). It was correct in many aspects, such as that gravity not only causes objects to fall to the surface of the Earth but that it keeps the moon in orbit about the Earth and other planets in orbit about the Sun. It served as an extremely useful base camp from which to reconnoiter and prepare a campaign to achieve a superior understanding of gravity. The latter was accomplished by Albert Einstein in the early 20 th century with his General Theory of Relativity, whose ontological content is entirely different from Newton’s, although it is not considered complete because of its exclusive involvement with gravitational effects. The modern scientific consensus is that quantum mechanical phenomena need to be fused with gravitation to form a Theory of Quantum Gravity that can describe how all known physical interactions proceed from the same underlying reality (whether this will shed light on the nature of consciousness remains to be seen). But it is the author’s impression that most physicists today accept the spacetime of General Relativity to be a valid concept that needs only some refinement, not another complete paradigm shift. Still, the cautionary tale of Newtonian gravity must keep us vigilant regarding metaphysical guidelines. For example, while formulating General Relativity, Einstein indicated that he was guided by “Mach’s principle” (Einstein, 1922) that inertia, and hence inertial reference frames, are determined by the distribution of matter throughout the Universe. This implied that his field equations had no 227
page 239
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.1 Interpreting Symbolic Models
solutions for a matter-free Universe. But Willem de Sitter (1917) had found such a solution. At first Einstein sought to invalidate it, but he eventually accepted it, and the role of Mach’s Principle in his thinking was, in modern parlance, deprecated. In his later years he said that he was no longer sure whether Mach’s principle had played any role in his thinking. In letters to colleagues (F. Pirani and D. Sciama, quoted in Pais, 1983, p.288), he said “As a matter of fact, one should no longer speak of Mach’s principle at all.” Even so, Misner et al. (1973, section 21.12) show that Einstein’s theory not only supports certain aspects of Mach’s Principle, it makes them more understandable. But Einstein himself apparently used the principle only as a logjam-breaking expedient that turned out to be a consequence of General Relativity rather than a foundational principle of it. So Einstein was human, and sometimes a redundant postulate can be a catalyst for a cognitive reaction that produces a valid result, a bit of conceptual scaffolding that can be removed later. By the same token, Einstein promoted the idea that physically meaningful statements can be made only about phenomena that can be observed. And yet he clearly considered strict determinism to be a hallmark of the way the Universe behaves. But how does one observe determinism? It would not be enough to observe a few physical processes that exhibited deterministic behavior. To establish the complete absence of random activity would require measuring the entire Universe with infinite precision and 100% accuracy, otherwise how could one rule out the possibility that sometimes events take place that do not completely follow deterministically from what has come before? Given that Einstein was devoted to logic, hence consistency, he must have viewed strict determinism as a metaphysical principle, thus not really immune to being jettisoned later (see the last paragraph of section 3.7), as Mach’s Principle was. But in this case, jettisoning strict determinism would not be merely the removal of scaffolding; its repercussions for ontology would be earth-shaking. In this vein, we cannot claim that postulating the existence of an objective physical reality external to and independent of our conscious experience and operating according to knowable laws is anything more than a metaphysical guiding principle. If it truly needs to be jettisoned, that need would seem most likely to emerge as a contradiction discovered while using it as a working hypothesis. If we are to follow what Einstein called an “aspiration towards truth and understanding”, we must embrace the idea that the truth we seek exists and (at least tentatively) the faith that understanding it is possible. As far as scientific pursuits are concerned, we take this approach because it is the only one that embodies our reason for pursuing science. If the goals cannot be achieved, it won’t be because we assumed they could not be. But if the goals can be achieved, what are the success criteria? Clearly human intelligence is finite. If the truth is that the physical Universe is infinitely hierarchical in nature with each deeper layer composed of smaller objects possessing their own unique properties, we will not be able to achieve closure in fundamental terms, an admittedly reductionist goal. The term “reductionist” has several variations. Here we mean the variation that does not deny the phenomenon known as “emergence”, where that term also has several variations, of which we do not mean that at a certain level of complexity some completely new effect appears out of nowhere and comes into play without any basis in the less complex level. Nobel laureate physicist Philip Anderson (2011) said “One can’t in any way accuse a single atom of copper or zinc or aluminum of being a metal; a metal occurs when the atoms are condensed into a solid and the electrons free themselves from their individual atoms and form a free ‘Fermi liquid’ in the ‘energy band’ formed from states on all the atoms.” We agree completely, but with the proviso that the properties of a metal stem exclusively from those of the constituent atoms with the only new ingredient being the interaction of the assembly. Once the 228
page 240
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.2 Clouds on the Classical Horizon
properties of the metal are recognized, they are understandable completely within the knowledge of the atom’s properties. Predicting the interaction without having first observed it may be mathematically intractable, but the interaction itself is not magical, it requires nothing from the atoms that they do not already possess. We could also say that one cannot accuse an electron of being a hydrogen atom; working together with a proton is needed. But the hydrogen atom’s properties are determined by those of the electron and proton, even though knowledge of the latter properties would have been more difficult to obtain without ever having studied a hydrogen atom. Part of our working hypothesis is that the way two objects interact is completely determined by the properties of the two objects, properties possessed prior to the interaction. So our goal cannot be achieved if the Universe’s properties stem from an infinitely descending hierarchy of disjoint levels, nor if a collection of objects can exhibit behavior that has no basis in the properties of those objects. Our working hypothesis is that the Universe is not like that. This amounts to assuming that there is some lowest-possible level of physical reality where the objects in play have no internal structure to be reduced further. This severely limits what properties are possible and demands extreme simplicity in what nevertheless gives rise to everything that we observe. If we can identify these ingredients and deduce their nature, we may still be left not knowing how they came to exist and how they happen to have whatever property makes them able to combine with each other to create such great complexity as that which our experience tells us exists. We do not require answers to those questions, they are not part of our success criteria. We will be satisfied knowing how these objects are able to weave the fabric of spacetime in such a way that gravity and the other known interactions emerge. This means that we must be prepared to expand our intuition to embrace the concept of a physical object that is maximally simple. The notion of “understanding” must transcend the traditional grasping of one idea in terms of other ideas, which is an infinite regress. The chain of explanations must terminate at an idea that is sufficient unto itself. The idea of a physical element with no internal structure presents a challenge: how can something have properties without some internal mechanism that gives it those properties? We must adapt to the inverse of the usual construction of formal mathematical systems; instead of the axioms defining the elements of the system as those things whose behavior obeys the laws defined by the axioms, we must take the laws of behavior as following directly from the nature of the monolithic indivisible elements. Those are the laws that lead to the emergence of the complicated system that we call the Universe. 5.2 Clouds on the Classical Horizon It is easy to get the impression that problems in the interpretation of physical theories began with the advent of Quantum Mechanics in the early 20th century, that classical physics made sense and satisfied intuition up to the point at which atoms were accepted as real objects. This ratification of the physical importance of atoms instigated the study of the microscopic realm, wherein it was discovered that the human mindset, conditioned exclusively on macroscopic phenomena, was shockingly unequipped to deal with the mysteries encountered. In fact, there had always been plenty of problems of interpretation just dealing with theories describing macroscopic effects. The problems simply took a sharp turn for the worse when attention became focused on natural processes occurring in layers too small for direct perception by human senses. 229
page 241
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.2 Clouds on the Classical Horizon
For example, by the late 1800s, Newton’s law of universal gravitation (Equation 5.1) had been used with gratifying success for over 200 years. Puzzling observed peculiarities in the moon’s orbital motion began to be explained ever more successfully by the inclusion of more and more perturbations and improved methods of computation. Other than a few theorists working on doomed attempts to formulate hydrodynamic theories of gravity, no one seems to have been seriously concerned by the philosophical difficulties implicit in action at a distance. The equation itself was viewed as an “understanding” of gravity. At least it was clear what was meant by a “force”, and even action at a distance could be understood intuitively even if the mechanism for it was a complete mystery. But because it was a mystery, Newton’s law could not be considered fully interpreted. An essential ingredient was missing. This notion of “understanding” persists today in Quantum Mechanics: having an equation that successfully describes quantum behavior is taken by some scientists to be equivalent to understanding that behavior. But the disconnection with intuition has grown more troublesome, since some elements represented in the equations have no clear physical-element counterparts. The apparent divergence between successful theories and human intuition led to schools of thought involving the permanent disqualification of the human mind as capable of understanding processes so far removed from human perception. The equations of Quantum Mechanics acquired the nature of the master’s spells in Goethe’s “The Sorcerer’s Apprentice”. We can create effects that seem miraculous by making the right invocation. The spells work despite the gross incompleteness of our understanding of why they work, and we have already used them to unleash forces that we are still struggling to control. That is not to say that we would not have done so if we had possessed a complete grasp of the physical mechanisms involved. As products of evolution, we naturally make maximal use of technology for near-term advantage, and scientific progress does not take place in a vacuum. But surely Einstein’s “aspiration towards truth and understanding” is a powerful reason to hope for the eventual overall enlightenment of the human race. And we have also used these mysterious powers to do many beneficial and beautiful things. Among some scientists, however, there is a lingering desire to reconnect intuition with what the equations are telling us. But this challenge is not entirely new. For example, recognition of electricity and magnetism is found in the earliest records of human history, but the nature of each and their relationship to each other remained mysteries for millennia. About the time that Kepler began analyzing Brahe’s astronomical data, William Gilbert began studying electromagnetic phenomena scientifically. His work was continued by other scientists, including Benjamin Franklin, who showed that lightning was electrical in nature. Enough knowledge developed by the end of the 18 th century to allow electricity to be stored, electrical circuits to be analyzed, and the presence of two opposite charges to be recognized. In 1831, Michael Faraday discovered that electricity and magnetism were coupled in such a way that either one could induce the other, allowing electrical motors to be constructed and electromagnetic waves to be recognized, leading to the development of electromagnetic field theory in 1864 by James Clerk Maxwell. In 1896 Henri Becquerel discovered particle radiation, and soon thereafter work by Pierre and Marie Curie, J.J. Thomson, and Ernest Rutherford identified two kinds of particles involved, which were dubbed alpha and beta particles. Studies of gas discharges in electrified low-pressure tubes led to the identification of particles emitted from the cathode and named “cathode rays” by Eugen Goldstein in 1876. Thomson measured their charge-to-mass ratio using magnetic deflection. Becquerel then did the same for beta rays and found the same ratio, leading to the identification of the two as the same particle, named the “electron” based on terminology 230
page 242
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.3 The Language of Quantum Mechanics
introduced by Gilbert. A picture of the electron as a small particle with mass and charge emerged, but it was not clear what its size was. If it were a point particle, it would have infinite density of mass and charge, which was difficult to imagine, but if it were extended, then it would contain charge distributed continuously in a small volume, and some charge would be touching other charge, which seemed to imply that an infinite repulsive force would exist, making the particle more unstable than it was known to be. No intuitive interpretation was available for the electron’s stability or charge, nor for the field that transmitted electric and magnetic forces, other than the old workhorse, aether. The dawn of Quantum Mechanics had not yet broken, and problems interpreting physical parameters already existed. But it can be argued (see, e.g., Omnès, 1999) that such problems acquired a new aspect once the development of Quantum Mechanics was underway. For example, Newton debated Huygens and others regarding whether light consisted of particles or waves, but there seems to be no evidence that anyone advocated that light was in some unimaginable way both. The classical mysteries of the electron did not include the possibility that it was a wave phenomenon, not a particle, or at least not exclusively a particle, but in 1927 electron diffraction was demonstrated in the laboratory. The mechanism underlying action at a distance remained undiscovered, but there was no problem interpreting what was meant by “action” and “distance”. In classical physics one is permitted to say “Let the z axis be defined as parallel to the system’s axis of orbital angular momentum L”, but as pointed out by Sam Treiman (1999, p. 105ff), one cannot say this about a quantum mechanical system, because the eigenvalues for the square of L are ( +1)h 2, = 0,1,2,3,..., where h is the “reduced Planck constant” h/2π, while the projection of the angular momentum onto the z axis, Lz, has eigenvalues m h , where m = - ,- +1,..., -1, . So for any quantum state > 0, the maximum value of Lz2 is 2 h 2, which is always less than the square of the orbital angular momentum, L2 = ( +1)h 2. Since L2 and Lz commute, the system can be in both corresponding eigenstates simultaneously (see sections 5.3 and 5.4 below for explanations of these statements if they are not already clear). As Treiman says, “L cannot point along the direction in which it is said to point!” We must agree with Omnès that with Quantum Mechanics, the severity of interpretational problems in physics achieved unprecedented magnitudes. 5.3 The Language of Quantum Mechanics The scope of this book permits only a scratching of the surface of Quantum Mechanics, just as with Classical Probability Theory, Thermodynamics, Statistical Mechanics, and Scientific Data Analysis. In all these cases we need just enough theory to understand the role played by randomness, which therefore requires some contemplation of what the theories mean in physical terms, i.e., we need to be able to interpret the mathematical formalism. Although the classical notions of force, momentum, inertia, etc., were challenging when they were new, centuries of usage have generated a familiarity with them so that today we treat them as fairly obvious and comfortable to the intuition. This follows from the fact that classical physics (prior to Einstein’s Relativity theories, which are also considered part of classical physics) dealt exclusively with the properties of macroscopic objects that could be perceived with the human senses, even if aided by microscopes and telescopes. For example, the application of Newtonian mechanics to the kinetic theory of gases naturally involved mental images of the gas particles as similar to tiny marbles flying around and bouncing off of each 231
page 243
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.3 The Language of Quantum Mechanics
other, and a remarkable amount of useful theory was developed with that basic model. By “useful” we mean both appealing to intuition and supported by laboratory experiments. This thread was broken early in the pursuit of atomic physics. At first it appeared that perhaps atoms were like “plum pudding”, with a positively charged pudding containing negatively charged raisins. That image soon gave way to one depicting atoms as little solar systems, with negatively charged planets swirling around a central positively charged sun. That model is still used to depict atoms schematically, but it is known to be incorrect in its failure to explain how some of the planetary orbits can have zero angular momentum, and the interactions of the planets with each other can be seen to involve much more complicated rules than mere gravity (or the very similar Coulomb force for electricity, in this case providing the attraction to the “sun” but oppositely directed toward other “planets”). Attempts to transplant macroscopic objects conceptually into the atomic and sub-atomic realms by simply shrinking their size have produced models that appeal to the intuition but have generally failed to inspire mathematical formalisms that predict observed effects. Human intuition adapted to the evidence that substances long thought to be continuous media, e.g., air and water, were actually composed of extremely small discrete particles, allowing the kinetic theory of gases to be developed as mentioned above. But other physical parameters were expected to remain continuous, such as spin and orbital angular momentum, the energy of harmonic oscillators, the intensity of sunlight, and the orbital radii of those atomic planets, to name a few. And the flying marbles were expected to behave as particles, not waves. The conceptual models demanded to be patched up with additional nonmacroscopic features. Somehow a planet inside the atom had to be able to sustain an orbital angular momentum of zero, passing through the atom’s central sun without being destroyed, and somehow the charged planets had to be able to swirl around without losing energy to electromagnetic radiation. With the accumulation of evidence that contradicted the macroscopic preconceptions, intuition began to lose its grip, and the human mindset was unable to follow the clues leading into microscopic reality with anything more than mathematical formalisms lacking anything remotely resembling full interpretation. The entrance to the proverbial rabbit hole that led to the Wonderland of quantum phenomena turned out to be a cavity containing thermal radiation. The first person to achieve a successful mental navigation of the burrow was Max Planck, and because his journey marked the transition into an entirely new and unanticipated view of the Universe, we will examine his discovery of quantized energy in some detail in the next section. But before that, we need to define some terminology and mathematical notions that will be the minimum required to discuss quantum processes, and since we cannot present a textbook course on the subject, for the moment we will just introduce some ideas whose relevance must be taken on faith, not unlike many aspects of the formalism of Quantum Mechanics. We will re-introduce some of these later in a more coherent context, but in the meantime we need to be able to make references such as that in the previous section about orbital angular momentum in quantum systems. One of the new ways of thinking necessitated by Quantum Mechanics involves associating observables with mathematical operators. By “observable”, we mean a physical parameter that can be measured, e.g., energy. By “mathematical operator”, we mean a mapping from one set of numbers to another, e.g., the differential operator d/dx that maps a function y(x) to the slope of that function. Quantum Mechanics makes considerable use of complex functions, but since physical observables are real, their corresponding operators must be Hermitian, also called self-adjoint. Quantum mechanical operators often take the form of complex matrices, i.e., composed of complex numbers 232
page 244
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.3 The Language of Quantum Mechanics
a+bi, where a and b are real numbers, and i is the square root of -1. The adjoint of such an operator is the transpose of the matrix with each element changed to its complex conjugate, a-bi. A self-adjoint matrix is equal to its own adjoint. When such operators are used to compute expectation values for their corresponding observables, the results are real numbers. The matrix equations in which these operators appear frequently correspond to eigenvalue computations, e.g., det(A-λI) = 0, where “det” indicates the determinant of the matrix in its argument, A is a square matrix, I is the identity matrix with the same order N as A, and λ is an N-valued scalar whose values constitute the set of numbers for which the equation is true, called the eigenvalues or spectrum of A. The set of N values of λ may contain duplicates called degenerate eigenvalues. The eigenvalues comprise the set of all possible measurement results. Just as in the Chapter 2 discussions of rotating covariance matrices to diagonalize them, if we diagonalize the matrix A by applying the appropriate rotation, we get the matrix λI, with the values of λ on the diagonal and only zeroes off the diagonal. This matrix represents N orthogonal vectors expressed in an N-dimensional reference system whose axes are the eigenvectors obtained as part of the solution of the eigenvalue problem. The magnitudes of these vectors are arbitrary, and so they are most easily thought of as unit vectors. The eigenvalues have a one-to-one correspondence with the eigenvectors, so each axis of the reference system is associated with one possible outcome of a measurement of the physical system. The physical system state may be associated with a unit vector in the coordinate system defined by the eigenvectors but generally not aligned with any of them prior to the measurement and thus having nonzero components on some or all of the reference axes, hence corresponding to a superposition of states which offends our intuition, e.g., the famous Schrödinger’s Cat being simultaneously dead and alive. The measurement forces the system state vector onto one of the eigenvectors, yielding the corresponding eigenvalue as the measurement result, a self-consistent state that makes sense to our macroscopically conditioned intuition. When the physical system is in a state corresponding to a particular eigenvalue, it is said to be in that eigenstate. The mechanism by which the measurement forces an eigenstate is not understood and has various interpretations, but the process of selecting which eigenvalue shall become the measurement result appears to be perfectly random and governed statistically by probabilities that can be computed for each eigenvalue, specifically (since we are taking the eigenvectors to be unit vectors) the square of the dot product of the pre-measurement state vector and the corresponding eigenvector. The closer the pre-measurement state vector is to a given eigenvector, the more likely it is that the eigenvector’s corresponding eigenvalue will turn out to be the measurement result. An important property of matrices is that in general their products do not commute, i.e., if A and B are matrices, then in general AB BA. When the operators corresponding to two different observables do not commute, those observables cannot have sharp values at the same time. If the system is in an eigenstate of one operator, it must be in a superposition of the other’s. This is one way to express the Heisenberg Uncertainty Principle. For discrete spectra, noncommuting operators cannot be simultaneously in eigenstates (an analogous restriction pertains to continuous spectra). The operators discussed at the end of the previous section, L2 and Lz, do commute, and therefore those observables can be simultaneously in eigenstates, which is why it is somewhat perplexing that the orbital angular momentum cannot be projected unchanged onto a coordinate axis parallel to it (unless the angular momentum is zero, but in that case the L vector is null). The classical view of orbital angular momentum as a spin vector with a definite direction has to be abandoned, and there is no deterministic macroscopically based model to replace it. There is a slightly different but related way 233
page 245
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.4 The Discovery of Quantized Energy
to view the situation, once one has accepted the Heisenberg Uncertainty Principle. In the usual threedimensional case, there are three orthogonal projections of the orbital angular momentum, one per axis, Lx, Ly, and Lz, all three of which commute with L2, but no two of which commute with each other. This means that if one of them is measured, the other two cannot have exact values; our knowledge of them must be blurred by the uncertainty involved in having to view them as spread out over multiple possible values with associated probabilities. But if somehow we were able to measure Lz2 > 0 as exactly equal to L2, then we would know that Lx and Ly were each exactly zero, and that is not permitted for nonzero orbital angular momentum. As a result, for L2 > 0, L2 and Lz2 can never be measured as equal. We can define their axes to be the same and claim that therefore they must be equal, but that is not the result of a measurement, and classical assertions that conflict with quantum rules carry no weight in the quantum realm. As soon as we measure Lz, that measurement disturbs the system enough to destroy our claim that the axes are aligned. We should mention that considerable thought has been devoted to the question of whether L and Lz “really are equal” before we make the measurement, i.e., that as long as we don’t disturb the system, they can be equal by virtue of our definition of the z axis. In classical physics, that would be the case. In Quantum Mechanics, the system’s orbital angular momentum cannot be considered to have a well-defined value if it has never been measured. The fuzziness is not due to our ignorance, it is due to the nature of reality at the quantum level. This question is an essential aspect of “the measurement problem”, and the answer that has emerged is that the classical interpretation is incorrect. The blurry nature of the physical parameter is not generated by epistemic randomness stemming from uncertainty about which object we are observing in a population whose members each have sharp values differing from each other according to some distribution function, it is a property of the individual physical object. In the classical case, our ignorance has no bearing on the behavior of the physical object, only on our knowledge of its behavior. In the quantum case, the object behaves as though it shares in this ignorance, or perhaps a more accurate interpretation is that the distinctions we are trying to make are not relevant to the objective physical reality. The thing that bears some resemblance to one macroscopic object orbiting another is in fact a different kind of thing, a process that is propagated in time in such a manner that its properties that resemble macroscopic qualities are not relevant to the process at every instant of time but rather manifest themselves only in response to certain stimuli. In general, as long as we make no measurement, the process propagates through time in a manner consistent with being in a superposition of states, not in a manner consistent with being in some eigenstate of which we are merely ignorant. As we will see below, laboratory experiments have now been carried out that can actually distinguish between these two interpretations (see section 5.12), and the verdict is unanimous in favor of Quantum Mechanics as embraced by the majority of modern physicists (i.e., according to current preferences regarding such things as causality, determinism, locality, and the possibility of “hidden variables”, all of which will be discussed further below). 5.4 The Discovery of Quantized Energy Ever since the discovery of fire, it has been common knowledge that very hot things give off light. The sight of glowing coals in a fireplace or campfire is familiar to almost every culture. Horseshoes heated at a blacksmith’s forge may be seen to glow with an orange color similar to that 234
page 246
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.4 The Discovery of Quantized Energy
of fresh lava flowing from a volcano. To understand the temperature dependence of the color and intensity of light emitted from a hot object was a challenge taken up by the physicists of the late 19 th century. It was known that light was an electromagnetic phenomenon because of the work of numerous researchers, culminating with Maxell’s equations in 1864. As early as 1800, William Herschel had discovered that sunlight refracted by a prism into the colors of the rainbow contained light invisible to the human eye. A thermometer intended to measure the temperature of the air in the room read different temperatures when placed in different colors, and surprisingly read the highest temperature when placed outside of the visible spectrum below the red end (hence the name infrared, “below the red”). In 1801, Johann Wilhelm Ritter made a discovery at the other end of the spectrum: he found that exposing paper soaked in silver chloride crystals to whatever was above the violet end of the spectrum darkened it faster than the violet light (hence the name ultraviolet, “beyond the violet”). In the late 1880s Heinrich Hertz conducted a series of experiments that proved the existence of free-space electromagnetic waves, and in 1894 Guglielmo Marconi experienced success in using what we now call radio waves to convey encoded information. It became clear that electromagnetic wavelengths spanned a very large dynamic range. In the second half of the 19th century some progress was made toward understanding thermal radiation using Thermodynamics. The theoretical idealization known as a black body was defined in 1860 by Gustav Kirchhoff as a body that absorbed all incident light, converted it to thermal energy, and when in thermal equilibrium emitted light with a spectrum determined only by its temperature. Several ways to visualize a black body were developed. One of the most fruitful was a cavity inside a perfect light-absorbing medium at some temperature T > 0. As all heated materials were known to do, the walls would radiate light into the cavity, and eventually the walls and radiation field would presumably come to thermal equilibrium. Thermodynamic arguments showed that the properties of the radiation field in the cavity would not depend on the chemical nature of the cavity walls. Although Thermodynamic analysis could not yield the spectrum, some useful results were obtained. Josef Stefan established experimentally that the total electromagnetic energy depended on the fourth power of the temperature, and Ludwig Boltzmann subsequently provided a Thermodynamic derivation. In 1893 Wilhelm Wien used Thermodynamic arguments to show that the radiation energy density at the frequency ν in a cavity at temperature T varied as u( , T ) v 3 F T
(5.2)
where F(ν/T) is an unknown function of the ratio of frequency to temperature, and u(ν,T) is the energy of the thermal radiation per unit volume per unit frequency, e.g., Joules per cubic meter per Hertz. The work of Michael Faraday and others had shown that moving an electrically charged object caused magnetic fields to be generated, which in turn generated more electric fields, an effect quantified theoretically by Maxwell. So it was natural to suppose that light was created by oscillating electric charge, and the power of statistical mechanics was brought to bear on the problem of thermal radiation. Here was an avenue for developing a description of electromagnetic radiation in terms of mechanical generators. The energy in light could be related to heat energy agitating electrically charged oscillators. Experimental physicists designed laboratory equipment to measure electromagnetic energy within wavelength or frequency bands, and this was applied to heated bodies to explore possible relationships between frequency, energy emitted at that frequency, and temperature. 235
page 247
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.4 The Discovery of Quantized Energy
Although this work took place before the development of atomic theory, charged particles such as the electron had been discovered, and so it was generally agreed that the constitution of matter included electrically charged particles in some manner. It seemed reasonable to assume that these could be the “resonators” (as Planck called them) whose thermal oscillations generated the electromagnetic waves comprising thermal radiation. The physical nature of the electromagnetic waves inside the cavity was a mystery, but it too seemed to be comprised of resonators whose energy was the light energy. The black-body radiation resulted from all these resonators being in thermal equilibrium across all frequencies (an assumption to which Henri Poincaré later took exception; this will be discussed further below). We saw in section 3.3 in the discussion following Equation 3.6 (p. 108) that in thermal equilibrium, the average kinetic energy is equal in all mechanical degrees of freedom and has the value ½kT, where k is Boltzmann’s constant. This is an aspect of the more general principle known as equipartition of energy, and it was used in 1900 by Lord Rayleigh (William Strutt, the 3 rd Baron Rayleigh) to arrive at what is now called the Rayleigh-Jeans distribution,
u( , T )
8 2 kT c3
(5.3)
where c is the speed of light, and Jeans derived the coefficients multiplying kT. Rayleigh arrived at this formula by considering standing waves in the cavity, hence wavelengths that are integer submultiples of the cavity size, for which a given wavelength implies a corresponding number of nodes. Each such wavelength corresponds to a degree of freedom, and adding up the energy over all the probability-weighted energy states for each degree of freedom led to Equation 5.3. This impeccable application of classical statistical physics caused considerable consternation when the obvious fact was recognized that the formula predicts arbitrarily large energy densities at arbitrarily high frequencies, a behavior that came to be known as “the ultraviolet catastrophe”. Many work-arounds and attempts to find the flaw in the derivation were attempted by numerous physicists, Lorentz, Ritz, and Einstein, to name a few. All eventually accepted that the problem was in the classical theory, although not in the notion of equipartition of energy itself. Even if one argues that arbitrarily high frequencies are not physically realizable, measurements had been made at sufficiently high frequencies to show that this distribution diverged from observations to an extreme degree, although for relatively low frequencies, it fit the data quite well. Today it is known as the “Rayleigh-Jeans tail”, where “tail” refers to long wavelengths. Jeans suggested that the high-frequency radiation never actually achieves thermal equilibrium, but he eventually abandoned that argument. Einstein considered the more realistic situation in which the cavity is not a perfect vacuum and argued that the classical physics applied only in the limit of wavelengths much longer than the size of the particles in the gas inside the cavity, hence also the resonators in the walls. Four years before Rayleigh and Jeans produced Equation 5.3, Wien provided an estimate of the unknown function F(ν/T) in Equation 5.2 by a method that has been described variously as empirical, heuristic, or at least considerably less rigorous than the Thermodynamic derivation he used for Equation 5.2 itself (see, e.g., Treiman, 1999, or Kragh, 2002). The author’s conjecture on how Wien did this (without actually having seen this in the literature, but perhaps it is there somewhere) is that Wien noticed that Equation 5.2 is in danger of becoming the first ultraviolet catastrophe unless F(ν/T) does something to prevent it. Clearly F(ν/T) has to provide some kind of damping of the high 236
page 248
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
page 249
5.4 The Discovery of Quantized Energy
frequencies, preferably in a manner that interpolates very smoothly towards F(ν/T) Such an interpolation is accomplished by Wien’s suggestion, which has the form F ae T
b T
0 as ν/T
. (5.4)
where a and b are constants to be determined. Max Planck worked on computing the constants using arguments based on Clausius entropy (see the paragraph following Equation 3.4 on p.100) and radiation processes that evolve irreversibly to maximize that entropy (see, e.g., Planck, 1899). In modern notation, their solution for the coefficients in the combination of Equations 5.2 and 5.4 results in a
8h
c3 h b k
u( , T )
(5.5) 8 2 3 h c h
e kT
This became known as the Planck-Wien distribution. In the process of doing this work, Planck gave the name “Boltzmann’s constant” to k and identified the coefficient of frequency that yields energy, which he recognized as a new constant of Nature and which was later named for him and given the symbol h. In the last few years of the 19th century, this formula fit the empirical data well enough over the entire range measured, unlike the Rayleigh-Jeans formula that came just a few years later, but the success at the low-frequency end was due to the relatively large measurement uncertainties. In the year 1900, superior measurements had been obtained that showed a problem with Equation 5.5 at the low-frequency end. In what he described later as “an act of desperation”, Planck discovered that a simple fix yielded impressive agreement with all the measurements. This involved simply subtracting 1 from the exponential in the denominator to yield
u( , T )
8 2 3 h c
(5.6)
h
e kT 1
This became known as the Planck radiation law, and Planck’s name disappeared from Equation 5.5, which became known simply as the Wien distribution. Planck was able to show that this change was consistent with a similarly ad hoc assumption about the dependence of Clausius entropy of the radiation field on its energy, but he remained extremely dissatisfied with such fixes to the Wien distribution and sought a solid theoretical foundation. Part of his “act of desperation” was to resort 237
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.4 The Discovery of Quantized Energy
to Statistical Mechanics, with which he was much less comfortable than his home field of Thermodynamics. Clausius entropy made perfect sense to him, but he had qualms about Boltzmann entropy and its reliance on probability theory. Nevertheless, he was desperate enough to wade into deep waters, and his courage was soon rewarded. A key ingredient in Planck’s success was his recognition that the relevant energy of an electromagnetic resonator was proportional to its oscillation frequency. This seems not at all obvious, because the conceptual model for these mysterious resonators was the linear harmonic oscillator, for which it was well known that the energy is proportional to the square of the frequency (see Appendix K). Of course, the whole “resonator” concept was sketchy at best, so one need not assume that if the resonator is a linear harmonic oscillator, it necessarily gives all of its energy to the radiation field. One view of the resonator is that the electromagnetic field acts as a damping agent, in which case energy lost to that agent is the relevant energy. But the linear dependence of the energy on frequency has to apply not only to the material resonators in the cavity wall, it must also apply to the resonators which Planck envisioned as existing in the otherwise empty space in the cavity. Since this first appearance of what has become called the Planck-Einstein relation, E = hν, is so crucial to Planck’s ultimate realization that the relevant probability distribution has to apply to discrete random variables, not continuous ones, how it came about merits a brief digression. The exponential argument in Equations 5.5 and 5.6 must be dimensionless, and since kT in the denominator has dimensions of energy, so must hν in the numerator. This linear dependence on frequency first appeared in Wien’s guess for F(ν/T), Equation 5.4, which was required by his rigorous derivation of Equation 5.2 to be a function only of the ratio ν/T. But instead of Equation 5.4, Wien could have tried, for example, F ae T
T
b
2
(5.7)
It seems clear that the reason why he did not propose this, or any other nonunit power of the ratio, is that when combined with Equation 5.2, what results simply cannot come close to fitting the empirical data. And in any case, since Equation 5.2 demands that ν and T appear raised to the same power, an energy proportional to T in the denominator is always going to require an energy proportional to ν in the numerator. So the very strong suggestion that the resonator energy is hν already exists in Wien’s requirement that F(ν/T) be a function only of the ratio, and this is closely related to how Planck explained his arriving at E = hν. He attributes it specifically to the Wien displacement law (Equation 5.8 below, although sometimes that name is also applied to Equation 5.2, e.g., Omnès, 1999) which Wien had developed from purely Thermodynamic arguments as part of his derivation of Equation 5.2 in 1893. Wien considered radiation in thermal equilibrium in a cavity whose walls were perfectly reflecting. The radiation would consist of standing waves whose frequencies were determined by the cavity dimensions. Then he considered what happens to a given mode with frequency ν when the cavity is expanded adiabatically (no energy added or subtracted) and quasistatically (always arbitrarily close to being in thermal equilibrium). He was able to show that the ratio ν/T is an adiabatic invariant, i.e., constant for a quasistatic adiabatic process. This applies to all modes in thermal equilibrium, but today the Wien displacement law is usually invoked only to describe the mode at which the energy density is maximal, the mode with frequency
238
page 250
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.4 The Discovery of Quantized Energy
v umax ( , T ) T
(5.8)
where α is a constant that is computed by setting the frequency derivative of Equation 5.6 to zero and solving for frequency. This yields α 5.878925421×1010 Hz per degree Kelvin. If the Wien distribution, Equation 5.5, is used instead, the value is about 6.250985379×10 10. But in 1893, Wien had neither of these formulas to use. The mere fact that ν/T was adiabatically invariant, together with the knowledge that the energy was proportional to T (shown by Statistical Mechanics to be ½kT per degree of freedom), sufficed to lead Planck to E = hν. A quarter of a century after Planck was able to feel his way through the dark, Erwin Schrödinger formulated an equation that could be applied to the linear harmonic oscillator to determine its energy eigenvalues, and these indeed turned out to be linear in the frequency, even though the classical energy, with its dependence on the square of the frequency, is used in the equation. But Planck had to make his way without the powerful tool later provided by Schrödinger. The difference between the classical and quantum energies is accompanied by a difference in the amplitude properties. The classical amplitude is continuous, whereas the amplitude states of the quantum linear harmonic oscillator are also quantized. An important difference between the quantum linear harmonic oscillator and Planck’s “resonator” is that the former’s energy eigenvalues are (n+½)hν, not the nhν eventually found by Planck for black-body radiation, i.e., the “zero-point” energy is not zero but rather ½hν. Nevertheless, both the quantum linear harmonic oscillator and Planck’s resonator can change energy states only in integer multiples of hν (the missing factor of ν is actually hidden in these integers, which happen to be eigenvalues of an operator whose definition includes a term with this factor; see Appendix K). We should mention that Equation 5.6 is written in a form consistent with energy density per unit frequency interval, i.e., there is an implicit dν on both sides of the equation. The form for energy density as a function of wavelength λ per unit wavelength interval is obtained by transforming the independent variable to λ = c/ν, so substituting c/λ for ν in Equation 5.6, and then accounting for the fact that dν = -c dλ/λ2 yields the wavelength form of the Planck distribution:
u( , T )
8hc hc
5e kT 1
(5.9)
Maximizing this with respect to λ yields
umax ( , T )
2.897772122 10-3 T
(5.10)
where λ(umax(λ,T)) is in meters. Thus λ(umax(λ,T)) 0.56825 c/ν(umax(ν,T)). Given that the model for the “resonators” is a one-dimensional linear harmonic oscillator, and given that these have both kinetic and potential energy, hence two degrees of freedom, each having an average value of ½kT in thermal equilibrium, the average energy per resonator mode with frequency ν is kT. The derivation of the Rayleigh-Jeans radiation law, Equation 5.3, must be considered correct according to classical Statistical Mechanics, so another way to express this law 239
page 251
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.4 The Discovery of Quantized Energy
is
u( , T )
8 2 E c
(5.11)
3
where is the average resonator energy for this mode. A qualitative feel for Planck’s revised justification of his radiation law, Equation 5.6, can be seen as a change to the way is computed. We saw in Chapter 3 that for the canonical ensemble of classical Statistical Mechanics (see the discussion preceding Equation 3.7 and following Equation 3.9, pp. 109-110) the probability of a state with energy E being occupied by a system in equilibrium at temperature T is proportional to exp(-E/kT). Note that we are considering an energy state of a particular mode. This dependence does not generally describe the energy distribution over the entire system, since the different modes are not necessarily equally probable. For example, Equations 3.8 and 3.9 show that the distribution of momentum in a gas varies as exp(-E/kT), but Equation 3.12 shows that the system’s kinetic energy distribution varies as E exp(-E/kT). So here we are considering the energy distribution within a particular vibrational mode and will denote it fmode(E). Classically, the energy is a continuous parameter, and therefore the probability density function for the vibrational mode to have energy E is
f mode ( E )
e
E kT
E e kT dE
(5.12)
0
so that the average energy for the vibrational mode is
E
Ee
E kT
0 E e kT
dE
dE
k 2T 2 kT kT
(5.13)
0
i.e., we recover the value = kT that converts Equation 5.11 into the Rayleigh-Jeans distribution, Equation 5.3. Planck struggled with representing the probability W in the Boltzmann entropy k lnW and eventually realized that there was a light at the end of the tunnel if a resonator with frequency ν could possess energy only in discrete packets nhν, where n is a nonnegative integer. If we allow n to be continuous or h to be arbitrarily small (which is what Planck was trying to do at one point along his tortuous path), then nhν becomes continuous, and the classical result above is obtained. But if n is constrained to be a nonnegative integer and h is not allowed to be arbitrarily small, then we move from a continuous probability density function to a discrete probability distribution:
240
page 252
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.4 The Discovery of Quantized Energy
e
Pmode (nh )
nh kT
e
(5.14)
nh kT
n 0
so that the average energy for the vibrational mode with frequency ν becomes
n h
E n h
n 0
nh e kT
h
nh e kT
nh n e kT
n 0 nh e kT
n 0
n 0
h h e kT
(5.15)
1
where we have used
eq
ne
n 0
e
n 0
nq
nq
e 1 q
q
e e 1
2
1 e 1
(5.16)
q
q
When we use Equation 5.15 in Equation 5.11, we obtain the Planck distribution, Equation 5.6. There is no need to apply the arbitrary smoothly interpolating high-frequency-damping function in Equation 5.4, and we obtain the -1 in the denominator naturally, not as an ad hoc fix that improves the fit to empirical data. Of course, knowing that Equation 5.6 fits the data so well was crucial in motivating Planck to work backwards from that formula. Upon discovering this result, Planck’s mood was probably in a superposition of the states euphoric/anxious and relieved/skeptical, among possibly others. Everything was on absolutely solid ground except the hypothesis of discrete energy packets. The fact that this hypothesis led to such convincing agreement with the measurements persuaded him that something real and important had been encountered, but at first he suspected that it was just a quirk in the mechanism governing the interaction between mechanical oscillators and the radiation field. It took almost ten years before he accepted that the constant of Nature he had already published in 1899 (with what later became known as the Planck length, time, and mass; see Appendix H) was actually a discrete “quantum of action” and truly an intrinsic aspect of reality, and he accepted it then only because of the fruitfulness of the concept for other theorists working on what eventually became Quantum Mechanics. The Planck black-body radiation law is illustrated in Figure 5-1 as a function of frequency for three temperatures, 250o, 300o, and 400o Kelvin, all temperatures typical of the Earth vicinity of the solar system. Some characteristic features are that as temperature increases, the total energy (the area under the curves) increases rapidly, and the peak emission occurs at higher frequencies. These follow from the Stefan-Boltzmann law for total emission being proportional to the fourth power of temperature and from the Wien displacement law (Equation 5.8), respectively. Figure 5-2 shows the distribution as a function of wavelength for the same temperatures. The same total-power behavior with increasing temperature is obvious, and now the peak emission occurs at shorter wavelengths, as specified by Equation 5.10. 241
page 253
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.4 The Discovery of Quantized Energy
Figure 5-1 The Planck distribution describing black-body radiation as a function of frequency. The dashed curve is for T = 400o Kelvin; the solid curve is for T = 300o, about the temperature of solarsystem objects the same distance from the sun as the Earth is; the dotted curve is for T = 250 o. The abscissa is in units of Hz, and the ordinate is in units of Joules per cubic meter per Hz.
Figure 5-2 The Planck distribution describing black-body radiation as a function of wavelength. The dashed curve is for T = 400o Kelvin; the solid curve is for T = 300o, about the temperature of solarsystem objects the same distance from the sun as the Earth is; the dotted curve is for T = 250 o. The abscissa is in units of meters, and the ordinate is in units of Joules per cubic meter per meter.
242
page 254
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.4 The Discovery of Quantized Energy
When Planck announced this derivation of the black-body radiation law in 1900, it did not release an avalanche of interest and enthusiasm. He himself had already shown that a heuristic interpolation formula for the Clausius entropy could lead to the same formula, and although that did not rest on a rigorous physical argument, it called into question whether discrete energy bundles were the only physical model that could produce the formula. When Planck was asked about the uniqueness of his result, he replied that very few physical formalisms had been proved to be unique descriptions of their associated phenomena. The only prominent physicist to embrace quantized energy early on was Einstein, and he was not motivated by the Planck distribution, he was independently arriving at the same concept via his work on the entropy of the radiation field and the photoelectric effect, for which his early papers relied on the Wien distribution confined to the region in which it is a good approximation, large values of hν/kT. Even after Einstein’s 1905 paper describing the work for which he later received the Nobel Prize (Annalen der Physik 17 (6), 1905), Planck remained unable to accept the notion of quantized electromagnetic energy, and he ascribed the discrete bundles to the physical resonators. In 1912 he attempted a different approach to the derivation with the assumption that the resonators absorbed radiation continuously but emitted it in discrete bundles. As we now know, both the physical resonators and the electromagnetic field have quantized energy, so this new approach was incorrect, but it had its own remarkable aspect, the introduction of zero-point energy: the resonator energy at zero temperature was ½hν. Despite the fact that its foundation was later shown to be erroneous, zeropoint energy fascinated many researchers, and its analysis led to some useful results. As mentioned above, zero-point energy was shown to be a property of harmonic oscillators over a decade later via the Schrödinger Equation (see Appendix K). The question whether Planck’s 1900 mechanism was both necessary and sufficient to derive Equation 5.6 from Statistical Mechanics was taken up by Henri Poincaré in the last year of his life (1912; for an excellent English review of Poincaré’s rather difficult paper, see Prentis, 1995; see also Lorentz, 1921). Besides being curious about quantized energy, he was dissatisfied with some aspects of the derivation, for example, the claim that the energy of standing waves of different frequencies in a cavity would come into thermal equilibrium with each other. Planck had shown convincingly that the quantized-energy mechanism was sufficient for the derivation of his formula, given the assumption of thermal equilibrium, but it remained to be shown that it was necessary on physical grounds (i.e., not allowing such things as entropy interpolation formulas with no physical foundation), including a justification for taking the radiation field to be in equilibrium with itself and the material in which the cavity existed. Poincaré proved not only that in this sense quantized energy was both necessary and sufficient for the Planck distribution, but that it was necessary for any thermal radiation law that does not predict that an infinite amount of energy will be radiated from heated objects. In other words, classical continuous energy is incompatible with finite thermal energy radiation. Poincaré suspected that the thermal equilibrium required for Planck’s derivation probably does develop, but since standing waves of different frequencies superpose linearly, there was no obvious way for waves with different frequencies to interact and exchange energy, something they must do in order to arrive at equilibrium with each other and the physical resonators. Poincaré called the physical resonators “atoms” and used the term “resonator” to indicate the agent that permits energy to be exchanged between matter and electromagnetic radiation. Despite Einstein’s Special Relativity having removed all support for the notion of absolute reference frames, hence for an absolutely 243
page 255
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.5 Gradual Acceptance of Quantization
stationary aether, theories dealing with Maxwell’s Equations continued to make reference to aether as the medium in which the electromagnetic oscillations take place (e.g., Poincaré, 1912b), but typically neither “aether” nor “resonator” is more specifically defined. Poincaré was noted for his belief that the simplicity of a mathematical formalism was enhanced by its generality. Unfortunately, many physicists of his time preferred more direct references to the specific phenomena being described, and his proof of the necessity and sufficiency of quantized energy for thermal radiation proved quite difficult to follow. Poincaré had already established for himself a reputation of the highest order, however, and as a result, the conclusions of his analysis were widely accepted and became influential in persuading the mainstream of physicists to take seriously the idea that energy could be quantized. Deeper analyses of his paper reveal that he had anticipated aspects of Heisenberg’s matrix formalism and established the necessity of a radically new approach to mechanics, but those features were lost for the most part to everyone but historians of science. He was able to show that thermal equilibrium could be obtained via energy exchanges brought about by collisions between atoms and by Doppler shifting of the resonators. In fact, he had applied his preference for generality to show that the average energy partitioned between highfrequency resonators and atoms emitting radiation at low frequency had to be equal, but to do so while keeping the total energy finite, both energies had to be quantized. Without having to be concrete about how the mean energy depends on temperature, he showed that the very existence of a law covering all temperatures implied energy quantization. Given that Planck’s distribution was based on accepted principles of Statistical Mechanics other than the quantization of energy, Poincaré’s proof showed that Planck’s distribution necessarily implied quantized energy. Given that there is only one known and accepted law relating temperature to the mean energy per degree of freedom of a system in thermodynamic equilibrium, quantized energy implied Planck’s distribution. Each was therefore necessary and sufficient for the other. Even Jeans was convinced enough to abandon his resistance to the notion of quantized energy, and Louis de Broglie was motivated to switch his academic studies from history and law to physics. 5.5 Gradual Acceptance of Quantization After 1900, Planck’s formula for black-body radiation was accepted as correct, but until Poincaré’s analysis in 1912, very few physicists attributed the apparent energy quantization to real physical effects fundamental to emission and absorption of light. As a result, the fact that a door to a new and unexpected view of reality had opened was missed by almost everyone, and even Planck continued to attempt variations that would reduce the glare of what seemed to be an absurd interpretation of sub-microscopic behavior. Among the few that embraced quantization was Einstein. As mentioned in section 3.7, a key element in his formulation of the photoelectric effect involved identifying the similarity between the entropy of the radiation field and that of molecular gases. He saw clearly that both were exhibiting particle behavior, and when he learned of Planck’s derivation based on quantized energy, it did not shock him, and it fit perfectly within the interpretation he was developing for laboratory measurements of electrons emitted from metals exposed to light of various frequencies. He came to realize that the quantization of light energy was not a property forced by quantization of the energy of charged material oscillators, it was a property intrinsic to the electromagnetic energy itself. But 244
page 256
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.5 Gradual Acceptance of Quantization
even Einstein’s four papers published in 1905 took several years to soak in, and by 1910, mainstream physicists were still scratching their heads about these intrusive notions. Einstein’s persuasive description of the photoelectric effect, however, did convert many scientists to the view that energy quantization had to be taken seriously as a genuine feature of reality. In that first decade of the 20th century, progress was made in understanding radioactivity, the electron as a sub-atomic particle, the alpha particle as doubly ionized helium, and the scattering of radioactive emissions by atoms. Ernest Rutherford conducted studies in which alpha and beta particles were scattered by various metal foils, and he published results (Rutherford, 1911) that relied heavily on laboratory measurements taken by Hans Geiger and Ernest Marsden (Geiger & Marsden, 1909). Rutherford already suspected that atoms contained a nucleus that comprised almost all of the mass. This was based on his identification of alpha particles as doubly ionized helium. But his analysis of the scattering of alpha particles from gold foil led him to advocate the nuclear model, because the scattering indicated that on rare occasions, the alpha particle encountered something heavier than itself that could scatter it back in the direction from which it had come, while the overall scattering pattern showed that this obstacle was extremely small. The solar-system atomic model became generally accepted, although there were several variations on this theme. It had to be taken on faith that some mechanism existed that prevented the orbitally accelerating electrons surrounding the nucleus from radiating energy the way Maxwell’s Equations indicated that they should, which would cause them to fall into the nucleus. The fact that atoms emitted light in a large number of very narrow wavelength bands had been discovered and studied extensively. This was a primary purpose of the gas discharge tubes that also led to the discovery of cathode rays. Different gases in the tubes were subjected to strong electric fields which caused the gases to react by emitting light unlike that of the black-body radiation from substances in thermal equilibrium. When this light was passed through a slit and then through a prism or off of a diffraction grating, the different bands showed up on photographic plates or scintillation screens as “emission lines”. Spectral analysis of sunlight in the early 19 th century had shown that it had such lines, but in absorption rather than emission. The lines appeared as dark features against a bright thermal continuum. By isolating different chemical species in discharge tubes, it was possible to determine which lines were being generated by which elements. A pattern in the line wavelengths for alkali metals was found in 1888 by Johannes Rydberg. He had been cataloging the lines and found that using “wave number” made the bookkeeping easier. The wave number corresponding to the wavelength λ is just 1/λ (the definition 2π/λ is also encountered, sometimes under the name angular wave number or circular wave number, often as a vector with this magnitude that points in the velocity direction). He learned that just three years earlier, Johann Balmer had found a pattern for the most prominent hydrogen lines, now known as the Balmer series. Rydberg generalized this pattern to cover all the hydrogen line series and obtained the formula 1 1 R 2 2 n1 n2 1
(5.17)
where n1 and n2 are integers such that 0 < n1 < n2, and R is a constant now called the Rydberg constant whose value depends on chemical species. For hydrogen it has a value of about 1.097×10 7 m-1. The pattern found by Balmer is essentially this same formula for n1 = 2. 245
page 257
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.5 Gradual Acceptance of Quantization
Niels Bohr (1913) took Rutherford’s planetary-system model of the atom and applied the new principle of energy quantization to the electron orbits. Although the reason for energy quantization remained completely mysterious, the phenomenon seemed to be a fact of life, and if the electrons could not radiate electromagnetic energy continuously, then apparently they could not spiral gracefully into the nucleus. Instead they occupied what Bohr called “stationary elliptical orbits” that were characterized by specific discrete energies. They could jump from one of these orbits to another, but only by changing their orbital energy in discrete amounts equal to the difference in orbital energy, and when moving from higher to lower orbits, electromagnetic energy was indeed radiated in the amount ΔE = hν, where h is Planck’s “quantum of action”, and ν is the frequency of the radiation. If the same energy were absorbed rather than emitted, the electron could move upward between the same two orbits. Bohr applied his model to the hydrogen atom with the assumption that it consists of a single electron orbiting a nucleus with a single positive charge. He assumed that, like Planck’s “resonators”, a free electron with zero energy at a large distance from the nucleus would emit radiation with energy nhν as it is captured into a stationary-state orbit, where ν is related to the orbital frequency, hence higher for lower orbits. The orbit into which the electron is captured determines the value of n, which becomes its label and is now identified as a “quantum number” for that orbit. Since some energy must be radiated during the capture, we must have n > 0, unlike the quantized energy of thermal radiation. The lowest-energy orbit therefore has the quantum number n = 1. Combining electrodynamics and orbital mechanics, Bohr arrived at the following expression for the energy Wn required to move the electron from orbit n to infinity: Wn
2 2 me Z 2 e 4 n2 h2
(5.18)
where me is the mass of the electron, e is the charge on the electron, and Z is the nuclear charge in units of absolute electron charge. Orbits with larger values of n therefore require less energy to ionize the atom, with the electron continuum corresponding to arbitrarily large n values. For hydrogen, Z = 1; we include Z in Equation 5.17 for purposes of generality. Using Z = 1 explicitly, ΔE for a jump from orbit n2 down to orbit n1 in a hydrogen atom involves a radiative loss of orbital energy given by
E Wn1 Wn 2
2 2 me e 4 1 1 2 2 h2 n1 n2
(5.19)
Since this is just hν = hc/λ, the relation to 1/λ in Equation 5.16 is straightforward, and the Rydberg constant for hydrogen can be computed from the known atomic constants. Thus Bohr’s model of the hydrogen atom explained the series of that element’s emission and absorption lines, and the early formulation of Quantum Mechanics was well underway. As it happens, Bohr really needed the “reduced mass” in place of me, where the reduced mass is given by meM/(me+M), and M is the mass of the nucleus, but since M >> me, M/(me+M) 1, so that is a small correction, and in fact other tweaks were needed, but the model’s impressive success secured the role of quantized energy in subsequent developments of physical theories. Bohr believed that the quantization of electromagnetic energy derived from the quantization 246
page 258
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.6 The Schrödinger Wave Equations
of atomic orbits, not some fundamental feature of electromagnetic energy itself. His attempts to extend the model to atoms with more electrons were less successful in reproducing observed spectra, and a more complete development of Quantum Mechanics was needed. Some hints of that formalism were present in the Bohr model, however. Based on the energy spacings of the orbits and some heuristic assumptions about the relationship between the frequency of the emitted electromagnetic radiation and the orbital frequency, Bohr deduced that the orbital angular momentum L was quantized, L = nh . This gives an erroneous value for the lowest orbit, n = 1, for which the angular momentum should be zero, but it was a step in the right direction leading to the notion of quantized orbital angular momentum eigenstates and their spacing. In his doctoral dissertation, Louis de Broglie (1925) pointed out that this quantization is consistent with the electron behaving like a wave instead of a particle, with the quantization corresponding to standing waves over the stationary-state electron orbits. During this period of a dozen years, much progress was made in many areas of physics such as cryogenics, optics, radioactivity, solid-state physics (including characterization of the specific heats of various chemical species), and an improved model of the hydrogen atom (including electron spin and different angular momentum states within orbital-energy levels). These all affected and were affected by the growing body of work on what Einstein (1922b) called “Quantum Mechanics” (or “Quantenmechanik” in the original German) in a paper in which he lamented the limitations of the current capability to describe composite systems (e.g., it had been generally expected that electrical resistance would become infinite as temperature approached zero degrees Kelvin, since the presumed free electrons in a metal would become frozen to the crystal lattice, but the opposite effect was discovered, superconductivity, and this was eventually described by a more developed quantum theory). Of course this period also saw Einstein’s General Theory of Relativity and its first empirical confirmations, but being a classical (i.e., nonquantum and completely deterministic) theory, it will not be of interest to us until we consider attempts to formulate Quantum Gravity. And the rate of progress was also impacted significantly by the First World War. 5.6 The Schrödinger Wave Equations The wave-particle duality pointed out by de Broglie inspired Erwin Schrödinger to seek a quantum-mechanical counterpart to Newton’s law of motion F = ma. De Broglie had postulated that the wavelength λ associated with an electron was related to its momentum p according to p = h/λ. The entity that had this wavelength was called a “matter wave”, and its physical significance was completely mysterious, possibly even more so than the particle nature of light, since the possibility that light consisted of particles had been debated at least since Aristotle’s time. No one knew what a matter wave was, but it seemed reasonable that it should propagate along with the associated particle. The wave nature of free electrons was confirmed in the laboratory by Clinton Davisson and Lester Germer, who conducted experiments over a five-year period ending in 1927 showing that when electron beams were scattered off of nickel crystals, they exhibited diffraction patterns similar to those of high-energy electromagnetic radiation. Einstein (1905b) had established mass-energy equivalence E = m0c2, and both he (1905) and Planck (1900) had established that the energy associated with electromagnetic radiation is E = hν. De Broglie had postulated that all objects possessing rest mass have both wave and particle attributes 247
page 259
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.6 The Schrödinger Wave Equations
linked by m0c2 = hν, where the mass m0 and frequency ν are measured in the object’s rest frame. He used the Special Theory of Relativity to show that for an object moving relative to an observer’s rest frame, the frequency of the matter wave is reduced in the observer’s frame to yield a wavelength of λ = h/p = h/mv, where the mass m and speed v are as measured by the observer (care must be taken not to mistake the lower-case italic “V” for the frequency ν; the distinction should be clear by the context). Guided by work in optics and mechanics done by William Rowan Hamilton almost a century earlier and the notion that something with wave properties should behave according to a wave equation, Schrödinger sought something resembling existing wave equations for optics. A cornerstone of modern physics, Hamilton’s formulation of Newtonian mechanics is introduced in undergraduate curricula, but as Schrödinger pointed out, Hamilton originally developed these equations for optics and later realized that there was a compelling analogy to mechanics. Hamilton was familiar with two descriptions of optics, one based on Fermat’s Principle involving rays of light encountering reflective surfaces and refractive media, and one based on the Huygens-Fresnel Principle involving wave fronts that propagate in such a way that slits can induce interference patterns. Schrödinger saw an analogy between two relationships: (a.) the relationship of the rays, which produce no diffraction patterns, to the waves, which cause interference effects like diffraction; (b.) the relationship of classical particles, which move along well-defined paths, to the interference effects implied by electron diffraction and atomic-electron-orbit standing waves. In both cases it seemed that a crucial ingredient was the ratio of the wavelength to the dimensions of the physical elements. Schrödinger observed that classical physics seemed to involve wavelengths that are very much smaller than the physical objects involved and thus give the appearance of particles moving along ray-like trajectories, whereas quantum-mechanical effects such as diffraction emerged in cases involving objects and wavelengths of more nearly similar size. Since Hamilton’s formulation of optics had connected the Fermat and Huygens-Fresnel regimes, a similar approach for describing the motion of massive particles seemed worth investigating. Given λ = h/mv, it was straightforward to expect that extremely small-mass physical objects (such as atoms and electrons were known to be) would be more likely to have sizes on the same order as the wavelengths of their associated “matter waves” than large-mass objects. For example, using the modern values for c, h, and the mass of the electron, we find the electron’s rest-frame de Broglie wavelength to be about 2.29×10-15 m, which is about 40.7% of the electron’s classical diameter. For comparison to a macroscopic object, a cube 1 cm on each side of a substance with the density of water would weigh one gram and have a rest-frame de Broglie wavelength of about 2.09×10 -39 m, or 2.09×10-35 % of the length of the object. Such a wavelength would be smaller than the Planck length (see Appendix H) and is therefore considered unrealistic today, and the assumption that the collective mass of all the atoms in the cube is the appropriate mass to use in the de Broglie equation is not obviously justified, but the collective masses of some molecules have been shown to be applicable in this fashion. For example, Arndt et al. (1999) found that C60 molecules moving at a speed of about 220 m/s displayed diffraction consistent with the laboratory-frame de Broglie wavelength 2.4×10-12 m assuming that all 60 carbon atoms move as a collective mass of 1.2549×10 -24 kg. The topic of Hamiltonian mechanics is much too large for us to investigate in the detail required to avoid oversimplification, so only a few qualitative remarks can be made. After Isaac Newton published his Philosophiæ Naturalis Principia Mathematica (the three versions were published in 1687, 1713, and 1726), Newtonian mechanics came to dominate the theory of particles moving under 248
page 260
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.6 The Schrödinger Wave Equations
the influence of forces. As usual, some debates concerning priority took place, but the most useful formalism became known as Newtonian mechanics, a term which today still applies to what is called “classical mechanics”. Its purpose is to compute the locations of the particles of interest as functions of time for any specified forces. Since the basic equation, F = ma, involves acceleration, which is the second derivative of position, it is necessary to solve second-order differential equations. Several subsequent alternative formulations were invented to make the equations easier to solve, most notably by Joseph-Louis Lagrange and later Hamilton. These reformulations involve a physical quantity called action, which has dimensions of energy multiplied by time (these are also the dimensions of Planck’s constant, hence the synonym quantum of action). In 1744 both Pierre Louis Maupertuis and Leonhard Euler discussed a metaphysical proposition that has come to be known as the Principle of Least Action. An unsubstantiated claim was made by Samuel König that his friend Gottfried Leibniz had discussed this principle in a letter dated 1707 after Leibniz observed that when light is refracted, the path it takes to get to where it ultimately goes is that which minimizes the action, but Leibniz did not publish this idea. In 1788, Lagrange introduced a formulation of Newtonian mechanics using generalized . coordinates and velocities, denoted q and q respectively, that simplified many problems, and in the process he introduced a new physical quantity L (not to be confused with orbital angular momentum) that has since been named after him, the Lagrangian, the total kinetic energy T of a system minus its total potential energy V, i.e., L = T - V (care must be taken not to confuse the kinetic energy T with temperature, nor V with volume; the distinctions should be clear from the context). Hamilton showed that Lagrange’s equations followed from what is now called Hamilton’s Principle, another name for the Principle of Least Action applied to the Lagrangian. The integral of the Lagrangian between two times is equal to the change in action during that interval, and Hamilton’s Principle states that a physical system will evolve during that interval in a manner that conserves the action. Setting the variation of this integral to zero yields Lagrange’s equations (see, e.g., Goldstein, 1959, Chapter 2). Building on this, Hamilton defined a physical quantity H that has come to be known as the Hamiltonian and is equal to the total energy of the system, usually kinetic plus potential, expressed parametrically, i.e., with the potential energy written as an explicit function of all relevant independent variables. The kinetic energy is usually just the expression ½mv2 written in terms of the canonical coordinates p and q , the momenta and position coordinates, respectively, of the objects of interest. Hamilton showed that the time evolution of the system proceeds according to the canonical equations H dp dt q H dq dt p
(5.20)
H L t t These equations have the advantage of being first-order differential equations. This is made possible by bringing the energy in explicitly. In section 5.4 we saw that Planck introduced quantized energy into physics because doing so 249
page 261
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.6 The Schrödinger Wave Equations
made the equations work perfectly with empirical information, and the acceptance of quantized energy was sealed when Poincaré showed that continuous energy interactions implied the Rayleigh-Jeans law with its ultraviolet catastrophe. But no physical model yet existed that could provide human intuition with an understanding of why energy was quantized. Bohr’s use of quantized energy in his model of the hydrogen atom shed some light on why energy is quantized through its dependence on quantized electron orbits, together with de Broglie’s observation that this suggested a wave-like propagation around the orbits for which standing waves implied quantization. But there was no physical model of what was propagating like a wave, and therefore to say that physicists understood what was going on would be a gross overstatement. Mathematical models were developing, but physical models to correspond with them were not keeping up, and it was during this period that the physicist’s most powerful tool, intuition, lost its grip. Further mathematical development was possible, guided exclusively by how well the formalism described experimental results. For example, atomic models were found to be improved by assigning spin to the electrons, as this could be used to explain the broadening and splitting of certain spectral lines in the presence of electric and magnetic fields. The concept of spin was familiar from classical physics, but not the fact that the electron’s was another quantized parameter. The process of feeling one’s way with heuristic mathematical hypotheses continued, keeping what worked and jettisoning what did not. It was in this manner that Schrödinger arrived at his wave equations and how the correspondence between mathematical operators and physical observables arose. Schrödinger began by considering known wave functions, such as those of a vibrating string. When such a string is clamped at both ends (e.g., a piano wire), displacements settle into standing waves that are characterized as the superposition of two waves traveling in opposite directions. For the simplest case, in which the two waves have the same wavelength and frequency, the wave functions have the form 2 x 2t
1 ( x , t ) A1 sin
2 x 2 ( x , t ) A2 sin 2t
(5.21)
where ψ1 is traveling to the right (toward larger x), and ψ2 is traveling to the left. Note that λ and ν are not independent, they are coupled by the nature of the medium in which the wave propagates, i.e., the crests of ψ1 travel to the right with a speed given by the phase velocity vp = λν, where vp is a property of the medium and may vary with wavelength, in which case the function ν = f(k), where k is the wave number, is nonlinear, and the medium is said to be dispersive. Figure 5-3 shows ψ1 and ψ2 for the case A1 = A2 = 1, λ = 4, and ν = 1/5 in plots A and B, respectively, sampled at four times spaced by Δt = ¼ from t = 0 to t = ¾ shown as solid, dashed, dashdot, and dotted curves, respectively. Plot C shows the superposition ψ1+ψ2, which is a standing wave, shown with the same time samples, along with additional samples at the same time spacing filling out half of a complete cycle and shown as thinner solid curves. Real vibrating strings have standing waves composed of the linear superposition of many wavelengths, but each wavelength forms its own standing wave, so the simpler case of a single wavelength will suffice here.
250
page 262
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.6 The Schrödinger Wave Equations
A
B
C
A. B. C. D.
D Figure 5-3 A sine wave moving to the right; λ = 4, ν = 1/5, amplitude = 1; solid curve is at t = 0, dashed at t = ¼, dash-dot at t = ½, dotted at t = ¾. A sine wave similar to A except moving to the left, samples at the same times. The superposition of the waves in A and B, sampled at the same times as A and B and with additional samples shown as thinner solid curves spaced at Δt = ¼ to fill out half of a cycle. Similar to C but with the amplitude of the sine wave in A doubled.
The fact that the superposition ψ ψ1+ψ2 is a standing wave for A1 = A2 can be expressed directly by using the trigonometric identities
sin1 2 sin 1 cos2 cos1 sin 2
(5.22)
sin1 2 sin 1 cos2 cos1 sin 2 Then the superposition can be computed as follows:
2 x 2 x cos 2 t cos sin 2 t
2 x 2 x cos 2 t cos sin 2 t
1 ( x, t ) A1 sin
2 ( x , t ) A2 sin
(5.23)
( x, t ) 1 ( x, t ) 2 ( x, t ) 2 x 2 x A1 A2 sin cos 2 t A2 A1 cos sin 2 t
If A1 = A2, then the second term in the superposition is zero, leaving only a sine wave in space whose amplitude A1+A2 is modulated by a cosine function in time, as may be seen in Figure 5-3C. If A1 A2, 251
page 263
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.6 The Schrödinger Wave Equations
then the standing wave is washed out, as illustrated in Figure 5-3D for the case A1 = 2 and A2 = 1 with the same time samples and curve styles as Figure 5-3C. The wave functions in Equation 5.21 are convenient for discussing standing waves represented by a string clamped at x = 0, since the sine wave is zero there, but a sine wave, even one with a phase term controlled by a time dependence, is not the most general form for a single-frequency wave function, nor is it the easiest to manipulate formally. We could have used a cosine wave with an explicit phase term, but this still requires formal trigonometric manipulations. Classical physicists discovered that the mathematics could be simplified by using Euler’s formula
e i cos i sin
(5.24)
It is also useful to reduce notational clutter with the wide-spread definitions k 2π/λ, ω 2πν, and h h/2π, where k is called the angular wave number (which we must be careful not to confuse with the Boltzmann constant), ω is called the angular frequency, and h is called the reduced Planck constant. Then a wave traveling to the right in one dimension can be expressed
( x , t ) Bei ( kx t 0 )
(5.25)
where ψ0 is the phase term, which can be absorbed into the amplitude by allowing it to be complex, i.e., A B e i 0
(5.26)
so that
( x , t ) Aei ( kx t )
(5.27)
For classical applications, this form can be used by extracting only the real components at the end. For example, with ψ0 = π/2, all the same results shown in Figure 5-3 can be obtained. But during the trial-and-error process of developing Quantum Mechanics, it was found that this form is not only convenient, it is essential. The imaginary part of the wave function makes a material contribution to obtaining results that agree with experiments. Assuming that a free particle moving in the positive x direction has an associated wave function, it is reasonable to expect Equation 5.27 to be that wave function. The de Broglie relations specify its momentum as p = h/λ = h k and its energy E = hν = h ω. Since it is a freely moving particle, it has no forces on it, and so its energy is entirely kinetic relative to the frame in which its motion is measured, so E = mv2/2 = p2/2m, where m is the particle’s reference-frame mass. We observe that the time derivative of the wave function is
E ( x , t ) i Aei ( kx t ) i ( x , t ) i ( x , t ) t
(5.28)
where we have substituted E/h for ω on the right-hand side. Multiplying both sides of the equation by ih gives
252
page 264
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.6 The Schrödinger Wave Equations
i
( x, t ) E ( x, t ) t
(5.29)
This allows us to associate the two multipliers of ψ(x,t) with each other, the differential operator on the left and the energy on the right. We call the former the energy operator and denote it E^. Since the Hamiltonian is equal to the total energy, this operator is also frequently denoted H^ and called the Hamiltonian operator: E H i (5.30) t A similar approach using the derivative of ψ(x,t) with respect to x yields the momentum operator p ( x , t ) ikAei ( kx t ) ik ( x , t ) i ( x, t ) x
( x , t ) p ( x , t ) x p i x
i
(5.31)
Other operators can be derived for orbital angular momentum, spin angular momentum, etc. The position operators are simply equal to the position variables, e.g., x^ = x. The advantage of these operators is that they permit us to write equations like Equation 5.29 and the middle line of Equation 5.31 in the form of eigenvalue equations:
H E p p
(5.32)
There are also operators for kinetic and potential energy. The latter is just the potential energy itself, V^ = V(x), which we assume is not an explicit function of time, i.e., the forces acting on the particles are conservative. The former is obtained from the momentum operator according to the usual definition of kinetic energy, T = p2/2m. The Hamiltonian operator can therefore also be expressed as the sum of the kinetic and potential operators
p 2 H T V V 2m 2 2 V ( x) 2m x 2
(5.33)
If we take the next derivative of ψ(x,t) with respect to x, we obtain
2 ( x , t ) k 2 Ae ( kx t ) k 2 ( x , t ) 2 x 253
(5.34)
page 265
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.6 The Schrödinger Wave Equations
Using k2
p2
2 p2 2 k 2 T E V ( x) 2m 2m 2m k 2 2 E V ( x)
(5.35)
Equation 5.34 becomes
2 2m ( x, t ) 2 V ( x ) E ( x, t ) 2 x
(5.36)
This is the nonrelativistic time-independent Schrödinger Equation for one dimension. It was derived heuristically but found to work perfectly for stationary states, which is why it is called “timeindependent”. Note that “stationary” does not mean that the particle is stationary, but rather that its state is stationary, like that of an electron in an atomic orbit, a free particle, a particle confined to a box, or a linear harmonic oscillator, each of which is characterized by its own specific V(x). This equation can be put into a different form by multiplying both sides by -h 2/2m and rearranging to obtain
2 2 ( x , t ) V ( x ) ( x , t ) E ( x , t ) 2m x 2
T +V ( x, t ) E ( x, t )
(5.37)
H ( x , t ) E ( x , t )
This is Equation 5.36 written in the form of an eigenvalue equation, as in Equation 5.32. It shows that the particle’s energy must be an eigenvalue of the Hamiltonian operator. In practice, since V(x) must be spelled out in order to solve a particular problem, Equation 5.36 is the form most often employed. The demonstration above that this equation is consistent with wave functions of the form in Equation 5.27 depends on knowing the equation in advance and hence bears no resemblance to the path Schrödinger traveled to derive it. In a series of papers published in 1926, he took a much more circuitous journey whose steps we will not retrace other than to sketch them at a high level. He began with Hamilton’s principle applied to a single particle, computed the gradient of the action, interpreted the result geometrically as a series of surfaces of constant action, and considered how these surfaces evolve in time. What he saw was a propagating system of stationary waves similar to those in an optical medium that is isotropic and dispersive. Following Hamilton, he associated the trajectories of the vectors normal to the surfaces with paths of optical rays consistent with the Huygens-Fresnel theory of wave propagation. The new ingredient was perceiving these trajectories to be the possible 254
page 266
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.6 The Schrödinger Wave Equations
paths available to a material point subject to a force defined by the potential even in the longwavelength regime where interference effects are important. As Schrödinger (1926b) says, “... the equations of ordinary mechanics will be of no more use for the study of these micro-mechanical wavephenomena than the rules of geometrical optics are for the study of diffraction phenomena.” From there Schrödinger introduced stationary wave functions whose phase is given by the action and was able to show that the particle associated with a given wave of frequency ν must have energy E = hν. He assumed that the waves obey the familiar wave equation, which he tested with a potential describing an electron affected by the Coulomb attraction of a proton. Requiring the wave function solutions to be finite, continuous, and single-valued constrains the energy solutions to two sets, one simply E > 0 (the continuum of free electrons), and the other being En = -Wn, binding energies corresponding to Bohr’s electron orbital energies for the hydrogen atom given by Equation 5.18. This convincing success opened the door to a vast array of subsequent developments whose details are not needed for our purposes herein, but we should mention that when the Hamiltonian operator in the form of Equation 5.29 is used on the left-hand side of Equation 5.37 and the one in Equation 5.33 is used for the energy on the right-hand side, we obtain the time-dependent Schrödinger Equation for one dimension i
2 2 ( x, t ) ( x, t ) V ( x ) 2 t 2m x
(5.38)
which can handle nonrelativistic problems involving nonstationary states. Finally we mention that relativistic forms of these equations were developed, notably by Oskar Klein and Walter Gordon in 1926 and by Paul Dirac in 1928, and that Werner Heisenberg developed a very different but equivalent (once properly modified by Born and Jordan, 1925) quantum-mechanical formalism based on matrices instead of waves in 1926. In Heisenberg’s formulation, the operators take the form of matrices as described in section 5.3 above. We will return to Heisenberg’s matrix mechanics below in the discussion of the famous uncertainty principle, but otherwise Schrödinger’s formulation is more appropriate for our purpose of exploring the role of randomness in physical processes. Since the Schrödinger Equations are linear in the wave function, if ψ1 and ψ2 are both solutions, then their sum ψ = ψ1 + ψ2 is also a solution. In typical applications, the wave function is a superposition of many simple wave functions of the form given in Equation 5.27:
( x , t ) n ( x , t ) An ei ( kn x nt ) n
n
(5.39)
The exp(i(knx - ωnt)) are called the eigenfunctions of ψ, and since each contains a single frequency, it has its own associated energy En = h ωn. These are the eigenvalues of the energy operator, the energies that can result from a measurement. In addition, since the momentum pn = h kn, a single wave number implies a single momentum, the eigenvalues of the momentum operator. So if ψ consists of a single eigenfunction, the energy and momentum of the particle’s state would be known exactly. In practice, the ψn must be computed by solving the Schrödinger Equation given the potential appropriate for the problem of interest. We have said above that any given particle has an associated wave. This terminology was used because historically physics dealt with objects that could be viewed as individual point or extended 255
page 267
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.7 Wave Packets
particles or macroscopic conglomerations of particles. Such objects were considered the fundamental objects of study, including atoms once they were accepted as real objects. The “associated wave” gave rise to what was called the “wave-particle duality”. Today it is widely considered that there is no such duality; the fundamental object is the wave. The particle is what is “associated” with the wave as a way of picturing the manifestation of the wave as a familiar object. It is only because macroscopic objects are associated with such extremely small wavelengths that we continue to find it convenient to picture particles as being involved. Once experimental physics became capable of probing the atomic level, wavelengths long enough to produce interference effects in measurement results were finally encountered. Of course for effects involving light, this had been going on for a long time, and the debate concerning whether light consisted of waves or particles foreshadowed the discovery of a greater question. For now, we have only a picture of a process propagating in accordance with wave constraints and interacting in quantized events. The nature of the wave-propagating medium remains a mystery. The initial interpretation of Schrödinger’s waves was that they somehow guide the motion of material particles and that the various eigenvalues represent possible measurement outcomes, but the significance of the amplitudes was not clear, nor what it meant for a material particle to be in a superposition of multiple states. Early attempts to understand the latter included a study of “wave packets”. 5.7 Wave Packets In classical physics, there was never any question about whether a particle had an exact location. In statistical mechanics, those locations had to be treated as random variables, but location was always assumed to be a straightforward property of each particle. Point particles had point coordinates, and extended particles could be located just as specifically by using the coordinates of their centers of mass as their addresses. Once particles became associated with waves, the picture became somewhat blurred. Waves such as those in Equation 5.27 extend over the entire real line. It was not clear what the location of an associated particle would be. But some superpositions of waves, such as in Equation 5.39, can have at least some localization in the sense that the net amplitude has dominant maxima in one or more confined regions. Such a concentration of net amplitude in a limited region is called a wave packet. Since each term in the summation corresponds to a single wave number, it is associated with a single momentum. It is well known from Fourier analysis that the more frequency components there are, the sharper and more confined the superposition can be. Ironically, this means that the more different momentum components there are in the superposition, the more localized the wave packet can be, i.e., the more nearly classically localized the associated particle is, the less classical its momentum state. In the previous section we mentioned that when Schrödinger applied his formalism to an electron influenced by the Coulomb potential of a proton, he found two sets of solutions, one with discrete energies corresponding to Bohr’s hydrogen orbits, and one with continuous energies corresponding to electrons that are not bound to the proton despite being affected by its charge, just as the Earth is affected by Jupiter’s gravity but not captured in an orbit around Jupiter. Thus Quantum Mechanics does not say that all energy spectra are discrete. A particle that is not trapped in any way has a continuous kinetic energy spectrum, and in such a case, the summation in Equation 5.39 must be replaced by a corresponding integral: 256
page 268
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.7 Wave Packets
( x, t )
A( k ) e
i ( kx ( k ) t )
dk
(5.40)
Continuous energy implies an infinite number of states, and this is also true of discrete states for many kinds of observables, i.e., an infinite number of terms may be in the summation in Equation 5.39. Such situations allow the associated wave packet to have a single maximum, whereas if the number of component waves is finite, then the waveform will be periodic, with duplicated maxima on the real line. Just as with black-body radiation, the existence of an infinite number of states does not imply that all these states are significantly occupied in pre-measurement superpositions. The frequency in the equation E = h ω can be continuous, as it is in black-body radiation, but as seen in Equation 5.15 (p. 241), the contribution of arbitrarily high frequencies is vanishingly small. The energy quantization stems from the fact that photons are constrained to interact only in quantized amounts of size nh ω, not from any quantization of ω itself in this case (although many models used to analyze black-body radiation are based on standing waves whose wavelengths are integer submultiples of the cavity size, implying quantization of wavelength and frequency, there is no need in general for the cavity to be geometrically regular). For simplicity of illustration, we will consider a free-particle wave packet composed of a finite number of discrete states with the net maximum amplitude in the vicinity of the origin of the x axis, ignoring the periodic repetitions far from the origin as approximation errors due to the limited number of states. Including an infinite number of states would not change the qualitative features that we wish to illustrate, but such wave packets are very difficult to plot clearly because of the extremely dense fluctuations which blur together, leaving only the envelope visible. For a sample particle, we will use the C60 molecule mentioned in the previous section, for which the effective mass is 1.2549×10 -24 kg, the classical speed relative to the laboratory frame is 220 m/s, and the de Broglie wavelength is 2.4×10-12 m. We will construct an artificial wave packet whose components all have unit amplitude. Since we are considering a nonrelativistic free particle with nonzero rest mass, the energy is all kinetic, and
E
p2 2 k 2 2m 2m
k 2 2m
(5.41)
So the frequency is a nonlinear function of the wave number, and therefore as stated in section 5.6, the medium in which the wave propagates is dispersive, i.e., a wave packet at maximum confinement will grow broader as time increases. To see this specifically, we compute the speed vg of the particle associated with each wave component: p k mv g vg
k m
(5.42)
where the subscript g indicates that the particle moves at a speed given by what is called the group 257
page 269
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.7 Wave Packets
velocity of the wave. Note that vg is the derivative of ω with respect to k. This is a general property, not an exclusive aspect of free particles. Although the group velocity of a monochromatic wave is not a particularly intuitive property of a wave, when that wave is part of a wave packet, the average group velocity determines the speed at which the packet itself moves, and since the packet is associated with a particle, this corresponds to the classical velocity of that particle. A brief but important digression concerns the equations for a photon propagating in vacuum corresponding to the last two sets of equations for a massive particle. Since a photon has zero rest mass, Equations 5.41 and 5.42 are not applicable, but a photon nevertheless has energy and momentum, given in this case by E = pc = h ω and as usual p = h k, respectively. So we have p = h ω/c = h k, and ω = ck, the frequency is linear in the wave number, the medium is nondispersive, and photon wave packets in vacuum do not spread as time increases. Since the group velocity is the derivative of ω with respect to k, we have vg = c, which is also the phase velocity in this case. For photons not in vacuum, the situation is more complicated, but the media are generally dispersive, otherwise Herschel and Ritter could not have discovered infrared and ultraviolet electromagnetic radiation, respectively, as mentioned in section 5.4. Returning to the free C60 molecule: the wavelength of 2.4×10-12 m is the central wavelength for the packet, which we will denote λ0. The angular wave number is k0 = 2.62×1012 radians/m. By the second line on Equation 5.41, the angular frequency is ω0 = 2.88×1014 radians/s. The frequencies of other packet components are given by the same formula as functions of their wave numbers. For our artificial wave packet, we will use 21 components with wavelengths evenly spaced from 4λ0/5 to 6λ0/5, a range of ±20%. This is a substantial spread of wavelengths, hence momenta, hence group velocities, and so this wave packet will disperse very rapidly. Since we are using uniformly spaced wavelengths, the wave numbers and momenta and group velocities are not uniformly spaced because they all vary as the inverse of wavelength. As a result, the initially symmetric wave packet will become asymmetric as it disperses. This is not a general property of wave packets; many different packet distributions of wave number have been studied, such as Gaussian distributions whose packet shapes remain symmetric as they broaden. Figure 5-4A shows the real part of the wave packet at three time samples, 0, 9×10 -14, and 1.8×10-13 s, with solid, dashed, and dash-dot curve styles, respectively. The imaginary parts of these three samples are similar but 90o out of phase with their real counterparts. The packet is clearly dispersing rapidly as it moves to the right, as it was designed to do in order to put three significantly different samples on the same plot. Realistic wave packets generally do not disperse this rapidly because their components have amplitudes that are not all equal, usually falling off faster with difference in wavelength relative to the dominant component. Contemplating wave packets did not clear up the mysteries involved in attempts to interpret the new microphysical formalisms, but it did provide hints of what was soon to come: inescapable uncertainties in the simultaneous estimates of certain physical parameters and the emergence of the role of randomness in quantum-mechanical processes.
258
page 270
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.7 Wave Packets
A
B Figure 5-4 A. The real part of an artificial wave packet composed of 21 unit-amplitude components spanning a ±20% wavelength range for a C60 molecule moving at 220 m/s sampled at t = 0 (solid curve), t = 9×10-14 s (dashed curve), and t = 1.8×10-13 s (dash-dot curve); the packet may be seen to be rapidly dispersing. B. The (unnormalized) position probability densities for the three samples in A, using the same curve styles.
259
page 271
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.8 The Heisenberg Uncertainty Principle
5.8 The Heisenberg Uncertainty Principle The year before Schrödinger was able to publish his first paper on his wave-function approach to Quantum Mechanics, Werner Heisenberg (1925) published his own formalism using an entirely different kind of mathematics. Heisenberg was influenced by suggestions made by Wolfgang Pauli, Max Born, and Pascual Jordan that were aimed at alleviating the quandary regarding how electrons move in their atomic orbits. Bohr had shown that these orbits involved quantized energy levels and had made some heuristic calculations about orbital frequencies and angular momenta, but no acceptable solution had emerged to satisfy the classic goal of physics, a determination of a particle’s position as a function of time. As Pauli and Born (among others) pointed out, an electron’s position in its orbit at a given instant of time was not observable. Orbital energies could be deduced by observing spectral lines, and orbital angular momenta could be related to known electromagnetic interactions, but there was no way to measure the position of an electron in the sense that the position of the planet Mercury in its orbit around the Sun can be observed. A philosophical principle that denied physical meaning to unobservable phenomena began to gain traction, much as many physicists considered atoms not to be real around the end of the 19th century. This principle falls within the realm of what is called logical positivism, while the principle that aspects of physical reality may be meaningfully said to exist despite being unobserved or even unobservable became labeled realism. Of course, once elementary particles began to be observed, the consensus regarding atoms shifted, but not as a repudiation of positivism, simply because the idea that atoms exist began to satisfy the demands of positivism. The classical position of a free particle had to be considered observable, but the position of an electron in an atomic orbit could not, and so Heisenberg attempted to design a formalism whose application to the hydrogen atom would make no reference to the electron’s position, at least not in the final formulas, although such a position does appear and plays a crucial role in the derivation of those formulas. As we will see below, recognizing the importance of random variables in quantum processes is generally attributed to Max Born in his interpretation of Schrödinger’s wave mechanics, but in fact probabilities made their appearance earlier than that in Heisenberg’s formalism, which also resembles Fourier synthesis, an essential ingredient of the wave packets described in the previous section. Given that the two approaches in their refined forms were soon proved to be mathematically equivalent, the initial appearance of a complete lack of resemblance was bound to be revealed as somewhat misleading. Both depend crucially on trigonometric and probabilistic concepts. It appears (to the author at least) that in both cases, the randomness was viewed early on as the epistemic randomness of a statistical ensemble, and only later did the idea of a single particle experiencing nonepistemic randomness arise and create a vigorous debate, as we will see in the next section. Heisenberg expressed the energy differences in Bohr’s stationary states (Equation 5.19, p. 246) in terms of a level denoted n and a level denoted n-α (replacing n2 and n1, respectively, for positive α, but α may also be negative) and divided by h to get a formula for ω(n,n-α), the frequency of the photon emitted or absorbed in this transition. For n n-α to represent a transition, we must have n > 0, n-α > 0, and α 0. The enforcement of these constraints is assumed in the summation in Equation 5.43 below. The set of values taken on by α for each n is infinite, and Heisenberg used this infinite set of frequencies as the basis for a trigonometric-series expansion of the electron position in one dimension during its periodic motion:
260
page 272
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.8 The Heisenberg Uncertainty Principle
x ( n, t )
X (n, n ) ei (n,n )t
(5.43)
where the X(n,n-α) serve a purpose similar to Fourier coefficients in a Fourier expansion. Heisenberg called these the transition amplitudes and identified them as the key observable parameters. This differs from the usual Fourier series, in which integer multiples of a fundamental frequency are used. He computed the power per unit time that would be emitted by an electron experiencing this accelerated motion if classical electromagnetic theory applied, then identified this power as being the product of the electromagnetic energy and a transition probability. He arranged the probabilities for all possible transitions in a two-dimensional table P(n,n-α). Since n is an element of an infinite set, Equation 5.43 describes an infinite ensemble. In one of the leaps taken in the paper, Heisenberg asserted that this is actually an ensemble of x(t). He considered how one would compute x(t)2 and saw that it involved the product of two such infinite summations. At this point Heisenberg noticed that because the base frequencies do not combine in the same way as the harmonics of a standard Fourier series, given two quantities expressed as such series q1(t) and q2(t), the result for q1(t)q2(t) was not always the same as that for q2(t)q1(t). Similarly one can take the time derivative of Equation 5.43 to get the velocity and thus the electron’s momentum p, and it turned out that the expression for px was not the same as for xp. These anomalies seemed like signs that something was wrong. Nevertheless, he showed his work to Wolfgang Pauli and Max Born, who encouraged him to pursue it further, and so he published it in 1925. This paper is generally considered difficult to follow, because Heisenberg omitted detailed explanations for many of the steps in the derivation. For a good English-language commentary on the paper, see Aitchison, MacManus, and Snyder (2008). After Max Born had studied it in more detail, he recognized Heisenberg’s tables as matrices that could be manipulated mathematically according to matrix algebra, with which most physicists were unfamiliar at the time. Born worked with Pascual Jordan to transform Heisenberg’s theory into a matrix-algebra representation and published their result that same year (Born and Jordan, 1925), and this paper contained the first appearance of what is now called the “commutation relation”, x^p^-p^x^ = ih . Before the year was out, all three published a more complete paper describing the new “matrix mechanics” formalism (Born, Heisenberg, and Jordan, 1925). After initial difficulties in applying the formalism quantitatively to the hydrogen atom, Pauli (1926) was able to show that it reproduced Bohr’s energy levels (Equations 5-18 and 5-19) and other effects such as spectral-line broadening and splitting due to electric and magnetic fields. The commutation relation was found to apply in general to quantum systems, not merely electrons in hydrogen atoms. Thus it is also true for free particles, for which the position is an observable parameter. To derive the commutation relation the way that Heisenberg did it would require too much of a digression into the details of matrix mechanics, but we can obtain a grasp of it from the wave mechanics described in section 5.6 above. For now we will just mention that in matrix mechanics, matrices act as operators analogous to differential operators in wave mechanics. In both cases, a measurement of momentum followed by a measurement of position (for example) yields different results from those obtained when the measurements are done in the opposite order. This corresponds to applying the operators in different orders. Making a measurement of one observable on a system that has already had a different observable measured involves applying the former operator on an expression to which the latter operator has been applied, and this produces the 261
page 273
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.8 The Heisenberg Uncertainty Principle
multiplication of the two operators. That different results may be obtained for different application orders in matrix mechanics can be seen in the very simple case of 2×2 matrices. Consider the basic operator matrices A and B; we denote the application order AB as C1 and the application order BA as C2: a11 a12 A a21 a22 b11 b12 B b21 b22 a11 a12 b11 b12 a11b11 a12b21 a11b12 a12b22 C1 AB a21 a22 b21 b22 a21b11 a22b21 a21b12 a22b22 b11 b12 a11 a12 a11b11 a21b12 C2 BA b21 b22 a21 a22 a11b21 a21b22 a12b21 a 21b12 C1 C2 a21b11 a22b21 a11b21 a 21b22
(5.44)
a12b11 a22b12 a12b21 a22b22
a11b12 a12 b22 a12 b11 a22 b12 a21b12 a12 b21
If A and B commute, then the four matrix elements of C1-C2 will be zero, e.g., a12b21 must equal a21b12, etc. These conditions are satisfied trivially when A and B are both diagonal matrices. One way to visualize noncommuting behavior is provided by rotations of physical objects, which can be represented as Euler rotation matrices. For example, if an airplane rotates 30o about its roll axis, then 45o about its pitch axis, it does not end up in the same orientation or flying direction as it would if it had first rotated 45o about its pitch axis and then 30o about its roll axis. The two Euler rotation matrices do not commute. If however, for example, both angles had been 180 o (which makes the rotation matrices diagonal), the two rotations would commute, and the airplane ends up with the same orientation and flying direction either way, right-side up and flying in a direction opposite to the original (assuming it started out right-side up and had time to complete the pitch maneuver). In matrix mechanics, operator matrices are Hermitian with generally complex elements. They commute when their corresponding observables can be simultaneously measured as eigenstates and otherwise not. By computing the commutator AB-BA (designated [A,B]) it can be determined theoretically whether simultaneous sharp knowledge of the two observable parameters is possible via measurement. For example, when A and B are the position and momentum operators, Heisenberg discovered that they do not commute, the results from a pair of measurements will depend on the order in which they are made, and it is impossible for the final system to be in an eigenstate of both observables. Since one of the parameters must be in a superposition of states, it will have a greater uncertainty, and at least one of the parameters must suffer this fate. This was somewhat shocking and took a couple of years to soak in. During that interval, Schrödinger published his 1926 paper, and the question of whether a similar result was to be found in his formalism was answered in the positive. To his credit, Heisenberg (1927) attempted to interpret this phenomenon intuitively in classical terms. He described a laboratory measurement of the position and momentum of an electron. He made the plausible claim that the most accurate way to measure the position is to illuminate the electron and observe the Compton-scattered photon in a microscope. 262
page 274
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.8 The Heisenberg Uncertainty Principle
Assuming an absolutely ideal apparatus, the initial trajectory of the photon is perfectly known and is perpendicular to the x axis by definition of that axis, and the electron may be considered initially stationary on that axis to arbitrary accuracy. If the photon hits the electron, it will be scattered, and the electron’s position will be known from the photon’s initial path, but only with an uncertainty σx limited by the photon’s wavelength, so it would be best to use photons with very small wavelength. Such photons carry higher momentum, however, so the disturbance of the electron’s position will be proportionately greater, but the angle of deflection will indicate the amount of transverse momentum imparted to the electron. The measurement of this angle, however, is limited to a fundamental nonzero extent by diffraction in the microscope, producing a momentum uncertainty σp. Thus both the position and the momentum of the electron have inescapable accuracy limitations. The momentum uncertainty can be reduced by using photons with longer wavelengths, but only at the expense of greater position uncertainty. Improving one accuracy degrades the other, and by an argument based on simple classical physics, Heisenberg showed qualitatively that the absolute best that can be done is to limit the product of the uncertainties to σp σx h, where he was not specific about probability distributions or confidence intervals for the uncertainties. It was later shown that the right-hand side of this Heisenberg Uncertainty Principle takes different forms for different noncommuting observable pairs, physical situations, and error distributions, but discovering the fact that such a principle exists at all was monumentally historic. For the situation described by Heisenberg, Kennard (1927) showed that the precise statement involving 1σ uncertainties is σp σx h /2. The Uncertainty Principle can be understood qualitatively to be implied by the commutation relation for noncommuting operators as follows: the commutation relation states clearly that the two different measurements yield different results when performed in opposite orders, hence putting the system into an eigenstate of the second operator prevents it from remaining in the eigenstate of the first, and when the first observable is forced into a superposition of states, its uncertainty must increase, because more states are possible. The uncertainty principle can be derived from the commutation relation, but this would require delving more deeply into the mathematics specific to Quantum Mechanics than is otherwise needed for our purposes, so we must omit it from our scope. It can be found in any standard textbook on Quantum Mechanics (e.g., Merzbacher, 1967). As already suggested, the Uncertainty Principle may also be seen lurking in the wave packet for the free particle discussed above: if there is only one frequency present, then the associated particle is in a corresponding momentum eigenstate with zero uncertainty, but the wave is distributed equally over the entire real line, and so the uncertainty of the particle’s position is unbounded (we will see below that this situation is not physically realistic because it violates certain normalization constraints on the wave packet). We can reduce the position uncertainty by adding more frequency components to form a somewhat localized wave packet, but this creates a superposition of momentum states, since each new frequency brings with it a new momentum state, and so the momentum uncertainty is increased. The position and momentum uncertainties are at odds; reducing one increases the other. This is really a more fundamental view of the Uncertainty Principle than Heisenberg’s classical explanation based on how any measurement “disturbs” the system, except in the sense of measurements “disturbing” wave functions. If it were possible to reduce the position uncertainty of a free particle without mechanical interference, the result would still be to increase the momentum uncertainty, because the more localized wave packet would necessarily have more frequency 263
page 275
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.9 The Born Interpretation of Wave Mechanics
components, hence more possible momentum states. This relationship becomes obvious when one contemplates the fundamental nature of wave packets. We have considered wave functions defined on the x axis, i.e., a position axis, but it is possible to change the independent variable from position to momentum via a Fourier Transform. Without getting into more details than necessary, it is well known that the more confined a function is (e.g., a wave packet on the x axis), the more spread out its Fourier Transform is (e.g., a wave packet on the p axis), and vice versa. This fact will ultimately defeat any attempt to measure both position and momentum simultaneously to arbitrary accuracy. Heisenberg’s classical interpretation assumes that after the measurement, the electron really does possess an exact momentum, and its position really is an exact function of time, but both have opposing epistemic uncertainties caused by a disturbance due to the measuring apparatus, a disturbance whose effects cannot be determined to arbitrary accuracy and which has come to be called the “Observer Effect”. The modern consensus, aside from some “hidden-variable” theories, is that the electron does not intrinsically possess simultaneously sharp position and momentum, rather its fundamental nature is 100% bound to the wave function, and the wave function demands the presence of multiple values of at least one of the two observables. The commutation relation can be derived using wave mechanics as follows. Recall that the position operator x^ is just the x variable itself, and the momentum operator p^ is the differential operator on the last line of Equation 5.31. Applying the commutator to the wave function ψ yields
( x ) px ) x i ( xp i x x x x i i x x x x ix
(5.45)
ix i x x
i The commutation relation is obtained by equating the multipliers of ψ on each side of the equation. Two examples of other observables whose operators do not commute are: energy and time; kinetic energy and position. Such pairs of noncommuting parameters are called canonically conjugate variables (not to be confused with the complex conjugates introduced in section 5.3). When the formalism is extended to more than one dimension, it can be shown that the y position operator does commute with the x momentum operator, for example, and as mentioned in the last paragraph of section 5.2, the operators for the projection of the orbital angular momentum on the z axis (or any other Cartesian axis) and the square of the total orbital angular momentum do commute. At this point it might be advisable to re-read that paragraph if it was not entirely clear upon the first reading. 5.9 The Born Interpretation of Wave Mechanics After Schrödinger published his concluding 1926 paper on wave mechanics, he believed he had established a connection between quantum and classical physics. One simply had to think of a particle as a density distribution whose location, extent, and motion were given by the square of the 264
page 276
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.9 The Born Interpretation of Wave Mechanics
wave function (by analogy to light energy being the square of the electromagnetic wave). But a dissenting opinion arose quickly. Before the year was out, Max Born (1926) published a paper applying Schrödinger’s wave mechanics to collision processes. In this paper he introduced his interpretation of the wave function as a probability amplitude. This was inspired by an idea that Einstein had suggested regarding the connection between the wave and particle nature of light, namely to interpret the squared amplitude of an electromagnetic wave (i.e., the energy, as stated above) as a measure of the probability of a photon with that energy existing at the given location. Since Schrödinger’s wave function is complex, its square has to be computed as the product of ψ with its complex conjugate, denoted ψ* (recall that a complex number a+bi, where a and b are real and i is -1, has a complex conjugate given by a-bi). Thus Born postulated that ψ(x)ψ*(x)dx was the probability of the associated particle being between x and x+dx. This became known as the Born Rule. It made sense to him for collisional processes based on the analogy of a water wave being scattered by a piling supporting a pier: parts of the scattered wave are circular, but with an amplitude that varies with direction. He envisioned the same thing happening with plane waves associated with “a swarm of electrons” (from his 1954 Nobel lecture) coming from a large distance encountering a relatively heavy object such as an atom, suggesting that the smaller squared amplitudes in some directions represented lower probabilities that a particle would be scattered in that direction. He perceived that Schrödinger’s wave equation even provided the probabilities for the collision inducing the atom’s different stationary states. He then realized that Heisenberg’s formalism, especially the Uncertainty Principle, added great support to a probabilistic interpretation of Quantum Mechanics in general. In fact, Born ascribed to Heisenberg the greatest measure of his commitment to the probabilistic interpretation. Nevertheless, the primary manifestation of this commitment became visible as an interpretation of the wave function. With this interpretation, Born declared the death of classical determinism and macroscopic qualities being possessed by microscopic objects. Of course, that obituary did not go without vehement opposition. Although it was not immediately clear what the scope of “probabilistic” was, unless it meant merely epistemic randomness in ensembles, it was not digestible to some of the principle architects of Quantum Mechanics. The idea that it was both nonepistemic and applicable to single particles was in the air but had yet to gain traction. Similarly, the idea that an electron could have several locations or no location property at all did not sit well. The primary antagonists to Born’s interpretation were Einstein, Schrödinger, and de Broglie. On Born’s side of the net were Heisenberg, Pauli, Dirac, Jordan, von Neumann, and Bohr. Dirac (1925) had published his own formulation of Quantum Mechanics based on ideas obtained from a lecture that Heisenberg had given in Cambridge, England, but Dirac invented his own mathematical tools that were distinct from matrix algebra. Nevertheless, his formalism was closely allied with Heisenberg’s, not Schrödinger’s. Thus the proponents of the probabilistic interpretation were all physicists with no personal stake in wave mechanics, whereas the two main proponents of the wave nature of matter strongly resisted allowing this wave blurriness being interpreted as indicating some fundamental randomness in Nature. The Born school did not immediately claim that the randomness requiring a probability formalism was necessarily nonepistemic. In fact, Born argued that the vaunted determinism of classical Newtonian mechanics was an illusion. Even setting aside the mathematical difficulties that prevent exact solutions (e.g., the fact that there is no closed-form solution to the general three-body problem), Born pointed out that classical mechanics can give results with infinite precision only if 265
page 277
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.9 The Born Interpretation of Wave Mechanics
provided with initial conditions of the same quality, and that would require perfection in the corresponding measurements, a physical impossibility because even in classical physics, noise is always present at some level. One might do thought experiments in infinite precision, but the results could never be applied to real physical objects, and therefore any claim that real physical objects possess the corresponding attributes is without foundation. For example, the idea that an electron has a location given by a point on the real line has no basis, because the measurement required to prove it can never achieve the required accuracy, even if the Universe were governed by classical physics. The idea that what cannot be observed cannot be said to exist extended to the exactness of the properties of real physical objects, and this was a consequence of classical physics, not Quantum Mechanics. On the other hand, Albrecht and Phillips (2014) argue that quantum-mechanical effects are indeed the cause of the randomness that causes the noise in macroscopic measurements and underlies Statistical Mechanics. They show that for plausible models, the tiny quantum uncertainties combine rapidly to macroscopic levels in such things as billiards and coin flips. For example, they compute the number of billiard-ball collisions needed for quantum randomness to dominate the physical layout to be eight. For random walks in phase space by nitrogen gas and water molecules, the path is governed by quantum randomness after a single collision by each particle. If the randomness in Quantum Mechanics is nonepistemic, then so also is the randomness in Statistical Mechanics. The positivist view of microphysical parameters became part of what has been called the “Copenhagen Interpretation” of Quantum Mechanics, and today this interpretation has the largest number of subscribers. In its most general form it states that the only physically meaningful quantities in Quantum Mechanics are the results of measurements expressed in macroscopic language, and that quantum events are nondeterministic and involve discrete transitions between states (because continuous distributions of states are not resolvable by measurements and because the transition from a superposition of states to an eigenstate is discontinuous). There are some variations among adherents, for example regarding whether the wave function is physically real or exists only in configuration space (e.g., as mentioned above, the wave function can be expressed in momentum space) and how strictly to apply the positivistic view that what cannot be observed cannot be said to exist. For example, Carl Friedrich von Weizsäcker reserved the freedom to make suitable assumptions about what cannot be observed in order to avoid paradoxes, and it is difficult to justify Heisenberg’s use of x(t) at the foundation of matrix mechanics if he considered it not to exist, since anything derived from something that does not exist would seem to be on shaky ground, and its success in predicting measurement outcomes could be seen as nothing more than a fluke. Closely linked to the Copenhagen Interpretation (largely through the role of Niels Bohr in formulating them) are the “Correspondence Principle” and “Complementarity”. The former states that Quantum Mechanics must approach classical mechanics asymptotically in the limit of arbitrarily large quantum numbers, and the latter is the principle that macroscopic manifestations of quantum behavior may exhibit wave behavior in some contexts and particle behavior in others because the quantum and classical realms are fundamentally different, and the former requires more than one way of being visualized in the latter. Given Born’s postulate that ψ(x)ψ*(x)dx is the probability of the associated particle being between x and x+dx, it follows that ψ(x)ψ*(x) must be unit-normalized, i.e., for the continuous and discrete cases respectively,
266
page 278
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.9 The Born Interpretation of Wave Mechanics
( x) * ( x) dx 1
n
n ( x ) n* ( x)
(5.46)
1
We mentioned in section 5.3 that when the system’s state is represented as a vector in the space defined by the operator’s eigenvectors, it is convenient to use an arbitrary degree of freedom to assign all of these vectors unit magnitude. That is because this convention automatically provides the normalization condition on the second line above. The projection of the state vector onto eigenvector k is the dot product of the two unit vectors, hence the cosine of the angle between them. The probability that a measurement will produce eigenvalue k is proportional to the square of this dot product, namely ψk(x)ψk*(x) = cos2(θk), where θk is the angle between the state vector and eigenvector k. Thus the most likely states are the ones whose eigenvectors are the closest to the state vector before the measurement. The sum of cos2(θk) over k is just the squared magnitude of the state vector, which is 1 because all the vectors were chosen to be unit vectors. In the continuous case, normalization is usually enforced by dividing ψ(x)ψ*(x) by its integral over the domain. The constraints in Equation 5.46 cannot be applied to monochromatic wave functions such as Equation 5.27 (p. 252) or an incomplete wave packet (i.e., such as Equation 5.39, p. 255, with a finite number of components), because the former is not localized at all, and the latter has an infinite number of repetitions on the real line. Since ψn(x)ψn*(x) > 0 in an undiminishing manner in an infinite number of places, the corresponding integrals over the entire real line diverge, and ψn(x)ψn*(x) cannot be normalized. As a result, these wave functions cannot describe real physical situations under the Copenhagen Interpretation. Since this rules out monochromatic wave functions for free particles, it follows that a free particle cannot be in a single-momentum state that is perfectly known with no uncertainty (i.e., the momentum distribution cannot be a Dirac Delta Function). By the same token, it cannot be located at a known point, since this makes the momentum probability distribution flat and unnormalizable. In Heisenberg’s Uncertainty Principle σp σx h /2, we never have one uncertainty zero and the other infinite for continuous cases such as a free particle. The unnormalized ψ(x,t)ψ*(x,t) for the wave packet shown in Figure 5.4A is plotted for the same time samples in Figure 5-4B. For complete wave packets, the oscillations in ψ(x,t)ψ*(x,t) would not be present; the shapes would resemble the envelopes of the oscillations. Note that the oscillation frequencies in plot B are twice those of plot A. The latter is the real part of ψ(x,t), hence only the cosine contributions, whereas the importance of the imaginary part, the sine contributions, becomes visible in ψ(x,t)ψ*(x,t). Although these oscillations are blended into smooth envelopes for real wave packets, the importance of the imaginary components remains. Even with a properly normalized distribution, the probability that a free particle is located at a given point on the real line is not meaningful for the same reason that was pointed out for the classical case in section 2.1 regarding the amount of October rainfall in London: ψ(x,t)ψ*(x,t) is a probability density function and must be integrated between two limits to get a probability that the particle is within those limits. Since this demands a range of positions and momenta, we still have σp and σx both greater than zero. For discrete eigenstates, it is theoretically possible for a measurement to yield an eigenvalue with zero uncertainty, implying an unbounded uncertainty in the other noncommuting parameter. For example, a system may be in an energy eigenstate with zero uncertainty, and the time at which the system has that energy is completely unknown. But in realistic cases, energy 267
page 279
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.10 Probability Amplitude, Quantum Probability, and Interference Between Coherent States
eigenstates are typically not perfectly sharp, i.e., they have some uncertainty because of zero-point energy fluctuations, finite decay lifetimes, etc. The significance of the Heisenberg Uncertainty Principle is not that it places any practical limits on measurement accuracy involving macroscopic objects, since it operates at a level many orders of magnitude below the corresponding limitations of laboratory equipment. The significance is that it is implied by a theoretical formalism that is believed to be a more accurate description of the ultimate Nature of the Universe than classical physics, just as Statistical Mechanics is closer to Nature than Thermodynamics. The fact that entropy is capable of spontaneously reducing itself in the former but not the latter has no practical consequences for everyday life, but it is highly significant in its superior revelation of the details of physical behavior. Laboratory experiments have shown conclusively that when scattered electrons strike a target that reveals their scattering direction, different results are found for different electrons in an apparently perfectly random fashion with distributions predicted by the ψ(x,t)ψ*(x,t). This is true even when only one electron at a time is in the beam traversing the apparatus. Every laboratory application of wave mechanics ever properly performed supports the Born Rule, and for this reason, we accept henceforth that the randomness implied by that interpretation is a real physical effect that must be dealt with by any formalism attempting to describe quantum processes. The early experiments did not establish conclusively whether the randomness is epistemic or nonepistemic, but the fact that single electrons exhibit the random behavior implies that it is not the usual epistemic randomness found in classical ensembles wherein chaotic interactions between ensemble elements generate a distribution. As a result, mainstream physicists have largely adopted the view that the behavior of individual objects is subject to this randomness and that it is nonepistemic, because there is no recognized physical mechanism for shuffling the deck analogous to the many unpredictable outcomes of collisions in the kinetic theory of gases. The formalism implies that electrons sent on identical initial trajectories through the same double-slit apparatus one at a time will not exhibit identical scattering but rather an interference-pattern distribution of scattering angles occurring at rates predicted by the wave function, but with each individual electron’s exact path being unpredictable. This is indeed what is found in the laboratory. Repeating the same situation produces different outcomes. It is also observed that if one of the slits is closed, the interference pattern that otherwise appeared after the passage of many electrons is absent, leaving only a diffraction blur spot of the same type as that of electromagnetic radiation. The fact that the motion of electrons is governed mathematically by probabilistic wave mechanics is firmly established, but the interpretation of this in terms of physical objects has yet to be realized. 5.10 Probability Amplitude, Quantum Probability, and Interference Between Coherent States Born called ψ(x) a probability amplitude because it is not a probability itself. To get a probability one must use ψ(x)ψ*(x), which is also called the squared modulus. Once a probability distribution is obtained, it is used in a manner analogous to a classical probability distribution. But the way in which the probabilities arise for various quantum systems is generally very different from the classical case because of the fact that any linear combination of solutions to the wave equation is also a solution. For example, if a fair classical die is rolled, it can stop with any one and only one of six 268
page 280
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.10 Probability Amplitude, Quantum Probability, and Interference Between Coherent States
numbers showing on top, each with probability 1/6. The probability of getting either a 1 or a 2 is the sum of the two separate probabilities, 1/3, and there is no chance of getting both. In Quantum Mechanics, it is the probability amplitudes that sum. In Equation 5.39, for example, we add together the various monochromatic terms, each of which satisfies the wave equation and represents a possible system state, to get the linear superposition that defines the total wave function. It is only after adding these terms, each with a probability amplitude An, that we take the squared modulus to get probabilities. The An are complex numbers which we will denote an+bni with an and bn real numbers and i the square root of -1. To reduce clutter, we will use θn in place of knx-ωnt. Then to get the probability that the die will turn up a 1 or a 2, we take the squared modulus of that part of the wave function: P(1 or 2) ( 1 2 )( 1 2 ) * ( 1 2 )( 1* 2* ) 1 1* 2 2* 1 2* 2 1*
(5.47)
P(1) P(2) 1 2* 2 1*
This is different from the classical probability, which would just be P(1)+P(2). We have two extra terms to deal with. The first, using the full definitions of the wave components, is
1 2* (a1 b1i ) ei1 (a2 b2i ) e i2
a1a2 ei (1 2 ) b1b2 ei (1 2 ) a2b1iei (1 2 ) a1b2iei (1 2 ) (a1a2 b1b2 )e
i (1 2 )
(a2b1 a1b2 )ie
(5.48)
i (1 2 )
where we have used the fact that since eiθ = cosθ + i sinθ, the complex conjugate is cosθ - i sinθ, and since cosθ = cos(-θ) and -sinθ = sin(-θ), this complex conjugate is cos(-θ) + i sin(-θ) = ei(-θ) = e-iθ. The second extra term is
2 1* (a2 b2i ) ei2 (a1 b1i ) e i1 a2 a1e
i (2 1 )
b2b1e
i (2 1 )
(5.49)
a1b2 iei (2 1 ) a2b1iei (2 1 )
(a1a2 b1b2 )e i (1 2 ) (a2b1 a1b2 )ie i (1 2 ) Adding the two extra terms together yields what we will call the interference term for the 1 and 2 faces of the die, I(1,2):
2a1a 2 b1b2 cos(1 2 ) 2 a2b1 a1b2 sin(1 2 )
I (1,2) a1a 2 b1b2 ei (1 2 ) e i (1 2 ) a2b1 a1b2 i ei (1 2 ) e i (1 2 )
(5.50)
where we have used eiθ + e-iθ = 2cosθ and eiθ - e-iθ = 2isinθ. The interference term has no counterpart in classical probability theory, and for objects like electrons passing through a double-slit apparatus, it can have a very strong effect, since the phase angles θ1 and θ2 can be very different at various locations on the scintillation screen due to the different path lengths from the two slits to that point. The variation of the phase difference over the 269
page 281
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.10 Probability Amplitude, Quantum Probability, and Interference Between Coherent States
screen gives rise to the observed interference pattern. When one slit is closed, the corresponding part of the wave function is removed, and so is the interference term and the pattern it produces, leaving only a single-slit diffraction pattern. Of course, light passing through a double-slit apparatus has long been known to exhibit these effects, which are adequately described by classical wave optics, but modern laboratory equipment permits the experiment to be done with single photons, and after many repetitions, the interference pattern emerges, indicating that although only a single photon traverses the apparatus at any given time, it behaves as though it goes through both slits, and the location of its eventual absorption by the screen is randomly distributed according to the probability distribution derived from the wave function. The interference term is frequently discussed in the context of a two-slit apparatus into which a collimated beam of electrons is sent with the beam symmetrically placed between the two slits. In this case, P(1) = P(2), with the symmetry also requiring A1 = A2, which implies that a1 = a2 and b1 = b2. Equation 5.50 shows that this cancels the sine contribution, and the coefficient of the cosine contribution simply becomes 2P(1) = 2P(2), since P(1) = a12+b12 = a1a2+b1b2 = a22+b22 = P(2):
I (1,2) 2 P(1)cos(1 2 )
(5.51)
Note that this is a highly idealized model because we have included only two components of the wave function, each monochromatic and with the same wavelength, since the two components are produced by splitting a single monochromatic plane wave by passing it through two slits. In real wave packets, this sort of interference would be happening between all frequency components, but this simple model does illustrate the basic process. Furthermore, we have been considering wave functions defined on a single axis, whereas the two-slit apparatus requires the description to be expanded to at least two axes, so we add a y axis whose only role is to support the scattering of what was originally a plane wave propagating only in the x direction. To keep things as simple as possible, we assume that everything is constant in the z direction. Figure 5-5A shows a schematic of a two-slit apparatus splitting a monochromatic plane wave into two components that each propagate in a circular fashion (cylindrical if we consider all three dimensions) after passing through a slit. This of course depends on the slit width Δy being on the same order as the wavelength, and in fact many engineering details beyond our scope must be taken into account for a real laboratory device to function properly. The two sets of circularly propagating waves interfere with each other, sometimes reinforcing and sometimes canceling, and the scintillation (or other detector) screen records the pattern at the locus of its plane. Figure 5-5B shows the relative fraction of the total probability density function at the detector screen over a range of ±5×10-7 m about the average height of the two slits for an electron speed of 0.2 c and laboratory wavelength of 1.2×10-11 m, a slit-center separation of 1 mm and a distance of 5 m from the slit-screen to the detector screen. After a sufficiently large number of electrons have been detected, the interference pattern shown in Figure 5-5C emerges on the detector screen. The peak-to-peak distances are approximately 30 nm, about the size of a polio virus and easily resolvable in modern electron microscopes.
270
page 282
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.10 Probability Amplitude, Quantum Probability, and Interference Between Coherent States
A
B
C
Figure 5-5 A. Schematic of a two-slit interference apparatus that splits an electron plane wave function into two circularly propagating waves that interfere with each other; the wavelength is exaggerated for clarity. B. Relative fraction of the total probability density function at the detector screen, sampled between ±5×10-7 m relative to the average height of the two slits, for an electron moving at 0.2c, slit-center separation of 1 mm, and a distance of 5 m from the slit screen to the detector screen. C. The interference pattern on the detector screen for the same vertical range as B, and a horizontal range of ±10 cm relative to the center of the detector screen.
271
page 283
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.11 Quantum Entanglement and Nonlocality
Between the slit screen and the detector screen, the electron’s state is given by a superposition of coherent states, i.e., states between which a stable phase relation exists that allows them to interfere with each other. Once the electron hits the detector screen, its location becomes known, and the wave function is said to collapse, and the loss of the phase relationship between the two wave components is called decoherence. In the state-vector representation, the process corresponding to the collapse of the wave function is called state-vector reduction. These are controversial concepts whose interpretation varies considerably within the physics community even today. Weaving the mysterious activities of random processes into these dilemmas added further complications to the enigma known as the measurement problem, the question of how linear processes like the propagation of the wave function can result in a discontinuous process such as the collapseinducing interaction between two quantum systems. In other words, the laboratory apparatus with which measurements are made must be viewed as yet another quantum system, since everything seems to be built up out of quantum systems whose linear waves should simply pass through each other without interacting, like waves on the surface of a pond. In order to have an interaction, some sort of nonlinear process is required, since nonlinear processes are able to couple to each other and transmit energy. Interference is not the same as interaction. Linear waves may experience constructive and destructive interference as they pass through each other, but afterwards each propagates as though the others had never existed. As tangled as this situation was, the last straw had not yet been placed on the camel’s back. Being committed to determinism and realism, Einstein was convinced that what he saw as dislocated aspects of Quantum Mechanics proved that the formalism was incomplete. He began a series of attacks intended to show convincingly that key elements were missing, and his primary opponent in a sequence of debates extending over a period of almost ten years was Niels Bohr. The history of these intellectual clashes between two dear friends is a fascinating subject that we cannot take the time to recount in detail, but the reader is encouraged to take advantage of the excellent accounts that exist (e.g., Bohr, 1949; Einstein, 1949; Pais, 1983). Einstein concocted some truly ingenious thought experiments meant to undermine the Copenhagen Interpretation, and occasionally Bohr was hard pressed to counter them, but one by one, Bohr managed to dismantle each of Einstein’s cleverly devised arguments, even using Einstein’s own theory of General Relativity against him in one notable case, until finally Einstein found what seemed to be a true juggernaut, something both unassailable and devastating. 5.11 Quantum Entanglement and Nonlocality In 1935, Einstein had been in permanent residence in the United States of America for two years and had arrived at the Princeton Institute for Advanced Study, abandoning Germany because of the rise to power there by Hitler and the Nazi party. In that year, he published one of the most important papers of his career (Einstein, Podolsky, and Rosen, 1935, hereinafter EPR). In this paper he argued in favor of realism, specifically that the physical parameters of quantum systems do have exact values even when not observed, including both noncommuting observables after they have been measured in either order. The fact that, after the second measurement, the Heisenberg Uncertainty Principle prohibits exact knowledge of the value of the observable measured first was taken as evidence that Quantum Mechanics was not a complete physical theory as it stood. 272
page 284
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.11 Quantum Entanglement and Nonlocality
Einstein continued to believe that the uncertainties implied by Quantum Mechanics were epistemic uncertainties applicable only to ensembles. By his definition of realism, physical parameters such as momentum could not be spread out over several values; the momentum of a real physical object always had one and only one value. Even if this value was not measured, if it could be predicted with certainty, it had to be considered real. So his task was to show that such an unobserved quantity could be predicted with certainty prior to being measured, with the measurement serving only to confirm the prediction, and that this could be done for either of two noncommuting observables by arbitrary choice without disturbing the physical object in any way, implying that both have exact values simultaneously. He used a deliberately simplified definition of the requirement for a theory to be “complete”: that every physical element that satisfies the definition of being real and is relevant to a theory must appear as part of the theory’s formalism. A physical element must be considered real if the value of the corresponding element in the theory can be correctly predicted exactly. A theory that refers to a physical element that may be considered real under some circumstances but not others is therefore an incomplete theory. In order to keep our summary of relevant topics in Quantum Mechanics as simple as possible, we have so far considered only single-particle wave functions. For these we have seen that a single particle may have multiple possible states that exist simultaneously in some fashion that is mysterious to macroscopically based intuition but which must be simultaneous nevertheless in order to interfere with each other as coherent states. To follow Einstein’s argument, we must expand the wave function to encompass the entangled states of two particles. This refers to coupling between the states of one particle and those of another with which it has interacted. For example, there are certain electronic transitions in atoms that give rise to the emission of not only one photon but two. These simultaneously emitted photons have polarization states that are not independent. Although both photons may be in superpositions of polarization states, each polarization state of one photon is correlated with a polarization state of the other in one-to-one correspondence over all polarization states. Since we have not discussed photon spin and polarization, we mention in passing that Maxwell’s equations show that electromagnetic waves are composed of oscillations of orthogonal electric and magnetic fields perpendicular to the direction of propagation, about which the field directions rotate. This rotation can be clockwise or counter-clockwise, giving rise to circular polarization that is left-handed or right-handed, respectively. But the electromagnetic wave can also be in a superposition of both states, and this results in linear polarization, since the field vectors of opposite rotation sum to form nonrotating vectors. The oscillating but nonrotating electric field vector may be horizontal, vertical, or in a superposition of both, but the direction for one of the photons corresponding to the two simultaneously emitted electromagnetic waves has an absolute correlation of 100% with that of the other. At least this is what a detailed analysis using Quantum Mechanics indicates. One way to test Quantum Mechanics involves measuring the polarizations of such photons, and such tests have always shown that the formalism provides a correct description. The classical circular polarization of electromagnetic waves translates into photon spin in Quantum Mechanics, with right or left circular polarization corresponding to spin of +h or -h , respectively, and a photon may be in a superposition of spin states. In classical physics, such a superposition would yield a net spin of zero, but in Quantum Mechanics, both spins are present as coherent states corresponding to different eigenfunctions in the wave function. Photon spin is usually stated in units of h , hence photons are described as having spin of unit magnitude. Particles with integer spin are called bosons, because they obey Bose-Einstein statistics, a 273
page 285
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.11 Quantum Entanglement and Nonlocality
formalism developed in 1924 by Satyendra Bose in collaboration with Einstein. A special feature of this formalism is that it treats photons as indistinguishable particles, and this has a profound effect on how particle states are counted, hence on the probability distributions relevant to photons. Particles with half-integer spins obey Fermi-Dirac statistics and are called fermions (e.g., electrons, protons, and neutrons). A special feature of this formalism is that the probability of two fermions occupying the same state vanishes (known as the Pauli Exclusion Principle). This makes it impossible for two electrons to occupy an atomic orbit with all the same quantum numbers, for example. Two electrons with equal principal, angular momentum, and magnetic quantum numbers in the same atom must have opposite spins in order to be in different states. This forces the electrons bound to an atom to be arranged in shells, a feature that underlies all chemical reactions. In contrast to fermions, multiple bosons can occupy the same quantum state, making possible what are called Bose-Einstein condensates. Beyond the need to introduce photon spin and polarization for the discussion of entanglement, we have extended this digression to include Bose-Einstein condensates because they, like quantum entanglement, were also brought to the attention of the physics community by Einstein and his collaborators decades before unequivocal verification of the validity of either was possible in the laboratory. One disadvantage of being so far ahead of his time was that Einstein did not live to see either verification. The laboratory formation of Bose-Einstein condensates would no doubt have pleased him, but the same cannot be said of quantum entanglement, since the eventual result was to refute his argument, as we will see below. We should mention that Einstein did not use the term entanglement. That was introduced by Schrödinger in a letter to Einstein in which he called the phenomenon Verschränkung and later translated this to “entanglement” (Schrödinger and Born, 1935). The relevance of photon polarization lies in most of the laboratory experiments that eventually demonstrated quantum entanglement, not in the EPR paper itself, which used a pair of generic particles labeled I and II to demonstrate position and momentum entanglement between the two particles. Even the nature of the interaction between the two particles is left unspecified. It suffices that any such interaction took place. In classical physics, two objects with known initial states that proceed to interact via collision (for example) have final states that depend on each others’ initial states, and the final states are correlated by the laws of energy and momentum conservation. But the laws are deterministic, so there is no question of a probability that one of the bodies has a certain momentum or the other a specific location at a given instant, and any measurement performed on one object has no physical effect on the other. In the EPR paper, Einstein specifically took the pre-interaction states of the two particles to be known, so that each is associated with a fully specified wave function. He then defined a period of time during which they interact, and after that period, they moved off in different directions, but then, according to the laws of Quantum Mechanics, they must be described by a single wave function that depends on the coordinates of both particles. Because the various possible eigenstates before the interaction have various probability amplitudes, the effects of the interaction are random variables, and the joint probability distribution for the two particles cannot be separated into a product of marginal distributions. Only a single wave function can contain the information for the two particles, which are now subsystem components of a single quantum system with a single wave function of the form (compare to Equation 5.39, p. 255)
( x1 , x2 , t ) n ( x1 , t )n ( x2 , t ) n
274
(5.52)
page 286
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.11 Quantum Entanglement and Nonlocality
where the n and ξn are the wave function components for particle I and particle II, respectively. Einstein then showed that if one measures the momentum of particle I, the mutual wave function collapses in such a way that the momentum of particle II becomes known, i.e., particle I cannot be forced via measurement into a momentum eigenstate without simultaneously forcing particle II into its corresponding eigenstate that has an absolute correlation of 100% with that of particle I, because both eigenstates correspond to the same term in the joint wave function. When all but one of the n collapse to zero, they take out all but one of the ξn with them. In this ideal case, there is only one surviving wavelength, hence only one momentum, for each particle. The joint wave function is reduced to the product k ξk, where k is the index of the eigenstates to which the joint wave function collapsed, hence factoring it into the product of separate wave functions, and the particles are left in correlated states but no longer entangled. The positions of the particles are in states of unbounded uncertainty, since both are maximally blurred, and the Copenhagen Interpretation assigns to them no real existence at all. Einstein took this to mean that the momentum of particle II can be known exactly without mechanically disturbing particle II with a measurement, and therefore the momentum of particle II satisfies Einstein’s definition of a real physical property. But the choice to measure the momentum of particle I was arbitrary; its position could just as readily have been measured instead, in which case particle II would have been forced into a position eigenstate that has an absolute correlation of 100% with that of particle I, and the position of particle II would become known exactly without performing any measurement on that particle. Thus the position of particle II also satisfies Einstein’s definition of a real physical property. The claim was then made that whether a given property of particle II is real cannot depend on what is done to particle I, which was quite distant from particle II and completely out of range of any physical interaction when the measurement was made. Einstein then applied the coup de grâce: because both the momentum and position of particle II were shown to be real but Quantum Mechanics cannot provide exact values for both of them, Quantum Mechanics does not satisfy the definition of a complete theory. He conceded that a complete theory did not exist but expressed a belief that such a theory is possible. In using the equations of Quantum Mechanics as he did in the EPR paper, Einstein’s intention was not to endorse them but to turn them against themselves. He did not believe that a measurement on particle I could instantaneously affect the state of particle II, because his commitment to relativity theory ruled out physical influences propagating faster than the speed of light and demanded that the notion of simultaneity be dealt with more carefully. The phenomenon described in his paper became known as the EPR Paradox, and one of the paradoxical aspects was that some observers in differently moving reference frames could see particle II collapse into a sharp momentum state before the measurement was made on particle I. Did a spontaneous collapse at particle II cause the measurement to be made on particle I? Could the entangled collapse be used to communicate faster than the speed of light? Einstein was confident that these implications created a level of absurdity high enough to rule out absolutely any chance that Quantum Mechanics, as it stood in 1935, could be taken seriously as not only a correct but also a complete description of physical reality. The possibility of faster-than-light communication was especially an anathema, because such a phenomenon could undermine the foundation of human sanity. In 1935, Special Relativity Theory was firmly established, and although General Relativity superseded it, the presence of weak gravitational fields does not eradicate the effects that it describes. Specifically, it uses Lorentz Transformations to define the relationships between the coordinates (X,Y,Z,T) and (X ,Y ,Z ,T ) in 275
page 287
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.11 Quantum Entanglement and Nonlocality
two inertial reference frames. Since intervals are just coordinate differences, they too can be transformed between inertial reference systems. Taking the two sets of axes to be aligned and the primed system B to be moving in the +X direction at a positive speed v < c relative to the unprimed system A, we consider the time interval ΔT in system A required for transmission of a signal sent with a speed u from an observer at the origin of system A to an observer at the origin of system B at a distance ΔX at the instant the signal is received, i.e., u = ΔX/ΔT. The time interval in system B is ΔT given by (see, e.g., Rindler, 1969)
T
vu 1 2 c v2 1 2 c
T
(5.53)
If u > c2/v and ΔT > 0, then the signal transmission time is negative in system B. That observer receives the signal before the sender transmits it, as system-B time is measured in either system. Although ΔT > 0, the signal goes into observer B’s past, i.e., observer A sees observer B receive a message from observer A that observer A has not yet sent, something that cannot happen with subluminal transmission speeds. By symmetry, observer B can reply to the message in the same way, sending the reply into observer A’s past, so that observer A receives a reply to a message not yet sent. Presumably observer A could decide not to send the original message after all. At the very least it becomes ambiguous which observer is the original sender, since both messages could be replies. A variety of insane violations of causality could occur. For example, Penrose (1989) suggests that instead of observers, computers could be sending messages that consist simply of “Yes” or “No”, with computer A programmed to send whatever message it receives and computer B programmed to send the opposite of what it receives, so that the “same” message flipflops every time through the “same” closed loop in spacetime. It may appear that there is a loophole in Equation 5.53: the signal transmission speed and time, u and ΔT, are not independent. Their product is the distance ΔX between the two systems at the moment when the signal is received, as measured in system A, where “instantaneous” communication could be taken to mean that u = . Since this implies that ΔT = 0, does it also imply that ΔT = 0? In fact, that won’t work. If we substitute ΔX/u for ΔT, we get
vu 1 v 1 2 c X u c 2 T X v2 u v2 1 2 1 2 c c
(5.54)
so that lim T
u
v X c c2 v 2
(5.55)
and the simultaneous sending and receipt of the signal in system A transforms to some negative time interval in system B. The only way to achieve ΔT = 0 with u is to make v zero, which does not solve the problem in general (clearly we cannot make ΔX zero and still have an entanglement setup). 276
page 288
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.11 Quantum Entanglement and Nonlocality
The consensus among professional physicists at the time and ever since has been that the absurdity of such causality violations is sufficient to rule them out, but the question of how ruling them out is consistent with the apparent ability of entangled particles to communicate “instantaneously” with each other has been a difficult one. Given the discontinuous nature of the unexplained “collapse” mechanism, it is perhaps surprising that apparently no one advocated the possibility that the collapse of particle I into a momentum eigenstate severs its connection to particle II before the wave function collapses, essentially factoring the joint wave function before one of the factors collapses. It is just as well that this approach was not taken, because as we will see below, it would eventually have been refuted experimentally. The old feature of instantaneous action at a distance that harassed Newton’s law of universal gravitation evolved into what is now called nonlocality. Just as the majority of physicists accepted quantized energy as something forced on them without explanation (other than stemming from the wave nature of matter and energy, which carries its own inexplicability), the idea that entangled quantum subsystems could exert nonlocal influences on each other was absorbed into the mainstream. Some pockets of resistance remained, however, consisting of advocates of what became called hiddenvariable theories. These were attempts to formulate additional constraints on quantum systems beyond the information contained in the wave function, some for the purpose of carrying the correlations via local effects, some for the purpose of making the randomness epistemic, some for both. We will discuss this topic further below. Even Einstein’s ally Schrödinger, who continued to resist nonepistemic randomness in natural processes, conspicuously did not reject nonlocality. He was the first prominent physicist to respond to the EPR paper. In a series of three papers in 1935 (which he characterized as possibly a lecture or possibly a general confession; for an English translation, see Trimmer, 1980), he investigated entanglement in detail and described the measurement process itself as an entanglement of an object and an instrument, both subject to the laws of Quantum Mechanics, with the only distinguishing feature being that the instrument had some kind of macroscopically visible pointer that could be read as a number corresponding to the measurement result. The reading of that number is what promulgates the collapse of the entangled wave function, leaving both the instrument and the object in their respective eigenstates. Among the mysteries that remained was: why is the pointer never observed to be in a superposition of states? It was in this series of papers that Schrödinger introduced his famous cat paradox: the cat lives or dies or does both according to whether an unobserved 50%-probable radioactive decay takes place, and until the cat is observed, it is apparently in a superposition of dead and alive eigenstates. His point was that confining superpositions to microscopic phenomena was artificial and had no basis in the formalism. The notion of a collapse was something he felt forced to embrace but never liked, if for no other reason than the indication that something additional was needed, something that could not be found in the wave equation itself, and perhaps worst of all, it implied the possibility that the behavior of physical objects was somehow determined by our conscious awareness. Schrödinger described the wave function metaphorically as a catalog of possible measurement outcomes, each with a corresponding probability of appearing if the appropriate measurement is made. One need only take advantage of the possibility of calculating the wave function in order to possess information that is maximal at that stage, in the sense that no greater knowledge of a quantum system is possible. Prior to a measurement, only probabilities of measurement outcomes can be obtained. One may choose which parameters of the physical object to measure, but in fact, only half of the 277
page 289
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.11 Quantum Entanglement and Nonlocality
parameters are available for exact evaluation, because each of them has a conjugate-pair partner that will be left “blurred” (verschwommene, verwaschen) by the measurement. In choosing a parameter to measure, one is also choosing one to blur. The corresponding collapse will alter the wave function to produce a new catalog whose information remains maximal but different from before. He then examined the new catalog obtained from a p measurement and compared it to that following an x measurement. Both catalogs are maximal in information content, but he found that nothing in the formalism requires these two catalogs to contain the same information, and in fact, the information content is distinctly different. Schrödinger stressed that the formalism demands that the EPR particle pair be treated as a single quantum system described by a single wave function after the interaction and prior to any measurement. Thus a measurement on particle I does disturb particle II; they are components of a single system. A measurement on either particle collapses this single wave function, puts both particles into their entangled eigenstates, and breaks the entanglement, yielding two collapsed and now separate wave functions. If the momentum of particle I was measured and found to be P, then the momentum state of each particle has collapsed, so that a subsequent measurement of the momentum of particle II can only yield -P. The positions of both particles are blurred but no longer entangled, e.g., a measurement of particle I’s position yields no information about particle II’s position. The manner in which Schrödinger described the probabilistic nature of measurement outcomes did not reveal whether he had changed his mind about the randomness being epistemic. He spent a significant portion of the three-part paper comparing classical models to quantum models and discussing the limitations implicit in both, saying “Reality resists imitation through a model. So one lets go of naive realism and leans directly on the indubitable proposition that actually (for the physicist) after all is said and done there is only observation, measurement.” This suggests significant overlap with the Copenhagen Interpretation. A modern variant of this thought is sometimes encountered in the admonition “Shut up and calculate!” But at no point did Schrödinger straightforwardly deny the validity of the realist interpretation. In fact, in referring to “Reality” as something that exists and has properties, he implied an endorsement of the realist point of view and simply lamented the difficulty of translating reality into mathematical models. In letting go of “naive realism”, he did not necessarily abandon hope for a more sophisticated version. But that would have to involve an altogether new way of modeling, because he did accept that the means currently available certify Quantum Mechanics to be as complete a theory as any derived from that mindset can be. There is some hint of exasperation in his acknowledgment that the EPR phenomenon is apparently on solid theoretical ground for reasons that escape us, for example in his statement “Best possible knowledge of a whole does not include best possible knowledge of its parts and that is what keeps coming back to haunt us” and “I hope later to make clear that the reigning doctrine is born of distress. Meanwhile I continue to expound it” and “The big idea seems to be that all statements pertain to the intuitive model. But the useful statements are scarcely intuitive in it, and its intuitive aspects are of little worth.” Einstein made clear that he considered the ability to predict measurement outcomes a necessary property of an acceptable physical theory. He never suggested otherwise and even made this the centerpiece of his EPR argument. But he never acquiesced to the idea that predicting measurement outcomes should be the sole activity of a scientist. Independently of whether the motivation is naive, it is the author’s impression that most if not all young people who commit to the effort of acquiring scientific expertise do so in the hope of gaining some understanding of how the Universe works. It 278
page 290
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.11 Quantum Entanglement and Nonlocality
is possible that the advice “Study science, and one day you will be able to predict measurement outcomes” would not inspire enough youthful enthusiasm to keep the population of the physics community from collapsing. But Einstein’s legacy remains alive in the hope that the disjoint relationship between Quantum Mechanics and General Relativity can someday be resolved in a Quantum Gravity Theory that is a superset of both, a new formalism based on a more complete concept of the fabric of the Universe, realistic without being naive, that will carry human intuition to a higher level. But that is the subject of the next chapter. Before the year was out, Bohr (1935) published his reply to the EPR paper in the same journal and with the same title. This time Bohr found no fallacy in Einstein’s logic. Instead he argued against one of Einstein’s premises, claiming that there was an ambiguity in the notion of “physical reality” as used by Einstein. This would eventually become a point on which the two would agree to disagree, ending their debate for all practical purposes. Bohr did not accept the use of classical notions of reality in a quantum-mechanical context. He invoked “the necessity of a final renunciation of the classical ideal of causality and a radical revision of our attitude towards the problem of physical reality” based on the unavoidable role played by the quantum of action and its requirement for the Principle of Complementarity. One crucial ambiguity was seen to be whether a measurement on particle I could be said not to disturb particle II in any way. Bohr’s position was that the absence of a mechanical disturbance was not enough to claim that there was no disturbance. The fact that the value of a momentum measurement on particle II could be predicted exactly from a measurement of the momentum of particle I did not endow the position of particle II with any exemption from the rule that it must be blurred by a momentum measurement on either particle. Bohr pointed out that although one could choose to predict exactly either the momentum or the position of particle II via a corresponding measurement on particle I, the two types of measurement involved fundamentally different kinds of laboratory apparatus, each having a different effect on the system wave function, so that one could not claim that both the position and the momentum of particle II can be simultaneously exact. “Either/or” is not the same as “both”, and furthermore being able to predict a measurement outcome with certainty does not imply that the physical parameter has that sharp value prior to the measurement, because that would ignore the effect of the measurement. Einstein’s EPR argument was that once the momentum of particle I is measured, the momentum of particle II is known exactly, independently of whether it is measured. That was the reason for claiming that the momentum of particle II was real. A measurement is certain to yield the alreadyknown value, hence it must be in effect prior to the measurement. This is the core of the unresolvable conflict between his physical intuition and Bohr’s. The latter’s position was that there is a critical difference between the following two statements regarding the time period after the measurement on particle I and before the measurement on particle II: (a.) a momentum measurement of particle II is certain to yield the value opposite to that measured for particle I; (b.) prior to the measurement on particle II, it already possesses a sharp momentum with the predicted value. For Bohr, particle II has no definite momentum value prior to the measurement of its momentum, it acquires the definite value by virtue of the measurement process. The wave function may be collapsed, but it does not tell you the state of the system, it tells you what measurement results are possible. For him, particle II must wait for its definite momentum value to be bestowed by the measurement process, not the entanglement of the particles. If its measurement is never made, particle II remains in its indefinite momentum state, or Bohr might say that one cannot assign any physical meaning to unmeasured momentum. Nevertheless, the certainty of the prediction abides, because the 279
page 291
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
possibility that the measurement could be made enters the quantum picture in the same way that a double-slit interference setup provides the possibility that a particle could go through either slit, and if that possibility is removed by closing one slit, the interference pattern disappears, to be replaced by a diffraction pattern. Having two slits open doesn’t just produce two diffraction patterns. Giving the particle a choice of slits creates a whole new set of possible trajectories revealed by the interference pattern. In classical physics, what does happen obviously depends on what could happen, but one of the mysteries of Quantum Mechanics is that what could happen but doesn’t retains a profound influence on what does happen. In essence, Bohr simply dismissed Einstein’s case on the grounds that it had no standing in the court of Quantum Mechanics. In the end, it all boiled down to irreconcilable differences between the two men’s physical intuitions. 5.12 Hidden-Variable Theories and Bell’s Inequalities As soon as Born published his probabilistic interpretation of the wave function in 1926, the physics community split into separate camps based on metaphysical preferences regarding whether the Universe behaved according to exclusively deterministic laws. Einstein led the forces defending strict determinism, and the first forays involved attacks on the completeness of Quantum Mechanics. As early as 1927, Einstein had worked out a supplemental formalism intended to remove the nonepistemic random elements by adding constraints to the physical canon governing microscopic processes. These constraints involved what were to become known as “hidden variables”, somewhat analogous to the phase space of classical Statistical Mechanics, i.e., the unobservable microscopic properties of molecules that give rise to the observable properties of a gas, such as temperature and pressure. He abandoned this approach, however, after he realized that it implied the inseparability of coupled quantum systems, a feature of Quantum Mechanics that he later placed at the foundation of his EPR argument. In order for a hidden-variable theory to be successful, it had to reproduce all the observable effects of Quantum Mechanics while maintaining exact values for all physical parameters in principle, i.e., in the manner that the formalism of classical Statistical Mechanics regards the momentum and position of each particle in the system to exist objectively and with unique values at all times. Even though these parameters are not individually observed, they can be treated as epistemic random variables with distributions that allow statistical computations to yield observable quantities. Their random behavior is still constrained by Newtonian mechanics, which assumes determinism, causality, and realism. The issue of nonlocality was not fully recognized early on and hence was not a primary focus, but causality violations such as those accompanying superluminal communication would clearly have to be ruled out, which is probably why Einstein abandoned his first attempt, choosing instead to make nonlocality work for him rather than against him by assigning it the role it plays in the quasi reductio ad absurdum argument in the EPR paper. And yet it was not actually clear that the nonlocality implicit in the EPR paradox necessarily leads to causality violations. The two particles seem to communicate instantaneously via a secret channel, but this is not the same as two conscious beings (or two computers) exchanging information at superluminal speeds. As long as the latter could not be achieved by nonlocal effects, the absurdity of causality violations could not be used as an argument against all nonlocality. In order to communicate information in the usual sense, that information must be encoded as a decipherable pattern on some 280
page 292
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
carrier. If the correlated states of entangled particles were capable of carrying Morse code, for example, then the nonlocality could not be tolerated even in a hidden-variable theory. But that question was not yet answered. The first well-known hidden-variable theory was put forth by de Broglie (1927) and was called the “Pilot Wave Theory”. It applied only to single particles, which initially were represented as singularities in a real physical wave but were soon changed back to traditional particles whose motion was guided by a wave via a term added to the potential energy called the quantum potential. This potential supplied the quantum force that guides the particle along a path that is deterministic but subject to epistemic uncertainty. In order to mimic the quantum randomness, the quantum potential was determined by the probability density defined by the wave function. This formalism worked within the context considered by de Broglie, but it failed for multiple-particle systems, and de Broglie abandoned it after failing to find a way to fix it. Work on the basic pilot-wave approach was later resumed by David Bohm (1952), who expanded it for multiple-particle systems at the expense of embracing nonlocal effects, and the pilot wave was something existing in configuration space (the sample space of all relevant observables, hence typically infinite-dimensional) but physically acting on the particle, so that the pilot wave must be taken as real itself. These aspects made it implausible to most physicists, certainly to Einstein because of the nonlocal character. Bohm’s own position, however, was not that his formalism was correct in any absolute sense but rather that it demonstrated that a deterministic formalism could reproduce the results of Quantum Mechanics, apparently falsifying a theorem put forth by John von Neumann (1932) that was widely interpreted as a proof of the impossibility of hidden-variable theories (the purpose, applicability, and correctness of the proof have turned out to be somewhat controversial). Bohm’s formalism drew the attention of John Stewart Bell (1964; for an excellent discussion of this paper and supporting material written earlier by Bell but published later, see Jackiw and Shimony, 2001), who studied the possibility of removing the nonlocal effects and in the process derived Bell’s Theorem, which states that no local hidden-variable theory can reproduce the predictions of Quantum Mechanics, and Bell’s Inequality, the first of a series of inequalities applicable to various laboratory setups that eventually enabled experimental probing of the EPR paradox. Bell’s Inequality expresses a fundamental difference between the statistical predictions derivable from Quantum Mechanics (hereinafter the “quantum model”) and the statistical predictions of otherwise-quantum theory but based on deterministic local physical processes underlying epistemically random behavior (hereinafter the “local model”). Thus the local model discussed below does employ the Pauli Exclusion Principle and the collapse of spin states, but not any nonlocal effects on particle II due to anything that happened to particle I, and all randomness is epistemic. We are exclusively concerned with differences between the models regarding what happens after the particles separate. Bell’s Inequality applies to the local model, and the quantum model predicts violations of the inequality. Since statistical behavior is involved, many trials are required in order to obtain high statistical significance, just as testing the hypothesis that a coin is fair must employ more than a few coin flips. Another metaphor used by Schrödinger in his 1935 paper represented EPR particle II as an object being used by a teacher to test a pupil and particle I as the answer book from which the teacher gets the correct response for the chosen question. Similar illustrations have proven useful in subsequent descriptions of the EPR Paradox. The teacher selects a question, either “What is the position of particle II?” or “What is the momentum of particle II?” and then metaphorically consults the answer book by measuring the appropriate parameter of particle I. Thus the teacher knows the correct answer and 281
page 293
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
is always impressed by the pupil’s unerring accuracy in answering the question, but then the pupil invariably falters when asked the other question (because the particles are no longer entangled after the first question is answered, so the pupil’s answer to the second question is randomly related to what is in the teacher’s answer book). The pupil cannot know in advance which question is going to be asked first. Even the teacher may not know, it may be decided by a coin flip immediately before the answer book is consulted. Somehow the pupil seems to know which question is going to be asked first and always has the right answer ready, after which the pupil becomes temporarily incompetent until a new pair of particles is prepared and the process repeated. In order for the EPR particles to behave as described while conforming to deterministic local physical laws, they must both “know” the answers to both questions by the time they separate after interacting. This was what Einstein assumed in his definition of realism, and this assumption is the defining quality of the local model. Bell allowed the teacher the option of asking the question corresponding to the answer already looked up or the other question, with the choice decided by a coin flip. Thus sometimes the student’s first answer does not pertain to the one looked up by the teacher, in which case perfect agreement cannot be expected, but statistical predictions regarding the rates of agreement and disagreement are still possible and turn out to be highly relevant. To make the two questions more comparable, Bell switched the EPR model to one suggested by Bohm and Aharonov (1957): instead of entanglement in momentum and position, the two particles are fermions whose half-integer spins must be entangled because of the interaction and opposite because of that and the Pauli Exclusion Principle. The spin directions can be measured for either or both particles via deflection by magnetic fields with any desired orientation. A crucial aspect of spindirection measurement, however, is that it induces a spin that is either parallel or antiparallel to the measurement direction; information about the original spin state is lost. This is just the collapse mechanism inherent in measurements operating with the only two available eigenvalues, which are commonly represented as ±1 in units of h /2. So Quantum Mechanics predicts that if the spins of both particles are measured in opposite directions, the product of the resulting values is always +1 because of the Pauli Exclusion Principle and the entanglement of eigenstates, and if the measurement directions are parallel, the product is always -1 for the same reasons. As we saw in Chapter 2, when two random variables are correlated, their joint distribution cannot be factored into a product of their separate marginal distributions. When computing the probability distribution for the observables of two previously interacting particles, there is a key difference between what the quantum model yields and what the local model yields. Both joint distributions describe correlation and cannot be factored into marginal distributions, but in the local model, the marginal distribution for particle II is not affected by anything done to particle I; a measurement on particle I simply locates the slice through the joint distribution that defines the particle II distribution given what has become known about particle I. Even though the local model is formulated to include the collapse mechanism, since nonlocal effects are to be ruled out, the collapse of particle I sharpens only that particle’s marginal distribution (hence the joint distribution) but cannot sharpen the marginal distribution for particle II, i.e., cannot reduce its dispersion. In the quantum model, the joint distribution is defined by an entangled wave function, and the collapse of the particle I distribution to an eigenstate simultaneously sharpens the joint distribution in a manner that directly modifies the marginal distribution for particle II by forcing it also to collapse. Bell found a way to shine a light on this difference and bring it into focus in a manner that facilitates laboratory measurements. To demonstrate strong absolute correlation itself would not be difficult. The master stroke was 282
page 294
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
finding a way to show that the correlation produced by quantum entanglement could not be obtained by local realism. For example, if the spins of the two fermions mentioned above were measured many times in various parallel directions with perfect equipment, the quantum model predicts that the product of their results would be -1 in every case. In real life, laboratory instruments have imperfections, and this is one of the reasons why the number of trials required to achieve convincing statistical significance is large. Existing laboratory equipment will occasionally fail to behave perfectly, so perfect absolute correlation cannot be expected. Clearly the optimal representation of measurement error discussed in Chapter 4 is needed for a convincing evaluation of statistical significance. But assuming that we convince ourselves that the spins are strongly negatively correlated, that could be explained by the particles acquiring those spins during the interaction and keeping them until the measurements were done. We would have succeeded only in demonstrating the Pauli Exclusion Principle. In order to probe nonlocality, some more subtle nuances must be introduced, specifically, including more general angular separation in the directions for the two measurements and ensuring that the “pupil” has no way to anticipate which question the “teacher” is going to ask. For example, if we measure the spin of particle I in the north direction and that of particle II in the southeast direction, the quantum model does not predict that we will always get the same product of results even with perfect equipment, because the collapse of particle I does not provide a 100%probable result for particle II. The statistical results will depend on the underlying probability distributions, and the local model predicts results that are not identical to those of Quantum Mechanics. The fact that probabilities enter at all demands that the hidden variables be epistemic random variables drawn each time from an ensemble, deterministic but unknown, like the trajectories of gas particles in Statistical Mechanics. It was not necessary for Bell to assume a specific distribution for the hidden variables. He was able to make his point with just the requirement that the deterministic hidden variables are affected only by local physical effects. Unless they also behave nonlocally (as in the Bohm model), the hidden-variable model could not mimic Quantum Mechanics. The key aspect of the local hidden-variable model is that after the particles have separated, neither is affected by anything done to the other. The fact that the post-interaction spin states of the two particles are correlated by the Pauli Exclusion Principle just causes the two particles to follow the corresponding distributions appropriate to their respective measurement directions. For the quantum model, what is done to particle I in general modifies the entire joint probability distribution, hence the probabilities for the outcomes of all possible subsequent measurements on particle II. The purpose of a Bell experiment is therefore to detect the differences between the statistical behavior of these two models. Bell’s original paper has been paraphrased by a number of researchers who were able to simplify the idea while maintaining its essence (e.g., Mermin, 1985; Penrose, 1989; Jackiw and Shimony, 2001; Ghirardi, 2005; Zeilinger, 2010). Bell’s Theorem (see Appendix J) employed both the quantum model and the local model to show that the latter, with its strictly local effects, could not imitate the behavior of the former, but among his later variations on Bell’s Inequality, he focused exclusively on the local model to produce constraints that could be tested experimentally. If these constraints were violated, then the local model had to be wrong. Whether the quantum model was correct could be tested independently, and of course its properties guided the formulation of constraints on the local model that could be expected to be violated if the latter were incorrect. He considered two observables: one labeled A representing a spin measurement on particle I in the direction of a unit vector a ; one labeled B representing a spin measurement on particle II in 283
page 295
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
the direction of a unit vector b . In the local model, the particles acquire their correlated spin states during the interaction and keep them unchanged as they separate to a distance too large for any further mechanical interaction. We know that their spin states are -100% correlated because of the Pauli Exclusion Principle, but we don’t know the actual states, nor what a measurement in any given direction will yield. Bell likened this classical condition to a man arriving at some destination and discovering that he had brought only one glove with him; at first he does not know whether it is for his right hand or left hand, and so he also does not know which glove was left at home, but upon closer examination, he sees that he has his right-hand glove and instantly knows that the one at home is for his left hand without having to observe it. Classically, this corresponds to measuring the spin of particle I as (e.g.) +1 in the direction a and not being surprised upon hearing that a measurement of the spin of particle II in the direction b = a had yielded -1. Seeing that the glove brought along is for the right hand did not force the one at home to become left-handed, it was for the left hand all along, and seeing that the spin of particle I in the direction a is +1 did not force the spin of particle II in the direction b = a to become -1, it was that way all along because of the Pauli Exclusion Principle, not because of any nonlocal effects. By considering a more general relationship between the directions a and b , Bell was able to reveal a weakness in the local model. Consider a case in which a is allowed to point only in one of three possible horizontal directions relative to north, θ1 = 0o, θ2 = 120o, and θ3 = 240o, and b is constrained to point only in one of those same three directions available to a . We take “horizontal” to be the xy plane. On any given one of N trials, each unit vector is chosen randomly from its three available directions, independently of the other, and immediately prior to actual measurement. Once chosen, the spin of particle I is measured, then the spin of particle II, with the delay too small for any subluminal communication between the two measurement instruments. Thus the choice of b cannot depend on what choice was made for a . The results of the ith trial are denoted Ai and Bi, i = 1 to N. These are recorded, and later the two lists are brought together so that the correlation coefficient ρAB can be computed (see section 2.10, especially Equation 2.23, p. 67). We consider the simple case of a fermion prepared in a state with its spin direction in the xy plane at an unknown random angle θ relative to the x axis. Denoting the measurement direction θm relative to the x axis, m = 1, 2, or 3, and defining Δθ θ-θm, then the probability that the spin will be measured as +1 in units of h /2 is cos2(Δθ/2), and the probability that the spin will be measured as -1 is sin2(Δθ/2) (see, e.g., Binney and Skinner, 2014, section 7.3.1; sometimes these are seen in the forms of their trigonometric identities, ½(1+cosΔθ) and ½(1-cosΔθ), respectively, e.g., Penrose, 1989). This applies to each of the entangled particles with their independent values of θm and their θ values separated by 180o because of the Pauli Exclusion Principle. In one third of the cases, a and b will point in the same direction by chance, and the product of their spins will be -1 in the quantum model. Without hidden variables, the local model predicts a tendency for -1 but an imperfect correlation of ρAB = -½, because in this model the collapse of particle I does not set up the spin of particle II for a 100%-probable outcome. But if hidden variables are included, they could operate as an agreement made between the particles before they separate such as: particle I will react to angles 1, 2, and 3 by collapsing to the +1, -1, and +1 states, respectively, and particle II will react to these angles by collapsing to -1, +1, and -1, respectively. The states for particle I could be chosen randomly by each particle pair, with the states for particle II always being the corresponding opposites, so that this predetermined behavior will not be identical for every particle pair. The result is that including hidden variables can make the local model capable of yielding ρAB = -1 284
page 296
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
and thereby mimicking the quantum model for a = b . Thus some trials with nonparallel measurements are needed. In the other two thirds of the cases, a and b will point in different directions, and what happens in the quantum model is very different from what happens in the local model, even augmented with hidden variables. In the local model, the result for B is whatever it would have been regardless of the result for A, even if particle II is following the hidden-variable instructions mentioned above. Over many trials, with or without hidden variables, the local model will not be able to mimic the quantum model perfectly. In the quantum model, the measurement on particle I changes the value of θ seen at B; θ for particle II becomes either the angle corresponding to a or -a , depending on whether the A result was -1 or +1, respectively. Since a b , particle I forces particle II into a spin direction that is either 120o or 60o away from b , respectively, so that its cos2(Δθ/2) probability to collapse to +1 is either 0.25 or 0.75, respectively. Thus there is a 75% chance that B will be -1 if A was -1, and a 75% chance that B will be +1 if A was +1. In other words, A = +1 puts the spin of particle II 60o away from o b with a 75% chance of collapsing onto it to give B = +1; A = -1 puts the spin of particle II 120 away from b with a 25% chance of collapsing onto it, hence a 75% chance of collapsing antiparallel to b , giving B = -1. Either way, there is a 75% chance of B = A, and a positive correlation results which turns out to be ρAB = ½. To summarize, we have nine combinations to consider, three models (quantum, local without hidden variables, and local with hidden variables), each for three measurement angle cases (all a and b , a = b , and a b ). In all cases, the means and standard deviations of A and B are 0 and 1, respectively. The correlation ρAB varies as shown in the table below. Quantum
Local, no HV
Local with HV
ρAB (all a &b )
0
0
-
ρAB ( a = b )
-1
-½
-1
ρAB (a
½
¼
0
b)
Other hidden-variable rules can be plugged into the local model, for example, half of the time on a random basis A reacts to each unit vector with +1 while B reacts with -1, and vice versa. This will get ρAB to -1 for a = b but fail differently for a b with ρAB also -1 for that combination. The former is required because that is what the quantum model predicts, but it cannot be obtained via hidden variables without failing for a b unless nonlocal effects are included. Bell’s Inequality for the spin-correlated entangled-fermion case is derived in Appendix J:
1 E (u2 , u3 , ) E (u1 , u2 , ) E (u1 , u3 , )
(5.56)
where E (u ,v,λ) is the expectation value of the product AB as a function of any two measurement unit vectors u and v and any desired hidden variables λ for the local model. The corresponding expectation value for the quantum model, denoted Eq (u ,v), can also be used in this inequality to see whether that causes violations. We can probe that question by using a Monte Carlo simulation of a laboratory experiment in 285
page 297
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
which a large number of spin measurements are made on both particles, each measurement using any one of three fixed measurement unit vectors denoted u 1, u 2, and u 3 (i.e., corresponding to the three angles θ1, θ2, and θ3 above, respectively), chosen randomly on each trial. After 10 million trials, every unit vector combination will have been used more than a million times, and we can compute values for E (u 1,u 2,λ), E (u 1,u 3,λ), and E (u 2,u 3,λ) to sufficient statistical stability to test for violations of the inequality, and similarly for Eq (u 1,u 2), Eq (u 1,u 3), and Eq (u 2,u 3). We can try the local model without hidden variables and also with the simple ones mentioned above, namely with particle I forced to respond to each unit vector with +1 or -1 chosen randomly for each vector on each trial, with particle II forced to the corresponding opposite values, agreed upon while the particles are still interacting. The local model without hidden variables is identical to the quantum model with only one exception: particle II is not collapsed by the measurement on particle I, rather it reacts to its measurement as though particle I did not exist. Not just any choice of u 1, u 2, and u 3 will cause the quantum model to violate Bell’s Inequality. Aspect (1981) finds a particularly useful choice to be those corresponding to θ1 = 0o, θ2 = 22.5o, and θ3 = 45o (i.e., rather than 0o, 120o, and 240o), and so those were used in the Monte Carlo simulation. The results are shown in the table below, where we use ij to denote Eq (u i ,u j) for the quantum model and E (u i ,u j,λ) for the local models. Quantum
Local, no HV
Local with HV
12
-0.9243684242
-0.4599470203
0.0000099048
13
-0.7054906599
-0.3520477690
0.0011791052
23
-0.9241747540
-0.4614735167
0.0004219106
1+23
0.0758252460
0.5385264833
1.0004219106
|12-13|
0.2188777643
0.1078992513
0.0011692004
Yes
No
No
Violation?
As expected for the quantum model (see Appendix J, Equation J.3, p. 484), 12 -cos(22.5o), 13 -cos(45o), and 23 -cos(22.5o), with random fluctuations in the third significant digit. Instrumental errors were not simulated, so the effects of random fluctuations would be larger in real laboratory experiments. Even though Quantum Mechanics had earned an excellent track record in predicting measurement outcomes by the mid 20th century, proving experimentally that it could violate Bell’s Inequality remained too difficult for many years. The use of spin-correlated photons proved to be much more manageable than spin-correlated fermions, and eventually Freedman and Clouser (1972) were able to report violations at the 6σ level with no significant deviations from the predictions of Quantum Mechanics. Subsequent work focused on enforcing delayed-choice instrumental settings, i.e., ensuring that the selections for a and b from the set ( u 1, u 2, u 3) are made after the particles have moved away from each other too far for information to be exchanged between the measurement instruments by any signal traveling at the speed of light or slower (for oppositely moving photons, any time later than 286
page 298
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
when they are emitted will clearly suffice). Violations of Bell’s Inequality satisfying these constraints were reported at 5σ significance in a series of papers by Aspect et al. (1981, 1982). In all arguments that rely on statistical significance of experimental results, absolute certainty is not possible. All that can be done is to drive the significance higher and higher until the phenomenon in question can be considered verified beyond a reasonable doubt. Violation of Bell’s Inequality by Quantum Mechanics has not yet reached universal acceptance by modern physicists because of remaining loopholes. One of the main sources of doubt is the fact that modern detector efficiencies are nowhere near 100%, sometimes causing one of the two photons to go undetected, in which case the event must be discarded. This also forces us to recognize that there is some fraction of cases in which both photons escape detection. The fair sampling assumption is that the photon pairs that are detected are representative of all photon pairs, but this cannot be proved absolutely except by achieving 100% efficiency in detecting and measuring the entangled particles (progress toward this goal has been reported by Hensen et al., 2015). In addition to instrumental limitations, objections can be raised regarding the fact that Bell’s Inequality refers to three different correlations, i.e., three different unit-vector combinations, whereas any one pair of photons can provide data for only one combination. It is necessary to assume that all photon pairs are equivalent in their contributions to statistical behavior regarding the three possible unit-vector combinations. Most physicists consider that reasonable, but it is not a proven fact. Another loophole is that the determinism desired by Einstein could be operating with such efficiency that everything in the Universe is predetermined, including what seems to be the free or random choice of which unit vectors to use for a particular photon pair. This superdeterminism could presumably set the photon spins and the unit vectors that will be used to measure them and the measurement results themselves in such a way that the observed correlations arise, rendering any notion of a role for free will a complete illusion. Bell, himself uncomfortable with nondeterminism, nevertheless rejected superdeterminism as implausible and self-defeating. Zeilinger (2010) rejects it on the basis that in principle it negates the motivation for pursuing science: It was a basic assumption in our discussion that choice is not determined from the outside. This fundamental assumption is essential to doing science. If this were not true, then, I suggest, it would make no sense at all to ask nature questions in an experiment, since then nature could determine what our questions are, and that could guide our questions such that we arrive at a false picture of nature.
To assume that it does make sense to ask Nature questions in experiments is to assume the truth of what may be an unprovable proposition. Scientific work involves solving puzzles, and the solution of any nontrivial puzzle demands that the existence of a solution be accepted as a working hypothesis. Applying this to the question of whether the pursuit of science is worthwhile, Einstein expressed his intuitive reaction as quoted above: “a conviction, akin to a religious feeling, of the rationality or intelligibility of the world” and “a faith in the possibility that the regulations valid for the world of existence are rational”. This is not quite the same as true religious faith, however. It is not an unshakable belief that the working hypothesis is true, it is only an expectation that accepting it will probably lead to a justification for having done so, and failing that, at least possibly some understanding of why it was either false or undecidable. As for whether a scientist should assume the validity of 287
page 299
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
free will, the mathematician Edward Lorenz (1993) postulated doing so with this statement: We must wholeheartedly believe in free will. If free will is a reality, we shall have made the correct choice. If it is not, we shall still not have made an incorrect choice, because we shall not have made any choice at all, not having a free will to do so.
So we reject superdeterminism not because it has been proved false but because we are not prepared to surrender the pursuit of science. If superdeterminism is in effect, then there is nothing that we can do about it anyway, and we might as well indulge our fantasy that we are engaged in meaningful activity as we study Nature (and no one who knows the truth can find fault in our choice, since we never actually made one). There is no clear reason to expect that the automatons of a superdeterministic Universe cannot be led eventually to a proof that free will is an illusion. Meanwhile, evidence for the validity of the alternative hypothesis continues to pile up and drive its statistical significance ever higher. Even for scientists there is a threshold beyond which the word “believe” is acceptable for all practical purposes. In the author’s opinion, superdeterminism is not really separable from determinism in general as far as physical “laws” are concerned. Can there be a partial determinism that is somehow less allencompassing than superdeterminism? Can physical laws be 100% deterministic with consciousness operating outside the scope of physical laws? But in that case, since the history of the Universe is clearly affected by conscious decisions, there would have to be some mechanism by which consciousness is able to override the deterministic physical laws, and given the assumption that consciousness itself is not bound by deterministic laws, the Universe would still be proceeding under the influence of nondeterministic processes. We conclude that rejecting superdeterminism amounts to rejecting determinism in general as an absolute constraint on physical processes. This falls far short of explaining free will, but if the Universe itself does not “know” for certain what it is going to do next, there is at least an opening for the agency of consciousness to select from among the available options rather than leave the outcome to pure randomness. To go further will require a vast improvement in our understanding of what consciousness is. It also strikes the author that nonlocality and nondeterminism are essentially equally challenging to accept. Efforts to eradicate the latter via hidden-variable theories have so far left the former intact, and the great success of Quantum Mechanics suggests strongly that we are stuck with nonlocality. As we will see below, the inherent randomness of Quantum Mechanics prevents the nonlocal effects of quantum entanglement from carrying information that could lead to causal absurdities. This argues that these two black sheep are best kept together and leads us to embrace both the randomness and nonlocality of Quantum Mechanics as essential ingredients of the Universe’s behavior. That decision has now proved to be very fruitful for investigators studying quantum computing. The laboratory techniques developed originally for studying Bell’s Inequality have been expanded greatly to explore the ideas that arose in the early 1980s based on the realization that a physical object in a superposition of states is actually a physical unit capable of storing more information than a binary digit (or bit) stored in a single unit of conventional computer memory capable of only two states that can be used to represent on/off or 0/1. In order to support computation, it is necessary for the information in what is now known as a qubit to be transferred from one place to another, and quantum entanglement was a natural choice of mechanism for this. Although daunting engineering challenges have been involved, theoretical analysis has established that quantum computing could in principle 288
page 300
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.12 Hidden-Variable Theories and Bell’s Inequalities
expand numerical capabilities vastly beyond those of the most advanced “classical” computers (where we use quotation marks because modern computers are themselves based on transistors, which exploit quantum-mechanical features, but not entanglement explicitly). While a fully operational general-purpose quantum computer has not yet been built, functioning circuits have been put together and tested successfully. The results have shown that some problems previously thought to be intractable because of the sheer number of arithmetic operations required will soon become routinely solvable, such as the factoring of gigantic integers used in cryptography. Exploiting entanglement for cryptography even provides a capability for the detection of eavesdropper activity. The advances in laboratory instruments driven by quantum computing have also lent themselves to more powerful tests of local hidden-variable theories, and as a result, further nails have been hammered into the latters’ coffins. Greenberger et al. (1990, 2007) investigated entangled systems of three or more photons and found experimental arrangements that do not require an accumulation of statistical significance. In one arrangement, three spin-entangled photons are generated with identical polarizations that are subsequently measured in three different directions for which Quantum Mechanics predicts 100% correlations that cannot be produced by any local hidden-variable theory. In this case, a single experiment suffices to rule out all possible local hidden-variable theories without the need to accumulate statistical significance over multiple trials. All that is needed is for an actual laboratory experiment to produce the results predicted by Quantum Mechanics. The state of the laboratory art had to be extended (which simultaneously served needs in quantum computing), and finally Bouwmeester et al. (1999) were able to report results confirming Quantum Mechanics. But the details of quantum computing are beyond our scope. Excellent discussions are given by Ghirardi (2005) and Zeilinger (2010). The point is that there is overwhelming evidence for the validity of quantum entanglement, nonlocality, and randomness. That leaves us with a number of questions to ponder. Can information be transferred at speeds exceeding the speed of light? In what sense do the nonlocal effects of entanglement operate “instantaneously”? How does the wave function “collapse”? Are nonlocal hidden-variable theories viable? We will address these in the following sections. To close this section, we note that although the use of expert polls is often criticized (especially by those whose opinions are not supported by the poll results), given the murky nature of the interpretational questions surrounding Quantum Mechanics, the author believes that ignoring the intuition of well-informed researchers is ill advised. Schlosshauer et al. (2013) present the results of such a poll of 33 physicists who are generally recognized as qualified to comment on such issues. Over half of these held the opinion that nonepistemic randomness and nonlocality are in fact inherent features of microscopic physical processes.
289
page 301
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.13 Nonlocal Effects and Information Transfer
5.13 Nonlocal Effects and Information Transfer The acceptance of nonlocal effects is demanded by the fact that Quantum Mechanics depends on them and has achieved ever more convincing success in predicting measurement outcomes, some of which violate classical intuition to such an extent that they would never have been predicted without Quantum Mechanics, and this adds greatly to the impressiveness when the effects actually appear in the laboratory. Bell’s Theorem shows that even if we can banish nonepistemic randomness via hidden-variable theories, we will still be left with nonlocality (or else something vastly less effective than Quantum Mechanics). And since nonlocality involves an action at a given location A having an effect at some arbitrarily distant location B, it was natural to suppose that entanglement could be used as a carrier of coded information to be transmitted at speeds greater than that of light. After all, a telegraph operator at A can press a key and cause a click to sound at B, and an entangled-spin measurement that produces a +1 at A can cause an identical measurement at B to produce a reliably predictable outcome and do it vastly more rapidly than a telegraph signal. In this section we will consider whether Quantum Mechanics allows instantaneous nonlocal action to be used for superluminal communication of information. The question of whether nonlocal hidden-variable theories allow it will be postponed to section 5.16. Transfer of information requires encoding it in some fashion that can be transmitted and decoded at the receiving end. For example, modern computers map the capital letters of the English alphabet to 8-bit numbers ranging from 65 to 90 in decimal value (these constitute a subset of the ASCII characters, the American Standard Code for Information Interchange). If one can transmit a single bit (a binary digit that can take on the values 0 or 1), then one can transmit a stream of bits, and it follows that letters and words can be transmitted. If entanglement could be used to transmit on/off (hence 0/1) information, then it could transmit letters and words faster than any signal traveling at the speed of light. This appeared to be purely an engineering problem, and a number of enthusiastic proposals began appearing shortly after quantum entanglement had been identified, although it was not always clear whether the intent was to patent a process for superluminal communication or continue to press the original EPR argument about the alleged shortcomings of Quantum Mechanics. Suppose a pair of physically separated researchers A and B have a source of streaming entangled fermions between them and have agreed to make spin measurements only in one and the same direction. They know that if A gets a +1 result, then B will get a -1 result, and vice versa. They agree that a +1 result corresponds to a binary 1 and a -1 result corresponds to a binary 0. Then A will attempt to send ASCII code to B “instantaneously”. The message, “Hello”, begins with “H”, which has the decimal ASCII code 72 corresponding to the binary 01001000. So A has to get a spin measurement of +1 so that B will get a -1 and therefore receive the first 0 in the code. But a problem arises immediately: researcher A cannot control the outcome of his spin measurement; +1 and -1 are equally likely. The only way to guarantee a +1 is to “prepare” his fermion so that its spin is already aligned with his measurement direction, but such a preparation is itself a measurement that destroys the entanglement with B’s fermion. Furthermore, it might take several tries to get A’s fermion into the desired spin state, since that process is also random. While A is puttering around trying to get a fermion with the desired spin, B has no idea what is going on, and when A finally gets the right spin state, it is no longer correlated with B’s fermion anyway. They might agree that A will simply make measurements until the desired result is obtained and then immediately seal off the fermion source, so that B will know that after a halt in the fermion 290
page 302
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.13 Nonlocal Effects and Information Transfer
stream, the last result obtained is the desired bit value, and after a pause, the fermion stream will be restarted so that the next bit can be transmitted in the same way. But this requires the fermion source to be close to A in order to provide time to react when the desired spin has been obtained. It would also be expected to leave some neglected particles in the stream that are already in flight, but presumably the sudden stoppage of the stream could be understood to imply that the last bit decoded before a known delay time was the bit intended. Anything of this nature, however, limits the transmission speed to the fermion speed, defeating the purpose of exploiting the “instantaneous” effect of entanglement. It was not immediately clear whether such problems were just more engineering challenges to be worked out. If the instantaneous nature of entanglement collapse is to be useful in rapid communication, clearly the information to be transmitted cannot be encoded into the stream of entangled particles at the source, because that would succeed only in patterns that travel at the stream speed. The particles must be left undisturbed as they travel to the two locations of A and B, and only upon arrival can they be manipulated. There would be periods during which no manipulation was taking place, so random noise would be all that B could detect, but this presents no problem, since detecting “garbage” on a communication channel has long been a solved problem, and the onset of meaningful carrier patterns is easily detected. But when information is being transmitted, it is obviously crucial that what is received at one end must reliably indicate systematic manipulation at the other. This is exactly what Ghirardi et al. (1980) showed to be impossible for any instrumental arrangement of the sort we are considering (for an extensive discussion, see Bassi and Ghirardi, 2003). Using completely general arguments based only on the assumption that Quantum Mechanics is valid, they showed that the expectation value of the result of the spin measurement at B does not depend even on whether any measurement was made at A. On average, B will observe whatever B observes independently of what is attempted at A, and the statistical results do not depend on which measurement is made first. The EPR mechanism is operative, the nonlocal effect does take place, the spin states at the two locations are correlated prior to any measurements that may be made, but assuming that both measurements are made, it is impossible to tell which measurement broke the entanglement until such time as the measurement results are brought together (necessarily subluminally) and measurement time tags in a mutually defined reference frame are compared. One measurement operates on the entangled system and the other operates on a post-entanglement state, and statistically neither result depends on which role its measurement played in the process. Only random “garbage” can be “transmitted” because of the irreducible randomness of Quantum Mechanics. Note that the A/B symmetry is consistent with the relativistic effect in which an observer moving with respect to both A and B may see the A measurement happen before the B measurement, while a differently moving observer may observe the reverse time order. If a message was being sent, the two moving observers would have opposite interpretations of the transmitter/receiver relationship. This applies even when A and B are at rest with respect to each other, i.e., v = 0 in Equations 5.53-5.55 (p. 276), so that ΔT = ΔT , both zero for arbitrarily large u, and the collapses of any pair of entangled particle spins are instantaneous in both systems. The very fact that it is impossible to arrive at a consistent identification of who is sending and who is receiving valid for all reference frames argues that “instantaneous” transmission of messages is impossible, but this is not a proof, because it depends upon human sanity being in effect, hence it assumes that which is to be proved, the absence of causality violations. It is very difficult to arrive at a proof when human sanity cannot be assumed! But we do assume human sanity for the same reason that we assume free will: it allows us to proceed with our 291
page 303
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.14 “Instantaneous” Nonlocal Effects
study of Nature, and if we are wrong, we cannot do anything about it anyway, nor need we be concerned that some condemnation of our methodology will be made by logical argument. So we accept the proof given by Ghirardi et al., and perhaps we should be grateful for the (arguably ironic) way in which randomness in this case prevents nonlocality from precipitating causality violations in our Universe, especially given the superluminal mechanism discussed in the next section. Of course spin correlations are not the only form of quantum entanglement. The original EPR argument involved position/momentum entanglement, and many other manifestations are possible. It was inevitable that the search for superluminal communication mechanisms would replace the pursuit of perpetual motion machines. Despite the desire expressed at the end of their 1980 paper, “to stop useless debates on this subject”, the work of Ghirardi et al. was just begun, and more papers followed (e.g., 1983 and 1988) in which attempts to get around the 1980 proof were shown to be lacking, usually due to inconsistent assumptions or misapplication of the quantum formalism. 5.14 “Instantaneous” Nonlocal Effects The EPR phenomenon discussed in section 5.11 was presented in the language of nonrelativistic Quantum Mechanics. The attempts to make Quantum Mechanics consistent with Special Relativity began almost immediately after the breakthroughs by Heisenberg and Schrödinger in the mid 1920s. In section 5.6 we mentioned that de Broglie used Special Relativity to transform the rest-mass energy m0c2 of an object with rest mass m0 to its energy in the laboratory frame in which the object is moving with speed v and has reference-frame mass m, obtaining hc/λ, where λ = h/mv, and that relativistic forms of Schrödinger’s wave equations were developed by Oskar Klein and Walter Gordon in 1926 and by Paul Dirac in 1928. But these were not global translations of Quantum Mechanics into relativistic form. In the late 1940s the quantum mechanical description of the interactions of charged particles with the electromagnetic field was eventually made compatible with Special Relativity to arrive at what is known as Quantum Electrodynamics through the work of a number of physicists, e.g., Sin-Itiro Tomonaga (1946), Julian Schwinger (1948), Richard Feynman (1949), and Freeman Dyson (1949). Subsequent developments in Quantum Field Theory further extended the compatibility of Quantum Mechanics with Special Relativity, and some progress on calculating quantum fields in curved space has been achieved, but so far the problem of unifying Quantum Mechanics and General Relativity remains unsolved. The entanglement discussed by Einstein in the EPR paper did not involve any electromagnetic interactions, just a very generic wave function such as that in Equation 5.52 (p. 274) that could apply to any situation independently of whether relativistic considerations had been involved in arriving at it. Assuming that this wave function has the right form, the “collapse” associated with a measurement must occur for both particles at the same instant in some reference system. Einstein did not refer to any relative motion between the instruments, and although they need not be at rest relative to each other, that is how laboratory entanglement experiments have been done so far, so for the moment we will just assume that the instruments have a mutual rest frame with synchronized clocks. Then the point is that when all but one eigenfunction belonging to a given particle acquire probability amplitudes of zero in Equation 5.52, all of the other particle’s corresponding eigenfunctions become multiplied by zero and are thus eliminated from the joint wave function. With v = 0 and arbitrarily large u, Equation 5.55 shows that ΔT approaches 0. With v = 0 and finite u in Equation 5.53, we have 292
page 304
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.14 “Instantaneous” Nonlocal Effects
ΔT = ΔT. Either way, even with the superluminal nonlocal effect, nothing in one system propagates into the past of the other system. Observers in uniform motion with respect to A and B may see different temporal orders of the collapses, but if they want to know what “really happened”, they will have to represent their observations in the A&B rest frame by applying Lorentz Transformations. But what if the two instruments are in relatively moving rest frames? Even though superluminal signaling is ruled out because A’s spin measurement outcome cannot be controlled, nevertheless the result obtained by A does induce a nonlocal effect at B, and Equation 5.53 implies that information regarding A’s spin measurement result can propagate into B’s past if that information propagates to B with a speed u > c2/v. Perhaps Nature does not permit this atrocity, but we cannot rule it out. If we identify the speed v as the group velocity of everything in B’s laboratory reference frame relative to A’s rest frame, we should denote it as vg, as in Equation 5.42 (p. 257). More specifically, it is the average group velocity of the total wave packet describing everything in B’s laboratory. By symmetry, this has the same magnitude as the group velocity of the A laboratory reference frame measured in the B frame. We consider entanglement of the two measurement instruments because, as Schrödinger, Bell, and others have pointed out, a measurement is an interaction between an instrument and what is measured, so the two become entangled. When two instruments each become entangled with one of two entangled particles, they become entangled with each other. One of the instruments may enter the picture sooner or later than the other, especially given the absence of absolute simultaneity, but entanglement experiments depend on both instruments eventually interacting with an entangled particle and thus becoming entangled with each other. The reason for considering entangled instruments rather than just entangled photons will become clear below. Bell (2004, Chapter 23) criticized the artificial separation of what is measured and what does the measuring into a “system” S and the “rest of the world” R by pointing out that because R is also a quantum-mechanical system, eventually it must be absorbed into S : “It is like a snake trying to swallow itself by the tail. It can be done -- up to a point. But it becomes embarrassing for the spectators before it becomes uncomfortable for the snake.” In other words, self-consistency demands that S and R be treated as quantum-mechanical systems from the outset. Many physicists are skeptical about describing complete macroscopic objects such as laboratory instruments by wave packets. But as mentioned in section 5.6, Arndt et al. (1999) found that a C 60 molecule displayed a de Broglie wavelength appropriate for the aggregation of 60 carbon atoms, so there is some evidence of quantum behavior for systems with more mass than simple electrons and protons. Temporarily accepting the hypothesis that the apparatus in system B can be described by a wave packet, we discover that ΔT can indeed be negative. As stated in section 5.7, the group velocity is the derivative of the angular frequency ω with respect to the angular wave number k,
E
E vg k p p
(5.57)
For an object with mass m moving freely with speed vg = p/m, where m and vg are represented in some observer’s reference frame, Special Relativity gives the square of the total energy in that frame as (see, e.g., Penrose, 2004, section 18.7) 293
page 305
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.14 “Instantaneous” Nonlocal Effects
E 2 p 2 c 2 m02 c 4
(5.58)
where m0 is the rest mass of the object; then vg
2 2 p c m02 c 4 p h 2 c pc 2 E h p 2 c 2 m02 c 4 pc 2
c2
(5.59)
c2 vp
where vp = λν is the phase velocity of the wave packet. As discussed in section 5.6, the magnitude of the phase velocity of a wave is the speed with which its crests propagate. Since vp = c2/vg, and since Special Relativity demands that vg < c, we have vp > c for quantum waves associated with an object whose rest mass is greater than zero. Relative to system A, let us denote the phase and group velocity of the A instrument wave packet as vpA and vgA, respectively, and similarly those of the B instrument as vpB and vgB. Setting aside for the moment the fact that vgA = 0 seems to imply that vpA = , let us just take vgA to be arbitrarily small, leaving vpA arbitrarily large. Then since vgAvpA = vgBvpB = c2, the fact that that vgB > vgA implies that vpA > vpB. It is generally recognized that a wave can propagate information only at its group velocity, but that refers to information encoded on the wave by modulating it (and hence introducing new frequency components), not information about the wave itself. Clearly the wave crests traveling in both directions along the x axis describe the waves themselves, which is information about each wave, i.e., where its crests are, what the amplitude is, and how fast the crests are traveling. Part of the quandary facing us is that material objects are described by quantum waves, and information about the waves provides information about where the object might be found, what values could be measured for its momentum and energy, etc. As we saw in section 5.10, phase relationships between components of a wave packet play a crucial role in where an object associated with that wave packet will go. Coherent states exist because of these phase relationships. There appears to be some plausibility to the notion that when something happens to alter the crest structure of a given monochromatic wave (i.e., associated with a single eigenstate), in this case its crest amplitude discontinuously dropping to zero, the resulting effects would propagate at the phase velocity, i.e., the last surviving crest would continue traveling at the phase velocity, followed by nothing but zero level at that frequency. So it seems that the v and u in Equation 5.53 can be identified as vgB and vpA, respectively, and since vpA > vpB and vgBvpB = c2, we must have vgBvpA > c2. Thus (1-uv/c2) < 0, and we obtain ΔT < 0, i.e., information about A’s spin measurement does indeed propagate into B’s past. Bell also stressed that position states are the most fundamental states in Quantum Mechanics, because the result of a measurement is always some kind of position, e.g., the position of a pointer, the position of ink on a page, etc. It follows that position states of entangled instrument components 294
page 306
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
are relevant to the process and that information about the A measurement, specifically the uncontrollable collapse of certain wave-packet components of the joint wave function, could be “communicated” to the B instrument by the change in crest structure transmitted at the phase velocity. This is just a qualitative sketch of a basic model, but it strongly suggests an avenue for resolving the dilemma of the “simultaneous” collapse of entangled particle states. But the change in crest structure cannot be modulated to encode signals to be propagated at the phase velocity because, as shown by Ghirardi et al. and described in the previous section, it is impossible to control which eigenstates have their probability amplitudes forced to zero by a measurement. In the case with both instruments at rest relative to each other, the classical nature of Special Relativity permits one to consider vgA to have a value of exactly zero, suggesting that vpA would be undefined. But the presence of quantum mechanical effects in the problem demands that we rule out exact values of velocity, because they would imply unbounded uncertainty in position due to the Heisenberg Uncertainty Principle. So we need to consider only |vgA| values that are extremely small but greater than zero, with |vpA| being proportionately very large, but still with vpAvgA = c2. An important aspect of this model is that it is the instruments that become entangled via the measurements of entangled particles and communicate with each other. The communication is not between the entangled particles themselves, they simply link the instruments into the entanglement. If this were not the case, then the model would not work for entangled photons, for which vp = vg = c. With the magnitude of the phase velocity equal to the speed of light, the fact that some wave crests had collapsed to zero at one instrument could not be transmitted fast enough to catch up to the other photon before its polarization was measured. Furthermore, it is clear that the group velocity vgB corresponds to the v in Equation 5.53, with u = vpA being the speed at which the zero-amplitude crest states are propagated from instrument A to instrument B. 5.15 Wave Function “Collapse” and Various Alternatives The association of a measurement of some observable with a wave function “collapse” to an eigenstate of that observable (what Schrödinger called “Springen” in his three-part 1935 paper, usually translated “jump”) has been problematic for a variety of reasons ever since the advent of the Born Interpretation discussed in section 5.9 above. It does not correspond to anything in the Schrödinger Equation. The actual process as described by Dirac and von Neumann does not correspond to a unitary operator as generally desired for operators describing real physical processes (an operator A is unitary if it has an inverse A-1 and AA-1 is the identity operator). Its failure to appear in the formalism led to a variety of interpretations lacking appeal to large fractions of the physics community, e.g., von Neumann suggested that the collapse was a manifestation of the interaction between physical matter and consciousness, a view that led to the famous question of whether the moon is still there when nobody is looking (see, e.g., Mermin, 1985). Although the early focus of the de Broglie-Bohm pilot-wave hidden-variable theory was to restore physical realism and eliminate the nonepistemic randomness of quantum processes, it also succeeded in eliminating the collapse mechanism. Since the physical parameters of objects retain exact values at every instant, there is no need to jump to an eigenstate at the moment of measurement. This touches upon a problem that troubled Schrödinger deeply; he would have liked to replace massive “particles” 295
page 307
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
with “wave packets” throughout physics, but wave packets for free massive particles disperse along their trajectories, whereas (for example) traces in cloud chambers suggest that particles are well localized all along their paths. But the latter also suggests that some sort of continuous interaction takes place along the path, perhaps continually collapsing the wave function enough to maintain sufficient localization without inducing unbounded momentum uncertainty. This avenue led to the development of spontaneous localization and objective collapse models that propose physical processes that induce the collapse, along with the closely associated environmental decoherence models in which entanglement with the environment causes the loss of coherent states and eliminates the possibility of observing macroscopic grotesque states (e.g., Schrödinger’s Cat being both alive and dead at the same time) but does not explain collapse itself. All of these and other interpretations have been developed with variations, each leading to its own variations, until the situation has become much too complicated to fit within our scope, forcing us to leave the reader to blaze an individual trail through the abundant literature. Besides warning the reader that the plethora of creative explications includes no small number of fringe postulations, including some of what the author can regard only as significantly unhinged speculations to be found on some internet science blogs, we can only direct the reader’s attention to several of the more widely embraced hypotheses. Of these, one of the subtler and more difficult to describe without introducing an inordinate amount of new mathematical formalism is known as the Consistent Histories Interpretation, the main thread of which is widely considered to be a refinement of the Copenhagen Interpretation (see, e.g., Omnès, 1999, Chapters 11-15 and 21). A “history” of a quantum system is represented as a timeordered sequence of projection operators defined on the state-vector space that describes the quantum system (technically, a complex Hilbert space, typically infinite-dimensional, but such rigor is not needed for the qualitative description to which we are limited). A projection operator computes the projection of a vector in this space onto a subspace. Thus a history spans an interval of time in the life of a physical system and consists of a sequence of projectors that describe each event that occurs for that system at specific times during the interval. In classical physics, the projectors trace the path of the system through the usual phase space of Statistical Mechanics, and for each system, the history is unique because of the deterministic laws, but epistemic randomness introduces a need for classical probability theory which allows averages to be computed over ensembles. Nevertheless, the system can be modeled as evolving through a unique history, unlike the quantum case, wherein the system may take various branches at each event, each branch chosen by a nonepistemic random process. This makes it necessary to consider families of histories in the quantum case. It is rather nontrivial to define the quantum projection operators. Each is associated with a specific time, and its function is to project the quantum state at that time onto a subregion of the state space that corresponds within small tolerances to a value of an observable, thus mapping a quantum concept to a classical concept. The word “corresponds” is no coincidence; the Consistent Histories formalism grew out of the attempt to put more flesh on the bones of the Correspondence Principle, that part of the Copenhagen Interpretation that asserts the classical limit of quantum systems in the transition from the microcosm to the macrocosm. Thus the notion of a “history” has one foot in each realm and attempts to bridge the gap, the elusive boundary claimed by the Complementarity Principle to exist between the quantum and classical worlds, a dichotomy that has been the subject of many complaints over the years, the demilitarized zone between a quantum object and the classical instrument used to measure it, what Bell called “the shifty split of the world into ‘system’ and ‘apparatus’.” 296
page 308
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
In order to be useful, a family of histories must be complete, i.e., it must contain every mutually exclusive history. The question of mutual exclusivity arises because some options available to the system at a given time along one branching path are not available to it along other branching paths. A single history containing such mutually exclusive events is not consistent and must be discarded. Whether a particular history is consistent is determined by the value of a mathematical expression involving the trace of the product of the projectors and their adjoints forming a similarity transformation of the density matrix (presenting this expression and defining its components would require too great a digression from our scope; for details, see, e.g., Omnès, 1999, Chapter 14). Each history in a family of consistent histories provides an interpretation of what happened to the system in classical terms during the time period covered, and a probability for it can be computed using the same ingredients as the consistency condition mentioned above. There is some resemblance between any given consistent history and the deterministic path taken by the system in the de BroglieBohm theory, the difference being that the latter’s randomness is purely epistemic and describes uncertainty in which unique path is taken, while the former’s is nonepistemic and characterizes only one of a complete family of such paths. What happens at each time associated with a projection operator can be thought of as a measurement performed by an imaginary instrument. One of the advantages of this approach is that histories can be conditioned on subsequent measurement results, narrowing the interpretations, and in general the tools of classical probability theory are made available. In any given history, the sequence of projectors generates a trajectory through the state space that is not burdened with grotesque superpositions, branching at each application of an imaginary instrument with the probability of the given “measurement” outcome contributing to the probability of the given history. Some questions remain concerning the physical mechanism corresponding to the projectors that function like measurements by imaginary instruments that never collapse to a grotesque superposition of states but do somehow collapse to a classical eigenstate. The collapse itself remains controversial, but the absence of grotesque states is widely interpreted as resulting from a phenomenon that has also been studied independently of the Consistent Histories formalism, the process mentioned above that is known as environmental decoherence. Bell devoted considerable attention to two issues that especially concerned him. How can we eliminate the “shifty split” between quantum and classical systems? Why does “measurement” play a special role in Quantum Mechanics? He was joined in the pursuit of answers to these questions by many other scientists. The key ingredients were to recognize that a measurement entangles the “system” and the “apparatus”, thus treating both as quantum systems and eliminating the shifty split, and that any interaction between two physical objects entangles their states, thus making any physical interaction equivalent to a “measurement” even if there is no conscious awareness of the outcome (i.e., the moon does exist even when nobody is looking). Together with the fact that it is almost impossible to keep a “system” isolated from its “environment”, this leads to joint wave functions involving phase modifications relative to the original separate wave functions. In almost all cases, the “environment” constitutes a gigantic system relative to the quantum objects we have been discussing. The number of environmental components can be typically on the order of Avogadro’s constant or even higher, and so actual calculations require approximate methods. Environmental decoherence is modeled formally by including terms in the Schrödinger-Equation Hamiltonian corresponding to dissipative processes and component couplings known as “Interaction 297
page 309
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
Hamiltonians” (e.g., for the hydrogen atom, the energy term expressing the Coulomb interaction between the proton and electron). The formalism is cast in such a way that both a decoherence term and a dissipative term emerge, and their coefficients are related by a positive scale factor. Just as Thermodynamic entropy increases irreversibly as a result of dissipative processes, for all practical purposes environmental decoherence is irreversible for the “system”. Its effect on the quantum system’s original eigenfunctions is to make the coherence between them vanish. This does not generally reduce the number of surviving eigenstates for the “system” to one, it just produces an essentially classical probability distribution of the original eigenstates, and so it does not amount to a “collapse” in itself, and something is still needed to fill in the blanks of the Consistent Histories Interpretation. Thus the location states available to the moon when nobody is looking are not grotesque superpositions, but we have yet to consider models of how it actually manages to get into one and only one of them. The main current approaches for providing a mathematically expressed physical model for “wave function collapse” are the spontaneous localization and objective collapse formalisms. Of the former, the best known is the “Continuous Spontaneous Localization” (CSL) model developed by Ghirardi et al. (see, e.g., Ghirardi, Rimini, and Weber, 1986; Ghirardi, Grassi, and Rimini, 1990). This kinematic extension to standard Quantum Mechanics involves spontaneous localizations that occur as the result of a process that is modeled as the renormalized product of the wave function with a localization function (see, e.g., Ghirardi, 2005, Chapters 16 and 17). These localizations occur at times that are distributed randomly but with a mathematically precise distribution that depends on the scale of the system described by the wave function. The model is phenomenological, but that is hardly avoidable in Quantum Mechanics, wherein the phenomenon of quantized energy had to be accepted on the basis of unequivocal measurements, and the wave nature of matter had to be accepted as the only available explanation for energy quantization. Note that “localization” refers explicitly to collapsing the wave function in a manner such that position is made nonblurry. This is consistent with Bell’s statement that position is a particle’s most fundamental physical parameter (the same point was made elsewhere, notably by Einstein). Many other physical parameters are contextual, e.g., spin, which collapses in a measurement to a value allowed by the measurement apparatus, the result obtained depends on how the instrument was oriented. There are other noncontextual parameters, but as Bell pointed out, measurements of any parameters emerge macroscopically as position values for some feature of the instrument. The position along the wave function where localization occurs is random with a distribution given by the squared modulus of the wave function, i.e., the usual quantum probability distribution. In sharpening the position, the process necessarily blurs the momentum, and as a result, momentum and kinetic energy are not conserved in general. Some scientists consider this a defect, although the extent of the violations may be immeasurably small, and there are examples of energy nonconservation in other well-regarded physical theories (e.g., General Relativity). The process is also irreversible, which may be seen as a defect or a virtue, depending on one’s preferences regarding the “arrow of time”. The CSL formalism adds a nonlinear process to Quantum Mechanics. The model involves only two new parameters, which are considered to be constants of nature. There are complications such as how to make the model consistent with relativity, how to incorporate systems of identical particles, etc., some of which have shown progress, but all of which are beyond our scope for detailed discussion. For reasons which include a need to prevent the tails of the post-localization wave function from extending to infinity, the localization function has the form of a Uniform distribution. Its width is one of the two parameters that define the model, and the average localization frequency is the other. 298
page 310
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
The formalism treats the probability of spontaneous localization for any given particle as independent of that for any other particle, even in systems of entangled particles. These two parameters can be made to fit well with observations, and the result is that very small systems can maintain uncollapsed wave functions for long periods of time, whereas macroscopic systems cannot. The number of particles comprising a large system is typically not many orders of magnitude different from Avogadro’s constant, and these particles are entangled with each other. Although there is only a tiny probability that any one particle will undergo spontaneous localization within a time interval on the order of many years, with 1020-1030 particles, it is overwhelmingly likely that one of them will undergo localization within a small fraction of a second. Since the particles are entangled, the entire collection becomes localized. Unlike the disentanglement of the individual particles in Bell experiments, the particles making up the macroscopic body resume their entanglement immediately because of the interactions implicit in remaining bound to each other. Thus the fate of Schrödinger’s Cat is chosen very quickly, and the moon is essentially always in only one place at any given time. The role of randomness at the quantum-mechanical level is expanded in the CSL formalism, because not only is the collapsed position chosen randomly, the time at which the localization occurs for a given particle is also random with a distribution that is the same as for radioactive decay. The Poisson distribution (Equation 2.19, p. 59) is an excellent model of this process, providing the probability that k events will occur in a situation for which the average number of events is λ. For a mean lifetime τ, the expected number of independent decays within a time interval Δt is λ = Δt/τ, and the probability that the radioactive particle will not decay within the time interval Δt is given by the Poisson probability P(0,λ) = exp(-λ), hence exp(-Δt/τ). The Poisson probability that no events will happen is the complement of the cumulative probability of the exponential distribution, for which τ is related to the half-life T½ by the equation τ = T½ ln2, where “ln” is the natural logarithm. Thus (in principle) one can measure the time T½ required for half of a radioactive sample to decay and then compute an estimate of τ. Ghirardi suggests that for a proton, a reasonable estimate of τ is 1016 seconds (about 317 million years), so that for example the probability that its wave function will not undergo a collapse due to spontaneous localization in a time interval of a microsecond is exp(-10 -22), which misses absolute certainty by only the negligible amount of about 1 part of 10 raised to the power 4.3×10 21. But if the proton is part of an assembly of particles whose proton count is Avogadro’s constant, then (ignoring any other particles that may be in the collection) we compute the probability that the entire assembly will not become spontaneously localized within a microsecond by raising exp(-10 -22) to the power 6×1023, which yields about 8.8×10-27, and a collapse within that time interval is essentially inevitable. Ghirardi has provided graphical representations of the action of a localization function on a wave function (e.g., Ghirardi, 2005, Chapter 16, Figure 16.7). The resemblance to Figure 4-11 (p. 185), the Parameter Refinement Theorem operating on Gaussian and Uniform distributions, is striking. Once the CSL time and location are selected, CSL and parameter refinement are mathematically identical processes, including renormalization following the product. Whereas Ghirardi illustrates a very general wave function, in fact Gaussian wave packets are frequently encountered in Quantum Mechanics, and so Figure 4-11 could be used as an example of CSL, although the width of the CSL Uniform distribution is generally much smaller than the standard deviation of the wave packet. The point is that the similarity between a CSL event and parameter refinement combining the information in the wave function with that from an “imaginary measuring instrument” in the Consistent Histories formalism is highly evocative of a Nature that not only makes its own measurements but uses each 299
page 311
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
new result to refine its knowledge of itself. Of the other general approaches to this same problem, the objective collapse models, the one that receives the most discussion involves the effect of gravity within the highly nonclassical situation in which a single object is in a superposition of location states, i.e., apparently occupies more than one position. Is a test body attracted to each position? Is the object in one position attracted to itself at the other positions? Does each location have a fraction of the object’s mass, or is the complete mass in each location, in which case, how does mass conservation enter the picture? The need for a theory of Quantum Gravity calls out. These and other related questions have been investigated to various extents, notably by Diósi (e.g., 1984, 1989) and Penrose (e.g., 1998, 2004, 2014). General Relativity describes the effects of matter and energy on the curvature of a fourdimensional spacetime continuum. If a single object’s location is in a superposition of (e.g.) two positions, how can the two corresponding curvatures be described? As a superposition of two curvatures of a single spacetime? As a superposition of two spacetimes, each with one of the two curvatures? Neither of these possibilities has a straightforward representation in Einstein’s equation. Penrose suggests that the tension implicit in this situation produces a real physical effect that leads to a collapse of the superposition and that pursuing this clue is as likely to lead to a real Quantum Gravity Theory as any other available approach. Without an actual Quantum Gravity Theory, one can only probe heuristically, and one item that can be probed is the difference between the gravitational self-energies of the two superposed locations. To pursue this approach, Penrose (1998; 2004, Chapter 30; 2014) replaces Schrödinger’s Cat with “Schrödinger’s Lump”, a piece of matter whose position remains unchanged if a 50%-probable event does not occur and is otherwise moved to a different position by a mechanism triggered by the event. This replaces the cat’s alive/dead superposition with a location superposition that is more readily analyzed. He also replaces Schrödinger’s 50%-probable radioactive decay with a beam splitter that transmits a photon from a single-photon light source to the position-altering mechanism with 50% probability, making the experiment controllable: the photon is emitted, and the lump’s position is or is not altered with equal likelihood, leading quantum mechanically to the location superposition. Either possibility involves a lump that is now resting in some location with no further influences and hence in a stationary state. The question arises whether the superposition itself is a stationary state, and the answer based on Penrose’s heuristic investigation turns out to be negative, as discussed further below. Since the time parameter in General Relativity is one of the four axes of spacetime, hence subject itself to field curvature, the simple absolute time-derivative operator in Schrödinger’s Equation has to be replaced with a vector field, and for a stationary spacetime, this is what is known as a timelike Killing vector field (after the German mathematician Wilhelm Killing). The first serious problem arises in the fact that for the two locations in the superposition, the two Killing fields are different, yielding two different spacetimes that cannot be “superposed”, because doing so would require a coordinate connection between the two spacetimes that assigns absolute meanings to the coordinates themselves, and this violates the principle of General Covariance that is essential to General Relativity. But since the moved location originates in a motion from the unmoved location, there must be some domain within which the rules of General Covariance can be ignored without too much approximation error, for example, at or near the Planck scale (see Appendix H). If this approach is taken, then when one constructs Schrödinger’s Equation with a gravitational potential in the Hamiltonian (Penrose considers both the Newtonian potential and one consistent with 300
page 312
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
General Relativity), the superposition is seen not to correspond to a stationary state but rather an unstable situation that must evolve in time. The gravitational self energy EG of the difference between the two stationary states can be computed only approximately, but it seems plausible that the approximation should not be off by an order of magnitude, so that something at least qualitatively resembling the process of decay of the superposition is worth considering. Penrose obtains a solution in which EG operates as an energy uncertainty related to the decay time constant τ by the Heisenberg Uncertainty Principle, i.e., τ h /EG. The suggestion is that in a Quantum Gravity Theory, it will be seen that spacetime simply cannot sustain a location superposition very long, and the superposition must collapse to one of the eigenstates with a probability given in the usual way. Before the collapse, each position eigenstate is the location of the complete mass, but the mass at one position does not attract the mass at the other, neither eigenstate recognizes the presence of the other, so mass conservation is not violated. It is the spacetime inconsistency between the two eigenstates that creates the instability of the superposition and the uncertainty in gravitational self-energy that implies an uncertainty in how long the unstable superposition can last. The larger EG is, the smaller τ is, much like the effect of the object mass in the CSL model. There are many variations on the theme of gravitationally induced objective collapse, and Ghirardi and his co-workers have investigated the possibility of a gravitational connection to the CSL model. Ideas for experimental investigations into both approaches have been suggested and are in development. Meanwhile, some indications that something of this sort happens have been seen in the laboratory. For example, the technology used for atomic clocks has become very advanced and has some side benefits that can be exploited to gain some insight into the collapse of location superpositions. This is closely interwoven into quantum computer studies, wherein superpositions are used to create the “qubits” mentioned in section 5.12 above. In order to employ superpositions without collapsing them prematurely, the technique known as interaction-free measurement has been developed (see, e.g., Elitzur and Vaidman, 1993; Kwiat et al., 1995). These methods use indirect observation to deduce a quantum state without disturbing it. Ion traps such as those used for atomic clocks can also be used to place the ion in a superposition of excitation states in which the different eigenstates have different magnetic dipole moments, inducing different precession rates that lead to different position states. The interaction-free measurement of such positions employs detecting the interference between the localized wave packets. Monroe et al. (1996) used this method to discover that the mean time required for collapse tends to vary as the square of the spatial separation of the superposed locations over the range available for measurement. This is not purely a function of the location superposition, however, because other aspects of the interaction-free measurement affect the process (e.g., the coupling of the superposed states to a thermal reservoir). Besides the difference in gravitational self-energy in a location superposition, it can also happen that the two locations are at different gravitational potentials (e.g., vertically separated in the Earth’s gravitational field), in which case the frequencies of the two eigenfunctions will be affected differently by the fact that clock rate depends on gravitational potential in General Relativity. It is now possible with atomic clocks to detect the difference in gravitational time dilation due to an altitude difference of 30 centimeters in the laboratory to 2σ accuracy (Chou et al., 2010; Wineland, 2012). This separation is quite large relative to position differences in typical location superpositions (e.g., Monroe et al. studied separations of about 10-5 cm between wave packets with a full width of about 10 -6 cm), but separations of this size are beginning to be achieved in the laboratory. Kovachy et al. (2015) report 301
page 313
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
separations of 54 cm sustained for one second using a Bose-Einstein condensate of rubidium atoms. It is clear that differential time dilation can be expected to affect the phases of coherent states, an effect studied by Margalit et al. (2015) using a magnetic field gradient to simulate the gravitational time dilation. As described above in the last paragraphs of sections 5.9 and 5.14 and illustrated in Figure 5.5 (p. 271), electrons sent through a two-slit apparatus display an interference pattern when both slits are available. If instead only one slit is open, it is known which slit was in the electron’s path, and the interference pattern disappears and is replaced with a simple diffraction blur. When both slits are open, the result is not two diffraction blurs, it is an interference pattern that reveals the presence of coherent states, the earmark of a superposition, and the location on the screen where an electron is detected does not reveal which slit was transited by the electron. If the electron’s path can be deduced by any means (of which closing one slit is the simplest but not the only means), the interference pattern disappears. If the paths through the slits are at different gravitational potentials, and if the electron is replaced with something that exhibits clock properties, then in principle, the path taken is revealed by the clock reading when the object is detected, assuming the particle speed and the two path lengths are known to sufficient accuracy. The mere possibility of determining which path was taken should destroy the interference pattern. Margalit et al. used atomic clocks (also employing rubidium atoms in a BoseEinstein condensate) instead of electrons and simulated gravitational time dilation with a magnetic field gradient along one path. They were able to show that when the clock differences were sufficient to determine which path was taken, the interference pattern disappeared. Differential time dilation indeed affects the decoherence process, a piece of the “collapse” puzzle. Further experimental investigation of these phenomena is ongoing, with optimistic expectations as the state of the laboratory art continues to advance. For the sake of completeness, we must include some discussion of an alternative perception of the wave function’s meaning in Quantum Mechanics that was introduced by Hugh Everett (1957) under the name relative state formulation, but whose collective variations have become grouped under the heading “Many Worlds Interpretation” (MWI). The key notion is that the wave function never collapses, but rather a physically real branching process occurs. In the simple case of Schrödinger’s Cat, instead of the cat collapsing into either a dead or alive state, the Universe bifurcates into two separate realities, one in which the cat is alive and one in which it is dead. The cat simply follows the instruction of Yogi Berra, “When you come to a fork in the road, take it.” The author is not aware of any topic in physics that is more polarizing and controversial than MWI. Adherents find great beauty in its formal simplicity and ability to sweep away so many troublesome notions (e.g., the need for some extra mechanism to produce “collapse”, the nondeterministic nature of that mechanism, special roles for measurements and observers). Detractors point out the ambiguity regarding what happens at branches corresponding to a large number of eigenstates with unequal probabilities, problems with testability, and what may be perceived as a gross lack of plausibility. For example, Schlegel (1980) calls it a “monstrous assertion”, and Bell (2004, Chapter 15) labels it “extravagant” and points out regarding Maxwell-Lorentz theory,“Nobody ever felt any discomfort because the field was supposed to exist and propagate even at points where there was no particle. To have multiplied universes, to realize all possible configurations, would have seemed grotesque.” Since MWI is an interpretation, not a theory, it can be argued that demanding that it be able to survive criticisms applicable to theories is misplaced (there are, however, variations of MWI in 302
page 314
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
which the separate worlds can interact, making the hypothesis testable and hence actually a theory). The issue of plausibility must be settled by one’s own metaphysical preferences. The problem of how different eigenstate probabilities enter the picture will be addressed below. There is also disagreement regarding how to apply the powerful heuristic tool known as Occam’s Razor to MWI. This is the principle that when selecting from competing hypotheses, the one with the fewest assumptions should be favored. It is also sometimes stated as “the simplest explanation tends to be the correct one.” But it is not a law, it is only a heuristic aid, and it occasionally founders on the definition of “simplest”. If a clear definition is available and a single hypothesis is indeed simplest, then that hypothesis enjoys the quality of uniqueness, which makes it much more appealing than a hypothesis which is just one of many possible alternatives. MWI adherents identify the laws of physics as the place for simplicity, whereas detractors find an exponentially proliferating set of Universes impossible to describe as simple. There are differing opinions regarding the exact nature of the separately branched Universes and what Everett actually meant by “relative states”. He did not use the phrase “many worlds”. He stressed the entanglement between an observer and what was observed, the former being “relative to” or “correlated with” the latter within each eigenstate of the observable. His purpose was to provide a foundation for Quantum Mechanics that would ease its union with General Relativity to form a single theory covering both subjects. The wave function appears as a physically real component of the Universe, unlike its role in standard Quantum Mechanics, wherein its nature is analogous to a forcegenerating potential, i.e., essentially a mathematical mnemonic device (since the role of potentials in classical physics is to be subject to differential operators, they need be defined only to within an unspecified zero point, what is known as gauge invariance, and hence they may have any numerical value and are not necessarily considered physically real). His mathematical definition of a relative state was precise, but as with so many aspects of Quantum Mechanics, interpreting it in terms familiar to human intuition was not straightforward. Everett died prematurely in 1982 and thus cannot supply further clarification of his own viewpoint as it might apply to the many variations of his basic idea that were subsequently developed by others. But the “many worlds” notion does seem at least to be suggested by his statement (Everett, 1957): ... some correspondents have raised the question of the “transition from possible to actual,” arguing that in “reality” there is — as our experience testifies — no such splitting of observers states, so that only one branch can ever actually exist. Since this point may occur to other readers the following is offered in explanation. The whole issue of the transition from “possible” to “actual” is taken care of in the theory in a very simple way — there is no such transition, nor is such a transition necessary for the theory to be in accord with our experience. From the viewpoint of the theory all elements of a superposition (all “branches”) are “actual,” none are any more “real” than the rest. It is unnecessary to suppose that all but one are somehow destroyed, since all the separate elements of a superposition individually obey the wave equation with complete indifference to the presence or absence (“actuality” or not) of any other elements. This total lack of effect of one branch on another also implies that no observer will ever be aware of any “splitting” process.
303
page 315
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.15 Wave Function “Collapse” and Various Alternatives
Although he claimed that every eigenstate (“element of a superposition”) was equally real for some observer, Everett recognized that each individual observer experiences different measurement outcomes at different times with frequencies given by the probabilities associated with the Born Interpretation. He addressed this issue by assigning each eigenstate a “measure” or “weight” that is numerically equal to the conventional probability that a measurement will yield a specific outcome: When interaction occurs, the result of the evolution in time is a superposition of states, each element of which assigns a different state to the memory of the observer. Judged by the state of the memory in almost [emphasis ours] all of the observer states, the probabilistic conclusion of the usual “external observation” formulation of quantum theory are valid.
The possibility seems to be left open that there is not a single branch for each eigenstate, but rather some integer number of branches greater than one and approximately proportional to its “measure”, possibly explaining the use of the word “almost” qualifying the observer states in the quotation above. Eigenstates with higher probability presumably yield more branches than eigenstates with lower probability, with some unknown proportionality constant. Perhaps there is not one dead and one living Schrödinger Cat, but rather 50 of each, or possibly even an arbitrarily large number of each. Other interpretations refer to the “weight” as a measure of the extent to which the corresponding branch is “real”. In the end, the abundance of nuance and ambiguity provides fertile ground for the development of many variations and in fact has done so. Besides the plausibility objection, there is also a philosophical objection to MWI that follows from its implicit claim that everything that is physically capable of happening does in fact happen in one or more of the branches. We experience our actions as influencing the subsequent history of our world, and our actions correspond to our decisions to act, which may be moral or immoral. The tenets of MWI imply that all possible immoral decisions are made somewhere in the grand network of physical universes. A given branch may consist of exclusively moral decisions, but it is not clear that it matters in which branch immoral decisions are made, since they must be made somewhere in the totality of all that exists. As with superdeterminism, to the author this model of reality seems to destroy the meaningfulness of morality, reducing it to yet another illusion in much the same way that superdeterminism vitiates the notion of free will. This is of course a metaphysical objection. Since the theme of this book is how randomness is relevant to human experience, and since the role of randomness in Quantum Mechanics is paramount in this context, we are engaging in a discussion about Quantum Mechanics, not a full definition of all aspects of Quantum Mechanics. But since the discussion about Quantum Mechanics necessarily involves the issue of wave-function collapse, we have had to cover the more prominent aspects of this notion in order to paint in the corners of the larger picture of interest to us. To delve deeper, however, would be to go beyond a manageable scope, and so we are omitting some topics that may be very important in a larger context but not appropriate here, such as the issue of a “preferred basis”, which touches on all of the above. We will say in passing that this is related to the fact that any vector system of coordinates can be rotated to another system whose axes are linear combinations of the former system. This has no affect on the physical situation, only its representation. The implication is that the Hilbert space whose axes correspond to classical eigenstates is not obviously unique for describing measurement outcomes, and so some justification for using it to represent actual measurement outcomes should be presented. Similarly, the desire exists to derive the Born Rule rather than simply assume it on the basis of its 304
page 316
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.16 The Status of Nonlocal Hidden-Variable Theories
apparent success. Many attempts to do this have been made within most of the contexts discussed above, but none have been universally accepted. These and other issues must be left to the reader’s own explorations, because pursuing them more deeply herein would not eliminate unresolved questions but rather introduce a new set of them. 5.16 The Status of Nonlocal Hidden-Variable Theories We saw in section 5.12 that John Bell proved that “hidden-variable” theories could not reproduce the statistical results predicted by Quantum Mechanics unless they incorporated nonlocal effects. For many physicists, this was sufficient to dismiss such theories, but a group remains for which the value of deterministic mechanics and retention of physical realism outweighs the cost of embracing nonlocality, especially since nonlocality is woven into Quantum Mechanics anyway. One can conceive of hidden-variable theories that involve nonepistemic randomness, but such theories would really be interpretations of Quantum Mechanics involving physical realism, i.e., exact values for all physical parameters at all times. The purpose of hidden-variable theories has always been to convert the nonepistemic randomness of Quantum Mechanics into epistemic randomness by adding a mechanism to do this that does not exist in standard Quantum Mechanics. We therefore take the randomness encountered in Quantum Mechanics to be nonepistemic. Anything that would make it epistemic would make Quantum Mechanics equivalent to a hidden-variable theory. From this point of view, the choice between Quantum Mechanics and a hidden-variable theory implies a choice between nonepistemic and epistemic randomness in the fundamental physical processes of Nature. The most familiar of the successful hidden-variable theories is the de Broglie-Bohm “Pilot Wave Theory”, for which a very complete elaboration has been provided by Peter Holland (1993), a student of David Bohm’s. Those who reject this theory do so on the basis of metaphysical preferences, since the theory works as well as standard nonrelativistic Quantum Mechanics, in fact so much so that many consider it an interpretation of Quantum Mechanics. But its formal derivation involves a number of distinct elements that separate it from Quantum Mechanics, so that it must (in the author’s opinion) be considered a theory unto itself. The main objections are the difficulty in falsifying it and the extent to which it may strike one as artificial. To formulate a relativistic version of it has also proven so far to be intractable, but the same can be said of most of the mechanisms discussed in the previous section, and indeed for Quantum Mechanics itself as far as full unification with General Relativity is concerned. John Bell found much to admire about the de Broglie-Bohm theory and advocated its inclusion in the standard university physics curriculum. Nevertheless, he was not completely satisfied with it, saying (2004, Chapter 10) “This scheme reproduced completely, and rather trivially, the whole of nonrelativistic quantum mechanics.” Bohm considered the value of the formulation to lie primarily in its demonstration that deterministic hidden-variable theories could be devised, not that this particular one was the final word on the subject. Although Einstein’s views on determinism and physical realism generally led everyone to expect that he would endorse this theory, in fact he said in a letter to Max Born (1971, Letter 99) “Have you noticed that Bohm believes (as de Broglie did, by the way, 25 years ago) that he is able to interpret the quantum theory in deterministic terms? That way seems too cheap to me.” Born himself was somewhat bemused by Einstein’s remark, noting “Although this theory was quite in line with his own ideas, to interpret the quantum mechanical formulae in a simple, deterministic way seemed to him to be ‘too cheap’.” It seems likely (to the author at least) that Einstein 305
page 317
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.16 The Status of Nonlocal Hidden-Variable Theories
considered the recovery of determinism in the de Broglie-Bohm theory a Pyrrhic victory, because it came at the expense of embracing nonlocality, the real complaint of the EPR paper. If Einstein had lived to see the experimental verification of nonlocal effects, he might have changed his opinion. Pauli (1996) conceded to Bohm that the formalism succeeds in emulating Quantum Mechanics while remaining deterministic but saw little value in it, calling it “a check which cannot be cashed.” In the simplest possible terms, the theory involves a “quantum potential” that guides the motions of particles in a manner similar to any potential function in classical physics (e.g., a gravitational potential), but in this case the potential is determined by the wave function and operates instantaneously on all particles. The wave function is therefore considered to be a physically real process, and this is a source of some dissatisfaction, as it is with the Many Worlds Interpretation. For their part, the particles possess real positions and momenta with precise values at all times, and any randomness associated with these parameters is epistemic, as in Statistical Mechanics. But unlike Statistical Mechanics, precise measurement of (say) the position of a given particle causes a reaction through the quantum potential in a manner that affects how it determines the momentum, so that at the surface, quantum behavior in general and the Heisenberg Uncertainty Principle in particular arise. In this case, the position measurement does not “blur” the actual momentum state but rather our knowledge of that state, so the randomness introduced is epistemic. Bohm compared this randomness to the Brownian Motion of classical Statistical Mechanics, suggesting that mathematical chaos is responsible for the randomness exhibited in such finely focused situations as Zeilinger’s example of a quantum-mechanical random number generator (see section 4.6), a half-silvered mirror with a sequence of transmissions/reflections yielding a string of binary ones and zeroes. The way Bell (2004, Chapter 14) expressed it was “Note that the only use of probability here is, as in classical statistical mechanics, to take account of uncertainty in initial conditions.” As with all the models discussed in the previous section, numerous approaches to the derivation of the de Broglie-Bohm formalism have been explored and have yielded variations of the basic idea. In some, the particles are allowed to influence the wave function, altering its time evolution. Some of these formulations have proved amenable to Monte Carlo calculations of quantum-mechanical processes. Indeed the Monte Carlo simulation described in section 5.12 employed a simple nonlocal hidden-variable model to mimic the quantum-mechanical behavior of two spin-correlated fermions: rather than simulating the absence of a definite pre-measurement spin state for each fermion (a rather difficult thing to do), spin directions were merely assigned randomly, with the first being uniformly distributed over angle, and the second being assigned the opposite direction in order to conform to the Pauli Exclusion Principle, and then the nonlocality was simulated simply by forcing the second fermion’s spin into the direction opposite to the measurement result for the first in an ad hoc manner. As we saw therein, the results perfectly emulated the statistical predictions of Quantum Mechanics. Whether one should consider the de Broglie-Bohm theory or some variation of it as a viable description of how Nature actually works depends completely on one’s metaphysical preferences. The author is not aware of any proof that it is fatally flawed, unless one considers the fact that it is a nonrelativistic theory to be a fatal flaw. But many of the collapse mechanisms described in the previous section are taken seriously despite being nonrelativistic, so one is as free to embrace the de Broglie-Bohm theory as any of those. When a nonrelativistic theory is taken seriously, it just means that its acceptability is provisional pending generalization to compliance with relativity theory, and its description of situations in which relativistic effects are negligible is believed to be correct. The remaining question is whether the de Broglie-Bohm theory permits transmitting superluminal 306
page 318
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.16 The Status of Nonlocal Hidden-Variable Theories
information. As discussed in section 5.13, the reason why Quantum Mechanics cannot do that is the random nature of measurement outcomes. If one believes that quantum-mechanical randomness is epistemic, then one cannot rule out the possibility that one day the random influences will be overcome to a sufficient extent to encode information on the measurement of an entangled system and thereby break the light-speed barrier to communication. If this were to happen, then the mechanism described in section 5.14 implies that insane causality violations could be made to enter the realm of possibility. But the randomness in the de Broglie-Bohm theory is epistemic, and so this concern would have to be addressed by anyone endorsing that theory. The opinion has been expressed (e.g., Pauli, 1996; Ghirardi, 2005) that if a deterministic hiddenvariable theory were to permit superluminal signaling, that would clearly make it not equivalent to Quantum Mechanics and thereby destroy its viability, since its purpose was presumably to be an exact replacement for Quantum Mechanics. It is not clear to the author that this is the case. It seems it would just put the deterministic hidden-variable theory into a position of superiority, able to do anything that Quantum Mechanics can do and more. On the other hand, the author’s position is to endorse the approach of Zeilinger and Lorenz regarding free will and determinism quoted in section 5.12. Besides the need to maintain the motivation for doing science in the first place, there is something repugnant about repudiating the validity of our notion of morality. If all events that take place in the Universe obey purely deterministic laws, then there is no valid basis upon which we can be held responsible for our decisions. It follows that there are no valid moral consequences to the decisions we appear to make, so they may be any that appeal to our basest instincts. Paradoxically, the loss of free will bestows the greatest freedom possible: we may do anything at all without being held morally responsible (except possibly by other automatons who do not realize that free will is an illusion), because we never truly choose our actions. While this follows logically from the rejection of free will, it seems outrageous and every bit as insane as causality violations that rewrite the past by forming inconsistent loops in spacetime. We stress again that while strict determinism rules out free will, the presence of randomness in the Universe does not by itself establish free will. We have much to learn about the nature of consciousness before we could make that connection. In the author’s view, we should embrace the nonepistemic randomness of Quantum Mechanics because it leaves open the possibility that the notion of “free will” does have a valid meaning, and with it, the notion of morality. Rejecting strict determinism cannot be an erroneous choice, because if it is erroneous, then it is not a choice. On the other hand, embracing strict determinism could be an erroneous choice! We certainly do not advocate embracing a desired belief in the face of evidence contrary to that belief. But it is common in mathematical problems to encounter extra degrees of freedom for which arbitrary definitions may be made. For example, the zero point of the Newtonian gravitational potential is a logically unconstrained constant, but it is convenient to assign it a value such that the potential is zero at infinity, and so this is typically done. If this were ever found to lead to a contradiction, a correction could be made. In the same way, neither Quantum Mechanics nor the de Broglie-Bohm theory can currently claim any relative theoretical or observational superiority, and so an arbitrary choice is acceptable as a working hypothesis. But it should be recognized that whatever choice one makes will set a context for intuition and can be expected to influence the path one takes in the further pursuit of science and other matters.
307
page 319
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.17 Summary
5.17 Summary In the first section of this chapter we said “Sometimes the only apparent way to organize facts into a mathematical theory ends up producing a great challenge to interpretation.” Between that section and this one, we have attempted to make clear how that statement describes the situation in Quantum Mechanics. We would like to understand why the unassailable success of Quantum Mechanics follows from the properties of the basic ingredients of the Universe, but we have not been able to achieve this goal. When an object of desire may not be possessed, some comfort may be found in at least understanding why it cannot be possessed, and we have attempted herein to provide some solace of that kind. At the top of our list of reasons why we have not acquired a satisfying intuitive grasp of whatever truths underlie Quantum Mechanics is our obvious confusion about what the “basic ingredients of the Universe” are. Until this situation improves, having mathematical recipes for manipulating these ingredients falls as far short of “understanding” as the insight of the Sorcerer’s Apprentice regarding the reasons for the success of the spells learned from his master. Judging the role of randomness among the phenomena that comprise the world of our experience would surely be easier if we were not weighted down with these other interpretational difficulties, but making that judgment under the actual circumstances does not appear to be impossible. If someday we remove the veil of mystery from the matter/wave conundrum by grasping the true nature of the ingredients of the physical Universe, there is every reason to expect that the nonepistemic randomness found in Quantum Mechanics will remain essentially unchanged but expressed in different terms. A simple context within which to ponder quantum-mechanical randomness is that of a pencil balanced on its point within a uniform gravitational field and isolated from all outside influences. Even within Quantum Mechanics, one may posit an ideal cylindrical pencil with a perfect conical point. In classical physics, this pencil may be balanced and left to stand forever. In Quantum Mechanics, the longer the pencil remains standing, the more accurately we know the position and momentum of its center of mass, and at some point, the Heisenberg Uncertainty Principle will be violated if the pencil does not fall. So the pencil cannot remain standing, but the question arises: in what direction shall it fall? In classical physics, the pencil cannot fall, because all directions are equivalent. This symmetry precludes the preference of any one direction over all others. But in Quantum Mechanics, something has got to give, and this symmetry is spontaneously broken without need of any logical justification for the direction in which it is broken. As we saw in sections 5.2 and 5.3 regarding orbital angular momentum L, “L cannot point along the direction in which it is said to point!” (Treiman, 1999). But in which direction does it point? Classically, we expect to be able to define an axis as parallel to L, but having done so, we find that the off-axis components of this vector are not zero, despite our protests. If we cannot even define an axis to be parallel to a particular vector, we probably should not expect to be able to balance a quantummechanical pencil perfectly. But how do these random nonzero magnitudes of the off-axis components come into being? It appears that they result from random fluctuations in space itself. In section 5.4 we saw that quantum-mechanical linear harmonic oscillators have a zero-point energy ½hν, where ν is the resonant frequency (a brief but more complete discussion may be found in Appendix K). Any attempt to remove all energy from the oscillator must fail. The oscillator cannot be brought to absolute zero energy because that is not an eigenvalue. This property has been extended to empty space in the process of developing quantum field theories to reconcile Quantum Mechanics with Special Relativity (e.g., Quantum Electrodynamics, the relativistic quantum field theory of 308
page 320
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
5.17 Summary
electromagnetic interactions). Fourier analysis permits a quantum field to be seen as a sum or integral of linear harmonic oscillators, each one with a frequency whose amplitude is the corresponding Fourier coefficient. Removing all mass from a given space is equivalent to removing all energy from all those oscillators, and any attempt to do so will fail, because there will always be at least the zero-point energy present, and so there will always be vacuum energy in the space that is as empty as possible. Furthermore, this energy cannot be expected to remain constant in each region of space for arbitrarily long times, and so vacuum fluctuations will be present. In a typical Fourier expansion, there are an infinite number of frequencies, and so one might expect to encounter difficulties with the computation of the total vacuum energy yielding infinity. This problem indeed presented itself and required some retooling of the quantum field theories to make them renormalizable, allowing them to be used for calculations at various levels of approximation. This very large and very important topic is vastly beyond our scope, but we may point out that a satisfactory theoretical description of vacuum energy remains elusive, although the existence of vacuum fluctuations has been established experimentally (e.g., the Casimir Effect, in which two closely spaced parallel plates are drawn toward each other by their suppression of some modes in the frequency spectrum of the vacuum fluctuations in the space between them). The energy in vacuum fluctuations is important in cosmological models based on General Relativity, where its operation is similar to that of a cosmological constant (only “similar” because the distribution of this energy may be different from a simple constant). This energy exerts a negative pressure that results in an anti-gravity effect that contributes to the expansion of the Universe, possibly even producing a net acceleration of the expansion rate. But straightforward quantum-mechanical computations of this energy produce expansion rates shockingly contradictory to observation. Thus quantum field theory can claim both the most precise agreement between theory and measurement (the anomalous magnetic dipole moment of the electron, with agreement to better than ten decimal digits) and the most outrageous disagreement (the expansion rate of the Universe, with disagreement by a factor exceeding 10120). Much work remains to be done before we can claim to understand vacuum fluctuations, but we know they exist and have consequences, and our immediate interest is in whether their randomness is epistemic, like the agitations produced by the particles in a gas as described by classical Statistical Mechanics, or nonepistemic and thus beyond further modeling other than estimating probability distributions. On this question we are left once again with only metaphysical guidelines.
309
page 321
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Chapter 6 The Quest for Quantum Gravity 6.1 Fields and Field Quantization In the previous chapter we mentioned in passing that Newton’s law of universal gravitation for two point masses (Equation 5.1, p. 223) assumes an absolute time (i.e., all clocks are synchronized and tick at the same rate independently of the observer’s location and motion) and instantaneous propagation of an attractive force that varies as the product of the masses and inversely as the square of the distance between them. With this law and some initial conditions, the motion of any two bodies of known mass moving exclusively under the influence of mutual gravity can be computed in closed form. As soon as a third body is added, no closed-form exact solution is possible in general. This is just the nature of the nonlinear mathematics involved. We also mentioned that if the instantaneous transmission of the force is removed so that it propagates at some finite speed, we are no longer able to define self-consistent nontrivial initial conditions, even with only two bodies in the system. We can assign positions and velocities arbitrarily for each body at t = 0, but we cannot compute the forces on each body at t = 0, since those forces originated at the other bodies when they were at unknown earlier positions at various past times that depend on how long the forces have taken to propagate. Without the forces, we do not know the accelerations, and without those, we cannot predict the evolution of velocity and position. Only artificial ad hoc starting conditions allow any calculation to proceed. The time-lagged one-way propagation times for the forces pulling any two bodies together typically result in those forces no longer pointing through the instantaneous center of mass, and the non-central-force motion that results does not conserve angular momentum. As a final blow, the explicit dependence of the gravitational potential on time destroys the conservation of mechanical energy. A finite propagation speed would also raise the possibility of gravitational waves, since it implies some reaction and relaxation time constants for the medium. This was the situation before Einstein introduced relativity theory, wherein the clock rate of one system observed in another depends in general on location (because of position-dependent gravitational time dilation) and relative motion (because of the Lorentz transformation, which also affects lengths and masses), and gravitational influences propagate at the speed of light rather than instantaneously. These new phenomena did not cancel out any of the older obstacles, making the computation of gravitational interactions even more difficult than before. Fortunately, the power of computers has advanced at a prodigious rate since Einstein introduced relativity theory, and numerical methods have evolved to take advantage of it, so that the general unavailability of closedform solutions has not prohibited many nontrivial physical situations from being described with high accuracy by excellent approximate numerical solutions. These calculations have allowed experimental tests of relativity theory that have served to establish great confidence in its correctness within the context to which it applies. Even without closed-form solutions, the philosophical interpretation of the main physical content of Einstein’s field equation for General Relativity is straightforward: there are no gravitational forces involved; free-fall motion under the influence of gravity is along least-resistance 310
page 322
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.1 Fields and Field Quantization
paths called geodesics in a space that possesses intrinsic curvature. The gravitational interaction of two bodies is through the contribution of each body’s energy (including mass) to the curvature, resulting in geodesics that are the trajectories of those bodies in the absence of actual forces. Of course, the notions of mass and acceleration exist in General Relativity, and so their product can still be defined as a force, but this is a derived notion, not a fundamental one as described by Newtonian gravity. There is a great deal more to be said about how General Relativity characterizes the physical Universe, but we cannot afford to open the door to the extremely large accumulation of highly specialized mathematics necessary to do justice to the subject. Instead we will focus on some qualitative features that are relevant to the questions of what a Quantum Gravity Theory might look like, how to go about formulating such a theory, and why this has not already been achieved. Since the essential features of Quantum Mechanics must be present, the question of what role randomness would play is especially pertinent. One bit of common ground shared by General Relativity and Quantum Mechanics is that both describe “empty” space as a dynamic entity. There are many different opinions about how to interpret the “curved spacetime” of General Relativity. One widespread view is that space is nothing more than a relationship between the positions of material bodies and has no intrinsic physical properties of its own; the “curvature” is a property of a “field” that permeates space, not the space itself. The other prevalent view is that space is a kind of fabric, a physical substance that serves as the canvas upon which physical reality is painted or the cloth from which the objects in the Universe are woven. Where no material body has been painted, the canvas is still there, and the “field” is just a way to characterize the properties of the spacetime fabric in the language of mathematics. The first view suggests that without any material bodies, space is undefined. But that is the view that Einstein abandoned after Willem de Sitter (1917) found a matter-free solution to the field equations. Even without any material objects, the curvature remains defined and determines the metric tensor field that generates geodesics. The discovery of gravitational waves (Abbott et al., 2016) that can propagate though “empty” space further supports the view that space is a physical substance whose properties lend themselves to a mathematical description in terms of a field. This experimental evidence of Einstein’s claim that the acceleration of matter causes radiation of gravitational waves echoes the discovery that the acceleration of charged particles causes radiation of electromagnetic waves. Both argue that “empty” space is not absolute nothingness; some kind of undulatory medium is there in which the waves propagate. Whereas Newton’s view that light consists of particles was consistent with space being absolute nothingness through which the particles moved, the light waves described by Christiaan Huygens, Thomas Young, and Augustin Jean Fresnel could not propagate in absolute nothingness, and light was observed to exhibit the interference and diffraction effects described by the wave theory. Furthermore, it was well known that waves propagate at a speed determined primarily by the vibrating medium, and measurements of the speed of light had already indicated that this speed was practically constant, whereas there was no apparent reason why Newton’s light particles should all move at the same speed, since no other material particle motion was constrained in this way. The concept of a “field” has been very important in physics ever since Michael Faraday described electricity and magnetism with the closely related phrase “lines of force” (1852, 1852b; he does not appear to have used the term “field” explicitly). Faraday discovered that passing an electric current through a coil of wire wrapped around an iron core about which a second coil of wire 311
page 323
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.1 Fields and Field Quantization
had been wrapped would induce a current in the second coil. This raised the issue of action at a distance again, but this time with forces whose geometrical variations were much less difficult to measure than those of Newton’s gravitational force. Faraday sought an interpretation of such forces that did not involve action at a distance, especially because of the completely mysterious nature of the mechanism producing the magnetic force. This force had been known to exist since antiquity and had been generally associated with the poles of a bar magnet. Faraday observed the effect of such poles on a magnetic compass at various locations in the vicinity of the bar magnet and was led to the notion of lines of force surrounding the bar. These clearly originated at the poles but extended into the surrounding space, so that he perceived the magnetism of a given pole as affecting the space in its immediate vicinity, and that space proceeding to affect the space adjacent to it, with the space-tospace influence continuing to propagate the mysterious force along certain directions that eventually traced to the other pole. As he put it (1852b): I have recently been engaged in describing and defining the lines of magnetic force, i.e., those lines which are indicated in a general manner by the disposition of iron filings or small magnetic needles, around or between magnets; and I have shown, I hope satisfactorily, how these lines may be taken as exact representants of the magnetic power, both as to disposition and amount ... The definition then given had no reference to the physical nature of the force at the place of action, and will apply with equal accuracy whatever that may be; and this being very thoroughly understood, I am now about to leave the strict line of reasoning for a time, and enter upon a few speculations respecting the physical character of the lines of force, and the manner in which they may be supposed to be continued through space.
Faraday recognized that these magnetic lines of force may or may not have a real physical existence of their own. His use of the phrase “physical character of the lines of force” suggests that he was inclined toward a belief that they corresponded to a physically real feature of some medium, perhaps the luminiferous aether. Since it was known that light propagates at a finite speed, it was reasonable to expect that the magnetic force does so also, which in his view distinguished it from the gravitational force, which was supposed, although it was not certain, to propagate instantaneously, and he saw that as an indication that the gravitational lines of force were a purely mathematical construct. He noted other differences, such as gravity being always attractive and impossible to block with intervening substances. He expressed the belief that if the gravitational force were eventually found to require some nonzero time to propagate, then that would argue for a real physical medium supporting those lines of force also. In any case, the notion of lines of force was more palatable than pure action at a distance, since it involved the field at each point in space being affected by the field at neighboring points, making a connection between the source of a force and the object on which it acted. The field could be described mathematically by a vector function of position. The field concept was accepted only gradually into the physics mainstream, after which the field associated with Faraday’s magnetic lines of force became known as a vector field. A field may be any kind of function of position, and several kinds have been found useful in physics. For example, the Newtonian gravitational force field is a vector field, but it may also be represented by a scalar field, the gravitational potential field whose gradient is the force field. Whereas something about the force field had to be considered real, since it produced real effects on physical bodies, the potential field was not considered real in the same sense. It is clearly a mathematical mnemonic device whose function is to be acted on by the gradient operator, so its zero 312
page 324
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.1 Fields and Field Quantization
point is arbitrary, and that could not be true of a real physical object. The absolute value of the potential is usually chosen to be zero at infinity, because then the negative of the potential equals the gravitational potential energy, the work that would need to be done against gravity in moving a body of unit mass from infinity to its actual location (negative because the force resisting gravity points in the direction opposite to the motion). This is considered real in the same sense that kinetic energy is considered real, because it is their sum that is conserved under gravitational interaction. The zero point of the gravitational potential energy is not arbitrary, because this energy contributes to energy density and must be calculated with the correct absolute value (this potential energy may also be used to calculate the force, in which case the force is defined as the negative of the gradient). Thus among mainstream scientists, some of the fields encountered in physics are regarded as corresponding exactly to existing components of objective reality, and some are clearly abstract mathematical devices that aid in calculations but have no direct counterparts among physically real objects. It is probably fair to say that there is no consensus regarding the ontological status of most fields, because it is not required for their productive mathematical utilization. Faraday’s work laid the foundation upon which James Clerk Maxwell built his formal theory of electromagnetism, with significant contributions primarily from the work of André-Marie Ampère, Carl Friedrich Gauss, Hendrik Antoon Lorentz, and Oliver Heaviside. Maxwell’s approach set the pattern for subsequent development of field theories, the simultaneous activities of finding the best definitions for the relevant fields and the form of the equations that govern their behavior. The fields involved in the modern form of Maxwell’s Equations are electric, magnetic, and electric current density vector fields, and a charge density scalar field. The equations are first-order differential equations or their equivalent integral equations. This formalism led to the identification of light as an electromagnetic phenomenon by describing the propagation of oscillating electric and magnetic field vectors. It was absorbed easily into the theory of Special Relativity shortly before the question of how it fit into Quantum Mechanics was taken up. Paul Dirac was the first to make significant progress in formulating a quantum-mechanical special-relativistic description of the interaction between electrons and light. This work introduced the next feature of quantum field theory, the simultaneous consideration of particles and fields, eventually leading to Quantum Electrodynamics, Electroweak Interaction Theory, and Quantum Chromodynamics, the quantum field theories that cover electromagnetic interactions and the weak and strong nuclear interactions, thus all known interactions other than gravity. To accomplish this, it was necessary to introduce a new kind of field, operator fields. As we saw in the previous chapter, the values of classical observables such as energy and momentum become eigenvalues of operators in Quantum Mechanics. The “vacuum” became a field of harmonic-oscillator operators with corresponding eigenvalues of the form (n+½)hν (see Appendix K), implying a nonzero minimum energy, the vacuum energy due to vacuum fluctuations. This energy, together with the problem of the electron’s self energy (the electrostatic self-repulsion of extended electric charge), caused attempts to apply perturbation theory to the electron-light interaction to run into pathological infinities arising at every level of approximation. Finding ways to make these infinities cancel each other out at each level led to Quantum Electrodynamics and Nobel prizes for Sin-Itiro Tomonaga, Julian Schwinger, and Richard Feynman. This process was dubbed renormalization, and it is also required in the other quantum field theories, which are therefore called renormalizable theories. With the exception of some work that has made progress in probing how these theories are affected by being embedded in curved spacetime, they generally omit gravity as negligible and hence absent 313
page 325
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
from theoretical considerations, so that they need to be compatible only with Special Relativity. When the time came to “quantize” General Relativity, the fact that the equations determining the gravitational field are nonlinear caused difficulties not encountered with the linear equations of Quantum Mechanics. The attempt to do what had worked for electroweak and strong interactions could not be made to work for General Relativity; it is not a renormalizable theory. Because Einstein’s field equations are nonlinear, gravity interacts with itself. For example, gravitational waves can scatter off of each other, whereas electromagnetic waves cannot (with the very rare exception of photon-photon scattering caused by interaction with intermediate charged particles such as those that arise in pair production). This makes the straightforward quantization of the metric field and/or the stress tensor field of General Relativity nonrenormalizable. As they stand, Quantum Mechanics and General Relativity have irreconcilable differences. Neither can be hammered into the other’s mold, and to arrive at a unified Quantum Gravity Theory, some completely new approach will be required. 6.2 Essential Features of General Relativity In this section, we will attempt the difficult task of summarizing the essential features of General Relativity without oversimplifying them. Our scope limits us to a qualitative description of those aspects needed to obtain some intuitive appreciation of: (a.) the classical physical reality described by General Relativity that must be subsumed in some manner into a quantum theory; (b.) why the gravitational field cannot be quantized in the same manner that was successful in all the other quantum field theories. Two excellent introductory texts that manage to do a better job than we can do here are Lieber (1945) and the first three chapters of Wald (1977). Two examples of more extensive discussions suitable for graduate-level study are Misner, Thorne, and Wheeler (1973) and Wald (1984). The key conceptual ingredient of General Relativity, which Einstein called the happiest thought of his life when it first occurred to him in 1907, is known as the Principle of Equivalence. As with most conceptual ingredients in physics, controversy still reigns over some nuances in the interpretation of this principle. Certainly some aspects of it were not original with Einstein. An element of it is present in Galileo’s claim that when frictional effects can be effectively removed, all bodies fall at the same rate under Earth’s gravity, irrespective of weight. After Newton presented his laws of motion and universal gravitation, Galileo’s principle could be seen as equating the inertial mass, m in F = ma, with gravitational mass, the m1 and m2 in Equation 5.1 (p. 223). Einstein’s Principle of Equivalence, if true, would provide the reason why these masses are the same physical parameter. Einstein entertained two main thought experiments involving the similarity in the physical experiences of two observers in different circumstances, both in closed rooms: (a.) a person in a room floating freely far away from any significant gravitational effects, and a person in a room falling freely under the influence of gravity; (b.) a person like the first in (a.) except being uniformly accelerated in the direction of the ceiling by an external force, and a person in a room sitting on the surface of the Earth and thus feeling the pull of gravity. In both cases, if each person is holding an object and releases it, what happens next is the same for both observers. In case (a.) the objects float freely along with each observer, and in case (b.) each object moves toward the floor with an accelerated motion. In both cases, the observers are in equivalent situations. Einstein realized that these equivalences had to be considered local, since tidal effects would eventually reveal the difference between a directional gravitational force and a uniform acceleration. 314
page 326
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
The extent of this necessary locality is one area of disagreement in the interpretation of what Einstein meant (and according to some scientists, what he should have meant). We will attempt to skirt these issues here for the sake of brevity. The point is that there is some region small enough that the two situations in both cases are equivalent in the sense that no experiment is possible that would distinguish between them. For Einstein, this meant that the laws of physics must be the same for both situations in each case, and it is in this respect that he went beyond previous notions of equivalence and advocated a principle of general covariance, that the laws of physics take the same form in all coordinate systems that are related through differentiable coordinate transformations. This carries with it the implication that there is no preferred coordinate system in Nature. Other than notational appearance, the laws of physics do not depend on the specific coordinate system in which they are expressed. A step in this direction had been taken by Einstein in his Special Theory of Relativity, which applies only to coordinate systems in uniform (i.e., unaccelerated) motion relative to each other, also called inertial systems. Consider two systems, S with coordinates (x,y,z,t) and S with coordinates (x ,y ,z ,t ), with the two sets of axes aligned and the latter moving with a speed v relative to the former in the x direction. The first three of each set of coordinates are space coordinates, and the fourth is a time coordinate, with the clocks in both systems running at the same rate within their own coordinate systems and having the same reading at the instant when the two origins coincide. Einstein (1905c) resolved a number of open problems by theorizing that physical events in systems described by S and S are related to each other by Lorentz Transformations rather than the classical Galilean transformations. The latter had greater intuitive appeal for the classical physicists and are given by
x x vt y y
(6.1)
z z t t Newtonian mechanics was entirely consistent with the Galilean transformation, but at the period around 1900, the fact that Maxwell’s Equations were not was perceived to be a problem with this formulation of electrodynamics. Einstein showed that Maxwell’s Equations were in fact Lorentz-covariant, where the transformation is given by (see, e.g., Rindler, 1969)
x x vt y y z z
(6.2)
vx t t 2 c where c is the speed of light in vacuum, a universal constant in all reference frames, and
1 1
315
v2 c2
(6.3)
page 327
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
The first line of Equation 6.2 was proposed in 1889 by George Fitzgerald as an ad hoc kinematic explanation of the failure of the Michelson-Morley experiment in 1887 to measure the motion of the Earth relative to an absolute reference frame associated with the luminiferous aether, the medium in which light was thought to propagate. Michelson and Morley had devised an interferometer using a light beam split into perpendicular directions and reflected back to a point where they interfered with each other. The direction of the “aether flow” was unknown, but it could not be parallel to both perpendicular directions nor perpendicular to both throughout the Earth’s rotation period. There should have been differences in the interference pattern at different orientations due to changes in which beam’s light moved more with and against the aether flow while the other was more transverse. No such effect was observed (an upper limit of 1/6 of the Earth’s orbital speed about the Sun was established), and other experiments had shown that the aether was not “dragged” along by the Earth. This led Fitzgerald to propose the “length contraction” in the first line of Equation 6.2 to adjust the distances traveled by the two perpendicular beams and cancel their effect on interference. The transformation was independently proposed by Hendrik Antoon Lorentz in 1892 for the same purpose in addition to a larger role in his electronic theory. In 1899 he added the “time dilation” on the fourth line of Equation 6.2, which Joseph Larmor had done independently in 1897. The various aspects of the Lorentz transformation had been floating around in largely uninterpreted form for 16 years when Einstein (1905c) tied them together in his Special Relativity Theory, although he stated later that explaining the Michelson-Morley results had not been among his goals at the time. In addition to explaining the failure of Michelson and Morley to detect the “drift” of the aether past the Earth, this postulate of Einstein’s also threw light on Maxwell’s Equations by showing that they were Lorentz covariant. The predicted time dilation and length contraction in the direction of motion were soon observed experimentally (e.g., extremely high-speed cosmic ray particles with very short decay times are observed to have their lifetimes extended long enough in a laboratory rest frame to survive the trip through the atmosphere, whose depth is foreshortened in their reference frames). Just as points can be transformed with Equation 6.2, intervals, which are just differences of points, can be transformed, and this is how Equation 5.53 (the Lorentz transform for a time interval, p. 276) was obtained. Once intervals are transformed, ratios of intervals can be computed, allowing velocity to be expressed in different inertial systems. Consider a velocity vector u expressed in system S. This can be transformed to the velocity u in system S as follows. The components of u are (ux, uy, uz), where ux = Δx/Δt, uy = Δy/Δt, and uz = Δz/Δt. Maintaining the convention that the S system is moving relative to S with speed v in the x direction, we transform the four intervals and compute the corresponding interval ratios in S , allowing us to compute the components of u as functions of (ux, uy, uz):
u x
ux v uv 1 x2 c
, u y
uy
1
ux v c2
, uz
uz
1
ux v c2
(6.4)
The squared magnitude of u is u2 = ux2 + uy2 + uz2, and denoting the squared magnitude of u as u = u x2 + u y2 + u z2, we compute the latter by applying Equation 6.4. After some regrouping, we find 2
316
page 328
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
c u 2
2
c2 c2 u 2 c2 v 2
c
2
u x v
2
(6.5)
From this we can see that if |u | = c, then |u| = c also, and vice versa by symmetry. So the vacuum speed of light in one inertial system transforms to the same speed in any other inertial system, which is why Michelson and Morley were unable to detect the motion of the Earth relative to the luminiferous aether, and this is probably the most widely known fact about Special Relativity. Another well-known fact is that simultaneity is not absolute: two events that occur at the same time in one inertial system generally occur at separate times in other inertial systems, as is obvious from the last line of Equation 6.2: two events that have equal t values but different x values in S will generally have different t values in S . As a result of this, the luminiferous aether lost its status as an absolute standard of rest, and classical support for absolute time and space was removed. As Einstein noted, his theory did not prove that there is no luminiferous aether or a preferred standard of rest, but these notions played no role and were therefore dispensable within the context of Special Relativity. Most physicists continued to believe that some sort of medium was required for electromagnetic wave propagation, but perhaps surprisingly, an impressive amount of progress was possible without resolving this dilemma. Prior to Special Relativity, the interpretation of Maxwell’s Equations included separate identities for electric and magnetic fields. The equations showed how they were coupled, but did not assign them one identity. An important feature of Special Relativity is making this identification. Einstein (1905c) showed that what was an electric field in one inertial reference frame could be a magnetic field in another, and vice versa. Whatever medium was involved, observing its effects depended on the observer’s state of motion. In an article intended for laypersons (Einstein, 1929), he said: Up to that time the electric field and the magnetic field were regarded as existing separately even if a close causal correlation between the two types of field was provided by Maxwell's field equations. But the special theory of relativity showed that this causal correlation corresponds to an essential identity of the two types of field. In fact, the same condition of space, which in one coordinate system appears as a pure magnetic field, appears simultaneously in another coordinate system in relative motion as an electric field, and vice versa.
This parallels Einstein’s happy thought regarding gravitation and acceleration: for a person in the accelerating reference frame falling freely in a gravitational field, Einstein said there is no gravity, just as for a person moving without acceleration far from any significantly gravitating body. This is interpreted by some as Einstein claiming that gravity and spatial curvature are not the same thing. And yet Einstein did apparently consider “space” more than absolute nothingness. In the same article quoted above, he said “space thus gave up its passive role as a mere stage for physical events” and referred specifically to “the physical states of space itself”. We interpret this to mean that to Einstein, spatial curvature was similar to the electromagnetic field in that its physical manifestation depends on the state of motion of the observer. In a lecture given at Kyoto University (1922c), he said “A falling man does not feel his weight because in his reference frame there is a new gravitational field which cancels the gravitational field due to the Earth.” So relative uniform motion can cause a single field to manifest itself as electric in one reference frame and magnetic in another, and accelerated motion 317
page 329
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
can cause a single field to manifest itself as a gravitational field in one reference frame but to be absent in another, with the net gravitational field depending on whether an external gravitational field is present. By representing Maxwell’s Equations in the context of Special Relativity, Einstein had arrived at a mathematical description of how the state of motion controlled the dual electric/magnetic manifestations of a single field, but in 1907 he had no such mathematical description of whatever it was that could manifest itself in a small enclosed space sometimes as gravity and sometimes as acceleration. His “happy thought”, however, told him that this too was controlled by the state of motion. Unlike Maxwell’s Equations, Newtonian mechanics was consistent with the Galilean transformation, Equation 6.1, but not with the Lorentz transformation, Equation 6.2, assuming that one demands that both mass and momentum remain constant independently of reference frame. To see that Special Relativity cannot keep both quantities absolutely constant in all inertial reference frames, we follow a theme similar to one in Rindler (1969) by considering two arbitrarily small spherical bodies with identical mass m when measured in the same system. Then with the standard arrangement of axis alignment and clock synchronization, we place one mass at the origin of the S system and the other almost at the origin of the S system but off by just under one diameter in the y direction (which is also the y direction). We let S approach S from the -x direction with speed v so that the two masses experience a glancing collision. Prior to the collision, the primed speeds in Equation 6.4 are all zero, which gives us ux = v, uy = uz = 0. After the glancing collision, the two bodies will have some small opposite transverse velocities in both of their own x and y directions. Considering only the magnitudes of these transverse velocities in the common y direction, the first body will move with speed uy and y-axis momentum muy in the S system, and by symmetry the second body will move with the identical magnitudes in the S system. When we transform the second body’s corresponding y -axis speed and momentum to obtain their appearance in the S system, we obtain the speed uy and momentum m uy . Conservation of momentum demands that m uy = muy. But the middle of Equation 6.4 shows that we actually have
m u y
m u y u v 1 x2 c
mu y
(6.6)
so that when we divide uy out of the middle and right-hand side, we have
u v m m 1 x2 c
(6.7)
We are free to make the collision more and more glancing, which makes ux as small as we like. The limit of Equation 6.7 as ux approaches zero is
m m
(6.8)
so that when both masses are observed in the S system, the mass of the second one is increased by the factor γ. The alternative is to surrender momentum conservation, which is not supported by experiment, whereas the relativistic mass increase is. Furthermore, Lorentz transformation involves both time dilation and length contraction, altering two of the three most fundamental physical parameters that require units. The third such parameter is mass, and so it should perhaps not be surprising that its 318
page 330
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
value also depends on which inertial reference frame is used to observe it. Special Relativity also introduced the notion of space and time being unified into a single fourdimensional medium called spacetime. As we will see below, the time parameter is scaled in General Relativity by the speed of light, so the axis corresponding to time is best represented as a ct axis (it is also common to do relativistic calculations in units for which c = 1, making this simply a t axis, but with position units). Certainly the concept of position as a function of time had been well known for centuries and carries the image of a space-time plane, but the Lorentz transforms for x and t above can be interpreted as due to a rotation of the (x,ct) plane about the (y,z) plane by an angle θ such that sin θ = v/c (in four dimensions, rotations are about planes in the same way as three-dimensional rotations are about lines and two-dimensional rotations are about points). Thus the x axis has components on both the unprimed x and unprimed ct axis, i.e., the x space direction in S is a linear superposition of space and time directions in S. This is illustrated in Figure 6-1, which for simplicity assumes that t = t = 0, and v = c/2, hence θ = 30o (the rotation may appear negative, but it is actually positive, because in this subspace projection, the axis of rotation points into the page).
Figure 6-1. Lorentz Transform showing rotation of primed system by θ = sin-1(v/c). The distance L on the x axis transforms to the distance L on the x axis, where L = L cosθ, hence L = L/cosθ = L/ (1-v2/c2) = γL, where γ 1/ (1-v2/c2).
Such rotations are frequently referred to as pseudorotations in order to discourage taking the geometrical interpretation too far, and yet it is suggestive of something real, since two bodies in relative uniform motion in General Relativity are moving along geodesics with a qualitatively similar angle between their instantaneous directions (depending on how one defines angle measurements and simultaneity in a space that is itself curved in general). Since the S observer is situated at the origin of that system, whatever the component of L on the ct axis is, it is in the S observer’s past. With v > 0, the ct axis is rotated from the ct axis in the +x direction, and so as the two observers move forward 319
page 331
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
in time along their geodesics, the x distance between them grows larger. Of course the modern concept of the role played by spatial curvature in gravitational interactions was not on Einstein’s mind in 1907, but nevertheless the importance of geometry became clearer and clearer to him in the eight-year period that followed. He thought about how acceleration affected reference frames in Special Relativity: if the system S is not only moving relative to the system S but also accelerating, then the Lorentz contraction parameter γ in Equation 6.3 is getting larger with the passage of time (in either system), and as measured in system S, the x-direction length of a rigid body in system S is growing ever shorter as the acceleration continues. Even a standard ruler laid next to this rigid body in system S in the x direction is shrinking. It is not just the rigid body that is shrinking, it is the scale of the x coordinate as measured in system S. This meant that Euclidean geometry, which Special Relativity had carried over from classical mechanics, no longer applied to system S as observed in system S if acceleration is present. The problem is not that Euclidean geometry can’t handle timedependent relationships; in fact, the Lorentz transforms contain time as a parameter. But Euclidean geometry is not appropriate when the nature of the coordinates themselves depends on time. Accelerated motion violated the conditions of Special Relativity, and Euclidean geometry had to be replaced. As Einstein put it (1923): ... this concept necessitates an even more profound modification of the geometric-kinematical principles than the special relativity theory. The Lorentz contraction, which is derived from the latter, leads to the conclusion that with regard to a system [S ] arbitrarily moved relative to a (gravity field free) inertial frame [S], the laws of Euclidean geometry governing the position of rigid (at rest relative to [S ] ) bodies do not apply. ... Generalizing, we arrive at the conclusion that gravitational field and metric are only different manifestations of the same physical field.
Here at last was the counterpart to the dual manifestations of a single field as electric or magnetic based on uniform motion. His reference to “metric” in 1923 is probably after-the-fact relative to the period from 1907 to 1915, during which he gradually developed a full appreciation of the need for a geometry more general than Euclid provided. We will discuss “metric” further below. He also sensed the importance of invariant quantities, particularly the invariance of certain lengths under various coordinate transformations. This led him to study Gauss’s work regarding surfaces, and the work of Gregorio Ricci and Bernhard Riemann regarding higher-dimensional differential geometry. After a false start in which he mistakenly dismissed Riemannian geometry as incapable of providing what he needed, he corrected his error and focused on the methods used by Riemann, which had also been developed by Ricci and others such as Elwin Christoffel and Tullio Levi-Civita. Three of the central ideas needed by Einstein that had been found extremely useful by these mathematicians were manifolds, metrics, and tensors. At this point we stand before the door of an overstuffed closet, a door whose keyhole we may peer through but that we dare not open, because the scope of this book would not survive the ensuing avalanche of mathematical elaborations. For example, there is a good reason why the excellent textbook produced by Misner, Thorne, and Wheeler (1973) has been affectionately nicknamed “The Phone Book”. Our purpose is to acquire some intuition regarding the physics of gravitational interactions, and much of the very large collection of mathematical details that enter into General Relativity are involved with providing firm mathematical foundations for operations such as taking derivatives on curved spaces. Both the spaces and the derivatives must be carefully defined if mathematical absurdities 320
page 332
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
are to be avoided. We will be forced to take on faith that the fine print has been read and certified by experts, and we cannot enter in significant detail into any of the specific applications of General Relativity to physical situations. But we can give a concise sketch of the kinds of formal issues that are covered by all the legal specifications. First, we must be a little careful about how we define the space whose curvature we want to describe. The kind of space we need is called a manifold, and we need it because it possesses regions that are locally Euclidean. This requires that a (perhaps) surprising amount of fine details be hammered out. Once that has been done, we will be free to do things like use the Pythagorean Theorem in a curved space, we just have to be careful to account for the fact that we can do this only in a locally Euclidean neighborhood. We also need for our manifold to be differentiable, i.e., we need for all these Euclidean neighborhoods to be connected to each other smoothly. Then we can do things like take higher-dimensional derivatives. If we could not do this, we could not discuss meaningfully such things as the speed with which we drive along an interstate highway, since we do that on the surface of an approximately spherical Earth (not to mention local fine-scale fluctuations in the surface). Unlike the two-dimensional surface of the Earth, however, we need a manifold that can support four-dimensional spacetime, but that is no problem for Riemannian geometry, which was created to handle any number of dimensions. Next, we need to be able to describe curvature in our four-dimensional manifold in a sufficiently general way. This is the problem solved by Riemann et al. by using a mathematical object known as a metric. It turns out that we have been using a metric all along when we use Euclidean geometry. For example, the familiar Pythagorean Theorem in plane geometry, A2 + B2 = C2, where C is the hypotenuse of a right triangle and A and B are the other two sides, has always implicitly had scale factors on A and B that happen to have the value 1 in Euclidean geometry. With these constant unitmagnitude scale factors, we don’t need to worry about whether any significant property of our coordinate system itself is changing as we move along the hypotenuse, but if we wanted to be especially careful, we could consider arbitrarily small differential segments and formally include the scale factors, which we will denote g11 and g22 for the A and B directions respectively, by writing (dC)2 = g11 (dA)2 + g22 (dB)2, (where we enclose the differentials in parentheses in preparation for a need we will encounter below to distinguish between exponents and superscript coordinate indexes) and then integrate over A and B. If we were sticking to Euclidean geometry, this would be a complete waste of time, but here it sets us up for the transition to curved spaces in which lengths must be integrated along paths through space whose curvature is defined by the metric gμν, where Greek subscripts are commonly used for fourdimensional spacetime, and while only one index is needed in Euclidean geometry, we will need two for more general cases in which the curvature in one direction depends also on the other directions. Now we need to modify our coordinate notation: let A be defined on an axis denoted x1, and let B be defined on an orthogonal axis denoted x2, orthogonal because A and B are perpendicular in a right triangle, and we use coordinate indexes in superscript position rather than subscript position for a reason that will be explained below. Finally, let the hypotenuse C be defined on a path denoted s. Then our differential Pythagorean Theorem becomes
ds 2
g11 dx 1
2
g 22 dx 2
2
g11 dx 1 dx 1 g 22 dx 2 dx 2
(6.9)
Now using the fact that for Cartesian coordinates in Euclidean geometry, g12 = g21 = 0, we can write this more generally: 321
page 333
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
ds 2
g dx dx
g dx dx
(6.10)
where we have introduced Einstein notation on the right-hand side, which is defined such that summation is implied over any indexes that occur more than once in a term. Not much is left before we will be able to discuss space curvature as it applies to General Relativity. First, the values taken on by μ and ν need to be extended from 1 and 2 above to 0, 1, 2, and 3 for four-dimensional spacetime (various conventions are used, but denoting the time axis with index 0 is very common). Finally, we need to expand on three more properties of gμν, specifically that it is symmetric, i.e., gμν = gνμ, that in general we will no longer have gμμ = 1 and gμν = 0 for by μ ν. and that it is a tensor. Note that no exponents occur on the right-hand side of Equation 6.10, unlike the first part of Equation 6.9. This is important, because tensor notation makes use of indexes in both subscript and superscript locations, and the latter must not be mistaken for exponents. As we will see below, the two types of index are used to distinguish between covariant tensors and contravariant tensors. At first sight, a tensor looks just like a matrix. That’s because matrices are encountered abundantly throughout physics and engineering, and many people are more familiar with them than with how they relate to tensors. Many matrices are also tensors, but not all. A tensor is a more general object with specific properties regarding coordinate transformations. Whereas the actual numerical values of the elements of ordinary vectors and matrices depend upon the basis in which they are expressed, one of the special qualities of a tensor is that it can be independent of basis. This makes tensors useful for expressing invariant quantities such as certain relationships between vectors. Given the importance of invariant quantities in relativity theory, we must go just a little deeper into that concept before finishing our exercise in gaining enough acquaintance with tensors to appreciate the role they play in General Relativity. First we consider a very simple case of invariance in Euclidean Cartesian coordinates. It will be convenient occasionally to fall back to the (x,y,...) notation, and we will do so here. Figure 6-2 illustrates two such sets of coordinates, (x,y) and (x ,y ), shown with solid and dashed axes respectively, and with the latter’s origin offset from that of the former such that the point x = y = 0 corresponds to x = 2, y = 1, and the primed system is rotated 30o relative to the unprimed system. Two points, P1 and P2, are plotted in both systems. The unprimed coordinates are (3,2) and (7,5) for P1 and P2, respectively. Given the offset and rotation of the primed system, the standard translateand-rotate coordinate transformation for computing the primed coordinates from the unprimed coordinates is
x cos sin x x 0 y sin cos y y 0
322
(6.11)
page 334
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
Figure 6-2 Two points P1 and P2 plotted in two different Cartesian coordinate systems (x,y) with solid axes and (x ,y ) with dashed axes; relative to the former, the latter is rotated 30o and offset by 2 on the x axis and 1 on the y axis. The coordinates of P1 and P2 depend on which system is used, but the distance between them is invariant under a change of coordinates.
Plugging in the unprimed coordinates for P1 and P2, along with θ = 30o, x0 = 2, and y0 = 1, gives us the primed coordinates, 3 1 1.366 x 2 y 1 3 1 0.366 2 5 x y 2 4
3 4 2 6.330 3 5 0.964 2
(6.12)
Thus a vector from the xy origin to P2, for example, would have components (7,5), clearly different from the components of a vector from the x y origin to P2, about (6.330,0.964). Obviously the elements of a vector depend on the basis in which they are expressed. The distance between the two points is shown with a thick solid line, and this distance is the same in both coordinate systems. Denoting the square of this distance D2, in the unprimed system we have D 2 7 3 5 2 25 2
2
and in the primed system, we have
323
(6.13)
page 335
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity 2
5 3 4 4 35 3 1 3 1 D 2 2 2 2
2
2
2
4 3 3 3 3 4 2 2
2
(6.14)
16 3 2 3 4 3 9 9 3 2 4 3 3 16 4 4 48 9 27 16 100 4 4 25
Of course, one can simply look at Figure 6-2 and see that the value of D obviously does not depend on which coordinates are used. There are two points placed on a flat page, and we can overlay any number of Euclidean coordinate systems we like without changing where those points sit on the page or the distance between them. The invariance of D under a change from one of these coordinate systems to another is so obvious that it seems completely trivial. And it is, until we consider coordinate systems that are in relative motion. If the primed system S is moving relative to the unprimed system S with constant velocity v in the x direction, and if we use the Galilean transformation (Equation 6.1) to compute where the points are in the primed system at some instant t, then we will find that D is still invariant, because this situation is the same as in Figure 6.2 except that we have x0 - vt instead of just x0 in Equation 6.11, so that with the same points P1 and P2 as before, instead of Equation 6.12, we have 3 1 3 vt x 2 2 3 1 vt y 1 2 2
(6.15)
5 34 3 vt x 2 2 y 2 4 3 5 vt 2 2
and now given the cancellation of the “-vt” terms, D2 in S looks just like D2 in Equation 6.14. So D is invariant under a Galilean transformation. But that is of academic interest only, since Galilean transformations are not appropriate for situations requiring Special Relativity. We need to use the Lorentz Transformation, Equation 6.2, where we see that since we are considering only (x,y) coordinates of two points at the same time t, we need only the first two lines, and the second one is the same as in the Galilean transformation, Equation 6.1. So in Equation 6.11, we must replace x-x0 with γ(x-x0-vt), and after performing that matrix-vector multiplication and plugging in all the same parameter values, we have
324
page 336
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
3 3vt 1 x 2 2 2 y 1 vt 3 2 2 2 5 3 3vt 4 x 2 2 2 y 2 5 vt 4 3 2 2 2
(6.16)
For illustration, we take v = c/2 as in Figure 6-2. This gives γ = 2/ 3. Note that the terms containing vt will cancel when we subtract the coordinates of the two points. The squared distance in S , which we must now distinguish from the D2 in S by denoting it D 2, becomes 2
5 3 3 4 1 5 1 4 3 3 D 2 2 2 2 2 2 2 2 3
2 3 3 3 2 2 2 2
2
2 3 2 3 3 2 3 2 2 2 3 3 2 4 3 3 3 4 2 2 3
16 12
2
2
(6.17)
2
9 16 27 12 4 3 4
30 1 3
So the distance between P1 and P2 in S must be larger than 5 in order for it to be 5 in S after Lorentz contraction. This distance is therefore not invariant under Lorentz transformation. If we seek invariant quantities in relativity theory, we must look beyond the distance between points fixed in space and the lengths of rigid rods. When Einstein was a student at the Swiss Federal Institute of Technology in Zurich, one of his professors was Hermann Minkowski, who is on record as having evaluated Einstein as a “lazy dog” (e.g., Isaacson, 2007) because of Einstein’s nonchalance regarding mathematics and class attendance. Not many years later, however, Minkowski became one of the first scientists to recognize the value of Special Relativity and was motivated to give it what he considered a firmer mathematical foundation. In the process he effectively advocated the concept of “spacetime” as a single unified medium, and he established what has become known as the Minkowski metric, a metric that describes a truly invariant quantity in Special Relativity, the spacetime interval Δs whose square is given by
s 2 c 2 t 2 x 2 y 2 z 2
(6.18)
To see the invariance of this quantity under Lorentz transformation, we must look again at the 325
page 337
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
intervals used to compute the velocity components of Equation 6.4. In the S system, we have simply
x x 2 x1 y y2 y1 z z2 z1
(6.19)
t t 2 t 1 These four intervals are used to form the displacement vector between two points in spacetime,
D ct , x , y , z
(6.20)
This is called a four-vector, and its length is not conserved under Lorentz transformation, just as we found above for the ordinary displacement magnitudes D D. A relativistic four-vector is defined such that its components transform from one inertial system to another according to the Lorentz transformation, which we will now employ. To get the four-vector components in S corresponding to those in Equation 6.20, we operate on the four right-hand sides of Equation 6.19 with the Lorentz transformations in Equation 6.2. Since the transformation equations are linear in the unprimed coordinates, this amounts to
x x 2 x1 vt 2 t1 x v t y y 2 y1 y z z2 z1 z
(6.21)
v x 2 x1 t t 2 t1 c2 v x t 2 c Using these to compute the spacetime interval Δs in S ,
s 2 c 2 t 2 x 2 y 2 z 2 2
v x 2 c t 2 2 x v t y 2 z 2 c 2
2
Substituting the definition of γ (Equation 6.3),
326
(6.22)
page 338
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity 2
v x c t 2 2 x v t c 2 s y 2 z 2 2 2 v v 1 2 1 2 c c 2 v c 2 t 2 2vxt 2 x 2 x 2 2vx t v 2 t 2 c y 2 z 2 2 2 v v 1 2 1 2 c c 2
(6.23)
Canceling the 2vΔxΔt terms and rearranging,
s 2
c 2 t 2 v 2 t 2 x 2 1
v2 c2
v2 2 x c2 y 2 z 2
v2 v2 c 2 t 2 1 2 x 2 1 2 c c y 2 z 2 2 v 1 2 c 2 2 2 c t x y 2 z 2
(6.24)
So the right-hand side is the same as that in Equation 6.18, and we have Δs = Δs. The spacetime interval is invariant under Lorentz transformation. This result is more general than we have shown. It applies even when the direction of motion is not parallel to the x axis (the proof is straightforward but not quite as simple algebraically). It is also more important than it might first appear. It tells us something about the physical relationship between two events in spacetime, where any given event is characterized by a single point (ct,x,y,z). Assuming that not all the intervals are zero, then when Δs = 0, we have c2Δt2 = Δx2+Δy2+Δz2, which means that the “time” part of the spacetime displacement is equal to the “space” part, and so the two events can be causally connected only by an influence that travels at the speed of light. Each event must be located on the other event’s light cone, also called its null cone because the spacetime interval is zero. This is the locus of all events past or future such that anything present at both events must have traveled from one to the other at the speed of light in vacuum. Because the interval is invariant, if two events are connected at the speed of light in one inertial reference frame, they are connected at the speed of light in all inertial references frames. If Δs2 > 0, then we have c2Δt2 > Δx2+Δy2+Δz2, and the “time” part of the spacetime displacement is greater than the “space” part. This means that even though a massive particle cannot move at the speed of light, it could make the trip from the earlier event to the later, and therefore causal connections are again possible, this time with something other than photons. This kind of interval is called timelike. The opposite is true if Δs2 < 0, which means that Δs is purely an imaginary number. In this case, causal connections between the two events are not possible in classical physics (despite its revolutionary 327
page 339
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
nature, relativity theory is nevertheless considered classical physics because it is independent of Quantum Mechanics), and the interval is called spacelike.
Figure 6-3. Segments of the past and future light cones of a point P1 in spacetime with the z dimension suppressed. The point P2a is on the past light cone; P2b is inside the future light cone; P2c is outside the light cones.
Figure 6-3 illustrates the displacements corresponding to these three kinds of spacetime interval as seen in one particular inertial system. Segments of the past and future light cones of point P1 are shown with the z dimension suppressed in order to be able to project a three-dimensional space into the page. Three cases are shown for point P2: (a.) P2a is on the past light cone of P1, and so the spacetime interval connecting these points is null, i.e., only light travels fast enough to get from P2a to P1; (b.) P2b is inside the future light cone of P1, and so this interval is timelike, and a massive object moving at the appropriate (necessarily subluminal) speed could be present at both events; c.) P2c lies outside the light cones, and so not even light is fast enough to get from P1 to P2c or vice versa, and the interval is spacelike. Although these three displacement four-vectors are generally different in different inertial systems, their corresponding three spacetime intervals will be of the same types discussed above relative to P1's light cones in all inertial reference frames because of the invariance of the spacetime interval under Lorentz transformation. Thus, according to Special Relativity, the possibility of causal connection between two events does not depend on the choice of inertial reference frame, hence the challenges posed by the quantum entanglement discussed in sections 5.11 through 5.14 above, wherein a causal effect appears to connect two events for which the spacetime interval is spacelike, allowing repercussions that threaten causality violations such as creating inconsistent closed-loop spacetime paths (see section 5.14 for a possible mechanism by which such a situation could arise). 328
page 340
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
We can make the coordinate intervals in Equation 6.18 as small as we like, so that the spacetime interval may be written in differential form as
ds 2 d ct dx 2 dy 2 dz 2 2
(6.25)
and since we are taking the spacetime coordinate convention (x0,x1,x2,x3) (ct,x,y,z), this can be written in the form of Equation 6.10,
ds 2
g dx dx
(6.26)
with the metric tensor given by
0 1 0 1 g 0 0 0 0
0 0 1 0
0 0 0 1
(6.27)
This is known as the Minkowski metric with signature (+ ). Other conventions are common, e.g., the entire matrix above multiplied by -1, hence signature ( +++), the c factor in ct absorbed into the metric and/or defined to be unity, the ct component placed in the last position instead of position 0 (with the numbering changed from 0..3 to 1..4), etc. All that is required is that the convention be applied consistently. Here we are just interested in the concept of a metric. Note that the (implied) summation in Equation 6.26 exclusively involves terms that are quadratic in the coordinate differentials. This is a key part of the definition of a Riemannian metric. The other part is that Riemannian metrics are positive definite. But the determinant of the Minkowski metric above is negative, and because spacelike intervals arise in relativity theory (e.g., Equation 6.26 can result in a negative number), the metric is called pseudo-Riemannian, which means that it is not required to be positive definite. With this incorporated, the metric defined by the form of Equation 6.26 is as general as we will ever need in the context of a space that is locally pseudo-Euclidean. In what follows, references to Riemannian metrics are meant to include pseudo-Riemannian metrics, and after this brief discussion of the Minkowski metric, we will not stress the difference again until section 6.9. The Minkowski metric established the pattern for the Lorentzian metrics encountered in General Relativity, which include effects of curvature but remain based on the spacetime interval. To appreciate the opposite signs of the time-related coordinate and the space-related coordinates in relativistic metrics, there is a barrier that must be overcome: the mental habit of instinctively using Euclidean geometry when thinking of the absolute distance separating two points in spacetime. It is natural to expect that as we integrate the distance along a path, each increment in a coordinate makes a positive contribution to the integrated absolute distance, not some positive and some negative contributions. As we will see below, spacetime is described by a Lorentzian metric in general, and so the Euclidean always-positive contributions of each coordinate increment to absolute distance no longer apply. In curved spacetime, the metric in general has nonzero off-diagonal elements, so cross terms in coordinate increments may make positive or negative contributions to the path length. But even in Special Relativity, despite having only zero off-diagonal elements, the Minkowski 329
page 341
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
metric nevertheless is not Euclidean . The geometry may be flat, but it is not Euclidean/Galilean, it is Lorentzian, and that couples space and time in a way that challenges the intuition. Note for example that Δx in Equation 6.21 reverses sign for vΔt > Δx, a situation that can occur in practice. For example, assuming a timelike interval for illustration, Δx is not some distance between two points at a single instant of time, it is a distance between two points that occur on a path that may be traveled by a particle that is at each endpoint at different times separated by Δt, and as we have seen, both spatial and temporal separations generally depend on which reference frame is used to observe them. Given that a positive spatial distance in one system can be a negative one in a different system, one must ponder more deeply the implications of how Special Relativity describes the real behavior of real physical situations as observed in differently moving coordinate systems. Given an object at rest in an inertial system, that object’s geodesic between two events has Δx = Δy = Δz = 0, so that Δs = cΔt, and the interval between the two endpoints is just the Galilean expectation for the distance between two events separated only by time. But our Galilean intuition expects that distance to become larger, not smaller, if we introduce Δx > 0 in the same inertial frame. Nevertheless, Δs becomes smaller. The reason is that with both Δx and Δt greater than zero, the object is no longer at rest in this inertial reference frame, we now have Δx/Δt = v > 0, and the Lorentzian nature of the spacetime comes into play, with its length contraction, time dilation, and mass increase. The Galilean expectation that distance will increase if Δx does is an illusion in this Lorentzian situation. When we ask ourselves what is “real” in a milieu of phenomena wherein most things change appearance when observed from different viewpoints, there is some appeal to the notion that what is real is what is invariant. Not that what is observed is decoupled from what is real, it may well have a real phenomenon underlying the observation, but its apparition may be altered by observer-specific effects, and so to discover what is real, we must remove those effects in order to arrive at what is invariant, since that is true for all observers. The difference between the role played by a time interval and a space interval is especially difficult to digest. Einstein originally used coordinates (ict,x,y,z) to provide an equivalent approach, where i = -1, since this yields the same spacetime interval as the Minkowski metric (with the signs reversed down the diagonal, but that convention is also used). This fit in with the fact that the nature of time is naturally mysterious and perhaps has some imaginary aspect, but before long this approach fell by the wayside except for certain special problems such as those that will be mentioned in sections 6.7 and 6.9. For one thing, why not use (ct,ix,iy,iz)? Both signatures, (+ ) and ( +++), have been found useful in different contexts (e.g., particle physics versus cosmology), and obtaining the negative contributions to the spacetime interval was judged by most physicists to be more conveniently done directly from within the metric itself. Although a number of theoretical interpretations of the fundamental nature of time have been put forth, for most people it retains its mysterious nature, and for now we seem best off simply accepting that it plays a special role in computing spacetime distances and hoping that a Quantum Gravity Theory will remove the confusion. In relativistic cases, to arrive at true event separations, we must integrate along a path using the metric that honors the only definition of an invariant spacetime interval. Given the Cartesian coordinates used to express the Minkowski metric, along with the flat nature of the space, its off-diagonal elements are all zero. The same is true in the two-dimensional space of Figure 6-2. If we use Equation 6.26 to integrate (ds)2 in that space, the cross-terms dx1dx2 and dx2dx1 make no contribution because the off-diagonal elements of the metric tensor are zero. This results not only from the plane being flat but also because we chose to use orthogonal coordinates. The latter 330
page 342
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
is a natural way to discuss the Pythagorean Theorem, since that involves a right triangle, but it is not a requirement for generally representing a point in flat space. If for some reason we chose to do so, we could use oblique coordinates such as those shown in Figure 6-4, in which the x and y axes meet at some angle θ 90o. The point at the end of the line ds from the origin has coordinates dx and dy. These form an obtuse triangle with sides dx, dy, and ds, with ds opposite an obtuse angle φ = 180o-θ. The right triangle with a base of length A and a height of length B has a hypotenuse of length ds, so the Pythagorean Theorem says that ds2 = A2 + B2, and we could derive expressions for A and B as functions of dx, dy, and θ, but the law of cosines gives us immediately
ds 2 dx 2 dy 2 2 dx dy cos
(6.28)
This can be represented in the form of Equation 6.26 with the metric tensor given by
1 g cos
cos 1
(6.29)
Figure 6-4. A differential line element ds represented in Euclidean oblique coordinates.
So the metric tensor can have nonzero off-diagonal elements even in a flat space. One may argue that oblique coordinates essentially “squeeze” the space, but it is still flat. It frequently happens that undesirable effects are caused by the choice of coordinates, e.g., singularities at the poles of spherical coordinates, and this is one of the reasons why the ability to transform between various systems is essential. The spirit of General Relativity includes keeping the physics independent of the coordinates, and the physics of gravity will produce off-diagonal elements in the metric tensor without a need to 331
page 343
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
introduce them artificially. We will look closer below at how that comes about, but first we need to pass from the notion of an inertial reference frame as it exists in Special Relativity to one that will work in General Relativity. In Special Relativity, an inertial reference frame is by definition unaccelerated. But the Equivalence Principle says that gravity and acceleration are two faces of the same coin, and since there is matter all around in the real world, there must be either gravity or acceleration or both all around. The globally flat nature of an inertial reference frame in Special Relativity places no boundaries on that frame; it extends to infinity in all directions. But as we saw above, acceleration alters the very nature of the coordinate system used to define the reference frame. We therefore have to restrict the domain of our inertial reference frame to the locally (pseudo-)Euclidean extent allowed by the (pseudo-)Riemannian manifold, leaving us with only a locally inertial reference frame. Within this context, we can employ the Pythagorean theorem to compute length along a geodesic by integrating Equation 6.26 with the appropriate metric tensor gμν, but in order to extend the notion of path length along the geodesic for arbitrarily large distances, we must use the full metric tensor (i.e., include any nonzero off-diagonal elements) and treat it as a function of position as we integrate. In other words, we need to be able to use differential geometry within a curved space, hence the need for all the caution. The Minkowski metric (Equation 6.27) looks exactly like an ordinary diagonal matrix. That is because the flat space it describes is simple enough not to need the full capabilities of a tensor. But our ultimate concern here is the nature of gravity, whose strength at any given location is known to depend on the distribution of the mass responsible for the gravitational field, i.e., the mass density distribution. But this density distribution in one inertial frame is not the same as in another relatively moving inertial frame because of the Lorentz contraction and relativistic mass increase. In fact, relative to an observer in any one inertial frame, a continuous fluid may have some parts moving with some velocities and other parts moving with other velocities, so that all these parts have different Lorentz contractions and mass increases affecting their densities, with different results in different inertial frames. We want to describe the density distribution and its effects on the spatial curvature in a way that is formally the same for all observers in all reference frames, and for that we need the more general capabilities of a tensor. So we see that density variation produces gravitational variation that is equivalent to acceleration variation that causes spatial curvature variation. The general result of this is that a rigid body moving through the space (i.e., being present at various events along the geodesic at increasing values of the time coordinate) experiences tidal effects as the curvature varies along its path. These effects amount to stresses like those in a deformed elastic medium wherein a force vector in a given direction generally causes pressure variations not only in the direction of the force but also in other directions. Such stresses are best expressed mathematically in the form of a tensor, because that structure provides an organized way to represent each aspect. In N dimensions, a force in a given direction can generally produce stresses in all N directions, and the stresses in directions other than the force direction populate the off-diagonal elements of the stress tensor. So far this just requires an N×N matrix. The tensor quality enters because we need to keep track of how such matrices (and vectors) transform from one coordinate system to another. There are two ways that a tensor can transform; these are known as covariant and contravariant transformations, and both play roles in differential geometry. Let us consider once again Figure 6-2: the points P1 and P2 are located at certain positions on the page that do not change as we represent them in different coordinate systems. Equation 6.11 shows how we compute the primed coordinates (dashed coordinate 332
page 344
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
axes) from the unprimed ones (solid coordinate axes). Let us return to the superscript-indexed coordinate notation and account for the offsets explicitly:
x 1 x x0 x 2 y y0
(6.30)
x 1 x x2 y Then Equation 6.11 can be expanded to
x 1 cos x 1 sin x 2 x sin x cos x 2
1
(6.31) 2
Taking the partial derivatives of the primed coordinates with respect to the unprimed coordinates,
x 1 cos x1 x 1 sin x2 x2 sin x1
(6.32)
x2 cos x2 so that both lines of Equation 6.31 can be written, using the Einstein summation convention,
x
x x x
(6.33)
where μ and ν both take on values of 1 and 2, and since ν occurs twice in the term on the right-hand side, summation over both of its values is implied. Note that this equation works in this situation because the partial derivatives are not functions of the unprimed coordinates and because we subtracted off the zero points x0 and y0 in Equation 6.30. In the more general case, we need the differential form:
x dx dx x
(6.34)
Equation 6.34 shows contravariant transformation, and the primed and unprimed vectors are called contravariant vectors (or contravariant tensors of order 1). For contrast, covariant transformation of a vector with components Aν has the form 333
page 345
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
A
x A x
(6.35)
i.e., the partial derivatives have the primed coordinate in the denominator, and the vectors are called covariant vectors (or covariant tensors of order 1, or just covectors), for which subscript indexes are traditional, as opposed to superscript indexes for contravariant tensors. Note that the coordinate differentials retain superscript indexes; their transformation properties have not changed. The coordinates xμ are given superscript indexes primarily to be consistent with the coordinate differentials. There also exist mixed tensors (e.g., products of tensors of different types) that have both subscript and superscript indexes, and there are operations performed on tensors in which they are contracted, as discussed below. Our scope prevents us from delving deeply into the sort of detail that requires these operations, however. The nature of a contravariant transformation is for the transformed coordinates to change in a way that is contrary to the way the coordinate system changes. In Figure 6-2, going from the unprimed coordinates (solid axes) to the primed coordinates (dashed axes) involves changing the coordinate system to one whose origin is off to the right and up relative to the starting system. As a result, the coordinates of the two points move to the left and down in the new system. The new axes are rotated relative to the old axes, and so the points rotate in the opposite direction in the new system. The points remain where they are on the page by taking on new coordinates that cancel the changes in the coordinate system used to represent them. If instead of leaving the points where they are, we wished to represent them as moving in the original coordinate system, then we would need to transform their coordinates in a covariant way, which is characterized by the transformed entity changing with the transformation instead of against it. One of the most important covariant objects in vector calculus is the gradient, the result of applying a vector differential operator to a scalar field. The gradient operator is the higher-dimensional generalization of the ordinary one-dimensional differential operator. To help visualize this, Figure 6-5 illustrates the distribution of a scalar variable over two dimensions described by Cartesian coordinates. We will take this scalar field to be a mass density distribution ρ(x,y) for concreteness and relevance to General Relativity. As shown in Figure 6-5A, this distribution varies over the part of the xy plane plotted. Many applications in physics require the slope of the surface of such distributions at various locations. For illustration, a point has been arbitrarily selected at x = 1 and y = 2 for evaluating the gradient; this is indicated by the white circle. The slope of a differentiable function of one independent variable is straightforward to picture mentally. For a given point on the abscissa, there is a straight line lying in the function plane and tangent to the function. The slope of the function at that point is equal to the slope of the tangent line, i.e., the ratio of the change in the ordinate of the straight line divided by the corresponding change in the independent variable. The slope of the line is given by the first derivative of the function with respect to the independent variable evaluated at the given abscissa value. But when there are two or more independent variables, there is no longer a single “function plane”, and a tangent line is no longer uniquely determined. On the surface shown in Figure 6-5A, at any given point there is a tangent plane within which a straight line can be drawn through the tangent point in any desired direction, so there is no unique straight line. But we can define two orthogonal planes that each contain the tangent point and are parallel to the function axis and one independent coordinate axis. In the example, these planes 334
page 346
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
B A
C
D
Figure 6-5 A. 3-D plot of density ρ as a function of (x,y) position; the white circle is where the gradient is evaluated for illustration at the point x = 1, y = 2. B. Contour plot of the density distribution in A, with gradient vectors superimposed at sample locations; the black circle is where the gradient is evaluated for illustration at the point x = 1, y = 2. C. Slice through ρ(x,y) at y = 2; the thick straight line is tangent to ρ(x,2) at x = 1 with a slope of about -0.004, the x component of the gradient vector at ρ(1,2). D. Slice through ρ(x,y) at x = 1 (note that the vertical scale is different from C); the thick straight line is tangent to ρ(1,y) at y = 2 with a slope of about 0.011, the y component of the gradient vector at ρ(1,2).
335
page 347
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
are the ρx plane and the ρy plane that contain the tangent point. The tangent plane’s intersections with these planes are straight lines. The gradient vector’s components are the slopes of each of these lines relative to the corresponding independent-variable coordinate of the corresponding function plane. The gradient is therefore a vector with N components for functions of N independent variables. Figure 6-5B shows a set of contours for ρ(x,y) with the gradient vectors sampled uniformly over the space and superimposed as arrows whose lengths are proportional to the magnitudes of the gradient vectors and with arrowheads on the ascending end. The point where we will evaluate the gradient is the center of the black circle. Figure 6-5C shows the ρx slice through the distribution at the y value of 2. This function plane looks just like the case of a single independent variable. The tangent line at x = 1 and y = 2 is shown as the thick straight black line. To aid in visualization, this line is extended well beyond any locally Euclidean neighborhood of ρ(x,y). The slope of that line is about -0.004, which is consistent with the nearby gradient vectors in Figure 6-5B, which have a small negative x component. The situation in the ρy slice at x = 1 and y = 2 is shown in Figure 6-5D (whose vertical scale is compressed relative to that of Figure 6-5C in order to fit in the same space on the page). The slope of that tangent line is about 0.11, which is consistent with the nearby gradient vectors with significant positive y components. So the sample gradient vector has components of approximately (-0.004, 0.011). The formal definition of this two-dimensional gradient vector is
, x y
(6.36)
For functions of N independent variables, the tangent plane described above is generalized to an N-dimensional tangent space, and there are N two-dimensional subspaces formed by the function axis and one abscissa axis which the tangent space intersects as a straight line at the tangent point. The slopes of the lines in these N two-dimensional subspaces are the components of the N-dimensional gradient vector. We referred above to the gradient as a covariant vector. To see this, we examine how it transforms from one coordinate system to another. We will go back to the notation in which (x,y) is represented by (x1,x2), or in general xν, ν = 1, 2 and this notation will extend directly into higher dimensions. Then the components of the gradient vector in Equation 6.36 can be represented by ρ/ xν. Since the primed coordinates to which we will transform the gradient are invertible functions of the unprimed coordinates (e.g.,those in Equation 6.33), all the transformations we need in General Relativity for our purposes have inverses, so that we may treat the unprimed coordinates as functions of the primed coordinates. Then by the chain rule, we have
x x x x x x x
(6.37)
so that ρ/ xν transforms in the same way as Aν in Equation 6.35, hence the gradient is a covariant vector. One thing that tensors have in common with ordinary scalars, vectors, and matrices is that they may be subjected to arithmetic operations. Addition is straightforward, but for tensors of order greater than zero, multiplication takes several forms. For example, there is inner multiplication, e.g., the vector 336
page 348
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
dot product, which produces a scalar and hence reduces two tensors of order one to a tensor of order zero. The inner product of a covariant vector A and a contravariant vector B is
A B A1
A2
B1 2 N B AN A B A B C 1 N B
(6.38)
where we emphasize the Einstein summation convention one last time. We also note that we took the inner product of a covariant and contravariant vector because this combination yields the result analogous to ordinary dot products of vectors defined in flat spaces. In curved spaces, taking an algebraically similar product of two covariant or two contravariant vectors generally yields different results, because the way these vectors relate to the curved space is different. The two spaces in which covariant and contravariant vectors respectively are defined form a dual space analogous to the bra and ket vector spaces of Dirac notation (see Appendix K) in which dot products are taken between a vector in one of the spaces and a vector in the other (for different reasons involving complex conjugates, but that need not concern us here). There are ways to get around this inner-product limitation by raising a covariant index or lowering a contravariant index, as discussed below for tensor contraction. There is also outer multiplication. For the same two vectors, the outer product is
A1 A2 1 A B B AN
B2 B N
A1 B 1 A B1 2 AN B 1
A1 B 2 A2 B 2 AN B 2
A1 B N A2 B N A B C AN B N
(6.39)
where Aμ Bν and Cμνon the right indicate both individual elements (specific values of the indexes) and the complete tensor (indexes running over all possible values); the appropriate interpretation is intended to be apparent from the context. The difference between inner and outer multiplication is revealed in the nature of the indexes that appear, and so we will not continue to use the symbols “ ” and “ ”; AμBν is an outer product, and AμBμ is an inner product. In the example above, we used covariant and contravariant vectors, but vectors of the same type follow the same rules for the outer product; we have AμBν = C μν, and we can mix orders, e.g., A B C A B C
(6.40)
Mixed tensors transform in a mixed way according to their covariant and contravariant indexes. For example, the tensor in the first line above:
C
x x x C x x x 337
(6.41)
page 349
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
A mixed tensor of at least second order can be contracted by making one of its covariant indexes and one of its contravariant indexes the same. According to the Einstein convention, we sum over such indexes. Consider the tensor Cμν in Equation 6.39; this is the outer product of the tensors Aμ and Bν. If we contract it by replacing ν with μ, then its elements change from AμBν to AμBμ, which is the same as in Equation 6.38, the inner product of the two tensors. Thus contracting a tensor is the same as taking the inner product of two of its first-order tensor components of opposite type, reducing the total order by two, one less covariant index and one less contravariant index. It may not be immediately obvious what happens when we contract a tensor of order greater than two. To visualize this, we consider again a tensor similar to the one in the first line of Equation 6.40, except to clarify the role of each component, we will expand the second-order tensor to show its origin as an outer product of two vectors, and in order to track the vectors through the process of multiplication and contraction, we will change the naming convention such that the three original vectors are Aμ, Bν, and Cα, with their outer product being AμBνCα. To keep the visualization as simple as possible, we will use two-dimensional vectors. The three vectors are illustrated in Figure 6-6A with their components enclosed in circles and each pair of components connected with a line. Each pair is laid out in a different direction to aid in following their assembly into the array structures symbolizing the outer products. These directions have nothing to do with the actual orientations of the vectors, which are determined by the values of the components as usual.
A. B. C. D.
Figure 6-6 Three vectors to be multiplied. Outer product of the first two vectors. Outer product of all three vectors. The result of contracting the outer product of all three vectors on the ν and α indexes by replacing α with ν and summing over the repeated index.
338
page 350
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
The first two vectors are multiplied exactly as in Equation 6.39 except for two covariant vectors in this illustration. The resulting second-order tensor AμBν is illustrated as a square in Figure 6-6B. The result of taking the outer product of this tensor with the vector Cα to form the third-order tensor AμBνCα is illustrated as a cube in Figure 6-6C. The result of contracting this tensor on the ν and α indexes is illustrated in Figure 6-6D. This contraction, AμBνCν, must be summed over the repeated ν indexes: 2
A B C A B C A B1 C 1 B2 C 2 1
(6.42)
Since we do not sum over the μ index, the right-hand side represents two vector components. The third-order tensor AμBνCα has been contracted to a first-order tensor KAμ, a vector with components (KA1, KA2), where K B1C1 + B2C2. If we wish to contract a pure (i.e., not mixed) tensor, for example Aμν, then we must first raise one of the covariant indexes, say μ (the result does not depend on which). This is done by taking the product with the inverse of the metric tensor. As described briefly below, the inverse of the covariant μ metric tensor gμν is the contravariant tensor gμν. We raise the μ index by computing gμλ Aλν = Aμλ λν = Aν., and then we can contract by replacing ν with μ. Similarly, we can lower a contravariant index by computing gμλ Aλν = Aμν, The importance of contracting tensors will be seen below. Now we turn to the last tensor operation for which we need to develop some intuition: differentiation. Matter density in General Relativity, like most distributions, is treated as a function of spacetime, i.e., all four dimensions, and so it is defined not only at a given spatial position (x1,x2,x3) but also at each event (x0,x1,x2,x3). The proper density of a continuous mass density distribution at a given event is the density that would be measured by an observer at rest with respect to the matter at that event. For other observers, the proper density must be transformed to their coordinate systems, and as mentioned above, this generally involves not only ordinary geometric transformation but also length contraction, time dilation, and relativistic mass increase for systems in relative motion. But unlike the density distribution shown in Figure 6-5, which is a scalar field with coordinates defined in a flat space, fields in General Relativity must be treated as functions of curved spacetime. The two-dimensional curved surface in Figure 6-5 is embedded in a flat higher-dimensional space and exhibits extrinsic curvature that can be seen within that flat space. In General Relativity as designed by Einstein, there is no assumption that four-dimensional spacetime itself is embedded in a higherdimensional space (flat or otherwise); its curvature must be treated as intrinsic. Such curvature without a flat embedding space is impossible to visualize without resorting to preferred coordinate systems that violate the constraint of general covariance, and the many arguments about whether spacetime is “really curved” stem from disagreements about whether it is possible to have intrinsic curvature without extrinsic curvature. Different metaphysical preferences exist regarding this throughout the physics community and need not be reconciled until the issue arises in considerations affecting quantum theories of gravity (we will return to this in section 6.9). It should be pointed out that a space may have extrinsic curvature without intrinsic curvature. We will use Figure 6-2 (p. 323) once again as an example of a flat Euclidean space. One can take that page and roll it into an open cylinder, giving both the primed and unprimed coordinates extrinsic curvature that can be seen in the three-dimensional embedding space. But within the two-dimensional space of Figure 6-2, all distances between points are preserved, so the intrinsic curvature is still zero, and Euclidean geometry still applies, because we have not stretched or compressed the space in any 339
page 351
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
way. Nothing measurable by an inhabitant confined to the two-dimensional space can reveal the extrinsic curvature. One may then ask how an inhabitant of an intrinsically curved space can detect such curvature and hopefully quantify it. We will give qualitative descriptions of two such methods, geodesic deviation and parallel transport. Figure 6-7 illustrates both of these for two cases: on the left, a two-dimensional flat space, i.e., a Euclidean plane, and on the right a two-dimensional curved space, in this case a sphere. Both show a pair of geodesics that are parallel at the labeled points A and B. In the plane, these remain parallel for arbitrarily large distances from the labeled points, so the geodesic deviation is zero. On the sphere, the geodesics are meridians, hence great circles, and are parallel only at the equator, meeting at the pole, so that the geodesic deviation is nonzero. In principle, sufficiently accurate measurements of the separation over extended distances along the lines could detect and measure the effects of any curvature.
Figure 6-7 Left. Parallel transport of a vector around a closed path and parallel lines A and B in a flat space. Right. Parallel transport of a vector around a closed path and locally parallel lines A and B on a spherical surface. The rigorous definition of geodesic deviation is the acceleration of the minimal distance between one of the lines and a moving point on other, so in fact the lines in Figure 6-7 on the left need not be parallel, just straight, so that the distance between them changes at a constant rate, hence zero acceleration. This will always be true of straight lines in a flat plane (note that in differential geometry, there are “planes” that are not flat!). By measuring the acceleration between the two lines in Figure 6-7 on the right, one can determine the kind of curvature, positive (e.g., on a sphere) or negative (e.g., on a hyperboloidal sheet). In general, the curvature need not be the same in all directions from a given point. Parallel transport refers to moving the foot of a vector along a path while keeping the vector 340
page 352
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
parallel to its original direction. In figure 6-7 on the left, the vectors are all in the plane, and the “tangent plane” at any point is just the plane itself. If we perform a parallel transport of the vector around a closed path, it will still be parallel to its original self when we return to the starting point. The illustration shows the transport beginning at position 1, where the vector is tangent to the circular path for illustration purposes only. After being transported around the closed path, it is obviously still parallel to its original self when it arrives back at position 1. This would be true for any closed path. For the spherical case in Figure 6-7 on the right, the closed path is composed of segments along lines of constant longitude and latitude to aid in visualizing the direction of the vector at each point. These vectors are drawn extending well beyond the locally Euclidean neighborhood of the point at which the foot is located, but in each case, within the local tangent plane. However, it is now impossible to keep the vector parallel to its other locations, because the orientation of the tangent plane changes along the path. Nevertheless, for the tangent planes at any two arbitrarily close points, there is a vector in one whose direction in the flat embedding space is minimally different from that of the other vector, and this is as close as we can come to keeping the vector parallel to itself at other locations as we move around the path. When we do this, upon returning to location 1 we find that not only is the vector not parallel to its original direction within the starting tangent plane, the discrepancy is generally different for different paths. The behavior of the discrepancy with respect to path is another measure of the curvature of the space containing the path. Note that if we roll the page containing Figure 6-7 into a cylinder as we contemplated doing for Figure 6-2 above, then the tangent planes along the closed circular path on the left will no longer all be the same, and viewing the arrow’s direction in the three-dimensional embedding space will reveal that it is not parallel to its original direction. But when the arrow is transported back to its original location, it will be once again parallel to its original self, and the intrinsic curvature will be seen to be zero. General Relativity attempts to describe the influence of gravity on what we actually experience within an intrinsically curved spacetime, and so our tools must operate within that context, not in a context of a flat embedding space. Whereas we used a flat embedding space to illustrate geodesic deviation and parallel transport, in fact these effects are observable completely from within the curved space, given appropriate tools. We do not rule out the possibility that spacetime is embedded in a flat space, but if so, it goes beyond the space within which we experience the Universe, and so the notion of the gradient in flat space must be generalized to the notion of a covariant derivative that can operate in curved spacetime. The use of locally flat neighborhoods in our mathematical formalism will still require the notion of flat tangent spaces, but each is not considered real beyond the locally flat neighborhood; they are mathematical idealizations that aid our calculations. The covariant derivative is defined subject to certain constraints. In order to be useful within the context of the problem we are attempting to solve with the set of tools at our disposal, we require the result of differentiating a tensor to be another tensor. This alone rules out ordinary differentiation in general, as may be seen by considering the Leibniz Product Rule for the functions u(x) and v(x): d dv du u v u v dx dx dx
(6.43)
Here we will follow the demonstration in Lieber (1945). We consider the transformation of the contravariant vector Aσ:
341
page 353
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
A
x A x
(6.44)
We take the partial derivative with respect to x ν applying the Leibniz Rule,
A x x
x A x
(6.45)
2 x A x A x x x x
The chain rule gives us
A A x x x x
(6.46)
A x x A 2 x A x x x x x x
(6.47)
so that Equation 6.45 becomes
The first term on the right above has the form of a transformation of a second-order mixed tensor (say from Bνσ to B ρμ). The second term, however, does not have the form of a covariant, contravariant, or mixed tensor transformation, and so the derivative on the left-hand side cannot be any such tensor unless the second term on the right is zero. That term will be zero if x μ/ x ν is not a function of xσ or if x μ/ xσ is not a function of x ν, which is often true (e.g., in Equations 6.31-6.33) but not in general, and so the ordinary derivative of a tensor is not formally a covariant, contravariant, or mixed tensor in general. This is the problem that we must eliminate in constructing the covariant derivative. As with all derivatives, the covariant derivative will be a derivative with respect to some variable indicated by the notation. The covariant derivative with respect to xμ is commonly denoted μ and must be consistent with the Leibniz Rule while yielding a true tensor, and also consistent with parallel transport as described qualitatively above. There are also other constraints, which we will illustrate in the list below with A and B representing tensors of any type and order. Linearity:
μ
(A + B) =
μ
A+
μ
B.
Commutativity with contraction: the derivative of a contracted tensor must be equal to the corresponding contraction of the corresponding derivative of the tensor. The derivative of a tensor field must be consistent with the notion of a tangent space as composed of directional derivatives.
342
page 354
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
A must reduce to the ordinary derivative A/ xμ when the curvature is zero. μ
Torsion-free:
μ
ν
A=
ν
μ
A.
The last constraint is applied in General Relativity but not in all modified theories of gravity. Torsion is a kind of curvature that causes the coordinate system to rotate in a helical fashion along the geodesic in addition to the squeezing and stretching typical of stresses. It has been thought to provide an approach toward including quantum-mechanical spin as an effect that couples to gravity (e.g., Cartan, 1922), and some work has been done to pursue that as an avenue toward Quantum Gravity (including work by Einstein to incorporate the electromagnetic field in General Relativity as part of his search for a “unified field theory”). We mentioned (with respect to Equation 6.10, p. 322) that the metric tensor gμν is symmetric, and by using subscript indexes we have acknowledged in advance that it is a covariant tensor. We will also assume without proof that its components are real numbers and that it is nonsingular (its determinant is nonzero). There are actually cases in which the metric becomes singular (e.g., “black holes” in which the mass density is so great that even light cannot escape), but the tools we are describing cannot operate under those circumstances unless the singularity stems from the choice of coordinates and can be removed by a suitable transformation, so we will assume that the metric is nonsingular, and therefore an inverse exists. Among other facts that our scope does not permit us to prove or derive, but which it is hoped will be seen as plausible, are that the inverse of gμν is a contravariant tensor, hence written with superscript indexes as gμν, and is also symmetric. Like the product of a matrix with its inverse, gμλgλν has unit diagonal elements and null off-diagonal elements, i.e., gμλgλν = δνμ, a mixed μ tensor of order 2 whose elements are the Kronecker delta, δν=μ = 1, δνμ μ = 0 (this is actually the definition of the inverse). These properties of the metric tensor are woven into the definition of the covariant derivative, which should not be thought of as an ad hoc fixup for the failure of the ordinary derivative to yield a tensor (as shown in Equation 6.47) but rather as the general definition of a directional derivative that reduces to the ordinary derivative when the curvature vanishes. Of course, the need for such a generalized definition was not realized until the mid-to-late 19th century work on differential geometry by Riemann, Christoffel, Ricci, and Levi-Civita. It was introduced by Ricci and Levi-Civita (1900), who drew upon work by Christoffel (1869). For illustration, we consider the covariant derivative of a covariant vector Aμ and a contravariant vector Bμ, each with respect to xν. The definitions which were found to satisfy all the constraints described above for these two types of vector are
A
A A x
(6.48)
B B B x where the Christoffel symbol Γρμν is defined as
g g g g 2 x x x 343
(6.49)
page 355
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
and the Einstein summation convention applies to σ, since multiplying the expression out produces σ twice in each term. When the Christoffel symbols are applied to Aρ and Bρ in Equation 6.48, there will also be summations over ρ. Note that if the metric tensor is constant throughout the space, which is the situation in flat space (e.g., Equations 6.27, p. 329, and 6.29, p. 331), then the derivatives in the Christoffel symbol are all zero, and the second terms in Equation 6.48 vanish, leaving only the ordinary derivatives. Except for the torsion-free property, it is straightforward to verify that all of the other constraints on the definition of the covariant derivative are satisfied, but the proofs are beyond our scope. We will look closer at the torsion-free property below, however, since it plays an essential role in deriving an expression for the Riemann tensor, which contains all relevant curvature information. But first we note the following effects of applying the covariant derivative operator to various tensors, which we will illustrate with tensors of order 1 and 2, Aμ, Aμ, Aμν, Aμν, and Aμν.
A B
(6.50)
A B
For each first-order tensor, applying the covariant derivative adds one covariant index. This pattern continues for second-order tensors:
A B A B
(6.51)
A B So differentiating first-order tensors produces second-order tensors, whose next derivatives (hence the second derivatives of the first-order tensors) behave as shown on the first two lines of Equation 6.51 (we omit the the contravariant tensor Aμν from that statement because it cannot be produced by the covariant derivative, since there must be at least one covariant index in the result). The torsion-free constraint in our definition of the covariant derivative is not forced upon us by differential geometry and is omitted in some theories of gravity. Einstein chose to include it in General Relativity because his approach was to try the simplest possible formalisms to see if they worked, i.e., produced a philosophically satisfying theory that was consistent with existing observations and could make testable predictions that are subsequently verified. In 1929 he described his approach as follows. The general problem is: Which are the simplest formal structures that can be attributed to a four-dimensional continuum and which are the simplest laws that may be conceived to govern these structures? We then look for the mathematical expression of the physical fields in these formal structures and for the field laws of physics - already known to a certain approximation from earlier researches - in the simplest laws governing this structure .
There are many more simple approaches than correct ones, and Einstein had to retreat from several dead ends and replot his course before eventually arriving at his destination. It so happens that the torsion-free constraint fit perfectly into his method in that it kept the theory as simple as possible 344
page 356
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
while still meeting his success criteria. Among the “already known” earlier approximations was Newton’s law of gravity, to which some rough similarities were to be expected in the behavior of a falling object observed in an inertial reference frame. The torsion-free constraint turns out not to be identically true, e.g., if one carries out the calculation in detail for ρ ( ν Aμ) and ν ( ρ Aμ) separately, one does not derive expressions that are identically equal. There are many identical terms, but the ones that are not identical allow the torsion-free constraint to be translated into a constraint on the metric, ρ ( ν Aμ) - ν ( ρ Aμ) = 0. We will leave the full expansion of the two expressions and the subsequent cancellation of many terms to the textbooks (or to the reader as an exercise) and just present the final expression:
A A
A x x R A
(6.52)
α where we use Rρνμ to denote the expression in brackets on the first line. We know it must be a fourth-order tensor so that its inner product with the arbitrary covariant vector Aα will yield a covariant third-order tensor to match the second derivatives on the left-hand side, and since the inner-product summation index α appears inside the brackets as a contravariant index, that’s where it must be remain. Furthermore, since the left-hand side must be zero by the torsion-free constraint, and since Aα is an arbitrary vector, we must have Rαρνμ = 0. α Rρνμ is called the Riemann-Christoffel curvature tensor, or often just the Riemann tensor. Since it is a fourth-order tensor defined in a four-dimensional space (in General Relativity), it has 256 elements. The number of independent elements is much smaller, however, because of some symmetries and α α antisymmetries. For example, it is antisymmetric in the last two covariant indexes, Rρνμ = -Rρμν (note that different indexing conventions exist, e.g., in Wald (1984) and Penrose (2004) the antisymmetry is in the first two covariant indexes), so one factor of 16 in the 256 drops to 6. After all redundancies are taken into account, the Riemann tensor is found to have 20 independent elements. The equation Rαρνμ = 0 is satisfied trivially in flat space, because in that case all the derivatives of the metric vanish. Although the proof is beyond our scope, it is also true that this equation is sufficient to establish that the space is flat. It may appear strange that all the work expended to describe spacetime curvature has led us to a situation in which the space is flat. The point is that in general spacetime is not globally flat, it is only locally flat within the confines of a locally inertial reference frame. In nontrivial cases, the metric tensor (and hence also the Riemann tensor) is a function of position in spacetime, i.e., a metric field. In a globally flat space, every individual term inside the brackets of Equation 6.52 is zero. In the general case, the terms that add up to produce any given element of the Riemann tensor sum to zero, but the individual terms themselves are not zero. The curvature information in the Riemann tensor allows us to establish connections between locally flat inertial frames so that we may integrate Equation 6.26 to calculate geodesics once we discover how to obtain the metric tensor from the Riemann tensor for a given physical situation, a process we will describe briefly below. The nuances implicit in this topic are typical of why so much caution is needed in establishing the mathematical framework for General Relativity. A particularly good discussion may be found in Rindler (1969, Chapter 8). The vanishing of the Riemann tensor in a locally inertial frame is related to the flatness of the tangent space containing a vector that is being parallel-transported along an arbitrary
345
page 357
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
path. Following the connections between locally flat frames results in transporting the vector through a sequence of tangent spaces that are not parallel to each other globally but minimize the change in the vector’s direction along the path. Note in Equation 6.52 that the Riemann tensor consists of Christoffel symbols and their first derivatives. Equation 6.49 shows that a Christoffel symbol consists of first derivatives of the metric tensor scaled by half the inverse of the metric tensor, so the Riemann tensor consists of combinations of first and second derivatives of the metric tensor. The metric tensor determines the geodesics of spacetime, the trajectories in space and time of particles falling under the influence of gravity. In Newtonian gravity, the coordinates along the trajectory of a falling body are determined by forces that are proportional to the second time derivatives of the coordinates themselves. In both cases, the trajectories are determined by something that depends on second derivatives of coordinates or that which determines the coordinates of a geodesic. In the case of the Riemann tensor, there is also a dependence on the first derivatives of the metric, but of course some differences from Newtonian gravity are not only to be expected but actively sought. For now, we note only the qualitative similarity in trajectories depending on second derivatives. Appreciating this similarity requires something in General Relativity corresponding to the time derivative in Newtonian mechanics. The covariant derivative (Equation 6.48) provides all the needed space derivatives and already has a component related to time, namely the component for x0 that we have identified with ct, but ct has units of space, not time, and as we have seen, t is measured differently in different relatively moving frames. In Relativity theory, a proper time differential is defined as dτ = ds/c, where ds is the spacetime interval given in Equation 6.26 (p. 329; this is the appropriate definition for the conventions we are using, but some authors define the spacetime interval with the time component negative and the space components positive, in which case dτ = -ds/c). One way to visualize the proper time is to consider the Minkowski metric (Equation 6.27).
d
2 g dx dx dx dy dz 2 ds dt c c2 c 2 2
2
2
2
(6.53)
If two events with coordinates represented in a given inertial reference frame are connected by a path for which x1, x2, and x3 (i.e., x, y, and z) are constant, then the corresponding differentials are zero along the path, and the integral of the proper time along that path yields Δτ = Δt. In this case, an object on that path through spacetime maintains a stationary (x, y, z) position in the inertial frame, and its proper time is the same as that kept by the inertial frame’s standard clock. The right-hand side of Equation 6.53 contains the metric explicitly and is more general than the case when the Minkowski metric is applicable. In the more general case, a geodesic connecting the two events may have varying x1, x2, and x3 values. Besides being the “closest possible path to a straight line”, an important property of a geodesic is that it locally maximizes Δs, the integral of ds between two events, and so it also maximizes Δτ (for the dτ = -ds/c definition, Δs is minimized; in either case, Δτ is maximized). Relative to the geodesic, other paths that connect the same two events but do not locally maximize Δτ must therefore have a smaller amount of proper time elapsing between those events. Such paths correspond to bodies not falling freely, hence such bodies are accelerated by nongravitational interactions. An example is the twin who uses a rocket engine to accelerate away from a sibling and then turns around and comes back, experiencing a smaller elapsed proper time than the twin who remained on the geodesic between the two events where they are both present. 346
page 358
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
The equations governing the maximization of proper time between two events may be expressed in either differential or integral form. The solutions of these equations are beyond our scope, and any simplification of the mathematical procedure would be misleading. A proper understanding of the actual mathematics requires a textbook-depth presentation, for which Chapters 10-13 of Misner, Thorne, and Wheeler (1973) are recommended. A qualitative description of the steps involved will have to suffice herein. For this, we need the concept of a parameterized path through spacetime. Examples of this may be seen in Figure 6-7 where parallel transport is illustrated for a flat space and a curved space. The paths therein are closed and have numeric labels along them in increasing order to lead us around the path and back to the starting points, because we wanted to illustrate how parallel transport of a vector in a curved space returns it to its starting point no longer parallel to its original self. But for present purposes, one can imagine the path ending at any point along the way, and the spacetime paths of interest to us are not closed, because that would require reverse time travel to get back to the initial event (an enticing topic but not relevant here). Furthermore, since we are primarily interested in real physical situations involving the motions of objects with nonzero rest mass, we will consider only timelike paths. The numeric labels in Figure 6-7 may be regarded as readouts of a continuous parameter that is a monotonic function of progress along the path. In a curved spacetime, it is convenient to make this parameter a linear function of the proper time. We will denote this parameter λ, and so we have λ = aτ + b, where a is determined by the time units and b is determined by the clock zero point. Then any path from an event P1(τ1) to P2(τ2>τ1) can be represented by a parameterized curve xμ = f μ(λ), where f μ(λ) indicates some function whose form is completely specified for each of the four values of the index μ, i.e., we can compute dxμ/dλ = df μ(λ)/dλ formally. We will see below how the metric tensor is determined by the physical situation. For now we assume that it is prescribed, so that Δτ can be computed for any given parameterized curve by integrating along the path from λ1 = aτ1 + b to λ2 = aτ2 + b:
1 c
2
g
dx dx d d d
(6.54)
1
One procedure for finding the curve that maximizes Δτ is to declare f μ(λ) to be a geodesic, then define a deformed curve that differs slightly from it, hμ(λ) = f μ(λ) + δf μ(λ), where δ is small but greater than zero, write the integral corresponding to Equation 6.54 for hμ(λ), subtract Equation 6.54 from that to obtain the proper time difference δτ, and solve for the conditions that make δτ vanish. While this is easily said, carrying out the procedure in mathematical detail is nontrivial, as may be seen in (e.g.) section 13.4 of Misner, Thorne, and Wheeler (1973). Here we simply note that knowledge of the metric tensor allows geodesics between pairs of events to be computed, and for a body present at a given event with known energy and momentum, it allows us to compute the subsequent free-fall path of that object. What remains to be described is how the metric tensor is determined for a given physical situation, i.e., distribution of mass in spacetime, where by “mass” we mean not only ordinary mass but also the mass equivalent of energy. It will probably come as no surprise that in order to be sufficiently general, this distribution must be described by a tensor field, because it must be representable in any 347
page 359
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
coordinate system expressing the locally inertial reference frame of any observer in any state of motion. The components of this tensor at any given event are determined by the total energy-momentum four-vector at that event. This four-vector is a relativistic generalization of the classical momentum vector, i.e., it has four components and transforms in the Lorentzian manner between locally inertial reference frames. The component in the time slot is the total energy density due to all physical processes operating at the event other than gravity itself, and the spatial (hence position) components are the corresponding momentum densities. These are the same action-pair variable associations that we saw in Quantum Mechanics (e.g., the Heisenberg Uncertainty Principle). Our scope does not permit detailed discussion of how the various forms of energy and momentum (e.g., those of the electromagnetic field, mechanical fluid fields, etc.) map in their own distinct ways into this tensor, which is known as the stress-energy tensor. Because it describes a distribution in spacetime, it is four-dimensional, and because it includes stresses, it is a second-order tensor. Its transformation properties show that it is a covariant tensor, and it is commonly denoted Tμν. Rigorous discussions of how it is constructed may be found in textbooks such as Wald (1984) and Misner, Thorne, and Wheeler (1973). Einstein realized that there must be a relationship between the Riemann tensor and the stress-energy tensor. The former, however, is a fourth-order tensor, so it was natural to look at contractions of the Riemann tensor as he searched “for the mathematical expression of the physical fields in these formal structures and for the field laws of physics”. He found a useful result in what is known as the Ricci tensor, the α contraction of the Riemann tensor (in our notation) Rμν = Rμνα . The symmetries in the Riemann tensor lead to symmetry in the Ricci tensor, Rμν = Rνμ, and so of its 16 elements, only ten are independent. A glance at Equations 6.52 and 6.49 will reveal that the formal fully expanded expression of the Ricci tensor is nontrivial, but just setting the ten equations equal to zero led Einstein (1915) to the first success in his formulation of gravity: the vacuum field equations of General Relativity, a special case in which the stress-energy tensor is null. These equations form a coupled system of nonlinear second-order partial differential equations for the metric tensor. The “vacuum” in this sense refers to the spacetime domain in which there is no mass; this domain can still be influenced by nearby mass. A spherically symmetric solution to these equations was quickly found by Karl Schwarzschild (1916). It describes the space surrounding a nonrotating uncharged spherical mass. To proceed from the vacuum case Rμν = 0 to the more general case involving an arbitrary stress tensor, Einstein had to take into consideration the local conservation of energy and momentum (four constraints), i.e., for a sufficiently small region of spacetime bounded by a surface, the flow of energy and momentum across the surface into the region must equal the flow across the surface out of the region. This constraint on the stress-energy tensor implies something similar for the corresponding curvature tensor. Besides the symmetries and antisymmetries in the Riemann tensor mentioned above, this tensor also obeys some differential symmetries known as the Bianchi identities which include the sort of zero-divergence property needed. But the Riemann tensor’s compliance with these identities is lost in most contractions. The only tensor that maintains the desired properties and is derived from contraction of the Riemann tensor is called the Einstein tensor, denoted Gμν, and is given by
G R
1 g R 2
where R is called the Ricci scalar or curvature scalar, the contraction of the Ricci tensor,
348
(6.55)
page 360
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
R g R R R
(6.56)
It is common to use the same letter “R” for the Riemann tensor, the Ricci tensor, and the Ricci scalar, with the distinction being the number of indexes. The same is true of the letter “G”, which is used for the Einstein tensor above and also for the “Newton constant” in Newton’s law of gravity, Equation 5.1 (p. 223), although it is not similarly related to the Einstein tensor via contraction. It appears in Einstein’s field equations, which establish the relationship between the Einstein tensor and the stressenergy tensor: G
8 G T c4
(6.57)
This equation is sometimes referred to in the plural and sometimes in the singular, depending on how one thinks about the indexes. It satisfied all of Einstein’s initial requirements for his law of gravity (a new requirement was added later, as discussed below). The two tensors are simply proportional to each other with the constant of proportionality shown. Tracing this equation back through Equation 6.52 shows that, like the vacuum field equations, it expands to a set of ten coupled nonlinear second-order partial differential equations for the metric tensor gμν. So once the stress-energy tensor is constructed, the work is cut out and amounts to a calculus problem. There is one catch, however: to solve Equation 6.57, one must know the stress-energy tensor, but the stress-energy tensor contains the information about the distribution of all mass (and its energy equivalents) throughout the spacetime domain, exactly what we want to figure out in the first place for some given initial configuration. This echoes the problem described in section 6.1 where we considered how Newtonian gravity would change if the force propagated at finite speed: we have to know where everything was in the past in order to compute the time-delayed forces acting on everything at the present moment. This catch-22 is actually not uncommon in problems involving coupled systems of nonlinear partial differential equations (also in systems of nonlinear simultaneous algebraic equations and even many single transcendental equations). There is no known general method for obtaining solutions directly, and so iterative solutions are needed, and these depend on initial estimates and algorithms for refining the estimates recursively on each iteration until convergence is achieved to within some acceptable tolerance. Success typically depends on the initial estimates being close enough to the eventual solutions so that the algorithmic procedure does not diverge, and the refinement of the estimates on each iteration must be graceful enough not to introduce oscillations that are too large to be damped by the algorithm. In many cases, it is difficult or impossible to prove the uniqueness of whatever solutions are eventually obtained, appeals to physical plausibility are often employed, and it can be difficult even to establish that convergence has been achieved. A very common approach to solving Equation 6.57 is to consider the only unknown to be the trajectory of a test particle, an object with size and mass both sufficiently small so that ignoring its effect on the stress tensor is an acceptable approximation over the spacetime volume of interest. Then the stress tensor can be constructed from the Newtonian solution within a limited domain, and the test particle does not contribute to it or alter it. Of course, the Newtonian solution is typically not perfectly 349
page 361
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
compatible with General Relativity. If that is important for the problem at hand, then the stress tensor must be computed iteratively as the trajectories of each body in the system are approximated more accurately. The simplest problem that can be solved in this way is the case of a single massive body that defines the stress tensor and one test particle falling freely under the influence of the other’s gravity. This approach can employ the Schwarzschild solution and was used early on to describe the two-body problem, for which Newtonian gravity yields conic-section trajectories for the test body about the stationary massive body, a solution that extends over the entire infinite time domain, even if the test particle’s mass is not negligible. For a test body bound in orbit about the massive body, the spatial trajectory is an ellipse with a focus at the center of the massive body, or at the system barycenter if the test body’s mass cannot be ignored. (The spacetime trajectory is a helix.) The solution of Equation 6.57 for this case yields an almost-elliptical spatial trajectory for the test body, with the deviation being a precession of the periapsis (the point of closest approach) in the orbital direction. This is how a long-standing puzzle involving the orbit of the planet Mercury about the Sun was resolved. Although perturbations by additional bodies in the system also cause apsidal precessions even in Newtonian gravity, the state of the art of both astronomical measurement and orbital mechanics was so high in the latter half of the 19 th century that a previously unexplainable residual precession of Mercury’s perihelion was established with high statistical significance. Explaining this was the first success of General Relativity. The second success employed a similar calculation except not for an orbiting body but rather a photon passing close to the Sun. The photon’s path is deflected under both gravity theories, but by an angle that is a factor of 2 larger in General Relativity. This effect was confirmed by comparing star positions measured near the Sun during total eclipse with the positions of the same stars when the Sun was in a distant part of the sky. Some repeated measurements were needed to achieve a comfortable statistical significance, but deflections by the predicted amounts were eventually established convincingly. In the first few years of its existence, the widespread reaction to General Relativity included both deep admiration and ridicule. The latter gradually evaporated for lack of substantive foundation, although the idea of “curved space” remained difficult to assimilate, and in a time and place of considerable anti-Semitism, the fact that such a breakthrough had been accomplished by someone of Jewish origin was found by many to be a bitter pill to swallow. The enthusiasm on the positive side soon produced much independent research into the implications of Equation 6.57, and some of the findings were confusing and controversial. Einstein himself was perplexed by the fact that early attempts to apply his field equations to the entire Universe led to unstable models that seemed to insist on either collapsing or expanding. At the time, the Universe was thought to be static and probably composed of a galaxy of gravitationally bound stars, or a gravitationally bound systems of such galaxies, surrounded by an infinite void. Given the difficulty in finding solutions to problems that Newtonian physics could handle relatively easily, it took some audacity to aspire to a physical model of the entire Universe based on General Relativity, but some encouragement was to be found in the observational fact that the Universe appeared to be isotropic at the largest scale. This allowed consideration of its large-scale structure, analogous to considering the Earth to be a slightly oblate spheroid. At the fine scale, Earth has mountains and valleys, oceans and continents, but as later determined by actual experience, from the moon the Earth looks very similar to a perfect sphere, and a lot can be learned by studying the planet at this scale. 350
page 362
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
The assumption of large-scale isotropy is called the Perfect Cosmological Principle. When applied within General Relativity, it implies that the Universe is also homogeneous so that it will be isotropic for all observers. This dispensed with the infinite void surrounding the observable stars and created a problem by implying that all lines of sight should eventually intersect a star, making the sky brightness arbitrarily large, clearly not an observational fact. This paradox was later resolved by the discovery that the Universe is expanding, and so light from those distant stars is not only redshifted below observability, it also has not had time to propagate through those arbitrarily large distances, given that the stars are not infinitely old. But when Einstein was struggling to make his equation compatible with a static Universe, the expansion of the Universe was not yet known, and theories of star formation and evolution were in their infancy. The Perfect Cosmological Principle allows a large-scale metric for the Universe to be computed. This is known as the Robertson-Walker metric, or sometimes as the Friedmann-Lemaître-RobertsonWalker metric, after the main theorists who developed it. It can take three different forms, each describing one of the three possible large-scale geometries for four-dimensional spacetime: spherical, flat, and hyperboloidal. These three cases are determined by the value of the parameter k in the metric. Its possible values are 1 (spherical), 0 (flat), and -1 (hyperboloidal). Here we follow Rindler (1969) with only slight changes in notation. The Robertson-Walker metric is commonly expressed using spherical coordinates (r,θ,φ) for the spatial components, which describe a Riemann space of constant curvature in three dimensions because of the isotropy condition:
d 2 ds 2 c 2 dt 2 a 2 (t ) 2 d 2 sin 2 d 2 2 1 k
(6.58)
where
r kr2 1 4
(6.59)
and a(t) is called the scale factor because it scales the three-dimensional space described inside the parentheses that it multiplies. The points in this space are called comoving because the entire space scales as a unit by the factor a(t), which is defined to be nonnegative. If a(t) increases with time, then the Universe is expanding, and if a(t) decreases with time, the Universe is contracting. Otherwise the Universe is static. It is fairly easy to see that if k = 0, then ρ = r, and the three-dimensional space has the ordinary Euclidean metric, making spacetime flat. It is not as obvious without working through the algebra, but if k = 1, the space becomes a 3-sphere (i.e., a three-dimensional surface of a four-dimensional ball), and if k = -1, the space becomes the three-dimensional analog of a hyperboloidal sheet. Einstein was primarily interested only in the spherical case, which possesses a finite spatial volume at any given finite time, and for this reason it is also called “closed”, with the other two cases called “open”. The use of the symbol ρ in Equations 6.58 and 6.59 is traditional but a bit unfortunate, because the same symbol is also used traditionally to denote density (e.g., Figure 6-5, p. 335) and as a tensor index (e.g., Equation 6.40, p. 337). We can only advise caution and attention to context, since mass and energy densities are important parameters in General Relativity and arise often. 351
page 363
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
Straightforward attempts to use the Robertson-Walker metric with simple stress tensors based on seemingly reasonable models such as uniform mass density, radiation energy, etc., led to a result in which da(t)/dt was exclusively negative for the spherical case that Einstein preferred (for the hyperboloidal case, da(t)/dt becomes exclusively positive). For a detailed discussion of these models, the reader is referred to any standard textbook on General Relativity; an extensive discussion can be found in Peebles (1993). As things stood, the Universe could not be closed and static and still a solution to his equation, as Einstein believed it must be (this is the additional requirement to which we referred above). Although Einstein considered it to be defacing a beautiful theory, he introduced an additional term into his equation because he believed that some way to make the closed Universe static was mandatory. The new form of the equation is G g
8 G T c4
(6.60)
where Λ is called the cosmological constant. As a scalar multiplying the metric tensor, it has the correct tensor form, maintains local energy conservation, and is the least complicated modification that can allow da(t)/dt to be zero for the closed model when assigned a value appropriate for a specific stress tensor, thus allowing the closed static Universe that Einstein desired. Two problems remained, however. First, the static Universe models made possible by fine-tuning the cosmological constant were all unstable. Any perturbation anywhere in the Universe would upset the delicate balance and cause an irreversible expansion or contraction. Second, evidence that the Universe was not static but rather expanding was rapidly accumulating. In 1919, as Arthur Eddington was making the first attempt to measure the deflection of starlight by the sun’s gravitational field during a total eclipse, Edwin Hubble was beginning his career at Mount Wilson Observatory near Pasadena, California, where the powerful new 100-inch telescope was just beginning service. By 1923, Hubble had used the recently discovered correlation between the absolute brightness and fluctuation period of certain variable stars (derived primarily over the previous decade by Henrietta Leavitt) to show that some cloud-like astronomical objects were too distant to be part of the Milky Way Galaxy, which had recently been thought to comprise all matter in the Universe. The apparent brightness and variability period of such a star are fairly easy to measure, and since the period reveals the the absolute brightness with which it is correlated, comparison to the apparent brightness reveals the distance (to first order; such things as absorption by intermediate gas and dust must be taken into account). It became clear that these cloud-like objects containing the variable stars were galaxies in their own right and at distances far greater than previously realized. The state of the art in astronomical spectroscopy had been advancing for decades, and the catalog of known atomic spectral lines was used by Hubble and his co-workers (primarily Vesto Slipher and Milton Humason) to measure the Doppler shifts in the light coming from these external galaxies. The Doppler shift in light is a change in frequency caused by the line-of-sight motion of the object emitting the light relative to the observer. A blueshift indicates that the object is approaching, and a redshift indicates that it is receding. As Einstein had pointed out, light is also redshifted when the emitter is in a “gravitational well” and observed from the outside, and this effect must be considered when interpreting astronomical redshifts. This has in fact produced some red herrings in the history of cosmology, but the current understanding is that the redshifts in the light from external galaxies are 352
page 364
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.2 Essential Features of General Relativity
strongly dominated by motion-induced Doppler shifting. Hubble’s work was made more difficult by the fact that there are two distinct variable-star classes with different correlations between absolute brightness and variability period. This was not recognized at the time and introduced scatter into his results, but nevertheless by 1929 a compelling pictured emerged from the fact that essentially all external galaxies showed redshifts exclusively, with more distant galaxies having higher redshifts that implied recession speeds proportional to distance. This could be understood only as a uniform expansion of the Universe, da(t)/dt > 0 (Lemaître, 1927, Robertson, 1928, Hubble, 1929). Given the irrelevance of static models of the Universe, Einstein pronounced the cosmological constant to be his greatest mistake. But the cosmological constant would not go away. For the spherical model, it was needed to explain the expansion. Willem de Sitter, who had found a matter-free solution to the field equations in 1917, claimed that nothing other than the cosmological constant could be the cause of the expansion of the Universe. Combining that with Einstein’s declaration, it could be said that Einstein’s greatest mistake was the time he thought he was wrong. The cosmological constant plays a crucial role in modern cosmology, being associated with “dark energy” that may stem from quantum-mechanical vacuum fluctuations. If this possible connection between General Relativity and Quantum Mechanics pans out, it should provide tantalizing clues for the development of Quantum Gravity, but in the process the most extreme disagreement known between theory and observation must be resolved: the approximate value of the cosmological constant deduced from observations of the expansion of the Universe is more than 120 orders of magnitude smaller than the value produced by the best known method for computing it quantum-mechanically (see, e.g., Rugh and Zinkernagel, 2000). Given the limits of current technology, the range of values consistent with the observed expansion make the cosmological constant incapable of producing measurable effects on scales such as that of the solar system or smaller. When viewed as a consequence of the quantum-mechanical vacuum, the role of the cosmological constant can be played by a contributor to the stress tensor instead of the additional term that Einstein found so offensive, and so one may return to Equation 6.57. In summing up this brief survey of the essential aspects of General Relativity, we should remark on why it was necessary to refer to “local” conservation of energy and momentum expressed as a zero divergence of these parameters with respect to a closed surface in spacetime. Many notions that are unrestricted in other contexts are limited to “local” applicability in General Relativity. For example, two differently moving observers studying the same physical system will compute two different values for its total energy. In Special Relativity, two observers at rest with respect to each other but widely separated in space will measure the same total energy in the mutually observed system, but in General Relativity, the notion of “widely separated but mutually at rest” loses its meaning, and the concept of total energy cannot be associated with unrestricted spacetime volumes. In Newtonian mechanics, momentum is a vector that points in a definite direction, whereas in General Relativity the notion of a vector’s direction is limited to the tangent space of a locally Euclidean neighborhood and is not physically relevant to anything outside of that neighborhood. Except for certain special cases that are no longer believed to be useful in describing the real Universe (e.g., stationary models that are asymptotically flat at all scales), such considerations make it impossible to discuss total energy and momentum meaningfully in General Relativity on anything other than a “local” basis. As seen in Equation 6.52, the Riemann tensor consists of derivatives and products of Christoffel symbols, and as seen in Equation 6.49, each of these Christoffel symbols contains products of metric 353
page 365
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.3 Reasons Why a Quantum Gravity Theory Is Needed
tensor derivatives with its inverse. This all makes the field equations nonlinear in the metric and its derivatives, the main reason why they are notoriously difficult to solve and why they defeat attempts at renormalization under the same approach to field quantization that has produced successful results for the quantum field theories. This synopsis of General Relativity is analogous to summarizing the game of chess by listing each piece and describing its legal moves. Much of the flavor can be appreciated, and something of what goes on in a match can be imagined, but it must be understood that vast realms of possible games, advanced strategies, and historic contests have been left unexplored. In the case of General Relativity, many physical situations have been studied with an extensive variety of computational techniques. For readers driven to venture further into this territory, an abundant literature exists to satisfy their hunger, and an excellent starting point may be found among the references cited in this section. The hope herein is that we have established enough of a feel for the core aspects of General Relativity to provide some foundation for appreciating what it is that we wish to transcend with a formalism that subsumes both General Relativity and Quantum Mechanics. 6.3 Reasons Why a Quantum Gravity Theory Is Needed In mathematics, a formal axiomatic system from which contradictory theorems can be derived is automatically considered invalid, even without any specific interpretation of the elements of the formalism. Because of the impressive usefulness of mathematics in developing physical theories that succeed in describing observed facts accurately and succinctly, physics has become a collection of self-consistent mathematical formalisms. While these generally share many common mathematical techniques, the separate theoretical fields within physics largely developed apart from each other historically and remained disjoint as long as the phenomena described by their formalisms did not overlap. The more recent history of physics, however, has seen numerous overlaps emerge, and these have led to unifications such as electroweak theory forming from the combination of quantum electrodynamics and the theory of weak nuclear interactions, to name one of many possible examples. When an internally consistent mathematical system yields theorems that appear to be incompatible with those of another such system, the two formalisms generally receive different interpretations of their elements. For example, we may have A×B = B×A in one system and A×B B×A in another, and there is no problem if A and B are ordinary integers in the first case but matrices in the second. In the set of mathematical formalisms that constitute physics, however, there is much less freedom to assign reconciliatory interpretations of the formal elements. For example, electrical charge in the theory of weak nuclear interactions is the same thing as electrical charge in the theory of strong nuclear interactions. If there were an inconsistency, grand unified theories would be impossible. Do such inconsistencies exist between Quantum Mechanics and General Relativity? That question is answered in different ways within the physics community. Some see nothing egregious, while others perceive a crisis. One serious outstanding issue involves quantum entanglement (see sections 5.11-5.14). It has been established experimentally that nonlocal effects take place in Nature, i.e., events separated by spacelike intervals can be causally connected. It has been established theoretically that these effects, as understood so far, cannot support transmission of encoded information at superluminal speeds, eliminating a whole class of sanity-threatening causal anomalies, but only through the mechanism of nonepistemic randomness. If the mechanism described in section 5.14 is operative, then it is ironic 354
page 366
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.3 Reasons Why a Quantum Gravity Theory Is Needed
that randomness should be the savior of sanity! The evidence that nonlocal effects have taken place can be established after the fact through ordinary communication channels by comparisons of measurement results, although there can be ambiguity in individual cases regarding which event was the cause and which was the effect. This follows from the fact that simultaneity is not an absolute but rather depends on which frame of reference is used for observation, and for events separated by a spacelike interval, the roles of cause and effect can be reversed in different frames. Some physicists find the entanglement enigma tolerable, because when restricted to their domains of specific applicability, the separate theories work to their satisfaction, and beyond that, asking about “what is really going on” is considered naive. Other physicists view their discipline’s primary purpose as casting light on the situation in which human consciousness finds itself, which can scarcely be done by refusing to address that question. The viewpoints herein are from the latter school, and as a result, we are interested in formalisms that provide insights into the true nature of what exists, not engineering approximations that are useful within a limited domain for artificial reasons. This excludes a very large collection of practical algorithms that make no attempt to provide philosophical content. In contrast, the notion of intrinsically curved spacetime offered by General Relativity is an example of what we mean by an insight into the true nature of what exists, as is the wave nature of physical substances that is essential to Quantum Mechanics. Our hypothesis is that these are clues to the fundamental elements of reality, not merely coincidental alignments of unrelated notions. If human intuition needs to be expanded during the quest to achieve these greater insights, so be it, and it has always been thus. For example, Special Relativity removed the need for an absolute standard of rest, but it did not prove that one cannot exist. General Relativity does not elevate any one coordinate system above all others, and it extended the democratic equality of all references frames to include accelerating ones. But the symmetries of such formalisms can be broken in actual solutions of the equations. For example, rotational symmetries of liquid water molecules are broken in the phase transition to ice. Symmetry breaking is an important concept in modern physics and arises in numerous contexts. If the description of a given physical process is simpler and more natural in a particular coordinate system compared to the way it appears in others, then perhaps that coordinate system should be considered special for physical reasons rather than merely being mathematically more convenient. For example, Winterberg (1992) says “The existence of such a preferred reference system is supported by the distribution of galaxies in the universe, suggesting that matter derives its existence from a cosmological field at rest with the galaxies. The existence of a preferred reference system is also suggested to explain the faster-than-light quantum correlations in a rational way.” So if a plausible physical motivation exists for preferring a particular reference frame, the roles of cause and effect for events separated by a spacelike interval can be assigned according to their temporal order in that frame and considered what “really happened”. For quantum-entanglement correlations, the center-of-mass frame of the particles’ interaction seems compelling as the preferred frame, since both measurement events occur on the particles’ trajectories, which are symmetrically defined in that frame, and the measurements are both separated from the interaction event by timelike or null spacetime intervals. Einstein introduced quantum entanglement in terms of its effects on the joint wave function of the particles that had interacted, and so the fact that quantum waves are involved is at the heart of the interpretation conundrum. This is one more manifestation of the unsolved quandary regarding what these quantum waves “really are”. One desired aspect of a Quantum Gravity Theory would be insight into the nature of these waves, including how the medium through which they propagate is related to spacetime. This seems reasonable to request, because both Quantum Mechanics and General 355
page 367
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.3 Reasons Why a Quantum Gravity Theory Is Needed
Relativity deal with the vacuum as something other than complete emptiness. In the former it is a field with fluctuations associated with positive zero-point energy and “excitations” constituting what we think of as particles, and in the latter it is a field that can possess intrinsic curvature and produce what we think of as gravitational interactions. Apparently the vacuum is a versatile medium whose properties are manifested via a motley collection of “fields”. Equation 6.57 expresses an equality between two tensors, one of which is the stress tensor Tμν multiplied by a constant scalar. The energy and momentum comprising Tμν are functions of position and time. All four of these quantities are subject to quantization and the Heisenberg Uncertainty Principle in Quantum Mechanics. This seems to require that the Einstein tensor Gμν be similarly constrained, since the quantum-mechanical properties of physical objects are well established experimentally and cannot be expected to vanish in a Quantum Gravity Theory that must produce all of the confirmed predictions of Quantum Mechanics and General Relativity. But the latter is founded on spacetime being a continuously differentiable manifold. How to quantize it and whether this is even possible are open questions. One approach has been to replace Tμν with its expectation value, . This is known as semiclassical gravity and allows both sides of Equation 6.57 to remain continuous. For example, the Poisson distribution (Equation 2.19, p. 59) is a discrete distribution, but its mean is a continuous variable. The mean of a discrete random variable need not even be in that random variable’s sample space. Semiclassical gravity has been used successfully for certain special cases such as the motion of test particles and the evolution of quantum fields in spacetimes curved by the expectation values of prescribed stress tensor fields. Among its predictions (yet to be confirmed unequivocally) are Hawking radiation from the event horizons of black holes (Hawking, 1974) and the Unruh effect (Fulling, 1973; Unruh, 1976) in which vacuum fluctuations cause an accelerating observer to see blackbody radiation that is not seen by inertial observers. Because of its ad hoc nature, semiclassical gravity is widely considered more of an engineering approach than a physically motivated description of reality. Furthermore, it leaves some important questions unanswered. For example, since it requires the Einstein tensor also to be an expectation value averaged over quantum fluctuations, what can be said of the causal connection between two events on the cusp of spacelike separation? Does causality also fluctuate quantum-mechanically? The St. Petersburg Paradox discussed in sections 1.8 and 1.9 showed us that it can be dangerous to consider only expectation values while ignoring the role of fluctuations. The value of Quantum Mechanics generally lies in its ability to provide the probabilities of various possible measurement outcomes, not an average over them that may not correspond to any one possible outcome, although this objection is weakened when one is interested only in macroscopically large systems. Still, semiclassical gravity does not satisfy our desire for a Quantum Gravity Theory. Attempting to retain the continuum of General Relativity is viewed by many to be counterproductive. The energy continuum of classical Statistical Mechanics was the source of the “ultraviolet catastrophe” (Equation 5.3, p. 236), and the introduction of quantized energy removed the catastrophe and gave birth to Quantum Mechanics. To many physicists, the singularities implied by General Relativity for black holes and the Big Bang are as repugnant as the ultraviolet catastrophe. Such singularities are taken as evidence that the theory has “broken down” at extremely microscopic scales and must be supplanted with something that works at those scales. This certainly seems like an invitation for Quantum Mechanics to enter the picture in some extended form and remove the effects of continuous spacetime. In quantum field theories, the energy required to probe field structure at a given distance scale 356
page 368
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.3 Reasons Why a Quantum Gravity Theory Is Needed
increases as the distance decreases, and this aggravates the problem of renormalization. A “natural cutoff” at the low end of the distance scale would provide relief and is therefore desirable. While the study of entropy in “black hole thermodynamics” is still dependent to some extent on heuristic arguments (e.g., Bekenstein, 1973), it also suggests a quantization of spacetime in the proportionality between black hole entropy and the number of cells of Planck-length size (see Appendix H) required to tile the event horizon (Sorkin, 1998). Recent work has suggested that the event horizon of a black hole is not a smooth continuum but rather has fine structure at the Planck scale (Hawking et al., 2016) that has come to be called “soft hair” (because this contradicts the saying “a black hole has no hair”, a reference to black hole properties being limited to mass, charge, and spin). This Planck-scale structure is expected to cause “echoes” ringing in gravitational waves created by black-hole mergers such as that reported by Abbott et al. (2016), and these echoes may have been seen at marginal statistical significance (Abedi et. al., 2106). The clues therefore seem to point in the direction of modifying General Relativity to remove the continuum, leaving it as a limiting approximation, not a fundamental property of the Universe. Some attempts to do this for quantum field theories, however (e.g., Winterberg, 1992), have indicated that Lorentz invariance must be abandoned at the discretization scale, with possibly observable consequences (e.g., light dispersion in pure vacuum). The question of preferred frames of reference also arises yet again, the rest frame of the “atoms of spacetime”. There is a question of whether the “size” of the smallest possible unit of spacetime becomes even smaller due to Lorentz contraction in a frame moving relative to this preferred rest frame. On the other hand, some discretization schemes have been proposed that do not have these symptoms (e.g., Loop Quantum Gravity, described in section 6.7 below). One aspect of Quantum Mechanics that is generally recognized as also needing improvement is known as the problem of time. This refers to the way in which time appears as a “background parameter”, not a dynamical element as in General Relativity. Although the modern successful quantum field theories incorporate Special Relativity and hence may be viewed as treating time as part of spacetime to the same extent, the fact that this spacetime has no curvature permits time to remain part of the background rather than an active player in dynamical physical processes. This almost-Newtonian role of time in Quantum Mechanics is in stark contrast to that of time in General Relativity, where it defines one of four dimensions of equal substance. In other words, in General Relativity, the “time direction” has fundamentally the same nature as any of the three spatial directions, and like them, it participates in the intrinsic curvature that causes gravitational effects and thus acts as a dynamical element. There is some danger here of thinking too much in terms of ordinary Cartesian dimensions. Cartesian coordinates are only one of many choices of coordinate systems that are valid in General Relativity, and the relationships between the natural axes applicable to any two given reference frames depend on the relative motions of those frames. One must not think of spacetime as a four-dimensional space characterized by an absolute Cartesian (ct,x,y,z) coordinate system in which time always points in the same absolute direction. The “time” direction for one frame may be a superposition of “time and space” directions for another, as in Figure 6-1 (p. 319). Unlike rotated axes in (e.g.) a temperaturealtitude coordinate plot, these superpositions of time and space directions are not chimerical. In the spacetime of General Relativity, a coordinate axis in one system that corresponds to a superposition of coordinate axes in another system is still a pure spacetime axis. Similar remarks apply to non-Cartesian coordinates for which axis directions are defined only locally, especially since orthogonal coordinate axes in General Relativity are defined only locally anyway. 357
page 369
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.3 Reasons Why a Quantum Gravity Theory Is Needed
If one is simply given a point in a spacetime with a completely specified stress tensor, one cannot immediately determine what is the “time direction” at that point. It is not implicit in the nature of the space in the manner that temperature has an obvious direction in a temperature-altitude space. An identification of a time direction depends on how the spacetime is decomposed into a “3+1” space, which is typically done on a physically motivated foundation that is generally different for test particles passing arbitrarily closely to the point with different velocities. The test particles have no significant effect on the stress tensor, and the stress tensor determines the Einstein tensor, but the Einstein tensor does not uniquely determine the metric tensors appropriate for the coordinate frames of the test particles, it determines a coupled system of nonlinear second-order partial differential equations whose solution for a metric tensor requires boundary conditions, which in turn depend on the the test particles’ momenta. For technical details, Chapter 21 of Misner, Thorne, and Wheeler (1973) is recommended. The point here is that the essential nature of spacetime must be compatible with the need for different test particles passing arbitrarily closely to each other to have different time directions, and this precludes any fundamental difference between time directions and space directions, very much unlike the situation in Quantum Mechanics. There are even solutions to the field equations of General Relativity in which one cannot define “future” and “past” time directions continuously along certain geodesics, although such spacetimes are widely considered physically unrealistic because they are not simply connected. (e.g., toroidal topologies such as Einstein-Rosen bridges, also called wormholes, which some members of the physics community do not in fact rule out as unphysical, however). General Relativity has its own problem of time. Beginning with Einstein, many physicists have accepted as ontologically valid the concept now known as the “block Universe”, a spacetime containing the entire Universe. Since time is one of the dimensions, the block Universe contains the complete history of all physical events. The amount of time spanned by this history may be infinite or finite, depending on whether the Universe has both a beginning and an end. The detailed history contained in the block Universe cannot be computed by human means, but many physicists regard this history as simply unknown but nevertheless existing in reality. In this interpretation, because all the physical laws are time-symmetric, there is no meaningful distinction between past and future. The perception of time flowing through a point labeled “now”, or a point labeled “now” flowing along the time direction, is an illusion. The Universe is a book already written. The fact that we seem to experience our own chronological progress at each instant while we read through this book follows from the fact that this impression exists at every event along our “world line”, our trajectory through spacetime, which exists like a graph of a projectile’s height vs. time, i.e., the complete curve sits on the page in its entirety, not moving, merely existing. In other words, we are not really even reading through the book, the instance of “us” at each event just has that false perception. As one might expect, not everyone subscribes to this interpretation of the human experience. For example, if time symmetry eliminates any distinction between past and future, then why do we remember the past but not the future? Is the formation of a memory similar to the process of an egg falling to the floor and breaking, which could conceivably happen in reverse? According to classical physics, all the microscopic events involved obey time-symmetric laws. The fact that the spontaneous reassembly of an egg never happens in our experience follows from the vanishingly small probability for the reverse process to follow exactly the right path through phase space. But the breaking of an egg involves an increase in entropy. For entropy to decrease spontaneously is impossible according to classical Thermodynamics, whereas in classical Statistical Mechanics it is possible but extremely 358
page 370
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.3 Reasons Why a Quantum Gravity Theory Is Needed
unlikely except for systems very close to equilibrium, where entropy fluctuates near the maximum thermodynamic value. Are memories of the past simply more probable than memories of the future? The egg argument suggests that the Universe began in a state of miraculously small entropy, and this is considered a weakness of the argument that the “arrow of time” is determined by spontaneous entropy changes. The proponents of that argument point out that solutions to physics problems are not normally required to justify their boundary conditions. In any case, the existence of memories of the past but not the future seriously erodes the plausibility of the claim that classical time symmetry makes past and future indistinguishable, but since the claim is supported by classical physics, the author’s view is that this is another indication of the need to include nonclassical physics at a more fundamental theoretical level. As with the questions of free will, morality, and determinism that were discussed in sections 5.12, 5.13, 5.15, and 5.16, one must arrive at one’s own working hypothesis on the basis of metaphysical preferences. If rejecting the block Universe is to be more than an empty gesture, we must assume that we have the freedom of will to make that choice, which is what we would like to prove. Of course, one may still be an automaton suffering the delusion of making a free choice, but as before, no one who understands what is actually going on can place blame for the mere act of going through pre-ordained motions. The author’s conclusion once again is that the preferable course is to embrace the hypothesis that preserves one’s reason for engaging in scientific activity in the first place. Among those who reject the block Universe it is generally agreed that one requirement for a Quantum Gravity Theory is to present a mechanism that generates a special instant of time that we recognize as “now”. We will give brief descriptions of several such models below, but we will not be able to go very deeply into any of the proposed formalisms because of the sheer enormity of the collection, which ranges from highly developed and generally respected proposals to qualitative sketches of mechanistic models, descending from there into fringe speculations that abound in many science blogs on the internet. The more well-known approaches will be described in short separate sections, and several of the less well-known ones that the author found interesting will be treated together in the next section. The list of approaches will be highly incomplete, because there are so many with which the author is unfamiliar, and even a grasp of every single one would still leave open the question of where to draw the line to limit the scope to worthwhile ideas. Perhaps a severely limited scope is more acceptable when applied to formalisms that are universally incomplete. Here we can hope only to give some flavor of the attempts that are currently underway and still in the running. We will not include theories of gravity that are purely competitors of General Relativity, for example, Brans-DickeJordan scalar-tensor theories. Such theories are important and interesting but are not directly relevant to the search for Quantum Gravity. Some goals for Quantum Gravity are more ambitious than others. The grand unified theories of particle physics seek not only to make the various interactions mutually compatible, they merge them into a single interaction at certain energy scales. This has worked very well for electricity, magnetism, and the weak nuclear interaction that underlies nucleon decay and atomic fission processes. Maxwell’s Equations embody the unification of electricity and magnetism, and it has been shown convincingly that electromagnetism and the weak interaction are facets of the same process above an energy of about 246 GeV described by electroweak theory. Formalisms exist that unify these processes with the strong nuclear interaction at an energy of about 1016 GeV, but none of these have been verified experimentally. It has been suggested that unification of these processes with gravity might occur near the Planck scale, about 3×1019 GeV (see Appendix H). A unification of gravity with the other 359
page 371
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.4 Miscellaneous Approaches
interactions in this sense would be an interesting feature of any Quantum Gravity Theory that implies it, and seeking it might provide useful clues, but it is not one of the author’s requirements. This summary of reasons for the desirability of a Quantum Gravity Theory is necessarily limited by our scope. For a more thorough discussion of why we need such a theory and the pitfalls that have been encountered while seeking it, the reader is encouraged to consult Chapter 14 of Wald (1984). 6.4 Miscellaneous Approaches One of the earliest attempts to incorporate particle spin directly in General Relativity was that of Cartan (1922). This is one of several approaches that involve dropping the torsion-free constraint that requires the right-hand side of Equation 6.52 (p. 345) to be zero. This causes some symmetries to be lost, including that of the metric tensor. Einstein (1928) tried to use torsion to unify gravity and electromagnetism, hoping that he could simultaneously encompass Quantum Mechanics without the randomness he found so abhorrent. The loss of symmetry in the metric tensor brought a blessing and a curse: with the metric tensor now having 16 independent elements, there was room to accommodate electromagnetic components, but the coupled system of nonlinear second-order partial differential equations became larger and hence even more complicated and difficult to solve. Einstein eventually abandoned the effort when this avenue did not lead to a way to make Quantum Mechanics purely deterministic, no compelling way to tie the electromagnetic field to the extra six components of the metric tensor could be found, and he was also troubled by the fact that his new field equations were no longer compatible with the Schwarzschild solution. These efforts are referred to as Einstein-Cartan Theory, and one reason why they fall short of Quantum Gravity is that their formulation coalesced before Quantum Field Theory was developed. Work to update this approach was performed independently by Kibble (1961) and Sciama (1964), resulting in what is now known as Einstein-Cartan-Kibble-Sciama Theory. This remains a deterministic classical theory needing some way to incorporate quantum randomness and coherent states, goals which are still considered achievable in principle. Among its virtues is the ability to make fermions extended rather than pointlike, which eliminates many problems caused by singularities, but as simply a more complete classical theory of gravity, it has not been a dominant force in attracting researchers because, compared to General Relativity, not enough additional predictive power has been found to justify the additional difficulties. As Einstein was working on including torsion in General Relativity, the mathematician Theodor Kaluza was working on expanding General Relativity to encompass five dimensions, the new one being a spatial dimension “compactified” via intrinsic circular curvature that made it unobservable via ordinary means. This gave the metric tensor 25 elements, of which symmetry leaves 15 independent, hence five new degrees of freedom to use for electromagnetism. Kaluza found that he could indeed arrange for Maxwell’s Equations to emerge from four of these five new components. The physicist Oskar Klein extended this approach to include Quantum Mechanics as understood in the year 1926. Kaluza had given no physical interpretation of the fifth dimension, but Klein showed that it could be identified as the source of electric charge arising from standing waves in the circular curvature. This allowed him to calculate the radius of curvature, for which he obtained 10 -32 m, explaining why the new spatial direction was so compact. Einstein initially admired this approach but was uncomfortable about the physical interpretation and the fact that electromagnetism seemed mostly just concatenated 360
page 372
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.4 Miscellaneous Approaches
onto the gravity. He continued on his own path, attempting further generalizations and alternatives to his torsion approach, never arriving at anything acceptable to the general physics community. KaluzaKlein theory remains of interest, although experimental verification of the fifth dimension has remained beyond the reach of current laboratory capabilities, and its field equations have been exceedingly difficult to work with, although the advent of modern mathematical software for symbolic manipulation has brought some success in this area. Kleinert (1987, 2010) extended the analysis of Einstein-Cartan spacetimes in a context of crystal lattices and showed that a similarity exists between “defects” in crystals and curvature in spacetime. He employed a lattice spacing on the order of the Planck length and developed the formalism into what is now called the World Crystal, a model of the Universe in which material particles arise from crystal defects whose curvature produces gravitational interactions in the same way as in General Relativity. As formulated, torsion can be present or absent according to the choice of a gauge-fixing function (“gauge” is a kind of symmetry and is discussed briefly in section 6.5 below). Problems coupling the torsion to particle spins without creating adverse effects (e.g., nonzero rest mass for the photon) led Kleinert to employ a gauge for which the torsion vanishes. The significance here is the appearance of discrete physical elements at the Planck scale constituting the foundation of physical reality. Danielewski (2010) extended the model, which he calls the Planck-Kleinert Crystal, by adding processes such as viscosity and diffusion of the crystal defects. After assigning values to fundamental constants (e.g., crystal element masses, internal process frequencies, diffusion coefficients, etc.), he was able to show that longitudinal and transverse waves propagate in the crystal, with the former allowing mechanical properties to be derived and the latter corresponding to electromagnetic waves. The collective behavior patterns of neighboring crystal elements, called Planck particles therein, can be interpreted as what are commonly called elementary particles (e.g., electrons, neutrinos, etc.), and analysis of the internal processes of the Planck particles allows the time-dependent Schrödinger Equation to be derived. The approaches described so far all start with General Relativity and try to move it towards Quantum Mechanics. Some approaches go in the other direction. One of these is known as Topological Quantum Field Theory. This is an attempt to remove the problem of time mentioned in the previous section, and in fact to repair a more general problem with quantum field theories, their lack of background independence. A set of axioms for Topological Quantum Field Theory was established by Atiyah (1988), and many mathematical physicists have worked in this area, with most of the official recognition coming in the form of mathematics prizes. An excellent summary of Topological Quantum Field Theory has been given by Baez (2001). A mathematical description of physical processes expressed in spacetime is said to be “background independent” when its formalism does not depend on the spacetime geometry or any fields present therein. For example, General Relativity is background independent: the formalism itself does not depend on which coordinate systems are used or any particular aspect of the spacetime curvature. The geometrical properties of spacetime are dynamical elements of the theory. They do not comprise an independent stage on which the evolution of physical processes takes place. In general, quantum field theories assume the Minkowski metric, which is relativistic, but its geometry is independent of the physical processes taking place on the stage it provides, and that background stage is implicit in the formalism describing those processes. Another feature of General Relativity is that the local degrees of freedom in its dynamical elements propagate causally. Local degrees of freedom are independent physical parameters that collectively 361
page 373
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.4 Miscellaneous Approaches
define the state of the system. The causal propagation constraint must also be satisfied by any backgroundindependent quantum field theory that includes gravity. Topological Quantum Field Theory attempts to endow quantum field theory with these two attributes: background independence and causally propagating local degrees of freedom. This is done by considering n-dimensional manifolds for which every sub-manifold of n-1 dimensions has an associated Hilbert space, a complex vector space whose vectors are in one-to-one correspondence with the system’s possible states. For n = 4, each sub-manifold can be considered the “space” part of spacetime locally orthogonal to a time axis in a 3+1 decomposition. The focus is on the topology of the space (e.g., whether it is simply connected vs. multiply connected), not on the details of its metric field. The states in the Hilbert space are topological states, and the processes of interest are changes in topology. Much of the early work in Topological Quantum Field Theory was devoted to the case n = 3 with a 2+1 decomposition. This case enjoys the property that there are no local degrees of freedom, allowing the complete focus to be on background independence. It was found that the Hilbert spaces for the 2-dimensional sub-manifolds could be represented as a collection of connected points on a sphere, where the connection distances in units of the Planck length were multiples of half-integers. This constitutes a network, with the state points being its nodes. The mathematical formalism for this borrowed heavily from the study of angular momentum, whose quantized values are multiples of h , and so the construction became known as a spin network. Spin networks were found to be useful in other approaches to Quantum Gravity, including String Theories and Loop Quantum Gravity, which are described in sections 6.6 and 6.7 below, respectively. When generalizing to the n = 4 case, the spin networks become spin foams, and the mathematical challenges increase exponentially. The work to date has involved the disciplines known as higherdimensional algebra and category theory, both of which are much more nontrivial than their names might suggest. Like String Theory, much of the interest is in the mathematical value of Topological Quantum Field Theory, which provides a hedge against the possibility that a complete Quantum Gravity Theory may never emerge from approaches of these types. An attempt to derive a mathematical formalism from which Quantum Mechanics (including quantum field theories) and General Relativity emerge in the appropriate limits has been undertaken by Finster (2015). The starting point is the Hilbert state space itself. This state space, together with its set of all finite self-adjoint operators and a positive measure defined over them, is called the Causal Fermion System. Instead of axiomatizing both the state space and the physical space, the latter is derived from the former; its properties are not assumed a priori. For a given Hilbert space (i.e., defined with certain operators of physical interest), a Lagrangian and a least-action principle can be formulated in terms of the operator eigenvalues. Spacetime is defined in terms of the operators, which induce its topology, and from there causal structure is derived. A rich mathematical formalism is developed whose details are far beyond our scope. This formalism can be analyzed in various limiting cases and found to have correspondences to the systems familiar from relativity and quantum theory. The author’s preference among these approaches is more inclined to the Planck-Kleinert World Crystal, because it derives the various particle interactions and quantum effects from the properties of the lowest-level components of the crystal lattice, whereas the others have more of the character of mathematical constructions designed specifically so that the familiar gravitational and quantum formalisms can be extracted from them without really springing from a single fundamental source. Although progress continues to be made in all these approaches to Quantum Gravity, the survey of that general topic given by Isham (1989) remains relevant. Since some mention of gravitons should 362
page 374
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.5 Canonical Quantum Gravity
be given before closing this section, we will base a brief discussion of them on that reference. Because the quantum field theories that describe electromagnetic, weak, and strong interactions involve field quanta which “mediate” the interactions, it is widely expected that Quantum Gravity will also have a field quantum. In the first three cases, these quanta are all spin-1 bosons, specifically photons, W and Z bosons, and gluons, respectively. The name commonly given to the anticipated quantum of the gravitational field is “graviton”. Some formalisms attempting to arrive at a Quantum Gravity Theory (e.g., string theories) predict the existence of gravitons, but some properties that gravitons are expected to have can be deduced simply from general arguments based on common properties of the “force-carrying bosons” of existing quantum field theories. For example, the graviton cannot be a fermion, because they are known to be incapable of mediating an interaction corresponding to a static force. It cannot be a spin-0 boson, because these are the quanta of scalar fields, not tensor fields such as those encountered in General Relativity. It cannot be a spin-1 boson, because these cannot mediate an interaction corresponding to an attractive force between identical particles. Spins greater than 2 are ruled out by arguments similar to that against fermions. All that is left is a spin-2 boson. The fact that these arise mathematically in a natural way as quanta of ten-component fields is widely viewed as too much for coincidence in matching the number of independent components of the stress tensor. Gravitons are therefore expected to be spin-2 bosons. Furthermore, they are expected to have zero mass because the range of the gravitational interaction is unlimited, unlike the short range of the weak interaction, whose force-carrying bosons have nonzero mass. Gravitons are expected to be exceedingly difficult to detect directly, however, and the theories which predict them are as yet unproven. As a result, gravitons must be considered hypothetical. A Quantum Gravity Theory that does not predict them could not be judged invalid on that basis alone. 6.5 Canonical Quantum Gravity Canonical quantization refers to quantizing the physical parameters encountered in the Hamiltonian formulation of classical physics. This involves defining fields in terms of operators with quantized eigenvalues that represent possible states of the physical system, a procedure known as second quantization. As we saw in Chapter 5, in certain situations, physical parameters such as energy, momentum, etc., are “quantized” in Quantum Mechanics. Their possible values no longer correspond in general to the set of all real numbers. In many cases of physical interest, they are restricted to values confined to a set of numbers called a discrete spectrum. But classical energy and momentum are not fundamentally fields. Second quantization is the quantization of fields themselves, such as the electromagnetic field, whose quantum is the photon. Operator fields are characteristic of quantum field theories, which treat “particles” as quantized excitations of the corresponding field. Thus electrons are excitations of the electron field (which is not the same as the electromagnetic field). Every known “particle” belongs to a field of which it is an excitation that is quantized and controlled by creation and annihilation operators (such as those introduced in Appendix K). In this view, the electron field exists everywhere, but its quantized excitations exist only at certain somewhat localized positions and times. Essentially any function of position and time that is a solution of a constraining equation can be converted to an operator field whose excitations correspond to the classical notion of whatever is described by that function, including particle positions and momenta. 363
page 375
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.5 Canonical Quantum Gravity
The word “canonical” refers to the role played by Hamilton’s canonical equations, Equation 5.20 (reproduced below as Equation 6.61), which are used to represent the physical situation of interest by assigning a corresponding definition to the Hamiltonian function (see section 5.6). Dirac (1925) established canonical quantization in his work that led to Quantum Electrodynamics. Hamilton’s equations were introduced in Chapter 5 only to set the backdrop for a qualitative narrative of Schrödinger’s conceptualization of a wave-particle duality that led to his wave equations (Equations 5.36 and 5.38, pp. 254-255). Our scope did not permit further elaboration there, nor does it allow very much here, but one aspect of Hamiltonian dynamics must be mentioned because of its relevance to the way in which Quantum Mechanics enters into Canonical Quantum Gravity, namely Poisson brackets. A detailed discussion of Hamiltonian dynamics can be found in any standard textbook on classical mechanics, e.g., Goldstein (1959). Recall from section 5.8 that any two physical variables whose product has dimensions of action are called canonically conjugate variables. Action has the dimensions mass×length2/time, the same as angular momentum and Planck’s constant. Two frequently encountered canonically conjugate variable pairs are: (a.) energy and time; (b.) momentum and position. The latter two are denoted p and q respectively in Hamiltonian dynamics, and they are related to the Hamiltonian H and the Lagrangian L (see section 5.6) by the canonical equations shown in Equation 5.20 and reproduced below.
H dp dt q H dq dt p
(6.61)
H L t t The variables p and q are called canonical coordinates. Any coordinate pair for which the canonical equations are satisfied is a pair of canonical coordinates that is related to the p and q pair and all other such pairs through canonical transformations (also called contact transformations when they either include the transformation of the time parameter or when it plays no role). The context of Hamiltonian dynamics is that of a phase space such as we encountered in section 3.3, a space in which the components of the p and q vectors are each defined on their own dedicated coordinate axes. Thus for a system of N particles in a three-dimensional space, the phase space has 6N dimensions, and a single point in this space defines a complete state of the system. The purpose of Hamiltonian dynamics is to determine the path of the system through the phase space as the system evolves in time. Poisson brackets provide an alternative way to represent the canonical equations, and their definition employs a separate counter for the p components and q components, with the total of each being equal and denoted n. For the 6N-dimensional phase space above, we have n = 3N, and the first two lines shown in Equation 6.61 above are actually a set of n equations when represented in component form. When taken over into General Relativity, the 3+1 space+time becomes four-dimensional spacetime defined on a manifold with a pseudo-Riemannian metric. The important points for our brief summary are: (a.) there is a one-to-one correspondence between the system trajectory in phase space and the paths of the particles through spacetime; (b.) the way spacetime is treated retains a dependence on 364
page 376
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.5 Canonical Quantum Gravity
the 3+1 decomposition. The idea is to supplant the phase space with a quantized field by replacing p and q with their corresponding operators p^ and q^, establishing the appropriate rules governing these operators, and computing the effects on the spacetime. The process of doing this is facilitated by the use of Poisson brackets, which are defined as follows for any two functions f(p ,q ) and g(p ,q ) defined on the phase space:
f g f g pi qi i 1 q i pi n
f , g
(6.62)
where we use braces to denote the Poisson bracket for f and g (besides braces, “square” brackets and parentheses are both commonly encountered, but parentheses are not as distinctive, and we wish to reserve the square bracket character for the quantum-mechanical commutator below). The following equations can be derived straightforwardly from Equations 6.61 and 6.62:
q , p i
j
ij
dqi qi , H dt dpi pi , H dt
(6.63)
where δij is the Kronecker delta (δii = 1, δi j = 0). The first two lines of Equation 6.61 can be replaced by the last two lines of Equation 6.63 to obtain a much more symmetric representation of the canonical equations. This may not look like much of an improvement, but the algebra of Poisson brackets is highly developed, and this representation allows Hamiltonian dynamics to take advantage of that. Instead of attempting to quantize the metric and/or tensor fields of General Relativity directly, Canonical Quantum Gravity quantizes the phase space applicable to General Relativity and formulates the consequences for the metric and stress tensor fields. Dirac’s method of canonical quantization was developed further by a number of theorists. Some of the most influential work in applying it to gravity was done by Arnowitt, Deser, and Misner (ADM, 1959) and DeWitt (1967). The ADM formalism was developed for more general reasons than deriving a Quantum Gravity Theory, but its use of canonical coordinates made it especially appropriate for this purpose. Using the Hamiltonian formalism as an entry point for Quantum Mechanics into General Relativity involves replacing the classical phase space with a Hilbert state space such as we encountered in section 5.15, a complex vector space, possibly infinite-dimensional, whose axes correspond to possible states of the system. Classical trajectories of the system through phase space become time-dependent probabilityamplitude distributions over the Hilbert space. The change from p , q , and H to their corresponding operators p^, q^, and H^ begins by replacing the first line of Equation 6.63 with Equation 5.45 (p. 264), changing the notation for the space coordinate from x to q, and using the definition of the commutator given in section 5.8, [q^, p^] q^ p^ - p^ q^ in the new notation:
365
page 377
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.5 Canonical Quantum Gravity
q , p i i
j
ij
(6.64)
We should note that there are less ad hoc ways to arrive at this (e.g., using Lie integration) that are beyond our scope. We should also note that like the famous “Dirac Delta Function”, mathematicians have voiced reservations about the rigor of Dirac’s approach, pointing out that Equation 6.64 assumes certain properties for the canonical coordinates that are absent under more general conditions. In the well-known cases employing this method, however (e.g., the Linear Harmonic Oscillator; see Appendix K), the needed properties are present. As in section 5.6 (but in our current notation), we have q^ = q , H^ still corresponds to the total energy E (i.e., the eigenvalues of H^ are the energy eigenvalues of the system), and
p i q
(6.65)
H i t Once the Hamiltonian has been written as the total energy of the system, obtaining a solution in the form of a wave function in the position representation (as throughout Chapter 5), involves solving the Schrödinger equations (Equations 5.36 and 5.38). As always when solving differential equations, boundary conditions are required in order to make the system solvable. In addition, the Hamiltonian must be constrained to prevent solutions from depending unphysically on such things as simple coordinate translations. In other words, the system must exhibit gauge invariance, where “gauge” refers to redundant degrees of freedom (a simple example is where to place the origin of a coordinate axis). Because General Relativity is formulated in a generally covariant manner, gauge invariance takes the form of diffeomorphism invariance. A diffeomorphism is an invertible function that provides a smooth mapping between two differentiable manifolds. Gauge invariances are important in physics because they are related to conservation laws through Noether’s Theorem, which establishes a correspondence between conservation laws and differentiable symmetries of physical systems. For example, in classical mechanics, invariance of the Lagrangian under translation of the time variable implies conservation of energy, and invariance under translation of the position variables implies momentum conservation. But as we saw in section 6.2 when we observed that in General Relativity “the notion of ‘widely separated but mutually at rest’ loses its meaning”, the definition of a coordinate translation that is straightforward in Euclidean geometry gets murky in Riemannian geometry. The closest thing to a Euclidean coordinate translation in General Relativity is a translation in the tangent space of a locally Euclidean system, and in general one cannot go very far in this tangent space before leaving the locally Euclidean system. To be relevant to the spacetime in which one is trying to translate a coordinate, one must include the effects of the curvature of that spacetime. In other words, the notion of a coordinate translation must be supplanted with the notion of a diffeomorphism. Gauge invariance must be expressed in terms of diffeomorphism invariance. One of the main original goals of the ADM formalism was to clarify the gauge-theoretic aspects of General Relativity. It turned out to lend itself exceptionally well to numerical solutions of classical general-relativistic problems and became one of the most popular methods for that purpose independently of offering an avenue for deriving a Quantum Gravity Theory. 366
page 378
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.5 Canonical Quantum Gravity
We mentioned in section 5.6 that Hamiltonian dynamics derives Lagrange’s equations via the Least Action Principle, which states that a physical system will evolve between two times in a way that minimizes the change in action. The action is the integral of the Lagrangian between those two times, so the principle states that the variation of that integral between the two times is zero. The “variation” differs from a simple derivative in that the latter would refer to the variable of integration, whereas the variation includes all path dependence of the function inside the integral, allowing the derivative of the integral with respect to path to be taken and set to zero. Denoting the Lagrangian L = T-V (where T is the total kinetic energy of the system, and V is its total potential energy), and given that the Lagrangian is traditionally treated as a function of q and dq /dt, the action S is defined as t2 dq (6.66) S q (t ) L q (t ), dt dt t1 The least action principle states that of all possible paths through phase space that take the system from where it is at t1 to where it is at t2, the path taken is the one that minimizes S. This is an extremization problem for which the Calculus of Variations was developed (see, e.g., Goldstein, 1959, Chapter 2). When the formula for the Lagrangian is inserted into the integral, computing the variation and setting it to zero yields the formulas defining the laws of motion for the system. What is needed in order to make the canonical formalism applicable to General Relativity is therefore the appropriate Lagrangian. This was found by Hilbert in 1915, and the resulting formula for the action is known as the Einstein-Hilbert Action:
c4 R g S q (t ) L M d 4 x 16 G
(6.67)
where R is the Ricci scalar (see Equation 6.56, p. 349), g is the determinant of the metric tensor, LM is the Lagrangian term for any non-gravitational components that may be part of the system, and the integral is taken over all spacetime coordinates. The Einstein-Hilbert Action is also commonly written with (-g); this is the case when the signature of the metric is the opposite of what we have been using, e.g., when the diagonal elements of the Minkowski metric have signs opposite to those shown in Equation 6.27. Both signatures are widely used. The one we have been using is denoted (+ ), and the opposite is denoted ( +++). The latter requires (-g) in the Einstein-Hilbert Action. Computing the variation of Equation 6.67 and setting it to zero yields Einstein’s field equations (Equation 6.60). For full details, see Arnowitt, Deser, and Misner (1959 and their subsequent publications that form a series of articles on this subject). Transforming the classical canonical representation to the operator fields above (Equations 6.64 and 6.65) yields Canonical Quantum Gravity. In order not to be misleading, we hasten to add that many technical details remain to be resolved before Canonical Quantum Gravity can be said to satisfy the need for a Quantum Gravity Theory. Because the use of canonical coordinates and operators involves a fundamentally special treatment of the time variable, a 3+1 decomposition is implicit throughout, and the “problem of time” discussed in section 6.3 remains. There is disagreement on how much this implies a lack of background independence, and the questions it raises show that background independence is not something that is either present or absent, but rather it can exist to varying degrees. The argument is made that Canonical 367
page 379
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.6 String Theories
Quantum Gravity fully complies with the principle of general covariance, and that principle has apparently satisfied everyone that General Relativity itself is fully background-independent. Furthermore, it is common in General Relativity to decompose a spacetime into time-ordered sub-manifolds forming a stack of three-dimensional spaces (an example of what is called a foliation in differential geometry), and that is simply what Canonical Quantum Gravity imposes from the outset. In addition, since most definitions of causality require a special role for the time variable, some accommodation must be made for the eventual formal adoption of this assignment. One other criticism of Canonical Quantum Gravity is that it just makes General Relativity consistent with Quantum Mechanics, it does not unify the two in any way. As stated previously, unifying the theories of all known interactions is not on the author’s list of absolute requirements, but it certainly would add confidence to see how all interactions proceed from a single source, and in this respect, the formalisms described in the next three sections have an advantage over Canonical Quantum Gravity. 6.6 String Theories The title of this section uses the plural form because one of the most characteristic aspects of “String Theory” is that it is actually an extremely large family of theories with much in common but also a multiplicity of distinguishing peculiarities. One of the common components is the postulate that elementary particles are not pointlike but rather correspond to vibrations of a fundamental substance that has the form of a string embedded in a multi-dimensional space. The strings were originally thought of as having length but not thickness and spatially arranged either as open or closed into loops. Later developments involved “strings” with extensions in multiple dimensions called “branes”. Another common constituent is the postulate of additional spatial dimensions. String theories tend to have many spatial dimensions beyond the usual three. These are typically compactified in a manner reminiscent of the fourth spatial dimension in Kaluza-Klein theory. The extra dimensions are required mathematically by consistency conditions imposed on the quantum-mechanical operators whose eigenvalues correspond to the string vibrational states. The original interest was in formalisms with operator definitions that result in behavior that reproduces observed scattering patterns of elementary particles. String theories have been under development for over fifty years, beginning before the successful formulation of the Standard Model of Particle Physics (electroweak and strong-interaction theory). Since the mid 1960s, evidence had accumulated from inelastic proton collision experiments indicating that protons had internal structure and that the components were bound together by a force that grows stronger with distance and weaker with proximity, unlike electric, magnetic, and gravitational interactions. These components came to be called quarks, the subatomic particles that combine to form protons and neutrons (and other particles with nonzero mass known as baryons and mesons). Attempts to split up the quarks inside a proton were unsuccessful. It appeared that the farther apart the bound quarks were, the more energy it took to separate them farther. Before the quarks could be completely unbound, the energy being applied would cause new particles to form in the vacuum. This behavior was named quark confinement. It was as if the attraction between quarks resembled the behavior of rubber bands that were so strong that snapping them replaced the entire physical situation. The earliest notion of strings was that they essentially were subatomic rubber-band-like connectors joining quarks to each other. Later developments implied that the quarks themselves corresponded 368
page 380
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.6 String Theories
to vibrational states of the strings. But the component of the Standard Model known as Quantum Chromodynamics soon became the most successful theory of the strong interaction, supplanting string theory as a quark model. Before that solidified, however, string theory had gained a foothold among a group of theorists who saw the possibility of taking it to a wider set of applications. The quest to make the string model of elementary particles consistent with Special Relativity and Quantum Mechanics revealed the need for an embarrassingly large number of spatial dimensions, and some versions also required tachyons (faster-than-light particles). Early on, only strings corresponding to bosons could be fit into a self-consistent formalism. A way to include fermions was found by Pierre Ramond (1971) via the notion of supersymmetry, the postulate that for every boson there is a corresponding fermion and vice versa. This not only added fermions, it removed the source of some sign inconsistencies whose circumvention had previously necessitated at least 26 dimensions. Supersymmetry reduced the required number to ten (nine spatial dimensions plus time), and as a welcome bonus, tachyons were no longer required. In the decades that followed, the roots of string theory spawned an entire forest of trees with multitudes of branches. Developing all the formal variations involved many a dead end, back tracking, and provisional refitting, and like the activities themselves, the number of theorists who accomplished all this is much too large for us to list them all by name. We will focus therefore on the most influential instances of promising formalisms. It is noteworthy, however, that the history of string theory contains several chapters in which formalisms developed by physicists were found by mathematicians to be useful for purposes not directly related to physics. This cross-fertilization between string theory and pure mathematics has created a strong interest in the former among practitioners of the latter, giving considerable value to string theory even if it never gives birth to a true Quantum Gravity Theory. The reason why so many current string theories involve six spatial dimensions in addition to the usual three is that a particular type of manifold was found to be extremely useful as a habitat in which strings could behave in ways that reproduce experimentally observed elementary particle behavior. These are called Calabi-Yau manifolds, and their description came about because of a conjecture by Eugenio Calabi (1954) regarding whether compact manifolds subject to certain constraints could have metrics corresponding to a Ricci-flat space, i.e., a generally curved Riemannian manifold in which the volume of a small wedge of a geodesic ball is the same as that of the corresponding ball in Euclidean space. A “ball” is an n-dimensional solid corresponding to the space enclosed by a spherical surface with n-1 dimensions. Just as a geodesic in a curved space can be loosely thought of as the closest thing to a straight line, a geodesic ball in a curved space is the closest thing to a ball in Euclidean space. In a Ricci-flat space, the geodesic ball is affected by positive curvature in some directions and negative curvature in others, with an arbitrarily small net effect on its wedge volumes relative to the corresponding Euclidean ball. Calabi’s conjecture was that such manifolds could exist, and it was proved to be true by Shing-Tung Yau (1978), who set out expecting to prove it false. The specific requirements on the manifolds are beyond our scope, but one important result of satisfying all the conditions is that these manifolds come only with n complex dimensions, hence 2n dimensions total. The dimensionality that yields the right number of vibrational degrees of freedom for string theories and satisfies the consistency requirements of supersymmetry is the case n = 3, hence the six additional dimensions. Their Ricci-flat quality makes them vacuum solutions of Einstein’s field equations in the appropriate dimensionality, so they are consistent with General Relativity, and this in turn makes them potentially useful for Quantum Gravity. The Calabi requirements are discussed more completely in a manner intended for interested nonspecialists 369
page 381
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.6 String Theories
in the book by Yau and Nadis (2010). The six-dimensional space of a Calabi-Yau manifold is connected to the ordinary three spatial dimensions in a manner that can be visualized by considering reduced-dimension cases. One can imagine a two-dimensional space such as an xy plane embedded in a flat three-dimensional space. In this case, the plane is a sub-manifold with its x and y directions extending to infinity. At every point in the plane, there is a z direction of the third dimension perpendicular to the plane, a straight line that extends to infinity in both directions away from the plane, because the three-dimensional space is flat. When looking down at the plane along such a line, one sees the line as a point, because it is a straight line, and one is looking directly down it. One can choose to look at the yz sub-manifold instead, and then one sees the z direction extending to infinity just as the x direction did in the previous picture. Although the xyz directions are mutually orthogonal, at a given xyz point, each direction has its own straight line that may be considered tangent to the point. When considering a two-dimensional sub-manifold of the nine-dimensional space of a typical string theory, instead of taking two of the usual three dimensions, one can select one usual dimension and one Calabi-Yau dimension. We will call the latter the u direction. In this picture, the u direction does not extend to infinity as a straight line; it is compactified, i.e., rolled up. Every point in xyz is still connected to the u direction at a tangent point, and the rolled up u direction loops around and comes back to that point. When viewing the ux sub-manifold (for example), the x direction appears as a straight line, and the u direction at a given point appears as a small loop tangent to the line. If at a given point one keeps the x coordinate constant and moves in the u direction, one will move around the loop and come back to the point where one started. One must keep in mind that although it may seem that some xyz coordinate must be changing as one goes around the loop, it is not, all xyz coordinates are specifically held constant, all motion is in the u direction. The reason why it may seem that the motion in u induces a motion in xyz stems from a conceptual obstacle that must be avoided, namely picturing the u loop as if it were embedded in the xyz space. If it were, then the loops in the u directions at arbitrarily close xyz positions could intersect, like two parallel cylinders placed too closely to each other. Two cylindrical spaces intersecting each other would be an inconsistent geometry. The u loop is not embedded in the xyz space, it is an orthogonal direction at every point of the xyz space, and so the u loop at one point does not intersect any similar loops at other points, and motion confined to the u direction does not result in motion in any other direction. A closed-loop string may or may not be wrapped around the u direction. In whatever space the string loop exists, since the time direction t is orthogonal to all the spatial directions, the time evolution of the string produces a tube in the ten-dimensional space. If the string is not closed, then a sheet results from the time evolution. To become realistic, the flat xyzt sub-manifold must be replaced with generally curved spacetime, and the simple tube traced out by a closed string becomes a bit more irregular because of the string’s vibrations. The tubes may split into multiple branches, and branches may merge into single tubes, and these forms of time evolution describe elementary-particle interactions. The corresponding reduceddimension portrayals of particle interactions are very reminiscent of the Feynman diagrams of Quantum Electrodynamics, schematics of various ways in which electrons and photons can interact that all have to be added up with proper weights to obtain complete descriptions of processes like electron scattering. The Calabi-Yau dimensions are generally not rolled up into perfect cylinders. The Ricci-flat metrics provide room for a great deal of variety, and the string vibration patterns in the nine spatial dimensions can take on many forms depending on the metric and the specific definitions of the quantum370
page 382
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.6 String Theories
mechanical operators. One of the drawbacks is that these models posit Calabi-Yau spaces rather than solve for their Ricci-flat metrics based on a ten-dimensional stress tensor. Such a tensor would contain 55 independent elements, so in principle one would have to solve a coupled system of 55 nonlinear second-order partial differential equations, a daunting task even with modern symbolic-math software. In actuality, since the Calabi-Yau manifolds are vacuum solutions, many elements of the stress tensor would be zero, but how Nature picks a particular vacuum solution for the Calabi-Yau sub-manifold of the ten-dimensional spacetime is not understood. Since the Calabi-Yau manifolds are known to be consistent with General Relativity (“tuned at the factory”, so to speak), any given string theory can simply advocate a selected one and then explore the consequences for elementary-particle interactions. Some such theories come remarkably close to duplicating all of the Standard Model, an impressive feat. But the resulting space of possible models is immense. The number of distinct models is typically estimated at about 10 500. This has become known as the string-theory landscape. The enormity of this landscape is a source of dissatisfaction for many physicists because it raises doubts about whether any apparently successful string theory would be falsifiable, and it has fostered speculations advocating an anthropic nature of human experience, the idea that all of these models exist in some form of compound Universe, but only a few are amenable to consciousness as we know it. There are various forms of anthropic principles, but the common thread is that we experience the world we know because if it were otherwise, we would not exist. This has been criticized as explaining why we observe something instead of explaining what it is that we observe as something more than just something that had to happen somewhere and wasn’t incompatible with observability. On the positive side, many of these models are immune to the problem that plagues other quantum field theories, the need to renormalize in order to remove infinities that occur at various perturbation orders. This is another of the triumphs of string theory. On the negative side, the choice of a particular Calabi-Yau manifold constitutes the choice of a background, so these formalism are not backgroundindependent. Some of these theories are also troubled by predictions of supersymmetric particles that should have been observed but have not been, and some imply that the gravitational constant is not actually constant. Different string theories may employ different gauge groups, families of continuous symmetries developed by the Norwegian mathematician Sophus Lie in the late 19 th century. The word “group” has a special meaning in mathematics, where it refers to a set of elements governed by an operation that combines any two elements to form a third which is also an element of the set. In order to qualify as a group, the operation must be associative and have an inverse, and the set must contain an identity element. For example, if A, B, and C are elements of the group, then using the symbol to denote the operation, associativity means that (A B) C = A (B C), and if I is the identity element, then I A = A. The inverse of A, denoted A-1, is defined such that AA-1 = A-1A = I. We do not intend to delve deeply into these notions, but since the concept of a group will arise in several of the topics below, some minimal clarification of what constitutes a group is needed. Details concerning gauge groups and how they are used in string theories are beyond our scope, but we should mention that they also form part of the foundation of the Standard Model of Particle Physics. Gauge groups are concerned with continuous transformations of differentiable manifolds that preserve certain properties of the manifolds. As string theory in general matured, five models employing different combinations of gauge groups were developed and came to be recognized as the most promising for deriving a Quantum Gravity Theory. All of these are background-dependent, but 371
page 383
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.6 String Theories
the hope has been that this blemish will eventually be removable. Some string theorists argue that the physics in string theories is intrinsically background-independent, but it is expressed in a backgrounddependent way only because the optimal mathematical description of the physics has not yet been found. Progress in that direction appears to be possible because of another mathematical concept known as duality whose usefulness in the context of string theory came into focus during the 1990s. It became evident that these five string theories may be intimately related and represent different asymptotic forms of a single theory which has come to be called M-Theory. There are several forms of duality, but the common element is that two formalisms A and B may appear at face value to be unrelated, but there may exist an invertible mapping between them that allows solutions in one to be transformed into solutions in the other. A problem that is intractable in A may be related via duality to a problem that is relatively easily solved in B, and vice versa. So a roadblock in A may be circumvented by solving the corresponding problem in B and then mapping the solution to A. This is certainly convenient, but even more important is the fact that the existence of such a mapping between A and B indicates that they can both be subsumed into a unifying theory from which they can be extracted as special cases. The possible existence of such a unifying theory was introduced by Edward Witten (1995), who called it M-Theory but left the meaning of “M” unspecified. As described by Witten, M-Theory requires one more spatial dimension, giving spacetime 11 dimensions. One of its virtues is that it appears possible to make it background-independent. The appeal of M-Theory stirred great enthusiasm among string theorists and set off what is now called the Second Superstring Revolution. Much progress has been made, but the development of M-Theory remains largely incomplete. Another remarkable duality was pointed out by Juan Maldacena (Maldacena et al., 1977; Maldacena, 1998). This is known as the Maldacena Duality and also as the AdS/CFT Correspondence, where “AdS” stands for “Anti-de Sitter”, and “CFT” stands for “Conformal Field Theory”. In section 6.1 we mentioned that Willem de Sitter found a cosmological vacuum solution to Einstein’s field equations in 1917; this was a flat model based on the Robertson-Walker metric with a cosmological constant. It is also possible to construct a hyperboloidal vacuum solution, and such a manifold is called an Anti-de Sitter Space. As a solution of Einstein’s field equations, it describes a gravitational field. A Conformal Field Theory is a field theory that is invariant under conformal transformations, i.e., transformations that preserve angles in a local sense (for example, a transformation from Cartesian coordinates to spherical coordinates, which both possess locally orthogonal axes). Although some work has been done to study the effects of embedding quantum field theories in curved spaces, none of the important quantum field theories (e.g., electroweak and Quantum Chromodynamics) explicitly make any reference to gravity. It was therefore quite unexpected that a duality should emerge between a field theory that includes gravity and one that includes only particle interactions. While the AdS/CFT Correspondence has not been mathematically proved in general, it has been found useful in numerous specific applications, although some problems in quantum field theory (e.g., certain aspects of condensed matter physics) cannot take advantage of it because the field theories involved cannot be cast in a conformally invariant representation. But in some cases, the duality is between weak gravitational fields in the AdS and strong quantum fields in the CFT, and problems involving weak fields are typically easier to solve than problems involving strong fields. In such cases, the duality has shown itself to have practical applications. The duality exists for many string-theory 372
page 384
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.7 Loop Quantum Gravity and Causal Dynamical Triangulations
models and has been applied successfully to solve problems defined in those formalisms. At the present time it is the author’s opinion that a successful Quantum Gravity Theory may lie somewhere in the string-theory landscape. The problem is that it may never be found, and if found, it may never be testable by falsifiability standards. These drawbacks are not unique to string theories, however, and it is not difficult to see why string theories cast a spell of unbreakable fascination on their advocates. For those who appreciate the concept of mathematical beauty, there is plenty to be found in string theories, and for those who adhere to the “shut up and calculate” dictum, string theories have provided many useful tools. For those who yearn for a deeper understanding of the fabric of the Universe, string theories take us past the unpalatable “point particles”, albeit by introducing a new challenge to intuition, vibrating strings in invisible dimensions. 6.7 Loop Quantum Gravity and Causal Dynamical Triangulations In section 6.4 we introduced the concept known as a spin network in the context of Topological Quantum Field Theory. In section 6.5 we described the Hamiltonian formulation of General Relativity by Arnowitt, Deser, and Misner (1959) and DeWitt (1967), along with the Einstein-Hilbert Action (Equation 6.67) and the extension of gauge invariance to diffeomorphism invariance. In section 6.6 we mentioned the use of gauge groups in constructing string theories. Together with the concept known as a spinor, these elements were extended and woven together by Abhay Ashtekar (1986) to cast General Relativity into a form that provided the foundation for a new approach to its reconciliation with Quantum Mechanics now known as Loop Quantum Gravity, commonly abbreviated LQG. That list contains a concept that we have not previously encountered herein, namely “spinor”, an element of a complex vector space. As with the Calabi-You spaces in the previous section, an ndimensional complex vector space can be mapped to a 2n-dimensional real vector space. The original concept now known as a spinor was introduced in 1913 by Élie Cartan as part of his work on rotation groups. It was found to be extremely useful by Dirac and Pauli in their work involving particles with spin ½. It possesses peculiar rotation properties which distinguish it from a tensor, and this led to the name “spinor” in physical applications. These rotation properties are generally not intuitive to those familiar only with ordinary Euclidean rotations. For example, in Figure 6-2 (p. 323) we showed a primed coordinate system rotated 30 o relative to an unprimed coordinate system. We did not specify how the 30 o rotation was accomplished, e.g., it could have been the final result of three 10o rotations, or a 45o rotation followed by a -15o rotation, etc. It could even have resulted from rotating the starting system out of the page, then by various other rotations such that rotating back into the page produced the result shown. None of that would have mattered, because the geometrical properties of the points and vectors depend only on the final orientation, not the sequence of steps taken to get there. That is not true in general of spinors. Rotation operators can be defined in terms of spinors, and because complex vector spaces have some properties not found in real vector spaces, these rotations can have surprising results. One of the most striking of these properties is the fact that rotating a spin-½ particle through 360 o leaves its spin pointing opposite to the original direction. To restore the original spin direction, the rotation must be extended to a total of 720o. As illustrated in Figure 6-8, it is as though the spin vector were normal to the surface of a Möbius strip (a two-dimensional strip with a single twist of 180 o about its long axis, then joined at the endpoints to form a loop). Keeping it normal to that surface and moving it 373
page 385
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.7 Loop Quantum Gravity and Causal Dynamical Triangulations
through 360o around the strip would bring it back to the starting location pointing in the opposite direction. Another trip around the strip would be needed to restore the original orientation. A closer analogy to the operation of a spinor would be to keep the foot of the vector fixed at a point in the three-dimensional space in which the Möbius strip is embedded, then rotate the strip about its center of mass while keeping the vector normal to the strip surface. These are just metaphors, however. In actuality, the Möbius strip is a nonorientable surface, and geometric visualizations of spinors in action are notoriously elusive.
A
B
C
D
E
F
G
H
I
Figure 6-8 A vector normal to the surface of a Möbius strip shown at successive positions A through I as it is moved 360o around the strip; when it arrives back at its starting location, it points in the direction opposite to its original direction.
374
page 386
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.7 Loop Quantum Gravity and Causal Dynamical Triangulations
Spinors now play a role in a great variety of mathematical and physical formalisms. Discussions of spinors are most straightforward in the context of Clifford algebras, which generalize a variety of number systems (e.g., real, complex, quaternions, etc., in addition to spinors). All of this is well beyond our scope, and so we must limit our discussion of Ashtekar’s work to its qualitative aspects. For an excellent summary of spinors and Clifford algebras, see Penrose (2004, Chapter 11). As we showed in section 6.5, Canonical Quantum Gravity is based on quantizing the canonical momentum and position variables p and q (see for example Equations 6.64 and 6.65), after which the Schrödinger Equation must be solved subject to constraints. One source of difficulty in applying this formalism is the fact that the constraint equations depend on the canonically conjugate variables in complicated ways. Ashtekar was able to transform variables to a new kind based on spinors. This resulted in simplifying the constraint equations and opened the door to nonperturbative methods of solving them, bypassing the nonrenormalizability problem of other attempts to formulate Quantum Gravity. In addition, the new variables do not require a background structure but rather determine it. Thus spacetime is an emergent dynamical entity. The notions that led to Loop Quantum Gravity were beginning to stir several decades before Ashtekar’s work, however. In the previous section we saw that the early concept of strings involved rubber-band-like connectors binding quarks. This time the story begins in condensed matter theory, where closed loops of magnetic field lines play a role. In both cases, the concept behind the idiosyncratic name evolved from a relatively primitive state to something almost unrelated. The early “loop” image was based on the quantized loops of magnetic fields found in the BCS superconductor theory (Bardeen, Cooper, and Schrieffer, 1957). In section 6.1 we described how Faraday was led to the concept of “lines of force” to describe electric and magnetic fields in empty space. The magnetic lines of force were best described by closed loops that continued unbroken through the magnetized medium. In section 5.5 we mentioned the discovery of superconductivity, the surprising fact that near a temperature of absolute zero, instead of becoming immobile because of being frozen to the metal crystal of a conducting medium, electrons were able to flow without any resistance, a phenomenon that required Quantum Mechanics for its explanation. The quantum theory of superconducting metals was refined over the years, and closed loops of magnetic lines of force continued to play an important role. Powerful methods for describing their geometry in the solid state were developed, including the quantized nature of these loops resulting from quantization of the magnetic field inside a superconductor. While string theory emerged from the refinement of the “rubber band” concept, a different approach for describing quark confinement was developed independently with quantized loops as its basis. Kenneth Wilson (1974) studied loop quantization using a “lattice” model, which in two dimensions is a bit like a checkerboard at the center of whose squares quarks could exist and be connected by forces that are quantized so that they act only along lines running vertically and horizontally through square centers while preserving gauge invariance. This kind of approach became known as lattice gauge theory. Just as there is no clear consensus on the ontological status of the quantum-mechanical wave function despite its unambiguous mathematical formalism, there are differing interpretations of what physically corresponds to a lattice in the context of lattice gauge theory. In the Topological Quantum Field Theory described in section 6.4, the spin networks and spin foams are defined on the Hilbert space of topological states, not on spacetime directly. Wilson’s “loops” describe magnetic flux lines lying on a lattice that is itself defined on spacetime. This discretizes the spacetime, eliminating the infinities that arise from operating on a continuous spacetime, and this property caught the attention 375
page 387
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.7 Loop Quantum Gravity and Causal Dynamical Triangulations
of theorists seeking a Theory of Quantum Gravity. But for Quantum Gravity, the lattice must be formulated as dynamical and time-evolving. It cannot be a fixed background structure. A key development in the evolution of these ideas into Loop Quantum Gravity was establishing that there is no a priori lattice, it is defined by the relationships between the loops that constitute the structure of space. This structure is described by a spin network formed from the loops and is subject to Einstein’s field equations expressed in the form developed by Ashtekhar. Unlike the loops in the quantum theory of superconductors, which describe electric and magnetic fields, and unlike the spin networks of Topological Quantum Field Theory, which define states in a Hilbert space, the spin networks of Loop Quantum Gravity describe areas and volumes in the discretized spacetime. Any given spin network defines a possible geometrical state of quantized spacetime. The edges of the network correspond to areas in units of LP2, and the nodes correspond to volumes in units of LP3, where LP is the Planck length (see Appendix H). The Planck length defines the Planck scale, the distance domain at which General Relativity is said to “break down”, i.e., is inconsistent with the known quantum behavior of Nature at such microscopic scales. Recall that the Hamiltonian form of General Relativity implicitly involves a 3+1 spacetime decomposition. Each node of the spin network at a given time can be thought of as corresponding to an indivisible “chunk” of 3-space (Rovelli, 2001), but the picture is not one of an existing discrete space onto which the spin network is mapped; it is the other way around. The spin network expresses relational aspects of the fundamental loops that form it, and the space emerges from the spin network. Thus the spin network is not located in space, location is not one of its parameters (ibid). The location of a given chunk of space is defined only in relative terms with respect to the neighboring chunks, and therefore the space formed by all the chunks does not constitute an a priori background. The spatial areas and volumes implicit in the spin network are eigenvalues of the operators defined in Ashtekhar’s quantized Hamiltonian formulation of General Relativity and have discrete spectra (Rovelli and Smolin, 1995). The volumes are not generally equal; they are simply quantized with a nonzero minimum size. The surfaces separating adjacent chunks have quantized areas. Bianchi (2008) constructed a length operator whose discrete eigenvalues are the lengths of curves formed by the intersections of the surfaces surrounding individual chunks. As the number of contiguous chunks of space becomes arbitrarily large, the loops form what is called a weave that appears to be a continuum at macroscopic levels. In the continuum limit of the weave, the geometrical description of space should become the same as General Relativity, but whether it does this has not yet been proven in general. Unlike string theories, Loop Quantum Gravity does not unify the gravitational interaction with electromagnetism or the strong and weak nuclear interactions, nor does it require additional spatial dimensions or supersymmetry, although such variations have been investigated. These aspects follow from the fact that it is formulated in four dimensions, which is considered by its adherents to be an advantage. Its main achievement is providing a quantum field theory for gravity by introducing operators whose eigenvalues describe the microscopic structure of space. This creates constraints within which quantum theories of the other interactions must operate. Attempts to describe space as a structure of discrete elements had been made for decades before Loop Quantum Gravity became a serious field of study. Tullio Regge (1961) developed what is now known as the Regge Calculus, a way of constructing simplicial manifolds that satisfy Einstein’s field equations. A simplicial manifold is a manifold consisting of simplexes, and a simplex is a geometrical 376
page 388
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.7 Loop Quantum Gravity and Causal Dynamical Triangulations
object embedded in an n-dimensional flat space constructed from triangles and having n+1 vertexes. For example, in two-dimensional space, a simplex is an ordinary triangle with three vertexes, and in three-dimensional space, it is a tetrahedron with four vertexes. Computer graphics artists commonly make use of simplex structures to approximate any desired shape to arbitrary accuracy. The Regge Calculus was part of the foundation of lattice gauge theory. Given a 3+1 manifold corresponding to a vacuum solution of General Relativity, the three-space can always be represented to arbitrary accuracy by a simplex construction. The process is called triangulation, and the Regge Calculus provides a method for doing this while preserving the metric. Curvature is defined within the Regge Calculus in terms of deficit angles. For example, in a two-dimensional surface, a deficit angle is the amount by which the sum of the angles formed by edges of triangles meeting at a common vertex fall short of 360o. The curvature of the surface is defined by the set of deficit angles at every vertex. While impossible to visualize, this can be extended to define the curvature of a three-dimensional space embedded in four dimensions, i.e., the spatial part of the 3+1 manifold. The Regge Calculus has been found to be an effective way to compute approximate solutions in General Relativity, but if the goal is to describe space as fundamentally granular, then the simplicial manifold is no longer the approximation, it is one way to model the actual granular space, and the continuous manifold is the approximation. The idea that fundamentally discrete spatial areas and volumes could be simplexes has been developed into a powerful approach to Quantum Gravity known as Causal Dynamical Triangulation (Loll, 1998; Ambjørn et al., 2002). The transition from a continuous to a discrete spacetime naturally presents some obstacles to cherished notions such as diffeomorphism invariance, since diffeomorphism is a smooth mapping between differentiable manifolds. Its importance lies in its connection to conservation laws via Noether’s Theorem. Different discretization schemes have different ways to alleviate this difficulty, such as defining the invariance on a strictly local basis, a basis already well woven into General Relativity. Loll (1998) discusses several aspects of this challenge, and we will quote two of her statements: “one may be able recover the diffeomorphism invariance in a suitable continuum limit” and “One can try to interpret the lattice theory as a manifestly diffeomorphism-invariant construction, with the lattice representing an entire diffeomorphism equivalence class of lattices embedded in the continuum.” In the case of Causal Dynamical Triangulations (CDT), the simplex-based formalism concerns only lengths, areas, volumes, and angles, not coordinates explicitly, and therefore diffeomorphism invariance can be claimed by the latter definition. Since these details are beyond our scope, we will simply mention that each approach to discretizing spacetime finds its own reconciliation with the issue of diffeomorphism invariance. The idea that such invariance is applicable only in the continuum limit is satisfactory to the author, whose heresies include the willingness to classify energy conservation as a result of the Law of Large Numbers that was mentioned in section 1.12. This opinion is consistent, for example, with viewing the Second Law of Thermodynamics as an approximation to Statistical Mechanics. The former states that it is strictly impossible for the entropy of an isolated system to decrease spontaneously, while the latter says that such spontaneous decreases in entropy are merely extremely improbable in general and negligible in states of near equilibrium, but they do occur. The negligibility and extreme improbability follow from the large-number statistics of systems whose particle count is typically on the order of Avogadro’s Constant, about 6×1023 and clearly a “large number”. Treating the Second Law of Thermodynamics as the Eleventh Commandment would preclude Statistical Mechanics, allowing an idealized mathematical approximation to obstruct progress toward 377
page 389
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.7 Loop Quantum Gravity and Causal Dynamical Triangulations
a more realistic physical theory. Any Quantum Gravity Theory whose foundation includes strict energy conservation as an axiom may turn out to be suffering from a similar impediment. With this in mind, the author is willing to accept diffeomorphism invariance as a local approximation and to have conservation principles and their associated invariant quantities take the form of emergent properties that arise asymptotically, not axiomatically. Two other issues that arise when going from a continuous spacetime to a discrete one are those of Lorentz invariance and Lorentz covariance. While the notion of “invariance” is fairly straightforward, we must be careful to distinguish among the various meanings of the term “covariance”. There is the sense involving correlated random variables (e.g., Equations 2.23, p. 67, and section 2.10 in general), and the sense in which tensors are classified as contravariant or covariant according to their transformation properties (Equations 6.34 and 6.35, pp. 333-334). None of this relates to Lorentz covariance. In section 6.2 we mentioned that Maxwell’s Equations are Lorentz-covariant (sometimes called form-invariant), which means that when these equations are expressed in a given inertial coordinate frame, applying a Lorentz transformation (Equation 6.2, p. 315) to express them in an inertial frame that is moving relative to the first leaves their form unchanged. Thus Maxwell’s Equations are compatible with Special Relativity in the sense that their form does not depend on which inertial reference frame is chosen to express them as long as Lorentz transformations are used to relate inertial frames to each other. The magnetic and electric fields described by the equations generally change under Lorentz transformation, but the mathematical relationships between them do not. In that same section we showed that the spacetime interval Δs (Equation 6.18, p. 325) retains its numerical value under Lorentz transformation, making it Lorentz invariant. To be Lorentz invariant is not a requirement for a quantity to be physically meaningful. For example, energy, clock rate, and many important quantities are not Lorentz invariant. But if a quantity that is Lorentz invariant in the continuous spacetime of Special Relativity is not Lorentz invariant in a discrete spacetime, it is generally considered that the theory incorporating that discrete spacetime is unacceptable unless a compelling reason can be given for relaxing that requirement. Similarly, doubts are cast on any theory whose mathematical formalism is not Lorentz-covariant, at least for describing equilibrium situations. The requirement for a viable Quantum Gravity Theory to support the foundational principles of General Relativity stems from the fact that otherwise it does not harmonize the latter with Quantum Mechanics but rather denies the need for such reconciliation. It is expected that both Quantum Mechanics and General Relativity will be extended with new concepts, but in ways that illuminate their existing principles without destroying them. By the same token, Quantum Gravity need not be more faithful to established principles than the existing theories are themselves, and General Relativity satisfies the definition of Lorentz covariance only in a local sense. While LQG does not have a “landscape” of models like that of string theory, there are different ways to construct LQG models (e.g., different choices of Lie groups to define gauge symmetries). Rovelli and Speziale (2011) have shown that among the degrees of freedom implicit in LQG, locally Lorentz-covariant formalisms exist. The issue of Lorentz invariance in LQG and CDT is less straightforward. Both involve spacetime with granular structure, and the argument has been made that light propagating through such vacuum would exhibit dispersion effects, i.e., the speed of light in vacuum would depend on wavelength because different wavelengths would involve different interactions between the electromagnetic field and the spacetime structure. Different models of granular structure yield different dispersion relations, but in general such effects are to be expected. Assuming that the granular nature of spacetime is significant only near the Planck scale, however, dispersion effects would be significant only at very high energies. 378
page 390
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.7 Loop Quantum Gravity and Causal Dynamical Triangulations
At present, such effects have not been observed, but they are expected to be very difficult to measure experimentally (see, e.g., Myers and Pospelov, 2003). The issues concerning the necessity of Lorentz covariance and invariance are the subjects of ongoing analysis, and we will not pursue them further here. A key ingredient of the CDT formalism is the quantum-mechanical concept known as a path integral. In section 6.6 we mentioned Feynman diagrams, schematic graphs of each possible interaction between components of a quantum system. Quantum Mechanics is concerned with system transitions from a state A to a state B, where “state” may refer to location in space, energy levels in an atom, spin orientations, or any parameter that can describe an observable aspect of a physical system. Feynman developed a technique for computing the probability of such a transition by summing the probability amplitudes over all possible paths that lead from A to B in the appropriate state space, and this became known as the path integral method, one of several alternative formalisms of Quantum Mechanics. Hamiltonian dynamics plays a major role in Quantum Mechanics, including the path integral method. The Hamiltonian operator (Equation 5.30, p. 253) is also called the generator of time translations because, for example, applying it to the time-independent Schrödinger Equation (Equation 5.36, p. 254) produces the time-dependent Schrödinger Equation (Equation 5.38, p. 255), whose solution includes the time dependence of the system state. Hamiltonian dynamics depends on the Least Action Principle, also called Hamilton’s Principle (see sections 5.6 and 6.5), which involves an integral of the Lagrangian called the action (ibid). Thus the path integral in CDT requires an appropriate action. Equation 6.67 shows the Einstein-Hilbert Action used in Canonical Quantum Gravity. In order to be useful in any theory incorporating discretized spacetime, it must be modified from its continuum form to be made compatible with the spatial properties involved. For CDT, this is the simplicial manifold, and the modification was accomplished by Regge as part of his development of the Regge Calculus. The use of the Regge Action in path integrals is the core of CDT, wherein the paths are piecewise linear (including the time parameter) and the states are the possible geometries of the simplicial manifold. As employed in CDT, the Regge Action takes the form of a sum of several terms that depend on the number and types of simplexes in the manifold, with each term scaled by a coupling constant. There are three of these, typically notated K0, K4, and Δ. K0 is related to the inverse of the Newton constant, K4 is related to the cosmological constant, and Δ is related to the difference between the lengths of time-like and space-like simplex links. We go into this much detail only because we will need these definitions in section 6.9. For a complete discussion, see Görlich (2010, especially Appendix A). The importance of K0 and Δ for us is that varying them creates different phases of the spacetime, i.e., different geometrical relationships between the simplexes. One of these phases corresponds to a four-dimensional de Sitter spacetime, a matter-free form of the Robertson-Walker metric with a positive cosmological constant, hence it is called the “de Sitter phase” although it is spheroidal, not flat. This has strong similarities to three of the models described in the next section, and there are implications for the embedding problems discussed in the section after that. As with other Hamiltonian-based formalisms, a 3+1 spacetime foliation is implicit. This involves the time parameter being the proper time measured by a global clock, and the 3-space corresponding to each clock tick being a spacelike hypersurface of simultaneity in the corresponding reference system (e.g., as in the Robertson-Walker metric, Equation 6.58, p. 351). The 3-space at a given time is the simplicial manifold, which may take on many different geometries between the initial and final geometric states, which are separated by a time interval between two readings of the global clock. Causality is enforced by requiring each intermediate geometry to preserve the foliation. The path “integral” 379
page 391
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
is a sum of probability amplitudes over all possible distinct geometric evolutions that lead from the initial to the final state. The probability amplitude of a given geometry is proportional to eiS(t), where S(t) is the Regge Action. It is natural to wonder whether the “chunks” of 3-space in LQG could have the form of the simplexes in CDT. As mentioned above, Bianchi (2008) developed a length operator for LQG and also studied its semiclassical behavior for a typical case for which such analysis is tractable. His remarks include the following: “the state obtained superposing spin network states ... actually appears to describe a configuration such that the chunk of space dual to the node n is peaked on the classical geometry of a flat regular tetrahedron”; “Numerical investigations indicate that the expectation value of the elementary length operator ... scales exactly as the classical length of an edge of the tetrahedron”; “This same semiclassical analysis can be extended to the case of a non-regular tetrahedron ... it would be interesting to show that the ... operator measured exactly the length of the edge ... shared by the triangular faces”; “A preliminary analysis has been performed and it confirms this expectation”; “This set of results strongly strengthens the relation between the quantum geometry of a spin network state and the classical simplicial geometry of piecewise-flat 3-metrics”. It therefore appears plausible that a convergence of LQG and CDT is possible, and many theorists working on one also spend a considerable amount of time on the other. As with the other efforts, a great deal of work remains to be done on both LQG and CDT, but these are the approaches that the author finds most appealing, partly because this possibility of convergence extends to two of the models discussed in the next section. There are two other aspects of CDT that we must mention because of their relevance for the next two sections: some of the CDT models exhibit fractal-like spacetime structure on small scales (Loll, 1998; Görlich, 2010; Ambjørn et al., 2013), and the “flat” 3+1 spacetime employed is not Euclidean, it is Minkowski with a ( +++) signature. 6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes Since all the serious attempts to formulate a Quantum Theory of Gravity are incomplete, some tolerance for incompleteness is required unless one prefers to abandon the subject. In the author’s opinion, the next desirable qualities for any acceptable formulation are: (a.) plausibility that completeness may someday be achieved; (b.) some appeal to physical intuition. In this section we will focus on item (b.), and the answer to the question of what is intuitive will admittedly be biased. Throughout this book we have stated a goal of understanding physical reality in terms of irreducible elements, including the possibility that irreducible randomness may exert an influence. We must accept the working hypothesis that such an understanding is humanly possible for this quest is to be meaningful. If such an understanding is ultimately proved to be impossible, then perhaps at least an understanding of why it is impossible may be achieved. But it is not likely that progress can be made without some evolution of human intuition. Quantum Mechanics has taught us that our macroscopic experience does not equip us with everything needed for an intuitive understanding of physical processes operating at the microscopic level. The equations involved can be grasped completely as pure mathematics, but an intuitive physical picture of what the equations are describing is missing. Among other things, this means that for example we cannot continue to think of elementary “particles” as point masses or little nuggets of uniform mass density, 380
page 392
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
but rather we must consider them to be processes evolving in spacetime. Already in quantum field theories, particles are described as “excitations” of their corresponding fields. This is a step in the right direction but still well short of providing a satisfying intuitive picture of the processes taking place. What exactly is the physical constitution of the all-permeating fields whose excitations manifest themselves on the macroscopic scale as particles? Some may object that we are trying to reinstate a “classical” interpretation of the quantummechanical layer of physical reality. Verbal language is inherently ambiguous, so we must clarify that such a return to the past is not our goal. There was a time when “classical” intuition involved the Earth being flat and either infinite in extent or else limited by a precipice beyond which there was only some kind of void. For the people of that time, the idea that they were being held by something called “gravity” to the surface of an approximately spherical slightly wet spinning ball of rock hurtling at a speed of 30 km/s around a 4½-billion-year-old continuously detonating nuclear explosion would probably have been too terrifying to contemplate seriously as a description of reality. But today those facts are absorbed rather easily by school children. And when General Relativity introduced the notion that the space we inhabit has curvature (at least intrinsic, if not also extrinsic), that went beyond classical intuition, even though General Relativity is today classified as a “classical” theory. But the overwhelming experimental verification of General Relativity has required the intuition of physicists, and anyone else interested enough to become involved, to adapt to the notion of curved space. As discussed in the next section, this adaptation is not quite complete, but the effort has so far not been futile, and so we are only arguing for more progress in this general type of intuition expansion. In the last section we discussed formalisms in which spacetime is granular, composed of simplexes, or chunks, or “atoms of spacetime” (e.g., Sorkin, 1998) with sizes typically at the Planck scale. Although history contains episodes of erroneous identifications of irreducible elements (e.g., atoms turned out to be composed of protons, neutrons, and electrons; protons and neutrons turned out to be composed of quarks), such mistakes never brought the progress of science to a halt, and therefore it is reasonable to consider the hypothesis that these spacetime granules constitute the bottom layer of reductionist physical analysis. The hierarchy above this level may then be viewed as layers of processes generated by the properties of the granules. One problem is that the laboratory measurement of anything as small as these granules is made impossible by the fact that the energy density that must be focused to probe structure at a given size scale varies inversely with the size. A special feature of the Planck scale is that it is where the probing energy density is so great that it disrupts the spacetime structure. For example, determining the position of an object with a mass of 5.46×10-5 gm more accurately than one Planck length (4.05×10-33 cm) would require enough energy to create another particle of that mass, and confining that mass within a region of two Planck Lengths would cause a black hole to form. These are additional reasons to assume that the granules are irreducible. Any study of them must therefore depend on examining the consequences of their properties on physical processes at larger size scales, and so the question of the uniqueness of any given theory will always be open. The plausibility of such theories must be judged based on the strength of their predictions, explanatory power, and the lack of close rivals. The hypothesis that the granules are irreducible constrains their possible properties. They cannot have any internal structure, since objects with internal structure are inherently reducible to the components of that structure. But having no internal structure limits the properties that the granules could possess, and this limits the processes that they can generate. It also limits the shapes that the granules could have, since the ability to take on different shapes implies some internal constitution. The simplest 381
page 393
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
shape free of all assumptions about internal composition is spherical. The size of these spheres is also constrained: it must be the same for all granules, because the ability to have different sizes implies subdivisibility of internal substance. Thus the diameter of a granule is the most fundamental length possible, and it is therefore plausible to associate it, at least tentatively, with the Planck length. The model that the author has spent the last 25 years working on (with admittedly very little publication because of the primitive state of its development) postulates that the granules have only one property: to be in one of two possible phases. In a manner somewhat analogous to supercooled water molecules, which may undergo a phase transition between liquid and solid states, the granules may undergo a phase transition between two states. As with the supercooled water, wherein molecules in the solid state have formed a crystal and thereby provide sites for molecules in the liquid state to attach themselves by undergoing the phase transition, spacetime granules in one state provide sites for granules in the other state to undergo the corresponding phase transition. But while a water molecule’s ability to undergo a phase transition depends on the properties of its constituent elements, the postulate that spacetime granules have no constituent elements rules out that sort of phase transition. The only two states available are “attached to other granules” and “not attached to other granules”. The only property that the lack of internal structure permits is that the granules may stick to each other once in contact. This assumes a continuous space within which the granules may move, but a discrete structure of attached granules, hence a discrete spacetime. Given that the granules that are attached to each other constitute the discrete spacetime, the others do not exist in the sense of being part of spacetime. Thus spacetime granules that exist provide sites where other spacetime granules can exist, but until they attach, they cannot be said to exist, because they are not part of the spacetime that we call the Universe. Whether they “exist” in some sense outside of the Universe cannot be known, since our knowledge is confined to things whose existence depends on being part of the Universe. But since they are at least potentially available to undergo the phase transition, then perhaps they can be viewed as inhabiting a reservoir of free granules in which the Universe is embedded, analogous to the supercooled water in which an ice crystal is embedded. Discussing the status of things not in the Universe is problematic. We will therefore view the phase transition as being a change from “doesn’t exist as part of the Universe” to “exists as part of the Universe”. The notion of a change of state implicitly involves a time dependence, hence a time parameter whose values index each discrete spacetime configuration. The normal view of four-dimensional spacetime is that it contains its own time parameter as one of its coordinates. The fact that the direction of the time axis in any given 3+1 decomposition of spacetime depends on the motion of the corresponding reference frame does not change this. Thus the time parameter involved in phase transitions leading to the formation of spacetime itself has the character of an external time. But since the change it describes is the process by which spacetime comes into existence, it may also be seen as necessarily external to spacetime, hence also something that doesn’t exist within the Universe. The only manifestation of the phase transition within the Universe is the existence of new spacetime. This notion can be applied in various ways, but the one of most interest to the author, dubbed the Propagating Spacetime Phase Transition model, or PSPT, involves random aggregation of granules. It is well known (see, e.g., Wolfram, 2002, Chapter 7) that discrete random-aggregation models tend to yield assemblages that become ever more spherical (in whatever dimensionality applies) as the addition of new elements proceeds. Thus if the PSPT model begins with a single pair of attached granules in four-dimensional space, the attachment of more granules at random locations on the assemblage increases its size, and an initially irregularly shaped clump of spacetime would become more and more 382
page 394
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
spherical at the large scale, with the length fluctuations of radii in different directions yielding ever more negligible relative differences, i.e., ratios of different radii would approach unity. Of course, there would continue to be complicated structure on the fine scale at the surface of the assemblage. Since we are considering this process in four dimensions, that surface is three-dimensional and grows in volume with every new layer of granules formed by the phase transition. Depending on the nature of the sphere packing, each new layer has some discrete thickness, i.e., some quantized increase in the four-dimensional radius at each locus within the surface. This process bears a striking resemblance to the spherical case of the Robertson-Walker model of the Universe (Equation 6.58, p. 351). Each new layer provides a stratum in which the fine structure of spacetime can be slightly different from the previous one, thus supporting the evolution of physical processes. Each new layer therefore adds a tick to the clock keeping cosmic time, the t parameter in Equation 6.58. And each new layer has a larger volume than its predecessor, making the passage of cosmic time and the expansion of the Universe consequences of the same fundamental phenomenon, the propagating phase transition of spacetime. A key difference, however, is that the traditional concept of a spherical Robertson-Walker model is that of a “block universe” such as that discussed in section 6.3. The block universe contains the entire history of the Universe, in this case beginning almost immediately after the first instant of the Big Bang and lasting until cosmic time comes to an end, if it ever does. As discussed in section 6.3, this view omits any attempt to explain the psychological experience of “now”, a new moment being observed for the first time and continually “flowing” toward the future, other than to dismiss it as something that is common to every instant of time and involving a bias in the ability of consciousness to know the past but not the future. These issues, along with the problems raised for notions of free will and morality, were touched upon in section 6.3, and much more expanded discussions can be found in the literature, so we will not pursue them further here until we pick up the “block universe” topic again below. For now, we simply note that the aspects involving a pre-existing future are not present in the PSPT model. On the other hand, the PSPT model leaves the nature of consciousness as mysterious as ever. It is not clear that the model can be developed to the point at which consciousness can be seen to emerge from the phase transition, nor can that be ruled out. If consciousness is a manifestation of some agent that exists independently of spacetime, then perhaps its attention is simply drawn to the propagating phase transition in a manner similar to our tendency to focus our awareness on the region of intermediate phase in a row of falling dominoes. Although we are aware of the fallen tiles behind and the standing ones lying in wait ahead, it is the action taking place where the dominoes are in mid fall that fascinates us. Like a surfer shooting the curl of a perfect wave, consciousness may be swept along with the crest of the Propagating Spacetime Phase Transition as the restless flux of the current moment solidifies into the immutable past while pending future possibilities that can only be imagined sort themselves out to form the new, however fleeting, present. PSPT supports a role for randomness in this process, but the question of whether consciousness can influence outcomes otherwise defaulted to random selection remains open. Of course, the analogies to dominoes and surfing are incomplete. To be the same as PSPT, the water in front of the wave and the standing dominoes would not exist until the phase transition was upon them, and at that point their possible configurations would depend on the details of what had taken place leading up to their appearance on the scene. For example, when an ice crystal grows in supercooled water, certain imperfections form randomly, and these deviations from simple regular 383
page 395
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
crystal lattice structure evolve in each new layer of ice. This gives rise to “bubble tubes” in ice cubes. The form taken within the new layer by a given bubble process depends on the shape that the process has produced in the existing layer. In PSPT, this is analogous to the propagation and evolution of “particle” processes in time. The idea is that deformations (or “defects” as they are sometimes called in crystal theory) evolve in ways that yield not only particle propagation in time but also the interactions associated with electroweak, chromodynamic, and spacetime stress-tensor fields. In PSPT, the nonzero rest mass of an object does not cause spacetime curvature, it is the spacetime curvature accompanying a persistent particle process evolving in the local time direction. Such evolution is determined, at least stochastically, by whatever “laws” govern the transition of each granule as it attaches itself to the existing spacetime. So the crux of the matter is to determine these laws, and as in every other approach to Quantum Gravity, therein lies the rub. For PSPT, one clue is the postulate that only two states are available to spacetime granules. This aligns with other constraints derived independently (e.g., black hole entropy) and progress in the field known as digital mechanics. The pattern formed by spherical granules in the spacetime “crystal” can be cast in the form of Voronoi cells, and those can be decomposed into simplexes. So the PSPT structure can be represented in the same simplicial-manifold form as those of CDT (Causal Dynamical Triangulation) in the previous section. Furthermore, there is already the hint of a connection between the “chunks” of spacetime in LQG (Loop Quantum Gravity) and CDT simplexes (Bianchi, 2008). Voronoi cell boundaries are the locus of points that are equidistant from the two nearest granule centers, so they are straight line segments in two dimensions, polygons in three dimensions, and polyhedra in four dimensions. Figure 6-9 shows a small random aggregation of granules in two dimensions with Voronoi cell boundaries.
Figure 6-9. An example of random aggregation in two dimensions; each granule is attached to at least one other granule at a random available location on each; Voronoi cell boundaries are shown with exterior cells open.
384
page 396
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
The CDT use of simplexes may need to be generalized, however, to geometric objects with more than three vertexes per face. The resolution of Voronoi cells into simplexes is generally non-unique, so many simplex states typically correspond to each collection of unresolved Voronoi cells, with some simplexes containing a granule, some not, and some part of a granule. “Part of a granule” violates the principle of granule indivisibility, and so complete Voronoi cells may be the minimal geometric object, at least to the extent that PSPT is involved. The LQG concept of spacetime “chunks” stems from the length/area/volume eigenvalues computed for the Ashtekar form of Canonical Quantum Gravity, and the evolution of the simplicial manifold of CDT is computed from path integrals over states with probabilities derived from the Regge action. Neither prescribes a clearly visualizable physical process that generates the actual spacetime states and their associated probabilities. That task can be seen as the goal of PSPT. So these three descriptions may be just three blind persons’ perceptions of the same elephant, and we will see below that there is yet another formalism that may be a fourth impression of that anatomy. As we mentioned in section 3.5, Bekenstein (1973) derived an expression for the entropy of a black hole by postulating that when the smallest possible particle falls through the event horizon, the Shannon entropy increases by the smallest possible amount, ln2. Since a black hole is an object so massive that even light cannot escape it, all information regarding the state of the particle is lost, thereby increasing the entropy. He also argued that since both classical thermodynamic entropy and classical black-hole event horizons can only grow larger with time, there must be some proportionality relationship between them, and he found that the simplest possible relationship that does not lead to contradictions is a linear model. Since Shannon entropy is dimensionless, whereas the event-horizon area has dimensions of length2, the proportionality constant multiplying the area must have dimensions of length-2, and taking the only truly fundamental length in Nature to be the Planck length LP, the proportionality constant must include 1/LP2. While this derivation must be conceded to be somewhat heuristic, it has some support from Bekenstein’s demonstration that under various typical scenarios it is consistent with the Second Law of Thermodynamics (that entropy is nondecreasing in closed systems) to within vacuum fluctuations at the event horizon. It also represents one of the few clues available in the search for Quantum Gravity, and so it has been taken quite seriously (e.g., Krasnov, 1996; Sorkin, 1998; Meissner, 2004; Rovelli, 2008; Carlip, 2015). Unfortunately, given the lack of unanimity regarding the definition of the Planck length (i.e., whether the reduced Planck constant should be used; see Appendix H), progress in relating entropy to actual microstate probability has been elusive, and there are also obstacles other than the choice of LP definition. For example, the increase in event-horizon area depends on the classical radius of the particle that is swallowed, and it is not clear whether this smallest-possible radius should be the standard Compton wavelength or the “reduced” Compton wavelength, so the choice of which Planck constant to use enters again. Of course, the problem of how to count relevant microstates in string theory and LQG also requires facing the usual issues that arise in top-down approaches, e.g., choice of gauge groups, classifying microstates as distinguishable vs. indistinguishable, etc. The author’s interest in PSPT is the possibility of a bottom-up procedure based on the simplest possible properties of the phase transition. Again, the analogy to ice crystals is computing the possible lattice structures around a “bubble tube” and then using those to compute the entropy of that structure. The physical study of phase transition is a highly developed subject. Classically it is concerned with thermodynamic analysis of ordinary matter (e.g., Pimpinelli and Villain, 1998), and quantum 385
page 397
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
mechanically with fields (e.g., Sachdev, 1999). The concept known as spontaneous symmetry breaking is relevant to both cases, especially most recently in quantum phase transitions. A classical example is once again liquid water condensing into ice crystals, in which the isotropic orientations of the molecules in the liquid state assume preferred directions in the crystal. Classical and quantum-mechanical phase transitions are generally classified as first-order or second-order. In the modern usage of these terms, first-order transitions involve a discontinuity in the thermodynamic free energy, a latent heat, and have an interface between regions in the two phases. The latent heat is energy absorbed or emitted at constant temperature during the transition. Second-order transitions (also called continuous transitions) have no such latent heat or discontinuity and have “infinite correlation length”, which is obviously a mathematical idealization, but it means that the entire medium changes phase in a coherent fashion during a single transition period. The phase transition in PSPT is envisioned as first-order in the local time direction but second-order in the 3-space orthogonal to the time direction. That conforms to the nature of that 3-space as a Gaussian hypersurface of simultaneity and supports possible mechanisms for quantum entanglement. Whereas “heat” cannot have the usual thermodynamic meaning in this context, the equivalent of latent heat in the first-order time-directed phase transition could be manifested as the dark energy component of the current concordance model of cosmology. The time period during which a typical phase transition occurs is especially difficult to treat exactly (see, e.g., Johnson, 2002). One approach employs the digital mechanics mentioned above (see, e.g., Fredkin, 1990, and Wolfram, 2002). One of the tools used in digital mechanics is the cellular automaton (Ulam, 1952; von Neumann, 1966). There are many different types of cellular automata, but typically they employ a lattice whose nodes at each stage of evolution interact in discrete steps to produce a new “generation” of node states. Most cellular automata operate in 1+1 or 2+1 spacetimes, but 3+1 spacetimes have been investigated (Bays, 1987). The rules governing the automata may be deterministic or stochastic, and the behavior of the automata may be reversible or irreversible. When applied to problems in classical physics, deterministic reversible automata are often used (e.g., Fredkin, 1990). Cellular automata can be used to model behavior known as self-organized criticality (see, e.g., Jensen, 1998, and Ball, 1999). This has involved the study of cellular automata designed to behave roughly as physical systems with simple internal degrees of freedom that interact to produce complex behavior on larger scales. One of the best examples is the study of sandpiles to which grains are slowly added and which experience occasional avalanches. Computer simulations of such modeled physical activities have yielded large sets of data that can be analyzed for patterns such as the power spectrum of the energy released in avalanches. The main thrust of these investigations so far has been the attempt to derive more reliable models of earthquakes, forest fires, biological evolution, and so forth. The interest herein is that these analyses provide vivid illustrations of how phenomenology can be used to derive conservation principles, statistical and causal relationships between macroscopic observables, large-scale descriptions in terms of mean-field theories, equations of motion and diffusion, dynamically driven renormalization groups, or in other words, the sort of activities that have taken place in humanity’s attempt to develop a mathematically rigorous description of nature. If, for example, only the macroscopically observable data produced by the sandpile cellular automaton were available for analysis, with no access to information about the simple internal degrees of freedom defining the cellular automaton itself, the theories constructed in terms of conservation principles, power spectra, time-correlated parameters, etc., would be very 386
page 398
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
similar to actual published theories, depending on the realism of the automaton. But in these applications, the laws operating at the corresponding microscopic level are relatively well known in advance. For example, it is fairly straightforward to imagine simulating a sandpile as a collection of tiny irregular rocks subject to gravity and friction and reacting to sand being added to the pile. From these simple internal processes, a cellular automaton can be run and its output used to develop the statistical and asymptotic formalisms for comparison to known theories. One may need to adjust size distributions, coefficients of friction, etc., but that is merely fine tuning. For PSPT we need to start from the existing disjoint asymptotic descriptions of nature and deduce the underlying mechanism in ontologically fundamental terms. The hope is that the mechanism will be simple; at least that is a good way to begin. But simplicity does not necessarily imply uniqueness, and the job of discovering an effective set of laws for PSPT remains largely unfinished. The attempt to work backwards in this way is similar to the problem of inverting a nonlinear function; aside from some special cases, no general formal inversion is possible, leaving iterative methods as the only way to proceed. An attempt to guess the solution is made, and then a forward mapping from that model to the observable data is performed. Any discrepancies between the results of the forward mapping and observations must be used in a negative-feedback fashion to modify the model to eliminate or at least reduce the discrepancies. But iterative solutions to nonlinear problems generally require an initial estimate that is in the neighborhood of a solution, and when the “solution” is a set of laws, it helps to have some intuition about how varying the laws will affect the behavior. We will see below that this is the primary obstacle for PSPT. If one achieves a model that yields no discrepancies, then one must be concerned with whether the model is unique. A widely accepted criterion for uniqueness has traditionally been simplicity. For example, maximum-entropy methods are based on the appeal of minimizing the complexity of the solution. In the search for ultimate physical laws, the scientific community has already embraced the goal of minimizing the number of arbitrary constants required by a theory. The most promising way to do that is to design the model so that it can be characterized in purely geometrical terms to the maximum possible extent, so that traditional physical constants might be derivable from geometrical relationships. An example is given below of how the speed of light may emerge from the geometry of the spacetime phase transition. The most well-known cellular automaton is the “Game of Life” invented by John Conway as a modification of some work done by John von Neumann. The Game of Life was brought to the attention of the general public by Martin Gardner (1970). Wikipedia has an excellent page on the Game of Life at https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life#cite_note-1. One of the points to be stressed herein is that cellular automata may have remarkably simple rules but nevertheless can produce extremely rich and complicated behavior. The Wikipedia page has some animations that show this strikingly (see, e.g., Gosper’s glider gun, and invoke the animation of the puffer-type breeder). The standard Game of Life takes place in a 2+1 spacetime for which consecutive displays of the 2-space replace the time axis. The 2-space is a finite grid of equal-size square cells spanning the horizontal and vertical directions, typically with wrap-around (i.e., a toroidal topology), and each cell is capable of two states commonly called “alive” and “dead”. These two states are displayed in different colors. The initial grid must have some but not all cells alive if any non-trivial action is to take place. Typically some fraction of the cells are set alive by random selection to get things started. Then each “generation” of cells (the complete 2-space at one tick of the game clock) determines which cells of the next generation will be alive or dead by two simple rules based on how many of each cell’s eight 387
page 399
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
contiguous neighbors are alive in the current generation: (a.) if exactly 2 or 3 neighbors of a live cell are alive, then that cell remains alive in the next generation, otherwise it is dead; (b.) if exactly 3 neighbors of a dead cell are alive, then that cell becomes alive in the next generation, otherwise it remains dead. These rules produce behavior that is deterministic but irreversible. An impressive example of how such simple rules can produce “rich and complicated behavior” is a starting pattern invented by Andrew Wade in 2010 named Gemini. It has a repetition cycle of 33,699,586 generations, after which it has replicated itself exactly and erased the original copy. It operates in a stack of 2-spaces each containing about 1.78×10 13 cells of which 846,278 are alive in the initial state. Each replicated copy has moved 5120 cells vertically and 1024 horizontally. If its Game of Life were operating at the Planck scale (cells of one Planck length per side, a new generation once per Planck time), it would occupy a sequence of 2-space areas about 1.7×10 -26 cm on each side, or about 1.65×1013 times smaller than the classical electron radius, and it would be moving at a speed of about 46.5 km/s, or 1.55×10-4c, about 1.55 times faster than the Earth’s orbital speed around the Sun. That assumes that no collisions with other patterns occur. Such collisions generally produce drastic effects in the Game of Life. The simplest moving self-replicating pattern in the Game of Life is the glider. It has a replication period of four generations, after which it has moved one cell horizontally and one cell vertically. It may have any of the four 90o rotations in the grid, so any given glider can move only in one of the four diagonal directions. A replication cycle is shown in Figure 6-10 for a glider moving to the right and down. At the Planck scale, a glider moves with a speed of 106,066 km/s, just over 35% of the speed of light.
Figure 6-10 A complete replication cycle of a glider in the Game of Life; starting at A, four generations later it has arrived at the same pattern but has moved one cell to the right and one cell down.
Figure 6-11 shows four gliders going in the four possible directions. This is shown in a stereo projection of the 2+1 space. We can do no better than to quote from the classic pair of textbooks by Morse & Feshbach (1953), who display various coordinate systems in such a 3-D format: “Several of the figures in this work, which have to do with three dimensions, are drawn for stereoscopic viewing... Those who have neither equipment nor sufficient ocular decoupling may consider these figures as ordinary perspective drawings unnecessarily duplicated. If not benefited, they will be at least not worse off by the duplication.” The point here is to illustrate how patterns can persist over the passage of time by replicating themselves in successive surfaces, and how some of these patterns can “move” in the process, although motion other than in the time direction is not necessary. 388
page 400
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
Figure 6-11 Stereo pair projection of a 2+1 spacetime in the Game of Life; four gliders are moving outward from the central region of an otherwise-empty grid.
The glider is not a “particle”, it is a process evolving in the 2+1 Game of Life spacetime, and it serves as a toy model of how we interpret the fundamental particle processes of our own 3+1 spacetime. Where there appears to be empty space in Figure 6-11 there are actually dead cells. The space is not empty, most of the cells are simply in a state other than those of the glider processes. If the PSPT model is correct, particle processes and interactions must be vastly more complicated even than the Gemini process of the Game of Life. Their evolution must take place in a propagating phase transition that is not defined on a regular grid. One similarity is the creation of extended patterns stretched out over the corresponding world lines in spacetime. What appears to be empty space is really chaotic non-repeating patterns of granules that yield no persistent forms but serve as a medium through which causal influences may propagate. The particle processes and the vacuum fluctuations are constrained to follow stochastic laws that are yet to be completely formulated, with extremely large numbers providing stability for the particle processes. The large-scale uniformity of the Universe’s 3-space spherical hypersurface follows from the simplicity of the granules’ behavior, which applies everywhere. The Game of Life also demonstrates this: even if live cells are initially concentrated in an isolated region of the grid, different starting patterns typically evolve to similar large-scale distributions over the entire space after a sufficient number of generations. Past interactions between patterns in widely separated regions are not required for this uniformity to develop. A Universe driven by the PSPT automaton can develop large-scale homogeneity and isotropy without the need for an inflationary mechanism such as that based on spontaneous symmetry breaking of the Higgs field of particle physics, the process by which large-scale uniformity is thought to have come about in the consensus cosmological model known as the ΛCDM Concordance Model (see, e.g., Frieman, Turner, and Huterer, 2008). Like the Game of Life, the PSPT automaton is irreversible, and while it is stochastic, it satisfies relativistic causal constraints. The maximum speed at which a causal influence may propagate in the Game of Life is from one cell to a neighbor over the duration of one generation, giving a geometric origin to the speed of light. But in the Game of Life, light travels larger absolute distances per unit time in diagonal directions than in horizontal and vertical ones. In PSPT, the speed of light derives from a similar geometric condition, and its isotropy results from the possibility that a granule may attach itself to the aggregation at any point on a previously attached granule. The propagation speed associated with this depends on the aggregation rate, which is a parameter of a given PSPT model, but if the mean time between attachments is the Planck time, then a causal influence can propagate 389
page 401
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
no faster than a mean speed of one Planck length per Planck time, which is the vacuum speed of light. In the Game of Life, a given process like a glider or Gemini “moves” by replicating itself in a new location while erasing its previous self. In PSPT, the “motion” of a given particle process is defined only in terms of the change in its 3-space distance from another such process as the cosmic clock ticks, and for this relative motion to take place, each process needs only to replicate itself along its own local time axis. The motion of two processes relative to each other results from the two local surfaces of simultaneity not being parallel. On the large scale, the global surface of simultaneity is a 3-sphere, so at every event the local time axis has its own unique direction relative to all others, hence all particle processes move away from each other as their way of participating in the expansion of the Universe. But on smaller scales there are local curvature variations that give rise to angles between the time axes orthogonal to different surfaces of simultaneity of various particle processes. Inertia is the tendency of such processes to replicate only in their own time directions, i.e., material objects always remain at rest within their own inertial reference systems. For a particle process to change its motion relative to other processes, some kind of interaction between processes is required to change the orientation of local surfaces of simultaneity, hence also time axis orientations. As illustrated in Figure 6-1 (p. 319) for two coordinate systems in uniform relative motion, a nonzero angle between the time axes results in a foreshortening of each process’s time axis as observed in the other process’s rest system. Each process advances along its own time axis with each tick of the cosmic clock, but in each process’s rest frame, the other appears to make less progress in the local time direction, while the 3-space distance between the two systems changes uniformly. Such remarks, however, flirt with forbidden notions such as preferred reference frames and geometrical/dynamical relationships between two processes that are significantly separated in the context of a Riemannian manifold. These notions must therefore be regarded as mostly qualitative but possibly of heuristic value in constructing rules for PSPT models. The connection between a PSPT cosmological model and the spherical Robertson-Walker metric involves only a preferred direction for the time axis. The orientations of the 3-space axes are arbitrary other than being locally orthogonal to the time axis. The hint of a legitimate role for a preferred reference frame was established in section 6.3, however, and preferred time axes are also to be found in other formulations of Quantum Gravity, e.g., LQG, CDT, and the EBU (Evolving Block Universe) model discussed below. Nevertheless, because there is so far no highly developed PSPT model, we are forced to be more qualitative than quantitative, and the same applies to the following suggestion that PSPT particle processes and interactions may underlie quantum-mechanical wave functions and interaction-induced collapses. For a PSPT model to be correct, the particle processes it supports must propagate more as processes than localized particles. When a process’s extent in each 3-space hypersurface is measured in units of granules, the numbers must be astronomical. Even when localized by interactions/collisions with other particle processes, the numbers must remain huge. The nature of these interactions must involve the long-distance correlations of the 3-space 2nd-order phase transition drastically altering the process extent to correspond to a localized particle. Large-number statistics must constrain random effects, perhaps by nonlinear chaotic attractor properties of the phase transition, to produce a post-interaction process corresponding to one of the possible interaction/measurement outcomes manifested at the macroscopic level. This view of interaction-induced transformation of the particle-process pattern favors Bohr’s interpretation of entanglement over Einstein’s (see section 5.11). The crux of their disagreement was Einstein’s belief that if the outcome of a measurement could be predicted with certainty, then the system 390
page 402
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
was in that state prior to the measurement, whereas Bohr believed that prior to the measurement, the system could not be said to be in any particular state but rather had to wait for the interaction to force it into an eigenstate. The PSPT equivalent of Bohr’s argument is that even if the outcome of an interaction can be predicted exactly, one still needs for the interaction to take place before the predicted state is actualized. Another parameter defining a specific PSPT model is the effect of existing granules on the probability that a new granule will attach at a given site. In general, granules may not overlap. In Figure 6-9 there is an open space just above left of the center where a granule could just fit, but depending on the model laws, it may be impossible, or extremely unlikely, for a granule to arrive at that location. For any specific granule to which others have attached, the available sites for further attachment are clearly limited, and this effects the probability of further attachments. Whether the possibility of attaching to two existing granules makes that event more or less likely is another PSPT model parameter. Much analysis has been devoted to the problem of sphere packing in various dimensions, or “the kissing problem”, as it is also known. For example, in three dimensions, the way oranges are normally stacked on grocery shelves was conjectured by Johannes Kepler to yield the densest possible packing in three dimensions, but this was remarkably difficult to prove (see, e.g., Szpiro, 2003; here we assume identical spherical oranges). In this arrangement, interior oranges are “kissed” by twelve others. Since the two-dimensional situation is so obvious (exactly six pennies can “kiss” a central one with no slack in the arrangement), it seemed that the twelve oranges ought to do the same in three dimensions and that this should be the only possible arrangement with twelve oranges touching a central one. But this turned out not to be true. There are several ways to make twelve oranges touch a common one, some of which have some slack, so the location of the twelve sphere centers relative to the central one are not unique. But none of these involve more than twelve oranges around the central one. Formal proof of the conjecture that twelve kissing oranges is the limit proved elusive, and it was finally proved in 1998 by Thomas Hales using a computer program that tried every possible arrangement and whose logic was found to be infallible via automated proof checking (Hales, 2005). In four dimensions (our main case of interest), the problem was even more elusive. The kissing problem in higher dimensions is generally difficult, but some surprises lurk. For example, in eight dimensions an exact proof was found early on: 240 spheres can kiss a central one, and no more. The maxima for dimensions one through eight are now known, and in four dimensions, it is 24. Of course maximal density is not anticipated to occur very often in PSPT, since variation in pattern detail is essential for yielding versatility. For example, it would be reasonable to expect that there are longitudinal and transverse vibrational modes in the spacetime structure changes from one surface of simultaneity to another, vortex activity, etc. Such degrees of freedom must eventually underlie electroweak and strong interactions, charge, spin, mass/energy, and all other observable physical parameters. In the two-dimensional illustration in Figure 6-9, the spheres are not maximally packed, but even if they were, they would not fill the space. This results in the spacetime being fractal-like at the Planck scale, something PSPT has in common with some of the CDT models of the previous section. By “fractal-like”, we mean that the coordinate space is not exhausted, i.e., the Planck cells may connect in ways that leave holes of various sizes, and distances between events may depend on the routes formed of connected Planck cells connecting the events. But this is only “fractal-like” because the infinite zooming-in properties of true fractals becomes finite here and stops at the Planck scale. Consideration of these effects will resume in the next section. Some cellular automaton rules employ states in several previous generations to contribute to 391
page 403
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
the new generation’s states, and some employ different clocks for different directions in the grid. A property common to most cellular automata is that the behavior that will be produced by a given set of rules is generally impossible to anticipate. Fredkin (1990) says “there is not yet any magical method for designing just what you might want; you have to try it out and see if it does what you want.” Wolfram (2002) refers to this as computational irreducibility and says “this means that the only way to work out how the system will behave is essentially to perform this computation”. Some progress has been made to alleviate this for the Game of Life. The Gemini pattern was not simply discovered, it was put together with knowledge of how some previously constructed patterns might work in concert. It operates as a combination of two identical structures, each containing patterns known as Chapman-Greene construction arms. Of course, none of this translates directly to PSPT, but the Game of Life is used herein only as a reference to the sorts of things that are possible in some contexts. There is also the hope that quantum computing will eventually be implemented on a sufficient scale to allow some probing of what a PSPT model might do. During the last several decades, many independent attempts to organize similar notions into a theory of Quantum Gravity have been ongoing. In section 6.3 we said “The list of approaches will be highly incomplete, because there are so many with which the author is unfamiliar, and even a grasp of every single one would still leave open the question of where to draw the line to limit the scope to worthwhile ideas.” Two formalisms should be mentioned in passing, however, because of some kindred properties relative to PSPT. Leffert (2001) proposed a process by which the radius and volume of a 3-sphere grow in cosmic time as a result of a physical process he calls Spatial Condensation. This involves accumulation of Planck-sized granules that he dubs plancktons at the surface of a 4-ball that is embedded in a space of approximately ten dimensions. Macken (2015, 2015b) has developed a theory based on treating space evolving in a background time as a high-impedance elastic medium in which all forms of mass/energy are manifestations of spatial curvature. While neither of these formalisms constitutes a complete theory of Quantum Gravity, both are extensively developed, provide foundations for interpreting certain quantum effects, and make numerical predictions that are compatible with a substantial volume of observations. But both also diverge significantly from PSPT in fundamental aspects. In section 6.3 we said “Among those who reject the block Universe it is generally agreed that one requirement for a Quantum Gravity Theory is to present a mechanism that generates a special instant of time that we recognize as ‘now’. We will give brief descriptions of several such models below.” Both PSPT and Leffert’s Spatial Condensation model do this, i.e., generate an expanding 3-sphere with the passage of a background time. Of course, inside the 4-ball bounded by the 3-sphere, everything looks exactly like a block Universe, so one could say that PSPT or Spatial Condensation could have completed their tasks an infinitely long time ago, and an infinite number of conscious observers still find themselves experiencing each past tick of the cosmic clock as though it were something special. A counter-argument is that each tick of the cosmic clock was special at the background time when that particular 3-sphere was created as a new surface of the 4-ball. This interpretation of reality goes against Einstein’s opinion on the subject as expressed in one of his last quotations (Isaacson, 2007), “the distinction between past, present, and future is only a stubborn illusion.” But we have already observed that Einstein left us with some unresolved conflicts regarding his beliefs. For example, he stressed the value of morality, but he accepted that in a block Universe in which all history is already written, the notion of free will must also be an illusion, saying (quoted by Isaacson, 2007, p. 393), “I know that philosophically a murderer is not responsible for 392
page 404
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
his crime, but I prefer not to take tea with him.” Thus it would seem that morality must be classified among the other illusions, free will and a “now” that flows toward the future. Regarding Einstein’s views, we said in section 5.1, “It is therefore only with some disquiet and after extensive consideration that one should proceed to contradict his interpretation of any aspect of physics. But as a rebel and revolutionary himself, he would surely not approve of dogmatic devotion to his views, which were based on theories that he clearly stated he held to be incomplete.” So given the residual ambiguities and a general distaste for dismissing the most essential aspects of human experience as illusions, a number of physicists have proceeded to work on descriptions of reality in which our conscious perceptions can be interpreted as actual insights. Even if our perceptions are illusions, there is a responsibility to explain how the illusions come to exist. The predestination of the standard block Universe vitiates the notion of free will, and without free will, the notion of morality is empty of anything other than functionally positive behavior in terms of social benefit. But predestination is not the only obstacle to free will. In the author’s view, the main hindrance to a meaningful discussion of free will is the lack of a cogent definition. A typical extant definition is that free will is the ability to make choices that are not controlled by fate or God (two more terms that are more difficult to define than widely acknowledged). This definition does not get around the determinism inherent in making choices governed by natural constraints, such as a genetic preference for chocolate over vanilla, unless that falls into the category “fate”. It has been argued convincingly that randomness does not empower free will, it just removes the constraint of predeterminism and provides the possibility for random choices irrelevant to moral values. A coin flip within the internal domain of consciousness does not seem like a good substitute for the notion we are seeking. Perhaps a better definition is that in which free will involves the power to create something that did not exist prior to the exercise. For example, the biblical “Let there be light”: did God know what light was before it was allowed to be? Perhaps not entirely, since the standard continuation is “And God saw the light, that it was good.” The point here is not to advocate for literal belief in the Bible (even the Catholic Church eschews that), but rather that perhaps a more useful notion of free will arises in the context known to artists, whose business involves conjuring new images or music or flavors from some unknown reservoir, then deciding whether what was evoked is useful for the artistic purpose at hand. Like the draw of a random variable from a nonepistemic pool of possibilities free of the constraints of determinism and conservation principles, whatever answered the summons was only vaguely imagined and not specifically preselected by the artist, whose role is rather to decide whether to keep the commodity supplied by the unknown power that creates unpredictable products. Whether that power is anything more than a coin flip itself goes beyond what we can hope to establish here. We are simply taking the definition of free will to be the capability to achieve such an incompletely prescribed beckoning in the first place. In the context mentioned above regarding the question of whether consciousness can influence outcomes otherwise defaulted to random selection in the evolution of processes posited in the PSPT model, since PSPT describes a mechanism by which things become part of the Universe, this would amount to willing such things into existence, with all the concomitant moral implications. But we will content ourselves with eliminating the obstacle to free will presented by the predestination implicit in a block Universe. Such a removal is a necessary condition for free will even if not sufficient. Toward that goal, there is another model, this one proposed by George Ellis (2104), the Evolving Block Universe (EBU), with a variation known as the Crystalizing Block Universe (CBU). The adjective 393
page 405
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
“Evolving” casts the phrase “Block Universe” into a new sense, one aligned with that of PSPT and Leffert’s Spatial Condensation model. Specifically, at each passing moment of what Ellis calls global time, the Universe consists of a static past (the “block” part) and a temporally extended present in which a specific spacetime structure is forming out of a multiplicity of possible structures, qualitatively similar to PSPT but envisioned as a locus of collapsing quantum-mechanical wave function rather than as a region of intermediate phase transition involving spacetime granules. Ellis describes it as “a spacetime which grows and incorporates ever more events, ‘concretizing’ as time evolves along each world line.” Here “world line” refers to spacetime paths in a preferred coordinate system whose global clock Ellis defines as “measuring proper time along Ricci eigenlines from the start of the universe.” Ellis and his collaborators are among a number of physicists who reject the standard block Universe model for various reasons. Here are three relevant statements by Lee Smolin. (2001) “We cannot understand the world we see around us as something static. We must see it as something created, and under continual recreation, by an enormous number of processes acting together.” (2014) “The future is not now real and there can be no definite facts of the matter about the future.” (2016) “What is real is the process by which future events are generated out of present events.” These all reflect the considerations that underlie the PSPT model and can also be interpreted in terms of Ellis’s EBU model. Ellis considers the structure of the Universe at five size scales ranging from the smallest possible, which he calls the quantum gravity scale or Scale 0, to the largest possible, the cosmological scale or Scale 4. In between are Scale 1, the micro-level quantum-physics scale, Scale 2, the macro-level daily-life scale, and Scale 3, the astrophysical-structures scale. The usual block Universe model involves all of these scales being joined smoothly both temporally and spatially to form a complete self-contained spacetime with all of its parts connected via reversible Hamiltonian dynamics. This reversibility is consistent only with unitary evolution of physical states, hence unitary evolution of the local wave functions corresponding to all the spatial scales. A key element of Ellis’s argument is that quantum-mechanical effects demonstrably include occasional nonunitary evolution, i.e., wave-function collapse (see, e.g., Penrose, 1989, Chapter 8). He explicitly rejects models in which wave-function collapse does not take place, such as the Many Worlds Interpretation (see section 5.15). As with PSPT, wherein any interaction between propagating patterns generally results in the production of a drastically new pattern, the EBU model takes any interaction as a cause for wave-function collapse, independently of whether one considers the interaction to be a “measurement”. The effect is the same, as we saw, for example, in the formal mathematical parallel between the Continuous Spontaneous Localization collapse model (see section 5.15) and the Parameter Refinement Theorem (see section 4.8). The plethora of such interactions that take place naturally leads to what was called environmental decoherence in section 5.15. As mentioned therein, environmental decoherence is an irreversible dissipative process. In effect, when everything is entangled with everything else, it resembles no entanglement at all. Expressed in terms of the density matrix formulation of Quantum Mechanics (e.g., von Neumann, 1932), environmental decoherence causes the probability density matrix to become asymptotically diagonal, eliminating the elements corresponding to grotesque states, e.g., Schrödinger’s Cat being alive and dead at the same time. But while this leaves only classical states available for random selection, it does not itself collapse the wave function. That would require forcing all but one of the density matrix diagonal elements to become zero, clearly a nonunitary transition from being a nonsingular matrix to being a singular matrix, and not something environmental decoherence can do. As discussed in section 5.15, exactly how wave-function collapse does happen is not well understood, 394
page 406
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.8 Spacetime Phase Transitions, Chaotic Automata, and Block Universes
but many physicists, including Ellis, consider its occurrence to be the only acceptable hypothesis. Because of the nonunitary nature of wave function collapse, the state of spacetime at t1 on the global clock cannot be used to reconstruct the state at t0 < t1, a state whose configuration defined the wave function whose irreversible collapse led to the state at t1. The latter state could have come from any state in the pre-collapse superposition immediately preceding t1. And the state at t0 can predict only the relative probabilities for the state that will be in effect at t1. It follows that states at different times and different size scales cannot be predicted from reversible processes at other times and other scales, invalidating the block Universe. As Ellis puts it, “The change from uncertainty to certainty takes place at the ever changing present, where the indefinite future becomes the determined past.” His description of the EBU model applies equally well to PSPT: “Spacetime itself is growing as time passes.” The moment that we call the “present” or “now” corresponds to the time on the global clock in the spacetime region where wave-function collapse is taking place, since that is the future boundary of spacetime beyond which nothing can exist because it does not yet know what form to take or even what the options are. Beyond “now” there is not yet any wave function to collapse, because the wave function depends on physical context, and that does not yet exist beyond “now” because it is “now” that the process of forming the necessary context is incomplete, and so there is no way to know what the possible states of spacetime beyond “now” will be. Ellis associates this process with Scale 1, but he points out: “The events at Scale 1 then underlie emergence of structures and function at Scale 2, thus they underlie the emergence of time at that level.” In the EBU model, as in PSPT, the relationship between consciousness and the physical Universe is an unsolved problem, but the widely reported human experience is that consciousness lives only in the “now” moment, a temporal region of intermediate wave-function collapse in EBU and of intermediate phase transition in PSPT. We can remember events of the past, but we no longer live in those moments. Two other parallels between EBU and PSPT can be seen in Ellis’s statements “Indeed quantum gravity can be based in EBU-like models, such as spin foam models based in a discrete spacetime picture” and “The point that emerges is we must distinguish between emergence of the spacetime itself, and the concretization of events within spacetime.” In EBU, the context for this distinction is provided by the different spatial scales, and in PSPT the distinction is between patterns propagating freely and patterns interacting to form entirely new patterns, both driven by the phase transition operating at EBU’s Scale 0. One reason why Ellis introduces the different spatial scales is that they permit the wave function to be defined locally, eliminating the need for a “universal wave function”, a problematic notion. Working within the context of local regions also allows the following question to be answered: “The proposal made here is to use proper time along preferred timeline as the time of evolution for the EBU, but the issue then is on what scale of description?” His answer (Ellis, 2014) is: As far as spacetime is concerned, the effective time fixing the EBU structure must be determined on the scale that controls space time structure, that is, this determination must take place on the basis of the matter distribution at Scale 3, even though the wave function collapse leading to its existence must be based in quantum gravity processes (Scale 0). As far as macro entities at Scale 2 are concerned, they experience the proper time determined by the space time structure, but have negligible influence on that structure. The present time for them, when the indeterminate future concretizes to the determinate past at this scale, must
395
page 407
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
obviously lie within the space-time domain that has come into existence, but need not necessarily coincide with the gravitational present. It will emerge from the underlying quantum (level 1) structures, where, because of evidence provided by delayed choice experiments, the emergent present may be expected to have a crinkly nature characterized as a Crystalizing Block Universe (CBU) structure
Thus proper times for various observers moving relative to each other are generally different from the global time but subject to the usual relativistic constraints. The development of the EBU/CBU model by Ellis and his collaborators shows how this is possible, and again there are qualitative parallels to PSPT. EBU has been criticized with the statement (Ellis, 2014) “Time does not flow.” His answer is to agree, time does not “flow”, it “passes”, and there is a difference: “It is the passage of time that allows rivers to flow and other events to take place.” Why it seems to “flow” is tied up with the question of why human consciousness remains within the region of intermediate wave-function collapse (or phase transition in PSPT). No one has a nontrivial answer to that question so far, although Penrose (1989) has also pointed out the connection between wave-function collapse, consciousness, and the perception of time passage. A related objection to EBU is the assertion that there is no sensible answer to the question “How fast does time pass?” Ellis’s answer is that this is determined by the metric tensor defined locally along any world line as defined in Equation 6.54 (p. 347). For an event whose world line is a Ricci eigenline, time passes at the rate of the global clock. For observers moving relative to the global coordinate system, the “rate” at which time passes for them can be defined in terms of the clock keeping proper time in their reference frame relative to the global clock. The EBU model is much more developed than the PSPT model, but much work remains to be done to complete it. A primary need is Quantum Gravity Theory, since the EBU details depend on that at scale 0. Although the PSPT model is currently mostly qualitative, to the author it supplies an intuitively appealing image of what EBU needs, as well as a conceptual mechanism that could underlie the formation of the discretized spacetime whose “chunk” size spectra can be computed in Loop Quantum Gravity and Causal Dynamical Triangulation. 6.9 Embeddings and Intrinsic vs. Extrinsic Curvature The great power of mathematics allows the solution of problems corresponding to physical processes that cannot be fully visualized in terms of ordinary familiar objects. The macroscopic definitions of modern experimental setups are straightforward, and the final outcomes of measurements designed to test the mathematical formalisms of Quantum Mechanics can be understood intuitively. But except in certain hidden-variable theories, what actually goes on in the currently unimaginable microscopic domain between the input and the output is the property of mathematical abstraction alone. For example, in electron diffraction experiments (e.g., Figure 5-5, p. 271), a large number of objects described macroscopically as negatively charged particles are ejected one at a time from a source into a beam whose path contains an obstacle with two open parallel slits beyond which is a detector screen. Some of these particles subsequently collide with the screen, and the locations of the endpoints of their trajectories are recorded. After enough objects register at the detector screen, 396
page 408
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
their impact locations are observed to form an interference pattern similar to what would be expected of a wave phenomenon, not a stream of individual particles. Although the ultimate nature of the electrons and the electromagnetic field that accelerates them is not known, the basic idea of particles being ejected and eventually hitting a screen makes sense to human intuition. What happens in between does not. It is described variously as “the electron goes through both slits”, “the electron interferes with itself”, “localized position is not a property possessed by the electron along the trajectory”, etc. The mathematical equations used in Quantum Mechanics to compute the probabilities of various possible outcomes of the physical process are generally not difficult to set up and solve, and many physicists consider that fact tantamount to understanding electron diffraction, while others long for a more complete visualization of the entire actual process in terms of objectively real physical objects. The microscopic processes of Quantum Mechanics, however, are not the only context in which our intuition is unable to grasp fully what the mathematics is describing. From its beginning, Einstein’s General Relativity Theory aroused discomfort because of its dependence on the concept of curved space. Einstein faced charges of injecting “Jewish science” into physics (e.g., as documented by Isaacson, 2007, Chapter 14). Regardless of how one defines the “space” that one inhabits, the idea that it could be anything other than Euclidean was difficult to absorb, just as Galileo’s life was made more difficult by the fact that no one could feel the Earth moving. To this day, the strongest argument that the space of our experience could be curved is the magnificent success of General Relativity. Newton’s law of gravity also works to a lesser extent but nevertheless to an excellent approximation for most applications. Bodies falling freely under the influence of gravity do seem to move as if Newton’s law were in effect. As discussed in the first paragraph of section 6.1, we know now that besides its assumption of a single universal time parameter, Newton’s law of gravity has two fatal flaws that cancel each other out to an extent that appears miraculous: the instantaneous transmission of the force and its dependence on the inverse square of the distance between the point masses. To remove either flaw without removing the other destroys the approximation. The error implicit in each flaw depends on the error in the other to cancel it out in order for freely falling bodies to behave as if the law held. In section 6.2 we discussed the difference between intrinsic and extrinsic curvature. Many textbooks on General Relativity and differential geometry introduce the notion of a space’s intrinsic curvature by showing how it results from the extrinsic curvature that can be seen by embedding that space in a flat higher-dimensional space. The most straightforward and common example is an ordinary 2-sphere embedded in a 3-dimensional Euclidean space (e.g., Figure 6-7, p. 340, on the right). From that first impression, it is reasonable to get the idea that these two types of curvature always go hand in hand, and therefore that the intrinsic curvature of spacetime is simply an internal manifestation of extrinsic curvature in a higher-dimensional flat space. This is pleasing to the intuition, and it is natural to desire an intuitive understanding of how intrinsic curvature comes to exist. But such a model is viable only if it is possible in general to embed in flat space all intrinsic curvature that actually arises within General Relativity as applied to real physical situations. This turns out to be problematic for several reasons. We must caution the reader in advance that we will not be able to present a satisfying answer to the question of how intrinsic spacetime curvature arises in some plausible way acceptable to human intuition. When one cannot have what one desires, it can at least be somewhat comforting to understand why, and so this section will be limited to defining the problem and then showing how certain attempts to relate intrinsic spacetime curvature to extrinsic curvature in the assumed external objective physical reality have so far all failed. A few visualizations come tantalizingly close, but in the author’s view, none fully succeed. 397
page 409
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
First we must define what it means to embed a manifold in a higher-dimensional flat space. The example of a 2-sphere in three-dimensional Euclidean space is misleadingly simple, because we start out knowing the equation that describes the locus of the 2-sphere in that space. It is just the formula expressing the constraint that all points be some equal distance r from the center. This is easily expressed in Cartesian or spherical coordinates, and the result is called the embedding equation. It employs the extrinsic-curvature variables, and we don’t even have to know how to express the curvature intrinsically in terms of a metric. But in the general problem of embedding Riemannian manifolds, we start out knowing the metric fields of those manifolds from having solved Einstein’s field equations, and then we must deduce the embedding equations. This is a much more difficult problem. If a given spacetime manifold cannot be embedded in flat space, then its physically real intrinsic curvature cannot stem from some real extrinsic curvature in a higher-dimensional flat space, and the intuitive understanding we desire cannot be achieved. Most specialists in General Relativity regard the curvature of spacetime as intrinsic but not extrinsic, hence as if it were curved in the usual intuitively understood sense. Rejecting any need for extrinsic curvature implies that a space with the same intrinsic curvature as a 2-sphere need not exist inside a space of three dimensions, not even in order to have room to curve back upon itself. The intrinsic curvature resulting from its metric at a given point can be described entirely in terms of quantities that are defined only within the two-dimensional surface, for example, the geodesic deviation. As we saw in section 6.2, this is the acceleration of the minimal distance from a point on one geodesic to another nearby geodesic as the first point moves along its geodesic. If the geodesics are locally parallel in the neighborhood, then as one moves along one of them away from where they are locally parallel, positive acceleration is caused by negative curvature, e.g., near a saddle point of a hyperboloidal sheet, while negative acceleration is caused by positive curvature, e.g., meridians near the equator of a sphere. Zero curvature produces zero acceleration, e.g., straight lines in a flat plane. If intrinsic curvature depended absolutely on extrinsic curvature in a higher-dimensional space, then among other things, compactified dimensions (e.g., as in Kaluza-Klein and superstring theories) would have to be embedded in spaces of higher dimensions. This would contradict the usual explanation for why we do not observe these extra dimensions, that their spaces are too small to see. The smallness of their spaces would nevertheless leave a need to explain why we do not observe the higher-dimensional flat spaces in which the compactified dimensions are embedded. Of course the same applies to fourdimensional spacetime. On the other hand, the nonexistence of such higher-dimensional spaces is not the only explanation possible for their not being observed. For example, Maxwell’s Equations involve vector operators defined for 3+1 spacetime; perhaps electromagnetic processes originating therein simply cannot leak out into the higher dimensions. Many other speculative possibilities exist, but to pursue them here would take us too far afield. Physicists of the “shut up and calculate” school dismiss such questions as simply pointless and even naive, whereas others have not given up on the possibility that human cognitive power might be expanded to the point of understanding either how intrinsic curvature could exist without the support of extrinsic curvature in higher dimensions or else how such extrinsic curvature is actually physically real. Some physicists appear to be saying “What’s to understand? It simply is, and that’s all there is to it.” This suggests that the phrase “curved space” is simply a notational convenience: space is neither curved nor flat, it is simply a formalism for describing a metric field in terms of arbitrarily chosen coordinates, and that metric field determines the geometry of geodesics that behave as if the space were a physically real extrinsically curved subspace of a higher-dimensional flat space. So our question 398
page 410
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
becomes whether this is more or less plausible than physically real extrinsically curved spacetime. The need for a higher dimension might be problematic if modern physics did not already embrace the notion. For example, superstring theorists appear to take the physical reality of the six Calabi-Yau dimensions quite seriously, even though half of them are technically imaginary. The consensus among specialists in relativity seems to be that nothing is being held back by the lack of an intuitive explanation of how spacetime can be intrinsically but not extrinsically curved. There also seems to be some concern that insisting on real extrinsic curvature creates a demand for a physical interpretation, and that presents risks of introducing misconceptions. For example, Wheeler (1990, Chapter 8) stresses several times that the extra dimensions used to embed spacetime are “geometric”, exist only “in imagination”, and have “nothing whatsoever to do with time.” The latter admonition suggests a concern that physical properties may be assigned to the extra dimensions erroneously. The main explicit reason behind this rejection seems to be that intrinsic curvature can be determined without help from the embedding dimensions, and therefore introducing physically real extrinsic curvature into the problem is extraneous. Another more implicit reason appears to be the fact that embedding pseudo-Riemannian manifolds in Euclidean space is more problematic than for normal Riemannian manifolds. For example, embedding the entire four-dimensional spacetime isometrically (i.e., preserving geodesic distances and angles) generally requires a pseudo-Euclidean embedding space to support the pseudo-Riemannian manifold, as discussed further below. But the possibility remains that the higher dimensions needed to support extrinsic curvature of subspaces do indeed exist as part of objective physical reality. Four-dimensional spacetime might even be a subspace of an infinite-dimensional flat (possibly pseudo-Euclidean) Universe. This issue is relevant to those scientists whose goal is to understand the nature of physical reality. The fact that intrinsic curvature can be determined from within a curved space purely by measurements made within that space, however, suggests that it may be impossible to prove the existence or nonexistence of a flat embedding space, leaving extrinsic curvature in an embedding space as an optional interpretation, not a falsifiable physical theory. To be falsifiable, there would have to be some interaction between the higher dimensions and objects confined to the subspace, and there is no a priori reason why such interactions would have to take place. This is similar to considerations that arise in the context of the Many Worlds Interpretation (MWI) of Quantum Mechanics described in section 5.15. There are several variations on this theme, some of which involve interactions between separate “worlds”, in which case the “interpretation” becomes a theory, because in principle it can be disproved if it is false. But without such interactions, MWI remains an interpretation, which is why it is not called the “Many Worlds Theory”. In the same way, the hypothetical physical existence of higher-dimensional embedding spaces may remain an interpretation rather than a theory unless a mechanism is discovered that requires the higher dimensions to interact observably with four-dimensional spacetime. Until then, the physical reality of extrinsic spacetime curvature appears to remain a perfectly viable interpretation, assuming that one can show that the objectively real physical existence of the embedding space is plausible, no mean feat, as we will see below. Unlike the MWI, however, in which any one observer experiences only one “world”, observers in four-dimensional spacetime already have evidence of more than one dimension, opening the door to the possibility of additional ones. It could be argued that the “interaction” with the higher dimensions need be no more than their supplying the room for spacetime to exist with extrinsic curvature. After all, one need not assume 399
page 411
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
that the spaces provided by the higher dimensions contain anything more than the classical notion of void, the ancient idea of complete emptiness, nothing more than a context for geometrical relationships between non-void objects. The modern concept of vacuum is not void but rather a medium possessing properties, e.g., zero-point energy in Quantum Mechanics and a substance capable of having curvature and of conveying it from the vicinity of one spacetime event to another as described by General Relativity. Physical reality may consist of four-dimensional vacuum in these senses embedded in higher-dimensional void. As we will see below, the dimensionality of this embedding-space void must be greater than the four dimensions of spacetime. Here we assume that the only relevant non-void that exists is the fabric of the embedded four-dimensional curved spacetime. One may define an n-dimensional empty flat space mathematically as a space in which there are n orthogonal directions in which non-void objects could be arranged relative to each other, but one cannot determine whether this is physically meaningful without knowing the actual possibilities for geometrical relationships between such objects, and there may be no such objects other than four-dimensional spacetime whose extrinsic curvature may not exhaust those possibilities. There appears to be no consensus on which view is more plausible, and the aroma of metaphysics clearly permeates the issue, which is no doubt why many mainstream physicists find it unpalatable. One weakness of the interpretation, however, is that in general there is more than one way to embed an intrinsically curved (or flat) manifold in higher dimensions, and different embeddings of the same manifold may imply different physical evolutions (e.g., whether the spherical Robertson-Walker model is capable of collapsing, i.e., whether the radial coordinate is proportional to cosmic time and can therefore function as a parametric time because of actual coincidence with the time coordinate, as in the Spatial Condensation and Propagating Spacetime Phase Transition cosmological models; otherwise if time and radius are orthogonal, collapse is possible without destroying the past history of the Universe). So the interpretation that extrinsic curvature in a higher-dimensional embedding space is physically real does not imply uniqueness. Either the actual nature of the embedding must be determined or else it must be possible to appeal to a principle of maximum simplicity to select one embedding from the set of possibilities. An important question that quickly arises is: what is the number of extra dimensions needed for the four-dimensional spacetime of General Relativity to have intrinsic curvature stemming from extrinsic curvature? Applied to the general case of Riemannian manifolds, this question turns out to be highly nontrivial. It depends on such things as the differentiability of the metric field of the Riemannian manifold, the number of dimensions it possesses, whether the manifold is compact (for our purposes, a compact manifold is an n-dimensional manifold containing a finite n-volume, with or without a boundary), and the nature of the embedding (e.g., whether it preserves geodesic distances), to name a few. In our case we are interested only in isometric embeddings, i.e., embeddings that preserve all intrinsic geodesic distances and angles in the embedded manifold (an example of a non-isometric embedding is given in the next section, Figure 6-12, p. 414). Isometry may be defined locally or globally. For a function to be differentiable with respect to a given variable at a given point, it must be continuous in that variable and its derivative not diverge. If the derivative is itself a continuous function, the function is said to be continuously differentiable. Differentiability may also be defined locally or globally and is denoted Ck, where k is an integer that indicates how many orders of continuous derivatives exist. Generalization to higher-dimensional derivatives is straightforward. 400
page 412
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
Since the latter half of the 19th century there has been interest in the problem of embedding manifolds in higher-dimensional Euclidean spaces. This involved not only how many Euclidean dimensions were required but also whether it was possible at all in general. We should note that outside of General Relativity, “embedding” may refer to functions in general, not only metric tensors, and the embedding space may not be Euclidean or even flat, but here we are interested at first only in Riemannian manifolds embedded in Euclidean spaces (we consider pseudo-Riemannian manifolds further below). An early result was a conjecture by Schlaefli (1873) that an n-dimensional manifold could be embedded in a Euclidean space of n(n+1)/2 dimensions (the number of independent elements in the metric tensor). This was contradicted by Hilbert (1901), who showed that the entire Lobachevsky plane (a two-dimensional infinitely extended hyperboloidal sheet) could not be isometrically embedded in a 3-dimensional Euclidean space (at least not using C4 real functions representable as convergent power series). Much of the work in this area has involved conjectures, approximations, and upper bounds on the required number of dimensions, along with occasional impossibility proofs for special cases. Given the nature of some approaches to Quantum Gravity (e.g., Causal Dynamical Triangulation) involving a discrete spacetime approximating the continuous spacetime of General Relativity as an emergent property, we will be willing to consider approximate isometric embeddings. For example, portions of the hyperbolic plane can be approximately embedded isometrically in Euclidean 3-space by using a procedure described by Henderson (1998) in which the hyperbolic plane is constructed from paper annuli glued together in the appropriate way. The result is an approximation to part of the hyperbolic plane that can be made arbitrarily accurate (if one has the patience to work with arbitrarily narrow annuli) and clearly exists in Euclidean 3-space. The seminal work on embedding Riemannian manifolds was done by John Nash (1956). One of his conclusions was “Every compact Riemannian n-manifold is realizable as a sub-manifold of Euclidean (n/2)(3n+11)-space.” The number of Euclidean dimensions is an upper limit. As we have seen, a 2-sphere requires only a Euclidean 3-space, not 17 dimensions. But Nash answered the question of what is possible for compact Riemannian manifolds: they can all be embedded isometrically. After Einstein published General Relativity in 1915, interest in embedding manifolds intensified, and more focus was placed on pseudo-Riemannian manifolds (e.g., Greene, 1970). While specialists have avoided assigning physical significance to embedding spaces, they have uniformly embraced them as a tool for visualization, usually with sub-manifolds being what is embedded rather than the entire manifold. Often the sub-manifold is a hypersurface of simultaneity, since this is not only of physical interest, it also eliminates the axis that makes the pseudo-Riemannian manifolds of General Relativity so troublesome. With the overall signature ( +++), the hypersurface of simultaneity can be treated as Riemannian. That generally leaves compactness as the only limitation, along with the fact that these hypersurfaces are not connected to each other in the embedding space to form a 3+1 spacetime. For example, the spherical case of the Robertson-Walker metric (Equation 6.58, p. 351) at one instant of finite cosmic time is a 3-sphere of finite radius, hence it is compact, and Nash’s result says that it can be embedded in a Euclidean space with 30 dimensions or less. It turns out to be a lot less, just four (Misner, Thorne, and Wheeler, 1973, Chapter 27). By dropping one more dimension from the subspace, hence leaving a 2-sphere, we can devote the third dimension to the Euclidean embedding space, and we can easily visualize Leffert’s Spatial Condensation cosmological model at one instant of cosmic time, and similarly for the Propagating Spacetime Phase Transition model. Since both of 401
page 413
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
these models involve growth by aggregation, they are non-collapsing, and we can view their time evolution as a succession of expanding 3-spheres, something we cannot do for all spherical RobertsonWalker models, since in general they are mathematically capable of collapsing. The flat case of the Robertson-Walker metric at one instant of cosmic time is not compact, but being flat, it is already Euclidean on its own. The hyperboloidal case at one instant of cosmic time is also not compact and cannot be entirely embedded in four-dimensional Euclidean space, because it is a higher-dimensional version of the Lobachevsky plane, which Hilbert proved could not be entirely embedded in Euclidean 3-space. But it can be embedded in a flat Minkowski space of four dimensions. Thus enters the tantalizing possibility that our desire for embedding the pseudo-Riemannian manifolds of General Relativity in some flat higher-dimensional space can be satisfied, albeit with a kind of flat space that is alien to our intuitive visualization capabilities. Indeed, Greene (1970) has established that infinitely differentiable n-manifolds, compact or not, can be embedded in flat space of sufficiently high dimension. For Riemannian manifolds, the space is Euclidean with maximum dimension (2n+1)(6n+14), and for pseudo-Riemannian manifolds the space has a mixed signature with k + and k - , where k = (2n+1)(2n+6), i.e., the space is 2k-dimensional. So Minkowski spaces arise in situations beyond Special Relativity and may have more than four dimensions. Penrose (2004) devotes an entire chapter to them (Chapter 18) wherein he considers normal Minkowski space as a subspace of four-dimensional complex space. This has four real and four imaginary dimensions, hence eight total from which 70 different four-dimensional subspaces can be extracted. The ones of special interest here are Euclidean with (++++) and Minkowski with signatures ( +++) and (+ ), where the negative contributions to the spacetime interval come from the imaginary dimensions. Penrose’s main interest in the space of four real and four imaginary dimensions is the essential role it plays in his Twistor Theory, a topic with relevance to this chapter but which would require so extensive a review of supporting mathematics not needed elsewhere herein that we regrettably must omit it. To describe it in any meaningful way would be to reproduce the discussion already provided by Penrose (2004, Chapter 33), which is recommended to the reader’s attention. Twistor Theory started out, as Penrose says, “to extract features [of Quantum Field Theories] that mesh with those of Einstein’s conceptions, seeking hidden harmonies between relativity and quantum mechanics.” It has provided many insights into such connections and has introduced the possibility of nonlinear gravitons that could mediate Penrose’s gravitationally-induced wave-function collapse mechanism (section 5.15). But so far its most valuable contributions are in particle physics, quantum field theory, and pure mathematics. Among the qualitative statements that can be made about Twistor Theory is that it resolves the problem encountered in semiclassical gravity (mentioned in section 6.3) wherein quantizing spacetime itself results in quantum fluctuations between causal and non-causal connections linking events separated by nearly-null spacetime intervals. Twistor Theory eliminates this by quantizing the twistor space instead of spacetime itself, and then spacetime becomes an emergent entity in a manner reminiscent of Loop Quantum Gravity (section 6.7). Both formalisms have intimate connections to spin networks and spin foams. Although Twistor Theory can be generalized to higher dimensions, in doing so it loses many of the properties Penrose prefers to keep, and its most natural form is consistent only with a fourdimensional spacetime. This puts it in opposition to Kaluza-Klein theories and superstring theories, including M theory (see section 6.6). In discussing Twistor Theory, however, Penrose makes statements 402
page 414
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
about his views on imaginary dimensions, and these will be of interest to us below. The Minkowski signature ( +++) brings us back to something mentioned in section 6.2 where we introduced the Minkowski metric (Equation 6.27, p. 329): the fact that Einstein originally used coordinates (ict,x,y,z), corresponding to the ( +++) subset of the complex manifold. This convention fell by the wayside, apparently because it did not provide anything that could not be had more easily by obtaining the negative contributions to the spacetime interval directly from the metric itself rather than by using an imaginary-variable coordinate axis. Perhaps even more seriously, when passing from the Minkowski metric of Special Relativity to the form it takes in General Relativity, a Lorentzian metric, the presence of nonzero off-diagonal elements based on (ict,x,y,z) axes results in (ds)2 being a complex number, as may be seen in Equation 6.26 when, e.g., dxμ is imaginary and dxν is real. Complex numbers are used extensively in physics, but it is generally considered bad form for the imaginary parts not to cancel out when values of physically real parameters are computed. Of course, with the (+ ) signature, spacelike intervals in Special Relativity are purely imaginary, and this is not seen as a failure of the theory but rather a reason to dismiss their relevance for physically real interactions. But if a Lorentzian metric were to yield complex intervals, spacelike or timelike intervals with both real and imaginary parts nonzero, we would have some explaining to do. However, such intervals cannot arise in Minkowski embedding spaces of any dimensionality, because the offdiagonal elements of the metric tensor are always zero. The worst thing we have to worry about is explaining how purely imaginary dimensions can function as constituents of something that is supposed to be the objectively existing fundamental reality. If we continue along this line, we come to questions such as: can something that is mathematically imaginary be physically real? Is this what Wheeler meant when he said the embedding dimensions exist “in imagination”? Has human intuition fully absorbed the complete physically relevant epistemological content of imaginary numbers? In retrospect, it took a surprisingly long time for the concept of zero to become universal (see, e.g., Barrow, 2000). The Roman Empire clearly had not attained it. Ordinary human discourse still stumbles occasionally on the concept of nothingness, as for example in advertising claims such as “Nothing works better than our product.” There was also considerable resistance to the idea that negative numbers could be meaningful. Diophantus of Alexandria considered negative numbers absurd. They were gradually absorbed into Chinese and Indian mathematics, but European mathematicians held out until the 17 th century. For example, Nahin (1998) describes the geometrical construction invented by Descartes for computing roots of quadratic equations. Expressed in terms of the modern quadratic formula, Descartes considered only the result of taking the positive square root to be legitimate if the other yielded a negative solution. He called negative roots of quadratic equations “false roots”, because his geometrical method for solving quadratic equations could not produce negative numbers. When both roots were positive, his method could yield them both, and he considered them both legitimate. Today negative numbers are considered to have as much relevance as positive ones, and the notion of zero is indispensable. Perhaps a similar transformation of our understanding of imaginary numbers will take place. Certainly they are already considered “meaningful” in complex analysis, and the role they play in rotations in the complex plane and in trigonometry problems is well understood. But this is mathematical meaningfulness. Even this much took a while to develop. Over a century after Descartes, Euler (1765, Chapter XIII) said of square roots of negative numbers “And, since all numbers which it is possible to conceive are either greater or less than 0, or are 0 itself, it is evident that we cannot rank the square root of a 403
page 415
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
negative number amongst possible numbers, and we must therefore say that it is an impossible quantity. In this manner we are led to the idea of numbers which from their nature are impossible; and therefore they are usually called imaginary quantities because they exist merely in the imagination.... and of such numbers we may truly assert that they are neither nothing, nor greater than nothing, nor less than nothing; which necessarily constitutes them imaginary, or impossible.” It is probably coincidental that Wheeler’s use of the phrase “in imagination” regarding embedding dimensions echoes Euler’s use of the phrase “in the imagination” regarding numbers whose square is negative. In Wheeler’s case, the specific use was in the context of Euclidean embedding space, so he clearly did not mean “imaginary” in the sense of complex variables. It seems to suggest, however, that he considered the dimensions of spacetime itself to be objectively real, otherwise there is not much distinction between its dimensions and those of the embedding space, whether Euclidean or Minkowski. But if the embedding dimensions exist only in imagination, there seems to be little loss if some of them are also imaginary in the mathematical sense. Nevertheless, Wheeler was talking exclusively about Euclidean space and making a distinction between the dimensions of spacetime and the embedding space regarding physical reality. The implication is that he embraced the hypothesis that intrinsically curved spacetime is physically real, an ontologically fundamental component of an objectively existing Universe that does not depend on human consciousness. Thus Wheeler did not seem to be advocating the view put forward by others claiming the nonexistence of objectively real physical space. This interpretation of the human experience holds that since everything we can deduce about existence is necessarily based on what comes to our consciousness through our senses, only those sensations can be regarded as existing without doubt, and we cannot know the external origin of those sensations, if in fact there is any such thing. Indeed, everything we experience may exist only “in imagination”. On the other hand, given that our sensory perceptions include observing the readouts of what seem to be measurement instruments capable of capturing symptoms of whatever generates sensations, and given that these symptoms reflect extremely subtle rules that elude our intuitive understanding in many cases (e.g., quantum entanglement), it might be argued that such a vast superiority of our unconscious minds compared to what our consciousness is capable of doing transcends plausibility. This is not a proof of anything, but in the author’s opinion, the hypothesis of an external objectively real physical Universe whose behavior may eventually be understood is not only the simplest interpretation of the human experience, it also maximizes the value of that experience by providing the most promising options for exploring it further. That is why embracing that hypothesis is implicit throughout this book and why we are driven to expand intuition to encompass the origin of intrinsic curvature of spacetime. Given that pseudo-Riemannian manifolds can be embedded in Minkowski spaces, and given the hint of an association of Minkowski spaces with imaginary dimensions, one possible avenue seems to be to pursue a deeper understanding of the relationships between imaginary and real numbers. We must point out that Euler’s seemingly disparaging remarks about imaginary numbers did not entail any refusal to work with them. In fact, Richard Feynman (1964, Chapter 22) called Euler’s formula, eix = cosx + isinx, “the most remarkable formula in mathematics”. We have already seen how useful Euler’s formula is in Chapter 5 (Equation 5.24, p. 252), where it greatly simplified manipulating quantum-mechanical wave functions. One of its implications is Euler’s identity, eiπ + 1 = 0, which has been admired for establishing a relationship between the five most important numbers in mathematics. Euler arrived at these results almost twenty years before he published the remarks quoted above about “impossible numbers”. 404
page 416
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
Besides their value in reducing the need to deal directly with trigonometric functions in many problems, these expressions also greatly facilitate exploring the relationships between real and imaginary numbers. As noted above, this seems potentially useful if intuition regarding Minkowski space is to be improved. One of the most surprising discoveries at first sight is the nature of ii. If one considers a number noted for being large, say a 10100!, and then one anticipates the nature of aa, one expects a vastly magnified quality of largeness. So one might expect that ii could exhibit some magnified quality of imaginariness. The opposite is found, however: ii is a 100% real number. Euler’s identity makes this easy to demonstrate: 1 e i i 1 1
1/ 2
e i / 2
i i e i / 2 e / 2 0.2078796 i
(6.68)
In fact, since we could have written the first line as -1 = e(2n+1)iπ, where n is an integer, we see that ii is actually multivalued, with an unbounded number of real values, e-(2n+1)π/2. For n , these form a decreasing sequence of values approaching zero. For example, for n = 999, the value is about 2.0222461×10-1364. For n - , the sequence diverges; for example, for n = -999, the value is about 2.1369260×101362. Nahin (1998) explores many properties and applications of imaginary numbers, but we will mention only two more items here. The first involves a criticism issued by Paul Heyl (1938) of Einstein’s early use of ict as a time coordinate. Heyl was a respected physicist whose contributions included an improved value for Newton’s gravitation constant while working at the National Bureau of Standards. He acknowledged the usefulness of -1 “as an essential cog in a mathematical device. In these legitimate cases, having done its work, it retires gracefully from the scene.” But regarding its use in a physical time coordinate ict by Einstein and Minkowski, he says “The criterion for distinguishing sense from nonsense has been lost; our minds are ready to tolerate anything, if it comes from a man of repute”. Nahin takes exception to this characterization, saying “ -1 has no less physical significance than do 0.017, 2, 10, or any other individual number (about which physicists do not usually write sarcastic essays).... Perhaps -1 has at least as much physical significance as π”. And yet it is not clear that this physical significance of -1 is actually more than its use as a flag signaling that certain solutions to equations describing physical situations have to be discarded, such as spacelike intervals as causal connections between events. This seems to be indicated in the second item we include from Nahin (1998), which involves a man running to catch a bus. Keeping the algebra as simple as possible, the man runs toward the bus at constant speed v. The bus is stopped but begins to move away from the man with constant acceleration a at time t = 0, at which instant the man is at a distance d from the door of the bus, which remains open. We wish to know the time T at which the man overtakes the door and thus catches the bus. At this instant, the bus will have moved a distance aT2/2, making the door’s distance from where the man was at t = 0 equal to aT2/2 + d, and the man will have run a distance vT. To catch the bus, these two distances must be equal, so we have
a 2 T vT d 0 2 The quadratic formula gives us
405
(6.69)
page 417
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
T
v v 2 2ad a
(6.70)
For example, suppose the man runs at 22 feet per second, the bus accelerates at 10 feet per second2, and the man is 20 feet from the door at t = 0: then we have two solutions for T, about 1.2835s and 3.1165s. Presumably he catches the bus at the earlier time, but if he somehow overshoots it, he has a second chance to board the still-accelerating bus when it catches up to him at the later time (assuming that he is so surprised at running past the door that he forgets to slow down). But if he is 30 feet from the door at t = 0, then we find that he “catches” the bus at T 2.2 ± 1.077i seconds. Mathematically, he does catch the bus, just at a time that happens to be a complex number. As most scientists would, Nahin interprets this as not catching the bus at all, indicating that the physical significance of the imaginary part of T means that the solution must be discarded. There is no solution for the time at which he catches the bus, because he does not catch the bus. The real part of T does have physical significance, however: 2.2s is the ratio v/a, which can be easily shown to be the time of closest approach by changing T in Equation 6.69 to the time variable t and changing the zero to the distance s as a function of t, then setting ds/dt equal to zero to find the extremum:
s
a 2 t vt d 2
(6.71)
ds v at v 0 t dt a This works only because the man does not catch the bus. Otherwise, the extremum can be a negative minimum after the man passes the bus and is therefore not the time of closest approach. But if the presence of an imaginary number in the time at which the man “catches” the bus implies that the man does not catch the bus at all, then the implication is that “imaginary” implies “not in the real world”, which bodes ill for imaginary dimensions forming part of physical reality. Nevertheless, based on the quote above, Nahin does not dismiss imaginary numbers completely, and Penrose indicates a similar inclination, having made it clear on numerous occasions (e.g., 2004, 2016) that he cares a great deal about the nature of objective physical reality, considering it a meaningful concept worthy of our attempts at understanding. He identifies three realms (“worlds”, as he calls them; see, e.g., Penrose, 2004, Chapter 1) relevant to such study: the mental world that we experience directly, the “Platonic” realm of mathematical perfection available to our contemplation but whose truth seems not to depend on our awareness of it, and the objectively existing external world of physical reality that seems to mirror many aspects of the Platonic realm. These worlds are connected pairwise in various ways accessible to human consciousness, or at least that is the hope regarding the “real world”. Penrose has frequently used the adjective “magical” to describe the behavior of complex numbers, clearly in the same sense that it was used to describe randomness in the Preface to this book: to indicate something extremely remarkable and fascinating but near the boundary of human conceptualization, not something supernatural about which to be superstitious. There is no sorcery involved in demonstrating that ii is a set of purely real numbers, but there is something magical, in the sense used herein, about an unlimited fountain of 100% “real” numbers springing forth from an expression involving only constants in a dense concentration of the very essence of imaginariness. 406
page 418
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
Regarding the relationship between “imaginary” numbers and physical reality, Penrose has words that are quite compatible with Nahin’s (Penrose, 2016, Appendix A.9): “However, the terminology is misleading, for it suggests that there is some greater ‘reality’ to these so-called real numbers than there is to the so-called imaginary numbers. ... the so-called imaginary numbers form just as consistent a mathematical structure as do the so-called real numbers, so, in this Platonic sense, they are also just as ‘real’. A separate (and indeed open) question is the extent to which either of these number systems precisely models the actual world.” That sounds like Penrose does not rule out the possibility that the intrinsic curvature of spacetime could arise from extrinsic curvature in a higher-dimensional flat spacetime with an “imaginary” time dimension. The link between the Platonic realm and physical reality presumably at least partly consists of “real” numbers, and since the Platonic realm seems to be the source of all the rules governing the behavior of the physical world, and given that “imaginary” numbers are full-fledged citizens of the former, perhaps that link includes both types of number. On the other hand, not everything in the former has been found mirrored in the latter, hence the question of whether physically real spacetime can encompass “imaginary” time is still “open”. A similar notion actually arises in a subject of public debate about which Penrose and Stephen Hawking took opposite sides: the Hartle-Hawking No-Boundary Proposal (Hartle and Hawking 1983; Hartle, Hawking, and Hertog, 2008). This proposal involves studies of the singularities encountered in black holes and the Big Bang cosmological models employing semiclassical quantum gravity (see section 6.3). Most theorists agree that General Relativity has lost its validity at some point prior to where its equations indicate the infinite curvature that characterizes a singularity. For one thing, the crucial notion of manifold differentiability does not apply. The No-Boundary Proposal exploits the fact that such singularities can be made to disappear under a Wick rotation of the time coordinate and a conversion to a new, but real, radial coordinate. A Wick rotation is a rotation of the time axis by 90o in the complex plane. It has the effect of simply replacing t by it, but it does so under the mantle of a generally accepted process, coordinate transformation, and therefore the impacts on the equations one is trying to solve are properly taken into account. It remains controversial in this case because it assumes that a coordinate believed by many to have physical significance can be considered part of a complex plane. This was not an obstacle to Hawking, however, because he attached no ontological significance to the coordinates anyway. He has made it clear on numerous occasions that his interpretation of the meaning of physics is that of logical positivism (see sections 5.8 and 5.9). For example, he has said (Hawking and Penrose, 1996): “I take the positivist viewpoint that a physical theory is just a mathematical model and that it is meaningless to ask whether it corresponds to reality. All that one can ask is that its predictions should be in agreement with observation.” Thus there is nothing personally inconsistent about Hawking’s use of Wick rotation, nor in its other uses (e.g., Statistical Mechanics, quantum field theories, etc.) in which it is considered a mathematical convenience, not a statement about physical reality. In Hawking’s case, it entered into the evaluation of the Feynman path integral, a method of evaluating state probabilities also used in many areas of physics, e.g., Causal Dynamical Triangulation (see section 6.7). Hawking was seeking a quantum-mechanical wave function to describe the Universe, for which he needed to evaluate the Feynman path integral, and this required a boundary condition at the Universe’s beginning in time. In Minkowski (or Lorentzian) space, the probability amplitudes of the possible states could not be normalized in the presence of a singularity. With the imaginary time dimension and new radial coordinate, 407
page 419
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
the metric signature could be switched from Minkowski to Euclidean. The singularity vanished from the path integral, and normalized probability amplitudes emerged. Despite the fact that curved spacetimes remained involved, this approach became known as Euclideanization. A consequence of this is that the Universe could be viewed as having no time boundary condition because it had no beginning in time. Instead of a singularity at t = 0, there was no t at all. The Universe was contained in a point of zero spatial extent. The closed real interval of cosmic time was replaced with an open interval of imaginary numbers, i.e., a set not containing its bounding point and hence having no smallest value. Moving in the direction of increasing imaginary time traces out the earliest evolution of the Universe. To many critics of the idea, this bears too close a resemblance to the man catching the bus at a partly imaginary time. Another way of stating the objection is that this imaginary time could not provide causal connections between events in the evolving Universe. In the case of the man and the bus, the experiment can be done, in principle. We can place the man at various distances d at t = 0, and with the same running speed and bus accelerations, see which values of d and T result in a successful boarding. We can establish empirically that cases in which T is complex are the very same cases in which no boarding takes place, i.e., he gets as close as he is ever going to get at t = v/a, but there is a spatial gap that never closes. When the man does succeed in boarding the bus, he does so because of a causal connection between his states at t = 0 and t = T, and no causal connection exists between his state at t = 0 and the state “he caught the bus” when T is complex, because we see him miss the bus, after which his closest approach at t = v/a yields to larger and larger separations, since the bus is accelerating ever farther away. One may object that when the man fails to catch the bus, there is no “he caught the bus” event available for even a spacelike connection with his state at t = 0. But this assumes a real time coordinate. If time is allowed to be complex, then there clearly is such an event, because we can compute the time at which it happens. Experiments performed in the Universe wherein we live show the absence of the bus-catching event for values of d that are too large, so the conclusion is that time is real, not complex, in our Universe. The objection to the No-Boundary model Universe is that it may work fine with its imaginary time, but imaginary time can’t supply those causal connections in the Universe about which we really care. Hawking and his collaborators continued to work on the No-Boundary Proposal and the wave function of the Universe, producing some variations on the basic ideas, but most of these met similar or related objections (e.g., Deltete and Guy, 1996; Feldbrugge, Lehners, and Turok, 2017). At the time of their debate (Hawking and Penrose, 1996), Penrose did not object to the use of imaginary time per se, in keeping with what we have quoted of him above. Instead he objected to the mechanism Hawking had invoked for making a transition from imaginary time to real time. This transition was necessary in order for the model to evolve into a Universe compatible with modern observations. Although at the time of the debate there was not yet any observational evidence for ongoing acceleration in the expansion of the Universe, nevertheless most cosmologists embraced the hypothesis that there had been an inflationary era in the Universe’s early history, as this was the most popular explanation for the large-scale uniformity that originally led to the Robertson-Walker metric (Equation 6.58, p. 351). Without going too far afield, we will mention that “inflation” is a spatial expansion caused by spontaneous symmetry breaking of the Higgs field, a quantum field required in the Standard Model of Particle Physics for certain particles to have nonzero mass. While fundamentally different from the cosmological constant, inflation does induce accelerated 408
page 420
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
expansion that presumably stops when the Higgs field reaches a state corresponding to a stable vacuum. There are a large number of variations on how that could come about, each with its own cosmological implications. We should point out that inflation is not inevitable in quantum cosmology, however; it is merely a mechanism that could yield the desired uniformity given the right initial geometry of the Higgs field (an unstable equilibrium called “false vacuum”). The inflationary mechanism is included in the cosmological model developed by Hawking and his collaborators in their work on the wave function of the Universe and the No-Boundary Proposal. During the inflationary era, its action has an effect that is similar to that of a cosmological constant. The model includes quantum fluctuations from which galaxies eventually form. These fluctuations are carried by inflation to vast spatial separations that would otherwise be too far apart for the matter and radiation fields in different parts of the sky ever to have interacted. But the Cosmic Microwave Background Radiation appears too uniform to have developed completely independently in different parts of the sky, and so inflation provides the needed connection. The Cosmic Microwave Background Radiation is a relic of the time when the hot ionized material in the very young Universe cooled sufficiently for electrons to attach themselves to primordial protons and alpha particles, forming neutral hydrogen and helium and thus rendering the medium transparent to electromagnetic radiation. Free electrons are highly efficient at scattering light, so the Universe had been essentially opaque up to that time, and the strongly interacting matter and radiation fields were at the same extremely high temperature everywhere in the infant Universe to within normal thermal fluctuations. The radiation field had a black-body spectrum as described by the Planck distribution (Equations 5.6, p. 237, and 5.9, p. 239). When the Universe was about 380,000 years old, the expansion had cooled that temperature to about 3000 K, the peak of the thermal radiation was at a wavelength of about 0.966 microns, and the photons were suddenly able to travel long distances without being scattered by free electrons. The radiation that we observe today has cooled to about 2.725 K and peaks at about 1.06 mm, i.e., in the microwave range. Its intensity is uniform over the sky to within fluctuations of about 0.001%. Getting a cosmological model to produce the right large-scale uniformity and the small-scale structure requires a delicate tuning of all the mechanisms involved. For Hawking’s model to get from being initially spherical with an imaginary-time Euclidean metric to a Lorentzian-metric de Sitter model approximately consistent with the large-scale structure observed today, he had to “glue together” the first half of the former with the later half of the latter and declare that, via a “tunneling” mechanism, the latter emerged from the former, with the time coordinate changing from imaginary to real in the process. “Tunneling” is the name given to a transition between quantum states that are separated by a barrier that would be impenetrable in classical physics. Given that this is now quantum cosmology, tunneling between states is on the table, although Hawking may have been using the term loosely, as no probability for the transition itself was quoted (the wave function for the Universe does of course provide probabilities associated with all the dominant Euclidean and Lorentzian history states). The term appears in a figure caption (Hawking and Penrose, 1996, Figure 5-7): “The tunneling to produce an expanding universe is described by joining half the Euclidean solution to half the Lorentzian solution.” Mathematically, this basic idea is not quite as outrageous as it might sound. Topologists “glue together” different topological spaces all the time, as do differential geometrists with various manifolds. A logical positivist would not bat an eye at this, but an advocate of physical realism would find it unpalatable, and Penrose did, because the gluing was not done in a way consistent with the mathematical requirements of the Platonic realm, and so it could not be an accurate representation of the objectively 409
page 421
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
existing physical realm. Penrose (2004) indicates that his primary objection to the Hartle-Hawking model stems from its dependence on inescapable approximations made under circumstances of extreme mathematical instability exacerbated at the glued-together boundary. The whole point of “Euclideanization” was to let the Riemannian metric remove singularities that otherwise made computing Feynman path integrals uselessly fragile. But even so, it was necessary to identify dominant histories somewhat heuristically to ensure their inclusion in the integrals while ignoring a plethora of negligible contributors. This had to be done in both the “Euclidean” and Lorentzian models, and the boundary where they are glued together was discontinuous, potentially wreaking havoc with the approximations. Glued-together manifolds are not supposed to be discontinuous at the boundaries. Once all the parts are glued together, the resulting manifold has to be globally differentiable if it is not to be pathological. Penrose claimed, and Hawking agreed, that there should be a region of complex time in which the metric signature changes smoothly from (++++) to ( +++). Hawking believed that there was no reason to doubt the accuracy of his approximations, since his model described the latter-day Universe to within observational constraints, and as a positivist, he intended it only as a model that should agree with measurements. As a “realist”, Penrose needed something plausible as a faithful description of the actual physical Universe, and not only was this not that, it also suffered from association with other deficiencies regarding time-reversal symmetry and the prediction of curvature details that are radically different between the Big Bang and the endpoints of collapsing models (e.g., the miraculously low entropy of the early Universe versus the extreme entropy in a collapsed phase). But the point for us here is that Penrose did not object to the very notion of imaginary or complex time dimensions. It seems therefore that if one can convince oneself that imaginary numbers can make it from the Platonic realm to the actual Universe, then intrinsic spacetime curvature can be considered to be due to extrinsic curvature in a flat higher-dimensional Minkowski space represented as Euclidean with an imaginary time dimension. If this is to be taken seriously, several other sticky points need to be addressed. In effect, this selects the ( +++) signature over the (+ ) signature. The reader may have noticed that a lot of waffling has been going on regarding which signature to use. The author’s original preference was the latter, because it yields positive timelike intervals, which seem intuitively appealing in that the time dimension increases toward the future, i.e., toward those events that we do not yet remember. But this is not necessary, since what we actually experience as time is the proper time of our own reference frame, and this always increases toward the future for causally connected events. Recall that Equations 6.53 and 6.54 (pp. 346-347) define proper time for the (+ ) signature. For the ( +++) signature, the definition is dτ = -ds/c, and other adjustments are also made explicitly, e.g., the Einstein-Hilbert action (Equation 6.67, p. 367) uses (-g) in place of g. So as a working hypothesis, let us adopt the ( +++) signature and also make a distinction between time and the time dimension. The latter has always been a bit mysterious in relativity theory, because it has units of space via the scale factor c. This is consistent with the idea that all axes of spacetime are composed of the same “stuff”, and different 3+1 foliations for different reference frames are all equally legitimate, even though any given hypersurface of simultaneity contains events that are generally non-simultaneous under the different foliations of other reference frames.. Next we make a distinction that will seem absurd to positivists: we associate the imaginary number i with the scale factor c, not the time t. This makes absolutely no mathematical difference within the framework of relativity theory, but it has great significance for physical realism. To see this, consider 410
page 422
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
once again the problem of the man trying to catch the bus. In this case, all the speeds between relative reference frames are much too small to need relativity theory. But if time is intrinsically imaginary, then we can represent it as t = itr, where tr is a real number. Then Equation 6.71 becomes
a s t r2 ivt r d 2
(6.72)
Now a positive acceleration results in motion that produces a negative change in position, velocity produces an imaginary change in position, and distance is a complex variable. Some of this could presumably be alleviated by suitable redefinitions such as velocity as a change in position per unit imaginary time, with dimensions L/iT, for example, meters per imaginary second. But these nuisances can be avoided altogether if leaving out relativity means leaving out an imaginary time dimension, keeping the usual notion of time real (without the ic scale factor) while remaining consistent with its use in relativity theory (with the ic scale factor). Thus the mysterious imaginary quality of the time dimension in relativity theory is absorbed into the nature of changes in the spatial “stuff” that constitutes not only the time dimension but all four dimensions, leaving the ontological nature of time itself no more mysterious than it already is in nonrelativistic mechanics. The passage of proper time may be subject to the mysterious influence of how changes occur along the time dimension, but at least it is not intrinsically “imaginary” itself. Standard clocks in relativistic reference frames continue to be read out in numbers on the real line, and the invariance of the spacetime interval across different reference frames with different foliations enforces the self-consistency of each foliation’s time dimension having its own ic scale factor. This still leaves us with something of a large pill to swallow. The invariance of the “stuff” of which spacetime is made under changes of foliation seems to require all volumes in hypersurfaces of simultaneity to be imaginary, since spacetime itself has dimensions of iL4, and the imaginary time dimension is different in different foliations, seemingly requiring the 3-space of any 3+1 decomposition to have dimensions of iL3. We should note that this all applies in the higher-dimensional embedding space whose Minkowski-like properties are derived through an imaginary time dimension. But nevertheless, any sub-manifold of that space would seem to be composed of the same reduced-dimension “stuff”, either purely real or purely imaginary, and for Quantum Gravity, the nature of this “stuff” seems rather certain to be relevant. On the other hand, we could consider an approach along these lines: (a.) for any given foliation appropriate to a given reference frame, the 3-space surface of simultaneity is described by the +++ part of the signature, hence it is completely Riemannian and embedded in a higher-dimensional Euclidean space made of real distances; (b.) to connect such hypersurfaces at different proper times to each other requires bridging a gap composed of imaginary distance; (c.) this gap is unique to the given foliation with its individual relationships between different hypersurfaces of simultaneity, hence it reflects something special about the relationship between events at different proper times within this foliation. The problem is that this implies that whether a given subspace of the four-dimensional Universe contains imaginary distances is dependent upon choice of foliation, violating the claim that all of spacetime is made up of the same “stuff” independent of foliation. On the other hand, perhaps the argument can be made that there is one preferred foliation within which the hypersurface of simultaneity contains only purely real distances, and such hypersurfaces may indeed be glued together continuously with a purely imaginary-distance time dimension to produce a four-dimensional spacetime that we will denote S0 embedded in a higher-dimensional Minkowski-like space with an imaginary time dimension. 411
page 423
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.9 Embeddings and Intrinsic vs. Extrinsic Curvature
Then consider the fact that S0 can be foliated differently to produce hypersurfaces of simultaneity that intersect hypersurfaces in S0 that contain non-simultaneous events, hence that cut across the imaginary time dimension in S0. The argument is that these other (non-preferred) foliations with hypersurfaces of simultaneity containing non-simultaneous events in S0 nevertheless continue to satisfy the general covariance built into General Relativity. Einstein’s field equations and all other generally covariant equations of physics continue to work in these other foliations despite the fact that the “stuff” comprising the hypersurfaces of simultaneity includes an imaginary component. This cannot be detected from within such a foliation, and the actual “stuff” making up spacetime is determined by S0, wherein all non-temporal distances are real, and only the gluing of the separate preferred hypersurfaces of simultaneity in the embedding space involves an imaginary dimension. There is a feeling of grasping at straws in all of this. We are appealing to a preferred frame of reference and we are admitting a role for imaginary numbers in the physical makeup of the objectively real physical Universe. It is not clear that this cannot work, however, and it is the closest thing to an acceptable visualization of actual spacetime extrinsically curved and embedded in a higher-dimensional flat space that the author has encountered. We have mentioned other reasons why a preferred reference frame might be admissible despite the widespread belief that such things are forbidden. Ultimately, lack of support for them is not the same as ruling them out (e.g., Ellis’s EBU has one built in). The main leap demanded of human intuition is to accept the imaginary distance separating the S0 hypersurfaces of simultaneity. The alternative is that spacetime’s intrinsic curvature does not stem from extrinsic curvature in a flat higher-dimensional space. The only remaining visible alternative is that Lorentzian metrics stem from something in the nature of the “fabric” that comprises the space. The most obvious features that may be involved are those of the discrete spacetimes of certain approaches to Quantum Gravity, e.g., Loop Quantum Gravity, Causal Dynamical Triangulations, and the Propagating Spacetime Phase Transition model. Each of these involves a spacetime that exhibits some fractal-like features. We say “fractal-like” because a truly fractal space has no limit to the extent to which one can “zoom in”, whereas these spacetimes all hit hard stops at the Planck scale. There are some attempts at Quantum Gravity that involve truly fractal spacetime ab initio, but those of which the author is aware bear too much resemblance to jumping-off points for delving into fringe conjectures that tend to multiply beyond control and branch into incoherence, so they will not be discussed herein. One should keep an open mind about entertaining speculative hypotheses, but that does not mean that all bets are off. As noted earlier, there is a close, almost symbiotic, relationship between Loop Quantum Gravity and Causal Dynamical Triangulations. Both are generally formulated in the context of a Minkowski embedding space, which a Wick rotation should be able to convert to a Euclidean space with an imaginary time axis. CDT especially has exhibited aspects of fractal spacetime structure (Loll, 1998; Görlich, 2010; Ambjørn et al., 2013) and also something special about the way that hypersurfaces of simultaneity are connected to each other that stems from the causality condition. Four-dimensional spacetimes in CDT are composed of four-dimensional simplexes with any two hypersurfaces of simultaneity at consecutive ticks of the global clock connected by simplexes with vertices in one or the other hypersurface. Enforcing causality results in the lengths of timelike simplex edges being different from spacelike edges (Görlich, 2010). As mentioned in section 6.7 above, CDT employs three coupling constants in the Einstein-Hilbert action, and one of these, Δ, depends on the length ratio of timelike and spacelike simplex edges. Another, 412
page 424
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.10 Summary
K0, is related to the inverse of the Newton constant. Specific ranges of Δ and K0 values produce the CDT phase of great interest for cosmology, the de Sitter phase, which corresponds to a vacuum solution with a Robertson-Walker metric with a cosmological constant but spheroidal shape. The simplexes forming the three-dimensional hypersurfaces of simultaneity do not exhaust the geometric space, and a fractal dimension of 1.5 is typical. Similar results are found for Loop Quantum Gravity (Modesto, 2009). When the CDT hypersurfaces are glued together, the fractal nature of the resulting four-space largely disappears, indicating a special feature of the local time directions. Geodesics in fractal spaces are necessarily different from those in classical differential geometry. In CDT (Görlich, 2010), “we use a discrete geodesic distance defined as the length of the shortest path between successive centers of neighboring four-simplices. We expect that our definition of a geodesic will in the continuum limit, i.e. on scales sufficiently large compared to the cut-off scale, lead to the same geometric results as for exact geodesics.” Much of the work on computing geodesics in fractal spaces has been focused on truly fractal spaces (e.g., Hino, M., 2013; Lapidus and Sarhad, 2014), and so is not directly applicable to discrete spacetimes with structure at the Planck scale, although this issue is sometimes addressed (e.g., Hu, 2017). It is not clear (to the author, at least) whether fractal aspects at the Planck scale can explain how intrinsic curvature arises without extrinsic curvature embedded in a flat higher-dimensional space, but there is a hint of such a possibility. The other hint is the special nature of the connections between successive hypersurfaces of simultaneity. As we saw above, there is no problem embedding a hypersurface of simultaneity in a higher-dimensional Euclidean space. The problem is connecting them to each other in a manner consistent with the pseudo-Riemannian metric (i.e., Lorentzian metric, or in the flat embedding spaces of LQG and CDT, Minkowski metric). In Ellis’s Evolving Block Universe, these connections are performed by the collapsing of the wave function defined over a small range of global time. This is reminiscent of Hawking’s reference to a “tunneling” effect joining the two halves of his model Universe. The problem of getting from one hypersurface of simultaneity to the next may be solved by a quantummechanical “jump”. Perhaps it would be appropriate if the mysterious mechanism that glues together hypersurfaces of simultaneity should turn out to be the same mysterious process that induces non-unitary evolution of the wave function in ordinary Quantum Mechanics. 6.10 Summary Within any one book, we cannot describe all of the interpretational quandaries in physics. In the previous chapter we focused on the unsolved mysteries surrounding quantum entanglement and the nature of the wave function, including how it “collapses”. One could add particle spin, chiral asymmetry, and many other phenomena to the list. In this chapter, we have focused on the origin of intrinsic spacetime curvature, since that seems most urgent to the author. The most intuitive mechanism is for intrinsic curvature to correspond to extrinsic curvature in a higher-dimensional embedding space, and this seems most straightforward in the context of isometric embeddings. Figure 6-12 shows an example of how non-intuitive non-isometric embeddings can be (based on an example in Penrose, 2004, Chapter 18). This is the (ct, x, y) submanifold of half of a four-sphere in Minkowski space embedded in Euclidean three-space. The center of the sphere is at the origin. The (x,y,z) submanifold would just be an ordinary three-sphere, the locus of points all equally 413
page 425
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.10 Summary
distant from the origin. But the (ct, x, y) submanifold embedded in Euclidean three-space has the form of hyperboloidal sheets. Here only the upper sheet is shown, corresponding to a hemisphere in Minkowski space. The thick black lines are all radii of the sphere, hence they are all the same length in Minkowski space, but clearly not in Euclidean space, unless some special metric is defined along each radius that operates with a fractal dimension less than unity. On the other hand, if ct is replaced with ict, the surface does become hemispherical.
Figure 6-12. Non-isometric embedding in Euclidean 3-space of the (ct,x,y) submanifold of a hemisphere in Minkowski space; the four thick black lines from the origin to the surface are all equal in length in the corresponding Minkowski-space embedding; xy slices of the surface at ct = constant are circles (not shown).
We also cannot describe all of the serious approaches to Quantum Gravity in one book, much less one chapter. The goal has been to include the most noteworthy in terms of promise and number of active workers, along with the more interesting (to the author, admittedly) in terms of possible insight into physical realism. None of these efforts has yet produced a theory that smoothly connects Quantum Mechanics and General Relativity, and none has successfully made full contact with the foundation of physical reality. In the author’s view, the former incompleteness is at least partly due to the lack of effort at the latter. The “shut up and calculate” policy is not conducive to developing the physical intuition needed to find the fundamental elements of physical reality shared by Quantum Mechanics 414
page 426
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
6.10 Summary
and General Relativity that must comprise the intersection of the two in a unified theory. In section 5.14, we saw that John Bell rejected the notion of separate “classical” and “quantum” regimes. As the quotes of Einstein in section 5.1 indicate, there is a mysterious perfection to the physical Universe that makes sense only in the context of an underlying unity. Self-consistency is a hallmark of physical reality suggested by the glimpses of natural truth perceived by those who strive to observe them. There are good reasons to accept the hypothesis that a devotion to physical realism holds the most hope for the eventual achievement of an intuitive understanding of the fundamental building blocks of the Universe. Grasping whatever produces the behavior described so successfully but separately by Quantum Mechanics and General Relativity will surely require an expansion of human intuition, but that is not a new requirement in the history of science. To some, the idea of physical perfection is incompatible with irreducibly random processes. The idea is that randomness is unruly to the point of being sloppy, hardly “perfect” in any sense. But it has been pointed out on numerous occasions that the evolution of the quantum-mechanical wave function itself is completely deterministic. It is only the final selection of one state among the many possible for a physical system that is random. Of course the new state corresponds to a new configuration that defines a new wave function, so the randomness makes itself felt in determining future possibilities. As Max Born (1926b) said: “The motion of the particle follows the laws of probability, but the probability itself propagates in accord with causal laws.” To the author, this is not only not an imperfection but rather what makes the physical Universe perfect for sentient beings by allowing them to experience meaningful lives. What value is there to a Universe that permits only automatons to exist and act out a completely predetermined timetable of activities? That would be imperfect. Instead we have what seems to provide occasions for free will, and hence morality, to operate by means still not understood, but at least not ruled out.
415
page 427
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Epilogue In 1959, the physicist, philosopher, and Nobel laureate Eugene Wigner gave a lecture entitled “The unreasonable effectiveness of mathematics in the natural sciences” (Wigner, 1960). Like Penrose (1989) and others, Wigner was intrigued by the interaction between consciousness and quantum-mechanical phenomena, but the point of this lecture was broader than that. It focused on the fact that while mathematics is widely considered to be a human creation abstracted from physical experiences, the purity of its truths suggest that it is a manifestation of a transcendent domain that has been called the “Platonic Realm” (e.g., Penrose, 1989, 2004), a non-physical dominion of immaculate ideas that exists independently of whether sentient beings ponder its sublime intangible treasures. Wigner’s message was that it is very remarkable that this essentially divine preserve should supply such perfect descriptions of mundane phenomena. Wigner concludes that “It is difficult to avoid the impression that a miracle confronts us here”. The public reactions to Wigner’s lecture ranged from “What did you expect?” to embracing and expanding on the fundamental idea, whose subsequent branches are beyond our scope. The point here is that the view of “randomness” expressed in this book falls well within Wigner’s wider concept. Logical discussions of randomness rely on mathematical expressions that reflect its place in the Platonic Realm, but the concept itself is hard to pin down in everyday terms. The human experience generally begins with no notion of determinism (or causality or conservation principles), so that randomness does not appear unusual to the infant mind. It is the opposite of randomness that the child learns and eventually comes to expect from the physical world. The idea that something could be random becomes the alien thought until the notion of epistemic randomness is appreciated, after which the classical precept of strict determinism takes root. Epistemic randomness provides sufficient explanation of how all the advantages and disadvantages of diversity are generated. Exact duplications of classical objects are rare, simply because classical objects are assemblages of a vast number of microscopic components subject to random perturbations during the aggregation or growth process. Diversity is essential to evolution, since it provides the palette of disparate capabilities for accomplishing various feats vital to survival. Evolution itself is a brutal process that yields progress at a cost that is often individually painful, but the consensus of humanity seems to be that progress is a desirable thing, so the price must be paid. But in the author’s opinion, evolution is a scientific truth whose challenge must be overcome by evolving human consciousness. Its merciless operation must be replaced with intelligent self interest. An example is the way in which the German rocket technology was redirected from the craft of inflicting human death in World War II to the elevation of human science by the engineering successes supporting the space-exploration programs of many nations. Diversity creates the distributions that result in a wide spectrum of human talents, and history shows that civilization advances or regresses primarily through the influence of that part of the population on the extreme tails of these distributions, members whose existence depends stochastically on having a sufficiently large population to produce a significant number of lowprobability high-impact specimens. Of course, this influence can’t operate without support from the rest of the population. Thus it is fallacious for elite individuals to adopt arrogant attitudes toward the bulk of humanity on whose shoulders they stand and without whose foundation they would probably not exist, just as it is misguided for other segments of the population to resent those blessed with 416
page 428
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Epilogue
remarkable abilities that can make life better for everyone if properly applied. A population not divided against itself has the best opportunity for improving the quality of the “human experience” as it applies to all individuals. In section 4.1 we said “Not every worthwhile question has answers subject to rigorous application of scientific method.” Among these questions is the one regarding whether there is “meaning” to the human experience. Some dismiss such a question as pointless, thereby implicitly giving their answer. Those scientists who accept the challenge of addressing the question know well that their usual tools cannot yield a scientifically rigorous answer, but they may shine some light on certain aspects. A serious obstacle is encountered immediately: just what is the meaning of “meaning”? As with earlier discussions of free will in which we focused on what free will is not, we will look more closely at alternatives, and we recognize in advance the limitations of verbal language. For example, one cannot say “There is no such thing as ‘meaning’”, because either the statement must be false or alternatively, like everything else, meaningless. One may utter the words, but not meaningfully, in which case there is no point to making the statement. Such considerations overlap what we said previously about the notion of sanity (e.g., in sections 5.11 and 5.14). One must assume that the notion is both meaningful and applicable to healthy thought processes, since rigorous proof of such a claim would rely on underlying sanity to justify the use of logic. One cannot assume sanity in order to prove that it exists; such reasoning would be circular and hence fallacious. So we consider the alternative, an example of a situation that would preclude sanity without being particularly sensitive to the detailed definition. In a Universe wherein quantum entanglement could be used by one person to send superluminal messages into some other person’s past, inconsistent time loops would be possible, precluding any plausible concept of sanity. But we know that in our Universe, entangled particles exhibit delayed-choice correlations that demand superluminal communication of quantum state information. At face value, it appears that our Universe is capable of being hostile to sanity. It therefore seems remarkable, even unreasonably effective, that randomness should supply the protection that at least keeps sanity in the running. As pointed out in section 5.13, every attempt to design a method for superluminal communication based on quantum entanglement has been shown to fail because of one basic fact: the outcome of the measurements at the “sending” end cannot be controlled properly so that retrievable information can be encoded (see, e.g., Ghirardi et al., 1980). This limit is expressed in different ways for different instrumental setups, but the key ingredient in each case is irreducible quantum randomness. If Wigner’s conjecture about being confronted with a miracle is correct, this could be the best example. So while conceivably we are being constantly bombarded with unintelligible messages from other peoples’ futures and sending the same into other peoples’ pasts, such sterile background noise poses no threat to the proper functioning of a logical mind. On the other hand, perhaps we should temper our gratitude by acknowledging that superluminal communication alone is not sufficient for sending messages into someone’s past. There is also a requirement that the speed at which the message is sent must be greater than the speed of light squared divided by the speed of relative motion between sender and receiver, as shown in Equation 5.53 (p. 276) and the discussion immediately following. In section 5.14 following Equation 5.59 (p. 294), a model is described in which the product of the message speed and moving-observer speed is greater than c2, resulting in ΔT in Equation 5.53 being less than zero. These speeds are interpreted in the context of Special Relativity, but they can 417
page 429
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Epilogue
also be interpreted in the context of Quantum Mechanics. In this case, the message speed corresponds to the phase velocity of the observer’s instrumental wave packet in that observer’s rest frame, the unprimed system. The speed of the moving observer’s instrument in the unprimed system corresponds to its wave packet’s group velocity in that reference frame. These are relevant because the instruments become entangled when they interact with the original entangled particles. ΔT < 0 causes the information concerning the collapsed states of the entangled particles to go into the moving observer’s past, i.e., an instant in that observer’s proper time that has already passed according to the way in which that observer’s proper time is recorded in the unprimed system. The path of the message through spacetime is spacelike, and as shown in Equation 6.24 (p. 327), it is spacelike in all reference frames. Equation 5.55 (p. 276) gives an approximation of the value of ΔT for very large message speeds. This proposed mechanism has yet to be proved applicable and effective for all instrumental setups, however, but it suggests that it is good to have that bulwark against insanity provided by the nonepistemic irreducible randomness of Quantum Mechanics. If one feature of a perfect Universe is allowing consciousness to engage in sane activity, then it seems that randomness is an essential ingredient of a perfect Universe. While nonepistemic randomness does not bestow free will or morality, it eliminates the obstacle of determinism. It does not bestow sanity, but it promotes the possibility by providing protection against inconsistent time loops. Free will, morality, and sanity are prerequisites for meaning and purpose in the human experience in the highest sense that we can attribute to those words. We have seen that if we subscribe to the school of thought that embraces illuminating the human experience as the goal of science, then we must pay a lot of attention to randomness, because it suffuses every aspect of that experience. It supplies evolution with the raw materials needed for natural selection. Although it prevents us from making perfect measurements as we seek to quantify natural phenomena, it can be tamed through probability theory so that meaningful results are not only possible, we can also quantify our uncertainty in empirical knowledge. That same mathematical theory allows us to use Statistical Mechanics to compute macroscopic properties that stem from microscopic interactions too numerous to examine individually, interactions with fluctuations driven by random processes. By accepting the challenges presented by randomness, empirical science was elevated to such a refined state that cracks in the armor of classical physics were exposed in high focus: the inability of Newtonian Mechanics to explain the tiny but highly significant discrepant residual advance in the perihelion of Mercury, and the failure of impeccable Classical Statistical Mechanics to explain blackbody radiation. Extremely accurate measurement theory and data analysis drove physics to General Relativity and Quantum Mechanics, where new challenges to human intuition were found, new glimpses into previously hidden aspects of Nature. Embracing the hypothesis that free will, morality, and sanity are possible imbues the adventure involved in the pursuit of knowledge, cooperative societies, and continued expansion of human consciousness with meaning in the fullest sense possible. Ultimately, the human experience will be whatever we make of it with the gifts that Nature provides.
418
page 430
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix A Moments of a Distribution In physics and engineering, the word “moment” is frequently used not to refer to a point or duration of time but rather as the effect of something operating at some distance from an origin, such as a “moment of force” due to pressure exerted perpendicular to a lever arm about a fulcrum, or a “moment of inertia” by which a mass density distribution causes resistance to a change of rotational state. In probability theory, the word “moment” is used in a manner similar to the latter, in keeping with the use of the words “mass” and “density” applied to probability. In section 1.7, the terms “mean” and “expectation value” were introduced as they apply in probability theory, that is, as synonyms for the “average value” of the random variable. For a continuous random variable x with a probability density function px(x), the mean is
x
x p ( x) dx
(A.1)
x
where the integration limits must enclose the random variable’s domain, which is assumed here to be the entire real line. Here we assume that this integral exists, i.e., does not diverge, and we assume that also for the other integrals herein, noting that there are distributions for which this is not true. Thus the mean is the weighted average of the random variable taken over all of its possible values, with each such value weighted by the probability density for that value. For discrete random variables, the straightforward substitution of a summation for the integral is made, and since this is the only essential difference, we will consider only continuous random variables in this brief summary. Equation A.1 can be seen to be a special case of the following equation:
mn
x
n
px ( x ) dx
(A.2)
where mn is called the nth moment of the distribution. So the mean is m1. Other moments are also important. The zeroth moment is simply the unit normalization, i.e., m0 = 1. Moments given by Equation A.2 are sometimes called “raw moments” to distinguish them from “central moments”, which are moments about the mean rather than about the point x = 0:
n
(x x)
n
px ( x ) dx
(A.3)
The cases of most importance are for n = 2, 3, and 4, since these are the most frequently used central moments of a random distribution. These central moments may be obtained from the raw moments via the following relationships:
419
page 431
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix A Moments of a Distribution
1 m1 x 2 m2 m12 3 m3 3m2 m1 2m13
(A.4)
4 m4 4m3 m1 6m2 m12 3m14 The variance is the average value of the squared distance from the mean, hence μ2. After the mean, this is the most important parameter of a distribution, and therefore it gets its own dedicated symbol, σ2, so that its square root, the standard deviation, is σ. The next two moments are used to define dimensionless parameters called the skewness and kurtosis. While these are important, they are not deemed worthy of Greek symbols and are usually just denoted s and k:
3 3 3 23/ 2 k 44 42 2 s
(A.5)
The skewness is a measure of the asymmetry in a density function. It is zero for distributions that are symmetric about the mean (for such distributions, all odd-numbered central moments are zero), positive for distributions that are “skewed to the right”, and negative for those that are “skewed to the left”. This is easiest to visualize for distributions with a single peak, as shown in Figure A-1. The “right” or “left” refers to the side of the mean that has the longer “tail”, i.e., the side for which the probability mass is stretched to more extreme distances from the mean. The effect is to make the distribution look like it is leaning in the opposite direction, i.e., a distribution that is skewed to the right appears to be leaning to the left, and vice versa.
A
B Figure A-1 A. Negatively skewed distribution, s < 0, skewed to the left (leaning to the right). B. Positively skewed distribution, s > 0, skewed to the right (leaning to the left).
420
page 432
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix A Moments of a Distribution
The kurtosis is a measure of central concentration, i.e., the relative mass near the center of the distribution compared to the wings. The standard for central concentration is the Gaussian distribution, which has a kurtosis of 3. The “excess kurtosis” is defined as k-3. A distribution with zero excess kurtosis is call mesokurtic and has the same degree of central concentration as a Gaussian. Negative excess kurtosis is called platykurtic and indicates stronger central concentration and weaker wings, although the central concentration may appear weaker because of a less pronounced peak. Positive excess kurtosis is called leptokurtic and indicates stronger wings often accompanied by a narrower peak; such distributions are more prone to yield relatively extreme random draws. Skewness and kurtosis are useful aids in deciding whether some sample’s distribution is approximately Gaussian. The general bell-shaped curve is very common but does not necessarily imply that a Gaussian is a good approximation. For example, the bell-shaped curves in Figure 2-4 (p. 46) all appear more or less Gaussian, but in fact the approximation is generally too poor for most applications. The uniform distribution in Figure 2-4 is clearly not very Gaussian; it has excess kurtosis equal to -1.2. The triangular distribution has excess kurtosis equal to -0.6. The excess kurtosis values of the bell-shaped curves, going from left to right, are -0.4, -0.3, and -0.24, all significantly non-Gaussian. But when a bell-shaped curve has both skewness and excess kurtosis very near zero, it can often be approximated as Gaussian. Every random variable has cumulative and mass/density functions; two other functions of interest can be defined by noting that Equation A.3 can be viewed as a special case of a more general equation:
(t )
f ( x, t ) p ( x )dx x
(A.6)
In other words, any function of x can generally be averaged over the distribution of x (assuming lack of pathological behavior). Two important cases are
x (t )
eitx p ( x)dx x
M x (t )
etx px ( x)dx
(A.7)
The first is known as the characteristic function, and the second is the moment-generating function. The characteristic function exists for every distribution, whereas the moment-generating function may not (for example, the Cauchy distribution shown in Equation 2.14, p. 55). The characteristic function is unique to each distribution, and so if it can be shown that two distributions have the same characteristic function, this proves that they are the same distribution. This is utilized in some derivations of the Central Limit Theorem. The moment-generating function is just the expectation value of etx and exists for well-behaved distributions; it is so named because it generates the mn of Equation A.2 according to
421
page 433
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix A Moments of a Distribution
mn
d n M x (t ) t 0 dt n
(A.8)
i.e., the nth raw moment can be obtained by taking the nth derivative of the moment-generating function and evaluating it at t = 0. The reason why this works can be seen by considering the wellknown series expansion for the exponential function:
u2 u3 ... 2! 3!
eu 1 u un n! n 0
(A.9)
Using this for the exponential in the moment-generating function produces
x2 t 2 M x (t ) px ( x ) dx xt px ( x )dx p ( x )dx 2! x 2
1 tm1
x3 t 3 3! px ( x)dx ...
(A.10)
3
t t m2 m3 ... 2! 3!
The nth term of the infinite summation is:
tn m n! n
(A.11)
Since
dn tn 1 dt n n!
(A.12)
the nth derivative of the nth term with respect to t will simply remove t and the factorial in the denominator, leaving only mn, and all lower terms will disappear completely. Higher terms will contain factors of t and therefore disappear when t is set to zero. Finally, two useful and frequently encountered parameters of a random distribution are the median and the mode. The definitions of these do not involve moments formally, although their values are often equal to certain moments. For example, any single-peaked symmetric density function has a median and mode equal to the mean. A good example is the Gaussian distribution. If one draws a sample of values from a population with a given distribution (where “sample” may be the entire population, which may be infinite) and sorts the values in increasing order, then the value at the midpoint is the median. If the sample contains an even number of values, then the average of the two values closest to the midpoint is the median (hence it can happen that the median 422
page 434
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix A Moments of a Distribution
has a value that does not occur in the sample). The median plays a critical role in robust estimation, the estimation of population properties from a sample that contains outliers. It often happens in scientific data analysis that a set of measurements is contaminated by undesirable phenomena, such as a set of stellar brightness measurements taken from an image that was also struck by cosmic rays. Outliers can have a very significant effect on sample averages, whereas the median tends to be minimally impacted, unless the contamination overwhelms the sample. The mode is the most frequently occurring, or most probable, value of a distribution. The mode of a population is the value where the mass/density function peaks, if such a point is well defined. For example, the Uniform distribution has no mode, because all values are equally likely. A distribution with two peaks is called “bimodal” to mean that it has two modes. When dealing with a finite sample drawn from a population, the mode can be very difficult to estimate, since it may happen that no two values are equal; in such a case, some smoothing or histogramming can help identify ranges wherein values tend to cluster. The median and mode are generally useful for communicating typical values of a distribution that is asymmetric. For example, consider a small town of 50 families, 49 of which have annual incomes of $25,000 and one of which earns $1,000,000. The economic status of the town sounds reasonable if the average income of $44,500 is used to describe it. But the dire reality is more honestly presented by quoting the median income of $25,000. In fact, national economies of many countries involve a small percentage of the population in possession of a large percentage of the wealth, and the dichotomy between the mean and median income is very real. To understand the nature of life as experienced by the citizens, one is better served by the median income as a measure of economic status. A word about notation: we have used m and μ to indicate raw and central moments, whereas various conventions are found in the literature. These symbols are often seen both indicating central moments, the former referring to a sample and the latter to the population from which the sample was drawn. It is common to see the same symbol used for both raw and central moments, with the former indicated by a prime. Primes are not used herein in order to avoid confusion with exponents.
423
page 435
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix B Functions of Random Variables Equation 2.16 (p. 57) in Chapter 2 gives the density function for the random variable that is the product of two independent zero-mean Gaussian random variables. We will derive that equation in this appendix as an illustration of the theory of functions of random variables. We will provide a brief qualitative description of the basic ideas employed by the theory, but rigorous derivation of its methods would take up more space than we can allocate. It is difficult to imagine a better exposition of this theory than what is presented by Papoulis (1965), Chapters 5 and 7. Anyone with even the slightest interest in this subject should have that book on the shelf; after reading those and other chapters, it will be found to be a very useful reference time and again. We consider the case in which x and y are random variables with density functions px(x) and py(y), respectively. The random variable z is some function of x and y, g(x,y), and it has a density function pz(z), which we wish to obtain from px(x), py(y), and g(x,y). The general approach involves computing the cumulative distribution Pz(z) and then taking its derivative to obtain the density function. The cumulative distribution is conceptually easy to work with, since Pz(Z) is just the probability that z Z, where Z is any value that z can take on. This inequality corresponds to a region of the xy plane in which g(x,y) Z. The idea is to identify that region (which may be disjoint) and integrate the joint probability density function for x and y over the region with respect to one of the independent variables. This will eliminate that variable, and z can be substituted for the other via g(x,y). Examples of regions of the xy plane where g(x,y) Z are shown in Figure B-1 for two cases: (a.) g(x,y) = x + y; (b.) g(x,y) = xy.
A
B Figure B-1 A. The shaded area illustrates z = g(x,y) = x + y Z for the case Z = 3. B. The shaded area illustrates z = g(x,y) = xy Z for the case Z = 3; note that if Z < 0 had been illustrated, the qualitative features would be reflected about the Y axis relative to those shown above.
424
page 436
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix B Functions of Random Variables
The joint probability density is a standard and familiar notion, and it is absolutely required to be known if functions of the corresponding random variables are to be computed. It is simply the product of the separate density functions if x and y are independent, and otherwise a more complicated coupling is involved. Here we wish to illustrate the formal theory, not techniques for computing integrals in general, and so we will assume the simpler case of independent random variables. The form of the integral of the joint density over the region of interest depends on g(x,y). Before examining z = g(x,y), it is useful to consider the simpler case z = g(x). A fundamental theorem in the study of functions of continuous random variables is that the density function for z depends on the density function for x, px(x), and all of the real roots of z - g(x) = 0 according to N
pz ( z)
px ( xi )
g (x ) i 1
(B.1)
i
where the xi are the N roots of z - g(x) = 0, and g (xi ) is the derivative of g(x) with respect to x evaluated at xi and must be nonzero. For the linear function g(x) = Ax + B, there is only one root, and we have 1 z B pz ( z) px (B.2) A A Returning to z = g(x,y), we can write Pz(Z) as the double integral of the joint density function over the region for which g(x,y) Z, denoting that region D:
Pz ( Z )
p
xy
( x , y )dxdy
(B.3)
D
We are considering z = g(x,y) = xy, and so, guided by Figure B-1B, in which the y limits of integration depend on whether x is positive or negative, we write
Pz ( Z ) pxy ( x , y )dy dx Z x 0
Zx 0 pxy ( x, y)dy dx
(B.4)
Note that this arrangement of integration limits is based on Figure B-1B, which shows a case for Z > 0. For Z < 0, the y integration limits should be swapped, and the two Z ranges should be handled separately. In many cases, depending on the joint density functions, the two cases will come out the same in the end, and that is in fact what happens here because we are considering independent zero-mean Gaussian random variables for which pxy(x,y) = pxy(±|x|,±|y|), i.e., the joint density function is even in both x and y. This fact also causes the two integrals in Equation B.4 to be equal, and so we can write simply
Zx Pz ( Z ) 2 pxy ( x, y) dy dx 0
425
(B.5)
page 437
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix B Functions of Random Variables
Using Equation B.2 to substitute for y at a fixed value of x, and using z(y=Z/x) = Z, Z
1 z pxy x , dzdx x x 0
Pz ( Z ) 2
(B.6)
Differentiating with respect to Z yields
pz ( z) 2
0
(B.7)
1 z pxy x , dx x x
For the independent zero-mean Gaussian random variables, the joint density function is
pxy ( x, y ) px ( x ) py ( y )
e
x2 y 2 2 2 2 y2 x
2 x y
(B.8)
Substituting this in Equation B.7 yields
pz ( z)
1
x y
e
x2 z 2 2 2 2 2 2 x y x
0
dx x
(B.9)
Making the substitutions
a
z2 1 2 ,b 2 y 2 x2
u x 2 , du 2 xdx ,
(B.10)
du dx 2u x
the integral in Equation B.9 becomes
1 bu u du e 2 0 u a
(B.11)
This integral can be recognized as having the form
u 0
a n 1 u bu
e
n
a 2 du 2 Kn 2 ab b
(B.12)
with n = 0. Kn is a modified Bessel function of the second kind with order n. Replacing the substitutions (B.10) and taking the scale factors on the integrals into account produces 426
page 438
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix B Functions of Random Variables
pz ( z)
z K0 x y
(B.13)
x y
which is Equation 2.16. By using the simple relationships between moments of independent random variables, we showed in Chapter 2 that
z z xy 0
z2 ( z z ) 2 z 2 x 2 y 2 x 2 y 2 x2 y2
(B.14)
Thus we don’t actually need a density function to obtain the mean and variance of z, but a primary reason for computing pz(z) is to be able to relate confidence levels with distance from the mean. Such distances are commonly expressed in units of standard deviation. For example, the confidence that a random draw will produce a result within ±kσ of the mean is z k
Cz ( k )
(B.15)
p ( z)dz z
zk
In our case, the mean is zero, and for purposes of expressing confidence level as a function of distance from the mean in units of standard deviation, we may take σx = σy = 1, giving σz = 1, so that Cz ( k )
1
k
k
K0 ( z ) dz
2
k
K ( z)dz 0
(B.16)
0
Some sample values are compared to corresponding Gaussian values, CG(kσ), below. k
Cz(k σ)
CG(k σ)
1
0.79100634
0.68268948
2
0.93817111
0.95449988
3
0.98036140
0.99730007
4
0.99354037
0.99993663
5
0.99782980
0.99999943
In most scientific and engineering applications, this much difference between two random distributions is significant and warrants the extra work needed to obtain the form of the density function. For example, there is only about one chance in 1,741,523 of getting a 5σ fluctuation from a Gaussian population, but with the product of two independent zero-mean Gaussian random variables, a 5σ fluctuation can be expected once out of every 461 draws. 427
page 439
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix B Functions of Random Variables
We will give one more example of a function of two independent random variables, because this next one is much simpler and of great importance. Specifically, we consider z = x + y, the most common of all dependences in scientific data analysis, since it is very common for random errors to combine additively. The region of the xy plane corresponding to z = g(x,y) = x + y Z is shown in Figure B-1A (where Z = 3 for purposes of illustration). This is the region over which we must integrate the joint density function, which is still given by Equation B.8 above. For this case of g(x,y), Equation B.3 becomes Z x Pz ( Z ) px ( x ) p y ( y )dy dx
(B.17)
Differentiating with respect to Z produces the density function
pz ( z)
p ( x) p ( z x)dx x
y
(B.18)
which is the convolution of the two input density functions. This applies to any two independent random variables, not just Gaussians (the extension to correlated random variables is straightforward; see, e.g., Papoulis, 1965). This is an extremely powerful and important general result. In the case under consideration, x and y independent Gaussian random variables, allowing the means to be nonzero introduces negligible complication, and so we will generalize Equation B.8 to
pxy ( x , y ) px ( x ) p y ( y)
e
( x x )2 ( y y )2 2 2 y2 2 x
2 x y
(B.19)
Evaluating Equation B.18 with this joint density function is fairly simple and results in
pz ( z)
e
( z z )2 2 z2
2 z
(B.20)
zx y
z2 x2 y2 i.e., z is itself Gaussian with a mean and variance given by the sums of the corresponding parameters of x and y. These relationships for mean and variance do not depend on x and y being Gaussian; they are general properties for sums of independent random variables, but for z to have a strictly Gaussian distribution, x and y must both be Gaussian. Even when they are not, the density function for z will usually tend to be more bell-shaped than those of x and y, as illustrated in Figure 2-4 (p. 46). Finally, we consider one more case of a function of one random variable, because this case is of immense importance for Monte Carlo analyses (see section 4.6) and involves one of the most beautiful results in all of probability theory. At first sight, the problem appears unsolvable as stated 428
page 440
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix B Functions of Random Variables
because of insufficient information. The function involved is the cumulative distribution of a random variable, i.e., the functional dependence is through the appearance of the random variable as the upper limit of an integral. For the random variable x, we define y = g(x) as follows. x
y g ( x ) px ( x ) dx Px ( x )
(B.21)
where x is a dummy integration variable, px(x) is the probability density function for the random variable x, and Px(x) is the corresponding cumulative distribution function. We assume that the domain of x is the entire real line, but for random variables with limited domains, the lower limit could be used without altering the basic situation, and disjoint domains can be handled easily by breaking up the integral into separate regions. For simplicity, we also assume for the moment that Px(x) is a strictly increasing function of x, i.e., has no flat regions (recall that cumulative distribution functions are never decreasing functions, since that would require negative probability). This allows us to assume that an inverse function exists. Given any random draw of x, there is a corresponding value of y, hence y is a function of the random variable x. The question is: what is the probability density function for y, py(y)? At first glance it may appear that this question cannot be answered without knowing the form of px(x), but in fact the answer does not depend on px(x) at all; we get the same result for py(y) independently of px(x). That is what is so remarkable about this theorem, which is sometimes called the Probability Integral Transform Theorem. There are various approaches to proving this theorem (see, e.g., Angus, 1994). Given the conditions we have imposed here on Px(x), we can assume that an inverse function exists and that its argument has the domain from 0 to 1, since that is the range of Px(x), and therefore a simple derivation is possible by applying Equation B.1 directly. Equation B.21 permits one solution to y - g(x) = 0, namely (B.22) x P 1 ( y ) 1
x
Equation B.1 becomes
py ( y )
px ( xi ) g ( xi )
px ( xi ) 1, 0 y 1 px ( xi )
(B.23)
Therefore y as defined in Equation B.21 is uniformly distributed between 0 and 1. The reason why cumulative distributions are uniformly distributed from 0 to 1 is easily visualized. Figure B-2 shows the cumulative and density functions for a somewhat nontrivial distribution, the mixture of two Gaussian populations discussed in section 2.9, which we will use here for illustration. Excluding the endpoints, the entire range of the cumulative distribution has been sampled with a uniform spacing of 0.02, and each sample is mapped to the density function. This results in the density function being sampled more densely where it has its higher values. This is because the density function is the slope of the cumulative distribution, so that slope is steeper where the density function is larger. The steeper slope in turn presents a larger target for the sampling lines. This relationship exists for all distributions. But now we can see why we required the cumulative distribution to be strictly increasing: any flat regions would map to multiple locations within the domain of the random variable. However, these would be values which the random 429
page 441
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix B Functions of Random Variables
variable cannot take on, since flat regions in the cumulative distribution are where the density function is zero. Technically, these are not in the domain of the random variable; rather its domain is disjoint. In such cases, the inverse function in Equation B.22 cannot be taken for granted, and more complicated derivations are needed, but the bottom line is still the same regarding y as defined in Equation B.21 being uniformly distributed between 0 and 1.
Figure B-2. Uniform sampling of the cumulative distribution maps into denser sampling of the density function near its larger values.
430
page 442
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix B Functions of Random Variables
The value of the Probability Integral Transform Theorem for Monte Carlo analyses is now clear: random deviates can be generated for any population whose cumulative distribution function is known, either analytically or numerically. Generating uniformly distributed random deviates is a thoroughly solved problem, and from there, random deviates for any describable distribution can be obtained via simple table lookup. Of course, specialized pseudorandom generation algorithms have been developed for most commonly encountered distributions, and these are faster computationally than inverting a cumulative distribution, but given that the latter is always uniformly distributed between 0 and 1, a way is clear for handling the nonstandard distributions that one occasionally encounters. When mapping from the cumulative distribution to generate pseudorandom deviates, however, one must handle flat regions carefully if they exist. When presented as we have done above, i.e., defining a random variable y as a function of another random variable x with the latter appearing in the function definition as the upper limit of the integral of an unspecified density function, the Probability Integral Transform Theorem that results may seem like a rabbit pulled out of a hat. But viewed from a different perspective, the fact that all cumulative distributions are uniformly distributed can be seen as tautological. Normally we think of a cumulative distribution as having the same domain as the density function of the random variable (e.g., as in Figure B-2), and its range is from 0 to 1. But the random variable y defined in Equation B.21 has a domain from 0 to 1, which may be a little disorienting at first. If any cumulative distribution’s range is partitioned into segments of equal size (again, as in Figure B-2), each segment is equally likely a priori to correspond to a given draw of the random variable because of the cumulative nature of the distribution itself. One could not partition the range into a segment from 0% to 50% and a segment from 50% to 100% and then claim that something other than 50% of all draws are expected to correspond to the upper half. Each partition is equally likely a priori to contain any given draw by the intrinsic property of being “cumulative”. The same is true for any equal-sized partitions, and that is the definition of being uniformly distributed. It is just not as easy to see coming when the value of the cumulative distribution corresponding to a random draw is itself treated as a random variable with its own density function, e.g., when the question posed is “what will be the shape of a histogram of the cumulative-distribution values corresponding to a very large number of random draws from a single population?” We close by noting that all of the above applies strictly to density functions for continuous random variables. The density function for a continuous random variable with physical dimensions D has units of 1/D. For example, if the random variable corresponds to a position measured in inches, then its density function has units of probability mass per inch. Since probability mass is dimensionless, the density function has dimensions of inverse inches. This is an important difference from probability distributions for discrete random variables. Each point in the domain of a discrete random variable has some probability of occurring, not some probability density. It follows that probability distributions for discrete random variables are dimensionless. One result of this is that the factor of 1/|A| in Equation B.2 is omitted for discrete random variables. That factor converts units for continuous random variables, and such a conversion is not needed for discrete random variables. This fact enters into the function of Poisson random variables discussed in Appendix I.
431
page 443
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix C M Things Taken K at a Time With Replacement Equation 1.2 in Chapter 1 (shown as C.1 below) gives the formula for the number of ways that m things can be taken k at a time. It is assumed in the context that when a thing is taken once, it cannot be taken again, and there is also a built-in usage of the fact that only two kinds of “things” are available, heads and tails. Rather than leave the impression that this is an entirely general expression, we will discuss this subject a little further in this appendix. The equation involved is:
Nm (k )
m! k !(m k )!
m k
(C.1)
The m things in the discussion may be labeled Hn, n = 1 to m, meaning “heads on the nth coin toss”. It is implicit that if one does not “take” heads on a given toss, then one is left with tails on that toss. It may seem trivial at first, but it is also implicit that if one does take heads on a given toss, one cannot then put that “thing” back into the pool from which future takings of things will be made. That is why the equation above is called the number of ways to take m things k at a time without replacement (the phrase “without repetition” is also used). There are times when things may be taken with replacement, and a different formula applies. For illustration, we consider a situation in which more than two kinds of things are available for the taking: we have an apple (A), a banana (B), and a carrot (C), and you are told that you may have any two of them. What are the possibilities? We have m = 3 and k = 2; the formula says that “3 take 2” is 3!/(2!1!) = 6/2 = 3. The three possibilities are AB, AC, and BC. Note that BC is considered the same as CB, in the same way that getting heads on the 3 rd flip and heads on the 7th flip is the same as getting heads on the 7th flip and heads on the 3rd flip. Now consider a different situation in which we have an unlimited supply of apples, bananas, and carrots, and you are again told that you may take any two. In this case, you may choose to take two apples, which you are allowed to do, because there are enough to make that possible. This is called taking 3 things 2 at a time with replacement, because after you take one apple, the effect is the same as though you had put it back and taken it again. Taking an apple once does not eliminate your option to take an apple again. Now the possibilities are AA, AB, AC, BB, BC, and CC. So there are 6 ways to take 3 things 2 at a time with replacement. The formula for this is: r
N m (k )
(m k 1)! (m 1)! k !
(C.2)
We use the prefix subscript r to indicate “with replacement”. Now with m = 3 and k = 2, we have (3+2-1)!/((3-1)!2!) = 4!/(2!2!) = 6. The formula in Equation C.2 is very important in its own right. For example, the number of independent elements in a symmetric N ×N matrix is just N things taken 2 at a time with replacement, since the 2 things being taken are the matrix indexes, and each runs from 1 to N and may be taken more than once. An element in row 17, for example, may have column number 3, 8, 13, etc., so 17 432
page 444
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix C M Things Taken K at a Time With Replacement
is taken as many times as necessary. Using m = N and k = 2 in the formula gives the well-known result for the number of independent elements, N(N+1)/2. For a symmetric N ×N ×N rank-3 tensor, we have k = 3, resulting in N(N+1)(N+2)/6 independent elements. In combinatorial analysis as encountered in probability theory, Equation C.1 is usually applicable more often than Equation C.2. The point here is that when taking m things k at a time, one must be careful to consider whether this is done with or without replacement.
433
page 445
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix D Chi-Square Minimization Given N data points yi with uncertainties σi, i = 1 to N, and each data point associated with a significantly distinct abscissa xi (i.e., x sampled well enough not to cause singularities or precision problems), we wish to model y as a function of x with a functional dependence of the form M
y
p k 1
k
f k ( x)
(D.1)
where the pk are coefficients whose values must be found, and the fk are M arbitrarily chosen basis functions that define the model. For example, if we wish to use a cubic polynomial model, then fk(x)_=_xk-1, and M = 4. In order to solve for M model coefficients, we must have N M, and for chisquare to be meaningful, we must have N > M. In typical cases, N >> M. We have deliberately made the functional dependence linear with respect to the pk coefficients. The basis functions may be anything at all as long as they can be evaluated at the abscissa values (hence they may not be explicit functions of pk); they may even be functions of both x and y, since the yi are also known, but since the yi are uncertain, any basis function that depends on them will also be uncertain, and in the chi-square minimization below, this will cause the pk to appear in the denominators, making the system of equations that we have to solve nonlinear. Nonlinear systems can be solved via iterative methods, but the subject is far removed from the present scope, and so we will use only basis functions that do not depend on y. We also assume that the abscissa values are known exactly, since if they also have uncertainties, that too results in the pk appearing in the denominators of the equations we have to solve, which would therefore again be nonlinear. If the uncertainties in the data describe additive errors that may be approximated as zero-mean Gaussian random variables, and if the true values underlying the data points are indeed distributed according to the model, then the discrepancies between the data points and the maximum-likelihood model are Gaussian random variables acting as additive noise, and Gaussian estimation may used to evaluate the coefficients pk. The maximum-likelihood model is that which maximizes the joint density function of the discrepancies between the data points and the model, hence it minimizes chisquare as defined below. For uncorrelated errors, chi-square is the sum of the ratios of the squared differences between the model and the observed values divided by the corresponding uncertainty variances (see Appendix E):
2
N
i 1
yi
pk f k ( xi ) k 1 M
2
(D.2)
2 i
The number of independent terms summed minus the number of parameters used in the fit is called the number of degrees of freedom of the χ2 random variable, DF, in this case N-M. Since χ2 is a function of the random errors in the yi, it is indeed a random variable itself, with its own probability density function and cumulative distribution. These two functions depend on DF, but two useful 434
page 446
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix D Chi-Square Minimization
features that all χ2 random variables have in common is that the mean is DF and the variance is 2DF. For the above expression to be a rigorously defined χ2 random variable, not only must the form of the model be appropriate, but the errors in the yi must also be truly Gaussian. Fortunately, in the vast majority of practical applications, treating errors as Gaussian even when that is not quite true is nevertheless an acceptable approximation, especially when it enters only as a cost function to be minimized in some fitting process. One must use more care when interpreting the significance of an observed fluctuation (i.e., a sample of the random variable). Furthermore, biased estimates generally result when deviations from the Gaussian distribution include being asymmetric, so the quality of the approximation must be understood. For example, biases at the 0.1% level may or may not be acceptable. The expression above assumes that the errors in the yi are uncorrelated, i.e., that the expectation value of the product of the error in ym and the error in yn, m n, is zero. In fact, it often occurs that errors are correlated, and taking this into account involves using a more general form for χ2, but this slight complication will be postponed for the moment, and we will assume that the errors are uncorrelated. Ignoring significant correlations in fitting computations generally has a very small effect on the pk solutions; the main impact is on the characterization of the fitting uncertainty, where it typically cannot be neglected. Inclusion of explicitly correlated data errors will be formulated further below. In order to solve for the values of pk that minimize χ2, we set the derivatives of the latter with respect to the former to zero:
yi
2 pk i 1 2
N
f j ( xi ) f k ( xi )
M
p j 1
j
i2
(D.3)
0
This results in M equations in the M unknown coefficients pk, forming a linear system that can be solved by standard methods. Each equation has the form M
N
i 1
yi f k ( xi )
2 i
N
f k ( xi ) p j f j ( xi )
i 1
i2
j 1
M
N
p j 1
j
(D.4)
f k ( xi ) f j ( xi )
i 1
i2
where we have interchanged the summation order of the right-hand side on the second line. With the definitions N
bk
i 1 N
a jk
i 1
yi f k ( xi )
i2
f j ( xi ) f k ( xi )
i2
Equation D.4 becomes 435
(D.5)
page 447
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix D Chi-Square Minimization
bk
M
a j 1
jk
pj
(D.6)
which can be written in vector-matrix form as
b Ap
(D.7)
where the matrix A has elements ajk and is called the coefficient matrix, because it contains the coefficients of the equations forming the M×M system of equations. Note that ajk = akj, i.e., A is a symmetric matrix. Equation D.7 can be solved by various techniques, one of which is to employ the inverse of the A matrix: multiplying both sides from the left by A-1, and dropping the identity matrix that results from A-1A, yields the equation
p A 1 b
(D.8)
If solutions for the pk are all that is desired, then this is not the most computationally efficient way to proceed, but in fact, we will need the inverse matrix anyway, and so we might as well compute it first and use it to obtain the solution. The reason why we need it is that we need the error covariance matrix for the pk, and this is just A-1 (a derivation will be given below). Fitting errors are typically significantly correlated, so when the model is used to compute a value of y for some chosen value of x, the uncertainty in that value of y should be computed using the full error covariance matrix:
A 1 v11 v12 v21 v22 v M1 v M 2
v1 M v2 M v MM
(D.9)
The diagonal elements vkk are the uncertainty variances of the pk solutions, also commonly denoted σ2(pk). These elements are always positive in practical applications, i.e., uncertainties are always real and greater than zero in actual life. Because A is symmetric, its inverse Ω is also. The off-diagonal elements are the error covariances for the model coefficients, i.e., expectation values for the product of the errors in pk and pj, k j, and may be positive, negative, or zero. Using an overparenthesis to indicate true values, and denoting errors by ε, we can write the model equation as
y y
M
( p k 1
k
pk ) f k ( x )
Since
436
(D.10)
page 448
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix D Chi-Square Minimization
y
M
p
k
k 1
(D.11)
f k ( x)
we can subtract this from the previous equation to obtain
y
M
k 1
pk
(D.12)
f k ( x)
We will see below that under the assumptions in effect, the εpk are zero-mean Gaussian random variables, and so εy is also, since it is a linear combination of rescaled zero-mean Gaussian random variables (see the discussion in Appendix B covering Equation B.2, p. 425, and Equations B.17 through B.20, p. 428). The uncertainty variance for y is the expectation value of the square of εy, where angle brackets denote expectation values:
M pk f k ( x ) k 1 2 y
2
2 y
M
k 1 M
M 1
2 pk f k2 ( x ) 2
k 1 j k 1 M 1
2 ( pk ) f k2 ( x) 2 k 1
M
pj
pk f j ( x ) f k ( x )
(D.13)
M
v
k 1 j k 1
jk
f j ( x) f k ( x)
Once the pk are known, Equation D.2 should be evaluated to verify that the value of χ2 is reasonable, i.e., close to DF. “Close” generally means within several standard deviations, where the value of the standard deviation is (2DF). If this is not true, then the model is not trustworthy. Assuming error correlation was not inadvertently ignored, if χ2 is too small, then the original data uncertainties may have been overestimated. If χ2 is too large, then either the original data uncertainties were underestimated or the model formula is not a good representation of the data, or both. It is important to appreciate that although the fitting uncertainties σ 2(pk), and hence σ2y, may be small, this does not mean that the actual error in a value of y computed from the model will be small. The fitting uncertainties reflect only how close the resulting fit is to the best possible fit to the given model, not whether that model itself is a good representation of the data. The fitting error covariance matrix Ω, being the inverse of A, depends on the same parameters as A, and this dependence does not include any of the yi. A and Ω depend only on the ajk, and as shown in Equation D.5, the ajk depend only on the data uncertainties σi, the xi. and the model. The data uncertainties produce some uncertainty in what the best model parameter values are to fit to the data, but the best fit may be simply the best of a set of very bad fits because the form of the model is poorly chosen. We have said that each yi is associated with an xi, and so these pairs can be plotted on a graph. We would normally choose a model whose plot on the same graph would pass closely to the data points, 437
page 449
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix D Chi-Square Minimization
and if there is real structure in the pattern of the data points, we would want the model to show similar structure. But imagine shuffling the yi to form a very different sequence. The new (xi, yi) pairs would form a very different pattern on the graph, and the fitting procedure would yield a very different set of solutions for the pk, but the uncertainties of this new set of pk would be exactly the same as before, because these uncertainties do not depend on the pairing between the xi and yi, simply because they do not depend on the yi at all. This assumes that the original pairs of xi and σi were kept together; if the yi and the σi are kept paired while the xi are shuffled, then generally some fluctuation in the uncertainties can be expected, but at a level negligible compared to the changes in the pk and the “goodness of fit” as measured by χ2. The uncertainty in whether we have obtained the best possible fit to the shuffled data pairs given the form of the model is essentially no different from before, even though the fit itself yields a model that probably looks nothing like the new data points on the graph. Clearly some check on whether the form of the model yields a good fit, not merely a fit with welldetermined parameters, is needed. This check is provided by the test on the value of χ2 described above. The value of χ2 will depend on whether we shuffled the data pairs, and it does depend on whether the model’s shape follows the patterns of structure in the data. If the original data uncertainties were well estimated, and if the form of the model is appropriate for the actual dependence of y on x, then the fluctuations of the yi relative to the model should have statistical similarity to the original data uncertainties. The χ2 test measures whether this is the case and should always be performed. If the value of χ2 turns out to be large relative to DF, then the uncertainties attached to the model should be inflated, or a better model should be used unless it can be shown that the original uncertainties were underestimated. One way to inflate the model uncertainty is simply to rescale Ω by χ2/DF. If nothing more than this can be done, then at least this much should be done, but unless resources simply do not permit, the reason for the failure of the model should be determined and fixed. Note that if the goodness-of-fit metric indicates that the model is inappropriate for the data, then the assumption that the discrepancies in Equation D.2 are Gaussian is invalid, and taking the errors defined in Equations D.12 and D.16 to be Gaussian is almost certainly a bad approximation. The convenient fact that the error covariance matrix Ω is the inverse of the coefficient matrix A can be derived as follows. We write Equation D.8 in summation form, using the elements of Ω explicitly, changing the dummy summation index from k to l and substituting the definition of bl:
pk
M
l 1
vlk bl
M
l 1
N
vlk
yi f l ( xi )
i2
i 1
(D.14)
We can write this in the “true plus error” form, along with the “true” itself:
pk pk pk
M
M
l 1 N
v l 1
lk
i 1
N
vlk
( yi i ) f l ( xi )
i 1
yi f l ( xi )
i2
438
i2
(D.15)
page 450
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix D Chi-Square Minimization
Subtracting the second line from the first gives
pk
M
i f l ( xi ) i2 i 1 N
v l 1
lk
(D.16)
i.e., εpk is a linear combination of the εi, so if the assumption that the latter are zero-mean Gaussian random variables is valid, εpk is also a zero-mean Gaussian random variable, because rescaling the εi does not destroy their Gaussian nature, and the density function for εpk is the convolution of the Gaussian density functions of the rescaled εi and thus also Gaussian (again, see the discussion in Appendix B covering Equation B.2 and Equations B.17 through B.20). The elements of the error covariance matrix are , so we write the product of the above equation with a similar one that uses indexes m, j, and n in place of l, k, and i, respectively, and take expectation values:
N n f m ( xn ) M i f l ( xi ) v 2 lk n l 1 i 1 i2 n 1
M
N
pj pk vmj m1
M
M
v m1 l 1
mj
(D.17)
f m ( xi ) f l ( xi ) i4 i 1 N
vlk
2 i
The cross terms in the product of the summations over i and n on the first line have been dropped in the second, because the expectation values of the products εn εi will be zero for i n. So we have
pj pk
M
M
v m 1 l 1
N
mj
vlk
i 1
i2 f m ( xi ) f l ( xi ) i4
i2 f m ( xi ) f l ( xi ) vmj vlk i4 m 1 l 1 i 1
M
M
N
M
M
N
v m 1 l 1 M
vlk
mj
vlk aml
v M
i2
(D.18)
M
v m 1
i 1
f m ( xi ) f l ( xi )
M
m 1 l 1
mj
mj
l 1
vlk aml
The second summation on the last line just yields the elements of the identity matrix, since the elements multiplied and summed belong to matrices that are inverses of each other; the result of the summation is therefore δkm, the Kronecker delta, whose value is 1 for k = m and 0 for k m. This results in 439
page 451
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix D Chi-Square Minimization
M
v
pj pk
mj
m 1
km
(D.19)
vkj v jk
Q. E . D.
Now we consider the case for correlated errors in the original data, for which a data error covariance matrix ΩD must be provided instead of just the N uncertainties σi, the squares of which are now the diagonal elements of this data error covariance matrix, whose off-diagonal elements are no longer taken to be zero. The more general expression for χ2 employs the following definitions. M
ui yi
p
f k ( xi )
k
k 1
w11 w21 1 W D wN 1
2
N
w N
w22 wN 2
(D.20)
N
i 1 j 1
w1 N w2 N wNN
w12
ij
N
w i 1 j 1
ij
ui u j yi
M
p
k
k 1
f k ( xi ) y j
M
p k 1
k
f k ( x j )
Again, we take the derivatives of χ2 with respect to the pk: N N 2 wij f k ( xi ) y j pk i 1 j 1
M
p m 1
m
f m ( x j ) f k ( x j ) yi
M
p m 1
m
f m ( xi )
(D.21)
where we have changed the dummy summation index k to m to distinguish it from the fk factors produced by differentiating with respect to pk. Setting this equal to zero and regrouping gives: N
N
i 1 j 1
wij y j f k ( xi ) yi f k ( x j )
M M w f ( x ) p f ( x ) f ( x ) ij k i m m j k j pm f m ( xi ) i 1 j 1 m 1 m 1 N
N
(D.22)
Interchanging the summation order on the right-hand side results in N
N
i 1 j 1
wij y j f k ( xi ) yi f k ( x j )
M
m 1
N
pm
N
w
i 1 j 1
ij
f (x ) f
440
k
i
m
( x j ) f k ( x j ) f m ( xi )
(D.23)
page 452
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix D Chi-Square Minimization
With the following definitions
1 N N bk wij y j f k ( xi ) yi f k ( x j ) 2 i 1 j 1 amk
1 N N wij f k ( xi ) f m ( x j ) f k ( x j ) f m ( xi ) 2 i 1 j 1
(D.24)
we obtain M
a
bk
pm
mk
m1
(D.25)
which is the same as Equation D.6 and can be solved in the same way. The inverse of the coefficient matrix is still the error covariance matrix for the model parameters pk, but in this case, the derivation must retain the cross terms that were dropped in Equation D.17, since the expectation values are no longer zero. Note that if we do apply these equations to the special case of uncorrelated data errors that we first considered, then wij = 0 for i j, wii = 1/σi2, and Equation D.24 reduces to Equation D.5; the coefficients of ½ on the summations in Equation D.24 are there for this purpose. They cancel out in Equation D.25 but are required to keep the relationship between the coefficient matrix and error covariance matrix. One interesting special case is that for M = 1 and f1(x) = 1, i.e., the model “equation” is just a constant, p1. Considering the case when data errors are uncorrelated, Equation D.5 becomes
b1
N
yi
i 1
N
a11
2 i
1
i 1
(D.26)
2 i
and Equation D.6 becomes N
yi
i 1
2 i
N
p1
i 1
2 i
N
p1 i 1
1
i2
Solving for p1 and using Equation D.9 to get its uncertainty,
441
(D.27)
page 453
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix D Chi-Square Minimization N
p1
yi
i 1 N
1
i 1
2 i 2 i
(D.28)
1 i 1 i2 N
2 ( p1 )
1
So the model is just the inverse-variance-weighted average of the data values yi, and the uncertainty variance is the inverse of the coefficient “matrix”, which has been reduced to a scalar in this case. This is the familiar maximum-likelihood Gaussian refinement formula for averaging numbers that have uncertainties, and therefore chi-square minimization is seen to be a generalization of this for models with more elaborate functional dependence than a single scalar.
442
page 454
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix E Chi-Square Distributions The chi-square family of random-variable distributions was introduced in section 2.5 and defined for the case of independent terms in Equation 2.10 (p. 53):
N2
x x N
i
2
i
(E.1)
i2
i 1
where each xi is a Gaussian random variable with mean x i and standard deviation σi, and N is the number of degrees of freedom, which we will abbreviate DF. If the xi and x i are not independent, then DF is less than N according to the specific constraints in effect. A very common example of this is the case when the means are all equal because they are the mean computed from a sample of N random draws, and the standard deviations are all equal because they refer to the known standard deviation of the population from which the sample was drawn (see section 2.11). In this case, because the mean was computed from the sample, one degree of freedom is lost, and we have
2 N 1
x x N
2
i
i 1
(E.2)
2
If the xi are not independent, then in general they will be correlated (although exceptions exist), and a full N×N covariance matrix is needed (see section 2.10). Denoting the covariance matrix Ω and its inverse W (which we can assume exists, because in addition to being symmetric, in all nontrivial cases, covariance matrices are real and non-singular, and correlation coefficients are not ±100%),
( x1 x )( x1 x ) ( x2 x )( x1 x ) ( x x )( x x ) N 1 w11 w21 1 W wN 1
w12 w22
wN 2
( x1 x )( x2 x )
( x2 x )( x2 x ) ( x N x )( x2 x )
( x1 x )( x N x ) ( x2 x )( x N x ) ( x N x )( x N x )
(E.3)
w1 N w2 N wNN
where we are considering the case in which all the means are equal, since it is rare for random variables drawn from populations with different means to be correlated (if they are, then the means 443
page 455
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix E Chi-Square Distributions
must retain subscripts matching the random variables from which they are subtracted). Then chisquare is defined as follows.
N2 wij xi x x j x N N
i 1 j 1
(E.4)
DF is not affected by the fact that the individual random variables are correlated, since correlation can always be removed via a coordinate rotation (see section 2.10), but that does not change the number of terms in the summation. To see this, consider the following: chi-square is just a sum of the squares of zero-mean unit-variance Gaussian random variables. If we translate the coordinates in Equation E.4 to make the means zero, we do not alter the numerical value of chisquare. If we then rotate coordinates to the system that diagonalizes the covariance matrix, Equation E.4 reduces to Equation E.2 (or Equation E.1, if the random variables come from different populations). The transformed variables will still be Gaussian, because the rotation produces linear combinations of the original variables, and a linear combination of Gaussians is Gaussian, and they will still be unit-variance, because each is divided by the properly transformed variance, namely the corresponding diagonal element of the diagonalized covariance matrix. Thus the number of squared zero-mean unit-variance Gaussian random variables in the summation is the same as before, so the DF is the same as before. The determinant of the covariance matrix is preserved in the rotation, and the volume of any ellipsoid corresponding to a contour of constant probability density in the Ndimensional space is proportional to the square root of this determinant, so the ellipsoid volume is also preserved. Of course, if the random variables are correlated but this is overlooked in computing chisquare, then the value obtained will generally be wrong, and the variance is increased by the extra contribution of this error. For example, as we will see below, a chi-square random variable with two degrees of freedom has a mean of 2 and a variance of 4. If its two random variables are correlated with a coefficient ρ 0 and this is overlooked, i.e., if Equation E.2 is used instead of Equation E.4, then in general the value obtained for a given sample will be different (i.e., wrong). If somehow the correct x (or x i if appropriate) and σi2 are used, however, then the error in Equation E.2 will be zero mean, so its expectation value is still 2, but its variance increases to 4(1+ρ2). In such a case, the expected mean of the ratio of the incorrect chi-square to the correct one is 1.0 with a variance of ρ2/2. In general, a chi-square random variable with N degrees of freedom has a mean of N and a variance of 2N. If correlation exists but is overlooked while the correct means and variances are used, the chisquare mean remains N, but the chi-square variance is increased by 4 times the sum of the squares of the correlation coefficients in the upper triangle of the correlation matrix, i.e., 2 var Ne 2 N 4 ij2 N 1
N
(E.5)
i 1 j i 1
where the e subscript denotes the “erroneous” form of chi-square. A more compact expression can be constructed by noting that the diagonal elements of a correlation matrix are all 1.0, and summing over the entire matrix counts each symmetric element twice, and so
444
page 456
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix E Chi-Square Distributions
2 var Ne 2 ij2 N N
(E.6)
i 1 j 1
Chi-square is used extensively in hypothesis testing, typically with a null hypothesis that some random variables are Gaussian and belong to a particular population. Chi-square computed for a sample drawn from that population should behave according to the chi-square distribution, and if it deviates to a statistically significant extent, that may be interpreted as evidence against the null hypothesis. As we see in Equations E.5 and E.6, one possible cause for such a deviation is that the random variables are correlated but this was not realized, or it may have been suspected but no covariance matrix was available. Another common cause is that the population’s distribution is significantly non-Gaussian. Since the average of the expression in Equation E.2 is generally not very sensitive to either of these conditions, it is advisable to draw multiple samples and compute chisquare for each, then examine the variance of these results to see if it is inflated relative to true chisquare variances. Another way that the null hypothesis can be false is that the elements of the sample were not actually drawn from the same population (or equivalently, the population was a mixture of significantly different basic distributions). This is more likely to lead to conspicuously implausible values for chi-square itself, and the statistical significance may be judged solely on the basis of the likelihood of obtaining the value observed given the null hypothesis. For this the cumulative distribution is needed. The general forms of the density and cumulative distributions are 2
( N2 ) N / 2 1 e N / 2 p( ) 2 N / 2 ( N / 2) 2 N
P( N2 )
(E.7)
( N / 2, / 2 ) 2 N
( N / 2)
where chi-square is positive semidefinite, Γ is the Gamma function , and γ is the lower incomplete Gamma function. For even values of N, it is convenient to use the fact that Γ(N/2) = (N/2-1)! (i.e., factorial). A number of alternative forms, recurrence relations, and asymptotic formulas exist for the cumulative distribution; see, for example, Abramowitz and Stegun (1972), section 26.4. Although the formulas in Equation E.7 are nontrivial, in fact chi-square random variables exhibit a few simple properties, the most useful of which are the mean and variance, which are equal to the number of degrees of freedom and twice that number, respectively:
N2 N2 p( N2 ) d N2 N
(E.8)
0
2 N
N2
2
N2 N2 0
2
p( N2 ) d N2 2 N
The chi-square skewness and kurtosis are also simple functions of the number of degrees of freedom, namely (8/N) and 3+12/N, respectively (i.e., the excess kurtosis is 12/N). Another useful property is that chi-square is asymptotically Gaussian. Whether a given N is sufficiently Gaussian for one’s purposes may be judged by whether the skewness and excess kurtosis are sufficiently close to zero. 445
page 457
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix F Quantiles, Central Probabilities, and Tails of Distributions When dealing with random variables, the following questions frequently arise. What is the probability that a random draw will be closer to the mean than some given distance? What fraction of all random draws will be greater than some given value? How far from the mean are the boundaries containing the most probable half of all cases? Such questions involve various ways to partition the cumulative distribution of the random variable, or equivalently, to partition areas under the density function. Figures 1-1 and 1-2 (p. 24) show the density functions and boundaries marking ±1σ from the mean for a Gaussian and Uniform distribution, respectively, and point out that for the Gaussian these boundaries contain about 68.27% of the probability, and for the Uniform about 57.74%. The fact that ±1σ boundaries enclose different fractions of the total is a consequence of the fact that the two distributions have different shapes. The area inside these boundaries is called central probability, and areas outside are called tail probabilities, with the one on the left called the low tail and the one on the right called the high tail. Terminology is not completely standardized for such partitions, however, and we should note furthermore that the phrase “central probability” also arises in analyses of the convergence of various distributions to asymptotically Gaussian form (e.g., the Central Limit Theorem). Two widely accepted conventions do exist: denoting the cumulative probability P and the high tail Q, i.e., Q = 1-P. Q is the most commonly used measure of statistical significance, since Q(x) is the fraction of all cases that would occur above x under a hypothesis that x is distributed according to the corresponding distribution P. For example, given the hypothesis that a coin is fair, i.e., the probability of landing with heads up is the same as tails (50%), then if 20 flips produce only 2 tails, the statistical significance is about 0.00021023. Coin flips follow the binomial distribution (Equation 2.3, p.39), which gives a probability of 0.00021023 that a fair coin flipped 20 times will yield 18 or more heads, the complement to 17 or fewer heads. In other words, P(17) = 0.99979877, so the probability of more than 17 heads is Q(17) = 0.00021023. When an event belonging to such a low-probability tail of a distribution is observed to take place, the hypothesis on which the probability is based is usually taken to be erroneous. If the hypothesis were correct, then one would expect to have to perform about 4970 experiments involving 20 flips of the fair coin in order to have about a 50% chance of witnessing this event. That is too implausible for most observers, and so they would reject the hypothesis that the coin is fair. But if 4970 different experimenters perform the 20 flips with a fair coin, it is slightly more likely that one of them will witness this rare event than that none will. After all, winning a million-dollar lottery has even lower probability, but someone does eventually win. Note that we judged significance by looking at the probability of getting more than 17 heads, which includes getting 18, 19, or 20 heads. In the example, what we got was 18 heads, not 19 or 20. So why didn’t we just look at the probability of getting exactly 18 heads (about 0.0001812)? The reason is that the probability of any single event can be misleadingly small when the number of possible outcomes is large. The most likely outcome, exactly 10 heads, has a probability of only about 0.176197, which doesn’t seem particularly large, but getting the most likely outcome is clearly 446
page 458
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix F Quantiles, Central Probabilities, and Tails of Distributions
as insignificant as possible. For continuous random variables, the notion of the probability of some specific value is problematic in itself; some range about that value must be specified in order to have a well-defined probability. Using smallness of the Q value (or nearness of the P value to 1) to judge statistical significance removes the potential obscurity of single-event probabilities from the problem, and in the typical cases in which the tails roll off rapidly, including all other events in the tail usually does not enlarge the numerical value very much anyway. For example, the probability of exactly 18 heads is about 86% of the tail. In order to have a figure of merit for plausibility when something happens that seems unlikely under the operating hypothesis, considering the probability of what was observed or anything even less likely in the same way (e.g., having too many coin flips yield heads) adds some conservatism along with the more solid numerical foundation, and so the P and Q values have become the traditional quantification of plausibility. In this illustration, we used the probability given by the “high tail” of the distribution; this is also called a “one-tail” distribution. If we had instead asked “what is the probability that the number of heads will be within 5 of the most likely number?”, then we would have wanted a “central probability.” For a fair coin, the most likely number of heads in 20 tosses is 10, and so we seek the probability of at least 5 heads and no more than 15. This time we can use the “central” probability, which is P(15) - P(4), the probability of 15 or fewer heads minus the probability of 4 or fewer. For a fair coin, this is about 0.9881821, so about 98.82% of the time, 20 flips will yield 5 to 15 heads if the coin is fair. Note that because the binomial probability is symmetric about its mean, this is the same as 1 - 2Q(15). When this simpler formula is used, one must be sure that the argument of the Q function is greater than the mean. If we had asked “what is the probability that 20 tosses will yield a number of heads more than 5 away from the most likely number, 10”, then we would use the “two-tailed” probability P(4) + Q(15). Often the probabilities in the two tails are equal, as in this case, but they need not be. We can ask “what is the probability of less than 4 heads or more than 18 heads?”, in which case we want P(3) + Q(18), which is about 0.001288 + 0.000020 = 0.001308. The question of central probability arises most frequently in cases for which the density function is symmetric about its mean. For asymmetric density functions, it is advisable to define clearly what one means by the “center” of a distribution, since the median and the mean are generally not equal in such cases, and the mode is generally different from both of them, or there may even be more than one mode (i.e., the distribution may be multimodal). The “center” may be the mean, because that is the “center of probability mass”. But “center” may also refer to the median, since that is where P = 0.5, the center of the cumulative distribution, and since central probabilities and tails are most fundamentally expressed in terms of the cumulative distribution P and its complement Q, the median is the logical place to define as the “center” if the median, mean, and mode are not all equal. For a unimodal distribution, the mode often appears visually to be the center, because that is the peak of the density function. For any of these definitions, if the density function is asymmetric about its mean, the boundaries marking ±x from the “center” may occur at different values of the density function, and the area in the +x side may not be equal to the area in the -x side, so the sum of the areas will not be a straightforward “central probability”. Similarly, if the areas on each side of the “center” are made equal, the boundaries will generally be at different distances from that center and at unequal density values. Caution is always advised when dealing with asymmetric distributions. 447
page 459
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix F Quantiles, Central Probabilities, and Tails of Distributions
Figure F-1. The distribution discussed in Section 2.9, a mixture of two Gaussian distributions shown also in Figure B-2. The shaded area is the central 60% probability; its boundaries are at different density values, about 2.08 and 0.98; the mean is 2.35, and the median and mode are about 2.31 and 2.21, respectively.
Figure F-1 shows the same distribution that was discussed in section 2.9 and used as an example in Appendix B illustrating the uniform distribution of the random variable defined as the cumulative distribution of any random variable (see Figures 2-9, p. 64, and B-2, p. 430). Figure F-1 448
page 460
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix F Quantiles, Central Probabilities, and Tails of Distributions
is based on Figure B-2 with several supplemental features. Three heavy lines are added to indicate the mode, median, and mean, respectively, in left-to-right order. These are all different for this asymmetric distribution and occur at abscissa values of about 2.21 (P 0.32), 2.31 (P 0.5), and exactly 2.35 (P 0.55), respectively. The central 60% probability is shown as a shaded subset of the line area tracing the relationship between the cumulative and density functions above P = 0.2 and below P = 0.8, hence it is defined about the median at P = 0.5. The median makes sense as the “center”, since only about 45% of the probability mass lies above the mean (but has a larger overall “moment arm”, so that the mean is still the “center of mass” in the sense of being where the distribution would balance if placed on a pivot there). Although we call this a “central” 60% probability, it is nevertheless bounded by different values of the density, about 2.08 on the left and about 0.98 on the right, and different abscissa distances from the median, since the low tail boundary is at about 2.159, and the high-tail boundary is at about 2.547, so the distances from the median are about 0.151 and 0.237, respectively. This “central” 60% probability is therefore not the “most probable” 60%; the latter would be the region whose two boundaries occur at equal probability densities with an area of 0.6. This can be non-trivial to locate (for multi-modal distributions, it can even be disjoint). In effect the boundaries of the gray region on the cumulative ordinate would be shifted by equal amounts until the corresponding gray region straddles the density function at equal values. In this example, that would involve shifting the gray region down the cumulative ordinate until its boundaries were at about 0.065 and 0.665, at which point the corresponding density function has a value of about 1.18 at the two abscissa values of 2.08 and 2.42. In N-dimensional distributions, where N > 1, boundaries partitioning the distribution are usually set at loci of constant probability density, i.e., contour levels in 2 dimensions, surfaces in 3 dimensions, and hypersurfaces in higher dimensions. For Gaussian distributions, these loci are ellipsoids, because Gaussian probability density depends on the exponential of an expression that defines an ellipsoid in the given space. Examples in 2 dimensions may be seen in Figures 2-13A (p. 79), 2-14A (p. 80), and 2-15B (p. 81). Assuming that just any distribution is Gaussian can result in errors such as the application of the notion of “error ellipses” to distributions like those shown in Figure 2-11 (p. 71). Proper contours of asymmetric distributions may be seen in Figures 2-13B and 2-14B. The latter illustrate the difficulty and ambiguity that may be encountered in defining a central probability for such distributions. For example, the probability masses inside the contours in Figure 2-13A are central probabilities, whereas those in Figure 2-13B are not, but most people are more comfortable visualizing any distribution in terms of its contour levels, which have the virtue of showing the corresponding most probable regions within their boundaries. Quantiles result from segmenting the cumulative distribution into equal areas. For example, in Figure F-1, the lines tracing the relationship between the cumulative and density functions are spaced at 0.02 (for purposes of visibility). If this had been 0.01, then each section would be called a percentile. The first line can be called the 2-percentile, since 2% of the cumulative distribution lies below it. So the central 60% probability lies between the 80- and 20-percentiles. Other terms are used such as quartiles, when the cumulative partitioning is into 0.25 segments, etc. Because of the importance of Gaussian distributions, most programming languages provide a built-in function that is useful for evaluating partitions such as those discussed above. This is the error function, denoted erf(u), defined as follows.
449
page 461
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix F Quantiles, Central Probabilities, and Tails of Distributions
erf (u)
2
u
0
2
e t dt
(F.1)
For u > 0, this is the central probability for a zero-mean Gaussian random variable with a variance of ½. This is an odd function, i.e., erf(-u) = -erf(u). The complement function is erfc(u) = 1 - erf(u), hence for u > 0 it is the two-tailed distribution for that random variable. To use this to evaluate the cumulative distribution for a general Gaussian random variable with mean x and variance σ2, x
P( x )
e
( x x )2 2 2
2
dx
x x 1 1 erf 2 2
450
(F.2)
page 462
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers A central task in Monte Carlo computations is generating pseudorandom numbers. Occasionally this must be done to simulate correlated random variables. As always, a basic pseudorandom number generator is required. The generation of pseudorandom deviates is a large and welldocumented subject which we will not duplicate here. We will instead assume that such a generator is available and denote it gx(x,σ), since most random distributions have a functional form that is fully specified given a mean and standard deviation. It is convenient to separate the generator into the sum of the mean and a zero-mean generator, i.e., gx(x,σ) = x +Gx(σ). If we want to generate correlated draws, we must have a model of the correlations or covariances involved. We will demonstrate the process for correlated pairs of random variables denoted x and y. Given the covariance matrix Ω shown below, the algorithm is:
vx v xy
v xy vy
x vx y vy
v xy
(G.1)
x y
y yG G y ( y ) x x G x ( x ) y 1 2 y
y x x
where the mean for the y generator has a subscript G to indicate the mean y would have if there were no correlation, i.e., the mean used for the independent draw. The correlation will generally produce a different mean value for y, namely
y 1 2 yG
y x x
(G.2)
The left arrows denote making a random draw from the generator. To show that the algorithm is governed by the error covariance matrix, we compute the expectation values (x-x)2 , (y-y)2 , and (x-x)(y-y) . It is obvious that (x-x)2 = vx from the fact that x is drawn from a population with that variance (we could have swapped the roles of the variables, as long as one’s random draw was properly correlated with the other’s). For y, 451
page 463
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
y y
2
y y 1 2 y x 1 2 yG x x x y 1 2 y yG x x x
2
y 1 2 Gy ( y ) Gx ( x ) x
2
1
2
G ( ) y
2
y
2
1 2 y
x
2
y Gy ( y ) Gx ( x ) x
2
G ( ) x
2
x
2
y 1 2 y2 x2 x 1 2 y2 2 y2 y2 v y
(G.3)
where we dropped the Gy(σy)Gx(σx) cross term because those two random draws are uncorrelated, so the expectation value of their product is zero. Similarly, we drop it in
x x y y Gx ( x ) 1 2 Gy ( y )
y G ( ) 2 x x x y x2 x x y
y Gx ( x ) x
(G.4)
v xy This demonstrates that the prescription in Equation G.1 does produce random draws with the correct variances and correlation. But sometimes more than two correlated random draws are needed. In the general case of N random variables, it is useful to treat them as elements of an N-vector. There are several approaches to generating an N-vector with correlated components. Often one wishes to specify the correlations a priori and then generate pseudorandom vectors that will exhibit exactly those correlations if a sufficiently large Monte Carlo computation is performed. The 2-vector above is an example of that, since one is free to choose any value for ρ in the open interval (-1, +1). For vectors with more components, an N×N covariance matrix can be defined as needed and used to generate Gaussian vectors via an algorithm that employs Cholesky decomposition; this will be discussed below, but first it is advisable to develop some intuition for correlated elements of vectors. 452
page 464
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
First, consider exactly what a random N-vector is: it is a single point in N-dimensional space. For example, Figure 2-10 (p. 68) shows a plot in 2-space of ten vectors; each point is a 2-vector, and so it can be easily displayed in a plot. Displays of 3-vectors are also possible, especially via 3dimensional stereo pairs, but higher dimensions must be taken on faith in mathematical generalization and the fidelity of projections into spaces of lower dimensionality. Figure 2-11 (p. 71) shows two plots in 2-space, related by a rotation. Each contains 5000 randomly drawn 2-vectors. For these two examples, it is possible to compute sample covariance matrices (hence correlation matrices), because we have more than a single point in N-space. We cannot compute sample statistics (e.g., mean, variance, skewness, kurtosis) from a single point. This is why in practice we need a large sample size, and to simulate this process, we need a model of the population so that we can draw many vectors. If we compute sample statistics for a very large number of vectors, they should approach the population values. It might be good to review section 2.10 (Correlated Random Variables) and 2.11 (Sample Statistics). The latter discusses drawing random deviates from a population and distinguishes two cases: (a.) the population is fully specified, and the purpose is to predict what samples drawn from it will look like in terms of statistical properties; (b.) the population is unknown, and knowledge concerning it is the object of pursuit. In the latter case, one usually has a sample obtained empirically and attempts to characterize the population from which this sample came. As with many nonlinear estimation problems, it is often useful to estimate a solution and compute what would be observed if it were correct, then use differences between predicted and actual observations to refine the estimated solution. The need to simulate random vectors with correlated components typically arises when one is in the process of pursuing such an iterative solution. The process involves positing a population about which we know everything, then statistically predicting the properties of samples that could be drawn from it. As far as the posited population is concerned, we are omniscient, and we can give the population any properties we want by simply modeling it any way we like. For example, suppose one has measured the flux of a particular star on 12 consecutive nights. Then one has a sample composed of 12 elements that one may average to get a better estimate of the true flux of the star. One may also investigate whether the measured fluxes change in time in a statistically significant way, in which case the star may be variable, an extremely useful thing to know about a star. By “statistically significant”, we mean that the amplitude of variation is sufficiently large relative to the measurement uncertainty to establish that the observed variations are unlikely to come from purely random fluctuations in the measurements of a non-variable star. Thus establishing variability always depends on arranging for measurement uncertainties to be smaller than the amplitude of variation. Note that because we are doing science properly, we do have an estimate of the uncertainty based on a prior model of the measurement apparatus. This may be a single number that applies to all 12 measurements, or it may be 12 numbers in one-to-one correspondence with the 12 flux measurements, if for example the uncertainty varies due to atmospheric effects changing over the 12 observing nights. In that case, we must consider the population to be really a mixture of different random-variable distributions, with part of the randomization being which distribution we draw a random variable from on any given night. The goal is to decide whether the unknown population (or population mixture) involves random measurement errors about a single constant true flux. If the true flux is changing, then our notion of the population must adapt to that fact. We would have to model it as a two-dimensional mixture of true fluxes and random-variable distributions. The true fluxes might be fairly simple 453
page 465
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
functions of time. Whatever we think Nature might be up to, that is what we must model. So we may view the 12 flux values as a sample consisting of 12 scalar values, or we may view them as a single draw from a population of 12-vectors. But what if our 12 measurements are correlated? This complicates the analysis somewhat. Neither view allows us to determine whether the 12 numbers are correlated based on the properties of the sample itself. Since we have uncertainties, we can compute a sample chi-square, and that might give us a clue about variability. Appendix E describes chi-squares; it also points out that the variance of the random variable defined by Equation E.2 (p. 443) is affected by ignoring nonzero correlations. But we cannot compute a variance, because we have only one sample of the 12-vector, a single point in 12-dimensional space. We might possibly learn something by chronologically partitioning the sample into six 2-vectors and computing a covariance matrix from that. But our purpose here is not to explore algorithmic design of data-analysis software, so we will just focus on generating N-vectors with correlated components. At least that will allow us to explore what might be going on under various assumptions about the correlations that may be in effect. Since at least some of the elements are to be correlated, the off-diagonal elements of the covariance matrix will not all be zero. Depending on how the simulated correlations are to be defined in our omniscient domain, however, some care must be taken in setting up the covariance matrix, or equivalently, the correlation matrix. If the covariance matrix originates in actual physical measurements, then it is almost certainly safe to use for simulation. But if one is constructing the simulation from scratch, one must be careful to do so in such a way that produces a positive definite covariance matrix (i.e., all eigenvalues greater than zero). Just any symmetric matrix with a positive determinant is not necessarily positive definite; for example, it could have an even number of negative eigenvalues. A legitimate covariance matrix must not have any negative eigenvalues, since its eigenvalues are the variances of the uncorrelated random variables which can always be obtained through a suitable rotation of coordinates, and so all of the eigenvalues must be positive definite, i.e., greater than zero. Negative eigenvalues would correspond to imaginary standard deviations, and zero eigenvalues would make the covariance matrix singular (i.e., the determinant is the product of the eigenvalues, hence it would be zero, and since the determinant is invariant under rotation, that means that the covariance matrix would be singular in all rotated coordinate systems). Another way of saying this is: one cannot simply construct a correlation matrix larger than 2×2 with ones on the diagonal and off-diagonal elements in the open interval (-1, +1), because the result might not be positive definite. Keeping the correlation magnitude less than 1.0 is all that’s needed for a 2×2 case, and so the example at the beginning of this appendix will always work, and the desired correlation may be controlled exactly. But for example, one may not specify a 3×3 correlation matrix by simply declaring
1 0.6 0.6 P 0.6 1 0.6 0.6 0.6 1
(G.5)
where we denote a complete correlation matrix with the Greek capital rho (which has the same appearance as the Latin P) because the lower-case rho is commonly used to denote elements of correlation matrices. In this example, the eigenvalues are -0.2, 1.6, and 1.6. 454
page 466
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
That such an arbitrary correlation matrix might be invalid is somewhat intuitive, since it is clear that off-diagonal elements of a correlation matrix cannot be completely independent in general. For example if x1 and x2 are positively correlated, and if x2 and x3 are positively correlated, then one cannot simply assume in general that one may declare x1 and x3 to be uncorrelated (or negatively correlated). This raises the question: is correlation transitive? For example, for real numbers, the inequality “greater than” is transitive: if A > B and B > C, then A > C. So if instead, A and B are correlated, and B and C are correlated, then are A and C correlated? The answer is “not necessarily”; a correlation matrix can be band-diagonal (i.e., zero everywhere except along the diagonal, the subdiagonal, and the superdiagonal; of course, in a correlation matrix the last two contain the same elements because the matrix is symmetric). A good way to develop a feel for this is to simulate correlation. We could construct various arbitrary correlation matrices, verify that they are positive definite, and then study their properties. But it might be better instead to start by asking “where does correlation come from in the first place?” and try simulating the answer in a more microscopic way. We will require all off-diagonal elements of a correlation matrix to be in the open interval (-1, +1). i.e., |ρjk| < 1, j k, where ρjk is the coefficient of correlation between the jth and kth components of the N-vector. We make this requirement because if two different elements are ±100% correlated, then the requirements above for a positive definite matrix will not be met. Of course, we always have ρjj = 1. First of all, it is physically intuitive to consider that if two random variables are partially correlated, then they cannot both be produced by different irreducible processes. At least one must have two or more processes combining to generate it, one or more of which are common to the other variable. In the discussion of two correlated random variables given at the beginning of this appendix, we saw that x could be irreducible, i.e., the generator that produced it could involve a process that cannot be broken down into subprocesses. But y is obviously produced by two processes, its own generator and that of x, with some rescaling. Thus y is produced by a composite process, one that can be broken down into two subprocesses, and possibly more, depending on whether Gx and Gy are themselves composite processes. When we said “we could have swapped the roles of the variables”, the symmetry is not absolute in that example, since our choice left y as necessarily coming from a composite process, hence it was the sum of at least two random variables, hence it was at least one step down the road to being at least approximately Gaussian because of the Central Limit Theorem. Thus we could not, for example, declare that x and y are both uniformly distributed, since y could not be (except in the trivial case of Gy(σy) being a Dirac Delta Function, which would convolve with the uniform Gx(σx) to yield a uniform distribution, but being zero-mean, would add nothing to y G). Declaring certain random variables to be correlated necessarily implies some constraints on their distribution. In general, we must relinquish some control over the forms of the distributions if we are to force correlations in the desired amounts. Note that any Gaussian random variable can be easily resolved into the sum of two other Gaussians just by partitioning the variance. This makes Gaussian random variables the easiest to correlate arbitrarily, and given the importance of this distribution, along with the fact that it is not lost when adding these variables, we will just work with this distribution for the remainder of this appendix (with one exception at the end). This will also ease checking whether our simulated correlated random variables behave as they should. One way to generate vectors with correlated components is to generate the components from 455
page 467
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
composite processes and allow some of the subprocesses to contribute to more than one vector component. We will consider an example of a 12-vector, since that maintains a correspondence to the example of the 12 flux measurements above, in which we hinted that correlated measurement errors could be important. We denote this vector R, with components rj, j = 1 to 12. Of the many approaches to generating a random 12-vector with correlated components, we will look into what is probably the simplest, one in which each component is the sum of M random variables, and the terms in the sums for some of the vector components are mutual. For concreteness and combinatorial convenience, we will set M equal to 12 also. It should become apparent that this is in no way a requirement, just as the fact that using the same M value for each component of the N-vector is not a requirement. There is a great deal of flexibility available when one is designing one’s own model of reality. So we may need up to 144 independent pseudorandom deviates per vector, depending on how many we assign to more than one vector element. We will denote the values obtained on these pseudorandom draws di, i = 1 to 144, and we will define a 144×12 matrix G whose elements are gij = 1 if di is one of the 12 contributors to vector element rj and 0 otherwise. While we are simplifying, let us start out with each of these being zero-mean unit-variance Gaussian draws, and hereinafter, we will use the simple term “random” to mean “pseudorandom”. Each vector component will then be 144
rj gij di
(G.6)
i 1
Since we are using 12 non-zero gij values for each j, and since each di is an independent draw from a zero-mean unit-variance distribution (Gaussian in this case), it follows that rj is zero-mean with variance 12 (and Gaussian). Thus the covariance matrix for R will have values of 12 down the diagonal. The off-diagonal elements will vary, depending on how we set the gij. The elements of the correlation matrix are all obtained in the same manner as the fourth line of Equation G.1. We are now free to set the gij to any desired pattern of 12 ones and 132 zeros per row. For a given value of i, whenever gij = 1 for more than one value of j, the corresponding rj will not be independent; they will also be correlated as long as the gij are allowed only values of 0 or 1. In this simple first example, only positive correlation can be induced (later we will consider more general values for gij). Figure G-1 shows six examples of this. Each gij cell is filled if the corresponding value is one, otherwise it is unfilled to indicate a value of zero. In case A, there is no filled-cell overlap between any vector components, and so they are all uncorrelated, and Ρ is the identity matrix.
456
page 468
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
A.
B.
C.
D.
E.
F.
Figure G-1. Each grid has 144 columns, each representing an independent draw from a random-number generator, and all but D have 12 rows, each representing an element of a 12-vector whose value is the sum of the random draws whose cells are filled on that row. A. The elements are uncorrelated; the correlation matrix is the identity matrix. B. Each element is +33% correlated with its neighbors; the correlation matrix is tridiagonal. C. Each element is +67% correlated with its nearest neighbors and +33% correlated with its next-nearest neighbors; the correlation matrix is pentadiagonal. D. The first two elements are positively correlated > 0.5, but the first and third elements are uncorrelated because of the negative contributions indicated by dashes on the third row. E. Each element is +33% correlated with every other element; the correlation matrix is floordiagonal. F. 144 randomly selected cells distributed over the 12 rows, which range from 8 to 17 selected cells each; see text for the correlation matrix.
457
page 469
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
In case B, each vector component has four-cell overlap with its neighbors. The covariance for vector components rj and rk, vjk, is
v jk rj rk gij di gmk d m i m gij gmk di d m gij gmk im im
(G.7)
im
gij gik i
where is the Kronecker delta, because di and dm are independent draws for i m, with zero covariance, whereas for i = m, this expectation value is 1 because di has unit variance. But now we see that the last line of Equation G.7 is just the number of di common to rj and rk. In case B, there are four draws common to adjacent vector elements and none common to non-adjacent elements. The subdiagonal and superdiagonal elements of the correlation matrix are therefore 4/ (12×12). Besides these and the diagonal elements of 1, all other elements are zero, and Ρ is band-diagonal:
1 1 3 0 0 0 0 P 0 0 0 0 0 0
1 3
1 1 3
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 3 1 13 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 3 3 1 1 0 3 1 3 0 0 0 0 0 0 0 0 13 1 13 0 0 0 0 0 0 0 0 13 1 13 0 0 0 0 0 0 0 0 13 1 13 0 0 0 0 0 0 0 0 13 1 13 0 0 0 0 0 0 0 0 13 1 13 0 0 0 0 0 0 0 0 13 1 13 0 0 0 0 0 0 0 0 13 1
(G.8)
This assumes that there is some significance to the order of the vector elements. For example, each element may be a readout of an instrument with some hysteresis, and the elements are in time order. It is an example of what was said above: A and B are correlated, and B and C are correlated, but A and C can be uncorrelated, depending on the signs and magnitudes of the correlations. In case C, the di overlap between nearest-neighbor vector components is eight, and between next-nearest-neighbor components it is four. This leads to
458
page 470
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
1 2 3 1 3 0 0 0 P 0 0 0 0 0 0
2 3
1 2 3 1 3
0
1 3 2 3
1 2 3 1 3
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 3 2 1 0 0 0 0 0 0 0 3 3 1 23 13 0 0 0 0 0 0 2 1 23 13 0 0 0 0 0 3 1 2 2 1 1 0 0 0 0 3 3 3 3 0 13 23 1 23 13 0 0 0 0 0 13 23 1 23 13 0 0 0 0 0 13 23 1 23 13 0 0 0 0 0 13 23 1 23 13 0 0 0 0 0 13 23 1 23 0 0 0 0 0 0 13 23 1
(G.9)
This shows that there is a limit beyond which we cannot push “A and B are correlated, and B and C are correlated, but A and C can be uncorrelated” as long as we keep the number of zero-mean unit-variance random contributors the same for all vector components and use only positive correlation. In such cases, as soon as ρ12 > 0.5 and ρ23 > 0.5, we must have ρ13 > 0. But if we allow negative correlation, we can indeed prevent ρ13 from becoming nonzero, i.e., we can make A and C uncorrelated or even negatively correlated. For example, in case D, only the third row is different from case C, so only the first three rows are shown. Vector component r1 has the usual 12 contributors, gi1 = 1 for i = 1 to 12, and r2 has the usual 12 contributors, gi2 = 1 for i = 5 to 16, but now r3 has the 12 contributors gi3 = -1 for i = 1 to 4 (indicated by dashes in those cells instead of the cells being filled) but +1 for i = 9 to 16 (12 contributors, the first four negative, the rest positive, with the four negative and four of the positive being in common with r1 and all eight positive in common with r2). The resulting correlations are ρ12 = 8/ (12×12) 2/3, ρ23 = 8/ (12×12) 2/3, and ρ13 = (4-4)/ (12×12) = 0. So we have ρ12 > 0.5, ρ23 > 0.5, but ρ13 = 0. This example shows that in general, even correlations greater than 0.5 between two elements don’t guarantee correlation with other adjacent elements. It is an example of two random variables, r1 and r3, being uncorrelated but not independent. Such an arrangement of composite random processes would be extremely difficult to detect empirically via sample statistics but seems unlikely to arise in practice. In case E, there are common contributors to all elements, di for i = 1 to 4, and no other contributors correspond to more than one vector element. This describes a situation in which 12 measurements were made with an instrument biased by some random amount. In the context of the 12 flux measurements of a star, this is not like atmospheric variations from night to night, since those would not be so consistently correlated over all 12 measurements. This is more like a systematic instrumental error in measuring the flux on each night. The effect is stronger than any of the previous cases, because this correlation extends across all of the vector elements, populating the entire correlation matrix with nonzero values. In this particular example, we have 459
page 471
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
1 13 13 13 13 13 13 13 13 13 13 13 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 13 13 1 3 13 13 13 13 13 13 13 13 3 3 3 1 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 13 13 13 13 13 1 13 13 13 13 13 13 P 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 1 3 3 3 3 3 13 13 13 13 13 13 13 1 13 13 13 13 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 13 13 13 13 13 13 13 13 13 1 3 13 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3
(G.10)
For want of a better name, we will call this matrix form floor-diagonal, since the off-diagonal values are all equal, i.e., they form a flat background, or “floor” (this is a special case of a “bordered matrix”, except completely filled with positive-definite off-diagonal elements; see, e.g., Wigner, 1955). This form is important, since it approximates several real situations, namely those in which relative measurement errors are estimated to be equal and all measurements share the same absolute measurement uncertainty. Besides the correlated flux error example, this form also arises when astrometric errors of point sources are the same for the relative position errors in a measurement frame whose absolute position on the sky is also uncertain. The one absolute position uncertainty is common to all the point sources (i.e., it is systematic), which have the additional relative position uncertainty. In all such cases, the covariance and correlation matrix elements have only two values respectively, one down the diagonal, and the other (smaller value) in all off-diagonal positions. The off-diagonal elements are in fact the variance of the systematic error, hence positive. and smaller than the diagonal elements, since they constitute only a portion of the latter summed in quadrature. It is very easy to compute the eigenvalues, inverses, and determinants for this form of matrix. Consider the N×N covariance matrix
v11 v12 v12 v12
v12 v12 v12 v11 v12 v12 v12 v11 v12 v12 v12 v11
with inverse
460
(G.11)
page 472
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
w11 w12 W w12 w12
w12 w11 w12 w12
w12 w12 w12 w12 w11 w12 w12 w11
(G.12)
where
1 0 W I 0 0
0 0 0 1 0 0 0 1 0 0 0 1
(G.13)
Then it can be shown that
w12
v12 (v11 v12 ) (v12 ( N 1) v11 )
w11
v11 ( N 2) v12 (v11 v12 ) (v12 ( N 1) v11 )
(G.14)
and the eigenvalues of the covariance matrix consist of one principal value λ1 and N-1 degenerate eigenvalues λ2, where
1 v11 ( N 1) v12 2 v11 v12
(G.15)
So the determinant is λ1 λ2N-1. The eigenvalues are obviously all positive, since v11 > v12 > 0. As v12 approaches v11 in value, however, the corresponding correlation coefficient approaches unity. So a correlation matrix can be filled with off-diagonal elements with values arbitrarily close to 1.0 (but less by some arbitrarily small amount, otherwise the matrix would be singular). This will be pursued below with Case G. Case F departs from the uniformity of 12 contributors per vector component. This G-matrix was generated by turning on 144 cells at uniformly distributed random locations in the rows and columns. The resulting numbers of contributors to each vector component are 10, 9, 14, 9, 16, 8, 11, 14, 11, 17, 14, and 11, respectively, and the random overlaps vary from 0 to 4. The off-diagonal elements of the correlation matrix vary as the ratios of 0 through 4 to the square root of the products of the numbers of contributors (see Equation G.16 below), so a full presentation of the disparate values in this matrix is not particularly informative, but this case will be included in the plausibility tests discussed below. The main point is that there is no need to keep the number of contributors the same for each vector element. There are many variations to explore with the G-matrix approach, far too many for the present 461
page 473
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
scope. As in Case D, we can make gij negative, and we could also not limit its magnitude to 1. In the latter case, the elements of the covariance matrix are still given by Equation G.7 as long as we retain the constraint that all di must be drawn from independent unit-variance populations. This allows more control over the correlation coefficients
jk
gij gik i
2 2 i gij i gik
(G.16)
This still leaves the task of achieving very precise correlations somewhat difficult, however, since modifying gij and gik can affect the numerator and the denominator in different ways and also other elements of the correlation matrix. The advantage of the G-matrix method, besides its intuitive nature, is that it is easy to produce a covariance matrix that is positive definite (assuming not all cells are equal). But given that one is trying to simulate random vectors that follow a preconceived exact correlation or covariance matrix, then one cannot escape the additional task of verifying that the matrix is positive definite, after which the method based on Cholesky decomposition can be used. That will be discussed below, but first we will examine some tests to probe whether the simulations above have succeeded in producing random vectors that behave correctly. A convenient test for correct behavior is based on the chi-square random variable. In particular, since it is not unusual to be faced with the task of evaluating chi-square for samples drawn from populations with unknown correlation properties, we will define a related variable which we will call xi-square:
2 N
r r N
j
j 1
2
j
(G.17)
2 j
This definition is identical to that for chi-square in Equation E.1 (p. 443), with this exception: Equation E.1 defines chi-square if and only if all off-diagonal elements of the correlation matrix are zero; we define xi-square as shown independently of the correlation matrix. If that happens to be diagonal, then xi-square is chi-square, but otherwise not. It is still a well-defined function of the random variables, and the expectation values of its moments can be computed formally via the joint density function for the Gaussian random variables rj. For the general case in which there may be correlations, this joint Gaussian density function is
e N /2 2
p N ( RN )
2 N / 2 N
(G.18)
where RN denotes a random N-vector (in our example, N = 12), χN2 is the true chi-square as given by Equation E.4 (p. 444), i.e., formally including all correlation, even though the actual numerical values may be unknown, and |ΩN | is the determinant of the formal covariance matrix. With this, we can compute the raw moments of xi-square (see Equation A.2, p. 419)
462
page 474
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
N2 p N ( x N ) dx N
2 n N
mn
n
(G.19)
0
for n = 1 to 4, which will allow us to compute the mean, variance, skewness, and kurtosis, since the raw moments can be converted to central moments via Equation A.4, and then we can apply Equation A.5 (both p. 429). The results of these formal manipulations are:
mean( N2 ) N N N
var( N2 ) 2 2jk j 1k 1
N N
N
jk km jm
8
j 1 k 1m 1
skew ( N2 )
N N 2 2 jk j 1k 1 N N
3
kurt ( N2 )
N N
(G.20)
3/ 2
2 2jk mn 4 jk jm kn mn
j 1 k 1m 1n 1
N N 2 jk j 1k 1
2
The fact that the mean of xi-square is the same as that of chi-square can be seen as obvious; the definition of xi-square is a sum of N terms, each with the expectation value of the numerator in the denominator, hence a sum of N terms of unity. We assume herein that the r j and σj2 values are correct, because we are simulating random draws based on known means and variances. This is a privilege that goes with simulating from the “omniscient” viewpoint. When the true means and/or variances are not known, such as in the non-simulation case wherein they must be computed from samples containing nonzero correlations that are ignored, then the chi-square mean will be generally incorrect, because correlations affect the real value of the sample mean (see, e.g., Equations 4.22 and 4.23, p. 164), and of course one degree of freedom is lost when the mean is computed from the sample. In our simulation, we know the correct population means and variances of the random variables going into Equation G.17, so the expectation value for the mean of xi-square is still N, but the variance, skewness, and kurtosis of xi-square all depend explicitly on the full correlation matrix as shown in Equation G.20. For example, the variance of chi-square is 2N, but the variance of xisquare can be almost (but not quite) 2N 2. If the off-diagonal terms are zero, then these statistics reduce to those of chi-square, namely
var N2 2 N var N2
skewN2 kurt
2 N
8N
2 N
3/ 2
8
N
skew N2
12 48 N 3 kurt N2 2 2 N N
12 N
2
463
(G.21)
page 475
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
For any given G-matrix, we can compute the theoretical correlation matrix via Equation G.16 and use it to evaluate the theoretical mean, variance, skewness, and kurtosis via Equation G.20. We will compare those values to corresponding simulated values obtained from a Monte Carlo calculation employing one million trials. The large number of trials allows us to calculate an actual sample covariance matrix for the 12 vector components by averaging rj (hence also rk in the equation below) and vjk over all the trials, and then we can compute the simulated correlation matrix:
v jk rj rj rk rk (G.22)
v jk
jk
v jj vkk
Each trial consists of drawing the needed di values, followed by an evaluation of the R vector based on the G-matrix via Equation G.6. Then we compute xi-square using the known population mean (zero) and variance (the sum of the squares of gij), so that xi-square will have 12 degrees of freedom, i.e., N = 12 in Equation G.20:
2 rj 2 12 144 j 1 gij2 12
i 1
(G.23)
where the index τ indicates the trial number, τ = 1 to 10 6. We compute the following sums.
rj
Sj
S jk
r j rk
Tn
122
(G.24)
n
Sj and Sjk are used to compute the covariance matrix elements:
rj
Sj
106 S jk v jk 6 rj rk 10
(G.25)
from which ρjk is obtained via Equation G.22. The raw moments of xi-square are mn = Tn /106, n = 1 to 4, and the central moments μn are obtained from Equation A.4, after which Equation A.5 gives 464
page 476
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
mean(122 ) 1 var (122 ) 2
3 23/ 2 kurt (122 ) 42 2
skew (122 )
(G.26)
As estimates of the population from with they were drawn, we expect the relative accuracy of our simulated sample statistics to vary approximately inversely by the square root of the number of trials, so about one part in 103. Indeed the simulated correlation matrix elements turn out to have this level of accuracy, generally ±1 in the third decimal place. To provide some illustration of this without getting too elaborate, we will present a portion of the simulated correlation matrix for Case C, whose exact theoretical values are shown in Equation G.9, pentadiagonal with values of 2/3 and 1 /3 off the diagonal. We give the numerical results to six decimal places. Note that the diagonal elements are all 1.000000 because the denominator in Equation G.22 is the square root of the square of the same numerical quantity as the numerator, and this incurs very little roundoff error.
1.000000 0.667231 0.334315 Pc 0.000287 0.001121 0.001312
0.667231 1.000000
0.334315 0.667089
0.000287 0.001121 0.001312 0.332662 0.001492 0.000235
0.667089
1.000000
0.666076
0.331228
0.000888
0.332662 0.001492
0.666076 0.331228
1.000000 0.665857
0.665857 1.000000
0.000235 0.000888 0.001880 0.000809
0.001880 0.000809 1.000000
(G.27)
One more case was done using the G-matrix approach in order to investigate a situation involving extremely high positive correlation. The maximum off-diagonal correlation possible with 144 independent unit-variance random contributors is shown in Figure G-2: all cells in the matrix are turned on except for one different cell in each row. Thus each row has 143 cells turned on, all of which are in common with all other rows except one cell in each row. This results in a floordiagonal correlation matrix with all off-diagonal elements equal to 142/143, or about 0.993. While this causes very large increases in variance, skewness, and kurtosis relative to chi-square, the mean nevertheless remains 12.
465
page 477
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
Case G
Figure G-2. The G-matrix for Case G is shown at the top; all cells are turned on except for one different off-cell in each row. Below the grid, the histograms for three cases are shown. Case A has no correlation, hence xi-square is chi-square, and its histogram has the shape of a true chi-square distribution with 12 degrees of freedom. Case E, the floor-diagonal correlation matrix with 1/3 in all off-diagonal positions; the mean is still 12, but the skewness and kurtosis have increased relative to chi-square, with the peak shifting to the left. Case G is floor-diagonal with 142/143 in all off-diagonal positions; the extreme correlation moves the peak much farther to the left, but the mean remains 12, while the variance has increased to 284.321.
The plot in Figure G-2 shows the shapes of the histograms for Case A, Case E, and Case G. Case A has the shape of a true chi-square, since its correlation matrix is the identity matrix, and xisquare is chi-square. Case E has the moderate global correlation typical of measurements with a significant systematic error; even so, the distribution is only slightly distorted relative to chi-square. Case G is extremely non-chi-square, however, and its distribution is correspondingly different from that of chi-square. The peak occurs in the first histogram bin, but the mean remains 12, with the very 466
page 478
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
large increase in skewness that would be expected from that fact, as seen below. The xi-square statistics for the seven cases are shown in the following table, which gives both the theoretical values (denoted “T”) and the values obtained in the Monte Carlo simulation of one million trials (denoted “S”). Case
Mean
Variance
Skewness
Kurtosis
A(T)
12.000
24.000
0.816
4.000
A(S)
12.001
24.045
0.821
4.008
B(T)
12.000
28.889
0.996
4.578
B(S)
12.003
28.911
1.004
4.632
C(T)
12.000
48.000
1.369
5.990
C(S)
12.005
48.025
1.379
6.062
D(T)
12.000
45.778
1.326
5.826
D(S)
12.006
45.737
1.333
5.887
E(T)
12.000
53.333
2.154
11.040
E(S)
12.004
53.168
2.149
10.988
F(T)
12.000
27.480
1.007
4.735
F(S)
11.997
27.466
1.002
4.701
G(T)
12.000
284.321
2.828
15.000
G(S)
11.997
283.569
2.818
14.828
Note that actual empirical results for these statistics, like Monte Carlo results from undisclosed populations, cannot by themselves reveal the underlying full correlation matrix, but they can give a qualitative idea about the general magnitude of overall correlation, especially whether correlation is negligible. This can be useful in actual situations when possible unknown correlations cannot be ruled out, but a statistically significant sample of xi-square values can be obtained, while the original data from which the xi-squares were computed are unavailable. Here we refer to “xi-square” values, because until significant correlations have been ruled out, one should not assume that one is dealing with chi-square values unless one knows for certain that the chi-square values were computed with correlation taken into account (e.g., Equation E.4, p. 444). If the set of xi-square values has excessive variance, skewness, and kurtosis for the number of degrees of freedom, then the presence of significant correlations is indicated, especially if those statistics fit the known xi-square pattern illustrated in the table above for 12 degrees of freedom (it is also possible that deviations from chisquare behavior could be caused by the fundamental processes being non-Gaussian). The reason why 467
page 479
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
xi-square values are often available is that they can be reconstructed from sample variances if an estimate of the population variance can be obtained. The close agreement between theoretical and simulated statistics in the table above shows that the G-matrix algorithm produces excellent correlated random variables. The slight differences are to be expected because of random fluctuations. The theoretical statistics were computed by using the theoretical correlation coefficients from Equation G.16 in the theoretical formulas given in Equation G.20, but the simulated correlation coefficients in Equation G.22 could also be used in Equation G.20, and the results are generally closer to the quoted theoretical values than the simulated values. Now we turn to the most general method for the task of generating correlated Gaussian random variables, given that one has a covariance matrix already in hand that is to govern the behavior of the random draws. This is the method based on Cholesky decomposition. An added advantage of this method is the fact that if Cholesky decomposition is possible (i.e., no divisions by zero or square roots of negative numbers occur during the process), then that alone establishes the fact that the matrix is positive definite (to within the usual computer-arithmetic roundoff error). Many areas of linear algebra involve factoring a matrix into two or more matrices whose product is the original matrix (e.g., singular value decomposition, lower-upper decomposition, etc.). When the matrix is symmetric and positive definite, the most efficient factorization method is Cholesky decomposition. An example of its application is solving linear systems such as chi-square minimization, which was discussed in Appendix D, where it was pointed out that the coefficient matrix is symmetric (Equation D.5, p. 435). We omitted any discussion of Cholesky decomposition therein because its main interest is efficiency stemming from not having to invert the coefficient matrix, whereas we wanted that inverse anyway for purposes of uncertainty analysis. Since this appendix is not about linear algebra, we will leave derivations and proofs for the standard literature. The application of interest herein is generating correlated Gaussian random variables, i.e., correlated elements of Gaussian vectors governed by a covariance matrix. We continue to focus our attention on Gaussian random variables because of their importance and the fact noted above that in the real world, correlations are induced by sharing common subprocesses, making the random variables sums of these subprocesses, hence tending toward being Gaussian because of the Central Limit Theorem. We will, however, take one look at a case in which the processes generating the random numbers are non-Gaussian in order to understand what the implications are. But first, we will describe the basic algorithm. Given a symmetric positive-definite N×N matrix A, Cholesky decomposition computes a lower triangular matrix L such that LLT = A, i.e., the product of L with its own transpose yields A. This similarity to squaring is why Cholesky decomposition is sometimes called “taking the square root” of a matrix. All elements of L above its diagonal are zero, which is why it is called a lower triangular matrix. The non-zero components are computed by alternately applying the two formulas
Lii L ji
i 1
Aii L2ik k 1
1
(G.28)
A L L ij ik jk , k 1 Lii i 1
j i 1 to N
where it is understood that the summations on k are done only for i > 1. Thus for a given i value, the 468
page 480
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
first formula provides what is needed by the second formula for each value of j, which then provides what is needed in the first formula for the next i value. By starting with i = 1, this alternation allows the computation to proceed to i = j = N, leaving Lji = 0 for j < i. So the first column is computed first, then the second column, and so on. A simple 3×3 example is:
L
A11 A12 L11 A13 L11
0
A22 L221 A23 L21 L31 L22
0 0 2 2 A33 L31 L32
(G.29)
where we express some elements of L in terms of other elements of L that have already been computed as we work our way down each column and from left to right in columns. For a given N, the elements could be expressed entirely in terms of elements of A, eliminating any dependence on the order of calculation, but the expressions become tedious quickly as N takes on larger values. Clearly, all arguments of the square root must be greater than zero if imaginary values are to be precluded and if divisions by zero are to be avoided. Satisfying that constraint is necessary and sufficient to establish that the A matrix is positive definite. The L matrix is used in a manner analogous to that of computing a scalar random deviate from a zero-mean unit-variance deviate: multiply the latter by the square root of the desired variance, then add the desired mean. So we perform a Cholesky decomposition on the covariance matrix that we want to govern subsequent random vector draws; this gives us the analogy to the square root of the variance. Then for each random vector, we begin with a vector whose elements are all independent zero-mean unit-variance deviates, which we designate as Z. Then we multiply that vector from the left by L to get the random vector whose statistical behavior is governed by the covariance matrix:
R LZ
(G.30)
We can add a constant vector to the result if we wish non-zero-mean random vector elements. The covariance matrix carries no information about means, so we must save adding any non-zero means for the last step. If any of the covariance matrices generated in cases A-G above are used to generate one million random vectors via Cholesky decomposition, these will produce the same results as the G-matrix method to within the expected fluctuations at the third significant digit. We said above that the elements of Z must be drawn from a zero-mean unit-variance population. This population does not necessarily have to be Gaussian; any distribution may be used, and even different distributions for different vector elements, although if statistical stability of the simulated covariance matrix in a Monte Carlo calculation is desired, some unruly distributions can require a much larger number of trials than when Gaussians are used throughout. But as mentioned above, when forcing a given covariance/correlation behavior on random variables, one sacrifices some control over the distribution of the vector elements, since each is generally a linear 469
page 481
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
combination of the Z elements, hence a convolution of rescaled draws from that parent population (we say “generally” because the number of terms in the linear combination is equal to the vector element number, hence the first element’s distribution is not affected at all by the algorithm, and the effect of the Central Limit Theorem is felt more by the higher-numbered elements of the vector). If the correlation is weak, any non-Gaussian quality of the parent distribution may possibly be retained to a detectable extent in all vector elements, but generally one does not go to the trouble of generating correlated random vectors if the correlation is intended to be weak. When N is small, the number of terms in the linear combination is small, so the Central Limit Theorem has less opportunity to shape the output distributions. But even in these cases, it can be difficult to force not only correlation but also a specifically desired non-Gaussian distribution on the random vector elements. One attractive feature of Gaussian parent populations is that one need not worry about the distribution changing as part of the process. If Gaussians are acceptable, then one can dismiss issues about how the randomized vectors are distributed, since it is a simple case of “Gaussians in, Gaussians out” as seen above in the xi-square behavior. We noted previously that the example of a 2-vector which began this appendix involves an asymmetry: one element is made to include part of the other, i.e., forced to be a linear combination of two subprocesses, whereas the other is left with its own parent-population distribution. The same is true of the Cholesky-decomposition method, as noted above, where we said the first element is unaffected, i.e., it has the same distribution as the first element of Z. We will expand Equation G.30 for a simple case of N = 5 for purposes of illustration.
R1 L11 Z1 R2 L21 Z1 L22 Z2 R3 L31 Z1 L32 Z2 L33 Z3 R4 L41 Z1 L42 Z2 L43 Z3 L44 Z4
(G.31)
R5 L51 Z1 L52 Z2 L53 Z3 L54 Z4 L55 Z5 Since the number of terms in the linear combination increases with vector element number, the Gaussian approximation can be expected to improve for the higher-numbered elements. If the Z distributions are significantly non-Gaussian, this forces a loss of those specific non-Gaussian qualities that increases with vector element number. Such an asymmetry may be completely arbitrary if the order of the vector elements was arbitrary. In such cases, one should be aware that the distribution of a given vector element depends on its location in the vector. To illustrate this, we will generate a Z vector for N = 12 using uniformly distributed deviates, then force the correlation matrix of Case E, in which all off-diagonal elements were 1/3. To keep the example simple, we will use the Case E correlation matrix as the covariance matrix that governs the randomization, i.e., that goes into the Cholesky decomposition. This just amounts to dividing the covariance matrix by 12, the variance that resulted from originally adding 12 zero-mean unitvariance Gaussians to form each random vector element. In other words, the correlation matrix is the covariance matrix if all the random variables are unit-variance. Since we need a zero-mean unit-variance population for Z in order to preserve the scale of the covariance matrix, we use a uniform distribution between - 3 and + 3. When we do this in a Monte Carlo calculation of one million trials, we find that the simulated covariance and correlation matrices 470
page 482
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix G Generating Correlated Pseudorandom Numbers
obtained previously in Cases A-G are reproduced to within the expected random fluctuations, but each simulated vector element has a different distribution, starting with a pure uniform and becoming progressively more Gaussian as we go down the vector (except for Case A, in which the vector elements were uncorrelated, so the uniform distribution survives unscathed throughout the vector). In the current example based on the Case E correlation matrix, we histogram each of the 12 vector elements over the one million trials using a bin size of 0.01 and a range from -3.0 to +3.0. This range allows the more nearly Gaussian distributions room for 3σ fluctuations. The histograms are shown in Figure G-3 for element numbers 1, 2, 3, 5, 8, and 12. The first three are shown because of their distinctive character, after which the shape becomes progressively more nearly Gaussian by less obvious degrees. Element number 1 remains uniformly distributed between - 3 and + 3. Element number 2 shows the trapezoidal shape of two dissimilar uniform distributions convolved with each other. The Cholesky matrix for this case yields R2 = 1/3 Z1 + 8/3 Z2, which allows fluctuations between about -2.2103 to +2.2103. Element number 3 shows the first appearance of curvature, but its domain is still smaller than the histogram range. Element number 5 shows additional progress toward a bell shape, but is still incapable of 3σ fluctuations, which begin with element number 6 (not shown). Element number 8 continues the progression and has 21 fluctuations outside of ±3σ in this simulation. Element number 12 is the closest to Gaussian, but with only 55 fluctuations outside of ±3σ, it falls far short of what a true Gaussian would be expected to produce, about 2700. Without these histograms, one might never notice that the vector elements follow different distributions, since the fact that Z was composed of uniformly distributed random variables instead of Gaussians is not evident in the output correlation matrix, which comes as close to Equation G.10 as the original Case E. Figure G-3 A and B illustrate an interesting corollary of Equation G.1 (and the second line in Equation G.31): two uniformly distributed random variables cannot be correlated with each other. In order to be correlated, at least one of them must be nonuniformly distributed by virtue of being a sum of two random variables. About the closest we can come is for one of them to have a trapezoidal distribution like that in Figure G-3B.
471
page 483
June 16, 2021
8:47
World Scientific Book - 10in x 7in
Appendix G Generating Correlated Pseudorandom Numbers
Figure G-3. Histograms of distributions of elements of a random vector generated for the Case E correlation matrix via Cholesky decomposition using uniformly distributed zero-mean unit-variance random vector elements in a Monte Carlo calculation of one million trials. The bin size is 0.01 over the range ±3.0. A. Vector element number 1, which remains uniformly distributed between - 3 and + 3. B. Vector element number 2, trapezoidal between about -2.21 and +2.21. C. Vector element number 3, first appearance of curvature. D. Vector element number 5, more bell-like, still no 3σ fluctuations. E. Vector element number 8, yet more bell-like, 21 fluctuations outside of ±3σ. F. Vector element number 12, closest to Gaussian, 55 fluctuations outside of ±3σ.
472
12468-resize
page 484
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix H The Planck Parameters The Planck length LP, time TP, mass MP, frequency νP, and energy EP are computed by forming combinations of the gravitational constant G (sometimes called the Newton constant), Planck’s constant h, and the speed of light c (sometimes called the Einstein constant) in such a way as to arrive at the correct dimensions. These are: 1
G h 2 LP 3 c
1
G h 2 L TP P 5 c c 1
(H.1)
h c 2 MP G
1
1 c5 2 P TP G h
1
hc5 2 E P M P c 2 h P G
The first three of these were presented by Planck in 1899 and are fundamental parameters, and the last two are derived from them. The Planck time could be considered derived, given its definition as the time it takes light to travel one Planck length. The approximate values of these parameters are LP = 4.05×10-33 cm, TP = 1.35×10-43 sec, MP = 5.46×10-5 gm, νP = 7.41×1042 Hz, and EP = 4.90×1016 erg = 3.06×1028 eV. Also found in the literature are definitions of the Planck parameters which use h h/(2π), also called the “reduced Planck constant”, consistently replacing h in Equation H.1 with h . This has no effect on physical dimensions and results in the approximate values LP = 1.61×10-33 cm, TP = 5.39×10-44 sec, and MP = 2.18×10-5 gm. These may have first appeared in a paper by Wheeler (1955) describing his concept of “geons” (“gravitational-electromagnetic entities”), gravitationally bound bundles of vibrational energy for which radial sizes, frequencies, and other parameters depend on action (e.g., the product of time and energy), for which he assumes h as a unit (with different notation, however). All of the geon parameters that depend on h also have other factors in their definitions (e.g., radial azimuthal standing-wave number), and he makes no use of terms such as “Planck length”, “Planck frequency”, etc. There are many other instances in which h appears as part of a length definition without implying that the corresponding definition of LP by itself applies to a fundamental physical object. 473
page 485
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix H The Planck Parameters
For example, area and volume eigenvalues in Loop Quantum Gravity (see section 6.7) are typically written with powers of (Gh /c3) among the coefficients of the quantum-number expression. In this case, the coefficients include powers of 2π that have the effect of converting h back to h. The problem with replacing h with h in Equation H.1, however, is that the resulting definitions do not produce a consistent result for the Planck energy, i.e., we no longer have MP c2 = hνP, since the two sides of the equation have 2π appearing in different places, one in the denominator and the other in the numerator. Writing h explicitly as h/(2π), we have instead 1
Gh 2 LP 3 2 c
1
Gh 2 L TP P 5 c 2c 1
hc 2 MP 2G
P
(H.2)
1 2c5 TP G h
1 2
1
1
h c5 2 2hc5 2 h P M P c2 G 2G
Some authors who prefer the latter definitions attempt to salvage the Planck energy by asserting without foundation that the Planck frequency is actually an angular frequency, so that multiplying it by h instead of h yields equality with MP c2, but one cannot convert a frequency in Hertz to an angular frequency in radians/s simply by declaring it to be such. When a process has a period T, the corresponding angular frequency is 2π/T, not 1/T, assuming angular frequency is relevant at all (e.g., the frequency with which one passes mile markers while driving 60 miles per hour on a highway has no relevant corresponding angular frequency, unless one invokes the frequency with which one is circling the earth, but these are quite different, since the former frequency is 1/60 Hz, while the latter is about 4.21×10-6 radians/s). We therefore consider the original definitions based on Planck’s 1899 paper and given in Equation H.1 to be the only correct ones. The strong possibility exists that these parameters correspond directly to fundamental physical objects in Quantum Gravity (see, e.g., Sorkin, 1998), so it is important to get their values right. Another interesting relationship involves these parameters and two other important fundamental parameters, the Schwarzschild radius RS and the Compton wavelength LC. Compressing a given mass M into a region smaller than RS results in a black hole (technically, the Schwarzschild metric becomes singular at this radius). Determining the position of a particle of mass M to an accuracy better than LC requires enough energy to create another particle of that mass (Baez, 2001). These parameters have the values
474
page 486
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix H The Planck Parameters
2G M c2 h LC Mc
RS
(H.3)
For the Planck mass, 1
1
2 G h c 2 Gh 2 RS 2 2 3 2 L P c c G
(H.4)
for either definition of the Planck length (i.e., with or without the h divisor of 2π, which cancels out in this case). So for the Planck mass, the Schwarzschild radius is twice the Planck length. For the Planck mass, the Compton wavelength is equal to the Planck length: h LC c
1
1 1
hc 2 G
Gh 2 3 LP c
(H.5)
The last two equations illustrate the fact that extreme physical effects are to be encountered at the Planck scale, including the onset of singularities, suggesting that General Relativity was being pushed beyond its boundaries and was “breaking down”. In the first half of the 20 th century it became clear that quantum-mechanical effects were important, even dominant, in this regime. The consensus among mainstream physicists today is that only a Quantum Gravity Theory can provide proper descriptions of physical behavior at distances, times, and energies characterized by the Planck parameters. A final point of interest involves the fact that in 1900 Max Planck published his discovery that the energy emitted by the electromagnetic resonators that produce black-body radiation comes in discrete bundles, each quantized with values E = hν. His notation was actually different, e.g., U instead of E to represent energy, but it is fundamentally the same equation that Einstein introduced in the context of the photoelectric effect (Einstein, Annalen der Physik 17 (6), 132-148, 1905) for which Einstein received the Nobel Prize. It seems possible in principle for Planck to have noticed that in terms of the Planck parameters (which did not go by that name at the time), 1
1 c5 2 P TP Gh
1
1
h hc5 2 hc 2 2 c M P c2 E h P G TP G
(H.6)
and thereby anticipate another 1905 Einstein equation by five or six years, E = mc2 (Einstein, Annalen der Physik 17 (8) 639-641, 1905b). The reason why this did not happen is not that Planck’s 475
page 487
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix H The Planck Parameters
definitions of these parameters used the second set above, in which hνP MP c2, although he is sometimes erroneously cited as having done so. In his paper (Planck, 1899) he quoted numerical values for LP, MP, and TP that match those given above in the second paragraph to within about 2%, with the difference being the value he used for h (called b therein), 6.885×10-27 erg sec (given in terms of “cm2gr/sec” therein, which is equivalent to erg sec) instead of the modern value of exactly 6.62607015×10-27 erg sec. Of course, noticing a mass-energy equivalence for the Planck mass and Planck energy does not immediately lead to a similar equivalence for all masses and energies, but the seed of the idea would have been planted, and often physical principles are discovered in the form of special cases that are later generalized. The reason why Planck missed the boat regarding E = mc2 is that catching it would have been practically miraculous, but he did use what we referred to above as the only correct definitions. On the other hand, it would have been very surprising if he had used a “reduced Planck constant”, since that definition didn’t appear until the development of real Quantum Mechanics in the mid 1920s (see, e.g. Llanes-Estrada, 2013). It was apparently first used by Paul Dirac, who in 1899 was three years into the future. It is unquestionably convenient in reducing notational clutter in certain equations but should not be allowed to introduce confusion into the definitions of the Planck parameters.
476
page 488
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix I Estimating the Mean of a Poisson Population From a Sample As mentioned in Section 2.11, the topic called “Sample Statistics” arises in two contexts, one in which we know everything about the population and use that knowledge to compute statistical properties of samples drawn from the population, and one in which we know very little about the population and use an observed sample to estimate population properties. The second context corresponds to the typical activity in scientific research in which measurements have been made for the purpose of determining the value of some physical parameter. The measurements are random draws from a population whose members each consist of the value sought plus some zero-mean fluctuation characterized by a distribution. Since the average over all fluctuations in the population is zero, the population mean is the desired value of the physical parameter. The goal is to use the sample to estimate the mean of the population and also the random distribution, which is as important as the mean, since it determines the eventual uncertainty of the estimate of the mean. The first context is generally an exercise in considering hypotheses. We establish a scenario in which we choose a value for the mean of the population and a random-variable distribution to define the fluctuations attached to the mean to create each member of the population. Then we use this model to simulate the contents of samples composed of randomly selected members of the population. We can study how the sample properties vary from one sample to the next, including variation in sample size. This can be used to compare the theoretical samples to actual sets of measurements in order to gain insight into how observed samples relate to the population underlying those measurements. As the sample size becomes larger, the sample takes on greater resemblance to the population, whose mean and random distribution can therefore be estimated more accurately. If any of this seems unfamiliar, a review of Section 2.11 is recommended. In the case of Poisson populations, choosing the mean also determines the variance of the population distribution, since the Poisson mean is equal to that variance. The Poisson distribution is defined in Equation 2.19 (shown below as Equation I.1), whose notation we will use herein. Our goal is to answer the question: given a Poisson population of known mean, what is the probability distribution of the unweighted average over a sample of size N drawn from that population? The Poisson distribution with mean λ for the random variable k contained in the set of nonnegative integers is e k (I.1) P( k , )
k!
The mean and variance are both λ > 0; the skewness is 1/ λ, and the kurtosis is 3+1/λ. Standard maximum-likelihood analysis shows that if a sample of size N is drawn from a population with this distribution, then the maximum-likelihood estimator ^λ for the population mean based on the sample is the unweighted average:
1 N k N i 1 i 477
(I.2)
page 489
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix I Estimating the Mean of a Poisson Population From a Sample
where the sample consists of ki, i = 1 to N. This can be shown by maximizing either the joint probability or its logarithm with respect to λ, but the latter is easier: N N e ki N ln( ) ki ln L( ) ln i 1 ki ! i 1
N
ln ki !
i 1
d 1 N ln L( ) N ki 0 d i 1
(I.3)
The solution for λ is the estimator in Equation I.2. We will omit further details here, but it is well known that this estimator is unbiased and has the minimum possible variance. Since this estimator is a sample statistic, it is a random variable. The mean and variance are known to be λ and λ/N, respectively. The skewness, kurtosis, and distribution function are not easily found in the literature, so we will derive the distribution function herein and use it to compute the skewness and kurtosis. We first consider the case N = 2, which is of special interest for Section 4.8 (The Parameter Refinement Theorem). The estimator is (k1 +k2)/2. We define
u
k1 k 2 k k u1 u2 , u1 1 , u2 2 2 2 2
(I.4)
The desired distribution function is therefore that for u, which is contained in the set {k/2}, where k is the set of nonnegative integers. We know that k1 and k2 both follow the Poisson distribution. Since u1 and u2 are obtained by a simple scale-factor change on k1 and k2, we can obtain their distributions by applying the theory of functions of random variables (see Appendix B), specifically Equation B.2 (p. 425), except as noted in the last paragraph of Appendix B, that equation pertains to density functions, and for use with probability mass distributions, the density scaling factor 1/|A| is not applicable and is therefore omitted. This provides the distributions e 2u1 e 2 u2 P1 (u1 , ) , P2 (u2 , ) (2u1 )! (2u2 )!
(I.5)
Figure I-1 illustrates this form of distribution for a Poisson random variable k1 with a mean of 10 and another random variable u1 = k1/2. Clearly the probability of 2u1 is the same as the probability of k1. Since u is the sum of u1 and u2, its probability distribution is the convolution of P1(u1) and P2(u2), as shown in Appendix B, Equations B.17 and B.18 (p. 428) except with those integrals changed to summations, since we have discrete random variables: P ( u, )
k
k
P1 1 , P2 u 1 , 2 k1 0 2
478
(I.6)
page 490
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix I Estimating the Mean of a Poisson Population From a Sample
Figure I-1. Poisson-distributed random variable k1 with a mean of 10. The random variable u1 = k1/2 is distributed such that the probability of 2u1 is the same as the probability of k1. Both random variables are discrete with probabilities shown as filled circles; the curve joining the circles is intended only to aid visibility.
The summations have to be over either k1 or k2, not u1 or u2, in order to include all the halfinteger values in the domain of u. The summation in Equation I.6 evaluates to P ( u, )
e 2 (2 ) 2u (2u)!
(I.7)
We can use this to compute moments in the usual way (see Appendix A) to obtain u
u2
2
skewness(u)
1
(I.8)
2 1 kurtosis(u) 3 2
Clearly u does not follow the Poisson distribution, since the mean is not equal to the variance. But there are strong resemblances. The mean is still λ, but the variance is half the mean, and the skewness and kurtosis have 2λ where the Poisson distribution has simply λ. 479
page 491
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix I Estimating the Mean of a Poisson Population From a Sample
Applying a similar procedure to the case N = 3 (u = u1+u2+u3, ui three distributions instead of two) results in P ( u, )
ki/3, and convolution of
e 3 (3 ) 3u (3u)!
(I.9)
with moments u
u2
3 1
skewness(u)
(I.10)
3 1 kurtosis(u) 3 3
and for N = 4 we find P ( u, )
e 4 (4 ) 4u (4u)!
(I.11)
with u
u2
4
skewness(u)
1
4 1 kurtosis(u) 3 4
(I.12)
The obvious pattern that emerges supports what seems to be a very safe conjecture:
P ( u, , N ) u
u2
e N ( N ) Nu ( Nu)!
(I.13)
N
skewness(u)
1
N 1 kurtosis(u) 3 N
where we add N to the parameter list of the distribution function now that it is a variable, and now 480
page 492
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix I Estimating the Mean of a Poisson Population From a Sample
u is in the set {k/N}, where k is the set of nonnegative integers. Note that although the factorial can be evaluated for non-integer numbers, it is not necessary here, because Nu is always an integer. Note further that in the special case of N = 1, Equation set I.13 reduces to the standard Poisson formulas. Because we did not actually prove the generalization to a sample size of N, a series of Monte Carlo tests were run, and Equation I.13 proved accurate for a wide range of N and λ values to the expected extent, i.e., the accuracy of the Monte Carlo calculations. All combinations of λ = 0.1, 1, 2, 3, 3.5, 4, 5, 10, 20, 50, and 100 with N = 2, 3, 5, 9, and 29 were computed, and in all cases agreement was found in the mean, variance, skewness, and kurtosis. In addition, histograms of the numerical results were generated and plotted with the distribution function in Equation I.13, and some samples are included herein. Figure I-2 shows four combinations of λ and N. In each, the normalized histogram of the Monte Carlo is plotted as open circles, and the solid curve is the corresponding distribution in Equation I.13. The four examples are (A) λ = 0.1, N = 2; (B) λ = 3, N = 9; (C) λ = 50, N = 9; (D) λ = 100, N = 5.
A
B
C
D
Figure I-2. P(u,λ,N) (Equation I.3, solid curve) plotted with normalized Monte Carlo histograms (open circles). A: λ = 0.1, N = 2. B: λ = 3, N = 9. C: λ = 50, N = 9. D: λ = 100, N = 5.
481
page 493
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix I Estimating the Mean of a Poisson Population From a Sample
A similar analysis reveals that if the N random draws ki are made from N Poisson populations with generally different means λi, i = 1 to N, then Equation I.13 becomes
e N N P(u, i 1.. N , N ) ( Nu)! where u
u2
Nu
1 N N i 1 i
(I.14)
N
skewness(u) kurtosis(u) 3
1 N 1 N
_ This shows that Equation set I.13 is actually a special case of I.14, namely λi = λ for all i. The cumulative distribution can be written
e N ( N ) k C ( u, , N ) k! k 0 ( Nu 1, N ) ( Nu)! Nu
where Γ is the upper incomplete Gamma function.
482
(I.15)
page 494
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix J Bell’s Theorem and Bell’s Inequalities John Stewart Bell was an Irish physicist with a special interest in quantum entanglement (see section 5.11). He is sometimes thought of as the champion of Quantum Mechanics in its contest with classical intuition, because his work essentially destroyed all hope for physical theories based on deterministic local realism, qualities desired by Albert Einstein. But in fact he was very much cut from the same cloth as Einstein. Both of them strove to take Quantum Mechanics beyond the Copenhagen Interpretation of Niels Bohr and Max Born (see section 5.9) but succeeded instead in establishing that the visible routes to such an extension were all dead ends. Einstein (1935), with Boris Podolsky and Nathan Rosen, exposed what seemed to be a fatal flaw in Quantum Mechanics, fatal because it implied that not only did Quantum Mechanics include nondeterministic processes, it also implied the presence of nonlocal phenomena, i.e., the ability of one part of Nature to affect another part outside of its light cone by propagating an influence at a speed faster than that of light. Such influences have now been established as real in the laboratory, although so far no mechanism by which these influences could induce causality violations have been found (here we distinguish between causality violation and nondeterminism; the latter involves phenomena taking place with no cause, whereas the former involves inconsistency between cause and effect such as reversed roles and inconsistent loops in spacetime). Bell was interested in the “hidden-variable” theory devised by David Bohm (1952) based on ideas suggested by Louis de Broglie. This theory successfully removed nonepistemic randomness in physical processes but left nonlocality intact. Bell’s original intention was to find a way to remove the nonlocality, but the pursuit of this goal led him instead to what is now known as Bell’s Theorem (Bell, 1964), a proof that hidden-variable theories based on local realism cannot provide the same statistical predictions as Quantum Mechanics. As part of this same work, he devised the first of what are now called Bell’s Inequalities. These express constraints on local realistic physical theories (with or without hidden variables) that are violated by Quantum Mechanics under appropriate conditions. Experimental demonstrations of such violations serve to rule out the theory based on local realism for which the specific form of Bell’s Inequality applies (different experimental arrangements lead to different forms of the basic inequality). In his 1964 paper, Bell used the entanglement model discussed in section 5.12, spin correlation between two previously interacting fermions labeled particle I and particle II. The spins must be entangled because of the interaction and opposite because of that and the Pauli Exclusion Principle (technically this corresponds to fermions in the singlet state). The spin directions can be measured for either or both particles via deflection by magnetic fields with any desired orientation. There are two observables: one labeled A representing a spin measurement on particle I in the direction of a unit vector a ; one labeled B representing a spin measurement on particle II in the direction of a unit vector b . The nature of fermion spin requires that A and B take on only values of ±1 in units of h /2. Bell considered the possibility that the spin states could be controlled by hidden variables denoted λ that are epistemically random with a distribution given by a properly normalized density function f(λ) (where we are now modifying some of Bell’s notation to avoid conflicts with the conventions in this book). So we have 483
page 495
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix J Bell’s Theorem and Bell’s Inequalities
A(a , ) 1, B(b , ) 1
(J.1)
The expectation value of the product AB as a function of a , b , and λ for the local model is denoted E (a ,b ,λ) and given by
E (a , b , ) A(a , ) B(b , ) f ( ) d
(J.2)
where the integral is to be understood as a definite integral over the domain of the hidden variables λ, which we assume for generality are continuous random variables; thus although the spin states are discrete, the expectation values must be computed with an integral over a density function, not a summation over a discrete probability distribution (for discrete hidden variables, the conversion to a discrete probability distribution is straightforward). We also explicitly require that A not be a function of b , and B not be a function of a , i.e., there are no nonlocal effects. This is the key feature of the local model that ultimately dooms it to failure. As shown further below, the expectation value of the product AB as a function of a and b for the quantum model is denoted Eq (a ,b ) and given by
E q ( a , b ) a b cos
(J.3)
where θ is the angle between a and b . Although AB can take on only values of ±1 in any single pair of measurements, the average over an arbitrarily large number of such measurements is a continuous number in the range from -1 to +1, depending on the value of θ. As we saw in section 2.2 regarding coin flips, if we assign a value of 1 to heads and 0 to tails, any given flip results in one of the integers 0 or 1, but the average over many flips for a fair coin is ½. Coin flips follow the binomial distribution, which like some others (e.g., the Poisson distribution) can have a mean that is not in the sample space of the random variable, so that in general the “expectation value” is really not to be “expected” on any given trial. Before the spin direction of a fermion is measured, it is in a superposition of all possible directions, so the mean and variance of A are computed quantum mechanically as though the angle between the spin direction and a were uniformly distributed in the range from 0o to 360o. This makes A = 1 and A = -1 equally probable, so that A is a zero-mean unit-variance random variable. The same is true of B, even after the spin of particle II has been collapsed onto either a or -a by a measurement on particle I, since those two directions are equally likely on average. So we have = = 0 and σA2 = σB2 = 1. An interesting property of pairs of zero-mean unit-variance random variables is that their average product is equal to their covariance cov(A,B), which is equal to their correlation coefficient, ρAB: cov( A, B) AB A B AB
AB
cov( A, B)
AB
cov( A, B) AB
(J.4)
For this reason, discussions of Bell’s Theorem and Bell’s Inequality sometimes refer to the correlation between A and B and other times to the average product of A and B; they are the same 484
page 496
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix J Bell’s Theorem and Bell’s Inequalities
thing for this spin-correlated entangled-fermion model. Quantum Mechanics dictates that once a spin measurement of particle I has been made, we will have either: (1) A = +1 with the spin of particle II collapsed onto -a ; (2) A = -1 with the spin of particle II collapsed onto +a . For case (1), a subsequent spin measurement on particle II in the b direction will reproject its spin onto either -b or +b to yield either AB = -1 or AB = +1 with probability ½(1+cosθ) or ½(1-cosθ), respectively (see, e.g., Penrose, 1989, chapter 6, or Binney and Skinner, 2014, section 7.3.1, where these probabilities are computed in their trigonometric-identity form cos2(θ/2) and sin2(θ/2), respectively). For case (2), a subsequent spin measurement on particle II in the b direction will reproject its spin onto either +b or -b to yield either AB = -1 or AB = +1 with probability ½(1+cosθ) or ½(1-cosθ), respectively. In either case, we can expect either AB = -1 with probability ½(1+cosθ) or AB = +1 with probability ½(1-cosθ). The expectation value of the product AB is therefore the probability-weighted average
1 cos 1 cos E q (a , b ) ( 1) ( 1) 2 2 2 2 1 1 cos cos 2 2 2 2 cos
(J.5)
yielding Equation J.3. Bell’s Theorem is a demonstration that Equation J.2 cannot perfectly mimic Equation J.3, because the latter incorporates the nonlocal effect on B caused by a measurement on A, i.e., B depends on both a and b , whereas Equation J.2 specifically prohibits B from having any dependence on a . Bell began by observing that Equation J.1 requires the minimum possible value for E (a , b ,λ) to be -1. Since Equation J.2 must apply for all values of a and b , it must apply for the case of a = b . Ignoring values of λ for which f(λ) = 0, E (b ,b ,λ) = -1 implies that B(b ,λ) = -A(b ,λ) (where by A(b ,λ), we mean A(a =b ,λ), and by E (b ,b ,λ), we mean E (a =b ,b ,λ), etc.), and Equation J.2 can be written
E (a , b , ) A(a , ) A(b , ) f ( ) d
(J.6)
Now we consider a third unit vector c that can operate in the same way as b ,
E (a , c , ) A(a , ) A(c , ) f ( ) d
(J.7)
so that the difference between these two expectation values is
E (a , b , ) E (a , c , ) A(a , ) A(c , ) A(a , ) A(b , ) f ( ) d
(J.8)
We see from Equation J.1 that |A(a ,λ)| = 1 and |B(b ,λ)| = 1 for any choice of unit vector, and so we have A2(b ,λ) = 1, and we can multiply the first term inside the integral’s parentheses by this to obtain
485
page 497
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix J Bell’s Theorem and Bell’s Inequalities
E (a , b , ) E ( a , c , )
A(a , ) A 2 (b , ) A(c , ) A(a , ) A(b , ) f ( ) d
Taking absolute values, E (a , b , ) E ( a , c , )
(J.9)
A(a , ) A(b , ) A(b , ) A(c , ) 1 f ( ) d
A(a , ) A(b , ) A(b , ) A(c , ) 1 f ( ) d
A( a , ) A(b , ) A(b , ) A(c , ) 1 f ( ) d
(J.10)
where we used the fact that for a real function g(x),
g ( x )dx g ( x ) dx Since f(λ)
(J.11)
0, and since for any two real functions g1(x) and g2(x), |g1(x)g2(x)| = |g1(x)| |g2(x)|,
E (a , b , ) E (a , c , ) A(a , ) A(b , ) A(b , ) A(c , ) 1 f ( ) d
(J.12)
Because |A(a ,λ)A(b ,λ)| 1 and multiplies a nonnegative quantity, simply removing it from the integral can only enlarge the magnitude of what remains or leave it the same, so we have E (a , b , ) E (a , c , ) 1 A(b , ) A(c , ) f ( ) d (J.13) where taking the absolute value has also allowed us to flip the sign of the argument of the absolute value in the integral. Noting that A(c ,λ) = -B(c ,λ), this becomes
E (a , b , ) E (a , c , ) 1 A(b , ) B(c , ) f ( ) d
(J.14)
Since 1+ A (b ,λ)B(c ,λ) is always 0, we can drop the absolute magnitude, and the right-handside may then be identified as 1+ E (b ,c ,λ), and we have
1 E (b , c , ) E (a , b , ) E (a , c , )
(J.15)
This is Bell’s Inequality for the spin-correlated entangled-fermion case. In his 1964 paper, Bell proceeded to consider whether E (a ,b ,λ) and Eq (a ,b ) are arbitrarily close to being equal in all cases, i.e., whether ε defined by |E (a ,b ,λ) - Eq (a ,b )| ε was always arbitrarily small. He used a fairly general argument to show that cases existed in which ε could not be made arbitrarily small, thus proving what is now known as Bell’s Theorem. But to prove that E (a ,b ,λ) cannot mimic Eq (a ,b ) perfectly, all we really need is a single case in which ε is not arbitrarily small. Schlegel (1980) provides a good example of this. Since Bell’s Inequality was derived directly from the definition of E (a ,b ,λ), we know that the latter will never violate the former, so all we need is an example in which Eq (a ,b ) does. We are free to use any vectors we desire, and so following Schlegel, we choose a = (b -c )/|b -c | 486
page 498
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix J Bell’s Theorem and Bell’s Inequalities
with b
c.
Using Eq (a ,b ) defined in Equation J.3 instead of E (a ,b ,λ) in Equation J.15, we have (J.16) 1 b c a b a c
substituting for a ,
b c b c 1 b c b c b c b c b c b c b c b c
(J.17)
Since we can choose any desired b c , we can choose them to be perpendicular, in which case |b -c | = 2 and b c = 0, leaving us with the clearly false statement that 1 2. Thus Quantum Mechanics can violate Bell’s Inequality, and Equation J.2 cannot substitute for Equation J.3.
487
page 499
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix K The Linear Harmonic Oscillator One of the most important concepts in all of physics is the linear harmonic oscillator, an object with mass m moving under the influence of a force F that is a linear function of displacement |x-x 0| with the force directed toward the position x 0 where the object is stable, i.e., if placed at rest in that location, there will be no force and hence no motion. We will keep this discussion as simple as possible by considering only the one-dimensional case (generalization to higher dimensions is straightforward), and so we may drop the vector notation and deal only with magnitudes. The mathematical form of the magnitude of the force we have described is F ( x x0 )
(K.1)
where κ is the constant of proportionality between force and displacement from x0, the equilibrium point, which we are free to assign as the origin of the x coordinate. Thus the equation F = -κx is sufficiently general for our purposes. This force can obviously be derived from the potential function U = -½κx2 in the usual way: F
U x 2 x x x 2
(K.2)
Since the potential is not an explicit function of time, the force is conservative, i.e., it conserves the sum of kinetic and potential energy, where the potential energy is the negative of the force potential, because we have set the latter’s arbitrary constant term to zero. The kinetic energy is the usual p2/2m, where p is the momentum. Thus the total mechanical energy E is E
p 2 x 2 2m 2
(K.3)
Applying Newton’s law F = ma, where a is the acceleration, the second time derivative of x,
F m
d2x x dt 2
(K.4)
and we can solve this differential equation for x(t). For the typical initial conditions in which the mass is pulled a distance A away from the equilibrium point and held motionless before being suddenly released at time t = 0, the solution is
x (t ) Acos t
m 488
(K.5)
page 500
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix K The Linear Harmonic Oscillator
So the motion is perfectly sinusoidal, hence the name harmonic oscillator. The frequency ω is independent of the amplitude A; this is a feature of the linear harmonic oscillator. The second line of the last equation says that κ = mω2, so Equation K.3 can be written E
p 2 m 2 x 2 2m 2
(K.6)
Note that when x = A, the oscillating object is at a turn-around point and is instantaneously at rest, so that p = 0 and E = ½mω2A2. Since the total energy is a constant, the latter formula is true at all times, and we have an expression for the energy that shows that it varies as the square of the frequency ω. This all applies to the classical linear harmonic oscillator. We stated in section 5.4 that the energy eigenvalues of the quantum-mechanical linear harmonic oscillator vary as the first power of frequency, not the second. One goal of this appendix is to resolve this apparent discrepancy. To investigate the quantum-mechanical linear harmonic oscillator, we consider the Hamiltonian operator, which is just Equation K.6 expressed in terms of operators (see Equations 5.30 - 5.32 and the text they enclose, p. 253): p 2 m 2 x 2 H (K.7) 2m 2 This Hamiltonian can be used in Schrödinger’s Equation in the form of the third line in Equation 5.37 (p. 254), H^ψ = Eψ, to solve for the energy eigenvalues and eigenfunctions. This equation has been solved in a variety of ways, none of them particularly simple. Since we are really interested only in the energy eigenvalues herein, we will use a method invented by Paul Dirac that leaves out the eigenfunctions but produces as a by-product the extremely important concepts known as creation and annihilation operators, critical ingredients of quantum field theories. This method is usually presented in terms of Dirac’s bra-ket notation, a very useful way of manipulating complex vectors that we have avoided as an unnecessary elaboration until now because the mathematical depth into which we can explore Quantum Mechanics is severely restricted by our scope. But besides being specifically designed for complex vectors, the Dirac notation provides a convenient way to keep track of the fact that some operators do not commute, and therefore that the order of multiplication is highly significant. This will be important in the case of the scalar product of two vectors, for which we will make a special definition below. Therefore we will borrow just enough of the Dirac notation to get us through this appendix. The reader should be aware that there is much more to be said about Dirac’s notation and complex vector spaces in general. We begin by defining the following dimensionless non-Hermitian operator
p im x 2m
(K.8)
For a complex operator, the definition of its adjoint is similar to that of the complex conjugate of a complex number and is typically denoted with a dagger: p im x † (K.9) 2m 489
page 501
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix K The Linear Harmonic Oscillator
Clearly η^ does not equal η^†, i.e., η^ is not self-adjoint, which is why we used the synonym “nonHermitian” above. Note that because η^ and η^† are linear combinations of p^ and x^, which are both linear differential operators, η^ and η^† are also linear operators, i.e., for any two functions f and g, f g f g
Now we define the dimensionless operator N^
(K.10)
η^η^†:
p imx p imx N 2m 2 2 2 2 px p m x im xp 2m
(K.11)
The last term in the numerator would be zero if x^ and p^ commuted, but as we saw in Equation 5.45 (p. 264), they do not; instead we have
px i xp
(K.12)
where the left-hand side is called the commutator for these operators and is denoted [x^, p^]. Using this in Equation K.11 gives us
p 2 m2 2 x 2 m N 2m 2 1 p m 2 x 2 1 2 m 2 2 H 1 2
(K.13)
and we have H N
1 2
(K.14)
The fact that H^ and N^ are related by a linear transformation allows us to assume that they have a common set of eigenvectors, so if we can solve for the eigenvalues of N^, this linear transformation will give us the eigenvalues of the Hamiltonian, which are the energy eigenvalues we seek. We will use n to denote the eigenvalues of N^, and then the plan is to show that the values taken on by n are the nonnegative integers. To do this, we take the following steps: 1. 2. 3.
Prove that n 0, i.e., n is nonnegative. Prove that if n = m is an eigenvalue, then n = m+1 is also an eigenvalue. Prove that the minimum eigenvalue is exactly 0. 490
page 502
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix K The Linear Harmonic Oscillator
The only set of numbers that satisfy these conditions is the set of nonnegative integers, since step 3 will show that the minimum eigenvalue is 0, and step 2 will show that given this eigenvalue, 0+1 is also an eigenvalue, and 0+1+1 is another eigenvalue, and so on. We denote the kth eigenvalue nk, and its associated eigenvector u k. Then the eigenvalue equation that we have to solve is N uk n k uk
(K.15)
Eigenvectors arise in eigenvalue problems with arbitrary but nonzero real magnitudes. We may assign them unit magnitude, and then they may be used to constitute the basis of a complex vector space, i.e., they may be used to represent an arbitrary vector in that space with complex components cj = aj + ibj on the u j axis, where aj and bj are real numbers, and i is the square root of -1, and these components may be operated on by complex and possibly non-commuting operators. Since the basis vectors are real and orthonormal, their scalar product is the usual dot product with u j u j = 1, and u j u k = 0 for j k. Vectors in this complex vector space are called kets in Dirac notation and are indicated by a vertical line on the left and an angle bracket on the right, so we change our notation from u j to |uj . The ket space has a dual space in which vectors are called bras and for which the basis vectors are denoted uj|. In this notation, Equation K.15 becomes N uk n k uk
(K.16)
A general complex vector may be represented as a ket |ψ : cj uj
(K.17)
j
Note that this is a vector summation, so that the result is also a vector. This vector is represented in the bra space by its adjoint:
c *j u j
(K.18)
j
where c*j is the complex conjugate of cj, i.e., c*j = aj - ibj. The definitions are set up this way so that the scalar product of ψ with itself, ψ|ψ , will be the square of the norm of ψ:
c *j c j u j u j c *j c j 2 j
(K.19)
j
where we have used the fact that uj|uj = 1, and we have omitted all the cross-terms because uj|uk = 0 for j k. The norm of a vector is its absolute length and is always nonnegative. Since a vector in the bra space has components that are the complex conjugates of the corresponding components in the ket space, moving a complex operator from one space to the other also involves representing it there by its adjoint. For any two arbitrary vectors ψa| and |ψb and linear operator η^ (see, e.g., Merzbacher, 1967, Eq. 14.38),
b † a b a 491
(K.20)
page 503
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix K The Linear Harmonic Oscillator
The same rules apply to real numbers, but since they are trivially self-adjoint, the dagger or complex-conjugate notation is not needed, and real numbers may be factored according to the rules of ordinary algebra. Since we know that the energy eigenvalues will be real, and since they are related to n by a real linear transformation, n will also take on only real values. Now we resume with Equation K.16, substituting η^η^† for N^, taking the scalar product with uk| on both sides, and applying the rules just discussed: u
k
† u k u k n k uk
† u
†
† u k n k u k uk
k
uk
2
(K.21)
nk
Since all vector norms are nonnegative, we have nk 0, and step 1 is accomplished. For step 2, we need the commutator [η^†η^, η^η^†]. Switching the order of the factors in the numerator of the first line in Equation K.11 will change the sign of the noncommuting factor in the numerator of the second line, and proceeding in a manner similar to Equations K.12 and K.13 produces †
H 1 2
(K.22)
Subtracting the third line of Equation K.13 yields † 1 † † 1 †
(K.23)
Now we add |uk to both sides of Equation K.16, again substituting η^η^† for N^, † uk u k n k uk uk
†
1 uk nk 1 u k
(K.24)
† u k nk 1 u k
where we used Equation K.23 on the left-hand side in going from the second to the third line above. Now we multiply both sides from the left by η^, † uk n k 1 uk † uk n k 1 uk
(K.25)
N uk nk 1 uk
The third line has the same form as Equation K.16 except for the ket being |η^uk instead of |uk and the eigenvalue being nk+1 in stead of nk. But the eigenvalue equation is satisfied for this new ket 492
page 504
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix K The Linear Harmonic Oscillator
and new eigenvalue, and so we see that if we apply the operator η^ to the eigenvector |uk , as long as |η^uk is not null, we obtain a new eigenvector associated with a new eigenvalue that is one greater than that associated with |uk . We can apply η^ to the new eigenvector and produce yet another new eigenvalue that is once again incremented by 1. This can be repeated indefinitely, and so step 2 is accomplished. Because of this property, η^ is called a raising operator or creation operator. To see that |η^uk cannot be null, we compute its squared norm as follows.
uk 2
uk u k uk † uk H 1 uk uk 2
(K.26)
H 1 2
where we used Equation K.22 to get from the second line above to the third. We know that the eigenvalues of the Hamiltonian are nonnegative, so the smallest value possible for the squared norm of |η^uk is ½. A calculation similar to Equations K.24 and K.25 can be used to show that N † uk nk 1 † uk
(K.27)
Specifically, starting with Equation K.16, again substituting η^η^† for N^, we multiply both sides from the left by η^†, † uk †nk uk †
† †uk nk †uk
(K.28)
Now using Equation K.23,
†
1 † uk nk † uk
uk † uk nk † uk †
†
(K.29)
† † uk nk † uk † uk
N † uk nk 1 † uk ^
As long as |η†uk is not null, it is an eigenvector whose eigenvalue is nk-1. η^† is called a lowering operator or annihilation operator, and each operator is called a ladder operator, because it raises or lowers eigenvalues up or down the equally spaced set of all possible eigenvalues. The reason why these are important in quantum field theories is that such theories treat particles as excitations of the 493
page 505
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix K The Linear Harmonic Oscillator
vacuum, and these operators create or annihilate excitations, thus creating or annihilating particles. But whereas there is no limit to how high η^ can go, there is a limit to how low η^† can go, because of the condition nk 0. We denote the smallest possible (hence nonnegative) eigenvalue n0 and its associated eigenvector |u0 . Let nk be some large legal eigenvalue, not assumed to be an integer; then beginning with this, we can apply η^† repeatedly until we arrive at n0. We must have 0 n0 < 1, because n0 must be nonnegative, and it must not be possible to subtract 1 from it without the result being negative (otherwise it would not be the lowest nonnegative eigenvalue). So when η^† operates on n0, it must fail, which implies that the ket |η^†u0 cannot be an eigenvector and therefore must be null, i.e., its norm must be zero, or
u †
2
0
0
† u0 † u0 0 † u0 0 u0
H 1 u0 u0 0 2
(K.30)
H 1 2 H 2
where we used the third line of Equation K.13 to get from the third line above to the fourth. Since the Hamiltonian operator is equal to a constant for the eigenvector |u0 , the associated Hamiltonian eigenvalue is that constant, and so we see from Equation K.14 that the corresponding N^ eigenvalue must be zero, so we have n0 = 0, completing step 3. Once the lowering operator arrives at an N^ eigenvalue of zero, applying it to the corresponding eigenvector yields a null result, and further applications would be attempts to operate on null vectors. The downward stepping hits a hard limit at n = 0. So the energy eigenvalues of the linear harmonic oscillator have the form h ω(n+½), or the equivalent hν(n+½) used in Chapter 5, where n is a nonnegative integer. Here we see the origin of the “zero-point energy” hν/2 in the noncommuting nature of x^ and p^. We mentioned in section 5.4 that Planck’s recognition of the linear frequency dependence of the energy of his black-body resonators was remarkable because of the known fact that the classical linear harmonic oscillator’s energy varies as the square of the frequency. Indeed, the classical energy, with its quadratic dependence on frequency, is what goes into the quantum-mechanical Hamiltonian operator. We see in the middle term of the first two lines of Equation K.13 that one factor of frequency is absorbed into the N^ operator, whose eigenvalues are the integers that multiply Planck’s hν, so the missing factor of frequency is hiding inside those dimensionless integers (the same is true of m, which is in the classical energy explicitly but not the quantum-mechanical energy eigenvalues). So the failure of ω2 to appear in the quantum-mechanical energy eigenvalues is a bit like using ω2 = κ/m from Equation K.5 to write the classical energy as E = ½κA2. The square of the frequency is really there in both cases. 494
page 506
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Appendix K The Linear Harmonic Oscillator
Of course, Planck did not have the formula for the quantum-mechanical energy eigenvalues at the time he was struggling to find his radiation law. His formula E = hν was based on Wien’s demonstration that the ratio of frequency to temperature (hence energy) was an adiabatic invariant for the black-body radiation field. Wien arrived at this conclusion by investigating the Thermodynamics of standing waves in a cavity, and the quantized frequencies of standing waves were what led de Broglie to propose a wave nature to material objects once the notion of quantized energy became widely accepted. So the first clue pointing to the importance of energy quantization was actually unearthed by Wien.
495
page 507
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References Abbott, B.P. et al., “Observation of Gravitational Waves from a Binary Black Hole Merger”, Phys. Rev. Lett. 116, 061102, 11 February 2016 Abedi, Jahed; Dykaar, Hannah; and Afshordi, Niayesh, “Echoes from the Abyss: Evidence for Planck-scale structure at black hole horizons”, arXiv:1612.00266v1 [gr-qc], 1 Dec 2016 Abramowitz, M., and Stegun, I.A., Handbook of Mathematical Functions, Dover Publications, Inc., 1972 Aitchison, Ian J. R.; MacManus, David A.; and Snyder, Thomas M., “Understanding Heisenberg’s ‘magical’ paper of July 1925: a new look at the calculational details”, arXiv.org/abs/quant-ph/0404009, 2008 Albrecht, Andreas, and Phillips, Daniel, “Origin of probabilities and their application to the multiverse”, arXiv.org/abs/1212.0953v3, 2014 Ambjørn, J.; Dasgupta, A.; Jurkiewicz, J.; and Loll, R., “A Lorentzian cure for Euclidean troubles”, arXiv:hep-th/0201104v1, 15 Jan 2002 Ambjørn, J.; Görlich, A.; Jurkiewicz, J.; and Loll, R., “Quantum Gravity via Causal Dynamical Triangulations”, arXiv:1302.2173v1 [hep-th], 8 Feb 2013 Anderson, Philip W., More and Different, World Scientific Publishing Co. Pte. Ltd., 2011 Angus, J. E., SIAM Review, Vol. 36, No. 4, pp. 652-654, December, 1994 Arndt, Markus; Nairz, Olaf; Vos-Andreae, Julian; Keller, Claudia; Van Der Zouw, Gerbrand; and Zeilinger, Anton, “Wave-particle duality of C 60", Nature 401 (6754), pp. 680-682, 1999 Arnowitt, R.; Deser, S.; Misner, C.,”Dynamical Structure and Definition of Energy in General Relativity”, Physical Review, Vol 116 (5), pp. 1322–1330, 1959 Ashtekar, Abhay, “New Variables for Classical and Quantum Gravity”, Physical Review Letters, 57, 2244-2247, 1986 Ashtekar, Abhay, “Introduction to Loop Quantum Gravity”, arXiv:1201.4598v1 [gr-qc], 22 Jan 2012 Aspect, Alain; Grangier, Philippe; and Roger, Gérard, “Experimental Tests of Realistic Local Theories via Bell’s Theorem”, Physical Review Letters 47, No. 7, 17 August, 1981 Aspect, Alain; Grangier, Philippe; and Roger, Gérard, “Experimental Realization of EinsteinPodolsky-Rosen-Bohm Gedankenexperiment: A New Violation of Bell’s Inequalities”, Physical Review Letters 49, No. 2, 12 July, 1982 Aspect, Alain; Dalibard, Jean; and Roger, Gérard, “Experimental Test of Bell’s Inequalities Using Time-Varying Analyzers”, Physical Review Letters 49, No. 25, 20 December, 1982 Atiyah, Michael, “Topological quantum field theory”, Publications Mathématiques de l'IHÉS, 68, 175–186, 1988 Baez, John C., “Higher-dimensional algebra and Planck scale physics”, in Physics Meets Philosophy at the Planck Scale, Callender and Huggett, editors, Cambridge University Press, 2001 Ball, Philip, The Self-Made Tapestry - Pattern formation in nature, Oxford University Press, 1999
496
page 508
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Bardeen, J.; Cooper, L.N.; and Schrieffer, J.R., “Microscopic Theory of Superconductivity”, Physical Review, Vol 106(1), pp. 162-164, 1957 Barrow, John D., The Book of Nothing, Pantheon Books, 2000 Bassi, Angelo, and Ghirardi, GianCarlo, “Dynamic Reduction Models”, arXiv:quant-ph/0302164 v2, 8 Apr 2003 Bays, Carter. “Candidates for the Game of Life in Three Dimensions” Complex Systems 1 pp. 373-400, 1987 Bekenstein, Jacob D., “Black Holes and Entropy”, Physical Review D, Vol. 7 No. 8, 1973 Beichman, C. A., ed.; Neugebauer, G.; Habing, H. J.; Clegg, P. E.; and Chester, T. J., IRAS Catalogs and Atlases: Explanatory Supplement, Washington, DC, GPO) (http://irsa.ipac.caltech.edu/IRASdocs/exp.sup and http://irsa.ipac.caltech.edu/IRASdocs/iras.html),1980 Bell, E. T., Men of Mathematics, Simon and Schuster, 1937 Bell, John S., “On the Einstein Podolsky Rosen Paradox”, Physics 1, 195, 1964 Bell, John S., Speakable and Unspeakable in Quantum Mechanics, Cambridge University Press, first published in 1987, revised edition 2004 Bergmann, P., “Hamilton–Jacobi and Schrödinger Theory in Theories with First-Class Hamiltonian Constraints”, Physical Review, Vol 144 (4), pp. 1078–1080, 1966 Bianchi, Eugenio, “The length operator in Loop Quantum Gravity”, arXiv:0806.4710v2 [gr-qc], 11 Sep 2008 Binney, James, and Skinner, David, The Physics of Quantum Mechanics, Oxford University Press, 2014 Bohm, David, “A suggested Interpretation of the Quantum Theory in Terms of Hidden Variables, I & II”, Physical Review, Vol 85 (2), pp. 166-179 & pp. 180-193, 1952 Bohm, David, and Aharonov, Yakir, “Discussion of Experimental Proof for the Paradox of Einstein, Rosen, and Podolsky”, Phys. Rev. 108, 1070, 1957 Bohr, Niels, “On the Constitution of Atoms and Molecules, Part I”, Philosophical Magazine Series 26, Vol. 151, pp. pp. 1-24, 1913 Bohr, Nils [sic], “Can Quantum-Mechanical Description of Physical Reality Be Considered Complete?”, Phys. Rev. 48, 696, 15 October 1935 Bohr, Niels, in Albert Einstein: Philosopher-Scientist, Schilpp, P. A., editor, Tudor, New York, 1949 Born, Max, and Jordan, Pascual, “Zur Quantenmechanik”, Zeitschrift für Physik 34, 858-888, 1925 Born, Max; Heisenberg, Werner; and Jordan, Pascual, “Zur Quantenmechanik II”, Zeitschrift für Physik 35, 557-615, 1925 Born, Max, “Zur Quantenmechanik der Stoßvorgänge”, Zeitschrift für Physik 37, pp. 863-867, 1926 Born, Max, “Zur Quantenmechanik der Stoßvorgänge”, Zeitschrift für Physik 38, pp. 803-827, 1926b Born, Max, Natural Philosophy of Cause and Chance, Oxford University Press, 1949 Born, Max, The Born-Einstein Letters, Walker Publishing Company, 1971
497
page 509
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Bouwmeester, Dirk; Pan, Jian-Wei; Daniell, Matthew; Weinfurter, Harald; and Zeilinger, Anton, “Observation of Three-Photon Greenberger-Horne-Zeilinger Entanglement”, Phys. Rev. Lett. 82, 1999 Calabi, Eugenio, “The space of Kähler metrics”, Proc. Internat. Congress Math. Amsterdam, 1954 Callen, Herbert B., Thermodynamics, John Wiley & Sons, 1960 Callender, Craig, and Huggett, Nick, editors, Physics Meets Philosophy at the Planck Scale, Cambridge University Press, 2001 Carlip, S., “A Note on Black Hole Entropy in Loop Quantum Gravity”, arXiv:1410.5763v4 [gr-qc] 29 May 2015 Cartan, Élie, “Sur une généralisation de la notion de courbure de Riemann et les espaces à torsion." C. R. Acad. Sci. (Paris) 174, 593–595, 1922 Castellani, E., editor, Interpreting Bodies, Princeton University Press, 1998 Chou, C.W.; Hume, D.B.; Rosenbrand, T.; and Wineland, D.J., “Optical Clocks and Relativity”, Science Vol 329, pp. 1630-1633, 2010 Christoffel, E.B., “Ueber die Transformation der homogenen Differentialausdrücke zweiten Grades”, Jour. für die reine und angewandte Mathematik, B 70, 46–70, 1869 Clauser, J.F., and Horne, M. A., “Experimental consequences of objective local theories”, Phys. Rev. D 10, pp. 526-535, 1974. Cramér, Harald, Mathematical Methods of Statistics, Princeton University Press, 1946 (19th printing 1999) D’Agostini, G., “Asymmetric Uncertainties: Sources, Treatment and Potential Dangers”, arXiv:physics/0403086 v2, 27 Apr 2004 D’Agostino, R. B., “Transformation to Normality of the Null Distribution of g1”, Biometrika Vol. 57, No. 3 , 1970 D'Agostino, R. B.; Belanger, A.; and D'Agostino, R. B. Jr., “A Suggestion for Using Powerful and Informative Tests of Normality”, The American Statistician, Vol. 44, No. 4, 1990 Danby, J. M. A., Fundamentals of Celestial Mechanics, The Macmillan Company, 1962 Danielewski, Marek, “The Planck-Kleinert Crystal”, Zeitschrift für Naturforschung 62a, 564-568, 10 May 2010 Davisson, C. J., “The Diffraction of Electrons by a Crystal of Nickel". Bell System Tech. J., 7 (1), pp. 90-105, 1928 de Broglie, Louis, “Recherches sur la théorie des quanta”, Thesis, Paris, Ann. de Physique (10) 3, 22, 1925 de Broglie, Louis, “La mécanique ondulatoire et la structure atomique de la matière et du rayonnement”, Journal de Physique et le Radium 8 (5), pp. 225-241, 1927 Deltete, Robert J., and Guy, Reed A., “Emerging From Imaginary Time”, Synthese 108, pp.185103, Kluwer Academic Publishers, 1996 Derbyshire, John, Unknown Quantity: A Real and Imaginary History of Algebra, Plume, 2007 de Sitter, Willem, Proc. K. Ak. Amsterdam 19, 1217, 1917 Dewitt, B., “Quantum Theory of Gravity. I. The Canonical Theory”, Physical Review, Vol 160 (5), pp. 1113–1148, 1967 Diósi, L., “Gravitation and quantum-mechanical localization of macro-objects”, Physics Letters A 105, pp. 199-202, 1984 498
page 510
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Diósi, L., “Models for universal reduction of macroscopic quantum fluctuations”, Phys. Rev. A40, pp. 1165-1174, 1989 Dirac, P. A. M., “The Fundamental Equations of Quantum Mechanics”, Proc. Roy. Soc. (London), A 109, 1925 Dirac, P. A. M., “Generalized Hamiltonian Dynamics”, Proceedings of the Royal Society of London A, 246 (1246), 326–332, 1958 Dunham, William, Journey through Genius: The Great Theorems of Mathematics, Penguin, 1991 Dunham, William, The Calculus Gallery - Masterpieces from Newton to Lebesgue, Princeton University Press, 2004 Dyson, Freeman, “The Radiation Theories of Tomonaga, Schwinger, and Feynman”, Physical Review 75, 1949 Einstein, Albert, “On a Heuristic Point of View about the Creation and Conversion of Light”, Annalen der Physik 17 (6) pp. 132-148, 1905 Einstein, Albert, “Does the Inertia of a Body Depend on its Energy Content?”, Annalen der Physik 17 (8) pp. 639-641, 1905b Einstein, Albert, “Zur Elektrodynamik bewegter Körper”, Annalen der Physik 17 (10), pp. 891–921, 1905c Einstein, Albert, “On the Motion of Small Particles Suspended in a Stationary Liquid, as Required by the Molecular Kinetic Theory of Heat”, Annalen der Physik 18 (13) pp. 549-560, 1905d Einstein, Albert, “Die Feldgleichungen der Gravitation”, Sitzungsberichte der Preussischen Akademie der Wissenschaften zu Berlin, 844–847, 1915 Einstein, Albert, “Bemerkung zu der Franz Seletyschen Arbeit Beitrage zum kosmologischen System”, Annalen der Physik 69 pp. 436-438, 1922 Einstein, Albert, “Theoretische Bemerkungen zur Supraleitung der Metalle”, in Gedenkboek pp. 429-435, 1922b Einstein, Albert, “How I created the theory of relativity”, Kyoto lecture translated from German to Japanese by J. Ishiwara and from Japanese to English by Yoshimasa A. Ono, published in Physics Today, August 1982, lecture given by Einstein on 14 December 1922c Einstein, Albert, “Fundamental ideas and problems of the theory of relativity”, Lecture delivered to the Nordic Assembly of Naturalists at Gothenburg (this is the infamous “Nobel Lecture” that was not delivered at the Nobel ceremony and did not deal with the work for which Einstein was awarded the Nobel Prize), 11 July 1923 Einstein, Albert, in Investigations on the Theory of the Brownian Movement, edited by R. Fürth, translated by A.D. Cowper, Dover Publications, 1926 Einstein, Albert, “Riemann-Geometrie mit Aufrechterhaltung des Begriffes des Fernparallelismus”, Preussische Akademie der Wissenschaften, Phys.-math. Klasse, Sitzungsberichte, 217–221, 1928 Einstein, Albert, “The History of Field Theory”, New York Times, February 3, 1929 Einstein, A; Podolsky, B.; and Rosen, N., “Can Quantum-Mechanical Description of Physical Reality Be Considered Complete?”, Phys. Rev. 47, 777, 15 May 1935 Einstein, Albert, in Albert Einstein: Philosopher-Scientist, Schilpp, P. A., editor, Tudor, New York, 1949 499
page 511
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Einstein, Albert, Ideas and Opinions, Wings Books, 1954 edition Einstein, Albert, Out of My Later Years, Castle Books, 2005 edition Elitzur, Avshalom C., and Vaidman, Lev, “Quantum mechanical interaction-free measurements”, Foundations of Physics 23, pp. 987-997, 1993 Ellis, George F.R., “The Evolving Block Universe and the Meshing Together of Times”, arxiv.org/ftp/arxiv/papers/1407/1407.7243.pdf, 2014 Euler, Leonhard; Lagrange, Joseph-Louis; and Bernouilli, Johan III, Elements of Algebra, CreateSpace, Inc. & Kindle Direct Publishing, 2015, based on the 1822 translation of Euler’s Vollständige Anleitung zur Algebra, 1765 Everett, Hugh, “‘Relative State’ Formulation of Quantum Mechanics”, Reviews of Modern Physics Vol. 29 No.3, 1957 Faraday, Michael, “On the Lines of Magnetic Force”, Royal Institution Proceedings, 23rd January, 1852 Faraday, Michael, “On the Physical Character of the Lines of Magnetic Force”, Philosophical Magazine, June, 1852b Fledbrugge, Job; Lehners, Jean-Luc; and Turok, Neil, “No Smooth Beginning for Spacetime”, Phys. Rev. Lett. 119, 171301, 27 October 2017 Feynman, Richard P., “Space-Time Approach to Quantum Electrodynamics”, Physical Review, Vol 76, 1949 Feynman, Richard P; Leighton, Robert B.; and Sands, Matthew, The Feynman Lectures on Physics, Addison Wesley, 1964 Fick, Adolph E., “Über Diffusion”, Annalen der Physik 170, 1855 Finster, Felix, and Kleiner, Johannes, “Causal Fermion Systems as a Candidate For a Unified Physical Theory”, arXiv:1502.03587v3 [math-ph], 19 May, 2015 Fisher, R. A., “Moments and Product Moments of Sampling Distributions”, Proc. Lond. Math. Society, Series 2, Vol. 30, 1929 Fisher, R. A., “The Moments of the Distribution for Normal Samples of Measures of Departure from Normality”, Proceedings of the Royal Society of London, Series A, Vol. 130, No. 812, 1930 Fisher, R. A., Statistical Methods for Research Workers, Eighth Edition, Edinburgh and London, 1941 Fowler, J. W., and Rolfe, E. G., “Position Parameter Estimation for Slit-Type Scanning Sensors”, Journal of the Astronautical Sciences, Vol. 30, No. 4, 1982 Fredkin, Edward, “Digital Mechanics - An Informational Process Based on Reversible Universal Cellular Automata”, Physica D 45, 254-270, 1990 Freedman, Stuart J., and Clauser, John F., “Experimental Test of Local Hidden-Variable Theories”, Physical Review Letters 28 (14), 938, 1972 Frieman, Joshua A.; Turner, Michael S.; and Huterer, Dragan, “Dark Energy and the Accelerating Universe”, Annual Review of Astronomy & Astrophysics, vol. 46, Issue 1, pp.385-432, 2008 Fulling, S. A., “Nonuniqueness of Canonical Field Quantization in Riemannian Space-Time", Physical Review D, Vol 7 (10), 2850, 1973 Gardner, Martin, “Mathematical Games – The fantastic combinations of John Conway's new solitaire game ‘life’”. Scientific American. 223, pp. 120–123, October 1970 500
page 512
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Geiger, Hans, and Marsden, Ernest, “On a Diffuse Reflection of the α-Particles”, Proceedings of the Royal Society of London, Series A, Vol 82, pp. 495-500, 1909 Gelb, Arthur, Editor, Applied Optimal Estimation, The MIT Press, 1974 Ghirardi, G.C.; Rimini, A.; and Weber, T., “A General argument against Superluminal Transmission through the Quantum Mechanical Measurement Process”, Lettere al Nuovo Cimento Vol 27, N. 10, 1980 Ghirardi, G.C., and Weber, T., “Quantum Mechanics and Faster-Than Light Communication: Methodological Considerations”, Il Nuovo Cimento Vol 78 B, N. 1, 1983 Ghirardi, G.C.; Rimini, A.; and Weber, T., “Unified dynamics for microscopic and macroscopic systems”, Physical Review D, Vol 34, 470, 1986 Ghirardi, G.C.; Grassi, R.; Rimini, A.; and Weber, T., “Experiments of the EPR Type Involving CP-Violation Do Not Allow Faster-Than-Light Communication between Distant Observers”, Europhysics Letters 6 (2), 1988 Ghirardi, G. C.; Grassi, R.; and Rimini, A., “Continuous spontaneous reduction model involving gravity”, Phys. Rev. A 42, 1057, 1990 Ghirardi, GianCarlo, (Malsbary, G., translation), Sneaking a Look at God’s Cards - Unraveling the Mysteries of Quantum Mechanics, Princeton University Press, 2005 Gibbs, Josiah Willard, Elementary principles in statistical mechanics, Scribner's Sons, 1902 Gleich, James, The Information, Pantheon Books, 2011 Goldman, Malcolm, Introduction to Probability and Statistics, Harcourt, Brace & World, Inc., 1970 Goldstein, Herbert, Classical Mechanics, Addison-Wesley Publishing Company, Inc., 1959 Görlich, Andrzej, “Causal Dynaical Triangulations in Four Dimensions”, Thesis written under the supervision of Prof. Jerzy Jurkiewicz, presented to the Jagiellonian University for the PhD degree in Physics, 2010 Greenberger, Daniel M.; Horne, Michael A.; Shimony, Abner; and Zeilinger, Anton, “Bell’s theorem without inequalities”, Am. J. Phys, 58, 1990 Greenberger, Daniel M.; Horne, Michael A.; and Zeilinger, Anton, “Going Beyond Bell’s Theorem”, arXiv:0712.0921v1 [quant-ph], 2007 Greene, Robert E., “Isometric Embeddings of Riemannian and Pseudo-Riemannian Manifolds”, Memoirs of the American Mathematical Society, Number 97, 1970 Hales, Thomas C., “A proof of the Kepler conjecture”, Annals of Mathematics 162 (3), pp. 1065-1185, 2005 Hamming, Richard W., Numerical Methods for Scientists and Engineers, McGraw-Hill, 1962 Hartle, J.B., and Hawking, S.W., “Wave function of the Universe”, Phys. Rev. D 28, iss. 12, 15 December 1983 Hartle, James B.; Hawking, S.W.; and Hertog, Thomas, “The No-Boundary Measure of the Universe”, Phys. Rev. Lett. 100, iss. 20, 23 May 2008 Hawking, Stephen W., “Black hole explosions?”, Nature. 248 (5443), 30–31, 1974 Hawking, Stephen, and Penrose, Roger, The Nature of Space and Time, Princeton University Press, 1996 Hawking, Stephen W.; Perry, Malcolm J.; and Strominger, Andrew, “Soft Hair on Black Holes”, arXiv:1601.00921v1 [hep-th], 5 Jan 2016
501
page 513
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Heisenberg, Werner, “Über quantentheoretische Umdeutung kinematischer und mechanischer Beziehungen”, Zeitschrift für Physik 33, 879-893, 1925 Heisenberg, Werner, “Über den anschaulichen Inhalt der quantentheoretischen Kinematik und Mechanik”, Zeitschrift für Physik 43, 172-198, 1927 Henderson, David W., Differential Geometry - A Geometric Introduction, Prentice-Hall Inc., 1998 Hensen, B.; Bernien, H.; Dréau, A.E.; Reiserer, A.; Kalb, N.; Blok, M.S.; Ruitenberg, J.; Vermeulen, R. F. L.; Schouten, R. N.; Abellán, C.; Amaya, W.; Pruneri, V.; Mitchell, M. W.; Markham, M.; Twitchen, D. J.; Elkouss, D.; Wehner, S.; Taminiau, T. H; and Hanson R., “Loophole-free Bell inequality violation using electron spins separated by 1.3 kilometres”, Nature 526, pp. 682-686, 2015 Heyl, Paul R., “The Skeptical Physicist”, Scientific Monthly 46, pp. 225-229, 1938 Hilbert, D., “Ueber flächen von constanter Gausscher krümmung”, Trans. Amer. Math. Soc., 2, pp. 87-99, 1901 Hino, Masanori, “Geodesic Distances and Intrinsic Distances on Some Fractal Sets”, arXiv:1311.4971v1 [math.PR], 20 Nov 2013 Holland, Peter R., The Quantum Theory of Motion: An Account of the de Broglie-Bohm Causal Interpretation of Quantum Mechanics, Cambridge University Press, 1993 Hu, B.L., “Fractal spacetimes in stochastic gravity? - views from anomalous diffusion and the correlation hierarchy”, J. Phys.: Conf. Ser. 880, 2017 Hubble, Edwin, “A relation between distance and radial velocity among extra-galactic nebulae”, Proc. N.A.S. Vol. 15, pp. 168-173, 1929 Isaacson, Walter, Einstein - His Life and Universe, Simon and Schuster, 2007 Isham, Chris, “Quantum gravity”, in The New Physics, Paul Davies, editor, Cambridge University Press, 1989 Jackiw, R., and Shimony, A., “The Depth and Breadth of John Bell’s Physics”, arXiv.org/abs/physics/0105046v2, 2001 Jeans, Sir James, An Introduction to the Kinetic Theory of Gases, Cambridge University Press, 1962 Jensen, Henrik Jeldtoft, Self-Organized Criticality - Emergent Complex Behavior in Physical and Biological Systems, Cambridge Lecture Notes in Physics, Cambridge University Press, 1998 Johnson, Erik, “The Elusive Liquid-Solid Interface”, Science Vol 296, 477-478, 19 April 2002 Joule, James Prescott, “On the calorific effects of magneto-electricity, and on the mechanical value of heat”. Philosophical Magazine, Series 3, Vol. 23, pp. 263-276, 1843 Kennard, E. H., "Zur Quantenmechanik einfacher Bewegungstypen", Zeitschrift für Physik 44, (4-5), 326, 1927 Kleinert, H., “Gravity as Theory of Defects in a Crystal with Only Second-Gradient Elasticity”, Annalen der Physik, 44 (2), 117, 1987 Kleinert, H., “New Gauge Symmetry in Gravity and the Evanescent Role of Torsion”, arXiv:1005.1460v2 [gr-qc], 8 June 2010 Kibble, Thomas, “Lorentz invariance and the gravitational field”, J. Math. Phys. 2, 212-221, 1961 Kittel, Charles, Elementary Statistical Physics, John Wiley & Sons, Inc., 1958 502
page 514
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Kovachy, T.; Asenbaum, P.; Overstreet, C.; Donnelly, C.A.; Dickerson, S.M.; Sugarbaker, A.; Hogan, J.M.; and Kasevich, M.A., “Quantum Superposition at the half-metre scale”, Nature 528, pp. 530-533, 2015 Kragh, Helge, Quantum Generations: A History of Physics in the Twentieth Century, Chapter 5, “The Slow Rise of Quantum Theory”, Princeton University Press, 2002 Krasnov, Kirill V., “Counting surface states in the loop quantum gravity”, arXiv:gr-qc/9603025v3, 27 Sep 1996 Kwiat, Paul; Weinfurter, Harald; Herzog, Thomas; and Zeilinger, Anton, “Interacion-Free Measurement”, Physical Review Letters 74, No. 24, pp. 4763-4766, 1995 Laher, R. R. and Fowler, J. W, “Optimal Merging of Point Sources Extracted from Spitzer Space Telescope Data in Multiple Infrared Passbands Versus Simple General Source Association”, Astronomical Data Analysis Software and Systems XVII P0-00 ASP Conference Series, Vol. XXX, R. Shaw, F. Hill and D. Bell, eds., 2007 Lapidus, M.L., and Sarhad, J.J., “Dirac Operators and Geodesic Metric on the Harmonic Sierpinski Gasket and Other Fractal Sets”, arXiv:1212.0878v2 [math.MG] 25 Feb 2014 Leffert, Charles B., “A Closed Non-collapsing 3-d Universe Predicting a New Source of Gravity and Dark Mass”, arXiv preprint astro-ph/0102318, 2001 Lemaître, Georges, “Un Univers homogène de masse constante et de rayon croissant rendant compte de la vitesse radiale des nébuleuses extra-galactiques”, Annales de la Société Scientifique de Bruxelles, A47, pp. 49-59, 1927 Lieber, Lillian R., The Einstein Theory of Relativity, Paul Dry Books Edition, 2008; original edition Farrar & Rinehart, 1945 Llanes-Estrada, Felipe J., “Universality of Planck’s constant and a constraint from the absence of h -induced ν mixing”, arXiv:1312.3566v1 [hep-ph], 12 Dec 2013 Loll, Renate, “Discrete approaches to quantum gravity in four dimensions”, arXiv:gr-qc/9805049v1, 13 May 1998 Lorentz, Hendrik A., “Deux Mémoires de Henri Poincaré sur la Physique Mathématique”, Acta Mathematica. 38, pp. 293-308, 1921; translated into English and made available by Wikisource at: http://en.wikisource.org/wiki/Translation:Two_Papers_of_Henri_Poincar%C3%A9_on_Mathematical_Physics
Lorenz, Edward N., The Essence of Chaos, University of Washington Press, 1993 Macken, J.A., “Spacetime Based Foundation of Quantum Mechanics and General Relativity”, in Nascimento, M.A., et al. (eds.), Progress in Theoretical Chemistry and Physics 29, Springer Switzerland, pp. 219-245, DOI 10.1007/978-3-319-14397-2_13, 2015 Macken, J.A., “Energetic spacetime: The new aether”, in The Nature of Light: What are Photons? VI Roychoudhuri, C.; Kracklauer, A.F.; and De Raedt, H., (eds) spiedigitallibrary.org, DOI 10.1117/12.2186257, 2015b Maldacena, Juan; Strominger, Andrew; Witten, Edward, “Black hole entropy in M-theory”. Journal of High Energy Physics. 1997 (12): 002, arXiv:hep-th/9711053, 1997 Maldacena, Juan, "The Large N limit of superconformal field theories and supergravity”, Advances in Theoretical and Mathematical Physics. 2: 231–252, arXiv:hep-th/9711200, 1998
503
page 515
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Margalit, Yair; Zhou, Zhifan; Machluf, Shimon; Rohrlich, Daniel; Japha, Yonathan; and Folman, Ron, “A self-interfering clock as a ‘which path’ witness”, Science Vol 349, pp. 12051208, 2015 McCallon, H. L.; Fowler, J. W.; Laher, R. R.; Masci, F. J.; and Moshir, M., “Refinement of the Spitzer Space Telescope Pointing History Based on Image Registration Corrections from Multiple Data Channels”, Publications of the Astronomical Society of the Pacific, 119, 2007 Meissner, Krzysztof A., “Black hole entropy in Loop Quantum Gravity”, arXiv:gr-qc/0407052v1 14 Jul 2004 Mermin, David, “Is the moon there when nobody looks? Reality and the quantum theory”, Physics Today, pp. 38-47, April, 1985 Merzbacher, Eugen, Quantum Mechanics, John Wiley & Sons, Inc., 1967 Misner, Charles W.; Thorne, Kip S.; and Wheeler, John Archibald, Gravitation, W.H. Freeman and Company, 1973 Modesto, Leonardo, “Fractal Quantum Space-Time”, arXiv:0905.1665v1 [gr-qc], 11 May 2009 Monroe, C.; Meekhof, D.M.; King, B.E.; and Wineland, D.J., “A ‘Schrödinger Cat’ Superposition State of an Atom”, Science Vol 272, pp. 1131-1136, 1996 Morse, Philip M., and Feshbach, Herman, Methods of Theoretical Physics, McGraw-Hill Book Company, Inc., 1953 Morse, Philip M., Thermal Physics, W.A. Benjamin, Inc., 1964 Moshir, M.; Fowler, J. W.; and Henderson, D., “Uncertainty Estimation and Propagation in SIRTF Pipelines”, Astronomical Data Analysis Software and Systems XII, ASP Conference Series, Vol. 295, 2003, H. E. Payne, R. I. Jedrzejewski, and R. N. Hook, eds., 2003 Myers, Robert C., and Pospelov, Maxim, “Ultraviolet modifications of dispersion relations in effective field theory”, arXiv:hep-ph/0301124v2, 6 May 2003 Nahin, Paul J., An Imaginary Tale, Princeton University Press, 1998 Nahin, Paul J., Digital Dice, Princeton University Press, 2008 Nash, John, “The Imbedding Problem for Riemannian Manifolds”, Annals of Mathematics, Second Series, 63, pp. 20-63, 1956 Newton, Roger G., Thinking About Physics, Princeton University Press, 2002 Omnès, Roland, Understanding Quantum Mechanics, Princeton University Press, 1999 Pais, Abraham, Subtle Is the Lord, Oxford University Press, 1983 edition. Papoulis, Athanasios, Probability, Random Variables, and Stochastic Processes, McGraw-Hill Book Company, 1965 Pauli, Wolfgang, Über das Wasserstoffspektrum vom Standpunkt der neuen Quantenmechanik”, Zeitschrift für Physik 36, pp. 336-363, 1926 Pauli, Wolfgang, Scientific Correspondence, Volume IV Part I, pp. 436-441 1951, Karl von Meyenn, editor, Berlin, 1996 Pauli, Wolfgang, in The Born-Einstein Letters, Max Born editor, Walker Publishing Company, 1971, in a letter to Max Born dated 31 March, 1954 Pearson, Egon S., “Note on tests for normality”, Biometrika, Vol 22, No. 3/4, 1931 Peebles, P.J.E., Principles of Physical Cosmology, Princeton University Press, 1993 Penrose, Roger, The Emperor’s New Mind, Oxford University Press, 1989
504
page 516
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Penrose, Roger, “Quantum computation, entanglement, and state reduction”, Phil. Trans. R. Soc. Lond. A, 356, pp. 1927-1939, 1998 Penrose, Roger, The Road to Reality, Alfred A. Knopf, 2004 Penrose, Roger, “On the Gravitization of Quantum Mechanics 1: Quantum State Reduction”, Found. Phys. 44, pp. 557-575, 2014 Penrose, Roger, Fashion, Faith, and Fantasy in the New Physics of the Universe, Princeton University Press, 2016 Perrin, Jean Baptiste, “Mouvement brownien et réalité moléculaire”, Ann. de Chimie et de Physique (VIII) 18, 5-114 (1909) Pimpinelli, Alberto, and Villain, Jacques, Physics of Crystal Growth, Cambridge University Press, 1998 Planck, Max, “Über irreversible Strahlungsvorgänge”, Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften zu Berlin, 1899 Planck, Max, “On the Theory of the Energy Distribution Law of the Normal Spectrum”, Verhandlungen der Deutschen Physikalischen Gesellschaft, 1900 Planck, Max, “On the Law of Distribution of Energy in the Normal Spectrum”, Annalen der Physik, vol. 4, p. 553 ff, 1901 Planck, Max, “ Über die Begründung des Gesetzes der schwarzen Strahlung ”, Annalen der Physik, 324(4), p. 642 ff, 1912 Planck, Max, Treatise on Thermodynamics, Seventh Edition, Dover Publications, 1921 Poincaré, Henri, “Sur la théorie des quanta”, J. Phys. (Paris) 2, pp. 5-34, 1912 Poincaré, Henri,”Les rapports de la matière et l'éther”, Journal de physique théorique et appliquée, ser 5, 2, pp. 347-360, 1912b Prentis, Jeffrey J., “Poincaré’s proof of the quantum discontinuity of nature”, Am. J. Phys., Vol 63, No. 4, 1995 Press, William H.; Teukolsky, Saul A.; Vetterling, William T.; and Flannery, Brian P., Numerical Recipes, Cambridge University Press, 1986 Ramond, P., “Dual theory for free fermions”, Physical Review D, Vol 3, 2415-2418, 1971 Regge, T., “General Relativity without coordinates”, Nuovo Cim. A, 19, 558–571, 1961 Resnick, Robert, and Halliday, David, Physics For Students of Science and Engineering, John Wiley & Sons, Inc., 1961 Ricci, Gregorio, and Levi-Civita, Tullio, “Méthodes de calcul différentiel absolu et leurs applications”, Mathematische Annalen, Springer, 54 (1–2), 125–201, March 1900 Rindler, Wolfgang, Essential Relativity, Van Nostrand Reinhold Company, 1969 Robertson, H.P., “On Relativistic Cosmology”, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 7. 5 (31), pp. 835–848, 1928 Rovelli, Carlo, and Smolin, Lee, “Discreteness of Area and Volume in Quantum Gravity”, Nuclear Physics B 442, 593-622, 1995 Rovelli, Carlo, “Quantum spacetime: What do we know?”, in Physics Meets Philosophy at the Planck Scale, Callender and Huggett, editors, Cambridge University Press, 2001 Rovelli, Carlo, “Black Hole Entropy from Loop Quantum Gravity”, arXiv:gr-qc/9603063v1 30 Mar 1996, revised 7 February 2008 Rovelli, Carlo, and Speziale, Simone, “Lorentz covariance of loop quantum gravity”, arXiv:1012.1739v3 [gr-qc],18 Apr 2011 505
page 517
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Rovelli, Carlo, “Zakopane lectures on loop gravity”, arXiv:1102.3660v5 [gr-qc], 3 Aug 2011 Rugh, S.E., and Zinkernagel, H., “The Quantum Vacuum and the Cosmological Constant Problem”, arXiv:hep-th/0012253v1, 28 Dec 2000 Rutherford, Ernest, “The Scattering of α and β Particles by Matter and the Structure of the Atom”, Philosophical Magazine, Series 6, vol. 21, pp. 669-688, 1911 Sachdev, Subir, Quantum Phase Transitions, Cambridge University Press, 1999 Schlegel, Richard, Superposition and Interaction, The University of Chicago Press, 1980 Schlosshauer, Maximilian; Kofler, Johannes; and Zeilinger, Anton, “A Snaphot of Foundational Attitudes Toward Quantum Mechanics”, arXiv:1301.1069v1 [quant-ph], 6 Jan 2013 Schrodinger, Erwin, “Quantisierung als Eigenwertproblem”, Annalen der Phys 384 (4), pp. 273376, 1926 Schrödinger, Erwin, “An Undulatory Theory of the Mechanics of Atoms and Molecules”, Physical Review, Vol 28 (6), pp. 1049-1070, 1926b Schrödinger, Erwin, and Born, Max, “Discussion of probability relations between separated systems”, Mathematical Proceedings of the Cambridge Philosophical Society 31 (4), pp. 555-563, 1935 Schrödinger, Erwin, “Die gegenwärtige Situation in derQuantenmechanik”, Naturwissenschaften 23, pp. 807-812; pp. 823-828; pp. 844-849, 1935 Schwarzschild, Karl, “Über das Gravitationsfeld eines Massenpunktes nach der Einsteinschen Theorie”, Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften 7, 189–196 (1916) Schwinger, Julian, “On Quantum-Electrodynamics and the Magnetic Moment of the Electron”, Physical Review, Vol 73, 1948 Sciama, Dennis, “The physical structure of general relativity”, Rev. Mod. Phys. 36, 463-469, 1964 Sedgewick, Robert, Algorithms, Addison-Wesley Publishing Company, 1983 Shannon, Claude E., “A Mathematical Theory of Communication”, Bell System Technical Journal 27, (3), pp. 379-423, 1948 Smolin, Lee, Three Roads to Quantum Gravity, Basic Books (Perseus Books Group), 2001 Smolin, Lee, The Trouble with Physics, First Mariner Books, Houghton Mifflin Company, 2007 Smolin, Lee, in The Singular Universe and the Reality of Time: A Proposal in Natural Philosophy, with Unger, Roberto Mangabeira, Cambridge University Press, 2014 Smolin, Lee, Time in Cosmology Conference, Perimeter Institute for Theoretical Physics, June 2016 Sorkin, R., “The Statistical Mechanics of Black Hole Thermodynamics”, Black Holes and Relativistic Stars, R. M. Wald, editor, University of Chicago Press, 1998 Spearman, C., “The proof and measurement of association between two things”, Amer. J. Psychol., 15, 1904 Stone, A. Douglas, Einstein and the Quantum, Princeton University Press, 2013 Szpiro, George G., Kepler’s Conjecture, John Wiley & Sons, 2003 Tomonaga, S., “On a Relativistically Invariant Formulation of the Quantum Theory of Wave Fields”, Progress of Theoretical Physics 1, 1946 Treiman, Sam, The Odd Quantum, Princeton University Press, 1999
506
page 518
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
References
Trimmer, John D., “The Present Situation in Quantum Mechanics: A Translation of Schrödinger’s ‘Cat Paradox Paper’”, Proceedings of the American Philosophical Society, 124, 323-38, 1980 Ulam, S., “Random Processes and Transformations”, Proceedings of the International Congress on Mathematics, Vol. 2, pp. 264-25, 1952 Unruh, W. G., “Notes on black-hole evaporation”, Physical Review D, Vol14 (4), 870, 1976 von Neumann, John, Mathematische Grundlagen der Quantenmechanik, Springer-Verlag, 1932 von Neumann, John, Theory of Self-Reproducing Automata, edited and completed by Arthur W. Burks, Illinois Lecture delivered in 1949, University of Illinois Press, 1966 von Plato, Jan, Creating Modern Probability, Cambridge University Press, 1994 Wald, Robert M., Space, Time, and Gravity, University of Chicago Press, 1977 Wald, Robert M., General Relativity, University of Chicago Press, 1984 Wheeler, John Archibald, “Geons”, Physical Review, Vol 97(2), pp. 511-536, January 15, 1955 Wheeler, John Archibald, A Journey into Gravity and Spacetime, Scientific American Library, 1990 Wigner, Eugene P., “Characteristic Vectors of Bordered Matrices With Infinite Dimensions”, Annals of Mathematics, Vol. 62, No. 3, 1955 Wigner, Eugene P., “The unreasonable effectiveness of mathematics in the natural sciences”, Richard Courant lecture in mathematical sciences delivered at New York University, May 11, 1959, Communications on Pure and Applied Mathematics 13, Issue 1, pp.1-14, 1960 Wilson, Kenneth G., “Confinement of Quarks”, Phys. Rev. D 10, 2445, 1974 Wineland, David J., “Superposition, Entanglement, and Raising Schrödinger’s Cat”, Nobel Lecture, December 8, 2012 Winterberg, F., “Physical Continuum and the Problem of a Finitistic Quantum Field Theory”, International Journal of Theoretical Physics, Vol. 32, No. 2, 1992 Witten, Edward, “String Theory Dynamics in Various Dimensions”, Nuclear Physics B. 443 (1), 85–126, arXiv:hep-th/9503124v2, 24 Mar 1995 Wolfram, Stephen, A New Kind of Science, Wolfram Media, Inc., 2002 Yang, C. N., and Mills, R., “Conservation of Isotopic Spin and Isotopic Gauge Invariance”, Physical Review, Vol 96 (1), pp. 191–195, 1954 Yau, Shing Tung, “On the Ricci curvature of a compact Kähler manifold and the complex Monge-Ampère equation. I”, Communications on Pure and Applied Mathematics, Vol 31 (3), 339–411, 1978 Yau, Shing-Tung, and Nadis, Steve, The Shape of Inner Space, Basic Books, 2010 Zeilinger, Anton, Dance of the Photons, Farrar, Straus, and Giroux, 2010 Zurek, W.H., “Probabilities from Entanglement, Born’s Rule pk = |ψk2|, from Envariance”, arXiv:quant-ph/0405161v2, 7 Feb, 2005
507
page 519
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index acceleration, 249, 309-311, 314, 317-320, 332, 340, 398, 405-411, 488 action, 241, 246, 249, 254, 255, 279, 348, 364, 367, 473 action at a distance, 223, 226, 230, 231, 277, 312 adiabatic, 98-100, 238, 493 ADM formalism (Arnowitt, Deser, and Misner), 365-367, 373 AdS/CFT correspondence, 372 algebraic irrational number, 35 alpha particle, 230, 245, 409 alternative hypothesis, 135, 136, 139, 143, 288 Anderson, Philip, 228 angular momentum, 102, 122, 223, 224, 227, 231-234, 247, 253, 264, 308, 310, 362, 364 annihilation operator, 493 Anti-de Sitter space, 372 Antikythera Mechanism, 222 Arnowitt, R., 365, 367, 373 Ashtekar, Abhay, 373, 375, 385 Avogadro’s constant, 98, 105, 106, 117, 123-125, 297, 299, 377 axiomatic interpretation, 10, 135 ----------------------------------------------------Bachelier, Louis, 123 background parameter, 357 Balmer, Johann, 245 baryon, 368 Bayes, Thomas, 11, 12, 30, 60-63, 95 Bekenstein, Jacob, 121, 385 Bell, John Stewart, 281-288, 290, 293-299, 302, 305, 306, 483-487 bell curve, 17, 23, 25, 46, 47, 51, 56, 66, 421, 428, 471,l 472 Becquerel, Henri, 230 Bernouilli, Daniel, 20, 104 Bernouilli, Nicholas, 20 Bessel function, 57, 58, 81, 190, 191, 426 beta particle, 230, 245
Bianchi identities, 348 biased coin, 39-43 biased random walk, 32 binomial distribution, 25, 38, 39, 52, 59, 118, 119, 136, 155, 215, 446, 484 black-body spectrum, 236, 239-245, 257, 409, 475, 494, 495 Bohm, David, 281-283, 295, 297, 305-307, 483 Bohr, Niels, 246, 247, 250, 255, 256, 260, 261, 265, 266, 272, 279, 280, 390, 391, 483 Boltzmann, Ludwig, 28, 98, 111-114, 116122, 235-236, 238, 240, 241 Boltzmann entropy, 113, 114, 116-118, 238, 240 Born, Max, 255, 260, 261, 264-268, 280, 295, 304, 305, 415, 483 Born interpretation, 264-266, 295, 304 Bose, Satyendra, 273, 274, 302 Bose-Einstein condensate, 274, 302 boson, 274, 363, 369 Brahe, Tycho, 222, 223, 230 bra vector, 337, 489, 491 bra-ket notation, 489 Brown, Robert, 123 Brownian motion, 104, 122-124, 306 ----------------------------------------------------Calabi-Yau manifold, 369-371, 399 calculus of variations, 367 caloric, 96 canonical coordinates, 249, 364-367 canonical ensemble, 108, 109, 114, 240 canonical equations, 249, 364, 365 canonical quantum gravity, 363-368, 375, 379, 385 Carnot, Sadi, 96 Carnot cycle, 98-101 Casimir effect, 309 cathode rays, 230, 245 Cauchy distribution, 55, 56, 58, 84, 121, 421
508
page 520
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index
Causal Dynamical Triangulation, 377-380. 384, 385, 390, 391, 396, 401, 407, 412, 413 Causal Fermion System, 362 causality, xi, 71, 73, 103, 234, 276, 277, 279, 280, 291, 292, 307, 328, 356, 368, 379, 412, 416, 483 celestial coordinates, 74, 75, 129, 131, 137, 138, 158, 160 central force motion, 223, 227, 310 central limit theorem, 51, 55, 56, 58, 65, 84, 86, 109, 111, 175, 207, 421, 446, 455, 468, 470 CGU distribution, 144-146, 150, 152, 178, 186, 187 characteristic function, 421 chi-square, 53, 54, 87, 88, 91, 92, 110, 121, 131-133, 139-143, 145, 147, 161, 163, 176, 193, 195, 196, 200-204, 207, 209-212, 434-445, 454, 462468 Cholesky decomposition, 452, 462, 468-472 Christoffel, Elwin, 320, 343 Christoffel symbol, xvi, 343-346, 353 classical interpretation (physics), xii, 234, 264, 381 classical interpretation (probability), 10 Clausius, Rudolph, 100 Clausius entropy, 28, 100, 116-118, 237, 238, 243 Clifford algebras, 375 closed universe, 351, 352 coarse graining, 105, 126 collapse of the wave function, 272, 275, 277-286, 277, 291-306, 390, 394396, 402, 413, 418, 484, 485 commutator, 261, 262, 264, 365, 490, 492 commuting variables, 231, 233, 234, 261264, 267, 272, 273, 365, 489-492, 494 comoving space, 351 compact manifold, 360, 368-370, 398, 400402 complementarity, 266, 279, 296
completeness, 137, 140, 141, 149, 219 complex manifold, 369, 402-404, 407, 410 conditional probability, 11, 30, 60-63, 135 conformal field theory, 372 consciousness, xiii, xvii, 225, 227, 288, 295, 307, 355, 371, 383, 393, 395, 396, 404, 406, 416, 418 conservation, 102, 103, 109, 224, 227, 274, 298, 300, 301, 310, 318, 348, 352, 353, 366, 377, 378, 386, 393, 416 Consistent Histories Interpretation, 296-299 contact transformation, 364 contextual parameters, 298 continuum (energy), 245, 246, 255, 356 continuum (numerical), 35, 36 continuum (spatial), 33, 96, 226, 300, 344, 356, 357, 376-379, 413 continuously differentiable, 315, 321, 334, 356, 366, 371, 377, 400, 402, 410 continuous spontaneous localization, 151. 296, 298, 299, 301, 394 contraction (length), 316, 318, 320, 325, 330, 332, 339, 357 contraction (tensor), 337-339, 342, 348, 349 contravariant tensor, 322, 332-334, 337-339, 341-345, 378 convolution, 45-52, 55, 63, 87, 145, 146, 149, 150, 175, 176, 207, 428, 439, 470, 478, 480 convolved-Gaussian-uniform distribution, 144-146, 150, 152, 178, 186, 187 Copenhagen interpretation, 266, 267, 272, 275, 278, 296, 483 Copernicus, 222 correlation, 62, 66, 67-71, 73-78, 81, 82, 92, 93, 134. 137, 143, 145, 147-152, 156, 158-160, 164-176, 199, 200, 206, 207, 218, 219, 273, 275, 277, 282-289, 292, 317, 352, 353, 355, 386, 390, 417, 435, 437, 444, 451472, 483,-84 correlation coefficient, 67, 69-71, 73, 77, 78, 92, 93, 200, 206, 213, 214, 284, 285, 443, 444, 459, 461, 462, 468, 484 509
page 521
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index
correlation matrix, 454-472 Correspondence Principle, 266, 296 cosmological constant, 309, 352, 353, 372, 379, 408, 409, 413 cosmology, 103, 224, 330, 352, 353, 386, 409, 413 covariance, general, 300, 315, 339, 366, 368, 412 covariance matrix, 53, 67, 69-73, 77, 138142, 159-167, 170-174, 178, 195, 198-200, 205-207, 233, 436-445, 451-464, 468-470, 484 covariant derivative, xvi, 341-344, 346 covariant tensor, 322, 332-334, 336-339, 342, 343, 345, 348 covector, 334 creation operator, 493 cross-correlation function, 137, 145-152 cumulative distribution, 38, 48, 49, 82, 93, 132, 147, 179, 299, 421, 424, 429431, 434, 445-450, 482 Curie, Pierre and Marie, 230 ----------------------------------------------------dark energy, 353, 386 Davisson, Clinton, 247 de Broglie, Louis, 244, 247, 248, 250, 252, 257, 265, 281, 292, 293, 305, 483, 495 de Broglie-Bohm theory, 295, 297, 305-307, 483 decision theory, 11, 16, 30, 60, 134-136 declination, 74, 131, 132, 138, 139 decoherence, 272, 296-298, 302, 394 de Fermat, Pierre, 3, 5, 35, 153, 248 deferent, 221, 222 deficit angle, 377 Democritus, 104 density function, 3, 24, 35-38, 43-59, 63-66, 71, 72, 75-84, 87, 93, 109, 110, 120, 126, 129, 133-137, 139, 141-152, 159, 175, 176, 179-182, 185-188, 193, 195, 207, 215, 220, 223, 240, 267, 270, 271, 419-431, 434, 439, 446-449, 462, 478, 483, 484
Deser, S., 365, 367, 373 de Sitter, Willem, 228, 311, 353, 372 de Sitter solution, 372, 379, 409, 413 determinism, ix-xiv, 2, 10, 26, 27, 33, 67, 73, 95, 102, 103, 106, 122-126, 150, 157, 228, 233, 234, 247, 265, 266, 272, 274, 280-283, 287, 288, 296, 297, 302, 304-307, 359, 360, 386, 388, 393, 415, 416, 418, 483 diffeomorphism, 366, 373, 377, 378 differentiability, 400, 407 differential entropy, 121 diffraction, 143, 231, 245, 247, 248, 255, 263, 268, 270, 280, 302, 311, 396, 397 Dirac, Paul, 94, 255, 265, 267, 292, 295, 313, 337, 364-366, 374, 455, 476, 489, 491 Doppler shift, 244, 352, 353 duality, 247, 256, 364, 372 ----------------------------------------------------Eddington, Arthur, 352 eigenfunction, 255, 273, 292, 298, 301, 489 eigenstate, 231, 233, 234, 247, 262-268, 274-278, 282, 294-298, 301-304, 391 eigenvalue, 69, 70, 159, 198, 231, 233, 239, 253-256, 267, 282, 308, 313, 362, 363, 366, 368, 376, 385, 454, 460, 461, 489-495 eigenvector, 69, 233, 267, 490-494 Einstein, Albert, 33, 96, 104, 116, 120, 122126, 225-228, 230, 231, 236, 238, 243-247, 265, 272-282, 287, 292, 298, 305, 306, 310-317, 320, 325, 330, 339, 343, 244, 348-360, 390393, 397, 401-405, 415, 473, 475, 483 Einstein-Cartan-Kibble-Sciama theory, 360, 361 Einstein-Hilbert action, xvi 367, 373, 379, 410, 412 Einstein notation, 322, 333, 337, 338, 344
510
page 522
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index
Einstein-Podolsky-Rosen, 272-275, 279, 280, 292 Einstein-Rosen bridge, 358 Einstein’s field equations, 300, 310, 314, 349, 352, 367, 369, 372, 376, 398, 412 Einstein tensor, 348, 349, 358 electromagnetism, 227, 313, 359, 360, 376 electron, 33, 85, 94, 110, 122, 123, 188, 195, 213, 228-231, 236, 244-250, 254256, 260-274, 293, 298, 302, 309, 313, 363, 370, 375, 381, 388, 397, 409 electron diffraction, xii, 231, 268-272, 302, 396, 397 electroweak, 313, 314, 354, 359, 368, 372, 384, 391 Ellis, George, 393-396, 412, 413 emergence, 228, 229, 375, 378, 395, 396, 401, 402 ensemble, xii, 8-10, 16, 75, 84, 85, 105-111, 114, 125, 127, 128, 135, 136, 152, 240, 260, 261, 265, 268, 273, 283, 296 entanglement, 272-279, 282, 283, 288-296, 299, 303, 328, 354, 355, 386, 390, 394, 404, 413, 417, 483 entropy, 27-29, 66, 96, 99-103, 106, 109112-122, 124, 126, 127, 237-240, 243, 244, 268, 298, 357-359, 377, 384, 385, 387, 410 environmental decoherence, 296-298, 302, 394 epicycle, 221, 222, 226 epistemic, xii, xiii, 10, 33, 34, 108, 116, 119, 121-129, 158, 219, 234, 260, 264268, 273, 277-283, 296, 297, 305309, 416, 483 EPR paradox (see Einstein-Podolsky-Rosen) equilibrium, 98-102, 106, 109-111, 123, 235-240, 243-245, 359, 377, 378, 409, 488 equipartition of energy, 236 equivalence principle, 314, 315, 332
error function, 93, 146, 186, 449, 450 estimator, xiv, xv, 59, 71, 72, 83-92, 160, 177-179, 189, 220, 477, 478 Euclidean, 320-324, 329-332, 336, 339-341, 351, 353, 366, 369, 373, 380, 397404, 408-414 Euclideanization, 408, 410 Euler, Leonhard, 249, 403-405 Euler rotation, 70, 262 Euler’s formula, 252, 404 Euler’s identity, 404, 405 Everett, Hugh, 302-304 expectation value, 17-22, 25, 27, 83, 87, 88, 129, 131, 159, 162, 177, 214, 233, 285, 291, 356, 380, 419, 421, 435441, 444, 451, 452, 458, 462, 463, 484, 485 exponential distribution, 109, 299 ----------------------------------------------------Faraday, Michael, 230, 235, 311-313, 375 Fermat’s Principle, 248 Fermi-Dirac statistics, 274 fermion, 274, 282-286, 290, 291, 306, 360, 363, 369, 483-486 Feynman, Richard, 292, 313, 379, 404 Feynman diagram, 370, 379, 407, 410 Fick, Adolph, 123, 124 Fisher, R.A, 91, 95, 96 Fisher z-transform, 92 flat universe, 103, 351, 372, 402 fluctuation, 21-23, 25-27, 30, 31, 45, 49, 54, 58, 60, 67-72, 83-86, 91-93, 107, 108, 131, 135, 136, 139, 150, 154, 156, 159, 162, 188, 192, 195, 207, 220, 268, 286, 308, 309, 313, 353, 356, 383, 385, 389, 402, 409, 418, 427, 435, 438, 453, 468, 469, 471, 472, 477 foliation, 368, 379, 410-412 Fourier analysis, 256, 260, 309 Fourier coefficient, 261, 309 Fourier transform, 52, 264 Franklin, Benjamin, 230
511
page 523
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index
free will, xiii, 124, 126, 287, 288, 291, 304, 307, 383, 392, 393, 415, 418 frequentist interpretation, 10, 11, 17, 39, 60, 95, 119, 135 Fresnel, Augustin Jean, 248, 254, 311 Friedmann-Lemaître-RobertsonWalker metric (see Robertson-Walker metric) ----------------------------------------------------Galileo, 222, 314, 397 Gassendi, Pierre, 104 gauge groups, 361, 371, 373, 378, 385, gauge invariance, 303, 366, 373, 375 Gaussian distribution, 17, 23, 24, 36-38, 4759, 63-66, 75-93, 106, 109, 110, 121, 124, 129-133, 137-152, 157-163, 175188, 192, 195, 196, 203, 207, 209, 213, 215, 258, 299, 421, 422, 424429, 434-450, 452, 455, 456, 462, 467-472 Gaussian hypersurface (see hypersurface) Geiger, Hans, 58, 245 general covariance, 300, 315, 339, 368, 412 general relativity, 96, 122, 227, 228, 272, 275, 279, 292, 298, 300, 301, 303, 305, 309, 310-381, 397-403, 407, 412-415, 418, 475 geodesic, 311, 319, 320, 330, 332, 340, 343, 345-347, 358, 369, 398-400, 413 geodesic deviation, 340, 341, 398 geodesic ball, 369 Germer, Lester, 247 Gibbs, J. Willard, 112 Gibbs entropy, 112-116, 118-121 Gilbert, William, 230, 231 Goldstein, Eugen, 230 Gordon, Walter, 255, 292 graviton, 362, 363, 402 grotesque states, 296-298, 394 group theory, 371, 373, 378, 386 group velocity, 257, 258, 293-295, 418 ----------------------------------------------------Hamilton, William Rowan, 97, 248, 249, 254, 255
Hamiltonian, 249, 253, 297, 298, 300, 364, 366, 373, 376 Hamiltonian mechanics, 248, 363, 364, 365, 367, 379, 394 Hamiltonian operator, 253, 254, 379, 489, 490, 493, 494 Hamilton’s Principle, 249, 254, 379 Hartle-Hawking no-boundary proposal, 407, 408, 410 Hawking, Stephen, 356, 407-410, 413 heat bath, 98, 99 heat sink, 99, 102 heat source, 98, 99, 102 Heisenberg, Werner, 244, 255, 260-266, 292, 394 Heisenberg Uncertainty Principle, 127, 233, 234, 262-265, 267, 268, 272, 295, 301, 306, 308, 348, 356 Herschel, William, 235, 258 Hertz, Heinrich, 235 heteroskedastic, 63 Higgs field, 389, 408, 409 Hilbert, David, 367, 401, 402 Hilbert space, 296, 304, 362, 365, 375, 376 Hooke, Robert, 104, 123 Hubble, Edwin, 352, 353 Humason, Milton, 352 Huygens, Christiaan, 231, 311 Huygens-Fresnel Principle, 248, 254 hyperbolic plane, 351, 398, 401, 414 hyperboloidal universe, 351, 352, 372, 402 hypersurface (phase space), 105, 107, 108, 125 hypersurface (simultaneity), 379, 386, 389, 390, 401, 410-413 hypothesis, 92, 93, 133, 134-152, 169, 175, 176, 178-180, 189, 194, 215, 224, 281, 288, 293, 303, 359, 381, 404, 408, 410, 445, 446, 447 ----------------------------------------------------Ingenhousz, Jan, 123 independent events, 8-10, 13, 20, 29, 30, 43
512
page 524
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index
independent random variables, 45, 48, 5158, 61-67, 70-73, 77-82, 111, 113, 117, 131, 146, 150, 158, 175-181, 184, 213-215, 273, 424-429, 434, 443, 451, 456-462, 465, 469 infrared, 85, 143, 148, 212, 222, 235, 258 inner product, 337, 338, 345 IRAS mission, 143-152, 179, 186, 222 irrational number, 35, 94, 95 irreversible, 101, 102, 111, 298, 352, 386, 388, 389, 394, 395 isothermal, 98-100, 116, 117 ----------------------------------------------------joint density function, 71, 72, 76-82, 109, 110, 129, 149, 151, 159, 176, 179181, 424-428, 434, 462 joint probability, 8, 9, 61, 62, 111, 274, 282, 283, 478 joint wave function, 275--277, 292, 295, 297, 355 Jordan, Pascual, 255, 260, 261, 265, 359 Joule, James Prescott, 102, 103 ----------------------------------------------------Kaluza, Theodor, 360 Kaluz-Klein theory, 361, 368, 398, 402 Kepler, Johannes, 102, 223, 224, 230, 391 ket vector, 337, 489-494 Kirchhoff, Gustav, 235 Klein, Oscar, 255, 292, 360, 361 Kleinert theory, 361, 362 König, Samuel, 249 Kronecker delta, 343, 365, 439, 458 kurtosis, 64, 65, 71, 91, 92, 188, 192, 420, 421, 445, 463-467, 477-482 ----------------------------------------------------Lagrange, Joseph-Louis, 97, 249 Langrangian, 249, 362-367, 379 landscape, string-theory, 371, 373, 378 large-scale structure of the Universe, 350, 409 Larmor, Joseph, 316 lattice gauge theory, 375, 377 Lavoisier, Antoine, 96 law of large numbers, 30, 377, 389
least action, 249, 362, 367, 379 Leavitt, Henrietta, 352 Leibniz, Gottfried, 3, 35, 249, 341, 342 Lemaître, Abbé-Georges, 351, 353 leptokurtic, 421 Leucippus, 104 Levi-Civita, Tullio, 320, 343 light cone, 327, 328, 483 likelihood ratio, 135, 136 Loop Quantum Gravity, 357, 362, 373-378, 380, 384, 385, 390, 396, 402, 412, 413, 474 Lorentz, Hendrik Antoon, 236, 302, 313, 316 Lorentz contraction, 316, 318, 320, 325, 332, 357 Lorentz covariant, 315, 316, 357, 378, 379 Lorentz distribution, 55 Lorentzian metric, 329, 330, 403, 407-413 Lorentz transform, 275, 293, 310, 315, 316, 318-320, 324-328, 348, 378 Loschmidt, Johann, 123 Lucretius, 123 luminiferous aether, 227, 231, 244, 312, 316, 317 ----------------------------------------------------Mach, Ernst, 122 Mach’s Principle, 227, 228 macrostate, 27-29, 40-43, 105, 111-117 Maldacena, Juan, 372 Maldacena duality, 372 Marsden, Ernest, 245 manifold, 320, 321, 332, 336, 362, 364, 366, 368-372, 376-378, 384, 385, 390, 398-404, 407, 409-414 marginal distribution, 76-82, 145, 146, 149, 176, 179, 181, 274, 282 marginal probability, 8 many worlds interpretation, 302, 303, 306, 394, 399 Marconi, Guglielmo, 235 Maupertuis, Pierre Louis, 240 Maxwell, James Clerk, 230, 235, 313
513
page 525
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index
Maxwell’s equations, 244, 245, 273, 302, 313, 315-318, 359, 360, 398 mesokurtic, 421 meson, 368 metaphysics, 128, 221, 227, 228, 249, 289, 303-306, 309, 339, 359, 400, metric tensor, 311, 329-332, 339, 343-349, 352, 358, 360, 367, 396, 401, 403 microcanonical ensemble, 105-108, 111, 125 microstate, 27-29, 39-43, 105, 106, 111-120, 126, 127, 385 Minkowski, Hermann, 325 Minkowski metric, 325, 329-332, 346, 361, 367, 408, 413 Minkowski space, 380, 402-414 Misner, C., 228, 314, 320, 347, 348, 358, 365-367, 373, 401 Möbius strip, 373, 374 mole, 97, 98, 104-106, 124 moment-generating function, 421, 422 Monte Carlo methods, 7, 16, 20, 22, 55, 148, 152-157, 172-174, 190-193, 207, 210, 285, 286, 306, 428, 431, 451, 452, 464, 467-472, 481 morality, 124, 126, 304, 307, 359, 383, 392, 393, 415, 418 M-theory, 372 ----------------------------------------------------Nernst heat theorem, 116 neutron, 274, 368, 381 Newton, Isaac, 3, 35, 97, 104, 223, 231 Newtonian mechanics, xii, 104, 125, 127, 222, 231, 247-249, 265, 280, 314, 315, 318, 346, 353, 357, 418, 488 Newton’s law of universal gravitation, 102, 223-227, 230, 277, 300, 307, 310312, 314, 345, 346, 349, 350, 379, 397, 405, 413, 418, 473 No-boundary proposal, 407-409 noncommuting variables, 233, 262-264, 267, 272, 273, 492, 494
nonepistemic, xii, xiii, 33, 34, 108, 125, 126, 157, 226, 260, 265-268, 277, 280, 289, 290, 295-297, 305-309, 354, 393, 418, 483 nonlocality, xvi, 126, 127, 277, 280-294, 305, 306, 354, 355, 483-485 null cone (see light cone) null hypothesis, 92, 93, 135, 445 null interval, 328, 355, 402 ----------------------------------------------------objective collapse, 296, 298, 300, 301 Occam’s razor, 303 ontology, 221-228, 313, 358, 375, 387, 404, 407, 411 Ostwald, Wilhelm, 122 outer product, 337-339 overfitting, 196, 201-203, 208 ----------------------------------------------------parallel transport, 340-342, 345-347 parallel universes, 8-10, 135, 152 parameterized path, 347 parameter refinement, 150-152, 175, 176, 178, 190, 299 parameter refinement theorem, 178-193, 215, 299, 394, 478 Pascal, Blaise, 3-5, 35, 153 Pascal’s rule, 5, 38, 39, 153, 155, 157 Pascal’s triangle, 4, 5, 25 path integral, 379, 385, 407, 408, 410 Pauli, Wolfgang, 126, 260, 261, 265, 306, 307, 373 Pauli Exclusion Principle, 274, 281-284, 306, 483 Penrose, Roger, 276, 283, 300, 301, 345, 396, 402, 406-410, 413, 416 Perfect Cosmological Principle, 351 perfect gas law, 97, 108, 118, 124 phase space, 105-109, 112, 114, 125-127, 266, 280, 296, 358, 364-367 phase velocity, 250, 258, 294, 295, 418 phlogiston, 96 physical realism, xviii, 260, 272, 273, 278, 280-283, 295, 305, 409, 410, 414, 415, 483 514
page 526
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index
Planck, Max, 99, 102, 103, 120, 124, 232, 236-249, 252, 409, 494, 495 Planck-Kleinert crystal, 361, 362 Planck length, 241, 248, 357, 361, 362, 376, 381, 382, 385, 390, 392 Planck parameters, 241, 389, 473-476, Planck scale, 300, 357, 359, 361, 376, 378, 381, 388, 391, 412, 413 Planck’s constant, 127, 231, 246, 249, 252, 364, 385 platykurtic, 421 Podolsky, Boris, 272, 483 Poincaré, Henri, 236, 243, 244, 250 point-spread function, 49, 50, 143 Poisson brackets, 364, 365 Poisson distribution, 58-50, 83, 106, 109, 118, 119, 121, 159, 176, 178, 188193, 219, 299, 356, 431, 477-482, 484 population, xv, 16, 51, 59, 63-67, 71, 72, 8293, 105, 134, 135, 152, 160, 189-193, 196, 203, 207, 209, 234, 416, 417, 422, 423, 427-431, 443-445, 451-454, 462-470, 477, 472 population mixtures, 51, 63-65 positivism, 260, 266, 407, 409, 410 posterior probability, 30, 61, 62 Principle of Least Action, 249, 342, 367, 379 prior probability, 17, 25, 26, 30, 61-63, 125, 431 problem of time, 357, 358, 361, 367 projection operator, 296, 297 proper density, 339 proper time, 346, 347, 379, 394-396, 410, 411, 418 proton, 33, 229, 255, 256, 274, 293, 298, 299, 368, 381, 409 pseudorandom, 2, 15, 16, 21, 31, 77, 125, 127, 152-156, 172, 431, 451-472 pseudo-Riemannian, 329, 332, 364, 399, 402, 404, 413 pseudorotation, 319 -----------------------------------------------------
quantized energy, 232, 234-260, 277, 298, 356, 495 Quantum chromodynamics, 313, 369, 372 quark, 33, 368, 369, 375, 381 quark confinement, 375 quasistatic, 98, 99, 238 ----------------------------------------------------random walk, 2, 3, 29-32, 43, 53, 78, 93, 124, 125, 207-219, 206 Rayleigh distribution, Rayleigh-Jeans distribution, 236-240, 250 realism, xviii, 260, 272, 273, 278-283, 295, 305, 409, 410, 414, 415, 483 reductionist, 228, 381 Regge action, 379, 380, 385 Regge calculus, 376, 377, 379 reliability, 137, 140, 141, 219 renormalization (distribution), 63, 181, 185, 298, 299 renormalization (field theory), 309, 313, 314, 354, 357, 371, 375, 386 response function, 130-133, 136, 159-161, 189, 212 reversible, 99-102, 116, 386, 394, 395 Ricci, Gregorio, 320, 343, 394, 396 Ricci-flat space, 369-371 Ricci scalar, 348, 349, 367 Ricci tensor, 348, 349 Riemann, Bernhard, 320, 321, 343 Riemannian manifold, 320, 321, 351, 366, 369, 390, 398-402, 411 Riemannian metric, 329, 332, 410, 411 Riemann tensor, 344-349, 353 right ascension, 74, 138 Ritter, Johann Wilhelm, 235, 258 Robertson-Walker metric, 103, 351, 352, 372, 379, 383, 390, 400-402, 408, 413 Rosen, Nathan, 272, 358, 483 Rutherford, Ernest, 230, 245, 246 Rydberg, Johannes, 245, 246 ----------------------------------------------------sample, xv, 66-73, 82-93, 477-482 sanity, 275, 291, 354, 355, 417, 418 515
page 527
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index
Schrödinger, Erwin, 127, 239, 247, 248, 250, 254, 256, 274, 277-281, 295 Schrödinger equation, 243, 247-256, 265, 297, 300, 361, 366, 375, 379, 489 Schrödinger’s cat, 233, 277, 296, 299, 302, 304 Schwarzschild, Karl, 348 Schwarzschild solution, 348, 350, 360, 474, 475 semantics, 221, semiclassical gravity, 356, 402, 407 Shannon entropy, 121, 122, 385 signature of a metric, 329, 330, 367, 380, 401-403, 408, 410, 411 simplex, 376-381, 384, 385, 412, 413 simplicial manifold, 376, 377, 379, 380, 384, 385 skewness, 64, 65, 87, 91, 92, 188, 190, 192, 420, 421, 445, 463-467, 477-482 Slipher, Vesto, 352 Smolin, Lee, 376, 394 spacelike interval, 328, 329, 354-356, 379, 403, 405, 408, 412, 418 spin foam, 362, 375, 395, 402 spin network, 362, 373, 375, 376, 380, 402 spinor, 373-375 spontaneous process, 101, 102, 151, 268, 308, 358, 359, 377, 386, 389, 408 spontaneous localization, 296, 298, 299, 394 standard deviation, 22-26, 36, 48, 51, 57, 67, 69, 71, 72, 77, 83, 85, 90-93, 121, 155, 188, 194, 420, 451, 454 standard model of particle physics, 368-371, 408 state vector 233, 267, 272, 296 state vector reduction, 233, 267, 272 statistic, xiv, xv, 21, 67, 82-93 statistical significance, xv, 25, 26, 29, 53, 54, 67, 73, 90, 92, 93, 103, 132, 134, 175, 176, 212, 281, 283, 287-289, 350, 357, 445-447, 453, 467 statistical stability, 8, 9, 16, 22, 125, 286, 390, 469 Stefan, Josef, 235, 241
stress-energy tensor, 314, 332, 343, 348-358, 363, 365, 371, 384 String Theory, 362, 363, 368-373, 385, 398, 399, 402 Strutt, William (Lord Rayleigh), 236 subjective interpretation, 10, 11, 135 superposition of states, 233, 234, 250-257, 262, 263, 266, 269, 272, 273, 277, 288, 297-304, 395, 484 supersymmetry, 369, 376 systematic error, 45, 92, 130, 131, 157-175, 200, 219, 459, 460, 466 ----------------------------------------------------tachyon, 369 tangent plane, 334-336, 341 tangent space, 336, 341, 342-346, 353, 366 tensor, 311, 314, 320, 322, 329-363, 365, 367, 371, 373, 396 tensor contraction, 337-339, 342, 348, 349 Thiele, Thorwald N., 123 Thomson, J.J., 230 timelike interval, 327-330, 347, 355, 403, 410, 412 topological quantum field theory, 361, 362, 375, 376 transcendental irrational number, 35 triangulation, 377-379, 384, 396, 401, 407, 412 type 1 error, 137, 139-142, 147-149 type 2 error, 137, 140, 149-151 Twistor Theory, 402 ----------------------------------------------------ultraviolet, 235, 258 ultraviolet catastrophe, 236, 250, 356 unbiased estimator, xiv, 59, 67, 83, 87-89, 176, 189, 478 unbiased random walk, 2, 3, 29, 30, 124, 215 uncertainty (estimator), 72, 85, 89-92, 150, 158, 161-179, 184-189, 193-220, 435-438, 442, 477 uncertainty (measurement), xii, 36, 54, 59, 60, 74-76, 85, 86, 103, 129-132, 138145, 158, 161-177, 418, 434, 453 516
page 528
June 16, 2021
8:47
World Scientific Book - 10in x 7in
12468-resize
Index
uncertainty (Heisenberg), 127, 233, 244, 255, 262-268, 272, 295, 301, 306, 308, 348, 356 uncorrelated random variables, 62, 66, 70-78, 81, 82, 87, 88, 92, 130-132, 162, 166170, 215, 434, 435, 441, 452-459 underfitting, 201, 203, 204 uniform distribution, 16, 23, 24, 42-48, 54, 72, 77, 82, 92, 109, 114, 119-125, 130, 143-153, 178, 183-187, 298, 299, 306, 308, 421, 423, 429-431, 446, 448, 455, 461, 470-472, 484 universal gas constant, 97, 98, 117, 124 Unruh effect, 356 utility, 20 ----------------------------------------------------vacuum energy, 103, 309, 313 vacuum field equations, 348, 349 van Leeuwenhoek, Antony, 123 von Neumann, John, von Weizsäcker, Carl Friedrich, 266 ----------------------------------------------------wave equation, 248, 250, 255, 265, 268, 269, 277, 292, 303, 364 wave function, 250-255, 264-282, 289, 292, 295, 297, 302-304, 306, 355, 366, 390, 395, 396, 407-409, 413, 415 wave function collapse, 394, 395, 402, 413 wave number, 245, 250-258, 293, 473 wave packet, 256-260, 263, 264, 267, 270, 293-296, 299, 301, 418 weave, 376 Wheeler, John Archibald, 399, 403, 404, 473 wheel of fortune, 26, 40-44, 55, 82 Wick rotation, 407, 412 Wien, Wilhelm, 235-239, 495 Wien displacement law, 238, 241 Wien distribution, 237, 239, 243 Wigner, Eugene, 416, 417 Winterberg, F., 355 Witten, Edward, 372 world crystal theory, 361, 362 wormholes, 358 -----------------------------------------------------
xi-square statistic, 462-470 ----------------------------------------------------Young, Thomas, 311 ----------------------------------------------------zero-point energy, 239, 243, 268, 308, 309, 356, 400, 494
517
page 529