248 34 7MB
English Pages 248 [250] Year 2021
Handbook of Statistics Volume 45
Information Geometry
Handbook of Statistics Series Editors C.R. Rao C.R. Rao AIMSCS, University of Hyderabad Campus, Hyderabad, India Arni S.R. Srinivasa Rao Medical College of Georgia, Augusta University, United States
Handbook of Statistics Volume 45
Information Geometry Edited by
Angelo Plastino National University La Plata (UNLP), IFLP-CCT-Conicet, La Plata, Argentina; Kido—Dynamics, Lausanne, Switzerland
Arni S.R. Srinivasa Rao Medical College of Georgia, Augusta, Georgia, United States
C.R. Rao AIMSCS, University of Hyderabad Campus, Hyderabad, India
North-Holland is an imprint of Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2021 Elsevier B.V. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0-323-85567-9 ISSN: 0169-7161 For information on all North-Holland publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Zoe Kruze Acquisitions Editor: Sam Mahfoudh Developmental Editor: Naiza Ermin Mendoza Production Project Manager: Abdulla Sait Cover Designer: Victoria Pearson Esser Typeset by STRAIVE, India
Contents Contributors Preface
Section I Foundations of information geometry 1.
Revisiting the connection between Fisher information and entropy’s rate of change A.R. Plastino, Angelo Plastino, and F. Pennini 1. Introduction 2. Fisher information and Cramer–Rao inequality 3. Fisher information and the rate of change of Boltzmann–Gibbs entropy 3.1 Brownian particle with constant drag force 3.2 Systems described by an N-dimensional Fokker–Planck equation 4. Possible lines for future research 5. Conclusions References
2.
3.
xi xiii
1 3 3 4 6 10 11 12 13 13
Pythagoras theorem in information geometry and applications to generalized linear models
15
Shinto Eguchi 1. Introduction 2. Pythagoras theorems in information geometry 3. Power entropy and divergence 4. Linear regression model 5. Generalized linear model 6. Discussion References Further reading
16 18 24 31 35 40 41 42
Rao distances and conformal mapping
43
Arni S.R. Srinivasa Rao and Steven G. Krantz 1. Introduction 2. Manifolds 2.1 Conformality between two regions
43 44 46 v
vi
4.
Contents
3. Rao distance 4. Conformal mapping 5. Applications Acknowledgments References
46 49 53 55 55
Cramer-Rao inequality for testing the suitability of divergent partition functions
57
Angelo Plastino, Mario Carlos Rocca, and Diana Monteoliva 1. Introduction 2. A first illustrative example 2.1 Evaluation of the partition function 2.2 Instruction manual for using our procedure 2.3 Evaluation of hri 2.4 Dealing with hr2i 2.5 Obtaining fisher information measure 2.6 The six steps to obtain a finite Fisher’s information 2.7 Cramer-Rao inequality (CRI) 2.8 Numerical example 3. A Brownian motion example 3.1 The present partition function 3.2 Mean values of x-powers 3.3 Tackling fisher 3.4 The present Cramer-Rao inequality 4. The harmonic oscillator (HO) in Tsallis’ statistics 4.1 The HO-Tsallis partition function 4.2 HO-Tsallis mean values for r2 4.3 Mean value of r 4.4 Variance V 4.5 The HO-Tsallis Fisher information measure 5. Failure of the Boltzmann-Gibbs (BG) statistics for Newton’s gravitation 5.1 Tackling Z ν 5.2 Mean values derived from our partition function (PP) 5.3 Variance Δr = hr2ihri2 5.4 Gravitational FIM 5.5 Incompatibility between Boltzmann-Gibbs statistics (BGS) and long-range interactions 6. Statistics of gravitation in Tsallis’ statistics 6.1 Gravity-Tsallis partition function 6.2 Gravity-Tsallis mean values for r and r2 6.3 Tsallis’ Gravity treatment and Fisher’s information measure 6.4 Tsallis’ Gravity treatment and Cramer-Rao inequality (CRI) 7. Conclusions References
57 58 59 61 61 62 62 63 64 64 65 65 66 66 67 68 68 69 69 69 70 70 70 71 72 73 73 73 73 74 76 77 77 78
Contents
5.
r–Rao-type Information geometry and classical Crame inequalities
vii
79
Kumar Vijay Mishra and M. Ashok Kumar 1. Introduction 79 2. I-divergence and Iα-divergence 82 2.1 Extension to infinite 84 2.2 Bregman vs Csisza´r 84 2.3 Classical vs quantum CR inequality 85 3. Information geometry from a divergence function 86 3.1 Information geometry for α-CR inequality 88 3.2 An α-version of Cramer–Rao inequality 90 3.3 Generalized version of Cramer–Rao inequality 95 4. Information geometry for Bayesian CR inequality and Barankin bound 98 5. Information geometry for Bayesian α-CR inequality 101 6. Information geometry for Hybrid CR inequality 106 7. Summary 106 Acknowledgments 107 Appendix 107 A.1 Other generalizations of Cramer–Rao inequality 107 References 110
Section II Theoretical applications and physics 6.
Principle of minimum loss of Fisher information, arising from the Cramer-Rao inequality: Its role in evolution of bio-physical laws, complex systems and universes B. Roy Frieden 1. Introduction 1.1 On learning, energy, sensory messages 1.2 On variational approaches 1.3 Vital role played by information 2. Overview and comparisons of applications 2.1 Classical dynamics 2.2 Quantum physics 2.3 Biology 2.4 Thermodynamics 2.5 Extending use of the principle of natural selection 2.6 From biological cell to earth to solar system, galaxy, universe, and multiverse 2.7 Creation of a multiverse by requiring its Fisher I to be maximized 2.8 Analogy of a cancer “universe” 2.9 What ultimately causes a multiverse to form? 2.10 Is there empirical evidence for a multiverse having formed?
115
117 118 118 119 119 120 120 120 121 121 122 123 124 125 125 126
viii
Contents
2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19
7.
Details of the process of growing successive universes How many universes N might exist in the multiverse? Annihilation of universes Growth of a bubble of nothing Counter-growth of new universes Possibility of many annihilation waves How large a number N of universes exist? Is the multiverse merely a theoretical construct? Should the fact that we do not, and have not observed life elsewhere in our universe affect a belief that we exist in a multiverse? 3. Derivation of principle of maximum Fisher information (MFI) 3.1 Cramer-Rao (C-R) inequality 3.2 On derivation of the C-R inequality 3.3 What do such data values (augmented by knowledge of a single equality obeyed by the system physics) have to say about the unknown system physics? 4. Kantian view of Fisher information use to predict a physical law 4.1 How principle of maximum information originates with Kant 4.2 On significance of the information difference I J 5. Principle of minimum loss of Fisher information 5.1 Verifying that minimum loss is actually achieved by the principle 5.2 Summary and foundations of the Fisher approach to knowledge acquisition 5.3 What is accomplished by use of the Fisher approach 6. Commonality of information-based growths of cancer and viral infections 6.1 MFI applied to early cancer growth 6.2 Later-stage cancer growth 6.3 MFI applied to early covid-19 growth 6.4 Common biological causes of cancer- and covid-19 growth; the ACE2 link References
126 128 129 129 130 130 130 131
Quantum metrology and quantum correlations
149
Diego G. Bussandri and Pedro W. Lamberti 1. Quantum correlations 2. Parameter estimation 3. Cramer–Rao bound 4. Quantum Fisher information 5. Quantum correlations in estimation theory 5.1 Heisenberg limit 5.2 Interferometric power 6. Conclusion References
149 152 153 155 156 157 159 160 160
131 132 132 132
133 135 135 135 136 137 137 140 142 142 143 143 144 146
Contents
8.
9.
ix
r-Rao Information, economics, and the Crame bound
161
Raymond J. Hawkins and B. Roy Frieden 1. Introduction 2. Shannon entropy and Fisher information 3. Financial economics 3.1 Discount factors and bonds 3.2 Derivative securities 4. Macroeconomics 5. Discussion and summary Acknowledgments References
161 162 164 164 168 171 174 175 175
Zipf’s law results from the scaling invariance of the Cramer–Rao inequality
179
Alberto Hernando and Angelo Plastino 1. Introduction 2. Our goal 3. Fisher’s information measure (FIM) and its minimization 4. Derivation of Zipf’s law 5. Zipf plots 6. Summary References Further reading
Section III Advanced statistical theory 10. λ-Deformed probability families with subtractive and divisive normalizations Jun Zhang and Ting-Kam Leonard Wong 1. Introduction 1.1 Deformation models 1.2 Deformed probability families: General approach 1.3 Chapter outline 2. λ-Deformation of exponential and mixture families 2.1 λ-Deformation 2.2 Deformation: Subtractive approach 2.3 Deformation: Divisive approach 2.4 Relation between the two normalizations 2.5 λ-Exponential and λ-mixture families 3. Deforming Legendre duality: λ-Duality 3.1 From Bregman divergence to λ-logarithmic divergence 3.2 λ-Deformed Legendre duality 3.3 Relationship between λ-conjugation and Legendre conjugation 3.4 Information geometry of λ-logarithmic divergence
179 180 180 180 181 183 183 183
185 187 187 189 191 192 193 193 194 195 196 197 199 199 201 202 206
x
Contents
4. λ-Deformed entropy and divergence 4.1 Relation between potential functions and R enyi entropy 4.2 Relation between λ-logarithmic divergence and R enyi divergence 4.3 Entropy maximizing property of λ-exponential family 5. Example: λ-Deformation of the probability simplex 5.1 λ-Exponential representation 5.2 λ-Mixture representation 6. Summary and conclusion References
207 207 208 209 210 210 211 212 214
11. Some remarks on Fisher information, the Cramer–Rao inequality, and their applications to physics 217 H.G. Miller, Angelo Plastino, and A.R. Plastino 1. Introduction 2. Diffusion equation 3. Connection with Tsallis statistics 4. Conclusions Appendix A.1 The Cramer–Rao bound References
Index
217 220 222 225 226 226 227
229
Contributors Numbers in Parentheses indicate the pages on which the author’s contributions begin.
Diego G. Bussandri (149), Instituto de Fı´sica La Plata (IFLP), CONICET, La Plata, Argentina Shinto Eguchi (15), Institute of Statistical Mathematics, Tokyo, Japan B. Roy Frieden (117, 161), Wyant College of Optical Sciences, University of Arizona, Tucson, AZ, United States Raymond J. Hawkins (161), Department of Economics, University of California, Berkeley, CA; Wyant College of Optical Sciences, University of Arizona, Tucson, AZ, United States Alberto Hernando (179), Kido Dynamics SA, Lausanne, Switzerland Steven G. Krantz (43), Department of Mathematics, Washington University in St. Louis, St. Louis, MO, United States M. Ashok Kumar (79), Department of Mathematics, Indian Institute of Technology Palakkad, Palakkad, India Pedro W. Lamberti (149), Consejo Nacional de Investigaciones Cientı´ficas y Tecnicas de la Repu´blica Argentina, Argentina H.G. Miller (217), Physics Department, SUNY Fredonia, Chautauqua, NY, United States Kumar Vijay Mishra (79), United States CCDC Army Research Laboratory, Adelphi, MD, United States Diana Monteoliva (57), National University La Plata (UNLP), IFLP-CCT-Conicet, La Plata; Comisio´n de Investigaciones Cientı´ficas Provincia de Buenos AiresLa, Plata, Argentina F. Pennini (3), Departamento de Fı´sica, Facultad de Ciencias Exactas y Naturales, Universidad Nacional de La Pampa, CONICET, Santa Rosa, La Pampa, Argentina; Departamento de Fı´sica, Universidad Cato´lica del Norte, Antofagasta, Chile Angelo Plastino (3, 57, 179, 217), National University La Plata (UNLP), IFLP-CCTConicet, La Plata, Argentina; Kido—Dynamics, Lausanne, Switzerland A.R. Plastino (3, 217), National University La Plata (UNLP), IFLP-CCT-Conicet, La Plata; CeBio y Departamento de Ciencias Ba´sicas, Universidad Nacional del Noroeste de la Prov. de Buenos Aires, UNNOBA, CONICET, Junin, Argentina
xi
xii
Contributors
Mario Carlos Rocca (57), National University La Plata (UNLP), IFLP-CCT-Conicet; Departamento de Matema´tica, Universidad Nacional de La Plata; Consejo Nacional de Investigaciones Cientı´ficas y Tecnolo´gicas (IFLP-CCT-CONICET)-C. C. 727, La Plata, Argentina Arni S.R. Srinivasa Rao (43), Laboratory for Theory and Mathematical Modeling, Medical College of Georgia and Department of Mathematics, Augusta University, Augusta, GA, United States Ting-Kam Leonard Wong (187), Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada Jun Zhang (187), Department of Psychology and Department of Statistics, University of Michigan, Ann Arbor, MI, United States
Preface Volume 45 of Information Geometry, in the Handbook of Statistics series, offers a fresh view on the subject with classical and advanced materials. The subject of information geometry blends several areas of statistics, computer science, physics, and mathematics. The subject evolved from the groundbreaking article published by the legendary statistician C.R. Rao in 1945. His works led to the creation of the Cramer–Rao bound, Rao distance, and Rao-Blackwellization. Fisher–Rao metrics and Rao distances play a vital role in geodesics, econometric analysis, and modern-day business analytics. The chapters of this book are written by experts in the field who have been promoting the field of information geometry and its applications. The topics of this volume include theoretical foundations of Rao distances, manifolds, differential geometry to classical Cramer–Rao inequalities, Rao distances, applications in biophysical laws, quantum correlations, information economics, deformed probability families, and entropy’s rate of change. This volume has been developed with brilliantly written chapters by the authors researching the various aspects of information geometry. The authors have taken utmost care in making their chapters available to new researchers, advanced theoreticians, and promoters of applied information geometry. The authors have vast experience in the subject area and have been advocating geometrical techniques arising from statistical data analysis, mathematical analysis, theoretical physics, and economics. This book is written keeping in mind both graduate students and advanced researchers. Graduate students can use this volume as a decent introduction to learning a new subject, which has not been taught in their curriculums. Faculty can refer to this volume for their advanced research in information geometry. There are 11 chapters in this volume; each chapter has been written brilliantly to attract readers to the theory of information geometry. Below, we provide a summary of all the 11 chapters and their authors in the order in which they appear in this book. The chapters in this volume are divided into three sections: Section I: Foundations of Information Geometry Section II: Theoretical Applications and Physics Section III: Advanced Statistical Theory Section I contains five chapters: The first chapter written by A.R. Plastino, A. Plastino, and F. Pennini discusses the foundations of the Cramer–Rao inequality and provides the readers with an advanced xiii
xiv
Preface
understanding of Boltzmann–Gibbs entropy. The chapter explains higherdimensional Fokker–Planck equations and provides directions for future research in the subject area. In the second chapter, S. Eguchi provides an introduction of the origin of the Pythagoras theorem in a space of probability density to recent advances in maximum likelihood estimators: Pythagoras foliation. The chapter provides historical accounts on the topic, Amari’s differential geometry methods. The third chapter by A.S.R. Srinivasa Rao and S.G. Krantz introduces the mathematical foundations required for Rao distances, manifolds, and differential geometry, and also provides historical facts on C.R. Rao’s contributions. The authors link Rao distances to conformal mappings and new applications of the content to the virtual tourism industry. The fourth chapter by A. Plastino, M.C. Rocca, and D. Monteoliva discusses Fermi’s partition functions and their relevance to statistical physics, thus showing a comprehensive association of these developments with the Cramer–Rao inequality. The chapter describes Tsallis statistics, Brownian motion, and statistical theory of gravitation and concludes with the association of present-day theoretical physics with the aforesaid topics. The fifth chapter by K.V. Mishra and M.A. Kumar provides a detailed analysis of various divergence functions and associated theories for advanced readers. The authors also discuss the Amari–Nagaoka framework of information geometry and Bayesian and hybrid Cramer–Rao lower bounds, generalized versions of Cramer–Rao inequalities. Section II contains four chapters: The first chapter by B.R. Frieden is a detailed discussion on complex systems arising out of biophysical laws, maximum randomness, and the development of models and analysis. The chapter provides an overview of classical dynamics and quantum physics and their applications in biology and thermodynamics. In the second chapter, D.G. Bussandri and P.W. Lamberti consider that in quantum mechanics it is possible to identify a fundamental unpredictability given by (1) Heisenberg’s uncertainty relations and the indistinguishability of quantum states together with (2) a practical uncertainty related to unavoidable errors in the measurement process. Accordingly, the estimation of unknown values of physical magnitudes has well-defined precision limits. The authors discuss how the Cramer–Rao bound is used as a cornerstone for the analysis of these restrictions and, thus, has become an irreplaceable tool to determine the most accurate measurement procedures. The third chapter by R.J. Hawkins and B.R. Frieden introduces the Cramer–Rao bound as a useful measure of the risk and utility of such bounds in economic data. The authors also discuss the technicalities of macroeconomics and derivative securities and the underlying principles by Shanon and Fisher. The fourth chapter by A. Hernando and A. Plastino presents a formal demonstration of the celebrated Zipf’s law. One sees that it arises from a variational principle related to the scaling invariance of the Cramer–Rao inequality. Interesting applications illustrate the issue.
Preface
xv
Section III contains two chapters: The first chapter by J. Zhang and T.-K. L. Wong discusses the detailed technicalities of newer techniques in deformation probabilities, and they investigate deformations to the exponential family and mixture family of probability density functions. The authors touch upon Tsallis statistics, generalized deformation models, and the advanced statistical theory of information geometry. The second chapter by H.G. Miller, A. Plastino, and A.R. Plastino discusses the interesting aspects of the applications of the celebrated Cramer–Rao inequality in time evolution in certain fundamental physics problems and suggests some promising lines for further inquiry on such an endeavor. Our sincere thanks goes to Mr. Sam Mahfoudh, Acquisition Editor (Elsevier and North-Holland) for his overall administrative support throughout the preparation of this volume. His valuable involvement in the timeto-time handling of authors’ queries is highly appreciated. We also thank Ms. Naiza Mendoza, Developmental Editor (Elsevier) for providing excellent assistance to the editors and authors in the various technical and editorial aspects throughout the preparation. Ms. Mendoza also provided valuable guidance to the proofs and production. Our thanks also goes to Md. Sait Abdulla, Project Manager, Book Production, Chennai, India, RELX India Private Limited, for leading the production and printing activities and for providing assistance to the authors. Our sincere thanks and gratitude go to all the authors for their hard work in developing brilliant chapters and meeting our requirements of the volume. We firmly believe that this volume is a well-timed publication, and we are convinced that this collection will be resourceful to beginners and advanced scientists in information geometry. Angelo Plastino Arni S.R. Srinivasa Rao C.R. Rao
This page intentionally left blank
Section I
Foundations of information geometry
This page intentionally left blank
Chapter 1
Revisiting the connection between Fisher information and entropy’s rate of change A.R. Plastinoa, Angelo Plastinob, and F. Penninic,d,∗ a
CeBio y Departamento de Ciencias Ba´sicas, Universidad Nacional del Noroeste de la Prov. de Buenos Aires, UNNOBA, CONICET, Junin, Argentina b National University La Plata (UNLP), IFLP-CCT-Conicet, La Plata, Argentina c Departamento de Fı´sica, Facultad de Ciencias Exactas y Naturales, Universidad Nacional de La Pampa, CONICET, Santa Rosa, La Pampa, Argentina d Departamento de Fı´sica, Universidad Cato´lica del Norte, Antofagasta, Chile ∗ Corresponding author: e-mail: [email protected]
Abstract For systems described by a time-dependent probability density obeying a continuity equation, the rate of change of entropy admits an upper bound based on Fisher information. We provide an overview on the bound, summarize its main properties, and discuss an alternative version of it that yields stronger constraints on entropy’s rate of change. Keywords: Fisher information, Entropic changes, Continuity equation
1
Introduction
The Fisher information measure (Fisher, 1925) and the Cramer–Rao inequality (Plastino and Plastino, 2020; Rao, 1945) constitute nowadays essential components of the tool-box of scientists and engineers dealing with probabilistic concepts. Ideas revolving around Fisher information were first applied to the statistical analysis of experimental or observational data. More recently, thanks in particular to the pioneering efforts of Frieden (1989, 1990, 1992, 1998, 2004), it was realized that the Fisher measure and the Cramer–Rao inequality are also important for understanding the very fabric of fundamental physical theories, because they shed new light on basic aspects of quantum mechanics and statistical physics (Brody and Meister, 1995; Dehesa et al., 2012; Flego et al., 2003, 2011; Frieden and Soffer, 1995; Frieden et al., 1999; Hall, 2000; Nikolov and Frieden, 1994; Plastino and Plastino, 1995a, 1996; Plastino et al., 1996, 1997a; Sa´nchez-Moreno et al., 2011). Handbook of Statistics, Vol. 45. https://doi.org/10.1016/bs.host.2021.07.004 Copyright © 2021 Elsevier B.V. All rights reserved.
3
4
SECTION
I Foundations of information geometry
One of the first developments along the lines championed by Frieden and collaborators, concerned bounds on entropy’s increase. Fisher information yields upper bounds on the rate of increase of entropy. The bounds were explored by various authors (see, for instance, Bag, 2002; Brody and Meister, 1995; Mayoraga et al., 1997; Nikolov and Frieden, 1994; Plastino and Plastino, 1995a; Prokopenko and Einav, 2015; Prokopenko et al., 2015; Yamano, 2012 and references therein), but there still are aspects of the connection between Fisher information and entropy’s rate of change that deserve to be further scrutinized. The aim of this brief contribution is to re-visit the Fisher-based bounds on the rate of change of entropy, indicating some new possible directions of enquiry.
2 Fisher information and Cramer–Rao inequality Fist we review basic ideas related to Fisher information, the Cramer–Rao inequality, and some of their applications to fundamental physics. Let us consider a one dimensional system described by a continuous probability density PðxÞ, where x is a relevant phase-space coordinate. The Fisher information we are going to consider here is, 2 Z ∂PðxÞ 1 I¼ dx: (1) ∂x PðxÞ The above is not the most general form of Fisher information. In its most general version, Fisher information is a functional defined on a parameterized family of probability densities, depending on one or more parameters. For a family of probability densities Pðx, θÞ depending on one parameter θ, Fisher information reads, 2 Z ∂Pðx, θÞ 1 Iθ ¼ dx: (2) ∂θ Pðx, θÞ The form (1) of Fisher’s information is a particular instance of the general form (2). It corresponds to a family of probability densities whose members are all of the same shape, but are uniformly shifted from each other. The dependence of the probability density Pðx, θÞ on the parameter θ, which characterizes the shift of the density Pðx, θÞ with respect to the reference density Pðx, θ ¼ 0Þ, is Pðx, θÞ ¼ Pðx θÞ:
(3) 2
2
It is plain that for the above family of densities ð∂P=∂θÞ ¼ ð∂P=∂xÞ , and the Fisher information (2) reduces to (1). Scientists often have to estimate the value of a parameter θ characterizing a probability density Pðx, θÞ. They have to determine θ from observations of the random variable x. The Cramer–Rao bound establishes a lower bound on
Revisiting the connection between Fisher information Chapter
1
5
e be the error with which the unknown parameter can be determined. Let θðxÞ an unbiased estimator of θ. That is, an estimator whose mean value coincides with the real value of θ, Z e e hθðxÞi ¼ Pðx, θÞ θðxÞdx ¼ θ: (4) e yields the value of θ is The error with which the estimator θðxÞ Z 2 e θ dx: e2 ¼ Pðx, θÞ θðxÞ
(5)
The Cramer–Rao inequality says that the error e2 is bounded from below by the inverse of Fisher’s information, e2
1 : Iθ
(6)
The Fisher information measure and the associated Cramer–Rao inequality found some of their most fundamental physical applications in the fields of quantum mechanics and statistical physics. In the realm of quantum mechanics Frieden investigated a deep connection between Fisher information and Schrodinger equation. There is a functional related to Fisher information whose optimization leads to Schrodinger equation. Consider (in one dimension) the functional, 2 Z Z dPðxÞ 1 A¼ dx + λ ½E VðxÞPðxÞ dx: (7) dx PðxÞ where λ is a negative constant, E is a constant, and V (x) is a potential function. After making the identification PðxÞ ¼ ΨðxÞ2 , the Euler–Lagrange equation corresponding to the variational principle δA ¼ 0, can be cast as, 1 ∂2 Ψ + ΨðxÞVðxÞ ¼ EΨðxÞ: λ ∂x2
(8)
Identifying λ with 2m2 , where ħ stands for Planck’s constant, Eq. (8) is ħ identical to the time-independent Schrodinger equation of a quantum particle of mass m and energy E moving in the potential V (x). The first term appearing in the functional A is the Fisher information, while the second term is proportional to the mean kinetic energy of the particle, Z EK ¼ ½E VðxÞPðxÞ dx (9) Note that for the stationary states described by the Schrodinger equation (8), Ψ(x) can be regarded as real, and therefore jΨ(x)j2 ¼ Ψ(x)2. The connection between Fisher information and Schrodinger equation was the starting point of an ambitious research program, with far reaching implications, leading
6
SECTION
I Foundations of information geometry
eventually to an information-theoretical re-interpretation of the fundamental equations of physics (Frieden, 1998, 2004). Besides its strictly quantum mechanical significance, the connection between Fisher information and Schrodinger equation generated new insights in statistical physics. It suggested ways of applying techniques from quantum mechanics to the study of problems in classical statistical physics, such as the study of solutions of Boltzmann transport equation associated with the propagation of sound in dilute gases (Flego et al., 2003). Other applications of Fisher information to statistical physics dealt with the characterization of the arrow of time. Fisher information is related to the rate of change of various entropic or information measures. Also, for some physical processes, the Fisher information itself satisfies an H-theorem. For processes obeying the linear diffusion equation, there are links between the time dependences of the the Boltzmann entropy, the Kullback relative entropy, and the Fisher information (Plastino et al., 1997a). The time derivative of Boltzmann–Gibbs entropy is equal to the Fisher information, and the time derivative of the Fisher information is always nonpositive. That is, the Fisher information itself obeys an H-Theorem (Plastino et al., 1997a). The Fisher H-theorem imposes some bounds on the behavior of the Boltsmann–Gibbs entropy, implying that this entropy can increase at most linearly with time. For systems obeying a Fokker–Planck equation, the symmetries of this evolution equation lead to H-theorems involving Fisher information (Plastino and Plastino, 1996). In general scenarios described by probability densities satisfying a continuity equation, the rate of change of the Boltzmann–Gibbs entropy admits a bound related to Fisher information (Brody and Meister, 1995; Plastino and Plastino, 1995a). Our aim here is to summarize and re-discuss some basic aspects of the Fisher-based bound on entropy’s change.
3 Fisher information and the rate of change of Boltzmann–Gibbs entropy Consider a system described by a time-dependent probability density Pðx, tÞ, defined on an N-dimensional configuration or phase space whose points are represented by the vectors x ¼ ðx1 , …, xN Þ RN . In some applications the vector x represents the configuration space of the system (for instance, the spatial location of a particle). In other applications, the vector x ¼ ðq1 , …, qn , p1 , …, pn Þ represents the complete set of N ¼ 2n phase-space canonical variables of a Hamiltonian system with n degrees of freedom. The density P has dimensions of inverse volume and is properly normalized. That is Z Pðx, tÞ dx ¼ 1, (10)
Revisiting the connection between Fisher information Chapter
1
7
where dx is the volume element in the relevant space where the density is defined. The evolution of Pðx, tÞ is governed by the continuity equation ∂P + r J ¼ 0, ∂t
(11) where J RN is the probability density current, and r ¼ ∂x∂1 , …, ∂x∂N . In
many applications to theoretical physics, particularly in those championed by Frieden, it is convenient to introduce a generalization of the Fisher information measure (1) to densities defined on N dimensional spaces. The generalized information measure is defined as Z 1 I¼ ðrP Þ2 dx: (12) P The Fisher measure (12) turns out to be closely linked to the behavior of the Boltzmann–Gibbs entropy. The Boltzann–Gibbs entropy associated with the probability density P can be cast as Z P S½P ¼ P ln dx, (13) P0 where P 0 is a constant with the same dimensions as P. If one works with a dimensionless probability density, it is not necessary to introduce the constant P 0. But, if one wants to explicitly keep the physical dimensions of P, one has to introduce P 0 . According to the definition (13), the entropy S is a dimensionless quantity. The time derivative of the entropy is Z dS ∂P P ¼ ln dx: (14) dt ∂t P0 Our aim is to explore an upper bound on the rate of entropy change, dS dt , given by a functional depending only on the form of the probability density P. The existence of such a bound seems impossible, because the rate of change of S is not, in general, completely determined by the instantaneous shape of the probability density P. It depends also on the detailed features of the system’s dynamics, embodied in the form of the probability density current J. to put it another way, two systems that at given time have the same instantaneous probability density P , may have different probability currents and, consequently, different rates of entropy change. For example, if the probability current is multiplied by α, the rate of entropy change is multiplied by α as well J!αJ
leads to
dS dS !α : dt dt
(15)
This example suggests that the quantity whose upper bound might by completely determined by the instantaneous shape of density P , is not the
8
SECTION
I Foundations of information geometry
entropy’s rate of change itself, but a rate of change appropriately normalized by the global rate of change of the density. Such a normalized rate of change would be independent of the scale parameter α appearing in (15). To formulate an appropriately normalized rate of change of the entropy, it is convenient to recast the continuity Eq. (11) under the guise ∂P + r ðvPÞ ¼ 0, ∂t
(16)
J ¼ vP:
(17)
where The vector field v(x) can be interpreted as an effective velocity field characterizing (at least formally) the system’s dynamics. But the velocity field v (x) is not to be regarded as meaning that the system is literally moving with velocity v(x) when located at x. The field v(x) is only a mathematically convenient representation of the system’s dynamics. The quantity pffiffiffiffiffiffiffiffi v ¼ ¼ hv2 i Z 1=2 ¼ v2 P dx (18) Z 2 1=2 J ¼ P dx , P is arguably the simplest choice for a numerical indicator of the global amount of motion exhibited by the system. R Note that quantities based on the mean value of the velocity field, hvi ¼ v P dx, would not do, because the velocities at different spatial regions may cancel each other, leading to a small or vanishing mean velocity, even for systems experiencing a large amount of motion. Then, adopting v as our indicator, one may consider the quantity 1 dS (19) v dt as a normalized rate of change for the entropy, Going back once more to the example encapsulated in the scaling relations (15), it is plain that the quantity given in (19) does not depend on the scaling factor α. Does it admit an upper bound depending only on the instantaneous shape of the probability density P? It does. And the bound is given by the square root of the Fisher information measure 1 dS pffiffiffiffiffiffiffiffi (20) I½P: v dt The above considerations do not, of course, constitute a proof of the inequality (20). They are intended only as an intuitive, heuristic, explanation of the physical meaning, and plausibility of the Fisher-based bound.
Revisiting the connection between Fisher information Chapter
1
9
We proceed now to review the proof of (20), along the lines followed in Plastino and Plastino (1995a) and Brody and Meister (1995). We have Z dS ∂P P ¼ ln dx dt ∂t P0 Z P dx ¼ ðr JÞ ln P0 Z P (21) dx ¼ J r ln P0 Z 1 rP dx ¼ J P Z J 1 pffiffiffiffi pffiffiffiffi rP dx: ¼ P P In the third step of the above derivation, there is an integration by parts. We assume that P ! 0 and jJj ! 0 fast enough when jxj ! ∞ so that the surface terms involved in the integration by parts vanish. Applying now the Schwartz inequality Z J 1 dS pffiffiffiffi pffiffiffiffi rP dx ¼ dt P P "Z 2 #12 Z 12 J 1 2 p ffiffiffiffi ðrP Þ dx dx P P (22) Z 2 12 Z 12 J 1 ¼ P dx ðrP Þ2 dx P P pffiffiffiffiffiffiffiffi ¼ v I½P: The rate of change of the entropy S can be expressed as the mean divergence of the effective velocity field Z dS ¼ ðr vÞ P dx ¼ hr vi, (23) dt from which follows that adding, at a given time, a constant (spaceindependent) velocity v0 to the effective velocity field v(x), does not change the instantaneous rate of change of the entropy. That is, the velocity fields v(x) and v(x) + v0 (or the probability currents J(x) and JðxÞ + v0 PðxÞ ) yield the same instantaneous time derivative of S. This implies that, if one adds at a given instant a constant term to the velocity field, the inequality () still holds. In R particular, one can add the (space-independent) velocity v0 ¼ hvi ¼ vðxÞ P dx , equal to minus the instantaneous mean value of v. Note that the mean value hvi depends on time (because both the density and the velocity field depend on time), but, obviously, it does not depend on
10
SECTION
I Foundations of information geometry
x. Re-expressing the bound (22) in terms of the “shifted” velocity field v hvi, leads to the improved bound pffiffiffiffiffiffiffiffi dS (24) σ I½P, dt qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where σ ¼ hðv hviÞ2 i. The bound (24) is clearly tighter than the bound discussed in Plastino and Plastino (1995a) and Brody and Meister (1995) (which was the bound (22)). We shall now consider a couple of examples of the Fiser-based bounds for entropy increase.
3.1 Brownian particle with constant drag force As an illustration of the difference between the bounds (24) and (22), let us consider a Brownian particle moving in one spatial dimension under the effect of a constant drag force K. The probability of finding the particle at different locations evolves according to the Fokker–Planck equation dP ∂2 P ∂P K ¼ D , dt ∂x ∂x2
(25)
where D is a diffusion constant. For a density Pðx, tÞ obeying the evolution Eq. (25), the time derivative of the entropy S½P is 2 Z dS 1 ∂P dx ¼ D I½P: (26) ¼ D dt P ∂x The probability current corresponding to the continuity Eq. (25) is JðxÞ ¼ KP D
∂P , ∂x
(27)
with the associated effective velocity field vðxÞ ¼
J D ∂P ¼ K : P P ∂x
(28)
It is worth mentioning that the velocity field v(x) should not be interpreted as meaning that the particle, when located at x, is literally moving with velocity v(x). The velocity field v(x) is only a convenient, alternative mathematical representation of the probability current J. It follows from (28) that hvi ¼ K, qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi σ ¼ hðv hviÞ2 i ¼ D I½P1=2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffi v ¼ hv2 i ¼ K 2 + D2 I½P, which, combined with (26), leads to
(29)
Revisiting the connection between Fisher information Chapter
1
11
pffiffiffiffiffiffiffiffi 1 dS I½P: ¼ σ dt
(30)
pffiffiffiffiffiffiffiffi I½P 1 dS I½P ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi < v dt ðK=DÞ2 + I½P
(31)
and
This example clearly illustrates that the bound (24) is tighter than the bound (22), since the bound (24) (which in this example corresponds to Eq. 30) is saturated while the bound (22) (corresponding to (31)) is not.
3.2 Systems described by an N-dimensional Fokker–Planck equation We consider now physical processes described by an N-dimensional Fokker–Planck equations (Risken, 1989), for which the evolution equation of the time-dependent probability density Pðx, tÞ reads ∂P ¼ D r2 P r ðKðxÞPÞ ∂t
(32)
The probability density Pðx, tÞ is defined on an N-dimensional space, whose points are represented by the vectors x RN, with components ðx1 , …, xN Þ. The operator r ¼
∂ ∂x1
, …,
∂ ∂xN
is the N-dimensional r-operator, D
stands for the diffusion constant, and K(x) is the drift force. We shall focus on systems for which the drift force field can be derived from a potential function U(x) K ¼ rU:
(33)
The probability current J and the associated effective velocity field v are then J ¼ KP DrP ¼ ðrUÞ P DrP, v ¼
J 1 ¼ ðrUÞ D rP: D P
(34)
We shall now discuss how the Fisher-based bounds to entropy increase are related to the behavior of the free energy, which is defined as R
F ¼ hUi DS,
(35)
where hUi ¼ UPdx is the mean value of the potential energy U, and S is the Botzmann–Gibbs entropy (13). It follows from (32) that F is always a nonincreasing quantity. The nonincreasing behavior of the free energy constitutes one formulation of the H-theorem satisfied by the Fokker–Planck equation (Risken, 1989). In order to establish a connection between the behavior of F and the Fisher bounds, it is convenient to briefly review the proof that the free energy F is nonincreasing. The time derivative of the free energy is
12
SECTION
dF ¼ dt
I Foundations of information geometry
Z
∂P P U + D ln dx ∂t P0
P dx fr ½DðrPÞ + ðrUÞP g U + D ln P0 Z D ðrPÞ dx ¼ ½DðrPÞ + ðrUÞP ðrUÞ + P 2 Z D ðrPÞ dx ¼ P ðrUÞ + P Z
¼
(36)
0: Comparing the next to last line of (36) with the second equation in (34), we find that Z dF ¼ v2 P dx ¼ hv2 i: (37) dt pffiffiffiffiffiffiffiffi If we now combine (37) with (20) (remember that v ¼ hv2 i) we obtain 2 dS dF (38) I½P : dt dt We see that, for time-dependent solutions of the Fokker–planck equation, there is an inequality relating the Fisher information measure with the rates of change of the Boltzmann–Giggs entropy and the free energy.
4 Possible lines for future research The above considerations suggest some promising lines for future research. The connection between the Fisher-based upper bound for the rate of change of entropy, and the time derivative of the free energy F, was established only in terms of the upper bound (20). It would be interesting to explore formulations of this connection using the improved bound (24). Moreover, all these considerations were made on the basis of linear Fokker–Planck equations. It would be worthwhile to extend them to nonlinear Fokker–Planck equations (Frank, 2005; Plastino and Plastino, 1995b; Wedemann and Plastino, 2019). The improved upper bound for the rate of change of entropy, when written as I hðv hviÞ2 i jdS=dtj,
(39)
shows a suggestive formal resemblance with the Cramer–Rao inequality I e2 1:
(40)
It is not clear at this stage if it is just a formal resemblance, or if it has some deeper physical meaning. The inequalities (39) and (40) differ in some
Revisiting the connection between Fisher information Chapter
1
13
obvious ways. In the right-hand side of Eq. (39) one has the time-dependent quantity dS/dt, while in the right-hand side of Eq. (40) one has a constant. Besides, to make a connection with the Cramer–Rao inequality, one would need to relate the quantity h(vhvi)2i to the error in the estimation of a parameter. In spite of these difficulties, we feel that there might be an interesting connection between the inequalities Eqs. (39) and (40), that deserves to be investigated.
5
Conclusions
We provided a brief overview of the connection between the Fisher information measure and the rate of change of the Boltzmann–Gibbs entropy, for systems described by a time dependent probability density that satisfies a continuity equation. The scenarios considered include, for instance, those involving diffusion, Fokker–Planck, or Liouville equations. For these kind of equations, the Fisher information measure provides upper bounds on the absolute value of the time derivative of the Boltzmann–Gibbs entropy. We also mentioned some possible directions for future research. We believe that the connection between Fisher information and entropy change is still worth of further exploration.
References Bag, B.C., 2002. Phys. Rev. E 65, 046118. Brody, D., Meister, B., 1995. Phys. Lett. A 204, 93. Dehesa, J.S., Plastino, A.R., Sanchez-Moreno, P., Vignat, C., 2012. Appl. Math. Lett. 25, 1689. Fisher, R.A., 1925. Math. Proc. Cambridge Philos. Soc. 22, 700. Flego, S.P., Frieden, B.R., Plastino, A., Plastino, A.R., Soffer, B.H., 2003. Phys. Rev. E 68, 016105. Flego, S.P., Plastino, A., Plastino, A.R., 2011. Ann. Phys. 326, 2533. Frank, T.D., 2005. Nonlinear Fokker-Planck Equations: Fundamentals, Applications. Springer, Berlin. Frieden, B.R., 1989. Am. J. Phys. 57, 1004. Frieden, B.R., 1990. Phys. Rev. A 41, 4265. Frieden, B.R., 1992. Physica A 180, 359. Frieden, B.R., 1998. Physics From Fisher Information: A Unification. Cambridge University Press, Cambridge. Frieden, B.R., 2004. Science From Fisher Information: A Unification. Cambridge University Press, Cambridge. Frieden, B.R., Soffer, B.H., 1995. Phys. Rev. E 52, 2274. Frieden, B.R., Plastino, A., Plastino, A.R., Soffer, B.H., 1999. Phys. Rev. E 60, 48. Hall, M.J.W., 2000. Phys. Rev. A 62, 012107. Mayoraga, M., Romero-Salazar, L., Velasco, R.M., 1997. Physica A 246, 145. Nikolov, B., Frieden, B.R., 1994. Phys. Rev. E 49, 4815. Plastino, A.R., Plastino, A., 1995a. Phys. Rev. E 52, 4580. Plastino, A.R., Plastino, A., 1995b. Physica A 222, 347. Plastino, A.R., Plastino, A., 1996. Phys. Rev. E 54, 4423.
14
SECTION
I Foundations of information geometry
Plastino, A.R., Plastino, A., 2020. Significance 17, 39. Plastino, A., Plastino, A.R., Miller, H.G., Khanna, F., 1996. Phys. Lett. A 221, 29. Plastino, A.R., Plastino, A., Miller, H.G., 1997a. Phys. Lett. A 235, 129. Prokopenko, M., Einav, I., 2015. Phys. Rev. E 91, 062143. Prokopenko, M., Barnett, L., Harre, M., Lizier, J.T., Obst, O., Wang, X.R., 2015. Proc. R. Soc. A 471, 20150610. Rao, C.R., 1945. Bull. Calcutta Math. Soc. 37, 8191. Risken, H., 1989. The Fokker-Planck Equation. Springer, New York. Sa´nchez-Moreno, P., Plastino, A.R., Dehesa, J.S., 2011. J. Phys. A 44, 065301. Wedemann, R.S., Plastino, A.R., 2019. In: LNCS 11727. International Conference on Artificial Neural Networks (ICANN), p. 43. Yamano, T., 2012. J. Math. Phys. 53, 043301.
Chapter 2
Pythagoras theorem in information geometry and applications to generalized linear models Shinto Eguchi∗ Institute of Statistical Mathematics, Tokyo, Japan ∗ Corresponding author: e-mail: [email protected]
Abstract We have a statistical introduction for Information Geometry focusing the Pythagoras theorem in a space of probability density or mass functions when the squared length is defined by the Kullback–Leibler divergence. It is reviewed that the Pythagoras theorem extends to a foliation structure of the subspace associated with the maximum likelihood estimator (MLE) under the assumption of an exponential model. We discuss such a perspective in a framework of regression model. A simple example of the Pythagoras theorem comes out the Gauss least squares in a linear regression model, in which the assumption of a normal distribution is strongly connected with the MLE. On the other hand, we consider another estimator called the minimum power estimator. We extend the couple of the normal distribution model and the MLE to another couple of the t-distribution model and the minimum power estimator, which exactly associate with the dualistic structure if the power defining the estimator is matched by the degrees of freedom of the t-distribution. These observations can be applied in a framework of generalized linear model. Under an exponential-dispersion model including the Bernoulli, Poisson, and exponential distributions, the MLE leads to the Pythagoras foliation, which reveals the decomposition of the deviance statistics. This is parallel to the discussion for the residual sum of squares in the Pythagoras theorem. Keywords: Foliation, Least squares, Linear connection, Pareto distribution, Power entropy, Power divergence, Pythagoras theorem, Student t-distribution
Handbook of Statistics, Vol. 45. https://doi.org/10.1016/bs.host.2021.06.001 Copyright © 2021 Elsevier B.V. All rights reserved.
15
16
SECTION
I Foundations of information geometry
1 Introduction In a narrow sense, Information Geometry is a differential geometric approach to integrating statistics, informatics and other fields of mathematical science, of which the terminology has been coined by Amari (1982, 1985) with excellent contributions. The key idea is to explore a dualistic structure on the space of all probability distributions, which comes from an understanding for a couple of basic objects, statistical modeling, and statistical inference. Such a dualistic structure gives stimulating motivations to differential geometry, see Kurose (1994) and Matsuzoe (1998). In effect, this framework provides a geometric insight to statistical notion such as Fisher information, sufficient statistics, and efficient estimators. In particular, the e-geodesic curvature measures the optimality for statistical model; the m-geodesic curvature measures the optimality for statistical estimation, in which curvature tensors with respect to e-geodesic and m-geodesic give geometric quantities for the loss of information integrating two optimalities. Further, the framework is enlarged in a lot of fields of mathematical sciences accompanying randomness or uncertainty, for example Quantum physics, machine learning, data science, and artificial intelligence, see Amari and Kawanabe (1997), Hayashi and Nagaoka (2003), and Murata et al. (2004) This chapter provides an introductory topic on Information Geometry, see Amari (2016) and Ay et al. (2017) for more advanced and technical topics. We focus on the Pythagoras theorem as an essence of Information Geometry, which aims at giving a chance to graduate students and researchers who would like to apply their interested themes to Information Geometry. For this, we discuss a close relation between the Pythagoras theorem and regression analyses, in which Gauss exploited the method of least squares to predict orbits of planets at the end of the 18th century, and Galton gave an excellent understanding for the least squares as “regression to the mean” with the data set of heights for pairs of father and his son. Nowadays, the regression analysis has been a popular tool to decipher data produced in various fields on natural and social sciences since then. It would be strange, but the essence of Information Geometry can reduce to the Pythagoras theorem on the space of all probability density functions. In effect, Pythagoras, the ancient Greek philosopher, proved the theorem on Euclidean space of two dimension; while on the theorem in Information Geometry the square Euclidean distance is replaced by the Kullback–Leibler divergence (KL), with which the pair of e-geodesic and m-geodesic is concomitant (Amari and Nagaoka, 2007; Nagaoka and Amari, 1982). A triangle is a right triangle when the e-geodesic and m-geodesic lines orthogonally intersect at the hypotenuse with respect to the information metric as discussed in more detail in the following section. The Pythagoras theorem leads to a foliation structure of the subspace associated with the maximum likelihood estimator (MLE) if a data set is assumed to follow an exponential model.
Pythagoras theorem in information geometry Chapter
2
17
Such a dualistic structure playing between the exponential model and the MLE is one of the most fundamental perspectives in Information Geometry, in which the exponential model is the set of the maximum entropy distributions under a constraint for the mean of a given statistic; the MLE is viewed as the minimization of the KL divergence from the true density function onto the exponential model. We note that the KL divergence does not uniquely satisfy the Pythagoras theorem. We introduce the power entropy and the power cross entropy, in which the power divergence is defined by the difference. The maximum power entropy distributions associate with a statistical model rather than the exponential model, typically as the family of t-distributions and that of generalized Pareto distributions. We consider the minimum power estimator defined by minimization of the power divergence. Then, we find another type of the Pythagoras theorem defining the square length by the power divergence. Similarly, this leads to such a dualistic structure between the model of maximum power entropy distributions and the minimum power estimator, in which the Pythagoras theorem with a different couple of geodesic lines is established, see Fujisawa and Eguchi (2008) and Eguchi et al. (2014). We next discuss a linear regression, in which we see the Gauss method of least squares is closely related with the Pythagoras theorem. The least squares estimate (LSE) is equal to the maximum likelihood estimate (MLE) under the normality assumption for a response variable, in which the MLE leads to the Pythagoras theorem. In other words, the couple of the normal distribution model and the maximum likelihood associates with such a simple geometric structure as the Pythagoras theorem. If we fix the normal model, the MLE uniquely leads to the Pythagoras theorem beyond the original statement. If we adopt the γ-power divergence with a power index γ, then the minimum γ-power estimator does not meet such a geometric structure under the normal regression model, in which the minimum divergence method is shown to satisfy a robust performance for model misspecification of the regression. However, if we assume Student t-distribution regression model, we see that the minimum γ-power estimator equals the LSE under the matching of γ with the freedom degrees of the t-distribution. Hence, a specific choice of the t-distribution and the γ-power estimator meets the Pythagoras theorem, in which the couple of the t-distribution model and the MLE no longer meets the Pythagoras theorem. Thus, we have a clear understanding about a complementary relationship between a statistical model and an estimator. This complementarity will be explored in a generalized linear model including Bernoulli logistic regression, Poisson log-linear model and so forth. Finally, we give a list of notations discussed in this chapter. P: Space of all probability density/mass functions M: Statistical model L(θ): Log-likelihood function H0( p): Boltzmann–Shanon entropy C0(p, q): Cross entropy
18
SECTION
I Foundations of information geometry
D0(p, q): Kullback–Leibler (KL) divergence Lðpθ Þ: MLE leaf I(θ): Fisher information matrix LðeÞ : e-Geodesic LðmÞ : m-Geodesic MðeÞ : Exponential model ðmÞðpÞ N : Moment match space ðmÞðpÞ N : Pythagoras foliation Cγ (p, q): γ-Power cross entropy Dγ (p, q): γ-Power divergence Lγ (θ): γ-Power loss function MðγÞ : γ-Power model ^ Least squares estimator β: dev: Deviance statistic RSS: Residual sum of squares
2 Pythagoras theorems in information geometry Let P be the space of all Radon Nikody´m derivatives with a common support that are dominated by a σ-finite measure Λ on p. We typically consider most cases where Λ is fixed by the Lebesgue measure or the counting measure so that P is the space of probability density functions, or that of probability mass functions. Then we consider a statistical model in P defined by M ¼ fpθ ðxÞ P : θ Θg,
(1)
where θ is a parameter vector of a parameter space Θ that is an open set of d. In a context of differential geometry M can be viewed as a local chart of a differentiable manifold of d-dimension under smoothness and topological assumptions. However, we do not discuss such a mathematical generality to focus on statistical applications, see Ay et al. (2017) for rigorous discussion in differential geometry. For a given data set fxi gni¼1 the log-likelihood function is defined by LðθÞ ¼
n X
log pθ ðxi Þ,
(2)
i¼1
and the maximum likelihood estimator θ^ is defined by θ^ ¼ argmax LðθÞ: θΘ
The Boltzmann–Shanon entropy for p(x) of P is defined by Z H 0 ðpÞ ¼ pðxÞ log pðxÞΛðdxÞ,
(3)
Pythagoras theorem in information geometry Chapter
2
19
and the cross entropy for p(x) and q(x) of P is given by Z C0 ðp, qÞ ¼ pðxÞ log qðxÞΛðdxÞ: Thus, the Kullback–Leibler (KL) divergence is defined by the difference as D0 ðp, qÞ ¼ C0 ðp, qÞ H 0 ðpÞ,
(4)
in which the expected log-likelihood function equals the n times the negative cross entropy, that is, p fLðθÞg ¼ nC0 ðp, pθ Þ when the underlying density function p(x) is not exactly in the model M defined in (1), where p denotes the expectation with respect to p. Hence, the maximum likelihood is closely connected with the minimum cross entropy or minimum KL divergence, in which we define the subspace associated with the MLE by Lðpθ Þ ¼ fp P : θ ¼ argmin D0 ðp, pθ~Þg, ~ θΘ
(5)
called the MLE leaf with the ray θ. Thus, Lðpθ Þ is a set of all underlying density functions p(x)’s under which the maximum of the expected log-likelihood function is attained at θ. Assume that the MLE uniquely exists in the model M for any underlying density function of P . Then, the union of all MLE S leaves over the parameter space, θΘ Lðpθ Þ is a foliation such that Lðpθ Þ and Lðpθ0 Þ are disjoint for any distinct θ and θ0 in Θ under the assumption of the uniqueness for the MLE. Thus, all MLE leaves separately transverse to the model M. In such a perspective the model validation is assessed by the minimum KL divergence, cf. Akaike’s information criterion. We will explore the structure of foliation when the model M is adopted by an exponential model. Rao (1945) has given a geometric perspective such that a statistical model M is a Riemann manifold if the Riemann metric is defined by the Fisher information matrix ∂ log pθ ðXÞ ∂ log pθ ðXÞ IðθÞ ¼ pθ , ∂θ ∂θ> for the θ-coordinate, called the information metric. This formulation gives insight related to the bound of statistical estimation, called the Cramer–Rao bound, in the sense that the inverse of I(θ) is exactly equal to the lower bound of variance matrices for all unbiased estimators for θ. It is a pioneer work of Information Geometry, on which the idea of dual geodesic lines is built as given in the following discussion.
20
SECTION
I Foundations of information geometry
We have a quick review for Information Geometry as follows: Let p and q be in P. Then, two types of geodesic lines connection between p and q are defined by LðeÞ ¼ fð1 tÞpðxÞ + tqðxÞ : t ½0, 1g and LðmÞ ¼ fct exp fð1 tÞ log pðxÞ + t log qðxÞg : t ½0, 1g, called the e-geodesic and m-geodesic lines, where ct is a normalizing constant for having a total mass one. Furthermore, the e-geodesic and m-geodesic lines are extended to the e-connection and m-connection for a subspace of P , in which the second-order efficiency is characterized by the e-connection curvature for the model and the m-connection curvature for the subspace associated with the estimation, cf. Amari (1982). Thus, the optimality measure is decomposed into the two curvatures in which the m-connection curvature becomes vanishing for the MLE, in which the e-connection curvature becomes vanishing for the case of exponential model, which is called the statistical curvature by Efron (1975) for the setting of one-parameter curved exponential model. This theorem supports the second order efficiency of the MLE with such a geometric understanding on the basis of the work by Rao (1962), in which a class of second-order efficient estimators is characterized in addition to the MLE, see Eguchi (1983) for the one-parameter family of second-order efficient estimators. In this way, the theory of second-order efficiency has been excellently established by Information Geometry; however, such an idea of higher order asymptotics does not provide so helpful understandings for data analysts because the discussion assumes a complete model specification, which is basically difficult to confirm from the data. We now have a look at the Pythagoras theorem in Information Geometry as follows, see Nagaoka and Amari (1982). This theorem is not one of sophisticated techniques in differential geometry as the theory of the second-order efficiency, but is essential to expand helpful understandings for the statistical methodology in Information Geometry. Theorem 1. Let p, q, and r be in P. Consider the triangle with vertices p, q, and r. Then, if the m-geodesic line connecting between p and q and the e-geodesic line connecting r and q orthogonally intersect at q with respect to the information metric, then D0 ðp, qÞ + D0 ðq, rÞ ¼ D0 ðp, rÞ: Proof. We define a two-parameter model as pθ ðxÞ ¼ ð1 tÞpðxÞ + tcs exp fs log qðxÞ + ð1 sÞ log rðxÞg,
(6)
Pythagoras theorem in information geometry Chapter
2
21
where θ ¼ (t, s). Then, noting p(1, 1)(x) ¼ q(x), the orthogonality condition is given by pθ
∂ log pθ ðXÞ ∂ log pθ ðXÞ ¼ 0: ∂t ∂s θ¼ð1,1Þ
This condition is written by Z fpðxÞ qðxÞgf log rðxÞ log qðxÞgΛðdxÞ ¼ 0, □
which implies that (6). In Theorem 1, if the m-geodesics connecting p and q is written by pt ðxÞ ¼ ð1 tÞpðxÞ + tqðxÞ
(7)
and the e-geodesic connecting r and q is r t ðxÞ ¼ ct exp fð1 tÞ log rðxÞ + t log qðxÞg, then it holds for any s, t [0, 1] that D0 ðpt , qÞ ¼ D0 ðpt , qÞ + D0 ðq, r s Þ, see Fig. 1. Thus, the space P is viewed as a dualistic Riemann space with the m-geodesic and e-geodesic. Subsequently, we have a close look at this property in Theorem 1 on a linear regression model and the Gauss least squares method. We have a review a statistical understanding from Theorem 1 introducing an exponential model, which includes most of important distributions in statistics. Let ðeÞ
MðeÞ ¼ fpθ ðxÞ :¼ exp fθ> sðxÞ ψðθÞg : θ Θg,
FIG. 1 Pythagoras theorem.
(8)
22
SECTION
I Foundations of information geometry
where θ is called the canonical parameter and ψ(θ) is the cumulant function defined by Z ψðθÞ ¼ log exp fθ> sðxÞgΛðdxÞ: Here we consider the full parameter space Θ ¼ fθ p : ψðθÞ < ∞g. Then, the sufficient statistic s(X) associates with the mean equal space ðmÞ (9) N ðpÞ ¼ q P : q fsðXÞg ¼ p fsðXÞg : Let q be arbitrarily fixed in MðeÞ . Then, for any density functions r of MðeÞ the e-geodesic line connecting q and r is in MðeÞ ; for arbitrary density ðmÞ ðmÞ functions p of N ðqÞ the m-geodesic line connecting p and q is in N ðqÞ. ðmÞ In other words, MðeÞ is a totally e-geodesic subspace; N ðqÞ is a totally m-geodesic subspace. Furthermore, we observe a foliation satisfying p 6¼ q in M and P ¼ N
ðmÞ
)
N
ðMðeÞ Þ, where N
ðmÞ
ðMÞ ¼
ðmÞ
ðpÞ \ N
[ pM
N
ðmÞ
ðmÞ
ðqÞ ¼ ;
ðpÞ:
ðeÞ
If we arbitrarily fix q of P, then the Pythagoras theorem D0 ðp, rÞ ¼ D0 ðp, qÞ + D0 ðq, rÞ for any p N
ðmÞ
ðqÞ and any r MðeÞ . In other words, D0 ðp, qÞ ¼ min D0 ðp, rÞ: rMðeÞ
The KL projection of p onto MðeÞ is exactly given by q. Thus, N ðmÞ ðqÞ is the MLE leaf LðθÞ defined in (5) since the minimization of the KL divergence is equivalent to the maximization of the expected log-likelihood function as discussed around (5). In accordance, all MLE leaves are mutually disjointed and orthogonally transverseds to MðeÞ, in which we observe the excellent foliation with the ray MðeÞ as in Fig. 2. We next have a look at the maximum entropy principle from a view of Information Geometry, see Jaynes (1957). Let us consider the MLE leaf as N
ðmÞ
ðμÞ ¼ fp P : p fsðXÞg ¼ μg:
(10)
Then, we solve a problem to find a density function, say pμ to maximize the ðmÞ entropy H0 on a constraint N ðμÞ. The Euler-Lagrange equation reveals that there is a solution pμ in the exponential model MðeÞ such that pμ ¼ p(e) θ(μ),
Pythagoras theorem in information geometry Chapter
2
23
FIG. 2 Pythagoras foliation.
where θ(μ) is determined by the mean equal constraint (10). We observe that, ðmÞ for all p N ðμÞ ðeÞ
ðeÞ
H 0 ðpθðμÞ Þ H 0 ðpÞ ¼ D0 ðp, pθðμÞ Þ: Therefore, the maximum entropy with the mean constraint characterizes the exponential model. Let fx1 , …, xn g be a data set from a probability density or mass function p(e) θ (x) of (8). Then, if we consider the mean parameter μ ðeÞ of M with the transformation μ(θ) ¼ ∂ψ(θ)/∂θ, then the MLE μ^ for μ is given by the sample mean vector as n 1X sðxi Þ, say s: n i¼1
(11)
This is because the log-likelihood function for μ is given by LðμÞ ¼ nfθðμÞ> s ψðθðμÞÞg with the inverse transformation θ(μ) of μ(θ), which equals the negative cross entropy as ðeÞ ðeÞ LðμÞ ¼ C0 pθðsÞ , pθðμÞ , so that L(μ) is maximized uniquely at μ ¼ s because of the basic inequality ðeÞ ðeÞ ðeÞ C0 pθðsÞ , pθðμÞ H 0 pθðsÞ :
24
SECTION
I Foundations of information geometry
In this way, there is surprising simplicity such that the MLE s for μ is independent of observations xi’s so that it is sufficient to keep the p-dimensional vector s regardless of the data set. This is why s is called a sufficient statistic. The maximum entropy method discusses this form of the maximum entropy distribution to approximate the distribution describing a given problem, in which a variety of applications is conducted. For example, in an ecological study it is important to estimate a habitat distribution of a species on the earth, or some region based on presence data, see Phillips et al. (2006) for Maxent modeling. Let R be a study region for a target species, which is a finite set of sites. We are provided a data set on the presence of the species with a form fsðxi Þgni¼1, where s(x) is a feature vector defined at the site x of R with components consisting of environmental variables that have influence for the habitation. The sample mean s is easily obtained by taking the average of feature vectors s(xi)’s of the observation sites xi’s, so the estimated maximum entropy distribution pθðsÞ provides an informative knowledge in a global perspective. The prediction whether the target species is present or absent at a given site x with the feature vector s(x is conducted by whether a linear predictor θðsÞ> sðxÞ is greater or smaller than a threshold α. Or, the probability of the presence of the species in a given region R is estimated by X ðeÞ ^ ðRÞ ¼ pθðμÞ ðxÞ: xR
Through the above consideration, the complementary relationship between ðmÞ the exponential model MðeÞ and the MLE leaf N ðÞ is clarified to have an essential role on the statistical inference under the model by the couple of the e-connection and m-connection. This tells us that two methodologies of statistical modeling and estimation cannot be unified by a unique connection or the Riemann connection, in effect, the dual couple of connections is necessary to understand the interaction between the two methodologies. In this way, the dual geodesic lines reveal the complementary relationship between the model and estimation in Information Geometry. On the other hand, a four-dimensional spatio-temporal gravitational tensor metric gives the path of light as the unique geodesic line, constructing a magnificent mechanical world of general relativity in physics.
3 Power entropy and divergence We consider another type of divergence defined on P P rather than the KL divergence. In general, any divergence has a Riemann metric and a couple of linear connections on a differentiable manifold M if the definition domain of the divergence is restricted to M M, see Eguchi (1983, 1992). We discuss
Pythagoras theorem in information geometry Chapter
2
25
in this section that there is a class of divergence such that the Pythagoras theorem holds on P drawing by the couple of geodesic lines. If the divergence measure D(p, q) is symmetric, then the dual connections jointly reduce to the Riemann connection; if D(p, q) is asymmetric as seen in the KL divergence, then the dual connections do not coincide, which opens a rich world with the interplay between statistical modeling and estimation. In particular, the asymmetry of the KL divergence is elucidated to have a close relation to the Neyman–Pearson lemma for showing the optimal property of the likelihood ratio testing hypothesis, Eguchi and Copas (2006). There are many types of divergence with asymmetry, in which we focus on the class of power divergence as a typical example. We introduce the power entropy and divergence, while H0(p) and D0(p, q) defined in (3) and (4) can be called the log entropy and divergence since they are defined by a logarithmic function. The γ-power entropy is defined by minus the Lebesgue norm that is, Hγ (p) ¼ kpkLγ+1, where Z k pkLγ+1 ¼
pðxÞγ+1 ΛðdxÞ
1 γ+1
,
and the γ-power cross entropy is defined by ( )γ Z qðxÞ Cγ ðp, qÞ ¼ pðxÞ ΛðdxÞ, k qkLγ+1 in which we note that Cγ (p, p) ¼ Hγ (p). Thus, the γ-power divergence is defined by Dγ ðp, qÞ ¼ Cγ ðp, qÞ H γ ðpÞ: Here the H€ older inequality for the Lebesgue norm gives Z pðxÞqðxÞγ ΛðdxÞ k pkLγ+1 fk qkLγ+1 gγ since kqγkL(γ + 1) / γ ¼ {kqkLγ+1}γ . This implies that Cγ ðp, qÞ H γ ðpÞ with equality if and only if p ¼ q, or equivalently Dγ (p, q) is a divergence measure to satisfy the first axiom of distance. If the power γ is taken a limit to 0, then the γ-power entropy and divergence are reduced to the classical ones in the sense that lim
γ!0
Cγ ðp, qÞ +1 ¼ C0 ðp, qÞ: γ
26
SECTION
I Foundations of information geometry
It follows from the general formula (Eguchi, 1992) that Dγ (p, q) induces the Riemann metric g(γ) and the dual couple of linear connections Γ(γ) and *Γ(γ) on a given model M ¼ fpθ ðxÞ : θ Θg as follows: ðγÞ
∂2 Dγ ðpθ1 , pθ2 Þθ1 ¼θ,θ2 ¼θ , ∂θ1i ∂θ2j
Γij,k ðθÞ ¼
ðγÞ
∂3 D ðp , p Þ ∂θ1i ∂θ1j ∂θ2k γ θ1 θ2 θ1 ¼θ,θ2 ¼θ
* ðγÞ Γij,k ðθÞ
∂3 Dγ ðpθ2 , pθ1 Þθ1 ¼θ,θ2 ¼θ ∂θ1i ∂θ1j ∂θ2k
gij ðθÞ ¼
and ¼
with respect to θ-coordinate, where ðθi Þdi¼1 ¼ θ. Hence, they are written as
γ Z pθ ðxÞ ∂ ∂ ðγÞ gij ðθÞ ¼ pθ ðxÞ ΛðdxÞ (12) ∂θi ∂θj H γ ðpθ Þ
γ Z pθ ðxÞ ∂2 ∂ ðγÞ pθ ðxÞ ΛðdxÞ Γij,k ðθÞ ¼ ∂θk Hγ ðpθ Þ ∂θi ∂θj and * ðγÞ Γij,k ðθÞ
Z ¼
γ pθ ðxÞ ∂ ∂2 p ðxÞ ΛðdxÞ: ∂θk θ ∂θi ∂θj H γ ðpθ Þ
We note that these formulas can be written in a coordinate-free manner in Eguchi (1992). ðγÞ Let us take distinct p, q of P , then the γ-power geodesic line fpt ðxÞg connecting p and q are given
γ
γ 1γ pðxÞ qðxÞ ðγÞ pt ðxÞ ¼ ct ð1 tÞ +t , (13) H γ ðpÞ H γ ðqÞ where ct is a normalizing factor for rt to have a total mass one. Then, we have the following theorem, see Eguchi et al. (2014) for a more general class of U-divergence. Theorem 2. If the m-geodesic line connecting between p and q and the γ-power geodesic line connecting r and q are orthogonal at q with respect to the Riemann metric (12), then Dγ ðp, qÞ + Dγ ðq, rÞ ¼ Dγ ðp, rÞ: Proof. We define a two-parameter model as
γ
γ 1γ qðxÞ rðxÞ pθ ðxÞ ¼ ð1 tÞpðxÞ + tcs ð1 sÞ +s , H γ ðqÞ H γ ðrÞ
(14)
Pythagoras theorem in information geometry Chapter
2
27
where θ ¼ (t, s). Then, noting p(1, 1) ¼ q, the orthogonality condition is given by
γ Z pθ ðxÞ ∂ ∂ p ðxÞ ΛðdxÞjθ¼ð1,1Þ ¼ 0: ∂t θ ∂s H γ ðpθ Þ This condition is written by γ γ
Z qðxÞ rðxÞ fpðxÞ qðxÞg ΛðdxÞ ¼ 0, H γ ðqÞ Hγ ðrÞ which concludes (15), which completes the proof.
□
In Theorem 2, if the m-geodesic line connecting p and q is written by (7) and the γ-power geodesic line connecting r and q is ðγÞ r t ðxÞ
qðxÞ ¼ ct ð1 tÞ H γ ðqÞ
γ
γ 1γ rðxÞ +t , H γ ðrÞ
then it holds for any s, t [0, 1] that Dγ ðpt , qÞ ¼ Dγ ðpt , qÞ + Dγ ðq, r s Þ: Let us define the log-form of γ-power divergence by Z Z 1 1 γ+1 Δγ ðp, qÞ ¼ pðxÞqðxÞγ dΛðxÞ log pðxÞ dΛðxÞ log γ γðγ + 1Þ Z 1 log qðxÞγ+1 dΛðxÞ: + γ +1 It is interesting to note that, if p, q, and r satisfy the same condition in Theorem 2, then another form of the Pythagoras theorem holds as Δγ ðp, qÞ + Δγ ðq, rÞ ¼ Δγ ðp, rÞ:
(15)
see Proposition 1 discussed in Eguchi and Kato (2010). Let fx1 , …, xn g be a data set from a given model M ¼ fpθ ðxÞ : θ Θg. We introduce the γ-power loss function by
γ n X pθ ðxi Þ Lγ ðθÞ ¼ (16) H γ ðpθ Þ i¼1 ðγÞ and the minimum γ-power estimator is defined by θ^ ¼ argmin θΘ Lγ ðθÞ: The expected γ-power loss function is equal to n times of the γ-power cross entropy Cγ (p, pθ) if the underlying density or mass function is p. This ðγÞ shows immediately the consistency of θ^ for θ. The estimating equation is given by
28
SECTION
I Foundations of information geometry n X
pθ ðxi Þ
i¼1
γ
∂ log pθ ðxi Þ ∂ log H γ ðpθ Þ ∂θ ∂θ
¼ 0:
This is viewed as a weighted likelihood equation, in which the γ-powered weights {pθ(xi)γ } play a robust performance against a set of outliers. Furthermore, the minimum γ-power method can detect modes from a data set if the data set follows far from the parametric density function pθ(x), but follows a multimodal distribution like a normal mixture distribution, see Notsu et al. (2014) for the algorithm of spontaneous clustering. In a subsequent discussion we explore a specific property of the minimum γ-power estimator in a regression model. We discuss the maximum entropy method replacing from the classical entropy to the power entropy supposing the mean constraint for the statistic vector s(x) as N
ðmÞ
ðμÞ ¼ fp P : p fsðXÞg ¼ μg:
Then, an argument similar to that for the classical entropy leads to the maximal entropy density function pμ(x) with respect to the γ-power entropy Hγ (p) ðmÞ on the constraint N ðμÞ, see also Eguchi et al. (2014). In fact, the objective functional is given by LðpÞ ¼ H γ ðpÞ + θ> p fsðXÞ μg: The Euler-Lagrange equation is given by
1γ > pðγÞ μ ðxÞ ¼ cθ 1 + γθ sðxÞ :
(17)
where cθ is a normalizing factor and θ depends on μ satisfying the mean ðmÞ constraint. We observe that, for all p N ðμÞ ðγÞ H γ ðpðγÞ μ Þ H γ ðpÞ ¼ Dγ ðp, pμ Þ,
(18)
in which the maximum power entropy is given by 1
> γ +1 Hγ ðpðγÞ μ Þ ¼ ð1 + γθ μÞ :
Thus, we see that the maximum power entropy with the mean constraint characterizes a class of distributions (17) than the exponential model. For example, Student t-distribution and the generalized Pareto distribution are examples of the maximal power entropy distribution as follows: We consider the space of univariate density functions with a common support (∞, ∞). Then, the maximal γ-power entropy distribution has a density function ðγÞ pθ ðxÞ
ν +1 ν+1 2 2 Þ Γð 1 ðx μÞ 2 ¼ pffiffiffiffiffiffiffiffiffiffi ν 1 + , ν σ2 νπσ 2 Γð Þ 2
Pythagoras theorem in information geometry Chapter
2
29
which is nothing but the density function of the Student t-distribution, where θ ¼ (μ, σ) and the degrees ν of freedom is matched as γ ¼ 2/(ν + 1). This is immediate since we confirm that this p(γ) θ (x) satisfies (18). On the other hand, the classical maximal entropy is just the normal distribution with mean μ and variance σ 2, which are contrast with the t-distribution unless ν is infinite, or equivalently γ is a zero. If we consider a space of probability density functions of which the common support is [0, ∞), then the maximal γ-power entropy distribution is given by the generalized Pareto distribution with a density function 8 11ξ >
: 0 otherwise where the shape parameter ξ is matched as γ ¼ ξ/(ξ + 1). Similarly, it is easðγÞ ily seen for this pσ,ξ ðxÞ to satisfy (18). The classical maximal entropy distribution on [0, ∞) is not to mention the exponential distribution, which is again quite contrast with the generalize Pareto distribution for a finite shape parameter ξ. In a subsequent discussion, we assume the t-distribution for the conditional distribution in a linear regression model and observe an interesting phenomenon. We define a parametric model as
1 ðγÞ MðγÞ ¼ fpθ ðxÞ :¼ cθ 1 + γ θ> sðxÞ γ : θ Θg,
(19)
called a γ-power model, which is the family of density functions with the maximum γ-power entropy. Note that a γ-power model reduces to an exponential model defined in (8) if γ goes to 0. Thus, the model has a special property as follows, cf. Eguchi et al. (2014). Theorem 3. A γ-power model MðγÞ defined by (19) is a totally geodesic subspace in P with respect to the γ-power geodesic line (13). Proof. By definition a geodesic line connecting pθ0(γ) and pθ1(γ) of MðγÞ is given by " ( ðγÞ )γ ( ðγÞ )γ #1γ pθ0 ðxÞ pθ1 ðxÞ ct ð1 tÞ +t , ðγÞ ðγÞ H γ ðpθ1 Þ H γ ðpθ0 Þ which is written by
1γ C* 1 + γθ*> sðxÞ ,
(20)
where *
C ¼ ct
1t ðγÞ
fHγ ðpθ0 Þg
γ
+
!1γ
t ðγÞ
fH γ ðpθ1 Þg
γ
,
30
SECTION
I Foundations of information geometry
and θ∗ ¼ (1 T∗)θ0 + T∗θ1 with ∗
T ¼
1t
ðγÞ γ fH γ ðpθ0 Þg
+
!1
t
ðγÞ γ fH γ ðpθ1 Þg
t ðγÞ
fH γ ðpθ1 Þg
γ
:
Therefore, (20) concludes that the geodesic line connecting pθ0(γ) and pθ1(γ) is □ in MðγÞ , which completes the proof. Accordingly, we have a look at the Pythagoras foliation of the γ-power ðmÞ model MðγÞ and the mean equal space N ð Þ in (9). Let q be arbitrarily fixed in P. Then, similarly the Pythagoras theorem Dγ ðp, rÞ ¼ Dγ ðp, qÞ + Dγ ðq, rÞ ðmÞ
for any p N ðqÞ and any r MðγÞ . The projection of p onto MðγÞ by γ-power divergence is exactly given by q. We confirm a dualistic structure between the maximal power entropy model MðγÞ and the minimum γ-power estimation, which is exactly the same structure between the exponential model and the MLE. We discuss that the γ-power estimator for the γ-power model coincides with the MLE for the exponential model, cf. Eguchi et al. (2014). Theorem 4. We take a change of parameter from the canonical parameter θ to the mean parameter μ by the transformation μðθÞ ¼ pðγÞ fsðXÞg θ
in the γ-power model MðγÞ . Then, the minimum γ-power estimator for μ is exactly equal to the sample mean vector s defined in (11). Proof. The γ-power loss function is given by Lγ ðμÞ ¼ n
1 + γθðμÞ> s ðγÞ
fH γ ðpθðμÞ Þg
γ
with the inverse transformation θ(μ) of μ(θ), which equals the negative γ-power cross entropy as Lγ ðμÞ ¼ nCγ ðpθðμÞ , pθðsÞ Þ so that Lγ(μ) is maximized uniquely at μ ¼ s.
□
The maximum entropy method can be extended employing the γ-power estimator under this γ-power model, see Komori and Eguchi (2019) for detailed discussion. Thus, the space P is viewed as a dualistic Riemann space with the
Pythagoras theorem in information geometry Chapter
2
31
m-geodesic and γ-power geodesic subspaces. This is a reasonable extension from the geometry associated with the KL divergence to that with the γ-power divergence, which the e-geodesic line is a log-linear for the affine parameter; the γ-power geodesic line is a power-linear for the affine parameter.
4
Linear regression model
In data science, a regression analysis is one of the most familiar methods, which are routinely employed in various areas including statistics, machine learning, and artificial intelligence. The linear regression analysis aims at finding a relationship between a response variable and explanatory vector in a linear statistical modeling. The method of least squares determines the unique plane that minimizes the sum of squared distances between the given data and that plane. Thus, linear algebra is helpful to do the linear regression analysis, in which the LSE is given by the projection matrix from a data point onto the plane. We explore a viewpoint of Information Geometry beyond such linear algebra over Euclidean space. In a standard setting, a linear regression model for a response variable yi with an explanatory vector xi is formulated as y i ¼ β > xi + ε i
ði ¼ 1, …, nÞ
(21)
where β is a regression parameter and εi is an error variable. The linear Eq. (21) is written by a simultaneous equation on n as y ¼ Xβ + ε,
(22)
where y ¼ ðp1 , …, yn Þ> , ε ¼ ðE1 , …, En Þ> and X ¼ ½xi , …, xn > : In this way, we have the following assumption: A1. rank(X) ¼ p A2. ðεÞ ¼ 0 A3. ðεÞ ¼ σ 2 I (I: identity matrix of n-order) Then, the sum of squares ky Xβk mized by the LSE
2
with the Euclidean norm kk is mini-
1 β^ ¼ ðX> XÞ X> y:
(23)
~ 2 ¼k y Xβk ^ 2 + k Xðβ^ βÞk ~ 2 k y Xβk
(24)
In fact, for any β~ p
which is nothing but the original Pythagoras theorem on n . We define the mean squared error (MSE) for a given estimator β~ by ~ βÞ ¼ fk β~ βk2 g: MSEðβ,
32
SECTION
I Foundations of information geometry
Under the assumption A 1, A 2, A 3, the LSE β^ satisfies ^ βÞ MSEðβ, ~ βÞ MSEðβ, ~ which is known as the Gauss-Markov for any unbiased linear estimator β, theorem cf. Hastie et al. (2009). This inequality is immediate by the mean Pythagoras theorem ~ 2 g: ^ βÞ ¼ MSEðβ, ~ βÞ + fk β^ βk MSEðβ,
(25)
Thus, Gauss applied to the LSE for obtaining the prediction of an orbit of a planet, in which the theory of optimality (25) for the LSE is a timeless insight, cf. Abdulle and Wanner (2002). There are glorious contributions from the least squares method to the Gauss-Jordan elimination method for solving linear equations and the Gauss–Newton algorithm for the nonlinear least squares method. However, the notion of “regression” has established by Galton who was a geneticist at the end of the 19-th century. Then, the understanding for the regression to the mean was provided from an observation based on a data set of the pairs of father and son’ heights. Such an understanding is beyond the method to solve the linear equation, in which a basic idea for population genetics is led to a preserving mechanism in the alternation of generations. Fisher (1922) introduces the t-statistic for a hypothesis testing of the regression parameter under a normality assumption. Thereafter, the regression analysis has been widely applied to various sciences, decision-making, marketing, etc. Through the above discussion our overview is based on the Euclidean geometry as Eq. (24) is the Pythagoras theorem in the Euclid space n. However, we point that there is implicitly an idea of Information Geometry beyond the Euclidean geometry. In place of Assumption A2 and A3 we assume a normality as the conditional density function of the response vector y given the explanatory matrix X, that is,
k y Xβk2 1 pðyjXβ, σ 2 Þ ¼ exp : (26) n 2σ 2 ð2πσ 2 Þ2 Then the log-likelihood function is given by Lðβ, σ 2 Þ ¼
1 n k y Xβk2 log ð2πσ 2 Þ, 2 2σ 2
(27)
in which the MLE ðβ^0 , σ^20 Þ for (β, σ 2) is explicitly given, and in particular, β^0 is the same as the LSE (23) and σ^20 is given by the average residual sum of squares σ^20 ¼
1 k y Xβ^0 k2 : n
Let us take three normal density functions p ¼ p(jy, σ 2), q ¼ pðjXβ^0 , σ 2 Þ and r ¼ p(jXβ, σ 2). Then we observe that they satisfy the Pythagoras identity (6) in the sense of the Information Geometry.
Pythagoras theorem in information geometry Chapter
2
33
In the context of linear regression model, the γ-power loss function (16) is given by Lγ ðβ, σ 2 Þ ¼
n X
1 ð2πσ 2 Þ
γ γ +1
i¼1
n o 2 γ exp 2 ðyi β> xi Þ 2σ
and the minimum γ-power estimator is defined by ðβ^γ , σ^2γ Þ ¼ argmin Lγ ðβ, σ 2 Þ, ðβ, σ 2 Þ
in which the estimating equation for (β, σ 2) is given by n X
γ
i¼1 n X i¼1
e
>
e2σ2 ðyi β
2 γ ðy β> xi Þ 2σ 2 i
xi Þ
2
ðyi β> xi Þxi ¼ 0,
1 2 ðyi β xi Þ σ γ +1 2
>
(28)
¼ 0:
Hence, the simultaneous equation on n is written by X> Wγ ðβ, yÞðy XβÞ ¼ 0 where
(29)
2 n γ > Wγ ðβ, yÞ ¼ diag ðe2σ2 ðyi β xi Þ Þi¼1 :
Thus, the minimum γ-power estimator β^γ is defined by the solution of Eq. (28). In general, the solution is not known, and hence we employ an algorithm fβt gTt¼1 with an initial guess β1 as 1
βt +1 ¼ ðX> Wγ ðβt , yÞXÞ X> Wγ ðβt , yÞy, which is viewed as an iteratively reweighted least square method. If γ ¼ 0, then Wγ (β, y) becomes the identity matrix, which implies that just one ahead β1 is reduced to the LSE. In this way, a complicated procedure is necessary to get the estimator β^γ ; however, a robustness performance is supported as follows. For any observed response vector y n , the weight matrix has a bounded behavior as k Wγ ðβ, yÞðy XβÞk2
2nσ 2 : γ exp ð1Þ
(30)
This guarantees a robust performance of β^γ even if a pair of xi and yi is away from the linear model (21). Furthermore, if the i-th observation (xi, yi) has a very large residual as
34
SECTION
I Foundations of information geometry
y β> xi ≫ 1, i the i-th diagonal component of the weight matrix Wγ (β, y) becomes almost negligible in a strong sense that γ
jyi β> xi j ! ∞ ¼) e2σ2 ðyi β
>
xi Þ
2
j yi β> xi j ! 0,
and hence a reasonable estimate is given by the rest of data set. Thus, the γ-power estimator is super-robust against such heavy outlying. The graph of the γ-power loss function is flexible and sometimes non-convex in β, while the log likelihood function (27) is convex in which the Hesse matrix is constant, X>X. In fact, we see that the Hess matrix of the γ-power loss function is proportional to γ X> Wγ ðβ, yÞX X> Wγ ðβ, yÞðy XβÞðy XβÞ> Wγ ðβ, yÞX, 2 which is not always positive-definite. On the other hand, the LSE is sensitive to such a local perturbation, and have an unstable behavior. If γ goes to 0, the solution of (29) becomes near the LSE, and the boundedness in the inequality (30) breaks down. We next assume that the conditional distribution of yi given xi is a t-distribution of degrees ν of freedom as ) ν +1 ν +1 ( 2 2 Þ Γð ðyi β> xi Þ 2 pðyi jxi , β, σ Þ ¼ pffiffiffiffiffiffiffiffiffiffi ν 1 + νσ 2 νπσ 2 Γð Þ 2 2
(31)
for i ¼ 1, …, n. Under this assumption we adjust the power index γ and the degrees of freedom, ν to satisfy γ¼
2 : ν +1
(32)
Thus, this model is characterized by the maximal γ-power distribution with the matching condition discussed in the preceding section. The γ-power loss function is given by ( ) 2 n > X ðy β x Þ 1 i Lγ ðβ, σ 2 Þ ¼ 1+ i , γ νσ 2 ð2πσ 2 Þγ +1 i¼1 which is written by Lγ ðβ, σ 2 Þ ¼
1 ð2πσ 2 Þ
γ γ+1
n+
k y Xβk2 , νσ 2
which implies that the minimum γ-power estimator β^γ coincides with the LSE β^ since Lγ (β, σ 2) is proportional to the sum of squares ky Xβk2. Accordingly, the minimum γ-power estimator satisfies the Pythagoras foliation structure. On the other hand, the log-likelihood function under the model (31) is written by
Pythagoras theorem in information geometry Chapter
Lðβ, σ Þ ¼ 2
n X ν +1 i¼1
(
ðy β> xi Þ log 1 + i 2 νσ 2
2
2
35
)
n log σ 2 + const: 2
of which the estimating equation for (β, σ 2) is given by n X
νσ 2
> 2 i¼1 νσ +ðyi β xi Þ n ν +1 X νσ 2
2
i¼1
2
ðyi β> xi Þxi ¼ 0, 2
ðyi β> xi Þ n 2 ¼ 0, 2 > 2 Þ2 2σ 2 νðσ νσ +ðyi β xi Þ
Hence, the relationship between the MLE and the minimum γ-power estimator under the assumption (31) becomes reverse with that under (26), in which the MLE does not satisfy the foliation structure, but plays a robust performance since the weight function νσ 2/(νσ 2 + (yiβ>xi)2) becomes negligible for a large residual for the i-th observation. This phenomenon is closely related to the idea of M-estimation in the framework of robust statistics. We summarize the discussion above as follows: If the degrees ν of freedom goes to ∞ in (31) goes to a normal density function, then the power γ becomes 0 under the adjustment (32), and hence Lγ (β, σ) becomes a minus of the log-likelihood function L(β, σ) of (27). Thus, the MLE equals the LSE in this situation. We extend this observation to a general case of (ν, γ) satisfying (32), in which the minimum γ-power estimator equals the LSE under the t-distribution of ν-degrees of freedom.
5
Generalized linear model
A linear regression model does not cover a setting where a response variable Y has only nonnegative values or discrete values, for example, failure time, a population size, a number of events, survival/death, or case/control. It needed more than a century and a half from Gauss’ contribution that a logistic regression model was proposed for a regression analysis with a binary response, in which the regression function for a binary response Y with values 0 and 1 is given by rðxÞ :¼ ½YjX ¼ x ¼
1 , 1 + exp ðβ> xÞ
(33)
see Cox (1958). It is interesting that this function r(x) is basically equal to a sigmoid function in a single perceptron in a theory of neural networks, cf. Rosenblatt (1958). It is surprising to be thought of the same equation in the same year by Dr Cox, who pioneered survival analysis in statistics from the discussion, and Dr Rosenblatt, who gave the opportunity to move on to multilayer perceptron and deep learning in artificial intelligence. Let P be the space of all conditional density/mass functions, that is, P ¼ fpðyjxÞg , and we write the space of all regression functions by R ¼ frðxÞ :¼ ½YjX ¼ xg. If a linear predictor f(x, β) ¼ β>x is embedded
36
SECTION
I Foundations of information geometry
in R , then a natural modeling of a regression function is conducted. For example, if the conditional density function p(yjx) is modeled as a normal density of mean f(x, β) and variance σ 2, then we reduce to a linear regression model (21) discussed in the preceding section. If p(yjx) is modeled in an exponential type of distribution model, then the restriction for the response variable is relaxed under the dualistic Riemann structure discussed in Section 4. In a framework of generalized linear model (GLM), a conditional density/ mass function of Y given X ¼ x on a support S in is defined by
yθðxÞ ψðθðxÞÞ + cðy, ϕÞ , pðyjxÞ ¼ exp (34) ϕ where ϕ is a dispersion parameter, see Hastie et al. (2009) for detailed discussion. We assume that the range of the canonical parameter θ can be extended on , so the canonical link function is defined by θðxÞ ¼ β> x, in which the space of the regression parameter β is full, or the Euclidean space of dimension p. Accordingly, this linear modeling is justified even if Y is binary, or S ¼ {0, 1}. Also, GLM covers a case where Y is Gamma, Poisson, binomial, or negative binomial variable. The regression function is written by ½YjX ¼ x ¼ ηðβ> xÞ
(35)
where η(θ) ¼ (∂/∂θ)ψ(θ). For a given data set fðxi , yi Þgni¼1 from a GLM, the log-likelihood function for a regression parameter β is written by LðβÞ ¼
n X yi β> xi ψðβ> xi Þ + cðyi , ϕÞ ϕ i¼1
where ϕ is assumed to be known. Thus, the MLE β^0 is given by the solution of the estimating equation n X
fyi ηðβ> xi Þgxi ¼ 0
i¼1
of which the simultaneous equation is expressed as X> fy ηðXβÞg ¼ 0
(36)
by an argument similar to that from (21) to (22) for a linear model, where ηðXβÞ ¼ ðηðβ> x1 Þ, …, ηðβ> xn ÞÞ
>
In this framework we have a look at the Pythagoras theorem as observed in a linear regression model. We know from (36) that the MLE β^0 satisfies
Pythagoras theorem in information geometry Chapter
2
37
X> y ¼ X> ηðXβ^0 Þ and hence, for any β~ p ~ ¼ D0 ðXβ^ , XβÞ ~ Lðβ^0 Þ LðβÞ 0
(37)
~ is the KL divergence from p(jθ(y)) to pðjXβÞ ~ with where D0 ðXβ^0 , XβÞ pðzjXβÞ ¼
n Y
exp fzi β> xi ψðβ> xi Þg:
i¼1 >
Furthermore, if we write θðyÞ ¼ ðη1 ðy1 Þ, …, η1 ðyn ÞÞ vector y ¼ ðy1 , …, yn Þ> , Eq. (37) yields
for the response
~ ¼ D0 ðθðyÞ, Xβ^ Þ + D0 ðXβ^ , XβÞ ~ D0 ðθðyÞ, XβÞ 0 0
(38)
and so the Pythagoras foliation holds. If an identity link η(θ) ¼ θ holds, then Eq. (36) has a closed form of the solution, which is nothing but the LSE (23). In general, the learning algorithm ðβt ÞTt¼1 for obtaining the MLE β^0 is defined by iteratively 1
βt +1 ¼ ðX> Dðβt ÞXÞ X> fy ηðXβt Þg which provides a numeric solution of β^0 , where β1 is an initial guess, T is appropriately selected to stop the iteration, and DðβÞ ¼ diag η0 ðβ> x1 Þ, …, η0 ðβ> xn Þ In this way, the algorithm is called the iteratively reweighted least squares (IRLS) method. If we do not explicitly pose the distributional assumption (34), but pose the assumption (35), we call L(β) the quasi log-likelihood function, and β^0 the quasi MLE. For example, we derive the LSE without any normal distribution assumption, in which the LSE is viewed as the quasi MLE according to the terminology above. A statistic of goodness-fit-test is given by dev ¼ D0 ðθðyÞ, Xβ^0 Þ, called the deviance statistic, which is closely related to the Pythagoras (38). Typically, the deviance statistic in a normal linear model is called the residual sum of squares, that is, ^ 2, RSS ¼ k y Xβk which follows a chi-square distribution with (n p) degrees of freedom. We consider a case where a response variable Y follows a Bernoulli distribution, then we have an exponential expression p py ð1 pÞ1y ¼ exp fy log + log ð1 pÞg 1p
38
SECTION
I Foundations of information geometry
FIG. 3 A plot of logistic model.
which means θ ¼ log p=ð1 pÞ , or p ¼ 1=ð1 + exp ðθÞÞ , and hence the logistic model (33) is derived. We observe that the graph of the regression function forms a ruled surface in 2 ½0, 1 for a parameter β 2 as seen in Fig. 3. The conditional mass function of Y given X ¼ x defined by ðeÞ
pβ ðyjxÞ ¼
expfyβ> xg 1 + expðβ> xÞ
for y ¼ 0, 1. Given a data set fðxi , yi Þgni¼1 , the log-likelihood function is LðβÞ ¼
n X
yi β> xi log f1 + expðβ> xi Þg,
i¼1
of which the estimating equation is given by
n X expðyi β> xi Þ yi xi ¼ 0: 1 + expðβ> xi Þ i¼1
(39)
This equation is also efficiently solvable by the IRLS in open-source packages. On the other hand, the minimum γ-power estimator is given by the minimization of the γ-power loss function
γ n X expfðγ +1Þyi θðxi Þg γ +1 Lγ ðβÞ ¼ 1 + expððγ +1Þθðxi ÞÞ i¼1 see Hung et al. (2018). Hence, the estimating equation is given by
n X expfðγ +1Þθðxi Þg wγ ðxi , yi , βÞ yi x ¼ 0, 1 + expððγ +1Þθðxi ÞÞ i i¼1
(40)
where wγ ðx, y, βÞ ¼
expfðγ +1ÞyθðxÞg 1 + expððγ +1ÞθðxÞÞ
γ γ+1
:
We note that Eq. (40) is nonlinear in yi’s; while Eq. (39) is linear in yi’s. If γ is taken by a limit to 0, then the estimating equation is reduced to the log-likelihood Eq. (39).
Pythagoras theorem in information geometry Chapter
2
39
We next consider the γ-power model for the conditional mass function by 1
ðγÞ
pβ ðyjxÞ ¼
ð1 + γ y β> xÞγ
1
1 + ð1 + γ β> xÞγ
with a constraint 1 + γ β>x 0 for y ¼ 0, 1. The power log transformation of the odds ratio suggests as "( ðγÞ # )γ pβ ð1jxÞ 1 1 ¼ β> x, ðγÞ γ p ð0jxÞ β
which corresponds to a fact that the log odds ratio for the model p(e) β (yjx) is equal to the linear predictor β>x. The γ-power loss function Lγ ðβÞ ¼
n X i¼1
1 + γ yi β> xi f1 + γ
γ
γ +1 ðγÞ pβ ð1jxi Þβ> xi g
+ λi ð1 + γ β> xi Þ,
(41)
where λi’s are the Lagrange multipliers. Hence, the estimating equation is linear in yi’s as given by n X
γ yi fβ> xi Φ0 ðβ> xi Þ + Φðβ> xi Þg + Φ0 ðβ> xi Þ + γ λi xi ¼ 0,
i¼1
where
( ΦðθÞ ¼
1 + ð1 + γθÞ
γ +1 γ 1
1 + ð1 + γθÞγ
γ )γ +1
:
We note that Eq. (41) is linear in yi’s. We consider a case of exponential distributions, where the conditional density function of Y given X ¼ x is defined by ( θðβ> xÞ exp fyθðβ> xÞg if y 0 ðeÞ pβ ðyjxÞ ¼ (42) 0 otherwise with the log-link function θðβ> xÞ ¼ expðβ> xÞ . Given a data set the loglikelihood function is LðβÞ ¼
n X
yi expðβ> xi Þ + β> xi ,
i¼1
which is linear in the response variable yi. The estimating equation is given by n X i¼1
yi expðβ> xi Þ 1 xi ¼ 0,
(43)
40
SECTION
I Foundations of information geometry
which is also efficiently solved by the IRLS. On the other hand, the minimum γ-power estimator is given by the minimization of the γ-power loss function
n X γ > > Lγ ðβÞ ¼ β xi : exp γ yi expðβ xi Þ + γ +1 i¼1 Hence, the estimating equation is given by
n X γ wγ ðxi , yi , βÞ yi expðβ> xi Þ x ¼ 0: γ +1 i i¼1 where
γ > β x : wγ ðx, y, βÞ ¼ exp γ y expðβ xÞ + γ +1 >
If γ is taken by a limit to 0, then the estimating equation is reduced the loglikelihood equation. Similarly, we can consider the γ-power model, which is nothing but the generalized Pareto distribution model with density function ( 1 cγ θðβ> xÞf1 + γθðβ> xÞyg γ if y 0 ðγÞ pβ ðyjxÞ ¼ , (44) 0 otherwise with a parametrization different from a standard one, see Komori and Eguchi (2021) for the application to clustering. We note that the link function θ(β>x) in (44) may depend on γ. An argument similar to the binary regression yields that the γ-power loss function for the .γ-power model (44) is linear in the response variable yi’s, but we do not explore the further discussion here to avoid specific details.
6 Discussion In this chapter, we have focused on the Pythagoras theorem of Information Geometry with applications in the context of regression analysis. In fact, we consider a triangle connection among three density functions, in which the triangle relation is extended to the foliation structure in the space of all density functions. We took a closer look at the complementarity of the Pythagoras foliation, which comes from the duality between a statistical model and an estimator. When the assumption of the normal distribution is posed in the linear regression model, the maximum likelihood estimation and the LSE coincide. We found that the assumption of the Student’s t-distribution leads to matching the γ-power estimator with the LSE when adapted to that degrees of freedom of the t-distribution. Thus, we observed a different couple of the modeling and estimation method, in which both estimators are the same as the LSE.
Pythagoras theorem in information geometry Chapter
2
41
In physics, the space-time gravitational tensor metric gives the path of light as the Riemann geodesic that defines the shortest path, constructing a magnificent mechanical world image of general relativity. It is sufficient to consider the unique geodesic. On the other hand, it is necessary to consider the dual geodesic subspaces in order to reveal the complementary relationship between the model and the estimation in Information Geometry. The model determines the randomness with a data set; the estimator determines the way of learning the data set in which the two determinations need two different geodesic lines with duality. In the following sense, Information Geometry is a dualistic Euclidean geometry rather than dualistic Riemann geometry. We assume for arbitrarily fixed three density functions p, q, r there exists an exponential model MðeÞ defined in (8) which includes p, q, r. Then, we get D0 ðp, rÞ D0 ðp, qÞ D0 ðq, rÞ ¼ ðμp μq Þ> ðθq θr Þ where μ is the mean parameter of the exponential model, μp and μq are the mean parameter vectors designating p and q; θq and θr are the canonical parameter vectors designating q and r. Therefore, the Pythagoras theorem holds with the Euclidean orthogonality between the vectors μp μq and θq θr. In fact, the e-geodesic line connecting q with r is in MðeÞ ; while the m-geodesic line connecting q with p is not included in MðeÞ. It suffices to consider the line pθ(μt)(e) with μt ¼ (1 t)μp + tμq which is characterized by the projection of the m-geodesic line onto the exponential model via the KL divergence. However, the efficient way to build such an exponential model is not still established except for a case of probability mass functions.
References Abdulle, A., Wanner, G., 2002. 200 years of least squares method. Elem. Math. 57 (2), 45–60. Amari, S.I., 1982. Differential geometry of curved exponential families-curvatures and information loss. Ann. Stat. 10, 357–385. Amari, S.I., 1985. Differential-Geometrical Methods in Statistics. Lecture Notes on Statistics, vol. 28. Springer, p. 1. Amari, S.I., 2016. Information Geometry and Its Applications. vol. 194 Springer, Berlin. Amari, S.I., Kawanabe, M., 1997. Information geometry of estimating functions in semiparametric statistical models. Bernoulli 3 (1), 29–54. Amari, S.I., Nagaoka, H., 2007. Methods of Information Geometry. American Mathematical Society. Ay, N., Jost, J., Van Le, H., Schwachhofer, L., 2017. Information Geometry. Springer, Cham. Cox, D.R., 1958. The regression analysis of binary sequences. J. R. Stat. Soc. B 20 (2), 215–232. Efron, B., 1975. Defining the curvature of a statistical problem (with applications to second order efficiency). Ann. Stat. 3 (6), 1189–1242. Eguchi, S., 1983. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 11 (3), 793–803. Eguchi, S., 1992. Geometry of minimum contrast. Hiroshima Math. J. 22 (3), 631–647.
42
SECTION
I Foundations of information geometry
Eguchi, S., Copas, J., 2006. Interpreting Kullback-Leibler divergence with the Neyman-Pearson lemma. J. Multivar. Anal. 97 (9), 2034–2040. Eguchi, S., Kato, S., 2010. Entropy and divergence associated with power function and the statistical application. Entropy 12 (2), 262–274. Eguchi, S., Komori, O., Ohara, A., 2014. Duality of maximum entropy and minimum divergence. Entropy 16 (7), 3552–3572. Fisher, R.A., 1922. The goodness of fit of regression formulae and the distribution of regression coefficients. J. R. Stat. Soc. 85, 597–612. Fujisawa, H., Eguchi, S., 2008. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 99, 2053–2081. Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media. Hayashi, M., Nagaoka, H., 2003. General formulas for capacity of classical-quantum channels. IEEE Trans. Inf. Theory 49 (7), 1753–1768. Hung, H., Jou, Z.Y., Huang, S.Y., 2018. Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics 74 (1), 145–154. Jaynes, E.T., 1957. Information theory and statistical mechanics. Phys. Rev. II. 106 (4), 620–630. Komori, O., Eguchi, S., 2019. Statistical Methods for Imbalanced Data in Ecological and Biological Studies. Springer, Japan. Komori, O., Eguchi, S., 2021. A unified formulation of k-means, Fuzzy c-means and Gaussian mixture model by the Kolmogorov-Nagumo average. Entropy 23 (5), 518. Kurose, T., 1994. On the divergences of 1-conformally flat statistical manifolds. Tohoku Math. J. 46 (3), 427–433. Matsuzoe, H., 1998. On realization of conformally-projectively flat statistical manifolds and the divergences. Hokkaido Math. J. 27 (2), 409–421. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S., 2004. Information geometry of U-Boost and Bregman divergence. Neural Comput. 16 (7), 1437–1481. Nagaoka, H., Amari, S., 1982. Differential geometry of smooth families of probability distributions. University of Tokyo. Notsu, A., Komori, O., Eguchi, S., 2014. Spontaneous clustering via minimum gammadivergence. Neural Comput. 26 (2), 421–448. Phillips, S.J., Anderson, R.P., Schapire, R.E., 2006. Maximum entropy modeling of species geographic distributions. Ecol. Model. 190 (3–4), 231–259. Rao, C.R., 1945. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–89. Rao, C.R., 1962. Efficient estimates and optimum inference procedures in large samples. J. R. Stat. Soc. B (Methodol.) 24 (1), 46–63. Rosenblatt, F., 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65 (6), 386–408.
Further reading Amari, S.I., 1995. Information geometry of the EM and em algorithms for neural networks. Neural Netw. 8 (9), 1379–1408. Sonoda, S., Murata, N., 2019. Transport analysis of infinitely deep neural network. J. Mach. Learn. Res. 20 (1), 31–82.
Chapter 3
Rao distances and conformal mapping Arni S.R. Srinivasa Raoa,∗ and Steven G. Krantzb a
Laboratory for Theory and Mathematical Modeling, Medical College of Georgia and Department of Mathematics, Augusta University, Augusta, GA, United States b Department of Mathematics, Washington University in St. Louis, St. Louis, MO, United States ∗ Corresponding author: e-mail: [email protected]
Abstract In this chapter, we have described the Rao distance (due to C.R. Rao) and ideas of conformal mappings on 3D objects with angle preservations. Three propositions help us to construct distances between the points within the 3D objects in 3 and line integrals within complex planes. We highlight application of these concepts to virtual tourism. Keywords: Riemannian metric, Differential geometry, Conformal mapping, Probability density functions, Angle preservations, Virtual tourism, Complex analysis MSC: 53B12, 30C20
1
Introduction
C.R. Rao introduced his famous metric in 1949 (Rao, 1949) for measuring distances between probability densities arising from population parameters. This was later called by others as the Rao distance (see, for example, Atkinson and Mitchell, 1981; Rios et al., 1992). There are several articles available for the technicalities of Rao distance (see, for example, Amari, 1985; Chaudhuri, 2020; Chen et al., 2021; Jimenez Gamero et al., 2002; Nielsen, 2018) and its applications (see, for example, Maybank, 2005; Rao and Krantz, 2020; Taylor, 2019). An elementary exposition of the same appeared during his centenary in Plastino and Plastino (2020). Rao distances and other research contributions of renowned statistician C.R. Rao were recollected by those who celebrated his 100th birthday during 2020 (see, for example, Efron et al., 2020; Prakasa Rao and Majumder, 2020; Prakasa Rao et al., 2020). A selected list of Rao’s contributions in R programs was also made available during his centenary (Vinod, 2020). Handbook of Statistics, Vol. 45. https://doi.org/10.1016/bs.host.2021.06.002 Copyright © 2021 Elsevier B.V. All rights reserved.
43
44
SECTION
I Foundations of information geometry
Rao distances are constructed under the framework of a quadratic differential metric, Riemannian metric, and differential manifolds over probability density functions and the Fisher information matrix. C.R. Rao considered populations as abstract spaces which he called population spaces (Rao, 1949), and then he endeavored to obtain topological distances between two populations. In the next section, we will describe manifolds. Section 3 will highlight technicalities of Rao distances and Section 4 will treat conformal mappings and basic constructions. Section 5 will conclude the chapter with applications in virtual tourism.
2 Manifolds Let Df(a) denote the derivative of f at a for a n and f : n ! m : A function f : n ! m is differentiable at a n if there exists a linear transformation J : n ! m such that lim
h!0
kfða + hÞ fðaÞ JðhÞk ¼ 0: k hk
(1)
Here h n and fða + hÞ fðaÞ JðhÞ n : If f : n ! m is differentiable at a, then there exists a unique linear transformation J : n ! m such that (1) holds. The m n matrix created by DfðaÞ : n ! m is the Jacobian matrix, whose elements are 2 3 D1 f1 ðaÞ D2 f1 ðaÞ ⋯ Dn f1 ðaÞ 6 D f ðaÞ D f ðaÞ ⋯ D f ðaÞ 7 2 2 n 2 6 1 2 7 DfðaÞ ¼ 6 7 4 5 ⋮ ⋮ ⋮ D1 fm ðaÞ D2 fm ðaÞ ⋯
Dn fm ðaÞ
That is, J(h) ¼ Df(a). Since J is linear, we have J(b1λ1 + b2λ2) ¼ b1J(λ1) + b2J(λ2) for every λ1 , λ2 n and every pair of scalars b1 and b2. Also, the directional derivative of f at a in the direction of v for v n is denoted by D(f, v) is given by Dðf, vÞ ¼ lim
h!0
kfða + hvÞ fðaÞk khk
(2)
provided R.H.S.of (2) exists. When f is linear D(f, v)5f(v) for every v and every a. Since J(h) is linear, we can write fða + uÞ ¼ fðaÞ + Dðf, uÞ + kukΔa ðuÞ, where u n withkuk < r for r > 0, so that a + u Bða; rÞ for an n ball Bða; rÞ n ,
(3)
Rao distances and conformal mapping Chapter
Δa ðuÞ ¼
3
45
kfða + hÞ fðaÞk f 0 ðaÞ if h 6¼ 0, k hk Δa ðuÞ ! 0 as u ! 0:
When u ¼ hv in (3), we have fða + hvÞ fðaÞ ¼ hDðf, uÞ + khkkvkΔa ðuÞ
(4)
For further results on the Jacobian matrix and differentiability properties, refer to Tu (2011), Spivak (1964), and Apostol (1974). Consider a function f ¼ u + iv defined on the plane with u(z), vðzÞ for z ¼ ðx, yÞ : If there exists four partial derivatives ∂uðx, yÞ ∂vðx, yÞ ∂uðx, yÞ ∂vðx, yÞ , , , , ∂x ∂x ∂y ∂y
(5)
and these partial derivatives satisfy Cauchy–Riemann equations (6) ∂uðx, yÞ ∂vðx, yÞ ∂vðx, yÞ ∂uðx, yÞ ¼ and ¼ , ∂x ∂y ∂x ∂y
(6)
then Df ðaÞ ¼
∂uðx, yÞ ∂vðx, yÞ for u, v : +i ∂x ∂x
Theorem 1. Let f ¼ u(x, y) + iv(x, y) for u(x, y), v(x, y) defined on a subset Bδ(c) for δ, c, ðx, yÞ : Assume u(x, y) and v(x, y) are differentiable at an interior a ¼ ða1 , a2 Þ Bδ ðcÞ: Suppose the partial derivatives and lim ðx,yÞ!a vðx,ðx,yÞvðaÞ exists for a and these partial lim ðx,yÞ!a uðx,ðx,yÞuðaÞ yÞa yÞa derivatives satisfy Cauchy–Riemann equations at a. Then Df ðaÞ ¼
lim
ðu,vÞ!ða1 , a2 Þ
f ðu, vÞ f ðaÞ ðu, vÞ a
exists, and uðx, yÞ uðaÞ +i Df ðaÞ ¼ lim ðx, yÞ a ðx, yÞ!a
vðx, yÞ vðaÞ lim : ðx, yÞ!a ðx, yÞ a
If Df(a) exists for every Bδ(c) , then we say that f is holomorphic in Bδ(c) and is denoted as H(Bδ(c)). Readers are reminded that when f is a complex function in Bδ(c) that has a differential at every point of Bδ(c), then f H(Bδ(c)) if, and only if, the Cauchy–Riemann equations (6) are satisfied for every a Bδ(c). Refer to Krantz (2004, 2008), Rudin (1974), Apostol (1974); Krantz (2007) for other properties of holomorphic functions and their association with Cauchy–Riemann equations.
46
SECTION
I Foundations of information geometry
2.1 Conformality between two regions Holomorphic functions discussed above allow us to study conformal equivalences (i.e., angle preservation properties). Consider two regions Bδ ðcÞ, Bα ðdÞ for some c, d, δ, α : These two regions are conformally equivalence if there exists a function g H(Bδ(c)) such that g is one-toone in Bδ(c) and such that g(Bδ(c)) ¼ Bα(d). This means g is conformally one-to-one mapping if Bδ(c) onto Bα(d). The inverse of g is holomorphic in Bα(d). This implies g is a conformal mapping of Bα(d) onto Bδ(c). We will introduce conformal mappings in the next section. The two regions Bδ(c) and Bα(d) are homeomorphic under the conformality. The idea of manifolds is more general than the concept of a complex plane. It uses the concepts of the Jacobian matrix, diffeomorphism between m and n , and linear transformations. A set M n is called a manifold if for every a M, there exists a neighborhood U (open set) containing a and a diffeomorphism f1 : U !V for V n such that (7) f 1 ðU \ MÞ ¼ V \ k f0g The dimension of M is k. See Spivak (1964) and Tu (2011) for other details on manifolds. Further for an open set V1 k and a diffeomorphism f 2 : V1 ! n
(8)
such that Df2(b) has rank k for bV1. Remark 1. The image of the mapping is a k-dimensional manifold.
3 Rao distance A Riemannian metric is defined using an inner product function, manifolds, and the tangent space of the manifold considered. Definition 1. Riemannian metric: Let a M and TaM be the tangent space of M for each a. A Riemannian metric G on M is an inner product Ga : T a M T a M ! n constructed on each a. Here ðM, GÞ forms Riemannian space or Riemannian manifold. The tensor space can be imagined as collection of all the multilinear mappings from the elements in M as shown in Fig. 1. For general references on metric spaces refer to Kobayashi and Nomizu (1969) and Ambrosio and Tilli (2004). Let pðx, θ1 , θ2 , …, θn Þ be the probability density function of a random variable X such that x X, and θ1 , θ2 , …, θn are the parameters describing the
Rao distances and conformal mapping Chapter
3
47
FIG. 1 Mapping of elements in the manifold M in a metric space X to the tensor space TaM.
population. For different values of θ1 , θ2 , …, θn we will obtain different populations. Let us call P(x, Θn) the population space created by Θn for a chosen functional form of X. Here Θn ¼ fθ1 , θ2 , …, θn g: Let us consider another population space Pðx, Θn + ΔÞ, where Θn + Δ ¼ fθ1 + δθ1 , θ2 + δθ2 , …, θn + δθn g: Let ϕðx, Θn Þdx be the probability differential corresponding to P(x, Θn) and ϕðx, Θn + ΔÞdx be the probability differential corresponding to Pðx, Θn + ΔÞ. Let dϕðΘn Þ
(9)
be the differences in probability densities corresponding to Θn and Θn + Δ. In (9), C.R. Rao considered only the first-order differentials (Burbea and Rao, 1982; Micchelli and Noakes, 2005; Rao, 1949). The variance of the distribudϕ tion of is given by ϕ 2 X X dϕ d ¼ Fij dθi dθj (10) ϕ where Fij is the Fisher information matrix for 1 ∂ϕ 1 ∂ϕ Fij ¼ E ðfor E the expectationÞ: ϕ ∂θi ϕ ∂θj Constructions in Eq. (10) and other measures between probability distributions by C.R. Rao has played an important role in statistical inferences. Let f3 be a measurable function on X with differential ϕðx, Θn Þdx. This implies that f3 is defined on an interval S and there exists a sequence of step-functions {sn} on S such that
48
SECTION
I Foundations of information geometry
lim sn ðxÞ ¼ f3 ðxÞ almost everywhere on S
n!∞
for x X. If f3 is a σ-finite measure on X, then it satisfies Z Z dPðx, Θn Þ d dμ Pðx, Θn Þ dμ ¼ dθ dθi S S and d dθi
Z S
P0 ðx, Θn Þ d Pðx, Θn Þ dμ ¼ dθi Pðx, Θn Þ
Z Pðx, Θn Þdμ: S
Remark 2. Since the random variable X can be covered by the collection of sets Tn such that ∞ [
Tn ¼ X,
n¼1
μ is the σ-finite measure, and f2 ðxÞ > 0 and
Z f2 ðxÞμðdxÞ < ∞:
The idea of Rao distance can be used to compute the geodesic distances between two 2D spreadsheets on two different 3D objects as shown in Fig. 2. Burbea–Rao studied Rao distances and developed α- order entropy metrics for α (Burbea and Rao, 1982), given as 2 n X dϕ ðαÞ d ðθÞ ¼ ¼ Gij dθi dθj (11) ϕ α i, j
FIG. 2 Two 2D-shaped spreadsheets on 3 D objects. Metrics between such 2D-shaped spreadsheets can be studied based on Rao distances. The distance between the space of points of Xa located on the 3D shape A to the space of Xb located on the 3D object B can be measured using population spaces conceptualized in Rao distance.
Rao distances and conformal mapping Chapter
where ðαÞ
Gij ¼
Z
Pðx, Θn Þα ð∂θi log PÞ ∂θj log P dμ:
3
49
(12)
X
For the case of P(x, Θn) as a multinomial distribution where x X for a sample space X ¼ f1, 2, …, ng, Burbea and Rao (1982) showed that Z ðαÞ Gij ðθÞ ¼ Pðx, Θn Þα2 ð∂θi log PÞ ∂θj log P dμ: (13) X
The tensor of the metric in (13) is of rank n.
4
Conformal mapping
The storyline of this section is constructed around Figs. 3 and 4. First let us consider Fig. 3 for our understanding of conformal mapping property. Let z(t) be a complex-valued function for z(t) ¼ a t b for a, b : Suppose γ 1 is the arc constructed out of z(t) values. Suppose an arc Γ1 is formed by the mapping f4 with a representation f5 ðtÞ ¼ f4 ðzðtÞÞ for a t b: Let us consider an arbitrary point z(c) on γ 1 for a c b at which f4 is 0 holomorphic and f4 ðzðcÞÞ 6¼ 0: Let θ1 be the angle of inclination at c as shown 0 in Fig. 3, then we can write arg z0 (c) ¼ θ1. Let α1 be the angle at f4 ðzðcÞÞ, i.e., 0
arg f4 ðzðcÞÞ ¼ α1 :
FIG. 3 Mapping of points from the real line to an arc in the complex plane. Suppose γ 1 is the arc constructed out of z(t) values. An arbitrary point z(c) on γ 1 for a c b at which f4 is holo0 morphic and f 4 ðzðcÞÞ 6¼ 0 . Let θ1 be the angle of inclination at c. When we denote 0 0 arg f 4 ðzðcÞÞ ¼ α1 , it will lead to arg f 5 ðcÞ ¼ α1 + θ1
50
SECTION
I Foundations of information geometry
FIG. 4 3D objects and conformality with respect to different viewpoints. The angles θ1, θ2,…α, β1, β2 are all measured. The distances of the rays A0C0, A0C1, B0A0, B0C0, and B0C1 by assuming they are situated in a single 3 structure and also assuming they are situated in five different complex planes is computed. By visualizing the three objects are replicas of an actual tourist spot an application to virtual tourism is discussed in Section 5.
By this construction, 0
0
0
0
arg f5 ðcÞ ¼ α1 + θ1 , ðbecause arg f5 ðcÞ ¼ arg f4 ðzðcÞÞ + arg z ðcÞÞ 0
(14)
0
where arg f5 ðcÞ is the angle at f5 ðcÞ corresponding to Γ1. Suppose that γ 2 is another arc passing through z(c) and θ2 be the angle of inclination of the directed tangent line at γ 2. Let Γ2 be the arc corresponding to γ 2 and 0 0 arg f6 ðcÞ be the corresponding angle at f6 ðcÞ: Hence the two directed angles created corresponding to Γ1 and Γ2 are 0
arg f5 ðcÞ ¼ α1 + θ1 0
arg f6 ðcÞ ¼ α2 + θ2 This implies that 0
0
arg f6 ðcÞ arg f5 ðcÞ ¼ θ2 θ1 :
(15)
The angle created from Γ2 to Γ1 at f4(z(c)) is the same as the angle created at c on z(t) due to passing of two arcs γ 1 and γ 2 at c. Let A, B, and C be three 3D objects as shown in Fig. 4. Object A has a polygon-shaped structure with a pointed top located at A0. A pyramid-shaped
Rao distances and conformal mapping Chapter
3
51
structure B is located near object A and a cylinder-shaped object C. Object B has a pointed top located at B0. Let C0 be the nearest distance on C from B0 and C1 be the farthest distance C from B0. The norms of A0, B0, C0, and C1 are all assumed to be different. Suppose A0 ¼ (A01,A02, A03), B0 ¼ (B01, B02, B03), C0 ¼ (C01, C02, C03), C1 ¼ (C11,C12, C13). Various distances between these points are defined as below: " #1=2 3 X 2 A0 C0 ¼ kA0 C0 k ¼ ðA0i C0i Þ "
i¼1
3 X A0 C1 ¼ kA0 C1 k ¼ ðA0i C1i Þ2 i¼1
B0 A0 ¼ k B 0 A0 k ¼
" 3 X "
#1=2 ðB0i A0i Þ2
i¼1
3 X B0 C0 ¼ kB0 C0 k ¼ ðB0i C0i Þ2
"
#1=2
i¼1
3 X B0 C1 ¼ kB0 C1 k ¼ ðB0i C1i Þ2
(16) #1=2 #1=2
i¼1
Let α be the angle from the ray A0C1 to the ray A0C0 with reference to the point A0, β1 be the angle from the ray B0C1 to the ray B0C1 with reference to the point B0, and β2 be the angle from the ray B0A0 to the ray B0C0 with reference to the point B0. Proposition 1. All the four points A0, B0, C0, and C1 of Fig. 4 cannot be located in a single Complex plane. These points could exist together in 3 : Proof. Suppose the first coordinate of the plane represents the distance from x-axis, the second coordinate is the distance from y-axis, and the third coordinate represents the height of the 3D structures. Even if A03 ¼ B03 ¼ C03, still all the four points cannot be on the same plane because C03 cannot be equal to C13. Hence all the four points cannot be situated within a single complex plane. However, by the same construction, they all can be situated within a single 3D sphere or in 3 . □ Proposition 2. Suppose the norms and the third coordinates of A0, B0, C0, and C1 are all assumed to be different. Then, it requires five different complex planes, say, 1 , 2 , 3 , 4 , and 5 such that A0 , C0 1 , A0 , C1 2 , A0 , B0 3 , B0 , C0 4 , B0 , C1 5 . Proof. By Proposition 1 all the four points A0, B0, C0, and C1 cannot be in a single complex plane. Although the third coordinates are different two out of
52
SECTION
I Foundations of information geometry
four points can be considered such that they fall within a same complex plane. Hence, the five rays A0C0, A0C1, B0A0, B0C0, and B0C1 can be accommodated in five different complex planes. □ Proposition 3. The angles α, β1, β2 and five distances of (16) are preserved when A0, B0, C0, and C1 are situated together in 3 . Proof. The angle α is created while viewing the 3D structure C from point A0. The angle β1 is created while viewing the 3D structure C from the point B0. The angle β2 is created while viewing the 3D structure C from the point A0. These structures could be imagined to stand on a disc within a 3D sphere or in 3 even proportionately mapped to 3 . Under such a construction, without altering the ratios of various distances, the angles remain the same in the mapped 3 . □ Let us construct an arc A0C0(t1) ¼ a1 t1 b1 from the point A0 to C0 and call this arc C1. Here a1 , b1 and A0 , C0 1 . The points of C1 are A0C0(t1). The values of t1 can be generated using a parametric representation which could be a continuous random variable or a deterministic model. t1 ¼ ψ 1 ðτÞ for α1 τ β1 :
(17)
Then the arc length L(C1) for the arc C1 is obtained through the integral Z β1 A0 C0 ½ψ 1 ðτÞψ 0 ðτÞdτ: (18) LðC1 Þ ¼ 0 1 α1
Likewise, the arc lengths L(C2), L(C3), L(C4), and L(C5) for the arcs C2, C3, C4, and C5 are constructed as follows: Z β2 A0 C0 ½ψ 2 ðτÞψ 0 ðτÞdτ, (19) LðC2 Þ ¼ 1 2 α2
where A0C1(t2) ¼ a2 t2 b2 for a2 , b2 and A0 , C1 2 and with parametric representation t2 ¼ ψ 2(τ) for α2 τ β2. Z β3 B0 A0 ½ψ 3 ðτÞψ 0 ðτÞdτ, (20) LðC3 Þ ¼ 0 3 α3
where B0A0(t3) ¼ a3 t3 b3 for a3 , b3 and B0 , A0 3 and with parametric representation t3 ¼ ψ 3(τ) for α3 τ β3. Z β4 B0 C0 ½ψ 4 ðτÞψ 0 ðτÞdτ, (21) LðC4 Þ ¼ 0 4 α4
where B0C0(t4) ¼ a4 t4 b4 for a4 , b4 and B0 , C0 4 and with parametric representation t4 ¼ ψ 4(τ) for α4 τ β4.
Rao distances and conformal mapping Chapter
Z LðC5 Þ ¼
3
β5 α5
B0 C0 ½ψ 5 ðτÞψ 0 ðτÞdτ, 1
5
53
(22)
where B0C1(t5) ¼ a5 t5 b5 for a5 , b5 and B0 , C1 5 and with parametric representation t5 ¼ ψ 5(τ) for α5 τ β5. Remark 3. One could also consider a common parametric representation ψ i ðτÞ ¼ ψðτÞ for i ¼ 1, 2, …, 5 if that provides more realistic situation of modeling.
5
Applications
The angle preservation approach can be used in preserving the angles and depth of 3D images for actual 3D structures. Earlier Rao and Krantz (2020) proposed such measures in the virtual tourism industry. Advanced virtual tourism technology is in the early stage of development and it occupies a small fraction of the total tourism-related business. Due to the pandemics and other large-scale disruptions around tourist locations, there will be a high demand for virtual tourism facilities. One such was visualized during COVID-19 (Rao and Krantz, 2020). Let us consider a tourist location that has three 3D structured buildings as in Fig. 4. When a tourist visits the location in person, then such scenery can be seen directly from the ground level by standing in between the three structures or standing beside one of the structures. It is not always possible to see those features when standing above those buildings. Suppose a video recording is available that was recorded with regular video cameras; then the distances A0C0, A0C1, B0A0, B0C0, and B0C1 and angles α, β1, and β2 would not be possible to capture. That depth of the scenery and relative elevations and distances would not be accurately recorded. The in-person virtual experience at most can see the distance between the bottom structures of the tourist attractions. The same scenery of Fig. 4, when watched in person at some time of the day, would be different when it is watched at a different time due to the differences between day and night visions. The climatic conditions and weather would affect the in-person tourism experiences. All these can be overcome by having virtual tourism technologies proposed for this purpose (Rao and Krantz, 2020). The new technology called LAPO (live-streaming with actual proportionality of objects) would combine the precaptured videos and photos with live-streaming of the current situations using advanced drone technology. This would enhance the visual experience of live videos by mixing them with prerecorded videos. Such technologies will not only enhance the visualizations but also help in repeated seeing of the experiences and a closer look at selected parts of the videos. Mathematical formulations will assist in maintaining the exactness and consistency of the experiences. We hope that the
54
SECTION
I Foundations of information geometry
newer mathematical constructions, theories, and models will also emerge from these collaborations. The line integrals L(Ci) for i ¼ 1, 2, …, 5 are computed and the angles between the structures can be practically precomputed for each tourist location so that these can be mixed with the live streaming of the tourist locations. The angle preservation capabilities to maintain the angles between various base points can be preserved with actual measurements that will bring a real-time experience of watching the monuments. The virtual tourism industry has many potential advantages if it is supported by high-end technologies. Viewing the normal videos of tourist attractions through the Internet browser could be enriched with the new technology proposed (Rao and Krantz, 2020). These new technologies combined with more accurate preservations of the depth, angles, and relative distances would enhance the experiences of virtual tourists. Fig. 4 could be considered as a view of a tourist location. There are more realistic graphical descriptions available to understand the proposed technology LAPO using the information geometry and conformal mapping (Rao and Krantz, 2020). Apart from applying mathematical tools, there are advantages of virtual tourism. Although this discussion is out of scope for this article, we wish to highlight below a list of advantages and disadvantages of new virtual tourism technology taken from Rao and Krantz (2020). ADVANTAGES: (a) Environmental protection around ancient monuments; (b) Lesser disease spread at the high population density tourist locations; (c) Easy tour for physically challenged persons; (d) Creation of newer employment opportunities; (e) The safety of tourists; (f ) The possibility of the emergence of new software technologies. DISADVANTAGES: (a) Possible abuse of the technology that can harm the environment around the tourist locations; (b) Violation of individual privacy; (c) Misuse of drone technology. Overall there are plenty of advantages of developing this new technology and implementing it with proper care taken for protection against misuse. The importance of this technology is that it will have deeper mathematical principles and insights that were not utilized previously in the tourism industry. When the population mobility reduces due to pandemics the hospitality and business industry was seen to have severe financial losses. In such a situation, virtual tourism could provide an alternative source of financial activity. There are of course several advantages of real tourism too, like understanding the actual physical structures of the monuments, touching of the monuments (trees, stones, water, etc.), and feeling real climatic conditions.
Rao distances and conformal mapping Chapter
3
55
We are not describing here all the possible advantages and disadvantages between virtual vs real tourism experiences. The concept of Rao distance constructed on population spaces can be used to measure distances between two probability densities. One possible application is to virtual tourism. This article is anticipated to help understand various technicalities of Rao distances and conformal mappings in a clear way.
Acknowledgments ASRS Rao thanks to his friend Padala Ramu who taught him complex analysis and to all the students who had attended ASRSR’s courses on real and complex analysis.
References Amari, S., 1985. Differential Geometric Methods in Statistics. Lecture Notes in Statistics 28, Springer-Verlag. Ambrosio, L., Tilli, P., 2004. Topics on Analysis in Metric Spaces. Oxford Lecture Series in Mathematics and its Applications, 25, Oxford University Press, Oxford, ISBN: 0-19-852938-4. viii+133 pp. Apostol, T.M., 1974, Second ed. Addison-Wesley Publishing Co., Reading, Mass.-London-Don Mills, Ont. xvii+492 pp. Atkinson, C., Mitchell, A.F.S., 1981. Rao’s distance measure. Sankhya A 43 (3), 345–365. Burbea, J., Rao, C.R., 1982. Radhakrishna entropy differential metric, distance and divergence measures in probability spaces: a unified approach. J. Multivar. Anal. 12 (4), 575–596. Chaudhuri, P., 2020. C R Rao and Mahalanobis’ distance. Proc. Indian Acad. Sci. Math. Sci. 130 (1), 46. 5pp. Chen, X., Zhou, J., Hu, S., 2021. Upper bounds for rao distance on the manifold of multivariate elliptical distributions. Autom. J 129, 109604. IFAC. Efron, B., Amari, S.I., Rubin, D.B., Rao, A.S.R.S., Cox, D.R., 2020. C. R. Rao’s century. Significance 17, 36–38. https://doi.org/10.1111/1740-9713.01424. Jimenez Gamero, M.D., Mun˜oz Pichardo, J.M., Mu noz Garcı´a, J., Pascual Acosta, A., 2002. Rao distance as a measure of influence in the multivariate linear model. J. Appl. Stat. 29 (6), 841–854. Kobayashi, S., Nomizu, K., 1969. Foundations of Differential Geometry. Vol. II. Reprint of the 1969 Original. Wiley Classics Library. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York. xvi+468 pp. Krantz, S.G., 2004. Complex Analysis: The Geometric Viewpoint, second ed. Carus Mathematical Monographs, 23. Mathematical Association of America, Washington, DC. xviii+219 pp. Krantz, S.G., 2007. Complex Variables: A Physical Approach With Applications and MATLAB. Chapman & Hall/CRC. Krantz, S.G., 2008. A Guide to Complex Variables. Mathematical Association of America, Washington, DC. xviii+182 pp. Maybank, S.J., 2005. Int. J. Comput. Vis. 63, 191–206. Micchelli, C.A., Noakes, L., 2005. Rao distances. J. Multivar. Anal. 92 (1), 97–115 (English summary). Nielsen, F., 2018. An elementary introduction to information geometry. arXiv 2018, arXiv:1808.08271.
56
SECTION
I Foundations of information geometry
Plastino, A.R., Plastino, A., 2020. What’s the big idea? Cramer-Rao inequality and Rao distance. Significance 17, 39. https://doi.org/10.1111/1740-9713.01425. Prakasa Rao, B.L.S., Majumder, P.P., 2020. Preface [special issue in honour of professor Calyampudi Radhakrishna Rao’s birth centenary]. Proc. Indian Acad. Sci. Math. Sci. 130 (1), 38 (2pp). Prakasa Rao, B.L.S., Carter, R., Nielsen, F., Agresti, A., Ullah, A., Rao, T.J., 2020. C. R. Rao’s Foundational Contributions to Statistics: In Celebration of His Centennial Year. AMSTAT NEWS. https://magazine.amstat.org/blog/2020/09/01/crrao/. Rao, C.R., 1949. On the distance between two populations. Sankhya 9, 246–248. Rao, A.S.R.S., Krantz, S.G., 2020. Data science for virtual tourism using cutting edge visualizations: information geometry and conformal mapping. Cell Patterns. https://doi.org/10.1016/ j.patter.2020.100067. Rios, M., Villarroya, A., Oller, J.M., 1992. Rao distance between multivariate linear normal models and their application to the classification of response curves. Comput. Stat. Data Anal. 13 (4), 431–445. Rudin, W., 1974. Real and Complex Analysis, second ed. McGraw-Hill Series in Higher Mathematics. McGraw-Hill Book Co., New York-D€usseldorf-Johannesburg. xii+452 pp. Spivak, M., 1964. Calculus on Manifolds. A Modern Approach to Classical Theorems of Advanced Calculus. W. A. Benjamin, Inc., New York-Amsterdam. xii+144 pp. Taylor, S., 2019. Clustering financial return distributions using the Fisher information metric. Entropy 21, 110. Tu, L.W., 2011. An Introduction to Manifolds, second ed. Universitext. Springer, New York. xviii +411 pp. Vinod, H.D., 2020. Software-illustrated explanations of econometrics contributions by CR Rao for his 100-th birthday. J. Quant. Econ. 18 (2), 235–252. https://doi.org/10.1007/s40953-02000209-9.
Chapter 4
Cramer-Rao inequality for testing the suitability of divergent partition functions Angelo Plastinoa,e, Mario Carlos Roccaa,b,c,*, and Diana Monteolivaa,d a
National University La Plata (UNLP), IFLP-CCT-Conicet, La Plata, Argentina Departamento de Matema´tica, Universidad Nacional de La Plata, La Plata, Argentina c Consejo Nacional de Investigaciones Cientı´ficas y Tecnolo´gicas (IFLP-CCT-CONICET)-C. C. 727, La Plata, Argentina d Comisio´n de Investigaciones Cientı´ficas Provincia de Buenos AiresLa Plata, Argentina e Kido—Dynamics, Lausanne, Switzerland * Corresponding author: e-mail: [email protected] b
Abstract We tackle some interesting statistical physical tasks in which the partition function Z is divergent, as first noted by Fermi in 1924 (the hydrogen-atom instance). This fact generates obvious problems, even if the system at hand appears to be in a steady state. We will here regularize such divergent Z’s and find their associated Fisher information measures (FIMs). FIMs transform physical content into information and permit one to evaluate the Cramer-Rao inequality (CRI). Obedience to the CRI is a severe test that the concomitant probability density must pass. In the present case studies, these Fisher measures were unavailable before. Keywords: Divergent partition functions, Statistical mechanics, Fisher information, Cramer-Rao inequality
1
Introduction
In several important physical problems, the partition function Z diverges, as first noted by Fermi in 1924 (the hydrogen-atom instance) (Fermi, 1924). This fact entails that the canonical-ensemble Gibbs probability density, let us call it f(x), cannot be determined. Consequently, mean values of physical observables become unavailable. Further, the most important physical information quantifier, called the Fisher information measure (FIM) (Roy Frieden, 2004), is unavailable. Our information regarding these systems is in this sense null. Handbook of Statistics, Vol. 45. https://doi.org/10.1016/bs.host.2021.04.001 Copyright © 2021 Elsevier B.V. All rights reserved.
57
58
SECTION
I Foundations of information geometry
We will here regularize several partition functions Z’s and find their associated Fisher information quantifiers that transform physical content into informational measures. Then we will pass to evaluate the concomitant Cramer-Rao inequality and see if it is obeyed. In our present case studies, these previously unavailable Fisher measures are computed. Of course, to obtain a FIM you need a suitable density distribution (DD) f(x). We assume that our DD is derived from a Gibbs’ canonical ensemble treatment, whose main ingredient is a partition function (PF) Z. It is our goal here to discuss situations in which this PF diverges. For this endeavor, we appeal to a generalization (Plastino and Rocca, 2018; Plastino et al., 2018; Roy Frieden, 2004) of the well-known dimensional regularization approach (DRA). It constitutes one of the greatest advances in theoretical physics of the second part of the 20th century, with manifold applications in variegated fields of endeavor (Plastino and Rocca, 2018). In the generalized DRA of Plastino and Rocca (2018), the central idea is that, usually, many physical divergences take place for a certain space dimensionality ν, say three, but not for other νvalues. Thus, we compute Z not in three dimensions but in ν ones, i.e., we deal with a Z ν . Passing from Z ν to Z 3 is a complicated technical task that is described in all detail in Plastino et al. (2018). The procedure, however, is a standard one. Our present treatment is of an statistical nature. We appeal to a Gibbs’ canonical ensemble (of systems like the one that interests us) with a probability distribution proportional to exp βH, where H is the pertinent Hamiltonian and β the inverse temperature (T). We begin discussing our case studies below.
2 A first illustrative example Before we start to deal with more physically relevant cases, we will present a simple example so as to show how our generalization of dimensional regularization methodology (DRA) is used. Divergent integrals or integral with divergent kernels are amply utilized in applied sciences and engineering. Their proper mathematical interpretation is obtained by recourse to the theory of distributions of Gel’fand and Shilov (Gradshteyn and Ryzhik, 1980). One should also consult Schwartz (Halperin and Schwartz, 1952). One interesting feature of our generalized DRA procedure is the possibility of using free phase factors on occasion. This will be the case in our starting simplified model, given by the Hamiltonian H¼
U0 , r
(1)
for which the partition function can be cast, using this free phase, in terms of the inverse temperature β as Z ∞ βU 0 Z¼ e r d 3 x, (2) 0
Cramer-Rao inequality for testing the suitability Chapter
4
59
where U0 is a suitable coupling constant. The explanation for the introduction of this minus sign has a simple and mathematically rigorous explanation. The integral given by (2) does not make sense. Therefore, we must give it a physical meaning. In our method, we can multiply it by 1 or 1, in such a way that the partition function becomes positive. In our case, choosing the minus sign will render the correct result.
2.1 Evaluation of the partition function The key step is now to write the generalization of (2) to ν dimensions Z βU 0 Z ν ¼ e r dν x
(3)
The computation that follows may look intimidating. However, our main task in performing it was to look at a table of Integrals. Evaluating first the integral in the angles, we obtain ν Z ∞ βU0 2π 2 Z ν ¼ ν r ν1 e r dr: (4) 0 Γ 2 We consider now the integral Z ∞ β ν1 ν1 β μ1 βx ν1 +μ 2γ 2 2 ν , x ðx + γÞ e dx ¼ β γ Γð1 μ νÞe W ν1 2 +μ,2 γ 0
(5)
with jargðγÞj < π, Rð1 μ νÞ > 0, where W is the second Whittaker’s function, which is a special solution of Whittaker’s equation. We speak of a modified form of the confluent hypergeometric equation introduced by Whittaker in 1904. Note that the condition RðβÞ > 0 is not required, as made clear by Gradshteyn and Rizhik in their table (Gradshteyn and Ryzhik, 1980) [this formula appears in page 340, Eq. (7), called ET II 234(13)a, where reference is made to (Gradshteyn and Ryzhik, 1980) (Caltech’s Bateman Project). The last letter “a” indicates that an analytical extension has been performed]. Selecting μ ¼ 1 in (5), we obtain Z ∞ β β ν1 ν+1 β ν , (6) xν1 ex dx ¼ β 2 γ 2 ΓðνÞe2γ W ν+1 2 ,2 γ 0 The last relation is valid for ν 6¼ 0, 1, 2, 3, …. The following formula appears in the same table (Gradshteyn and Ryzhik, 1980) ν+1 β β β 2 2γβ ν ν+1 ν W ν+1 ¼ M ¼ e , (7) , , 2 2 2 2 γ γ γ where M is the first Whittaker’s function. We have then the result Z ∞ β xν1 ex dx ¼ βν ΓðνÞ 0
(8)
60
SECTION
I Foundations of information geometry
This integral can be evaluated using the generalized DRA obtained in Plastino and Rocca (2018). Note that for ν ¼ 1, 2, 3, …, the Gamma function diverges. Changing now β by β in (8) we have Z ∞ β xν1 ex dx ¼ ðβÞν ΓðνÞ: (9) 0
Eq. (9) displays a cut at RðβÞ > 0. One can therefore choose for β among three possibilities (β)ν ¼ eiπνβν, (β)ν ¼ eiπνβν, or ðβÞν ¼ cos ðπνÞβν. As the integral must be real for β to be real, we choose the last possibility of the three available ones and finally obtain Z ∞ β xν1 ex dx ¼ cos ðπνÞβν ΓðνÞ: (10) 0
We have then for (4) the result ν
Zν ¼
2π 2 ν cos πνðβU 0 Þ ΓðνÞ: Γ 2ν
(11)
We rewrite now this result in the form ν 2π 2 cos πνðβU 0 Þ Γð3 νÞ, Γ 2ν νðν 1Þðν 2Þ ν
Zν ¼
(12)
and define ν 2π 2 cos πνðβU0 Þ : Γ 2ν νðν 1Þðν 2Þ ν
f ðνÞ ¼
(13)
Notice then that Z n u ¼ f ðνÞΓðνÞ:
(14)
In order to obtain a finite result for the integral, we make use of the essential trick of the generalized DRA method, which consists in appealing to Laurent expansions in the parameter that causes divergences to occur, here ν. Thus, we Laurent-expand Z ν around ν ¼ 3. For this purpose, we first tackle f(ν) and find f ð3Þ ¼
2π ðβU0 Þ3 , 3
(15)
while for its derivative f 0 one has f ð3Þ 16 ln π + ln ðβU 0 Þ + C + 2 ln 2 f 0 ð3Þ ¼ : 2 3
(16)
Laurent’s expansion around ν ¼ 3 for f is then 0
f ðνÞ ¼ f ð3Þ + f ð3Þðν 3Þ +
∞ X n¼2
bn ðν 3Þn :
(17)
Cramer-Rao inequality for testing the suitability Chapter
4
61
For the partition function expansion, we need also the Laurent expansion of the Gamma function. From Gradshteyn and Ryzhik (1980) we obtain for Γ(3ν) ∞ X 1 C+ Γð3 νÞ ¼ cn ðν 3Þn : (18) 3ν n¼1 Multiplying the two distinct previous expansions we have for the Z-expansion ∞ X f ð3Þ f ð3ÞC f 0 ð3Þ+ Zν ¼ an ðν 3Þn : (19) 3ν n¼1 Our method’s tenets assert that the result of the divergent integral is then the independent term of the powers ν3, that is to say (C is the Euler-Mascheroni constant) Z ¼ f ð3ÞC f 0 ð3Þ: so that
π 16 3 2 Z ¼ ðβU 0 Þ ln π + ln ðβU0 Þ + 3C + 2 ln 2 : 2 3
(20)
(21)
If this technique sounds strange to you because you see it for the first time, remember that you can look at the 54 concomitant references cited in Plastino and Rocca (2018).
2.2 Instruction manual for using our procedure Having used it in the preceding paragraphs, we proceed now to provide the reader with a succinct scheme, i.e., an “instruction manual” for the generalized DRA. l
l
l l l l
You have a diverging function Z. The divergence takes place at, say, dimension ν ¼ ν0 (ν0 an integer). You write Z in ν dimensions (with ν real!) and compute it with the help of a really big table of functions and integrals. If you succeed, you have an expression for Z that is a function of ν. You Laurent-expand that ensuing function around ν ¼ ν0. You retain from the expansion only the ν0-independent terms. These terms constitute the result you were looking for.
2.3 Evaluation of hri This entails evaluation of still another integral. We pass now to evaluate the mean value of the distance r. We use the trick of multiplying Zhri to obtain a function Q of β and have then, in terms of a probability density f(r), Z Z βU 0 QðβÞ ¼ ZhriðβÞ ¼ re r d3 x ¼ rf ðrÞd 3 x: (22)
62
SECTION
I Foundations of information geometry
We have to apply now to Q(β) the rigmarole of six steps enunciated in Section 2.2, beginning by generalizing Q(β) to ν dimensions (second step) Z βU 0 QðβÞ ¼ Zhriν ¼ re r dν x: (23) As in the previous subsection, we tackle first the angular integral and obtain ν Z ∞ βU 0 2π 2 QðβÞ ¼ Zhriν ¼ ν r ν e r dr: (24) Γ 2 0 We work this integral out (second step) and obtain specific expression for Q(β) (third step), that is Laurent expand around ν ¼ 3 (fourth step) and retain the ν-independent terms (fifth step). We arrive then in the sixth step at h i 1π 37 ðβU0 Þ4 ln π + ln ðβU 0 Þ2 + 3C + 2 ln 2 hri ¼ : (25) Z6 6
2.4 Dealing with hr2i This entails applying our generalized DRA method to still another integral. In similar fashion as in the previous proceedings, working out the pertinent six steps, we calculate now hr2i. The result is h i 1 π 197 ðβU 0 Þ5 ln π + ln ðβU0 Þ2 + 3C + 2 ln 2 hr 2 i ¼ : (26) Z 60 30
2.5 Obtaining fisher information measure This is our main objective. Given a continuous probability distribution funcR tion (PDF) f(x) with x Δ and Δ f(x) dx ¼ 1, its concomitant Shannon Entropy S is Z Sð f Þ ¼ f ln ð f Þdx (27) Δ
a quantifier of global nature, as it is well known. The entropy is not very sensitive to strong changes in the distribution that may take place in a small-sized zone. Precisely, such is definitely not the case for FIM, that we denote by ðFÞ. It measures gradient contents (Roy Frieden, 2004). One has Z Z 1 df ðxÞ 2 dψðxÞ 2 (28) FðfÞ ¼ dx ¼ 4 dx dx Δ f ðxÞ Δ FIM can be looked at in several ways. (i) As a quantifier of the capacity to estimate a parameter. (ii) As the information amount of what can be gathered from a set of measurements. (iii) As quantifying the degree of order of a system (or phenomenon) (Roy Frieden, 2004), as has been strongly emphasized
Cramer-Rao inequality for testing the suitability Chapter
4
63
recently (Frieden and Hawkins, 2010). The division by f(x) in the FIM definition is best avoided if f(x) ! 0 at certain x-values. We overcome this trouble by appealing to real probability amplitudes f(x) ¼ ψ 2(x) (Roy Frieden, 2004), which yields a simpler form (with no divisors), as seen at the extreme right of the equation above. Accordingly, FIM is called a local measure (Roy Frieden, 2004). Let us begin by considering the density probability f βU 0
er f ¼ : Z
(29)
The information that it contains is conveyed by Fisher’s information measure, a functional of f, Z 1 ½rf 2 dx FðfÞ ¼ (30) Δ f ðxÞ In our case, we have FðfÞ ¼
1 ðβU 0 Þ2 Z
Z
βU 0
r 4 e r d3 x:
(31)
This integral must be regularized as well.
2.6 The six steps to obtain a finite Fisher’s information We pass (first step) to ν dimensions and face Z βU 0 1 F ð f Þ ¼ ðβU0 Þ2 r 4 e r dν x: Z Integrating over the angles first, as always, we have ν Z ∞ βU 0 1 2 2π 2 F ð f Þ ¼ ðβU 0 Þ ν r ν5 e r dr: Z 0 Γ 2
(32)
(33)
The result is now (second step) ν
1 2π 2 F ð f Þν ¼ ðβU 0 Þ2 ν cos πð5 νÞΓð5 νÞ, Z Γ 2
(34)
and we pass then to the third and fourth steps that yield, specifying ν ¼ 3, the fifth step relation FðfÞ ¼
4π ðβU 0 Þ2 , Z
(35)
a highly nontrivial result that strongly depends on the regularized partition function.
64
SECTION
I Foundations of information geometry
2.7 Cramer-Rao inequality (CRI) The CRI has been related to the celebrated Heisenberg uncertainty relation (HUR) for the D-dimensional quantum central problem (Sanchez Moreno et al., 2011). Additionally, Frieden has demonstrated that all UHRs can be deduced from the CRI (Roy Frieden, 2004). CRI is the product of the error jej in assessing r times FIM that must be > 1. Defining jej2 ¼ jhr 2 i hri2 j h i 197 1 π ðβU 0 Þ5 ln π + ln ðβU 0 Þ2 + 3C + 2 ln 2 ¼ Z 60 30 n h io 1π 37 2 4 2 ðβU 0 Þ ln π + ln ðβU 0 Þ + 3C + 2 ln 2 Z6 6
(36)
one immediately realizes that multiplying this by Fð fÞ ¼
4π ðβU 0 Þ2 , Z
(37)
one has the product F ð f Þe2 ¼
4π ðβU0 Þ2 ½hr 2 i hri2 , Z
(38)
that should be >1 to fulfill the Cramer-Rao inequality.
2.8 Numerical example Let k stand for Boltzmann’s constant. As a numerical example, we arbitrarily give the temperature, to facilitate computations, the form T ¼ k1 U 0 ð ln π + ln 2Þe 2 12 , 3C
37
(39)
entailing β ¼ ðkTÞ1 ¼ ½U0 ð ln π + ln 2Þe 2 12 3C
37
1
(40)
With this temperature, we have hri ¼ 0,
(41)
and hr 2 i ¼
5ðβU 0 Þ2 : 72
(42)
5π ðβU 0 Þ4 , 18Z
(43)
Thus, F ð f Þe2 ¼
Cramer-Rao inequality for testing the suitability Chapter
4
65
or equivalently, 2 F ð f Þe2 ¼ βU0 3
(44)
For the Cramer-Rao inequality to be verified, the following inequality must be satisfied 3C 37 1 2 2 βU 0 ¼ ½ð ln π + ln 2Þe 2 12 10=3 > 1, (45) 3 3 so that the CRI is respected.
3
A Brownian motion example
It is known that one can associate to Brownian motion in an external field the potential VðxÞ ¼ 1 U+ 0x2 , with U0 a convenient coupling constant. We will here look for the FIM linked to V (x).
3.1 The present partition function The partition function for this problem is Z ∞ βU 0 e1 + x2 dx, Z¼
(46)
∞
where, as before, β ¼ 1/(kT). k is, of course, Boltzmann’s constant. Set first of all y ¼ 1 + x2 and get Z ∞ 1 βU0 (47) Z¼ ðy 1Þ 2 e y dy: 1
Reiterating our consultation of the table (Gradshteyn and Ryzhik, 1980), we discover that we can write Z ∞ β β xν1 ðx uÞμ1 ex dx ¼ Bð1 μ ν,μÞuμ + ν1 ϕ 1 μ ν;1 ν, , (48) u u where B is the well-known Beta function. ϕ, instead, refers to the confluent hypergeometric function. Inspecting again (Gradshteyn and Ryzhik, 1980), we are able now to write lim ϕðα; γ; sÞ ¼ zαϕðα + 1; 2; zÞ,
γ!0
and then Z ¼ πβU 0 ϕ
1 ; 2; βU 0 : 2
(49) (50)
We have indeed managed to find a finite Z after following rather simple steps. Once again, the gist of our labor was to appeal to a convenient Table. We will next calculate with this Z some useful quantifiers in our route to the CramerRao expression.
66
SECTION
I Foundations of information geometry
3.2 Mean values of x-powers We pass to consider mean values of x-powers (moments) for the probability density (PDF) f(x) associated to our present Z. We had in (46) βU 0
e1+x2 : f ðxÞ ¼ Z
(51)
Odd momenta x2n+1, ðn ¼ 1, 2, 3, …::Þ will vanish on account of parity. Only momenta of the form x2n will then be considered. Z 1 ∞ 2n βU02 x e1 + x dx: hx2n i ¼ (52) Z ∞ As habitual, we set y ¼ 1 + x2 and write down Z 1 βU0 1 ∞ 2n hx i ¼ ðy 1Þn 2 e y dy, Z 1 which can be translated into βU 0 1 1 1 hx2n i ¼ Γ n + Γ n+ ϕ n; 2; βU 0 : 2 2 2 Z
(53)
(54)
We need the moment hx2 i ¼
πβU 0 1 ϕ ; 2; βU0 : 2 Z
(55)
For completeness’ sake, we give also hx4 i ¼
πβU0 3 ϕ ; 2; βU 0 : 2 Z
(56)
3.3 Tackling fisher For the PDF f of Eq. (46), one obtains the following expression for Fisher’s measure 2 βU 32 0 Z ∞ βU 0 2 de1 + x2 5 (57) Fð fÞ¼ e 1 + x2 4 dx, Z 0 dx that is Fð fÞ¼
8β2 U02 Z
Z
∞ 0
x2 ð1 + x2 Þ2
βU0
e1 + x2 dx:
(58)
As we should by now expect, we change variables in the fashion y ¼ 1 + x2 to encounter
Cramer-Rao inequality for testing the suitability Chapter
Fð fÞ ¼
4β2 U02 Z
Z
∞
1 βU0 y dy,
y2 ðy 1Þ2 e
4
67
(59)
1
that leads to F ð f Þ ¼ 2βU 0 ¼ 1:44x1023 U0 =T:
(60)
3.4 The present Cramer-Rao inequality It is worth to point out that the CRI has been related to the uncertainty relations for the D-dimensional central problem (Sanchez Moreno et al., 2011). Additionally, Frieden has demonstrated that all quantum uncertainties can be deduced from the CRI (Roy Frieden, 2004). We know already the value for jej2 ¼ jhx2ihxi2j. Our Cramer-Rao product jej2 F becomes accordingly 2πβ2 U 20 1 F ð f Þjej2 ¼ ϕ ; 2; βU0 1, (61) 2 Z which can be numerically demonstrated by giving values to the temperature (see figure below). Fig. 1 is very instructive. The CR inequality seems to be violated. But look at the concomitant temperatures at which this would happen. These are so high that could only be attained a fraction of a second after the Big Bang, when no ordinary matter existed yet. A quark gluon plasma prevailed at that time and, of course, our Brownian model loses all sense there. The Cramer-Rao relation seems to anticipate this nonsensical scenario by allowing for its violation.
FIG. 1 The Brownian Cramer-Rao product vs the temperature.
68
SECTION
I Foundations of information geometry
4 The harmonic oscillator (HO) in Tsallis’ statistics Tsallis statistics (TS) is a generalization of the conventional Boltzmann-Gibbs (BG) one (Tsallis, 2009). Tsallis entropies are characterized by a real parameter q such that if q ¼ 1, the ordinary BG entropic form is achieved (Tsallis, 2009). TS is a very active contemporary area of research. We also encounter TS divergences in the partition function. Tsallis entropy Sq is defined, for any real number q and a probability density p(x) in the fashion Z 1 q Sq ¼ 1 dxpðxÞ , (62) q1 and is successfully employed in variegated environments, replacing its Boltzmann-Gibbs counterpart (Tsallis, 2009).
4.1 The HO-Tsallis partition function We will consider the HO treatment for the Tsallis index q ¼ 4/3, that is appropriate for gravitational purposes (Plastino and Rocca, 2017). One has Z 1 Z ν ¼ ½1 + ðq 1Þβðp2 + r 2 Þ1q dν xd ν p: (63) In ν dimensions, we have (step 2), if q ¼ 43, Zν ¼
ν πν 3 Γð3 νÞ: 2 β
(64)
According to the tenets of reference (Plastino et al., 2018), if we work out the concomitant five IM steps, we get the three-dimensional expression Z¼
3 1 3π 3π ln +C : 2 β β
(65)
Since we demand that Z > 0, this leads to an upper bound for the temperature T to that effect T
0, we are again led to a temperature’s upper bound e3C , 3πkB 2
T
1: Zβ2
(77)
After some numerical manipulation, one sees that the CRI can be rewritten as 17 1023 > T½52:63 ln T,
(78)
which will be always satisfied. The second factor in the right-hand side becomes negative for T > 7:19 exp 22.
5 Failure of the Boltzmann-Gibbs (BG) statistics for Newton’s gravitation We emphasize the fact that our treatment is of an statistical character. We do not discuss a single system but a whole (Gibbs canonical) ensemble of them. Our observables are mean values taken over the associated canonical ensemble. We pass now to discuss the Fisher information and the CR product in the BG-Newton’s potential’s statistical instance. In it, we deal with an ensemble of two identical masses m that attract themselves. One of them is fixed at 2 p2 an appropriate “center.” Our Hamiltonian is H ¼ Gmr + 2m , where G is the gravitation constant, m is the pertinent mass, p is the momentum, and r measures the distance between masses.
5.1 Tackling Z ν The reader will not be surprised if, at this stage, we jump directly to the second step of the DR instruction manual so that our partition function is
Cramer-Rao inequality for testing the suitability Chapter
2
ν 2π 2
32
6 7 Z ν ¼ 4 ν5 Γ 2
Z
∞
r ν1
Z
0
∞
pν1 e
p2 Gm2 r 2m drdp,
4
71
(79)
0
and one passes to evaluate the concomitant integral. Perusing reference Plastino et al. (2018), we encounter that the pertinent procedure yields the desired partition function in the fashion ν ν ν 2π 2 2m 2 Z ν ¼ ν ðβGm2 Þ cos ðπνÞΓðνÞ, (80) β Γ 2 which is further re-elaborated in that reference, following steps 3–5 of the instruction manual (IM), to yield (see Eq. 80 in Plastino et al., 2018) our three-dimensionally renormalized gravitational partition function in the guise
5 3 π2 m 2 m 17 2 3 2 2 Z¼ ð2βGm Þ ln ð2πβGm Þ + 3C + ln 2 : (81) β 3 3 β
5.2 Mean values derived from our partition function (PP) We must follow the instruction manual (IM) for each mean value separately. We begin with the second IM step, by considering, in ν dimensions, the mean values for r and r2.
5.2.1 hri-Value We start with the product "
ν
2π 2 Z ν hri ¼ ν
Γ 2
#2 Z
∞ 0
rν
Z
∞
pν1 e
2 Gm2 p r 2m
drdp,
(82)
0
that yields, as shown in Plastino et al. (2018), 2ν ν ν 2π 2 2m Z ν hri ¼ ðβGm2 Þ cos ðπνÞΓðνÞ, ν β ð1 + νÞΓ 2
(83)
and reworking things exactly as detailed in Plastino et al. (2018) for steps 3–5 of the IM we find, for ν ¼ 3,
5 3 4 π2 m 2 m 2 4 2 2 hri ¼ ð2πβGm Þ + C + ln 2 2ψð5Þ : (84) ðβGm Þ ln Z3 β β
72
SECTION
I Foundations of information geometry
5.2.2 The hr2i instance We proceed in exactly the same way as above, starting with the product in ν dimensions " ν #2 Z Z ∞ ∞ 2 Gm2 p 2π 2 2 ν+1 Z ν hr i ¼ ν
r pν1 e r 2m drdp, (85) Γ 2 0 0 that yields (Plastino et al., 2018) ν
Z ν hr 2 i ¼
2π 2
ν ð1 + νÞð2 + νÞΓ 2
2ν ν 2m ðβGm2 Þ cos ðπνÞΓðνÞ, β
(86)
that, with further work, that is carefully detailed in Plastino et al. (2018), yields (steps 3–5 of the IM) the three-dimensional result
5 3 5 2 4 π2 m 2 m hr 2 i ¼ ð2πβGm2 Þ + C + ln 2 2ψð6Þ : (87) ðβGm2 Þ ln Z 15 β β We have now at hand the ingredients needed to evaluate the squared error e2 and the FIM.
5.3 Variance Δr 5 hr2i2hri2 We plot the variance in Fig. 2 vs the mass. The operating temperature is here T ¼ 3 K, the cosmic back-ground radiation one. As one can easily appreciate, SV positivity is restricted to an absurdly small range of masses. This indicates that the BG orthodox statistical mechanics framework (BGOSMF) completely fails here.
FIG. 2 Newton’s statistical variance (SV) Δr vs mass at T ¼ 3 K. Note that SV positivity is restricted to an absurdly small range of masses.
Cramer-Rao inequality for testing the suitability Chapter
4
73
5.4 Gravitational FIM This requires again the IM usage. We begin again discussing the ν-dimensional instance with " ν #2 Z ∞ Z ∞ 2 Gm2 p 2π 2 Z ν F ð f Þ ¼ ν β2 G2 m4 r ν5 pν1 e r 2m drdp Γ 2 0 0 (88) " ν #2 Z Z p2 2π 2 β2 ∞ ν5 ∞ ν+1 Gmr 2 2m + ν
r p e drdp: m2 0 Γ 2 0 It is shown in Plastino et al. (2018) just how the above relation leads us to the product ν ν2 4π ν 2m 2 Z ν F ð f Þ ¼ ν ðβGm2 Þ cos ðπνÞΓð4 νÞ β Γ 2 (89) ν 22+1 π ν ν ν 2ν+1 5ν2 1 + ν G β m cos ðπνÞΓðνÞ, Γ 2 which is reduced to three dimensions in Plastino et al. (2018), after steps 3–5, yielding n 9 5 1 7 1 22 π 2 β2 Gm2 Fð fÞ ¼ Z (90) o 3 5 5 13 + 22 π 2 G3 β2 m 2 ½ ln ð2π 2 G2 βm5 + 3C + 2 ln 2 5 :
5.5 Incompatibility between Boltzmann-Gibbs statistics (BGS) and long-range interactions If we compute Fisher’s information vs mass in Fig. 3 for T ¼ 3 K, disaster strikes. This is no surprise, as Tsallis and others have argued that BGS was devised for systems with short-range interactions (Tsallis, 2009). If we compute F for a temperature ¼ 3 K (cosmic background radiation), we find that it is positive ONLY for a mass m 0.21kg < m < 0.56kg., which makes no physical sense. As for the associated CRI, have a look at Fig. 3.
6
Statistics of gravitation in Tsallis’ statistics
6.1 Gravity-Tsallis partition function In Tsallis’ case (see Section 4), gravity’s partition function is given by 2 1 Z Gm p2 q1 ν ν Zν ¼ 1 + ðq 1Þβ d xd p: (91) r 2m +
74
SECTION
I Foundations of information geometry
FIG. 3 BG’s gravitation’s Cramer-Rao relation. The result is absurd.
Or, equivalently "
ν
2π 2 Z ν ¼ ν
Γ 2
#2 Z
∞
r
ν1
Z
∞
dr
0
p 0
ν1
2 1 Gm p2 q1 dp 1 + ðq 1Þβ (92) r 2m +
Assigning to q the value 43, and evaluating the precedent integral as in previous cases, we obtain " ν #2 ν ν 1 2π 2 3m 2 2m2 βG ν ν ν
Zν ¼ (93) B 4, B 2, ν 2 Γ 2 2β 3 2 2 Proceeding as before, we find Z¼
3 3 1 16π 3 3m 2 2m2 βG ½ ln ð2βπ 2 m5 G2 Þ + 2C ln 2 2, 2 352 2β 3
(94)
Z ¼ 73:39π 3 m15=2 β3=2 G3 ½ ln ð2βπ 2 m5 G2 Þ + 2C ln 2 2,
(95)
or
6.2 Gravity-Tsallis mean values for r and r2 We pass to calculate the mean value of r. It is given by 2 1 Z 1 Gm p2 q1 ν ν hriν ¼ d xd p, r 1 + ðq 1Þβ Z r 2m +
(96)
Cramer-Rao inequality for testing the suitability Chapter
75
4
or, equivalently, " ν #2 Z 2 1 Z ∞ ∞ 1 2π 2 Gm p2 q1 ν ν1
hriν ¼ r dr p dp 1 + ðq 1Þβ Z Γ 2ν r 2m + 0 0 (97) After evaluating the precedent integral we, arrive at " ν #2
ν+1 ν ν ν 1 β 2π 2 ν ð2mβÞ2 m2 G hriν ¼ B 4, B 2, 1 ν , Z 54 Γ 2 2 2
(98)
that can be recast as hri ¼
3
4 16π 2 β 32 ð2mβÞ2 Gm2 ln ½ð2mβÞðπm2 GÞ2 + 2C 2 63 567 Z
(99)
For the mean value of r2, instead, we deal with " # 5 X 3
5 128π 2 β2 1 2 2 2 2 2 ð2mβÞ2 Gm ln ½ð2mβÞðπm GÞ + 2C 2 hr i ¼ 2n 1 8505 Z n¼1 (100) Fig. 4 depicts the variance VT of our Tsallis’ probability distribution vs the mass at T ¼ 3 K. Notice that there is a lower bound to the mass of the order of 0.5 kg.
FIG. 4 Variance VT vs mass at T ¼ 3 K.
76
SECTION
I Foundations of information geometry
6.3 Tsallis’ Gravity treatment and Fisher’s information measure The Fisher’s information takes the form 2 Z 2 2 4 1 β G m 9p2 β Gm2 p2 F νð f Þ ¼ + dν xd ν p: 1 Z 3 r 2m r4 4m2
(101)
Evaluating the previous integral, we arrive at the result ν ν ν 7 1 1 ν 2 2 4 ð2GmÞ mβ 2 2 π β G m F νð f Þ ¼ πν ν ν 2Z 3 Γ Γ 3+ sin 2 2 2 2ν Γ ν 3 ΓðνÞ 9 1 νπ ν1 πν ν mβ 2 : ð2GmÞ + sin ν 24 Z m2 2 3 Γ 2
(102)
After the by now customary Laurent’s expansion we obtain 3 3 2 2 1 1 4 β2 G2 m4 1 1 π2 3 mβ 3 mβ π ð2GmÞ Fð fÞ ¼ ð2GmÞ 2 9 2Z 3 48 Z m 3 Γ 2
βm 3 3 11 ln ð2πGm2 Þ3 + ψ ψ + 2C : 3 2 2 3
(103)
Fig. 5 depicts the Fisher-Tsallis measure, that exhibits a pole at very low masses, in quite similar fashion as that of the Boltzmann-Gibbs statistics’ instance.
FIG. 5 Tsallis-Fisher measure vs mass at T ¼ 3 K.
Cramer-Rao inequality for testing the suitability Chapter
77
4
FIG. 6 Tsallis’ CRI vs mass at T ¼ 3 K.
6.4 Tsallis’ Gravity treatment and Cramer-Rao inequality (CRI) We have now all the ingredients needed for a CRI-computing, i.e., F ð f ÞV T ,
(104)
That is depicted in Fig. 6. The CRI is violated for very small masses.
7
Conclusions
In a rather simple fashion, we have successfully regularized divergent partition functions in several scenarios, validated by obedience to the Cramer-Rao inequality (CRI). Our regularization procedure allowed us to obtain the associated probability densities (PDs) that permit to construct the CRI, except for Boltzmann-Gibbs statistics in a gravitational scenario, as explained in Section 5. With these PDs we can construct the associated Fisher information measures (FIMs) that convey useful information regarding these scenarios. These FIM would remain inaccessible without our present treatment, that displays a remarkable feature, namely, l
l
l
The Boltzmann-Gibbs’ statistical treatment of Newton’s gravity does not work at all. Tsallis’ framework is the adequate one. At extremely high temperatures (of the order of 1022 K), we face a T-upper bound. This fact has already been reported, in another context, in Plastino et al. (2018) and Pennini et al. (2019). We also encounter a mass lower bound for Newton’s gravity in the Tsallis case.
78
SECTION
I Foundations of information geometry
References Fermi, E., 1924. Uber die wahrscheinlichkeit der quantenzust€ande. Z. Phys. 26, 54–56. Frieden, B.R., Hawkins, R.J., 2010. Quantifying system order for full and partial coarse graining. Phys. Rev. E 82, 066117–066125. Gradshteyn, I.S., Ryzhik, I.M., 1980. Table of Integrals, Series and Products. Academic Press, New York. Halperin, I., Schwartz, L., 1952. Introduction to the Theory of Distributions. Toronto University Press, Toronto. Pennini, F., Plastino, A., Rocca, M.C., Ferri, G.L., 2019. A review of the classical canonical ensemble treatment of Newton’s gravitation. Entropy 21, 677. Plastino, A., Rocca, M.C., 2017. Analysis of Tsallis’ classical partition function’s poles. Physica A 487, 196. Plastino, A., Rocca, M.C., 2018. Quantum field theory, feynman-, wheeler propagators, dimensional regularization in configuration space and convolution of lorentz invariant tempered distributions. J. Phys. Commun. 2, 115029–115033. Plastino, A., Rocca, M.C., Ferri, G.L., Zamora, D., 2018. Dimensionally regularized BoltzmannGibbs statistical mechanics and two-body Newton’s gravitation. Physica A 503, 793–801. 2918. Roy Frieden, B., 2004. Science From Fisher Information. University Press, UK, Cambridge. Sanchez Moreno, P., Plastino, A.R., Dehesa, J.S., 2011. A quantum uncertainty relation based on Fisher’s information. J. Phys. A Math. Theor. 44, 065301. Tsallis, C., 2009. Introduction to Nonextensive Statistical Mechanics Approaching a Complex World. Springer, Berlin.
Chapter 5
Information geometry r–Rao-type and classical Crame inequalities Kumar Vijay Mishraa,∗ and M. Ashok Kumarb a
United States CCDC Army Research Laboratory, Adelphi, MD, United States Department of Mathematics, Indian Institute of Technology Palakkad, Palakkad, India * Corresponding author: e-mail: [email protected] b
Abstract We examine the role of information geometry in the context of classical Cramer–Rao (CR) type inequalities. In particular, we focus on Eguchi’s theory of obtaining dualistic geometric structures from a divergence function and then applying Amari–Nagoaka’s theory to obtain a CR type inequality. The classical deterministic CR inequality is derived from Kullback–Leibler (KL) divergence. We show that this framework could be generalized to other CR type inequalities through four examples: α-version of CR inequality, generalized CR inequality, Bayesian CR inequality, and Bayesian α-CR inequality. These are obtained from, respectively, Iα-divergence (or relative α-entropy), generalized Csisza´r divergence, Bayesian KL divergence, and Bayesian Iα-divergence. Keywords: Bayesian bounds, Cramer–Rao lower bound, Renyi entropy, Fisher metric, Relative α-entropy
1
Introduction
Information geometry is a study of statistical models (families of probability distributions) from a Riemannian geometric perspective. In this framework, a statistical model plays the role of a manifold. Each point on the manifold is a probability distribution from the model. In a historical development, Prof. C R Rao introduced this idea in his seminal 1945 paper (Rao, 1945, Secs. 6,7). He also proposed Fisher information as a Riemannian metric on a statistical manifold as follows: Let P be the space of all probability distributions (strictly positive) on a state space . Assume that P is parametrized by a coordinate system θ. Then, the Fisher metric at a point pθ of P is
Handbook of Statistics, Vol. 45. https://doi.org/10.1016/bs.host.2021.07.005 Copyright © 2021 Elsevier B.V. All rights reserved.
79
80
SECTION
I Foundations of information geometry
Z
∂ ∂ p ðxÞ log pθ ðxÞ dx ∂θi θ ∂θj ∂ ∂ 0 Þ ¼ Iðp , p , ∂θi ∂θj0 θ θ 0
gi,j ðθÞ :¼ h∂i , ∂j ipθ :¼
(1)
(2)
θ¼θ
where I(pθ, pθ0 ) is the Kullback–Leibler (KL) divergence between pθ and pθ0 (or entropy of pθ relative to pθ0 ). Rao called the space based on such a metric a Riemann space and the geometry associated with this as the Riemannian geometry with its definitions of length, distance, and angle. Since then, information geometry has widely proliferated through several substantial contributions, for example, Efron (1975), Cencov (1981), Amari (1982), Amari (1985), Amari and Nagaoka (2000), and Eguchi (1992). Information-geometric concepts have garnered considerable interest in recent years with a wide range of books by Amari (2016), Ay et al. (2017, 2018), Barndorff-Nielsen (2014), Calin and Udris¸ te (2014), Kass and Vos (2011), Murray and Rice (2017), Nielsen (2021), Nielsen and Bhatia (2013), and Nielsen et al. (2017). This perspective is helpful in analyzing problems in engineering and sciences where parametric probability distributions are used, including (but not limited to) robust estimation of covariance matrices (Balaji et al., 2014), optimization (Amari and Yukawa, 2013), signal processing (Amari, 2016), neural networks (Amari, 1997, 2002), machine learning (Amari, 1998), optimal transport (Gangbo and McCann, 1996), quantum information (Grasselli and Streater, 2001), radar systems (Barbaresco, 2008; de Jong and Pribic, 2014), communications (Coutino et al., 2016), computer vision (Maybank et al., 2012), and covariant thermodynamics (Barbaresco, 2014, 2016). More recently, several developments in deep learning (Desjardins et al., 2015; Roux et al., 2008) that employ various approximations to the Fisher information matrix (FIM) to calculate the gradient descent have incorporated information-geometric concepts. We are aware of two strong motivations for studying information geometry. The first is the following: The pair of statistical models, namely linear and exponential families of probability distributions, play an important role in information geometry. These are dually flat in the sense that the former is flat with respect to the m-connection and the latter is flat with respect to the e-connection and the two connections are dual to each other with respect to Fisher metric (see Amari and Nagaoka, 2000, Sec. 2.3 and Chap. 3). We refer the reader to Kurose (1994) and Matsuzoe (1998) for further details on the importance of dualistic structures in Riemannian geometry. A close relationship between the linear and exponential families was known even without Riemannian geometry. These two families were shown to be “orthogonal” to each other in the sense that an exponential family intersects with the associated linear family in a single point at right angle, that is, a Pythagorean theorem with respect to the KL divergence holds at the point of intersection
Information geometry and classical Cram er–Rao-type inequalities Chapter
5
81
(see Fig. 1). This is interesting as it enables one to turn the problem of maximum likelihood estimation (MLE) on an exponential family into a problem of solving a set of linear equations (Csisza´r and Shields, 2004, Th. 3.3). This fact was extended to generalized exponential families and convex integral functionals (which includes Bregman divergences) by Csisza´r and Matu´sˇ Csisza´r and Matu´sˇ (2012, Sec. 4). An analogous fact was shown from a Riemannian geometric perspective for U divergences (a special form of Bregman divergences) and U models (student distributions are a special case) by Eguchi et al. (2014). A similar orthogonality relationship between power-law and linear families with respect to the Iα-divergence (or relative α-entropy) was established in Kumar and Sundaresan (2015a). The second motivation for information geometry (and also for this chapter) comes from the works of Amari and Nagaoka (2000, Sec. 2.5). Apart from showing that the e and m connections are dual to each other with respect to the Fisher metric, they also define, at every point p of a manifold S, a pair of ðeÞ ðmÞ spaces of vectors T ðmÞ p and T p and show that T p is flat with respect to the mconnection and T ðeÞ p is flat with respect to the e-connection and are “orthogonal” to each other with respect to the Fisher metric (see Fig. 2). Also the Fisher
FIG. 1 Orthogonality of exponential and linear families.
FIG. 2 Orthogonality of T ðmÞ and T ðeÞ p p .
82
SECTION
I Foundations of information geometry
metric in Eq. (1) for two tangent vectors X and Y can be given by hX, Yip ¼ ðeÞ ðeÞ hX(m), Y(e)ip, where XðmÞ T ðmÞ p , Y T p . They show that, for a smooth function f : S ! , kðdf Þp k2p ¼ ð∂i f Þp ð∂ j f Þp gi,j ðpÞ,
(3)
where gi, j are the entries of the inverse of the FIM defined in (1). This enables them to show that, for a random variable A : ! , V p ½A ¼ kðdE½AÞp k2p :
(4)
where E½A : P ! maps p 7! Ep[A], the expectation of A with respect to p and Vp[A], the variance (Amari and Nagaoka, 2000, Th. 2.8). This is interesting as this connects Riemannian geometry and statistics (as the left hand side is a statistical quantity and the right side is a differential geometric quantity). The above, when applied to a sub-manifold S of P, becomes V p ½A k ðdE½AÞp k2p :
(5)
Now, if θb ¼ ðθb1 , …, θbk Þ is an unbiased estimator of θ ¼ ðθ1 , …, θk Þ (assuming P that S is a k-dimensional manifold), then applying (5) to A ¼ i ci θbi for c ¼ ðc1 , …, ck ÞT , we get the classical Cramer–Rao lower bound (CRLB) cT V θ ðθb Þc cT GðθÞ1 c,
(6)
where V θ ðθb Þ is the covariance matrix of θ^ and G(θ) is the FIM. This is one among several ways of proving the Cramer–Rao (CR) inequality. This is interesting from a divergence function point of view as Fisher metric and the e and m connections can be derived from the KL divergence. Indeed, Eguchi (1992) proved that, given a (sufficiently smooth) divergence function, one can always come up with a metric and a pair of affine connections so that this triplet forms a dualistic structure on the underlying statistical manifold. In this chapter, we first apply Eguchi’s theory to the Iα-divergence and come up with a dualistic structure of a metric and a pair of affine connections. Subsequently, we apply Amari and Nagoaka’s above mentioned theory to establish an α-version of the Cramer–Rao inequality. We then extend this to generalized Csisza´r divergences and obtain a generalized Cramer–Rao inequality. We also establish the Bayesian counterparts of the α-Cramer–Rao inequality and the usual one by defining the appropriate divergence functions.
2 I-divergence and Iα-divergence In this section, we introduce Iα-divergence and its connection to Csisza´r divergences. We restrict ourselves to finite state space . However, all these may be extended to continuous densities using analogous functional analytic tools (see our remark on infinite in Section 2.1).
Information geometry and classical Cram er–Rao-type inequalities Chapter
83
5
The I-divergence between two probability distributions p and q on a finite state space, say ¼ f0, 1, 2, …, Mg, is defined as X X Iðp, qÞ :¼ pðxÞ log pðxÞ pðxÞ log qðxÞ, (7) x
where HðpÞ :¼
x
X
pðxÞ log pðxÞ
(8)
x
is the Shannon entropy and DðpkqÞ :¼
X
pðxÞ log qðxÞ
(9)
x
is the cross-entropy. Throughout the chapter, we shall assume that all probability distributions have common support . There are other measures of uncertainty that are used as alternatives to Shannon entropy. One of these is the Renyi entropy that was discovered by Alfred Renyi while attempting to find an axiomatic characterization to measures of uncertainty (Renyi et al., 1961). Later, Campbell gave an operational meaning to Renyi entropy (Campbell, 1965); he showed that Renyi entropy plays the role of Shannon entropy in a source coding problem where normalized cumulants of compressed lengths are considered. Blumer and McEliece (1988) and Sundaresan (2007) studied the mismatched (source distribution) version of this problem and showed that Iα-divergence plays the role of I-divergence in this problem. The Renyi entropy of p of order α, α 0, α6¼1, is defined as X 1 log pðxÞα : Hα ðpÞ :¼ 1α x The Iα-divergence (also known as Sundaresan’s divergence Sundaresan, 2002) between two probability distributions p and q is defined as I α ðp, qÞ
α1 X X qðxÞ 1 1 :¼ log log pðxÞ pðxÞα 1α αð1 αÞ kqkα x x ¼
(10)
X X X 1 1 1 log log pðxÞqðxÞα1 + log qðxÞα pðxÞα : 1α α αð1 αÞ x x x (11)
The first term in (10) is called the Renyi cross-entropy and is to be compared with the first term of (7). It should be noted that, as α ! 1, we have Iα(p, q) ! I(p, q) and Hα(p) ! H(p) (Kumar and Sundaresan, 2015a). Renyi entropy and Iα-divergence are related by the equation I α ðp, uÞ ¼ log jj H α ðpÞ.
84
SECTION
I Foundations of information geometry
The ubiquity of Renyi entropy and Iα-divergence in information theory was further noticed, for example, in guessing problems by Arıkan (1996), Sundaresan (2007), and Huleihel et al. (2017); and in encoding of tasks by Bunte and Lapidoth (2014). Iα-divergence arises in statistics as a generalized likelihood function robust to outliers ( Jones et al., 2001; Kumar and Sundaresan, 2015b). It has been referred variously as γ-divergence (Cichocki and Amari, 2010; Fujisawa and Eguchi, 2008; Notsu et al., 2014), projective power divergence (Eguchi and Kato, 2010; Eguchi et al., 2011), logarithmic density power divergence (Basu et al., 2011) and relative α-entropy (Kumar and Sundaresan, 2015a; Sundaresan, 2002). Throughout this chapter, we shall follow the nomenclature of Iα-divergence. Iα-divergence shares many interesting properties with I-divergence (see, e.g., Kumar and Sundaresan, 2015a, Sec. II for a summary of its properties and relationships to other divergences). For instance, analogous to I-divergence, Iα-divergence behaves like squared Euclidean distance and satisfies a Pythagorean property (Kumar and Mishra, 2018; Kumar and Sundaresan, 2015a). The Pythagorean property proved useful in arriving at a computation scheme (Kumar and Sundaresan, 2015b) for a robust estimation procedure (Fujisawa and Eguchi, 2008).
2.1 Extension to infinite The Cramer–Rao-type inequalities discussed in this chapter are obtained by applying Eguchi’s theory (Eguchi, 1992) followed by Amari–Nagaoka’s framework (Amari and Nagaoka, 2000, Sec. 2.5). While the former is applicable even for infinite , the latter (Amari and Nagaoka, 2000, Sec. 2.5) is applicable only for the finite case. This is a limitation on the applicability of the established bounds. Several works, notably Pistone (1995, 2007) have made significant contributions in this direction; see also Amari (2021) and Ay et al. (2017) for further details. A more interesting case from the applications perspective is when is infinite and S is finite-dimensional. It follows from the concluding remarks of (Amari and Nagaoka, 2000, Sec. 2.5) and via personal communication (dated June 29, 2021) with Prof. Nagaoka that the arguments of Amari and Nagaoka (2000, Sec. 2.5) would still “apply in its essence.” However, the formulation of these arguments in a mathematically rigorous way in the framework of infinite-dimensional differential geometry on PðÞ is worth investigating.
2.2 Bregman vs Csisza´r Bregman and Csisza´r are two classes of divergences with the I-divergence at their intersection. Our primary interest in this chapter is the geometry of Iα-divergence. This divergence differs from, but is related to, the usual Renyi divergence, which is a member of Csisza´r family. However, Iα-divergence is
Information geometry and classical Cram er–Rao-type inequalities Chapter
5
85
not a member of the Csisza´r family. Instead, it falls under a generalized form of Csisza´r f-divergences, whose geometry is different from that of Bregman and Csisza´r divergences (Zhang, 2004). In particular, Iα-divergence is closely related to the Csisza´r f-divergence Df as h i 1 I α ðp, qÞ ¼ log sgnð1 αÞ Df ðpðαÞ , qðαÞ Þ + 1 , (12) 1α where pðxÞα qðxÞα pðαÞ ðxÞ :¼ X , qðαÞ ðxÞ :¼ X , f ðuÞ ¼ sgnð1 αÞ ðu1=α 1Þ, u 0 α pðyÞ qðyÞα y
y
(cf. Kumar and Sundaresan, 2015a, Sec. II). The measures p(α) and q(α) are called α-escort or α-scaled measures (Karthik and Sundaresan, 2018; Tsallis et al., 1998). Observe from Eq. (12) that Iα-divergence is a monotone function of the Csisza´r divergence, not between p and q, but their escorts p(α) and q(α). For a strictly convex function f with f(1) ¼ 0, the Csisza´r f-divergence between two probability distributions p and q is defined as (also, see Csisza´r, 1991) X pðxÞ qðxÞf Df ðp, qÞ ¼ : qðxÞ x Note that the right side of (12) is Renyi divergence between p(α) and q(α) of order 1/α (Kumar and Sundaresan, 2015b). For an extensive study of properties of the Renyi divergence, we refer the reader to van Erven and Harremoe¨s (2014). The Csisza´r f-divergence is further related to the Bregman divergence Bf through X Df ðp, qÞ ¼ pðxÞBf ðqðxÞ=pðxÞ, 1Þ, (13) x
(Zhang, 2004). However, Iα-divergence differs from both Csisza´r and Bregman divergences because of the appearance of the escort distributions in (12).
2.3 Classical vs quantum CR inequality This chapter is concerned with the classical CR inequality to differentiate it with its quantum counterpart. In quantum metrology, the choice of measurement affects the probability distribution obtained. The implication of this effect is that the classical FIM becomes a function of measurement. In general, there may not be any measurement to attain the resulting quantum FIM (Braunstein and Caves, 1994). There are many quantum versions of classical FIM, e.g., based on the symmetric, left, and right derivatives. Petz (1996, 2007) showed that all quantum FIMs are a member of a family of Riemannian monotone metrics. Further, all quantum FIMs yield quantum CR inequalities
86
SECTION
I Foundations of information geometry
with different achievabilities (Liu et al., 2019). Quantum algorithms to estimate von Neumann’s entropy and α-Renyi entropy of quantum states (with Hartley, Shannon, and collision entropies as special cases for α ¼ 0, α ¼ 1, and α ¼ 2, respectively) have also been reported (Li and Wu, 2018). For geometric structure induced from a quantum divergence, we refer the reader to Amari and Nagaoka (2000, Chapter 7).
3 Information geometry from a divergence function In this section, we summarize the information-geometric concepts associated with a general divergence function. For detailed mathematical definitions, we refer the reader to Amari and Nagaoka (2000). For more intuitive explanations of information-geometric notions, one may refer to Amari’s recent book (Amari, 2016). We shall introduce the reader to a certain dualistic structure on a statistical manifold of probability distributions arising from a divergence function. For a detailed background on differential and Riemannian geometry, we refer the reader to Spivak (2005), Jost (2005), Gallot et al. (2004), and Do Carmo (1976). In information geometry, statistical models play the role of a manifold and the FIM and its various generalizations play the role of a Riemannian metric. A statistical manifold is a parametric family of probability distributions on with a “continuously varying” parameter space Θ (statistical model). A statistical manifold S is usually represented by S ¼ fpθ : θ ¼ ðθ1 , …, θn Þ Θ n g. Here, θ1 , …, θn are the coordinates of the point p in S and the mapping p! 7 ðθ1 ðpÞ, …, θn ðpÞÞ that takes a point p to its coordinates constitute a coordinate system. The “dimension” of the parameter space is the dimension of the manifold. For example, the set of all binomial probability distributions {B(r, θ) : θ (0, 1)}, where r is the (known) number of trials, is a onedimensional statistical manifold. Similarly, the family of normal distributions S ¼ fNðμ, σ 2 Þ : μ , σ 2 > 0g is a two-dimensional statistical manifold. The tangent space at a point p on a manifold S (denoted Tp(S)) is a linear space that corresponds to the “local linearization” of the manifold around the point p. The elements of Tp(S) are called tangent vectors of S at p. For a coordinate system θ, the (standard) basis vectors of a tangent space Tp are denoted by ð∂i Þp :¼ ð∂=∂θi Þp , i ¼ 1, …, n. A (Riemannian) metric at a point p is an inner product defined for any pair of tangent vectors of S at p. A metric is completely characterized by the matrix whose entries are the inner products between the basic tangent vectors. That is, it is characterized by the matrix GðθÞ ¼ ½gi,j ðθÞi,j¼1,…,n , where gi, j(θ) :¼ h∂i, ∂ji. An affine connection (denoted r) on a manifold is a correspondence between the tangent vectors at a point p to the tangent vectors at a “nearby” point p0 on the manifold. An affine connection is completely
Information geometry and classical Cram er–Rao-type inequalities Chapter
5
87
specified by specifying the n3 real numbers ðΓij,k Þp , i, j, k ¼ 1, …, n called the connection coefficients associated with a coordinate system θ. Let us restrict to statistical manifolds defined on a finite set ¼ fa1 , …, ad g. Let P :¼ PðÞ denote the space of all probability distributions on . Let S P be a submanifold. Let θ ¼ ðθ1 , …, θk Þ be a parameterization of S. Let D be a divergence function on S. By a divergence, we mean a nonnegative function D defined on S S such that D(p, q) ¼ 0 if p ¼ q. Let D* be another divergence function defined by D*(p, q) ¼ D(q, p). Given a (sufficiently smooth) divergence function on S, Eguchi (1992) defines a Riemannian metric on S by the matrix h i ðDÞ GðDÞ ðθÞ ¼ gi,j ðθÞ , where ðDÞ gi,j ðθÞ
∂ ∂ :¼ D½∂i , ∂j :¼ 0 Dðpθ , pθ0 Þ ∂θj ∂θi
θ¼θ0
where gi, j is the elements in the ith row and jth column of the matrix G, θ ¼ ðθ1 , …, θn Þ, θ0 ¼ ðθ01 , …, θ0n Þ, and dual affine connections r(D) and *
rðD Þ, with connection coefficients described by following Christoffel symbols ∂ ∂ ∂ ðDÞ 0 Dðpθ , pθ Þ Γij,k ðθÞ :¼ D½∂i ∂j , ∂k :¼ 0 ∂θi ∂θj ∂θk θ¼θ0 and ðD* Þ Γij,k ðθÞ
∂ ∂ ∂ 0 Þ :¼ D½∂k , ∂i ∂j :¼ Dðp , p θ θ ∂θk ∂θ0i ∂θj0
such that r(D) and rðD G(D) in the sense that
*
Þ
, θ¼θ0
are duals of each other with respect to the metric ðDÞ
ðDÞ
ðD* Þ
∂k gi,j ¼ Γki,j + Γkj,i :
(14)
When D(p, q) ¼ I(p,q), the resulting metric is called the Fisher information metric given by G(θ) ¼ [gi, j(θ)] with pθ ðxÞ ∂ ∂ X p ðxÞ log gi,j ðθÞ ¼ ∂θi ∂θj0 x θ pθ0 ðxÞ 0 θ ¼θ X ¼ ∂i pθ ðxÞ ∂j log pθ ðxÞ (15) x
¼ Eθ ½∂i log pθ ðXÞ ∂j log pθ ðXÞ ¼ Covθ ½∂i log pθ ðXÞ, ∂j log pθ ðXÞ:
88
SECTION
I Foundations of information geometry
The last equality follows from the fact that the expectation of the score function is zero, that is, Eθ ½∂i log pθ ðXÞ ¼ 0, i ¼ 1, …, n. The affine connection r(I) is called the m-connection (mixture connection) with connection coefficients X ðmÞ Γij,k ðθÞ ¼ ∂i ∂j pθ ðxÞ ∂k log pθ ðxÞ x
*
and is denoted r(m). The affine connection rðI Þ is called the e-connection (exponential connection) with connection coefficients X ðeÞ Γij,k ðθÞ ¼ ∂k pθ ðxÞ ∂i ∂j log pθ ðxÞ x
and is denoted r(e) (Amari and Nagaoka, 2000, Sec. 3.2).
3.1 Information geometry for α-CR inequality Set D ¼ Iα and apply the Eguchi framework. For simplicity, write G(α) for GðIα Þ . The Riemannian metric on S is specified by the matrix GðαÞ ðθÞ ¼ ðαÞ
½gi,j ðθÞ, where ðαÞ
ðI Þ
gi,j ðθÞ :¼ gi,jα
¼ ∂θ∂ 0 ∂θ∂ i I α ðpθ , pθ0 Þ j
(16) 0
θ ¼θ
" # X 1 ∂ ∂ α1 ¼ log pθ ðxÞpθ0 ðxÞ α 1 ∂θj0 ∂θi y θ0 ¼θ 2 3
(17)
6 7 pθ0 ðxÞα1 1 X 7 X ∂i pθ ðxÞ ∂j0 6 α1 5 4 α1 x pθ ðyÞpθ0 ðyÞ
(18)
¼
y
0
θ ¼θ 2 3 X X α2 α pθ ðxÞ ∂j pθ ðxÞ pθ ðyÞ pθ ðxÞα1 pθ ðyÞα1 ∂j pθ ðyÞ X 6 7 y y 7 ¼ ∂i pθ ðxÞ6 X 4 5 α 2 x ð pθ ðyÞ Þ y
(19) ¼ EθðαÞ ½∂i ð log pθ ðXÞÞ ∂j ð log pθ ðXÞÞ
EθðαÞ ½∂i log pθ ðXÞ EθðαÞ ½∂j log pθ ðXÞ ¼ Covθ ðαÞ ½∂i log pθ ðXÞ, ∂j log pθ ðXÞ ¼ ðαÞ
1 ðαÞ ðαÞ CovθðαÞ ½∂i log pθ ðXÞ, ∂j log pθ ðXÞ, α2
where pθ is the α-escort distribution associated with pθ,
(20) (21) (22)
Information geometry and classical Cram er–Rao-type inequalities Chapter
p ðxÞα ðαÞ pθ ðxÞ :¼ Xθ , pθ ðyÞα
5
89
(23)
y
ðαÞ
and EθðαÞ denotes expectation with respect to pθ . The equality (22) follows because 0 1 " # ðαÞ α X pðαÞ ðyÞ B pθ ðxÞ C pθ ðxÞ ðαÞ ðαÞ θ B C ¼α ∂i pθ ðxÞ ¼ ∂i @ X ∂ p ðxÞ pθ ðxÞ ∂i pθ ðyÞ : pθ ðxÞ i θ pθ ðyÞα A y pθ ðyÞ y
ðαÞ
If we define SðαÞ :¼ fpθ : pθ Sg, then (22) tells us that G(α) is essentially the usual Fisher information for the model S(α) up to the scale factor α. We shall call the metric defined by G(α) an α-information metric. We shall assume that G(α) is positive definite; see Kumar and Mishra (2020, pp. 39–40) for an example of a parameterization with respect to which this assumption holds. Let us now return to the general manifold S with a coordinate system θ. * Denote rðαÞ :¼ rðIα Þ and rðαÞ* :¼ rðIα Þ where the right-hand sides are as defined by Eguchi (1992) with D ¼ Iα. Motivated by the expression for the Riemannian metric in Eq. (16), define 0 1 α1 B C 0 p ðxÞ 1 ðαÞ 0B X θ : C ∂i ðpθ ðxÞÞ :¼ ∂ (24) α1 A α 1 i@ pθ ðyÞ pθ0 ðyÞ 0 y θ ¼θ
We now identify the corresponding connection coefficients as ðαÞ
ðI Þ
Γij,k :¼ Γij,kα
(25)
¼ I α ½∂i ∂j , ∂k " # X X 1 ðαÞ ðαÞ ¼ ∂ p ðxÞ ∂i ∂k ðpθ Þ + ∂i ∂j pθ ðxÞ ∂k ðpθ Þ α1 x j θ x
(26)
and ðαÞ*
ðI* Þ
Γij,k :¼ Γij,kα
(27)
90
SECTION
I Foundations of information geometry
¼ I α ½∂k , ∂i ∂j 2
1 α1 X B C 0 ðxÞ p 1 6 0 0B X θ 6 C ∂ p ðxÞ ∂i ∂j @ ¼ α1 A α 14 x k θ pθ ðyÞpθ0 ðyÞ y 0
3 7 7: 5
(28)
θ0 ¼θ
We also have Eq. (14) specialized to our setting: ðαÞ
ðαÞ
ðαÞ*
∂k gi,j ¼ Γki,j + Γkj,i :
(29)
(G , r , r ) forms a dualistic structure on S. We shall call the connecðαÞ tion r(α) with the connection coefficients Γij,k , an α-connection. When α ¼ 1, the metric G(α)(θ) coincides with the usual Fisher metric and the connections r(α) and r(α)* coincide with the m-connection r(m) and the e-connection r(e), respectively. A comparison of the expressions in Eqs. (15) and (22) suggests that the manifold S with the α-information metric may be equivalent to the Riemannian ðαÞ metric specified by the FIM on the manifold SðαÞ :¼ fpθ : θ Θ n g. This is true to some extent because the Riemannian metric on S(α) specified by the ðαÞ FIM is simply GðαÞ ðθÞ ¼ ½gij ðθÞ . However, our calculations indicate that the α-connection and its dual on S are not the same as the e- and the m-connections on S(α) except when α ¼ 1. The α-connection and its dual should therefore be thought of as a parametric generalization of the e- and m-connections. In addition, the α-connections in Eqs. (25) and (27) are different from the α-connection of Amari and Nagaoka (2000), which is a convex combination of the e- and m-connections. (α)
(α)
(α)*
3.2 An α-version of Cramer–Rao inequality We now apply Amari and Nagoaka’s theory (Amari and Nagaoka, 2000, 2.5) to derive the α-CR inequality. For this, we examine the geometry of P with *
respect to the metric G(α) and the dual affine connections r(α) and rðαÞ . Note P that P is an open subset of the affine subspace A1 :¼ fA : x AðxÞ ¼ 1g and the tangent space at each p P, T p ðPÞ is the linear space X AðxÞ ¼ 0g: A0 :¼ fA : x
For every tangent vector X T p ðPÞ, let XðeÞ p ðxÞ :¼ XðxÞ=pðxÞ at p and call it the exponential representation of X at p. The collection of exponential representations is then ðeÞ T ðeÞ p ðPÞ :¼ fXp : X T p ðPÞg ¼ fA : Ep ½A ¼ 0g,
(30)
Information geometry and classical Cram er–Rao-type inequalities Chapter
where the last equality is easy to check. Observe that Eq. (24) is 0 1 α1 B C ðαÞ p ðxÞ 0 1 ∂0i @X θ ∂i ðpθ ðxÞÞ ¼ α1 A pθ ðyÞpθ0 ðyÞα1 0 y θ ¼θ 3 2 X α1 6 pθ ðyÞα1 ∂i pθ ðyÞ7 7 6p ðxÞα2 ∂ p ðxÞ pθ ðxÞ y 7 6 ¼ 6 θ X i αθ 7 !2 7 6 p ðyÞ X θ 5 4 α y pθ ðyÞ "
5
91
(31)
y
# pθ ðxÞ pθ ðxÞ ¼ ∂ ð log pθ ðxÞÞ E ðαÞ ½∂i ð log pθ ðXÞÞ : pθ ðxÞ i pθ ðxÞ θ ðαÞ
ðαÞ
Define the above as an α-representation of ∂i at pθ. With this notation, the α-information metric is X ðαÞ ðαÞ gi,j ðθÞ ¼ ∂i pθ ðxÞ ∂j ðpθ ðxÞÞ: x
It should be noted that
ðαÞ Eθ ½∂i ðpθ ðXÞÞ
¼ 0. This follows since
ðαÞ
pθ ðαÞ ∂ log pθ : pθ i
ðαÞ
∂i ðpθ Þ ¼
When α ¼ 1, the right-hand side of Eq. (31) reduces to ∂i ð log pθ Þ. Motivated by Eq. (31), the α-representation of a tangent vector X at p is ðαÞ p ðxÞ ðeÞ pðαÞ ðxÞ ðeÞ XðαÞ ðxÞ :¼ ðxÞ ½X X E ðαÞ p p pðxÞ p pðxÞ p (32) ðαÞ p ðxÞ ðeÞ ðeÞ ¼ Xp ðxÞ EpðαÞ ½Xp : pðxÞ The collection of all such α-representations is T pðαÞ ðPÞ :¼ fXðαÞ p : X T p ðPÞg: Clearly Ep ½XpðαÞ ¼ 0. Also, since any A with Ep[A] ¼ 0 is ðαÞ
p B EpðαÞ ½B A¼ p e where e Ep ½B, with B ¼ B e :¼ BðxÞ
pðxÞ AðxÞ : pðαÞ ðxÞ
(33)
92
SECTION
I Foundations of information geometry
In view of Eq. (30), we have ðαÞ T ðeÞ p ðPÞ ¼ T p ðPÞ:
(34)
Now the inner product between any two tangent vectors X, Y T p ðPÞ defined by the α-information metric in Eq. (16) is hX, YipðαÞ :¼ Ep ½XðeÞ Y ðαÞ :
(35)
Consider now an n-dimensional statistical manifold S, a submanifold of P , together with the metric G(α) as in Eq. (35). Let T *p ðSÞ be the dual space (cotangent space) of the tangent space Tp(S) and let us consider for each Y Tp(S), the element ωY T *p ðSÞ which maps X to hX, Yi(α). The correspondence Y 7!ωY is a linear map between Tp(S) and T *p ðSÞ. An inner product and a norm on T *p ðSÞ are naturally inherited from Tp(S) by hωX , ωY ip :¼ hX, YiðαÞ p and k ωX kp :¼k XkðαÞ p ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi hX, XipðαÞ :
Now, for a (smooth) real function f on S, the differential of f at p, (d f )p, is a member of T *p ðSÞ which maps X to X( f ). The gradient of f at p is the tangent vector corresponding to (df )p, hence, satisfies , ðd f Þp ðXÞ ¼ Xð f Þ ¼ hðgrad f Þp , XiðαÞ p
(36)
: k ðd f Þp k2p ¼ hðgrad f Þp , ðgrad f Þp iðαÞ p
(37)
and
Since grad f is a tangent vector, grad f ¼
n X
hi ∂ i
(38)
i¼1
for some scalars hi. Applying Eq. (36) with X ¼ ∂j, for each j ¼ 1, …, n, and using Eq. (38), we obtain * +ðαÞ n X ð∂j Þðf Þ ¼ hi ∂i , ∂j i¼1
n X ¼ hi h∂i , ∂j iðαÞ i¼1
n X ðαÞ ¼ hi gi, j , j ¼ 1, …,n: i¼1
Information geometry and classical Cram er–Rao-type inequalities Chapter
5
93
This yields h i1 ½h1 , …, hn T ¼ GðαÞ ½∂1 ð f Þ, …, ∂n ð f ÞT , and so grad f ¼
X ðαÞ ðgi,j Þ ∂j ð f Þ∂i :
(39)
i, j
From Eqs. (36), (37), and (39), we get X ðαÞ ðgi,j Þ ∂j ð f Þ∂i ð f Þ kðd f Þp k2p ¼
(40)
i, j
where (gi, j)(α) is the (i, j)th entry of the inverse of G(α). With these preliminaries, we state results analogous to those in Amari and Nagaoka (2000, Sec. 2.5). Theorem 1 (Kumar and Mishra, 2020). Let A : ! be any mapping (that is, a vector in . Let E½A : P ! be the mapping p 7! Ep[A]. We then have VarpðαÞ
p ðA E ½AÞ ¼kðdEp ½AÞp k2p , p pðαÞ
(41)
where the subscript p(α) in Var means variance with respect to p(α). Proof. For any tangent vector X T p ðPÞ, X XðxÞAðxÞ XðEp ½AÞ ¼ x
(42)
¼ Ep ½XðeÞ p A ¼ Ep ½XðeÞ p ðA Ep ½AÞ:
(43)
Since A Ep ½A T ðαÞ p ðPÞ (cf. 34), there exists Y T p ðPÞ such that A Ep ½A ¼ Y ðαÞ p , and grad(E[A]) ¼ Y. Hence we see that ðαÞ kðdE½AÞp k2p ¼ Ep ½Y ðeÞ p Yp
¼ Ep ½Y ðeÞ p ðA Ep ½AÞ
ðaÞ pðXÞ ðeÞ ½AÞ + E ½Y ½AÞ ðA E ¼ Ep ðA E ðαÞ p p p p pðαÞ ðXÞ ðbÞ pðXÞ ¼ Ep ðαÞ ðA Ep ½AÞðA Ep ½AÞ p ðXÞ
94
SECTION
I Foundations of information geometry
pðXÞ pðXÞ ½AÞ ½AÞ ðA E ðA E p p pðαÞ ðXÞ pðαÞ ðXÞ pðXÞ ¼ VarpðαÞ ðαÞ ðA Ep ½AÞ , p ðXÞ ¼ EpðαÞ
where the equality (a) is obtained by applying (32) to Y and (b) follows because Ep[A Ep[A]] ¼ 0. □ Corollary 1 (Kumar and Mishra, 2020). If S is a submanifold of P, then pðXÞ (44) ðA Ep ½AÞ kðdE½AjS Þp k2p VarpðαÞ ðαÞ p ðXÞ with equality if and only if ðαÞ A Ep ½A fXðαÞ p : X T p ðSÞg ¼: T p ðSÞ:
We use the aforementioned ideas to establish an α-version of the CR inequality for the α-escort of the underlying distribution. This gives a lower ðαÞ bound for the variance of the unbiased estimator θ^ in S(α). Theorem 2 (α-version of Cramer–Rao inequality (Kumar and Mishra, 2020)). ðαÞ Let S ¼ fpθ : θ ¼ ðθ1 , …, θm Þ Θg be the given statistical model. Let θ^ ¼
ðαÞ ðαÞ ðθ^1 , …, θ^m Þ be an unbiased estimator of θ ¼ ðθ1 , …, θm Þ for the statistical ðαÞ 1 ðαÞ model SðαÞ :¼ fp : pθ Sg . Then, Var ðαÞ ½θ^ ðXÞ ½GðαÞ , where θ(α) θ
ðαÞ
θ
denotes expectation with respect to pθ . On the other hand, given an unbiased estimator θ^ ¼ ðθ^1 , …, θ^m Þ of θ for S, there exists an unbiased estimator ðαÞ ðαÞ ðαÞ ðαÞ 1 θ^ ¼ ðθ^ , …, θ^ Þ of θ for S(α) such that Var ðαÞ ½θ^ ðXÞ ½GðαÞ . 1
θ
m
(We follow the convention that, for two matrices M and N, M N implies that M N is positive semidefinite.)
ðαÞ ðαÞ ðαÞ Proof. Given an unbiased estimator θ^ ¼ ðθ^1 , …, θ^m Þ of θ ¼ ðθ1 , …, θm Þ for the statistical model S(α), let ðαÞ
p ðXÞ ^ðαÞ θbi ðXÞ :¼ θ θ ðXÞ: pθ ðXÞ i
(45)
It is easy to check that θ^ is an unbiased estimator of θ for S. Hence, if we let P m b A¼ m i¼1 ci θ i , for c ¼ ðc1 , …, cm Þ , then from Eqs. (44) and (40), we have
Information geometry and classical Cram er–Rao-type inequalities Chapter 1 ðαÞ cVarθðαÞ ½θ^ ðXÞct c½GðαÞ ct :
5
95
(46)
This proves the first part. For the converse, consider an unbiased estimator θ^ ¼ ðθ^1 , …, θ^m Þ of θ for S. Let ðαÞ pθ ðXÞ b θ^i ðXÞ :¼ ðαÞ θi ðXÞ: pθ ðXÞ
(47)
This is an unbiased estimator of θi for S(α). Hence, the assertion follows from the first part of the proof. □ When α ¼ 1, the inequality in Eq. (46) reduces to the classical Cramer– Rao inequality. We see that Eq. (46) is, in fact, the Cramer–Rao inequality for the α-escort family S(α).
3.3 Generalized version of Cramer–Rao inequality We apply the result in Eq. (46) to a more general class of f-divergences. Observe from (12) that Iα-divergence is a monotone function of an f-divergence not of the actual distributions but their α-escort distributions. Motivated by this, we first define a more general f-divergence and then show that these divergences also lead to generalized CR inequality analogous to Eq. (46). Although these divergences are defined for positive measures, we restrict to probability measures here. Definition 1. Let f be a strictly convex, twice continuously differentiable real valued function defined on [0, ∞) with f(1) ¼ 0 and f 00 (1)6¼0. Let F be a function that maps a probability distribution p to another probability distribution F(p). Then the generalized f-divergence between two probability distributions p and q is defined by X FðpðxÞÞ 1 ðFÞ D f ðp, qÞ ¼ 00 FðqðxÞÞf : (48) FðqðxÞÞ f ð1Þ x Since f is convex, by Jensen’s inequality, ðFÞ Df ðp, qÞ
1 f 00 f ð1Þ ¼
1 f f 00 ð1Þ
X x
X
1 f ð1Þ f ð1Þ ¼ 0: ¼
00
x
! FðpðxÞÞ FðqðxÞÞ FðqðxÞÞ ! FðpðxÞÞ
96
SECTION
I Foundations of information geometry ðFÞ
Notice that, when F(p(x)) ¼ p(x), Df
becomes the usual Csisza´r divergence.
ðFÞ We now apply Eguchi’s theory to Df . The Riemannian ðf ,FÞ fied by the matrix Gðf ,FÞ ðθÞ ¼ ½gi,j ðθÞ, where ðD
ðf ,FÞ
ðFÞ
gi,j ðθÞ: ¼ gi,j f
Þ
metric on S is speci-
ðθÞ
ðFÞ ¼ ∂θ∂ 0 ∂θ∂ i Df ðpθ , pθ0 Þ j
θ0 ¼θ
X 1 Fðpθ ðxÞÞ ¼ Fðpθ0 ðxÞÞf Fðp 0 ðxÞÞ 00 θ ð1Þ f x θ0 ¼θ " # Fðpθ ðxÞÞ F0 ðpθ ðxÞÞ ∂ X 1 0 Fðpθ0 ðxÞÞ f 00 ¼ 0 ∂ p ðxÞ Fðpθ0 ðxÞÞ Fðpθ0 ðxÞÞ i θ ∂θj x f ð1Þ 0 θ ¼θ " # X Fðpθ ðxÞÞ 0 00 Fðpθ ðxÞÞ 0 F ðpθ ðxÞÞ f F ðpθ0 ðxÞÞ∂i pθ ðxÞ∂j pθ ðxÞ ¼ Fðpθ0 ðxÞÞ Fðpθ0 ðxÞÞ2 x θ0 ¼θ ∂θ∂ 0 ∂θ∂ i j
1 00 f ð1Þ X ¼ Fðpθ ðxÞÞ ∂i log Fðpθ ðxÞÞ ∂j log Fðpθ ðxÞÞ x
¼ EθðFÞ ½∂i log Fðpθ ðXÞÞ ∂j log Fðpθ ðXÞÞ, (49)
where θ stands for expectation with respect to the escort measure F(pθ). Although the generalized Csisza´r f-divergence is also a Csisza´rf-divergence, it is not between p and q. Rather, it is between the distributions F(p) and F(q). ðFÞ As a consequence, the metric induced by D f is different from the Fisher information metric, whereas the metric arising from all Csisza´r f-divergences is the Fisher information metric (Amari and Cichocki, 2010). The following theorem extends the result in Theorem 2 to a more general framework. Theorem 3. (Generalized version of Cramer–Rao inequality (Kumar and Mishra, 2020)). Let θ^ ¼ ðθ^1 , …, θ^m Þ be an unbiased estimator of θ ¼ ðθ1 , …, θm Þ for the statistical model S. Then there exists an unbiased estimator F ðFÞ θ^ of θ for the model S(F) ¼ {F(p) : p S} such that Var ðFÞ ½θ^ ðXÞ (F)
θ
1
½Gðf ,FÞ . Further, if S is such that its escort model S(F) is exponential, then there exists efficient estimators for the escort model. Proof. Following the same steps as in Theorems 1–2 and Corollary 1 produces 1 ðFÞ cVarθðFÞ ½θ^ ct c½Gð f ,FÞ ct
(50)
Information geometry and classical Cram er–Rao-type inequalities Chapter
5
97
ðFÞ for an unbiased estimator θ^ of θ for S(F). This proves the first assertion of the theorem. Now let us suppose that pθ is model such that
log Fðpθ ðxÞÞ ¼ cðxÞ +
k X
θi hi ðxÞ ψðθÞ:
(51)
i¼1
Then ∂i log Fðpθ ðxÞÞ ¼ hi ðxÞ ψðθÞ:
(52)
Let η^ðxÞ :¼ hi ðxÞ
and η :¼ EθðFÞ ½^ ηðXÞ:
Since EθðFÞ ½∂i log Fðpθ ðXÞÞ ¼ 0, we have ∂i ψðθÞ ¼ ηi :
(53)
gi,j ðθÞ ¼ EθðFÞ ½ð^ ηi ðXÞ ηi Þð^ ηj ðXÞ ηj Þ
(54)
Hence ðf ,FÞ
Moreover, since
1 ∂j log Fðpθ ðxÞÞ Fðpθ ðxÞÞ 1 1 ¼ ∂i Fðpθ ðxÞÞ∂j Fðpθ ðxÞÞ ∂i ∂j Fðpθ ðxÞÞ Fðpθ ðxÞÞ Fðpθ ðxÞÞ2
∂i ∂j log Fðpθ ðxÞÞ ¼ ∂i
¼
1 ∂ ∂ Fðpθ ðxÞÞ ∂i log Fðpθ ðxÞÞ∂j log Fðpθ ðxÞÞ, Fðpθ ðxÞÞ i j (55)
from (49), we have ðf ,FÞ
gi,j ðθÞ ¼ EθðFÞ ½∂i ∂j log Fðpθ ðXÞÞ:
(56)
Hence, from Eqs. (52) and (53), we have ðf ,FÞ
∂i ηj ¼ gi,j ðθÞ:
(57)
This implies that η is dual to θ. Hence the generalized FIM of η is equal to the inverse of the generalized FIM of θ. Thus from Eq. (54), η^ is an efficient estimator of η for the escort model. This further helps us to find efficient estimators for θ for the escort model. This completes the proof. □ Theorem 3 generalizes the dually flat structure of exponential and linear families with respect to the Fisher metric identified by Amari and Nagaoka (2000, Sec. 3.5) to other distributions and a more widely applicable metric (as in Definition 1).
98
SECTION
I Foundations of information geometry
4 Information geometry for Bayesian CR inequality and Barankin bound e We extend Eguchi’s theory in Section 3 to the space PðÞ of all positive e measures on , that is, P ¼ fe p : ! ð0, ∞Þg. Let S ¼ fpθ : θ ¼ ðθ1 , …, θk Þ Θg be a k-dimensional submanifold of P and let Se :¼ fe pθ ðxÞ ¼ pθ ðxÞλðθÞ : pθ Sg,
(58)
e For where λ is a probability distribution on Θ. Then Se is a submanifold of P. e peθ , peθ0 S, the KL-divergence between peθ and peθ0 is given by Iðe pθ k peθ0 Þ ¼
X X pe ðxÞ X peθ ðxÞ log θ peθ ðxÞ + peθ0 ðxÞ 0 peθ ðxÞ x x x
X p ðxÞλðθÞ λðθÞ + λðθ0 Þ: ¼ pθ ðxÞλðθÞ log θ pθ0 ðxÞλðθ0 Þ x
(59)
By following Eguchi, we define a Riemannian metric GðIÞ ðθÞ ¼ ½gi,j ðθÞ on Se by ðIÞ
ðI Þ
gi,jλ ðθÞ :¼ I½∂i k∂j ¼ ¼
∂θ∂ i ∂θ∂ 0 j X x
¼
X x
X x
pθ ðxÞλðθÞ log
pθ ðxÞλðθÞ pθ0 ðxÞλðθ0 Þ
θ0 ¼θ
∂i ðpθ ðxÞλðθÞÞ ∂j log ðpθ ðxÞλðθÞÞ pθ ðxÞλðθÞ∂i ð log pθ ðxÞλðθÞÞ ∂j ð log ðpθ ðxÞλðθÞÞÞ
¼ λðθÞ
(60)
X pθ ðxÞ½∂i ð log pθ ðxÞÞ + ∂i ð log λðθÞÞ x
½∂j ð log pθ ðxÞÞ + ∂j ð log λðθÞÞ ¼ λðθÞ Eθ ½∂i log pθ ðXÞ ∂j log pθ ðXÞ + ∂i ð log λðθÞÞ ∂j ð log λðθÞÞ n o ðeÞ ¼ λðθÞ gi,j ðθÞ + J λi,j ðθÞ ,
(61)
where ðeÞ
gi,j ðθÞ :¼ Eθ ½∂i log pθ ðXÞ ∂j log pθ ðXÞ,
(62)
J λi,j ðθÞ :¼ ∂i ð log λðθÞÞ ∂j ð log λðθÞÞ:
(63)
and
Information geometry and classical Cram er–Rao-type inequalities Chapter
99
5
ðeÞ
Let GðeÞ ðθÞ :¼ ½gi,j ðθÞ and J λ ðθÞ :¼ ½J λi,j ðθÞ. Then h i GðIÞ ðθÞ ¼ λðθÞ GðeÞ ðθÞ + J λ ðθÞ ,
(64)
e is an affine subset of e , where G(e)(θ) is the usual FIM. Observe that P
e :¼ [ fad +1 g . The tangent space at every point of P e is A0 :¼ where e P e ¼ A0 . Thus, proceeding with AðxÞ ¼ 0g . That is, T p ðPÞ fA : xe Amari and Nagoaka’s theory (Amari and Nagaoka, 2000, Sec. 2,5) (as in Section 3.1) with p replaced by pe , we get the following theorem and corollary. Theorem 4 (Kumar and Mishra, 2018). Let A : ! be any mapping e ! be the mapping pe 7! Ep~½A. We (that is, a vector in .) Let E½A : P then have VarðAÞ ¼kðdEp~½AÞp~ k 2p~ :
(65)
e then Corollary 2 (Kumar and Mishra, 2018). If S is a submanifold of P, Varp~½A kðdE½AjS Þp~k2p~
(66)
with equality if ðeÞ
ðeÞ
A Ep~½A fXp~ : X T p~ðSÞg ¼: T p~ ðSÞ: We state our main result in the following theorem. Theorem 5 (Kumar and Mishra, 2018). Let S and Se be as in (58). Let θ^ be an estimator of θ. Then (a) Bayesian Cram er–Rao: o1 n ^ λ ½GðeÞ ðθÞ + J λ ðθÞ λ Varθ ðθÞ , (67) ^ ¼ ½Covθ ðθ^i ðXÞ, θ^j ðXÞÞ is the covariance matrix and G(e)(θ) where Varθ ðθÞ λ and J (θ) are as in (62) and (63). (b) Deterministic Cram er–Rao : If θ^ is an unbiased estimator of θ, then ^ ½GðeÞ ðθÞ1 : Varθ ½θ
(68)
(c) Deterministic Cram er–Rao (biased version) : For any estimator θ^ of θ, ^ ð1 + B0 ðθÞÞ½GðeÞ ðθÞ1 ð1 + B0 ðθÞÞ + bðθÞbðθÞT , MSEθ ½θ ^ is the bias and 1 + B0 (θ) is the where bðθÞ ¼ ðb1 ðθÞ, …, bk ðθÞÞT :¼ θ ½θ matrix whose (i, j)th entry is 0 if i6¼j and is (1 + ∂ibi(θ)) if i ¼ j.
100 SECTION
I Foundations of information geometry
(d) Barankin Bound : (Scalar case) If θ^ be an unbiased estimator of θ, then " #2 n X ðlÞ al ðθ θÞ l¼1 ^ sup , (69) Varθ ½θ " #2 n n, al , θðlÞ X X al LθðlÞ ðxÞ pθ ðxÞ x
l¼1
where LθðlÞ ðxÞ :¼ pθðlÞ ðxÞ=pθ ðxÞ and the supremum is over all a1 , …, an , n , and θð1Þ , …, θðnÞ Θ. P Proof. (a) Let A ¼ ki¼1 ci θ^i , where θ^ ¼ ðθ^1 , …, θ^k Þ is an unbiased estimator of θ, in Corollary 2. Then, from Eq. (66), we have X X i,j ci cj Coveðθ^i , θ^j Þ ci cj ðgðIÞ Þ ðθÞ: θ i, j
i, j
This implies that λðθÞ
X X i,j ci cj Covθ ðθ^i , θ^j Þ ci cj ðgðIÞ Þ ðθÞ: i, j
(70)
i, j
Hence, integrating with respect to θ, from Eq. (64), we get X ci cj λ Covθ ðθ^i , θ^j Þ i, j
X
h i 1 ci cj λ ½GðeÞ ðθÞ + J λ ðθÞ :
i, j
That is,
h i ^ λ ½GðeÞ ðθÞ + J λ ðθÞ1 : λ Varθ ðθÞ
But h i h i1 1 λ ½GðeÞ ðθÞ + J λ ðθÞ λ ½GðeÞ ðθÞ + J λ ðθÞ by Groves and Rothenberg (1969). This proves the result. (b) This follows from Eq. (70) by taking λ(θ) ¼ 1. ^ (c) Let us Pkfirst observe that θ is an unbiased P estimator of θ + b(θ). Let ^ A ¼ i¼1 ci θi as before. Then ½A ¼ ki¼1 ci ðθi + bi ðθÞÞ . Then, from Corollary 2 and Eq. (66), we have 1
^ ð1 + B0 ðθÞÞ½GðeÞ ðθÞ ð1 + B0 ðθÞÞ Varθ ½θ
Information geometry and classical Cram er–Rao-type inequalities Chapter
5 101
^ ¼ Varθ ½θ ^ + bðθÞbðθÞT . This proves the assertion. But MSEθ ½θ (d) For fixed a1 , …, an and θð1Þ , …, θðnÞ Θ, let us define a metric by the following formula " #2 n X X (71) gðθÞ :¼ al LθðlÞ ðxÞ pθ ðxÞ: x
l¼1
^ Þ θ, where θ^ is an unbiased Let f be the mapping p7!p ½A. Let Að Þ ¼ θð estimator of θ in Corollary 2. Then, from Eq. (66), we have ! n X X ^ θÞ ðθðxÞ al L ðlÞ ðxÞ pθ ðxÞ θ
x
¼
n X
al
X x
l¼1
¼
n X
l¼1
p ðlÞ ðxÞ pθ ðxÞ ð^ θðxÞ θÞ θ pθ ðxÞ
!
al ðθðlÞ θÞ:
l¼1
Hence, from Corollary 2, we have " Varθ ½^ θ
n X al ðθðlÞ θÞ
"
#2
l¼1
#2 n X X al LθðlÞ ðxÞ pθ ðxÞ x
l¼1
Since al and θ are arbitrary, taking supremum over all a1 , …, an , n , and θð1Þ , …, θðnÞ Θ, we get (69). (l)
□ 5
Information geometry for Bayesian α-CR inequality
We now introduce Iα-divergence in the Bayesian case. Consider the setting of Section 4. Then, Iα-divergence between peθ with respect to peθ0 is X λðθÞ α1 I α ðe log pθ , peθ0 Þ :¼ pθ ðxÞðλðθ0 Þpθ0 ðxÞÞ + λðθ0 Þ 1α x X 2 3 log pθ ðxÞα X 1 x 6 7 λðθÞ4 pθ0 ðxÞα 5: f1 + log λðθÞg log α αð1 αÞ x We present the following Lemma 1 which shows that our definition of Bayesian Iα-divergence is not only a valid divergence function but also coincides with the KL-divergence as α ! 1.
102 SECTION
I Foundations of information geometry
Lemma 1 (Mishra and Kumar, 2020). 1. I α ðe pθ , peθ0 Þ 0 with equality if and only if peθ ¼ peθ0 2. I α ðe pθ , peθ0 Þ ! Iðe pθ , peθ0 Þ as α ! 1. Proof. (1) Let α > 1. Applying Holder’s inequality with Holder conjugates p ¼ α and q ¼ α/(α 1), we have X α1 α1 pθ ðxÞðλðθ0 Þpθ0 ðxÞÞ k pθ k λðθ0 Þ k p0θ kα1 , x
where kk denotes α-norm. When α < 1, the inequality is reversed. Hence X λðθÞ s1 log pθ ðxÞðλðθ0 Þpθ0 ðxÞÞ 1α x X λðθÞ log pθ ðxÞα X λðθÞ x log pθ0 ðxÞα λðθÞ log λðθ0 Þ α αð1 αÞ x X α λðθÞ log pθ ðxÞ λðθÞ log λðθÞ λðθÞ + λðθ0 Þ αð1 αÞ X λðθÞ log pθ0 ðxÞα α x X 2 3 log pθ ðxÞα X x 6 7 ¼ λðθÞ4 pθ0 ðxÞα 5 + λðθ0 Þ, f1 + log λðθÞg log αð1 αÞ x
x
where the second inequality follows because, for x, y 0, log
x y ¼ x log xðy=x 1Þ y + x, y x
and hence x log y x log x x + y: The conditions of equality follow from the same in Holder’s inequality and log x x 1. (2) This follows by applying L’H^ opital rule to the first term of Iα: "
# X α α1 0 lim λðθÞ log pθ ðxÞðλðθ Þpθ0 ðxÞÞ α!1 1 α x 2 3 X 6 1 α1 7 λðθÞ log ¼ lim 4 pθ ðxÞðλðθ0 Þpθ0 ðxÞÞ 5 1 α!1 x 1 α
Information geometry and classical Cram er–Rao-type inequalities Chapter
2
X
5 103
3 α1 pθ ðxÞðλðθ0 Þpθ0 ðxÞÞ log ðλðθ0 Þpθ0 ðxÞÞ 7 7 X 5 α1 0 pθ ðxÞðλðθ Þpθ0 ðxÞÞ
6 1 x ¼ λðθÞ lim 6 1 α!1 4 2 α x X ðλðθÞpθ ðxÞÞ log ðλðθ0 Þpθ0 ðxÞÞ, ¼ x
and since Renyi entropy coincides with Shannon entropy as α ! 1.
□
e We apply Eguchi’s theory provided in Section 3 to the space PðÞ of all e ¼ fe positive measures on , that is, P p : ! ð0, ∞Þg . Following Eguchi ðI Þ (1992), we define a Riemannian metric ½gi,jα ðθÞ on Se by ðI Þ
gi,jα ðθÞ
∂ ∂ I α ðe pθ , peθ0 Þ ¼ 0 ∂θj ∂θi
θ0 ¼θ
X 1 α1 0 0 ∂j ∂i λðθÞ log ¼ pθ ðxÞðλðθ Þpθ0 ðxÞÞ α1 0 y θ ¼θ X ∂i λðθÞ∂j0 log pθ0 ðxÞα 0 x θ ¼θ 8 2 3 > > α1 < X 6 7 ðλðθ0 Þpθ0 ðxÞÞ 1 7 X ¼ ∂i pθ ðxÞ ∂j0 6 λðθÞ α1 5 4 0 α 1> 0 ðxÞÞ p ðyÞðλðθ Þp > x θ θ : y
2X
9 > > =
3
p ðxÞ∂j0 ðλðθÞpθ0 ðxÞÞα1 6 x θ 7 7 + ∂i λðθÞ 6 4 X α1 5 0 pθ ðxÞðλðθ Þpθ0 ðxÞÞ x
∂i λðθÞ∂j0 log
X x
α pθ0 ðxÞ
0
θ ¼θ
θ0 ¼ θ
> > ;
θ0 ¼ θ
8X ∂i pθ ðxÞðλðθÞpθ ðxÞÞα2 ∂j ðλðθÞpθ ðxÞÞ > < x X ¼ λðθÞ > pθ ðxÞðλðθÞpθ ðxÞÞα1 : X
x
x
ð∂i pθ ðxÞÞpθ ðxÞα1 X pθ ðxÞα x
X x
pθ ðxÞðλðθÞpθ0 ðxÞÞα2 ∂j ðλðθÞpθ ðxÞÞ X pθ ðxÞðλðθÞpθ ðxÞÞα1 x
104 SECTION
I Foundations of information geometry
2X 3 pθ ðxÞ∂j0 ðλðθÞpθ0 ðxÞÞα1 6 x 7 7 + ∂i log λðθÞ 6 4 X α1 5 0 pθ ðxÞðλðθ Þpθ0 ðxÞÞ x
θ0 ¼ θ
9 > > = > > ;
∂i λðθÞEθðαÞ ½∂j log pθ ðXÞ ¼ λðθÞ EθðαÞ ½∂i log pθ ðXÞ∂j log pθ ðXÞ + ∂j log λðθÞEθðαÞ ½∂i log pθ ðXÞ EθðαÞ ½∂i log pθ ðXÞ EθðαÞ ½∂j log pθ ðXÞ + ∂j log λðθÞ + ∂i log λðθÞ EθðαÞ ½∂j log pθ ðXÞ + ∂j log λðθÞ ∂i λðθÞEθðαÞ ½∂j log pθ ðXÞ ¼ λðθÞ CovθðαÞ ½∂i log pθ ðXÞ, ∂j log pθ ðXÞ + ∂i log λðθÞ EθðαÞ ½∂j log pθ ðXÞ + ∂j log λðθÞ ∂i λðθÞEθðαÞ ½∂j log pθ ðXÞ ¼ λðθÞ CovθðαÞ ½∂i log pθ ðXÞ, ∂j log pθ ðXÞ + ∂i log λðθÞ∂j log λðθÞ ðαÞ ¼ λðθÞ½gi,j ðθÞ + J λi,j ðθÞ,
(72) (73)
where ðαÞ
gi,j ðθÞ :¼ CovθðαÞ ½∂i log pθ ðXÞ, ∂j log pθ ðXÞ,
(74)
J λi,j ðθÞ :¼ ∂i ð log λðθÞÞ ∂j ð log λðθÞÞ:
(75)
and
ðαÞ
Let GðαÞ ðθÞ :¼ ½gi,j ðθÞ , J λ ðθÞ :¼ ½J λi,j ðθÞ and Gλα ðθÞ :¼ GðαÞ ðθÞ + J λ ðθÞ .
Notice that, when α ¼ 1, Gλα becomes G(I), the usual FIM in the Bayesian case (cf. Kumar and Mishra, 2018). e with respect to the metric Gλ , we have the Examining the geometry of P α following results analogous to Theorem 1 and Corollary 1 derived in Section 3.1 for P. Theorem 6 (Mishra and Kumar, 2020). Let A : ! be any mapping (that e ! be the mapping pe 7! Ep~½A. We then have is, a vector in . Let E½A : P pe VarpðαÞ ðαÞ ðA Ep~½AÞ ¼ k ðdEp~½AÞp~k2p~: (76) p e then Corollary 3. Mishra and Kumar, 2020. If Se is a submanifold of P, peðXÞ (77) VarpðαÞ ðαÞ ðA Ep~½AÞ kðdE½AjS Þp~k2p~ p ðXÞ
Information geometry and classical Cram er–Rao-type inequalities Chapter
5 105
with equality if and only if ðαÞ
ðαÞ
A Ee ½A fX : X Te ðSÞg ¼: T ðSÞ: p p e e p p We use the aforementioned ideas to establish a Bayesian α-version of the CR inequality for the α-escort of the underlying distribution. The following theorem gives a Bayesian lower bound for the variance of an estimator of S(α) starting from an unbiased estimator of S. Theorem 7 (Bayesian α-Cramer–Rao inequality Mishra and Kumar, 2020). Let S ¼ fpθ : θ ¼ ðθ1 , …, θm Þ Θg be the given statistical model and let Se be as before. Let θ^ ¼ ðθ^1 , …, θ^m Þ be an unbiased estimator of θ ¼ ðθ1 , …, θm Þ for the statistical model S. Then "
Z VarθðαÞ
# n h io1 peθ ðXÞ ^ ðαÞ ð θðXÞ θÞ dθ E G , λ λ ðαÞ pθ ðXÞ
(78)
ðαÞ
where θ(α) denotes expectation with respect to pθ . Proof. Given an unbiased estimator θ^ of θ for Se , let A ¼ c ¼ ðc1 , …, cm Þ m . Then, from Eq. (44) and kðdf Þp~k2p~ ¼
X
Pm
ðαÞ
ðgi,j Þ ∂j ðf Þ∂i ðf Þ,
b , for
i¼1 ci θ i
(79)
i, j
we have
" cVarθðαÞ
# peθ ðXÞ ^ ðαÞ 1 ðθðXÞ θÞ ct cfλðθÞGλ g ct : ðαÞ pθ ðXÞ
Integrating the above over θ, we get " # Z peθ ðXÞ ^ ðθðXÞ θÞ dθ ct c VarθðαÞ ðαÞ pθ ðXÞ Z ðαÞ 1 c ½λðθÞGλ dθ ct : But
Z
n o1 ðαÞ 1 ðαÞ ½λðθÞGλ dθ λ ½Gλ ðθÞ
by Groves and Rothenberg (1969). This proves the result.
(80)
(81)
(82) □
106 SECTION
I Foundations of information geometry
The above result reduces to the usual Bayesian Cramer–Rao inequality when α ¼ 1 as in Kumar and Mishra (2018). When λ is the uniform distribution, we obtain the α-Cramer–Rao inequality as in Kumar and Mishra (2020). When α ¼ 1 and λ is the uniform distribution, this yields the usual deterministic Cramer–Rao inequality.
6 Information geometry for Hybrid CR inequality Hybrid CR inequality is a special case of Bayesian CR inequality where part of the unknown parameters are deterministic and the rest are random. This was first encountered by Rockah in a specific application (Rockah and Schultheiss, 1987a, b). Further properties of hybrid CR inequality were studied, for example, in Narasimhan and Krolik (1995), Noam and Messer (2009), and Messer (2006). Consider the setting in 4. The unknown parameter θ is now concatenation T
of two vectors θ1 and θ2, that is, θ ¼ ½θT1 , θT2 , where θ1 is an m-dimensional vector of deterministic parameters and θ2 is an n-dimensional vector of random parameters. Since θ1 is deterministic, the prior distribution λ(θ) is independent of θ1. As a consequence, the entries J λi,j in (63) corresponding to any of the components of θ1 vanish. The hybrid CR inequality takes a form that is same as the Bayesian one except that the Jλ matrix in (78) now becomes 0 0 (83) 0 J λ ðθ2 Þ, where Jλ(θ2) is the Jλ matrix for the random parameter vector θ2. In a similar way, one obtains the hybrid α-CR inequality from Theorem 7.
7 Summary In this chapter, we discussed information-geometric characterizations of various divergence functions linking them to the classical α-CRLB, generalized CRLB, Bayesian CRLB, Bayesian α-CRLB, hybrid CRLB, and hybrid α-CRLB (see Table 1). For the Bayesian CRLB, we exploited the definition of KL-divergence when the probability densities are not normalized. This is an improvement over Amari–Nagaoka framework (Amari and Nagaoka, 2000) on information geometry which only dealt with the notion of deterministic classical CRLB. In particular, we formulated an analogous inequality from the generalized Csisza´r f-divergence. This result leads the usual CR inequality to its escort F(p) by the transformation p7!F(p). Note that this reduction is not coincidental because the Riemannian metric derived from all Csisza´r f-divergences is the Fisher information metric and the divergence studied here is a Csisza´r f-divergence, not between p and q, but between F(p) and F(q). The generalized
Information geometry and classical Cram er–Rao-type inequalities Chapter
5 107
TABLE 1 Lower error bounds and corresponding information-geometric properties. cf. Section
Divergence
Riemannian metric
Deterministic CRLB (Amari and Nagaoka, 2000)
2, 3
I (p, q)
G(e)(θ)
Bayesian CRLB (Kumar and Mishra, 2018)
4
Iðe p θ k peθ0 Þ
λðθÞ½G ðeÞ ðθÞ+J λ ðθÞ
Hybrid CRLB
6
Iðe p θ k peθ0 Þ
λðθÞ½G ðeÞ ðθÞ+J λ ðθÞ
Barankin bound (Kumar and Mishra, 2018)
4
Not applicable
g(θ)
Deterministic α-CRLB (Kumar and Mishra, 2020)
3.2
Iα(p, q)
G(α)(θ)
General (f, F)-CRLB (Kumar and Mishra, 2020)
3.3
Df
ðFÞ
G(f,F)(θ)
Bayesian α-CRLB (Mishra and Kumar, 2020)
5
I α ðe p θ , peθ0 Þ
λðθÞ½G ðαÞ ðθÞ+J λ ðθÞ
Hybrid α-CRLB
6
I α ðe p θ , peθ0 Þ
λðθÞ½G ðαÞ ðθÞ+J λ ðθÞ
Bound
ðp, qÞ
version of the CR inequality enables us to find unbiased and efficient estimators for the escort of the underlying model. Finally, using the general definition of Iα-divergence in the Bayesian case, we derived Bayesian α-CRLB and hybrid CRLB. These improvements enable usage of information-geometric approaches for biased estimators and noisy situations as in radar and communications problems (Mishra and Eldar, 2017).
Acknowledgments The authors are sincerely grateful to the anonymous reviewers whose valuable comments greatly helped in improving the manuscript. K.V.M. acknowledges support from the National Academies of Sciences, Engineering, and Medicine via Army Research Laboratory Harry Diamond Distinguished Postdoctoral Fellowship.
Appendix A.1 Other generalizations of Cramer–Rao inequality Here we discuss commonalities of some of the earlier generalizations of CR inequality with the α-CR inequality mentioned in Section 3.2.
108 SECTION
I Foundations of information geometry
1. Jan Naudts suggests an alternative generalization of the usual Cramer–Rao inequality in the context of Tsallis’ thermostatistics (Naudts, 2004, Eq. (2.5)). Their inequality is closely analogous to ours. It enables us to find a bound for the variance of an estimator of the underlying model (with respect to the escort model) in terms of a generalized Fisher information (gkl(θ)) involving both the underlying (pθ) and its escort families (Pθ). ðαÞ Their Fisher information, when the escort is taken to be Pθ ¼ pθ , is given by X
gk,l ðθÞ ¼
x
1
∂k pθ ðxÞ∂l pθ ðxÞ: ðαÞ pθ ðxÞ
The same in our case is ðαÞ
gk,l ðθÞ ¼
X x
1 ðαÞ ðαÞ ∂k pθ ðxÞ∂l pθ ðxÞ: ðαÞ pθ ðxÞ
ðαÞ
Also, ∂i pθ and ∂ipθ are related by 0 1 ðαÞ ∂i pθ ðxÞ
" ðαÞ B pθ ðxÞα C C ¼ α pθ ðxÞ ∂i pθ ðxÞ X ¼ ∂i B @ p ðxÞ p ðyÞα A θ
y
θ
ðαÞ pθ ðxÞ
X pðαÞ ðyÞ θ
y
pθ ðyÞ
# ∂i pθ ðyÞ :
Moreover, while theirs bounds the variance of an estimator of the true distribution with respect to the escort distribution, ours bounds the variance of an estimator of the escort distribution itself. Their result is precisely the following. Theorem 2.1 of Naudts (2004) Let be given two families of pdfs ðpθ Þθ D and ðPθ Þθ D and corresponding expectations Eθ and Fθ. Let c be an estimator of ðpθ Þθ D , with scale function F. Assume that the regularity condition Fθ
1 ∂ p ðxÞ ¼ 0, Pθ ðxÞ ∂θk θ
holds. Let gkl(θ) be the information matrix introduced before. Then, for all u and v in n is uk ul ½Fθ ck cl ðFθ ck ÞðFθ cl Þ 1 k l : h i2 2 v v gkl ðθÞ uk vl ∂θ∂l ∂θk FðθÞ 2. Furuichi (2009) defines a generalized Fisher information based on the q-logarithmic function and gives a bound for the variance of an estimator with respect to the escort distribution. Given a random variable X with the probability density function f(x), they define the q-score function sq(x) based on the q-logarithmic function and q-Fisher information
Information geometry and classical Cram er–Rao-type inequalities Chapter
5 109
h i J q ðXÞ ¼ Eq sq ðXÞ2 , where Eq stands for expectation with respect to the escort distribution f (q) of f as in (23). Observe that h i J q ðXÞ ¼ Eq sq ðXÞ2 2 d 22q log f ðXÞÞ , ¼ Eq f ðXÞ dX whereas our Fisher information in this setup, following (20), is " 2 # 2 d d ðqÞ log f ðXÞ log f ðXÞ , g ðXÞ ¼ Eq Eq dX dX
(A.1)
(A.2)
Interestingly, they also bound the variance of an estimator of the escort model with respect to the escort model itself as in our case. Their main result is the following: Theorem 3.3 of Furuichi (2009): Given the random variable X with the probability density function p(x), ithe q-expectation value μq ¼ Eq[X], and h
2 2 the q-variance σ q ¼ Eq X μq , we have a q-Cram er–Rao inequality 0 1 J q ðXÞ
1B BZ σ 2q @
2 pðxÞq dx
C 1C A
for q ½0, 1Þ [ ð1, 3Þ:
Immediately, we have J q ðXÞ
1 σ 2q
for q ð1, 3Þ:
3. Lutwak et al. (2005) derive a Cramer–Rao inequality in connection with extending Stam’s inequality for the generalized Gaussian densities. Their inequality finds lower bound for the p-th moment of the given density (σ p[f ]) in terms of a generalized Fisher information. Their Fisher information ϕp, λ[f ], when specialized to p ¼ q ¼ 2, is given by ( " 2 #)12 d 2λ2 ϕ2,λ ½ f ¼ E f ðXÞ log f ðXÞ , dX which is closely related to that of Furuichi’s (A.1) up to a change of measure f 7! f (λ), which, in turn, related to ours (A.2). Moreover, while they use Iα-divergence to derive their moment-entropy inequality, they do not do so while defining their Fisher information and hence obtain a different Cramer–Rao inequality. Their result is reproduced as follows: Theorem 5 of Lutwak et al. (2005): Let p [1, ∞], λ (1/(1 + p), ∞), and f be a density. If p < ∞, then f is assumed to be absolutely continuous; if p ¼ ∞, then f λ is assumed to have bounded variation. If σ p[ f ], ϕp,λ[ f ] < ∞, then
110 SECTION
I Foundations of information geometry
σ p ½ f ϕp,λ ½ f σ p ½Gϕp,λ ½G, where Gis the generalized Gaussian density. 4. Bercher (2012) derived a two parameter extension of Fisher information and a generalized Cramer–Rao inequality which bounds the α moment of an estimator. Their Fisher information, when specialized to α ¼ β ¼ 2, reduces to ðqÞ 2 f ðX; θÞ ∂ I 2,q ½ f ; θ ¼ Eq log f ðqÞ ðX; θÞ , f ðx; θÞ ∂θ where Eq stands for expectation with respect to the escort distribution f (q). Whereas, following (22), our Fisher information in this setup is 2 1 ∂ ðqÞ ðqÞ g ðθÞ ¼ 2 Eq log f ðX; θÞ : ∂θ q Thus our Fisher information differs from his by the factor f(x;θ)/q2f (q)(x, θ) inside the expectation. Note that q in their result is analogous to α in our work. The main result of Bercher (2012) is reproduced verbatim as follows: Theorem 1 of Bercher (2012): Let f(x ; θ) be a univariate probability density function defined over a subset X of , and θ Θ a parameter of the density. Assume that f(x;θ) is a jointly measurable function of x and θ, is integrable with respect to x, is absolutely continuous with respect to θ, and that the derivative with respect to θ is locally integrable. Assume also that q > 0 and that ^ of θ, we have Mq[ f ; θ]is finite. For any estimator θðxÞ 1 ^ θ, ^ θjα α I β,q ½ f ; θβ1 1 + ∂ Eq ½θðxÞ E jθðxÞ ∂θ € with α and β Holder conjugates of each other, i.e., α1 + β1 ¼ 1, α 1, and where the quantity " β # f ðx ; θÞq1 ∂ f ðx ; θÞq I β,q ½ f ; θ ¼ E ln , Mq ½ f ; θ ∂θ Mq ½ f ; θ R where Mq ½ f ; θ :¼ f ðx; θÞq dx , is the generalized Fisher information of order (β, q) on the parameter θ.
References Amari, S., 1982. Differential geometry of curved exponential families-curvatures and information loss. Ann. Stat. 10 (2), 357–385. Amari, S., 1985. Differential-Geometrical Methods in Statistics, first ed. Lecture Notes on Statistics, vol. 28 Springer. Amari, S., 1997. Information geometry of neural networks: an overview. In: Ellacott, S.W., Mason, J.C., Anderson, I.J. (Eds.), Mathematics of Neural Networks. Operations Research/ Computer Science Interfaces Series, vol. 8. Springer US, pp. 15–23.
Information geometry and classical Cram er–Rao-type inequalities Chapter
5 111
Amari, S., 1998. Natural gradient works efficiently in learning. Neural Comput. 10 (2), 251–276. Amari, S., 2002. Information geometry of neural learning and belief propagation. In: IEEE International Conference on Neural Information Processing, vol. 2, p. 886-vol. Amari, S., 2016. Information Geometry and Its Applications. Springer. Amari, S., 2021. Information geometry. Jpn. J. Math. 16, 1–48. Amari, S., Cichocki, A., 2010. Information geometry of divergence functions. Bull. Polish Acad. Sci. Tech. Sci. 58 (1), 183–195. Amari, S., Nagaoka, H., 2000. Methods of Information Geometry. vol. 191 American Mathematical Society, Oxford University Press. Amari, S., Yukawa, M., 2013. Minkovskian gradient for sparse optimization. IEEE J. Sel. Top. Signal Process. 7 (4), 576–585. Arıkan, E., 1996. An inequality on guessing and its application to sequential decoding. IEEE Trans. Inf. Theory 42 (1), 99–105. Ay, N., Jost, J., V^an L^e, H., Schwachh€ofer, L., 2017. Information Geometry. Springer. Ay, N., Gibilisco, P., Matus, F., 2018. Information geometry and its applications. In: Springer Proceedings in Mathematics & Statistics, vol. 252. Springer. Balaji, B., Barbaresco, F., Decurninge, A., 2014. Information geometry and estimation of Toeplitz covariance matrices. In: IEEE Radar Conference, pp. 1–4. Barbaresco, F., 2008. Innovative tools for radar signal processing based on Cartan’s geometry of SPD matrices & information geometry. In: IEEE Radar Conference, pp. 1–6. Barbaresco, F., 2014. Koszul information geometry and Souriau geometric temperature/capacity of Lie group thermodynamics. Entropy 16 (8), 4521–4565. Barbaresco, F., 2016. Geometric theory of heat from Souriau Lie groups thermodynamics and Koszul Hessian geometry: applications in information geometry for exponential families. Entropy 18 (11), 386. Barndorff-Nielsen, O., 2014. Information and Exponential Families in Statistical Theory. John Wiley & Sons. Basu, A., Shioya, H., Park, C., 2011. Statistical Inference: The Minimum Distance Approach. Monographs on Statistics and Applied Probability, Chapman & Hall/CRC Press. Bercher, J.-F., 2012. On generalized Cramer-Rao inequalities, generalized Fisher information and characterizations of generalized q-Gaussian distributions. J. Phys. A Math. Theor 45 (25), 255303. Blumer, A.C., McEliece, R.J., 1988. The Renyi redundancy of generalized Huffman codes. IEEE Trans. Inf. Theory 34 (5), 1242–1249. Braunstein, S.L., Caves, C.M., 1994. Statistical distance and the geometry of quantum states. Phys. Rev. Lett. 72 (22), 3439. Bunte, C., Lapidoth, A., 2014. Codes for tasks and Renyi entropy. IEEE Trans. Inf. Theory 60 (9), 5065–5076. Calin, O., Udris¸ te, C., 2014. Geometric Modeling in Probability and Statistics. Springer. Campbell, L.L., 1965. A coding theorem and Renyi’s entropy. Inf. Control 8, 423–429. Cencov, N.N., 1981. Statistical Decision Rules and Optimal Inference. Translations of Mathematical Monographs, American Mathematical Society. 53. Cichocki, A., Amari, S., 2010. Families of alpha- beta- and gamma- divergences: flexible and robust measures of similarities. Entropy 12, 1532–1568. Coutino, M., Pribic, R., Leus, G., 2016. Direction of arrival estimation based on information geometry. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3066–3070.
112 SECTION
I Foundations of information geometry
Csisza´r, I., 1991. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Stat. 19 (4), 2032–2066. Csisza´r, I., Matu´sˇ, F., 2012. Generalized minimizers of convex integral functionals, Bergman distance, Pythagorean identities. Kybernetika (Prague) 48, 637–689. Csisza´r, I., Shields, P., 2004. Information Theory and Statistics: A Tutorial. Foundations and Trends in Communications and Information Theory, vol. 1.4 Now Publishers, Inc, Hanover, USA. de Jong, E., Pribic, R., 2014. Design of radar grid cells with constant information distance. In: IEEE Radar Conference, pp. 1–5. Desjardins, G., Simonyan, K., Pascanu, R., et al., 2015. Natural neural networks. In: Advances in Neural Information Processing Systems, pp. 2071–2079. Do Carmo, M.P., 1976. Differential Geometry of Curves and Surfaces. Prentice-Hall. Efron, B., 1975. Defining the curvature of a statistical problem (with applications to second order efficiency). Ann. Stat. 3 (6), 1189–1242. Eguchi, S., 1992. Geometry of minimum contrast. Hiroshima Math. J. 22 (3), 631–647. Eguchi, S., Kato, S., 2010. Entropy and divergence associated with power function and the statistical application. Entropy 12 (2), 262–274. Eguchi, S., Komori, O., Kato, S., 2011. Projective power entropy and maximum Tsallis entropy distributions. Entropy 13 (10), 1746–1764. Eguchi, S., Komori, O., Ohara, A., 2014. Duality of maximum entropy and minimum divergence. Entropy 16 (7), 3552–3572. Fujisawa, H., Eguchi, S., 2008. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 99, 2053–2081. Furuichi, S., 2009. On the maximum entropy principle and the minimization of the Fisher information in Tsallis statistics. J. Math. Phys. 50 (013303), 1–12. Gallot, S., Hulin, D., Lafontaine, J., 2004. Riemannian Geometry. Springer. Gangbo, W., McCann, R.J., 1996. The geometry of optimal transportation. Acta Math. 177 (2), 113–161. Grasselli, M.R., Streater, R.F., 2001. On the uniqueness of the Chentsov metric in quantum information geometry. Infin. Dimens. Anal. Quantum Prob. Relat. Top. 4 (02), 173–182. Groves, T., Rothenberg, T., 1969. A note on the expected value of an inverse matrix. Biometrika 56, 690–691. Huleihel, W., Salamatian, S., Medard, M., 2017. Guessing with limited memory. In: IEEE International Symposium on Information Theory, pp. 2253–2257. Jones, M.C., Hjort, N.L., Harris, I.R., Basu, A., 2001. A comparison of related density based minimum divergence estimators. Biometrika 88 (3), 865–873. Jost, J., 2005. Riemannian Geometry and Geometric Analysis. Springer. Karthik, P.N., Sundaresan, R., 2018. On the equivalence of projections in relative α-entropy and Renyi divergence. In: National Conference on Communications, pp. 1–6. Kass, R.E., Vos, P.W., 2011. Geometrical Foundations of Asymptotic Inference. vol. 908 John Wiley & Sons. Kumar, M.A., Mishra, K.V., 2018. Information geometric approach to Bayesian lower error bounds. In: IEEE International Symposium on Information Theory, pp. 746–750. Kumar, M.A., Mishra, K.V., 2020. Cramer-Rao lower bounds arising from generalized csisza´r divergences. Inf. Geom. 3 (1), 33–59. Kumar, M.A., Sundaresan, R., 2015a. Minimization problems based on relative α-entropy I: forward projection. IEEE Trans. Inf. Theory 61 (9), 5063–5080.
Information geometry and classical Cram er–Rao-type inequalities Chapter
5 113
Kumar, M.A., Sundaresan, R., 2015b. Minimization problems based on relative α-entropy II: reverse projection. IEEE Trans. Inf. Theory 61 (9), 5081–5095. Kurose, T., 1994. On the divergences of 1-conformally flat statistical manifolds. Tohoku Math. J. 46 (3), 427–433. Li, T., Wu, X., 2018. Quantum query complexity of entropy estimation. IEEE Trans. Inf. Theory 65 (5), 2899–2921. Liu, J., Yuan, H., Lu, X.-M., Wang, X., 2019. Quantum fisher information matrix and multiparameter estimation. J. Phys. A Math. Theor. 53 (2), 023001. Lutwak, E., Yang, D., Zhang, G., 2005. Cramer-Rao and moment-entropy inequalities for Renyi entropy and generalized Fisher information. IEEE Trans. Inf. Theory 51 (1), 473–478. Matsuzoe, H., 1998. On realization of conformally-projectively flat statistical manifolds and the divergences. Hokkaido Math. J. 27 (2), 409–421. Maybank, S.J., Ieng, S., Benosman, R., 2012. A Fisher-Rao metric for paracatadioptric images of lines. Int. J. Comput. Vis. 99 (2), 147–165. Messer, H., 2006. The hybrid Cramer-Rao lower bound–from practice to theory. In: IEEE Workshop on Sensor Array and Multichannel Processing, pp. 304–307. Mishra, K.V., Eldar, Y.C., 2017. Performance of time delay estimation in a cognitive radar. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3141–3145. Mishra, K.V., Kumar, M.A., 2020. Generalized Bayesian Cramer-Rao inequality via information geometry of relative α-entropy. In: IEEE Annual Conference on Information Sciences and Systems, pp. 1–6. Murray, M.K., Rice, J.W., 2017. Differential Geometry and Statistics. Routledge. Narasimhan, S., Krolik, J.L., 1995. Fundamental limits on acoustic source range estimation performance in uncertain ocean channels. J. Acoust. Soc. Am. 97 (1), 215–226. Naudts, J., 2004. Estimators, escort probabilities, and ϕ-exponential families in statistical physics. J. Inequal. Pure Appl. Math. 5 (4), 1–15. Nielsen, F., 2021. Progress in Information Geometry: Theory and Applications. Springer. Nielsen, F., Bhatia, R., 2013. Matrix Information Geometry. Springer. Nielsen, F., Critchley, F., Dodson, C.T.J., 2017. Computational Information Geometry for Image and Signal Processing. Springer. Noam, Y., Messer, H., 2009. Notes on the tightness of the hybrid Cramer-Rao lower bound. IEEE Trans. Signal Process. 57 (6), 2074–2084. Notsu, A., Komori, O., Eguchi, S., 2014. Spontaneous clustering via minimum gammadivergence. Neural Comput. 26 (2), 421–448. Petz, D., 1996. Monotone metrics on matrix spaces. Linear Algebra Appl. 244, 81–96. Petz, D., 2007. Quantum Information Theory and Quantum Statistics. Springer Science & Business Media. Pistone, G., 1995. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 23 (5), 1543–1561. Pistone, G., 2007. Exponential statistical manifold. Ann. Inst. Stat. Math. 59, 27–56. Rao, C.R., 1945. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91. Renyi, A., et al., 1961. On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, pp. 547–561. Rockah, Y., Schultheiss, P., 1987a. Array shape calibration using sources in unknown locations— Part I: far-field sources. IEEE Trans. Acoust. Speech Signal Process. 35 (3), 286–299.
114 SECTION
I Foundations of information geometry
Rockah, Y., Schultheiss, P., 1987b. Array shape calibration using sources in unknown locations— Part II: near-field sources and estimator implementation. IEEE Trans. Acoust. Speech Signal Process. 35 (6), 724–735. Roux, N.L., Manzagol, P.-A., Bengio, Y., 2008. Topmoumoute online natural gradient algorithm. In: Advances in Neural Information Processing Systems, pp. 849–856. Spivak, M., 2005. A Comprehensive Introduction to Differential Geometry–Volume I. Publish or Perish Inc. Sundaresan, R., 2002. A measure of discrimination and its geometric properties. In: Proc. of the 2002 IEEE International Symposium on Information Theory, Lausanne, Switzerland, June, p. 264. Sundaresan, R., 2007. Guessing under source uncertainty. IEEE Trans. Inf. Theory 53 (1), 269–287. Tsallis, C., Mendes, R.S., Plastino, A.R., 1998. The role of constraints within generalized nonextensive statistics. Phys. A 261, 534–554. van Erven, T., Harremoe¨s, P., 2014. Renyi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 60 (7), 3797–3820. Zhang, J., 2004. Divergence function, duality, and convex analysis. Neural Comput. 16 (1), 159–195.
Section II
Theoretical applications and physics
This page intentionally left blank
Chapter 6
Principle of minimum loss of Fisher information, arising from the Cramer-Rao inequality: Its role in evolution of bio-physical laws, complex systems and universes B. Roy Frieden* Wyant College of Optical Sciences, University of Arizona, Tucson, AZ, United States * Corresponding author: e-mail: [email protected]
Abstract A thermodynamically open system ordinarily obeys physical laws expressing maximum randomness. How, then, do real systems grow in ordered complexity, information content and size? Consider an “information channel”: this is a closed system, acting as a channel linking a vital input of information to an output device. An example is a cell membrane (CM) receiving vital input information about its environment, as carried by, say, an entering K+ ion. The information channel extends radially through the thickness of that CM; into the cell cytoplasm just beyond it. The total passage of the ion from input to output channel also describes a physical signal carrying Fisher information (FI) in the form of a charge current p(t) in time. Providing this information to the cell allows it to grow in complexity, and survive. During this time the CM is otherwise closed to inputs. It cannot receive any additional environmental information. Then, any change in the information so-carried can only be a loss. On the other hand, the growth in complexity of any well-defined system increases linearly with its (Fisher) information level. Hence the cell’s growth in complexity is maximized if the information level is maximized (obeys an MFI principle); or equivalently, if the above loss of channel information is minimized. As shown, thanks to the Cramer-Rao inequality, this MFI principle holds widely, normally over all time for all material channels. Such channels consist of evolving systems: sub-nuclear particles, quantum particles, classical particles, biological viruses such as the coronavirus, cells, plants, animals, economic systems, sociological systems, Handbook of Statistics, Vol. 45. https://doi.org/10.1016/bs.host.2021.07.002 Copyright © 2021 Elsevier B.V. All rights reserved.
117
118 SECTION
II Theoretical applications and physics
planets, galaxies and universes. Mathematically, demanding MFI (formerly termed “extreme physical information” or EPI) of a physical system yields a differential equation defining its probability law p(x) ¼ a∗(x) ∙ a(x), x 5 x, y, z, ct, through a real, complex or tensor amplitude law a(x), q(x), Ψ(x) or Ψ(xα), xα ¼ x, y, z, ct, respectively. Its outcomes x are randomly sampled during system use and evolution. The system is predicted to obey properties of continued, evolutionary growth and complexity (organization). Because each grows independently, to achieve even higher values of FI it splits off (give “birth to”) a subsidiary that continues evolving via MF. This is independent of its “parent” system, so their FI values can maximally add. In biology, a parent (plant, or animal, …) gives rise to an offspring. In economics, a parent company splits off a subsidiary. In cosmology, a universe splits off an “offspring” universe. The offspring universe so forms, and evolves independent of its parent, as required, when located in an incomplete vacuum (containing a finite, yet small, number of molecules/cubic meter). Even the most nearly perfect vacuum in intergalactic space obeys this property. To further satisfy the EPI requirement of maximized total Fisher (MFI) a large number of such universes Un, n ¼ 1, …, N,called a “multiverse,” independently form. Each can exist, and evolve on its own, since each obeys the same 26 universal physical constants as ours. The formation of universe Un initiates only after Un1 ships phenomena, identifying the 26 constants, to it. These are via a narrow Lorentz wormhole channel within Un1, stretching from its interior, to just outside its “bubble” surface. There a point P of incomplete vacuum is assumed to exist, allowing universe Un to start evolving. Effects encouraging or inhibiting or (even) annihilating such created universes are discussed. Keywords: Sciences, At all scales, From Fisher information, Natural selection from Fisher information, The multiverse from Fisher information
1 Introduction (Fisher, 1922; Frieden 1998, 2004; Frieden and Gatenby, 2019) 1.1 On learning, energy, sensory messages A child learns, from direct experience, that he/she is a distinct system, needing energy inputs, chiefly food, drink and warmth as needed for physical growth and survival. The energy inputs are obtained out of sensory, i.e., biological, neuron-based messages; these messages provide us with information enabling us to use the inputs to satisfy these needs. We grow out of receipt and processing of the information. Later on, the child also learns that all learning ultimately ceases at some definite, finite time T. The learning system then stops acquiring the information, so to stop learning is “fatal” to it (or to any process of system growth). The learning process is aptly called a mathematical “death process.”This even extends” to cosmological growth (Sections 2.6–2.17) whereby an “offspring” universe forms from a “parent” universe out of information fixing its 26 universal physical constants to those of the parent. There it is seen, likewise, that, once a forming universe stops receiving this information its initial, rapid expansionary phase of growth ends (followed by a “big bang” phase of heat expansion, not taken up here).
Principle of minimum loss of Fisher information Chapter
6 119
Much later, we learn through science, that not only is “information” vital to survival of our personal system, but, it also provides the basis for us to form (Frieden, 1998, 2004; Frieden and Gatenby, 2007)—i.e., learn or know—the laws governing the actions of all physical “systems” (even including “universes,” see Sections 2.5–2.16). Such laws define both living and nonliving processes, existing interior and exterior to our bodies, including biological, physical and cosmological phenomena. We continue to so learn—as a vital function of life—until the (above) fateful time Tends….. our learning process.
1.2 On variational approaches In graduate school we learn how the physical laws, in particular, can be formed out of properties of energy. This is through the usual mathematics of a Lagrange variational approach (the mathematics used here as well). This and, for some of us, witnessing the end of WWII, leaves us in awe of the concept of energy, how it forms, is propagated and is used; also that it is vital to our personal survival as biological creatures.
1.3 Vital role played by information However, from the above, we come to realize that, in contrast, physical (including biological) laws are actually expressions of information learned from data; and can, in fact, be easily shown to form out of information concepts, in particular that of maximum Fisher information MFI (Frieden, 1998, 2004; Frieden and Gatenby, 2007) in the data. Indeed, Erwin Schrodinger famously asked where his own, famous kinetic energy term in the Lagrangian for quantum particles came from. In fact it is precisely the Fisher information Eq. (8a) for the particle (Frieden, 1998, 2004). Why maximum in particular? Closed systems, such as: human beings, microbes attached to biological cells thru their cell membranes, planets, galaxies and (even) universes have well-defined values of Fisher information that actually increase with time. In particular, these do not suffer the irreversible, lossy changes that open systems, instead obeying maximum entropy (disorder), generally suffer. In fact, the maximized Fisher (MFI) values connote increased levels of complexity as well; e.g., as attained in biology out of the process of “natural selection” (Section 2.3). On this basis of MFI virtually all standard physics through 2nd year graduate work has been shown, in detail, to derive (Frieden, 1998, 2004; Frieden and Gatenby, 2007) case by case. The information I is that of R.A. Fisher (Fisher, 1922), and its maximization in each case is attained by solving a variational problem δ(I J) ¼ 0. Here J is prior knowledge of a least upperbound level to the Fisher I for the phenomenon. This is always available, as shown below. The solutions to the variational problem generally obey.
120 SECTION
II Theoretical applications and physics
I J ¼ minimum,
I ¼ kJ, 0 k 1:
(1)
where I is a convex functional and J ¼ constant. The second equation I ¼ kJ thereby defines, for each application, the efficiency k with which the prior knowledge of a maximum possible information value J of Fisher I comes through into the data.This result is confirmed in all applications. Thus the principle (1) is one of minimum loss of information or, equivalently, with J fixed and I J one of maximum Fisher information MFI. But, on what grounds is J the constant maximum value of the information? Also, in a given application, how do we know that the minimum required in Eq. (1) will actually be observed? And does principle (1) actually follow from the Cramer-Rao inequality? This is found in Sections 3.1–3.3. Other questions about foundations of the principle (1) are answered in Sections 4.1 and 4.2 and 5. Meanwhile, some key applications of the use of principle (1) are defined in Sections 2.1–2.17. These start at finest scales, for sub-nuclear quantum effects in atoms and the sub-micron level in living cells, proceeding outward to the macroscopic evolution of plants and animals, the planet, solar system, galaxy, uni-verse and, finally, multi-verse. That is, MFI predicts that all these phenomena are embraced by systems that optimally evolve (see below) during optimally extended lifetimes T. The latter aspect of evolution, in particular, seems a newly obtained, yet powerful, property of MFI.
2 Overview and comparisons of applications 2.1 Classical dynamics (Frieden, 1998, 2004; Frieden and Gatenby, 2007) A particular class of physics (the classical variety, which ignores phase) obeys k ¼ ½, i.e., its dynamics show a loss of 50% of the information level J.
2.2 Quantum physics (Frieden, 1998, 2004) In all quantum scenarios, including those of sub-nuclear particles, it is found that efficiency constant k ¼ 1 (there is no loss of information level J. Then, taking these results A and B together, when you lose track of phase, you lose knowledge of 50% of what is actually there. All the familiar particle-wave equations of quantum mechanics are derived using the MFI principle Eq. (1). These are the non-relativistic of Schrodinger; the relativistic of Dirac for spin ½ particles, including electrons and quarks; and the Rarita-Schwinger for fermions of arbitrary spin. Schrodinger’s “dilemma,” as to what the origin of the Lagrangian (so-called kinetic energy) term might be, is resolved; it is simply the FI term in either of Eqs. (8a), (8b) or (8c) below.
Principle of minimum loss of Fisher information Chapter
6 121
Quantum mechanics is ordinarily taught as growing out of the de Broglie hypothesis, that a mass particle of momentum μo obeys quantum wave motion of wavelength λ ¼ h/μo, with h Planck’s constant. (By comparison we showed that q.m. arises directly out of MFI (Frieden, 1998, 2004), with no need for the de Broglie hypothesis). Nevertheless, a longstanding question is why the de Broglie hypothesis holds. In fact we found (Frieden and Soffer, 2009), in answer, that seeking the unknown dynamics obeyed by a moving particle obeying the MFI principle (1) naturally gives two characteristic solutions for its motion. According to the size of its momentum “coordinate” μ: This obeys classical mechanics if μ ¼ μo is macroscopic; or obeys quantum mechanics for μ microscopic. The well-known Fourier relation connecting particle position x and momentum μ is also derived (Frieden and Soffer, 2009). The C–R inequality (3) directly gives (Frieden, 1998, 2004) the Heisenberg uncertainty principle, when (3) is applied to describing mean-squared errors in position and in momentum using the preceding Fourier relation connecting particle position and momentum value. That is, the Heisenberg principle, with all its incredible implications, is a direct expression of the C–R inequality.
2.3 Biology (Darwin, 1859; Fisher, 1922; Frieden and Gatenby, 2020; Gatenby and Frieden, 2016; Hodgkin and Huxley, 1952) Here all key-and-lock effects (e.g., all viral growth as well as the Hodgkin and Huxley, 1952) effect of cellular nutrient uptake from microfilaments) derive from the principle. On this basis, then, the simultaneous occurrence of all such optimization effects in biology make up the biological process Darwin (Darwin, 1859) called “natural selection.” Natural selection is thereby shown to be the systemic expression of maximum Fisher information (Frieden and Gatenby, 2020; Gatenby and Frieden, 2016) (a proposition that Fisher, a biologist himself, might well have believed).
2.4 Thermodynamics (Frieden et al., 1999) The Legendre-transform structure of thermodynamics can be replicated, without change, if one replaces the statistical entropy measure S by Fisher’s information measure I. Also, the important thermodynamic property of concavity is shown to be obeyed by I. Such trial use of the information measure leads to a Fisher-based thermodynamics that seems able to treat equilibrium and nonequilibrium situations in a manner entirely similar to the conventional one. Riccati- and Schroedinger-type equations easily emerge from the Fisher approach, and the shift-invariance assumed of the data implies statistics with no absolute origin.
122 SECTION
II Theoretical applications and physics
2.5 Extending use of the principle of natural selection (Popper, 1963) Natural selection is, by most definitions, a process of living, biological systems. But in fact, further consider this: Physicists and engineers actually use a “natural selection” argument for explaining why certain systems (living or non-living) continue to function while others do not. In physics, this is the argument (of K. Popper(1963), 1959) that any new mathematical law that is proposed to describe a physical effect can also be regarded as a physical law only if it is ultimately capable of being verified out of observation (e.g., using a suitable sensory, i.e., biological, system as in Section 1). Otherwise it should be rejected as science, albeit perhaps accepted as a possibility if it seems important and promising. Popper’s famous criterion is often restated in negative form, replacing the word “verified” by “capable of being contradicted by experiment.” Consider, on these grounds, the law of “survival of the fittest” (simply meaning producing the most offspring per generation that themselves reach maturity). This, of course, is justified out of empirical observations of the (related) process of natural selection. But it is not just the selection of living creatures, as follows. Of wider scope are applications of MFI to engineering and the sciences. Consider, e.g., the ultimate domination of the “species” automobile by one best-seller in a given year. Is this not, also, another instance of such “survival”? Also, how about devising a new physical theory, or model, that best explains (with simplest, least necessary preconditions) known effects? Here “survival” simply means ultimate domination, now by a product or idea, out of a “natural” process of selection (purchase by discerning consumers, or winnowing out by skeptical scientists, etc.). In this sense, natural selection actually occurs in all well-defined systems, whether for plants, animals, cars, scientific theories, … so long as one “species” ends up dominating it. See Section 3, and beyond, for further extension of use of this generalized principle of natural selection. That is, All existing systems (biological or not) exist because they obey the concept of “fittest” for some defined “function.”
Indeed, were they not “fittest” they wouldn’t exist (a general “system-thropic” view replacing the usual anthro-pic one). As will be seen, these properties of MFI actually trace back to the Cramer-Rao inequality (Section 3.2, Eq. (5)). This is complemented by the philosopher I. Kant’s view of knowledge acquisition in Sections 4.1–4.2. Kant’s view is shown to tie in with that of communication theory, that observed phenomena are but outputs of communication channels that normally operate imperfectly as conveyors of information to the receiver (i.e., to the observer). That is, the original information exists in perfect form
Principle of minimum loss of Fisher information Chapter
6 123
(and at its initial, maximum information value J) back at the input (which was just prior to entering the channel). Beyond this, as it travels through the channel to the observer, the information can only be reduced, by inevitable imperfections in the channel, to value I. Thus I ¼ kJ, 0 k 1 as modeled in Eq. (1). In fact, in all uses of MFI to date, the case k ¼ 1 only holds for data from quantum-based phenomena. Classical physics obeys k ¼ 1/2, with 50% of the Fisher information lost (Frieden, 1998, 2004; Frieden and Gatenby, 2007) to the observer.
2.6 From biological cell to earth to solar system, galaxy, universe, and multiverse Consider, first, the reaction that allows us to live. Cells require various nutrients, hormones and ions, such as Na+ and K+, in order to live. Such ions are shipped to, and through, the cells within microfilaments and microtubules. These require transits of ions through the cell membrane (CM). The information I that a Na+ ion, e.g., is about to enter a cell through its CM is vital to cell function. The information channel is, here, a path normal to, and directly through, the thickness of the CM. The MFI principle Eqs. (1) and (8a) state that, although information is lost in traversing the CM thickness, it is a minimum loss (see Section 4.2). Thus, the MFI principle holds. Its solution is the resulting “flow” p(t) of the ion. It increases exponentially (Hodgkin and Huxley, 1952) with time t, indicating very quick growth (as also predicted by the Hodgkin-Huxley equation for the problem). The cell prospers, and its fitness is increased. Such microfilament-to-CM flows go on throughout the body, which thereby obeys optimum overall fitness and is, therefore, optimally selected for in its offspring. Thus, “natural selection.” By such extended use of the concept of “natural selection,” looking ever further outward in scale of size, the Fisher principle Eq. (1) also defines the forces and evolution governing: from smallest to largest scales—from the subnuclear particles (which, in fact, obey quantum mechanics, out of the Fisher principle Eq. (1)); to larger nonliving particles (atoms, molecules, etc.); to living particles (as we saw), i.e., biology, to the earth, solar system, galaxy, universe and, ultimately, universes. These systems obey, on all levels of size, minimum loss of Fisher information I J or, as we saw, maximization of its Fisher value I. In fact it is the latter requirement, that the total information I be maximized, coupled with the additivity of I for independent processes that, by itself, would have caused the above transitions from smallest to largest scales of system sizes to actually occur. As an example, by the additivity of the information from independent systems the huge mass of, e.g., germs in existence (each acting independently for the most part) must represent, as well, huge values of the Fisher I.
124 SECTION
II Theoretical applications and physics
Another consequence, regarding “natural selection,” which (as we saw) is obeyed by all living organisms, is that it holds for all systems no matter their size. Consider, e.g., formation of the most massive such system conceivable.
2.7 Creation of a multiverse (Popper, 1963) by requiring its Fisher I to be maximized Suppose that a universe B exists as an isolated system. Also suppose it has, in isolation, evolved via principle Eq. (1) to attain, in isolation, its maximum Fisher level. This is a state of maximum complexity as well; see discussion following Eq. (8c). We suppose the universe had the usual “Big Bang” beginning, an eternal inflationary process whereby it undergoes expansion of space–time. (The next phase, an exponential re-heating, is not taken up here.) This inflationary phase of the universe’s expansion lasts forever. Why is this so? As with all systems so evolving, its expansion ever increases its level of Fisher information and complexity. According to eternal inflation, the inflationary phase of the universe’s space–time lasts forever. According to MFI this inflationary phase occurs simply because it enables the information to ever increase, as required by Eq. (1). Thus, every increase in a system’s complexity requires an ever higher level of energy (as in, e.g., a plant maturing via increasing inputs of solar energy). But, finally there is some point at which the system cannot so utilize these inputs; e.g., when its size is already prohibitively large out of past information (and/or energy) demands. For example, the plant (above) has a genetically predetermined maximum size. It simply cannot grow any larger. We assume that individual universes are, likewise, limited in the possible extent of their in situ growth processes. For example, it appears that the ever-increasing expansion of our universe will ultimately suffer a “cold death,” whereby all particles of it are infinitely separated and forever noninteracting. So, following the strategy of biology, the system B “decides” (actually, we will see, is compelled by, principle Eq. (1)) to reproduce. What has “enabled” evolution of our universe, call it A, to reach ever-higher levels of Fisher information I? There is evidence(Frieden and Gatenby, 2019) that these have been enabled by incorporating into the system the 26 known universal, unitless physical constants (e/m ratio, fine structure constant, etc.; see Wikipedia, “Standard model.”). Certainly, there is no evidence that such evolution would have occurred out of any other set of the 26 constants. Instead, even very tiny departures from these 26 values would have acted as “mutations” to distort its evolution to other directions. Such departures would have probably resulted in decreases in Fisher I due to mathematical bottlenecks, non-convergences and/or contradictory event outcomes occurring during its evolution.
Principle of minimum loss of Fisher information Chapter
6 125
And so, we assume that universe B starts its reproduction process out of these very 26 constants as well, with the successful result A (our own universe). And, since B and A now have the same 26 this could only be the case if, during the reproduction process B ! A, universe B physically shipped its own 26 physical constants to A. This would be in the form of bosons, fermions and Higgs particles undergoing evolutionary reactions utilizing the 26. (There are actually, in total, 37 such constants, of which the last 11 define cosmological properties; these might eventually be shown to so evolve out of the original 26.) But then, would all such universe pairs B,A sharing the same 26 physical constants have identical developmental histories? No, since even in the presence of the same 26 constants each stage of evolution of a universe would actually be randomly different. This is because the outputs from any evolutionary process of reproduction and development are, in fact, random samples x from a probability law q(x) or Ψ(x) (defined as amplitudes in Eqs. (8a–8c) below). Although the laws are repeated in the forming universe A it is the samples from them, of e.g. energies, positions, … that actually define its history. Hence even a reuse of the same 26 constants during each new formation of a universe would not form an exact repetition of the previous system’s evolution.
2.8 Analogy of a cancer “universe” Indeed, cancer is well-known to evolve in this way (Section 6). That is, at first, as a statistical malfunction during gene replication from a functionally normal cell (“universe”). This is termed a “mutation” in genetics. Each cancer cell stops its normal functioning (e.g., a breast cancer stops making milk). Thus, it stops contributing to the “previous universe’s” growth, instead channeling its energy into the new “universe” of the growing cancer mass. Meanwhile, the “universe” of cancer cells is intermingled with that of the functioning cells from which these mutated.
2.9 What ultimately causes a multiverse to form? Note: At this point it is convenient to rename B to A2, and A to A1; etc. as in setout (2) below. What “motivation” is there behind occurrence of the transition process A2 ! A1 (as renamed)? It does not occur only because A2, say, “knows” or “senses” that its ongoing expansion will ultimately lead to its demise as a cold, non-evolving space, with all its fundamental particles spaced indefinitely apart. It’s simpler than that: The transition A2 ! A1 is seen by A2 as integral to its ever-present requirement Eq. (1) of passing on to A1 maximum Fisher information.
126 SECTION
II Theoretical applications and physics
(This occurs, as well, within the smaller systems comprising universe A2. For example, a developing elephant in A2 is a system ever-subject to the MFI principle, and so it naturally reproduces, obeying Eq. (1), once it is permitted by its life cycle.) Likewise A2 was the “offspring” of a parent universe A3, etc., both of which (again) had the same 26 constants. That is, the maximization process Eqs. (1) and (8a)–(8c) grows a sequence of universes, AN ! AN1 ! AN2 ! …:A2 ! A1 ≡ A ðoursÞ:
(2)
(The number N of prior universes is of course unknown, as discussed below, since it is ever changing.) Each new universe initiates growth at a specific time value and, if relativistic covariance holds on this scale, also at a specific space value (although the latter is not necessary to the theory). At each transition the same set of 26 constants is shipped from parent to offspring universe. How this is done is taken up below. Thus, our universe A has formed and evolved out of 26 physical constants of precise value. And these are also presumed to have existed, as well, in all prior universes in the above chain (2).
2.10 Is there empirical evidence for a multiverse having formed? There are as yet no “sightings” of other universes. But, in the absence of such confirming evidence on a colossal scale, is there, perhaps, an example of a smaller system, such as a galaxy, having originated out of a different galaxy? For example, did the Andromeda (age about 10 billion years) originate out of our Milky Way, of age about 13.8 billion years (Wikipedia)? Although the difference in age of about 4 billion years might allow this, current thought seems to be that, instead, they both originated out of the accretion of existing stars. However, the above example of the growing cancer mass (“universe”) suggests an alternative form to be taken by the offspring universe from its parent. This is that they co-mingle (say, spatially) and yet continue to develop independently. For example, in that cancer example, although the cancer mass exists and continues growing as a spatially distinct system (say A) it is still located within the human body mass of its origin (say B). Then perhaps, likewise, the Andromeda and Milky Way are evolving (spatially and temporally) quasi-independently (the independence required for continued increase in total Fisher I; Section 2.7 preceding).
2.11 Details of the process of growing successive universes (Frieden and Gatenby, 2019) The successive universes are as listed in setout (2) preceding. 1. At the outset, each set of 26 physical constants (plus Higgs particles) is shipped from a parent universe An (temporarily call it B), to an offspring
Principle of minimum loss of Fisher information Chapter
6 127
universe An-1 (call it A). This is via Lorentzian “wormhole” pathways (Frieden and Gatenby, 2019) or equivalent means. The values of the 26 constants are defined by accompanying phenomena (as normally learned via lab experiments). The wormhole pathways extend from the interior of the parent (say B) to just outside its “bubble” surface. There, at a point P, incomplete vacuum is assumed to hold (see Fig. 1). And so, the point P becomes the origin for growth of the offspring universe A. In the same way universe B was likewise formed out of the imperfect vacuum of C; etc. as in setout (2) and Fig. 1. 2. Each universe in setout (2) initiates, grows and evolves, with statistical independence, as follows (Frieden and Gatenby, 2019):
FIG. 1 Two universes: ours A (lower) formed by a neighbor B (upper) via an Lorentzian wormhole forming an “umbilical pathway.” The pathway is a Lorentz wormhole extending from somewhere internal to B to its surface, and then just outside it (as shown) to form the surface of the newly forming A. The external portion (shown) of the wormhole is assumed to maintain minimal length as A grows in size. From TAKE 27 LTD/Science Source.
128 SECTION
II Theoretical applications and physics
(i) The universes form instantly, at a time t0 ¼ 0, not after a time lag due to the finite speed c of light. Why? According to general relativity, in two gravito-electrically interacting particle ensembles, e.g., two planets or stars moving at constant velocity with respect to each other, each feels a force toward the instantaneous position of the other body, i.e., without a speed-of-light delay (Wikipedia). As seen below, this effect is vital to ultimate survival of the multiverse even in the presence of a destruction wave due to pure vacuum somewhere else in the cosmos. (ii) The distribution p(x, y, z, t) of matter in such a universe emerges during its expansionary phase. In comparison with time coordinate t in preceding Section 3(i), the space origin (x0, y0, z0) of the distribution is conventionally assumed to not have a well-defined (i.e., absolute). However, it is also possible to take the viewpoint that while such a spatial origin is currently unknown, it can be ultimately known (see Ref. Frieden and Gatenby, 2019, for details). The instantaneous nature of formation of the new universe (described previously) is assumed to define a specific time t0 ¼ 0 of initiation of formation of its mass-energy distribution.The MFI solution for this t-dependence of p(x, y, z, t) is found (Frieden and Gatenby, 2019) to be a simple exponential (also, as conventionally thought). But by the condition of relativistic covariance, the (x, y, z) dependence should likewise exist; so it should likewise obey MFI theory. This covariant solution is found (Frieden and Gatenby, 2019), likewise via MFI, to be exponential in (x, y, z). It results that p(x, y, z, t) is exponential in all four of its variables. (iii) Then, how was the “first” universe AN in this multiverse formed, and what is the value of N? By step (2) above, this formation would likewise have required an “epsilon” of matter-energy to be present. Such a remnant might have been, say, the sole surviving piece of an otherwise totally annihilated, previous multiverse. In honor of his contributions to understanding the multiverse, we call the composite system AN ! AN-1 ! AN-2 ! …. A2 ! A1 ≡ A of universes the Guth (1981) multiverse.
2.12 How many universes N might exist in the multiverse? In principle the (above) principle of maximum Fisher information I holds in formation and evolution of each component universe A,B,C,…. Thus A forming out of B defines a Fisher information channel. On this basis, since these form independently, the total Fisher I in the multiverse is simply the sum (Frieden, 1998, 2004; Frieden and Gatenby, 2007) of the individual I values of the universes A,B,C,... Finally, since these are all positive values,
Principle of minimum loss of Fisher information Chapter
6 129
the condition of maximum Fisher overall for the entire multiverse implies, ideally, an unlimited number of universes to exist. However, the number tends to be reduced by other effects, as next, that work to keep that number finite. Estimates of the size of N are in Section 2.17 following.
2.13
Annihilation of universes
This are, however, factors that limit the number of universes. As we saw, the existence of a multiverse depends on the existence of imperfect vacuum everywhere in its component universes. According to quantum mechanics, a particle can “tunnel” through a barrier between one region and another; this applies, as well, to the vacuum state. So a universe that is evolving in false vacuum could, via random quantum fluctuations, suddenly find part of itself within true vacuum. Likewise, the possibility of such vacuum decay has arisen of late because measurements of the mass of the Higgs boson seem to indicate that the vacuum is metastable. Recent experimental and theoretical works suggest that our vacuum is probably metastable. Measurements of Higgs boson and top quark masses indicate that a true, i.e., total vacuum can exist, and therefore with less energy than the present one. Also, problems of vacuum instability may arise. So it is possible that the current minimum of the scalar potential is merely a local one, so that a deeper minimum exists; or that the potential has a bottomless pit separated by a finite barrier. In this situation, the Universe should eventually tunnel out into some other state, in which the elementary particles and the laws of physics are different. This would mean “curtains” for our current existence. Both Linde and Guth, however, continue to support the inflationary theory and the multiverse. (See Linde quote above.) According to Guth (1981): “It’s hard to build models of inflation that don’t lead to a multiverse. It’s not impossible, so I think there’s still certainly research that needs to be done. But most models of inflation do lead to a multiverse, and evidence for inflation will be pushing us in the direction of taking the idea of a multiverse seriously.”
2.14
Growth of a bubble of nothing
But in fact, if there is instead, perfect vacuum somewhere in the multiverse (say in our universe A) a different kind of “growth” would occur there: the unlimited growth of that vacuum. This takes the form of a growing “bubble of nothing” expanding outward at the finite speed of light (important!) in every direction. Moreover, since every other universe likewise requires incomplete vacuum, every pre-existing universe in the multiverse would seemingly be canceled (wiped out), one by one, as the spreading bubble of nothing engulfs it.
130 SECTION
II Theoretical applications and physics
2.15 Counter-growth of new universes However, new universes would, meanwhile, continue to grow in adjacent regions where there is still imperfect vacuum. So, multiverse growth would be competing with this annihilation due to the vacuum bubble. Which process will win out? As we saw in Section 2.7 as growth of a new universe, in fact each of these forms instantly from its imperfect vacuum; whereas (as we noted) destruction by the annihilation wave occurs at a limited rate due to its finite speed c, that of light. Also destruction by the annihilation wave of previously imperfect vacuum space is undone, in time, as new mass particles are regained, by Higgs bosons in that space. From this viewpoint it would seem that, owing to its advantage of speed of formation, the Guth multiverse “growth wave” eventually wins out over the initial, single annihilation wave event.
2.16 Possibility of many annihilation waves But what if, more realistically, many such vacuum-induced annihilation waves are independently launched, at a given finite rate r (and each traveling at the limited speed value c)? Now which effect will now win out in the growth-destruction contest? The answer may depend upon the size of the destruction rate r vs. that of the imperfect-vacuum mass growth rate, call it r’. The ensuing problem of such competing rates seems, basically, to follow the rules of thermodynamics. We return next to fundamentals of the Fisher approach overall, in particular how it derives, based on use of the Cramer-Rao inequality. As we saw, what impels the “parent” universe to ship its 26 constants to its offspring universe is the principle (1), Eqs. (8a)–(8c) of maximized Fisher information, coupled with the statistical independence of the two universes. Under such independence their Fisher information values add, forming a value larger (of course) than either one. So, on this basis, the two universes are preferred. Moreover, on the same basis the two should each, themselves, reproduce, etc., forming a large number of universes, the multiverse. As these exist as independent conveyors of Fisher information the result would be an extremely high level of Fisher information present in the multiverse.
2.17 How large a number N of universes exist (Linde and Vanchurin, 2010)? The precise number is, of course unknown, since the theory even behind formation of a single universe is unclear. Moreover, no one has yet sighted a source that was definitely exterior to our universe. However, physicists A. Linde and V. Vanchurin formed a physiological estimate (Linde and
Principle of minimum loss of Fisher information Chapter
6 131
Vanchurin, 2010) based on the number N of “things” that are distinguishably observable by neurons of a human brain during its lifetime. On this basis, the effective number of observable universes, has a value of about N ¼ 10 to the 1016, that is, 10 raised to the power 10,000,000,000,000,000. A truly colossal number. This number is a kind of limit to satisfying the Popper criterion (Popper, 1963) of “observability.” It or anything larger could not be observed under any physical (or physiological) conditions. In fact much higher values formally exist (Linde and Vanchurin, 2010), but now ignoring the preceding Popperian limit. Such an estimate follows, based alone on our MFI growth analysis. A universe forms out of incomplete vacuum due to the existence of at least one particle per unit meter (Section 2.11). There are estimated to be 3.28 x 1080 particles in our universe (Wikipedia). Then, ultimately, there could be that many universes so formed from ours alone. (And, since this number is much less than the above physiological limit of Linde and Vanchurin, these could be distinguishable in a human lifetime.) However, if, by the same process, each of these gave rise to a like number the result would be 3.28 x 1080 raised to the power 3.28 x 1080 universes. Even if physically valid, this number much exceeds the above LindeVanchurin physiological limit set by the human brain; and so, it could never be verified out of direct observation, violating the Popper criterion (Popper, 1963). Nevertheless, indirect means could exist.
2.18
Is the multiverse merely a theoretical construct?
According to Linde and Vanchurin (2010), “It’s possible to invent models of inflation that do not (my italics) allow a multiverse, but it’s difficult. Every experiment that brings better credence to inflationary theory brings us much closer to hints that the multiverse is real.”
2.19 Should the fact that we do not, and have not observed life elsewhere in our universe affect a belief that we exist in a multiverse? Yes, if the multiverse, and the phenomena described in prior Sections 2.6–2.15 that explain it, are valid, then our existence as a living form here in universe A may be tenuous, by the following reasoning. Universe A is, as we have seen (Frieden and Gatenby, 2019), consistent with having developed from a sequence AN ! AN-1 ! AN-2 ! …. A2 ! A1 ≡ A of prior universes, each out of the same 26 physical constants, and each out of incomplete vacuum. Also, the raw materials needed for life (carbohydrates, oxygen and water floating as gases) have been observed in outer space, some but hundreds of light-years away.
132 SECTION
II Theoretical applications and physics
There is water vapor in the Milky Way, although the total amount is 4000 times less than in a quasar, because most of the Milky Way’s water is frozen in ice. A complication is that the Milky Way is 2,000,000 light years in size, so any sighting of water in the Milky Way is at most 2,000,000years delayed in our observing it. Since the earth is about 4000,000 years old, water from the Milky Way could have been observed at about half its age 2,000,000 years ago. Also, it is now apparent that about 35% of observed exo-planets contain up to half their mass as water. Some of these even are but a few astronomical units (earth-sun distances) away. On these bases, some observations we might make of life-like events in outer space ought only be delayed by hundreds of years of time from their inceptions. On this basis, life forms should be widely observed in the cosmos. But meanwhile, as we noted in Sections 2.13–2.15, if there are ever-expanding “bubbles of nothing” arising out of one (or more) instances of growth from pure vacuum, these bubbles would extinguish the life forms among all universes in AN, AN-1…, A2, A that they engulf. These, in fact, may already have been extinguished; thereby accounting for why we do not widely observe life elsewhere in the cosmos. Worse yet, as a corollary we will be engulfed, as well, once the “bubble” reaches us here. At this point it is perhaps better to return to theoretical considerations.
3 Derivation of principle of maximum Fisher information (MFI) Note that this principle, during its long-term development (Frieden, 1998, 2004; Frieden and Gatenby, 2007), has been called MFI, as well as Extreme Physical Information (EPI) and Minimum Loss of Fisher Information. In this paper it is also simply called “the Fisher principle.”
3.1 Cramer-Rao (C-R) inequality (Frieden, 1998, 2004; Frieden and Gatenby, 2020) The Cramer-Rao inequality is e2 I 1:
(3)
Here, e2 is the mean-squared error in estimating some fixed system parameter a out of data. y ≡ yn ≡ a + xn , n ¼ 1, …, N:
(4)
These are conveniently represented as a vector y ¼ a + x of the data y. From these the estimated value abðyÞ of the single system parameter a is formed. This can be, e.g., an arithmetic- or geometric- mean or, indeed, any analytic function of the data y.
3.2 On derivation of the C-R inequality Given data y it turns out that the C-R Eq. (3) holds even for a system law p(y j a) that has finite (but constant) bias (taken up below). This is for any
Principle of minimum loss of Fisher information Chapter
6 133
system, of parameter value a, for which the mean departure of estimates abðyÞ from true value a, over all possible data y, is biased away from a by a constant amount. That is by amount Z < yja >¼ dyðb aðyÞ2aÞpðyjaÞ ¼ bðaÞ, (5) as long as the bias function b(a) ¼ b ¼ constant. Eq. (5) also shows that, with general bias b(a) present, the errors in the estimates abðyÞ of a do not, when averaged, generally mirror physical truth; they are on average biased away from it. Nevertheless, the resulting C-R inequality (Eq. (6) below) will still show inverse dependence of estimates abðyÞ upon the Fisher information level I. This is extremely important. The steps in Frieden and Gatenby (2007) on pgs. 9–11 show that, even in the presence of general bias b(a), the minimum mean squared error emin2 attainable is emin 2 ¼ ð1 + db=daÞ2 =I:
(6)
Thus if bias function b(a) ¼ constant the usual CR inequality Eq. (3) holds. But better yet, even in the presence of general bias b(a), the minimum mean squared error e2min attainable still decreases as 1/I. This indicates that information I is the real key to defining the underlying physical effect. It holds independent of how the data y are processed, as c of them in Eq. (5). The larger I is the smaller is some chosen function aðyÞ the error emin2, and this even holds if the estimates are biased. In other words, the information level I is set by nature; it is a property of the given natural effect. By comparison, the bias b(a) is set by the observer, by his choice of how “best” to estimate key parameter a. And bias doesn’t matter to emin2 if I is large enough. Better yet, the C-R inequality also can be used to predict what physical law p(yj a) the system actually obeys! In general, this is even in the presence of the following three additional levels I–III of difficulty: I. The system whose data y obey Eq. (4) obeys unknown physics— expressed as the unknown probability law p(yj a). II. The system is in a generally unknown state a. III. The data are arbitrarily biased, that is, there is generally nonzero average bias b(a) in the estimates, as expressed by Eq. (5) above.
3.3 What do such data values (augmented by knowledge of a single equality obeyed by the system physics) have to say about the unknown system physics? As will be seen, such an equality constraint will always be, equivalently, a statement of how large J the Fisher information level I can be for the given scenario. This is central to the derivation that follows, of how problem I (of finding the unknown physics p(y j a)) may be solved.
134 SECTION
II Theoretical applications and physics
3.3.1 Dependence of system knowledge on the arbitrary nature of forming the data Most generally, the law on the data fluctuation sizes xn in Eq. (5) will depend on the size a of the system parameter (an example is a system obeying Poisson noise). But, for simplicity, let it be independent of a. That is, the system probability law on acquired data is shift-invariant, obeying. pðyjaÞ ≡ pðy2aÞ ¼ pðxÞ ≡ q2 ðy2aÞ ¼ q2 ðxÞ
(7)
for any shift xn ≡ x in Eq. (4). Because of its form this is also called an “additive noise” scenario (below Eq. (4)). Under these conditions (7) the system’s Fisher information level I is most simply defined in terms of the amplitudes q(x): Z 2 I ¼ 4 dx q0 ðxÞ , (8a) d . This was assuming, in Eq. (5), a real-only, the prime 0 operation denoting dx one-dimensional amplitude function q(x). More generally: If, as in quantum mechanics, it is instead a complex function Ψ(x), then q0 2in Eq. (8a) is replaced by Ψ0 *Ψ0 (the asterisk denoting a complex conjugate).
3.3.2 Dependence on dimensionality Or, more generally in 4-space x ¼(x, y, z, ict) it becomes Z I ¼ 4 dx rΨ∗ dotrΨ in terms of a dot (inner) product of gradient operations r ≡
(8b)
∂ ∂ ∂ ∂ ∂x , ∂y , ∂z , ic∂t
upon Ψ. Higher-order spaces form analogous expressions in terms of their inner product gradients. Of further interest is the information form governing complex tensor amplitude functions (Frieden and Gatenby, 2007) Qαβμ(xα) and Qαμβ(xα), α ¼ 0 3, Z I ¼ 4N G dxμ ∂μ Qμαβ ∂μ Qμβ (8c) α : Tensor summation notation is used, e.g., in nuclear theory, with NGthe number of gluons. Forms (8a), (8b) and (8c) are of the same form—a sum of squared gradients of a system amplitude function (whose “square” is the system probability law).
3.3.3 Dependence of system complexity (or order) upon Fisher I. The measures “order” (or complexity) R and Fisher “information” I are linearly proportional (Frieden and Gatenby, 2011) Then, since each universe of the multiverse obeys, as the result of its evolution, maximum Fisher I this also
Principle of minimum loss of Fisher information Chapter
6 135
represents a state of maximum order R (or complexity). Then the resulting total levels of information, order and complexity over all universes in the multiverse are maximum, as well. At least this is the tendency, … until bubbles of “nothing” (Sections 2.12–2.16) wipe out some (or all).
4 Kantian view of Fisher information use to predict a physical law Eq. (6) states that, in the presence of any finite bias b(a), the minimum possible mean-squared error e2ms in estimating the state a obeys proportionality e2ms 1=I ¼ minimum:
(9)
We immediately see, from this, that minimum error in the estimate is attained when information I ¼ maximum value, regardless of how its data are biased from required state parameter a. That is, relation (9) “wants us to know” that nature obeys laws conveying maximum information to us, no matter how poorly we observe it by distorting its observation with bias b(a). So, suppose that phenomenological data of Eq. (1) are taken. In view of the preceding, can the law q(x) governing physical fluctuations of the system be found from these considerations? This is the fundamental problem of fixing the physics from observed data.
4.1 How principle of maximum information originates with Kant The preceding suggests that nature tends to supply the observer with data that supply maximum information about each of its laws. This is in the sense of minimal loss from an ideal, maximum value J of it (see Section 4.2 following). Equivalently, it is assumed that “reality,” in the form of the law p(y j a),or equivalently (as we saw in Eq. (8a)) in its amplitudes q(x),supplies maximum Fisher information about the system to the observer. Taking a cue from Kant (1781) (my thanks to a suggestion by Prof. E. J. Valentyne (Valentyne, 2021)), this turns out to be represented by a Fisher information I that differs minimally from an ideally large value of the information, call it value J (as follows).
4.2 On significance of the information difference I 2 J As with I, this largest value J must also be a known function (actually, a “functional”) of q(x). What is the defining property of information level J? Notion of an information-bearing channel: As alluded to in Section 2.5 the concept of J follows from the definition of a communication channel. Simply put, this consists, first, of an input “aperture” located just outside the
136 SECTION
II Theoretical applications and physics
channel, where the ideal, fixed value of the parameter a exists. Then it is here, as well, where the maximum possible level J of the information about the parameter exists. Then, as well, the ideal value J of the information is, in principle, fixed, and independent of the fidelity of the system beyond it (the channel itself ). Now communication channels operate as generally imperfect transporters of system parameter values to an output device. The price paid for this “service” is, as demanded by the 2nd law of thermodynamics, generally added noise. So, by definition of the channel, as the signal travels beyond the input aperture to inside the channel and toward the output, there can be no further (input) gain of information (although, somewhat ironically, this “noise” actually describes the physics of the medium or channel that is being sought). That is: Within the system, the only change of information occurring can be its loss. Generally, the larger and more disordered the system is the larger is the loss. Therefore, with the aim of preserving what the maximum level J of information was (as we saw, right at the system entrance), the “best” system obeys minimum loss I J from level J. Recall that J is regarded as a fixed number defining prior knowledge of the scenario. However, its actual value does not have to be known except in quantum cases. So, the problem is solved by solving the minimization problem Eq. (1). As an example, consider a biological cell collecting nutritional inputs from calcium or sodium ions that travel into the cell via a microtubule. Each such ion must first traverse the cell membrane. The cell benefits maximally when, in traveling through the cell membrane, the ion loses a minimum amount of information (about the time, in this case). Using this principle results, correctly, in exponential growth in time of the ion flux, i.e., the Hodgkin-Huxley effect (Frieden and Gatenby, 2020). But, what about a quantum effect? Here there really is no explicit “input aperture.” The information loss I J still occurs, but now, famously, by just the act of observing the effect. So the same principle of minimizing loss I J from its known maximum level J holds (Frieden, 1998, 2004), as required.
5 Principle of minimum loss of Fisher information From the preceding, regardless of physical scenario, the system amplitude law q(x) obeys a principle Eq. (1) of minimum loss of information I from an ideally large value J. But this makes sense as a tractable solution only if the large value J (Kant’s ideal knowledge, see below). In fact it does, in all physical scenarios to date, out of prior knowledge.) It should be mentioned that coordinate space x is assumed continuous in all the preceding. Interestingly, if instead space is discrete, of values xn ¼ nΔx, n ¼ 1, 2, …, N, either intrinsically or as an approximation due to quantum
Principle of minimum loss of Fisher information Chapter
6 137
decoherence, Fisher information goes over into Shannon information (Frieden, 1998, 2004) (also see below Eq. (5) in Frieden and Gatenby (2020).
5.1 Verifying that minimum loss is actually achieved by the principle In practice, the problem Eq. (1) is generally solved by the use of calculus of variations, i.e., by setting the mathematical variation δðI J Þ ¼ 0:
(10)
Now, the variation in Eq. (10) might conceivably give rise to a maximum rather than the minimum we want by principle Eq. (1). However, it is a minimum, as follows. (We carry through for the simplest systems, having a single coordinate dimension.) The total integrand, called Int, for (I J) in Eq. (10) is, by Eqs. (1) and (8a), Int ¼ 4q0 ðxÞ jðxÞ: 2
(11)
The well-known condition for attaining a minimum value in problem (10) is to establish that Int is a convex function. Equivalently, that ∂2 Int=∂q0 > 0 2
(12)
in this problem. Simplifying matters, note that j(x) in (11) does not have a dependence on q0 (x), since j(x) always represents prior knowledge independent of the functional form of q(x) (and therefore of q0 (x)). In fact, in all past applications (Frieden, 1998, 2004; Frieden and Gatenby, 2007) such independence has been the case. Then from Eq. (11) the partial derivative ∂ Int/∂ q0 is simply 8q 0 (x), so that ∂2 Int=∂q0 ¼ +8: 2
(13)
The condition (12) for accomplishing a minimum in principle Eq. (1) is thereby achieved in general.
5.2 Summary and foundations of the Fisher approach to knowledge acquisition Although nature generally supplies us with non-ideal (noise-prone and biased) data, it nevertheless allows us to increase our knowledge of it in the form of an analytically known physical law. Note: This does not violate the Heisenberg uncertainty principle, which expresses the limited joint knowledge of two fixed parameters (say, a particle’s position and momentum). By contrast,
138 SECTION
II Theoretical applications and physics
our unknowns here are uncertainties, or spreads p, in parameter values, i.e., the probability law p(y j a). The analysis is via a chain of reasoning that proceeds stepwise from the simplest related problem to ever more realistic problems. First, suppose that just the state a of a system is to be inferred, by observing it at successive coordinates yn, n ¼ 1, …, N defined in Eq. (3). These N identities are convenient to denote as a vector y ¼ a + x. For simplicity, and because of its wide-spread occurrence (at least in approximation), we assume additive noise xn to be present. Then, for example, a could be estimated as a simple average of the data y. This would be a routine problem of statistics; except that, here, there are two unknown quantities: both (1) the system state a and (2) the system law p(yj a). How can both a and p(y j a) be determined? The latter is a deeper problem, essentially since the system law p(y j a) defines the physics of the system (a more basic, Nobel prize-winning problem). What, in particular, can be said about determining p(yj a)? Note that the use of data, i.e., Eq. (13), agrees in philosophy with that of Kant (1781). That is, the more information I you have about the nature of an effect, or system, the smaller will be the error you suffer in knowing a required observable of that system. Here the “observable” is the size of parameter a. Then, by Eq. (8a–(8c) the problem is optimally solved if the system’s level of information I is as large as is allowed by its nature. How large may I be? (a) Adage of “the cave” (Plato) The EPI approach may remind philosophy students of the famous adage of “the cave” of the Greek philosopher Plato: A person born and raised in isolation in a cave sees shadows on the wall cast by people outside the cave. From these he concludes that the shadows ARE the people. Here is an example where the acquired information level I is much less than the intrinsic level J defining the people outside. That is, the information efficiency constant κ in Eq. (1) obeys κ < < 1. In practice, such tiny efficiency values never occur in MFI solutions Eq. (1). Nature, when addressed by a modern observer (who “knowingly” demands it to yield maximum information, by principle Eq. (1)) gives much better answers than it gave to the “innocent” people in Plato’s day. (b) Utility of Kant’s philosophy This has a parallel in the philosophy of Kant (see above): Man observes a phenomenon that is only a sensory version of the “true” effect; the latter is called a noumenon. Hence the noumenon is some unknown, perhaps unknowable, absolute statement of the nature of the effect. Man cannot know the absolute noumenon, and so contents himself with merely observing it as a phenomenon, within the limited dimensionality of some sensory framework. Various
Principle of minimum loss of Fisher information Chapter
6 139
frameworks have been used for this purpose through the ages: witchcraft, astrology, religion … differential equations! How does our Fisher approach fit within this scheme, and can it perhaps provide an absolute framework for defining the noumenon? What is the noumenon? This is keeping in mind that we actually do not need to know the entire noumenon; only its maximal Fisher information level J. This turns out to be possible, as next. (c) Knowledge basis The framework is provided by the notion of observation. A noumenon is an unknown physical process. To be identified as a noumenon, it must first be observed. (Of course, some might be currently unobservable, as with the possibly parallel universes we derived in Sections 2.6–2.19.) The observation is regarded as an absolute truth about the noumenon. This is in the sense that it is suspected to exist, but this cannot yet be verified, out of insufficient data (i.e., of low information level I). On the other hand, we need only to know its maximum possible value J in quantum scenarios (Frieden, 2004). (d) Avoiding arbitrary models Hence sufficient observation unties a philosophical Gordian knot. It provides a tangible, absolute means for addressing or analyzing the noumenon without the need for invoking an arbitrary model for its a priori unknown details (such as whether it is composed of “strings,” or “Cooper pair particles,” etc.) Indeed, it allows us, via the Fisher principle, to infer these details. (e) Invaluable role of functional mathematics With this absolute so identified, we turn next to a choice of description for the “observation(s)” and the noumenon. Owing to its unparalleled success in describing physical effects (Frieden, 1998, 2004; Frieden and Gatenby, 2007), this is by the use of Lagrangian mathematics coupled with that of Fisher information theory. (This is even despite Goedel’s incompleteness theorem.) Hence the “observations” are data to be used as inputs to appropriate mathematics; and are the experimental measurements y in Eq. (3). To review, one absolute determinant of a noumenon is afforded by its repeated measurement. We now proceed to show how the Fisher principle can quantify the noumenon out of such measurements. (f ) Source space as the noumenon of Kant The EPI view of knowledge acquisition regards information level J as the absolute level of information that is provided by “the source space.” We have regarded the latter as the noumenon. Hence, within the dimensionality of the given measurements (see above) J is also regarded as the information level of the noumenon. Note that this identification can be made without knowing in detail what the noumenon is. (Indeed, our aim is to reconstruct it.)
140 SECTION
II Theoretical applications and physics
Correspondingly, I is that of the phenomenon. Thus, the information difference I J, previously called the “physical information” K, also measures the loss of information about the absolute truth that is suffered by the observer. (On this basis it might be called the “Kantian” as well.) (g) Amount of information lost For example, an observer who views the positions of particles on a macroscopic level ignores valuable phase information on the micro-level, and thereby loses exactly k ¼ 1/2 (see Sections 2.1–2.3) of their pre-existing information about position. Here the lost information is that due to uncertainty on the unseen quantum level. (h) Recapitulation The Cramer-Rao result Eq. (9) on minimum rms error holds regardless of how a is estimated from the data (whether as simple average, weighted average, rms average, median, …). That is, Eq. (9) indicates an optimally small rms error in the estimate of a system state parameter a if its Fisher information level I about the parameter is large. However, it is less well known that any existent system function p(x) obeys, in fact, a simple principle requiring that its Fisher information level I be optimally high. But, what system condition gives rise to this? (i) Connection of data with their underlying physics In fact consider, in this light, a system obeying unknown physics. Its data, although (say) arbitrarily biased, connect data source values to the receiver (you). What do these data values (plus knowledge of a single Kantian equality constraint on the system) have to say about the unknown physics of the system?
5.3 What is accomplished by use of the Fisher approach It was shown, using the Cramer-Rao inequality, that minimizing the amount of Fisher information that is lost during flow of the information through an unknown physical system defines a variational problem whose solution is THE physical law governing that unknown system. So-minimizing the amount of Fisher information that is lost is equivalent to maximizing its gain. But, why does this turn out to work? It rests on a premise that the laws of nature supply us, one by one, with valuable packets of “information” that are each sufficient to imply its source as a physical law. So far, the approach works. But, why bother to do this? To live is to learn, and by this to enhance: (i) our personal “learning time” constants T (Section 1); (ii) the quality of our own personal survivals, via optimum fitness and comprehension of nature; and more broadly (iii) our
Principle of minimum loss of Fisher information Chapter
6 141
understanding of growth processes, in particular of (a) biological systems obeying natural selection, and of (b) physical systems, from sub-nuclear particles to biological cells to living beings, societies, planets, galaxies,… to universe(s)! Scale size does not matter to the underlying theory of maximum Fisher information. And this theory holds without recourse in ad hoc models such as a Cooper pair or string-like model. Basically all that is really needed is some knowledge in use of Fisher information per se and the noumenal effect under observation. Each universe that the Fisher theory proposes (Sections 2.5–2.8) evolves independently in time, and out of the same 26 universal physical constants. The evolution occurs in discrete steps, each obeying principle Eqs. (1) and (8a) or (8b) or (8c) of maximum Fisher information. This guarantees that each evolves with statistical independence. Since each was formed with (ideally) the same 26 universal physical constants as ours, the evolutionary processes should qualitatively resemble ours: i.e., be well-defined, tending to form stable systems. And generally function with increased complexity over time, the latter because the degree of complexity of a system is proportional to its Fisher information level (Frieden and Gatenby, 2011) and is, thus, being ever maximized by principle (1). Also, as with our systems these should tend to increase in size over time, from the tiniest, of merging subnuclear particles, to merging atoms (comprising the elements listed in the “periodic table”), to molecules (chemical reactions), to systems of terrestrial biology, to extraterrestrial biology, to reproduction of universes in Sections 2.5–2.18. This wide scope of applications ultimately traces back, remarkably, to the simplest of principles: the Cramer-Rao inequality (2). An excellent book on the subject of the cosmos and multiverse is Parallel Universes and the Deep Laws of the Cosmos (Greene, 2011), by B. Greene. The multiverse that is developed in our paper “parallels” that of the parallel multiverse described by Greene. The work (Rao, 1945) by C.R. Rao first derived the CramerRao inequality—since used worldwide to quantify knowledge of the accuracy attainable in estimating a statistical observable. The C-R inequality also provides a key basis for the information theory used in this chapter to derive laws of physics and cosmology. Finally it initiated use of the concept of information geometry to solve analytic problems in physics and engineering. The modern theory of information geometry (Amari, 1983) by Amari has been greatly influential in fostering the novel development of geometry-based methods in analytical statistics. We now turn to a current, all-to-well known problem of growth: that of covid-19, and its apparent co-dependence and emergence, from another growth process, that of cancer.
142 SECTION
II Theoretical applications and physics
6 Commonality of information-based growths of cancer and viral infections Covid-19 is a formidable, and probably long-term, viral disease of man. Cancer is, of course, another such enemy, and presumably not viral in origin. Nevertheless, both diseases share growth characteristics in early and (certain) later stages, as discussed next.
6.1 MFI applied to early cancer growth Aside from confirming known laws of growth in the physical sciences (Frieden, 1998, 2004; Frieden and Gatenby, 2007), the MFI principle also derives laws governing growth phenomena that are largely unexplained. An example is that of (untreated) early-onset cancer C in some organ B of a person. This was found to grow in mass mCB, via MFI principle Eq. (1), as (Frieden, 1998b) mCB ðtÞ ¼ Atφ , φ ¼ 1:618…
(14a)
t the time. The MFI growth result (14a) is seen to be a power-law in time t, where the power constant φ is the famous “Fibonacci golden mean” of biological growth. It is notable that the power constant φ here did not arise out of assuming a priori that biological growth was present. Rather, it was out of a two-step minimization process: (i) using principle Eq. (1) to get the general MFI solution for this problem. The solution has the form tb, with power b (at first) unknown. Next, (ii) power b is varied to further minimize (Frieden, 1998b) the FI. The result is b ¼ (1/2)(1+√5) ¼ 1.618… ≡ φ as in Eq. (14a). The famously biological constant derives anew in this application. It is interesting that the system here is a person. However, mCB(t) specifically describes early growth of cancerous cell tissue in, specifically, some organ B of the person. Of course, in reaction, the person’s defense systems (of T cells, K cells, etc.) try to suppress the cancerous mass growth mCB(t) so as to continue the healthy mass growth mB(t) of the organ in time. It is this functional growth mB(t) that is to be optimized using principle Eq. (1). That is, by the information channel model of Section 4.2 the “true” or “ideal” level of information is that of functional mass growth mB(t) of the organ. And the lossy “channel” degrading it is, here, that obeying cancer growth law mCB(t). Therefore, by the description in Section 4.2, it is this growth whose Fisher information level is minimized (Frieden, 1998b) via Eqs. (1) and (8a); giving rise (Frieden, 1998b) to growth Eq. (14a).
Principle of minimum loss of Fisher information Chapter
6 143
6.2 Later-stage cancer growth In contrast with the power-law result Eq. (14a) holding for early cancerous CB growth, later CB (and other in situ biological) growth is often exponential, of form mCB ðtÞ ¼ A expðatÞ, a ¼ const:
(14b)
Exponential growth is notably much faster than the power law growth Eq. (14a). In fact, virtually all key-in-lock in situ growth (biological or non-), e.g., by biological viruses, was found (Frieden, 1998b; Frieden and Gatenby, 2020; Gatenby and Frieden, 2016) by MFI to obey exponential growth on a long-term basis. On this basis, (untreated) long-term growth of covid-19 might approximate, to some extent, Eq. (14b). What about early, short-term growth?
6.3 MFI applied to early covid-19 growth There is, in fact, direct empirical evidence of power-law growth (14a) is early-stage growth of the covid-19 virus (in year 2020). See the epidemical growth curves below in Fig. 2, for both S. Korea and Japan.
FIG. 2 Early growth curves of form tμ (during year 2020) of covid-19 cases, by country. Note the particular growth exponents μ 1.7 for both countries Japan and Republic of Korea. These are remarkably close to the Fibonacci growth value φ ¼ 1.618…, defining early cancer growth tφ as well; valuable empirical evidence for a common origin.
144 SECTION
II Theoretical applications and physics
There it is found that the theoretical growth constant φ ¼ 1.618… in Eq. (14a) is very well-approximated by the single, observed value φ ¼ 1.7 for the two countries! It is also a minimum value over all countries shown; due to conscientious efforts to identify, and then isolate, infected people. So evidently, such early-stage isolation of infected people in Japan and S. Korea changed early-stage biological growth (14b) to mere power-law growth (14a). This was also the “flattening of the curve” effect so hopefully sought at that time. This “curve flattening”, by the efforts of motivated people, implies that man actually entered into (became an integral part of) the covid growth phenomenon at that early stage; of course with the aim of reducing its growth. Evidently, in its early stage, the disease spreads chiefly by contagion. Thus the S. Korean and Japanese programs of monitoring and isolating contagion, in particular, worked well in the early stages of covid-19. (It should also work well in later stages as well; see below.)
6.4 Common biological causes of cancer- and covid-19 growth; the ACE2 link As would be expected, there are biological reasons for the quantitative commonality of both early-stage cancer- and covid-19 growths (also with some later-stages of growth, as taken up below). ACE2, which stands for angiotensin-converting enzyme 2, is a protein that sits on the surface of many types of cells in the human body, e.g., in the heart, gut, lungs, eyes and nose. It is believed to, at normal levels, have beneficial anti-inflammatory properties, e.g., inhibiting covid-19 growth and lung disease. Healthy ACE2 inhibits the growth of tumors cells, and reduces local inflammation and angiogenesis in several types of cancer. Many conditions (Kasela et al., 2020), including genetic mutation of ACE2, obesity and cancer, predispose individuals to SARS-CoV-2 infection and its severe form covid-19. Thus, the ACE2 protein is a connecting link for both cancer and SARS-Cov-2 (also called covid-19). How does ACE2 bring about covid-19? In covid-19 infections, the SARS-CoV-2 virus hooks into ACE2 and uses it to invade and infect a wide range of cells throughout the body (typical of key-in-lock phenomena). But invasion of cells from a common source is a characteristic, as well, of a metastasizing cancer. Then, can ACE2 also facilitate metastasis and cancer growth? If so, how? The growth of ACE2 utilizes, among others, the “tumor suppressor” gene TP53. When damage to the DNA in a cell nucleus is too extensive to be repaired, tumor suppressor genes such as TP53 induce programmed cell death (apoptosis) so that the damage is not passed on as cancer. However, gene TP53 affects human cancers in different ways (Brady and Attardi, 2010), some good, some bad: First, mutations are frequent in most proliferative cancer; particularly during age-related cell senescence. And, in
Principle of minimum loss of Fisher information Chapter
6 145
opposition, TP53 is anti-proliferative by nature, impeding cancer growth (Frieden and Soffer, 2009; Levine and Oren, 2009; Locatelli et al., 2017). This makes TP53 a powerful weapon for impeding cancer growth (Frieden and Soffer, 2009; Levine and Oren, 2009; Locatelli et al., 2017). Conversely, a mutated TP53 gene must lessen ACE2 growth and, hence, lessen its ability to impede cancer growth. There is a saying, “The enemy of my enemy is my friend.” In this case, a healthy TP53 gene of ACE2 is the enemy of processes such as cell inflammation and senescence, both of which cause cancer. Therefore, as inflammation, senescence and cancer are “my enemies,” the healthy TP53 gene is “my friend.” So, on this basis, both cancer and covid-19 cases tend to increase owing to mutation of the TP53 gene of ACE2. Therefore, sharing a common functional cause, all cancer and covid-19 growth might likewise grow in mass in similar ways, at least in their early stages. And, as we saw in Eq. (14a), early cancer growth is power-law (both empirically and theoretically). In fact, early-stage covid-19 growth (shown empirically, below, by the graphs of Fig. 2 for the number of infected people vs. time) is likewise power-law; see “log–log” inset curves in particular. This is for roughly 10 days (e.g., days 20–30 for USA; 15–25 for Germany, Spain, France and Italy; 8–14 for Brazil, 20–35 for Japan). Hence, mathematically, both illnesses start out as the same growth effect: power-law Eq. (14a). But what about longer-term growth? Regarding covid-19, the growth curves in Fig. 2 show, after about 20 days, departure from pure power-law growth to approximately exponential growth; and beyond day 20 or 30 a leveling off (S. Korea, China). Also, over the long term, both illnesses grow exponentially. Of course long-term cancer, in its “mature” form, is notorious for growing exponentially, as in Eq. (14b), due to metastasizing (spreading spatially) from an originally affected organ B to other organs. In summary, we have found that both cancer and covid-19 share similar growth forms: in the short term, obeying Eq. (14a) at small times t; and, at worst, growing exponentially in the long term. A simplest simultaneous fit to both mass growth phenomena is, from Eqs. (14a) and (14b), simply their product mðtÞ ¼ Atφ exp½aðtÞ, with aðtÞ ¼ a1 t + a2 t2 + …
(15)
The constants a1, a2,… in polynomial a(t) are those characterizing the given “patient” (a covid-suffering country or a cancer-suffering patient). Note that the a1t term in Eq. (15) is just the at term in Eq. (14b); that is, at low t, a(t) ¼ a1t. Passage of time brings in the higher-order terms a2t2, etc., in polynomial a(t). What does this model Eq. (15) predict for covid growth curves following those given in Fig. 2? These should consist of smooth transitions from early times, over which m(t) increases slowly, as a power-law tφ, to when it
146 SECTION
II Theoretical applications and physics
increases rapidly, dominated by the exponential factor exp[a(t)]. The lower portions will be due to programs of “lockdown” or, at least, “social distancing”; as complemented by the more comprehensive programs of vigilance and isolation (mentioned above) in S. Korea and Japan. In the preceding, social distancing/lockdown was used for reducing covid-19 growth. What about in corresponding cases of cancer growth? Here, small values of the m(t) curve Eq. (15) at small times t result from dominance by the tφ (pure power-law) factor, as it did in Eq. (14a). At larger times t the use of chemotherapy can accomplish smaller values of the polynomial a(t). In fact, at t for which a(t) goes negative enough, m(t) can approach zero, overcoming the ever-present power-law growth factor tφ in Eq. (15). Overall growth curves at this time (beginning of 2021) resemble ascending staircases from left to right, with each “step” of the stairs rounded out according to the polynomial form a(t) in Eq. (15). Hopefully, the coming vaccines for covid-19 will work to finally flatten the curve to negligible growth values. Could these possibly suggest vaccines for cancer as well? This mathematical commonality Eq. (15) of temporal growth in cancer and in covid-19 suggest that cancer and covid-19 growths share, as well, a biological commonality: so that cancer might generally arise and/or grow out of viral infection; or vice-versa. And, as a result, a particular regimen for curing cancer might also be curative for covid-19 (and vice-versa).
References Amari, S.I., 1983. A foundation of information geometry, Electronics and Communications in Japan 66-A, 1-10. In: Paper Nominated for the Royal Society Winton Prize for Science Books for 2010. Brady, C.A., Attardi, L.D., 2010. P53 at a glance. J. Cell Sci. 123, 2527–2532. 2010. Darwin, C., 1859. The Origin of Species: Special Collector’s Edition with an Introduction by Charles Darwin. 8PMC6982146ISBN: 193682809X ISBN13: 9781936828098. Charles Darwin’s On the Origin of Species, published by London on 24 November 1859, is a work of scientific literature which is considered to be the foundation of evolutionary biology. Its full title was ’On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life.’ For the sixth edition of 1872, the short title was changed to ‘The Origin of Species.’ Fisher, R.A., 1922. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. A 222, 594–604. 309–368. https://doi.org/10.1098/rsta.1922.0009. Also Statistical Methods and Scientific Inference, 2nd edn. (Oliver and Boyd, London, UK, 1959). Frieden, B.R., 1998. Physics from Fisher Information. Cambridge University Press, Cambridge, UK. 1999. Frieden, B.R., 2004. Science from Fisher Information. Cambridge University Press, Cambridge, UK. Frieden, B.R., Gatenby, R.A. (Eds.), 2007. Exploratory Data Analysis Using Fisher Information. Springer-Verlag, London.
Principle of minimum loss of Fisher information Chapter
6 147
Frieden, B.R., Gatenby, R.A., 2011. Order in a multidimensional system. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 84 (1 Pt 1), 1–9, 011128. Published online July 19, 2011. https://doi.org/10.1103/PhysRevE.84.011128. Frieden, B.R., Gatenby, R.A., 2019. Spontaneous formation of universes from vacuum via information-induced holograms. arXiv, 11435 (physics.gen-ph). Frieden, B.R., Gatenby, R.A., 2020. Ion-based intracellular signal transmission, principles of minimum Information loss and evolution by natural selection. Published online, Int. J. Mol. Sci. 21 (1), 9. 2019 Dec 18. PMC6982146. https://doi.org/10.3390/ijms21010009. Frieden, B.R., Soffer, B.H., 2009. De Broglie’s wave hypothesis from fisher information. Physica A A388, 1315–1330. Frieden, B.R., Plastino, A., Plastino, A.R., Soffer, B.H., 1999. Fisher-based thermodynamics: its Legendre transform and concavity properties. Phys. Rev. E 60. @S1063-651X99! 03707-1#. Gatenby, R., Frieden, B.R., 2016. Investigating information dynamics in living systems through the structure and function of enzymes. PLoS One. https://doi.org/10.1371/journal.pone. 0154867. Greene, B., 2011. The Hidden Reality: Parallel Universes and the Deep Laws of the Cosmos. Vintage, New York. Multiple universes of 9 types are defined and discussed: quilted, inflationary, brane, cyclic, landscape, quantum and holographic. Guth, A., 1981. Inflationary universe: a possible solution to the horizon and flatness problems. Phys. Rev. D 23 (2), 347–356. Bibcode:1981PhRvD..23..347G. https://doi.org/10.1103/ PhysRevD.23.347. Hodgkin, A.L., Huxley, A.F., 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117 (4), 500–544. PMC 1392413. 12991237. https://doi.org/10.1113/jphysiol.1952.sp004764. Kant, I.J., 1781. Internet Encyclopedia of Philosophy 2021. Article Immanuel Kant: Metaphysics. “Kant is an empirical realist about the world we experience; we can know objects as they appear to us. He gives a robust defense of science and the study of the natural world from his argument about the mind’s role in making (forming a realistic) nature.”. Kasela, S., et al., 2020. Genetic and non-genetic factors affecting the expression of COVID-19 relevant genes in the large airway epithelium. medRxiv. preprint. https://www.medrxiv.org/ content/10.1101/2020.10.01.20202820v1. Levine, A.J., Oren, M., 2009. The first 30 years of p53: growing ever more complex. Nat. Rev. Cancer 9, 749–75824. Linde, A., Vanchurin, V., 2010. How many universes are in the multiverse? arXiv 81, 0910.1589v3 [hep-th]. https://doi.org/10.1103/physRevD81.083535. The authors argue that the total number of distinguishable, locally “Friedmann universes” generated by eternal inflation, is proportional to the exponent of the entropy of inflationary perturbations. Eternal inflation is a hypothetical, inflationary universe model, which is itself an outgrowth or extension of the Big Bang. According to eternal inflation, the inflationary phase of the universe’s expansion lasts forever. But, why does it? Our thesis (Sec. 2.6) is that it does this so as to ever increase its complexity and Fisher I level; and that it further accomplishes this by, at times, “reproducing” as a natural part of its evolution; where each of its offspring universes obeys eternal inflation as well. This leads to a “multiverse,” consisting of one or more chains of multiple universes, all ever-increasing their complexity and Fisher information levels; until, possibly, going extinct due to incursion of a random “bubble of nothingness” (Sec. 2.12, 2.13,2.14,2.15,2.16).
148 SECTION
II Theoretical applications and physics
Locatelli, F., Merli, P., Pagliara, D., Li Pira, G., Falco, M., Pende, D., et al., 2017. Outcome of children with acute leukemia given HLA-haploidentical HSCT after alphabeta T-cell and B-cell depletion. Blood 130, 677–685. https://doi.org/10.1182/blood-2017-04-779769. Popper, K., 1963. Conjectures and Refutations. Routledge and Kegan Paul, London. Notable quotation: “Scientific claims are empirically falsifiable, i.e., their truth should be incompatible with certain observations.” Rao, C.R., 1945. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–89. MR 0015748. Valentyne, E.J. (Ed.), 2021. Powers of Two: The Information Universe. In: Conference on Information Universe, Netherlands. Springer, NY.
Chapter 7
Quantum metrology and quantum correlations Diego G. Bussandria and Pedro W. Lambertib,* a
Instituto de Fı´sica La Plata (IFLP), CONICET, La Plata, Argentina Consejo Nacional de Investigaciones Cientı´ficas y T ecnicas de la Repu´blica Argentina, Argentina * Corresponding author: e-mail: [email protected] b
Abstract Given a physical system to define a suitable measurement process on it constitutes one of the most crucial and challenging issues in physics. According to the particular theory under consideration, taking measurements has more or fewer subtleties but one ubiquitous concept for every physical theory is the uncertainty. In quantum mechanics, it is possible to identify a fundamental unpredictability given by Heisenberg uncertainty relations and the indistinguishability of quantum states, together with a practical uncertainty related to unavoidable errors in the measurement process. Considering only these fundamental ties, the estimation of unknown values of physical magnitudes has welldefined precision limits. The Cramer–Rao bound is a cornerstone for the analysis of these restrictions and an irreplaceable tool to determine the most accurate measurement procedures. On the quantum side, there exist quantum correlations known as entanglement and quantum discord which bring new ways to overcome the classical precision limits. Quantum metrology is a relatively young emerging area whose main aim is to study how to improve parameter estimation theory by using quantum correlations present in multipartite systems. Keywords: Quantum metrology, Quantum discord, Entanglement, Cramer–Rao bound, Fisher information
1
Quantum correlations
Let us consider a system formed by two quantum subsystems A and B whose states belong to the Hilbert space H ¼ HA HB being HX the individual parts corresponding to the subsystem X {A, B}. In addition, let B+1 ðHÞ be the set of density matrices ρ defined over H.
Handbook of Statistics, Vol. 45. https://doi.org/10.1016/bs.host.2021.06.004 Copyright © 2021 Elsevier B.V. All rights reserved.
149
150 SECTION
II Theoretical applications and physics
A natural measure of quantum correlations between the systems A and B in a given state ρ is the mutual information I(ρ) (Adesso et al., 2016; Bussandri et al., 2019) IðρÞ ¼ SðρA Þ + SðρB Þ SðρÞ
(1)
with SðχÞ ¼ Tr½ρ log ρ the von Neumann entropy, ρA ¼ TrB ½ρ and ρB ¼ TrA ½ρ, and the corresponding partial trace. Now, we shall classify the states according to its correlations. The first class is Product states which can be written as ρp ¼ α β
(2)
with α B+1 ðHA Þ and β B+1 ðHB Þ. As in this case ρp is the Kronecker product between the marginal states ρA and ρB, it holds I(ρp) ¼ 0 because of S(α β) ¼ S(α) + S(β). Thus, product states have no correlations of any kind. On the other hand, given an arbitrary state ρ, the corresponding noncorrelated state is ρA ρB. One may ask how much information it is discarded, in comparison with ρ, in the state ρA ρB. The usual quantifier of this is I(ρ) that, additionally, satisfies IðρÞ ¼ SðρjjρA ρB Þ
(3)
being Sðχ 1 jj χ 2 Þ ¼ Tr ½ χ 1 ð log χ 1 log χ 2 Þ a measure of distinguishability between quantum states (χ 1 and χ 2 in this case) known as von Neumann relative entropy. Taking into account these considerations (and other reasons that exceed this chapter) (Adesso et al., 2016), the mutual information I(ρ) is recognized as the main quantifier of correlations between subsystems. Let us consider a particular procedure to obtain correlated quantum states. Let X and Y be two random discrete variables and let be Pij ¼ P(X ¼ xi ^ Y ¼ yj) the joint probability distribution. If one represents the outcomes of X and Y with two orthonormal basis fjiA igi and fj jB igj of HA and HB , respectively, then a suitable state representing the stochastic behavior of X and Y is X ρcc ¼ (4) Pij jiA ihiA j j jB ih jB j: ij
Moreover, it turns out Iðρcc Þ ¼ IðX, YÞ ¼
X ij
Pij log
Pij pi q j
(5)
where I(X, Y) P is the classicalPmutual information between the random variables, being pi ¼ j Pij and qj ¼ i Pij the marginal probabilities. Therefore, we can conclude that ρcc is a classical state (sometimes called as classical-classical state).
Quantum metrology and quantum correlations Chapter
7 151
Now, let us analyze a richer and complex situation. Let us suppose that we have one random variable X. If turns out X ¼ xi we choose jiA i (an element of the previous orthonormal base) as the state of the system A, and we prepare ρiB as state of the system B. Thus, the suitable joint state for the global system in this scenario is (Bussandri et al., 2019) X pi MAi ρiB (6) ρc ¼ i
where Mi is the orthonormal projector onto jiA i, given by Mi ¼ jiA ihiA j. All information about the behavior of system A is provided by the probability distribution p ¼ {pi}. However, system B has a quantum behavior determined by the probability distribution p and the ensemble of states fρiB g. Taken into account these considerations, ρc is called A-classical state or classicalquantum state. The subsequent correlations of ρc can be quantified by taking the von Neumann mutual information: X Iðρc Þ ¼ pi SðρiB jjρB Þ (7) i
P
being ρB ¼ i pi ρiB the marginal state of B. This kind of correlation is named as classical quantum. The last kind of states that we will introduce is the Separable one which are the natural generalization of classical-quantum states, that is, convex sums of product states: X ρs ¼ pi ρiA ρiB : (8) i
If a state ρ cannot be written as in the previous form, it is called an entangled state and it is one of the most important elements of quantum information theory. However, nowadays there is a remarkable corpus of studies pointing out that separable states can lead to truly quantum effects, demonstrating the presence of quantum correlations between systems A and B, in a similar way than entanglement do. On the other hand, there exist an enormous quantity of measures of quantum correlations. The most general one is the so-called Quantum Discord (Bussandri et al., 2019), defined as QðρÞ ¼ IðρÞ max IðρM Þ M
(9)
being M a generalized measurement and ρM the corresponding state after the measurement. On the other hand, it is important to remark that if one restricts to the set of pure states, then the only possible quantum correlation is entanglement (Horodecki et al., 2009). That is, given an entangled state then its correlations can be divided into two classes: A-classical (from now on: classical) correlations and entanglement. However, pure states have a particular
152 SECTION
II Theoretical applications and physics
property: they have the same quantity (measured by the mutual information) of classical correlations than entanglement. Correspondingly, if a state is not a product state, then it has entanglement. In addition, within the set of pure states fjψ ig, we can use the entanglement of formation in order to quantify entanglement, defined as EðψÞ ¼ SðρA Þ ¼ SðρB Þ: The extension of this measure to the set of mixed states X EðρÞ ¼ min pi Eðψ i Þ
(10) B+1 ðHÞ
is (11)
i
being ρ¼
X
pi j ψ i i h ψ i j
i
one representation of ρ as a convex sum of pure states and the minimum taken over these possible representations.
2 Parameter estimation Consider now the problem of estimating one parameter ϕ of a quantum system under some assumptions (Polino et al., 2020). Let us suppose that we know how the state of the system depends on the parameter ϕ. Besides this we will assume that this dependence can be described employing a unitary operation over a quantum state ρ0: ρϕ ¼ U ϕ ρ0 U {ϕ :
(12)
It is important to emphasize that the quantum systems used in this chapter are associated to finite degrees of freedom and the corresponding quantum states are density operators, that is to say, Hermitian positive semidefinite operators with trace one B+1 ðHÞ, defined over a finite-dimensional Hilbert space H. In order to carry out the estimation of ϕ, we perform a generalized quantum measurement M given by the measurement operators M ¼ fMi gni¼1 , n , P and i M{i Mi ¼ . These operators are in correspondence one to one with the values fx1 , …, xn g of an observable X of the quantum system i.e., Mi $ X ¼ xi. The key of the estimation process is the estimator Φ, that is a function whose aim is connecting the measurement outcomes with the value of ϕ. We will also assume that there exists the possibility of measuring Mν ! independent times and thus we obtain a set of outcomes x ¼ ðx1 , …, xν Þ . We will define this procedure as an experiment. Thus, it is possible to define ! a final estimator as Φð x Þ. In addition, this functional is called unbiased if its mean value is the parameter ϕ, i.e., X ! ! ¼ Φ Pð x jϕÞΦð x Þ ¼ ϕ: (13) !
x
Quantum metrology and quantum correlations Chapter
7 153
The average is taken over m experiments. On the other hand, we can obtain an expression for the probability of an outcome xi by using the Born rule: Pðxi jϕÞ ¼ Tr Mi ρϕ : (14) The previous quantity can be understood as a conditional probability because of it is a priori assumed the value of the parameter ϕ. Consequently, !
Pð x jϕÞ ¼
ν Y
Pðxi jϕÞ
(15)
i¼1 !
!
stands for the probability to obtain x in ν independent measurements. Pð x jϕÞ is called likelihood function. Another important definition related to estimators is the one of locally unbiased estimator ∂Φ ¼1 ∂ϕ
(16)
¼λ lim Φ
(17)
Additionally, if Φ satisfies m!∞
then it is called asymptotically unbiased. These two last requirements are less restrictive than the unbiased one. Now, let us consider the following question: given a measurement and an estimator Φ, how can we quantify the corresponding estimation process? The answer is the mean square error (MSE) that captures the average square differ! ! ence between the guess based on the outcomes x -Φð x Þ- and the parameter: X 2 ! ! ðΦð x Þ ϕÞ Pð x jϕÞ: (18) MSEðϕÞ ¼ !
x
Finally, if Φ is unbiased, then it holds X ! 2 ! ½Φð x Þ Φ Pð x jϕÞ: MSEðϕÞ ¼ Δϕ2 ¼
(19)
!
x
3
Cramer–Rao bound
The Cramer–Rao Bound and Fisher information (Khalid et al., 2018) are fundamental tools of parameter estimation theory. Given a particular measurement M ¼ fMi gi , the Fisher information captures the information encoded in the probabilities of the measurement outcomes, according to: 2 X ∂ log Pðxi jϕÞ FðϕÞ ¼ Pðxi jϕÞ : (20) ∂ϕ i
154 SECTION
II Theoretical applications and physics
Two important properties ofPF(ϕ) arePconvexity and additivity. The former P establishes that given ρ ¼ k ckρk, k ck ¼ 1, then it holds F(ϕ) k ckFk(ϕ) being Fk(ϕ) and F(ϕ) the corresponding Fisher information to the state ρk and ρ, respectively. By its side additivity has to do with more than one measurement. To be precise let us consider a collection of states {ρk}k and ν independent measurements Mk . The total Fisher information is given P by Ft(ϕ) ¼ k Fk(ϕ) where Fk(ϕ) is the resulting Fisher information when we use Mk and ρk. At the same time, this quantity is intimately related to the symmetric logarithmic derivative (SLD) Lϕ indirectly given by ∂ρ ρϕ Lϕ + Lϕ ρϕ ¼ : ∂ϕ 2
(21)
The SLD is an Hermitian operator related to the Fisher information in the following way: X Re Tr ρϕ Mi Lϕ 2 FðϕÞ ¼ : (22) Tr ρϕ Mi i From the definition (20), we can rewrite the corresponding expression in the next form: 2 ∂Pðxi jϕÞ X ∂ϕ FðϕÞ ¼ : (23) Pðx i jϕÞ i Looking at Eq. (23), it is possible to interpret the Fisher information as a quantifier of the sensitivity of the probabilities of the outcomes to a change in the parameter ϕ. Specifically, to obtain more information about ϕ is directly related to a greater variation of the probabilities when the parameter changes. A formal version of this statement is the Cramer–Rao bound. If we perform ν independent measurements M, it holds 2 Δϕ2
∂Φ ∂ϕ
:
(24)
1 : νFðϕÞ
(25)
νFðϕÞ
If Φ is locally unbiased, then we arrive at Δϕ2
One way to prove the Cramer–Rao bound is to define a suitable scalar product using the Cauchy Schwarz inequality. Lastly, an estimator that fulfills the equality 2 Δϕ ¼ 2
is called efficient.
∂Φ ∂ϕ
νFðϕÞ
Quantum metrology and quantum correlations Chapter
4
7 155
Quantum Fisher information
Both sides of Cramer–Rao bound depend on the measurement M. In particular, F(ϕ) is sensible to a change in the considered POVM. In order to reach the precision limits, we maximize F(ϕ) over the possible measurements. This new optimization gives the quantum Fisher information: FQ ðρϕ Þ ¼ max FðϕÞ:
(26)
M
Correspondingly, the Quantum Cramer–Rao bound for an asymptotically locally unbiased estimator states Δϕ2
1 : νFQ ðρϕ Þ
(27)
As the Fisher information, its corresponding quantum version also depends on the symmetric logarithm derivative Lϕ. It can be proven that h i (28) FQ ðρϕ Þ ¼ Tr ρϕ L2ϕ ¼ ðΔLϕ Þ2 : In addition, if we have U ϕ ¼ exp ðiϕHÞ being H an Hermitian operator, FQ(ρϕ) does not depend on the parameter ϕ, and it is a function depending only of ρ0, see Eq. (12),P and H. Considering now the spectral decomposition of the initial state ρ0 ¼ i pi jψ i ihψ i j, the quantum Fisher information under these conditions reads: X ðpi p j Þ2
ψ i jHjψ j 2 : FQ ðρϕ Þ ¼ 2 pi + p j
(29)
i6¼j
An interesting case arises when ρ0 is a pure state, ρ0 ¼ jΨ0 ihΨ0 j. In this case it holds: FQ ðρϕ Þ ¼ 4ðΔHÞ2
(30)
being D E ðΔHÞ2 ¼ ðH hH iρ0 Þ2 ¼ H2 ρ hH i2ρ0 ρ0
0
where hAiρ0 ¼ Tr½ρ0 A. Taking into account the properties of the Fisher information, we have an additional inequality FQ ðρϕ Þ 4ðΔHÞ2
(31)
where the left-hand side corresponds to the Fisher information of an arbitrary quantum state.
156 SECTION
II Theoretical applications and physics
5 Quantum correlations in estimation theory Having considered the optimization of Fisher information over the possible measurements M , now we will deal with the following question: What is the best strategy to improve the quantum Cramer–Rao bound? In the previous sections, we have considered the parameter estimation problem by using m sequential experiments. As we said before, each experiment consists in performing ν independent measurements over the state ρϕ. So, now we will follow a parallel strategy (Polino et al., 2020), schematized in Fig. 1. In order to carry out this procedure, we assume that we have m packages of ν copies of m each state ρi0 , i f1, …, mg and m black boxes fU iϕ gi¼1 whose interaction with ρi0 leads to ρiϕ ¼ U iϕ ρi0 U i,{ ϕ . Thus, we can construct a unitary global operation acting over the m states, resulting in: ρϕ ¼ Uρ0 U{
(32)
1 m with U ¼ U 1ϕ ⋯ U m ϕ and ρ0 ¼ ρ0 ⋯ ρ0 . In what follows, we shall coni iϕH sider identical black boxes U ϕ ¼ e . The precision Δϕ of this new estimation process according to the quantum Cramer–Rao bound is limited as in Eq. (27). By using the additivity property of the Fisher information, we can write (Polino et al., 2020)
FQ ðρϕ Þ ¼ FQ ðρ1ϕ ⋯ ρm ϕÞ ¼
m X
FQ ðρiϕ Þ mFQmax
(33)
i¼1
being FQmax ¼ max fFQ ðρiϕ Þgi . Thus, this parallel procedure leads to the following inequality known as Standard Quantum Limit (SQL): 1 1 Δϕ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : νmFQmax νFQ ðρϕ Þ
FIG. 1 Procedure corresponding to the standard quantum limit.
(34)
Quantum metrology and quantum correlations Chapter
7 157
Considering that FQmax is a constant quantity, the standard quantum limit is characterized by the error scaling Δϕ ∝ p1ffiffimffi.
5.1 Heisenberg limit We arrived to a new interesting issue. In the previous section, we employed noncorrelated initial states, i.e., ρ0 was a multipartite product state. Then, is it possible to improve the precision limits considering correlated states? Or alternatively, what type of correlations may lead to a better error scaling? In order to answer these questions, let us consider separable states. As we have seen before, this class of quantum states is constituted by convex sums of products states. Thus, under the previous unitary evolution, we have X X 1ðαÞ mðαÞ 1ðαÞ mðαÞ ρ0 ¼ c α ρ0 ⋯ ρ0 ! ρϕ ¼ c α ρϕ ⋯ ρϕ (35) α
α
P
with α cα ¼ 1 and ρϕ ¼ U ϕ ρ0 U {ϕ . However, by using the convexity and additivity properties of the Fisher information, it turns out that X 1ðαÞ mðαÞ FQ ðρϕ Þ c α F Q ρϕ ⋯ ρϕ (36) α
ð
X α
n o 1ðα0 Þ mðα0 Þ cα Þ max FQ ρϕ ⋯ ρϕ
α0
n o 1ðα0 Þ mðα0 Þ ¼ max FQ ρϕ ⋯ ρϕ : 0 α
(37) (38)
As we can see, the optimal separable state (i.e., the one that maximizes the Fisher information) is a product state. Correspondingly, separable states do not lead to an improvement of the standard quantum limit and it is not necessary to consider such class of states in order to reach the same bound. On the other hand, it is important to note that optimal measurements, i.e., the ones that maximize the Fisher information in Eq. (26), can be local measurements for each state ρiϕ ; thus it is not possible to overcome the SQL by using quantum correlations between the quantum systems, labeled by i f1, …, mg, in the measurement process. However, it is possible to find a place for quantum resources like entanglement in order to improve the precision limits (see Fig. 2). This kind of quantum correlations can enhance the sensitivity of the bound, beating the standard quantum limit and improving the error scaling (Khalid et al., 2018). In order to do this, we have to return to inequality (31) in which we can see that the state maximizing the fisher information is a pure one. Thus, according to Eq. (30), we have to choose ρ0 corresponding to greater values of the variance of the total Hamiltonian HT. In this scenario, we have that
158 SECTION
II Theoretical applications and physics
FIG. 2 Procedure corresponding to the Heisenberg limit.
HT ¼
Xm i¼1
Hi
(39)
m operators
zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{ ⋯H ⋯: H i ¼ |fflfflfflffl{zfflfflfflffl}
(40)
i1 prods:
Defining jψ max i and jψ min i as the eigenstates of H corresponding to h max —the maximum eigenvalue–and h min –the minimum—, respectively, and ρ0 ¼ jΨ0 ihΨ0 j, with 1 jΨ0 i ¼ pffiffiffi jh max im + jh min im 2
(41)
then it holds 1 hH T iρ0 ¼ Tr½ρ0 H T ¼ hΨ0 jH T jΨ0 i ¼ ðmh max + mh min Þ 2 2 1 HT ρ ¼ hΨ0 jH 2T jΨ0 i ¼ m2 h2max + m2 h2min : 0 2
(42) (43)
So the variance of the total Hamiltonian turns out to be ðΔH T Þ2 ¼
m2 ðh h min Þ2 : 4 max
(44)
Correspondingly, the quantum Fisher information is given by FQ ðρϕ Þ ¼ m2 ðh max h min Þ2
(45)
and, lastly, the quantum Cramer–Rao bound reads (Polino et al., 2020) 1 1 ∝ : Δϕ pffiffiffi m νmðh max h min Þ
(46)
Quantum metrology and quantum correlations Chapter
7 159
As we can see, the particular election of the state ρ0 as a pure entangled one leads to an improvement in the precision limit and the error scaling. This new bound is known as Heisenberg limit.
5.2 Interferometric power Finally, we present another phase estimation problem but based in an interferometric setup. Let us consider a bipartite system A + B, initialized in an joint state ρ0 that undergoes a local unitary evolution U ¼ UA UB. Let us suppose that only subsystem A evolves according to U A ¼ exp½iϕHA , while subsystem B remains unaffected, U B ¼ B , namely, ρϕ ¼ ðU A B Þρ0 ðUA B Þ{ :
(47)
An alternative way to write the previous evolution is by using the total hamiltonian H T ¼ H A B and a global unitary operation U ¼ exp ½iϕHT acting over ρ0. Then, as have seen earlier, the precision estimating ϕ through ν independent measurements over identical copies is limited by the quantum Cramer–Rao bound, Δϕ2
1 νFQ ðρϕ Þ
being FQ ðρϕ Þ ¼ 2
X ðpi pj Þ2
hψ i jH T ψj 2 pi + pj
(48)
i6¼j
where fpi , jψ i ig are the eigenvalues and eigenvectors of ρ0, respectively. The Fisher information captures the information that ρϕ posseses about the parameter ϕ. One interesting question at this point is what states are more suitable to increase the Fisher information in this setup. First, let us suppose that we have no full knowledge of HA; only its spectrum is fixed and it is nondegenerated. Thus, one way to quantify the usefulness of a given state ρ0 in order to improve the estimation process is by taking the minimum of the quantum Fisher information over all Hamiltonians with the corresponding fixed spectrum min HA Fðρϕ Þ (Bromley et al., 2017; Girolami et al., 2014). By introducing a convenient normalization factor, we have P A ðρ0 Þ ¼
1 min Fðρϕ Þ 4 HA
(49)
known as interferometric power (Girolami et al., 2014) of the input state because of it captures the sensitivity that ρ0 allows in an interferometric configuration as we have. Moreover, P A vanishes if and only if ρ0 is a classicalquantum state, see Eq. (6), and it is contractive over Local Commutative Preserving Operations that are the corresponding free operations for a
160 SECTION
II Theoretical applications and physics
measure of general quantum correlations. Thus, together with other considerations that exceed this chapter, it is possible to conclude that this quantity is a natural measure of quantum (discord-type) correlations. In addition, the reason why this particular kind of correlation is useful to increase the precision estimation is that it increases the coherence of HA (in its eigenbasis) (Bromley et al., 2017), a fact directly associated with greater values of the quantum Fisher information.
6 Conclusion Quantum correlations are recognized as a fundamental area within quantum information theory. Entanglement is one of the most important types of correlations but not the only relevant one. Nowadays, a series of researches point out that quantum correlations beyond entanglement may overcome the corresponding classical limits in a varied list of cases (Adesso et al., 2016). In this chapter, we have introduced the problem of estimating parameters that define quantum states, stating a fundamental inequality known as the Cramer– Rao bound in terms of the Fisher information. This quantity stands for a natural quantifier of the information encoded in the state of the system about the parameter to be estimated. One remarkable aspect of the Fisher information is the relation with the correlations between the involved quantum systems. Moreover, here we were aimed to bring closer the reader to how this kind of correlations leads the information encoding in the quantum state, in an interferometric configuration. The corresponding area of quantum information theory exploring the resourcefulness of the quantum correlations to estimate parameters is known as Quantum Metrology encompassing a variety of works about these topics under different evolutions and systems.
References Adesso, G., Bromley, T.R., Cianciaruso, M., 2016. Measures and applications of quantum correlations. J. Phys. A Math. Theor. 49 (47), 473001. Bromley, T.R., et al., 2017. There is more to quantum interferometry than entanglement. Phys. Rev. A 95 (5), 052313. Bussandri, D.G., et al., 2019. Generalized approach to quantify correlations in bipartite quantum systems. Quantum Inf. Process. 18 (2), 57. Girolami, D., et al., 2014. Quantum discord determines the interferometric power of quantum states. Phys. Rev. Lett. 112 (21), 210401. Horodecki, R., et al., 2009. Quantum entanglement. Rev. Mod. Phys. 81 (2), 865. Khalid, U., Jeong, Y., Shin, H., 2018. Measurement-based quantum correlation in mixed-state quantum metrology. Quantum Inf. Process. 17 (12), 1–12. Polino, E., et al., 2020. Photonic quantum metrology. AVS Quantum Sci. 2 (2), 024703.
Chapter 8
Information, economics, r-Rao bound and the Crame Raymond J. Hawkinsa,b,* and B. Roy Friedenb a
Department of Economics, University of California, Berkeley, CA, United States Wyant College of Optical Sciences, University of Arizona, Tucson, AZ, United States * Corresponding author: e-mail: [email protected] b
Abstract In this chapter we explore the importance of the Cramer-Rao bound in the representation of risk in economics. As the Cramer-Rao bound is a function of Fisher information, we also examine the use of information-theoretic approaches to probability density extraction with a particular focus on the minimum Fisher information approach. The ubiquity of probability densities in economic theory in general and financial economics in particular recommends the use of information-theoretic approaches for their extraction and the Cramer-Rao bound as a practical measure of the risk implied by these distributions. Keywords: Cramer-Rao bound, Fisher information, Shannon information, Economics
1
Introduction
A substantial portion of economics deals, directly or indirectly, with probability densities. From Keynes’s (1937) comments concerning the distinction between risk and uncertainty in his development of macroeconomics “…the prospect of a European war is uncertain, or the price of copper and the rate of interest twenty years hence, or the obsolescence of a new invention, or the position of private wealth owners in the social system of 1970. About these matters there is no scientific basis on which to form any calculable probabilities whatever. We simply do not know. Nevertheless, the necessity for action and for decision compels us as practical men to do our best to overlook this awkward fact and to behave exactly as we should if we had behind us a good Benthamite calculation of a series of prospective advantages and disadvantages, each multiplied by its appropriate probability, waiting to be summed.”
Handbook of Statistics, Vol. 45. https://doi.org/10.1016/bs.host.2021.06.005 Copyright © 2021 Elsevier B.V. All rights reserved.
161
162 SECTION
II Theoretical applications and physics
to Osborne’s (1959) discovery of Brownian motion in the stock market which laid the foundation of modern financial economics (Bouchaud and Potters, 2003), to the more recent finding of the ubiquity of power laws in economics (Gabaix, 2009; Gabaix et al., 2003), the estimation and use of probability densities has been a central feature of theoretical and applied economics. The use of information theory in the analysis of probability densities in economics follows, in part, from the development of financial economics in the years after the publication of the celebrated Black–Scholes option-pricing equation (Black and Scholes, 1973; Merton, 1974), which illustrated the formal similarities between economics and statistical mechanics (Aoki and Yoshikawa, 2007; Schulz, 2003). The influence of information theory in statistical mechanicsa had, in part through the formal similarity mentioned above, a similar influence in economics, with Shannon entropy (Shannon, 1948) being adopted early in this development. While Shannon information has played a key role in economics in general and the analysis of equilibrium states in particular, the importance of dynamics and non-equilibrium states inspired our extension of this line of inquiry with Fisher information Fisher (1922). Among the important insights that Fisher information, IFi, provides is its relationship with the variance σ 2 via the celebrated Cramer-Rao bound (Cramer, 1946; Rao, 1945) σ 2 1=I Fi :
(1)
Although simple in form, the Cramer-Rao bound links two important concepts in economics: the information in a probability density and the risk, or quantifiable uncertainty, associated with an economic random variable such as price. In this chapter we explore some applications of Shannon entropy and Fisher information in economics, highlighting the importance of the Cramer-Rao bound in them. To this end we continue with a brief review of Shannon entropy and Fisher information in Section 2, and then move on to financial economic applications in Section 3 and macroeconomic applications in Section 4. We close in Section 5 with a discussion and summary.
2 Shannon entropy and Fisher information Probability distributions p(x) implicit in observed data d1, …dM ¼ {dm} that can be expressed as averages of known functions {fm(x)}: Z f m ðxÞPðxÞdx ¼ dm , m ¼ 1, …, M, (2)
a
See, for example, Jaynes (1957a, 1957b, 1968, 2003), Katz (1967), Balian (1982), and Ben-Naim (2008).
Information, economics, and the Cram er-Rao bound Chapter
8 163
are a classic inverse problem. Perhaps the best-known approach to this inverse problem is to use Shannon entropy HSh ≡ P(x) ln(P(x))dx to form the Lagrangian Z Z M X fm ðxÞPðxÞdx dm , L ¼ HSh J ¼ PðxÞ ln ðPðxÞÞdx λm (3) m¼1
where the {λi} are Lagrange multipliers and J is the known or intrinsic information in the problem, and to use variational calculus to obtain the density ( Jaynes, 1968) PM e m¼1 λm fm ðxÞ PSh ðxÞ ¼ Z : (4) PM e m¼1 λm fm ðxÞ dx This exponential density, the well-known Boltzmann density, gives the equilibrium density consistent with the known data. This density is obtained by extremizing the Lagrangian which is the difference, or asymmetry, between the Shannon entropy HSh and the intrinsic information. Thus, the probability density for the economy follows from extremizing the asymmetric information of the economy. While this density is extremely useful, the complete description of an economy often requires a treatment of dynamics including departures from equilibrium, and for this one is often better served by replacing Shannon entropy with Fisher information IFi given byb 2 Z Z dPðxÞ 1 2 0 I Fi ¼ 4 jψ ðxÞj dx ¼ , (5) dx Pð x Þ and ψ 0 (x) is the derivative of ψ(x) which solves " # M X dψ ðxÞ 1 ¼ λ0 + λm fm ðxÞ ψ ðxÞ dx 4 m¼1
(6)
and is related to the probability density PFi(x) by PFi(x) ¼ j ψ(x)j2. While the Fisher information can be calculated using PSh or PFi, the Fisher information associated with PFi is particularly useful since, via the CramerRao bound, the minimum Cramer-Rao bound will be larger. Since the Cramer-Rao bound provides a measure of risk in an economy, an understanding of the minimum risk associated with a given set of economic observables or other constraints is of particular utility.
b
See Frieden (1998, 2004, 2007), and references therein.
164 SECTION
II Theoretical applications and physics
3 Financial economics A central valuation paradigm in financial economics is the notion that the price of a future cash flow is the expected (probability-weighted), discounted value of that cash flow, and that the price of any security is the sum of the price of the constituent expected discounted cash flows. Probability densities (and Cramer-Rao bounds) appear on both aspects of these calculations. First, there is the obvious relationship between a probability density and the expected future cashflow. Future cashflows are a function of the future level of financial assets, a level regarding which there is uncertainty. Thus, when one speaks of future cashflows, one always speaks in terms of expectation values. Less obvious is the presence of probability densities in the discount function: the function that relates a unit of currency (e.g., one dollar) to be paid in the future to its price today. Since the discount factor is fundamental to all security pricing, we continue with this in Section 3.1 and follow with an exploration of derivative securities in Section 3.2.
3.1 Discount factors and bonds Discount factors are, perhaps, the most frequently encountered probability density encountered by the investing public. A discount factor answers the question “how much money would I need to put in a deposit account at a bank today to have one unit of currency (e.g., one euro) at a specified time in the future. As such, they embody the concept of being paid interest (money) by a bank for lending money to the bank. One often thinks of interest as the additional money one will have in the future: e.g., at a 5%/year annual interest rate one would have euro 1.05 one year from now. But when contemplating the need to make a payment in a year, one would observe that at this rate of interest one would need only to deposit 0.9524 (¼1/1.05) euro today to have one euro in the future. The relationship between the amount one deposits today, the present value (PV), the amount one expects to receive in the future, the future value (FV), and the discount factor (DF) is given by PV ¼ DFðtÞ FV ðtÞ
(7)
where the time argument indicates the length of time that the money accrues interest, and from this the notion of a probability density emerges. In the t ! 0 limit there is no time to accumulate interest and we have that PV ¼ FV(0) and DF(0) ¼ 1. For all t > 0 there is time to accumulate interest and DF(t) < 1. This is often expressed as Rt f ðsÞds DFðtÞ ¼ erðtÞt ¼ e 0 (8) where r(t) and f(t) are the continuously-compounded spot and forward interest rates, respectively. Brody and Hughston (2001, 2002) were the first to see this behavior of the discount factor as indicative of a cumulative probability density
Information, economics, and the Cram er-Rao bound Chapter
Z
∞
DFðtÞ ¼
Z pðsÞds ¼
t
∞
Θðs tÞpðsÞds
8 165
(9)
0
where Θ(x) is the Heaviside step function, and to develop this concept from the perspective of information geometry in the context of Shannon entropy. A consequence of this analysis is that given the constraints of (i) probability normalization, and reproduction of the price of both (ii) a perpetual annuity ξ Z ∞ ξ¼ tpðtÞdt (10) 0
and (iii) the level of the observed discount factor Z ∞ DFðtÞ ¼ Θðs tÞpðsÞds ,
(11)
0
maximizing the Shannon entropy yields (Hawkins et al., 2020) pðtÞ ¼
1 λ1 tλ2 ΘðtτÞ e Z
(12)
where the partition function Z is given by Z ∞ 1 Z¼ eλ1 tλ2 ΘðtτÞ dt ¼ 1 eλ1 τ + eλ2 eλ1 τ λ1 0 and the discount factor is given by λ2 λ1 t e e 1 DðtÞ ¼ λ1 Z eλ1 t eλ1 τ + eλ2 eλ1 τ
for t τ for t < τ
:
(13)
(14)
In addition to being sound from an information-theory perspective and computationally parsimonious, the approach of Brody and Hughston (2001, 2002) also provides insights into the behavioral economics of discount factors. There is considerable evidence, for example, that humans value current consumption more highly than future consumption and that this may be responsible for observed deviations from the approximation of a constant spot rate proposed by Samuelson (1937). Remarkably, for t > τ and with the identifications δ ¼ eλ1 t
(15)
eλ2 ð λ1 Z Þ
(16)
and β¼
it is identical to the β–δ modelc of behavioral economics (Frederick et al., 2002; Laibson, 1997).
c
This is also known as quasi-hyperbolic discounting.
166 SECTION
II Theoretical applications and physics
The Heaviside step function also has a behavioral-based probabilistic interpretation: that of the willingness of an economic agent to accept an amount less than the level of the discount factor as a function of time: before the time associated with the discount factor the probability is zero, while after the time associated with the discount factor the probability is one. This ideal behavior is rarely seen in humans; a smoother transition between zero and one being the typical observable (Hawkins et al., 2020; Luce and Suppes, 1965; Scharfenaker and Foley, 2017). And while we have modeled this (Hawkins et al., 2020) by leveraging the work of Scharfenaker and Foley (2017) in a Shannon-entropy context, a Fisher-information based analysis has yet to be completed. While these results can be related to Fisher information via Eq. (5), a result that guarantees a maximum value for the Cramer-Rao bound can be had by minimizing the Fisher information discussed above directly. An example of this is shown in the right-hand panels of Fig. 1 which shows the solution for and observed 10-year discount factor with a value of 0.35 (Frieden et al., 2007; Hawkins et al., 2005). Our solution of Eq. (6) exploits its similarity with the time-independent Schroedinger wave equation shown in the upper-left panel with the Heaviside function forming a potential and the normalizing Lagrange multiplier, λ0, being an analog of the associated energy.
FIG. 1 The Fisher information results for a discount factor (left column) and a fixed-coupon bond (right column). Reprinted from Hawkins, R.J., Frieden, B.R., D’Anna, J.L. 2005. Ab initio yield curve dynamics. Phys. Lett. A 344 (5), 317–323, with permission from Elsevier.
Information, economics, and the Cram er-Rao bound Chapter
8 167
The probability density function is proportional to the ground state of the associated finite square well. In the lower-left panel we see the associated spot and forward interest rates that follow from Eq. (8). That these rates follow from the discount factor as shown by Eq. (8) which, in turn, follows from PFi illustrates the direct connection between information-theoretic basis of probability estimation, the Cramer-Rao-bound-associated variance of PFi, and the interest rates that economic agents use to make investment decisions. Since this solution was generated by minimizing the Fisher information, the Cramer-Rao bound ensures that we have a solution with maximum variance; the Fisher-information analog to the smoothness obtained by minimizing the Shannon information. Having demonstrated the use of Fisher information on a single discount factor, let’s see how this works for the most ubiquitous collection of discount factors: a fixed-coupon bond. When an investor buys a bond they are lending money to the entity that issued the bond. In exchange for being lent that money, the bond issuer agrees to make a periodic interest payment to the lender—the “coupon”—and to return the money borrowed (the principal) along with the final coupon payment at the time in the future when the bond matures. The coupons and principal are the future cashflows, and for each of these cashflows their present value is their expected value multiplied by the discount factor for that time horizon. A Fisher-information-based analysis of bonds is illustrated in the right-hand panels of Fig. 1 where the results for a 6.75%/year fixed-coupon bond making semi-annual payments with a maturity date of November 15, 2006, a price of 77.055% of par, and a pricing date of October 31, 1994, are shown. For ease of illustration we have assumed that the cashflows of this bond are certain. The price of a bond is the present-value of all of its future values. The future values are (i) the collection of coupon (interest) payments to be made at semi-annual intervals until the maturity of the bond and (ii) the return of the principal amount (the amount the investor lent to the bond issuer) at the maturity date. The associated potential shown in the upper-right panel begins with the accumulation of Heaviside functions at each coupon-payment date and, after 10 years includes the return of principal. From a quantum-mechanical perspective this is another finite-well problem with a ground state λ0 and PDF corresponding to our prior analysis. Also, from a quantum-mechanical perspective there may be excited states: deviations from equilibrium that are expected in the event of an economic shock and that follow a prescribed set of relaxation dynamics. These results, together with the risk information of the Cramer-Rao-bound, recommend the Fisher-information approach in a variety of applications. The discount factor is the foundation of the field of fixed-income: bonds and interest rates. Embedded within it are a number of probability densities whose value and variance directly impact the global flow of funds. Fisher information at the Cramer-Rao bound provide a useful, and currently underexploited, approach to a variety of questions in this field.
168 SECTION
II Theoretical applications and physics
3.2 Derivative securities Derivative securities (derivatives) are securities whose price is an explicit function of the price of another asset or index, referred to as the underlying. Common examples of these securities are options and futures. Derivatives are of particular interest as their prices often measure moments of the probability density of future price of the underlying, and both Shannon- and Fisher-based methods have proved useful in reconstructing the probability densities implicit in the price of derivative securities.d Consider, for example, the expressions for the price of a call option, c, and put option, p, Z ∞ rt max ðs0 k, 0ÞPðs, s0 , σ, tÞds0 (17) cðs, k, σ, r, tÞ ¼ e 0
and pðs, k, σ, r, tÞ ¼ ert
Z
∞
max ðk s0 , 0ÞPðs, s0 , σ, tÞds0
(18)
0
that give the holder of the option the right to buy or sell the underlying (now trading at price s) at the strike price, k, respectively. The call and put prices are the present value of their expected payoff functions, max(s0 k,0) and max(k s0 ,0), respectively. Since the discount factors exp( rt) are given by the fixed-income markets (cf. Section 3.1), the payoff functions are specified in the option contracts, and the current option prices (the left-hand sides of Eqs. (17) and (18)) are observable in the options market, the determination of P(s,s0 ,σ,t) is also a classic inverse problem that lends itself to our information-theoretic approach and Cramer-Rao bound analysis. The basis of the Fisher information solution is illustrated in Fig. 2 for the case of a single call option with a strike of 1.5 and its associated underlying asset. In panel (A) we see the payoff function of the underlying asset: a 1:1 relationship to its level in the future. The payoff function of the call option with a strike price, k, of 1.5 is shown in panel (B). For the option holder it is not in their interest to exercise their right to buy at 1.5 if the underlying was trading for less than 1.5, and would instead let the option expire worthless. On the other hand, if the underlying is trading for more than 1.5 then the option holder would (i) exercise their right to buy the underlying at 1.5, (ii) immediately sell the underlying at the prevailing price s0 , and (iii) keep the difference of s0 k. With these two securities a potential can be formed by being short the underlying and long two calls as shown in panel (C). This will, whether using the Shannon or Fisher approach, result in a localized PDF
d
See, for examples, Buchen and Kelly (1996), Hawkins et al. (1996), Hawkins (1997), and references therein.
Information, economics, and the Cram er-Rao bound Chapter
2
(a)
(b)
(c)
(d)
8 169
0 -2 2 0 -2 0
1
2
3
0
1
2
3
LEVEL OF UNDERLYING ASSET FIG. 2 The payoff function of the underlying (upper-left) and a call option (upper-right) together with the potential formed from a linear combination of these payoff functions (lower-left) and some of the associated probability amplitudes from Eq. (6). Reprinted from Hawkins, R.J., Frieden, B.R., 2017. Quantization in financial economics: an information-theoretic approach. In: Haven, E., Khrennikov, A. (Eds.), The Palgrave Handbook of Quantum Models in Social Science. Springer, pp. 19–38, with permission from Elsevier.
as shown in panel (D), where we have shown the equilibrium and first excited states using the Fisher-information approach. An example of the Shannon-entropy version of this approach is illustrated in Fig. 3 where we see some snapshots of the evolution of the probability density implicit the prices of options on the S&P 500 index. In each panel we plot the maximum Shannon-entropy result (the solid curve) together with a prior probability (dashed curve) given by the lognormal density implicit in the Black-Scholes option price for the option whose strike is closest to the current level of the S&P 500 index (also known as the at-the-money option). This prior was chosen because market practitioners would, in the absence of any other information, consider that density to be representative of the market at that time. This prior is materially different than the uniform prior assumed in a basic implementation of a maximum Shannon-entropy inversion, and suggested the use of the Kullback-Leibler divergence DKL Z ∞ Pð s Þ DKL ¼ PðsÞ ln ds (19) gð s Þ 0 where g(s) can be the lognormal density or whatever density is most likely in the absence of any option price information because, in the absence of any option price information, g(s) replaces the uniform density as the Shannonentropy PDF.
170 SECTION
II Theoretical applications and physics
FIG. 3 The probability density functions associated with S&P500 index options in (A) January of 1987, (B) December of 1987, and (C) January of 1990. Reprinted from Hawkins, R.J., Rubinstein, M., Daniell, G.J., 1996. Reconstruction of the probability density function implicit in option prices from incomplete and noisy data. In: Hanson, K. M., Silver, R.N. (Eds.), Maximum Entropy and Bayesian Methods. Fundamental Theories of Physics. Kluwer Academic, Dordrecht, pp. 1–8, with permission from Springer Nature.
Our calculation of the implied PDF also employed a variation on the usual constraint with the Lagrangian L ¼ DKL
M X
λm ðccalc ðkm Þ cobs ðkm ÞÞ
(20)
m¼1
suggested by Skilling and Bryan (1984) who modified the Lagrangian to 1 L ¼ DKL λχ 2 2
(21)
where χ ¼ 2
2 M X ccalc ðkm Þ cobs ðkm Þ m¼1
σ ðk m Þ
(22)
and where M is the number of observed call prices, ccalc(km) is the calculated call price at the mth strike price, km is the mth strike price, cobs(km) is the
Information, economics, and the Cram er-Rao bound Chapter
8 171
observed call price at the mth strike price, and σ(km) is the standard error of cobs(km) which was proxied by the bid-ask spread for each strike. The temporal evolution of the implied PDF of the S&P 500 index options responded in way both expected and unexpected as shown in Fig. 3. In panel (A) we see the prior densities in January of 1987. Both densities are essentially the same since one cannot distinguish them by eye. This means that the lognormal density calibrated to the one call option whose strike was at that time closest to the level of the S&P 500 index was able to price options across all strikes for this option expiration. Consequently, the Kullback-Leibler divergence in this panel is effectively zero. In panel (B) we see the densities in December of 1987, 2 months after the market crash of October 1987. The index has clearly declined in value—roughly 24% of the index value was lost in the crash—and the width of the PDF has increased, reflecting the increased uncertainty in the market at that time. Curiously, the Kullback-Leibler divergence in this panel is also effectively zero. Finally, in panel (C) we see the densities associated with the S&P 500 index options in January of 1990, 3 months after the “market correction” of October 1990. While the prior density has risen in mean and narrowed, the density needed to price options on all strikes differs dramatically from the prior: this Kullback-Leibler divergence is definitely nonzero. It would appear that the market participants are now pricing in a higher probability of a crash, as evidenced by the bump in the density seen in the low 300 s.
4
Macroeconomics
The development of our information-theoretic understanding of the macroeconomy and related areas of economic is due, largely, to the work led by Aoki (Aoki, 1998, 2001; Aoki and Yoshikawa, 2007), Foley (1994, 1996), Scharfenaker and Foley (2017), and others (Bahr and Passerini, 1998a, b; Helbing, 2010; Weidlich, 2000; Weidlich and Haag, 1983) for whom macroeconomics is but one instance of the challenge of predicting aggregate behavior from a limited micro-level understanding. An example that aligns well with our examination of the Cramer-Rao bound is the discussion in Yoshikawa (2003) and Aoki and Yoshikawa (2007) that considers an economy with S sectors, each of size ns where s ¼ 1, …,S. The economy has a resource constraint S X
ns ¼ N :
(23)
s¼1
that follows form the endowment of total production factor N being divided among each sector. Each sector produces output Ys Y s ¼ c s ns ,
(24)
172 SECTION
II Theoretical applications and physics
where cs is the productivity of sector s. Since productivity differs across sectors, the sectors are ordered such that c1 0 8ig. Note that when λ ! 0, ϑ reduces to the usual exponential coordinates θi ¼ log ðui =u0 Þ . By Proposition 4, the dual variable η ¼ rλφλ(ϑ) gives the escort distribution: ðpðijϑÞÞ1λ ðui Þ1λ η i ¼ Pd ¼ , P 1λ d 1λ j¼0 ðpð jjϑÞÞ j¼0 ðuj Þ P And we can define η0 ¼ 1 di¼1 ηi .
i ¼ 1, …, d:
(58)
5.2 λ-Mixture representation Dually, for i X , let Pi(ζ) ¼ δi(ζ) be the density of the point mass at i and consider the corresponding λ-mixture family p(jη) given by (25) (not to be confused with p(jϑ)): !1=ð1λÞ Z d X 1 pðζjηÞ ¼ ηj δj ðζÞ dμ: Z λ ðηÞ j¼0 That is, p(ζjη) ¼ 0 for ζ 6¼ i f0, 1, ⋯ , dg and ðη Þ1=ð1λÞ pðijηÞ ¼ Pd i , 1=ð1λÞ j¼0 ðη j Þ
i ¼ 0, 1, ⋯ , d:
(59)
Note that if η is given by Eq. (58), then we have p(jϑ) ¼ p(jη). Thus the probability simplex is both a λ-exponential family and a λ-mixture family. Consider the mixture variable η. Since the set fη ð0, 1Þ1 + d : η0 + ⋯ + ηd ¼ 1g is not open, we consider instead the open domain η ¼ ð^ η1 , …, η^d Þ ð0, 1Þd : η^1 + ⋯ + η^d < 1g, Ω0 ¼ f^
212 SECTION
III Advanced statistical theory
where the notation η^ signifies that the 0-th coordinate is dropped. Consider the negative Renyi entropy ψ, given by Eq. (46), as a function of η^. Note that ∂ψ ∂ψ ∂ψ ¼ , ∂^ ηi ∂ηi ∂η0
i ¼ 1, …, d:
(60)
Since 1λ ðeλψð^ηÞ 1Þ is convex in η^ (which is a linear transformation of η), so we may consider the λ-duality. The following result completes the circle of idea (see Wong and Zhang, 2021 for a more general result). Proposition 8. With respect to the probability simplex Δd, the primal variable ϑ in (57) when Δd is considered as a λ-exponential family with the form (56) d ^ ¼ rλ ψð^ equals the conjugate variable given by ϑ η^ ηÞ when Δ is considered as a λ-mixture family with mixture variable η: ^i , ϑi ¼ ϑ
i ¼ 1, …, d:
Proof. Given η^ Ω0, consider η where ηi ¼ η^i for 1 i d and η0 ¼ 1 Using (9) and (60), we have ! d X 1 1λ 1λ ψð^ ηÞ ¼ log ηj , λ j¼0 and
(61) P i
^η i .
λ λ ∂ψ 1 1 ¼ Pd 1=ð1λÞ ηj1λ η01λ : ∂^ ηi λ j¼0 ηj
So 1λ
d X ∂ψ i¼1
1=ð1λÞ1
η η^ ¼ Pd0 1=ð1λÞ : ∂^ ηi i j¼0 η j
Computing the λ-gradient (31), we have, for any i ¼ 1, …, d, " # λ i 1 u i ^ ¼ 1 : ϑ λ u0 Hence we obtain (61).
□
6 Summary and conclusion This chapter summarizes a novel approach for studying subtractive and divisive normalizations in deformed exponential family models. A typical example of the former is the q-exponential family with associated Tsallis entropy
λ-Deformed probability families with subtractive Chapter
10 213
whereas an example of the latter is the F ðαÞ-family and the associated Renyi entropy. Our first conclusion is that these two versions of deformation to an exponential family are two faces of the same coin: under a reparameterization, they are one and the same. Nevertheless, using different dualities lead to genuinely different mathematical structures. We introduce the λ-exponential family which reparameterizes (13) by pðλÞ ðζj Þ ¼ exp λ ðϑ FðζÞÞeφλ ðϑÞ ¼ exp λ ðθ FðζÞ ϕλ ðθÞÞ: The λ-exponential family is also linked to λ-mixture family, when λ 6¼ 0, 1, via a reparameterization of the random functions F(ζ) above. A basic example is the finite simplex which is both a λ-exponential family and a λ-mixture family. Several other examples are considered in Wong and Zhang (2021). The coincidence of these two parameterizations of the deformed family is associated with the λ-duality, which is the main focus of our exposition. The λ-duality is a “deformation” of the usual Legendre duality reviewed in Section 3.1. In a nutshell, instead of convex functions, we work with λ-convex functions f such that 1λ ðeλ f 1Þ is convex, for a fixed λ 6¼ 0. Also, instead of the convex conjugate, we use the λ-conjugate given by f ðλÞ ðuÞ ¼ sup x
1 log ð1 + λx uÞ f ðxÞ : λ
Turning the above into an inequality leads to the nonnegative λ-logarithmic divergence associated to the λ-convexity (as generalization of the Bregman divergence associated to the regular convexity): Lλ,f ðx, yÞ ¼ f ðxÞ f ðyÞ
1 log ð1 + λrf ðyÞ ðx yÞÞ: λ
Coming back to the probability families, we first verified that the subtractive potential ϕλ(θ) is convex in θ and the divisive potential φλ(ϑ) is λ-convex in ϑ. Subtractive normalization using ϕλ(θ) is associated with the regular Legendre duality, whereas multiplicative normalization using φλ(ϑ) is associated with the λ-duality. This gives an interpretation of distinctiveness of Renyi entropy (used in the latter) from the Tsallis entropy (used in the former) based on their intimate connection to λ-duality (for λ 6¼ 0) or to the Legendre duality. As λ is the parameter that controls the curvature in the Riemannian geometry of these probability families (see Wong, 2018), our framework provides a simple parametric deformation from dually flat geometry (of the exponential model) to the dually projectively flat geometry (of the λ-exponential model). We expect that this framework will generate new insights in the applications of the q-exponential family and related concepts in statistical physics and information science.
214 SECTION
III Advanced statistical theory
References Amari, S.-I., 2016. Information Geometry and Its Applications. Springer. Amari, S.-I., Nagaoka, H., 2000. Methods of Information Geometry. American Mathematical Society. Amari, S.-I., Ohara, A., 2011. Geometry of q-exponential family of probability distributions. Entropy 13 (6), 1170–1185. Amari, S.-I., Ohara, A., Matsuzoe, H., 2012. Geometry of deformed exponential families: invariant, dually-flat and conformal geometries. Phys. A Stat. Mech. Appl. 391 (18), 4308–4319. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J., 2005. Clustering with Bregman divergences. J. Mach. Learn. Res. 6 (Oct), 1705–1749. Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. J. R. Stat. Soc. B (Methodol.) 26 (2), 211–243. de Andrade, L.H.F., Vieira, F.L.J., Cavalcante, C.C., 2021. On normalization functions and ϕ-families of probability distributions. In: Progress in Information Geometry: Theory and Applications, Springer Nature, p. 19. Eguchi, S., 2006. Information geometry and statistical pattern recognition. Sugaku Expo. 19 (2), 197–216. Kaniadakis, G., 2001. Non-linear kinetics underlying generalized statistics. Phys. A Stat. Mech. Appl. 296 (3-4), 405–425. Matsuzoe, H., 2014. Hessian structures on deformed exponential families and their conformal structures. Differ. Geom. Appl. 35, 323–333. Montrucchio, L., Pistone, G., 2017. Deformed exponential bundle: the linear growth case. In: International Conference on Geometric Science of Information, Springer, pp. 239–246. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S., 2004. Information geometry of U-boost and Bregman divergence. Neural Comput. 16 (7), 1437–1481. Naudts, J., 2004. Estimators, escort probabilities, and ϕ-exponential families in statistical physics. J. Inequalities Pure Appl. Math. 5 (4), 102. Naudts, J., 2008. Generalised exponential families and associated entropy functions. Entropy 10 (3), 131–149. Naudts, J., 2011. Generalised Thermostatistics. Springer. Naudts, J., Zhang, J., 2018. Rho-tau embedding and gauge freedom in information geometry. Inf. Geom. 1 (1), 79–115. Newton, N.J., 2012. An infinite-dimensional statistical manifold modelled on Hilbert space. J. Funct. Anal. 263 (6), 1661–1681. Ohara, A., Matsuzoe, H., Amari, S.-I., 2012. Conformal geometry of escort probability and its applications. Mod. Phys. Lett. B 26 (10), 1250063. Pal, S., Wong, T.-K.L., 2016. The geometry of relative arbitrage. Math. Financ. Econ. 10 (3), 263–293. Pal, S., Wong, T.-K.L., 2018. Exponentially concave functions and a new information geometry. Ann. Probab. 46 (2), 1070–1113. Pal, S., Wong, T.-K.L., 2020. Multiplicative Schr€oodinger problem and the Dirichlet transport. Probab. Theory Relat. Fields 178 (1), 613–654. Renyi, A., 1961. On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. Tsallis, C., 1988. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 52 (1-2), 479–487.
λ-Deformed probability families with subtractive Chapter
10 215
Tsallis, C., 1994. What are the numbers that experiments provide. Quimica Nova 17 (6), 468–471. Van Erven, T., Harremos, P., 2014. Renyi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 60 (7), 3797–3820. Villani, C., 2003. Topics in Optimal Transportation. American Mathematical Society. Villani, C., 2008. Optimal Transport: Old and New. Springer. Wong, T.-K.L., 2018. Logarithmic divergences from optimal transport and Renyi geometry. Inf. Geom. 1 (1), 39–78. Wong, T.-K.L., 2019. Information geometry in portfolio theory. In: Geometric Structures of Information, Springer, pp. 105–136. Wong, T.-K.L., Yang, J., 2019. Logarithmic divergence: geometry and interpretation of curvature. In: International Conference on Geometric Science of Information. (pp. 413–422) Springer, Cham. Wong, T.-K.L., Yang, J., 2021. Pseudo-Riemannian geometry encodes information geometry in optimal transport. Information Geometry. Inf. Geom. (in press). Wong, T.-K.L., Zhang, J., 2021. Tsallis and Renyi deformations linked via a new λ-duality. arXiv preprint arXiv:2107.11925. Zhang, J., 2004. Divergence function, duality, and convex analysis. Neural Comput. 16 (1), 159–195. Zhang, J., 2005. Referential duality and representational duality on statistical manifolds. In: Proceedings of the Second International Symposium on Information Geometry and Its Applications, Tokyo, Japan, vol. 1216, pp. 58–67. Zhang, J., 2013. Nonparametric information geometry: from divergence function to referentialrepresentational biduality on statistical manifolds. Entropy 15 (12), 5384–5418. Zhang, J., 2015. On monotone embedding in information geometry. Entropy 17 (7), 4485–4499.
This page intentionally left blank
Chapter 11
Some remarks on Fisher information, the Cramer–Rao inequality, and their applications to physics H.G. Millera,*, Angelo Plastinob, and A.R. Plastinob,c a
Physics Department, SUNY Fredonia, Chautauqua, NY, United States National University La Plata (UNLP), IFLP-CCT-Conicet, La Plata, Argentina c CeBio y Departamento de Ciencias Ba´sicas, Universidad Nacional del Noroeste de la Prov. de Buenos Aires, UNNOBA, CONICET, Junin, Argentina * Corresponding author: e-mail: [email protected] b
Abstract On the basis of a brief review of some of our previous work on Fisher information, we discuss some aspects of the applications of this information measure, and of the associated Cramer–Rao inequality, to Physics. We also suggest some promising lines for further enquiry on the physical implications of the Fisher measure. Keywords: Fisher information, Cramer–Rao inequality, Continuity equations
1
Introduction
There is much interest in the properties and physical applications of the Fisher Information measure (Fisher, 1925) and of the associated Cramer–Rao inequality (Plastino and Plastino, 2020; Rao, 1945) due in large part to the work of Frieden (1998), Frieden (2004), Frieden (1989), Frieden (1990), and Frieden (1992). Of interest is its connection to Estimation theory and its relevance in determining a lower bound on the estimates of statistical quantities (Frieden, 1989). The Fisher information measure and the Cramer–Rao inequality play a central role in connection with diverse essential aspects of quantum mechanics and statistical physics (Brody and Mesiter, 1995; Dehesa et al., 2012; Flego et al., 2003; Frieden and Soffer, 1995; Plastino and Plastino, 1995a, 1996; Plastino et al., 1996, 1997). In the case of the Brownian motion where the evolution of the associated probability density function Pðx, tÞ is governed by the Handbook of Statistics, Vol. 45. https://doi.org/10.1016/bs.host.2021.07.003 Copyright © 2021 Elsevier B.V. All rights reserved.
217
218 SECTION
III Advanced statistical theory
diffusion equation, relationships between the behavior of the Fisher information measure, the Boltzmann entropy, and the Kullback relative entropy exist (Plastino et al., 1997). For the particular case of solutions to the diffusion equation, it has been proven (without recourse to symmetry arguments) that the time derivative of Fisher’s information measure has a definite sign. Furthermore the H-theorem satisfied by Fisher’s information measure I is related to the time dependence of the Boltzmann–Gibbs entropy S and the second time derivative of the Boltzmann–Gibbs entropy, S, has a definite sign. This latter property allows for the determination of simple bounds on the Boltzmann entropy. In particular, S(t) increases at most linearly with time. The time behavior of Fisher’s information measure can also be related to that of Kullback’s relative information measure, under certain circumstances. The Cramer–Rao bound and its relation to the Fisher entropy (Frieden, 1989) may simply be understood in the following manner. Consider an isolated many-particle system that is specified by a physical parameter x (position coordinate, velocity, etc.) and let PðxÞ describe the probability density for this parameter. Suppose the probability density is unknown, and one wishes to determine it. A measurement of the parameter y, y¼η+z
(1)
where η is the ensemble mean value for the parameter x and z is the random noise made. From this isolated measurement, the resulting estimate ηest ¼ ηest(y). Estimation theory asserts (Frieden, 1989) that the best possible estimator, ηest(y), after a very large number of samples is examined, is called the efficient estimator. Any other estimator must have a larger mean-square error. The only proviso to the above result is that all estimators be unbiased, i.e., satisfy a mean-square error, e2, with respect to η that obeys a relationship involving Fisher’s information measure, I, namely, e2 ¼
1 I
(2)
where the information measure is 2 Z ∂PðxÞ 1 I ¼ dx ∂x PðxÞ or
* I¼
1 ∂PðxÞ PðxÞ ∂x
(3)
2 + (4)
and all estimators are unbiased, i.e., satisfy hηest ðyÞi ¼ η
(5)
Thus, Fisher’s information measure has a lower bound, in the sense that, no matter what parameter of the system we choose to measure, I has to be
Some remarks on Fisher information, the Cramer–Rao inequality Chapter
11 219
larger than or equal to the inverse of the mean-square error associated with the concomitant experiment. This result, i.e., 1 e2 is referred to as the Cramer–Rao bound (see Appendix). I
(6)
Source: Eq. (6) was reproduced from Frieden, B.R., 1989. Fisher information as the basis for the Schro¨dinger wave equation. Am. J. Phys. 57 (11), 1004, with the permission of the American Association of Physics Teachers. One of the most interesting applications of the Fisher entropy and the Cramer–Rao bound has been given by Frieden (1989). Assuming smooth and broad fluctuations of a particle’s position to the extent that best estimate of the true position must exhibit a maximum mean-square error directly implies the time-independent Schrodinger wave equation for the probability amplitude for the position x of the particle. Consider the average kinetic energy hKEi of a particle acting under the influence of a static potential, V (x), which must be positive Z hKEi ¼ PðxÞ½E VðxÞ > 0 (7) where E is the total mean energy and PðxÞ is the probability distribution of finding the particle at position x. To enforce this inequality introduce a new merit function 2 Z Z dPðxÞ 1 dx + λ dxPðxÞ½E VðxÞ ¼ min: (8) dx PðxÞ where m is the partiChoosing λ to be a suitable negative constant (λ ¼ 2m ħ2 cle mass and ħ is Planck’ constant dived by 2π), whose magnitude will minimize I and maximize hKEi, will also satisfy the requirement that the latter quantity be positive. Setting PðxÞ ¼ ΦðxÞ2
(9)
(Φ(x) is assumed to be real, consistently with the fact that, as we shall see, it satisfies the time-independent Schrodinger equation associated with the potential function V (x)) yields 2 Z Z dΦðxÞ dx + λ dxΦðxÞ2 ½E VðxÞ ¼ min: (10) dx Using the Euler–Lagrange equation d ∂L ∂L ¼ dx ∂Φ0 ∂Φ
(11)
with L ¼ Φ02 + λΦ2 ½E VðxÞ yields the time-independent Schrodinger equation.
(12)
220 SECTION
III Advanced statistical theory
Φ00 ðxÞ λΦðxÞ½E VðxÞ ¼ 0:
(13)
This solution minimizes Eq. (10) since ∂2 L > 0: 2 ∂ðΦ0 Þ
(14)
Furthermore it can be shown that the absolute minimum (or bound) of the Fisher entropy is given by the lowest eigensolution. This approach is interesting in that it suggests a kind of uncertainty principle on locating the true position of the particle based strictly on statistical measures. Next consider a probability distribution function, Pðx, tÞ, whose evolution is determined by the diffusion equation (Plastino et al., 1997).
2 Diffusion equation ∂P ∂2 P ¼ 2 ∂t ∂x
(15)
where P and ∂P ∂x vanish at the boundaries. The time derivative of I is given by 2 Z Z dI 1 ∂P ∂P 2 ∂P ∂2 P ¼ dx, dx + dt P ∂x ∂x∂t P 2 ∂t ∂x
(16)
which for the case of the diffusion equation (15) and after an integration by parts of the second term on the right-hand side yields 2 2 Z Z dI 1 ∂2 P ∂P 2 ∂2 P ¼ dx : (17) dt ∂x P ∂x2 P 2 ∂x2 Integration by parts of the first term yields 4 2 2 2 Z Z Z dI 2 ∂P 2 ∂P ∂ P 2 ∂2 P ¼ dx dx, dx dt P ∂x2 ∂x2 P 3 ∂x P 2 ∂x
(18)
which after another integration by parts (in the second term of the right-hand side of (18)) yields 2 #2 Z " 2 dI 1 ∂ P 1 ∂P ¼ 2 P dx, (19) 2 dt P ∂x2 P ∂x or dI ¼ 2 dt
*
∂2 ð ln PÞ ∂x2
2 + :
(20)
In all the steps involving integration by parts it is assumed that, when jxj!∞, P ! 0 and ∂P=∂x ! 0 fast enough for the boundary terms resulting from the integration by parts to vanish.
Some remarks on Fisher information, the Cramer–Rao inequality Chapter
11 221
It follows from the above calculations that, for the diffusion equation, I satisfies an H-theorem since dI < 0: dt
(21)
For the diffusion equation (15) it is easy to show for the Boltzmann–Gibbs entropy Z S ¼ P ln Pdx (22) that the time derivative of S is given by 2 Z dS 1 ∂P ¼ dx ¼ I 0, dt P dx
(23)
and that the second derivative of S has a definite sign d2 S dI ¼ 0: dt dt2
(24)
It is interesting to note that these bounds are different in character from those obtained in Brody and Mesiter (1995) and Plastino et al. (1996) where the bound on the time derivative of S is given by qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dS hðj=PÞ2 Ii: (25) dt For the diffusion equation (15) one has j¼
∂P ∂x
(26)
the bound simply reduces to dS I: dt
(27)
A connection between the H-theorem fulfilled by Fisher information I and the standard H-theorem satisfied by Kullback relative information measure, K, exists (Risken, 1989). Assume that we have two different solutions P 1 ðx, tÞ and P 2 ðx, tÞ of the diffusion equation (15). The relative information entropy involving the two probability density functions is given by Z K ¼ P 1 lnðP 1 =P 2 Þ dx (28) where K can be understood as follows (Ravicule et al., 1997): assume that the total amount of available information concerning a given system S is called T . Associated with P 2 we have a Shannon missing information (Katz, 1967),
222 SECTION
III Advanced statistical theory
Z SðP 2 Þ ¼
P 2 lnðP 2 Þ dx
(29)
which measures the difference between T and what we actually know once we have been able to ascertain that P 2 is our PDF (Katz, 1967). We consider now a new scenario in which further measurements are performed. From them we realize that P 1 is the PDF that better reflects our (new) state of knowledge. K measures the missing information that still remains in going from P 2 to P 1 (Ravicule et al., 1997). It is easy to show that the quantity K is a monotonically decreasing function of time and, consequently, satisfies an H-theorem as well, since 2 Z dK 1 ∂P 1 1 ∂P 2 dx: 0 (30) ¼ P1 dt P 1 ∂x P 2 ∂x Clearly the diffusion equation (15) possesses translational symmetry since for a given solution Pðx, tÞ, and a given constant E, the density Pðx + E, tÞ is also a solution. If the two solutions are related by a simple translation P 2 ðx, tÞ ¼ P 1 ðx + E, tÞ:
(31)
Expanding K in powers of E yields 1 K E2 I, 2
(32)
to second order in E. This establishes a connection between the time behavior of K and I which has been obtained in a different context by Vstovsky (1995). It has also been shown for a system obeying conservation of flow, either classical or quantum mechanical, the maximal rate of change of the Shannon Boltzmann entropy increase as a function of time is limited by the Fisher entropy (Plastino and Plastino, 1995a).
3 Connection with Tsallis statistics In the case of Tsallis nonextensive thermostatistics (Tsallis, 1988, 2009), Sq ¼ ðq 1Þ1
W X ðpi pqi Þ
(33)
i¼1
where Sq is the entropy, Ppi is the probability of the ith micro state in any of the W different micro states ( ipi ¼ 1), and q is a real parameter, the Cramer–Rao bound can be generalized (Plastino et al., 1997b). Note that for q ¼ 1 one recovers the extensive Boltzmann–Shannon entropy and that an experimental measurement of an observable A whose expectation value in a micro state i is ai is given by hAiq ¼
W X
pqi ai :
i¼1
A suitable generalization (4) is given by
(34)
Some remarks on Fisher information, the Cramer–Rao inequality Chapter
* Iq ¼
1 dP P dx
11 223
2 + ,
(35)
q
where Iq is the generalized Fisher entropy and hηest ðyÞiq ¼ η, or
(36)
Z dy½P η ðyÞq ½ηest ðyÞ η ¼ 0,
is the best estimator in this case. Differentiating Eq. (37) w.r.t. η yields Z ∂½P η ðyÞq ½ηest ðyÞ η dy ¼ J Q, ∂η with
Z J¼
dyP q1 Z Q¼
∂P ½η ðyÞ η, ∂η est
(37)
(38)
(39)
dyP q ,
(40)
qJ Q ¼ 0:
(41)
and
Recasting J as
Z J¼
Pq
∂ ln P ½ηest ðyÞ η, ∂η
(42)
2 ∂ln f q=2 ½P ½ηest ðyÞ η , ∂η
(43)
and squaring Z J ¼ 2
dy P
q=2
which from Schwartz’ inequality yields 2 Z Z 2 q ∂ ln P dyP q ½ηest ðyÞ η2 : J dyP ∂η
(44)
Setting PðyÞ ¼ Pðy ηÞ and changing the integration variables u ¼ y η yields J 2 e2q I q ,
(45)
with e2q ¼ h½ηest ðyÞ η2 iq yields the nonextensive thermostatistics equivalent of the Cramer–Rao bound
224 SECTION
III Advanced statistical theory
Q2 I q e2q : q2
(46)
Note that the integration bounds in (40) are infinite. However, only for q > 0 is the convergence of the integral guaranteed. As an example consider a Gaussian probability 2 2 1 PðxÞ ¼ pffiffiffiffiffi exp x =2σ : 2π σ
(47)
In this case pffiffiffiffiffi 1q ð 2π σÞ pffiffiffi , Q¼ q
(48)
with e2q ¼
Q 2 σ , q
(49)
e2q , σ4
(50)
and Iq ¼
which tends toward the correct limit I ¼ σ 2 for q ! 1. For systems described by a general continuity equation, a bound to the Tsallis’ entropy increase exists and is given by (Plastino et al., 1997b) sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 dSq j q (51) Iq , dt P q where j is the current which obeys the continuity equation. For q ! 1 one recovers sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 ffi j dS I, (52) dt P which is the bound found in Brody and Mesiter (1995) and Plastino and Plastino (1995a). Systems described by nonlinear Fokker–Planck equations (Frank, 2005) constitute a promising field of enquiry regarding the relationships between Fisher information, its generalizations, and the Sq entropy. It was shown in Plastino and Plastino (1995b) that Fokker–Planck equations with a power-law nonlinearity in the Laplacian term are closely related to the Sq entropies. The theoretical links between the nonlinear Fokker–Planck dynamics and the Sq-based thermostatistics have been the focus of considerable interest in recent years, and their implications have been investigated intensively (see Conroy and Miller, 2015; Lenzi et al., 2019; Souza et al., 2018; Troncoso et al., 2007
Some remarks on Fisher information, the Cramer–Rao inequality Chapter
11 225
and references therein; this is only a very incomplete sample of the works devoted to this subject). These developments are central to some of the applications of the Sq-thermostatistics that are best understood (specially from the analytical point of view) and rest on the firmest theoretical basis. It was pointed out in Plastino and Plastino (1995b) that there is a relation linking the nonlinear Fokker–Planck equation, the Sq entropy, and the Fisher information. In one spatial dimension, the power-law nonlinear Fokker–Planck equation governing the evolution of a time-dependent density Pðx, tÞ reads, ∂P D ∂2 2q ∂ ¼ ðKðxÞPÞ, P ∂t 2 ∂x2 ∂x
(53)
where D is the diffusion constant, K(x) is the drift force, and q < 2 is a real parameter characterizing the nonlinearity in the Laplacian term. As already mentioned, the evolution Eq. (53) is closely related with both the Tsallis entropy Sq ¼ ðP P q Þ=ðq 1Þ and the Fisher information measure I (Plastino and Plastino, 1995b). Indeed, one has that
2 Z dSq dK 1 1 ∂P + qð2 qÞD dx: (54) ¼ dx q 2 P ∂x dt However, the theoretical implications of the links connecting the nonlinear Fokker–Planck equations with the Fisher information remain largely unexplored. Some work has been done in this direction (see, for example, Bercher, 2013; Ubriaco, 2009; Yamano, 2013 and references therein). A nonlinear diffusion equation based on Fisher information was proposed in Ubriaco (2009). Some aspects of the connection between nonlinear Fokker–Planck equations and Fisher information were investigated in Yamano (2013). The connection between Fisher information and the time derivative of Sq (Plastino and Plastino, 1995b) was revisited in Bercher (2013), and discussed in terms of extended de Bruijn identities. These efforts suggest that the study of Fisher information (or suitable generalizations) in relation with complex systems obeying a nonlinear Fokker–Planck dynamics may till lead to interesting insights. These matters certainly deserve further exploration.
4
Conclusions
Summing up, we provided a brief review of some of our previous work on the physical applications of Fisher information and the Cramer—Rao inequality. We discussed, in particular, the relation between the Fisher measure and the time dependence of the Boltzmann–Gibbs entropy of a time dependent probability density obeying a continuity equation. Probability densities governed by continuity equations are common in physics. As examples we can mention those obeying the diffusion, the Fokker–Planck, or the Liouville equations. For these type of equations, Fisher information allows for the formulation of an upper bound on the absolute value of the time derivative of the Boltzmann entropy. In the case of the diffusion equation, the time derivative of
226 SECTION
III Advanced statistical theory
the Boltzmann entropy is directly given by the Fisher information. Moreover, for the diffusion equation the Fisher information itself satisfies an H-theorem, since its time derivative is always nonpositive. These results have been partially extended to the study of the time dependence of the Sq entropies, and to problems involving nonlinear diffusion. In particular, for time-dependent solutions of the nonlinear, power-law diffusion equation it was found that there is an Sq entropy such that its time derivative is given by the Fisher measure. In spite of some interesting developments, the connections between Fisher information and systems described by nonlinear continuity equations remains largely unexplored. Any new results along these lines would certainly be welcome.
Appendix A.1 The Cramer–Rao bound (Frieden, 1989) For the sake of completeness, we provide here a derivation of the Cramer–Rao bound. This derivation follows the standard lines of those presented in texts dealing with the application of Fisher information to the natural sciences (Frieden, 1989). A brief, intuitive discussion of the Cramer–Rao bound and its significance can be found in Plastino and Plastino (2020). The Cramer–Rao bound establishes basic limitations on the ability to estimate a parameter η characterizing a parameterized family of probability densities P. Let η^ be an unbiased estimate of η based on an observation y. This defines an estimation rule. Hence an average over all observations is given by h^ ηðyÞi ¼ η:
(A.1)
Therefore the variance of error in η must be bounded from below. The Cramer–Rao bound is given by an inequality relating the Fisher information, 2 Z 1 ∂P I ¼ dy (A.2) P ∂η with the mean squared error Z e2 ¼
dy PðyÞð^ ηðyÞ ηÞ2 :
(A.3)
Let PðyÞ represent the probability density of an observation y if η is present. Then Z ∞ h^ ηðyÞ ηi ¼ dyPðyÞð^ ηðyÞ ηÞ ¼ 0: (A.4) ∞
Differentiating with respect to η, and making use of the fact that
Some remarks on Fisher information, the Cramer–Rao inequality Chapter
11 227
Z dyPðyÞ ¼ 1,
(A.5)
and
yields
Z dy
∂PðyÞ ∂lnPðyÞ ¼ PðyÞ, ∂η ∂η
(A.6)
∂lnPðyÞ PðyÞð^ ηðyÞ ηÞ ¼ 1: ∂η
(A.7)
Factorizing and squaring yields Z i2 ∂lnPðyÞ pffiffiffiffiffiffiffiffiffiffi hpffiffiffiffiffiffiffiffiffiffi dy PðyÞ PðyÞð^ ηðyÞ ηÞ ¼ 1, ∂η
(A.8)
and by Schwartz’s inequality Z i2 ∂lnPðyÞ pffiffiffiffiffiffiffiffiffiffi hpffiffiffiffiffiffiffiffiffiffi PðyÞ PðyÞð^ ηðyÞ ηÞ dy ∂η Z
2 Z hpffiffiffiffiffiffiffiffiffiffi i2 ∂lnPðyÞ pffiffiffiffiffiffiffiffiffiffi PðyÞ dy PðyÞð^ dy ηðyÞ ηÞ : ∂η
(A.9)
Using Eq. (A.8) it follows that e2 1=I,
(A.10)
where e is the mean squared error (A.3) due to the estimate η^ and I is the Fisher entropy given in Eq. (4). 2
References Bercher, J.F., 2013. Physica A 392, 3140. Brody, D., Mesiter, B., 1995. Phys. Lett. A 204, 9398. Conroy, J.M., Miller, H.G., 2015. Phys. Rev. E 91, 052112. Dehesa, J.S., Plastino, A.R., Sanchez-Moreno, P., Vignat, C., 2012. Appl. Math. Lett. 25, 1689. Fisher, R.A., 1925. Math. Proc. Cambridge Philos. Soc. 22, 700. Flego, S.P., Frieden, B.R., Plastino, A., Plastino, A.R., Soffer, B.H., 2003. Phys. Rev. E 68, 0161105. Frank, T.D., 2005. Nonlinear Fokker-Planck Equations: Fundamentals and Applications. Springer, Berlin. Frieden, B.R., 1989. Fisher information as the basis for the Schro¨dinger wave equation. Am. J. Phys. 57 (11), 1004. Frieden, B.R., 1990. Phys. Rev. A 41, 4265. Frieden, B.R., 1992. Physica A 180, 359.
228 SECTION
III Advanced statistical theory
Frieden, B.R., 1998. Physics From Fisher Information: A Unification. Cambridge University Press, Cambridge. Frieden, B.R., 2004. Science From Fisher Information: A Unification. Cambridge University Press, Cambridge. Frieden, B.R., Soffer, B.H., 1995. Phys. Rev. E 52, 2274. Katz, A., 1967. Priciples of Statistical Mechanics. Freeman, San Francisco. Lenzi, E.K., Lenzi, M.K., Ribeiro, R.V., Evangelista, L.R., 2019. Proc. R. Soc. A 475, 20190432. Plastino, A.R., Plastino, A., 1995a. Phys. Rev. E 52, 4580. Plastino, A.R., Plastino, A., 1995b. Physica A 222, 347. Plastino, A.R., Plastino, A., 1996. Phys. Rev. E 54, 4423. Plastino, A.R., Plastino, A., 2020. Significance 17, 39. Plastino, A., Plastino, A.R., Miller, H.G., Khanna, F., 1996. Phys. Lett. A 221, 29. Plastino, A.R., Plastino, A., Miller, H.G., 1997. Phys. Lett. A 235, 129. Plastino, A., Plastino, A.R., Miller, H.G., 1997b. Physica A 225, 577. Rao, C.R., 1945. Bull. Calcutta Math. Soc. 37, 8191. Ravicule, M., Casas, M., Plastino, A., 1997. Phys. Rev. A 55, 1695. Risken, H., 1989. The Fokker-Planck Equation. Springer, New York. Souza, A.M.C., Andrade, R.F.S., Nobre, F.D., Curado, M.F.E., 2018. Physica A 491, 153. Troncoso, P., Fierro, O., Curilef, S., Plastino, A.R., 2007. Physica A 375, 457. Tsallis, C., 1988. J. Stat. Phys. 52, 479. Tsallis, C., 2009. Introduction to Nonextensive Statistical Mechanics. Springer, New York. Ubriaco, M.R., 2009. Phys. Lett. A 373, 4017. Vstovsky, G.V., 1995. Phys. Rev. E 51, 975. Yamano, T., 2013. Cent. Eur. J. Phys. 11, 910.
Index
Note: Page numbers followed by “f ” indicate figures.
A Airy function, 172–174 Akaike’s information criterion, 18–19 α-Cramer–Rao inequality, 88–95 Amari–Nagaoka’s framework, 84 Asymptotically unbiased estimator, 153
B Barankin bound, 98–101 Bayesian α-Cramer–Rao inequality, 101–106 Bayesian Cramer–Rao inequality, 98–101 BG orthodox statistical mechanics framework (BGOSMF), 72 Black–Scholes option-pricing equation, 162, 169 Boltzmann–Gibbs entropy, 6–12, 217–218, 221 Boltzmann–Gibbs (BG) statistics, 68, 70–73 Boltzmann–Shannon entropy, 18–19, 222 Box–Cox transformation, 194 Bregman divergence, 84–85, 199–207 Brownian motion, 65–67, 161–162, 217–218 Brownian particle, 10–11
C Canonical parameter, 21–22 Cauchy–Riemann equations, 45 Cauchy Schwarz inequality, 153–154 c-Duality, 193, 201 Classical Cramer–Rao inequality, 85–86 Conformal Hessian geometry, 191, 206–207 Conformal mapping, 46, 49–53 Constant curvature, 192 Cramer–Rao bound, 153–154, 162, 171–174, 218, 222–224, 226–227 Cramer–Rao inequality (CRI), 3–6, 12–13, 64, 81–82, 132 Brownian motion, 67, 67f generalized version, 95–97 physical applications, 217–218, 225–226 scale-invariance, 179 Tsallis’ gravity treatment, 77
Zipf’s law, 179–181 Cramer–Rao lower bound (CRLB), 81–82 Cross-entropy, 18–19, 22–24, 83 Csisza´r divergence, 82, 84–85 Cumulant function, 21–22
D Deformation models, 189–191 Deformed exponential families, 191–192, 194–195 Deformed probability families, 191–192 Density distribution (DD), 58 Derivative securities, 168–171 Diffeomorphism, 46 Diffusion equation, 220–221, 225–226 Dimensional regularization approach (DRA), 58–60 Discount factor (DF), 164–167 Divergent integrals, 58–59 Dualistic Riemann space, 21, 30–31
E Efficient estimator, 154, 218 e-geodesic curvature, 16–19, 22 Eguchi’s theory, 84 Entanglement of formation, 151–152, 160 Entropy’s rate of change, 4–6, 10–13 Fisher information, 6–12 Estimation theory, 218 Euclidean space, 31–32 Euler–Lagrange equation, 5, 22–24, 28–29, 219–220 Exponential families of probability distributions, 187–188 Exponential model, 22
F Financial economics, 164–171 Fisher information, 3–12, 162, 168–169, 225–227 Kantian view, 135–136
229
230 Fisher information (Continued ) multiverse from, 124–125 natural selection, 121–123 Fisher information matrix (FIM), 19, 47, 80–82 Fisher information measure (FIM), 57–58, 62–63, 66–67, 180, 218–219 physical applications, 217–220, 225–226 Fisher metric, 79–82 Fisher–Rao metric, 187–188 Fokker–Planck equation, 6, 10, 12, 224–225 Foliation, 16–17, 23f, 30 Frieden–Soffer’s translational invariance, 180 Future cashflows, 164
G Gamma function, 60–61 Gaussian probability, 224 Gauss–Jordan elimination method, 31–32 Gauss–Markov theorem, 31–32 Gauss–Newton algorithm, 31–32 γ-deformation divisive approach, 195–196 Legendre duality, 201–202 normalizations, 196–197 subtractive approach, 194–195 Generalized linear model, 35–40 Generalized Pareto distribution, 28–29 Gibbs’ canonical ensemble, 58 Gibbs probability density, 57 γ-power cross entropy, 25–26 γ-power divergence, 25–27 γ-power entropy, 25–26 γ-power estimator, 17 γ-power loss function, 27–28, 33–34 γ-power model, 29–30
Index
Information geometry, 16–17, 19–20, 22–24, 79–80, 86–97 Bayesian α-Cramer–Rao inequality, 101–106 Bayesian Cramer–Rao inequality, 98–101 α-Cramer–Rao inequality, 88–90 hybrid Cramer–Rao inequality, 106 λ-logarithmic divergence, 206–207 Interferometric power, 159–160
J Jacobian matrix, 44, 46
K Keynes’s macroeconomics, 161–162 Kronecker product, 150 Kullback–Leibler (KL) divergence, 16–19, 22, 79–82, 169, 171, 187–189 Kullback relative information measure, 221–222
L Lagrange multiplier, 166–167, 170–171, 180–181 Laurent’s expansion, 59–61 λ-deformation, 210–212 Least squares estimate (LSE), 17, 31–32 Lebesgue measure, 18 Lebesgue norm, 25–26 Legendre duality, 199–207 Linear connection, 24–26 Linear regression model, 31–35 Live-streaming with actual proportionality of objects (LAPO), 53–54 Locally unbiased estimator, 153 Log-likelihood function, 18–19, 22–24, 32
H Heaviside step function, 166–167 Heisenberg limit, 157–159, 158f Heisenberg uncertainty relation (HUR), 64 Hermitian operator, 155 Hessian manifold, 189 Hess matrix, 33–34 Hilbert space, 149 Hodgkin–Huxley effect, 136 H-theorem, 6, 217–218, 221–222 Hybrid CR inequality, 106
I Iα-divergence, 82–86 I-divergence, 82–86
M Macroeconomics, 171–174 Maxent modeling, 24 Maximum entropy method, 24, 30–31 Maximum Fisher information (MFI), 119, 132–135 cancer growth, 142 early covid-19 growth, 143–144 Maximum likelihood estimator (MLE), 16–20, 22–24, 32 Mean square error (MSE), 31–32, 153, 226–227 Mean values of x-powers, 66 Method of least squares, 16 m-geodesic curvature, 16–19, 21–22
231
Index
Minimum Fisher information principle (MFI), 181 Mixture families of probability distributions, 188–189 MLE leaf, 18–19, 22–24
N N-dimensional Fokker–Planck equation, 11–12 Neyman–Pearson lemma, 24–25
O Orthogonality condition, 27, 81–82, 81f
P Pareto distribution, 28–29 Partition function (PF), 57–58 Brownian motion, 65 evaluation, 59–61 Population spaces, 44 Power entropy, 24–31 Principle of minimum loss of Fisher information, 136–141 Principle of natural selection, 122–123 Probability density current, 7–8 Probability density function (PDF), 46–47, 65, 170–171, 170f Probability distribution function (PDF), 62–63, 220 Product states, 150 Pythagoras theorem, 16–24, 21f, 27, 80–81
Q q-exponential family, 195 Quantum correlations, 149–152, 160 estimation theory, 156–160 Quantum Cramer–Rao inequality, 85–86 Quantum Fisher information, 155 Quantum physics, 120–121
R Radon–Nikodym derivative, 187–188 Rank plot. See Zipf plot Rao distance, 43–44, 46–49 applications, 53–55 conformal mapping, 46, 49–53 differential manifolds, 44–46 Rate of change of entropy. See Entropy’s rate of change Renyi cross-entropy, 83
Renyi entropy, 83–84, 190 Riemannian metric, 19, 24–25, 46, 79–82 R programs, 43
S Scale-invariance, 179 Schroedinger equation, 5–6, 166–167, 180–181, 219–220 Schwartz’s inequality, 223, 226–227 Second-order efficient estimators, 20 Shannon Boltzmann entropy, 222 Shannon entropy, 62–63, 83, 162–163, 172–174, 189 Space of probability density/mass functions, 18 Standard quantum limit (SQL), 156–157, 156f Statistical manifold, 86 Statistical mechanics, 72 Statistical model, 18–19 Student t-distribution, 28–29 Successive universes, 126–128 Sufficient statistic, 22–24 Sundaresan’s divergence. See Iα-divergence Symmetric logarithmic derivative (SLD), 153–154
T Thermodynamics, 121 Tsallis’ entropy, 190, 192, 195, 210, 212–213, 224 Tsallis nonextensive thermostatistics, 222 Tsallis statistics (TS), 68–70
U U-divergence, 26
V Virtual tourism technology, 53–54 von Neumann entropy, 150
W Weighted likelihood equation, 27–28 Whittaker’s function, 59–60
Z Zipf plot, 181–183, 182f Zipf’s law, 179–181
This page intentionally left blank