270 58 3MB
English Pages 431 Year 2020
Ruslan L. Stratonovich
Theory of Information and its Value Edited by Roman V. Belavkin Panos M. Pardalos · Jose C. Principe
Theory of Information and its Value
Roman V. Belavkin • Panos M. Pardalos Jose C. Principe Editors
Theory of Information and its Value
Editors Roman V. Belavkin Faculty of Science and Technology Middlesex University London, UK
Panos M. Pardalos Industrial and Systems Engineering University of Florida Gainesville, FL, USA
Jose C. Principe Electrical & Computer Engineering University of Florida Gainesville, FL, USA Author Ruslan L. Stratonovich (Deceased)
ISBN 978-3-030-22832-3 ISBN 978-3-030-22833-0 (eBook) https://doi.org/10.1007/978-3-030-22833-0 Mathematics Subject Classification: 94A17, 94A05, 60G35 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword
It would be impossible for us to start this book without mentioning the main achievements of its remarkable author, Professor Ruslan Leontievich Stratonovich (RLS or Ruslan later). He was a brilliant mathematician, probabilist and theoretical physicist, best known for the development of the symmetrized version of stochastic calculus (an alternative to Itˆo calculus), with stochastic differential equations and integrals now bearing his name. His unique and beautiful approach to stochastic processes was invented in the 1950s during the time of his doctoral work on the solution to the notorious nonlinear filtering problem. The importance of this work was immediately recognized by the great Andrei Kolmogorov, who invited Ruslan, then a graduate student, for a discussion of his first papers. This work was so much ahead of its time that its initial reception in the Soviet mathematical community was mixed, mainly due to misunderstandings of the differences between the Itˆo and Stratonovich approaches. These and perhaps other factors related to the cold war had obscured some of the achievements of Stratonovich’s early papers on optimal nonlinear filtering, which, apart from the general solution to the nonlinear filtering problem, contained also the equation for the Kalman–Bucy filter as a special (linear) case as well as the forward–backward procedures for computing posterior probabilities, which were later rediscovered in the hidden Markov models theory. Nonetheless, the main papers were quickly translated into English, and thanks to the remarkable hard work of RLS, by 1966 he had already published two monographs—the Topics in the Theory of Random Noise (see [54] for a recent reprint) and Conditional Markov Processes [52]. These books were also promptly translated and had a better reception in the West, with the latter book being edited by Richard Bellman. In 1968, Professor W. Murray Wonham wrote in a letter to RLS ‘Perhaps you are the prophet, who is honored everywhere except in his own land’. Despite the difficulties, the following ten years in the 1960s were very productive for Ruslan, who quickly became recognized as one of the top scientists in the World in his field. He became a Professor by 1969 (in the post he remained at the Department of Physics all his life). At that time in the 1960s, he managed to form a group of young and talented graduate students including Grishanin, B. A., Sosulin, Yu. G., Kulman, N. K., Kolosov, G. E., Mamayev, D. D., Platonov, A. A. and Belavkin, V. P. v
vi
Foreword
Fig. 1 Stratonovich R. L. (second left) with his group in front of Physics Department, Moscow State University, 1971. From left to right: Kulman, N. K., Stratonovich, R. L., Mamayev, D. D., Sosulin, Yu. G., Kolosov, G. E., Grishanin, B. A. Picture taken by Slava (V. P.) Belavkin
(Figure 1). These students began working under his supervision in completely new areas of information theory, stochastic and adaptive optimal control, cybernetics and quantum information. Ruslan was young and had a somewhat legendary reputation among students and colleagues at the university, so that many students aspired to work with him, even though he was known to be a very hard working and demanding supervisor. At the same time, he treated his students as equals and gave them a lot of freedom. Many of his former students recall that Ruslan had an amazing gift to predict the solution before it was derived. Sometimes, he surprised his colleagues by giving an answer to some difficult problem, and when they asked him how he obtained it, his answer was ‘just verify it yourself, and you will see it is correct’. Ruslan and his students spent a lot of their spare time together, either playing tennis during the summer time or skiing in winters, such that they developed long-lasting friendships. In the mid-1960s, Ruslan read several specialist courses on various topics, including adaptive Bayesian inference, logic and information theory. The latter course included a lot of original material, which emphasized the connection between information theory with statistical thermodynamics and introduced the Value of Information, which he pioneered in his 1965 paper [47] and later developed together with his students (mainly Grishanin, B. A.). This course motivated Ruslan to write his third book called ‘Theory of Information’. Its first draft was ready in 1967, and it is remarkable to note that RLS even had some negotiations with Springer to pub-
Foreword
vii
lish the book in English, which unfortunately did not happen, perhaps due to some bureaucratic difficulties in the Soviet Union. The monograph was quite large and included parts on quantum information theory, which he developed together with his new student Slava (V. P.) Belavkin. Although there was an agreement to publish the book with a leading Soviet scientific publisher, the publication was delayed for some unexplained reasons. In the end, an abridged version of the book was published by a different publisher (Soviet Radio) in 1975 [53], which did not include the quantum information parts! Nonetheless, the book had become a classic even without having been translated into English. Several anecdotal stories exist regarding the book that it was used in the West and discussed at seminars with the help of Russian-speaking graduate students. For example, Professor Richard Bucy used translated parts of the book in his seminars on information theory, and in the 1990s he even suggested that the book be published in English. In fact, in 1994 and 1995 Ruslan visited his former student and collaborator Slava Belavkin in the University of Nottingham, United Kingdom, who worked there at the Department of Mathematics (Figure 2). They had a plan to publish a second edition of the book together in English and to include the parts on quantum information. Because quantum information theory had progressed in the 1970s and 1980s, it was necessary to update the quantum parts of the manuscript, and this had become the Achilles’s heel to their plan. During Ruslan’s visit, they spent more time working on joint papers, which seemed as a more urgent matter.
Recollections of Roman Belavkin I also visited my father (V.P. Belavkin) in Nottingham during the summer of 1994, and I remember very clearly how happy Ruslan was during that visit (Figure 3), especially for the ability to mow the lawn in his backyard—a completely new experience for someone who lived in a small Moscow flat all his life. Two years later, in January 1997, Ruslan died after catching a flu during the winter examinations at the Moscow State University. I went to his funeral at the Department of Physics, from which I too had already graduated. It was a very big and sad event attended by a crowd of students and colleagues. In the next couple of years, my father collaborated with Valentina, Ruslans wife, on an English translation of the book, the first version of which was in fact finished. Valentina too passed away two years later, and my father never finished this project. Before telling the reader how the translation of this book eventually came about, I would like to write a few words about this book from my personal experience, how it became one of my favourite texts on information theory, and why I believe it is so relevant today. Having witnessed first-hand the development of quantum information and filtering theory in the 1980s (my father’s study in our small Moscow flat was also my bedroom), I decided that my career could do without non-commutative probability and stochastics. So, although I graduated from the same department as my father
viii
Foreword
Fig. 2 Stratonovich with his wife during their visit to Nottingham, England, 1994. From left to right: Robin Hudson, Slava Belavkin, Ruslan Stratonovich, Valentina Stratonovich, Nadezda Belavkina
and Ruslan, I became interested in Artificial Intelligence (AI), and a couple of years later I managed to get a scholarship to do a PhD in cognitive modelling of human learning at the University of Nottingham. I was fortunate enough to be in the same city with my parents, which allowed me to take a cycle ride through Wollaton Park and visit them either for lunch or dinner. Although we often had scientific discussions, I felt comfortable that my area of research was far away and independent of my father’s territory. That, however, turned out to be a false impression. During that time at the end of 1990s, I came across many heuristics and learning algorithms using randomization in the form of the so-called soft-max rule, where decisions were sampled from a Boltzmann distribution with a temperature parameter controlling how much randomization was necessary. And although using these heuristics had clearly improved the performance of the algorithms and cognitive models, I was puzzled by these links with statistical physics and thermodynamics.
Foreword
ix
Fig. 3 Stratonovich with his wife at Belavkins’ home in Nottingham, England, 1994. From left to right: Nadezda Belavkina, Slava Belavkin, Roman Belavkin, Ruslan Stratonovich and Valentina Stratonovich
The fact that it was more than just a coincidence became clear when I saw that performance of cognitive models could be improved by relating the temperature parameter dynamically to entropy. Of course, I could not help sharing these naive ideas with my father, and to my surprise he did not criticize them. Instead, he went to his study and brought an old copy of Ruslan’s Theory of Information. I spent the next few days going through various chapters of the book, and I was immediately impressed by the self-contained, and at the same time, very detailed and deep style of the presentation. Ruslan managed to start each chapter with basic and fundamental ideas, supported by very understandable examples, and then developed the material to such depth and detail that no questions seemed to remain unanswered. However, the main value of the book was in the ideas unifying theories of information, optimization and statistical physics. My main focus was on Chapters 3 and 9, which covered variational problems leading to optimal solutions in the form of exponential family distributions (the ‘soft-max’), defined and developed the value of information theory and explored many interesting examples. The value of information is an amalgamation of theories of optimal statistical decisions and information, and its applications go far beyond problems of information transmission. For example, the relation to machine learning and cognitive modelling was very immediately clear—learning from mathematical point of view was simply an optimization problem with information constraints (otherwise, there is nothing to learn), and a solution to such a problem could only be a randomized policy, where randomization was the consequence of incomplete information. Furthermore, the temperature parameter was simply the Lagrange multiplier defined by the constraint, which also meant that an optimal temperature could be derived (at least in theory) giving the solution to the notorious ‘explorationexploitation’ dilemma in reinforcement learning theory. A few years later, I applied
x
Foreword
these ideas to evolutionary systems and derived optimal control strategies for mutation rates in genetic algorithms (controlling randomization of DNA sequences). Similar applications can be developed to control learning rates in artificial neural networks and other data analysis algorithms.
History of this translation This publication is a new translation of the 1975 book, which incorporates some parts from the original translation by Ruslan’s wife, Valentina Stratonovich. The publication has become possible, thanks to the initiatives of Professors Panos Pardalos and Jose Principe. The collaboration was initiated at the ‘First International Conference on Dynamics of Information Systems’ organized by Panos in the University of Florida in 2009 [22]. It is fair to say that at that time it was the only conference dedicated to more than traditional information-theoretic aspects of data and systems analysis, but also to the importance of analysing and understanding the value of information. Another very important achievement of this conference was the first attempt to develop a geometric approach to the value of information, which is why one of the invited speakers to the conference was Professor ShunIchi Amari. It was at this conference that the editors of this book first met together. Panos, who by that time was the author and editor of dozens of books on global optimization and data science, expressed his amazement at the unfortunate fact that this book had still not been available in English. Jose Principe, known for his pioneering work on information-theoretic learning, had already recognized the importance and relevance of this book to modern applications and was planning the translation of specific chapters. It was clear that there was a huge interest in the topic of value of information, and we began discussing the possibility of making the new English translation of this classic book, and finishing the project, which unfortunately was never completed by Ruslan Stratonovich and Slava Belavkin. Panos suggested that Vladimir Stozhkov, one of his Russian-speaking PhD students, should do the initial translation. Vladimir took on the bulk of this work. The equations for each chapter were coordinated by Matt Emigh and entered in LATEX by students and visitors in the Department of Computational Neuro-Engineering Laboratory (CNEL), University of Florida, during the Summer and Fall of 2016 as follows: Sections 1.1–1.5 by Carlos Loza, 1.6–1.7 by Ryan Burt, 2.1–3.4 by Ying Ma, 3.5–4.3 by Zheng Cao, 4.4–5.3, 6.1–6.5 and 8.6–8.8 by Isaac Sledge, 5.4–5.7 by Catia Silva, 5.8–5.11 by John Henning, 6.6–6.7 by Eder Santana, 7.3–7.5 by Paulo Scalassara, 7.6–8.5 by Shulian Yu and Chapters 9–12 by Matt Emigh. This translation and equations were then edited by Roman Belavkin, who also combined it with the translation by Valentina Stratonovich in order to achieve a better reflection of the original text and terminology. In particular, the introductory paragraphs of each chapter are largely based on Valentina’s translation. We would like to take the opportunity to thank Springer, and specifically Razia Amzad and Elizabeth Loew, for making the publication of this book possible. We
Foreword
xi
also acknowledge the help of the Laboratory of Algorithms and Technologies for Network Analysis (LATNA) at the Higher School of Economics in Nizhny Novgorod, Russia, for facilitating meetings and collaborations among the editors and accommodating many fruitful discussions on the topic in this beautiful Russian city. With the emergence of data-driven economy, progress in machine learning and AI algorithms and increased computational resources, the need for a better understanding of information, its value and limitations is greater than ever. This is why we believe this book is even more relevant today than when it was first published. The vast amount of examples pertaining to all kinds of stochastic processes and problems makes it a treasure trove of information for any researcher working in the areas of data science or machine learning. It is a great pleasure to be able to contribute a little to this project and see that finally this amazing book will be open to the rest of the World. London, UK Gainesville, FL, USA Gainesville, FL, USA
Roman Belavkin Panos Pardalos Jose Principe
Preface
This book was inspired by the author’s lectures on information theory in the Department of Physics at Moscow State University, 1963–1965. Initially, the book was written in accordance with the content of those lectures. The plan was to organize the book in order to reflect all the paramount achievements of Shannon’s information theory. However, while working on the book the author ‘lost his way’ and used a more familiar style, in which the development of own ideas dominated over a thorough recollection of existing results. That led to an inclusion of the original material into the book and to an original interpretation of many central constructs of the theory. The original material extruded a part of steady results, which were about to be included into the book. For instance, the chapter devoted to known steady methods of encoding and decoding in noisy channels was discarded. The material included in the book is organized in three stages: the first, second, third variational problems and the first, second, third asymptotical theorems, respectively. This creates a clear panorama of the most fundamental content of Shannon’s information theory. Every writing style has its own benefits and disadvantages. The disadvantage of the chosen style is that the work of many scientists in the field of interest remains non-reflected (or insufficiently reflected). This fact should not be regarded as an indication of insufficient respect to them. As a rule, an assessment of the material’s originality and the attribution of the author’s ownership of the results are not given. The only exception is made for a few explicit facts. The book adopts ‘the principle of increasing complexity of material’, which means that simpler and easily understood material is placed in the beginning of a book (as well as in the beginning of a chapter). The reader is not required to be familiar with more difficult and specific material situated towards the end of a chapter/the book. This principle allows the inclusion of complicated material into the book, while not making it difficult for a fairly wide range of readers. The hope is that many readers will gain useful knowledge for themselves from the book. While considering general questions, the author tried to lead statements with the most generality possible. To achieve this, he often used the language of measure theory. For example, he utilized the notation P(dx) for probability distribution. This xiii
xiv
Preface
should not scare off those who did not master the specified language. The point is that, omitting the details, a reader can always use a simple dictionary which converts those ‘intimidating’ terms into those more familiar. For instance, ‘probability measure P(dx)’ can be treated as probability P(x) in the case of a discrete random variable or as product p(x) dx in the case of a continuous random variable, where p(x) is a probability density function and dx = dx1 . . . dxr is a differential corresponding to a space of dimensionality r. Focusing on various readers, the author did not attach significance to consistency of terminology. He thought that it did not matter whether we say ‘expectation’ or ‘mean value’, ‘probabilistic measure’ or ‘distribution’. If we employ the apparatus of generalized functions, then there always exists a probability density and it can be used. Often there is no need to distinguish between minimization signs min. and inf. By this we mean that if infimum is not attained within a considered region then we can always ‘turn on’ an absolutely standard ε -procedure and, as a matter of fact, nothing essential will change as a result. In the book, we pass from a discrete probabilistic space to a continuous probabilistic space in a free manner. The author tried to spare the reader any concern of inessential details and distractions from the main ideas. The general constructs of the theory are illustrated in the book by numerous cases and examples. Due to a common importance of the theory and the examples, their statement will not require a special radiophysics terminology. If a reader is interested in application of the stated material to the problem of messages transmission through radiochannels, he/she should fill abstract concepts with radiophysical content. For instance, when considering noisy channels (Chapter 7), an input stochastic process x = {xt } should be treated as a variable informational parameter of signal s(t, xt ) emitted by a radiotransmitter. Also, an output process y = {yt } should be treated as a signal at a receiver input. A proper concretization of concepts is needed for application of any mathematical theory. The author expresses acknowledgements to the first reader of this book B.A. Grishanin, who rendered significant assistance while the book was being prepared to print, and professor B.I. Tikhonov for discussions involving a variety of subjects and a number of precious comments. Moscow, Russia
Ruslan L. Stratonovich
Introduction
The term ‘information’ mentioned in the title of the book is understood here not in the broad sense in which the word is understood by people working in the press, radio, media, but in the narrow scientific sense of Claude Shannon’s theory. In other words, the subject of this book is the special mathematical discipline, Shannon’s information theory, which can solve its own quite specific problems. This discipline consists of abstractly formulated theorems and results, which can have different specializations in various branches of knowledge. Information theory has numerous applications in the theory of message transmission in the presence of noise, the theory of recording and registering devices, mathematical linguistics and other sciences including genetics. Information theory, together with other mathematical disciplines, such as the theory of optimal statistical decisions, the theory of optimal control, the theory of algorithms and automata, game theory and so on, is a part of theoretical cybernetics—a discipline dealing with problems of control. Each of the above disciplines is an independent field of science. However, this does not mean that they are completely separated from each other and cannot be bridged. Undoubtedly, the emergence of complex theories is possible and probable, where concepts and results from different theories are combined and which interconnect distinct disciplines. The picture resembles trees in a forest: their trunks stand apart, but their crowns intertwine. At first, they grow independently, but then their twigs and branches intertwine making their new common crown. Of course, generally speaking, the statement about uniting different disciplines is just an assertion, but, in fact, the merging of some initially disconnected fields of science is now an actual fact. As is evident from a number of works and from this book, the following three disciplines are inosculating: 1. statistical thermodynamics as a mathematical theory 2. Shannon’s information theory 3. the theory of optimal statistical decisions (together with its multi-step or sequential variations, such as optimal filtering and dynamic programming).
xv
xvi
Introduction
This book will demonstrate that the three disciplines mentioned above are cemented by ‘thermodynamic’ methods with typical attributes such as ‘thermodynamic’ parameters and potentials, Legendre transforms, extremum distributions, and asymptotic nature of the most important theorems. Statistical thermodynamics can be referred to as cybernetic disciplines only conditionally. However, in some problems of statistical thermodynamics, its cybernetic nature manifests itself quite clearly. It is sufficient to recall the second law of thermodynamics and ‘Maxwell’s demon’, which is a typical automaton converting information into physical entropy. Information is ‘fuel’ for perpetual motion of the second kind. These points will be discussed in Chapter 12. If we consider statistical thermodynamics as a cybernetic discipline, then L. Boltzmann and J. C. Maxwell should be called the first outstanding cyberneticists. It is important to bear in mind that the formula expressing entropy in terms of probabilities was introduced by L. Boltzmann, who also introduced the probability distribution that was the solution to the first variational problem (of course, it does not matter how we call the functions in this formula—energy or cost function). During the emergence of Shannon’s information theory, the appearance of a wellknown notion of thermodynamics, namely entropy, was regarded by some scientists as a curious coincidence, and little attention was given to the fact. It was thought that this entropy had nothing to do with physical entropy (despite the work of Maxwell’s demon). In this connection, we can recall a countless number of quotation marks around the word ‘entropy’ in the first edition of the collection of Shannon’s papers translated into Russian (the collection under the editorship of A.N. Zheleznov, Foreign Literature, 1953). I believe that now even terms such as ‘temperature’ in information theory can be written without quotation marks and understood merely as a parameter incorporated in the expression for the extreme distribution. Similar laws are valid both in information theory and statistical physics, and we can conditionally call them ‘thermodynamics’. At the beginning (from 1948 until 1959), only one ‘thermodynamic’ notion appeared in Shannon’s information theory—entropy. There seemed to be no room for energy and other analogous thermodynamic potentials in it. In that regard, the theory looked feeble in comparison with statistical thermodynamics. This, however, was short-lived. The situation changed when scientists realized that in applied information theory, regarded as the theory of signal transmission, the cost function was the analogue of energy, and risk or average cost was the analogue of average energy. It became evident that a number of main concepts and relations between them are similar in two disciplines. In particular, if the first variational problem is considered, then we can speak about resemblance, ‘isomorphism’ of the two theories. Mathematical relationships between the corresponding notions of both disciplines are the same, and they are the contents of a mathematical theory that is considered in this book. The content of information theory is not limited to the specified relations. Besides entropy the theory contains other notions such as the Shannon’s amount of information. In addition to the first variational problem, related to an extremum of entropy under fixed risk (i.e. energy), there are also possible variational problems, in
Introduction
xvii
which entropy is replaced by the Shannon’s amount of information. Therefore, the content of information theory is broader than the mathematical content of statistical thermodynamics. New variational problems reveal a remarkable analogy with the first variational problem. They contain the same ensemble of conjugate parameters and potentials, the same formulae linking potentials via the Legendre transform. And this is no surprise. It is possible to show that all those phenomena emerge when considering any non-singular variational problem. Let us consider this subject more thoroughly. Suppose that at least two functionals Φ1 [P], Φ2 [P] of the argument (‘distribution’) P are given, and it is required to find an extremum of one functional subject to a fixed value constraint of the second one, say, Φ1 [P] = extrP subject to Φ2 [P] = A. Introducing the Lagrange multiplier α , which serves as a ‘thermodynamic’ parameter canonically conjugate with parameter A, we study an extremum of the expression (1) K = Φ1 [P] + αΦ2 [P] = extrP under a fixed value of α . The extreme value K(α ) = Φ1 [Pextr ] + αΦ2 [Pextr ] = Φ1 + α A serves as a ‘thermodynamic’ potential since dK = d Φ1 + α d Φ2 + Φ2 d α = Φ2 d α , i.e. (2) dK/d α = A. The relation used d Φ1 + α d Φ2 ≡ [Φ1 (Pextr + δ P) − Φ1 (Pextr )] + α [Φ2 (Pextr + δ P) − Φ2 (Pextr )] = 0 follows from (1). Here, we take a partial variation ∂ P = Pextr (α + d α ) − Pextr (α ) due to the increment of parameter A, i.e. parameter α . Function L(A) = K − α A = K − α (dK/d α ) is a potential conjugate by Legendre with K(α ). Meanwhile, it follows from (2) that dL(A)/dA = −α . These relations, which are common in thermodynamics, are apparently valid regardless of the nature of the functionals Φ1 , Φ2 and the argument P. The results of information theory discussed in this book are related to the three variational problems whose solution yields a number of relations, parameters and potentials. Variational problems play an important role in theoretical cybernetics because it concerns mainly optimal constructions and procedures. As can be seen from the book contents, these variational problems are related also to the major laws (theorems) having asymptotic nature (i.e. valid for large composite systems). The first variational problem corresponds to stability of the canonical distribution, which is essential in statistical physics, as well as to asymptotic equivalence of constraints imposed on specific and mean values of a function.
xviii
Introduction
The second variational problem is related to the famous result of Shannon about asymptotic zero probability of error for the transmission of messages through noisy channels. The third variational problem is connected with asymptotic equivalence of the values of information of Shannon’s and Hartley’s type. The latter results are a splendid example of unity of discrete and continuous worlds. They are also a good example that when the complexity of a discrete system grows, it is convenient to describe it by continuous mathematical objects. Finally, they are an example of how a complex continuous system behaves asymptotically similarly to a complex discrete system. It would be tempting to observe something similar, say, in a future asymptotical theory of dynamical systems and automata.
Contents
1
2
3
Definition of information and entropy in the absence of noise . . . . . . . 1.1 Definition of entropy in the case of equiprobable outcomes . . . . . . 1.2 Entropy and its properties in the case of non-equiprobable outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Conditional entropy. Hierarchical additivity . . . . . . . . . . . . . . . . . . . 1.4 Asymptotic equivalence of non-equiprobable and equiprobable outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Asymptotic equiprobability and entropic stability . . . . . . . . . . . . . . 1.6 Definition of entropy of a continuous random variable . . . . . . . . . . 1.7 Properties of entropy in the generalized version. Conditional entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Encoding of discrete information in the absence of noise and penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Main principles of encoding discrete information . . . . . . . . . . . . . . 2.2 Main theorems for encoding without noise. Independent identically distributed messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Optimal encoding by Huffman. Examples . . . . . . . . . . . . . . . . . . . . 2.4 Errors of encoding without noise in the case of a finite code sequence length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Encoding in the presence of penalties. First variational problem . . . . 3.1 Direct method of computing information capacity of a message for one example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Discrete channel without noise and its capacity . . . . . . . . . . . . . . . . 3.3 Solution of the first variational problem. Thermodynamic parameters and potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Examples of application of general methods for computation of channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Methods of potentials in the case of a large number of parameters 3.6 Capacity of a noiseless channel with penalties in a generalized version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 3 5 8 12 16 22 28 35 36 40 44 48 53 54 56 58 65 70 74 xix
xx
4
5
6
7
Contents
First asymptotic theorem and related results . . . . . . . . . . . . . . . . . . . . . 4.1 Potential Γ or the cumulant generating function . . . . . . . . . . . . . . . 4.2 Some asymptotic results of statistical thermodynamics. Stability of the canonical distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Asymptotic equivalence of two types of constraints . . . . . . . . . . . . 4.4 Some theorems about the characteristic potential . . . . . . . . . . . . . . Computation of entropy for special cases. Entropy of stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Entropy of a segment of a stationary discrete process and entropy rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Entropy of a Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Entropy rate of part of the components of a discrete Markov process and of a conditional Markov process . . . . . . . . . . . . . . . . . . 5.4 Entropy of Gaussian random variables . . . . . . . . . . . . . . . . . . . . . . . 5.5 Entropy of a stationary sequence. Gaussian sequence . . . . . . . . . . . 5.6 Entropy of stochastic processes in continuous time. General concepts and relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Entropy of a Gaussian process in continuous time . . . . . . . . . . . . . . 5.8 Entropy of a stochastic point process . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Entropy of a discrete Markov process in continuous time . . . . . . . . 5.10 Entropy of diffusion Markov processes . . . . . . . . . . . . . . . . . . . . . . . 5.11 Entropy of a composite Markov process, a conditional process, and some components of a Markov process . . . . . . . . . . . . . . . . . . . Information in the presence of noise. Shannon’s amount of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Information losses under degenerate transformations and simple noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Mutual information for discrete random variables . . . . . . . . . . . . . . 6.3 Conditional mutual information. Hierarchical additivity of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Mutual information in the general case . . . . . . . . . . . . . . . . . . . . . . . 6.5 Mutual information for Gaussian variables . . . . . . . . . . . . . . . . . . . . 6.6 Information rate of stationary and stationary-connected processes. Gaussian processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Mutual information of components of a Markov process . . . . . . . . Message transmission in the presence of noise. Second asymptotic theorem and its various formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Principles of information transmission and information reception in the presence of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Random code and the mean probability of error . . . . . . . . . . . . . . . 7.3 Asymptotic zero probability of decoding error. Shannon’s theorem (second asymptotic theorem) . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Asymptotic formula for the probability of error . . . . . . . . . . . . . . . .
77 78 82 89 94 103 104 107 113 123 128 134 137 144 153 157 161 173 173 178 181 187 189 196 202 217 218 221 225 228
Contents
xxi
7.5 7.6
Enhanced estimators for optimal decoding . . . . . . . . . . . . . . . . . . . . 232 Some general relations between entropies and mutual informations for encoding and decoding . . . . . . . . . . . . . . . . . . . . . . 243
8
9
10
11
Channel capacity. Important particular cases of channels . . . . . . . . . . 8.1 Definition of channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Solution of the second variational problem. Relations for channel capacity and potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 The type of optimal distribution and the partition function . . . . . . . 8.4 Symmetric channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Binary channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Gaussian channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Stationary Gaussian channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Additive channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Definition of the value of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Reduction of average cost under uncertainty reduction . . . . . . . . . . 9.2 Value of Hartley’s information amount. An example . . . . . . . . . . . . 9.3 Definition of the value of Shannon’s information amount and α -information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Solution of the third variational problem. The corresponding potentials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Solution of a variational problem under several additional assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Value of Boltzmann’s information amount . . . . . . . . . . . . . . . . . . . . 9.7 Another approach to defining the value of Shannon’s information Value of Shannon’s information for the most important Bayesian systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Two-state system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Systems with translation invariant cost function . . . . . . . . . . . . . . . 10.3 Gaussian Bayesian systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Stationary Gaussian systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Asymptotic results about the value of information. Third asymptotic theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 On the distinction between the value functions of different types of information. Preliminary forms . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Theorem about asymptotic equivalence of the value functions of different types of information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Rate of convergence between the values of Shannon’s and Hartley’s information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Alternative forms of the main result. Generalizations and special cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Generalized Shannon’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
249 249 252 259 262 264 267 277 284 289 290 294 300 304 313 318 321 327 327 331 338 346 353 354 357 369 379 385
xxii
12
A
Contents
Information theory and the second law of thermodynamics . . . . . . . . 12.1 Information about a physical system being in thermodynamic equilibrium. The generalized second law of thermodynamics . . . . 12.2 Influx of Shannon’s information and transformation of heat into work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Energy costs of creating and recording information. An example . 12.4 Energy costs of creating and recording information. General formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Energy costs in physical channels . . . . . . . . . . . . . . . . . . . . . . . . . . .
391 392 395 399 403 405
Some matrix (operator) identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 A.1 Rules for operator transfer from left to right . . . . . . . . . . . . . . . . . . . 409 A.2 Determinant of a block matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Chapter 1
Definition of information and entropy in the absence of noise
In modern science, engineering and public life, a big role is played by information and operations associated with it: information reception, information transmission, information processing, storing information and so on. The significance of information has seemingly outgrown the significance of the other important factor, which used to play a dominant role in the previous century, namely, energy. In the future, in view of a complexification of science, engineering, economics and other fields, the significance of correct control in these areas will grow and, therefore, the importance of information will increase as well. What is information? Is a theory of information possible? Are there any general laws for information independent of its content that can be quite diverse? Answers to these questions are far from obvious. Information appears to be a more difficult concept to formalize than, say, energy, which has a certain, long established place in physics. There are two sides of information: quantitative and qualitative. Sometimes it is the total amount of information that is important, while other times it is its quality, its specific content. Besides, a transformation of information from one format into another is technically a more difficult problem than, say, transformation of energy from one form into another. All this complicates the development of information theory and its usage. It is quite possible that the general information theory will not bring any benefit to some practical problems, and they have to be tackled by independent engineering methods. Nevertheless, general information theory exists, and so do standard situations and problems, in which the laws of general information theory play the main role. Therefore, information theory is important from a practical standpoint, as well as in fundamental science, philosophy and expanding the horizons of a researcher. From this introduction one can gauge how difficult it was to discover the laws of information theory. In this regard, the most important milestone was the work of Claude Shannon [44, 45] published in 1948–1949 (the respective English originals are [38, 39]). His formulation of the problem and results were both perceived as a surprise. However, on closer investigation one can see that the new theory extends and develops former ideas, specifically, the ideas of statistical thermodynamics due © Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0 1
1
2
1 Definition of information and entropy in the absence of noise
to Boltzmann. The deep mathematical similarities between these two directions are not accidental. It is evidenced in the use of the same formulae (for instance, for entropy of a discrete random variable). Besides that, a logarithmic measure for the amount of information, which is fundamental in Shannon’s theory, was proposed for problems of communication as early as 1928 in the work of R. Hartley [19] (the English original is [18]). In the present chapter, we introduce the logarithmic measure of the amount of information and state a number of important properties of information, which follow from that measure, such as the additivity property. The notion of the amount of information is closely related to the notion of entropy, which is a measure of uncertainty. Acquisition of information is accompanied by a decrease in uncertainty, so that the amount of information can be measured by the amount of uncertainty or entropy that has disappeared. In the case of a discrete message, i.e. a discrete random variable, entropy is defined by the Boltzmann formula Hξ = − ∑ P(ξ ) ln P(ξ ), ξ
where ξ is a random variable, and P(ξ ) is its probability distribution. In this chapter, we emphasize the fact that this formula is a corollary (in the asymptotic sense) of the simpler Hartley’s formula H = ln M. The fact that the set of realizations (the set of all possible values of a random variable) of an entropically stable (this property is defined in Section 1.5) random variable can be partitioned into two subsets, is an essential result of information theory. The first subset has infinitesimal probability, and therefore it can be discarded. The second subset contains approximately eHξ realizations (i.e. variants of representation), and it is often substantially smaller than the total number of realizations. Realizations of that second subset can be approximately treated as equiprobable. According to the terminology introduced by Boltzmann, those equiprobable realizations can be called ‘micro states’. Information theory is a mathematical theory that employs definitions and methods of probability theory. In these discussions, there is no particular need to hold to a special ‘informational’ terminology, which is usually used in applications. Without detriment to the theory, the notion of ‘message’ can be substituted by the notion of ‘random variable’, and the notion of ‘sequence of messages’ by ‘stochastic process’, etc. Then in order to apply the general theory, certainly, we require a proper formalization of the applicable concepts in the language of probability theory. We will not pay extra attention here to such a translation. The main results of Sections 1.1–1.3, related to discrete random variables (or processes) having a finite or countable number of states, can be generalized to the case of continuous (and even arbitrary) random variables assuming values from a multidimensional real space. For such a generalization, one should overcome certain difficulties and put up with certain complications of the most important concepts and formulae. The main complication consists in the following. In contrast to the case of
1.1 Definition of entropy in the case of equiprobable outcomes
3
discrete random variables, where it is sufficient to define entropy using one measure or one probability distribution, in the general case it is necessary to introduce two measures to do so. Therefore, entropy is now related to two measures instead of one, and thus it characterizes the relationship between these measures. In our presentation, the general version of the formula for entropy is derived from the Boltzmann formula by using as an example the condensation of the points representing random variables. There are several books on information theory, in particular, Goldman [16], Feinstein [12] and Fano [10] (the corresponding English originals are [11, 15] and [9]). These books conveniently introduce readers to the most basic notions (also see the book by Kullback [30] or its English original [29]).
1.1 Definition of entropy in the case of equiprobable outcomes Suppose we have M equiprobable outcomes of an experiment. For example, when we roll a standard die, M = 6. Of course, we cannot always perform the formalization of conditions so easily and accurately as in the case of a die. We assume though that the formalization has been performed and, indeed, one of M outcomes is realized, and they are equivalent in probabilistic terms. Then there is a priori uncertainty directly connected with M (i.e. the greater the M is, the higher the uncertainty is). The quantity measuring the above uncertainty is called entropy and is denoted by H: H = f (M), (1.1.1) where f (·) is some increasing non-negative function defined at least for natural numbers. When rolling a dice and observing the outcome number, we obtain information whose amount is denoted by I. After that (i.e. a posteriori) there is no uncertainty left: the a posteriori number of outcomes is M = 1 and we must have Hps = f (1) = 0. It is natural to measure the amount of information received by the value of disappeared uncertainty: (1.1.2) I = Hpr − Hps . Here, the subscript ‘pr’ means ‘a priori’, whereas ‘ps’ means ‘a posteriori’. We see that the amount of received information I coincides with the initial entropy. In other cases (in particular, for formula (1.2.3) given below) a message having entropy H can also transmit the amount of information I equal to H. In order to determine the form of function f (·) in (1.1.1) we employ very natural additivity principle. In the case of a die it reads: the entropy of two throws of a die is twice as large as the entropy of one throw, the entropy of three throws of a die is three times as large as the entropy of one throw, etc. Applying the additivity principle to other cases means that the entropy of several independent systems is equal to the sum of the entropies of individual systems. However, the number M of outcomes for a complex system is equal to the product of the numbers m of outcomes for each
4
1 Definition of information and entropy in the absence of noise
one of the ‘simple’ (relative to the total system) subsystems. For two throws of dice, the number of various pairs (ξ1 , ξ2 ) (where ξ1 and ξ2 both take one out of six values) equals to 36 = 62 . Generally, for n throws the number of equivalent outcomes is 6n . Applying formula (1.1.1) for this number, we obtain entropy f (6n ). According to the additivity principle, we find that f (6n ) = n f (6). For other m > 1 the latter formula would take the form f (mn ) = n f (m).
(1.1.3)
Denoting x = mn , we have n = ln x/ ln m. Then it follows from (1.1.3) that f (x) = K ln x,
(1.1.4)
where K = f (m)/ ln m is a positive constant, which is independent of x. It is related to the choice of the units of information. Thus, a representation of function f (·) has been determined up to a choice of the measurement units. It is easy to verify that the aforementioned constraint f (1) = 0 is satisfied, indeed. Hartley was the first [19] (the English original is [18]) who introduced the logarithmic measure of information. That is why the quantity H = K ln M is called the Hartley’s amount of information. Let us indicate three main choices of the measurement units for information: 1. If we set K = 1 in (1.1.4), then the entropy will be measured in natural units (nats) 1 : (1.1.5) Hnat = ln M; 2. If we set K = 1/ ln 2, then we will have the entropy expressed in binary units (bits) 2 : 1 ln M = log2 M; (1.1.6) Hbit = ln 2 3. Finally, we will have a physical scale of K if we consider the Boltzmann constant k = 1.38 · 10−23 J/K. The entropy measured in those units will be equal to S = Hph = k ln M.
(1.1.7)
It is easy to see from the comparison of (1.1.5) and (1.1.6) that 1 nat is greater than 1 bit in log2 e = 1/ ln 2 ≈ 1.44 times. In what follows, we shall use natural units (formula (1.1.5)), dropping subscript ‘nat’, unless otherwise stipulated. Suppose the random variable ξ takes any of the M equiprobable values, say, 1, . . . , M. Then the probability of each individual value is equal to P(ξ ) = 1/M, ξ = 1, . . . , M. Consequently, formula (1.1.5) can be rewritten as
1 2
‘nat’ refers to natural digit that means natural unit. ‘bit’ refers to binary digit that means binary unit (sign).
1.2 Entropy and its properties in the case of non-equiprobable outcomes
H = − ln P(ξ ).
5
(1.1.8)
1.2 Entropy and its properties in the case of non-equiprobable outcomes 1. Suppose now the probabilities of different outcomes are unequal. If, as earlier, the number of outcomes equals to M, then we can consider a random variable ξ , which takes one of M values. Considering an index of the corresponding outcome as ξ , we obtain that those values are nothing else but 1, . . . , M. Probabilities P(ξ ) of those values are non-negative and satisfy the normalization constraint: ∑ξ P(ξ ) = 1. If we formally apply equality (1.1.8) to this case, then each ξ should have its own entropy (1.2.1) H(ξ ) = − ln P(ξ ). Thus, we attribute a certain value of entropy to each realization of the variable ξ . Since ξ is a random variable, we can also regard this entropy as a random variable. As in Section 1.1, the a posteriori entropy, which remains after the realization of ξ becomes known, is equal to zero. That is why the information we obtain once the realization is known is numerically equal to the initial entropy I(ξ ) = H(ξ ) = − ln P(ξ ).
(1.2.2)
Similar to entropy H(ξ ), information I depends on the actual realization (on the value of ξ ), i.e., it is a random variable. One can see from the latter formula that information and entropy are both large when a posteriori probability of the given realization is small and vice versa. This observation is quite consistent with intuitive ideas. Example 1.1. Suppose we would like to know whether a certain student has passed an exam or not. Let the probabilities of these two events be P(pass) = 7/8,
P(fail) = 1/8.
One can see from these probabilities that the student is quite strong. If we were informed that the student had passed the exam, then we could say: ‘Your message has not given me a lot of information. I have already expected that the student passed the exam’. According to formula (1.2.2) the information of this message is quantitatively equal to I(pass) = log2 (8/7) = 0.193 bits. If we were informed that the student had failed, then we would say ‘Really?’ and would feel that we have improved our knowledge to a greater extent. The amount of information of such a message is equal to I(fail) = log2 (8) = 3 bits.
6
1 Definition of information and entropy in the absence of noise
In theory, however, a greater role is not played by random entropy (random information, respectively) (1.2.1), (1.2.2), but by the average entropy defined by the formula (1.2.3) Hξ = E[H(ξ )] = − ∑ P(ξ ) ln P(ξ ). ξ
We shall call it the Boltzmann’s entropy or the Boltzmann’s amount of information.3 In the aforementioned example averaging over both messages yields 3 7 Iξ = Hξ = 0.193 + = 0.544 bits. 8 8 The subscript ξ of symbol Hξ (as opposed to the argument of H(ξ )) is a dummy, i.e., the average entropy Hξ describes the random variable ξ (depending on the probabilities Pξ ), but it does not depend on the actual value of ξ , i.e., on the realization of ξ . Further, we shall use this notation system in its extended form including conditional entropies. Uncertainty of the type 0 ln 0 occurring in (1.2.3), for vanishing probabilities is always understood in the sense 0 ln 0 = 0. Consequently, a set of M outcomes can be always complemented by any outcomes having zero probability. Also, deterministic (non-random) quantities can be added to the entropy index. For instance, the equality Hξ = Hξ ,7 is valid for every random variable ξ . 2. Properties of entropy. Theorem 1.1. Both random and average entropies are always non-negative. This property is connected with the fact that probability cannot exceed one and that the constant K in (1.1.4) is necessarily positive. Since P(ξ ) 1, we have − ln P(ξ ) = H(ξ ) 0. Certainly, this inequality remains valid after averaging as well. Theorem 1.2. Entropy attains the maximum value of ln M when the outcomes (realizations) are equiprobable, i.e. when P(ξ ) = 1/M. Proof. This property is the consequence of the Jensen’s inequality (for instance, see Rao [35] or its English original [34]) E[ f (ζ )] f (E[ζ ])
(1.2.4)
that is valid for every concave function f (x). (Function f (x) = ln x is concave for x > 0, because f (x) = −x−2 < 0). Indeed, denoting ζ = 1/P(ξ ) we have
M 1 P(ξ ) = M, E[ζ ] = E = ∑ P(ξ ) P(ξ ) ξ =1 1 E[ f (ζ )] = E ln = E[H(ξ )] = Hξ . P(ξ ) 3
(1.2.5) (1.2.6)
Boltzmann’s entropy is commonly referred to as ‘Shannon’s entropy’, or just ‘entropy’ within the field of information theory.
1.2 Entropy and its properties in the case of non-equiprobable outcomes
7
Substituting (1.2.5), (1.2.6) to (1.2.4), we obtain Hξ ln M. For a particular form of function f (ζ ) = ln ζ it is easy to verify inequality (1.2.4) directly. Averaging the obvious inequality ln
ζ ζ − 1, E[ζ ] E[ζ ]
(1.2.7)
we obtain (1.2.4). In the general case, it is convenient to consider the tangent line f (E[ζ ]) + f (E[ζ ])(ζ − E[ζ ]) for function f (ζ ) at point ζ = E[ζ ] in order to prove (1.2.4). Concavity implies f (ζ ) f (E[ζ ]) + f (E[ζ ])(ζ − E[ζ ]). Averaging the latter inequality, we derive (1.2.4). As is seen from the provided proof, Theorem 1.2 would remain valid if we replaced the logarithmic function by any other concave function in the definition of entropy. Let us now consider the properties of entropy, which are specific for the logarithmic function, namely, properties related to the additivity of entropy. Theorem 1.3. If random variables ξ1 , ξ2 are independent, then the full (joint) entropy Hξ1 ξ2 is decomposed into the sum of entropies: Hξ1 ξ2 = Hξ1 + Hξ2 .
(1.2.8)
Proof. Suppose that ξ1 , ξ2 are two random variables such that the first one assumes values 1, . . . , m1 and the second one—values 1, . . . , m2 . There are m1 m2 pairs of ξ = (ξ1 , ξ2 ) with probabilities P(ξ1 , ξ2 ). Numbering the pairs in an arbitrary order by indices ξ = 1, . . . , m1 m2 we have Hξ = −E[ln P(ξ )] = −E[ln P(ξ1 , ξ2 )] = Hξ1 ξ2 . In view of independence, we have P(ξ1 , ξ2 ) = P(ξ1 )P(ξ2 ). Therefore, ln P(ξ1 , ξ2 ) = ln P(ξ1 ) + ln P(ξ2 ); H(ξ1 ξ2 ) = H(ξ1 ) + H(ξ2 ). Averaging of the latter equality yields Hξ = Hξ1 ξ2 = Hξ1 + H ξ2 , i.e. (1.2.7). The proof is complete. If, instead of two independent random variables, there are two independent groups (η1 , . . . , ηr and ζ1 , . . . , ζs ) of random variables (P(η1 , . . . , ηr , ζ1 , . . . , ζs ) = P(η1 , . . . , ηr )P(ζ1 , . . . , ζs )), then the provided reasoning is still applicable when denoting ensembles (η1 , . . . , ηr ) and (ζ1 , . . . , ζs ) as ξ1 and ξ2 , respectively.
8
1 Definition of information and entropy in the absence of noise
The property mentioned in Theorem 1.3 is a manifestation of the additivity principle, which was taken as a base principle in Section 1.1 and led to the logarithmic function (1.1.1). This property can be generalized to the case of several independent random variables ξ1 , . . . , ξn , which yields, n
∑ Hξ j ,
Hξ1 ...ξn =
(1.2.9)
j=1
and is easy to prove by an analogous method.
1.3 Conditional entropy. Hierarchical additivity Let us generalize formulae (1.2.1), (1.2.3) to the case of conditional probabilities. Let ξ1 , . . . , ξn be random variables described by the joint distribution P(ξ1 , . . . , ξn ). The conditional probabilities P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ) =
P(ξ1 , . . . , ξn ) P(ξ1 , . . . , ξk−1 )
(k n)
are associated with the random conditional entropy H(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ) = − ln P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ).
(1.3.1)
Let us introduce a special notation for the result of averaging (1.3.1) over ξk , . . . , ξn : Hξk ...ξn (| ξ1 , . . . ξk−1 ) = −
∑
P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 )×
ξk ...ξn
× ln P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ), (1.3.2)
Fig. 1.1 Partitioning stages and the decision tree in the general case
1.3 Conditional entropy. Hierarchical additivity
9
and also for the result of total averaging: Hξk ,...,ξn |ξ1 ,...,ξk−1 = E[H(ξk , . . . , ξn | ξ1 , . . . , ξk−1 )] =−
∑
P(ξ1 . . . ξn ) ln P(ξk , . . . , ξn | ξ1 , . . . , ξk−1 ).
(1.3.3)
ξ1 ...ξn
If, in addition, we vary k and n, then we will form a large number of different entropies, conditional and non-conditional, random and non-random. They are related by identities that will be considered below. Before we formulate the main hierarchical equality (1.3.4), we show how to introduce a hierarchical set of random variables ξ1 , . . . , ξn , even if there was just one random variable ξ initially. Let ξ take one of M values with probabilities P(ξ ). The choice of one realization will be made in several stages. At the first stage, we indicate which subset (from a full ensemble of non-overlapping subsets E1 , . . . , Em1 ) the realization belongs to. Let ξ1 be the index of such a subset. At the second stage, each subset is partitioned into smaller subsets Eξ1 ξ2 . The second random variable ξ2 points to which smaller subset the realization of the random variable belongs to. In turn, those smaller subsets are further partitioned until we obtain subsets consisting of a single element. Apparently, the number of nontrivial partitioning stages n cannot exceed M − 1. We can juxtapose a fixed partitioning scheme with a ‘decision tree’ depicted on Figure 1.1. Further considerations will be associated with a particular selected ‘tree’.
Fig. 1.2 A ‘decision tree’ for one particular example
In order to indicate a realization of ξ , it is necessary and sufficient to fix the assembly of realizations (ξ1 , ξ2 , . . . , ξn ). In order to indicate the ‘node’ of (k + 1)th stage, we need to specify values ξ1 , . . . , ξk . Then the value of ξk+1 points to the branch we use to get out from this node. As an example of a ‘decision tree’ we can give the simple ‘tree’ represented on Figure 1.2, which was considered by Shannon. At each stage, the choice is associated with some uncertainty. Consider the entropy corresponding to this uncertainty. The first stage has one node, and the entropy of the choice is equal to Hξ1 . Fixing ξ1 , we determine the node of the second stage.
10
1 Definition of information and entropy in the absence of noise
The probability of moving from the node ξ1 along the branch ξ2 is equal to the conditional probability P(ξ1 | ξ2 ) = P(ξ1 , ξ2 )/P(ξ1 ). The entropy, associated with a selection of one branch emanating from this node, is precisely the conditional entropy of type (1.3.2) for n = 2, k = 2: Hξ2 (| ξ1 ) = − ∑ P(ξ2 | ξ1 ) ln P(ξ2 | ξ1 ). ξ2
As in (1.3.3), averaging over all second stage nodes yields the full selection entropy at the second stage: Hξ2 |ξ1 = E[Hξ2 (| ξ1 )] = ∑ P(ξ1 )Hξ2 (| ξ1 ). ξ1
As a matter of fact, the selection entropy at stage k in the node defined by values ξ1 , . . . , ξk−1 is equal to Hξk (| ξ1 , . . . , ξk−1 ). At the same time the total entropy of stage k is equal to Hξk |ξ1 ,...,ξk−1 = E[Hξk (| ξ1 , . . . , ξk−1 )]. For instance, on Figure 1.2 the first stage entropy is equal to Hξ1 = 1 bit. Node A has entropy Hξ2 (| ξ1 = 1) = 0, and node B has entropy Hξ2 (| ξ1 = 2) = 23 log 32 + 1 2 3 log 3 = log2 3 − 3 bits. The average entropy at the second stage is apparently equal 1 to Hξ2 |ξ1 = 2 Hξ2 (| 2) = 12 log2 3 − 13 bits. An important regularity is that the sum of entropies of all stages is equal to the full entropy Hξ , which can be computed without partitioning a selection into stages. For the above example Hξ1 ξ2 =
1 1 2 1 1 log 2 + log 3 + log 6 = + log2 3 bits, 2 3 6 3 2
which does, indeed, coincide with the sum Hξ1 + Hξ2 |ξ1 . This observation is general. Theorem 1.4. Entropy possesses the property of hierarchical additivity: Hξ1 ,...,ξn = Hξ1 + Hξ2 |ξ1 + Hξ3 |ξ1 ξ2 + · · · + Hξn |ξ1 ,...,ξn−1 .
(1.3.4)
Proof. By the definition of conditional probabilities, they possess the following property of hierarchical multiplicativity: P(ξ1 , . . . , ξn ) = P(ξ1 )P(ξ2 | ξ1 )P(ξ3 | ξ1 ξ2 ) · · · P(ξn | ξ1 , . . . , ξn−1 ).
(1.3.5)
Taking the logarithm of (1.3.5) and taking into account definition (1.3.1) of conditional random entropy, we obtain
1.3 Conditional entropy. Hierarchical additivity
11
H(ξ1 , . . . , ξn ) = H(ξ1 ) + H(ξ2 | ξ1 ) + H(ξ3 | ξ1 ξ2 ) + · · · · · · + H(ξn | ξ1 , . . . , ξn−1 ). (1.3.6) Averaging this equality according to (1.3.3) gives (1.3.4). This completes the proof. The property in Theorem 1.4 is a reflection of the simple additivity principle that was stated in Section 1.1. This property is a consequence of the choice of the logarithmic function in (1.1.1). It is easy to understand that the additivity property (1.2.8), (1.2.9) follows from it. Indeed, a conditional probability is equivalent to a non-conditional probability for independent random variables. Taking the logarithms of those probabilities, we have H(ξ2 | ξ1 ) = H(ξ2 ) and, thus, Hξ2 |ξ1 = Hξ2 after averaging. Therefore, the equality Hξ1 ξ2 = Hξ1 + Hξ2 |ξ1 turns into Hξ1 ξ2 = Hξ1 + Hξ2 . A particular case of (1.3.4) for two-stage partition was taken by Shannon [45] (the original in English is [38]) and also by Feinstein [12] (the original in English is [11]) as one of the axioms for deriving formula (1.2.3), i.e., in essence for specifying the logarithmic measure of information. For other axiomatic ways of defining the amount of information it is also necessary to postulate the additivity property to some extent (may be in the weak or special form) for the logarithmic measure to be singled out. In conclusion of this section, we shall prove a theorem involving conditional entropy. But first we shall state an important auxiliary proposition. Theorem 1.5. Whatever probability distributions P(ξ ) and Q(ξ ) are, the following inequality holds: P(ξ ) (1.3.7) ∑ P(ξ ) ln Q(ξ ) 0. ξ Proof. The proof is similar to the proof of Theorem 1.2. It is based on inequality (1.2.4) for function f (x) = ln x. We set ζ = Q(ξ )/P(ξ ) and perform averaging with weight P(ξ ). Then Q(ξ ) = Q(ξ ) = 1 E[ζ ] = ∑ P(ξ ) P(ξ ) ∑ ξ ξ and E[ f (ζ )] = ∑ P(ξ ) ln ξ
Q(ξ ) . P(ξ )
A substitution of these values into (1.2.4) yields Q(ξ )
∑ P(ξ ) ln P(ξ ) ln 1 = 0, ξ
that completes the proof.
12
1 Definition of information and entropy in the absence of noise
Theorem 1.6. Conditional entropy cannot exceed regular (non-conditional) one: Hξ |η Hξ .
(1.3.8)
Proof. Using Theorem 1.5 we substitute P(ξ ) and Q(ξ ) by P(ξ | η ) and P(ξ ), respectively, therein. Then we will obtain − ∑ P(ξ | η ) ln P(ξ | η ) ∑ P(ξ | η ) ln P(ξ ). ξ
ξ
Averaging this inequality over η with weight P(η ) results in the inequality − ∑ P(ξ | η ) ln P(ξ | η ) − ∑ P(ξ ) ln P(ξ ), ξ
ξ
that is equivalent to (1.3.8). The proof is complete. The following theorem is proven similarly. Theorem 1.6a Adding conditions does not increase conditional entropy: Hξ |ηζ = Hξ |ζ .
(1.3.9)
1.4 Asymptotic equivalence of non-equiprobable and equiprobable outcomes The idea that the general case of non-equiprobable outcomes can be asymptotically reduced to the case of equiprobable outcomes is fundamental for information theory in the absence of noise. This idea belongs to Ludwig Boltzmann who derived formula (1.2.3) for entropy. Claude Shannon revived this idea and broadly used it for derivation of new results. In considering this question here, we shall not try to reach generality, since these results form a particular case of more general results of Section 1.5. Consider the set of independent realizations η = (ξ1 , . . . , ξn ) of a random variable ξ = ξ j , which assumes one of two values 1 or 0 with probabilities P[ξ = 1] = p < 1/2; P[ξ = 0] = 1 − p = q. Evidently, the number of such different combinations (realizations) is equal to 2n . Let realization ηn1 contain n1 ones and n − n1 = n0 zeros. Then its probability is given by P(ηn1 ) = pn1 qn−n1 .
(1.4.1)
Of course, these probabilities are different for different n1 . The ratio P(η0 )/P(ηn ) = (q/p)n of the largest probability to the smallest one is big and increases fast with a growth of n. What equiprobability can we talk about then? The thing is that due to
1.4 Asymptotic equivalence of non-equiprobable and equiprobable outcomes
13
the Law of Large Numbers the number of ones n1 = ξ1 + · · · + ξn has a tendency to take values, which are close to its mean E[n1 ] =
n
∑ E[ξ j ] = nE[ξ j ] = np.
j=1
Let us find the variance Var[n1 ] = Var[ξ1 + · · · + ξn ] of the number of ones. Due to independence of the summands we have Var(n1 ) = nVar(ξ ) = n[E[ξ 2 ] − (E[ξ ])2 ], and E[ξ 2 ] = E[ξ ] = p;
E[ξ 2 ] − (E[ξ ])2 = p − p2 = pq.
Therefore, Var(n1 ) = npq;
Var(n1 /n) = pq/n.
(1.4.2)
Hence, we have obtained that the mean deviation
Δ n1 = n1 − np ∼
√
pqn
increases with n, but slower than the mean value np and the length of the entire range 0 n1 n grow. to A typical relative deviation Δ n1 /n1 decreases according √ the law Δ n1 /n1 ∼ q/np. Within the bounds of the range |n1 − pn| ∼ pqn the difference between probabilities P(ηn1 ) is still quite substantial: P(ηn1 ) = P(ηn1 −Δ n1 )
Δ n1 √ pqn q q ≈ p p
(as it follows from (1.4.1)) and increases with a growth of n. However, this increase √ is relatively slow, as compared to that of the probability itself. For Δ n1 ≈ pqn, the corresponding inequality ln
P(ηn1 ) 1 ln P(ηn1 −Δ n1 ) P(ηn1 )
becomes stronger and stronger with a growth of n. Now we state the forgoing in more precise terms. Theorem 1.7. All of the 2n realizations of η can be partitioned into two sets An and Bn , so that 1. The total probability of realizations from the first set An vanishes: P(An ) → 0 as
n → ∞;
(1.4.3)
2. Realizations from the second set Bn become relatively equiprobable in the following sense:
14
1 Definition of information and entropy in the absence of noise
ln P(η ) − ln P(η ) → 0, ln P(η )
η ∈ Bn ;
η ∈ Bn .
(1.4.4)
Proof. Using Chebyshev’s inequality (for instance, see Gnedenko [13] or its translation to English [14]), we obtain P[|n1 − pn| ε ]
Var(n1 ) . ε2
Taking into account (1.4.2) and assigning ε = n3/4 , we obtain from here that P[|n1 − pn| n3/4 ] pqn−1/2 .
(1.4.5)
We include realizations ηn1 , for which |n1 − pn| n3/4 into set An and the rest of them into set Bn . Then the left-hand side of (1.4.5) is nothing but P(An ), and passing to the limit n → 0 in (1.4.5) proves (1.4.3). For the realizations from the second set Bn , the inequality pn − n3/4 < n1 < pn + 3/4 n holds, which, in view of (1.4.1), gives − n(p ln p + q ln q) − n3/4 ln
q < − ln P(ηn1 ) < p q < −n(p ln p + q ln q) + n3/4 ln . (1.4.6) p
Hence ln P(η ) − ln P(η ) < 2n3/4 ln q , p and
s. t.
η ∈ Bn ,
η ∈ Bn
q | ln P(η )| > −n(p ln p + q ln q) − n3/4 ln . p
Consequently, 2n−1/4 ln qp ln P(η ) − ln P(η ) < −p ln p − q ln q − n−1/4 ln q . ln P(η ) p In order to obtain (1.4.4), one should pass to the limit as n → ∞. This ends the proof. Inequality (1.4.6) also allows us to evaluate the number of elements of the set Bn . Theorem 1.8. Let Bn be a set described in Theorem 1.7. Its cardinality M is such that ln M → −p ln p − q ln q ≡ Hξ1 as n → ∞. (1.4.7) n This theorem can also be formulated as follows.
1.4 Asymptotic equivalence of non-equiprobable and equiprobable outcomes
15
Theorem 1.8a If we suppose that realizations of set An have zero probability, realizations of set Bn are equiprobable and compute entropy Hη by the simple formula Hη = ln M (see (1.1.5)), then the entropy rate Hη /n in the limit will be equivalent to the entropy determined by formula (1.2.3), i.e. lim
n→∞
H˜ η = −p ln p − q ln q. n
(1.4.8)
Note that formula (1.2.3) can be obtained as a corollary of the simpler relation (1.1.5). Proof. According to (1.4.5) the sum
∑
η ∈Bn
P(η ) = P(Bn )
can be bounded by the inequality pq P(Bn ) = 1 − P(An ) 1 − √ . n The indicated sum involves the elements of set Bn and has M summands. In consequence of (1.4.6) each summand can be bounded above: q 3/4 P(η ) < exp n(p ln p + q ln q) + n ln . p Therefore, the number of terms cannot be less than a certain number: pq q P(Bn ) > 1 − √ exp −n(p ln p + q ln q) − n3/4 ln . M max P(η ) n p
(1.4.9)
On the other hand, ∑Bn P(η ) 1 and due to (1.4.6) q 3/4 P(η ) > exp n(p ln p + q ln q) − n ln , p so that
1 q P(Bn ) 3/4 < exp −n(p ln p + q ln q) + n ln . M min P(η ) min P(η ) p
Taking the logarithms of the derived inequalities (1.4.9), (1.4.10) we obtain pq q q ln M < Hξ1 + n−1/4 ln . Hξ1 − n−1/4 ln + ln 1 − √ < p n n p Passing to the limit as n → ∞ proves the desired relations (1.4.7), (1.4.8).
(1.4.10)
16
1 Definition of information and entropy in the absence of noise
In the case when ξ1 takes one out of m values there are mn different realizations of process η = (ξ1 , . . . , ξn ) with independent components. According to the nH aforesaid only M = e ξ1 of them (which we can consider as equiprobable) deserve attention. When P(ξ1 ) = 1/m and Hξ1 = ln m, the numbers mn and M are equal; othnH erwise, (when Hξ1 < ln m) the fraction of realizations deserving attention e ξ1 /mn unboundedly decreases with a growth of n. Therefore, the vast majority of realizations in this case is not essential and can be disregarded. This fact underlies coding theory (see Chapter 2). The asymptotic equiprobability takes place under more general assumptions in the case of ergodic stationary processes as well. Boltzmann called the above equiprobable realizations ‘micro states’ contrasting them to ‘macro states’ formed by an ensemble of ‘micro states’.
1.5 Asymptotic equiprobability and entropic stability 1. The ideas of preceding section concerning asymptotic equivalence of nonequiprobable and equiprobable outcomes can be extended to essentially more general cases of random sequences and processes. It is not necessary for random variables ξ j forming the sequence η n = (ξ1 , . . . , ξn ) to take only one of two values and to have the same distribution law P(ξ j ). There is also no need for ξ j to be statistically independent and even for η n to be the sequence (ξ1 , . . . , ξn ). So what is really necessary, the asymptotic equivalence? In order to state the property of asymptotic equivalence of non-equiprobable and equiprobable outcomes in general terms we should use the notion of entropic stability of family of random variables. A family of random variables {η n } is entropically stable if the ratio H(η n )/Hη n converges in probability to one as n → ∞. This means that whatever ε > 0, η > 0 are, there exists N(ε , η ) such that the inequality P{|H(η n )/Hη n − 1| ε } < η
(1.5.1)
is satisfied for every n N(ε , η ). The above definition implies that 0 < Hη n < ∞ and Hη n does not decrease with n. Usually Hη n → ∞. Asymptotic equiprobability can be expressed in terms of entropic stability in the form of the following general theorem. Theorem 1.9. If a family of random variables {η n } is entropically stable, then the set of realizations of each random variable can be partitioned into two subsets An and Bn in such a way that 1. The total probability of realizations from subset An vanishes: P(An ) → 0 as
n → ∞;
(1.5.2)
1.5 Asymptotic equiprobability and entropic stability
17
2. Realizations of the second subset Bn become relatively equiprobable in the following sense: ln P(η ) − ln P(η ) → 0 as n → ∞, η ∈ Bn , η ∈ Bn ; (1.5.3) ln P(η ) 3. The number Mn of realizations from subset Bn is associated with entropy Hη n via the relation (1.5.4) ln Mn /Hη n → 1 as n → ∞. Proof. Putting, for instance, ε = 1/m and η = 1/m, we obtain from (1.5.1) that P{|H(η n )/Hη n − 1| 1/m} < 1/m
(1.5.5)
for n N(1/m, 1/m). Let m increase by ranging over consecutive integer numbers. We define sets An as sets of realizations satisfying the inequality |H(η n )/Hη n − 1| 1/m
(1.5.6)
when N(1/m, 1/m) n < N(1/(m + 1), 1/(m + 1)). For such a definition property (1.5.2) apparently holds true due to (1.5.5). For the additional set Bn we have (1 − 1/m)Hη n < − ln P(η n ) < (1 + 1/m)Hη n that entails
(η n ∈ Bn ),
(1.5.7)
2Hη n/m ln P(η n ) − ln P(η n ) 2/m n − ln P(η ) − ln P(η n ) 1 − 1/m
for η n ∈ Bn , η n ∈ Bn that proves property (1.5.3) (the convergence of n = N(1/m, 1/m) invokes the convergence m → ∞). According to (1.5.7), the probabilities P(η n ) of all realizations from set Bn lie in the range e−(1+1/m)Hη n < P(η n ) < e−(1−1/m)Hη n . At the same time, the total probability P(Bn ) = ∑Bn P(η n ) is enclosed between 1 − 1/m and 1. Hence we get the following range for the number of terms: (1 − 1/m)e(1−1/m)Hη n < Mn < e(1+1/m)Hη n (on the left-hand side the least number is divided by the greatest number, and on the right-hand side, vice versa). Therefore, 1−
1 1 ln(1 − 1/m) ln Mn + < < 1+ , n n m Hη Hη m
that entails (1.5.4). Note that term ln (1 − 1/m)/Hη n converges to zero because entropy Hη n does not decrease. The proof is complete.
18
1 Definition of information and entropy in the absence of noise
The property of entropic stability, which plays a big role according to Theorem 1.9, can be conveniently checked for different examples by determining the variance of random entropy: Var(H(η n )) = E[H 2 (η n )] − Hη2 n . If this variance does not increase too fast with a growth of n, then applying Chebyshev’s inequality can prove (1.5.1), i.e. entropic stability. We now formulate three theorems related to this question. Theorem 1.10. If there exists the null limit var(H(η n )) = 0, n→∞ Hη2 n lim
(1.5.8)
then the family of random variables {η n } is entropically stable. Proof. According to Chebyshev’s inequality we have P{|ξ − E[ξ ]| ε } Var(ξ /ε 2 ) for every random variable with a finite variance and every ε > 0. Supposing here that ξ = H(ηη n )/Hη n and taking into account (1.5.8), we obtain H(η n ) − 1 ε → 0 as n → 0 P Hη n for every ε . Hence, (1.5.1) follows from here, i.e. entropic stability. Theorem 1.11. If the entropy Hη n increases without bound and there exists a finite upper limit var(H(η n )) < C, (1.5.9) lim sup n→∞ Hη n then the family of random variables is entropically stable. Proof. To prove the theorem it suffices to note that the quantity Var[Hη n ]/Hη2 n , which on the strength of (1.5.9) can be bounded as follows: Var(H(η n )) C +ε 2 Hη n Hη n
(s.t. n > N(ε ))
tends to zero, because Hη n increases (and ε is arbitrary), so that Theorem 1.10 can be applied to this case. In many important practical cases the following limits exist: H1 = lim
n→∞
1 Hη n ; n
D1 = lim
n→∞
1 VarH(η n ), n
(1.5.10)
1.5 Asymptotic equiprobability and entropic stability
19
which can be called the entropy rate and the variance rate, respectively. A number of more or less general methods have been developed to calculate these rate quantities. According to the theorem stated below, finiteness of these limits guarantees entropic stability. Theorem 1.12. If limits (1.5.10) exist and are finite, the first of them being different from zero, then the corresponding family of random variables is entropically stable. Proof. To prove the theorem we use the fact that formulae (1.5.10) imply that Hη n = H1 n + o(n) Var(H(η n )) = D1 n + o(n).
(1.5.11) (1.5.12)
Here, as usual, o(n) means that o(n)/n → 0. Since H1 > 0, entropy Hη n infinitely increases. Dividing expression (1.5.11) by (1.5.12), we obtain a finite limit Var(H(η n )) D1 + o(1) D1 = = lim . n→∞ n→∞ H1 + o(1) Hη n H1 lim
Thus, the conditions of Theorem 1.11 are satisfied that proves entropic stability. Example 1.2. Suppose we have an infinite sequence {ξ j } of discrete random variables that are statistically independent but distributed differently. We assume that entropies Hξ j of all those variables are finite and uniformly bounded below: Hξ j > C1 .
(1.5.13)
We also assume that their variances are uniformly bounded above: Var(H(ξ j )) < C2 .
(1.5.14)
We define random variables η n as a selection of first n elements of the sequence η n = {ξ1 , . . . , ξn }. Then due to (1.5.10), (1.5.14) and statistical independence, i.e. additivity of entropies, we will have Hη n > C1 n;
Var(H(η n )) < C2 n.
In this case, the conditions of Theorem 1.11 are satisfied, and, thereby, entropic stability of family {η n } follows from it. Other more complex examples of entropically stable random variables, which cannot be decomposed into statistically independent variables, will be considered further. The concept of entropic stability can be applied (less rigorously though) to a single random variable η instead of a sequence of random variables {η n }. Then we perceive an entropically stable random variable η as a random variable, for which the probability
20
1 Definition of information and entropy in the absence of noise
H(η ) P − 1 < ε Hη is quite close to one for some small value of ε , i.e. H(η )/Hη is sufficiently close to one. In what follows, we shall introduce other notions: informational stability, canonical stability and others, which resemble entropic stability in many respects. 2. In order to derive some results related to entropically stable random variable, it is sometimes convenient to consider the characteristic potential μ0 (α ) of entropy H(η ) defined by the formula eμ0 (α ) = ∑ eα H(η ) P(η ) = ∑ P1−α (η ). η
(1.5.15)
η
This potential is similar to the potentials that will be considered further (Sections 4.1 and 4.4). With the help of this potential it is convenient to investigate the rate of convergence (1.5.2–1.5.4) in Theorem 1.9. This subject is covered in the following theorem. Theorem 1.13. Let potential (1.5.15) be defined and differentiable on the interval s1 < α < s2 (s1 < 0; s2 > 0), and let the equation d μ0 (s) = (1 + ε )Hη ds
(ε > 0)
(1.5.16)
have a root s ∈ [0, s2 ]. Then the subset A of realizations of η defined by the constraint H(η ) − 1 > ε, Hη has the probability
P(A) e−sμ0 (s)+μ0 (s) .
(1.5.17)
(1.5.18)
The rest of realizations constituting the complimentary subset B have probabilities related by ln P(η ) − ln P(η ) < 2ε (1.5.19) 1 − ε (η , η ∈ B), ln P(η ) and the number M of such realizations satisfies the inequality 1−ε +
ln M 1
ln 1 − e−sμ0 (s)+μ0 (s) < < 1 + ε. Hη Hη
(1.5.20)
Proof. The proof is analogous to that of Theorem 1.9 in many aspects. Relation (1.5.19) follows from (1.5.17). Inequality (1.5.17) is equivalent to the inequality e−(1+ε )Hη < P(η ) < e−(1−ε )Hη . Taking it into account and considering the equality
1.5 Asymptotic equiprobability and entropic stability
21
∑ P(η ) = 1 − P(A), B
we find that the number of terms in the above sum (i.e. the number of realizations in B) satisfies the inequality [1 − P(A)] e(I−ε )Hη < M < [1 − P(A)] e(1+ε )Hη . Therefore, (1 − ε )Hη + ln [1 − P(A)] < ln M < (1 + ε )Hη + ln [1 − P(A)]. If we take into account negativity of ln [1 − P(A)] and inequality (1.5.18), then (1.5.20) follows from the last formula. Thus, we only need to justify (1.5.18) in order to complete the proof. We will obtain the desired relation, if we apply Theorem 4.7 (Section 4.4), which will be proven later on, with ξ = η , B(ξ ) = H(η ). This completes the proof. We can obtain a number of simple approximate relations from the formulae provided in the previous theorem, if we use the condition that ε is small. For ε = 0, there is a null root s = 0 of equation (1.5.16), because ddsμ (0) = Hη . For small values of ε , the value of s is small as well. Thus, the right-hand side of equation (1.5.16) can be expanded to the Maclaurin series
μ0 (s) = μ0 (0) + μ0 (0)s + · · · , so that the root of the equation will take the form s=
ε Hη + O(ε 2 ). μ0 (0)
(1.5.21)
Furthermore, we write down the Maclaurin expansion for the expression in the exponent of (1.5.18): 1 sμ0 (s) − μ0 (s) = μ0 (0)s2 + O(s3 ). 2 Substituting (1.5.21) into the above expression, we obtain
ε 2 Hη2 P(A) exp − [1 + O(ε 3 )]. 2μ0 (0)
(1.5.22)
Here μ (0) > 0 is a positive quantity that is equal to the variance
μ0 (0) = Var(H(η )) due to general properties of the characteristic potentials.
(1.5.23)
22
1 Definition of information and entropy in the absence of noise
We see that the characteristic potential of entropy helps us to investigate problems related to entropic stability.
1.6 Definition of entropy of a continuous random variable Up to now we have assumed that a random variable ξ , with entropy Hξ , can take values from some discrete space consisting of either a finite or a countable number of elements, for instance, messages, symbols, etc. However, continuous variables are also widespread in engineering, i.e. variables (scalar or vector), which can take values from a continuous space X, most often from the space of real numbers. Such a random variable ξ is described by the probability density function p(ξ ) that assigns the probability
ΔP =
ξ εΔ X
p(ξ )d ξ ≈ p(A)Δ V
(A ∈ Δ X)
of ξ appearing in region Δ X of the specified space X with volume Δ V (d ξ = dV is a differential of the volume). How can we define entropy Hξ for such a random variable? One of many possible formal ways is the following: In the formula Hξ = − ∑ Pξ ln P(ξ ) = −E[ln P(ξ )],
(1.6.1)
ξ
appropriate for a discrete variable we formally replace probabilities P(ξ ) in the argument of the logarithm by the probability density and, thereby, consider the expression Hξ = −E[ln p(ξ )] = −
X
p(ξ ) ln p(ξ )d ξ .
(1.6.2)
This way of defining entropy is not well justified. It remains unclear how to define entropy in the combined case, when a continuous distribution in a continuous space coexists with concentrations of probability at single points, i.e. the probability density contains delta-shaped singularities. Entropy (1.6.2) also suffers from the drawback that it is not invariant, i.e. it changes under a non-degenerate transformation of variables η = f (ξ ) in contrast to entropy (1.6.1), which remains invariant under such transformations. That is why we deem it expedient to give a somewhat different definition of entropy for a continuous random variable. Let us approach this definition based on formula (1.6.1). We shall assume that ξ is a discrete random variable with probabilities concentrated at points Ai of a continuous space X: P(Ai ) = ωi . This means that the probability density has the form
1.6 Definition of entropy of a continuous random variable
23
p(ξ ) = ∑ ωi δ (ξ − Ai ),
ξ ε X.
(1.6.3)
i
Employing formula (1.6.1) we have Hξ = − ∑ P(Ai ) ln P(Ai ) .
(1.6.4)
i
Further, let points Ai be located in space X sufficiently densely, so that there is a quite large number Δ N of points within a relatively small region Δ X having volume Δ V . Region Δ X is assumed to be small in the sense that all probabilities P(Ai ) of points within are approximately equal: P(Ai ) ≈
ΔP ΔN
Ai ∈ Δ X.
for
(1.6.4a)
Then, for the sum over points lying within Δ X, we have −
∑
Ai ∈Δ X
P(Ai ) ln P(Ai ) ≈ −Δ P ln
ΔP . ΔN
Summing with respect to all regions Δ X, we see that the entropy (1.6.4) assumes the form ΔP . (1.6.5) Hξ ≈ − ∑ Δ P ln ΔN If we introduce a measure ν0 (ξ ) specifying the density of points Ai , and such that by integrating ν0 (ξ ) we calculate the number of points
ΔN =
ξ εΔ X
ν0 (ξ )d ξ
inside any region Δ X, then entropy (1.6.4) can be written as Hξ = −
X
p(ξ ) ln
p(ξ ) dξ . ν0 ( ξ )
(1.6.6)
Apparently, a delta-shaped density ν0 (ξ ) of the form
ν0 (ξ ) = ∑ δ (ξ − Ai ) i
corresponds to the probability density of distribution (1.6.3). From these deltashaped densities we can move to the smoothed densities p(ξ ) =
ν0 =
K(ξ − ξ )p(ξ )d ξ ,
K(ξ − ξ )ν0 (ξ )d ξ .
(1.6.7)
24
1 Definition of information and entropy in the absence of noise
If the ‘smoothing radius’ r0 corresponding to the width of function K(x) is relatively small (it is necessary that the probabilities P(Ai ) do not change essentially over such distances), then smoothing has little effect on the probability density ratio: p(ξ ) p(ξ ) . ≈ ν0 (ξ ) ν0 (ξ ) Therefore, we obtain from (1.6.6) that Hξ ≈ −
p(ξ ) ln
p(ξ ) dξ . ν0 (ξ )
(1.6.8)
This formula can be also derived from (1.6.5), since Δ P ≈ pΔ V , Δ N ≈ v0 Δ V when the ‘radius’ r0 is significantly smaller than sizes of regions Δ X. However, if the ‘smoothing radius’ r0 is significantly longer than the mean distance between points Ai , then smoothed functions (1.6.7) will have a simple (smooth) representation, which was assumed in, say, formula (1.6.2). Discarding the signs ∼ in (1.6.8) for both densities, instead of (1.6.2), we obtain the following formula for entropy Hξ : Hξ = −
X
p(ξ ) ln
p(ξ ) dξ . ν0 ( ξ )
(1.6.9)
Here ν0 (ξ ) is some auxiliary density, which is assumed to be given. It is not necessarily normalized. Besides, according to the aforementioned interpretation of entropy (1.6.9) as a particular case of entropy (1.6.1), the normalizing integral X
ν0 (ξ )d ξ = N
(1.6.10)
is assumed to be a rather large number (N 1) interpreted as the total number of points Ai , where probabilities P(Ai ) are concentrated. However, such an interpretation is not necessary and, in particular, we can put N = 1. If we introduce the normalized density q(ξ ) = ν0 (ξ )/N
(1.6.11)
(it is possible to do so when N is finite), then, evidently, it follows from formula (1.6.9) that p(ξ ) Hξ = ln N − p(ξ ) ln dξ . (1.6.12) q(ξ ) Definition (1.6.9) of entropy Hξ can be generalized both to the combined case and the general case of an abstract random variable as well. The latter is thought to be given if some space X of points ξ and a Borel field F (σ -algebra) of its subsets are fixed. In such cases, it is said that a measurable space (X, F) is given. Moreover, on this field a probability measure P(A) (A ∈ F) is defined, such that P(X) = 1. In defining entropy we require that, on a measurable space (X, F), an auxiliary measure
1.6 Definition of entropy of a continuous random variable
25
ν (A) should be given such that the measure P is absolutely continuous with respect to ν . A measure P is called absolutely continuous with respect to measure ν , if for every set A from F, such that ν (A) = 0, the equality P(A) = 0 holds. According to the well-known Radon–Nikodym theorem, it follows from the absolute continuity of measure P with respect to measure ν that there exists an F-measurable function f (x), denoted dP/d ν and called the Radon–Nikodym derivative, which generalizes the notion of probability density. It is defined for all points from space X excluding, perhaps, some subset Λ , for which ν (Λ ) = 0 and therefore P(Λ ) = 0. Thus, if the condition of absolute continuity is satisfied, then the entropy Hξ is defined with the help of the Radon–Nikodym derivative by the formula Hξ = −
X−Λ −Λ0
ln
dP (ξ )P(d ξ ). dν
(1.6.13)
The subset Λ , for which function dP/d ν is not defined, has no effect on the result of integration since it has null measure P(Λ ) = ν (Λ ) = 0. Also, there is one more inessential subset Λ0 , namely, a subset on which function dP/d ν is defined but equal to zero, because P(Λ0 ) =
Λ0
dP ν (d ξ ) = dν
Λ0
0 · v(d ξ ) = 0
even if ν (Λ0 ) = 0. Therefore, some indefiniteness of the function f (ξ ) = dP/d ν and infinite values of ln f (ξ ) at points with f (ξ ) = 0 do not make an impact on definition (1.6.13) of entropy Hξ in the case of absolute continuity of P with respect to ν . Therefore, the quantity H(ξ ) = − ln
dP (ξ ) dν
(1.6.14)
plays the role of random entropy analogous to random entropy (1.2.2). It is defined almost everywhere in X, i.e. in the entire space excluding, perhaps, sets Λ + Λ0 of zero probability P. By analogy with (1.6.11), if N = ν (X) < ∞, then instead of ν (A) we can introduce a normalized (i.e. probability) measure Q(A) = ν (A)/N,
A ∈ F,
(1.6.15)
and transform (1.6.13) to the form Hξ = ln N −
ln
dP (ξ )P(d ξ ) dQ
(1.6.16)
that is analogous to (1.6.12). The quantity P/Q Hξ
=
ln
dP (ξ )P(d ξ ) dQ
(1.6.17)
26
1 Definition of information and entropy in the absence of noise
is non-negative. This statement is analogous to Theorem 1.5 and can be proven by the same method. In consequence of the indicated non-negativity, quantity (1.6.17) can be called the entropy of probability distribution P relative to probability distribution Q. The definition of entropy (1.6.9), (1.6.13), given in this section, allows us to consider entropy (1.6.2) as a particular case of that general definition. In fact, formula (1.6.2) is entropy (1.6.13) for the case, when measure ν corresponds to a uniform unit density ν0 (ξ ) = d ν /d ξ = 1. It is worth noting that entropy (1.6.17) of probability measure P relative to probability measure Q can be used as an indicator of the degree of difference between measures P and Q (refer to the book by Kullback [30] or its original in English [29] regarding this subject). This is facilitated by the fact that it becomes zero for coincident measures P(·) = Q(·) and is positive for non-coincident ones. Another indicator of difference between measures, which possess the same properties and may be a ‘distance’, is defined by the formula Q
s(P, Q) = inf
ds,
(1.6.18)
P
where ds2 = 2H P/P±δ P = 2
ln
P(d ζ ) P(d ζ ). P(d ζ ) ± δ P(d ζ )
(1.6.19)
By performing a decomposition of function − ln (1 ± δ P/P) by δ P/P, it is not difficult to assure yourself that this metric can be also given by the equivalent formula
ds2 =
[δ P(d ζ )]2 = P(d ζ )
[δ ln P(d ζ )]2 P(d ζ ).
(1.6.20)
We connect points P and Q by a ‘curve’—a family of points Pλ dependent on parameter λ , so that P0 = P, P1 = Q. Then (1.6.18) can be rewritten as s(P, Q) = inf
1 ds 0
dλ
dλ ,
(1.6.21)
where due to (1.6.20) we have
ds dλ
2
=
∂ ln Pλ (d ζ ) ∂λ
2 Pλ (d ζ ).
(1.6.22)
Here and further we assume that differentiability conditions are satisfied. It follows from (1.6.22) that
ds dλ
2 =−
∂ 2 ln Pλ (d ζ ) Pλ (d ζ ). ∂ 2λ
Indeed, the difference of these expressions
(1.6.23)
1.6 Definition of entropy of a continuous random variable
∂ ln Pλ (d ζ ) ∂λ
2
∂ 2 ln Pλ (d ζ ) + ∂ 2λ
27
∂ ∂λ
Pλ (d ζ ) =
∂ ln Pλ (d ζ ) Pλ (d ζ ) ∂λ
is equal to zero, because the expression
∂ ln Pλ (d ζ ) ∂ Pλ (d ζ ) = ∂λ ∂λ
∂ 1 ∂λ
Pλ (d ζ ) =
similarly disappears. One can see from the definition above that, as points P and Q get closer, the entropies HP/Q , HQ/P and quantity 12 s2 (P, Q) do not differ anymore. However, we can formulate a theorem that connects these quantities not only in the case of close points. Theorem 1.14. The squared ‘distance’ (1.6.18) is bounded from above by the sum of entropies: s2 (P, Q) H P/Q + H Q/P . (1.6.24)
Proof. We connect points P and Q by the curve Pλ (d ζ ) = e−Γ (λ ) [P(d ζ )]1−λ Q(d ζ )λ (Γ (λ ) = ln
(1.6.25)
[P(d ζ )]1−λ [Q(d ζ )]λ ).
We obtain
∂ 2 ln Pλ (d ζ ) d 2Γ (λ ) = . ∂ 2λ dλ 2 Therefore, due to (1.6.23) we get −
and also −
ds dλ
2
d 2Γ , dλ 2
=
∂ 2 ln Pλ (d ζ ) = ∂ 2λ
ds dλ
2 .
(1.6.26)
Moreover, it is not difficult to make certain that expression (1.6.17) can be rewritten in the following integral form: H
P/Q
=−
1 0
dλ
∂ ln Pλ (d ζ ) P0 (d ζ ) = − ∂λ
1 0
dλ
λ 0
dλ
∂ 2 ln Pλ (d ζ ) P0 (d ζ ). ∂ λ 2
To derive the last formula we have taken into account that [∂ (ln Pλ (d ζ ))/∂ λ ] P0 (d ζ ) = 0 for λ = 0. Then we take into consideration (1.6.26) and obtain that H
P/Q
1
= 0
dλ
λ ds 2 0
dλ
d λζ =
1 0
ds (1 − λ ) dλ
2
dλ .
(1.6.27)
28
1 Definition of information and entropy in the absence of noise
Analogously, switching P and Q while keeping the connecting curve (1.6.25) intact, we obtain 1 ds 2 Q/P = λ dλ . (1.6.28) H dλ 0 Next, we add up (1.6.28), (1.6.27) and apply the Cauchy–Schwartz inequality:
1 0
This will yield
0
ds dλ dλ
1
2
ds dλ dλ
1 ds 2 0
dλ
dλ .
2 H P/Q + H Q/P ,
and, therefore, also (1.6.24) if we account for (1.6.21). The proof is complete.
1.7 Properties of entropy in the generalized version. Conditional entropy Entropy (1.6.13), (1.6.16) defined in the previous section possesses a set of properties, which are analogous to the properties of an entropy of a discrete random variable considered earlier. Such an analogy is quite natural if we take into account the interpretation of entropy (1.6.13) (provided in Section 1.6) as an asymptotic case (for large N) of entropy (1.6.1) of a discrete random variable. The non-negativity property of entropy, which was discussed in Theorem 1.1, is not always satisfied for entropy (1.6.13), (1.6.16) but holds true for sufficiently large N. The constraint P/Q Hξ ln N results in non-negativity of entropy Hξ . Now we move on to Theorem 1.2, which considered the maximum value of entropy. In the case of entropy (1.6.13), when comparing different distributions P we need to keep measure ν fixed. As it was mentioned, quantity (1.6.17) is non-negative and, thus, (1.6.16) entails the inequality Hξ ln N. At the same time, if we suppose P = Q, then, evidently, we will have Hξ = ln N. This proves the following statement that is an analog of Theorem 1.2. Theorem 1.15. Entropy (1.6.13) attains its maximum value equal to ln N, when measure P is proportional to measure ν .
1.7 Properties of entropy in the generalized version. Conditional entropy
29
This result is rather natural in the light of the discrete interpretation of formula (1.6.13) given in Section 1.6. Indeed, a proportionality of measures P and ν means exactly a uniform probability distribution on discrete points Ai and, thereby, Theorem 1.15 becomes a paraphrase of Theorem 1.2. The following statement is an P/Q P/Q analog of Theorems 1.2 and 1.15 for entropy Hξ : entropy Hξ attains a minimum value equal to zero when distribution P coincide with Q. P/Q According to Theorem 1.15, it is reasonable to interpret entropy Hξ , defined by formula (1.6.17), as a deficit of entropy, i.e. as lack of this quantity needed to attain its maximum value. So far we assumed that measure P is absolutely continuous with respect to measure Q or (that is the same for finite N) measure ν . It raises a question how to P/Q define entropy Hξ or Hξ , when there is no such absolute continuity. The answer to this question can be obtained if formula (1.6.16) is considered as an asymptotic case (for very large N) of the discrete version (1.6.1). If in condensing the points Ai (introduced in Section 1.6) we regard, contrary to formula P(Ai ) ≈ Δ P/(N Δ Q) (see (1.6.4a)), the probabilities P(Ai ) of some points as finite: P(Ai ) > c > 0 (c is independent of N), then measure QN , as N → ∞, will not be absolutely continuous P/Q with respect to measure P. In this case, the deficiency of Hξ will increase without bound as N → ∞. This allows us to assume P/Q
Hξ
=∞
if measure P is not absolutely continuous with respect to Q (i.e. singular with respect to Q). However, the foregoing does not define the entropy Hξ in the absence of absolute continuity, since we have indeterminacy of the type ∞ − ∞ according to (1.6.16). In order to eventually define it, we require a more detailed analysis of the passage to the limit N → ∞ related to condensing points Ai . Other properties of the discrete version of entropy, mentioned in Theorems 1.3, 1.4, 1.6, are related to entropy of many random variables and conditional entropy. With a proper definition of the latter notions, the given properties will take place for the generalized version, based on definition (1.6.13), as well. Consider two random variables ξ , η . According to (1.6.13) their joint entropy is of the form P(d ξ , d η ) P(d ξ , d η ). (1.7.1) Hξ ,η = − ln ν (d ξ , d η ) At the same time, applying formula (1.6.13) to a single random variable ξ or η we obtain
P(d ξ ) P(d ξ ), ν1 (d ξ ) P(d η ) P(d η ). Hη = − ln ν2 (d η ) Hξ = −
ln
30
1 Definition of information and entropy in the absence of noise
Here, ν1 , ν2 are some measures; their relation to ν will be clarified later. We define conditional entropy Hξ |η as the difference Hξ |η = Hξ η − Hη ,
(1.7.2)
i.e. in such a way that the additivity property is satisfied: Hξ η = Hη + Hξ |η .
(1.7.3)
Taking into account (1.6.13) and (1.7.2), it is easy to see that for Hξ |η we will have the formula P(d ξ | η ) P(d ξ d η ), (1.7.4) Hξ |η = − ln ν (d ξ | η ) where P(d ξ | λ η ), ν (d ξ | η ) are conditional measures defined as the Radon– Nikodym derivatives with the help of standard relations P(ξ ∈ A, η ∈ B) =
ν (ξ ∈ A, η ∈ B) =
η ∈B η ∈B
P(ξ ∈ A | η )P(d η ), v(ξ ∈ A | η )ν2 (d η )
(sets A, B are arbitrary). The definition in (1.7.4) uses the following random entropy: H(ξ | η ) = − ln
P(d ξ | η ) . ν (d ξ | η )
The definition of conditional entropy Hξ |η provided above can be used multiple times when considering a chain of random variables ξ1 , ξ2 , . . . , ξn step by step. In this case, relation (1.7.3) leads to the hierarchical additivity property H(ξ1 , . . . , ξη ) = H(ξ1 ) + H(ξ2 | ξ1 ) + H(ξ3 | ξ1 , ξ2 ) + · · · + H(ξη | ξ1 , . . . , ξη −1 ), Hξ1 ,...,ξη = Hξ1 + Hξ2 |ξ1 + Hξ3 |ξ1 ,ξ2 + · · · + Hξη |ξ1 ,...,ξη −1 , (1.7.4a) that is analogous to (1.3.4). In order that the entropies Hξ , Hξ |η , and other ones be interpreted as measures of uncertainty, it is necessary that in the generalized version the following inequality be valid: (1.7.4b) Hξ Hξ |η , which is usual in the discrete case. Meanwhile, we manage to prove this inequality in the case, which is not generalized as it can be seen from formula (1.7.6) further. Theorem 1.16. If the following condition is satisfied
ν (d ξ , d η ) ν1 (d ξ )ν2 (d η ), then the inequality Hξ Hξ |η holds.
(1.7.5)
1.7 Properties of entropy in the generalized version. Conditional entropy
31
Proof. For the difference Hξ − Hξ |η we will have Hξ − Hξ |η =
P(d η )
ln
ν1 (d ξ ) P(d ξ | η ) P(d ξ | η ) + E ln . P(d ξ ) ν (d ξ | η )
(1.7.6)
But for all probability measures P1 , P2 the following inequality holds:
ln
P1 (d ξ ) P1 (d ξ ) 0, P2 (d ξ )
(1.7.7)
This inequality is a generalization of inequality (1.3.7) and can be proven in the same way. Applying (1.7.7) to (1.7.6) for P1 (d ξ ) = P(d ξ | η ), P2 (d ξ ) = P(d ξ ), we obtain that the first term of the right-hand side of (1.7.6) is non-negative. The second term can be represented as follows: ν1 (d ξ ) ν1 (d ξ )ν2 (d η ) E ln = E ln . ν (d ξ | η ) ν (d ξ , d η ) Its non-negativity is seen from here due to condition (1.7.5). Therefore, the difference (1.7.6) is non-negative. The proof is complete. Inequality (1.7.5) is satisfied naturally, when ν (A) is interpreted according to the aforesaid in Section 1.6 as the number of discrete points of zero probability within set A. Then ν1 (Δ ξ ) and ν2 (Δ η ) are interpreted as the numbers of such points on the intervals [ξ , ξ + Δ ξ ] and [η , η + Δ η ], respectively. Naturally, there remain no more than ν1 (Δ ξ )ν2 (Δ η ) of valid points within the rectangle [ξ , ξ + Δ ξ ] × [η , η + Δ η ]. If we do not follow the above interpretation, then we will have to postulate condition (1.7.5) independently. It is especially convenient to use the multiplicativity condition ν (d ξ , d η ) = ν1 (d ξ )ν2 (d η ), (1.7.8) which we shall postulate in what follows. Thus, the generalized version of entropy has the usual properties, i.e. not only possesses the property of hierarchical additivity, but also satisfies standard inequalities, if the multiplicativity condition (1.7.8) is satisfied. When there are several random variables ξ1 , . . . , ξn , it is convenient to select measure ν , which satisfies the total multiplicativity condition n
ν (d ξ1 , . . . , d ξn ) = ∏ νk (d ξk ),
(1.7.9)
k=1
which we understand as the condition of consistency of auxiliary measures. According to (1.7.9), measure ν of the combined random variable ξ = (ξ1 , . . . , ξn ) is decomposed into a product of ‘elementary’ measures νk . Thus, in compliance with the aforementioned formulae we will have
32
1 Definition of information and entropy in the absence of noise
P(d ξk ) , νk (d ξk ) Hξk = E[H(ξk )],
H(ξk ) = − ln
(1.7.10) (1.7.11)
P(d ξk | ξ1 , . . . , ξk−1 ) , νk (d ξk ) P(d ξk | ξ1 , . . . , ξk−1 ) . = E[H(ξk | ξ1 , . . . , ξk−1 )] = − ln νk (d ξk )
H(ξk | ξ1 , . . . , ξk−1 ) = − ln
(1.7.12)
Hξk |ξ1 ,...,ξk−1
(1.7.13)
If random variables ξ1 , . . . , ξn are statistically independent, then P(d ξk | ξ1 , . . . , ξk−1 ) = P(d ξk ) (with probability 1). Consequently, H(ξk | ξ1 , . . . , ξk−1 ) = H(ξk ), Hξk |ξ1 ,...,ξk−1 = Hξk (k = 1, . . . , n), and (1.7.4a) implies the additivity property n
∑ Hξk ,
Hξ1 ,...,ξn =
k=1
coinciding with (1.2.9). Let us now consider the normalized measures Q(d ξ , d η ) = ν (d ξ , d η )/N,
where N=
Q1 (d ξ ) = ν1 (d ξ )/N1 ,
ν (d ξ , d η ),
N1 =
Q2 (d η ) = ν2 (d η )/N2 ,
ν1 (d ξ ),
N2 =
ν2 (d η ),
and the corresponding entropies (1.6.17). The following relations Hζ η = ln N − H P/Q ;
P/Q1
Hξ = ln N1 − Hξ
;
P/Q2
Hη = ln N2 − Hη
(1.7.14)
of type (1.6.16) are valid for them. Supposing Q1 (d ξ ) = Q(d ξ );
Q2 (d η ) = Q(d η )
(1.7.15)
and assuming the multiplicativity condition Q(d ξ d η ) = Q(d ξ )Q(d η ),
N = N1 N2 ;
(1.7.16)
we define conditional entropy by the natural formula P/Q
Hξ |η = so that
ln
P(d ξ | η ) P(d ξ d η ), Q(d ξ ) P/Q
Hξ |η = ln N1 − Hξ |η .
(1.7.17)
(1.7.17a)
1.7 Properties of entropy in the generalized version. Conditional entropy
33
In this case, the additivity property holds: P/Q
P/Q
Hξ η = Hη
P/Q
+ Hξ |η
(1.7.18)
Taking into account (1.7.2), we reduce inequality (1.7.4b) to the form Hξ Hξ η − Hη . Using (1.7.14), in view of (1.7.15), (1.7.16), we obtain from the last formula that P/Q
P/Q
Hξ η − Hη
P/Q
Hξ
.
In consequence of (1.7.18), this inequality can be rewritten as P/Q
Hξ
P/Q
Hξ |η .
(1.7.19)
Comparing it with (1.7.4b), it is easy to see that the sign has been replaced with the opposite one () for entropy H P/Q . This is convincing evidence that enP/Q P/Q tropies Hξ , Hξ |η cannot be regarded as measures of uncertainty in contrast to entropies (1.6.1) or (1.6.13). In the case of many random variables, it is expedient to impose constraints of type (1.7.15), (1.7.16) for many variables and use the hierarchical additivity property P/Q 1 ,...,ξn
Hξ
P/Q
= Hξ
1
P/Q 2 | ξ1
+ Hξ
P/Q 3 | ξ1 , ξ2
+ Hξ
P/Q
+ · · · + Hξn |ξ
This property is analogous to (1.3.4) and (1.7.4a).
1 ,...,ξn−1
.
(1.7.20)
Chapter 2
Encoding of discrete information in the absence of noise and penalties
The definition of the amount of information, given in Chapter 1, is justified when we deal with a transformation of information from one kind into another, i.e. when considering encoding of information. It is essential that the law of conservation of information amount holds under such a transformation. It is very useful to draw an analogy with the law of conservation of energy. The latter is the main argument for introducing the notion of energy. Of course, the law of conservation of information is more complex than the law of conservation of energy in two respects. The law of conservation of energy establishes an exact equality of energies, when one type of energy is transformed into another. However, in transforming information we have a more complex relation, namely ‘not greater’ (), i.e. the amount of information cannot increase. The equality sign corresponds to optimal encoding. Thus, when formulating the law of conservation of information, we have to point out that there possibly exists such an encoding, for which the equality of the amounts of information occurs. The second complication is that the equality is not exact. It is approximate, asymptotic, valid for complex (large) messages and for composite random variables. The larger a system of messages is, the more exact such a relation becomes. The exact equality sign takes place only in the limiting case. In this respect, there is an analogy with the laws of statistical thermodynamics, which are valid for large thermodynamic systems consisting of a large number (of the order of the Avogadro number) of molecules. When conducting encoding, we assume that a long sequence of messages ξ1 , ξ2 , . . . is given together with their probabilities, i.e. a sequence of random variables. Therefore, the amount of information (entropy H) corresponding to this sequence can be calculated. This information can be recorded and transmitted by different realizations of the sequence. If M is the number of such realizations, then the law of conservation of information can be expressed by the equality H = ln M, which is complicated by the two above-mentioned factors (i.e. actually, H ln M). Two different approaches may be used for solving the encoding problem. One can perform encoding of an infinite sequence of messages, i.e. online (or ‘sliding’) encoding. The inverse procedure, i.e. decoding, will be performed analogously. © Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0 2
35
36
2 Encoding of discrete information in the absence of noise and penalties
Typically in online processing, the quantitative equality between the amounts of information to be encoded and information that has been encoded is maintained only on average. With time, random time lag is produced. For a fixed length message sequence, the length of its code sequence will have random spread increasing with time; and vice versa: for a fixed record length, the number of elementary messages transmitted will have increasing random spread. Another approach can be called ‘block’ or ‘batch’ encoding. A finite collection (a block) of elementary messages is encoded. If there are several blocks, then different blocks are encoded and decoded independently. In this approach, there is no increase in random time lag, but there is a loss of some realizations of the message. A small portion of message realizations cannot be encoded and is lost, because there are not enough code realizations. If the block is entropically stable, then the probability of such a loss is quite small. Therefore, when we use the block approach, we should investigate problems related to entropic stability. Following tradition, we mostly consider the online encoding in the present chapter. In the last paragraph, we study the errors of the aforementioned type of encoding, which occur in the case of a finite length of an encoding sequence.
2.1 Main principles of encoding discrete information Let us confirm the validity and efficiency of the definitions of entropy and the amount of information given earlier by considering a transformation of information from a certain form to another. Such a transformation is called encoding and is applied to transforming and storing information. Suppose that the message to be transmitted and recorded is represented as a sequence ξ1 , ξ2 , ξ3 , . . . of elementary messages or random variables. Let each elementary message be represented in the form of a discrete random variable ξ , which takes one out of m values (say, 1, . . . , m) with probability P(ξ ). It is required to transform the message into a sequence η1 , η2 , . . . of letters of some alphabet A = (A, B, . . . , Z). The number of letters in this alphabet is denoted by D. We treat a space, punctuation marks and so on as letters of the alphabet, so that a message is represented in one word. It is required to record a message in such a way that the message itself can be recovered from the record without any losses or errors. The respective theory has to show which conditions must be satisfied and how to do so. Since there is no fixed constraint between the number of elementary messages and the number of letters in the alphabet, the trivial ‘encoding’ of one message by one special letter may not be the best strategy in general. We will associate each separate message, i.e. each realization ξ = i of a random variable, with the corresponding ‘word’ V (i) = (η1i , . . . , ηlii ) in the alphabet A (li is the length of that word). The full set of such words (their number equals m) forms a code. Having defined the code, we can recover the message in letters of alphabet A from its realization ξ1 , ξ2 , . . . , i.e. it will take the form
2.1 Main principles of encoding discrete information
37
V (ξ1 )V (ξ2 ) . . . = η1 η2..., where we treat V (ξi ) as a sequence of letters. As all words V (ξi ) are written in one letter sequentially, there emerges a difficult problem of decomposing a sequence of letters η1 , η2 , . . . into words V (ξi ) [separate representation of words would mean some particular case (also, uneconomical) of the aforementioned encoding because a space can be considered as an ordinary letter]. This problem will be covered later on (refer to Section 2.2). It is desirable to make a record of the message as short as possible as long as it can be done without loss of content. Information theory points out limits to which the length of the record can be compressed (or the time of message transmission if the message is transmitted over some time interval). Next we explain the theoretical conclusions via argumentation using the results of Sections 1.4 and 1.5. Let the full message (ξ1 , . . . , ξn ) consist of n identical independent random elementary messages ξ j = ξ . Then (ξ1 , . . . , ξn ) = ζ will be an entropically stable random variable. If n is sufficiently large, then according to the aforesaid in Sections 1.4 and 1.5 we can consider the following simplified model of full message: it is supposed that only one out of enHζ equiprobable messages occurs, where Hζ = − ∑ζ P(ζ ) ln P(ζ ). At the same time only one out of DL possibilities is realized for a literal record containing L consecutive letters. Equating these two quantities, we obtain that the length L of the record is determined by the equation nHξ = L ln D.
(2.1.1)
If a shorter representation is used, then the message will be inevitably distorted in some cases. Besides, if a record of length greater than nHξ / ln D is used, then such a record will be uneconomical and non-optimal. The conclusions from Sections 1.4 and 1.5 can be applied not only to the message sequence ξ1 , ξ2 , . . ., ξn but also to a literal sequence η1 , η2 , . . ., ηL . The maximum entropy (equal to L ln D) of such a sequence is attained for a joint distribution such that marginal distributions of two adjacent letters are independent and all marginal distributions are equiprobable over the entire alphabet. But probabilities of letters are exactly determined by probabilities of messages and a choice of code. When probabilities of messages are fixed, the problem is to pick out code words and try to attain the aforementioned optimal distribution of letters. If the maximum entropy L ln D of a literal record is attained for some code, then there are eL ln D = DL essential realizations of the record and we can establish a one-to-one correspondence between these realizations and essential realizations of the message (the number of which is enHξ ). Therefore, the optimal relation (2.1.1) is valid. Such codes are called optimal. Since record letters are independent for optimal codes we have H(η j+1 | η j , η j−1 , . . .) = H(η j+1 ). Besides, since equiprobable distribution P(η j+1 ) = 1/D over the alphabet is realized, every letter carries identical information (2.1.2) H η j+1 = ln D
38
2 Encoding of discrete information in the absence of noise and penalties
This relationship tells us that every letter is used to its ‘full power’; it is an indication of an optimal code. Further, we consider some certain realization of message ξ = j. According to (1.2.1) it contains random information H(ξ ) = − ln P( j). But for an optimal code every letter of the alphabet carries information ln D due to (2.1.2). It follows from here that the code word V (ξ ) of this message (which also carries information H(ξ ) = − ln P(ξ ) in the optimal code) must consist of l(ξ ) = − ln P(ξ )/ ln D letters. For non-optimal encoding, every letter carries less information than ln D. That is why the length of the code word l(ξ ) (which carries information − ln P(ξ ) = H(ξ )) must be greater. Therefore, for each code l (ξ ) Hξ / ln D. Averaging of this inequality yields lav Hξ / ln D
(2.1.3)
that complies with (2.1.1), since lav = L/n. Thus, the theory points out what cannot be achieved (duration L of the record cannot be made less than nHξ / ln D), what can be achieved (duration L can be made close to nHξ / ln D) and how to achieve that (by selecting a code possibly providing independent equiprobable distributions of letters over the alphabet). Example 2.1. Let ξ be a random variable with the following values and probabilities:
ξ
1 2 3 4 5 6 7 8
P(ξ )
1 1 1 1 1 1 1 1 4 4 8 8 16 16 16 16
The message is recorded in the binary alphabet (A, B), so that D = 2. We take in the code words V (1) = (AAA); V (5) = (BAA);
V (2) = (AAB); V (6) = (BAB);
V (3) = (ABA); V (7) = (BBA);
V (4) = (AAB); V (8) = (BBB).
(2.1.4)
In order to tell whether this code is good or not, we compare it with an optimal code, for which relation (2.1.1) is satisfied. Computing entropy of an elementary message by formula (1.2.3), we obtain 3 Hξ = ln 2 + ln 2 + ln 2 = 2.75 nat. 4 There are three letters per elementary message in code (2.1.4), whereas the same message may require L/n = Hξ / ln 2 = 2.75 letters in the optimal code according to (2.1.1). Consequently, it is possible to compress the record by 8.4%. As an optimal code we can choose the code
2.1 Main principles of encoding discrete information
39
V (1) = (AA); V (2) = (AB); V (3) = (BAA); V (4) = (BAB); (2.1.5) V (5) = (BBAA); V (6) = (BBAB); V (7) = (BBBA); V (8) = (BBBB). It is easy to check if the probability of occurrence of letter A at the first place is equal to 1/2. Thus, the entropy of the first character is equal to Hη1 = 1 bit. Further, when the first character is fixed, the conditional probability of occurrence of letter A at the second place is equal to 1/2 as well. This yields conditional entropies Hη2 (| η1 ) = 1 bit; Hη2 /η1 = 1 bit. Analogously, Hηs (| η1 , η2 ) = 1 bit;
Hηs |η1 ,η2 = 2 bit;
Hηs |η1 ,η2 ,η3 = 1 bit.
Thus, every letter carries the same information 1 bit independently of its situation. Under this condition random information of an elementary message ξ is equal to the length of the corresponding code word: H (ξ ) = l (ξ ) bit.
(2.1.6)
Therefore, the average length of the word is equal to information of the elementary message: Hξ = lav bit = lav ln 2 nat. This complies with (2.1.1) so that the code is optimal. Relation P(ξ ) = (1/2)l(ξ ) , which is equivalent to equality (2.1.6), is a consequence of independence of letter distribution in a literal record of a message. Relation (2.1.3) with the equality sign is valid under the assumption that n is large, when we can neglect probability P(An ) of non-essential realizations of the sequence ξ1 , . . . , ξn (see Theorems 1.7, 1.9). In this sense it is proper asymptotically. We cannot establish a similar relation for a finite n. Indeed, if we require precise reproduction of all literal messages of a finite length n (their number equals mn ), then length L of the record will be determined from the formula DL = mn . It will be impossible to shorten it (for instance, to the value nHξ / ln D) without loss of some messages (the probability of which is finite for finite n). On the other hand, supposing that length L = L(ξ1 , . . . , ξn ) is not constant for finite n, we can encode messages in such a way that (2.1.3) transforms into the reverse inequality lav = Hξ / ln D. We demonstrate the validity of the last fact in the case of n = 1. Let there be given three possibilities ξ = 1, 2, 3 with probabilities 1/2, 1/4, 1/4. By selecting a code ξ 1 2 3 (2.1.7) V (ξ ) (A) (B) (AB) corresponding to D = 2, we obtain lav = 12 ln 2 + 24 ln 2 + 24 ln 2 = 1.5 bits. Therefore, for the given single message inequality (2.1.3) is violated.
40
2 Encoding of discrete information in the absence of noise and penalties
2.2 Main theorems for encoding without noise. Independent identically distributed messages Let us now consider the case of an unbounded (from one side) sequence ξ1 , ξ2 , ξ3 , . . . of identically distributed independent messages. It is required to encode the sequence. In order to apply the reasoning of the previous paragraph, one has to decompose this sequence into intervals, called blocks, which contain n elementary messages each. Then we can employ the aforementioned general asymptotic (‘thermodynamic’) relations for large n. However, there also exists a different approach for studying encoding in the absence of noise—to avoid a sequence decomposition into blocks and to discard inessential realizations. The corresponding theory will be stated in this paragraph. Code (2.1.7) fits for transmitting (or recording) one single message but is not suitable for a transmission of a sequence of such messages. For instance, the record ABAB can be simultaneously interpreted as the record V (1)V (2)V (3) of the message (ξ1 , ξ2 , ξ3 ) = (1 2 3) or the record V (3)V (1)V (2) of the message (3 1 2) (when n = 3), let alone the messages (ξ1 , ξ2 ) = (3 3); (ξ1 , ξ2 , ξ3 , ξ4 ) = (1 2 1 2), which correspond to different n. The code in question does not make it possible to unambiguously decode a message and, thereby, we have to reject it if we want to transmit a sequence of messages. A necessary condition, that a long sequence of code symbols can be uniquely decomposed into words, restricts the class of feasible codes. Codes in which every semi-infinite sequence of code words is uniquely decomposed into words are called uniquely decodable. As it can be proven (for instance, in the book by Feinstein [12] or its English original [11]) the inequality m
∑ D−l(ξ ) 1
(2.2.1)
ξ =1
is valid for such codes. It is more convenient to consider a somewhat narrower class of uniquely decodable codes—we call them Kraft’s uniquely decodable codes (see the work by Kraft [28]).
Fig. 2.1 An example of a code ‘tree’
2.2 Main theorems for encoding without noise. I.i.d. messages
41
In such codes any code word cannot be a forepart (‘prefix’) of another word. In code (2.1.7) this condition is violated for word (A) because it is a prefix of word (AB). Codes can be drawn in the form of a ‘tree’ (Figure 2.1) similarly to ‘trees’ represented on Figures 1.1 and 1.2. However (if considering a code separately from a message ξ ), those ‘trees’ probabilities are not assigned to ‘branches’ of a ‘code tree’. A branch choice is conducted in stages by recording a next letter η of the word. The end of the word corresponds to a special end node which we denote as a triangle on Figure 2.1. For each code word there is a code line coming from a start node to an end node. The ‘tree’ corresponding to code (2.1.5) is depicted on Figure 2.1. Any end node does not belong to a different code line for Kraft’s uniquely decodable codes. Theorem 2.1. Inequality (2.2.1) is a necessary and sufficient condition for existence of a Kraft uniquely decodable code. Proof. At first, we prove that inequality (2.2.1) is satisfied for every Kraft’s code. We emanate a maximum number (equal D) of auxiliary lines (which are represented by wavy lines on Figure 2.2) from each end node of the code tree. We suppose that they multiply in an optimal way (each of them reproduces D offspring lines) at all subsequent stages. Next, we calculate the number of auxiliary lines corresponding to index k at some high-order stage. We assume that k is greater than the length of the longest word. The end node of a word of length l(ξ ) will produce Dk−l(ξ ) auxiliary lines at stage k. The total number of auxiliary lines is equal to
∑ Dk−l(ξ ) .
(2.2.2)
ξ
Now we emanate auxiliary lines not only from end nodes but also from every interim node if the number of code lines coming out from it is less than D. In turn, let
Fig. 2.2 Constructing and counting auxiliary lines at stage k (for k = 3, D = 2)
those lines multiply in a maximum way at other stages. At k-th stage the number of auxiliary lines will increase compared to (2.2.2) and will become equal to
42
2 Encoding of discrete information in the absence of noise and penalties
Dk ∑ Dk−l(ξ ) . ξ
By dividing both parts of this inequality by Dk , we obtain (2.2.1). Apparently, an equality sign takes place in the case when all internal nodes of a code tree are used in a maximum way, i.e. D code lines come out of each node. Now we prove that condition (2.2.1) is sufficient for Kraft’s decipherability, i.e. that a code tree with isolated nodes can be always constructed if there is given a set of lengths l(ξ ) satisfying (2.2.1). Construction of a code line of a fixed length on a tree means completing a word of a fixed length with letters. What obstacles can be possibly encountered at that? Suppose there are m1 = m[l(ξ ) = 1] one-letter words. While completing those words with letters, we can encounter an obstacle when and only when there are not enough letters in alphabet, i.e. when m1 > D. But this inequality contradicts condition (2.2.1). Indeed, keeping only terms with l(ξ ) = 1 in the left-hand side of (2.2.1), we obtain just an enhancement of the inequality:
∑
l(ξ )=1
D−l(ξ ) =
m 1. D
Hence, m1 D and the alphabet contains enough letters to fill out all one-letter words. Further, we consider two-letter words. Besides already used letters located at the first position, there are also D − m1 letters, which we can place at the first position. We can place any of D letters after the first letter. Totally, there are (D − m1 )D possibilities. Every node out of D − m1 non-end nodes of the first stage produces D lines. The number of those two-letter combinations (lines) must be enough to fill out all two-letter words. We denote their number by m2 . Indeed, keeping only terms with l(ξ ) = 1 or l(ξ ) = 2 in the left-hand side of (2.2.1), we have D−1 m1 + D−2 m2 1 i.e. m2 D2 − Dm1 . This exactly means that the number of two-letter combinations is enough for completing two-letter words. After such a completion there are D2 − Dm1 − m2 two-letter combinations left available, which we can use for adding new letters. The number of three-letter combinations equal (D2 − Dm1 − m2 )D is enough for completing three-letter words and so forth. Every time a part of the terms of the sum ∑ξ D−l(ξ ) is used in the proof. This finished the proof. As is seen from the proof provided above, it is easy to actually construct a code (filling out a word with letters) if an appropriate set of lengths l(1), . . . , l(n) is given. Next we move to the main theorems. Theorem 2.2. The average word length lav cannot be less than Hξ / ln D for any encoding. Proof. We consider the difference
D−l(ξ ) lav ln D − Hξ = E [(l (ξ )) ln D + ln P (ξ )] = −E ln . P (ξ )
2.2 Main theorems for encoding without noise. I.i.d. messages
43
Using the evident inequality − ln x 1 − x we obtain
i.e.
since
(2.2.3)
D−l(ξ ) D−l(ξ ) 1−E , −E ln P (ξ ) P (ξ )
D−l(ξ ) −E ln 1 − ∑ D−l(ξ ) P (ξ ) ξ D−l(ξ ) D−l(ξ ) ∑ P (ξ ) E P (ξ ) P (ξ ) ξ
Taking into account (2.2.1) we obtain from here that lav ln D − Hξ 1 − ∑ D−l(ξ ) 0. ξ
The proof is complete. Instead of inequality (2.2.3) we can use inequality (1.2.4). The following theorems state that we can approach the value Hξ / ln D (which bounds the average length lav from below) rather close by selecting an appropriate encoding method. Theorem 2.3. For independent identically distributed messages, it is possible to specify an encoding procedure such that lav
1 and split the sequence ξ1 , ξ2 , . . . into groups consisting of exactly n random variables. Next, we treat each of such groups as a random variable ζ = (ξ1 , . . . , ξn ) and apply Theorem 2.3 to it. Inequality (2.2.4) applied to ζ will take the form lavζ
1. 2. Suppose that n = 2. Then we will have the following possibilities and probabilities:
ζ 1 2 3 4 ξ1 ξ2 11 10 01 00 P (ζ ) 0.015 0.110 0.110 0.765 Here ζ means an index of pair (ξ1 , ξ2 ). At the first ‘simplification’ we can merge realizations ζ = 1 and ζ = 2 into one having probability 0.015 + 0.110 = 0.125. Among realizations left after the ‘simplification’ and having probabilities 0.110, 0.125, 0.765 we merge two least
2.3 Optimal encoding by Huffman. Examples
47
likely ones again. The scheme of such ‘simplifications’ and the respective code ‘tree’ are represented on Figures 2.3 and 2.4. It just remains to position letters A, B along the branches in order to obtain the following optimal code:
ζ 1 2 3 4 ξ1 ξ2 11 10 01 00 V (ξ1 ξ2 ) AAA AAB AB B Its average word length is equal to lavζ = 0.0153 + 0.1103 + 001102 + 0.765 = 1.36, 1 lav = lavζ = 0.68. 2
Fig. 2.3 The ‘simplification’ scheme for n = 2
(2.3.2)
Fig. 2.4 The code ‘tree’ for the considered example with n = 2
3. Next we consider n = 3. Now we have the following possibilities
ζ 1 2 3 4 5 6 7 8 ξ1 ξ2 ξ3 111 110 101 011 100 010 001 000 P (ζ ) 0.002 0.014 0.014 0.014 0.096 0.096 0.096 0.668 The corresponding code ‘tree’ is shown on Figure 2.5. Positioning letters along the branches we find the optimal code:
ζ 1 2 3 4
V (ζ ) AAAAA AAAAB AAABA AAABB
ζ 5 6 7 8
V (ζ ) AAB ABA ABB B
We calculate the average length of code words for it: 1 lavζ = 2.308, lav = lavζ = 0.577 4
48
2 Encoding of discrete information in the absence of noise and penalties
By this time, the value is sufficiently closer to the limit value lav = 0.544 (2.3.1) than lav = 0.68 (2.3.2). Increasing n and constructing optimal codes by the specified method, we can approach value 0.544 as close as necessary. Of course, that is achieved by complication of the coding system.
Fig. 2.5 The code ‘tree’ for n = 3
2.4 Errors of encoding without noise in the case of a finite code sequence length The encoding methods described in Section 2.2 are such that the record length of a fixed number of messages is random. The provided theory gives estimators for the average length but tells nothing about its deviation. Meanwhile, in practice a record length or a message transmission time can be bounded by technical specifications. It may occur that a message record does not fit in permissible limits and, thus, the given realization of the message cannot be recorded (or transmitted). This results in certain losses of information and distortions (dependent on deviation of a record length). The investigation of those phenomena is a special problem. As it will be seen below, entropical stability of random variables
η n = (ξ1 , . . . ., ξn ) . is a critical factor in the indicated setting. Suppose that the record length l(η n ) of message η n cannot exceed a fixed value L. Next we choose a code defined by the inequalities −
ln P (η n ) ln P (η n ) l (η n ) < − +1 ln D ln D
2.4 Errors of encoding without noise in the case of a finite code sequence length
49
according to (2.2.5). Then the record of those messages, for which the constraint −
ln P (η n ) +1 L ln D
(2.4.1)
holds, will automatically be within fixed bounds. When decoding, those realizations will be recovered with no errors. Records of some realizations will not fit in. Thus, when decoding such a case, we can stick with any realization that will entail emergence of errors, as a rule. There arises a question how to estimate probabilities of a correct decoding and an erroneous decoding. Evidently, the probability of decoding error will be at most the probability of the inequality ln P (η n ) +1 > L (2.4.2) − ln D which is reverse to (2.4.1). Inequality (2.4.2) can be represented in the form H (η n ) L − 1 > ln D. Hη n Hη n But the ratio H(η n )/Hη n converges to one for entropically stable random variables. Taking into account the definition of entropically stable variables (Section 1.5), we obtain the following result. Theorem 2.5. If random variables η n are entropically stable and the record length L = Ln increases with a growth of n in such a way that the expression Ln − 1 ln D − 1 Hη n is kept larger than some positive constant ε , then probability Per of decoding error converges to zero as n → ∞. It is apparent that the constraint Ln − 1 ln D − 1 > ε Hη n in the indicated theorem can be replaced with the simpler constraint Ln ln D > (1 + ε ) Hη n if Hη n → ∞ as n → ∞. The realizations, which have not fitted in and thereby have been decoded erroneously, pertain to set An of non-essential realizations. We remind that set An is treated in Theorem 1.9. Employing the results of Section 1.5 we can derive more detailed estimators for the error probability Per and estimate the rate of decrease.
50
2 Encoding of discrete information in the absence of noise and penalties
Theorem 2.6. For the fixed record length L > (Hη n / ln D) + 1 the probability of decoding error satisfies the inequality Per
Var (H (η n )) [(L − 1) ln D − Hη n ]2
.
(2.4.3)
Proof. We apply Chebyshev’s inequality P [|ξ − E [ξ ]| ε ] Var (ξ ) /ε 2 (ε > 0 is arbitrary) to the random variable ξ = H(η n )/Hη n . We suppose
ε=
L−1 ln D − 1 Hη n
and obtain from here that −2 H (η n ) L − 1 Var (H (η n )) L − 1 > ln D ln D − 1 P Hη n Hη n Hη n Hη2 n that proves (2.4.3). It follows from (2.4.3) that probability Per decreases with a growth of n according to the law 1/n for a sequence of identically distributed independent messages, if L − 1 increases linearly with a growth of n and variance Var[H(ξ )] is finite. Indeed, substituting Hη n = nHξ ; Var (H (η n )) = nVar (H (ξ )) ; L − 1 = L1 n to (2.4.3), we find that Per
1 Var (H (ξ )) n L1 ln D − Hξ 2
(2.4.4)
(2.4.5)
Using Theorem 1.13, it is easy to show a faster exponential law of decay for the probability Per . Theorem 2.7. For the fixed length L > Hη / ln D + 1 the probability of decoding error satisfies the inequality Per eμ (s)−sμ (s) , (2.4.6) where μ (α ) is the potential defined by formula (1.5.15), and s is a positive root of the equation μ (s) = (L − 1) ln (D) if such a root lies both in the domain and in the differentiability region of the potential. In order to prove the theorem we just need to use formula (1.5.18) from Theorem 1.13 by writing
2.4 Errors of encoding without noise in the case of a finite code sequence length
ε=
51
L−1 ln D − 1. Hη
For small ε we can replace inequality (2.4.6) with the following inequality: 2 [(L − 1) ln D − Hη ] Per exp − (2.4.7) 2Var (H (η )) within the applicability of formula (1.5.22). In the case of identically distributed independent messages, when relations (2.4.4) and formula (2.4.5) are valid, due to (2.4.7) we obtain the following inequality: ⎧ ⎫ ⎨ n L ln D − H 2 ⎬ 1 ξ Per exp − . (2.4.8) ⎩ 2 Var (H (ξ )) ⎭ It tells us about an exponential law of decay for the probability Per with a growth of n. Formulae (2.4.7) and (2.4.8) correspond to the case, in which the probability distribution of entropy can be regarded as approximately Gaussian due to the central limit theorem.
Chapter 3
Encoding in the presence of penalties. First variational problem
The amount of information that can be recorded or transmitted is defined by a logarithm of the number of various recording realizations or transmission realizations, respectively. However, calculation of this number is not always a simple task. It can be complicated because of the presence of some constraints imposed on feasible realizations. In many cases, instead of direct calculation of the number of realizations, it is reasonable to compute the maximum value of recording entropy via maximization over distributions compatible with conditions imposed on the expected value of some random cost. This maximum value of entropy is called the capacity of a channel without noise. This variational problem is the first from the set of variational problems playing an important role in information theory. Solving the specified variational problem results in relationships analogous to statistical thermodynamics (see, for instance, the following textbooks by Leontovich [31, 32]). It is convenient to employ methods developed by Leontovich in those textbooks in order to find a channel capacity and optimal distributions. These methods are based on systematic usage of thermodynamic potentials and formulae where a cost function comes as an analog of energy. Temperature, free energy and other thermodynamic concepts find their appropriate place in theory. Thus the mathematical apparatus of this branch of information theory very closely resembles that of statistical thermodynamics. Such analogy will be subsequently observed in the second and the third variational problems (Sections 8.2 and 9.4, respectively). It should be emphasized that the indicated analogy is not just fundamental, but is also important for applications. It allows us to use methods and results of statistical thermodynamics. The solution of a number of particular problems (Example 3.3 from Section 3.4) of optimal encoding with penalties is mathematically equivalent to the one-dimen sional Ising model well known in statistical thermodynamics. At the end of the chapter, the results obtained are extended to the case of a more general entropy definition, which is valid for continuous and arbitrary random variables.
© Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0 3
53
54
3 Encoding in the presence of penalties. First variational problem
3.1 Direct method of computing information capacity of a message for one example Consider a particular example of information encoding, for which the calculation of informational or channel capacity ln M, where M is the number of realizations of the code sequence, is nontrivial. This is the case in which the symbols have different duration. Let there be m symbols V1 , . . . , Vm with lengths l(1), . . . , l(m), respectively. Possibly (but not necessarily) they are combinations of more elementary letters A1 , . . . , AD of the same length, where l(i) is a number of such letters. Then, generally speaking, symbols Vi will be code words like those considered before. Now, however, filling with letters is fixed and does not change during our consideration process. We take four telegraph symbols as an example: Symbols Dot Dash Spacing between letters Spacing between words
Vi l (i) +− 2 + + +− 4 −−− 3 − − − − −− 6
(3.1.1)
The second column gives these symbols in a binary (D = 2) alphabet: A = + and B = −. The third column gives their length in this alphabet. To be sure, the symbols Vi can also be called not ‘words’, but ‘letters’ of a new, more complex alphabet. Let there be given a recording (or a transmission) Vi1 , Vi2 , . . . ,Vik
(3.1.2)
that consists of k symbols. Its length is apparently equal to L = l(i1 ) + l(i2 ) + · · · + l(ik ). We fix this total recording length and compute the number M(L) of distinct realizations of a recording with the specified length. We neglect the last symbol Vik and obtain a sequence of a smaller length L − l(ik ). Each such truncated sequence can be realized in M(L − l(ik )) ways. By summing different scenarios we obtain the following equation n
M (L) = ∑ M (L − l (i)) ,
(3.1.3)
i=1
that allows us to find M(L) as a function of L. After M(L) has been found, it becomes easy to determine information that can be transmitted by a recording of length L. As in the previous cases, maximum information is attained when all of M(L) scenarios are equiprobable. At the same time H L /L = ln(M(L))/L
3.1 Direct method of computing information capacity of a message for one example
55
where H L = Hvi1 ...vik is entropy of the recording. We take the limit of this relationship for L → ∞ and, thereby, obtain entropy of a recording meant for a unit of length: H1 = lim ln(M(L))/L. L→∞
(3.1.4)
As is seen from this formula, there is no need to find an exact solution of equation (3.1.3) but it is sufficient to consider only asymptotic behaviour of ln M(L) for large L. Equation (3.1.3) is linear and homogeneous. As for any such equation, we can look for a solution in the form M (L) = Ceλ L .
(3.1.5)
Such a solution usually turns out to be possible only for certain (‘eigen’) values λ = λ1 , λ2 , . . .. With the help of the spectrum of these ‘eigenvalues’ a general solution can be represented as follows: M (L) = ∑ Cα eλα L ,
(3.1.6)
α
where constants C1 , C2 , . . . are determined from initial conditions. Substituting (3.1.5) to (3.1.3) we obtain the ‘characteristic’ equation m
1 = ∑ e−λ l(i) ,
(3.1.7)
i=1
that allows us to find eigenvalues λ1 , λ2 , . . .. As is easily seen, the function
Φ (X) ≡ ∑ X −l(i)
(3.1.8)
i
turns out to be a monotonic decreasing function of X on the entire half-line X > 0, i.e., this function decreases from ∞ to 0 since all l(i) > 0. Therefore, there exists only one single positive root W of the equation
∑ W −l(i) = 1
(3.1.9)
i
equations (3.1.7), (3.1.9) merge if we set
λ = lnW.
(3.1.10)
Thus, equation (3.1.7) also has only one real root. For sufficiently large L in (3.1.6) a term with the largest real part of eigenvalue λα will be dominant. We denote this eigenvalue as λm with index m: Reλm = max Reλα . α
Furthermore, it follows from (3.1.6) that
56
3 Encoding in the presence of penalties. First variational problem
M (L) ≈ Cm eλm L .
(3.1.11)
Since function M(L) cannot be complex and alternating-sign, the eigenvalue must be real: Im λm = 0. But (3.1.10) is the only real eigenvalue. Consequently, the value lnW is the desired eigenvalue λm with the maximum real part. Formula (3.1.11) then takes the form M (L) ≈ Cm W L and limit (3.1.4) appears to be equal to H = λm = lnW
(3.1.12)
that yields the solution of the problem in question. This solution has been first found by Shannon [45] (the English original is [38]).
3.2 Discrete channel without noise and its capacity The recording considered above or a transmission having a fixed length in the specified alphabet is an example of a noiseless discrete channel. Here we provide a more general definition of this notion. Given variable y taking a discrete set Y (not necessarily with a finite cardinality) of values. Further, some numerical function c(y) is given on Y . By reasons that will get clear later we call it a cost function. Let there be given some number a and a fixed condition c (y) a.
(3.2.1)
System [Y, c(y), a] fully characterizes a noiseless discrete channel. Constraint (3.2.1) picks out some subset Ya from set Y . This subset Ya is assumed to consist of a finite number Ma of elements. These elements can be used for information recording and transmission. Number Ma is interpreted as a number of distinct realizations of a recording similarly to number M(L) in Section 3.1. For the particular case considered in Section 3.1 set Y is a set of different recordings (3.2.2) y = Vi1 , . . . ,Vik . In this case it is reasonable to choose the function c (y) = l (i1 ) + · · · + l (ik ) .
(3.2.3)
Thereby, long recordings are penalized more, which is indirectly reflected in a desire to reduce a recording length. With this choice, condition (3.2.1) coincides with the condition l (i1 ) + · · · + l (ik ) L
(L plays the role of a)
(3.2.4)
3.2 Discrete channel without noise and its capacity
57
that is, with the requirement that a recording length does not exceed the specified value of L. The consideration of this example will be continued in Section 3.4 (Example 3.4). Variable Ma (the number of realizations) tells us about capabilities of the given system (channel) to record or transmit information. The maximum amount of information that can be transmitted is evidently equal to ln Ma . This amount can be called information capacity or channel capacity. However, a direct calculation of Ma is coupled sometimes with some difficulties, as it is seen from the example covered in Section 3.1. Thus, it is more convenient to define the notion of channel capacity not as ln Ma but a bit differently. We introduce a probability distribution P(y) on Y and replace condition (3.2.1) by an analogous condition for the mean
∑ c (y) P (y) a
(3.2.5)
and it should be understood as a constraint imposed on distribution P(y). We define channel capacity C or information capacity of channel [Y, c(y), a] as the maximum value of the entropy (3.2.6) C = sup Hy . P(y)
Here maximization is conducted over different distributions P(y), which are compatible with constraint (3.2.5). Hence, channel capacity is defined as a solution of one variational problem. As it will be seen from Section 4.3, there exists a direct asymptotic relationship between value (3.2.6) and Ma . Speaking of applications of these ideas to statistical physics, relationships (3.2.1) (taken with an equality sign) and (3.2.5) correspond to microcanonical and canonical distributions, respectively. Asymptotic equivalence of these distributions is well known in statistical physics. The above definitions of a channel and its capacity can be modified a little, for instance, by substituting inequalities (3.2.1), (3.2.6) with the following two-sided inequalities a1 ∑ c (y) P (y) a2 (3.2.7) a1 c (y) a2 ; y
or with inequalities in the opposite direction (if the number of realizations remains finite). Besides, in more complicated cases there can be given several numerical functions or a function taking values from some other space. All these modifications do not usually relate to fundamental changes, so we do not especially address them here. The cost function c(y) may have different physical or technical meaning in different problems. It can characterize ‘costs’ of certain symbols, point to unequal costs incurred in recording or transmission of some symbol, for instance, a different amount of paint or electric power. It can also correspond to penalties placed on different adverse factors, for instance, it can penalize excessive height of letters and other. In particular, if length of symbols is exposed to penalties, then (3.2.3) or c(y) = l(y) holds true for y = i.
58
3 Encoding in the presence of penalties. First variational problem
Introduction of costs, payments and so on is typical for mathematical statistics, optimal control theory and game theory. We will repeatedly discuss them later on. When function c(y) does actually have a physical meaning of penalties and costs incurred in information transmission and those costs are of benefit, as a rule the maximum value of entropy in (3.2.6) takes place for the largest possible costs, so that we can keep only an equality sign in (3.2.5) or (3.2.7) supposing that
∑ c (y) P (y) = a.
(3.2.8)
y
The variational problem (3.2.6), (3.2.8) is close to another inverse variational problem: the problem of finding distribution P(y) that minimizes average losses (risk) R ≡ ∑ c (y) P (y) = min (3.2.9) y
for the fixed value of entropy − ∑ P (y) ln P (y) = I
(3.2.10)
y
where I is a given number. Finally, the third scenario for the problem statement is possible: it is required to maximize some linear combination of indicated expressions Hy − β R ≡ − ∑ P (y) ln P (y) − β ∑ c (y) P (y) = min . y
(3.2.11)
y
It is essential that all three specified problem statements lead to the same solution if parameters a, I, β are coordinated properly. We call any of these mentioned statements the first variational problem. It is convenient to study the addressed questions with the help of thermodynamic potentials introduced below.
3.3 Solution of the first variational problem. Thermodynamic parameters and potentials 1. Variational problems (3.2.6), (3.2.8) formulated in the previous paragraph and problems (3.2.9), (3.2.10) are typical problems on a conditional extremum. Thus, they can be solved via the method of Lagrange multipliers. Also, the normalization condition (3.3.1) ∑ P (y) = 1 y
must be added to the indicated conditions. Further, the requirement of non-negative probabilities
3.3 Solution of the first variational problem. Thermodynamic parameters and potentials
P (y) 0,
y ∈ Y.
59
(3.3.2)
must be satisfied. However, this requirement should not be necessarily included into the set of constraints, since the solution of the problem with all other constraints retained but without the given requirement turns out (as the successive inspection will show) to satisfy it automatically. We introduce Lagrangian multipliers β , γ and try to find an extremum of the expression K = − ∑ P (y) ln P (y) − β ∑ c (y) P (y) − γ ∑ P (y) . y
y
(3.3.3)
y
In consequence of uncertainty of multipliers β , γ finding of this extremum is equivalent to finding of an extremum of an analogous expression for problem (3.2.9), (3.2.10) or (3.2.11) with the help of other indefinite multipliers. The constraints of the extremum has the form −
∂K = ln P (y) + 1 + β c (y) + γ = 0. ∂ P (y)
(3.3.4)
Here differentiation is carried out with respect to those and only those P(y), which are different from zero in the extremum distribution P(·). We assume that this distribution exists and is unique. A subset of Y , the elements of which have non-zero probabilities in the extremum distribution, is denoted by Y . We call this Y an ‘active’ domain. Hence, equation (3.3.4) is valid only for elements y belonging to Y . From (3.3.4) we get P (y) = e−1−β c(y)−γ ,
y ∈ Y .
(3.3.5)
Probabilities are equal to zero for other y’s beyond Y . We see that extremum probabilities (3.3.5) turn out to be non-negative, thereby constraint (3.3.2) can be disregarded as a side condition. Multiplier e−1−γ can be immediately determined from the normalization condition (3.3.1) that yields P (y) = e−β c(y) ∑ e−β c(y) , (3.3.6) y
if the sum in the denominator converges. Parameter β is given in problem (3.2.11) and there is no need to expressly define it. In problems (3.2.6), (3.2.8), (3.2.9), (3.2.10) it should be defined from the side conditions of (3.2.8) or (3.2.10), correspondingly. 2. The derived solution of the extremum problem can be simplified due to validity of the following theorem. Theorem 3.1. When solving the problem of maximum entropy (3.2.6) under the constraint (3.2.8), probabilities of all elements of Y , at which the cost function c(y) takes finite values, are different from zero in the extremum distribution. Therefore, if function c(y) is finite for all elements, then set Y coincides with the entire set Y .
60
3 Encoding in the presence of penalties. First variational problem
Proof. Assume the contrary, i.e. assume that some elements y ∈ Y have null probability P0 (y) = 0 in the extremum distribution P0 (y). Since distribution P0 is extremum, for another measure P1 (even non-normalized) with the same nullity set Y − Y the following relationship − ∑ P1 (y) ln P1 (y) − β ∑ c (y) P1 (y) − γ ∑ P1 (y) + y
y
y
+ ∑ P0 (y) ln P0 (y) − β ∑ c (y) P0 (y) − γ ∑ P0 (y) = y
y
=−∑
y∈Y
y
1 (3.3.7) [P1 (y) − P0 (y)]2 + · · · = O (P1 − P0 )2 P0 (y)
is valid. Now we consider P(y1 ) = 0, y1 ∈ Y − Y and suppose that P(y) = 0 at other points of subset Y − Y . We choose other probabilities P(y), y ∈ Y in such a way that the constraints (3.3.8) ∑ P (y) = 1; ∑ c (y) P (y) = ∑ c (y) P0 (y) y
y
y
are satisfied and differences P(y) − P0 (y) (y ∈ Y ) are linearly dependent on P(y1 ). Non-zero probability P(y1 ) = 0 will lead to the emergence of the extra term −P(y1 ) ln P(y1 ) in the expression for entropy Hy . Supposing P1 (y) = P(y) for y ∈ Y and taking into account (3.3.7), we obtain that − ∑ P (y) ln P (y) − β ∑ c (y) P (y) − γ ∑ P (y) + y
y
y
+ ∑ P0 (y) ln P0 (y) + β ∑ c (y) P0 (y) + γ ∑ P0 (y) = y
y
y
= −P (y1 ) ln P (y1 ) − [β c (y1 ) + γ ] P (y1 ) + O (P − P1 )2 .
Further, we use (3.3.8) to get − ∑ P (y) ln P (y) + ∑ P0 (y) ln P0 (y) =
1 − [β c (y1 ) + γ ] P (y1 ) + O (P (y1 ))2 . (3.3.9) = P (y1 ) ln P (y1 ) For sufficiently small P(y1 ) (when − ln P(y1 ) − β c(y1 ) − γ + O(P(y1 )) > 0) the expression in the right-hand side of (3.3.9) is certainly positive. Thus, the entropy of distribution P satisfying the same constraints (3.3.8) will exceed the entropy of the extremum distribution P0 , which is impossible. So, element y1 having a zero probability in the extremum distribution does not exist. The proof is complete. So, as a result of the fact that Y and Y coincide for finite values of c(y) derived in Theorem 3.1, henceforth we use formula (3.3.5) or (3.3.6) for the entire set Y .
3.3 Solution of the first variational problem. Thermodynamic parameters and potentials
61
Expression (3.3.6) is analogous to the corresponding expression for the Boltzmann distribution that is well known in statistical physics, i.e., c(y) is an analog of energy and also β = 1/T is a parameter reciprocal to absolute temperature (if we use the scale, in which the Boltzmann constant equals 1). With this analogy in mind we call T temperature and c(y) energy. The fact that the thermodynamic equilibrium (balanced) distribution (3.3.6) serves as a solution to the problem of an extremum of entropy with a given fixed energy (i.e. to the extremum problem (3.2.6), (3.2.8)) is also a well-known fact in statistical physics. 3. Now we want to verify that the found distribution (3.3.5), (3.3.6) actually corresponds to the maximum (and not to, say, the minimum) of entropy Hy . Calculating the second-order derivatives of entropy, we have ! ∂ 2 Hy 1 , y, y ∈ Y = − δ yy ∂ P (y) ∂ P (y ) P (y) where δyy is a Kronecker symbol. Thus, taking into account disappearance of the first-order derivatives and neglecting the third-order terms with respect to deviation Δ P(y), y ∈ Y from the extremum distribution P we obtain 1 [ P (y)]2 . P (y)
Hy = Hy −C = − ∑
(3.3.10)
y∈Y
Since all P(y) included here are positive, the difference Hy − C is negative that proves maximality of C. It can also be proven that if β > 0, then the extremum distribution found above corresponds to the minimum of average cost (3.2.9) with the fixed entropy (3.2.10). In order to prove it, we need to take into account that analogously to (3.3.10) the following relationship 1 [ P (y)]2 < 0. P (y)
K = K − Kextr = − ∑
y∈Y
is valid for function (3.3.3) considered as a function of independent variables P(y), y ∈ Y (suppose there are m of them). Hence, the maximum of the expression K takes place at the extremum point. Next we need to take into account that we are interested in a conditional extremum corresponding to conditions (3.2.10) and (3.3.1). The latter constitute a hyperplane of dimension m − 2 containing the extremum point in the m-dimensional space Rm . We should restrain ourselves only with shifting from the extremum point within the hyperplane. But if there exists a maximum for every shift along the hyperplane, then we can also find a maximum for any shifts of a particular type. Since the first and the third terms in (3.3.3) remain the same when shifting within the hyperplane, the extremum point turns out to be a point of conditional maximum of expression −β ∑ c(y)P(y), i.e. a point of the minimum of average cost (if β > 0). This ends the proof.
62
3 Encoding in the presence of penalties. First variational problem
4. The found distribution (3.3.6) allows us to determine channel capacity C and average cost R as a function of β or ‘temperature’ T = 1/β . Excluding parameter β or T from these dependencies it is simple to find a dependence of C from R = a for problem (3.2.6),(3.2.8) or a dependence of R from C = I for problem (3.2.11),(3.2.10) . When solving these problems in practice it is convenient to use a technique that is employed in statistical physics. At first we calculate the sum (assumed to be finite) (3.3.11) Z = ∑ e−c(y)/T , y
called a partition function. This function is essentially a normalization factor in the distribution (3.3.6). The free energy is expressed through it: F (T ) = −T ln Z.
(3.3.12)
With the help of the mentioned free energy we can compute entropy C and energy R. We now prove a number of relations common in thermodynamics as separate theorems. Theorem 3.2. Free energy is connected with entropy and average energy by the simple relationship F = R − TC. (3.3.13)
This is well known in thermodynamics. Proof. Using formulae (3.3.12), (3.3.11) we rewrite the probability distribution (3.3.6) in the form P (y) = e(F−c(y))/T (3.3.14) (Gibbs distribution). According to (1.2.1) we find the random entropy H (y) =
c (y) − F . T
Averaging of this equality with respect to y leads to relationship (3.3.13). The proof is complete. Theorem 3.3. Entropy can be computed via differentiating free energy with respect to temperature C = −dF/dT. (3.3.15)
Proof. Differentiating expression (3.3.12) with respect to temperature and taking into account (3.3.11) we obtain that −
c(y) c (y) dF dZ = ln Z + T Z −1 = ln Z + T Z −1 ∑ e− T , dT dT T2 y
3.3 Solution of the first variational problem. Thermodynamic parameters and potentials
63
i.e. due to (3.3.6), (3.3.11) we have −
dF = ln Z + T −1 E[c(y)]. dT
Taking account of equalities (3.3.12), (3.3.13) we find that −
dF = T −1 (−F + R) = C. dT
This end the proof. As is seen from formulae (3.3.13), (3.3.15), energy (average cost) can be expressed using free energy as follows: R = F −T
dF . dT
(3.3.16)
It is simple to verify that this formula can be rewritten in the following more compact form: d d ln Z 1 R= (β F) = − , β= . (3.3.17) dβ dβ T After calculation of functions C(T ), R(T ) it is straightforward to find the channel capacity C = C(a) (3.2.6). Equation (3.2.8), i.e. the equation R (T ) = a
(3.3.18)
will determine T (a). So, the channel capacity will be equal to C(a) = C(T (a)). Similarly, we can find the minimum average cost (3.2.9) with the given amount of information I for problem (3.2.9), (3.2.10). Averaging (3.2.10) taking the form C (T ) = I, determines the temperature T (I) corresponding to the given amount of information I. In this case, the minimum average cost (3.2.9) is equal to R(T (I)). In conclusion of this paragraph we formulate two theorems related to facts well known in thermodynamics. Theorem 3.4. If distribution (3.3.6) exists, i.e. the sum (3.3.11) converges, then the following formula dR = T, (3.3.19) dC is valid, so that for T > 0 the average cost is an increasing function of entropy, and for T < 0 it is decreasing. Proof. We consider a differential from expression (3.3.13) and obtain that dR = dF +CdT + T dC.
64
3 Encoding in the presence of penalties. First variational problem
But on account of (3.3.15) dF +CdT = 0, so that dR = T dC.
(3.3.20)
The desired formula (3.3.19) follows from here. Equation (3.3.20) is nothing but the well-known thermodynamic equation dH = dQ/T where dQ is the amount of heat entering the system and augmenting its internal energy by dR. H = C is the entropy. Theorem 3.5. If distribution (3.3.6) exists and c(y) is not constant within Y , then R turns out to be an increasing function of temperature. Also, channel capacity (entropy) C is an increasing function of T for T > 0. Proof. Differentiating (3.3.17) we obtain dR 1 d 2 (β F) =− 2 . dT T dβ 2
(3.3.21)
If we substitute (3.3.20) to here, then we will have 1 d 2 (β F) dC =− 3 . dT T dβ 2
(3.3.22)
In order to prove the theorem it is left to show that −
d 2 (β F) >0 dβ 2
(3.3.23)
i.e. β F is a concave function of β . According to (3.3.12), β F is nothing but − ln Z. Therefore, 1 dZ 2 1 d2Z d 2 (β F) = − . (3.3.24) − dβ 2 Z dβ 2 Z dβ Differentiating (3.3.11) we obtain that 1 dZ = − ∑ c (y) e−β c(y) / ∑ e−β c(y) = −E [c (y)] Z dβ 1 d2Z = ∑ c2 (y) e−β c(y) / ∑ e−β c(y) = E c2 (y) 2 Z dβ
(3.3.25)
by virtue of (3.3.6). Hence, expression (3.3.24) is exactly the variance of costs: E[c(y) − E[c(y)]]2 that is a non-negative variable (it is a positive variable if c(y) is not constant within Y ). That proves inequality (3.3.23) and also the entire theorem due to (3.3.21), (3.3.22). As a result of (3.3.15) formula (3.3.22) yields
3.4 Examples of application of general methods for computation of channel capacity
−
65
2 d F 1 d 2 (β F) dC = = 3 . 2 dT dT T dβ 2
From the latter formula, due to (3.3.23), we can make a conclusion about concavity of function F(T ) for T > 0 and convexity of F(T ) for T < 0. The above-mentioned facts are a particular demonstration of properties of thermodynamic potentials. The relative examples of such phenomena are F(T ), ln Z(β ). Hence, these potentials play a significant role not only in thermodynamics but also in information theory. In the next chapter, we will touch upon subjects related to asymptotic equivalence between constraints of types (3.2.1) and (3.2.5).
3.4 Examples of application of general methods for computation of channel capacity Example 3.1. For simplicity let there be only two symbols initially: m = 2; y = 1, 2, which correspond to different costs c (1) = b − a,
c (2) = b + a.
(3.4.1)
In this case, the partition function (3.3.11) is equal to Z = e−β b+β a + e−β b−β a = 2e−β b cosh β a. By formula (3.3.12) the free energy F = b − T ln 2 − T ln cosh
a T
corresponds to it. Applying formulae (3.3.12), (3.3.13) we find entropy and average energy C = Hy (T ) = ln 2 + ln cosh R (T ) = b − a tanh
a . T
a a a − tanh , T T T (3.4.2)
The graphs of these functions are presented on Figure 3.1. It is also shown there how to determine channel capacity with the given level of costs R = R0 in a simple graphical way. When temperature changes from 0 to ∞, entropy changes from 0 to ln 2 and energy goes from b − a to b (where a > 0). Also in a more general case as T → ∞ entropy Hy goes to limit ln m. This value appears to be the channel capacity corresponding to the absence of constraints. Also, this value cannot be exceeded for any average cost. The additive parameter b implicated in the cost function (3.4.1) is not essential. Neither entropy nor a probability distribution depends on it. This corresponds to the fact that the additive constant from the expression for energy (recall that R is an
66
3 Encoding in the presence of penalties. First variational problem
Fig. 3.1 Thermodynamic functions for Example 3.1
analog of energy) can be chosen arbitrarily. In the example in question the optimal probability distribution has the form (3.4.2a) P (1, 2) = e±a/T 2 cosh (a/T ) according to (3.3.14). The determined functions (3.4.2) can be used in cases, for which there exists sequence yL = (y1 , . . . , yL ) of length L that consists of symbols described above. If the number of distinct elementary symbols equals 2, then the number of distinct sequences is evidently equal to m = 2L . Next we assume that the costs imposed on an entire sequence are the sum of the costs imposed on symbols, which constitute this sequence. Hence, L c yL = ∑ c (yi ) .
(3.4.3)
i=1
Application of the above-derived formula (3.3.6) to the sequence shows that in this case the optimal probability distribution for the sequence is decomposed into a product of probabilities of different symbols. That is, thermodynamic functions F, R, H related to the given sequence are equal to a sum of the corresponding thermodynamic functions of constituent symbols. In the stationary case (when an identical distribution of the costs corresponds to symbols situated at different places) we have F = LF1 ; R = LR1 ; H = LH1 , where F1 , R1 , H1 are functions for a single symbol, which have been found earlier [for instance, see (3.4.2)]. Example 3.2. Now we consider a more difficult example, for which the principle of the cost additivity (3.4.3) does not hold. Suppose the choice of symbol y = 1 or y = 2 is cost-free but a cost is incurred when switching symbols. If symbol 1 follows 1 or symbols 2 follows 2 in a sequence, then there is no cost. But if 1 follows 2 or 2 follows 1, then the cost 2d is observed. It is easy to see that in this case the total cost of the entire sequence yL can be written as follows:
3.4 Examples of application of general methods for computation of channel capacity L−1 c yL = 2d ∑ 1 − δ y j y j+1 .
67
(3.4.4)
j=1
Further, it is required to find the conditional capacity of such a channel and its optimal probability distribution. We begin with a calculation of the partition function (3.3.11): L−1
Z = e−2β d(L−1)
2β d δy j ,y j+1
∑ ∏e
.
(3.4.5)
y1 ,...yL j=1
It is convenient to express it by means of the matrix " " V11 V12 1 e−2β d " " V = Vyy = = −2β d V21 V22 e 1 in the form Z=
∑
y1 ,yL
L−1 V y
1 yL
.
(3.4.6)
If we consider the orthogonal transformation √ √ 1/√2 −1/√ 2 U= 1/ 2 1/ 2 bringing the mentioned matrix to a diagonal form, then we can find ! 1 (1 + k)L−1 + (1 − k)L−1 (1 + k)L−1 − (1 − k)L−1 V L−1 = k = e−2β d , L−1 L−1 L−1 L−1 , 2 (1 + k) − (1 − k) (1 + k) + (1 − k) and
Z = 2 (1 + k)L−1 = 2L e−β d(L−1) coshL−1 β d.
Consequently, due to formula (3.3.12) we have d F = −LT ln 2 + (L − 1) d − (L − 1) T ln cosh . T According to (3.3.15), (3.3.16) it follows from here that d d d C = L ln 2 + (L − 1) ln cosh − (L − 1) tanh T T T d R = (L − 1) d − (L − 1) d tanh . T With the help of these functions it is easy to find channel capacity and average cost meant for one symbol in the asymptotic limit L → ∞: d d d C = ln 2 + ln cosh − tanh L→∞ L T T T
C1 = lim
R1 = d − d tanh (d/T ) .
68
3 Encoding in the presence of penalties. First variational problem
Now an optimal probability distribution for symbols of the recording cannot be decomposed into a product of (marginal) probability distributions for separate symbols. In other words, the sequence y1 , y2 , . . ., yL is not a process with independent constituents. However, this sequence turns out to be a simple Markov chain. Example 3.3. Let us now consider the combined case when there are costs of both types: additive costs (3.4.3) of the same type as in Example 3.1 and paired costs (3.4.4). The total cost function has the form L L−1 c yL = bL + a ∑ (−1)y j + 2d ∑ 1 − δ y j y j−1 . j=1
j=1
As in Example 3.2, it is convenient to compute the partition function corresponding to this function via a matrix method by using the formula Z = ∑ exp {−β α (−1)y1 } V L−1 y y 1 L
y1 ,yL
which is a bit more complicated than (3.4.6). However, now the matrix has a more difficult representation β a+β d −β a−β d e e V = e−β b−β d β a−β d −β a+β d . (3.4.7) e e Only an asymptotic case of large L is of practical interest. If we restrict ourselves to this case in advance, then calculations will be greatly simplified. Then the most essential role in sum Z will be played by terms corresponding to the largest eigenvalue λm of matrix (3.4.7). As in other similar cases (for instance, see Section 3.1) when taking logarithm ln Z and dividing it by L only those terms make an impact on the limit. Thus, we will have lim
L→∞
ln Z = ln λm . L
(3.4.8)
The characteristic equation corresponding to matrix (3.4.7) is β a+β d ! e − λ e−β a−β d −β b−β d = 0 λ = e λ eβ a−β d e−β a+β d − λ that is the equation
λ 2 − 2λ eβ d cosh β a + e2β d − e−2β d = 0. Having taken the largest root of this equation and taking into account (3.4.8), we find the limit free energy computed for one symbol # a 2 a −4d/T +e F1 = −T ln λm = b − T ln cosh + sinh . T T
3.4 Examples of application of general methods for computation of channel capacity
69
From here we can derive free energy for one symbol and respective channel capacity in the usual way. As in Example 3.2, an optimal probability distribution corresponds to a Markov process. A respective transition probability P(y j+1 | y j ) is connected with matrix (3.4.7) and differs from this matrix by normalization factors. Next we present the resulting expressions P (1 | 1) = A1 eβ a+β d ;
P (2 | 1) = A1 e−β a−β d ;
P (1 | 2) = A2 eβ a−β d ;
P (2 | 2) = A2 e−β a+β d ;
β a+β d (A−1 + e−β a−β d ; 1 =e
β a−β d A−1 + e−β a+β d ). 2 =e
The statistical systems considered in the last two examples have been studied in statistical physics under the name of ‘Ising model’ (for instance, see Hill [21] (originally published in English [20]) and Stratonovich [46]). Example 3.4. Now we apply methods of the mentioned general theory to that case of different symbol lengths, which was investigated in Section 3.1. We suppose that y is an ensemble of variables k, Vi1 , . . ., Vik , where k is a number of consecutive symbols in a recording and Vi j is a length of symbol situated on the j-th place. If l(i) is a length of symbol i, then we should consider function (3.2.4) as a cost function. According to the general method we calculate the partition function for the case in question Z=∑
∑
∞
1
∑ Z1k (β ) = 1 − Z1 (β )
exp [−β l (i1 ) − · · · − β l (ik )] =
k i1 ,...,ik
k=0
(if the convergence condition Z1 (β ) < 1 is satisfied). Here we denote Z1 (β ) = ∑ e−β l(i) .
(3.4.9)
i
Applying formulae (3.3.17), (3.3.13) we obtain that dZ1 d ln (1 − Z1 ) = − (β ) (1 − Z1 )−1 dβ dβ dZ1 C = β ln (1 − Z1 ) − β (1 − Z1 )−1 dβ R=
(F = T ln (1 − Z1 )) .
(3.4.10)
Let L be a fixed recording length. Then condition (3.2.8), (3.3.18) will take the form of the equation dZ1 (β ) =L (3.4.11) − (1 − Z1 (β ))−1 dβ from which β can be determined. Formulae (3.4.10), (3.3.11), (3.4.9) provide the solution to the problem of channel capacity C(L) computation. It is also of our interest to find the channel capacity rate
70
3 Encoding in the presence of penalties. First variational problem
C1 = C/L = C/R = β − β (F/R)
(3.4.12)
[we have applied (3.3.13)] and, particularly, its limit value as L = R → ∞. Differentiating (3.4.9) it is easy to find out that −Z1−1 dZ1 /d β has a meaning of average symbol length: −
1 dZ1 = ∑ l (i) P (i) = lav Z1 d β i
analogously to (3.3.25). That is why equation (3.4.11) can be reduced to the form Z1 (β ) / (1 − Z1 (β )) = L/lav . It is easily seen from the latter equation that Z1 (β ) converges to one: Z1 (β ) → 1 as
L/lav → ∞.
(3.4.13)
But it is evident now that (1 − Z1 ) ln (1 − Z1 ) → 0
as
Z1 → 1.
Therefore, due to (3.4.10) we have −
βF 1 = (1 − Z1 ) ln (1 − Z1 ) → 0 R dZ1 /d β
as
L→∞
(3.4.14)
provided that dZ1 /d β (i.e. lav ) tends to the finite value dZ1 (β0 )/d β . According to (3.4.12) and (3.4.14) we have C1 = lim β = β0
(3.4.15)
in this case. Due to (3.4.13) the limit value β0 is determined from the equation Z1 (β0 ) = 1.
(3.4.16)
In consequence of (3.4.9) this equation is nothing but equations (3.1.7), (3.1.9) derived earlier. At the same time formula (3.4.15) coincides with relationship (3.1.12). So, we see that the general standardized method yields the same results as the special method applied earlier produces.
3.5 Methods of potentials in the case of a large number of parameters The method of potentials worded in Sections 3.3 and 3.4 can be generalized to more difficult cases when there are a larger number of external parameters, i.e., this method resembles methods usually used in statistical thermodynamics even more.
3.5 Methods of potentials in the case of a large number of parameters
71
Here we outline possible ways to realize the above generalization and postpone a more elaborated analysis to the next chapter. Let cost function c(y) = c(y, a) depend on a numeric parameter a now and be differentiable with respect to this parameter. Then free energy F(T, a) = −T ln ∑ e−c(y,u)/T
(3.5.1)
y
and other functions introduced in Section 3.4 will be dependent not only on temperature T (or parameter β = 1/T ), but also on the value of a. The same holds true for the optimal distribution (3.3.14) now having the form F(T, a) − c(y, a) P(y | a) = exp . (3.5.2) T Certainly, formula (3.3.15) and other results from Section 3.3 will remain valid if we account parameter a to be constant when varying parameter T , i.e. if regular derivatives are replaced with partial ones. Hence, entropy of distribution (3.5.2) is equal to Hy = −
∂ F(T, a) . ∂T
(3.5.3)
Now in addition to these results we can derive a formula of partial derivatives taken with respect to the new parameter a. We differentiate (3.5.1) with respect to a and find
∂ F(T, a) = ∂a
−1 c(y, a) ∂ c(y, a) c(y, a) ∑ exp − T ∑ ∂ a exp − T y
or, equivalently,
∂ F(T, a) ∂ c(y, a) =∑ P(y | a) ≡ −E [B(y) | a] , ∂a ∂a y if we take into account (3.5.2) and (3.5.1). The function
(3.5.4)
∂ c(y, a) ∂a is called a random internal thermodynamic parameter conjugate to the external parameter a. As a consequence of formulae (3.5.3), (3.5.4) the total differential of free energy can be written as (3.5.5) dF(T, a) = −Hy dT − Ada. B(y) = −
In particular, it follows from here that
72
3 Encoding in the presence of penalties. First variational problem
A=−
∂F . ∂a
(3.5.5a)
Internal parameter A defined by such a formula is called conjugate to parameter a with respect to potential F. In formula (3.5.5) Hy and A = E[B] come as functions of T and a, respectively. However, they can be interpreted as independent variables. Then instead of F(T, a) we consider another potential Φ (Hy , A) expressed in terms of F(T, a) via the Legendre transform:
Φ (Hy , A) = F + Hy T + Aa = F − Hy
∂F ∂F . −a ∂ Uy ∂a
(3.5.6)
Then parameters T , a can be regarded as functions of Hy , A, respectively, by differentiation: ∂Φ ∂Φ , (3.5.7) T= , a= ∂ Hy ∂A since (3.5.5), (3.5.6) entail d Φ = T dHy + adA. Of course, the Legendre transform can be performed in one of two variables. It is convenient to use the indicated potentials and their derivatives for solving various variational problems related to conditional channel capacity. Also it is convenient to use those variables, which are given in the problem formulation as independent variables. Further we consider one illustrative example from a variety of possible problems pertaining to choice of optimal encoding under given cost function and a number of constraints imposed on the amount of information, average cost and other quantities. Example 3.5. The goal is to encode a recording by symbols ρ . Each symbol has cost c0 (ρ ) as an attribute. Besides this cost, it is required to take account of one more additional expenditure η (ρ ), say, the amount of paint used for a given symbol. If we introduce the cost of paint a, then the total cost will take the form c(ρ ) = c0 (ρ ) + aη (ρ ).
(3.5.8)
Let it be required to find encoding such that the given amount of information I (meant per one symbol) is recorded (transmitted), at the same time the fixed amount of paint K is spent (on average per one symbol) and, besides, average costs E[c0 (ρ )] are minimized. In order to solve this problem we introduce T and a as auxiliary parameters, which are initially indefinite and then found from additional conditions. Paint consumption η (ρ ) per symbol ρ is considered as the random variable η . As the second variable ζ we choose a variable complementing η to ρ so that ρ = (ν , ζ ). Thus, the cost function (3.5.8) can be rewritten as c(η , ζ ) = c0 (η , ζ ) + aη .
3.5 Methods of potentials in the case of a large number of parameters
73
Now we can apply formulae (3.5.1)–(3.5.7) and other ones, for which B = −η , to the problem in question. According to (3.5.1) free energy F(T, a) is determined by the formula 1 β= F(T, a) = −T ln ∑ exp[−β c0 (η , ζ ) − β aη ] , (3.5.9) T η ,ζ and the optimal distribution (3.5.2) has the form P(η , ζ ) = exp[β F − β c0 (η , ζ ) − β aη ].
(3.5.10)
For the final determination of probabilities, entropy and other variables it is left to concretize parameters T and a employing conditions formulated above. Namely, average entropy Hηζ and average paint consumption E[η ] are assumed to be fixed: Hηζ = I;
A ≡ E[η ] = K.
(3.5.11)
Using formulae (3.5.3) and (3.5.4) we obtain the system of two equations −
∂ F(T, a) = I; ∂T
∂ F(T, a) =K ∂a
(3.5.12)
for finding parameters T = T (I, K) and a = a(I, K). The optimal distribution (3.5.10) minimizes the total average cost R = E[c(η , ζ )] as well as the partial average cost
Φ = E[c0 (η , ζ )] = R − aE[η ] ≡ R + aA, since difference R− Φ = aK remains constant for any variations. In view of R = F + T Hη it is easily seen that partial cost Φ = F +T Hη +aA (considered as a function of Hη ,A) can be derived via the Legendre transform (3.5.6) of F(T, a), i.e. this partial cost is an example of potential Φ (Hy , A). Taking into account formulae (3.5.7) we represent the optimal distribution (3.5.10) in the following form: ⎡ ⎤ c0 (ρ ) − η ∂∂ΦA (I, −K) ⎦. P(ρ ) = const exp ⎣− ∂Φ (I, −K) ∂ Hη After having determined optimal probabilities P(ρ ) completely we can perform actual encoding by methods of Section 3.6.
74
3 Encoding in the presence of penalties. First variational problem
3.6 Capacity of a noiseless channel with penalties in a generalized version 1. In Sections 3.2 and 3.3 we considered capacity of a discrete channel without noise, but with penalties. As it was stated, computation of that channel capacity can be reduced to solving the first variational problem of information theory. The results derived here can be generalized to the case of arbitrary random variables, which can assume continuous values, in particular. Some formulae relative to the generalized version of the formula and provided in Section 1.6 give us some hints how to do so. We suppose that there is given a noiseless channel (not necessarily discrete) if there are given measurable space (X, F), random variable ξ taking values from the indicated space, F-measurable function c(ξ ) called a cost function, and also measure ν on (X, F) (normalization of ν is not required). We define channel capacity C(a) for the level of losses a as the maximum value of entropy (1.6.9): Hξ = −
P(d ξ ) ln
P(d ξ ) , v(d ξ )
(3.6.1)
compatible with the constraint
c(ξ )P(d ξ ) = a.
(3.6.2)
Basically, the given variational problem is solved in the same way as it was done in Section 3.3. But note that partial derivatives are replaced with variational derivatives in this modified approach. After variational differentiation with respect to P(dx) instead of (3.3.4) we will have the extremum condition: ln
P(d ξ ) = β F − β c(ξ ), ν (d ξ )
(3.6.3)
where β F = −1 − γ . From here we obtain the extremum distribution P(d ξ ) = eβ F−β c(ξ ) ν (d ξ ).
(3.6.3a)
Averaging (3.6.3) and taking into account (3.6.1), (3.6.2) we obtain that Hξ = β E[c(ξ )] − β F,
C = β R − β F.
(3.6.3b)
The latter formula coincides with equality (3.3.13) of the discrete version. As in Section 3.3 we can introduce the following partition function (integral)
Z=
e−c(ξ )/T ν (d ξ ),
(3.6.4)
3.6 Capacity of a noiseless channel with penalties in a generalized version
75
serving as a generalization of (3.3.11). We also introduce free energy F(T ) = −T ln Z here. Relationship (3.3.15) and other results are proven in the same way as in Section 3.3. Formulae from Section 3.4 are extended to the case in question analogously, and formulae (3.5.1), (3.5.2) are replaced with F(T, a) = −T ln P(d ξ | a) = exp
e−c(ξ ,a)/T ν (d ξ )
(3.6.5)
F(T, a) − c(ξ , a)
ν (d ξ ). (3.6.6) T Finally, resulting equalities (3.5.3), (3.5.5) and other ones remain intact. 2. As an example we consider the case when X is an r-dimensional real space ξ = (x1 , . . . , xr ) and function c(ξ ) is represented as a linear quadratic form c(ξ ) = c0 +
1 r (xi − bi )ci j (x j − b j ), 2 i,∑ j=1
where ci j is a non-singular positive definite matrix. We suppose that ν (d ξ ) = d ξ = dx1 , . . . , dxr . Then formula (3.6.2) turns into β (yi = xi − bi ). Z ≡ e−F/T exp −c0 β − ∑ yi ci j y j dy1 , . . . , dyr 2 Calculating the integral and taking the logarithm we find that F(T ) = c0 +
T T rT ln det β ci j = c0 − ln T + ln det ci j . 2 2 2
Hence, taking account of (3.3.16) equation (3.3.18) takes the form c0 + rT /2 = a.
(3.6.7)
At the same time formula (3.3.15) results in " " r 1 C(a) = (1 + ln T ) − ln det "ci j " , 2 2 i.e. due to (3.6.7) we obtain that r r 1 C(a) = 1 − ln − ln det ci j . 2 2(a − c0 ) 2 In this case the extremum distribution is Gaussian and its entropy C(a) can be also found with the help of formulae from Section 5.4.
Chapter 4
First asymptotic theorem and related results
In the previous chapter, for one particular example (see Sections 3.1 and 3.4) we showed that in calculating the maximum entropy (i.e. the capacity of a noiseless channel) the constraint c(y) a imposed on feasible realizations is equivalent, for a sufficiently long code sequence, to the constraint E[c(y)] a on the mean value E[c(y)]. In this chapter we prove (Section 4.3) that under certain assumptions such equivalence takes place in the general case; this is the assertion of the first asymptotic theorem. In what follows, we shall also consider the other two asymptotic theorems (Chapters 7 and 11), which are the most profound results of information theory. All of them have the following feature in common: ultimately all these theorems state that, for utmost large systems, the difference between the concepts of discreteness and continuity disappears, and that the characteristics of a large collection of discrete objects can be calculated using a continuous functional dependence involving averaged quantities. For the first variational problem, this feature is expressed by the fact that the discrete function H = ln M of a, which exists under the constraint c(y) ≤ a, is asymptotically replaced by a continuous function H(a) calculated by solving the first variational problem. As far as the proof is concerned, the first asymptotic theorem turns out to be related to the theorem on canonical distribution stability (Section 4.2), which is very important in statistical thermodynamics and which is actually proved there when the canonical distribution is derived from the microcanonical one. Here we consider it in a more general and abstract form. The relationship between the first asymptotic theorem and the theorem on the canonical distribution once more underlines the intrinsic unity of the mathematical apparatus of information theory and statistical thermodynamics. Potential Γ (α ) and its properties are used in the process of proving the indicated theorems. The material about this potential is presented in Section 4.1. It is related to the content of Section 3.3. However, instead of regular physical free energy F we consider dimensionless free energy, that is potential Γ = −F/T . Instead of parameters T , a2 , a3 , . . . common in thermodynamics we introduce symmetrically defined parameters α1 = −1/T , α2 = a2 /T , α3 = a3 /T , . . .. Under such choice the temperature is an ordinary thermodynamic parameter along with the others. © Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0 4
77
78
4 First asymptotic theorem and related results
When ‘thermodynamic’ potentials are used systematically, it is convenient to treat the logarithm μ (s) = ln Θ (s) of the characteristic function Θ (s) as a characteristic potential. Indeed, this logarithm is actually the cumulant generating function just as well the potential Γ (α ). Section 4.4 presents auxiliary theorems involving the characteristic potential. These theorems are subsequently used in proving the second and third asymptotic theorems.
4.1 Potential Γ or the cumulant generating function Consider a thermodynamic system or an informational system, for which formula (3.5.1) is relevant. For such a system we introduce symmetrically defined parameters α1 , . . . , αr and the corresponding potential Γ (α ) that is mathematically equivalent to free energy F, but has the following advantage over F: Γ is the cumulant generating function. For a physical thermodynamic system the equilibrium distribution is usually represented by Gibbs’ formula: F − H (ζ , a) P(d ζ ) = exp (4.1.1) dζ T where ζ = (p, q) are dynamic variables (coordinates and impulses); H (ζ , a) is a Hamilton’s function dependent on parameters a2 , . . ., ar . Formula (4.1.1) is an analog of formula (3.5.2). Now assume that the Hamilton’s function is linear with respect to the specified parameters H (ζ , a) = H0 (ζ ) − a2 F 2 (ζ ) − · · · − ar F r (ζ ).
(4.1.2)
Then (4.1.1) becomes P(d ζ ) = exp[F − H0 (ζ ) + a2 F 2 (ζ ) + · · · + ar F r (ζ )]d ζ .
(4.1.3)
Further we introduce new parameters
α1 = −1/T ≡ −β ;
α2 = a2 /T ;
...
;
αr = ar /T,
(4.1.4)
which we call canonical external parameters. Parameters B1 = H0 ;
B2 = F 2 ;
...
Br = F r
;
are called random internal parameters and also A1 = E[H0 ];
A2 = E[F 2 ];
...
;
qAr = E[F r ]
(4.1.5)
4.1 Potential Γ or the cumulant generating function
79
are called canonical internal parameters conjugate to external parameters α1 , . . ., αr . Moreover, we introduce the potential Γ and rewrite distribution (4.1.3) in a canonical form: P(d ζ | d) = exp[−Γ (α ) + B1 (ζ )α1 + · · · + Br (ζ )ar ] d ζ .
(4.1.6)
Here −Γ (α ) = F/T or, as it follows from the normalization constraint, equivalently,
Γ (α ) = ln Z
where Z=
(4.1.7)
exp[B1 α1 + · · · + Br αr ] d ζ
(4.1.8)
is the corresponding partition function. In the symmetric form (4.1.6) the temperature parameter α1 is on equal terms with the other parameters α2 , . . . , αr , which are connected with a2 , . . . , ar . If c(y, a) is linear with respect to a, then expression (3.5.2) can also be represented in this form. We integrate (4.1.6) with respect to variables η such that (B, η ) together form ζ . Then we obtain the distribution over the random internal parameters P(dB | α ) = exp[−Γ (α ) + Bα − Φ (B)] dB
where Bα = B1 α1 + · · · + Br αr ;
ΔB
e−Φ (B) dB =
ΔB
(4.1.9) dζ .
In turn, we call the latter distribution canonical. In the case of the canonical distribution (4.1.6) it is easy to express the characteristic function Θ (iu) = eiuB P(dB | α ) (4.1.10) of random variables B1 , . . . , Br by Γ . Indeed, we substitute (4.1.9) into (4.1.10) to obtain Θ (iu) = e−Γ (α ) exp[(iu + α )B − Θ (B)]dB. Taking into account Γ (α )
e
=
exp[Bα − Φ (b)]dB
(4.1.10a)
and a normalization of distribution (4.1.9) it follows from here that
Θ (iu) = exp[Γ (α + iu) − Γ (α )]. The logarithm
μ (s) = ln Θ (s) = ln
esB(ζ ) P(d ζ | α )
(4.1.11)
(4.1.11a)
80
4 First asymptotic theorem and related results
of characteristic function (4.1.10) represents the characteristic potential or the cumulant generating function, as is well known, the cumulants are calculated by differentiation: ∂ mμ k j1 ,..., jm = (0). (4.1.12) ∂ s j1 · · · ∂ s jm Substituting the equality following from (4.1.11)
μ (s) = Γ (α + s) − Γ (α ),
(4.1.12a)
to here, we eventually find that k j1 ,..., jm =
∂ m Γ (α ) . ∂ α j1 · · · ∂ α jm
(4.1.13)
Hence we see that the potential Γ (α ) is the cumulant generating function for the whole family of distributions P(dB | α ). For m = 1 we have from (4.1.13) that k j ≡ A j ≡ E[B j ] −
∂Γ (α ) . ∂αj
(4.1.14)
The first of these formulae E[B1 ] ≡ A1 =
∂Γ (α ) ∂ α1
is equivalent to formulae (3.3.16), (3.3.17). The other formulae Al =
∂Γ (α ) , ∂ αl
l = 2, . . . , r
are equivalent to the equalities Al = −
∂F , ∂ al
l = 2, . . . , r,
of type (3.5.5a). With the given definition of parameters, the relationship defining entropy has a peculiar form when energy (average cost) and temperature have the appearance of regular parameters. Substituting (4.1.1) to the formula for physical entropy P(d ζ ) H = −E ln dζ or applying an analogous formula for a discrete version, we get H = −F/T + (1/T )E[H (ζ , a)].
(4.1.14a)
4.1 Potential Γ or the cumulant generating function
81
Plugging (4.1.2) in here and taking account of notations (4.1.4), (4.1.5), we obtain H = Γ − α E[B] or, equivalently, if we take into account (4.1.14), r
H = Γ (α ) − ∑ α j j=1
∂Γ (α ) . ∂αj
(4.1.14b)
Further we provide one more corollary from formula (4.1.13). Theorem 4.1. Canonical potential Γ (α ) is a convex function of parameters α defined in their domain Qa . Proof. We present out proof in the presence of an extra condition about twice differentiability of a potential. We employ formula (4.1.13) for m = 2. It takes the form ki j =
∂ 2 Γ (α ) . ∂ αi ∂ α j
(4.1.15)
We also take into account that the correlation matrix ki j is positive semi-definite. Therefore, the matrix of the second derivatives ∂ 2Γ /∂ αi ∂ α j is positive semidefinite as well, which proves convexity of the potential. This ends the proof. Corollary 4.1. In the presence of just one parameter α1 , r = 1, function H(A1 ) defined by the Legendre transform dΓ (α1 ) (4.1.16) H(A1 ) = Γ − α1 (A1 )A1 A1 = d α1 is concave. Indeed, as it follows from formula (4.1.16) and the formula of the inverse Legendre transform dH Γ (α1 ) = H + α1 A1 (α1 ) α1 = − (A1 ) , dA1 the following relationships d 2Γ (α1 ) dA1 = ; d α1 d α12
d 2 H(A1 ) d α1 =− . dA1 dA21
are valid. Since by virtue of Theorem 4.1 inequality d 2Γ /d α12 0 holds true (when the differentiability condition holds), we deduce that dA1 /d α1 0 and thereby d2H d α1 =− 0. dA1 dA21
(4.1.16a)
82
4 First asymptotic theorem and related results
This statement can also be proven without using the differentiability condition. In conclusion of this paragraph we provide formulae pertaining to a characteristic potential of entropy that has been defined earlier by the formula of type (1.5.15). Nonetheless, these formulae generalize the previous one. Theorem 4.2. For the canonical family of distributions P(d ζ | α ) = e−Γ (α )+α B(ζ ) v(d ζ )
(4.1.17)
the characteristic potential
μ0 (s0 ) = ln
es0 H(ζ |α ) P(d ζ | α )
of random entropy
H(ζ | α ) = − ln
P(d ζ | α ) v(d ζ )
(4.1.18)
has the form
μ0 (s0 ) = Γ ((1 − s0 )α1 , . . . , (1 − s0 )αr ) − (1 − s0 )Γ (α1 , . . . , αr ).
(4.1.19)
To prove this, it is sufficient to substitute (4.1.17) into (4.1.18) and take into consideration the formula Γ (α ) = ln eα B(ζ ) ν (d ζ ), which is of type (4.1.10a) defining Γ (α ). Differentiating (4.1.19) by s0 and equating s0 to zero (analogously to (4.1.12) for m = 1), we can find a mean value of entropy that coincides with (4.1.14b). Repeated differentiation will yield the expression for variance.
4.2 Some asymptotic results of statistical thermodynamics. Stability of the canonical distribution The deepest results of information theory and statistical thermodynamics have an asymptotic nature, i.e. are represented in the form of limiting theorems under growth of a cumulative system. Before considering the first asymptotic theorem of information theory, we present a related (as it is seen from the proof) result from statistical thermodynamics, namely, an important theorem about stability of the canonical distribution. In the case of just one parameter the latter distribution has the form P(ξ | α ) = exp[−Γ (α ) + α B(ξ ) − ϕ (ξ )].
(4.2.1)
If B(ξ ) = H (p, q) is perceived as energy of a system that is Hamilton’s function and ϕ (ξ ) is supposed to be zero, then the indicated distribution becomes the canonical Gibbs distribution:
4.2 Asymptotic results of statistical thermodynamics. Stability of the canonical distribution
exp
F(T ) − H (p, q) T
F(T ) = −T Γ
83
1 − , T
where T = −1/α is temperature. The theorem about stability of this distribution (i.e. about the fact that it is formed by a ‘microcanonical’ distribution for a cumulative system including a thermostat) is called Gibbs theorem. Adhering to a general and formal exposition style adopted in this chapter, we formulate the addressed theorem in abstract form. Preliminary, we introduce several additional notions. We call the conditional distribution (4.2.2) Pn (ξ1 , . . . , ξn | α ) an n-th degree of the distribution P1 (ξ1 | α ),
(4.2.3)
Pn (ξ1 , . . . , ξn | α ) = P1 (ξ1 | α ) · · · P1 (ξn | α ).
(4.2.4)
if Let the distribution (4.2.3) be canonical: ln P1 (ξ1 | α ) = −Γ1 (α ) + α Bn (ξ1 ) − ϕ1 (ξ1 ).
(4.2.5)
Then in consequence of (4.2.4) we obtain that the following equality holds true for the joint distribution (4.2.2): ln Pn (ξ1 , . . . , ξn | α ) = −nΓ1 (α ) + α Bn (ξ1 , . . . , ξn ) − ϕn (ξ1 , . . . , ξn ), where Bn (ξ1 , . . . , ξn ) =
n
∑ B1 (ξk ),
ϕn (ξ1 , . . . , xn ) =
k=1
(4.2.6)
n
∑ ϕ1 (ξk ).
k=1
Apparently, it is canonical as well. There may be several parameters α = (α1 , . . . , αr ). In this case we mean inner product ∑ αi Bi by α B. Canonical internal parameters B1 (ξk ), for which constraint Bn = ∑k B1 (ξk ) is valid, are called extensive. In turn, the respective external parameters α are called intensive. It is easy to show that if n-th degree of a distribution is canonical, then the distribution is canonical itself. Indeed, applying (4.2.4) and the definition of canonicity we obtain that n
∑ ln P1 (ξk | α ) = −Γn (α ) + α Bn (ξ1 , . . . , ξn ) − ϕn (ξ1 , . . . , ξn ).
k=1
Suppose here ξ1 = ξ2 = · · · = ξn . This yields n ln P1 (ξ1 | α ) = −Γn (α ) + α Bn (ξ1 , . . . , ξ1 ) − ϕn (ξ1 , . . . , ξ1 ),
84
4 First asymptotic theorem and related results
i.e. the condition of canonicity (4.2.1) is actually satisfied for the ‘small’ system (4.2.3), and 1 B1 (ξ1 ) = Bn (ξ1 , . . . , ξ1 ); n
1 ϕ1 (ξ1 ) = ϕn (ξ1 , . . . , ξ1 ). n
The derivation of the canonical ‘small’ distribution from the canonical ‘large’ distribution is natural, of course. The following fact proven below is deeper: the canonical ‘small’ distribution is approximately formed from a non-canonical ‘large’ distribution. Therefore, the canonical form of a distribution appears to be stable in a sense that it is asymptotically formed from different ‘large’ distributions. In fact, this explains the important role of the canonical distribution in theory, particularly, in statistical physics. Theorem 4.3. Let the functions B1 (ξ ), ϕ1 (ξ ) be given, for which the corresponding canonical distribution is of the form P˜1 (ξ | α ) = exp[−Γ (α ) + α B1 (ξ ) − ϕ1 (ξ )]. Let the distribution of the ‘large’ system be also given: −Ψn (A)−∑nk=1 ϕ1 (ξk )
Pn (ξ1 , . . . , ξn | A) = e
δ
n
(4.2.7)
∑ B1 (ξk ) − nA
,
(4.2.8)
k=1
where the functions Ψn (A) are determined from the normalization constraint, and A plays the role of a parameter. Then the distribution of a ‘small’ system P1 (ξ1 | A) =
∑
Pn (ξ1 , . . . , ξn | A) ,
(4.2.9)
ξ2 ,...,ξn
which is formed from the initial distribution by summation, asymptotically transforms into the distribution (4.2.7), when the change α = α (A) of the parameter is made. Namely, 1 ln P1 (ξ1 | A) = −ψ (A) + α (A)B1 (ξ1 ) − ϕ1 (ξ1 ) + B21 χ (A) + O(n−2 ). n
(4.2.10)
The functions ψ (A), α (A), χ (A) are defined by formulae (4.2.17) mentioned below. We also assume that function (4.2.13) is differentiable a sufficient number of times and that equation (4.2.15) has a root. A distribution of type (4.2.8) is called ‘microcanonical’ in statistical physics. Proof. Using the integral representation of the delta-function
δ (x) =
1 2π
∞ −∞
eiκx dκ,
we rewrite the ‘microcanonical’ distribution (4.2.8) in the form
(4.2.11)
4.2 Asymptotic results of statistical thermodynamics. Stability of the canonical distribution
1 Pn (ξ1 , . . . , ξn | A) = 2π
∞ −∞
n
85
exp −Ψn (A) − iκnA + ∑ [iκB1 (ξk ) − ϕ1 (ξk )] dκ k=1
and substitute this equality to (4.2.9). In the resulting expression
n 1 ∞ P1 (ξ1 | A) = ∑ exp −Ψn (A) − iκnA + ∑ [iκB1 (ξk ) − ϕ1 (ξk )] dκ 2π −∞ k=1 ζ ,...,ζ n
2
we perform summation by ζ2 , . . . , ζn employing formula (4.1.10a). That results in
1 ∞ exp{−Ψ (A) − iκnA + (n − 1)Γ (iκ) + iκB1 (ξ1 ) − ϕ1 (ξ1 )}dκ 2π −∞ ≡ exp[−Ψ (A) − ϕ1 (ξ1 )]I, (4.2.12)
P1 (ξ1 | A) =
where
Γ (α ) = ln ∑ exp[α B1 (ξk ) − ϕ1 (ξk )].
(4.2.13)
ξk
We apply the method of the steepest descent to the integral in (4.2.12) using the fact that n takes large values. Further, we determine a saddle point iκ = α0 from the extremum condition for the expression situated in the exponent of (4.2.12), i.e. from the equation (n − 1)
dΓ (α0 ) = nA − B1 (ξ1 ). dα
(4.2.14)
In view of that the point turns out to be dependent on ξ , it is convenient to also consider point α1 independent from ξ1 and defined by the equation dΓ (α1 ) = A, dα
(4.2.15)
Point α1 is close to α0 for large n. It follows from (4.2.14), (4.2.15) that 1 A − B1 Γ (α0 ) − Γ (α1 ) = Γ (α1 )ε + Γ (α )ε 2 + · · · = 2 n−1
(ε = α0 − α1 ).
From here we have
ε=
Γ A − B1 − ε 2 − · · · (n − 1)Γ 2Γ =
Γ (A − B1 )2 A − B1 − + O(n−3 ). (4.2.16) (n − 1)Γ 2(n − 1)2 (Γ )3
Since Γ (α ) can be interpreted as variance σ 2 = k11 of some random variable according to (4.1.15), the inequality
86
4 First asymptotic theorem and related results
Γ (α0 ) > 0
(4.2.16a)
holds true and, consequently, a direction of the steepest descent of the function (n − 1)Γ (α ) − α nA − α B1 at point α0 (and also at point α1 ) is orthogonal to the real axis. Indeed, if difference α − α0 = iy is imaginary, then (n − 1)Γ (α ) − nα A + α B1 = 1 (n − 1)Γ (α0 ) − nα0 A + α0 B1 − (n − 1)Γ (α0 )y2 + O(y3 ). 2 Drawing the contour of integration through point α0 in the direction of the steepest descent, we use the equality that follows from the formula a 2 b 3 c 4 exp − x + x + x + · · · dx ≈ (2π /a)1/2 exp[c/8a2 + · · · ], (4.2.16b) 2 6 24 3/2
where a > 0, b/a2 I≡
1 2π
1, c/a2 1, . . . . This equality is
exp {(n − 1)Γ (iκ) + iκ[B1 (ξ1 ) − nA]} dκ ( = [2π (n − 1)Γ (α0 )]−1/2 exp (n − 1)Γ (α0 )+ + α0 [B1 (ξ1 ) − nA] +
) Γ (α0 ) −2 + O(n ) . (4.2.16c) 8(n − 1)Γ (α0 )2
Employing (4.2.16) here, we can express value α0 in terms of α1 by performing expansion in series with respect to ε = α0 − α1 and taking into account a required number of terms. This leads to the result ( I = [2π (n − 1)Γ (α1 )]−1/2 exp (n − 1)Γ (α1 ) + α1 [B1 (ξ1 ) − nA]− ) Γ (α1 )[A − B1 (ζ )] [B1 (ξ1 ) − A]2 Γ (α1 ) −2 + − + O(n ) . − 2nΓ (α1 )2 2nΓ (α1 ) 8nΓ (α1 )2 Substituting this expression to (4.2.12) and denoting 1 ψ (A) =Ψn (A) − (n − 1)Γ (α1 ) + nα1 A + ln[2π (n − 1)Γ (α1 )] 2 AΓ (α1 ) Γ (α1 ) A2 + − + ; (4.2.17) 8nΓ (α1 )2 2nΓ (α1 ) 2nΓ (α1 )2
α (A) = α1 +
A nΓ (α1 )
+
Γ (α1 ) ; 2nΓ (α1 )2
we obtain (4.2.10). The proof is complete.
χ (A) = −
1 , 2Γ (α1 )
4.2 Asymptotic results of statistical thermodynamics. Stability of the canonical distribution
87
It is not necessary to account for the first equality in (4.2.17) because function ψ (A) is unambiguously determined by functions α (A), χ (A) due to the normalization constraint. Since a number of terms in (4.2.17) disappears in the limit n → ∞, the limit expression (4.2.10) has the form (Γ (α ) = A),
ln P1 (ξ1 | A) = const + α B1 (ξ1 ) − ϕ1 (ξ1 )
or ln P1 (ξ | A) = −Γ (α ) + α B1 (ξ1 ) − ϕ1 (ξ1 ) as it follows from considerations of normalization. Theorem 4.3 can be generalized in different directions. A generalization is trivial in the case when there is not one but several (r) parameters B1 , . . . , Br . The deltafunction in (4.2.8) needs to be replaced with a product of delta-functions and also expression α B should be understood as an inner product. For another generalization the delta-function in formula (4.2.8) can be substituted by other functions. In order to apply the theory considered in Section 3.2 to the problem of computation of capacity of noiseless channels it is very important to use the functions 1 if x > 0 ϑ+ (x) = ϑ− (x) = 1 − ϑ+ (x), (4.2.18) 0 if x 0, and also the functions of the type
ϑ+ (x) − ϑ+ (x − c) = ϑ− (x − c) − ϑ− (x). Indeed, introduction of function ϑ− (·) to the distribution Pn (ξ1 , . . . , ξn | A) = ϑ−
n
∑ B1 (ξk ) − nA
e−Ψn (A)−∑k=1 ϕ1 (ξk ) n
k=1
is equivalent to introduction of constraint ∑nk=1 B1 (ξk ) nA, i.e. constraint (3.2.1) with c(y) = ∑nk=1 B1 (ξk ) and a = nA. Analogously, the first constraint (3.2.7) corresponds to introduction of the function
ϑ−
n
∑ B1 (ξk ) − a1
− ϑ−
k=1
n
∑ B1 (ξk ) − a2
.
k=1
In order to cover these cases, the ‘microcanonical’ distribution (4.2.8) has to be replaced with a more general distribution Pn (ξ1 , . . . , ξn | A) = ϑ
n
∑ B1 (ξk ) − nA
e−Ψn (A)−∑k=1 ϕ1 (ξk ) . n
k=1
Here ϑ (·) is a given function having the spectral representation
(4.2.19)
88
4 First asymptotic theorem and related results
ϑ (z) =
expiκz θ (iκ) dκ.
(4.2.20)
Such a generalization will require quite insignificant changes in the proof of the theorem. Expansion (4.2.11) needs to be substituted by expansion (4.2.20), where the extra term ln θ (iκ) will appear in the exponent in formula (4.2.12) and others. This will lead to a minor complication of final formulae. Results of Theorem 4.3 also allow a generalization in a different direction. It n is not necessary to require that the factor e− ∑k=1 ϕ1 (ξk ) (independent from A and proportional to the conditional probability P(ξ1 , . . . , ξk | A, Bn ) = P(ξ1 , . . . , ξk | Bn )) can be decomposed into a product of factors, especially identical ones, in formulae (4.2.8), (4.2.19). However, we need to introduce some weaker requirement to keep the asymptotic result valid. In order to formulate such a requirement, we introduce a notion of canonical stability of a random variable. We call a sequence of random variables {ζn , n = 1, 2, . . . given by probability distributions Pn (ζn ) in sample space Z (independent of n) canonically stable regarding Pn (ζn ) if values of their characteristic potentials μn (β ) = ln ∑ eβ ζn Pn (ζn ) (4.2.21) indefinitely increase for various β in proportion to each other (they together converge to ∞). The latter means the following:
μn (β ) → ∞,
μn (β ) μ0 (β ) → μn (β ) μ0 (β )
(β , β ∈ Q)
as n → ∞, where μ0 (β ) is some function independent of n [Q is a feasible region of β , for which (4.2.21) makes sense]. It is easy to see that random variables equal to expanding sums
ζn =
n
∑ B1 (ξk )
k=1
of independent identically distributed random variables are canonically stable because μn (β ) = nμ1 (β ) holds for them. However, other examples of canonically stable families of random variables are also possible. That is why the theorem formulated below is a generalization of Theorem 4.3. Theorem 4.4. Let Pn (ζn , ηn ), n = 1, 2, . . . , be a sequence of probability distributions such that the sequence of random variables ζn is canonically stable. With the help of these distributions we construct the joint distribution P(ξ1 , ζn , ηn | An ) = θ (B1 (ξ1 ) + ζn − An ) e−Ψn (A)−ϕ1 (ξ1 ) Pn (ζn , ηn ),
(4.2.22)
where θ (·) is a function independent of n with spectrum θ (iκ). Then the summation of distribution P(ξ1 | An ) over ηn and ζn transforms (4.2.22) into an expression of type (4.2.10). In this expression, functions ψ (An ), α (An ), χ (An ) are determined from the corresponding formulae like (4.2.17); as n → ∞, function α (An ) turns into the function inverse to the function
4.3 Asymptotic equivalence of two types of constraints
89
An = μn (α ),
(4.2.23)
while function ψ (An ) turns into Γ (α (An )), where
μn (β ) = ln ∑ eβ ζn Pn (ζn ), ζn
Γ (α ) = ln ∑ eα B1 (ξ1 )−ϕ1 (ξ1 ) . ξi
It is assumed that equation (4.2.23) has a root tending to a finite limit as n → ∞. Proof. The proof is analogous to the proof of Theorem 4.3. The only difference is that now there is an additional term ln θ (iκ) and the expression (n − 1)Γ1 (iκ) must be replaced with μn (iκ). Instead of formula (4.2.12) now we have −Ψn (A)−ϕ1 (ξ1 )
P(ξ1 | An ) = e
exp[iκB1 (ξ1 ) − iκAn + μn (iκ) + ln θ (iκ)]dκ
after summation over ηn and ζn . The saddle point iκ = α0 is determined from the equation θ (α0 ) = 0. μn (α0 ) − An + B1 (ξ ) + θ (α0 ) The root of the latter equation is asymptotically close to root α of equation (4.2.23). Namely, θ −1 α0 − α = (μn ) B1 + + · · · = O(μ −1 ). θ Other changes do not require explanation. The theorems presented in this paragraph characterize the role of canonical distributions just as the Central Limit Theorem characterizes the role of Gaussian distributions. As a matter of fact, this also explains the fundamental role of canonical distributions in statistical thermodynamics.
4.3 Asymptotic equivalence of two types of constraints Consider asymptotic results related to the content of Chapter 3. We will show that when computing maximum entropy (capacity of a noiseless channel), constraints imposed on mean values and constraints imposed on exact values are asymptotically equivalent to each other. These results are closely related to the theorem about stability of the canonical distribution proven in Section 4.2. We set out the material in a more general form than in Sections 3.2 and 3.3 by using an auxiliary measure ν (dx) similarly to Section 3.6. Let the space X with measure ν (dx) (not normalized to unity) be given. Entropy will be defined by the formula
90
4 First asymptotic theorem and related results
H =−
ln
P(dx) P(dx) v(dx)
(4.3.1)
[see (1.6.13)]. Let the constraint B(x) A
(4.3.2)
B(x) ∈ E
(4.3.3)
or, more generally, be given, where E is some (measurable) set, B(x) is a given function. Entropy of level A (or set E) is defined by maximization P(dx) H˜ = sup − P(dx) ln . (4.3.4) v(dx) P∈G˜ of feasible distributions P(·) is characterized by the fact that a probability Here set G is concentrated within subspace X ⊂ X defined by constraints (4.3.2) and (4.3.3), i.e. ˜ = 1; P(X)
P[B(x) ∈ E] = 1.
(4.3.5)
Constraints (4.3.2), (4.3.3) will be relaxed if we substitute them by analogous constraints for expectations: E[B(x)] A,
where
E[B(x)] ∈ E,
(4.3.6)
where symbol E means averaging with respect to measure P. Averaging (4.3.2), (4.3.3) we obtain (4.3.6), thereby set G of distributions with respect to measure P from G defined by constraint (4.3.5). Hence, P defined by constraint (4.3.6) embodies set G if we introduce the following entropy: P(dx) H = sup − P(dx) ln , (4.3.7) v(dx) P∈G then we will have ˜ H H.
(4.3.8)
Finding entropy (4.3.7) under constraint (4.3.6) is nothing else but the first variational problem (see Sections 3.3 and 3.6). Entropy H coincides with entropy of a certain canonical distribution P(dx) = exp[−Γ (α ) + α B(x)]v(dx)
(4.3.9)
[this formula coincides with (3.6.3a) when B(x) = c(x); ξ = x; α = −β ], i.e. dΓ E[B(x)] = ∈ E. Usually the value of E[B(x)] coincides with either the maxidα mum or the minimum point of interval E. It is easy to make certain that entropy (4.3.4) is determined by the formula
4.3 Asymptotic equivalence of two types of constraints
91
= ln v(X). H
(4.3.10)
At the same it follows from (4.3.9) that H = Γ (α ) − α A
A=
dΓ (α ) dα
(4.3.11)
holds for entropy (4.3.7) [see (3.6.3b)]. Deep asymptotic results related to the first variational problem are that entropies (4.3.10) and (4.3.11) are close to one another in some cases and computations of their values are interchangeable (i.e. in order to compute one of them it is sufficient to compute the other one). Usually it is convenient to compute entropy (4.3.11) of the canonical distribution (4.3.9) instead of (4.3.10) via applying methods regular in statistical thermodynamics. The specified results affirming a big role of the canonical distribution (4.3.9) are close to the results stated in the previous paragraph both in matter and in methods of proof. We implement the described above ‘system’ (or a noiseless channel) in a double way (ν (dx), B(x), A relate to an indicated system). On the one hand, we suppose that there is given a ‘small system’ (channel), which contains ν1 (dx1 ), B1 (x1 ), A1 . On the other hand, let there be a ‘large system’ (channel) for which it holds that vn (dx1 , . . . , dxn ) = v1 (dx1 ) · · · v1 (dxn ); n
Bn (x1 , . . . , xn ) =
∑ B1 (xk );
An = nA1 .
k=1
The ‘large system’ appears to be an n-th degree of the ‘small system’. The aforementioned formulae (4.3.1–4.3.11) can be applied to both mentioned systems. A n , H n can be provided both for the ‘small’ and for the ‘large’ definition of entropies H 1 , H 1 are essentially different, but for systems. For the ‘small’ system, the values H n , H n , according to the foregoing, are relatively close the ‘large’ system the values H to each other in the asymptotic sense n /H n → 1, H
n → ∞.
(4.3.12)
As it is easy to see, H n = nH 1 . Because of that relationship (4.3.12) can be rewritten as follows: 1 Hn → H 1 . (4.3.13) n This formula together with its various generalizations indeed represents the main result of the first asymptotic theorem. For the ‘large system’ we consider constraints (4.3.3), (4.3.6) in the form n
∑ B1 (xk ) − An ∈ E0 ;
k=1
E
n
∑ B1 (xk ) − An
k=1
where E0 is independent of n. If we introduce the function
∈ E0 ,
92
4 First asymptotic theorem and related results
ϑ (z) =
1 z ∈ E0 0 z∈ / E0
then constraint ∑ B1 (xk ) − An ∈ E0 can be substituted by the formula P(dx1 , . . . , dxn | An ) = const ϑ
∑ B1 (xk ) − An
P(dx1 , . . . , dxn ).
k
The extremum distribution (the one yielding H n ) has the following form: P(dx1 , . . . , dxn | An ) = N −1 ϑ
∑ B1 (xk ) − An
v1 (dx1 ) · · · v1 (dxn ),
(4.3.14)
k
where N is a normalization constant quite simply associated with entropy H: = ln N = ln ϑ ∑ B1 (xk ) − An v1 (dx1 ) · · · v1 (dxn ). H (4.3.15) k
Formula (4.3.14) is an evident analogy of both (4.2.19) and (4.2.22). It means that the problem of entropy (4.3.15) calculation is related to the problem of calculation of the partial distribution (4.2.9) considered in the previous paragraph. As there, the conditions of the exact multiplicativity vn (dx1 , . . . , dxn ) = v1 (dx1 ) · · · v1 (dxn ) and the exact additivity n
Bn (x1 , . . . , xn ) =
∑ B1 (xk )
k=1
are not necessary for the proof of the main result (convergence (4.3.12)). Next we formulate the result directly in a general form by employing the notion (introduced in Section 4.2) of a canonically stable sequence of random variables (ξn being understood as the set x1 , . . . , xn ). Theorem 4.5 (The first asymptotic theorem). Let νn (d ξn ), n = 1, 2, . . . , be a sequence of measures and Bn (ξn ) be a sequence of random variables that is canonically stable with respect to the distribution P(d ξn ) = vn (d ξn )/vn (Ξn )
(4.3.16)
(the measure νn (Σn ) of the entire space of values ξn is assumed to be finite). Then n can be computed by the asymptotic formula the entropy H = −γn (An ) − 1 ln[Γ (α1 )] + O(1), H 2
(4.3.17)
4.3 Asymptotic equivalence of two types of constraints
93
where
γn (An ) =α1 An − Γn (α1 ) = −H n Γn (α ) = ln
(Γn (α1 ) = An ),
exp[α Bn (ξn )]vn (d ξn )
(4.3.18)
is the potential implicated in (4.3.11). Therefore, the convergence (4.3.12) takes place as n → ∞. Condition νn (Σn ) < ∞ is needed only for existence of probability measure (4.3.16). If it is not satisfied, then we only need to modify a definition of canonical stability for Bn (ξn ), namely, to require that potential (4.3.18) converges to infinity as n → ∞ but not a characteristic potential. Proof. As in Section 4.3 we use the integral representation (4.2.20). We integrate the equality = ln θ (Bn (ξn ) − An )vn (d ξn ) H over ξn after substitution by (4.2.20). Due to (4.3.18) we have = ln H
exp[−iκAn + Γn (iκ) + ln θ (iκ)]dκ.
Computation of the integral can be carried out with the help of the saddle-point method (the method of the steepest descent), i.e. using formula (4.2.16b) with varying degrees of accuracy. The saddle point iκ = α0 is determined from the equation
Γn (α0 ) = An − θ (α0 )/θ (α0 ). Applicability of the saddle-point method is guaranteed by the condition of canonical stability. In order to prove Theorem 4.4 the following accuracy suffices: = −α0 An + Γn (α0 ) − 1 ln 1 Γn (α0 )[1 + O(Γ −1 )] H 2 2π 1 = −α1 An + Γn (α1 ) − ln Γn (α1 ) + O(1). 2 Here α1 is a value determined from equation Γn (α1 ) = An . Also, α1 varies little from α0 : 0 α1 − α0 = (Γn )−1 + · · · = O(Γ −1 ). 0 The proof is complete. Certainly, more accurate asymptotic results can be also derived similarly. As is seen from the proof, the condition of canonical stability of random variables Bn is sufficient for validity of the theorem. However, it in no way follows that this condition is necessary and that it is impossible to declare any weaker level, for which
94
4 First asymptotic theorem and related results
the theorem is valid. Undoubtedly, Theorem 4.5 (and also Theorem 4.4, possibly) can be extended to a more general case. By implication this is evidenced by the fact that for the example covered in Sections 3.1 and 3.4 the condition of canonical stability is not satisfied but, as we have seen, the passage to the limit (4.3.12) occurs as L → ∞. Along with other asymptotic theorems, which will be considered later on (Chapters 7 and 11), Theorems 4.3–4.5 constitute the principal content of information theory perceived as ‘thermodynamic’ in a broad sense, i.e. asymptotic theory. Many important notions and relationships of this theory take on their special significance during the passage to the limit pertaining to an expansion of systems in consideration. Those notions and relationships appear to be asymptotic in the indicated sense.
4.4 Some theorems about the characteristic potential 1. Characteristic potential μ (s) = ln Θ (s) of random variables Bi (ξ ) (or a cumulant generating function) was defined by formula (4.1.11a) earlier. If there is given the family of distributions p(ξ | α ) = exp[−Γ (α ) + α B(ξ ) − ϕ0 (ξ )], then μ (s) is expressed in terms of Γ (α ) by formula (4.1.12a). If there is merely a given random variable ξ with probability distribution P(d ξ ) instead of a family of distributions, then we can construct the following family of distributions: P(d ξ | α ) = consteα B(ξ ) P(d ξ ). Because of (4.1.11a) the normalization constant is expressed through μ (α ), so that P(d ξ | α ) = exp[−μ (α ) + α B(ξ )]P(d ξ )
(4.4.1)
(ln p(ξ ) = −ϕ0 (ξ )). Thus, we have built the family {P(d ξ | α ), α ∈ Q} on basis of P(ξ ). Besides the characteristic potential μ (s)of the initial distribution, we can find a characteristic potential of any distribution (4.4.1) from the indicated family by formula (4.1.12a), i.e. by the formula μ (s | α ) = μ (α + s) − μ (α ), (4.4.2) 2. At first consider an easy example of a single variable B(ξ ), r = 1 and prove a simple but useful theorem. Theorem 4.6. Suppose that the domain Q of the parameter implicated in (4.4.1) contains the interval s1 < α < s2 (s1 < 0, s2 > 0), and the potential μ (α ) is differentiable on this interval. Then the cumulative distribution function F(x) = P[B(ξ ) < x] satisfies the Chernoff’s inequality
(4.4.3)
4.4 Some theorems about the characteristic potential
95
F(x) exp[−sμ (s) + μ (s)]
(4.4.4)
μ (s) = x,
(4.4.5)
where
if the latter equation has a root s ∈ (s1 , 0]. Proof. Taking into consideration (4.4.1), we rewrite (4.4.3) in the form F(x) =
∑
exp[μ (α ) − α B(ξ )]P(ξ | α )
(4.4.6)
B(ξ ) E[B] such that the corresponding roots s and s of equation (4.4.17) lie on segment (s1 , s2 ). Then we apply the famous inversion formula to them (the L´evy formula): 1 c→∞ 2π
F(x ) − F(x) = lim
c −itx e − e−itx μ (it) e dt.
it
−c
(4.4.20)
Here eμ (it) = enμ1 (it) is a characteristic function. We represent the limit in the righthand side of (4.4.20) as a limit of a difference of two integrals 1 2π i
L
e−zx+μ (z)−ln z dz −
1 2π i
L
e−zx +μ (z)−ln z dz = I − I .
(4.4.21)
As a contour of integration L we consider a contour connecting −ic and ic through the saddle point z = z0 of the first integral. Note that this saddle point can be determined from the equation
μ (z0 ) − x −
1 = 0, z0
μ1 (z0 ) − x1 −
1 =0 nz0
(x1 = x/n).
(4.4.22)
This equation is derived by equating the derivative of the expression located in the exponent to zero. We denote by L a contour going from −ic to ic passing by the saddle point z0 of the second integral. This second saddle point can be found from the equation 1 μ1 (z0 ) − x1 − = 0. nz0
98
4 First asymptotic theorem and related results
Now we try to substitute the contour of integration L by L in the second integral from (4.4.21). Since nx1 < E[B] = nμ1 (0) and also nx1 > E[B] = nμ1 (0), the mentioned two saddle points lie on different sides from the origin z = 0 for sufficiently large n. Therefore, in order to replace a contour of integration (L by L ) in a specified way, we need to take into account a residue at point z = 0 in the second integral from (4.4.21). This manipulation yields 1 I = 2π i
−zx +μ (z) dz
e L
1 = z 2π i
L
e−zx +μ (z)
dz − e−μ (0) . z
(4.4.23)
We compute integrals (4.4.21), (4.4.23) by the method of the steepest descent. Taking account that μ1 (z0 ) 0 (Theorem 4.1), it is easy to see that a direction of the steepest descent is perpendicular to the real axis: − zx + μ (z) − ln z = 1 y2 1 1 − z0 x + μ (z0 ) − ln z0 − μ (z0 )y2 − 2 + μ (z0 )(iy)3 − 2 3 2z0 6
iy z0
3 + y4 · · ·
We use the expansion 1 n 1 exp −z0 x + μ (z0 ) − μ1 (z0 ) + 2 y2 + ny3 · · · z0 2 nz 0 1 n 1 = exp −z0 x + μ (z0 ) − μ + y2 [1 + ny3 · · · ] z0 2 1 nz20
e−nzx1 +nμ1 (z)−ln(z) =
and obtain that
dz exp−nzx1 +nμ1 (z) z L −1/2 1 1 −z0 x+μ (z0 ) [1 + O(n−1 )] = e 2π n μ1 (z0 ) + 2 z0 nz0
1 I≡ 2π i
(4.4.24)
due to (4.2.16b) (note that the largest term of a residue O(n−1 ) is given by the 1 (4) −2 fourth derivative Γ (4) and takes the form μ (μ1 ) ). Comparing (4.4.22) with 8n 1 equation μ1 (s) − x1 = 0 or, equivalently, with (4.4.17), we have
μ1 (s)(z0 − s) + (z − s)2 · · · =
1 ; nz0
z0 − s =
1 + O(n−2 ). nz0 μ1 (s)
(4.4.25)
In order to derive the desired relationship (4.4.19) it is sufficient to apply the formula resulted from (4.4.24) and (4.4.25) 1 I = e−sx+μ (s) {2π nμ1 (s)}−1/2 [1 + O(n−1 )] s
(4.4.26)
4.4 Some theorems about the characteristic potential
99
where sx − μ (s) = γ (x), μ (s) = x (s < 0). The second integral (4.4.23) is taken similarly that yields I =
1 {2π μ (s )}−1/2 e−γ (x ) [1 + O(n−1 )] − 1 (s > 0). s
(4.4.27)
We substitute (4.4.26) and (4.4.27) to (4.4.21) we get
F(x ) − F(x) = 1 − [2π (s )2 μ (s )]−1/2 exp−γ (x ) [1 + O(n−1 )] − [2π s2 μ (s)]−1/2 e−γ (x) [1 + O(n−1 )]
(4.4.28)
by virtue of (4.4.20) (it is accounted here that s < 0; roots [. . .]−1/2 are positive). The last equality determines F(x) and F (x) up to some additive constant: F(x) =[2π μ (s)s2 ]−1/2 e−γ (x) [1 + O(n−1 )] + K(n);
F(x ) =1 − [2π μ (s )(s )2 ]−1/2 e−γ (x ) [1 + O(n−1 )] + K(n). In order to estimate constant K(n) we consider point s∗ belonging to segment (s1 , s), point s∗ from segment (s , s2 ) and values x∗ , x∗ corresponding to them. We take into account inequality F(x∗ ) 1 − F(x∗ ) + F(x∗ ) and substitute (4.4.28) to its righthand side and, finally, obtain that F(x∗ ) [2π μ (s∗ )s2∗ ]−1/2 e−γ (x∗ ) [1 + O(n−1 )]
∗
+ [2π μ (s∗ )(s∗ )2 ]−1/2 e−γ (x ) [1 + O(n−1 )]. Hence, ∗
F(x∗ ) = O(e−nγ1 (x∗1 ) ) + O(e−nγ (x1 ) ) (γ1 (x1 ) = γ (x1 )/n = x1 μ1 − μ1 ; x∗1 = x∗ /n, x1∗ = x∗ /n). Substituting this expression into the equality
1 − F(x ) = −F(x∗ ) + [2π μ (s )(s )2 ]−1/2 e−γ (x ) [1 + O(n−1 )]+ + [2π μ (s∗ )s2∗ ]−1/2 e−γ (x∗ ) [1 + O(n−1 )], that follows from (4.4.28) when x = x∗ , we find that
1 − F(x ) = 2π [μ (s )(s )2 ]−1/2 e−γ (x ) [1 + O(n−1 )]+ ∗
+ O(e−nγ1 (x∗1 ) ) + O(e−nγ1 (x1 ) ). (4.4.29) It is easy to obtain from (4.4.28), (4.4.29) that
100
4 First asymptotic theorem and related results
F(x) = 2π [μ (s)s2 ]−1/2 e−γ (x) [1 + O(n−1 )] ∗
+ O(e−nγ1 (x∗1 ) ) + O(e−nγ1 (x1 ) ). (4.4.30) If x∗1 , s∗ and x1∗ , s∗ do not depend on n and are chosen in such a way that
γ1 (x∗1 ) > γ1 (x1 );
γ1 (x1∗ ) > γ1 (x1 ),
(4.4.31)
then terms O(e−nγ1 (x∗1 ) ), O(e−nγ1 (x1 ) ) in (4.4.30) converge to zero as n → ∞ faster than e−nγ1 (x1 ) O(n−1 ). Also, we obtain the first equality (4.4.19) from (4.4.30). The second equality follows from (4.4.29) when the inequalities
γ1 (x∗1 ) > γ1 (x1 );
γ (x1∗ ) > γ1 (x1 ).
(4.4.32)
are satisfied. Since γ (x) is monotonic within segments μ (s1 ) x μ (0) and μ (0) x μ (s2 ), points x∗1 and x1∗ (for which inequalities (4.4.31), (4.4.32) would be valid) can be a fortiori chosen if constraint (4.4.18) is satisfied. The proof is complete. As it is seen from the provided proof, the requirement (one of the conditions of Theorem 4.8) that random variable B is equal to a sum of identically distributed independent random variables turns out not to be necessary. Mutual (proportional) increase of resulting potential μ (s) is sufficient for terms similar to term μ (4) /(μ )2 in the right-hand side of (4.4.24) to be small. That is why formula (4.4.19) is valid even in a more general case if we replace O(n−1 ) with O(μ −1 ) and understand this estimation only in the specified sense. If we apply formula (4.4.19) to distribution p(B | α ) dependent on a parameter instead of distribution p(B), then in consequence of (4.1.12a) we will have μ (s) = Γ (α + s) − Γ (α ); μ (s) = Γ (α + s); sμ (s) − μ (s) = sΓ (α + s) − Γ (α + s) + Γ (α ) = γ (x) − α x + Γ (α ) and, thereby, formula (4.4.19) becomes x < Γ (α ) F(x) x > Γ (α ) 1 − F(x)
=
= 2π [Γ (α + s)s2 ]1/2 e−Γ (α )+α x−γ (X) [1 + O(Γ −1 )], (4.4.33) where γ (x) = ax − Γ (a) (Γ (a) = x) is a Legendre transform. Also, it is not difficult to generalize Theorem 4.8 to the case of many random variables B1 , . . . , Br . If we apply the same notations as in (4.4.13), then the respective generalization of formula (4.4.19) will take the form P{[B1 − x1 ]sign s1 0, . . . , [Br − xr ]sign sr 0} " " −1/2 " " −γ (x) ∂ 2 μ (s) " = det "2π si s j " [1 + O(Γ −1 )], (4.4.34) "e ∂ si ∂ s j where γ (x) is defined by (4.4.14).
4.4 Some theorems about the characteristic potential
101
Finally, the generalization of the last formula to the case of a parametric distribution [the generalization of (4.4.33) to a multivariate case, correspondingly] will be represented as follows: P{(B1 − x1 )sign s1 0, . . . , (Br − xr )sign sr 0 | α } " " −1/2 " ∂ 2 Γ (α + s) " −r/2 −1 " e−Γ (α )+α (x)−γ (x) [1 + O(Γ −1 )]. " = 2(π ) |s1 , . . . , sr | det " ∂ si ∂ s j " (4.4.35) The aforementioned results speak for an important role of potentials and their images under the Legendre transform. The applied method of proof unite Theorem 4.8 with Theorems 4.3–4.5.
Chapter 5
Computation of entropy for special cases. Entropy of stochastic processes
In the present chapter, we set out the methods for computation of entropy of many random variables or of a stochastic process in discrete and continuous time. From a fundamental and practical points of view, of particular interest are the stationary stochastic processes and their information-theoretic characteristics, specifically their entropy. Such processes are relatively simple objects, particularly a discrete process, i.e. a stationary process with discrete states and running in discrete time. Therefore, this process is a very good example for demonstrating the basic points of the theory, and so we shall start from its presentation. Our main attention will be drawn to the definition of such an important characteristic of a stationary process as the entropy rate, that is entropy per unit of time or per step. In addition, we introduce entropy Γ at the end of an interval. This entropy together with the entropy rate H1 defines the entropy of a long interval of length T by the approximate formula HT ≈ H1 T + 2Γ , which is the more precise, the greater is T . Both constants H1 and Γ are calculated for a discrete Markov process. The generalized definition of entropy, given in Section 1.6, allows for the application of this notion to continuous random variables as well as to the case when the set of these random variables is continuum, i.e. to a stochastic process with a continuous parameter (time). In what follows, we show that many results related to a discrete process can be extended both to the case of continuous sample space and to continuous time. For instance, we can introduce the entropy rate (not per one step but per a unit of time) and entropy of an end of an interval for continuous-time stationary processes. The entropy of a stochastic process on an interval is represented approximately in the form of two terms by analogy with the aforementioned formula. For non-stationary continuous-time processes, instead of constant entropy rate, one should consider entropy density, which, generally speaking, is not constant in time.
© Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0 5
103
104
5 Computation of entropy for special cases. Entropy of stochastic processes
Entropy and its density are calculated for various important cases of continuoustime processes: Gaussian processes and diffusion Markov processes. Entropy computation of stochastic processes carried out here allows us to calculate the Shannon’s amount of information (this will be covered in Chapter 6) for stochastic processes.
5.1 Entropy of a segment of a stationary discrete process and entropy rate Suppose that there is a sequence of random variables . . ., ξk−1 , ξk , ξK+1 , . . . . Index (parameter) k can be interpreted as discrete time t taking integer values . . ., k − 1, k, k + 1, . . .. The number of different values of the index can be unbounded from both sides: −∞k < ∞; unbounded from one side, for instance: 0 < k < ∞, or finite: 1 k N. The specified values constitute the feasible region K of the parameter. Suppose that every random variable ξk takes one value from a finite or countable number of values, for instance, ξk = (A1 , A2 , . . . , Am ) (not so much finiteness of m as finiteness of entropies Hξk is essential for future discussion). We will call the indicated process (a discrete random variable as a function of a discrete parameter k) a discrete process. A discrete process is stationary if all distribution laws P(ξk1 , . . . , ξkr ) (of arbitrary multiplicity r) do not change under translation: P[ξk1 = x1 , . . . , ξkr = xr ] = P[ξk1 +a = x1 , . . . , ξkr +a = xr ]
(5.1.1)
where a is any integer number. The magnitude a of the translation is assumed to be such that the values k1 + a, . . . , kr + a remain in the domain K of parameter k. Henceforth, we shall not stipulate this condition assuming that, for example, the parameter domain is not bounded in both directions. Consider different conditional entropies of one of the random variables ξk from a discrete stationary stochastic process. Due to property (5.1.1), its unconditional entropy Hξk = − ∑ξk P(ξk ) ln P(ξk ) is independent of the chosen value of parameter k. Analogously, according to (5.1.1), conditional entropy Hξk |ξk−1 is independent of k. Applying Theorem 1.6, we obtain the inequality Hξk |ξk−1 Hξk or, taking into account stationarity, Hξ2 |ξ1 = Hξ3 |ξ2 = · · · = Hξk |ξk−1 Hξ1 = Hξ2 = · · · = Hξk . If we introduce conditional entropy Hξk |ξk−1 ξk−2 , then applying Theorem 1.6a for ξ = ξk , η = ξk−1 , ζ = ξk−2 will yield the inequality
5.1 Entropy of a segment of a stationary discrete process and entropy rate
105
Hξk |ξk−1 ,ξk−2 Hξk |ξk−1 . Due to the stationarity condition (5.1.1) this entropy is independent of k. Similarly, when increasing a number of random variables in the condition, we will have monotonic change (non-increasing) of conditional entropy further: Hξk Hξk |ξk−1 Hξk |ξk−1 ,ξk−2 · · · Hξk |ξk−1 ,...,ξk−l · · · 0.
(5.1.2)
Besides, all conditional entropies are non-negative, i.e. bounded below. Hence, there exists the non-negative limit H1 = lim Hξk |ξk−1 , ..., ξk−l , l→∞
(5.1.3)
which we also denote by h in order to avoid an increase in the number of indices. We define this limit as entropy rate meant for one element of sequence {ξk }. The basis of such a definition is the following theorem. Theorem 5.1. If {ξk } is a stationary discrete process such that Hξk < ∞, then the limit lim Hξ1 , ..., ξl /l l→∞
exists and equals to (5.1.3). Proof. We consider entropy Hξ1 ...ξm+n and represent it in the form Hξm+n ...ξ1 = Hξm ...ξ1 + Hξm+1 |ξm ...ξ1 + Hξm+2 |ξm+1 ...ξ1 + · · · + Hξm+n |ξm+n−1 ...ξ1 . (5.1.4) Since Hξm+l |ξm+l−1 ...ξ1 = H1 + om+l (1) = H1 + om (1) (i 1) due to (5.1.3) (here is o j (1) → 0 as j → ∞), it follows from (5.1.4) (after dividing by m + n) that Hξm+n ...x1 Hξ ...ξ n n = m 1+ H1 + om (1). m+n m+n m+n m+n
(5.1.5)
Let m and n converge to infinity in such a way that n/m → ∞. Then n/(m + n) converges to 1, while the ratio Hξm ...ξ1 /(m + n), which can be estimated as m 1 Hξm ...ξ1 H , m+n m + n ξ1 clearly converges to 0. Therefore, we obtain the statement of the theorem from equality (5.1.5). The proof is complete. It is also easy to prove that as l grows, the ratio Hξ1 ...ξl /l changes monotonically, i.e. does not increase. For that we construct the difference
106
5 Computation of entropy for special cases. Entropy of stochastic processes
1 1 1 1 Hξ1 ···ξl+1 = Hξ1 ...ξl − [H δ = Hξ1 ...ξl − + Hξl+1 |ξ1 ...ξl ] l l +1 l l + 1 ξ1 ...ξl 1 1 Hξ1 ...ξl − H = l(l + 1) l + 1 ξl+1 |ξ1 ...ξl and reduce it to l 1 1 [Hξ1 ...ξl − lHξl+1 |ξ1 ...ξl ] = [H − Hξi |ξi−l ...ξi−1 ]. ∑ l(l + 1) l(l + 1) i=1 ξi |ξ1 ...ξi−1 (5.1.6) In consequence of inequalities (5.1.2) summands in the right-hand side of (5.1.6) are non-negative. Thus, non-negativity of difference δ follows from here. By virtue of Theorem 5.1 the following equality holds:
δ=
Hξ1 ...ξn = nH1 + non (1).
(5.1.7)
Further, we form the combination Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n = Hξ1 ...ξm + Hξm+1 ...ξm+n − Hξ1 ···ξm+n .
(5.1.8)
It is easy to prove that this combination is a monotonic function of m with fixed n (or vice versa). Indeed, it can be rewritten as follows: Hξm+1 ...ξm+n − Hξm+l ...ξm+n |ξ1 ...ξm .
(5.1.9)
Since conditional entropies do not increase with a growth of m according to (5.1.2), expression (5.1.9) does not decrease when m increases. The same holds for dependence on n, because m and n are symmetrical in (5.1.8). It evidently follows from (5.1.9) that combination (5.1.8) is non-negative (by virtue of Theorem 1.6 with ξ = (ξm+1 , . . . , ξm+n )). Consider the limit 2Γ = lim lim [Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n ] n→∞ m→∞
= lim lim [Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n ]. m→∞ n→∞
(5.1.10)
We can switch the order of limits here due to the mentioned symmetry about a transposition between m and n. By virtue of the indicated monotonicity this limit (either finite or infinite) always exists. Passing from form (5.1.8) to form (5.1.9) and using the hierarchical relationship of type (1.3.4) n
Hξm+1 ...ξm+n |ξ1 ,...,ξm = ∑ Hξm+i |ξ1 ,...,ξm+i−1 i=1
we perform the passage to the limit m → ∞ and rewrite equality (5.1.10) as follows: 2Γ = lim [Hξ1 ...ξn − nH1 ], n→∞
(5.1.11)
5.2 Entropy of a Markov chain
107
since Hξm+i |ξ1 ,...,ξm+i−1 → H1 as m → ∞. If we represent entropy Hξ1 ,...,ξn in the form of the hierarchical sum (1.3.4) here, then formula (5.1.11) will be reduced to ∞
∑ [Hξ j+1 |ξ1 ...ξ j − H1 ].
(5.1.12)
Hξ1 ...ξn = nH1 + 2Γ + on (1).
(5.1.13)
2Γ =
j=0
According to (5.1.11) we have
This formula refines (5.1.7). Since there is entropy H1 in average for every element of sequence {ξk }, nH1 accounts for n such elements. Due to (5.1.13) entropy 2Γ differs from nH1 for large n by the value 2Γ that can be interpreted as entropy at the ends of the segment. Thus, entropy of each end of a segment of a stationary process is equal to Γ in the limit. If we partition one long segment of a stationary process into two segments and liquidate statistical ties (correlations) between processes relative to these two segments, then entropy will increase approximately by 2Γ since there will appear two new ends. Applying formula (5.1.13) for three segments of lengths m, n, m + n and forming combination (5.1.8) we will have Hξ1 ...ξm + Hξ1 ...ξn − Hξ1 ...ξm+n = 2Γ + om (1) + on (1) + om+n (1). This complies with (5.1.10) and confirms the above statement about the increase of entropy by 2Γ .
5.2 Entropy of a Markov chain 1. Let the discrete (not necessarily stationary) process {ξk } be Markov. This means that joint distribution laws of two consecutive random variables can be decomposed into the product P(ξk , ξk+1 , . . . , ξk+l ) = P(ξk )πk (ξk , ξk+1 ) · · · πk+l−1 (ξk+l−1 , ξk+l )
(5.2.1)
of functions π j (ξ , ξ ) = P(ξ | ξ ), which are called transition probabilities. Probabilities P(ξk ) correspond to a marginal distribution of random variable ξk in (5.2.1). Transition probability π j (ξ j , ξ j+1 ) equals to conditional probabilities P(ξ j+1 | ξ j ) and thereby it is non-negative and normalized
∑ π j (ξ , ξ ) = 1. ξ
(5.2.2)
108
5 Computation of entropy for special cases. Entropy of stochastic processes
A discrete Markov process is also called a Markov chain. Passing from probabilities (5.2.1) to conditional probabilities, it is easy to find out that in the case of a Markov process we have P(ξk+m+1 | ξk , . . . , ξk+m ) = πk+m (ξk+m , ξk+m+1 )
(m 1)
and, consequently, P(ξk+m+1 | ξk , . . . , ξk+m ) = P(ξk+m+1 | ξk+m ). Hence, H(ξk+m+1 | ξk , . . . , ξk+m ) = − ln πk+m (ξk+m , ξk+m+1 ) = H(ξk+m+1 | ξk+m ). (5.2.3) Due to (5.2.3) the hierarchical formula (1.3.6) takes the form H(ξ1 , . . . , ξn ) = H(ξ1 ) + H(ξ2 | ξ1 ) + H(ξ3 | ξ2 ) + . . . + H(ξn | ξn−1 ).
(5.2.4)
Similarly, we have Hξk+m+1 |ξk ...ξk+m = Hξk+m+1|ξ
,
k+m
Hξ1 ...ξn = Hξ1 + Hξ2 |ξ1 + Hξ3 |ξ2 + · · · + Hξn |ξn−1
(5.2.5) (5.2.6)
for mean entropies. A discrete Markov process is stationary if transition probabilities πk (ξ , ξ ) do not depend on the value of parameter k and all marginal distributions P(ξk ) are identical (equal to Pst (ξ )). Constructing joint distributions according to formula (5.2.1), it is easy to prove that they also satisfy the stationarity condition (5.1.1). Stationarity of marginal distribution Pst (ξ ) means that the equation
∑ Pst (ξ )π (ξ , ξ ) = Pst (ξ )
(5.2.7)
ξ
is satisfied. The latter can be easily derived if we rewrite the joint distribution P(ξk , ξk+1 ) = Pst (ξk )π (ξk , ξk+1 ) according to (5.2.1) and sum it over ξk = ξ that yields P(ξk+1 ). Taking into account (5.2.3) it is easy to see that entropy rate (5.1.3) for a stationary Markov process coincides with entropy corresponding to transition probabilities π (ξ , ξ ) with the stationary probability distribution: H1 = − ∑ Pst (ξ ) ∑ π (ξ , ξ ) ln π (ξ , ξ ). ξ
(5.2.8)
ξ
Indeed, averaging out (5.2.3) with stationary probabilities Pst (ξ )π (ξ , ξ ) we confirm that the same conditional entropy Hξk+m+1 |ξk+1 ...ξk+m equal to (5.2.8) corresponds to all values m 1. Further, we average out formula (5.2.4) with stationary probabilities. This will yield the equality
5.2 Entropy of a Markov chain
109
Hξ1 ...ξn = Hξ1 + (n − 1)H1 .
(5.2.9)
Comparing the last equality with (5.1.13) we observe that in the stationary Markov case we have 2Γ = Hξ1 − H1 =
∑ Pst (ξ )π (ξ , ξ ) ln
ξ ,ξ
π (ξ , ξ ) Pst (ξ )
(5.2.10)
and on (1) = 0, i.e. the formula Hξ1 ...ξn = nH1 + 2Γ
(5.2.11)
holds true exactly. We can also derive the result (5.2.10) with the help of (5.1.12). Indeed, because Hξ j+1 |ξ1 ...ξ j = H1 , as was noted earlier, there is only one non-zero term left in the right-hand side of (5.1.13): 2Γ = Hξ1 − H1 , which coincides with (5.2.10). 2. So, given the transition probability matrix
π = π (ξ , ξ ) ,
(5.2.12)
in order to calculate entropy, one should find the stationary distribution and then apply formulae (5.2.8), (5.2.9). equation (5.2.7) defines the stationary distribution Pst (ξ ) quite clearly if a Markov process is ergodic, i.e. if eigenvalue λ = 1 is a non-degenerate eigenvalue of matrix (5.2.12). According to the theorem about decomposition of determinants, equation det(π − 1) = 0 entails (5.2.7), if we assume * Pst (ξ ) = Aξ ξ
∑ Aξ ξ ,
(5.2.13)
ξ
where Aξ ξ is an algebraic co-factor of element aξ ξ of matrix
aξ ξ = π (ξ , ξ ) − δξ ξ ≡ π − 1. Here the value of ξ is arbitrary, so that it is sufficient to calculate algebraic cofactors for one column of this matrix. If a Markov process is non-ergodic, then the corresponding stationary distribution is not completely defined by the transition matrix, but also depends on an initial (or some other marginal) distribution. In this case, algebraic co-factors equal zero in formula (5.2.13), and thereby the indeterminate form of type 0/0 takes place, i.e. the stationary distribution can be expressed in terms of lower-order minors and an initial distribution. As is well known (see Doob in Russian [8] and the respective English original [7]), states of a discrete-time Markov process can be put in such an order that its transition matrix has the following ‘box’ form: ⎛ 00 01 02 03 ⎞ π π π π ··· π = ⎝ 0 π 11 0 0 · · ·⎠ ≡ Π 00 + ∑(Π 0i + Π ii ), (5.2.14) i 0 0 π 22 0 · · ·
110
5 Computation of entropy for special cases. Entropy of stochastic processes
⎛
where
π 00 ⎜ Π 00 = ⎝ 0 .. .
⎛ 0 ⎜0 01 Π =⎝ .. .
⎞ 0 ··· 0 · · ·⎟ ⎠, .. .. . .
π 01 0 .. .
⎞ 0 ··· 0 · · ·⎟ ⎠, .. .. . .
and so on. Here π i j denotes a matrix of dimensionality ri × r j that describes the transition from subset Ei containing ri states to subset E j containing r j states. There are zeros in the rest of matrix cells. Sets E1 , E2 , . . . constitute ergodic classes. Transitions from set E0 to ergodic classes E1 , E2 , . . . occur. There is no exit from each of these classes. Hence, its own stationary distribution is set up within each class. This distribution can be found by the formula of type (5.2.13) * Psti (ξ ) = Aiiξ ξ
∑
ξ ∈E
Aiiξ ξ ,
ξ ∈ Ei
(5.2.15)
i
with the only difference that now we consider algebraic co-factors of submatrix
πξllξ − δξ ξ , ξ , ξ ∈ El , which are not zeros. Probabilities Pstl (ξ ) address to El . The full stationary distribution appears to be the linear combination Pst (ξ ) = ∑ qi Psti (ξ )
(5.2.16)
i
of particular distributions (5.2.15), which are orthogonal as is easy to see. Indeed,
∑ Psti (ξ )Pstj (ξ ) = 0
i = j.
ξ
Coefficients qi of this linear combination are determined by the initial distribution P(ξ1 ) = P1 (ξ1 ). They coincide with a resultant probability of belonging to this or that ergodic class and satisfy the normalization constraint ∑i qi = 1. Taking into account the form (5.2.14) of the transition probability matrix and summing up transitions from E0 to Ei at different stages, we can find how they are explicitly expressed in terms of the initial distribution: qi =
∑ P1 (ξ ) + ∑ P1 (ξ )[Πξ0iξ + (Π 00 Π 0i )ξ ξ + (Π 00 Π 00 Π 0i )ξ ξ + · · · ],
ξ ∈Ei
i = 0.
ξξ
In parentheses here we have a matrix product. By summing up powers of matrix Π 00 we get qi =
∑ P1 (ξ ) + ∑ P1 (ξ )([1 − Π 00 ]−1 Π 0i )ξ ξ ,
ξ ∈Ei
q0 = 0.
ξ ,ξ
Accounting (5.2.16), (5.2.14) we obtain that it is convenient to represent entropy rate (5.2.8) as the following sum:
5.2 Entropy of a Markov chain
111
H1 = ∑ qi H1i ,
(5.2.17)
Psti (ξ )π (ξ , ξ ) ln π (ξ , ξ )
(5.2.18)
i
where particular entropies H1i = −
∑
ξ ,ξ ∈Ei
are computed quite similar to the ergodic case. The reason is that a non-ergodic process is a statistical mixture (with probabilities qi ) of ergodic processes having a fewer number of states ri . Summation in (5.2.17), (5.2.18) is carried out only over ergodic classes E1 , E2 , . . ., i.e. subspace E0 has a zero stationary probability. The union of all ergodic classes E1 + E2 + · · · = Ea (on which a stationary probability is concentrated) can be called an ‘active’ subspace. Distributions and transitions exert influence on entropy rate H1 in the ‘passive’ space E0 up to the point they have influence on probabilities qi . If the process is ergodic, i.e. there is just one ergodic class E1 besides E0 , then q1 = 1 and the passive space does not have any impact on the entropy rate. In this case the entropy rate of a Markov process in space E0 + Ea coincides with the entropy rate of a Markov process taking place in subspace Ea = E1 and having the transition probability matrix π11 . 3. In order to illustrate application of the derived above formulae, we consider in this paragraph several simple examples. Example 5.1. At first we consider the simplest discrete Markov process—the process with two states, i.e. matrix (5.2.12) is a 2 × 2 matrix. In consequence of the normalization constraint (5.2.2) its elements are not independent. There are just two independent parameters μ and ν that define matrix π : 1−μ μ π= . ν 1−ν Because in this case
−μ μ a = π −1 = ; ν −ν
A11 = −ν ;
A21 = −μ ,
the stationary distribution is derived according to (5.2.13): Pst (1) =
ν ; μ +ν
Furthermore, 1
Pst (ξk , ξk+1 ) = μ +ν
Pst (2) =
μ . μ +ν
(5.2.19)
ν − μν μν . μν μ − μν
Applying formula (5.2.8), we find the entropy rate H1 =
ν μ h2 (μ ) + h2 (ν ), μ +ν μ +ν
(5.2.20)
112
5 Computation of entropy for special cases. Entropy of stochastic processes
where h2 (x) = −x ln x − (1 − x) ln (1 − x).
(5.2.20a)
Next, one easily obtains the boundary entropy by formula (5.2.10) as follows: μ ν h2 (μ ) + μ h2 (ν ) 2Γ = h2 . − μ +ν μ +ν Example 5.2. Now suppose there is given a process with three states that have the transition probability matrix ⎞ ⎛ μ 1 − μ μ (μ = μ + μ etc). π = ⎝ ν 1 − ν ν ⎠ λ λ 1 − λ At the same time it is evident that
⎞ −μ μ μ a = ⎝ ν −ν ν ⎠ . λ λ −λ ⎛
We find the respective stationary distribution by formula (5.2.13) Pst (ξ ) = Aξ 1 /(A11 + A21 + A31 ), where
(5.2.21)
−ν ν = λ ν − λ ν = λ ν + λ ν + λ ν ; A11 = λ −λ A21 =μ λ + μ λ + λ μ ; A31 =μ ν + ν μ + ν μ .
The entropy rate (5.2.8) turns out to be equal to H1 = Pst (1)h3 (μ , μ ) + Pst (2)h3 (ν , ν ) + Pst (3)h3 (λ , λ ), where we use the denotation h3 (μ , μ ) = −μ ln μ − μ ln μ − (1 − μ − μ ) ln (1 − μ − μ ).
(5.2.22)
The given process with three states appears to be non-ergodic if, for instance, λ = λ = 0, μ = ν = 0, so that the transition probability matrix has a ‘block’ type ⎞ ⎛ 1−μ μ 0 π = ⎝ ν 1 − ν 0⎠ . 0 0 1
5.3 Entropy rate of components of a discrete and conditional Markov process
113
For such a matrix, the third state remains constant, and transitions are made only between the first and the second states. Algebraic co-factors (5.2.21) vanish. As is easy to see, the following distributions are stationary: ⎧ ⎪ ⎨ν /(μ + ν ), for ξ = 1; Pst1 (ξ ) = μ /(μ + ν ), for ξ = 2; Pst2 (ξ ) = δξ 3 . ⎪ ⎩ 0, for ξ = 3; The first of them coincides with (5.2.19). Functions Pst1 (ξ ) and Pst2 (ξ ) are orthogonal. Using the given initial distribution P1 (ξ ), we find the resultant stationary distribution by formula (5.2.16): Pst (ξ ) = [P1 (1) + P1 (2)]Pst1 (ξ ) + P1 (3)Pst2 (ξ ). Now, due to (5.2.17,) it is easy to rewrite the corresponding entropy rate as follows: H1 = [P1 (1) + P1 (2)](ν h(μ ) + μ h(ν ))/(μ + ν ).
5.3 Entropy rate of part of the components of a discrete Markov process and of a conditional Markov process 1. At first we compute the entropy rate of a stationary non-Markov discrete process {yk }, which can be complemented to a Markov process by appending additional components xk . Assemblage ξ = (x, y) will constitute a phase space of a Markov process and process {ξk } = {xk , yk } will be Markov in its turn. The results from the previous paragraph can be applied to the new process, in particular, we can find 1 = h . Entropy H the entropy rate that we denote as Hxy xy y1 ...yn of the initial process differs from the entropy Hξ1 ...ξn = Hy1 ...yn + Hx1 ...xn |y1 ...yn
(5.3.1)
of the Markov process by the conditional entropy Hx1 ...xn |y1 ...yn called entropy of the conditional Markov process {xk } (for fixed {yk }). Along with entropy rate hxy = hξ of a stationary Markov process, we introduce the entropy rates of the initial y-process 1 Hy1 ...yn n→∞ n
hy = lim
(5.3.2)
and the conditional x-process hx|y = lim
n→∞
1 H . n x1 ...xn |y1 ...yn
By virtue of (5.3.1) they are related to hxy as follows:
(5.3.3)
114
5 Computation of entropy for special cases. Entropy of stochastic processes
hy + hx|y = hxy .
(5.3.4)
One may infer that (in a general case) the x-process can be considered as a nonMarkov a priori process and the conditional y-process with fixed x. Their entropy rates will be, respectively, 1 Hx1 ...xn ; n→∞ n
hx = lim
hy|x = lim Hy1 ...yn |x1 ...xn . n→∞
(5.3.5)
The relationships (5.3.1), (5.3.4) are replaced by Hx1 ...xn + Hy1 ...yn |x1 ...xn = Hξ1 ...ξn ;
(5.3.6)
hx + hy|x = hxy .
(5.3.7)
Since we already know how to find entropy of a Markov process, it suffices to learn how to calculate just one of the variables hy or hx|y . The second one is found with the help of (5.3.4). Due to symmetry the corresponding variable out of hx , hy|x can be computed in the same way, whereas the second one—from (5.3.7). The method described below (clause 2) can be employed for calculating entropy of a conditional Markov process in both stationary and non-stationary cases. As a rule, stationary probability distributions and limits (5.3.2), (5.3.3), (5.3.5) exist only in the stationary case. Conditional entropy Hx1 ...xn |y1 ...yn can be represented in the form of the sum Hx1 ...xn |y1 ...yn = Hx1 |y1 ...yn + Hx2 |x1 y1 ...yn + Hx3 |x1 x2 y1 ...yn + · · · + Hxn |x1 ...xn−1 y1 ...yn Also, limit (5.3.3) can be substituted by the limit hx|y =
lim
k→∞,n−k→∞
Hxk |x1 ...xk−1 y1 ...yn = Hxk |...ξk−2 ξk−1 yk yk+1 ... .
(5.3.8)
Equivalence of two such representations of entropy rate was discussed in Theorem 5.1. Certainly, now process {xk } is conditional, and hence non-stationary. We cannot apply Theorem 5.1 to this process directly. Hence, we need to generalize that theorem. Theorem 5.2. For stationary process {ξk } = {xk , yk } the limits (5.3.3) and (5.3.8) are equal. Proof. By virtue of lim Hxk |x1 ...xk−1 y1 ...yn = Hxk |ξ1 ...ξk−1 yk yk+1 ... ,
n−k→∞
lim Hxk |ξ1 ...ξk−1 yk yk+1 ... = Hxk |...ξk+2 ξk−1 yk yk+1 ... ,
k→∞
we have Hxk |x1 ...xk−1 y1 ...yn = Hxk |...ξk−2 ξk−1 yk yk+1 ... + ok (1) + on−k (1)
(5.3.9)
5.3 Entropy rate of components of a discrete and conditional Markov process
115
(it is assumed that on−k (1) is uniform with respect to k). Further, we represent n as the sum of three numbers: n = m + r + s. Substituting (5.3.9) to the equality Hx1 ...xn |y1 ...yn = Hx1 ...xm |y1 ...yn + Hxm+1 |ξ1 ...ξm ym+1 ...yn + · · · + Hxm+r |ξ1 ...ξm+r−1 ym+r ...yn + Hxm+r−1 ...xn |ξ1 ...ξm+r ym+r+1 ...yn we obtain that 1 1 H H = + n x1 ...xn |y1 ...yn m + r + s x1 ...xm |y1 ...yn r r H [om (1) + os (1)]+ + + m + r + s xk |...ξk−1 yk ... m + r + s 1 H + . m + r + s xm+r+1 ...xn |ξ1 ...ξm+r ym+r+1 ...yn
(5.3.10)
Here we mean that m
Hx1 ...xm |y1 ...yn =
∑ Hxk |x1 ...xk−1 y1 ...yn mHxk
k=1
and Hxm+r+1 ...xn |ξ1 ...ξm+r ym+r+1
yn sHxk ,
because conditional entropy is less or equal than regular one. Thus, if we make the passage to the limit m → ∞, r → ∞, s → ∞ in (5.3.10) such that r/m → ∞ and r/s → ∞, then there will be left only one term Hxk |...,ξk−1 ,yk in that expression. This proves the theorem. As is seen from the above proof, Theorem 5.2 is valid not only in the case of a Markov joint process {xk , yk }. Furthermore, in consequence of the Markov condition we have P(xk | x1 . . . xk−1 y1 . . . yn ) = P(xk | xk−1 , yk−1 , yk , . . . , yn )
(n k)
and Hxk |x1 ...xk−1 y1 ...yn = Hxk |ξk−1 yk ...yn in the case of a Markov process. Consequently, formula (5.3.8) takes the form hx|y = Hxk |xk−1 yk−1 yk yk+1 . 2. Let us now calculate entropies of the y-process and a conditional process induced by a Markov joint process. In order to do this, we consider the conditional entropy (5.3.11) Hyk |y1 ...yk−1 = −E [ln P(yk | y1 . . . yk−1 )],
116
5 Computation of entropy for special cases. Entropy of stochastic processes
that defines (according to (5.1.3)) the entropy rate hy in the limit hy = lim Hyk |y1 ...yk−1 .
(5.3.12)
k→∞
We use the Markov condition (5.2.1) and then write down the multivariate probability distribution P(ξ1 , . . . , ξn ) = P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−1 , ξk ), where ξ j denotes pair x j , y j . Applying the formula of inverse probability (Bayes’ formula) we obtain from here that P(yk | y1 . . . yk−1 ) =
∑x1 ...xk P(ξ1 )π (ξ1 , ξ2 ) . . . π (ξk−1 , ξk ) . ∑x1 ...xk−1 P(ξ1 )π (ξ1 .ξ2 ) · · · π (ξk−2 , ξk−1 )
(5.3.13)
According to the theory of conditional Markov processes (see Stratonovich [56]) we introduce final a posteriori probabilities ∑x1 ...xk−2 P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−2 , ξk−1 ) . ∑x1 ...xk−1 P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−2 , ξk−1 (5.3.14) With the help of these probabilities expression (5.3.13) can be written as follows:
P(xk−1 | y1 . . . yk−1 ) ≡ Wk−1 (xk−1 ) =
P(yk | y1 , . . . , yk−1 ) =
∑
Wk−1 (xk−1 )π (xk−1 , yk−1 ; xk , yk )
xk−1 ,xk
or, equivalently, P(yk | y1 , . . . , yk−1 ) = ∑ Wk−1 (x)π (x, yk−1 ; yk ),
(5.3.15)
x
if we denote ∑x π (x, y; x , y ) = π (x, y; y ). Then formula (5.3.11) takes the form Hyk |y1 ...yk−1 = −E ln ∑ Wk−1 (x)π (x, yk−1 ; yk ) . x
Here the symbol E denotes the averaging over y1 , . . . , yk−1 , yk . Furthermore, if we substitute (5.3.15) to the formula Hyk |y1 ...yk−1 = −E then we obtain
∑ P(yk | y1 , . . . , yk−1 ) ln P(yk | y1 , . . . , yk−1 ) yk
,
5.3 Entropy rate of components of a discrete and conditional Markov process
117
Hyk |y1 ...yk−1 = − E ∑ ∑ Wk−1 (x)π (x, yk−1 ; yk ) ln ∑ Wk−1 (x)π (x, yk−1 ; yk ) , yk
x
(5.3.16)
x
where symbol E denotes averaging by y1 , . . . , yk−1 . It is convenient to consider a posteriori probabilities Wk−1 (·) as a distribution with respect to both variables (x, y) = ξ redefining them by the formula Wk (x), for y = yk−1 , Wk−1 (x, y) = (5.3.17) 0, for y = yk−1 . Then formula (5.3.15) is reduced to P(yk | y1 . . . yk−1 ) = ∑ Wk−1 (ξ )π (ξ , yk )
(5.3.17a)
ξ
and we can also rewrite expression (5.3.16) as Hyk |y1 ...yk−1 = −E
∑ ∑ Wk−1 (ξ )π (ξ , yk ) ln ∑ Wk−1 (ξ )π (ξ , yk )
yk ξ
.
(5.3.18)
ξ
Here E, as in (5.3.16), denotes averaging over y1 , . . . , yk−1 , but the expression to be averaged depends on y1 , . . . , yk−1 only via a posteriori probabilities Wk−1 (·). That is why we can suppose that symbol E in (5.3.18) corresponds to averaging over random variables Wk−1 (A1 ), . . . , Wk−1 (AL ) (A1 , . . . , AL is the state space of process ξ = (x, y), and L is the number of its states). Introducing the distribution law P(dWk−1 ) for these random variables we reduce formula (5.3.18) to the form Hyk |y1 ...yk−1 = =−
P(dWk−1 ) ∑ ∑ Wk−1 (ξ )π (ξ , y ) ln ∑ Wk−1 (ξ )π (ξ , y ). (5.3.19) y ξ
ξ
That is what the main results are. We see that in order to calculate conditional entropy of some components of a Markov process we need to investigate posterior probabilities {Wk−1 (·)} as a stochastic process in itself. This process is studied in the theory of conditional Markov processes. It is well known that when k increases, the corresponding probabilities are transformed by certain recurrent relationships. In order to introduce them, let us write down the equality analogous to (5.3.14) replacing k − 1 with k: Wk (xk ) =
∑xk−1 [∑x1 ...xk−2 P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−2 , ξk−1 ]π (ξk−1 , ξk ) . ∑xk−1 ,xk [∑x1 ...xk−2 P(ξ1 )π (ξ1 , ξ2 ) · · · π (ξk−2 , ξk−1 )]π (ξk−1 , ξk )
118
5 Computation of entropy for special cases. Entropy of stochastic processes
Substituting (5.3.14) to the last formula we obtain the following recurrent relationships: ∑xk−1 Wk−1 (xk−1 )π (xk−1 , yk−1 ; xk , yk ) . (5.3.20) Wk (xk ) = ∑xk−1 ,xk Wk−1 (xk−1 )π (xk−1 , yk−1 ; xk , yk ) Process {Wk−1 (·)} (considered as a stochastic process by itself) is called a secondary a posteriori W -process. As is known from the theory of conditional Markov processes, this process is Markov. Let us consider its transition probabilities. Transformation (5.3.20) that can be represented as Wk (ξ ) =
∑ξ Wk−1 (ξ )π (ξ , ξ )δyyk , ∑ξ Wk−1 (ξ )π (ξ , yk )
(5.3.21)
explicitly depends on random variable yk . This variable’s a posteriori probabilities P(yk | y1 , . . . , yk−1 ) coincide with (5.3.15) and, consequently, they are fully defined by ‘state’ Wk−1 (·) of the secondary a posteriori process at time k − 1. Thus, every transformation (5.3.21) occurs with probability (5.3.15). This means that the ‘transition probability density’ of the secondary process {Wk } can be rewritten as
π W (Wk−1 ,Wk ) = δyy ∑ξ Wk−1 (ξ )π (ξ , ξ ) = ∑ δ Wk (ξ ) − k ∑ Wk−1 (ξ )π (ξ , yk ). (5.3.22) )π (ξ , y ) W ( ξ ∑ k−1 k ξ yk ξ Here δ (W ) is the L-dimensional δ -function defined by the formula
f (W )δ (W ) ∏ dW (ξ ) = f (0). ξ
This δ -function corresponds to a measure concentrated on the origin of the Ldimensional space. Hence, we see that the entropy of process {yk } (that is non-Markov but is also a part of some Markov process) can be defined by methods of the theory of Markov processes after we pass on to a secondary a posteriori process, which is Markov. In the stationary case, we can use a conventional approach to find the stationary distribution Pst (dW ) that is the limit of distribution Pk (dW ) implicated in (5.3.19) for k → ∞. Substituting (5.3.19) to (5.3.12) and passing to the limit k → ∞ we obtain the following entropy rate: hy = −
Pst (dW ) ∑ W (ξ )π (ξ , y) ln ∑ W (ξ )π (ξ , y). yξ
(5.3.23)
ξ
Further, we can find hx|y by formula (5.3.4), if required. Before we consider an example, let us prove the following theorem having a direct relation to the above.
5.3 Entropy rate of components of a discrete and conditional Markov process
119
Theorem 5.3. Entropy Hy2 ...yl |y1 of a non-Markov y-process coincides with the analogous entropy of the corresponding secondary a posteriori (Markov) process: Hy2 ...yl |y1 = HW2 ...Wl |W1 .
(5.3.24)
Therefore, the entropy rates are also equal: hy = hW . Proof. Since the equalities l
Hy2 ...yl |y1 = ∑ Hyk |y1 ...yk−1 , k=2 l
HW2 ...Wl |W1 = ∑ HWk |W1 ...Wk−1 , k=2
are valid together with formula (5.3.12) and the analogous formula for {Wk }, it is sufficient to prove the equality Hyk |y1 ...yk−1 = HWk |W1 ...Wk−1 = HWk |Wk−1 .
(5.3.25)
Here we have HWk |W1 ...Wk−1 = HWk |Wk−1 due to the Markovian property. Let S0 be some initial point from the sample space W (·). According to (5.3.21) transitions from this point to other points S(yk ) may occur depending on the value of yk (S(yk ) is the point with coordinates (5.3.21)). Those points are different for different values of yk . Indeed, if we assume that two points from S(yk ), say S = S(y ) and S = S(y ), coincide, i.e. the equality Wk (x, y) = Wk (x, y), is valid, then it follows from the latter equality that
∑ Wk (x, y) = ∑ Wk (x, y). x
(5.3.26)
x
But due to (5.3.17) the relationship
∑ Wk (x, y)δyyk = δyy , x
holds true and thereby equality (5.3.26) entails y = y (a contradiction). Thus, we can set a one-to-one correspondence (bijection) between points S(yk ) and values yk , respectively. Hence, P(S(yk ) | S0 ) = P(yk | y1 , . . . yk−1 ),
HS(yk ) (· | S0 ) = Hyk (· | y1 , . . . , yk−1 ).
But, if we properly choose S0 = Wk−1 , then the events, occurring as conditions, coincide, and after averaging we derive (5.3.25). This ends the proof.
120
5 Computation of entropy for special cases. Entropy of stochastic processes
Example 5.3. Suppose that a non-Markov process {yk } is a process with two states, i.e. yk can take one out of two values, say 1 or 2. Further, suppose that this process can be turned into a Markov process by adding variable x and thus decomposing state y = 2 into two states: ξ = 2 and ξ = 3. Namely, ξ = 2 corresponds to x = 1, y = 2; ξ = 3 corresponds to x = 2, y = 2. State y = 1 can be related to ξ = 1, x = 1, for instance. The joint process {ξk } = {xk , yk } is stationary Markov and is described by the transition probability matrix ⎞ ⎛ π11 π12 π13 πξ ξ = ⎝π21 π22 π23 ⎠ . π31 π32 π33 According to (5.3.17) a posteriori distributions Wk (ξ ) = Wk (x, y) have the form for yk = 1, (Wk (ξ = 1),Wk (ξ = 2),Wk (ξ = 3)) = (1, 0, 0), (Wk (ξ = 1),Wk (ξ = 2),Wk (ξ = 3)) = (0,Wk (2),Wk (3)), for yk = 2.
(5.3.27)
Due to (5.3.21) value yk = 2 is referred to the transformation Wk (1) = 0; Wk (2) =
Wk−1 (1)π12 +Wk−1 (2)π22 +Wk−1 (3)π32 ; Wk−1 (1)(π12 + π13 ) +Wk−1 (2)(π22 + π23 ) +Wk−1 (3)(π32 + π33 ) (5.3.28)
Wk (3) = 1 −Wk (2). We denote point (1, 0, 0) that belongs to the sample space (W (1),W (2),W (3)) and corresponds to distribution (5.3.27) as S0 . Further, we investigate possible transitions from that point. Substituting value Wk−1 = (1, 0, 0) to (5.3.28) we obtain the transition to the point W (0) = 0;
π12 π12 + π13 π13 W (3) = π12 + π13 W (2) =
(5.3.29)
which we denote as S1 . In consequence of (5.3.18) such a transition occurs with probability (5.3.30) p1 = π12 + π13 . The process stays at point S0 with probability 1 − p1 = π11 . Now we consider transitions from point S1 . Substituting Wk−1 by values (5.3.29) in formula (5.3.28) with yk = 2 we obtain the coordinates
5.3 Entropy rate of components of a discrete and conditional Markov process
121
W (1) = 0;
π12 π22 + π13 π32 π12 (π22 + π23 ) + π13 (π32 + π33 ) W (3) = 1 −W (2) W (2) =
(5.3.31)
of the new point S2 . The transition from S1 to S2 occurs with probability p2 =
π12 (π22 + π23 ) + π13 (π32 + π33 ) . π12 + π13
(5.3.32)
This expression is derived by plugging (5.3.29) into the formula P(yk = 2 | y1 , . . . , yk−1 ) = P(Sk | Sk−1 ) = = Wk−1 (2)(π22 + π23 ) +Wk−1 (3)(π32 + π33 ), (5.3.33) obtained from (5.3.17a). The return to point S0 occurs with probability 1 − p2 . Similarly, substituting values (5.3.31) to (5.3.33) we obtain the following probability: p3 =
(π12 π22 + π13 π32 )(π22 + π23 ) + (π12 π23 + π13 π33 )(π32 + π33 ) π12 (π22 + π23 ) + π13 (π32 + π33 )
(5.3.34)
of the transition from S2 to the next point S3 and so forth. Transition probabilities pk to the points following each other are calculated consecutively as described. Each time a return to point S0 occurs with probability 1 − pk . The probability that there has been no return yet to point S0 at time k is apparently equal to p1 p2 · · · pk . If consecutive values pk do not converge to 1, then the indicated probability converges to zero as k → ∞. Therefore, usually a return to point S0 eventually occurs with probability 1. If we had chosen some different point S0 as an initial point, then a walk over some different sequence of points would have been observed but eventually the process would have returned to point S0 . After such a transition the aforementioned walk over points S0 , S1 , S2 , . . . (with already calculated transition probabilities) will be observed. The indicated scheme of transitions of the secondary Markov process allows us to easily find a stationary probability distribution. It will be concentrated in points S0 , S1 , S2 , . . . For those points the transition probability matrix has the form ⎛ ⎞ 1 − p1 p1 0 0 · · · ⎜1 − p2 0 p2 0 · · ·⎟ ⎜ ⎟ (5.3.35) π w = ⎜1 − p3 0 0 p3 · · ·⎟ . ⎝ ⎠ .. .. .. .. .. . . . . . Taking into account equalities Pst (Sk ) = pk Pst (Sk−1 ), k = 1, 2, . . . , following from (5.2.7), (5.3.35) and the normalization constraint, we find the stationary probabilities for the specified points:
122
5 Computation of entropy for special cases. Entropy of stochastic processes
Pst (S0 ) =
1 ; 1 + p1 + p1 p2 + · · ·
Pst (Sk ) =
p1 · · · pk , 1 + p1 + p1 p2 + · · · k = 1, 2, . . . (5.3.36)
In order to compute entropy rate (5.3.12) it is only left to substitute expressions (5.3.36) to the formula ∞
hy = − ∑ Pst (Sk )[pk+1 ln pk+1 + (1 − pk+1 ) ln(1 − pk+1 )],
(5.3.37)
k=0
which is resulted from (5.3.23) or from Theorem 5.3 and formula (5.2.8) applied to transition probabilities (5.3.35). In the particular case when the value of x does not affect transitions from one value of y to another one we have
π22 + π23 = π32 + π33 .
(5.3.38)
Here is drawn probability P(yk = 2 | yk−1 = 2, x) = 1 − ν (ν is the transition probability P(yk = 1 | yk−1 = 2, x)) that does not depend on x now. In this case (5.3.33) yields p2 = p3 = · · · = [Wk−1 (2) +Wk−1 (3)](π22 + π23 ) = π22 + π23 = 1 − ν
(5.3.39)
and, consequently, (5.3.37) entails hy = Pst (S0 )h2 (μ ) + [1 − Pst (S0 )]h2 (ν )
(μ = p1 = π12 + π13 ).
(5.3.40)
Furthermore, due to (5.3.36), (5.3.39) we obtain Pst (S0 ) = 1 +
p1 1 − p2
−1 =
ν . ν +μ
(5.3.41)
That is why formula (5.3.40) coincides with (5.2.20), and this is natural, because, when condition (5.3.38) is satisfied, process {yk } becomes Markov itself. In the considered example, stationary probabilities were concentrated on a countable set of points. Applying the terminology from Section 5.2 we can say that the active space Ea of W -process consisted of points S0 , S1 , S2 , . . . and, consequently, was countable albeit the full space Ea + E0 was continuum. We may expect that a situation is the same in other cases of discrete processes {xk , yk }, that is the stationary distribution Pst (dW ) is concentrated on the countable set Ea of points in a quite general case of a finite or countable number of states of the joint process {xk , yk }.
5.4 Entropy of Gaussian random variables
123
5.4 Entropy of Gaussian random variables 1. Let us consider l Gaussian random variables ξ , . . . , ξl that are described by the vector of mean values E [ξk ] = mk and a non-singular correlation matrix Ri j = E [(ξi − mi ) (ξ j − m j )] .
(5.4.1)
As is known, such random variables have the following joint probability density function: p ( ξ1 , . . . , ξl ) = = (2π )
−1/2
det
−1/2
1 l
Ri j exp − ∑ (ξi − mi ) ai j (ξ j − m j ) (5.4.2) 2 i, j=1
where ai j = Ri j −1 is the inverse matrix of the correlation matrix. In order to determine entropy of the specified variables, we introduce an auxiliary measure ν (d ξ1 , . . . , d ξl ) satisfying the multiplicativity condition (1.7.9). For simplicity we choose measures ν j (d ξ j ) with the same corresponding uniform probability density v j (d ξ j ) = v1 = (v0 )1/l . (5.4.3) dξ j As will be seen further, it is convenient to set v1 = (2π e)−1/2 .
(5.4.4)
At the same time v(d ξ1 , . . . , d ξl ) = vl1 d ξ1 · · · d ξl = (2π e)−1/2 d ξ1 · · · d ξl . Taking into account (5.4.2) we can see that if the density v(d ξ1 , . . . , d ξl )/d ξ1 · · · d ξl is constant, then the condition of absolute continuity of measure P with respect to measure ν is satisfied. Indeed, equality ν (A) = 0 for some set A, consisting of points from the l-dimensional real space Rl , means that the l-dimensional volume of set A equals zero. But probability P(A) also equals zero for such sets. This absolute continuity condition would not have been satisfied, if the correlation matrix (5.4.1) had been singular. Therefore, the non-singularity condition appears to be essential. For the specified choice of measures, the random entropy (1.6.14) takes the form
124
5 Computation of entropy for special cases. Entropy of stochastic processes
H(ξ1 , . . . , ξl ) = 1 1 1 l ln 2π + l ln v1 + ln det Ri j + ∑ (ξi − mi )ai j (ξ j − m j ). (5.4.5) 2 2 2 i, j=1 To calculate entropy (1.6.13), one only needs to perform averaging of the latter expression. Taking into account (5.4.1), we obtain Hξ1 ,...,ξl = =
1 1 1 l ln 2π + l ln v1 + ln det Ri j + ∑ ai j Ri j 2 2 2 i, j=1 1 1 1 ln 2π + l ln v1 + ln det Ri j + . 2 2 2
(5.4.6)
For the choice (5.4.4), the result has the following simple form: Hξ1 ,...,ξl =
1 ln det Ri j . 2
(5.4.6a)
Matrix R = Ri j is symmetric. Therefore, as is known, there exists a unitary transformation U that diagonalizes this matrix:
∑ Uir∗ Ri jU js = λr δrs . i, j
Here λr are the eigenvalues of the correlation matrix. They satisfy the equation
∑ R jkUkr = λrU jr .
(5.4.7)
k
With the help of these eigenvalues we can rewrite the entropy Hξ1 ...ξl as follows: Hξ1 ,...,ξl =
1 l ∑ ln λr . 2 r=1
(5.4.8)
1 tr ln R. 2
(5.4.8a)
This formula can also be reduced to Hξ1 ,...,ξl =
The derived result, in particular, allows us to find easily the conditional entropy Hξk |ξ1 ...ξk−1 . Using the hierarchical property of entropy we can write down Hξ1 ,...,ξk = Hξ1 + Hξ2 |ξ1 + · · · + Hξk |ξ1 ,...,ξk−1 , Hξ1 ,...,ξk−1 = Hξ1 + Hξ2 |ξ1 + · · · + Hξk−1 |ξ1 ,...,ξk−1 and, subtracting one expression from the other, we obtain Hξk |ξ1 ,...,ξk−1 = Hξ1 ,...,ξk − Hξ1 ,...,ξk−1 .
5.4 Entropy of Gaussian random variables
125
Each of the two entropies on the right-hand side of this equation can be determined by formula (5.4.8). This leads to the relationship Hξk |ξ1 ,...,ξk−1 =
1 1 tr ln r(k) − tr ln r(k−1) 2 2
(5.4.9)
where ⎛ R11 . . . ⎜ r(k) = ⎝ ... . . . Rk1 . . .
⎞ R1k .. ⎟ , . ⎠ Rkk
⎞ R11 . . . R1,k−1 ⎟ ⎜ .. r(k−1) = ⎝ ... . . . ⎠ . Rk−1,1 . . . Rk−1,k−1 . ⎛
Similarly, we can determine the entropy Hξk ξk−1 |ξ1 ,...,ξk−2 = Hξ1 ,...,ξk − Hξ1 ,...,ξk−2 , . . . 2. Let us now choose more complicated measures νk (d ξk ). Specifically, let us assume that they have joint probability density function Nk qk (ξk ), where qk (ξk ) is a Gaussian density
k )2 1 ( ξk − m −1/2 exp − . qk (ξk ) = (2π λk ) 2 λk Here we assume the multiplicativity condition (1.7.9). In this case, the random entropy (1.6.14) turns out to be equal to 1 1 H(ξ1 , . . . , ξl ) = ln N + ln det Ri j − ∑ ln λk 2 2 k −
k )2 1 1 ( ξk − m + ∑(ξi − mi )ai j (ξ j − m j ) , 2∑ 2 i, j λk k
(5.4.10)
where N = ∏ Nk . Now averaging of this entropy entails the relationship Hξ1 ,...,ξl = l 1 1 1 1 k )2 + . ln N + ln det Ri j − ∑ ln Rkk + (mk − m λk − ∑ 2 2 k 2 k 2 λk
(5.4.11)
Introducing matrix R = λk δkr and employing a matrix form, we can reduce equality (5.4.11) to Hξ1 ,...,ξl = 1 1 − 1 (m − m) T R−1 (m − m). (5.4.12) ln N + tr ln(R−1 R) − tr R−1 (R − R) 2 2 2
126
5 Computation of entropy for special cases. Entropy of stochastic processes
Here we have taken into account that det R−det R = det R−1 R; tr R−1 R = tr I = l; m− is a column-matrix; T denotes transposition. Comparing (5.4.12) with (1.6.16) it m P/Q is easy to see that thereby we have found entropy Hξ ...ξ of distribution P with 1 l respect to distribution Q. That entropy turns out to be equal to P/Q 1 ,...,ξl
Hξ
1 T R−1 (m − m) = tr G(R−1 R) + (m − m) 2
(5.4.13)
where G(x) = (x − 1 − ln x)/2. The latter formula has been derived under the assumption of multiplicativity (1.7.9) of measure ν (d ξ1 , . . . , d ξl ) and, consequently, multiplicativity of Q(d ξ1 , . . . , d ξl ). However, it can be easily extended to a more general case. Let measure k = EQ [ξk ] and the correlation matrix Q be Gaussian and be defined by vector m Rkr that is not necessarily diagonal. Then entropy H P/Q is invariant with respect to orthogonal transformations (and more generally with respect to non-singular linear transformations) of the l-dimensional real space Rl . By performing a rotation, we can achieve a diagonalization of matrix R and afterwards we can apply formula (5.4.13). However, formula (5.4.13) is already invariant with respect to linear non-singular transformations, and thereby it is valid not only in the case but also in the case of a non-diagonal matrix R. Indeed, of a diagonal matrix R, for the linear transformation ξk = ∑r Ckr ξr (i.e. ξ = Cξ ) the following transfor T . Consequently, = C(m − m), R = CRCT , R = CRC mations take place: m − m −1 T −1 −1 −1 −1 T −1 −1 T (R ) = (C ) R C , (R R) = (C ) R RC hold true as well. That is why T R−1 (m − m) remain invariant. This proves combinations tr f (R−1 R) and (m − m) that formula (5.4.13) remains valid not only in the multiplicative case (1.7.9), but in the more general case when formulae (5.4.8), (5.4.9), (5.4.10) may be invalid. 3. Concluding this section let us calculate the variance of entropy that is useful to know when studying the question of entropic stability (see Section 1.5) of the family of Gaussian random variables. We begin by considering random entropy (5.4.5). Subtracting (5.4.6), we find the random deviation H(ξ1 , . . . , ξl ) − Hξ1 ,...,ξl =
1 a jk (η j ηk − R jk) = 2∑ 1 1 l 1 = η T aη − tr 1 = η T aη − . 2 2 2 2
(5.4.14)
The mean square of this random deviation coincides with the desired variance. Here we have denoted η j = ξ j − m j . When averaging the square of the given expression, one needs to take into account that E [η j ηk ηr ηs ] = R jk Rrs + R jr Rks + R js Rkr
(5.4.15)
according to well-known properties of Gaussian random variables. Therefore, we have
5.4 Entropy of Gaussian random variables
2 Var H(ξ1 , . . . , ξl ) = E H(ξ1 , . . . , ξl ) − Hξ1 ,...,ξl = 1 1 = ∑ a jk ars (R jr Rks + R js Rkr ) = tr aRaR . 4 2
127
(5.4.16)
That is
l 1 tr 1 = . 2 2 Proceeding to random entropy (5.4.10) we have Var H(ξ1 , . . . , ξl ) =
H(ξ1 , . . . , ξl ) − Hξ1 ,...,ξl = 1 1 1 T R−1 η . = η T (a − R−1 )η − tr(a − R−1 )R − (m − m) 2 2 2 In turn, using (5.4.15) we obtain 2 1 1
T R−1 RR−1 (m − m) tr (a − R−1 )R) + (m − m) 2 4 1 1 T R−1 RR−1 (m − m). (5.4.17) = tr(1 − R−1 R)2 + (m − m) 2 4
Var H(ξ1 , . . . , ξl ) =
It is not difficult also to compute other statistical characteristics of random entropy of Gaussian variables, particularly, its characteristic potential
μ0 (s) = ln E [esH(ξ1 ,...,ξl ) ] (see (1.5.15), (4.1.18)). Thus, substituting (5.4.5) to the last formula and taking into consideration the form (5.4.2) of the probability density function, we obtain that −1/2 sl −l/2 μ0 (s) = ln (2π ) ( det R) exp sHξ1 ,...,ξl − × 2
1−s exp − ηi ai j η j d η1 · · · d ηl × 2 ∑ i, j = sHξ1 ,...,ξl −
sl + ln det−1/2 [(1 − s)a] + ln det−1/2 R 2
(5.4.18)
(ηi = ξi − mi ). Here we used the following formula: " " " di j " 1 " exp − ∑ ηi di j η j d η1 · · · d ηl = det−1/2 " " 2π " 2 i, j
(5.4.19)
that is valid for every non-singular positive definite matrix di j . Since a = R−1 , terms with ln det−1/2 R cancel out in (5.4.18) and we obtain
128
5 Computation of entropy for special cases. Entropy of stochastic processes
sl l μ0 (s) = − ln(1 − s) − + sHξ2 ,...,ξl 2 2 that holds true for s < 1. In particular, this result can be used to derive formula (5.4.16).
5.5 Entropy of a stationary sequence. Gaussian sequence 1. In Section 5.1 we considered the entropy of a segment of stationary process {ξk } in discrete time, i.e. of a stationary sequence. There we assumed that each element of the sequence was a random variable itself. The generalization of the notion of entropy given in Section 1.6 allows us to consider the entropy of a stationary sequence that consists of arbitrary random variables (including continuous random variables), and therefore generalizing the results of Section 5.1. If the auxiliary measure ν satisfied the multiplicativity condition (1.7.9), then (as it was shown before) conditional entropies in the generalized version possess all the properties that conditional entropies in the discrete version possess. The specified properties (and essentially only them) were used in the material of Section 5.1. That is why all the aforesaid in Section 5.1 can be related to arbitrary random variables and entropy in the generalized case. Measure ν is assumed to be multiplicative r
v(d ξk1 , . . . , d ξkr ) = ∏ vki (d ξki ).
(5.5.1)
i=1
At the same time ‘elementary’ measures νk are assumed to be identical in view of stationarity. We also assume that the condition of absolute continuity of probability measure P(d ξk ) with respect to νk (d ξk ) is satisfied as well. Process {ξk } appears to be stationary with respect to distribution P, i.e. the condition of type (5.1.1) is valid for all k1 , . . . , kr , a. The entropy rate H1 is introduced by formula (5.1.3). However, Hξk |ξk−l ,...,ξk−1 in this case should be understood as entropy (1.7.13). Hence, the mentioned definition corresponds to the formula
H1 = −
ln
P(d ξk | ξk−1 , ξk−2 , . . .) P(d ξk | ξk−1 , ξk−2 , . . .). vk (d ξk )
(5.5.2)
Theorem 5.1 is also valid in the generalized version. Luckily, that theorem can be proven in the same way. Now it means the equality 1 H1 = − lim l→∞ l
···
ln
P(d ξ1 , . . . , d ξl ) P(d ξ1 , . . . , d ξl ). v1 (d ξ1 ) . . . vl (d ξl )
(5.5.3)
Further, similarly to (5.1.11) , (5.1.12) we can introduce the auxiliary variable
5.5 Entropy of a stationary sequence. Gaussian sequence
2Γ = lim Hξ1 ,...,ξn − nH1 = n→∞
129
H − H 1 ∑ ξ j+1 |ξ j ...ξ1 ∞
(5.5.4)
j=0
which is non-negative because Hξ j+1 |ξ j ...ξ1 Hξ j+1 |ξ j ξ j−1 ... = H1
(5.5.5)
holds true. This variable can be interpreted as the entropy of the ends of the segment of the sequence in consideration. Besides the aforementioned variables and relationships based on the definition of entropy (1.6.13), we can also consider analogous variables and relationships based on definition (1.6.17) in their turn. Namely, similarly to (5.5.2), (5.5.4) we can introduce P/Q
H1
P/Q k |ξk−1 ξk−2 ...
= Hξ
P(d ξk | ξk−1 , ξk−2 , . . .) P(d ξk | ξk−1 , ξk−2 , . . .) Q(d ξk ) ∞
P/Q P/Q . 2ΓP/Q = ∑ Hξ |ξ ...ξ − H1 =
ln
j=0
j+1
j
1
(5.5.6) (5.5.7)
However, due to (1.7.19) the inequality P/Q
Hξ
j+1 |ξ j ...ξi
P/Q
Hξ
j+1 |ξ j ...ξ1 ,...
(inverse to inequality (5.5.5)) takes place for entropy H P/Q . Therefore, ‘entropy of the end” Γ P/Q has to be non-positive. If we use expressions of types (1.7.13), (5.5.2) for conditional entropies, then it will be easy to see that difference Hξ j+1 |ξ1 ,...,ξ j − H1 turns out to be independent of νk . Consequently, boundary entropy Γ does not depend on νk . Analogously, Γ P/Q appears to be independent of Q (if the multiplicativity condition is satisfied), whereas the equation Γ P/Q = −Γ is valid. This is useful to keep in mind when writing down formula (5.1.13) for both entropies. That formula takes the form Hξ1 ,...,ξl = lH1 + 2Γ + ol (1),
P/Q 1 ...ξl
Hξ
P/Q
= lH1
− 2Γ + ol (1).
(5.5.7a)
These relationships allow us to find the entropy of a segment of a stationary process more precisely than by using a simple multiplication of entropy rate H1 with the interval length l. 2. Now we find entropy rate H1 in the case of a stationary Gaussian sequence. a) At first, we assume that there is given s stationary sequence ξ1 , . . . , ξl on a circle such that the distribution of the sequence is invariant with respect to rotations of the circle. At the same time the elements of the correlation matrix Ri j will depend only on the difference j − k and satisfy the periodicity condition: R jk = R j−k , R j−k+l = R j−k . Then equation (5.4.7) will have the following solutions:
130
5 Computation of entropy for special cases. Entropy of stochastic processes
1 U jr = √ e2π i jr/l , l
λr =
l−1
∑ Rs e−2π isrt/l ,
(5.5.8)
r = 0, 1, . . . , l − 1.
(5.5.9)
s=0
Indeed, substitution of (5.5.8) into (5.4.7) yields
∑ R j−k e2π ikr/l = λr e2π i jr/l , k
which is satisfied due to (5.5.9). So, (5.5.8) defines the transformation that diagonalizes the correlation matrix R j−k . It is easy to check its unitarity. Indeed, the Hermitian conjugate operator " " " ∗ " " 1 −2π t jr/l " + " " " " U = U jr = " √ e " l coincides with the inverse operator U −1 as a consequence of the equalities l
1
∑ U jrUkr∗ = l ∑ e2π i r
( j−k)r l
r=1 2π i( j−k)/l
(ε = e
=
ε l
l−1
∑ εr =
r=0
ε 1 − εl = δ jk l 1−ε
).
After the computation of the eigenvalues (5.5.9), one can apply formula (5.4.8) to obtain the entropy Hξ1 ,...,ξl . In the considered case of invariance with respect to rotations it is easy to also calculate entropy (5.4.13). Certainly, it is assumed that not only measure P but also measure Q possesses the described property of symmetry (‘circular stationarity’). Consequently, the correlation matrix R jk of the latter measure has the same proper k are constant for both measures (they ties that R jk has. Besides, mean values mk , m respectively). are equal to m and m, The unitary transformation U = U jr diagonalizes not only matrix R, but also matrix R (even if the multiplicativity condition does not hold). Furthermore, similarly to (5.5.9) the mean values of R have the form λr =
l−1
∑ Rs e−2π isr/l ,
r = 1, . . . , l − 1.
s=0
into the vector This transformation turns vector m − m
(5.5.10)
5.5 Entropy of a stationary sequence. Gaussian sequence
131
⎛ l ⎞ j) ∑ j=1 e−2π i0 j/l (m j − m 1 = √ ⎝. . . . . . . . . . . . . . . . . . . . . . . . .⎠ = U + (m − m) l j) ∑lj=1 e−2π i(l−1) j/l (m j − m ⎛√ ⎞ l(m − m) ⎛ l ⎞ ⎜ ⎟ ∑ j=1 e−2π i0 j/l 0 ⎜ ⎟ ⎜ m−m ⎟ ⎜ ⎟ . . .. .. = √ ⎝ ⎟ ⎠=⎜ ⎜ ⎟ l l −2 π i(l−1) j/l ⎝ ⎠ 0 ∑ j=1 e . As a result due to formula (5.4.13) we obtain that P/Q Hξ ...ξ 1 l
λr = ∑G λr r=0 l−1
+
2 l (m − m) , 2 λ0
(5.5.11)
where G(x) = 12 (x − 1 − ln x). Consider the entropy of a single element of the sequence. According to (5.4.8), (5.5.11) we obtain that H1 = P/Q
H1
1 l−1 ∑ ln λr , 2l r=0 =
(5.5.12)
2 1 l−1 λr 1 (m − m) ln + . ∑ l r=0 λr 2 λ0
(5.5.13)
b) Now, let ξ1 , . . . , ξl constitute a segment of a stationary sequence in such a way that R jk = R j−k for j, k = 1, . . . , l but the periodicity condition R j−k+l = R j−k is not satisfied. Then each of two ends of the segment plays some role and contributes Γ to the aggregated entropy (5.5.7a). If we neglect this contribution, then we can connect the ends of the segment and reduce this case to the case of circular symmetry covered before. In so doing, it is necessary to form a new correlation function R¯ j, j+s = R¯ s =
∞
∑
Rs+nl
(5.5.14)
n=−∞
which already possesses the periodicity property. If Rs is quite different from zero only for s l, then supplementary terms in (5.5.14) will visibly affect only a small number of elements of the correlation matrix R jk situated in corners where j l, l − k l or l − j l, k l. After the transition to the correlation matrix (5.5.14) (and, if needed, after the we can use formulae (5.5.12), (5.5.13), analogous transition for the second matrix R) (5.5.9), (5.5.10) derived before. Taking into account (5.5.9) we will have
λr =
∞
l−1
∞
∑ ∑ Rs+nl e−2π isr/l = ∑
n=−∞ s=0
σ =−∞
Rσ e−2π iσ r/l
132
5 Computation of entropy for special cases. Entropy of stochastic processes
for eigenvalues or, equivalently, λr = ϕ (r/l) if we make a denotation
ϕ (μ ) = Analogously,
∞
∑
σ =−∞
Rσ e−2π iμσ .
(5.5.15)
∞ r! λr = ϕ = ∑ e−2π iσ r/l Rσ . l σ =−∞
Furthermore, formulae (5.5.12), (5.5.13) apparently take the form r! 1 l−1 H¯ 1 = , ln ϕ ∑ 2l r=0 l 2 ϕ (r/l) 1 l−1 1 (m − m) P/Q H¯ 1 = ∑ G . + l r=0 ϕ(r/l) 2 ϕ(0)
(5.5.16)
Further we increase length l of the chosen segment of the sequence. Making a passage to the limit l → ∞ in the last formulae we turn the sums into the integrals H1 =
1 2
P/Q
H1
1 0
ln ϕ (μ )d μ =
1 2
1/2 −1/2
ln ϕ (μ )d μ ,
2 ϕ (μ ) 1 (m − m) dμ + ϕ(μ ) 2 ϕ(0) 0 1/2 2 ϕ (μ ) (m − m) . = G dμ + ϕ(μ ) 2ϕ(0) −1/2 1
=
(5.5.17)
G
(5.5.18)
Here when changing the limits of integration we account for the property ϕ (μ + 1) = ϕ (μ ) that follows from (5.5.15). For large l the endpoints of the segment do a relatively small impact in comparison with large full entropy having order lH1 . Passage (5.5.14) to the correlation function Rs changes entropy Hξ1 ,...,ξl by some number that does not increase when l grows. Thus, the following limits are equivalent: 1 1 lim H¯ ξ1 ,...,ξl = lim Hξ1 ,...,ξl . l→∞ l l→∞ l Here H and H correspond to the correlation functions Rs and Rs , respectively. Therefore, expressions (5.5.17), (5.5.18) coincide with above-defined entropy rates (5.5.3), (5.5.6). Hence, we have just determined the entropy rates for the case of stationary Gaussian sequences. Those entropies turned out to be expressed in terms of spectral densities ϕ (μ ), ϕ(μ ). The condition of absolute continuity of measure P with respect to measure Q takes the form of the condition of integrability of function G(ϕ (μ )/ϕ(μ )) implicated in (5.5.18).
5.5 Entropy of a stationary sequence. Gaussian sequence
133
Formula (5.5.18) is valid not only when the multiplicativity condition (5.5.1) is satisfied. For the stationary Gaussian case this condition means that matrix R jk is a jk , ϕ(μ ) = ϕ = const. Then formula (5.5.18) multiple of the unit matrix: R jk = ϕδ yields 1 2 ϕ (μ ) 1 (m − m) P/Q H1 = G . dμ + ϕ 2 ϕ 0 The provided results can be also generalized to the case when there are not only one random sequence {. . . , ξ1 , ξ2 , . . .} but several (r) stationary and stationary associated sequences {. . . , ξ1α , ξ2α , . . .}, α = 1, . . . , r described by the following correlation matrix: " " " α ,β " α , β = 1, . . . , r "R j−k " , αβ
or by the matrix of spectral functions ϕ (μ ) = ϕ αβ (μ ) , ϕ αβ (μ ) = ∑∞ σ =−∞ Rσ e−2π iμσ and the column-vector (by index α ) of mean values m = mα . Now formula (5.5.17) is replaced by the matrix generalization H1 =
1 2
1/2 −1/2
tr [ln ϕ (μ )] d μ =
1 2
1/2 −1/2
ln [det ϕ (μ )] d μ .
(5.5.19)
Let measure Q be expressed via the matrix of spectral function ϕ(μ ) = ϕ αβ (μ ) = m α . Then instead of (5.5.18) we will have the analogous and mean values m matrix formula P/Q
H1
1/2
=
1 T ϕ−1 (0)(m − m). tr G(ϕ−1 (μ )ϕ (μ ))d μ + (m − m) 2 −1/2
(5.5.20)
Certainly, the represented results follow from formulae (5.4.6a), (5.4.13). By the form they represent the synthesis of (5.4.6a), (5.4.13) and (5.5.17), (5.5.18). 3. The obtained results allow to make a conclusion about entropic stability (see Section 1.5) of a family of random variables {ξ l } where ξ l = {ξ1 , . . . , ξl } is a segment of a stationary Gaussian sequence. Entropy Hξ1 ,...,ξl increases approximately linearly with a growth of l. According to (5.4.16) the variance of entropy also grows linearly. That is why ratio Var[Hξ1 ,...,ξl ]/Hξ2 ,...,ξ converges to zero, so that the con1 l dition of entropic stability (1.5.8) for entropy (5.4.5) turns out to be satisfied. Further we move to entropy (5.4.10). The conditions VarH(ξ1 , . . . , ξl ) → 0, Hξ2 ,...,ξ 1
l
VarH P/Q (ξ1 , . . . , ξl ) P/Q )2 1 ,...,ξl
(Hξ
→0
will be satisfied for it if variance Var[Hξ1 ,...,ξl ] = Var[H P/Q (ξ1 , . . . , ξl )] (defined by formula (5.4.17)) increases with a growth of l approximately linearly, i.e. if there exists the finite limit 1 D1 = lim VarH P/Q (ξ1 , . . . , ξl ). l→∞ l
(5.5.21)
134
5 Computation of entropy for special cases. Entropy of stochastic processes
In order to determine limit (5.5.21) of expression (5.4.17) we can apply the P/Q same methods as we used for calculation of limit liml→∞ 1l Hξ ,...,ξ corresponding 1 l to expression (5.4.13). Similarly to the way of getting (5.5.18) from (5.4.13), formula (5.4.17) can be used to find limit (5.5.21) as follows: D1 =
1 2
1/2 −1/2
1−
ϕ (μ ) ϕ(μ )
2 dμ +
2 ϕ (0) (m − m) . 4ϕ2 (0)
(5.5.22)
5.6 Entropy of stochastic processes in continuous time. General concepts and relations 1. The generalized definition of entropy, given in Chapter 6, allows us to calculate entropy of stochastic processes {ξi } dependent on continuous (time) parameter t. We assume that process {ξi } is given on some interval a t b. Consider an arbitrary subinterval α t β lying within the feasible interval of the process. We use β β notation ξα = {ξt , α t β } for it. Therefore, ξα denotes the value set of process {ξt } on subinterval [α , β ]. The initial process {ξt } is described by probability measure P. According to the definition of entropy given in Section 1.6, in order to determine entropy Hξ β for any α
distinct intervals [α , β ] we need to introduce an auxiliary non-normalized measure ν or the corresponding probability measure Q. Measure ν (or Q) has to be defined on the same measurable space, i.e. on the same field of events related to the behaviour of process ξ (t) on the entire interval [a, b], i.e. process {ξ (t)} having probabilities Q can be interpreted as a new auxiliary stochastic process {η (t)} different from the original process {ξ (t)}. Measure P has to be absolutely continuous with respect to measure Q (or ν ) for the entire field of events pertaining to the behaviour of process {ξ (t)} on the whole feasible interval [a, b]. Consequently, the condition of absolute continuity will be satisfied also for any of its subinterval [α , β ]. Applying formula (1.6.17) to values of the stochastic process on some chosen subinterval [α , β ] we obtain the following definition of entropy of this interval: H
P/Q β
ξα
=
β
ln
P(d ξα )
β Q(d ξα )
β
P(d ξα ).
(5.6.1)
Furthermore, according to the contents of Section 1.7 (see (1.7.17)) we can introduce the conditional entropy H
P/Q β
ξα |ξγδ
=
ρ
ln
P(d ξα | ξγδ ) β Q(d ξα )
β
P(d ξα d ξγδ )
where [γ , δ ] is another subinterval not overlapping with [α , β ].
(5.6.2)
5.6 Entropy of stochastic processes in continuous time. General concepts and relations
135
The introduced entropies obey regular relationships met in the discrete version. For instance, they obey the additivity condition P/Q ξαδ
H
=H
P/Q β
ξα
+H
P/Q β
ξβδ |ξα
α < β < δ.
,
(5.6.3)
When writing formulae (5.6.2), (5.6.3) it is assumed that measures Q, ν satisfy the multiplicativity condition β
β
β
Q(d ξα d ξγδ ) = Q(d ξα )Q(d ξγδ ),
β
v(d ξα ξγδ ) = v1 (d ξα ) + v2 (d ξγδ )
(5.6.4)
([α , β ] does not overlap with [γ , δ ]) that is analogous to (1.7.8). The indicated multiplicativity condition for measure Q means that the auxiliary process {ηt } is such β that its values ξα and ξγδ for non-overlapping intervals [α , β ], [γ , δ ] must be independent. The multiplicativity condition for measure ν means in addition that the constants β β Nα = v(d ξα ) are defined by some increasing function F(t) by the formula β
Nα = eF(β )−F(α )
(5.6.5)
In the case of a stationary process function F(t) appears to be linear, so that β
Nα = e(β −α )hv .
F(β ) − F(α ) = (β − α )hv ,
Taking into account (5.6.5), regular entropy of type (1.6.16) and conditional entropy of type (1.7.4) can be found by the formulae Hξ β = F(β ) − F(α ) − H
P/Q β
ξα
α
Hξ β |ξ δ = F(β ) − F(α ) − H α
,
(5.6.6)
P/Q
(5.6.7)
β
ξα |ξγδ
γ
where the right-hand side variables are defined via relations (5.6.1), (5.6.2). 2. Further we consider a stationary process {ξt } defined for all t. In this case it is natural to choose the auxiliary process {ηt } to be stationary as well. In view of the fact that the entropy in a generalized version possesses the same properties as the entropy in a discrete version when the multiplicativity condition is satisfied, the considerations and the results related to a stationary process in discrete time and stated in Sections 5.1 and 5.5 can be extended to the continuous-time case. Due to ordinary general properties of entropy, conditional entropy Hξ τ |ξ 0 (σ > 0 −σ 0) does not monotonically increase with a growth of σ . This fact entails existence of the limit (5.6.8) lim Hξ τ |ξ 0 = Hξ τ |ξ 0 σ →∞
which we define as Hξ τ |ξ 0 . 0
−∞
0
−σ
0
−∞
136
5 Computation of entropy for special cases. Entropy of stochastic processes
Because the general property (1.7.3) implies Hξ τ1 +τ2 |ξ 0 = Hξ τ1 |ξ 0 + Hξ τ1 +τ2 |ξ τ1 −σ
0
τ1
−σ
0
(5.6.9)
−σ
while the stationarity condition implies Hξ τ1 +τ2 |ξ τ1 = Hξ τ2 |ξ 0 τ1
−σ
− τ1 − σ
0
we can pass to the limit σ → ∞ in (5.6.9) and obtain Hξ τ1 +τ2 |ξ 0 = Hξ τ1 |ξ 0 + Hξ τ2 |ξ 0 . −∞
0
0
−∞
0
−∞
Therefore, conditional entropy Hξ τ |ξ 0 is linearly dependent on τ . The correspond0 −∞ ing proportionality coefficient is defined as the following entropy rate: 1 h = Hξ τ |ξ 0 . τ 0 −∞
(5.6.10)
The next theorem is an analog of Theorem 5.1 in the continuous case. Theorem 5.4. If entropy Hξ t is finite, then entropy rate (5.6.10) can be determined 0 by the limit 1 (5.6.11) h = lim Hξ t . 0 t→∞ t Proof. The proof exploits the same method as the proof of Theorem 5.1 does. We use the additivity property (1.7.4a) and thereby represent entropy Hξot in the form (σ = t − n) . (5.6.12)
Hξ t = Hξ0σ + Hξ σ +1 |ξ σ + Hξ σ +2 |ξ σ +1 + · · · + Hξ t σ
0
σ +1
0
t−1 t−1 |ξ0
0
Due to (5.6.8), (5.6.10) we have (0 k t − σ − 1),
Hξ σ +k+1 |ξ σ +k = h + oσ +k (1) σ +k
0
(5.6.13)
where oσ +k (1) converges to 0 as σ → ∞. Substituting (5.6.13) to (5.6.12) we obtain that 1 n 1 n Hξ t = Hξ0σ + h+ oσ (1). (5.6.14) 0 t σ +n σ +n σ +n Here we let σ and n go to infinity in such a way that n/σ → ∞. Since Hξ0σ = Hξ 1 + Hξ 2 |ξ 1 + · · · + Hξ σ 0
1
0
m−1 m−1 |ξ0
mHξ 1 (m − σ + 1), 0
1 n Hξ0σ σm+n Hξ 1 converges to 0. Term σ +n oσ (1) also goes we observe that term σ +n 0 n to 0. At the same time σ +n h converges to h because n/(σ + n) → 1. Therefore, the desired relation (5.6.11) follows from (5.6.14). The proof is complete.
5.7 Entropy of a Gaussian process in continuous time
137
The statements from Section 5.1 related to boundary entropy Γ can be generalized to the continuous-time case. Similarly to (5.1.10) that entropy can be defined by the formula ! Hξ0σ + Hξ0τ − Hξ σ +τ (5.6.15) 2Γ = lim σ →∞,τ →∞
0
and be represented in the form Γ=
1 2
∞ 0
! Hξ d τ |ξ 0 − hd τ , 0
−τ
(5.6.16)
analogous to (5.1.12). With the help of variables h, Γ entropy Hξ t of a finite segment of a stationary 0 process can be expressed as follows: Hξ t = th + 2Γ + ot (1).
(5.6.17)
0
If we take into account (5.6.1), then we can easily see from definition (5.6.15) of boundary entropy Γ that this entropy is independent of choice of measure Q (or ν ) similarly to Section 5.4. If the multiplicativity conditions are satisfied, then the formula P/Q
Hξ t
0
= thP/Q − 2Γ + ot (1),
(5.6.18)
will be valid for entropy H P/Q in analogy with (5.6.17). In the last formula Γ is the same variable as the one in (5.6.17).
5.7 Entropy of a Gaussian process in continuous time 1. A Gaussian stochastic process ξ (t) in continuous time t(a t b) is characterized by the vector of mean values E[ξ (t)] = m(t) and the correlation matrix R(t,t ) = E [ξ (t) − m(t)][ξ (t ) − m(t )] similarly to random variables considered in Section 5.4. The only difference is that now the vector represents a function of t defined on interval [a, b], whereas in Section 5.4 the vector consisted of l components. Furthermore, a matrix (say R) is a function of two arguments t, t that defines the linear transformation b
y(t) = a
R(t,t )x(t )dt
(t ∈ [a, b])
of vector x(t). As is well known, all the main results of the theory of finitedimensional vectors can be used in this case. In so doing, it is required to make trivial changes in the formulas such as replace a sum by an integral, etc. The methods of calculating entropy given in Section 5.4 can be extended to the case of continuous time if we implement the above-mentioned changes. The resulting matrix
138
5 Computation of entropy for special cases. Entropy of stochastic processes
formulae (5.4.8a), (5.4.13) retain their meaning with new understanding of matrices and vectors. Certainly, the indicated expression is not supposed to be finite now. The condition of their finiteness is connected with the condition of absolute continuity of measure P with respect to measure ν or Q. If we understand vectors and matrices in a generalized sense, then formulae (5.4.8a), (5.4.13) are valid for both finite and infinite domain intervals [a, b] of a process in stationary and non-stationary cases. Henceforth, we shall only consider stationary processes and determine their entropy rates h, hP/Q . For this purpose we can apply the approach employed in clause 2 of Section 5.5. This approach uses the passage to a periodic stationary process. While considering a process on interval [0, T ] with correlation function R(t,t ) = R(t − t ), one can construct the new correlation function ∞
¯ τ) = R(
∑
R(τ + nT )
(5.7.1)
n=−∞
which, apart from stationarity, also possesses the periodicity property. Formula (5.7.1) is analogous to formula (5.5.14). The process 1 N−1 ξ¯ (t) = √ ∑ ξ (t + jT ) N j=0 has such a correlation function in the limit as N → ∞. Stationary periodic matrix R(t −t ) of type (5.7.1) can be diagonalized by unitary transformation U with the matrix 1 Utr = √ e2π itr/T T
t ∈ [0, 1],
r = . . . , −1, 0, 1, 2, . . .
Therefore, the eigenvalues λ of matrix R(t − t ) are
λr =
T 0
R¯ τ e−2π iτ r/T d τ .
If we take into account (5.7.1), then we will obtain
λr =
∞ −∞
e−2π iτ r/T R(τ )d τ = S 2π
r! T
(5.7.2)
from the latter formula, where S(ω ) denotes the spectral density S(ω ) =
∞ −∞
e−iωτ R(τ )d τ = S(−ω )
of process ξ (t). Substituting (5.7.2) to (5.4.8a) we obtain
(5.7.3)
5.7 Entropy of a Gaussian process in continuous time
139
1 r! 1 1 H¯ ξ T = tr ln R = ∑ ln λr = ∑ ln S 2π . 0 2 2 r 2 r T P/Q
Now we move to entropy Hξ T
(5.7.4)
that is determined by formula (5.4.13). It is
0
convenient to apply this formula after diagonalizations of matrices R(t − t ) and − t ). Then R(t λr tr G(R−1 R) = ∑ G (5.7.5) λr r holds true, where r! , λr = S 2π T
S =
∞ −∞
τ )d τ e−iω t R(
(5.7.6)
in analogy with (5.7.2), (5.7.3). After diagonalization the second term in the righthand side (5.4.13) takes the form 1 T T )R−1 (m − m) = c+U −1 R−1Uc = ∑ (m − m λr−1 cr 2 (c = U + (m − m)). Because
cr =
Utr+ [m(t) − m(t)]dt
(5.7.7)
√ for r = 0 T (m − m) = 0 for r = 0
equation (5.4.13) yields P/Q H¯ ξ T 0
∞
=
∑
r=−∞
G
S(2π r/T ) π r/T ) S(2
+T
2 (m − m) S(0)
(5.7.8)
in consequence of (5.7.5), (5.7.7). The summation over r in the right-hand side of (5.7.8) contains an infinite number of terms. In order for the series to converge to a finite limit, it is necessary that ratio π r/T ) goes to 1 as |r| → ∞ because function G(x) = (x − 1 − ln x)/2 S(2π r/T )/S(2 turns to zero (in the region of positive values of x) only for the x = 1. In the proximity of point x = 1 the function behaves like: 1 G(x) = (x − 1)2 + O((x − 1)3 ). 4 Therefore, when the constraints 2 ∞ S(2π r/T ) − 1 < ∞, ∑ r=0 S(2π r/T )
0
(1−ε )Iξ n η n
dF(I) e(ε −μ )Iξ n η n .
Therefore, it goes to zero due to (7.3.11) and property A from the definition of informational stability. The considerations, which are analogous to the ones in the previous theorem finishes the proof.
228
7 Message transmission in the presence of noise. Second asymptotic theorem
7.4 Asymptotic formula for the probability of error In addition to the results of the previous section, we can obtain stronger results related to the rate, with which of the error probability vanishes. It turns out that the probability of error for satisfactory codes decreases mainly exponentially with a growth of n: (7.4.1) Per ea−α n where a is a value weakly dependent on n and α is a constant of main interest. Rather general formulae can be derived for the latter quantity. 1. Theorem 7.3. Under the conditions of Theorem 7.1 the following inequality is valid: (7.4.2) Per 2e−[sμ (s)−μ (s)]n where
μ (t) = ln ∑ P1−t (x, y)Pt (x)Pt (y)
(7.4.3)
x,y
[see (6.4.10), with argument s replaced by t] and s is a positive root of the equation
μ (s) = −R.
(7.4.4)
It is also assumed that R is relatively close to Ixy in order for the latter equation to have a solution. Besides, the value of s is assumed to lie within a differentiability interval of potential μ (t). Proof. At first, we introduce the skewed cumulative distribution function: λ
λ ) = −∞ F( ∞ −∞
e−I dF(I) e−I dF(I)
(7.4.5)
formula (7.4.5) can be rewritten in the following form: ∞
∞ λ) e−I dF(I) = 1 − F( e−I dF(I). λ
−∞
Substituting this inequality to an inequality of type (7.2.13) (but simultaneously selecting a breakpoint λ = nR) we obtain
∞ Per M 1 − F(nR) e−I dF(I) + F(nR) −∞
or
Per enR 1 − F(nR)
∞
−∞
in consequence of (7.3.1).
e−I dF(I) + F(nR)
(7.4.6)
7.4 Asymptotic formula for the probability of error
229
In order to estimate the obtained expression, we need to know how to estimate ‘tails’ of distribution functions F(nR), F(nR), which correspond to a sum of independent random variables. This constitutes the content of Theorems 4.6, 4.7. By applying Theorem 4.7 to function F(nR) (for B(ξ ) = −I(ξ , η )) we have
F(nR) = P [−I > nR] e−[sμ (s)−μ (s)]n where enμ (t) = i.e. nμ (t)
e
−tI(ξ ,η )
=E e
(7.4.7)
e−tI F(dI)
(7.4.8)
=
n
∑e
−tI(x,y)
P(x, y)
(7.4.9)
x,y
and s is a root of the equation
μ (s) = −R.
(7.4.10)
Root s is positive because −R > −Ixy = μ (0) and μ (t) is an increasing function: μ (t) > 0. Analogously, for the skewed distribution 1 − F(−x) = P[−I < x] it follows from Theorem 4.6 that (7.4.11) 1 − F(nR) e−[sμ (s)−μ (s)]n where enμ (s) =
e−sI F(dI)
(7.4.12)
and ( μ s) = −R s < 0 since − R < μ (0) (μ (0) = μ (1) > 0).
(7.4.13)
Substituting (7.4.5) to (7.4.12) we find that n μ ( s)
e
∞
e−sI−I dF(I) ∞ = enμ (s+1)−nμ (1) . = −∞ −I dF(I) e −∞
(7.4.14)
The latter is valid by virtue of relationship (7.4.8), which entails ∞
e−I dF(I) = enμ (1) .
(7.4.15)
( μ s) = μ ( s + 1) − μ (1).
(7.4.16)
−∞
Therefore,
That is why (7.4.13) takes the form
μ ( s + 1) = −R.
(7.4.17)
230
7 Message transmission in the presence of noise. Second asymptotic theorem
A comparison of this equality with (7.4.10) yields: s+ 1 = s. Therefore, formula (7.4.11) can be rewritten as 1 − F(nR) e−[(s−1)μ (s)−μ (s)+μ (1)]n
(7.4.18)
where (7.4.16) is taken into account. Substituting (7.4.7), (7.4.18), (7.4.15) to (7.4.6) and taking into account (7.4.10), we obtain the desired formula (7.4.2). Equality (7.4.3) follows from (7.4.9). The theorem has been proven. 2. According to the provided theorem potential μ (s) defines the coefficient sitting in the exponent (7.4.1) α = sμ (s) − μ (s) (7.4.19) as the Legendre transform of the characteristic potential:
α (R) = −sR − μ (s)
(d μ /ds = −R).
(7.4.20)
According to common features of Legendre transforms, convexity of function μ (s) results in convexity of function α (R). This fact can be justified similarly to the proof of inequality (4.1.16a). Using convexity, we extrapolate function α (R) with tangent lines. If function α (R) has been already computed, say, on interval [R1 , Ixy ], then we can perform the following extrapolation: α (R) for R1 R Ixy αtan (R) = α (R1 ) + d αα(RR 1 ) (R − R1 ) for R R1 and replace the formula with a weaker formula
Per 2e−nα (R)
(7.4.21)
Per 2e−nαtan (R) .
A typical character of dependence α (R) and the indicated extrapolation by a tangent line are shown on Figure 7.3. Point R = Ixy corresponds to value s = 0, since equation (7.4.4) takes the form μ (s) = μ (0) for such a value. Furthermore, μ (0) = 0 follows from the definition of function μ (s). Hence, the null value of coefficient α = 0 corresponds to value R = Ixy . Next we investigate the behaviour of dependence α (R) in proximity of the specified point. Differentiating (7.4.4) and (7.4.19), we obtain −μ (s)ds = dR, Consequently, d α /dR = −s.
d α = sμ (s)ds.
7.4 Asymptotic formula for the probability of error
Fig. 7.3 Typical behaviour of the coefficient α (R) = limn→∞ formula (7.4.1)
231
− ln Per n
appearing in the exponent of
Further, we take in the differential from that derivative and divide it by dR = −μ (s)ds. Thereby we find that 1 d2α = . 2 dR μ (s) In particular, we have
d2α 1 (Ixy ) = dR2 μ (0)
(7.4.22)
at point R = Ixy , s = 0. As is easy to see from definition (7.4.9) of function μ (s), here μ (0) coincides with the variance of random information I(x, y): 2 μ (0) = E [I(x, y) − Ixy ]2 = ∑ [I(x, y)]2 P(x, y) − Ixy . x,y
Calculating the third-order derivative by the same approach, we have d3α μ (s) = . dR3 [μ n (s)]3
(7.4.23)
Representing function α (R) in the form of Taylor expansion and disregarding terms with higher-order derivatives, we obtain
α (R) =
(Ixy − R)2 μ (0) − (Ixy − R)3 + · · · , 2μ (0) 6 [μ (0)]3
R < Ixy
(7.4.24)
according to (7.4.22), (7.4.23). The aforementioned results related to formula (7.4.1) can be complemented and improved in more ways than one. Several stronger results will be mentioned later on (see Section 7.5).
232
7 Message transmission in the presence of noise. Second asymptotic theorem
These results can be propagated to a more general [in comparison with (7.1.1)] case of arbitrary informationally stable random variables, i.e. Theorem 7.2 can be enhanced in the direction of accounting for rate of convergence of the error probability. We will not pursue this point but limit ourselves to indicating that the improvement in question can be realized by a completely standard way. Coefficient α should not be considered separately from n. Instead, we need to operate with combination αn = nα . Analogously, we should consider only the combinations
μn (t) = nμ (t),
Rn = nR,
Iξ n η n = nIxy .
Therefore formulae (7.4.2)–(7.4.4) will be replaced with the formulae
Per 2e−sμn (s)+μn (s) ,
μn (t) = ln
μn = −Rn ,
P1−t (d ξ n , d η n )Pt (d ξ n )Pt (d η n ).
(7.4.24a)
Only the method of writing formulae will change in the above-said text. Formulae which assess a behaviour of the error probability were derived in the works of Shannon [42] (the English original is [40]) and Fano [10] (the English original is [9]).
7.5 Enhanced estimators for optimal decoding 1. In the preceding sections decoding was performed according to the principle of the maximum likelihood function (7.1.6) or, equivalently, the minimum distance (7.1.8). It is of our great interest to study what an estimator of the error probability equals to if decoding is performed on the basis of ‘distance’ D(ξ , η ) defined somewhat differently. In the present paragraph we suppose that ‘distance’ D(ξ , η ) is some arbitrarily given function. Certainly, a transition to a new ‘distance’ cannot diminish the probability of decoding error but, in principle, it can decrease an upper bound for an estimator of the specified probability. Theorem 7.4. Suppose that we have a channel [P(η | ξ ), P(ξ )] (just as in Theorem 7.1), which is an n-th power of channel [P(y | x), P(x)]. Let the decoding be performed on the basis of the minimum distance D(ξ , η ) =
n
∑ d(x j , y j )
(7.5.1)
j=1
where d(x, y) is a given function. The amount ln M of transmitted information increases with n according to the law (7.5.2) ln M = ln enR nR
7.5 Enhanced estimators for optimal decoding
233
(R < Ixy is independent of n). Then there exists a sequence of codes having the probability of decoding error
Per 2e−n[s0 γ (s0 )−γ (s0 )] .
(7.5.3)
Here s0 is one of the roots s0 , t0 , r0 of the following system of equations:
γ (s0 ) = ϕr (r0 ,t0 ) ϕt (r0 ,t0 ) = 0 (s0 − r0 )γ (s0 ) − γ (s0 ) + ϕ (r0 ,t0 ) + R = 0.
(7.5.4)
Also, γ (s) and ϕ (r,t) are the functions below:
γ (s) = ln ∑ esd(x,y) P(x)P(y | x)
(7.5.5)
x,y
ϕ (r,t) = ln
∑ e(r−t)d(x,y)+td(x ,y) P(x)P(y | x)P(x ).
(7.5.6)
x,y,x
Besides, ϕr = ∂∂ϕr , ϕt = ∂∂ϕt . It is also assumed that equations (7.5.4) have roots, which belong to the domains of definition and differentiability of functions (7.5.5), (7.5.6), and s0 > 0, r0 < 0, t0 < 0. Proof. As earlier, we will consider random codes and average out a decoding error with respect to them. First, we write down the inequalities for the average error, which are analogous to (7.2.1)–(7.2.4) but with an arbitrarily assigned distance D(ξ , η ). Now we perform averaging with respect to η in the last turn: Per = Per (| k) = ∑ Per (| k, η )P(η )
(7.5.7)
η
where Per (| k, η ) =
∑
υ (k, ξ1 , · · · , ξM , η )P(ξ1 ) · · · P(ξk−1 )×
ξ1 ···ξm
× P(ξk | η )P(ξk+1 ) · · · P(ξM ) (7.5.8) and
0, υ (k, ξ1 , · · · , ξM , η ) = 1,
if all D(ξl , η ) > D(ξk , η ), l = k, if at least one D(ξl , η ) D(ξk , η ), l = k.
Substituting (7.5.9) to (7.5.8) in analogy with (7.2.5)–(7.2.8), we obtain
(7.5.9)
234
7 Message transmission in the presence of noise. Second asymptotic theorem
Per (| k, η ) = Per (| η ) = 1 − ∑ P(ξk | η ) ∏
∑
P(ξl )×
l=k D(ξl ,η )>D(ξk ,η )
k
× ∑ P(ξk | η ) = f ξk
∑
P(ξl ) .
(7.5.10)
D(ξl ,η )D(ξk ,η )
Here the inequality D(ξl , η ) < D(ξk , η ) is not strict. Including all those cases when D(ξl , η ) = D(ξk , η ), we denote Fη [λ ] =
∑
P(ξl ).
(7.5.11)
D(ξl ,η )λ
Next, if we use (7.2.11), then it will follow from (7.5.10) that Per (| η ) ∑ P(ξ | η ) f (Fη [D(ξ , η )]) ξ
4 3 E min M [Fη [D(ξ , η )]] , 1 | η .
(7.5.12)
We have omitted index k here because the expressions in (7.5.10), (7.5.12) turn out to be independent of it. Selecting some boundary value nd (independent of η ) and using the inequality 4 3 MFη [D(ξ , η )] for D(ξ , η ) nd, (7.5.13) min MFη [D(ξ , η )] , 1 1 for D(ξ , η ) > nd we conclude from (7.5.12) that Per (| η )
∑
D(ξ ,η )nd
MFη [D(ξ , η )] P(ξ | η ) +
∑
P(ξ | η ).
(7.5.14)
D(ξ ,η )>nd
Further, we average out the latter inequality by η and thereby obtain the estimator for the average probability of error Per MP1 + P2
(7.5.15)
where P2 =
∑
P(ξ , η )
∑
∑
(7.5.16)
D(ξ ,η )>nd
P1 =
P(ξ , η )P(ξl )
(7.5.17)
D(ξ ,η )nd D(ξl ,η )D(ξ ,η )
[(7.5.11) is taken into account in the last expression]. The cumulative distribution function (7.5.16) can be estimated with the help of Theorem 4.7. Under the assumption d > 1n E[D(ξ , η )] = γ (0) we have
P2 e−n[s0 γ (s0 )−γ (s0 )]
(7.5.18)
7.5 Enhanced estimators for optimal decoding
235
where
γ (s0 ) = d,
s0 > 0;
γ (s) = ln E esd(x,y)
nγ (s) = ln E esD(ξ ,η ) ,
(7.5.19)
is the function (7.5.5). The probability (7.5.17) can be expressed in terms of the bivariate joint distribution function F(λ1 , λ2 ) =
∑
∑
P(ξ )P(η | ξ )P(ξl ).
(7.5.20)
D(ξ ,η )λ1 D(ξl ,η )−D(ξ ,η )λ2
In order to derive an analogous estimator for it, we need to apply the multivariate generalization of Theorem 4.6 or 4.7, i.e. formula (4.4.13). This yields
P1 e−n[r0 ϕr +t ϕt −ϕ (r0 ,t0 )]
∂ ϕ (r0 ,t0 ) , ϕr = ∂r
ϕt =
∂ ϕ (r0 ,t0 ) ∂t
(7.5.21)
where r0 , t0 are solutions of the respective equations
ϕr (r0 ,t0 ) = d,
ϕt (r0 ,t0 ) = 0
and nϕ (r,t) is a two-dimensional characteristic potential
nϕ (r,t) = ln E erD(ξ ,η )+t[D(ξl ,η )−D(ξ ,η )]
= n ln E erd(x,y)+t [d(x ,y)−d(x,y)]
(7.5.22)
(7.5.23)
[see (7.5.6)]. It is assumed that r0 < 0,
t0 < 0.
We substitute (7.5.18), (7.5.21) to (7.5.15) and replace M with enR . Employing the freedom of choosing constant d, we adjust it in such a way that the estimators for both terms in the right-hand side of the formula Per enR P1 + P2
(7.5.24)
are equal to each other. This will result in the equation r0 ϕr (r0 ,t0 ) − ϕ (r0 ,t0 ) − R = s0 γ (s0 ) − γ (s0 ) which constitutes the system of equations (7.5.4) together with other equations following from (7.5.19), (7.5.22). Simultaneously, inequality (7.5.24) turns into (7.5.3). The proof is complete.
236
7 Message transmission in the presence of noise. Second asymptotic theorem
2. Now we turn our attention to the particular case when R(< Ixy ) is so far from Ixy that root r0 [see equations (7.5.4)] becomes positive. Then it is reasonable to choose ∞ as nd in inequality (7.5.13), so that that inequality takes the form 3 4 min MFη [D(ξ , η )] , 1 MFη [D(ξ , η )] . Instead of (7.5.15)–(7.5.17) we will have Per M
∑
P(ξ , η )P(ξl ) = MF(∞, 0)
(7.5.25)
D(ξl ,η ) 0 and when formula (7.5.3) cannot be used. Taking into account a character of change of potentials ϕ (r,t), γ (s), we can make certain that the constraint r0 > 0 is equivalent to the constraint R < R∗ or s0 > s∗0 . After introducing the Legendre convex conjugate α (R) of function γ (s) by equalities (7.4.19), (7.4.20), formulae (7.5.3), (7.5.30) can be represented as follows:
7.5 Enhanced estimators for optimal decoding
2e−nα (R) Per = −n[α (R∗ )+R∗ −R] e
237
for R∗ < R < Ixy for R < R∗ .
(7.5.31)
3. Now we consider an important corollary from the obtained results. We select a function d(x, y) of the following type: d(x, y) = − ln P(y | x) + f (y)
(7.5.32)
where f (y) is a function, which will be specified below. The corresponding distance D(ξ , η ) = ∑i d(xi , yi ) is more general than (7.1.8). By introducing the notation
γy (β ) = ln ∑ Pβ (y | x)P(x)
(7.5.33)
x
we can represent functions (7.5.5), (7.5.6) in this case as
γ (s) = ln ∑ eγy (1−s)+s f (y) y
3 4 ϕ (r,t) = ln ∑ exp γy (1 − r + t) + γy (−t) + r f (y) .
(7.5.34)
y
Further, the derivatives involved in (7.5.4) take the form γ (s0 ) = e−γ (s0 ) ∑ eγy (1−s0 )+s0 f (y) f (y) − γy (1 − s0 )
(7.5.35)
y
ϕr (r0 ,t0 ) = e−ϕ (r0 ,t0 ) ∑ exp[γy (1 − r0 + t0 )+ y
+ γy (−t0 ) + r0 f (y)] f (y) − γy (1 − r0 + t0 ) (7.5.36)
ϕt (r0 ,t0 ) = e−ϕ (r0 ,t0 ) ∑ exp[γy (1 − r0 + t0 ) + γy (−t0 )+ y
+ r0 f (y)] γy (1 − r0 + t0 ) − γy (−t0 ) .
(7.5.37)
The second equation from (7.5.4) will be satisfied if we suppose that 1 + t − r = −t,
r = 1 + 2t
(7.5.38)
particularly, r0 = 1 + 2t0 , since in this case every term of the latter sum turns into zero. According to (7.5.35), (7.5.36) the first equation of (7.5.4) takes the form e−γ (s0 ) ∑ eγy (1−s0 )+s0 f (y) f (y) − γy (1 − s0 ) = y
= e−ϕ (1+2t0 ,t0 ) ∑ e2γy (−t0 )+(1+2t0 ) f (y) f (y) − γy (−t0 ) (7.5.39) y
238
7 Message transmission in the presence of noise. Second asymptotic theorem
in the given case. In order to satisfy the latter equation we suppose that 1 − s0 = −t0 γy (1 − s0 ) + s0 f (y) = 2γy (−t0 ) + (1 + 2t0 ) f (y)
(7.5.40) (7.5.41)
i.e. we choose the certain type of function f : f (y) =
1 γy (1 − s0 ). 1 − s0
(7.5.42)
Then summations in the left-hand and right-hand sides of (7.5.39) will be equated and thereby this equation will be reduced to the equation
γ (s0 ) = ϕ (1 + 2t0 ,t0 ) = ϕ (2s0 − 1, s0 − 1).
(7.5.43)
But the last equation is satisfied by virtue of the same relations (7.5.38), (7.5.40), (7.5.42) that can be easily verified by substituting them to (7.5.34). Hence, both equations of system (7.5.4) are satisfied. In consequence of (7.5.38), (7.5.40), (7.5.43) the remaining equation can be reduced to (1 − s0 )γ (s0 ) + R = 0 and due to (7.5.34), (7.5.42) we have γ (s) = ln ∑ exp γy (1 − s) + y
s γy (1 − s0 ) . 1 − s0
(7.5.44)
(7.5.45)
Differentiating this expression or, taking into account (7.5.35), we obtain that equation (7.5.44) can be rewritten as 1
R = e−γ (s0 ) ∑ e 1−s0
γy (1−s0 )
(1 − s0 )γy (1 − s0 ) − γy (1 − s0 ) .
(7.5.46)
y
It is left to check signs of roots s0 , r0 , t0 . Assigning s0 = 0 we find the boundary value from (7.5.46): Rmax = − γ (s0 ) s =0 = ∑ eγy (1) γy (1) − γy (1) . (7.5.47) 0
y
This boundary value coincides with information Ixy . Indeed, in consequence of (7.5.33) it follows from
γy (1) = ∑ x
P(x, y) ln P(y | x) P(y)
γy (1) = ln P(y) γy (1) − γy (1) =H(y) −
1 P(x, y)H(y | x) P(y) ∑ x
7.5 Enhanced estimators for optimal decoding
that
∑ eγy (1)
239
γy (1) − γy (1) = Hy − Hy|x = Ixy .
(7.5.47a)
y
Analyzing expression (7.5.46), we can obtain that s0 > 0 if R < Ixy . Due to (7.5.38), (7.5.40) the other roots are equal to r0 = 2s0 − 1,
t0 = s0 − 1.
(7.5.48)
Apparently, they are negative if 0 < s0 < 1/2, i.e. if R is sufficiently close to Ixy . If s0 exceeds 1/2 thus, r0 becomes positive because of the first equality of (7.5.48) , then we should use formula (7.5.30) instead of (7.5.3) as it was said in the previous clause. Due to (7.5.48) values s∗0 , t0∗ obtained from the condition r0 = 0 turn out to be the following: s∗0 = 1/2, t0∗ = −1/2. The ‘critical’ value of R∗ is obtained from equation (7.5.46) by substituting s0 = 1/2, i.e. it turns out to be equal to 1 1 ∗ −γ (1/2) 2γy (1/2) 1 1 1 R =− γ γ =e − γy ∑e 2 2 2 y 2 2 y or, equivalently, ∗
R =
∑y e2γy (1/2)
1
∑y
1 2 γy 2 − γ e2γy (1/2)
1 2
(7.5.49)
where we take into account (7.5.35), (7.5.42). The results derived above can be formulated in the form of a theorem. Theorem 7.5. Under the conditions of Theorems 7.1, 7.3 there exists a sequence of codes such that the probability of decoding error satisfies the inequality 2en[s0 R/(1−s0 )+γ (s0 )] for R∗ R < Ixy Per n[γ (1/2)+R] (7.5.50) e for R < R∗ where
γ (s0 ) = ln ∑ e
γy (1−s0 )/(1−s0 )
y
≡ ln ∑ y
∑ P(x)P
1−s0
(y | x)
1/(1−s0 )
x
s0 ∈ (0, 1/2) is a root of equation (7.5.46), and R∗ is a value of (7.5.49). Coefficient α0 = −s0 R/(1 − s0 ) − γ (s0 ) in exponent (7.5.50) is obtained by substituting equality (7.5.44) to the expression
α0 = s0 γ (s0 ) − γ (s0 ) situated in the exponents of formulae (7.5.3), (7.5.30), (7.5.31).
(7.5.51)
240
7 Message transmission in the presence of noise. Second asymptotic theorem
4. Next we will investigate a behaviour of the expressions provided in Theorem 7.5 for values of R close to the limit value Ixy . This investigation will allow to compare the indicated results with the results of Theorem 7.3. As is seen from (7.5.5), function γ (s) possesses the property γ (0) = 0 for any distance d(x, y). This property is also valid for the particular case (7.5.32), (7.5.42). That is why the decomposition of function (7.5.45) into a bivariate series with respect to s, s0 will contain the following terms:
γ (s) = γ10 s + γ20 s2 + γ11 ss0 + γ30 s3 + γ21 s2 s0 + γ12 ss20 + · · · .
(7.5.52)
In this case,
γ10 = −Ixy
(7.5.53)
due to (7.5.47), (7.5.47a). Substituting (7.5.52) to (7.5.51) we obtain
α0 = γ20 s20 + (2γ30 + γ21 )s30 + s40 + · · · .
(7.5.54)
In order to express s0 in terms of Ixy − R we plug (7.5.52) into equation (7.5.44) that yields −R = (1 − s0 )[γ10 + (2γ20 + γ11 )s0 + (3γ30 + 2γ21 + γ12 )s20 + · · · ]. Taking into account (7.5.53), it follows from the latter formula that Ixy − R = (2γ20 + γ11 − γ10 )s0 + (3γ30 + 2γ21 + + γ12 − 2γ20 − γ11 )s20 + · · · s30 + · · · .
(7.5.55)
Then we can express s0 as follows Ixy − R 3γ30 + 2γ21 + γ12 − 2γ20 − γ11 2 3 − s0 − s0 · · · 2γ20 + γ11 − γ10 2γ20 + γ11 − γ10 Ixy − R 3γ30 + 2γ21 + γ12 − 2γ20 − γ11 = − × 2γ20 + γ11 − γ10 2γ20 + γ21 − γ10 2 Ixy − R × −··· . (7.5.56) 2γ20 + γ21 − γ10
s0 =
The substitution of this expression to (7.5.54) allows us to find value α0 with the order of magnitude (Ixy − R)3 . Coefficients γik can be computed with the help of (7.5.45). For convenience of computation we transform the latter expression to a somewhat different form by introducing the conditional characteristic potential of the random information:
μ (t | y) = ln ∑ e−tI(x,y) P(x | y) = ln ∑ P1−t (y | x)P(x)Pt−1 (y) x
x
which is associated with function (7.5.33) via the evident relationship
(7.5.57)
7.5 Enhanced estimators for optimal decoding
γy (1 − t) = μ (t | y) + (1 − t) ln P(y).
241
(7.5.58)
The derivatives of (7.5.57) may be interpreted as conditional cumulant of −I(x, y):
μ (0 | y) = 0, μ (0 | y) = −E[I(x, y) | y] = −m, μ (0 | y) = E[I 2 (x, y) | y] − {E[I(x, y) | y]}2 = Var[I(x, y) | y] = D, μ (0 | y) = −k, ··· (7.5.59) [see formula (4.1.12)]. The substitution of (7.5.58) to (7.5.45) yields s γ (s) = ln ∑ exp μ (s | y) + μ (s0 | y) P(y). 1 − s0 y
(7.5.60)
Taking into account (7.5.59), we can represent the expression situated in the exponent in the form
μ (s | y) + s(1 + s0 + s20 + · · · )μ (s0 | y) =
1 1 2 1 3 2 D − m s0 + · · · . = −ms + Ds − ks + · · · + s −ms0 + 2 6 2
Consequently, exp μ (s | y) +
s μ (s0 | y) 1 − s0 1 1 1 1 = 1 − ms + m2 s2 − m3 s3 + Ds2 + · · · − ks3 + · · · − 2 6 2 6 1 1 3 2 2 D − m ss20 + · · · − mDs + m s s0 − mss0 + · · · + 2 2 1 1 = 1 − ms + (D + m2 )s2 − mss0 − (k + 3Dm + m3 )s3 + 2 6 1 2 2 + m s s0 + D − m ss20 + · · · . 2
After averaging the latter expression over y according to (7.5.60) and denoting a mean value by an overline, we will have 1 γ (s) = ln 1 − ms + (D + m2 )s2 − mss0 − 2 1 1 D − m ss20 + · · · . − (k + 3Dm + m3 )s3 + m2 s2 s0 + 6 2
242
7 Message transmission in the presence of noise. Second asymptotic theorem
The last formula entails
γ (s) = −ms +
1
D + m2 − (m)2 s2 − mss0 − 2 1 − [k + 3Dm + m3 + 2(m)3 − 3(D + m2 )m]s3 + 6 1 + m2 s2 s0 − (m)2 s2 s0 + D − m ss20 + · · · . 2
(7.5.61)
If we assign s0 = 0 in expression (7.5.60), then due to (7.5.57) it will turn into the characteristic potential
μ (s) = ln ∑ e−sI(x,y) P(x, y) x,y
equivalent to (7.4.3). That is why the terms with s, s2 , s3 , . . . in (7.5.61) are automatically proportional to the full cumulants
γ10 = μ (0) = −Ixy , 1 1 γ20 = μ (0) = Var[I(x, y)], 2 2 1 1 γ30 = μ (0) = − K3 [I(x, y)], 6 6 ...
(7.5.62)
where K3 means the third cumulant. In addition to the given relations from (7.5.61) we have
γ11 = γ10 = −Ixy , 2 γ21 = m2 Ixy , 1 γ12 = D − Ixy . 2
Therefore, relationship (7.5.56) takes the form
1 2 2I 2 + 1 D − μ (0) μ (0) + 2m xy 2 2 Ixy − R s0 = − (Ixy − R)2 . μ (0) [μ (0)]3 Further, we substitute this equality to formula (7.5.54), which is taking the form 1 1 2 2 2 α0 = μ (0)s0 + μ (0)m − Ixy s30 + · · · . 2 3 As a result we obtain that
7.6 General relations between entropies and mutual informations for encoding and decoding 243
α0 =
(Ixy − R)2 (Ixy − R)3 3 1 + 2 μ (0) − D − μ (0) 2μ (0) 2 6 [μ (0)]3
(R < Ixy ). (7.5.63)
Here we have taken into account that, according to the third equality (7.5.59), it holds true that 2 2 μ (0) = D − MI 2 (x, y) − Ixy − D = m2 − Ixy . Moreover, we compare the last result with formula (7.4.24), which is valid in the same approximation sense. At the same time we take into account that
μ (0) 0
and
μ (0) D,
since the conditional variance Var does not exceed the regular (non-conditional) one μ (0). In comparison with (7.4.24), equation (7.5.63) contains additional positive terms, because of which α0 > α . Therefore, inequality (7.5.50) is stronger than inequality (7.4.2) (at least for values of R, which are sufficiently close to Ixy ). Thus, Theorem 7.5 is stronger than Theorem 7.3. A number of other results giving an improved estimation of behaviour of the probability of decoding error are provided in the book by Fano [10] (the English original is [9]).
7.6 Some general relations between entropies and mutual informations for encoding and decoding 1. The noisy channel in consideration is characterized by conditional probabilities P(η | ξ ) = ∏i P(yi | xi ). The probabilities P(ξ ) = ∏i P(xi ) of the input variable ξ determine the method of obtaining the random code ξ1 , . . . , ξM . The transmitted message indexed by k = 1, . . . , M is associated with k-th code point ξk . During decoding, the observed random variable η defines the index l(η ) of the received message. As it as mentioned earlier, we select a code point ξl , which is the ‘closest’ to the observed point η in terms of some ‘distance’. The transformation l = l(η ) is degenerate. Therefore, due to inequality (6.3.9) we have (7.6.1) Ik,l (| ξ1 , . . . , ξM ) Ik,η (| ξ1 , . . . , ξM ). Applying (6.3.9) we need to juxtapose k with y, l with x1 and interpret x2 as a random variable complementing l to η (in such a way that η coincides with x1 , x2 ). The code ξ1 , . . . , ξM in (7.6.1) is assumed to be fixed. Averaging (7.6.1) over various codes with weight P(ξ1 ) . . . P(ξM ) and denoting the corresponding results as Ikl|ξ1 ...ξM , Ikη |ξ1 ...ξM , we obtain Ikl|ξ1 ,...,ξM Ikη |ξ1 ,...,ξM .
(7.6.2)
244
7 Message transmission in the presence of noise. Second asymptotic theorem
2. Let us now compare the amount of information Ikη |ξ1 ...ξM with Iξ η . The former amount of information is defined by the formula Ikη |ξ1 ,...,ξM = ∑ Ikη (| ξ1 , . . . , ξM )P(ξ1 ) · · · P(ξM ).
(7.6.3)
where Ikη (| ξ1 , . . . , ξM ) = Hη (| ξ1 , . . . , ξM ) − Hη |k (| ξ1 , . . . , ξM ) =∑f η
∑ P(k)P(η | ξk )
− ∑ P(k)Hη (| ξk );
k
(7.6.4)
k
f (z) = −z ln z.
(7.6.5)
In turn, information Iξ η can be rewritten as Iξ η = Hη − Hη |ξ = ∑ f η
∑ P(ξ )P(η | ξ )
− ∑ P(ξ )Hη (| ξ ).
ξ
(7.6.6)
ξ
It is easy to make sure that after averaging (7.6.4) over ξ1 , . . . , ξk the second (subtrahend) term coincides with the second term of formula (7.6.6). Indeed, the expectation ∑ P(ξk )H(| ξk ) ξk
of entropy Hη (| ξ ) is independent of k due to a parity of all k (see 7.2), so that
∑ P(k) ∑ P(ξk )Hη (| ξk ) = ∑ P(ξ )Hη (| ξ ). ξk
k
Therefore, the difference of informations (7.6.6) and (7.6.3) is equal to Iξ η − Ikη |ξ1 ,...,ξM = =∑f η
∑ P(ξ )P(η | ξ )
−E
ξ
∑ f ∑ P(k)P(η | ξk ) η
, (7.6.7)
k
where E denotes the averaging over ξ1 , . . . , ξM . In consequence of concavity of function (7.6.5), we can apply the formula E [ f (ζ )] f (E[ζ ]) [see (1.2.4)] for ζ = ∑k P(k)P(η | ξk ), i.e. the inequality f
∑ P(k)E[P(η | ξk )] k
−E f
∑ P(k)P(η | ξk ) k
0.
7.6 General relations between entropies and mutual informations for encoding and decoding 245
But E[P(η | ξk )] = ∑k P(ξk )P(η | ξk ) does not depend on k and coincides with the argument of function f from the first term of relation (7.6.7). Hence, for every η f
∑ P(ξ )P(η | ξ ) ξ
−E f
∑ P(k)P(η | ξk )
0
k
and it follows from (7.6.7) that Iξ η − Ikη |ξ1 ,...,ξM 0.
(7.6.8)
Uniting inequalities (7.6.2), (7.6.8) we obtain that Ikl Ikη Iξ η ,
(7.6.9)
where | ξ1 . . . ξM are neglected for brevity. 3. It is useful to relate the information Ikl between input and output messages to the probability of error. Consider the entropy Hl (| k, ξ1 , . . . , ξk ) corresponding to a given transmitted message k. After message k has been sent, there is still some uncertainty about what message will be received. Employing the hierarchical property of entropy (Section 1.3), this uncertainty (which is equal to Hl (| k; ξ1 , . . . , ξM ) numerically) can be represented as a sum of two terms. Those terms correspond to a two-stage elimination of uncertainty. At the first stage, it is pointed out whether a received message is correct, i.e. whether l and k coincide. This uncertainty is equal to − Per (k, ξ1 , . . . , ξM ) ln Per (k, ξ1 , . . . , ξM )− − [1 − Per (k, ξ1 , . . . , ξM )] ln[1 − Per (k, ξ1 , . . . , ξM )] ≡ ≡ h2 [Per (k, ξ1 , . . . , ξM )], where Per (k, ξ1 , . . . , ξM ) is the probability of decoding error under the condition that message k has been transmitted (a code is fixed). At the second stage, if l = k, then we should point out which of the remaining messages has been received. The corresponding uncertainty cannot be larger than Per ln (1 − M). Therefore, Hl (| k, ξ1 , . . . , ξM ) < h2 (Per (k, ξ1 , . . . , ξM )) + Per (k, ξ1 , . . . , ξM ) ln(M − 1). Next, we average out this inequality by k by using the formula E[h2 (ζ )] h2 (E[ζ ]),
(7.6.10)
which is valid due to convexity of function h2 (z) = −z ln z − (1 − z) ln (1 − z) and inequality (1.2.4). That yields Hl|k (| ξ1 , . . . , ξM ) h2 (Per (ξ , 1 . . . , ξM ) − Per (ξ1 , . . . , ξM ) ln(M − 1).
246
7 Message transmission in the presence of noise. Second asymptotic theorem
Further, we can perform averaging over an ensemble of random codes and analogously, using (7.6.10) one more time, obtain Hl|k,ξ1 ,...,ξM h2 (Per ) + Per ln(M − 1).
(7.6.11)
Because Ikl|ξ1 ,...,ξM = Hl|ξ1 ,...,ξM − Hl|k,ξ1 ,...,ξM , it follows from (7.6.11) that Ikl|ξ1 ,...,ξM Hl|ξ1 ,...,ξM − Per ln(M − 1) − h2 (Per ).
(7.6.12)
The same reasoning is applicable if we switch k and l. Then in analogy with (7.6.12) we will have Ikl|ξ1 ,...,ξM Hk − Per ln(M − 1) − h2 (Per )
(7.6.13)
(Per = P(k = l) = E [P(k = l | k)] = E [P(k = l | l))] . Presuming the equiprobability of all M possible messages k = 1, . . . , M we have Hk = ln M. Besides, it is evident that h2 (Per ) ln 2 = 1 bit. That is why (7.6.13) can be rewritten as (7.6.14) ln M − Per ln(M − 1) − ln 2 Ikl|ξ1 ,...,ξM . Earlier we supposed that M = [enR ]; then enR − 1 M,
enR M − 1,
and (7.6.14) takes the form ln(enR − 1) − Per nR − ln 2 Ikl|ξ1 ,...,ξ2 , i.e.
1 + ln(1 − e−nR )/nR − Per − ln 2/nR Ikl|ξ1 ,...,ξM /nR.
Taking into account (7.6.9) and the relationship Iξ η − nIxy we obtain the resultant inequality Ixy ln 2 1 − + ln(1 − e−nR ), (7.6.15) Per 1 − R nR nR which defines a lower bound for the probability of decoding error. As nR → ∞, the inequality (7.6.15) turns into the asymptotic formula Per 1 − Ixy /R. It makes sense to use the latter formula when R > Ixy (if R < Ixy , then the inequality becomes trivial). According to the last formula, the errorless decoding inherently does not take place when R > Ixy , so that the boundary Ixy for R is substantial. 4. Uniting formulae (7.6.9), (7.6.14) and replacing the factor in Per with ln M, we will obtain the result ln M − Per ln(M) − ln 2 Ikl Ikη Iξ η .
(7.6.16)
7.6 General relations between entropies and mutual informations for encoding and decoding 247
The quantity ln M = I is the amount of information in Hartley’s sense. It follows from the asymptotic faultlessness of decoding that amount of information is close to the Shannon’s amount of information Iξ η . Dividing (7.6.16) by ln M = I, we have 1 − Per − ln 2/I Ikl /I Ikη /I Iξ η /I.
(7.6.17)
It can be concluded from the previous theorems in this chapter that one can increase I and perform encoding and decoding in such a way that Iξ η /I → 1
as n → 0 and Iat the same time Per → 0. Then, apparently, the length of interval ln 2 ξ η 1 − Per − I , I tends to zero: Iξ η /I − 1 + Per + ln 2/I → 0.
(7.6.18)
This means that with increasing n the following approximations are valid with a greater degree of accuracy: I ≈ Ikl /I ≈ Ikη /I ≈ Iξ η /I or I/n ≈ Ikl /n ≈ Ikη /n ≈ Iξ η /n.
(7.6.19)
These asymptotic relations generalize equalities (6.1.17) concerned with simple noise (disturbances). Thus, arbitrary noise can be regarded as asymptotically equivalent to simple noise. Index l of code region Gl is an asymptotically sufficient coordinate (see Section 6.1). Just as in the case of simple noise (Section 6.1) when the use of Shannon’s amount of information (7.6.20) Ixy = Hx − Hx|y was justified by (according to (6.1.17)) its ability to be reduced to a simpler ‘Boltzmann’ amount of information Hk = − ∑ P(k) ln P(k), k
in the case of arbitrary noise the use of information amount (7.6.20) is most convincingly justified by the asymptotic equality Iξ η / ln M ≈ 1.
Chapter 8
Channel capacity. Important particular cases of channels
This chapter is devoted to the second variational problem, in which we try to find an extremum of the Shannon’s amount of information with respect to different input distributions. We assume that the channel, i.e. a conditional distribution on its output with a fixed input signal, is known. The maximum amount of information between the input and output signals is called channel capacity. Contrary to the conventional presentation, from the very beginning we introduce an additional constraint concerning the mean value of some function of input variables, i.e. we consider a conditional variational problem. Results for the case without the constraint are obtained as a particular case of the provided general results. Following the presentation style adopted in this book, we introduce potentials, in terms of which a conditional channel capacity is expressed. We consider a number of important particular cases of channels more thoroughly, and for which it is possible to derive explicit results. For instance, in the case of Gaussian channels, the general formulae in matrix form are obtained using matrix techniques. In this chapter, the presentation concerns mainly the case of discrete random variables x, y. However, many considerations and results can be generalized directly by changing notation (for example, via substituting P(y | x), P(x) by P(dy | x), P(dx) and so on).
8.1 Definition of channel capacity In the previous chapter, it was assumed that not only noise inside a channel (described by conditional probabilities P(y | x)) is statistically defined, but also signals on a channel’s input, which are described by a priori probabilities P(x). That is why the system characterized by the ensemble of distributions [P(y | x), P(x)] (or, equivalently, by the joint distribution P(x, y)) was considered as a communication channel. Usually, the distribution P(x) is not an inherent part of a real communication channel as distinct from the conditional distribution P(y | x). Sometimes it makes © Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0 8
249
250
8 Channel capacity. Important particular cases of channels
sense not to fix the distribution P(x) a priori but just to fix some technically important requirements, say of the form a1 ∑ c(x)P(x) a2 ,
(8.1.1)
x
where c(x) is a known function. Usually, it is sufficient to consider only a one-sided constraint of the type E[c(x)] a0 . (8.1.2) For instance, a constraint on average power of a transmitter belongs to the set of requirements of such a type. Thus, a channel is characterized by the ensemble of distribution P(y | x) and condition (8.1.1) or (8.1.2). According to that, when referring to such a channel, we use the expressions ‘channel [P(y | x), c(x), a1 , a2 ]’, ‘channel [P(y | x), c(x), a0 ]’ or succinctly ‘channel [P(y | x), c(x)]’. In some cases conditions (8.1.1), (8.1.2) may be even absent. Then a channel is characterized only by a conditional distribution P(y | x). As is seen from the results of the previous chapter, the most important informational characteristic of channel [P(y | x), P(x)] is the quantity Ixy , which defines an upper limit for the amount of information that is transmitted without errors asymptotically. Capacity C is an analogous quantity for channel [P(y | x), c(x)]. Using a freedom of choice of distribution P(x), which remains after fixing either condition (8.1.1) or (8.1.2), it is natural to select the most beneficial distribution in respect to the amount of information Ixy . This leads to the following definition. Definition 8.1. Capacity of channel [P(y | x), c(x)] is the maximum amount of information between the input and output: C = C[P(y, x), c(x)] = sup Ixy ,
(8.1.3)
P(x)
where the maximization is considered over all P(x) compliant with condition (8.1.1) or (8.1.2). As a result of the specified maximization, we can find the optimal distribution P0 (x) for which (8.1.4) C = Ixy or at least an ε -optimal Pε (x) for which 0 C − Ixy < ε , where ε is infinitesimally small. After that we can consider system [P(y | x), P0 (x)] or [P(y | x), Pε (x)] and apply the results of the previous chapter to it. Thus, Theorem 7.1 induces the following statement. Theorem 8.1. Suppose there is a stationary channel, which is the n-th power of channel [P(y | x), c(x)]. Suppose that the amount ln M of transmitted information grows with n → ∞ according to the law
8.1 Definition of channel capacity
251
ln M = ln[enR ], where R is a scalar independent of n and satisfying the inequality R < C,
(8.1.5)
and C < ∞ is the capacity of the channel [P(y | x), c(x)]. Then there exists a sequence of codes such that Per → 0 as n → ∞.
In order to derive this theorem from Theorem 7.1, evidently, it suffices to select a distribution P(x), that is consistent with condition (8.1.1) or (8.1.2) in such a way that the inequality R < Ixy C holds. This can be done in view of (8.1.3), (8.1.5). In a similar way, the other results of the previous chapter obtained for the channels [P(y|x), P(x)] can be extended to the channels [P(y|x), c(x)]. We shall not dwell on this any longer. According to definition (8.1.3) of capacity of a noisy channel, its calculation reduces to solving a certain extremum problem. The analogous situation was encountered in Sections 3.2, 3.3 and 3.6, where we considered capacity of a noiseless channel. The difference between these two cases is that entropy is maximized for the first one, whereas the Shannon’s amount of information is maximized for the other. In spite of such a difference, there is a lot in common for these two extremum problems. In order to distinguish the two, the latter extremum problem will be called the second extremum problem of information theory. The extremum (8.1.3) is usually achieved at the boundary of the feasible range (8.1.1) of average costs. Thereby, condition (8.1.1) can be replaced with a one-sided inequality of type (8.1.2) or even with the equality E[c(x)] = a
(8.1.6)
(where a coincides with a0 or a2 ). In this case the channel can be specialized as a system [P(y, x), c(x), a] and we say that a channel capacity corresponds to the level of losses (degree of blocking) a. The channel in consideration does not need to be discrete. All the aforesaid also pertains to the case when random variables x, y are arbitrary: continuous, combined and so on. Under an abstract formulation of the problem we should consider an abstract space Ω (which consists of points ω ) and two Borel fields F1 , F2 of its subsets (so-called σ -algebras). Fields F1 and F2 correspond to random variables x and y, respectively. The cost function c(x) = c(ω ) is an F1 -measurable function of ω . Besides, there must be defined a conditional distribution P(Λ | F1 ), Λ ∈ F2 . Then system [P(·, F1 ), c(·), a] constitutes an abstract channel1. Its capacity is defined as the maximum value (8.1.3) of the Shannon’s amount of information
252
8 Channel capacity. Important particular cases of channels
Ixy =
P(d ω | F1 ) P(d ω | F1 )P(d ω ) P(d ω ) d ω ∈F2 Λ ∈ F2 P(Λ ) = P(Λ | F1 )P(d ω ), P(d ω )
ln
in the generalized version (see Section 6.4). Therefore, we compare distinct distributions P(·) on F1 , which satisfy a condition of type (8.1.1), (8.1.2) or (8.1.6).
8.2 Solution of the second variational problem. Relations for channel capacity and potential 1. We use X to denote the space of values, which an input variable x can take. For the extremum distribution P0 (dx) corresponding to capacity (8.1.3), the probability can be concentrated only in a part of the indicated space. Furthermore, let us denote = 1 (i.e. P0 (X − X) = 0). We shall by X the minimal subset X ∈ X, for which P0 (X) call it an ‘active domain’. When solving the extremum problem we suppose that x is a discrete variable for convenience. Then we can consider probabilities P(x) of individual points x and take partial derivatives of them. Otherwise, we would have introduced variational derivatives that are associated with some complications, which is not of a special type though. We try to find a conditional extremum with respect to P(x) of the expression Ixy = ∑ P(x)P(y | x) ln x,y
P(y | x) ∑ P(x)P(y | x) x
under the extra constraints
∑ c(x)P(x) = a,
(8.2.1)
∑ P(x) = 1.
(8.2.2)
x
x
It is allowed not to fix the non-negativity constraint for probabilities P(x) for now but to check its satisfaction after having solved the problem. Introducing the indefinite Lagrange multipliers −β , 1 + β ϕ , which will be determined from constraints (8.2.1), (8.2.2) later on, we construct the expression K = ∑ P(x)P(y | x) ln x,y
P(y | x) − β ∑ c(x)P(x) + (1 + β ϕ ) ∑ P(x). (8.2.3) P(x)P(y | x) ∑ x x x
We will seek its extremum by varying values P(x), x ∈ X corresponding to the active Equating the partial derivative of (8.2.3) by P(x) to zero, we obtain the domain X.
8.2 Solution of the second variational problem. Relations for channel capacity and potential 253
equation
∑ P(y | x) ln y
P(y | x) − β c(x) + β ϕ = 0 P(y)
for x ∈ X,
(8.2.4)
which is a necessary condition of extremum. Here P(y) = ∑ P(x)P(y | x).
(8.2.5)
x
Factoring (8.2.4) by P(x) and summing over x with account of (8.2.1), (8.2.2), we have Ixy = β (a − ϕ ), i.e. (8.2.6) C = β (a − ϕ ). This relation allows us to exclude ϕ from equation (8.2.4) and thereby rewrite it in the form P(y | x) (8.2.7) ∑ P(y | x) ln P(y) = C − β a + β c(x) for x ∈ X. y Formulae (8.2.7), (8.2.1), (8.2.2) constitute the system of equations that serves if the region X is already for a joint determination of variables C, β , P(x) (x ∈ X), selected. For a proper selection of this region, solving the specified equations will yield positive probabilities P(x), x ∈ X. We can multiply the main equation (8.2.4) or (8.2.7) by P(x) and, thus, rewrite it in the inequality form
∑ P(x)P(y | x) ln y
P(y | x) = [C − β a + β c(x)]P(x), P(y)
(8.2.8)
which is convenient because it is valid for all values x ∈ X and not only for x ∈ X. Equation (8.2.7) does not necessarily hold true beyond region X. It is not difficult to write down a generalization of the provided equations for the case when random variables are not discrete. Instead of (8.2.7), (8.2.5) we will have
P(dy | x) = C − β a + β c(x), P(dy | x) ln P(dy) P(dy) = P(dy | x)P(x) .
x ∈ X
(8.2.9)
In this case, the desired distribution turns out to be P(dx) = p(x)dx, x ∈ X. 2. Coming back to the discrete version, we prove the following statement. Theorem 8.2. A solution to equations (8.2.7), (8.2.1), (8.2.2) corresponds to the maximum of information Ixy with respect to variations of distribution P(x) that leave the active domain X invariant. Proof. First, we compute the matrix of second derivatives of expression (8.2.3):
254
8 Channel capacity. Important particular cases of channels
∂ 2 Ixy ∂ 2K P(y | x)P(y | x ) = = −∑ , ∂ P(x)∂ P(x ) ∂ P(x)∂ P(x ) P(y) y
(8.2.10)
where β , ϕ do not vary. It is easy to see that this matrix is negative semi-definite Indeed, (since P(y) 0), so that K is a convex function of P(x), x ∈ X. 2 P(y | x)P(y | x ) 1 f (x) f (x ) = − ∑ −∑ ∑ ∑ P(y | x) f (x) 0 P(y) y y P(y) x x,x for every function f (x). Preservation of region X for different variations means that only probabilities P(x), x ∈ X are varied. For such variations function K has null derivatives (8.2.4). Hence, it follows from the convexity of function K that the function attains maximum at an extreme point. Therefore, maximality also takes place for those par which remain equalities (8.2.1), (8.2.2) untial variations of variables P(x), x ∈ X, changed. The proof is complete. Unfortunately, in writing and solving equation (8.2.7), the active domain X is not known in advance, which complicates the problem. As a matter of fact, in order to one should perform maximization of C(X) over X. In some important cases, find X, for instance, when the indicated equations yield distribution P(x) > 0 on the entire space X = X, it can be avoided. Then C(X) is the desired capacity C. Indeed, in consequence of the maximality of information Ixy proven in Theorem 8.2, there is no need to consider smaller active domains X1 ⊂ X in the given case. Distribution P01 (x) having any smaller active domain X1 can be obtained as the limit of distributions P (x), for which X is an active domain. But for P informa is not greater than extreme information I 0 due to Theorem 8.2. Therefore, tion Ixy xy (which coincides with information I 01 for P ) is also the limit information lim Ixy 01 xy 0 . Thus, the specified solution is not greater than the found extreme information Ixy nothing else but the desired solution. The statement of Theorem 8.2 for the discrete case can be extended to an arbitrary case corresponding to equations (8.2.9). 3. For the variational problem in consideration we can introduce thermodynamic potentials, which play the same role as the potentials in the first variational problem considered in Sections 3.2, 3.3 and 3.6. Relation (8.2.6) is already analogous to the well-known relation H = (E − F)/T [see, for instance, (4.1.14a)] from thermodynamics. Therein, C is an analog of entropy H, a is an analog of internal energy E, −ϕ is an analog of free energy F and β = 1/T is a parameter is a parameter inverse of temperature. In what follows, the average cost E[c(x)] as a function of thermodynamic parameters will be denoted by letter R. The next theorem confirms the indicated analogy. Theorem 8.3. Channel capacity can be calculated by differentiating potential ϕ with respect to temperature: C = −d ϕ (T )/dT.
(8.2.11)
8.2 Solution of the second variational problem. Relations for channel capacity and potential 255
Proof. We will vary parameter β = 1/T or, equivalently, parameter a inside constraint (8.2.1). This variation is accompanied by variations of parameter ϕ and dis If variation tribution P(x). We take in an arbitrary point x of the active domain X. da is not rather large, then the condition P(x) + dP(x) > 0 will remain valid after the variations, i.e. x will belong to the variated active domain. An equality of type (8.2.4) holds true at point x before and after variating. We differentiate it and obtain the equation for variations − ∑ P(y | x) y∈Y
dP(y) − [c(x) − ϕ ]d β + β d ϕ = 0. P(y)
(8.2.12)
Here the summation is carried out over region Y , where P(y) > 0. Further, we consecutively multiply the latter equality by P(x) and sum over x. Taking into account (8.2.5), (8.2.1) and keeping the normalization constraint
∑ dP(y) = d ∑ P(y) = dl = 0, y
y
we obtain (a − ϕ )d β = β d ϕ . Due to (8.2.6) this yields C = β2
dϕ , dβ
(8.2.13)
and thereby (8.2.11) as well. This ends the proof. If potential ϕ (T ) is known as a function of temperature T , then in order to determine the capacity by formula (8.2.11) it is left to concretize a value of temperature T (or parameter β ), which corresponds to constraint (8.2.1). In order to do so, it is convenient to consider a new function R = −T
dϕ + ϕ. dT
(8.2.14)
Substituting (8.2.11) to (8.2.6), we obtain the equation −T
dϕ (T ) + ϕ (T ) = a, dT
i.e.
R = a,
(8.2.15)
serving to determine quantity T . It is convenient to consider formula (8.2.14) as the Legendre transform of function ϕ (T ): ∂ϕ R(S) = T S + ϕ (T (S)) S=− . ∂T Then, according to (8.2.15), capacity C will be a root of the equation R(C) = a.
(8.2.16)
256
8 Channel capacity. Important particular cases of channels
Besides, if we consider the potential 1 Γ (β ) = − β ϕ , β
(8.2.17)
then equation (8.2.15) will take the form dΓ (β ) = −a. dβ
(8.2.18)
The value of β can be found from the last equation. In so doing, formula (8.2.11), which determines channel capacity, can be transformed to d Γ (β ) . (8.2.19) C = Γ (β ) − β dβ formulae (8.2.18), (8.2.19) are analogous to formulae (4.1.14b), (3.2.9) if accounting for (4.1.14). This consideration is a consequence of the fact that the same regular relations (8.2.6), (8.2.11) are satisfied in the case of the second variational problem as well as in the case of the first variational problem. Taking the differential of (8.2.15), we obtain da/T = −d(d ϕ /dT ) = dC,
(8.2.20)
that corresponds to the famous thermodynamic formula dH = dE/T for a differential of entropy. 4. The following result turns out to be useful. Theorem 8.4. Function C(a) is convex: d 2C(a) 0. da2
(8.2.21)
Proof. Let variation da correspond to variations dP(x) and dP(y)= ∑x P(y | x)dP(x). We multiply (8.2.12) by dP(x) and sum it over x ∈ X taking into account that
∑ dP(x) = 0, ∑ c(x)dP(x) = da x
x
due to (8.2.1), (8.2.2). This yields [dP(y)]2 ∑ P(y) + dad β = 0.
(8.2.22)
y∈Y
Apparently, the first term cannot be negative herein. That is why da d β 0. Dividing this inequality by positive quantity (da)2 , we have d β /da 0.
(8.2.23)
8.2 Solution of the second variational problem. Relations for channel capacity and potential 257
The desired inequality (8.2.21) follows from here if we take into account that β = dC/da according to (8.2.20). The proof is complete. As is seen from (8.2.22), the equality sign in (8.2.21), (8.2.23) relates to the case when all dP(y)/da = 0 within region Y . A typical behaviour of curve C(a) is represented in Figure 8.1. In consequence of relation (8.2.20), which can be written as dC/da = β , the maximum point of function C(a) corresponds to the value β = 0. For this particular value of β equation (8.2.7) takes the form
∑ P(y | x) ln
y∈Y
P(y | x) = C. P(y)
(8.2.24)
Here c(x) and a are completely absent. Equation (8.2.24) corresponds to channel P(y | x) defined without accounting for conditions (8.1.1), (8.1.2), (8.1.6). Indeed, solving the variational problem with no account of condition (8.1.6) leads exactly to equation (8.2.24), which needs to be complemented with constraint (8.2.2). We denote this maximum value C by Cmax . Now we discuss corollaries of Theorem 8.4. Function a = R(C) is an inverse function to function C(a). Therefore, it is defined only for C Cmax and is twovalued (at least on some interval adjacent to Cmax ), if sign < takes place in (8.2.21) for β = 0. It holds for one branch that dR(C) > 0, dC
i.e. T > 0 or β > 0;
We call this branch normal. For the other, anomalous branch we have dR(C) < 0, dC
β < 0.
It is easy to comprehend that the normal branch is convex: d 2 R(C) dT >0 ≡ dC2 dC
(C < Cmax ),
In its turn, the anomalous branch is concave: d 2 R(C) dT 0.
8.4 Symmetric channels In some particular cases, a channel capacity can be computed by methods simpler than those of Section 8.3. First, we consider the so-called symmetric channels. We provide an analysis in the discrete case. Channel [P(y | x), c(x)] is called symmetric at input if every row of the matrix ⎞ ⎛ P(y1 | x1 ) · · · P(yr | x1 ) (8.4.1)
P(y | x) = ⎝. . . . . . . . . . . . . . . . . . . . .⎠ P(y1 | xr ) · · · P(yr | xr ) is a transposition of the same values p1 , . . . , pr playing a role of probabilities. Apparently, for such channels the expression Hy (| x) = − ∑ P(y | x) ln P(y | x) y
is independent of x and equal to H ≡ − ∑ p j ln p j . j
(8.4.2)
8.4 Symmetric channels
263
It is not difficult to understand that (8.4.2) coincides with Hy|x . Therefore, formula (8.3.5) can be represented as (8.4.3) ∑ P(y | x) ln ∑ P(x)P(y | x) +C − β a + Hy|x = −β c(x) y
x
(x = x1 , . . . , xr ), or, equivalently, P(y | x) ln P(x)P(y | x) +C + H x|y = 0 ∑ ∑ y
(8.4.4)
x
(x = x1 , . . . , xr ), if β = 0 or a constraint associated with c(x) is absent. Relations (8.4.3) constitute equations pertaining to unknown variables P(x1 ), . . . , P(xr ). Since Hy (| x) = Hy|x = H for channels, which are symmetric at input, the amount of information Ixy = Hy − Hy|x for such channels is equal to Ixy = Hy − H. Hence, the maximization of Ixy by P(x) is reduced to the maximization of entropy Hy by P(x): C = max Hy − H formula (8.4.4) can be even more simplified for completely symmetric channels. A channel is called completely symmetric if every column of matrix (8.4.1) is also a transposition of same values (say, pj , j = 1, . . . , r ). It is easy to justify that the uniform distribution P(x) = const = 1/r
(8.4.5)
is a solution of equations (8.4.4) in the case of completely symmetric channels. Indeed, for completely symmetric channels the probabilities P(y) = ∑ P(x)P(y | x) = x
1 P(y | x) r ∑ x
are independent of y, because the sum
∑ P(y | x) = ∑ pj x
j
is the same for all columns. Therefore, a distribution by y is uniform: P(y) = const = 1/r. It follows from here that Hy = ln r and equalities (8.4.4) are satisfied if
(8.4.6)
264
8 Channel capacity. Important particular cases of channels
C = ln r − H. Since
r
r=
∑
j=1
(8.4.7)
pj 1 =E , pj pj
(8.4.8)
result (8.4.7) can be rewritten as C = ln E[1/p j ] − E[ln 1/p j ].
(8.4.9)
Here E means averaging by j with weight p j . The provided consideration can be also generalized to the case when variables x and y have continuous nature. We can tailor a definition of symmetric channels for this case by replacing a transposition of elements of rows or columns of matrix P(x, y) with a transposition of corresponding intervals. Formula (8.4.7) will have the form C = ln r + where r =
ln
P(dy) P(dy) = ln r − H P/Q , Q(dy)
Q(dy) P(dy) = P(dy)
Q(dy).
Instead of (8.4.9) we have
Q(dy) Q(dy) C = ln E − E ln . P(dy) P(dy) Here the averaging corresponds to weight P(dx); Q(dy) is an auxiliary measure using which the permutation of intervals is carried out (equal intervals are interchanged in a scale, for which the measure Q is uniform).
8.5 Binary channels 1. The simplest particular case of a completely symmetric channel is a binary symmetric channel having two states both at input and output, and having the following matrix of transition probabilities: 1− p p
P(y | x) = , p < 1/2. (8.5.1) p 1− p According to (8.4.5), (8.4.6), the capacity of such a channel takes place for uniform distributions P(x) = 1/2 , P(y) = 1/2. (8.5.2) Due to (8.4.7) it is equal to
8.5 Binary channels
265
C = ln 2 − h2 (p) = ln 2 + p ln p + (1 − p) ln(1 − p).
(8.5.3)
It follows from (8.5.1), (8.5.2) that
P(x, y) =
(1 − p)/2 p/2 p/2 (1 − p)/2.
(8.5.4)
Further, we compute thermodynamic potential μ (t) (8.4.3) for a binary symmetric channel. This thermodynamic potential allows us to estimate the probability of decoding error (7.4.2) according to Theorem 7.3. Substituting (8.5.2), (8.5.4) to (7.4.3) we obtain
μ (t) = ln{[(1 − p)1−t + p1−t ]2−t } = ln[p1−t + (1 − p)1−t ] − t ln 2.
(8.5.5)
Due to (8.5.5) equation (7.4.4) takes the form p1−s ln p + (1 − p)1−s ln(1 − p) + ln 2 = R, p1−s + (1 − p)1−s
(8.5.6)
and coefficient α in exponent (7.4.2) can be written as
α = − sR − ln[p1−s + (1 − p)1−s ] + s ln 2 = − (1 − s)
p1−s ln p + (1 − p)1−s ln(1 − p) p1−s + (1 − p)1−s
− ln[p1−s + (1 − p)1−s ] + ln 2 − R.
(8.5.7)
We can rewrite formulae (8.5.7), (8.5.6) in a more convenient way by introducing a new variable p1−s (1 − p)1−s ρ = 1−s 1 − ρ = 1−s p + (1 − p)1−s p + (1 − p)1−s instead of s. Then, taking into account (1 − s) ln p = ln p1−s = ln ρ + ln[p1−s + (1 − p)1−s ], (1 − s) ln(1 − p) = ln(1 − p)1−s = ln(1 − ρ ) + ln[p1−s + (1 − p)1−s ], we conclude from (8.5.7) that
α = ρ ln ρ + (1 − ρ ) ln(1 − ρ ) + ln 2 − R = ln 2 − h2 (ρ ) − R.
(8.5.8)
Moreover, we can simplify equation (8.5.6) similarly to get
ρ ln p + (1 − ρ ) ln(1 − p) = R − ln 2.
(8.5.9)
266
8 Channel capacity. Important particular cases of channels
Accounting for (8.5.9), (8.5.3) it is easy to see that the condition R < C means the inequality ρ ln p + (1 − ρ ) ln(1 − p) < p ln p + (1 − p) ln(1 − p) or (ρ − p) ln i.e. ρ > p, since
p < 0, 1− p
1 p < 1 (because p < ). As is seen from (8.5.9), the value 1− p 2
ρ=
ln[4p(1 − p)] ln[(1 − p)/p]
corresponds to R = 0. When R grows to C, the value of ρ decreases to p. Stronger results can be derived with the help of Theorem 7.5 but we will not linger upon this subject. 2. Let us now consider a binary, but non-symmetric channel [P(y | x), c(x)], which is characterized by the transition probabilities P(y1 | x1 ) P(y2 | x1 ) 1 − α, α
P(y | x) = = (8.5.10) P(y1 | x2 ) P(y2 | x2 ) α , 1 − α and the cost function
c c(x) = 1 . c2
In this case transformation (8.3.2) is represented as follows: (1 − α ) f1 + α f2 Lf = . α f1 + (1 − α ) f2 Its respective inverse transformation turns out to be 1 1 − α −α 1 g1 (1 − α )g1 − α gl −1 L g= = , g2 D −α 1 − α D −α g1 + (1 − α )g2 where D = 1 − α − α . Furthermore, according to (8.5.10) we have Hy (| 1) = h2 (α ),
Hy (| 2) = h2 (α ).
That is why 1 (1 − α )[β c1 + h2 (α )] − α [β c2 + h2 (α )] L β c(x) + Hy (| x) = , D −α [β c1 + h2 (α )] + (1 − α )[β c2 + h2 (α )] −1
and thereby formula (8.3.11) yields α α Z = exp [β c1 + h2 (α )] + [β c2 + h2 (α )] (ξ1 + ξ2 ), D D
(8.5.11)
8.6 Gaussian channels
where
267
1 ξ1 = exp − [β c1 + h2 (α )] , D
1 ξ2 = exp − [β c2 + h2 (α )] . D
Further, we employ (8.3.12) to obtain a=−
α c1 + α c2 1 c1 ξ1 + c2 ξ2 + . D D ξ1 + ξ2
(8.5.12)
Next, since C = β a + Γ = β a + ln Z [see (8.2.6), (8.2.17)], we obtain from (8.5.11), (8.5.12) that C=
α h2 (α ) + α h2 (α ) β c1 ξ1 + c2 ξ2 + ln(ξ1 + ξ2 ) + . D D ξ1 + ξ2
(8.5.13)
In particular, if a constraint with c(x) is absent, then, supposing β = 0, we find C=
α h2 (α ) + α h2 (α ) + ln[e−h2 (α )/(1−α −α ) + e−h2 (α )/(1−α −α ) ]. 1 − α − α
In the other particular case, when α = α ; c1 = c2 (symmetric channel), it follows from (8.5.13) that C=
β c1 + h2 (α ) β c1 2α h2 (α ) + ln 2 − + = ln 2 − h2 (α ), 1 − 2α 1 − 2α 1 − 2α
that naturally coincides with (8.5.3).
8.6 Gaussian channels 1. Let x and y be points from multidimensional Euclidean spaces X, Y of dimensions p and q, respectively. We call a channel [p(y | x), c(x)] Gaussian if: 1. The transition probability density is Gaussian:
1 −s/2 1/2 det Ai j exp − ∑ Ai j (yi − mi )(y j − m j ) , p(y | x) = (2π ) 2 ij
(8.6.1)
2. The Ai j are independent of x, and the mean values mi depend on x linearly: mi = m0i + ∑ dik xk , k
3. The cost function c(x) is a polynomial of degree at most 2 in x:
(8.6.2)
268
8 Channel capacity. Important particular cases of channels
c(x) = c(0) + ∑ ck xk + (1)
k
1 ckl xk xl . 2∑ k,l
(8.6.3)
Obviously, we can choose the origin in the spaces X and Y (making the substi(1) 0 tution xk + c−1 kl cl → xk , yi − mi → yi ), so that the terms in (8.6.3) and in the exponent (8.6.1), which are linear in x, y, vanish. Furthermore, the constant summand in (8.6.3) is negligible and can be omitted. Therefore, without loss of generality, we can keep only bilinear terms in (8.6.1), (8.6.3). Using matrix notation, let us write (8.6.1), (8.6.3) as follows: 1 T 1/2 A T T exp − (y − x d )A(y − dx) (8.6.4) p(y | x) = det 2π 2 1 (8.6.5) c(x) = xT cx. 2 Here we imply a matrix product of two adjacent matrices. Character T denotes transposition; x, y are column-matrices and xT , yT are row-matrices, correspondingly: yT = (y1 , . . . , ys ). xT = (x1 , . . . , xr ), Certainly, matrix A is a non-singular positive definite matrix, which is inverse to the correlation matrix K = A−1 . Matrix c is also perceived as non-singular and positive definite. As is seen from (8.6.1), actions of disturbances in a channel are reduced to the addition y = dx + z (8.6.6) of noises zT = (z1 , . . . , zs ) having a Gaussian distribution with zero mean vector and correlation matrix K: (8.6.7) E[z] = 0, E[zzT ] = K. (a matrix representation is applied in the latter formula as well). 2. Let us turn to computation of capacity C and probability densities p(x), p(y) for the channel in consideration. To this end, consider equation (8.2.9), which (as where there are non-zero it was mentioned in Section 8.2) is valid in subspace X, probabilities p(x)Δ x. In our case, X will be an Euclidean subspace of the original rdimensional Euclidean space X. In that subspace, of course, matrix c (we can define a scalar product with the help of it) will also be non-singular and positive definite. We shall seek the probability density function p(x) in the Gaussian form: 1 T −1 − r/2 1/2 det Kx exp − x Kx x , x ∈ X (8.6.8) p(x) = (2π ) 2 (so that
E[x] = 0,
E[xxT ] = Kx ).
(8.6.9)
positive Kx is an unknown (non-singular in X) Here r is a dimension of space X; definite correlation matrix. Mean values E[x] are selected to be zeros according to (8.6.4), (8.6.5) on the basis of symmetry considerations.
8.6 Gaussian channels
269
Taking into account (8.6.6), it follows from Gaussian nature of random variables x and z that y are Gaussian random variables as well. Therefore, averaging out (8.6.6) and accounting for (8.6.7), (8.6.9), it is easy to find their mean value and correlation matrix (8.6.10) E[y] = 0, E[yyT ] = Ky = K + dKx d T . Therefore, 1 p(y) = det−1/2 (2π Ky ) exp − yT Ky−1 y 2 1 = det−1/2 [2π (K + dKx d T )] exp − yT (K + dKk d T )−1 y . 2
(8.6.11)
Substituting (8.6.5), (8.6.11) to (8.2.9), we obtain 1 1 − ln [det(A) det(Ky )] + E[(yT − xT d T )A(y − dx) − yT Ky−1 y | x] = 2 2 β x ∈ X. = β α − xT cx −C, 2
(8.6.12)
When taking a conditional expectation we account for E[(y − dx)(yT − xT d T ) | x] = E[zzT ] = K, E[yyT | x] = E[(dx + z)(xT d T + zT ) | x] = dxxT d T + E[zzT ] = dxxT d T + K due to (8.6.6), (8.6.7). That is why (8.6.12) takes the form − ln[det Ky A] + tr KA − tr Ky−1 K − xT d T Ky−1 dx = = 2β α − β xT cx − 2C,
x ∈ X.
(8.6.13)
Supposing, in particular, that x = 0, we have − ln[det Ky A] + tr(A − Ky−1 )K = 2β α − 2C
(8.6.14)
and [after comparing the latter equality with (8.6.13)] xT d T Ky−1 dx = β xT cx,
x ∈ X.
(8.6.15)
From this moment we perceive operators c, d T Ky−1 d and others as operators acting on vectors x from subspace X and transforming them into vectors from the same subspace (that is, x-operators are understood as corresponding projections of initial equality (8.6.15) x-operators). Then, due to a freedom of selection of x from X, yields d T Ky−1 d = β c. Substituting the equality
270
8 Channel capacity. Important particular cases of channels
Ky−1 = [K(1y + AdKx d T )]−1 = (1y + AdKx d T )−1 A
(8.6.16)
following from (8.6.10) [since K = A−1 ] to the last formula, we obtain d T (1y + AdKx d T )−1 Ad = β c,
(8.6.17)
where 1y is an identity y-operator. Employing the operator equality A∗ (1 + B∗ A∗ )−1 = (1 + A∗ B∗ )−1 A∗
(8.6.18)
for A∗ = d T , B∗ = AdKx [refer to formula (A.1.1) from Appendix], we reduce equation (8.6.17) to the form = β c. x )−1 A (8.6.19) (1x + AK Here
= d T Ad A
(8.6.20)
is an x-operator and 1x is an identity x-operator. Operator c (both an initial r-dimensional operator and its r-dimensional projection) is non-singular and positive definite according to the aforesaid. It follows from (8.6.19) that the operator −1 A (1x + AKx)
is non-singular as well. It is not difficult to conclude from here that each of opera x )−1 , A is non-singular. Indeed, the determinant of the matrix product tors (1x + AK −1 (1x + AKx ) A, which is equal to the product of the determinants of the respective matrices, could not be different from zero if at least one factor-determinant were follows from the inequality det A = 0. Thus, equal to zero. Non-singularity of A −1 T −1 there exists a reverse operator A = (d Ad) . Using this operator, we solve equation (8.6.19): x )−1 = β cA −1 , (1x + AK
x = A( β c)−1 . 1x + AK
(8.6.21)
We obtain from here that Kx =
1 −1 −1 1 −1 c − A = c − (d T Ad)−1 . β β
(8.6.22)
Further, taking into account (8.6.10) we have Ky = A−1 − d(d T Ad)−1 d T +
1 −1 T dc d . β
(8.6.23)
Hence, distributions p(x) (8.6.8), p(y) (8.6.11) have been found. 3. It remains to determine capacity C and average energy (costs) a = E[c(x)]. Due to (8.6.5) the latter is evidently equal to
8.6 Gaussian channels
271
1 1 a = MxT cx = tr cKx 2 2 or, if we substitute (8.6.22) hereto, 1 1 1 −1 r −1 a = tr 1x − cA . = T − tr cA 2 β 2 2
(8.6.24)
(8.6.25)
Here we have taken into account that the trace of the identity x-operator is equal to i.e. ‘the number of active degrees of freedom’ of random dimension r of space X, variable x. The corresponding ‘thermal capacity’ equals da/dT =
r 2
(8.6.26)
(if r does not change for variation dT ). Thus, according to the laws of classical statistical thermodynamics, there is average energy T /2 per every degree of freedom. In order to determine the capacity, we can use formulae (8.6.14), (8.6.25) [when applying formulae (8.6.16), (8.6.21)] or regular formulae for the information of communication between Gaussian variables, which yield C=
1 1 ln det Ky A = tr ln Ky A. 2 2
Further, we substitute (8.6.23) hereto. It is easy to prove (moving d from left to right with the help of formula (8.6.41)) that 1 tr ln[1 − d(d T Ad)−1 d T A + T dc−1 d T A] 2 1 −1 ), = d T Ad). = tr ln(T Ac (A 2
C=
Otherwise, 1 −1 ) + ln(T ) tr 1x tr ln(Ac 2 2 1 r −1 = tr ln(Ac ) + ln T. 2 2
C=
(8.6.27)
The provided logarithmic dependence of C from temperature T can be already derived from formula (8.6.25) if we take into account the general thermodynamic relationship (8.2.20). In the given case it takes the form dC = da/T = rdT /2T due to (8.6.26). There is a differential of (r/2) ln T + const on the right. Now we find thermodynamic functions ϕ (T ), C(a) for the channel in consideration. Using (8.2.6), (8.6.25), (8.6.27), (8.2.17), we have
272
8 Channel capacity. Important particular cases of channels
r 1 1 −1 r −1 ) T − ln T + tr ln(Ac T + tr A c 2 2 2 2 −1 Ac 1 1 −1 Γ (β ) = tr ln c. + tr A 2 βe 2
− ϕ (T ) = TC − a =
Further, excluding parameter T from (8.6.25), (8.6.27), we obtain 1 r 1 2 −1 −1 ). C = ln tr cA + a + tr ln(Ac r r 2 2
(8.6.28)
(8.6.29)
These functions, as it was pointed out in Section 8.2, simplify a usage of condition (8.1.1), (8.1.2). Quantity C increases with a growth of a, according to the terminology of Section 8.2 the dependence between C and a is normal in the given case. If a condition is of type (8.1.1), then the channel capacity is determined from (8.6.29) by the substitution a = a2 . −1 is a multiple of 4. Examples. The simplest particular case is when matrix Ac the identity matrix: −1 − 1x . (8.6.30) Ac 2N This takes place in the case when, say, 1 l = 1, . . . , r c = 2 · 1x , K = N · 1y , dil = (8.6.31) 0 l > r. Substituting (8.6.30) to (8.6.29), we obtain in this case that r 2a a ! C= T = 2N + , ln 1 + rN r 2 since
−1 − 2N tr 1x − 2N r, tr cA
(8.6.32)
−1 ) = − tr ln(Ac r ln(2N).
According to formulae (8.6.22), (8.6.25) the correlation matrix at input is represented as T a Kx = − N Ix = 1x . r 2 Next, we consider a somewhat more difficult example. We suppose that spaces X, Y coincide with each other, matrices 12 c, d are identity matrices and matrix K is diagonal but not a multiple of the identity matrix: K = Ni δi j . Consequently, disturbances are independent but not identical. In this case
8.6 Gaussian channels
273
" " " δi j " −1 −1 −1 " " Ac = K c = " 2Ni "
(8.6.33)
Subspace X is a space of a smaller dimensionality. For this subspace, not all components (x1 , . . . , xr ) are different from zero but only a portion of them, say, components xi , i ∈ L. This means that if i does not belong to set L, then xi = 0 Under such notations it is more appropriate to represent mafor (x1 , . . . , xr ) ∈ X. trix (8.6.33) as follows:
δi j /2Ni ,
i ∈ L,
j ∈ L.
For the Let us give a number of other relations with the help of the introduced set X. example in consideration, formula (8.6.22) takes the form T − Ni δi j (Kx )i j = (8.6.34) 2 and also due to (8.6.33) equalities (8.6.27), (8.6.24) yield C=
T 1 ln , ∑ 2 i∈L 2Ni
a = ∑ (Kx )ii = ∑ i∈L
i∈L
T − Ni . 2
(8.6.35)
These formulae solve the problem completely, if we know set L of indices i cor Let us demonstrate what responding to non-zero components of vectors x from X. considerations define that set. In the case of Gaussian distributions, probability densities p(x), p(y), of course, cannot be negative, and therefore subspace X is not determined from the positiveness condition of probabilities (see Section 8.2). However, the condition of positive definiteness of matrix Kx [which is given by formula (8.2.22)] may be violated, and it must be verified. Besides, the condition of non-degeneracy of operator (8.6.20) must be also satisfied. In the given example, we do not need to care much about the latter condition because d = 1. But the condition of positive definiteness of matrix Kx , i.e. the constraint T Ni < (8.6.36) 2 due to (8.6.34), is quite essential though. For each fixed T the set of indices L is determined from constraint (8.6.36). Hence, we can replace i ∈ L with Ni < T /2 under the summation signs in formulae (8.6.35). The derived relations can be conveniently tracked on Figure 8.2. We draw indices i and variances of disturbances Ni (stepwise line) on abscissa and ordinate axes, respectively. A horizontal line corresponds to a fixed temperature T /2. Below it (between the horizontal line and the stepwise line) there are intervals (Kx )ii located according to (8.6.34). Intersection points of the two specified lines define boundaries of set L. The area sitting between the horizontal and the stepwise lines is equal to
274
8 Channel capacity. Important particular cases of channels
the total useful energy a. In turn, the shaded area located between the stepwise line and the abscissa axis is equal to the total energy of disturbances ∑i∈L Ni . Space X (as a function of T ) is determined by analogous methods for more complicated cases. 5. Now we compute thermodynamic potential (7.4.3), defining estimation (7.4.2) of the probability of decoding error, for Gaussian channels. The information
Fig. 8.2 Determination of channel capacity in the case of independent (spectral) components of a useful signal and additive noise with identically distributed components
I(x, y) = ln =
p(y | x) p(y)
1 1 ln det(Ky A) − (yT − xT d T )A(y − dx) + yT Ky−1 y 2 2
have already been involved in formula (8.6.12). Substituting it to the equality eμ (s) =
e−sI(x,y) p(x)p(y | x)dxdy
and taking into account (8.6.4), (8.6.8), we obtain eμ (s) = (2π )−(q+r)/2 det1/2 Adet−s/2 Ky A× 1 1−s T (y Adx + xT d T Ay)− × exp − xT [Kx−1 + (1 − s)d T Ad]x + 2 2 1 − yT [(1 − s)A + sKy−1 ]y dxdy. (8.6.37) 2 The latter integral can be easily calculated with the help of the well-known matrix formula (5.4.19). Apparently, it turns out to be equal to
8.6 Gaussian channels
275
eμ (s) = dets/2 (Ky A)det−1/2 (KKx )det−1/2 B, where B=
−1 Kx + (1 − s)d T Ad (1 − s)d T A −(1 − s)Ad (1 − s)A + sKy−1 .
In order to compute det B we apply the formula b c det T = det d det(b − cd −1 cT ), c d
(8.6.38)
(8.6.39)
which follows from formula (A.2.4) from Appendix. This application yields eμ (s) = det−s/2 (Ky A)det−1/2 (KKx )det−1/2 [(1 − s)A + sKy−1 ]× × det−1/2 [Kx−1 + (1 − s)d T Ad − (1 − s)2 d T A(A − sA+ + sKy−1 )−1 Ad] = det−s/2 (Ky A)det−1/2 [1 − s + sKKy−1 ]× × det−1/2 [1 + (1 − s)Kx d T Ad − (1 − s)2 d T A(A − sA + sKy−1 )−1 Ad]. Taking the logarithm of the last expression and accounting for formula (6.5.4), we find s 1 μ (s) = − tr ln Ky A − tr ln(1 − s + sKKy−1 )− 2 2 1 − tr ln[1 + (1 − s)sKx d T Ky−1 (1 − s + sKKy−1 )−1 d]. (8.6.40) 2 We have simplified the latter term here by taking into account that d T Ad − (1 − s)d T A(A − sA + sKy−1 )−1 Ad = −1 s T −1 −1 A Ky = d A 1− 1+ d 1−s −1 s s A−1 Ky−1 1 + A−1 Ky−1 d. = dT A s−1 1−s If we next use the formula tr f (AB) = tr f (BA)
(8.6.41)
for f (z) = ln (1 + z) [see (A.1.5), (A.1.6)], then we can move d from left to right in the latter term in (8.6.40), so that the two last terms can be conveniently merged into one: tr ln(1 − s + sKKy−1 ) + tr ln[1 + (1 − s)sdKx d T Ky−1 (1 − s + sKKy−1 )−1 ] = = tr ln[1 − s + sKKy−1 + s(1 − s)dKx d T Ky−1 ].
276
8 Channel capacity. Important particular cases of channels
Besides, it is useful to substitute dKx d T by Ky − K [see (8.6.10)] here. Then equality (8.6.40) takes the form s 1 μ (s) = − tr ln Ky A − tr ln[1 − s2 + s2 KKy−1 ]. 2 2
(8.6.42)
Since KKy−1 = Ky−1 A, as we can see, function μ (s) has turned out to be expressed in terms of a single matrix Ky A. Taking into account (8.6.23), (8.6.22) we can transform formula (8.6.42) in such a way that μ (s) is expressed in terms of a single −1 . Indeed, substituting (8.6.23) we have matrix cA s −1 d T A + T dc−1 d T d T A)− μ (s) = − tr ln(1 − d A 2 1 −1 d T A + T dc−1 d T A)−1 ]. − tr ln[(1 − s)2 + s2 (1 − d A 2 Moving d T A from left to right according to formula (8.6.41) [in the second term for f (z) = ln [1 + (1 − s2 )z] − ln (1 + z)], we obtain s 1 s2 −1 −1 2 μ (s) = − tr ln(T Ac ) − tr ln 1 − s + cA . 2 2 T We can also represent this formula as follows:
μ (s) =
s −1 ) − 1 tr ln[(−s2 + s2 β cA −1 ] tr ln(β cA 2 2
(8.6.43)
or
1−s −1 ) − 1 tr ln[s2 + (1 − s2 )T Ac −1 ]. tr ln(T Ac 2 2 We obtain from here, in particular, the variance of information 1 −1 1 −1 . μ (0) = tr 1 − cA = r − tr cA T T
μ (s) =
If we apply (8.6.43) to the particular case of (8.6.31), then we will have T r s 2s2 N r ln − ln 1 − s2 + μ (s) = − 2 2N 2 T ! r a a s r − ln 1 − s2 = − ln 1 + . rN 2 2 a+ rN In turn, it is valid for the second considered example (8.6.33) that 1 2Ni 2 2 Ni − ln 1 − s μ (s) = + 2s s ln . 2 2N∑ T T i h2 /k2 g22 . β ) has one dimension, we have In the first case, when U( g x = gT kx g = (k1 g21 ). k gT = g1 0 ; g= 1 ; 0 In the two-dimensional case, we have 2 k1 g1 0 kx = . 0 k2 g22 Taking into account (10.3.10) we write formulae (10.3.30) for the given example as follows:
344
10 Value of Shannon’s information for the most important Bayesian systems
ln β + 12 ln(k1 g21 /h1 ), for h1 /k1 g21 < β < h2 /k2 g22 ln β + 12 ln(k1 g21 /h1 ) + 12 ln k2 g22 /h2 , for β > h2 /k2 g22 , 1/2β − k1 g21 /2h1 + E[c0 (x)], for h1 /k1 g21 < β < h2 k2 g22 R= 1 2 2 1/β − k1 g1 /2h1 − 2 k2 g2 /h2 + E[c0 (x)], for β > h2 /k2 g22 . I=
1 2
Excluding β we find the value of information ⎧ k g2 ⎪ ⎨ 12 1h1 1 (1 − e−2I ), # V (I) = 2 2! k1 g21 k2 g22 −I k g k g ⎪ 1 2 1 1 2 ⎩ − 2 h1 + h2 h1 h2 e ,
k g2 h
for 0 I I2 = 12 ln k1 g12 h2 , 2 2 1
for I I2 . (10.3.35)
At the point I = I2 , β = h2 /k2 g2 the second derivative d 1 dT d 2V = = dI 2 dI β dI
(10.3.36)
undergoes a jump. It equals V (I2 − 0) = −2V = −2k2 g22 /h2 to the left of the point I = I2 − 0 and V (I2 + 0) = −V = −k2 g22 /h2 to the right of the point I = I2 . The obtained dependency is depicted in Figure 10.3. The dimensionality r˜ of β ) may be interpreted as the number of active degrees of freedom, the space U( which may vary with changes in temperature. This leads to a discontinuous jump of the second derivative (10.3.36), which is analogous to an abrupt change in heat capacity in thermodynamics (second-order phase transition). 5. In conclusion, let us compute function (10.2.22) for Gaussian systems. Taking into account (10.3.1), (10.3.4), (10.3.21) we get 1 t 1 μx (t) = ln det(au /2π ) + tc0 (x) + ln exp txT gu + uT hu − uT au u du. 2 2 2 U Applying formula (10.3.5) we obtain 1 1 2 T −1 T μx (t) = tc0 (x) − ln det(1 − tha−1 u ) + t x g(au − th) g x 2 2 1 1 2 T −1 −1 −1 T = tc0 (x) − tr ln(1 − tha−1 u ) + t x gau (1 − thau ) g x, 2 2
(10.3.37)
where a−1 u is determined from (10.3.18). Hence, we can find the expectation 1 1 2 −1 −1 −1 E[μx (t)] = tE[c0 (x)] − tr ln(1 − tha−1 u ) + t tr au (1 − thau ) kx (10.3.38) 2 2 and the derivatives
10.3 Gaussian Bayesian systems
345
Fig. 10.3 The value of information for a Gaussian system with a ‘phase transition’ corresponding to formula (10.3.35)
ha−1 1 1 T −1 2 − tha−1 u u tx gau μx (t) = c0 (x) + tr + gT x, 2 2 1 − tha−1 2 (1 − tha−1 u u ) E[μx (t)] =
2 (ha−1 1 u ) −1 −3 tr + tr a−1 u (1 − thau ) kx . −1 2 (1 − thau )2
(10.3.39) (10.3.40)
If we further combine (10.3.37) with (10.3.39) and assign t = −β , then we obtain
β μx (−β ) + μx (−β ) =
β ha−1 1 u −1 tr − ln(1 + β ha ) − u 2 1 + β ha−1 u β2 −1 −2 T − xT ga−1 u (1 + β hau ) g x. 2
(10.3.41)
The relations, arising from (10.3.18) 1+
1 1 −1 −1 au h−1 = 1 − hk ; β β x
−1 − 1 β ha−1 u = β kx h
˜ −1 allow us to pass the obtained results from the matrix ha−1 u to the matrix kx h . Thus, it follows from (10.3.40), (10.3.41) that x h−1 − 1 1 −1 2 βk 1 tr 1 − hkx β + tr , x h−1 )2 2 β (β k 1 −1 1 −1 k β μx (−β ) + μx (−β ) = tr 1 − hk − ln( β h ) x 2 β x x h−1 − 1 β βk − xT gh−1 gT x. x h−1 )2 2 (β k 2
E[μx (−β )] =
These formulae will be used in Section 11.4.
(10.3.42)
(10.3.43)
346
10 Value of Shannon’s information for the most important Bayesian systems
10.4 Stationary Gaussian systems 1. Let x and u consist of multiple components: u = u1 , . . . , ur . x = x1 , . . . , xr , A Bayesian system is stationary, if it is invariant under translation x1 , . . . , xr → x1+n , . . . , xr+n ; u1 , . . . , ur → u1+n , . . . , ur+n ; (xk+r = xk ,
uk+r = uk ).
To ensure this, the stochastic process x must be stationary, i.e. the density p(x) has to satisfy the condition p(x1 , . . . , xr ) = p(x1+n , . . . , xr+n ) and the cost function must also satisfy the stationarity condition c(x1 , . . . , xr ; u1 , . . . , ur ) = c(x1+n , . . . , xr+n ; u1+n , . . . , ur+n ). When applied to Gaussian systems, the conditions of stationarity lead to the requirement that matrices
hi j ,
gi j ,
ai j ,
(10.4.1)
in the expressions (10.3.1), (10.3.2) must be stationary, i.e. their elements have to depend only on the difference of indices: hi j = hi− j ,
gi j = gi− j
ai j = ai− j ,
(kx )i j = (kx )i− j .
At first, let us consider the case when spaces have a finite dimensionality r. Matrices (10.4.1) can be transformed to the diagonal form by a unitary transformation x = U+ x
(10.4.2)
with the matrix U = Ul j ;
1 Ul j = e2π il j/r ; r
l, j = 1, . . . , r
(10.4.3)
U+ kx U = kl δlm ,
(10.4.4)
satisfying U+ hU = hl δlm ;
U+ gU = gl δlm ;
where
r
hl =
∑ e−2π ilk/r hk
k=1
(10.4.5)
10.4 Stationary Gaussian systems
347
by analogy with (5.5.9), (8.7.6). After diagonalization, formulae (10.3.30) take the form 1 ln β kl |gl |2 /hl ; 2∑ l 1 R = ∑ 1/β − kl |gl |2 /hl + E[c0 (x)]. 2 l I=
(10.4.6)
The result of the unitary transformation (10.4.2) is that the new variables ⎡ ⎤ ⎡ ⎤ x1 x1 ⎢ .. ⎥ + ⎢ .. ⎥ ⎣ . ⎦=U ⎣ . ⎦ xr
xr
have become uncorrelated, i.e. independent. Taking into consideration coordinates ⎡ ⎤ ⎡ ⎤ u1 u1 ⎢ .. ⎥ + ⎢ .. ⎥ ⎣ . ⎦=U ⎣ . ⎦ ur
ur ,
Some of the coordinates are it is convenient to characterize the active subspace U. equal to zero. Due to (10.3.33) the only non-zero coordinates are those, which satisfy the inequality β kl |gl |2 /hl > 1 . (10.4.7) Thus, the summation in (10.4.6) has to be carried only over those indexes l, for which the latter inequality is valid. In fact, the number of such indices is exactly r˜. 2. Now, let x = {x(t)} be a random function on the interval [0, T0 ], i.e. space x is infinite dimensional (functional). A Bayesian system is assumed to be strictly stationary with matrices h, g having the form g = g(t − t ) (10.4.8) h = h(t − t ) ; and similarly for a, where h(τ ), g(τ ), a(τ ) are periodic functions with period T0 . A unitary transformation that diagonalizes these matrices has the form of the following Fourier transformation:
T0 1 (Ux)l = xl = x (ωl ) = √ e−2π ilt/T0 x(t) dt, T0 0 (ωl = 2π l/T0 ; l = 0, 1, . . .).
In this case, we have (see (8.7.12))
348
10 Value of Shannon’s information for the most important Bayesian systems
U+ hU = h(ωl )δlm , h(ωl ) =
T0 0
−iωl τ
e
h(τ )d τ =
T0 /2 −T0 /2
eiωl τ h(τ )d τ ,
(10.4.9)
and so on for the other matrices. Formulae (10.3.30) in this case can be will have the form similar to (10.4.6) k(ωl )|g(ωl )|2 I 1 I0 = = , ∑ ln β h(ω ) T0 2T0 l∈L l (10.4.10) k(ωl )|g(ωl )|2 1 R 1 1 R0 = =− E[c (x)], − + 0 ∑ T0 2T0 l∈L β T0 h(ωl ) with the only difference that index l now can range over all possible integer values . . ., −1, 0, 1, 2, . . ., which satisfy the conditions
β k(ωl )|g(ωl )|2 > h(ωl )
(10.4.11)
similar to (10.4.7). Let Φmax be the maximum value k(ωl )|g(ωl )|2 Φmax = max l = . . . , −1, 0, 1, 2, . . . . h(ωl ) Then for β < 1/Φmax there are no indices l, for which the condition (10.4.11) is satisfied, and the summations in (10.4.10) will be absent, so that l will equal to zero and also 1 (R0 )I0 =0 = E[c0 (x)]. (10.4.12) T0 I0 will take a non-zero value as soon as β attains the value 1/Φmax and surpasses it. Taking into account (10.4.12) by the usual formula V0 (I0 ) = (R0 )I0 =0 − R0 (I0 ), we derive from (10.4.10) the expression for the value of information k(ωl )|g(ωl )|2 1 1 V0 (I0 ) = − . ∑ 2T0 l∈L β h(ωl )
(10.4.13)
The dependency V0 (I0 ) has been obtained via (10.4.10), (10.4.13) in a parametric form. 3. Let us consider the case when x = {. . . , x−1 , x0 , x1 , x2 , . . .} is an infinite stationary sequence, and the elements of the matrices h = hi− j ,
g = gi− j
10.4 Stationary Gaussian systems
349
depend only on the difference i − j. Then these matrices can be diagonalized by a unitary transformation 1 (U+ x)(λ ) = √ 2π
∑ e−iλ l xl ,
(10.4.14)
l
where −π λ π . Similarly to (10.4.4), (10.4.5) we obtain in this case U+ hU = h(λ )δ (λ − λ ) , . . . ,
h(λ ) = ∑ e−iλ k hk .
(10.4.15)
k
The transformation (10.4.14) may be considered as a limiting case (for r → ∞) of the transformation (10.4.2), (10.4.3), where λ = λl = 2π r/l. It follows from the comparison of (10.4.5) and (10.4.15) that h¯ l = h(2π l/r), . . . . Therefore, formulae (10.4.6) yield k(2π l/r)|g(2π l/r)|2 1 I I1 = = ∑ ln β , r 2r l h(2π l/r) k(2π l/r)|g(2π l/r)|2 1 1 R 1 − R1 = = − ∑ + E[c0 (x)]. r 2r l h(2π l/r) β r Taking the limit r → ∞ and because Δ λ = 2π /r, we obtain the integrals k(λ )|g(λ )|2 1 ln β I1 = dλ , 4π λ ∈L(β ) h(λ ) (10.4.16) k(λ )|g(λ )|2 1 1 − d λ + const. R1 = − 4π λ ∈L(β ) h(λ ) β The integration is carried over subinterval L(β ) of interval (−π , π ), for which β k (λ )|g(λ )|2 > h(λ ). The obtained formulae determine the rates I1 , R1 corresponding on average to one element from sequences {. . . , x1 , x2 , . . .}, {. . . , u1 , u2 , . . .}. Denoting the length of the subinterval L(β ) by |L|, we obtain from (10.4.16) that 4π R1 = −Φ1 + Φ2 e−4π R1 /|L| + const, |L| where
1 Φ1 = |L|
L
Φ (λ )d λ ;
1 Φ2 = exp |L|
(10.4.17)
L
ln Φ (λ )d λ
are some mean (in L(β )) values of functions Φ (λ ) = k (λ )|g(λ )|2 /h(λ ), ln Φ (λ ). 4. Finally, suppose there is a stationary process on the infinite continuous-time axis. Functions h, g depend only on the differences in time, similarly to (10.4.8). This case can be considered as a limiting case of the system considered in 2 or 3. Thus, in the formulae of 2 it is required to take the limit T0 → ∞, for which the points ωl = 2π l/T0 from the axis ω are indefinitely consolidating (Δ ω = 2π /T0 ),
350
10 Value of Shannon’s information for the most important Bayesian systems
and the sums in (10.4.10), (10.4.13) are transformed into integrals: k(ω )|g(ω )|2 1 ln β dω I0 = 4π L(β ) h(ω ) k(ω )|g(ω )|2 1 1 V0 = − . 4π L(β ) β h(ω )
(10.4.18)
¯ ω ), g( ¯ ω ) included in the previous formulae are defined by the The functions h( equality (10.4.9) h(ω ) =
∞
−∞
e−iωτ h(τ )d τ , . . . .
(10.4.19)
The integration in (10.4.18) is carried over the region L(β ) belonging to the axis ω , where β k(ω )|g(ω )|2 > h(ω ). (10.4.20) Denote the total length of that region by |L|. Then, similarly to (10.4.17), formulae (10.4.18) entail the following: 4π V0 = Φ1 − Φ2 e−4π I0 /|L| , |L| where 1 Φ0 = |L|
Φ (ω ) =
L
1 Φ2 = exp |L|
Φ (ω )d ω ;
L
ln Φ (ω )d ω ,
k(ω )|g(ω )|2 . h(ω )
(10.4.21)
These mean values turn out to be mildly dependent on β . 5. As an example, let us consider a stationary Gaussian process x = {x(t)} having the following correlation function: k(τ ) = σ 2 e−γ |τ | . Thus, according to (10.4.19), we have k(ω ) = 2γσ 2 /(γ 2 + ω 2 ).
(10.4.22)
Let us take the cost function in quadratic form c(x, u) =
1 2
[x(t) − u(t)]2 dt
or, equivalently, using matrix notation 1 1 c(x, u) = xT x − xT u + uT u . 2 2
(10.4.23)
10.4 Stationary Gaussian systems
351
Fig. 10.4 The rate of the value of information for the example with the cost function (10.4.23)
When matrices h, g (as it can be seen from comparison with (10.3.1) have the form h = δ (t − t ) , g = − δ (t − t ) , then h(ω ) = 1;
g(ω ) = −1.
(10.4.24)
In this case, due to (10.4.22), (10.4.24), function (10.4.21) can be written as
Φ (ω ) = 2γσ 2 /(γ 2 + ω 2 ). Then condition (10.4.20) takes the form 2β γσ 2 > γ 2 + ω 2 , and, fixed value β > γ /2σ 2 the region L(β ) is the interval consequently, for a 2 2 − 2β γσ − γ< ω < 2β γσ 2 − γ 2 . Instead of parameter β , we consider now parameter y = 2β σ 2 /γ − 1. Then β = γ (1 + y2 )/2σ 2 , and the specified interval L will have the form −γ y < ω < γ y. Taking into account 1 2y and
1 2y
y
dx arctan y , = 2 y −y 1 + x
arctan y −1 , ln(1 + x )dx = ln(1 + y ) + 2 y −y
y
2
2
it follows from formulae (10.4.18) that I0 =
γ (y − arctan y); π
V0 =
σ2 π
arctan y −
y 1 + y2
.
(10.4.25)
352
10 Value of Shannon’s information for the most important Bayesian systems
Because of (10.4.23), and the stationarity of the process, the doubled rate of loss 2R0 , corresponding to a unit of time, coincide with the mean square error 2R0 = E[x(t) − u(t)]2 . Furthermore, the doubled value 2V0 (I0 ) = σ 2 − 2R0 (I0 ) represents the value of maximum decrease of the mean square error that is possible for a given rate of information amount I0 . The graph of the value function, corresponding to formulae (10.4.25), is shown on Figure 10.4. These formulae can also be used to obtain approximate formulae for the dependency V0 (I0 ) under small and large values of parameter y (or, equivalently, of the ratio I0 /y). For small value y 1, let us use the expansions arctan y = y − y3 /3 + y5 /5 − · · · and y/(1 + y2 ) = y − y3 + y5 + · · ·. Substituting them into (10.4.25) we obtain γ y3 y5 σ2 2 3 4 5 − +··· , V0 = y − y +··· , I0 = π 3 5 π 3 5 and, finally, after elimination of y: V0 (I0 ) =
5/3 I0 2σ 2 2σ 2 6 3 I0 1 − y2 + · · · = − I0 − σ 2 (3π )2/3 +··· . γ 5 γ 5 γ
For large value y 1, let us expand with respect to the reciprocal parameter, namely arctan y =
1 π π 1 − arctan = − y−1 + y−3 − · · · , 2 y 2 3
y = y−1 − y−3 + y−5 + · · · . 1 + y2 Then formulae (10.4.25) yield
π I0 /γ = y − π /2 + y−1 + · · · , π V0 /σ 2 = π /2 − 2y−1 + (4/3)y−3 . Eliminating y from the last formulae we obtain V0 /σ 2 = 1/2 − (2/π 2 )(I0 /γ + 1/2)−1 + O(I0−3 ). For this example, it is also straightforward to express potential (10.3.28).
Chapter 11
Asymptotic results about the value of information. Third asymptotic theorem
The fact about asymptotic equivalence of the values of various types of information (Hartley’s, Boltzmann’s or Shannon’s information amounts) should be regarded as the main asymptotic result concerning the value of information, which holds true under very broad assumptions, such as the requirement of information stability. This fact cannot be reduced to the fact of asymptotically errorless information transmission through a noisy channel stated by the Shannon’s theorem (Chapter 7), but it is an independent and no less significant fact. The combination of these two facts leads to a generalized result, namely the generalized Shannon’s theorem (Section 11.5). The latter concerns general performance criterion determined by an arbitrary cost function and the risk corresponding to it. Historically, the fact of asymptotic equivalence of different values of information was first proven (1959) precisely in this composite and implicit form, combined with the second fact (asymptotically errorless transmission). Initially, this fact was not regarded as an independent one, and was, in essence, incorporated into the generalized Shannon’s theorem. In this chapter we follow another way of presentation and treat the fact of asymptotic equivalence of the values of different types of information as a special, completely independent fact that is more elementary than the generalized Shannon’s theorem. We consider this way of presentation more preferable both from fundamental and from pedagogical points of view. At the same time we can clearly observe the symmetry of information theory and equal importance of the second and the third variational problems. Apart from the fact of asymptotic equivalence of the values of information, it is also interesting and important to study the magnitude of their difference. Sections 11.3 and 11.4 give the first terms of the asymptotic expansion for this difference, which were found by the author. These terms are exact for a chosen random encoding, and they give an idea of the rate of decrease of the difference (as in any asymptotic semiconvergent expansion), even though the sum of all the remaining terms of the expansion is not evaluated. We pay a special attention to the question about invariance of the results with respect to a transformation c(ξ , ζ ) → c(ξ , ζ ) + f (ξ ) of the cost function, which does not influence © Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0 11
353
354
11 Asymptotic results about the value of information. Third asymptotic theorem
information transmission and the value of information. Preference is given to those formulae, which use variables and functions that are invariant with respect to the specified transformation. For instance, we take the ratio of the invariant difference R − R = V − V over invariant variable V rather than risk R, which is not invariant (see Theorem 11.2). Certainly, research in the given direction can be supplemented and improved. Say, when the generalized Shannon’s theorem is considered in Section 11.5, it is legitimate to raise the question about the rate of decrease of the difference in risks. This question, however, was not considered in this book.
11.1 On the distinction between the value functions of different types of information. Preliminary forms Let [P(dx), c(x, u)] be a Bayesian system. That is, there is a random variable x from probability space (X, F, P), and F × G-measurable cost function c(x, u), where u (from a measurable space (U, G)) is an estimator. In Chapter 9, we defined for such a system the value functions of different types of information (Hartley, Boltzmann, Shannon). These functions correspond to the minimum average cost E[c(x, u)] attained for the specified amounts of information received. The Hartley’s information consists of indexes pointing to which region Ek of the optimal partition E1 + · · · + EM = X (Ek ∈ F) does the value of x belongs to. The minimum losses
= inf E inf E [c(x, u) | Ek ] R(I) (11.1.1) ∑ Ek
u
are determined by minimization over both the estimators, u, and different partitions. The upper bound I on Hartley’s information amount corresponds to the upper bound M eI on the number of the indicated regions. When bounding the Shannon’s amount of information, the following minimum costs are considered:
R(I) = inf
P(du|x)
c(x, u)P(dx)P(du | x)
(11.1.2)
where minimization is carried over conditional distributions P(du | x) compatible with inequality Ixy I. The minimum losses R(I) with respect to the constraint on the Boltzmann’s information amount, according to (9.6.6), range between losses (11.1.1) and (11.1.2): R(I) R(I) R(I).
(11.1.3)
will be close to each other when R(I) Therefore, all three functions R(I), R(I), R(I) and R(I) are close.
11.1 Distinction between values of different types of information
355
Note that the definitions of functions R(I), R(I), R(I) imply only the inequal and ity (11.1.3). A question that emerges here is how much do the functions R(I) R(I) differ from each other? If they do not differ much, then, instead of the function R(I), which is difficult to compute, one may consider a much easier computationally function R(I), which in addition has other convenient properties, such as differen tiability. The study of the differences between the functions R(I), R(I) and, thus, between the value functions V (I), V (I), V (I), is the subject of the present chapter. It turns out that for Bayesian systems of a particular kind, namely Bayesian systems possessing the property of ‘information stability’, there is asymptotic equivalence of the above-mentioned functions. This profound asymptotic result (the third asymptotic theorem) is comparable in its depth and significance with the respective asymptotic results for the first and the second theorems. Before giving the definition of informationally stable Bayesian systems, let us consider composite systems, a generalization of which are the informationally stable systems. We call a Bayesian system [Pn (d ξ ), c(ξ , ζ )] an n-th power (or degree) of a system [P1 (dx), c1 (x, u)], if the random variable ξ is a tuple (x1 , . . . , xn ) consisting of n independent identically distributed (i.i.d.) random variables that are copies of x, i.e. (11.1.4) Pn (d ξ ) = P1 (dx1 ) · · · P1 (dxn ). The estimator, ζ , is a tuple of identical u1 , . . . , un , while the cost function is the following sum: n
c(ξ , ζ ) = ∑ c1 (xi , ui ).
(11.1.5)
i=1
Formulae (11.1.1), (11.1.2) can be applied to the system [P1 , c1 ] as well as to its n-th power. Naturally, the amount of information In = nI1 for a composite system should be n times the amount of the original system. Then the optimal distribution P(d ζ | ξ ) (corresponding to formula (11.1.2) for a composite system) is factorized according to (9.4.21) into the product PIn (d ζ | ξ ) = PI1 (du1 | x1 ) · · · PI1 (dun | xn ),
(11.1.6)
where PI1 (du | x) is the analogous optimal distribution for the original system. Following (11.1.2) we have
R1 (I1 ) =
c1 (x, u)P1 (dx)PI1 (du | x).
(11.1.7)
Thus, for the n-th power system, as a consequence of (11.1.5), (11.1.7), we obtain Rn (In ) = nR1 (I1 ).
(11.1.8)
The case of function (11.1.1) is more complicated. Functions Rn (nI1 ) and R1 (I1 ) of the composite and the original systems are no longer related so simply. The number of partition regions for the composite system can be assumed to be M = [enI1 ], whereas for the original system it is m [eI1 ] (the brackets [ ] here denote an integer
356
11 Asymptotic results about the value of information. Third asymptotic theorem
part). It is clear that M mn , and therefore the partition ∑ Ek . . . ∑ Ez
(11.1.9)
of space X n = (X × · · · × X) into mn regions, induced by the optimal distribution ∑ Ek = X of the original system, is among feasible partitions searched over in the formula Rn (nI1 ) = inf E inf E[c(ξ , ζ ) | Gk ] . (11.1.10) n ∑M k=1 Gk =X
Hence,
ζ
Rn (nI1 ) nR1 (I1 ).
(11.1.11)
However, besides partitions (11.1.9) there is a large number of feasible partitions of another kind. Thus, it is reasonable to expect that nR1 (I1 ) is substantially larger than Rn (nI1 ). One can expect that for some systems the rates Rn (nI1 )/n of average costs [clearly exceeding R1 (I1 ) on account of (11.1.3), (11.1.8)] decrease as n increases and approach their plausible minimum, which turns out to coincide precisely with Rn (nI1 )/n = R1 (I1 ). It is this fact that is the essence of the main result (the third asymptotic theorem). Its derivation also yields another important result, namely, there emerges a method of finding a partition ∑ Gk close (in asymptotic sense) to the optimal. As it turns out, a procedure analogous to decoding via a ran code points ζ1 , dom Shannon’s code is suitable here (see Section 7.2). One takes M . . . , ζM (we remind that each of them represents a block (u1 , . . . , un )). These points are the result of M-fold random sampling with probabilities PIn (d ζ ) =
Xn
P(d ξ )PIn (d ζ | ξ ) = PI1 (u1 ) · · · PI1 (un ),
(11.1.12)
where PIn (d ζ | ξ ) is an optimal distribution (11.1.6). The number of code points = [eIn ] (In = nI1 , I1 is independent of n). Moreover, to prove the next is set to M Theorem 11.1, we should take the quantity In to be different from In = nI1 occurring in (11.1.12). The specified code points and the ‘distance’ c(ξ , ζ ) determine the partition M
∑ Gk = X n .
(11.1.13)
k=1
The region Gk contains those points ξ , which are ‘closer’ to point ζk than to other points (equidistant points may be assigned to any of competing regions by default). If in the specified partition (11.1.13) the estimator is chosen to be point ζk used to construct the region Gk , instead of point ζ minimizing the expression E[c(ξ , ζ ) | Gk ], then it will be related to some optimality, that is (11.1.14) E inf E[c(ξ , ζ ) | Gk ] E [E[c(ξ , ζ ) | Gk ]] . ζ
11.2 Theorem about asymptotic equivalence of the value functions of different. . .
357
Comparing the left-hand side of inequality (11.1.14) with (11.1.10) we evidently conclude that Rn (nI1 ) E inf E[c(ξ , ζ ) | Gk ] E [E[c(ξ , ζ ) | Gk ]] . ζ
In the expression E [E[c(ξ , ζk | Gk )]] points ζ1 , . . . , ζM are assumed to be fixed. Rewriting the latter expression in more detail as E [E[c(ξ , ζk ) | Gk ] | ζ1 , . . . , ζn ] we obtain the inequality Rn (nI1 ) E {E[c(ξ , ζk ) | Gk ] | ζ1 , . . . , ζM } ,
(11.1.15)
that will be useful in the next paragraph. In what follows, it is also useful to remember (see Chapter 9) that the optimal distribution (11.1.12) is related via formula (9.4.23), that is via
γn (ξ ) = ln or via
γ1 (ξ ) = ln
e−β c(ξ ,ζ ) PIn (d ζ )
(11.1.16)
e−β c1 (x,u) PI1 (du),
(11.1.17)
with functions γ1 (x), γn (ξ ) = γ1 (x1 ) + · · · + γ1 (xn ). Averaging them gives the potentials
Γ1 (β ) = E[γ1 (x)] =
P(dx) ln
e−β c1 (x,u) PI1 (du),
Γn (β ) = E[γn (ξ )] = nΓ1 (β ),
(11.1.18) (11.1.19)
that allow for the computation of the average costs Rn = −n
dΓ1 (β ); dβ
dΓ1 (β ) dβ
(11.1.20)
dΓ1 (β ) − Γ1 (β ) dβ
(11.1.21)
R1 = −
and the amount of information In = nI1 ;
I1 = β
according to (9.4.10), (9.4.29), (9.4.30).
11.2 Theorem about asymptotic equivalence of the value functions of different types of information 1. Before we transition to the main asymptotic results related to equivalence of different types of information, we introduce useful inequalities (11.2.5), (11.2.13),
358
11 Asymptotic results about the value of information. Third asymptotic theorem
which allow us to find upper bounds for minimum losses Rn (nI1 ) corresponding to Hartley’s information amount. We use inequality (11.1.15) that is valid for every set of code points ζ1 , . . . , ζM . It remains true if we average out with respect to a statistical ensemble of code points described by probabilities PIn (d ζ1 ) · · · PIn (d ζM ), = [enI1 ]. At the same time we have M ! 3 4 Rn nI1 E E [c (ξ , ζk ) | Gk ] | ζ1 , . . . , ζM PIn (d ζ1 ) · · · PIn (d ζM ) ≡ L. (11.2.1) Let us write down the expression L in the right-hand side in more detail. Since Gk is a region of points ξk , for which the ‘distance’ c(ξ , ζk ) is at most the ‘distance’ c(ξ , ζi ) to any point ζi from ζ1 , . . . , ζM , we conclude that E [c (ξ , ζk ) | Gk ] =
1 P(Gk )
c(ξ , ζk )P(d ξ )
c(ξ , ζk ) c(ξ , ζ1 ), ................. c(ξ , ζk ) c(ξ , ζM )
and, consequently, also 3 4 E E [c (ξ , ζk ) | Gk ] | ζ1 , . . . , ζM = ∑ P(Gk )E [c (ξ , ζk ) | Gk ] k
=∑ k
c(ξ , ζk )P(d ξ ).
c(ξ , ζk ) c(ξ , ζ1 ) ................. c(ξ , ζk ) c(ξ , ζM )
Averaging over ζ1 , . . . , ζM yields L=∑ k
···
c(ξ , ζk )P(d ξ )PIn (d ζ1 ) · · · PIn (d ζM ).
(11.2.2)
c(ξ , ζk ) c(ξ , ζi ) ................. c(ξ , ζk ) c(ξ , ζM )
For each k-th term, it is convenient first of all to make an integration with respect to points ζi that do not coincide with ζk . If we introduce the cumulative distribution function ⎧ PIn (d ζ ) for λ 0, ⎪ ⎨ c(ξ ,ζ )λ 1 − Fξ (λ ) = (11.2.3) ⎪ PIn (d ζ ) for λ < 0, ⎩ c(ξ ,ζ )>λ
− 1)-fold integration by points ζi not coinciding with ζk , equathen, after an (M tion (11.2.2) results in
11.2 Theorem about asymptotic equivalence of the value functions of different. . .
L
M
∑
k=1
=M
359
c(ξ , ζk )[1 − Fξ (c(ξ , ζk ))]M−1 P(d ξ )PIn (d ζk )
ξ ζk
c(ξ , ζ )[1 − Fξ (c(ξ , ζ ))]M−1 P(d ξ )PIn (d ζ ).
(11.2.4)
ξ ζ
The inequality sign has emerged because for c(ξ , ζk ) 0 we slightly expanded the regions Gi (i = k) by adding to them all ‘questionable’ points ξ , for which c(ξ , ζk ) = c(ξ , ζi ), and also for c(ξ , ζk ) < 0 we shrank those regions by dropping out all similar points. It is not hard to see that (11.2.4) can be rewritten as
L
P(d ξ )
∞ −∞
λ d{1 − [1 − Fξ (λ )]M },
or, as a result of (11.2.1), Rn (nIn )
P(d ξ )
λ dF1 (λ ).
(11.2.5)
Here in the right-hand side we average by ξ the expectation corresponding to the cumulative distribution function
F1 (λ ) = 1 − [1 − Fξ (λ )]M .
(11.2.6)
It is evident that i.e. F1 (λ ) Fξ (λ ),
1 − F1 = [1 − Fξ ]M 1 − Fξ ,
(11.2.7)
since 1 − Fξ (λ ) 1. It is convenient to introduce a new cumulative distribution function close to F1 (λ ) (for large n)
F2 (λ ) = max{1 − e−MFξ (λ ) , Fξ (λ )}.
(11.2.8)
It is not hard to prove that
F1 = 1 − (1 − Fξ )M 1 − e−MFξ .
(11.2.9)
Indeed, by using the inequality 1 − Fξ e−Fξ , we get
(1 − Fξ )M e−MFξ , that is equivalent to (11.2.9).
(11.2.10)
360
11 Asymptotic results about the value of information. Third asymptotic theorem
Employing (11.2.7), (11.2.9) we obtain that function (11.2.8) does not surpass function (11.2.6): (11.2.11) F2 (λ ) F1 (λ ). It follows from the last inequality that
λ dF1 (λ )
λ dF2 (λ ).
(11.2.12)
In order to ascertain this, we can take into account that (11.2.11) entails the inequality λ1 (F) λ2 (F) for reciprocal functions due to which the difference λ dF2 − λ dF1 can be expanded to 01 [λ2 (F) − λ1 (F)]dF and, consequently, turns out to be non-negative. So, inequality (11.2.5) will only become stronger, if we substitute λ dF1 in the right-hand side by λ dF2 . Therefore, we shall have Rn (nI1 )
P(d ξ )
λ dF2 (λ ),
(11.2.13)
= [enI1 ]. The derived inequality where F2 (λ ) is defined by formula (11.2.8) when M will be used in Section 11.3. 2. Theorem 11.1. Let [Pn (d ξ ), c(ξ , ζ )] be Bayesian systems that are n-th degree of system [P1 (dx), c1 (x, u)] with the cost function bounded above: |c1 (x, u)| K1 .
(11.2.14)
Then the cost rates corresponding to different types of information coincide in the limit: 1 Rn (nI1 ) → R1 (I1 ) (11.2.15) n as n → ∞. In other words, there is asymptotic equivalence Vn (nI1 )/Vn (nI1 ) → 1
(11.2.16)
of the value functions of Shannon’s and Hartley’s information amounts. It is assumed here that the cost rates in (11.2.15) are finite and that function R1 (I1 ) is continuous. We shall follow Shannon [43] (originally published in English [41]) in the proof of this theorem, who formulated this result in different terms. Proof. 1. In consequence of (11.1.4) the random information n
I(ξ , ζ ) = ∑ I(xi , ui ) i=1
constitutes a sum of independent random variables for the extremum distribution (11.1.6). The cost function (11.1.5) takes the same form. Each of them has
11.2 Theorem about asymptotic equivalence of the value functions of different. . .
361
a finite expectation. Applying the Law of Large Numbers (Khinchin’s Theorem) we obtain 1 1 P c(ξ , ζ ) − Rn (nI1 ) < ε → 1, n n 1 P I(ξ , ζ ) − I1 < ε → 1 n for n → ∞ and for any ε > 0. Thus, whichever δ > 0 is, for sufficiently large n > n(ε , δ ) we shall have P {c(ξ , ζ )/n < R1 (I1 ) + ε } > 1 − δ 2 /2, P {I(ξ , ζ )/n < I0 + ε } > 1 − δ 2 /2.
(11.2.17)
Let us denote by Γ the set of pairs (ξ , ζ ), for which both inequalities c(ξ , ζ ) < n[R1 (I1 ) + ε ], I(ξ , ζ ) < n(I1 + ε )
(11.2.18) (11.2.19)
hold true simultaneously. Then it follows from (11.2.17) that P(Γ ) > 1 − δ 2 .
(11.2.20)
Indeed, the probability of A¯ (a complementary event of (11.2.18)) and also the probability of B¯ (a complementary event of (11.2.19)) are at most δ 2 /2. Therefore, we have P(A ∪ B) P(A) + P(B) δ 2 for their union. The complementary event of A¯ ∪ B¯ is what Γ is and, consequently, (11.2.20) is valid. For a fixed ξ , let Zξ be a set of those ζ , which belong to Γ in a pair with ξ . In other words, Zξ is a section of set Γ . With this definition we have P(Zξ | ξ ) = P(Γ | ξ ). Consider the set Ξ of those elements ξ , for which P(Zξ | ξ ) > 1 − δ .
(11.2.21)
P(Ξ ) > 1 − δ .
(11.2.22)
It follows from (11.2.20) that
In order to confirm this, let us assume the contrary, that is P(Ξ ) < 1 − δ ;
P(Ξ ) δ
(11.2.22a)
(here Ξ¯ is the complement of Ξ ). Probability P{Γ¯ } of the complement of Γ can be estimated in this case in the following manner. Let us use the following
362
11 Asymptotic results about the value of information. Third asymptotic theorem
representation:
P(Γ ) =
P(Γ | ξ )P(d ξ ),
split this integral into two parts and keep just one subintegral: P(Γ ) =
Ξ
P(Γ | ξ )P(d ξ ) +
Ξ
P(Γ | ξ )P(d ξ )
Ξ
P(Γ | ξ )P(d ξ ).
(11.2.22b) Within the complementary set Ξ¯ the inequality opposite to (11.2.21) holds true: P(Zξ | ξ ) ≡ P(Γ | ξ ) δ , i.e. P(Γ¯ | ξ ) > δ when ξ ∈ Ξ¯ . Substituting this estimation in (11.2.22b) and taking into account (11.2.22a) we find that P(Γ ) > δ
Ξ
P(d ξ ) = δ P(Ξ ) δ 2 .
It contradicts (11.2.20) and, consequently, the assumption (11.2.22a) is not true. Inequality (11.2.19) means that ln that is
P(ζ | ξ ) < n(I1 + ε ), P(ζ )
P(ζ ) > e−n(I1 +ε ) P(ζ | ξ ).
Summing over ζ ∈ Zξ we obtain from here that P(Zξ ) > e−n(I1 +ε ) P(Zξ | ξ ), or, employing (11.2.21), P(Zξ ) > e−n(I1 +ε ) (1 − δ )
(ξ ∈ Ξ ) where n > n(ξ , δ ).
(11.2.23)
2. Now we apply formula (11.2.5). Let us take the following cumulative distribution function: ⎧ ⎪ for λ < nR1 (I1 ) + nε ⎨0 F3 (λ ) = F1 (nR1 (I1 ) + nε ) for nR1 (I1 ) + nε < λ < nK1 (11.2.24) ⎪ ⎩ 1 for λ nK1 . Because of the boundedness condition (11.2.14) of the cost function, the probability of inequality c(ξ , ζ ) > nK1 being valid is equal to zero, and it follows from (11.2.3) that Fξ (nK1 ) = 1. As a result of (11.2.6) we have F1 (nK1 ) = 1. Hence, functions F1 (λ ) and F3 (λ ) coincide within the interval λ nK1 . Within the interval nR1 (I1 ) + nε < λ < nK1 we have F3 (λ ) F1 (λ ) since F3 (λ ) = F1 (nR1 (I1 ) + nε ) F1 (λ )
where nR1 (I1 ) + nε λ
11.2 Theorem about asymptotic equivalence of the value functions of different. . .
363
by virtue of an non-decreasing characteristic of function F1 (λ ). Therefore, inequality F3 (λ ) F1 (λ ) holds true for all values of λ . This inequality entails a reciprocal inequality for means:
λ dF3 (λ )
λ dF1 (λ ),
in the same way as (11.2.12) has followed from (11.2.11). That is why formula (11.2.5) yields Rn (nI1 )
P(d ξ )
λ dF3 (λ ),
(11.2.25)
but
λ dF3 (λ ) = [nR1 (I1 ) + nε ]F1 (nR1 (I1 ) + nε ) + nK1 [1 − F1 (nR1 (I1 ) + nε )]
due to (11.2.24). Consequently, (11.2.25) together with (11.2.6) results in 1 Rn (nI1 ) R1 (I1 ) + ε + n
[K1 − R1 (I1 ) − ε ][1 − Fξ (nR1 (I1 ) + nε )]M P(d ξ )
or 1 Rn (nI1 ) R1 (I1 ) + ε + [K1 − R1 (I1 ) − ε ] n
P(d ξ )e−MFξ (nR1 (I1 )+nε ) , (11.2.26)
if (11.2.10) is taken into account. The validity of the presumed inequality K1 − R1 (I1 ) − ε > 0 is ensured by increasing K1 . 3. As a result of (11.2.3) we have
Fξ (nR1 (I1 ) + nε ) =
P(d ζ )
c(ξ ,ζ ) e−n(I1 +ε ) (1 − δ )
(ξ ∈ Ξ ).
(11.2.27)
Let us split the integral in (11.2.26) into two parts: an integral over set Ξ and an integral over the complementary Ξ¯ . We apply (11.2.27) to the first integral and replace the exponent by number one in the second one. This yields
ξ (nR1 (I1 ) + nε )] P(d ξ ) exp[−MF
Ξ
−nI1 −nε (1 − δ )] + 1 − P(Ξ ) P(d ξ ) exp[−Me
−nI1 −nε (1 − δ )] + δ , exp[−Me
(11.2.28)
where inequality (11.2.22) is taken into account. Here = [enI1 ] enI1 − 1, M and also it is expedient to assign I¯1 = I1 + 2ε , ε > 0. Then inequality (11.2.28) will take the form
P(d ξ ) exp[−MFξ (nR1 (I1 ) + nε )] exp{−enε (1 − δ )(1 − e−nI1 )} + δ .
Plugging the latter expression into (11.2.26) we obtain that 1 Rn (nI1 ) R1 (I1 − 2ε ) + ε + n
+ 2K1 δ + 2K1 exp{−enε (1 − δ )(1 − e−nI1 )} (11.2.29) (here we used the fact that |R1 | < K1 due to (11.2.14)). Thus, we have deduced that for every I¯1 , ε > 0, δ > 0 independent of n, there exists n(ε , δ ) such that inequality (11.2.29) is valid for all n > n(ε , δ ). This means that 1 lim sup Rn (nI1 ) R1 (I1 − 2ε ) + ε + 2K1 δ + n
n→∞
+ lim 2K1 exp{−enε (1 − δ )(1 − e−nI1 )}. n→∞
However,
lim exp{−enε (1 − δ )(1 − e−nI1 )} = 0
n→∞
(here δ < 1). Thus, we have
11.2 Theorem about asymptotic equivalence of the value functions of different. . .
365
1 lim sup Rn (nI1 ) − R1 (I1 ) R1 (I1 − 2ε ) − R1 (I1 ) + ε + 2K1 δ . n→∞ n
(11.2.30)
Because the function R1 (I1 ) is continuous, the expression in the right-hand side of (11.2.30) can be made arbitrarily small by considering sufficiently small ε and δ . Therefore, 1 lim sup Rn (nI1 ) R1 (I1 ). n→∞ n The above formula, together with the inequality 1 lim inf R1 (nI1 ) R1 (I1 ), n
n→∞
(11.2.30a)
that follows from (11.1.3) and (11.1.8), proves the following relation 1 1 lim sup R1 (nI1 ) = lim inf R1 (I1 ) = R1 (I1 ), n→∞ n n
n→∞
that is equation (11.2.15). The proof is complete. 3. Theorem 11.1 proven above allows for a natural generalization for those cases, when a system in consideration is not an n-th degree of some elementary system, but instead some other more general conditions are satisfied. The corresponding generalization is analogous to the generalization made during the transition from Theorem 7.1 to Theorem 7.2, where the requirement for a channel to be an n-th degree of an elementary channel was replaced by a requirement of informational stability. Therefore, let us impose such a condition on Bayesian systems in question that no significant changes to the aforementioned proof were required. According to a common trick, we have to exclude from the proof the number n and the rates I1 , R1 , . . . , and instead consider mere combinations I = nI1 , R = nR1 and so on. Let us require that a sequence of random variables ξ , ζ (dependent on n or some other parameter) were informationally stable in terms of the definition given in Section 7.3. Besides let us require the following convergence in probability: [c(ξ , ζ ) − R]/V (I) → 0
as
n → ∞.
(11.2.31)
It is easy to see that under such conditions the inequalities of the type (11.2.17) will hold. Those inequalities now take the form P{c(ξ , ζ ) < R + ε1V (I)} > 1 − 1/2δ 2 , P{I(ξ , ζ ) < I + ε2 I} > 1 − 1/2δ 2 for n > n(ε1 , ε2 , δ ), where ε1V = ε2 I = nε . Instead of the previous relation I1 = I1 + 2ε we now have I = I + 2ε2 I. Let us take the boundedness condition (11.2.14) in the following form:
366
11 Asymptotic results about the value of information. Third asymptotic theorem
|c(ξ , ζ )| KV (I),
(11.2.32)
where K is independent of n. Only modification of a way to write formulae is required in the mentioned proof. Now the relation (11.2.29) takes the form I) R(I− 2ε2 I) + ε1V (I) + 2KV (I)δ + 2KV (I) exp{−eε2 I (1 − δ )(1 − e−I)} R( or
V (I− 2ε2 I) − V (I) ε1 + 2K δ + exp{−eε2 I (1 − δ )(1 − e−I) } V (I− 2ε2 I)
(11.2.33)
for all values of ε1 , ε2 , δ that are independent of n, and for n > n(ε1 , ε2 , δ ). In consequence of condition A from the definition of informational stability (see Section 7.3) ˜ for every ε2 > 0, the expression exp{−eε2 I (1 − δ )(1 − e−I )} where I˜ = (1 + 2ε2 )I converges to zero as n → ∞ if δ < 1. Therefore, a passage to the limit n → ∞ from (11.2.33) results in 1 − lim inf n→∞
V (I) V (I) ε1 + 2K δ . V (I− 2ε2 I) V (I)
(11.2.33a)
We assume that there exists a limit lim
V (I) = ϕ (y) V (yI)
(11.2.34)
which is a continuous function of y. Then it follows from (11.2.33a) that V (I) lim inf ϕ n→∞ V (I)
1 1 + 2ε2
1 − ε1 − 2K δ .
Because ε1 , ε2 , δ are arbitrary and the function φ (y) is continuous, we obtain that lim inf n→∞
V (I) 1. V (I)
(11.2.34a)
In order to prove the convergence V (I) →1 V (I) it only remains to compare (11.2.34a) with the inequality lim sup n→∞
which is a generalization of (11.2.30a).
V (I) 1 V (I)
(11.2.35)
11.2 Theorem about asymptotic equivalence of the value functions of different. . .
367
4. We noted in Section 9.6 that the value of information functions V (I), V (I) are invariant under the following transformation of the cost function: c (ξ , ζ ) = c(ξ , ζ ) + f (ξ )
(11.2.36)
(see Theorem 9.8). At the same time the difference of risks R − R and also the regions Gk defined in Section 11.1 remain invariant, if the code points ζk and the distribution P(d ζ ), from which they are sampled do not change. Meanwhile, conditions (11.2.31), (11.2.32) are not invariant under the transformation defined by (11.2.36). Thus, (11.2.31) turns into c (ξ , ζ ) − E[c (ξ , ζ )] − f (ξ ) + E[ f (ξ )] /V (I) → 0, that evidently do not coincide with c (ξ , ζ ) − E[c (ξ , ζ )] /V (I) → 0. Taking this into account, one may take advantage of arbitrary function f (ξ ) in transformation (11.2.36) and select it in such a way that conditions (11.2.31), (11.2.32) are satisfied, if they were not satisfied initially. This broadens the set of cases, for which the convergence (11.2.34) can be proven, and relaxes the conditions required for asymptotic equivalence of the value functions in the aforementioned theory. In fact, using a particular choice of function f (ξ ) in (11.2.36) one can eliminate the need for condition (11.2.31) altogether. Indeed, as one can see from (9.4.21), the following equation holds for the case of extremum distribution P(ζ | ξ ): − I(ξ , ζ ) = β c(ξ , ζ ) + γ (ξ ).
(11.2.37)
Using arbitrary form of function f (ξ ) in (11.2.36), let us set f (ξ ) = γ (ξ )/β . Then a new ‘distance’ c (ξ , ζ ) coincides with −I(ξ , ζ )/β , and the convergence condition (11.2.31) need not be discussed, because it evidently follows from the informational stability condition [I(ξ , ζ ) − Iξ ζ ]/Iξ ζ → 0
(in probability)
(11.2.38)
unless, of course, the ratio I/β V does not grow to infinity. We say that a sequence of Bayesian systems [P(d ξ ), c(ξ , ζ )] is informationally stable, if the sequence of random variables ξ , ζ is informationally stable for extremum distributions. Hence, the next result follows the above-mentioned. Theorem 11.2 (A general form of the third asymptotic theorem). If an informationally stable sequence of Bayesian systems satisfies the following boundedness condition: |I(ξ , ζ )| K V (I)/V (I) (11.2.39) (K is independent of n) and function (11.2.34) is continuous in y, then the convergence defined in (11.2.35) takes place.
368
11 Asymptotic results about the value of information. Third asymptotic theorem
Condition (11.2.39), as one can check, follows from (11.2.32), K = β KV (I). 5. Let us use the aforementioned theory for more detailed analysis of a measuringtransmitting system considered in 3 of Section 9.2. Asymptotic equivalence of the values of Hartley’s and Shannon’s information allows one to enhance a performance of the system shown on Figure 9.5, i.e. to decrease the average cost from the level = min E[c(x, u)] − V (I), R(I) u
I = ln m,
to the level R(I) = min E[c(x, u)] −V (I), u
defined by the value of Shannon’s information. In order to achieve this, we need to activate the system many times and replace x by the block ξ = (x1 , . . . , xn ), while u by the block ζ = (u1 , . . . , un ) for sufficiently large n. A cumulative cost has to be considered as penalties. After that we can apply Theorem 11.1 and construct blocks 1 and 2 shown on Figure 11.1 according to the same principles as in 3 of Section 9.2. Now, however, with the knowledge of the proof of Theorem 11.1 we can elaborate their operation. Since we do not strive to reach precise optimality, let us consider the aforementioned partition (11.1.13) as a required partition ∑k Ek = X n . The regions Gk are defined here by the proximity to random code points ζ1 , . . . , ζk . This partition is asymptotically optimal, as it has
Fig. 11.1 Block diagram of a system subject to an information constraint. The channel capacity is n times greater than that on Figure 9.5. MD—measuring device
been proven. In other words, the system (see Figure 11.1), in which block 1 classifies the input signal ξ according to its membership in regions Gk and outputs the index of the region containing ξ , is asymptotically optimal. It is readily seen that the specified work of block 1 is absolutely analogous to the work of the decoder at the output of a noisy channel mentioned in Chapter 7. The only difference is that instead of the ‘distance’ defined by (7.1.8) we now consider ‘distance’ c(ξ , ζ ) and also ξ , η are substituted by ζ , ξ . This analogy allows us to call block 1 a measuring decoder, while block 2 acts as a block of optimal estimation. The described information system and more general systems were studied in the works of Stratonovich [47], Stratonovich and Grishanin [55], Grishanin and Stratonovich [17]. The information constraints of the type considered before (Figures 9.6 and 11.1) can be taken into account in various systems of optimal filtering, automatic control, dynamic programming and even game theory. Sometimes these constraints are related to boundedness of information inflow, sometimes to memory limitations, sometimes to a bounded complexity of an automaton or a controller. The consid-
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information
369
eration of these constraints results in permeation of these theories by the concepts and methods of information theory and also to their amalgamation with the theory of information. In dynamic programming (for instance, see Bellman [2], the corresponding original in English is [1]) and often in other theories a number of actions occurring consequently in time is considered. As a result, informational constraints in this case occur repeatedly. This leads to a generalization of the aforementioned results to sequences. A number of questions pertaining to the specified direction are studied in the works of Stratonovich [49, 51], Stratonovich and Grishanin [56].
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information The aforementioned Theorem 11.1 establishes the fact about asymptotic equivalence of the values of Shannon’s and Hartley’s information amounts. It is of interest how quickly does the difference between these values vanish. We remind the reader that in Chapter 7, right after Theorems 7.1 and 7.2, which established the fact of asymptotic vanishing of probability of decoding error, contains theorems, in which the rate of vanishing of that probability was studied. Undoubtedly, we can obtain a large number of results of various complexity and importance, which are concerned with the rate of vanishing of the difference V (I) − V (I) (as in the problem of asymptotic zero probability of error). Various methods can also be used in solving this problem. Here we give some comparatively simple results concerning this problem. We shall calculate the first terms of the asymptotic expansion of the difference V (I) − V˜ (I) in powers of the small parameter n−1 . In so doing, we shall determine that the boundedness condition (11.2.14) of the cost function, stipulated in the proof of Theorem 11.1, is inessential for the asymptotic equivalence of the values of different kinds of information and is dictated only by the adopted proof method. Consider formula (11.2.13), which can be rewritten using the notation S = λ dF2 (λ ) as follows: (11.3.1) Rn (nI1 ) SP(d ξ ). Hereinafter, we identify I˜ with I, I˜1 with I1 and M˜ with M, because it is now unnecessary to make the difference between them. Let us perform an asymptotic calculation of the expression situated in the righthand side of the above inequality. 1. In view of (11.2.8) we have S = S1 + S2 + S3 , where
(11.3.2)
370
11 Asymptotic results about the value of information. Third asymptotic theorem
S1 = −
c λ =−∞
∞
λ de−MFξ (λ ) ,
S2 = − λ de−MFξ (λ ) , λ =c
λ d Fξ (λ ) − 1 + e−MFξ (λ .
S3 = Fξ (λ )>1−e
(11.3.3) (11.3.4)
−MFξ (λ )
It is convenient to introduce here the notation
c=
c(ξ , ζ )PIn (d ζ ).
The distribution function Fξ (λ ) = P [c(ξ , ζ ) λ ] (ξ is fixed, P corresponds to PIn (d ζ ), see (11.2.3)) appearing in these formulae can be assessed using Theorem 4.8, or more precisely using its generalization that was discussed right after its proof. We assume for simplicity that segment [s1 , s2 ] mentioned in Theorem 4.8 is rather large, so that equation (4.4.17) (11.3.7) has root s for all values λ = x. Then
−1/2
−sμξ (s)+μξ (s) 2 −1 1 + O(μξ ) e Fξ (λ ) = 2π μξ (s)s (11.3.5)
for λ < c(ξ , ζ )PIn (d ζ ) = c¯ and also −sμξ (s)+μξ (s)
Fξ (λ ) = 1 − [2π [μξ (s)s2 ]−1/2 e
[1 + O(μξ−1 )]
(11.3.6)
for λ > c, where s is the root of equation
μξ (s) = λ , and
μξ (t) = ln
etc(ξ ,ζ ) PIn (d ζ ).
(11.3.7)
(11.3.8)
Different expressions (11.3.5), (11.3.6) correspond to different signs of the root s. Due to (11.3.5)–(11.3.7) the first integral (11.3.3) can be transformed into the form S1 = −
0 s=s
( μξ (s)d exp − exp nI1 − sμξ (s) +
1
2 −1 1 + O(n ) , + μξ (s) − · · · − ln 2π μξ (s)s 2
(11.3.9)
where integration is conducted with respect to s, and the lower bound of integration s∗ is determined by the formula lims→s∗ μξ (s) = −∞.
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information
371
For a better illustration (perhaps, somehow symbolically) we denote I1 = I/n, μ1 = μξ /n and substitute O(n−1 ) for O(μξ−1 ). In order to estimate the specified integral, we consider point r belonging to axis s that can be determined by the equation 1 (11.3.10) I1 − r μ1 (r) + μ1 (r) − ln 2π nμ1 (r)r2 = 0, 2 where μ1 (t) = μξ (t)/n. Then we perform Taylor expansion about this point of the functions from equation (11.3.9): 1 μ1 (s) = μ1 (r) + μ1 (r)(s − r) + μ1 (r)(s − r)2 + · · · , 2 sμ1 (s) − μ1 (s) = r μ1 (r) − μ1 (r) + r μ1 (r)(s − r)+ 1 + r μ1 (r) + μ1 (r) (s − r)2 + · · · , 2 ln 2π nμ1 (s)s2 = ln 2π nμ1 (r)r2 + μ1 (r)/μ1 (r) + 2/r (s − r) + · · · Substituting these expansions into integral (11.3.9) we obtain that
1 S1 = μ1 (r) 1 − e−MFξ (c) − n − μ1 (r)
−r
x=s −r −r
1 − μ1 (r) 2 (x = s − r),
( ) 2 xd exp −enα x+nβ x +··· 1 + O(n−1 ) −
x=s −r
( ) 2 x2 d exp −enα x+nβ x +··· 1 + O(n−1 ) + · · · , (11.3.11)
where
α = −r μ1 (r) + O(n−1 );
2β = −r μ1 (r) − μ1 (r) + O(n−1 )
(11.3.12)
with the required precision. Further, we choose a negative root of equation (11.3.10) r 0. Making a change of a variable enα x = z we transform (11.3.11) into the form
1 S1 = μ1 (r) 1 − e−MFξ (c) − n −nα r ( ) 2 μ (r) e 2 ln(z)d exp e− ln z+(β /nα ) ln z+··· 1 + O(n−1 ) − − 1 nα z=z −nα r ( ) 1 μ1 (r) e 2 − ln z+(β /nα 2 ) ln2 z+··· −1 1 + O(n ln (z)d exp e ) −··· , − 2 n2 α 2 z=z (z = ena (s − r)). (11.3.14)
372
11 Asymptotic results about the value of information. Third asymptotic theorem
The above equation allows one to appreciate the magnitude of various terms for large n. By expanding the exponent
2 2 exp −eln z+(β /nα ) ln z+··· ≡ exp −eln z+ε into the Taylor series with respect to ε = (β /nα 2 ) ln2 z + · · · we obtain 1 exp(−zeε ) = exp −z − zε − zε 2 − · · · 2 ! 1 !2 ε ε = e−z 1 − zε 1 + + · · · + z2 ε 2 1 + + · · · − · · · . 2 2 2
(11.3.15)
Plugging (11.3.15) into (11.3.14), let us retain the terms only of orders 1, n−1 , n−2 and obtain
1 S1 = μ1 (r) 1 − e−nξ (c) − n −nα r μ (r) e β − 2 ln(z)d e−z − z 2 ln2 ze−z − nα z=z nα
μ (r) − 12 2 2n α
e−MF α r z=z
ln2 (z)nde−z + n−3 · · ·
If we choose smaller precision, then we obtain a simpler formula:
μ (r) 1 S1 = μ1 (r) − 1 n nα
∞ 0
ln(z)de−z + n−2 · · ·
or, taking into account (11.3.12), 1 1 S1 = μ1 (r) + n nr
∞ 0
ln(z)de−z + n−2 · · ·
(11.3.16)
¯ Here we have neglected the term −μ1 (r)e−MFξ (c) , which vanishes with the growth nI −2 of n and M = [e 1 ] much faster than n . Also we have neglected the integrals
μ1 (r) nα
∞ e−nα r
ln(z)e−z dz;
μ1 (r) nα
z
ln(z)e−z dz,
0
which decrease very quickly with the growth of n, because e−nα r and z∗ = enα (s∗ −r) exponentially converge to ∞ and 0, respectively (recall that r < 0, s∗ < r). One can easily assess the values of these integrals, for instance, by omitting the factor ln z in the first integral and e−z in the second one, which have relatively small influence. However, we will not dwell on this any further. The integral in (11.3.16) is equal to the Euler constant C = 0.577 . . . Indeed, expressing this integral as the limit
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information
α
lim
α →∞ 0
373
ln(z)d e−z − 1
and integrating by parts α 0
α α dz ln(z)d(e−z − 1) = (e−z − 1) ln(z) + (1 − e−z ) z 0 0 1 α dz dz = (e−α − 1) ln α + (1 − e−z ) − e−z + ln α z z 1 0
we can reduce it to the form ∞ 0
ln(z)de−z =
1 0
(1 − e−z )
dz − z
∞ 1
e−z
dz =C z
(see p. 97 in [25], the corresponding book in English is [24]). Therefore, result (11.3.16) acquires the form 1 C S1 = μ1 (r) + + n−2 · · · n nr
(11.3.17)
Instead of the root r of equation (11.3.10), it is convenient to consider the root qξ of the simpler equation (11.3.18) qξ μ1 (qξ ) − μ1 (qξ ) = I1 . Comparing these two equations shows 2 1 ln[2π nμ1 (qξ )qξ ] + n−2 ln2 n · · · , r − qξ = − 2n qξ μ1 (qξ )
(11.3.19)
and the condition for the existence of the negative root (11.3.13) is equivalent (for large n) to the condition of the existence of the root qξ < 0.
(11.3.20)
By differentiation, we can make verify that the function sμξ (s) − μξ (s) attains the minimum value equal to zero at the point s = 0. Therefore, equation (11.3.18), at least for sufficiently small I1 , has two roots: positive and negative. Thus, it is possible to choose the negative root qξ . According to (11.3.19), formula (11.3.17) can be simplified to the form 1 1 C S1 = μ1 (qξ ) − ln[2π nμ1 (qξ )q2ξ ] + + n−2 ln2 n · · · n 2nqξ nqξ
(11.3.21)
2. Let us now turn to the estimation of integral (11.3.4). Denote by λ the boundary point determined by the equation Fξ (λΓ ) = 1 − e−MFξ (λΓ ) .
(11.3.22)
374
11 Asymptotic results about the value of information. Third asymptotic theorem
For large M (namely, M 1) the quantity 1 − F(λΓ ) = ε is small (ε 1). Transforming equation (11.3.22) to the form 1 2 2 −M(1−ε ) −M ε =e =e 1 + εM + ε M + · · · , 2 we find its approximate solution 1 − Fξ (λΓ ) = ε = e−M + Me−2M + · · · ,
(11.3.22a)
which confirms the small value of ε . Within the domain of integration λ > λΓ in equation (11.3.4) the derivative ! dF (λ ) d
ξ Fξ (λ ) − 1 + e−MFξ (λ ) = 1 − Me−M eM[1−Fξ (λ )] dλ dλ is positive for M 1, which can be checked by taking into account that 1 − Fξ (λ ) ε ∼ e−M
for λ > λΓ .
Therefore, integral (11.3.4) can be bounded above as follows: ∞ ∞
|λ | dFξ (λ ) + S3 S3 = λ d Fξ (λ ) − 1 + e−MFξ (λ ) λΓ
λΓ
(11.3.23)
where S3 = λ∞Γ |λ |de−MFξ (λ ) is a negative surplus we neglect. Starting from some M > M(c) ¯ the value λΓ certainly exceeds an average value, thereby we can incorporate estimation (11.3.6) into integral (11.3.23). This formula allows us to obtain
where
1 − Fξ (λ ) = [1 − Fξ (λΓ )]e−nρ (λ −λΓ )+··· ,
(11.3.24)
nμ1 (sΓ ) = λΓ ,
(11.3.25)
ρ = sΓ μ1 (sΓ ) > 0
and the dots correspond to other terms, the form of which is not important to us. Substituting (11.3.24) into (11.3.23) we obtain S3 ε
∞ x=0
−nρ x+···
|λΓ + x| de
1 = ε |λΓ | + +··· . nρ
ε
∞ x=0
[|λΓ | + x] de−nρ x+···
The right-hand side of the above equation can be written using (11.3.6) as follows: S3 e−nu(λΓ )+··· [|λΓ | + · · · ] . Here μ is the image under the Legendre transform:
(11.3.26)
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information
u(λ ) = sμ1 (s) − μ1 (s)
375
(nμ1 (s) = λ );
and dots correspond to the less essential terms. If λΓ were independent of n, then, according to (11.3.26), the estimate of integral S3 would vanish with increasing n mainly exponentially. However, because of (11.3.22a), the value λΓ increases with n, and therefore μ˜ (λΓ ) increases as well ¯ That makes the (the function μ˜ (λ ), as is easily verified, is increasing at λ > c). expression in the right-hand side of (11.3.26) vanish even faster. Thus, the integral S3 vanishes with the growth of n more rapidly than the terms of the asymptotic expansion (11.3.21). Therefore, the term S3 in (11.3.2) can be neglected. 3. Let us now estimate the second integral S2 in (11.3.3). Choosing a positive number b > |c|, ¯ we represent this integral as a sum of two terms: S2 = S2 + S2 ; S2 = − Evidently,
b
λ =c
λ de−MFξ (λ ) ;
S2 = −
∞ λ =b
(11.3.27)
λ de−MFξ (λ ) .
S2 b e−MFξ (c) − e−MFξ (b) be−MFξ (c) .
(11.3.28)
To estimate the integral S2 we use (11.3.6) and the formula of the type (11.3.24): S2 = −
∞ b
) (
λ d exp −M 1 − 1 − Fξ (b) e−nρb (λ −b)+··· .
Here, analogously to (11.3.25), we have ρb > 0. However, ∞ ) ( − λ d exp −M + Ne−nρb (λ −b) = b ∞ 4 3 y −M = −e b+ d exp +Ne−y nρb 0 1 −M 1 = e−M eN − 1 b − e ln(z)deNz = T1 + T2 , nρb 0
(11.3.29)
where we denote N = M[1 − Fξ (b)] and z = e−y . For the first term in (11.3.29) we have the inequality T1 ≡ e−M eN − 1 b be−M+N = be−MFξ (b) .
(11.3.30)
Let us compute the second term in (11.3.29). By analytical continuation, from formula 1 1 1 e−ρ z ln zdz = (C + ln ρ ) + Ei(−ρ ) ρ ρ 0 (see formulae (3.711.2) and (3.711.3) in Ryzhik and Gradstein [36], the corresponding translation to English is [37]) we obtain
376
11 Asymptotic results about the value of information. Third asymptotic theorem
1 0
eNz ln zdz −
1 1 (C + ln N) − E1 (N), N N
2 eN 1 Ei(N) = 1+ + 2 +··· N N N
whereas
(see p. 48 in Jahnke and Emde [25], the corresponding book in English is [24]). Consequently, the main dependency of the second term T2 in (11.3.29) on M is determined by the exponential factor e−M+N = e−MFξ (b) . Thus, all three terms (11.3.28), (11.3.30) and T2 constituting S2 decrease with a growth of n quite rapidly. Just as S3 , they do not influence an asymptotic expansion of the type (11.3.21) over the powers of the small parameter n−1 (in combination with logarithms ln n). Therefore, we need to account only for one term in formula (11.3.2), so that due to (11.3.21) we have S μξ (qξ ) −
C
1 ln (2π )−1 μξ (qξ )q2ξ + + O n−1 ln2 n . 2qξ qξ
(11.3.31)
Consequently, inequality (11.3.1) takes the form 1 Rn (nI1 ) n
μ1 (qξ ) −
C 1 ln 2π nμ1 (qξ )q2ξ + 2nqξ nqξ
P(d ξ )+
+ O(n−2 ln2 n) · · · (11.3.32) 4. Let us average over ξ in compliance with the latter formula. This averaging becomes easier, because of the fact that function (11.3.8) comes as a summation of identically distributed random summands ln etc1 (xi ,u) PI1 (du), while μ1 = μξ /n is their arithmetic mean:
μ1 (t) =
1 n ∑ ln n i=1
etc1 (xi ,u) PI1 (du).
(11.3.33)
By virtue of the Law of Large Numbers this mean converges to the expectation
ν1 (−t, β ) ≡ E[μ1 (t)] =
P(dx) ln
etc1 (x,u) PI1 (du),
(11.3.34)
which is naturally connected with the following potential (11.1.18):
ν1 (−t, β ) ≡ ν1 (−t) = Γ1 (−t).
(11.3.35)
For each fixed t, according to the above law, we have the following convergence in probability: μ1 (t) → ν1 (−t), (11.3.36) and also for the derivatives
11.3 Rate of convergence between the values of Shannon’s and Hartley’s information (k)
(k)
μ1 (t) → (−1)k ν1 (−t),
377
(11.3.37)
if they exist. Because of the convergence in (11.3.36), (11.3.37), the root qξ of equation (11.3.18) converges in probability to the root q = −z of the equation − qν1 (−q) − ν1 (−q) = I1
where zν1 (z) − ν1 (z) = I1 .
(11.3.38)
The last equation coincides with (11.1.21). That is why z = β so that qξ → −β as n → ∞.
(11.3.39)
According to (9.4.37) we have
β = −dI/dR, and the parameter β is positive for the normal branch R+ (9.3.10) (see Section 9.3). Hence we obtain the inequality qξ < 0 that complies with condition (11.3.20). Because of the convergence in (11.3.37), (11.3.39) and equalities (11.3.35), (11.1.20), the first term E[μ1 (qξ )] in (11.3.32) tends to R1 (I1 ). That proves (if we also take into account vanishing of other terms) the convergence 1 R(nI1 ) → R1 (I1 ) n
(n → ∞),
which was discussed in Theorem 11.1. In order to study the rate of convergence, let us consider the deviation of the random variables occurring in the averaged expression (11.3.32) from their nonrandom limits. We introduce a random deviation δ μ1 = μ1 (−β ) − ν1 (β ). Just as the random deviation δ μ1 = μ1 (−β ) − ν1 (β ), this deviation due to (11.3.33) has a null expected value and a variance of the order n−1 : E[δ μ1 ]2 = E[μ12 (−β )] − ν12 (β ) 1 P(dx) ln2 e−β c1 (x,u) PI1 (du)− = n 2 1 −β c1 (x,u) − P(dx) ln e PI1 (du) n and analogously for δ μ1 . In order to take into account the contribution of random deviations having the order n−1 while averaging variable μ1 (qξ ), let us express this quantity as an expansion in δ μ , δ μ including the quadratic terms. Denoting qξ + β = δ q and expanding the right-hand side of equations (11.3.18) with respect to δ q and accounting for quadratic terms, we obtain −β μ1 (−β ) − μ1 (−β ) − β μ1 (−β )δ q +
1 μ1 (−β ) − β μ1 (−β ) δ q2 + · · · = I1 . 2
378
11 Asymptotic results about the value of information. Third asymptotic theorem
Substituting in the above
μ1 (−β ) = −ν1 (β ) + δ μ1 ;
μ1 (−β ) = ν1 (β ) + δ μ1 ; μ1 (−β ) = ν1 (β ) + δ μ1 ,
and neglecting the terms of order higher than quadratic we obtain that
β ν1 (β ) − ν1 (β ) − β δ μ1 − δ μ1 − β δ μ1 δ q − β ν1 (β )δ q+ 1 + ν1 (β ) + β ν1 (β ) δ q2 + · · · = I1 . 2 Zero-order terms cancel out here on the strength of (11.3.38), where z = β , and we have 2 1 1 δq = − β δ μ 1 − δ μ 1 − β δ μ 1 δ q + ν1 ( β ) + β ν1 ( β ) δ q + · · · , β ν1 (β ) 2 and, consequently,
δq = −
β δ μ1 + δ μ1 δ μ1 (β δ μ1 + δ μ1 ) + + 2 β ν1 (β ) β ν (β ) 1
+
2 ν1 (β ) + β ν1 (β ) 3 β δ μ1 + δ μ1 + · · · 2β 3 ν1 (β )
Substituting this result into the expansion 1 μ1 (qξ ) = μ1 (−β ) + μ1 (−β )δ q + μ1 (−β )δ q2 + · · · 2 1 = −ν1 (β ) + δ μ1 + ν1 (β )δ q + δ μ1 δ q − ν1 (β )δ q2 , 2 after cancellations we obtain E[μ1 (qξ )] = −ν1 (β ) +
E[(β δ μ1 + δ μ1 )2 ] + E[δ μ13 ] · · · 2β 3 ν1 (β )
Here we take into account that E[δ μ1 ] = E[δ μ1 ] = 0. The other terms entering into the expression under the integral sign in (11.3.32) require less accuracy in calculations. It is sufficient to substitute their limit values for the random ones. Finally, we obtain 1 1 ln 2π nν1 (β )β 2 − 0 Rn (nI1 ) − R1 (I1 ) n 2nβ
2 7 3 C − 2β ν1 (β ) + o(n−1 ). + E β δ μ1 + δ μ1 nβ
(11.3.40)
11.4 Alternative forms of the main result. Generalizations and special cases
379
Applying the same methods and conducting calculations with a greater level of accuracy, we can also find the terms of higher order in this asymptotic expansion. This way one can corroborate those points of the above-stated derivation, which may appear insufficiently grounded. In the exposition above we assumed that the domain (s1 , s2 ) of the potential μξ (t) and its differentiability, which was mentioned in Theorem 4.8, is sufficiently large enough. However, only the vicinity of the point s = −β is actually essential. The other parts of the straight line s have influence only on exponential terms of the types (11.3.26), (11.3.28), (11.3.30), and they do not influence the asymptotic expansion (11.3.40). Of course, the anomalous behaviour of the function μ1 (s) on these other parts of s complicates the proof. For formula (11.3.40) to be valid, condition (11.2.14) of boundedness of the cost function is not really necessary. However, the derivation of this formula can be somewhat simplified if we impose this condition. Then elaborate estimations of integrals S2 , S3 and the integral over domain Ξ will not be required. Instead, we can confine ourselves to proving that the probability of the appropriate regions of integration vanishes rapidly (exponentially). In this case, the value of the constant K will be inessential, since it will not appear in the final result. As one can see from the above derivation, the terms of the asymptotic expansion (11.3.40) are exact for the given random encoding. Higher-order terms can be found, but the terms already written cannot be improved if the given encoding procedure is not rejected. The problem of interest is how close the estimate (11.3.40) is ˜ 1 ) − R1 (I1 ), and to what extent can it to the actual value of the difference (1/n)R(nI be refined if more elaborate encoding techniques are used.
11.4 Alternative forms of the main result. Generalizations and special cases 1. In the previous section we introduced the function μ1 (t) = (1/n)μξ (t) instead of function (11.3.8). It was done essentially for convenience and illustration reasons in order to emphasize the relative magnitude of terms. Almost nothing will change if the rate quantities μ1 , ν1 , R1 , I1 and others are used only as factors of n, that is if we do not introduce the rate quantities at all. Thus, instead of the main result formula (11.3.40), after multiplication by n we obtain − R(I) = V (I) − V (I) 0 R(I) E (β δ μ + δ μ )2 2π 1 + o(1), ln 2 ν (β )β 2 + 2β γ 2β 3 ν (β ) where according to (11.3.34), (11.3.8), we have
(11.4.1)
380
11 Asymptotic results about the value of information. Third asymptotic theorem
ν (−t) = E[μξ (t)] =
δ μ = δ μ (−β );
etc(ξ ,ζ ) PI (d ζ ),
(11.4.2)
δ μ (t) = μξ (t) − ν (−t)
(11.4.3)
P(d ξ ) ln
δ μ = δ μ (−β );
(index n is omitted, the term C/β is moved under the logarithm; γ = eC = 1.781). With these substitutions, just as in paragraph 3 in Section 11.2, it becomes unnecessary for the Bayesian system to be the n-th degree of some elementary Bayesian system. Differentiating twice function (11.4.2) at point t = −β and taking into account (11.1.16) we obtain that
ν (β ) =
P(d ξ )
c2 (ξ , ζ )e−γ (ξ )−β c(ξ ,ζ ) PI (d ζ ) − −
c(ξ , ζ )e−γ (ξ )−β c(ξ ,ζ ) PI (d ζ )
2 .
Because of (9.4.21), the above integrals over ζ are the integrals of a conditional expectations taken with respect to conditional probabilities P(d ζ | ξ ). Therefore, ) ( ν (β ) = E E c2 (ξ , ζ ) | ξ − (E [c(ξ , ζ ) | ξ ])2 ≡ E [Var [c(ξ , ζ ) | ξ ]] ,
(11.4.4)
where Var[. . . | ξ ] denotes conditional variance. As is easily verified, the average conditional variance (11.4.4) remains unchanged under the transformation (11.2.36). Consequently, conducting transformation (11.2.37) we obtain
β 2 ν (β ) = E [Var [I(ξ , ζ ) | ξ ]] .
(11.4.5)
Let us now consider what is the mean square term E[(β δ μ + δ μ )2 ] in (11.4.4), which, due to (11.4.3), coincides with the variance
2 = Var β μξ (−β ) + μξ (−β ) (11.4.6) E β δ μ + δ μ of the random variable β μξ (−β ) + μξ (−β ). Setting t = −β in (11.3.8) and comparing the resulting expression with (9.4.23) we see that μξ (−β ) = γ (ξ ). (11.4.7) Next, after differentiating (11.3.8) at point t = −β , we find that
μξ (−β ) =
c(ξ , ζ )e−γ (ξ )−β c(ξ ,ζ ) PI (d ζ ).
In view of (9.4.21), this integral is nothing but the integral averaging with respect to conditional distribution P(d ζ | ξ ), i.e.
11.4 Alternative forms of the main result. Generalizations and special cases
381
μξ (−β ) = E [c(ξ , ζ ) | ξ ] . It follows from the above and equation (11.4.7) that
β μξ (−β ) + μξ (−β ) = E [β c(ξ , ζ ) + γ (ξ ) | ξ ] . However, −β c(ξ , ζ ) − γ (ξ ) coincides with the random information I(ξ , ζ ) on the strength of (9.4.21), so that
β μξ (−β ) + μξ (−β ) = −E [I(ξ , ζ ) | ξ ] ≡ −Iζ | (| ξ ).
(11.4.8)
Thus, we see that expression (11.4.6) is exactly the variance of a particularly averaged random information:
2 = Var Var β μξ (−β ) + μξ (−β ) = [Iξ | (| ξ )]. E β δ μ + δ μ (11.4.9) Following from (11.4.5), (11.4.9) the main formula (11.4.1) takes the form 0 2β [V (I) − V (I)] ln
2π E [Var[I( ξ , ζ ) | ξ ]] + γ2
+ Var[Iζ (| ξ )]/E [Var[I(ξ , ζ ) | ξ ]] + o(1). (11.4.10) 2. In several particular cases the extremum distribution PI (d ζ ) does not depend on β , i.e. on I. Then, as it can be seen from (9.4.23) and (11.3.8), the function μξ (t) = μ (ξ ,t, β ), which in general depends on all arguments ξ , t, β , turns out to be independent of β , while its dependence on ξ and t coincides with the dependence of function γ (ξ ) = γ (ξ , β ) on ξ and −β :
μ (ξ ,t, β ) = γ (ξ , −t)
∀t.
(11.4.11)
By averaging this equality over ξ and taking into account (9.4.10), (11.4.2) we obtain ν (−t) = Γ (−t) ∀t. (11.4.12) That is why we can replace ν (β ) by Γ (β ) in formula (11.4.1). Further, it is useful to recall that due to (9.4.31) the value of β is related to the derivatives of functions R(I), V (I): 1 dR = V (I), =− (11.4.13) β dI so that
β = 1/V (I). derivative Γ (β )
(11.4.14)
The second can also be expressed in terms of the function I(R) (or V (I)), because these functions are related by the Legendre transform [see (9.4.30) and (9.4.29)]. Differentiating (9.4.30) we have
382
11 Asymptotic results about the value of information. Third asymptotic theorem
βΓ (β ) = This implies 1
β 2Γ (β )
=
dI . dβ 1 dβ β dI
or, equivalently, if we differentiate (11.4.14) and take into account (11.4.13), 1/(β 2Γ (β )) = −V (I)/V (I).
(11.4.15)
On the strength of (11.4.14), (11.4.15), (11.4.9) the main formula (11.4.1) can be written as follows: 1 2π V (I) 0 V (I) − V (I) V (I) ln − 2 − 2 γ V (I) 1 − V (I)Var Iζ | (| ξ ) + o(1). (11.4.16) 2 Besides, the function (11.4.11) sometimes turns out to be independent of ξ . We encountered such a phenomenon in Section 10.2 where we derived formula (10.2.25) for one particular case. According to the latter we have μξ (t) = Γ (−t), so that μξ (t) = ν (−t) = Γ (−t); δ μ (t) = 0, and averaging over ξ becomes redundant. In this case, the variance (11.4.6), (11.4.9) vanishes and formula (11.4.16) can be somewhat simplified taking the form 1 2π V (I) V (I) V (I) V (I) − V (I) ln − 2 + o(1). (11.4.17) 2 γ V (I) At the same time the analysis carried out in 4 of the previous paragraph becomes redundant. 3. In some important cases the sequence of values of I and the sequence of Bayesian systems [P(d ξ ), c(ξ , ζ )] (dependent on n or another parameter) are such that for an extremum distribution A. I = Iξ ζ → ∞.
(11.4.18)
B. There exist finite non-zero limits lim
E [Var[I(ξ , ζ ) | ξ ]] , I
lim
Var[Iζ | (| ξ )] V (yI) , lim , I I
lim
dV (I) (11.4.19) dI
(y is arbitrary and independent of I). It is easy to see that the sum 3 E [Var[I(ξ , ζ ) | ξ ]] + Var Iζ | (| ξ ) = E E[I 2 (ξ , ζ ) | ξ ]− 4 −(E [I(ξ , ζ ) | ξ ])2 + E (E[I(ξ , ζ ) | ξ ])2 − (E [E[I(ξ , ζ ) | ξ ]])2
11.4 Alternative forms of the main result. Generalizations and special cases
383
coincides with the total variance Var[I(ξ , ζ )]. Consequently, it follows from the existence of the first two limits in (11.4.19) implies the existence of the finite limit 1 lim Var[I(ξ , ζ )]. I Therefore, on one hand, A and B imply conditions of information stability of ξ , ζ mentioned in Section 7.3 (it can be derived in a standard way by using Chebyshev’s inequality). Also, B implies (11.2.34). Thus, if in addition the boundedness condition (11.2.39) and continuity of function (11.2.34) with respect to y are satisfied, then, according to Theorem 11.2, convergence (11.2.35) will take place. On the other hand, it follows from (11.4.18) together with the finiteness of limits (11.4.19) and equation (11.4.14) that Var[Iζ | (| ξ )] 1 = 2β V (I) E [Var[I(ξ , ζ ) | ξ ]] =
1 dV I Var[Iζ | (ξ )] I → 0. 2I dI V (I) I E [Var[I(ξ , ζ ) | ξ ]]
(11.4.20)
Furthermore, 2π 2π E [Var[I(ξ , ζ ) | ξ ]] 1 ln I 1 ln + ln E [Var[I( ξ , ζ )] ξ ]] = →0 I γ2 I I γ2 I where I → ∞. Thus, for the logarithmic term in (11.4.10) we have 2π 1 ln E [Var[I( ξ , ζ ) | ξ ]] → 0. 2β V (I) γ2
(11.4.21)
In this case, because of (11.4.20), (11.4.21), the convergence in (11.2.35) follows from (11.4.10), whereas the other terms replaced by o(1) in the right-hand side of (11.4.10) decrease even faster. At the same time we can see that condition (11.2.39) is unnecessary. 4. Let us consider more thoroughly one special case—the case of Gaussian Bayesian systems, for which we dedicated Sections 10.3 and 10.4. In this case, generally speaking, we cannot employ the simplification from Paragraph 2, and we have to refer to formula (11.4.10). The value of β 2 ν (β ) appearing in (11.4.5) has already been found earlier in Chapter 10. It is determined by equation (10.3.42) that can be easily transformed to the form 1 −1 2 1 2 β ν (β ) = tr 1u − hk . (11.4.22) 2 β x Further, in order to compute the variance in (11.4.9) we should use formula (10.3.43). The variance of expressions quadratic in Gaussian variables was calculated earlier in Section 5.4. It is easy to see that by applying the same computational method
384
11 Asymptotic results about the value of information. Third asymptotic theorem
[based on formulae (5.4.15), (5.4.16)] to (10.3.43) instead of (5.4.14) we obtain Var[β μx (−β ) + μx (−β )] =
β2 β β kx h−1 − 1u T kx h−1 − 1u T tr gh−1 g kx gh−1 g kx 2 (β (β kx h−1 )2 kx h−1 )2 1 1 −1 = tr 1u − hkx (11.4.23) 2 β
=
x ). (since gT kx g = k After the substitution of (11.4.22), (11.4.23) into (11.4.1), (11.4.6) we will have 2 1 π 0 2β [V (I) − V (I)] ln + tr 1u − hk−1 γ2 β x 8 1 −1 2 1 −1 2 + tr 1u − hkx hk + o(1). (11.4.24) tr 1u − β β x
If in addition to the specified expressions we take into account formula (10.3.30), then condition A from the previous clause will take the form ! (11.4.25) tr ln β kx h−1 → ∞. The requirement of existence of the limit lim (dV /dI) appearing in condition B (11.4.19) is equivalent, in view of (11.4.14), to the requirement of the existence of the limit lim β = β0 . (11.4.26) The first two limits in (11.4.19) can be rewritten as lim
kx−1 )2 ] tr[1u − (β −1 h , tr ln(β kx h−1 )
lim
tr(1u − β −1 h kx−1 )2 . tr ln(β kx h−1 )
(11.4.27)
Finally, the condition of the existence of the limit V (yI)/I, in view of (10.3.32), takes the form lim tr( kx h−1 − βy−1 1u )/ tr ln(β (11.4.28) kx h−1 ), where βy is determined from the condition tr ln(βy kx h−1 ) = 2yI = y tr ln(β kx h−1 ).
(11.4.29)
As an example, let us consider a continuous time Gaussian stochastic process that is periodic within segment [0, T0 ]. It can be derived from a non-periodic stationary process via periodization (5.7.1) described in the beginning of Section 5.7. A Bayesian system with such a periodic process was considered in 2 of Section 10.4. Its traces in (10.3.30) were reduced to summations (10.4.10). We can also express
11.5 Generalized Shannon’s theorem
385
the traces in (11.4.24) in the similar way. If we likewise substitute summations by integrals (that is a valid approximation for large T0 ) and let T0 → ∞ [see the derivation of formulae (10.4.18)], then the traces tr implicated in (11.4.27)–(11.4.29) will be approximately proportional to T0 : tr f ( kx h−1 ) ≈
T0 2π
L(ω )
f (Φ (ω )) d ω
(11.4.30)
(the condition of convergence of integrals is required to be satisfied as well). Here we have k(ω ) |g(ω )|2 ; Φ (ω ) = h(ω ) ¯ ω ) and others have the same meaning as in Section 10.4 [see where k¯ (ω ), h( (10.4.19), (10.4.21)]. The integration domain L(ω ) is defined by inequality (10.4.20). In this case, limits (11.4.27), (11.4.28) do exist and equal to the ratio of the respective integrals. For instance, the right limit (11.4.27) equals 8 −2 −2 [1 − β0 Φ (ω )]d ω ln(β0 Φ (ω ))d ω (L = L(β0 )). In view of (11.4.30), the difference 1 − V /V (as is easily seen from (11.4.24)) decrease with a growth of T0 according to the law constT0−1 ln T0 + constT0−1 + o(T0−1 ).
11.5 Generalized Shannon’s theorem In this section we consider a generalization of results from Section 7.3 (Theorems 7.1 and 7.2) and Section 8.1 (Theorem 8.1) in the case of an arbitrary quality criterion. We remind that in Chapter 7 only one quality criterion was discussed, namely, quality of informational system was characterized by mean probability of accepting a false report. The works of Kolmogorov [26] (also translated to English [27]), Shannon [43] (the original was published in English [41]), Dobrushin [6] commenced a dissemination of the specified results in the case of more general quality criterion characterized by an arbitrary cost function c(ξ , ζ ). The corresponding theorem, formulated for an arbitrary function c(ξ , ζ ), which in the particular case −1, if ξ = ζ (11.5.1) c(ξ , ζ ) = −δξ ζ = 0, if ξ = ζ (ξ , ζ are discrete random variables) turn into the standard results, can be naturally named the generalized Shannon’s theorem.
386
11 Asymptotic results about the value of information. Third asymptotic theorem
The defined direction of research is closely related to the third variational problem and the material covered in Chapters 9–11, because it involves an introduction of the cost function c(ξ , ζ ) and the assumption that distribution P(d ξ ) is given a priori. Results of different strength and detail can be obtained in this direction. We present here a unique theorem, which almost immediately follows from the standard Shannon’s theorem and results of Section 11.2. Consequently, it will not require a new proof. Let ξ , ζ be random variables defined by the joint distribution P(d ξ d ζ ). Let also [P(d η˜ | η )] be a channel such that its output variable (or variables) η˜ is connected with its input variable η by the conditional probabilities P(d η˜ | η ). The goal is to conduct encoding ξ → η and decoding η˜ → ζ˜ by selecting probabilities P(d η | ξ ) and P(d ζ˜ | η˜ ), respectively, in such a way that the distribution P(d ξ , d ζ˜ ) induced by the distributions P(d ξ ), P(d η | ξ ), P(d η˜ | η ), P(d ζ˜ | η˜ ) coincides with the initial distribution P(d ξ d ζ ). One can see that the variables ξ , η , η˜ , ζ˜ connected by the scheme → ζ, ξ →η →η (11.5.2) form a Markov chain with transition probabilities mentioned above. We apply formula (6.3.7) and obtain that I(ξ η )(η ζ) = I(ζ η )(ζη ) = Iξ ζ + Iη ζ|ξ + Iξ η |ζ + Iη η |ξ ζ .
(11.5.3)
At the same time using the same formula we can derive that I(ηξ )(η ζ) = Iη η + Iξ η /η + Iη ζ/η + Iξ ξ/η η .
(11.5.4)
According to the Markov property, the future does not depend on the past, if the present is fixed. Consequently, mutual information between the past (ξ ) and the future (η˜ ) with the fixed present (η ) equals zero: Iξ η˜ |η = 0. By the same reasoning Iη ζ|η = 0;
Iξ ζ|η η = 0.
(11.5.5)
Equating (11.5.4) with (11.5.3) and taking into account (11.5.5) we get Iξ ζ + Iη ζ|ξ + Iξ η |ζ + Iη η |ξ ζ = Iη η .
(11.5.6)
However, the information Iη ζ˜ |ξ , Iξ η˜ |ζ˜ , Iη η˜ |ξ ζ˜ (like any Shannon’s amount of information) are non-negative. Thus, if we take into account the definition of the channel capacity C = sup Iη η˜ [see (8.1.3)], then (11.5.6) implies the inequality Iξ ζ Iη η
or
Iξ ζ C.
(11.5.7)
It can be seen from here that the distribution P(d ξ · d ζ˜ ) can copy the initial distribution P(d ξ d ζ ) only under the following necessary condition: Iξ ζ C.
(11.5.8)
11.5 Generalized Shannon’s theorem
387
Otherwise, methods of encoding and decoding do not exist, i.e. P(d η | ξ ), P(d ζ | η ) such that P(d ξ d ζ) coincides with P(d ξ d ζ ). As we can see, this conclusion follows from the most general properties of the considered concepts. The next fact is less trivial: condition (11.5.8) or, more precisely, the condition (11.5.9) lim sup Iξ ζ /C < 1, is sufficient in an asymptotic sense, if not for coincidence of P(d ξ d ζ) with P(d ξ d ζ ), then anyways for relatively good in some sense quality of this distribution. In order to formulate the specified fact, we need to introduce a quality criterion—the cost function c(ξ , ζ ). The condition of equivalence between P(d ξ d ζ) and P(d ξ d ζ ) can be replaced by a weaker condition ! |E[c(ξ , ζ )]|−1 ϑ + E[c(ξ , ζ)] − E[c(ξ , ζ ) ] → 0 (11.5.10) (2ϑ+ (z) = z + |z|), which points to the fact that the quality of distribution P(d ξ d ζ˜ ) is asymptotically not worse than the quality of P(d ξ d ζ ). The primary statement is formulated for a sequence of schemes (11.5.2) dependent on n. Theorem 11.3. Suppose that 1. the sequence of pairs of random variables ξ , ζ is informationally stable (see Section 7.3); 2. the convergence c(ξ , ζ ) − E[c(ξ , ζ )] →0 (11.5.11) E[c(ξ , ζ )] in probability P(d ξ d ζ ) takes place [compare with (11.2.31)]; 3. the sequence of cost functions satisfies the boundedness condition |c(ξ , ζ )| K|E[c(ξ , ζ )]|
(11.5.12)
[compare with (11.2.32), K does not depend on n]; 4. the sequence of channels [P(d η˜ | η )] is informationally stable [i.e. (η , η˜ ) is informationally stable for the extremum distribution]; 5. condition (11.5.9) is satisfied. Then there exist encoding and decoding methods ensuring asymptotic equivalence in the sense of convergence (11.5.10). Proof. The proof uses the results derived earlier and points to the methods of encoding and decoding, i.e. it structurally defines ζ˜ . Conditions 1–3 of the theorem allow us to prove the inequality ξ ζ + 2ε2 Iξ ζ ) − R] ε1 + 4K δ lim sup|R|−1 [R(I n→∞
(R = E[c(ξ , ζ )];
ε1 = ε2 I/R)
(11.5.13)
388
11 Asymptotic results about the value of information. Third asymptotic theorem
in exactly the same way that we used to prove (11.2.33a) on the basis of conditions of informational stability and conditions (11.2.31), (11.2.32) in Section 11.2 [see also the derivation of (11.2.30)]. In (11.5.13) ε , δ are indefinitely small positive ξ ζ + 2ε2 Iξ ζ ) is the average cost: values independent of n and R(I ξ ζ + 2ε2 Iξ ζ ) = E inf E[c(ξ , ζ ) | Ek ] , R(I ζ
(11.5.14)
corresponding to a certain partition ∑ Ek of the sample space ξ into M = [eIξ ζ +2ε2 Iξ ζ ] regions. These regions can be constructed with the help of random code points ζk occurring with probabilities P(d ζ ) [see Sections 11.1 and 11.2]. It follows from (11.5.14) that there exist points ζk such that ξ ζ + 2ε2 Iξ ζ ) ∑ P(Ek )E[c(ξ , ζk )] − ε3 R(I k
holds true (ε3 > 0 is infinitesimal) and due to (11.5.13) we have lim sup|R|−1 n→∞
∑ P(Ek )E[c(ξ , ζk ) | Ek ] − R
ε1 + 4K δ + |R|−1 ε3 .
(11.5.15)
k
Further, we convey the message about the index of region Ek containing ξ through channel P(d η˜ | η ). In consequence of (11.5.9) we can always choose ε2 such that the inequality [Iξ ζ + ε2 (2Iξ ζ + C)]/C < 1 holds true ∀n > N for some fixed N. Since M = [exp(Iξ ζ + 2ε2 Iξ ζ )] the last statement means that the inequality ln M/C < 1 − ε2 is valid, i.e. (8.1.5) is satisfied. The latter together with requirement (4) of Theorem 11.3 assures that Theorem 8.1 (generalized similarly to Theorem 7.2) can be applied here. According to Theorem 8.1 the probability of error for a message reception through a channel can be made infinitely small with a growth of n: Pow < ε1 ,
(11.5.16)
starting from some n (ε4 is arbitrary). Let l be a number of a received message having the form ξ ∈ El . Having received this message a decoder outputs signal ζ˜ = ζl . Hence, the output variable ζ˜ has the probability density function P(ζ) = ∑ P(l)δ (ζ − ζl ). l
The joint distribution P(d ξ d ζ˜ ) takes the form
ζ ) = ∑ P(d ξ )ϑEk (ξ )P(l | k)δ (ζ − ζl )d9 ζ, P(d ξ d9 k,l
(ϑ˙Ek (ξ˙ ) = 1
for ξ˙ ∈ Ek ;
ϑEk (ξ˙ ) = 0 for ξ˙ ∈ / Ek )
(11.5.17)
11.5 Generalized Shannon’s theorem
389
where P(l | k) is a probability to receive message ξ ∈ El if message ξ ∈ Ek has been delivered. The average cost corresponding to distribution (11.5.17) is E[c(ξ , ζ)] = ∑ P(Ek )P(l | k)E[c(ξ , ζl ) | Ek ] k,l
= ∑ P(Ek )E[c(ξ , ζk ) | Ek ] k
+ ∑ P(Ek )P(l | k) {E[c(ξ , ζl ) | Ek ] − E[c(ξ , ζk ) | Ek ]} . k=l
We majorize the latter sum using (11.5.12). This yields |R|−1 |E[c(ξ , ζ)] − ∑ P(Ek )E[c(ξ , ζk ) | Ek ]| 2K ∑ P(Ek )Pow (| k) k
Pow (| k) =
∑ P(l | k)
(11.5.18)
k
.
l,l=k
Averaging over the ensemble of random codes and taking into account (11.5.16) we obtain that E
∑ P(Ek )Pow (| k)
= E [Pow (| k)] = Pow < ε4 .
k
From here we can make the conclusion that there exists some random code, which is not worse than the first in the sense of the inequality
∑ P(Ek )Pow (| k) < ε4 .
(11.5.19)
k
Thus, by virtue of (11.5.18) we have E[c(ξ , ζ)] − ∑ P(Ek )E[c(ξ , ζk ) | Ek ] 2K ε4 | R]. k
Combining this inequality with (11.5.15) we obtain
lim sup |R|−1 E[c(ξ , ζ)] − E[c(ξ , ζ )] ε2 + 4K δ + |R|−1 ε3 + 2K ε4 . Because ε2 , δ , ε3 , ε4 can be considered as small as required, equation (11.5.10) follows from here. The proof is complete. We assumed above that the initial distribution P(d ξ d ζ ) is given. The provided consideration can be simply extended to the cases, for which there is a set of such distributions or the Bayesian system [P(d ξ ), c(ξ , ζ )] with a fixed level of cost E[c(ξ , ζ )] a.
390
11 Asymptotic results about the value of information. Third asymptotic theorem
In the latter case, we prove the convergence 1 ϑ + (E[c(ξ , ζ)] − a) → 0. V Condition (11.5.9) is replaced by lim sup I(a)/C < 1; condition (11.5.12) is replaced by (11.2.39), while conditions (1) and (2) of Theorem 11.3 need to be replaced by the requirements of information stability of the Bayesian system (Section 11.2, paragraph 4). After these substitutions one can apply Theorem 11.2. In conclusion, let us confirm that the regular formulation of the Shannon’s theorem follows from the stated results. To this end, let us consider for ξ and ζ identical discrete random variables taking M values. Let us choose the distribution P(ξ , ζ ) as follows: 1 P(ξ , ζ ) = δξ ζ . M In this case we evidently have Iξ ζ = Hξ = Hζ = ln M
(11.5.20)
in such a way that the condition Iξ ζ → ∞ is satisfied, if M → ∞. Furthermore, I(ξ , ζ ) = ln M for ξ = ζ . Hence, the second condition B of informational stability from Section 7.3 is trivially satisfied. Thus, requirement (1) of Theorem 11.3 is satisfied, if M → ∞. Choosing the cost function (11.5.1) we ascertain that requirements (2), (3) are also satisfied. Due to (11.5.20) inequality (11.5.9) takes the form lim sup ln M/C < 1 that is equivalent to (8.1.5). In the case of the cost function (11.5.1) we have E[c(ξ , ζ )] = −1 and thereby (11.5.10) takes the following form: (11.5.21) 1 − E[δξ ζ ] = ∑ P(ξ , ζ) → 0. ξ =ζ
However,
! = ) = 1 P( ξ , ζ P ζ ξ | ξ ∑ M∑ ξ =ζ ξ
is nothing else but the mean probability of error (7.1.11). Hence, convergence (11.5.21) coincides with (7.3.7). Thus we have obtained that in this particular case, Theorem 11.3 actually coincides with Theorem 7.2.
Chapter 12
Information theory and the second law of thermodynamics
In this chapter, we discuss a relation between the concept of the amount of information and that of physical entropy. As is well known, the latter allows us to express quantitatively the second law of thermodynamics, which forbids, in an isolated system, the existence of processes accompanied by an increase of entropy. If there exists an influx of information dI about the system, i.e. if the physical system is isolated only thermally, but not informationally, then the above law should be generalized by substituting inequality dH ≥ 0 with inequality dH + dI ≥ 0. Therefore, if there is an influx of information, then the thermal energy of the system can be converted (without the help of a refrigerator) into mechanical energy. In other words, the existence of perpetual motion of the second kind powered by information becomes possible. In Sections 12.1 and 12.2, the process of transformation of thermal energy into mechanical energy using information is analysed quantitatively. Furthermore, a particular mechanism allowing such conversion is described. This mechanism consists in installing impenetrable walls and in moving them in a special way inside the physical system. Thereby, the well-known semi-qualitative arguments related to this question and, for instance, contained in the book of Brillouin [4] (the corresponding book in English is [5]) acquire an exact quantitative confirmation. The generalization of the second law of thermodynamics by no means cancels its initial formulation. Hence, in Section 12.3 the conclusion is made that, in practice, expenditure of energy is necessary for measuring the coordinates of a physical system and for recording this information. If the system is at temperature T , then in order to receive and record the amount of information I about this system, it is necessary to spend at least energy T dI. Otherwise, the combination of an automatic measuring device and an information converter of thermal energy into mechanical one would result in perpetual motion of the second kind. The above general rule is corroborated for a particular model of measuring device, which is described in Section 12.3. The conclusion about the necessity of minimal energy expenditure is also extended to noisy physical channels corresponding to a given temperature T (Section 12.5). Hence, the second law of thermodynamics imposes some constraints on © Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0 12
391
392
12 Information theory and the second law of thermodynamics
a possibility of physical implementation of informational systems: automata-meters and channels.
12.1 Information about a physical system being in thermodynamic equilibrium. The generalized second law of thermodynamics Theory of the value of information (Chapter 9) considers information about coordinate x, which is a random variable having probability distribution p(x)dx. In the present paragraph, in order to reveal the connection between information theory and laws of thermodynamics we assume that x is a continuous coordinate of a physical system being in thermodynamic equilibrium. The energy of the system is presumed to be a known function E(x) of this coordinate. Further, we consider a state corresponding to temperature T . In this case, the distribution is defined by the Boltzmann–Gibbs formula p(x) = exp{[F − E(x)]/T }, where F = −T ln
(12.1.1)
eE(x)/T dx
(12.1.2)
is the free energy of the system. The temperature T is measured in energy units, for which the Boltzmann constant is equal to 1. It is convenient to assume that we have a thermostat at temperature T , and the distribution mentioned above reaches its steady state as a result of a protracted contact with the thermostat. In the frame of the general theory of the value of information distribution (12.1.1) is a special case of probability distribution appearing in the definition of a Bayesian system. Certainly, the general results obtained for arbitrary Bayesian systems in Chapters 9 and 10 can be extended naturally to this case. Besides, some special phenomena related to the Second law of thermodynamics can be investigated, because the system under consideration is a physical system. Here we are interested in a possibility of transforming thermal energy into mechanical energy, which is facilitated by the inflow of information about the coordinate x. In defining the values of Hartley’s and Boltzmann’s information amounts (Sections 9.2 and 9.6) we assumed that incoming information about the value of x has a simple form. It indicates what region Ek from the specified partition ∑k Ek = X of the sample space X point x belongs to. This information is equivalent to an indication of the index of region Ek . Let us show that such information does indeed facilitate the transfer of thermal energy into mechanical energy. When specifying the region Ek the a priori distribution (12.1.1) is transformed into a posteriori distribution
12.1 The generalized second law of thermodynamics
393
exp{[ f (Ek ) − E(x)]/T } p(x | Ek ) = 0 where F(Ek ) = −T ln
for x ∈ Ek for x ∈ / Ek
e−E(x)/T dx
(12.1.3)
(12.1.4)
Ek
is a conditional free energy. Because it is known that x lies within the region Ek , this region can be surrounded by impenetrable walls, and the energy function E(x) is replaced by the function E(x) if x ∈ Ek (12.1.5) E(x | k) = ∞ if x ∈ / Ek . Distribution (12.1.3) is precisely the equilibrium distribution of type (12.1.1) for this function. Then we slowly move apart the walls encompassing the region Ek until they go to infinity. Pressure is exerted on the walls, and the pressure force does work when moving the walls apart. Energy in the form of mechanical work will flow to external bodies mechanically connected with the walls. This energy is equal to the well-known thermodynamic integral of the type ∂F − pressure . p=− A = pdv ∂v The differential dA of work can be determined by variating the region Ek in expression (12.1.4). During the expansion of region Ek to region Ek = Ek + dEk , the following mechanical energy is transferred to external bodies dA = F(Ek ) − F(Ek + dEk ) = T
e
F(Ek )−E(x) T
dx.
(12.1.6)
dEk
Because the walls are being moved apart slowly, the energy transfer occurs without changing the temperature of the system. This is the result of the influx of thermal energy from the thermostat, the contact with which must not be interrupted. Note that a contact with it must not to be interrupted. Then the source of mechanical energy leaving the system will be the thermal energy of the thermostat, which is converted into mechanical work. In order to calculate the total work Ak , it is necessary to sum the differentials (12.1.6). When the walls are moved to infinity, the region Ek coincides with entire space X, and also the free energy F(Ek ) coincides with (12.1.2). Therefore, the total mechanical energy is equal to the difference between the free energies (12.1.2) and (12.1.4) Ak = F(Ek ) − F.
(12.1.7)
By integrating (12.1.1) over Ek we obtain P(Ek ), while the analogous integral for (12.1.3) is equal to one. This leads us to the formula
394
12 Information theory and the second law of thermodynamics
e
F−F(Ek ) T
= P(Ek ).
(12.1.8)
Taking into account (12.1.8), we derive from (12.1.7) the equation Ak = −T ln P(Ek ) = T H(| Ek )
(12.1.9)
where H(| Ek ) is a conditional entropy. The formula just derived corresponds to the case, when point x appears in the region Ek , which occurs with probability P(Ek ). By averaging (12.1.9) over different regions, it is not hard to calculate the average energy converted from the thermal form into the mechanical one: A = ∑ Ak P(Ek ) = T HEk .
(12.1.9a)
k
Supplementing the derived above formula with the inequality sign allowing for a non-equilibrium process (occurring insufficiently slow), we have A T HEk .
(12.1.10)
Thus, we have obtained that the maximum amount of thermal energy turning into work is equal to the product of the absolute temperature and the Boltzmann amount of incoming information. The influx of information about the physical system facilitates the conversion of thermal energy into work without transferring part of the energy to the refrigerator. The assertion of the Second law of thermodynamics about the impossibility of such a process is valid only in the absence of information influx. If there is an information influx dI, then the standard form of the second law permitting only those processes, for which the total entropy of an isolated system does not decrease: dH 0, (12.1.11) becomes insufficient. Constraint (12.1.11) has to be replaced by the condition that the sum of entropy and information does not decrease: dH + dI 0.
(12.1.12)
In the above process of converting heat into work, there was the information inflow Δ I = HEk . The entropy of the thermostat decreased by Δ H = −A/T , and as a result the energy of the system did not change. Consequently, condition (12.1.12) for the specified process is valid with the equality sign. The equality sign is based on the ideal nature of the process, which was specially constructed. If the walls encompassed a larger region instead of Ek or if their motion apart were not infinitely slow, then there would be an inequality sign in condition (12.1.12). Also, the obtained amount of work would be less than (12.1.9a). In view of (12.1.12), it is impossible to produce from heat an amount of work larger than (12.1.9a). The idea about the generalization of the Second law of thermodynamics to the case of systems with an influx of information appeared long ago in connection with
12.2 Influx of Shannon’s information and transformation of heat into work
395
‘Maxwell’s demon’. The latter, by opening or closing a door in a wall between two vessels (depending on the speed of a molecule coming close to this door), can create the difference of temperatures or the difference of pressures without doing any work, contrary to the second law of thermodynamics. For such an action the ‘demon’ requires an influx of information. The limits of violation of the second law of thermodynamics by the ‘demon’ are bounded by the amount of incoming information. According to what has been said above, we can state this not only qualitatively, but also formulate in the form of a precise quantitative law (12.1.12). For processes not concerned with an influx of information the second law of thermodynamics considered in its standard form (12.1.11) remains valid, of course. Furthermore, formula (12.1.10) corresponding to the generalized law (12.1.12) was, in fact, derived based on the second law (12.1.11) applied to the expanding region Ek . In particular, formulae (12.1.6), (12.1.7) are nothing but corollaries of the second law (12.1.11). Let us now show this. Let us denote by dQ the amount of heat that arrived from the thermostat, and let us represent the change of entropy of the thermostat as follows (12.1.13) dHΓ = −dQ/T. According to the First law of thermodynamics dA = dQ − dU,
(12.1.14)
where U = E[E(x)] is the internal energy of the system, which is related to the free energy F via the famous equation U = F + T Hx . Differentiating the latter, we have dF = dU − T dHx .
(12.1.15)
In this case, the second law (12.1.11) has the form dHT + dHx 0, T dHx − dQ 0, which, on the basis of (12.1.14), (12.1.15), is equivalent to dA −dF.
(12.1.16)
Taking this relation with the equality sign (which corresponds to an ideal process), we obtain the first relation (12.1.6).
12.2 Influx of Shannon’s information and transformation of heat into work All that has been said above about converting thermal energy into mechanical energy due to an influx of information, can be applied to the case, in which we have information that is more complex than just specifying the region Ek , to which x belongs. We shall have such a case if we make errors in specifying the region Ek . Suppose that k˜ is the index of the region referred to with a possible error, and k is the true number of the region containing x. In this case, the amount of incoming
396
12 Information theory and the second law of thermodynamics
information is determined by the Shannon’s formula I = HEk − HEk |k˜ . This amount of information is less than the entropy HEk considered in previous section. Further, in a more general case, information about the value of x can come not in the form of an index of a region, but in the form of some other random variable y connected with x statistically. In this case, the amount of information is also determined by Shannon’s formula [see (6.2.1)]: I = Hx − Hx|y .
(12.2.1)
The posterior distribution p(x | y) will now have a more complicated form than (12.1.3). Nevertheless, the generalized second law of thermodynamics will have the same representation (12.1.12), if I is understood as Shannon’s amount of information (12.2.1). Now formula (12.1.10) can be replaced by the following A T I.
(12.2.2)
In order to verify this, one should consider an infinitely slow isothermal transition from the state corresponding to a posteriori distribution p(x | y) and having entropy Hx (| y) to the initial (a priori) state with a given distribution p(x). This transition has to be carried out in compliance with the second law of thermodynamics (12.1.11), i.e. according to formulae (12.1.13), (12.1.16). Summing up elementary works (12.1.16) we obtain that every found value y corresponds to the work Ay −F + F(y).
(12.2.3)
Here F is a free energy (12.1.2), and F(y) is a free energy F(y) = E[E(x) | y] − T Hx (| y),
(12.2.4)
corresponding to a posteriori distribution p(x | y), which is regarded as equilibrium with installed walls (see below). Substituting (12.2.4) into (12.2.3), and averaging over y, in view of the equation F = E[E(x)] − T Hx , we have A T Hx − T Hx|y ,
(12.2.5)
that is inequality (12.2.2). However, in this more general case, the specific ideal thermodynamic process, for which formula (12.2.2) is valid with an equality sign, is more complicated than that in Section 12.1. Because now a posteriori probability p(x | y) is not concentrated on one region Ek , the walls surrounding the region cannot be moved apart to infinity (this is possible only if, using the condition of information stability, we consider information without error). Nonetheless, the used above method of installing, moving and taking away the walls can be applied here as well. As in Section 12.1, installing and taking away the walls must be performed instantaneously without changing energy and entropy, which explains why the physical equilibrium free energy becomes equal to expression (12.2.4) after installing the walls.
12.2 Influx of Shannon’s information and transformation of heat into work
397
Let us consider in greater detail how to realize a thermodynamic process close to ideal. We take a posteriori distribution p(x | y) and install in space X a system of walls, which partition X into cells Ek . Without the walls the distribution p(x | y) would have transitioned into the equilibrium distribution p(x) (12.1.1) via the process of relaxation. The system of walls keeps the system in the state with distribution p(x | y). The physical system exerts some mechanical forces on the walls. Let us now move the walls slowly in such a way that an actual distribution continuously transitions from non-equilibrium p(x | y) to equilibrium (12.1.1). All interim states during this process must be the states of thermodynamic equilibrium for a given configuration of the walls. The final state is an equilibrium state not only in the presence of the walls, but also in their absence. Therefore, in the end, one may remove the walls imperceptibly. During the process of moving walls, mechanical forces exerted on a membrane do some mechanical work, the mean value of which can be computed via thermodynamic methods resembling (12.2.3)–(12.2.5). We shall have exactly the equality A = T I, if we manage to choose the wall configurations in such a way that the initial distribution p(x | y) and the final distribution (12.1.1) are exactly equilibrium for the initial and final configurations, respectively. When the distributions p(x), p(x | y) are not uniform, in order to approach the limit T I, it may be necessary to devise a special passage to the limit, for which the sizes of certain cells placed between the walls tend to zero. As an example, let us we consider a simpler case, when an ideal thermodynamic process is possible with a small number of cells. Let X = [0,V ] be a closed interval, and the function E(x) ≡ 0 be constant, so that distribution (12.1.1) is uniform. Further, we measure which half of the interval [0,V ] contains the point x. Let information about this be transmitted with an error whose probability is equal to p. The amount of information in this case is I = ln 2 + p ln p + (1 − p) ln(1 − p).
(12.2.5a)
Assume that we have received a message that y = 1, i.e. x is situated in the left half: x ∈ [0,V /2]. This corresponds to the following posterior probability density 2(1 − p)/V for 0 x V /2 p(x | y = 1) = (12.2.6) 2p/V for V /2 < x V . We install a wall at the point x = z0 = V /2, which we then move slowly. In order to find the forces acting on the wall we calculate the free energy for every location z of the wall. Since E(x) ≡ 0, the calculation of free energy is reduced to a computation of entropy. If the wall has been moved from point x = V /2 to point x = z, then probability density (12.2.6) should be replaced by the probability density 1−p for 0 x < z z (12.2.7) p(x | z, 1) = p for z < x V , V −z which has the following entropy
398
12 Information theory and the second law of thermodynamics
Hx (| 1, z) = (1 − p) ln
V −z z + p ln 1− p p
and free energy F(1, z) = −T (1 − p) ln
V −z z − T p ln . 1− p p
(12.2.8)
Differentiating with respect to z, we find the force acting on the wall −
∂ F(1, z) T T = (1 − p) − p . ∂z z V −z
(12.2.9)
If the coordinate of x were located on the left, then the acting force would be equal to T /z (by analogy with the formula for the pressure of an ideal gas, z plays the role of a volume); if x were on the right, then the force would be −T /(V − z). Formula (12.2.9) gives the posterior expectation of these forces, because 1 − p is the posterior probability of the inequality x < V /2. The work of the force in (12.2.9) on the interval [z0 , z1 ] can be calculated by taking the difference of potentials (12.2.8): A1 = −F(1, z1 ) + F(1, z0 ).
(12.2.10)
The initial position of the wall, as was mentioned above, is in the middle (z0 = V /2). The final position is such that the probability density (12.2.7) becomes equilibrium. This yields (1 − p)/z1 = p/(V − z1 ) = 1/V,
z1 = (1 − p)V.
Substituting these values of z0 , z1 into (12.2.10), (12.2.8) we find the work A1 = T (1 − p) ln[2(1 − p)] + T p ln(2p).
(12.2.11)
A similar result takes place for the second message y = 2. In this case we should move the wall to the other part of the interval. The mean work A is determined by the same expression (12.2.11). A comparison of this formula with (12.2.5a) shows that relation (12.2.2) is valid with the equality sign. In conclusion, we have to point out that the condition of equilibrium of the initial probability density p(x) [see (12.1.1)] is optional. If the initial state p(x) of a physical system is non-equilibrium, then we should instantly (while the coordinates of x have not changed) ‘take into action’ some new Hamiltonian E(x), which differs from the original E(x) such that the corresponding probability density is equilibrium. After that, we can apply all of our previous reasoning. When conversion of ˜ thermal energy into work is finished, we must instantly ‘turn off’ E(x) by proceeding to E(x). Expenditure of energy for ‘turning on’ and ‘turning off’ the Hamiltonian ˜ E(x) compensate each other. Thus, the informational generalization of the Second law of thermodynamics does not depend on the condition that the initial state be equilibrium. This generalization can be stated as follows:
12.3 Energy costs of creating and recording information. An example
399
Theorem 12.1. If a physical system S is thermally isolated and if there is information amount I about the system (it does not matter of what type: Hartley, Boltzmann or Shannon), then the only possible processes are those, for which the change of total entropy exceeds −I: Δ H ≥ −I. The lower bound is physically attainable. Here thermal isolation means that the thermal energy that can be converted into work is taken from the system S itself, i.e. the thermostat is included into S. As is well known, the second law of thermodynamics is asymptotic and not quite exact. It is violated for processes related to thermal fluctuations. In view of this, we can give an adjusted (relaxed) formulation for this law: in a heat-insulated system we cannot observe processes, for which the entropy increment is
Δ H −1.
(12.2.12)
If we represent entropy in thermodynamic units according to (1.1.7), then 1 should be replaced by the Boltzmann constant k in the right-hand side of (12.2.12). Then (12.2.12) will take the form Δ Hphys −k. Just as in the more exact statement of the Second law, the condition Δ H < 0 is replaced by the stronger inequality Δ H −1, so we can change the assertion of Theorem 12.1, namely now processes are forbidden for which
Δ H + I −1 or Δ Hphys + kI −k.
(12.2.13)
The term containing I is important if I 1. We could include analogous refinements in the other sections, but we shall not dwell on this.
12.3 Energy costs of creating and recording information. An example In the previous sections the random variable y carrying information about the coordinate (or coordinates) x of a physical system was assumed to be known beforehand and related statistically (correlated) with x. The problem of physical realization of such an information carrying random variable must be considered specially. It is natural to treat this variable as one (or several) coordinate (coordinates) out of a set of coordinates (i.e. dynamic variables) of some other physical system S0 , which we shall refer to as the ‘measuring device’. The Second law of thermodynamics implies some assertions about physical procedure for creating an information signal y statistically related to x. These assertions concern energy costs necessary to create y. They are, in some way, the converse of the statements given in Sections 12.1 and 12.2. The fact is that the physical system S (with a thermostat) discussed above which ‘converts information and heat into work’, and the measuring instrument S0 (acting automatically) creating the information can be combined into one system and then
400
12 Information theory and the second law of thermodynamics
the Second law of thermodynamics can be applied to it. Information will not be entering the combined system anymore, thereby inequality (12.1.12) will turn into its usual form (12.1.11) for it. If the measuring instrument or a thermostat (with which the instrument is in contact) has the same temperature T as the physical system with coordinate x, then the overall conversion of heat into work will be absent according to the usual Second law of thermodynamics. This means that mechanical energy of type (12.1.10) or (12.2.5) must be converted into the heat released in the measuring instrument. It follows that every measuring instrument at temperature T must convert into heat the energy no less than T I to create the amount of information I about the coordinate of a physical system. Let us first check this inherent property of any physical instrument on one simple model of a real measuring instrument. Let us construct a meter such that measuring the coordinate x of a physical system does not influence the behaviour of this system. It is convenient to consider a two-dimensional or even a one-dimensional model. Let x be the coordinate (or coordinates) of the centre of a small metal ball that moves without friction inside a tube made of insulating material, (or between parallel plates). Couples of electrodes (the metal plates across which potential difference is applied) are put into the insulator finished by grinding. When the ball takes a certain position, it connects a couple of electrodes and thus closes the circuit (Figure 12.1). Watching the current, we can determine the position of the ball. Thus, the space of values of x is divided into regions E1 , . . . , EM corresponding to the size of the electrodes. If the number of electrode couples is equal to M = 2n , i.e. the integral power of 2, then it will be suitable to select n current sources and to connect electrodes in such a way that the fact of the presence of current (or the absence of it) from one source gives one bit information about the number of regions E1 , . . . , EM . An example of such a connection for n = 3 is shown on Figures 12.1 and 12.2. When the circuit corresponding to one bit of information is closed, the appearing current reverses the magnetization of the magnetic that is placed inside the induction coil and plays the role of a memory cell. If the current is absent, the magnetization of the magnetic is not changed. As a result, the number of the region Ek is recorded on three magnetics in binary code.
Fig. 12.1 Arrangement of measuring contact electrodes in the tube
The sizes of the regions Ek (i.e. of the plates) can be selected for optimality considerations according to the theory of the value of information. If the goal is to produce maximum work (12.1.9a), then the regions Ek have to be selected such that their probabilities are equal
12.3 Energy costs of creating and recording information. An example
p(E1 ) = · · · = p(EM ) = 1/M.
401
(12.3.1)
In this case, Boltzmann’s and Hartley’s information amounts will coincide, HEk = ln M, and formula (12.1.9a) gives the maximum mechanical energy equal to T ln M. The logarithm ln M of the number of regions can be called the limit information capacity of the measuring instrument. In reality, the amount of information produced by the measuring instrument turns out to be smaller than ln M as a result of errors arising in the instrument. A source of these errors is thermal fluctuations, in particular fluctuations of the current flowing through the coils Lr . We suppose that the temperature T of the measuring device is given (T is measured in energy units). According to the laws of statistical physics, the mean energy of fluctuation current ifl flowing through the induction coil L is determined by the formula 1 1 LE[i2fl ] = T. (12.3.2) 2 2 Thus, it is proportional to the absolute temperature T . The useful current ius from the emf source Er (Figures 12.1 and 12.2) is added to the fluctuational current ifl . It is the average energy of the useful current Li2us /2 that constitutes in this case the energy costs that have been mentioned above and are inherent in any measuring instrument. Let us find the connection between the energy costs and the amount of information, taking in account the fluctuation current ifl .
Fig. 12.2 Alternative ways of connecting the measuring contact electrodes
For given useful current ius the total current i = ifl + ius has Gaussian distribution with variance that can be found from (12.3.2), that is # L −L(i−ius )2 /2T e p(i | ius ) = . (12.3.3) 2π T For M = 2n , the useful current [if (12.3.1) is satisfied] is equal to 0 with probability 1/2 or to some value i1 with probability 1/2. Hence, # 2 L Li2 /2T 1 (12.3.4) e + e−L(i−i1 ) /2T . p(i) = 2 2π T
402
12 Information theory and the second law of thermodynamics
Let us calculate the mutual information Ii,ius . Taking into account (12.3.3), (12.3.4) we have 2 1
p(i | 0) = − ln 1 + e(Lii1 /T )−(Li1 /2T ) , ln p(i) 2 p(i | i1 ) 1 −(Lii1 /T )+(Li2 /2T ) 1 ln = − ln e +1 . p(i) 2 The first expression has to be averaged with the weight p(i | 0), while the second with the weight p(i | i ). Introduce the variable ξ = (i /2 − i) L/T (and ξ = (i − 1 1 i1 /2) L/T for the second integral), we have the same expression for both integrals:
p(i | 0) ln
p(i | 0) di = p(i)
p(i | i1 ) ln
1 = −√ 2π
∞ −∞
p(i | i1 ) di p(i)
e− 2 (ξ −η ) ln 1
2
1 1 −2ξ η + e dξ , 2 2
where η = (i1 /2) L/T . Therefore, the information in question is equal to the same expression. Let us bring it into the following form: 1 Ii,ius = η 2 − √ 2π
∞ −∞
e− 2 (ξ −η ) ln cosh ξ η d ξ . 1
3
The second term is obviously negative, because cosh ξ η > 1, ln cosh ξ η > 0 and, consequently, (12.3.5) Ii,ius η 2 = Li21 /4T. However, the value Li21 /4 is nothing but the mean energy of the useful current: 1 2 1 1 2 1 Li1 + 0. E Lius = 2 2 2 2 Thus, we have just proven that in order to obtain information Ii,ius , the cost of the useful current mean energy must exceed T Ii,ius . All that has been said above refers only to one coil. Summing over different circuits shows that to obtain total information I = ∑nr=1 (Ii,ius )r , it is necessary to spend, on the average, the amount of energy that is not less than T I. Consequently, for the specified measuring instrument the above statement about incurred energy costs necessary for receiving information is confirmed. Taking into account of thermal fluctuations in the other elements of the instrument makes the inequality A > T I even stronger.
12.4 Energy costs of creating and recording information. General formulation
403
12.4 Energy costs of creating and recording information. General formulation In the example considered above, the information coordinate y was given by the currents i flowing through the coils L1 , L2 , . . . (Figures 12.1 and 12.2). In order to deal with more stable, better preserved in time informational variables, it is reasonable to locate magnets inside the coils that get magnetized, when the current flows. Then the variables y will be represented by the corresponding magnetization m1 , m2 , . . . . Because magnetization mr is a function of the initial magnetization mr0 (independent from x) and the current i (m = f (m0 , i)), it is clear that the amount of information Imx will not exceed the amount Iix + Im0 x = Iix , that is the amount Iix considered earlier. Thus, the inequality Iix A/T (12.3.5) can only become stronger under the transition to Imx . The case of the informational signal y represented by magnetization degrees m1 , m2 , . . . of recording magnetic cores is quite similar to the case of recording of the information onto a magnetic tape, where the number of ‘elementary magnets’ is rather large. One can see from the example above that the process of information ‘creation’ by a physical measuring device is inseparable from the process of physically recording the information. The inequality IT A that was checked above for a particular example can be proven from general thermodynamic considerations, if a set of general and precise definitions is introduced. Let x be a subset of variables ξ of a dynamical system S, and y be a subset of coordinates η of system S0 , referring to the same instant of time. We call a physical process associated with systems S and S0 interacting with each other and, perhaps, with other systems, a normal physical recording of information, if its initial state characterized by a multiplicative joint distribution p1 (ξ )p2 (η ) is transported into a final state joint distribution p(ξ , η ) with the same marginal distributions p1 (ξ ) and p2 (η ). Prior to and after recording the information the systems S, S0 are assumed to be non-interacting. We can now state a general formulation for the main assertion. Theorem 12.2. If a normal physical recording of information is carried out in contact with a thermostat at temperature T , then the following energy consumption and energy transfer to the thermostat (in the form of heat) are necessary: A IT,
(12.4.1)
I = Hx + Hy − Hxy
(12.4.2)
where is Shannon’s amount of information. Proof. Let us denote by H+ the entropy of the combined system S + S0 , while HT will denote the entropy of the thermostat. Applying the Second law of thermodynamics to the process of information recording we have
Δ H+ + Δ HT 0.
(12.4.3)
404
12 Information theory and the second law of thermodynamics
In this case the change of entropy is evidently of the form
Δ H+ = Hξ η − Hξ − Hη = −Iξ η .
(12.4.4)
Thus, the thermostat has received entropy Δ HT Iξ η , and, consequently, the transferred thermal energy is A T Iξ η . Where has it come from? According to the conditions of the theorem, there is no interaction between systems S, S0 both in the beginning and in the end, thereby the mean total energy U+ is the sum of the mean partial energies: U+ = E[E1 (ξ )] + E[E2 (η )]. They too remain invariant, because the marginal distributions p1 (ξ ) and p2 (η ) do not change. Hence, Δ U+ = 0 and, thus, the energy A must come from some external non-thermal energy sources during the process of information recording. In order to obtain (12.4.1), it only remains to take into account the inequality Ixy Iξ η . The proof is complete. In conclusion of this paragraph we consider one more example of information recording of an absolutely different kind rather than the example from Section 12.3. The compassion of these two examples shows that the sources of external energy can be of completely different nature. Suppose that we need to receive and record information about the fluctuating coordinate x of an arrow rotating about an axis and supported near equilibrium by a spring Π (Figure 12.3). A positively charged ball is placed at the end of the arrow.
Fig. 12.3 An example of creating and recording information by means of moving together and apart the arrows that interact with the springs Π
In order to perform the ‘measurement and recording’ of the coordinate x of the arrow, we bring to it a similar second arrow placed on the same axis, but with a ball charged negatively. By bringing them together, we obtain some energy A caused by the attraction of the balls. The approach is presumed to be very fast, so that
12.5 Energy costs in physical channels
405
the coordinates x, y of the arrows do have time to change during the process. The state we have after this approach is non-equilibrium. Then it transforms into an equilibrium state, in which correlation between the coordinates x, y of the arrows is established. It is convenient to assume that the attraction force between the balls is much stronger than the returning forces of springs, so that the correlation will be very strong. The transition to the equilibrium is accompanied by the reduction of the average interaction energy of the balls (‘descent into a potential well’) and by giving some thermal energy to the thermostat (to the environment). After the equilibrium (correlated) distribution p(x, y) is established, we move the arrows apart quickly (both movements close and apart are performed along the axis of rotation and do not involve the rotation coordinate). In so doing, we spend the work A2 , which is, obviously, greater than the work A1 we had before, since the absolute value of the difference |x − y| has become smaller on the average. The marginal distributions p(x) and p(y) are nearly the same if the mean E[(x − y)2 ] is small. As thermodynamic analysis (similar to the one provided in the proof of Theorem 12.2) shows, A2 − A1 = A IT in this example. After moving the arrows apart, there is a correlation between them, but no force interaction. The process of ‘recording’ information is complete. Of course, the obtained recording can be converted into a different form, say, it can be recorded on a magnetic tape. In so doing, the amount of information can only be decreased. In this example, the work necessary for information recording is done by a human or a device moving the arrows together or apart. According to the general theory, such work must be expended wherever there is a creation of new information, creation of new correlations, and not simply reprocessing the old ones. The general theory in this case only points to a lower theoretical level of these expenditures. In practice expenditure of energy can obviously exceed and even considerably exceed this thermodynamic level. A comparison of this expenditure with the minimum theoretical value allows us to judge energy efficiency of real devices.
12.5 Energy costs in physical channels Energy costs are necessary not only for creation and recording of information, but also for its transmission, if the latter occurs in the presence of fluctuation disturbances, for instance, thermal ones. As is known from statistical physics, in linear systems there is certain mean equilibrium fluctuational energy for every degree of freedom. This energy equals Tfl /2, where Tfl is the environment (thermostat) temperature. In a number of works (by Brillouin [4] (the corresponding English book is [5]) and others), researchers came up with an idea that in order to transmit 1 Nat of information under these conditions it is necessary to have energy at least Tfl (we use energy units for temperature, so that the Boltzmann constant is equal to 1). In this section we shall try to make this statement more exact and to prove it. Let us call a channel described by transition probabilities p(y | x) and the cost function c(x) a physical channel, if the variable y has the meaning of a complete
406
12 Information theory and the second law of thermodynamics
set of dynamical variables of some physical system S. The Hamilton function (energy) of the system is denoted by E(y) (it is non-negative). Taking this function into account, we can apply standard formulae to calculate the equilibrium potential
Γ (β ) = −β F = ln
e−β E(y) dy
and entropy as a function of the mean energy dΓ H(R) = Γ + β R R=− (β ) 0 . dβ
(12.5.1)
(12.5.2)
Theorem 12.3. The capacity (see Section 8.1) of a physical channel [p(y | x), c(x)] satisfies the inequality (12.5.3) TflC(a) EE(y) − afl , where the level afl and ‘fluctuation temperature’ Tfl are defined by the equations Tfl−1 =
dH (afl ); dR
H(afl ) = Hy|x .
(12.5.4)
The mean energy E[E(y)] and the conditional entropy Hy|x are calculated using the extremum probability density p0 (x) (which is assumed to exist) realizing the channel capacity. It is also assumed that the second equation (12.5.4) has root afl belonging to the normal branch, where Tfl > 0. Proof. Formulae (12.5.1), (12.5.2) emerge in the solution to the first variational problem (for instance, see Section 3.6)—the conditional extremum problem for entropy Hy with the constraint E[E(y)] = A. Therefore, the following inequality holds: Hy H(EE(y))
(12.5.5)
(the level E[E(y)] is fixed). Further, as it follows from a general theory (see Corollary from Theorem 4.1) the function H(R) is concave. Consequently, its derivative
β (R) =
dH(R) dR
(12.5.6)
is a non-increasing function of R. The channel capacity coincides with Shannon’s amount of information C(a) = Hy − Hy|x
(12.5.7)
for the extremum distribution p0 (x), which takes this amount to a conditional maximum. From (12.5.5) and the usual inequality Hy Hy|x we find that Hy|x H(E[E(y)]).
(12.5.8)
12.5 Energy costs in physical channels
407
Because afl is a root of the equation Hy|x = H(a f ) in the normal branch of the concave function H(·), where the derivative is non-negative, regardless of which branch the value E[E(y)] belongs to, equation (12.5.4) implies the inequality afl E[E(y)]. Taking this inequality into account together with the non-increasing nature of the derivative (12.5.6) we obtain that H(E[E(y)]) − H(afl ) β (afl )[E[E(y)] − afl ]. The last inequality in combination with C(a) H(E[E(y)]) − H(afl ) resulting from (12.5.7) and (12.5.5) yields (12.5.3). The proof is complete. Because a f is positive (which follows from a non-negative nature of energy E(y)), the term afl in (12.5.3) can be discarded. The inequality can only become stronger as a result. However, it is not always advisable to do so, because formula (12.5.3) allows for the following simple interpretation. The value afl can be treated as mean ‘energy of fluctuational noise’, while the difference E[E(y)] − afl as the ‘energy of a useful signal’ that remains after passing through a channel. Hence, according to (12.5.3), for every Nat of the channel capacity, there is at least Tfl amount of ‘energy of a useful signal’. If a is the energy of the useful signal before passing through a channel (this energy is, naturally, greater than the remaining energy E[E(y)] − afl ), then the inequality a/C(a) Tfl will even more certainly hold. It is the most straightforward to specialize these ideas for the case of a quadratic energy (12.5.9) E(y) = ∑ e jk y j yk ≡ yT e2 y j,k
and additive independent noise ξ with a zero mean value: y = x+ζ.
(12.5.10)
Substituting (12.5.10) into (12.5.9) and averaging out we obtain that E[E(y)] = E[xT e2 x] + E[ζ T e2 ζ ], i.e. E[E(y)] = E[E(x)] + E[E(ζ )].
(12.5.11)
In view of (12.5.10) and the independence of noise we next have Hy|x = Hζ .
(12.5.12)
We recall that the first variational problem, a solution to which is either the aforementioned function H(R) or the inverse function R(H), can be interpreted (as is known from Section 3.2) as the problem of minimizing the mean energy E[E] with fixed entropy. Thus, R(Hζ ) is the minimum value of energy possible for fixed entropy Hζ , i.e. (12.5.13) EE(ζ ) R(Hζ ). However, the value R(Hζ ) due to (12.5.4), (12.5.12) is nothing but afl , and therefore (12.5.13) can be rewritten as follows
408
12 Information theory and the second law of thermodynamics
EE(ζ ) a f .
(12.5.14)
From (12.5.11) and (12.5.14) we obtain E[E(y)] − afl E[E(x)]. This inequality allows us to transform the basic derived inequality (12.5.3) to the form TflC(a) E[E(x)] or (12.5.15) TflC(a) a, if the cost function c(x) coincides with the energy E(x). The results given here concerning physical channels are closely connected to Theorem 8.5. In this theorem, however, the ‘temperature parameter’ 1/β has a formal mathematical meaning. In order for this parameter to have the meaning of physical temperature, the costs c(x) or b(y) have to be specified as physical energy. According to inequality (12.5.15), in order to transmit one nat of information through a Gaussian physical channel we need energy that is not less than Tfl . It should be noted that we could not derive any universal inequality of the type (12.5.15) containing a real (rather than effective) temperature.
Appendix A
Some matrix (operator) identities
A.1 Rules for operator transfer from left to right Suppose we have two arbitrary not necessarily square matrices A and B such that the matrix products AB and BA have meaning. In operator language, this means the following: if A maps an element of space X into an element of space Y , then B maps an element of space Y into an element of space X, thus it defines the inverse mapping. Under broad assumptions regarding function f (z), the next formula holds: A f (BA) = f (AB)A.
(A.1.1)
Let us prove this formula under the assumption that f can be expressed as the Taylor series ∞ 1 (A.1.2) f (z) = ∑ f n (0)zn n=0 n! where z = AB or BA. (Note that a large number of generalizations can be obtained via the passage to the limit lim fm = f , where { fm } is a sequence of appropriate expandable functions). Substituting (A.1.2) into (A.1.1) we observe that both the right-hand and the lefthand sides of the equality turn into the same expression ∞
1
∑ n! f (n) (0)ABA · · · ABA.
n=0
This proves equality (A.1.1). As a matter of fact, matrices AB and BA have different dimensionalities, and actually they represent operators acting in different spaces (the first one in Y , the second one in Y ). The same is true regarding matrices f (AB) and f (BA). However, let us compare their traces. Using expansion (A.1.2), we obtain
© Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0
409
410
A Some matrix (operator) identities
1 f (0) tr(ABAB) + · · · , 2 1 tr f (BA) = tr f (0) + f (0) tr(BA) + f (0) tr(BABA) + · · · 2
tr f (AB) = tr f (0) + f (0) tr(AB) +
(A.1.3) (A.1.4)
However, tr(AB) = tr(BA) = ∑i j Ai j B ji and, therefore, tr(A[(BAk B)]) = tr([(BAk B)]A). This is why all the terms in (A.1.3), (A.1.4), apart from the first one, are identical. In general, the first terms tr( f (0)) are not identical, because the operator f (0) in the expansion of f (AB) and the same operator in the expansion of f (BA) are multiples of the identity matrices of different dimensions. However, if the following condition is met f (0) = 0 , (A.1.5) then, consequently, tr f (AB) = tr f (BA).
(A.1.6)
If we were interested in determinants, then for the corresponding equality det f (AB) = det f (BA)
(A.1.7)
instead of (A.1.5), we would have the condition F(0) = 1, because ln(det F) = tr(ln F) (here ln F = f ).
A.2 Determinant of a block matrix It is required to compute the determinant of the matrix AB K= , CD
(A.2.1)
where A, D are square matrices, and B, C are arbitrary matrices. Matrix D is assumed to be non-singular. Let us denote 1 0 A B AB 1 0 L= = , so that K = L CD 0D 0 D−1 D−1C 1 (here 1 and 0 are matrices) and, therefore, det K = det D det L It is easy to check by direct multiplication that A B 1B A − BD−1C 0 = . 01 D−1C 1 D−1C 1 However,
(A.2.2)
(A.2.3)
A.2 Determinant of a block matrix
det
1B = 1; 01
411
det
A − BD−1C 0 = det(A − BD−1C). D−1C 1
Therefore, (A.2.3) yields det(L) = det(A − BD−1C). Substituting this equality to (A.2.2) we obtain AB (A.2.4) det = det D det(A − BD−1C). CD Similarly, if A−1 exists, then AB det = det A det(D −CA−1 B) . CD
(A.2.5)
According to the above formulas, the problem of calculating the original determinant is reduced to the problem of calculating determinants of smaller dimension.
References
1. Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957) 2. Bellman, R.E.: Dynamic Programming (Translation to Russian). Inostrannaya Literatura, Moscow (1960) 3. Berger, T.: Rate Distortion Theory. A Mathematical Basis for Data Compression. Prentice Hall, Englewood Cliffs (1971) 4. Brilloiun, L.: Science and Information Theory (Translation from French to Russian). Fizmatgiz, Moscow (1960) 5. Brilloiun, L.: Science and Information Theory. Academic, New York (1962) 6. Dobrushin, R.L.: The general formulation of the fundamental theorem of Shannon in the theory of information. Usp. Mat. Nauk 14(6) (1959, in Russian) 7. Doob, J.: Stochastic Processes. Wiley Publications in Statistics. Wiley, New York (1953) 8. Doob, J.: Stochastic Processes (Translation to Russian). Inostrannaya Literatura, Moscow (1956) 9. Fano, R.M.: Transmission of Information: A Statistical Theory of Communications, 1st edn. MIT Press, Cambridge (1961) 10. Fano, R.M.: Transmission of Information: A Statistical Theory of Communications (Translation to Russian). Mir, Moscow (1965) 11. Feinstein, A.: Foundations of Information Theory. McGraw-Hill, New York (1958) 12. Feinstein, A.: Foundations of Information Theory (Translation to Russian). Inostrannaya Literatura, Moscow (1960) 13. Gnedenko, B.V.: The Course of the Theory of Probability. Fizmatgiz, Moscow (1961, in Russian) 14. Gnedenko, B.V.: The Theory of Probability (Translation from Russian). Chelsea, New York (1962) 15. Goldman, S.: Information Theory. Prentice Hall, Englewood Cliffs (1953) 16. Goldman, S.: Information Theory (Translation to Russian). Inostrannaya Literatura, Moscow (1957) © Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0
413
414
References
17. Grishanin, B.A., Stratonovich, R.L.: The value of information and sufficient statistics when observing a stochastic process. Izv. USSR Acad. Sci. Tech. Cybern. 6, 4–12 (1966, in Russian) 18. Hartley, R.V.L.: Transmission of information. Bell Syst. Tech. J. 7(3) (1928) 19. Hartley, R.V.L.: Transmission of information (Translation to Russian). In: A. Harkevich (ed.) Theory of Information and Its Applications. Fizmatgiz, Moscow (1959) 20. Hill, T.L.: Statistical Mechanics. McGraw-Hill Book Company Inc., New York (1956) 21. Hill, T.L.: Statistical Mechanics (Translation to Russian). Inostrannaya Literatura, Moscow (1960) 22. Hirsch, M.J., Pardalos, P.M., Murphey, R. (eds.): Dynamics of Information Systems: Theory and Applications. Springer Optimization and Its Applications Series, vol. 40. Springer, Berlin (2010) 23. Huffman, D.A.: A method for the construction of minimum redundancy codes. Proc. IRE 40(9), 1098–1101 (1952) 24. Jahnke, E., Emde, F.: Tables of Functions with Formulae and Curves. Dover Publications, New York (1945) 25. Jahnke, E., Emde, F.: Tables of Functions with Formulae and Curves (Translation from German to Russian). Gostekhizdat, Moscow (1949) 26. Kolmogorov, A.N.: Theory of transmission of information. In: USSR Academy of Sciences Session on Scientific Problems Related to Production Automation. USSR Academy of Sciences, Moscow (1957, in Russian) 27. Kolmogorov, A.N.: Theory of transmission of information (translation from Russian). Am. Math. Soc. Translat. Ser. 2(33) (1963) 28. Kraft, L.G.: A device for quantizing, grouping, and coding amplitudemodulated pulses. Master’s Thesis, Massachusetts Institute of Technology, Dept. of Electrical Engineering (1949) 29. Kullback, S.: Information Theory and Statistics. Wiley, New York (1959) 30. Kullback, S.: Information Theory and Statistics (Translation to Russian). Nauka, Moscow (1967) 31. Leontovich, M.A.: Statistical Physics. Gostekhizdat, Moscow (1944, in Russian) 32. Leontovich, M.A.: Introduction to Thermodynamics. GITTL, MoscowLeningrad (1952, in Russian) 33. Pinsker, M.S.: The quantity of information about a Gaussian random stationary process, contained in a second process connected with it in a stationary manner. Dokl. Akad. Nauk USSR 99, 213–216 (1954, in Russian) 34. Rao, C.R.: Linear Statistical Inference and Its Applications. Wiley, New York (1965) 35. Rao, C.R.: Linear Statistical Inference and Its Applications (Translation to Russian). Inostrannaya Literatura, Moscow (1968) 36. Ryzhik, J.M., Gradstein, I.S.: Tables of Series, Products and Integrals. Gostekhizdat, Moscow (1951, in Russian)
References
415
37. Ryzhik, J.M., Gradstein, I.S.: Tables of Series, Products and Integrals (Translation from Russian). Academic, New York (1965) 38. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27 (1948) 39. Shannon, C.E.: Communication in the presence of noise. Proc. IRE 37(1), 10– 21 (1949) 40. Shannon, C.E.: Certain results in coding theory for noisy channels. Inform. Control 1(1) (1957) 41. Shannon, C.E.: Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec. 4(1), 142–163 (1959) 42. Shannon, C.E.: Certain results in coding theory for noisy channels (translation to Russian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information Theory and Cybernetics. Inostrannaya Literatura, Moscow (1963) 43. Shannon, C.E.: Coding theorems for a discrete source with a fidelity criterion (translation to Russian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information Theory and Cybernetics. Inostrannaya Literatura, Moscow (1963) 44. Shannon, C.E.: Communication in the presence of noise (translation to Russian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information Theory and Cybernetics. Inostrannaya Literatura, Moscow (1963) 45. Shannon, C.E.: A mathematical theory of communication (translation to Russian). In: R.L. Dobrushin, O.B. Lupanov (eds.) Works on Information Theory and Cybernetics. Inostrannaya Literatura, Moscow (1963) 46. Stratonovich, R.L.: On statistics of magnetism in the Ising model (in Russian). Fizika Tvyordogo Tela 3(10) (1961) 47. Stratonovich, R.L.: On the value of information (in Russian). Izv. USSR Acad. Sci. Tech. Cybern. 5, 3–12 (1965) 48. Stratonovich, R.L.: Conditional Markov Processes and Their Application to the Theory of Optimal Control. Moscow State University, Moscow (1966, in Russian) 49. Stratonovich, R.L.: The value of information when observing a stochastic process in systems containing finite automata. Izv. USSR Acad. Sci. Tech. Cybern. 5, 3–13 (1966, in Russian) 50. Stratonovich, R.L.: Amount of information and entropy of segments of stationary Gaussian processes. Problemy Peredachi Informacii 3(2), 3–21 (1967, in Russian) 51. Stratonovich, R.L.: Extremal problems of information theory and dynamic programming. Izv. USSR Acad. Sci. Tech. Cybern. 5, 63–77 (1967, in Russian) 52. Stratonovich, R.L.: Conditional Markov Processes and Their Application to the Theory of Optimal Control (Translation from Russian). Modern Analytic and Computational Methods in Science and Mathematics. Elsevier, New York (1968) 53. Stratonovich, R.L.: Theory of Information. Sovetskoe Radio, USSR, Moscow (1975) 54. Stratonovich, R.L.: Topics in the Theory of Random Noise, vol. 1. Martino Fine Books, Eastford (2014)
416
References
55. Stratonovich, R.L., Grishanin, B.A.: The value of information when a direct observation of an estimated random variable is impossible. Izv. USSR Acad. Sci. Tech. Cybern. 3, 3–15 (1966, in Russian) 56. Stratonovich, R.L., Grishanin, B.A.: Game problems with constraints of an informational type. Izv. USSR Acad. Sci. Tech. Cybern. 1, 3–12 (1968, in Russian)
Index
A α -information, 300 active domain, 59, 252, 307 additivity principle, 3 asymptotic equivalence of value of information functions, 360 asymptotic theorem first, 92 second, 225, 227 third, 356, 360, 367 average cost, 300
Chebyshev’s inequality, 14 Chernoff’s inequality, 94 code, 36, 219 optimal, 37 Shannon’s random, 221 uniquely decodable, 40 Kraft’s, 40 condition of multiplicativity, 31, 32 normalization, 58, 304 conditional Markov process, 163, 205
B Bayesian system, 300 Gaussian, 338 stationary, 346 bit, 4 Boltzmann formula, 2 branch anomalous, 257, 301 normal, 257, 301
D decoding error, 49, 221 distribution canonical, 79 extremum, 300
C channel abstract, 250, 251 additive, 284 binary, 264, 266 capacity, 53, 57, 250 discrete noiseless, 56 Gaussian, 267 stationary, 277 physical, 405 capacity of, 406 symmetric, 262
E elementary message, 38 encoding of information, 35, 36 block, 36 online, 35 optimal, 35 entropy, 2, 3 Boltzmann’s, 6 conditional, 8, 30 continuous random variable, 24, 25 end of an interval, 103, 107 maximum value, 6, 28 properties, 6, 7 random, 5 rate, 15, 103, 105, 156 entropy density, 157
© Springer Nature Switzerland AG 2020 R. V. Belavkin et al. (eds.), Theory of Information and its Value, https://doi.org/10.1007/978-3-030-22833-0
417
418 F Fokker–Planck equation, 157, 166 stationary, 159 free energy, 62 function cost, 56, 296 cumulant generating, 80 likelihood, 219 value of information, 296, 322 G Gibbs canonical distribution, 62, 78, 82, 392 theorem, 83 H Hartley’s formula, 2, 4 I information amount of Boltzmann’s, 6, 318 Hartley’s, 4 Shannon’s, 173, 178 capacity, 57 mutual conditional, 181 pairwise, 178 random, 179 rate, 196 triple, 185 Ising model, 69 J Jensen’s inequality, 6 K Khinchin’s theorem, 225 Kotelnikov’s theorem, 282 L L´evy formula, 97 law conservation of information amount, 35 of thermodynamics, Second, 392, 398 generalized, 392, 399 Legendre transform, 72, 81, 230 length of record, 37, 42 M Markov chain, 108 Markov condition, 115 method of Lagrange multipliers, 58 micro state, 2
Index N nat, 4 neg-information, 290 P parameter canonical, 78, 79 thermodynamic conjugate, 72, 79 external, 71 internal, 71, 78 partition function, 62, 74 potential characteristic, 20, 80, 94, 127, 188, 194, 201 conditional, 240 thermodynamic, 65 probability final a posteriori, 116 of error, 219 average, 221 process discrete, 104 Markov, 107 stationary, 104 Markov conditional, 113, 163, 205 conditional, entropy of, 113 diffusion, 157 secondary a posteriori, 118 stationary-connected, 196 stationary periodic, 138 stochastic point, 144 property of hierarchical additivity, 10, 30, 182 R Radon–Nikodym derivative, 25 random flow, 144 risk, 300 S sequence of informationally stable Bayesian systems, 367 Shannon’s theorem, 225, 250 generalized, 353, 385, 387 simple noise, 175 stability canonical, 88 entropic, 16, 19 sufficient condition, 19 informational, 226 Stirling’s formula, 150 T thermodynamic relation, 62, 64, 254, 309
Index V value of information, 289, 291, 296 Boltzmann’s, 319 differential, 291, 292 Hartley’s, 297 random, 312 Shannon’s, 301, 321
419 variational problem first, 57, 58 second, 249–251 third, 293, 300, 304 W W -process secondary a posteriori, 118