354 8 10MB
English Pages xiv, 418 pages : illustrations; 25 cm [433] Year 2016
PROBABILITY, STATISTICS, AND RANDOM SIGNALS
PROBABILITY, STATISTICS, AND RANDOM SIGNALS
CHARLES G. BONCELET JR. University of Delaware
New York • Oxford OXFORD UNIVERSITY PRESS
Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Copyright © 2016 by Oxford University Press For titles covered by Section 112 of the U.S. Higher Education Opportunity Act, please visit www.oup.com/us/he for the latest information about pricing and alternate formats. Published by Oxford University Press. 198 Madison Avenue, New York, NY 10016 http://www.oup.com Oxford is a registered trademark of Oxford University Press. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of Oxford University Press. Library of Congress Cataloging in Publication Data Names: Boncelet, Charles G. Title: Probability, statistics, and random signals / Charles G. Boncelet Jr. Description: New York : Oxford University Press, [2017] | Series: The Oxford series in electrical and computer engineering | Includes index. Identifiers: LCCN 2015034908 | ISBN 9780190200510 Subjects: LCSH: Mathematical statistics–Textbooks. | Probabilities–Textbooks. | Electrical engineering–Mathematics–Textbooks. Classification: LCC QA276.18 .B66 2017 | DDC 519.5–dc23 LC record available at http://lccn.loc.gov/2015034908
Printing number: 9 8 7 6 5 4 3 2 1 Printed in the United States of America on acid-free paper
CONTENT S
PREFACE
xi
1 PROBABILITY BASICS 1 What Is Probability? 1 Experiments, Outcomes, and Events 3 Venn Diagrams 4 Random Variables 5 Basic Probability Rules 6 Probability Formalized 9 Simple Theorems 11 Compound Experiments 15 Independence 16 Example: Can S Communicate With D? 17 1.10.1 List All Outcomes 18 1.10.2 Probability of a Union 19 1.10.3 Probability of the Complement 20 1.11 Example: Now Can S Communicate With D? 1.11.1 A Big Table 21 1.11.2 Break Into Pieces 22 1.11.3 Probability of the Complement 23 1.12 Computational Procedures 23 Summary 24 Problems 25
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10
21
2 CONDITIONAL PROBABILITY 29 2.1 Definitions of Conditional Probability 29 2.2 Law of Total Probability and Bayes Theorem 2.3 Example: Urn Models 34 2.4 Example: A Binary Channel 36 2.5 Example: Drug Testing 38 2.6 Example: A Diamond Network 40 Summary 41 Problems 42
32
3 A LITTLE COMBINATORICS 47 3.1 3.2
Basics of Counting 47 Notes on Computation 52
v
vi CONTENTS 3.3 Combinations and the Binomial Coefficients 53 3.4 The Binomial Theorem 54 3.5 Multinomial Coefficient and Theorem 55 3.6 The Birthday Paradox and Message Authentication 57 3.7 Hypergeometric Probabilities and Card Games 61 Summary 66 Problems 67
4 DISCRETE PROBABILITIES AND RANDOM VARIABLES 75 Probability Mass Functions 75 Cumulative Distribution Functions 77 Expected Values 78 Moment Generating Functions 83 Several Important Discrete PMFs 85 4.5.1 Uniform PMF 86 4.5.2 Geometric PMF 87 4.5.3 The Poisson Distribution 90 4.6 Gambling and Financial Decision Making Summary 95 Problems 96 4.1 4.2 4.3 4.4 4.5
92
5 MULTIPLE DISCRETE RANDOM VARIABLES 101 Multiple Random Variables and PMFs 101 Independence 104 Moments and Expected Values 105 5.3.1 Expected Values for Two Random Variables 105 5.3.2 Moments for Two Random Variables 106 5.4 Example: Two Discrete Random Variables 108 5.4.1 Marginal PMFs and Expected Values 109 5.4.2 Independence 109 5.4.3 Joint CDF 110 5.4.4 Transformations With One Output 110 5.4.5 Transformations With Several Outputs 112 5.4.6 Discussion 113 5.5 Sums of Independent Random Variables 113 5.6 Sample Probabilities, Mean, and Variance 117 5.7 Histograms 119 5.8 Entropy and Data Compression 120 5.8.1 Entropy and Information Theory 121 5.8.2 Variable Length Coding 123 5.8.3 Encoding Binary Sequences 127 5.8.4 Maximum Entropy 128 Summary 131 Problems 132 5.1 5.2 5.3
CONTENTS vii
6 BINOMIAL PROBABILITIES 137 Basics of the Binomial Distribution 137 Computing Binomial Probabilities 141 Moments of the Binomial Distribution 142 Sums of Independent Binomial Random Variables 144 Distributions Related to the Binomial 146 6.5.1 Connections Between Binomial and Hypergeometric Probabilities 146 6.5.2 Multinomial Probabilities 147 6.5.3 The Negative Binomial Distribution 148 6.5.4 The Poisson Distribution 149 6.6 Binomial and Multinomial Estimation 151 6.7 Alohanet 152 6.8 Error Control Codes 154 6.8.1 Repetition-by-Three Code 155 6.8.2 General Linear Block Codes 157 6.8.3 Conclusions 160 Summary 160 Problems 162 6.1 6.2 6.3 6.4 6.5
7 A CONTINUOUS RANDOM VARIABLE 167 Basic Properties 167 Example Calculations for One Random Variable Selected Continuous Distributions 174 7.3.1 The Uniform Distribution 174 7.3.2 The Exponential Distribution 176 7.4 Conditional Probabilities 179 7.5 Discrete PMFs and Delta Functions 182 7.6 Quantization 184 7.7 A Final Word 187 Summary 187 Problems 189 7.1 7.2 7.3
171
8 MULTIPLE CONTINUOUS RANDOM VARIABLES 192 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10
Joint Densities and Distribution Functions 192 Expected Values and Moments 194 Independence 194 Conditional Probabilities for Multiple Random Variables 195 Extended Example: Two Continuous Random Variables 198 Sums of Independent Random Variables 202 Random Sums 205 General Transformations and the Jacobian 207 Parameter Estimation for the Exponential Distribution 214 Comparison of Discrete and Continuous Distributions 214
viii CONTENTS Summary 215 Problems 216
9 THE GAUSSIAN AND RELATED DISTRIBUTIONS 221 The Gaussian Distribution and Density 221 Quantile Function 227 Moments of the Gaussian Distribution 228 The Central Limit Theorem 230 Related Distributions 235 9.5.1 The Laplace Distribution 236 9.5.2 The Rayleigh Distribution 236 9.5.3 The Chi-Squared and F Distributions 238 9.6 Multiple Gaussian Random Variables 240 9.6.1 Independent Gaussian Random Variables 240 9.6.2 Transformation to Polar Coordinates 241 9.6.3 Two Correlated Gaussian Random Variables 243 9.7 Example: Digital Communications Using QAM 246 9.7.1 Background 246 9.7.2 Discrete Time Model 247 9.7.3 Monte Carlo Exercise 253 9.7.4 QAM Recap 258 Summary 259 Problems 260 9.1 9.2 9.3 9.4 9.5
10 ELEMENTS OF STATISTICS 265 10.1 A Simple Election Poll 265 10.2 Estimating the Mean and Variance 269 10.3 Recursive Calculation of the Sample Mean 271 10.4 Exponential Weighting 273 10.5 Order Statistics and Robust Estimates 274 10.6 Estimating the Distribution Function 276 10.7 PMF and Density Estimates 278 10.8 Confidence Intervals 280 10.9 Significance Tests and p-Values 282 10.10 Introduction to Estimation Theory 285 10.11 Minimum Mean Squared Error Estimation 289 10.12 Bayesian Estimation 291 Problems 295
11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION 298 11.1 Gaussian Random Vectors 298 11.2 Linear Operations on Gaussian Random Vectors 11.3 Linear Regression 304 11.3.1 Linear Regression in Detail 305
303
CONTENTS ix 11.3.2 Statistics of the Linear Regression Estimates 11.3.3 Computational Issues 311 11.3.4 Linear Regression Examples 313 11.3.5 Extensions of Linear Regression 317 Summary 319 Problems 320
309
12 HYPOTHESIS TESTING 324 12.1 Hypothesis Testing: Basic Principles 324 12.2 Example: Radar Detection 326 12.3 Hypothesis Tests and Likelihood Ratios 331 12.4 MAP Tests 335 Summary 336 Problems 337
13 RANDOM SIGNALS AND NOISE 340 Introduction to Random Signals 340 A Simple Random Process 341 Fourier Transforms 342 WSS Random Processes 346 WSS Signals and Linear Filters 350 Noise 352 13.6.1 Probabilistic Properties of Noise 352 13.6.2 Spectral Properties of Noise 353 13.7 Example: Amplitude Modulation 354 13.8 Example: Discrete Time Wiener Filter 357 13.9 The Sampling Theorem for WSS Random Processes 357 13.9.1 Discussion 358 13.9.2 Example: Figure 13.4 359 13.9.3 Proof of the Random Sampling Theorem 361 Summary 362 Problems 364 13.1 13.2 13.3 13.4 13.5 13.6
14 SELECTED RANDOM PROCESSES 366 The Lightbulb Process 366 The Poisson Process 368 Markov Chains 372 Kalman Filter 381 14.4.1 The Optimal Filter and Example 381 14.4.2 QR Method Applied to the Kalman Filter Summary 386 Problems 388
14.1 14.2 14.3 14.4
384
x CONTENTS
A COMPUTATION EXAMPLES 391 A.1 A.2 A.3
Matlab 391 Python 393 R 395
B ACRONYMS 399 C PROBABILITY TABLES 401 C.1
Tables of Gaussian Probabilities
D BIBLIOGRAPHY 403 INDEX
405
401
PREFACE
I have many goals for this book, but this is foremost: I have always liked probability and have been fascinated by its application to predicting the future. I hope to encourage this generation of students to study, appreciate, and apply probability to the many applications they will face in the years ahead. To the student: This book is written for you. The prose style is less formal than many textbooks use. This more engaging prose was chosen to encourage you to read the book. I firmly believe a good textbook should help you learn the material. But it will not help if you do not read it. Whenever I ask my students what they want to see in a text, the answer is: “Examples. Lots of examples.” I have tried to heed this advice and included “lots” of examples. Many are small, quick examples to illustrate a single concept. Others are long, detailed examples designed to demonstrate more sophisticated concepts. Finally, most chapters end in one or more longer examples that illustrate how the concepts of that chapter apply to engineering or scientific applications. Almost all the concepts and equations are derived using routine algebra. Read the derivations, and reproduce them yourselves. A great learning technique is to read through a section, then write down the salient points. Read a derivation, and then reproduce it yourself. Repeat the sequence—read, then reproduce—until you get it right. I have included many figures and graphics. The old expression, “a picture is worth a thousand words,” is still true. I am a believer in Edward Tufte’s graphics philosophy: maximize the data-ink ratio.1 All graphics are carefully drawn. They each have enough ink to tell a story, but only enough ink. To the instructor: This textbook has several advantages over other textbooks. It is the right size—not too big and not too small. It should cover the essential concepts for the level of the course, but should not cover too much. Part of the art of textbook writing is to decide what should be in and what should be out. The selection of topics is, of course, a determination on the part of the author and represents the era in which the book is written. When I first started teaching my course more than two decades ago, the selection of topics favored continuous random variables and continuous time random processes. Over time, discrete random variables and discrete time random processes have grown in importance. Students today are expected to understand more statistics than in the past. Computation is much more important and more immediate. Each year I add a bit more computation to the course than the prior year. I like computation. So do most students. Computation gives a reality to the theoretical concepts. It can also be fun. Throughout the book, there are computational examples and exercises. Unfortunately, not everyone uses the same computational packages. The book uses
1
Edward Tufte, The Visual Display of Quantitative Information, 2nd ed. Cheshire, CT: Graphics Press, 2001. A great book, highly recommended.
xi
xii PREFACE
three of the most popular: Matlab, Python, and R. For the most part, we alternate between Matlab and Python and postpone discussion of R until the statistics chapters. Most chapters have a common format: introductory material, followed by deeper and more involved topics, and then one or more examples illustrating the application of the concepts, a summary of the main topics, and a list of homework problems. The instructor can choose how far into each chapter to go. For instance, I usually cover entropy (Chapter 5) and Aloha (Chapter 6), but skip error-correcting coding (also Chapter 6). I am a firm believer that before statistics or random processes can be understood, the student must have a good knowledge of probability. A typical undergraduate class can cover the first nine chapters in about two-thirds of a semester, giving the student a good understanding of both discrete and continuous probability. The instructor can select topics from the later chapters to fill out the rest of the semester. If students have had basic probability in a prior course, the first nine chapters can be covered quickly and greater emphasis placed on the remaining chapters. Depending on the focus of the course, the instructor can choose to emphasize statistics by covering the material in Chapters 10 through 12. Alternatively, the instructor can emphasize random signals by covering Chapters 13 and 14. The text can be used in a graduate class. Assuming the students have seen some probability as undergraduates, the first nine chapters can be covered quickly and more attention paid to the last five chapters. In my experience, most new graduate students need to refresh their probability knowledge. Reviewing the first nine chapters will be time well spent. Graduate students will also benefit from doing computational exercises and learning the similarities and differences in the three computational packages discussed, Matlab, Python, and R.
Chapter Coverage Chapters 1 and 2 are a fairly standard introduction to probability. The first chapter introduces the basic definitions and the three axioms, proves a series of simple theorems, and concludes with detailed examples of calculating probabilities for simple networks. The second chapter covers conditional probability, Bayes theorem and the law of total probability, and several applications. Chapter 3 is a detour into combinatorics. A knowledge of combinatorics is essential to understanding probability, especially discrete probability, but students often confuse the two, thinking combinatorics to be a branch of probability. The two are different, and we emphasize that. Much of the development of probability in history was driven by gambling. I, too, use examples from gambling and game play in this chapter (and in some later chapters as well). Students play games and occasionally gamble. Examples from these subjects help bring probability to the student life experience—and we show that gambling is unlikely to be profitable! Chapters 4 and 5 introduce discrete probability mass functions, distribution functions, expected values, change of variables, and the uniform, geometric, and Poisson distributions. Chapter 4 culminates with a discussion of the financial considerations of gambling versus buying insurance. Chapter 5 ends with a long section on entropy and data compression. (It still amazes me that most textbooks targeting an electrical and computer engineering
PREFACE xiii
audience omit entropy.) Chapter 6 presents binomial, multinomial, negative binomial, hypergeometric, and Poisson probabilities and considers the connections between these important discrete probability distributions. It is punctuated by two optional sections, the first on the Aloha protocol and the second on error-correcting codes. Chapters 7 and 8 present continuous random variables and their densities and distribution functions. Expected values and changes of variables are also presented, as is an extended example on quantization. Chapter 9 presents the Gaussian distribution. Moments, expected values, and change of variables are also presented here. The central limit theorem is motivated by multiple examples showing how the probability mass function or density function converges to the Gaussian density. Some of the related distributions, including the Laplace, Rayleigh, and chi-squared, are presented. The chapter concludes with an extended example on digital communications using quadrature amplitude modulation. Exact and approximate error rates are computed and compared to a Monte Carlo simulation. The first nine chapters are typically covered in order at whatever speed is comfortable for the instructor and students. The remaining chapters can be divided into two subjects, statistics and random processes. Chapters 10, 11, and 12 comprise an introduction to statistics, linear regression, and hypothesis testing. Chapters 13 and 14 introduce random processes and random signals. These chapters are not serial; the instructor can pick and choose whichever chapters or sections to cover. Chapter 10 presents basic statistics. At this point, the student should have a good understanding of probability and be ready to understand the “why” behind statistical procedures. Standard and robust estimates of mean and variance, density and distribution estimates, confidence intervals, and significance tests are presented. Finally, maximum likelihood, minimum mean squared estimation, and Bayes estimation are discussed. Chapter 11 takes a linear algebra approach (vectors and matrices) to multivariate Gaussian random variables and uses this approach to study linear regression. Chapter 12 covers hypothesis testing from a traditional engineering point of view. MAP (maximum a posteriori), Neyman-Pearson, and Bayesian hypothesis tests are presented. Chapter 13 studies random signals, with particular emphasis on those signals that appear in engineering applications. Wide sense stationary signals, noise, linear filters, and modulation are covered. The chapter ends with a discussion of the sampling theorem. Chapter 14 focuses on the Poisson process and Markov processes and includes a section on Kalman filtering. Let me conclude this preface by repeating my overall goal: that the student will develop not only an understanding and appreciation of probability, statistics, and random processes but also a willingness to apply these concepts to the various problems that will occur in the years ahead.
Acknowledgments I would like to thank the reviewers who helped shaped this text during its development. Their many comments are much appreciated. They are the following: Deva K Borah, New Mexico State University Petar M. Djurić, Stony Brook University Jens Gregor, University of Tennessee
xiv PREFACE
Eddie Jacobs, University of Memphis JeongHee Kim, San José State University Nicholas J. Kirsch, University of New Hampshire Joerg Kliewer, New Jersey Institute of Technology Sarah Koskie, Indiana University-Purdue University Indianapolis Ioannis (John) Lambadaris, Carleton University Eric Miller, Tufts University Ali A. Minai, University of Cincinnati Robert Morelos-Zaragoza, San José State University Xiaoshu Qian, Santa Clara University Danda B. Rawat, Georgia Southern University Rodney Roberts, Florida State University John M. Shea, University of Florida Igor Tsukerman, The University of Akron I would like to thank the following people from Oxford University Press who helped make this book a reality: Nancy Blaine, John Appeldorn, Megan Carlson, Christine Mahon, Daniel Kaveney, and Claudia Dukeshire. Last, and definitely not least, I would like to thank my children, Matthew and Amy, and my wife, Carol, for their patience over the years while I worked on this book. Charles Boncelet
CHAPTER
1
PROBABILITY BASICS
In this chapter, we introduce the formalism of probability, from experiments to outcomes to events. The three axioms of probability are introduced and used to prove a number of simple theorems. The chapter concludes with some examples.
1.1 WHAT IS PROBABILITY? Probability refers to how likely something is. By convention, probabilities are real numbers between 0 and 1. A probability of 0 refers to something that never occurs; a probability of 1 refers to something that always occurs. Probabilities between 0 and 1 refer to things that sometimes occur. For instance, an ordinary coin when flipped will land heads up about half the time and land tails up about half the time. We say the probability of heads is 0.5; the probability of tails is also 0.5. As another example, a typical telephone line has a probability of sending a data bit correctly of around 0.9999, or 1 − 10−4 . The probability the bit is incorrect is 10−4 . A fiber-optic line may have a bit error rate as low as 10−15 . Imagine Alice sends a message to Bob. For Bob to receive any information (any new knowledge), the message must be unknown to Bob. If Bob knew the message before receiving it, then he gains no new knowledge from hearing it. Only if the message is random to Bob will Bob receive any information. There are a great many applications where people try to predict the future. Stock markets, weather, sporting events, and elections all are random. Successful prediction of any of these would be immensely profitable, but each seems to have substantial randomness. Engineers worry about reliability of devices and systems. Engineers control complex systems, often without perfect knowledge of the inputs. People are building self-driving automobiles and aircraft. These devices must operate successfully even though all sorts of unpredictable events may occur. 1
2 CHAPTER 1 PROBABILITY BASICS
Probabilities may be functions of other variables, such as time and space. The probability of someone getting cancer is a function of lots of things, including age, gender, genetics, dietary habits, whether the person smokes, and where the person lives. Noise in an electric circuit is a function of time and temperature. The number of questions answered correctly on an exam is a function of what questions are asked—and how prepared the test taker is! In some problems, time is the relevant quantity. How many flips of a coin are required before the first head occurs? How many before the 100th head? The point of this is that many experiments feature randomness, where the result of the experiment is not known in advance. Furthermore, repetitions of the same experiment may produce different results. Flipping a coin once and getting heads does not mean that a second flip will be heads (or tails). Probability is about understanding and quantifying this randomness.
Comment 1.1: Is a coin flip truly random? Is it unpredictable? Presumably, if we knew the mass distribution of the coin, the initial force (both linear and rotational) applied to the coin, the density of the air, and any air currents, we could use physics to compute the path of the coin and how it will land (heads or tails). From this point of view, the coin is not random. In practice, we usually do not know these variables. Most coins are symmetric (or close to symmetric). As long as the number of rotations of the coin is large, we can reasonably assume the coin will land heads up half the time and tails up half the time, and we cannot predict which will occur on any given toss. From this point of view, the coin flip is random. However, there are rules, even if they are usually unspoken. The coin flipper must make no attempt to control the flip (i.e., to control how many rotations the coin undergoes before landing). The flipper must also make no attempt to control the catch of the coin or its landing. These concerns are real. Magicians have been known to practice flipping coins until they can control the flip. (And sometimes they simply cheat.) Only if the rules are followed can we reasonably assume the coin flip is random.
EXAMPLE 1.1
Let us test this question: How many flips are required to get a head? Find a coin, and flip it until a head occurs. Record how many flips were required. Repeat the experiment again, and record the result. Do this at least 10 times. Each of these is referred to as a run, a sequence of tails ending with a heads. What is the longest run you observed? What is the shortest? What is the average run length? Theory tells us that the average run length will be about 2.0, though of course your average may be different.
1.2 Experiments, Outcomes, and Events 3
1.2 EXPERIMENTS, OUTCOMES, AND EVENTS An experiment is whatever is done. It may be flipping a coin, rolling some dice, measuring a voltage or someone’s height and weight, or numerous others. The experiment results in outcomes. The outcomes are the atomic results of the experiment. They cannot be divided further. For instance, for a coin flip, the outcomes are heads and tails; for a counting experiment (e.g., the number of electrons crossing a PN junction), the outcomes are the nonnegative integers, 0, 1, 2, 3, . . .. Outcomes are denoted with italic lowercase letters, perhaps with subscripts, such as x, n, a1 , a2 , etc. The number of outcomes can be finite or infinite, as in the two examples mentioned in the paragraph above. Furthermore, the experiment can result in discrete outcomes, such as the integers, or continuous outcomes, such as a person’s weight. For now, we postpone continuous experiments to Chapter 7 and consider only discrete experiments. Sets of outcomes are known as events. Events are denoted with italic uppercase Roman letters, perhaps with subscripts, such as A, B, and Ai . The outcomes in an event are listed with braces. For instance, A = {1,2,3,4} or B = {2,4,6}. A is the event containing the outcomes 1, 2, 3, and 4, while B is the event containing outcomes 2, 4, and 6. The set of all possible outcomes is the sample space and is denoted by S. For example, the outcomes of a roll of an ordinary six-sided die1 are 1, 2, 3, 4, 5, and 6. The sample space is S = {1,2,3,4,5,6}. The set containing no outcomes is the empty set and is denoted by . The complement of an event A, denoted A, is the event containing every outcome not in A. The sample space is the complement of the empty set, and vice versa. The usual rules of set arithmetic apply to events. The union of two events, A ∪ B, is the event containing outcomes in either A or B. The intersection of two events, A ∩ B or more simply AB, is the event containing all outcomes in both A and B. For any event A, A ∩ A = AA = and A ∪ A = S. EXAMPLE 1.2
Consider a roll of an ordinary six-sided die, and let A = {1,2,3,4} and B = {2,4,6}. Then, A ∪ B = {1,2,3,4,6} and A ∩ B = {2,4}. A = {5,6} and B = {1,3,5}.
EXAMPLE 1.3
Consider the following experiment: A coin is flipped three times. The outcomes are the eight flip sequences: hhh, hht, . . . ,ttt. If A = first flip is head = hhh, hht, hth, htt , two heads = hht, hth, thh , then A ∪ B = then A = ttt, tth, tht, thh . If B = exactly hhh, hht, hth, htt, thh and AB = hht, hth .
Comment 1.2: Be careful in defining events. In the coin flipping experiment above, an event might be specified as C = {two heads}. Is this “exactly two heads” or “at least two heads”? The former is {hht, hth, thh}, while the latter is {hht, hth, thh, hhh}.
1
One is a die; two or more are dice.
4 CHAPTER 1 PROBABILITY BASICS
Set arithmetic obeys DeMorgan’s laws: A∪B = A∩B
(1.1)
A∩B = A∪B
(1.2)
DeMorgan’s laws are handy when the complements of events are easier to define and specify than the events themselves. A is a subset of B, denoted A ⊂ B, if each outcome in A is also in B. For instance, if A = {1,2} and B = {1,2,4,6}, then A ⊂ B. Note that any set is a subset of itself, A ⊂ A. If A ⊂ B and B ⊂ A, then A = B. Two events are disjoint (also known as mutually exclusive) if they have no outcomes in common, that is, if AB = . A collection of events, Ai for i = 1,2, . . ., is pairwise disjoint if each pair of events is disjoint, i.e., Ai Aj = for all i = j. A collection of events, Ai for i = 1,2, . . ., forms a partition of S if the events are pairwise disjoint and the union of all events is the sample space: Ai Aj = ∞
i=1
for i = j
Ai = S
In the next chapter, we introduce the law of total probability, which uses a partition to divide a problem into pieces, with each Ai representing a piece. Each piece is solved and the pieces combined to get the total solution.
1.3 VENN DIAGRAMS A useful tool for visualizing relationships between sets is the Venn diagram. Typically, Venn diagrams use a box for the sample space and circles (or circle-like figures) for the various events. In Figure 1.1, we show a simple Venn diagram. The outer box, labeled S, denotes the sample space. All outcomes are in S. The two circles, A and B, represent two events. The S B
A
AB
FIGURE 1.1 A Venn diagram of A ∪ B.
AB
AB
1.4 Random Variables 5
A
B
Light: AB
=
A
∩
B
Dark: AB
=
A
∪
B
FIGURE 1.2 A Venn diagram “proof” of the second of DeMorgan’s laws (Equation 1.2). The “dark” parts show AB = A ∪ B, while the “light” parts show AB = A ∩ B.
shaded area is the union of these two events. One can see that A = AB ∪ AB, that B = AB ∪ AB, and that A ∪ B = AB ∪ AB ∪ AB. Figure 1.2 presents a simple Venn diagram proof of Equation (1.2). The dark shaded area in the leftmost box represents AB, and the shaded areas in the two rightmost boxes represent A and B, respectively. The left box is the logical OR of the two rightmost boxes. On the other hand, the light area on the left is AB. It is the logical AND of A and B. Figure 1.3 shows a portion of the Venn diagram of A ∪ B ∪ C. The shaded area, representing the union, can be divided into seven parts. One part is ABC, another part is ABC, etc. Problem 1.13 asks the reader to complete the picture. S A
B ABC
ABC ABC
C FIGURE 1.3 A Venn diagram of A ∪ B ∪ C. See Problem 1.13 for details.
1.4 RANDOM VARIABLES It is often convenient to refer to the outcomes of experiments as numbers. For instance, it is convenient to refer to “heads” as 1 and “tails” as 0. The faces of most six-sided dice are labeled with pips (dots). We refer to the side with one pip as 1, to the side with two pips as 2, etc. In other experiments, the mapping is less clear because the outcomes are naturally numbers. A coin can be flipped n times and the number of heads counted. Or a large number
6 CHAPTER 1 PROBABILITY BASICS
of bits can be transmitted across a wireless communications network and the number of bits received in error counted. A randomly chosen person’s height, weight, age, temperature, and blood pressure can be measured. All these quantities are represented by numbers. Random variables are mappings from outcomes to numbers. We denote random variables with bold-italic uppercase Roman letters (or sometimes Greek letters), such as X and Y, and sometimes with subscripts, such as X 1 , X 2 , etc. The outcomes are denoted with italic lowercase letters, such as x, y, and n. For instance, X (heads) = 1 X (tails) = 0 Events, sets of outcomes, become relations on the random variables. For instance,
heads = X (heads) = 1 = {X = 1}
where we simplify the notation and write just {X = 1}. As another example, let Y denote the number of heads in three flips of a coin. Then, various events are written as follows:
hhh = {Y = 3} hht, hth, thh = {Y = 2}
hhh, hht, hth, thh = {2 ≤ Y ≤ 3} = {Y = 2} ∪ {Y = 3}
In some experiments, the variables are discrete (e.g., counting experiments), and in others, the variables are continuous (e.g., height and weight). In still others, both types of random variables can be present. A person’s height and weight are continuous quantities, but a person’s gender is discrete, say, 0 = male and 1 = female. A crucial distinction is that between the random variable, say, N, and the outcomes, say, k = 0,1,2,3. Before the experiment is done, the value of N is unknown. It could be any of the outcomes. After the experiment is done, N is one of the values. The probabilities of N refer to before the experiment; that is, Pr N = k is the probability the experiment results in the outcome k (i.e., that outcome k is the selected outcome). Discrete random variables are considered in detail in Chapters 4, 5, and 6 and continuous random variables in Chapters 7, 8, and 9.
1.5 BASIC PROBABILITY RULES In this section, we take an intuitive approach to the basic rules of probability. In the next section, we give a more formal approach to the basic rules. When the experiment is performed, one outcome is selected. Any event or events containing that outcome are true; all other events are false. This can be a confusing point: even though only one outcome is selected, many events can be true because many events can contain the selected outcome. For example, consider the experiment of rolling an ordinary six-sided die. The outcomes are the numbers 1, 2, 3, 4, 5, and 6. Let A = {1,2,3,4}, B = {2,4,6}, and C = {2}. Then, if the roll results in a 4, events A and B are true while C is false.
1.5 Basic Probability Rules 7
Comment 1.3: The operations of set arithmetic are analogous to those of Boolean algebra. Set union is analogous to Boolean Or, set intersection to Boolean And, and set complement to Boolean complement. For example, if C = A ∪ B, then C contains the selected outcome if either A or B (or both) contain the selected outcome. Alternatively, we say C is true if A is true or B is true.
Probability is a function of events that yields a number. If A is some event, then the probability of A, denoted Pr A , is a number; that is,
Pr A = number
(1.3)
Probabilities are computed as follows: Each outcome in S is assigned a probability between 0 and 1 such that the sum of all the outcome probabilities is 1. Then, for example, if A = {a1 ,a2 ,a3 }, the probability of A is the sum of the outcome probabilities in A; that is,
Pr A = Pr {a1 ,a2 ,a3 } = Pr a1 + Pr a2 + Pr a3
A probability of 0means the event does not occur. The empty set , for instance, has probability 0, or Pr = 0, since it has no outcomes. By definition whatever outcome is selected is not in the empty set. Conversely, the sample space contains all outcomes. It is always true. Probabilities are normalized so that the probability of the sample space is 1:
Pr S = 1
The probability of any event A is between 0 and 1; that is, 0 ≤ Pr A ≤ 1. Since A ∪ A = S, it is reasonable to expect that Pr A + Pr A = 1. This is indeed true and can be handy. Sometimes one of these probabilities, Pr A or Pr A , is much easier to compute than the other one. Reiterating, for any event A,
0 ≤ Pr A ≤ 1
Pr A + Pr A = 1 The probabilities of nonoverlapping events add: if AB = , then Pr A ∪ B = Pr A +
Pr B . If the events overlap (i.e., have outcomes in common), then we must modify the formula to eliminate any double counting. There are two main ways of doing this. The first adds the two probabilities and then subtracts the probability of the overlapping region:
Pr A ∪ B = Pr A + Pr B − Pr AB
The second avoids the overlap by breaking the union into nonoverlapping pieces:
Pr A ∪ B = Pr AB ∪ AB ∪ AB = Pr AB + Pr AB + Pr AB
Both formulas are useful. A crucial notion in probability is that of independence. Independence means two events, A and B, do not affect each other. For example, flip a coin twice, and let A represent the event
8 CHAPTER 1 PROBABILITY BASICS
the first coin is heads and B the event the second coin is heads. If the two coin flips are done in such a way that the result of the first flip does not affect the second flip (as coin flips are usually done), then we say the two flips are independent. When A and B are independent, the probabilities multiply:
Pr AB = Pr A Pr B
if A and B are independent
Another way of thinking about independence is that knowing A has occurred (or not occurred) does not give us any information about whether B has occurred, and conversely, knowing B does not give us information about A. See Chapter 2 for further discussion of this view of independence. Comment 1.4: Sometimes probabilities are expressed as percentages. A probability of 0.5 might be expressed as a 50% chance of occurring. The notation, Pr A , is shorthand for the more complex “probability that event A is true,” which itself is shorthand for the even more complex “probability that one of the outcomes in A is the result of the experiment.” Similarly, intersections and unions can be thought of in terms of Boolean algebra: Pr A ∪ B means “the probability that event A is true or event B is true,” and Pr AB means “the probability that event A is true and event B is true.”
EXAMPLE 1.4
In Example 1.2, we defined two events, A and B, but said nothing about the probabilities. Assume each side of the die is equally likely. Since there are six sides and each side is equally likely, the probability of any one side must be 1/6:
Pr A = Pr {1,2,3,4}
(list the outcomes of A)
= Pr 1 + Pr 2 + Pr 3 + Pr 4 =
1 1 1 1 4 + + + = 6 6 6 6 6
Pr B = Pr 2,4,6 =
(break the event into its outcomes) (each side equally likely)
3 1 = 6 2
Continuing, A ∪ B = {1,2,3,4,6} and AB = {2,4}. Thus,
Pr A ∪ B = Pr {1,2,3,4,6} =
5 6
= Pr A + Pr B − Pr AB =
4 3 2 5 + − = 6 6 6 6
(first, solve directly) (second, solve with union formula)
1.6 Probability Formalized 9
Alternatively, AB = {1,3}, AB = {6}, and
Pr A ∪ B = Pr AB + Pr AB + Pr AB =
EXAMPLE 1.5
2 2 1 5 + + = 6 6 6 6
In Example 1.4, we assumed all sides of the die are equally likely. The probabilities do not have to be equally likely. For instance, consider the following probabilities:
Pr 1 = 0.5
Pr k = 0.1
for k = 2,3,4,5,6
Then, repeating the above calculations,
Pr A = Pr {1,2,3,4}
(list the outcomes of A)
= Pr 1 + Pr 2 + Pr 3 + Pr 4 =
1 1 1 1 8 + + + = 2 10 10 10 10
Pr B = Pr 2,4,6 =
(break the event into its outcomes) (unequal probabilities)
3 10
Continuing, A ∪ B = {1,2,3,4,6}, and AB = {2,4}. Thus,
Pr A ∪ B = Pr {1,2,3,4,6} =
9 10
(first, solve directly)
= Pr A + Pr B − Pr AB =
(second, solve with union formula)
8 3 2 9 + − = 10 10 10 10
Alternatively,
Pr A ∪ B = Pr AB + Pr AB + Pr AB =
2 6 1 9 + + = 10 10 10 10
1.6 PROBABILITY FORMALIZED A formal development begins with three axioms. Axioms are truths that are unproven but accepted. We present the three axioms of probability, then use these axioms to prove several basic theorems. The first two axioms are simple, while the third is more complicated:
Axiom 1: Pr A ≥ 0 for any event A.
10 CHAPTER 1 PROBABILITY BASICS
Axiom 2: Pr S = 1, where S is the sample space. Axiom 3: If Ai for i = 1,2, . . . are pairwise disjoint, then ∞
Pr
i=1
Ai =
∞
i=1
Pr Ai
(1.4)
From these three axioms, the basic theorems about probability are proved. The first axiom states that all probabilities are nonnegative. The second axiom states that the probability of the sample space is 1. Since the sample space contains all possible outcomes (by definition), the result of the experiment (the outcome that is selected) is contained in S. Thus, S is always true, and its probability is 1. The third axiom says that the probabilities of nonoverlapping events add; that is, if two or more events have no outcomes in common, then the probability of the union is the sum of the individual probabilities. Probability is like mass. The first axiom says mass is nonnegative, the second says the mass of the universe is 1, and the third says the masses of nonoverlapping bodies add. In advanced texts, the word “measure” is often used in discussing probabilities. This third axiom is handy in computing probabilities. Consider an event A containing outcomes, a1 ,a2 , . . . ,an . Then, A = {a1 ,a2 , . . . ,an } = {a1 } ∪ {a2 } ∪ · · · ∪ {an }
Pr A = Pr {a1 } + Pr {a2 } + · · · + Pr {an }
since the events, {ai }, are disjoint. When the context is clear, the clumsy notation Pr {ai } is replaced by the simpler Pr ai . In words, the paradigm is clear: divide the event into its outcomes, calculate the probability of each outcome (technically, of the event containing only that outcome), and sum the probabilities to obtain the probability of the event. Comment 1.5: The third axiom is often presented in a finite form: n
Pr
i=1
Ai =
n i=1
Pr Ai
when the Ai are pairwise disjoint. A common special case holds for two disjoint events: if AB = , then Pr A ∪ B = Pr A + Pr B . Both of these are special cases of the third axiom (just let the excess Ai = ). But, for technical reasons that are beyond this text, the finite version does not imply the infinite version.
1.7 Simple Theorems 11
1.7 SIMPLE THEOREMS In this section, we use the axioms to prove a series of “simple theorems” about probabilities. These theorems are so simple that they are often mistaken to be axioms themselves.
Theorem 1.1: Pr = 0. Proof:
1 = Pr S
(the second axiom)
= Pr S ∪ = Pr S + Pr = 1 + Pr
(S = S ∪ ) (by the third axiom) (by the second axiom)
The last implies Pr = 0. This theorem provides a symmetry to Axiom 2, which states the probability of the sample space is 1. This theorem states the probability of the null space is 0. The next theorem relates the probability of an event to the probability of its complement. The importance lies in the simple observation that one of these probabilities may be easier to compute than the other.
Theorem 1.2: Pr A = 1 − Pr A . In other words, Pr A + Pr A = 1. Proof: By definition, A ∪ A = S. Combining the second and third axioms, one obtains
1 = Pr S = Pr A ∪ A = Pr A + Pr A
A simple rearrangement yields Pr A = 1 − Pr A .
This theorem is useful in practice. Calculate Pr A or Pr A , whichever is easier, and then subtract from 1 if necessary.
Theorem 1.3: Pr A ≤ 1 for any event A.
Proof: Since 0 ≤ Pr A (by Axiom 2) and Pr A = 1 − Pr A , it follows immediately that Pr A ≤ 1. Combining Theorem 1.3 and Axiom 1, one obtains
0 ≤ Pr A ≤ 1
12 CHAPTER 1 PROBABILITY BASICS
for any event A. This bears repeating: all probabilities are between 0 and 1. One can combine this result with Axiom 2 and Theorem 1.3 to create a simple form:
0 = Pr ≤ Pr A ≤ Pr S = 1 The probability of the null event is 0 (it contains no outcomes, so the null event can never be true). The probability of the sample space is 1 (it contains all outcomes, so the sample space is always true). All other events are somewhere in between 0 and 1 inclusive. While it may seem counter-intuitive, it is reasonable in many experiments to define outcomes that have events, A and B, can be defined such zero probability. Then, nontrivial that A = but Pr A = 0 and B = S but Pr B = 1. Many probability applications depend on parameters. This theorem provides a sanity check on whether a supposed result can be correct. For instance, let the probability of a head be p. Since probabilities are between 0 and 1, it must be true that 0 ≤ p ≤ 1. Now, one might ask what is the probability of getting three heads in a row? One might guess the answer is 3p, but this answer is obviously incorrect since 3p > 1 when p > 1/3. If the coin flips are independent (discussed below in Section 1.9), the probability of three heads in a row is p3 . If 0 ≤ p ≤ 1, then 0 ≤ p3 ≤ 1. This answer is possibly correct: it is between 0 and 1 for all permissible values of p. Of course, lots of incorrect answers are also between 0 and 1. For instance, p2 , p/3, and cos(pπ/2) are all between 0 and 1 for 0 ≤ p ≤ 1, but none is correct.
Theorem 1.4: If A ⊂ B, then Pr A ≤ Pr B . Proof: B = (A ∪ A)B = AB ∪ AB = A ∪ AB Pr B = Pr A + Pr AB ≥ Pr A
(A ⊂ B implies AB = A)
(since Pr AB ≥ 0)
■
Probability is an increasing function of the outcomes in an event. Adding more outcomes to the event may cause the probability to increase, but it will not cause the probability to decrease. (It is possible the additional outcomes have zero probability. Then, the events with and without those additional outcomes have the same probability.) Theorem 1.5: For any two events A and B,
Pr A ∪ B = Pr A + Pr B − Pr AB
(1.5)
This theorem generalizes Axiom 3. It does not require A and B to be disjoint. If they are, then AB = , and the theorem reduces to Axiom 3.
1.7 Simple Theorems 13
Proof: This proof uses a series of basic results from set arithmetic. First, A = AS = A(B ∪ B) = AB ∪ AB Second, for B, B = BS = B(A ∪ A) = AB ∪ AB and A ∪ B = AB ∪ AB ∪ AB ∪ AB = AB ∪ AB ∪ AB Thus,
Pr A ∪ B = Pr AB + Pr AB + Pr AB
(1.6)
Similarly,
Pr A + Pr B = Pr AB + Pr AB + Pr AB + Pr AB = Pr A ∪ B + Pr AB
Rearranging the last equation yields the theorem: Pr A ∪ B = Pr A + Pr B − Pr AB . (Note that Equation 1.6 is a useful alternative to this theorem. In some applications, Equation 1.6 is easier to use than Equation 1.5.) The following theorem is known as the inclusion-exclusion formula. It is a generalization of the previous theorem. Logically, the theorem fits in this section, but it is not a “small” theorem given the complexity of its statement. Theorem 1.6 (Inclusion-Exclusion Formula): For any events A1 , A2 , . . . ,An ,
Pr A1 ∪ A2 ∪ · · · ∪ An =
n i=1
+
Pr Ai −
j−1 n i −1 i=1 j=1 k=1
n i −1 i=1 j=1
Pr Ai Aj
Pr Ai Aj Ak − · · · ± Pr A1 A2 · · · An
This theorem is actually easier to state in words than in symbols: the probability of the union is the sum of all the individual probabilities minus the sum of all pair probabilities plus the sum of all triple probabilities, etc., until the last term, which is the probability of the intersection of all events. Proof: This proof is by induction. Induction proofs consist of two major steps: the basis step and the inductive step. An analogy is climbing a ladder. To climb a ladder, one must first get on the ladder. This is the basis step. Once on, one must be able to climb from the (n − 1)-st step to the nth step. This is the inductive step. One gets on the first step, then climbs to the second, then climbs to the third, and so on, as high as one desires. The basis step is Theorem 1.5. Let A = A1 and B = A2 , then Pr A1 ∪ A2 = Pr A1 + Pr A2 − Pr A1 A2 .
14 CHAPTER 1 PROBABILITY BASICS
The inductive step assumes the theorem is true for n − 1 events. Given the theorem is true for n − 1 events, one must show that it is then true for n events. The argument is as follows (though we skip some of the more tedious steps). Let A = A1 ∪ A2 ∪ · · · ∪ An−1 and B = An . Then, Theorem 1.5 yields
Pr A1 ∪ A2 ∪ · · · ∪ An = Pr A ∪ B = Pr A + Pr B − Pr AB
= Pr A1 ∪ A2 ∪ · · · ∪ An−1 + Pr An − Pr (A1 ∪ A2 ∪ · · · ∪ An−1 )An
This last equation contains everything needed. The first term is expanded using the inclusion-exclusion theorem (which is true by the inductive assumption). Similarly, the last term is expanded the same way. Finally, the terms are regrouped into the pattern in the theorem’s statement. But we skip these tedious steps. While the inclusion-exclusion theorem is true (it is a theorem after all), it may be tedious to keep track of which outcomes are in which intersecting events. One alternative to the inclusion-exclusion theorem is to write the complicated event as a union of disjoint events. EXAMPLE 1.6
Consider the union A ∪ B ∪ C. Write it as follows: A ∪ B ∪ C = ABC ∪ ABC ∪ ABC ∪ AB C ∪ ABC ∪ ABC ∪ A BC
Pr A ∪ B ∪ C = Pr ABC + Pr ABC + Pr ABC + Pr AB C
+ Pr ABC + Pr ABC + Pr A BC
In some problems, the probabilities of these events are easier to calculate than those in the inclusion-exclusion formula. Also, note that it may be easier to use DeMorgan’s first law (Eq. 1.1) and Theorem 1.2:
Pr A ∪ B ∪ C = 1 − Pr A ∪ B ∪ C = 1 − Pr A B C
EXAMPLE 1.7
Let us look at the various ways we can compute Pr A ∪ B . For the experiment, select one cardfrom a well-shuffled deck. Each card is equally likely, with probability equal to 1/52. Let A = card is a ♦ and B = card is a Q . The event A ∪ B is the event the card is a diamond or is a Queen (Q). There are 16 cards in this event (13 diamonds plus 3 additional Q’s). The straightforward solution is
Pr A ∪ B = Pr one of 16 cards selected =
16 52
Using Theorem 1.5, AB = card is a ♦ and card is a Q = {Q♦}, so
Pr A ∪ B = Pr A + Pr B − Pr AB =
13 4 1 16 + − = 52 52 52 52
1.8 Compound Experiments 15
Using A ∪ B = AB ∪ AB ∪ AB, where
AB = card is not a ♦ and card is a Q = {Q♣,Q♥,Q♠} AB = card is a ♦ and card is not a Q
= {2♦,3♦,4♦,5♦,6♦,7♦,8♦,9♦,10♦,J ♦,K ♦,A♦}
Then,
Pr A ∪ B = Pr AB + Pr AB + Pr AB =
1 3 12 16 + + = 52 52 52 52
Or, using A ∪ B = A ∪ AB = AB ∪ B,
13 3 16 + = 52 52 52 12 4 16 = Pr AB + Pr B = + = 52 12 52
Pr A ∪ B = Pr A + Pr AB =
Lastly, using DeMorgan’s first law (Eq. 1.1),
Pr A ∪ B = 1 − Pr A ∩ B = 1 − Pr not ♦ and not Q = 1 −
36 16 = 52 52
Finally, we conclude this section with a useful inequality, called Boole’s inequality or the union bound. Theorem 1.7:
Pr A1 ∪ A2 ∪ · · · ∪ An ≤ Pr A1 + Pr A2 + · · · + Pr An
(1.7)
Proof: This proof The basis step is trivial since the theorem is trivially true is inductive. for n = 1: Pr A1 ≤ Pr A1 . The inductive stepis as follows: Assume the inequality is true for n − 1 events; that is, Pr A1 ∪ A2 ∪ · · · ∪ An−1 ≤ Pr A1 + Pr A2 + · · · + Pr An−1 . Then,
Pr A1 ∪A2 ∪ · · · ∪ An
= Pr A1 ∪ A2 ∪ · · · ∪ An−1 + Pr An − Pr (A1 ∪ · · · ∪ An−1 )An ≤ Pr A1 ∪ A2 ∪ · · · ∪ An−1 + Pr An ≤ Pr A1 + Pr A2 + · · · + Pr An
(by Theorem 1.5) (by Axiom 1) (by the inductive hypothesis)
■
1.8 COMPOUND EXPERIMENTS So far, we have discussed experiments, outcomes, and events. In this section, we extend the discussion to compound experiments. Compound experiments have multiple parts.
16 CHAPTER 1 PROBABILITY BASICS
For instance, one may flip a coin and then roll a die. The compound experiment is the combination of flipping the coin and rolling the die. Let the first experiment be denoted as E 1 with sample space S1 and the second as E 2 with sample space S2 . Then, the compound experiment E = E 1 ×E 2 has sample space S = S1 × S2 . In the example above, the coin flip has sample space S1 = h,t , and the roll of the die has sample space S2 = {1,2,3,4,5,6}. The compound experiment has 12 outcomes in its sample space:
S = (h,1), (h,2), . . . , (h,6), (t,1), (t,2), . . . , (t,6)
The outcomes are tuples, such as (h,1) or (t,3). The primary complication with compound experiments is determining what is an outcome and what is an event. Recall, outcomes are the elementary results of the experiment. In the example above, (h,1) is an outcome of the compound experiment. The event “the roll of the die is 2” is not elementary. It is made up of two outcomes, (h,2), (t,2) . Even though it is an outcome of E 2 , “the roll of the die is 2” is not an outcome of the compound experiment E .
1.9 INDEPENDENCE
An important concept is independence. Two events, A and B, are independent if Pr AB = Pr A Pr B . Independence is an elementary but important concept: Pr AB , Pr A , and Pr B are numbers. A and B are if the number on the independent left, Pr AB ,is the product of the two numbers, Pr A and Pr B , on the right (i.e., if Pr AB = Pr A Pr B ). A common error is to confuse independence with mutually exclusive. Recall, two Pr events are mutually exclusive if AB = , which implies AB = 0. Two events are independent if the probabilities multiply (i.e., Pr AB = Pr A Pr B ). These are separate properties. The only way mutually exclusive events can be independent is if Pr A = 0 or if Pr B = 0. Probabilistic independence often arises from the physical independence of events and experiments. One often assumes that coin flips are physically independent. The occurrence of a head or a tail on the first flip does not affect the second flip, etc. Therefore, if A depends only on the first flip and B only on the second flip, we can assume probabilities on AB are the product of probabilities of A times those of B. Independence is especially important for compound experiments. Consider a compound experiment, E = E 1 × E 2 . Frequently, E 1 is physically independent from E 2 ; that is, whatever happens in E 1 does not affect whatever happens in E 2 , and vice versa. Then, the probabilities of the outcomes of E are the product of the probabilities of the outcomes of each experiment. In Chapter 2, we present an alternative interpretation of independence. If A and B are independent, then knowing whether B has occurred does change the probability of A occurring, and vice versa. In other words, knowing B has occurred does not give any information about whether A has also occurred.
1.10 Example: Can S Communicate With D? 17
EXAMPLE 1.8
A coin with the probability of heads equal to p is flipped three times. The three flips are independent. What is the probability that all three flips are heads? Letting hhh denote the event with three consecutive heads,
Pr hhh = Pr h Pr h Pr h = p3 What is the probability of getting two heads and one tail in any order? The event consists of three outcomes, hht, hth, and thh. The probabilities of these are pp(1 − p), p(1 − p)p, and (1 − p)pp, respectively. Adding them up gives 3p2 (1 − p). Similarly, the probability of one head and two tails is 3p(1 − p)2 and of three tails is (1 − p)3 . Adding these up yields p3 + 3p2 (1 − p) + 3p(1 − p)2 + (1 − p)3 = 1 Description
Outcomes
3 Heads 2 Heads, 1 Tail 1 Head, 2 Tails 3 Tails
hhh hht, hth, thh htt, tht, tth ttt
Probability p3 3p2 (1 − p) 3p(1 − p)2 (1 − p)3
Note that this compound experiment has eight outcomes. There are two outcomes in the first experiment, two in the second, and two in the third. Multiplying these together, 2 · 2 · 2 = 8, gives the total number of outcomes.
Comment 1.6: In this example, the events are specified as two heads and one tail in any order. One should be careful in specifying the events: Does order matter or not? Is hht equivalent to hth? An easy way to make mistakes is to imprecisely define the events.
1.10 EXAMPLE: CAN S COMMUNICATE WITH D? To illustrate the use of the probability formulas presented so far, we consider a simple network, consisting of three links, a source node, and a destination node, as shown in Figure 1.4. The central question is whether the source can send a message to the destination. The source node is labeled S, the destination D, and the three links L1 , L2 , and L3 . We make two assumptions: First, each link works with probability p. Second, the links are independent of each other. Given these assumptions, what is the probability that S is connected to D? This network might model a communications network. S is a source, and D is a destination. Each link might represent a wireless connection. The unnamed node is a relay node. Can S send a message to D? Or, the network might represent a collection of pipes. Each pipe works (is not clogged) with probability p. Can water flow from S to D?
18 CHAPTER 1 PROBABILITY BASICS
L1
L2 L3
S
D
FIGURE 1.4 Two-path network.
These networks also model reliability networks. Each link represents a task or a piece of the whole. Then, the whole “works” if L3 works or if L1 and L2 work (or all three). We can solve this problem, like many probability problems, in a variety of ways. We demonstrate some of them here.
1.10.1 List All Outcomes The most direct method to solve this problem is to list all outcomes, calculate the probability of each outcome, and add those probabilities of outcomes in the event. Since the network has three links and each can be in one of two states, there are eight outcomes. We can begin by listing them in a table, as in Table 1.1 below: TABLE 1.1 Start of table of outcomes. L1
L2
L3
0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
Note that we use the convention that “0” represents “failed” and “1” represents “working” and that we list the outcomes in binary order, counting from 0 (= 000) to 7 (= 111). It is important to be neat and systematic in creating tables like this. Now, we need to calculate probabilities of each outcome. Since the links are independent, the probabilities multiply. Thus,
Pr 000 = Pr L1 = 0 ∩ L2 = 0 ∩ L3 = 0 = Pr 0 Pr 0 Pr 0 = (1 − p)(1 − p)(1 − p) = (1 − p)3
(by independence)
1.10 Example: Can S Communicate With D? 19
Pr 001 = Pr 0 Pr 0 Pr 1 = (1 − p)(1 − p)p = p(1 − p)2
and so on. Now, add another column to the table, as shown in Table 1.2 below. TABLE 1.2 Table of outcomes with probabilities.
L1
L2
L3
Pr L1 ,L2 ,L3
0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1
(1 − p)3 p(1 − p)2 p(1 − p)2 p2 (1 − p) p(1 − p)2 p2 (1 − p) p2 (1 − p)
p3
The last remaining step is to identify which outcomes result in a working path from S to D and sum the probabilities: {S → D} = {110,001,011,101,111} Pr S → D = Pr {110,001,011,101,111}
= Pr 110 + Pr 001 + Pr 011 + Pr 101 + Pr 111 = p2 (1 − p) + p(1 − p)2 + p2 (1 − p) + p2 (1 − p) + p3 = p(1 − p)2 + 3p2 (1 − p) + p3 = p + p2 − p3
(1.8)
This method always works but has one significant drawback: the table may get really big. In networks like this, the number of rows in the table is exponential in the number of links (8 = 23 ). If the network had 10 links, the table would have 1024 rows, etc. In some problems, the number of outcomes is infinite. The table then cannot be written down, and other methods must be employed.
1.10.2 Probability of a Union Let us introduce some notation. Let A be the event the upper path consisting of the serial connection of L1 and L2 works, and B the event the bottom path consisting of the single link L3 works. Presented visually,
A B
20 CHAPTER 1 PROBABILITY BASICS
Furthermore, we will use the convention that if link Li works, then Li = 1, and if Li has failed, then Li = 0. Thus,
Pr A = Pr L1 = 1 ∩ L2 = 1 = Pr L1 = 1 Pr L2 = 1
(both must work) (by independence)
Pr B = Pr L3 = 1
The event, S → D, can be written as {A ∪ B}. Thus,
Pr S → D = Pr A ∪ B = Pr A + Pr B − Pr AB
Look at each event. A requires both L1 = 1 and L2 = 1, B requires only L3 = 1, while AB requires L1 = 1, L2 = 1, and L3 = 1. Thus,
Pr A = p2 Pr B = p
Pr AB = p3 Pr A ∪ B = Pr A + Pr B − Pr AB = p + p2 − p3
This sequence arises often. Rather than calculate a big table (with possibly an infinite number of rows), one divides the problem into pieces, solves each piece, and then combines the results to solve the overall problem. EXAMPLE 1.9
A common method to solve these problems is to use Boolean algebra. Write the event {S → D} as L1 L2 ∪ L3 . (Recall, we use ∪ for logical OR.) Then, use probability properties as above:
Pr S → D = Pr L1 L2 ∪ L3 = Pr L1 L2 + Pr L3 − Pr L1 L2 L3 = p2 + p − p3
1.10.3 Probability of the Complement In some problems, it is easier to calculate the probability of the complement, or in this case, the probability that S is not connected to D. Using the second of DeMorgan’s laws (Equation 1.2) results in a simple solution:
Pr A ∪ B = 1 − Pr A ∪ B
= 1 − Pr A ∩ B = 1 − Pr A Pr B 2
= 1 − (1 − p )(1 − p)
(DeMorgan’s law) (by independence)
(since Pr A = p2 and Pr B = p)
= p + p2 − p3
In summary, we have solved this problem three ways: first, by calculating the probability of each outcome and summing those probabilities corresponding to outcomes in the event of interest; second, by breaking the event into simpler events and calculating the probability
1.11 Example: Now Can S Communicate With D? 21
of the union; and third, by calculating the probability of the complement. It is important to become familiar with each of these methods as it is often not obvious which will be simplest.
1.11 EXAMPLE: NOW CAN S COMMUNICATE WITH D? In this example, we have added a third path, C, consisting of three more links, L4 , L5 , and L6 , as shown in Figure 1.5. As before, the links are independent, and each works with probability p. L1
L2 L3
S
D
L5
L4
L6
FIGURE 1.5 Three-path network.
Let us answer the question with the same three methods as before.
1.11.1 A Big Table There are six links. Therefore, the table has 26 = 64 rows. Since the problem is more complicated, we add a few more columns to simplify the listing of outcomes and to denote whether or not there is a path. The first column, labeled a, gives a label to each outcome. The columns labeled A, B, and C are logical values denoting whether or not each event is true (i.e., whether the corresponding path works). The last column is the logical OR of the three previous columns. The table begins as shown in Table 1.3 below. TABLE 1.3 First few rows of a big table.
a
L1
L2
L3
L4
L5
L6
Pr L1 ∩ L2 ∩ . . . L6
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1
0 0 0 0 1 1 1 1 0 0
0 0 1 1 0 0 1 1 0 0
0 1 0 1 0 1 0 1 0 1
(1 − p)6 p(1 − p)5 p(1 − p)5 p2 (1 − p)4 p(1 − p)5 p2 (1 − p)4 p2 (1 − p)4 p3 (1 − p)3 p(1 − p)5 p2 (1 − p)4
A
B
C
A∪B∪C
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 1 1
22 CHAPTER 1 PROBABILITY BASICS
The probability that a path exists from S to D can be calculated as follows:
Pr S → D = Pr all outcomes resulting in a path = Pr a7 ,a8 ,a9 ,and others = Pr a7 + Pr a8 + Pr a9 + · · · + Pr a63 = p3 (1 − p)3 + p(1 − p)5 + p2 (1 − p)4 + · · · + p6 = p(1 − p)5 + 6p2 (1 − p)4 + 14p3 (1 − p)3 + 15p4 (1 − p)2 + 6p5 (1 − p) + p6 = p + p2 − p4 − p5 + p6
This method works—as it always does if the number of rows is finite—but the table is practically unmanageable for a hand calculation. A computer, however, could deal with a table much larger than this one.
1.11.2 Break Into Pieces As in the previous example, breaking the problem into pieces will result in a simple and efficient solution method. Let A denote the top path, B the middle path, and C the bottom path. For C to work, L4 = 1, L5 = 1, and L6 = 1. A B C
Pr C = Pr L4 = 1 ∩ L5 = 1 ∩ L6 = 1
= Pr L4 = 1 Pr L5 = 1 Pr L6 = 1 = p3
Since we have already solved the simpler network, we can use that result. Let G = A + B. Thus,
Pr S → D = Pr A + B + C
= Pr G + C = Pr G + Pr C − Pr CG
(using G = A + B)
The events C and G are independent since C depends only on links L 4 ,L5 , and L6 while G = A + B depends only on links L1 , L2 , and L3 . Thus, Pr CG = Pr C Pr G = p3 (p + p2 − p3 ), and
Pr S → D = p + p2 − p3 + p3 − p3 (p + p2 − p3 ) = p + p2 − p4 − p5 + p6
As expected, this sequence yields an easy and quick solution.
1.12 Computational Procedures 23
1.11.3 Probability of the Complement As with the simpler network, finding the probability of the complement is quick and easy. For the network to fail, all the paths must fail. Thus, we obtain a version of the first of DeMorgan’s laws (Equation 1.1) for three events:
A∪B∪C = A B C
Pr A ∪ B ∪ C = Pr A B C
= Pr A Pr B Pr C = 1 − Pr A 1 − Pr B 1 − Pr C
(independence)
= (1 − p2 )(1 − p)(1 − p3 )
Thus, Pr A ∪ B ∪ C = 1 − Pr A ∪ B ∪ C = 1 − (1 − p2 )(1 − p)(1 − p3 ). One point to remember is that computing the complement may be easier than computing the original event.
1.12 COMPUTATIONAL PROCEDURES Many problems get too large for hand calculation. Computational software can be employed to facilitate calculations and experiments. A good first place to start is with a spreadsheet program, such as Microsoft Excel. Spreadsheets are useful for organizing data and doing routine calculations. Simple plots can be made quickly. More complicated calculations and analyses require a numerical software package. In this text, we refer to three packages: Matlab, Python, and R. Many students have some familiarity with Matlab or Python, but probably not R. The three packages have their own advantages and disadvantages. Matlab is a commercial numerical computation package sold by Mathworks.2 It is widely used in engineering and scientific calculations. Matlab has an extensive set of libraries and considerable documentation, both official and user-generated. The open source package Octave3 is a re-implementation of many of the features of Matlab and can be useful when Matlab is unavailable. Python4 is an open source language that is popular for a variety of uses. By itself, Python’s numerical capabilities are limited. However, many libraries exist that expand Python into an excellent numerical package. The following libraries are needed: • Numpy gives python Matlab-like vectors and matrices. • Matplotlib (also known as Pylab) gives python Matlab-like plotting routines. 2
http://www.mathworks.com/ http://www.gnu.org/software/octave/ 4 https://www.python.org/ 3
24 CHAPTER 1 PROBABILITY BASICS
• Scipy has numerous functions. We are mostly interested in the Scipy.stats library. It provides many probability distributions and is useful for Monte Carlo experiments (using computers to simulate randomness). • Ipython provides a better Python interpreter and a useful notebook (web browser) interface. • Pandas gives Python a data-handling library. If your data looks like a table (with named columns and rows), Pandas can be handy. • Statsmodels provides a modern set of model-fitting routines. The reader is urged to consider installing one of the Python distributions that include many libraries. Canopy5 and Anaconda6 are two popular distributions. Another alternative is the Sage7 computational environment, which uses Python as its core language. The syntax for Matlab and Python is similar. Most of our computational examples are given in either Matlab or Python, or sometimes both. We assume readers can translate the code to whichever package they may be using. R8 is an open source data analysis and statistics package. It began as a recreation of the commercial package S and has been expanded with many additional libraries. R is widely used in many universities. We assume most readers are unfamiliar with R and defer most examples using this package until the later chapters. SUMMARY
This chapter has introduced probability. We do an experiment. The experiment has a set of possible outcomes, called the sample space. After doing the experiment, one outcome will be true, while all others will be false. Events are sets of outcomes. Any event containing the true outcome is true, while any event without the true outcome is false. Each outcome has a certain probability of being true. The probability is a measure of the likelihood of the outcome. The probability of an event is the sum of the probabilities of each outcome in the event. All probabilities are between 0 and 1. Probabilities of 0 are for events that never occur, while probabilities of 1 are for events that always occur. Probabilities obey several fundamental relations:
• The probability of the empty set is 0, or Pr = 0.
• The probability of the sample space is 1, or Pr S = 1.
• All probabilities are between 0 and 1, or 0 ≤ Pr A ≤ 1.
• If events A and B are independent, Pr AB = Pr A Pr B .
• The probability of a complement is Pr A = 1 − Pr A .
• Probabilities obey DeMorgan’s laws: Pr A ∪ B = Pr A ∩ B and Pr AB = Pr A ∪ B . 5
https://www.enthought.com/products/canopy/ https://store.continuum.io/cshop/anaconda/ 7 http://www.sagemath.org/ 8 http://www.r-project.org/ 6
Problems 25
• For any events A and B,
Pr A ∪ B = Pr A + Pr B − Pr AB
= Pr AB + Pr AB + Pr AB = Pr A + Pr AB = Pr AB + Pr B = 1 − Pr A ∩ B
PROBLEMS 1.1
Assume 0 ≤ p ≤ 1. Which of the following are valid probabilities, and if not, why not? a. b. c. d. e.
1.2
Assume 0 ≤ p ≤ 1. Which of the following are valid probabilities, and if not, why not? a. b. c. d. e.
1.3
e− p ep cos(2πp) p2 − 2p + 1 3p − 2
p 1/p ( p − 1) 3 sin(πp) 4p(1 − p)
Zipf ’s Law says the probability of the kth most popular word in English (and other languages) is proportional to 1/k for k = 1,2,3, . . .. For example, the most common word in English is “the” and occurs about 7% of the time, the second most common word “of ” occurs about half as frequently as “the,” and the third most common word “and” occurs a bit more than a third as often as “the,” etc. a. Why is Zipf ’s Law not a valid probability law? b. List two ways you might modify Zipf ’s Law to keep its essential behavior but make it a valid probability law.
1.4
You have a fair coin (probability of heads is 0.5) and flip it twice (with the flips independent). a. What is the probability of getting two heads? b. What is the probability of getting exactly one head? c. What is the probability of getting no heads?
26 CHAPTER 1 PROBABILITY BASICS
1.5 A fair coin is flipped until a head appears. Let the number of flips required be denoted N (the head appears on the Nth flip). Assume the flips are independent. Let the outcomes be denoted by k for k = 1,2,3, . . . . The event N = k means exactly k flips are required. The event N ≥ k means at least k flips are required. a. How many outcomes are there? b. What is Pr N = k (i.e., the probability of of k − 1 tails followed by a heads)? a sequence (Hint: write a general expression for Pr N = k for any k = 1,2,3, . . .)
c. Show the probabilities sum to 1 (i.e., ∞ k=1 Pr N = k = 1).
d. What is Pr N ≥ l for all l ≥ 1? e. What is Pr N ≤ l for all l ≥ 1? f. Do the answers to the previous two parts sum to 1? Should they? 1.6 This problem extends the previous one. Now, assume the coin comes up heads with probability p, 0 ≤ p ≤ 1.
a. What is Pr N = k ?
b. Show the probabilities sum to 1 (i.e., ∞ k=1 Pr N = k = 1).
c. For p = 0.25, plot the probabilities Pr N = k for k = 1,2,3, . . .. 1.7 An ordinary deck of playing cards (52 cards, in 4 suits and 13 ranks) is shuffled, and one card is randomly chosen (all cards are equally likely). a. b. c. d.
What is Pr ♣ ? What is Pr Q ? What is Pr Q♣ ? What is Pr Q ∪ ♣ ?
1.8 An ordinary six-sided die is rolled. Each side is equally likely to appear up. Let A = {1,2,3}, B = {2,4,6}, and C = {2,3,4,5}. Calculate the probabilities of the following events: a. A, B, C b. AB, AC, BC, ABC c. A ∪ B, A ∪ C, B ∪ C, A ∪ B ∪ C using both ways as in Example 1.4 d. AB, A ∪ B directly and using DeMorgan’s laws 1.9 An ordinary six-sided die is rolled. Each side is equally likely to appear up. Let A = {1,2}, B = {4,6}, and C = {2,3,4,5,6}. Calculate the probabilities of the following events: a. A, B, C b. AB, AC, BC, ABC c. A ∪ B, A ∪ C, B ∪ C, A ∪ B ∪ C using both ways as in Example 1.4 d. AB, A ∪ B directly and using DeMorgan’s laws
Problems 27
1.10 Assume a loaded die is rolled (a loaded die is weighted the sides are not unequally; equally likely). Let N refer to the side that is upand Pr N = k = ck for constant c and k = 1,2,3,4,5,6. That is, Pr N = 2 = 2c, Pr N = 3 = 3c, etc. a. What is the value of c? (Hint: the probabilities must sum to 1.) b. Plot the probabilities. c. Using A, B, and C in Problem 1.8, what are Pr A , Pr B , Pr A ∪ B , and Pr AB ? 1.11 A four-sided die is rolled. Each side is equally likely. Let A = {1,2}, B = {2,3}, and C = {1,4}. Calculate the following:
a. Pr A , Pr B , Pr C b. Pr AB , Pr A ∪ B c. Pr ABC , Pr A ∪ B ∪ C
1.12 Repeat Problem 1.11, but with the unequal probabilities Pr 1 = 0.4, Pr 2 = 0.3, Pr 3 = 0.2, and Pr 4 = 0.1. 1.13 Figure 1.3 shows a Venn diagram of A ∪ B ∪ C. Fill in the missing labels. 1.14 Show that if A and B are independent, then so are A and B, A and B, and A and B. 1.15 Verify that the probabilities of all eight outcomes in Table 1.2 sum to 1.0. 1.16 Verify that the answer to Equation (1.8) in Section 1.10, p + p2 − p3 , is between 0 and 1 for all values of p between 0 and 1. 1.17 Fill in the remaining 64 − 10 = 54 rows in Table 1.3. (You may want to use spreadsheet software or write a short program to output the table.) 1.18 The graph in Figure 1.6 is the same as Figure 1.4 except that the links have different probabilities of working. Repeat the sequence of solutions in Section 1.10. Make sure your answers agree with Section 1.10 when p1 = p2 = p3 = p.
L1 , p1
L2 , p2 L3 , p3
S
D
FIGURE 1.6 Two paths network with different probabilities for each link.
1.19 Why is it when you throw a rock at a picket fence, you almost always hit a picket? Assume the pickets are three inches (7.5 cm), the gap between pickets is also three inches (7.5 cm), and the rock is one inch (2.5 cm) in diameter. Assume you can throw the rock accurately enough to hit the fence but not so accurately to control where it hits. What is a reasonable estimate of the probability that the rock passes through a gap between pickets without hitting a picket?
28 CHAPTER 1 PROBABILITY BASICS
1.20 In the picket fence problem above, assume the pickets are two inches (5.0 cm) wide and the gap also two inches (5.0 cm). Now, what is the probability a rock one inch in diameter passes through a gap without hitting a picket? 1.21 List three real-life experiments and an event in each experiment where the probabilities of that event are: a. greater than 0 but less than 0.1, b. approximately 0.5. 1.22 The U.S. National Weather Service issues Probability of Precipitation (PoP) reports. A report might indicate a 30% chance of rain. a. What does this mean? b. What is the PoP definition of the following terms: slight chance, chance, likely? 1.23 A comic strip made the following statements:
• “There’s a 60% chance of rain on Saturday” (denoted S1). • “There’s a 40% chance of rain on Sunday” (denoted S2). • “Therefore, there’s a 100% chance of rain this weekend” (denoted S3). a. Given that S1 and S2 true, can the third statement, S3, be true? Why or why not? b. Given S1 and S2 are true, what is the smallest possible probability of rain this weekend, and what is the largest possible probability of rain this weekend? c. Explain how the lower and upper bounds in part b can be achieved. d. If S1 and S2 are independent, what is the probability of rain this weekend? 1.24 Bob is painting Alice’s house and has just finished a first coat. Alice notices that Bob missed 2% of the surface area. Bob says, “Don’t worry. Probabilities multiply. After two coats, there will be only 2% × 2% = 0.04% missing area. You won’t notice it.” Why is Alice skeptical of Bob’s calculation?
CHAPTER
2
CONDITIONAL PROBABILITY
Conditional probability represents the idea that partial information may be available. How does this partial information affect probabilities and decisions one might make? For instance, assume one person is a professional basketball player and the other is a professional jockey. You need to guess whether person A or person B is the jockey. If no other information is given, then you are likely to say the probability that person A is the jockey is 0.5. If, however, you are told person A is seven feet tall, while person B is five feet tall, then your guess is likely to change. Person A is most likely to be the basketball player and person B the jockey. This could be wrong—it is still a guess after all—but it is likely to be true. In this chapter, we define and quantify conditional probability.
2.1 DEFINITIONS OF CONDITIONAL PROBABILITY In many experiments, one might have partial information about which outcome is true (or which outcomes are false). This information may lead to updated probabilities and therefore changed decisions. Let A be the event in which one is interested and B the partial information (i.e., B is true). Then, the conditional probability of A given B is written as Pr A B . Mathematically, the formula for conditional probability is Pr AB Pr A B =
Pr B
when Pr B > 0
(2.1)
When Pr B = 0, Pr AB = 0 since Pr AB ≤ Pr B . The ratio is 0/0 and is undefined. The conditional probability formula can be rearranged to yield a useful formula for the joint probability:
Pr AB = Pr A B Pr B
29
30 CHAPTER 2 CONDITIONAL PROBABILITY
One interpretation of conditional probabilities is that the sample space changes from S to S = B. The probabilities of each outcome in S must be renormalized (divided by Pr B ) to ensure that the probabilities sum to 1. probabilities are not, in general, symmetric. It is usually true that Pr A B = Conditional Pr B A , so Pr AB Pr AB = = Pr B A Pr A B =
Pr B
Pr A
One can see the conditional probabilities are not equal unless Pr A = Pr B or Pr AB = 0. Consider a simple example of theroll of a fair die, and let A = {1,2} and B = {2,4,6}. The event AB = {2} has probability Pr AB = 1/6. The conditional probability of A given B and of B given A are as follows:
Pr A B =
Pr AB 1/6 1 = = 3/6 3 Pr B
Pr AB 1/6 1 = = Pr B A = 2/6 2 Pr A
Note that the two probabilities are different. Figure 2.1 illustrates the conditional probability of A given B. The intersection AB is shaded, and the circle for event B is bold. The conditional probability is the probability of the intersection divided by the probability of B. S B
A
AB
AB
AB
FIGURE 2.1 A Venn diagram of Pr A B = Pr AB Pr B = Pr AB Pr AB + Pr AB . The new sample space is B, and only the portion of A that intersects B is relevant.
Conditional probabilities are probabilities. They satisfy all the axioms and theorems of probability. For instance, the three axioms are the following:
1. PrA B ≥ 0.This is obvious since Pr AB ≥ 0 and Pr B ≥ 0. 2. Pr S B = Pr B B = 1. The new sample space is B.
2.1 Definitions of Conditional Probability 31
3. If A and C are disjoint given B, then Pr A ∪ C B = Pr A B + Pr C B . This axiom is more complicated because the condition of disjointedness is more complicated. A and C are disjoint given B. It is not necessary that A and C be disjoint. They can intersect, but they can only do so outside of B. The condition is = (AB) ∩ (CB) = ABC
A Venn diagram showing events A, B, and C satisfying these properties would appear like this: S C
A
B
If this is true, the third axiom is satisfied: Pr (A ∪ C)B) Pr A ∪ C B =
Pr B
Pr AB ∪ BC = Pr B
=
Pr AB Pr BC + Pr B Pr B
= Pr A B + Pr C B
Since the axioms hold, then so do all the theorems in Section 1.7. presented Below is a Venn diagram showing Pr A ∪ C B . The new sample space is B, and only the portions of A and C intersecting B are relevant. S C
A
B
32 CHAPTER 2 CONDITIONAL PROBABILITY
EXAMPLE 2.1
Roll an ordinary six-sided die, and observe the number on top. Assume each number is equally likely. Let A = {1,2,3}, B = {3,4,5,6}, and C = {1,5,6}. Then, AB = {3}, BC = {5,6}, and ABC = . Thus, Pr AB Pr BC 1/6 2/6 1 2 3 + = + = + = Pr A ∪ C B = Pr A B + Pr C B = 4/6 4/6 4 4 4 Pr B Pr B
Of course, the probability can be calculated directly. A ∪ C = {1,2,3,5,6}, and (A ∪ C)B = {3,5,6}. Thus,
Pr (A ∪ C)B 3/6 3 = = Pr A ∪ C B = 4/6 4 Pr B
If A and B are independent, then by definition Pr AB = Pr A Pr B . In this case, Pr AB Pr A Pr B = Pr A B = = Pr A
Pr B
Pr B
In words, if A and B are independent, then the conditional probability of A given B is just Pr A . Knowing B is true does not change the probability of A being true when A and B are independent.
Comment 2.1: When A and B are independent, Pr AB = Pr A Pr B . When A and B are not independent (i.e., dependent), we can write Pr AB = Pr A B Pr B . Thus, we see conditional probabilities give us a way to generalize the independence formula to dependent events.
Conditional probabilities extend as expected to more than two events. Here is an example of the chain rule for conditional probabilities:
Pr ABC =
Pr ABC Pr BC Pr C = Pr A BC Pr B C Pr C Pr BC Pr C
(2.2)
In summary, conditional probability captures the notion of partial information. Given that event B is true, Equation (2.1) gives the adjusted probability of A. Since conditional probability obeys the three axioms of probability, all probability also apply to theorems conditional probability. If A and B are independent, Pr A B = Pr A . The chain rule can be helpful for simplifying many calculations.
2.2 LAW OF TOTAL PROBABILITY AND BAYES THEOREM Two important theorems concerning conditional probability are the law of total probability (LTP) and Bayes’ Theorem. The law of total probability is handy for calculating probabilities of complicated events. Bayes’ Theorem relates forward and reverse conditional probabilities.
2.2 Law of Total Probability and Bayes Theorem 33
Theorem 2.1 (LTP): Let Bi for i = 1,2, . . . be a partition of S; that is, Bi Bj = for i = j and
∞ i=1 Bi = S. Then,
Pr A =
∞
i=1
Pr A Bi Pr Bi =
∞
i=1
Pr ABi
Proof:
Pr A = Pr AS = Pr A(B1 ∪ B2 ∪ · · · ) = =
∞
i=1 ∞
i=1
Pr ABi
(A = AS) (S = B1 ∪ B2 ∪ · · · ) (Bi are disjoint)
Pr A Bi Pr Bi
(Pr ABi = Pr A Bi Pr Bi )
■
The LTP is useful in numerous problems for calculating probabilities. It is often the case that A is a complicated event, but dividing the experiment into pieces, represented by of A for each piece (i.e., Pr A Bi ) Bi , makes A simpler. One then calculates the probability weighted by the probability of each piece (i.e., Pr Bi ). An analogy might be helpful. Consider the task of measuring the fraction of a painting that is red. One way to do this is to cut the painting into many pieces (perhaps like a jigsaw puzzle) and measure the fraction of red within each piece. Then, the total fraction is determined by summing the individual fractions, each weighted by the fractional area of the piece. If one had the freedom, one would try to select the pieces so that measuring the fraction of red within each piece would be easy. Perhaps some would be all red, others without any red. The same holds true in probabilities. If one has the luxury, choose Bi to make the task of computing Pr A Bi easy. Bayes theorem relates a conditional probability Pr A B with its reverse Pr B A . Theorem 2.2 (Bayes’ Theorem):
Pr B A =
Pr A B Pr B Pr A
(2.3)
Proof: For such a useful theorem, the proof is surprisingly simple: Pr AB Pr B A =
Pr A
=
(by definition)
Pr AB Pr B Pr A Pr B
(multiply by 1)
Pr A B Pr B = Pr A
(rearrange the terms)
■
34 CHAPTER 2 CONDITIONAL PROBABILITY
The LTP and Bayes’ Theorem are often combined as follows: Pr A Bk Pr Bk Pr Bk A = ∞ i=1 Pr A Bi Pr Bi
Comment 2.2: We use the LTP numerous times throughout the text. It and Bayes’ Theorem must be understood thoroughly.
2.3 EXAMPLE: URN MODELS The classic example of the LTP is the urn model. Urn models are not engineering, nor are they scientific. However, they are simple and easy to comprehend. In deference to three centuries of probability theory, let us consider a simple urn model. Urn models consist of a number of urns (jars or similar vessels) containing objects (e.g., marbles). The experimentalist selects an urn by some random mechanism and then selects one or more marbles. For instance, assume there are two urns with the following numbers of red and blue marbles: Urn
Red
Blue
5 2
5 4
1 2
Now, consider a experiment: An urn is randomly selected. Let U1 be the event Urn 1 is selected and U2 be the event Urn 2 is selected. Assume the selection probabilities are as follows:
Pr U1 =
2 3
Pr U2 =
1 3
After selecting an urn, a marble from that urn is selected (all marbles are equally likely). What is the probability the marble is red? The probability calculation is tricky because the two urns have different numbers of marbles and, therefore, different probabilities of red and blue marbles, and it is not known in advance which urn is selected. However, the LTP lets us simplify the calculation by assuming we know which urn is selected. We calculate the probability for each urn and combine the results. Let R denote the event the selected marble is red, which gives
Pr R = Pr R U1 Pr U1 + Pr R U2 Pr U2 5 2 2 1 4 · + · = = 0.444 = 5+5 3 2+4 3 9
2.3 Example: Urn Models 35
Now, let us change the experiment. Select an urn, select a marble and remove it from the urn, and then select another marble from the same urn. Let R1 be the event the first marble is red and R2 be the event the second marble is red. What is the probability the second marble is red given the first marble is red? Pr R1 R2 Pr R1 R2 U1 Pr U1 + Pr R1 R2 U2 Pr U2 = Pr R2 R1 =
Pr R1
Pr R1
We calculated the denominator above, Pr R1 = 4/9, but the numerator is new. To continue, we need to evaluate the conditional probabilities. Using the chain rule for conditional probability (Equation 2.2), these probabilities can be simplified as follows: Pr R1 R2 U1 Pr R R U Pr R U 1 2 1 1 1 = = Pr R2 R1 U1 Pr R1 U1 Pr R1 R2 U1 =
Pr U1
Pr R1 U1
Pr U1
We recognize the last term as chain the rule. The first term above, Pr R2 R1 U1 , is the probability the second marble is red given the first marble is red and given the selection is from U1 . U1 started with five red and five blue marbles. Now ithas four red and five blue marbles. Therefore, this probability is 4/9. The second term, Pr R1 U1 , is the probability the first marble selected from U1 is red. This is 5/10. Similarly, Pr R2 R1 U2 is the probability the second marble chosen from U2 is red given the first marble was red. This is 1/5. Pr R1 U2 is the probability the first marble selected from Urn 2 is red, or 2/6. Combining everything,
Pr R2 R1 =
(4/9)(5/10)(2/3) + (1/5)(2/6)(1/3) = 0.383 4/9
This probability is less than Pr A because the second marble is less likely than the first to be red. One less red marble is available to be selected. Finally, let us change the experiment one more time. Select an urn, and then select a marble and remove it from the urn. Repeat: select an urn independently of the first selection (it could be the same as the first or it could be different) and then select a marble. Unlike the previous experiment, the two urns could be different. Let V11 be the event Urn 1 is selected twice, V12 be the event Urn 1 is selected first and Urn 2 is selected second, V21 be the event Urn 2 is selected first and Urn 1 second, and finally, V22 be the event Urn 2 is selected twice. The probabilities of these urn selections are as follows:
4 9 2 Pr V12 = Pr U1 U2 = Pr U1 Pr U2 = 9
Pr V11 = Pr U1 U1 = Pr U1 Pr U1 =
36 CHAPTER 2 CONDITIONAL PROBABILITY
2 9 1 Pr V22 = Pr U2 U2 = Pr U2 Pr U2 = 9
Pr V21 = Pr U2 U1 = Pr U2 Pr U1 =
As usual, we use the LTP to simplify the probability calculation:
Pr R1 R2 = Pr R1 R2 V11 Pr V11 + Pr R1 R2 V12 Pr V12
+ Pr R1 R2 V21 Pr V21 + Pr R1 R2 V22 Pr V22
= (5/10)(4/9)(4/9) + (5/10)(2/6)(2/9) + (2/6)(5/10)(2/9) + (2/6)(1/5)(1/9) = 0.193 Pr R1 R2 0.193 = Pr R2 R1 = = 0.433 4/9 Pr R1
This probability, 0.433, is larger than the previous probability, 0.383, because the same urn may not be chosen for both selections. A missing marble from U1 does not affect U2 probabilities, and vice versa. Mathematicians have studied urn models for three centuries. Although they appear simplistic, urns and marbles can model numerous real experiments. Their study has led to many advances in our knowledge of probability.
2.4 EXAMPLE: A BINARY CHANNEL A simple, but central, problem in communications is the transmission of a single bit. The communications channel may be wired, as in the wired telephone network, or wireless, as in cellular radio. Because of limited power available for transmission and noise in the channel, bit errors will sometimes happen. In this section, we consider a simple binary channel model and the communications problem from the point of view of the receiver. A transmitter sends a bit X through a communications channel to a receiver. The receiver receives a bit Y possibly not equal to X. A common model is shown in Figure 2.2. 1−ν
1
1
ν
X
Y
0
0
1−
FIGURE 2.2 Binary channel. The labeled paths represent conditional probabilities of Y given X.
The links in Figure 2.2 represent conditional probabilities of Y given X. For instance,
Pr Y = 1 X = 1 = 1 − ν Pr Y = 0 X = 1 = ν
2.4 Example: A Binary Channel 37
Pr Y = 1 X = 0 = Pr Y = 0 X = 0 = 1 − Typically, the crossover probabilities, and ν, are small. If they are equal, the channel is known as a binary symmetric channel. The communications problem for the receiver is to decide (guess) what is X given the received value Y. The receiver employs both Bayes theorem and the LTP in making this decision. Assume the received value is a 1; that is, Y = 1 (an analogous argument holds when the received value is a 0). A reasonable decision rule for the receiver to employ is the maximum a posteriori (MAP) rule: decide for X = 1 if the conditional probability of X = 1 given Y = 1 is greater than the conditional probability of X = 0 given Y = 1. Let Xˆ (Y = 1) denote the estimated bit. Note that Xˆ is a function of the observation Y. The MAP rule says decide Xˆ (Y = 1) = 1 if
Pr X = 1 Y = 1 > Pr X = 0 Y = 1
If this is not true, decide Xˆ (Y = 1) = 0. Let Pr X = 1 = p, and using Bayes theorem and the LTP, the decision rule can be simplified as follows: First, use the LTP to calculate
Pr Y = 1 = Pr Y = 1 X = 0 Pr X = 0 + Pr Y = 1 X = 1 Pr X = 1 = (1 − p) + (1 − ν)p
Now, calculate the a posteriori probabilities using Bayes theorem: Pr Y = 1 X = 1 Pr X = 1 (1 − ν)p Pr X = 1 Y = 1 = = (1 − p) + (1 − ν)p Pr Y = 1 Pr Y = 1 X = 0 Pr X = 0 (1 − p) = Pr X = 0 Y = 1 = (1 − p) + (1 − ν)p Pr Y = 1
After comparing the two conditional probabilities and simplifying, the MAP decision rule becomes Set Xˆ (Y = 1) = 1
if
p 1−p
>
1−ν
Note that the “if ” statement above compares one number to another number (so it is either always true or always false, depending on the values of p, , and ν). In the common special case when = ν, the rule simplifies further to set Xˆ (Y = 1) = 1 if p > . In other words, if the most common route to the observation Y = 1 is across the top, set Xˆ (Y = 1) = 1; if the most common route to Y = 1 starts at X = 0 and goes up, set Xˆ (Y = 1) = 0. This example illustrates an important problem in statistical inference: hypothesis testing. In this case, there are two hypotheses. The first is that X = 1, and the second is that X = 0. The receiver receives Y and must decide which hypothesis is true. Often Bayes theorem and the LTP are essential tools in calculating the decision rule. Hypothesis testing is discussed further in Section 2.5 and in Chapter 12.
38 CHAPTER 2 CONDITIONAL PROBABILITY
2.5 EXAMPLE: DRUG TESTING Many employers test potential employees for illegal drugs. Fail the test, and the applicant does not get the job. Unfortunately, no test, whether for drugs, pregnancy, disease, incoming missiles, or anything else, is perfect. Some drug users will pass the test, and some nonusers will fail. Therefore, some drug users will in fact be hired, while other perfectly innocent people will be refused a job. In this section, we will analyze this effect using hypothetical tests. This text is not about blood or urine chemistry. We will not engage in how blood tests actually work, or in actual failure rates, but we will see how the general problem arises. Assume the test, T, outputs one of two values. T = 1 indicates the person is a drug user, while T = 0 indicates the person is not a drug user. Let U = 1 if the person is a drug user and U = 0 if the person is not a drug user. Note that T is the (possibly incorrect) result of the test and that U is the correct value. The performance of a test is defined by conditional probabilities. The false positive (or false alarm) rate is the probability of the test indicating the person is a drug user when in fact the that person is not, or Pr T = 1 U = 0 . The false negative (or miss) rate is the probability test indicates the person is not a drug user when in fact that person is, or Pr T = 0 U = 1 . A successful result is either a true positive or a true negative. See table 2.1, below. TABLE 2.1 A simple confusion matrix showing the relationships between true and false positives and negatives.
T =1 T =0
U=1
U=0
True Positive False Negative
False Positive True Negative
Now, a miss is unfortunate for the employer (it is arguable how deleterious a drug user might be in most jobs), but a false alarm can be devastating to the applicant. The applicant is not hired, and his or her reputation may be ruined. Most tests try to keep the false positive rate low. T = U = 1 is known as the a priori probability that a person is a drug user. Pr U = 1 Pr 1 and Pr U = 1 T = 0 areknown as thea posteriori probabilities of the person being a drug user. For a perfect test, Pr T = 1 U = 1 = 1 and Pr T = 1 U = 0 = 0. However, tests are not perfect. Let us assume the false positive rate is , or Pr T = 1 U = 0 = ; the false negative rate is ν, or Pr T = 0 U = 1 = ν; and the a priori probability that a given person is a drug user is Pr U = 1 = p. For a good test, both and ν are fairly small, often 0.05 or less. However, even 0.05 (i.e., test is 95% correct) may not be good enough if one is among those falsely accused. One important question is what that a person is not a drug user given probability is the that he or she failed the test, or Pr U = 0 T = 1 . Using Bayes theorem, Pr T = 1 U = 0 Pr U = 0 Pr U = 0 T = 1 = Pr T = 1
2.5 Example: Drug Testing 39
Pr T = 1 U = 0 Pr U = 0 = Pr T = 1 U = 0 Pr U = 0 + Pr T = 1 U = 1 Pr U = 1 (1 − p) = (1 − p) + (1 − ν)p This expression is hard to appreciate, so let us put in some numbers. First, assume that the false alarm rate equals the miss rate, or = ν (often tests that minimize the total number of errors have this property). In Figure 2.3, we plot Pr U = 0 T = 1 against Pr U = 1 for two values of false positive and false negative rates. The first curve is for = ν = 0.05, and the second curve is for = ν = 0.20. One can see that when p < , the majority of positive tests are false (i.e., most people who test positive are not drug users). 1.0
= 0.2
0.5
= 0.05
0.0
0.2
0.4
0.6
0.8
1.0
FIGURE 2.3 Probability of a drug user falsely failing a drug test versus the probability of the person being a drug user. The probability of a false positive is greater than 0.5 when p < .
For instance, when p = 0.01 and = ν = 0.05, Pr U = 0 T = 1 = 0.84. In other words, 84% of those failing the test are not drug users. This is for a test that is 95% correct. How can this be? The test is correct for 95% of the persons tested, but that does not mean it is 95% correct for each subgroup when the subgroups are defined by the test results. In actual practice, drug tests are often given in two parts. The first part is quick and cheap but has a large number of false positives. The second part is more precise (i.e., fewer false positives) but is more costly. Anyone who fails the first part is retested with the second part. However, some false positives will always occur. Comment 2.3: Many tests are blind or double blind. As an example, consider a test of a new headache drug. A pool of subjects is divided (usually randomly) into two groups: a control group and a test group. The control group is given a placebo (a sugar pill), and
40 CHAPTER 2 CONDITIONAL PROBABILITY
the test group is given the new drug. Both groups record whether or not their pill helps alleviate their headaches. In a blind test (also known as a single blind test), the subjects are unaware of which group they are in. They do not know if they are given the placebo or the test drug. If they did know, that knowledge might influence their perceived results. In a double blind trial, the test givers also do not know who is in the control or test group. They do not know which pill, the placebo or the new drug, they are distributing to each patient. This way, they cannot give subtle clues to test subjects that might affect their perceived results. The point of a double blind trial is that neither the test subjects nor the test givers know who is getting the new drug. Double blind trials are considered to be the best way to conduct many tests. They are often a required part of the new drug approval process.
2.6 EXAMPLE: A DIAMOND NETWORK As one more example, consider the “diamond” network in Figure 2.4. It consists of a source S, a destination D, and five links. The question, as before, is what is the probability that S is connected with D? We make the same assumptions as before: each link works with probability p, and the links are independent.
L1 S
L2 L3
L4
D
L5
FIGURE 2.4 A diamond network.
Since this chapter is about conditional probability, we will use the LTP to help solve for the probability that S is connected to D. The question is what should we condition on? What information would we like to know to help us the most? In this network, the vertical link (L3 ) seems to be the greatest problem, so it would seem fruitful to condition on L3 :
Pr S → D = Pr S → D L3 = 1 Pr L3 = 1 + Pr S → D L3 = 0 Pr L3 = 0
When L3 = 1, the link works, and the network simplifies to that shown in Figure 2.5.
(2.4)
Summary 41
L1
L2 U
D
S L4
L5
FIGURE 2.5 A diamond network with L3 working.
The network simplifies to the serial concatenation of two networks. For S → D, it is necessary that S → U and U → D:
Pr S → D L3 = 1 = Pr S → U Pr U → D
= (p + p − p2 )(p + p − p2 ) = (2p − p2 )2
(2.5)
Now, when L3 is not working, the network simplifies as shown in Figure 2.6. There are two paths, one of which must work for there to be a path from S to D:
Pr S → D L3 = 0 = p2 + p2 − p4
(2.6)
Combining Equations (2.4), (2.5), and (2.6) gives the final answer:
Pr S → D = (2p − p2 )2 p + (2p2 − p4 )(1 − p) = 2p2 + 2p3 − 5p4 + 2p5
S
L1
L2
L4
L5
D
FIGURE 2.6 A diamond network with L3 not working.
SUMMARY
Conditional probabilities capture the notion of partial information. For example, given you know the output of a communications channel, what can you say about the input? Or, given a randomly selected person’s weight, what can you say about that person’s height? Or, given a card is a club, what is the probability the card is a six? These are all questions that conditional probabilities help answer.
42 CHAPTER 2 CONDITIONAL PROBABILITY
The conditional probability of A given B is Pr A B =Pr AB /Pr B when Pr B = 0. B Pr B . Pr A Sometimes it is more convenient to write it asPr AB = When A and B are independent, Pr A B = Pr A ; that is, the a posteriori probability is the same as the a priori probability. In other words, knowing B does not change the probability of A occurring. If Bi form a partition of S, the Law of Total Probability (LTP) often simplifies probability calculation:
Pr A =
∞
i=1
Pr ABi =
∞
i=1
Pr A Bi Pr Bi
Bayes’ Theorem relates conditional probabilities:
Pr A B Pr B Pr B A =
Pr A
A binary test is a decision Xˆ (Y ) = 1 or Xˆ (Y ) = 0 depending on the value of the observation Y. The test is characterized by its conditional probabilities:
• The true positive rate, Pr Xˆ = 1 X = 1
• The false positive or false alarm rate, Pr Xˆ = 1 X = 0 • The true negative rate, Pr Xˆ = 0 X = 0 • The false negative or miss rate, Pr Xˆ = 0 X = 1
PROBLEMS 2.1 If a fair coin is flipped twice (the flips are independent) and you are told at least one of the flips is a head, what is the probability that both flips are heads? 2.2 A fair coin is flipped three times (the flips are independent). a. What is the probability the first flip is a head given the second flip is heads? b. What is the probability the first flip is a head given there is at least one head in the three flips? c. What is the probability the first flip is a head given there are at least two heads in the three flips? 2.3 A coin with probability p of coming up heads is flipped three times (the flips are independent). a. What is the probability of getting at least two heads given there is at least one head? b. What is the probability of getting at least one head given there are at least two heads? 2.4 A coin with probability p of coming up heads is flipped three times (the flips are independent). Let A be the event the first flip is a head, B the event the second flip is a head, and C the event the third flip is a head. Demonstrate that the probabilities satisfy the chain rule (Equation 2.2):
Pr ABC = Pr A BC Pr B C Pr C
Problems 43
2.5
An ordinary six-sided die is rolled, and a side is selected. Assume all sides are equally likely. Let A = {1,2,3}, B = {2,3,4,5}, and C = {2,4,5,6}. Demonstrate the probabilities satisfy the chain rule (Equation 2.2).
2.6
A small library has 250 books: 100 are on history, 80 on science, and 90 on other topics (20 of the books are on “history of science” and count in both the history and science categories). A book is selected (all books are equally likely to be selected). Let H denote that it is a history book and S that it is a science book. a. b. c. d.
2.7
What are Pr H S and Pr H S ? Show Pr H can be calculated using the LTP. What is Pr S H ? Show that Pr H S and Pr S H are related by Bayes theorem.
Assume there are two urns with the following numbers of red and blue marbles: Urn 1 2
Red
Blue
5 2
3 6
An urn is selected (Pr U1 = 0.25 and Pr U2 = 0.75), and then a marble is selected from that urn (assume all marbles in that urn are equally likely). Let R denote the event the marble is red and B the event the marble is blue. a. b. c. d. 2.8
What are Pr R and Pr B ? What are Pr U1 R and Pr U1 B ? What are Pr U2 R and Pr U2 B ? Which combinations of conditional probabilities above sum to 1? Why?
A small electronics manufacturer buys parts from two suppliers, A and B. Each widget the manufacturer produces requires two parts from A and one part from B. Parts from A fail randomly at a 10% rate, while parts from B fail at a 20% rate. Assume the parts fail independently. a. What is the probability all three parts in a selected widget work? b. What is the probability at least two of the three parts in a selected widget work? c. Given the widget has one failed part, what is the probability that failed part came from A? From B? d. Given the widget has two failed parts, what is the probability both failed parts came from A?
2.9
Continue the development in Section 2.5 and calculate: a. b. c. d.
Pr U = 1 T = 1 ? Pr U = 0 T = 0 ? Pr U = 1 T = 0 ? Show these probabilities are consistent with the overall test being 95% accurate.
44 CHAPTER 2 CONDITIONAL PROBABILITY
2.10 Solve the network shown in Figure 2.4 using two of the same techniques as before. 1. Compute a big table. This network has five links, so the table will have 25 = 32 rows. 2. The network has four paths from S to D. Call them A, B, C, and D. Solve for Pr A ∪ B ∪ C ∪ D . Be careful: only one pair of paths is independent; all the rest are dependent (since they have overlapping links).
2.11 A and B are events with Pr AB = 0.5, Pr AB = 0.3, Pr AB = 0.1, and Pr AB = 0.1.
a. What are Pr A and Pr B ?
b. What are Pr A B , Pr A B , Pr A B , and Pr A B ? c. Calculate Pr B A from Pr A B using Bayes theorem.
2.12 A and B are events with Pr AB = 0.6, Pr AB = 0.3, Pr AB = 0.1, and Pr AB = 0.0.
a. What are Pr A and Pr B ?
b. What are Pr A B , Pr A B , Pr A B , and Pr A B ? c. Calculate Pr B A from Pr A B using Bayes theorem. 2.13 Consider a roll of a four-sided die with each side equally likely. Let A = {1,2,3}, B = {2,4}, and C = {1,3}.
a. What are Pr A , Pr B , and Pr C ? b. What are Pr A B , Pr A B ∪ C , and Pr A BC ? c. What are Pr B A , Pr B ∪ C A , and Pr BC A ? 2.14 Consider a roll of an ordinary six-sided die. Let A = {2,3,4,5}, B = {1,2}, and C = {2,5,6}.
a. What are Pr A , Pr B , and Pr C ? b. What are Pr A B , Pr A B ∪ C , and Pr A BC ? c. What are Pr B A , Pr B ∪ C A , and Pr BC A ? 2.15 The Monte Hall Problem: This is a famous (or infamous!) probability paradox. In the television game show Let’s Make a Deal, host Monte Hall was famous for offering contestants a deal and then trying to get them to change their minds. Consider the following: There are three doors. Behind one is a special prize (e.g., an expensive car), and behind the other two are booby prizes (on the show, often goats). The contestant picks a door, and then Monte Hall opens another door and shows that behind that door is a booby prize. Monte Hall then offers to allow the contestant to switch and pick the other (unopened) door. Should the contestant switch? Does it make a difference? 2.16 Assume a card is selected from an ordinary deck with all selections equally likely. Calculate the following: a. b. c. d. e.
Pr Q ♣ Pr ♣ Q Pr Q Q ∪ K ∪ A Pr Q ∪ K ∪ A Q Pr A♣ A ∪ ♣
Problems 45
2.17 Two six-sided dice are rolled. Assume all 36 possibilities are equally likely. Let X 1 equal the result of the first die, X 2 the second, and S the sum of the two dice; that is, S = X 1 + X 2 . Calculate the following:
a. Pr S = k for k = 2,3, . . . ,12; that is, what are Pr S = 2 , Pr S = 3 , . . . ,Pr S = 12 ? b. Pr X 1 = 2 S = k for k = 2,3, . . . ,12. c. Pr X 1 = 6 S = k for k = 2,3, . . . ,12. 2.18 Assume four teams, the Antelopes, Bobcats, Cougars, and Donkeys, play a tournament. In the first round, the Antelopes play the Bobcats, and the Cougars play the Donkeys. The two winners meet in the finals for the championship. The Antelopes are the best team. They win any game they play with probability 0.7. The other three teams are all equal. Each one can beat one of the other two with probability 0.5. Assume the games are independent. What is the probability that each team wins the tournament? 2.19 Discuss the logical fallacy below. State your answer in terms of events A and B and appropriate probabilities. A booklet says the statement “People who talk about photographing Martians rarely attempt to photograph Martians” is a myth (i.e., it is basically untrue) and offers the following fact as evidence: “Most who attempt to photograph Martians have given verbal clues.” Even if we accept the second statement as true, why is this a logical fallacy? In other words, why does the second statement not disprove the first statement? 2.20 Repeat the calculations in Section 2.4 when Y = 0, and find the MAP decision rule for Xˆ (Y = 0). 2.21 A test has false alarm probability of 0.05 and a miss probability of 0.10. a. If the rate of true positives in the population is 0.10, calculate the probability someone is a positive given he or she tested positive. b. Repeat the calculation for a true positive rate of 0.01. 2.22 Consider two binary symmetric channels, each with crossover probability , placed back to back; that is, X enters the first channel and Y leaves, then Y enters the second channel and Z leaves. What are the following transition probabilities? a. b. c. d.
Pr Z = 0 X = 0 Pr Z = 0 X = 1 Pr Z = 1 X = 0 Pr Z = 1 X = 1
2.23 While double blind trials are considered to be the best way of testing, say, a new drug, they may have ethical concerns. Discuss the ethical problem of conducting a double blind test of a new drug on a group of ill patients when the placebo is known or thought to be ineffective but the new drug may be effective?
46 CHAPTER 2 CONDITIONAL PROBABILITY
2.24 In 2014, Facebook researchers published a paper describing a facial recognition system that achieves a success rate of about p = 97.3% correct.1 a. Normally, a test like this is described by two numbers, typically the false alarm probability and the miss probability. The usual way these two numbers of combining into a single number is q = max(Pr False Alarm ,Pr Miss ), where q = 1 − p = 2.7% for the Facebook system. In this case, what can be said about the false alarm and miss probabilities? In addition to their numerical values, describe in words what they mean. b. The Facebook result assumes high-quality images. In practical situations, the success rate is likely to be much lower. Nevertheless, assume a facial recognition system like this is used at a sporting event with n = 70,000 spectators to identify a possible “bad guy” who may or may not be at the event. Assuming the false alarm and miss rates are both equal to q = 2.7%, describe the result that will likely occur (i) if the bad guy is present and (ii) if the bad guy is not present.
1
Yaniv Taigman, Ming Yang, Marc’ Aurelio Ranzato, and Lior Wolf, “DeepFace: Closing the Gap to Human-Level Performance in Face Verification.” https://www.cs.toronto.edu/∼ranzato/publications/taigman_cvpr14.pdf
CHAPTER
3
A LITTLE COMBINATORICS
This chapter represents a brief detour into the mathematics of counting: combinatorics. In many probability applications, one needs to know how many outcomes there are, or how many are in certain events. The basic results of counting are listed here. Before we get too far, some advice: combinatorics gets complicated quickly. Work out small examples where all possibilities can be listed and counted. Check these examples against general formulas to make sure the formulas—and your understanding—are correct.
3.1 BASICS OF COUNTING Consider a set of objects. The objects are distinguishable if each object can be differentiated from the others. For instance, a penny, a nickel, and a dime are distinct but two identical dice are not. For the most part, we are interested in sets of distinguishable objects. The first principle of counting goes like this: if a first set of distinguishable objects has n items, a1 , a2 , . . . ,an , and a second set of distinguishable objects has m items, b1 , b2 , . . . ,bm , then there are nm pairs of objects, (a1 ,b1 ), (a1 ,b2 ), . . . , (a1 ,bm ), (a2 ,b1 ), . . . , (an ,bm ). Comment 3.1: This principle was introduced in Chapter 1 when talking about compound experiments. If the first experiment is to flip a coin and the second roll a die, then the compound experiment has 2 · 6 = 12 outcomes.
In applying this fundamental result, one must be careful to precisely specify how the experiment is done. For instance, consider the following simple experiment: From a set of n distinguishable items, k are selected. How many possible selections are there?
47
48 CHAPTER 3 A LITTLE COMBINATORICS
The number of possible selections depends on how the selection is done. There are two fundamental aspects of the selection process: 1. Is the selection ordered or unordered? An ordered selection is one where the order of selecting the items matters. For instance, flip a coin twice. One selection is a head, and one is a tail. With an ordered selection, getting a head on the first flip and a tail on the second is different from getting a tail on the first flip and a head on the second. With an unordered selection, all that matters is one is a head and the other a tail. 2. Is the selection with replacement or without replacement? Selections with replacement mean as each item is selected, it is replaced (put back) into the set and available to be selected for the next item. Selections without replacement mean an item, once selected, is not available for future selections. To see how the selection matters, consider a selection of two items from a set of four items, a,b,c,d . If the selection is ordered and is done with replacement, then there are 16 possible selections. There are four choices for the first item and four choices for the second. The 16 selections are listed in Figure 3.1. aa ba ca da
ab bb cb db
ac bc cc dc
ad bd cd dd
FIGURE 3.1 The possible selections of two items from four if the selection is ordered and with replacement.
If the selection is ordered and is done without replacement, there are 12 possible selections. There are four choices for the first item but only three for the second. These 12 selections are given in Figure 3.2. Since the selection is without replacement, the diagonal entries, aa, bb, cc, and dd, cannot happen. ab ba ca da
cb db
ac bc
ad bd cd
dc
FIGURE 3.2 The possible selections of two items from four if the selection is ordered and without replacement.
If the selection is unordered and is done with replacement, there are 10 possible selections. These are given in Figure 3.3. Since ab and ba are equivalent (this is what an unordered selection means), only one is listed in the table. We use the convention of listing the first one alphabetically. Finally, if the selection is unordered and is done without replacement, there are 6 possible selections. These are given in Figure 3.4. Now, let us consider in general the four cases of selecting k items from n.
3.1 Basics of Counting 49
aa
ab bb
ac bc cc
ad bd cd dd
FIGURE 3.3 The possible selections of two items from four if the selection is unordered and with replacement.
ab
ac bc
ad bd cd
FIGURE 3.4 The possible selections of two items from four if the selection is unordered and without replacement.
Ordered and With Replacement: From n items, make an ordered selection of k items with replacement. The first item can be selected n ways. The second item can be selected n ways (since the first item is replaced and is available to be selected again), the third item in n ways, etc. Therefore, the k items can be selected in n · n · · · n = nk ways. For example, consider an ordered with replacement selection of three cards from an ordinary deck of 52 cards. Select a card, look at it, then put it back in the deck. Do the same for a second and a third card. The three cards can be selected in 52 · 52 · 52 = 140,608 ways. Ordered and Without Replacement: From n items, make an ordered selection of k items without replacement. An ordered selection without replacement is known as a permutation. The number of permutations of k items selected from n is sometimes written (n)k . The first item can be selected n ways. The second can be selected in n − 1 ways and the third in n − 2 ways. Therefore, the k items can be selected in (n)k = n · (n − 1) · (n − 2) · · · (n − k + 1)
(3.1)
ways. For example, the number of permutations of three cards taken from a deck of 52 is (52)3 = 52 · 51 · 50 = 132,600. The number, n(n − 1) · · · 3 · 2 · 1 occurs often and is written n! and pronounced “n factorial.” By convention, 0! = 1 and n! = 0 if n < 0. Factorials grow rapidly, as Table 3.1 shows. Using the factorial notation, (n)k can be written as (n)k =
n! (n − k)!
(3.2)
50 CHAPTER 3 A LITTLE COMBINATORICS TABLE 3.1 Some values of n! n
n!
0 1 2 3 4 5 6 7 8 9 10 20
1 1 2 6 24 120 720 5040 40,320 362,880 3,628,800 2,432,902,008,176,640,000
Note that if all n items are selected (i.e., k = n), then the number of permutations (orderings) is n! In other words, (n)n = n!
For example, there are 52! = 8 × 1067 possible orderings of an ordinary deck of 52 cards. Unordered and Without Replacement: From n items, make an unordered selection of k items without replacement. An unordered selection without replacement is known as a combination. The number of combinations of k items selected from n is written as nk . For reasons we discuss below, nk is also known as the binomial coefficient. Above, we saw the number of permutations (where order matters) of k items from n is (n−n!k)! . For each combination, there correspond k! permutations. For instance, the three items a, b, c can be ordered 3! = 6 ways: abc, acb, bac, bca, cab, and cba. Thus, to get the number of combinations, divide the number of permutations by k!; that is,
n (n)k = k! k
(3.3)
n! k!(n − k)! n · (n − 1) · (n − 2) · · · (n − k + 1) = k · (k − 1) · (k − 2) · · · 1 =
ways. For example, a combination of three cards from a deck of 52 can be selected in 22,100 ways.
EXAMPLE 3.1
(3.4) (3.5) 52 3
=
The number of sequences of k 1’s and n − k 0’s is nk . To see this, think of each digit as an item. The “items” are distinguishable (the first digit, the second digit, etc.). Each 1 means that digit was selected.
3.1 Basics of Counting 51
For instance, 42 = 6. The six sequences of two 1’s and two 0’s are 0011, 0101, 0110, 1001, 1010, and 1100. If the role of the 1’s and 0’s are reversed (the 0’s select the items), then the number of sequences is n−n k . However, the sequences are the same. Thus,
n n = k n−k
Unordered and With Replacement: From n items, make a unordered selection of k items with replacement. To count the number of possible selections, we appeal to a construction: 1. Make a row of n + k slots. (In the schematics below, n = 5 and k = 3, and the slots are denoted with an “x.”) x
x
x
x
x
x
x
x
x
x
x
x
x
a5
2. Put the last item in the last slot. x
x
3. Now, select k of the n + k − 1 slots, and put a marker in each. x
|
|
x
x
|
x
a5
4. Fill in the remaining slots with the items in order. a1
|
|
a2
a3
|
a4
a5
5. Select each item to the right of a marker. If a marker occurs multiple times, select the following item the same number of times. In this sample construction, the selection is a2 , a2 , and a4 . Each selection of markers corresponds to a selection of items, and vice versa. Thus, the number of ways k items can be selected from n in an unordered selection with replacement is the number of ways k slots (i.e., the markers) can be chosen out of n + k − 1 slots; that is,
n+k−1 k
Thus, selection with replacement of three cards from a deck of 52 can be done in 52+3−unordered 1 = 24,804 ways. 3 The results from this section are summarized in Table 3.2.
52 CHAPTER 3 A LITTLE COMBINATORICS TABLE 3.2 The number of ways of selecting k distinguishable objects from n distinguishable objects. Ordered
Unordered
Without Replacement
(n)k
With Replacement
nk
n k
n+k−1 k
Comment 3.2: There are two common alternative notations for permutations and combinations: (n)k = n Pk = P(n,k)
n = n Ck = C (n,k) k
The “double subscript” notation is used in many calculators.
3.2 NOTES ON COMPUTATION As we saw in Table 3.1, factorials grow rapidly. Computations involving factorials need to be done carefully or else the intermediate numbers may grow too big. For instance, the number of permutations of 100 items taken two at a time is 100 · 99 = 9900. This number is easily computed. However, 100! = 9.33 × 10157 , and 98! = 9.42 × 10153 . Dividing the first number by the second yields 9904.45, close but incorrect. Not only is the first method much simpler, it is also more accurate! Thus, to compute the number of permutations, use Equation (3.1), not Equation (3.2): (n)k = n · (n − 1) · (n − 2) · · · (n − k + 1)
Now, let us look at the number of combinations. The same basic conclusion holds. Use Equation (3.5) rather than Equation (3.4). There are two further optimizations. First, since the binomial coefficient is symmetric, nk = n−n k , we can assume k ≤ n/2; if not, replace k with n − k. Second, the answer is always an Yet a casual implementation may require integer. = 10 · 9 · 8/(3 · 2 · 1). One simple way to compute this floating point numbers. For instance, 10 3 number is to divide 10 by 3, multiply the result by 9, divide by 2, multiply by 8, and divide by 1 (obviously, this last division is unnecessary). Unfortunately, this sequence requires floating point numbers (10/3 = 3.3333, etc.). However, 10 3 can be computed only using integers by reversing the denominator sequence: divide 10 by 1, multiply by 9, divide by 2, multiply by 8, and divide by 3. The intermediate values are 10, 90, 45, 360, and 120.
3.3 Combinations and the Binomial Coefficients 53
A simple Matlab function for computing a binomial coefficient might look something like Figure 3.5. function count = choosenk (n , k ) % CHOOSENK calculate binomial coefficient % returns number of combinations of k items % taken from n if k > n /2 k = n-k; end if k < 0 count = 0; else count = 1; for j =1: k count = count * (n - j +1); count = count / j ; end end FIGURE 3.5 A simple Matlab function for computing a binomial coefficient.
3.3 COMBINATIONS AND THE BINOMIAL COEFFICIENTS The binomial coefficients have many interesting properties and numerous applications in probability. We present some of these properties here. Perhaps the easiest interpretation of the binomial coefficient nk is that it is the number of ways n items can be divided into two piles, with k items in the first pile and n − k items in the second pile. For instance, the five items a,b,c,d,e can be divided into two piles with two items in the first pile and three items in the second pile in 52 = 10 ways. Those 10 ways are shown in Table 3.3. Binomial coefficients obey a simple recursive relationship. Consider n slots, each containing a 0 or a 1. If the first slot contains a 0, then the remaining n − 1 slots must contain k 1’s. If it contains a 1, then the remaining n − 1 slots must contain (k − 1) 1’s. Thus,
n n−1 n−1 = + k k k−1
(3.6)
This formula is the basis for Pascal’s triangle, an arrangement of binomial coefficients where each value is the sum of the two values above. For instance, in Figure 3.6, the binomial coefficients 4k are 1, 4, 6, 4, and 1 for k = 0,1,2,3,4.
54 CHAPTER 3 A LITTLE COMBINATORICS TABLE 3.3 Example showing the various ways five objects can be divided into two piles with two objects in the first pile and three objects in the second pile. First Pile
Second Pile
a, b a, c a, d a, e b, c b, d b, e c, d c, e d, e
c, d, e b, d, e b, c, e b, c, d a, d, e a, c, e a, c, d a, b, e a, b, d a, b, c
1 1 1
1
2 3 6
4 5
1
3
1 1
1
10
1 4
10
1 5
1
FIGURE 3.6 Pascal’s triangle for n up to 5. Each number is the sum of the two numbers above it. The rows are the binomial coefficients for a given value of n.
3.4 THE BINOMIAL THEOREM The binomial coefficients are so named because they are central to the binomial theorem. Theorem 3.1 (Binomial Theorem): For integer n ≥ 0, n
(a + b) =
n n k=0
k
ak bn−k
(3.7)
The binomial theorem is usually proved recursively. The basis, n = 0, is trivial: the theorem reduces to 1 = 1. The recursive step assumes the theorem is true for n − 1 and uses this to show the theorem is true for n. The proof starts like this: (a + b)n = (a + b)(a + b)n−1
(3.8)
3.5 Multinomial Coefficient and Theorem 55
Now, substitute Equation (3.7) into Equation (3.8): n n
k=0
k
k n−k
a b
−1 n n−1
= a+b
k=0
k
ak bn−1−k
(3.9)
The rest of the proof is to use Equation (3.6) and rearrange the right-hand side to look like the left. The binomial theorem gives some simple properties of binomial coefficients: n n n n n n k n−k n n 0 = (−1) + 1 = = (−1) 1 − + − · · · + (−1) k=0
2n = (1 + 1)n =
n n k=0
k
k
1k 1n−k =
n n k=0
k
0
=
1
2
n n n n + + + ··· + 0 1 2 n
n
(3.10) (3.11)
We can check Equations (3.10) and (3.11) using the first few rows of Pascal’s triangle: 0 = 1−1 = 1−2+1 = 1−3+3−1 = 1−4+6−4+1 21 = 2 = 1 + 1 22 = 4 = 1 + 2 + 1 23 = 8 = 1 + 3 + 3 + 1 24 = 16 = 1 + 4 + 6 + 4 + 1 Since all the binomial coefficients are nonnegative, Equation (3.11) gives a bound on the binomial coefficients:
n ≤ 2n k
for all k = 0,1, . . . ,n
(3.12)
The binomial theorem is central to the binomial distribution, the subject of Chapter 6.
3.5 MULTINOMIAL COEFFICIENT AND THEOREM The binomial coefficient tells us the number of ways of dividing n distinguishable objects into two piles. What if we want to divide the n objects into more than two piles? This is the multinomial coefficient. In this section, we introduce the multinomial coefficient and show how it helps analyze various card games. How many ways can n distinguishable objects be divided into three piles with k1 in the first pile, k2 in the second, and k3 in the third with k1 + k2 + k3 = n? This is a multinomial coefficient and is denoted as follows:
n k1 ,k2 ,k3
We develop this count in two stages. First, the n objects are divided into two piles, the first with k1 objects and the second with n − k1 = k2 + k3 objects. Then, the second pile is divided
56 CHAPTER 3 A LITTLE COMBINATORICS
into two piles with k2 in one pile and k3 in the other. The number of ways of doing this is the product of the two binomial coefficients:
(k2 + k3 )! n n k2 + k3 n! n! · = = = k2 k1 !(k2 + k3 )! k2 !k3 ! k1 !k2 !k3 ! k1 ,k2 ,k3 k1
If there are more than three piles, the formula extends simply:
n n! = k1 !k2 ! · · · kl ! k1 ,k2 , . . . ,kl
(3.13)
The binomial coefficient can be written in this form as well:
n n = k k,n − k
The binomial theorem is extended to the multinomial theorem: Theorem 3.2 (Multinomial Theorem):
n
(a1 + a2 + · · · + am ) =
k 1 +k 2 +···+k m =n
n ak1 ak2 · · · akmm k1 ,k2 , . . . ,km 1 2
(3.14)
where the summation is taken over all values of ki ≥ 0 such that k1 + k2 + · · · + km = n. The sum seems confusing, but really is not too bad. Consider (a + b + c)2 :
2
(a + b + c) =
j+k+l=2
2 aj bk cl j,k,l
2 2 2 = a2 b0 c0 + a0 b2 c0 + a0 b0 c2 2,0,0 0,2,0 0,0,2
2 2 2 + a1 b1 c0 + a1 b0 c1 + a0 b1 c1 1,1,0 1,0,1 0,1,1 = a2 + b2 + c2 + 2ab + 2ac + 2bc
Let a1 = a2 = · · · = am = 1. Then, n
n
m = (1 + 1 + · · · + 1) =
k 1 +k 2 +···+k m =n
For example, 32 = (1 + 1 + 1)2 =
j+k+l=2
2 1j 1k 1l j,k,l
n k1 ,k2 , . . . ,km
3.6 The Birthday Paradox and Message Authentication 57
2 2 2 2 2 2 = + + + + + 2,0,0 0,2,0 0,0,2 1,1,0 1,0,1 0,1,1
= 1+1+1+2+2+2 =9
EXAMPLE 3.2
(3.15)
In the game of bridge, an ordinary 52-card deck of cards is dealt into four hands, each with 13 cards. The number of ways this can be done is
52 = 5.4 × 1028 13,13,13,13
3.6 THE BIRTHDAY PARADOX AND MESSAGE AUTHENTICATION A classic probability paradox is the birthday problem: How large must k be for a group of k people to be likely to have at least two people with the same birthday? In this section, we solve this problem and show how it relates to the problem of transmitting secure messages. We assume three things: • Years have 365 days. Leap years have 366, but they occur only once every four years. • Birthdays are independent. Any one person’s birthday does not affect anyone else’s. In particular, we assume the group of people includes no twins, or triplets, etc. • All days are equally likely to be birthdays. This is not quite true, but the assumption spreads out the birthdays and minimizes the possibility of common birthdays. Births (in the United States anyway) are about 10% higher in the late summer than in winter.1 There are fewer births on weekends and holidays. Ironically, “Labor Day” (in September) has a relatively low birthrate. We also take “likely” to mean a probability of 0.5 or more. How large must k be for a group of k people to have a probability of at least 0.5 of having at least two with the same birthday? Let Ak be the event that at least two people out of k have the same birthday. This event is complicated. Multiple pairs of people could have common birthdays, triples of people, etc. The complementary event—that no two people have the same birthday—is much simpler. Let q(k) = Pr Ak = Pr no common birthday with k people . Then, q(1) = 1 since one person cannot have a pair. What about q(2)? The first person’s birthday is one of 365 days. The second person’s birthday differs from the first’s if it is one of the remaining 364 days. The probability of this happening is 364/365: q(2) = q(1) · 1
364 365 364 364 = · = 365 365 365 365
National Vital Statistics Reports, vol. 55, no. 1 (September 2006). http://www.cdc.gov
58 CHAPTER 3 A LITTLE COMBINATORICS
The third person does not match either of the first two with probability (363/365): q(3) = q(2) ·
363 365 364 363 = · · 365 365 365 365
We can continue this process and get a recursive formula for q(k): q(k) = q(k − 1) ·
365 + 1 − k 365 · 364 · · · (366 − k) (365)k = = 365 365k 365k
(3.16)
Note that the probability is the number of ways of making a permutation of k objects (days) taken from n = 365 divided by the number of ways of making an ordered with replacement selection of k objects from 365. How large does k have to be so that q(k) < 0.5? A little arithmetic shows that k = 23 suffices since q(22) = 0.524 and q(23) = 0.493. The q(k) sequence is shown in Figure 3.7. 1
Pr no pair 0.5
0 5
10
15
20
k
25
30
FIGURE 3.7 The probability of no pairs of birthdays for k people in a year with 365 days.
Why is such a small number of people sufficient? Most people intuitively assume the number would be much closer to 182 = 365/2. To see why a small number of people is sufficient, first we generalize and let n denote the number of days in a year: q(k) =
(n)k
nk n(n − 1)(n − 2) · · · (n − k + 1) = nnn · · · n n n−1 n−2 n−k+1 = · · ··· n n n n 0 1 2 k−1 = 1− 1− 1 − ··· 1 − n n n n
The multiplication of all these terms is awkward to manipulate. However, taking logs converts those multiplications to additions:
log q(k) = log 1 −
0 1 2 k−1 + log 1 − + log 1 − + · · · + log 1 − n n n n
3.6 The Birthday Paradox and Message Authentication 59
Now, use an important fact about logs: when x is small, log(1 + x) ≈ x
for x ≈ 0
(3.17)
This is shown in Figure 3.8. (We use this approximation several more times in later chapters.) y=x y = log(1 + x)
0.693 y -0.5 x
1
-0.693
FIGURE 3.8 Plot showing log(1 + x ) ≈ x when x is small.
Since k n, we can use this approximation: −1 0 1 2 k − 1 1 k k(k − 1) − log q(k) ≈ + + + · · · + = l=
n
n
n
n
n
l=0
2n
Finally, invert the logarithm: k(k − 1) q(k) ≈ exp −
2n
(3.18)
When k = 23 and n = 365, the approximation evaluates to 0.50 (actually 0.499998), close to the actual 0.493. Alternatively, we can set q(k) = 0.5 and solve for k: 0.5 = q(k) log(0.5) = −0.693 ≈
−k(k − 1)
2n
(3.19)
Thus, k(k − 1)/2 ≈ 0.693n. For n = 365, k(k − 1)/2 ≈ 253. Hence, k = 23. The intuition in the birthday problem should be that k people define k(k − 1)/2 pairs of people and, at most, about 0.693n pairs can have different birthdays.
60 CHAPTER 3 A LITTLE COMBINATORICS
Comment 3.3: We use “log” or “loge ” to denote the natural logarithm. Some textbooks and many calculators use “ln” instead. That is, if y = ex , then x = log(y ). Later, we need log10 and log2 . If y = 10x , then x = log10 (y ), and if y = 2x , then x = log2 (y ). See also Comment 5.9.
EXAMPLE 3.3
The birthday problem is a convenient example to introduce simulation, a computational procedure that mimics the randomness in real systems. The python command trial=randint(n,size=k) returns a vector of k random integers. All integers between 0 and n − 1 are equally likely. These are birthdays. The command unique(trial) returns a vector with the unique elements of trial. If the size of the unique vector equals k, then no birthdays are repeated. Sample python code follows: k , n = 23 , 365 numtrials = 10000 count = 0 for i in range ( numtrials ): trial = randint (n , size = k ) count += ( unique ( trial ). size == k ) phat = count / numtrials std = 1.0/ sqrt (2* numtrials ) print phat , phat -1.96* std , phat +1.96* std A sequence of 10 trials with n = 8 and k = 3 looks like this: [0 7 6], [3 6 6], [5 2 0], [0 4 4], [5 6 0], [2 5 6], [6 2 5], [1 7 0], [2 2 7], [6 0 3]. We see the second, fourth, and ninth trials have repeated birthdays; the other seven do not. The probability of no repeated birthday is estimated as pˆ = 7/10 = 0.7. The exact probability is p = 1 · (7/8) · (6/8) = 0.66. The confidence interval is an estimate of the range of values in which we expect to find the correct large enough, we expect the correct value to be in the interval value. If n is (pˆ − 1.96/ 2N, pˆ + 1.96/ 2N ) approximately 95% of the time (where N is the number of trials). In 10,000 trials with n = 365 and k = 23, we observe 4868 trials with no repeated birthdays, giving an estimate pˆ = 4868/10000 = 0.487, which is pretty close to the exact value p = 0.493. The confidence interval is (0.473,0.501), which does indeed contain the correct value. More information on probability estimates and confidence intervals can be found in Sections 5.6, 6.6, 8.9, 9.7.3, 10.1, 10.6, 10.7, and 10.8. What does the birthday problem have to do with secure communications? When security is important, users often attach a message authentication code (MAC) to a message. The MAC is a b-bit signature with the following properties: • The MAC is computed from the message and a password shared by sender and receiver. • Two different messages should with high probability have different MACs.
3.7 Hypergeometric Probabilities and Card Games 61
• The MAC algorithm should be one-way. It should be relatively easy to compute a MAC from a message but difficult to compute a message that has a particular MAC. Various MAC algorithms exist. These algorithms use cryptographic primitives. Typical values of b are 128, 196, and 256. The probability that two different messages have the same MAC is small, approximately 2−b . When Alice wants to send a message (e.g., a legal contract) to Bob, Alice computes a MAC and sends it along with the message. Bob receives both and computes a MAC from the received message. He then compares the computed MAC to the received one. If they are the same, he concludes the received message is the same as the one sent. However, if the MACs differ, he rejects the received message. The sequence of steps looks like this: 1. Alice and Bob share a secret key, K. 2. Alice has a message, M. She computes a MAC h = H (M,K ), where H is a MAC function. 3. Alice transmits M and h to Bob. 4. Bob receives M and h . These possibly differ from M and h due to channel noise or an attack by an enemy. 5. Bob computes h = H (M ,K ) from the received message and the secret key. 6. If h = h , Bob assumes M = M. If not, Bob rejects M . If, however, Alice is dishonest, she may try to deceive Bob with a birthday attack. She computes a large number k of messages in trying to find two with the same MAC. Of the two with the same MAC, one contract is favorable to Bob, and one cheats Bob. Alice sends the favorable contract to Bob, and Bob approves it. Sometime later, Alice produces the cheating contract and falsely accuses Bob of reneging. She argues that Bob approved the cheating contract because the MAC matches the one he approved. How many contracts must Alice create before she finds two that match? The approximation (Equation 3.18) answers this question. With n = 2b and k large, the approximation indicates (ignoring multiplicative constants that are close to 1):
k≈ n The birthday attack is much more efficient than trying to match a specific MAC. For b = 128, Alice would have to create about n/2 = 1.7 × 1038 contractsto match a specific MAC. However, to find any two that match, she has to create about k ≈ n = 1.8 × 1019 contracts, a factor of 1019 faster. Making birthday attacks difficult is one important reason why b is usually chosen to be relatively large.
3.7 HYPERGEOMETRIC PROBABILITIES AND CARD GAMES Hypergeometric probabilities use binomial coefficients to calculate probabilities of experiments that involve selecting items from various groups. In this section, we develop the hypergeometric probabilities and show how they are used to analyze various card games.
62 CHAPTER 3 A LITTLE COMBINATORICS
Consider the task of making an unordered selection without replacement of k1 items from a group of n1 items, selecting k2 items from another group of n2 items, etc., to km items from nm items. The number of ways this can be done is the product of m binomials:
n1 n2 nm ··· k1 k2 km
(3.20)
Consider an unordered without replacement selection of k items from n. Let the n items be divided into groups with n1 , n2 , . . . ,nm in each. Similarly, divide the selection as k1 , k2 , . . . ,km . If all selections are equally likely, then the probability of this particular selection is a hypergeometric probability:
Pr k1 ,k2 , . . . ,km =
n1 n2 nm ··· k1 k2 km
n k
(3.21)
The probability is the number of ways of making this particular selection divided by the number of ways of making any selection. EXAMPLE 3.4
Consider a box with four red marbles and five blue marbles. The box is shaken, and someone reaches in and blindly selects six marbles. The probability of two red marbles and four blue ones is
4 5
6·5 5 Pr 2 red and 4 blue marbles = 29 4 = 9·8·7 = 14 3·2·1 6 For the remainder of this section, we illustrate hypergeometric probabilities with some poker hands. In most versions of poker, each player must make his or her best five-card hand from some number of cards dealt to the player and some number of community cards. For example, in Seven-Card Stud, each player is dealt seven cards, and there are no community cards. In Texas Hold ‘em, each player is dealt two cards, and there are five community cards. The poker hands are listed here from best to worst (assuming no wild cards are being used): Straight Flush The five best cards are all of the same suit (a flush) and in rank order (a straight). For instance, 5♦,6♦,7♦,8♦,9♦ is a “9-high” straight flush. When comparing two straight flushes, a “10-high” straight flush beats a “9-high” straight flush, etc. The highest possible hand is an “Ace-high” straight flush. Four of a Kind Four cards of one rank, and one other card. For example, four Aces (A), and one Jack (J), or four 3’s and a 2. Full House Three cards of one rank, and two of another. Four example, three Jacks and two 9’s, pronounced “jacks full of nines.” Flush Five cards of the same suit.
3.7 Hypergeometric Probabilities and Card Games 63
Straight Five cards in rank order (suits do not matter). The Ace usually plays as 1 or 14 for straights. For example, Ace, 2, 3, 4, 5 is the lowest possible straight, while 10, J, Queen (Q), King (K), A is the highest. Three of a Kind Three cards of the same rank, and two other cards (not a pair). For example, K♦, K♥, K♣, 2♦, Q♠. Two Pair Two cards of one rank, two of another, and one other card. For example, 10♠, 10♦, 9♥, 9♦, K♥, pronounced “tens over nines.” One Pair Two cards of one rank and three other cards. For example, 6♠, 6♥, 7♥, 2♣, 3♣. High Card None of the above. The hand is ranked by its highest card, then its second highest, etc. For instance, if five cards are dealt from a deck of 52, the probability of getting a full house with three Q’s and two 8’s is
4 4
Pr three Q’s, two 8’s = 352 2 = 5
4·6 = 9.2 × 10−6 2598960
Comment 3.4: Sometimes it is easier to remember this formula if it is written as
4 4 44
Pr three Q’s, two 8’s =
3 2 0 52 5
=
4·6·1 = 9.2 × 10−6 2598960
The idea is that the 52 cards are divided into three groups: Q’s, 8’s, and Others. Select three Q’s, two 8’s, and zero Others. The mnemonic is that 3 + 2 + 0 = 5 and 4 + 4 + 44 = 52.
The number of different full houses is (13)2 = 156 (13 ranks for the triple and 12 ranks for the pair). Therefore, the probability of getting any full house is
Pr any Full House = 156 · 9.2 × 10−6 = 1.4 × 10−3 ≈ 1 in 700 hands If the player chooses his or her best five-card hand from seven cards, the probability of a full house is much higher (about 18 times higher). However, the calculation is much more involved. First, list the different ways of getting a full house in seven cards. The most obvious way is to get three cards of one rank, two of another, one of a third rank, and one of a fourth. The second way is get three cards in one rank, two in a second rank, and two in a third rank. (This full house is the triple and the higher-ranked pair.) The third way is to get three cards of one rank, three cards of a second rank, and one card in a third rank. (This full house is the higher-ranked triple and a pair from the lower triple.) We use the shorthand 3211, 322, and 331 to describe these three possibilities.
64 CHAPTER 3 A LITTLE COMBINATORICS
Second, calculate the number ways of getting a full house. Let n(3211) be the number of ways of getting a specific 3211 full house, N (3211) the number of ways of getting any 3211 full house, and similarly for 322 and 331 full houses. Since there are four cards in each rank,
n(3211) =
4 4 4 4 = 4 · 6 · 4 · 4 = 384 3 2 1 1
n(322) =
4 4 4 = 4 · 6 · 6 = 144 3 2 2
n(331) =
4 4 4 = 4 · 4 · 4 = 64 3 3 1
Consider a 3211 full house. There are 13 ways of selecting the rank for the triple, 12 ways of selecting the rank for the pair (since one rank is unavailable), and 11 2 = 55 ways of selecting the two ranks for the single cards. Thus,
11 · n(3211) = 8580 · 384 = 3,294,720 N (3211) = 13 · 12 · 2
Comment 3.5: The number of ways of selecting the four ranks is not 13 4 since the ranks are not equivalent. That is, three Q’s and two K’s is different from two Q’s and three K’s. The two extra ranks are equivalent, however. In other words, three Q’s, two K’s, one A, and one J is the same as three Q’s, two K’s, one J, and one A.
For a 322 full house, there are 13 ways of selecting the rank of the triple and ways of selecting the ranks of the pairs:
12 2
= 66
12 · n(322) = 858 · 144 = 123,552 N (322) = 13 · 2
Finally, for a 331 full house, there are 13 2 = 78 ways of selecting the ranks of the two triples and 11 ways of selecting the rank of the single card:
13 N (331) = · 11 · n(331) = 858 · 64 = 54,912 2 Now, let N denote the total number of full houses: N = N (3211) + N (322) + N (331) = 3,294,720 + 123,552 + 54,912 = 3,473,184
3.7 Hypergeometric Probabilities and Card Games 65
Third, the probability of a full house is N divided by the number of ways of selecting seven cards from 52:
Pr full house =
3,473,183
52 7
=
3,473,183 = 0.02596 133,784,560
≈ 1 in 38.5 hands
So, getting a full house in a seven-card poker game is about 18 times as likely as getting one in a five-card game. Incidentally, the conditional probability of a 3211 full house given one has a full house is
Pr 3211 full houseany full house =
3,294,720 = 0.95 3,473,184
About 19 of every 20 full houses are the 3211 variety. Comment 3.6: We did not have to consider the (admittedly unlikely) hand of three cards in one rank and four cards in another rank. This hand is not a full house. It is four of a kind, and four of a kind beats a full house. See also Problem 3.30.
In Texas Hold ‘em, each player is initially dealt two cards. The best starting hand is two Aces, the probability of which is
4 48 2
4
0
2 Pr two Aces = 52 = 52 = 2
2
6 1 = 1326 221
The worst starting hand is considered to be the “7 2 off-suit,” or a 7 and a 2 in two different suits. (This hand is unlikely to make a straight or a flush, and any pairs it might make would likely lose to other, higher, pairs.) There are four ways to choose the suit for the 7 and three ways to choose the suit for the 2 (since the suits must differ), giving a probability of 4·3 2 Pr “7 2 off-suit” = 52 = 221 2
In a cruel irony, getting the worst possible starting hand is twice as likely as getting the best starting hand. Comment 3.7: Probability analysis can answer many poker questions but not all, including some really important ones. Because of the betting sequence and the fact that a player’s cards are hidden from the other players, simple questions like “Do I have a winning hand?” often cannot be answered with simple probabilistic analysis. In most cases, the answer depends on the playing decisions made by the other players at the table.
66 CHAPTER 3 A LITTLE COMBINATORICS
SUMMARY
In many experiments, the outcomes are equally likely. The probability of an event then is simply the number of outcomes in the event divided by the number of possible outcomes. Accordingly, we need to know how to count the outcomes. Consider a selection of k items from n items. The selection is ordered if the order of the selection matters (i.e., if ab differs from ba). Otherwise, it is unordered. The selection is with replacement if each item can be selected multiple times (i.e., if aa is possible). Otherwise, the selection is without replacement. The four possibilities for a selection of k items from n are considered below: • Ordered With Replacement: The number of selections is nk . • Ordered Without Replacement: These are called permutations. The number of permutations is (n)k = n!/(n − k)!, where n! = n(n − 1)(n − 2) · · · 2 · 1, 0! = 1, and n! = 0 for n < 0. n! is pronounced “n factorial.” • Unordered Without Replacement: These are combinations. The number of combinations of k items selected from n is nk . This number is also known as a binomial coefficient:
n n! n = = (n − k)!k! k n−k
• Unordered With Replacement: The number of selections is
n+k−1 k
.
The binomial coefficients are central to the binomial theorem: n
(a + b) =
n n k=0
ak bn−k
k
The binomial coefficient can be generalized to the multinomial coefficient:
n n! = k1 ,k2 , . . . ,km k1 !k2 ! · · · km !
The binomial theorem is extended to the multinomial theorem:
(a1 + a2 + · · · + am )n =
k 1 +k 2 +···+k m =n
n ak1 ak2 · · · akmm k1 ,k2 , . . . ,km 1 2
where the summation is taken over all values of ki ≥ 0 such that k1 + k2 + · · · + km = n. Consider an unordered selection without replacement of k1 items from n1 , k2 from n2 , through km from nm . The number of selections is
n1 n2 nm ··· k1 k2 km
Problems 67
The hypergeometric probabilities measure the likelihood of making this selection (assuming all selections are equally likely):
n1 n2 nm ··· k1 k2 km
Pr k1 ,k2 , . . . ,km =
n k
where n = n1 + n2 + · · · + nm and k = k1 + k2 + · · · + km . The birthday problem is a classic probability paradox: How large must k be for a group of k people to have two with the same birthday? The surprising answer is a group of 23 is more likely than not to have at least two people with the same birthday. The general answer for k people in a year of n days is
Pr no pair =
(n)k
nk
k(k − 1) ≈ exp −
2n
Setting this probability equal to 0.5 and solving result in
k(k − 1) ≈ 0.693n 2
If n is large, k ≈ n. Combinatorics can get tricky. Perhaps the best advice is to check any formulas using small problems for which the answer is known. Simplify the problem, count the outcomes, and make sure the count agrees with the formulas. PROBLEMS 3.1
List the sequences of three 1’s and two 0’s.
3.2
Show
n n−1 k =n k k−1
(3.22)
This formula comes in handy later when discussing binomial probabilities. 3.3
Prove the computational sequence in Figure 3.5 can be done using only integers.
3.4
Prove Equation (3.6) algebraically using Equation (3.4).
3.5
Write a computer function to calculate the nth row of Pascal’s triangle iteratively from the (n − 1)-st row. (The computation step, using vector operations, can be done in one line.)
3.6
The sum of the integers 1 + 2 + · · · + n = n(n + 1)/2 is a binomial coefficient. a. Which binomial coefficient is the sum? b. Using Pascal’s triangle, can you think of an argument why this is so?
3.7
Complete the missing steps in the proof of the binomial theorem, starting with Equation (3.9).
68 CHAPTER 3 A LITTLE COMBINATORICS
3.8 Using the multinomial theorem as in Equation (3.15): a. Expand 42 = (1 + 1 + 1 + 1)2 . b. Expand 42 = (2 + 1 + 1)2 . 3.9 Write a function to compute the multinomial coefficient for an arbitrary number of piles with k1 in the first, k2 in the second, etc. Demonstrate your program works by showing it gets the correct answer on several interesting examples. 3.10 If Alice can generate a billion contracts per second, how long will it take her to mount a birthday attack for b = 32, 64, 128, and 256? Express your answer in human units (e.g., years, days, hours, or seconds) as appropriate. 3.11 Assume Alice works for a well-funded organization that can purchase a million computers, each capable of generating a billion contracts per second. How long will a birthday attack take for b = 32, 64, 128, and 256? 3.12 For a year of 366 days, how many people are required to make it likely that a pair of birthdays exist? 3.13 A year on Mars is 687 (Earth) days. If we lived on Mars, how many people would need to be in the room for it to be likely that at least two of them have the same birthday? 3.14 A year on Mars is 669 (Mars) days. If we were born on and lived on Mars, how many people would need to be in a room for it to be likely at least two of them have the same birthday? 3.15 Factorials can get big, so big that computing them can be problematic. In your answers below, clearly specify what computing platform you are using and how you are doing your calculation. a. What is the largest value of n for which you can compute n! using integers? b. What is the largest value of n for which you can compute n! using floating point numbers? c. How should you compute log(n!) for large n? Give a short program that computes log(n!). What is log(n!) for n = 50,100,200? 3.16 Stirling’s formula gives an approximation for n!:
n! ∼ 2πnn+1/2 e−n
(3.23)
The approximation in Equation (3.23) is asymptotic. The ratio between the two terms goes to 1 as n → ∞: n! =1 lim 2πn n + 1 / 2 e − n
n→∞
Plot n! and Stirling’s formula on a log-log plot for n = 1,3,10,30,100,300,1000. Note that n! can get huge. You probably cannot compute n! and then log(n!). It is better to compute log(n!) directly (see Problem 3.15). 3.17 Show the following: n n 2k 3 = n
k=0
k
Problems 69
3.18 The approximation log(1 + x) ≈ x (Equation 3.17) is actually an inequality: log(1 + x) ≤ x
for all x > −1
(3.24)
as careful examination of Figure 3.8 shows. Repeat the derivation of the approximation to q(k), but this time, develop an inequality. a. What is that inequality? b. Evaluate the exact probability and the inequality for several values of k and n to demonstrate the inequality. 3.19 The log inequality above (Equation 3.24) is sometimes written differently. a. Show the log inequality (Equation 3.24) can be written as y − 1 ≥ log(y)
for y > 0
(3.25)
b. Recreate Figure 3.8 using Equation (3.25). 3.20 Plot f (x) = x − log(1 + x) versus x. 3.21 One way to prove the log inequality (Equation 3.24) is to minimize f (x) = x − log (1 + x ). a. Use calculus to show that x = 0 is a possible minimum of f (x). b. Show that x = 0 is a minimum (not a maximum) by evaluating f (0) and any other value, such as f (1). 3.22 Solve a different birthday problem: a. How many other people are required for it to be likely that someone has the same birthday as you do? b. Why is this number so much larger than 23? c. Why is this number larger than 365/2 = 182? 3.23 In most developed countries, more children are born on Tuesdays than any other day. (Wednesdays, Thursdays, and to a lesser extent, Fridays are close to Tuesdays, while Saturdays and Sundays are much less so.) Why? 3.24 Using the code in Example 4.7, or a simple rewrite in Matlab or R, estimate the probabilities of no common birthday for the following: a. n = 365 and k = 1,2, . . . ,30. Plot your answers and the calculated probabilities on the same plot (see Figure 3.7). b. n = 687 (Mars) and k = 5,10,15,20,25,30,35,40,45,50. Plot your answers and the calculated probabilities on the same plot. 3.25 Sometimes “unique” identifiers are determined by keeping the least significant digits of a large number. For example, patents in the United States are numbered with seven-digit numbers (soon they will need eight digits). Lawyers refer to patents by their last three digits; for example, Edison’s light bulb patent is No. 223,898. Lawyers might refer to it as the “898” patent. A patent infringement case might involve k patents. Use the approximation in Equation (3.18) to calculate the probability of a name collision for k = 2,3, . . . ,10 patents.
70 CHAPTER 3 A LITTLE COMBINATORICS
3.26 Continuing Problem 3.25, organizations once commonly used the last four digits of a person’s Social Security number as a unique identifier. (This is not done so much today for fear of identity theft.) a. Use the approximation in Equation (3.18) to calculate the probability of an identifier collision for k = 10,30,100 people. b. What value of k people gives a probability of 0.5 of an identifier collision? 3.27 The binomial coefficient can be bounded as follows (for k > 0): n k
k
≤
n ne k ≤ k k
a. Prove the left-hand inequality. (The proof is straightforward.) b. The proof of the right-hand inequality is tricky. It begins with the following:
n n n k x ≤ x l = (1 + x ) n k l l=0
Justify the inequality and the equality. c. The next step in the proof is to show (1 + x)n ≤ exn . Use the inequality log(1 + x) ≤ x to show this. d. Finally, let x = k/n. Complete the proof. e. Evaluate all three terms for n = 10 and k = 0,1, . . . ,10. The inequality is best when k is small. When k is large, use the simple inequality n right-hand n k ≤ 2 from Equation (3.12). 3.28 In the game Yahtzee sold by Hasbro, a player throws five ordinary six-sided dice and tries to obtain various combinations of poker-like hands. In a throw of five dice, compute the following: a. All five dice showing the same number. b. Four dice showing one number and the other die showing a different number. c. Three dice showing one number and the other two dice showing the same number (in poker parlance, a full house). d. Three dice showing one number and the other two dice showing different numbers (three-of-a-kind). e. Two dice showing one number, two other dice showing a different number, and the fifth die showing a third number (two pair). f. Five different numbers on the five dice. g. Show the probabilities above sum to 1. 3.29 Write a short program to simulate the throw of five dice. Use the program to simulate a million throws of five dice and estimate the probabilities calculated in Problem 3.28. 3.30 What is the probability of getting three cards in one rank and four cards in another rank in a selection of seven cards from a standard deck of 52 cards if all combinations of 7 cards from 52 are equally likely?
Problems 71
3.31 In Texas Hold ‘em, each player initially gets two cards, and then five community cards are dealt in the center of the table. Each player makes his or her best five-card hand from the seven cards. a. How many initial two-card starting hands are there? b. Many starting hands play the same way. For example, suits do not matter for a pair of Aces. Each pair of Aces plays the same (on average) as any other pair of Aces. The starting hands can be divided into three playing groups: both cards have the same rank (a “pair”), both cards are in the same suit (“suited”), or neither (“off-suit”). How many differently playing starting hands are in each group? How many differently playing starting hands are there in total? (Note that this answer is only about 1/8th the answer to the question above.) c. How many initial two-card hands correspond to each differently playing starting hand in the question above? Add them up, and show they total the number in the first question. 3.32 You are playing a Texas Hold ‘em game against one other player. Your opponent has a pair of 9’s (somehow you know this). The five community cards have not yet been dealt. a. Which two-card hand gives you the best chance of winning? (Hint: the answer is not a pair of Aces.) b. If you do not have a pair, which (nonpair) hand gives you the best chance of winning? 3.33 In Texas Hold ‘em, determine: a. The probability of getting a flush given your first two cards are the same suit. b. The probability of getting a flush given your first two cards are in different suits. c. Which of the two hands above is more likely to result in a flush? 3.34 In the game of Blackjack (also known as “21”), both the player and dealer are initially dealt two cards. Tens and face cards (e.g., 10’s, Jacks, Queens, and Kings) count as 10 points, and Aces count as either 1 or 11 points. A “blackjack” occurs when the first two cards sum to 21 (counting the Ace as 11). What is the probability of a blackjack? 3.35 Three cards are dealt without replacement from a well-shuffled, standard 52-card deck. a. b. c. d.
Directly calculate Pr three of a kind . Directly calculate Pr a pair and one other card . Directly calculate Pr three cards of different ranks . Show the three probabilities above sum to 1.
3.36 Three cards are dealt without replacement from a well-shuffled, standard 52-card deck. a. b. c. d.
Directly calculate Pr three of the same suit . Directly calculate Pr two of one suit and one of a different suit . Directly calculate Pr three cards in three different suits . Show the three probabilities above sum to 1.
72 CHAPTER 3 A LITTLE COMBINATORICS
3.37 A typical “Pick Six” lottery run by some states and countries works as follows: Six numbered balls are selected (unordered, without replacement) from 49. A player selects six balls. Calculate the following probabilities: a. b. c. d.
The player matches exactly three of the six selected balls. The player matches exactly four of the six selected balls. The player matches exactly five of the six selected balls. The player matches all six of the six selected balls.
3.38 “Powerball” is a popular two-part lottery. The rules when this book was written are as follows: In the first part, five numbered white balls are selected (unordered, without replacement) from 59; in the second part, one numbered red ball is selected from 35 red balls (the selected red ball is the “Powerball”). Similarly, the player has a ticket with five white numbers selected from 1 to 59 and one red number from 1 to 35. Calculate the following probabilities: a. b. c. d.
The player matches the red Powerball and none of the white balls. The player matches the red Powerball and exactly one of the white balls. The player matches the red Powerball and exactly two of the white balls. The player matches the red Powerball and all five of the white balls.
3.39 One “Instant” lottery works as follows: The player buys a ticket with five winning numbers selected from 1 to 50. For example, the winning numbers might be 4, 12, 13, 26, and 43. The ticket has 25 “trial” numbers, also selected from 1 to 50. If any of the trial numbers matches any of the winning numbers, the player wins a prize. The ticket, perhaps surprisingly, does not indicate how the trial numbers are chosen. We do know two things (about the ticket we purchased): first, the 25 trial numbers contain no duplicates, and second, none of the trial numbers match any of the winning numbers. a. Calculate the probability there are no duplicates for an unordered selection with replacement of k = 25 numbers chosen from n = 50. b. In your best estimate, was the selection of possibly winning numbers likely done with or without replacement? c. Calculate the probability a randomly chosen ticket has no winners if all selections of k = 25 numbers from n = 50 (unordered without replacement) are equally likely. d. Since the actual ticket had no winners, would you conclude all selections of 25 winning numbers were equally likely? 3.40 Consider selecting two cards from a well-shuffled deck (unordered and without replacement). Let K1 denote the event the first card is a King and K2 the event the second card is a King.
a. Calculate Pr K1 ∩ K2 = Pr K1 Pr K2 K1 . b. Compare to the formula of Equation (3.21) for calculating the same probability. 3.41 Continuing the Problem 3.40, let X denote any card other than a King. a. Use the LTP to calculate the probability of getting a King and any other card (i.e., exactly one King) in an unordered and without replacement selection of two cards from a well-shuffled deck. b. Compare your answer in part a to Equation (3.21).
Problems 73
3.42 Show Equation (3.21) can be written as
n1 n2 nm ··· k1 k2 km
Pr k1 ,k2 , . . . ,km =
n k
( n 1 ) k 1 (n 2 ) k 2 · · · (n m ) k m k = k1 ,k2 , . . . ,km (n)k
(3.26)
3.43 For m = 2, derive Equation (3.26) directly from first principles. 3.44 In World War II, Germany used an electromechanical encryption machine called Enigma. Enigma was an excellent machine for the time, and breaking its encryption was an important challenge for the Allied countries. The Enigma machine consisted of a plugboard, three (or, near the end of the war, four) rotors, and a reflector (and a keyboard and lights, but these do not affect the security of the system). a. The plugboard consisted of 26 holes (labeled A to Z). Part of each day’s key was a specification of k wires that connected one hole to another. For example, one wire might connect B to R, another might connect J to K, and a third might connect A to W. How many possible connections can be made with k wires, where k = 0,1, . . . ,13? Evaluate the number for k = 10 (the most common value used by the Germans). Note that the wires were interchangeable; that is, a wire from A to B and one from C to D is the same as a wire from C to D and another from A to B. (Hint: for k = 2 and k = 3, there are 44,850 and 3,453,450 configurations, respectively.) b. Each rotor consisted of two parts: a wiring matrix from 26 inputs to 26 outputs and a movable ring. The wiring consisted of 26 wires with each wire connecting one input to one output. That is, each input was wired to one and only one output. The wiring of a rotor was fixed at the time of manufacture and not changed afterward. How many possible rotors were there? (The Germans obviously did not manufacture this many different rotors. They only manufactured a few different rotors, but the Allies did not know which rotors were actually in use.) c. For most of the war, Enigma used three different rotors placed left to right in order. One rotor was chosen for the first position, another rotor different from the first was chosen for the second position, and a third rotor different from the first two was chosen for the third position. How many possible selections of rotors were there? (Hint: this is a very large number.) d. In operation, the three rotors were rotated to a daily starting configuration. Each rotor could be started in any of 26 positions. How many starting positions were there for the three rotors? e. The two leftmost rotors had a moveable ring that could be placed in any of 26 positions. (The rightmost rotor rotated one position on each key press. The moveable rings determined when the middle and left rotors turned. Think of the dials in a mechanical car odometer or water meter.) How many different configurations of the two rings were there?
74 CHAPTER 3 A LITTLE COMBINATORICS
f. The reflector was a fixed wiring of 13 wires, with each wire connecting a letter to another letter. For example, a wire might connect C to G. How many ways could the reflector be wired? (Hint: this is the same as the number of plugboard connections with 13 wires.) g. Multiply the following numbers together to get the overall complexity of the Enigma machine: (i) the number of possible selections of three rotors, (ii) the number of daily starting positions of the three rotors, (iii) the number of daily positions of the two rings, (iv) the number of plugboard configurations (assume k = 10 wires were used), and (v) the number of reflector configurations. What is the number? (Hint: this is a really, really large number.) h. During the course of the war, the Allies captured several Enigma machines and learned several important things: The Germans used five different rotors (later, eight different rotors). Each day, three of the five rotors were placed left to right in the machine (this was part of the daily key). The Allies also learned the wiring of each rotor and were able to copy the rotors. They learned the wiring of the reflector. How many ways could three rotors be selected from five and placed into the machine? (Order mattered; a rotor configuration of 123 operated differently from 132, etc.) i. After learning the wiring of the rotors and the wiring of the reflector, the remaining configuration variables (parts of the daily secret key) were (i) the placement of three rotors from five into the machine, (ii) the number of starting positions for three rotors, (iii) the position of the two rings (on the leftmost and middle rotors), and (iv) the plugboard configuration. Assuming k = 10 wires were used in the plugboard, how many possible Enigma configurations remained? j. How important was capturing the several Enigma machines to breaking the encryption? For more information, see A. Ray Miller, The Cryptographic Mathematics of Enigma (Fort Meade, MD: Center for Cryptologic History, 2012).
CHAPTER
4
DISCRETE PROBABILITIES AND RANDOM VARIABLES
Flip a coin until a head appears. How many flips are required? What is the average number of flips required? What is the probability that more than 10 flips are required? That the number of flips is even? Many questions in probability concern discrete experiments whose outcomes are most conveniently described by one or more numbers. This chapter introduces discrete probabilities and random variables.
4.1 PROBABILITY MASS FUNCTIONS Consider an experiment that produces a discrete set of outcomes. Denote those outcomes as x0 , x1 , x2 , . . .. For example, a binary bit is a 0 or 1. An ordinary die is 1, 2, 3, 4, 5, or 6. The number of voters in an election is an integer, as are the number of electrons crossing a PN junction in a unit of time or the number of telephone calls being made at a given instant of time. In many situations, the discrete values are themselves integers or can be easily mapped to integers; for example, xk = k or maybe xk = kΔ. In these cases, it is convenient to refer to the integers as the outcomes, or 0,1,2, . . .. Let X be a discrete random variable denoting the (random) result of the experiment. It is discrete if the experiment results in discrete outcomes. It is a random variable if its value is the result of a random experiment; that is, it is not known until the experiment is done. Comment 4.1: Advanced texts distinguish between countable and uncountable sets. A set is countable if an integer can be assigned to each element (i.e., if one can count the elements). All the discrete random variables considered in this text are countable. The most important example of an uncountable set is an interval (e.g., the set of values x
75
76 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
such that 0 ≤ x ≤ 1). For our purposes, sets are either discrete or continuous (or an obvious combination of the two).
Comment 4.2: Random variables are denoted with uppercase bold letters, sometimes with subscripts, such as X, Y, N, X1 , and X2 . The values random variables take on are denoted with lowercase, italic letters, sometimes with subscripts, such as x, y, n, x1 , and x2 .
A probability mass function (PMF) is a mapping of probabilities to outcomes:
p(k) = Pr X = xk
for all values of k. PMFs possess two properties: 1. Each value is nonnegative: p(k) ≥ 0
(4.1)
2. The sum of the p(k) values is 1:
p(k) = 1
(4.2)
k
Some comments about PMFs are in order: • A PMF is a function of a discrete argument, k. For each value of k, p(k) is a number between 0 and 1. That is, p(1) is a number, p(2) is a possibly different number, p(3) is yet another number, etc. • The simplest way of describing a PMF is to simply list the values: p(0), p(1), p(3), etc. Another way is to use a table: k p(k)
0 0.4
1 0.3
2 0.2
3 0.1
Still a third way is to use a case statement:
p(k) =
⎧ ⎪ 0.4 ⎪ ⎪ ⎪ ⎨0.3 ⎪ 0.2 ⎪ ⎪ ⎪ ⎩
0.1
k=0 k=1 k=2 k=3
• Any collection of numbers satisfying Equations (4.1) and (4.2) is a PMF. • In other words, there are an infinity of PMFs. Of this infinity, throughout the book we focus on a few PMFs that frequently occur in applications.
4.2 Cumulative Distribution Functions 77
4.2 CUMULATIVE DISTRIBUTION FUNCTIONS A cumulative distribution function (CDF) is another mapping of probabilities to outcomes, but unlike a PMF, a CDF measures the probabilities of {X ≤ u} for all values of −∞ < u < ∞:
FX (u) = Pr X ≤ u =
p(k)
k: x k ≤u
where the sum is over all k such that xk ≤ u. Note that even though the outcomes are discrete, u is a continuous parameter. All values from −∞ to ∞ are allowed. CDFs possess several important properties: 1. The distribution function starts at 0: FX (−∞) = 0 2. The distribution function is nondecreasing. For u1 < u2 , FX (u1 ) ≤ FX (u2 ) 3. The distribution function ends at 1: FX (∞) = 1 A CDF is useful for calculating probabilities. The event {X ≤ u1 } can be written as
{X ≤ u0 } ∪ {u0 < X ≤ u1 }. This is shown schematically below: {X ≤ u1 }
= {X ≤ u 0 } ∪ {u 0 < X ≤ u 1 }
u0
u1
Probabilities follow immediately from Axiom 3:
Pr X ≤ u1 = Pr X ≤ u0 + Pr u0 < X ≤ u1
The equation can be rearranged into a more common form:
Pr u0 < X ≤ u1 = Pr X ≤ u1 − Pr X ≤ u0 = F (u1 ) − F (u0 ) Comment 4.3: The word distribution is used in (at least) two different senses in probability. The cumulative distribution function, or simply the distribution function, is FX (u) = Pr X ≤ u for all values of u. We also use distribution to refer to the type of probabilities the random variable has. For instance, a random variable X might have the Poisson distribution (discussed later in this chapter) or the binomial distribution (discussed in Section 6.1), or it might have one of many other distributions.
78 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
1 FX (u)
p
1−p
0
u
1
p(k) p 1−p
p
0
k
1
FIGURE 4.1 PMF and CDF for a simple Bernoulli distribution with Pr X = 1 = p and Pr X = 0 = 1 − p. The PMF (bottom) in a stem (or comb) plot. The CDF (top) is a staircase function. The height of each step is the p(k) value at that value of k. For instance, Pr X = 1 can be read off either plot. It is the height of the stem at k = 1 in the PMF or the height of the jump in the CDF.
EXAMPLE 4.1
The simplest interesting distribution is the Bernoulli The variable X distribution. random takes on two values with associated probabilities Pr X = 1 = p and Pr X = 0 = 1 − p, where 0 ≤ p ≤ 1. Bernoulli random variables model all sorts of binary experiments, such as coin flips, incidence of diseases, hit or miss, etc. The PMF and CDF for a Bernoulli distribution are shown in Figure 4.1.
4.3 EXPECTED VALUES The expected value captures the notion of probabilistic average. Imagine an experiment that results in a discrete random variable, say, a six-sided die is rolled. If the experiment is repeated a great many times, what would be the average value of the roll? What is a typical deviation from the average? We can use the probabilities to compute the values we would
4.3 Expected Values 79
expect to get if we did do the experiment many times. In other words, we can analyze the experiment beforehand to estimate what will likely happen. Let X be a discrete random variable with PMF p (k), and let g (X ) be some function of X. Then, the expected value of g (X ), denoted E g (X ) , is
E g (X ) =
g (xk )p(k)
(4.3)
k
The sum is over all values of k. For each value the random variable takes on, xk , g(xk ) is weighted by the probability of that value, p(k). Note that while X is random, E g (X ) is not random; it is a number. Comment 4.4: In Chapter 7, we discuss continuous random variables and define a density, fX (x ), which corresponds to a PMF. The definition of expected values changes from Equation (4.3) to Equation (7.9) in Section 7.1:
E g (X) =
∞ −∞
g (x )fX (x ) dx
Otherwise, all the properties of expected values developed below are the same for discrete or continuous random variables.
The two most important expected values are the mean and variance of X. The mean is the “average value” of X and is denoted μx . For the mean, g (X ) = X, and thus, μx = E X = xk p(k)
(4.4)
k
It is perhaps surprising that the expected value of X may not be one of the outcomes xk . For instance, if Pr X = 1 = p and Pr X = 0 = 1 − p, E X = 0 · (1 − p) + 1 · p = p. If p is any value other than 0 or 1, E X is not one of the xk . As another example, in Example 4.2 below, the outcomes are 1, 2, 3, 4, 5, and 6, while the mean is 3.5. The variance of X is a measure of how spread out the values of X are. It is denoted σ2x and is computed using Equation (4.3) with g (X ) = (X − μx )2 , σ2x = Var X = E (X − μx )2 = (xk − μx )2 p(k)
(4.5)
k
The standard deviation is the square root of the variance, σx = σ2x . The mean of X is the probabilistic average value of X. It is an indication of where X is located. The variance is a measure of the spread of the distribution of X. When the variance is high, the observations of X are spread out; conversely, when the variance is small, the values of X are clustered tightly around the mean. The values E X k are known as the moments of X (some authors describe them as the X2 , moments of the PMF of X). The first moment is the mean. The second moment is E the third moment is E X 3 , etc.
80 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
Comment 4.5: We use summation notation throughout the text:
g (k) = . . . + g (−1) + g (0) + g (1) + g (2) + . . .
k
Many of these g (k) may be 0. In some problems, k = 0,1,2, . . .; in others, k = 1,2,3, . . .. Some special sums are needed: m k=1 m k=1
k = 1 + 2 + 3 + ··· + m =
k2 = 12 + 22 + 32 + · · · + m2 =
∞ rk = 1 + r + r2 + · · · =
k=0
EXAMPLE 4.2
m(m + 1) 2
1 1−r
(4.6)
(2m + 1)(m + 1)m
6
for |r | < 1
(4.7) (4.8)
Consider a roll of a standard six-sided die. Let X be a random variable denoting the side of the die that is up.X takes on one of the values 1, 2, 3, 4, 5,or 6, each with equal probability. That is, Pr X = k = 1/6 for k = 1, 2, 3, 4, 5, and 6, and Pr X = k = 0 for any other value of k. The mean and variance can be computed as 1 1 1 1 1 1 E X = 1· +2· +3· +4· +5· +6· 6 6 6 6 6 6 = 3.5 1 1 1 Var X = (1 − 3.5)2 + (2 − 3.5)2 + (3 − 3.5)2 6 6 6 + (4 − 3.5)2
1 1 1 + (5 − 3.5)2 + (6 − 3.5)2 6 6 6
=
2.52 + 1.52 + 0.52 + 0.52 + 1.52 + 2.52 6
=
35 = 1.712 12
The mean of X is 3.5, the variance 35/12, and the standard deviation 1.71. Expected values have a number of important properties. In many cases, expected values can be manipulated with the tools of ordinary algebra: 1. The expected value of a constant is the constant. If g (X ) = c, where c is a constant (not random), then
E c =c
(4.9)
4.3 Expected Values 81
2. Expected values scale by multiplicative constants. If g (X ) is multiplied by a constant, c, then
E cg (X ) = cE g (X )
(4.10)
3. Expected values are additive. If g (X ) = g1 (X ) + g2 (X ), then
E g1 (X ) + g2 (X ) = E g1 (X ) + E g2 (X )
(4.11)
4. Expected values are not multiplicative, however. If g (X ) = g1 (X )g2 (X ), then in general,
2
E g1 (X )g2 (X ) = E g1 (X ) E g2 (X )
(4.12)
For instance, in general, E X 2 = E X · E X = E X . 5. If g (X ) ≥ 0 for all values of X, then
E g (X ) ≥ 0
(4.13)
Forexample, E (X − a)2 ≥ 0 since (X − a)2 ≥ 0 for all values of X and a. In particular, Var X ≥ 0 since (X − μ)2 ≥ 0 for all values of X. 6. Expected values commute with derivatives and integrals in the following sense: if g is a function of X and a nonrandom parameter v, then d d E g (X,v) = E g (X,v) dv dv
(4.14)
Similarly,
E g (X,v) dv = E
EXAMPLE 4.3
g (X,v) dv
(4.15)
(Affine transformation): An affine transformation is Y = aX + b for constants a and b. Using the properties above, let us show μY = aμX + b: μy = E Y = E aX + b = E aX + E b = aE X + b
(definition of Y) (by Equation 4.11) (by Equations 4.10 and 4.9)
= aμx + b
Let us use these properties to prove an important theorem about variances. This theorem is handy for computing variances.
82 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
2
Theorem 4.1: σ2x = E X 2 − E X σ2x + μ2x .
= E X 2 − μ2x . It is often rearranged as E X 2 =
Proof: σ2x = E (X − μx )2 = E X 2 − 2X μx + μ2x = E X 2 + E − 2X μx + E μ2x = E X 2 − 2μx E X + μ2x = E X 2 − 2μ2x + μ2x = E X 2 − μ2x
EXAMPLE 4.4
(by Equation 4.5) (expanding the square) (by Equation 4.11) (by Equations 4.9 and 4.10) (by Equation 4.4) ■
Let us continue Example 4.2 and demonstrate Theorem 4.1 above:
E X2 =
12 + 22 + 32 + 42 + 52 + 62 91 = 6 6
2
Var X = E X 2 − E X
=
91 35 − 3.52 = 6 12
Generally speaking, calculating variances with this theorem is easier than using Equation (4.5).
Comment 4.6: The mean, variance, and Theorem 4.1 all have analogs in physics. If a PMF corresponds to a mass distribution, then the mean is the center of mass, the variance is the moment of inertia about the mean, and Theorem 4.1 is the parallel axis theorem about moments of inertia.
Chebyshev’s Inequality: A useful inequality involving probabilities and expected values is Chebyshev’s inequality: Var X Pr |X − μ| ≥ ≤ 2
(4.16)
The key to this inequality (and many others) is the indicator function IA (x):
IA (x) =
1 x∈A A 0 x∈
The expected value of the indicator function gives a probability:
E IA (X ) =
xk
IA (xk )Pr X = xk =
k:x k ∈A
1 · Pr X = xk = Pr X ∈ A
4.4 Moment Generating Functions 83
In the case of Chebyshev’s inequality, let A = |X − μ| ≥ . Then, the indicator function obeys an inequality, for all values of X: I|X −μ|≥ ≤
(X − μ)2 2
This is shown in Figure 4.2, in which two functions are equal when X − μ = ±.
FIGURE 4.2 Illustration of the inequality used in proving Chebyshev’s inequality.
Taking expected values gives the inequality:
E I|X −μ|≥ = Pr |X − μ| ≥ ≤ E
(X − μ)2 2
=
Var X 2
Note that Chebyshev’s inequality works for any random variable, whether discrete or continuous, that has a finite variance. Its advantage is its ubiquity; it always applies. But this is also a disadvantage. Chebyshev’s inequality is often loose; tighter bounds can be found by making additional assumptions about the random variable, such as assuming the random variable has a known distribution.
Comment 4.7: If Var X → 0, then Chebyshev’s inequality implies Pr |X − μ| ≥ → 0 for any > 0. In simple terms, if the variance is 0, then the random variable has no spread about the mean; that is, the random variable equals its mean with probability 1. We use this result several times later in the book.
4.4 MOMENT GENERATING FUNCTIONS For many PMFs, computing expected values is not easy. The sums may be difficult to compute. One alternative that often helps is to use a moment generating function (MGF), denoted MX (u): MX (u) = E euX = euk p(k)
(4.17)
k
When it is clear which random variable is being discussed, we will drop the subscript on MX (u) and write simply M (u).
84 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
The MGF is the Laplace transform of the PMF except that s is replaced by −u. Apart from this (mostly irrelevant) difference, the MGF has all the properties of the Laplace transform. Later on, we will use the fact that the MGF uniquely determines the PMF, just as the Laplace transform uniquely determines the signal. In signal processing, we rarely compute moments of signals, but in probability, we often compute moments of PMFs. We show how the MGF helps compute moments by two different arguments. The first argument begins by expanding the exponential in a Maclaurin series∗ and taking expected values, term by term:* u2 X 2 u3 X 3 + + ··· 2! 3! u2 X 2 u3 X 3 M (u) = E euX = E 1 + E uX + E +E + ··· 2! 3! 2 3 u u = 1 + uE X + E X 2 + E X 3 + · · · 2! 3! euX = 1 + uX +
Now, take a derivative with respect to u: u2 d M (u) = 0 + E X + uE X 2 + E X 3 + · · · du 2!
Finally, set u = 0: d M (u) = 0 + E X + 0 + 0 + ··· = E X u=0 du
Notice how the only term that “survives” both steps (taking the derivative and setting u = 0) is E X . The second moment is found by taking two derivatives, then setting u = 0:
E X2 =
d2 M u ( ) u=0 du2
In general, the kth moment can be found as
E Xk =
dk M u ( ) u=0 duk
(4.18)
The second argument showing why the MGF is useful in computing moments is to use properties of expected values directly: d d d uX E e euX = E XeuX =E M (u) = du du du
d M (u) = E Xe0X = E X u=0 du
The Maclaurin series of a function is f (x) = f (0) + f (0)x + f (0)x2 /2! + · · · . See Wolfram Mathworld, http://mathworld.wolfram.com/TaylorSeries.html. *
4.5 Several Important Discrete PMFs 85
Similarly, for the kth moment, dk dk uX M u = E e = E X k euX u=0 = E X k ( ) k k u = 0 u = 0 du du
The above argument also shows a useful fact for checking your calculations: M (0) = E e0X = E 1 = 1
Whenever you compute an MGF, take a moment and verify M (0) = 1. The MGF is helpful in computing moments for many, but not all, PMFs (see Example 4.5). For some distributions, the MGF either cannot be computed in closed form or evaluating the derivative is tricky (requiring L’Hospital’s rule, as discussed in Example 4.5). A general rule is to try to compute moments by Equation (4.3) first, then try the MGF if that fails. Comment 4.8: Three transforms are widely used in signal processing: the Laplace transform, the Fourier transform, and the Z transform. In probability, the analogs are MGF, the characteristic function, and the generating function. The MGF is the Laplace transform with s replaced by −u: M (u) = E euX = euk p(k) = L (−u) k
The characteristic function is the Fourier transform with the sign of the exponent reversed: C (ω) = E ejωX = ejωk p(k) = F − ω k
The generating function (or the probability generating function) is the Z transform with z−1 replaced by s: G (s) = E sX = sk p(k) = Z s−1 k
Mathematicians use the generating function extensively in analyzing combinatorics and discrete probabilities. In this text, we generally use the MGF in analyzing both discrete and continuous probabilities and the Laplace, Fourier, or Z transform, as appropriate, for analyzing signals.
4.5 SEVERAL IMPORTANT DISCRETE PMFs There are many discrete PMFs that appear in various applications. Four of the most important are the uniform, geometric, Poisson, and binomial PMFs. The uniform, geometric, and Poisson are discussed below. The binomial is presented in some detail in Chapter 6.
86 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
4.5.1 Uniform PMF For a uniform PMF, all nonzero probabilities are the same. Typically, X takes on integral values, with k = 1,2, . . . ,m:
Pr X = k =
1 m
for k = 1,2, . . . ,m for all other k
0
The uniform PMF with m = 6 (e.g., for a fair die) is illustrated in the stem plot below: 0.167
Pr X = k
1
3
k
6
Computing the mean and variance is straightforward (using Equations 4.6 and 4.7):
E X =
m
k
k=1
m 1 1 m(m + 1) m + 1 1 = k= · = m m k=1 m 2 2
(4.19)
The mean is the weighted average of the PMF values (note that if m is even, the mean is not one of the samples):
E X2 =
m k=1
k2
m 1 1 1 (2m + 1)(m + 1)m (2m + 1)(m + 1) = k2 = · = m m k=1 m 6 6
Now, compute the variance: 2 (2m + 1)(m + 1) (m + 1)2 (m + 1)(m − 1) σ2x = E X 2 − E X = − =
6
EXAMPLE 4.5
4
12
(4.20)
The uniform distribution is a bad example for the utility of computing moments with the MGF, requiring a level of calculus well beyond that required elsewhere in the text. With that said, here goes: M (u) =
m 1 euk m k=1
Ignore the 1/m for now, and focus on the sum. Multiplying by (eu − 1) allows us to sum the telescoping series. Then, set u = 0 to check our calculation: (eu − 1)
m
euk =
k=1 m k=1
m +1 k=2
euk =
euk −
m k=1
eu(m+1) − eu eu − 1
euk = eu(m+1) − eu
4.5 Several Important Discrete PMFs 87
e k=1 m
uk
eu(m+1) − eu 1−1 0 = = eu − 1 u=0 1 − 1 0
= u=0
Warning bells should be clanging: danger ahead, proceed with caution! In situations like this, we can use L’Hospital’s rule to evaluate the limit as u → 0. L’Hospital’s rule says differentiate the numerator and denominator separately, set u = 0 in each, and take the ratio: e k=1 m
uk
= u=0
d du
eu(m+1) − eu u=0
d du
eu − 1
=
(m + 1)eu(m+1) − eu ) u=0
u=0
eu |
u=0
=
m+1−1 =m 1
After dividingbym, the check works: M (0) = 1. Back to E X : take a derivative of M (u) (using the quotient rule), then set u = 0:
d meu(m+2) − (m + 1)eu(m+1) + eu M (u) = 2 du u=0 m eu − 1
= u=0
02 02
Back to L’Hospital’s rule, but this time we need to take two derivatives of the current numerator and denominator separately and set u = 0 (two derivatives are necessary because one still results in 0/0):
E X =
m(m + 2)2 − (m + 1)3 + 1 m(m + 1) m + 1 = = 2m 2m 2
Fortunately, the moments of the uniform distribution can be calculated easily by the direct formula as the MGF is surprisingly complicated. Nevertheless, the MGF is often easier to use than the direct formula.
4.5.2 Geometric PMF The geometric PMF captures the notion of flipping a coin until a head appears. Let p be the probability of a head on an individual flip, and assume the flips are independent (one flip does not affect any other flips). A sequence of flips might look like the following: 00001. This sequence has four zeros followed by a 1. In general, a sequence of length k will have k − 1 zeros followed by a 1. Letting X denote the number of flips required, the PMF of X is
Pr X = k =
p(1 − p)k−1 0
for k = 1,2, . . . k≤0
The PMF values are clearly nonnegative. They sum to 1 as follows: ∞
k=1
(1 − p)k−1 p = p
∞
l=0
(1 − p)l =
p p = =1 1 − (1 − p) p
88 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
The first 12 values of a geometric PMF with p = 0.3 are shown below:
0.3
Pr X = k versus k with p = 0.3
0.3 × 0.7 = 0.21 0.3 × 0.72 = 0.147 1
5
k
10
The mean is difficult to calculate directly:
E X =
∞
kp(1 − p)k−1
(4.21)
k=1
It is not obvious how to compute the sum. However, it can be calculated using the MGF: M (u) = E euX ∞ euk p(1 − p)k−1 = k=1
= peu
∞
eu(k−1) (1 − p)k−1
k=1
= peu
∞
(1 − p)eu
l
(changing variable l = k − 1)
(4.22)
l=0
Let r = (1 − p)eu , and using Equation (4.8), ∞ ∞ l (1 − p)eu = r l
l=0
(substituting r = (1 − p)eu )
l=0
=
1 1−r
=
1 1 − (1 − p)eu
for |r| < 1
Comment 4.9: We should verify that |r | = |(1 − p)eu | < 1 in a neighborhood around u = 0 (to show the series converges and that we can take a derivative of the MGF). Solving for u results in u < log(1/(1 − p)). If p > 0, log(1/(1 − p)) > 0. Thus, the series converges in a neighborhood of u = 0. We can therefore take the derivative of the MGF at u = 0.
Substituting this result into Equation (4.22) allows us to finish calculating M (u): M (u) =
−1 peu e− u · − u = p e− u − 1 + p u 1 − (1 − p)e e
Check the calculation: M (0) = p/(e−0 − 1 + p) = p/(1 − 1 + p) = p/p = 1.
(4.23)
4.5 Several Important Discrete PMFs 89
Computing the derivative and setting u = 0, −2 d M (u) = p(−1) e−u − 1 + p (−1)e−u du p 1 d E X = M (u) = 2= du p p u=0
On average, it takes 1/p flips to get the first head. If p = 0.5, then it takes an average of 2 flips; if p = 0.1, it takes an average of 10 flips. Taking a second derivation of M (u) and setting u = 0 yield E X 2 = (2 − p)/p. The variance can now be computed easily: 2 2 − p 1 1−p σ2X = E X 2 − E X = 2 − 2 = 2
p
p
p
The standard deviation is the square root of the variance, σX = (1 − p) p. A sequence of 50 random bits with p = 0.3 is shown below: 10100010100011110001000001111101000000010010110001 The runs end with the 1’s. This particular sequence has 20 runs: 1 · 01 · 0001 · 01 · 0001 · 1 · 1 · 1 · 0001 · 000001 · 1 · 1 · 1 · 1 · 01· 00000001 · 001 · 01 · 1 · 0001 The run lengths are the number of digits in each run: 12424111461111283214 The average run length of this particular sequence is 1 + 2 + 4 + 2 + 4 + 1 + 1 + 1 + 4 + 6 + 1 + 1 + 1 + 1 + 2 + 8 + 3 + 2 + 1 + 4 50 = = 2.5 20 20
This average is close to the expected value, E X = 1 p = 1 0.3 = 3.33. EXAMPLE 4.6
When a cell phone wants to connect to the network (e.g., make a phone call or access the Internet), the phone performs a “random access” procedure. The phone transmits a message. If the message is received correctly by the cell tower, the tower responds with an acknowledgment. If the message is not received correctly (e.g., because another cell phone is transmitting at the same time), the phone waits a random period of time and tries again. Under reasonable conditions, the success of each try can be modeled as independent Bernoulli random variables with probability p. Then, the expected number of tries needed is 1/p. In actual practice, when a random access message fails, the next message usually is sent at a higher power (the original message might not have been “loud” enough to be heard). In this case, the tries are not independent with a common probability of success, and the analysis of the expected number of messages needed is more complicated.
90 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
Comment 4.10: People are notoriously bad at generating random sequences by themselves. Let us demonstrate this. Write down a “random” sequence of 1’s and 0’s with a length of at least 50 (more is better) and ending with a 1 (probability of a 1 is 0.5). Do not use coins or a computer; just let the sequence flow from your head. Compute the run lengths, and compare the lengths of your runs to the expected lengths (half the runs should be length 1, one quarter of length 2, etc.). If you are like most people, you will have too many short runs and too few long runs.
4.5.3 The Poisson Distribution The Poisson distribution is widely used to describe counting experiments, such as the number of cancers in a certain area or the number of phone calls being made at any given time. The PMF of a Poisson random variable with parameter λ is
Pr X = k = p(k) =
λk
k!
e−λ for k = 0,1,2, . . .
(4.24)
The first 14 points of a Poisson PMF with λ = 5 are shown below:
0.175
λ=5
Pr X = k
0
4 5
9
k
13
To show the Poisson probabilities sum to 1, we start with the Taylor series for eλ : eλ = 1 + λ +
λ2
2!
+
λ3
3!
+ ···
Now, multiply both sides by e−λ : 1 = e−λ eλ = e−λ + λe−λ +
λ2
2!
e−λ +
λ3
3!
e−λ + · · ·
The terms on the right are the Poisson probabilities. The moments can be computed with the MGF: M (u) = E euX ∞ = euk p(k) k=0
=
∞
k=0
euk e−λ
λk
k!
(definition of MGF) (definition of expected value)
p(k) =
λk e−λ
k!
4.5 Several Important Discrete PMFs 91
= e−λ
∞ λe u k k=0
= eλ(e
u
(sum is power series for eλe )
k!
u −1)
(4.25)
The final form in Equation (4.25) is a bit peculiar, but it is easy to use. We can get moments by taking derivatives and setting u = 0, as follows: d λ(eu −1) λ(eu −1) d d u e M (u) = =e (λ(eu − 1)) = eλ(e −1) λeu du du du Setting u = 0,
E N =
d 0 M (u) = eλ(e −1) λe0 = λ du u=0
(4.26)
Similarly, one can compute E N 2 by taking two derivatives and then setting u = 0:
E N 2 = λ + λ2
2
2
The variance is σ2 = E N − E N mean and variance are both λ:
(4.27)
= λ + λ2 − λ2 = λ. The Poisson is unusual in that the
Var N = λ Taking ratios of successive p(k) gives a method to find the largest p(k) and a convenient method for computing the p(k): p(k) λ e−λ λk /k! = −λ k−1 = p(k − 1) e λ /(k − 1)! k Thus, p(k) > p(k − 1) if λ > k, p(k) = p(k − 1) if λ = k, and p(k) < p(k − 1) if λ < k. For instance, in the example above, λ = 5. The sequence increases from k = 0 to k = 4, is constant from k = 4 to k = 5, and decreases for k > 5. The ratio can be rearranged to yield a convenient method for computing the p(k) sequence: p(0) = e−λ ; for k = 1,2,. . .do p(k) =
λ
k
· p(k − 1);
end A sequence of 30 Poisson random variables with λ = 5 is 5 6 6 5 8 5 7 3 4 4 5 7 8 6 5 6 7 8 5 8 2 2 12 1 7 4 3 4 4 8 There are one 1, two 2’s, two 3’s, five 4’s, six 5’s, four 6’s, four 7’s, five 8’s, and one 12. For example, the probability of a 5 is 0.175. In a sequence of 30 random variables, we would expect to get about 30 × 0.175 = 5.25 fives. This sequence has six, which is close to
92 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
the expected number. As another example, the expected number of 2’s is
30 · Pr X = 2 = 30 ·
52 −5 e = 2.53 2!
The actual number is 2, again close to the expected number. It is not always true, however, that the agreement between expected and actual is so good. For instance, this sequence has five 8’s, while the expected number of 8’s is
30 · Pr X = 8 = 30 ·
58 −5 e = 1.95 8!
The sequence is random after all. While we expect the count of each number to be approximately equal to the expected number, we should not expect to observe exactly the “right” number. This example looked ahead a bit to multiple random variables. That is the topic of the next chapter. EXAMPLE 4.7
The image sensor in a digital camera is made up of pixels. Each pixel in the sensor counts the photons that hit it. If the shutter is held open for t seconds, the number of photons counted is Poisson with mean λt. Let X denote the number of photons counted. Then,
E X = λt
Var X = λt
2 The camera computes an average Z = X t. Therefore, E Z = λ and Var Z = (1/t )Var X = λ t. The performance of systems like this is typically measured as the signal-to-noise ratio (SNR), defined as the average power of the signal divided by the average of the noise power. We can write Z in signal plus noise form as
Z = λ + Z − λ = λ + N = signal + noise where λ is the signal and N = Z − λ is the noise. Put another way, SNR =
signal power λ2 λ2 = = λt = noise power Var N λ t
We see the SNR improves as λ increases (i.e., as there is more light) and as t increases (i.e., longer shutter times). Of course, long shutter times only work if the subject and the camera are reasonably stationary.
4.6 GAMBLING AND FINANCIAL DECISION MAKING Decision making under uncertainty is sometimes called gambling, or sometimes investing, but is often necessary. Sometimes we must make decisions about future events that we can only partially predict. In this section, we consider several examples.
4.6 Gambling and Financial Decision Making 93
Imagine you are presented with the following choice: spend 1 dollar to purchase a ticket, or not. With probability p, the ticket wins and returns w + 1 dollars (w dollars represent your winnings, and the extra 1 dollar is the return of your original ticket price). If the ticket loses, you lose your original dollar. Should you buy the ticket? This example represents many gambling situations. Bettors at horse racing buy tickets on one or more horses. Gamblers in casinos can wager on various card games (e.g., blackjack or baccarat), dice games (e.g., craps), roulette, and others. People can buy lottery tickets or place bets on sporting events. Let X represent the gambler’s gain (or loss). With probability p, you win w dollars, and with probability, 1 − p you lose one dollar. Thus, X=
−1
with probability 1 − p with probability p
w
E X = pw + (1 − p)(−1) = pw − (1 − p)
E X 2 = pw2 + (1 − p)(−1)2
2
Var X = E X 2 − E X
= p(1 − p)(w + 1)2
The wager is fair if E X = 0. If E X > 0, you should buy the ticket. On average, your win exceeds the cost. Conversely, if E X < 0, do not buy the ticket. Let pe represent the break-even probability for a given w and we the break-even win for a given p. Then, at the break-even point,
E X = 0 = pw − (1 − p) Rearranging yields we = (1 − p)/p or pe = 1/(w + 1). The ratio (1 − p)/p is known as the odds ratio. For example, if w = 3, the break-even probability is pe = 1/(3 + 1) = 0.25. If the actual probability is greater than this, make the bet; if not, do not. EXAMPLE 4.8
Imagine you are playing Texas Hold ‘em poker and know your two cards and the first four common cards. There is one more common card to come. You currently have a losing hand, but you have four cards to a flush. If the last card makes your flush, you judge you will win the pot. There are w dollars in the pot, and you have to bet another dollar to continue. Should you? Since there are 13 − 4 = 9 cards remaining in the flush suit (these are “outs” in poker parlance) and 52 − 6 = 46 cards overall, the probability of making the flush is p = 9/46 = 0.195. If the pot holds more than we = (1 − p)/p = 37/9 = 4.1 dollars, make the bet.
Comment 4.11: In gambling parlance, a wager that returns a win of w dollars for each dollar wagered offers w-to-1 odds. For example, “4-to-1” odds means the break-even probability is pe = 1/(4 + 1) = 1/5. Some sports betting, such as baseball betting, uses money lines. A money line is a positive or negative number. If positive, say, 140, the money line represents the
94 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
gambler’s potential win on a $100 bet; that is, the gambler bets $100 to win $140. If negative, say, −120, it represents the amount of money a gambler needs to wager to potentially win $100; that is, a $120 bet on the favorite would win $100 if successful. A baseball game might be listed as −140,+120, meaning a bet on the favorite requires $140 to win $100 and a bet of $100 on the underdog might win $120. (That the numbers are not the same represents a profit potential to the bookmaker.)
When placing a bet, the gambler risks a sure thing (the money in his or her pocket) to potentially win w dollars. The bookmaker will adjust the payout w so that, on average, the expected return to the gambler is negative. In effect, the gambler trades a lower expected value for an increased variance. For a typical wager,
E X = wp − (1 − p) < 0
Var X = p(1 − p)(w + 1)2 > 0 Buying insurance is the opposite bet, trading a lower expected value for a lower variance. Let c be the cost of the policy, p the probability of something bad happening (e.g., a tree falling on your house), and v the cost when the bad thing happens. Typically, p is small, and v relatively large. Before buying insurance, you face an expected value of −vp and a variance of v2 p(1 − p). After buying insurance, your expected value is −c whether or not the bad thing happens, and your variance is reduced to 0. Buying insurance reduces your expected value by c − vp > 0 but reduces your variance from v2 p(1 − p) to 0. In summary, buying insurance is a form of gambling, but the trade-off is different. The insurance buyer replaces a possible large loss (of size v) with a guaranteed small loss (of size c). The insurance broker profits, on average, by c − vp > 0. Comment 4.12: Many financial planners believe that reducing your variance by buying insurance is a good bet (if the insurance is not too costly). Lottery tickets (and other forms of gambling) are considered to be bad bets, as lowering your expected value to increase your variance is considered poor financial planning. For many people, however, buying lottery tickets and checking for winners is fun. Whether or not the fun factor is worth the cost is a personal decision outside the realm of probability. It is perhaps a sad commentary to note that government-run lotteries tend to be more costly than lotteries run by organized crime. For instance, in a “Pick 3” game with p = 1/1000, the typical government payout is 500, while the typical organized crime payout is 750. In real-world insurance problems, both p and v are unknown and must be estimated. Assessing risk like this is called actuarial science. Someone who does so is an actuary.
Summary 95
SUMMARY
A random variable is a variable whose value depends on the result of a random experiment. A discrete random variable can take on one of a discrete set of outcomes. Let X be a discrete random variable, and let xk for k = 0,1,2, . . . be the discrete outcomes. In many experiments, the outcomes are integers; that is, xk = k. A probability mass function (PMF) is the collection of discrete probabilities, p(k):
p(k) = Pr X = xk
The p(k) satisfy two important properties: p(k) ≥ 0
∞
and
p(k) = 1
k=0
A cumulative distribution function (CDF) measures the probability that X ≤ u for all values of u:
FX (u) = Pr X ≤ u =
p(k)
k: x k ≤u
For all u < v, 0 = FX (−∞) ≤ FX (u) ≤ FX (v) ≤ FX (∞) = 1 Expected values are probabilistic averages:
E g (X ) =
∞
g (xk )p(k)
k=0
Expected values are also linear:
E ag1 (X ) + bg2 (X ) = aE g1 (X ) + bE g2 (X ) However, they are not multiplicative in general:
E g1 (X )g2 (X ) = E g1 (X ) E g2 (X ) The mean is the probabilistic average value of X: ∞ μx = E X = xk p(k)
k=0
The variance of X is a measure of spread about the mean: σ2x = Var X = E (X − μx )2 = E X 2 − μ2x
The moment generating function (MPF) is the Laplace transform of the PMF (except that the sign of the exponent is flipped): ∞ MX (u) = E euX = euxk p(k)
k=0
96 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
The MGF can be used for computing moments:
E Xk =
dk M u ( ) u=0 duk
Four important discrete distributions are the Bernouli, uniform, geometric, and Poisson:
• The Bernoulli PMF models binary random variables with Pr X = 1 = p and Pr X = 0 = 1 − p, E X = p, and Var X = p(1 − p). • The uniform PMF are equally likely; that is, p(k) = 1/n for is used when the outcomes k = 1,2, . . . ,n, E X = (n + 1)/2, an Var X = (n + 1)(n − 1)/12. • The geometric PMF is used to model the number of trials needed for a result to occur (i.e., k−1 ( k ) = p ( 1 − p ) for k = 1,2, . . . , E X = 1/p, the number of flips required to get a heads): p and Var X = (1 − p)/p2 . • The Poisson distribution is used in many counting experiments, such as counting the number of cancers in a city: p(k) = λk e−λ /k! for k = 0,1,2, . . ., E X = λ, and Var X = λ.
PROBLEMS
4.1 X has the probabilities listed in the table below. What are E X and Var X ? k Pr X = k
1 0.5
2 0.2
3 0.1
4 0.2
4.2 For the probabilities in Problem 4.1, compute the MGF, and use it to compute the mean and variance. 4.3 X has the probabilities listed in the table below. What is the CDF of X? k Pr X = k
1 0.5
2 0.2
3 0.1
4 0.2
4.4 For the probabilities in Problem 4.3, compute the MGF, and use it to compute the mean and variance.
4.5 X has the probabilities listed in the table below. What are E X and Var X ? k Pr X = k
1 0.1
2 0.2
3 0.3
4 0.2
5 0.2
4.6 X has the probabilities listed in the table below. What is the CDF of X? k Pr X = k
1 0.1
2 0.2
3 0.3
4 0.2
5 0.2
Problems 97
4.7
Imagine you have a coin that comes up heads with probability 0.5 (and tails with probability 0.5). a. How can you use that coin to generate a bit with the probability of a 1 equal to 0.25? b. How might you generalize this to a probability of k/2n for any k between 0 and 2n ? (Hint: you can flip the coin as many times as you want and use those flips to determine whether the generated bit is 1 or 0.)
4.8
If X is uniform on k = 1 to k = m: a. What is the distribution function of X? b. Plot the distribution function.
4.9
We defined the uniform distribution on k = 1,2, . . . ,m. In some cases, the uniform PMF is defined on k = 0,1, . . . ,m − 1. What are the mean and variance in this case?
4.10 Let N be a geometric random variable with parameter p. What is Pr N ≥ k for arbitrary integer k > 0? Give a simple interpretation of your answer.
4.11 Let N be a geometric random variable with parameter p. Calculate Pr N = l N ≥ k for l ≥ k.
4.12 Let N be a geometric random variable with parameter p. Calculate Pr N ≤ M for M, a positive integer, and Pr N = k N ≤ M for k = 1,2, . . . ,M.
4.13 Let N be a geometric random variable with parameter p = 1/3. Calculate Pr N ≤ 2 , Pr N = 2 , and Pr N ≥ 2 . 4.14 Let N be a Poisson variable with parameter λ = 1. Calculate and plot Pr N = 0 , random Pr N = 1 , . . . ,Pr N = 6 . 4.15 Let N be a Poisson variable with parameter λ = 2. Calculate and plot Pr N = 0 , random Pr N = 1 , . . . ,Pr N = 6 . 4.16 Using plotting software, plot the distribution function of a Poisson random variable with parameter λ = 1.
4.17 For N Poisson with parameter λ, show E N 2 = λ + λ2 . 4.18 The expected value of the Poisson distribution can be calculated directly:
E X =
∞
k=0
ke−λ
λk
k!
=
∞
k=1
ke−λ
λk
k!
= λe−λ
∞ λk−1 = e−λ λeλ = λ ( k − 1 ) ! k=1
a. Use this technique to compute E X (X − 1) . (This is known as the second factorial moment.) b. Use the mean and second factorial moment to compute the variance. c. You have computed the first and second factorial moments (E X and E X (X − 1) ). Continue this pattern, and guess the kth factorial moment, E X (X − 1) · · · (X − k + 1) . 4.19 Consider a roll of a fair six-sided die. a. Calculate the mean, second moment, and variance of the roll using Equations (4.19) and (4.20). b. Compare the answers above to the direct calculations in Examples 4.2 and 4.4.
98 CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
4.20 Show the variance of Y = aX + b is σ2y = a2 σ2x . Note that the variance does not depend on b. 4.21 Let X be a discrete random variable. a. Show that the variance of X is nonnegative. b. Show the even moments E X 2k are nonnegative. c. What about the odd moments? Find a PMF whose mean is positive but whose third moment is negative.
4.22 What value of a minimizes E (X − a)2 ? Show this two ways.
a. Write E (X − a)2 in terms of σ2 , μ, and a (no expected values at this point), and find the value of a that minimizes the expression. b. Use calculus and Equation (4.14) to find the minimizing value of a. 4.23 Often random variables are normalized. Let Y = (X − μx )/σx . What are the mean and variance of Y? 4.24 Generate your own sequence of 20 runs (twenty 1’s in the sequence) with p = 0.3. The Matlab command rand(1,n) 0,
λki
e−λi k! u MX i (u) = E euX i = eλi (e −1)
Pr X i = k =
(from Equation 4.24)
For S = X 1 + X 2 + · · · + X n , the MGF is MS (u) = MX 1 (u) MX 2 (u) · · · MX n (u) = e(λ1 +λ2 +···+λn )(e
u −1)
Notice the MGF of S has the same form as the MGF of X except that λ is replaced by λ1 + λ2 + · · · + λn . Since the MGF uniquely determines the PMF, we have just determined an important result: a sum of independent Poisson random variables is Poisson with parameter λ1 + λ2 + · · · + λn . When the λ values are the same, S is Poisson with parameter nλ. Comment 5.7: Adding independent random variables means convolving their PMFs or multiplying their MGFs.
5.6 SAMPLE PROBABILITIES, MEAN, AND VARIANCE So far, we have assumed we know the PMF and have calculated various probabilities and moments, focusing most of our attention on the mean and variance. However, in many applications the situation is reversed: we have observations and need to estimate probabilities and moments. Estimation like this is one branch of statistics. In this section, we study the most common estimates of the mean and variance. More statistics are presented in Chapters 10, 11, and 12. Assume we have n observations X 1 , X 2 , . . . ,X n that are IID. Typically, we get data like this from doing n repetitions of the same experiment, with the repetitions physically independent of each other. The sample or empirical estimate of probability is simply the fraction of the data that satisfies the condition. For example, suppose we want to estimate the probability p = FX (x) = Pr X ≤ x for some value x. We would use the sample probability as an estimate. First, let Y i = 1 if X i ≤ x and Y i = 0 if X i > x. Then, pˆ =
Y1 + Y2 + ··· + Yn = fraction of true samples n
This idea generalizes to other probabilities. Simply let Y i = 1 if the event is true and Y i = 0 if false. The Y i are independent Bernoulli random variables (see Example 4.1). Sums of independent Bernoulli random variables are binomial random variables, which will be studied in Chapter 6.
118 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
The most popular estimate (by far) of the mean is the sample average. We denote the estimate of the mean as μˆ and the sample mean as X n : ˆ = Xn = μ
1 X1 + X2 + · · · + Xn n
If the X i have mean μ and variance σ2 , then the mean and variance of X n are the following:
nμ 1 E X1 + E X2 + · · · + E Xn = =μ n n 1 nσ2 σ2 Var X n = 2 Var X 1 + Var X 2 + · · · + Var X n = 2 = n n n
E Xn =
The mean of X n equals (the unknown value) μ. In other words, the expected value of the sample average is the mean of the distribution. An estimator with this property is unbiased. The variance ofX n decreases with n. As n increases (more observations are made), 2 the squared error E X n − μ = Var X n = σ2 /n tends to 0. An estimator whose variance goes to 0 as n → ∞ is consistent. This is a good thing. Gathering more data means the estimate gets better (more accurate). (See Section 10.1 for more on unbiasedness and consistency.) For instance, the sample probability above is the sample average of the Y i values. Therefore, nE Y i np = =p E pˆ = n n nVar Y i p(1 − p) = Var pˆ = 2
n
n
As expected, the sample probability is an unbiased and consistent estimate of the underlying probability. 2 denote The sample variance is a common estimate of the underlying variance, σ2 . Let σ the estimate of the variance. Then, 2 = σ
1
n
n − 1 k=1
(X − X )2
(5.12)
2 is an unbiased estimate of σ2 ; that is, E σ 2 = σ2 . σ
For example, assume the observations are 7, 4, 4, 3, 7, 2, 4, and 7. The sample average and sample variance are 7 + 4 + 4 + 3 + 7 + 2 + 4 + 7 38 = = 4.75 8 8 2 2 2 2 = (7 − 4.75) + (4 − 4.75) + · · · + (7 − 4.75) = 3.93 σ 8−1
Xn =
The data were generated from a Poisson distribution with parameter λ = 5. Hence, the sample mean is close to the distribution mean, 4.75 ≈ 5. The sample variance, 3.93, is a bit further from the distribution variance, 5, but is reasonable (especially for only eight observations).
5.7 Histograms 119
The data above were generated with the following Python code: import scipy . stats as st import numpy as np l , n = 5 ,8 x = st . poisson ( l ). rvs ( n ) muhat = np . sum ( x )/ len ( x ) sighat = np . sum (( x - muhat )**2)/( len ( x ) -1) print ( muhat , sighat ) We will revisit mean and variance estimation in Sections 6.6, 8.9, 9.7.3, and 10.2.
5.7 HISTOGRAMS Histograms are frequently used techniques for estimating PMFs. Histograms are part graphical and part analytical. In this section, we introduce histograms in the context of estimating discrete PMFs. We revisit histograms in Chapter 10. Consider a sequence X 1 , X 2 , . . . ,X n of n IID random variables, with each X i uniform on 1, 2, . . . , 6. In Figure 5.1, we show the uniform PMF with m = 6 as bars (each bar has area 1/6) and two different histograms. The first is a sequence of 30 random variables. The counts of each outcome are 4, 7, 8, 1, 7, and 3. The second sequence has 60 random variables with counts 10, 6, 6, 14, 12, and 12. In both cases, the counts are divided by the number of observations
0.167
Pr X = k
0.167
1
2
3
4
k
5
6
FIGURE 5.1 Comparison of uniform PMF and histograms of uniform observations. The histograms have 30 (top) and 60 (bottom) observations. In general, as the number of observations increases, the histogram looks more and more like the PMF.
120 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
(30 or 60) to obtain estimates of the PMF. As the number of observations increases, the histogram estimates generally get closer and closer to the PMF. Histograms are often drawn as bar graphs. Figure 5.2 shows the same data as Figure 5.1 with the observations as a bar graph and the true PMF shown as a line.
n = 30
0.167
n = 60 0.167
1
2
3
4
k
5
6
FIGURE 5.2 Histograms drawn as bar graphs. The height of each bar equals the number of observations of that value divided by the total number of observations (30 or 60). The bottom histogram for n = 60 is more uniform than the upper histogram for n = 30.
Comment 5.8: In all three computing environments, Matlab, R, and Python, the histogram command is hist(x, options), where x is a sequence of values. The options vary between the three versions. Both Matlab and Python default to 10 equally spaced bins. The default in R is a bit more complicated. It chooses the number of bins dynamically based on the number of points and the range of values. Note that if the data are discrete on a set of integers, the bins should be integer widths, with each bin ideally centered on the integers. If the bins have non-integer widths, some bins might be empty (if they do not include an integer) or may include more integers than other bins. Both problems are forms of aliasing. Bottom line: do not rely on the default bin values for computing histograms of discrete random variables.
5.8 ENTROPY AND DATA COMPRESSION Many engineering systems involve large data sets. These data sets are often compressed before storage or transmission. The compression is either lossless, meaning the original data
5.8 Entropy and Data Compression 121
can be reconstructed exactly, or lossy, meaning some information is lost and the original data cannot be reconstructed exactly. For instance, the standard facsimile compression standard is lossless (the scanning process introduces loss, but the black-and-white dots are compressed and transmitted losslessly). The Joint Photographic Experts Group (JPEG) image compression standard is a lossy technique. Gzip and Bzip2 are lossless, while MP3 is lossy. Curiously, most lossy compression algorithms incorporate lossless methods internally. In this section, we take a slight detour, discussing the problem of lossless data compression and presenting a measure of complexity called the entropy. A famous theorem says the expected number of bits needed to encode a source is lower bounded by the entropy. We also develop Huffman coding, an optimal coding technique.
5.8.1 Entropy and Information Theory Consider a source that emits a sequence of symbols, X 1 , X 2 , X 3 , etc. Assume the symbols are independent. (If the symbols are dependent, such as for English text, we will ignore that dependence.) Let X denote one such symbol. Assume X takes on one letter, ak , from an alphabet of m letters. Example alphabets include {0,1} (m= 2), a,b,c, . . . ,z (m = 26), {0,1, . . . ,9} (m = 10), and many others. Let p(k) = Pr X = ak be thePMF of the symbols. The entropy of X, H X , is a measure of how unpredictable (i.e., how random) the data are:
H X = E − log p(X ) = −
m
p(k) log p(k)
(5.13)
k=1
Random variables with higher entropy are “more random” and, therefore, more unpredicable than random variables with lower entropy. The “log” is usually taken to base 2. If so, the entropy is measured in bits. If the log is to base e, the entropy is in nats. Whenever we evaluate an entropy, we use bits. Comment 5.9: Most calculators compute logs to base e (often denoted ln(x )) or base 10 (log10 (x )). Converting from one base to another is simple: log2 (x ) =
log(x ) log(2)
where log(x ) and log(2) are base e or base 10, whichever is convenient. For instance, log2 (8) = 3 = loge (8)/loge (2) = 2.079/0.693.
It should be emphasized the entropy is a function of the probabilities, not of the alphabet. Two different alphabets having the same probabilities have the same entropy. The entropy is nonnegative and upper bounded by log(m):
0 ≤ H X ≤ log(m)
122 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
The lower bound follows fromthe basic notion that all probabilities are between 0 and 1, 0 ≤ p(k) ≤ 1, which implies log p(k) ≤ 0. Therefore, − log p(k) ≥ 0. The lower bound is achieved if each term in the sum is 0. This happens if p(k) = 0 (the limit as p → 0 of plog(p) is 0; see the figure below) or p(k) = 1 (since log(1) = 0). The distribution that achieves this is degenerate: one outcome has probability 1; all other outcomes have probability 0. 0.531 =
e−1 log e (2)
− p log 2 (p )
0
1
e−1
We show the upper bound in Section 5.8.4 where we solve an optimization problem: maximize H X over all probability distributions. The maximizing distribution is the uniform distribution. In this sense, the uniform distribution is the most random of all distributions over m outcomes. For example, consider an m = 4 alphabet with probabilities 0.5, 0.2, 0.1, and 0.1. Since m = 4, the entropy is upper bounded by log(4) = 2 bits. The entropy of this distribution is
H X = −0.5log2 (0.5) − 0.2log2 (0.2) − 0.1log2 (0.1) − 0.1log2 (0.1) = 1.69 bits ≤ log2 (4) = 2 bits
Entropy is a crucial concept in communications. The entropy of X measures how much information X contains. When Alice transmits X to Bob, Bob receives H X bits of information. For useful communications to take place, it is necessary that H X > 0. The special case when X is binary (i.e., m = 2) occurs in many applications. Let X be binary with Pr X = 1 = p. Then,
H X = −plog2 (p) − (1 − p) log2 (1 − p) = h(p) h(p) is known as the binary entropy function. It is shown in Figure 5.3. 1 h (p ) 0.5
0 0.11
0.5
p
0.89 1
FIGURE 5.3 The binary entropy function h(p) = −plog2 (p) − (1 − p) log2 (1 − p) versus p.
(5.14)
5.8 Entropy and Data Compression 123
The binary entropy function obeys some simple properties: • • • •
h(p) = h(1 − p) for 0 ≤ p ≤ 1 h(0) = h(1) = 0 h(0.5) = 1 h(0.11) = h(0.89) = 0.5 The joint entropy function of X and Y is defined similarly to Equation (5.13):
H X,Y = E − log p(X,Y ) = −
k
pXY (k,l) log pXY (k,l)
l
The joint entropy is bounded from above by the sum of the individual entropies:
H X,Y ≤ H X + H Y
For instance, consider X and Y as defined in Section 5.4:
3 3 3 3 3 3 log log H X = − log2 − − 12 12 12 2 12 12 2 12 2 2 1 1 log2 log2 − − 12 12 12 12 = 2.23 bits 3 3 4 4 5 5 H Y = − log2 log log − − = 1.55 bits 12 12 12 2 12 12 2 12 1 1 H X,Y = −12 × log2 = 3.58 bits 12 12 ≤ H X + H Y = 2.23 + 1.55 = 3.78 bits If X and Y are independent, the joint entropy is the sum of individual entropies. This follows because log(ab) = log(a) + log(b):
H X,Y = E − log p(X,Y )
= E − log p(X )p(Y ) = E − log p(X ) + E − log p(Y ) = H X +H Y
(definition) (independence) (additivity of E · )
In summary, the entropy is a measure of the randomness of one or more random variables. All entropies are nonnegative. The distribution that maximizes the entropy is the uniform distribution. Entropies of independent random variables add.
5.8.2 Variable Length Coding The entropy is intimately related to the problem of efficiently encoding a data sequence for communication or storage. For example, consider a five-letter alphabet a,b,c,d,e with probabilities 0.3,0.3,0.2,0.1, and 0.1, respectively. The entropy of this source is
124 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
H X = −0.3log2 (0.3) − 0.3log2 (0.3) − 0.2log2 (0.2) − 0.1log2 (0.1) − 0.1log2 (0.1) = 2.17 bits per symbol Since this source is not uniform, its entropy is less than log2 (5) = 2.32 bits. Consider the problem of encoding this data source with binary codes. There are five symbols, so each one can be encoded with three bits (23 = 8 ≥ 5). We illustrate the code with a binary tree: 0 0
1 0
1
0
1
0
1
0
a
b
c
d
e
1 0
1
1
Code words start at the root of the tree (top) and proceed to a leaf (bottom). For instance, the sequence aabec is encoded as 000 · 000 · 001 · 100 · 010 (the “dots” are shown only for exposition and are not transmitted). This seems wasteful, however, since only five of the eight possible code words would be used. We can prune the tree by eliminating the three unused leafs and shortening the remaining branch: 0 0
1 e
1
0
1
0
1
a
b
c
d
Now, the last letter has a different coding length than the others. The sequence aabec is now encoded as 000 · 000 · 001 · 1 · 010 for a savings of two bits. Define a random variable L representing the coding length for each symbol. The average coding length is the expected value of L:
E L = 0.3 × 3 + 0.3 × 3 + 0.2 × 3 + 0.1 × 3 + 0.1 × 1 = 2.8 bits This is a savings in bits. Rather than 3 bits per symbol, this code requires an average of only 2.8 bits per symbol. This is an example of a variable length code. Different letters can have different code lengths. The expected length of the code, E L , measures the code performance. A broad class of variable length codes can be represented by binary trees, as this one is. Is this the best code? Clearly not, since the shortest code is for the letter e even though e is the least frequently occurring letter. Using the shortest code for the most frequently
5.8 Entropy and Data Compression 125
occurring letter, a, would be better. In this case, the tree might look like this: 0
1
a
0
1
0
1
0
1
b
c
d
e
The expected coding length is now
E L = 0.3 × 1 + 0.3 × 3 + 0.2 × 3 + 0.1 × 3 + 0.1 × 3 = 2.4 bits Is this the best code? No. The best code is the Huffman code, which can be found by a simple recursive algorithm. First, list the probabilities in sorted order (it is convenient, though not necessary, to sort the probabilities):
0.3 0.3 0.2 0.1 0.1 Combine the two smallest probabilities: 0.2 0.3 0.3 0.2 0.1 0.1 Do it again: 0.4 0.2 0.3 0.3 0.2 0.1 0.1 And again: 0.4 0.6
0.2
0.3 0.3 0.2 0.1 0.1
126 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
Finally, the last step: 1.0 0.4 0.6
0.2
0.3 0.3 0.2 0.1 0.1 This is the optimal tree. The letters have lengths 2, 2, 2, 3, and 3, respectively. The expected coding length is
E L = 0.3 × 2 + 0.3 × 2 + 0.2 × 2 + 0.1 × 3 + 0.1 × 3 = 2.2 bits Comment 5.10: If there is a tie in merging the nodes (i.e., three or more nodes have the same minimal probability), merge any pair. The specific trees will vary depending on which pair is merged, but each tree will result in an optimal code. In other words, if there are ties, the optimal code is not unique.
There is a famous theorem about coding efficiencies and entropy due to Claude Shannon.1 Theorem 5.3 (Shannon, 1948): For any decodable (tree) code, the expected coding length is lower bounded by the entropy:
E L ≥H X
(5.15)
In the example above, the coding length is 2.2 bits per symbol, which is slightly greater than the entropy of 2.17 bits per sample. When does the expected coding length equal the entropy? To answer this question, we can equate the two and compare the expressions term by term:
?
E L =H X m k=1
?
p(k)l(k) =
m k=1
p(k) − log2 (p(k))
We have explicitly used log2 because coding trees are binary. We see the expressions are equal if
l(k) = − log2 p(k) 1
Claude Elwood Shannon (1916–2001) was an American mathematician, electrical engineer, and cryptographer known as “the father of information theory.”
5.8 Entropy and Data Compression 127
or, equivalently, if p(k) = 2−l(k) For example, consider an m = 4 source with probabilities 0.5, 0.25, 0.125, and 0.125. These correspond to lengths 1, 2, 3, and 3, respectively. The Huffman tree is shown below: 1.0 0.50 0.25 0.5
0.25 0.125 0.125
The expected coding length and the entropy are both 1.75 bits per symbol.
5.8.3 Encoding Binary Sequences The encoding algorithms discussed above need to be modified to work with binary data. The problem is that there is only one tree with two leaves: 0
1
1−p p
The expected length is E L = (1 − p) · 1 + p · 1 = 1 for all values of p. There is no compression gain. The trick is to group successive input symbols together to form a supersymbol. If the input symbols are grouped two at a time, the sequence 0001101101 would be parsed as 00 · 01 · 10 · 11 · 01. Let Y denote a supersymbol formed from two normal symbols. Y then takes on one of the “letters”: 00, 01, 10, and 11, with probabilities (1 − p)2 , (1 − p)p, (1 − p)p, and p2 , respectively. As an example, the binary entropy function equals 0.5 when p = 0.11 or p = 0.89. Let us take p = 0.11 and see how well pairs of symbols can be encoded. Since the letters are independent, the probabilities of pairs are the products of the individual probabilities:
Pr 00 = Pr 0 · Pr 0 = 0.89 × 0.89 = 0.79 Pr 01 = Pr 0 · Pr 1 = 0.89 × 0.11 = 0.10 Pr 10 = Pr 1 · Pr 0 = 0.11 × 0.89 = 0.10 Pr 11 = Pr 1 · Pr 1 = 0.11 × 0.11 = 0.01
128 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
The Huffman tree looks like this: 1.0 0.21 0.11 0.79 0.1 0.1 0.01 The expected length per input symbol is
E L =
0.79 × 1 + 0.1 × 2 + 0.1 × 3 + 0.01 × 3 1.32 = = 0.66 bits per symbol 2 2
By combining two input symbols into one supersymbol, the expected coding rate drops from 1 bit per symbol to 0.66 bits per symbol, a savings of about one-third fewer bits required. However, the coding rate, 0.66 bits per symbol, is still 32% higher than the theoretical rate of h(0.11) = 0.5 bits per symbol.
5.8.4 Maximum Entropy What distribution has maximum entropy? This question arises in many applications, including spectral analysis, tomography, and signal reconstruction. Here, we consider the simplest problem, that of maximizing entropy without other constraints, and use the method of Lagrange multipliers to find the maximum. The main entropy theorem is the following:
Theorem 5.4: 0 ≤ H X ≤ log(m). Furthermore, the distribution that achieves the maximum is the uniform. To prove the upper bound, we set up an optimization problem and solve it using the method of Lagrange multipliers. Lagrange multipliers are widely used in economics, operations research, and engineering to solve constrained optimization problems. Unfortunately, the Lagrange multiplier method gets insufficient attention in many undergraduate calculus sequences, so we review it here. Consider the following constrained optimization problem:
H X =− max p(k) s
m
p(k) log p(k)
subject to
k=1
m
p(k) = 1
k=1
The function being maximized (in this case, H X ) is the objective function, and the constraint is the restriction that the probabilities sum to 1. The entropy for any distribution is upper bounded by the maximum entropy (i.e., by the entropy of the distribution that solves this optimization problem).
5.8 Entropy and Data Compression 129
Now, rewrite the constraint as 1 − m k=1 p(k) = 0, introduce a Lagrange multiplier λ, and change the optimization problem from a constrained one to an unconstrained problem:
max −
p(k) s,λ
m
m p(k) log p(k) + λ 1 − p(k)
k=1
k=1
This unconstrained optimization problem can be solved by taking derivatives with respect to each variable and setting the derivatives to 0. The Lagrange multiplier λ is a variable, so we have to differentiate with respect to it as well. First, note that p d − plog(p) = − log(p) − = − log(p) − 1 dp p
The derivative with respect to p(l) (where l is arbitrary) looks like
0 = − log p(l) − 1 − λ for all l = 1,2, . . . ,m Solve this equation for p(l): p(l) = e−λ−1
for all l = 1,2, . . . ,m
Note that p(l) is a constant independent of l (the right-hand side of the equation above is not a function of l). In other words, all the p(l) values are the same. The derivative with respect to λ brings back the constraint: m
0 = 1−
p(k)
k=1
Since the p’s are constant, the constant must be p(l) = 1/m. The last step is to evaluate the entropy:
H X ≤−
m
k=1 m
1 1 log m m k=1 mlog(m) = m = log(m)
=−
p(k) log p(k)
Thus, the theorem is proved. Let us repeat the main result. The entropy is bounded as follows:
0 ≤ H X ≤ log(m) The distribution that achieves the lower bound is degenerate, with one p(k) = 1 and the rest equal to 0. The distribution that achieves the maximum is the uniform distribution, p(k) = 1/m for k = 1,2, . . . ,m.
130 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
Comment 5.11: The maximum entropy optimization problem above also includes inequality constraints: p(k) ≥ 0. We ignored those (treated them implicitly) and solved the problem anyway. The resulting solution satisfies these constraints, p(k) = 1/m > 0, thus rewarding our laziness. Had the optimal solution not satisfied the inequality constraints, we would have had to impose the inequality constraints explicitly and solve the optimization problem with more sophisticated searching algorithms. As one might gather, such problems are substantially harder.
EXAMPLE 5.5
Here is a simple example to demonstrate the Lagrange multiplier method. Minimize x2 + y2 subject to x + 2y = 3. Using a Lagrange multiplier, λ, convert the problem to an optimization over three variables:
min x2 + y2 + λ(3 − x + 2y) x,y,λ
Differentiate the function with respect to each of the three variables, and obtain three equations: 0 = 2x − λ 0 = 2y − 2λ 3 = x + 2y The solution is x = 3/5, y = 6/5, and λ = 6/5, as shown in Figure 5.4.
(0.6,1.2)
x + 2y = 3
x2 + y2 = 1.8 FIGURE 5.4 Illustration of a Lagrange multiplier problem: Find the point on the line x + 2y= 3 that minimizes the distance to the origin. That point is (0.6,1.2), and it lies at a distance 1.8 from the origin.
Summary 131
SUMMARY
Let X and Y be two discrete random variables. The joint probability mass function is
pXY (k,l) = Pr X = xk ∩ Y = yl
for all values of k and l. The PMF values are nonnegative and sum to 1: pXY (k,l) ≥ 0 ∞ ∞
k=0 j=0
pXY (k,l) = 1
The marginal probability mass functions are found by summing over the unwanted variable:
l
pX (k) = Pr X = k =
pY (l) = Pr Y = l =
l
Pr X = xk ∩ Y = yl =
Pr X = xk ∩ Y = yl =
k
pXY (k,l)
pXY (k,l)
k
The joint distribution function is
FXY (u,v) = Pr X ≤ u ∩ Y ≤ v
X and Y are independent if the PMF factors: pXY (k,l) = pX (k)pY (l) for all k and l or, equivalently, if the distribution function factors: FXY (u,v) = FX (u)FY (v)
for all u and v
The expected value of g (X,Y ) is the probabilistic average:
E g (X,Y ) =
k
g (xk ,yl )pXY (k,l)
l
The correlation of X and Y is rxy = E XY if rxy = μx μy . The . X and Y are uncorrelated covariance of X and Y is σxy = Cov X,Y = E (X − μx )(Y − μy ) = rxy − μx μy . If X and Y are uncorrelated, then σxy = 0. Let Z = aX + bY. Then,
E Z = aE X + bE Y
Var Z = a2 Var X + 2abCov X,Y + b2 Var Y
If X and Y are independent, the PMF of S = X + Y is the convolution of the two marginal PMFs, and the MGF of S is the product of the MGFs of X and Y: pS (n) =
px (k)py (n − k)
k
MS (u) = MX (u) MY (v)
132 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
Probabilities and moments can be estimated from samples. Let X i be n IID samples, and let Y i = 1 if the event is true and Y i = 0 if the event is false. Then, Y1 + Y2 + ··· + Yn n 1 ˆ = Xn = X1 + X2 + · · · + Xn μ n n 2 = 1 (X − X )2 σ n − 1 k=1 pˆ =
The entropy is a measure of the randomness of X:
H X = E − log p(X ) = −
p(k) log p(k)
k
If X has m outcomes, the entropy is between 0 and log(m); that is, 0 ≤ H (X ) ≤ log(m). The distribution that achieves the upper bound is the uniform distribution. Maximum entropy problems can often be solved by the method of Lagrange multipliers. The log is usually taken to base 2, log2 (x) = log(x)/ log(2). The expected length of a lossless compression code is lower bounded by the entropy:
E L ≥H X
The Huffman code is an optimal code. It builds a code tree by repeatedly combining the two least probable nodes.
PROBLEMS 5.1 Let X and Y have the following joint PMF:
y
2 1 0
0.1 0.1 0.0 0
0.0 0.0 0.1 1
0.1 0.1 0.2 2
0.0 0.1 0.2 3
x a. What are the marginal PMFs of X and Y? b. What are the conditional probabilities (computed directly) of X given Y and Y given X (compute them directly)? c. What are the conditional probabilities of Y given X from the conditional probabilities of X given Y using Bayes theorem? 5.2 Using the probabilities in Problem 5.1:
a. What are E X , E Y , Var X , Var Y , and Cov X,Y ? b. Are X and Y independent?
Problems 133
5.3
Let X and Y have the following joint PMF: y
1 0
0.1 0.0 0
0.1 0.1 1
0.1 0.2 2
0.1 0.3 3
x a. What are the marginal PMFs of X and Y? b. What are the conditional probabilities (computed directly) of X given Y and Y given X (compute them directly)? c. What are the conditional probabilities of Y given X from the conditional probabilities of X given Y using Bayes theorem? 5.4
Using the probabilities in Problem 5.3:
a. What are E X , E Y , Var X , Var Y , and Cov X,Y ? b. Are X and Y independent? 5.5
Continue the example in Section 5.4, and consider the joint transformation, U = min(X,Y ) (e.g., min(3,2) = 2), and W = max(X,Y ). For each transformation: a. What are the level curves (draw pictures)? b. What are the individual PMFs of U and W? c. What is the joint PMF of U and W?
5.6
Continue the example in Section 5.4, and consider the joint transformation V = 2X − Y and V = 2Y − X. For each transformation: a. What are the level curves (draw pictures)? b. What are the individual PMFs of V and V ? c. What is the joint PMF of V and V ?
5.7
X and Y are jointly distributed as in the figure below. Each dot is equally likely.
Y
2 1 0 0
a. b. c. d. 5.8
1
2
3
X
4
What are the first-order PMFs of X and Y? What are E X and E Y ? What is Cov X,Y ? Are X and Y independent? If W = X − Y, what are the PMF of W and the mean and variance of W?
Find a joint PMF for X and Y such that X and Y are uncorrelated but not independent. (Hint: find a simple table of PMF values as in Example 5.1 such that X and Y are uncorrelated but not independent.)
134 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
5.9 Prove Theorem 5.1. 5.10 Let S = X 1 + X 2 + X 3 , with each X i IID uniform on the outcomes k = 1,2,3,4. What is the PMF of S? 5.11 What are the first four terms of the convolution of the infinite sequence [1/2,1/4,1/8, 1/16, . . .] with itself? 5.12 What is the convolution of the infinite sequence [1,1,1, . . .] with itself? 5.13 If X and Y are independent random variables with means μx and μy , respectively, and variances σ2x and σ2y , respectively, what are the mean and variance of Z = aX + bY for constants a and b?
5.14 Let X 1 , X 2 , and X 3 be IID Bernoulli random variables with Pr X i = 1 = p and Pr X i = 0 = 1 − p. What are the PMF, mean, and variance of S = X 1 + X 2 + X 3 ? 5.15 Let X 1 and X 2 be independent geometric random variables with the same p. What is the PMF of S = X 1 + X 2 ? 5.16 Let X 1 and X 2 be independent Poisson random variables with the same λ. What is the PMF of S = X 1 + X 2 ?
5.17 Suppose X 1 , X 2 , and X 3 are IID uniform on k = 0,1,2,3 (i.e., Pr X i = k = 0.25 for k = 0,1,2,3). What is the PMF of S = X 1 + X 2 + X 3 ? 5.18 Generate a sequence of 50 IID Poisson random variables with λ = 5. Compute the sample mean and variance, and compare these values to the mean and variance of the Poisson distribution. 5.19 Generate a sequence of 100 IID Poisson random variables with λ = 10. Compute the sample mean and variance, and compare these values to the mean and variance of the Poisson distribution. 5.20 A sequence of 10 IID observations from a U (0,1) distribution are the following: 0.76 0.92 0.33 0.81 0.37 0.05, 0.19, 0.10, 0.09, 0.31. Compute the sample mean and sample variance of the data, and compare these to the mean and variance of a U (0,1) distribution. 5.21 In the m = 5 Huffman coding example in Section 5.8.2, we showed codes with efficiencies of 3, 2.8, 2.4, and 2.2 bits per symbol. a. Can you find a code with an efficiency of 2.3 bits per symbol? b. What is the worst code (tree with five leaves) for these probabilities you can find? 5.22 Let a four-letter alphabet have probabilities p = [0.7, 0.1, 0.1, 0.1]. a. What is the entropy of this alphabet? b. What is the Huffman code? c. What is the Huffman code when symbols are taken two at a time? 5.23 Continue the binary Huffman coding example in Section 5.8.3, but with three input symbols per supersymbol. a. What is the Huffman tree? b. What is the expected coding length? c. How far is this code from the theoretical limit?
Problems 135
5.24 Write a program to compute the Huffman code for a given input probability vector. 5.25 What is the entropy of a geometric random variable with parameter p? 5.26 Let X have mean μx and variance σ2x . Let Y have mean μy and variance σ2y . Let Z = X with probability p and Z = Y with probability 1 − p. What are E Z and Var Z ? Here is an example to help understand this question: You flip a coin that has probability p of coming up heads. If it comes up heads, you select a part from box X and measure some quantity that has mean μx and variance σ2x ; if it comes up tails, you select a part from box Y and measure some quantity that has mean μy and variance σ2y . What are the mean and variance of the measurement taking into account the effect of the coin flip? In practice, most experimental designs (e.g., polls) try to avoid this problem by sampling and measuring X and Y separately and not relying on the whims of a coin flip. 5.27 The conditional entropy of X given Y is defined as H ( X |Y ) = −
k
pXY (k,l) log pX |Y (k|l)
l
Show H (X,Y ) = H (Y ) + H (X |Y ). Interpret this result in words. 5.28 Consider two probability distributions, p(k) for k = 1,2, . . . ,m and q(k) for k = 1,2, . . . ,m. The Kullback-Leibler divergence (KL) between them is the following: KL(P||Q) =
m
p(i) log
i=1
p(i)
q(i)
(5.16)
The KL divergence is always nonnegative: KL(P||Q) ≥ 0
(5.17)
Use the log inequality given in Equation (3.25) to prove the KL inequality given in Equation (5.17). (Hint: show −KL(P||Q) ≤ 0 instead, and rearrange the equation to “hide” the minus sign and then apply the inequality.) One application of the KL divergence: When data X have probability p(k) but are encoded with lengths designed for distribution q(k), the KL divergence us how many tells additional bits are required. In other words, the coding lengths − log q(k) are best if q(k) = p(k) for all k. 5.29 Solve the following optimization problems using Lagrange multipliers: a. min x2 + y2 such that x − y = 3 x,y
b. max x + y such that x2 + y2 = 1 x,y
c. min x2 + y2 + z 2 such that x + 2y + 3z = 6 x,y,z
136 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES
5.30 Prove the Cauchy-Schwarz inequality: n i=1
xi yi
2
n n 2 2 ≤ xi · yi i=1
i=1
where the x’s and y’s are arbitrary numbers. Hint: Start with the following inequality (why is this true?): 0≤
n i=1
(xi − ayi )2
for all values of a
Find the value of a that minimizes the right-hand side above, substitute that value into the same inequality, and rearrange the terms into the Cauchy-Schwarz inequality at the top. 5.31 Complete an alternative proof of Equation (5.6).
≤ E X 2 E Y 2 for any X and Y using the methods in Problem 5.30. 2 b. Show this implies Cov X,Y ≤ Var X Var Y and hence ρ 2xy ≤ 1.
a. Show E XY
2
5.32 Consider the following maximum entropy problem: Among all distributions over the
integers, k = 1,2,3, . . . with known mean μ = ∞ kp ( k), which one has the maximum k=1 entropy? Clearly, the answer is not the uniform distribution. A uniform distribution over m = ∞ does not make sense, and even if it did, its mean would be ∞. The constrained optimization problem looks like the following:
maxp(k) s H X = −
∞
p(k)log p(k)
k=1
subject to
∞
p(k) = 1 and
k=1
∞
kp(k) = μ
k=1
a. Introduce two Lagrange multipliers, λ and ψ, and convert the constrained problem to an unconstrained problem over the p(k) and λ and ψ. What is the unconstrained problem? b. Show the p(k) satisfy the following:
0 = − log p(k) − 1 − λ − kψ for k = 1,2,3, . . . c. What two other equations do the p(k) satisfy? d. Show the p(k) correspond to the geometric distribution.
6
CHAPTER
BINOMIAL PROBABILITIES
Two teams, say, the Yankees and the Giants, play an n-game series. If the Yankees win each game with probability p independently of any other game, what is the probability the Yankees win the series (i.e., more than half the games)? This probability is a binomial probability. Binomial probabilities arise in numerous applications—not just baseball. In this chapter, we examine binomial probabilities and develop some of their properties. We also show how binomial probabilities apply to the problem of correcting errors in a digital communications system.
6.1 BASICS OF THE BINOMIAL DISTRIBUTION In this section, we introduce the binomial distribution and compute its PMF by two methods. The binomial distribution arises from the sum of IID Bernoulli random variables (e.g., flips ofa coin). Let X i for i = 1,2, . . . ,n be IID Bernoulli random variables with Pr X i = 1 = p and Pr X i = 0 = 1 − p (throughout this chapter, we use the convention q = 1 − p): S = X1 + X2 + · · · + Xn Then, S has a binomial distribution. The binomial PMF can be determined as follows:
Pr S = k =
Pr sequence with k 1’s and n − k 0’s
sequences with k 1’s
Consider an arbitrary sequence with k 1’s and n − k 0’s. Since the flips are independent, the probabilities multiply. The probability of the sequence is pk qn−k . Note that each sequence with k 1’s and n − k 0’s has the same probability. 137
138 CHAPTER 6 BINOMIAL PROBABILITIES
The number of sequences with k 1’s and n − k 0’s is
Pr S = k =
n
. Thus,
k
n k n−k p q k
(6.1)
PMF values must satisfy two properties. First, the PMF values are nonnegative, and second, the PMF values sum to 1. The binomial probabilities are clearly nonnegative as each is the product of three nonnegative terms, nk , pk , and qn−k . The probabilities sum to 1 by the binomial theorem (Equation 3.7): n k=0
Pr S = k =
n n k=0
k
pk qn−k = (p + q)n = 1n = 1
The binomial PMF for n = 5 and p = 0.7 is shown in Figure 6.1. The PMF values are represented by the heights of each stem. An alternative is to use a bar graph, as shown in Figure 6.2, wherein the same PMF values are shown as bars. The height of each bar is the PMF value, and the width of each bar is 1 (so the area of each bar equals the probability). Also shown in Figure 6.1 are the mean (μ = 3.5) and standard deviation (σ = 1.02). μ = 3.5
σ = 1.02
0.360 p(k)
0
5
k
FIGURE 6.1 Binomial PMF for n = 5 and p = 0.7. The probabilities are proportional to the heights of each stem. Also shown are the mean (μ = 3.5) and standard deviation (σ = 1.02). The largest probability occurs for k = 4 and is equal to 0.360.
0.360 p(k)
0
1
2
3
4
5
FIGURE 6.2 Binomial probabilities for n = 5 and p = 0.7 as a bar graph. Since the bars have width equal to 1, the area of each bar equals the probability of that value. Bar graphs are especially useful when comparing discrete probabilities to continuous probabilities.
6.1 Basics of the Binomial Distribution 139
EXAMPLE 6.1
Consider a sequence of 56 Bernoulli IID p = 0.7 random variables, 1111111 · 1110101 · 1110010 · 1000111 · 0011011 · 0111111 · 0110011 · 0100110. Each group of seven bits is summed, yielding eight binomial observations, 75444643 (sum across the rows in the table below).
1 1 1 1 0 0 0 0
1 1 1 0 0 1 1 1
1 1 1 0 1 1 1 0
Bernoullis 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 1
1 0 1 1 1 1 1 1
Binomial 7 5 4 4 4 6 4 3
1 1 0 1 1 1 1 0
For example, the probability of getting a 4 is
7 0.74 0.33 = 0.23 4
Pr X = 4 =
In a sequence of eight binomial random variables, we expect to get about 8 × 0.23 = 1.81 fours. In this sequence, we observed 4 fours. The binomial probabilities satisfy an interesting and useful recursion. It is convenient to define a few quantities: Sn−1 = X 1 + X 2 + · · · + X n−1 Sn = Sn−1 + X n
b(n,k,p) = Pr Sn = k
Note that Sn−1 and X n are independent since X n is independent of X 1 through X n−1 . The recursion is developed through the LTP:
Pr Sn = k = Pr Sn = k X n = 1 Pr X n = 1 + Pr Sn = k X n = 0 Pr X n = 0 = Pr Sn−1 = k − 1 X n = 1 p + Pr Sn−1 = k X n = 0 q = Pr Sn−1 = k − 1 p + Pr Sn−1 = k q
We used the independence of Sn−1 and X n to simplify the conditional probability. Using the b(n,k,p) notation gives a simple recursion: b(n,k,p) = b(n − 1,k − 1,p) · p + b(n − 1,k,p) · q
(6.2)
Equation 6.2 gives us a Pascal’s triangle-like method to calculate the binomial probabilities.
140 CHAPTER 6 BINOMIAL PROBABILITIES
1
q q q q q q5
q4
q3
p
q2
p
p
p q
q
q 4pq3
5pq4
p q
2pq
q 3pq2
p
p
q 10p2 q3
p q
p 3p2 q
q 6p2 q2
p
p
p2
p q
p q
4p3 q
10p3 q2
p
p3
p q
5p4 q
p4
p p5
FIGURE 6.3 Binomial probabilities organized in Pascal’s triangle for n up to 5. Each entry is a weighted sum with weights p and q = 1 − p of the two entries above it. For instance, 10p2 q3 = (4pq3 )p + (6p2 q2 )q.
Comment 6.1: The recursive development is just a long-winded way of saying the binomial probabilities are the repeated convolution of the Bernoulli probabilities. Here is a convolution table for n = 2, n = 3, and n = 4:
q p q p q p
q
p
q2
pq pq
p2
q2
2pq
p2
q3
2pq2 pq2
p2 q 2p2 q
q3
3pq2
3p2 q
p3
q4
3pq3
3p2 q2
p3 q
pq3
3p2 q2
3p3 q
p4
4pq3
6p2 q2
4p3 q
p4
q4
p3
The various intermediate rowslist binomial probabilities (e.g., for n = 3, Pr N = 1 = 3pq2 and Pr N = 2 = 3p2 q).
To demonstrate binomial probabilities, here is a sequence of 30 observations from an n = 5, p = 0.7 binomial distribution: 4, 3, 4, 5, 3, 4, 4, 4, 5, 2, 3, 5, 3, 3, 3, 4, 3, 5, 4, 3, 3, 3, 0, 4, 5, 4, 4, 2, 1, 4. There are one 0, one 1, two 2’s, ten 3’s, eleven 4’s, and five 5’s. Note the histogram approximates the PMF fairly well. Some differences are apparent (e.g., the sequence has a 0
6.2 Computing Binomial Probabilities 141
0
1
even though a 0 is unlikely, Pr X = 0 = 0.002), but the overall shapes are pretty similar. The histogram and the PMF are plotted in Figure 6.4.
0.360 p(k)
2
3
4
5
FIGURE 6.4 Plot of a binomial n = 5, p = 0.7 PMF (as a bar graph) and histogram (as dots) of 30 observations. Note the histogram reasonably approximates the PMF.
In summary, for independent flips of the same coin (IID random variables), Bernoulli probabilities answer the question of how many heads we get in one flip. Binomial probabilities answer the question of how many heads we get in n flips. EXAMPLE 6.2
How likely is it that a sequence of 30 IID binomial n = 5, p = 0.7 random variables would have at least one 0? First, calculate the probability of getting a 0 in a single observation:
Pr X = 0 =
5 0.70 (1.0 − 0.7)5 = 0.35 = 0.00243 0
Second, calculate the probability of getting at least one 0 in 30 tries. Each trial has six possible outcomes, 0, 1, 2, 3, 4, and 5. However, we are only interested in 0’s or not-0’s. We just calculated the probability of a 0 in a signal trial as 0.00243. Therefore, the probability of a not-0 is 1 − 0.00243. Thus, the probability of at least one 0 is 1 minus the probability of no 0’s:
Pr no 0’s in 30 trials = (1 − 0.00243)30 = 0.9296
Pr at least one 0 in 30 trials = 1 − 0.9296 = 0.0704 Thus, about 7% of the time, a sequence of 30 trials will contain at least one 0.
6.2 COMPUTING BINOMIAL PROBABILITIES To compute the probability of an interval, say, l ≤ S ≤ m, one must sum the PMF values:
Pr l ≤ S ≤ m =
m k=l
b(n,k,p)
142 CHAPTER 6 BINOMIAL PROBABILITIES
This calculation is facilitated by computing the b(n,k,p) recursively. First, look at the ratio:
n k n−k p q p b(n,k,p) n−k+1 = n k k−1 n−k+1 = · b(n,k − 1,p) k q p q k−1
Thus,
b(n,k,p) = b(n,k − 1,p) ·
p n−k+1 · k q
(6.3)
Using this formula, it is trivial for a computer to calculate binomial probabilities with thousands of terms. The same argument allows one to analyze the sequence of binomial probabilities. b(n,k,p) is larger than b(n,k − 1,p) if (n − k + 1)p
kq
>1
Rearranging the terms gives k < (n + 1)p
(6.4)
Similarly, the terms b(n,k,p) and b(n,k − 1,p) are equal if k = (n + 1)p and b(n,k,p) is less than b(n,k − 1,p) if k > (n + 1)p. For example, for n = 5, p = 0.7, and (n + 1)p = 4.2, the b(5,k,0.7) sequence reaches its maximum at k = 4, then decreases for k = 5. This rise and fall are shown in Figure 6.1.
6.3 MOMENTS OF THE BINOMIAL DISTRIBUTION The mean of the binomial distribution is np and the variance npq. In this section, we derive these values three ways. First, we use the fact that a binomial is a sum of IID Bernoulli random variables. Second, we perform a direct computation using binomial probabilities. Third, we use the MGF. Of course, all three methods lead to the same answers. First, since S is a sum of IID Bernoulli random variables, the mean and variance of S are the sum of the means and variances of the X i (see Section 5.5): μs = E S = E X 1 + E X 2 + · · · + E X n = np Var S = Var X 1 + Var X 2 + · · · + Var X n = npq
since E X = p and Var X = pq. Thus, we see the mean and variance of the Bernoulli distribution are np and npq, respectively. For example, Figure 6.1 shows the PMF for a binomial distribution with n = 5 and p = 0.7. The mean is μ = np = 5 × 0.7 = 3.5, and the variance is σ2 = 5 × 0.7 × 0.3 = 1.05 = (1.02)2 .
6.3 Moments of the Binomial Distribution 143
Second, compute the mean directly using binomial probabilities, taking advantage of Equation (3.22), k nk = n nk−−11 :
E S = =
n k=0 n k=1
=
n
n
k · Pr X = k
(k = 0 term is 0)
k
k=1
k · Pr X = k
n k n−k p q k
n − 1 k n−k = n p q k−1 k=1 = np
n −1
l=0
(using Equation 3.22)
n − 1 l n−1−l pq l
(change of variables, l = k − 1)
= np(p + q)n−1
(using the binomial theorem)
= np
(since p + q = 1) 2
It is tricky to compute E S directly. It is easier to compute E S(S − 1) first and then adjust the formula for computing the variance: 2 2 σ2s = E S2 − E S = E S(S − 1) + E S − E S
We will also need to extend Equation (3.22):
n n−2 k(k − 1) = n(n − 1) k k−2
(6.5)
Using this formula, we can compute E S(S − 1) : n n E S(S − 1) = k(k − 1) pk qn−k
k
k=0
= n(n − 1)p
2
= n(n − 1)p
2
n k=2
n − 2 k−2 n−k p q k−2
Now, finish the calculation: 2 σ2s = E S(S − 1) + E S − E S = n(n − 1)p2 + np − n2 p2 = np(1 − p) = npq
See Problem 4.18 for a similar approach in computing moments from a Poisson distribution. Third, compute the mean and variance using the MGF:
MS (u) = E e
uS
=
n n k=0
k
k n−k ku
p q
e
n n (peu )k qn−k = (peu + q)n = k=0
k
(6.6)
144 CHAPTER 6 BINOMIAL PROBABILITIES
Now, compute the mean:
d E X = MS (u) = n(peu + q)n−1 peu u=0 = n(p + q)n−1 p = np du u=0
In Problem 6.15, we continue this development and compute the variance using the MGF. In summary, using three different methods, we have computed the mean and variance of the binomial distribution. The first method exploits the fact that the binomial is the sum of IID Bernoulli random variables. This method is quick because the moments of the Bernoulli distribution are computed easily. The second method calculates the moments directly from the binomial PMF. This straightforward method needs two nontrivial binomial coefficient identities (Equations 3.22 and 6.5). However, for many other distributions, the direct calculation proceeds quickly and easily. The third method uses the MGF. Calculate the MGF, differentiate, and set the derivative to 0. For other problems, it is handy to be able to apply all three of these methods. It is often the case that at least one of the three is easy to apply—though sometimes it is not obvious beforehand which one.
6.4 SUMS OF INDEPENDENT BINOMIAL RANDOM VARIABLES Consider the sum of two independent binomial random variables, N = N 1 + N 2 , where N 1 and N 2 use the same value of p. The first might represent the number of heads in n1 flips of a coin, the second the number of heads in n2 flips of the same (or an identical) coin, and the sum the number of flips in n = n1 + n2 flips of the coin (or coins). All three of these random variables are binomial. By this counting argument, the sum of two independent binomial random variables is binomial. This is most easily shown with MGFs: n n n +n MN (u) = MN 1 (u) MN 2 (u) = peu + q 1 peu + q 2 = peu + q 1 2
Since the latter expression is the MGF of a binomial random variable, N is binomial. Now, let us ask the opposite question. Given that N 1 and N 2 are independent binomial random variables and the sum N = N 1 + N 2 , what can we say about the conditional probability of N 1 given the sum N = l? It turns out the conditional probability is not binomial:
Pr N 1 = kN = m
Pr N 1 = k ∩ N = m = Pr N = m
(definition)
=
Pr N 1 = k ∩ N 2 = m − k Pr N = m
=
Pr N 1 = k Pr N 2 = m − k Pr N = m
(N 2 = N − N 1 )
(by independence)
6.4 Sums of Independent Binomial Random Variables 145
=
=
n1 k n 1 −k n2 p q pm−k qn 2 −m+k k m−k n m n−m p q m
n1 n2 k m−k
(6.7)
n m
In fact, There the conditional probability is hypergeometric. This has a simple interpretation. are mn sequences of m heads in n trials. Each sequence is equally likely. There are nk1 mn−2 k sequences with k heads in the first n1 positions and m − k heads in the last n2 = n − n1 positions. Thus, the probability is the number of sequences with k heads in the first n1 flips and m − k heads in the next n2 = n − n1 flips, divided by the number of sequences with m heads in n flips. For instance, let n1 = n2 = 4 and k = 4. So, four of the eight flips are heads. The probability of all four heads in the first four positions is
Pr N 1 = 4N = 4 =
4 4 4 4−4
8 4
=
1 70
The probability of an equal split, two heads in the first four flips and two in the second four flips, is
Pr N 1 = 2N = 4 =
4 4 2 4−2
8 4
=
36 70
Clearly, an equal split is much more likely than having all the heads in the first four (or the last four) positions. Comment 6.2: It is useful to note what happened here. N1 by itself is binomial, but N1 given the value of N1 + N2 is hypergeometric. By conditioning on the sum, N1 is restricted. A trivial example of this is when N = 0. In this case, it follows that N1 must also be 0 (since N1 + N2 = 0 implies N1 = 0). When N = 1, N1 must be 0 or 1.
146 CHAPTER 6 BINOMIAL PROBABILITIES
6.5 DISTRIBUTIONS RELATED TO THE BINOMIAL The binomial is related to a number of other distributions. In this section, we discuss some of these: the hypergeometric, the multinomial, the negative binomial, and the Poisson. In Chapter 9, we discuss the connection between the binomial and the Gaussian distribution.
6.5.1 Connections Between Binomial and Hypergeometric Probabilities The binomial and hypergeometric distributions answer similar questions. Consider a box containing n items, with n0 labeled with 0’s and n1 labeled with 1’s (n = n0 + n1 ). Make a selection of m items without replacement, and let N denote the number of 1’s in the selection. Then, the probabilities are hypergeometric:
Pr N = k =
n1 n0 k m−k
n m
The first item has probability (n1 n) of being a 1. The second item has conditional probability (n1 − 1) (n − 1) or n1 (n − 1) depending on whether the first item selected was a 1 or a 0. In contrast, if the items are selected with replacement, then the probabilities are constant. The probability of a 1 is (n1 n) regardless of previous selections. In this case, the probability of a selection is binomial. If n0 and n1 are large, then the probabilities are approximately constant. In this case, the hypergeometric probabilities are approximately binomial. For example, let n0 = n1 = 5 and m = 6. Selected probabilities are listed below: Probability
Pr N = 3
Pr N = 5
Binomial
6 20 0.53 0.53 = = 0.313 64 3
6 6 0.55 0.51 = = 0.094 5 64
Hypergeometric
5 5
3 3 10 = 6
5 5 5 1 10 = 6
100 = 0.476 210 5 = 0.024 210
In summary, if the selection is made with replacement, the probabilities are binomial; if the selection is made without replacement, the probabilities are hypergeometric. If the number of items is large, the hypergeometric probabilities are approximately binomial. As the selection size gets large, the hypergeometric distribution favors balanced selections (e.g., half 1’s and half 0’s) more than the binomial distribution. Conversely, unbalanced selections (e.g., all 1’s or all 0’s) are much more likely with the binomial distribution.
6.5 Distributions Related to the Binomial 147
Binomial probabilities tend to be easier to manipulate than hypergeometric probabilities. It is sometimes useful to approximate hypergeometric probabilities by binomial probabilities. The approximation is valid when the number of each item is large compared with the number of each selected.
6.5.2 Multinomial Probabilities Just as the multinomial coefficient (Equation 3.13) generalizes the binomial coefficient (Equation 3.3), the multinomial distribution generalizes the binomial distribution. The binomial distribution occurs in counting experiments with two outcomes in each trial (e.g., heads or tails). The multinomial distribution occurs in similar counting experiments but with two or more outcomes per trial. For example, the English language uses 26 letters (occurring in both uppercase and lowercase versions), 10 numbers, and various punctuation symbols. We can ask questions like “What is the probability of a letter t?” and “How many letter t’s can we expect to in a string of n letters?” These questions lead to multinomial probabilities. Consider an experiment that generates a sequence of n symbols X 1 , X 2 , . . . ,X n . For example, the symbols might be letters from an alphabet or the colors of a series of automobiles or many other things. For convenience, we will assume each symbol is an integer in the range from 0 to m − 1. (In other words, n is the sum of the counts, and m is the size of the alphabet.) Let the probability of outcome k be pk = Pr X i = k . The probabilities sum to 1; that is, p0 + p1 + · · · + pm−1 = 1. Let N k equal the number of X i = k for k = 0,1, . . . ,m − 1. Thus, N 0 + N 1 + · · · + N m−1 = n. For instance, n is the total number of cars, and N 0 might be the number of red cars, N 1 the number of blue cars, etc. The probability of a particular collection of counts is the multinomial probability:
Pr N 0 = k0 ∩ · · · ∩ N m−1 = km−1 =
n k0 ,k1 , . . . ,km−1
pk00 pk11 · · · pkmm−−11
(6.8)
For example, a source emits symbols from a four-letter alphabet with probabilities p0 = 0.4, p1 = 0.3, p2 = 0.2, and p3 = 0.1. One sequence of 20 symbols is 1, 2, 0, 2, 0, 3, 2, 1, 1, 0, 1, 1, 2, 1, 0, 2, 2, 3, 1, 1.1 The counts are N 0 = 4, N 1 = 8, N 2 = 6, and N 3 = 2. The probability of this particular set of counts is
20 0.44 0.38 0.26 0.12 Pr N 0 = 4 ∩ N 1 = 8 ∩ N 2 = 6 ∩ N 3 = 2 = 4,8,6,2 = 0.002
This probability is small for two reasons. First, with n = 20, there are many possible sets of counts; any particular one is unlikely. Second, this particular sequence has relatively few 0’s despite 0 being the most likely symbol. For comparison, the expected counts (8,6,4,2) have probability 0.013, about six times as likely, but still occurring only about once every 75 trials. 1
The first sequence I generated.
148 CHAPTER 6 BINOMIAL PROBABILITIES
The mean and variance of each count are
E N i = npi Var N i = npi (1 − pi )
(6.9) (6.10)
These are the same as for the binomial distribution. The covariance between N i and N j for i = j is
Cov N i ,N j = −npi pj
(6.11)
The covariance is a measure of how one variable varies with changes to the other variable. For the multinomial, the N’s sum to a constant, N 0 + N 1 + · · · + N m−1 = n. If N i is greater than its average, it is likely that N j is less than its average. That Cov N i ,N j < 0 for the multinomial follows from this simple observation. For instance, in the example above, E N0 = 20 × 0.4 = 8, E N 1 = 20 × 0.3 = 6, Var N 0 = 20 × 0.4 × 0.6 = 4.8, and Cov N 0 ,N 1 = −20 × 0.4 × 0.3 = −2.4.
6.5.3 The Negative Binomial Distribution The binomial distribution helps answer the question “In n independent flips, how many heads can one expect?” The negative binomial distribution helps answer the reverse question “To get k heads, how many independent flips are needed?” Let N be the number of flips required to obtain k heads. The event {N = n} is a sequence of n − 1 flips, with k − 1 heads followed by a head in the last flip. The probability of this sequence is
n − 1 k n−k Pr N = n = p q for n = k,k + 1,k + 2, . . . k−1
(6.12)
The first 12 terms of the negative binomial distribution for p = 0.5 and k = 3 are shown below: p = 0.5, k = 3
0.1875
Pr N = n
3
8
n
14
For instance, Pr N = 5 = 35−−11 0.53 0.55−3 = 42 0.55 = 6 × 0.03125 = 0.1875. Just as the binomial is a sum of n Bernoulli random variables, the negative binomial is a sum of k geometric random variables. (Recall that the geometric is the number of flips required to get one head.) Therefore, the MGF of the negative bionomial is the MGF of the geometric raised to the kth power: M (u) = pk (e−u − 1 + p)−k
(by Equations 4.23 and 5.11)
(6.13)
6.5 Distributions Related to the Binomial 149
Moments of the negative binomial can be calculated easily from the geometric. Let X be a geometric random variable with mean 1/p and variance (1 − p)/p2 . Then,
E N = k·E X =
k p
Var N = k · Var X =
EXAMPLE 6.3
(6.14) k(1 − p) p2
(6.15)
In a baseball inning, each team sends a succession of players to bat. In a simplified version of the game, each batter either gets on base or makes an out. The team bats until there are three outs (k = 3). If we assume each batter makes an out with probability p = 0.7 and the batters are independent, then the number of batters is negative binomial. The first few probabilities are
3−1 Pr N = 3 = 0.73 0.33−3 = 0.343 3−1 4−1 Pr N = 4 = 0.73 0.34−3 = 0.309 3−1 5−1 0.73 0.35−3 = 0.185 Pr N = 5 = 3−1 6−1 0.73 0.36−3 = 0.093 Pr N = 6 = 3−1 The mean number of batters per team per inning is
E N =
3 = 4.3 0.7
6.5.4 The Poisson Distribution The Poisson distribution is a counting distribution (see Section 4.5.3). It is the limit of the binomial distribution when n is large and p is small. The advantage of the correspondence is that binomial probabilities can be approximated by the easier-to-compute Poisson probabilities. Let N be binomial with parameters n and p, and let λ = np = E N . We are interested in cases when n is large, p is small, but λ is moderate. For instance, the number of telephones in an area code is in the hundreds of thousands, the probability of any given phone being used at a particular time is small, but the average number of phones in use is moderate. As another example, consider transmitting a large file over a noisy wireless channel. The file contains millions of bits, the probability of any given bit being in error is small, but the number of errors per file is moderate.
150 CHAPTER 6 BINOMIAL PROBABILITIES
The probability N = k is
Pr N = k =
n k n! pk (1 − p)n−k p (1 − p)n−k = k!(n − k)! k
Let us look at what happens when n is large, p is small, and λ = np is moderate: n! ≈ nk (n − k)! log(1 − p)n−k = (n − k) log(1 − λ/n) ≈ nlog(1 − λ/n)
(λ = np) (n − k ≈ n) (log(1 + x) ≈ x)
≈ −λ (1 − p)
n−k
≈e
−λ
Putting it all together,
Pr N = k =
nk pk −λ λk −λ n k e = e p (1 − p)n−k ≈ k! k! k
for k = 0,1,2, . . .
(6.16)
These are the Poisson probabilities. To summarize, the limit of binomial probabilities when n is large, p is small, and λ = np is moderate is Poisson. Figure 6.5 shows the convergence of the binomial to the Poisson. The top graph shows a somewhat poor convergence when n = 10 and p = 0.5. The bottom graph shows a much better convergence with n = 50 and p = 0.1. Binomial n = 10, p = 0.5 Poisson λ = 5
0.246 0.175
0
1
2
3
4
5
6
7
8
9 10 11 Binomial n = 50, p = 0.1 Poisson λ = 5
0.185 0.175
0
1
2
3
4
5
6
7
8
9 10 11
FIGURE 6.5 Comparison of binomial and Poisson PMFs. Both have the same λ = np. The top graph compares a binomial with n = 10, p = 0.5 to a Poisson with λ = 5. The agreement is poor. The bottom graph has n = 50 and p = 0.1 and shows much better agreement.
6.6 Binomial and Multinomial Estimation 151
6.6 BINOMIAL AND MULTINOMIAL ESTIMATION A common problem in statistics is estimating the parameters of a probability distribution. In this section, we consider the Bernoulli, binomial, and multinomial distributions. As one example, consider a medical experiment to evaluate whether or not a new drug is helpful. The drug might be given to n patients. Of these n, k patients improved. What can we say about the probability the drug leads to an improvement? Let X 1 , X 2 , . . . ,X n be n IID Bernoulli random variables with unknown probability p of a 1 (and probability q = 1 − p of a 0). Let S = X 1 + · · · + X n be the sum of the random variables. Then, S is binomial with parameters n and p, and
E S = np Var S = npq We will use the notation pˆ to denote an estimate of p. pˆ is bold because it is a random variable; its value depends on the outcomes of the experiment. Let k equal the actual number of 1’s observed in the sequence and n − k equal the observed number of 0’s. An obvious estimate of p is S k = n n E S np E pˆ = = =p n n Var S npq pq = 2 = Var pˆ = n2 n n pˆ =
Since the expected value of pˆ is p, we say pˆ is an unbiased estimate of p, and since the variance of pˆ goes to 0 as n → ∞, we say pˆ is a consistent estimate of p. Unbiased means the average value of the estimator equals the value being estimated; that is, there is no bias. Consistent means the variance of the estimator goes to 0 as the number of observations goes to infinity. In short, estimators that are both unbiased and consistent are likely to give good results. Estimating the parameters of a multinomial distribution is similar. Let X i be an observation from an alphabet of m letters (from 0 to m − 1). Let pi = Pr X = i , and let ki be the number of i’s in n observations. Then, pˆ i is ki n npi E pˆ i = = pi n npi (1 − pi ) pi (1 − pi ) Var pˆ i = = n2 n pˆ i =
As with the binomial distribution, pˆ i = ki /n is an unbiased and consistent estimator of pi . In summary, the obvious estimator of p in the binomial and multinomial distributions is the sample average of the n random variables, X 1 , X 2 , . . . ,X n . Since it is an unbiased and
152 CHAPTER 6 BINOMIAL PROBABILITIES
consistent estimator of p, the sample average is a good estimator and is commonly used. Furthermore, as we will see in later chapters, the sample average is often a good parameter estimate for other distributions as well. Comment 6.3: It is especially important in estimation problems like these to distinguish between the random variables and the observations. X and S are random variables. Before we perform the experiment, we do not know their values. After the experiment, X and S have values, such as S = k. Before doing the experiment, pˆ = S/n is a random variable. After doing the experiment, pˆ has a particular value. When we say pˆ is unbiased and consistent, we mean that if we did this experiment many times, the average value of pˆ would be close to p.
6.7 ALOHANET In 1970, the University of Hawaii built a radio network that connected four islands with the central campus in Oahu. This network was known as Alohanet and eventually led to the widely used Ethernet, Internet, and cellular networks of today. The Aloha protocol led to many advances in computer networks as researchers analyzed its strengths and weaknesses and developed improvements. The original Aloha network was a star, something like the illustration below: C
B
H A D The individual nodes A, B, C, etc., communicate with the hub H in Oahu over one broadcast radio channel, and the hub communicates with the nodes over a second broadcast radio channel. The incoming channel was shared by all users. The original idea was for any remote user to send a packet (a short data burst) to the central hub whenever it had any data to send. If it received the packet correctly, the central hub broadcasted an acknowledgment (if the packet was destined for the hub) or rebroadcasted the packet (if the packet was destined for another node). If two or more nodes transmitted packets that overlapped in time, the hub received none of the packets correctly. This is called a collision. Since the nodes could not hear each other (radios cannot both transmit and receive on the same channel at the same time),
6.7 Alohanet 153
collisions were detected by listening for the hub’s response (either the acknowledgment or the rebroadcast). Consider a collision as in the illustration below:
t0
t1
t2
One user sends a packet from time t1 to time t2 = t1 + T (solid line). If another user starts a packet at any time between t0 = t1 − T and t2 (dashed lines), the two packets will partially overlap, and both will need to be retransmitted. Shortly afterward, a significant improvement was realized in Slotted Aloha. As the name suggests, the transmission times became slotted. Packets were transmitted only on slot boundaries. In the example above, the first dashed packet would be transmitted at time t1 and would (completely) collide with the other packet. However, the second dashed packet would wait until t2 and not collide with the other packet. At low rates, Slotted Aloha reduces the collision rate in half. Let us calculate the efficiency of Slotted Aloha. Let there be n nodes sharing the communications channel. In any given slot, each node generates a packet with probability p. We assume the nodes are independent (i.e., whether a node has a packet to transmit is independent of whether any other nodes have packets to transmit). The network successfully transmits a packet if one and only one node transmits a packet. If no nodes transmit, nothing is received. If more than one node transmits, a collision occurs. Let N be the number of nodes transmitting. Then,
Pr N = 0 = (1 − p)n Pr N = 1 =
n p(1 − p)n−1 = np(1 − p)n−1 1
Pr N ≥ 2 = 1 − (1 − p)n − np(1 − p)n−1 Let λ = np be the offered packet rate, or the average number of packets attempted per slot. The Poisson approximation to the binomial in Equation (6.16) gives a simple expression:
Pr N = 1 ≈ λe−λ This throughput expression is plotted in Figure 6.6. The maximum throughput equals e−1 = 0.368 and occurs when λ= 1. Similarly, Pr N = 0 ≈ e−λ = e−1 when λ = 1. So, Slotted Aloha has a maximum throughput of 0.37. This means about 37% of the time, exactly one node transmits and the packet is successfully transmitted; about 37% of the time, no node transmits; and about 26% of the time, collisions occur. The maximum throughput of Slotted Aloha is rather low, but even 37% overstates the throughput. Consider what happens to an individual packet. Assume some node has a packet to transmit. The probability this packet gets through is the probability no other node has a packet to transmit. When n is large, this is
154 CHAPTER 6 BINOMIAL PROBABILITIES
0.368 λe − λ
Pr N = 1
λ
1 FIGURE 6.6 Slotted Aloha’s throughput for large n.
Pr N = 0 = e−λ . Let this number be r. In other words, with probability r = e−λ , the packet is successfully transmitted, and with probability 1 − r = 1 − e−λ , it is blocked (it collides with another packet). If blocked, the node will wait (for a random amount of time) and retransmit. Again, the probability of success is p, and the probability of failure is 1 − p. If blocked again, the node will attempt a third time to transmit the packet, and so on. Let T denote the number of tries needed to transmit the packet. Then,
Pr T = 1 = r Pr T = 2 = r(1 − r) Pr T = 3 = r(1 − r)2
and so on. In general, Pr T = k = r(1 − r)k−1 . T is a geometric random variable, and the mean of T is
E T =
1 = eλ r
On average, each new packet thus requires eλ tries before it is transmitted successfully. Since all nodes do this, more and more packets collide, and the throughput drops further until the point when all nodes transmit all the time and nothing gets through. The protocol is unstable unless rates well below 1/n are used. Aloha and Slotted Aloha are early protocols, though Slotted Aloha is still sometimes used (e.g., when a cellphone wakes up and wants to transmit). As mentioned, however, both led to many advances in computer networks that are in use today.
6.8 ERROR CONTROL CODES Error correcting coding (ECC) is commonly used in communications systems to reduce the effects of channel errors. We assume the data being communicated are bits and each bit is possibly flipped by the channel. The basic idea of error control codes is to send additional bits, which help the receiver detect and correct for any transmission errors. As we shall see, the analysis of error control codes uses binomial probabilities. ECC is used in many systems. Possibly the first consumer item to use nontrivial ECC was the compact disc (CD) player in the early 1980s. Since then, ECC has been incorporated
6.8 Error Control Codes 155
into numerous devices, including cell phones, digital television, wireless networks, and others. Pretty much any system that transmits bits uses ECC to combat noise. Throughout this section, we consider a basic class of codes, called linear block codes. Linear block codes are widely used and have many advantages. Other useful codes, such as nonlinear block codes and convolutional codes, also exist but are beyond this text.
6.8.1 Repetition-by-Three Code As a simple example to illustrate how ECC works, consider the repetition-by-three code. Each input bit is replicated three times, as shown in the table below: Input
Output
0 1
000 111
The code consists of two code words, 000 and 111. The Hamming distance between code words is defined as the number of bits in which they differ. In this case, the distance is three. Let D(·, ·) be the Hamming distance; for example, D(000,111) = 3. Each bit is passed through a binary symmetric channel with crossover probability , as illustrated below: 1−
1
1
X
Y
0
0
1−
The number of bits in a code word that get flipped is binomial with parameters n = 3 and p = . Since is small, Equation (6.4) says that the b(n,k,p) sequence is decreasing. Getting no errors is more likely than getting one error, getting one error is more likely than getting two errors, and getting two errors is more likely than getting three errors. If W denotes the number of errors, then
Pr W = 0 > Pr W = 1 > Pr W = 2 > Pr W = 3
The receiver receives one of eight words, 000, 001, . . ., 111. In Figure 6.7, the two code words are at the left and right. The three words a distance of one away from 000 are listed on the left side, and the three words a distance of one away from 111 are listed on the right. Error control codes can be designed to detect errors or to correct them (or some combination of both). An error detection scheme is typically used with retransmissions. Upon detecting an error, the receiver asks the transmitter to repeat the communication. Error correction is used when retransmissions are impractical or impossible. The receiver tries to correct errors as best it can.
156 CHAPTER 6 BINOMIAL PROBABILITIES
000
001
011
010
101
100
110
decoded as 0
111
decoded as 1
FIGURE 6.7 Decoding for a simple repetition-by-three code.
Error detection is easy: If one of the words that is not a code word is received, then an error has occurred. If an incorrect code word is received, the receiver decides that no error has occurred. This is called a miss. The probability of a miss is known as the miss rate. In this example, a miss requires all three bits to be received in error:
Pr error missed = 3
(6.17)
For = 10−3 , the probability of an error being missed is 10−9 , once every billion data bits. Error correction is a bit more difficult, but not much. Define the distance between words as the number of bit positions in which they differ. Each particular received word is decoded to the code word to which it is closest. This is known as maximum likelihood decoding. When the error correction fails, a decoding error has occurred. The probability of failed error correction is the decoding error rate. Consider a simple example of Alice trying to communicate to Bob: What Alice wants to communicate: What Alice transmits after encoding: What Bob receives: What Bob believes was sent: What Bob decodes to:
0 000 000 000 0
1 111 110 111 1
1 111 001 000 0
0 000 100 000 0
In this example, Alice transmits four bits. She encodes each bit with three bits and transmits these (now 12) bits. Due to channel noise, the 12 bits Bob receives differ from those Alice transmitted. In each group of three bits, if two or three are 0, Bob decides (guesses) a 0 was transmitted; if two or more are 1, Bob decides a 1 was transmitted. In total, Bob receives three bits correctly and one bit incorrectly. For instance, if the received word is 001, D(000,001) = 1 and D(111,001) = 2. Since 001 is closer to 000 than it is to 111, the word is decoded as 0. Similarly, the words 000, 010, and 100 are decoded as 0, while the words 111, 011, 101, and 110 are decoded as 1. In Figure 6.7, all the words on the left are decoded to 0, and all those on the right are decoded to 1.
6.8 Error Control Codes 157
The code word is decoded correctly if the number of errors is zero or one. It is decoded incorrectly if the number of errors is two or three. The decoder error rate is therefore
Pr W = 2 ∪ W = 3 = Pr W = 2 + Pr W = 3
3 2 3 3 (1 − )3−2 + (1 − )3−3 = 2 3 = 32 (1 − ) + 3 = 32 − 23
Since ≈ 0, this can be approximated as
Pr W = 2 ∪ W = 3 ≈ 32
when ≈ 0
(6.18)
Without coding, the probability of a bit being received in error is . ECC reduces the decoder error rate if 32 − 23 < . This is true for all 0 < < 0.5. However, this reduction occurs at the cost of increasing the number of bits transmitted by a factor of three.
6.8.2 General Linear Block Codes The basic ideas presented above can be generalized to better and more elaborate codes. An (n,k) linear block code breaks the message sequence into blocks of k bits at a time. k can be as small as 1 or as big as thousands. Each block of k input bits is encoded as n output bits, as shown below: k bits
ECC
n bits
The coding rate or, more simply, the rate of the code is k/n. This is the ratio of input bits to output bits. The rate gives the communications efficiency of the code. Other things being equal, high-rate codes (k/n close to 1) are better than low-rate codes (k/n close to 0). Since the input is taken k bits at a time, there are 2k possible inputs. To each of these inputs, an n-bit-long code word is assigned. The collection of code words is a code. For instance, a (5,2) code is shown in Table 6.1. Notice the first two bits of each code word are the data bits. Codes with this property are systematic. The remaining bits are parity bits. TABLE 6.1 A (5,2) block code. Label c1 c2 c3 c4
Data
Code Word
00 01 10 11
00000 01101 10011 11110
The parity bits can be computed easily using modulo 2 arithmetic. In modulo 2 arithmetic, 0 + 0 = 0, 0 + 1 = 1 + 0 = 1, 1 + 1 = 0, 0 · 0 = 0, 0 · 1 = 1 · 0 = 0, and 1 · 1 = 1. In modulo 2 arithmetic, subtraction is the same as addition.
158 CHAPTER 6 BINOMIAL PROBABILITIES
Comment 6.4: We use three types of algebra in this text. In ordinary arithmetic, 1 + 1 = 2. In Boolean algebra (used in set arithmetic), 1 + 1 = 1. And in modulo 2 arithmetic, 1 + 1 = 0. In modulo 2 arithmetic, subtraction is the same as addition: 1 − 1 = 1 + 1 = 0, 1 − 0 = 1 + 0 = 1, etc. Arithmetic on code words (bit vectors) is performed bitwise, meaning without carries: 0110 + 0011 = 0101, 1101 − 0110 = 1011.
Let the data bits be X = (X 1 ,X 2 , . . . ,X k ), the parity bits P, and the code bits C = ( X,P). Define a generator matrix, G: P = XG
For instance, in the (5,2) code above,
G=
0 1 1 1 0 1
For example, if X = (1, 1), 0 P = (1, 1)
1 1 1 0 1
= (1 · 0 + 1 · 1, 1 · 1 + 1 · 0, 1 · 1 + 1 · 1) = (0 + 1, 1 + 0, 1 + 1) = (1, 1, 0)
The code word is the concatenation of the data bits and the parity bits: C = (1,1,1,1,0). Comment 6.5: The convention in ECC is to use row vectors and multiply the vector on the left of the matrix to give a row vector as the output. We will see this same convention later in discussing Markov random processes. Most other fields use the convention of column vectors and multiplication on the right.
The advantage of using the generator matrix comes with large codes. Rather than compute a table with 2k rows and n − k columns (only the parity columns are needed), the generator matrix has k rows and n − k columns. The distance between two code words is measured with the Hamming distance, the number of bits in which they differ. The distance of a code is the minimum distance between any two code words. In the (5,2) code above, the distance of the code is three. The weight of a code word is the number of 1’s in the code word. The distance between C i and C j , is the weight of Ci − Cj . two code words, Computing the distance for a linear code is considerably easier than comparing all pairs of code words. A code is linear if the addition of any two code words is a code word, i.e., if
6.8 Error Control Codes 159
C i + C j is a codeword for all Ci and C j . The distance of a linear code is therefore the minimum
weight of the code’s nonzero code words. For each received word (n bits), the closest code word is calculated. For many linear block codes, this computation uses more matrix arithmetic (or polynomial arithmetic with coefficients calculated modulo 2). But for simple codes, the search can be done by brute force. Comment 6.6: The greatest area of research into codes over the last 50 years is in finding fast and efficient means to decode codes. For instance, one might imagine creating a table analogous to Table 6.1 for the decoder. For each possible received word, the table would list the closest code word or the corresponding data word. However, such a table would have 2n rows. For the size of block codes in use, the table would be too big. Fortunately, most traditional codes have algebraic decoding techniques that do not require a large table. Recent codes, such as turbo and similar codes, achieve excellent performance but require computer techniques for decoding.
A code with distance d can detect d − 1 errors or correct d = d−2 1 errors. (Since some pairs of code words might be farther than d apart, it may sometimes be possible to detect more errors, or to correct more, but we ignore this possibility.) For example, a code with d = 3 can detect 3 − 1 = 2 errors or correct d = 3−2 1 = 1 error. As above, a miss is when the code fails to detect an error. A decoding error occurs when the code fails to correct an error. The probability of a miss is the probability of getting d or more errors in n positions. Again, let W denote the number of errors. A miss occurs when W ≥ d. Thus, n n Pr W ≥ d = l (1 − )n−l
l=d
≈
l
n d (1 − )n−d d
( ≈ 0 means first term dominates)
≈
n d d
(6.19)
The decoding error rate is the probability of getting d + 1 or more errors in the n bits:
n
Pr W ≥ d + 1 =
l=d +1
≈
n l (1 − )n−l l
d + 1
≈
n
n
d +1 (1 − )n−d −1
d + 1
d +1
(6.20)
160 CHAPTER 6 BINOMIAL PROBABILITIES
In the repetition code, n = d, so nd = 1, and Equation (6.19) reduces to Equation (6.17) when d = 3. The result in Equation (6.20) similarly reduces to Equation (6.18).
6.8.3 Conclusions This section is merely an introduction to the theory of error control codes. Over the years, many codes have been developed. As one might expect, as n and k increase, the performance goes up, but so does the complexity. In some cases (turbo and similar codes), n and k can be in the thousands. For codes that large, complexity issues become extremely important. Even as k and n increase, however, the coding performance does not increase forever. It is bounded by the channel capacity (similar to the entropy, but a function of the channel, not the input). Even the simple codes presented here have substantial gains in performance. The decoding error rate without ECC is ; for the repetition-by-three code, the rate is proportional to 2 . For ≈ 10−3 , ECC lowers the overall decoding error rate by a factor of about 1000. With more complicated codes, such as the Huffman and Golay codes (see Problems 6.36, 6.37, and 6.38), the decoding error rate can drop even further. SUMMARY
The binomial distribution is an important discrete distribution. It occurs in many applications. The random variable N is binomial with parameters n and k, if
Pr N = k =
n k p (1 − p)n−k k
for k = 0,1, . . . ,n. Often it is convenient to let q = 1 − p. One interpretation of the binomial distribution is that it is a sum of n IID Bernoulli random variables. Let X i for i = 1,2, . . . ,n be a sequence of IID Bernoulli random variables with Pr X i = 1 = p and Pr X i = 0 = 1 − p = q. Then, S = X 1 + X 2 + · · · + X n is binomial with parameters n and k. The number of heads obtained in n independent flips of a coin with probability p of coming up heads is binomial. Let b(n,k,p) = Pr N = k . Then b(n,k,p) can be calculated recursively in two useful ways: b(n,k,p) = b(n − 1,k,p) · q + b(n − 1,k − 1,p) · p p n−k+1 = b(n,k − 1,p) · · k q The first gives a Pascal’s triangle-like recursion, allowing whole rows to be computed at once using vector operations, such as in Matlab. The second gives a simple way to compute successive terms one by one. The mean and variance of N are E N = np and Var N = np(1 − p) = npq. The MGF of N is MN (u) = (peu + q)n . A sum of independent binomial random variables with the same p is binomial. Let N 1 and N 2 be independent binomials with the same p and parameters n1 and n2 , respectively.
Summary 161
Then, N = N 1 + N 2 is binomial with parameters n = n1 + n2 and p. This is easily shown with MGFs: MN 1 +N 2 (u) = MN 1 (u) MN 2 (u) = (peu + q)n1 (peu + q)n2 = (peu + q)(n1 + n2 )
The latter expression is the MGF of a binomial random variable with parameters n1 + n2 and p. The conditional probability of N 1 = k given N = N 1 + N 2 = m, where N 1 and N 2 are independent binomial with the same p, is hypergeometric:
Pr N 1 = k N = m =
n1 n2 k m−k
n m
The generalization of the binomial distribution to more than two classes is the multinomial distribution:
Pr N 0 = k0 ∩ . . . ∩ N m−1 = km−1 =
n k0 ,k1 , . . . ,km−1
pk00 pk11 · · · pkmm−−11
E N i = npi Var N i = npi (1 − pi )
Cov N i ,N j = −npi pj The negative binomial distribution helps answer the reverse question: How many flips are required to obtain k heads? The negative binomial is the sum of k geometric distributions:
n − 1 k n−k Pr N = n = p q for n = k,k + 1,k + 2, . . . k−1
k p k(1 − p) Var N = p2 E N =
Binomial probabilities are approximately Poisson when n is large, p is small, but λ = np is moderate:
Pr N = k ≈
λk
k!
e−λ
The Alohanet is an early communications network. The throughput of Slotted Aloha is
λe−λ . Maximum throughput occurs when λ = 1.
Error correcting codes transmit extra bits to combat channel errors. Block codes transmit k data bits and n − k extra bits. A code with a distance of d can detect d − 1 errors and correct d = d−2 1 errors. The number of errors in a block is binomial. Therefore, the probability of correctly decoding a block is binomial.
162 CHAPTER 6 BINOMIAL PROBABILITIES
PROBLEMS
6.1 IfN isbinomial n = 4 and p = 0.33, what are Pr N = 0 , Pr N = 1 , Pr N = with parameters 2 , Pr N = 3 , and Pr N = 4 ? 6.2 IfN isbinomial with parameters n = 5 and p = 0.4, what are Pr N = 0 , Pr N = 1 , Pr N = 2 , Pr N = 3 , Pr N = 4 , and Pr N = 5 ? 6.3 If X and Y are independent Bernoulli random variables with parameters px and py , respectively (it is possible px = py ), what is the PMF of S = X + Y?
6.4 If N withparameters k = 3 and p = 0.3, what are Pr N = 3 , Pr N = 4 , is negative binomial Pr N = 5 , Pr N = 6 , and Pr N = 7 ? 6.5 If N withparameters k = 3 and p = 0.5, what are Pr N = 3 , Pr N = 4 , is negative binomial Pr N = 5 , Pr N = 6 , and Pr N = 7 ? 6.6 A sequence of letters are drawn from a three-letter alphabet, “0,” “1”, and “ ” (space). Assume the letters are drawn independently with probabilities 0.5, 0.4, and 0.1, respectively. a. What is the probability a selection of five letters contains three 0’s, one 1, and one space? b. What is the probability a selection of five letters contains all 0’s? c. What is the expected number of 0’s in a sequence of 20 letters? What is the expected number of 1’s in a sequence of 20 letters? 6.7 A multiple-choice exam consists of 40 questions, each with five possible answers. Assume the student guesses randomly on each question. a. What is the probability the student answers at least 20 questions correctly? b. What is the probability the student answers at least 32 questions correctly? 6.8 A multiple-choice exam consists of 50 questions, each with four possible answers. Assume the student guesses randomly on each question. a. What is the probability the student answers at least 25 questions correctly? b. What is the probability the student answers at least 40 questions correctly? 6.9 A box contains five red marbles and seven blue marbles. Assume the box is shaken and you make no attempt to “see” or otherwise control the selection of marbles. a. You reach in and draw two marbles. What is the probability both marbles are red? b. You reach in, draw a marble, look at it, put it back into the box, and then repeat. What is the probability both marbles are red? 6.10 A box contains five red, four blue, and three yellow marbles. Assume the box is shaken and you make no attempt to “see” or otherwise control the selection of marbles. a. You reach in and draw two marbles. What is the probability one is red and one marble is blue? b. You reach in, draw a marble, look at it, put it back into the box, and then repeat. What is the probability one marble is red and the other blue (in either order)?
Problems 163
6.11 In basketball, sometimes a player shoots free throws “one and one,” meaning if the first free throw goes in the player shoots a second, but if the first misses the player does not shoot a second. Sometimes the player shoots two free throws (the player shoots the second whether or not the first one went in). Each successful free throw is worth one point. Assume the free throws are independent, each with probability p of going in. a. What is the expected number of points shooting “one and one”? b. What is the expected number of points shooting two free throws? c. What is the value of p that maximizes the difference between the two answers above? 6.12 Write a function to compute binomial probabilities using the Pascal’s triangle-like recursion. In Matlab and Python, the recursive step (going from n − 1 to n) can be written in one line. 6.13 Show that the binomial probabilities of Equation (6.1) solve the recursion of Equation (6.2). 6.14 Show that Equation (6.5), k(k − 1)
n k
−2 = n(n − 1) nk− 2 , is true.
6.15 For S binomial with parameters n and p, use the MGF to compute E S2 , then finish by computing σ2S . 6.16 Plot binomial and Poisson probabilities for n = 10 and p = 0.1. 6.17 Plot binomial and Poisson probabilities for n = 20 and p = 0.25. 6.18 The variance of a binomial random variable is npq (with q = 1 − p). What value of p maximizes the variance? 6.19 An alternative recursion to Equation (6.3) can be found by considering the ratio b(n,k,p)/b(n − 1,k,p). a. What is this recursion? b. How would you use this recursion to calculate binomial probabilities? 6.20 Let N 0 , N 1 , . . . ,N m−1 be multinomial with parameters n and p0 , p1 , . . . ,pm−1 .
a. Define a covariance matrix, C, such that Ci,i = Var N i and Ci,j = Cov N i ,N j for i = j. That is, the variances are down the main diagonal, and the covariances are the off-diagonal elements. For example, for m = 3, the matrix is ⎛
p 1 (1 − p 1 ) C = n ⎝ −p 1 p 2 −p 1 p 3
−p 1 p 2 p 2 (1 − p 2 ) −p 2 p 3
⎞ −p 1 p 3 −p 2 p 3 ⎠ p 3 (1 − p 3 )
b. Since the counts obey a simple linear constraint, N 0 + N 1 + · · · + N m−1 = n, the covariance matrix is singular. Show this (i.e., that the covariance matrix is singular for any value of n > 1). (Hint: one simple way to show a matrix is singular is to find a vector x = 0 such that Cx = 0.) 6.21 Use the MGF to calculate the mean and variance of a negative binomial random variable N with parameters k and p.
164 CHAPTER 6 BINOMIAL PROBABILITIES
6.22 In a recent five-year period, a small town in the western United States saw 11 children diagnosed with childhood leukemia. The normal rate for towns that size would be one to two cases per five-year period. (No clear cause of this cluster has been identified. Fortunately, the rate of new cancers seems to have reverted to the norm.) a. Incidences of cancer are often modeled with Poisson probabilities. Calculate the probability that a similar town with n = 2000 children would have k = 11 cancers using two different values for the probability, p1 = 1/2000 and p2 = 2/2000. Use both the binomial probability and the Poisson probability. b. Which formula is easier to use? c. Most so-called cancer clusters are the result of chance statistics. Does this cluster seem likely to be a chance event? 6.23 Alice and Bob like to play games. To determine who is the better player, they play a “best k of n” competition, where n = 2k − 1. Typical competitions are “best 2 of 3” and “best 4 of 7”, though as large as “best 8 of 15” are sometimes used (sometimes even larger). If the probability Alice beats Bob in an individual game is p and the games are independent, your goal is to calculate the probability Alice wins the competition. a. Such competitions are usually conducted as follows: As soon as either player, Alice or Bob, wins k games, the competition is over. The remaining games are not played. What is the probability Alice wins? (Hint: this probability is not binomial.) b. Consider an alternative. All n games are played. Whichever player has won k or more games wins the competition. Now, what is the probability Alice wins the competition? (Hint: this probability is binomial.) c. Show the two probabilities calculated above are the same. 6.24 A group of n students engage in a competition to see who is the “best” coin flipper. Each one flips a coin with probability p of coming up heads. If the coin comes up tails, the student stops flipping; if the coin comes up heads, the student continues flipping. The last student flipping is the winner. Assume all the flips are independent. What is the probability at least k of the n students are still flipping after t flips?
6.25 The throughput for Slotted Aloha for finite n is Pr N = 1 = λ(1 − λ/n)n−1 . a. What value of λ maximizes the throughput? b. Plot the maximum throughput for various values of n to show the throughput’s convergence to e−1 when n → ∞. 6.26 In the Aloha collision resolution discussion, we mentioned the node waits a random amount of time (number of slots) before retransmitting the packet. Why does the node wait a random amount of time? 6.27 If S is binomial with parameters n and p, what is the probability S is even? (Hint: Manipulate (p + q)n and (p − q)n to isolate the even terms.) 6.28 Consider an alternative solution to Problem 6.27. Let Sn = Sn−1 + X n , where X n is a Bernoulli random variable that is independent of Sn−1 . Solve for the Pr Sn = even in terms of Pr Sn−1 = even , and then solve the recursion.
Problems 165
6.29 The probability of getting no heads in n flips is (1 − p)n . This probability can be bounded as follows: 1 − np ≤ (1 − p)n ≤ e−np a. Prove the left-hand inequality. One way is to consider the function h(x) = (1 − x)n + nx and use calculus to show h(x) achieves its minimum value when x = 0. Another way is by induction. Assume the inequality is true for n − 1 (i.e., (1 − p)n−1 ≥ 1 − (n − 1)p), and show this implies it is true for n. b. Prove the right-hand side. Use the inequality log(1 + x) ≤ x. c. Evaluate both inequalities for different combinations of n and p with np = 0.5 and np = 0.1. 6.30 Another way to prove the left-hand inequality in Problem 6.29, 1 − np ≤ (1 − p)n , is to use the union bound (Equation 1.7). Let X 1 , X 2 , . . . ,X n be IID Bernoulli random variables. Let S = X 1 + X 2 + · · · X n . Then, S is binomial. Use the union bound to show Pr S > 0 ≤ np, and rearrange to show the left-hand inequality. 6.31 Here’s a problem first solved by Isaac Newton (who did it without calculators or computers). Which is more likely: getting at least one 6 in a throw of six dice, getting at least two 6’s in a throw of 12 dice, or getting at least three 6’s in a throw of 18 dice? 6.32 In the game of Chuck-a-luck, three dice are rolled. The player selects a number between 1 and 6. If the player’s number comes up on exactly one die, the player wins $1. If the player’s number comes up on exactly two dice, the player wins $2. If the player’s number comes up on all three dice, the player wins $3. If the player’s number does not come up, the player loses $1. Let X denote the player’s win or loss.
a. What is E X ? b. In some versions, the player wins $10 if all three dice show the player’s number. Now what is E X ? 6.33 Write a short program to simulate a player’s fortune while playing the standard version of Chuck-a-luck as described in Problem 6.32. Assume the player starts out with $100 and each play costs $1. Let N be the number of plays before a player loses all of his or her money.
a. Generate a large number of trials, and estimate E N and Var N . b. If each play of the actual game takes 30 seconds, how long (in time) will the player typically play? 6.34 Generalize the repetition-by-three ECC code to a repetition-by-r code. a. b. c. d.
What is the probability of a miss? If r is odd, what is the probability of decoding error? What happens if r is even? (It is helpful to consider r = 2 and r = 4 in detail.) Make a decision about what to do if r is even, and calculate the decoding error probability.
166 CHAPTER 6 BINOMIAL PROBABILITIES
6.35 For the (5,2) code in Table 6.1: a. Compute the distance between all pairs of code words, and show the distance of the code is three. b. Show the difference between any pair of code words is a code word. 6.36 A famous code is the Hamming (7,4) code, which uses a generator matrix: ⎛
1 ⎜1 ⎜ G=⎝ 0 1
1 0 1 1
⎞
0 1⎟ ⎟ 1⎠ 1
a. What are all 24 = 16 code words? b. What is the distance of the code? c. What are its miss and decoding error rate probabilities? 6.37 Another famous code is the Golay (23,12) code. It has about the same rate as the Hamming (7,4) code but a distance of seven. a. What are its miss and decoding error probabilities? b. Make a log-log plot of the decoding error rate of the Golay (23,12) code and the Huffman (7,4) code. (Hint: some Matlab commands you might find useful are loglog and logspace. Span the range from 10−4 ≤ ≤ 10−1 .) 6.38 Compare the coding rates and the error decoding rates for the (3,1) repetition code, the (5,2) code above, the (7,4) Hamming code, and the (23,12) Golay code. Is one better than the others?
CHAPTER
7
A CONTINUOUS RANDOM VARIABLE
What is the probability a randomly selected person weighs exactly 150 pounds? What is the probability the person is exactly 6 feet tall? What is the probability the temperature outside is exactly 45◦ F? The answer to all of these questions is the same: a probability of 0. No one weighs exactly 150 pounds. Each person is an assemblage of an astronomical number of particles. The likelihood of having exactly 150 pounds worth of particles is negligible. What is meant by questions like “What is the probability a person weighs 150 pounds?” is “What is the probability a person weighs about 150 pounds?” where “about” is defined as some convenient level of precision. In some applications, 150 ± 5 pounds is good enough, while others might require 150 ± 1 pounds or even 150 ± 0.001 pounds. Engineers are accustomed to measurement precision. This chapter addresses continuous random variables and deals with concepts like “about.”
7.1 BASIC PROPERTIES
A random variable X is continuous if Pr X = x = 0 for all values of x. What is important for continuous random variables are the probabilities of intervals, say, Pr a < X ≤ b . Comment 7.1: Before performing the experiment, the probability that X takes on any specific value is 0, but after performing the experiment, X has a specific value, whatever value actually occurred. The problem is that there are too many real numbers to assign each one a positive probability before the experiment is conducted. Nevertheless, one of the values will occur.
167
168 CHAPTER 7 A CONTINUOUS RANDOM VARIABLE
The cumulative distribution function (CDF) (or simply the distribution function) of X is Fx (x) = Pr X ≤ x for all −∞ < x < ∞. The probability that a < X ≤ b is
Pr a < X ≤ b = FX (b) − FX (a)
(7.1)
Note that for continuous random variables, Pr X = a = 0 = Pr X = b . Thus, all the following are equivalent:
Pr a < X ≤ b = Pr a < X < b
= Pr a ≤ X < b = Pr a ≤ X ≤ b
Comment 7.2: For discrete random variables, the endpoints matter; that is, Pr X < b = Pr X ≤ b in general. The formula
Pr a < X ≤ b = FX (b) − FX (a) is true for both continuous and discrete random variables.
Since X is continuous, the concept of a PMF does not make sense. The equivalent concept is the density, also known more formally as the probability density function (PDF): fX (x) =
d FX (x) dx
(7.2)
Since the density is the derivative of the distribution function, the distribution function is the integral of the density: FX (x) =
x −∞
fX (y) dy
(7.3)
The distribution function obeys all the usual properties: FX (−∞) = 0 FX (v) ≥ FX (u) for v ≥ u FX (∞) = 1 Densities have two important mathematical properties. First, since the distribution function is nondecreasing, the density is nonnegative: fX (x) ≥ 0 for −∞ < x < ∞
(7.4)
Second, the integral of the density everywhere is 1: ∞ −∞
fX (x) dx = 1
(7.5)
7.1 Basic Properties 169
Probabilities can be found by integrating the density function:
Pr a < X ≤ b =
b a
fX (x) dx
(7.6)
Combining Equations (7.1) and (7.6), the probability that a < X ≤ b can be found from the distribution function or by integrating the density is
Pr a < X ≤ b = FX (b) − FX (a) =
b a
fX (x) dx
(7.7)
Figures 7.1 and 7.2 illustrate both approaches. 1.0 F (x ) 0.8 0.5
f (x ) 0.5 a f (x ) 0.8 b
x
FIGURE 7.1 Probabilities can be found from the distribution function or by integrating the density function.
The probability that X “equals” x can be approximated by
Pr X ≈ x = Pr x < X ≤ x + dx =
x+ dx
fX (y) dy ≈ fX (x) (x + dx) − x x
= fX (x) dx
(7.8)
170 CHAPTER 7 A CONTINUOUS RANDOM VARIABLE
1.0 F (x ) 0.8 0.3 = 0.8 − 0.5
0.5
f (x )
0.3 a
b
FIGURE 7.2 Probabilities of intervals can be found by differencing the distribution function or by integrating the density function between the two limits.
when dx is small. Since the length of the integral is small, the integrand is approximately constant. The integral is approximately the height of the integrand, fX (x), times the width of the interval, dx. Comment 7.3: In Equation (7.8) above, think of dx as the “precision” of the approximation. Returning to the example in the introduction to this chapter, we might ask what is the probability a randomly selected person weighs 150 pounds. The answer, of course, is a probability of 0. No one weighs exactly 150 pounds. However, that is not what is normally meant by the question. We want to know “approximately 150 pounds,” where “approximately” might mean within 10 pounds, or within 1 pound, or within 0.01 pounds, etc. This is the role of dx, to represent the precision in the estimate.
Expected values for continuous random variables use the density. For a function g (X ),
E g (X ) =
∞
−∞
g (x)fX (x) dx
(7.9)
Expected values of discrete random variables are sums of the function values times the PMF; expected values of continuous random variables are integrals of the function values times the density. Otherwise, all the properties of expected values (additivity, independence, linear transformations, etc.) are the same for both continuous and discrete random variables. Some random variables are mixed. That is, they are part continuous and part discrete. For instance, let X denote a continuous random variable that varies over a range that = = max ( 0,X ) . Then, Pr Y = 0 includes both positive and negative numbers, and let Y Pr X ≤ 0 , so Y has a discrete component but is continuous when X > 0. In cases like this,
7.2 Example Calculations for One Random Variable 171
expected values are sums over the discrete component and integrals over the continuous part. For example,
E Y = 0 · Pr Y = 0 +
∞ 0
yfX (y) dy
(7.10)
As an observation, mixed random variables occur infrequently. Comment 7.4: There is a tendency to equate the density with a PMF. They are similar, but they are not the same. Both must be nonnegative everywhere, and both must “sum” to 1. One way they differ is that the PMF values must be less than or equal to 1 (since the sum is 1). A density has no such restriction. The density can even be infinite; that is, f (x ) → ∞ as long as the integral of the density is 1 (i.e., as long as the area under the density curve is 1). For instance,
f (x ) =
1 −1/2 2x
for 0 ≤ x ≤ 1
0
otherwise
As x → 0, f (x ) → ∞, but F (x ) is well behaved: ⎧ ⎪ ⎪ ⎨0 & F (x ) = 0x 21 y −1/2 dy ⎪ ⎪ ⎩1
for x < 0 for 0 ≤ x ≤ 1 for 1 < x
=
⎧ ⎪ ⎪ ⎨0
x 1/2
⎪ ⎪ ⎩1
for x < 0 for 0 ≤ x ≤ 1 for 1 < x
Comment 7.5: For discrete random variables, the fundamental probability is PrX = xk ; for continuous random variables, it is the probability of intervals, Pr a < X ≤ b . For discrete random variables, probabilities and expected values are sums against the PMFs; for continuous random variables, probabilities and expected values are integrals against the density functions.
7.2 EXAMPLE CALCULATIONS FOR ONE RANDOM VARIABLE An example of the calculations for a single random variable can be helpful. Let the density be the following:
fX (x) =
c x
0
for 1 ≤ x ≤ 2 for x < 0
172 CHAPTER 7 A CONTINUOUS RANDOM VARIABLE
The density is shown below:
The first question is the value of c. The integral of the density is 1: 1=
∞ −∞
f (x) dx =
2 1
2 c dx = c logx = c log2 x x=1
Therefore, c = 1/ log(2) = 1.44. Probabilities are computed by integrating the density:
Pr X ≤ 1.5 =
1.5 1
1.44 dx = 0.58 x
The distribution function is the integral of the density: ⎧ ⎪ ⎪0 ⎪ x ⎨x 1.44 dy F (x) = f (y) dy = ⎪ −∞ ⎪ 1 y ⎪ ⎩
1
x2
x2
Comment 7.6: In the integral above, x appears in the upper limit. We therefore used a different dummy variable, y, in the integral. Always use a different dummy variable in sums and integrals rather than the other variables in the problem.
The distribution function is shown below:
7.2 Example Calculations for One Random Variable 173
Expected values are computed by integrating against the density function. For instance, here are the mean and variance:
∞
2
2 1.44 dx = 1.44x = 1.44 × (2 − 1) = 1.44 x x=1 −∞ 1 ∞ 2 2 1.44 1.44 × (4 − 1) x2 2 E X = dx = 1.44 = x2 f (x) dx = x2 = 2.16 x 2 2 = 1 x −∞ 1
E X =
xf (x) dx =
2
Var X = E X 2 − E X
x
= 2.16 − 1.442 = 0.086
The standard deviation is the square root of the variance, σ = 0.086 = 0.29. Consider a common transformation, Y = X 2 . What is the density of Y, fY (y)? The simplest way to solve for the density is to solve for the distribution function first, then differentiate to solve for the density. (There are formulas for computing the density directly, but they can be difficult to remember. See Section 8.8.) Since the interesting values of X are 1 ≤ X ≤ 2, the interesting values of Y are 1 ≤ Y ≤ 4. Thus,
FY (y) = Pr Y ≤ y = Pr X 2 ≤ y = Pr − y ≤ X ≤ y = Pr 0 ≤ X ≤ y = 1.44log y = 0.72logy
for 1 ≤ y ≤ 4 d 0.72 for 1 ≤ y ≤ 4 fY (y) = FY (y) = dy y Of course, the complete answer is the following:
fY (y) =
⎧ ⎪ 0 ⎪ ⎪ ⎨ 0.72 ⎪ y ⎪ ⎪ ⎩
0
y4
The mean of Y can be computed two different ways:
E Y =
∞ −∞
yfY (y) dy =
= E X 2 = 2.16
4
y 1
0.72 dy = 0.72 × (4 − 1) = 2.16 y
As this example indicates, most of the calculations are straightforward applications of basic calculus. Integrate the density to get the distribution function. Differentiate the distribution function to get the density. Integrate the density times a function to get the expected value. Most transformations of one random variable to another can be done by first calculating the distribution function of the new random variable and then differentiating to get the density. The main “gotcha” is to make sure the limits of the integrals are correct. Integrate over the region where the density is nonzero. Unfortunately, while setting up the integrals may be straightforward, computing them may be hard—or even impossible. Numerical techniques may be needed.
174 CHAPTER 7 A CONTINUOUS RANDOM VARIABLE
7.3 SELECTED CONTINUOUS DISTRIBUTIONS Of the infinity of continuous distributions, we concentrate in this section on a few that appear in a wide variety of applications. The uniform distribution models experiments where the outcomes are equally likely. The exponential models a wide variety of waiting experiments (i.e., wait until something happens). The most important continuous distribution (by far!) is the Gaussian. It gets its own chapter, Chapter 9. The Laplace or double exponential distribution is used in a number of signal processing contexts and is also presented in Chapter 9.
7.3.1 The Uniform Distribution The simplest continuous distribution is the uniform distribution. We say that X is uniform on (a,b), denoted X ∼ U (a,b) and pronounced “X is distributed as uniform on (a,b)” or, more simply, “X is uniform on (a,b),” if
fX (x) =
⎧ 0 ⎪ ⎪ ⎨
x0 a0
a 0 and a < 0.
8.8 General Transformations and the Jacobian 209
The 1/a in the above expression is the Jacobian of the transformation. It is the derivative of x with respect to y and measures the change in size of differential elements. The absolute value of 1/a is needed to make the density nonnegative: dx dy = |J (x;y)| dy
dx =
dy
A second derivation of Equation (8.10) uses the transformation property of integrals. The transformation Y = aX + b can be considered as a change of variables in an integral. Let u < X ≤ v be an interval where the density is nonzero. In the integral, make a change of variables, y = ax + b. Invert the transformation, x = (y − b)/a, and find the Jacobian, J (x;y) = dx dy = 1/a. Thus,
Pr u ≤ X ≤ v = = =
If a > 0, we can identify fY (y) = a1 fX and use the absolute value:
v u
fX (x) dx
av+b
fX
au+b
av+b
y−b 1 dy a a
fY (y) dy
au+b
(8.11)
y−b a
Pr u ≤ X ≤ v =
. If a < 0, we have to flip the limits of integration
au+b av+b
fX
y−b 1 dy a |a|
(8.12)
Since u and v are arbitrary, it must be true that the densities in Equations (8.11) and (8.12) are the same:
fY (y) = fX
y−b 1 a |a|
(8.13)
Thus, by two different methods, Equations (8.10) and (8.13), we found the density of Y = aX + b. First, we computed the distribution function of Y and differentiated to find the density. Second, we changed variables in an integral over the density function. Now, consider a different transformation, Y = X 2 , as shown in Figure 8.4. The inverse transformation has two solutions:
X = + Y or X = − Y The interval in Y leads to two intervals in X: {u ≤ Y ≤ v} =
u≤X≤ v ∪ − v≤X≤− u
Compute probabilities:
Pr u ≤ Y ≤ v = Pr − v ≤ X ≤ − u + Pr
u≤X≤ v
210 CHAPTER 8 MULTIPLE CONTINUOUS RANDOM VARIABLES
Y = X2 v Y u
– v – u
X
FIGURE 8.4 Illustration of {u ≤ Y ≤ v } = − v ≤ X ≤ − u ∪
u
v
u≤X≤ v .
Replace all probabilities by integrals using the Jacobian, J (x;y) = v u
fY (y) dy = =
−u − v
− u −v
fX (x) dx +
dx dy
= ±0.5y−0.5 :
v
u
fX (x) dx
fX (− y) − 0.5y−0.5 dy +
v u
fX ( y) 0.5y−0.5 dy
The density of Y is the sum of two terms: 1 fY (y) = fX (− y) + fX ( y) 2 y The transformation rules for a change in variable from one variable X to another Y are the following: 1. Invert the transformation Y = g (X ) to get X = h(Y ). There may be multiple solutions to this equation. dx d 2. Compute the Jacobian for each solution, J (x;y) = = h(y). dy dy dx 3. The density is the sum of terms. Each term looks like fX h(y) . dy
EXAMPLE 8.4
Let U ∼ U (0,1), and change variables Y = − log(U )/λ with λ > 0. Since 0 < U < 1, 0 < Y < ∞. −λy . Therefore, The inverse transformation is U = e−λY . The Jacobian is J (u;y) = du dy = −λe the density of Y is fY (y) = fU (e−λy )|−λe−λy | = λe−λy
for 0 < y < ∞
(uniform density, fU e−λy = 1)
8.8 General Transformations and the Jacobian 211
Thus, Y has an exponential density. Computers can easily generate uniform random variables. This transformation gives an easy way to change those uniform random variables to exponential random variables.
EXAMPLE 8.5
An Erlang random variable S is the sum of IID exponential random variables. Since an exponential is easy to generate, so is an Erlang: S = Y1 + Y2 + ··· + Yn = =
− logU 1 − logU 2 − · · · − logU n λ − log(U 1 U 2 · · · U n ) λ
Generate n uniform random variables, multiply them together, take minus the log, and divide by lambda. Transformations for multidimensional densities are similar, but the Jacobian is more complicated. Let X 1 , X 2 , . . . ,X n be the original n random variables and Y 1 , Y 2 , . . . ,Y n the new random variables. The first step is to invert the transformation: X 1 = X 1 (Y 1 ,Y 2 , . . . ,Y n ) X 2 = X 2 (Y 1 ,Y 2 , . . . ,Y n ) .. . X n = X n (Y 1 ,Y 2 , . . . ,Y n ) Next, substitute x for X and y for Y, and compute the partial derivative of xi with respect to yj : jij =
∂xi ∂yj
The determinant of the matrix of partial derivatives is the Jacobian: j 11 j 21 J = . .. jn1
j12 j22 .. .
··· ···
jn2
···
..
.
j1n j2n .. .
jnn
The density of the Y values is fY (y) = fX x(y) |J (x;y)| where we introduced a simple vector notation X = [X 1 ,X 2 , . . . ,X n ]T , x = [x1 ,x2 , . . . ,xn ]T ,
Y = [Y 1 ,Y 2 , . . . ,Y n ]T , and y = [y1 ,y2 , . . . ,yn ]T .
212 CHAPTER 8 MULTIPLE CONTINUOUS RANDOM VARIABLES
EXAMPLE 8.6
Consider a transformation from rectangular coordinates, (X,Y ), to polar coordinates, (R, Θ). The transformation is R=
X2 + Y 2 Y Θ = tan−1 X The inverse transformation is X = Rcos Θ Y = Rsin Θ The Jacobian is ∂x ∂r J (x,y;r, θ) = ∂y ∂r
∂x ∂θ cos θ = ∂y sin θ ∂θ
−r sin θ = r cos2 θ + r sin2 θ = r r cos θ
The density in polar coordinates is
fRΘ (r, θ) = r · fXY r cos θ,r sin θ
For instance, if fXY (x,y) = 1/π for x2 + y2 ≤ 1 (the shaded area in the circle above), fRΘ (r, θ) = r · fXY (r cos θ,r sin θ) = r fR (r) = FR (r) = fΘ (θ) =
2π 0
r 0
1 0
fRΘ (r, θ) dθ =
fR (s) ds = r π
dr =
r
1 2π
0
2π 0
1 π
r π
2sds = r 2
=
r π
dθ = 2π
for 0 ≤ r ≤ 1 and 0 ≤ θ ≤ 2π r π
= 2r
for 0 < r < 1
for 0 < r < 1
for 0 ≤ θ ≤ 2π
For comparison, see Section 9.6.2 where we solve the transformation from rectangular to polar coordinates by first computing the joint distribution function and then the density functions.
8.8 General Transformations and the Jacobian 213
Comment 8.6: Notice in Example 8.6 that the Jacobian is applied when converting from (X,Y) to (R, Θ), but is not applied after the transformation is completed—that is, when computing fR (r ) from fRΘ (r, θ).
EXAMPLE 8.7
Let S = X + Y. To calculate fX |S (x|S = s) = fXS (x,s)/fS (s), we need the transformed density, fXS (x,s). The initial variables are (X,Y ), and the transformed variables are (X,S). The transformation can be written with matrix-vector arithmetic:
x 1 0 x = s 1 1 y
The inverse transformation is
x 1 0 x = −1 1 s y
The Jacobian is the determinant of the matrix: 1 J (x,y;x,s) = −1
0 = 1 · 1 − 0 · (−1) = 1 1
The conditional density is fX |S (x|S = s) =
fXS (x,s) fXY (x,s − x) fXY (x,s − x) = ·1 = fS (s) fS (s) fS (s)
For instance, if X and Y are IID exponential random variables, then S is an Erlang random variable: fX |S (x|S = s) =
fXY (x,s − x) fX (x)fY (s − x) λe−λx λe−λ(s−x) 1 = = = fS (s) fS (s) s λ2 se−λs
The conditional density of X given S = s is uniform on (0,s). In summary, transformations from one variable to another, or from one set of variables to another set, occur in many applications of probability. If the transformation is simple enough, it is often easiest to compute the distribution function for the new variable and then differentiate to obtain the density. If the problem is more complicated, the transformed density can be found by inverting the transformation (solving for the old variables in terms of the new ones) and then computing the Jacobian of the transformation. The Jacobian measures the change in size of a differential volume element from the old to the new coordinate system. More information about transformations and Jacobians can be found in advanced calculus texts.
214 CHAPTER 8 MULTIPLE CONTINUOUS RANDOM VARIABLES
8.9 PARAMETER ESTIMATION FOR THE EXPONENTIAL DISTRIBUTION In Section 6.6, we considered the problem of estimating the parameter p in a Bernoulli distribution. The estimate is the sample average of the observations. Here, we consider the problem of estimating the parameter in the exponential distribution. Let X 1 , X 2 , . . . ,X n be IID exponential random variables with parameter λ:
E X i = 1/λ This suggests an estimator: ˆ= λ
n X1 + X2 + · · · + Xn
This is a reasonable estimator, and we will say more about it in Section10.10. ˆ and Var λ ˆ . For The problem with this estimator is that it is difficult to calculate E λ these reasons, this estimator is not often used. Most statisticians reparamaterize the density by replacing λ by 1/μ: fx (x) =
1 μ
e−x/μ
for x ≥ 0
In this paramaterization, E X = μ and Var X = μ2 . An obvious estimate of μ is X1 + X2 + · · · + Xn n
ˆ= μ E μˆ = μ
Var μˆ =
n · Var X i μ2 = 2 n n
We see that the sample mean μˆ is an unbiased and consistent estimator of μ.
8.10 COMPARISON OF DISCRETE AND CONTINUOUS DISTRIBUTIONS Of the limitless set of probability distributions, we have emphasized a small subset. These occur in many applications and are characteristic of the various fields in which probability is needed. Table 8.2 lists some common distributions. (The Gaussian distribution is discussed in Chapter 9. The Poisson process is continuous in time but discrete in probability, as will be discussed in Section 14.2.) An understanding of the commonalities and differences between these distributions is important for understanding the application of probability.
Summary 215 TABLE 8.2 Common discrete and continuous distributions. Discrete
Continuous
Comment
Uniform Geometric Negative binomial Binomial Multinomial Poisson
Uniform Exponential Erlang Gaussian Multivariate Gaussian Poisson process
All values equally likely Waiting time for an event Waiting time for k events Sums of small effects Vector of observations Counts of rare events
SUMMARY
Multiple random variables are a generalization of a single random variable:
FXY (x,y) = Pr X ≤ x ∩ Y ≤ y FX (x) = FXY (x, ∞)
FY (y) = FXY (∞,y) ∂ ∂ ∂ ∂ FXY (x,y) = FXY (x,y) ∂x ∂y ∂y ∂x x y FXY (x,y) = fXY (u,v) dv du −∞ −∞ ∞ fX (x) = fXY (x,y) dy −∞ ∞ fY (y) = fXY (x,y) dx
fXY (x,y) =
−∞
The density and cumulative distribution functions factor when X and Y are independent. For all u and v, fXY (u,v) = fX (u)fY (v) FXY (u,v) = FX (u)FY (v) Expected values generalize as expected. Special expected values include the correlation and covariance:
E g (X,Y ) =
∞ ∞
−∞ −∞
g (x,y)fXY (x,y) dx dy
rxy = E XY =
∞ ∞
−∞ −∞
xyfXY (x,y) dx dy
σxy = E (X − μx )(Y − μy ) = rxy − μx μy
The mean and variance for a sum of random variables can be computed:
E S = E X + Y = E X + E Y = μx + μy 2
Var S = E S2 − E S = σ2x + 2σxy + σ2y
216 CHAPTER 8 MULTIPLE CONTINUOUS RANDOM VARIABLES
Conditional probabilities are built around the densities: ∂ ∂ FXY (x,y) ∂x ∂y fXY (x,y) fXY (x,y) ∂ fX |Y (x|y) = FX |Y (x|y) = = = &∞ ∂x fY (y) fY (y) −∞ fXY (s,y) ds
The LTP for continuous random variables is fX (x) =
∞
−∞
fX |Y (x|y)fY (y) dy
The density of the sum Z = X + Y of independent random variables is the convolution of the individual densities: fZ = fX ∗ fY To compute the density of a transformed random variable, Y = g (X ), it is usually best to compute FY (y), then differentiate to get fY (y). If that is not easy, then the density can be computed using the inverse transform, x = h(y), and the Jacobian, J (x,y) = dx/ dy:
dx
fY (y) = fX h(y)
dy
PROBLEMS 8.1 Let X and Y have joint density fXY (x,y) = c(x2 + 2xy) for 0 < x < 1 and 0 < y < 1 and fXY (x,y) = 0 elsewhere. a. b. c. d.
What is c? What are fX (x) and fY (y)? What are E X , E Y , Var X , and Var Y ? What is E XY ? Are X and Y independent?
8.2 Let X and Y have joint density fXY (x,y) = cx2 y for x and y in the triangle defined by 0 < x < 1, 0 < y < 1, 0 < x + y < 1 and fXY (x,y) = 0 elsewhere. a. b. c. d.
What is c? What are fX (x) and fY (y)? What are E X , E Y , Var X , and Var Y ? What is E XY ? Are X and Y independent?
8.3 If X and Y are IID U (0,1) random variables: a. What is the density of Z = X + Y? b. What are the mean and variance of Z? 8.4 Let X and Y be IID U (0,1) random variables. What are the density, mean, and variance of Z = 2X + Y? 8.5 Let X and Y be independent with X ∼ U (0,1) and Y ∼ U (0,2). What are the mean, variance, density, and distribution of Z = X + Y?
Problems 217
8.6
Let X and Y be independent with X ∼ U (0,1) and Y exponential with parameter λ. What are the mean, variance, density, and distribution function of Z = X + Y?
8.7
Let X and Y be independent with X exponential with parameter λx and Y exponential with parameter λy = λx . What are the mean, variance, density, and distribution function of Z = X + Y?
8.8
Let X and Y have means μx and μy , variances σ2x and σ2y , and covariance σxy , and let S = aX + bY, where a and b are constants. a. What are the mean and variance of S? b. Use this result to compute the variance of X − Y.
8.9
Let X 1 , X 2 , . . . ,X n be IID with mean μ and variance σ2 . What are the mean and variance of S = X 1 + X 2 + · · · + X n ?
8.10 If X and Y are IID U (0,1), let Z = XY. a. What are the mean and variance of Z? (Hint: you can answer this without knowing the density of Z.) b. (hard) What are the density and distribution function of Z? c. (hard) Calculate the mean and variance of Z by integrating against the density. 8.11 The Cauchy distribution arises from the ratio of two standard normal (Gaussian) random variables. Among other things, it shows that a distribution does not always have a mean and finite variance. The (standard) Cauchy density is f X (x ) =
1 π(1 + x2 )
for −∞ < x < ∞
a. What is the Cauchy CDF? b. Show the mean of a Cauchy random variable is undefined and the variance is infinite. 8.12 Check that fY (y) in (equation 8.9) is correct by showing its integral is 1. 8.13 Plot fX (x) (equation 8.8) and fY (y) (equation 8.9). 8.14 For the same triangular region in the example in Section 8.5, let fXY (x,y) = c(1 − x). Calculate the following: a. b. c. d. e. f. g. h. i.
f X (x ) f Y (y ) f X | Y ( x |y ) f Y | X ( y |x ) E X E Y Var X Var Y Cov X,Y
218 CHAPTER 8 MULTIPLE CONTINUOUS RANDOM VARIABLES
8.15 What are the mean and variance of an Erlang random variable?
8.16 Let S = N k=1 X i be a random sum. Assume N is Poisson with parameter λ and the X k are IID exponential with parameter γ. Furthermore, assume the X k are independent of N. What are the mean and variance of S? The next four problems refer to the picture below. The density is uniform in the shaded semicircle. radius = 1
−1
0
1
8.17 Select a random point in the semicircle, and represent it by the usual Cartesian coordinates, X and Y. a. b. c. d.
What is the functional form of the joint density function, fXY (x,y)? What are the marginal densities, fX (x) and fY (y)? Are X and Y independent? What are the means of X and Y?
8.18 For the same semicircle: a. What are the conditional densities, fX |Y (x|Y = y) and fY |X (y|X = x)? b. What are the conditional means, EX |Y X |Y = y and EY |X y|X = x ? 8.19 The distribution functions of X and Y are difficult to calculate directly (by integrating the marginal densities). They can also be calculated by integrating the joint density. Let’s consider FX (x). a. Draw the regions of integration for x = −0.4, x = 0, and x = 0.2. b. Sketch the distribution function (qualitatively). On the same sketch, add the distribution function for a U (−1,1) random variable (perhaps a dashed line), showing clearly how the two distribution functions differ. c. FX (x) can be calculated geometrically from the area of a pie slice and a triangle. Draw pictures showing this, making sure to consider both x < 0 and x > 0. What is FX (x)? Note that in this problem, the density is a constant. Integrals over a region become the constant times the area of the region. If the density is not a constant, the integral must be computed. 8.20 For the same semicircle, represent the same random point in polar coordinates, R and Θ.
a. What is the region of integration of FRΘ (r, θ) = Pr R ≤ r ∩ Θ ≤ θ ? b. Do the integral. What is the joint distribution function?
Problems 219
c. d. e. f.
What is the joint density? What are the marginal densities? Are R and Θ independent? What are the means and variances of R and Θ?
8.21 What is your favorite computational package’s command for generating exponential random variables? Write a short program to compute the sample mean and variance of n IID exponential random variables. Test your program with small values of n, and compare the results to hand calculations. a. How large can n be with the computation still taking less than one second? b. How large can n be with the computation still taking less than one minute?
8.22 Let X be an IIDexponential variable with parameter λ = 3. What are Pr X ≥ 3 , random Pr X ≥ 4 X ≥ 3 , and E X X ≥ 3 ? Do the following Monte Carlo experiment: generate n = 1000 exponential random variables λ = 3, X 1 , X 2 , . . . ,X n . What are the sample with parameter probabilities Pr X i ≥ 3 and Pr X i ≥ 4 X i ≥ 3 and the sample conditional mean E X i X i ≥ 4 ? 8.23 Compare your Monte Carlo estimates to the computed probabilities in Problem 8.22. 8.24 Let X 1 , X 2 , . . . ,X n be n IID exponential random variables with parameter λ. Let x0 > 0 be a known value. What is the probability at least k of the n random variables are greater than x0 ?
8.25 Consider the random sum S = N i=1 X i , where the X i are IID Bernoulli random variables with parameter p and N is a Poisson random variable with parameter λ. N is independent of the X i values. a. Calculate the MGF of S. b. Show S is Poisson with parameter λp. Here is one interpretation of this result: If the number of people with a certain disease is Poisson with parameter λ and each person tests positive for the disease with probability p, then the number of people who test positive is Poisson with parameter λp. 8.26 Repeat Example 8.7 with X and Y being IID U (0,1) random variables. 8.27 Let X and Y be independent with X Bernoulli with parameter p and Y ∼ U (0,1). Let Z = X + Y. a. What are the mean and variance of Z? b. What are the density and distribution function of Z? c. Use the density of Z to compute the mean and variance of Z.
220 CHAPTER 8 MULTIPLE CONTINUOUS RANDOM VARIABLES
8.28 Let X and Y be independent with X binomial with parameters n and p and Y ∼ U (0,1). Let Z = X + Y. a. What are the mean and variance of Z? b. What are the density and distribution function of Z? c. Use the density of Z to compute the mean and variance of Z. 8.29 Let X and Y be IID U (0,1) random variables and let Z = X /Y. a. What are the density and distribution functions of Z? b. What is the median of the distribution of Z? c. What is the mean of Z?
CHAPTER
9
THE GAUSSIAN AND RELATED DISTRIBUTIONS
The Gaussian distribution is arguably the most important probability distribution. It occurs in numerous applications in engineering, statistics, and science. It is so common, it is also referred to as the “normal” distribution.
9.1 THE GAUSSIAN DISTRIBUTION AND DENSITY X is Gaussian with mean μ and variance σ2 if 1 2 2 fX (x) = e−(x−μ) /2σ σ 2π
for −∞ < x < ∞
(9.1)
The Gaussian distribution is also known as the normal distribution. We use the notation X ∼ N (μ, σ2 ) to indicate that X is Gaussian (normal) with mean μ and variance σ2 . A standard normal random variable, Z, has mean 0 and variance 1; that is, Z ∼ N (0,1). The Gaussian density has the familiar “bell curve” shape, as shown in Figure 9.1.
σ
f (x) =
2σ
1 2π
μ
1 σ
2π
e–(x–μ)2/2σ 2
x
FIGURE 9.1 The Gaussian density.
221
222 CHAPTER 9 THE GAUSSIAN AND RELATED DISTRIBUTIONS
Shown below is a comparison of two Gaussians with the same means but different σ values: σ=1 σ=2
Notice the larger σ results in the peak of the density shrinking and the width spreading out. The Gaussian distribution function is the integral of the density function: FX (x) =
x
1
−∞
σ 2π
e−(y−μ)
2 /2σ2
dy
(9.2)
Unfortunately, this function cannot be computed in closed form with elementary functions (basically, we cannot write down a simple expression for the integral). However, the function can be computed numerically and is presented in Tables C.1 and C.2 in Appendix C. The distribution function can be simplified significantly, nevertheless. First, define φ(z) and Φ(z) as 1
2
φ(z) = e−z /2 2π z Φ(z) = φ(y) dy
for −∞ < z < ∞
(9.3) (9.4)
−∞
That is, φ(z) and Φ(z) are the density and distribution function of a standard normal random variable (i.e., μ = 0 and σ = 1). (Φ(z) is shown in Figure 9.2.) Then, Equation (9.2) can be 0.999 0.98 0.84 Φ(z ) 0.5
0.16 0.02 −3
−2
−1
0
1
2
FIGURE 9.2 The standard normal (Gaussian) distribution function.
z
3
9.1 The Gaussian Distribution and Density 223
written as 1 φ (x − μ)/σ σ x x−μ x−μ σ 1 φ (y − μ)/σ dy = FX (x) = φ(z) dz = Φ σ −∞ σ −∞
fX (x) =
(9.5)
where we made the substitution z = (y − μ)/σ and dz = dy/σ. This development is handy. First, it shows that Φ(z) is sufficient to calculate all Gaussian probabilities. So, even though the Gaussian density cannot be integrated in closed form, only one table of Gaussian probabilities needs to be calculated. (The alternative would be separate tables for each combination of μ and σ.) Second, it shows that if Z ∼ N (0,1), then X = σZ + μ ∼ N (μ, σ2 ). In other words, a linear operation on a Gaussian random variable yields a Gaussian random variable (with possibly a different mean and variance). Later in this chapter, we show that this generalizes to multiple Gaussian random variables: linear operations on Gaussian random variables yield Gaussian random variables. Here is an alternative development using Equations (9.4) and (9.5) to calculate a probability of X:
Pr a < X ≤ b = Pr a − μ < X − μ ≤ b − μ a−μ X −μ b−μ < ≤ = Pr
(subtract μ everywhere) (divide everything by σ)
σ σ σ a−μ b−μ 0
(since E Z 2 = 1)
Var S = 2k
(since Var Z 2 = 2)
The Gamma function, Γ(x), appears in the density of a chi-squared random variable. It is a continuous version of the factorial function: Γ(x) =
∞
t x−1 e−t dt
(definition)
(9.26)
0
= (x − 1)Γ(x − 1) Γ(n) = (n − 1)! Γ(0.5) = π
(recursive formula)
for n a positive integer
(4 − 1)! = 6 Γ(x ) (3 − 1)! = 2 π = 1.77 (2 − 1)! = (1 − 1)! = 1 0.5 1
2
x
3
4
The chi-squared density is shown in Figure 9.10 for k = 1, k = 2, k = 5, and k = 10. Notice that as k increases, the density moves to the right (E S = k) and spreads out (Var S = 2k). The distribution function of the chi-squared density is computed numerically. It is tabulated for various values of the degrees-of-freedom parameter, k. The Matlab command for evaluating the chi-squared distribution function is chi2cdf. For instance, if S is chi-squared with k = 5 degrees of freedom, then
Pr S ≥ 8 = 1 − chi2cdf(8,5) = 1 − 0.844 = 0.156 The chi-squared density for k = 5 is shown in Figure 9.11. The shaded area represents the probability S ≥ 8. The F distribution arises from the ratio of independent chi-squared random variables. If S and T are independent chi-squared random variables with ds and dt degrees of freedom,
9.5 Related Distributions 239
0.75
k=1
0.50
k=2
0.25
k=5 k = 10
0 0
5
10
x
15
FIGURE 9.10 Comparison of several chi-squared densities.
f S (x ) =
0.20
1 2 5/2
Γ(5/2)
x 1.5 e − x /2 = 0.133 x 1.5 e − x /2
0.15 0.10
0.156
0.05 0 0
5
10
x
15
FIGURE 9.11 Chi-squared density for k = 5.
respectively, then Z = S/T has the F distribution with parameters ds and dt :
Z=
S ∼ F (ds ,dt ) T
The density and distribution function are complicated. Probabilities are computed numerically. The F distribution is named after famed statistician Ronald Fisher (1890–1962). It arises in certain statistical tests when checking the “goodness of fit” of models to data.
240 CHAPTER 9 THE GAUSSIAN AND RELATED DISTRIBUTIONS
9.6 MULTIPLE GAUSSIAN RANDOM VARIABLES The Gaussian distribution extends straightforwardly to two or more random variables. Multiple random variables occur in many signal processing, communications, and statistical applications. We first discuss independent Gaussian random variables, then two correlated Gaussian random variables. In Chapter 11, we introduce random vectors and use them to represent the general case of many Gaussian random variables.
9.6.1 Independent Gaussian Random Variables The first generalization is from one random variable to two independent random variables. Let X ∼ N (μx , σ2x ) and Y ∼ N (μy , σ2y ) be independent Gaussian random variables. The joint density and distribution are the product of the individual densities and distribution functions, respectively: fXY (x,y) = fX (x)fY (y) =
1 2πσx σy
exp −
(x − μx )2
2σ2x
FXY (x,y) = Pr X ≤ x ∩ Y ≤ y = FX (x)FY (y) = Φ
−
(y − μy )2
2σ2y
x − μx σx
y − μy Φ σy
In Figure 9.12, we show 1000 randomly selected Gaussian points. The three circles have radii 1σ, 2σ, and 3σ and contain about 39%, 86%, and 99% of the points, respectively. Notice the “cloud” is denser in the center and has no obvious angular pattern.
FIGURE 9.12 “Cloud” of 1000 two-dimensional Gaussian points.
9.6 Multiple Gaussian Random Variables 241
The joint density of n independent random variables is the product of the individual densities. For simplicity, let the X i be IID N (0, σ2 ), and let s2n = x12 + x22 + · · · + xn2 . Then, 2 x1 + x22 + · · · + xn2 fx1 ···xn (x1 ,x2 , . . . ,xn ) = (2πσ ) exp − 2σ2 2 s = (2πσ2 )−n/2 exp − n 2 2σ 2 −n/2
Notice the joint density depends only on sn and not on the individual xi . Since s2n is the distance from the origin to the observation, the density depends only on the distance and not on the direction. In other words, the joint density function for IID N (0, σ2 ) Gaussian random variables is circularly symmetric. Comment 9.4: The circularly symmetric property of the Gaussian distribution gives a convenient method for generating uniform points on a hypersphere. (In two dimensions, a hypersphere is a circle; in three dimensions, it is an ordinary sphere.) Let n be the dimension of the hypersphere, let Zi for i = 1,2, . . . ,n by a sequence of n IID N(0,1) random variables, and let S2n = Z21 + Z22 + · · · + Z2n . Normalize each observation by Yi = Xi /Sn . Then, (Y1 ,Y2 , . . . ,Yn ) is a uniformly chosen point on the n-dimensional hypersphere of radius 1.
9.6.2 Transformation to Polar Coordinates The transformation to polar coordinates is interesting in several ways. First, we derive the distributions of the radius and angle and show the two polar coordinates are independent of each other. Second, we use the polar coordinate result to show that 1/ 2π is the correct normalizing constant for the Gaussian density. Third, we use the polar coordinate result to develop one technique for the computer generation of Gaussian pseudo-random variables. Let X and Y be IID N (0, σ2 ) random variables. Consider the transformation to polar coordinates, R = X 2 + Y 2 and Θ = tan−1 (Y /X ), as shown below:
(X,Y)
R
Θ
242 CHAPTER 9 THE GAUSSIAN AND RELATED DISTRIBUTIONS
The joint distribution function of R and Θ can be found by integrating the Gaussian over the “pie slice”:
r
θ
The two dimensional integral of the joint Gaussian density over the pie slice can be computed by switching to polar coordinates: 2 x + y2 exp − dy dx FRΘ (r, θ) = 2πσ2 2σ2 θ r 1 = exp(−s2 /2σ2 )sdsdν 2 0 0 2πσ θ r 1 1 −s2 /σ2 = dν e sds 2 2π 0 σ 0 θ 2 2 = 1 − e− r / 2 σ 2π
1
= FΘ (θ)FR (r)
(9.27)
Equation (9.27) shows several things: 1. FRΘ (r, θ) = FR (r)FΘ (θ). Thus, R and Θ are independent. In other words, for two IID Gaussian random variables, the distance from the origin and the angle are independent. 2. Θ ∼ U (0,2π). Thus, all angles are equally likely. The joint density of two IID Gaussians is circularly symmetric. 3. R = X 2 + Y 2 has a Rayleigh distribution. 4. We can show the area under the Gaussian density is 1. Even though we cannot integrate a single Gaussian density, we can easily integrate the bivariate density of two independent Gaussian random variables: ?
1=
∞ ∞ −∞ −∞
1 −(x2 +y2 )/2σ2 e dx dy 2πσ2
θ 2 2 = · 1 − e− r / 2 σ r=∞ 2π θ=2π
= 1·1 = 1
We have shown that two IID zero-mean Gaussian random variables are equivalent to a Rayleigh and a uniform. One way for computers to generate Gaussian random variables
9.6 Multiple Gaussian Random Variables 243
is to generate a Rayleigh and a uniform and then invert the relation. This is known as the Box-Muller method. Let U 1 and U 2 be two U (0,1) random variables. (Computers can generate pseudo-random uniform distributions fairly easily.) Let FR (r) be the Rayleigh distribution and form R = FR−1 (U 1 ). First, we show R is Rayleigh:
Pr R ≤ r = Pr FR−1 (U 1 ) ≤ r
= Pr FR (FR−1 (U 1 )) ≤ FR (r) = Pr U 1 ≤ FR (r) = FU (FR (r))
(FR (FR−1 (U 1 )) = U 1 ) (definition of FU )
= FR (r)
(FU (u) = u)
Now, we have to invert the relation. Since R = FR−1 (U 1 ), U 1 = FR (R): U 1 = FR (R) = 1 − e−R
2 /2σ2
) R = −2σ2 log(1 − U 1 )
(solving for R)
Since 1 − U 1 is also U (0,1), the formula can be simplified slightly: R=
)
−2σ2 log(U 1 )
(9.28)
Generating Θ is easy: Θ = 2πU 2
(9.29)
The last part is to invert the polar relation to get X and Y: X = Rcos(Θ)
(9.30)
Y = Rsin(Θ)
(9.31)
Comment 9.5: Every modern computer language and numerical package has some library function for generating Gaussian random variables (often randn). The Box-Muller method is not the fastest way (each iteration requires the computation of a square root, a logarithm, a sine, and a cosine), but it is easy to implement. In Section 9.7.3, we modify the Box-Muller method to generate conditional Gaussians.
9.6.3 Two Correlated Gaussian Random Variables In this section, we look at two correlated Gaussian random variables and study how the correlation between them affects the pattern of random samples. Let X ∼ N (μx , σ2x ) and Y ∼ N (μy , σ2y ) be Gaussian random variables with covariance σxy . Introduce normalized random variables U and V: U=
X − μx σx
V=
Y − μy
(9.32)
σy
U and V are both standard normal variables with covariance ρ = σxy (σx σy ).
244 CHAPTER 9 THE GAUSSIAN AND RELATED DISTRIBUTIONS
The joint density of U and V is 2 u − 2ρ uv + v2 fUV (u,v) = exp − 2(1 − ρ 2 ) 2π 1 − ρ 2
1
(9.33)
When ρ = 0, the joint density reduces to the product of fU (u) times fV (u). That is, when ρ = 0, U and V are independent and so therefore are X and Y. The conditional density of U given V = v is fUV (u,v) fV (v) 1
fU |V (u|v) =
2 u − 2ρ uv + v2 v2 + exp − = 2(1 − ρ 2 ) 2 2π(1 − ρ 2 ) 2 2 2 1 u − 2ρ uv + ρ v exp − = 2 2(1 − ρ 2 ) 2π(1 − ρ ) −(u − ρ v)2 1 = exp 2(1 − ρ 2 ) 2π(1 − ρ 2 )
(9.34)
From this, we conclude that U given V = v is N (ρ v,1 − ρ 2 ). Notice also that the conditional mean of U depends on v, but the conditional variance does not. We can convert back to X and Y if desired. For instance,
E X Y = y = E (σx U + μx )V = (y − μy )/σy = μx + = μx +
σxy σ2y
σx ρ (y − μy ) σy
(y − μy )
Var X Y = y = Var (σx U + μx )V = (y − μy )/σy = σ2x (1 − ρ 2 ) There are many examples of correlated Gaussian random variables. One common statistical exercise is to estimate one random variable given the observation of the other. For instance, given a person’s height, one might estimate that person’s weight. A common estimate is to select the value that minimizes the expected squared error:
Xˆ = argmin E (X − Xˆ )2 Y = y
a
Differentiating with respect to Xˆ and setting the derivative to 0 yield
Xˆ = E X Y = y = μx +
σxy σ2y
(y − μy )
The estimate can be interpreted in a predictor-corrector fashion: the estimate is the prediction μx plus the correction, which is the gain σxy /σ2y times the innovation y − μy . Comment 9.6: When doing calculations on Gaussian random variables, it is almost always easier to convert to normalized versions, as we did in (9.32), do the probability calculations on the normalized versions, and then convert back if needed.
9.6 Multiple Gaussian Random Variables 245
If ρ = 0, then the random variables are independent. Figure 9.12 in Section 9.6.1 shows the circular pattern of 1000 Gaussian pairs. Note the lack of angular dependence in that figure. On the other hand, the figure below shows 500 Gaussian pairs, each with ρ = 0.5:
v ρ = 0.5
u
The ellipses correspond to the circles in Figure 9.12, and the dependence between U and V is clear. As U increases, so does V (at least on average). Most points cluster roughly along the line u = v. The figure below shows 500 Gaussian pairs with ρ = 0.9: v ρ = 0.9
u
Note how the data points are tightly bunched near the line u = v.
246 CHAPTER 9 THE GAUSSIAN AND RELATED DISTRIBUTIONS
When ρ is negative, the points cluster the other way, as the figure below shows: v ρ = − 0.9
u
In this case, when U is large and positive, V is likely to be large and negative, and vice versa.
9.7 EXAMPLE: DIGITAL COMMUNICATIONS USING QAM In this section, we consider an important application of Gaussian probabilities, the communication of data over noisy channels. A common technique for digital communication is called quadrature amplitude modulation (QAM). Here, we consider a typical example of QAM-modulated data transmitted over a communications channel that adds Gaussian noise.
9.7.1 Background Let S(t ) be the signal that is sent, R(t ) the received signal, and N (t ) the noise. Thus, N (t ) S(t )
+
R(t )
R(t ) = S(t ) + N (t ) It turns out that since sin(θ) and cos(θ) are orthogonal over a complete period, 2π 0
cos θ sin θ dθ = 0
9.7 Digital Communications Using QAM 247
the signals can be broken up into a cosine term and a sine term. These are known as the quadrature components: S(t ) = Sx (t ) cos(ωc t ) + Sy (t ) sin(ωc t ) N (t ) = N x (t ) cos(ωc t ) + N y (t ) sin(ωc t ) R(t ) = Rx (t ) cos(ωc t ) + Ry (t ) sin(ωc t ) = Sx (t ) + N x (t ) cos(ωc t ) + Sy (t ) + N y (t ) sin(ωc t ) where ωc is the carrier frequency in radians per second, ωc = 2πfc , where fc is the carrier frequency in hertz (cycles per second). The receiver demodulates the received signal and separates the quadrature components: Rx (t ) = Sx (t ) + N x (t ) Ry (t ) = Sy (t ) + N y (t ) In the QAM application considered here, Sx (t ) = Sy (t ) =
∞
n=−∞ ∞
n=−∞
Ax (n)p(t − nT ) Ay (n)p(t − nT )
The pair (Ax , Ay ) represents the data sent. This is the data the receiver tries to recover (much more about this below). The p(t ) is a pulse-shape function. While p(t ) can be complicated, in many applications it is just a simple rectangle function:
p(t ) =
1 0 Δ /Pr error = exp(Δ2 /2σ2 ) = exp(ψ2 /2). Selected values of exp(ψ2 /2) are shown in Table 9.1. When ψ = 3, the speedup is a factor of 90; when ψ = 4, the speedup factor is almost 3000.
9.7 Digital Communications Using QAM 257
FIGURE 9.17 Illustration of Monte Carlo study for error rate of a rectangle region, but conditioned on being outside the circle of radius Δ. The illustration is drawn for ψ = Δ/σ = 2.0. TABLE 9.1 Table of Monte Carlo speedups attained versus ψ. ψ
1.0
1.5
2.0
2.5
3.0
3.5
4.0
exp(ψ2 /2)
1.6
3.1
7.4
23
90
437
2981
Recall that in Section 9.6.2 we showed the Box-Muller method can generate two IID zero-mean Gaussian random variables from a Rayleigh distribution and a uniform distribution. Here, we modify the Box-Muller method to generate conditional Gaussians and dramatically speed up the Monte Carlo calculation. Generate a Rayleigh distribution conditioned on being greater than Δ (i.e., conditioned on being greater than the radius of the circle): Pr Δ ≤ R ≤ r Pr R ≤ r R ≥ Δ = Pr R ≥ Δ
=
e−Δ
2 /2σ2
− e− r
2 /2σ2
e−Δ2 /2σ2
= 1 − e−(r
2 −Δ2 )/2σ2
The conditional density can be calculated by differentiating Equation (9.46): fR (r|R ≥ Δ) =
r σ2
e−(r
2 −Δ2 )/2σ2
for r > Δ
(9.46)
258 CHAPTER 9 THE GAUSSIAN AND RELATED DISTRIBUTIONS
To generate the conditional Rayleigh distribution, we must invert the CDF:
U 1 = 1 − e−(R
2 −Δ2 )/2σ2
=⇒ R =
)
Δ2 − 2σ2 log(1 − U 1 )
(9.47)
The formula for R can also be written as )
R = σ ψ2 − 2log(1 − U 1 ) After dividing by σ, R/σ can be generated as R σ
=
)
ψ2 − 2log(1 − U 1 )
This form is easy to use in Equation (9.45). As before, 1 − U 1 can be replaced by U 1 to slightly simplify the computation. Comment 9.8: One might object that doing a Monte Carlo simulation on a problem for which we know the exact answer is a waste of time. Not true at all. In any numerical problem, it is essential to test your code on problems for which you know the answer before trusting your code on problems for which you do not know the answer. Test early, and test often.
9.7.4 QAM Recap This QAM example has been lengthy. Let us recap what we have done: • First, we discussed the digital communications problem showing how the quadrature components are used. • Second, after allowing the receiver to perform “several signal processing steps,” a continuous time signal was converted to a discrete time signal. (In fact, the symbols are separated in time and we can study one time instance.) We calculated probabilities of error for three popular signal constellations. • Third, we used a Monte Carlo simulation to calculate the same probabilities of error. A more extensive digital communications course would relax our various assumptions (e.g., that the time instances are truly separated), but we have studied many of the fundamentals.
Summary 259
SUMMARY
The Gaussian (normal) distribution occurs in numerous applications. X is Gaussian with mean μ and variance σ2 , written X ∼ N (μ, σ2 ), if 1 2 2 f (x) = e−(x−μ) /2σ σ 2π The Gaussian density has the familiar bell shape. Z is standard normal if Z ∼ N (0,1). X and Z are related by the following linear transformations: X = σZ + μ
Z=
X −μ σ
The distribution function of a standard normal random variable is F (z) = Φ(z). Probabilities of X can be computed as
Pr X ≤ x = Φ((x − μ)/σ) Pr a ≤ X ≤ b = Φ (x − b)/σ − Φ (x − a)/σ
The moments of Z are
E Z
n
=
0 n odd (n − 1)(n − 3) · · · 3 · 1 n even
The moment generating function (MGF) of Z is MZ (u) = eu
2 /2
An important reason why the Gaussian distribution is so common is the Central Limit Theorem (CLT), which says that a sum of IID random variables with mean μ and finite variance σ2 < ∞ is approximately Gaussian: Sn = X 1 + X 2 + · · · + X n
Sn − E Sn
Yn = )
S n − nμ = σn ≈ N (0,1) Var Sn
When using the CLT to approximate a discrete random variable, it is best to shift the endpoints by 0.5. For example, if S is binomial,
Pr k ≤ S ≤ l ≈ Φ
l − np + 0.5 k − np − 0.5 −Φ npq npq
The Gaussian is related to many other distributions. The Laplace distribution is symmetric about its mean, as is the Gaussian, but its tails are heavier. The Laplace distribution has density fX (x) =
λ
2
e−λ|x−μ|
260 CHAPTER 9 THE GAUSSIAN AND RELATED DISTRIBUTIONS
If X and Y are N (0,s2 ), then R =
X 2 + Y 2 is a Rayleigh distribution: fR (r) =
r −r 2 /2s2 e s2
The chi-squared distribution arises from the sum of squares of zero-mean Gaussian random variables. If S = Z 21 + Z 22 + · · · + Z 2k , fS (x) =
1 2k/2 Γ(k/2)
x(k/2)−1 e−x/2
E S =k
Var S = 2k The joint density of two standard normal random variables with covariance ρ is fUV (u,v) =
1
2π 1 − ρ 2
exp −
u2 − 2ρ uv + v2 2(1 − ρ 2 )
The conditional distribution of U given V = v is N (ρ v,1 − ρ 2 ). The noise in many communications problems is modeled as Gaussian. For instance, in QAM, the received signal is demodulated to quadrature components corrupted by additive Gaussian noise. Monte Carlo experiments are often employed to estimate complicated probabilities or moments. A computer generates thousands or even millions of random trials and keeps track of the estimated probabilities and moments. PROBLEMS 9.1 If X ∼ N (1,4), compute the following:
a. Pr X < 0 b. Pr X ≤ 0 c. Pr 0 ≤ X ≤ 4 9.2 If X ∼ N (−1,5), compute the following: a. b. c. d.
Pr X < 0 Pr X > 0 Pr − 2 < X < 0 Pr − 2 < X < 2
9.3 If X ∼ N (1,4), and Y = 2X − 1: a. What are the mean and variance of Y? b. What is the density of Y?
Problems 261
9.4
If X ∼ N (3,16), and Y = 3X + 4: a. What are the mean and variance of Y? b. What is the density of Y?
9.5
How would you use erfcinv to compute the Φ−1 (p) for 0 ≤ p ≤ 1?
9.6
The Gaussian quantile function is often tabulated for p > 0.5. How would you use that table to compute Q(p) for p < 0?
9.7
2 For X = σZ +μ with Z ∼ N (0,1), use Equations (9.13) and (9.14) to calculate E X , E X , 3 4 E X , and E X ?
9.8
Use the MGF (Equation 9.15) to calculate E Z , E Z 2 , E Z 3 , and E Z 4 for Z ∼ N (0,1).
9.9
If X = σZ + μ with Z ∼ N (0,1), use the MGF of Z to derive the MGF of X. (Hint: 2 you 3do not need to compute any integrals.) Use the MGF of X to compute E X , E X , E X , 4 and E X .
9.10 Where do the Gaussian and Laplacian densities intersect? Assume both have zero mean, and let λ = 1 and σ = 1. 9.11 What is the CDF of the Laplace distribution?
9.12 What are Pr X ≥ k for k = 1,2,3,4 for X Gaussian (σ = 1) and X Laplacian (λ = 1), both with zero mean? See how much heavier the Laplacian’s tails are in comparison. 9.13 Let X ∼ N (0, σ2 ). What is the density of Y = X 2 ? (Hint: the solution for σ = 1 is in the chapter, if you want to check your answer.) 9.14 Let X ∼ N (μ, σ2 ) and Y = eX . What is the density of Y? (Comment: Y has the log-normal distribution with parameters μ and σ2 .) 9.15 As in Problem 9.14, let X ∼ N (0, σ2 ) and Y = eX . What are the mean and variance of Y? 9.16 Generalize Problem 9.14 by letting Y = eaX with a > 0 and X ∼ N (μ, σ2 ). a. What is the density of Y? b. What are the mean and variance of Y? 9.17 Let Z ∼ N (0,1). Then,
Pr Z > z0 =
∞ z0
φ(z) dz ≤
∞ z0
z φ(z) dz z0
a. Justify the inequality above. b. Compute the integral, and derive a useful inequality for Gaussian tail probabilities. What is it? c. Use Table C.2 in Appendix C to compare this bound to exact tail probabilities for z0 = 1, 2, 3, 4, and 5. 9.18 Let X ∼ N (μ, σ2 ) and let W = aX + b. a. What is the density of W? b. What is the distribution function of W? c. What is the MGF of W?
262 CHAPTER 9 THE GAUSSIAN AND RELATED DISTRIBUTIONS
9.19 Let X be chi-squared with k degrees of freedom and let Y = sX where s > 0 is a known scaling constant, what is the density of Y? 9.20 Let X be chi-squared with k = 5 degrees of freedom.
a. What is Pr X > 1 ? b. As k gets large, a chi-squared random variable can be approximatedby a Gaussian. How accurate is the approximation for k = 5? Evaluate the probability Pr X > x0 for various values of x0 using the chi-square distribution and the Gaussian approximation. 9.21 Let Y = |Z|, where Z ∼ N (0,1). What are the moments of Y? 9.22 Molecules of a gas in thermal equilibrium move at random speeds V with the )
2 −v 2 /(2a 2 )
Maxwell-Boltzmann distribution having density f (v) = π2 v e a3 for x > 0. Use the results of Problem 9.21 to calculate the mean and variance of a Maxwell-Boltzmann random variable.
9.23 If Z ∼ N (0,1), show Var Z 2 = 2. 9.24 Use the properties of the Gamma function, (9.26) and what follows, to calculate Γ(1.5) and Γ(2.5). 9.25 Evaluate exact binomial probabilities and approximations (Equations 9.16 and 9.17) for the following: a. b. c. d.
n = 10, p = 0.5, k = 3, l = 6 n = 10, p = 0.2, k = 3, l = 6 n = 20, p = 0.5, k = 6, l = 12 n = 20, p = 0.2, k = 6, l = 12
How well do the approximations perform? 9.26 Use the Gaussian approximation to estimate the multiple-choice probabilities in Problem 6.7. 9.27 Use the Gaussian approximation to estimate the multiple-choice probabilities in Problem 6.8. 9.28 Plot on the same axes the distribution function of an Erlang random variable with n = 5 and that of a Gaussian random variable with the same mean and variance.
9.29 Let T = X 1 X 2 · · · X n . Assume X i = 1 + Y i with the Y i IID and E Y i = 0 and Var Y i = σ2i → 0 as n → ∞ for all > 0. Chebyshev’s inequality implies that if the estimator is unbiased and its variance approaches 0 as n → ∞, then the estimator is consistent. This is usually the easiest way to demonstrate that an estimator is consistent. In this example, the sample average is an unbiased and consistent estimator of p. Equation (10.2) shows the estimator is unbiased, and Equation (10.3) shows the variance goes to 0 as n tends to infinity. Both unbiasedness and consistency are generally considered to be good characteristics of an estimator, but just how good is pˆ = X n ? Or, to turn the question around, how large does n have to be in order for us to have confidence in our prediction? The variance of X n depends on the unknown p, but we can bound it as follows, p(1 − p) ≤ 0.25 with equality when p = 0.5. Since the unknown p is likely to be near 0.5, the bound is likely to be close; that is, p(1 − p) ≈ 0.25: p (1 − p )
0.25 0
0.5
p
1
Now, we can use Chebyshev’s inequality (Equation 4.16) to estimate how good this estimator is: Var pˆ p(1 − p) 1 ˆ Pr |p − p| > ≤ = ≤ 2 n 2 4n2
Let us put in some numbers. If 95% of the time we want to be within = 0.1, then
Pr |pˆ − p| > 0.1 ≤
1 ≤ 1 − 0.95 4n · 0.12
10.1 A Simple Election Poll 267
Solving this yields n ≥ 500. To get this level of accuracy (−0.1 ≤ pˆ − p ≤ 0.1) and be wrong no more than 5% of the time, we need to sample n = 500 voters. Some might object that getting to within 0.1 is unimpressive, especially for an election prediction. If we tighten the bound to = 0.03, the number of voters we need to sample now becomes n ≥ 5555. Sampling 500 voters might be doable, but sampling 5000 sounds like a lot of work. Is n > 5555 the best estimate we can find? The answer is no. We can find a better estimate. We used Chebyshev’s inequality to bound n. The primary advantage of the Chebyshev bound is that it applies to any random variable with a finite variance. However, this comes at a cost: the Chebyshev bound is often loose. By making a stronger assumption—in this case, that the sample average is approximately Gaussian—we can get a better estimate of n. In this example, pˆ = X n is the average of n IID Bernoulli random variables. Hence, X n is binomial (scaled by 1/n), and since n is large, pˆ is approximately Gaussian by the CLT, pˆ ∼ N (p,pq/n). Thus,
−
Pr − ≤ pˆ − p ≤ = Pr
pq/n
Xn − p ≤ pq/n pq/n
≤
− Xn − p ≈ Pr ≤ ≤ pq/n 1/4n 1/4n ≈ Pr − 2 n ≤ Z ≤ 2 n = Φ(2 n) − Φ(−2 n) = 2Φ(2 n) − 1 = 2Φ(z) − 1
(normalize) (bound pq) (Z ∼ N (0,1))
(z = 2 n)
Solving 2Φ(z) − 1 = 0.95 yields z ≈ 1.96: φ (z ) 0.95 − 1.96
1.96
Now, solve for n:
1.96 = z = 2 n n=
1.962 0.96 = 2 ≈ − 2 42
(10.4)
For = 0.03, n = 1067, which is about one-fifth as large as the Chebyshev inequality indicated. Thus, for a simple election poll to be within = 0.03 of the correct value 95% of the time, a sample of about n = 1067 people is required.
268 CHAPTER 10 ELEMENTS OF STATISTICS
n = 1600
0.35
0.40
0.45
0.5
0.55
n = 400
0.35
0.40
0.45
0.5
0.55 n = 100
0.35
0.40
0.45
0.5
pˆ
0.55
FIGURE 10.1 Effect of sample size on confidence intervals. As n increases, the confidence interval shrinks, in this example, from 0.35 ≤ pˆ ≤ 0.55 for n = 100 to 0.4 ≤ pˆ ≤ 0.5 for n = 400 and to 0.425 ≤ pˆ ≤ 0.475 for n = 1600. Each time n is increased by a factor of four, the width of the confidence interval shrinks by a factor of two (and the height of the density doubles). See Example 10.1 for further discussion.
EXAMPLE 10.1
Let’s illustrate how the confidence intervals vary with the number of samples, n. As above, consider a simple poll. We are interested in who is likely to win the election. In other words, wewant toknow if p < 0.5 or p > 0.5. How large must n be so the probability that pˆ < 0.5 or pˆ > 0.5 is at least 0.95? In this example, we assume p = 0.45. The distance between p and 0.5 (the dividing line between winning and losing) is = 0.5 − 0.45 = 0.05. From Equation (10.4), n ≈ −2 = (0.05)−2 = 400 If is halved, n increases by a factor of four; if is doubled, n decreases by a factor of four. Thus, for n = 100, = 0.1; for n = 1600, = 0.025. The 95% confidence intervals for n = 100, for n = 400, and for n = 1600 are shown in Figure 10.1. For n = 100, we are not able to predict a winner as the confidence interval extends from pˆ = 0.35 to pˆ = 0.55; that is, the confidence interval is on both sides of p = 0.5. For n = 400, however, the confidence interval is smaller, from pˆ = 0.40 to pˆ = 0.50. We can now, with 95% confidence, predict that Bob will win the election.
10.2 Estimating the Mean and Variance 269
Comment 10.1: In the Monte Carlo exercise in Section 9.7.3, p is small and varies over several orders of magnitude, from about 10−1 to 10−4 . Accordingly, we used a relative error to assess the accuracy of pˆ (see Equation 9.43):
Pr − ≤
pˆ − p ≤ ≥ 0.95 p
(10.5)
In the election poll, p ≈ 0.5. We used a simple absolute, not relative, error:
Pr − ≤ pˆ − p ≤ ≥ 0.95 There is no one rule for all situations: choose the accuracy criterion appropriate for the problem at hand.
10.2 ESTIMATING THE MEAN AND VARIANCE The election poll example above illustrated many aspects of statistical analysis. In this section, we present some basic ideas of data analysis, concentrating on the simplest, most common problem: estimating the mean and variance. Let X 1 , X 2 , . . . ,X n be n IID samples from a distribution F (x) with mean μ and variance ˆ denote an estimate of μ. σ2 . Let μ The most common estimate of μ is the sample mean: ˆ = Xn = μ
n 1 Xi n k=1
E μˆ = E X n = μ
Var μˆ = Var X n =
σ2
n
Thus, the sample mean is an unbiased estimator of μ. It is also a consistent estimator of μ since it is unbiased and its variance goes to 0 as n → ∞. The combination of these two properties is known as the weak law of large numbers. Comment 10.2: The weak law of large numbers says the sample mean of n IID observations, each with finite variance, converges to the underlying mean, μ. For any > 0,
Pr |Xn − μ| < → 1
as n → ∞
As one might suspect, there is also a strong law of large numbers. It says the same basic thing (i.e., the sample mean converges to μ), but the mathematical context is stronger:
Pr lim Xn = μ = 1 n→∞
270 CHAPTER 10 ELEMENTS OF STATISTICS
The reasons why the strong law is stronger than the weak law are beyond this text, but are often covered in advanced texts. Nevertheless, in practice, both laws indicate that if the data are IID with finite variance, then the sample average converges to the mean.
The sample mean can be derived as the estimate that minimizes the squared error. Consider the following minimization problem: min Q(a) = a
n i=1
(X i − a)2
Differentiate Q(a) with respect to a, set the derivative to 0, and solve for a, replacing a by aˆ :
n n d Q(a) = 2 (X i − aˆ ) = 2 X i − naˆ da a=aˆ i=1 i=1 n 1 aˆ = Xi = Xn n i=1
0=
We see the sample mean is the value that minimizes the squared error; that is, the sample mean is the best constant approximating the X i , where “best” means minimal squared error. 2 denote our estimator of σ2 . First, we Estimating the variance is a bit tricky. Let σ assume the mean is known and let the estimate be the obvious sample mean of the squared errors: n 2 = 1 σ (X k − μ)2
n k=1
n n 1 1 2 = E σ E (X k − μ)2 = σ2 = σ2 n k=1 n k=1
2 is an unbiased estimator of σ2 . So, σ In this estimation, we assumed that the mean, but not the variance, was known. This can happen, but often both are unknown and need to be estimated. The obvious generalization is to replace μ by its estimate X n , but this leads to a complication:
E
n k=1
(X k − X n )2 = (n − 1)σ2
(10.6)
An unbiased estimator of the variance is 2 = σ
n
1
n − 1 k=1
(X − X n )2
In the statistics literature, this estimate is commonly denoted S2 , or sometimes S2n : S2n =
1
n
n − 1 k=1
E S2n = σ2
(X − X n )2
(10.7)
10.3 Recursive Calculation of the Sample Mean 271
It is worth emphasizing: the unbiased estimate of variance divides the sample squared error by n − 1, not by n. As we shall see in Section 10.10, the maximum likelihood estimate of variance is a biased estimate as it divides by n. In practice, the unbiased estimate (dividing by n − 1) is more commonly used.
10.3 RECURSIVE CALCULATION OF THE SAMPLE MEAN In many engineering applications, a new observation arrives each discrete time step, and the sample mean needs to be updated for each new observation. In this section, we consider recursive algorithms for computing the sample mean. Let X 1 , X 2 , . . . ,X n−1 be the first n − 1 samples. The sample mean after these n − 1 samples is X n−1 =
−1 X 1 + X 2 + · · · + X n−1 1 n = Xk n−1 n − 1 k=1
At time n, a new sample arrives, and a new sample mean is computed: Xn =
n X1 + X2 + · · · + Xn 1 = Xk n n k=1
When n gets large, this approach is wasteful. Surely, all that work in computing X n−1 can be useful to simplify computing X n . We present two approaches for reducing the computation. For the first recursive approach, define the sum of the observations as T n = X 1 + X 2 + · · · + X n . Then, T n can be computed recursively: T n = T n−1 + X n A recursive computation uses previous values of the “same” quantity to compute the new values of the quantity. In this case, the previous value T n−1 is used to compute T n . EXAMPLE 10.2
The classic example of a recursive function is the Fibonacci sequence. Let f (n) be the nth Fibonacci number. Then, f (0) = 0 f (1) = 1 f (n) = f (n − 1) + f (n − 2) for n = 2,3, . . . The recurrence relation can be solved, yielding the sequence of numbers f (n) = 0,1,1,2,3,5,8,13, . . ..
272 CHAPTER 10 ELEMENTS OF STATISTICS
The sample average can be computed easily from T n : Xn =
Tn n
The algorithm is simple: at time n, compute T n = T n−1 + X n and divide by n, X n = T n /n. EXAMPLE 10.3
Bowling leagues rate players by their average score. To update the averages, the leagues keep track of each bowler’s total pins (the sum of his or her scores) and each bowler’s number of games. The average is computed by dividing the total pins by the number of games bowled. Note that the recursive algorithm does not need to keep track of all the samples. It only needs to keep track of two quantities, T and n. Regardless of how large n becomes, only these two values need to be stored. For instance, assume a bowler has scored 130, 180, 200, and 145 pins. Then, T0 = 0 T 1 = T 0 + 130 = 0 + 130 = 130 T 2 = T 1 + 180 = 130 + 180 = 310 T 3 = T 2 + 200 = 310 + 200 = 510 T 4 = T 3 + 145 = 510 + 145 = 655 The sequence of sample averages is X1 =
T1 = 130 1
X2 =
T2 = 155 2
X3 =
T3 = 170 3
X4 =
T4 = 163.75 4
The second recursive approach develops the sample mean in a predictor-corrector fashion. Consider the following: X1 + X2 + · · · + Xn n X 1 + X 2 + · · · + X n−1 X n = + n n n − 1 X 1 + X 2 + · · · + X n−1 X n = · + n n−1 n n−1 Xn = · X n−1 + n n 1 = X n−1 + X n − X n−1 n = prediction + gain · innovation
Xn =
(10.8)
In words, the new estimate equals the old estimate (the prediction) plus a correction term. The correction is a product of a gain (1/n) times an innovation (X n − X n−1 ). The innovation is what is new about the latest observation. The new estimate is larger than the old if the new observation is larger than the old estimate, and vice versa.
10.4 Exponential Weighting 273
EXAMPLE 10.4
The predictor-corrector form is useful in hardware implementations. As n gets large, T might increase toward infinity (or minus infinity). The number of bits needed to represent T can get large—too large for many microprocessors. The predictor-corrector form is usually better behaved. For example, assume the X k are IID uniform on 0,1, . . . ,255. The X k are eight-bit numbers with an average value of 127.5. When n = 1024, 0 ≤ T 1024 ≤ 255 × 1024 = 261,120. It takes 18 bits to represent these numbers. When n = 106 , it takes 28 bits to represent T. The sample average is between 0 and 255 (since all the X k are between 0 and 255). The innovation term, X n−1 − X n−1 , is between −255 and 255. Thus, the innovation requires nine bits, regardless of how large n becomes.
Comment 10.3: Getting a million samples sounds like a lot, and for many statistics problems, it is. For many signal processing problems, however, it is not. For example, CD-quality audio is sampled at 44,100 samples per second. At this rate, a million samples are collected every 23 seconds!
10.4 EXPONENTIAL WEIGHTING In some applications, the mean can be considered quasi-constant: the mean varies in time, but slowly in comparison to the rate at which samples arrive. One approach is to employ exponential weighting. Let α be a weighting factor between 0 and 1 (a typical value is 0.95). Then, ˆn = μ
X n + αX n − 1 + α 2 X n − 2 + · · · + α n − 1 X 1 1 + α + α2 + · · · + αn−1
Notice how the importance of old observations decreases exponentially. We can calculate the exponentially weighted estimate recursively in two ways: S n = X n + αS n − 1 w n = 1 + αw n − 1 Sn ˆn = μ wn 1 ˆ n−1 + =μ X n − μˆ n−1 wn In the limit, as n → ∞, wn → 1/(1 − α). With this approximation, the expression for μˆ n simplifies considerably: ˆ n−1 + (1 − α) X n − μ ˆ n−1 ˆn ≈ μ μ ˆ n−1 + (1 − α)X n = αμ
This form is useful when processing power is limited.
274 CHAPTER 10 ELEMENTS OF STATISTICS
Comment 10.4: It is an engineering observation that many real-life problems benefit from some exponential weighting (e.g., 0.95 ≤ α ≤ 0.99) even though theory indicates the underlying mean should be constant.
10.5 ORDER STATISTICS AND ROBUST ESTIMATES In some applications, the observations may be corrupted by outliers, which are values that are inordinately large (positive or negative). The sample mean, since it weights all samples equally, can be corrupted by the outliers. Even one bad sample can cause the sample average to deviate substantially from its correct value. Similarly, estimates of the variance that square the data values are especially sensitive to outliers. Estimators that are insensitive to these outliers are termed robust. A large class of robust estimators are based on the order statistics. Let X 1 , X 2 , . . . ,X n be a sequence of random variables. Typically, the random variables are IID, but that is not necessary. The order statistics are the sorted data values: X 1 ,X 2 , . . . ,X n
X (1) ,X (2) , . . . ,X (n)
Sort
The sorted values are ordered from low to high: X (1) ≤ X (2) ≤ · · · ≤ X (n) The minimum value is X (1) , and the maximum value is X (n) . The median is the middle value. If n is odd, the median is X((n+1)/2) ; if n is even, the median is usually defined to be the average of the two middle values, X (n/2) + X (n/2+1) 2. For instance, if the data values are 5, 4, 3, 8, 10, and 1. The sorted values are 1, 3, 4, 5, 8, and 10. The minimum value is 1, the maximum is 10, and the median is (4 + 5)/2 = 4.5. Probabilities of order statistics are relatively easy to calculate if the original data are IID. First, note that even for IID data, the order statistics are neither independent nor identically distributed. They are dependent because of the ordering; that is, X (1) ≤ X (2) , etc. For instance, X (2) cannot be smaller than X (1) . The order statistics are not identically distributed for the same reason. We calculate the distribution functions of order statistics using binomial probabilities. The event X (k) ≤ x is the event that at least k of the n random variables are less than or equal to x. The probability any one of the X i is less than or equal to x is F (x); that is, p = F (x). Thus,
Pr X (k) ≤ x =
n n j=k
j
j
p (1 − p)
n−j
=
n n j=k
j
F j (x)(1 − F (x))n−j
(10.9)
10.5 Order Statistics and Robust Estimates 275
For instance, the probability the first-order statistic (the minimum value) is less than or equal to x is
Pr X (1) ≤ x =
n n j=1
j
pj (1 − p)n−j
= 1 − (1 − p)n n = 1 − 1 − F (x)
(sum is 1 minus the missing term)
The distribution function of the maximum is
Pr X (n) ≤ x = pn = F n (x) The median is sometimes used as an alternative to the mean for an estimate of location. The median is the middle of the data, half below and half above. Therefore, it has the advantage that even with almost half the data being outliers, the estimate is still well behaved. Another robust location estimate is the α-trimmed mean, which is formed by dropping the αn smallest and largest order statistics and then taking the average of the rest: ˆα = μ
n −αn 1 X (k) n − 2αn k=αn
The α-trimmed mean can tolerate up to αn outliers. A robust estimate of spread is the interquartile distance (sometimes called the interquartile range): ˆ = c X (3n/4) − X (n/4) σ
ˆ = σ. (Note the interquartile distance is an estimator where c is a constant chosen so that E σ of σ, not σ2 .) When the data are approximately Gaussian, c = 1.349. Using the Gaussian quantile function, c is calculated as follows:
c = Q(0.75) − Q(0.25) = 0.6745 − (−0.6745) = 1.349 We used the following reasoning: If X i ∼ N (0,1), then Q(0.75) = Φ−1 (3/4) = 0.6745 gives the average value of the (3n/4)-order statistic, X (3n/4) . Similarly, Q(0.25) gives the average value of the (n/4)-order statistic. The difference between these two gives the average value of the interquartile range. As an example, consider the 11 data points −1.91, −0.62, −0.46, −0.44, −0.18, −0.16, −0.07, 0.33, 0.75, 1.60, 4.00. The first 10 are samples from an N (0,1) distribution; the 11th sample is an outlier. The sample mean and sample variance of these 11 values are Xn = S2n =
11 1 X i = 0.26 11 k=1
1
11
n − 1 k=1
(X k − 0.26)2 = 2.30
276 CHAPTER 10 ELEMENTS OF STATISTICS
Neither of these results is particularly close to the correct values, 0 and 1. The median is X (6) = −0.16, and the interquartile estimate of spread is ˆ = 1.349 X (8) − X (3) σ = 1.349 0.33 − (−0.46)
(3 × 11/4 ≈ 8 and 11/4 ≈ 3)
= 1.06
Both the median and interquartile distance give better estimates (on this data set, at least) than the sample mean and variance. To summarize, the sample mean and variance are usually good estimates. They work well when the data are Gaussian, for instance. However, when the data contain outliers, the sample mean and variance can be poor estimates. Robust estimates sacrifice some performance on Gaussian data to achieve better performance when there are outliers. Many robust estimates, including the sample median and the interquartile distance, are based on order statistics.
10.6 ESTIMATING THE DISTRIBUTION FUNCTION After mean and variance, the next most important estimates are of the PMF, the density, and the distribution function. In this section, we present some general background for estimating these three, then focus on the distribution function. In the next section, we consider specific methods of estimating densities and PMFs. There are two main approaches for estimating the PMF, the density, and the distribution function. These approaches are the parametric and nonparametric. A parametric approach assumes a particular form for the density (or distribution function) with one or more unknown parameters. The data are used to estimate the unknown parameters. For instance, the data may be assumed to be Gaussian with an unknown mean and variance. The mean can be estimated by the sample mean and the variance by the sample variance. The density or distribution function is estimated by the assumed density of distribution function (e.g., Gaussian) with the parameter estimates. Nonparametric approaches do not assume a known form for the density or distribution function, but instead try to estimate it. Often nonparametric approaches are graphical. The estimated density or distribution function is plotted, perhaps against a trial form (e.g., Gaussian). The unknown distribution is assumed to be Gaussian only if the estimated distribution function is sufficiently close to an actual Gaussian distribution function. Consider a nonparametric estimation of a distribution function. Let X 1 , X 2 , . . . ,X n be n IID observations from an unknown common distribution F (x). Let p(x) = Pr X ≤ x = F (x), and let N (x) equal the number of X i that are less than or equal to x. N (x) is binomial with parameters (n,p(x)). The empirical distribution function or sample distribution function is the following: Fˆ (x) =
N (x) n
for all −∞ < x < ∞
(10.10)
10.6 Estimating the Distribution Function 277
Since N (x) is binomial, the mean and variance of Fˆ (x) are easily computed:
np(x) = p(x) = F (x) n np(x) 1 − p(x) F (x) 1 − F (x) ˆ = Var F (x) = n2 n E Fˆ (x) =
(unbiased) (consistent)
We see the empirical distribution function, Fˆ (x), is an unbiased and consistent estimator of the distribution function, F (x). For example, let [0.70, 0.92, −0.28, 0.93, 0.40, −1.64, 1.77, 0.40, −0.46, −0.31, 0.38, 0.63, −0.79, 0.07, −2.03, −0.29, −0.68, 1.78, −1.83, 0.95] be 20 samples from an N (0,1) distribution quantized to two decimal places. The first step in computing the sample distribution function is to sort the data: − 2.03, −1.83, −1.64, −0.79, −0.68, −0.46, −0.31, −0.29, −0.28,
0.07,0.38,0.40,0.40,0.63,0.70,0.92,0.93,0.95,1.77,1.78 The jumps in the N (x) sequence are computed from the sorted sequence: N (−2.03) = 1 N (−1.83) = 2 N (−1.64) = 3 .. . N (1.78) = 20 The sample distribution is simply Fˆ (x) = N (x)/20. The sample distribution function and the reference Gaussian distribution function are plotted below: 1 Φ(x ) Fˆ (x )
−2
0
2
x
Since the empirical distribution function is an unbiased and consistent estimator of the distribution function, the two curves will get closer and closer together as n → ∞. For
278 CHAPTER 10 ELEMENTS OF STATISTICS
instance, here is a plot for 100 points:
As we can see in the plot above, with 100 points the agreement between the sample distribution and the reference Gaussian is good. The two curves are close throughout. Comment 10.5: In the example data set above, two samples are listed as 0.40. For a continuous source such as the Gaussian used here, the probability of two samples being equal is 0. However, when the data are quantized, as they are here, to two decimal places, duplicates can happen. The actual samples, to three decimal places, are 0.396 and 0.401. They are not the same.
10.7 PMF AND DENSITY ESTIMATES As discussed above, PMF and density estimates fall into two categories: parametric and nonparametric. We start with parametric estimates, then consider some nonparametric estimates. Parametric estimates form a density estimate by assuming a known form for the density, estimating the unknown parameters of the density from observations, and substituting the estimated parameters into the density function. The process is usually straightforward. For example, in the previous section, 20 samples from an N (0,1) distribution were obtained. Using these 20 samples, we would estimate the mean and variance as follows: 0.70 + 0.92 + · · · + 0.95 0.62 = = 0.031 20 20 2 2 2 2 = (0.70 − 0.031) + (0.92 − 0.031) + · · · + (0.95 − 0.031) σ 20 − 1 21.97 = = 1.16 19 ˆ= μ
The parametric estimate of the density is N (0.031,1.16). Figure 10.2 shows the estimated and exact densities, with individual observations shown as tick marks. The estimated density is fairly close to the exact density.
10.7 PMF and Density Estimates 279
Exact –3
–2
–1
0
1
Estimated
2
3
FIGURE 10.2 Estimated Gaussian density using parametric estimates of the mean and variance. The tick marks are the observations.
Parametric density estimates work well when the form of the density is known and the parameter estimates converge quickly to their correct values. However, when either of these is not true (usually the form of the density is not known), nonparametric estimates can be used. A simple nonparametric density estimate is a histogram of the data. If there are enough data and the bin sizes are chosen well, the histogram can give a reasonable estimate. Often, however, a better density estimate is a kernel density estimate (KDE). Let X 1 , X 2 , . . . ,X n be n observations. A density estimate is fˆ (x) =
n 1 (1/h)K (x − X i )/h n i=1
where h is a smoothing parameter and (1/h)K (x/h) is a kernel function. The kernel function has unit area and is centered on 0: 1=
∞
−∞
K (x) dx =
∞
−∞
(1/h)K (x/h) dx
The KDE puts a kernel on each sample and sums the various kernels. The effect of the kernel is to “spread out” or “blur” each observation. The most popular kernel is a Gaussian density (smooth and differentiable everywhere), K (x) = φ(x), where φ(x) is the Gaussian density function. When h is large, the estimate is smooth; when h is small, the estimate can become noisy (bumpy). One popular choice for ˆ n−1/5 , where σ ˆ is an estimate of the standard deviation (this rule is known as h is h = 1.06σ Silverman’s rule). Note that as n → ∞, h → 0. That is, amount of smoothing decreases as n increases. Comment 10.6: The KDE has a simple signal processing interpretation. Start with a sequence of impulses at each sample value xi . Filter the impulses (to smooth them out) using a lowpass filter with impulse response (1/h)K (x /h).
Figure 10.3 shows a KDE using a Gaussian kernel with the same 20 points as in Figure 10.2. Clearly, the KDE is worse than the parametric estimate, but it does not assume a known form for the density. The parametric density estimate is better because it correctly assumes the underlying density is Gaussian. The KDE is most useful when the shape of the underlying density is unknown.
280 CHAPTER 10 ELEMENTS OF STATISTICS
Estimated Exact –3
–2
–1
0
1
2
3
FIGURE 10.3 A kernel density estimate of a probability density using a Gaussian kernel and Silverman’s rule for choosing h.
10.8 CONFIDENCE INTERVALS One measure of the quality of an estimate is the confidence interval. The confidence interval is a range of values that are close to the estimate. A small confidence interval means the estimate is likely to be close to the correct value; a larger confidence interval indicates the estimate may be inaccurate. As expected, confidence intervals shrink as more data are obtained. In this section, we illustrate confidence intervals with two related examples. The first is the confidence interval for an estimate of the mean when the variance is known, and the second is the confidence interval for the mean when the variance is unknown. Formally, a confidence interval for an estimate of the mean is an interval (L(X ),U (X )) such that
Pr L(X ) ≤ μˆ − μ ≤ U (X ) ≥ 1 − τ
(10.11)
where τ is a threshold, typically 0.01 to 0.05. Usually, we will simply write L and U instead of L(X ) and U (X ). Among the many possible confidence intervals, we usually select the shortest one; that is, U − L = min. For example, for the X i IID N (μ, σ2 ) with μ unknown and estimated by μˆ = X n and σ2 known, the distribution of μˆ − μ is N (0, σ2 /n). In particular, it is symmetric about 0. Hence, the upper and lower limits of the confidence interval are also symmetric; that is, L = −U:
Pr L ≤ μˆ − μ ≤ U = Pr − U ≤ μˆ − μ ≤ U = 2Φ(U n/σ) − 1 = 1 − τ Solving for U, we obtain U=
σΦ−1 (1 − τ/2) σQ(1 − τ/2) =
n
n
where Q is the Gaussian quantile function. When τ = 0.05, Q(1 − 0.05/2) = 1.96. Finally, we obtain the confidence interval: Pr
−1.96σ 1.96σ ˆ −μ ≤ ≤μ = 0.95
n
n
(10.12)
Generally, we desire the confidence interval to be small. This requires a small σ and a large n. We get the expected result that more data lead to better estimates. Note that Equation (10.12)
10.8 Confidence Intervals 281 TABLE 10.1 Table of thresholds for t-distribution with 1 − τ = 0.95 for various values of v degrees of freedom. v = n−1 threshold
4 2.776
9 2.262
19 2.093
49 2.010
99 1.984
∞
1.960
is widely used even if the data are only approximately Gaussian (or if n is large enough that CLT approximations are valid). When the variance is also unknown, the confidence interval uses the estimated variance (Equation 10.7):
Pr − U ≤
ˆ −μ μ ≤ U ≥ 1−τ S/ n
(10.13)
Let T = (μˆ − μ)/(S/ n). Unfortunately, T is not Gaussian. It has the Student’s t-distribution with v = n − 1 degrees of freedom. The t-distribution was discovered by William Gosset in 1908. It is known as “Student’s t-distribution” because Gosset published under the pseudonym Student. The density of the t-distribution has the form
fT (t ) = C(v) 1 +
t2 v
−(v+1)/2
(10.14)
where v is the number of degrees of freedom and C(v) is a normalizing constant (so the integral of the density is 1). Except for a few special cases, the distribution function of the t-distribution must be calculated numerically. When v is large, the t-distribution is approximately N (0,1). Some Python code to calculate t-distribution properties is as follows: from scipy import stats t = stats . t prob = t . cdf (x , v ) # Pr (T < x ) with df = v threshold = t . ppf (0.95 , v ) # quantile function print threshold
For example, with v = 4, Pr T < 2.132 = 0.95. The command print t . ppf (0.95 ,4) returns 2.132 (with rounding). As we mentioned above, the normalized deviation T = (μˆ − μ)/(S/ n) has v = n − 1 degrees of freedom. Several values of the quantile function of the t-distribution for 1 − τ = 0.95 are listed in Table 10.1. For instance, for n = 10, the confidence interval is
Pr − 2.262 ≤
ˆ −μ μ ≤ 2.262 = 0.95 S/ n
The confidence interval is larger when σ is unknown because its estimate S may differ from the correct value. Therefore, the ratio in Equation (10.13) varies more than if the variance were known. As a result, we need to make the confidence interval larger.
282 CHAPTER 10 ELEMENTS OF STATISTICS
A confidence interval shows the likely range of errors in the estimate. In Section 10.9, we show that confidence intervals are related to significance tests.
10.9 SIGNIFICANCE TESTS AND p-VALUES After computing estimates such as the sample mean or variance, many statisticians then do a test: Is the estimated quantity statistically significant? Are the data sufficiently compelling that the estimate is accepted, or is it possible that a simpler explanation exists? For example, consider estimating the mean of a Gaussian distribution. Let X i ∼ N (μ, σ2 ) for i = 1,2, . . . ,n be an IID sequence of Gaussian random variables. The obvious estimate of ˆ = X n . The statistician might propose a null event, that the mean is μ is the sample mean, μ 0, and ask the question whether the observed value is significantly different from zero? To answer this question, the statistician computes a p-value. Let x be the observed value of the sample mean:
p = Pr X n ≥ x X ∼ N (0, σ2 )
(10.15)
The p-value is the probability of making observations that are at least as extreme (different from zero) as the observations, where the probability is calculated using the assumption the data are N (0, σ2 ). If the p-value is less than 0.05, the statistician decides the sample mean is significantly different from zero (technically, the statistician rejects the null hypothesis that the mean is 0); if greater than 0.05, the statistician decides the observed sample mean is not significantly different from zero. The idea is the p-value measures how likely the data are given the null hypothesis. If that probability is less than 0.05, the statistician is willing to reject the notion that the null hypothesis adequately describes the data. Under the null hypothesis, X n ∼ N (0, σ2 /n),
Pr X n ≥ x X ∼ N (0, σ2 ) = 1 − Φ(x n/σ) ≥ 0.05
x n
σ
≥ Q(0.95) = 1.64
If x is greater than or equal to 1.64σ/ n, decide the sample mean is significantly different from zero. Strictly speaking, this is a one-sided test, appropriate when the deviation from zero is in the positive direction. For instance, the test might be whether or not a drug has a significantly positive effect. (We reject the drug if the effect is negative; we are only interested in positive effects.) A two-sided test looks for deviations in either direction, positive or negative, and chooses a symmetric region. A two-sided Gaussian test looks like the following:
Pr |X n ≥ x| X ∼ N (0, σ2 ) = 2 1 − Φ(x n/σ) = 0.05
x n σ
≥ Q(0.975) = 1.96
If |X n | ≥ 1.96σ/ n, reject the hypothesis that the data have zero mean.
(10.16) (10.17)
10.9 Significance Tests and p-Values 283
The method can be summarized as follows: 1. Choose an appropriate null hypothesis. The null hypothesis should reflect a lack of surprise. For instance, in a medical test of the efficacy of a new drug, the null hypothesis might be that the drug is no better than the placebo. 2. Calculate the probability of getting the data observed, or data more extreme than were observed, given the null hypothesis is true. 3. If this probability is less than an accepted threshold, typically 0.05, decide the observations are statistically significant. If this probability is greater than the threshold, the data are not statistically significant. Two-sided significance tests are related to confidence intervals. The calculation in Equations (10.16) and (10.17) is the same as in Equation (10.12) except for the direction of the inequality. If the null hypothesis is within the confidence interval, then the data cannot reject the null hypothethis. Conversely, if the null hypothesis is outside the confidence interval, then the data support rejecting the null hypothesis. Many standard significance tests have been proposed. Here are a few of the more important ones: One-sample z-test This is a slight generalization of the Gaussian test above. It allows the mean of the null hypothesis to be nonzero. The test statistic is Z = (X n −μ0 )/(σ/ n) ∼ N (0,1), where μ0 is the mean of the null hypothesis. One-sample t-test This test generalizes the z-test above by allowing the variance to be unknown. It tests the sample mean against the null hypothesis mean. The test statistic is T = (X n − μ0 )/(S/ n), where S is the square root of the sample variance (Equation 10.7). The test statistic follows the Student t-distribution with n − 1 degrees of freedom. The Student t-distribution is discussed in equation (10.14) and the following paragraphs. One-proportion z-test Let the number of successes in a binomial experiment be N and the estimate pˆ = N /n. If np0 is reasonably large, the CLT indicates Z will be approximately N (0,1), where Z is
Z=
n(pˆ − p0 ) p0 (1 − p0 )
Chi-squared variance test Let X i be IID N (0, σ2 ) under the null hypothesis. The test is whether or not the sample variance equals σ2 . The test statistic is χ2 = (n − 1)s2 /σ2 , where s2 is the sample variance. The statistic has the chi-squared distribution with v = n − 1 degrees of freedom. Pearson’s chi-squared test This test is used for two different purposes. The first is to test whether observed data fit a known multinomial distribution, and the second is whether observed two-dimensional data can reasonably be modeled as independent. In both cases, the test statistic is chi-squared with v degrees of freedom. First, let X be multinomial, taking on values 1,2, . . . ,m. Let Oi be the observed number of times i is observed and Ei = npi the expected number of times i should be observed
284 CHAPTER 10 ELEMENTS OF STATISTICS
if the null hypothesis is true. Thus, χ2 =
m (O − E )2 i i
Ei
i=1
Then, χ2 is approximately chi-squared with v = m − 1 degrees of freedom. For example, assume a coin is flipped 50 times, and 30 heads and 20 tails are observed. Can we reject the null hypothesis that the coin is fair? Is 30 significantly different from the expected number, 25? The χ2 statistic is computed as follows: χ2 =
(30 − 25)2
25
+
(20 − 25)2
25
=2
There are v = m − 1 = 2 − 1 = 1 degrees of freedom. The Matlab command chi2cdf(2,1) returns 0.843, which means the tail probability is 1 − 0.843 = 0.157. Since this is not less than 0.05, we cannot reject the null hypothesis that the coin is fair. This is not surprising. Under the null hypothesis, X is binomial with parameters n = 50 and p = 0.5. The standard deviation is np(1 − p) = 3.53. The deviation from the expected value, 30 − 25 = 5, is less than two standard deviations. A deviation this small is not normally a cause for concern. The second approach is a test for independence. Assume a table of observations with r rows and c columns. Under the null hypothesis, the row and column random variables are independent. Therefore, the probability of observing the pair (i,j) factors into the product of the individual probabilities, pi and qj . The expected number of observations of (i,j) is Ei,j = npi qj . Let Oi,j be the actual number of observations. Then, the test statistic χ2 =
r c (O − E )2 i,j i,j i=1 j=1
Ei,j
is approximately chi-squared with v = (r − 1)(c − 1) degrees of freedom. Note that in both these Pearson’s chi-squared tests, the test becomes more accurate as more observations are made. An often-quoted rule of thumb is that the expected number of observations in each cell should be five or more to apply these tests. These tests are illustrative of the variety of tests that have been proposed. Tests of means with known variances are Gaussian or asymptotically Gaussian (by the CLT). Tests of means with unknown variances are t-tests. Tests of sums of squares of Gaussians are chi-squared tests. Finally, Pearson’s tests are asymptotically chi-squared and should be applied only if the number of observations is large enough.
10.10 Introduction to Estimation Theory 285
Comment 10.7: Both p-values and tests of statistical significance are widely used in the social sciences, economics, medicine, and many other areas, though somewhat less so in engineering. Nevertheless, there are many criticisms. Here are a few of them:
• The tortured language is unfortunate. Either the data “support rejecting the null hypothesis” or the data “do not support rejecting the null hypothesis.” Saying the “data support accepting the null hypothesis” is considered improper. • The choice of the null hypothesis is somewhat arbitrary. Sometimes it is obvious, but sometimes the statistician can choose among several “reasonable” null hypotheses. Then, whether the data are significant may depend on which null hypothesis is chosen. • The choice of the threshold, typically 0.05, is arbitrary. Other values are sometimes used. • The p-value is sometimes referred to as the probability of the null hypothesis being true. This is incorrect. It is the probability of getting the data observed (or worse) given the null hypothesis is true. • The p-value and the result of the test depend on data not actually observed; that is, the p-value is the probability of getting the data observed or more extreme data. Many statisticians argue that tests should depend only on that data actually observed. • “Statistically significant” does not imply significant in any other sense. For instance, a drug might be statistically significant compared to a placebo, but that does not mean the drug will significantly improve health outcomes. For example, a drug might “statistically significantly” lengthen a cancer patient’s life by a day, but is it significant if the drug is terribly costly or has other side effects? Over the years, statisticians have proposed numerous alternatives to p-values and significance tests, but p-values and significance tests are still popular.
EXAMPLE 10.5
Not all statistical tests use a threshold of 0.05. The search for the Higgs boson requires the creation of a tremendous number of high-energy particle collisions. If a Higgs boson is created, it will decay almost instantly into a series of other particles, which also decay. These decay “signatures” are detected. Since the likelihood of a Higgs boson being created in any single experiment is only about 1 in 10 billion, scientists insist on a tiny significance threshold. The accepted value is 5σ (for a Gaussian distribution), or about a 1-in-3.5-million chance of falsely claiming the existence of the Higgs boson.
10.10 INTRODUCTION TO ESTIMATION THEORY There are three main estimation methods: maximum likelihood estimation, minimum mean squared error (MMSE) estimation, and Bayesian estimation. All three are widely used,
286 CHAPTER 10 ELEMENTS OF STATISTICS
all three have advantages, and all three have disadvantages. In this section, we discuss maximum likelihood estimation; in the next two sections, we discuss MMSE estimation and Bayesian estimation. It is conventional to denote the unknown parameter or parameters by θ. For example, θ might be μ if the mean is unknown, or μ and σ2 if both the mean and variance are unknown. Assume we have a sequence of IID observations, X 1 = x1 , X 2 = x2 , . . . ,X n = xn with a common density f (x; θ). The density depends on the unknown parameter, θ. The density of the n observations is f (x1 ,x2 , . . . ,xn ; θ) = f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ) The likelihood function, L(θ), is the n-dimensional density function thought of as a function of θ with x1 , x2 , through xn as known paramters: L(θ) = f (x1 ,x2 , . . . ,xn ; θ) For many densities, it is more convenient to consider the log-likelihood function, l(θ): l(θ) = logL(θ) = log f (x1 ,x2 , . . . ,xn ; θ) = logf (x1 ; θ) + logf (x2 ; θ) + · · · + logf (xn ; θ) The principle behind the maximum likelihood estimate (MLE) is to choose the parameter θ to maximize the likelihood function. In other words, select the parameter value that maximizes the probability of the particular observation sequence, x1 ,x2 , . . . ,xn . We write the maximizing value of θ as θˆ . Computing the MLE consists of the following steps: 1. Compute the likelihood function. 2. If it appears beneficial, compute the log-likelihood function. 3. Differentiate the likelihood or log-likelihood function with respect to θ, set the derivative to 0, and solve for θˆ . If θ consists of multiple components, differentiate separately with respect to each component, and solve the set of equations.
EXAMPLE 10.6
(Gaussian with Unknown Mean) Let X i be IID N (μ, σ2 ) with μ unknown and to be estimated. Thus,
L(μ) = 2πσ
2 −n/2
l(μ) = log(L(μ)) =
0=
−(x1 − μ)2 − (x2 − μ)2 − · · · − (xn − μ)2 exp 2σ2 −n
2
log 2πσ2 −
n 1
2σ2 i=1
n d 2 ˆ l(μ) = 0− (xi − μ) dμ 2σ2 i=1 ˆ μ=μ
(xi − μ)2
10.10 Introduction to Estimation Theory 287
ˆ= μ
n 1 xi = X n n i=1
Not surprisingly, the MLE of μ is the sample mean X n .
EXAMPLE 10.7
(Gaussian with Unknown Mean and Variance) Let X i be IID N (μ, σ2 ) with both μ and σ unknown and to be estimated. We can start with l(μ, σ2 ) above, but we have to differentiate once with respect to μ and once with respect to σ, yielding two equations to be solved:
L(μ, σ) = 2πσ2
−n/2
exp
l(μ, σ) = log(L(μ, σ)) =
(x1 − μ)2 + (x2 − μ)2 + · · · + (xn − μ)2 2σ2
−n
2
log 2πσ2 −
n 1
2σ2 i=1
(xi − μ)2
0=
n d 2 l(μ, σ) = 0− xi − μˆ 2 dμ ˆ i=1 2σ ˆ ˆ ,σ=σ μ=μ
0=
n 2 d n 2 l(μ, σ) =− + xi − μˆ 3 ˆ 2σ σ dσ ˆ i=1 ˆ ˆ ,σ=σ μ=μ
ˆ= μ
n 1 xi = X n n i=1
+ ˆ= σ
n 2 1 xi − μˆ = n i=1
+
n 2 1 xi − X n n i=1
ˆ. In this example, the first equation can be solved for μˆ and then the second equation for σ This is a property of the Gaussian distribution; it is not a property of maximum likelihood estimation. In most cases, the two equations must be solved simultaneously. ˆ and that the MLE of σ2 is σ ˆ 2 . This is a property that is not Note that the MLE of σ is σ often shared by other estimation methods. Note also the MLE of σ2 is a biased estimator of σ2 . The unbiased estimator divides by n − 1, not n.
EXAMPLE 10.8
(Poisson PMF with Unknown Parameter) Let X i be Poisson with parameter λ. The MLE of λ can be found as follows: L(λ) = =
λx1 e−λ λx2 e−λ λxn e−λ · ···
x1 !
λ
x2 !
xn !
(x 1 +x 2 +···+x n ) −nλ
,n
e
i=1 xi !
l(λ) = log(L(λ)) = (x1 + x2 + · · · + xn )logλ − nλ −
n i=1
log(xi !)
288 CHAPTER 10 ELEMENTS OF STATISTICS
d x1 + x2 + · · · + xn 0= l(λ) = −n ˆ dλ ˆ λ λ=λ ˆ = x1 + x2 + · · · + xn = X n λ
n
As expected, the MLE of λ is the sample mean.
EXAMPLE 10.9
(Exponential Density with Unknown Parameter) Let X i be IID exponential with parameter λ. The problem is to estimate λ: L(λ) = λn e−λ(x1 +x2 +···+xn ) l(λ) = log(L(λ)) = nlogλ − λ(x1 + x2 + · · · + xn )
d n l(λ) = − x1 + x2 + · · · + xn ˆ dλ ˆ λ λ=λ
0= ˆ= λ
n x1 + x2 + · · · + xn
=
1 Xn
ˆ For an exponential, the expected value of X is λ−1 . Therefore, λ
EXAMPLE 10.10
−1
= X n makes sense.
(Censored Exponential) Consider the following waiting time experiment: At time 0, n lightbulbs are turned on. Each one fails at a random time T i . We assume the T i are IID exponential with parameter λ. However, the experiment is terminated at time tmax , and k of the lightbulbs are still running. Clearly, these lightbulbs give information about λ, but how do we incorporate that information into our estimate? The order of the random variables are unimportant (since they are IID). We assume the random variables are ordered so that the first n − k represent lightbulbs that failed and the last k represent lightbulbs that did not fail. The first n − k are continuous random variables, while the last k are discrete. The likelihood function is the product of the n − k densities for the first group of random variables times the product of the k probabilities of the discrete random variables: fT (t ) = λe−λt
Pr T > tmax =
∞
L(λ) =
n−k i=1
=
t max
n−k i=1
fT (t ) dt = e−λtmax
fT (ti ) ·
n i=n−k+1
λe
−λt i
·
Pr T > tmax
n -
e
−λt max
i=n−k+1
= λn−k e−λ(t1 +t2 +···+tn−k ) · e−kλtmax
10.11 Minimum Mean Squared Error Estimation 289
l(λ) = log(L(λ)) = (n − k)logλ − λ(t1 + t2 + · · · + tn−k ) − kλtmax
0=
d n−k l(λ) = − (t1 + t2 + · · · + tn−k ) − ktmax ˆ dλ ˆ λ λ=λ
ˆ= λ
n−k t1 + t2 + · · · + tn−k + ktmax
As the examples above indicate, maximum likelihood estimation is especially popular in parameter estimation. While statisticians have devised pathological examples where it can be bad, in practice the MLE is almost always a good estimate. The MLE is almost always well behaved and converges to the true parameter value as n → ∞.
10.11 MINIMUM MEAN SQUARED ERROR ESTIMATION This section discusses the most popular estimation rule, especially in engineering and scientific applications: choose the estimator that minimizes the mean squared error. Let θ represent the unknown value or values to be estimated. For instance, θ might be μ if the mean is unknown, or (μ, σ2 ) if both the mean and variance are unknown and to be estimated. The estimate of θ is denoted either θˆ or θˆ , depending on whether or not the estimate depends on random variables (i.e., random observations). If it does, we denote the estimate θˆ . (We also use θˆ when referring to general estimates.) If it does not depend on any random observations, we denote the estimate as θˆ . 2 The error is θˆ − θ, the squared error is θˆ − θ , and the mean squared error (MSE) is 2 E θˆ − θ . The value that minimizes the MSE is the MMSE estimate: θˆ − θ = error 2 θˆ − θ = squared error 2 E θˆ − θ = mean squared error 2 min E θˆ − θ = minimum mean squared error θˆ
We start here with the simplest example, which is to estimate a random variable by a constant, and we progress to more complicated estimators. Let the random variable be Y and the estimate be θˆ (since θˆ is a constant, it is not a random variable). Then,
ˆ 2 Q(θˆ ;Y ) = E (Y − θ)
(10.18)
Q is a function of θˆ . It measures the expected loss as a function of θ and is minimized by setting its first derivative to 0: 0=
d ˆ ˆ = 2(E Y − θ) Q(θ;Y ) ˆ = 2E (Y − θ) θ=θ dθ
290 CHAPTER 10 ELEMENTS OF STATISTICS
which implies the optimal estimate is the mean, θˆ = E Y . The Q function is known as the mean squared error (MSE) and the estimator as the minimum mean squared error (MMSE) or least mean squared error (LMSE) estimator. As another example, let a random variable Y be estimated by a function g (X ) of another random variable, X. Using Equation (8.7), the MSE can be written as
E (Y − g (X ))2 =
∞
−∞
EY |X (Y − g (X ))2 X = x fX (x) dx
The conditional expected value inside the integral can be minimized separately for each value of x (a sum of positive terms is minimized when each term is minimized). Letting θ = g (X ), d EY |X (Y − θ)2 X = x ˆ θ=θ dθ ˆ = 2EY |X (Y − θ) X = x
0=
which implies that the estimate of Y is the conditional mean of Y given X = x: ˆ x) = EY |X Y X = x θ(
For instance,if Y represents the weight and X the height of a randomly selected person, ˆ x) = EY |X Y |X = x is the average weight of a person given the height, x. One fully then θ( expects that the average weight of people who are five feet tall would be different from the average weight of those who are six feet tall. This is what the conditional mean tells us. ˆ x) = EY |X Y X = x is nonlinear in x. (It is unlikely that people who are In general, θ( six feet tall are exactly 20% heavier than those who are five feet tall, or that seven-footers are exactly 40% heavier than five-footers). Sometimes it is desirable to find the best linear function of X that estimates Y. Let Yˆ = aX + b with a and b to be determined. Then,
min Q(a,b;Y,X ) = min E (Y − aX − b)2 a,b
a,b
(Technically, this is an affine function because linearity requires b = 0, but it is often referred to as linear even though it is not.) This minimization requires setting two derivatives to 0 and solving the two equations: ∂ ˆ E (Y − aX − b)2 = 2E XY − aˆ X 2 − bX a=ˆa ,b=bˆ ∂a ∂ 0 = E (Y − aX − b)2 = 2E Y − aˆ X − bˆ ˆ a =ˆ a ,b = b ∂b
0=
(10.19) (10.20)
The simplest way to solve these equations is to multiply the second by E X and subtract the result from the first, obtaining
2
E XY − E X E Y = aˆ E X 2 − E X Solving for aˆ ,
aˆ =
E XY − E X E Y
2
E X2 − E X
σxy Cov X,Y = 2 = σx Var X
(10.21)
10.12 Bayesian Estimation 291
The value for aˆ can then be substituted into Equation (10.20):
bˆ = E Y − aˆ E X = E Y −
σxy Cov X,Y E X = μy − 2 μx σx Var X
(10.22)
It is convenient to introduce normalized quantities, and letting ρ be the correlation coefficient, ρ = σxy /(σx σy ), ˆ Y − μy σy
=ρ
x − μx σx
(10.23)
The term in parentheses on the left is the normalized estimate of Y, and the one on the right is the normalized deviation of the observed value X. Normalizing estimates like this is common because it eliminates scale differences between X and Y. For instance, it is possible that X and Y have different units. The normalized quantities are dimensionless. Note the important role ρ plays in the estimate. Recall that ρ is a measure of how closely related X and Y are. When ρ is close to 1 or −1, the size of the expected normalized deviation of Y is almost the same as the observed deviation of X. However, when ρ is close to 0, X and Y are unrelated, and the observation X is of little use in estimating Y. Comment 10.8: Equation (10.23), while useful in many problems, has also been misinterpreted. For instance, in the middle of the 20th century, this equation was used to estimate a son’s intelligence quotient (IQ) score from the father’s IQ score. The conclusion was that over time, the population was becoming more average (since ρ < 1). The fallacy of this reasoning is illustrated by switching the roles of father and son and then predicting backward in time. In this case, the fathers are predicted to be more average than their sons. What actually happens is that the best prediction is more average than the observation, but this does not mean the random variable Y is more average than the random variable X. The population statistics do not necessarily change with time. (In fact, raw scores on IQ tests have risen over time, reflecting that people have gotten either smarter or better at taking standardized tests. The published IQ score is normalized to eliminate this rise.) Why not mothers and daughters, you might ask? At the time, the military gave IQ tests to recruits, almost all of whom were male. As a result, many more IQ results were available for males than females.
10.12 BAYESIAN ESTIMATION In traditional estimation, θ is assumed to be an unknown but nonrandom quantity. In contrast, in Bayesian estimation, the unknown quantity θ is assumed to be a random value (or values) with a known density (or PMF) f (θ).
292 CHAPTER 10 ELEMENTS OF STATISTICS
Bayesian estimation is like minimum mean squared error estimation except that the unknown parameter or parameters are considered to be random themselves with a known probability distribution. Bayesian estimation is growing in popularity among engineers and statisticians, and in this section, we present two simple examples, the first estimating the probability of a binomial random variable and the second estimating the mean of a Gaussian random variable with known variance. Let θ represent the unknown random parameters, and let f (θ) be the a priori probability density of θ. Let the observations be x with conditional density f (x|θ), and let f (x) denote the density of x. (We use the convention in Bayes estimation of dropping the subscripts on the various densities.) Using Bayes theorem, the a posteriori probability can be written as f (θ|x) =
f (x|θ)f (θ) f (x|θ)f (θ) = &∞ f (x) −∞ f (x|ν)f (ν) dν
(10.24)
Perhaps surprisingly, f (x) plays a relatively small role in Bayesian estimation. It is a constant as far as θ goes, and it is needed to normalize f (θ|x) so the integral is 1. Otherwise, however, it is generally unimportant. The minimum mean squared estimator of θ is the conditional mean: θˆ = E θ X = x =
∞ −∞
θ f (θ|x) dθ
(10.25)
In principle, computing this estimate is an exercise in integral calculus. In practice, however, the computation often falls to one of two extremes: either the computation is easy (can be done in closed form), or it is so difficult that numerical techniques must be employed. The computation is easy if the f (θ) is chosen as the conjugate distribution to f (x|θ). We give two examples below. For the first example, let θ represent the unknown probability of a 1 in a series of n Bernoulli trials, and let x = k equal the number of 1’s observed (and n − k the number of 0’s). The conditional density f (k|θ) is the usual binomial PMF. Thus,
f (k|θ) =
n k θ (1 − θ)n−k k
The conjugate density to the binomial is the beta density. The beta density has the form f (θ) =
Γ(α + β) α−1 θ (1 − θ)β−1 for 0 < θ < 1 Γ(α)Γ(β)
(10.26)
where the Gamma function is defined in Equation (9.26) and α ≥ 0 and β ≥ are nonnegative parameters. The beta density is a generalization of the uniform density. When α = β = 1, the beta density equals the uniform density. The density is symmetric about 0.5 whenever α = β. Typical values of α and β are 0.5 and 1.0. When α = β = 0.5, the density is peaked at the edges; when α = β = 1.0, the density is uniform for 0 ≤ θ ≤ 1. The mean of the beta density is α/(α + β).
10.12 Bayesian Estimation 293
The magic of the conjugate density is that the a posteriori density is also a beta density: f (θ|k) =
f (k|θ)f (θ) f (k)
1 n k Γ(α + β) α−1 θ (1 − θ)n−k = θ (1 − θ)β−1 f (k) k Γ(α)Γ(β)
1 n Γ(α + β) · θ k+α−1 (1 − θ)n−k+β−1 = f (k) k Γ(α)Γ(β) The term in the large parentheses is a normalizing constant independent of θ (it depends on α , β, n, and k, but not on θ). It can be shown to equal Γ(α + β + n) Γ(α + k)Γ(β + n − k) . Thus, the a posteriori density can be written as f (θ|k) =
Γ(α + β + n) θ k+α−1 (1 − θ)n−k+β−1 Γ(α + k)Γ(β + n − k)
which we recognize as a beta density with parameters k + α and n − k + β. The estimate is the conditional mean and is therefore θˆ = E θ X = k =
α+k α+β+n
(10.27)
Thus, see the Bayes estimate has a familiar form: If α = β = 0, the estimate reduces to the sample mean, k/n. With nonzero α and β, the estimate is biased toward the a priori estimate, α/(α + β). EXAMPLE 10.11
In Chapter 5, we considered the problem of encoding an IID binary sequence with a known probability of a 1 equal to p and developed the optimal Huffman code. In many situations, however, p is unknown and must be estimated. A common technique is to use the Bayesian sequential estimator described above. The (n + 1)’st bit is encoded with a probability estimate determined from the first n bits. Let pˆ n be the estimated probability of the (n + 1)’st bit. It is a function of the first n bits. Common values for α and β are α = β = 1. Assume, for example, the input sequence is 0110 · · · . The first bit is estimated using the a priori estimate: pˆ 0 =
1+0 α+0 = = 0.5 α+β+0 1+1+0
For the second bit, n = 1 and k = 0, and the estimate is updated as follows: pˆ 1 =
α+0 1 = = 0.33 α+β+1 3
For the third bit, n = 2 and k = 1, so pˆ 2 =
α+1 2 = = 0.5 α+β+2 4
294 CHAPTER 10 ELEMENTS OF STATISTICS
For the fourth bit, n = 3 and k = 2, and pˆ 3 =
α+2 3 = = 0.6 α+β+3 5
This process can continue forever, each time updating the probability estimate with the new information. In practice, the probability update is often written in a predictor-corrector form. For the second example, consider estimating the mean of a Gaussian distribution with known variance. The conjugate distribution to the Gaussian with unknown mean and known variance is the Gaussian distribution, θ ∼ N (μ0 , σ20 ). (The conjugate distribution is different if the variance is also unknown.) The a posteriori density is f (θ|x) =
f (x|θ)f (θ) f (x)
(θ − μ0 )2 (x − θ)2 1 1 1 = exp − exp − 2 f (x) 2πσ 2σ 2σ20 2πσ0
(x − θ)2 (θ − μ0 )2 = exp − − 2πσσ0 f (x) 2σ2 2σ20
1
With much tedious algebra, this density can be shown to be a Gaussian density on θ with mean (xσ20 +μ0 σ2 )/(σ2 +σ20 ) and variance (σ−2 +σ−0 2 )−1 . The Bayesian estimate is therefore xσ20 + μ0 σ2 σ20 σ2 = x + μ0 2 θˆ = E θ X = x = 2 2 2 2 σ + σ0 σ + σ0 σ + σ20
(10.28)
Thus, we see the estimate is a weighted combination of the observation x and the a priori value μ0 . If multiple observations of X are made, we can replace X by the sample mean, X n , and use its variance, σ2 /n, to generalize the result as follows: θˆ = E θ X n = xn = xn
nσ20 σ2 + μ0 2 2 2 σ + nσ 0 σ + nσ20
(10.29)
As more observations are made (i.e., as n → ∞), the estimate puts increasing weight on the observations and less on the a priori value. In both examples, the Bayesian estimate is a linear combination of the observation and the a priori estimate. Both estimates have an easy sequential interpretation. Before any data are observed, the estimate is the a priori value, either α/(α+β) for the binomial or μ0 for the Gaussian. As data are observed, the estimate is updated using Equations (10.27) and (10.29). Both these estimates are commonly employed in engineering applications when data arrive sequentially and estimates are updated with each new observation. We mentioned above that the computation in Equation (10.25) can be hard, especially if conjugate densities are not used (or do not exist). Many numerical techniques have been developed, the most popular of which is the Markov chain Monte Carlo, but study of these is beyond this text.
Problems 295
Comment 10.9: Bayesian estimation is controversial in traditional statistics. Many statisticians object to the idea that θ is random and furthermore argue the a priori distribution f (θ) represents the statistician’s biases and should be avoided. The counterargument made by many engineers and Bayesian statisticians is that a random θ is perfectly reasonable in many applications. They also argue that the a priori distribution represents the engineer’s or statistician’s prior knowledge gained from past experience. Regardless of the philosophical debate, Bayesian estimation is growing in popularity. It is especially handy in sequential estimation, when the estimate is updated with each new observation.
PROBLEMS 10.1 List several practical (nonstatistical) reasons why election polls can be misleading. 10.2 Ten data samples are 1.47, 2.08, 3.77, 1.01, 0.42, 0.77, 3.17, 2.89, 2.42, and −0.65. Compute the following: a. b. c. d.
the sample mean the sample variance the sample distribution function a parametric estimate of the density, assuming the data are Gaussian with the sample mean and variance
10.3 Ten data samples are 1.44, 7.62, 15.80, 14.14, 3.54, 11.76, 14.40, 12.33, 7.08, and 3.40. Compute the following: a. b. c. d.
the sample mean the sample variance the sample distribution function a parametric estimate of the density, assuming the data are exponential with the sample mean
10.4 Generate 100 samples from an N (0,1) distribution. a. Compute the sample mean and variance. b. Plot the sample distribution function against a Gaussian distribution function. c. Plot the parametric estimate of the density using the sample mean and variance against the actual Gaussian density. 10.5 Generate 100 samples from an exponential distribution with λ = 1. a. Compute the sample mean and variance. b. Plot the sample distribution function against an exponential distribution function. c. Plot the parametric estimate of the density using the sample mean against the actual exponential density.
296 CHAPTER 10 ELEMENTS OF STATISTICS
10.6 The samples below are IID from one of the three distributions: N (μ, σ2 ), exponential with parameter λ, or uniform U (a,b) where b > a. For each data set, determine which distribution best describes the data (including values for the unknown parameters). Justify your answers. a. b. c. d.
X i = [0.30,0.48, −0.24, −0.04,0.023, −0.37, −0.18, −0.02,0.47,0.46] X i = [2.41,2.41,5.01,1.97,3.53,3.14,4.74,3.03,2.02,4.01] X i = [1.87,2.13,1.82,0.07,0.22,1.43,0.74,1.20,0.61,1.38] X i = [0.46, −0.61, −0.95,1.98, −0.13,1.57,1.01,1.44,2.68, −3.31]
10.7 The samples below are IID from one of the three distributions: N (μ, σ2 ), exponential with parameter λ, or uniform U (a,b) where b > a. For each data set, determine which distribution best describes the data (including values for the unknown parameters). Justify your answers. a. b. c. d.
X i = [1.50,0.52,0.88,0.18,1.24,0.41,0.32,0.14,0.23,0.96] X i = [3.82,4.39,4.75,0.74,3.08,1.48,3.60,1.45,2.71, −1.75] X i = [−0.18, −0.22,0.45, −0.04,0.32,0.38, −0.48, −0.01, −0.45, −0.23] X i = [−0.85,0.62, −0.22, −0.69, −1.94,0.80, −1.23, −0.03,0.01, −1.28]
10.8 Let X i for i = 1,2, . . . ,n be a sequence of independent, but not necessarily identically distributed, random variables with common mean E X i = μ and different variances
Var X i = σ2i . Let T = ni=1 αi X i be an estimator of μ. Use Lagrange multipliers to minimize the variance of T subject to the constraint E T = μ. What are the resulting αi ? (This problem shows how to combine observations of the same quantity, where the observations have different accuracies.) 10.9 Show Equation (10.6), which is an excellent exercise in the algebra of expected values. A crucial step is the expected value of X i X j :
E XiXj =
σ2 + μ2
i=j i = j
μ2
10.10
The exponential weighting in Section 10.4 can be interpreted as a lowpass filter operating on a discrete time sequence. Compute the frequency response of the filter, and plot its magnitude and phase.
10.11
Let X i be n IID U (0,1) random variables. What are the mean and variance of the minimum-order and maximum-order statistics?
10.12
Let X i be n IID U (0,1) random variables. Plot, on the same axes, the distribution functions of X, of X (1) , and of X (n) .
10.13
Let X 1 , X 2 , . . . ,X n be n independent, but not necessarily identically distributed, random variables. Let X (k;m) denote the kth-order statistic taken from the first m random variables. Then,
Pr X (k;n) ≤ x = Pr X (k−1;n−1) ≤ x Pr X n ≤ x + Pr X (k;n−1) ≤ x Pr X n > x a. Justify the recursion above. Why is it true? b. Rewrite the recursion using distribution functions.
Problems 297
c. Recursive calculations need boundary conditions. What are the boundary conditions for the recursion? 10.14 Write a short program using the recursion in Problem 10.13 to calculate order statistic distributions. a. Use your program to calculate the distribution function of the median of n = 9 Gaussian N (0,1) random variables. b. Compare the computer calculation in part a, above, to Equation (10.9). For example, show the recursion calculates the same values as Equation (10.9). 10.15 Use the program you developed in Problem (10.14) to compute the mean of the median-order statistic of n = 5 random variables: a. X i ∼ N (0,1) for i = 1,2, . . . ,5. b. X i ∼ N (0,1) for i = 1,2, . . . ,4 and X 5 ∼ N (1,1). c. X i ∼ N (0,1) for i = 1,2,3 and X i ∼ N (1,1) for i = 4,5. 10.16 If X 1 , X 2 , . . . ,X n are n IID continuous random variables, then the density of the kth-order statistic can be written as
fx(k) (x) dx =
n F k−1 (x) · f (x) dx · (1 − F (x))n−k k − 1,1,n − k
a. Justify the formula above. Why is it true? b. Use the formula to compute and plot the density of the median of n = 5 N (0,1) random variables. 10.17 Repeat Example 10.10, but using the parameterization of the exponential distribution in Equation (8.14). In other words, what is the censored estimate of μ? 10.18 Assuming X is continuous, what value of x maximizes the variance of distribution function estimate Fˆ (x) (Equation 10.10)? 10.19 Write a short computer function to compute a KDE using a Gaussian kernel and Silverman’s rule for h. Your program should take two inputs: a sequence of observations and a sequence of target values. It should output a sequence of density estimates, one density estimate for each value of the target sequence. Test your program by reproducing Figure 10.3. 10.20 Use the KDE function in Problem 10.19 to compute a KDE of the density for the data in Problem 10.2. 10.21 Use the values for aˆ (Equation 10.21) and bˆ (Equation 10.22) to derive Equation (10.23). 10.22 Repeat the steps of Example 10.10 for the alternative parameterization of the exponential fT (t ) = (1/t0 )e−t/t0 to find the MLE of t0 . 10.23 Let X i be IID U (0, θ), where θ > 0 is an unknown parameter that is to be estimated. What is the MLE of θ ? 10.24 Repeat the calculation of the sequence of probability estimates of p in Example 10.11 using α = β = 0.5.
CHAPTER
11
GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION
Multiple Gaussian random variables are best dealt with as random vectors. Many of the properties of Gaussian random vectors become the properties of vectors and matrices. This chapter also introduces linear regression, a common technique for estimating the parameters in linear models.
11.1 GAUSSIAN RANDOM VECTORS Multiple Gaussian random variables occur in many applications. The easiest way to manipulate multiple random variables is to introduce random vectors. This section begins by discussing multiple Gaussian random variables, then introduces random vectors and concludes with some properties of Gaussian random vectors. Let X, Y, and Z be independent Gaussian random variables. Then, the joint distribution function is the product of the individual densities:
FXYZ (x,y,z) = Pr {X ≤ x} ∩ Y ≤ y ∩ {Z ≤ z} = Pr X ≤ x · Pr Y ≤ y · Pr Z ≤ z = FX (x)FY (y)FZ (z)
The joint density is fXYZ (x,y,z) = =
298
∂ ∂ ∂ FXYZ (x,y,z) ∂x ∂y ∂z
d d d FX (x) FY (y) FZ (z) dx dy dz
(by independence)
11.1 Gaussian Random Vectors 299
= fX (x)fY (y)fZ (z) =
1 σx σy σz (2π)3/2
exp −
(x − μx )2
2σ2x
−
(y − μy )2
2σ2y
−
(z − μz )2
2σ2z
These expressions rapidly get unwieldy as the number of random variables grows. To simplify the notation—and improve the understanding—it is easier to use random vectors. We use three different notations for vectors and matrices, depending on the situation. From simplest to most complicated, a vector can be represented as ⎛
⎞
a1 ⎜a ⎟ ⎜ 2⎟ ⎟ a = [ai ] = ⎜ ⎜ .. ⎟ ⎝.⎠ an and a matrix as
⎛
a11 ⎜a ⎜ 21 A = [aij ] = ⎜ ⎜ .. ⎝ .
a12 a22 .. .
··· ···
an1
an2
···
..
.
⎞
a1m a2m ⎟ ⎟ .. ⎟ ⎟ . ⎠ anm
A random vector is a vector, x, whose components are random variables and can be represented as ⎛
⎞
X1 ⎜X ⎟ ⎜ 2⎟ ⎟ x = ⎜ ⎜ .. ⎟ ⎝ . ⎠ Xn
where each X i is a random variable. Comment 11.1: In this chapter, we deviate slightly from our usual practice of writing random variables as bold-italic uppercase letters. We write random vectors as bold-italic lowercase letters adorned with the vector symbol (the small arrow on top of the letter). This allows us to use lowercase letters for vectors and uppercase letters for matrices. We still use bold-italic uppercase letters for the components of random vectors, as these components are random variables.
Expected values of random vectors are defined componentwise: ⎛ ⎞ ⎛ ⎞ E X1 μ1 ⎜E X ⎟ ⎜μ ⎟ 2 ⎟ ⎜ 2⎟ ⎜ ⎟ ⎜ ⎟ x E x = ⎜ ⎜ .. ⎟ = ⎜ .. ⎟ = μ ⎝ . ⎠ ⎝ . ⎠ μn E Xn
300 CHAPTER 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION
Correlations and covariances are defined in terms of matrices: ⎛
X1X1 ⎜X X ⎜ 2 1 xxT = ⎜ ⎜ .. ⎝ .
X1X2 X2X2 .. .
XnX1
⎞
X1Xn X2Xn ⎟ ⎟ .. ⎟ ⎟ . ⎠
··· ···
..
XnX2
.
(11.1)
XnXn
···
The autocorrelation matrix is
Rxx = E xxT
⎛ E X 1 X 1 ⎜E X X ⎜ 2 1 =⎜ .. ⎜ ⎝ . ⎛
E XnX1
r1n r2n ⎟ ⎟ .. ⎟ ⎟ . ⎠
E X 1 X 2 E X2X2 .. .
E XnX2
r11 ⎜r ⎜ 21 =⎜ ⎜ .. ⎝ .
r12 r22 .. .
··· ···
rn1
rn2
···
..
.
··· ···
..
.
···
⎞
⎞
E X 1 X n E X2Xn ⎟ ⎟ ⎟ .. ⎟ ⎠ .
E XnXn
(11.2)
rnn
Where possible, we drop the subscripts on R, μ, and C (see below). The covariance matrix, C = Cxx , is
C = E (x − μ μT = R − μ)(x − μ)T = E xxT − μ μT
(11.3)
Recall that a matrix A is symmetric if A = AT . It is nonnegative definite if xTAx ≥ 0 for all x = 0. It is positive definite if xTAx > 0 for all x = 0. Since X i X j = X j X i (ordinary multiplication commutes), rij = rji . Thus, R is symmetric. Similarly, C is symmetric. R and C are also nonnegative definite. To see this, let a be an arbitrary nonzero vector. Then, aT Ra =aT E xxT a = E aTxxTa = E (aTx)(xTa) = E YTY = E Y2
(letting Y =xTa) (Y T = Y since Y is a 1 × 1 matrix)
≥0
In the above argument, Y = xTa is a 1 × 1 matrix (i.e., a scalar). Therefore, Y T = Y and Y T Y = Y 2 . The same argument shows that C is nonnegative definite.
11.1 Gaussian Random Vectors 301
Sometimes, there are two different random vectors, x and y. The joint correlation and covariance—and subscripts are needed here—are the following:
Rxy = E xyT
Cxy = E (x − μx )(y − μy )T ) = Rxy − μx μy Rxy and Cxy are, in general, neither symmetric nor nonnegative definite. The determinant of a matrix is denoted |C|. If the covariance matrix is positive definite, then it is also invertible. If so, then we say x ∼ N ( μ,C) if 1
fx (x) =
(2π)n |C|
(x − μ)T C−1 (x − μ) exp −
(11.4)
2
If C is diagonal, the density simplifies considerably: ⎛ ⎜ ⎜ C=⎜ ⎜ ⎝
σ21
0 σ22
0 .. .
.. . 0
0
··· ···
.
0 0 .. .
···
σ2n
..
|C| = σ21 σ22 · · · σ2n |C| = σ1 σ2 · · · σn ⎛ 1/σ21 0 ⎜ 0 2 /σ 1 ⎜ 2 −1 ⎜ C =⎜ . .. . ⎝ . .
0
0
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
··· ···
..
⎞
0 0 .. .
.
⎟ ⎟ ⎟ ⎟ ⎠
···
1/σ2n
0
··· ···
The quadratic form in the exponent becomes ⎛
⎞T ⎛
x1 − μ1 ⎜x − μ ⎟ ⎜ 2 2⎟ ⎟ (x − μ)T C−1 (x − μ) = ⎜ ⎜ .. ⎟ ⎝ . ⎠ ⎛
xn − μn
0
⎞T ⎛
x1 − μ1 ⎜x − μ ⎟ ⎜ 2 2⎟ ⎟ =⎜ ⎜ .. ⎟ ⎝ . ⎠ xn − μn =
1/σ21 ⎜ 0 ⎜ ⎜ . ⎜ . ⎝ .
1/σ22 .. . 0
.. ⎞
.
···
0 0 .. .
⎞⎛
1/σ2n
(x1 − μ1 )/σ21 ⎜ (x − μ )/σ2 ⎟ ⎜ 2 2 2⎟ ⎜ ⎟ .. ⎜ ⎟ ⎝ ⎠ . 2 (xn − μn )/σn
n (x − μ )2 i i i=1
σ2i
The overall density then factors into a product of individual densities: fx (x) =
1
n (x − μ )2 i i
exp − σ1 σ2 · · · σn (2π)n i=1
2σ2i
⎞
x1 − μ1 ⎟ ⎜x − μ ⎟ ⎟⎜ 2 2⎟ ⎟⎜ . ⎟ ⎟⎜ . ⎟ ⎠⎝ . ⎠ xn − μn
302 CHAPTER 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION
=
1
exp σ1 2π
−(x1 − μ1 )2
2σ21
···
1
exp σn 2π
−(xn − μn )2
2σ2n
= fX 1 (x1 )fX 2 (x2 ) · · · fX n (xn )
Thus, if C is diagonal, the X i are independent. This is an important result. It shows that if Gaussian random variables are uncorrelated, which means C is diagonal, then they are independent. In practice, it is difficult to verify that random variables are independent, but it is often easier to show the random variables are uncorrelated. Comment 11.2: In Comment 8.2, we pointed out that to show two random variables are independent, it is necessary to show the joint density factors for all values of x and y. When X and Y are jointly Gaussian, the process is simpler. All that needs to be done is to show X and Y are uncorrelated (i.e., that E XY = E X · E Y ). Normally, uncorrelated random variables are not necessarily independent, but for Gaussian random variables, uncorrelated means independent.
The MGF of a vector Gaussian random variable is useful for calculating moments and showing that linear operations on Gaussian random variables result in Gaussian random variables. First, recall the MGF of a single N (μ, σ2 ) random variable is Equation (9.15): M (u) = E euX = exp uμ + σ2 u2 /2
If X 1 , X 2 , . . . ,X n are IID N (μ, σ2 ), then the MGF of the vector x is the product of the individual MGFs: M (u1 ,u2 , . . . ,un ) = E eu1 X 1 +u2 X 2 +···+un X n = E eu 1 X 1 E eu 2 X 2 · · · E eu n X n
(by independence)
= M (u1 ) M (u2 ) · · · M (un ) = exp u1 μ + u21 σ2 /2 · exp u2 μ + u22 σ2 /2 · · · exp un μ + u2n σ2 /2 = exp (u1 + u2 + · · · + un )μ · exp (u21 + u22 + · · · + u2n )σ2 /2
The last expression can be simplified with matrices and vectors: T σ2uTu M u = E eu x = exp uT μ+
2
(11.5)
μ = [μ, μ, . . . , μ]T . where u = [u1 ,u2 , . . . ,un ]T and μ,C) is the same as Equation (11.5), but with C The MGF of the general case x ∼ N ( replacing σ2 I: uT Cu T M u = exp u μ+
2
(11.6)
In Table 11.1, we list several equivalents between formulas for one Gaussian random variance and formulas for Gaussian vectors.
11.2 Linear Operations on Gaussian Random Vectors 303 TABLE 11.1 Table of equivalents for one Gaussian random variable and n Gaussian random variables. 1 Gaussian Random Variable
n Gaussian Random Variables
X ∼ N (μ, σ2 )
x ∼ N ( μ,C) E x = μ Cov x = C
E X =μ
Var X = σ2 (x − μ)2 1 exp − fX (x) = 2 2σ2 2πσ σ2 u2 M (u) = exp uμ + 2
fx (x) =
1
exp −
μ)T C−1 (x − μ) (x −
n (2π) |C| uT Cu μ+ M u = exp uT
2
2
Comment 11.3: The covariance matrix of x may not be invertible. A simple example is when X1 = X2 with each N(0,1). Then,
C=
1 1
1 is not invertible 1
In cases like this one, the PDF (Equation 11.4) is undefined, but the MGF (Equation 11.6) is fine. Advanced treatments define a random vector as Gaussian if its MGF is Equation (11.6) and then derive the PDF from the MGF if the covariance is invertible. In other words, advanced treatments first define the MGF because the MGF does not require the covariance matrix to be invertible. If the covariance happens to be invertible, then the density is defined.
11.2 LINEAR OPERATIONS ON GAUSSIAN RANDOM VECTORS One important property of the Gaussian distribution is that linear operations on Gaussian random variables yield Gaussian random variables. In this section, we explore this concept. μx and covariance Cxx . Let y be a linear (affine) Let x be Gaussian, with mean transformation of x: y = Ax +b
What are the mean and covariance of y? Fortunately, both are easily calculated:
E y = E Ax +b
= AE x +b = A μx +b
(E · and linear operators commute)
304 CHAPTER 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION
Cyy = E y − μy y − μy
T
T = E Ax +b − A μx −b Ax +b − A μx −b T = E A(x − μx ) A(x − μx ) T μx x − μx AT = E A x − T μx x − μx AT = AE x − = ACxx AT
Thus, y = Ax +b is Gaussian, with mean A μx +b and covariance ACxx AT . 3 2 2 1 1 1 For instance, if , b = , Cxx = , and A = , then μx = 1 1 1 2 0 1 μy =
Cyy =
1 1 3 2 6 + = 0 1 1 1 2 1 1 2 0 1 1
1 1 0 6 3 = 2 1 1 3 2
6 6 3 Therefore, y is Gaussian with mean and covariance . 2 3 2 As another example, computers can generate IID Gaussian random variables (see Section 9.6.2). Let x be a vector of IID N (0,1) Gaussian random variables. Let y = Ax be a transformed vector. Then, the covariance of y is Cyy = ACxx AT = AIAT = AAT since Cxx = I. Thus, to generate a vector y with a specified covariance Cyy , it is necessary to find a matrix A that is the “square root” of Cyy . Since Cyy is symmetric and nonnegative, definite square roots exist (this is a known result in linear algebra). In fact, if Cyy is 2 × 2 or larger, there are an infinite number of square roots. One particularly simple method of finding square roots is the Cholesky decomposition, which finds square roots that are triangular matrices. See Problem 11.6 for further information on the Cholesky decomposition and matrix square roots.
11.3 LINEAR REGRESSION We introduce the linear regression problem. Linear regression generalizes the sample mean (Section 10.2) and the MMSE (Section 10.11) by estimating a vector of unknowns β1 , β2 , . . . , βm from a vector of observations, y1 , y2 , . . . , yn . We concentrate mostly on linear models: yi ≈ xi,1 β1 + xi,2 β2 + · · · + xi,m βm
for i = 1,2, . . . ,n
where the xi,j are known values and the βj are coefficients to be determined.
11.3 Linear Regression 305
Linear regression problems appear in numerous applications in science, engineering, economics, and many other fields. It is fair to say that linear regression is one of the most important problems in computational science and data analysis. Before considering a general linear regression problem, we start with a simple one, estimating the sample mean (Section 10.2). Recall that the sample mean can be derived as follows: Let Y i be a series of observations with the simple model Y i = μ + i . An estimate of μ can be obtained as the value that minimizes the sum of squared errors: ˆ = argmin μ μ
n i=1
(Y i − μ)2
The optimal estimate is found by differentiating the sum with respect to μ, setting the derivative to 0, and solving for μˆ , yielding the sample mean: ˆ= μ
n 1 Yi n i=1
Here, we present a quick overview of the main linear regression results. Instead of one , and the model generalizes: unknown, there are m unknowns, organized as a vector β + y = X β
The estimate is chosen to minimize the total squared error: = y − X β 2 = (y − X β) T (y − X β) Q(β)
After differentiating, we obtain the normal equations (these are m linear equations in m unknowns): ˆ 0 = X T (y − X β)
Assuming X is full rank, the normal equations can be solved for the optimal estimate: βˆ = (X T X )−1 X Ty
In the subsections below, we consider the linear regression program in more detail, study the statistics of the linear regression estimate, present several applications, consider the computational procedures for obtaining the estimate, and discuss a few extensions.
11.3.1 Linear Regression in Detail Linear regression is a class of parameter estimation problems. Typically, there is a linear relation between the observations and unknowns. Since there are more observations (equations) than unknowns, the equations cannot be solved exactly, and an approximate solution is sought. be a m × 1 vector of unknown parameters that are to be estimated, let X be an Let β n × m matrix of known values, and let y be an n × 1 vector of observations. Then, as shown
306 CHAPTER 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION
m
1
n
y
=
β
X
+
+ FIGURE 11.1 Illustration of the linear regression problem, y = X β . There are n equations in m unknowns. Typically, X is a “tall” matrix with n > m.
in Figure 11.1, the linear regression model is + y = X β
(11.7)
with the following terms: • • • •
y is a size n random vector of known observations. X is an n × m matrix of known (nonrandom) values. is a vector of size m of unknown parameters to be estimated. β
is a vector of size n of IID Gaussian random variables, each with mean 0 and variance ∼ N (0, σ2 I ). Note that is not observed directly, only indirectly through y. σ2 ; that is, Typically, is thought of as additive observation noise. is denoted βˆ. It is a random vector dependent on the random noise . • The estimate of β (what the observations would be if there were no noise) is yˆ = X βˆ. • The estimate of X β from the observations y. What makes The linear regression problem is to estimate β the problem difficult is that there are (usually) more equations than there are unknowns . (i.e., n > m) and there is no solution to the equation y = X β Since there are usually no solutions to Equation (11.7), we opt to find the best approximate solution, with “best” meaning the vector that minimizes the squared error. This is known as the least squares solution. Define the squared value of a vector, r, as r 2 =r Tr =
n i=1
r 2i
. Lettingr =y − X β as Q(β) , the squared error is Denote the squared error as a function of β 2 = y − X β Q(β) T (y − X β) = (y − X β) T X T X β −β T X Ty + β =yTy −yT X β
11.3 Linear Regression 307
All the terms are 1 × 1 objects. Therefore, = yT X β T =β T X Ty yT X β
Thus, the two middle terms can be combined, yielding =yTy − 2β T X T X β T X Ty + β Q(β) is the value of β that The linear least squares, or simply the least squares, estimate of β ˆ minimizes the squared error. It is denoted β. βˆ = argmin Q(β)
(11.8)
β
T X Ty + β T X T X β = argmin yTy − 2β β
(11.9)
; that is, βj , for j = 1,2, . . . ,m. The squared error is a function of the m components of β Minimizing a multivariable function is logically the same as minimizing a scalar function: differentiate the function with respect to the unknown variables, and set the derivatives to 0. . This is known as the In this case, we have to differentiate with respect to the vector β gradient and is denoted as ⎛
d dβ1 Q(β)
⎞
⎜ ⎟ ⎜ d Q(β) ⎟ ⎜ dβ2 ⎟ ⎟ = =⎜ ∇Q(β) Q(β) ⎜ ⎟ .. ⎜ ⎟ dβ . ⎝ ⎠ d dβm Q(β)
d
We can differentiate each term separately: ∇ yTy = 0 T T X y = X Ty ∇β T T = 2X T X β X Xβ ∇β
Combining and dividing by 2, the optimal estimate satisfies the following equation: ˆ 0 = X T (y − X β)
(11.10)
These are known as the normal equations. In this case, normal means perpendicular (or , it is natural to take yˆ = X βˆ as the estimate of X β . orthogonal). Since βˆ is the estimate of β The normal equations can be multiplied on the left by βˆ, giving T
ˆ = yˆ T (y − X β) ˆ = yˆ T (y − yˆ ) 0 = βˆ X T (y − X β)
We see the error y − yˆ is orthogonal to yˆ . That the error is orthogonal to the estimate of y is a special case of a more general result known as the projection theorem. See Figure 11.2 for a pictorial representation of the normal equations.
308 CHAPTER 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION
m
m
1 βˆ
XT X
=
X Ty
FIGURE 11.2 Illustration of the normal equations. X T X is a square m × m matrix, X Ty is an m × 1 vector, and βˆ is an m × 1 vector.
To solve for βˆ, first we rewrite the normal equations as
X T X βˆ = X Ty
If X T X is invertible, then βˆ = (X T X )−1 X Ty
(11.11)
Comment 11.4: Here is a quick review of some linear algebra:
• The rank of a matrix is the maximal number of its linearly independent rows or columns: Rank[X ] ≤ min(m,n)
• • • •
A matrix is full rank if its rank equals min(n,m). In linear regression, m ≤ n, so Rank[X ] ≤ m. The rank of X T X equals the rank of X. A square matrix is invertible if it is full rank.
Thus, X T X is invertible if n ≥ m and X is full rank.
Comment 11.5: If m > 1 and X is not full rank, the normal equations still have a solution (in fact, there are an infinite number of solutions). Linear algebraic techniques can find the solutions, but we will not discuss it further as most applications in which we are interested are full rank. One mistake many students make is to try to simplify the solution: βˆ = (X T X )−1 X Ty ?
= X −1 X −T X Ty ?
= X −1y
This works only if X is invertible, which is true only if X is square (n = m) and full rank. In the usual case with n > m, X −1 does not exist.
11.3 Linear Regression 309
11.3.2 Statistics of the Linear Regression Estimates In linear regression, we obtain two estimates: βˆ, the estimate of the unknown parameter vector, and yˆ = X βˆ, the estimate of the noise-free observations (as if the noise was 0). In this section, we will study the statistical properties of these two estimates. + ∼ N (0, σ2 I ). Our discussion of Recall that the model we consider is y = X β with statistical properties assumes this model is true—in other words, that it accurately reflects ) and the noise ( ). If the model is wrong, the estimates both the structure of the situation (X β and their statistical properties may be inaccurate and misleading. The mean and covariance of βˆ are straightforward:
E βˆ = E (X T X )−1 X Ty
+ ) = E (X T X )−1 X T (X β + (X T X )−1 X T E =β
=β βˆ − β) T Cov βˆ = E (βˆ − β)( T X (X T X )−1 = (X T X )−1 X T E
(11.12)
= (X T X )−1 X T σ2 IX (X T X )−1 = σ2 (X T X )−1 X T X (X T X )−1 = σ2 (X T X )−1
(11.13)
and from Equation (11.13) We see from Equation (11.12) that βˆ is an unbiased estimate of β that it has covariance σ2 (X T X )−1 . Since is Gaussian, so is βˆ ∼ N (0, σ2 (X T X )−1 ). Some comments:
• (X T X )−1 plays the role in regression that 1/n does in the sample mean. Intuitively, we want (X T X )−1 → 0 as n → ∞. More precisely, we want the largest eigenvalue of (X T X )−1 to tend to 0. Since the eigenvalues of (X T X )−1 are the inverses of those of X T X, we want that the smallest eigenvalue of X T X → ∞ as n → ∞. • If X T X is diagonal, then (X T X )−1 is also diagonal. In this case, the individual components of βˆ are uncorrelated with each other. Generally speaking, this is a desirable result. If possible, try to assure the columns of X are orthogonal to each other. This will result in a diagonal covariance matrix. • When the noise is Gaussian, the estimate βˆ is Gaussian, and when is non-Gaussian, βˆ is non-Gaussian. However, if the noise is “approximately” Gaussian and n is much larger than m, CLT arguments lead to the conclusion that βˆ is approximately Gaussian with and covariance σ2 (X T X )−1 . mean β . We note again that β is unknown and therefore yˆ = X βˆ is an obvious estimate of X β so is X β.
yˆ = X βˆ = X (X T X )−1 X Ty = Hy
310 CHAPTER 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION
where we define the n × n matrix H = X (X T X )−1 X T . H is the “hat” matrix (because it puts a “hat” on y). H has a number of interesting properties: • • • •
H is symmetric; that is, H T = H. H is idempotent, meaning H 2 = H. Similarly, (I − H )2 = (I − H ), and (I − H )H = 0. Rank[H ] = m since Rank[X ] = m. The eigenvalues of H are 0 or 1. If λ is an eigenvalue of H, then idempotency implies λ2 = λ and, thus, λ = 0 or λ = 1. Furthermore, since Rank[H ] = m, H has m unit eigenvalues and n − m zero eigenvalues. Also, HX = X, which means the columns of X are the eigenvectors corresponding to the unit eigenvalues. Using H, the mean and covariance of yˆ are easily computed:
+ E yˆ = E Hy = E H (X β ) = X β
y − X β) T = HE T H = σ2 H Cov yˆ = E (ˆy − X β)(ˆ . The covariance of yˆ does not, however, tend to 0. Even if yˆ is an unbiased estimator of X β perfectly, y would still be contaminated by noise, . we knew β We can use the properties of the hat matrix to show the orthogonality of the prediction and the prediction error. First, consider three squared values. All three of these can be computed from the observations and the estimated values: y 2 =yTy y − yˆ 2 =yT (I − H )T (I − H )y =yT (I − H )y ˆy 2 =yT H T Hy =yT Hy
With these definitions, we get Pythagoras’s formula for the squared observation: y 2 =yT (I − H + H )y =yT (I − H )y +yT Hy = y − yˆ 2 + ˆy 2
(11.14)
Figure 11.3 illustrates the orthogonality condition thaty − yˆ is orthogonal to yˆ = X βˆ. The shows the possible values as β varies. The least squares estimate is the value that line y = X β y y − yˆ
possible values of y = X β
yˆ = X βˆ 0 FIGURE 11.3 Illustration of the orthogonality condition that y − yˆ is orthogonal to yˆ = X βˆ.
11.3 Linear Regression 311
is closest to y. The Pythagorean theorem says the squared lengths of the two sides, yˆ = X βˆ and y − yˆ , equal the squared length of the hypotenuse, y. The expected value of y − yˆ 2 gives us an estimate of σ2 . First, we need to look more carefully at y − yˆ : + y − yˆ = (I − H )y = (I − H )(X β ) = (I − H )
since HX = X. Continuing with the squared norm, y − yˆ 2 = T (I − H )T (I − H ) = T (I − H )
We are now in position to take an expected value, but the expected value of T (I − H ) is unlike anything we have seen. However, we can reason our way to the answer. The n components of the are IID N (0, σ2 ). An n-dimensional Gaussian distribution is circularly symmetric. The vector is equally likely to point in any direction. The I − H matrix has n − m unit eigenvalues (λ = 1) and n − m corresponding eigenvectors. On average, the n components of match n − m of these eigenvectors. Therefore,
E y − yˆ 2 = (n − m)σ2 This suggests a reasonable estimate of σ2 : 2 T 2 = y − yˆ = (y − yˆ ) (y − yˆ ) σ n−m n−m
(11.15)
For example, the sample mean corresponds to a simple regression problem with m = 1. We have seen that the squared sample error divided by n − 1 is an unbiased estimate of σ2 (Equation 10.7). This formula is a direct generalization of that one. A popular statistical measure of how well the estimate “explains” the data is R2
(sometimes written as R2). Let y = ni=1 yi /n. Then,
n
R = 1 − in=1 2
(yi − yˆ i )2
i=1 (y i − y)
= 1−
2
(11.16)
sum of squared residuals sum of observation squared deviations
Clearly, R2 ≤ 1. Larger values of R2 (closer to 1) indicate a better fit of the model to the data.
11.3.3 Computational Issues Linear regression is a common computational task. Many problems have hundreds or even thousands of rows and columns. It is important to compute the estimate efficiently and accurately. The main computational tasks are computing X T X and inverting it. X T X is an m × m symmetric matrix. It has m(m + 1)/2 unique elements (the indices of the elements are an ordered selection with replacement of two items from m). Computing each element requires approximately n operations (where an operation is a floating point multiplication followed by an addition). Thus, computing the matrix requires approximately nm2 /2 operations. Inverting X T X is usually done by factoring it into RT R, with R being an
312 CHAPTER 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION
m × m upper triangular matrix (Cholesky factorization). This step requires approximately m3 /6 operations. After factoring, the normal equations can be solved in approximately m2 operations. Thus, solving the normal equations requires approximately nm2 /2 + m3 /6 operations (plus lower-order terms we generally ignore). The computational time to compute βˆ drops dramatically if the columns of X are orthogonal. Then, the off-diagonal terms of X T X are zero. Only the m diagonal terms need be computed and inverted. The m diagonal terms can be computed in nm operations. Inverting the diagonal matrix requires m operations. For large problems with a nondiagonal X T X, computational accuracy is paramount. One preferred technique is the QR method. Let Q be an orthogonal matrix; that is, QT Q = I. The crucial idea behind the QR method is that multiplying by an orthogonal matrix leaves the squared error unchanged. Let u be a vector and v = Qu. Then, . . . .2 .v. =vTv =uT QT Qu =uTu = .u.2
Multiplying by an orthogonal matrix transforms the linear regression problem into an equivalent one: . . . . .y − X β .2 = .Qy − QX β . 2
In the QR method, the orthogonal matrix is chosen to satisfy two criteria: 1. QX can be computed efficiently. ˜ and 2. QX has a special form: the first m rows are an m × m upper triagonal matrix R, the lower n − m rows all 0’s: y R˜ and Qy = ∥ QX = R = 0 y⊥ QX is illustrated in Figure 11.4. The squared error then simplifies: . . . . . . . . .y − X β .2 = .Qy − QX β .2 = .y − R .2 + .y .2 ˜β ∥ ⊥ and the penultimate term can be solved exactly: The last term is independent of β ˜ βˆ = 0 y∥ − R
=⇒
˜ −1y βˆ = R ∥
R˜ Q
X
=
0
FIGURE 11.4 Illustration of the QR algorithm showing QX = R, with R consisting of an upper triangular matrix R˜ and rows of zeros.
11.3 Linear Regression 313
The QR method essentially consists of the following steps: 1. Compute QX = R. In practice, Q is computed as a product of simple orthogonal matrices. A product of orthogonal matrices is an orthogonal matrix. Each one “zeros” out some portion of X. Popular choices are Householder and Givens matrices. 2. Compute βˆ = R˜ −1y∥ . . .2 . .2 3. The residual squared error is .y − yˆ . = .y⊥ . . The QR method requires a bit more work than the direct solution, but it is more accurate. It avoids “squaring” X in computing X T X. Consider, for example, two numbers, 1 and 10. Squaring them results in 1 and 100. Doing computations on the squares requires more computational precision than doing computations on the original numbers. The QR method never computes X T X. See Problems 11.13 to 11.16 for further information on computing the QR method with Givens matrices. In Matlab and Python, the user creates the y vector and the X matrix and calls a solver. In Matlab, the solver is easy: betahat = X \ y ; Python uses a library function. In addition to the estimate, it also returns the sum of squared residuals, rank of X, and singular values of X: import numpy . linalg as lg betahat , resid , rank , singvals = lg . lstsq (X , y ) In R, the syntax is a bit different. Instead of creating a matrix X, the user creates one or more columns and tells R to fit a linear model. Let x1, x2, x3, and y be columns of data. The regression of y against the three columns is m 0. In this example, H0 : X ∼ N (0, σ2 )
(12.3)
H1 : X ∼ N (s, σ )
(12.4)
2
where s > 0 is the signal level. Figure 12.1 shows a plot of the two alternatives (drawn for s = σ = 1). The densities for the two hypotheses are also shown in Figure 12.1. Since H1 has a shift s = 1, its density is shifted to the right. The obvious question is “How should we make the decision?” Looking at Figure 12.1, we see the left side favors H0 , while the right side favors H1 . In the communications problem, H0 corresponds to a 0 being sent, while H1 corresponds to a 1 being sent. Usually, both hypotheses are equally likely (a 0 being sent is as likely as a 1 being sent), and the two errors are equally important. In this case, true-negative rate is the probability of a true negative, and it makes sense to minimize the number of errors. ˆ = 0 if X ≤ 0.5 and when H ˆ = 1 if X > 0.5 (Figure 12.1 is symmetrical This happens when H about X = 0.5). In the radar problem, true-negative rate is the probability of a true negative, and in many others, it is difficult to estimate the prior probabilities of H0 and H1 , and the two errors can have very different costs. For instance, a radar miss usually is not too serious for
12.2 Example: Radar Detection 327
x
Hˆ
xT s
0
x
Hˆ
xT
1
1
H0
H1 Pr TP
Pr TN
H0
H1 Pr FP
Pr FN 0
1
xT
2
3
x
FIGURE 12.1 The densities for H0 and H1 as described in Equations (12.3) and (12.4). H1 is shifted to the right by s = 1. The decision threshold xT = 1.5 is illustrated. If the observation x is ˆ ˆ less than xT, decide H = 0; otherwise, decide H = 1. The top graph shows Pr TP (the light gray TN (the dark gray region). The bottom graph shows Pr (light region) and Pr FN gray) and sum to 1 (Pr FN + Pr TP = 1) and that Pr FP (dark gray). Note that the two light gray regions the two dark gray regions sum to 1 (Pr TN + Pr FP =1). With the decision threshold xT = 1.5 (as illustrated here), Pr FN is much larger than Pr FP .
the simple reason that the radar test will be repeated in the next second or so (as the radar antenna spins around). A false alarm, on the other hand, can be very serious. It may result in a missile launch or the scrambling of fighter jets. A false positive in a drug test might result in a person not being hired. A false positive in a cancer test might result in the person undergoing unnecessary surgeries or chemotherapy. To accommodate the many different scenarios, we will not focus on how the decision should be made in any specific circumstance, but on the general procedure for making decisions. Returning to Figure 12.1, the obvious procedure is to select a threshold, xT , and ˆ = 1) if X > xT ; otherwise, decide for the null hypothesis. The decide for the alternative (H figure shows a threshold xT = 1.5. One convention in hypothesis testing is to focus onthe false-positive probability, Pr FP , and the true-positive probability, Pr TP = 1 − Pr FN . These are the right side regions in Figure 12.1 and are also illustrated in Figure 12.2. The false-positive probability is the dark gray region, and the true-positive probability is the light gray region (partially hidden by the dark gray region in Figure 12.2). As the threshold varies, so do the true-positive and false-positive probabilities. Figure 12.3 shows four versions of the hypotheses densities with four different values of the
328 CHAPTER 12 HYPOTHESIS TESTING
x
0
Hˆ
xT s
x
1
Hˆ
xT
1
H0
H1 Pr TP Pr FP 0
1
xT
2
3
x
FIGURE 12.2 The same situation as in Figure 12.1, but showing only the probabilities of a true positive and a false positive. The dark gray region is the probability of a false positive. The light gray region (partially obscured by the dark gray region) is the probability of a true positive.
xT = 0
xT = 1
xT = 2
xT = 3
FIGURE 12.3 Hypothesis densities with four different values of the decision threshold. Note that as the threshold is increased (moves to the right), both the probability of a false positive and the probability of a true positive decrease, though not at the same rate.
threshold. When xT = 0, both probabilities are fairly large. As xT increases, both probabilities shrink, though not at the same rate. In light of this, the user picks a threshold value that results in acceptable false-positive and true-positive probabilities (or, equivalently, false alarm and miss probabilities). A graphical technique to help characterize the decision is the receiver operating characteristic (ROC) graph. A ROC graph is a plot of the true-positive probability versus the false-positive probability as the threshold is changed. For the radar problem, the true-positive and false-positive probabilities are the following:
Pr TP = Pr X > xT H1 X − s xT − s H1 > = Pr σ σ x −s T = 1−Φ σ Pr FP = Pr X > xT H0
(definition of TP) (normalize) (under H1 , X ∼ N (s, σ2 )) (definition of FP)
12.2 Example: Radar Detection 329
xT > H0 σ σ x T = 1−Φ σ = Pr
X
(normalize) (under H0 , X ∼ N (0, σ2 ))
The ROC curve can be found by solving the second equation above for xT and substituting into the first:
xT = σΦ−1 1 − Pr FP
Pr TP = 1 − Φ
σΦ−1 (1 − Pr FP ) − s
σ −1 = 1 − Φ Φ (1 − Pr FP ) − (s/σ)
(12.5)
An ROC plot for s = σ is shown in Figure 12.4. As the threshold changes, both Pr FP and Pr TP change. An ideal test would have Pr FP = 0 and Pr TP = 1. In an ROC plot, this is the upper left corner. The closer the ROC curve is to the upper left corner, the better is the test. If the performance of the test is insufficient (i.e., not close enough to the upper left corner), the ratio s/σ must be increased. Because s is related to the strength of the reflection from the target, perhaps the radar power can be increased, which will increase s. Because σ comes from the noise in the system, the noise can be reduced by repeating the experiment several times and replacing X by X n . If the repetitions are independent, the variance of the sample σ2 /n. The standard deviation is reduced to σ/ n and therefore s/σ is replaced by average is ns/σ. The ROC curves for s/σ = 1, 2, and 3 are shown in Figure 12.5. As s/σ increases, the curves move toward the upper left corner, where Pr TP is large and Pr FP is small. Other hypothesis tests are solved the same way: First, characterize the two hypotheses in terms of the measurement, X. Next, look at different decision rules, typically comparing X to a threshold, and then determine the best trade-off of true versus false positives. 1.0 0
Pr TP = 1 − Pr FN 0.5
1
2 0
0
3
0.5
Pr FP
1.0
FIGURE 12.4 The ROC curve for the hypothesis test depicted in Figure 12.1. The labeled dots correspond to the four subfigures in Figure 12.3.
330 CHAPTER 12 HYPOTHESIS TESTING
1.0 Pr TP 0.5
s =2 σ
s =1 σ
s =3 σ 0
0
0.5
1.0
Pr FP
FIGURE 12.5 ROC curves for s/σ = 1, 2, and 3.
EXAMPLE 12.1
Consider a simple hypothesis testing example. Under H0 , X ∼ U (0,1), and under H1 , X ∼ U (0.5,2.5). The two densities are shown in Figure 12.6. For the simple test comparing X to a threshold xT , what are the false-positive and false-negative probabilities? Pr FP = Pr X > xT H0 =
⎧ ⎪ ⎨0
xT > 1 0 ≤ xT ≤ 1 xT < 0
1 − xT 1
⎪ ⎩
⎧ ⎪0 ⎨ Pr FN = Pr X ≤ xT H1 = (xT − 0.5)/2 ⎪ ⎩
xT < 0.5 0.5 ≤ xT ≤ 2.5 xT > 2.5
1
The interesting range of values of xT is from 0.5 to 1.0. Outside this range, one or both of the densities are zero. The ROC curve is shown in Figure 12.7.
xT = 0.8
Pr FP = 0.2
f0
f1
Pr FN = 0.15
0
0.5
1.0
x
2.5
FIGURE 12.6 Densities of the two hypotheses in Example 12.1. When xT = 0.8, the false-positive probability is 0.2, and the false-negative probability is 0.15.
12.3 Hypothesis Tests and Likelihood Ratios 331
1
xT = 0.5
Pr TN
0
xT = 1.0
0
Pr FP 1
FIGURE 12.7 The ROC curve for Example 12.1.
In summary, hypotheses tests decide between two (or more) alternatives. Typically, the observation X is compared to a threshold xT . As xT is varied, the true-positive and the false-positive rates vary. A plot of these two rates is the ROC curve. The user selects a value of xT , and the system operates at that point on the ROC curve.
12.3 HYPOTHESIS TESTS AND LIKELIHOOD RATIOS In the previous section, we introduced the basic ideas of hypothesis tests and solved the example problem somewhat informally: “Looking at Figure 12.1, we see the left side favors H0 , while the right side favors H1 .” In this section, we introduce likelihood ratio tests as a way of replacing our intuition with a formal procedure. Under H0 , the observation has density f0 (x), and under H1 , the observation has density f1 (x). The likelihood ratio is the ratio of densities: L(x) =
f1 (x) f0 (x)
(12.6)
For many densities, however, it is more convenient to compute the log-likelihood ratio:
l(x) = log L(x) = log
f1 (x) f0 (x)
In the example in the previous section, H0 : X ∼ N (0, σ2 ) H1 : X ∼ N (s, σ2 ) Therefore, the two densities are 1 2 2 H0 : f0 (x) = e−x /2σ σ 2π 1 2 2 H1 : f1 (x) = e−(x−s) /2σ σ 2π
(12.7)
332 CHAPTER 12 HYPOTHESIS TESTING
The likelihood ratio is L(x) =
=
f1 (x) f0 (x) 1
e−(x−s) σ 2π
1
σ 2π = e−(x
2 /2σ2
e− x 2 / 2 σ 2
2 −2sx+s 2 )/2σ2
= e(2sx−s
· ex
2 /2σ2
2 )/2σ2
The log-likelihood ratio is
l(x) = log L(x) = log(e(2sx−s
2 )/2σ2
)=
2sx − s2 2σ2
ˆ = 1 when L(x) ≥ L0 or, The Neyman-Pearson likelihood ratio test1 says choose H equivalently, when l(x) ≥ l0 for an appropriately chosen L0 (or, equivalently, l0 =log(L0 )). Often, L0 is chosen so that the false-alarm probability has a desired value; that is, Pr FP = α, where α is a desired value. ˆ = 1 if In this example, it is convenient to use the log-likelihood ratio. Choose H
l(x) =
2sx − s2 ≥ l0 2σ2
This can be simplified as follows (solving for x): 2sx − s2 ≥ l0 2σ2 2sx − s2 ≥ 2σ2 l0 2sx ≥ 2σ2 l0 + s2 x≥
2σ2 l0 + s2 = xT 2s
In the last step, we replaced σ2 l0 /s + s/2 by the constant xT . Therefore, the test becomes ˆ = 1 if x ≥ xT . This test is shown in Figure 12.8. choose H If α is specified, we can solve for xT as follows: α = Pr FP ˆ = 1H = 0 = Pr H = Pr x ≥ xT H = 0 = 1 − Φ(xT /σ) 1
ˆ = 1 when x > xT ) (H
(under H0 , X ∼ N (0, σ2 ))
Jerzy Neyman and Egon Pearson published this test in 1933.
12.3 Hypothesis Tests and Likelihood Ratios 333
2 1
−2
−1
1 −1
2
ˆ =0 H
x ˆ =1 H
xT −2
FIGURE 12.8 Simplified log-likelihood ratio for Gaussian versus Gaussian test. The threshold xT is ˆ = 0; when X > xT , the test shown on the horizontal axis. When X < xT , the test decides for H ˆ = 0. decides for H
Therefore, xT = σΦ−1 (1 − α) In Figure 12.2, σ = 1 and xT = 1.5; therefore, α = 1 − Φ(1.5/1) = 1 − 0.9332 = 0.0668. As another example, consider the problem of deciding whether an observation is Gaussian or Laplacian, both with mean 0 and scale parameter 1: H0 : X is Gaussian H1 : X is Laplacian Therefore, the two densities are 1 2 H0 : f0 (x) = e−x /2 2π 1 −|x| H1 : f1 (x) = e 2 The likelihood and log-likelihood ratios are +
f1 (x) 2 x2 /2−|x| = L(x) = e f0 (x) π 1 x2 l(x) = log(2/π) + − |x| 2 2 ˆ = 1 when l(x) ≥ l0 : The Neyman-Pearson likelihood ratio test becomes choose H
l(x) ≥ l0
334 CHAPTER 12 HYPOTHESIS TESTING
2
x 2 − 2| x |
1
−2
−1
1
2
x
−1
FIGURE 12.9 Simplified log-likelihood ratio for Laplacian versus Gaussian test.
1 x2 l(x) = log(2/π) + − |x| ≥ l0 2 2 x2 − 2|x| ≥ 2l0 − log(2/π) = l The simplified log-likelihood function is shown in Figure 12.9. Notice the peculiar shape of the curve. Compare this with Figure 9.8, which shows the Laplacian and Gaussian densities. The log-likelihood ratio is high in the tails (large |x|) because the Laplacian density has heavier tails than the Gaussian. The log-likelihood ratio has a smaller peak near 0, again because the Laplacian is higher near 0 than the Gaussian. The test for this example is more complicated than the previous test for Gaussian versus ˆ = 1 when x is large positive or large negative, as shown Gaussian. If l is large, decide for H in Figure 12.10. When l is small, the test results in five regions on x, the first, third, and fifth favoring ˆ ˆ = 0. These are shown in Figure 12.11. H = 1 and the second and fourth favoring H
l′ Hˆ = 1
Hˆ = 0
Hˆ = 1 x
FIGURE 12.10 Log-likelihood test when l is large. There are three decision regions on x, the left ˆ = 1 and the middle favoring H ˆ = 0. and right favoring H
12.4 MAP Tests 335
Hˆ = 1
Hˆ = 1
Hˆ = 0
Hˆ = 0
Hˆ = 1 x
l′
FIGURE 12.11 Log-likelihood test when l is small. There are five decision regions, with the first, ˆ = 1 and the second and fourth favoring H ˆ = 0. third, and fifth favoring H
ratio test is optimal in the sense that no other test with The Neyman-Pearson likelihood Pr FP = α has a higher Pr TP . The steps in the test are as follows: 1. Specify the hypotheses and their densities, f0 (x) and f1 (x). If the problem is discrete, specify the PMFs. 2. Compute the likelihood ratio, L(x) = f1 (x)/f0 (x). 3. Compare the likelihood ratio to a threshold, L(x) ≥ L0 . 4. Algebraically simplify the relation. It is often the case that computing logs is helpful to simplify the relation. 5. Determine the threshold value necessary to have the right Pr FP . The Neyman-Pearson likelihood ratio test is widely used in engineering and statistical applications. It is optimal and straightforward, though the algebra sometimes is formidable.
12.4 MAP TESTS In many hypothesis testing problems, the hypotheses can be considered random themselves. Pr For instance, H0 has a priori probability H0 of being true, and H1 has a priori probability Pr H1 of being true, with Pr H0 + Pr H1 = 1. A simple example is a binary communications problem: sending a single bit over a noisy channel. H0 corresponds to a 0 being sent and H1 a 1. Let X denote the observation, and let Pr X H0 and Pr X H1 denote the probability of X under each hypothesis. The maximum a posteriori (MAP) test selects the hypothesis ˆ = 1 if that has the maximum probability given the observation X. Choose H
Pr H1 X > Pr H0 X ˆ = 0. Otherwise, choose H
(12.8)
336 CHAPTER 12 HYPOTHESIS TESTING
Using Bayes theorem the test can be rewritten as:
Pr X H1 Pr H1 Pr X H0 Pr H0 > Pr X Pr X
Note that Pr X divides both sides and can be cancelled out. The test can be rewritten in ˆ = 1 if terms of the likelihood ratio. Choose H
Pr X H1 Pr H0 L(X ) = > Pr X H0 Pr H1 ˆ = 0. Otherwise, choose H Consider, for example, a simple optical communications problem. Under H1 , X ∼ Poisson(λ1 ), and under H0 , X ∼ Poisson(λ0 ). For convenience, we assume λ1 > λ0 . The likelihood ratio for X = k is
λk e−λ1 /k! Pr X H1 λ1 L(X = k) = = 1k −λ = λ0 Pr X H0 λ0 e 0 /k!
k
eλ0 −λ1
ˆ = 1 if After solving for k, the MAP test becomes choose H
Pr H log 0 + (λ1 − λ0 )
k > kT =
Pr H 1
log
λ1 λ0
ˆ = 0. Otherwise, choose H In summary, the MAP test assumes the hypotheses are random with a priori proba bilities Pr Hi for i = 0,1. The test says select the hypothesis with the largest a posteriori probability Pr H i X . Bayes theorem allows the test to be written in terms of the conditional probabilities Pr X Hi and the a priori probabilities. The MAP test is widely employed in engineering and scientific applications. It minimizes the probability of error by maximizing the probability of correct decisions. It also generalizes easily to more than two hypotheses: for each value of X, choose the hypothesis with the largest a posteriori probability.
SUMMARY
Hypothesis testing is about choosing between two or more hypotheses. In the usual problem with two hypotheses, one is the null hypothesis, H0 , and the other is the alternative hypothesis, H1 . Hypothesis testing has an amazing amount of terminology, including: ˆ • A true positive (TP) is when we decide H = 1 when H1 is true. The true-positive rate is the probability of a true positive, Pr TP . It is also known as the sensitivity of the test.
• A false positive (FP), also known as a false alarm or as a type I error, is when we decide ˆ = 1 when H0 is true (i.e., when we decide there is a target when there is none). The H false-positive rate is the probability of a false positive, Pr FP , and is often denoted α.
12.4 MAP Tests 337
ˆ = 0 when H0 is true. The true-negative rate, • A true negative (TN) is when we decide H Pr TN , is known as the specificity of the test. ˆ = 0 when • A false negative (FN), also known as a miss or a type II error, is when we decide H H1 is true (i.e., when we decide there is no target when there is one). The false-negative rate is the probability of a false negative, Pr FN , and is often denoted β.
The probabilities of a true positive, a false positive, a true negative, and a false negative are, respectively,
ˆ = 1H1 = 1 = 1 − β Pr TP = Pr H ˆ = 1H0 = 1 = α Pr FP = Pr H ˆ = 0H0 = 1 = 1 − α Pr TN = Pr H ˆ = 0H1 = 1 = β Pr FN = Pr H
The user picks a threshold value that results in acceptable false-positive and true-positive probabilities (or, equivalently, false alarm and miss probabilities). A graphical technique to help characterize the decision is the receiver operating characteristic (ROC) graph. An ROC graph is a plot of the true-positive probability versus the false-positive probability as the threshold is changed. Under H0 , the observation has density f0 (x), and under H1 , the observation has density f1 (x). The likelihood ratio is the ratio of densities: L(x) =
f1 (x) f0 (x)
(12.9)
For many densities, it is more convenient to compute the log-likelihood ratio: f1 (x) l(x) = log L(x) = log f0 (x)
(12.10)
ˆ = 1 when L(x) ≥ L0 or, The Neyman-Pearson likelihood ratio test says choose H equivalently, when l(x) ≥ l0 for an appropriately chosen L0 (or, equivalently, l0 =log(L0 )). Often, L0 is chosen so that the false alarm probability has a desired value; that is, Pr FP = α, where α is a desired value. The maximum a posteriori (MAP) test selects the hypothesis that has the maximum ˆ = 1 if probability given the observation X. Choose H
Pr H1 X > Pr H0 X ˆ = 0. Otherwise, choose H
PROBLEMS 12.1 Use Equation (12.5) and its development to answer the following questions:
a. If s/σ = 1, what value of xT results in Pr TP = 0.95? What is the value of Pr FP for that value of xT ? b. What value of s/σ is required to have Pr TP = 0.99 for Pr FP = 0.5?
338 CHAPTER 12 HYPOTHESIS TESTING
12.2 One measure of the overall success of a test is the area under the ROC curve (AUC), defined as AUC =
1 0
Pr TP (v) dv
where Pr TP is a function of v = Pr FP . A poor test has an AUC ≈ 0.5, while a good test has an AUC ≈ 1. a. Use a numerical package to compute the AUC for s/σ equal to 1, 2, and 3 in Equation (12.5). b. For each curve in Figure 12.5, calculate the AUC. 12.3 For the Gaussian hypothesis problem (Equations 12.3 and 12.4), calculate the false-alarm rate, Pr false alarm , as a function of s/σ when the miss rate is 0.5 (i.e., Pr miss = 0.5). 12.4 Under H1 , N is Poisson with parameter λ = 2.0; under H0 , N is Poisson with parameter λ = 1.0. a. What is the log-likelihood ratio? ˆ = 1 if N > 1.5, and choose H ˆ = 0 if b. If a threshold of 1.5 is chosen—that H is, choose N < 1.5—calculate Pr FP , Pr TP , Pr FN , and Pr TN . 12.5 Consider a simple discrete hypothesis test. Under H0 , N is uniform on the integers from 0 to 5; under H1 , N is binomial with parameters n = 5 and p. a. What is the likelihood ratio of this test? b. Calculate and plot the ROC curve assuming p = 0.5. c. Assuming the two hypotheses are equally likely, what is the MAP test? 12.6 Consider a simple hypothesis test. Under H0 , X is exponential with parameter λ = 2; under H1 , X ∼ U (0,1). a. What is the likelihood ratio of this test? b. Calculate and plot the ROC curve. c. Assuming the two hypotheses are equally likely, what is the MAP detector? 12.7 Consider a simple hypothesis test. Under H0 , X is exponential with parameter λ = 1; under H1 , X is exponential with parameter λ = 2. a. What is the likelihood ratio of this test? b. Calculate and plot the ROC curve. c. Assuming the two hypotheses are equally likely, what is the MAP detector? 12.8 Consider a simple hypothesis test. Under H0 , X ∼ N (0, σ20 ) ; under H1 , X ∼ N (0, σ21 ) with σ21 > σ20 . a. b. c. d. e.
What is the likelihood ratio of this test? What is the Neyman-Pearson likelihood ratio test? Calculate Pr FP and Pr TP as functions of the test. Plot the ROC curve assuming σ21 = 2σ20 . Assuming the two hypotheses are equally likely, what is the MAP detector?
Problems 339
12.9 In any hypothesis test, it is possible to achieve Pr TP = Pr FP = p, where 0 ≤ p ≤ 1, by inserting some randomness into the decision procedure. a. How is this possible? b. Draw the ROC curve for such a procedure. (Hint: the AUC is 0.5 for such a test; see Problem 12.2.) 12.10 For the hypothesis testing example shown in Figure 12.1 with xT = 1.5, calculate the following probabilities: a. b. c. d.
Pr FP Pr TN Pr FN Pr TP
12.11 Repeat the radar example, but with Laplacian rather than Gaussian noise. That is, assume H0 : X = s + N with s > 0 and H1 : X = N with N Laplacian with parameter λ. a. Compute and plot the log-likelihood ratio. b. Compute and plot the ROC curve. Note that since the log-likelihood ratio has three regimes, the ROC curve will have three regimes.
12.12 Consider a binary communications problem Y = X + N, where Pr X = 1 = Pr X = −1 = 0.5, N ∼ N (0, σ2 ), and X and N are independent. a. What is the MAP detector for this problem? b. What is the error rate for this detector? c. What happens to the MAP detector if Pr X = 1 = p > 0.5 and Pr X = −1 = 1 − p < 0.5? That is, how does the optimal decision rule change? d. What happens to the probability of error when p > 0.5? What is the limit as p → 1?
CHAPTER
13
RANDOM SIGNALS AND NOISE
We talk, we listen, we see. All the interesting signals we experience are random. In truth, nonrandom signals are uninteresting; they are completely predictable. Random signals are unknown and unpredictable and therefore much more interesting. If Alice wants to communicate information to Bob, then that information must be unknown to Bob (before the communication takes place). In other words, the signal is random to Bob. This chapter introduces random signals and presents their probabilistic and spectral properties. We pay particular attention to a class of random signals called wide sense stationary (WSS), as these appear in many engineering applications. We also discuss noise and linear filters.
13.1 INTRODUCTION TO RANDOM SIGNALS A random signal is a random function of time. Engineering has many examples of random signals. For instance, a spoken fricative (e.g., the “th” or “f ” sound) is made by passing turbulent air through a narrow opening such as between the teeth and lips. Fricatives are noise-like. Other examples are the random voltages in an electric circuit, temperatures in the atmosphere, photon counts in a pixel sensor, and radio signals. In this section, we give a quick introduction to random signals and the more general class, random processes. Let X (t ) be a random process. Fix t as, say, t0 . Then, X (t0 ) is a random variable. Fix another time, t1 . Then, X (t1 ) is another random variable. In general, let t1 , t2 , . . . ,tn be times. Then, X (t1 ), X (t2 ), . . . ,X (tn ) are random variables. Let f (x;t ) denote the density of X (t ). The joint density of X (t1 ) and X (t2 ) is f (x1 ,x2 ;t1 ,t2 ). In general, the nth-order density is f (x1 ,x2 , . . . ,xn ;t1 ,t2 , . . . ,tn ). The mean of X (t ) is μ(t ): μ(t ) = E X (t ) =
340
∞ −∞
xf (x;t ) dx
13.2 A Simple Random Process 341
In the integral, the density is a function of time. Therefore, the mean is a function of time. Similarly, the second moment and variance are functions of time: ∞ E X (t ) = E X (t )X (t ) = x2 f (x;t ) dx −∞ σ2 (t ) = Var X (t ) = E X 2 (t ) − μ(t )2
2
The autocorrelation Rxx (t1 ,t2 ) and autocovariance Cxx (t1 ,t2 ) of a random process are
Rxx (t1 ,t2 ) = E X (t1 )X (t2 )
Cxx (t1 ,t2 ) = E X (t1 ) − μ(t1 ) X (t2 ) − μ(t2 ) = E X (t1 )X (t2 ) − μ(t1 )μ(t2 ) = Rxx (t1 ,t2 ) − μ(t1 )μ(t2 )
(13.1)
The autocorrelation and autocovariance are measures of how correlated the random process is at two different times. The autocorrelation and covariance functions are symmetric in their two arguments. The proof follows because multiplication commutes:
Rxx (t1 ,t2 ) = E X (t1 )X (t2 )
= E X (t2 )X (t1 ) = Rxx (t2 ,t1 )
Cxx (t1 ,t2 ) = Rxx (t1 ,t2 ) − E X (t1 ) E X (t2 ) = Rxx (t2 ,t1 ) − E X (t2 ) E X (t1 ) = Cxx (t2 ,t1 )
(13.2)
(13.3)
When it is clear, what random process is being referred to, we drop the subscripts on Rxx and Cxx : R(t1 ,t2 ) = Rxx (t1 ,t2 ) C(t1 ,t2 ) = Cxx (t1 ,t2 ) The cross-correlation between two random processes is
Rsx (t1 ,t2 ) = E S(t1 )X (t2 )
13.2 A SIMPLE RANDOM PROCESS To illustrate the calculations, we present a simple random process and compute its probabilities and moments. Let X (t ) = X 0 for −∞ < t < ∞, where X 0 is a random variable with some density function f0 (x), mean μ0 , and variance σ20 . Since the random process does not vary with time, the density f (x;t ) is simply the density of X 0 ; that is, f (x;t ) = f0 (x):
E X (t ) =
∞ −∞
xf (x;t ) dt =
∞ −∞
xf0 (x) dx = μ
342 CHAPTER 13 RANDOM SIGNALS AND NOISE
E X 2 (t ) =
∞ −∞
x2 f (x;t ) dx =
∞ −∞
x2 f0 (x) dx = E X 20
Var X (t ) = E X 2 (t ) − μ2 (t ) = E X 20 − μ20 = σ20
R(t1 ,t2 ) = E X (t1 )X (t2 ) = E X 0 X 0 = E X 20 = σ20 + μ20 C(t1 ,t2 ) = R(t1 ,t2 ) − μ(t1 )μ(t2 ) = σ20 + μ20 − μ20 = σ20 As expected, the mean, variance, autocorrelation, and covariance are all constants. The random process is constant, so therefore the moments are as well. Comment 13.1: The joint density f (x1 ,x2 ;t1 ,t2 ) is tricky since X2 (t2 ) = X1 (t1 ) for all values of t1 and t2 . Since x1 = x2 , f (x1 ,x2 ;t1 ,t2 ) = f0 (x1 )δ(x1 − x2 ) Note that we did not compute R(t1 ,t2 ) by integrating against this density function; we used simple properties of expected values. Here is how we would integrate against this density function: R(t1 ,t2 ) = = =
∞ ∞ −∞ −∞ ∞ ∞
−∞ ∞ −∞
−∞
x1 x2 f (x1 ,x2 ;t1 ,t2 ) dx1 dx2 x1 x2 f0 (x1 )δ(x1 − x2 ) dx2 dx1
x12 f0 (x1 ) dx1
= E X20
Even though we can integrate and get the answer does not mean we should do it this way. Avoid doing the integral. If possible, use properties of expected values. It is faster and easier.
13.3 FOURIER TRANSFORMS Before moving on to discuss the random signals used in signal processing and communications, we take a brief detour and discuss Fourier transforms. Fourier transforms relate a time domain signal to an equivalent frequency domain signal. Time can be continuous or discrete. First, we discuss continuous time Fourier transforms, then move on to discrete time transforms. Let x(t ) be a continuous time signal, −∞ < t < ∞. The Fourier transform of x(t ) is X (ω):
X (ω) = F x(t ) =
∞ −∞
x(t )e−jωt dt
where ω is a frequency variable measured in radians per second, −∞ < ω < ∞.
13.3 Fourier Transforms 343
The inverse Fourier transform is the inverse transform, which takes a frequency function and computes a time function: x(t ) = F
−1
1 X (ω) = 2π
∞ −∞
X (ω)ejωt dω
In words, the inverse Fourier transform says that a time signal x(t ) can be written as a sum (integral) of complex exponentials ejωt . This is handy in many signal processing contexts. The response of a linear system to a signal can be decomposed into the sum of responses of the system to complex exponentials. The Fourier transform is a complex function even if x(t ) is real. We can write X (ω) as X (ω) = XR (ω) + jXI (ω) where XR (ω) is the real part of X (ω) and XI (ω) is the imaginary part of X (ω). Assuming x(t ) is real, we can use Euler’s formula, ejω = cos(ω) + jsin(ω), to solve for XR and XI : XR (ω) = XI (ω) =
∞ −∞ ∞ −∞
x(t ) cos(ωt ) dt x(t ) sin(ωt ) dt
The Fourier transform can also be represented in magnitude and phase form. The magnitude of X (ω) is |X (ω)|, where |X (ω)| =
)
XR2 (ω) + XI2 (ω)
The phase of X (ω) is X (ω), where
X (ω) = tan−1 XI (ω)/XR (ω)
The phase is in radians and is bounded by −π ≤ X (ω) ≤ π. As an example, let x(t ) = e−a|t| for a > 0. Then, X (ω) = = = =
∞ −∞ ∞ −∞
0
−∞
0
−∞
x(t )e−jωt dt e−a|t| e−jωt dt eat e−jωt dt + e(a−jω)t dt +
∞
e−at e−jωt dt
0
∞
e−(a+jω)t dt
0
1 1 + a − jω a + jω 2a = 2 a + ω2 =
Since x(t ) is real and symmetric, X (ω) is also real and symmetric. In other words, XR (ω) = 2a/(a2 + ω2 ) = XR (−ω) and XI (ω) = 0.
344 CHAPTER 13 RANDOM SIGNALS AND NOISE
The Fourier transform converts convolution to multiplication. If y(t ) = x(t ) ∗ h(t ), then Y (ω) = X (ω)H (ω):
Y (ω) = F y(t )
= F x(t ) ∗ h(t ) ∞ ∞ x(t − s)h(s) ds e−jωt dt = −∞ −∞ ∞ ∞ −jω(t −s) = x(t − s)e dt h(s)e−jωs ds −∞ −∞ ∞ = X (ω)h(s)e−jωs ds −∞ ∞ = X (ω) h(s)e−jωs ds −∞
= X (ω)H (ω)
Comment 13.2: This development is often loosely described as “the Fourier transform turns convolution into multiplication.” We used the similar property of the Laplace transform in Sections 5.5 and 8.6 to show the MGF of a sum of independent random variables is the product of the individual MGFs.
Table 13.1 lists various properties of the Fourier transform, concentrating on complex exponentials and delta functions. For example, let us calculate the Fourier transform of x(t ) cos(ωc t ). We use Euler’s formula for cosine: e j ω c t + e− j ω c t 2 1 F x(t ) cos(ωc t ) = F x(t )(ejωc t + e−jωc t ) 2 1 = F x(t )ejωc t + F x(t )e−jωc t 2 X (ω − ωc ) + X (ω + ωc ) = 2 cos(ωc t ) =
TABLE 13.1 Continuous time Fourier transform properties. Comment Definition Linearity Modulation Time shift Delta function Shifted delta Constant Complex delta
Property
F x(t ) = X (ω) F αx(t ) + βy(t ) = αX (ω) + βY (ω) F x(t )ejωc t = X (ω − ωc ) F x(t − t0 ) = X (ω)e−jωt0 F δ(t ) = 1 F δ(t − t0 ) = ejωt0 F 1 = 2πδ(ω) F ejωc t = 2πδ(ω − ωc )
(13.4)
13.3 Fourier Transforms 345
Thus, modulating x(t ) by cos(ωc t ) results in two terms. The first shifts X (ω) to X (ω − ωc )/2 and the second to X (ω + ωc )/2. This process is shown schematically below: X (ω)
cos(ωc t )
X (ω − ωc )/2
×
x(t ) 0
X (ω + ωc )/2
−ωc
ω
0
ωc
ω
We use this result in Section 13.7 in analyzing amplitude modulation. Let x(n), −∞ < n < ∞, be a discrete time signal. The discrete time Fourier transform (DTFT) is ∞ F x(n) = X (λ) = x(n)e−jλn
n=−∞
where λ is a continuous frequency variable. X (λ) is a periodic function of λ with period 2π. We adopt the convention that −π ≤ λ ≤ π. The inverse DTFT transform is the following integral: 1 F −1 X (λ) = 2π
π −π
X (λ)ejλn dλ = x(n) for −∞ < n < ∞
(Because λ is continuous, the inverse transform uses an integral, not a sum.) This integral computes one sample at a time. To compute x(n + 1), replace n with n + 1 in the integral, etc. The DTFT has many of the same properties as the continuous time Fourier transform, some properties of which are listed in Table 13.2.
TABLE 13.2 Discrete time Fourier transform properties. Comment Definition Linearity Modulation Time shift Delta function Shifted delta Constant Complex exponential
Property
F x(n) = X (λ) F αx(n) + βy(n) = αX (λ) + βY (λ) F x(n)ejλc n = X (λ − λc ) F x(n − n0 ) = X (λ)e−jλn0 F δ(n) = 1 F δ(n − n0 ) = ejωn0 F 1 = 2πδ(λ) F ejλc n = 2πδ(λ − λc )
346 CHAPTER 13 RANDOM SIGNALS AND NOISE
Comment 13.3: There are two Fourier transforms in discrete time. We are interested in the DTFT. The other transform is the discrete Fourier transform (DFT). The DFT is a finite sample version of the DTFT with a discretized frequency variable. In many signal processing problems, the DFT approximates the DTFT. The fast Fourier transform computes the same numbers as the DFT, but faster.
Comment 13.4: We are using the convention of employing variables like t, s, and τ for continuous time and ω for the continuous time frequency variable. We represent discrete time with n, m, η, or similar. The discrete time frequency variable is λ. In this chapter, we are mostly interested in continuous time signals, though discrete time signals are discussed.
13.4 WSS RANDOM PROCESSES Most signal processing and communications processes use a particular class of random processes called wide sense stationary (WSS). A random process is WSS if it satisfies two conditions:
1. The mean is constant in time; that is, E X (t ) = μ = constant. 2. The autocorrelation depends only on the difference in the two times, not on both times separately; that is, R(t1 ,t2 ) = Rxx (t2 − t1 ). Comment 13.5: Discrete time processes can also be WSS. The two conditions above are interpreted in discrete time. The mean is constant, and the autocorrelation depends only on the time difference, n2 − n1 . For the most part in this chapter, we consider continuous time processes, but the basic properties apply to discrete time processes as well.
The autocovariance process of a WSS process is also a function of the time difference, t2 − t1 : Cxx (t1 ,t2 ) = Rxx (t1 ,t2 ) − μ(t1 )μ(t2 ) = Rxx (t2 − t1 ) − μ2 = Cxx (t2 − t1 )
(using Equation 13.1) (definition of WSS process)
The variance of a WSS process is constant: σ2 (t ) = E X (t )2 − μ(t )2 = Rxx (t,t ) − μ2 = Rxx (0) − μ2 = σ2
13.4 WSS Random Processes 347
In summary, for a WSS process, the mean, second moment, and variance are constant in time, and the autocorrelation and autocovariance functions are functions of the time difference, t2 − t1 . Frequently, the notation is changed a bit, with t1 and t2 replaced with t and t + τ, respectively: R(t,t + τ) = R(τ) C(t,t + τ) = C(τ) If the mean is 0, then R(τ) = C(τ). The autocorrelation and autocovariance functions of WSS random processes satisfy several properties: and 1. R(τ) = R(−τ) C(τ) = C(−τ). These follow from Equations (13.2) and (13.3). 2. R(0) = E X 2 (t ) = σ2 + μ2 ≥ 0, C(0) = σ2 ≥ 0. This shows that both E X 2 (t ) and σ2 (t ) = σ2 are constant in time. 3. R(0) ≥ |R(τ)| for all τ. 4. If E X (t ) = 0 and Y (t ) = μ + X (t ), then Ryy (τ) = μ2 + Rxx (τ). Next, we consider two important transformations on WSS random processes: a linear (affine) transformation, Y (t ) = aX (t ) + b, and the sum of two independent WSS random processes, Z(t ) = X (t ) + Y (t ). In both cases, the resulting random process is WSS. First, let X (t ) be a WSS random process with mean μx and autocorrelation Rxx (τ) and let us determine the mean and autocorrelation of Y (t ) = aX (t ) + b and determine whether Y (t ) is WSS:
E Y (t ) = E aX (t ) + b = aE X (t ) + b = aμx + b
Ryy (t,t + τ) = E Y (t )Y (t + τ)
= E (aX (t ) + b)(aX (t + τ) + b) = a2 E X (t )X (t + τ) + ab E X (t ) + E X (t + τ) + b2 = a2 Rxx (t,t + τ) + ab μx (t ) + μx (t + τ) + b2 = a2 Rxx (τ) + 2abμx + b2
Since the mean of Y is constant and the autocorrelation of Y depends only on τ and not on t, then Y is WSS. Thus, a linear (affine) transformation of a WSS random process yields a WSS random process. Since linear transformations occur in many applications, this is an important property and one reason why WSS random processes are so important. Second, let X (t ) and Y (t ) be independent WSS random processes and let us determine the mean and autocorrelation of Z(t ) = X (t ) + Y (t ) and whether Z(t ) is WSS:
E Z(t ) = E X (t ) + Y (t ) = E X (t ) + E Y (t ) = μx + μy
Rzz (t,t + τ) = E Z(t )Z(t + τ)
= E X (t ) + Y (t ) X (t + τ) + Y (t + τ)
348 CHAPTER 13 RANDOM SIGNALS AND NOISE
= E X (t )X (t + τ) + E X (t )Y (t + τ) + E Y (t )X (t + τ) + E Y (t )Y (t + τ) = Rxx (τ) + μx μy + μy μx + Ryy (τ) = Rxx (τ) + 2μx μy + Ryy (τ)
Since the mean is constant and the autocorrelation depends only on τ and not also on t, Z(t ) is WSS. In general, the sum of independent WSS random processes is WSS. The average power of a random process (not necessarily WSS) is the mean of X (t )2 :
Average power = E X (t )2
If the random process is WSS, we can simplify this expression as follows:
Average power = E X (t )X (t ) = Rxx (0) = σ2 + μ2 Note that the average power of a WSS random process is constant in time. More generally, the average power in a WSS random process is described by the power spectral density (PSD). The PSD Sxx (ω) of a WSS random process is the Fourier transform of the autocorrelation function. The PSD measures the random process’s power per unit frequency:
S(ω) = F R(τ) =
∞ −∞
R(τ)e−ωτ dτ
Since the Fourier transform is invertible, the autocorrelation function is the inverse transform of the PSD:
R(τ) = F −1 S(ω) =
1 2π
∞ −∞
S(ω)eωτ dω
Since R(τ) is real and symmetric, S(ω) is also real and symmetric. That is, S(−ω) = S(ω) and S∗ (ω) = S(ω): R(τ) = R∗ (τ) ⇐⇒ S(ω) = S(−ω) R(τ) = R(−τ) ⇐⇒ S(ω) = S∗ (ω) We show later that the PSD is nonnegative for all ω. The PSD measures the average power in the signal per angular frequency (that is why the PSD is nonnegative). The average power E X (t )2 is the integral of the PSD:
E X (t )2 = R(0) =
1 2π
∞ −∞
S(ω)ejω0 dω =
1 2π
∞ −∞
S(ω) dω
13.4 WSS Random Processes 349
Comment 13.6: We have three ways to compute the average power of a WSS random process: 1. Average power = σ2 + μ2 . 2. Average power = R(0&). ∞ 3. Average power = 21π −∞ S(ω) dω. Use whichever method is easiest.
Comment 13.7: In many applications, it is more convenient to replace ω (radians per second) with f = ω/(2π) (cycles per second or Hertz):
S(f ) = S(ω)f = ω
2π
S(f ) has the interpretation of average power per Hertz.
Comment 13.8: Even though S(ω) and R(τ) are Fourier transform pairs, they are not equivalent. In particular, R(τ) can be negative, but S(ω) cannot. All of the following are perfectly valid autocorrelation functions:
• R(τ) = cos(ωc τ), where ωc is a constant. sin(ωc τ) • R(τ) = . ωc τ
• R ( n) =
1
n=0
−0.5
n=1
for n a discrete time index.
WSS random processes are specified by their time domain (or, equivalently, frequency domain) properties and their statistical (probabilistic) properties. For example, a random process might have a certain autocorrelation function and a uniform density. Another random process might have the same autocorrelation function but a Laplacian density. In many applications, the signals are WSS with a Gaussian density. For these signals, let t1 , t2 , . . . ,tk be an arbitrary sequence of times. Then, X (t1 ), X (t2 ), . . . ,X (tk ) are a sequence of Gaussian random variables. The joint density depends on the mean and autocorrelation function. In summary, a signal is WSS if the mean is constant and the autocorrelation function depends only on the time difference, not on the two times separately. The PSD is the Fourier transform of the autocorrelation function. The PSD S(f ) measures the average power of the signal per Hertz. Specifying a WSS signal requires specifying its time properties (the autocorrelation function or the PSD) and its probability properties. Many common signals are Gaussian.
350 CHAPTER 13 RANDOM SIGNALS AND NOISE
13.5 WSS SIGNALS AND LINEAR FILTERS Let X (t ) be a continuous time WSS random process that is input to a linear filter, h(t ), and let the output be Y (t ) = X (t ) ∗ h(t ). Thus, Y (t ) = X (t ) ∗ h(t ) =
∞ −∞
X (t − s)h(s) ds
Is Y (t ) WSS? Recall that we need to show that the mean is constant and the autocorrelation depends only on the time difference, τ:
E Y (t ) = E X (t ) ∗ h(t )
∞ =E X (t − s)h(s) ds ∞ −∞ = E X (t − s) h(s) ds −∞ ∞ = μx h(s) ds −∞ ∞ = μx h(s) ds −∞
Thus, E Y (t ) does not depend on t. The autocorrelation of Y (t ) can be computed as follows: Ryy (t,t + τ) = E Y (t )Y (t + τ) ∞ ∞ =E X (t − s)h(s) ds X (t + τ − v)h(v) dv −∞ ∞ ∞ −∞ = E X (t − s)X (t + τ − v) h(s)h(v) dsdv −∞ −∞ ∞ ∞ = Rxx (τ + s − v)h(s)h(v) dsdv −∞ −∞
(13.5)
This expression is sufficient to show that the autocorrelation function depends only on
τ and not also on t. Since the mean is also constant, it follows that Y (t ) is WSS. In other
words, linear filters operating on WSS processes produce WSS processes. This last expression is a two-dimensional convolution of the autocorrelation function against h(·). It turns out that with some work, the expression for the PSD simplifies greatly:
Syy (ω) = F Ryy (τ) = = =
∞
Ryy (τ)e−jωτ dτ
−∞ ∞ ∞ −∞
∞
−∞ −∞
∞ ∞
−∞ −∞
Rxx (τ + s − v)h(s)h(v) dsdv e−jωτ dτ
h(s)h(v)
∞ −∞
Rxx (τ + s − v)e
−jωτ
dτ dsdv
13.5 WSS Signals and Linear Filters 351
Consider the term in parentheses. Let u = τ + s − v and τ = u − s + v. In this integral, s and v are constants. Therefore, du = dτ: ∞ −∞
Rxx (τ + s − v)e
−jωτ
dτ = e
−jω(s−v)
∞ −∞
Rxx (u)e−jωu du
= e−jω(s−v) Sxx (ω)
Now insert this result above: Syy (ω) = Sxx (ω) = Sxx (ω)
∞ ∞ −∞ −∞ ∞ −∞
h(s)h(v)e−jω(s−v) dsdv
h(s)e
−jωs
∞
ds
= Sxx (ω)H (ω)H ∗ (ω)
−∞
h(v)ejωv dv
= |H (ω)|2 Sxx (ω)
(13.6)
The PSD of Y (t ) is the PSD of X (t ) multiplied by the filter magnitude squared, |H (ω)|2 . EXAMPLE 13.1
Consider a discrete time WSS random process X (n) with mean 0 and PSD Sxx (λ), where −π ≤ λ ≤ π. Let Y (n) = X (n) + αX (n − 1). What is the PSD of Y? Note that Y (n) is the convolution of X (n) with h(k) = [1, α]. The Fourier tranform of h is H (λ) =
∞
h(k)e−jλk
k=−∞
= 1 · e− j λ 0 + αe− j λ 1 = 1 + αe− j λ |H (λ)2 | = H (λ)H ∗ (λ) = (1 + αe−jλ )(1 + αejλ ) = 1 + α(e−jλ + e−jλ ) + α2 = 1 + 2α cos(λ) + α2
Thus, the PSD of Y (n) is
Syy (λ) = 1 + 2α cos(λ) + α2 Sxx (λ) If X (t ) is Gaussian, then Y (t ) = h(t ) ∗ X (t ) is a linear combination of the values of X (t ). In other words, Y (t ) is a linear combination of Gaussian random variables. Therefore, Y (t ) is Gaussian. Gaussian signals input into linear filters yield Gaussian signals. To summarize, when a Gaussian WSS signal is input to a linear filter, the output is also a Gaussian WSS signal. The output signal’s PSD is Syy (ω) = |H (ω)|2 Sxx (ω)
352 CHAPTER 13 RANDOM SIGNALS AND NOISE
The output signal’s power can be computed by integrating Syy (ω) from ω = −∞ to ω = ∞: Pyy =
1 2π
∞
−∞
Syy (ω) dω =
1 2π
∞
−∞
|H (ω)|2 Sxx (ω) dω
13.6 NOISE Noise is all around us. In electrical circuits, noise arises from three main sources: thermal oscillations of electrons, variations in photon and electron counts, and various unwanted interferers and background sources. This section is organized into two parts. The first considers the probabilistic properties of noise, and the second considers the spectral properties of noise.
13.6.1 Probabilistic Properties of Noise The three most important noises in typical electrical systems are quantization noise, Poisson (shot) noise, and Gaussian thermal noise. In this section, we explore the probabilistic properties of these three noises. Quantization noise is presented in Section 7.6. Quantization refers to the process of converting a continuous valued quantity to a discrete valued quantity. Quantization noise is usually modeled as uniform between −Δ/2 and Δ/2, where Δ is the quantizer step size. Letting Q denote the quantization noise,
E Q =0
Var Q =
Δ2
12
To reduce quantization noise, it is usually necessary to increase the number of quantization levels and thereby reduce Δ. Poisson (shot) noise is presented in Sections 4.5.3 and 14.2. It arises in counting processes due to the random nature of the Poisson process. Sometimes counts are high, sometimes low. Letting N = λ + N − λ = λ + S refer to the count and S = N − λ refer to the shot noise,
E N =λ E S =0
Var N = λ Var S = λ
As shown in Example 4.7, reducing the relative effect of Poisson noise requires increasing λ. Gaussian thermal noise arises from the random fluctuations in the movement of electrical charges due to thermal fluctuations. Typically, the electrical system observes the macro state of huge number of charged particles (e.g., electrical current or radio waves from vibrating atmospheric charges). By the CLT, the noise is Gaussian. Letting X represent the thermal noise,
E X =0
Var X = cT
where T is the temperature and c is a constant depending on the physics of the situation. Higher temperatures result in more noise than lower temperatures due to the more aggressive vibrations of the charged particles.
13.6 Noise 353
The performance of electrical systems is usually measured in signal-to-noise ratio (SNR), defined as the signal power divided by the noise power:
SNR = 10log10
signal power noise power
Typically, SNR is measured in decibels. The power in an electrical system is proportional to the voltage squared. When the noise has zero mean, the expected squared value is the variance. Thus, the noise power is proportional to the noise variance. When the noises are uncorrelated, the variances add: N = N1 + N2 + ···Nk Var N = Var N 1 + Var N 2 + · · · + Var N k
for uncorrelated noises
For example, a signal may be corrupted by thermal noise, and quantization noise. Under reasonable conditions, the two noises are independent, and the noise powers add. In summary, noise in electrical systems can have a variety of probability distributions. Quantization noise is uniform, shot noise is Poisson, and thermal noise is Gaussian. Systems are characterized by SNRs. Noise powers for uncorrelated sources add. EXAMPLE 13.2
Electrical currents are the sum of a vast number of charged particles. Resistors impede this flow and introduce some randomness to the current flow. The randomness introduces a noise voltage with a Gaussian distribution and variance proportional to the product of the resistance value and the temperature. We say more about resistor noise in Section 13.6.2.
13.6.2 Spectral Properties of Noise In the previous section, we discussed the probabilistic properties of noise, often uniform, Poisson, or Gaussian. In this section, we discuss the spectral properties of noise. We study continuous time noise signals and then examine discrete time noise signals. Recall that the PSD of a WSS signal is the Fourier transform of the autocorrelation function:
Sxx (ω) = F Rxx (τ) =
∞
−∞
Rxx (τ)e−jωτ
The most important example of noise is white noise, which has a PSD that is constant for all frequencies, Sxx (ω) = N0 for −∞ < ω < ∞. The autocorrelation function, therefore, is a delta function:
Rxx (τ) = F −1 Sxx (ω) = N0 δ(τ) The simple form of the PSD and the autocorrelation function mean that manipulating white noise is relatively easy. For example, let N (t ) denote Gaussian white noise, with PSD Sxx (ω) = N0 input to a filter with impulse response h(t ):
2
2
Syy (ω) = Sxx (ω)H (ω) = N0 H (ω)
354 CHAPTER 13 RANDOM SIGNALS AND NOISE
Curiously, white noise has infinite power: Average power =
∞ −∞
S(ω) dω = ∞
Since white noise has infinite power, it cannot exist as a physical process. However, white noise is a convenient approximation in many systems. As long as the bandwidth of the noise is wider than the bandwidth of the system, the noise can be approximated as white noise. Noise with a PSD other than white noise can be modeled as white noise passed through a filter. For example, pink noise has a power spectrum that is proportional to ω−1 . It can be generated by passing white noise through a filter with a magnitude response proportional to ω−0.5 . White noise is fundamentally simpler in discrete time. The PSD is constant over −π ≤ ω ≤ π. The autocorrelation function is proportional to a Kronecker delta function, δ(n) = 1 for n = 0 and δ(n) = 0 for n = 0. Discrete time white noise can exist and has finite power. In discrete time, the power in a white noise signal is computed as follows: Px =
1 2π
π
−π
Sxx (ω) dω = σ2
Comment 13.9: White noise in discrete time is fairly easy to understand: samples at different times are uncorrelated with each other. The autocorrelation function is therefore a delta function, and the PSD is a constant. White noise in continuous time, however, is more subtle: samples at different times, no matter how closely together in time they are, are uncorrelated. If t and s are two different times, even if extremely close in time, then Rxx (t,s) = σ2 δ(t − s) = 0. Advanced treatments discuss concepts such as the continuity and differentiability of continuous time white noise, but we will skip these.
To summarize, many common noises have a constant PSD. These are known as white noise. Samples of white noise, no matter how closely together in time, are uncorrelated. Other noise spectra can be modeled as white noise passed through an appropriate filter.
13.7 EXAMPLE: AMPLITUDE MODULATION An important example of a random signal application is amplitude modulation (AM). In this section, we discuss amplitude modulation with special emphasis on its spectral properties. First, consider a random sinusoid, X (t ) = cos(ωc t + θ), where ωc = 2πfc , fc is the carrier frequency (e.g., the frequency to which you tune your radio) and θ is a random variable. We assume θ is uniform on (0,2π). In practice, θ represents a time synchronization uncertainty between sender and receiver. Typical carrier frequencies are fc = 106 Hz and higher (often much higher). For the sender and receiver to agree on phase, clock synchronization uncertainties must be much less than 1/fc = 10−6 seconds. Such synchronizations are difficult to achieve with inexpensive hardware.
13.7 Example: Amplitude Modulation 355
To determine whether X (t ) is WSS, we need to verify two things: that the mean is constant and that the autocorrelation function is a function only of the time difference. The mean is the integral of the function against the density of θ:
E X (t ) = E cos(ωc t + θ) =
1 2π
2π 0
cos(ωc t + θ) dθ = 0
Thus, the mean is constant in time. The autocorrelation function is the following:
Rxx (t,t + τ) = E X (t )X (t + τ) = E cos(ωc t + θ) cos(ωc (t + τ) + θ) To evaluate this expected value, we need a standard trigonometric identity: 1 1 cos(α) cos(β) = cos(α + β) + cos(β − α) 2 2
With α = ωc t and β = ωc (t + τ), α + β = ωc (2t + τ) + 2θ and β − α = ωc τ. Thus, 1 E cos(ωc (2t + τ) + 2θ) + E cos(ωc τ) 2 cos(ωc τ) = 0+ 2 = R(τ) (function of τ, not t)
Rxx (t,t + τ) =
The first expected value, over cos(ωc (2t + τ) + 2θ), is 0 because the integral is over two full periods of the sine wave. The second integral, over cos(ωc τ), is simply cos(ωc τ) since it is a constant as far as θ is concerned. Thus, we conclude X (t ) is WSS. The PSD of X (t ) is the Fourier transform of the autocorrelation function. Using some of the properties in Table 13.1,
Sxx (ω) = F Rxx (τ) =
2πδ(ω − ωc ) + 2πδ(ω + ωc ) 2
The PSD of X (t ) is shown in Figure 13.1.
−ωc
0
ωc
FIGURE 13.1 Power spectral density of X(t) = cos(ωc t + θ) consists of two impulses, one at −ωc and one at ωc .
Now, let us consider the amplitude modulation of the sinusoid by a WSS signal A(t ) with known mean μa and autocorrelation function Raa (τ). We assume the signal A(t ) is independent of the random phase θ. Thus,
E Y (t ) = E A(t ) cos(2πωc t + θ)
= E A(t ) E cos(2πωc t + θ)
= μa · 0 =0
356 CHAPTER 13 RANDOM SIGNALS AND NOISE
Ryy (t,t + τ) = E A(t )X (t )A(t + τ)X (t + τ) = E A(t )A(t + τ) E X (t )X (t + τ)
(definition) (by independence)
= Raa (τ)Rxx (τ)
Raa (τ) cos(ωc τ) 2 = Ryy (τ) =
(function of τ, not t)
Therefore, we conclude Y (t ) = A(t ) cos(ωc t + θ) is WSS. Using Equation (13.4), we can compute the PSD of Y (t ):
Syy (ω) = F Ryy (τ) 1 = F Raa (τ) cos(ωc τ) 2 Saa (ω − ωc ) + Saa (ω + ωc ) = 2 The relationship between the original PSD and the modulated PSD is shown in Figure 13.2. In the Americas, broadcast AM radio stations transmit a baseband signal that includes frequencies from about −5 KHz to 5 KHz. AM shifts the baseband to be centered on the carrier frequency. Thus, the bandwidth occupied by a standard broadcast AM station is (fc + 5000) − (fc − 5000) = 10 KHz. Radio stations are separated by 10 KHz to avoid interfering with each other. Sometimes when adjacent channels are unoccupied by other stations in the same market, AM stations can transmit wider baseband signals. In most of the rest of the world, broadcast AM stations transmit a baseband signal that includes frequencies from −4.5 KHz to 4.5 KHz. Thus, stations are 9 KHz apart.
Saa (ω)
0 modulation Syy (ω)
−ωc
0
ωc
FIGURE 13.2 Power spectral density of Y(t) = A(t) cos(ωc t + θ) consists of two replications of Saa (ω)/2, one at −ωc and one at ωc .
13.9 The Sampling Theorem for WSS Random 357
13.8 EXAMPLE: DISCRETE TIME WIENER FILTER A common signal processing problem is to estimate a signal from a related signal. For example, a communications receiver tries to estimate the signal that was sent, S, from an observed signal, X. X may be a noisy version of S, a filtered version of X, or both. As another example, a digital camera might observe a blurry and noisy version of the underlying signal. The camera applies a filter designed to best meet the contrary goals of minimizing the noise (generally a smoothing filter) and inverting the blurring (generally a sharpening filter). In this section, we develop the discrete time Wiener filter. The Wiener filter is an MMSE estimate of S from X, where the filter is restricted to be a causal finite impulse response filter. Let X (n) be a WSS discrete time signal, and let S(n) be a desired discrete time WSS signal. Furthermore, assume the cross-correlation between X and S is stationary in this sense: E X (m)S(n) = Rxs (m,n) = Rxs (n − m). The Wiener filter estimates S by a linear combination of delayed samples of X: Sˆ (n) =
p k=0
ak X (n − k)
To select the optimal coefficients, minimize the MSE:
2
E S(n) − Sˆ (n)
p 2 = E S(n) − ak X (n − k) k=0
Next, differentiate with respect to al for l = 0,1, . . . ,p. Set the derivatives to 0, and obtain the normal equations:
0 = E X (n − l) S(n) −
p k=0
ak X (n − k)
p = E X (n − l)S(n) − ak E X (n − l)X (n − k)
k=0
= Rxs (l) −
p
ak Rxx (l − k) for l = 0,1, . . . ,p
(13.7)
k=0
The set of equations (Equation 13.7) are known as the Wiener-Hopf equations. They consist of l + 1 equations in l + 1 coefficients. A particularly fast solution to the Wiener-Hopf equations is the Levinson-Durbin algorithm. Useful Matlab commands are levinson and lpc.
13.9 THE SAMPLING THEOREM FOR WSS RANDOM PROCESSES The sampling theorem is perhaps the most important result in signal processing. It states that under the right conditions, signals can be converted from analog to digital and back again without loss of information. In other words, the sampling theorem is the mathematical underpinning for the use of digital computers to process analog signals.
358 CHAPTER 13 RANDOM SIGNALS AND NOISE
13.9.1 Discussion The sampling theorem is usually presented for deterministic signals. In this section, we review the deterministic sampling theorem and show how it extends to WSS random signals. The proof is complicated (and we ignore the mathematical niceties of taking appropriate limits, etc.), but we address the main points. Let xa (t ) be a continuous time deterministic signal, and let Xa (ω) be the Fourier transform of xa (t ). Thus,
Xa (ω) = F xa (t ) =
∞ −∞
xa (t )e−jωt dt
x(t ) is bandlimited to W Hz if X (ω) = 0 for |ω| > 2πW. Now, let xa (t ) be sampled at rate 1/T, forming a discrete time signal x(n):
x(n) = xa (t )t=nT = xa (nT ) The sampling theorem says that xa (t ) can be reconstructed from the samples x(n): Theorem 13.1 (Shannon, Whittaker, Katelnikov): If xa (t ) is bandlimited to W Hz and xa (t ) is sampled at rate 1/T > 2W, then xa (t ) can be reconstructed from its samples, x(n) = xa (nT ). The reconstruction is as follows: ∞
sin π(t − nT )/T x(n) xa (t ) = π(t − nT )/T n=−∞ =
∞
n=−∞
x(n) sinc (t − nT )/T
where sincx = sin(πx)/πx (with the limit sinc(0) = 1). As an example, let xa (t ) = cos(2πft ) with f = 0.4 be sampled with T = 1. Since 1/T = 1 > 2f = 0.8, xa (t ) can be reconstructed from its samples, x(n). In Figure 13.3, we show two reconstructions of xa (t ). The N = 3 reconstruction uses three samples, n = −1,0,1, and the N = 7 reconstruction uses seven samples, n = −3, −2, −1,0,1,2,3. The sampling theorem can be extended to WSS random signals. Let X a (t ) be a WSS random signal with autocorrelation function R(τ) and PSD S(ω). Then, X a (t ) is bandlimited to W Hz if the PSD S(ω) = 0 for |ω| > 2πW. Theorem 13.2: If X a (t ) is bandlimited to W Hz and sampled at rate 1/T > 2W, then X a (t ) can be reconstructed from its samples, X (n) = X a (nT ): Xˆ a (t ) =
∞
n=−∞
X (n)
∞ sin(π(t − nT )/T ) = X (n) sinc π(t − nT )/T π(t − nT )/T n=−∞
The reconstruction Xˆ a (t ) approximates X a (t ) in the mean squared sense:
2
E X a (t ) − Xˆ a (t )
=0
13.9 The Sampling Theorem for WSS Random Processes 359
y(t) = sinc(t)
–5
–4
–3
–2
1
–1
2
3
t
4
5
xa (t) = 1 N=3
N=7
xa (t) = cos(2π f t) N=7
N=3
FIGURE 13.3 The top graph shows the sinc function. The middle and bottom graphs show reconstructions of different functions, a constant and a cosine function. The N = 3 reconstructions use three samples, x (−1), x (0), and x (1), and the N = 7 reconstructions use seven samples, x (−3), x (−2), . . . ,x (3). The reconstructions improve as more samples are used.
By Chebyshev’s inequality (Equation 4.16), ˆ a (t ) 2 E X a (t ) − X ˆ Pr |X a (t ) − X a (t )| > ≤ =0 2
for all > 0
Since Xˆ a (t ) approximates X a (t ) in the mean squared sense, the probability that Xˆ a (t ) differs from X a (t ) is 0 for all t. Comment 13.10: While we cannot say Xha (t) = Xa (t) for all t, we can say the probability they are the same is 1. This is a mathematical technicality that does not diminish the utility of the sampling theorem for random signals.
13.9.2 Example: Figure 13.4 Figure 13.4 shows an example of the sampling theorem applied to a random signal. The top graph shows a Gaussian WSS signal. The middle graph shows the signal sampled with sampling time T = 1. The bottom graph shows the sinc reconstruction.
360 CHAPTER 13 RANDOM SIGNALS AND NOISE
1
Xa(t)
0
t
–1 X(n) n
Xa(t) t 0
5
10
15
20
25
FIGURE 13.4 Sampling theorem example for a random signal. The top graph shows the original signal, the middle graph the sampled signal, and the bottom graph the reconstructed signal.
Since 1/T = 1, the highest frequency in X (t ) must be 0.5 or less. We chose 0.4. In practice, a little oversampling helps compensate for the three problems in sinc reconstruction: The signal is only approximately bandlimited, the sinc function must be discretized, and the sinc function must be truncated. Even though the first graph shows a continuous signal, it is drawn by oversampling a discrete signal (in this case, oversampled by a factor of four) and “connecting the dots.” For 25 seconds, we generate 101 samples (25 · 4 + 1; the signal goes from t = 0 to t = 25, inclusive). The signal is generated with the following Matlab command: x = randn (1 ,101); Since the signal is oversampled by a factor of 4, the upper frequency must be adjusted to 0.4/4 = 0.1. A filter (just about any lowpass filter with a cutoff of 0.1 would work) is designed by b = firrcos (30 , 0.1 , 0.02 , 1); The signal is filtered and the central portion extracted: xa = conv (x , b ); xa = xa (30:130); This signal is shown in the top graph (after connecting the dots). The sampled signal (middle graph) is created by zeroing the samples we do not want: xn = xa ; xn (2:4: end ) = 0; xn (3:4: end ) = 0; xn (4:4: end ) = 0;
13.9 The Sampling Theorem for WSS Random Processes 361
For the reconstruction, the sinc function must be discretized and truncated: s = sinc ( -5:0.25:5); Note that the sinc function is sampled at four samples per period, the same rate at which the signal is sampled. The reconstruction is created by convolving the sinc function with the sampled signal: xahat = conv ( xn , s ); xahat = xahat (20:121);
13.9.3 Proof of the Random Sampling Theorem In this section, we present the basic steps of the proof of the sampling theorem for WSS random signals. The MSE can be expanded as
2
E X a (t ) − Xˆ a (t )
2 ˆ a (t ) ˆ a (t ) + E X = E X 2a (t ) − 2E X a (t )X
The first term simplifies to
E X 2a (t ) = R(0)
2
If Xˆ a (t ) ≈ X a (t ), then it must be true that E X a (t )Xˆ a (t ) ≈ R(0) and E Xˆ a (t ) ≈ R(0). If these are true, then
2
E X a (t ) − Xˆ a (t )
= R(0) − 2R(0) + R(0) = 0
2
The crux of the proof is to show these two equalities, E X a (t )Xˆ a (t ) ≈ R(0) and E Xˆ a (t ) ≈ R(0). The proof of these equalities relies on several facts. First, note that the autocorrelation function, R(τ), can be thought of as a deterministic bandlimited signal (the Fourier transform of R is bandlimited). Therefore, R(τ) can be written in a sinc expansion: R(τ) =
∞
n=−∞
R(n) sinc (τ − nT )/T
Second, R is symmetric, R(τ) = R(−τ). Third, the sinc function is also symmetric, sinc(t ) = sinc(−t ). Consider the first equality:
E X a (t )Xˆ a (t ) = E X a (t ) = =
∞
n=−∞ ∞
n=−∞
∞
n=−∞
X (n) sinc (t − nT )/T
E X a (t )X (n) sinc (t − nT )/T
R(nT − t ) sinc (t − nT )/T
(13.8)
362 CHAPTER 13 RANDOM SIGNALS AND NOISE
Let R (nT ) = R(nT − t ). Since it is just a time-shifted version of R, R is bandlimited (the Fourier transform of R is phase-shifted, but the magnitude is the same as that of R). Then,
∞
E X a (t )Xˆ a (t ) =
n=−∞
R (nT ) sinc (t − nT )/T
= R (t )
(the sampling theorem) (replace R (t ) = R(t − t ))
= R(t − t ) = R(0)
Now, look at the second equality:
2 E Xˆ a (t ) = E
= = =
∞
n=−∞
∞
X (n) sinc (t − nT )/T
∞
n=−∞ m=−∞ ∞
∞
n=−∞ m=−∞
∞
m=−∞
X (m) sinc (t − mT )/T
E X (n)X (m) sinc (t − nT )/T sinc (t − mT )/T
R(mT − nT ) sinc (t − nT )/T sinc (t − mT )/T
∞ ∞
n=−∞ m=−∞
R(mT − nT ) sinc (t − mT )/T sinc (t − nT )/T
Consider the expression in the large parentheses. As above, let R (mT ) = R(mT − nT ). Then, the expression is a sinc expansion of R (t ):
2 E Xˆ a (t ) =
= =
∞
n=−∞ ∞
n=−∞ ∞
n=−∞
R (t ) sinc (t − nT )/T
R(t − nT ) sinc (t − nT )/T R(nT − t ) sinc (t − nT )/T
(same as Equation 13.8)
= R(0)
Thus, we have shown both equalities, which implies
2
E X a (t ) − Xˆ a (t )
= R(0) − 2R(0) + R(0) = 0
In summary, the sampling theorem applies to bandlimited WSS random signals as well as bandlimited deterministic signals. In practice, the sampling theorem provides mathematical support for converting back and forth between continuous and discrete time. SUMMARY
A random signal (also known as a random process) is a random function of time. Let f (x;t ) denote the density of X (t ). The joint density of X (t1 ) and X (t2 ) is f (x1 ,x2 ;t1 ,t2 ). The first and second moments of X (t ) are μ(t ) = E X (t ) =
∞
−∞
xf (x;t ) dx
Summary 363
E X 2 (t ) = E X (t )X (t ) =
∞ −∞
x2 f (x;t ) dx
Var X (t ) = E X 2 (t ) − μ(t )2 The autocorrelation Rxx (t1 ,t2 ) and autocovariance Cxx (t1 ,t2 ) of a random process are
Rxx (t1 ,t2 ) = E X (t1 )X (t2 ) Cxx (t1 ,t2 ) = E X (t1 ) − μ(t1 ) X (t2 ) − μ(t2 ) = Rxx (t1 ,t2 ) − μ(t1 )μ(t2 )
The Fourier transform of x(t ) is X (ω):
X (ω) = F x(t ) =
∞ −∞
x(t )e−jωt dt
The inverse Fourier transform is the inverse transform, which takes a frequency function and computes a time function:
x(t ) = F −1 X (ω) =
1 2π
∞ −∞
X (ω)ejωt dω
A random process is wide sense stationary (WSS) if it satisfies two conditions:
1. The mean is constant in time; that is, E X (t ) = μ = constant. 2. The autocorrelation depends only on the difference in the two times, not on both times separately; that is, R(t1 ,t2 ) = Rxx (t2 − t1 ). The power spectral density (PSD) Sxx (ω) of a WSS random process is the Fourier transform of the autocorrelation function:
S(ω) = F R(τ) =
∞
−∞
R(τ)e−ωτ dτ
We have three ways to compute the average power of a WSS random process: average &∞ power = σ2 + μ2 = R(0) = 21π −∞ S(ω) dω. Let X (t ) be a continuous time WSS random process that is input to a linear filter, h(t ), and let the output be Y (t ) = X (t ) ∗ h(t ). The PSD of Y (t ) is the PSD of X (t ) multiplied by the filter magnitude squared, |H (ω)|2 . The three most important types of noise in typical electrical systems are quantization noise (uniform), Poisson (shot) noise, and Gaussian thermal noise. When the noises are uncorrelated, the variances add: N = N1 + N2 + ···Nk Var N = Var N 1 + Var N 2 + · · · + Var N k
White noise has a PSD that is constant for all frequencies, Sxx (ω) = N0 for −∞ < ω < ∞. The autocorrelation function, therefore, is a delta function:
Rxx (τ) = F −1 Sxx (ω) = N0 δ(τ)
364 CHAPTER 13 RANDOM SIGNALS AND NOISE
The amplitude modulation of a random sinusoid by a WSS signal Y (t ) = A(t ) cos(ωc t +θ) is WSS, with PSD SYY (ω) = SAA (ω − ωc ) + SAA (ω + ωc ) /2. The discrete time Wiener filter estimates S by a linear combination of delayed samples of X: Sˆ (n) =
p k=0
ak X (n − k)
The optimal coefficients satisfy the Wiener-Hopf equations: 0 = Rxs (l) −
p
ak Rxx (l − k) for l = 0,1, . . . ,p
k=0
The sampling theorem is the theoretical underpinning of using digital computers to process analog signals. It applies to WSS signals as well: Theorem 13.3: If X a (t ) is bandlimited to W Hz and sampled at rate 1/T > 2W, then X a (t ) can be reconstructed from its samples, X (n) = X a (nT ): Xˆ a (t ) =
∞
n=−∞
X (n)
∞ sin(π(t − nT )/T ) = X (n) sinc π(t − nT )/T π(t − nT )/T n=−∞
The reconstruction Xˆ a (t ) approximates X a (t ) in the mean squared sense:
2
E X a (t ) − Xˆ a (t )
=0
PROBLEMS 13.1 If X (t ) and Y (t ) are independent WSS signals, is X (t ) − Y (t ) WSS? 13.2 A drunken random walk is a random process defined as follows: let S(n) = 0 for n ≤ 0 and S(n) = X 1 + X 2 + · · · + X n , where the X i are IID with probability Pr X i = 1 = Pr X i = −1 = 0.5. a. b. c. d. e.
What are the mean and variance of S(n)? What is the autocorrelation function of S(n)? Is S(n) WSS? What does the CLT say about S(n) for large n? Use a numerical package to generate and plot an example sequence S(n) for n = 0,1, . . . ,100.
13.3 Used to model price movements of financial instruments, the Gaussian random walk is a random process defined as follows: let S(n) = 0 for n ≤ 0 and S(n) = X 1 + X 2 + · · · + X n , where the X i are IID N (0, σ2 ). a. What are the mean and variance of S(n)? b. What is the autocorrelation function of S(n)?
Problems 365
c. Is S(n) WSS? d. Use a numerical package to generate and plot an example sequence S(n) for n = 0,1, . . . ,100. Use σ2 = 1. 13.4 What is the Fourier transform of x(t ) sin(ωc t )? 13.5 The filtered signal PSD result in Equation (13.6) holds in discrete time as well as continuous time. Repeat the sequence in Section 13.5 with sums instead of integrals to show this. 13.6 Let X (t ) be a WSS signal, and let θ ∼ U (0,2π) be independent of X (t ). Form X c (t ) = X (t )cos(ωt + θ) and X s (t ) = X (t )sin(ωt + θ). a. Are X c (t ) and X s (t ) WSS? If so, what are their autocorrelation functions? b. What is the cross-correlation between X c (t ) and X s (t )? 13.7 Let X (t ) be a Gaussian white noise with variance σ2 . It is filtered by a perfect lowpass filter with magnitude |H (ω)| = 1 for |ω| < ωc and |H (ω)| = 0 for |ω| > ωc . What is the autocorrelation function of the filtered signal? 13.8 Let X (t ) be a Gaussian white noise with variance σ2 . It is filtered by a perfect bandpass filter with magnitude |H (ω)| = 1 for ω1 < |ω| < ω2 and |H (ω)| = 0 for other values of ω. What is the autocorrelation function of the filtered signal? 13.9 The advantage of sending an unmodulated carrier is that receivers can be built inexpensively with simple hardware. This was especially important in the early days of radio. The disadvantage is the unmodulated carrier requires a considerable fraction of the total available power at the transmitter. Standard broadcast AM radio stations transmit an unmodulated carrier as well as the modulated carrier. Let W (t ) be a WSS baseband signal (i.e., the voice or music) scaled so that |W (t )| ≤ 1. Then, A(t ) = W (t ) + 1. a. What are the autocorrelation and power spectral densities of A(t ) and Y (t )? b. Redraw Figure 13.2 to reflect the presence of the carrier. 13.10 A common example of the Wiener filter is when S(n) = X (n + 1)—that is, when the desired signal S(n) is a prediction of X (n + 1). What do the Wiener-Hopf equations looks like in this case? (These equations are known as the Yule-Walker equations.) 13.11 Create your own version of Figure 13.4.
CHAPTER
14
SELECTED RANDOM PROCESSES
A great variety of random processes occur in applications. Advances in computers have allowed the simulation and processing of increasingly large and complicated models. In this chapter, we consider three important random processes: the Poisson process, Markov chains, and the Kalman filter.
14.1 THE LIGHTBULB PROCESS An elementary discrete random process is the lightbulb process. This simple process helps illustrate the kinds of calculations done in studying random processes. A lightbulb is turned on at time 0 and continues until it fails at some random time T. Let X (t ) = 1 if the lightbulb is working and X (t ) = 0 if it has failed. We assume the failure time T is exponentially distributed. Its distribution function is
Pr T ≤ t = FT (t ) = 1 − e−λt Therefore, the probability mass function of X (t ) is
Pr X (t ) = 1 = Pr t ≤ T = 1 − FT (t ) = e−λt Pr X (t ) = 0 = Pr T < t = FT (t ) = 1 − e−λt A sample realization of the random process is shown in Figure 14.1. X (t ) starts out at 1, stays at 1 for awhile, then switches to X (t ) = 0 at T = t and stays at 0 forever afterward. What are the properties of X (t )? The mean and variance are easily computed:
E X (t ) = 0 ∗ Pr X (t ) = 0 + 1 ∗ Pr X (t ) = 1 = e−λt
E X (t )2 = 02 ∗ Pr X (t ) = 0 + 12 ∗ Pr X (t ) = 1 = e−λt
2
Var X (t ) = E X (t )2 − E X (t ) = e−λt − e−2λt
E X (t ) and Var X (t ) are shown in Figure 14.2. 366
14.1 The Lightbulb Process 367
1 X(t)
T
t
FIGURE 14.1 Example of the lightbulb process. X(t) = 1 when the lightbulb is working, and X(t) = 0 when the bulb has failed.
1
E[X(t)]
Var[X(t)]
t FIGURE 14.2 Mean and variance of X(t) for the lightbulb process.
The autocorrelation and autocovariance functions require the joint probability mass function. Let t1 and t2 be two times, and let p(i,j;t1 ,t2 ) = Pr X (t1 ) = i ∩ X (t2 ) = j . Then,
p(0,0;t1 ,t2 ) = Pr X (t1 ) = 0 ∩ X (t2 ) = 0 = Pr T < t1 ∩ T < t2 = Pr T < min(t1 ,t2 )
= 1 − e−λ min(t1 ,t2 ) p(0,1;t1 ,t2 ) = Pr X (t1 ) = 0 ∩ X (t2 ) = 1 = Pr T < t1 ∩ t2 ≤ T = Pr t2 ≤ T < t1 0 for t1 ≤ t2 = −λt −λt 1 2 e −e for t1 > t2 p(1,0;t1 ,t2 ) = Pr X (t1 ) = 1 ∩ X (t2 ) = 0 = Pr t1 ≤ T ∩ T < t2 = Pr t1 ≤ T < t2 e−λt1 − e−λt2 for t1 < t2 = 0 for t1 ≥ t2
368 CHAPTER 14 SELECTED RANDOM PROCESSES
p(1,1;t1 ,t2 ) = Pr X (t1 ) = 1 ∩ X (t2 ) = 1 = Pr t1 ≤ T ∩ t2 ≤ T = Pr max(t1 ,t2 ) ≤ T = e−λmax(t1 ,t2 )
The autocorrelation and autocovariance functions are computed as follows:
R(t1 ,t2 ) = E X (t1 )X (t2 )
= 0 · 0 · Pr 0,0;t1 ,t2 + 0 · 1 · Pr 0,1;t1 ,t2 + 1 · 0 · Pr 1,0;t1 ,t2 + 1 · 1 · Pr 1,1;t1 ,t2
= e−λmax(t1 ,t2 )
C(t1 ,t2 ) = R(t1 ,t2 ) − E X (t1 )X (t2 )
= e−λmax(t1 ,t2 ) − e−λt1 e−λt2
The lightbulb process is not WSS. It fails both criteria. The mean is not constant, and the autocorrelation function is not a function of only the time difference, t2 − t1 . Other, more interesting random processes are more complicated than the lightbulb process and require more involved calculations. Nevertheless, the lightbulb process serves as a useful example of the kinds of calculations needed.
14.2 THE POISSON PROCESS The Poisson process is an example of a wide class of random processes that possess independent increments. The basic idea is that points occur randomly in time and the number of points in any interval is independent of the number that occurs in any other nonoverlapping interval. A Poisson process also has the property that the number of points in any time interval (s,t ) is a Poisson random variable with parameter λ(t − s). Another way of looking at it is that the mean number of points in the interval (s,t ) is proportional to the size of the interval, t − s. As one example of a Poisson process, consider a sequence of photons incident on a detector. The mean number of photons incident in (s,t ) is the intensity of the light, λ, times the size of the interval, t − s. As another example, the number of diseases (e.g., cancer) in a population versus time is often modeled as a Poisson process. In this section, we explore some of the basic properties of the Poisson process. We calculate moments and two interesting conditional probabilities. The first is the probability of getting l points in the interval (0,t ) given k points in the shorter interval (0,s), with s < t. The second is the reverse, the probability of getting k points in (0,s) given l points in (0,t ).
14.2 The Poisson Process 369
Comment 14.1: It is conventional to refer to point, and the Poisson process is often referred to as a Poisson point process. In ordinary English, one might use the word “events” to describe the things that happen, but we have already used this word to refer to a set of outcomes.
Let (s,t ) be a time interval with 0 < s < t, and let N (s,t ) be the number of points in the interval (s,t ). For convenience, let N (s) = N (0,s) and N (t ) = N (0,t ). We also assume N (0) = 0. Then, N (s,t ) = N (t ) − N (s) This can be rearranged as N (t ) = N (s) + N (s,t ). This is shown graphically in Figure 14.3. N (t ) N (s)
N (s,t ) s
0
t
FIGURE 14.3 An example of a Poisson process. The points are shown as closed dots: N(s) = 5, N(s,t) = 2, and N(t) = 7.
As mentioned above, a Poisson process has two important properties. First, it has independent increments, meaning that if (s,t ) and (u,v) are disjoint (nonoverlapping) intervals, then N (s,t ) and N (u,v) are independent random variables. In particular, N (s) and N (s,t ) are independent since the intervals (0,s) and (s,t ) do not overlap. N (s,t ) is Poisson with parameter λ(t − s). For a Poisson random variable, Second, E N (s,t ) = λ(t − s). This assumption means the expected number of points in an interval is proportional to the size of the interval. Using the Poisson assumption, means and variances are easily computed:
E N (t ) = Var N (t ) = λt E N (s,t ) = Var N (s,t ) = λ(t − s) Computing autocorrelations and autocovariances uses both the Poisson and independent increments assumptions:
R(s,t ) = E N (s)N (t )
= E N (s) N (s) + N (s,t ) = E N (s)N (s) + E N (s)N (s,t ) 2 = Var N (s) + E N (s) + E N (s) E N (s,t ) = λs + λ2 s2 + (λs)(λ(t − s)) = λs(1 + λt )
370 CHAPTER 14 SELECTED RANDOM PROCESSES
C(s,t ) = R(s,t ) − E N (s) E N (t ) = λs
An interesting probability is the one that no points occur in an interval:
Pr N (s,t ) = 0 = e−λ(t−s) This is the same probability as for an exponential random variable not occurring in (s,t ). Thus, we conclude the interarrival times for the Poisson process are exponential. That the waiting times are exponential gives a simple interpretation to the Poisson process. Starting from time 0, wait an exponential time T 1 for the first point. Then, wait an additional exponential time T 2 for the second point. The process continues, with each point occurring an exponential waiting time after the previous point. The idea that the waiting times are exponential gives an easy method of generating a realization of a Poisson process: 1. Generate a sequence of U (0,1) random variables U 1 , U 2 , . . .. 2. Transform each as T 1 = − log(U 1 )/λ, T 2 = − log(U 2 )/λ, . . .. The T i are exponential random variables. 3. The points occur at times T 1 , T 1 + T 2 , T 1 + T 2 + T 3 , etc. (This is how the points in Figures 14.3 and 14.4 were generated.) We also calculate two conditional probabilities. The first uses the orthogonal increments property to show that the future of the Poisson process is independent of its past:
Pr N (t ) = l N (s) = k =
Pr N (t ) = l ∩ N (s) = k Pr N (s) = k
Pr N (s,t ) = l − k ∩ N (s) = k = Pr N (s) = k
=
Pr N (s,t ) = l − k · Pr N (s) = k Pr N (s) = k
(by independence)
= Pr N (s,t ) = l − k
Given there are l points at time t, how many points were there at time s? The reverse conditional probability helps answer this question and gives yet another connection between the Poisson and Binomial distributions: Pr N (t ) = l ∩ N (s) = k Pr N (s) = k N (t ) = l = Pr N (t ) = l Pr N (s,t ) = l − k ∩ N (s) = k = Pr N (t ) = l Pr N (s,t ) = l − k · Pr N (s) = k = Pr N (t ) = l
14.2 The Poisson Process 371
(λ(t − s))l−k e−λ(t−s) (λs)k e−λs · (l − k)! k! = l −λt (λt ) e
l!
=
l! (l − k)!k!
·
s k t − s l−k
t
(14.1)
t
This probability is binomial with parameters l and s/t. Curiously, the probability does not depend on λ. Figure 14.4 shows a realization of a Poisson process created as described above with λ = 1. In the time interval from t = 0 to t = 25, we expect to get about E N (25) = 25λ = 25 points. We obtained 31 points, slightly greater than one standard deviation (σ = λt = 5) above the mean. For large t, the Poisson distribution converges to a Gaussian by the CLT. On average, we expect N (t ) to be within one standard deviation of the mean about two-thirds of the time. In summary, the Poisson process is used in many counting experiments. N (s,t ) is the number of points that occur in the interval from s to t and is Poisson with parameter λ(t − s). The separation between successive points is exponential; that is, the waiting time until the next point occurs is exponential with parameter λ (the same λ in the Poisson probabilities). The mean of N (s,t ) is λ(t − s); the variance of N (s,t ) is also λ(t − s). A typical N (t ) sequence is shown in Figure 14.4.
30 λt
25
N(t) λt
λt
20 15 10 5 0
0
5
10
15
20
25
31 points
FIGURE 14.4 A realization of process with λ = 1. The 31 points are shown below the a Poisson plot. The plot also shows E N(t) = λt and one standard deviation ( λt) above and below.
372 CHAPTER 14 SELECTED RANDOM PROCESSES
Comment 14.2: Higher-dimensional Poisson processes are also common. For instance, a common health care problem is to study the number of incidents of a disease that occur in a given area. A common model is that the number of incidents is Poisson, with the rate proportional to the population of that area. Too many incidents might be indicative of an abnormal cluster caused by a pathogen or other unknown source.
14.3 MARKOV CHAINS Markov chains1 are a class of random processes that have grown in importance over time. They are used to model a wide variety of signals and random machines. Simulating Markov processes is an important application of numerical programming. Many Markov chains have a stationary probability distribution over the states of the chain. In a large number of numerical applications, the researcher generates samples from the probability distribution by generating samples from the Markov chain. Accordingly, we spend some time discussing stationary distributions. Let X (n) be a discrete time and discrete outcome process. We are interested in how X (n) varies in time given the past observations of the process. X (n) is Markov if
Pr X (n + 1) = j X (n) = i,X (n − 1) = k, . . . ,X (0) = l = Pr X (n + 1) = j X (n) = i
(14.2) (14.3)
The idea is the probability of the next value depends only on the current value and not on all the past values. In other words, it is only necessary to know the current value of X (n) and not how the Markov chain got there. We make two simplifying assumptions. First, we assume the sample space is finite, S = {1,2, . . . ,m} (the alternative is to allow an infinite number of outcomes). Second, we assume the transition probabilities are constant in time. A Markov chain is homogeneous if
Pr X (n + 1) = j X (n) = i = pi,j
(14.4)
The idea is that the pi,j are constant in time. The pi,j are the transition probabilities from X (n) = i to X (n + 1) = j. The alternative are time Markov chains with transition probabilities that inhomogeneous depend on time; that is, Pr X (n + 1) = j X (n) = i = pi,j (n) is a function of time n. While time inhomogeneous Markov chains have a greater range of behaviors than time homogeneous ones, we nevertheless focus on time homogeneous Markov chains (with constant transition probabilities) as they appear in many applications. 1
After the Russian mathematician Andrey Markov (1856–1922).
14.3 Markov Chains 373
It is convenient to introduce vectors and matrices to describe the probabilities. Let pi (n) = Pr X (n) = i . Furthermore, define the probability vector as p(n) = p1 (n)
···
p2 (n)
pm (n)
Note that p(n) is a row vector. Also, p(0) describes the initial state of the system. Comment 14.3: Even though the transition probabilities are constant in time, in general, the state probabilities vary with time. The Markov chain may start in a given state and then evolve as time passes. The state probabilities then change over time. As we discuss below, we are particularly interested in the subset of Markov chains whose state probabilities approach a constant vector as n → ∞; that is, p(n) →p(∞).
The transition probabilities can be organized into a matrix: ⎛
p1,1 ⎜p ⎜ 2,1 P=⎜ ⎜ .. ⎝ .
p1,2 p2,2 .. .
··· ···
pm,1
pm,2
···
..
.
⎞
p1,m p2,m ⎟ ⎟ .. ⎟ ⎟ . ⎠ pm,m
The rows of P sum to 1; that is, pi,1 + pi,2 + · · · + pi,m = 1. Matrices with this property are known as stochastic. In some examples, the columns also sum to 1. In those cases, the matrix is known as doubly stochastic. For example, Figure 14.5 shows a two-state Markov chain. The two states are labeled simply as 0 and 1. Let X (n) denote the state of the system at time n. The transition probabilities can be read directly from the diagram:
Pr X (n + 1) = 0 X (n) = 0 = 1 − p Pr X (n + 1) = 1 X (n) = 0 = p Pr X (n + 1) = 0 X (n) = 1 = q Pr X (n + 1) = 1 X (n) = 1 = 1 − q
p 1−p
0
1
1−q
q FIGURE 14.5 A simple two-state Markov chain. With probabilities p and q, the machine changes state from 0 to 1 and from 1 to 0, respectively.
374 CHAPTER 14 SELECTED RANDOM PROCESSES
The transition matrix is
P=
1−p p q 1−q
Note that the rows of P sum to 1. The probabilities for a homogeneous Markov chain evolve in time as follows: p(1) = p(0)P p(2) = p(1)P = p(0)P 2 p(3) = p(2)P = p(0)P 3
In general, p(n) = p(0)P n
Since P is the one-step transition probability matrix, P2 is the two-step probability transition matrix. In general, Pk is the k-step transition matrix. Comment 14.4: There are two unfortunate conventions in analyzing Markov chains. First, the probability vector p(n) is a row vector. The row vector then multiplies the probability transition matrix on the left. Most other applications use column vectors and multiplication on the right. π. The Second, as we see below, the stationary probability distribution is denoted reader must be aware that this use of π does not refer to the transcendental number 3.1415.
For example, let p = 0.1 and q = 0.2 in Figure 14.5. Then,
P=
0.1 0.8
0.9 0.2
0.1 0.9 · 0.8 0.2
P2 =
P3 =
0.83 0.34
P4 = .. . P10 = .. .
0.747 0.507
0.676 0.648
0.667 P = 0.666 20
0.9 0.2
0.17 0.9 · 0.66 0.2 0.253 0.493
0.324 0.352
0.333 0.334
0.1 0.83 = 0.8 0.34
0.17 0.66
0.1 0.781 = 0.8 0.438
0.219 0.562
14.3 Markov Chains 375
It appears the rows of Pk are approaching (0.667, 0.333) as k → ∞. This is indeed the case, as discussed below. In the example above, the k-step transition matrix Pk approaches a limit as k → ∞. Furthermore, in the limit all rows are the same. Some Markov chains have these two properties, some have one but not both, and some have neither. When the Pk approaches a limit with all identical rows, the probability vector approaches a stationary distribution independently of the starting distribution. Let the rows π. Then, for any probability vector p(0), of P∞ be denoted p(∞) = p(0)P∞ = π
This is a desirable property in many numerical experiments. As such, we are interested in which Markov chains have this property and which do not. A simple example of a Markov chain that does not approach a limit is the following:
P=
0 1 1 0
P2 =
1 0 0 1
P3 =
0 1 1 0
P4 =
1 0
0 1
and so on. Pk alternates between the two forms, depending on whether k is even or odd. A simple example of a Markov chain that approaches a limit that does not have identical rows is the identity matrix, P = I. Then, Pk = I = P. Clearly, however, the rows of P = I are different. π denote Now, consider the case when Pk approaches a limit with identical rows. Let the row. Then, ⎛
⎞ ← π→ ⎜← ⎟ ⎜ π →⎟ ⎟ P k → P∞ = ⎜ ⎜ .. ⎟ ⎝ . ⎠ ← π→
as k → ∞
A direct calculation shows π is a left eigenvector of P∞ . Since P∞ P = P∞ , ∞ πP∞ P = πP P = πP π = πP ∞ =
(14.5)
Thus, we see π is a left eigenvector of P with eigenvalue 1. Comment 14.5: All stochastic matrices have a left eigenvector with eigenvalue 1. It is a property of square matrices that the left and right eigenvalues are the same (though usually the left and right eigenvectors are different). Since the sum over any row of the entries of a stochastic matrix is 1, all stochastic matrices have a right eigenvalue of 1 and a corresponding right eigenvector that is all 1’s (perhaps normalized). Unfortunately, the all-1’s vector is not usually a left eigenvector of a stochastic matrix. The left eigenvector corresponding to eigenvalue 1 must be computed. Not all stochastic matrices, however, approach a limit with identical rows as k → ∞. We are particularly interested in those Markov chains that do approach a limit with identical rows.
376 CHAPTER 14 SELECTED RANDOM PROCESSES
Let πj denote the jth component of π. Then, πj satisfies the following equation: πj =
m i=1
πi pi,j
which can be written as (1 − pj,j )πj =
m i=1,i =j
πi pi,j
Since the pj,k sum to 1 (over k), the left-hand side can be rewritten as πj
m
pj,k =
k=1,k =j
m i=1,i =j
πi pi,j
The term on the left is the probability of leaving state j, and the term on the right is the probability of entering state j. This equation has a simple flow interpretation. In steady state, the probability flow into state j equals the probability flow leaving state j. Since the rows of P sum to 1, the flow equations are linearly dependent. Any one of the π, we need the additional m flow equations can be derived from the other m − 1. To solve for equation that the sum of the πi is 1. The flow equations can be found graphically. Pick m − 1 of the m states, and draw a closed curve around each one. For each, set the flow entering equal to the flow leaving. For example, in Figure 14.6, a closed curve is drawn around the 0 state. The flow equation for this state is π0 p = π1 q
(A loop around the other node leads to the same equation.) Coupled with the sum equation (π0 + π1 = 1), the steady-state probabilities are determined as follows: π0 =
q
π1 =
p+q
p p+q
When p = 0.1 and q = 0.2, π0 = 2/3 and π1 = 1/3.
p 1−p
0
1
1−q
q
FIGURE 14.6 Binary Markov chain with a closed curve around the 0 state.
14.3 Markov Chains 377
At this point, we have seen that some Markov chains have a stationary distribution and that the stationary distribution can be computed three different ways: 1. Compute Pk for some large value k. The rows will be approximately a stationary distribution. Note that usually the fastest way to compute Pk for a large k is to employ repeated squaring: compute P2 = P · P, P4 = P2 · P2 , P8 = P4 · P4 , etc. 2. Compute the left eigenvectors of P. The stationary distribution is the eigenvector corresponding to the eigenvalue 1. 3. Set up and solve the flow equations. To determine which Markov chains have a stationary distribution (and which do not), we need to examine the chains more closely: • A state i is absorbing if pi,i = 1. If the Markov chain ever reaches the absorbing state i, it will never leave the absorbing state. • A state j is accessible from state i if there is a nonzero probability of getting to state j given the Markov chain started in state i (i.e., if Pr X (n) = j X (0) = i > 0 for some n). Note that state j does not have to be one-step accessible, but there must be a path with nonzero probability that starts in state i and ends in state j. • The Markov chain is irreducible if each state is accessible from any initial state (i.e., if it is possible to get to any state from any starting state). • A state i is aperiodic if a time n exists such that Pr X (l) = i X (0) = i > 0 for all l ≥ n. In other words, the state is aperiodic if it is possible to return to the state at all possible times later than n. For a counterexample, consider the Markov chain that rotates among three states: a → b → c → a → b, etc. It is possible to return to any state, and in fact, the Markov chain returns infinitely often to each state. However, the Markov chain cannot return at every time (after a suitable waiting period). If the chain starts in state a, then it returns to a when n = 3,6,9, . . .. It will never be in state a if n is not a multiple of 3. • The Markov chain is aperiodic if all states are aperiodic. A Markov chain has a unique stationary distribution if it is irreducible and aperiodic. Conceptually, a Markov chain being irreducible (any state can get to any other) and aperiodic (can return to any state at all future times after a finite waiting period) means a time n exists such that
Pr X (l) = j X (0) = i > 0 for all i and j and all l ≥ n In other words, the Markov chain has a unique stationary distribution if the matrix Pn has all nonzero entries for some n. The Markov chain in Figure 14.5 has a stationary distribution if 0 < p < 1 and 0 < q < 1. Then, the entries in P are all nonzero.
378 CHAPTER 14 SELECTED RANDOM PROCESSES
EXAMPLE 14.1
The Markov chain in Figure 14.5 has a surprisingly wide variety of behaviors depending on p and q: • If p = q = 0, the Markov chain stays in whatever state it started in. • Ifp = 1 − q,the sequence of states is IID Bernoulli with parameter p. That is, Pr X (n + 1) = 1 X (n) = i = p for i = 0,1. • If p = q = 1, the Markov chain alternates between 0 and 1. • If 0 < p < 1 and q = 1, the Markov chain produces runs of 0’s followed by isolated 1’s. Here is a realization with p = 0.25: 01000000100001000000000100001000100000100000100000 • If 0 < p < 1 and 0 < q < 1, the Markov chain produces alternating runs of 0’s followed by runs of 1’s. Here is a realization for p = 0.1 and q = 0.2: 01101100100000000000000000000011100111111100000000
EXAMPLE 14.2
Consider the three-state Markov chain shown in Figure 14.7. A question arises whether or not this Markov chain has a unique stationary distribution. Is it irreducible and aperiodic? It is possible to get from one state to any other, so it is irreducible. To be aperiodic, however, requires that it can go from state i to state i for every time after a suitable waiting period. The loop on the left allows this. The Markov chain can pause and rotate around the loop a few times to arrive at state i at a desired future time. For instance, the Markov chain can start in state 1 and return to state 1 by any of the these sequences (and many more): 1→2→0→1 1→2→0→0→1 1→2→0→0→0→1 Here is the probability transition matrix with p = 0.1: ⎛
0.1 P=⎝ 0 1
0.9 0 0
1−p p
0
⎞
0 1⎠ 0
1
1
2
1 FIGURE 14.7 A three-state Markov chain as discussed in Example 14.2. This Markov chain has a stationary distribution.
14.3 Markov Chains 379
Look at the first few powers of P: ⎛
0.01 P2 = ⎝ 1 0.1 ⎛
0.1801 P = ⎝ 0.01 0.901 4
0.09 0 0.9
⎞
⎛
0.9 0⎠ 0
0.8109 0.09 0.009
0.901 P3 = ⎝ 0.1 0.01
0.009 0.9 0.09
⎞
0.09 0 ⎠ 0.9
⎞
0.009 0.9 ⎠ 0.09
P4 has all nonzero entries. Therefore, the Markov chain has a unique stationary distribution. Unfortunately, Pk converges slowly to a stationary distribution. Even P30 has not converged to constant rows yet: ⎛
0.43 P = ⎝0.41 0.22 30
0.20 0.41 0.37
⎞
0.37 0.18⎠ 0.41
However, the eigenvector corresponding to the eigenvalue equal to 1 gives the answer: π = 0.357
0.321
0.321
Comment 14.6: Here are a series of helpful Python commands for Example 14.2: import numpy as np P = np . array ([[0.1 ,0.9 ,0] ,[0 ,0 ,1] ,[1 ,0 ,0]]) P30 = np . linalg . matrix_power (P ,30) # eig computes right eigenvectors l , v = np . linalg . eig ( P . T ) # transpose P i = np . argmax ( l ) # normalize eigenvector vpi = v [: , i ]/ sum ( v [: , i ]) # check eigenvalue , eigenvector print l [ i ] , vpi - np . dot ( vpi , P )
EXAMPLE 14.3
(Win by 2): Consider a simple competition between two sides: the winner is the first one to win two more points or games than the other. Such competitions often decide close games or sets (e.g., or volleyball). A state diagram illustrating this “win by 2” competition is shown in Figure 14.8. The play starts in state 0 and stops when it reaches state −2 (a loss) or state 2 (a win). Both states, −2 and 2, are absorbing states. If it started in either of the absorbing states, the Markov chain would stay there forever. Therefore, this Markov chain does not have a unique stationary distribution.
380 CHAPTER 14 SELECTED RANDOM PROCESSES
p 1
-2
p
0
-1 1−p
p
1−p
1
2
1
1−p
FIGURE 14.8 State diagram for “win by 2” competition. This Markov chain does not have a unique stationary distribution.
The probability transition matrix is the following: ⎛
1
0 0
0 p 0
0 ⎜1 − p 0 ⎜ p 1−p P=⎜ ⎜ 0 ⎝ 0 0 1−p 0 0 0 0 0
⎞
0 0⎟ ⎟ 0⎟ ⎟ p⎠ 1
It is interesting to start with the initial probability vector p(0) = 0 0 1 0 0 and propagate it forward in time to see how the probability vector evolves. For concreteness, let p = 0.6. Then, p(1) = p(0)P = 0 0.4 0 0.6 0 p(2) = p(1)P = 0.16 0 0.48 0 0.36 p(4) = p(3)P = 0.237 0 0.230 0 0.533
.. .
p(12) = p(11)P = 0.304
0 0.012
0 0.684
There is only a 1.2% chance the game is still undecided after 12 points. With p = 0.6, the probability of a win is approximately 0.69. The win probability versus p is shown in Figure 14.9.
1 Pr [Win] 0.69 0.5
0
0
0.6
p
1
FIGURE 14.9 Probability of a win versus p for the “win by 2” competition. When p = 0.6, the win rate is 0.69.
14.4 Kalman Filter 381
In summary, Markov chains model a large variety of random processes. The probability of moving from state i to state j is pi,j . The probabilities are usually arranged into a stochastic matrix P. The rows of P sum to 1. A particularly interesting class of Markov chains are those whose k-step transition matrix Pk approaches a limit with constant rows. Researchers use these to generate random samples with the Markov chain’s stationary distribution.
14.4 KALMAN FILTER The Kalman filter is an optimal state estimator that is commonly used in tracking and control applications. It was largely developed in the 1960s for the aircraft industry. Nowadays, the Kalman filter is used in all sorts of applications, from tracking vehicles to predicting stock market prices to following the performance of sports teams, to name but a few. In this section, we introduce the Kalman filter and give a simple example. The section concludes with a discussion of an efficient implementation based on the QR method for solving linear least squares problems.
14.4.1 The Optimal Filter and Example Formally, the Kalman filter consists of two parts: 1. A dynamical model describing the evolution of a state vector and observations, perhaps indirect, of the state 2. An optimal procedure for estimating the state vector and its error covariance We consider the discrete time Kalman filter. While there is a continuous time Kalman filter, the discrete time filter is much more common. Let the time index be k, for k = 0,1,2, . . .. Let xk be a vector of state variables at time k. In an aircraft control problem, for instance, the vector might consist of three position variables, three velocity variables, three rotational position variables (roll, pitch, and yaw), and three rotational velocity variables. The state evolves in time according to a linear equation: xk+1 = Fkxk + Gk wk
(14.6)
where Fk and Gk are known matrices and wk is Gaussian with mean 0 and covariance Qk . We assume Qk is invertible. In many problems, Gk = I. Observations of the state are made: z k = Hkxk +vk
(14.7)
where Hk is a known matrix. (Note that some authors write HkT in this equation. In that case, the transpose on H in the state and error covariance updates below must be changed.) The noise vk is Gaussian with mean 0 and covariance Rk . As with the state noise, we assume Rk is invertible. Another thing to note is that the number of observations may not be the same as the size of xk . There may be fewer observations, an equal number, or more.
382 CHAPTER 14 SELECTED RANDOM PROCESSES
We assume all the noises are independent and Gaussian with mean 0 and known covariances, Rk and Qk . If the means are not 0, Equations (14.6) and (14.7) can be updated with known offsets and the noises replaced by zero-mean versions with the same covariances. As an example, consider an object that is allowed to move in one dimension. Let X k be the position of the object and V k the velocity of the object. Then, the state vector is xk =
Xk Vk
(14.8)
The state evolves according to a simple dynamical model subject to random positional errors and random accelerations: X k+1 = X k + ΔT V k + k V k+1 = V k + νk where ΔT is the size of the time step, k ∼ N (0, σ2x ), and νk ∼ N (0, σ2v ). Thus, the state equation can be written as
ΔT
xk+1 =
1 0
1
xk + wk
(14.9)
The various matrices are
Fk =
1 ΔT 0 1
Gk =
1 0
0 =I 1
Qk =
2 σx
0
0
σ2v
In this example, the position, but not the velocity, can be observed: Zk = X k + V k where V k ∼ N (0, σ2z ). The observation equation can be written as z k = 1
0 xk +vk = Hkxk +vk
(14.10)
Note that the matrices Fk , Gk , Hk , Qk , and Rk are all constant in time. When we discuss the optimal estimates for this example, we will drop the subscript k on these matrices. (It often happens that these matrices are time independent.) The second part of the Kalman filter is a procedure for optimally estimating the state vector and the error covariance of the state vector. Define xˆ k|l as the estimate of xk using observations through (and including) time l. While we use the term filtering broadly to mean estimating xk from observations, sometimes the terms smoothing, filtering, and prediction are used more precisely: if l > k (future observations are used to estimate xk ), xˆ k|l is a smoothed estimate of xk ; if l = k, xˆ k|k is a filtered estimate; and if l < k, xˆ k|l is a prediction of xk . The Kalman filter is recursive: at time k − 1, an estimate xˆ k−1|k−1 and its error covariance Σk−1|k−1 are available. The first step is to generate a prediction, xˆ k|k−1 , and its error covariance, Σk|k−1 . This step uses the state equation (Equation 14.6). The second step is to generate a filtered estimate, xˆ k|k , and its error covariance, Σk|k , using the observation
14.4 Kalman Filter 383
equation (Equation 14.7). This process (i.e., predict and then filter) can be repeated as long as desired. The Kalman filter starts with an estimate of the initial state, xˆ 0|0 . We assume xˆ 0|0 ∼ N (x0 , Σ0|0 ). We assume the error in xˆ 0|0 is independent of the wk and vk for k > 0. At time k − 1, assume xˆ k−1|k−1 ∼ N (xk−1 , Σk−1|k−1 ). Knowing nothing else, the best estimate of the noises in the state equation is 0. Therefore, the prediction is xˆ k|k−1 = Fk−1 xˆ k−1|k−1
(14.11)
The error is as follows: xˆ k|k−1 −xk = Fk−1 (ˆxk−1|k−1 −xk−1 ) − Gk−1 w k−1 The two noises are independent. Therefore, the error covariance is Σk|k−1 = Fk−1 Σk−1|k−1 FkT−1 + Gk−1 Qk−1 GTk−1
(14.12)
The observation update step, going from xˆ k|k−1 to xˆ k|k , is more complicated. We state, without proof, the answer:
xˆ k|k = xˆ k|k−1 + Σk|k−1 HkT Hk Σk|k−1 HkT + Rk
−1
(z k − Hk xˆ k|k−1 ) − 1 Hk Σk|k−1 HkT + Rk Hk Σk|k−1
T
Σk|k = Σk|k−1 − Σk|k−1 Hk
(14.13) (14.14)
The update equations are simplified by defining a gain matrix, Kk . Then,
Kk = Σk|k−1 HkT Hk Σk|k−1 HkT + Rk xˆ k|k = xˆ k|k−1 + Kk (z k − Hk xˆ k|k−1 ) Σk|k = (I − Kk Hk )Σk|k−1
−1
(14.15) (14.16) (14.17)
The state prediction update equation (Equation 14.16) has the familiar predictor-corrector form. The updated state estimate is the predicted estimate plus the gain times the innovation. Equations (14.11), (14.12), (14.15), (14.16), and (14.17) constitute what is generally called the Kalman filter. The state and observation equations are linear and corrupted by additive Gaussian noise. The state estimates are linear in the observations and are optimal in the sense of minimizing the MSE. The state estimates are unbiased with the covariances above. Let us continue the example and calculate the updates. Assume the initial estimate is xˆ 0|0 = (0 0)T with error covariance Σ0|0 = I. For simplicity, we also assume ΔT = 0.9 and σ2x = σ2v = 0.5. Then,
xˆ 1|0 = Σ1|0 =
1 0.9 0 0 = 0 1 0 0
1 0.9 1 0 1 0.9 0 1 0 1 0 1
T +
0.5 0
0 2.31 = 0.5 0.9
0.9 1.5
384 CHAPTER 14 SELECTED RANDOM PROCESSES
Letting σ2z = 0.33, we obtain an observation z 1 = 0.37. First, we need to calculate the gain matrix:
K1 =
−1
2.31 0.9 1 1 0 1.5 0 0.9
2.31 0.9
0.9 1 + 0.33 1.5 0
=
0.875 0.34
The innovation is z 1 − H1 xˆ 1|0 = 0.37 − 0 = 0.37
Therefore, the new state estimate and its error covariance are the following:
xˆ 1|1 =
Σ1|1 =
0 0.875 0.323 0.37 = + 0 0.34 0.126
2.31 1 0 0.875 1 0 − 0 1 0.34 0.9
0.9 0.29 ≈ 1.5 0.11
0.11 1.04
The Kalman estimate (0.323 0.126) is reasonable given the observation Z 1 = 0.37. Note that the estimated covariance Σ1|1 is smaller than Σ1|0 and that both are symmetric (as are all covariance matrices).
14.4.2 QR Method Applied to the Kalman Filter The Kalman filter is the optimal MSE estimator, but the straightforward implementation sometimes has numerical problems. In this section, we show how the QR method, an implementation that is fast, easy to program, and accurate, can be used to compute the Kalman estimates. In the development below, we make the simplifying assumption that Gk = I for all k. If this is not true, we can set up and solve a constrained linear regression problem, but that development is beyond this text. Recall that we introduced the QR method in Section 11.3.3 as a way of solving a linear to minimize y − X β 2 . The regression problem. The linear regression problem is to choose β T −1 T ˆ optimal solution is β = (X X ) X y. Alternately, the solution can be found by applying an orthogonal matrix to X andy, converting X to an upper triangular matrix R˜ with zeros below (Figure 11.4). The solution can then be found by solving the equation R˜ βˆ =y∥ . The work in the QR method is applying the orthogonal matrix to X. Solving the resulting triangular system of equations is easy. The Kalman recursion starts with an estimate, x0|0 ∼ N (x0 , Σ0|0 ). We need to represent this relation in the general form of the linear regression problem. To do this, we first factor Σ0|0 as P T P, where P is an upper triangular matrix. There are several ways of doing this, the most popular of which is the Cholesky factorization (see Problem 11.6). w0 , where w0 has covariance Σ0|0 = The original estimate can be written as xˆ 0|0 =x0 + PT P. These equations can be transformed as P−T xˆ 0|0 = P−Tx0 + 0 where 0 ∼ N (0,I ). We use the notation P−T to denote the transpose of the inverse of P; that is, P−T = (P−1 )T .
14.4 Kalman Filter 385
To simplify the remaining steps, we change the notation a bit. Instead of writing the factorization of Σ0|0 as PT P, we write it as Σ0T|/02 Σ10/|02 . Similarly, we factor Q = QT /2 Q1/2 and R = RT /2 R1/2 . The Kalman state equations can be written as 0 = Q−k−T1/2 (xk −xk−1 ) + 10 . The Kalman
observation equations can be written as R−k T /2z k = RkT /2 Hkxk + 11 . The noises in both equations are Gaussian with mean 0 and identity covariances. The initial estimate and the state equations combine to form a linear regression problem as follows:
X=
Σ−0|T0 /2
0
−Q−1 T /2 F0
Q−1 T /2
= x0 β x1
y =
Σ−0|T0 /2 xˆ 0
(14.18)
0
The optimal solution is xˆ 0|0 ˆ = (X T X )−1 X Ty β= xˆ 1|0
(14.19)
We can add the Kalman observation equations: ⎛
Σ−0|T0 /2
⎜
⎞
0
−T /2 X=⎜ ⎝−Q1 F0
⎟
Q−1 T /2 ⎟ ⎠ R−1 T /2 H1
0
⎛
= x0 β x1
⎜ y = ⎜ ⎝
Σ−0|T0 /2 xˆ 0
0
⎞ ⎟ ⎟ ⎠
(14.20)
R−1 T /2z 1
The least squares solution to Equation (14.20) is xˆ 0|1 ˆ = (X T X )−1 X Ty β= xˆ 1|1
(14.21)
The linear regression problem (Equation 14.20) can be solved by the QR method (note that the “Q” in the QR method is different from the noise covariance Qk ). After multiplying X and y by a suitably chosen orthogonal matrix, we obtain the following: ⎛
R˜ 0|0 QX = ⎝ 0 0
⎞
R˜ 0|1 R˜ 1|1 ⎠ 0
⎞ ⎛ y0|1 Qy = ⎝y1|1 ⎠
(14.22)
0
The filtered estimate xˆ 1|1 can be found as xˆ 1|1 = R˜ −1|11y1|1 . The error covariance is Σ1|1 = R˜ −1|11 R˜ −1|T1 (see Problem 14.12). This process can be continued. In general, at time k, the regression problem is the following: ⎛ ⎜
R˜ k−1|k−1
−T /2 X=⎜ ⎝−Qk−1 Fk−1
0
⎞
0
⎟
Q−k−T1/2 ⎟ ⎠ −T /2
Rk
xk−1 β= xk
Hk
⎛ ⎞ yk−1|k−1 ⎜ ⎟ ⎟ y = ⎜ ⎝ 0 ⎠ R−k T /2z k
(14.23)
After multiplying by Q, xˆ k|k = R˜ −k|1kyk|k
(14.24)
386 CHAPTER 14 SELECTED RANDOM PROCESSES
Let us close with some comments on the QR method solving the Kalman filtering problem: • The QR method applied to the Kalman filtering problem is called a square root information filter. • The terms Q−k−T1/2 , Fk−1 , and R−k T /2 Hk look ugly, but they do not depend on the data and can be precomputed. In many applications, Q and R are diagonal and computing the square root matrices is trivial. • The standard solution (Equations 14.11, 14.12, and 14.15 to 14.17) constitutes explicit formulas for computing the Kalman estimate. The QR method (Equations 14.23 and 14.24) constitutes a method for computing the estimate. • The QR method is numerically accurate. As in the linear regression problem, the QR method never forms the “square” matrix, X T X. In general, computing the Kalman estimate with the QR method is numerically more accurate than using the explicit formulas. In summary, the Kalman filter is widely employed in engineering and other fields to track a state vector over time given observations. We studied two methods for finding the optimal estimate: explicit formulas and a method for computing the estimate. SUMMARY
Random processes are characterized by their time and probabilistic behaviors. The lightbulb process is a simple example of a random process that takes one of two values, 0 and 1:
Pr X (t ) = 1 = Pr T ≤ t = e−λt Pr X (t ) = 0 = Pr T > t = 1 − e−λt
E X (t ) = e−λt Var X (t ) = e−λt − e−2λt R(t1 ,t2 ) = e−λmax(t1 ,t2 ) C(t1 ,t2 ) = e−λmax(t1 ,t2 ) − e−λt1 e−λt2 The Poisson process models a wide variety of physical processes. Let N (s) = N (0,s) be the number of points in the interval (0,s) and N (s,t ) be the number of points in the interval (s,t ). N (s,t ) is Poisson with parameter λ(t − s):
E N (t ) = Var N (t ) = λt E N (s,t ) = Var N (s,t ) = λ(t − s) R(s,t ) = λs(1 + λt ) C(s,t ) = λs An interesting probability is the one that no points occur in an interval:
Pr N (s,t ) = 0 = e−λ(t−s)
Summary 387
Two interesting conditional probabilities for the Poisson process are the following:
Pr N (t ) = l N (s) = k = Pr N (s,t ) = l − k Pr N (s) = k N (t ) = l =
s k t − s l! · (l − k)!k! t t
l−k
A Markov chain is a random process in which the conditional probability of the present given the past only depends on the immediate past. X (n) is Markov if
Pr X (n + 1) = j X (n) = i,X (n − 1) = k, . . . ,X (0) = l = Pr X (n + 1) = j X (n) = i
A Markov chain is homogeneous if
Pr X (n + 1) = j X (n) = i = pi,j The probabilities for a homogeneous Markov chain evolve in time as follows: p(n) = p(0)P n
where P is the one-step transition probability matrix. Some Markov chains have a stationary distribution. That stationary distribution can be computed three different ways: 1. Compute Pk for some large value k. 2. Compute the left eigenvectors of P. The stationary distribution is the eigenvector corresponding to the eigenvalue 1. 3. Set up and solve the flow equations. A Markov chain has a unique stationary distribution if it is irreducible and aperiodic. Conceptually, a Markov chain being irreducible and aperiodic means a time n exists such that
Pr X (l) = j X (0) = i > 0 for all i and j and all l ≥ n In other words, the Markov chain has a unique stationary distribution if the matrix Pn has all nonzero entries for some n. The Kalman filter is a minimum mean squared estimator of a state vector given observations corrupted by additive Gaussian noise. The state evolves in time according to a linear equation: xk+1 = Fkxk + Gk wk
Observations are made: z k = Hkxk +vk
The Kalman filter is recursive. Given an estimate xˆ k−1|k−1 , the prediction of xk is xˆ k|k−1 = Fk−1 xˆ k−1|k−1
388 CHAPTER 14 SELECTED RANDOM PROCESSES
The error covariance of the prediction is Σk|k−1 = Fk−1 Σk−1|k−1 FkT−1 + Gk−1 Qk−1 GTk−1
The observation update equations are simplified by defining a gain matrix, Kk :
Kk = Σk|k−1 HkT Hk Σk|k−1 HkT + Rk xˆ k|k = xˆ k|k−1 + Kk (z k − Hk xˆ k|k−1 ) Σk|k = (I − Kk Hk )Σk|k−1
−1
The state prediction update equation has the familiar predictor-corrector form. One numerically accurate, fast, and easy-to-program implementation of the Kalman filter is to apply the QR factorization to a linear regression problem formed from the Kalman state and observation equations. PROBLEMS 14.1 For the lightbulb process in Section 14.1:
a. What value of t maximizes Var X (t ) ? b. What is the value of E X (t ) at this value of t? c. Why does this answer make sense? 14.2 Is the Poisson process WSS? Why or why not? 14.3 Generate your own version of Figure 14.4. 14.4 Another way of generating a realization of a Poisson process uses Equation (14.1). Given N (t ) = l, N (s) is binomial with parameters n = l and p = t /l. A binomial is a sum of Bernoullis, and a Bernoulli can be generated by comparing a uniform to a threshold. Combining these ideas, a realization of a Poisson process can be generated with these two steps: 1. Generate a Poisson random variable with parameter λt. Call the value n. 2. Generate n uniform random variables on the interval (0,t ). The values of the n random variables are the points of a Poisson process. Use this technique to generate your own version of Figure 14.4. 14.5 Write a program to implement a Markov chain. The Markov chain function should accept an initial probability vector, a probability transition matrix, and a time n and then output the sequence of states visited by the Markov chain. Test your program on simple Markov chains, and a histogram of the states visited approximates the stationary distribution. 14.6 Why is the Markov chain in Example 14.2 converging so slowly? 14.7 In Problem 6.23, we directly solved the “first to k competition” (as phrased in that question, the first to win k of n games wins the competition) using binomial probabilities. Assume the games are independent and one team wins each game with probability p. Set up a Markov chain to solve the “first to 2 wins” competition. a. How many states are required?
Problems 389
b. What is the probability transition matrix? c. Compute the probability of winning versus p for several values of p using the Markov chain and the direct formula, and show the answers are the same. 14.8 For the “win by 2” Markov chain in Example 14.3, compute the expected number of games to result in a win or loss as a function of p. 14.9 In Problems 6.27 and 6.28, we computed the probability a binomial random variable Sn with parameters n and p is even. Here, we use a Markov chain to compute the same probability: 1. Set up a two-state Markov chain with states even and odd. Draw the state diagram. What is the state transition matrix P? What is the initial probability vector p(0)? 2. Find the first few state probabilities by raising P to the nth power for n = 1,2,3,4. 3. Set up the flow equations, and solve for the steady-state probability Sn is even (i.e., as n → ∞). 4. Solve for the steady-state probabilities from the eigenvalue and eigenvector decomposition of P using p = 0.5,0.6,0.7,0.8,0.9. 14.10 Simulate the example Kalman filter problem in Section 14.4.1 for k = 0,1, . . . ,10 using the standard Kalman recursions. a. What is the final estimate? b. How close is the final estimate to the actual value? 14.11 Simulate the example Kalman filter problem in Section 14.4.1 for k = 0,1, . . . ,10 using the QR method. Set up multiple linear regression problems, and use a library function to solve them. a. What is the final estimate? b. How close is the final estimate to the actual value? 14.12 The error covariance in a linear regression problem is σ2 X T X, where σ2 is the variance of the noise (often σ2 = 1, which we assume in this problem). In the QR method for computing the Kalman filter update (Section 14.4), we said, “The error covariance is ˜ −1 R ˜ −T Σ1|1 = R 1|1 1|1 .” Show this. ˜ Write Σ−1 in terms of R ˜ −1 ; calculate R ˜ −1 in terms of R ˜ 0|0 , (Hints: Σ−1 = X T X = R˜ T R. ˜ ˜ R0|1 , and R1|1 ; and then compute Σ. Identify the portion of Σ that is Σ1|1 . Also, remember that matrix multiplication does not commute: AB = BA in general.) 14.13 Write the state and observation equations for an automobile cruise control. Assume the velocity of the vehicle can be measured, but not the slope of the road. How might the control system incorporate a Kalman filter?
APPENDIX
A
COMPUTATION EXAMPLES
Throughout the text, we have used three computation packages: Matlab, Python, and R. In this appendix, we briefly demonstrate how to use these packages. For more information, consult each package’s documentation and the many websites devoted to each one. All the examples in the text have been developed using one or more of these three packages (mostly Python). However, all the plots and graphics, except those in this appendix, have been recoded in a graphics macro package for a consistent look.
A.1 MATLAB Matlab is a popular numerical computing package widely used at many universities. To save a little space, we have eliminated blank lines and compressed multiple answers into a single line in the Matlab results below. Start with some simple calculations: >> conv([1,1,1,1],[1,1,1,1]) ans = 1 2 3 4 3 >> for k = 0:5 nchoosek(5,k) end ans = 1 5 10 10 5 1
2
1
Here is some code for the Markov chain example in Section 14.3: >> P = [[0.9, 0.1]; [0.2, 0.8]] P = 0.9000 0.1000 391
392 APPENDIX A COMPUTATION EXAMPLES
0.2000 0.8000 >> P2 = P * P P2 = 0.8300 0.1700 0.3400 0.6600 >> P4 = P2 * P2 P4 = 0.7467 0.2533 0.5066 0.4934 The probability a standard normal random variable is between −1.96 and 1.96 is 0.95: >> cdf(’Normal’,1.96,0,1)-cdf(’Normal’,-1.96,0,1) ans = 0.9500 The probability of getting three heads in a throw of six coins is 0.3125: >> pdf(’Binomial’,3,6,0.5) ans = 0.3125 Compute basic statistics with the data from Section 10.6. The data exceed the linewidth of this page and are broken into pieces for display: >> data = [0.70, 0.92, -0.28, 0.93, 0.40, -1.64, 1.77, 0.40, -0.46, -0.31, 0.38, 0.63, -0.79, 0.07, -2.03, -0.29, -0.68, 1.78, -1.83, 0.95]; Compute the mean, median, sample variance, and interquartile distance: >> mean(data), median(data), var(data), iqr(data) ans = 0.0310, 0.2250, 1.1562, 1.3800 The sample distribution function of this data can be computed and plotted as follows: >> >> >> >> >> >> >> >> >>
x = linspace(-3,3,101); y = cdf(’Normal’,x,0,1); plot(x,y,’LineWidth’,3,’Color’,[0.75,0.75,0.75]) [fs, xs] = ecdf(data); hold on; stairs(xs,fs,’LineWidth’,1.5,’Color’,[0,0,0]) xlabel(’x’,’FontSize’,14), ylabel(’CDF’,’FontSize’,14) title(’Sample CDF vs Gaussian CDF’,’FontSize’,18) saveas(gcf,’MatlabCDFplot’,’epsc’)
A.2 Python 393
The plot is shown below. We used options to make the continuous curve a wide gray color, to make the sample curve a bit narrower and black, and to increase the size of the fonts used in the title and labels.
Sample CDF vs. Gaussian CDF
1 0.9 0.8 0.7
CDF
0.6 0.5 0.4 0.3 0.2 0.1 0 –3
–2
–1
0
x
1
2
3
A.2 PYTHON Python is an open source, general-purpose computing language. It is currently the most common first language taught at universities in the United States. The core Python language has limited numerical and data analysis capabilities. However, the core language is supplemented with numerous libraries, giving the “batteries included” Python similar capabilities to R and Matlab. In Python, numpy and matplotlib give Python linear algebra and plotting capabilities similar to those of Matlab. scipy.stats and statistics are basic statistical packages. Start Python, and import the libraries and functions we need: >>> ... ... ...
import numpy as np import matplotlib.pyplot as plt import scipy.stats as st from statistics import mean, median, variance, stdev Do routine analysis on the data from Section 10.6:
>>> data = [0.70, 0.92, -0.28, 0.93, 0.40, -1.64, ... 1.77, 0.40, -0.46, -0.31, 0.38, 0.63, -0.79, 0.07, ... -2.03, -0.29, -0.68, 1.78, -1.83, 0.95] >>> print(mean(data), median(data), variance(data), stdev(data)) 0.03100000000000001 0.225 1.15622 1.0752767085731934
394 APPENDIX A COMPUTATION EXAMPLES
Define a Gaussian random variable Z ∼ N (0,1). Generate a vector of 100 samples from the distribution, and plot a histogram of the data against the density: >>> ... ... ... ... ... ... ...
x = np.linspace(-3,3,101) Z = st.norm() #N(0,1) random variable plt.plot(x,Z.pdf(x),linewidth=2, color=’k’) ns, bins, patches = plt.hist(Z.rvs(100),normed=True, range=(-3.25,3.25), bins=13, color=’0.9’) plt.xlabel(’x’) plt.ylabel(’Normalized Counts’) plt.savefig(’PythonHistCDF.pdf’)
Now, use the same Gaussian random variable to compute probabilities: >>> #some basic Gaussian probabilities ... Z.cdf(1.96)-Z.cdf(-1.96), Z.ppf(0.975) (0.95000420970355903, 1.959963984540054) Do some of the calculations of the birthday problem in Section 3.6. For n = 365 people, a group of 23 has a better than 0.5 chance of a common birthday. That is, for the first 22 group sizes, Pr(no pair) > 0.5: >>> ... ... ... ... 22
n = 365 days = np.arange(1,n+1) p = 1.0*(n+1-days)/n probnopair = np.cumprod(p) np.sum(probnopair>0.5)
A.3 R 395
Plot the probability of no pair versus the number of people in the group: >>> ... ... ... ... ... ... ... ...
xupper = 30 cutoff = 23 plt.plot(days[:xupper],probnopair[:xupper],days[:xupper], probnopair[:xupper],’.’) plt.axvline(x=cutoff,color=’r’) plt.axhline(y=0.5) plt.xlabel(’Number of People in Group’) plt.ylabel(’Probability of No Match’) plt.savefig(’PythonBirthday.pdf’)
A.3 R R is a popular open source statistics and data analysis package. It is widely used in universities, and its use is growing in industry. R’s syntax is somewhat different from that in Matlab and Python. For further study, the reader is urged to consult the many texts and online resources devoted to R. Start R: R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.4.0 (64-bit) ... We begin with some simple probability calculations. The probability of two Aces in a selection of two cards from a standard deck is 0.00452 = 1/221: > choose(5,3) [1] 10
396 APPENDIX A COMPUTATION EXAMPLES
> choose(5,c(0,1,2,3,4,5)) [1] 1 5 10 10 5 1 > choose(4,2)/choose(52,2) [1] 0.004524887 > choose(52,2)/choose(4,2) [1] 221 Note that the syntax above for creating a vector, c(0,1,2,3,4,5), differs from the syntax in Matlab and Python. The probability a standard normal random variable is between −1.96 and 1.96 is 0.95: > pnorm(1.96); qnorm(0.975) [1] 0.9750021 [1] 1.959964 > pnorm(1.96)-pnorm(-1.96) [1] 0.9500042 To illustrate R’s data handling capabilities, import data from a Comma Separated Value (CSV) file. The data are grades for 20 students from two midterms: > grades = read.csv(’Grades.csv’,header=TRUE) > grades Midterm1 Midterm2 1 53 59 2 38 50 3 53 56 4 55 61 5 32 18 6 48 57 7 56 39 8 47 24 9 44 22 10 94 86 11 66 18 12 62 57 13 56 45 14 94 63 15 70 51 16 88 89 17 56 47 18 100 96 19 75 67 20 88 86
A.3 R 397
The summary command is a simple way to summarize the data: > summary(grades) Midterm1 Min. : 32.00 1st Qu.: 51.75 Median : 56.00 Mean : 63.75 3rd Qu.: 78.25 Max. :100.00
Midterm2 Min. :18.00 1st Qu.:43.50 Median :56.50 Mean :54.55 3rd Qu.:64.00 Max. :96.00
Clearly, grades on Midterm2 are, on average, lower than those on Midterm1. It is easiest to access the columns if we attach the data frame: > attach(grades) For instance, the correlation between the two columns is 0.78: > cor(Midterm1, Midterm2) [1] 0.7784707 A stem-and-leaf plot is an interesting way to represent the data. It combines features of a sorted list and a histogram: > stem(Midterm1,scale=2) The decimal point is 1 digit(s) to the right of the | 3 4 5 6 7 8 9 10
| | | | | | | |
28 478 335666 26 05 88 44 0
The sorted data can be read from the stem plot: 32, 38, 44, etc. Of the 20 scores, six were in 50s. (Note, the stem command in R differs from the stem command in Matlab and Python. Neither Matlab nor Python has a stem-and-leaf plot command standard, but such a function is easy to write in either language.) Fit a linear model to the data, and plot both the data and the fitted line. To make the plot more attractive, we use a few of the plotting options: > lmfit = lm(Midterm2 ~ Midterm1) > pdf(’grades.pdf’)
398 APPENDIX A COMPUTATION EXAMPLES
> plot(Midterm1,Midterm2,cex=1.5,cex.axis=1.5,cex.lab=1.5,las=1) > abline(lmfit) > dev.off()
Midterm2
80
60
40
20 30
40
50
60
70
Midterm1
80
90
100
APPENDIX
B
ACRONYMS
AUC
Area Under the ROC Curve
CDF
Cumulative Distribution Function
CLT
Central Limit Theorem
CSV
Comma Separated Value
DFT
Discrete Fourier Transform
DTFT
Discrete Time Fourier Transform
ECC
Error Correcting Coding
IID
Independent and Identically Distributed
IQ
Intelligence Quotient
JPEG
Joint Photographic Experts Group
KDE
Kernel Density Estimate
KL
Kullback-Leibler Divergence
LTP
Law of Total Probability
MAC
Message Authentication Code
MAP
Maximum a Posteriori
MGF
Moment Generating Function
MLE
Maximum Likelihood Estimate
MMSE Minimum Mean Squared Error MSE
Mean Squared Error 399
400 APPENDIX B ACRONYMS
PDF
Probability Density Function
PMF
Probability Mass Function
PSD
Power Spectral Density
PSK
Phase Shift Keying
QAM
Quadrature Amplitude Modulation
ROC
Receiver Operating Characteristic
SNR
Signal-to-Noise Ratio
WSS
Wide Sense Stationary
8PSK
Eight-Point Phase Shift Keying
4QAM Four-Point Quadrature Amplitude Modulation 16QAM 16-Point Quadrature Amplitude Modulation
C
APPENDIX
PROBABILITY TABLES
C.1 TABLES OF GAUSSIAN PROBABILITIES TABLE C.1 Values of the Standard Normal Distribution Function z
Φ(z)
z
Φ(z)
z
Φ(z)
z
Φ(z)
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
0.5000 0.5199 0.5398 0.5596 0.5793 0.5987 0.6179 0.6368 0.6554 0.6736 0.6915 0.7088 0.7257 0.7422 0.7580 0.7734 0.7881 0.8023 0.8159 0.8289
1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95
0.8413 0.8531 0.8643 0.8749 0.8849 0.8944 0.9032 0.9115 0.9192 0.9265 0.9332 0.9394 0.9452 0.9505 0.9554 0.9599 0.9641 0.9678 0.9713 0.9744
2.00 2.05 2.10 2.15 2.20 2.25 2.30 2.35 2.40 2.45 2.50 2.55 2.60 2.65 2.70 2.75 2.80 2.85 2.90 2.95
0.9772 0.9798 0.9821 0.9842 0.9861 0.9878 0.9893 0.9906 0.9918 0.9929 0.9938 0.9946 0.9953 0.9960 0.9965 0.9970 0.9974 0.9978 0.9981 0.9984
3.00 3.05 3.10 3.15 3.20 3.25 3.30 3.35 3.40 3.45 3.50 3.55 3.60 3.65 3.70 3.75 3.80 3.85 3.90 3.95
0.9987 0.9989 0.9990 0.9992 0.9993 0.9994 0.9995 0.9996 0.9997 0.9997 0.9998 0.9998 0.9998 0.9999 0.9999 0.9999 0.9999 0.9999 1.0000 1.0000
401
402 APPENDIX C PROBABILITY TABLES TABLE C.2 Values of the Standard Normal Tail Probabilities z
1 − Φ(z)
z
1 − Φ(z)
z
1 − Φ(z)
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90
5.0000e-01 4.6017e-01 4.2074e-01 3.8209e-01 3.4458e-01 3.0854e-01 2.7425e-01 2.4196e-01 2.1186e-01 1.8406e-01 1.5866e-01 1.3567e-01 1.1507e-01 9.6800e-02 8.0757e-02 6.6807e-02 5.4799e-02 4.4565e-02 3.5930e-02 2.8717e-02
2.00 2.10 2.20 2.30 2.40 2.50 2.60 2.70 2.80 2.90 3.00 3.10 3.20 3.30 3.40 3.50 3.60 3.70 3.80 3.90
2.2750e-02 1.7864e-02 1.3903e-02 1.0724e-02 8.1975e-03 6.2097e-03 4.6612e-03 3.4670e-03 2.5551e-03 1.8658e-03 1.3499e-03 9.6760e-04 6.8714e-04 4.8342e-04 3.3693e-04 2.3263e-04 1.5911e-04 1.0780e-04 7.2348e-05 4.8096e-05
4.00 4.10 4.20 4.30 4.40 4.50 4.60 4.70 4.80 4.90 5.00 5.10 5.20 5.30 5.40 5.50 5.60 5.70 5.80 5.90
3.1671e-05 2.0658e-05 1.3346e-05 8.5399e-06 5.4125e-06 3.3977e-06 2.1125e-06 1.3008e-06 7.9333e-07 4.7918e-07 2.8665e-07 1.6983e-07 9.9644e-08 5.7901e-08 3.3320e-08 1.8990e-08 1.0718e-08 5.9904e-09 3.3157e-09 1.8175e-09
TABLE C.3 Values of the Standard Normal Quantile Function p
Q(p)
p
Q(p)
p
Q(p)
0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65 0.66
0.000 0.025 0.050 0.075 0.100 0.126 0.151 0.176 0.202 0.228 0.253 0.279 0.305 0.332 0.358 0.385 0.412
0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83
0.440 0.468 0.496 0.524 0.553 0.583 0.613 0.643 0.674 0.706 0.739 0.772 0.806 0.842 0.878 0.915 0.954
0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00
0.994 1.036 1.080 1.126 1.175 1.227 1.282 1.341 1.405 1.476 1.555 1.645 1.751 1.881 2.054 2.326 1
APPENDIX
D
BIBLIOGRAPHY
Isaac Newton is reported to have said, “If I have seen further than others, it is by standing upon the shoulders of giants.” I have had the opportunity to learn from and rely upon many great texts. Here are those that helped me. When I was a student, I learned from several classics: • Peter J. Bickel and Kjell A. Doksum. Mathematical Statistics: Basic Ideas and Selected Topics. Holden-Day, 1977. • Patrick Billingsley. Probability and Measure. John Wiley & Sons, 1979. • William Feller. An Introduction to Probability Theory and Its Applications. John Wiley & Sons, 1968. • Athanasios Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill Series in Systems Science. McGraw-Hill, 1965. • Michael Woodroofe. Probability with Applications. McGraw-Hill, 1975. As an instructor, I have had the opportunity to teach from a number of texts, including: • George R. Cooper and Clare D. McGillem. Probabilistic Methods for Signal and System Analysis. Oxford University Press, 1999. • Carl W. Helstrom. Probability and Stochastic Processes for Engineers. Macmillan, 1991. • Peyton Z. Peebles, Jr. Probability, Random Variance, and Random Signal Principles. McGraw-Hill, 1993. • Roy D. Yates and David J. Goodman. Probability and Stochastic Processes. John Wiley & Sons, 2005. Other books I have consulted include: • Brian D. O. Anderson and John B. Moore. Optimal Filtering. Information and System Sciences Series. Prentice-Hall, 1979.
403
404 APPENDIX D BIBLIOGRAPHY
• Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. • E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003. • Samuel Karlin and Howard M. Taylor. A First Course in Stochastic Processes. Academic Press, 1975. • Alberto Leon-Garcia. Probability, Statistics, and Random Processes for Electrical Engineering. Pearson Education, 2008. • Masud Mansuripur. Introduction to Information Theory. Prentice-Hall, 1987. • D. S. Sivia and J. Skilling. Data Analysis: A Bayesian Tutorial. Oxford University Press, 2008. • Henry Stark and John W. Woods. Probability, Statistics, and Random Processes for Engineers. Pearson Education, 2012. • Edward R. Tufte. The Visual Display of Quantitative Information. Graphics Press, 1983. I would be remiss if I did not acknowledge some Internet resources: • Wikipedia (https://www.wikipedia.org/) has been particularly helpful. I have consulted it many times. • Wolfram Mathworld (http://mathworld.wolfram.com/) is another useful reference. • Numerous websites for Matlab, Python, and R. It is impossible to list them all. To those authors whose texts I have read and studied from but accidentally omitted, please accept my apologies.
INDEX
References to tables and figures are denoted by an italicized t and f 16-point quadrature amplitude modulation (16QAM), 249–50, 253–54, 263–64 accessible, 377 acronyms, 399–400 affine transformation, 81, 290 Alohanet, 152–54, 161 alternative hypothesis, 324, 326, 336 amplitude modulation (AM), 354–56, 364 Anaconda, 24 aperiodic, 377 a posteriori density, 293, 294 a posteriori probability Bayes theorem, 37, 292 drug testing, 38 a priori distribution, 295 a priori probability drug testing, 38 a priori value, 294 α-trimmed mean, 275 autocorrelation random process, 341, 363 wide sense stationary process, 346, 347 autocorrelation matrix, 300 autocovariance random process, 341, 363 wide sense stationary process, 346, 347 average power, random process, 348, 349, 363 axioms, probability, 9–10 bandlimited, 358, 360–62, 364 Bayesian estimation, 291–95 Bayes theorem, 32–34, 42, 196, 336 a posteriori probability, 37 definition of, 198t
Bernoulli distribution, 78, 96 Bernoulli probabilities, 140, 141 Bernoulli random variables, 117, 134 beta density, 292, 293 bias, 151 big table, 21–22 binary channel, 36–37 binary entropy function, 122, 123 binary Markov chain, 376 binary sequences, encoding, 127–28 binary symmetric channel, 37 binomial, 137, 160 binomial coefficient, 66, 70 combinations and, 53–54 Matlab function for computing, 53f binomial distribution, 77, 160 basics of, 137–41 moments of, 142–44 negative, 148–49, 161 parameter estimation, 151–52 binomial probability, 160 Alohanet, 152–54, 161 basics of binomial distribution, 137–41 computing, 141–42 connections between, and hypergeometric probabilities, 146–47 distributions, 146–50 error control codes, 154–60 independent binomial random variables, 144–45 multinomial probabilities, 147–48 negative binomial distribution, 148–49, 161 parameter estimation for distributions, 151–52 405
406
INDEX
Pascal’s triangle, 140f Poisson distribution, 149–50 binomial random variables, sums of independent, 144–45 binomial theorem, 54–55, 66 birthday problem, 57–61, 67, 69 Blackjack (21), 71 blind test, 39–40 Boole’s inequality, 15 bowling leagues, 272 Box-Muller method, 242, 243, 257, 262 bridge, 57 Bzip2, 121 Canopy, 24 card games, hypergeometric probabilities and, 61–65 Poisson distribution, 149–50 poker hands, 62–65 Seven-Card Stud, 62 Texas Hold ’em, 62, 65, 71, 93 Cauchy distribution, 217 Cauchy-Schwarz inequality, 135, 136 cell phone network, 89 censored exponential, 288–89 center of mass, 82 central limit theorem (CLT), 230–35, 259 chain rule, 32, 35 channel capacity, 160 characteristic function, 85, 230 Chebyshev’s inequality, 82–83, 266–67, 359 chi-squared (χ2 ) distribution, 235, 238–39, 260 chi-squared variance test, 283 Cholesky decomposition, 304, 321 Cholesky factorization, 312 Chuck-a-luck game, 165 circularly symmetric, 241 cloud, Gaussian points, 240, 240f, 264 code/coding coding rate, 157 coding trees, 126, 128 encoding binary sequences, 127–28 error control codes, 154–60 error correcting coding (ECC), 154–55, 157–58, 160, 161, 165
general linear block codes, 157–60 Golay, 160, 166 Huffman, 125, 127, 128, 132 linear block codes, 155 repetition-by-three code, 155–57, 165 variable length coding, 123–27 words, 124 see also decoding coin flip, 2, 3, 25–26 collision, 152–53 collision resolution, 164 combinations, 50, 66 binomial coefficients and, 53–54 combinatorics basics of counting, 47–52 binomial theorem, 54–55 birthday paradox, 57–61, 67, 69 card games, 61–65 computation notes, 52–53 hypergeometric probabilities, 61–65, 67 message authentication, 60–61 multinomial coefficient and theorem, 55–57 ordered and with replacement, 49, 66 ordered and without replacement, 49–50, 66 unordered and with replacement, 51–52, 66 unordered and without replacement, 50, 66 values of factorials n!, 50t communications Alohanet, 152–54 cell phone network. 89 entropy, 122 network, 17 outcomes, 18–19, 19t, 21t probability of complement, 20–21, 23 probability of union, 19–20 reliability networks, 18 source node with destination node, 17–23 see also digital communications complement, 3 probability of, 20–21, 23 complex, 343
INDEX 407
complimentary error function, 225 compound experiments, 15–16 computations binomial probabilities, 141–42 convolutions, 115 linear regression, 311–13 Matlab, 313, 391–93 notes on, 52–53 Octave, 23 procedures, 23–24 Python, 313, 393–95 QR method, 312–13 R, 313, 395–98 single random variable, 171–73 conditional density, 180, 200, 201, 292 calculation of, 196, 197f conditional expected value, 197, 200, 201 conditional probability, 29, 41–42, 189 Bayes theorem, 32–34, 37 binary channel, 36–37 chain rule, 32, 35 continuous random variable, 179–82 definitions of, 29–32, 198t diamond network example, 40–41 drug testing, 38–40 law of total probability (LTP), 32–34, 37 multiple continuous random variables, 195–98 urn models, 34–36 Venn diagram, 30f confidence intervals, 60, 255 sample size effect, 268f statistics quality measure, 280–82 confidence level, 255 conjugate distribution, 292 consistent, 118, 255, 266 consistent estimate, 151 constellation, 248–54, 259, 263–64 constraint, 128 continuous, 167, 187 continuous distributions, 214, 215t exponential distribution, 176–79 uniform distribution, 174–76 continuous random variable, 167–71 calculations for single random variable, 171–73
conditional probabilities for, 179–82 continuous distributions, 174–79 cumulative distribution function (CDF), 168 delta functions, 182–83 density function, 169f, 170f discrete PMFs, 182–83 distribution function, 168 expected values for, 170–71 probability density function (PDF), 168–69, 170f quantization, 184–87, 191 selected continuous distributions, 174–79 continuous signal, 360 continuous time Fourier transform properties, 344t convolutions, computing, 115 correction, 244 correlated, 106 correlation, 106, 131 multiple continuous random variables, 194, 215 random vectors, 300 correlation coefficient, 107 counting, basics of, 47–52 covariance, 106, 131 error, 315, 388 matrix, 300, 301, 303 multiple continuous random variables, 194, 215 random vectors, 300 crossover probabilities, 37 cumulative distribution function (CDF), 95, 187–88 Bernoulli distribution, 78f continuous random variable, 168 mapping probabilities to outcomes, 77–78 joint CDF, 110 decibels (dB), 187 decoding error, 156 error rate, 156, 159 maximum likelihood decoding, 156
408
INDEX
decoding (cont.) repetition-by-three code, 156f see also code/coding degrees of freedom, 238, 262, 281, 283–84 delta functions, 182–83, 363 DeMorgan’s laws, 4, 5f, 20, 24 DeMorgan’s theorem, 14 density, 79 density estimates, 278–80 density functions, 171, 187, 188 integrating the, 169, 170f, 173 probability, 168–69 destination node, 17 determinant, 301 diamond network, 40–41 digital communications background, 246–47 discrete time model, 247–53 eight-point phase shift keying (8PSK), 248–51, 263 four-point quadrature amplitude modulation (4QAM), 248–53, 263–64 Monte Carlo experiments, 253–58 phase shift keying (PSK), 248–49, 263 16-point quadrature amplitude modulation 16QAM), 249–50, 253–54, 263–64 quadrature amplitude modulation (QAM), 246–58, 260 digital-to-analog converter, 184 Dirac delta function, 182–83, 226 discrete distributions, 214, 215t discrete random variable, 75–76 example of two, 108–13 discrete probabilities expected values, 78–83, 95 see also probability mass functions (PMFs) discrete time Fourier transform (DTFT), 345–46, 345t discrete time Kalman filter, 381 discrete time model, 247–53 discrete time Wiener filter, 357, 364, 365 disjoint, 4 distinguishable items, 47 distribution, 77, 182
distribution function, 77, 168, 177f, 187 empirical, 276–77 estimating the, 276–78 half-life, 179 sample, 276–77 double blind trial, 39–40, 45 double exponential distribution, 236 double subscript, 52 doubly stochastic, 373 drug testing, 38–40 drunken random walk, 364 Edison’s light bulb, 69 eigenvectors, 311 eight-point phase shift keying (8PSK), 248–51, 263 election poll (simple), 265–69 empirical, 117 empirical distribution function, 276–77 empty set, 3 encryption machine, Enigma, 73–74 entropy, 121, 132, 134 encoding binary sequences, 127–28 information theory and, 121–23 maximum, 128–30 variable length coding, 123–27 Erlang density, 203–5, 211, 215t, 218, 232 Erlang random variable, 262 error, 289 function, 225 prediction, 310 probability, 251–53, 254f error control codes, 154–60 general linear block codes, 157–60 repetition-by-three code, 155–57 error correcting coding (ECC), 154–55, 157–58, 160, 161, 165 estimation Bayesian, 291–95 introduction to, theory, 285–89 minimum mean squared error, 289–91 estimator, 266 Euler’s formula, 343, 344 even quantizer, 185, 191 events, 3–4, 24 expected values, 131
INDEX 409
continuous random variable, 170–71 density function, 188 multiple continuous random variables, 194, 215 probabilistic average, 78–83, 95 random vectors, 299 tools of ordinary algebra, 80–81 two random variables, 105–6 experiment, 3–4, 24 explicit formulas, 386 exponential, 176 exponential density, 177f exponential density with unknown parameter, 288 exponential distribution, 176–79 parameter estimation of, 214 exponential weighting, 273–74, 296 Facebook, 46 facial recognition system, 46 fair, 93 false alarm, 38, 42, 45, 46, 325, 336 false negative (FN), 38, 325, 337 false-negative rate, 325, 337 false positive (FP), 38, 42, 325, 336 false-positive rate, 325, 336 F distribution, 235, 238–39 Fibonacci sequence, 271 filtering, 382 financial decision making, 92–94 Fisher, Ronald, 239 Fourier transforms, 342–46, 363 continuous time properties, 344t discrete time properties, 345t, 346 fast, 346 frequency, 342–43 imaginary part, 343 inverse, 343, 363 phase, 343 real part, 343 four-point quadrature amplitude modulation (4QAM), 248–53, 263–64 frequency variable, 342
gain, 272 gain matrix, 383–84, 388 gambling, 92–94 Gamma function, 238, 262, 292 Gaussian density, 221, 223, 226, 231–32, 235f, 237, 259 “bell curve” shape, 221f estimated, 279 Gaussian distribution, 221–26, 259 central limit theorem (CLT), 230–35, 259 central probabilities, 225 chi-squared (χ2 ), 235, 238–39 circularly symmetric, 241 double exponential distribution, 236 estimating Gaussian distribution function, 222, 277–78 F distributions, 235, 238–39 Gaussian density, 221f, 223, 226, 231–32, 235f, 237, 259 Laplace distribution, 236, 236f moments of, 228–30, 259 multiple Gaussian random variables, 240–46 one-sided Gaussian probabilities, 226f probabilities, 223–27, 401t, 402t quantile function, 227f, 227–28 Rayleigh distribution, 235, 236f, 236–37, 242–43, 256–58, 260 related distributions, 235–39 z-scores, 223 Gaussian random variables, 303t “cloud” of 1000 two-dimensional Gaussian points, 240f, 245, 246 independent, 240–41 moment generating function (MGF) of vector, 302 multiple, 240–46 Python, 374 transformation to polar coordinates, 241–43 two correlated, 243–46 Gaussian random vectors, 298–303 linear operations on, 303–4 Gaussian random walk, 364–65 Gaussian thermal noise, 233, 352, 363 Gaussian with unknown mean, 286–87
410
INDEX
Gaussian with unknown mean and variance, 287 general linear block codes, 155, 157–60 generating function, 85 geometric probability mass function (PMF), 87–89, 96 geometric random variables, 134 Givens rotation, 322, 323 Golay code, 160, 166 Gosset, William, 281 gradient, linear regression, 307 Gray codes, 263 Gzip, 121 half-life, 179 Hamming distance, 155, 158, 166 hertz, 247 H ("hat" matrix), 310–11 Higgs boson, 285 histograms as bar graphs, 120f characterizing quantization error, 186–87 comparing uniform probability mass function (PMF) and, 119f computing environments, 120 estimating PMFs, 119–20 homogeneous Markov chains, 372, 374, 387 Huffman code, 125, 132, 134, 293 Huffman tree, 127, 128 hypergeometric probabilities, 67, 145, 161 card games, 61–65 connection between binomial and, 146–47 hypotheses, 37 conventions in estimation, 326 maximum a posteriori (MAP) tests, 335–36, 337 radar detection, 326–31 receiver operating characteristic (ROC) graph, 328–31, 337 hypothesis testing alternative, 324, 326, 336 likelihood ratios, 331–35 Neyman-Pearson test, 332, 333, 335, 337 null hypothesis, 282, 285, 324, 326, 336
principles, 324–26 radar detection example, 326–31 sensitivity, 325, 336, 337 specificity, 325 idempotent, 310 imaginary part, Fourier transform, 343 inclusion-exclusion formula, 13–14 independence, 7, 16–17 multiple continuous random variables, 194–95 multiple random variables, 104–5, 109 independent and identically distributed (IID), 104–5 IID Bernoulli random variables, 137, 142, 144, 151, 160 independent binomial random variables, sums of, 144–45 independent Gaussian random variables, 240–41 independent increments, 368, 369 independent random variables, sums of, 113–17, 202–5 indicator function, 82 induction, 13–14 inequality, 15 Cauchy-Schwarz, 135, 136 Chebyshev’s, 82–83, 266–67, 359 information theory, entropy and, 121–23 innovation, 272 "Instant" lottery, 72 insurance, buying, 94 integrals, 171, 187 intelligence quotient (IQ), 291 interarrival times, 370 interquartile distance, 275–76 interquartile range, 275 intervals, 167 probabilities of, 170f inverse Fourier transform, 343, 363 Ipython, 24 irreducible, 377 Jacobian of the transformation, 201, 209–10, 212–13, 216
INDEX 411
joint density, 192–93, 196 joint distribution function joint density and, 192–93 multiple continuous random variables, 192–93 multiple discrete random variables, 102–3, 131 joint entropy function, 123 Joint Photographic Experts Group (JPEG), 121 joint probability mass function, 101, 130–31, 132 Kalman filter, 381, 387–88, 389 discrete time, 381 equations, 383, 385 optimal filter, 381–84 QR method for, 384–86 kernel density estimate (KDE), 279, 280f, 297 kernel function, 279 Kronecker delta function, 354 Kullback-Leibler divergence (KL), 135 Lagrange multipliers, 128, 129, 130 Laplace distribution, 236 Laplace transform, 116 law of total probability (LTP), 32–34, 37, 42, 202 continuous version of, 196–97, 205, 216 definition, 198t least absolute deviations regression, 318 least absolute residuals, 318 least absolute value regression, 318 least mean squared error (LMSE), 290 least squares, 306, 307 Levinson-Durbin algorithm, 357 L’Hospital’s rule, 85 life table, 181–82, 190, 191 lightbulb process, 366–68, 386 likelihood function, 286 likelihood ratios, 336, 337 hypothesis tests and, 331–35 linear, 290 linear algebra, 304, 308
linear block codes, 155, 157–60 linear filters, wide sense stationary (WSS) signals and, 350–52 linear least squares, 307 linear operations, Gaussian random vectors, 303–4 linear regression, 304–19 class of parameter estimation problems, 305–8 computational tasks, 311–13 error covariance, 315 estimated error variance, 315–16 examples, 313–17 extensions of, 317–19 gradient, 307 illustration of problem, 306f Kalman filter, 385 least absolute value regression, 318 linear least squares, 307 model, 306f, 315–16 noise variance, 315 nonlinear least squares regression, 319 normal equations, 305, 307–8, 312, 317, 320 problem, 304–6, 306f, 312, 316, 317, 320 projection theorem, 307, 308f recursive, 318 regression model, 315 statistics of, 309–11 straight-line example, 315f temperature example, 316–17 weighted, 317, 322 loaded die, 27 log-likelihood function, 286 log-likelihood ratio, 331, 332, 334f, 335f, 337 Gaussian vs. Gaussian test, 333f Laplacian vs. Gaussian test, 334f looping code, 263 magnitude, Fourier transform, 343 MAP (maximum a posteriori) tests, 335–36, 337 marginal density, 193, 195, 199–201, 218 marginal probability mass functions, 101, 109, 131
412
INDEX
Markov, Andrey, 372n1 Markov chains binary, 376f homogeneous, 372, 374, 387 Monte Carlo, 294, 319 Python commands, 379 random process, 372–81, 387 simple two-state, 373f state diagram, 380f stationary distribution, 377–81 three-state, 378f time inhomogeneous, 372 transition probabilities, 372, 373, 380 “win by 2” competition, 379–80, 380f, 389 Matlab, 23, 53, 120 computation package, 391–93 computing binomial coefficient, 53f estimating probability of error, 254f Gaussian distribution, 224–25 looping code, 263 Pascal’s triangle-like recursion, 160 QR algorithm, 313 sampling theorem, 360–61 Matplotlib, 23 maximum a posteriori (MAP) tests, 335–36, 337 maximum entropy, 128–30 maximum likelihood decoding, 156 maximum likelihood estimate (MLE), 286 Maxwell-Boltzmann distribution, 262 mean, 79, 82, 95, 117–19 continuous random variables, 194, 215 estimating, 269–71 Gaussian distribution, 228–30 mean squared error (MSE), 289–90 median, 274 message authentication code (MAC), 60–61 Microsoft Excel, 23 minimum mean squared error (MMSE), 292 estimation, 289–91 miss, 38, 42, 46, 325, 337 moment generating function (MGF), 90, 95, 188, 190 calculating moments, 188, 190
computing moments, 178 expected values, 83–85 Gaussian distribution, 259 independent random variables, 203–4 mean and variance using, 143, 144 vector Gaussian random variable, 302 moment of inertia, 82 moments, 96, 131 binomial distribution, 142–44 Gaussian distribution, 228–30, 259 multiple continuous random variables, 194 two random variables, 106–8 Monte Carlo, 24 Monte Carlo experiments, 264, 269 computing probability of error, 253–58, 260 Matlab code for estimating error probability, 254f table of speedups, 257t Monte Hall problem, 44 multinomial coefficient, 55–56, 66 multinomial distribution, 147, 161 parameter estimation, 151–52 multinomial probability, 147–48 multinomial theorem, 56–57 multiple continuous random variables comparing discrete and continuous distributions, 214, 215t conditional probabilities for, 195–98 expected values, 194, 215 extended example, 198–202 general transformations, 207–13 independence, 194–95 Jacobian of the transformation, 201, 209–10, 212–13, 216 joint density, 192–93 joint distribution functions, 192–93 mean, 194, 215 moments, 194 parameter estimation for exponential distribution, 214 random sums, 205–7 sums of independent, 202–5 variance, 194, 215
INDEX 413
multiple Gaussian random variables, 240–46 independent, 240–41 transformation to polar coordinates, 241–43 two correlated, 243–46 multiple random variables entropy and data compression, 120–30 example of two discrete, 108–13 histograms, 119–20 independence, 104–5, 109 joint cumulative distribution function (CDF), 110 moments and expected values, 105–8 probabilities, mean and variance, 117–19 probability mass functions (PMFs), 101–3 transformations with one output, 110–12 transformations with several outputs, 112 mutually exclusive, 4 nearest neighbor, 250 negative binomial distribution, 148–49, 161 network communications, 17 diamond, 40f, 40–41 reliability, 18 three-path, 21f two-path, 18f, 27f see also digital communications Newton, Isaac, 165, 403 Neyman, Jerzy, 332n.1 Neyman-Pearson test, 332, 333, 335, 337 n! (factorial), 49, 50t noise, 352 Gaussian thermal, 352, 363 Poisson, 352, 363 probabilistic properties of, 352–53, 363 quantization, 352, 363 shot, 352, 363 signal-to-noise ratio (SNR), 92, 187, 353 spectral properties of, 353–54 white, 353–54, 363 nonlinear least squares regression, 319 nonnegative definite, 300
nonparametric approach, density or distribution function, 276, 279 nonuniform quantizer, 185 normal, 307 normal distribution, 221, 227, 259 normal equations, 305, 307, 320 null hypothesis, 282, 285, 324, 326, 336 Numpy, 23 objective, 128 Octave, package, 23 odd quantizer, 185, 191 offered packet rate, 153 one-proportion z-test, 283 one-sample t-test, 283 one-sample z-test, 283 one-sided test, 282 optimal constellation problem, 253 optimal filter, Kalman, 381–84 ordered, 48, 66 with placement, 49, 52t without placement, 49, 52t with replacement, 66 without replacement, 66 order statistics, 274–76 orthogonality, 310, 310f outcomes, 3–4, 24 big table of, 21–22 listing all, 18–19 outliers, 274 pairwise disjoint, 4 Pandas, 24 parallel axis theorem, 82 parametric approach, density or distribution function, 276, 278–79 parity bits, 157 partition, 4 Pascal’s triangle, 53, 54f, 55, 67, 116 binomial probabilities, 139, 140f, 160, 163 Pearson, Egon, 332n1 Pearson’s chi-squared test, 283–84 permutation, 49, 52, 66 petition-by-three code, decoding, 156f
414
INDEX
phase, Fourier transform, 343 phase shift keying (PSK), 248–49, 263 physical independence, 16 "Pick Six" lottery, 72 plugboard, Enigma, 73 Poisson distribution, 77, 117, 118, 143, 146, 149–50 counting, 90–92, 96 examples, 97, 100 expected value of, 92, 97 Poisson noise, 352, 363 Poisson PMF with unknown parameter, 287–88 Poisson point process, 369 Poisson process, 368–72, 386 Poisson random variables, 134 polar coordinates, transformation to, 241–43 positive definite, 300 "Powerball" lottery, 72 power per unit frequency, random process, 348 power spectral density (PSD) amplitude modulation, 355, 355f, 356f random process, 348, 363 wide sense stationary (WSS) random process, 350–51 precision, 170 prediction, 244, 272, 382 prediction error, 310 predictor-corrector fashion, 244, 272–73 probabilistic average, 78 probability, 1–2, 24, 117–19 basic rules, 6–8 complement, 20–21, 23 connections between binomial and hypergeometric, 146–47 of correct detection, 251–53 expressing as percentages, 8 formalized, 9–10 independence, 7–8 integrating density function, 193 joint distribution function, 192–93 multinomial, 147–48 tables, 401t, 402t union, 19–20
see also binomial probability; conditional probability probability density function (PDF), 168–69 probability distribution function (PDF), 188 probability generating function, 85 probability mass function (PMF), 76, 85–92, 95, 187 Bernoulli distribution, 78f, 96 binomial PMF, 138–41, 234f comparing binomial and Poisson, 150f comparing uniform PMF and histogram, 119f discrete, and delta functions, 182–83 discrete random variable and, 75–76 estimates, 278–80 Gaussian approximations, 234–35 geometric, 87–89, 96 joint PMF, 101, 130–31, 132 marginal PMFs and expected values, 109 multiple random variables and, 101–3 Poisson distribution, 90–92, 96 uniform PMF, 86–87, 96 Probability of Precipitation (PoP), 28 probability transition matrix, 374, 378, 380, 388–89 problems, breaking into pieces, 22 projection theorem, 307, 308f p-value, 282–83, 285 Pylab, 23 Pythagoras’s formula, 310 Pythagorean theorem, 311 Python, 23–24, 120 birthday problem, 60 code, 119 computation package, 393–95 singular value decomposition algorithm, 313 t-distribution, 281 three-state Markov chain, 379 QR algorithm, 312f, 313, 318, 320 QR method Kalman filter, 384–86 linear regression, 312–13 quadrature, 247
INDEX 415
quadrature amplitude modulation (QAM), 246, 260 digital communications using, 246–58 discrete time model, 247–53 Monte Carlo exercise, 253–58 recap, 258 quality, confidence intervals, 280–82 quantile function Gaussian distribution, 227f, 227–28 values of standard normal, 402t quantization continuous random variable, 184–87, 191 error, 185–86, 187, 191 histograms characterizing error, 186–87 noise, 352, 363 quantizers, 184–85, 191 Quinto, lottery game, 99 R, 23, 24, 120 computation package, 395–98 QR algorithm, 313 R2 , measure explaining data, 311 radar detection densities of two hypotheses, 328f, 330f hypothesis testing example, 326–31 receiver operating characteristic (ROC) curve, 329f, 330f, 331f ROC graph, 328–31, 337 radians per second, 247, 342 random, 92 random access procedure, 89 randomness, 2, 132 random process, 362–63, 386–88 amplitude modulation example, 354–56 average power, 348 discrete time Wiener filter, 357 drunken random walk, 364 Gaussian random walk, 364–65 Kalman filter, 381–86 lightbulb process, 366–68, 386 Markov chains, 372–81 Poisson process, 368–72, 386 sampling theorem of wide sense stationary (WSS), 357–62 simple, 341–42
WSS, 340, 346–49 random signals, 340, 362 discrete time Fourier transform properties, 345t Fourier transforms, 342–46 introduction to, 340–41 simple process, 341–42 see also random process random sums, 205–7 random variables, 5–6, 75–76, 95 example of two discrete, 108–13 expected values for two, 105–6 independence, 109 joint cumulative distribution function (CDF), 110 marginal PMFs and expected values, 109 moments for two, 106–8 multiple Gaussian, 240–46 sums of independent, 113–17 transformations with one output, 110–12 transformations with several outputs, 112 random vectors, 298, 299 autocorrelation matrix, 300 correlations and covariances, 300 expected values of, 299 Gaussian, 298–303 linear operations, 303–4 rank, matrix, 308 rate, code, 157 Rayleigh distribution, 236–37, 242–43, 256–58, 260 Monte Carlo exercise, 256–58 Rayleigh density, 237f transformation to polar coordinates, 242–43 real part, Fourier transform, 343 receiver operating characteristic (ROC), 328–31, 337 recursion, 139, 140, 160, 163 recursive Fibonacci sequence, 271 linear regression, 318 sample mean computation, 271–73 reflector, Enigma, 73–74 regular, constellations, 253
416
INDEX
relationships, Venn diagrams, 4–5 reliability networks, 18 repetition-by-three code, 155–57, 165 robust estimates, 274–76 rotors, Enigma, 73–74 sample, 117 sample average, 118, 266 sample distribution function, 276–77 sample mean estimating, 269–71 recursive calculation of, 271–73 sampler, 184 sample space, 3, 24 sampling theorem, 364 discussion, 358–59 proof of random, 361–62 random signal, 359–61 Shannon, Whittaker, Katelnikov, 358 wide sense stationary (WSS) random processes, 357–62 Scipy, 24 Scipy.stats library, 24 sensitivity, 325, 336, 337 set arithmetic, 7 Seven-Card Stud, 62 Shannon, Claude Elwood, 126n1 Shannon, Whittaker, Katelnikov theorem, 358 shot noise, 352, 363 signal-to-noise ratio (SNR), 92, 187, 353 significance tests, 282–85 chi-squared variance test, 283 one-proportion z-test, 283 one-sample z-test, 283 one-sample z-test, 283 one-sided test, 282 Pearson’s chi-squared test, 283–84 p-values, 282, 285 two-sided test, 282–83 Silverman’s rule, 279, 280f, 297 simple election poll, 265–69 simple random process, 341–42 simple theorems, 11–15 single blind test, 40
single random variable, calculations for, 171–73 singular value decomposition algorithm, 313 Slotted Aloha, 153–54, 164 smoothed, 382 smoothing, 382 software computation, 23–24 source node, 17 specificity, 325 squared error, 289 square root information filter, 386 standard deviation, 79 standard normal, 221–22, 227, 259–60 “bell curve” shape, 221, 259 distribution function, 222f, 401t Gaussian density, 221f Gaussian distribution and density, 221–27 quantile function, 402t tail probabilities, 402t star, Aloha network, 152 statistically significant, 282–83, 285 statistics, 265 Bayesian estimation, 291–95 confidence intervals, 60, 255, 268f, 280–82 estimating the distribution function, 276–78 estimating the mean and variance, 269–71 estimation theory, 285–89 exponential weighting, 273–74, 296 linear regression estimates, 309–11 minimum mean squared error estimation, 289–91 order statistics, 274–76 PMF and density estimates, 278–80 recursive calculation of sample mean, 271–73 robust estimates, 274–76 significance tests and p-values, 282–85 simple election poll, 265–69 Statsmodels, 24 stem-and-leaf plot, 397 step size, 185, 191
INDEX 417
Stirling’s formula, 68 stochastic, 373 strong law of large numbers, 269–70 Student’s t-distribution, 281 subset, 4 sufficient statistic, 195 summary command, R, 397 summation notation, 80 sums, 171 independent random variables, 113–17, 202–5 random, 205–7 supersymbol, 127 symmetric, 300 systematic, 157 t-distribution, 281 table of thresholds for, 281t temperature, New York City, 316–17 Texas Hold ’em, 62, 65, 71, 93 theorems Bayes, 32–34, 37, 42, 196, 336 binomial, 54–55, 66 central limit theorem (CLT), 230–35, 259 DeMorgan’s, 14 inclusion-exclusion formula, 13–15 multinomial, 56–57 projection, 307, 308f Pythagorean, 311 sampling, of wide sense stationary (WSS), 357–62 simple, 11–15 three-bit uniform quantizer, 185f three-path network, 21f tie, 126 time inhomogeneous, Markov chains, 372 transition probabilities, 372, 373 trials, 254 true-negative rate, 325, 326, 327f, 337 true negative (TN), 38, 42, 325, 337 true-positive rate, 325, 336 true positive (TP), 38, 42, 325, 336 two-path network, 18f, 27f two-sided test, 282–83 type I error, 325, 336 type II error, 325, 337
unbiased, 118, 255, 266 unbiased estimate, 151 unbiased estimator of variance, 270, 271 uniform density, 188 uniform distribution, 174–76 uniform probability mass function (PMF), 86–87, 96 uniform quantizer, 185 union, 15 probability of, 19–20 unit step function, 183 unordered, 48, 66 with placement, 51, 52t without placement, 50, 52t with replacement, 66 without replacement, 66 urn, 43 urn models, 34–36 variable length code, 124 variable length coding, 123–27 variables, random, 5–6 variance, 79, 82, 95, 117–19 continuous random variables, 194, 215 estimated error, 316 estimating, 269–71 Gaussian distribution, 228–30 noise, 315 wide sense stationary process, 346, 347 wide sense stationary process, 346, 347 vector moments, 319 Venn diagram, 4–5, 30f, 31 A ∪ B ∪ C, 6f conditional probability, 30f, 31 DeMorgan’s laws, 5f Voronoi diagram, 250 weak law of large numbers, 269 weight, code, 158 weighted linear regression, 317, 322 white noise, 353–54, 363, 365
418
INDEX
wide sense stationary (WSS), 340, 363 and linear filters, 350–52 random processes, 346–49 sampling theorem of WSS random processes, 357–62 Wiener-Hopf equations, 357, 364, 365 win probability, 380
Yahtzee game, 70, 100 Yule-Walker equations, 365
Zipf ’s Law, 25 z-scores, 223