Probability, Statistics, and Random Signals 9780190200510


358 90 267MB

English Pages [434] Year 2016

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Probability, Statistics, and Random Signals [Charles Boncelet]
HALF TITLE
TITLE PAGE
COPYRIGHT
CONTENTS
PREFACE
Chapter 1 PROBABILITY BASICS
Chapter 2 CONDITIONAL PROBABILITY
Chapter 3 A LITTLE COMBINATORICS
Chapter 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES
Chapter 5 MULTIPLE DISCRETE RANDOM VARIABLES
Chapter 6 BINOMIAL PROBABILITIES
Chapter 7 A CONTINUOUS RANDOM VARIABLE
Chapter 8 MULTIPLE CONTINUOUS RANDOM VARIABLES
Chapter 9 THE GAUSSIAN AND RELATED DISTRIBUTIONS
Chapter 10 ELEMENTS OF STATISTICS
Chapter 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION
Chapter 12 HYPOTHESIS TESTING
Chapter 13 RANDOM SIGNALS AND NOISE
Chapter 14 SELECTED RANDOM PROCESSES
APPENDIX A COMPUTATION EXAMPLES
APPENDIX B ACRONYMS
APPENDIX C PROBABILITY TABLES
APPENDIX D BIBLIOGRAPHY
Index
Recommend Papers

Probability, Statistics, and Random Signals
 9780190200510

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

PROBABILITY, STATISTICS, AND RANDOM SIGNALS

PROBABILITY, STATISTICS, AND RANDOM SIGNALS

CHARLES G. BONCELET JR. University of Delaware

New York • Oxford OXFORD UNIVERSITY PRESS

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide. Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Copyright© 2016 by Oxford University Press

For titles covered by Section 112 of the U.S. Higher Education Opportunity Act, please visit www.oup.com/us/he for the latest information about pricing and alternate formats.

Published by Oxford University Press. 198 Madison Avenue, New York, NY 10016 http:/ /www.oup.com Oxford is a registered trademark of Oxford University Press. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of Oxford University Press. Library of Congress Cataloging in Publication Data Names: Boncelet, Charles G. Title: Probability, statistics, and random signals I Charles G. Boncelet Jr. Description: New York: Oxford University Press, [2017] 1Series: The Oxford series in electrical and computer engineering I Includes index. Identifiers: LCCN 20150349081 ISBN 9780190200510 Subjects: LCSH: Mathematical statistics-Textbooks. I Probabilities-Textbooks. I Electrical engineering-Mathematics-Textbooks. Classification: LCC QA276.18 .B66 20171 DDC 519.5-dc23 LC record available at http:/ /lccn.loc.gov/20 15034908

Printing number: 9 8 7 6 5 4 3 2 I Printed in the United States of America on acid-free paper

PREFACE

x1

PROBABILI1Y BASICS What Is Probability? Experiments, Outcomes, and Events 3 Venn Diagrams 4 Random Variables 5 Basic Probability Rules 6 Probability Formalized 9 1.7 Simple Theorems 11 1.8 Compound Experiments 15 1.9 Independence 16 1.10 Example: Can S Communicate With D? 17 1.10.1 List All Outcomes 18 1.10.2 Probability of a Union 19 1.10.3 Probability of the Complement 20 1.11 Example: Now CanS Communicate With D? 1.11.1 A Big Table 21 1.11.2 Break Into Pieces 22 1.11.3 Probability of the Complement 23 1.12 Computational Procedures 23 Summary 24 Problems 25 1.1 1.2 1.3 1.4 1.5 1.6

2

21

CONDITIONAL PROBABILI1Y 29 Definitions of Conditional Probability 29 2.2 Law of Total Probability and Bayes Theorem 2.3 Example: Urn Models 34 2.4 Example: A Binary Channel 36 2.5 Example: Drug Testing 38 2.6 Example: A Diamond Network 40 Summary 41 Problems 42

2.1

3

A LITTLE COMBINATORICS 3.1

3.2

32

47

Basics of Counting 4 7 Notes on Computation 52

v

vi

CONTENTS 3.3 Combinations and the Binomial Coefficients 53 3.4 The Binomial Theorem 54 3.5 Multinomial Coefficient and Theorem 55 3.6 The Birthday Paradox and Message Authentication 57 3.7 Hypergeometric Probabilities and Card Games 61 Summary 66 Problems 67

4

DISCRETE PROBABILITIES AND RANDOM VARIABLES 4.1 4.2 4.3 4.4 4.5

Probability Mass Functions 75 Cumulative Distribution Functions 77 Expected Values 78 Moment Generating Functions 83 Several Important Discrete PMFs 85 4.5.1 Uniform PMF 86 4.5.2 Geometric PMF 87 4.5.3 The Poisson Distribution 90 4.6 Gambling and Financial Decision Making Summary 95 Problems 96

5

92

MULTIPLE DISCRETE RANDOM VARIABLES 5.1 5.2 5.3

101

Multiple Random Variables and PMFs 101 Independence 104 Moments and Expected Values 105 5.3.1 Expected Values for Two Random Variables 105 5.3.2 Moments for Two Random Variables 106 5.4 Example: Two Discrete Random Variables 108 5.4.1 Marginal PMFs and Expected Values 109 5.4.2 Independence 109 5.4.3 Joint CDF llO 5.4.4 Transformations With One Output 110 5.4.5 Transformations With Several Outputs 112 5.4.6 Discussion 113 5.5 Sums of Independent Random Variables 113 5.6 Sample Probabilities, Mean, and Variance 117 5.7 Histograms ll9 5.8 Entropy and Data Compression 120 5.8.1 Entropy and Information Theory 121 5.8.2 Variable Length Coding 123 5.8.3 Encoding Binary Sequences 127 5.8.4 Maximum Entropy 128 Summary 131 Problems 132

75

CONTENTS

6

BINOMIAL PROBABILITIES

137

6.1 6.2 6.3 6.4 6.5

Basics of the Binomial Distribution 137 Computing Binomial Probabilities 141 Moments of the Binomial Distribution 142 Sums of Independent Binomial Random Variables 144 Distributions Related to the Binomial 146 6.5.1 Connections Between Binomial and Hypergeometric Probabilities 146 6.5.2 Multinomial Probabilities 147 6.5.3 The Negative Binomial Distribution 148 6.5.4 The Poisson Distribution 149 6.6 Binomial and Multinomial Estimation 151 6.7 Alohanet 152 6.8 Error Control Codes 154 6.8.1 Repetition-by-Three Code 155 6.8.2 General Linear Block Codes 157 6.8.3 Conclusions 160 Summary 160 Problems 162

7

A CONTINUOUS RANDOM VARIABLE 7.1 7.2 7.3

167

Basic Properties 167 Example Calculations for One Random Variable Selected Continuous Distributions 174 7.3.1 The Uniform Distribution 174 7.3.2 The Exponential Distribution 176 7.4 Conditional Probabilities 179 7.5 Discrete PMFs and Delta Functions 182 7.6 Quantization 184 7.7 A Final Word 187 Summary 187 Problems 189

8

171

MULTIPLE CONTINUOUS RANDOM VARIABLES 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10

192

Joint Densities and Distribution Functions 192 Expected Values and Moments 194 Independence 194 Conditional Probabilities for Multiple Random Variables 195 Extended Example: Two Continuous Random Variables 198 Sums of Independent Random Variables 202 Random Sums 205 General Transformations and the Jacobian 207 Parameter Estimation for the Exponential Distribution 214 Comparison of Discrete and Continuous Distributions 214

vii

viii

CONTENTS Summary 215 Problems 216

9

THE GAUSSIAN AND RELATED DISTRIBUTIONS

221

9.1 9.2 9.3 9.4 9.5

The Gaussian Distribution and Density 221 Quantile Function 227 Moments of the Gaussian Distribution 228 The Central Limit Theorem 230 Related Distributions 235 9.5.1 The Laplace Distribution 236 9.5.2 The Rayleigh Distribution 236 9.5.3 The Chi-Squared and F Distributions 238 9.6 Multiple Gaussian Random Variables 240 9.6.1 Independent Gaussian Random Variables 240 9.6.2 Transformation to Polar Coordinates 241 9.6.3 Two Correlated Gaussian Random Variables 243 9.7 Example: Digital Communications Using QAM 246 9.7.1 Background 246 9.7.2 Discrete Time Model 247 9.7.3 Monte Carlo Exercise 253 9.7.4 QAM Recap 258 Summary 259 Problems 260

10 ELEMENTS OF STATISTICS

265

10.1 A Simple Election Poll 265 10.2 Estimating the Mean and Variance 269 10.3 Recursive Calculation of the Sample Mean 271 10.4 Exponential Weighting 273 10.5 Order Statistics and Robust Estimates 274 10.6 Estimating the Distribution Function 276 10.7 PMF and Density Estimates 278 10.8 Confidence Intervals 280 10.9 Significance Tests and p-Values 282 10.10 Introduction to Estimation Theory 285 10.11 Minimum Mean Squared Error Estimation 289 10.12 Bayesian Estimation 291 Problems 295

11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION 11.1 Gaussian Random Vectors 298 11.2 Linear Operations on Gaussian Random Vectors 11.3 Linear Regression 304 11.3.1 Linear Regression in Detail 305

303

298

CONTENTS 11.3.2 Statistics of the Linear Regression Estimates 11.3.3 Computational Issues 311 11.3.4 Linear Regression Examples 313 11.3.5 Extensions of Linear Regression 317 Summary 319 Problems 320

12 HYPOTHESIS TESTING

309

324

12.1 Hypothesis Testing: Basic Principles 324 12.2 Example: Radar Detection 326 12.3 Hypothesis Tests and Likelihood Ratios 331 12.4 MAP Tests 335 Summary 336 Problems 337

13 RANDOM SIGNALS AND NOISE

340

13.1 13.2 13.3 13.4 13.5 13.6

Introduction to Random Signals 340 A Simple Random Process 341 Fourier Transforms 342 WSS Random Processes 346 WSS Signals and Linear Filters 350 Noise 352 13.6.1 Probabilistic Properties of Noise 352 13.6.2 Spectral Properties of Noise 353 13.7 Example: Amplitude Modulation 354 13.8 Example: Discrete Time Wiener Filter 357 13.9 The Sampling Theorem for WSS Random Processes 357 13.9.1 Discussion 358 13.9.2 Example: Figure 13.4 359 13.9.3 Proof of the Random Sampling Theorem 361 Summary 362 Problems 364

14 SELECTED RANDOM PROCESSES

366

14.1 14.2 14.3 14.4

The Lightbulb Process 366 The Poisson Process 368 Markov Chains 372 Kalman Filter 381 14.4.1 The Optimal Filter and Example 381 14.4.2 QR Method Applied to the Kalman Filter Summary 386 Problems 388

384

ix

x CONTENTS A

COMPUTATION EXAMPLES A.l A.2 A.3

Matlab 391 Python 393 R 395

B ACRONYMS

c

399

PROBABILITY TABLES C.l

405

401

Tables of Gaussian Probabilities

D BIBLIOGRAPHY 403 INDEX

391

401

I have many goals for this book, but this is foremost: I have always liked probability and have been fascinated by its application to predicting the future. I hope to encourage this generation of students to study, appreciate, and apply probability to the many applications they will face in the years ahead. To the student: This book is written for you. The prose style is less formal than many textbooks use. This more engaging prose was chosen to encourage you to read the book. I firmly believe a good textbook should help you learn the material. But it will not help if you do not read it. Whenever I ask my students what they want to see in a text, the answer is: "Examples. Lots of examples:' I have tried to heed this advice and included "lots" of examples. Many are small, quick examples to illustrate a single concept. Others are long, detailed examples designed to demonstrate more sophisticated concepts. Finally, most chapters end in one or more longer examples that illustrate how the concepts of that chapter apply to engineering or scientific applications. Almost all the concepts and equations are derived using routine algebra. Read the derivations, and reproduce them yourselves. A great learning technique is to read through a section, then write down the salient points. Read a derivation, and then reproduce it yourself. Repeat the sequence-read, then reproduce-until you get it right. I have included many figures and graphics. The old expression, "a picture is worth a thousand words;' is still true. I am a believer in Edward Tufte's graphics philosophy: maximize the data-ink ratio. 1 All graphics are carefully drawn. They each have enough ink to tell a story, but only enough ink. To the instructor: This textbook has several advantages over other textbooks. It is the right size-not too big and not too small. It should cover the essential concepts for the level of the course, but should not cover too much. Part of the art of textbook writing is to decide what should be in and what should be out. The selection of topics is, of course, a determination on the part of the author and represents the era in which the book is written. When I first started teaching my course more than two decades ago, the selection of topics favored continuous random variables and continuous time random processes. Over time, discrete random variables and discrete time random processes have grown in importance. Students today are expected to understand more statistics than in the past. Computation is much more important and more immediate. Each year I add a bit more computation to the course than the prior year. I like computation. So do most students. Computation gives a reality to the theoretical concepts. It can also be fun. Throughout the book, there are computational examples and exercises. Unfortunately, not everyone uses the same computational packages. The book uses

1 Edward Tufte, The Visual Display of Quantitative Information, 2nd ed. Cheshire, CT: Graphics Press, 2001. A great book, highly recommended.

xi

xii

PREFACE

three of the most popular: Matlab, Python, and R. For the most part, we alternate between Matlab and Python and postpone discussion ofR until the statistics chapters. Most chapters have a common format: introductory material, followed by deeper and more involved topics, and then one or more examples illustrating the application of the concepts, a summary of the main topics, and a list of homework problems. The instructor can choose how far into each chapter to go. For instance, I usually cover entropy (Chapter 5) and Aloha (Chapter 6), but skip error-correcting coding (also Chapter 6). I am a firm believer that before statistics or random processes can be understood, the student must have a good knowledge of probability. A typical undergraduate class can cover the first nine chapters in about two-thirds of a semester, giving the student a good understanding of both discrete and continuous probability. The instructor can select topics from the later chapters to fill out the rest of the semester. If students have had basic probability in a prior course, the first nine chapters can be covered quickly and greater emphasis placed on the remaining chapters. Depending on the focus of the course, the instructor can choose to emphasize statistics by covering the material in Chapters 10 through 12. Alternatively, the instructor can emphasize random signals by covering Chapters 13 and 14. The text can be used in a graduate class. Assuming the students have seen some probability as undergraduates, the first nine chapters can be covered quickly and more attention paid to the last five chapters. In my experience, most new graduate students need to refresh their probability knowledge. Reviewing the first nine chapters will be time well spent. Graduate students will also benefit from doing computational exercises and learning the similarities and differences in the three computational packages discussed, Matlab, Python, andR.

Chapter Coverage Chapters 1 and 2 are a fairly standard introduction to probability. The first chapter introduces the basic definitions and the three axioms, proves a series of simple theorems, and concludes with detailed examples of calculating probabilities for simple networks. The second chapter covers conditional probability, Bayes theorem and the law of total probability, and several applications. Chapter 3 is a detour into combinatorics. A knowledge of combinatorics is essential to understanding probability, especially discrete probability, but students often confuse the two, thinking combinatorics to be a branch of probability. The two are different, and we emphasize that. Much of the development of probability in history was driven by gambling. I, too, use examples from gambling and game play in this chapter (and in some later chapters as well). Students play games and occasionally gamble. Examples from these subjects help bring probability to the student life experience-and we show that gambling is unlikely to be profitable! Chapters 4 and 5 introduce discrete probability mass functions, distribution functions, expected values, change of variables, and the uniform, geometric, and Poisson distributions. Chapter 4 culminates with a discussion of the financial considerations of gambling versus buying insurance. Chapter 5 ends with a long section on entropy and data compression. (It still amazes me that most textbooks targeting an electrical and computer engineering

PREFACE

xiii

audience omit entropy.) Chapter 6 presents binomial, multinomial, negative binomial, hypergeometric, and Poisson probabilities and considers the connections between these important discrete probability distributions. It is punctuated by two optional sections, the first on the Aloha protocol and the second on error-correcting codes. Chapters 7 and 8 present continuous random variables and their densities and distribution functions. Expected values and changes of variables are also presented, as is an extended example on quantization. Chapter 9 presents the Gaussian distribution. Moments, expected values, and change of variables are also presented here. The central limit theorem is motivated by multiple examples showing how the probability mass function or density function converges to the Gaussian density. Some of the related distributions, including the Laplace, Rayleigh, and chi-squared, are presented. The chapter concludes with an extended example on digital communications using quadrature amplitude modulation. Exact and approximate error rates are computed and compared to a Monte Carlo simulation. The first nine chapters are typically covered in order at whatever speed is comfortable for the instructor and students. The remaining chapters can be divided into two subjects, statistics and random processes. Chapters 10, 11, and 12 comprise an introduction to statistics, linear regression, and hypothesis testing. Chapters 13 and 14 introduce random processes and random signals. These chapters are not serial; the instructor can pick and choose whichever chapters or sections to cover. Chapter 10 presents basic statistics. At this point, the student should have a good understanding of probability and be ready to understand the "why" behind statistical procedures. Standard and robust estimates of mean and variance, density and distribution estimates, confidence intervals, and significance tests are presented. Finally, maximum likelihood, minimum mean squared estimation, and Bayes estimation are discussed. Chapter 11 takes a linear algebra approach (vectors and matrices) to multivariate Gaussian random variables and uses this approach to study linear regression. Chapter 12 covers hypothesis testing from a traditional engineering point of view. MAP (maximum a posteriori), Neyman-Pearson, and Bayesian hypothesis tests are presented. Chapter 13 studies random signals, with particular emphasis on those signals that appear in engineering applications. Wide sense stationary signals, noise, linear filters, and modulation are covered. The chapter ends with a discussion of the sampling theorem. Chapter 14 focuses on the Poisson process and Markov processes and includes a section on Kalman filtering. Let me conclude this preface by repeating my overall goal: that the student will develop not only an understanding and appreciation of probability, statistics, and random processes but also a willingness to apply these concepts to the various problems that will occur in the years ahead.

Acknowledgments I would like to thank the reviewers who helped shaped this text during its development. Their many comments are much appreciated. They are the following: Deva K Borah, New Mexico State University Petar M. Djuric, Stony Brook University Jens Gregor, University of Tennessee

xiv

PREFACE

Eddie Jacobs, University of Memphis JeongHee Kim, San Jose State University Nicholas J. Kirsch, University of New Hampshire Joerg Kliewer, New Jersey Institute of Technology Sarah Koskie, Indiana University-Purdue University Indianapolis Ioannis (John) Lambadaris, Carleton University Eric Miller, Tufts University Ali A. Minai, University of Cincinnati Robert Morelos-Zaragoza, San Jose State University Xiaoshu Qian, Santa Clara University Danda B. Rawat, Georgia Southern University Rodney Roberts, Florida State University John M. Shea, University of Florida Igor Tsukerman, The University of Akron I would like to thank the following people from Oxford University Press who helped make this book a reality: Nancy Blaine, John Appeldorn, Megan Carlson, Christine Mahon, Daniel Kaveney, and Claudia Dukeshire. Last, and definitely not least, I would like to thank my children, Matthew and Amy, and my wife, Carol, for their patience over the years while I worked on this book. Charles Boncelet

CHAPTER

PROBABILITY BASICS

In this chapter, we introduce the formalism of probability, from experiments to outcomes to events. The three axioms of probability are introduced and used to prove a number of simple theorems. The chapter concludes with some examples.

1.1 WHAT IS PROBABILITY? Probability refers to how likely something is. By convention, probabilities are real numbers between 0 and 1. A probability of 0 refers to something that never occurs; a probability of 1 refers to something that always occurs. Probabilities between 0 and 1 refer to things that sometimes occur. For instance, an ordinary coin when flipped will land heads up about half the time and land tails up about half the time. We say the probability of heads is 0.5; the probability of tails is also 0.5. As another example, a typical telephone line has a probability of sending a data bit correctly of around 0.9999, or 1 - w- 4 . The probability the bit is incorrect is w- 4 . A fiber-optic line may have a bit error rate as low as w- 15 . Imagine Alice sends a message to Bob. For Bob to receive any information (any new knowledge), the message must be unknown to Bob. If Bob knew the message before receiving it, then he gains no new knowledge from hearing it. Only if the message is random to Bob will Bob receive any information. There are a great many applications where people try to predict the future. Stock markets, weather, sporting events, and elections all are random. Successful prediction of any of these would be immensely profitable, but each seems to have substantial randomness. Engineers worry about reliability of devices and systems. Engineers control complex systems, often without perfect knowledge of the inputs. People are building self-driving automobiles and aircraft. These devices must operate successfully even though all sorts of unpredictable events may occur.

2 CHAPTER 1 PROBABILI1Y BASICS Probabilities may be functions of other variables, such as time and space. The probability of someone getting cancer is a function of lots of things, including age, gender, genetics, dietary habits, whether the person smokes, and where the person lives. Noise in an electric circuit is a function of time and temperature. The number of questions answered correctly on an exam is a function of what questions are asked-and how prepared the test taker is! In some problems, time is the relevant quantity. How many flips of a coin are required before the first head occurs? How many before the lOOth head? The point of this is that many experiments feature randomness, where the result of the experiment is not known in advance. Furthermore, repetitions of the same experiment may produce different results. Flipping a coin once and getting heads does not mean that a second flip will be heads (or tails). Probability is about understanding and quantifying this randomness.

Comment 1.1: Is a coin flip truly random? Is it unpredictable?

Presumably, if we knew the mass distribution of the coin , the initial force (both linear and rotational) applied to the coin, the density of the air, and any air currents, we could use physics to compute the path of the coin and how it will land (heads or tails) . From this point of view, the coin is not random . In practice, we usually do not know these variables. Most coins are symmetric (or close to symmetric) . As long as the number of rotations of the coin is large, we can reasonably assume the coin will land heads up half the time and tails up half the time, and we cannot predict which will occur on any given toss. From this point of view, the coin flip is random . However, there are rules, even if they are usually unspoken . The coin flipper must make no attempt to control the flip (i .e., to control how many rotations the coin undergoes before landing) . The flipper must also make no attempt to control the catch of the coin or its landing. These conce rns are real. Magicians have been known to practice flipping coins until they can control the flip. (And sometimes they simply cheat.) Only if the rules are followed can we reasonably assume the coin flip is random .

EXAMPLE 1.1

Let us test this question: How many flips are required to get a head? Find a coin, and flip it until a head occurs. Record how many flips were required. Repeat the experiment again, and record the result. Do this at least 10 times. Each of these is referred to as a run, a sequence of tails ending with a heads. What is the longest run you observed? What is the shortest? What is the average run length? Theory tells us that the average run length will be about 2.0, though of course your average may be different.

1.2 Experiments, Outcomes, and Events

3

1.2 EXPERIMENTS, OUTCOMES, AND EVENTS An experiment is whatever is done. It may be flipping a coin, rolling some dice, measuring a voltage or someone's height and weight, or numerous others. The experiment results in outcomes. The outcomes are the atomic results of the experiment. They cannot be divided further. For instance, for a coin flip, the outcomes are heads and tails; for a counting experiment (e.g., the number of electrons crossing a PN junction), the outcomes are the nonnegative integers, 0, 1, 2, 3, .... Outcomes are denoted with italic lowercase letters, perhaps with subscripts, such as x, n, al> a2, etc. The number of outcomes can be finite or infinite, as in the two examples mentioned in the paragraph above. Furthermore, the experiment can result in discrete outcomes, such as the integers, or continuous outcomes, such as a person's weight. For now, we postpone continuous experiments to Chapter 7 and consider only discrete experiments. Sets of outcomes are known as events. Events are denoted with italic uppercase Roman letters, perhaps with subscripts, such as A, B, and A;. The outcomes in an event are listed with braces. For instance, A = {1, 2, 3, 4} orB = {2, 4, 6}. A is the event containing the outcomes 1, 2, 3, and 4, while B is the event containing outcomes 2, 4, and 6. The set of all possible outcomes is the sample space and is denoted by S. For example, the outcomes of a roll of an ordinary six-sided die 1 are 1, 2, 3, 4, 5, and 6. The sample space isS= {1,2,3,4,5,6}. The set containing no outcomes is the empty set and is denoted by¢. The complement of an event A, denoted A, is the event containing every outcome not in A. The sample space is the complement of the empty set, and vice versa. The usual rules of set arithmetic apply to events. The union of two events, A u B, is the event containing outcomes in either A or B. The intersection of two events, A n B or more simply AB, is the event containing all outcomes in both A and B. ForanyeventA, AnA =AA =¢and A uA = S. EXAMPLE 1.2

Consider a roll of an ordinary six-sided die, and let A= {1,2,3,4} and B = {2,4,6}. Then, A uB = {1,2,3,4,6} and A nB = {2,4}. A= {5,6} andB = {1,3,5}.

Consider the following experiment: A coin is flipped three times. The outcomes are the eight flip sequences: hhh, hht, ... , ttt. If A = {first flip is head} = {hhh, hht, hth, htt}, then A = {ttt, tth, tht, thh}. If B = {exactly two heads} = {hht, hth, thh}, then Au B = {hhh, hht, hth, htt, thh} and AB = {hht, hth}.

Comment 1.2: Be careful in defining events. In the coin flipping experiment above, an

event might be specified as C ={two heads}. Is this "exactly two heads" or "at least two heads"? The former is {hht, hth, thh}, while the latter is {hht, hth, thh, hhh}.

1

One is a die; two or more are dice.

4

CHAPTER 1 PROBABILI1Y BASICS

Set arithmetic obeys DeMorgan's laws:

AuB=AnB

(1.1)

AnB=AuB

(1.2)

DeMorgan's laws are handy when the complements of events are easier to define and specify than the events themselves. A is a subset of B, denoted A c B, if each outcome in A is also in B. For instance, if A = {1, 2} and B = {1, 2, 4, 6}, then A c B. Note that any set is a subset of itself, A cA. If A c B andBcA, then A =B. Two events are disjoint (also known as mutually exclusive) if they have no outcomes in common, that is, if AB =¢.A collection of events, A; fori= 1,2, ... , is pairwise disjoint if each pair of events is disjoint, i.e., A;Aj = ¢ for all i f:. j. A collection of events, A; for i = 1, 2, .. . , forms a partition of S if the events are pairwise disjoint and the union of all events is the sample space:

A;Aj = ¢

for i f:. j

00

UA;=S i= l

In the next chapter, we introduce the law of total probability, which uses a partition to divide a problem into pieces, with each A; representing a piece. Each piece is solved and the pieces combined to get the total solution.

1.3 VENN DIAGRAMS A useful tool for visualizing relationships between sets is the Venn diagram. Typically, Venn diagrams use a box for the sample space and circles (or circle-like figures) for the various events. In Figure 1.1, we show a simple Venn diagram. The outer box, labeled S, denotes the sample space. All outcomes are in S. The two circles, A and B, represent two events. The

A

FIGURE 1.1 A Venn diagram ofAuB.

B

s

1.4 Random Variables

A

5

B

OJ CD OJ Light: Dark:

AB AB

A

n

B

A

u

B

FIGURE 1.2 A Venn~ag~m ~proof" of the second ofDeMorgan's laws (Equation 1.2). The "dark" parts show AB =Au B, while the "light" parts show AB =An B.

shaded area is the union of these two events. One can see that A = ABu AB, that B = ABu AB, and thatAuB =ABuABuAB. Figure 1.2 presents a simple Venn diagram proof of Equation (1.2). The dark shaded area in the leftmost box represents AB, and the shaded areas in the two rightmost boxes represent A and B, respectively. The left box is the logical OR of the two rightmost boxes. On the other hand, the light area on the left is AB. It is the logical AND of A and B. Figure 1.3 shows a portion of the Venn diagram of Au B u C. The shaded area, representing the union, can be divided into seven parts. One part is ABC, another part is ABC, etc. Problem 1.13 asks the reader to complete the picture.

s

FIGURE 1.3 A Venn diagram of Au B u C. See Problem 1.13 for details.

1.4 RANDOM VARIABLES It is often convenient to refer to the outcomes of experiments as numbers. For instance, it is convenient to refer to "heads" as 1 and "tails'' as 0. The faces of most six-sided dice are labeled with pips (dots). We refer to the side with one pip as 1, to the side with two pips as 2, etc. In other experiments, the mapping is less clear because the outcomes are naturally numbers. A coin can be flipped n times and the number of heads counted. Or a large number

6

CHAPTER 1 PROBABILI1Y BASICS

ofbits can be transmitted across a wireless communications network and the number of bits received in error counted. A randomly chosen person's height, weight, age, temperature, and blood pressure can be measured. All these quantities are represented by numbers. Random variables are mappings from outcomes to numbers. We denote random variables with bold-italic uppercase Roman letters (or sometimes Greek letters), such as X andY, and sometimes with subscripts, such as X 1 , X 2 , etc. The outcomes are denoted with italic lowercase letters, such as x, y, and n. For instance, X(heads) = 1 X (tails)= 0

Events, sets of outcomes, become relations on the random variables. For instance, {heads} = {X(heads) =

1} = {X= 1}

where we simplify the notation and write just {X= 1}. As another example, let Y denote the number of heads in three flips of a coin. Then, various events are written as follows: {hhh} = !Y = 3} {hht, hth, thh} = !Y = 2} {hhh, hht, hth, thh} = {2 :5 Y :53}= !Y = 2} u {Y = 3}

In some experiments, the variables are discrete (e.g., counting experiments), and in others, the variables are continuous (e.g., height and weight). In still others, both types of random variables can be present. A person's height and weight are continuous quantities, but a person's gender is discrete, say, 0 = male and 1 = female. A crucial distinction is that between the random variable, say, N, and the outcomes, say, k = 0, 1,2, 3. Before the experiment is done, the value of N is unknown. It could be any of the outcomes. After the experiment is done, N is one of the values. The probabilities of N refer to before the experiment; that is, Pr[N = k] is the probability the experiment results in the outcome k (i.e., that outcome k is the selected outcome). Discrete random variables are considered in detail in Chapters 4, 5, and 6 and continuous random variables in Chapters 7, 8, and 9.

1.5 BASIC PROBABILITY RULES In this section, we take an intuitive approach to the basic rules of probability. In the next section, we give a more formal approach to the basic rules. When the experiment is performed, one outcome is selected. Any event or events containing that outcome are true; all other events are false. This can be a confusing point: even though only one outcome is selected, many events can be true because many events can contain the selected outcome. For example, consider the experiment of rolling an ordinary six-sided die. The outcomes are the numbers 1, 2, 3, 4, 5, and 6. Let A= {1,2,3,4}, B = {2,4,6}, and C = {2}. Then, if the roll results in a 4, events A and B are true while C is false.

1.5 Basic Proba bility Rules

7

Comment 1.3: The operations of set arithmetic are analogous to those of Boolean

algebra. Set union is analogous to Boolean Or, set intersection to Boolean And, and set complement to Boolean complement. For example, if C =Au B, then C contains the selected outcome if either A orB (or both) contain the selected outcome. Alternatively, we say C is true if A is true orB is true .

Probability is a function of events that yields a number. If A is some event, then the probability of A, denoted Pr[A], is a number; that is, Pr[A] =number

(1.3)

Probabilities are computed as follows: Each outcome in S is assigned a probability between 0 and 1 such that the sum of all the outcome probabilities is 1. Then, for example, if A = {aba 2,a 3), the probability of A is the sum of the outcome probabilities in A; that is,

A probability of 0 means the event does not occur. The empty set ¢, for instance, has probability 0, or Pr [¢] = 0, since it has no outcomes. By definition whatever outcome is selected is not in the empty set. Conversely, the sample space contains all outcomes. It is always true. Probabilities are normalized so that the probability of the sample space is 1: Pr[S]=l The probability of any event A is between 0 and 1; that is, 0:5 Pr[A] :51. Since A uA = S, it is reasonable to expect that Pr[A] + Pr[A] = 1. This is indeed true and can be handy. Sometimes one of these probabilities, Pr[A] or Pr[A], is much easier to compute than the other one. Reiterating, for any event A, 0 :o; Pr[A] :51 Pr[A] + Pr[A] = 1 The probabilities of nonoverlapping events add: if AB =¢,then Pr[A u B] = Pr[A] + Pr[B]. If the events overlap (i.e., have outcomes in common), then we must modify the formula to eliminate any double counting. There are two main ways of doing this. The first adds the two probabilities and then subtracts the probability of the overlapping region: Pr[AuB] =Pr[Aj +Pr[Bj-Pr[AB] The second avoids the overlap by breaking the union into nonoverlapping pieces: Pr[A uB] = Pr[AB uAB uAB] = Pr[AB] + Pr[AB] + Pr[AB] Both formulas are useful. A crucial notion in probability is that of independence. Independence means two events, A and B, do not affect each other. For example, flip a coin twice, and let A represent the event

8

CHAPTER 1 PROBABILI1Y BASICS

the first coin is heads and B the event the second coin is heads. If the two coin flips are done in such a way that the result of the first flip does not affect the second flip (as coin flips are usually done), then we say the two flips are independent. When A and Bare independent, the probabilities multiply: Pr[AB] = Pr[A]Pr[B]

if A andB are independent

Another way of thinking about independence is that knowing A has occurred (or not occurred) does not give us any information about whether B has occurred, and conversely, knowing B does not give us information about A. See Chapter 2 for further discussion of this view of independence.

Comment 1.4: Sometimes probabilities are expressed as percentages. A probability of

0.5 might be expressed as a 50% chance of occurring. The notation, Pr[A], is shorthand for the more complex "probability that event A is true," which itself is shorthand for the even more complex "probability that one of the outcomes in A is the result of the experiment." Similarly, intersections and unions can be thought of in terms of Boolean algebra: Pr[A u B] means "the probability that event A is true or event B is true," and Pr[AB] means "the probability that event A is true and event B is true ."

EXAMPLE 1.4

In Example 1.2, we defined two events, A and B, but said nothing about the probabilities. Assume each side of the die is equally likely. Since there are six sides and each side is equally likely, the probability of any one side must be 116: (list the outcomes of A)

Pr[Aj = Pr[ {1,2,3,4}] = Pr[1] +Pr[2] +Pr[3] +Pr[4]

(break the event into its outcomes)

1 1 1 1 4 =-+-+-+-=6 6 6 6 6

(each side equally likely)

Pr[B] = Pr[2,4,6] =

3

1

6 = 2"

Continuing, A uB = {1,2,3,4,6} and AB = {2,4}. Thus, Pr[AuB] = Pr[ {1,2,3,4,6}] =

65

= Pr[A] + Pr[B]- Pr[AB] 4 3 2 5 =-+---=6 6 6 6

(first, solve directly) (second, solve with union formula)

1.6 Probability Formalized

9

Alternatively, AB = {1, 3}, AB = {6}, and 2 2 1 5 Pr[A u B] = Pr[AB] + Pr[AB] + Pr[AB] = 6 + 6 + 6 = 6

EXAMPLE 1.5

In Example 1.4, we assumed all sides of the die are equally likely. The probabilities do not have to be equally likely. For instance, consider the following probabilities: Pr [1] =0.5 Pr [k] = 0.1

fork= 2,3,4,5,6

Then, repeating the above calculations, Pr[Aj = Pr[ {1,2,3,4} ]

(list the outcomes of A)

= Pr[1 ] +Pr[2 ] +Pr[3 ] +Pr[4 ]

(break the event into its outcomes)

1 1 1 1 8 =-+-+-+-=2 10 10 10 10

(unequal probabilities)

Pr[B] = Pr[2,4,6] =

3 10

Continuing, A uB = {1,2,3,4,6}, and AB = {2,4}. Thus, 9 Pr [AuB] = Pr[ {1,2,3,4,6} ] = 10 = Pr[A] + Pr [B] - Pr[AB]

(first, solve directly) (second, solve with union formula)

8 3 2 9 =-+---=10 10 10 10 Alternatively, 2 6 1 9 Pr [AuB] = Pr[AB] +Pr[AB ] +Pr[AB ] = - + - + - = 10 10 10 10

1.6 PROBABILITY FORMALIZED A formal development begins with three axioms. Axioms are truths that are unproven but accepted. We present the three axioms of probability, then use these axioms to prove several basic theorems. The first two axioms are simple, while the third is more complicated:

Axiom 1: Pr [A] ?.OjoranyeventA.

10

CHAPTER 1 PROBABILI1Y BASICS

Axiom 2: Pr[ S] = 1, where Sis the sample space. Axiom 3:

If Ai for i = 1, 2, ... are pairwise disjoint, then 00

Pr[ UAi] = I:Pr[Ai] i= l

(1.4)

i= l

From these three axioms, the basic theorems about probability are proved. The first axiom states that all probabilities are nonnegative. The second axiom states that the probability of the sample space is 1. Since the sample space contains all possible outcomes (by definition), the result of the experiment (the outcome that is selected) is contained in S. Thus, S is always true, and its probability is 1. The third axiom says that the probabilities of nonoverlapping events add; that is, if two or more events have no outcomes in common, then the probability of the union is the sum of the individual probabilities. Probability is like mass. The first axiom says mass is nonnegative, the second says the mass of the universe is 1, and the third says the masses of nonoverlapping bodies add. In advanced texts, the word "measure" is often used in discussing probabilities. This third axiom is handy in computing probabilities. Consider an event A containing outcomes, a 1 ,a 2 , ••• ,an. Then,

A= !ai>az, ... ,anl ={ad u {az} u · · · u lanl Pr[A] = Pr[ {ad]+ Pr[ {az}] + · · · + Pr[ !an)] since the events, !ai), are disjoint. When the context is clear, the clumsy notation Pr[ {ail] is replaced by the simpler Pr [ai ]. In words, the paradigm is clear: divide the event into its outcomes, calculate the probability of each outcome (technically, of the event containing only that outcome), and sum the probabilities to obtain the probability of the event.

Comment 1.5: The third axiom is often presented in a finite form: n

n

Pr[UA;] = I;Pr[A;] i= 1

i= 1

when the A; are pairwise disjoint. A common special case holds for two disjoint events: ifAB =I/!, then Pr[AuB] = Pr[A] +Pr[B] . Both of these are special cases of the third axiom Uust let the excess A;= 1/!) . But, for technical reasons that are beyond this text, the finite version does not imply the infinite vers1on .

1.7 Simple Theorems

11

1.7 SIMPLE THEOREMS In this section, we use the axioms to prove a series of"simple theorems" about probabilities. These theorems are so simple that they are often mistaken to be axioms themselves. Theorem 1.1: Pr[ ¢] = 0.

PROOF: (the second axiom)

1 = Pr[S]

=Pr[Su¢]

(S=Su¢)

= Pr[S] +Pr[¢]

(by the third axiom)

= 1 +Pr[¢]

(by the second axiom)



The last implies Pr[ ¢] = 0.

This theorem provides a symmetry to Axiom 2, which states the probability of the sample space is 1. This theorem states the probability of the null space is 0. The next theorem relates the probability of an event to the probability of its complement. The importance lies in the simple observation that one of these probabilities may be easier to compute than the other. Theorem 1.2: Pr[.A] = 1-Pr[A]. In other words, Pr[A] +Pr[A] = 1.

PROOF: By definition, Au A = S. Combining the second and third axioms, one obtains 1 = Pr[S] = Pr[A uA] = Pr[A] + Pr[A]

A simple rearrangement yields Pr[.A] = 1- Pr[A].



This theorem is useful in practice. Calculate Pr[A] or Pr[.A], whichever is easier, and then subtract from 1 if necessary. Theorem 1.3: Pr[A] :5 1 for any event A.

PROOF: Since 0 :5 Pr[.A] (by Axiom 2) and Pr[.A] = 1- Pr[A], it follows immediately that Pr[Aj :51. • Combining Theorem 1.3 and Axiom 1, one obtains 0:5 Pr[Aj :51

12

CHAPTER 1 PROBABILI1Y BASICS

for any event A. This bears repeating: all probabilities are between 0 and 1. One can combine this result with Axiom 2 and Theorem 1.3 to create a simple form: 0 = Pr[¢] :5 Pr[Aj :5 Pr[Sj = 1 The probability of the null event is 0 (it contains no outcomes, so the null event can never be true). The probability of the sample space is 1 (it contains all outcomes, so the sample space is always true). All other events are somewhere in between 0 and 1 inclusive. While it may seem counter-intuitive, it is reasonable in many experiments to define outcomes that have zero probability. Then, nontrivial events, A and B, can be defined such that A f:. ¢but Pr[A] = 0 and B f:. Sbut Pr[B] = 1. Many probability applications depend on parameters. This theorem provides a sanity check on whether a supposed result can be correct. For instance, let the probability of a head be p. Since probabilities are between 0 and 1, it must be true that 0 :5 p :5 1. Now, one might ask what is the probability of getting three heads in a row? One might guess the answer is 3p, but this answer is obviously incorrect since 3p > 1 when p > 1I 3. If the coin flips are independent (discussed below in Section 1.9), the probability of three heads in a row is p 3 . If 0 :5 p :5 1, then 0 :5 p 3 :5 1. This answer is possibly correct: it is between 0 and 1 for all permissible values of p. Of course, lots of incorrect answers are also between 0 and 1. For instance, p 2 , p/3, and cos(pn 12) are all between 0 and 1 for 0 :5 p :5 1, but none is correct. Theorem 1.4: IfAcB, thenPr[A] :5Pr[B]. PROOF:

B= (AuA)B =ABuAB =AuAB

(A c B implies AB =A)

Pr[B] = Pr[Aj +Pr[AB] ~Pr[Aj

(since Pr[AB] ~ 0)



Probability is an increasing function of the outcomes in an event. Adding more outcomes to the event may cause the probability to increase, but it will not cause the probability to decrease. (It is possible the additional outcomes have zero probability. Then, the events with and without those additional outcomes have the same probability.) Theorem 1.5: For any two events A and B, Pr[AuB] =Pr[Aj +Pr[Bj-Pr[AB]

(1.5)

This theorem generalizes Axiom 3. It does not require A and B to be disjoint. If they are, then AB = ¢, and the theorem reduces to Axiom 3.

1.7 Simple Theorems

13

PROOF: This proof uses a series of basic results from set arithmetic. First, A =AS=A(BuB) =ABuAB Second, forB, B = BS = B(A uA) = AB uAB and A uB = ABuABuABuAB = ABuAB uAB Thus, Pr[A uB] = Pr[AB] + Pr[AB] + Pr[AB]

(1.6)

Similarly, Pr[A] + Pr[B] = Pr[AB] + Pr[AB] + Pr[AB] + Pr[AB] = Pr[AuB] +Pr[AB]

Rearranging thelast equation yields the theorem: Pr [Au B] = Pr [A] + Pr[ B ]- Pr [AB ]. (Note that Equation 1.6 is a useful alternative to this theorem. In some applications, Equation 1.6 is easier to use than Equation 1.5.) • The following theorem is known as the inclusion-exclusion formula. It is a generalization of the previous theorem. Logically, the theorem fits in this section, but it is not a "small" theorem given the complexity of its statement. Theorem 1.6 (Inclusion-Exclusion Formula): For any events AI> Az, ... ,An, n

n i-l

Pr[A 1 uAz u ··· uAn] = I:Pr[A;j- LLPr[A;AJ] i= l

i = lj = l

n i-l j-1

+ LLLPr[A;AJAk]- ··· ±Pr[A!Az···An] i= lj= l k= l

This theorem is actually easier to state in words than in symbols: the probability of the union is the sum of all the individual probabilities minus the sum of all pair probabilities plus the sum of all triple probabilities, etc., until the last term, which is the probability of the intersection of all events. PROOF: This proof is by induction. Induction proofs consist of two major steps: the basis step and the inductive step. An analogy is climbing a ladder. To climb a ladder, one must first get on the ladder. This is the basis step. Once on, one must be able to climb from the (n -1)-st step to the nth step. This is the inductive step. One gets on the first step, then climbs to the second, then climbs to the third, and so on, as high as one desires. The basis step is Theorem 1.5. Let A = A 1 and B = A 2 , then Pr[A 1 u A2 ] = Pr[A 1 ] + Pr[A 2 ]-Pr[A 1 A2 ].

14

CHAPTER 1 PROBABILI1Y BASICS

The inductive step assumes the theorem is true for n - 1 events. Given the theorem is true for n - 1 events, one must show that it is then true for n events. The argument is as follows (though we skip some of the more tedious steps). Let A= A 1 u A2 u .. · u An-I and B =An. Then, Theorem 1.5 yields Pr[A 1 uA 2 u .. ·UAn] =Pr[AuB] = Pr[Aj + Pr[B]- Pr[AB] = Pr[A 1 uA2 u ··· UAn-d + Pr[An]- Pr[CA1 uA2 u .. · uAn-llAn] This last equation contains everything needed. The first term is expanded using the inclusion-exclusion theorem (which is true by the inductive assumption). Similarly, the last term is expanded the same way. Finally, the terms are regrouped into the pattern in the theorem's statement. But we skip these tedious steps. • While the inclusion-exclusion theorem is true (it is a theorem after all), it may be tedious to keep track of which outcomes are in which intersecting events. One alternative to the inclusion-exclusion theorem is to write the complicated event as a union of disjoint events. EXAMPLE 1.6

Consider the union Au B u C. Write it as follows: A uB u C =ABC uABCu ABC u ABC uABC uABC uA BC

Pr[A u B u C] = Pr[ABC] + Pr[ABC] + Pr[ABC] + Pr[ABC] +Pr[ABC] +Pr[ABC] +Pr[ABC] In some problems, the probabilities of these events are easier to calculate than those in the inclusion-exclusion formula. Also, note that it may be easier to use DeMorgan's first law (Eq. 1.1) and Theorem 1.2: Pr[AuBuC] = 1- Pr[AuB uC] = 1-Pr[ABC] EXAMPLE 1.7

Let us look at the various ways we can compute Pr [A u B ]. For the experiment, select one card from a well-shuffled deck. Each card is equally likely, with probability equal to 1/52. Let A = {card is a O} and B = {card is a Q}. The event Au B is the event the card is a diamond or is a Queen (Q). There are 16 cards in this event (13 diamonds plus 3 additional Q's). The straightforward solution is 16 Pr[A u B] = Pr[ one of 16 cards selected] = 52

Using Theorem 1.5, AB ={card is a-I-p I-v

Note that the "if" statement above compares one number to another number (so it is either always true or always false, depending on the values of p, £, and v). In the common special case when£= v, the rule simplifies further to set iCY= 1) =I ifp >£.In other words, if the most common route to the observation Y =I is across the top, set iCY= 1) =I; if the most common route to Y = I starts at X = 0 and goes up, set iCY = 1) = 0. This example illustrates an important problem in statistical inference: hypothesis testing. In this case, there are two hypotheses. The first is that X = I, and the second is that X = 0. The receiver receives Y and must decide which hypothesis is true. Often Bayes theorem and the LTP are essential tools in calculating the decision rule. Hypothesis testing is discussed further in Section 2.5 and in Chapter I2.

38

CHAPTER 2 CONDITIONAL PROBABILI1Y

2.5 EXAMPLE: DRUG TESTING Many employers test potential employees for illegal drugs. Fail the test, and the applicant does not get the job. Unfortunately, no test, whether for drugs, pregnancy, disease, incoming missiles, or anything else, is perfect. Some drug users will pass the test, and some nonusers will fail. Therefore, some drug users will in fact be hired, while other perfectly innocent people will be refused a job. In this section, we will analyze this effect using hypothetical tests. This text is not about blood or urine chemistry. We will not engage in how blood tests actually work, or in actual failure rates, but we will see how the general problem arises. Assume the test, T, outputs one of two values. T = I indicates the person is a drug user, while T = 0 indicates the person is not a drug user. Let U = I if the person is a drug user and U = 0 if the person is not a drug user. Note that T is the (possibly incorrect) result of the test and that U is the correct value. The performance of a test is defined by conditional probabilities. The false positive (or false alarm) rate is the probability of the test indicating the person is a drug user when in fact that person is not, or Pr[T =II U = 0]. The false negative (or miss) rate is the probability the test indicates the person is not a drug user when in fact that person is, or Pr [T = 0 I U = I]. A successful result is either a true positive or a true negative. See table 2.I, below. TABLE 2.1 A simple confusion matrix showing the relationships between true and false positives and negatives. U=l

T=l T=O

U=O

True Positive

False Positive

False Negative

True Negative

Now, a miss is unfortunate for the employer (it is arguable how deleterious a drug user might be in most jobs), but a false alarm can be devastating to the applicant. The applicant is not hired, and his or her reputation may be ruined. Most tests try to keep the false positive rate low. Pr[ U = I] is known as the a priori probability that a person is a drug user. Pr[ U = II T = I] and Pr[U =II T = 0] are known as thea posterioriprobabilitiesoftheperson being a drug user. For a perfect test, Pr[ T = II U = I] = I and Pr [T = II U = 0] = 0. However, tests are not perfect. Let us assume the false positive rate is £, or Pr [T = II U = 0] = £; the false negative rate is v, or Pr[ T = 0 I U = I] = v; and the a priori probability that a given person is a drug user is Pr[U =I] = p. For a good test, both£ and v are fairly small, often 0.05 or less. However, even 0.05 (i.e., test is 95% correct) may not be good enough if one is among those falsely accused. One important question is what is the probability that a person is not a drug user given that he or she failed the test, or Pr [ U = 0 I T = I]. Using Bayes theorem, Pr [ U = 0 I T = I] = _Pr..:...[T_=___,I1--,U_=_O..:...]P,--r"-[U_=_O....:.] Pr[T= I]

2.5 Example: Drug Testing

39

u = O]Pr[U = o] Pr[T =II u = O]Pr[U = o] +Pr[T= II u = I]Pr[U =I] Pr[T =II

£(1- p) £(1- p)

+ (l - v)p

This expression is hard to appreciate, so let us put in some numbers. First, assume that the false alarm rate equals the miss rate, or£ = v (often tests that minimize the total number of errors have this property). In Figure 2.3, we plot Pr[ U = 0 IT = I] against Pr [ U = I] for two values offalse positive and false negative rates. The first curve is for£ = v = 0.05, and the second curve is for£ = v = 0.20. One can see that when p .

1

1 / '>. / '>. 1 2 1 .?'>./'>./>. 1 3 3 1 .?>./'>./'>..?>. 1

1

4

6

4

1

"'"'"'"' 5 10 "'"' 10 "'"'"'"' 5 1

FIGURE 3.6 Pascal's triangle for n up to 5. Each number is the sum of the two numbers above it. The rows are the binomial coefficients for a given value of n.

3.4 THE BINOMIAL THEOREM The binomial coefficients are so named because they are central to the binomial theorem. Theorem 3.1 (Binomial Theorem): For integer n ~ 0, (3.7)

The binomial theorem is usually proved recursively. The basis, n = 0, is trivial: the theorem reduces to 1 = 1. The recursive step assumes the theorem is true for n - 1 and uses this to show the theorem is true for n. The proof starts like this: (a+ b)n =(a+ b)(a + b)n-l

(3.8)

3.5 Multinomial Coefficient and Theorem

55

Now, substitute Equation (3.7) into Equation (3.8):

f (n)akbn-k =(a+ b) I: (n -1)akbn-1-k k k

k=O

(3.9)

k=O

The rest of the proof is to use Equation (3.6) and rearrange the right-hand side to look like the left. The binomial theorem gives some simple properties of binomial coefficients:

0 = {(-1) + 1)" =

2" = (1 + 1)" =

~ ( ~) (-1h n-k = ( ~) _ ( ~) + (;) _ ... + (-1)" (:)

~ ( ~) h

n-k =

~ ( ~) = ( ~) + ( ~) + (;) + ... + (:)

(3.10)

(3.11)

We can check Equations (3.10) and (3.11) using the first few rows of Pascal's triangle:

0=1-1=1-2+1=1-3+3-1=1-4+6-4+1 2 1 =2=1+1 22 = 4 = 1 +2+ 1 23 = 8 = 1 + 3 + 3 + 1 2 4 = 16 = 1 + 4 + 6 + 4 + 1 Since all the binomial coefficients are nonnegative, Equation (3.11) gives a bound on the binomial coefficients: for all k = 0,1, ... ,n

(3.12)

The binomial theorem is central to the binomial distribution, the subject of Chapter 6.

3.5 MULTINOMIAL COEFFICIENT AND THEOREM The binomial coefficient tells us the number of ways of dividing n distinguishable objects into two piles. What if we want to divide the n objects into more than two piles? This is the multinomial coefficient. In this section, we introduce the multinomial coefficient and show how it helps analyze various card games. How many ways can n distinguishable objects be divided into three piles with k 1 in the first pile, k 2 in the second, and k 3 in the third with k 1 + k 2 + k 3 = n? This is a multinomial coefficient and is denoted as follows:

We develop this count in two stages. First, the n objects are divided into two piles, the first with k 1 objects and the second with n- k 1 = k 2 + k 3 objects. Then, the second pile is divided

56

CHAPTER 3 A LllTLE COMBINATORICS

into two piles with k2 in one pile and k3 in the other. The number of ways of doing this is the product of the two binomial coefficients:

If there are more than three piles, the formula extends simply: (3.13) The binomial coefficient can be written in this form as well:

The binomial theorem is extended to the multinomial theorem:

Theorem 3.2 (Multinomial Theorem): (3.14)

where the summation is taken over all values of ki ~ 0 such that k 1 + k 2 + · · · + km = n. The sum seems confusing, but really is not too bad. Consider (a+ b + c) 2 :

+(

2

1,1,0

)alblco+(

2

1,0,1

)albOcl+(

= a2 + b2 + c2 + 2ab + 2ac + 2bc

Let a 1 = a 2 =···=am= l. Then,

For example,

2

0,1,1

)aobJcl

3.6 The Birthday Paradox and Message Authentication 57

=

(2,~,o) + (o,~,o) + (o,~,2) + C,~,o) + C,~,1) + (o,~,1)

=1+1+1+2+2+2 (3.15)

=9

In the game of bridge, an ordinary 52-card deck of cards is dealt into four hands, each with 13 cards. The number of ways this can be done is 52 ) = 5.4 ( 13,13,13,13

X

1028

3.6 THE BIRTHDAY PARADOX AND MESSAGE AUTHENTICATION A classic probability paradox is the birthday problem: How large must k be for a group of k people to be likely to have at least two people with the same birthday? In this section, we solve this problem and show how it relates to the problem of transmitting secure messages. We assume three things: • Years have 365 days. Leap years have 366, but they occur only once every four years. • Birthdays are independent. Any one person's birthday does not affect anyone else's. In particular, we assume the group of people includes no twins, or triplets, etc. • All days are equally likely to be birthdays. This is not quite true, but the assumption spreads out the birthdays and minimizes the possibility of common birthdays. Births (in the United States anyway) are about 10% higher in the late summer than in winter. 1 There are fewer births on weekends and holidays. Ironically, "Labor Day" (in September) has a relatively low birthrate. We also take "likely" to mean a probability of 0.5 or more. How large must k be for a group of k people to have a probability of at least 0.5 of having at least two with the same birthday? Let Ak be the event that at least two people out of k have the same birthday. This event is complicated. Multiple pairs of people could have common birthdays, triples of people, etc. The complementary event-that no two people have the same birthday-is much simpler. Let q(k) = Pr[Ak] = Pr[no common birthday with k people]. Then, q(l) = 1 since one person cannot have a pair. What about q(2)? The first person's birthday is one of 365 days. The second person's birthday differs from the first's if it is one of the remaining 364 days. The probability of this happening is 364/365: q(2 ) 1

364 365 364 364 = q(l) . 365 = 365 . 365 = 365

National Vital Statistics Reports, vol. 55, no. 1 (September 2006). http://www.cdc.gov

58

CHAPTER 3 A LllTLE COMBINATORICS

The third person does not match either of the first two with probability (363/365): 363 365 364 363 q( 3) = q( 2 ). 365 = 365 . 365 . 365 We can continue this process and get a recursive formula for q(k): q(k) = q(k- 1) .

365 + 1 - k 365

365. 364 ... (366- k) 365k

(365lk 365k

(3.16)

Note that the probability is the number of ways of making a permutation of k objects (days) taken from n = 365 divided by the number of ways of making an ordered with replacement selection of k objects from 365. How large does k have to be so that q(k) < 0.5? A little arithmetic shows that k = 23 suffices since q(22) = 0.524 and q(23) = 0.493. The q(k) sequence is shown in Figure 3.7.

Pr[no pair] 0.5

0 5

10

15

20

25

30

k FIGURE 3.7 The probability of no pairs of birthdays fork people in a year with 365 days.

Why is such a small number of people sufficient? Most people intuitively assume the number would be much closer to 182 = 365/2. To see why a small number of people is sufficient, first we generalize and let n denote the number of days in a year: q(k) = (nlk nk

n(n- 1)(n- 2) · · · (n- k + 1) nnn···n n n-1 n-2 n-k+1 n n n n

=(1-~)(1-~)(1-~)···(1-k:1) The multiplication of all these terms is awkward to manipulate. However, taking logs converts those multiplications to additions: log (q(k)) = log (

1- ~) +log (1- ~) +log ( 1- ~) + ... +log (1- k: 1 )

3.6 The Birthday Paradox and Message Authentication

59

Now, use an important fact about logs: when xis small, log(l +x)

for

"-'X

X"-'

0

(3.17)

This is shown in Figure 3.8. (We use this approximation several more times in later chapters.)

y=x

y = log(l +x)

0.693

-0.5 X

-0.693

FIGURE 3.8 Plot showing log(1 +x)

"'X

when xis small.

Since k « n, we can use this approximation: 0 1 2 k- 1 1 k-l k(k -1) -log(qCkl)"-' -+-+-+···+- =- .[l= - n n n n n l=O 2n Finally, invert the logarithm:

k(k -1))

q(k)"-'exp ( -~

(3.18)

When k = 23 and n = 365, the approximation evaluates to 0.50 (actually 0.499998), close to the actual 0.493. Alternatively, we can set q(k) = 0.5 and solve for k: 0.5 = q(k) -k(k-1) log(0.5) = -0.693 "' - -n2

(3.19)

Thus, k(k -1)12 "'0.693n. For n = 365, k(k -1)12 "'253. Hence, k = 23. The intuition in the birthday problem should be that k people define k(k -1) /2 pairs of people and, at most, about 0.693n pairs can have different birthdays.

60

CHAPTER 3 A LllTLE COMBINATORICS

Comment 3.3: We use "log" or "log." to denote the natural logarithm. Some

textbooks and many calculators use "In" instead. That is, ify = e", then x = log(y). Later, we need log 10 and log 2 . lfy = 10x, then x = log10 (y), and ify = 2x, then x = log2 (y). See also Comment 5.9.

EXAMPLE3.3

The birthday problem is a convenient example to introduce simulation, a computational procedure that mimics the randomness in real systems. The python command trial=randint (n, size=k) returns a vector of k random integers. All integers between 0 and n- 1 are equally likely. These are birthdays. The command unique (trial) returns a vector with the unique elements of trial. If the size of the unique vector equals k, then no birthdays are repeated. Sample python code follows: k,

n

=

23,

365

numtrials = 10000 count = 0 for i in range(numtrials): trial = randint(n, size=k) count+= (unique(trial).size k) phat = count/numtrials std = 1.0/sqrt(2*numtrials) print phat, phat-1.96*std, phat+1.96*std A sequence of 10 trials with n = 8 and k = 3looks like this: [0 7 6], [3 6 6], [52 0], [0 4 4], [56 0], [2 56], [6 2 5], [1 7 0], [2 2 7], [6 0 3]. We see the second, fourth, and ninth trials have repeated birthdays; the other seven do not. The probability of no repeated birthday is estimated asp= 7110 = 0.7. The exact probability is p = 1· (718) · (618) = 0.66. The confidence interval is an estimate of the range of values in which we expect to find the correct value. If n is large enough, we expect the correct value to be in the interval $ - 1. 96 I ../iN, p+ 1. 96 I v'2NJ approximately 9 5% of the time (where N is the number of trials). In 10,000 trials with n = 365 and k = 23, we observe 4868 trials with no repeated birthdays, giving an estimate p = 4868110000 = 0.487, which is pretty close to the exact value p = 0.493. The confidence interval is (0.473,0.501), which does indeed contain the correct value. More information on probability estimates and confidence intervals can be found in Sections 5.6, 6.6, 8.9, 9.7.3, 10.1, 10.6, 10.7, and 10.8. What does the birthday problem have to do with secure communications? When security is important, users often attach a message authentication code (MAC) to a message. The MAC is a b-bit signature with the following properties: • The MAC is computed from the message and a password shared by sender and receiver. • Two different messages should with high probability have different MACs.

3.7 Hypergeometric Probabilities and Card Games

61

• The MAC algorithm should be one-way. It should be relatively easy to compute a MAC from a message but difficult to compute a message that has a particular MAC. Various MAC algorithms exist. These algorithms use cryptographic primitives. Typical values of bare 128, 196, and 256. The probability that two different messages have the same MAC is small, approximately 2-b. When Alice wants to send a message (e.g., a legal contract) to Bob, Alice computes a MAC and sends it along with the message. Bob receives both and computes a MAC from the received message. He then compares the computed MAC to the received one. If they are the same, he concludes the received message is the same as the one sent. However, if the MACs differ, he rejects the received message. The sequence of steps looks like this: 1. Alice and Bob share a secret key, K. 2. Alice has a message, M. She computes a MACh= H(M,K), where His a MAC function. 3. Alice transmits M and h to Bob. 4. Bob receives M' and h'. These possibly differ from M and h due to channel noise or an attack by an enemy. 5. Bob computes h" = H(M',K) from the received message and the secret key. 6. If h" = h', Bob assumes M' = M. If not, Bob rejects M'.

If, however, Alice is dishonest, she may try to deceive Bob with a birthday attack. She computes a large number k of messages in trying to find two with the same MAC. Of the two with the same MAC, one contract is favorable to Bob, and one cheats Bob. Alice sends the favorable contract to Bob, and Bob approves it. Sometime later, Alice produces the cheating contract and falsely accuses Bob of reneging. She argues that Bob approved the cheating contract because the MAC matches the one he approved. How many contracts must Alice create before she finds two that match? The approximation (Equation 3.18) answers this question. With n = 2b and k large, the approximation indicates (ignoring multiplicative constants that are close to 1):

The birthday attack is much more efficient than trying to match a specific MAC. For b = 128, Alice would have to create about n/2 = 1.7 x 10 38 contracts to match a specific MAC. However, to find any two that match, she has to create about k"' yn = 1.8 x 10 19 contracts, a factor of 10 19 faster. Making birthday attacks difficult is one important reason why b is usually chosen to be relatively large.

3.7 HYPERGEOMETRIC PROBABILITIES AND CARD GAMES Hypergeometric probabilities use binomial coefficients to calculate probabilities of experiments that involve selecting items from various groups. In this section, we develop the hypergeometric probabilities and show how they are used to analyze various card games.

62

CHAPTER 3 A LllTLE COMBINATORICS

Consider the task of making an unordered selection without replacement of k 1 items from a group of n 1 items, selecting k2 items from another group of n2 items, etc., to km items from nm items. The number of ways this can be done is the product of m binomials: (3.20) Consider an unordered without replacement selection of k items from n. Let the n items be divided into groups with nl> n2, ... ,nm in each. Similarly, divide the selection as kl> k2, ... ,km. If all selections are equally likely, then the probability of this particular selection is a hypergeometric probability:

(3.21)

The probability is the number of ways of making this particular selection divided by the number of ways of making any selection. EXAMPLE3.4

Consider a box with four red marbles and five blue marbles. The box is shaken, and someone reaches in and blindly selects six marbles. The probability of two red marbles and four blue ones is

(~)(!) 6·5 5 Pr 2 red and 4 blue marbles = -9( = ( 9 .8 . 7 ) = ) 14 6 3·2·1

[

l

For the remainder of this section, we illustrate hypergeometric probabilities with some poker hands. In most versions of poker, each player must make his or her best five-card hand from some number of cards dealt to the player and some number of community cards. For example, in Seven-Card Stud, each player is dealt seven cards, and there are no community cards. In Texas Hold 'em, each player is dealt two cards, and there are five community cards. The poker hands are listed here from best to worst (assuming no wild cards are being used):

Straight Flush The five best cards are all of the same suit (a flush) and in rank order (a straight). For instance, 5

l

Pr [ three Qs, two 8s =

(i)(i) 4 .6 -6 es2) = 2598960 = 9.2 X 10

Comment 3.4: Sometimes it is easier to remember this formula if it is written as

The idea is that the 52 cards are divided into three groups: Q's, 8's, and Others. Select three Q's, two 8's, and zero Others. The mnemonic is that 3 + 2 + 0 = 5 and 4+4+44=52.

The number of different full houses is (13h = 156 (13 ranks for the triple and 12 ranks for the pair). Therefore, the probability of getting any full house is Pr [any Full Housel = 156. 9.2

X

10- 6 = 1.4 X 10- 3

"' 1 in 700 hands If the player chooses his or her best five-card hand from seven cards, the probability of a full house is much higher (about 18 times higher). However, the calculation is much more involved. First, list the different ways of getting a full house in seven cards. The most obvious way is to get three cards of one rank, two of another, one of a third rank, and one of a fourth. The second way is get three cards in one rank, two in a second rank, and two in a third rank. (This full house is the triple and the higher-ranked pair.) The third way is to get three cards of one rank, three cards of a second rank, and one card in a third rank. (This full house is the higher-ranked triple and a pair from the lower triple.) We use the shorthand 3211, 322, and 331 to describe these three possibilities.

64

CHAPTER 3 A LllTLE COMBINATORICS

Second, calculate the number ways of getting a full house. Let n(3211) be the number of ways of getting a specific 3211 full house, N(3211) the number of ways of getting any 3211 full house, and similarly for 322 and 331 full houses. Since there are four cards in each rank,

n(3211) =

(!)(~)(;)(;) = 4 ·6 ·4 ·4 = 384

n(322) =

(:)(~)(~) = 4 ·6 ·6 = 144

n(331) =

(!)(!)(;)

= 4 ·4 ·4 = 64

Consider a 3211 full house. There are 13 ways of selecting the rank for the triple, 12 ways of selecting the rank for the pair (since one rank is unavailable), and C21) =55 ways of selecting the two ranks for the single cards. Thus,

N(3211) = 13 ·12 · (

~1 ) · n(3211) = 8580 · 384 = 3,294,720

Comment 3.5: The number of ways of selecting the four ranks is not(~) since the ranks are not equivalent. That is, three Q's and two K's is different from two Q's and three K's. The two extra ranks are equivalent, however. In other words, three Q's, two K's, one A, and oneJ is the same as three Q's, two K's, oneJ, and one A.

For a 322 full house, there are 13 ways of selecting the rank of the triple and ways of selecting the ranks of the pairs:

N(322) = 13 ·

C

2 ) 2

= 66

c;) ·

n(322) = 858 · 144 = 123,552

Finally, for a 331 full house, there are C{) = 78 ways of selecting the ranks of the two triples and 11 ways of selecting the rank of the single card:

N(331)

=en

·11· n(331) = 858 · 64 = 54,912

Now, let N denote the total number of full houses: N = N(3211) + N(322) + N(331) = 3,294, 720 + 123,552 + 54,912 = 3,473,184

3.7 Hypergeometric Probabilities and Card Games 65

Third, the probability of a full house is N divided by the number of ways of selecting seven cards from 52: Pr [fu ll h ouse l =

3,473,183

(

52

)

3,473,183

=

133,784,560

= 0.02596

7 "' 1 in 38.5 hands

So, getting a full house in a seven -card poker game is about 18 times as likely as getting one in a five-card game. Incidentally, the conditional probability of a 3211 full house given one has a full house is

[

l

I

Pr 3211 full house any full house =

3,294,720 3,473,184

= 0.95

About 19 of every 20 full houses are the 3211 variety.

Comment 3.6: We did not have to consider the (admittedly unlikely) hand of three

cards in one rank and four cards in another rank . This hand is not a full house . It is four of a kind, and four of a kind beats a full house . See also Problem 3.30.

In Texas Hold 'em, each player is initially dealt two cards. The best starting hand is two Aces, the probability of which is 8

(~)(~ ) Pr[two Aces]=

(~)

e22) = e22) =

6 1 1326 = 221

The worst starting hand is considered to be the "7 2 off-suit;' or a 7 and a 2 in two different suits. (This hand is unlikely to make a straight or a flush, and any pairs it might make would likely lose to other, higher, pairs.) There are four ways to choose the suit for the 7 and three ways to choose the suit for the 2 (since the suits must differ), giving a probability of

" ff "] 4. 3 2 Pr [ 7 2 o -suit = ( ) =

522

221

In a cruel irony, getting the worst possible starting hand is twice as likely as getting the best starting hand.

Comment 3.7: Probability analysis can answer many poker questions but not all,

including some really important ones. Because of the betting sequence and the fact that a player's cards are hidden from the other players, simple questions like "Do I have a winning hand?" often cannot be answered with simple probabilistic analysis. In most cases, the answer depends on the playing decisions made by the other players at the table.

66

CHAPTER 3 A LllTLE COMBINATORICS

In many experiments, the outcomes are equally likely. The probability of an event then is simply the number of outcomes in the event divided by the number of possible outcomes. Accordingly, we need to know how to count the outcomes. Consider a selection of k items from n items. The selection is ordered if the order of the selection matters (i.e., if ab differs from ba). Otherwise, it is unordered. The selection is with replacement if each item can be selected multiple times (i.e., if aa is possible). Otherwise, the selection is without replacement. The four possibilities for a selection of k items from n are considered below: • Ordered With Replacement: The number of selections is nk. • Ordered Without Replacement: These are called permutations. The number of permutations is (nlk = n!l (n- k)!, where n! = n(n -1)(n- 2) · · · 2 ·1, 0! = 1, and n! = 0 for n < 0. n!

is pronounced "n factorial:' • Unordered Without Replacement: These are combinations. The number of combinations of k items selected from n is G). This number is also known as a binomial coefficient:

(n)k -

n!

(

n)

(n-k)!k!- n-k

• Unordered With Replacement: The number of selections is

(n+z- 1).

The binomial coefficients are central to the binomial theorem:

The binomial coefficient can be generalized to the multinomial coefficient:

The binomial theorem is extended to the multinomial theorem:

where the summation is taken over all values of k; ~ 0 such that k 1 + k 2 + · · · + km = n. Consider an unordered selection without replacement of k 1 items from n 1 , k 2 from n 2 , through km from nm. The number of selections is

Problems 67 The hypergeometric probabilities measure the likelihood of making this selection (assuming all selections are equally likely):

where n = n1 + n2 + · · · + nm and k = k1 + k2 + · · · + km. The birthday problem is a classic probability paradox: How large must k be for a group of k people to have two with the same birthday? The surprising answer is a group of 23 is more likely than not to have at least two people with the same birthday. The general answer for k people in a year of n days is

(nh ( k(k-1)) Pr[no pair]= k "'exp - - - n

2n

Setting this probability equal to 0.5 and solving result in

k(k-1)

- -2

"'0.693n

If n is large, k "' yn. Combinatorics can get tricky. Perhaps the best advice is to check any formulas using small problems for which the answer is known. Simplify the problem, count the outcomes, and make sure the count agrees with the formulas.

3.1

List the sequences of three 1'sand two O's.

3.2 Show (3.22) This formula comes in handy later when discussing binomial probabilities. 3.3

Prove the computational sequence in Figure 3.5 can be done using only integers.

3.4

Prove Equation (3.6) algebraically using Equation (3.4).

3.5

Write a computer function to calculate the nth row of Pascal's triangle iteratively from the (n -1)-st row. (The computation step, using vector operations, can be done in one line.)

3.6

The sum of the integers 1 + 2 + · · · + n = n(n + 1) /2 is a binomial coefficient. a. Which binomial coefficient is the sum? b. Using Pascal's triangle, can you think of an argument why this is so?

3.7

Complete the missing steps in the proof of the binomial theorem, starting with Equation (3.9).

68

CHAPTER 3 A LllTLE COMBINATORICS

3.8

Using the multinomial theorem as in Equation (3.15): a. Expand 42 = (1 + 1 + 1 + 1) 2 . b. Expand 42 = (2 + 1 + 1) 2 .

3.9

Write a function to compute the multinomial coefficient for an arbitrary number of piles with k 1 in the first, k 2 in the second, etc. Demonstrate your program works by showing it gets the correct answer on several interesting examples.

3.10

If Alice can generate a billion contracts per second, how long will it take her to mount a birthday attack for b = 32, 64, 128, and 256? Express your answer in human units (e.g., years, days, hours, or seconds) as appropriate.

3.11

Assume Alice works for a well-funded organization that can purchase a million computers, each capable of generating a billion contracts per second. How long will a birthday attack take forb= 32, 64, 128, and 256?

3.12

For a year of 366 days, how many people are required to make it likely that a pair of birthdays exist?

3.13

A year on Mars is 687 (Earth) days. If we lived on Mars, how many people would need to be in the room for it to be likely that at least two of them have the same birthday?

3.14

A year on Mars is 669 (Mars) days. If we were born on and lived on Mars, how many people would need to be in a room for it to be likely at least two of them have the same birthday?

3.15

Factorials can get big, so big that computing them can be problematic. In your answers below, clearly specify what computing platform you are using and how you are doing your calculation. a. What is the largest value of n for which you can compute n! using integers? b. What is the largest value of n for which you can compute n! using floating point numbers? c. How should you compute log(n!) for large n? Give a short program that computes log(n!). What is log(n!) for n =50, 100,200?

3.16

Stirling's formula gives an approximation for n!: (3.23) The approximation in Equation (3.23) is asymptotic. The ratio between the two terms goes to 1 as n- oo:

Plot n! and Stirling's formula on a log-log plot for n = 1,3, 10,30, 100,300,1000. Note that n! can get huge. You probably cannot compute n! and then log(n!). It is better to compute log(n!) directly (see Problem 3.15). 3.17

Show the following:

Problems 69 3.18

The approximation log(l +x) "'x (Equation 3.17) is actually an inequality: log(l +x)

~

x

for all x > -1

(3.24)

as careful examination of Figure 3.8 shows. Repeat the derivation of the approximation to q(k), but this time, develop an inequality. a. What is that inequality? b. Evaluate the exact probability and the inequality for several values of k and n to demonstrate the inequality. 3.19

The log inequality above (Equation 3.24) is sometimes written differently. a. Show the log inequality (Equation 3.24) can be written as y-1

~

log(y)

(3.25)

for y>O

b. Recreate Figure 3.8 using Equation (3.25).

= x -log(l +x) versus x.

3.20

Plotj(x)

3.21

One way to prove the log inequality (Equation 3.24) is to minimize j(x)

=x

- log

(1 +x).

a. Use calculus to show that x

= 0 is a possible minimum ofj(x).

b. Show that x = 0 is a minimum (not a maximum) by evaluatingj(O) and any other value, such as j(l). 3.22

Solve a different birthday problem: a. How many other people are required for it to be likely that someone has the same birthday as you do? b. Why is this number so much larger than 23? c. Why is this number larger than 365/2

= 182?

3.23

In most developed countries, more children are born on Tuesdays than any other day. (Wednesdays, Thursdays, and to a lesser extent, Fridays are close to Tuesdays, while Saturdays and Sundays are much less so.) Why?

3.24

Using the code in Example 4. 7, or a simple rewrite in Matlab orR, estimate the probabilities of no common birthday for the following: a. n = 365 and k = 1,2, ... ,30. Plot your answers and the calculated probabilities on the same plot (see Figure 3.7). b. n = 687 (Mars) and k = 5,10,15,20,25,30,35,40,45,50. Plot your answers and the calculated probabilities on the same plot.

3.25

Sometimes "unique" identifiers are determined by keeping the least significant digits of a large number. For example, patents in the United States are numbered with seven-digit numbers (soon they will need eight digits). Lawyers refer to patents by their last three digits; for example, Edison's light bulb patent is No. 223,898. Lawyers might refer to it as the "898" patent. A patent infringement case might involve k patents. Use the approximation in Equation (3.18) to calculate the probability of a name collision fork= 2,3, ... ,10 patents.

70

CHAPTER 3 A LllTLE COMBINATORICS

3.26

Continuing Problem 3.25, organizations once commonly used the last four digits of a person's Social Security number as a unique identifier. (This is not done so much today for fear of identity theft.) a. Use the approximation in Equation (3.18) to calculate the probability of an identifier collision fork= 10, 30,100 people. b. What value of k people gives a probability ofO.S of an identifier collision?

3.27

The binomial coefficient can be bounded as follows (fork> 0):

a. Prove the left-hand inequality. (The proof is straightforward.) b. The proof of the right -hand inequality is tricky. It begins with the following:

Justify the inequality and the equality. c. The next step in the proof is to show (1 + x)" show this.

~ exn.

Use the inequality log(l + x)

~

x to

d. Finally, let x =kin. Complete the proof. e. Evaluate all three terms for n = 10 and k = 0, 1, ... , 10. The right-hand inequality is best when k is small. When k is large, use the simple inequality ~ 2" from Equation (3.12).

m

3.28

In the game Yahtzee sold by Hasbro, a player throws five ordinary six-sided dice and tries to obtain various combinations of poker-like hands. In a throw of five dice, compute the following: a. All five dice showing the same number. b. Four dice showing one number and the other die showing a different number. c. Three dice showing one number and the other two dice showing the same number (in poker parlance, a full house). d. Three dice showing one number and the other two dice showing different numbers (three-of-a-kind). e. Two dice showing one number, two other dice showing a different number, and the fifth die showing a third number (two pair). f. Five different numbers on the five dice. g. Show the probabilities above sum to 1.

3.29

Write a short program to simulate the throw of five dice. Use the program to simulate a million throws of five dice and estimate the probabilities calculated in Problem 3.28.

3.30

What is the probability of getting three cards in one rank and four cards in another rank in a selection of seven cards from a standard deck of 52 cards if all combinations of? cards from 52 are equally likely?

Problems 71 3.31

In Texas Hold 'em, each player initially gets two cards, and then five community cards are dealt in the center of the table. Each player makes his or her best five-card hand from the seven cards. a. How many initial two-card starting hands are there? b. Many starting hands play the same way. For example, suits do not matter for a pair of Aces. Each pair of Aces plays the same (on average) as any other pair of Aces. The starting hands can be divided into three playing groups: both cards have the same rank (a "pair"), both cards are in the same suit ("suited"), or neither ("off-suit"). How many differently playing starting hands are in each group? How many differently playing starting hands are there in total? (Note that this answer is only about l/8th the answer to the question above.) c. How many initial two-card hands correspond to each differently playing starting hand in the question above? Add them up, and show they total the number in the first question.

3.32

You are playing a Texas Hold 'em game against one other player. Your opponent has a pair of 9's (somehow you know this). The five community cards have not yet been dealt. a. Which two-card hand gives you the best chance of winning? (Hint: the answer is not a pair of Aces.) b. If you do not have a pair, which (non pair) hand gives you the best chance of winning?

3.33

In Texas Hold 'em, determine: a. The probability of getting a flush given your first two cards are the same suit. b. The probability of getting a flush given your first two cards are in different suits. c. Which of the two hands above is more likely to result in a flush?

3.34

In the game of Blackjack (also known as "21 "),both the player and dealer are initially dealt two cards. Tens and face cards (e.g., 10's, Jacks, Queens, and Kings) count as 10 points, and Aces count as either 1 or 11 points. A "blackjack" occurs when the first two cards sum to 21 (counting the Ace as 11). What is the probability of a blackjack?

3.35

Three cards are dealt without replacement from a well-shuffled, standard 52-card deck. a. Directly calculate Pr [three of a kind]. b. Directly calculate Pr [a pair and one other card]. c. Directly calculate Pr [three cards of different ranks]. d. Show the three probabilities above sum to 1.

3.36

Three cards are dealt without replacement from a well-shuffled, standard 52-card deck. a. Directly calculate Pr [three of the same suit]. b. Directly calculate Pr [two of one suit and one of a different suit]. c. Directly calculate Pr [three cards in three different suits]. d. Show the three probabilities above sum to 1.

72

CHAPTER 3 A LllTLE COMBINATORICS

3.37

A typical "Pick Six" lottery run by some states and countries works as follows: Six numbered balls are selected (unordered, without replacement) from 49. A player selects six balls. Calculate the following probabilities: a. The player matches exactly three of the six selected balls. b. The player matches exactly four of the six selected balls. c. The player matches exactly five of the six selected balls. d. The player matches all six of the six selected balls.

3.38

"Powerball" is a popular two-part lottery. The rules when this book was written are as follows: In the first part, five numbered white balls are selected (unordered, without replacement) from 59; in the second part, one numbered red ball is selected from 35 red balls (the selected red ball is the "Powerball"). Similarly, the player has a ticket with five white numbers selected from 1 to 59 and one red number from 1 to 35. Calculate the following probabilities: a. The player matches the red Powerball and none of the white balls. b. The player matches the red Powerball and exactly one of the white balls. c. The player matches the red Powerball and exactly two of the white balls. d. The player matches the red Powerball and all five of the white balls.

3.39

One "Instant" lottery works as follows: The player buys a ticket with five winning numbers selected from 1 to 50. For example, the winning numbers might be 4, 12, 13, 26, and 43. The ticket has 25 "trial" numbers, also selected from 1 to 50. If any of the trial numbers matches any of the winning numbers, the player wins a prize. The ticket, perhaps surprisingly, does not indicate how the trial numbers are chosen. We do know two things (about the ticket we purchased): first, the 25 trial numbers contain no duplicates, and second, none of the trial numbers match any of the winning numbers. a. Calculate the probability there are no duplicates for an unordered selection with replacement of k = 25 numbers chosen from n = 50. b. In your best estimate, was the selection of possibly winning numbers likely done with or without replacement? c. Calculate the probability a randomly chosen ticket has no winners if all selections of k = 25 numbers from n =50 (unordered without replacement) are equally likely. d. Since the actual ticket had no winners, would you conclude all selections of25 winning numbers were equally likely?

3.40

Consider selecting two cards from a well-shuffled deck (unordered and without replacement). Let K1 denote the event the first card is a King and K 2 the event the second card is a King. a. Calculate Pr[K1 nK2] = Pr[K!]Pr[K2I K!]. b. Compare to the formula of Equation (3.21) for calculating the same probability.

3.41

Continuing the Problem 3.40, let X denote any card other than a King. a. Use the LTP to calculate the probability of getting a King and any other card (i.e., exactly one King) in an unordered and without replacement selection of two cards from a well-shuffled deck. b. Compare your answer in part a to Equation (3.21 ).

Problems 3.42

73

Show Equation (3.21) can be written as

(3.26)

= 2, derive Equation (3.26) directly from first principles.

3.43

For m

3.44

In World War II, Germany used an electromechanical encryption machine called Enigma. Enigma was an excellent machine for the time, and breaking its encryption was an important challenge for the Allied countries. The Enigma machine consisted of a plugboard, three (or, near the end of the war, four) rotors, and a reflector (and a keyboard and lights, but these do not affect the security of the system). a. The plugboard consisted of 26 holes (labeled A to Z). Part of each day's key was a specification of k wires that connected one hole to another. For example, one wire might connect B to R, another might connect J to K, and a third might connect A to W How many possible connections can be made with k wires, where k = 0, 1, ... , 13? Evaluate the number fork= 10 (the most common value used by the Germans). Note that the wires were interchangeable; that is, a wire from A to B and one from C to D is the same as a wire from C to D and another from A to B. (Hint: for k = 2 and k = 3, there are 44,850 and 3,453,450 configurations, respectively.) b. Each rotor consisted of two parts: a wiring matrix from 26 inputs to 26 outputs and a movable ring. The wiring consisted of 26 wires with each wire connecting one input to one output. That is, each input was wired to one and only one output. The wiring of a rotor was fixed at the time of manufacture and not changed afterward. How many possible rotors were there? (The Germans obviously did not manufacture this many different rotors. They only manufactured a few different rotors, but the Allies did not know which rotors were actually in use.) c. For most of the war, Enigma used three different rotors placed left to right in order. One rotor was chosen for the first position, another rotor different from the first was chosen for the second position, and a third rotor different from the first two was chosen for the third position. How many possible selections of rotors were there? (Hint: this is a very large number.) d. In operation, the three rotors were rotated to a daily starting configuration. Each rotor could be started in any of 26 positions. How many starting positions were there for the three rotors? e. The two leftmost rotors had a moveable ring that could be placed in any of 26 positions. (The rightmost rotor rotated one position on each key press. The moveable rings determined when the middle and left rotors turned. Think of the dials in a mechanical car odometer or water meter.) How many different configurations of the two rings were there?

74

CHAPTER 3 A LllTLE COMBINATORICS

f. The reflector was a fixed wiring of 13 wires, with each wire connecting a letter to another letter. For example, a wire might connect C to G. How many ways could the reflector be wired? (Hint: this is the same as the number of plugboard connections with 13 wires.) g. Multiply the following numbers together to get the overall complexity of the Enigma machine: (i) the number of possible selections of three rotors, (ii) the number of daily starting positions of the three rotors, (iii) the number of daily positions of the two rings, (iv) the number of plugboard configurations (assume k = 10 wires were used), and (v) the number of reflector configurations. What is the number? (Hint: this is a really, really large number.) h. During the course of the war, the Allies captured several Enigma machines and learned several important things: The Germans used five different rotors (later, eight different rotors). Each day, three of the five rotors were placed left to right in the machine (this was part of the daily key). The Allies also learned the wiring of each rotor and were able to copy the rotors. They learned the wiring of the reflector. How many ways could three rotors be selected from five and placed into the machine? (Order mattered; a rotor configuration of 123 operated differently from 132, etc.) i. After learning the wiring of the rotors and the wiring of the reflector, the remaining

configuration variables (parts of the daily secret key) were (i) the placement of three rotors from five into the machine, (ii) the number of starting positions for three rotors, (iii) the position of the two rings (on the leftmost and middle rotors), and (iv) the plugboard configuration. Assuming k = 10 wires were used in the plugboard, how many possible Enigma configurations remained?

j. How important was capturing the several Enigma machines to breaking the encryption? For more information, see A. Ray Miller, The Cryptographic Mathematics of Enigma (Fort Meade, MD: Center for Cryptologic History, 2012).

CHAPTER

DISCRETE PROBABILITIES AND RANDOM VARIABLES

Flip a coin until a head appears. How many flips are required? What is the average number of flips required? What is the probability that more than 10 flips are required? That the number of flips is even? Many questions in probability concern discrete experiments whose outcomes are most conveniently described by one or more numbers. This chapter introduces discrete probabilities and random variables.

4.1 PROBABILITY MASS FUNCTIONS Consider an experiment that produces a discrete set of outcomes. Denote those outcomes as x 0 , x 1 , x 2 , ••• • For example, a binary bit is a 0 or 1. An ordinary die is 1, 2, 3, 4, 5, or 6. The number of voters in an election is an integer, as are the number of electrons crossing a PN junction in a unit of time or the number of telephone calls being made at a given instant of time. In many situations, the discrete values are themselves integers or can be easily mapped to integers; for example, Xk = k or maybe Xk = k!'... In these cases, it is convenient to refer to the integers as the outcomes, or 0, 1,2, .... Let X be a discrete random variable denoting the (random) result of the experiment. It is discrete if the experiment results in discrete outcomes. It is a random variable if its value is the result of a random experiment; that is, it is not known until the experiment is done.

Comment 4.1: Advanced texts distinguish between countable and uncountable sets. A set is countable if an integer can be assigned to each element (i .e., if one can count the elements). All the discrete random variables considered in this text are countable. The most important example of an uncountable set is an interval (e.g., the set ofvaluesx

75

76

CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES

such that 0 ~x ~ 1 ). For our purposes, sets are either discrete or continuous (or an obvious combination of the two).

Comment 4.2: Random variables are denoted with uppercase bold letters, sometimes with subscripts, such as X, Y, N, X,, and X2 . The values random variables take on are denoted with lowercase, italic letters, sometimes with subscripts, such as x,y, n, x,, and x2 .

A probability mass function (PMF) is a mapping of probabilities to outcomes:

p(k) = Pr[X = Xk] for all values of k. PMFs possess two properties: 1. Each value is nonnegative:

p(k)

~0

(4.1)

2. The sum of the p(k) values is 1: (4.2)

LP(k) = 1 k

Some comments about PMFs are in order: • A PMF is a function of a discrete argument, k. For each value of k, p(k) is a number between 0 and 1. That is, p(l) is a number, p(2) is a possibly different number, p(3) is yet another number, etc. • The simplest way of describing a PMF is to simply list the values: p(O), p(l), p(3), etc. Another way is to use a table:

k p(k) I

0~4

1

0.3

2 0.2

3 0.1

Still a third way is to use a case statement:

p(k)=

r 0.3 0.2 0.1

k=O k=1 k=2 k=3

• Any collection of numbers satisfying Equations (4.1) and (4.2) is a PMF. • In other words, there are an infinity of PMFs. Of this infinity, throughout the book we focus on a few PMFs that frequently occur in applications.

4.2 Cumulative Distribution Functions

77

4.2 CUMULATIVE DISTRIBUTION FUNCTIONS A cumulative distribution function (CDF) is another mapping of probabilities to outcomes, but unlike a PMF, a CDF measures the probabilities of {X :5 u) for all values of-=< u < =: Fx(u) =

Pr[X :5 u] = .[ p(k) k:xk'5: u

where the sum is over all k such that Xk :5 u. Note that even though the outcomes are discrete, u is a continuous parameter. All values from -= to= are allowed. CDFs possess several important properties: 1. The distribution function starts at 0:

FxC-=l = 0 2. The distribution function is nondecreasing. For u 1 < u2 ,

3. The distribution function ends at 1:

FxC=l = 1 A CDF is useful for calculating probabilities. The event {X :5 ud can be written as {X :5 uo) u {uo "kp(k)

(4.17)

k

When it is clear which random variable is being discussed, we will drop the subscript on Jtx (u) and write simply Jt (u).

84

CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES

The MGF is the Laplace transform of the PMF except that s is replaced by -u. Apart from this (mostly irrelevant) difference, the MGF has all the properties of the Laplace transform. Later on, we will use the fact that the MGF uniquely determines the PMF, just as the Laplace transform uniquely determines the signal. In signal processing, we rarely compute moments of signals, but in probability, we often compute moments of PMFs. We show how the MGF helps compute moments by two different arguments. The first argument begins by expanding the exponential in a Maclaurin series* and taking expected values, term by term:'

Now, take a derivative with respect to u: d ~ -d Jt(u) =O+E[X] +uE[X 2 ] + -E[X 3 ] +··· u 2!

Finally, set u = 0:

d Jt(u)l =O+E[X]+O+O+···=E[X] u=O dU Notice how the only term that "survives" both steps (taking the derivative and setting u = 0) isE[X]. The second moment is found by taking two derivatives, then setting u = 0:

In general, the kth moment can be found as (4.18)

The second argument showing why the MGF is useful in computing moments is to use properties of expected values directly:

!!_Jt(u) = !!_E[euX] =E[!!_euX] =E[XeuX] du du du

:u Jt(u) lu=O= E[Xe xj = E[Xj 0

"The Maclaurin series of a function isj(x) = j(O) + j'(O)x+ j"(O)x2 /2! +···.See Wolfram Mathworld, http://mathworld.wolfram.com/TaylorSeries.html.

4.5 Several Important Discrete PMFs 85

Similarly, for the kth moment,

The above argument also shows a useful fact for checking your calculations: J£(0) =E[e0Xj =E[lj = 1

Whenever you compute an MGF, take a moment and verify J£(0) = 1. The MGF is helpful in computing moments for many, but not all, PMFs (see Example 4.5). For some distributions, the MGF either cannot be computed in closed form or evaluating the derivative is tricky (requiring I:Hospital's rule, as discussed in Example 4.5). A general rule is to try to compute moments by Equation (4.3) first, then try the MGF if that fails.

Comment 4.8: Three transforms are widely used in signal processing: the Laplace

transform, the Fourier transform, and the Z transform . In probability, the analogs are MGF, the characteristic function, and the generating function . The MGF is the Laplace transform with s replaced by -u:

Jt(u) = f[e"X] = L,e"kp(k) = 5l'(-u) k

The characteristic {Unction is the Fourier transform with the sign of the exponent reversed: 'i&'(w) = E[eiwX] = L,eiwkp(k) =

$(- w)

k

The generating {Unction (or the probability generating {Unction) is the Z transform with z- 1 replaced by s: W(s) =£[?] = L,lp(k) =.I(s- 1 ) k

Mathematicians use the generating function extensively in analyzing combinatorics and discrete probabilities. In this text, we generally use the MGF in analyzing both discrete and continuous probabilities and the Laplace, Fourier, or Z transform, as appropriate, for analyzing signals.

4.5 SEVERAL IMPORTANT DISCRETE PMFs There are many discrete PMFs that appear in various applications. Four of the most important are the uniform, geometric, Poisson, and binomial PMFs. The uniform, geometric, and Poisson are discussed below. The binomial is presented in some detail in Chapter 6.

86

CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES

4.5. 1 Uniform PMF For a uniform PMF, all nonzero probabilities are the same. Typically, X takes on integral values, with k = 1,2, ... ,m: Pr[X=k] = {;

fork=1,2, ... ,m for all other k

The uniform PMF with m = 6 (e.g., for a fair die) is illustrated in the stem plot below:

3

6

k

Computing the mean and variance is straightforward (using Equations 4.6 and 4.7): (4.19) The mean is the weighted average of the PMF values (note that if m is even, the mean is not one of the samples):

Now, compute the variance: 2

a2=E[X2]-E[X]2= (2m+1)(m+l) _ (m+l) = (m+1)(m-l) X 6 4 12

EXAMPLE4.5

(4.20)

The uniform distribution is a bad example for the utility of computing moments with the MGF, requiring a level of calculus well beyond that required elsewhere in the text. With that said, here goes: 1 m Jt(u) = euk mk=i

L

Ignore the 11m for now, and focus on the sum. Multiplying by (eu -1) allows us to sum the telescoping series. Then, set u = 0 to check our calculation: (eu -1)

m

m+ l

m

k=i

k=2

k=i

I: euk = I: euk- I: euk = eu(m+l)- eu m eu(m+l)- eu I:euk=---k=i eu -1

4.5 Several Important Discrete PMFs

eu(m+l ) - e" e" -1

I

0 0

1-1

u=O

87

1-1

Warning bells should be clanging: danger ahead, proceed with caution! In situations like this, we can use LHospital's rule to evaluate the limit as u ~ 0. LHospital's rule says differentiate the numerator and denominator separately, set u = 0 in each, and take the ratio:

..4.. (eu (m+ I) - e") I

u=O

du

-Ju (e" -1)lu=O

(em+ l)eu(m+ l)- e"l)lu=O m+ 1-1

---=m

e"lu=O

After dividing by m, the check works: Jt (0) = 1. Back to E[X]: take a derivative of Jt'(u) (using the quotient rule), then set u = 0: meu(m+2) - (m + l)eu (m+l ) + e" m(e" -1)2

I u=O

Back to LHospital's rule, but this time we need to take two derivatives of the current numerator and denominator separately and set u = 0 (two derivatives are necessary because one still results in 0/0): 2 3 E[X] = m(m+2) -(m+l) +1 = m(m+l) = m+1 2m 2m 2

Fortunately, the moments of the uniform distribution can be calculated easily by the direct formula as the MG F is surprisingly complicated. Nevertheless, the MGF is often easier to use than the direct formula.

4.5.2 Geometric PMF The geometric PMF captures the notion of flipping a coin until a head appears. Let p be the probability of a head on an individual flip, and assume the flips are independent (one flip does not affect any other flips). A sequence of flips might look like the following: 00001. This sequence has four zeros followed by a 1. In general, a sequence oflength k will have k - 1 zeros followed by a 1. Letting X denote the number of flips required, the PMF of X is fork= 1,2, ... k~O

The PMF values are clearly nonnegative. They sum to 1 as follows:

88

CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES

The first 12 values of a geometric PMF with p = 0.3 are shown below:

0.3 0.3

X

X

03 " -~ 0.7 = 0.21 --~ 2

0.7 = 0.147

---r r

Pr[X = k] versus k with p = 0.3

r

T T , • • • •

5

k

10

The mean is difficult to calculate directly: E[X] =

f:

kp(l- p)k- 1 k=1

(4.21)

It is not obvious how to compute the sum. However, it can be calculated using the MGF:

...t{(u) =E[euX] =

f:

eukp(l- p)k-1 k=1

= peu

f:

eu(k-1)(1- p)k-1 k=1 oo

=peui:(O-p)eu)

I

(changing variable l = k - 1)

(4.22)

1=0

Let r = (1- p)eu, and using Equation (4.8), oo

L (0-p)eu) 1=0

I

oo

(substituting r = (1- p)eu)

= I:rl 1=0

1

for lrl < 1

1-r 1

1- (1- p)eu

= 1(1 -p)e"l < 1 in a neighborhood around u = 0 (to show the series converges and that we can take a derivative of the MGF) . Solving for u results in u < log(1i(1 -p)) . lfp > 0, log(1i(1 -p)) > 0. Thus, the series converges in a neighborhood of u = 0 . We can therefore take the derivative of the MGF at u = 0.

Comment 4.9: We should verity that lrl

Substituting this result into Equation (4.22) allows us to finish calculating ...t{(u):

...t{(u)=

peu 1 - (1 - p)eu

e-u -1 ·-=p(e-u-1+p) e-u

Check the calculation: ...t{(O) =pi (e-o -1 + p) = pi(l-1 + p) =pip= 1.

(4.23)

4.5 Several Important Discrete PMFs 89 Computing the derivative and setting u = 0,

!

.it(u) = p( -1) (e-u- 1 + pr2

d I E[X] = -.it(u) du

(-l)e-u

p = -1 = 2"

u=O

P

P

On average, it takes 1I p flips to get the first head. If p = 0.5, then it takes an average of2 flips; if p = 0.1, it takes an average of 10 flips. Taking a second derivation of .it(u) and setting u = 0 yield E[X 2 ] = (2- p)lp. The variance can now be computed easily: a2 x

2

2-p

1

1-p

=E[X2 ]-E[X] =p2- -p2 - =p2-

The standard deviation is the square root of the variance, ax= y'O- p)jp. A sequence of 50 random bits with p = 0.3 is shown below: 10100010100011110001000001111101000000010010110001 The runs end with the l's. This particular sequence has 20 runs: 1·01·0001·01·0001·1·1·1·0001·000001·1·1·1·1·01· 00000001·001·01·1·0001 The run lengths are the number of digits in each run: 12424111461111283214 The average run length of this particular sequence is 1+2+4+2+4+1+1+1+4+6+1+1+1+1+2+8+3+2+1+4 20

~

- - - - - - - - - - - - - - - - - - - - - - - = - = 2.5

This average is close to the expected value, E[X] EXAMPLE4.6

20

= 1/p = 1/0.3 = 3.33.

When a cell phone wants to connect to the network (e.g., make a phone call or access the Internet), the phone performs a "random access" procedure. The phone transmits a message. If the message is received correctly by the cell tower, the tower responds with an acknowledgment. If the message is not received correctly (e.g., because another cell phone is transmitting at the same time), the phone waits a random period of time and tries again. Under reasonable conditions, the success of each try can be modeled as independent Bernoulli random variables with probability p. Then, the expected number of tries needed is 1/p. In actual practice, when a random access message fails, the next message usually is sent at a higher power (the original message might not have been "loud" enough to be heard). In this case, the tries are not independent with a common probability of success, and the analysis of the expected number of messages needed is more complicated.

90

CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES

Comment 4.10: People are notoriously bad at generating random sequences by

themselves. Let us demonstrate this. Write down a "random" sequence of 1 'sand O's with a length of at least SO (more is better) and ending with a 1 (probability of a 1 is O.S). Do not use coins or a computer; just let the sequence flow from your head. Compute the run lengths, and compare the lengths of your runs to the expected lengths (half the runs should be length 1, one quarter of length 2, etc.). If you are like most people, you will have too many short runs and too few long runs.

4.5.3 The Poisson Distribution The Poisson distribution is widely used to describe counting experiments, such as the number of cancers in a certain area or the number of phone calls being made at any given time. The PMF of a Poisson random variable with parameter 1\, is ;~,k

Pr[X = k] = p(k) = k!e--' fork= 0, 1,2, ...

(4.24)

The first 14 points of a Poisson PMF with 1\, = 5 are shown below: 0.175

/1,=5

Pr[X=k]

• T 4

0

5

k

r T• • 9

• •

l3

To show the Poisson probabilities sum to 1, we start with the Taylor series fore-': 1\,2 /1,3 e-'=1+1\,+-+-+··· 2! 3!

Now, multiply both sides by e--':

The terms on the right are the Poisson probabilities. The moments can be computed with the MGF: Jt(u) =E[e"x]

=

(definition ofMGF)

f: e"kp(k)

(definition of expected value)

k=O 00

1\, k

= I:e"ke--'k=O

k!

-. 5. The ratio can be rearranged to yield a convenient method for computing the p(k) sequence: p(O) = e-,t;

fork= 1,2, ... do 1

p(k) =

(~) ·p(k-1);

end A sequence of 30 Poisson random variables with 1\, = 5 is 5665857344578656785822121743448 There are one 1, two 2's, two 3's, five 4's, six 5's, four 6's, four 7's, five 8's, and one 12. For example, the probability of a 5 is 0.175. In a sequence of 30 random variables, we would expect to get about 30 x 0.175 = 5.25 fives. This sequence has six, which is close to

92

CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES

the expected number. As another example, the expected number of 2's is

52

30·Pr[X=2] =30·

21 e-

5

=2.53

The actual number is 2, again close to the expected number. It is not always true, however, that the agreement between expected and actual is so good. For instance, this sequence has five 8's, while the expected number of 8's is 30·Pr[X=8] =30·

ss

B!e- 5 = 1.95

The sequence is random after all. While we expect the count of each number to be approximately equal to the expected number, we should not expect to observe exactly the "right" number. This example looked ahead a bit to multiple random variables. That is the topic of the next chapter. EXAMPLE4.7

The image sensor in a digital camera is made up of pixels. Each pixel in the sensor counts the photons that hit it. If the shutter is held open for t seconds, the number of photons counted is Poisson with mean .Itt. Let X denote the number of photons counted. Then, E [X] =.Itt Var[X] =.Itt ThecameracomputesanaverageZ =X ft. Therefore,E [Z ] =It and Var[Z] = (l/t 2 )Var [X] = ltjt. The performance of systems like this is typically measured as the signal-to-noise ratio (SNR), defined as the average power of the signal divided by the average of the noise power. We can write Z in signal plus noise form as Z = It + (Z - It) = It + N = signal + noise

where It is the signal and N = Z - It is the noise. Put another way, SNR=

signal power It 2 It 2 =--=-=.Itt noise power Var[N] lt/t

We see the SNR improves as It increases (i.e., as there is more light) and as t increases (i.e., longer shutter times). Of course, long shutter times only work if the subject and the camera are reasonably stationary.

4.6 GAMBLING AND FINANCIAL DECISION MAKING Decision making under uncertainty is sometimes called gambling, or sometimes investing, but is often necessary. Sometimes we must make decisions about future events that we can only partially predict. In this section, we consider several examples.

4.6 Gambling and Financial Decision Making 93

Imagine you are presented with the following choice: spend 1 dollar to purchase a ticket, or not. With probability p, the ticket wins and returns w + 1 dollars (w dollars represent your winnings, and the extra 1 dollar is the return of your original ticket price). If the ticket loses, you lose your original dollar. Should you buy the ticket? This example represents many gambling situations. Bettors at horse racing buy tickets on one or more horses. Gamblers in casinos can wager on various card games (e.g., blackjack or baccarat), dice games (e.g., craps), roulette, and others. People can buy lottery tickets or place bets on sporting events. Let X represent the gambler's gain (or loss). With probability p, you win w dollars, and with probability, 1 - p you lose one dollar. Thus, -1

X=

{w

with probability 1 - p with probability p

E[X] =pw+(l-p)(-1) =pw-(1-p) E[X2j = pw2 + (1- p)( -1)2 2 Var[X] = E[X 2 j-E[Xj = p(l- p)(w+ 1) 2 The wager is fair if E [X] = 0. If E [X] > 0, you should buy the ticket. On average, your win exceeds the cost. Conversely, if E[X] < 0, do not buy the ticket. Let Pe represent the break-even probability for a given wand We the break-even win for a given p. Then, at the break-even point, E[X] =O=pw-(1-p) Rearranging yields We = (1 - p) I p or Pe = 1I (w + 1). The ratio (1 - p) I p is known as the odds

ratio. For example, if w = 3, the break-even probability is Pe = 1 I (3 + 1) = 0.25. If the actual probability is greater than this, make the bet; if not, do not. EXAMPLE4.8

Imagine you are playing Texas Hold 'em poker and know your two cards and the first four common cards. There is one more common card to come. You currently have a losing hand, but you have four cards to a flush. If the last card makes your flush, you judge you will win the pot. There are w dollars in the pot, and you have to bet another dollar to continue. Should you? Since there are 13-4 = 9 cards remaining in the flush suit (these are "outs" in poker parlance) and 52- 6 = 46 cards overall, the probability of making the flush is p = 9146 = 0.195. If the pot holds more than We= (1- p)lp = 3719 = 4.1 dollars, make the bet.

Comment 4.11: In gambling parlance, a wager that returns a win ofw dollars for each

dollar wagered offers w-to-1 odds. For example, "4-to-1" odds means the break-even probability is Pe = 1 1(4 + 1) = 115. Some sports betting, such as baseball betting, uses money lines. A money line is a positive or negative number. If positive, say, 140, the money line represents the

94

CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES

gambler's potential win on a $1 00 bet; that is, the gambler bets $100 to win $140. If negative, say, -120, it represents the amount of money a gambler needs to wager to potentially win $1 00; that is, a $120 bet on the favorite would win $100 if successful. A baseball game might be listed as -140,+ 120, meaning a bet on the favorite requires $140 to win $100 and a bet of$100 on the underdog might win $120 . (That the numbers are not the same represents a profit potential to the bookmaker.)

When placing a bet, the gambler risks a sure thing (the money in his or her pocket) to potentially win w dollars. The bookmaker will adjust the payout w so that, on average, the expected return to the gambler is negative. In effect, the gambler trades a lower expected value for an increased variance. For a typical wager, E[X] = wp-(1-p) 0 Buying insurance is the opposite bet, trading a lower expected value for a lower variance. Let c be the cost of the policy, p the probability of something bad happening (e.g., a tree falling on your house), and v the cost when the bad thing happens. Typically, p is small, and v relatively large. Before buying insurance, you face an expected value of -vp and a variance of v 2p(l- p). After buying insurance, your expected value is -c whether or not the bad thing happens, and your variance is reduced to 0. Buying insurance reduces your expected value by c - vp > 0 but reduces your variance from v 2p(l- p) to 0. In summary, buying insurance is a form of gambling, but the trade-off is different. The insurance buyer replaces a possible large loss (of size v) with a guaranteed small loss (of size c). The insurance broker profits, on average, by c- vp > 0.

Comment 4.12: Many financial planners believe that reducing your variance by buying

insurance is a good bet (if the insurance is not too costly) . Lottery tickets (and other forms of gambling) are considered to be bad bets, as lowering your expected value to increase your variance is considered poor financial planning. For many people, however, buying lottery tickets and checking for winners is fun . Whether or not the fun factor is worth the cost is a personal decision outside the realm of probability. It is perhaps a sad commentary to note that government-run lotteries tend to be more costly than lotteries run by organized crime. For instance, in a "Pick 3" game with p = 1/1000, the typical government payout is SOD, while the typical organized crime payout is ?SO . In real-world insurance problems, both p and v are unknown and must be estimated . Assessing risk like this is called actuarial science. Someone who does so is an actuary.

Summary

95

A random variable is a variable whose value depends on the result of a random experiment. A discrete random variable can take on one of a discrete set of outcomes. Let X be a discrete random variable, and letxk fork= 0, 1,2, ... be the discrete outcomes. In many experiments, the outcomes are integers; that is, Xk = k. A probability mass function (PMF) is the collection of discrete probabilities, p(k):

p(k) = Pr[X = xk] The p(k) satisfy two important properties: 00

p(k) ~ 0

LP(k) = 1 k=O

and

A cumulative distribution function (CDF) measures the probability that X values of u:

Fx(u) = Pr[X :5 u] = L

:5

u for all

p(k)

k:xk'5:u

For all u < v,

0 = Fx( -=l :5 Fx(u)

:5

Fx(v)

:5

FxC=l = 1

Expected values are probabilistic averages: 00

E[gCXl] = LK(Xk)p(k) k=O Expected values are also linear:

E[ag1(X)+ bgz(X)j = aE[g1 (X)]+ bE[gz(X)] However, they are not multiplicative in general:

E[g1 (X)gz(X)] f:. E[g1 (X) ]E[gz(X)] The mean is the probabilistic average value of X: 00

llx=E[X] = LXkp(k) k=O The variance of X is a measure of spread about the mean:

The moment generatingfunction (MPF) is the Laplace transform of the PMF (except that the sign of the exponent is flipped): 00

Jtx (u) = E[ euX] = L euxkp(k) k=O

96

CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES

The MGF can be used for computing moments:

Four important discrete distributions are the Bernouli, uniform, geometric, and Poisson: • The Bernoulli PMF models binary random variables with Pr[X = 1- p, E[X] = p, and Var[X] = p(l- p).

1] = p and Pr[X = 0]

=

• The uniform PMF is used when the outcomes are equally likely; that is, p(k) = 11 n for k = 1,2, ... ,n, E[X] = (n + 1)/2, an Var[X] = (n + 1)(n -1)112. • The geometric PMF is used to model the number of trials needed for a result to occur (i.e., the number of flips required to get a heads): p(k) = p(l- p)k-l fork= 1,2, ... , E[X] =lip, and Var[X] = (l-p)!p 2 . • The Poisson distribution is used in many counting experiments, such as counting the number of cancers in a city: p(k) = Jtke-"-1 k! fork= 0, 1,2, ... , E[X] = Jt, and Var[X] = Jt.

4.1

X has the probabilities listed in the table below. What are E[X] and Var[X]? k Pr [X = k]

I

1

0.5

2

3

4

0.2

0.1

0.2

4.2

For the probabilities in Problem 4.1, compute the MGF, and use it to compute the mean and variance.

4.3

X has the probabilities listed in the table below. What is the CDF of X? k Pr [X = k]

I

1

0.5

2

3

4

0.2

0.1

0.2

4.4

For the probabilities in Problem 4.3, compute the MGF, and use it to compute the mean and variance.

4.5

X has the probabilities listed in the table below. What are E [X] and Var[X]? k Pr [X=k]

4.6

I

1

0.1

2

3

4

0.2

0.3

0.2

0.2

X has the probabilities listed in the table below. What is the CDF of X? k Pr [X=k]

I

1

0.1

2 0.2

3 0.3

4 0.2

0.2

Problems 97 4.7

Imagine you have a coin that comes up heads with probability 0.5 (and tails with probability 0.5). a. How can you use that coin to generate a bit with the probability of a 1 equal to 0.25? b. How might you generalize this to a probability of k/2n for any k between 0 and 2n? (Hint: you can flip the coin as many times as you want and use those flips to determine whether the generated bit is 1 or 0.)

4.8

If X is uniform on k = 1 to k = m: a. What is the distribution function of X? b. Plot the distribution function.

4.9

We defined the uniform distribution on k = 1,2, ... ,m. In some cases, the uniform PMF is defined on k = 0, 1, ... , m- 1. What are the mean and variance in this case?

4.10

Let N be a geometric random variable with parameter p. What is Pr[N ~ k] for arbitrary integer k > 0? Give a simple interpretation of your answer.

4.11

Let N be a geometric random variable with parameter p. Calculate Pr[N =liN~ k] for l~k.

4.12

Let N be a geometric random variable with parameter p. Calculate Pr[N::; M] forM, a positive integer, and Pr[N =kIN::; M] fork= 1,2, ... ,M.

4.13

Let N be a geometric random variable with parameter Pr[N = 2], and Pr[N ~ 2].

4.14

Let N be a Poisson random variable with parameter A= 1. Calculate and plot Pr[N = 0], Pr[N = 1], ... ,Pr[N = 6].

4.15

Let N be a Poisson random variable with parameter A= 2. Calculate and plot Pr[N = 0], Pr[N = 1], ... ,Pr[N = 6].

4.16

Using plotting software, plot the distribution function of a Poisson random variable with parameter A = 1.

4.17

For N Poisson with parameter A, show E[N2 ] = A+A2 .

4.18

The expected value of the Poisson distribution can be calculated directly:

a. Use this technique to compute E[X(X moment.)

-1)].

p=

113. Calculate Pr[N::; 2],

(This is known as the second factorial

b. Use the mean and second factorial moment to compute the variance. c. You have computed the first and second factorial moments (E [X] and E [X(X- 1)] ). Continue this pattern, and guess the kth factorial moment, E [X(X -1) ···(X- k + 1) ]. 4.19

Consider a roll of a fair six-sided die. a. Calculate the mean, second moment, and variance of the roll using Equations (4.19) and (4.20). b. Compare the answers above to the direct calculations in Examples 4.2 and 4.4.

98

CHAPTER 4 DISCRETE PROBABILITIES AND RANDOM VARIABLES

4.20

Show the variance of Y =aX+ b is uJ on b.

4.21

Let X be a discrete random variable.

=a

2

u~. Note that the variance does not depend

a. Show that the variance of X is nonnegative. b. Show the even moments E[X2k] are nonnegative. c. What about the odd moments? Find a PMF whose mean is positive but whose third moment is negative. 4.22

What value of a minimizes E [(X- a) 2 ]? Show this two ways. a. Write E [(X- a) 2 ] in terms of u 2 , f.l, and a (no expected values at this point), and find the value of a that minimizes the expression. b. Use calculus and Equation (4.14) to find the minimizing value of a.

4.23

Often random variables are normalized. Let Y =(X- Jlx)lux. What are the mean and variance of Y?

4.24

Generate your own sequence of 20 runs (twenty 1's in the sequence) with p = 0.3. The Matlab command rand (1 , n) Yl)p(k,l) k

(5.3)

l

Comment 5.2: In Chapter 8, we consider multiple continuous random variables. The

generalization of Equation (5.3) to continuous random variables replaces the double sum with a double integral over the two-dimensional density function (Equation 8.3 in Section 8.2): E[g(X, Y)]

=

L:L:

g(x,y)fXY(x,y)o/dx

With this change, all the properties developed below hold for continuous as well as discrete random variables.

Expected values have some basic properties: 1. If g(X, Y) = g 1(X, Y) + g2 (X, Y), then the expected value is additive: E[g1 (X, Y)

+ g2(X, Y)] = E[g1 (X, Y)] + E[g2(X, Y)]

In particular, means add: E[X + Y] = E[X] +E[Y]. 2. In general, the expected value is multiplicative only if g(X, Y) factors and X and Yare independent, which means the PMF factors; that is, PXY(k,l) = Px(k)py(l). If g(X, Y) = g 1(X)g2(Y) and X and Yare independent,

106

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

(5.4) If X and Yare dependent, then Equation (5.4) is, in general, not true.

5.3.2 Moments for Two Random Variables In addition to the moments (e.g., mean and variance) for the individual random variables, there are joint analogs of the mean and variance. These are the correlation and covariance. The correlation of X and Y is rxy = E[XY]. X and Yare uncorrelated if rxy = JlxJly (i.e., if E[XY] = E[X]E[Yj). Are X and Yin Example 5.1 uncorrelated? We need to compute three moments: 3

E[X] = L,k·px(k) =Ox 0.1 + 1 x 0.1 +2 x 0.5+3 x 0.3 =2.0 k=O I

E[Y] = L,l·p1 ([) =Ox 0.7+ 1 x 0.3 =0.3 1=0 3

I

E[XY] = L,L,k·l·px1 (k,[) k=OI=O

= 0 X 0 X 0.0 + 0 X 1 X 0.1 + 1 X 0 X 0.1 + 1 X 1 X 0.0 + 2 X 0 X 0.4 + 2 X 1 X 0.1 + 3 X 0 X 0.2 + 3 X 1 X 0.1 =0.5 Since E[XY] = 0.5 f:. E[X]E[Y] = 2.0 x 0.3 = 0.6, X andY are not uncorrelated; they are correlated. Independence implies uncorrelated, but the converse is not necessarily true. If X and Yare independent, then Pxy(k,l) = Px(k)py(l) for all k and I. This implies that E[XY] = E[X]E[Y]. Hence, X and Yare uncorrelated. See Problem 5.8 for an example of two uncorrelated random variables that are not independent. The covariance of X andY is O"xy = Cov[X, Y] =E[ (X- JlxlCY- p1 )]. Like the variance, the covariance obeys a decomposition theorem: Theorem 5.1: For all X and Y, Cov[X,Y] =E[XY]-E[X]E[Y] = rxy - JlxJly

This theorem is an extension of Theorem 4.1. One corollary of Theorem 5.1 is an alternate definition of uncorrelated: if X and Yare uncorrelated, then axy = 0, and vice versa. To summarize, if X andY are uncorrelated, then all the following are true (they all say the same thing, just in different ways or with different notations):

rxy = JlxJly E[XY] =E[X]E[Y]

5.3 Moments and Expected Values

107

axy =0 E[ (X -pxlCY -py)] = 0

Comment5.3: lfY=X,

In other words, the covariance between X and itself is the variance of X.

Linear combinations of random variables occur frequently. Let Z = aX+ bY. Then, E[Zj =E[aX+bY] =aE[X] +bE[Y] = apx + bpy

Var[Z] =Var[aX+bY] = E[{CaX +bY)- (ajlx + bpy)) 2

2

]

2

= E[a (X -px) +2ab(X -px)(Y -py)

+ b2 (Y -pyl 2 ]

= a 2 Var[X] + 2ab Cov[X, Y] + b2 Var[ Y] 2

2

= a a; + 2abaxy + b aJ

The correlation coefficient is axy Pxy=-axay

(5.5)

Sometimes this is rearranged as a xy = Pxya xa y· Theorem 5.2: The correlation coefficient is between -1 and 1, -1 :5 Pxy :5 1

(5.6)

Define normalized versions of X and Y as follows: X = (X - llxl I ax and Y = (Y -py)lay. Then, both X and Yhavezeromean and unit variance, and the correlation between X and Y is the correlation coefficient between X and Y: PROOF:

- -]- [--]- [X-px Y-py]- Cov[X,Y]_ Cov [X,Y -E XY -E - - - - -Pxy ax ay axay

The variance of X± Y is nonnegative (as are all variances): 0 :5 Var[X ± Y] = Var[X] ± 2 Cov[X, Y] + Var[ Y]

108

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

= 1 ±2Pxy + 1 = 2(1 ±Pxy)

The only way this is true is if -1

:5



p xy :5 1.

Comment 5.4: For two random variables, we focus on five moments, /lx, /ly, a;, a'j, and axy, and the two related values, rxy and Pxy· These five (or seven) appear in many applications. Much of statistics is about estimating these moments or using them to estimate other things.

5.4 EXAMPLE: TWO DISCRETE RANDOM VARIABLES We can illustrate these ideas with an example. Let X and Y be discrete random variables whose PMF is uniform on the 12 points shown below:

2











2

3 k 4

l

0 0

The idea is that X and Yare restricted to these 12 points. Each of these points is equally likely. Pr[X= kn Y= l] = p(k,l) = {

~

for k and l highlighted above otherwise

Probabilities of events are computed by summing the probabilities of the outcomes that comprise the event. For example, {2 :o; X :o; 3 n 1 :o; Y :o; 2} ={X= 2 n Y = 1} u {X= 2 n Y = 2} u {X= 3 n Y = 1}

The dots within the square shown below are those that satisfy the event in which we are interested:













D •





5.4 Example: Two Discrete Random Variables

109

The probability is the sum of the dots:

1] +Pr[X=2n Y=2] +Pr[X= 3n Y= 1]

Pr[2 :o;X:o; 3n 1:5 Y:o;2] = Pr[X=2n Y=

3 =- +- +- =12 12 12 12 1

1

1

5.4. 1 Marginal PMFs and Expected Values The first exercise is to calculate the marginal PMFs. px(k) is calculated by summingpXY(k, [) along I; that is, by summing columns:

3

12

3

12

3

12

2

12

I

12

Px(k)

For example, Px(O) = PxCll = Px(2) = 3112, Px(3) = 2/12, and Px(4) = 1112. py([) is computed by summing over the rows: 3

py(l)

12 4

12 5

12 Computing the means of X and Y is straightforward: 1 19 3 3 3 2 E[X] =0·- +1·- +2· - + 3 · - +4·- = 12 12 12 12 12 12 5 4 3 10 E[Y] =0· - +1· - +2· - = 12 12 12 12

5.4.2 Independence Are X andY independent? Independence requires the PMF, PXY(k,l), factor as Px(k)py([) for all k and I. On the other hand, showing X andY are dependent (i.e., not independent) takes only one value of k and I that does not factor. For instance, PXY(O,O) = 1112 f:. Px(O)py(O) = (5112) · (3112). Therefore, X andY are dependent.

110

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

5.4.3 Joint CDF Computing the joint CDF of X and Y is straightforward, though a bit tedious:

L L

Fxy(u,v)=Pr[X:o;unY:o;v] =

Pxr(k,[)

k:xk ~U f:YI'5:V

In the figure below, the region where {X :5 2.6 n Y





1.4} is highlighted in gray:

:5

• •

The probability of this event (the gray region) is computed by summing the probabilities of the dots within the region: Fxy(2.6, 1.4) =

6 12

Since the PMF is discrete, the CDF is constant for all (2.0 :5 u < 3.0) n (1.0 :5 v < 2.0). One way to illustrate the CDF is to replace the dots with the CDF value at that point, as shown below:

EXAMPLES.3











12

6













Replace the rest of the dots with their CDF values. Describe the region where the CDF is equal to l.

5.4.4 Transformations With One Output Let Z =X+ Y. What is pz(m)? The first step is to note that 0 :5 Z :54. Next, look at the event {Z = m) and then at the probability of that event: !Z=m) ={X+ Y=m) ={(X= mn Y=O)u (X= m-1 n Y= l) u ... u(X=On Y= m)) ={X= m n Y = 0} u {X= m -1 n Y = 1} u .. · u {X= 0 n Y = m)

pz(m) = Pr[Z = m]

=Pr[X+Y=m]

5.4 Example: Two Discrete Random Variables

111

= 1]

= Pr[X = m n Y = 0] + Pr[X = m -1 n Y + · · · + Pr[X = 0 n Y = m] = p(m,O) + p(m -1, 1) + · ·· + p(O,m)

Thus, the level curves of Z = m are diagonal lines, as shown below:

I

2

12

3

12

3

12

12

3

12

Pz(m)

The mean of Z can be calculated two different ways: 1 2 3 3 3 E[Zj = - ·0+- ·1 + - · 2 + - ·3+- ·4 12 12 12 12 12 29 12 =E[X+Y]

=E[X] +E[Y] (additivity of expected value) 19 10 29 =-+-=12 12 12 One transformation that occurs in many applications is the max function; that is, W = max(X, Y). For example, max(3,2) = 3, max(-2,5) = 5, etc. Often, we are interested in the largest or worst of many observations, and the max(·) transformation gives this value: Pr[W= w] = Pr[max(X,Y) = w] = Pr[CX= wn Y < w) u(X < wn Y=w) u(X= wn Y=wl] The level curves are right angles, as shown below:

I

12

3

12

5

12 Pw(w)

2

12

I

12

112

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

For example, Pr[ W = 2] =

E[Wj

fz. Thus, 1 3 5 2 1 23 = - ·0+- ·1 + - ·2+ - · 3 + - ·4=12 12 12 12 12 12

5.4.5 Transformations With Several Outputs Transformations can also have as many outputs as inputs. Consider a joint transformation

Z =X+ Y and Z' =X- Y. What is the joint PMF of Z and Z '? The simplest solution to this problem is to compute a table of all the values:

X

y

z

0

0

0

Z' 0 -I

0

-2

2

0

0 0 -I

2 2

0

2 4

0

4

3 2

4

4

2

2 2

2

3 3

0

4

0

These Z and Z' values can be plotted as dots:



4



3



2



Z' 1





0 2 -1 -2



3



4

z



Now, one can compute the individual PMFs, expected values, and joint CDF as before.

5.5 Sums of Independent Random Variables

113

5.4.6 Discussion In this example, we worked with a two-dimensional distribution. The same techniques work in higher dimensions as well. The level curves become subspaces (e.g., surfaces in three dimensions). One aspect to pay attention to is the probability distribution. In this example, the distribution was uniform. All values were the same. Summing the probabilities simply amounted to counting the dots and multiplying by 1/12. In more general examples, the probabilities will vary. Instead of counting the dots, one must sum the probabilities of each dot.

5.5 SUMS OF INDEPENDENT RANDOM VARIABLES Sums of independent random variables occur frequently. Here, we show three important results. First, the mean of a sum is the sum of the means. Second, the variance of the sum is the sum of the variances. Third, the PMF is the convolution of the individual PMFs. LetX 1,X 2 , ••• ,Xn be random variables, and letS =X 1 +X2 + · · · +Xn be the sum of the random variables. The means add even if the random variables are not independent:

This result does not require the random variables to be independent. It follows from the additivity of expected values and applies to any set of random variables. The variances add if the covariances between the random variables are 0. The usual way this occurs is if the random variables are independent because independent random variables are uncorrelated. We show this below for three random variables, S = X 1 + X 2 + X 3 , but the result extends to any number: Var[S] =E[CS-E[S]l 2 ] = E[{CX1 -p1l + (Xz -pz) + (X3 -p3l)

2 ]

= E[ (XI -p1l 2 ] + 2E[ (XI -pl)(Xz -pz) j + 2E[ (XI -p1)(X3 -p3)] + E[ (Xz -pzl 2 ] + 2E[ (Xz -pz)(X3 -p3)] +E[CX3-/l3l 2 ] =Var[Xi] +2Cov[X 1,X 2 ] +2Cov[Xi>X 3] + Var[X 2 ] +2Cov[Xz,X3] + Var[X 3] If the X values are uncorrelated, then the covariances are 0, and the variance is the sum of the variances: Var[S] = Var[Xi] + Var[X 2 ] + Var[X 3] This result generalizes to the sum of n random variables: if the random variables are uncorrelated, then the variance of the sum is the sum of the variances.

114

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

Computing PMFs of a sum of random variables is, in general, difficult. However, in the special case when the random variables are independent, the PMF of the sum is the convolution of the individual PMFs. Similarly, the MGF of the sum is the product of the individual MGFs. Let X and Y be possibly dependent random variables with joint PMF PXY(k, [),and let S =X+ Y. The PMF of Sis Ps(n) = Pr[S = n]

= Pr[X+ Y= n] = I:Pr[X+ Y=

ni Y= I]Pr[Y= I]

(LTP)

I

= I:Pr[X =n-Il Y = I]Pr[Y =I]

(using Y= I)

I

If we further assume that X and Y are independent, the conditional PMF reduces to the PMF of X, Pr[X = n -II Y =I]= Pr[X = n -I], and we get a fundamental result: Ps(n) = I:Pr[X= n -I]Pr[Y= I] I

(5.7)

= LPx(n- [)py(l) I

Thus, we see the PMF of Sis the convolution of the PMF of X with the PMF of Y. This result generalizes: if S = X 1 + X 2 + · · · + Xn and the X; are independent, each with PMF p;(·), then the PMF of Sis the convolution of the various PMFs: Ps = P1

* P2 * ··· * Pn

(5.8)

Consider a simple example of two four-sided dice being thrown. LetX 1 andX 2 be the numbers from each die, and letS= X 1 + X 2 equal the sum of the two numbers. (A four-sided die is a tetrahedron. Each corner corresponds to a number. When thrown, one corner is up.) Assume each X; is uniform and the two dice are independent. Then, 1 1 1 Pr[S=2] =Pr[X 1 = 1]Pr[X2 = 1] =- ·- = 4 4 16 Pr[S=3] =Pr[X 1 = 1]Pr[X2 =2] +Pr[X1 =2]Pr[X2 = 1] = 1 1 1 1 1 1 3 Pr[S = 4] =- ·- +- ·- +- ·- = 4 4 4 4 4 4 16 Pr[S=5]=

4

16 3 Pr[S=6]= 16

2

16

5.5 Sums of Independent Random Vari abl es

115

2

Pr[S=7] = 16 1 Pr[S=8] = 16

This is shown schematically below:

r r r r

*

, T

r r r r

i Ji

T,

Comment 5.5: Many people get confused about computing convolutions. It is easy, howeve r, if you organize the computation the right way. Assume we have two vectors of data (either from signal processing or PMFs) that we wish to convolve: z =x *y . We will use a simple vector notation , such asx = [1 , 2, 3 , -4] andy= [3 , 4, 5] . Define two vector operations: sum [x] = 1 + 2 + 3-4 = 2 is the sum of the elements of x, and length[x] = 4 is the number of elements ofx (from the first nonzero element to the last nonzero element) . Convolutions obey two simple relations that are useful fo r checking one's calculations. lfz =x *y , then sum[z] = sum[x] · sum{y]

(5.9) (5.10)

length [z] =length [x] +length {yl- 1 It is convenient to assume length[x] Then:

~

length{yl; if not, reverse the roles ofx andy .

1. Make a table with x along the top andy going down the left side . 2 . Fill in the rows of the table by multiplyingx by each value ofy and shifting the rows over one place for each place downward . 3 . Sum the columns. Do not "carry" as if you were doing long multiplication . The sum isz=X*Y· For instance, for x andy as above , the table looks like this:

3 4 5

1 3

3

2 6 4 10

3 9 8 5 22

-4

-12 12 10

-16 15

-20

10

-1

-20

Thus, z= [3 , 10, 22 , 10, -1 , -20] = [1 , 2, 3 , -4]

* [3, 4 , 5] =X*Y

sum[z] = 24 = 2 ·12 = sum[x]· sum{y] length [z] = 6 = 4 + 3- 1 =length [x] +length {yl- 1

116

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

EXAMPLE SA

Pascal's triangle is really the repeated convolution of [1, 1] with itself. Here is the table:

2

2 2

3 3 3 4

6

4

= [1, 1] * [1, 1] * [1, 1] * [1, 1]

[1,4,6,4, 1]

sum[1,4,6,4, 1] = 16 = 24 = sum[1, 1] · sum[1, 1]· sum[1, 1]· sum[1, 1]

Comment 5.6: Another way of organizing the table is like long multiplication:

3

4 6

3

10

2 3

3 4

-4

15 -16

-20

9

10 12 -12

22

10

-1

-20

5 8

5

Multiply the terms as if doing long multiplication, but do not carry.

As we mentioned earlier, the MGF corresponds to the Laplace transform in signal processing (except for the change of sign in the exponent). Just as the Laplace transform converts convolution to multiplication, so does the MGF: 8

Jts (u) = E [ e" ]

= E[ eu(X, +Xz+···+Xnl] = E [ e"x, euXz ... e"Xn

l

= E[ e"X']E[ e"X2 ]· • ·E[ e" Xn] = Jt, (u)Jtz (u) · · ·Jtn (u)

(by independence) (5.11)

The MGF of the sum is the product of the individual MGFs, all with the same value of u.

5.6 Sample Probabilities, Mean, and Variance 117 As an example, let X 1 i, i = 1,2, ... ,n, be a sequence of independent random variables with the Poisson distribution, each with a different It;. For k = 0, 1, 2, ... and It; > 0, ..:tk

Pr[X; =

kj = k; e--';

.,t{X; (u) =E[euX;] = e-';x. Then,

p=

Y 1 +Y2 +···+Y n = fraction of true samples n

This idea generalizes to other probabilities. Simply let Y; = 1 if the event is true and Y; = 0 if false. The Y; are independent Bernoulli random variables (see Example 4.1). Sums of independent Bernoulli random variables are binomial random variables, which will be studied in Chapter 6.

118

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

The most popular estimate (by far) of the mean is the sample average. We denote the estimate of the mean as fl and the sample mean as Xn:

If the Xi have mean Jl and variance a 2 , then the mean and variance of Xn are the following: E[Xn] = -

~{E[X!] +E[X2] + ··· +E[Xn]) = n

np = Jl

n

~

1

~

Var[Xn] = 2{Var[X!] + Var[X2] + · · · + Var[Xn]) = - 2 = n n n The mean of Xn equals (the unknown value) Jl. In other words, the expected value of the sample average is the mean of the distribution. An estimator with this property is unbiased. The variance of Xn decreases with n. As n increases (more observations are made), 2 the squared error E[(Xn- Jl) ] = Var[Xn] = a 2 In tends to 0. An estimator whose variance goes to 0 as n ~ = is consistent. This is a good thing. Gathering more data means the estimate gets better (more accurate). (See Section 10.1 for more on unbiasedness and consistency.) For instance, the sample probability above is the sample average of the Yi values. Therefore, E[p] = nE[Yi] = np = p n n Var[p] = nVar[Yd = p(l-p) n2 n As expected, the sample probability is an unbiased and consistent estimate of the underlying probability. The sample variance is a common estimate of the underlying variance, a 2 . Let~ denote the estimate of the variance. Then, ~

1

n

n -I

k= i

-

a 2 = - - .[CX-X) 2

(5.12)

~is an unbiased estimate of a 2 ; that is, E[~] = a 2 • For example, assume the observations are 7, 4, 4, 3, 7, 2, 4, and 7. The sample average and sample variance are 38 8 ~ (7- 4.75) 2 + (4- 4.75) 2 + ... + (7- 4.75) 2 a2 = = 3.93 8-1

Xn =

7+ 4+ 4+ 3+ 7+ 2+ 4+ 7

8

= - =4.75

The data were generated from a Poisson distribution with parameter /1, = 5. Hence, the sample mean is close to the distribution mean, 4.75 "' 5. The sample variance, 3.93, is a bit further from the distribution variance, 5, but is reasonable (especially for only eight observations).

5.7 Histograms

119

The data above were generated with the following Python code:

import scipy.stats as st import numpy as np l,n = 5,8 x = st . poisson(l) . rvs(n) muhat = np . sum(x)/len(x) sighat = np . sum((x-muhat)**2)/(len(x)-1) print(muhat, sighat) We will revisit mean and variance estimation in Sections 6.6, 8.9, 9.7.3, and 10.2.

5.7 HISTOGRAMS Histograms are frequently used techniques for estimating PMFs. Histograms are part graphical and part analytical. In this section, we introduce histograms in the context of estimating discrete PMFs. We revisit histograms in Chapter 10. Consider a sequence X 1 , X 2 , ••• ,Xn of n liD random variables, with each X; uniform on1,2, ... ,6. In Figure 5.1, we show the uniform PMF with m = 6 as bars (each bar has area 116) and two different histograms. The first is a sequence of 30 random variables. The counts of each outcome are 4, 7, 8, 1, 7, and 3. The second sequence has 60 random variables with counts 10, 6, 6, 14, 12, and 12. In both cases, the counts are divided by the number of observations

0.167

Pr[X = k]

0.167

2

3

4

k

5

FIGURE 5.1 Comparison of uniform PMF and histograms of uniform observations. The histograms have 30 (top) and 60 (bottom) observations. In general, as the number of observations increases, the histogram looks more and more like the PM F.

120 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES (30 or 60) to obtain estimates of the PMF. As the number of observations increases, the histogram estimates generally get closer and closer to the PMF. Histograms are often drawn as bar graphs. Figure 5.2 shows the same data as Figure 5.1 with the observations as a bar graph and the true PMF shown as a line.

0.167

0.167

2

3

4

k

5

6

FIGURE 5.2 Histograms drawn as bar graphs. The height of each bar equals the number of observations ofthat value divided by the total number of observations (30 or 60) . The bottom histogram for n = 60 is more uniform than the upper histogram for n = 30.

Comment 5.8: In all three computing environments, Matlab, R, and Python , the histogram command is hist(x,options), where xis a sequence of values. The options vary between the three versions. Both Matlab and Python default to 10 equally spaced bins. The default in R is a bit more complicated . It chooses the number of bins dynamically based on the number of points and the range of values. Note that if the data are discrete on a set of integers, the bins should be integer widths, with each bin ideally centered on the integers. If the bins have non-integer widths, some bins might be empty (if they do not include an integer) or may include more integers than other bins. Both problems are forms of aliasing. Bottom line: do not rely on the default bin values for computing histograms of discrete random variables.

5.8 ENTROPY AND DATA COMPRESSION Many engineering systems involve large data sets. These data sets are often compressed before storage or transmission. The compression is either lossless, meaning the original data

5.8 Entropy and Data Compression

121

can be reconstructed exactly, or lossy, meaning some information is lost and the original data cannot be reconstructed exactly. For instance, the standard facsimile compression standard is lossless (the scanning process introduces loss, but the black-and-white dots are compressed and transmitted losslessly). The Joint Photographic Experts Group (JPEG) image compression standard is a lossy technique. Gzip and Bzip2 are lossless, while MP3 is lossy. Curiously, most lossy compression algorithms incorporate lossless methods internally. In this section, we take a slight detour, discussing the problem of lossless data compression and presenting a measure of complexity called the entropy. A famous theorem says the expected number ofbits needed to encode a source is lower bounded by the entropy. We also develop Huffman coding, an optimal coding technique.

5.8. 1 Entropy and Information Theory Consider a source that emits a sequence of symbols, X 1 , X 2 , X 3 , etc. Assume the symbols are independent. (If the symbols are dependent, such as for English text, we will ignore that dependence.) Let X denote one such symbol. Assume X takes on one letter, ak> from an alphabet of m letters. Example alphabets include {0,1} (m = 2), {a,b,c, . . . ,z} (m = 26), {0,1, ... ,9} (m = 10), and many others. Let p(k) = Pr[X = ak] be the PMF of the symbols. The entropy of X, H[X], is a measure of how unpredictable (i.e., how random) the data are: m

H[X] =E[ -log{p(Xl)] =- LP(k)log{p(k))

(5.13)

k= l

Random variables with higher entropy are "more random'' and, therefore, more unpredicable than random variables with lower entropy. The "log" is usually taken to base 2. If so, the entropy is measured in bits. If the log is to base e, the entropy is in nats. Whenever we evaluate an entropy, we use bits.

Comment 5.9: Most calculators compute logs to base e (often denoted ln(x)) or base 10 (log 10 (x)). Converting from one base to another is simple: log 2 (x)

log(x)

=-

1og(2)

where log(x) and log(2) are base e or base 10, whichever is convenient. For instance, log 2 (8) = 3 = loge(8)/log.(2) = 2.079/0.693 .

It should be emphasized the entropy is a function of the probabilities, not of the alphabet. Two different alphabets having the same probabilities have the same entropy. The entropy is nonnegative and upper bounded by log(m): 0 :5 H[X] :o;log(m)

122 CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES The lower bound follows from the basic notion that all probabilities are between 0 and 1, 0 :5 p(k) :5 1, which implies log {p(k)) :5 0. Therefore, -log {p(k)) ~ 0. The lower bound is achieved if each term in the sum is 0. This happens if p(k) = 0 (the limit asp~ 0 of plog(p) is 0; see the figure below) or p(k) = 1 (since log(l) = 0). The distribution that achieves this is degenerate: one outcome has probability 1; all other outcomes have probability 0.

0.531

P -1 ~

= _e-

-

log2CP l

loge(2)

)

We show the upper bound in Section 5.8.4 where we solve an optimization problem: maximize H[X] over all probability distributions. The maximizing distribution is the uniform distribution. In this sense, the uniform distribution is the most random of all distributions over m outcomes. For example, consider an m = 4 alphabet with probabilities 0.5, 0.2, 0.1, and 0.1. Since m = 4, the entropy is upper bounded by log( 4) = 2 bits. The entropy of this distribution is H[X] = -0.5log 2(0.5)- 0.2log 2(0.2)- O.llog 2(0.1)- O.llog 2(0.1) = 1.69 bits :5

log2(4) = 2 bits

Entropy is a crucial concept in communications. The entropy of X measures how much information X contains. When Alice transmits X to Bob, Bob receives H[X] bits of information. For useful communications to take place, it is necessary that H[X] > 0. The special case when X is binary (i.e., m = 2) occurs in many applications. Let X be binary with Pr[X = 1] = p. Then, H[X] = -p1og 2 (p)- (1- p)log 2(1- p) = h(p)

h(p) is known as the binary entropy function. It is shown in Figure 5.3.

h(p)

0.5

0 0.1 1

0.5

p

0.89 1

FIGURE 5.3 The binary entropy function h(p) = -p log 2 (p)- (1 - p) log 2 (1 - p) versus p.

(5.14)

5.8 Entropy and Data Compression

123

The binary entropy function obeys some simple properties: • h(p) = h(l -

p) for 0 ~ p ~ 1

• h(O) = h(l) = 0 • h(0.5) = 1

• h(0.1l) = h(0.89) = 0.5 The joint entropy function of X andY is defined similarly to Equation (5.13): H[X, Yj = E[ -log{p(X, Yl)] =- LL)XY(k,l)log{pXY(k,l)) k

I

The joint entropy is bounded from above by the sum of the individual entropies: H[X,Y] ~H[X] +H[Y] For instance, consider X and Y as defined in Section 5.4: H[X]

=-~log 2 (~)-~log 2 (~)-~log 2 (~) 12 12 12 12 12 12 -~log 2 (~)-_.!._log 2 (_.!._) 12 12 12 12 = 2.23 bits

H[Y]=-~log 2 (~)-.i_log 2 (.i_)_2log2 (2)=l.55bits 12 12 12 12 12 12 H[X, Y] = -12 x _.!._ log 2 (_.!._) = 3.58 bits 12 12 ~ H[X] + H[Y] = 2.23 + 1.55 = 3.78 bits If X and Y are independent, the joint entropy is the sum of individual entropies. This follows because log(ab) = log(a) + log(b): H[X,Y] =E[ -log{p(X,Yl)]

(definition)

=E[ -log{pCXlpCYl)]

(independence)

=E[ -log{pCXl)] +E[ -log{pCYl)]

(additivity of E[ · j)

=H[X]+H[Y] In summary, the entropy is a measure of the randomness of one or more random variables. All entropies are nonnegative. The distribution that maximizes the entropy is the uniform distribution. Entropies of independent random variables add.

5.8.2 Variable Length Coding The entropy is intimately related to the problem of efficiently encoding a data sequence for communication or storage. For example, consider a five-letter alphabet {a, b, c,d, e} with probabilities 0.3, 0.3, 0.2, 0.1, and 0.1, respectively. The entropy of this source is

124

CHAPTER

5

MULTIPLE DISCRETE RANDOM VARIABLES

H[X] = -0.3log 2 (0.3) -0.3log 2 (0.3) -0.2log 2 (0.2) -O.llog 2 (0.1) -O.llog 2 (0.1) = 2.17 bits per symbol

Since this source is not uniform, its entropy is less than log2 (5) = 2.32 bits. Consider the problem of encoding this data source with binary codes. There are five symbols, so each one can be encoded with three bits (2 3 = 8 ~ 5). We illustrate the code with a binary tree: 0

0

a

b

c

e

d

Code words start at the root of the tree (top) and proceed to a leaf (bottom). For instance, the sequence aabec is encoded as 000 · 000 · 001 · 100 · 010 (the "dots" are shown only for exposition and are not transmitted). This seems wasteful, however, since only five of the eight possible code words would be used. We can prune the tree by eliminating the three unused leafs and shortening the remaining branch:

~

A a

b

,

c

d

Now, the last letter has a different coding length than the others. The sequence aabec is now encoded as 000 · 000 · 001 · 1 · 010 for a savings of two bits. Define a random variable L representing the coding length for each symbol. The average coding length is the expected value of L: E[L] = 0.3 x 3 +0.3 x 3 +0.2 x 3 +0.1 x 3 +0.1 x 1 = 2.8 bits This is a savings in bits. Rather than 3 bits per symbol, this code requires an average of only 2.8 bits per symbol. This is an example of a variable length code. Different letters can have different code lengths. The expected length of the code, E [L], measures the code performance. A broad class of variable length codes can be represented by binary trees, as this one is. Is this the best code? Clearly not, since the shortest code is for the letter e even though e is the least frequently occurring letter. Using the shortest code for the most frequently

5.8 Entropy and Data Compression

125

occurring letter, a, would be better. In this case, the tree might look like this: 0

0

a

0

0

b

c

d

e

The expected coding length is now E[L] = 0.3 x 1 +0.3 x 3 +0.2 x 3 +0.1 x 3 +0.1 x 3 = 2.4 bits Is this the best code? No. The best code is the Huffman code, which can be found by a simple recursive algorithm. First, list the probabilities in sorted order (it is convenient, though not necessary, to sort the probabilities):

0.3

0.3

0.2

0.1

0.1

Combine the two smallest probabilities:

/\2 0.3

0.3

0.2

0.1

0.1

Do it again:

/ \.4"

.I .1\..2

0.3

0.3

0.2

0.1

0.1

0.3

0.3

0.2

0.1

0.1

And again:

126

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

Finally, the last step: 1.0

0.6

0.3

0.3

0.2

0.1

0.1

This is the optimal tree. The letters have lengths 2, 2, 2, 3, and 3, respectively. The expected coding length is E[L] = 0.3 x 2 +0.3 x 2 +0.2 x 2 +0.1 x 3 +0.1 x 3 = 2.2 bits

Comment 5.10: If there is a tie in merging the nodes (i .e., three or more nodes have the same minimal probability), merge any pair. The specific trees will vary depending on which pair is merged, but each tree will result in an optimal code. In other words, if there are ties, the optimal code is not unique.

There is a famous theorem about coding efficiencies and entropy due to Claude Shannon. 1 Theorem 5.3 (Shannon, 1948): For any decodable (tree) code, the expected coding length is lower bounded by the entropy: E[L] ~H[X]

(5.15)

In the example above, the coding length is 2.2 bits per symbol, which is slightly greater than the entropy of 2.17 bits per sample. When does the expected coding length equal the entropy? To answer this question, we can equate the two and compare the expressions term by term: E[L] m

,

L,p(k)l(k) k= i

J,H[X] m

=I,pCkl( -log Cp(kll) 2

k= i

We have explicitly used log 2 because coding trees are binary. We see the expressions are equal if l(k) = -log 2 {p(k)) 1 Claude Elwood Shannon (1916-2001) was an American mathematician, electrical engineer, and cryptographer known as "the father of information theory:'

5.8 Entropy and Data Compression

127

or, equivalently, if p(k) =Tl(kl

For example, consider an m = 4 source with probabilities 0.5, 0.25, 0.125, and 0.125. These correspond to lengths 1, 2, 3, and 3, respectively. The Huffman tree is shown below:

0.5

0.25 0.125 0.125

The expected coding length and the entropy are both 1.75 bits per symbol.

5.8.3 Encoding Binary Sequences The encoding algorithms discussed above need to be modified to work with binary data. The problem is that there is only one tree with two leaves:

~

1-p p The expected length is E [L] = (1- p) · 1 + p · 1 = 1 for all values of p. There is no compression gain. The trick is to group successive input symbols together to form a supersymbol. If the input symbols are grouped two at a time, the sequence 0001101101 would be parsed as 00 · 01 · 10 · 11 · 01. Let Y denote a supersymbol formed from two normal symbols. Y then takes on one of the "letters": 00, 01, 10, and 11, with probabilities (1- p) 2 , (1- p)p, (1- p)p, and p2 , respectively. As an example, the binary entropy function equals 0.5 when p = 0.11 or p = 0.89. Let us take p = 0.11 and see how well pairs of symbols can be encoded. Since the letters are independent, the probabilities of pairs are the products of the individual probabilities: Pr[oo] = Pr[O] · Pr[o] = 0.89 x 0.89 = 0.79 Pr[01] = Pr[O] ·Pr[1] = 0.89 x 0.11 = 0.10 Pr[lO] = Pr[1 ] · Pr[O] = 0.11 x 0.89 = 0.10 Pr[11] = Pr[1 ] · Pr[1] = 0.11 x 0.11 = 0.01

128

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

The Huffman tree looks like this:

0.79 0.1

0.1 0.01

The expected length per input symbol is

[l

E L =

0.79x1+0.1x2+0.1x3+0.01x3 1.32 b b l = - - = 0.66 its per sym o 2 2

By combining two input symbols into one supersymbol, the expected coding rate drops from 1 bit per symbol to 0.66 bits per symbol, a savings of about one-third fewer bits required. However, the coding rate, 0.66 bits per symbol, is still 32% higher than the theoretical rate of h(0.11) = 0.5 bits per symbol.

5.8.4 Maximum Entropy What distribution has maximum entropy? This question arises in many applications, including spectral analysis, tomography, and signal reconstruction. Here, we consider the simplest problem, that of maximizing entropy without other constraints, and use the method of Lagrange multipliers to find the maximum. The main entropy theorem is the following: Theorem 5.4: 0 :5 H[X] :o;log(m). Furthermore, the distribution that achieves the maximum is the uniform.

To prove the upper bound, we set up an optimization problem and solve it using the method of Lagrange multipliers. Lagrange multipliers are widely used in economics, opera-

tions research, and engineering to solve constrained optimization problems. Unfortunately, the Lagrange multiplier method gets insufficient attention in many undergraduate calculus sequences, so we review it here. Consider the following constrained optimization problem: m

maxH[X] =- LP(k)log{p(k)) p (k )' s

k=l

m

subject to

LP(k) = 1 k= l

The function being maximized (in this case, H[Xj) is the objective function, and the constraint is the restriction that the probabilities sum to 1. The entropy for any distribution is upper bounded by the maximum entropy (i.e., by the entropy of the distribution that solves this optimization problem).

5.8 Entropy and Data Compression

129

Now, rewrite the constraint as 1- L.'k=1 p(k) = 0, introduce a Lagrange multiplier It, and change the optimization problem from a constrained one to an unconstrained problem:

This unconstrained optimization problem can be solved by taking derivatives with respect to each variable and setting the derivatives to 0. The Lagrange multiplier It is a variable, so we have to differentiate with respect to it as well. First, note that d p - ( - plog(pl) = -log(p)-- = -log(p) -1

dp

p

The derivative with respect to p(l) (where lis arbitrary) looks like 0 = -log{pCll) -1-/t

for alii= 1,2, ... ,m

Solve this equation for p(l): p(l) = e--'- 1

for alii= 1,2, ... ,m

Note that p(l) is a constant independent of l (the right-hand side of the equation above is not a function of l). In other words, all the p(l) values are the same. The derivative with respect to It brings back the constraint: m

0= 1- LP(k) k= !

Since the p's are constant, the constant must be p(l) = 1I m. The last step is to evaluate the entropy: H[X]

m :5- L,p(k)log{p(k)) k= !

=-

1 ( 1) Lm -log-

k= !m

m

mlog(m)

m =log(m) Thus, the theorem is proved. Let us repeat the main result. The entropy is bounded as follows: 0 :5 H[X] :o;log(m) The distribution that achieves the lower bound is degenerate, with one p(k) = 1 and the rest equal to 0. The distribution that achieves the maximum is the uniform distribution, p(k) =lim fork= 1,2, ... ,m.

130

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

Comment 5.11: The maximum entropy optimization problem above also includes

inequality constraints: p(k) ~ 0. We ignored those (treated them implicitly) and solved the problem anyway. The resulting solution satisfies these constraints, p(k) = 1/m > 0, thus rewarding our laziness. Had the optimal solution not satisfied the inequality constraints, we would have had to impose the inequality constraints explicitly and solve the optimization problem with more sophisticated searching algorithms. As one might gather, such problems are substantially harder.

EXAMPLES.S

Here is a simple example to demonstrate the Lagrange multiplier method. Minimize x 2 + y 2 subject to x+ 2y = 3. Using a Lagrange multiplier, II., convert the problem to an optimization over three variables: min(x2 + x,y,-1.

l

+ 11.(3 -x+2yl)

Differentiate the function with respect to each of the three variables, and obtain three equations:

0=2x-l\. 0 = 2y-211. 3 =x+2y The solution is x = 3 I 5, y = 615, and II. = 615, as shown in Figure 5.4.

x+2y= 3

FIGURE 5.4 Illustration of a Lagrange multiplier problem: Find the point on the line x + 2y = 3 that minimizes the distance to the origin . That point is (0.6, 1.2), and it lies at a distance v'1Jl from the origin .

Summary

131

Let X and Y be two discrete random variables. The joint probability mass function is

PXY(k,[) = Pr[X = xk n Y = y!] for all values of k and l. The PMF values are nonnegative and sum to 1: ~0

PXY(k,[) 00

00

I: LPXY(k,[) = 1 k=Oj=O

The marginal probability mass functions are found by summing over the unwanted variable:

Px(k) = Pr[X=k] = I:Pr[X=xkn Y= y!] = LPXY(k,l) I

I

py(l) = Pr[Y =I]= I:Pr[X = Xk n Y = y!] = LPXY(k,l) k

k

The joint distribution function is

FXY(u, v) = Pr[X :5 u n Y

:5

v]

X and Y are independent if the PMF factors:

PXY(k,l) = Px(k)py(l)

for all k and l

or, equivalently, if the distribution function factors:

FXY(u, v) = Fx(u)Fy(v)

for all u and v

The expected value of g(X, Y) is the probabilistic average:

The correlation of X and Y is rxy = E[XY]. X and Yare uncorrelated if rxy = JlxJly· The covariance of X andY is a xy = Cov[X, Y] = E[ (X- JlxHY- Jly)] = rxy- JlxJly· If X andY are uncorrelated, then a xy = 0. Let Z =aX+ bY. Then, E[Zj =aE[X] +bE[Y] Var[Z] = a 2 Var[X] + 2abCov[X, Y] + b2 Var[Y] If X and Yare independent, the PMF of S =X+ Y is the convolution of the two marginal PMFs, and the MGF of Sis the product of the MGFs of X and Y:

Ps(n) = LPx(k)p1 (n- k) k

J{s(u) =J{x(u)J{y(v)

132

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

Probabilities and moments can be estimated from samples. Let Xi be n liD samples, and let Yi = 1 if the event is true and Yi = 0 if the event is false. Then,

n -

1

fl =Xn =-{Xi +X2 + ··· +Xn) n

~

u2

1

n

-

= - .[CX-X} 2 n -1 k= i

The entropy is a measure of the randomness of X: H[X] =E[ -log{pCXl)] =- .[p(k)log{p(k)) k

If X has m outcomes, the entropy is between 0 and log(m); that is, 0 :5 H(X} :o;log(m). The distribution that achieves the upper bound is the uniform distribution. Maximum entropy problems can often be solved by the method of Lagrange multipliers. The log is usually taken to base 2, log 2 (x) = log(x)!log(2). The expected length of a lossless compression code is lower bounded by the entropy:

E[L]

~H[X]

The Huffman code is an optimal code. It builds a code tree by repeatedly combining the two least probable nodes.

5.1

Let X and Y have the following joint PMF:

y

~

0.1 0.1 I 0.0 0

0.0 0.0 0.1

0.1 0.1 0.2 2

0.0 0.1 0.2 3

X

a. What are the marginal PMFs of X and Y? b. What are the conditional probabilities (computed directly) of X given Y and Y given X (compute them directly)?

c. What are the conditional probabilities of Y given X from the conditional probabilities of X given Y using Bayes theorem? 5.2

Using the probabilities in Problem 5.1: a. What are E[X], E[Y], Var[X], Var[Y], and Cov[X, Y]? b. Are X and Y independent?

Problems 5.3

133

Let X and Y have the following joint PMF:

y

I 0.1

1 0

0.0

0.1 0.1

0.1 0.2

0.1 0.3

---oo;-------;;-------;2~-----;3~

X

a. What are the marginal PMFs of X andY? b. What are the conditional probabilities (computed directly) of X given Y and Y given X (compute them directly)? c. What are the conditional probabilities of Y given X from the conditional probabilities of X given Y using Bayes theorem? 5.4

Using the probabilities in Problem 5.3: a. What are E[X], E[Y], Var[X], Var[Y], and Cov[X, Y]? b. Are X and Y independent?

5.5

Continue the example in Section 5.4, and consider the joint transformation, U = min(X, Y) (e.g., min(3,2) = 2), and W = max(X, Y). For each transformation: a. What are the level curves (draw pictures)? b. What are the individual PMFs of U and W? c. What is the joint PMF of U and W?

5.6

Continue the example in Section 5.4, and consider the joint transformation V = 2X- Y and V' = 2 Y- X. For each transformation: a. What are the level curves (draw pictures)? b. What are the individual PMFs of V and V'? c. What is the joint PMF of V and V'?

5.7

X andY are jointly distributed as in the figure below. Each dot is equally likely.



2

y







0 .___.___.___.___._____ 0

2

3

X

~

4

a. What are the first-order PMFs of X andY? b. WhatareE[X] andE[Y]? c. What is Cov[X, Y]? Are X andY independent? d. If W =X- Y, what are the PMF of Wand the mean and variance of W? 5.8

Find a joint PMF for X and Y such that X and Y are uncorrelated but not independent. (Hint: find a simple table of PMF values as in Example 5.1 such that X and Y are uncorrelated but not independent.)

134

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

5.9

Prove Theorem 5.1.

5.10

Let S = X 1 +X 2 + X 3, with each X; liD uniform on the outcomes k = 1, 2, 3, 4. What is the PMF ofS?

5.11

What are the first four terms of the convolution of the infinite sequence [112, 114,118, 1116, ... ] with itself?

5.12

What is the convolution of the infinite sequence [1, 1, 1, ... ] with itself?

5.13

If X and Y are independent random variables with means J.lx and J.ly, respectively, and variances a~ and respectively, what are the mean and variance of Z =aX+ bY for constants a and b?

5.14

Let X1, Xz, and X3 be liD Bernoulli random variables with Pr[X; = 1] = Pr[X; = 0] = 1- p. What are the PMF, mean, and variance of S =X1 +Xz +X3?

5.15

Let X1 and Xz be independent geometric random variables with the same p. What is the PMFofS=X1 +Xz?

5.16

Let X1 and Xz be independent Poisson random variables with the same A. What is the PMFofS=X1 +Xz?

5.17

Suppose X1, Xz, and X3 are liD uniform on k = 0, 1,2,3 (i.e., Pr[X; = k] = 0.25 fork= 0,1,2,3). What is the PMF ofS =X1 +Xz +X3?

5.18

Generate a sequence of 50 liD Poisson random variables with A= 5. Compute the sample mean and variance, and compare these values to the mean and variance of the Poisson distribution.

5.19

Generate a sequence of 100 liD Poisson random variables with A = 10. Compute the sample mean and variance, and compare these values to the mean and variance of the Poisson distribution.

5.20

A sequence of 10 liD observations from a U(O, 1) distribution are the following: 0.76 0.92 0.33 0.81 0.37 0.05, 0.19, 0.10, 0.09, 0.31. Compute the sample mean and sample variance of the data, and compare these to the mean and variance of a U(O, 1) distribution.

5.21

In them= 5 Huffman coding example in Section 5.8.2, we showed codes with efficiencies of 3, 2.8, 2.4, and 2.2 bits per symbol.

aJ,

p and

a. Can you find a code with an efficiency of2.3 bits per symbol? b. What is the worst code (tree with five leaves) for these probabilities you can find? 5.22

Let a four-letter alphabet have probabilities p = [0.7, 0.1, 0.1, 0.1]. a. What is the entropy of this alphabet? b. What is the Huffman code? c. What is the Huffman code when symbols are taken two at a time?

5.23

Continue the binary Huffman coding example in Section 5.8.3, but with three input symbols per supersymbol. a. What is the Huffman tree? b. What is the expected coding length? c. How far is this code from the theoretical limit?

Problems

135

5.24

Write a program to compute the Huffman code for a given input probability vector.

5.25

What is the entropy of a geometric random variable with parameter p?

5.26

Let X have mean J.lx and variance a~. Let Y have mean J.ly and variance Let Z =X with probability p and Z = Y with probability 1- p. What are E[Z] and Var[Z]? Here is an example to help understand this question: You flip a coin that has probability p of coming up heads. If it comes up heads, you select a part from box X and measure some quantity that has mean J.lx and variance a~; if it comes up tails, you select a What are part from box Y and measure some quantity that has mean J.ly and variance the mean and variance of the measurement taking into account the effect of the coin flip? In practice, most experimental designs (e.g., polls) try to avoid this problem by sampling and measuring X and Y separately and not relying on the whims of a coin flip.

aJ.

aJ.

5.27

The conditional entropy of X given Y is defined as H(XIY) =- LLPXY(k,l)log(pxiY(klll) k

Show H(X, Y) 5.28

I

= H(Y) + H(XIY). Interpret this result in words.

Consider two probability distributions, p(k) fork= 1,2, .. . ,m and q(k) fork= 1,2, .. . ,m. The Kullback-Leibler divergence (KL) between them is the following: KL(PIIQl

m

(i)

i= i

q(t)

= LP(i)log(L)

(5.16)

~0

(5.17)

The KL divergence is always nonnegative: KL(PIIQl

Use the log inequality given in Equation (3.25) to prove the KL inequality given in Equation (5.17). (Hint: show -KL(PIIQl ~ 0 instead, and rearrange the equation to "hide" the minus sign and then apply the inequality.) One application of the KL divergence: When data X have probability p(k) but are encoded with lengths designed for distribution q(k), the KL divergence tells us how many additional bits are required. In other words, the coding lengths -log (q(k)) are best if q(k) = p(k) for all k. 5.29

Solve the following optimization problems using Lagrange multipliers: a. minx2 + l such that x- y = 3 x,y

b. maxx + y such that x 2 + l x,y

=1

c. minx + l + z such that x + 2y+ 3z = 6 2

x,y,z

2

136

CHAPTER 5 MULTIPLE DISCRETE RANDOM VARIABLES

5.30

Prove the Cauchy-Schwarz inequality:

where the x's andy's are arbitrary numbers. Hint: Start with the following inequality (why is this true?): 0~

n

L (x; - ay;) 2

for all values of a

i= l

Find the value of a that minimizes the right -hand side above, substitute that value into the same inequality, and rearrange the terms into the Cauchy-Schwarz inequality at the top. 5.31

Complete an alternative proof of Equation (5.6). a. Show E[XY] ~E[X 2 ]E[Y 2 ] for any X and Yusing the methods in Problem 5.30. 2

b. Show this implies Cov[X, Y] ~ Var[X]Var[Y] and hence P~y ~ 1. 2

5.32

Consider the following maximum entropy problem: Among all distributions over the integers, k = 1,2,3, ... with known mean J.l = .[b 1 kp(k), which one has the maximum entropy? Clearly, the answer is not the uniform distribution. A uniform distribution over m = oo does not make sense, and even if it did, its mean would be oo. The constrained optimization problem looks like the following: 00

maxp Ckl'sH[X] =- :[p(k)log(p(k)) k=i 00

subject to :[p(k)

00

= 1 and

k=i

L kp(k) = J.l k=i

a. Introduce two Lagrange multipliers, A and 1/f, and convert the constrained problem to an unconstrained problem over the p(k) and A and 1/f. What is the unconstrained problem? b. Show the p(k) satisfy the following: 0 = -log(p(kl) -1- A -k1f!

fork= 1,2,3, ...

c. What two other equations do the p(k) satisfy? d. Show the p(k) correspond to the geometric distribution.

CHAPTER

BINOMIAL PROBABILITIES

Two teams, say, the Yankees and the Giants, play an n-game series. If the Yankees win each game with probability p independently of any other game, what is the probability the Yankees win the series (i.e., more than half the games)? This probability is a binomial probability. Binomial probabilities arise in numerous applications-not just baseball. In this chapter, we examine binomial probabilities and develop some of their properties. We also show how binomial probabilities apply to the problem of correcting errors in a digital communications system.

6.1 BASICS OF THE BINOMIAL DISTRIBUTION In this section, we introduce the binomial distribution and compute its PMF by two methods. The binomial distribution arises from the sum of liD Bernoulli random variables (e.g., flips of a coin). Let X; fori= 1,2, .. . ,n be liD Bernoulli random variables with Pr[X; = 1] = p and Pr[X; = D] = 1- p (throughout this chapter, we use the convention q = 1- p):

Then, S has a binomial distribution. The binomial PMF can be determined as follows: Pr[S = k] =

L

Pr[sequence with k l's and n -k D's]

sequ ences with k 1's

Consider an arbitrary sequence with k 1's and n- k D's. Since the flips are independent, the probabilities multiply. The probability of the sequence is pkqn-k. Note that each sequence with k 1's and n- k D's has the same probability.

137

138

CHAPTER 6 BINOMIAL PROBABILITIES

The number of sequences with k 1's and n- k D's is

G). Thus, (6.1)

PMF values must satisfy two properties. First, the PMF values are nonnegative, and second, the PMF values sum to 1. The binomial probabilities are clearly nonnegative as each is the product of three nonnegative terms, (~), pk, and q"-k. The probabilities sum to 1 by the binomial theorem (Equation 3.7):

The binomial PMF for n = 5 and p = 0.7 is shown in Figure 6.1. The PMF values are represented by the heights of each stem. An alternative is to use a bar graph, as shown in Figure 6.2, wherein the same PMF values are shown as bars. The height of each bar is the PMF value, and the width of each bar is 1 (so the area of each bar equals the probability). Also shown in Figure 6.1 are the mean (Jl = 3.5) and standard deviation (a = 1.02).

1+ , 1_

Jl = 3.5 a= 1.02 ____,,+•!>

I

0.360

I

p(k)

0

5

k

FIGURE 6.1 Binomial PMF for n = 5 and p = 0.7. The probabilities are proportional to the heights of each stem . Also shown are the mean (I!= 3.5) and standard deviation (a= 1.02). The largest probability occurs fork= 4 and is equal to 0.360.

0.360 p(k)

0

2

3

4

5

FIGURE 6.2 Binomial probabilities for n = 5 and p = 0.7 as a bar graph. Since the bars have width equal to 1, the area of each bar equals the probability of that value. Bar graphs are especially useful when comparing discrete probabilities to continuous probabilities.

6.1 Basics of the Binomial Distribution

EXAMPLE6.1

139

Consider a sequence of 56 Bernoulli liD p = 0.7 random variables, 1111111 · 1110101 · 1110010 · 1000111 · 0011011 · 0111111 · 0110011 · 0100110. Each group of seven bits is summed, yielding eight binomial observations, 75444643 (sum across the rows in the table below). Bernoullis

Binomial 7

0

0 0 0 0

0 0

0

0

0 0

0 0

0 0

5 0

4 4

0

4

0

6 4 3

0

For example, the probability of getting a 4 is

In a sequence of eight binomial random variables, we expect to get about 8 x 0.23 = 1.81 fours. In this sequence, we observed 4 fours. The binomial probabilities satisfy an interesting and useful recursion. It is convenient to define a few quantities: Sn-1 =X1 +X2 + ··· +Xn-1 Sn =Sn-1 +Xn b(n,k,p) = Pr[Sn = k]

Note that Sn-l andXn are independent sinceXn is independent of X 1 through Xn-l· The recursion is developed through the LTP: Pr[Sn = k] = Pr[Sn = kiXn = 1]Pr[Xn = 1] + Pr[Sn = kiXn = O]Pr[Xn = 0] = Pr[Sn-1 = k-11Xn = 1]p+ Pr[Sn-1 = kiXn = Ojq = Pr[Sn-1 = k-1]p + Pr[Sn-1 = k]q We used the independence of Sn-l and Xn to simplify the conditional probability. Using the b(n,k,p) notation gives a simple recursion: b(n,k,p) = b(n -1,k -1,p) ·p + b(n -1,k,p) · q

(6.2)

Equation 6.2 gives us a Pascal's triangle-like method to calculate the binomial probabilities.

140 CHAPTER 6 BINOMIAL PROBABILITIES

FIGURE 6.3 Binomial probabilities organized in Pascal's triangle for n up to 5. Each entry is a weighted sum with weights p and q = 1 - p ofthe two entries above it. For instance, 1op2q3 = (4 pq3)p + (6p2q2Jq.

Comment 6.1: The recursive development is just a long-winded way of saying the binomial probabilities are the repeated convolution of the Bernoulli probabilities. Here is a convolution table for n = 2, n = 3, and n = 4:

q p

q q2

p2

q3

2pq 2pq2 pq2

p2q 2p2q

q3

3pq2

3p2q

p3

q•

3pq3 pq3

3p2q2 3p2q2

p3q 3p3q

p•

q•

4pq3

6p2q2

4p3q

p•

q2 q p q p

p pq pq

p2 p3

The various intermediate rows list binomial probabilities (e.g., for n = 3, Pr[N = 1] = 3pq 2 and Pr[N = 2] = 3p 2q) .

To demonstrate binomial probabilities, here is a sequence of 30 observations from an n = 5, p = 0.7 binomial distribution: 4, 3, 4, 5, 3, 4, 4, 4, 5, 2, 3, 5, 3, 3, 3, 4, 3, 5, 4, 3, 3, 3, 0, 4, 5, 4, 4, 2, 1, 4. There are one 0, one 1, two 2's, ten 3's, eleven 4's, and five 5's. Note the histogram approximates the PMF fairly well. Some differences are apparent (e.g., the sequence has a 0

6.2 Computing Binomial Probabilities

141

even though a 0 is unlikely, Pr[X = 0] = 0.002), but the overall shapes are pretty similar. The histogram and the PMF are plotted in Figure 6.4.

0.360 p(k)

0

3

2

4

5

FIGURE 6.4 Plot of a binomial n = 5, p = 0.7 PMF (as a bar graph) and histogram (as dots) of30 observations. Note the histogram reasonably approximates the PMF.

In summary, for independent flips of the same coin (liD random variables), Bernoulli probabilities answer the question of how many heads we get in one flip. Binomial probabilities answer the question of how many heads we get inn flips. EXAMPLE6.2

How likely is it that a sequence of 30 liD binomial n = 5, p = 0. 7 random variables would have at least one 0? First, calculate the probability of getting a 0 in a single observation:

Second, calculate the probability of getting at least one 0 in 30 tries. Each trial has six possible outcomes, 0, 1, 2, 3, 4, and 5. However, we are only interested in O's or not-O's. We just calculated the probability of a 0 in a signal trial as 0.00243. Therefore, the probability of a not-0 is 1 -0.00243. Thus, the probability of at least one 0 is 1 minus the probability of noO's: Pr[ no O's in 30 trials] = (1 - 0.00243) 30 = 0.9296 Pr[ at least one 0 in 30 trials] = 1 -0.9296 = 0.0704 Thus, about 7% of the time, a sequence of 30 trials will contain at least one 0.

6.2 COMPUTING BINOMIAL PROBABILITIES To compute the probability of an interval, say, l :5 S :5 m, one must sum the PMF values: m

Pr[l:o;S:o;m] =L,b(n,k,p) k=l

142

CHAPTER 6 BINOMIAL PROBABILITIES

This calculation is facilitated by computing the b(n,k,p) recursively. First, look at the ratio:

Thus, n-k+1) b(n,k,p) = b(n,k-1,p) · ( k- ·

(P) q

(6.3)

Using this formula, it is trivial for a computer to calculate binomial probabilities with thousands of terms. The same argument allows one to analyze the sequence of binomial probabilities. b(n,k,p) is larger than b(n,k -1,p) if (n-k+ l)p

---~>1

kq

Rearranging the terms gives k < (n + l)p

(6.4)

Similarly, the terms b(n,k,p) and b(n,k- 1,p) are equal if k = (n + l)p and b(n,k,p) is less than b(n,k -1,p) if k > (n + l)p. For example, for n = 5, p = 0.7, and (n + l)p = 4.2, the b(5,k,0.7) sequence reaches its maximum at k = 4, then decreases fork= 5. This rise and fall are shown in Figure 6.1.

6.3 MOMENTS OF THE BINOMIAL DISTRIBUTION The mean of the binomial distribution is np and the variance npq. In this section, we derive these values three ways. First, we use the fact that a binomial is a sum of liD Bernoulli random variables. Second, we perform a direct computation using binomial probabilities. Third, we use the MGF. Of course, all three methods lead to the same answers. First, since S is a sum of liD Bernoulli random variables, the mean and variance of S are the sum of the means and variances of the X; (see Section 5.5): /ls=E[S] =E[X!] +E[X2] +···+E[Xn] =np Var[S] = Var[X!] + Var[X 2 ] + · · · + Var[Xn] = npq sinceE[X] =pand Var[X] =pq. Thus, we see the mean and variance of the Bernoulli distribution are np and npq, respectively. For example, Figure 6.1 shows the PMF for a binomial distribution with n = 5 and p = 0.7. The mean is p = np = 5 x 0.7 = 3.5, and the variance is a 2 = 5 x 0.7 x 0.3 = 1.05 = (1.02) 2 •

6.3 Moments of the Binomial Distribution

143

Second, compute the mean directly using binomial probabilities, taking advantage of Equation (3.22), kG)=

nG=:):

n

E[Sj = I:k·Pr[X=k] k=O n

= I:k·Pr[X=k]

(k = 0 term is 0)

k=l

=

f n(nk--l)lqn-k

k=l

(using Equation 3.22)

l

L

= np n-l ( n-

1)plqn-l-1

(change of variables, l = k - l)

l = np(p + q)n-l

(using the binomial theorem)

=np

(sincep+q = l)

1=0

It is tricky to compute E [ S2 ] directly. It is easier to compute E [ S(Sadjust the formula for computing the variance:

l)]

first and then

We will also need to extend Equation (3.22): (6.5) Using this formula, we can compute E [ S(S-

E[SCS-l)] =

l)]:

f

k(k -l)(n)lqn-k k=O k

= n(n -l)p2

Ln (n- 2)l-2qn-k

k=2 k- 2 = n(n-l)p 2

Now, finish the calculation:

a;

=E[SCS-ll] +E[S]-E[Sj 2 = n(n-l)p 2 +np-n 2p 2 = np(l-p) = npq

See Problem 4.18 for a similar approach in computing moments from a Poisson distribution. Third, compute the mean and variance using the MGF:

144

CHAPTER 6 BINOMIAL PROBABILITIES

Now, compute the mean:

E[X]

=

!!_Jts(u)l du

= n(pe"+q)"- 1pe"lu=o=n(p+q)"- 1p=np

u=O

In Problem 6.15, we continue this development and compute the variance using the MGF. In summary, using three different methods, we have computed the mean and variance of the binomial distribution. The first method exploits the fact that the binomial is the sum of liD Bernoulli random variables. This method is quick because the moments of the Bernoulli distribution are computed easily. The second method calculates the moments directly from the binomial PMF. This straightforward method needs two nontrivial binomial coefficient identities (Equations 3.22 and 6.5). However, for many other distributions, the direct calculation proceeds quickly and easily. The third method uses the MGF. Calculate the MGF, differentiate, and set the derivative to 0. For other problems, it is handy to be able to apply all three of these methods. It is often the case that at least one of the three is easy to apply-though sometimes it is not obvious beforehand which one.

6.4 SUMS OF INDEPENDENT BINOMIAL RANDOM VARIABLES Consider the sum of two independent binomial random variables, N = N 1 +N2, where N 1 and N 2 use the same value of p. The first might represent the number of heads in n 1 flips of a coin, the second the number of heads in n2 flips of the same (or an identical) coin, and the sum the number of flips inn= n 1 + n2 flips of the coin (or coins). All three of these random variables are binomial. By this counting argument, the sum of two independent binomial random variables is binomial. This is most easily shown with MGFs:

Since the latter expression is the MGF of a binomial random variable, N is binomial. Now, let us ask the opposite question. Given that N 1 and N 2 are independent binomial random variables and the sum N = N 1 + N2, what can we say about the conditional probability of N 1 given the sum N = I? It turns out the conditional probability is not binomial: Pr[N 1 =kiN=m] Pr[N 1 =knN=m] Pr[N=m]

(definition)

Pr[N 1 =knN 2 =m-k] Pr[N=m] Pr[N 1 =k]Pr[N 2 = m-k] Pr[N=m]

(by independence)

6.4 Sums of Independent Binomial Random Variables

(

145

nz ) m-k n2-m+k k Pkqn1-k( m-k Pq

nl)

(:)pmqn-m 1 (:

)(mn~k) (6.7)

(:) In fact, the conditional probability is hypergeometric. This has a simple interpretation. There are {;) sequences of m heads in n trials. Each sequence is equally likely. There are {"~) (,,:'~k) sequences with k heads in the first n 1 positions and m- k heads in the last n 2 = n- n 1 positions. Thus, the probability is the number of sequences with k heads in the first n 1 flips and m- k heads in the next n 2 = n- n 1 flips, divided by the number of sequences with m heads in n flips. For instance, let n 1 = n 2 = 4 and k = 4. So, four of the eight flips are heads. The probability of all four heads in the first four positions is

The probability of an equal split, two heads in the first four flips and two in the second four flips, is

Clearly, an equal split is much more likely than having all the heads in the first four (or the last four) positions.

Comment 6.2: It is useful to note what happened here. N, by itself is binomial, but N, given the value of N, + N2 is hypergeometric. By conditioning on the sum, N, is restricted. A trivial example of this is when N = 0. In this case, it follows that N, must also be 0 (since N, + N2 = 0 implies N, = 0). When N = 1, N, must be 0 or 1.

146

CHAPTER 6 BINOMIAL PROBABILITIES

6.5 DISTRIBUTIONS RELATED TO THE BINOMIAL The binomial is related to a number of other distributions. In this section, we discuss some of these: the hypergeometric, the multinomial, the negative binomial, and the Poisson. In Chapter 9, we discuss the connection between the binomial and the Gaussian distribution.

6.5. 1 Connections Between Binomial and Hypergeometric Proba bi Iities The binomial and hypergeometric distributions answer similar questions. Consider a box containing n items, with n 0 labeled with O's and n 1 labeled with 1's (n = n 0 + n 1 ). Make a selection of m items without replacement, and let N denote the number of 1'sin the selection. Then, the probabilities are hypergeometric:

The first item has probability (n 1 In) ofbeing a 1. The second item has conditional probability (n 1 - 1) I (n - 1) or n 1 I (n- 1) depending on whether the first item selected was a 1 or a 0. In contrast, if the items are selected with replacement, then the probabilities are constant. The probability of a 1 is (n 1 In) regardless of previous selections. In this case, the probability of a selection is binomial. If n 0 and n 1 are large, then the probabilities are approximately constant. In this case, the hypergeometric probabilities are approximately binomial. For example, let n 0 = n 1 = 5 and m = 6. Selected probabilities are listed below: Probability

Binomial

Hypergeometric (

W) =100 - =0.476

5

Pr[N= 3]

(6) 20 = 0.313 3 0.5 3 0.5 3 = 64

.1.......l_

Pr[N= 5]

(6) 6 = 0.094 5 0.5 5 0.5 1 = 64

m m= _2_ = o.024 C~) 210

C~)

210

In summary, if the selection is made with replacement, the probabilities are binomial; if the selection is made without replacement, the probabilities are hypergeometric. If the number of items is large, the hypergeometric probabilities are approximately binomial. As the selection size gets large, the hypergeometric distribution favors balanced selections (e.g., half 1'sand half O's) more than the binomial distribution. Conversely, unbalanced selections (e.g., alii's or all O's) are much more likely with the binomial distribution.

6.5 Distributions Related to the Binomial

147

Binomial probabilities tend to be easier to manipulate than hypergeometric probabilities. It is sometimes useful to approximate hypergeometric probabilities by binomial probabilities. The approximation is valid when the number of each item is large compared with the number of each selected.

6.5.2 Multinomial Probabilities Just as the multinomial coefficient (Equation 3.13) generalizes the binomial coefficient (Equation 3.3), the multinomial distribution generalizes the binomial distribution. The binomial distribution occurs in counting experiments with two outcomes in each trial (e.g., heads or tails). The multinomial distribution occurs in similar counting experiments but with two or more outcomes per trial. For example, the English language uses 26 letters (occurring in both uppercase and lowercase versions), 10 numbers, and various punctuation symbols. We can ask questions like "What is the probability of a letter t?" and "How many letter t's can we expect to in a string of n letters?" These questions lead to multinomial probabilities. Consider an experiment that generates a sequence of n symbols X 1 , X 2 , ••• ,Xn. For example, the symbols might be letters from an alphabet or the colors of a series of automobiles or many other things. For convenience, we will assume each symbol is an integer in the range from 0 to m - l. (In other words, n is the sum of the counts, and m is the size of the alphabet.) Let the probability of outcome k be Pk = Pr[X; = k]. The probabilities sum to 1; that is, Po+ P1 + · · · + Pm-1 = l. Let Nk equal the number of X;= k fork= 0, 1, ... ,m -l. Thus, N 0 + N 1 + · · · + N m-l = n. For instance, n is the total number of cars, and N 0 might be the number of red cars, N 1 the number of blue cars, etc. The probability of a particular collection of counts is the multinomial probability: Pr [N0-- k 0 n ... n N m-1 -- k m-1 ]- (k k Q,

n k

1, ... ,

m-l

) Poko PIk1 .. ·Pm-1 km - 1

(6.8)

For example, a source emits symbols from a four-letter alphabet with probabilities Po = 0.4, p 1 = 0.3, p 2 = 0.2, and p 3 = 0.1. One sequence of 20 symbols is 1, 2, 0, 2, 0, 3, 2, 1, 1, 0, 1, 1, 2, 1, 0, 2, 2, 3, 1, 1. 1 The counts are No= 4, N 1 = 8, N 2 = 6, and N 3 = 2. The probability of this particular set of counts is PrN [ 0 =4nN 1 =8nN2 =6nN3 =2 l =

20

(4,8,6,2

) 0.40.30.20.1 4 8 6 2

= 0.002 This probability is small for two reasons. First, with n = 20, there are many possible sets of counts; any particular one is unlikely. Second, this particular sequence has relatively few D's despite 0 being the most likely symbol. For comparison, the expected counts (8,6,4,2) have probability 0.013, about six times as likely, but still occurring only about once every 75 trials. 1

The first sequence I generated.

148

CHAPTER 6 BINOMIAL PROBABILITIES

The mean and variance of each count are (6.9)

E[N;j =npi

(6.10)

Var[N;] = npi(l- Pi) These are the same as for the binomial distribution. The covariance between Ni and N 1 fori fc j is

(6.11)

The covariance is a measure of how one variable varies with changes to the other variable. For the multinomial, theN's sum to a constant, No+ N 1 + · · · + N m-l = n. IfNi is greater than its average, it is likely that NJ is less than its average. That Cov[Ni,NJ] < 0 for the multinomial follows from this simple observation. For instance, in the example above, E[No] = 20 x 0.4 = 8, E[Ni] = 20 x 0.3 = 6, Var[No] = 20 x 0.4 x 0.6 = 4.8, and Cov[N0 ,Ni] = -20 x 0.4 x 0.3 = -2.4.

6.5.3 The Negative Binomial Distribution The binomial distribution helps answer the question "In n independent flips, how many heads can one expect?" The negative binomial distribution helps answer the reverse question "To get k heads, how many independent flips are needed?" Let N be the number of flips required to obtain k heads. The event {N = n) is a sequence of n - 1 flips, with k - 1 heads followed by a head in the last flip. The probability of this sequence is

[

n-1)

l (

PrN=n = k-

1

p k qn-k forn=k,k+1,k+2, ...

(6.12)

The first 12 terms of the negative binomial distribution for p = 0.5 and k = 3 are shown below:

p = 0.5, k= 3

3

8

n

14

For instance, Pr[N = 5] = (;=:)0.5 3 0.5 5 - 3 = (~)0.5 5 = 6 x 0.03125 = 0.1875. Just as the binomial is a sum of n Bernoulli random variables, the negative binomial is a sum of k geometric random variables. (Recall that the geometric is the number of flips required to get one head.) Therefore, the MGF of the negative bionomial is the MGF of the geometric raised to the kth power: (by Equations 4.23 and 5.11)

(6.13)

6.5 Distributions Related to the Binomial

149

Moments of the negative binomial can be calculated easily from the geometric. Let X be a geometric random variable with mean lip and variance (1- p)lp 2 . Then, E[Nj =k·E[X] =

~

(6.14)

p

k(l- p)

Var[N] =k·Var[X] = - 2

p

EXAMPLE6.3

(6.15)

In a baseball inning, each team sends a succession of players to bat. In a simplified version of the game, each batter either gets on base or makes an out. The team bats until there are three outs (k = 3). If we assume each batter makes an out with probability p = 0. 7 and the batters are independent, then the number of batters is negative binomial. The first few probabilities are

-1)

Pr[N = 3] = 3 0.7 3 0.3 3 - 3 = 0.343 ( 3-1 Pr[N = 4] = 4-1) 0.7 3 0.3 4 - 3 = 0.309 ( 3-1 Pr[N = 5] = 5-1) 0.7 3 0.3 5 - 3 = 0.185 ( 3-1 Pr[N = 6] = 6-1) 0.7 3 0.3 6 - 3 = 0.093 ( 3-1 The mean number of batters per team per inning is 3 E[Nj =-=4.3 0.7

6.5.4 The Poisson Distribution The Poisson distribution is a counting distribution (see Section 4.5.3). It is the limit of the binomial distribution when n is large and p is small. The advantage of the correspondence is that binomial probabilities can be approximated by the easier-to-compute Poisson probabilities. LetNbe binomial with parameters n andp, and let It= np =E[N]. We are interested in cases when n is large, p is small, but It is moderate. For instance, the number of telephones in an area code is in the hundreds of thousands, the probability of any given phone being used at a particular time is small, but the average number of phones in use is moderate. As another example, consider transmitting a large file over a noisy wireless channel. The file contains millions of bits, the probability of any given bit being in error is small, but the number of errors per file is moderate.

150

CHAPTER 6 BINOMIAL PROBABILITIES

The probability N = k is Pr[N = k] =

n! lo- p)"-k (kn)lo- p)"-k = k!(n-k)!

Let us look at what happens when n is large, p is small, and /1, = np is moderate: n! (n-k)!

---"'n

k

log(l-p)"-k = (n-k)log(l-/1,/n)

(Jt=np)

"'nlog(l- Jtln)

(n-k"' n)

"'-/\,

(log(l + x) "'x)

(1- p)n-k "'e--'

Putting it all together, for k = 0, 1, 2, ...

(6.16)

These are the Poisson probabilities. To summarize, the limit of binomial probabilities when n is large, p is small, and /1, = np is moderate is Poisson. Figure 6.5 shows the convergence of the binomial to the Poisson. The top graph shows a somewhat poor convergence when n = 10 and p = 0.5. The bottom graph shows a much better convergence with n =50 and p = 0.1.

0.246

0

Binomial n = 10, p Poisson ll = 5

8

9 10 11

0

Binomial n = 50, p Poisson ll = 5

8

9 10 11

0.175

0

2

3

4

5

6

7

0.185

= 0.5

= 0.1

0.175

0

2

3

4

5

6

7

FIGURE 6.5 Comparison of binomial and Poisson PMFs. Both have the same A= np. The top graph compares a binomial with n = 10, p = 0.5 to a Poisson with A= 5. The agreement is poor. The bottom graph has n =SO and p = 0.1 and shows much better agreement.

6.6 Binomial and Multinomial Estimation

151

6.6 BINOMIAL AND MULTINOMIAL ESTIMATION A common problem in statistics is estimating the parameters of a probability distribution. In this section, we consider the Bernoulli, binomial, and multinomial distributions. As one example, consider a medical experiment to evaluate whether or not a new drug is helpful. The drug might be given to n patients. Of these n, k patients improved. What can we say about the probability the drug leads to an improvement? LetX 1 , X 2 , .. . ,Xn ben liD Bernoulli random variables with unknown probability p of a 1 (and probability q = 1- p of a 0). LetS= X 1 + · · · + Xn be the sum of the random variables. Then, Sis binomial with parameters n and p, and E[S] = np Var[S] = npq We will use the notation pto denote an estimate of p. pis bold because it is a random variable; its value depends on the outcomes of the experiment. Let k equal the actual number of l's observed in the sequence and n - k equal the observed number of O's. An obvious estimate of p is A

s

k n

p=-=n

E[p] = E[S] n

= np =p

n

Var[p] = Var[S] = npq = pq n2 n2 n Since the expected value of pis p, we say pis an unbiased estimate of p, and since the variance of p goes to 0 as n ~ =, we say p is a consistent estimate of p. Unbiased means the average value of the estimator equals the value being estimated; that is, there is no bias. Consistent means the variance of the estimator goes to 0 as the number of observations goes to infinity. In short, estimators that are both unbiased and consistent are likely to give good results. Estimating the parameters of a multinomial distribution is similar. Let Xi be an observation from an alphabet of m letters (from 0 tom- 1). Let Pi= Pr[X = i], and let ki be the number of i's in n observations. Then, Pi is

As with the binomial distribution, Pi= kJn is an unbiased and consistent estimator of Pi· In summary, the obvious estimator of p in the binomial and multinomial distributions is the sample average of the n random variables, X 1 , X 2 , .. . ,Xn. Since it is an unbiased and

152

CHAPTER 6 BINOMIAL PROBABILITIES

consistent estimator of p, the sample average is a good estimator and is commonly used. Furthermore, as we will see in later chapters, the sample average is often a good parameter estimate for other distributions as well.

Comment 6.3: It is especially important in estimation problems like these to

distinguish between the random variables and the observations. X and S are random variables. Before we perform the experiment, we do not know their values. After the experiment, X and Shave values, such asS= k. Before doing the experiment, p=Sin is a random variable. After doing the experiment, p has a particular value. When we say pis unbiased and consistent, we mean that if we did this experiment many times, the average value ofp would be close top.

6.7 ALOHANET In 1970, the University of Hawaii built a radio network that connected four islands with the central campus in Oahu. This network was known as Alohanet and eventually led to the widely used Ethernet, Internet, and cellular networks of today. The Aloha protocol led to many advances in computer networks as researchers analyzed its strengths and weaknesses and developed improvements. The original Aloha network was a star, something like the illustration below:

D

The individual nodes A, B, C, etc., communicate with the hub H in Oahu over one broadcast radio channel, and the hub communicates with the nodes over a second broadcast radio channel. The incoming channel was shared by all users. The original idea was for any remote user to send a packet (a short data burst) to the central hub whenever it had any data to send. If it received the packet correctly, the central hub broadcasted an acknowledgment (if the packet was destined for the hub) or rebroadcasted the packet (if the packet was destined for another node). If two or more nodes transmitted packets that overlapped in time, the hub received none of the packets correctly. This is called a collision. Since the nodes could not hear each other (radios cannot both transmit and receive on the same channel at the same time),

6.7 Alohanet

153

collisions were detected by listening for the hub's response (either the acknowledgment or the rebroadcast). Consider a collision as in the illustration below:

One user sends a packet from time t 1 to time t2 = t 1 + T (solid line). If another user starts a packet at any time between t0 = t 1 - T and t 2 (dashed lines), the two packets will partially overlap, and both will need to be retransmitted. Shortly afterward, a significant improvement was realized in Slotted Aloha. As the name suggests, the transmission times became slotted. Packets were transmitted only on slot boundaries. In the example above, the first dashed packet would be transmitted at time t 1 and would (completely) collide with the other packet. However, the second dashed packet would wait until t2 and not collide with the other packet. At low rates, Slotted Aloha reduces the collision rate in half. Let us calculate the efficiency of Slotted Aloha. Let there be n nodes sharing the communications channel. In any given slot, each node generates a packet with probability p. We assume the nodes are independent (i.e., whether a node has a packet to transmit is independent of whether any other nodes have packets to transmit). The network successfully transmits a packet if one and only one node transmits a packet. If no nodes transmit, nothing is received. If more than one node transmits, a collision occurs. Let N be the number of nodes transmitting. Then, Pr[N=O] =(1-p)"

~

Pr[N = 1] = ( )p(l- p)n-1 = np(l- p)n-1 Pr[N ~ 2] = 1- (1- p)"- np(l- p)"- 1 Let /I, = np be the offered packet rate, or the average number of packets attempted per slot. The Poisson approximation to the binomial in Equation (6.16) gives a simple expression: Pr[N = 1] "'.1\,e--' This throughput expression is plotted in Figure 6.6. The maximum throughput equals e- 1 = 0.368 and occurs when /I, = 1. Similarly, Pr[N = 0] "' e--' = e- 1 when /I, = 1. So, Slotted Aloha has a maximum throughput of0.37. This means about 37% of the time, exactly one node transmits and the packet is successfully transmitted; about 37% of the time, no node transmits; and about 26% of the time, collisions occur. The maximum throughput of Slotted Aloha is rather low, but even 37% overstates the throughput. Consider what happens to an individual packet. Assume some node has a packet to transmit. The probability this packet gets through is the probability no other node has a packet to transmit. When n is large, this is

154

CHAPTER 6 BINOMIAL PROBABILITIES

0.368

Pr [N= 1]

FIGURE 6.6 Slotted Aloha's throughput for large n.

Pr[ N = 0] = e--t. Let this number be r. In other words, with probability r = e--t, the packet is successfully transmitted, and with probability l - r = l - e--t, it is blocked (it collides with another packet). If blocked, the node will wait (for a random amount of time) and retransmit. Again, the probability of success is p, and the probability of failure is l - p. If blocked again, the node will attempt a third time to transmit the packet, and so on. Let T denote the number of tries needed to transmit the packet. Then, Pr[T=l]=r Pr[T=2] =r(l-r) Pr[T=3] =r(l-r) 2 and so on. In general, Pr[T = k] = r(l- r)k-i. Tis a geometric random variable, and the mean of Tis

On average, each new packet thus requires e,t tries before it is transmitted successfully. Since all nodes do this, more and more packets collide, and the throughput drops further until the point when all nodes transmit all the time and nothing gets through. The protocol is unstable unless rates well below lIn are used. Aloha and Slotted Aloha are early protocols, though Slotted Aloha is still sometimes used (e.g., when a cellphone wakes up and wants to transmit). As mentioned, however, both led to many advances in computer networks that are in use today.

6.8 ERROR CONTROL CODES Error correcting coding (ECC) is commonly used in communications systems to reduce the

effects of channel errors. We assume the data being communicated are bits and each bit is possibly flipped by the channel. The basic idea of error control codes is to send additional bits, which help the receiver detect and correct for any transmission errors. As we shall see, the analysis of error control codes uses binomial probabilities. ECC is used in many systems. Possibly the first consumer item to use nontrivial ECC was the compact disc (CD) player in the early 1980s. Since then, ECC has been incorporated

6.8 Error Control Codes

155

into numerous devices, including cell phones, digital television, wireless networks, and others. Pretty much any system that transmits bits uses ECC to combat noise. Throughout this section, we consider a basic class of codes, called linear block codes. Linear block codes are widely used and have many advantages. Other useful codes, such as nonlinear block codes and convolutional codes, also exist but are beyond this text.

6.8. 1 Repetition-by-Three Code As a simple example to illustrate how ECC works, consider the repetition-by-three code. Each input bit is replicated three times, as shown in the table below: Input

Output

0

000 111

The code consists of two code words, 000 and 111. The Hamming distance between code words is defined as the number of bits in which they differ. In this case, the distance is three. Let DL ·)be the Hamming distance; for example, D(OOO, 111) = 3. Each bit is passed through a binary symmetric channel with crossover probability£, as illustrated below:

1z

1-£

·1

X

y t

0

1-£

• 0

The number ofbits in a code word that get flipped is binomial with parameters n = 3 and p =£.Since£ is small, Equation (6.4) says that the b(n,k,p) sequence is decreasing. Getting

no errors is more likely than getting one error, getting one error is more likely than getting two errors, and getting two errors is more likely than getting three errors. If W denotes the number of errors, then Pr[W= 0] > Pr[W=

1] > Pr[W = 2]

> Pr[W = 3]

The receiver receives one of eight words, 000, 001, ... , 111. In Figure 6.7, the two code words are at the left and right. The three words a distance of one away from 000 are listed on the left side, and the three words a distance of one away from 111 are listed on the right. Error control codes can be designed to detect errors or to correct them (or some combination of both). An error detection scheme is typically used with retransmissions. Upon detecting an error, the receiver asks the transmitter to repeat the communication. Error correction is used when retransmissions are impractical or impossible. The receiver tries to correct errors as best it can.

156

CHAPTER 6 BINOMIAL PROBABILITIES

/ 001> 1). (Hint: one simple way to show a matrix is singular is to find a vector :X ic 0 such that CX = 0.)

6.21

Use the MGF to calculate the mean and variance of a negative binomial random variable N with parameters k and p.

164

CHAPTER 6 BINOMIAL PROBABILITIES

6.22

In a recent five-year period, a small town in the western United States saw 11 children diagnosed with childhood leukemia. The normal rate for towns that size would be one to two cases per five-year period. (No clear cause of this cluster has been identified. Fortunately, the rate of new cancers seems to have reverted to the norm.) a. Incidences of cancer are often modeled with Poisson probabilities. Calculate the probability that a similar town with n = 2000 children would have k = 11 cancers using two different values for the probability, Pi = 1/2000 and pz = 2/2000. Use both the binomial probability and the Poisson probability. b. Which formula is easier to use? c. Most so-called cancer clusters are the result of chance statistics. Does this cluster seem likely to be a chance event?

6.23

Alice and Bob like to play games. To determine who is the better player, they play a "best k of n" competition, where n = 2k- 1. Typical competitions are "best 2 of 3" and "best 4 of7", though as large as "best 8 of 15" are sometimes used (sometimes even larger). If the probability Alice beats Bob in an individual game is p and the games are independent, your goal is to calculate the probability Alice wins the competition. a. Such competitions are usually conducted as follows: As soon as either player, Alice or Bob, wins k games, the competition is over. The remaining games are not played. What is the probability Alice wins? (Hint: this probability is not binomial.) b. Consider an alternative. All n games are played. Whichever player has won k or more games wins the competition. Now, what is the probability Alice wins the competition? (Hint: this probability is binomial.) c. Show the two probabilities calculated above are the same.

6.24

A group of n students engage in a competition to see who is the "best" coin flipper. Each one flips a coin with probability p of coming up heads. If the coin comes up tails, the student stops flipping; if the coin comes up heads, the student continues flipping. The last student flipping is the winner. Assume all the flips are independent. What is the probability at least k of the n students are still flipping after t flips?

6.25

The throughput for Slotted Aloha for finite n is Pr(N = 1] = /1,(1- Jtln)n-l. a. What value of It maximizes the throughput? b. Plot the maximum throughput for various values of n to show the throughput's convergence to e- 1 when n- =·

6.26

In the Aloha collision resolution discussion, we mentioned the node waits a random amount of time (number of slots) before retransmitting the packet. Why does the node wait a random amount of time?

6.27

If S is binomial with parameters n and p, what is the probability S is even? (Hint: Manipulate (p + q)n and (p- q)n to isolate the even terms.)

6.28

Consider an alternative solution to Problem 6.27. Let Sn = Sn-1 + Xn, where Xn is a Bernoulli random variable that is independent of Sn-1· Solve for the Pr [Sn = even] in terms of Pr [Sn-l =even], and then solve the recursion.

Problems 6.29

165

The probability of getting no heads in n flips is (1 - p) ". This probability can be bounded as follows: 1-np~

(1-p)"

~e- np

a. Prove the left-hand inequality. One way is to consider the function h(x) = (1- x)" + nx and use calculus to show h(x) achieves its minimum value when x = 0. Another way is by induction. Assume the inequality is true for n -1 (i.e., (1- pl"- 1 ~ 1- (n -l)p), and show this implies it is true for n. b. Prove the right -hand side. Use the inequality log(l + x)

~ x.

c. Evaluate both inequalities for different combinations of n and p with np = 0.5 and np = 0.1. 6.30

Another way to prove the left-hand inequality in Problem 6.29, 1- np ~ (1 - p)", is to use the union bound (Equation 1. 7). Let X 1, X2, ... ,Xn be liD Bernoulli random variables. Let S = X1 +X2 + · ··Xn. Then, Sis binomial. Use the union bound to show Pr[S > 0] ~ np, and rearrange to show the left-hand inequality.

6.31

Here's a problem first solved by Isaac Newton (who did it without calculators or computers). Which is more likely: getting at least one 6 in a throw of six dice, getting at least two 6's in a throw of 12 dice, or getting at least three 6's in a throw of 18 dice?

6.32

In the game of Chuck-a-luck, three dice are rolled. The player selects a number between 1 and 6. If the player's number comes up on exactly one die, the player wins $1. If the player's number comes up on exactly two dice, the player wins $2. If the player's number comes up on all three dice, the player wins $3. If the player's number does not come up, the player loses $1. Let X denote the player's win or loss. a. What is E[X]? b. In some versions, the player wins $10 if all three dice show the player's number. Now whatisE[X]?

6.33

Write a short program to simulate a player's fortune while playing the standard version of Chuck-a-luck as described in Problem 6.32. Assume the player starts out with $100 and each play costs $1. Let N be the number of plays before a player loses all of his or her money. a. Generate a large number of trials, and estimate E[N] and Var[N]. b. If each play of the actual game takes 30 seconds, how long (in time) will the player typically play?

6.34

Generalize the repetition-by-three ECC code to a repetition-by-r code. a. What is the probability of a miss? b. If r is odd, what is the probability of decoding error? c. What happens if r is even? (It is helpful to consider r

= 2 and r = 4 in detail.)

d. Make a decision about what to do if r is even, and calculate the decoding error probability.

166

CHAPTER 6 BINOMIAL PROBABILITIES

6.35

For the (5,2) code in Table 6.1: a. Compute the distance between all pairs of code words, and show the distance of the code is three. b. Show the difference between any pair of code words is a code word.

6.36

A famous code is the Hamming (7,4) code, which uses a generator matrix: 1

0

a. What are all2 4

l

= 16 code words?

b. What is the distance of the code? c. What are its miss and decoding error rate probabilities? 6.37

Another famous code is the Golay (23,12) code.lt has about the same rate as the Hamming (7 ,4) code but a distance of seven. a. What are its miss and decoding error probabilities? b. Make a log-log plot of the decoding error rate of the Golay (23,12) code and the Huffman (7,4) code. (Hint: some Matlab commands you might find useful are loglog and logspace. Span the range from 10-4 ~ f ~ 10- 1 .)

6.38

Compare the coding rates and the error decoding rates for the (3,1) repetition code, the (5,2) code above, the (7,4) Hamming code, and the (23,12) Golay code. Is one better than the others?

CHAPTER

A CONTINUOUS RANDOM VARIABLE

What is the probability a randomly selected person weighs exactly 150 pounds? What is the probability the person is exactly 6 feet tall? What is the probability the temperature outside is exactly 45°F? The answer to all of these questions is the same: a probability of 0. No one weighs exactly 150 pounds. Each person is an assemblage of an astronomical number of particles. The likelihood of having exactly 150 pounds worth of particles is negligible. What is meant by questions like "What is the probability a person weighs 150 pounds?" is "What is the probability a person weighs about 150 pounds?" where "about" is defined as some convenient level of precision. In some applications, 150 ± 5 pounds is good enough, while others might require 150 ± 1 pounds or even 150 ± 0.001 pounds. Engineers are accustomed to measurement precision. This chapter addresses continuous random variables and deals with concepts like "about:'

7.1 BASIC PROPERTIES A random variable X is continuous if Pr[X = x] = 0 for all values of x. What is important for continuous random variables are the probabilities of intervals, say, Pr[ a 0 be a known value. What is the probability at least k of the n random variables are greater than xo?

8.25

Consider the random sumS= 1 X;, where the X; are liD Bernoulli random variables with parameter p and N is a Poisson random variable with parameter A. N is independent of the X; values.

L:f:

a. Calculate the MGF of S. b. ShowS is Poisson with parameter Ap. Here is one interpretation of this result: If the number of people with a certain disease is Poisson with parameter A and each person tests positive for the disease with probability p, then the number of people who test positive is Poisson with parameter Ap. 8.26

Repeat Example 8.7 with X and Y being liD U(O, 1) random variables.

8.27

Let X and Y be independent with X Bernoulli with parameter p and Y - U(O, 1). Let Z=X+Y. a. What are the mean and variance of Z? b. What are the density and distribution function of Z? c. Use the density of Z to compute the mean and variance of Z.

220

CHAPTER 8 MULTIPLE CONTINUOUS RANDOM VARIABLES

8.28

Let X and Y be independent with X binomial with parameters n and p and Y- U(O, 1). LetZ =X+ Y. a. What are the mean and variance of Z? b. What are the density and distribution function of Z? c. Use the density of Z to compute the mean and variance of Z.

8.29

Let X andY be liD U(O, 1) random variables and let Z

= XIY.

a. What are the density and distribution functions of Z? b. What is the median of the distribution of Z? c. What is the mean of Z?

CHAPTER

THE GAUSSIAN AND RELATED DISTRIBUTIONS

The Gaussian distribution is arguably the most important probability distribution. It occurs in numerous applications in engineering, statistics, and science. It is so common, it is also referred to as the "normal" distribution.

9.1 THE GAUSSIAN DISTRIBUTION AND DENSITY X is Gaussian with mean Jl and variance a 2 if fx(x) =

_1_e- r = sJ21og2 = 1.177s

(9.20)

9.5 Related Distributions

237

f.l = 1.253 s 0.607

~-->!+-~

a= 0.655 s

0.5

1s f.l

2s

3s

FIGURE 9.9 Rayleigh density and its median, mean, and standard deviation. Note that the median is less than the mean (median = 1.177s < 1.253s =mean).

r 2 2 /R(r) = -e-r 12 s

oo lo

(9.21)

52

E[R] =



= 5 ,fiii 2

=

r

2

-e-r 5

125

2

loo -oo

2

dr

____!!.__e-r212s2

dr

53,fiii

5vrf = 1.2535

E[R 2 ] = E[X2 + Y2 j = 25 2 n 4-n Var[R] = 25 2 - 52 = - -5 2 = (0.6555) 2 2 2

(9.22) (9.23) (9.24) (9.25)

The integral in Equation (9.22) is the second moment of a Gaussian density and is therefore equal to 1 (by Equation 9.14). EXAMPLE9.3

In Figure 9.12 in Section 9.6, 14 of the 1000 points have a radius of greater than or equal to 3.0. (The point on the circle at five o'clock has a radius of 3.004.) Are 14 such points unusual? The radius is Rayleigh with parameter 5 = 1. The probability of any point having a radius greater than 3.0 is

p =Pr[R~ 3.0] = 1-Pr[Ri•cbf----+• R(t) = S(t)

R(t)

+ N(t)

It turns out that since sin(e) and cos(e) are orthogonal over a complete period,

{ 2"

Jo

cose sine de= 0

9.7 Digital Communications Using QAM

247

the signals can be broken up into a cosine term and a sine term. These are known as the quadrature components:

+ Sy(t) sin(wetl N(t) = N x(t) cos(wet) + Ny(t) sin(wet) R(t) = Rx(t) cos(wet) + Ry(t) sin(wetl = (SxCt) + N xCtl) cos(wetl + (Sy(t) + Ny(t)) sin(wetl S(t) = Sx(t) cos(wetl

where We is the carrier frequency in radians per second, We = 2nfC> where fe is the carrier frequency in hertz (cycles per second). The receiver demodulates the received signal and separates the quadrature components:

+ Nx(t) Ry(t) = Sy(t) + Ny(t) Rx(t) = Sx(t)

In the QAM application considered here, 00

SxCtl =

L

Ax(n)p(t- nD

n= -oo 00

Sy(t) =

L

Ay(n)p(t- nT)

n= -oo

The pair (Ax, Ay) represents the data sent. This is the data the receiver tries to recover (much more about this below). The p(t) is a pulse-shape function. While p(t) can be complicated, in many applications it is just a simple rectangle function:

1 p(t) = { 0

O 1'1] /Pr[ error] = exp(l'1 2 /2a 2 ) = exp(1p 2 /2). Selected valuesofexp(1p 2 /2) are shown in Table 9.1. When 1p = 3, the speedup is a factor of 90; when 1p = 4, the speedup factor is almost 3000.

9.7 Digital Communications Using QAM

257

0



0

0 0



0

0

0 0

0

• •

• • • •

• •

•• •



• • • • • ••





FIGURE 9.17 Illustration of Monte Carlo study for error rate of a rectangle region, but conditioned on being outside the circle of radius{';. The illustration is drawn for 1f! ={';fa= 2.0. TABLE 9.1 Table of Monte Carlo speedups attained versus 1f!. 1.0

1.5

2.0

2.5

3.0

3.5

4.0

1.6

3.1

7.4

23

90

437

2981

Recall that in Section 9.6.2 we showed the Box-Muller method can generate two liD zero-mean Gaussian random variables from a Rayleigh distribution and a uniform distribution. Here, we modify the Box-Muller method to generate conditional Gaussians and dramatically speed up the Monte Carlo calculation. Generate a Rayleigh distribution conditioned on being greater than 1'1 (i.e., conditioned on being greater than the radius of the circle): Pr[R:o;riR~I'l]=

Pr[l'l £] ~ 0 as n ~ = for all £ > 0. Chebyshev's inequality implies that if the estimator is unbiased and its variance approaches 0 as n ~ =, then the estimator is consistent. This is usually the easiest way to demonstrate that an estimator is consistent. In this example, the sample average is an unbiased and consistent estimator of p. Equation (10.2) shows the estimator is unbiased, and Equation (10.3) shows the variance goes to 0 as n tends to infinity. Both unbiasedness and consistency are generally considered to be good characteristics of an estimator, but just how good is p = Xn? Or, to turn the question around, how large does n have to be in order for us to have confidence in our prediction? The variance of Xn depends on the unknown p, but we can bound it as follows, p(l- p) :5 0.25 with equality when p = 0.5. Since the unknown pis likely to be near 0.5, the bound is likely to be close; that is, p(l- p) "'0.25: p( l - p)

0.25

~

0

0.5

p

Now, we can use Chebyshev's inequality (Equation 4.16) to estimate how good this estimator is: Var[p] p(l- p) 1 Pr[lp-pl >£] : 5 - - = - - : 5 £2 n£2 4n£2 A

Let us put in some numbers. If95% of the time we want to be within£= 0.1, then 1 Pr[lp-pl >0.1] : 5 --:51-0.95 4n · 0.1 2

10.1 A Simple Election Poll

267

Solving this yields n ~ 500. To get this level of accuracy ( -0.1 :5 p- p :5 0.1) and be wrong no more than 5% of the time, we need to sample n = 500 voters. Some might object that getting to within 0.1 is unimpressive, especially for an election prediction. If we tighten the bound to£= 0.03, the number of voters we need to sample now becomes n ~ 5555. Sampling 500 voters might be doable, but sampling 5000 sounds like a lot of work. Is n > 5555 the best estimate we can find? The answer is no. We can find a better estimate. We used Chebyshev's inequality to bound n. The primary advantage of the Chebyshev bound is that it applies to any random variable with a finite variance. However, this comes at a cost: the Chebyshev bound is often loose. By making a stronger assumption-in this case, that the sample average is approximately Gaussian-we can get a better estimate of n. In this example, p = Xn is the average of n liD Bernoulli random variables. Hence, Xn is binomial (scaled by 1/n), and since n is large, pis approximately Gaussian by the CLT, p- N(p,pqln). Thus,

Pr[ -£:5p-p:5c]

=Pr[~ :5 ~ :5 ~]

(normalize)

l

(boundpq)

[

Xn -p



"'Pr V1/4n :5

/P{jin

£

:5 vll4n

"'Pr[- 2cvn :o; z :o; 2cvn]

(Z- N(0,1))

= (2£Vn)- (-2£Vn) = 2C2cvnl - 1 (z=2£yn)

= 2(z) -1

Solving 2(z)- 1 = 0.95 yields

z"' 1.96:

~ 0.95

/

-1.96

1.96

Now, solve for n: 1.96 =

z = 2£Vn 1.962

n=

4£2

0.96

=

7

-2 "-'£

(10.4)

For£= 0.03, n = 1067, which is about one-fifth as large as the Chebyshev inequality indicated. Thus, for a simple election poll to be within £ = 0.03 of the correct value 95% of the time, a sample of about n = 1067 people is required.

268

CHAPTER 10 ELEMENTS OF STATISTICS

n

= 1600

/

0.35

I 0.35

'

0.35

0.40

0.45

0.55

0.5

/\: 0.40

0.45

0.5

0.40

0.45

0.5

I 0.55

p

0.55

'

FIGURE 10.1 Effect of sample size on confidence intervals. As n increases, the confidence interval shrinks, in this example, from 0.35,; p,; 0.55 for n = 100 to 0.4,; p,; 0.5 for n = 400 and to 0.425 ,;p,; 0.475 for n = 1600. Each time n is increased by a factor offour, the width of the confidence interval shrinks by a factor of two (and the height ofthe density doubles) . See Example 10.1 for further discussion.

EXAMPLE 10.1

Let's illustrate how the confidence intervals vary with the number of samples, n. As above, consider a simple poll. We are interested in who is likely to win the election. In other words, we want to know if p < 0.5 or p > 0.5. How large must n be so the probability that {p < 0.5} or {P > 0.5} is at least 0.95? In this example, we assume p = 0.45. The distance between p and 0.5 (the dividing line between winning and losing) is£= 0.5-0.45 = 0.05. From Equation (10.4), n "'£- 2 = (0.05)- 2 = 400 If£ is halved, n increases by a factor of four; if£ is doubled, n decreases by a factor of four. Thus, for n = 100, £ = 0.1; for n = 1600, £ = 0.025. The 95% confidence intervals for n = 100, for n = 400, and for n = 1600 are shown in Figure 10.1. For n = 100, we are not able to predict a winner as the confidence interval extends from p = 0.35 to p = 0.55; that is, the confidence interval is on both sides of p = 0.5. For n = 400, however, the confidence interval is smaller, from p = 0.40 to p = 0.50. We can now, with 95% confidence, predict that Bob will win the election.

10.2 Estimating the Mean and Variance

269

Comment 10.1: In the Monte Carlo exercise in Section 9.7.3, pis small and varies over several orders of magnitude, from about 10- 1 to 10- 4 . Accordingly, we used a relative error to assess the accuracy ofp (see Equation 9.43):

Pr[-e~p-p~€]~0.95 p

In the election poll,

(10.5)

p"' 0.5 . We used a simple absolute, not relative, error: Pr[ -e~p-p~e] ~0.95

There is no one rule for all situations: choose the accuracy criterion appropriate for the problem at hand .

10.2 ESTIMATING THE MEAN AND VARIANCE The election poll example above illustrated many aspects of statistical analysis. In this section, we present some basic ideas of data analysis, concentrating on the simplest, most common problem: estimating the mean and variance. LetX 1 ,Xz, ... ,Xn ben liD samples from a distribution F(x) with mean Jl and variance a 2 . Let fi. denote an estimate of Jl. The most common estimate of Jl is the sample mean: -

1

fi.=Xn =-

n

I:X;

n k= l

E[fi.] =E[Xn] =Jl -

a2

Var[fi.] = Var[Xn] = n Thus, the sample mean is an unbiased estimator of Jl. It is also a consistent estimator of Jl since it is unbiased and its variance goes to 0 as n ~ =· The combination of these two properties is known as the weak law of large numbers.

Comment 1 0.2: The weak law of large numbers says the sample mean of n liD observations, each with finite variance, converges to the underlying mean, Jl · For any €>0,

As one might suspect, there is also a strong law of large numbers. It says the same basic thing (i.e., the sample mean converges to Jl), but the mathematical context is stronger: Pr[ lim Xn n-oo

= Jl] = 1

270

CHAPTER 10 ELEMENTS OF STATISTICS

The reasons why the strong law is stronger than the weak law are beyond this text, but are often covered in advanced texts. Nevertheless, in practice, both laws indicate that if the data are liD with finite variance, then the sample average converges to the mean.

The sample mean can be derived as the estimate that minimizes the squared error. Consider the following minimization problem: n

minQ(a) = I:cxi-a) 2 a

i= l

Differentiate Q(a) with respect to a, set the derivative to 0, and solve for a, replacing a by ii:

d da

0= -Q(a) 1

I

n n =2I:CXi-ii)=2LXi-nii

a=a n

i= l

i= l

-

ii=- I:xi =Xn n i= l

We see the sample mean is the value that minimizes the squared error; that is, the sample mean is the best constant approximating the Xi, where "best" means minimal squared error. Estimating the variance is a bit tricky. Let (;2 denote our estimator of a 2 . First, we assume the mean is known and let the estimate be the obvious sample mean of the squared errors:

So, (;2 is an unbiased estimator of a 2 . In this estimation, we assumed that the mean, but not the variance, was known. This can happen, but often both are unknown and need to be estimated. The obvious generalization is to replace Jl by its estimate Xn, but this leads to a complication:

E[ f. (Xk -Xnl

2

]

= (n -l)a

2

(10.6)

k= i

An unbiased estimator of the variance is ~

1

n

-

a 2 = --I:CX-Xnl 2

(10.7)

n- 1 k= l

In the statistics literature, this estimate is commonly denoted S2 , or sometimes S~: 1

n

-

S~ = --I:CX-Xnl n- 1 k= l

E[S~j = a

2

2

10.3 Recursive Calculation ofthe Sample Mean

271

It is worth emphasizing: the unbiased estimate of variance divides the sample squared error by n- 1, not by n. As we shall see in Section 10.10, the maximum likelihood estimate of variance is a biased estimate as it divides by n. In practice, the unbiased estimate (dividing by n- 1) is more commonly used.

10.3 RECURSIVE CALCULATION OF THE SAMPLE MEAN In many engineering applications, a new observation arrives each discrete time step, and the sample mean needs to be updated for each new observation. In this section, we consider recursive algorithms for computing the sample mean. Let X 1 , X 2 , .. . ,Xn-l be the first n- 1 samples. The sample mean after these n- 1 samples is

At time n, a new sample arrives, and a new sample mean is computed:

When n gets large, this approach is wasteful. Surely, all that work in computing Xn-l can be useful to simplify computing Xn. We present two approaches for reducing the computation. For the first recursive approach, define the sum of the observations as T n =X 1 + X 2 + .. · +X n. Then, T n can be computed recursively:

A recursive computation uses previous values of the "same" quantity to compute the new values of the quantity. In this case, the previous value T n-l is used to compute T n· EXAMPLE 10.2

The classic example of a recursive function is the Fibonacci sequence. Letj(n) be the nth Fibonacci number. Then, j(O) =0 j(l) = 1

j(n) = j(n -1) + j(n -2)

for n = 2,3, .. .

The recurrence relation can be solved, yielding the sequence of numbers j(n) 0, 1, 1,2,3,5,8, 13, .. ..

272

CHAPTER 10 ELEMENTS OF STATISTICS

The sample average can be computed easily from T n:

The algorithm is simple: at time n, compute T n = T n-l + Xn and divide by n, Xn = T nIn. EXAMPLE 10.3

Bowling leagues rate players by their average score. To update the averages, the leagues keep track of each bowler's total pins (the sum of his or her scores) and each bowler's number of games. The average is computed by dividing the total pins by the number of games bowled. Note that the recursive algorithm does not need to keep track of all the samples. It only needs to keep track of two quantities, T and n. Regardless of how large n becomes, only these two values need to be stored. For instance, assume a bowler has scored 130, 180, 200, and 145 pins. Then,

To =0 T I = T 0 + 130 = 0 + 130 = 130 T 2 = T1 + 180 = 130 + 180 = 310

T 3 = T 2 +200 = 310+200 = 510

T4 = T3 + 145 = 510 + 145 = 655 The sequence of sample averages is

The second recursive approach develops the sample mean in a predictor-corrector fashion. Consider the following:

n X1+Xz+ .. ·+Xn-1 Xn ---=---=----_:.:__::_+n n n-1 X1 +Xz + ... +Xn-1 Xn ---=---=---___;.;_~ + n n-1 n n-1 Xn =--·Xn-1+n n 1 =Xn-1 + -(Xn -Xn-d n = prediction + gain · innovation

(10.8)

In words, the new estimate equals the old estimate (the prediction) plus a correction term. The correction is a product of again (lin) times an innovation (Xn- Xn_ 1). The innovation is what is new about the latest observation. The new estimate is larger than the old if the new observation is larger than the old estimate, and vice versa.

10.4 Exponential Weighting EXAMPLE 10.4

273

The predictor-corrector form is useful in hardware implementations. As n gets large, T might increase toward infinity (or minus infinity). The number ofbits needed to represent T can get large-too large for many microprocessors. The predictor-corrector form is usually better behaved. For example, assume the Xk are liD uniform on 0,1, . .. ,255. The Xk are eight-bit numbers with an average value of 127.5. When n = 1024,0:5 T 1024 :5 255 x 1024 = 261,120. It takes 18 bits to represent these numbers. When n = 106 , it takes 28 bits to represent T. The sample average is between 0 and 255 (since all the Xk are between 0 and 255). The innovation term,Xn-1- Xn-1> is between -255 and 255. Thus, the innovation requires nine bits, regardless of how large n becomes.

Comment 10.3: Getting a million samples sounds like a lot, and for many statistics

problems, it is. For many signal processing problems, however, it is not. For example, CO-quality audio is sampled at 44,100 samples per second . At this rate, a million samples are collected every 23 seconds!

10.4 EXPONENTIAL WEIGHTING In some applications, the mean can be considered quasi-constant: the mean varies in time, but slowly in comparison to the rate at which samples arrive. One approach is to employ exponential weighting. Let a be a weighting factor between 0 and 1 (a typical value is 0.95). Then, A

Jln=

Xn + aXn-1 + a 2Xn-2 + · · · + a"- 1X1 1+a+a2+···+an-1

Notice how the importance of old observations decreases exponentially. We can calculate the exponentially weighted estimate recursively in two ways: Sn =Xn + aSn-1 Wn = 1 +aWn-!

In the limit, as n ~ =, Wn simplifies considerably:

~ 1/(1-

a). With this approximation, the expression for fln

fln "'fln-1 + (1- a)(Xn- fln-1) = afln-1 + (1- a)Xn

This form is useful when processing power is limited.

274

CHAPTER 10 ELEMENTS OF STATISTICS

Comment 10.4: It is an engineering observation that many real-life problems benefit

from some exponential weighting (e.g., 0.95 the underlying mean should be constant.

~a~

0.99) even though theory indicates

10.5 ORDER STATISTICS AND ROBUST ESTIMATES In some applications, the observations may be corrupted by outliers, which are values that are inordinately large (positive or negative). The sample mean, since it weights all samples equally, can be corrupted by the outliers. Even one bad sample can cause the sample average to deviate substantially from its correct value. Similarly, estimates of the variance that square the data values are especially sensitive to outliers. Estimators that are insensitive to these outliers are termed robust. A large class of robust estimators are based on the order statistics. Let Xi> X 2 , ... ,Xn be a sequence of random variables. Typically, the random variables are liD, but that is not necessary. The order statistics are the sorted data values:

The sorted values are ordered from low to high:

The minimum value is X (I), and the maximum value is X (n ) . The median is the middle value. If n is odd, the median isX((n+l l/ 2); if n is even, the median is usually defined to be the average of the two middle values, (X(n/2) + X (n/2+1))/2. For instance, if the data values are 5, 4, 3, 8, 10, and l. The sorted values are l, 3, 4, 5, 8, and 10. The minimum value is l, the maximum is 10, and the median is (4 + 5)/2 = 4.5. Probabilities of order statistics are relatively easy to calculate if the original data are liD. First, note that even for liD data, the order statistics are neither independent nor identically distributed. They are dependent because of the ordering; that is, Xo l ~X ,2l, etc. For instance, x,2l cannot be smaller than X(l). The order statistics are not identically distributed for the same reason. We calculate the distribution functions of order statistics using binomial probabilities. The event {X(kl ~ x} is the event that at least k of the n random variables are less than or equal to x. The probability anyone of the X; is less than or equal tox is F(x); that is, p = F(x). Thus,

Pr[X(kl

~ x] =f. (~)pio- p)n-J =f. (~)Fi(x)(l- F(x))n-J j =k

1

j =k

1

(10.9)

10.5 Order Statistics and Robust Estimates

275

For instance, the probability the first-order statistic (the minimum value) is less than or equal tox is Pr[X(I) :5 x]

=f. (~)pi 1

(1- p)n-J

j=l

= 1- (1-p)n

(sum is 1 minus the missing term)

= 1- (1-FCxlr

The distribution function of the maximum is

The median is sometimes used as an alternative to the mean for an estimate oflocation. The median is the middle of the data, half below and half above. Therefore, it has the advantage that even with almost half the data being outliers, the estimate is still well behaved. Another robust location estimate is the a-trimmed mean, which is formed by dropping the an smallest and largest order statistics and then taking the average of the rest: 1

n-an

p - - - ' x k( l a- n- 2an L.. k=an

The a-trimmed mean can tolerate up to an outliers. A robust estimate of spread is the interquartile distance (sometimes called the interquartile range):

u = c(X(3 nt4l - X (nt4J ) where cis a constant chosen so that E [ u] = a. (Note the interquartile distance is an estimator of a, not a 2 .) When the data are approximately Gaussian, c = 1.349. Using the Gaussian quantile function, c is calculated as follows:

=

c Q(0.75)- Q(0.25)

=0.6745- (-0.6745) =1.349

We used the following reasoning: If X;- N(O, 1), then Q(0.75) = - 1 (3/4) = 0.6745 gives the average value of the (3nl 4)-order statistic, X (3 n /4l · Similarly, Q(0.25) gives the average value of the (n/4)-order statistic. The difference between these two gives the average value of the interquartile range. As an example, consider the 11 data points -1.91, -0.62, -0.46, -0.44, -0.18, -0.16, -0.07, 0.33, 0.75, 1.60, 4.00. The first 10 are samples from an N(O, 1) distribution; the 11th sample is an outlier. The sample mean and sample variance of these 11 values are _ 1 II Xn = - LX;= 0.26 11 k= l 1 II s~ = cxk- o.26l 2 = 2.3o n -1 k= l

--I:

276

CHAPTER 10 ELEMENTS OF STATISTICS

Neither of these results is particularly close to the correct values, 0 and 1. The median is X

1= l

l= n-k+ l

10.11 Minimum Mean Squared Error Estimation l(/1,) = log(L(/1,)) = (n- k)loglt- lt(t1

0=

d~ l(/1,) I

A

-'=A

/1,

= n~k -

A

289

+ t2 + · · · + tn-kl- klttmax

Ct! + t2 + ... + tn-kl- ktmax

n-k

A

A=--------:-tl + t2 + · · · + tn-k + ktmax As the examples above indicate, maximum likelihood estimation is especially popular in parameter estimation. While statisticians have devised pathological examples where it can be bad, in practice the MLE is almost always a good estimate. The MLE is almost always well behaved and converges to the true parameter value as n ~ =·

10.11 MINIMUM MEAN SQUARED ERROR ESTIMATION This section discusses the most popular estimation rule, especially in engineering and scientific applications: choose the estimator that minimizes the mean squared error. Let 8 represent the unknown value or values to be estimated. For instance, 8 might be Jl if the mean is unknown, or (p,a 2 ) if both the mean and variance are unknown and to be estimated. The estimate of is denoted either 8 or {j, depending on whether or not the estimate depends on random variables (i.e., random observations). If it does, we denote the estimate 8. (We also use 8 when referring to general estimates.) If it does not depend on any random observations, we denote the estimate as 2 The error is 8- the squared error is (8- 8) , and the mean squared error (MSE) is

e

e.

e,

E [ (8 - 8)

2

].

The value that minimizes the MSE is the MMSE estimate:

8 -e =error

(8 - e) 2 = squared error E [ (8 - 8 = mean squared error

n

mjnE[(8()

en=

minimum mean squared error

We start here with the simplest example, which is to estimate a random variable by a constant, and we progress to more complicated estimators. Let the random variable be Y and the estimate be {j (since {j is a constant, it is not a random variable). Then, (10.18)

Q is a function of {j. It measures the expected loss as a function of setting its first derivative to 0:

e and is minimized by

290

CHAPTER 10 ELEMENTS OF STATISTICS

which implies the optimal estimate is the mean, {J = E [ Y ]. The Q function is known as the mean squared error (MSE) and the estimator as the minimum mean squared error (MMSE) or least mean squared error (LMSE) estimator. As another example, let a random variable Y be estimated by a functiong(X} of another random variable, X. Using Equation (8.7), the MSE can be written as

The conditional expected value inside the integral can be minimized separately for each value of x (a sum of positive terms is minimized when each term is minimized). Letting =gCX),

e

2

0= !Erlx[CY-8) iX=xJI

8

=

0

= 2Erlx[CY -OJ IX =x] which implies that the estimate of Y is the conditional mean of Y given X = x:

For instance, if Y represents the weight and X the height of a randomly selected person, then O(x) = Er 1x[YIX = x] is the average weight of a person given the height, x. One fully expects that the average weight of people who are five feet tall would be different from the average weight of those who are six feet tall. This is what the conditional mean tells us. In general, O(x) = Er 1x[YIX = x] is nonlinear in x. (It is unlikely that people who are six feet tall are exactly 20% heavier than those who are five feet tall, or that seven-footers are exactly 40% heavier than five-footers). Sometimes it is desirable to find the best linear function of X that estimates Y. Let iT =aX+ b with a and b to be determined. Then, minQ(a,b; Y,X) = minE[CY -aX- bl 2 ] a,b

a,b

(Technically, this is an affine function because linearity requires b = 0, but it is often referred to as linear even though it is not.) This minimization requires setting two derivatives to 0 and solving the two equations: o=i.E[CY-aX-bl 2 JI

aa

0=i.E[CY-aX-b) 2 jl

ab

__ =2E[XY-ax 2 -bx]

(10.19)

. =2E[Y-ax-bj

(10.20)

a=a,b=b

_

a=a,b=b

The simplest way to solve these equations is to multiply the second byE [X] and subtract the result from the first, obtaining

Solving for a, ,

a-

E[XY]-E[X]E[Y] E[X2 ]-E[Xj 2

-

Cov[X,Yj Var[X]

a xy

-- a~

(10.21)

10.12 BayesianEstimation The value for

291

a can then be substituted into Equation (10.20): (10.22)

It is convenient to introduce normalized quantities, and letting p be the correlation coefficient, p =a xy l (a xa y), (10.23) The term in parentheses on the left is the normalized estimate of Y, and the one on the right is the normalized deviation of the observed value X. Normalizing estimates like this is common because it eliminates scale differences between X and Y. For instance, it is possible that X and Y have different units. The normalized quantities are dimensionless. Note the important role p plays in the estimate. Recall that p is a measure of how closely related X and Yare. When p is close to 1 or -1, the size of the expected normalized deviation of Y is almost the same as the observed deviation of X. However, when p is close to 0, X and Yare unrelated, and the observation X is of little use in estimating Y.

Comment 10.8: Equation (10.23), while useful in many problems, has also been misinterpreted . For instance, in the middle of the 20th century, this equation was used to estimate a son's intelligence quotient (IQ) score from the father's IQ score. The conclusion was that over time, the population was becoming more average (since p< 1). The fallacy of this reasoning is illustrated by switching the roles offather and son and then predicting backward in time. In this case, the fathers are predicted to be more average than their sons. What actually happens is that the best prediction is more average than the observation, but this does not mean the random variable Y is more average than the random variable X. The population statistics do not necessarily change with time. (In fact, raw scores on IQ tests have risen over time, reflecting that people have gotten either smarter or better at taking standardized tests. The published IQ score is normalized to eliminate this rise.) Why not mothers and daughters, you might ask? At the time, the military gave IQ tests to recruits, almost all of whom were male . As a result, many more IQ results were available for males than females.

10.12 BAYESIAN ESTIMATION

e

In traditional estimation, is assumed to be an unknown but nonrandom quantity. In contrast, in Bayesian estimation, the unknown quantity is assumed to be a random value (or values) with a known density (or PMF) j(8).

e

292

CHAPTER 10 ELEMENTS OF STATISTICS

Bayesian estimation is like minimum mean squared error estimation except that the unknown parameter or parameters are considered to be random themselves with a known probability distribution. Bayesian estimation is growing in popularity among engineers and statisticians, and in this section, we present two simple examples, the first estimating the probability of a binomial random variable and the second estimating the mean of a Gaussian random variable with known variance. Let 8 represent the unknown random parameters, and 1etj(8) be the a priori probability density of 8. Let the observations be x with conditional density j(xi8), and let j(x) denote the density of x. (We use the convention in Bayes estimation of dropping the subscripts on the various densities.) Using Bayes theorem, the a posteriori probability can be written as

f

8

x _ j(xi8)j(8) _ j(xi8)j(8) j(x) - f:JCx iv)j(v) dv

( I )-

(10.24)

Perhaps surprisingly,f(x) plays a relatively small role in Bayesian estimation. It is a constant as far as 8 goes, and it is needed to normalize j(8 ix) so the integral is 1. Otherwise, however, it is generally unimportant. The minimum mean squared estimator of 8 is the conditional mean: (10.25)

In principle, computing this estimate is an exercise in integral calculus. In practice, however, the computation often falls to one of two extremes: either the computation is easy (can be done in closed form), or it is so difficult that numerical techniques must be employed. The computation is easy if the j(8) is chosen as the conjugate distribution to j(xi8). We give two examples below. For the first example, let 8 represent the unknown probability of a 1 in a series of n Bernoulli trials, and letx = k equal the number of l's observed (and n- k the number ofO's). The conditional density j(ki8) is the usual binomial PMF. Thus,

The conjugate density to the binomial is the beta density. The beta density has the form j(8) = rca+ {3) ea-l (1- 8)/l-l for 0 < e < 1 f(a)f({J)

(10.26)

where the Gamma function is defined in Equation (9.26) and a~ 0 and f3 ~are nonnegative parameters. The beta density is a generalization of the uniform density. When a = f3 = 1, the beta density equals the uniform density. The density is symmetric about 0.5 whenever a = {3. Typical values of a and f3 are 0.5 and 1.0. When a = f3 = 0.5, the density is peaked at the edges; when a = f3 = 1.0, the density is uniform for 0 :58 :5 1. The mean of the beta density is al(a + {3).

10.12 Bayesian Estimation

293

The magic of the conjugate density is that the a posteriori density is also a beta density:

j(8ik) = j(ki8)j(8) j(k) = _1_ (n)8k(l- 8)n-k f(a + f:ll 8a-l (1- 8){3-1 j(k) k f(a)f({:l) =

(-1-

(n) f(a + f:ll ) . 8k+a-l (1- 8)n-k+{J-I j(k) k f(a)f({:l)

The term in the large parentheses is a normalizing constant independent of 8 (it depends on a, f:l, n, and k, but not on 8). It can be shown to equal f(a + f:l + nl/(rca + k)f({:l + n- k) ). Thus, the a posteriori density can be written as

j(8ik)=

f(a+f:l+n) 8k+a-l(l-8)"-k+{J-I f(a+k)f({:l+n-k)

which we recognize as a beta density with parameters k + a and n - k + {:!. The estimate is the conditional mean and is therefore

a+k 8 = E[8 IX= k] = -,-----a+f:l+n A

(10.27)

Thus, see the Bayes estimate has a familiar form: If a = f:l = 0, the estimate reduces to the sample mean, kl n. With nonzero a and f:l, the estimate is biased toward the a priori estimate, a I (a + f:ll. EXAMPLE 10.11

In Chapter 5, we considered the problem of encoding an liD binary sequence with a known probability of a 1 equal top and developed the optimal Huffman code. In many situations, however, pis unknown and must be estimated. A common technique is to use the Bayesian sequential estimator described above. The (n + 1)'st bit is encoded with a probability estimate determined from the first n bits. Let be the estimated probability of the (n + 1)'st bit. It is a function of the first n bits. Common values for a and f:l are a = f:l = 1. Assume, for example, the input sequence is 0110 · · ·. The first bit is estimated using the a priori estimate:

Pn

A

a+O 1+0 ----05 -a+f:l+0-1+1+0- ·

p0

For the second bit, n = 1 and k = 0, and the estimate is updated as follows: A

P1 =

a+O a+f:l+1

1

=- =0.33 3

For the third bit, n = 2 and k = 1, so 2 - a+ 1 ---05 P2 -a+f:l+2_4_ · A

294

CHAPTER 10 ELEMENTS OF STATISTICS

For the fourth bit, n = 3 and k = 2, and 3 - a+2 ---06 P3 -a+f:l+3_5_ · A

This process can continue forever, each time updating the probability estimate with the new information. In practice, the probability update is often written in a predictor-corrector form. For the second example, consider estimating the mean of a Gaussian distribution with known variance. The conjugate distribution to the Gaussian with unknown mean and known variance is the Gaussian distribution, 8- N(jlo,a~). (The conjugate distribution is different if the variance is also unknown.) The a posteriori density is X _

f (8 I ) -

j(xj8)j(8) j(x)

1

1

2

(

(x- 8) )

(

(x- 8)

1

(

= j(x) v'2iia exp -~ v'2iiao exp 1

2

= 2naaof(x) exp -~-

(8 - jlo)

2a~

2

2

(8 -j1 0 ) )

2a~

)

With much tedious algebra, this density can be shown to be a Gaussian density on 8 with mean (xa~ + lloa 2 )! (a 2 +a~) and variance (a- 2 +a 02 )- 1 . The Bayesian estimate is therefore (10.28) Thus, we see the estimate is a weighted combination of the observation x and the a priori value jlo. If multiple observations of X are made, we can replace X by the sample mean, Xn, and use its variance, a 2 In, to generalize the result as follows: (10.29) As more observations are made (i.e., as n ~=),the estimate puts increasing weight on the observations and less on the a priori value. In both examples, the Bayesian estimate is a linear combination of the observation and the a priori estimate. Both estimates have an easy sequential interpretation. Before any data are observed, the estimate is the a priori value, either a/ (a+ f:ll for the binomial or llo for the Gaussian. As data are observed, the estimate is updated using Equations (10.27) and (10.29). Both these estimates are commonly employed in engineering applications when data arrive sequentially and estimates are updated with each new observation. We mentioned above that the computation in Equation (10.25) can be hard, especially if conjugate densities are not used (or do not exist). Many numerical techniques have been developed, the most popular of which is the Markov chain Monte Carlo, but study of these is beyond this text.

Problems

295

Comment 10.9: Bayesian estimation is controversial in traditional statistics. Many

statisticians object to the idea that(} is random and furthermore argue the a priori distribution f({}) represents the statistician's biases and should be avoided. The counterargument made by many engineers and Bayesian statisticians is that a random (} is perfectly reasonable in many applications. They also argue that the a priori distribution represents the engineer's or statistician's prior knowledge gained from past experience. Regardless of the philosophical debate, Bayesian estimation is growing in popularity. It is especially handy in sequential estimation, when the estimate is updated with each new observation.

10.1

List several practical (nonstatistical) reasons why election polls can be misleading.

10.2

Ten data samples are 1.47, 2.08, 3.77, 1.01, 0.42, 0.77, 3.17, 2.89, 2.42, and -0.65. Compute the following: a. the sample mean b. the sample variance c. the sample distribution function d. a parametric estimate of the density, assuming the data are Gaussian with the sample mean and variance

10.3

Ten data samples are 1.44, 7.62, 15.80, 14.14, 3.54, 11.76, 14.40, 12.33, 7.08, and 3.40. Compute the following: a. the sample mean b. the sample variance c. the sample distribution function d. a parametric estimate of the density, assuming the data are exponential with the sample mean

10.4

Generate 100 samples from an N(O, 1) distribution. a. Compute the sample mean and variance. b. Plot the sample distribution function against a Gaussian distribution function. c. Plot the parametric estimate of the density using the sample mean and variance against the actual Gaussian density.

10.5

Generate 100 samples from an exponential distribution with A.= 1. a. Compute the sample mean and variance. b. Plot the sample distribution function against an exponential distribution function. c. Plot the parametric estimate of the density using the sample mean against the actual exponential density.

296

CHAPTER 10 ELEMENTS OF STATISTICS 10.6

The samples below are liD from one of the three distributions: N(Jl,a 2 ), exponential with parameter A, or uniform U(a,b) where b >a. For each data set, determine which distribution best describes the data (including values for the unknown parameters). Justify your answers. a. X;

= [0.30, 0.48, -0.24, -0.04, 0.023, -0.37, -0.18, -0.02, 0.47,0.46]

b. X;= [2.41,2.41,5.01,1.97,3.53,3.14,4.74,3.03,2.02,4.01] c. X;= [1.87,2.13, 1.82,0.07,0.22, 1.43,0.74, 1.20,0.61, 1.38]

d. X; 10.7

= [0.46, -0.61, -0.95, 1.98, -0.13, 1.57, 1.01, 1.44,2.68, -3.31]

The samples below are liD from one of the three distributions: N(Jl,a 2 ), exponential with parameter A, or uniform U(a,b) where b >a. For each data set, determine which distribution best describes the data (including values for the unknown parameters). Justify your answers. a. X;= [1.50,0.52,0.88,0.18, 1.24,0.41,0.32,0.14,0.23,0.96] b. X;= [3.82,4.39,4.75,0.74,3.08,1.48,3.60,1.45,2.71,-1.75] c. X;

d. X;

= [-0.18, -0.22, 0.45, -0.04,0.32, 0.38, -0.48, -0.01, -0.45, -0.23] = [-0.85, 0.62, -0.22, -0.69, -1.94,0.80, -1.23, -0.03,0.0 1, -1.28]

10.8

Let X; for i = 1,2, ... ,n be a sequence of independent, but not necessarily identically distributed, random variables with common mean E[Xi] = Jl and different variances Var[Xi] = af. Let T = 2:7= 1 a;X; be an estimator of Jl. Use Lagrange multipliers to minimize the variance ofT subject to the constraint E [ T] = Jl. What are the resulting a;? (This problem shows how to combine observations of the same quantity, where the observations have different accuracies.)

10.9

Show Equation (10.6), which is an excellent exercise in the algebra of expected values. A crucial step is the expected value of X;Xr

i=j it'-} 10.10

The exponential weighting in Section 10.4 can be interpreted as a lowpass filter operating on a discrete time sequence. Compute the frequency response of the filter, and plot its magnitude and phase.

10.11

Let X; ben liD U(O,l) random variables. What are the mean and variance of the minimum-order and maximum-order statistics?

10.12

Let X; be n liD U(O, 1) random variables. Plot, on the same axes, the distribution functions of X, ofX( 1l, and ofX(n)·

10.13

Let X 1, Xz, .. . ,Xn be n independent, but not necessarily identically distributed, random variables. Let X(k;m) denote the kth-order statistic taken from the first m random variables. Then, Pr[X(k;n) ~x] =Pr[X(k-1;n-1) ~x]Pr[Xn ~x] +Pr[X(k;n-1) ~x]Pr[Xn >x] a. Justify the recursion above. Why is it true? b. Rewrite the recursion using distribution functions.

Problems

297

c. Recursive calculations need boundary conditions. What are the boundary conditions for the recursion? IO.I4

Write a short program using the recursion in Problem 10.13 to calculate order statistic distributions. a. Use your program to calculate the distribution function of the median of n = 9 Gaussian N(O, 1) random variables. b. Compare the computer calculation in part a, above, to Equation (10.9). For example, show the recursion calculates the same values as Equation (10.9).

IO.I5

Use the program you developed in Problem (10.14) to compute the mean of the median-order statistic of n = 5 random variables: a. X;-N(O,l)fori=1,2, ... ,5. b. X; -N(O,l) fori= 1,2, ... ,4andXs -N(l,l). c. X;- N(O, 1) fori= 1,2,3 and X;- N(l, 1) fori= 4,5.

IO.I6

IfX 1 ,X2 , .. . ,Xn are n liD continuous random variables, then the density of the kth-order statistic can be written as

fx ckl (x)dx=(k

n k)pk-i(x)-j(x)dx·(l-F(x))"-k -1,1,n-

a. Justify the formula above. Why is it true? b. Use the formula to compute and plot the density of the median of n = 5 N (0, 1) random variables. IO.I7

Repeat Example 10.1 0, but using the parameterization of the exponential distribution in Equation (8.14). In other words, what is the censored estimate of J.L?

IO.IS

Assuming X is continuous, what value of x maximizes the variance of distribution function estimate F(x) (Equation 10.10)?

IO.I9

Write a short computer function to compute a KDE using a Gaussian kernel and Silverman's rule for h. Your program should take two inputs: a sequence of observations and a sequence of target values. It should output a sequence of density estimates, one density estimate for each value of the target sequence. Test your program by reproducing Figure 10.3.

I0.20

Use the KDE function in Problem 10.19 to compute a KDE of the density for the data in Problem 10.2.

I0.2I

Use the values for a (Equation 10.21) and b (Equation 10.22) to derive Equation (10.23).

I 0.22

Repeat the steps ofExample 10.10 for the alternative parameterization of the exponential fr(t) = (llto)e-tlto to find the MLE of to.

I0.23

Let X; be liD U(0,8), where 8 > 0 is an unknown parameter that is to be estimated. What is the MLE of 8?

I 0.24

Repeat the calculation of the sequence of probability estimates of p in Example 10.11 using a = fJ = 0.5.

CHAPTER

GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION

Multiple Gaussian random variables are best dealt with as random vectors. Many of the properties of Gaussian random vectors become the properties of vectors and matrices. This chapter also introduces linear regression, a common technique for estimating the parameters in linear models.

11.1 GAUSSIAN RANDOM VECTORS Multiple Gaussian random variables occur in many applications. The easiest way to manipulate multiple random variables is to introduce random vectors. This section begins by discussing multiple Gaussian random variables, then introduces random vectors and concludes with some properties of Gaussian random vectors. Let X, Y, and Z be independent Gaussian random variables. Then, the joint distribution function is the product of the individual densities:

{Y :5 y} n {Z :5 z}] = Pr[X :5 x]· Pr[Y :5 y]· Pr[Z :5 zj

FXYz(x,y,z) = Pr[ {X :5 x) n

= Fx(x)Fy(y)Fz(z)

The joint density is

fxyz(x,y,z) =

aaa

ax ay & FXYz(x,y,z) d

d

d

= dxFx(x)dJFy(y)"J;Fz(z)

298

(by independence)

11.1 Gaussian Random Vectors

299

= fx(x)jy(y)jz(z) 2

2

2

( (x -p.xl (y -p.y) (z -p.zl ) = axayaz(2n)312 exp -~-~-~ 1

These expressions rapidly get unwieldy as the number of random variables grows. To simplify the notation-and improve the understanding-it is easier to use random vectors. We use three different notations for vectors and matrices, depending on the situation. From simplest to most complicated, a vector can be represented as

and a matrix as au a21

A= [a;jl =

:

( ani

A random vector is a vector, represented as

x,

whose components are random variables and can be

where each X; is a random variable.

Comment 11.1: In this chapter, we deviate slightly from our usual practice of writing random variables as bold-italic uppercase letters. We write random vectors as bold-italic lowercase letters adorned with the vector symbol (the small arrow on top of the letter) . This allows us to use lowercase letters for vectors and uppercase letters for matrices. We still use bold-italic uppercase letters for the components of random vectors, as these components are random variables.

Expected values of random vectors are defined componentwise:

300 CHAPTER 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION

Correlations and covariances are defined in terms of matrices:

XXT =

[x,x, XzX1 .

XnXI

x,x.l XzXn

X1Xz XzXz

(11.1)

XnXz

...

XnXn

The autocorrelation matrix is

Rxx=E[xxT]

[E[x,x,J E[XzX!]

E[X1X2]

E[XzXz]

E[XzXn]

E[XnX!]

E[XnXz]

E[XnXn]

['"

r21

rnl

rl2 rzz rnz

E[x,x.[l

'rzn" l

(11.2)

rnn

Where possible, we drop the subscripts on R, Jl, and C (see below). The covariance matrix, C= Cxx> is (11.3)

Recall that a matrix A is symmetric if A =A r. It is nonnegative definite if xTAx~ 0 for all x f:. 0. It is positive definite ifXTAx> 0 for all x f:. 0. Since X;Xj = XjXi (ordinary multiplication commutes), r;j = rji· Thus, R is symmetric. Similarly, C is symmetric. R and C are also nonnegative definite. To see this, let a be an arbitrary nonzero vector. Then,

arRa = arE[:xxr]a =E[aTXiTaj = E[ carxlc:xYal]

=E[YTYj =E[Y

2

]

(letting Y = :xra)

(YT = Y since Y is a 1 x 1 matrix)

~0

In the above argument, Y = :xra is a 1 x 1 matrix (i.e., a scalar). Therefore, yT = Y and yT Y = Y 2 . The same argument shows that Cis nonnegative definite.

11.1 Gaussian Random Vectors

301

Sometimes, there are two different random vectors, x andy. The joint correlation and covariance-and subscripts are needed here-are the following: Rxy =E[xyTj Cxy = E [ (x- JlxHY- Py) T)] = Rxy - PxPy Rxy and Cxy are, in general, neither symmetric nor nonnegative definite. The determinant of a matrix is denoted ICI. If the covariance matrix is positive definite, then it is also invertible. If so, then we say x- N(P,, C) if _ jx(x) =

1

ex- P.lrc- cx- P.l) 1

(

exp - - - ' - - - - - ' - V(2n)"ICI 2

(11.4)

If Cis diagonal, the density simplifies considerably:

C=

[

~i

0 a~

. 0

0

C'=[T

lla~

0

The quadratic form in the exponent becomes

0 0

l[X!- Jlll

1/~~

The overall density then factors into a product of individual densities:

c-l

r. x = Jx

1

O"J0"2···anJC2n)"

~ ex; exp ( - L.. i=l 2ai2

Jl;l2)

X2- Jl2

Xn

~

Jln

302

CHAPTER 11 GAUSSIAN RANDOM VECTORS AND LINEAR REGRESSION

Thus, if C is diagonal, the X; are independent. This is an important result. It shows that if Gaussian random variables are uncorrelated, which means C is diagonal, then they are independent. In practice, it is difficult to verify that random variables are independent, but it is often easier to show the random variables are uncorrelated.

Comment 11.2: In Comment 8 .2, we pointed out that to show two random variables are independent, it is necessary to show the joint density factors for all values of x andy. When X and Yare jointly Gaussian, the process is simpler. All that needs to be done is to show X and Yare uncorrelated (i .e., that f(XY] = E[X]· E[Y]) . Normally, uncorrelated random variables are not necessarily independent, but for Gaussian random variables, uncorrelated means independent.

The MGF of a vector Gaussian random variable is useful for calculating moments and showing that linear operations on Gaussian random variables result in Gaussian random variables. First, recall the MGF of a single N(p,a 2) random variable is Equation (9.15):

.it(u) = E[ e"x] = exp (up+ a 2u212) If X 1 , X 2, ... ,Xn are liD N(p,a 2), then the MGF of the vector xis the product of the individual MGFs:

.it(u,' u2, ... 'Un) = E [e"'x' +u2X2 +···+unXn l = E[ e" 1x 1]E[ e"2x2]· · ·E[ e""x"]

(by independence)

=.it (u Jl..U Cu2l ···.it (un) = exp {u 1 p + ufa 212) · exp {u2p + u~a 2 12) · · · exp {UnJl + u~a 2 12) 2

= exp (Cu, + u2 + · · · + Un)Jl) · exp {Cuf + u~ + · · · + u~)a 12)

The last expression can be simplified with matrices and vectors:

.it(u) =E[e"rx ] =exp(urji+

a2~Tu)

(ll.S)

where u = [u 1 , u2, .. . , unlT and ji = [Jl,Jl, ... ,Jl]T. The MGF of the general case x- N(ji,C) is the same as Equation (ll.S), but with C replacing a 2 I:

..U(u) =exp(urji+ ur2cu)

(11.6)

In Table 11.1, we list several equivalents between formulas for one Gaussian random variance and formulas for Gaussian vectors.

11.2 Linear Operations on Gaussian Random Vectors

303

TABLE 11.1 Table of equivalents for one Gaussian random variable and n Gaussian random variables. 1 Gaussian

Random Variable

n Gaussian Random Variables

X-N(f.l,a 2 )

x-N(jl,C)

E [X]=f.l

E[x] =11 Cov[x] =C

Var [X] =a 2 2

(X-f.l) ) Jx(x) = -1- exp ( - V2na2 2a 2

Jx(x) = - -1- - ex (- (x-jl)Tc-'(X-jl))

.it (u) = exp ( Uf.l + a22u2 )

..it(u) = exp(uTjl + z 1.5, and choose fi = 0 if N < 1.5-calculate Pr[FP], Pr[TP], Pr[FN], and Pr[TN].

12.5

Consider a simple discrete hypothesis test. Under Ho, N is uniform on the integers from 0 to 5; under H1, N is binomial with parameters n = 5 and p. a. What is the likelihood ratio of this test? b. Calculate and plot the ROC curve assumingp = 0.5. c. Assuming the two hypotheses are equally likely, what is the MAP test?

12.6

Consider a simple hypothesis test. Under Ho, X is exponential with parameter A = 2; under H 1 ,X- U(O,l). a. What is the likelihood ratio of this test? b. Calculate and plot the ROC curve. c. Assuming the two hypotheses are equally likely, what is the MAP detector?

12.7

Consider a simple hypothesis test. Under Ho, X is exponential with parameter A= 1; under H1, X is exponential with parameter A= 2. a. What is the likelihood ratio of this test? b. Calculate and plot the ROC curve. c. Assuming the two hypotheses are equally likely, what is the MAP detector?

12.8

Considerasimplehypothesis test. Under H 0 ,X -N(O,a~); under H 1 ,X -N(O,afl with af >a~. a. What is the likelihood ratio of this test? b. What is the Neyman-Pearson likelihood ratio test? c. Calculate Pr[FP] and Pr[TP] as functions of the test. d. Plot the ROC curve assuming at = 2a~. e. Assuming the two hypotheses are equally likely, what is the MAP detector?

Problems 12.9

339

In any hypothesis test, it is possible to achieve Pr[TP] = Pr[FP] = p, where 0,; p,; 1, by inserting some randomness into the decision procedure. a. How is this possible? b. Draw the ROC curve for such a procedure. (Hint: the AUC is 0.5 for such a test; see Problem 12.2.)

12.10

For the hypothesis testing example shown in Figure 12.1 with following probabilities:

xy

= 1.5, calculate the

a. Pr[FP] b. Pr[TN] c. Pr[FN] d. Pr[TP] 12.11

Repeat the radar example, but with Laplacian rather than Gaussian noise. That is, assume Ho: X= s + N with s > 0 and H1 :X= N with N Laplacian with parameter A.. a. Compute and plot the log-likelihood ratio. b. Compute and plot the ROC curve. Note that since the log-likelihood ratio has three regimes, the ROC curve will have three regimes.

12.12

Consider a binary communications problemY =X +N, where Pr[X = 1] = Pr[X = -1] = 0.5, N- N(O,u 2 ), and X and N are independent. a. What is the MAP detector for this problem? b. What is the error rate for this detector? c. What happens to theMAPdetectorifPr[X = 1] =p > 0.5 and Pr[X = -1] = 1-p < 0.5? That is, how does the optimal decision rule change? d. What happens to the probability of error when p > 0.5? What is the limit asp- 1?

CHAPTER

RANDOM SIGNALS AND NOISE

We talk, we listen, we see. All the interesting signals we experience are random. In truth, nonrandom signals are uninteresting; they are completely predictable. Random signals are unknown and unpredictable and therefore much more interesting. If Alice wants to communicate information to Bob, then that information must be unknown to Bob (before the communication takes place). In other words, the signal is random to Bob. This chapter introduces random signals and presents their probabilistic and spectral properties. We pay particular attention to a class of random signals called wide sense stationary (WSS), as these appear in many engineering applications. We also discuss noise and linear filters.

13.1 INTRODUCTION TO RANDOM SIGNALS A random signal is a random function of time. Engineering has many examples of random signals. For instance, a spoken fricative (e.g., the "th'' or "f" sound) is made by passing turbulent air through a narrow opening such as between the teeth and lips. Fricatives are noise-like. Other examples are the random voltages in an electric circuit, temperatures in the atmosphere, photon counts in a pixel sensor, and radio signals. In this section, we give a quick introduction to random signals and the more general class, random processes. Let X(t) be a random process. Fix t as, say, t 0 . Then, X(t 0 ) is a random variable. Fix another time, t 1 . Then, X(t!l is another random variable. In general, let t1> t2, . .. , tn be times. Then, X(tJl, X(t2), . .. ,XCtnl are random variables. Letj(x; t) denote the densityofX(t). The joint densityofX(tJl andX(t2 ) is j(xl>x 2; t1> t 2). In general, the nth-order density is j(x 1 ,x2, ... ,xn; t 1 , t2, ... , tn). The mean of X(t) is p.(t): p.(t)

340

=E[XCtl] =

L:

xj(x;t)dx

13.2 A Simple Random Process

341

In the integral, the density is a function of time. Therefore, the mean is a function of time. Similarly, the second moment and variance are functions of time:

E[X2Ctl] =E[X(t)X(tl] =

L:

x 2j(x;t)dx

a 2 (t) =Var[XCtl] =E[X 2Ctl]-p.Ctl 2 The autocorrelation RxxCti> t 2) and autocovariance CxxCti> t 2) of a random process are

RxxCti> t2l = E[X(t1)X(t2)] CxxCtJ, t2l = E[ (XCt1l -p.Ct1l) (XCt2l -p.Ct2l)]

= E[XCt1)X(t2) j-p.(tJlp.Ct2l = RxxCt!, t2l -p.(tJlp.Ct2l

(13.1)

The autocorrelation and autocovariance are measures of how correlated the random process is at two different times. The autocorrelation and covariance functions are symmetric in their two arguments. The proof follows because multiplication commutes:

RxxCt!, t2l = E [XCtJlXCt2l]

= E [XCt2lXCtJl] = RxxCt2, t1)

(13.2)

CxxCtJ, t2l = RxxCtJ, t2l- E[XCt1l ]E[XCt2l]

= RxxCt2, t1)- E[XCt2l ]E[XCtJl] = CxxCt2, tJl

(13.3)

When it is clear, what random process is being referred to, we drop the subscripts on Rxx and Cxx:

R(tl, t2) = RxxCt!, t2) C(tl, t2) = CxxCti> t2) The cross-correlation between two random processes is

13.2 A SIMPLE RANDOM PROCESS To illustrate the calculations, we present a simple random process and compute its probabilities and moments. Let X(t) = Xo for -= < t £] :5

E[{x Ctl -.X 0

Since Xa (t) approximates Xa(t) in the mean squared sense, the probability that Xa (t) differs from X a (t) is 0 for all t.

Comment 13.10: While we cannot say Xh 0 (t) =X0 (t) for all t, we can say the probability they are the same is 1. This is a mathematical technicality that does not diminish the utility of the sampling theorem for random signals.

13.9.2 Example: Figure 13.4 Figure 13.4 shows an example of the sampling theorem applied to a random signal. The top graph shows a Gaussian WSS signal. The middle graph shows the signal sampled with sampling time T = 1. The bottom graph shows the sine reconstruction.

360

CHAPTER 13 RANDOM SIGNALS AND NOISE

X(n)

FIGURE 13.4 Sampling theorem example for a random signal. The top graph shows the original signal, the middle graph the sampled signal, and the bottom graph the reconstructed signal.

Since 11 T = 1, the highestfrequency in X(t) must be 0.5 orless. We chose 0.4. In practice, a little oversampling helps compensate for the three problems in sine reconstruction: The signal is only approximately bandlimited, the sine function must be discretized, and the sine function must be truncated. Even though the first graph shows a continuous signal, it is drawn by oversampling a discrete signal (in this case, oversampled by a factor of four) and "connecting the dots:' For 25 seconds, we generate 101 samples (25 ·4+ 1; the signal goes from t = 0 tot= 25, inclusive). The signal is generated with the following Matlab command:

x

=

randn(1,101);

Since the signal is oversampled by a factor of 4, the upper frequency must be adjusted to 0.4/4 = 0.1. A filter (just about any lowpass filter with a cutoffof0.1 would work) is designed by

b = firrcos(30,

0.1, 0.02,

1);

The signal is filtered and the central portion extracted:

xa xa

= =

conv(x, b); xa(30:130);

This signal is shown in the top graph (after connecting the dots). The sampled signal (middle graph) is created by zeroing the samples we do not want:

xn = xa; xn(2:4:end) xn(3:4:end) xn(4:4:end)

O· ' O· ' O· '

13.9 The Sampling Theorem for WSS Random Processes

361

For the reconstruction, the sine function must be discretized and truncated:

s

=

sinc(-5:0.25 : 5);

Note that the sine function is sampled at four samples per period, the same rate at which the signal is sampled. The reconstruction is created by convolving the sine function with the sampled signal:

xahat xahat

=

conv(xn,s); xahat(20 : 121);

13.9.3 Proof of the Random Sampling Theorem In this section, we present the basic steps of the proof of the sampling theorem for WSS random signals. The MSE can be expanded as

The first term simplifies to E[X~(t)] = R(O)

If XaCtl ~ Xa(t), then it must be true that E[XaCtlXaCtl] ~ R(O) and E[X!Ctl] ~ R(O). If these are true, then E[(Xa(tl- XaCtl)

2

]

= R(O)- 2R(O) + R(O) = 0

The crux of the proof is to show these two equalities, E[XaCtlXa(t)] ~ R(O) and E [.X! (t)] ~ R(O).

The proof of these equalities relies on several facts. First, note that the autocorrelation function, R(T), can be thought of as a deterministic bandlimited signal (the Fourier transform of R is bandlimited). Therefore, R(r) can be written in a sine expansion: 00

R(r)=

L

R(n)sinc(Cr-nD!T)

n= -oo

Second, R is symmetric, R(T) = R( -T). Third, the sine function is also symmetric, sinc(t) = sinc(-t). Consider the first equality:

n= -oo 00

=

L

E[XaCt)X(n)]sinc(Ct-nD!T)

n= -oo 00

=

L n= -oo

R(nT- t)sinc(Ct- nD!T)

(13.8)

362

CHAPTER 13 RANDOM SIGNALS AND NOISE

Let R'(nT) = R(nT- t). Since it is just a time-shifted version of R, R' is bandlimited (the Fourier transform of R' is phase-shifted, but the magnitude is the same as that of R). Then,

f:

E[XaCtliaCtl] =

R'CnDsinc(Ct-nD!T)

n = -oo

=R'(t)

(the sampling theorem)

=R(t-t)

(replaceR'(t) =R(t-t))

=R(O)

Now, look at the second equality: A2

00

00

L

E[XaCtl] =E[

X(n)sinc(Ct-nD!T)

n= -oo 00

X(m)sinc((t-mT)!T)]

m = -oo

00

L L

=

L

E[X(n)X(m)]sinc(Ct-nD!T)sinc(Ct-mD!T)

n= -oom = -oo 00

=

00

I: I:

R(mT-nDsinc(Ct-nD!T)sinc(Ct-mD!T)

n= -oom = -oo

f: ( f:

=

R(mT-nDsinc(Ct-mD!T))sinc(Ct-nD!T)

n= -oo m=-oo

Consider the expression in the large parentheses. As above, let R' (mD = R(mT- nD. Then, the expression is a sine expansion of R'(t): 2

00

E[iaCtl] =

L

R'(t)sinc((t-nT)!T)

n= -oo 00

=

L

R(t-nDsinc(Ct-nD!T)

n= -oo 00

=

L

(same as Equation 13.8)

R(nT-t)sinc(Ct-nD!T)

n= -oo

=R(O)

Thus, we have shown both equalities, which implies

In summary, the sampling theorem applies to bandlimited WSS random signals as well as bandlimited deterministic signals. In practice, the sampling theorem provides mathematical support for converting back and forth between continuous and discrete time.

A random signal (also known as a random process) is a random function oftime. Let j(x; t) denote the density of X(t). The joint density of X(t 1 ) and X(t 2 ) isj(x 1 ,x2 ; t 1 , t2 ). The first and second moments of X(t) are p.(t) =E[XCtl] =

L:

xj(x;t)dx

Summary

363

E[X2 Ctl] =E[X(t)X(tl] =I: x 2f(x;t)dx Var[XCt}] = E[X 2 (t) j-p.(t) 2 The autocorrelation RxxCti> tz) and autocovariance CxxCti> tz) of a random process are

RxxCtJ,tz) =E[X(ti)X(tz)] CxxCtl, tz) = E[ {XCtJl-p.(tl) ){X(tz) -p.(tz))] = RxxCtl, tz) -p.(tJlp.(tz)

The Fourier transform of x(t) is X(w):

X(w) =$(xCtl) =I: x(t)e-jwtdt The inverse Fourier transform is the inverse transform, which takes a frequency function and computes a time function:

A random process is wide sense stationary (WSS) if it satisfies two conditions: 1. The mean is constant in time; that is, E[XCt}] = p. =constant.

2. The autocorrelation depends only on the difference in the two times, not on both times separately; that is, R(t1, tz) = RxxCtz - t1). The power spectral density (PSD) SxxCw) of a WSS random process is the Fourier transform of the autocorrelation function:

S(w) = ${RCrl) =I: R(r)e-wr dr We have three ways to compute the average power of a WSS random process: average power= a 2 + p. 2 = R(O) = f,; j:C,S(w)dw. Let X(t) be a continuous time WSS random process that is input to a linear filter, h(t), and let the output be Y(t) = X(t) * h(t). The PSD of Y(t) is the PSD of X(t) multiplied by the filter magnitude squared, IH(w)l 2 • The three most important types of noise in typical electrical systems are quantization noise (uniform), Poisson (shot) noise, and Gaussian thermal noise. When the noises are uncorrelated, the variances add:

N=N1+Nz+···Nk Var[N] = Var[N 1 ] + Var[N 2 ] + · · · + Var[Nk]

White noise has a PSD that is constant for all frequencies, SxxCw) = N 0 for-=< w < =· The autocorrelation function, therefore, is a delta function:

364

CHAPTER 13 RANDOM SIGNALS AND NOISE

The amplitude modulation of a random sinusoid by a WSS signal Y(t) = A(t) cos(wct+8) is WSS, with PSD Syy(W) = (SAA(W -We) +SAA(W +wcl)12. The discrete time Wiener filter estimates S by a linear combination of delayed samples of X: A

S(n) =

p

L akX(n- k)

k=O

The optimal coefficients satisfy the Wiener-Hopf equations: p

O=RxsCll- LakRxxCl-k) k=O

forl=O,l, .. . ,p

The sampling theorem is the theoretical underpinning of using digital computers to process analog signals. It applies to WSS signals as well: Theorem 13.3: IfXaCtl is bandlimited toW Hz and sampled at rate liT> 2W, then XaCtl can be reconstructed from its samples, X(n) = XaCnD:

iaCtl =

oo

L

n=-oo

X(n)

sin(n(t-nDID n (t - nT) IT

=

oo

L

n=-oo

X(n)sinc(nCt-nDIT)

The reconstruction iaCtl approximates XaCtl in the mean squared sense:

E[(XaCtl-iauln =O

13.1

IfX(t) and Y(t) are independent WSS signals, is X(t)- Y(t) WSS?

13.2

A drunken random walk is a random process defined as follows: let S(n) = 0 for n ~ 0 and S(n) = X 1 + X 2 + · · · + Xn, where the X; are liD with probability Pr[X; = 1] = Pr[X; = -1] =0.5.

a. What are the mean and variance of S(n)? b. What is the autocorrelation function of S(n)? c. Is S(n) WSS? d. What does the CLT say about S(n) for large n? e. Use a numerical package to generate and plot an example sequence S(n) for n = 0,1, ... ,100. 13.3

Used to model price movements of financial instruments, the Gaussian random walk is a random process defined as follows: let S(n) = 0 for n ~ 0 and S(n) =X 1 +X2 + · ·· +Xn, where the X; are liD N(O,a 2 ). a. What are the mean and variance of S(n)? b. What is the autocorrelation function of S(n)?

Problems

365

c. Is S(n) WSS? d. Use a numerical package to generate and plot an example sequence S(n) for n 0,1, ... ,100. Use u 2 = 1.

=

13.4

What is the Fourier transform of x(t) sin(wetl?

13.5

The filtered signal PSD result in Equation (13.6) holds in discrete time as well as continuous time. Repeat the sequence in Section 13.5 with sums instead of integrals to show this.

13.6

Let X(t) be a WSS signal, and let(} - U(0,2n) be independent of X(t). Form Xe(t) X(t)cos(wt + 6) and X, (t) = X(t)sin(wt + 6).

=

a. Are Xe (t) and X, (t) WSS? If so, what are their autocorrelation functions?

b. What is the cross-correlation between Xe (t) and X, (t)? 13.7

Let X(t) be a Gaussian white noise with variance u 2 . It is filtered by a perfect lowpass filter with magnitude IH(w)l = 1 for lwl We. What is the autocorrelation function of the filtered signal?

13.8

Let X(t) be a Gaussian white noise with variance u 2 . It is filtered by a perfect bandpass filter with magnitude IH(w)l = 1 for w1 < lwl < wz and IH(w)l = 0 for other values of w. What is the autocorrelation function of the filtered signal?

13.9

The advantage of sending an unmodulated carrier is that receivers can be built inexpensively with simple hardware. This was especially important in the early days of radio. The disadvantage is the unmodulated carrier requires a considerable fraction of the total available power at the transmitter. Standard broadcast AM radio stations transmit an unmodulated carrier as well as the modulated carrier. Let W(t) be a WSS baseband signal (i.e., the voice or music) scaled so that IW(t)l:::; 1. Then,A(t) = W(t) + 1. a. What are the autocorrelation and power spectral densities of A(t) and Y(t)?

b. Redraw Figure 13.2 to reflect the presence of the carrier. 13.10

A common example of the Wiener filter is when S(n) =X(n+ 1)-thatis, when the desired signal S(n) is a prediction of X(n + 1). What do the Wiener-Hopf equations looks like in this case? (These equations are known as the Yule-Walker equations.)

13.11

Create your own version of Figure 13.4.

CHAPTER

SELECTED RANDOM PROCESSES

A great variety of random processes occur in applications. Advances in computers have allowed the simulation and processing of increasingly large and complicated models. In this chapter, we consider three important random processes: the Poisson process, Markov chains, and the Kalman filter.

14.1 THE LIGHTBULB PROCESS An elementary discrete random process is the lightbulb process. This simple process helps illustrate the kinds of calculations done in studying random processes. A lightbulb is turned on at time 0 and continues until it fails at some random time T. Let X(t) = 1 if the lightbulb is working and X(t) = 0 if it has failed. We assume the failure time Tis exponentially distributed. Its distribution function is Pr[T :5 t] = Fy(t) = 1- e-M Therefore, the probability mass function of X(t) is Pr[XCt} = 1] = Pr[t :5 Tj = 1- Fy(t) =e-M Pr[XCt} = 0] = Pr[T < t] = Fy(t) = 1- e-M A sample realization of the random process is shown in Figure 14.1. X(t) starts out at 1, stays at 1 for awhile, then switches to X(t) = 0 at T = t and stays at 0 forever afterward. What are the properties of X(t)? The mean and variance are easily computed: E[XCtl] = 0 * Pr[XCt} = 0] + 1 * Pr[XCt} = 1] =e-M E[XCtl 2 ] = 0 2 * Pr[XCt} = 0] + 12 * Pr[XCt} = 1] =e-M 2

Var[XCt}] = E[XCtl 2 ] - E[XCt} ] =e-M- e-nt E[X(t)] and Var[XCt}] are shown in Figure 14.2.

366

14.1 The Lightbulb Process

367

X(t)

T FIGURE 14.1 Example ofthe lightbulb process. X(t) = 1 when the lightbulb is working, and X(t) = 0 when the bulb has failed .

Var[X(t) ]

FIGURE 14.2 Mean and variance of X(t) for the lightbulb process.

The autocorrelation and autocovariance functions require the joint probability mass function. Let t 1 and t 2 be two times, and let p(i,j; t 1, t 2) = Pr[XCtJl = i n X(t 2) = j]. Then, p(O,O;t1 ,t2)

= Pr[X(t,) = OnXCt2l = 0] =Pr[T t 2 ] + 0 ·1 · Pr[O, 1; t 1 , t2 ]

+ 1 · 0 · Pr[ 1,0; t 1 , t2 ] + 1 · 1 · Pr[ 1, l;t1 ,tz]

The lightbulb process is not WSS. It fails both criteria. The mean is not constant, and the autocorrelation function is not a function of only the time difference, t2 - t 1 . Other, more interesting random processes are more complicated than the lightbulb process and require more involved calculations. Nevertheless, the lightbulb process serves as a useful example of the kinds of calculations needed.

14.2 THE POISSON PROCESS The Poisson process is an example of a wide class of random processes that possess independent increments. The basic idea is that points occur randomly in time and the number of points in any interval is independent of the number that occurs in any other nonoverlapping interval. A Poisson process also has the property that the number of points in any time interval (s, t) is a Poisson random variable with parameter lt(t- s). Another way of looking at it is that the mean number of points in the interval (s, t) is proportional to the size of the interval, t-s. As one example of a Poisson process, consider a sequence of photons incident on a detector. The mean number of photons incident in (s, t) is the intensity of the light, It, times the size of the interval, t -s. As another example, the number of diseases (e.g., cancer) in a population versus time is often modeled as a Poisson process. In this section, we explore some of the basic properties of the Poisson process. We calculate moments and two interesting conditional probabilities. The first is the probability of getting l points in the interval (0, t) given k points in the shorter interval (O,s), with s < t. The second is the reverse, the probability of getting k points in (O,s) given l points in (O,t).

14.2 The Poisson Process 369

Comment 14.1: It is conventional to refer to point, and the Poisson process is often referred to as a Poisson point process. In ordinary English, one might use the word "events" to describe the things that happen, but we have already used this word to refer to a set of outcomes.

Let (s, t) be a time interval with 0 < s < t, and let N(s, t) be the number of points in the interval (s,t). For convenience, letN(s) =N(O,s) andN(t) =N(O,t). WealsoassumeN(O) = 0. Then, N(s, t) = N(t)- N(s)

This can be rearranged as N(t) = N(s) + N(s, t). This is shown graphically in Figure 14.3. 1 + - - - - - - - - - N(t) - - - - - - - - - + 1 ~ N(s)

_ __,-1+----- N(s,t) - - - - > 1

s

0

FIGURE 14.3 An example of a Poisson process. The points are shown as closed dots: N(s) = 5, 2, and N(t) = 7.

N(s,t) =

As mentioned above, a Poisson process has two important properties. First, it has independent increments, meaning that if (s, t) and (u, v) are disjoint (nonoverlapping) intervals, then N(s, t) and N(u, v) are independent random variables. In particular, N(s) and N(s, t) are independent since the intervals (O,s) and (s, t) do not overlap. Second, N(s, t) is Poisson with parameter /t(t - s). For a Poisson random variable, E[N(s, t)] = lt(t -s). This assumption means the expected number of points in an interval is

proportional to the size of the interval. Using the Poisson assumption, means and variances are easily computed: E[N(t)] = Var[N(t)] =Itt E[N(s, t)] = Var[N(s, t)] = 1\,(t -s)

Computing autocorrelations and autocovariances uses both the Poisson and independent increments assumptions: R(s,t) =E[N(s)N(t)j

= E[N(s)(N(s) +N(s,tl)] =E[N(s)N(s)j +E[N(s)N(s,t)j 2

= Var[NCsl] +E[NCsl] +E[N(s)jE[N(s,t)j 2 2

= lts+lt s + (lts)(lt(t-s)) =Its (I +Itt)

370

CHAPTER 14 SELECTED RANDOM PROCESSES

C(s, t) = R(s, t)- E[N(s) ]E[N(t)]

=Its An interesting probability is the one that no points occur in an interval: Pr[N(s, t) = Oj = e--t(t - s) This is the same probability as for an exponential random variable not occurring in (s, t). Thus, we conclude the interarrival times for the Poisson process are exponential. That the waiting times are exponential gives a simple interpretation to the Poisson process. Starting from time 0, wait an exponential time T 1 for the first point. Then, wait an additional exponential time T 2 for the second point. The process continues, with each point occurring an exponential waiting time after the previous point. The idea that the waiting times are exponential gives an easy method of generating a realization of a Poisson process: 1. Generate a sequence of U(O, 1) random variables U 1 , U 2, .. .. 2. Transform each as T 1 = -log(UJ)/.1\,, T2 = -log(U2l/lt, .... The Ti are exponential random variables. 3. The points occur at times Tl> T 1 + T 2 , T 1 + T 2 + T 3 , etc.

(This is how the points in Figures 14.3 and 14.4 were generated.) We also calculate two conditional probabilities. The first uses the orthogonal increments property to show that the future of the Poisson process is independent of its past:

[

I

l

Pr N(t) =I N(s) = k =

Pr[N(t) = lnN(s) = k] [ ] Pr N(s) =k Pr[N(s,t) = 1-knN(s) = k] Pr[N(s) =k] Pr[N(s,t) = 1-k]· Pr[N(s) = k] Pr[N(s) =k]

(by independence)

= Pr[N(s, t) = 1- k] Given there are I points at time t, how many points were there at time s? The reverse conditional probability helps answer this question and gives yet another connection between the Poisson and Binomial distributions:

[

I

l

Pr N(s) = k N(t) =I =

Pr[NCt)=lnN(s)=k] [ ] Pr N(t) =I Pr[N(s,t) = 1-knN(s) =k] Pr[N(t) =I] Pr[N(s,t) = 1-k]· Pr[N(s) = k] Pr[N(t) =I]

14.2 The Poisson Process

371

(1\,(t _ s))l-ke-.l(t-s) (1\,s)ke-.ls

k!

(l-k)! (Itt) I e-M

I! I!

(s)k(t-s)l-k

= (l-k)!k!.

t

(14.1)

-t-

This probability is binomial with parameters I and sit. Curiously, the probability does not depend on lt. Figure 14.4 shows a realization of a Poisson process created as described above with It = 1. In the time interval from t = 0 to t = 25, we expect to get about E [N (25)] = 25/t = 25 points. We obtained 31 points, slightly greater than one standard deviation (a= VIi= 5) above the mean. For large t, the Poisson distribution converges to a Gaussian by the CLT. On average, we expect N(t) to be within one standard deviation of the mean about two-thirds of the time. In summary, the Poisson process is used in many counting experiments. N(s, t) is the number of points that occur in the interval from s tot and is Poisson with parameter 1\,(t-s). The separation between successive points is exponential; that is, the waiting time until the next point occurs is exponential with parameter It (the same It in the Poisson probabilities). The mean of N(s, t) is 1\,(t- s); the variance of N(s, t) is also lt(t- s). A typical N(t) sequence is shown in Figure 14.4.

30

vTt 25

Jit

A-t 20 15 10 5

5

Ill

10

Ill I I II I IIIII I

15

I

20

II

II Ill I I

25 Gk> Hk> Qk> and Rk are all constant in time. When we discuss the optimal estimates for this example, we will drop the subscript k on these matrices. (It often happens that these matrices are time independent.) The second part of the Kalman filter is a procedure for optimally estimating the state vector and the error covariance of the state vector. Define Xkil as the estimate of Xk using observations through (and including) time l. While we use the term filtering broadly to mean estimating Xk from observations, sometimes the terms smoothing, filtering, and prediction are used more precisely: if l > k (future observations are used to estimate Xk), Xk il is a smoothed estimate of xk; if l = k, Xkik is a filtered estimate; and if l < k, Xkil is a prediction ofXk.

The Kalman filter is recursive: at time k -1, an estimate Xk-l ik-l and its error covariance are available. The first step is to generate a prediction, Xklk-l> and its error covariance, rk lk-l· This step uses the state equation (Equation 14.6). The second step is to generate a filtered estimate, Xkik> and its error covariance, rklk> using the observation rk-l ik-l

14.4 Kalman Filter

383

equation (Equation 14.7). This process (i.e., predict and then filter) can be repeated as long as desired. The Kalman filter starts with an estimate of the initial state, x010 • We assume x010 N(x 0 , r 010 ). We assume the error in x010 is independent of the Wk and Vk fork> 0. At time k- 1, assume Xk-l lk-1 - N(Xk-l,rk- l lk-1). Knowing nothing else, the best estimate of the noises in the state equation is 0. Therefore, the prediction is (14.11) The error is as follows:

The two noises are independent. Therefore, the error covariance is (14.12) The observation update step, going from without proof, the answer:

Xk lk-l

to

Xklk>

is more complicated. We state,

(14.13) (14.14) The update equations are simplified by defining a gain matrix, Kk. Then, K k = rk lk-IH[{Hkrklk-IH[ +Rkr

1

(14.15)

Xklk = Xklk- 1 + Kk(Zk- Hkxk 1k- 1l

(14.16)

rklk = u - KkHklrk lk-1

(14.17)

The state prediction update equation (Equation 14.16) has the familiar predictor-corrector form. The updated state estimate is the predicted estimate plus the gain times the innovation. Equations (14.11), (14.12), (14.15), (14.16), and (14.17) constitute what is generally called the Kalman filter. The state and observation equations are linear and corrupted by additive Gaussian noise. The state estimates are linear in the observations and are optimal in the sense of minimizing the MSE. The state estimates are unbiased with the covariances above. Let us continue the example and calculate the updates. Assume the initial estimate is x010 = (0 O)T with error covariance r 010 =I. For simplicity, we also assume l'lr = 0.9 and a~ = a~ = 0.5. Then, A

XI IO =

(10 0.9) 1.5

384

CHAPTER 14 SELECTED RANDOM PROCESSES

Letting gain matrix:

a; = 0.33, we obtain an observation Z1 = 0.37. First, we need to calculate the o) (2.31

K = (2.31

0.9

I

0.9

The innovation is

Therefore, the new state estimate and its error covariance are the following: XI II

(0) (0.875) (0.323) = 0 + 0.34 0 .37 = 0.126

r

= ((1

A

0

I ll

o)1 _(0.875) ( 1 o)) (2.31 0.34 0.9

0.9) "'(0.29 1.5 0.11

0.11) 1.04

The Kalman estimate (0.323 0.126) is reasonable given the observation Z 1 = 0.37. Note that the estimated covariance r 111 is smaller than r 110 and that both are symmetric (as are all covariance matrices).

14.4.2 QR Method Applied to the Kalman Filter The Kalman filter is the optimal MSE estimator, but the straightforward implementation sometimes has numerical problems. In this section, we show how the QR method, an implementation that is fast, easy to program, and accurate, can be used to compute the Kalman estimates. In the development below, we make the simplifying assumption that Gk =I for all k. If this is not true, we can set up and solve a constrained linear regression problem, but that development is beyond this text. Recall that we introduced the QR method in Section 11.3.3 as a way of solving a linear regression problem. The linear regression problem is to choose ~ to minimize llji- X~ 11 2 . The optimal solution is = (Xrx)- 1xry. Alternately, the solution can be found by applying an orthogonal matrix to X and ji, converting X to an upper triangular matrix Rwith zeros below (Figure 11.4). The solution can then be found by solving the equation = ji The work in the QR method is applying the orthogonal matrix to X. Solving the resulting triangular system of equations is easy. The Kalman recursion starts with an estimate, .i010 - N(:X0 ,r 010 ). We need to represent this relation in the general form of the linear regression problem. To do this, we first factor r 010 as pTp, where Pis an upper triangular matrix. There are several ways of doing this, the most popular of which is the Cholesky factorization (see Problem 11.6). The original estimate can be written as x010 = x0 + w0 , where w0 has covariance r 010 = pT P. These equations can be transformed as

p

R.p



11

p-TXOIO= p-T_xo + £o where co - N(O,I). We use the notation p- T to denote the transpose of the inverse of P; that is,p-T = (P-i)T.

14.4 Kalman Filter

385

To simplify the remaining steps, we change the notation a bit. Instead of writing the factorization of r 01 0as pTP, we write it as r~ri2 rM5. Similarly, we factor Q = QT 12 Q 112 and R = RT/2Rl/2. The Kalman state equations can be written as 0 = Q;;!i 2Cxk- Xk-!l + £ 10 . The Kalman observation equations can be written as R;;T 12 z k = RJ' 2HkXk + £ . The noises in both 11

equations are Gaussian with mean 0 and identity covariances. The initial estimate and the state equations combine to form a linear regression problem as follows:

x-

r-T/2

...--T/2 A ) - tj = 1-e-M

E [X(t) j =e-M Var[X(t} j =e-M- e-lM R(t,, t2 ) = e-.0

for all i and j and alii~ n

In other words, the Markov chain has a unique stationary distribution if the matrix P" has all nonzero entries for some n. The Kalman filter is a minimum mean squared estimator of a state vector given observations corrupted by additive Gaussian noise. The state evolves in time according to a linear equation:

Observations are made:

The Kalman filter is recursive. Given an estimate Xk-i !k-J, the prediction ofxk is

388

CHAPTER 14 SELECTED RANDOM PROCESSES

The error covariance of the prediction is

The observation update equations are simplified by defining a gain matrix, Kk: Kk = I.k lk-IH[{Hki.klk-IH[ +Rkr xklk = xklk-1

1

+ Kk(Zk- Hkxk 1k-1l

I.klk = (I- KkHk) I.k lk-1

The state prediction update equation has the familiar predictor-corrector form. One numerically accurate, fast, and easy-to-program implementation of the Kalman filter is to apply the QR factorization to a linear regression problem formed from the Kalman state and observation equations.

14.1

For the lightbulb process in Section 14.1: a. What value oft maximizes Var[XCtl]? b. What is the value of E [X(t)] at this value oft? c. Why does this answer make sense?

14.2

Is the Poisson process WSS? Why or why not?

14.3

Generate your own version of Figure 14.4.

14.4

Another way of generating a realization of a Poisson process uses Equation (14.1). Given p = t/1. A binomial is a sum of Bernoullis, and a Bernoulli can be generated by comparing a uniform to a threshold. Combining these ideas, a realization of a Poisson process can be generated with these two steps:

N(t) =I, N(s) is binomial with parameters n =I and

1. Generate a Poisson random variable with parameter At. Call the value n. 2. Generate n uniform random variables on the interval (0, t). The values of then random variables are the points of a Poisson process.

Use this technique to generate your own version of Figure 14.4. 14.5

Write a program to implement a Markov chain. The Markov chain function should accept an initial probability vector, a probability transition matrix, and a time n and then output the sequence of states visited by the Markov chain. Test your program on simple Markov chains, and a histogram of the states visited approximates the stationary distribution.

14.6

Why is the Markov chain in Example 14.2 converging so slowly?

14.7

In Problem 6.23, we directly solved the "first to k competition'' (as phrased in that question, the first to wink of n games wins the competition) using binomial probabilities. Assume the games are independent and one team wins each game with probability p. Set up a Markov chain to solve the "first to 2 wins" competition. a. How many states are required?

Problems

389

b. What is the probability transition matrix? c. Compute the probability of winning versus p for several values of p using the Markov chain and the direct formula, and show the answers are the same. 14.8

For the "win by 2" Markov chain in Example 14.3, compute the expected number of games to result in a win or loss as a function of p.

14.9

In Problems 6.27 and 6.28, we computed the probability a binomial random variable Sn with parameters n and p is even. Here, we use a Markov chain to compute the same probability: 1. Set up a two-state Markov chain with states even and odd. Draw the state diagram.

What is the state transition matrix P? What is the initial probability vector p(0)?

2. Find the first few state probabilities by raising P to the nth power for n = 1,2,3,4. 3. Set up the flow equations, and solve for the steady-state probability Sn is even (i.e., as n-oo). 4. Solve for the steady-state probabilities from the eigenvalue and eigenvector decomposition of P usingp = 0.5,0.6,0.7,0.8,0.9. 14.10

Simulate the example Kalman filter problem in Section 14.4.1 fork= 0, 1, ... , 10 using the standard Kalman recursions. a. What is the final estimate? b. How close is the final estimate to the actual value?

14.11

Simulate the example Kalman filter problem in Section 14.4.1 fork= 0,1, ... ,10 using the QR method. Set up multiple linear regression problems, and use a library function to solve them. a. What is the final estimate? b. How close is the final estimate to the actual value?

14.12

The error covariance in a linear regression problem is u 2 XTX, where u 2 is the variance of the noise (often u 2 = 1, which we assume in this problem). In the QR method for computing the Kalman filter update (Section 14.4), we said, "The error covariance is ~I l l= Rjkk;-1 Show this. (Hints: ~-I =XTX = RTR. Write ~-I in terms ofR- 1 ; calculate k- 1 in terms ofRo1o, R0 11 , andR111; and then compute~- Identify the portion of~ that is ~Ill· Also, remember that matrix multiplication does not commute: AB fc BA in general.)

f:'

14.13

Write the state and observation equations for an automobile cruise control. Assume the velocity of the vehicle can be measured, but not the slope of the road. How might the control system incorporate a Kalman filter?

APPENDIX

COMPUTATION EXAMPLES

Throughout the text, we have used three computation packages: Matlab, Python, and R. In this appendix, we briefly demonstrate how to use these packages. For more information, consult each package's documentation and the many websites devoted to each one. All the examples in the text have been developed using one or more of these three packages (mostly Python). However, all the plots and graphics, except those in this appendix, have been recoded in a graphics macro package for a consistent look.

A.1 MATLAB Matlab is a popular numerical computing package widely used at many universities. To save a little space, we have eliminated blank lines and compressed multiple answers into a single line in the Matlab results below. Start with some simple calculations:

>> conv( [1,1,1,1], [1,1,1,1]) ans=

1

2 0 :5

>> for k

4

3

3

2

1

nchoosek(5,k) end ans

1

5

10 10

5

1

Here is some code for the Markov chain example in Section 14.3:

>> p p

=

[[0.9, 0.1]; [0 . 2, 0 . 8]]

= 0.9000

0.1000

391

392 APPENDIX A COMPUTATION EXAMPLES

0 . 2000 0 . 8000 >> P2 = P * p P2 0 . 8300 0 . 1700 0 . 3400 0 . 6600 >> P4 = P2 * P2 P4 0 . 7467 0 . 2533 0 . 5066 0 . 4934 The probability a standard normal random variable is between -1.96 and 1.96 is 0.95:

>> cdf('Normal',1 . 96,0,1)-cdf('Normal',-1 . 96,0,1) ans 0 . 9500 The probability of getting three heads in a throw of six coins is 0.3125:

>> pdf('Binomial',3,6,0.5) ans 0 . 3125 Compute basic statistics with the data from Section 10.6. The data exceed the linewidth of this page and are broken into pieces for display:

>>data= [0 . 70, 0 . 92, -0 . 28, 0 . 93, 0.40, -1 . 64, 1 . 77, 0 . 40, -0 . 46, -0 . 31, 0 . 38, 0.63, -0 . 79, 0 . 07, -2 . 03, -0 . 29, -0.68, 1 . 78, -1 . 83, 0.95]; Compute the mean, median, sample variance, and interquartile distance:

>> mean(data), median(data), var(data), iqr(data) ans 0 . 0310, 0 . 2250, 1 . 1562, 1 . 3800 The sample distribution function of this data can be computed and plotted as follows:

>> >> >> >> >> >> >> >> >>

x = linspace(-3,3,101); y = cdf('Normal',x,0,1); plot(x,y,'LineWidth',3,'Color',[0 . 75,0 . 75,0 . 75]) [fs, xs] = ecdf(data); hold on; stairs(xs,fs,'LineWidth',1.5,'Color', [0,0,0]) xlabel('x','FontSize',14), ylabel('CDF','FontSize',14) title('Sample CDF vs Gaussian CDF','FontSize',18) saveas(gcf,'MatlabCDFplot','epsc')

A.2 Python

393

The plot is shown below. We used options to make the continuous curve a wide gray color, to make the sample curve a bit narrower and black, and to increase the size of the fonts used in the title and labels.

Sample CDF vs. Gaussian CDF 0.9 0.8 0.7

0.6

Q """ u

0.5 0.4 0.3 0.2

0.1

0

-3

-2

-1 X

A.2 PYTHON Python is an open source, general-purpose computing language. It is currently the most common first language taught at universities in the United States. The core Python language has limited numerical and data analysis capabilities. However, the core language is supplemented with numerous libraries, giving the "batteries included" Python similar capabilities to Rand Matlab. In Python, numpy and matplotlib give Python linear algebra and plotting capabilities similar to those of Matlab. scipy. stats and statistics are basic statistical packages. Start Python, and import the libraries and functions we need: >>> import numpy as np import matplotlib.pyplot as plt import scipy.stats as st from statistics import mean, median, variance, stdev Do routine analysis on the data from Section 10.6: >>>data= [0.70,

0.92, -0.28, 0.93, 0.40, -1.64, 1.77, 0.40, -0.46, -0.31, 0.38, 0.63, -0.79, -2.03, -0.29, -0.68, 1.78, -1.83, 0.95]

0.07,

>>> print(mean(data), median(data), variance(data), stdev(data)) 0.03100000000000001 0.225 1.15622 1.0752767085731934

394

APPENDIX A COMPUTATION EXAMPLES

Define a Gaussian random variable Z- N(O, 1). Generate a vector of 100 samples from the distribution, and plot a histogram of the data against the density:

>>> x = np.linspace(-3,3,101) Z = st.norm() #N(0,1) random variable plt.plot(x,Z.pdf(x),linewidth=2, color='k') ns, bins, patches= plt.hist(Z.rvs(100),normed=True, range=(-3.25,3.25), bins=13, color='0.9') pl t. xlabel ( 'x') plt.ylabel('Normalized Counts') plt.savefig('PythonHistCDF.pdf')

0.35 0 .30 ~

c

~ 0.25

u

~ 0 .20 ;;;

§ o.15 0

z 0 .10 0 .05

Now, use the same Gaussian random variable to compute probabilities:

>>> #some basic Gaussian probabilities Z.cd£(1.96)-Z.cdf(-1.96), Z.pp£(0.975) (0.95000420970355903, 1.959963984540054) Do some of the calculations of the birthday problem in Section 3.6. For n = 365 people, a group of 23 has a better than 0.5 chance of a common birthday. That is, for the first 22 group sizes, Pr(no pair) > 0.5:

»> n = 365 days = np.arange(1,n+1) p = 1.0*(n+1-days)/n probnopair = np.cumprod(p) np.sum(probnopair>0.5) 22

A.3 R

395

Plot the probability of no pair versus the number of people in the group:

>>> xupper = 30 cutoff = 23 plt.plot(days[:xupper] ,probnopair[ : xupper] ,days[ : xupper], probnopair[ : xupper] ,' . ') plt.axvline(x=cutoff,color='r') plt.axhline(y=0 . 5) plt.xlabel('Number of People in Group') plt.ylabel('Probability of No Match') plt.savefig('PythonBirthday . pdf')

0 .3 0.2 L..__ _..___ __,___ _- ' - - -- ' -0 10 15 20 Number of People in Group

- ' - - - - ' -- - - '

25

30

A.3 R R is a popular open source statistics and data analysis package. It is widely used in universities, and its use is growing in industry. R's syntax is somewhat different from that in Matlab and Python. For further study, the reader is urged to consult the many texts and online resources devoted to R. Start R: R version 3 . 1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13 . 4 . 0 (64-bit)

We begin with some simple probability calculations. The probability of two Aces in a selection of two cards from a standard deck is 0.00452 = 1/221:

> choose(5,3) [1] 10

396

APPENDIX A COMPUTATION EXAMPLES

> choose(5,c(0,1,2,3,4,5)) [1] 1 5 10 10 5 1 > choose(4,2)/choose(52,2) [1] 0. 004524887 > choose(52,2)/choose(4,2) [1] 221 Note that the syntax above for creating a vector, c ( 0, 1, 2, 3, 4, 5), differs from the syntax in Matlab and Python.

The probability a standard normal random variable is between -1.96 and 1.96 is 0.95:

> pnorm(1.96); qnorm(0.975) [1] 0.9750021 [1] 1. 959964 > pnorm(1.96)-pnorm(-1.96) [1] 0. 9500042 To illustrate R's data handling capabilities, import data from a Comma Separated Value (CSV) file. The data are grades for 20 students from two midterms:

>grades= read.csv('Grades.csv',header=TRUE) > grades Midterm1 Midterm2 1 53 59 50 2 38 56 3 53 4 55 61 5 32 18 6 48 57 7 56 39 8 47 24 9 44 22 10 94 86 11 66 18 12 62 57 13 56 45 14 94 63 15 70 51 16 88 89 17 56 47 18 100 96 19 75 67 20 88 86

A.3 R 397 The summary command is a simple way to summarize the data:

> summary(grades) Midterm1 Min . 32 . 00 1stQu.:51 . 75 Median : 56 . 00 Mean 63 . 75 3rd Qu.: 78 . 25 Max . :100 . 00

Midterm2 Min . : 18 . 00 1st Qu. : 43 . 50 Median : 56 . 50 Mean : 54 . 55 3rd Qu. : 64 . 00 Max . : 96 . 00

Clearly, grades on Midterm2 are, on average, lower than those on Midterm I. It is easiest to access the columns if we attach the data frame:

> attach(grades) For instance, the correlation between the two columns is 0.78:

> cor(Midterm1, Midterm2) [1] 0 . 7784707

A stem-and-leafplot is an interesting way to represent the data. It combines features of a sorted list and a histogram:

> stem(Midterm1,scale=2) The decimal point is 1 digit(s) to the right of the I 3 4 5 6 7 8 9 10

28 478 335666 26 05 88 44 0

The sorted data can be read from the stem plot: 32, 38, 44, etc. Of the 20 scores, six were in 50s. (Note, the stem command in R differs from the stem command in Matlab and Python. Neither Matlab nor Python has a stem-and-leaf plot command standard, but such a function is easy to write in either language.) Fit a linear model to the data, and plot both the data and the fitted line. To make the plot more attractive, we use a few of the plotting options:

> lmfit = lm(Midterm2 - Midterm1) > pdf('grades.pdf')

398

APPENDIX A COMPUTATION EXAMPLES

> plot(Midterm1,Midterm2,cex=1 . 5,cex . axis=1 . 5,cex . lab=1 . 5,las=1) > abline(lmfit) > dev . off() 0 0 0

0

80 0 N

Ei 60

1l

0

~

00 0

0 0 0

0

40

0

00

20

0

30

0

40

50

60

70

Midterm!

80

90

100

APPENDIX

ACRONYMS

AUC

Area Under the ROC Curve

CDF

Cumulative Distribution Function

CLT

Central Limit Theorem

csv

Comma Separated Value

DFT

Discrete Fourier Transform

DTFT

Discrete Time Fourier Transform

ECC

Error Correcting Coding

liD

Independent and Identically Distributed

IQ

Intelligence Quotient

JPEG

Joint Photographic Experts Group

KDE

Kernel Density Estimate

KL

Kullback-Leibler Divergence

LTP

Law of Total Probability

MAC

Message Authentication Code

MAP

Maximum a Posteriori

MGF

Moment Generating Function

MLE

Maximum Likelihood Estimate

MMSE

Minimum Mean Squared Error

MSE

Mean Squared Error

399

400

APPENDIX B ACRONYMS

PDF

Probability Density Function

PMF

Probability Mass Function

PSD

Power Spectral Density

PSK

Phase Shift Keying

QAM

Quadrature Amplitude Modulation

ROC

Receiver Operating Characteristic

SNR

Signal-to-Noise Ratio

WSS

Wide Sense Stationary

SPSK

Eight-Point Phase Shift Keying

4QAM

Four-Point Quadrature Amplitude Modulation

16QAM 16-Point Quadrature Amplitude Modulation

APPENDIX

PROBABILITY TABLES

C.1 TABLES OF GAUSS IAN PROBABILITIES TABLE C.1 Va lu es ofthe Standard No rm a l Distrib ution Fun ctio n

z

(z)

z

(z)

z

(z)

z

(z)

0.00 0.05

0.5000 0.5199

1.00 1.05

0.8413 0.8531

2.00 2.05

0.9772 0.9798

3.00 3.05

0.9987 0.9989

0.10 0.15

0.5398

1.10

0.8643

2.10

0.9821

3.10

0.9990

0.5596

1.15

0.8749

2.15

0.9842

3.15

0.9992

0.20

0.5793

1.20

0.8849

2.20

0.9861

3.20

0.9993

0.25 0.30 0.35

0.5987 0.6179 0.6368

1.25 1.30 1.35

0.8944 0.9032 0.9115

2.25 2.30 2.35

0.9878 0.9893 0.9906

3.25 3.30 3.35

0.9994 0.9995 0.9996

0.40 0.45 0.50

0.6554 0.6736 0.6915

1.40 1.45 1.50

0.9192 0.9265 0.9332

2.40 2.45 2.50

0.9918 0.9929 0.9938

3.40 3.45 3.50

0.9997 0.9997 0.9998

0.55 0.60 0.65

0.7088 0.7257

1.55 1.60

0.9394 0.9452

2.55 2.60

0.9946 0.9953

3.55 3.60

0.9998 0.9998

0.7422

1.65

0.9505

2.65

0.9960

3.65

0.9999

0.70 0.75

0.7580

1.70

0.9554

2.70

0.9965

3.70

0.9999

0.80

0.7734 0.7881

1.75 1.80

0.9599 0.9641

2.75 2.80

0.9970 0.9974

3.75 3.80

0.9999 0.9999

0.85 0.90 0.95

0.8023 0.8159 0.8289

1.85 1.90 1.95

0.9678 0.9713 0.9744

2.85 2.90 2.95

0.9978 0.9981 0.9984

3.85 3.90 3.95

0.9999 1.0000 1.0000

401

402

APPENDIX C PROBA BI LI1YTABLES TABLE C.2 Va lues ofthe Standard Norma l Tail Pro ba bil ities

z

1-