157 80 45MB
English Pages [1225]
PROBABILITY, MATHEMATICAL STATISTICS, AND STOCHASTIC PROCESSES
Kyle Siegrist University of Alabama in huntsville
Book: Probability, Mathematical Statistics, Stochastic Processes
This text is disseminated via the Open Education Resource (OER) LibreTexts Project (https://LibreTexts.org) and like the hundreds of other texts available within this powerful platform, it is freely available for reading, printing and "consuming." Most, but not all, pages in the library have licenses that may allow individuals to make changes, save, and print this book. Carefully consult the applicable license(s) before pursuing such effects. Instructors can adopt existing LibreTexts texts or Remix them to quickly build course-specific resources to meet the needs of their students. Unlike traditional textbooks, LibreTexts’ web based origins allow powerful integration of advanced features and new technologies to support learning.
The LibreTexts mission is to unite students, faculty and scholars in a cooperative effort to develop an easy-to-use online platform for the construction, customization, and dissemination of OER content to reduce the burdens of unreasonable textbook costs to our students and society. The LibreTexts project is a multi-institutional collaborative venture to develop the next generation of openaccess texts to improve postsecondary education at all levels of higher learning by developing an Open Access Resource environment. The project currently consists of 14 independently operating and interconnected libraries that are constantly being optimized by students, faculty, and outside experts to supplant conventional paper-based books. These free textbook alternatives are organized within a central environment that is both vertically (from advance to basic level) and horizontally (across different fields) integrated. The LibreTexts libraries are Powered by NICE CXOne and are supported by the Department of Education Open Textbook Pilot Project, the UC Davis Office of the Provost, the UC Davis Library, the California State University Affordable Learning Solutions Program, and Merlot. This material is based upon work supported by the National Science Foundation under Grant No. 1246120, 1525057, and 1413739. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation nor the US Department of Education. Have questions or comments? For information about adoptions or adaptions contact [email protected]. More information on our activities can be found via Facebook (https://facebook.com/Libretexts), Twitter (https://twitter.com/libretexts), or our blog (http://Blog.Libretexts.org). This text was compiled on 10/17/2023
Introduction
1
https://stats.libretexts.org/@go/page/10315
TABLE OF CONTENTS Introduction Licensing Object Library Credits Sources and Resources
1: Foundations 1.1: Sets 1.2: Functions 1.3: Relations 1.4: Partial Orders 1.5: Equivalence Relations 1.6: Cardinality 1.7: Counting Measure 1.8: Combinatorial Structures 1.9: Topological Spaces 1.10: Metric Spaces 1.11: Measurable Spaces 1.12: Special Set Structures
2: Probability Spaces 2.1: Random Experiments 2.2: Events and Random Variables 2.3: Probability Measures 2.4: Conditional Probability 2.5: Independence 2.6: Convergence 2.7: Measure Spaces 2.8: Existence and Uniqueness 2.9: Probability Spaces Revisited 2.10: Stochastic Processes 2.11: Filtrations and Stopping Times
3: Distributions 3.1: Discrete Distributions 3.2: Continuous Distributions 3.3: Mixed Distributions 3.4: Joint Distributions 3.5: Conditional Distributions 3.6: Distribution and Quantile Functions 3.7: Transformations of Random Variables 3.8: Convergence in Distribution 3.9: General Distribution Functions
1
https://stats.libretexts.org/@go/page/25718
3.10: The Integral With Respect to a Measure 3.11: Properties of the Integral 3.12: General Measures 3.13: Absolute Continuity and Density Functions 3.14: Function Spaces
4: Expected Value 4.1: Definitions and Basic Properties 4.2: Additional Properties 4.3: Variance 4.4: Skewness and Kurtosis 4.5: Covariance and Correlation 4.6: Generating Functions 4.7: Conditional Expected Value 4.8: Expected Value and Covariance Matrices 4.9: Expected Value as an Integral 4.10: Conditional Expected Value Revisited 4.11: Vector Spaces of Random Variables 4.12: Uniformly Integrable Variables 4.13: Kernels and Operators
5: Special Distributions 5.1: Location-Scale Families 5.2: General Exponential Families 5.3: Stable Distributions 5.4: Infinitely Divisible Distributions 5.5: Power Series Distributions 5.6: The Normal Distribution 5.7: The Multivariate Normal Distribution 5.8: The Gamma Distribution 5.9: Chi-Square and Related Distribution 5.10: The Student t Distribution 5.11: The F Distribution 5.12: The Lognormal Distribution 5.13: The Folded Normal Distribution 5.14: The Rayleigh Distribution 5.15: The Maxwell Distribution 5.16: The Lévy Distribution 5.17: The Beta Distribution 5.18: The Beta Prime Distribution 5.19: The Arcsine Distribution 5.20: General Uniform Distributions 5.21: The Uniform Distribution on an Interval 5.22: Discrete Uniform Distributions 5.23: The Semicircle Distribution 5.24: The Triangle Distribution 5.25: The Irwin-Hall Distribution 5.26: The U-Power Distribution 5.27: The Sine Distribution 5.28: The Laplace Distribution 5.29: The Logistic Distribution
2
https://stats.libretexts.org/@go/page/25718
5.30: The Extreme Value Distribution 5.31: The Hyperbolic Secant Distribution 5.32: The Cauchy Distribution 5.33: The Exponential-Logarithmic Distribution 5.34: The Gompertz Distribution 5.35: The Log-Logistic Distribution 5.36: The Pareto Distribution 5.37: The Wald Distribution 5.38: The Weibull Distribution 5.39: Benford's Law 5.40: The Zeta Distribution 5.41: The Logarithmic Series Distribution
6: Random Samples 6.1: Introduction 6.2: The Sample Mean 6.3: The Law of Large Numbers 6.4: The Central Limit Theorem 6.5: The Sample Variance 6.6: Order Statistics 6.7: Sample Correlation and Regression 6.8: Special Properties of Normal Samples
7: Point Estimation 7.1: Estimators 7.2: The Method of Moments 7.3: Maximum Likelihood 7.4: Bayesian Estimation 7.5: Best Unbiased Estimators 7.6: Sufficient, Complete and Ancillary Statistics
8: Set Estimation 8.1: Introduction to Set Estimation 8.2: Estimation the Normal Model 8.3: Estimation in the Bernoulli Model 8.4: Estimation in the Two-Sample Normal Model 8.5: Bayesian Set Estimation
9: Hypothesis Testing 9.1: Introduction to Hypothesis Testing 9.2: Tests in the Normal Model 9.3: Tests in the Bernoulli Model 9.4: Tests in the Two-Sample Normal Model 9.5: Likelihood Ratio Tests 9.6: Chi-Square Tests
10: Geometric Models 10.1: Buffon's Problems 10.2: Bertrand's Paradox 10.3: Random Triangles
3
https://stats.libretexts.org/@go/page/25718
11: Bernoulli Trials 11.1: Introduction to Bernoulli Trials 11.2: The Binomial Distribution 11.3: The Geometric Distribution 11.4: The Negative Binomial Distribution 11.5: The Multinomial Distribution 11.6: The Simple Random Walk 11.7: The Beta-Bernoulli Process
12: Finite Sampling Models 12.1: Introduction to Finite Sampling Models 12.2: The Hypergeometric Distribution 12.3: The Multivariate Hypergeometric Distribution 12.4: Order Statistics 12.5: The Matching Problem 12.6: The Birthday Problem 12.7: The Coupon Collector Problem 12.8: Pólya's Urn Process 12.9: The Secretary Problem
13: Games of Chance 13.1: Introduction to Games of Chance 13.2: Poker 13.3: Simple Dice Games 13.4: Craps 13.5: Roulette 13.6: The Monty Hall Problem 13.7: Lotteries 13.8: The Red and Black Game 13.9: Timid Play 13.10: Bold Play 13.11: Optimal Strategies
14: The Poisson Process 14.1: Introduction to the Poisson Process 14.2: The Exponential Distribution 14.3: The Gamma Distribution 14.4: The Poisson Distribution 14.5: Thinning and Superpositon 14.6: Non-homogeneous Poisson Processes 14.7: Compound Poisson Processes 14.8: Poisson Processes on General Spaces
15: Renewal Processes 15.1: Introduction 15.2: Renewal Equations 15.3: Renewal Limit Theorems 15.4: Delayed Renewal Processes 15.5: Alternating Renewal Processes 15.6: Renewal Reward Processes
4
https://stats.libretexts.org/@go/page/25718
16: Markov Processes 16.1: Introduction to Markov Processes 16.2: Potentials and Generators for General Markov Processes 16.3: Introduction to Discrete-Time Chains 16.4: Transience and Recurrence for Discrete-Time Chains 16.5: Periodicity of Discrete-Time Chains 16.6: Stationary and Limiting Distributions of Discrete-Time Chains 16.7: Time Reversal in Discrete-Time Chains 16.8: The Ehrenfest Chains 16.9: The Bernoulli-Laplace Chain 16.10: Discrete-Time Reliability Chains 16.11: Discrete-Time Branching Chain 16.12: Discrete-Time Queuing Chains 16.13: Discrete-Time Birth-Death Chains 16.14: Random Walks on Graphs 16.15: Introduction to Continuous-Time Markov Chains 16.16: Transition Matrices and Generators of Continuous-Time Chains 16.17: Potential Matrices 16.18: Stationary and Limting Distributions of Continuous-Time Chains 16.19: Time Reversal in Continuous-Time Chains 16.20: Chains Subordinate to the Poisson Process 16.21: Continuous-Time Birth-Death Chains 16.22: Continuous-Time Queuing Chains 16.23: Continuous-Time Branching Chains
17: Martingales 17.1: Introduction to Martingalges 17.2: Properties and Constructions 17.3: Stopping Times 17.4: Inequalities 17.5: Convergence 17.6: Backwards Martingales
18: Brownian Motion 18.1: Standard Brownian Motion 18.2: Brownian Motion with Drift and Scaling 18.3: The Brownian Bridge 18.4: Geometric Brownian Motion
Index Glossary Detailed Licensing
5
https://stats.libretexts.org/@go/page/25718
Licensing A detailed breakdown of this resource's licensing can be found in Back Matter/Detailed Licensing.
1
https://stats.libretexts.org/@go/page/32585
Table of Contents http://www.randomservices.org/random/index.html Random is a website devoted to probability, mathematical statistics, and stochastic processes, and is intended for teachers and students of these subjects. The site consists of an integrated set of components that includes expository text, interactive web apps, data sets, biographical sketches, and an object library. Please read the Introduction for more information about the content, structure, mathematical prerequisites, technologies, and organization of the project.
1: Foundations 1.1: Sets 1.2: Functions 1.3: Relations 1.4: Partial Orders 1.5: Equivalence Relations 1.6: Cardinality 1.7: Counting Measure 1.8: Combinatorial Structures 1.9: Topological Spaces 1.10: Metric Spaces 1.11: Measurable Spaces 1.12: Special Set Structures
2: Probability Spaces 2.1: Random Experiments 2.2: Events and Random Variables 2.3: Probability Measures 2.4: Conditional Probability 2.5: Independence 2.6: Convergence 2.7: Measure Spaces 2.8: Existence and Uniqueness 2.9: Probability Spaces Revisited 2.10: Stochastic Processes 2.11: Filtrations and Stopping Times
3: Distributions 3.1: Discrete Distributions 3.2: Continuous Distributions 3.3: Mixed Distributions 3.4: Joint Distributions 3.5: Conditional Distributions 3.6: Distribution and Quantile Functions 3.7: Transformations of Random Variables 3.8: Convergence in Distribution 3.9: General Distribution Functions 3.10: The Integral With Respect to a Measure 3.11: Properties of the Integral 3.12: General Measures 3.13: Absolute Continuity and Density Functions 3.14: Function Spaces
1
https://stats.libretexts.org/@go/page/10321
4: Expected Value 4.1: New Page 4.2: New Page 4.3: New Page 4.4: New Page 4.5: New Page 4.6: New Page 4.7: New Page 4.8: New Page 4.9: New Page 4.10: New Page
5: Special Distributions 5.1: New Page 5.2: New Page 5.3: New Page 5.4: New Page 5.5: New Page 5.6: New Page 5.7: New Page 5.8: New Page 5.9: New Page 5.10: New Page
6: Random Samples 6.1: New Page 6.2: New Page 6.3: New Page 6.4: New Page 6.5: New Page 6.6: New Page 6.7: New Page 6.8: New Page 6.9: New Page 6.10: New Page
7: Point Estimation 7.1: New Page 7.2: New Page 7.3: New Page 7.4: New Page 7.5: New Page 7.6: New Page 7.7: New Page 7.8: New Page 7.9: New Page 7.10: New Page
8: Set Estimation 8.1: New Page 8.2: New Page 8.3: New Page
2
https://stats.libretexts.org/@go/page/10321
8.4: New Page 8.5: New Page 8.6: New Page 8.7: New Page 8.8: New Page 8.9: New Page 8.10: New Page
9: Hypothesis Testing 9.1: New Page 9.2: New Page 9.3: New Page 9.4: New Page 9.5: New Page 9.6: New Page 9.7: New Page 9.8: New Page 9.9: New Page 9.10: New Page
10: Geometric Models 10.1: New Page 10.2: New Page 10.3: New Page 10.4: New Page 10.5: New Page 10.6: New Page 10.7: New Page 10.8: New Page 10.9: New Page 10.10: New Page
11: Bernoulli Trials 11.1: New Page 11.2: New Page 11.3: New Page 11.4: New Page 11.5: New Page 11.6: New Page 11.7: New Page 11.8: New Page 11.9: New Page 11.10: New Page
12: Finite Sampling Models 12.1: New Page 12.2: New Page 12.3: New Page 12.4: New Page 12.5: New Page 12.6: New Page 12.7: New Page 12.8: New Page
3
https://stats.libretexts.org/@go/page/10321
12.9: New Page 12.10: New Page
13: Games of Chance 13.1: New Page 13.2: New Page 13.3: New Page 13.4: New Page 13.5: New Page 13.6: New Page 13.7: New Page 13.8: New Page 13.9: New Page 13.10: New Page
14: The Poisson Process 14.1: New Page 14.2: New Page 14.3: New Page 14.4: New Page 14.5: New Page 14.6: New Page 14.7: New Page 14.8: New Page 14.9: New Page 14.10: New Page
15: Renewal Processes 15.1: New Page 15.2: New Page 15.3: New Page 15.4: New Page 15.5: New Page 15.6: New Page 15.7: New Page 15.8: New Page 15.9: New Page 15.10: New Page
16: Markov Processes 16.1: New Page 16.2: New Page 16.3: New Page 16.4: New Page 16.5: New Page 16.6: New Page 16.7: New Page 16.8: New Page 16.9: New Page 16.10: New Page
4
https://stats.libretexts.org/@go/page/10321
17: Martingales 17.1: New Page 17.2: New Page 17.3: New Page 17.4: New Page 17.5: New Page 17.6: New Page 17.7: New Page 17.8: New Page 17.9: New Page 17.10: New Page
18: Brownian Motion
5
https://stats.libretexts.org/@go/page/10321
Object Library
1
https://stats.libretexts.org/@go/page/10316
Credits
1
https://stats.libretexts.org/@go/page/10317
Sources and Resources
1
https://stats.libretexts.org/@go/page/10318
CHAPTER OVERVIEW 1: Foundations In this chapter we review several mathematical topics that form the foundation of probability and mathematical statistics. These include the algebra of sets and functions, general relations with special emphasis on equivalence relations and partial orders, counting measure, and some basic combinatorial structures such as permuations and combinations. We also discuss some advanced topics from topology and measure theory. You may wish to review the topics in this chapter as the need arises. 1.1: Sets 1.2: Functions 1.3: Relations 1.4: Partial Orders 1.5: Equivalence Relations 1.6: Cardinality 1.7: Counting Measure 1.8: Combinatorial Structures 1.9: Topological Spaces 1.10: Metric Spaces 1.11: Measurable Spaces 1.12: Special Set Structures
This page titled 1: Foundations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
1.1: Sets Set theory is the foundation of probability and statistics, as it is for almost every branch of mathematics.
Sets and subsets In this text, sets and their elements are primitive, self-evident concepts, an approach that is sometimes referred to as naive set theory. A set is simply a collection of objects; the objects are referred to as elements of the set. The statement that x is an element of set S is written x ∈ S , and the negation that x is not an element of S is written as x ∉ S . By definition, a set is completely determined by its elements; thus sets A and B are equal if they have the same elements: A = B if and only if x ∈ A
⟺
x ∈ B
(1.1.1)
Our next definition is the subset relation, another very basic concept. If A and B are sets then A is a subset of B if every element of A is also an element of B : A ⊆ B if and only if x ∈ A
⟹
x ∈ B
(1.1.2)
Concepts in set theory are often illustrated with small, schematic sketches known as Venn diagrams, named for John Venn. The Venn diagram in the picture below illustrates the subset relation.
Figure 1.1.1 : A ⊆ B
As noted earlier, membership is a primitive, undefined concept in naive set theory. However, the following construction, known as Russell's paradox, after the mathematician and philosopher Bertrand Russell, shows that we cannot be too cavalier in the construction of sets. Let R be the set of all sets A such that A ∉ A . Then R ∈ R if and only if R ∉ R . Proof Usually, the sets under discussion in a particular context are all subsets of a well-defined, specified set S , often called a universal set. The use of a universal set prevents the type of problem that arises in Russell's paradox. That is, if S is a given set and p(x) is a predicate on S (that is, a valid mathematical statement that is either true or false for each x ∈ S ), then {x ∈ S : p(x)} is a valid subset of S . Defining a set in this way is known as predicate form. The other basic way to define a set is simply be listing its elements; this method is known as list form. In contrast to a universal set, the empty set, denoted ∅, is the set with no elements. ∅ ⊆A
for every set A .
Proof One step up from the empty set is a set with just one element. Such a set is called a singleton set. The subset relation is a partial order on the collection of subsets of S . Suppose that A , B and C are subsets of a set S . Then 1. A ⊆ A (the reflexive property). 2. If A ⊆ B and B ⊆ A then A = B (the anti-symmetric property). 3. If A ⊆ B and B ⊆ C then A ⊆ C (the transitive property).
1.1.1
https://stats.libretexts.org/@go/page/10116
Here are a couple of variations on the subset relation. Suppose that A and B are sets. 1. If A ⊆ B and A ≠ B , then A is a strict subset of B and we sometimes write A ⊂ B . 2. If ∅ ⊂ A ⊂ B , then A is called a proper subset of B . The collection of all subsets of a given set frequently plays an important role, particularly when the given set is the universal set. If S is a set, then the set of all subsets of S is known as the power set of S and is denoted P(S) .
Special Sets The following special sets are used throughout this text. Defining them will also give us practice using list and predicate form. Special Sets 1. R denotes the set of real numbers and is the universal set for the other subsets in this list. 2. N = {0, 1, 2, …} is the set of natural numbers 3. N = {1, 2, 3, …} is the set of positive integers 4. Z = {… , −2, −1, 0, 1, 2, …}is the set of integers 5. Q = {m/n : m ∈ Z and n ∈ N } is the set of rational numbers 6. A = {x ∈ R : p(x) = 0 for some polynomial p with integer coefficients} is the set of algebraic numbers. +
+
Note that N ⊂ N ⊂ Z ⊂ Q ⊂ A ⊂ R . We will also occasionally need the set of complex numbers C = {x + iy : x, where i is the imaginary unit. The following special rational numbers turn out to be useful for various constructions. +
y ∈ R}
For n ∈ N , a rational number of the form j/2 where j ∈ Z is odd is a dyadic rational (or binary rational) of rank n . n
1. For n ∈ N , the set of dyadic rationals of rank n or less is D = {j/2 2. The set of all dyadic rationals is D = {j/2 : j ∈ Z and n ∈ N} .
n
n
: j ∈ Z}
.
n
Note that D = Z and D ⊂ D for n ∈ N , and of course, D ⊂ Q . We use the usual notation for intervals of real numbers, but again the definitions provide practice with predicate notation. 0
Suppose that a,
n
b ∈ R
n+1
with a < b .
1. [a, b] = {x ∈ R : a ≤ x ≤ b} . This interval is closed. 2. (a, b) = {x ∈ R : a < x < b} . This interval is open. 3. [a, b) = {x ∈ R : a ≤ x < b} . This interval is closed-open. 4. (a, b] = {x ∈ R : a < x ≤ b} . This interval is open-closed. The terms open and closed are actually topological concepts. You may recall that x ∈ R is rational if and only if the decimal expansion of x either terminates or forms a repeating block. The binary rationals have simple binary expansions (that is, expansions in the base 2 number system). A number x ∈ R is a binary rational of rank (after the separator).
n ∈ N+
if and only if the binary expansion of
x
is finite, with
1
in position
n
Proof
Set Operations We are now ready to review the basic operations of set theory. For the following definitions, suppose that A and B are subsets of a universal set, which we will denote by S . The union of A and B is the set obtained by combining the elements of A and B . A ∪ B = {x ∈ S : x ∈ A or x ∈ B}
1.1.2
(1.1.3)
https://stats.libretexts.org/@go/page/10116
The intersection of A and B is the set of elements common to both A and B : A ∩ B = {x ∈ S : x ∈ A and x ∈ B}
(1.1.4)
If A ∩ B = ∅ then A and B are disjoint. So A and B are disjoint if the two sets have no elements in common. The set difference of B and A is the set of elements that are in B but not in A : B ∖ A = {x ∈ S : x ∈ B and x ∉ A}
Sometimes (particularly in older works and particularly when A ⊆ B , B − A is known as proper set difference.
A ⊆B
), the notation
(1.1.5)
B−A
is used instead of
B∖A
. When
The complement of A is the set of elements that are not in A : c
A
= {x ∈ S : x ∉ A}
(1.1.6)
Note that union, intersection, and difference are binary set operations, while complement is a unary set operation. In the Venn diagram app, select each of the following and note the shaded area in the diagram. 1. A 2. B 3. A 4. B 5. A ∪ B 6. A ∩ B c
c
Basic Rules In the following theorems, A , B , and C are subsets of a universal set S . The proofs are straightforward, and just use the definitions and basic logic. Try the proofs yourself before reading the ones in the text. A∩B ⊆ A ⊆ A∪B
.
The identity laws: 1. A ∪ ∅ = A 2. A ∩ S = A So the empty set acts as an identity relative to the union operation, and the universal set acts as an identiy relative to the intersection operation. The idempotent laws: 1. A ∪ A = A 2. A ∩ A = A The complement laws: 1. A ∪ A 2. A ∩ A
c c
=S =∅
The double complement law: (A
c
c
)
=A
The commutative laws: 1. A ∪ B = B ∪ A
1.1.3
https://stats.libretexts.org/@go/page/10116
2. A ∩ B = B ∩ A Proof The associative laws: 1. A ∪ (B ∪ C ) = (A ∪ B) ∪ C 2. A ∩ (B ∩ C ) = (A ∩ B) ∩ C Proof Thus, we can write A ∪ B ∪ C without ambiguity. Note that x is an element of this set if and only if x is an element of at least one of the three given sets. Similarly, we can write A ∩ B ∩ C without ambiguity. Note that x is an element of this set if and only if x is an element of all three of the given sets. The distributive laws: 1. A ∩ (B ∪ C ) = (A ∩ B) ∪ (A ∩ C ) 2. A ∪ (B ∩ C ) = (A ∪ B) ∩ (A ∪ C ) Proof So intersection distributes over union, and union distributes over intersection. It's interesting to compare the distributive properties of set theory with those of the real number system. If x, y, z ∈ R , then x(y + z) = (xy) + (xz) , so multiplication distributes over addition, but it is not true that x + (yz) = (x + y)(x + z) , so addition does not distribute over multiplication. The following results are particularly important in probability theory. DeMorgan's laws (named after Agustus DeMorgan): 1. (A ∪ B) 2. (A ∩ B)
c c
c
=A
c
=A
c
∩B
c
∪B
.
Proof The following result explores the connections between the subset relation and the set operations. The following statements are equivalent: 1. A ⊆ B 2. B ⊆ A 3. A ∪ B = B 4. A ∩ B = A 5. A ∖ B = ∅ c
c
Proof In addition to the special sets defined earlier, we also have the following: More special sets 1. R ∖ Q is the set of irrational numbers 2. R ∖ A is the set of transcendental numbers Since Q ⊂ A ⊂ R it follows that R ∖ A ⊂ R ∖ Q , that is, every transcendental number is also irrational. Set difference can be expressed in terms of complement and intersection. All of the other set operations (complement, union, and intersection) can be expressed in terms of difference. Results for set difference: 1. B ∖ A = B ∩ A 2. A = S ∖ A 3. A ∩ B = A ∖ (A ∖ B) 4. A ∪ B = S ∖ {(S ∖ A) ∖ [(S ∖ A) ∖ (S ∖ B)]} c
c
1.1.4
https://stats.libretexts.org/@go/page/10116
Proof So in principle, we could do all of set theory using the one operation of set difference. But as (c) and (d) suggest, the results would be hideous. .
(A ∪ B) ∖ (A ∩ B) = (A ∖ B) ∪ (B ∖ A)
Proof The set in the previous result is called the symmetric difference of A and B , and is sometimes denoted A △ B . The elements of this set belong to one but not both of the given sets. Thus, the symmetric difference corresponds to exclusive or in the same way that union corresponds to inclusive or. That is, x ∈ A ∪ B if and only if x ∈ A or x ∈ B (or both); x ∈ A △ B if and only if x ∈ A or x ∈ B , but not both. On the other hand, the complement of the symmetric difference consists of the elements that belong to both or neither of the given sets: c
(A △ B)
c
= (A ∩ B) ∪ (A
c
c
∩ B ) = (A
c
∪ B) ∩ (B
∪ A)
Proof There are 16 different (in general) sets that can be constructed from two given events A and B . Proof Open the Venn diagram app. This app lists the 16 sets that can be constructed from given sets operations. 1. Select each of the four subsets in the proof of the last exercise: A ∩ B , A ∩ B , A ∩ B , and A disjoint and their union is S . 2. Select each of the other 12 sets and show how each is a union of some of the sets in (a). c
c
c
A
c
∩B
and
B
using the set
. Note that these are
General Operations The operations of union and intersection can easily be extended to a finite or even an infinite collection of sets.
Definitions Suppose that A is a nonempty collection of subsets of a universal set S . In some cases, the subsets in A may be naturally indexed by a nonempty index set I , so that A = {A : i ∈ I } . (In a technical sense, any collection of subsets can be indexed.) i
The union of the collection of sets A is the set obtained by combining the elements of the sets in A : ⋃ A = {x ∈ S : x ∈ A for some A ∈ A }
If A
= { Ai : i ∈ I }
(1.1.16)
, so that the collection of sets is indexed, then we use the more natural notation: ⋃ Ai = {x ∈ S : x ∈ Ai for some i ∈ I }
(1.1.17)
i∈I
The intersection of the collection of sets A is the set of elements common to all of the sets in A : ⋂ A = {x ∈ S : x ∈ A for all A ∈ A }
If A
= { Ai : i ∈ I }
(1.1.18)
, so that the collection of sets is indexed, then we use the more natural notation: ⋂ Ai = {x ∈ S : x ∈ Ai for all i ∈ I }
(1.1.19)
i∈I
Often the index set is an “integer interval” of N. In such cases, an even more natural notation is to use the upper and lower limits of the index set. For example, if the collection is {A : i ∈ N } then we would write ⋃ A for the union and ⋂ A for the intersection. Similarly, if the collection is {A : i ∈ {1, 2, … , n}} for some n ∈ N , we would write ⋃ A for the union and ⋂ A for the intersection. ∞
i
+
i=1
i
+
n
i=1
∞
i
i=1
n
i=1
i
i
i
1.1.5
https://stats.libretexts.org/@go/page/10116
A collection of sets A is pairwise disjoint if the intersection of any two sets in the collection is empty: A ∩ B = ∅ for every A, B ∈ A with A ≠ B . A collection of sets A is said to partition a set B if the collection A is pairwise disjoint and ⋃ A
=B
.
Partitions are intimately related to equivalence relations. As an example, for n ∈ N , the set j Dn = {[
n
2
j+1 ,
n
) : j ∈ Z}
(1.1.20)
2
is a partition of R into intervals of equal length 1/2 . Note that the endpoints are the dyadic rationals of rank n or less, and that D can be obtained from D by dividing each interval into two equal parts. This sequence of partitions is one of the reasons that the dyadic rationals are important. n
n+1
n
Basic Rules In the following problems, A a subset of S .
= { Ai : i ∈ I }
is a collection of subsets of a universal set S , indexed by a nonempty set I , and B is
The general distributive laws: 1. (⋃ 2. (⋂
i∈I
Ai ) ∩ B = ⋃
(Ai ∩ B)
i∈I
Ai ) ∪ B = ⋂
(Ai ∪ B)
i∈I i∈I
Restate the laws in the notation where the collection A is not indexed. Proof The general De Morgan's laws: 1. (⋃ 2. (⋂
i∈I
i∈I
c
Ai )
c
= ⋂i∈I A
i
c
Ai )
c
= ⋃i∈I A
i
Restate the laws in the notation where the collection A is not indexed. Proof Suppose that the collection A partitions S . For any subset B , the collection {A ∩ B : A ∈ A } partitions B . Proof
Figure 1.1.2: A partition of S induces a partition of B Suppose that {A
i
1. ⋂ 2. ⋃
∞ n=1 ∞ n=1
∞
: i ∈ N+ }
is a collection of subsets of a universal set S
⋃
Ak = {x ∈ S : x ∈ Ak for infinitely many k ∈ N+ }
⋂
Ak = {x ∈ S : x ∈ Ak for all but finitely many k ∈ N+ }
k=n ∞ k=n
Proof The sets in the previous result turn out to be important in the study of probability.
Product sets Definitions Product sets are sets of sequences. The defining property of a sequence, of course, is that order as well as membership is important.
1.1.6
https://stats.libretexts.org/@go/page/10116
Let us start with ordered pairs. In this case, the defining property is that (a, b) = (c, d) if and only if a = c and b = d . Interestingly, the structure of an ordered pair can be defined just using set theory. The construction in the result below is due to Kazimierz Kuratowski Define (a, b) = {{a}, {a, b}}. This definition captures the defining property of an ordered pair. Proof Of course, it's important not to confuse the ordered pair (a, b) with the open interval (a, b), since the same notation is used for both. Usually it's clear form context which type of object is referred to. For ordered triples, the defining property is (a, b, c) = (d, e, f ) if and only if a = d , b = e , and c = f . Ordered triples can be defined in terms of ordered pairs, which via the last result, uses only set theory. Define (a, b, c) = (a, (b, c)) . This definition captures the defining property of an ordered triple. Proof All of this is just to show how complicated structures can be built from simpler ones, and ultimately from set theory. But enough of that! More generally, two ordered sequences of the same size (finite or infinite) are the same if and only if their corresponding coordinates agree. Thus for n ∈ N , the definition for n -tuples is (x , x , … , x ) = (y , y , … , y ) if and only if x = y for all i ∈ {1, 2, … , n}. For infinite sequences, (x , x , …) = (y , y , …) if and only if x = y for all i ∈ N . +
1
1
2
1
Suppose now that we have a sequence of n sets, (S as follows:
1,
2
n
1
2
i
2
n
i
i
i
+
, where n ∈ N . The Cartesian product of the sets is defined
S2 , … , Sn )
+
S1 × S2 × ⋯ × Sn = {(x1 , x2 , … , xn ) : xi ∈ Si for i ∈ {1, 2, … , n}}
(1.1.22)
Cartesian products are named for René Descartes. If S = S for each i, then the Cartesian product set can be written compactly as S , a Cartesian power. In particular, recall that R denotes the set of real numbers so that R is n -dimensional Euclidean space, named after Euclid, of course. The elements of {0, 1} are called bit strings of length n . As the name suggests, we sometimes represent elements of this product set as strings rather than sequences (that is, we omit the parentheses and commas). Since the coordinates just take two values, there is no risk of confusion. i
n
n
n
Suppose that we have an infinite sequence of sets (S
1,
. The Cartesian product of the sets is defined by
S2 , …)
S1 × S2 × ⋯ = {(x1 , x2 , …) : xi ∈ Si for each i ∈ {1, 2, …}}
(1.1.23)
When S = S for i ∈ N , the Cartesian product set is sometimes written as a Cartesian power as S or as S . An explanation for the last notation, as well as a much more general construction for products of sets, is given in the next section on functions. Also, notation similar to that of general union and intersection is often used for Cartesian product, with ∏ as the operator. So i
∞
+
n
N+
∞
∏ Si = S1 × S2 × ⋯ × Sn ,
∏ Si = S1 × S2 × ⋯
i=1
i=1
(1.1.24)
Rules for Product Sets We will now see how the set operations relate to the Cartesian product operation. Suppose that S and T are sets and that A ⊆ S , B ⊆ S and C ⊆ T , D ⊆ T . The sets in the theorems below are subsets of S × T . The most important rules that relate Cartesian product with union, intersection, and difference are the distributive rules: Distributive rules for product sets 1. A × (C ∪ D) = (A × C ) ∪ (A × D) 2. (A ∪ B) × C = (A × C ) ∪ (B × C ) 3. A × (C ∩ D) = (A × C ) ∩ (A × D) 4. (A ∩ B) × C = (A × C ) ∩ (B × C ) 5. A × (C ∖ D) = (A × C ) ∖ (A × D) 6. (A ∖ B) × C = (A × C ) ∖ (B × C ) Proof
1.1.7
https://stats.libretexts.org/@go/page/10116
In general, the product of unions is larger than the corresponding union of products. (A ∪ B) × (C ∪ D) = (A × C ) ∪ (A × D) ∪ (B × C ) ∪ (B × D)
Proof So in particular it follows that (A × C ) ∪ (B × D) ⊆ (A ∪ B) × (C ∪ D) same as the corresponding intersection of products.
. On the other hand, the product of intersections is the
(A × C ) ∩ (B × D) = (A ∩ B) × (C ∩ D)
Proof In general, the product of differences is smaller than the corresponding difference of products. (A ∖ B) × (C ∖ D) = [(A × C ) ∖ (A × D)] ∖ [(B × C ) ∖ (B × D)]
Proof So in particular it follows that (A ∖ B) × (C ∖ D) ⊆ (A × C ) ∖ (B × D)
,
Projections and Cross Sections In this discussion, suppose again that S and T are nonempty sets, and that C
⊆ S ×T
.
Cross Sections 1. The cross section of C in the first coordinate at x ∈ S is C = {y ∈ T 2. The cross section of C at in the second coordinate at y ∈ T is x
C
Note that C
x
⊆T
for x ∈ S and C
y
⊆S
y
: (x, y) ∈ C }
= {x ∈ S : (x, y) ∈ C }
(1.1.25)
for y ∈ T .
Projections 1. The projection of C onto T is C 2. The projection of C onto S is C
T S
. = {x ∈ S : (x, y) ∈ C for some y ∈ T } .
= {y ∈ T : (x, y) ∈ C for some x ∈ S}
The projections are the unions of the appropriate cross sections. Unions 1. C 2. C
T S
= ⋃x∈S Cx =⋃
y∈T
C
y
Cross sections are preserved under the set operations. We state the result for cross sections at analgous results hold for cross sections at y ∈ T . Suppose that C , 1. (C ∪ D) 2. (C ∩ D) 3. (C ∖ D)
D ⊆ S ×T
x
= Cx ∪ Dx
x
= Cx ∩ Dx
x
x ∈ S
. By symmetry, of course,
. Then for x ∈ S ,
= Cx ∖ Dx
Proof For projections, the results are a bit more complicated. We give the results for projections onto projections onto S are analogous. Suppose again that C , 1. (C ∪ D) 2. (C ∩ D)
D ⊆ S ×T
T
= CT ∪ DT
T
⊆ CT ∩ DT
T
; naturally the results for
. Then
1.1.8
https://stats.libretexts.org/@go/page/10116
3. (C
T
c
)
c
⊆ (C )T
Proof It's easy to see that equality does not hold in general in parts (b) and (c). In part (b) for example, suppose that A , A ⊆ S are nonempty and disjoint and B ⊆ T is nonempty. Let C = A × B and D = A × B . Then C ∩ D = ∅ so (C ∩ D) = ∅ . But C =D = B . In part (c) for example, suppose that A is a nonempty proper subset of S and B is a nonempty proper subset of T . Let C = A × B . Then C = B so (C ) = B . On the other hand, C = (A × B) ∪ (A × B ) ∪ (A × B ) , so (C ) = T . 1
1
T
2
2
T
T
c
T
c
c
c
c
c
c
T
c
T
Cross sections and projections will be extended to very general product sets in the next section on Functions.
Computational Exercises Subsets of R The universal set is [0, ∞). Let A = [0, 5] and B = (3, 7) . Express each of the following in terms of intervals: 1. A ∩ B 2. A ∪ B 3. A ∖ B 4. B ∖ A 5. A c
Answer The universal set is N. Let A = {n ∈ N : n is even} and let B = {n ∈ N : n ≤ 9} . Give each of the following: 1. A ∩ B in list form 2. A ∪ B in predicate form 3. A ∖ B in list form 4. B ∖ A in list form 5. A in predicate form 6. B in list form c
c
Answer
Coins and Dice Let S = {1, 2, 3, 4} × {1, 2, 3, 4, 5, 6}. This is the set of outcomes when a 4-sided die and a 6-sided die are tossed. Further let A = {(x, y) ∈ S : x = 2} and B = {(x, y) ∈ S : x + y = 7} . Give each of the following sets in list form: 1. A 2. B 3. A ∩ B 4. A ∪ B 5. A ∖ B 6. B ∖ A Answer Let S = {0, 1} . This is the set of outcomes when a coin is tossed 3 times (0 denotes tails and 1 denotes heads). Further let A = {(x , x , x ) ∈ S : x = 1} and B = {(x , x , x ) ∈ S : x + x + x = 2} . Give each of the following sets in list form, using bit-string notation: 3
1
2
3
2
1
2
3
1
2
3
1. S 2. A 3. B 4. A 5. B 6. A ∩ B 7. A ∪ B c
c
1.1.9
https://stats.libretexts.org/@go/page/10116
8. A ∖ B 9. B ∖ A Answer Let S = {0, 1} . This is the set of outcomes when a coin is tossed twice (0 denotes tails and 1 denotes heads). Give P(S) in list form. 2
Answer
Cards A standard card deck can be modeled by the Cartesian product set D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠}
(1.1.26)
where the first coordinate encodes the denomination or kind (ace, 2–10, jack, queen, king) and where the second coordinate encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for example q♡ for the queen of hearts). For the problems in this subsection, the card deck D is the universal set. Let H denote the set of hearts and F the set of face cards. Find each of the following: 1. H ∩ F 2. H ∖ F 3. F ∖ H 4. H △ F Answer A bridge hand is a subset of D with 13 cards. Often bridge hands are described by giving the cross sections by suit. Suppose that N is a bridge hand, held by a player named North, defined by N
♣
= {2, 5, q}, N
♢
= {1, 5, 8, q, k}, N
♡
= {8, 10, j, q}, N
♠
= {1}
(1.1.27)
Find each of the following: 1. The nonempty cross sections of N by denomination. 2. The projection of N onto the set of suits. 3. The projection of N onto the set of denominations Answer By contrast, it is usually more useful to describe a poker hand by giving the cross sections by denomination. In the usual version of draw poker, a hand is a subset of D with 5 cards. Suppose that B is a poker hand, held by a player named Bill, with B1 = {♣, ♠}, B8 = {♣, ♠}, Bq = {♡}
(1.1.28)
Find each of the following: 1. The nonempty cross sections of B by suit. 2. The projection of B onto the set of suits. 3. The projection of B onto the set of denominations Answer The poker hand in the last exercise is known as a dead man's hand. Legend has it that Wild Bill Hickock held this hand at the time of his murder in 1876.
General unions and intersections For the problems in this subsection, the universal set is R. Let A
n
= [0, 1 −
1 n
]
for n ∈ N . Find +
1.1.10
https://stats.libretexts.org/@go/page/10116
1. ⋂ 2. ⋃ 3. ⋂ 4. ⋃
∞ n=1 ∞ n=1 ∞ n=1 ∞ n=1
An An c
An c
An
Answer Let A
n
1. ⋂ 2. ⋃ 3. ⋂ 4. ⋃
= (2 −
∞ n=1 ∞ n=1 ∞ n=1 ∞ n=1
1 n
1
,5+
n
)
for n ∈ N . Find +
An An c
An c
An
Answer
Subsets of R
2
Let T be the closed triangular region in R with vertices (0, 0), (1, 0), and (1, 1). Find each of the following: 2
1. The cross section T 2. The cross section T 3. The projection of T 4. The projection of T
for x ∈ R for y ∈ R onto the horizontal axis onto the vertical axis
x
y
Answer This page titled 1.1: Sets is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.1.11
https://stats.libretexts.org/@go/page/10116
1.2: Functions Functions play a central role in probability and statistics, as they do in every other branch of mathematics. For the most part, the proofs in this section are straightforward, so be sure to try them yourself before reading the ones in the text.
Definitions and Properties Basic Definitions We start with the formal, technical definition of a function. It's not very intuitive, but has the advantage that it only requires set theory. A function f from a set S into a set T is a subset of the product set S × T with the property that for each element x ∈ S , there exists a unique element y ∈ T such that (x, y) ∈ f . If f is a function from S to T we write f : S → T . If (x, y) ∈ f we write y = f (x). Less formally, a function f from S into T is a “rule” (or “procedure” or “algorithm”) that assigns to each x ∈ S a unique element f (x) ∈ T . The definition of a function as a set of ordered pairs, is due to Kazimierz Kuratowski. The term map or mapping is also used in place of function, so we could say that f maps S into T .
Figure 1.2.1: A function f from S into T The sets S and T in the definition are clearly important. Suppose that f
: S → T
.
1. The set S is the domain of f . 2. The set T is the range space or co-domain of f . 3. The range of f is the set of function values. That is, range (f ) = {y ∈ T
: y = f (x) for some x ∈ S}
.
The domain and range are completely specified by a function. That's not true of the co-domain: if f is a function from S into T , and U is another set with T ⊆ U , then we can also think of f as a function from S into U . The following definitions are natural and important. Suppose again that f
: S → T
.
1. f maps S onto T if range (f ) = T . That is, for each y ∈ T there exists x ∈ S such that f (x) = y . 2. f is one-to-one if distinct elements in the domain are mapped to distinct elements in the range. That is, if u, u ≠ v then f (u) ≠ f (v) . Clearly a function always maps its domain onto its range. Note also that u, v ∈ S .
f
is one-to-one if
f (u) = f (v)
v∈ S
implies
and
u =v
for
Inverse functions A funtion that is one-to-one and onto can be “reversed” in a sense. If f maps S one-to-one onto T , the inverse of f is the function f f
−1
(y) = x
⟺
−1
from T onto S given by
f (x) = y;
1.2.1
x ∈ S, y ∈ T
(1.2.1)
https://stats.libretexts.org/@go/page/10117
If you like to think of a function as a set of ordered pairs, then f = {(y, x) ∈ T × S : (x, y) ∈ f } . The fact that f is one-to-one and onto ensures that f is a valid function from T onto S . Sets S and T are in one-to-one correspondence if there exists a oneto-one function from S onto T . One-to-one correspondence plays an essential role in the study of cardinality. −1
−1
Restrictions The domain of a function can be restricted to create a new funtion. Suppose that f to A .
: S → T
and that A ⊆ S . The function f
A
As a set of ordered pairs, note that f
A
: A → T
= {(x, y) ∈ f : x ∈ A}
defined by f
A
(x) = f (x)
for x ∈ A is the restriction of f
.
Composition Composition is perhaps the most important way to combine two functions to create another function. Suppose that g : R → S and f
: S → T
. The composition of f with g is the function f ∘ g : R → T defined by (f ∘ g) (x) = f (g(x)) ,
x ∈ R
(1.2.2)
Composition is associative: Suppose that h : R → S , g : S → T , and f
: T → U
. Then
f ∘ (g ∘ h) = (f ∘ g) ∘ h
(1.2.3)
Proof Thus we can write f ∘ g ∘ h without ambiguity. On the other hand, composition is not commutative. Indeed depending on the domains and co-domains, f ∘ g might be defined when g ∘ f is not. Even when both are defined, they may have different domains and co-domains, and so of course cannot be the same function. Even when both are defined and have the same domains and codomains, the two compositions will not be the same in general. Examples of all of these cases are given in the computational exercises below. Suppose that g : R → S and f
: S → T
.
1. If f and g are one-to-one then f ∘ g is one-to-one. 2. If f and g are onto then f ∘ g is onto. Proof The identity function on a set S is the function I from S onto S defined by I
S (x)
S
=x
for x ∈ S
The identity function acts like an identity with respect to the operation of composition. If f
: S → T
then
1. f ∘ I = f 2. I ∘ f = f S
T
Proof The inverse of a function is really the inverse with respect to composition. Suppose that f is a one-to-one function from S onto T . Then 1. f ∘ f 2. f ∘ f −1
−1
= IS = IT
Proof An element x ∈ S can be thought of as a function from {1, 2, … , n} into S . Similarly, an element x ∈ S can be thought of as a function from N into S . For such a sequence x, of course, we usually write x instead of x(i). More generally, if S and T are n
∞
+
i
1.2.2
https://stats.libretexts.org/@go/page/10117
sets, then the set of all functions from more accurately) written as S .
S
into
T
is denoted by
T
S
. In particular, as we noted in the last section,
S
∞
is also (and
N+
Suppose that (f ∘ g)
−1
=g
is a one-to-one function from ∘f .
g −1
R
onto
S
and that
f
is a one-to-one function from
S
onto
T
. Then
−1
Proof
Inverse Images Inverse images of a function play a fundamental role in probability, particularly in the context of random variables. Suppose that f
. If A ⊆ T , the inverse image of A under f is the subset of S given by
: S → T
f
So f
−1
(A)
−1
(A) = {x ∈ S : f (x) ∈ A}
(1.2.4)
is the subset of S consisting of those elements that map into A .
Figure 1.2.2: The inverse image of A under f Technically, the inverse images define a new function from P(T ) into P(S) . We use the same notation as for the inverse function, which is defined when f is one-to-one and onto. These are very different functions, but usually no confusion results. The following important theorem shows that inverse images preserve all set operations. Suppose that f
, and that A,
: S → T
B ⊆T
. Then
1. f (A ∪ B) = f (A) ∪ f (B) 2. f (A ∩ B) = f (A) ∩ f (B) 3. f (A ∖ B) = f (A) ∖ f (B) 4. If A ⊆ B then f (A) ⊆ f (B) 5. If A and B are disjoint, so are f (A) and f −1
−1
−1
−1
−1
−1 −1
−1
−1
−1
−1
−1
−1
(B)
Proof The result in part (a) holds for arbitrary unions, and the result in part (b) holds for arbitrary intersections. No new ideas are involved; only the notation is more complicated. Suppose that {A
i
1. f 2. f
−1
is a collection of subsets of T , where I is a nonempty index set. Then
(⋃
Ai ) = ⋃
f
(⋂
Ai ) = ⋂
f
i∈I
−1
: i ∈ I}
i∈I
i∈I i∈I
−1 −1
(Ai ) (Ai )
Proof
Forward Images Forward images of a function are a naturally complement to inverse images. Suppose again that f
: S → T
. If A ⊆ S , the forward image of A under f is the subset of T given by f (A) = {f (x) : x ∈ A}
(1.2.5)
So f (A) is the range of f restricted to A .
1.2.3
https://stats.libretexts.org/@go/page/10117
Figure 1.2.3: The forward image of A under f Technically, the forward images define a new function from P(S) into P(T ) , but we use the same symbol f for this new function as for the underlying function from S into T that we started with. Again, the two functions are very different, but usually no confusion results. It might seem that forward images are more natural than inverse images, but in fact, the inverse images are much more important than the forward ones (at least in probability and measure theory). Fortunately, the inverse images are nicer as well—unlike the inverse images, the forward images do not preserve all of the set operations. Suppose that f
: S → T
, and that A,
B ⊆S
. Then
1. f (A ∪ B) = f (A) ∪ f (B) . 2. f (A ∩ B) ⊆ f (A) ∩ f (B) . Equality holds if f is one-to-one. 3. f (A) ∖ f (B) ⊆ f (A ∖ B) . Equality holds if f is one-to-one. 4. If A ⊆ B then f (A) ⊆ f (B) . Proof The result in part (a) hold for arbitrary unions, and the results in part (b) hold for arbitrary intersections. No new ideas are involved; only the notation is more complicated. Suppose that {A
i
1. f (⋃ 2. f (⋂
: i ∈ I}
is a collection of subsets of S , where I is a nonempty index set. Then
i∈I
Ai ) = ⋃
i∈I
Ai ) ⊆ ⋂i∈I f (Ai )
i∈I
f (Ai )
. . Equality holds if f is one-to-one.
Proof Suppose again that f : S → T . As noted earlier, the forward images of f define a function from P(S) into P(T ) and the inverse images define a function from P(T ) into P(S) . One might hope that these functions are inverses of one another, but alas no. Suppose that f 1. A ⊆ f 2. f [ f
−1
−1
: S → T
[f (A)]
(B)] ⊆ B
.
for A ⊆ S . Equality holds if f is one-to-one. for B ⊆ T . Equality holds if f is onto.
Proof
Spaces of Real Functions Real-valued function on a given set pointwise. Suppose thatf ,
g : S → R
S
are of particular importance. The usual arithmetic operations on such functions are defined
and c ∈ R , then f + g,
f − g, f g, cf , f /g : S → R
are defined as follows for all x ∈ S .
1. (f + g)(x) = f (x) + g(x) 2. (f − g)(x) = f (x) − g(x) 3. (f g)(x) = f (x)g(x) 4. (cf )(x) = cf (x) 5. (f /g)(x) = f (x)/g(x) assuming that g(x) ≠ 0 for x ∈ S . Now let V denote the collection of all functions from the given set S into R. A fact that is very important in probability as well as other branches of analysis is that V , with addition and scalar multiplication as defined above, is a vector space. The zero function 0 is defined, of course, by 0(x) = 0 for all x ∈ S .
1.2.4
https://stats.libretexts.org/@go/page/10117
(V , +, ⋅)
is a vector space over R. That is, for all f ,
g, h ∈ V
and a,
b ∈ R
1. f + g = g + f , the commutative property of vector addition. 2. f + (g + h) = (f + g) + h , the associative property of vector addition. 3. a(f + g) = af + ag , scalar multiplication distributes over vector addition. 4. (a + b)f = af + bf , scalar multiplication distributive over scalar addition. 5. f + 0 = f , the existence of an zero vector. 6. f + (−f ) = 0 , the existence of additive inverses. 7. 1 ⋅ f = f , the unity property. Proof Various subspaces of V are important in probability as well. We will return to the discussion of vector spaces of functions in the sections on partial orders and in the advanced sections on metric spaces and measure theory.
Indicator Functions For our next discussion, suppose that S is the universal set, so that all other sets mentioned are subsets of S . Suppose that A ⊆ S . The indicator function of A is the function 1
: S → {0, 1}
A
1A (x) = {
1,
x ∈ A
0,
x ∉ A
Thus, the indicator function of A simply indicates whether or not takes the values 0 and 1 is an indicator function. If f
: S → {0, 1}
x ∈ A
then f is the indicator function of the set A = f
−1
for each
defined as follows: (1.2.6)
x ∈ S
. Conversely, any function on
{1} = {x ∈ S : f (x) = 1}
S
.
Thus, there is a one-to-one correspondence between P(S) , the power set of S , and the collection of indicator functions The next result shows how the set algebra of subsets corresponds to the arithmetic algebra of the indicator functions. Suppose that A,
B ⊆S
A
B
A
A∪B
A
c
.
S
{0, 1}
. Then
1. 1 =1 1 = min { 1 , 1 } 2. 1 = 1 − (1 − 1 ) (1 − 1 ) = max { 1 3. 1 = 1 − 1 4. 1 =1 (1 − 1 ) 5. A ⊆ B if and only if 1 ≤ 1 A∩B
that just
B
A,
B
1B }
A
A
A
A∖B
B
A
B
Proof The results in part (a) extends to arbitrary intersections and the results in part (b) extends to arbitrary unions. Suppose that {A
i
1. 1 2. 1
⋂
i∈I
Ai
⋃i∈I Ai
=∏
: i ∈ I}
i∈I
1A
i
= 1 −∏
i∈I
is a collection of subsets of S , where I is a nonempty index set. Then
= min { 1A
i
: i ∈ I}
(1 − 1A ) = max { 1A i
i
: i ∈ I}
Proof
Multisets A multiset is like an ordinary set, except that elements may be repeated. A multiset A (with elements from a universal set S ) can be uniquely associated with its multiplicity function m : S → N , where m (x) is the number of times that element x is in A for each x ∈ S . So the multiplicity function of a multiset plays the same role that an indicator function does for an ordinary set. Multisets arise naturally when objects are sampled with replacement (but without regard to order) from a population. Various sampling models are explored in the section on Combinatorial Structures. We will not go into detail about the operations on multisets, but the definitions are straightforward generalizations of those for ordinary sets. A
A
Suppose that A and B are multisets with elements from the universal set S . Then
1.2.5
https://stats.libretexts.org/@go/page/10117
1. A ⊆ B if and only if m ≤ m 2. m = max{ m , m } 3. m = min{ m , m } 4. m =m +m A
A∪B
A
A∩B
A
A+B
A
B
B
B
B
Product Spaces Using functions, we can generalize the Cartesian products studied earlier. In this discussion, we suppose that S is a set for each i in a nonempty index set I . i
Define the product set ∏ Si = {x : x is a function from I into ⋃ Si such that x(i) ∈ Si for each i ∈ I } i∈I
(1.2.7)
i∈I
Note that except for being nonempty, there are no assumptions on the cardinality of the index set I . Of course, if I = {1, 2 … , n} for some n ∈ N , or if I = N then this construction reduces to S × S × ⋯ × S and to S × S × ⋯ , respectively. Since we want to make the notation more closely resemble that of simple Cartesian products, we will write x instead of x(i) for the value of the function x at i ∈ I , and we sometimes refer to this value as the ith coordinate of x. Finally, note that if S = S for each i ∈ I , then ∏ S is simply the set of all functions from I into S , which we denoted by S above. +
+
1
2
n
1
2
i
i
I
i∈I
i
For j ∈ I define the projection p
j
: ∏
i∈I
Si → Sj
by p
j (x)
for x ∈ ∏
= xj
i∈I
Si
.
So p (x) is just the j th coordinate of x. The projections are of basic importance for product spaces. In particular, we have a better way of looking at projections of a subset of a product set. j
For A ⊆ ∏
i∈I
Si
and j ∈ I , the forward image p
j (A)
is the projection of A onto S . j
Proof So the properties of projection that we studied in the last section are just special cases of the properties of forward images. Projections also allow us to get coordinate functions in a simple way. Suppose that R is a set, and that f
: R → ∏
i∈I
Si
. If j ∈ I then p
j
∘ f : R → Sj
is the j th coordinate function of f .
Proof This will look more familiar for a simple cartesian product. If f f : R → S is the j th coordinate function for j ∈ {1, 2, … , n}. j
: R → S1 × S2 × ⋯ × Sn
, then
f = (f1 , f2 , … , fn )
where
i
Cross sections of a subset of a product set can be expressed in terms of inverse images of a function. First we need some additional notation. Suppose that our index set I has at least two elements. For j ∈ I and u ∈ S , define j : ∏ S → ∏ S by j (x) = y where y = x for i ∈ I − {j} and y = u . In words, j takes a point x ∈ ∏ S and assigns u to coordinate j to produce the point y ∈ ∏ S . j
u
i
i
i∈I
j
u
i∈I−{j}
u
i∈I−{j}
i
i∈I
i
i
i
In the setting above, if j ∈ I , u ∈ S and A ⊆ ∏ j
i∈I
Si
then j
−1 (A) u
is the cross section of A in the j th coordinate at u.
Proof Let's look at this for the product of two sets S and T . For x ∈ S , the function 1 : T → S × T is given by 1 (y) = (x, y). Similarly, for y ∈ T , the function 2 : S → S × T is given by 2 (x) = (x, y). Suppose now that A ⊆ S × T . If x ∈ S , then 1 (A) = {y ∈ T : (x, y) ∈ A} , the very definition of the cross section of A in the first coordinate at x. Similarly, if y ∈ T , then 2 (A) = {x ∈ S : (x, y) ∈ A} , the very definition of the cross section of A in the second coordinate at y . This construction is not particularly important except to show that cross sections are inverse images. Thus the fact that cross sections preserve all of the set operations is a simple consequence of the fact that inverse images generally preserve set operations. x
y
x
y
−1 x
−1 y
Operators Sometimes functions have special interpretations in certain settings.
1.2.6
https://stats.libretexts.org/@go/page/10117
Suppose that S is a set. 1. A function f : S → S is sometimes called a unary operator on S . 2. A function g : S × S → S is sometimes called a binary operator on S . As the names suggests, a unary operator f operates on an element x ∈ S to produce a new element f (x) ∈ S . Similarly, a binary operator g operates on a pair of elements (x, y) ∈ S × S to produce a new element g(x, y) ∈ S . The arithmetic operators are quintessential examples: The following are operators on R: 1. minus(x) = −x is a unary operator. 2. sum(x, y) = x + y is a binary operator. 3. product(x, y) = x y is a binary operator. 4. difference(x, y) = x − y is a binary operator. For a fixed universal set S , the set operators studied in the section on Sets provide other examples. For a given set S , the following are operators on P(S) : 1. complement(A) = A is a unary operator. 2. union(A, B) = A ∪ B is a binary operator. 3. intersect(A, B) = A ∩ B is a binary operator. 4. difference(A, B) = A ∖ B is a binary operator. c
As these examples illustrate, a binary operator is often written as x f y rather than f (x, y). Still, it is useful to know that operators are simply functions of a special type. Suppose that f is a unary operator on a set S , g is a binary operator on S , and that A ⊆ S . 1. A is closed under f if x ∈ A implies f (x) ∈ A. 2. A is closed under g if (x, y) ∈ A × A implies g(x, y) ∈ A . Thus if A is closed under the unary operator f , then f restricted to A is unary operator on A . Similary if A is closed under the binary operator g , then g restricted to A × A is a binary operator on A . Let's return to our most basic example. For the arithmetic operatoes on R, 1. N is closed under plus and times, but not under minus and difference. 2. Z is closed under plus, times, minus, and difference. 3. Q is closed under plus, times, minus, and difference. Many properties that you are familiar with for special operators (such as the arithmetic and set operators) can now be formulated generally. Suppose that f and g are binary operators on a set S . In the following definitions, x, y , and z are arbitrary elements of S . 1. f is commutative if f (x, y) = f (y, x), that is, x f y = y f x 2. f is associative if f (x, f (y, z)) = f (f (x, y), z), that is, x f (y f z) = (x f y) f z. 3. g distributes over f (on the left) if g(x, f (y, z)) = f (g(x, y), g(x, z)), that is, x g (y f z) = (x g y) f (x g z)
The Axiom of Choice Suppose that S is a collection of nonempty subsets of a set S . The axiom of choice states that there exists a function f : S → S with the property that f (A) ∈ A for each A ∈ S . The function f is known as a choice function. Stripped of most of the mathematical jargon, the idea is very simple. Since each set A ∈ S is nonempty, we can select an element of A ; we will call the element we select f (A) and thus our selections define a function. In fact, you may wonder why we need an
1.2.7
https://stats.libretexts.org/@go/page/10117
axiom at all. The problem is that we have not given a rule (or procedure or algorithm) for selecting the elements of the sets in the collection. Indeed, we may not know enough about the sets in the collection to define a specific rule, so in such a case, the axiom of choice simply guarantees the existence of a choice function. Some mathematicians, known as constructionists do not accept the axiom of choice, and insist on well defined rules for constructing functions. A nice consequence of the axiom of choice is a type of duality between one-to-one functions and onto functions. Suppose that f is a function from a set S onto a set T . There exists a one-to-one function g from T into S . Proof. Suppose that f is a one-to-one function from a set S into a set T . There exists a function g from T onto S . Proof.
Computational Exercises Some Elementary Functions Each of the following rules defines a function from R into R. 2
f (x) = x
g(x) = sin(x) h(x) = ⌊x⌋ u(x) =
e
x
1+e
x
Find the range of the function and determine if the function is one-to-one in each of the following cases: 1. f 2. g 3. h 4. u Answer Find the following inverse images: 1. f 2. g 3. h
−1
−1 −1
[4, 9] {0} {2, 3, 4}
Answer The function u is one-to-one. Find (that is, give the domain and rule for) the inverse function u . −1
Answer Give the rule and find the range for each of the following functions: 1. f ∘ g 2. g ∘ f 3. h ∘ g ∘ f Answer Note that f ∘ g and g ∘ f are well-defined functions from R into R, but f ∘ g ≠ g ∘ f .
Dice Let S = {1, 2, 3, 4, 5, 6} . This is the set of possible outcomes when a pair of standard dice are thrown. Let f , g , u, and functions from S into Z defined by the following rules: 2
v
be the
f (x, y) = x + y g(x, y) = y − x u(x, y) = min{x, y}
1.2.8
https://stats.libretexts.org/@go/page/10117
v(x, y) = max{x, y}
In addition, let F and U be the functions defined by F
= (f , g)
and U
= (u, v)
.
Find the range of each of the following functions: 1. f 2. g 3. u 4. v 5. U Answer Give each of the following inverse images in list form: 1. f 2. u 3. v 4. U
−1
−1
−1
{6} {3}
{4}
−1
{(3, 4)}
Answer Find each of the following compositions: 1. f ∘ U 2. g ∘ U 3. u ∘ F 4. v ∘ F 5. F ∘ U 6. U ∘ F Answer Note that while f ∘ U is well-defined, U ∘ f is not. Note also that f ∘ U
=f
even though U is not the identity function on S .
Bit Strings Let n ∈ N and let S = {0, 1} and T = {0, 1, … , n}. Recall that the elements of S are bit strings of length n , and could represent the possible outcomes of n tosses of a coin (where 1 means heads and 0 means tails). Let f : S → T be the function defined by f (x , x , … , x ) = ∑ x . Note that f (x) is just the number of 1s in the the bit string x. Let g : T → S be the function defined by g(k) = x where x denotes the bit string with k 1s followed by n − k 0s. n
+
n
1
2
n
i=1
k
i
k
Find each of the following 1. f ∘ g 2. g ∘ f Answer In the previous exercise, note that f ∘ g and g ∘ f are both well-defined, but have different domains, and so of course are not the same. Note also that f ∘ g is the identity function on T , but f is not the inverse of g . Indeed f is not one-to-one, and so does not have an inverse. However, f restricted to {x : k ∈ T } (the range of g ) is one-to-one and is the inverse of g . k
Let n = 4 . Give f
−1
({k})
in list form for each k ∈ T .
Answer Again let n = 4 . Let A = {1000, 1010} and B = {1000, 1100}. Give each of the following in list form: 1. f (A) 2. f (B) 3. f (A ∩ B) 4. f (A) ∩ f (B)
1.2.9
https://stats.libretexts.org/@go/page/10117
5. f
−1
(f (A))
Answer In the previous exercise, note that f (A ∩ B) ⊂ f (A) ∩ f (B) and A ⊂ f
−1
(f (A))
.
Indicator Functions Suppose that A and B are subsets of a universal set S . Express, in terms of 1 and 1 , the indicator function of each of the 14 non-trivial sets that can be constructed from A and B . Use the Venn diagram app to help. A
B
Answer Suppose that A , B , and C are subsets of a universal set S . Give the indicator function of each of the following, in terms of 1 , 1 , and 1 in sum-product form: A
B
C
1. D = {x ∈ S : x is an element of exactly one of the given sets} 2. E = {x ∈ S : x is an element of exactly two of the given sets} Answer
Operators Recall the standard arithmetic operators on R discussed above. We all know that sum is commutative and associative, and that product distributes over sum. 1. Is difference commutative? 2. Is difference associative? 3. Does product distribute over difference? 4. Does sum distributed over product? Answer
Multisets Express the multiset A in list form that has the multiplicity function m(c) = 1 , m(d) = 0 , m(e) = 4 .
m : {a, b, c, d, e} → N
given by
m(a) = 2
,
m(b) = 3
,
Answer Express the prime factors of 360 as a multiset in list form. Answer This page titled 1.2: Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.2.10
https://stats.libretexts.org/@go/page/10117
1.3: Relations Relations play a fundamental role in probability theory, as in most other areas of mathematics.
Definitions and Constructions Suppose that S and T are sets. A relation from S to T is a subset of the product set S × T . 1. The domain of R is the set of first coordinates: domain(R) = {x ∈ S : (x, y) ∈ R for some y ∈ T } . 2. The range of R is the set of second coordinates: range(R) = {y ∈ T : (x, y) ∈ R for some x ∈ S} . A relation from a set S to itself is a relation on S . As the name suggests, a relation R from S into T is supposed to define a relationship between the elements of S and the elements of T , and so we usually use the more suggestive notation x R y when (x, y) ∈ R. Note that the domain of R is the projections of R onto S and the range of R is the projection of R onto T .
Basic Examples Suppose that S is a set and recall that P(S) denotes the power set of S , the collection of all subsets of S . The membership relation ∈ from S to P(S) is perhaps the most important and basic relationship in mathematics. Indeed, for us, it's a primitive (undefined) relationship—given x and A , we assume that we understand the meaning of the statement x ∈ A . Another basic primitive relation is the equality relation = on a given set of objects assume that we understand the meaning of the statement x = y .
S
. That is, given two objects
x
and y , we
Other basic relations that you have seen are 1. The subset relation ⊆ on P(S) . 2. The order relation ≤ on R These two belong to a special class of relations known as partial orders that we will study in the next section. Note that a function f from S into T is a special type of relation. To compare the two types of notation (relation and function), note that x f y means that y = f (x).
Constructions Since a relation is just a set of ordered pairs, the set operations can be used to build new relations from existing ones. if Q and R are relations from S to T , then so are Q ∪ R , Q ∩ R , Q ∖ R . 1. x(Q ∪ R)y if and only if x Q y or x R y. 2. x(Q ∩ R)y if and only if x Q y and x R y. 3. x(Q ∖ R)y if and only if x Q y but not x R y. 4. If Q ⊆ R then x Q y implies x R y. If R is a relation from S to T and Q ⊆ R , then Q is a relation from S to T . The restriction of a relation defines a new relation. If R is a relation on S and A ⊆ S then R
A
= R ∩ (A × A)
is a relation on A , called the restriction of R to A .
The inverse of a relation also defines a new relation. If R is a relation from S to T , the inverse of R is the relation from T to S defined by −1
y R
x if and only if x R y
(1.3.1)
Equivalently, R = {(y, x) : (x, y) ∈ R} . Note that any function f from S into T has an inverse relation, but only when the f is one-to-one is the inverse relation also a function (the inverse function). Composition is another natural way to create new relations −1
1.3.1
https://stats.libretexts.org/@go/page/10118
from existing ones. Suppose that Q is a relation from S to T and that R is a relation from T to U . The composition Q ∘ R is the relation from S to U defined as follows: for x ∈ S and z ∈ U , x(Q ∘ R)z if and only if there exists y ∈ T such that x Q y and y R z . Note that the notation is inconsistent with the notation used for composition of functions, essentially because relations are read from left to right, while functions are read from right to left. Hopefully, the inconsistency will not cause confusion, since we will always use function notation for functions.
Basic Properties The important classes of relations that we will study in the next couple of sections are characterized by certain basic properties. Here are the definitions: Suppose that R is a relation on S . 1. R is reflexive if x R x for all x ∈ S . 2. R is irreflexive if no x ∈ S satisfies x R x. 3. R is symmetric if x R y implies y R x for all x, y ∈ S . 4. R is anti-symmetric if x R y and y R x implies x = y for all x, y ∈ S . 5. R is transitive if x R y and y R z implies x R z for all x, y, z ∈ S . The proofs of the following results are straightforward, so be sure to try them yourself before reading the ones in the text. A relation R on S is reflexive if and only if the equality relation = on S is a subset of R . Proof A relation R on S is symmetric if and only if R
−1
=R
.
Proof A relation R on S is transitive if and only if R ∘ R ⊆ R . Proof A relation R on S is antisymmetric if and only if R ∩ R
−1
is a subset of the equality relation = on S .
Proof Suppose that Q and R are relations on S . For each property below, if both Q and R have the property, then so does Q ∩ R . 1. reflexive 2. symmetric 3. transitive Proof Suppose that R is a relation on a set S . 1. Give an explicit definition for the property R is not reflexive. 2. Give an explicit definition for the property R is not irreflexive. 3. Are any of the properties R is reflexive, R is not reflexive, R is irreflexive, R is not irreflexive equivalent? Answer Suppose that R is a relation on a set S . 1. Give an explicit definition for the property R is not symmetric. 2. Give an explicit definition for the property R is not antisymmetric. 3. Are any of the properties R is symmetric, R is not symmetric, R is antisymmetric, R is not antisymmetric equivalent? Answer
1.3.2
https://stats.libretexts.org/@go/page/10118
Computational Exercises Let R be the relation defined on properties:
R
by
xRy
if and only if
sin(x) = sin(y)
. Determine if
R
has each of the following
1. reflexive 2. symmetric 3. transitive 4. irreflexive 5. antisymmetric Answer The relation R in the previous exercise is a member of an important class of equivalence relations. Let R be the relation defined on R by x R y if and only if x
2
+y
2
≤1
. Determine if R has each of the following properties:
1. reflexive 2. symmetric 3. transitive 4. irreflexive 5. antisymmetric Answer This page titled 1.3: Relations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.3.3
https://stats.libretexts.org/@go/page/10118
1.4: Partial Orders Partial orders are a special class of relations that play an important role in probability theory.
Basic Theory Definitions A partial order on a set S is a relation ⪯ on partially ordered set. So for all x, y, z ∈ S :
S
that is reflexive, anti-symmetric, and transitive. The pair
(S, ⪯)
is called a
1. x ⪯ x , the reflexive property 2. If x ⪯ y and y ⪯ x then x = y , the antisymmetric property 3. If x ⪯ y and y ⪯ z then x ⪯ z , the transitive property As the name and notation suggest, a partial order is a type of ordering of the elements of S . Partial orders occur naturally in many areas of mathematics, including probability. A partial order on a set naturally gives rise to several other relations on the set. Suppose that ⪯ is a partial order on a set S . The relations ⪰, ≺, ≻, ⊥, and ∥ are defined as follows: 1. x ⪰ y if and only if y ⪯ x . 2. x ≺ y if and only if x ⪯ y and x ≠ y . 3. x ≻ y if and only if y ≺ x . 4. x ⊥ y if and only if x ⪯ y or y ⪯ x . 5. x ∥ y if and only if neither x ⪯ y nor y ⪯ x . Note that ⪰ is the inverse of ⪯, and ≻ is the inverse of ≺. Note also that x ⪯ y if and only if either x ≺ y or x = y , so the relation ≺ completely determines the relation ⪯. The relation ≺ is sometimes called a strict or strong partial order to distingush it from the ordinary (weak) partial order ⪯. Finally, note that x ⊥ y means that x and y are related in the partial order, while x ∥ y means that x and y are unrelated in the partial order. Thus, the relations ⊥ and ∥ are complements of each other, as sets of ordered pairs. A total or linear order is a partial order in which there are no unrelated elements. A partial order ⪯ on S is a total order or linear order if for every x, y ∈ S , either x ⪯ y or y ⪯ x . Suppose that ⪯ and ⪯ are partial orders on a set S . Then ⪯ is an sub-order of ⪯ , or equivalently ⪯ is an extension of ⪯ if and only if x ⪯ y implies x ⪯ y for x, y ∈ S . 1
2
1
1
2
2
1
2
Thus if ⪯ is a suborder of ⪯ , then as sets of ordered pairs, ⪯ is a subset of ⪯ . We need one more relation that arises naturally from a partial order. 1
Suppose that x ≺z ≺y .
2
⪯
is a partial order on a set
1
S
. For
x, y ∈ S
,
2
y
is said to cover
x
if
x ≺y
but no element
z ∈ S
satisfies
If S is finite, the covering relation completely determines the partial order, by virtue of the transitive property. Suppose that ⪯ is a partial order on a finite set S . The covering graph or Hasse graph of vertex set S and directed edge set E , where (x, y) ∈ E if and only if y covers x.
(S, ⪯)
is the directed graph with
Thus, x ≺ y if and only if there is a directed path in the graph from x to y . Hasse graphs are named for the German mathematician Helmut Hasse. The graphs are often drawn with the edges directed upward. In this way, the directions can be inferred without having to actually draw arrows.
1.4.1
https://stats.libretexts.org/@go/page/10119
Basic Examples Of course, the ordinary order ≤ is a total order on the set of real numbers R. The subset partial order is one of the most important in probability theory: Suppose that S is a set. The subset relation ⊆ is a partial order on P(S) , the power set of S . Proof Here is a partial order that arises naturally from arithmetic. Let ∣ denote the division relation on the set of positive integers N . That is, m ∣ n if and only if there exists k ∈ N n = km . Then +
+
such that
1. ∣ is a partial order on N . 2. ∣ is a sub-order of the ordinary order ≤. +
Proof The set of functions from a set into a partial ordered set can itself be partially ordered in a natural way. Suppose that S is a set and that (T , ⪯ ) is a partially ordered set, and let S denote the set of functions relation ⪯ on S defined by f ⪯ g if and only f (x) ⪯ g(x) for all x ∈ S is a partial order on S . T
f : S → T
. The
T
Proof Note that we don't need a partial order on the domain S .
Basic Properties The proofs of the following basic properties are straightforward. Be sure to try them yourself before reading the ones in the text. The inverse of a partial order is also a partial order. Proof If ⪯ is a partial order on S and A is a subset of S , then the restriction of ⪯ to A is a partial order on A . Proof The following theorem characterizes relations that correspond to strict order. Let S be a set. A relation ⪯ is a partial order on S if and only if ≺ is transitive and irreflexive. Proof
Monotone Sets and Functions Partial orders form a natural setting for increasing and decreasing sets and functions. Here are the definitions: Suppose that ⪯ is a partial order on a set S and that A ⊆ S . In the following definitions, x,
y
are arbitrary elements of S .
1. A is increasing if x ∈ A and x ⪯ y imply y ∈ A . 2. A is decreasing if y ∈ A and x ⪯ y imply x ∈ A . Suppose that S is a set with partial order definitions, x, y are arbitrary elements of S . 1. f 2. f 3. f 4. f
⪯S
,
T
is a set with partial order
is increasing if and only if x ⪯ y implies f (x) ⪯ f (y). is decreasing if and only if x ⪯ y implies f (x) ⪰ f (y). is strictly increasing if and only if x ≺ y implies f (x) ≺ is strictly decreasing if and only if x ≺ y implies f (x) ≻ S
⪯T
, and that
f : S → T
. In the following
T
S
T
S
S
T
. .
f (y)
T
f (y)
Recall the definition of the indicator function 1 associated with a subset A of a universal set S : For x ∈ S , 1 and 1 (x) = 0 if x ∉ A .
A (x)
A
=1
if
x ∈ A
A
1.4.2
https://stats.libretexts.org/@go/page/10119
Suppose that ⪯ is a partial order on a set S and that A ⊆ S . Then 1. A is increasing if and only if 1 is increasing. 2. A is decreasing if and only if 1 is decreasing. A
A
Proof
Isomorphism Two partially ordered sets (S, ⪯ ) and (T , ⪯ ) are said to be isomorphic if there exists a one-to-one function f from S onto T such that x ⪯ y if and only if f (x) ⪯ f (y), for all x, y ∈ S . The function f is an isomorphism. S
S
T
T
Generally, a mathematical space often consists of a set and various structures defined in terms of the set, such as relations, operators, or a collection of subsets. Loosely speaking, two mathematical spaces of the same type are isomorphic if there exists a one-to-one function from one of the sets onto the other that preserves the structures, and again, the function is called an isomorphism. The basic idea is that isomorphic spaces are mathematically identical, except for superficial matters of appearance. The word isomorphism is from the Greek and means equal shape. Suppose that the partially ordered sets and f are strictly increasing.
(S, ⪯S )
and (T , ⪯
T
)
are isomorphic, and that
f : S → T
is an isomorphism. Then
f
−1
Proof In a sense, the subset partial order is universal—every partially ordered set is isomorphic to (S , ⊆) for some collection of sets S . Suppose that ⪯ is a partial order on a set S . Then there exists S
⊆ P(S)
such that (S, ⪯) is isomorphic to (S , ⊆).
Proof
Extremal Elements Various types of extremal elements play important roles in partially ordered sets. Here are the definitions: Suppose that ⪯ is a partial order on a set S and that A ⊆ S . 1. An element a ∈ A is the minimum element of A if and only if a ⪯ x for every x ∈ A . 2. An element a ∈ A is a minimal element of A if and only if no x ∈ A satisfies x ≺ a . 3. An element b ∈ A is the maximum element of A if and only if b ⪰ x for every x ∈ A . 4. An element b ∈ A is a maximal element of A if and only if no x ∈ A satisfies x ≻ b . In general, a set can have several maximal and minimal elements (or none). On the other hand, The minimum and maximum elements of A , if they exist, are unique. They are denoted min(A) and max(A), respectively. Proof Minimal, maximal, minimum, and maximum elements of a set must belong to that set. The following definitions relate to upper and lower bounds of a set, which do not have to belong to the set. Suppose again that ⪯ is a partial order on a set S and that A ⊆ S . Then 1. An element u ∈ S is a lower bound for A if and only if u ⪯ x for every x ∈ A . 2. An element v ∈ S is an upper bound for A if and only if v ⪰ x for every x ∈ A . 3. The greatest lower bound or infimum of A , if it exists, is the maximum of the set of lower bounds of A . 4. The least upper bound or supremum of A , if it exists, is the minimum of the set of upper bounds of A . By (20), the greatest lower bound of A is unique, if it exists. It is denoted glb(A) or inf(A) . Similarly, the least upper bound of A is unique, if it exists, and is denoted lub(A) or sup(A) . Note that every element of S is a lower bound and an upper bound for ∅, since the conditions in the definition hold vacuously. The symbols ∧ and ∨ are also used for infimum and supremum, respectively, so ⋀ A = inf(A) and ⋁ A = sup(A) if they exist.. In particular, for x, y ∈ S , operator notation is more commonly used, so x ∧ y = inf{x, y} and x ∨ y = sup{x, y} . Partially
1.4.3
https://stats.libretexts.org/@go/page/10119
ordered sets for which these elements always exist are important, and have a special name. Suppose that ⪯ is a partial order on a set S . Then (S, ⪯) is a lattice if x ∧ y and x ∨ y exist for every x, y ∈ S . For the subset partial order, the inf and sup operators correspond to intersection and union, respectively: Let S be a set and consider the subset partial order that is, a nonempty collection of subsets of S . Then
⊆
on P(S) , the power set of
S
. Let A be a nonempty subset of
P(S)
,
1. inf(A ) = ⋂ A 2. sup(A ) = ⋃ A Proof In particular, A ∧ B = A ∩ B and A ∨ B = A ∪ B , so (P(S), ⊆) is a lattice. Consider the division partial order ∣ on the set of positive integers N and let A be a nonempty subset of N . +
+
1. inf(A) is the greatest common divisor of A , usually denoted gcd(A) in this context. 2. If A is infinite then sup(A) does not exist. If A is finite then sup(A) is the least common multiple of A , usually denoted lcm(A) in this context. Suppose that S is a set and that f
: S → S
. An element z ∈ S is said to be a fixed point of f if f (z) = z .
The following result explores a basic fixed point theorem for a partially ordered set. The theorem is important in the study of cardinality. Suppose that ⪯ is a partial order on a set S with the property that sup(A) exists for every A ⊆ S . If f then f has a fixed point.
: S → S
is increasing,
Proof. If ⪯ is a total order on a set S with the property that every nonempty subset of S has a minimum element, then S is said to be well ordered by ⪯. One of the most important examples is N , which is well ordered by the ordinary order ≤. On the other hand, the well ordering principle, which is equivalent to the axiom of choice, states that every nonempty set can be well ordered. +
Orders on Product Spaces Suppose that S and T are sets with partial orders ⪯ and ⪯ respectively. Define the relation ⪯ on S × T by (x, y) ⪯ (z, w) if and only if x ⪯ z and y ⪯ w . S
S
T
T
1. The relation ⪯ is a partial order on S × T , called, appropriately enough, the product order. 2. Suppose that (S, ⪯ ) = (T , ⪯ ) . If S has at least 2 elements, then ⪯ is not a total order on S . 2
S
T
Proof
Figure 1.4.1 : The product order on R . The region shaded red is the set of points ⪰ (x, y). The region shaded blue is the set of points ⪯ (x, y). The region shaded white is the set of points that are not comparable with (x, y). 2
Product order extends in a straightforward way to the Cartesian product of a finite or an infinite sequence of partially ordered spaces. For example, suppose that S is a set with partial order ⪯ for each i ∈ {1, 2, … , n}, where n ∈ N . The product order ⪯ on the product set S × S × ⋯ × S is defined as follows: for x = (x , x , … , x ) and y = (y , y , … , y ) in the product i
1
2
i
n
+
1
1.4.4
2
n
1
2
n
https://stats.libretexts.org/@go/page/10119
set, x ⪯ y if and only if x ⪯ y for each i ∈ {1, 2, … , n}. We can generalize this further to arbitrary product sets. Suppose that S is a set for each i in a nonempty (both otherwise arbitrary) index set I . Recall that i
i
i
i
∏ Si = {x : x is a function from I into ⋃ Si such that x(i) ∈ Si for each i ∈ I } i∈I
(1.4.2)
i∈I
To make the notation look more like a simple Cartesian product, we will write x instead of x(i) for the value of a function x in the product set at i ∈ I . i
Suppose that S is a set with partial order ⪯ for each i in a nonempty index set I . Define the relation ⪯ on ∏ S by x ⪯ y if and only if x ⪯ y for each i ∈ I . Then ⪯ is a partial order on the product set, known again as the product order. i
i
i
i
i∈I
i
i
Proof Note again that no assumptions are made on the index set I , other than it be nonempty. In particular, no order is necessary on I . The next result gives a very different type of order on a product space. Suppose again that S and T are sets with partial orders ⪯ (x, y) ⪯ (z, w) if and only if either x ≺ z , or x = z and y ⪯
S
S
T
and .
⪯T
respectively. Define the relation
⪯
on
S ×T
by
w
1. The relation ⪯ is a partial order on S × T , called the lexicographic order or dictionary order. 2. If ⪯ and ⪯ are total orders on S and T , respectively, then ⪯ is a total order on S × T . S
T
Proof
Figure 1.4.2 : The lexicographic order on R . The region shaded red is the set of points ⪰ (x, y). The region shaded blue is the set of points ⪯ (x, y). 2
As with the product order, the lexicographic order can be generalized to a collection of partially ordered spaces. However, we need the index set to be totally ordered. Suppose that S is a set with partial order ⪯ for each i in a nonempty index set I . Suppose also that ≤ is a total order on I . Define the relation ⪯ on the product set ∏ S as follows: x ≺ y if and only if there exists j ∈ I such that x = y if i < j and x ≺ y . Then i
i
i∈I
j
j
i
i
i
j
1. ⪯ is a partial order on S , known again as the lexicographic order. 2. If ⪯ is a total order for each i ∈ I , and I is well ordered by ≤, then ⪯ is a total order on S . i
Proof The term lexicographic comes from the way that we order words alphabetically: We look at the first letter; if these are different, we know how to order the words. If the first letters are the same, we look at the second letter; if these are different, we know how to order the words. We continue in this way until we find letters that are different, and we can order the words. In fact, the lexicographic order is sometimes referred to as the first difference order. Note also that if S is a set and ⪯ a total order on S for i ∈ I , then by the well ordering principle, there exists a well ordering ≤ of I , and hence there exists a lexicographic total order on the product space ∏ S . As a mathematical structure, the lexicographic order is not as obscure as you might think. i
i∈I
(R, ≤)
i
i
i
is isomorphic to the lexicographic product of (Z, ≤) with ([0, 1), ≤), where ≤ is the ordinary order for real numbers.
Proof
1.4.5
https://stats.libretexts.org/@go/page/10119
Limits of Sequences of Real Numbers Suppose that (a
1,
a2 , …)
The sequence inf{a
n,
is a sequence of real numbers.
an+1 …}
is increasing in n ∈ N . +
Since the sequence of infimums in the last result is increasing, the limit exists in original sequence:
R ∪ {∞}
, and is called the limit inferior of the
lim inf an = lim inf{ an , an+1 , …} n→∞
The sequence sup{a
n,
an+1 , …}
(1.4.3)
n→∞
is decreasing in n ∈ N . +
Since the the sequence of supremums in the last result is decreasing, the limit exists in R ∪ {−∞} , and is called the limit superior of the original sequence: lim sup an = lim sup{ an , an+1 , …} n→∞
Note that lim inf
n→∞
an ≤ lim sup n→∞ an
(1.4.4)
n→∞
and equality holds if and only if lim
n→∞
an
exists (and is the common value).
Vector Spaces of Functions Suppose that S is a nonempty set, and recall that the set V of functions f : S → R is a vector space, under the usual pointwise definition of addition and scalar multiplication. As noted in (9), V is also a partial ordered set, under the pointwise partial order: f ⪯ g if and only if f (x) ≤ g(x) for all x ∈ S . Consistent with the definitions (19), f ∈ V is bounded if there exists C ∈ (0, ∞) such that |f (x)| ≤ C for all x ∈ S . Now let U denote the set of bounded functions f : S → R , and for f ∈ U define ∥f ∥ = sup{|f (x)| : x ∈ S}
U
(1.4.5)
is a vector subspace of V and ∥ ⋅ ∥ is a norm on U .
Proof Appropriately enough, ∥ ⋅ ∥ is called the supremum norm on U . Vector spaces of bounded, real-valued functions, with the supremum norm are especially important in probability and random processes. We will return to this discussion again in the advanced sections on metric spaces and measure theory.
Computational Exercises Let S = {2, 3, 4, 6, 12}. 1. Sketch the Hasse graph corresponding to the ordinary order ≤ on S . 2. Sketch the Hasse graph corresponding to the division partial order ∣ on S . Answer Consider the ordinary order exist:
≤
on the set of real numbers
R
, and let
A = [a, b)
where a < b . Find each of the following that
1. The set of minimal elements of A 2. The set of maximal elements of A 3. min(A) 4. max(A) 5. The set of lower bounds of A 6. The set of upper bounds of A 7. inf(A) 8. sup(A) Answer
1.4.6
https://stats.libretexts.org/@go/page/10119
Again consider the division partial order following that exist:
∣
on the set of positive integers
N+
and let
. Find each of the
A = {2, 3, 4, 6, 12}
1. The set of minimal elements of A 2. The set of maximal elements of A 3. min(A) 4. max(A) 5. The set of lower bounds of A 6. The set of upper bounds of A 7. inf(A) 8. sup(A) . Answer Let S = {a, b, c} . 1. Give P(S) in list form. 2. Describe the Hasse graph of (P(S), ⊆) Answer Let S = {a, b, c, d} . 1. Give P(S) in list form. 2. Describe the Hasse graph of (P(S), ⊆) Answer Suppose that A and B are subsets of a universal set S . Let A denote the collection of the 16 subsets of S that can be constructed from A and B using the set operations. Show that (A , ⊆) is isomorphic to the partially ordered set in the previous exercise. Use the Venn diagram app to help. Proof This page titled 1.4: Partial Orders is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.4.7
https://stats.libretexts.org/@go/page/10119
1.5: Equivalence Relations Basic Theory Definitions A relation
≈
x, y, z ∈ S
on a nonempty set
S
that is reflexive, symmetric, and transitive is an equivalence relation on
S
. Thus, for all
,
1. x ≈ x , the reflexive property. 2. If x ≈ y then y ≈ x , the symmetric property. 3. If x ≈ y and y ≈ z then x ≈ z , the transitive property. As the name and notation suggest, an equivalence relation is intended to define a type of equivalence among the elements of Like partial orders, equivalence relations occur naturally in most areas of mathematics, including probability. Suppose that ≈ is an equivalence relation on equivalent to x, and is denoted
S
. The equivalence class of an element
x ∈ S
S
.
is the set of all elements that are
[x] = {y ∈ S : y ≈ x}
(1.5.1)
Results The most important result is that an equivalence relation on a set S defines a partition of S , by means of the equivalence classes. Suppose that ≈ is an equivalence relation on a set S . 1. If x ≈ y then [x] = [y]. 2. If x ≉ y then [x] ∩ [y] = ∅ . 3. The collection of (distinct) equivalence classes is a partition of S into nonempty sets. Proof Sometimes the set S of equivalence classes is denoted S/ ≈ . The idea is that the equivalence classes are new “objects” obtained by “identifying” elements in S that are equivalent. Conversely, every partition of a set defines an equivalence relation on the set. Suppose that S is a collection of nonempty sets that partition a given set S . Define the relation ≈ on S by x ≈ y if and only if x ∈ A and y ∈ A for some A ∈ S . 1. ≈ is an equivalence relation. 2. S is the set of equivalence classes. Proof
Figure 1.5.1 : A partition of S . Any two points in the same partition set are equivalent.
Sometimes the equivalence relation ≈ associated with a given partition S is denoted S/S . The idea, of course, is that elements in the same set of the partition are equivalent. The process of forming a partition from an equivalence relation, and the process of forming an equivalence relation from a partition are inverses of each other.
1.5.1
https://stats.libretexts.org/@go/page/10120
1. If we start with an equivalence relation ≈ on S , form the associated partition, and then construct the equivalence relation associated with the partition, then we end up with the original equivalence relation. In modular notation, S/(S/ ≈) is the same as ≈. 2. If we start with a partition S of S , form the associated equivalence relation, and then form the partition associated with the equivalence relation, then we end up with the original partition. In modular notation, S/(S/S ) is the same as S . Suppose that S is a nonempty set. The most basic equivalence relation on S is the equality relation =. In this case [x] = {x} for each x ∈ S . At the other extreme is the trivial relation ≈ defined by x ≈ y for all x, y ∈ S . In this case S is the only equivalence class. Every function f defines an equivalence relation on its domain, known as the equivalence relation associated with f . Moreover, the equivalence classes have a simple description in terms of the inverse images of f . Suppose that f
: S → T
. Define the relation ≈ on S by x ≈ y if and only if f (x) = f (y).
1. The relation ≈ is an equivalence relation on S . 2. The set of equivalences classes is S = {f {t} : t ∈ range(f )} . 3. The function F : S → T defined by F ([x]) = f (x) is well defined and is one-to-one. −1
Proof
Figure 1.5.2 : The equivalence relation on S associated with f
Suppose again that f
: S → T
: S → T
.
1. If f is one-to-one then the equivalence relation associated with f is the equality relation, and hence [x] = {x} for each x ∈ S. 2. If f is a constant function then the equivalence relation associated with f is the trivial relation, and hence S is the only equivalence class. Proof Equivalence relations associated with functions are universal: every equivalence relation is of this form: Suppose that ≈ is an equivalence relation on a set S . Define f associated with f .
: S → P(S)
by f (x) = [x]. Then ≈ is the equivalence relation
Proof The intersection of two equivalence relations is another equivalence relation. Suppose that ≈ and ≅ are equivalence relations on a set S . Let ≡ denote the intersection of ≈ and ≅ (thought of as subsets of S × S ). Equivalently, x ≡ y if and only if x ≈ y and x ≅y . 1. ≡ is an equivalence relation on S . 2. [x ] = [x ] ∩ [x ] . ≡
≈
≅
Suppose that we have a relation that is reflexive and transitive, but fails to be a partial order because it's not anti-symmetric. The relation and its inverse naturally lead to an equivalence relation, and then in turn, the original relation defines a true partial order on the equivalence classes. This is a common construction, and the details are given in the next theorem. Suppose that ⪯ is a relation on a set S that is reflexive and transitive. Define the relation ≈ on S by x ≈ y if and only if x ⪯ y and y ⪯ x . 1. ≈ is an equivalence relation on S . 2. If A and B are equivalence classes and x ⪯ y for some x ∈ A and y ∈ B , then u ⪯ v for all u ∈ A and v ∈ B .
1.5.2
https://stats.libretexts.org/@go/page/10120
3. Define the relation ⪯ on the collection of equivalence classes S by A ⪯ B if and only if x ⪯ y for some (and hence all) x ∈ A and y ∈ B . Then ⪯ is a partial order on S . Proof A prime example of the construction in the previous theorem occurs when we have a function whose range space is partially ordered. We can construct a partial order on the equivalence classes in the domain that are associated with the function. Suppose that S and T are sets and that ⪯ is a partial order on T . Suppose also that f by x ⪯ y if and only if f (x) ⪯ f (y). T
S
: S → T
. Define the relation ⪯ on S S
T
1. ⪯ is reflexive and transitive. 2. The equivalence relation on S constructed in (10) is the equivalence relation associated with f , as in (6). 3. ⪯ can be extended to a partial order on the equivalence classes corresponding to f . S
S
Proof
Examples and Applications Simple functions Give the equivalence classes explicitly for the functions from R into R defined below: 1. f (x) = x . 2. g(x) = ⌊x⌋ . 3. h(x) = sin(x). 2
Answer
Calculus Suppose that I is a fixed interval of R, and that S is the set of differentiable functions from I into R. Consider the equivalence relation associated with the derivative operator D on S , so that D(f ) = f . For f ∈ S , give a simple description of [f ]. ′
Answer
Congruence Recall the division relation ∣ from N to Z: For d ∈ N and n ∈ Z , d ∣ n means that n = kd for some k ∈ Z . In words, d divides n or equivalently n is a multiple of d . In the previous section, we showed that ∣ is a partial order on N . +
+
+
Fix d ∈ N . +
1. Define the relation ≡ on Z by m ≡ n if and only if d ∣ (n − m) . The relation ≡ is known as congruence modulo d . 2. Let r : Z → {0, 1, … , d − 1} be defined so that r(n) is the remainder when n is divided by d . d
d
d
d
Recall that by the Euclidean division theorem, named for Euclid of course, n ∈ Z can be written uniquely in the form n = kd + q where k ∈ Z and q ∈ {0, 1, … , d − 1} , and then r (n) = q . d
Congruence modulo d . 1. ≡ is the equivalence relation associated with the function r . 2. There are d distinct equivalence classes, given by [q] = {q + kd : k ∈ Z} for q ∈ {0, 1, … , d − 1} . d
d
d
Proof Explicitly give the equivalence classes for ≡ , congruence mod 4. 4
Answer
Linear Algebra Linear algebra provides several examples of important and interesting equivalence relations. To set the stage, let R set of m × n matrices with real entries, for m, n ∈ N .
m×n
denote the
+
1.5.3
https://stats.libretexts.org/@go/page/10120
Recall that the following are row operations on a matrix: 1. Multiply a row by a non-zero real number. 2. Interchange two rows. 3. Add a multiple of a row to another row. Row operations are essential for inverting matrices and solving systems of linear equations. Matrices A, B ∈ R are row equivalent if A can be transformed into equivalence is an equivalence relation on R . m×n
B
by a finite sequence of row operations. Row
m×n
Proof. Our next relation involves similarity, which is very important in the study of linear transformations, change of basis, and the theory of eigenvalues and eigenvectors. Matrices A, B ∈ R relation on R .
n×n
are similar if there exists an invertible P
n×n
∈ R
such that P
−1
AP = B
. Similarity is an equivalence
n×n
Proof Next recall that for A ∈ R , the transpose of A is the matrix A ∈ R with the property that (i, j) entry of A is the (j, i) entry of A , for i, j ∈ {1, 2, … , m}. Simply stated, A is the matrix whose rows are the columns of A . For the theorem that m×n
T
T
n×m
T
follows, we need to remember that invertible.
T
(AB)
T
=B
T
A
for
m×n
A ∈ R
Matrices A, B ∈ R are congruent if there exists an invertible equivalence relation on R n×n
and
n×k
B ∈ R
n×n
P ∈ R
, and
T
(A
such that
−1
)
B =P
−1
= (A
T
AP
T
)
if
n×n
A ∈ R
is
. Congruence is an
n×n
Proof Congruence is important in the study of orthogonal matrices and change of basis. Of course, the term congruence applied to matrices should not be confused with the same term applied to integers.
Number Systems Equivalence relations play an important role in the construction of complex mathematical structures from simpler ones. Often the objects in the new structure are equivalence classes of objects constructed from the simpler structures, modulo an equivalence relation that captures the essential properties of the new objects. The construction of number systems is a prime example of this general idea. The next exercise explores the construction of rational numbers from integers. Define a relation ≈ on Z × N
+
by (j, k) ≈ (m, n) if and only if j n = k m .
1. ≈ is an equivalence relation. 2. Define = [(m, n)] , the equivalence class generated by (m, n), for m ∈ Z and n ∈ N . This definition captures the essential properties of the rational numbers. m
+
n
Proof This page titled 1.5: Equivalence Relations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.5.4
https://stats.libretexts.org/@go/page/10120
1.6: Cardinality Basic Theory Definitions Suppose that S is a non-empty collection of sets. We define a relation ≈ on S by A ≈ B if and only if there exists a one-toone function f from A onto B . The relation ≈ is an equivalence relation on S . That is, for all A, B, C ∈ S , 1. A ≈ A , the reflexive property 2. If A ≈ B then B ≈ A , the symmetric property 3. If A ≈ B and B ≈ C then A ≈ C , the transitive property Proof A one-to-one function f from A onto B is sometimes called a bijection. Thus if A ≈ B then A and B are in one-to-one correspondence and are said to have the same cardinality. The equivalence classes under this equivalence relation capture the notion of having the same number of elements. Let N
0
=∅
, and for k ∈ N , let N +
k
= {0, 1, … k − 1}
. As always, N = {0, 1, 2, …} is the set of all natural numbers.
Suppose that A is a set. 1. A is finite if A ≈ N for some k ∈ N , in which case k is the cardinality of A , and we write #(A) = k . 2. A is infinite if A is not finite. 3. A is countably infinite if A ≈ N . 4. A is countable if A is finite or countably infinite. 5. A is uncountable if A is not countable. k
In part (a), think of N as a reference set with k elements; any other set with k elements must be equivalent to this one. We will study the cardinality of finite sets in the next two sections on Counting Measure and Combinatorial Structures. In this section, we will concentrate primarily on infinite sets. In part (d), a countable set is one that can be enumerated or counted by putting the elements into one-to-one correspondence with N for some k ∈ N or with all of N. An uncountable set is one that cannot be so counted. Countable sets play a special role in probability theory, as in many other branches of mathematics. Apriori, it's not clear that there are uncountable sets, but we will soon see examples. k
k
Preliminary Examples If S is a set, recall that P(S) denotes the power set of S (the set of all subsets of S ). If A and B are sets, then A is the set of all functions from B into A . In particular, {0, 1} denotes the set of functions from S into {0, 1}. B
S
If S is a set then P(S) ≈ {0, 1} . S
Proof Next are some examples of countably infinite sets. The following sets are countably infinite: 1. The set of even natural numbers E = {0, 2, 4, …} 2. The set of integers Z Proof At one level, it might seem that E has only half as many elements as N while Z has twice as many elements as N. as the previous result shows, that point of view is incorrect: N, E , and Z all have the same cardinality (and are countably infinite). The next example shows that there are indeed uncountable sets. If A is a set with at least two elements then S = A , the set of all functions from N into A , is uncountable. N
1.6.1
https://stats.libretexts.org/@go/page/10121
Proof
Subsets of Infinite Sets Surely a set must be as least as large as any of its subsets, in terms of cardinality. On the other hand, by example (4), the set of natural numbers N, the set of even natural numbers E and the set of integers Z all have exactly the same cardinality, even though E ⊂ N ⊂ Z . In this subsection, we will explore some interesting and somewhat paradoxical results that relate to subsets of infinite sets. Along the way, we will see that the countable infinity is the “smallest” of the infinities. If S is an infinite set then S has a countable infinite subset. Proof A set S is infinite if and only if S is equivalent to a proper subset of S . Proof When S was infinite in the proof of the previous result, not only did we map S one-to-one onto a proper subset, we actually threw away a countably infinite subset and still maintained equivalence. Similarly, we can add a countably infinite set to an infinite set S without changing the cardinality. If S is an infinite set and B is a countable set, then S ≈ S ∪ B . Proof In particular, if S is uncountable and B is countable then S ∪ B and S ∖ B have the same cardinality as S , and in particular are uncountable. In terms of the dichotomies finite-infinite and countable-uncountable, a set is indeed at least as large as a subset. First we need a preliminary result. If S is countably infinite and A ⊆ S then A is countable. Proof Suppose that A ⊆ B . 1. If B is finite then A is finite. 2. If A is infinite then B is infinite. 3. If B is countable then A is countable. 4. If A is uncountable then B is uncountable. Proof
Comparisons by one-to-one and onto functions We will look deeper at the general question of when one set is “at least as big” as another, in the sense of cardinality. Not surprisingly, this will eventually lead to a partial order on the cardinality equivalence classes. First note that if there exists a function that maps a set A one-to-one into a set B , then in a sense, there is a copy of A contained in B . Hence B should be at least as large as A . Suppose that f
: A → B
is one-to-one.
1. If B is finite then A is finite. 2. If A is infinite then B is infinite. 3. If B is countable then A is countable. 4. If A is uncountable then B is uncountable. Proof On the other hand, if there exists a function that maps a set Hence A should be at least as large as B . Suppose that f
: A → B
A
onto a set
B
, then in a sense, there is a copy of
B
contained in
A
.
is onto.
1. If A is finite then B is finite.
1.6.2
https://stats.libretexts.org/@go/page/10121
2. If B is infinite then A is infinite. 3. If A is countable then B is countable. 4. If B is uncountable then A is uncountable. Proof The previous exercise also could be proved from the one before, since if there exists a function f mapping A onto B , then there exists a function g mapping B one-to-one into A . This duality is proven in the discussion of the axiom of choice. A simple and useful corollary of the previous two theorems is that if B is a given countably infinite set, then a set A is countable if and only if there exists a one-to-one function f from A into B , if and only if there exists a function g from B onto A . If A is a countable set for each i in a countable index set I , then ⋃ i
i∈I
Ai
is countable.
Proof If A and B are countable then A × B is countable. Proof The last result could also be proven from the one before, by noting that A × B = ⋃ {a} × B
(1.6.1)
a∈A
Both proofs work because the set M is essentially a copy of N × N , embedded inside of N. The last theorem generalizes to the statement that a finite product of countable sets is still countable. But, from (5), a product of infinitely many sets (with at least 2 elements each) will be uncountable. The set of rational numbers Q is countably infinite. Proof A real number is algebraic if it is the root of a polynomial function (of degree 1 or more) with integer coefficients. Rational numbers are algebraic, as are rational roots of rational numbers (when defined). Moreover, the algebraic numbers are closed under addition, multiplication, and division. A real number is transcendental if it's not algebraic. The numbers e and π are transcendental, but we don't know very many other transcendental numbers by name. However, as we will see, most (in the sense of cardinality) real numbers are transcendental. The set of algebraic numbers A is countably infinite. Proof Now let's look at some uncountable sets. The interval [0, 1) is uncountable. Proof The following sets have the same cardinality, and in particular all are uncountable: 1. R, the set of real numbers. 2. Any interval I of R, as long as the interval is not empty or a single point. 3. R ∖ Q , the set of irrational numbers. 4. R ∖ A , the set of transcendental numbers. 5. P(N), the power set of N. Proof
The Cardinality Partial Order Suppose that S is a nonempty collection of sets. We define the relation ⪯ on S by A ⪯ B if and only if there exists a one-to-one function f from A into B , if and only if there exists a function g from B onto A . In light of the previous subsection, A ⪯ B should capture the notion that B is at least as big as A , in the sense of cardinality.
1.6.3
https://stats.libretexts.org/@go/page/10121
The relation ⪯ is reflexive and transitive. Proof Thus, we can use the construction in the section on on Equivalence Relations to first define an equivalence relation on S , and then extend ⪯ to a true partial order on the collection of equivalence classes. The only question that remains is whether the equivalence relation we obtain in this way is the same as the one that we have been using in our study of cardinality. Rephrased, the question is this: If there exists a one-to-one function from A into B and a one-to-one function from B into A , does there necessarily exist a one-to-one function from A onto B ? Fortunately, the answer is yes; the result is known as the Schröder-Bernstein Theorem, named for Ernst Schröder and Felix Bernstein. If A ⪯ B and B ⪯ A then A ≈ B . Proof We will write A ≺ B if A ⪯ B , but A ≉ B , That is, there exists a one-to-one function from A into B , but there does not exist a function from A onto B . Note that ≺ would have its usual meaning if applied to the equivalence classes. That is, [A] ≺ [B] if and only if [A] ⪯ [B] but [A] ≠ [B] . Intuitively, of course, A ≺ B means that B is strictly larger than A , in the sense of cardinality. A ≺B
in each of the following cases:
1. A and B are finite and #(A) < #(B) . 2. A is finite and B is countably infinite. 3. A is countably infinite and B is uncountable. We close our discussion with the observation that for any set, there is always a larger set. If S is a set then S ≺ P(S) . Proof The proof that a set cannot be mapped onto its power set is similar to the Russell paradox, named for Bertrand Russell. The continuum hypothesis is the statement that there is no set whose cardinality is strictly between that of N and R. The continuum hypothesis actually started out as the continuum conjecture, until it was shown to be consistent with the usual axioms of the real number system (by Kurt Gödel in 1940), and independent of those axioms (by Paul Cohen in 1963). Assuming the continuum hypothesis, if S is uncountable then there exists A ⊆ S such that A and A are uncountable. c
Proof There is a more complicated proof of the last result, without the continuum hypothesis and just using the axiom of choice. This page titled 1.6: Cardinality is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.6.4
https://stats.libretexts.org/@go/page/10121
1.7: Counting Measure Basic Theory For our first discussion, we assume that the universal set S is finite. Recall the following definition from the section on cardinality. For A ⊆ S , the cardinality of counting measure.
A
is the number of elements in
A
, and is denoted
#(A)
. The function
#
on
P(S)
is called
Counting measure plays a fundamental role in discrete probability structures, and particularly those that involve sampling from a finite set. The set S is typically very large, hence efficient counting methods are essential. The first combinatorial problem is attributed to the Greek mathematician Xenocrates. In many cases, a set of objects can be counted by establishing a one-to-one correspondence between the given set and some other set. Naturally, the two sets have the same number of elements, but for various reasons, the second set may be easier to count.
The Addition Rule The addition rule of combinatorics is simply the additivity axiom of counting measure. If {A
1,
A2 , … , An }
is a collection of disjoint subsets of S then n
n
# ( ⋃ Ai ) = ∑ #(Ai ) i=1
(1.7.1)
i=1
Figure 1.7.1 : The addition rule
The following counting rules are simple consequences of the addition rule. Be sure to try the proofs yourself before reading the ones in the text. c
#(A ) = #(S) − #(A)
. This is the complement rule.
Proof
Figure 1.7.2 : The complement rule #(B ∖ A) = #(B) − #(A ∩ B)
. This is the difference rule.
Proof If A ⊆ B then #(B ∖ A) = #(B) − #(A) . This is the proper difference rule. Proof If A ⊆ B then #(A) ≤ #(B) . Proof Thus, # is an increasing function, relative to the subset partial order ⊆ on P(S) , and the ordinary order ≤ on N.
1.7.1
https://stats.libretexts.org/@go/page/10122
Inequalities Our next disucssion concerns two inequalities that are useful for obtaining bounds on the number of elements in a set. The first is Boole's inequality (named after George Boole) which gives an upper bound on the cardinality of a union. If {A
1,
A2 , … , An }
is a finite collection of subsets of S then n
n
# ( ⋃ Ai ) ≤ ∑ #(Ai ) i=1
(1.7.2)
i=1
Proof Intuitively, Boole's inequality holds because parts of the union have been counted more than once in the expression on the right. The second inequality is Bonferroni's inequality (named after Carlo Bonferroni), which gives a lower bound on the cardinality of an intersection. If {A
1,
A2 , … , An }
is a finite collection of subsets of S then n
n
# ( ⋂ Ai ) ≥ #(S) − ∑[#(S) − #(Ai )] i=1
(1.7.4)
i=1
Proof
The Inclusion-Exclusion Formula The inclusion-exclusion formula gives the cardinality of a union of sets in terms of the cardinality of the various intersections of the sets. The formula is useful because intersections are often easier to count. We start with the special cases of two sets and three sets. As usual, we assume that the sets are subsets of a finite universal set S . If A and B are subsets of S then #(A ∪ B) = #(A) + #(B) − #(A ∩ B) . Proof
Figure 1.7.3 : The inclusion-exclusion theorem for two sets
If
A
,
B
,
C
are
subsets
of
then
S
#(A ∪ B ∪ C ) = #(A) + #(B) + #(C ) − #(A ∩ B) − #(A ∩ C ) − #(B ∩ C ) + #(A ∩ B ∩ C )
.
Proof
Figure 1.7.4 : The inclusion-exclusion theorem for three sets
The inclusion-exclusion rule for two and three sets can be generalized to a union of (general) inclusion-exclusion formula. Suppose that {A
i
: i ∈ I}
n
sets; the generalization is known as the
is a collection of subsets of S where I is an index set with #(I ) = n . Then n k−1
# ( ⋃ Ai ) = ∑(−1 ) i∈I
k=1
∑ J⊆I, #(J)=k
# ( ⋂ Aj )
(1.7.6)
j∈J
Proof
1.7.2
https://stats.libretexts.org/@go/page/10122
The general Bonferroni inequalities, named again for Carlo Bonferroni, state that if sum on the right is truncated after k terms ( k < n ), then the truncated sum is an upper bound for the cardinality of the union if k is odd (so that the last term has a positive sign) and is a lower bound for the cardinality of the union if k is even (so that the last terms has a negative sign).
The Multiplication Rule The multiplication rule of combinatorics is based on the formulation of a procedure (or algorithm) that generates the objects to be counted. Suppose that a procedure consists of k steps, performed sequentially, and that for each j ∈ {1, 2, … , k}, step j can be performed in n ways, regardless of the choices made on the previous steps. Then the number of ways to perform the entire procedure is n n ⋯ n . j
1
2
k
The key to a successful application of the multiplication rule to a counting problem is the clear formulation of an algorithm that generates the objects being counted, so that each object is generated once and only once. That is, we must neither over count nor under count. It's also important to notice that the set of choices available at step j may well depend on the previous steps; the assumption is only that the number of choices available does not depend on the previous steps. The first two results below give equivalent formulations of the multiplication principle. Suppose that S is a set of sequences of length k , and that we denote a generic element of S by (x , x , … , x ). Suppose that for each j ∈ {1, 2, … , k}, x has n different values, regardless of the values of the previous coordinates. Then #(S) = n n ⋯ n . 1
j
1
2
2
k
j
k
Proof Suppose that T is an ordered tree with depth k and that each vertex at level i − 1 has n children for the number of endpoints of the tree is n n ⋯ n . i
1
2
. Then
i ∈ {1, 2, … , k}
k
Proof
Product Sets If S is a set with n elements for i ∈ {1, 2, … , k} then i
i
#(S1 × S2 × ⋯ × Sk ) = n1 n2 ⋯ nk
(1.7.10)
Proof If S is a set with n elements, then S has n elements. k
k
Proof In (16), note that the elements of S can be thought of as ordered samples of size k that can be chosen with replacement from a population of n objects. Elements of {0, 1} are sometimes called bit strings of length n . Thus, there are 2 bit strings of length n . k
n
n
Functions The number of functions from a set A of m elements into a set B of n elements is n . m
Proof Recall that the set of functions from a set A into a set B (regardless of whether the sets are finite or infinite) is denoted B . This theorem is motivation for the notation. Note also that if S is a set with n elements, then the elements in the Cartesian power set S can be thought of as functions from {1, 2, … , k} into S . So the counting formula for sequences can be thought of as a corollary of counting formula for functions. A
k
Subsets Suppose that S is a set with n elements, where n ∈ N . There are 2 subsets of S . n
Proof from the multiplication principle Proof using indicator functions
1.7.3
https://stats.libretexts.org/@go/page/10122
k
Suppose that {A , A , … A } is a collection of k subsets of a set S , where k ∈ N . There are 2 different (in general) sets that can be constructed from the k given sets, using the operations of union, intersection, and complement. These sets form the algebra generated by the given sets. 2
1
2
k
+
Proof Open the Venn diagram app. 1. Select each of the 4 disjoint sets A ∩ B , A ∩ B , A ∩ B , A ∩ B . 2. Select each of the 12 other subsets of S . Note how each is a union of some of the sets in (a). c
c
c
c
Suppose that S is a set with n elements and that A is a subset of S with k elements, where n, of subsets of S that contain A is 2 .
k ∈ N
and k ≤ n . The number
n−k
Proof Our last result in this discussion generalizes the basic subset result above. Suppose that
n, k ∈ N
and that S is a set with is (k + 1) .
A1 ⊆ A2 ⊆ ⋯ ⊆ Ak ⊆ S
n
elements. The number of sequences of subsets
(A1 , A2 , … , Ak )
with
n
Proof When k = 1 we get 2 as the number of subsets of S , as before. n
Computational Exercises Identification Numbers A license number consists of two letters (uppercase) followed by five digits. How many different license numbers are there? Answer Suppose that a Personal Identification Number (PIN) is a four-symbol code word in which each entry is either a letter (uppercase) or a digit. How many PINs are there? Answer
Cards, Dice, and Coins In the board game Clue, Mr. Boddy has been murdered. There are 6 suspects, 6 possible weapons, and 9 possible rooms for the murder. 1. The game includes a card for each suspect, each weapon, and each room. How many cards are there? 2. The outcome of the game is a sequence consisting of a suspect, a weapon, and a room (for example, Colonel Mustard with the knife in the billiard room). How many outcomes are there? 3. Once the three cards that constitute the outcome have been randomly chosen, the remaining cards are dealt to the players. Suppose that you are dealt 5 cards. In trying to guess the outcome, what hand of cards would be best? Answer An experiment consists of rolling a standard die, drawing a card from a standard deck, and tossing a standard coin. How many outcomes are there? Answer A standard die is rolled 5 times and the sequence of scores recorded. How many outcomes are there? Answer In the card game Set, each card has 4 properties: number (one, two, or three), shape (diamond, oval, or squiggle), color (red, blue, or green), and shading (solid, open, or stripped). The deck has one card of each (number, shape, color, shading) configuration. A set in the game is defined as a set of three cards which, for each property, the cards are either all the same or all different.
1.7.4
https://stats.libretexts.org/@go/page/10122
1. How many cards are in a deck? 2. How many sets are there? Answer A coin is tossed 10 times and the sequence of scores recorded. How many sequences are there? Answer The die-coin experiment consists of rolling a die and then tossing a coin the number of times shown on the die. The sequence of coin results is recorded. 1. How many outcomes are there? 2. How many outcomes are there with all heads? 3. How many outcomes are there with exactly one head? Answer Run the die-coin experiment 100 times and observe the outcomes. Consider a deck of cards as a set D with 52 elements. 1. How many subsets of D are there? 2. How many functions are there from D into the set {1, 2, 3, 4}? Answer
Birthdays Consider a group of 10 persons. 1. If we record the birth month of each person, how many outcomes are there? 2. If we record the birthday of each person (ignoring leap day), how many outcomes are there? Answer
Reliability In the usual model of structural reliability, a system consists of components, each of which is either working or defective. The system as a whole is also either working or defective, depending on the states of the components and how the components are connected. A string of lights has 20 bulbs, each of which may be good or defective. How many configurations are there? Answer If the components are connected in series, then the system as a whole is working if and only if each component is working. If the components are connected parallel, then the system as a whole is working if and only if at least one component is working. A system consists of three subsystems with 6, 5, and 4 components, respectively. Find the number of component states for which the system is working in each of the following cases: 1. The components in each subsystem are in parallel and the subsystems are in series. 2. The components in each subsystem are in series and the subsystems are in parallel. Answer
Menus Suppose that a sandwich at a restaurant consists of bread, meat, cheese, and various toppings. There are 4 choices for the bread, 3 choices for the meat, 5 choices for the cheese, and 10 different toppings (each of which may be chosen). How many sandwiches are there? Answer At a wedding dinner, there are three choices for the entrée, four choices for the beverage, and two choices for the dessert.
1.7.5
https://stats.libretexts.org/@go/page/10122
1. How many different meals are there? 2. If there are 50 guests at the wedding and we record the meal requested for each guest, how many possible outcomes are there? Answer
Braille Braille is a tactile writing system used by people who are visually impaired. The system is named for the French educator Louis Braille and uses raised dots in a 3 × 2 grid to encode characters. How many meaningful Braille configurations are there? Answer
Figure 1.7.5 : The Braille encoding of the number 2 and the letter b
Personality Typing The Meyers-Briggs personality typing is based on four dichotomies: A person is typed as either extroversion (E) or introversion (I), either sensing (S) or intuition (I), either thinking (T) or feeling (F), and either judgement (J) or perception (P). 1. How many Meyers-Briggs personality types are there? List them. 2. Suppose that we list the personality types of 10 persons. How many possible outcomes are there? Answer
The Galton Board The Galton Board, named after Francis Galton, is a triangular array of pegs. Galton, apparently too modest to name the device after himself, called it a quincunx from the Latin word for five twelfths (go figure). The rows are numbered, from the top down, by (0, 1, …). Row n has n + 1 pegs that are labeled, from left to right by (0, 1, … , n). Thus, a peg can be uniquely identified by an ordered pair (n, k) where n is the row number and k is the peg number in that row. A ball is dropped onto the top peg (0, 0) of the Galton board. In general, when the ball hits peg (n, k), it either bounces to the left to peg (n + 1, k) or to the right to peg (n + 1, k + 1) . The sequence of pegs that the ball hits is a path in the Galton board. There is a one-to-one correspondence between each pair of the following three collections: 1. Bit strings of length n 2. Paths in the Galton board from (0, 0) to any peg in row n . 3. Subsets of a set with n elements. Thus, each of these collections has 2 elements. n
Open the Galton board app. 1. Move the ball from (0, 0) to (10, 6) along a path of your choice. Note the corresponding bit string and subset. 2. Generate the bit string 0111001010. Note the corresponding subset and path. 3. Generate the subset {2, 4, 5, 9, 10}. Note the corresponding bit string and path. 4. Generate all paths from (0, 0) to (4, 2). How many paths are there? Answer This page titled 1.7: Counting Measure is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.7.6
https://stats.libretexts.org/@go/page/10122
1.8: Combinatorial Structures The purpose of this section is to study several combinatorial structures that are of basic importance in probability.
Permutations Suppose that D is a set with n ∈ N elements. A permutation of length k ∈ {0, 1, … , n} from D is an ordered sequence of distinct elements of D; that is, a sequence of the form (x , x , … , x ) where x ∈ D for each i and x ≠ x for i ≠ j . 1
Statistically, a permutation of length population D.
k
from
D
2
k
i
i
corresponds to an ordered sample of size
k
k
j
chosen without replacement from the
The number of permutations of length k from an n element set is (k)
n
= n(n − 1) ⋯ (n − k + 1)
(1.8.1)
Proof By convention, n = 1 . Recall that, in general, a product over an empty index set is 1. Note that n has k factors, starting at n , and with each subsequent factor one less than the previous factor. Some alternate notations for the number of permutations of size k from a set of n objects are P (n, k) , P , and P . (0)
(k)
n,k
n
k
The number of permutations of length n from the n element set D (these are called simply permutations of D) is (n)
n! = n
The function on factorials:
N
given by
n ↦ n!
= n(n − 1) ⋯ (1)
(1.8.2)
is the factorial function. The general permutation formula in (2) can be written in terms of
For n ∈ N and k ∈ {0, 1, … , n} (k)
n
n! =
(1.8.3) (n − k)!
Although this formula is succinct, it's not always a good idea numerically. If n and n − k are large, n! and (n − k)! are enormous, and division of the first by the second can lead to significant round-off errors. Note that the basic permutation formula in (2) is defined for every real number n and nonnegative integer k . This extension is sometimes referred to as the generalized permutation formula. Actually, we will sometimes need an even more general formula of this type (particularly in the sections on Pólya's urn and the beta-Bernoulli process). For a ∈ R , s ∈ R , and k ∈ N , define (s,k)
a
1. a 2. a 3. a 4. 1
(0,k)
(1,k)
(1.8.4)
k
=a
(−1,k) (1,k)
= a(a + s)(a + 2s) ⋯ [a + (k − 1)s]
(k)
=a
= a(a + 1) ⋯ (a + k − 1) = k!
The product a =a (our ordinary permutation formula) is sometimes called the falling power of a of order k , while a is called the rising power of a of order k , and is sometimes denoted a . Note that a is the ordinary k th power of a . In general, note that a has k factors, starting at a and with each subsequent factor obtained by adding s to the previous factor. (−1,k)
(k)
(1,k)
[k]
(0,k)
(s,k)
1.8.1
https://stats.libretexts.org/@go/page/10123
Combinations Consider again a set D with n ∈ N elements. A combination of size k ∈ {0, 1, … , n} from D is an (unordered) subset of k distinct elements of D. Thus, a combination of size k from D has the form {x , x , … , x }, where x ∈ D for each i and x ≠x for i ≠ j . 1
i
2
k
i
j
Statistically, a combination of size k from D corresponds to an unordered sample of size k chosen without replacement from the population D. Note that for each combination of size k from D, there are k! distinct orderings of the elements of that combination. Each of these is a permutation of length k from D. The number of combinations of size k from an n -element set is denoted by ( ). Some alternate notations are C (n, k), C , and C . n k
n,k
n
k
The number of combinations of size k from an n element set is (k)
(
n n ) = k k!
n! =
(1.8.5) k!(n − k)!
Proof The number ( ) is called a binomial coefficient. Note that the formula makes sense for any real number n and nonnegative integer k since this is true of the generalized permutation formula n . With this extension, ( ) is called the generalized binomial coefficient. Note that if n and k are positive integers and k > n then ( ) = 0 . By convention, we will also define ( ) = 0 if k < 0 . This convention sometimes simplifies formulas. n k
n
(k)
k
n
n
k
k
Properties of Binomial Coefficients For some of the identities below, there are two possible proofs. An algebraic proof, of course, should be based on (5). A combinatorial proof is constructed by showing that the left and right sides of the identity are two different ways of counting the same collection. n
n
n
0
( ) =( ) =1
.
Algebraically, the last result is trivial. It also makes sense combinatorially: There is only one way to select a subset of elements (D itself), and there is only one way to select a subset of size 0 from D (the empty set ∅). If n,
k ∈ N
D
with
n
with k ≤ n then (
n n ) =( ) k n−k
(1.8.6)
Combinatorial Proof The next result is one of the most famous and most important. It's known as Pascal's rule and is named for Blaise Pascal. If n,
k ∈ N+
with k ≤ n then (
n n−1 n−1 ) =( ) +( ) k k−1 k
(1.8.7)
Combinatorial Proof Recall that the Galton board is a triangular array of pegs: the rows are numbered n = 0, 1, … and the pegs in row n are numbered k = 0, 1, … , n. If each peg in the Galton board is replaced by the corresponding binomial coefficient, the resulting table of numbers is known as Pascal's triangle, named again for Pascal. By (8), each interior number in Pascal's triangle is the sum of the two numbers directly above it. The following result is the binomial theorem, and is the reason for the term binomial coefficient. If a,
b ∈ R
and n ∈ N is a positive integer, then n n
(a + b )
= ∑( k=0
1.8.2
n k n−k )a b k
(1.8.8)
https://stats.libretexts.org/@go/page/10123
Combinatorial Proof If j,
with j ≤ k ≤ n then
k, n ∈ N+
(j)
k
(
n n−j (j) ) =n ( ) k k−j
(1.8.9)
Combinatorial Proof The following result is known as Vandermonde's identity, named for Alexandre-Théophile Vandermonde. If m,
n, k ∈ N
with k ≤ m + n , then k
m
∑( j=0
n
m +n
)( j
) =( k−j
)
(1.8.10)
k
Combinatorial Proof The next result is a general identity for the sum of binomial coefficients. If m,
n ∈ N
with n ≤ m then m
j
∑(
m +1 ) =(
n
j=n
)
(1.8.11)
n+1
Combinatorial Proof For an even more general version of the last result, see the section on Order Statistics in the chapter on Finite Sampling Models. The following identity for the sum of the first m positive integers is a special case of the last result. If m ∈ N then m
(m + 1)m
m +1
∑j= (
) = 2
j=1
(1.8.12) 2
Proof There is a one-to-one correspondence between each pair of the following collections. Hence the number objects in each of these collection is ( ). n k
1. Subsets of size k from a set of n elements. 2. Bit strings of length n with exactly k 1's. 3. Paths in the Galton board from (0, 0) to (n, k). Proof The following identity is known as the alternating sum identity for binomial coefficients. It turns out to be useful in the Irwin-Hall probability distribution. We give the identity in two equivalent forms, one for falling powers and one for ordinary powers. If n ∈ N , j ∈ {0, 1, … n − 1} then +
1.
n
∑( k=0
2.
n k (j) )(−1 ) k =0 k
(1.8.13)
n k j )(−1 ) k = 0 k
(1.8.14)
n
∑( k=0
Proof Our next identity deals with a generalized binomial coefficient. If n,
k ∈ N
then
1.8.3
https://stats.libretexts.org/@go/page/10123
−n (
n+k−1
k
) = (−1 ) (
)
k
(1.8.16)
k
Proof In particular, note that (
−1 k
k
) = (−1 )
. Our last result in this discussion concerns the binomial operator and its inverse.
The binomial operator takes a sequence of real numbers b = (b , b , b , …) by means of the formula 0
1
a = (a0 , a1 , a2 , …)
and returns the sequence of real numbers
2
n
bn = ∑ ( k=0
n ) ak , k
n ∈ N
(1.8.19)
The inverse binomial operator recovers the sequence a from the sequence b by means of the formula n n−k
an = ∑(−1 ) k=0
(
n ) bk , k
n ∈ N
(1.8.20)
Proof
Samples The experiment of drawing a sample from a population is basic and important. There are two essential attributes of samples: whether or not order is important, and whether or not a sampled object is replaced in the population before the next draw. Suppose now that the population D contains n objects and we are interested in drawing a sample of k objects. Let's review what we know so far: If order is important and sampled objects are replaced, then the samples are just elements of the product set D . Hence, the number of samples is n . If order is important and sample objects are not replaced, then the samples are just permutations of size k chosen from D. Hence the number of samples is n . If order is not important and sample objects are not replaced, then the samples are just combinations of size k chosen from D. Hence the number of samples is ( ). k
k
(k)
n k
Thus, we have one case left to consider.
Unordered Samples With Replacement An unordered sample chosen with replacement from may be repeated.
D
is called a multiset. A multiset is like an ordinary set except that elements
There is a one-to-one correspondence between each pair of the following collections: 1. Mulitsets of size k from a population D of n elements. 2. Bit strings of length n + k − 1 with exactly k 1s. 3. Nonnegative integer solutions (x , x , … , x ) of the equation x 1
Each of these collections has (
n+k−1 k
)
2
n
1
+ x2 + ⋯ + xn = k
.
members.
Proof
Summary of Sampling Formulas The following table summarizes the formulas for the number of samples of size k chosen from a population of on the criteria of order and replacement.
n
elements, based
Sampling formulas Number of Samples m…
With replacement
m…
Without
With order n
n
Without n+k−1
k
(
k
)
n
(k)
( ) k
1.8.4
https://stats.libretexts.org/@go/page/10123
Multinomial Coefficients Partitions of a Set Recall that the binomial coefficient (
n j
)
is the number of subsets of size j from a set S of n elements. Note also that when we select
a subset A of size j from S , we effectively partition S into two disjoint subsets of sizes j and n − j , namely A and A . A natural generalization is to partition S into a union of k distinct, pairwise disjoint subsets (A , A , … , A ) where #(A ) = n for each i ∈ {1, 2, … , k}. Of course we must have n + n + ⋯ + n = n . c
1
1
2
2
The number of ways to partition a set of n elements into a sequence of k sets of sizes (n
1,
n (
)(
n − n1
n1
k
i
i
k
)⋯(
n − n1 − ⋯ − nk−1
n2
n2 , … , nk )
is
n! ) =
nk
(1.8.22) n1 ! n2 ! ⋯ nk !
Proof The number in (18) is called a multinomial coefficient and is denoted by n
n!
(
) =
(1.8.23)
n1 , n2 , ⋯ , nk
If n,
k ∈ N
n1 ! n2 ! ⋯ nk !
with k ≤ n then n (
) =( k, n − k
n ) k
(1.8.24)
Combinatorial Proof
Sequences Consider now the set T = {1, 2, … , k} . Elements of this set are sequences of length n in which each coordinate is one of k values. Thus, these sequences generalize the bit strings of length n . Again, let (n , n , … , n ) be a sequence of nonnegative integers with ∑ n = n . n
1
2
k
k
i=1
i
There is a one-to-one correspondence between the following collections: 1. Partitions of S into pairwise disjoint subsets (A , A , … , A ) where #(A ) = n for each j ∈ {1, 2, … , k}. 2. Sequences in {1, 2, … , k} in which j occurs n times for each j ∈ {1, 2, … , k}. 1
n
2
k
j
j
j
Proof It follows that the number of elements in both of these collections is n
n!
(
) =
(1.8.25)
n1 , n2 , ⋯ , nk
n1 ! n2 ! ⋯ nk !
Permutations with Indistinguishable Objects Suppose now that we have n object of k different types, with n elements of type i for each i ∈ {1, 2, … , k}. Moreover, objects of a given type are considered identical. There is a one-to-one correspondence between the following collections: i
1. Sequences in {1, 2, … , k} in which j occurs n times for each j ∈ {1, 2, … , k}. 2. Distinguishable permutations of the n objects. n
j
Proof Once again, it follows that the number of elements in both collections is n
n!
(
) = n1 , n2 , ⋯ , nk
(1.8.26) n1 ! n2 ! ⋯ nk !
The Multinomial Theorem The following result is the multinomial theorem which is the reason for the name of the coefficients.
1.8.5
https://stats.libretexts.org/@go/page/10123
If x
1,
x2 , … , xk ∈ R
and n ∈ N then n
(x1 + x2 + ⋯ + xk )
The sum is over sequences of nonnegative integers in this sum.
n
n1
= ∑(
)x n1 , n2 , ⋯ , nk
(n1 , n2 , … , nk )
with n
1
1
n2
x
2
nk
⋯x
(1.8.27)
k
+ n2 + ⋯ + nk = n
. There are
n+k−1
(
n
)
terms
Combinatorial Proof
Computational Exercises Arrangements In a race with 10 horses, the first, second, and third place finishers are noted. How many outcomes are there? Answer Eight persons, consisting of four male-female couples, are to be seated in a row of eight chairs. How many seating arrangements are there in each of the following cases: 1. There are no other restrictions. 2. The men must sit together and the women must sit together. 3. The men must sit together. 4. Each couple must sit together. Answer Suppose that n people are to be seated at a round table. How many seating arrangements are there? The mathematical significance of a round table is that there is no dedicated first chair. Answer Twelve books, consisting of 5 math books, 4 science books, and 3 history books are arranged on a bookshelf. Find the number of arrangements in each of the following cases: 1. There are no restrictions. 2. The books of each type must be together. 3. The math books must be together. Answer Find the number of distinct arrangements of the letters in each of the following words: 1. statistics 2. probability 3. mississippi 4. tennessee 5. alabama Answer A child has 12 blocks; 5 are red, 4 are green, and 3 are blue. In how many ways can the blocks be arranged in a line if blocks of a given color are considered identical? Answer
Code Words A license tag consists of 2 capital letters and 5 digits. Find the number of tags in each of the following cases: 1. There are no other restrictions 2. The letters and digits are all different. Answer
1.8.6
https://stats.libretexts.org/@go/page/10123
Committees A club has 20 members; 12 are women and 8 are men. A committee of 6 members is to be chosen. Find the number of different committees in each of the following cases: 1. There are no other restrictions. 2. The committee must have 4 women and 2 men. 3. The committee must have at least 2 women and at least 2 men. Answer Suppose that a club with 20 members plans to form 3 distinct committees with 6, 5, and 4 members, respectively. In how many ways can this be done. Answer
Cards A standard card deck can be modeled by the Cartesian product set D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠}
(1.8.28)
where the first coordinate encodes the denomination or kind (ace, 2-10, jack, queen, king) and where the second coordinate encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for example q♡). A poker hand (in draw poker) consists of 5 cards dealt without replacement and without regard to order from a deck of 52 cards. Find the number of poker hands in each of the following cases: 1. There are no restrictions. 2. The hand is a full house (3 cards of one kind and 2 of another kind). 3. The hand has 4 of a kind. 4. The cards are all in the same suit (so the hand is a flush or a straight flush). Answer The game of poker is studied in detail in the chapter on Games of Chance. A bridge hand consists of 13 cards dealt without replacement and without regard to order from a deck of 52 cards. Find the number of bridge hands in each of the following cases: 1. There are no restrictions. 2. The hand has exactly 4 spades. 3. The hand has exactly 4 spades and 3 hearts. 4. The hand has exactly 4 spades, 3 hearts, and 2 diamonds. Answer A hand of cards that has no cards in a particular suit is said to be void in that suit. Use the inclusion-exclusion formula to find each of the following: 1. The number of poker hands that are void in at least one suit. 2. The number of bridge hands that are void in at least one suit. Answer A bridge hand that has no honor cards (cards of denomination 10, jack, queen, king, or ace) is said to be a Yarborough, in honor of the Second Earl of Yarborough. Find the number of Yarboroughs. Answer A bridge deal consists of dealing 13 cards (a bridge hand) to each of 4 distinct players (generically referred to as north, south, east, and west) from a standard deck of 52 cards. Find the number of bridge deals. Answer
1.8.7
https://stats.libretexts.org/@go/page/10123
This staggering number is about the same order of magnitude as the number of atoms in your body, and is one of the reasons that bridge is a rich and interesting game. Find the number of permutations of the cards in a standard deck. Answer This number is even more staggering. Indeed if you perform the experiment of dealing all 52 cards from a well-shuffled deck, you may well generate a pattern of cards that has never been generated before, thereby ensuring your immortality. Actually, this experiment shows that, in a sense, rare events can be very common. By the way, Persi Diaconis has shown that it takes about seven standard riffle shuffles to thoroughly randomize a deck of cards.
Dice and Coins Suppose that 5 distinct, standard dice are rolled and the sequence of scores recorded. 1. Find the number of sequences. 2. Find the number of sequences with the scores all different. Answer Suppose that 5 identical, standard dice are rolled. How many outcomes are there? Answer A coin is tossed 10 times and the outcome is recorded as a bit string (where 1 denotes heads and 0 tails). 1. Find the number of outcomes. 2. Find the number of outcomes with exactly 4 heads. 3. Find the number of outcomes with at least 8 heads. Answer
Polynomial Coefficients Find the coefficient of x
3
y
4
in (2 x − 4 y) . 7
Answer Find the coefficient of x in (2 + 3 x) . 5
8
Answer Find the coefficient of x
3
y
7
z
5
in (x + y + z)
15
.
Answer
The Galton Board In the Galton board game, 1. Move the ball from (0, 0) to (10, 6) along a path of your choice. Note the corresponding bit string and subset. 2. Generate the bit string 0011101001. Note the corresponding subset and path. 3. Generate the subset {1, 4, 5, 7, 8, 10}. Note the corresponding bit string and path. 4. Generate all paths from (0, 0) to (5, 3). How many paths are there? Answer Generate Pascal's triangle up to n = 10 .
Samples A shipment contains 12 good and 8 defective items. A sample of 5 items is selected. Find the number of samples that contain exactly 3 good items. Answer
1.8.8
https://stats.libretexts.org/@go/page/10123
In the (n, k) lottery, k numbers are chosen without replacement from the set of integers from 1 to k < n ). Order does not matter.
n
(where
n, k ∈ N+
and
1. Find the number of outcomes in the general (n, k) lottery. 2. Explicitly compute the number of outcomes in the (44, 6) lottery (a common format). Answer For more on this topic, see the section on Lotteries in the chapter on Games of Chance. Explicitly compute each formula in the sampling table above when n = 10 and k = 4 . Answer
Greetings Suppose there are n people who shake hands with each other. How many handshakes are there? Answer There are m men and n women. The men shake hands with each other; the women hug each other; and each man bows to each woman. 1. How many handshakes are there? 2. How many hugs are there? 3. How many bows are there? 4. How many greetings are there? Answer
Integer Solutions Find the number of integer solutions of x
1
1. x 2. x
i
≥0
i
>0
+ x2 + x3 = 10
in each of the following cases:
for each i. for each i.
Answer
Generalized Coefficients Compute each of the following: 1. (−5)
(3)
2. (
(4)
1
)
2
3. (−
1 3
(5)
)
Answer Compute each of the following: 1. ( 2. ( 3. (
1/2
)
3
−5 4
)
−1/3 5
)
Answer
Birthdays Suppose that n persons are selected and their birthdays noted. (Ignore leap years, so that a year has 365 days.) 1. Find the number of outcomes. 2. Find the number of outcomes with distinct birthdays. Answer
1.8.9
https://stats.libretexts.org/@go/page/10123
Chess Note that the squares of a chessboard are distinct, and in fact are often identified with the Cartesian product set {a, b, c, d, e, f , g, h} × {1, 2, 3, 4, 5, 6, 7, 8}
(1.8.29)
Find the number of ways of placing 8 rooks on a chessboard so that no rook can capture another in each of the following cases. 1. The rooks are distinguishable. 2. The rooks are indistinguishable. Answer
Gifts Suppose that 20 identical candies are distributed to 4 children. Find the number of distributions in each of the following cases: 1. There are no restrictions. 2. Each child must get at least one candy. Answer In the song The Twelve Days of Christmas, find the number of gifts given to the singer by her true love. (Note that the singer starts afresh with gifts each day, so that for example, the true love gets a new partridge in a pear tree each of the 12 days.) Answer
Teams Suppose that 10 kids are divided into two teams of 5 each for a game of basketball. In how many ways can this be done in each of the following cases: 1. The teams are distinguishable (for example, one team is labeled “Alabama” and the other team is labeled “Auburn”). 2. The teams are not distinguishable. Answer This page titled 1.8: Combinatorial Structures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.8.10
https://stats.libretexts.org/@go/page/10123
1.9: Topological Spaces Topology is one of the major branches of mathematics, along with other such branches as algebra (in the broad sense of algebraic structures), and analysis. Topology deals with spatial concepts involving distance, closeness, separation, convergence, and continuity. Needless to say, entire series of books have been written about the subject. Our goal in this section and the next is simply to review the basic definitions and concepts of topology that we will need for our study of probability and stochastic processes. You may want to refer to this section as needed.
Basic Theory Definitions A topological space consists of a nonempty set S and a collection S of subsets of S that satisfy the following properties: 1. S ∈ S and ∅ ∈ S 2. If A ⊆ S then ⋃ A ∈ S 3. If A ⊆ S and A is finite, then ⋂ A
∈ S
If A ∈ S , then A is said to be open and A is said to be closed. The collection S of open sets is a topology on S . c
So the union of an arbitrary number of open sets is still open, as is the intersection of a finite number of open sets. The universal set S and the empty set ∅ are both open and closed. There may or may not exist other subsets of S with this property. Suppose that S is a nonempty set, and that S and T are topologies on S . If S ⊆ T then T is finer than S , and S is coarser than T . Coarser than defines a partial order on the collection of topologies on S . That is, if R, S , T are topologies on S then 1. R is coarser than R , the reflexive property. 2. If R is coarser than S and S is coarser than R then R = S , the anti-symmetric property. 3. If R is coarser than S and S is coarser than T then R is coarser than T , the transitive property. A topology can be characterized just as easily by means of closed sets as open sets. Suppose that S is a nonempty set. A collection of subsets C is the collection of closed sets for a topology on S if and only if 1. S ∈ C and ∅ ∈ C 2. If A ⊆ C then ⋂ A ∈ C . 3. If A ⊆ C and A is a finite then ⋃ A
∈ C
.
Proof Suppose that (S, S ) is a topological space, and that x ∈ S . A set A ⊆ S is a neighborhood of x if there exists x ∈ U ⊆A .
U ∈ S
with
So a neighborhood of a point x ∈ S is simply a set with an open subset that contains x. The idea is that points in a “small” neighborhood of x are “close” to x in a sense. An open set can be defined in terms of the neighborhoods of the points in the set. Suppose again that (S, S ) is a topological space. A set U
⊆S
is open if and only if U is a neighborhood of every x ∈ U
Proof Although the proof seems trivial, the neighborhood concept is how you should think of openness. A set U is open if every point in U has a set of “nearby points” that are also in U . Our next three definitions deal with topological sets that are naturally associated with a given subset. Suppose again that (S, S ) is a topological space and that A ⊆ S . The closure of A is the set cl(A) = ⋂{B ⊆ S : B is closed and A ⊆ B}
1.9.1
(1.9.1)
https://stats.libretexts.org/@go/page/10124
This is the smallest closed set containing A : 1. cl(A) is closed. 2. A ⊆ cl(A) . 3. If B is closed and A ⊆ B then cl(A) ⊆ B Proof Of course, if A is closed then A = cl(A) . Complementary to the closure of a set is the interior of the set. Suppose again that (S, S ) is a topological space and that A ⊆ S . The interior of A is the set int(A) = ⋃{U ⊆ S : U is open and U ⊆ A}
(1.9.2)
This set is the largest open subset of A : 1. int(A) is open. 2. int(A) ⊆ A . 3. If U is open and U
⊆A
then U
⊆ int(A)
Proof Of course, if A is open then A = int(A) . The boundary of a set is the set difference between the closure and the interior. Suppose again that (S, S ) is a topological space. The boundary of A is ∂(A) = cl(A) ∖ int(A) . This set is closed. Proof A topology on a set induces a natural topology on any subset of the set. Suppose that (S, S ) is a topological space and that R is a nonempty subset of S . Then R = {A ∩ R : A ∈ S } is a topology on R , known as the relative topology induced by S . Proof In the context of the previous result, note that if R is itself open, then the relative topology is R = {A ∈ S of R that are open in the original topology.
: A ⊆ R}
, the subsets
Separation Properties Separation properties refer to the ability to separate points or sets with disjoint open sets. Our first definition deals with separating two points. Suppose that (S, S ) is a topological space and that x, y are distinct points in S . Then x and y can be separated if there exist disjoint open sets U and V with x ∈ U and y ∈ V . If every pair of distinct points in S can be separated, then (S, S ) is called a Hausdorff space. Hausdorff spaces are named for the German mathematician Felix Hausdorff. There are weaker separation properties. For example, there could be an open set U that contains x but not y , and an open set V that contains y but not x, but no disjoint open sets that contain x and y . Clearly if every open set that contains one of the points also contains the other, then the points are indistinguishable from a topological viewpoint. In a Hausdorff space, singletons are closed. Suppose that (S, S ) is a Hausdorff space. Then {x} is closed for each x ∈ S . Proof Our next definition deals with separating a point from a closed set. Suppose again that (S, S ) is a topological space. A nonempty closed set A ⊆ S and a point x ∈ A can be separated if there exist disjoint open sets U and V with A ⊆ U and x ∈ V . If every nonempty closed set A and point x ∈ A can be separated, then the space (S, S ) is regular. c
c
Clearly if (S, S ) is a regular space and singleton sets are closed, then (S, S ) is a Hausdorff space.
1.9.2
https://stats.libretexts.org/@go/page/10124
Bases Topologies, like other set structures, are often defined by first giving some basic sets that should belong to the collection, and the extending the collection so that the defining axioms are satisfied. This idea is motivation for the following definition: Suppose again that (S, S ) is a topological space. A collection B ⊆ S is a base for S if every set in S can be written as a union of sets in B . So, a base is a smaller collection of open sets with the property that every other open set can be written as a union of basic open sets. But again, we often want to start with the basic open sets and extend this collection to a topology. The following theorem gives the conditions under which this can be done. Suppose that S is a nonempty set. A collection B of subsets of S is a base for a topology on S if and only if 1. S = ⋃ B 2. If A, B ∈ B and x ∈ A ∩ B , there exists C
∈ B
with x ∈ C
⊆ A∩B
Proof Here is a slightly weaker condition, but one that is often satisfied in practice. Suppose that S is a nonempty set. A collection B of subsets of S that satisfies the following properties is a base for a topology on S : 1. S = ⋃ B 2. If A, B ∈ B then A ∩ B ∈ B Part (b) means that B is closed under finite intersections.
Compactness Our next discussion considers another very important type of set. Some additional terminology will make the discussion easier. Suppose that S is a set and A ⊆ S . A collection of subsets A of S is said to cover A if A ⊆ A . So the word cover simply means a collection of sets whose union contains a given set. In a topological space, we can have open an open cover (that is, a cover with open sets), a closed cover (that is, a cover with closed sets), and so forth. Suppose again that (S, S ) is a topological space. A set C ⊆ S is compact if every open cover of That is, if A ⊆ S with C ⊆ ⋃ A then there exists a finite B ⊆ A with C ⊆ ⋃ B .
C
has a finite sub-cover.
So intuitively, a compact set is compact in the ordinary sense of the word. No matter how “small” are the open sets in the covering of C , there will always exist a finite number of the open sets that cover C . Suppose again that (S, S ) is a topological space and that C
⊆S
is a compact. If B ⊆ C is closed, then B is also compact.
Proof Compactness is also preserved under finite unions. Suppose again that (S, S ) is a topological space, and that C =⋃ C is compact. i∈I
Ci ⊆ S
is compact for each
i
in a finite index set I . Then
i
Proof As we saw above, closed subsets of a compact set are themselves compact. In a Hausdorff space, a compact set is itself closed. Suppose that (S, S ) is a Hausdorff space. If C
⊆S
is compact then C is closed.
Proof Also in a Hausdorff space, a point can be separated from a compact set that does not contain the point. Suppose that (S, S ) is a Hausdorff space. If x ∈ S , C V with x ∈ U and C ⊆ V
⊆S
is compact, and x ∉ C , then there exist disjoint open sets U and
1.9.3
https://stats.libretexts.org/@go/page/10124
Proof In a Hausdorff space, if a point has a neighborhood with a compact boundary, then there is a smaller, closed neighborhood. Suppose again that (S, S ) is a Hausdorff space. If x ∈ S and A is a neighborhood of x with ∂(A) compact, then there exists a closed neighborhood B of x with B ⊆ A . Proof Generally, local properties in a topological space refer to properties that hold on the neighborhoods of a point x ∈ S . A topological space (S, S ) is locally compact if every point x ∈ S has a compact neighborhood. This definition is important because many of the topological spaces that occur in applications (like probability) are not compact, but are locally compact. Locally compact Hausdorff spaces have a number of nice properties. In particular, in a locally compact Hausdorff space, there are arbitrarily “small” compact neighborhoods of a point. Suppose that (S, S ) is a locally compact Hausdorff space. If x ∈ S and A is a neighborhood of x, then there exists a compact neighborhood B of x with B ⊆ A . Proof
Countability Axioms Our next discussion concerns topologies that can be “countably constructed” in a certain sense. Such axioms limit the “size” of the topology in a way, and are often satisfied by important topological spaces that occur in applications. We start with an important preliminary definition. Suppose that (S, S ) is a topological space. A set D ⊆ S is dense if U ∩ D is nonempty for every nonempty U
∈ S
.
Equivalently, D is dense if every neighborhood of a point x ∈ S contains an element of D. So in this sense, one can find elements of D “arbitrarily close” to a point x ∈ S . Of course, the entire space S is dense, but we are usually interested in topological spaces that have dense sets of limited cardinality. Suppose again that (S, S ) is a topological space. A set D ⊆ S is dense if and only if cl(D) = S . Proof Here is our first countability axiom: A topological space (S, S ) is separable if there exists a countable dense subset. So in a separable space, there is a countable set D with the property that there are points in D “arbitrarily close” to every x ∈ S . Unfortunately, the term separable is similar to separating points that we discussed above in the definition of a Hausdorff space. But clearly the concepts are very different. Here is another important countability axiom. A topological space (S, S ) is second countable if it has a countable base. So in a second countable space, there is a countable collection of open sets B with the property that every other open set is a union of sets in B . Here is how the two properties are related: If a topological space (S, S ) is second countable then it is separable. Proof As the terminology suggests, there are other axioms of countability (such as first countable), but the two we have discussed are the most important.
Connected and Disconnected Spaces This discussion deals with the situation in which a topological space falls into two or more separated pieces, in a sense.
1.9.4
https://stats.libretexts.org/@go/page/10124
A topological space (S, S ) is disconnected if there exist nonempty, disjoint, open sets U and V with S = U ∪ V . If (S, S ) is not disconnected, then it is connected. Since U = V , it follows that U and V are also closed. So the space is disconnected if and only if there exists a proper subset U that is open and closed (sadly, such sets are sometimes called clopen). If S is disconnected, then S consists of two pieces U and V , and the points in U are not “close” to the points in V , in a sense. To study S topologically, we could simply study U and V separately, with their relative topologies. c
Convergence There is a natural definition for a convergent sequence in a topological space, but the concept is not as useful as one might expect. Suppose again that (S, S ) is a topological space. A sequence of points (x : n ∈ N neighborhood A of x there exists m ∈ N such that x ∈ A for n > m . We write x
+)
n
+
n
n
in S converges to as n → ∞ .
x ∈ S
if for every
→ x
So for every neighborhood of x, regardless of how “small”, all but finitely many of the terms of the sequence will be in the neighborhood. One would naturally hope that limits, when they exist, are unique, but this will only be the case if points in the space can be separated. Suppose that (S, S ) is a Hausdorff space. If x → y ∈ S as n → ∞ , then x = y .
(xn : n ∈ N+ )
is a sequence of points in
S
with
xn → x ∈ S
as
n → ∞
and
n
Proof On the other hand, if distinct points x,
y ∈ S
cannot be separated, then any sequence that converges to x will also converge to y .
Continuity Continuity of functions is one of the most important concepts to come out of general topology. The idea, of course, is that if two points are close together in the domain, then the functional values should be close together in the range. The abstract topological definition, based on inverse images is very simple, but not very intuitive at first. Suppose that A ∈ T .
(S, S )
and
(T , T )
are topological spaces. A function
f : S → T
is continuous if
f
−1
(A) ∈ S
for every
So a continuous function has the property that the inverse image of an open set (in the range space) is also open (in the domain space). Continuity can equivalently be expressed in terms of closed subsets. Suppose again that (S, S ) and (T , T ) are topological spaces. A function f closed subset of S for every closed subset A of T .
: S → T
is continuous if and only if f
−1
(A)
is a
Proof Continuity preserves limits. Suppose again that (S, S ) and (T , T ) are topological spaces, and that f : S → T is continuous. If sequence of points in S with x → x ∈ S as n → ∞ , then f (x ) → f (x) as n → ∞ . n
(xn : n ∈ N+ )
is a
n
Proof The converse of the last result is not true, so continuity of functions in a general topological space cannot be characterized in terms of convergent sequences. There are objects like sequences but more general, known as nets, that do characterize continuity, but we will not study these. Composition, the most important way to combine functions, preserves continuity. Suppose that g∘ f : S → U
, (T , T ), and is continuous.
(S, S )
(U , U )
are topological spaces. If
f : S → T
and
g : T → U
are continuous, then
Proof The next definition is very important. A recurring theme in mathematics is to recognize when two mathematical structures of a certain type are fundamentally the same, even though they may appear to be different.
1.9.5
https://stats.libretexts.org/@go/page/10124
Suppose again that (S, S ) and (T , T ) are topological spaces. A one-to-one function f that maps S onto T with both f and f continuous is a homeomorphism from (S, S ) to (T , T ). When such a function exists, the topological spaces are said to be homeomorphic. −1
Note that in this definition, f refers to the inverse function, not the mapping of inverse images. If f is a homeomorphism, then A is open in S if and only if f (A) is open in T . It follows that the topological spaces are essentially equivalent: any purely topological property can be characterized in terms of open sets and therefore any such property is shared by the two spaces. −1
Being homeomorphic is an equivalence relation on the collection of topological spaces. That is, for spaces and (U , U ) ,
,
(S, S ) (T , T )
,
1. (S, S ) is homeomorphic to (S, S ) (the reflexive property). 2. If (S, S ) is homeomorphic to (T , T ) then (T , T ) is homeomorphic to (S, S ) (the symmetric property). 3. If (S, S ) is homeomorphic to (T , T ) and (T , T ) is homeomorphic to (U , U ) then (S, S ) is homeomorphic to (U , U ) (the transitive property). Proof Continuity can also be defined locally, by restricting attention to the neighborhoods of a point. Suppose again that (S, S ) and (T , T ) are topological spaces, and that x ∈ S . A function f : S → T is continuous at x if f (B) is a neighborhood of x in S whenever B is a neighborhood of f (x) in T . If A ⊆ S , then f is continuous on A is f is continuous at each x ∈ A . −1
Suppose again that (S, S ) and continuous at each x ∈ S .
(T , T )
are topological spaces, and that
f : S → T
. Then
f
is continuous if and only if
f
is
Proof Properties that are defined for a topological space can be applied to a subset of the space, with the relative topology. But one has to be careful. Suppose again that (S, S ) are topological spaces and that f : S → T . Suppose also that A ⊆ S , and let A denote the relative topology on A induced by S , and let f denote the restriction of f to A . If f is continuous on A then f is continuous relative to the spaces (A, A ) and (T , T ). The converse is not generally true. A
A
Proof
Product Spaces Cartesian product sets are ubiquitous in mathematics, so a natural question is this: given topological spaces what is a natural topology for S × T ? The answer is very simple using the concept of a base above.
(S, S )
and
(T , T )
,
Suppose that (S, S ) and (T , T ) are topological spaces. The collection B = {A × B : A ∈ S , B ∈ T } is a base for a topology on S × T , called the product topology associated with the given spaces. Proof So basically, we want the product of open sets to be open in the product space. The product topology is the smallest topology that makes this happen. The definition above can be extended to very general product spaces, but to state the extension, let's recall how general product sets are constructed. Suppose that S is a set for each i in a nonempty index set I . Then the product set ∏ S is the set of all functions x : I → ⋃ S such that x(i) ∈ S for i ∈ I . i
i∈I
Suppose that (S
i,
Si )
i
i∈I
i
i
is a topological space for each i in a nonempty index set I . Then
B = { ∏ Ai : Ai ∈ Si for all i ∈ I and Ai = Si for all but finitely many i ∈ I }
(1.9.8)
i∈I
is a base for a topology on ∏
i∈I
Si
, known as the product topology associated with the given spaces.
Proof
1.9.6
https://stats.libretexts.org/@go/page/10124
Suppose again that S is a set for each i in a nonempty index set I . For j ∈ I , recall that projection function defined by p (x) = x(j) . i
pj : ∏
i∈I
Si → Sj
is
j
Suppose again that (S , S ) is a topological space for each The projection function p is continuous for each j ∈ I . i
i
i ∈ I
, and give the product spacee
∏
i∈I
Si
the product topology.
j
Proof As a special case of all this, suppose that (S, S ) is a topological space, and that S = S for all i ∈ I . Then the product space ∏ S is the set of all functions from I to S , sometimes denoted S . In this case, the base for the product topology on S is i
I
i∈I
I
i
B = { ∏ Ai : Ai ∈ S for all i ∈ I and Ai = S for all but finitely many i ∈ I }
(1.9.9)
i∈I
For j ∈ I , the projection function p just returns the value of a function x : I continuous. Note in particular that no topology is necessary on the domain I . j
→ S
at j : p
j (x)
= x(j)
. This projection function is
Examples and Special Cases The Trivial Topology Suppose that S is a nonempty set. Then {S, ∅} is a topology on S , known as the trivial topology. With the trivial topology, no two distinct points can be separated. So the topology cannot distinguish between points, in a sense, and all points in S are close to each other. Clearly, this topology is not very interesting, except as a place to start. Since there is only one nonempty open set (S itself), the space is connected, and every subset of S is compact. A sequence in S converges to every point in S . Suppose that S has the trivial topology and that (T , T ) is another topological space. 1. Every function from T to S is continuous. 2. If (T , T ) is a Hausdorff space then the only continuous functions from S to T are constant functions. Proof
The Discrete Topology At the opposite extreme from the trivial topology, with the smallest collection of open sets, is the discrete topology, with the largest collection of open sets. Suppose that S is a nonempty set. The power set topology.
P(S)
(consisting of all subsets of
S
) is a topology, known as the discrete
So in the discrete topology, every set is both open and closed. All points are separated, and in a sense, widely so. No point is close to another point. With the discrete topology, S is Hausdorff, disconnected, and the compact subsets are the finite subsets. A sequence in S converges to x ∈ S , if and only if all but finitely many terms of the sequence are x. Suppose that S has the discrete topology and that (T , S ) is another topological space. 1. Every function from S to T is continuous. 2. If (T , T ) is connected, then the only continuous functions from T to S are constant functions. Proof
Euclidean Spaces The standard topologies used in the Euclidean spaces are the topologies built from open sets that you familiar with. For the set of real numbers R, let B = {(a, b) : a, b ∈ R, topology R on R, known as the Euclidean topology.
a < b}
, the collection of open intervals. Then
B
is a base for a
Proof
1.9.7
https://stats.libretexts.org/@go/page/10124
The space (R, R) satisfies many properties that are motivations for definitions in topology in the first place. The convergence of a sequence in R, in the topological sense given above, is the same as the definition of convergence in calculus. The same statement holds for the continuity of a function f from R to R. Before listing other topological properties, we give a characterization of compact sets, known as the Heine-Borel theorem, named for Eduard Heine and Émile Borel. Recall that A ⊆ R is bounded if A ⊆ [a, b] for some a, b ∈ R with a < b . A subset C
⊆R
is compact if and only if C is closed and bounded.
So in particular, closed, bounded intervals of the form [a, b] with a,
b ∈ R
and a < b are compact.
The space (R, R) has the following properties: 1. Hausdorff. 2. Connected. 3. Locally compact. 4. Second countable. Proof As noted in the proof,
, the set of rationals, is countable and is dense in R. Another countable, dense subset is , the set of dyadic rationals (or binary rationals). For the higher-dimensional Euclidean spaces, we can use the product topology based on the topology of the real numbers. n
D = {j/ 2
Q
: n ∈ N and j ∈ Z}
For n ∈ {2, 3, …}, let (R topology on R .
n
, Rn )
be the n -fold product space corresponding to the space
(R, R)
. Then
Rn
is the Euclidean
n
A subset A ⊆ R “block”.
is bounded if there exists a,
n
A subset C
n
⊆R
The space (R
n
b ∈ R
with a < b such that
n
A ⊆ [a, b]
, so that
A
fits inside of an n -dimensional
is compact if and only if C is closed and bounded.
, Rn )
has the following properties:
1. Hausdorff. 2. Connected. 3. Locally compact. 4. Second countable. This page titled 1.9: Topological Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.9.8
https://stats.libretexts.org/@go/page/10124
1.10: Metric Spaces Basic Theory Most of the important topological spaces that occur in applications (like probability) have an additional structure that gives a distance between points in the space.
Definitions A metric space consists of a nonempty set x, y, z ∈ S ,
S
and a function
d : S × S → [0, ∞)
that satisfies the following axioms: For
1. d(x, y) = 0 if and only if x = y . 2. d(x, y) = d(y, x). 3. d(x, z) ≤ d(x, y) + d(y, z) . The function d is known as a metric or a distance function. So as the name suggests, d(x, y) is the distance between points x, y ∈ S . The axioms are intended to capture the essential properties of distance from geometry. Part (a) is the positive property; the distance is strictly positive if and only if the points are distinct. Part (b) is the symmetric property; the distance from x to y is the same as the distance from y to x. Part (c) is the triangle inequality; going from x to z cannot be longer than going from x to z by way of a third point y . Note that if (S, d) is a metric space, and A is a nonempty subset of S , then the set A with space (known as a subspace). The next definitions also come naturally from geometry:
d
restricted to
A×A
is also a metric
x ∈ U
there exists
r ∈ (0, ∞)
Suppose that (S, d) is a metric space, and that x ∈ S and r ∈ (0, ∞). 1. B(x, r) = {y ∈ S : d(x, y) < r} is the open ball with center x and radius r. 2. C (x, r) = {y ∈ S : d(x, y) ≤ r} is the closed ball with center x and radius r. A metric on a space induces a topology on the space in a natural way. Suppose that (S, d) is a metric space. By definition, a set U ⊆ S is open if for every B(x, r) ⊆ U . The collection S of open subsets of S is a topology.
such that
d
Proof As the names suggests, an open ball is in fact open and a closed ball is in fact closed. Suppose again that (S, d) is a metric space, and that x ∈ S and r ∈ (0, ∞). Then 1. B(x, r) is open. 2. C (x, r) is closed. Proof Recall that for a general topological space, a neighborhood of a point x ∈ S is a set A ⊆ S with the property that there exists an open set U with x ∈ U ⊆ A . It follows that in a metric space, A ⊆ S is a neighborhood of x if and only if there exists r > 0 such that B(x, r) ⊆ A . In words, a neighborhood of a point must contain an open ball about that point. It's easy to construct new metrics from ones that we already have. Here's one such result. Suppose that S is a nonempty set, and that d,
e
are metrics on S , and c ∈ (0, ∞). Then the following are also metrics on S :
1. cd 2. d + e Proof Since a metric space produces a topological space, all of the definitions for general topological spaces apply to metric spaces as well. In particular, in a metric space, distinct points can always be separated.
1.10.1
https://stats.libretexts.org/@go/page/10125
A metric space (S, d) is a Hausdorff space. Proof
Metrizable Spaces Again, every metric space is a topological space, but not conversely. A non-Hausdorff space, for example, cannot correspond to a metric space. We know there are such spaces; a set S with more than one point, and with the trivial topology S = {S, ∅} is nonHausdorff. Suppose that metrizable.
(S, S )
is a topological space. If there exists a metric
d
on
S
such that
S = Sd
, then
(S, S )
is said to be
It's easy to see that different metrics can induce the same topology. For example, if d is a metric and c ∈ (0, ∞), then the metrics and cd induce the same topology. Let S be a nonempty set. Metrics d and e on S are equivalent, and we write d ≡ e , if equivalence relation on the collection of metrics on S . That is, for metrics d, e, f on S ,
Sd = Se
. The relation
≡
d
is an
1. d ≡ d , the reflexive property. 2. If d ≡ e then e ≡ d , the symmetric property. 3. If d ≡ e and e ≡ f then d ≡ f , the transitive property. There is a simple condition that characterizes when the topology of one metric is finer than the topology of another metric, and then this in turn leads to a condition for equivalence of metrics. Suppose again that S is a nonempty set and that d, relative to d contains an open ball relative to e .
e
are metrics on S . Then S is finer than S if and only if every open ball e
d
Proof It follows that metrics d and e on S are equivalent if and only if every open ball relative to one of the metrics contains an open ball relative to the other metric. So every metrizable topology on S corresponds to an equivalence class of metrics that produce that topology. Sometimes we want to know that a topological space is metrizable, because of the nice properties that it will have, but we don't really need to use a specific metric that generates the topology. At any rate, it's important to have conditions that are sufficient for a topological space to be metrizable. The most famous such result is the Urysohn metrization theorem, named for the Russian mathematician Pavel Uryshon: Suppose that (S, S ) is a regular, second-countable, Hausdorff space. Then (S, S ) is metrizable. Review of the terms
Convergence With a distance function, the convergence of a sequence can be characterized in a manner that is just like calculus. Recall that for a general topological space (S, S ), if (x : n ∈ N ) is a sequence of points in S and x ∈ S , then x → x as n → ∞ means that for every neighborhood U of x, there exists m ∈ N such that x ∈ U for n > m . n
+
n
+
n
Suppose that (S, d) is a metric space, and that (x : n ∈ N ) is a sequence of points in S and x ∈ S . Then x → x as n → ∞ if and only if for every ϵ > 0 there exists m ∈ N such that if n > m then d(x , x) < ϵ . Equivalently, x → x as n → ∞ if and only if d(x , x) → 0 as n → ∞ (in the usual calculus sense). n
+
n
+
n
n
n
Proof So, no matter how tiny limits are unique.
ϵ>0
may be, all but finitely many terms of the sequence are within
Suppose again that (S, d) is a metric space. Suppose also that (x x → x as n → ∞ and x → y as n → ∞ then x = y .
n
n
: n ∈ N+ )
ϵ
distance of x. As one might hope,
is a sequence of points in S and that x,
y ∈ S
. If
n
Proof
1.10.2
https://stats.libretexts.org/@go/page/10125
Convergence of a sequence is a topological property, and so is preserved under equivalence of metrics. Suppose that d, e are equivalent metrics on S , and that (x : n ∈ N n → ∞ relative to d if and only if x → x as n → ∞ relative to e .
is a sequence of points in S and x ∈ S . Then x
+)
n
n
→ x
as
n
Closed subsets of a metric space have a simple characterization in terms of convergent sequences, and this characterization is more intuitive than the abstract axioms in a general topological space. Suppose again that (S, d) is a metric space. Then A ⊆ S is closed if and only if whenever a sequence of points in A converges, the limit is also in A . Proof The following definition also shows up in standard calculus. The idea is to have a criterion for convergence of a sequence that does not require knowing a-priori the limit. But for metric spaces, this definition takes on added importance. Suppose again that (S, d) is a metric space. A sequence of points (x : n ∈ N ) in S is a Cauchy sequence if for every there exist k ∈ N such that if m, n ∈ N with m > k and n > k then d(x , x ) < ϵ . n
+
+
+
m
ϵ>0
n
Cauchy sequences are named for the ubiquitous Augustin Cauchy. So for a Cauchy sequence, no matter how tiny ϵ > 0 may be, all but finitely many terms of the sequence will be within ϵ distance of each other. A convergent sequence is always Cauchy. Suppose again that (S, d) is a metric space. If a sequence of points (x
n
: n ∈ N+ )
in S converges, then the sequence is Cauchy.
Proof Conversely, one might think that a Cauchy sequence should converge, but it's relatively trivial to create a situation where this is false. Suppose that (S, d) is a metric space, and that there is a point x ∈ S that is the limit of a sequence of points in S that are all distinct from x. Then the space T = S − {x} with the metric d restricted to T × T has a Cauchy sequence that does not converge. Essentially, we have created a “convergence hole”. So our next defintion is very natural and very important. Suppose again that point in A .
(S, d)
is metric space and that
A ⊆S
. Then
A
is complete if every Cauchy sequence in
A
converges to a
Of course, completeness can be applied to the entire space S . Trivially, a complete set must be closed. Suppose again that (S, d) is a metric space, and that A ⊆ S . If A is complete, then A is closed. Proof Completeness is such a crucial property that it is often imposed as an assumption on metric spaces that occur in applications. Even though a Cauchy sequence may not converge, here is a partial result that will be useful latter: if a Cauchy sequence has a convergent subsequence, then the sequence itself converges. Suppose again the (S, d) is a metric space, and that (x : n ∈ N ) is a Cauchy sequence in (x : k ∈ N ) such that x → x ∈ S as k → ∞ , then x → x as n → ∞ . n
nk
+
nk
+
S
. If there exists a subsequence
n
Proof
Continuity In metric spaces, continuity of functions also has simple characterizations in terms of that are familiar from calculus. We start with local continuity. Recall that the general topological definition is that f : S → T is continuous at x ∈ S if f (V ) is a neighborhood of x in S for every open set V of f (x) in T . −1
Suppose that (S, d) and (T , e) are metric spaces, and that f following conditions:
: S → T
. The continuity of f at x ∈ S is equivalent to each of the
1. If (x : n ∈ N ) is a sequence in S with x → x as n → ∞ then f (x ) → f (x) as n → ∞ . 2. For every ϵ > 0 , there exists δ > 0 such that if y ∈ S and d(x, y) < δ then e[f (y) − f (x)] < ϵ . n
+
n
n
Proof
1.10.3
https://stats.libretexts.org/@go/page/10125
More generally, recall that f continuous on A ⊆ S means that f is continuous at each x ∈ A , and that f continuous means that continuous on S . So general continuity can be characterized in terms of sequential continuity and the ϵ-δ condition.
f
is
On a metric space, there are stronger versions of continuity. Suppose again that (S, d) and (T , e) are metric spaces and that f : S → T . Then there exists δ > 0 such that if x, y ∈ S with d(x, y) < δ then e[f (x), f (y)] ≤ ϵ.
f
In the ϵ-δ formulation of ordinary point-wise continuity above, δ depends on the point there exists a δ depending only on ϵ that works uniformly in x ∈ S . Suppose again that (S, d) and (T , e) are metric spaces, and that f
: S → T
is uniformly continuous if for every
ϵ>0
in addition to ϵ. With uniform continuity,
x
. If f is uniformly continuous then f is continuous.
Here is an even stronger version of continuity. Suppose again that (S, d) and (T , e) are metric spaces, and that f : S → T . Then α ∈ (0, ∞) if there exists C ∈ (0, ∞) such that e[f (x), f (y)] ≤ C [d(x, y)] for all x, α
f
is Höder continuous with exponent .
y ∈ S
The definition is named for Otto Höder. The exponent α is more important than the constant C , which generally does not have a name. If α = 1 , f is said to be Lipschitz continuous, named for the German mathematician Rudolf Lipschitz. Suppose again that (S, d) and (T , e) are metric spaces, and that f f is uniformly continuous. The case where α = 1 and C Suppose again that that
(S, d)
0 then
is particularly important. (T , e)
are metric spaces. A function
f : S → T
e[f (x), f (y)] ≤ C d(x, y),
is a contraction if there exists
x, y ∈ S
C ∈ (0, 1)
such
(1.10.6)
So contractions shrink distance. By the result above, a contraction is uniformly continuous. Part of the importance of contraction maps is due to the famous Banach fixed-point theorem, named for Stefan Banach. Suppose that (S, d) is a complete metric space and that f : S → S is a contraction. Then f has a unique fixed point. That is, there exists exactly one x ∈ S with f (x ) = x . Let x ∈ S , and recursively define x = f (x ) for n ∈ N . Then x → x as n → ∞ . ∗
n
∗
∗
0
n
n−1
+
∗
Functions that preserve distance are particularly important. The term isometry means distance-preserving. Suppose again that (S, d) and (T , e) are metric spaces, and that f every x, y ∈ S . Suppose again that (S, d) and Lipschitz continuous.
(T , e)
: S → T
are metric spaces, and that
. Then f is an isometry if e[f (x), f (y)] = d(x, y) for
f : S → T
. If
f
is an isometry, then
f
is one-to-one and
Proof In particular, an isometry f is uniformly continuous. If one metric space can be mapped isometrically onto another metric space, the spaces are essentially the same. Metric spaces (S, d) and (T , e) are isometric if there exists an isometry f that maps relation on metric spaces. That is, for metric spaces (S, d) , (T , e) , and (U , ρ),
S
onto
T
. Isometry is an equivalence
1. (S, d) is isometric to (S, d) , the reflexive property. 2. If (S, d) is isometric to (T , e) them (T , e) is isometric to (S, d) , the symmetric property. 3. If (S, d) is isometric to (T , e) and (T , e) is isometric to (U , ρ), then (S, d) is isometric to (U , ρ), the transitive property. Proof
1.10.4
https://stats.libretexts.org/@go/page/10125
In particular, if metric spaces (S, d) and (T , e) are isometric, then as topological spaces, they are homeomorphic.
Compactness and Boundedness In a metric space, various definitions related to a set being bounded are natural, and are related to the general concept of compactness. Suppose again that (S, d) is a metric space, and that A ⊆ S . Then A is bounded if there exists r ∈ (0, ∞) such that d(x, y) ≤ r for all x, y ∈ A. The diameter of A is diam(A) = inf{r > 0 : d(x, y) < r for all x, y ∈ A}
(1.10.7)
Additional details So A is bounded if and only if diam(A) < ∞ . Diameter is an increasing function relative to the subset partial order. Suppose again that (S, d) is a metric space, and that A ⊆ B ⊆ S . Then diam(A) ≤ diam(B) . Our next definition is stronger, but first let's review some terminology that we used for general topological spaces: If S is a set, A a subset of S , and A a collection of subsets of S , then A is said to cover A if A ⊆ ⋃ A . So with this terminology, we can talk about open covers, closed covers, finite covers, disjoint covers, and so on. Suppose again that (S, d) is a metric space, and that A ⊆ S . Then A is totally bounded if for every r > 0 there is a finite cover of A with open balls of radius r. Recall that for a general topological space, a set A is compact if every open cover of A has a finite subcover. So in a metric space, the term precompact is sometimes used instead of totally bounded: The set A is totally bounded if every cover of A with open balls of radius r has a finite subcover. Suppose again that (S, d) is a metric space. If A ⊆ S is totally bounded then A is bounded. Proof Since a metric space is a Hausdorff space, a compact subset of a metric space is closed. Compactness also has a simple characterization in terms of convergence of sequences. Suppose again that (S, d) is a metric space. A subset subsequence that converges to a point in C .
C ⊆S
is compact if and only if every sequence of points in
C
has a
Proof
Hausdorff Measure and Dimension Our last discussion is somewhat advanced, but is important for the study of certain random processes, particularly Brownian motion. The idea is to measure the “size” of a set in a metric space in a topological way, and then use this measure to define a type of “dimension”. We need a preliminary definition, using our convenient cover terminology. If (S, d) is a metric space, A ⊆ S , and δ ∈ (0, ∞), then a countable δ cover of A is a countable cover B of A with the property that diam(B) < δ for each B ∈ B . Suppose again that (S, d) is a metric space and that A ⊆ S . For δ ∈ (0, ∞) and k ∈ [0, ∞), define H
k
δ
(A) = inf { ∑ [diam(B)]
k
: B is a countable δ cover of A}
(1.10.9)
B∈B
The k -dimensional Hausdorff measure of A is H
k
(A) = sup { H
k
δ
(A) : δ > 0} = lim H δ↓0
k
δ
(A)
(1.10.10)
Additional details Note that the k -dimensional Hausdorff measure is defined for every k ∈ [0, ∞), not just nonnegative integers. Nonetheless, the integer dimensions are interesting. The 0-dimensional measure of A is the number of points in A . In Euclidean space, which we consider in (36), the measures of dimension 1, 2, and 3 are related to length, area, and volume, respectively.
1.10.5
https://stats.libretexts.org/@go/page/10125
Suppose again that (S, d) is a metric space and that A ⊆ S . The Hausdorff dimension of A is dimH (A) = inf{k ∈ [0, ∞) : H
k
(A) = 0}
(1.10.12)
Of special interest, as before, is the case when S = R for some n ∈ N and d is the standard Euclidean distance, reviewed in (36). As you might guess, the Hausdorff dimension of a point is 0, the Hausdorff dimension of a “simple curve” is 1, the Hausdorff dimension of a “simple surface” is 2, and so on. But there are also sets with fractional Hausdorff dimension, and the stochastic process Brownian motion provides some fascinating examples. The graph of standard Brownian motion has Hausdorff dimension 3/2 while the set of zeros has Hausdorff dimension 1/2. n
+
Examples and Special Cases Normed Vector Spaces A norm on a vector space generates a metric on the space in a very simple, natural way. Suppose that (S, +, ⋅) is a vector space, and that ∥ ⋅ ∥ is a norm on the space. Then d defined by d(x, y) = ∥y − x∥ for x, is a metric on S .
y ∈ S
Proof On R , we have a variety of norms, and hence a variety of metrics. n
For n ∈ N
and k ∈ [1, ∞), the function d given below is a metric on R : n
+
k
1/k
n
dk (x, y) = ( ∑ | xi − yi |
k
)
,
n
x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ R
(1.10.13)
i=1
Proof Of course the metric d is Euclidean distance, named for Euclid of course. This is the most important one, in a practical sense because it's the usual one that we use in the real world, and in a mathematical sense because the associated norm corresponds to the standard inner product on R given by 2
n
n
⟨x, y⟩ = ∑ xi yi ,
n
x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ R
(1.10.15)
i=1
For n ∈ N , the function d +
defined below is a metric on R : n
∞
n
d∞ (x, y) = max{| xi − yi | : i ∈ {1, 2 … , n}},
x = (x1 , x2 , … , xn ) ∈ R
(1.10.16)
Proof To justify the notation, recall that x, y ∈ R .
∥x ∥k → ∥x ∥∞
as
k → ∞
for
n
x ∈ R
, and hence
dk (x, y) → d∞ (x, y)
as
k → ∞
for
n
Figure
1.10.1
: From inside out, the boundaries of the unit balls centered at the origin in .
2
R
for the metrics
dk
with
k ∈ {1, 3/4, 2, 3, ∞}
Suppose now that S is a nonempty set. Recall that the collection V of all functions f : S → R is a vector space under the usual pointwise definition of addition and scalar multiplication. That is, if f , g ∈ V and c ∈ R , then f + g ∈ V and cf ∈ V are defined by (f + g)(x) = f (x) + g(x) and (cf )(x) = cf (x) for x ∈ S . Recall further that the collection U of bounded functions f : S → R
1.10.6
https://stats.libretexts.org/@go/page/10125
is a vector subspace of V , and moreover, ∥ ⋅ ∥ defined by ∥f ∥ = sup{|f (x)| : x ∈ S} is a norm on norm. It follow that U is a metric space with the metric d defined by
U
, known as the supremum
d(f , g) = ∥f − g∥ = sup{|f (x) − g(x)| : x ∈ S}
(1.10.18)
Vector spaces of bounded, real-valued functions, with the supremum norm are very important in the study of probability and stochastic processes. Note that the supremum norm on U generalizes the maximum norm on R , since we can think of a point in R as a function from {1, 2, … , n} into R. Later, as part of our discussion on integration with respect to a positive measure, we will see how to generalize the k norms on R to spaces of functions. n
n
n
Products of Metric Spaces If we have a finite number of metric spaces, then we can combine the individual metrics, together with an norm on the vector space R , to create a norm on the Cartesian product space. n
Suppose n ∈ {2, 3, …}, and that (S , d ) is a metric space for each in ∈ {1, 2, … , n}. Suppose also that ∥ ⋅ ∥ is a norm on R . Then the function d given as follows is a metric on S = S × S × ⋯ × S : n
i
i
1
2
d(x, y) = ∥(d1 (x1 , y1 ), d2 (x2 , y2 ), … , dn (xn , yn ))∥ ,
n
x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ S
(1.10.19)
Proof
Graphs Recall that a graph (in the combinatorial sense) consists of a countable set S of vertices and a set E ⊆ S × S of edges. In this discussion, we assume that the graph is undirected in the sense that (x, y) ∈ E if and only if (y, x) ∈ E , and has no loops so that (x, x) ∉ E for x ∈ S . Finally, recall that a path of length n ∈ N from x ∈ S to y ∈ S is a sequence (x , x , … , x ) ∈ S such that x = x , x = y , and (x , x ) ∈ E for i ∈ {1, 2, … , n}. The graph is connected if there exists a path of finite length between any two distinct vertices in S . Such a graph has a natural metric: n+1
+
0
n
i−1
0
1
n
i
Suppose that G = (S, E) is a connected graph. Then d defined as follows is a metric on S : d(x, x) = 0 for x ∈ S , and is the length of the shortest path from x to y for distinct x, y ∈ S .
d(x, y)
Proof
The Discrete Topology Suppose that S is a nonempty set. Recall that the discrete topology on S is P(S) , the power set of S , so that every subset of open (and closed). The discrete topology is metrizable, and there are lots of metrics that generate this topology.
S
is
Suppose again that S is a nonempty set. A metric d on S with the property that there exists c ∈ (0, ∞) such that d(x, y) ≥ c for distinct x, y ∈ S generates the discrete topology. Proof So any metric that is bounded from below (for distinct points) generates the discrete topology. It's easy to see that there are such metrics. Suppose again that S is a nonempty set. The function d on S × S defined by d(x, x) = 0 for x ∈ S and d(x, y) = 1 for distinct x, y ∈ S is a metric on S , known as the discrete metric. This metric generates the discrete topology. Proof In probability applications, the discrete topology is often appropriate when S is countable. Note also that the discrete metric is the graph distance if S is made into the complete graph, so that (x, y) is an edge for every pair of distinct vertices x, y ∈ S . This page titled 1.10: Metric Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.10.7
https://stats.libretexts.org/@go/page/10125
1.11: Measurable Spaces In this section we discuss some topics from measure theory that are a bit more advanced than the topics in the early sections of this chapter. However, measure-theoretic ideas are essential for a deep understanding of probability, since probability is itself a measure. The most important of the definitions is the σ-algebra, a collection of subsets of a set with certain closure properties. Such collections play a fundamental role, even for applied probability, in encoding the state of information about a random experiment. On the other hand, we won't be overly pedantic about measure-theoretic details in this text. Unless we say otherwise, we assume that all sets that appear are measurable (that is, members of the appropriate σ-algebras), and that all functions are measurable (relative to the appropriate σ-algebras). Although this section is somewhat abstract, many of the proofs are straightforward. Be sure to try the proofs yourself before reading the ones in the text.
Algebras and σ-Algebras Suppose that S is a set, playing the role of a universal set for a particular mathematical model. It is sometimes impossible to include all subsets of S in our model, particularly when S is uncountable. In a sense, the more sets that we include, the harder it is to have consistent theories. However, we almost always want the collection of admissible subsets to be closed under the basic set operations. This leads to some important definitions.
Algebras of Sets Suppose that S is a nonempty collection of subsets of S . Then S is an algebra (or field) if it is closed under complement and union: 1. If A ∈ S then A ∈ S . 2. If A ∈ S and B ∈ S then A ∪ B ∈ S . c
If S is an algebra of subsets of S then 1. S ∈ S 2. ∅ ∈ S Proof Suppose that S is an algebra of subsets of S and that A
i
1. ⋃ 2. ⋂
i∈I
Ai ∈ S
i∈I
Ai ∈ S
∈ S
for each i in a finite index set I .
Proof Thus it follows that an algebra of sets is closed under a finite number of set operations. That is, if we start with a finite number of sets in the algebra S , and build a new set with a finite number of set operations (union, intersection, complement), then the new set is also in S . However in many mathematical theories, probability in particular, this is not sufficient; we often need the collection of admissible subsets to be closed under a countable number of set operations. σ
-Algebras of Sets Suppose that satisfied:
S
is a nonempty collection of subsets of
S
1. If A ∈ S then A ∈ S . 2. If A ∈ S for each i in a countable index set I , then ⋃
. Then
S
is a σ-algebra (or σ-field) if the following axioms are
c
i
i∈I
Ai ∈ S
.
Clearly a σ-algebra of subsets is also an algebra of subsets, so the basic results for algebras above still hold. In particular, and ∅ ∈ S . If A
i
∈ S
for each i in a countable index set I , then ⋂
i∈I
Ai ∈ S
1.11.1
S ∈ S
.
https://stats.libretexts.org/@go/page/10126
Proof Thus a σ-algebra of subsets of S is closed under countable unions and intersections. This is the reason for the symbol σ in the name. As mentioned in the introductory paragraph, σ-algebras are of fundamental importance in mathematics generally and probability theory specifically, and thus deserve a special definition: If S is a set and S a σ-algebra of subsets of S , then the pair (S, S ) is called a measurable space. The term measurable space will make more sense in the next chapter, when we discuss positive measures (and in particular, probability measures) on such spaces. Suppose that S is a set and that S is a finite algebra of subsets of S . Then S is also a σ-algebra. Proof However, there are algebras that are not σ-algebras. Here is the classic example: Suppose that S is an infinite set. The collection of finite and co-finite subsets of S defined below is an algebra of subsets of S , but not a σ-algebra: c
F = {A ⊆ S : A is finite or A is finite}
(1.11.1)
Proof
General Constructions Recall that P(S) denotes the collection of all subsets of S , called the power set of S . Trivially, P(S) is the largest σ-algebra of S . The power set is often the appropriate σ-algebra if S is countable, but as noted above, is sometimes too large to be useful if S is uncountable. At the other extreme, the smallest σ-algebra of S is given in the following result: The collection {∅, S} is a σ-algebra. Proof In many cases, we want to construct a σ-algebra that contains certain basic sets. The next two results show how to do this. Suppose that S is a σ-algebra of subsets of S for each i in a nonempty index set I . Then S subsets of S . i
=⋂
i∈I
Si
is also a σ-algebra of
Proof Note that no restrictions are placed on the index set I , other than it be nonempty, so in particular it may well be uncountable. Suppose that S is a set and that B is a collection of subsets of S . The σ-algebra generated by B is σ(B) = ⋂{S : S is a σ-algebra of subsets of S and B ⊆ S }
(1.11.2)
If B is countable then σ(B) is said to be countably generated. So the σ-algebra generated by B is the intersection of all σ-algebras that contain B , which by the previous result really is a σalgebra. Note that the collection of σ-algebras in the intersection is not empty, since P(S) is in the collection. Think of the sets in B as basic sets that we want to be measurable, but do not form a σ-algebra. The σ-algebra σ(B) is the smallest σ algebra containing B . 1. B ⊆ σ(B) 2. If S is a σ-algebra of subsets of S and B ⊆ S then σ(B) ⊆ S . Proof Note that the conditions in the last theorem completely characterize B ⊆S and B ⊆ S . But then by (b), S ⊆ S and S ⊆ S . 1
2
1
2
If A is a subset of S then σ{A} = {∅, A, A
c
2
. If
σ(B)
S1
and
S2
satisfy the conditions, then by (a),
1
, S}
1.11.2
https://stats.libretexts.org/@go/page/10126
Proof We can generalize the previous result. Recall that a collection of subsets A i, j ∈ I with i ≠ j , and ⋃ A =S. i∈I
Suppose that A = {A of sets in A . That is,
i
is a partition of S if A
= { Ai : i ∈ I }
i
∩ Aj = ∅
for
i
: i ∈ I}
is a countable partition of S into nonempty subsets. Then σ(A ) is the collection of all unions
σ(A ) = { ⋃ Aj : J ⊆ I }
(1.11.3)
j∈J
Proof A σ-algebra of this form is said to be generated by a countable partition. Note that since A ≠ ∅ for i ∈ I , the representation of a set in σ(A ) as a union of sets in A is unique. That is, if J, K ⊆ I and J ≠ K then ⋃ A ≠ ⋃ A . In particular, if there are n nonempty sets in A , so that #(I ) = n , then there are 2 subsets of I and hence 2 sets in σ(A ). i
j∈J
n
j
k∈K
k
n
Suppose now that A = {A , A , … , A } is a collection of n subsets of S (not necessarily disjoint). To describe the σ-algebra generated by A we need a bit more notation. For x = (x , x , … , x ) ∈ {0, 1} (a bit string of length n ), let B = ⋂ A where A = A and A = A . 1
2
n
1
1 i
i
0
c
i
i
2
n
n
x
n
xi
i=1
i
In the setting above, 1. B = {B : x ∈ {0, 1} } partitions S . 2. A = ⋃ {B : x ∈ {0, 1} , x = 1} for i ∈ {1, 2, … , n}. 3. σ(A ) = σ(B) = {⋃ B : J ⊆ {0, 1 } } . n
x
n
i
x
i
n
x∈J
x
Proof Recall that there are 2 bit strings of length n . The sets in A are said to be in general position if the sets in hence there are 2 of them) and are nonempty. In this case, there are 2 sets in σ(A ). n
B
are distinct (and
n
n
2
Open the Venn diagram app. This app shows two subsets A and B of S in general position, and lists the 16 sets in σ{A, B}. 1. Select each of the 4 sets that partition S : A ∩ B , A ∩ B , A ∩ B , A ∩ B . 2. Select each of the other 12 sets in σ{A, B} and note how each is a union of some of the sets in (a). c
Sketch a Venn diagram with sets A
1,
A2 , A3
c
c
c
in general position. Identify the set B for each x ∈ {0, 1} . 3
x
If a σ-algebra is generated by a collection of basic sets, then each set in the σ-algebra is generated by a countable number of the basic sets. Suppose that S is a set and B a nonempty collection of subsets of S . Then σ(B) = {A ⊆ S : A ∈ σ(C ) for some countable C ⊆ B}
(1.11.4)
Proof A σ-algebra on a set naturally leads to a σ-algebra on a subset. Suppose that (S, S ) is a measurable space, and that R ⊆ S . Let R = {A ∩ R : A ∈ S } . Then 1. R is a σ-algebra of subsets of R . 2. If R ∈ S then R = {B ∈ S : B ⊆ R} . Proof The σ-algebra R is the σ-algebra on R induced by example with the one for finite and co-finite sets.
S
. The following construction is useful for counterexamples. Compare this
Let S be a nonempty set. The collection of countable and co-countable subsets of S is c
C = {A ⊆ S : A is countable or A is countable}
1.11.3
(1.11.5)
https://stats.libretexts.org/@go/page/10126
1. C is a σ-algebra 2. C = σ{{x} : x ∈ S}, the σ-algebra generated by the singleton sets. Proof Of course, if S is itself countable then C = P(S) . On the other hand, if S is uncountable, then there exists A ⊆ S such that A and A are uncountable. Thus, A ∉ C , but A = ⋃ {x} , and of course {x} ∈ C . Thus, we have an example of a σ-algebra that is not closed under general unions. c
x∈A
Topology and Measure One of the most important ways to generate a σ-algebra is by means of topology. Recall that a topological space consists of a set S and a topology S , the collection of open subsets of S . Most spaces that occur in probability and stochastic processes are topological spaces, so it's crucial that the topological and measure-theoretic structures are compatible. Suppose that space.
(S, S )
is a topological space. Then
σ(S )
is the Borel σ-algebra on
S
, and
(S, σ(S ))
is a Borel measurable
So the Borel σ-algebra on S , named for Émile Borel is generated by the open subsets of S . Thus, a topological space (S, S ) naturally leads to a measurable space (S, σ(S )). Since a closed set is simply the complement of an open set, the Borel σ-algebra contains the closed sets as well (and in fact is generated by the closed sets). Here are some other sets that are in the Borel σalgebra: Suppose again that (S, S ) is a topological space and that I is a countable index set. 1. If A is open for each i ∈ I then ⋂ A ∈ σ(S ) . Such sets are called G sets. 2. If A is closed for each i ∈ I then ⋃ A ∈ σ(S ) . Such sets are called F sets. 3. If (S, S ) is Hausdorff then {x} ∈ S for every x ∈ S . i
i∈I
i
i∈I
i
δ
i
σ
Proof In terms of part (c), recall that a topological space is Hausdorff, named for Felix Hausdorff, if the topology can distinguish individual points. Specifically, if x, y ∈ S are distinct then there exist disjoint open sets U , V with x ∈ U and y ∈ V . This is a very basic property possessed by almost all topological spaces that occur in applications. A simple corollary of (c) is that if the topological space (S, S ) is Hausdorff then A ∈ σ(S ) for every countable A ⊆ S . Let's note the extreme cases. If S has the discrete topology P(S) , so that every set is open (and closed), then of course the Borel σ-algebra is also P(S) . As noted above, this is often the appropriate σ-algebra if S is countable, but is often too large if S is uncountable. If S has the trivial topology {S, ∅}, then the Borel σ-algebra is also {S, ∅}, and so is also trivial. Recall that a base for a topological space (S, T ) is a collection B ⊆ T with the property that every set in collection of sets in B . In short, every open set is a union of some of the basic open sets.
T
is a union of a
Suppose that (S, S ) is a topological space with a countable base B . Then σ(B) = σ(S ) . Proof The topological spaces that occur in probability and stochastic processes are usually assumed to have a countable base (along with other nice properties such as the Hausdorff property and locally compactness). The σ-algebra used for such a space is usually the Borel σ-algebra, which by the previous result, is countably generated.
Measurable Functions Recall that a set usually comes with a σ-algebra of admissible subsets. A natural requirement on a function is that the inverse image of an admissible set in the range space be admissible in the domain space. Here is the formal definition. Suppose that A ∈ T .
(S, S )
and
(T , T )
are measurable spaces. A function
f : S → T
is measurable if
f
−1
(A) ∈ S
for every
If the σ-algebra in the range space is generated by a collection of basic sets, then to check the measurability of a function, we need only consider inverse images of basic sets:
1.11.4
https://stats.libretexts.org/@go/page/10126
Suppose again that (S, S ) and (T , T ) are measurable spaces, and that T f : S → T is measurable if and only if f (B) ∈ S for every B ∈ B .
= σ(B)
for a collection of subsets B of T . Then
−1
Proof If you have reviewed the section on topology then you may have noticed a striking parallel between the definition of continuity for functions on topological spaces and the defintion of measurability for functions on measurable spaces: A function from one topological space to another is continuous if the inverse image of an open set in the range space is open in the domain space. A function from one measurable space to another is measurable if the inverse image of a measurable set in the range space is measurable in the domain space. If we start with topological spaces, which we often do, and use the Borel σ-algebras to get measurable spaces, then we get the following (hardly surprising) connection. Suppose that (S, S ) and (T , T ) are topological spaces, and that we give respectively. If f : S → T is continuous, then f is measurable.
S
and
T
the Borel σ-algebras
and
σ(S )
σ(T )
Proof Measurability is preserved under composition, the most important method for combining functions. Suppose that (R, R) , (S, S ), and (T , T ) are measurable spaces. If then g ∘ f : R → T is measurable.
f : R → S
is measurable and
g : S → T
is measurable,
Proof If T is given the smallest possible σ-algebra or if S is given the largest one, then any function from S into T is measurable. Every function f 1. T 2. S
= {∅, T } = P(S)
: S → T
is measurable in each of the following cases:
and S is an arbitrary σ-algebra of subsets of S and T is an arbitrary σ-algebra of subsets of T .
Proof When there are several σ-algebras for the same set, then we use the phrase with respect to so that we can be precise. If a function is measurable with respect to a given σ-algebra on its domain, then it's measurable with respect to any larger σ-algebra on this space. If the function is measurable with respect to a σ-algebra on the range space then its measurable with respect to any smaller σalgebra on this space. Suppose that S has σ-algebras R and S with R ⊆ S , and that T has σ-algebras T and measurable with respect to R and U , then f is measureable with respect to S and T .
U
with
T ⊆U
. If
f : S → T
is
Proof The following construction is particularly important in probability theory: Suppose that σ(f ) = { f
−1
S
is a set and . Then
(T , T )
is a measurable space. Suppose also that
f : S → T
and
define
(A) : A ∈ T }
1. σ(f ) is a σ-algebra on S . 2. σ(f ) is the smallest σ-algebra on S that makes f measurable. Proof Appropriately enough, σ(f ) is called the σ-algebra generated by f . Often, S will have a given σ-algebra S and f will be measurable with respect to S and T . In this case, σ(f ) ⊆ S . We can generalize to an arbitrary collection of functions on S . Suppose S is a set and that (T , T ) is a measurable space for each i in a nonempty index set I . Suppose also that f for each i ∈ I . The σ-algebra generated by this collection of functions is i
i
i
σ { fi : i ∈ I } = σ {σ(fi ) : i ∈ I } = σ { f
−1
i
(A) : i ∈ I , A ∈ Ti }
: S → Ti
(1.11.6)
Again, this is the smallest σ-algebra on S that makes f measurable for each i ∈ I . i
1.11.5
https://stats.libretexts.org/@go/page/10126
Product Sets Product sets arise naturally in the form of the higher-dimensional Euclidean spaces R for n ∈ {2, 3, …}. In addition, product spaces are particularly important in probability, where they are used to describe the spaces associated with sequences of random variables. More general product spaces arise in the study of stochastic processes. We start with the product of two sets; the generalization to products of n sets and to general products is straightforward, although the notation gets more complicated. n
Suppose that (S, S ) and (T , T ) are measurable spaces. The product σ-algebra on S × T is S ⊗ T = σ{A × B : A ∈ S , B ∈ T }
(1.11.7)
So the definition is natural: the product σ-algebra is generated by products of measurable sets. Our next goal is to consider the measurability of functions defined on, or mapping into, product spaces. Of basic importance are the projection functions. If S and T are sets, let p : S × T → S and p : S × T → T be defined by p (x, y) = x and p (x, y) = y for (x, y) ∈ S × T . Recall that p is the projection onto the first coordinate and p is the projection onto the second coordinate. The product σ algebra is the smallest σ-algebra that makes the projections measurable: 1
2
1
1
2
2
Suppose again that (S, S ) and (T , T ) are measurable spaces. Then S
⊗ T = σ{ p1 , p2 }
.
Proof Projection functions make it easy to study functions mapping into a product space. Suppose that (R, R) , (S, S ) and (T , T ) are measurable spaces, and that S × T is given the product σ-algebra S ⊗ T . Suppose also that f : R → S × T , so that f (x) = (f (x), f (x)) for x ∈ R, where f : R → S and f : R → T are the coordinate functions. Then f is measurable if and only if f and f are measurable. 1
2
1
1
2
2
Proof Our next goal is to consider cross sections of sets in a product space and cross sections of functions defined on a product space. It will help to introduce some new functions, which in a sense are complementary to the projection functions. Suppose again that (S, S ) and (T , T ) are measurable spaces, and that S × T is given the product σ-algebra S 1. For x ∈ S the function 1 2. For y ∈ T the function 2
x
: T → S ×T
y
: S → S ×T
, defined by 1 , defined by 2
x (y)
= (x, y)
y (x)
= (x, y)
⊗T
.
for y ∈ T , is measurable. for x ∈ S , is measurable.
Proof Now our work is easy. Suppose again that (S, S ) and (T , T ) are measurable spaces, and that C
∈ S ⊗T
. Then
1. For x ∈ S , {y ∈ T : (x, y) ∈ C } ∈ T . 2. For y ∈ T , {x ∈ S : (x, y) ∈ C } ∈ S . Proof The set in (a) is the cross section of C in the first coordinate at x, and the set in (b) is the cross section of C in the second coordinate at y . As a simple corollary to the theorem, note that if A ⊆ S , B ⊆ T and A × B ∈ S ⊗ T then A ∈ S and B ∈ T . That is, the only measurable product sets are products of measurable sets. Here is the measurability result for cross-sectional functions: Suppose again that (S, S ) and (T , T ) are measurable spaces, and that S × T is given the product Suppose also that (U , U ) is another measurable space, and that f : S × T → U is measurable. Then
σ
-algebra
S ⊗T
.
1. The function y ↦ f (x, y) from T to U is measurable for each x ∈ S . 2. The function x ↦ f (x, y) from S to U is measurable for each y ∈ T . Proof The results for products of two spaces generalize in a completely straightforward way to a product of n spaces.
1.11.6
https://stats.libretexts.org/@go/page/10126
Suppose n ∈ N and that (S , S ) is a measurable space for each product set S × S × ⋯ × S is +
i
1
2
i
. The product σ-algebra on the Cartesian
i ∈ {1, 2, … , n}
n
S1 ⊗ S2 ⊗ ⋯ ⊗ Sn = σ { A1 × A2 × ⋯ × An : Ai ∈ Si for all i ∈ {1, 2, … , n}}
(1.11.8)
So again, the product σ-algebra is generated by products of measurable sets. Results analogous to the theorems above hold. In the special case that (S , S ) = (S, S ) for i ∈ {1, 2, … , n}, the Cartesian product becomes S and the corresponding product σalgebra is denoted S . The notation is natural, but potentially confusing. Note that S is not the Cartesian product of S n times, but rather the σ-algebra generated by sets of the form A × A × ⋯ × A where A ∈ S for i ∈ {1, 2, … , n}. n
i
i
n
n
1
2
n
i
We can also extend these ideas to a general product. To recall the definition, suppose that S is a set for each i in a nonempty index set I . The product set ∏ S consists of all functions x : I → ⋃ S such that x(i) ∈ S for each i ∈ I . To make the notation look more like a simple Cartesian product, we often write x instead of x(i) for the value of a function in the product set at i ∈ I . The next definition gives the appropriate σ-algebra for the product set. i
i
i∈I
i
i∈I
i
i
Suppose that (S ∏ S is
i,
i∈I
Si )
is a measurable space for each i in a nonempty index set I . The product σ-algebra on the product set
i
σ { ∏ Ai : Ai ∈ Si for each i ∈ I and Ai = Si for all but finitely many i ∈ I }
(1.11.9)
i∈I
If you have reviewed the section on topology, the definition should look familiar. If the spaces were topological spaces instead of measurable spaces, with S the topology of S for i ∈ I , then the set of products in the displayed expression above is a base for the product topology on ∏ S . i
i
i
i∈I
The definition can also be understood in terms of projections. Recall that the projection onto coordinate j ∈ I is the function p : ∏ S → S given by p (x) = x . The product σ-algebra is the smallest σ-algebra on the product set that makes all of the projections measurable. j
i∈I
i
j
j
j
Suppose again that (S , S ) is a measurable space for each i in a nonempty index set I , and let algebra on the product set S = ∏ S . Then S = σ{p : i ∈ I } . i
i
I
i∈I
i
S
denote the product σ-
i
Proof In the special case that (S, S ) is a fixed measurable space and (S , S ) = (S, S ) for all i ∈ I , the product set ∏ S is just the collection of functions from I into S , often denoted S . The product σ-algebra is then denoted S , a notation that is natural, but again potentially confusing. Here is the main measurability result for a function mapping into a product space. i
i
i∈I
I
I
Suppose that (R, R) is a measurable space, and that (S , S ) is a measurable space for each i in a nonempty index set I . As before, let ∏ S have the product σ-algebra. Suppose now that f : R → ∏ S . For i ∈ I let f : R → S denote the ith coordinate function of f , so that f (x) = [f (x)] for x ∈ R . Then f is measurable if and only if f is measurable for each i ∈ I. i
i
i
i∈I
i∈I
i
i
i
i
i
i
Proof Just as with the product of two sets, cross-sectional sets and functions are measurable with respect to the product measure. Again, it's best to work with some special functions. Suppose that (S , S ) is a measurable space for each i in an index set I with at least two elements. For j ∈ I and u ∈ S , define the function j : ∏ → ∏ S by j (x) = y where y = x for i ≠ j and y = u . Then j is measurable with respect to the product σ-algebras. i
i
u
j
i∈I−{j}
i∈I
i
u
i
i
j
u
Proof In words, for j ∈ I and u ∈ S , the function j takes a point in the product set ∏ S and assigns u to coordinate j to give a point in ∏ S . If A ⊆ ∏ S , then j (A) is the cross section of A in coordinate j at u. So it follows immediately from the previous result that the cross sections of a measurable set are measurable. Cross sections of measurable functions are also j
i∈I
i
i∈I
u
i
i∈I−{j}
i
−1 u
1.11.7
https://stats.libretexts.org/@go/page/10126
measurable. Suppose that (T , T ) is another measurable space, and that f : ∏ S → T is measurable. The cross section of f in coordinate j ∈ I at u ∈ S is simply f ∘ j : S → T , a composition of measurable functions. i∈I
j
u
i
I−{j}
However, a non-measurable set can have measurable cross sections, even in a product of two spaces. Suppose that S is an uncountable set with the σ-algebra C of countable and co-countable sets as in (21). Consider S × S with the product σ-algebra C ⊗ C . Let D = {(x, x) : x ∈ S} , the diagonal of S × S . Then D has measurable cross sections, but D is not measurable. Proof
Special Cases Most of the sets encountered in applied probability are either countable, or subsets of R for some n , or more generally, subsets of a product of a countable number of sets of these types. In the study of stochastic processes, various spaces of functions play an important role. In this subsection, we will explore the most important special cases. n
Discrete Spaces If S is countable and S
= P(S)
is the collection of all subsets of S , then (S, S ) is a discrete measurable space.
Thus if (S, S ) is discrete, all subsets of S are measurable and every function from S to another measurable space is measurable. The power set is also the discrete topology on S , so S is a Borel σ-algebra as well. As a topological space, (S, S ) is complete, locally compact, Hausdorff, and since S is countable, separable. Moreover, the discrete topology corresponds to the discrete metric d , defined by d(x, x) = 0 for x ∈ S and d(x, y) = 1 for x, y ∈ S with x ≠ y .
Euclidean Spaces Recall that for n ∈ N , the Euclidean topology on R is generated by the standard Euclidean metric d given by n
+
n
−−−−−−−−− − n
2
dn (x, y) = √ ∑(xi − yi )
n
,
x = (x1 , x2 , … , xn ), y = (y1 , y2 , … , yn ) ∈ R
(1.11.10)
i=1
With this topology, R is complete, connected, locally compact, Hausdorff, and separable. n
For n ∈ N , the n -dimensional Euclidean measurable space is the standard Euclidean topology on R . +
n
(R , Rn )
where
Rn
is the Borel σ-algebra corresponding to
n
The one-dimensional case is particularly important. In this case, the standard Euclidean metric d is given by d(x, y) = |x − y| for x, y ∈ R. The Borel σ-algebra R can be generated by various collections of intervals. Each of the following collections generates R . 1. B 2. B 3. B
1
= {I ⊆ R : I is an interval}
2
= {(a, b] : a, b ∈ R, a < b}
3
= {(−∞, b] : b ∈ R}
Proof Since the Euclidean topology has a countable base, R is countably generated. In fact each collection of intervals above, but with endpoints restricted to Q, generates R . Moreover, R can also be constructed from σ-algebras that are generated by countable partitions. First recall that for n ∈ N , the set of dyadic rationals (or binary rationals) of rank n or less is D = {j/2 : j ∈ Z} . Note that D is countable and D ⊆ D for n ∈ N . Moreover, the set D = ⋃ D of all dyadic rationals is dense in R. The dyadic rationals are often useful in various applications because D has the natural ordered enumeration j ↦ j/2 for each n ∈ N . Now let n
n
n
n
n+1
n∈N
n
n
n
n
n
Dn = {(j/ 2 , (j + 1)/ 2 ] : j ∈ Z} ,
n ∈ N
(1.11.11)
Then D is a countable partition of R into nonempty intervals of equal size 1/2 , so E = σ(D ) consists of unions of sets in D as described above. Every set D is the union of two sets in D so clearly E ⊆ E for n ∈ N . Finally, the Borel σ-algebra on R is R = σ (⋃ E ) = σ (⋃ D ) . This construction turns out to be useful in a number of settings. n
n
n
n
∞
n=0
n+1
n
n
n
n+1
∞
n
n=0
n
1.11.8
https://stats.libretexts.org/@go/page/10126
For n ∈ {2, 3, …}, the Euclidean topology on R is the n -fold product topology formed from the Euclidean topology on R. So the Borel σ-algebra R is also the n -fold power σ-algebra formed from R . Finally, R can be generated by n -fold products of sets in any of the three collections in the previous theorem. n
n
n
Space of Real Functions Suppose that (S, S ) is a measurable space. From our general discussion of functions, recall that the usual arithmetic operations on functions from S into R are defined pointwise. If f : S → R and measurable:
g : S → R
are measurable and
a ∈ R
, then each of the following functions from
S
into
R
is also
1. f + g 2. f − g 3. f g 4. af Proof Similarly, if f : S → R ∖ {0} is measurable, then so is 1/f. Recall that the set of functions from S into R is a vector space, under the pointwise definitions of addition and scalar multiplication. But once again, we usually want to restrict our attention to measurable functions. Thus, it's nice to know that the measurable functions from S into R also form a vector space. This follows immediately from the closure properties (a) and (d) of the previous theorem. Of particular importance in probability and stochastic processes is the vector space of bounded, measurable functions f : S → R , with the supremum norm ∥f ∥ = sup {|f (x)| : x ∈ S}
(1.11.12)
The elementary functions that we encounter in calculus and other areas of applied mathematics are functions from subsets of R into R . The elementary functions include algebraic functions (which in turn include the polynomial and rational functions), the usual transcendental functions (exponential, logarithm, trigonometric), and the usual functions constructed from these by composition, the arithmetic operations, and by piecing together. As we might hope, all of the elementary functions are measurable. This page titled 1.11: Measurable Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.11.9
https://stats.libretexts.org/@go/page/10126
1.12: Special Set Structures There are several other types of algebraic set structures that are weaker than σ-algebras. These are not particularly important in themselves, but are important for constructing σ-algebras and the measures on these σ-algebras. You may want to skip this section if you are not intersted in questions of existence and uniqueness of positive measures.
Basic Theory Definitions Throughout this section, we assume that S is a set and S is a nonempty collection of subsets of S . Here are the main definitions we will need. S
is a π-system if S is closed under finite intersections: if A,
B ∈ S
then A ∩ B ∈ S .
Closure under intersection is clearly a very simple property, but π systems turn out to be useful enough to deserve a name. S
is a λ -system if it is closed under complements and countable disjoint unions.
1. If A ∈ S then A ∈ S . 2. If A ∈ S for i in a countable index set I and A c
i
S
i
∩ Aj = ∅
for i ≠ j then ⋃
i∈I
Ai ∈ S
.
is a semi-algebra if it is closed under intersection and if complements can be written as finite, disjoint unions:
1. If A, B ∈ S then A ∩ B ∈ S . 2. If A ∈ S then there exists a finite, disjoint collection {B
i
: i ∈ I} ⊆ S
such that A
c
=⋃
i∈I
Bi
.
For our final structure, recall that a sequence (A , A , …) of subsets of S is increasing if A ⊆ A for all n ∈ N . The sequence is decreasing if A ⊆A for all n ∈ N . Of course, these are the standard meanings of increasing and decreasing relative to the ordinary order ≤ on N and the subset partial order ⊆ on P(S) . 1
n+1
n
2
n
n+1
+
+
+
S
is a monotone class if it is closed under increasing unions and decreasing intersections:
1. If (A 2. If (A
is an increasing sequence of sets in S then ⋃ , …) is a decreasing sequence of sets in S then ⋂
1,
A2 , …)
1,
A2
∞
n=1 ∞ n=1
An ∈ S
An ∈ S
.
.
∞
If (A , A , …) is an increasing sequence of sets then we sometimes write ⋃ A = lim A . Similarly, if (A , A …) is a decreasing sequence of sets we sometimes write ⋂ A = lim A . The reason for this notation will become clear in the section on Convergence in the chapter on Probability Spaces. With this notation, a monotone class S is defined by the condition that if (A , A , …) is an increasing or decreasing sequence of sets in S then lim A ∈ S . 1
2
n=1
1
n
n=1
∞
n
n→∞
n→∞
n
1
2
n
2
n→∞
n
Basic Theorems Our most important set structure, the σ-algebra, has all of the properties in the definitions above. If S is a σ-algebra then S is a π-system, a λ -system, a semi-algebra, and a monotone class. If S is a λ -system then S ∈ S and ∅ ∈ S . Proof Any type of algebraic structure on subsets of S that is defined purely in terms of closure properties will be preserved under intersection. That is, we will have results that are analogous to how σ-algebras are generated from more basic sets, with completely straightforward and analgous proofs. In the following two theorems, the term system could mean π-system, λ -system, or monotone class of subsets of S . If S is a system for each i in an index set I and ⋂ i
i∈I
Si
is nonempty, then ⋂
i∈I
1.12.1
Si
is a system of the same type.
https://stats.libretexts.org/@go/page/10127
The condition that ⋂ S be nonempty is unnecessary for a λ -system, by the result above. Now suppose that B is a nonempty collection of subsets of S , thought of as basic sets of some sort. Then the system generated by B is the intersection of all systems that contain B . i∈I
i
The system S generated by B is the smallest system containing B , and is characterized by the following properties: 1. B ⊆ S . 2. If T is a system and B ⊆ T then S
⊆T
.
Note however, that the previous two results do not apply to semi-algebras, because the semi-algebra is not defined purely in terms of closure properties (the condition on A is not a closure property). c
If S is a monotone class and an algebra, then S is a σ-algebra. Proof By definition, a semi-algebra is a π-system. More importantly, a semi-algebra can be used to construct an algebra. Suppose that S is a semi-algebra of subsets of S . Then the collection S of finite, disjoint unions of sets in S is an algebra. ∗
Proof We will say that our nonempty collection S is closed under proper set difference if A, B ∈ S and A ⊆ B implies The following theorem gives the basic relationship between λ -systems and monotone classes.
B∖A ∈ S
.
Suppose that S is a nonempty collection of subsets of S . 1. If S is a λ -system then S is a monotone class and is closed under proper set difference. 2. If S is a monotone class, is closed under proper set difference, and contains S , then S is a λ -system. Proof The following theorem is known as the monotone class theorem, and is due to the mathematician Paul Halmos. Suppose that A is an algebra, M is a monotone class, and A
⊆M
. Then σ(A ) ⊆ M .
Proof As noted in (5), a σ-algebra is both a π-system and a studying these structures.
λ
-system. The converse is also true, and is one of the main reasons for
If S is a π-system and a λ -system then S is a σ-algebra. Proof The importance of π-systems and λ -systems stems in part from Dynkin's π-λ theorem given next. It's named for the mathematician Eugene Dynkin. Suppose that A is a π-system of subsets of S , B is a λ -system of subsets of S , and A
⊆B
. Then σ(A ) ⊆ B .
Proof
Examples and Special Cases Suppose that S is a set and A is a finite partition of S . Then S
= {∅} ∪ A
is a semi-algebra of subsets of S .
Proof
Euclidean Spaces The following example is particulalry important because it will be used to construct positive measures on R. Let B = {(a, b] : a, b ∈ R, a < b} ∪ {(−∞, b] : b ∈ R} ∪ {(a, ∞) : a ∈ R}
B
(1.12.3)
is a semi-algebra of subsets of R.
Proof
1.12.2
https://stats.libretexts.org/@go/page/10127
It follows from the theorem above that the collection A of finite disjoint unions of intervals in B is an algebra. Recall also that σ(B) = σ(A ) is the Borel σ-algebra of R , named for Émile Borel. We can generalize all of this to R for n ∈ N n
+
n
The collection B
= { ∏i=1 Ai : Ai ∈ B for each i ∈ {1, 2, … , n}}
n
Recall also that σ(B
n)
is a semi-algebra of subsets of R . n
is the σ-algebra of Borel sets of R . n
Product Spaces The examples in this discussion are important for constructing positive measures on product spaces. Suppose that S is a semi-algebra of subsets of a set S and that T is a semi-algebra of subsets of a set T . Then U = {A × B : A ∈ S , B ∈ T }
(1.12.4)
is a semi-algebra of subsets of S × T . Proof This result extends in a completely straightforward way to a product of a finite number of sets. Suppose that n ∈ N
+
and that S is a semi-algebra of subsets of a set S for i ∈ {1, 2, … , n}. Then i
i
n
U = { ∏ Ai : Ai ∈ Si for all i ∈ {1, 2, … , n}}
(1.12.7)
i=1
is a semi-algebra of subsets of ∏
n i=1
Si
.
Note that the semi-algebra of products of intervals in infinite sequence of sets, the result is bit more tricky.
n
R
described above is a special case of this result. For the product of an
Suppose that S is a semi-algebra of subsets of a set S for i ∈ N . Then i
i
+
∞
U = { ∏ Ai : Ai ∈ Si for all i ∈ N+ and Ai = Si for all but finitely many i ∈ N+ }
(1.12.8)
i=1
is a semi-algebra of subsets of ∏
n i=1
Si
.
Proof ∞
Note that this result would not be true with U = {∏ cannot be written as a finite disjoint union of sets in U .
i=1
Ai : Ai ∈ Si for all i ∈ N+ }
. In general, the complement of a set in U
This page titled 1.12: Special Set Structures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1.12.3
https://stats.libretexts.org/@go/page/10127
CHAPTER OVERVIEW 2: Probability Spaces The basic topics in this chapter are fundamental to probability theory, and should be accessible to new students of probability. We start with the paradigm of the random experiment and its mathematical model, the probability space. The main objects in this model are sample spaces, events, random variables, and probability measures. We also study several concepts of fundamental importance: conditional probability and independence. The advanced topics can be skipped if you are a new student of probability, or can be studied later, as the need arises. These topics include the convergence of random variables, the measure-theoretic foundations of probability theory, and the existence and construction of probability measures and random processes. 2.1: Random Experiments 2.2: Events and Random Variables 2.3: Probability Measures 2.4: Conditional Probability 2.5: Independence 2.6: Convergence 2.7: Measure Spaces 2.8: Existence and Uniqueness 2.9: Probability Spaces Revisited 2.10: Stochastic Processes 2.11: Filtrations and Stopping Times
This page titled 2: Probability Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
2.1: Random Experiments Experiments Probability theory is based on the paradigm of a random experiment; that is, an experiment whose outcome cannot be predicted with certainty, before the experiment is run. In classical or frequency-based probability theory, we also assume that the experiment can be repeated indefinitely under essentially the same conditions. The repetitions can be in time (as when we toss a single coin over and over again) or in space (as when we toss a bunch of similar coins all at once). The repeatability assumption is important because the classical theory is concerned with the long-term behavior as the experiment is replicated. By contrast, subjective or belief-based probability theory is concerned with measures of belief about what will happen when we run the experiment. In this view, repeatability is a less crucial assumption. In any event, a complete description of a random experiment requires a careful definition of precisely what information about the experiment is being recorded, that is, a careful definition of what constitutes an outcome. The term parameter refers to a non-random quantity in a model that, once chosen, remains constant. Many probability models of random experiments have one or more parameters that can be adjusted to fit the physical experiment being modeled. The subjects of probability and statistics have an inverse relationship of sorts. In probability, we start with a completely specified mathematical model of a random experiment. Our goal is perform various computations that help us understand the random experiment, help us predict what will happen when we run the experiment. In statistics, by contrast, we start with an incompletely specified mathematical model (one or more parameters may be unknown, for example). We run the experiment to collect data, and then use the data to draw inferences about the unknown factors in the mathematical model.
Compound Experiments Suppose that we have n experiments (E , E , … , E ). We can form a new, compound experiment by performing the n experiments in sequence, E first, and then E and so on, independently of one another. The term independent means, intuitively, that the outcome of one experiment has no influence over any of the other experiments. We will make the term mathematically precise later. 1
1
2
n
2
In particular, suppose that we have a basic experiment. A fixed number (or even an infinite number) of independent replications of the basic experiment is a new, compound experiment. Many experiments turn out to be compound experiments and moreover, as noted above, (classical) probability theory itself is based on the idea of replicating an experiment. In particular, suppose that we have a simple experiment with two outcomes. Independent replications of this experiment are referred to as Bernoulli trials, named for Jacob Bernoulli. This is one of the simplest, but most important models in probability. More generally, suppose that we have a simple experiment with k possible outcomes. Independent replications of this experiment are referred to as multinomial trials. Sometimes an experiment occurs in well-defined stages, but in a dependent way, in the sense that the outcome of a given stage is influenced by the outcomes of the previous stages.
Sampling Experiments In most statistical studies, we start with a population of objects of interest. The objects may be people, memory chips, or acres of corn, for example. Usually there are one or more numerical measurements of interest to us—the height and weight of a person, the lifetime of a memory chip, the amount of rain, amount of fertilizer, and yield of an acre of corn. Although our interest is in the entire population of objects, this set is usually too large and too amorphous to study. Instead, we collect a random sample of objects from the population and record the measurements of interest of for each object in the sample. There are two basic types of sampling. If we sample with replacement, each item is replaced in the population before the next draw; thus, a single object may occur several times in the sample. If we sample without replacement, objects are not replaced in the population. The chapter on Finite Sampling Models explores a number of models based on sampling from a finite population. Sampling with replacement can be thought of as a compound experiment, based on independent replications of the simple experiment of drawing a single object from the population and recording the measurements of interest. Conversely, a compound experiment that consists of n independent replications of a simple experiment can usually be thought of as a sampling experiment.
2.1.1
https://stats.libretexts.org/@go/page/10129
On the other hand, sampling without replacement is an experiment that consists of dependent stages, because the population changes with each draw.
Examples and Applications Probability theory is often illustrated using simple devices from games of chance: coins, dice, card, spinners, urns with balls, and so forth. Examples based on such devices are pedagogically valuable because of their simplicity and conceptual clarity. On the other hand, it would be a terrible shame if you were to think that probability is only about gambling and games of chance. Rather, try to see problems involving coins, dice, etc. as metaphors for more complex and realistic problems.
Coins and Dice In terms of probability, the important fact about a coin is simply that when tossed it lands on one side or the other. Coins in Western societies, dating to antiquity, usually have the head of a prominent person engraved on one side and something of lesser importance on the other. In non-Western societies, coins often did not have a head on either side, but did have distinct engravings on the two sides, one typically more important than the other. Nonetheless, heads and tails are the ubiquitous terms used in probability theory to distinguish the front or obverse side of the coin from the back or reverse side of the coin.
Figure 2.1.1 : Obverse and reverse sides of a Roman coin, about 241 CE, from Wikipedia
Consider the coin experiment of tossing a coin n times and recording the score (1 for heads or 0 for tails) for each toss. 1. Identify a parameter of the experiment. 2. Interpret the experiment as a compound experiment. 3. Interpret the experiment as a sampling experiment. 4. Interpret the experiment as n Bernoulli trials. Answer In the simulation of the coin experiment, set n = 5 . Run the simulation 100 times and observe the outcomes. Dice are randomizing devices that, like coins, date to antiquity and come in a variety of sizes and shapes. Typically, the faces of a die have numbers or other symbols engraved on them. Again, the important fact is that when a die is thrown, a unique face is chosen (usually the upward face, but sometimes the downward one). For more on dice, see the introductory section in the chapter on Games of Chance. Consider the dice experiment of throwing a k -sided die (with faces numbered 1 to k ), n times and recording the scores for each throw. 1. Identify the parameters of the experiment. 2. Interpret the experiment as a compound experiment. 3. Interpret the experiment as a sampling experiment. 4. Identify the experiment as n multinomial trials. Answer In reality, most dice are Platonic solids (named for Plato of course) with 4, 6, 8, 12, or 20 sides. The six-sided die is the standard die.
Figure 2.1.2 : Blue Platonic dice
In the simulation of the dice experiment, set n = 5 . Run the simulation 100 times and observe the outcomes.
2.1.2
https://stats.libretexts.org/@go/page/10129
In the die-coin experiment, a standard die is thrown and then a coin is tossed the number of times shown on the die. The sequence of coin scores is recorded (1 for heads and 0 for tails). Interpret the experiment as a compound experiment. Answer Note that this experiment can be obtained by randomizing the parameter n in the basic coin experiment in (1). Run the simulation of the die-coin experiment 100 times and observe the outcomes. In the coin-die experiment, a coin is tossed. If the coin lands heads, a red die is thrown and if the coin lands tails, a green die is thrown. The coin score (1 for heads and 0 for tails) and the die score are recorded. Interpret the experiment as a compound experiment. Answer Run the simulation of the coin-die experiment 100 times and observe the outcomes.
Cards Playing cards, like coins and dice, date to antiquity. From the point of view of probability, the important fact is that a playing card encodes a number of properties or attributes on the front of the card that are hidden on the back of the card. (Later in this chapter, these properties will become random variables.) In particular, a standard card deck can be modeled by the Cartesian product set D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠}
(2.1.1)
where the first coordinate encodes the denomination or kind (ace, 2–10, jack, queen, king) and where the second coordinate encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for example q♡ rather than (q, ♡) for the queen of hearts). Some other properties, derived from the two main ones, are color (diamonds and hearts are red, clubs and spades are black), face (jacks, queens, and kings have faces, the other cards do not), and suit order (from least to highest rank: (♣, ♢, ♡, ♠) ). Consider the card experiment that consists of dealing n cards from a standard deck (without replacement). 1. Identify a parameter of the experiment. 2. Interpret the experiment as a compound experiment. 3. Interpret the experiment as a sampling experiment. Answer In the simulation of the card experiment, set n = 5 . Run the simulation 100 times and observe the outcomes. The special case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment. Open each of the following to see depictions of card playing in some famous paintings. 1. Cheat with the Ace of Clubs by Georges de La Tour 2. The Cardsharps by Michelangelo Carravagio 3. The Card Players by Paul Cézanne 4. His Station and Four Aces by CM Coolidge 5. Waterloo by CM Coolidge
Urn Models Urn models are often used in probability as simple metaphors for sampling from a finite population. An urn contains m distinct balls, labeled from 1 to m. The experiment consists of selecting replacement, and recording the sequence of ball numbers.
n
balls from the urn, without
1. Identify the parameters of the experiment. 2. Interpret the experiment as a compound experiment. 3. Interpret the experiment as a sampling experiment.
2.1.3
https://stats.libretexts.org/@go/page/10129
Answer Consider the basic urn model of the previous exercise. Suppose that r of the m balls are red and the remaining m − r balls are green. Identify an additional parameter of the model. This experiment is a metaphor for sampling from a general dichotomous population Answer In the simulation of the urn experiment, set results.
m = 100
,
r = 40
, and
n = 25
. Run the experiment 100 times and observe the
An urn initially contains m balls; r are red and m − r are green. A ball is selected from the urn and removed, and then replaced with k balls of the same color. The process is repeated. This is known as Pólya's urn model, named after George Pólya. 1. Identify the parameters of the experiment. 2. Interpret the case k = 0 as a sampling experiment. 3. Interpret the case k = 1 as a sampling experiment. Answer Open the image of the painting Allegory of Fortune by Dosso Dossi. Presumably the young man has chosen lottery tickets from an urn.
Buffon's Coin Experiment Buffon's coin experiment consists of tossing a coin with radius r ≤ on a floor covered with square tiles of side length 1. The coordinates of the center of the coin are recorded, relative to axes through the center of the square, parallel to the sides. The experiment is named for comte de Buffon. 1 2
1. Identify a parameter of the experiment 2. Interpret the experiment as a compound experiment. 3. Interpret the experiment as sampling experiment. Answer In the simulation of Buffon's coin experiment, set r = 0.1. Run the experiment 100 times and observe the outcomes.
Reliability In the usual model of structural reliability, a system consists of n components, each of which is either working or failed. The states of the components are uncertain, and hence define a random experiment. The system as a whole is also either working or failed, depending on the states of the components and how the components are connected. For example, a series system works if and only if each component works, while a parallel system works if and only if at least one component works. More generally, a k out of n system works if at least k components work. Consider the k out of n reliability model. 1. Identify two parameters. 2. What value of k gives a series system? 3. What value of k gives a parallel system? Answer The reliability model above is a static model. It can be extended to a dynamic model by assuming that each component is initially working, but has a random time until failure. The system as a whole would also have a random time until failure that would depend on the component failure times and the structure of the system.
Genetics
2.1.4
https://stats.libretexts.org/@go/page/10129
In ordinary sexual reproduction, the genetic material of a child is a random combination of the genetic material of the parents. Thus, the birth of a child is a random experiment with respect to outcomes such as eye color, hair type, and many other physical traits. We are often particularly interested in the random transmission of traits and the random transmission of genetic disorders. For example, let's consider an overly simplified model of an inherited trait that has two possible states (phenotypes), say a pea plant whose pods are either green or yellow. The term allele refers to alternate forms of a particular gene, so we are assuming that there is a gene that determines pod color, with two alleles: g for green and y for yellow. A pea plant has two alleles for the trait (one from each parent), so the possible genotypes are , alleles for green pods from each parent. , an allele for green pods from one parent and an allele for yellow pods from the other (we usually cannot observe which parent contributed which allele). yy, alleles for yellow pods from each parent. gg
gy
The genotypes gg and yy are called homozygous because the two alleles are the same, while the genotype gy is called heterozygous because the two alleles are different. Typically, one of the alleles of the inherited trait is dominant and the other recessive. Thus, for example, if g is the dominant allele for pod color, then a plant with genotype gg or gy has green pods, while a plant with genotype yy has yellow pods. Genes are passed from parent to child in a random manner, so each new plant is a random experiment with respect to pod color. Pod color in peas was actually one of the first examples of an inherited trait studied by Gregor Mendel, who is considered the father of modern genetics. Mendel also studied the color of the flowers (yellow or purple), the length of the stems (short or long), and the texture of the seeds (round or wrinkled). For another example, the ABO blood type in humans is controlled by three alleles: a , b , and o . Thus, the possible genotypes are aa , ab , ao , bb, bo and oo . The alleles a and b are co-dominant and o is recessive. Thus there are four possible blood types (phenotypes): Type A : genotype aa or ao Type B : genotype bb or bo Type AB: genotype ab type O: genotype oo Of course, blood may be typed in much more extensive ways than the simple ABO typing. The RH factor (positive or negative) is the most well-known example. For our third example, consider a sex-linked hereditary disorder in humans. This is a disorder due to a defect on the X chromosome (one of the two chromosomes that determine gender). Suppose that h denotes the healthy allele and d the defective allele for the gene linked to the disorder. Women have two X chromosomes, and typically d is recessive. Thus, a woman with genotype hh is completely normal with respect to the condition; a woman with genotype hd does not have the disorder, but is a carrier, since she can pass the defective allele to her children; and a woman with genotype dd has the disorder. A man has only one X chromosome (his other sex chromosome, the Y chromosome, typically plays no role in the disorder). A man with genotype h is normal and a man with genotype d has the disorder. Examples of sex-linked hereditary disorders are dichromatism, the most common form of color-blindness, and hemophilia, a bleeding disorder. Again, genes are passed from parent to child in a random manner, so the birth of a child is a random experiment in terms of the disorder.
Point Processes There are a number of important processes that generate “random points in time”. Often the random points are referred to as arrivals. Here are some specific examples: times that a piece of radioactive material emits elementary particles times that customers arrive at a store times that requests arrive at a web server failure times of a device To formalize an experiment, we might record the number of arrivals during a specified interval of time or we might record the times of successive arrivals. There are other processes that produce “random points in space”. For example,
2.1.5
https://stats.libretexts.org/@go/page/10129
flaws in a piece of sheet metal errors in a string of symbols (in a computer program, for example) raisins in a cake misprints on a page stars in a region of space Again, to formalize an experiment, we might record the number of points in a given region of space.
Statistical Experiments In 1879, Albert Michelson constructed an experiment for measuring the speed of light with an interferometer. The velocity of light data set contains the results of 100 repetitions of Michelson's experiment. Explore the data set and explain, in a general way, the variability of the data. Answer In 1998, two students at the University of Alabama in Huntsville designed the following experiment: purchase a bag of M&Ms (of a specified advertised size) and record the counts for red, green, blue, orange, and yellow candies, and the net weight (in grams). Explore the M&M data. set and explain, in a general way, the variability of the data. Answer In 1999, two researchers at Belmont University designed the following experiment: capture a cicada in the Middle Tennessee area, and record the body weight (in grams), the wing length, wing width, and body length (in millimeters), the gender, and the species type. The cicada data set contains the results of 104 repetitions of this experiment. Explore the cicada data and explain, in a general way, the variability of the data. Answer On June 6, 1761, James Short made 53 measurements of the parallax of the sun, based on the transit of Venus. Explore the Short data set and explain, in a general way, the variability of the data. Answer In 1954, two massive field trials were conducted in an attempt to determine the effectiveness of the new vaccine developed by Jonas Salk for the prevention of polio. In both trials, a treatment group of children were given the vaccine while a control group of children were not. The incidence of polio in each group was measured. Explore the polio field trial data set and explain, in a general way, the underlying random experiment. Answer Each year from 1969 to 1972 a lottery was held in the US to determine who would be drafted for military service. Essentially, the lottery was a ball and urn model and became famous because many believed that the process was not sufficiently random. Explore the Vietnam draft lottery data set and speculate on how one might judge the degree of randomness. Answer
Deterministic Versus Probabilistic Models One could argue that some of the examples discussed above are inherently deterministic. In tossing a coin, for example, if we know the initial conditions (involving position, velocity, rotation, etc.), the forces acting on the coin (gravity, air resistance, etc.), and the makeup of the coin (shape, mass density, center of mass, etc.), then the laws of physics should allow us to predict precisely how the coin will land. This is true in a technical, theoretical sense, but false in a very real sense. Coins, dice, and many more complicated and important systems are chaotic in the sense that the outcomes of interest depend in a very sensitive way on the initial conditions and other parameters. In such situations, it might well be impossible to ever know the initial conditions and forces accurately enough to use deterministic methods. In the coin experiment, for example, even if we strip away most of the real world complexity, we are still left with an essentially random experiment. Joseph Keller in his article “The Probability of Heads” deterministically analyzed the toss of a coin under a number of ideal assumptions: 1. The coin is a perfect circle and has negligible thickness
2.1.6
https://stats.libretexts.org/@go/page/10129
2. The center of gravity of the coin is the geometric center. 3. The coin is initially heads up and is given an initial upward velocity u and angular velocity ω. 4. In flight, the coin rotates about a horizontal axis along a diameter of the coin. 5. In flight, the coin is governed only by the force of gravity. All other possible forces (air resistance or wind, for example) are neglected. 6. The coin does not bounce or roll after landing (as might be the case if it lands in sand or mud). Of course, few of these ideal assumptions are valid for real coins tossed by humans. Let t = u/g where g is the acceleration of gravity (in appropriate units). Note that the t just has units of time (in seconds) and hence is independent of how distance is measured. The scaled parameter t actually represents the time required for the coin to reach its maximum height. Keller showed that the regions of the parameter space curves
(t, ω)
where the coin lands either heads up or tails up are separated by the
1 ω = (2n ±
π )
2
,
n ∈ N
(2.1.2)
2t
The parameter n is the total number of revolutions in the toss. A plot of some of these curves is given below. The largest region, in the lower left corner, corresponds to the event that the coin does not complete even one rotation, and so of course lands heads up, just as it started. The next region corresponds to one rotation, with the coin landing tails up. In general, the regions alternate between heads and tails.
Figure 2.1.3 : Regions of heads and tails
The important point, of course, is that for even moderate values of t and ω, the curves are very close together, so that a small change in the initial conditions can easily shift the outcome from heads up to tails up or conversely. As noted in Keller's article, the probabilist and statistician Persi Diaconis determined experimentally that typical values of the initial conditions for a real coin toss are t = seconds and ω = 76π ≈ 238.6 radians per second. These values correspond to n = 19 revolutions in the toss. Of course, this parameter point is far beyond the region shown in our graph, in a region where the curves are exquisitely close together. 1 4
This page titled 2.1: Random Experiments is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.1.7
https://stats.libretexts.org/@go/page/10129
2.2: Events and Random Variables The purpose of this section is to study two basic types of objects that form part of the model of a random experiment. If you are a new student of probability, just ignore the measure-theoretic terminology and skip the technical details.
Sample Spaces The Set of Outcomes Recall that in a random experiment, the outcome cannot be predicted with certainty, before the experiment is run. On the other hand: We assume that we can identify a fixed set S that includes all possible outcomes of a random experiment. This set plays the role of the universal set when modeling the experiment. For simple experiments, S may be precisely the set of possible outcomes. More often, for complex experiments, S is a mathematically convenient set that includes the possible outcomes and perhaps other elements as well. For example, if the experiment is to throw a standard die and record the score that occurs, we would let S = {1, 2, 3, 4, 5, 6}, the set of possible outcomes. On the other hand, if the experiment is to capture a cicada and measure its body weight (in milligrams), we might conveniently take S = [0, ∞) , even though most elements of this set are impossible (we hope!). The problem is that we may not know exactly the outcomes that are possible. Can a light bulb burn without failure for one thousand hours? For one thousand days? for one thousand years? Often the outcome of a random experiment consists of one or more real measurements, and thus, the S consists of all possible measurement sequences, a subset of R for some n ∈ N . More generally, suppose that we have n experiments and that S is the set of outcomes for experiment i ∈ {1, 2, … , n}. Then the Cartesian product S × S × ⋯ × S is the natural set of outcomes for the compound experiment that consists of performing the n experiments in sequence. In particular, if we have a basic experiment with S as the set of outcomes, then S is the natural set of outcomes for the compound experiment that consists of n replications of the basic experiment. Similarly, if we have an infinite sequence of experiments and S is the set of outcomes for experiment i ∈ N , then then S × S × ⋯ is the natural set of outcomes for the compound experiment that consists of performing the given experiments in sequence. In particular, the set of outcomes for the compound experiment that consists of indefinite replications of a basic experiment is S = S × S × ⋯ . This is an essential special case, because (classical) probability theory is based on the idea of replicating a given experiment. n
+
i
1
2
n
n
i
+
1
2
∞
Events Consider again a random experiment with S as the set of outcomes. Certain subsets of S are referred to as events. Suppose that A ⊆ S is a given event, and that the experiment is run, resulting in outcome s ∈ S . 1. If s ∈ A then we say that A occurs. 2. If s ∉ A then we say that A does not occur. Intuitively, you should think of an event as a meaningful statement about the experiment: every such statement translates into an event, namely the set of outcomes for which the statement is true. In particular, S itself is an event; by definition it always occurs. At the other extreme, the empty set ∅ is also an event; by definition it never occurs. For a note on terminology, recall that a mathematical space consists of a set together with other mathematical structures defined on the set. An example you may be familiar with is a vector space, which consists of a set (the vectors) together with the operations of addition and scalar multiplication. In probability theory, many authors use the term sample space for the set of outcomes of a random experiment, but here is the more careful definition: The sample space of an experiment is (S, S ) where S is the set of outcomes and S is the collection of events. Details
2.2.1
https://stats.libretexts.org/@go/page/10130
The Algebra of Events The standard algebra of sets leads to a grammar for discussing random experiments and allows us to construct new events from given events. In the following results, suppose that S is the set of outcomes of a random experiment, and that A and B are events. A ⊆B
if and only if the occurrence of A implies the occurrence of B .
Proof A∪B
is the event that occurs if and only if A occurs or B occurs.
Proof A∩B
is the event that occurs if and only if A occurs and B occurs.
Proof and B are disjoint if and only if they are mutually exclusive; they cannot both occur on the same run of the experiment.
A
Proof c
A
is the event that occurs if and only if A does not occur.
Proof A∖B
is the event that occurs if and only if A occurs and B does not occur.
Proof c
c
(A ∩ B ) ∪ (B ∩ A )
is the event that occurs if and only if one but not both of the given events occurs.
Proof Recall that the event in (10) is the symmetric difference of A and B , and is sometimes denoted exclusive or, as opposed to the ordinary union A ∪ B which corresponds to inclusive or. c
(A ∩ B) ∪ (A
c
∩B )
AΔB
. This event corresponds to
is the event that occurs if and only if both or neither of the given events occurs.
Proof In the Venn diagram app, observe the diagram of each of the 16 events that can be constructed from A and B . Suppose now that A ⋃A = ⋃
i∈I
Ai
= { Ai : i ∈ I }
is a collection of events for the random experiment, where I is a countable index set.
is the event that occurs if and only if at least one event in the collection occurs.
Proof ⋂ A = ⋂i∈I Ai
is the event that occurs if and only if every event in the collection occurs:
Proof is a pairwise disjoint collection if and only if the events are mutually exclusive; at most one of the events could occur on a given run of the experiment. A
Proof Suppose now that (A
1,
) is an infinite sequence of events.
A2 , …
A is the event that occurs if and only if infinitely many of the given events occur. This event is sometimes called the limit superior of (A , A , …). ∞
⋂
n=1
∞
⋃
i=n
i
1
2
Proof ∞
∞
A is the event that occurs if and only if all but finitely many of the given events occur. This event is sometimes called the limit inferior of (A , A , …). ⋃
n=1
⋂
i=n
i
1
2
Proof
2.2.2
https://stats.libretexts.org/@go/page/10130
Limit superiors and inferiors are discussed in more detail in the section on convergence.
Random Variables Intuitively, a random variable is a measurement of interest in the context of the experiment. Simple examples include the number of heads when a coin is tossed several times, the sum of the scores when a pair of dice are thrown, the lifetime of a device subject to random stress, the weight of a person chosen from a population. Many more examples are given below in the exercises below. Mathematically, a random variable is a function defined on the set of outcomes. A function X from S into a set T is a random variable for the experiment with values in T . Details Probability has its own notation, very different from other branches of mathematics. As a case in point, random variables, even though they are functions, are usually denoted by capital letters near the end of the alphabet. The use of a letter near the end of the alphabet is intended to emphasize the idea that the object is a variable in the context of the experiment. The use of a capital letter is intended to emphasize the fact that it is not an ordinary algebraic variable to which we can assign a specific value, but rather a random variable whose value is indeterminate until we run the experiment. Specifically, when we run the experiment an outcome s ∈ S occurs, and random variable X takes the value X(s) ∈ T .
Figure 2.2.1 : A random variable as a function defined on the set of outcomes.
If B ⊆ T , we use the notation {X ∈ B} for the inverse image {s ∈ S : X(s) ∈ B} , rather than X (B) . Again, the notation is more natural since we think of X as a variable in the experiment. Think of {X ∈ B} as a statement about X, which then translates into the event {s ∈ S : X(s) ∈ B} −1
Figure 2.2.2 : The event {X ∈ B} corresponding to B ⊆ T
Again, every statement about a random variable X with values in T translates into an inverse image of the form {X ∈ B} for some B ∈ T . So, for example, if x ∈ T then {X = x} = {X ∈ {x}} = {s ∈ S : X(s) = x} . If X is a real-valued random variable and a, b ∈ R with a < b then {a ≤ X ≤ b} = {X ∈ [a, b]} = {s ∈ S : a ≤ X(s) ≤ b} . Suppose that X is a random variable taking values in T , and that A,
B ⊆T
. Then
1. {X ∈ A ∪ B} = {X ∈ A} ∪ {X ∈ B} 2. {X ∈ A ∩ B} = {X ∈ A} ∩ {X ∈ B} 3. {X ∈ A ∖ B} = {X ∈ A} ∖ {X ∈ B} 4. A ⊆ B ⟹ {X ∈ A} ⊆ {X ∈ B} 5. If A and B are disjoint, then so are {X ∈ A} and {X ∈ B} . Proof As with a general function, the result in part (a) holds for the union of a countable collection of subsets, and the result in part (b) holds for the intersection of a countable collection of subsets. No new ideas are involved; only the notation is more complicated. Often, a random variable takes values in a subset T of R for some k ∈ N . We might express such a random variable as X = (X , X , … , X ) where X is a real-valued random variable for each i ∈ {1, 2, … , k}. In this case, we usually refer to X as a random vector, to emphasize its higher-dimensional character. A random variable can have an even more complicated structure. For example, if the experiment is to select n objects from a population and record various real measurements for each object, then the outcome of the experiment is a vector of vectors: X = (X , X , … , X ) where X is the vector of measurements for the ith object. There are other possibilities; a random variable could be an infinite sequence, or could be set-valued. Specific k
+
1
2
k
i
1
2.2.3
2
n
i
https://stats.libretexts.org/@go/page/10130
examples are given in the computational exercises below. However, the important point is simply that a random variable is a function defined on the set of outcomes S . The outcome of the experiment itself can be thought of as a random variable. Specifically, let T = S and let X denote the identify function on S so that X(s) = s for s ∈ S . Then trivially X is a random variable, and the events that can be defined in terms of X are simply the original events of the experiment. That is, if A is an event then {X ∈ A} = A . Conversely, every random variable effectively defines a new random experiment. In the general setting above, a random variable subsets of T as the new collection of events.
X
defines a new random experiment with
T
as the new set of outcomes and
Details In fact, often a random experiment is modeled by specifying the random variables of interest, in the language of the experiment. Then, a mathematical definition of the random variables specifies the sample space. A function (or transformation) of a random variable defines a new random variable. Suppose that X is a random variable for the experiment with values in Then Y = g(X) is a random variable with values in U .
T
and that
g
is a function from
T
into another set
U
.
Details Note that, as functions, g(X) = g ∘ X , the composition of g with X. But again, thinking of X and Y as variables in the context of the experiment, the notation Y = g(X) is much more natural.
Indicator Variables For an event A , the indicator function of A is called the indicator variable of A . The value of this random variables tells us whether or not A has occurred: 1A = {
1,
A occurs
0,
A does not occur
(2.2.1)
That is, as a function on S , 1A (s) = {
1,
s ∈ A
0,
s ∉ A
(2.2.2)
If X is a random variable that takes values 0 and 1, then X is the indicator variable of the event {X = 1} . Proof Recall also that the set algebra of events translates into the arithmetic algebra of indicator variables. Suppose that A and B are events. 1. 1 =1 1 = min { 1 , 1 } 2. 1 = 1 − (1 − 1 ) (1 − 1 ) = max { 1 3. 1 =1 (1 − 1 ) 4. 1 = 1 − 1 5. A ⊆ B if and only if 1 ≤ 1 A∩B
A
B
A∪B
B∖A c
A
A
A
B
B
B
A,
1B }
A
A
A
B
The results in part (a) extends to arbitrary intersections and the results in part (b) extends to arbitrary unions. If the event A has a complicated description, sometimes we use 1(A) for the indicator variable rather that 1 . A
Examples and Applications Recall that probability theory is often illustrated using simple devices from games of chance: coins, dice, cards, spinners, urns with balls, and so forth. Examples based on such devices are pedagogically valuable because of their simplicity and conceptual clarity. On the other hand, remember that probability is not only about gambling and games of chance. Rather, try to see problems involving coins, dice, etc. as metaphors for more complex and realistic problems.
2.2.4
https://stats.libretexts.org/@go/page/10130
Coins and Dice The basic coin experiment consists of tossing a coin n times and recording the sequence of scores (X , X , … , X ) (where 1 denotes heads and 0 denotes tails). This experiment is a generic example of n Bernoulli trials, named for Jacob Bernoulli. 1
2
n
Consider the coin experiment with n = 4 , and Let Y denote the number of heads. 1. Give the set of outcomes S in list form. 2. Give the event {Y = k} in list form for each k ∈ {0, 1, 2, 3, 4}. Answer In the simulation of the coin experiment, set event {Y = 2} occurs.
n =4
. Run the experiment 100 times and count the number of times that the
Now consider the general coin experiment with the coin tossed n times, and let Y denote the number of heads. 1. Give the set of outcomes S in Cartesian product form, and give the cardinality of S . 2. Express Y as a function on S . 3. Find #{Y = k} (as a subset of S ) for k ∈ {0, 1, … , n} Answer The basic dice experiment consists of throwing n distinct k -sided dice (with faces numbered from 1 to k ) and recording the sequence of scores (X , X , … , X ). This experiment is a generic example of n multinomial trials. The special case k = 6 corresponds to standard dice. 1
2
n
Consider the dice experiment with n = 2 standard dice. Let S denote the set of outcomes, A the event that the first die score is 1, and B the event that the sum of the scores is 7. Give each of the following events in the form indicated: 1. S in Cartesian product form 2. A in list form 3. B in list form 4. A ∪ B in list form 5. A ∩ B in list form 6. A ∩ B in predicate form c
c
Answer In the simulation of the dice experiment, set n = 2 . Run the experiment 100 times and count the number of times each event in the previous exercise occurs. Consider the dice experiment with n = 2 standard dice, and let S denote the set of outcomes, Y the sum of the scores, minimum score, and V the maximum score.
U
the
1. Express Y as a function on S and give the set of possible values in list form. 2. Express U as a function on S and give the set of possible values in list form. 3. Express V as a function on the S and give the set of possible values in list form. 4. Give the set of possible values of (U , V ) in predicate from Answer Consider again the dice experiment with n = 2 standard dice, and let S denote the set of outcomes, Y the sum of the scores, U the minimum score, and V the maximum score. Give each of the following as subsets of S , in list form. 1. {X < 3, X 2. {Y = 7} 3. {U = 2} 4. {V = 4} 5. {U = V } 1
2
> 4}
Answer
2.2.5
https://stats.libretexts.org/@go/page/10130
In the dice experiment, set exercise occurred.
n =2
. Run the experiment 100 times. Count the number of times each event in the previous
In the general dice experiment with n distinct k -sided dice, let Y denote the sum of the scores, U the minimum score, and V the maximum score. 1. Give the set of outcomes S and find #(S) . 2. Express Y as a function on S , and give the set of possible values in list form. 3. Express U as a function on S , and give the set of possible values in list form. 4. Express V as a function on S , and give the set of possible values in list form. 5. Give the set of possible values of (U , V ) in predicate from. Answer The set of outcomes of a random experiment depends of course on what information is recorded. The following exercise is an illustration. An experiment consists of throwing a pair of standard dice repeatedly until the sum of the two scores is either 5 or 7. Let denote the event that the sum is 5 rather than 7 on the final throw. Experiments of this type arise in the casino game craps.
A
1. Suppose that the pair of scores on each throw is recorded. Define the set of outcomes of the experiment and describe A as a subset of this set. 2. Suppose that the pair of scores on the final throw is recorded. Define the set of outcomes of the experiment and describe A as a subset of this set. Answer Suppose that 3 standard dice are rolled and the sequence of scores (X , X , X ) is recorded. A person pays $1 to play. If some of the dice come up 6, then the player receives her $1 back, plus $1 for each 6. Otherwise she loses her $1. Let W denote the person's net winnings. This is the game of chuck-a-luck and is treated in more detail in the chapter on Games of Chance. 1
2
3
1. Give the set of outcomes S in Cartesian product form. 2. Express W as a function on S and give the set of possible values in list form. Answer Play the chuck-a-luck experiment a few times and see how you do. In the die-coin experiment, a standard die is rolled and then a coin is tossed the number of times shown on the die. The sequence of coin scores X is recorded (0 for tails, 1 for heads). Let N denote the die score and Y the number of heads. 1. Give the set of outcomes S in terms of Cartesian powers and find #(S) . 2. Express N as a function on S and give the set of possible values in list form. 3. Express Y as a function on S and give the set of possible values in list form. 4. Give the event A that all tosses result in heads in list form. Answer Run the simulation of the die-coin experiment 10 times. For each run, give the values of the random variables X, N , and Y of the previous exercise. Count the number of times the event A occurs. In the coin-die experiment, we have a coin and two distinct dice, say one red and one green. First the coin is tossed, and then if the result is heads the red die is thrown, while if the result is tails the green die is thrown. The coin score X and the score of the chosen die Y are recorded. Suppose now that the red die is a standard 6-sided die, and the green die a 4-sided die. 1. Give the set of outcomes S in list form. 2. Express X as a function on S . 3. Express Y as a function on S . 4. Give the event {Y ≥ 3} as a subset of S in list form. Answer
2.2.6
https://stats.libretexts.org/@go/page/10130
Run the coin-die experiment 100 times, with various types of dice.
Sampling Models Recall that many random experiments can be thought of as sampling experiments. For the general finite sampling model, we start with a population D with m (distinct) objects. We select a sample of n objects from the population. If the sampling is done in a random way, then we have a random experiment with the sample as the basic outcome. Thus, the set of outcomes of the experiment is literally the set of samples; this is the historical origin of the term sample space. There are four common types of sampling from a finite population, based on the criteria of order and replacement. Recall the following facts from the section on combinatorial structures: Samples of size n chosen from a population with m elements. 1. If the sampling is with replacement and with regard to order, then the set of samples is the Cartesian power D . The number of samples is m . 2. If the sampling is without replacement and with regard to order, then the set of samples is the set of all permutations of size n from D. The number of samples is m = m(m − 1) ⋯ [m − (n − 1)] . 3. If the sampling is without replacement and without regard to order, then the set of samples is the set of all combinations (or subsets) of size n from D. The number of samples is ( ). 4. If the sampling is with replacement and without regard to order, then the set of samples is the set of all multisets of size n from D. The number of samples is ( ). n
n
(n)
m n
m+n−1 n
If we sample with replacement, the sample size n can be any positive integer. If we sample without replacement, the sample size cannot exceed the population size, so we must have n ∈ {1, 2, … , m}. The basic coin and dice experiments are examples of sampling with replacement. If we toss a coin n times and record the sequence of scores (where as usual, 0 denotes tails and 1 denotes heads), then we generate an ordered sample of size n with replacement from the population {0, 1}. If we throw n (distinct) standard dice and record the sequence of scores, then we generate an ordered sample of size n with replacement from the population {1, 2, 3, 4, 5, 6}. Suppose that the sampling is without replacement (the most common case). If we record the ordered sample X = (X , X , … , X ) , then the unordered sample W = { X , X , … , X } is a random variable (that is, a function of X). On the other hand, if we just record the unordered sample W in the first place, then we cannot recover the ordered sample. Note also that the number of ordered samples of size n is simply n! times the number of unordered samples of size n . No such simple relationship exists when the sampling is with replacement. This will turn out to be an important point when we study probability models based on random samples, in the next section. 1
2
n
1
2
n
Consider a sample of size n = 3 chosen without replacement from the population {a, b, c, d, e}. 1. Give T , the set of unordered samples in list form. 2. Give in list form the set of all ordered samples that correspond to the unordered sample {b, c, e}. 3. Note that for every unordered sample, there are 6 ordered samples. 4. Give the cardinality of S , the set of ordered samples. Answer Traditionally in probability theory, an urn containing balls is often used as a metaphor for a finite population. Suppose that an urn contains 50 (distinct) balls. A sample of 10 balls is selected from the urn. Find the number of samples in each of the following cases: 1. Ordered samples with replacement 2. Ordered samples without replacement 3. Unordered samples without replacement 4. Unordered samples with replacement Answer
2.2.7
https://stats.libretexts.org/@go/page/10130
Suppose again that we have a population D with m (distinct) objects, but suppose now that each object is one of two types—either type 1 or type 0. Such populations are said to be dichotomous. Here are some specific examples: The population consists of persons, each either male or female. The population consists of voters, each either democrat or republican. The population consists of devices, each either good or defective. The population consists of balls, each either red or green. Suppose that the population D has r type 1 objects and hence m − r type 0 objects. Of course, we must have r ∈ {0, 1, … , m}. Now suppose that we select a sample of size n without replacement from the population. Note that this model has three parameters: the population size m, the number of type 1 objects in the population r, and the sample size n . Let Y denote the number of type 1 objects in the sample. Then 1. #{Y = k} = ( )r (m − r) for each k ∈ {0, 1, … , n}, if the event is considered as a subset of S , the set of ordered samples. 2. #{Y = k} = ( )( ) for each k ∈ {0, 1, … , n}, if the event is considered as a subset of T , the set of unordered samples. 3. The expression in (a) is n! times the expression in (b). n
(k)
(n−k)
k
r
m−r
k
n−k
Proof A batch of 50 components consists of 40 good components and 10 defective components. A sample of 5 components is selected, without replacement. Let Y denote the number of defectives in the sample. 1. Let S denote the set of ordered samples. Find #(S) . 2. Let T denote the set of unordered samples. Find #(T ). 3. As a subset of T , find #{Y = k} for each k ∈ {0, 1, 2, 3, 4, 5}. Answer Run the simulation of the ball and urn experiment 100 times for the parameter values in the last exercise: n = 5 . Note the values of the random variable Y .
m = 50
,
r = 10
,
Cards Recall that a standard card deck can be modeled by the Cartesian product set D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠}
(2.2.8)
where the first coordinate encodes the denomination or kind (ace, 2–10, jack, queen, king) and where the second coordinate encodes the suit (clubs, diamonds, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for example q♡ rather than (q, ♡) for the queen of hearts). Most card games involve sampling without replacement from the deck D, which plays the role of the population. Thus, the basic card experiment consists of dealing n cards from a standard deck without replacement; in this special context, the sample of cards is often referred to as a hand. Just as in the general sampling model, if we record the ordered hand X = (X , X , … , X ), then the unordered hand W = {X , X , … , X } is a random variable (that is, a function of X). On the other hand, if we just record the unordered hand W in the first place, then we cannot recover the ordered hand. Finally, recall that n = 5 is the poker experiment and n = 13 is the bridge experiment. The game of poker is treated in more detail in the chapter on Games of Chance. 1
1
2
2
n
n
Suppose that a single card is dealt from a standard deck. Let Q denote the event that the card is a queen and H the event that the card is a heart. Give each of the following events in list form: 1. Q 2. H 3. Q ∪ H 4. Q ∩ H 5. Q ∖ H Answer
2.2.8
https://stats.libretexts.org/@go/page/10130
In the card experiment, set exercise occurs.
n =1
. Run the experiment 100 times and count the number of times each event in the previous
Suppose that two cards are dealt from a standard deck and the sequence of cards recorded. Let S denote the set of outcomes, and let Q denote the event that the ith card is a queen and H the event that the ith card is a heart for i ∈ {1, 2}. Find the number of outcomes in each of the following events: i
1. S 2. H 3. H 4. H 5. Q 6. Q 7. H
i
1 2 1
∩ H2
1
∩ H1
1
∩ H2
1
∪ H2
Answer Consider the general card experiment in which n cards are dealt from a standard deck, and the ordered hand X is recorded. 1. Give cardinality of S , the set of values of the ordered hand X. 2. Give the cardinality of T , the set of values of the unordered hand W . 3. How many ordered hands correspond to a given unordered hand? 4. Explicitly compute the numbers in (a) and (b) when n = 5 (poker). 5. Explicitly compute the numbers in (a) and (b) when n = 13 (bridge). Answer Consider the bridge experiment of dealing 13 cards from a deck and recording the unordered hand. In the most common point counting system, an ace is worth 4 points, a king 3 points, a queen 2 points, and a jack 1 point. The other cards are worth 0 points. Let S denote the set of outcomes of the experiment and V the point value of the hand. 1. Find the set of possible values of V . 2. Find the cardinality of the event {V = 0} as a subset of S . Answer In the card experiment, set n = 13 and run the experiment 100 times. For each run, compute the value of each of the random variable V in the previous exercise. Consider the poker experiment of dealing 5 cards from a deck. Find the cardinality of each of the events below, as a subset of the set of unordered hands. 1. A : the event that the hand is a full house (3 cards of one kind and 2 of another kind). 2. B : the event that the hand has 4 of a kind (4 cards of one kind and 1 of another kind). 3. C : the event that all cards in the hand are in the same suit (the hand is a flush or a straight flush). Answer Run the poker experiment 1000 times. Note the number of times that the events A , B , and C in the previous exercise occurred. Consider the bridge experiment of dealing 13 cards from a standard deck. Let number of hearts in the hand, and Z the number of queens in the hand.
S
denote the set of unordered hands,
Y
the
1. Find the cardinality of the event {Y = y} as a subset of S for each y ∈ {0, 1, … , 13}. 2. Find the cardinality of the event {Z = z} as a subset of S for each z ∈ {0, 1, 2, 3, 4}. Answer
2.2.9
https://stats.libretexts.org/@go/page/10130
Geometric Models In the experiments that we have considered so far, the sample spaces have all been discrete (so that the set of outcomes is finite or countably infinite). In this subsection, we consider Euclidean sample spaces where the set of outcomes S is continuous in a sense that we will make clear later. The experiments we consider are sometimes referred to as geometric models because they involve selecting a point at random from a Euclidean set. We first consider Buffon's coin experiment, which consists of tossing a coin with radius r ≤ randomly on a floor covered with square tiles of side length 1. The coordinates (X, Y ) of the center of the coin are recorded relative to axes through the center of the square in which the coin lands. Buffon's experiments are studied in more detail in the chapter on Geometric Models and are named for Compte de Buffon 1 2
Figure 2.2.3 : Buffon's
coin experiment
In Buffon's coin experiment, let S denote the set of outcomes, A the event that the coin does not touch the sides of the square, and let Z denote the distance form the center of the coin to the center of the square. 1. Describe S as a Cartesian product. 2. Describe A as a subset of S . 3. Describe A as a subset of S . 4. Express Z as a function on S . 5. Express the event {X < Y } as a subset of S . 6. Express the event {Z ≤ } as a subset of S . c
1 2
Answer Run Buffon's coin experiment 100 times with random variable Z .
r = 0.2
. For each run, note whether event
A
occurs and compute the value of
A point (X, Y ) is chosen at random in the circular region of radius 1 in R centered at the origin. Let S denote the set of outcomes. Let A denote the event that the point is in the inscribed square region centered at the origin, with sides parallel to the coordinate axes. Let B denote the event that the point is in the inscribed square with vertices (±1, 0), (0, ±1). 2
1. Describe S mathematically and sketch the set. 2. Describe A mathematically and sketch the set. 3. Describe B mathematically and sketch the set. 4. Sketch A ∪ B 5. Sketch A ∩ B 6. Sketch A ∩ B c
Answer
Reliability In the simple model of structural reliability, a system is composed of n components, each of which is either working or failed. The state of component i is an indicator random variable X , where 1 means working and 0 means failure. Thus, X = (X , X , … , X ) is a vector of indicator random variables that specifies the states of all of the components, and therefore the set of outcomes of the experiment is S = {0, 1} . The system as a whole is also either working or failed, depending only on the states of the components and how the components are connected together. Thus, the state of the system is also an indicator random variable and is a function of X. The state of the system (working or failed) as a function of the states of the components is the structure function. i
1
2
n
n
2.2.10
https://stats.libretexts.org/@go/page/10130
A series system is working if and only if each component is working. The state of the system is U = X1 X2 ⋯ Xn = min { X1 , X2 , … , Xn }
(2.2.9)
A parallel system is working if and only if at least one component is working. The state of the system is V = 1 − (1 − X1 ) (1 − X2 ) ⋯ (1 − Xn ) = max { X1 , X2 , … , Xn }
(2.2.10)
More generally, a k out of n system is working if and only if at least k of the n components are working. Note that a parallel system is a 1 out of n system and a series system is an n out of n system. A k out of 2k system is a majority rules system. The state of the k out of n system is in the variables. Explicitly give the state of the k ∈ {1, 2, 3}.
k
n
Un,k = 1 (∑i=1 Xi ≥ k)
. The structure function can also be expressed as a polynomial
out of 3 system, as a polynomial function of the component states
, for each
(X1 , X2 , X3 )
Answer In some cases, the system can be represented as a graph or network. The edges represent the components and the vertices the connections between the components. The system functions if and only if there is a working path between two designated vertices, which we will denote by a and b . Find the state of the Wheatstone bridge network shown below, as a function of the component states. The network is named for Charles Wheatstone. Answer
Figure 2.2.4 : The Wheatstone bridge network
Not every function desirable:
n
u : {0, 1 }
→ {0, 1}
makes sense as a structure function. Explain why the following properties might be
1. u(0, 0, … , 0) = 0 and s(1, 1, … , 1) = 1 2. u is an increasing function, where {0, 1} is given the ordinary order and {0, 1} the corresponding product order. 3. For each i ∈ {1, 2, … , n}, there exist x and y in {0, 1} all of whose coordinates agree, except x = 0 and y = 1 , and u(x) = 0 while u(y) = 1 . n
n
i
i
Answer The model just discussed is a static model. We can extend it to a dynamic model by assuming that component i is initially working, but has a random time to failure T , taking values in [0, ∞), for each i ∈ {1, 2, … , n}. Thus, the basic outcome of the experiment is the random vector of failure times (T , T , … , T ), and so the set of outcomes is [0, ∞) . i
n
1
2
n
Consider the dynamic reliability model for a system with structure function u (valid in the sense of the previous exercise). 1. The state of component i at time t ≥ 0 is X (t) = 1 (T > t) . 2. The state of the system at time t is X(t) = s [X (t), X (t), … , X 3. The time to failure of the system is T = min{t ≥ 0 : X(t) = 0} . i
i
1
n (t)]
2
.
Suppose that we have two devices and that we record (X, Y ), where X is the failure time of device 1 and Y is the failure time of device 2. Both variables take values in the interval [0, ∞), where the units are in hundreds of hours. Sketch each of the following events: 1. The set of outcomes S 2. {X < Y } 3. {X + Y > 2}
2.2.11
https://stats.libretexts.org/@go/page/10130
Answer
Genetics Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this section. Recall first that the ABO blood type in humans is determined by three alleles: a , b , and o . Furthermore, o is recessive and a and b are co-dominant. Suppose that a person is chosen at random and his genotype is recorded. Give each of the following in list form. 1. The set of outcomes S 2. The event that the person is type A 3. The event that the person is type B 4. The event that the person is type AB 5. The event that the person is type O Answer Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: g for green and y for yellow, and that g is dominant. Suppose that n (distinct) pea plants are collected and the sequence of pod color genotypes is recorded. 1. Give the set of outcomes S in Cartesian product form and find #(S) . 2. Let N denote the number of plants with green pods. Find #(N = k) (as a subset of S ) for each k ∈ {0, 1, … , n}. Answer Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let and d the defective allele for the gene linked to the disorder. Recall that d is recessive for women.
h
denote the healthy allele
Suppose that n women are sampled and the sequence of genotypes is recorded. 1. Give the set of outcomes S in Cartesian product form and find #(S) . 2. Let N denote the number of women who are completely healthy (genotype hh ). Find #(N each k ∈ {0, 1, … , n}.
= k)
(as a subset of S ) for
Answer
Radioactive Emissions The emission of elementary particles from a sample of radioactive material occurs in a random way. Suppose that the time of emission of the ith particle is a random variable T taking values in (0, ∞). If we measure these arrival times, then basic outcome vector is (T , T , …) and so the set of outcomes is S = {(t , t , …) : 0 < t < t < ⋯} . i
1
2
1
2
1
2
Run the simulation of the gamma experiment in single-step mode for different values of the parameters. Observe the arrival times. Now let N denote the number of emissions in the interval (0, t]. Then t
1. N 2. N
t
= max {n ∈ N+ : Tn ≤ t}
t
≥n
if and only if T
n
≤t
.
.
Run the simulation of the Poisson experiment in single-step mode for different parameter values. Observe the arrivals in the specified time interval.
Statistical Experiments In the basic cicada experiment, a cicada in the Middle Tennessee area is captured and the following measurements recorded: body weight (in grams), wing length, wing width, and body length (in millimeters), species type, and gender. The cicada data set gives the results of 104 repetitions of this experiment.
2.2.12
https://stats.libretexts.org/@go/page/10130
1. Define the set of outcomes S for the basic experiment. 2. Let F be the event that a cicada is female. Describe F as a subset of S . Determine whether F occurs for each cicada in the data set. 3. Let V denote the ratio of wing length to wing width. Compute V for each cicada. 4. Give the set of outcomes for the compound experiment that consists of 104 repetitions of the basic experiment. Answer In the basic M&M experiment, a bag of M&Ms (of a specified size) is purchased and the following measurements recorded: the number of red, green, blue, yellow, orange, and brown candies, and the net weight (in grams). The M&M data set gives the results of 30 repetitions of this experiment. 1. Define the set of outcomes S for the basic experiment. 2. Let A be the event that a bag contains at least 57 candies. Describe A as a subset of S . 3. Determine whether A occurs for each bag in the data set. 4. Let N denote the total number of candies. Compute N for each bag in the data set. 5. Give the set of outcomes for the compound experiment that consists of 30 repetitions of the basic experiment. Answer This page titled 2.2: Events and Random Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.2.13
https://stats.libretexts.org/@go/page/10130
2.3: Probability Measures This section contains the final and most important ingredient in the basic model of a random experiment. If you are a new student of probability, skip the technical detials.
Definitions and Interpretations Suppose that we have a random experiment with sample space (S, S ), so that S is the set of outcomes of the experiment and S is the collection of events. When we run the experiment, a given event A either occurs or does not occur, depending on whether the outcome of the experiment is in A or not. Intuitively, the probability of an event is a measure of how likely the event is to occur when we run the experiment. Mathematically, probability is a function on the collection of events that satisfies certain axioms.
Definition A probability measure (or probability distribution) P on the sample space collection of events S that satisifes the following axioms:
(S, S )
is a real-valued function defined on the
1. P(A) ≥ 0 for every event A . 2. P(S) = 1 . 3. If {A : i ∈ I } is a countable, pairwise disjoint collection of events then i
P ( ⋃ Ai ) = ∑ P(Ai ) i∈I
(2.3.1)
i∈I
Details Axiom (c) is known as countable additivity, and states that the probability of a union of a finite or countably infinite collection of disjoint events is the sum of the corresponding probabilities. The axioms are known as the Kolmogorov axioms, in honor of Andrei Kolmogorov who was the first to formalize probability theory in an axiomatic way. More informally, we say that P is a probability measure (or distribution) on S , the collection of events S usually being understood. Axioms (a) and (b) are really just a matter of convention; we choose to measure the probability of an event with a number between 0 and 1 (as opposed, say, to a number between −5 and 7). Axiom (c) however, is fundamental and inescapable. It is required for probability for precisely the same reason that it is required for other measures of the “size” of a set, such as cardinality for finite sets, length for subsets of R, area for subsets of R , and volume for subsets of R . In all these cases, the size of a set that is composed of countably many disjoint pieces is the sum of the sizes of the pieces. 2
3
Figure 2.3.1 :The union of 4 disjoint events
On the other hand, uncountable additivity (the extension of axiom (c) to an uncountable index set I ) is unreasonable for probability, just as it is for other measures. For example, an interval of positive length in R is a union of uncountably many points, each of which has length 0. We now have defined the three essential ingredients for the model a random experiment: A probability space (S, S , P) consists of 1. A set of outcomes S 2. A collection of events S 3. A probability measure P on the sample space (S, S ) Details
2.3.1
https://stats.libretexts.org/@go/page/10131
The Law of Large Numbers Intuitively, the probability of an event is supposed to measure the long-term relative frequency of the event—in fact, this concept was taken as the definition of probability by Richard Von Mises. Here are the relevant definitions: Suppose that the experiment is repeated indefinitely, and that A is an event. For n ∈ N , +
1. Let N 2. Let P
n (A)
n (A)
denote the number of times that A occurred. This is the frequency of A in the first n runs. . This is the relative frequency or empirical probability of A in the first n runs.
= Nn (A)/n
Note that repeating the original experiment indefinitely creates a new, compound experiment, and that N (A) and P (A) are random variables for the new experiment. In particular, the values of these variables are uncertain until the experiment is run n times. The basic idea is that if we have chosen the correct probability measure for the experiment, then in some sense we expect that the relative frequency of an event should converge to the probability of the event. That is, n
Pn (A) → P(A) as n → ∞,
A ∈ S
n
(2.3.2)
regardless of the uncertainty of the relative frequencies on the left. The precise statement of this is the law of large numbers or law of averages, one of the fundamental theorems in probability. To emphasize the point, note that in general there will be lots of possible probability measures for an experiment, in the sense of the axioms. However, only the probability measure that models the experiment correctly will satisfy the law of large numbers. Given the data from n runs of the experiment, the empirical probability function P is a probability measure on S . n
Proof
The Distribution of a Random Variable Suppose now that X is a random variable for the experiment, taking values in a set T . Recall that mathematically, X is a function from S into T , and {X ∈ B} denotes the event {s ∈ S : X(s) ∈ B} for B ⊆ T . Intuitively, X is a variable of interest for the experiment, and every meaningful statement about X defines an event. The function B ↦ P(X ∈ B) for B ⊆ T defines a probability measure on T . Proof
Figure 2.3.2 : A set B ∈ T corresponds to the event {X ∈ B} ∈ S
The probability measure in (5) is called the probability distribution of space.
X
, so we have all of the ingredients for a new probability
A random variable X with values in T defines a new probability space: 1. T is the set of outcomes. 2. Subsets of T are the events. 3. The probability distribution of X is the probability measure on T . This probability space corresponds to the new random experiment in which the outcome is X. Moreover, recall that the outcome of the experiment itself can be thought of as a random variable. Specifically, if we let T = S we let X be the identity function on S , so that X(s) = s for s ∈ S . Then X is a random variable with values in S and P(X ∈ A) = P(A) for each event A . Thus, every probability measure can be thought of as the distribution of a random variable.
Constructions
2.3.2
https://stats.libretexts.org/@go/page/10131
Measures How can we construct probability measures? As noted briefly above, there are other measures of the “size” of sets; in many cases, these can be converted into probability measures. First, a positive measure μ on the sample space (S, S ) is a real-valued function defined on S that satisfies axioms (a) and (c) in (1), and then (S, S , μ) is a measure space. In general, μ(A) is allowed to be infinite. However, if μ(S) is positive and finite (so that μ is a finite positive measure), then μ can easily be re-scaled into a probability measure. If μ is a positive measure on S with 0 < μ(S) < ∞ then P defined below is a probability measure. μ(A) P(A) =
,
A ∈ S
(2.3.3)
μ(S)
Proof In this context, cases.
μ(S)
is called the normalizing constant. In the next two subsections, we consider some very important special
Discrete Distributions In this discussion, we assume that the sample space (S, S ) is discrete. Recall that this means that the set of outcomes S is countable and that S = P(S) is the collection of all subsets of S , so that every subset is an event. The standard measure on a discrete space is counting measure #, so that #(A) is the number of elements in A for A ⊆ S . When S is finite, the probability measure corresponding to counting measure as constructed in above is particularly important in combinatorial and sampling experiments. Suppose that S is a finite, nonempty set. The discrete uniform distribution on S is given by #(A) P(A) =
,
A ⊆S
(2.3.5)
#(S)
The underlying model is refereed to as the classical probability model, because historically the very first problems in probability (involving coins and dice) fit this model. In the general discrete case, if P is a probability measure on S , then since S is countable, it follows from countable additivity that P is completely determined by its values on the singleton events. Specifically, if we define f (x) = P ({x}) for x ∈ S , then P(A) = ∑ f (x) for every A ⊆ S . By axiom (a), f (x) ≥ 0 for x ∈ S and by axiom (b), ∑ f (x) = 1 . Conversely, we can give a general construction for defining a probability measure on a discrete space. x∈A
Suppose that
x∈S
. Then μ defined by μ(A) = ∑ g(x) for then P defined as follows is a probability measure on S .
g : S → [0, ∞)
0 < μ(S) < ∞
x∈A
∑
μ(A) P(A) =
= μ(S)
x∈A
∑
x∈S
A ⊆S
is a positive measure on
S
. If
g(x) ,
A ⊆S
(2.3.6)
g(x)
Proof In the context of our previous remarks, f (x) = g(x)/μ(S) = g(x)/ ∑ g(y) for x ∈ S . Distributions of this type are said to be discrete. Discrete distributions are studied in detail in the chapter on Distributions. y∈S
If S is finite and g is a constant function, then the probability measure P associated with g is the discrete uniform distribution on S . Proof
Continuous Distributions The probability distributions that we will construct next are continuous distributions on R for n ∈ N n
+
and require some calculus.
For n ∈ N , the standard measure λ on R is given by n
+
n
2.3.3
https://stats.libretexts.org/@go/page/10131
λn (A) = ∫
1 dx,
n
A ⊆R
(2.3.8)
A
In particular, A ⊆R .
λ1 (A)
is the length of ( A \subseteq \R \),
lambda2 (A)
is the area of
2
A ⊆R
, and
λ3 (A)
is the volume of
3
Details When n > 3 , λ (A) is sometimes called the n -dimensional volume of A ⊆ R . The probability measure associated with set with positive, finite n -dimensional volume is particularly important. n
n
Suppose that S ⊆ R with 0 < λ
n (S)
n
0 . 1. If B ⊆ A then P(A ∣ B) = 1 . 2. If A ⊆ B then P(A ∣ B) = P(A)/P(B) . 3. If A and B are disjoint then P(A ∣ B) = 0 . Proof Parts (a) and (c) certainly make sense. Suppose that we know that event B has occurred. If B ⊆ A then A becomes a certain event. If A ∩ B = ∅ then A becomes an impossible event. A conditional probability can be computed relative to a probability measure that is itself a conditional probability measure. The following result is a consistency condition. Suppose that A , B , and C are events with P(B ∩ C ) > 0 . The probability of the probability of A given B and C (relative to P). That is,
A
given B , relative to
P(⋅ ∣ C )
, is the same as
P(A ∩ B ∣ C ) = P(A ∣ B ∩ C )
(2.4.10)
P(B ∣ C )
Proof
2.4.2
https://stats.libretexts.org/@go/page/10132
Correlation Our next discussion concerns an important concept that deals with how two events are related, in a probabilistic sense. Suppose that A and B are events with P(A) > 0 and P(B) > 0 . 1. P(A ∣ B) > P(A) if and only if P(B ∣ A) > P(B) if and only if P(A ∩ B) > P(A)P(B) . In this case, A and B are positively correlated. 2. P(A ∣ B) < P(A) if and only if P(B ∣ A) < P(B) if and only if P(A ∩ B) < P(A)P(B) . In this case, A and B are negatively correlated. 3. P(A ∣ B) = P(A) if and only if P(B ∣ A) = P(B) if and only if P(A ∩ B) = P(A)P(B) . In this case, A and B are uncorrelated or independent. Proof Intuitively, if A and B are positively correlated, then the occurrence of either event means that the other event is more likely. If A and B are negatively correlated, then the occurrence of either event means that the other event is less likely. If A and B are uncorrelated, then the occurrence of either event does not change the probability of the other event. Independence is a fundamental concept that can be extended to more than two events and to random variables; these generalizations are studied in the next section on Independence. A much more general version of correlation, for random variables, is explored in the section on Covariance and Correlation in the chapter on Expected Value. Suppose that A and B are events. Note from (4) that if A ⊆ B or B ⊆ A then A and B are positively correlated. If disjoint then A and B are negatively correlated.
A
and B are
Suppose that A and B are events in a random experiment. 1. A and B have the same correlation (positive, negative, or zero) as A and B . 2. A and B have the opposite correlation as A and B (that is, positive-negative, negative-positive, or 0-0). c
c
c
Proof
The Multiplication Rule Sometimes conditional probabilities are known and can be used to find the probabilities of other events. Note first that if A and B are events with positive probability, then by the very definition of conditional probability, P(A ∩ B) = P(A)P(B ∣ A) = P(B)P(A ∣ B)
(2.4.15)
The following generalization is known as the multiplication rule of probability. As usual, we assume that any event conditioned on has positive probability. Suppose that (A
1,
A2 , … , An )
is a sequence of events. Then
P (A1 ∩ A2 ∩ ⋯ ∩ An ) = P (A1 ) P (A2 ∣ A1 ) P (A3 ∣ A1 ∩ A2 ) ⋯ P (An ∣ A1 ∩ A2 ∩ ⋯ ∩ An−1 )
(2.4.16)
Proof The multiplication rule is particularly useful for experiments that consist of dependent stages, where Compare the multiplication rule of probability with the multiplication rule of combinatorics.
Ai
is an event in stage i.
As with any other result, the multiplication rule can be applied to a conditional probability measure. In the context above, if another event, then P (A1 ∩ A2 ∩ ⋯ ∩ An ∣ E) = P (A1 ∣ E) P (A2 ∣ A1 ∩ E) P (A3 ∣ A1 ∩ A2 ∩ E)
E
is
(2.4.17)
⋯ P (An ∣ A1 ∩ A2 ∩ ⋯ ∩ An−1 ∩ E)
Conditioning and Bayes' Theorem Suppose that A i ∈ I.
= { Ai : i ∈ I }
is a countable collection of events that partition the sample space S , and that P(A
i)
2.4.3
>0
for each
https://stats.libretexts.org/@go/page/10132
Figure 2.4.2 : A partition of S induces a partition of B .
The following theorem is known as the law of total probability. If B is an event then P(B) = ∑ P(Ai )P(B ∣ Ai )
(2.4.18)
i∈I
Proof The following theorem is known as Bayes' Theorem, named after Thomas Bayes: If B is an event then P(Aj )P(B ∣ Aj ) P(Aj ∣ B) =
, ∑
i∈I
j∈ I
(2.4.20)
P(Ai )P(B ∣ Ai )
Proof These two theorems are most useful, of course, when we know P(A ) and P(B ∣ A ) for each i ∈ I . When we compute the probability of P(B) by the law of total probability, we say that we are conditioning on the partition A . Note that we can think of the sum as a weighted average of the conditional probabilities P(B ∣ A ) over i ∈ I , where P(A ) , i ∈ I are the weight factors. In the context of Bayes theorem, P(A ) is the prior probability of A and P(A ∣ B) is the posterior probability of A for j ∈ I . We will study more general versions of conditioning and Bayes theorem in the section on Discrete Distributions in the chapter on Distributions, and again in the section on Conditional Expected Value in the chapter on Expected Value. i
i
i
j
j
i
j
j
Once again, the law of total probability and Bayes' theorem can be applied to a conditional probability measure. So, if E is another event with P(A ∩ E) > 0 for i ∈ I then i
P(B ∣ E) = ∑ P(Ai ∣ E)P(B ∣ Ai ∩ E)
(2.4.21)
i∈I
P(Aj ∣ E)P(B ∣ Aj ∩ E) P(Aj ∣ B ∩ E) =
, ∑
i∈I
j∈ I
(2.4.22)
P(Ai ∩ E)P(B ∣ Ai ∩ E)
Examples and Applications Basic Rules Suppose that A and B are events in an experiment with P(A) =
1 3
, P(B) =
1 4
, P(A ∩ B) =
1 10
. Find each of the following:
1. P(A ∣ B) 2. P(B ∣ A) 3. P(A ∣ B) 4. P(B ∣ A) 5. P(A ∣ B ) c
c
c
c
Answer Suppose that A , B , and C are events in a random experiment with Find each of the following:
P(A ∣ C ) =
1 2
,
P(B ∣ C ) =
1 3
, and
P(A ∩ B ∣ C ) =
1 4
.
1. P(B ∖ A ∣ C ) 2. P(A ∪ B ∣ C ) 3. P(A ∩ B ∣ C ) 4. P(A ∪ B ∣ C ) c
c
c
c
2.4.4
https://stats.libretexts.org/@go/page/10132
5. P(A ∪ B ∣ C ) 6. P(A ∣ B ∩ C ) c
Answer Suppose that A and B are events in a random experiment with P(A) =
1 2
, P(B) =
1 3
, and P(A ∣ B) =
3 4
.
1. Find P(A ∩ B) 2. Find P(A ∪ B) 3. Find P(B ∪ A ) 4. Find P(B ∣ A) 5. Are A and B positively correlated, negatively correlated, or independent? c
Answer Open the conditional probability experiment. 1. Given P(A) , P(B) , and P(A ∩ B) , in the table, verify all of the other probabilities in the table. 2. Run the experiment 1000 times and compare the probabilities with the relative frequencies.
Simple Populations In a certain population, 30% of the persons smoke cigarettes and 8% have COPD (Chronic Obstructive Pulmonary Disease). Moreover, 12% of the persons who smoke have COPD. 1. What percentage of the population smoke and have COPD? 2. What percentage of the population with COPD also smoke? 3. Are smoking and COPD positively correlated, negatively correlated, or independent? Answer A company has 200 employees: 120 are women and 80 are men. Of the 120 female employees, 30 are classified as managers, while 20 of the 80 male employees are managers. Suppose that an employee is chosen at random. 1. Find the probability that the employee is female. 2. Find the probability that the employee is a manager. 3. Find the conditional probability that the employee is a manager given that the employee is female. 4. Find the conditional probability that the employee is female given that the employee is a manager. 5. Are the events female and manager positively correlated, negatively correlated, or indpendent? Answer
Dice and Coins Consider the experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores X = (X , X ) . Let Y denote the sum of the scores. For each of the following pairs of events, find the probability of each event and the conditional probability of each event given the other. Determine whether the events are positively correlated, negatively correlated, or independent. 1
1. {X 2. {X 3. {X 4. {X
, {Y = 5} , {Y = 7} = 2} , {Y = 5} = 3} , { X = 2}
1
= 3}
1
= 3}
1 1
2
1
Answer Note that positive correlation is not a transitive relation. From the previous exercise, for example, note that {X = 3} and {Y = 5} are positively correlated, {Y = 5} and { X = 2} are positively correlated, but { X = 3} and { X = 2} are negatively correlated (in fact, disjoint). 1
1
1
1
In dice experiment, set n = 2 . Run the experiment 1000 times. Compute the empirical conditional probabilities corresponding to the conditional probabilities in the last exercise.
2.4.5
https://stats.libretexts.org/@go/page/10132
Consider again the experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores X = (X , X ) . Let Y denote the sum of the scores, U the minimum score, and V the maximum score. 1
2
1. Find P(U = u ∣ V = 4) for the appropriate values of u. 2. Find P(Y = y ∣ V = 4) for the appropriate values of y . 3. Find P(V = v ∣ Y = 8) for appropriate values of v . 4. Find P(U = u ∣ Y = 8) for the appropriate values of u. 5. Find P[(X , X ) = (x , x ) ∣ Y = 8] for the appropriate values of (x 1
2
1
1,
2
x2 )
.
Answer In the die-coin experiment, a standard, fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let N denote the die score and H the event that all coin tosses result in heads. 1. Find P(H ). 2. Find P(N = n ∣ H ) for n ∈ {1, 2, 3, 4, 5, 6}. 3. Compare the results in (b) with P(N = n) for n ∈ {1, 2, 3, 4, 5, 6}. In each case, note whether the events H and {N are positively correlated, negatively correlated, or independent.
= n}
Answer Run the die-coin experiment 1000 times. Let H and N be as defined in the previous exercise. 1. Compute the empirical probability of H . Compare with the true probability in the previous exercise. 2. Compute the empirical probability of {N = n} given H , for n ∈ {1, 2, 3, 4, 5, 6}. Compare with the true probabilities in the previous exercise. Suppose that a bag contains 12 coins: 5 are fair, 4 are biased with probability of heads chosen at random from the bag and tossed.
1 3
; and 3 are two-headed. A coin is
1. Find the probability that the coin is heads. 2. Given that the coin is heads, find the conditional probability of each coin type. Answer Compare die-coin experiment and bag of coins experiment. In the die-coin experiment, we toss a coin with a fixed probability of heads a random number of times. In the bag of coins experiment, we effectively toss a coin with a random probability of heads a fixed number of times. The random experiment of tossing a coin with a fixed probability of heads p a fixed number of times n is known as the binomial experiment with parameters n and p. This is a very basic and important experiment that is studied in more detail in the section on the binomial distribution in the chapter on Bernoulli Trials. Thus, the die-coin and bag of coins experiments can be thought of as modifications of the binomial experiment in which a parameter has been randomized. In general, interesting new random experiments can often be constructed by randomizing one or more parameters in another random experiment. In the coin-die experiment, a fair coin is tossed. If the coin lands tails, a fair die is rolled. If the coin lands heads, an ace-six flat die is tossed (faces 1 and 6 have probability each, while faces 2, 3, 4, and 5 have probability each). Let H denote the event that the coin lands heads, and let Y denote the score when the chosen die is tossed. 1
1
4
8
1. Find P(Y = y) for y ∈ {1, 2, 3, 4, 5, 6}. 2. Find P(H ∣ Y = y) for y ∈ {1, 2, 3, 4, 5, 6, }. 3. Compare each probability in part (b) with P(H ). In each case, note whether the events H and {Y correlated, negatively correlated, or independent.
= y}
are positively
Answer Run the coin-die experiment 1000 times. Let H and Y be as defined in the previous exercise. 1. Compute the empirical probability of {Y = y} , for each y , and compare with the true probability in the previous exercise 2. Compute the empirical probability of H given {Y = y} for each y , and compare with the true probability in the previous exercise.
2.4.6
https://stats.libretexts.org/@go/page/10132
Cards Consider the card experiment that consists of dealing 2 cards from a standard deck and recording the sequence of cards dealt. For i ∈ {1, 2}, let Q be the event that card i is a queen and H the event that card i is a heart. For each of the following pairs of events, compute the probability of each event, and the conditional probability of each event given the other. Determine whether the events are positively correlated, negatively correlated, or independent. i
1. Q 2. Q 3. Q 4. Q
1 1 2 1
i
,H ,Q ,H ,H
1
2 2 2
Answer In the card experiment, set n = 2 . Run the experiment 500 times. Compute the conditional relative frequencies corresponding to the conditional probabilities in the last exercise. Consider the card experiment that consists of dealing 3 cards from a standard deck and recording the sequence of cards dealt. Find the probability of the following events: 1. All three cards are all hearts. 2. The first two cards are hearts and the third is a spade. 3. The first and third cards are hearts and the second is a spade. Proof In the card experiment, set n = 3 and run the simulation 1000 times. Compute the empirical probability of each event in the previous exercise and compare with the true probability.
Bivariate Uniform Distributions Recall that Buffon's coin experiment consists of tossing a coin with radius r ≤ randomly on a floor covered with square tiles of side length 1. The coordinates (X, Y ) of the center of the coin are recorded relative to axes through the center of the square, parallel to the sides. Since the needle is dropped randomly, the basic modeling assumption is that (X, Y ) is uniformly distributed on the square [−1/2, 1/2] . 1 2
2
Figure 2.4.3 : Buffon's coin experiment
In Buffon's coin experiment, 1. Find P(Y > 0 ∣ X < Y ) 2. Find the conditional distribution of (X, Y ) given that the coin does not touch the sides of the square. Answer Run Buffon's coin experiment 500 times. Compute the empirical probability that Y the probability in the last exercise.
>0
given that
X 3) > 5 ∣ T > 2)
Answer Suppose again that T denotes the lifetime of a light bulb (in 1000 hour units), but that T is uniformly distributed on the interal [0, 10]. 1. Find P(T 2. Find P(T
> 3) > 5 ∣ T > 2)
Answer
Genetics Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this section. Recall first that the ABO blood type in humans is determined by three alleles: a , b , and o . Furthermore, a and b are co-dominant and o is recessive. Suppose that the probability distribution for the set of blood genotypes in a certain population is given in the following table: Genotype
aa
ab
ao
bb
bo
oo
Probability
0.050
0.038
0.310
0.007
0.116
0.479
Suppose that a person is chosen at random from the population. Let A , B , AB, and O be the events that the person is type A , type B , type AB, and type O respectively. Let H be the event that the person is homozygous, and let D denote the event that the person has an o allele. Find each of the following: 1. P(A) , P(B) , P(AB), P(O) , P(H ), P(D) 2. P (A ∩ H ) , P (A ∣ H ) , P (H ∣ A) . Are the events A and H positively correlated, negatively correlated, or independent?
2.4.8
https://stats.libretexts.org/@go/page/10132
3. P (B ∩ H ) , P (B ∣ H ) , P (H ∣ B) . Are the events B and H positively correlated, negatively correlated, or independent? 4. P (A ∩ D) , P (A ∣ D) , P (D ∣ A) . Are the events A and D positively correlated, negatively correlated, or independent? 5. P (B ∩ D) , P (B ∣ D) , P (D ∣ B) . Are the events B and D positively correlated, negatively correlated, or independent? 6. P (H ∩ D) , P (H ∣ D) , P (D ∣ H ) . Are the events H and D positively correlated, negatively correlated, or independent? Answer Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: g for green and y for yellow, and that g is dominant and y recessive. Suppose that a green-pod plant and a yellow-pod plant are bred together. Suppose further that the green-pod plant has a chance of carrying the recessive yellow-pod allele.
1 4
1. Find the probability that a child plant will have green pods. 2. Given that a child plant has green pods, find the updated probability that the green-pod parent has the recessive allele. Answer Suppose that two green-pod plants are bred together. Suppose further that with probability neither plant has the recessive allele, with probability one plant has the recessive allele, and with probability both plants have the recessive allele. 1 3
1
1
2
6
1. Find the probability that a child plant has green pods. 2. Given that a child plant has green pods, find the updated probability that both parents have the recessive gene. Answer Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let h denote the healthy allele and d the defective allele for the gene linked to the disorder. Recall that h is dominant and d recessive for women. Suppose that in a certain population, 50% are male and 50% are female. Moreover, suppose that 10% of males are color blind but only 1% of females are color blind. 1. Find the percentage of color blind persons in the population. 2. Find the percentage of color blind persons that are male. Answer Since color blindness is a sex-linked hereditary disorder, note that it's reasonable in the previous exercise that the probability that a female is color blind is the square of the probability that a male is color blind. If p is the probability of the defective allele on the X chromosome, then p is also the probability that a male will be color blind. But since the defective allele is recessive, a woman would need two copies of the defective allele to be color blind, and assuming independence, the probability of this event is p . 2
A man and a woman do not have a certain sex-linked hereditary disorder, but the woman has a
1 3
chance of being a carrier.
1. Find the probability that a son born to the couple will be normal. 2. Find the probability that a daughter born to the couple will be a carrier. 3. Given that a son born to the couple is normal, find the updated probability that the mother is a carrier. Answer
Urn Models Urn 1 contains 4 red and 6 green balls while urn 2 contains 7 red and 3 green balls. An urn is chosen at random and then a ball is chosen at random from the selected urn. 1. Find the probability that the ball is green. 2. Given that the ball is green, find the conditional probability that urn 1 was selected. Answer Urn 1 contains 4 red and 6 green balls while urn 2 contains 6 red and 3 green balls. A ball is selected at random from urn 1 and transferred to urn 2. Then a ball is selected at random from urn 2. 1. Find the probability that the ball from urn 2 is green.
2.4.9
https://stats.libretexts.org/@go/page/10132
2. Given that the ball from urn 2 is green, find the conditional probability that the ball from urn 1 was green. Answer An urn initially contains 6 red and 4 green balls. A ball is chosen at random from the urn and its color is recorded. It is then replaced in the urn and 2 new balls of the same color are added to the urn. The process is repeated. Find the probability of each of the following events: 1. Balls 1 and 2 are red and ball 3 is green. 2. Balls 1 and 3 are red and ball 2 is green. 3. Ball 1 is green and balls 2 and 3 are red. 4. Ball 2 is red. 5. Ball 1 is red given that ball 2 is red. Answer Think about the results in the previous exercise. Note in particular that the answers to parts (a), (b), and (c) are the same, and that the probability that the second ball is red in part (d) is the same as the probability that the first ball is red. More generally, the probabilities of events do not depend on the order of the draws. For example, the probability of an event involving the first, second, and third draws is the same as the probability of the corresponding event involving the seventh, tenth and fifth draws. Technically, the sequence of events (R , R , …) is exchangeable. The random process described in this exercise is a special case of Pólya's urn scheme, named after George Pólya. We sill study Pólya's urn in more detail in the chapter on Finite Sampling Models 1
2
An urn initially contains 6 red and 4 green balls. A ball is chosen at random from the urn and its color is recorded. It is then replaced in the urn and two new balls of the other color are added to the urn. The process is repeated. Find the probability of each of the following events: 1. Balls 1 and 2 are red and ball 3 is green. 2. Balls 1 and 3 are red and ball 2 is green. 3. Ball 1 is green and balls 2 and 3 are red. 4. Ball 2 is red. 5. Ball 1 is red given that ball 2 is red. Answer Think about the results in the previous exercise, and compare with Pólya's urn. Note that the answers to parts (a), (b), and (c) are not all the same, and that the probability that the second ball is red in part (d) is not the same as the probability that the first ball is red. In short, the sequence of events (R , R , …) is not exchangeable. 1
2
Diagnostic Testing Suppose that we have a random experiment with an event A of interest. When we run the experiment, of course, event A will either occur or not occur. However, suppose that we are not able to observe the occurrence or non-occurrence of A directly. Instead we have a diagnostic test designed to indicate the occurrence of event A ; thus the test that can be either positive for A or negative for A . The test also has an element of randomness, and in particular can be in error. Here are some typical examples of the type of situation we have in mind: The event is that a person has a certain disease and the test is a blood test for the disease. The event is that a woman is pregnant and the test is a home pregnancy test. The event is that a person is lying and the test is a lie-detector test. The event is that a device is defective and the test consists of a sensor reading. The event is that a missile is in a certain region of airspace and the test consists of radar signals. The event is that a person has committed a crime, and the test is a jury trial with evidence presented for and against the event. Let T be the event that the test is positive for the occurrence of A . The conditional probability P(T the test. The complementary probability P(T
c
∣ A) = 1 − P(T ∣ A)
is the false negative probability. The conditional probability probability
P(T
c
c
∣ A )
2.4.10
∣ A)
is called the sensitivity of
(2.4.24)
is called the specificity of the test. The complementary
https://stats.libretexts.org/@go/page/10132
c
P(T ∣ A ) = 1 − P(T
c
c
∣ A )
(2.4.25)
is the false positive probability. In many cases, the sensitivity and specificity of the test are known, as a result of the development of the test. However, the user of the test is interested in the opposite conditional probabilities, namely P(A ∣ T ) , the probability of the event of interest, given a positive test, and P(A ∣ T ) , the probability of the complementary event, given a negative test. Of course, if we know P(A ∣ T ) then we also have P(A ∣ T ) = 1 − P(A ∣ T ) , the probability of the complementary event given a positive test. Similarly, if we know P(A ∣ T ) then we also have P(A ∣ T ) , the probability of the event given a negative test. Computing the probabilities of interest is simply a special case of Bayes' theorem. c
c
c
c
c
c
The probability that the event occurs, given a positive test is P(A)P(T ∣ A) P(A ∣ T ) =
c
(2.4.26)
c
P(A)P(T ∣ A) + P(A )P(T ∣ A )
The probability that the event does not occur, given a negative test is c
c
P(A
∣ T
c
P(A )P(T ) = P(A)P(T
c
c
c
∣ A ) c
∣ A) + P(A )P(T
c
(2.4.27)
c
∣ A )
There is often a trade-off between sensitivity and specificity. An attempt to make a test more sensitive may result in the test being less specific, and an attempt to make a test more specific may result in the test being less sensitive. As an extreme example, consider the worthless test that always returns positive, no matter what the evidence. Then T = S so the test has sensitivity 1, but specificity 0. At the opposite extreme is the worthless test that always returns negative, no matter what the evidence. Then T = ∅ so the test has specificity 1 but sensitivity 0. In between these extremes are helpful tests that are actually based on evidence of some sort. Suppose that the sensitivity a = P(T ∣ A) ∈ (0, 1) and the specificity b = P(T ∣ A ) ∈ (0, 1) are fixed. Let the prior probability of the event A and P = P(A ∣ T ) the posterior probability of A given a positive test. c
P
c
p = P(A)
denote
as a function of p is given by ap P =
,
p ∈ [0, 1]
(2.4.28)
(a + b − 1)p + (1 − b)
1. P increases continuously from 0 to 1 as p increases from 0 to 1. 2. P is concave downward if a + b > 1 . In this case A and T are positively correlated. 3. P is concave upward if a + b < 1 . In this case A and T are negatively correlated. 4. P = p if a + b = 1 . In this case, A and T are uncorrelated (independent). Proof Of course, part (b) is the typical case, where the test is useful. In fact, we would hope that the sensitivity and specificity are close to 1. In case (c), the test is worse than useless since it gives the wrong information about A . But this case could be turned into a useful test by simply reversing the roles of positive and negative. In case (d), the test is worthless and gives no information about A . It's interesting that the broad classification above depends only on the sum of the sensitivity and specificity.
Figure 2.4.4 : P
= P(A ∣ T )
as a function of p = P(A) in the three cases
2.4.11
https://stats.libretexts.org/@go/page/10132
Suppose that a diagnostic test has sensitivity 0.99 and specificity 0.95. Find P(A) :
P(A ∣ T )
for each of the following values of
1. 0.001 2. 0.01 3. 0.2 4. 0.5 5. 0.7 6. 0.9 Answer With sensitivity 0.99 and specificity 0.95, the test in the last exercise superficially looks good. However the small value of P(A ∣ T ) for small values of P(A) is striking (but inevitable given the properties above). The moral, of course, is that P(A ∣ T ) depends critically on P(A) not just on the sensitivity and specificity of the test. Moreover, the correct comparison is P(A ∣ T ) with P(A) , as in the exercise, not P(A ∣ T ) with P(T ∣ A) —Beware of the fallacy of the transposed conditional! In terms of the correct comparison, the test does indeed work well; P(A ∣ T ) is significantly larger than P(A) in all cases. A woman initially believes that there is an even chance that she is or is not pregnant. She takes a home pregnancy test with sensitivity 0.95 and specificity 0.90 (which are reasonable values for a home pregnancy test). Find the updated probability that the woman is pregnant in each of the following cases. 1. The test is positive. 2. The test is negative. Answer Suppose that 70% of defendants brought to trial for a certain type of crime are guilty. Moreover, historical data show that juries convict guilty persons 80% of the time and convict innocent persons 10% of the time. Suppose that a person is tried for a crime of this type. Find the updated probability that the person is guilty in each of the following cases: 1. The person is convicted. 2. The person is acquitted. Answer The “Check Engine” light on your car has turned on. Without the information from the light, you believe that there is a 10% chance that your car has a serious engine problem. You learn that if the car has such a problem, the light will come on with probability 0.99, but if the car does not have a serious problem, the light will still come on, under circumstances similar to yours, with probability 0.3. Find the updated probability that you have an engine problem. Answer The standard test for HIV is the ELISA (Enzyme-Linked Immunosorbent Assay) test. It has sensitivity and specificity of 0.999. Suppose that a person is selected at random from a population in which 1% are infected with HIV, and given the ELISA test. Find the probability that the person has HIV in each of the following cases: 1. The test is positive. 2. The test is negative. Answer The ELISA test for HIV is a very good one. Let's look another test, this one for prostate cancer, that's rather bad. The PSA test for prostate cancer is based on a blood marker known as the Prostate Specific Antigen. An elevated level of PSA is evidence for prostate cancer. To have a diagnostic test, in the sense that we are discussing here, we must decide on a definite level of PSA, above which we declare the test to be positive. A positive test would typically lead to other more invasive tests (such as biopsy) which, of course, carry risks and cost. The PSA test with cutoff 2.6 ng/ml has sensitivity 0.40 and specificity 0.81. The overall incidence of prostate cancer among males is 156 per 100000. Suppose that a man, with no particular risk factors, has the PSA test. Find the probability that the man has prostate cancer in each of the following cases: 1. The test is positive.
2.4.12
https://stats.libretexts.org/@go/page/10132
2. The test is negative. Answer Diagnostic testing is closely related to a general statistical procedure known as hypothesis testing. A separate chapter on hypothesis testing explores this procedure in detail.
Data Analysis Exercises For the M&M data set, find the empirical probability that a bag has at least 10 reds, given that the weight of the bag is at least 48 grams. Answer Consider the Cicada data. 1. Find the empirical probability that a cicada weighs at least 0.25 grams given that the cicada is male. 2. Find the empirical probability that a cicada weighs at least 0.25 grams given that the cicada is the tredecula species. Answer This page titled 2.4: Conditional Probability is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.4.13
https://stats.libretexts.org/@go/page/10132
2.5: Independence In this section, we will discuss independence, one of the fundamental concepts in probability theory. Independence is frequently invoked as a modeling assumption, and moreover, (classical) probability itself is based on the idea of independent replications of the experiment. As usual, if you are a new student of probability, you may want to skip the technical details.
Basic Theory As usual, our starting point is a random experiment modeled by a probability space (S, S , P) so that S is the set of outcomes, S the collection of events, and P the probability measure on the sample space (S, S ). We will define independence for two events, then for collections of events, and then for collections of random variables. In each case, the basic idea is the same.
Independence of Two Events Two events A and B are independent if P(A ∩ B) = P(A)P(B)
(2.5.1)
If both of the events have positive probability, then independence is equivalent to the statement that the conditional probability of one event given the other is the same as the unconditional probability of the event: P(A ∣ B) = P(A) ⟺
P(B ∣ A) = P(B) ⟺
P(A ∩ B) = P(A)P(B)
(2.5.2)
This is how you should think of independence: knowledge that one event has occurred does not change the probability assigned to the other event. Independence of two events was discussed in the last section in the context of correlation. In particular, for two events, independent and uncorrelated mean the same thing. The terms independent and disjoint sound vaguely similar but they are actually very different. First, note that disjointness is purely a set-theory concept while independence is a probability (measure-theoretic) concept. Indeed, two events can be independent relative to one probability measure and dependent relative to another. But most importantly, two disjoint events can never be independent, except in the trivial case that one of the events is null. Suppose that A and B are disjoint events, each with positive probability. Then negatively correlated.
A
and
B
are dependent, and in fact are
Proof If A and B are independent events then intuitively it seems clear that any event that can be constructed from A should be independent of any event that can be constructed from B . This is the case, as the next result shows. Moreover, this basic idea is essential for the generalization of independence that we will consider shortly. If A and B are independent events, then each of the following pairs of events is independent: 1. A , B 2. B , A 3. A , B c
c
c
c
Proof An event that is “essentially deterministic”, that is, has probability 0 or 1, is independent of any other event, even itself. Suppose that A and B are events. 1. If P(A) = 0 or P(A) = 1 , then A and B are independent. 2. A is independent of itself if and only if P(A) = 0 or P(A) = 1 . Proof
2.5.1
https://stats.libretexts.org/@go/page/10133
General Independence of Events To extend the definition of independence to more than two events, we might think that we could just require pairwise independence, the independence of each pair of events. However, this is not sufficient for the strong type of independence that we have in mind. For example, suppose that we have three events A , B , and C . Mutual independence of these events should not only mean that each pair is independent, but also that an event that can be constructed from A and B (for example A ∪ B ) should be independent of C . Pairwise independence does not achieve this; an exercise below gives three events that are pairwise independent, but the intersection of two of the events is related to the third event in the strongest possible sense. c
Another possible generalization would be to simply require the probability of the intersection of the events to be the product of the probabilities of the events. However, this condition does not even guarantee pairwise independence. An exercise below gives an example. However, the definition of independence for two events does generalize in a natural way to an arbitrary collection of events. Suppose that finite J ⊆ I ,
Ai
is an event for each
i
in an index set I . Then the collection
A = { Ai : i ∈ I }
is independent if for every
P ( ⋂ Aj ) = ∏ P(Aj ) j∈J
(2.5.4)
j∈J
Independence of a collection of events is much stronger than mere pairwise independence of the events in the collection. The basic inheritance property in the following result follows immediately from the definition. Suppose that A is a collection of events. 1. If A is independent, then B is independent for every B ⊆ A . 2. If B is independent for every finite B ⊆ A then A is independent. For a finite collection of events, the number of conditions required for mutual independence grows exponentially with the number of events. There are 2
n
−n−1
non-trivial conditions in the definition of the independence of n events.
1. Explicitly give the 4 conditions that must be satisfied for events A , B , and C to be independent. 2. Explicitly give the 11 conditions that must be satisfied for events A , B , C , and D to be independent. Answer If the events A
1,
A2 , … , An
are independent, then it follows immediately from the definition that n
n
P ( ⋂ Ai ) = ∏ P(Ai ) i=1
(2.5.5)
i=1
This is known as the multiplication rule for independent events. Compare this with the general multiplication rule for conditional probability. The collection of essentially deterministic events D = {A ∈ S
: P(A) = 0 or P(A) = 1}
is independent.
Proof The next result generalizes the theorem above on the complements of two independent events. Suppose that A = {A : i ∈ I } and B = {B : i ∈ I } are two collections of events with the property that for each either B = A or B = A . Then A is independent if and only if B is an independent. i
i
i ∈ I
,
c
i
i
i
i
Proof The last theorem in turn leads to the type of strong independence that we want. The following exercise gives examples. If A , B , C , and D are independent events, then
2.5.2
https://stats.libretexts.org/@go/page/10133
1. A ∪ B , C , D are independent. 2. A ∪ B , C ∪ D are independent. c
c
c
c
Proof The complete generalization of these results is a bit complicated, but roughly means that if we start with a collection of indpendent events, and form new events from disjoint subcollections (using the set operations of union, intersection, and complment), then the new events are independent. For a precise statement, see the section on measure spaces. The importance of the complement theorem lies in the fact that any event that can be defined in terms of a finite collection of events {A : i ∈ I } can be written as a disjoint union of events of the form ⋂ B where B = A or B = A for each i ∈ I . i
c
i∈I
i
i
i
i
i
Another consequence of the general complement theorem is a formula for the probability of the union of a collection of independent events that is much nicer than the inclusion-exclusion formula. If A
1,
A2 , … , An
are independent events, then n
n
P ( ⋃ Ai ) = 1 − ∏ [1 − P(Ai )] i=1
(2.5.10)
i=1
Proof
Independence of Random Variables Suppose now that X is a random variable for the experiment with values in a set T for each i in a nonempty index set I . Mathematically, X is a function from S into T , and recall that {X ∈ B} denotes the event {s ∈ S : X (s) ∈ B} for B ⊆ T . Intuitively, X is a variable of interest in the experiment, and every meaningful statement about X defines an event. Intuitively, the random variables are independent if information about some of the variables tells us nothing about the other variables. Mathematically, independence of a collection of random variables can be reduced to the independence of collections of events. i
i
i
i
i
i
i
i
i
The collection of random variables X = {X : i ∈ I } is independent if the collection of events {{X ∈ B } : i ∈ I } is independent for every choice of B ⊆ T for i ∈ I . Equivalently then, X is independent if for every finite J ⊆ I , and for every choice of B ⊆ T for j ∈ J we have i
i
j
i
i
i
j
P ( ⋂ { Xj ∈ Bj }) = ∏ P(Xj ∈ Bj ) j∈J
(2.5.12)
j∈J
Details Suppose that X is a collection of random variables. 1. If X is independent, then Y is independent for every Y ⊆ X 2. If Y is independent for every finite Y ⊆ X then X is independent. It would seem almost obvious that if a collection of random variables is independent, and we transform each variable in deterministic way, then the new collection of random variables should still be independent. Suppose now that g is a function from T into a set U for each i ∈ I . If {X also independent. i
i
i
i
: i ∈ I}
is independent, then {g
i (Xi )
: i ∈ I}
is
Proof As with events, the (mutual) independence of random variables is a very strong property. If a collection of random variables is independent, then any subcollection is also independent. New random variables formed from disjoint subcollections are independent. For a simple example, suppose that X, Y , and Z are independent real-valued random variables. Then 1. sin(X), cos(Y ), and e are independent. 2. (X, Y ) and Z are independent. 3. X + Y and arctan(Z) are independent. 4. X and Z are independent. 5. Y and Z are independent. Z
2
2
2.5.3
https://stats.libretexts.org/@go/page/10133
In particular, note that statement 2 in the list above is much stronger than the conjunction of statements 4 and 5. Contrapositively, if X and Z are dependent, then (X, Y ) and Z are also dependent. Independence of random variables subsumes independence of events. A collection of events independent.
A
is independent if and only if the corresponding collection of indicator variables
{ 1A : A ∈ A }
is
Proof Many of the concepts that we have been using informally can now be made precise. A compound experiment that consists of “independent stages” is essentially just an experiment whose outcome is a sequence of independent random variables X = (X , X , …) where X is the outcome of the i th stage. 1
2
i
In particular, suppose that we have a basic experiment with outcome variable X. By definition, the outcome of the experiment that consists of “independent replications” of the basic experiment is a sequence of independent random variables X = (X , X , …) each with the same probability distribution as X. This is fundamental to the very concept of probability, as expressed in the law of large numbers. From a statistical point of view, suppose that we have a population of objects and a vector of measurements X of interest for the objects in the sample. The sequence X above corresponds to sampling from the distribution of X; that is, X is the vector of measurements for the ith object drawn from the sample. When we sample from a finite population, sampling with replacement generates independent random variables while sampling without replacement generates dependent random variables. 1
2
i
Conditional Independence and Conditional Probability As noted at the beginning of our discussion, independence of events or random variables depends on the underlying probability measure. Thus, suppose that B is an event with positive probability. A collection of events or a collection of random variables is conditionally independent given B if the collection is independent relative to the conditional probability measure A ↦ P(A ∣ B) . For example, a collection of events {A : i ∈ I } is conditionally independent given B if for every finite J ⊆ I , i
∣ P ( ⋂ Aj ∣ B) = ∏ P(Aj ∣ B) ∣ j∈J
(2.5.13)
j∈J
Note that the definitions and theorems of this section would still be true, but with all probabilities conditioned on B . Conversely, conditional probability has a nice interpretation in terms of independent replications of the experiment. Thus, suppose that we start with a basic experiment with S as the set of outcomes. We let X denote the outcome random variable, so that mathematically X is simply the identity function on S . In particular, if A is an event then trivially, P(X ∈ A) = P(A) . Suppose now that we replicate the experiment independently. This results in a new, compound experiment with a sequence of independent random variables (X , X , …), each with the same distribution as X. That is, X is the outcome of the ith repetition of the experiment. 1
2
i
Suppose now that A and B are events in the basic experiment with “when B occurs for the first time, A also occurs” has probability
P(B) > 0
. In the compound experiment, the event that
P(A ∩ B) = P(A ∣ B)
(2.5.14)
P(B)
Proof Heuristic Argument Suppose that A and B are disjoint events in a basic experiment with P(A) > 0 and P(B) > 0 . In the compound experiment obtained by replicating the basic experiment, the event that “A occurs before B ” has probability P(A) (2.5.18) P(A) + P(B)
Proof
2.5.4
https://stats.libretexts.org/@go/page/10133
Examples and Applications Basic Rules Suppose that A , B , and C are independent events in an experiment with P(A) = 0.3 , P(B) = 0.4 , and each of the following events in set notation and find its probability:
P(C ) = 0.8
. Express
1. All three events occur. 2. None of the three events occurs. 3. At least one of the three events occurs. 4. At least one of the three events does not occur. 5. Exactly one of the three events occurs. 6. Exactly two of the three events occurs. Answer Suppose that A , B , and C are independent events for an experiment with probability of each of the following events: 1. (A ∩ B) ∪ C 2. A ∪ B ∪ C 3. (A ∩ B ) ∪ C
P(A) =
1 3
,
P(B) =
1 4
, and
P(C ) =
1 5
. Find the
c
c
c
c
Answer
Simple Populations A small company has 100 employees; 40 are men and 60 are women. There are 6 male executives. How many female executives should there be if gender and rank are independent? The underlying experiment is to choose an employee at random. Answer Suppose that a farm has four orchards that produce peaches, and that peaches are classified by size as small, medium, and large. The table below gives total number of peaches in a recent harvest by orchard and by size. Fill in the body of the table with counts for the various intersections, so that orchard and size are independent variables. The underlying experiment is to select a peach at random from the farm. Frequency
Size Small
Medium
Large
Total
otal
Orchard 1
Total2000
400
otal
2
Total2000
600
otal
3
Total2000
300
otal
4
Total2000
700
Total
400
1000
600
2000
Answer Note from the last two exercises that you cannot “see” independence in a Venn diagram. Again, independence is a measuretheoretic concept, not a set-theoretic concept.
Bernoulli Trials A Bernoulli trials sequence is a sequence X = (X , X , …) of independent, identically distributed indicator variables. Random variable X is the outcome of trial i, where in the usual terminology of reliability theory, 1 denotes success and 0 denotes failure. The canonical example is the sequence of scores when a coin (not necessarily fair) is tossed repeatedly. Another basic example arises whenever we start with an basic experiment and an event A of interest, and then repeat the experiment. In this setting, X is the indicator variable for event A on the ith run of the experiment. The Bernoulli trials process is named for Jacob Bernoulli, and has a single basic parameter p = P(X = 1) . This random process is studied in detail in the chapter on Bernoulli trials. 1
2
i
i
i
2.5.5
https://stats.libretexts.org/@go/page/10133
For (x
1,
n
x2 , … , xn ) ∈ {0, 1 }
, x1 +x2 +⋯+xn
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = p
n−( x1 +x2 +⋯+xn )
(1 − p )
(2.5.19)
Proof Note that the sequence of indicator random variables X is exchangeable. That is, if the sequence (x , x , … , x ) in the previous result is permuted, the probability does not change. On the other hand, there are exchangeable sequences of indicator random variables that are dependent, as Pólya's urn model so dramatically illustrates. 1
2
n
Let Y denote the number of successes in the first n trials. Then P(Y = y) = (
n y n−y ) p (1 − p ) , y
y ∈ {0, 1, … , n}
(2.5.20)
Proof The distribution of Y is called the binomial distribution with parameters detail in the chapter on Bernoulli Trials.
n
and p. The binomial distribution is studied in more
More generally, a multinomial trials sequence is a sequence X = (X , X , …) of independent, identically distributed random variables, each taking values in a finite set S . The canonical example is the sequence of scores when a k -sided die (not necessarily fair) is thrown repeatedly. Multinomial trials are also studied in detail in the chapter on Bernoulli trials. 1
2
Cards Consider the experiment that consists of dealing 2 cards at random from a standard deck and recording the sequence of cards dealt. For i ∈ {1, 2}, let Q be the event that card i is a queen and H the event that card i is a heart. Compute the appropriate probabilities to verify the following results. Reflect on these results. i
1. Q 2. Q 3. Q 4. H 5. Q 6. H
1 2 1 1
1 1
and H and H and Q and H and H and Q
1 2
2 2
2 2
i
are independent. are independent. are negatively correlated. are negatively correlated. are independent. are independent.
Answer In the card experiment, set n = 2 . Run the simulation 500 times. For each pair of events in the previous exercise, compute the product of the empirical probabilities and the empirical probability of the intersection. Compare the results.
Dice The following exercise gives three events that are pairwise independent, but not (mutually) independent. Consider the dice experiment that consists of rolling 2 standard, fair dice and recording the sequence of scores. Let A denote the event that first score is 3, B the event that the second score is 4, and C the event that the sum of the scores is 7. Then 1. A , B , C are pairwise independent. 2. A ∩ B implies (is a subset of) C and hence these events are dependent in the strongest possible sense. Answer In the dice experiment, set n = 2 . Run the experiment 500 times. For each pair of events in the previous exercise, compute the product of the empirical probabilities and the empirical probability of the intersection. Compare the results. The following exercise gives an example of three events with the property that the probability of the intersection is the product of the probabilities, but the events are not pairwise independent. Suppose that we throw a standard, fair die one time. Let A = {1, 2, 3, 4}, B = C
= {4, 5, 6}
. Then
1. P(A ∩ B ∩ C ) = P(A)P(B)P(C ) .
2.5.6
https://stats.libretexts.org/@go/page/10133
2. B and C are the same event, and hence are dependent in the strongest possbile sense. Answer Suppose that a standard, fair die is thrown 4 times. Find the probability of the following events. 1. Six does not occur. 2. Six occurs at least once. 3. The sum of the first two scores is 5 and the sum of the last two scores is 7. Answer Suppose that a pair of standard, fair dice are thrown 8 times. Find the probability of each of the following events. 1. Double six does not occur. 2. Double six occurs at least once. 3. Double six does not occur on the first 4 throws but occurs at least once in the last 4 throws. Answer Consider the dice experiment that consists of rolling n , k -sided dice and recording the sequence of scores X = (X , X , … , X ) .The following conditions are equivalent (and correspond to the assumption that the dice are fair): 1
2
n
1. X is uniformly distributed on {1, 2, … , k} . 2. X is a sequence of independent variables, and X is uniformly distributed on {1, 2, … , k} for each i. n
i
Proof A pair of standard, fair dice are thrown repeatedly. Find the probability of each of the following events. 1. A sum of 4 occurs before a sum of 7. 2. A sum of 5 occurs before a sum of 7. 3. A sum of 6 occurs before a sum of 7. 4. When a sum of 8 occurs the first time, it occurs “the hard way” as (4, 4). Answer Problems of the type in the last exercise are important in the game of craps. Craps is studied in more detail in the chapter on Games of Chance.
Coins A biased coin with probability of heads is tossed 5 times. Let X denote the outcome of the tosses (encoded as a bit string) and let Y denote the number of heads. Find each of the following: 1 3
1. P(X = x) for each x ∈ {0, 1} . 2. P(Y = y) for each y ∈ {0, 1, 2, 3, 4, 5}. 3. P(1 ≤ Y ≤ 3) 5
Answer A box contains a fair coin and a two-headed coin. A coin is chosen at random from the box and tossed repeatedly. Let denote the event that the fair coin is chosen, and let H denote the event that the ith toss results in heads. Then
F
i
1. (H , H , …) are conditionally independent given F , with P(H ∣ F ) = for each i. 2. (H , H , …) are conditionally independent given F , with P(H ∣ F ) = 1 for each i. 3. P(H ) = for each i. 4. P(H ∩ H ∩ ⋯ ∩ H ) = + . 1
1
2
1
2
i
2
c
c
i
3
i
4
1
1
2
n
1
n+1
2
2
5. (H , H , …) are dependent. 6. P(F ∣ H ∩ H ∩ ⋯ ∩ H ) = . 7. P(F ∣ H ∩ H ∩ ⋯ ∩ H ) → 0 as n → ∞ . 1
2
1
1
2
n
1
2
n
n
2 +1
Proof
2.5.7
https://stats.libretexts.org/@go/page/10133
Consider again the box in the previous exercise, but we change the experiment as follows: a coin is chosen at random from the box and tossed and the result recorded. The coin is returned to the box and the process is repeated. As before, let H denote the event that toss i results in heads. Then i
1. (H , H , …) are independent. 2. P(H ) = for each i. 3. P(H ∩ H ∩ ⋯ H ) = ( ) . 1
2
3
i
4
3
1
2
n
n
4
Proof Think carefully about the results in the previous two exercises, and the differences between the two models. Tossing a coin produces independent random variables if the probability of heads is fixed (that is, non-random even if unknown). Tossing a coin with a random probability of heads generally does not produce independent random variables; the result of a toss gives information about the probability of heads which in turn gives information about subsequent tosses.
Uniform Distributions Recall that Buffon's coin experiment consists of tossing a coin with radius r ≤ randomly on a floor covered with square tiles of side length 1. The coordinates (X, Y ) of the center of the coin are recorded relative to axes through the center of the square in which the coin lands. The following conditions are equivalent: 1 2
2
1. (X, Y ) is uniformly distributed on [− , ] . 2. X and Y are independent and each is uniformly distributed on [− 1
1
2
2
1 2
,
1 2
]
.
Figure 2.5.1: Buffon's coin experiment Proof Compare this result with the result above for fair dice. In Buffon's coin experiment, set r = 0.3. Run the simulation 500 times. For the events {X > 0} and {Y product of the empirical probabilities and the empirical probability of the intersection. Compare the results.
< 0}
, compute the
The arrival time X of the A train is uniformly distributed on the interval (0, 30), while the arrival time Y of the B train is uniformly distributed on the interval (15, 30). (The arrival times are in minutes, after 8:00 AM). Moreover, the arrival times are independent. Find the probability of each of the following events: 1. The A train arrives first. 2. Both trains arrive sometime after 20 minutes. Answer
Reliability Recall the simple model of structural reliability in which a system is composed of n components. Suppose in addition that the components operate independently of each other. As before, let X denote the state of component i, where 1 means working and 0 means failure. Thus, our basic assumption is that the state vector X = (X , X , … , X ) is a sequence of independent indicator random variables. We assume that the state of the system (either working or failed) depends only on the states of the components. Thus, the state of the system is an indicator random variable i
1
Y = y(X1 , X2 , … , Xn )
2.5.8
2
n
(2.5.25)
https://stats.libretexts.org/@go/page/10133
where y : {0, 1} → {0, 1} is the structure function. Generally, the probability that a device is working is the reliability of the device. Thus, we will denote the reliability of component i by p = P(X = 1) so that the vector of component reliabilities is p = (p , p , … , p ) . By independence, the system reliability r is a function of the component reliabilities: n
i
1
2
i
n
r(p1 , p2 , … , pn ) = P(Y = 1)
(2.5.26)
Appropriately enough, this function is known as the reliability function. Our challenge is usually to find the reliability function, given the structure function. When the components all have the same probability p then of course the system reliability r is just a function of p. In this case, the state vector X = (X , X , … , X ) forms a sequence of Bernoulli trials. 1
2
n
Comment on the independence assumption for real systems, such as your car or your computer. Recall that a series system is working if and only if each component is working. 1. The state of the system is U = X X ⋯ X 2. The reliability is P(U = 1) = p p ⋯ p . 1
1
2
2
n
= min{ X1 , X2 , … , Xn }
.
n
Recall that a parallel system is working if and only if at least one component is working. 1. The state of the system is V = 1 − (1 − X )(1 − X ) ⋯ (1 − X ) = max{X 2. The reliability is P(V = 1) = 1 − (1 − p )(1 − p ) ⋯ (1 − p ) . 1
1
2
1,
n
2
X2 , … , Xn }
.
n
Recall that a k out of n system is working if and only if at least k of the n components are working. Thus, a parallel system is a 1 out of n system and a series system is an n out of n system. A k out of 2k − 1 system is a majority rules system. The reliability function of a general k out of n system is a mess. However, if the component reliabilities are the same, the function has a reasonably simple form. For a k out of n system with common component reliability p, the system reliability is n
n
r(p) = ∑ ( i=k
i
n−i
) p (1 − p )
(2.5.27)
i
Consider a system of 3 independent components with common reliability p = 0.8 . Find the reliability of each of the following: 1. The parallel system. 2. The 2 out of 3 system. 3. The series system. Answer Consider a system of 3 independent components with reliabilities p the following:
1
= 0.8
,p
2
= 0.8
,p
3
= 0.7
. Find the reliability of each of
1. The parallel system. 2. The 2 out of 3 system. 3. The series system. Answer Consider an airplane with an odd number of engines, each with reliability p. Suppose that the airplane is a majority rules system, so that the airplane needs a majority of working engines in order to fly. 1. Find the reliability of a 3 engine plane as a function of p. 2. Find the reliability of a 5 engine plane as a function of p. 3. For what values of p is a 5 engine plane preferable to a 3 engine plane? Answer The graph below is known as the Wheatstone bridge network and is named for Charles Wheatstone. The edges represent components, and the system works if and only if there is a working path from vertex a to vertex b .
2.5.9
https://stats.libretexts.org/@go/page/10133
1. Find the structure function. 2. Find the reliability function.
Figure 2.5.2: The Wheatstone bridge netwok Answer A system consists of 3 components, connected in parallel. Because of environmental factors, the components do not operate independently, so our usual assumption does not hold. However, we will assume that under low stress conditions, the components are independent, each with reliability 0.9; under medium stress conditions, the components are independent with reliability 0.8; and under high stress conditions, the components are independent, each with reliability 0.7. The probability of low stress is 0.5, of medium stress is 0.3, and of high stress is 0.2. 1. Find the reliability of the system. 2. Given that the system works, find the conditional probability of each stress level. Answer Suppose that bits are transmitted across a noisy communications channel. Each bit that is sent, independently of the others, is received correctly with probability 0.9 and changed to the complementary bit with probability 0.1. Using redundancy to improve reliability, suppose that a given bit will be sent 3 times. We naturally want to compute the probability that we correctly identify the bit that was sent. Assume we have no prior knowledge of the bit, so we assign probability each to the event that 000 was sent and the event that 111 was sent. Now find the conditional probability that 111 was sent given each of the 8 possible bit strings received. 1 2
Answer
Diagnostic Testing Recall the discussion of diagnostic testing in the section on Conditional Probability. Thus, we have an event A for a random experiment whose occurrence or non-occurrence we cannot observe directly. Suppose now that we have n tests for the occurrence of A , labeled from 1 to n . We will let T denote the event that test i is positive for A . The tests are independent in the following sense: i
If A occurs, then (T , T , … , T ) are (conditionally) independent and test i has sensitivity a = P(T If A does not occur, then (T , T , … , T ) are (conditionally) independent and test i has specificity b 1
2
n
1
i
2
n
.
i
∣ A)
i
= P(T
c
i
c
∣ A )
.
Note that unconditionally, it is not reasonable to assume that the tests are independent. For example, a positive result for a given test presumably is evidence that the condition A has occurred, which in turn is evidence that a subsequent test will be positive. In short, we expect that T and T should be positively correlated. i
j
We can form a new, compound test by giving a decision rule in terms of the individual test results. In other words, the event T that the compound test is positive for A is a function of (T , T , … , T ). The typical decision rules are very similar to the reliability structures discussed above. A special case of interest is when the n tests are independent applications of a given basic test. In this case, a = a and b = b for each i. 1
i
2
n
i
Consider the compound test that is positive for A if and only if each of the n tests is positive for A . 1. T = T ∩ T ∩ ⋯ ∩ T 2. The sensitivity is P(T ∣ A) = a a ⋯ a . 3. The specificity is P(T ∣ A ) = 1 − (1 − b 1
2
n
1
c
2
n
c
1 )(1
− b2 ) ⋯ (1 − bn )
Consider the compound test that is positive for A if and only if each at least one of the n tests is positive for A . 1. T = T ∪ T ∪ ⋯ ∪ T 2. The sensitivity is P(T ∣ A) = 1 − (1 − a )(1 − a 3. The specificity is P(T ∣ A ) = b b ⋯ b . 1
2
n
2)
1
c
c
1
2
⋯ (1 − an )
.
n
2.5.10
https://stats.libretexts.org/@go/page/10133
More generally, we could define the compound k out of n test that is positive for A if and only if at least k of the individual tests are positive for A . The series test is the n out of n test, while the parallel test is the 1 out of n test. The k out of 2k − 1 test is the majority rules test. Suppose that a woman initially believes that there is an even chance that she is or is not pregnant. She buys three identical pregnancy tests with sensitivity 0.95 and specificity 0.90. Tests 1 and 3 are positive and test 2 is negative. 1. Find the updated probability that the woman is pregnant. 2. Can we just say that tests 2 and 3 cancel each other out? Find the probability that the woman is pregnant given just one positive test, and compare the answer with the answer to part (a). Answer Suppose that 3 independent, identical tests for an event sensitivity and specificity of the following tests:
A
are applied, each with sensitivity
a
and specificity b . Find the
1. 1 out of 3 test 2. 2 out of 3 test 3. 3 out of 3 test Answer In a criminal trial, the defendant is convicted if and only if all 6 jurors vote guilty. Assume that if the defendant really is guilty, the jurors vote guilty, independently, with probability 0.95, while if the defendant is really innocent, the jurors vote not guilty, independently with probability 0.8. Suppose that 70% of defendants brought to trial are guilty. 1. Find the probability that the defendant is convicted. 2. Given that the defendant is convicted, find the probability that the defendant is guilty. 3. Comment on the assumption that the jurors act independently. Answer
Genetics Please refer to the discussion of genetics in the section on random experiments if you need to review some of the definitions in this section. Recall first that the ABO blood type in humans is determined by three alleles: a , b , and o . Furthermore, a and b are co-dominant and o is recessive. Suppose that in a certain population, the proportion of a , b , and o alleles are p, q, and r respectively. Of course we must have p > 0 , q > 0 , r > 0 and p + q + r = 1 . Suppose that the blood genotype in a person is the result of independent alleles, chosen with probabilities p, q, and r as above. 1. The probability distribubtion of the geneotypes is given in the following table: Genotype
aa
Probability
p
2
ab
ao
bb
2pq
2pr
q
2
bo
oo
2qr
r
2
2. The probability distribution of the blood types is given in the following table: Blood type
A
Probability
2
p
B
+ 2pr
q
2
+ 2qr
AB
O
2pq
r
2
Proof The discussion above is related to the Hardy-Weinberg model of genetics. The model is named for the English mathematician Godfrey Hardy and the German physician Wilhelm Weiberg Suppose that the probability distribution for the set of blood types in a certain population is given in the following table: Blood type
A
B
AB
2.5.11
O
https://stats.libretexts.org/@go/page/10133
Probability
0.360
0.123
0.038
0.479
Find p, q, and r. Answer Suppose next that pod color in certain type of pea plant is determined by a gene with two alleles: g for green and y for yellow, and that g is dominant and o recessive. Suppose that 2 green-pod plants are bred together. Suppose further that each plant, independently, has the recessive yellow-pod allele with probability . 1 4
1. Find the probability that 3 offspring plants will have green pods. 2. Given that the 3 offspring plants have green pods, find the updated probability that both parents have the recessive allele. Answer Next consider a sex-linked hereditary disorder in humans (such as colorblindness or hemophilia). Let h denote the healthy allele and d the defective allele for the gene linked to the disorder. Recall that h is dominant and d recessive for women. Suppose that a healthy woman initially has a chance of being a carrier. (This would be the case, for example, if her mother and father are healthy but she has a brother with the disorder, so that her mother must be a carrier). 1 2
1. Find the probability that the first two sons of the women will be healthy. 2. Given that the first two sons are healthy, compute the updated probability that she is a carrier. 3. Given that the first two sons are healthy, compute the conditional probability that the third son will be healthy. Answer
Laplace's Rule of Succession Suppose that we have m + 1 coins, labeled 0, 1, … , m. Coin i lands heads with probability for each i. The experiment is to choose a coin at random (so that each coin is equally likely to be chosen) and then toss the chosen coin repeatedly. i
m
1. The probability that the first n tosses are all heads is p 2. p → as m → ∞
m,n
=
1 m+1
m
∑
i=0
(
i m
n
)
1
m,n
n+1
3. The conditional probability that toss n + 1 is heads given that the previous n tosses were all heads is 4.
pm,n+1 pm,n
→
n+1 n+2
pm,n+1 pm,n
as m → ∞
Proof Note that coin 0 is two-tailed, the probability of heads increases with i, and coin m is two-headed. The limiting conditional probability in part (d) is called Laplace's Rule of Succession, named after Simon Laplace. This rule was used by Laplace and others as a general principle for estimating the conditional probability that an event will occur on time n + 1 , given that the event has occurred n times in succession. Suppose that a missile has had 10 successful tests in a row. Compute Laplace's estimate that the 11th test will be successful. Does this make sense? Answer This page titled 2.5: Independence is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.5.12
https://stats.libretexts.org/@go/page/10133
2.6: Convergence This is the first of several sections in this chapter that are more advanced than the basic topics in the first five sections. In this section we discuss several topics related to convergence of events and random variables, a subject of fundamental importance in probability theory. In particular the results that we obtain will be important for: Properties of distribution functions, The weak law of large numbers, The strong law of large numbers. As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So to review, outcomes, F the σ-algebra of events, and P the probability measure on the sample space (Ω, F ).
Ω
is the set of
Basic Theory Sequences of events Our first discussion deals with sequences of events and various types of limits of such sequences. The limits are also event. We start with two simple definitions. Suppose that (A
1,
A2 , …)
is a sequence of events.
1. The sequence is increasing if A 2. The sequence is decreasing if A
n
⊆ An+1
n+1
⊆ An
for every n ∈ N . for every n ∈ N . +
+
Note that these are the standard definitions of increasing and decreasing, relative to the ordinary total order ≤ on the index set N and the subset partial order ⊆ on the collection of events. The terminology is also justified by the corresponding indicator variables. +
Suppose that (A
1,
A2 , …)
is a sequence of events, and let I
n
denote the indicator variable of the event A for n ∈ N .
= 1An
n
+
1. The sequence of events is increasing if and only if the sequence of indicator variables is increasing in the ordinary sense. That is, I ≤ I for each n ∈ N . 2. The sequence of events is decreasing if and only if the sequence of indicator variables is decreasing in the ordinary sense. That is, I ≤I for each n ∈ . n
n+1
n+1
n
+
+
Proof
Figure 2.6.1: A sequence of increasing events and their union
Figure 2.6.2: A sequence of decreasing events and their intersection If a sequence of events is either increasing or decreasing, we can define the limit of the sequence in a way that turns out to be quite natural. Suppose that (A
1,
A2 , …)
is a sequence of events.
1. If the sequence is increasing, we define lim
n→∞
∞
An = ⋃
n=1
An
2.6.1
.
https://stats.libretexts.org/@go/page/10134
2. If the sequence is decreasing, we define lim
∞
An = ⋂
n→∞
n=1
An
.
Once again, the terminology is clarified by the corresponding indicator variables. Suppose again that (A
1,
is a sequence of events, and let I
A2 , …)
n
1. If the sequence of events is increasing, then lim 2. If the sequence of events is decreasing, then lim
n→∞
denote the indicator variable of A for n ∈ N . n
is the indicator variable of ⋃ is the indicator variable of ⋂
∞
In
n→∞
= 1An
n=1 ∞
In
n=1
+
An An
Proof An arbitrary union of events can always be written as a union of increasing events, and an arbitrary intersection of events can always be written as an intersection of decreasing events: Suppose that (A
1,
n
1. ⋃ 2. ⋂
i=1 n i=1
Ai Ai
A2 , …)
is a sequence of events. Then ∞
n
is increasing in n ∈ N and ⋃ is decreasing in n ∈ N and ⋂ +
i=1 ∞
+
i=1
Ai = limn→∞ ⋃i=1 Ai n
Ai = limn→∞ ⋂i=1 Ai
. .
Proof There is a more interesting and useful way to generate increasing and decreasing sequences from an arbitrary sequence of events, using the tail segment of the sequence rather than the initial segment. Suppose that (A
1,
∞
1. ⋃ 2. ⋂
i=n ∞ i=n
Ai Ai
A2 , …)
is a sequence of events. Then
is decreasing in n ∈ N . is increasing in n ∈ N . +
+
Proof Since the new sequences defined in the previous results are decreasing and increasing, respectively, we can take their limits. These are the limit superior and limit inferior, respectively, of the original sequence. Suppose that (A
1,
A2 , …)
is a sequence of events. Define
1. lim sup A = lim many values of n . 2. lim inf A = lim finitely many values of n . n
n→∞
n→∞
∞
n→∞
n
n→∞
⋃
∞
i=n
∞
⋂
i=n
Ai = ⋂
∞
n=1
∞
Ai = ⋃
n=1
⋃
i=n
∞
⋂
i=n
. This is the event that occurs if an only if A occurs for infinitely
Ai
Ai
n
. This is the event that occurs if an only if A occurs for all but n
Proof Once again, the terminology and notation are clarified by the corresponding indicator variables. You may need to review limit inferior and limit superior for sequences of real numbers in the section on Partial Orders. Suppose that (A
1,
1. lim sup 2. lim inf
is a sequence of events, and et I
In
= 1An
n
is the indicator variable of lim sup is the indicator variable of lim inf
In
n→∞
n→∞
A2 , …)
n→∞
n→∞
An
An
denote the indicator variable of A for n ∈ N . Then n
+
.
.
Proof Suppose that (A
1,
A2 , …)
is a sequence of events. Then lim inf
A2 , …)
is a sequence of events. Then
n→∞
An ⊆ lim sup n→∞ An
.
Proof Suppose that (A
1,
1. (lim sup 2. (lim inf
n→∞
n→∞
An )
An )
c
c
c
= lim infn→∞ An c
= lim sup n→∞ An
.
Proof
2.6.2
https://stats.libretexts.org/@go/page/10134
The Continuity Theorems Generally speaking, a function is continuous if it preserves limits. Thus, the following results are the continuity theorems of probability. Part (a) is the continuity theorem for increasing events and part (b) the continuity theorem for decreasing events. Suppose that (A
1,
A2 , …)
is a sequence of events.
1. If the sequence is increasing then lim 2. If the sequence is decreasing then lim
∞
n→∞
P(An ) = P (limn→∞ An ) = P (⋃
n→∞
n=1 ∞
P(An ) = P (limn→∞ An ) = P (⋂
n=1
An ) An )
Proof The continuity theorems can be applied to the increasing and decreasing sequences that we constructed earlier from an arbitrary sequence of events. Suppose that (A
1,
∞
A2 , …)
is a sequence of events. n
1. P (⋃ 2. P (⋂
i=1 ∞ i=1
Ai ) = limn→∞ P (⋃i=1 Ai ) n
Ai ) = limn→∞ P (⋂i=1 Ai )
Proof Suppose that (A
1,
1. P (lim sup 2. P (lim inf
A2 , …)
is a sequence of events. Then ∞
n→∞
An ) = limn→∞ P (⋃i=n Ai ) ∞
n→∞
An ) = limn→∞ P (⋂i=n Ai )
Proof The next result shows that the countable additivity axiom for a probability measure is equivalent to finite additivity and the continuity property for increasing events. Temporarily, suppose that countably additive.
P
is only finitely additive, but satisfies the continuity property for increasing events. Then
P
is
Proof There are a few mathematicians who reject the countable additivity axiom of probability measure in favor of the weaker finite additivity axiom. Whatever the philosophical arguments may be, life is certainly much harder without the continuity theorems.
The Borel-Cantelli Lemmas The Borel-Cantelli Lemmas, named after Emil Borel and Francessco Cantelli, are very important tools in probability theory. The first lemma gives a condition that is sufficient to conclude that infinitely many events occur with probability 0. First Borel-Cantelli Lemma. Suppose that P (lim sup A ) = 0. n→∞
(A1 , A2 , …)
is a sequence of events. If
∞
∑
n=1
P(An ) < ∞
then
n
Proof The second lemma gives a condition that is sufficient to conclude that infinitely many independent events occur with probability 1. Second Borel-Cantelli Lemma. Suppose that P (lim sup A ) = 1. n→∞
(A1 , A2 , …)
is a sequence of independent events. If
∞
∑
n=1
P(An ) = ∞
then
n
Proof For independent events, both Borel-Cantelli lemmas apply of course, and lead to a zero-one law. If (A
1,
A2 , …) ∞
1. If ∑ 2. If ∑
n=1 ∞ n=1
is a sequence of independent events then lim sup
P(An ) < ∞ P(An ) = ∞
then P (lim sup then P (lim sup
n→∞
An ) = 0
n→∞
An ) = 1
n→∞
An
has probability 0 or 1:
. .
2.6.3
https://stats.libretexts.org/@go/page/10134
This result is actually a special case of a more general zero-one law, known as the Kolmogorov zero-one law, and named for Andrei Kolmogorov. This law is studied in the more advanced section on measure. Also, we can use the zero-one law to derive a calculus theorem that relates infinite series and infinte products. This derivation is an example of the probabilistic method—the use of probability to obtain results, seemingly unrelated to probability, in other areas of mathematics. Suppose that p
i
∈ (0, 1)
for each i ∈ N . Then +
∞
∞
∏ pi > 0 if and only if ∑(1 − pi ) < ∞ i=1
(2.6.6)
i=1
Proof Our next result is a simple application of the second Borel-Cantelli lemma to independent replications of a basic experiment. Suppose that A is an event in a basic random experiment with P(A) > 0 . In the compound experiment that consists of independent replications of the basic experiment, the event “A occurs infinitely often” has probability 1. Proof
Convergence of Random Variables Our next discussion concerns two ways that a sequence of random variables defined for our experiment can “converge”. These are fundamentally important concepts, since some of the deepest results in probability theory are limit theorems involving random variables. The most important special case is when the random variables are real valued, but the proofs are essentially the same for variables with values in a metric space, so we will use the more general setting. Thus, suppose that (S, d) is a metric space, and that S is the corresponding Borel σ-algebra (that is, the σ-algebra generated by the topology), so that our measurable space is (S, S ). Here is the most important special case: For n ∈ N , is the n -dimensional Euclidean space is (R
n
+
, dn )
where
−− −−−−−−− − n 2
dn (x, y) = √ ∑(yi − xi )
,
n
x = (x1 , x2 … , xn ), y = (y1 , y2 , … , yn ) ∈ R
(2.6.7)
i=1
Euclidean spaces are named for Euclid, of course. As noted above, the one-dimensional case where d(x, y) = |y − x| for x, y ∈ R is particularly important. Returning to the general metric space, recall that if (x , x , …) is a sequence in S and x ∈ S , then x → x as n → ∞ means that d(x , x) → 0 as n → ∞ (in the usual calculus sense). For the rest of our discussion, we assume that (X , X , …) is a sequence of random variable with values in S and X is a random variable with values in S , all defined on the probability space (Ω, F , P). 1
n
2
n
1
2
We say that X
n
→ X
as n → ∞ with probability 1 if the event that X
n
→ X
as n → ∞ has probability 1. That is,
P{ω ∈ S : Xn (ω) → X(ω) as n → ∞} = 1
(2.6.8)
Details As good probabilists, we usually suppress references to the sample space and write the definition simply as P(X → X as n → ∞) = 1 . The statement that an event has probability 1 is usually the strongest affirmative statement that we can make in probability theory. Thus, convergence with probability 1 is the strongest form of convergence. The phrases almost surely and almost everywhere are sometimes used instead of the phrase with probability 1. n
Recall that metrics d and e on S are equivalent if they generate the same topology on S . Recall also that convergence of a sequence is a topological property. That is, if (x , x , …) is a sequence in S and x ∈ S , and if d, e are equivalent metrics on S , then x → x as n → ∞ relative to d if and only if x → x as n → ∞ relative to e . So for our random variables as defined above, it follows that X → X as n → ∞ with probability 1 relative to d if and only if X → X as n → ∞ with probability 1 relative to e. 1
n
2
n
n
n
The following statements are equivalent: 1. X
n
→ X
as n → ∞ with probability 1.
2.6.4
https://stats.libretexts.org/@go/page/10134
2. P [d(X 3. P [d(X 4. P [d(X
n,
X) > ϵ for infinitely many n ∈ N+ ] = 0
n , X) > ϵ for infinitely many n ∈ N+ ] = 0 k,
X) > ϵ for some k ≥ n] → 0
as n → ∞
for every rational ϵ > 0 . for every ϵ > 0 . for every ϵ > 0 .
Proof Our next result gives a fundamental criterion for convergence with probability 1: If ∑
∞ n=1
P [d(Xn , X) > ϵ] < ∞
for every ϵ > 0 then X
n
→ X
as n → ∞ with probability 1.
Proof Here is our next mode of convergence. We say that X
n
→ X
as n → ∞ in probability if P [d(Xn , X) > ϵ] → 0 as n → ∞ for each ϵ > 0
(2.6.11)
The phrase in probability sounds superficially like the phrase with probability 1. However, as we will soon see, convergence in probability is much weaker than convergence with probability 1. Indeed, convergence with probability 1 is often called strong convergence, while convergence in probability is often called weak convergence. If X
n
→ X
as n → ∞ with probability 1 then X
n
→ X
as n → ∞ in probability.
Proof The converse fails with a passion. A simple counterexample is given below. However, there is a partial converse that is very useful. If X → X as n → ∞ in probability, then there exists a subsequence (n with probability 1.
1,
n
n2 , n3 …)
of N such that +
Xnk → X
as k → ∞
Proof There are two other modes of convergence that we will discuss later: Convergence in distribution. Convergence in mean,
Examples and Applications Coins Suppose that we have an infinite sequence of coins labeled 1, 2, … Moreover, coin n has probability of heads 1/n for each n ∈ N , where a > 0 is a parameter. We toss each coin in sequence one time. In terms of a , find the probability of the following events: a
+
1. infinitely many heads occur 2. infinitely many tails occur Answer The following exercise gives a simple example of a sequence of random variables that converge in probability but not with probability 1. Naturally, we are assuming the standard metric on R. Suppose again that we have a sequence of coins labeled 1, 2, …, and that coin n lands heads up with probability We toss the coins in order to produce a sequence (X , X , …) of independent indicator random variables with 1
n
for each n .
2
1 P(Xn = 1) =
1 n
1 , P(Xn = 0) = 1 −
; n
n ∈ N+
(2.6.12)
1. P(X = 0 for infinitely many n) = 1 , so that infinitely many tails occur with probability 1. 2. P(X = 1 for infinitely many n) = 1 , so that infinitely many heads occur with probability 1. 3. P(X does not converge as n → ∞) = 1 . 4. X → 0 as n → ∞ in probability. n n n
n
2.6.5
https://stats.libretexts.org/@go/page/10134
Proof
Discrete Spaces Recall that a measurable space (S, S ) is discrete if S is countable and S is the collection of all subsets of S (the power set of S ). Moreover, S is the Borel σ-algebra corresponding to the discrete metric d on S given by d(x, x) = 0 for x ∈ S and d(x, y) = 1 for distinct x, y ∈ S . How do convergence with probability 1 and convergence in probability work for the discrete metric? Suppose that (S, S ) is a discrete space. Suppose further that (X , X , …) is a sequence of random variables with values in S and X is a random variable with values in S , all defined on the probability space (Ω, F , P). Relative to the discrete metric d , 1
1. X 2. X
n
→ X
n
→ X
2
as n → ∞ with probability 1 if and only if P(X = X for all but finitely many n ∈ N as n → ∞ in probability if and only if P(X ≠ X) → 0 as n → ∞ . n
+)
=1
.
n
Proof Of course, it's important to realize that a discrete space can be the Borel space for metrics other than the discrete metric. This page titled 2.6: Convergence is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.6.6
https://stats.libretexts.org/@go/page/10134
2.7: Measure Spaces In this section we discuss positive measure spaces (which include probability spaces) from a more advanced point of view. The sections on Measure Theory and Special Set Structures in the chapter on Foundations are essential prerequisites. On the other hand, if you are not interested in the measure-theoretic aspects of probability, you can safely skip this section.
Positive Measure Definitions Suppose that S is a set, playing the role of a universal set for a mathematical theory. As we have noted before, S usually comes with a σ-algebra S of admissible subsets of S , so that (S, S ) is a measurable space. In particular, this is the case for the model of a random experiment, where S is the set of outcomes and S the σ-algebra of events, so that the measurable space (S, S ) is the sample space of the experiment. A probability measure is a special case of a more general object known as a positive measure. A positive measure on (S, S ) is a function μ : S
→ [0, ∞]
that satisfies the following axioms:
1. μ(∅) = 0 2. If {A : i ∈ I } is a countable, pairwise disjoint collection of sets in S then i
μ ( ⋃ Ai ) = ∑ μ(Ai ) i∈I
(2.7.1)
i∈I
The triple (S, S , μ) is a measure space. Axiom (b) is called countable additivity, and is the essential property. The measure of a set that consists of a countable union of disjoint pieces is the sum of the measures of the pieces. Note also that since the terms in the sum are positive, there is no issue with the order of the terms in the sum, although of course, ∞ is a possible value.
Figure 2.7.1 : A union of four disjoint sets
So perhaps the term measurable space for positive measure defined on it.
(S, S )
makes a little more sense now—a measurable space is one that can have a
Suppose that (S, S , μ) is a measure space. 1. If μ(S) < ∞ then (S, S , μ) is a finite measure space. 2. If μ(S) = 1 then (S, S , μ) is a probability space. So probability measures are positive measures, but positive measures are important beyond the application to probability. The standard measures on the Euclidean spaces are all positive measures: the extension of length for measurable subsets of R, the extension of area for measurable subsets of R , the extension of volume for measurable subsets of R , and the higher dimensional analogues. We will actually construct these measures in the next section on Existence and Uniqueness. In addition, counting measure # is a positive measure on the subsets of a set S . Even more general measures that can take positive and negative values are explored in the chapter on Distributions. 2
3
Properties The following results give some simple properties of a positive measure space (S, S , μ). The proofs are essentially identical to the proofs of the corresponding properties of probability, except that the measure of a set may be infinite so we must be careful to avoid the dreaded indeterminate form ∞ − ∞ . If A,
B ∈ S
, then μ(B) = μ(A ∩ B) + μ(B ∖ A) .
2.7.1
https://stats.libretexts.org/@go/page/10135
Proof If A,
B ∈ S
and A ⊆ B then
1. μ(B) = μ(A) + μ(B ∖ A) 2. μ(A) ≤ μ(B) Proof Thus μ is an increasing function, relative to the subset partial order ⊆ on S and the ordinary order ≤ on [0, ∞]. In particular, if μ is a finite measure, then μ(A) < ∞ for every A ∈ S . Note also that if A, B ∈ S and μ(B) < ∞ then μ(B ∖ A) = μ(B) − μ(A ∩ B) . In the special case that A ⊆ B , this becomes μ(B ∖ A) = μ(B) − μ(A) . In particular, these results holds for a finite measure and are just like the difference rules for probability. If μ is a finite measure, then μ(A ) = μ(S) − μ(A) . This is the analogue of the complement rule in probability, with but with μ(S) replacing 1. c
The following result is the analogue of Boole's inequality for probability. For a general positive measure, the result is referred to as the subadditive property. Suppose that A
∈ S
i
for i in a countable index set I . Then μ ( ⋃ Ai ) ≤ ∑ μ(Ai ) i∈I
(2.7.2)
i∈I
Proof For a union of sets with finite measure, the inclusion-exclusion formula holds, and the proof is just like the one for probability. Suppose that A
∈ S
i
for each i ∈ I where #(I ) = n , and that μ(A
i)
0 2. If B ∈ S and B ⊆ A then either μ(B) = μ(A) or μ(B) = 0 . A measure space that has no atoms is called non-atomic or diffuse. In probability theory, we are often particularly interested in atoms that are singleton sets. Note that {x} ∈ S is an atom if and only if μ({x}) > 0, since the only subsets of {x} are {x} itself and ∅.
Constructions There are several simple ways to construct new positive measures from existing ones. As usual, we start with a measurable space (S, S ). Suppose that (R, R) is a measurable subspace of (S, S ). If μ is a positive measure on (S, S ) then positive measure on (R, R) . If μ is a finite measure on (S, S ) then μ is a finite measure on (R, R) .
μ
restricted to
R
is a
Proof However, if μ is σ-finite on (S, S ), it is not necessarily true that μ is σ-finite on (R, R) . A counterexample is given below. The previous theorem would apply, in particular, when R = S so that R is a sub σ-algebra of S . Next, a positive multiple of a positive measure gives another positive measure. If μ is a positive measure on (S, S ) and c ∈ (0, ∞), then cμ is also a positive measure on (S, S ). If μ is finite (σ-finite) then cμ is finite (σ-finite) respectively. Proof A nontrivial finite positive measure μ is practically just like a probability measure, and in fact can be re-scaled into a probability measure P, as was done in the section on Probability Measures: Suppose that μ is a positive measure on (S, S ) with 0 < μ(S) < ∞ . Then P defined by P(A) = μ(A)/μ(S) for A ∈ S is a probability measure on (S, S ). Proof Sums of positive measures are also positive measures. If μ is a positive measure on (S, S ). i
(S, S )
for each
i
in a countable index set
2.7.3
I
then
μ =∑
i∈I
μi
is also a positive measure on
https://stats.libretexts.org/@go/page/10135
1. If I is finite and μ is finite for each i ∈ I then μ is finite. 2. If I is finite and μ is σ-finite for each i ∈ I then μ is σ-finite. i i
Proof In the context of the last result, if I is countably infinite and μ is finite for each i ∈ I , then μ is not necessarily σ-finite. A counterexample is given below. In this case, μ is said to be s -finite, but we've had enough definitions, so we won't pursue this one. From scaling and sum properties, note that a positive linear combination of positive measures is a positive measure. The next method is sometimes referred to as a change of variables. i
Suppose that (S, S , μ) is a measure space. Suppose also that (T , T ) is another measurable space and that measurable. Then ν defined as follows is a positive measure on (T , T ) ν (B) = μ [ f
−1
(B)] ,
f : S → T
B ∈ T
is
(2.7.17)
If μ is finite then ν is finite. Proof In the context of the last result, if μ is σ-finite on (S, S ), it is not necessarily true that ν is σ-finite on (T , T ), even if f is one-toone. A counterexample is given below. The takeaway is that σ-finiteness of ν depends very much on the nature of the σ-algebra T . Our next result shows that it's easy to explicitly construct a positive measure on a countably generated σ-algebra, that is, a σalgebra generated by a countable partition. Such σ-algebras are important for counterexamples and to gain insight, and also because many σ-algebras that occur in applications can be constructed from them. Suppose that A = {A : i ∈ I } is a countable partition of S into nonempty sets, and that S = σ(A ) , the generated by the partition. For i ∈ I , define μ(A ) ∈ [0, ∞] arbitrarily. For A = ⋃ A where J ⊆ I , define i
i
j∈J
σ
-algebra
j
μ(A) = ∑ μ(Aj )
(2.7.19)
j∈J
Then μ is a positive measure on (S, S ). 1. The atoms of the measure are the sets of the form A = ⋃ A where J ⊆ I and where μ(A j∈ J . 2. If μ(A ) < ∞ for i ∈ I and I is finite then μ is finite. 3. If μ(A ) < ∞ for i ∈ I and I is countably infinite then μ is σ-finite. j∈J
j
j)
>0
for one and only one
i i
Proof One of the most general ways to construct new measures from old ones is via the theory of integration with respect to a positive measure, which is explored in the chapter on Distributions. The construction of positive measures more or less “from scratch” is considered in the next section on Existence and Uniqueness. We close this discussion with a simple result that is useful for counterexamples. Suppose that the measure space (S, S , μ) has an atom A ∈ S with μ(A) = ∞ . Then the space is not σ-finite. Proof
Measure and Topology Often the spaces that occur in probability and stochastic processes are topological spaces. Recall that a topological space (S, T ) consists of a set S and a topology T on S (the collection of open sets). The topology as well as the measure theory plays an important role, so it's natural to want these two types of structures to be compatible. We have already seen the most important step in this direction: Recall that S = σ(T ) , the σ-algebra generated by the topology, is the Borel σ-algebra on S , named for Émile Borel. Since the complement of an open set is a closed set, S is also the σ-algebra generated by the collection of closed sets. Moreover, S contains countable intersections of open sets (called G sets) and countable unions of closed sets (called F sets). δ
σ
Suppose that (S, T ) is a topological space and let S = σ(T ) be the Borel σ-algebra. A positive measure Borel measure and then (S, S , μ) is a Borel measure space.
2.7.4
μ
on
(S, S )
is a
https://stats.libretexts.org/@go/page/10135
The next definition concerns the subset on which a Borel measure is concentrated, in a certain sense. Suppose that (S, S , μ) is a Borel measure space. The support of μ is supp(μ) = {x ∈ S : μ(U ) > 0 for every open neighborhood U of x}
(2.7.21)
The set supp(μ) is closed. Proof The term Borel measure has different definitions in the literature. Often the topological space is required to be locally compact, Hausdorff, and with a countable base (LCCB). Then a Borel measure μ is required to have the additional condition that μ(C ) < ∞ if C ⊆ S is compact. In this text, we use the term Borel measures in this more restricted sense. Suppose that (S, S , μ) is a Borel measure space corresponding to an LCCB topolgy. Then the space is σ-finite. Proof Here are a couple of other definitions that are important for Borel measures, again linking topology and measure in natural ways. Suppose again that (S, S , μ) is a Borel measure space. 1. μ is inner regular if μ(A) = sup{μ(C ) : C is compact and C ⊆ A} for A ∈ S . 2. μ is outer regular if μ(A) = inf{μ(U ) : U is open and A ⊆ U } for A ∈ S . 3. μ is regular if it is both inner regular and outer regular. The measure spaces that occur in probability and stochastic processes are usually regular Borel spaces associated with LCCB topologies.
Null Sets and Equivalence Sets of measure 0 in a measure space turn out to be very important precisely because we can often ignore the differences between mathematical objects on such sets. In this disucssion, we assume that we have a fixed measure space (S, S , μ). A set A ∈ S is null if μ(A) = 0 . Consider a measurable “statement” with x ∈ S as a free variable. (Technically, such a statement is a predicate on S .) If the statement is true for all x ∈ S except for x in a null set, we say that the statement holds almost everywhere on S . This terminology is used often in measure theory and captures the importance of the definition. Let D = {A ∈ S
c
: μ(A) = 0 or μ(A ) = 0}
, the collection of null and co-null sets. Then D is a sub σ-algebra of S .
Proof Of course μ restricted to D is not very interesting since μ(A) = 0 or μ(A) = μ(S) for every A ∈ S . Our next definition is a type of equivalence between sets in S . To make this precise, recall first that the symmetric difference between subsets A and B of S is A △ B = (A ∖ B) ∪ (B ∖ A) . This is the set that consists of points in one of the two sets, but not both, and corresponds to exclusive or. Sets A,
B ∈ S
are equivalent if μ(A △ B) = 0 , and we denote this by A ≡ B .
Thus A ≡ B if and only if μ(A △ B) = μ(A ∖ B) + μ(B ∖ A) = 0 terminology mentioned above, the statement
if and only if
x ∈ A if and only if x ∈ B
μ(A ∖ B) = μ(B ∖ A) = 0
. In the predicate
(2.7.22)
is true for almost every x ∈ S . As the name suggests, the relation ≡ really is an equivalence relation on S and hence S is partitioned into disjoint classes of mutually equivalent sets. Two sets in the same equivalence class differ by a set of measure 0. The relation ≡ is an equivalence relation on S . That is, for A,
B, C ∈ S
,
1. A ≡ A (the reflexive property). 2. If A ≡ B then B ≡ A (the symmetric property).
2.7.5
https://stats.libretexts.org/@go/page/10135
3. If A ≡ B and B ≡ C then A ≡ C (the transitive property). Proof Equivalence is preserved under the standard set operations. If A,
B ∈ S
and A ≡ B then A
c
c
≡B
.
Proof Suppose that A
i,
1. ⋃ 2. ⋂
Bi ∈ S
i∈I
Ai ≡ ⋃
Bi
i∈I
Ai ≡ ⋂
Bi
i∈I i∈I
and that A
i
≡ Bi
for i in a countable index set I . Then
Proof Equivalent sets have the same measure. If A,
B ∈ S
and A ≡ B then μ(A) = μ(B) .
Proof The converse trivially fails, and a counterexample is given below. However, the collection of null sets and the collection of co-null sets do form equivalence classes. Suppose that A ∈ S . 1. If μ(A) = 0 then A ≡ B if and only if μ(B) = 0 . 2. If μ(A ) = 0 then A ≡ B if and only if μ(B ) = 0 . c
c
Proof We can extend the notion of equivalence to measruable functions with a common range space. Thus suppose that (T , T ) is another measurable space. If f , g : S → T are measurable, then (f , g) : S → T × T is measurable with respect the usual product σalgebra T ⊗ T . We assume that the diagonal set D = {(y, y) : y ∈ T } ∈ T ⊗ T , which is almost always true in applications. Measurable functions f ,
g : S → T
are equivalent if μ{x ∈ S : f (x) ≠ g(x)} = 0 . Again we write f
≡g
.
Details In the terminology discussed earlier, f ≡ g means that f (x) = g(x) almost everywhere on S . As with measurable sets, the relation ≡ really does define an equivalence relation on the collection of measurable functions from S to T . Thus, the collection of such functions is partitioned into disjoint classes of mutually equivalent variables. The relation
≡
f , g, h : S → T
is an equivalence relation on the collection of measurable functions from ,
S
to
. That is, for measurable
T
1. f ≡ f (the reflexive property). 2. If f ≡ g then g ≡ f (the symmetric property). 3. If f ≡ g and g ≡ h then f ≡ h (the transitive property). Proof Suppose agaom that f ,
g : S → T
are measurable and that f
≡g
. Then for every B ∈ T , the sets f
−1
(B) ≡ g
−1
(B)
.
Proof Thus if f , g : S → T are measurable and f ≡ g , then by the previous result, associated with f and g , as above. Again, the converse fails with a passion.
νf = νg
where
νf , νg
are the measures on
(T , T )
It often happens that a definition for functions subsumes the corresponding definition for sets, by considering the indicator functons of the sets. So it is with equivalence. In the following result, we can take T = {0, 1} with T the collection of all subsets. Suppose that A,
B ∈ S
. Then A ≡ B if and only if 1
A
≡ 1B
.
Proof
2.7.6
https://stats.libretexts.org/@go/page/10135
Equivalence is preserved under composition. For the next result, suppose that (U , U ) is yet another measurable space. Suppose that f ,
g : S → T
are measurable and that h : T
→ U
is measurable. If f
≡g
then h ∘ f
≡h∘g
.
Proof Suppose again that (S, S , μ) is a measure space. Let V denote the collection of all measurable real-valued random functions from S into R . (As usual, R is given the Borel σ-algebra.) From our previous discussion of measure theory, we know that with the usual definitions of addition and scalar multiplication, (V , +, ⋅) is a vector space. However, in measure theory, we often do not want to distinguish between functions that are equivalent, so it's nice to know that the vector space structure is preserved when we identify equivalent functions. Formally, let [f ] denote the equivalence class generated by f ∈ V , and let W denote the collection of all such equivalence classes. In modular notation, W is V / ≡ . We define addition and scalar multiplication on W by [f ] + [g] = [f + g], c[f ] = [cf ]; (W , +, ⋅)
f, g ∈ V , c ∈ R
(2.7.27)
is a vector space.
Proof Often we don't bother to use the special notation for the equivalence class associated with a function. Rather, it's understood that equivalent functions represent the same object. Spaces of functions in a measure space are studied further in the chapter on Distributions.
Completion Suppose that (S, S , μ) is a measure space and let N = {A ∈ S : μ(A) = 0} denote the collection of null sets of the space. If A ∈ N and B ∈ S is a subset of A , then we know that μ(B) = 0 so B ∈ N also. However, in general there might be subsets of A that are not in S . This leads naturally to the following definition. The measure space (S, S , μ) is complete if A ∈ N and B ⊆ A imply B ∈ S (and hence B ∈ N ). Our goal in this discussion is to show that if (S, S , μ) is a σ-finite measure that is not complete, then it can be completed. That is μ can be extended to σ-algebra that includes all of the sets in S and all subsets of null sets. The first step is to extend the equivalence relation defined in our previous discussion to P(S) . For A, B ⊆ S , define A ≡ B if and only if there exists relation on P(S) : For A, B, C ⊆ S ,
N ∈ N
such that
A △ B ⊆N
. The relation
≡
is an equivalence
1. A ≡ A (the reflexive property). 2. If A ≡ B then B ≡ A (the symmetric property). 3. If A ≡ B and B ≡ C then A ≡ C (the transitive property). Proof So the equivalence relation ≡ partitions P(S) into mutually disjoint equivalence classes. Two sets in an equivalence class differ by a subset of a null set. In particular, A ≡ ∅ if and only if A ⊆ N for some N ∈ N . The extended relation ≡ is preserved under the set operations, just as before. Our next step is to enlarge the σ-algebra S by adding any set that is equivalent to a set in S . Let S = {A ⊆ S : A ≡ B for some B ∈ S } . Then S is a σ-algebra of subsets of S , and in fact is the σ-algebra generated by S ∪ {A ⊆ S : A ≡ ∅} . 0
0
Proof Our last step is to extend μ to a positive measure on the enlarged σ-algebra S . 0
Suppose that A ∈ S so that A ≡ B for some B ∈ S . Define μ
0 (A)
0
= μ(B)
. Then
1. μ is well defined. 2. μ (A) = μ(A) for A ∈ S . 3. μ is a positive measure on S . 0 0 0
0
The measure space (S, S
0,
μ0 )
is complete and is known as the completion of (S, S , μ).
2.7.7
https://stats.libretexts.org/@go/page/10135
Proof
Examples and Exercises As always, be sure to try the computational exercises and proofs yourself before reading the answers and proofs in the text. Recall that a discrete measure space consists of a countable set, with the σ-algebra of all subsets, and with counting measure #.
Counterexamples The continuity theorem for decreasing events can fail if the events do not have finite measure. Consider Z with counting measure # on the continuity theorem fails for (A , A , …). 1
σ
-algebra of all subsets. Let
An = {z ∈ Z : z ≤ −n}
for
n ∈ N+
. The
2
Proof Equal measure certainly does not imply equivalent sets. Suppose that
is a measure space with the property that there exist disjoint sets . Then A and B are not equivalent.
(S, S , μ)
μ(A) = μ(B) > 0
A, B ∈ S
such that
Proof For a concrete example, we could take S = {0, 1} with counting measure # on σ-algebra of all subsets, and A = {0} , B = {1} . The σ-finite property is not necessarily inherited by a sub-measure space. To set the stage for the counterexample, let R denote the Borel σ-algebra of R, that is, the σ-algebra generated by the standard Euclidean topology. There exists a positive measure λ on (R, R) that generalizes length. The measure λ , known as Lebesgue measure, is constructed in the section on Existence. Next let C denote the σ-algebra of countable and co-countable sets: c
C = {A ⊆ R : A is countable or A is countable}
(2.7.31)
That C is a σ-algebra was shown in the section on measure theory in the chapter on foundations. (R, C )
is a subspace of (R, R). Moreover, (R, R, λ) is σ-finite but (R, C , λ) is not.
Proof A sum of finite measures may not be σ-finite. Let S be a nonempty, finite set with the σ-algebra S of all subsets. Let μ = # be counting measure on (S, S ) for n ∈ N . Then μ is a finite measure for each n ∈ N , but μ = ∑ μ is not σ-finite. n
n
+
+
n
n∈N+
Proof
Basic Properties In the following problems, μ is a positive measure on the measurable space (S, S ). Suppose that μ(S) = 20 and that following sets:
A, B ∈ S
with
μ(A) = 5
,
A, B ∈ S
with
μ(A ∖ B) = 2
μ(B) = 6
,
μ(A ∩ B) = 2
. Find the measure of each of the
1. A ∖ B 2. A ∪ B 3. A ∪ B 4. A ∩ B 5. A ∪ B c
c
c
c
c
Answer Suppose that μ(S) = ∞ and that each of the following sets:
,
μ(B ∖ A) = 3
, and
μ(A ∩ B) = 4
. Find the measure of
1. A 2. B
2.7.8
https://stats.libretexts.org/@go/page/10135
3. A ∪ B 4. A ∩ B 5. A ∪ B c
c
c
c
Answer Suppose that μ(S) = 10 and that A, the following events:
B ∈ S
with μ(A) = 3 , μ(A ∪ B) = 7 , and μ(A ∩ B) = 2 . Find the measure of each of
1. B 2. A ∖ B 3. B ∖ A 4. A ∪ B 5. A ∩ B c
c
c
c
Answer Suppose that A,
with μ(A) = 10 , μ(B) = 12 , μ(C ) = 15 , . Find the probabilities of the various unions:
B, C ∈ S
μ(A ∩ B ∩ C ) = 1S
μ(A ∩ B) = 3
,
μ(A ∩ C ) = 4
,
μ(B ∩ C ) = 5
, and
1. A ∪ B 2. A ∪ C 3. B ∪ C 4. A ∪ B ∪ C Answer This page titled 2.7: Measure Spaces is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.7.9
https://stats.libretexts.org/@go/page/10135
2.8: Existence and Uniqueness Suppose that S is a set and S a σ-algebra of subsets of S , so that (S, S ) is a measurable space. In many cases, it is impossible to define a positive measure μ on S explicitly, by giving a “formula” for computing μ(A) for each A ∈ S . Rather, we often know how the measure μ should work on some class of sets B that generates S . We would then like to know that μ can be extended to a positive measure on S , and that this extension is unique. The purpose of this section is to discuss the basic results on this topic. To understand this section you will need to review the sections on Measure Theory and Special Set Structures in the chapter on Foundations, and the section on Measure Spaces in this chapter. If you are not interested in questions of existence and uniqueness of positive measures, you can safely skip this section.
Basic Theory Positive Measures on Algebras Suppose first that A is an algebra of subsets of S . Recall that this means that A is a collection of subsets that contains closed under complements and finite unions (and hence also finite intersections). Here is our first definition: A positive measure on A is a function μ : A
→ [0, ∞]
S
and is
that satisfies the following properties:
1. μ(∅) = 0 2. If {A : i ∈ I } is a countable, disjoint collection of sets in A and if ⋃ i
i∈I
Ai ∈ A
then
μ ( ⋃ Ai ) = ∑ μ(Ai ) i∈I
(2.8.1)
i∈I
Clearly the definition of a positive measure on an algebra is very similar to the definition for a σ-algebra. If the collection of sets in (b) is finite, then ⋃ A must be in the algebra A . Thus, μ is finitely additive. If the collection is countably infinite, then there is no guarantee that the union is in A . If it is however, then μ must be additive over this collection. Given the similarity, it is not surprising that μ shares many of the basic properties of a positive measure on a σ-algebra, with proofs that are almost identical. i∈I
If A,
B ∈ A
i
, then μ(B) = μ(A ∩ B) + μ(B ∖ A) .
Proof If A,
B ∈ A
and A ⊆ B then
1. μ(B) = μ(A) + μ(B ∖ A) 2. μ(A) ≤ μ(B) Proof Thus μ is increasing, relative to the subset partial order ⊆ on A and the ordinary order ≤ on [0, ∞]. Note also that if A, B ∈ A and μ(B) < ∞ then μ(B ∖ A) = μ(B) − μ(A ∩ B) . In the special case that A ⊆ B , this becomes μ(B ∖ A) = μ(B) − μ(A) . If μ(S) < ∞ then μ(A ) = μ(S) − μ(A) . These are the familiar difference and complement rules. c
The following result is the subadditive property for a positive measure μ on an algebra A . Suppose that {A
i
: i ∈ I}
is a countable collection of sets in A and that ⋃
i∈I
Ai ∈ A
. Then
μ ( ⋃ Ai ) ≤ ∑ μ(Ai ) i∈I
(2.8.2)
i∈I
Proof For a finite union of sets with finite measure, the inclusion-exclusion formula holds, and the proof is just like the one for a probability measure. Suppose that {A
i
: i ∈ I}
is a finite collection of sets in A where #(I ) = n ∈ N , and that μ(A +
2.8.1
i)
s} is an open neighborhood of ∞. That is, T is the one-point compactification of T . The reason for this is to preserve the meaning of time converging to infinity. That is, if (t , t , …) is a sequence in T then t → ∞ as n → ∞ if and only if, for every t ∈ T there exists m ∈ N such that t > t for n > m . We then give T the Borel σ-algebra T as before. In discrete time, this is once again the discrete σ-algebra, so that all subsets are measurable. In both cases, we now have an enhanced time space is (T , T ). A random variable τ taking values in T is called a random time. ∞
∞
∞
∞
∞ 1
+
n
2
∞
n
∞
∞
∞
Suppose that F = {F for each t ∈ T .
t
: t ∈ T}
∞
is a filtration on
∞
. A random time
(Ω, F )
τ
is a stopping time relative to
F
if
{τ ≤ t} ∈ Ft
In a sense, a stopping time is a random time that does not require that we see into the future. That is, we can tell whether or not τ ≤ t from our information at time t . The term stopping time comes from gambling. Consider a gambler betting on games of chance. The gambler's decision to stop gambling at some point in time and accept his fortune must define a stopping time. That is, the gambler can base his decision to stop gambling on all of the information that he has at that point in time, but not on what will happen in the future. The terms Markov time and optional time are sometimes used instead of stopping time. If τ is a stopping time relative to a filtration, then it is also a stoping time relative to any finer filtration:
2.11.4
https://stats.libretexts.org/@go/page/10139
Suppose that F = {F : t ∈ T } and G = {G : t ∈ T } are filtrations on (Ω, F ), and that G is finer than F. If a random time τ is a stopping time relative to F then τ is a stopping time relative to G. t
t
Proof So, the finer the filtration, the larger the collection of stopping times. In fact, every random time is a stopping time relative to the finest filtration F where F = F for every t ∈ T . But this filtration corresponds to having complete information from the beginning of time, which of course is usually not sensible. At the other extreme, for the coarsest filtration F where F = {Ω, ∅} for every t ∈ T , the only stopping times are constants. That is, random times of the form τ (ω) = t for every ω ∈ Ω , for some t ∈ T . t
t
∞
Suppose again that F = {F : t ∈ T } is a filtration on (Ω, F ). A random time τ is a stopping time relative to F if and only if {τ > t} ∈ F for each t ∈ T . t
t
Proof Suppose again that F = {F
t
1. {τ 2. {τ 3. {τ
is a filtration on (Ω, F ), and that τ is a stopping time relative to F. Then
: t ∈ T}
for every t ∈ T . for every t ∈ T . for every t ∈ T .
< t} ∈ Ft ≥ t} ∈ Ft = t} ∈ Ft
Proof Note that when T = N , we actually showed that {τ < t} ∈ F and {τ ≥ t} ∈ F . The converse to part (a) (or equivalently (b)) is not true, but in continuous time there is a connection to the right-continuous refinement of the filtration. t−1
t−1
Suppose that T = [0, ∞) and that F = {F : t ∈ [0, ∞)} is a filtration on relative to F if and only if {τ < t} ∈ F for every t ∈ [0, ∞). t
+
. A random time
(Ω, F )
τ
is a stopping time
t
Proof If F = {F : t ∈ [0, ∞)} is a filtration and τ is a random time that satisfies {τ < t} ∈ F for every t ∈ T , then some authors call τ a weak stopping time or say that τ is weakly optional for the filtration F . But to me, the increase in jargon is not worthwhile, and it's better to simply say that τ is a stopping time for the filtration F . The following corollary now follows. t
t
+
Suppose that T = [0, ∞) and that F = {F : t ∈ [0, ∞)} is a right-continuous filtration. A random time τ is a stopping time relative to F if and only if {τ < t} ∈ F for every t ∈ [0, ∞). t
t
The converse to part (c) of the result above holds in discrete time. Suppose that T = N and that F = {F : n ∈ N} is a filtration on only if {τ = n} ∈ F for every n ∈ N . n
. A random time
(Ω, F )
τ
is a stopping time for
F
if and
n
Proof
Basic Constructions As noted above, a constant element of T
∞
Suppose s ∈ T
∞
is a stopping time, but not a very interesting one.
and that τ (ω) = s for all ω ∈ Ω . The τ is a stopping time relative to any filtration on (Ω, F ).
Proof If the filtration {F : t ∈ T } is complete, then a random time that is almost certainly a constant is also a stopping time. The following theorems give some basic ways of constructing new stopping times from ones we already have. t
Suppose that F = {F : t ∈ T } is a filtration on (Ω, F ) and that τ and τ are stopping times relative to F. Then each of the following is also a stopping time relative to F: t
1. τ 2. τ 3. τ
1
∨ τ2 = max{ τ1 , τ2 }
1
∧ τ2 = min{ τ1 , τ2 }
1
+ τ2
1
2.11.5
2
https://stats.libretexts.org/@go/page/10139
Proof It follows that if (τ time relative to F:
1,
τ2 , … , τn )
is a finite sequence of stopping times relative to F, then each of the following is also a stopping
τ1 ∨ τ2 ∨ ⋯ ∨ τn τ1 ∧ τ2 ∧ ⋯ ∧ τn τ1 + τ2 + ⋯ + τn
We have to be careful when we try to extend these results to infinite sequences. Suppose that F = {F Then sup{τ : n ∈ N
t
is a filtration on (Ω, F ), and that (τ is also a stopping time relative to F.
: t ∈ T}
+}
n
n
is a sequence of stopping times relative to F.
: n ∈ N+ )
Proof Suppose that F = {F : t ∈ T } is a filtration on (Ω, F ), and that relative to F. Then lim τ is a stopping time relative to F. t
n→∞
(τn : n ∈ N+ )
is an increasing sequence of stopping times
n
Proof Suppose that T = [0, ∞) and that F = {F : t ∈ T } is a filtration on (Ω, F ). If times relative to F, then each of the following is a stopping time relative to F : t
(τn : n ∈ N+ )
is a sequence of stopping
+
1. inf {τ : n ∈ N 2. lim inf τ 3. lim sup τ
+}
n
n→∞
n
n→∞
n
Proof As a simple corollary, we have the following results: Suppose that T = [0, ∞) and that F = {F : t ∈ T } is a right-continuous filtration on (Ω, F ). If sequence of stopping times relative to F, then each of the following is a also a stopping time relative to F: t
(τn : n ∈ N+ )
is a
1. inf {τ : n ∈ N 2. lim inf τ 3. lim sup τ
+}
n
n→∞
n
n→∞
n
The σ-Algebra of a Stopping Time Consider again the general setting of a filtration F = {F : t ∈ T } on the sample space (Ω, F ), and suppose that τ is a stopping time relative to F. We want to define the σ-algebra F of events up to the random time τ , analagous to F the σ-algebra of events up to a fixed time t ∈ T . Here is the appropriate definition: t
τ
Suppose that
F = { Ft : t ∈ T }
t
is a filtration on (Ω, F ) and that τ is a stopping time relative to . Then F is a σ-algebra.
Fτ = {A ∈ F : A ∩ {τ ≤ t} ∈ Ft for all t ∈ T }
F
.
Define
τ
Proof Thus, an event A is in F if we can determine if A and τ ≤ t both occurred given our information at time t . If τ is constant, then F reduces to the corresponding member of the original filtration, which clealry should be the case, and is additional motivation for the definition. τ
τ
Suppose again that F =F . τ
F = { Ft : t ∈ T }
is a filtration on
. Fix
(Ω, F )
s ∈ T
and define
τ (ω) = s
for all
ω ∈ Ω
. Then
s
Proof Clearly, if we have the information available in F , then we should know the value of τ itself. This is also true: τ
Suppose again that F = {F : t ∈ T } is a filtration on measureable with respect to F . t
(Ω, F )
and that
τ
is a stopping time relative to
F
. Then
τ
is
τ
2.11.6
https://stats.libretexts.org/@go/page/10139
Proof Here are other results that relate the σ-algebra of a stopping time to the original filtration. Suppose again that F = {F t ∈ T ,
t
1. A ∩ {τ 2. A ∩ {τ
: t ∈ T}
is a filtration on (Ω, F ) and that τ is a stopping time relative to F. If A ∈ F then for τ
< t} ∈ Ft = t} ∈ Ft
Proof The σ-algebra of a stopping time relative to a filtration is related to the σ-algebra of the stopping time relative to a finer filtration in the natural way. Suppose that F = {F : t ∈ T } and G = {G time relative to F then F ⊆ G . t
t
τ
: t ∈ T}
are filtrations on (Ω, F ) and that G is finer than F. If τ is a stopping
τ
Proof When two stopping times are ordered, their σ-algebras are also ordered. Suppose that F ⊆F . ρ
F = { Ft : t ∈ T }
is a filtration on
(Ω, F )
and that
ρ
and
τ
are stopping times for
F
with
ρ ≤τ
. Then
τ
Proof Suppose again that F = {F : t ∈ T } is a filtration on following events is in F and in F . t
τ
, and that
(Ω, F )
ρ
and
τ
are stopping times for F. Then each of the
ρ
1. {ρ < τ } 2. {ρ = τ } 3. {ρ > τ } 4. {ρ ≤ τ } 5. {ρ ≥ τ } Proof We can “stop” a filtration at a stopping time. In the next subsection, we will stop a stochastic process in the same way. Suppose again that F = {F F =F . Then F = {F t
τ
t∧τ
is a filtration on (Ω, F ), and that : t ∈ T } is a filtration and is coarser than F .
: t ∈ T}
t
τ
τ
t
τ
is a stopping times for
F
. For
t ∈ T
define
Proof
Stochastic Processes As usual, the most common setting is when we have a stochastic process X = {X : t ∈ T } defined on our sample space (Ω, F ) and with state space (S, S ). If τ is a random time, we are often interested in the state X at the random time. But there are two issues. First, τ may take the value infinity, in which case X is not defined. The usual solution is to introduce a new “death state” δ , and define X = δ . The σ-algebra S on S is extended to S = S ∪ {δ} in the natural way, namely S = σ(S ∪ {δ}) . t
τ
τ
∞
δ
δ
Our other problem is that we naturally expect X to be a random variable (that is, measurable), just as X is a random variable for a deterministic t ∈ T . Moreover, if X is adapted to a filtration F = {F : t ∈ T } , then we would naturally also expect X to be measurable with respect to F , just as X is measurable with respect to F for deterministic t ∈ T . But this is not obvious, and in fact is not true without additional assumptions. Note that X is a random state at a random time, and so depends on an outcome ω ∈ Ω in two ways: X (ω). τ
t
t
τ
t
τ
t
τ
τ(ω)
Suppose that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that measurable. If τ is a finite random time, then X is measurable. That is, X is a random variable with values in S . t
τ
X
is
τ
Proof
2.11.7
https://stats.libretexts.org/@go/page/10139
This result is one of the main reasons for the definition of a measurable process in the first place. Sometimes we literally want to stop the random process at a random time τ . As you might guess, this is the origin of the term stopping time. Suppose again that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that X is measurable. If τ is a random time, then the process X = {X : t ∈ T } defined by X = X for t ∈ T is the process X stopped at τ . t
τ
τ
τ
t
t
t∧τ
Proof When the original process is progressively measurable, so is the stopped process. Suppose again that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that X is progressively measurable with respect to a filtration F = {F : t ∈ T } . If τ is a stopping time relative to F, then the stopped process X = {X : t ∈ T } is progressively measurable with respect to the stopped filtration F . t
t
τ
τ
τ
t
Since F is finer than F , it follows that X is also progressively measurable with respect to F. τ
τ
Suppose again that X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ) with state space (S, S ), and that X is progressively measurable with respect to a filtration F = {F : t ∈ T } on (Ω, F ). If τ is a finite stopping time relative to F then X is measurable with respect to F . t
t
τ
τ
For many random processes, the first time that the process enters or hits a set of states is particularly important. In the discussion that follows, let T = {t ∈ T : t > 0} , the set of positive times. +
Suppose that X = {X
t
1. ρ 2. τ
: t ∈ T}
is a stochastic process on (Ω, F ) with state space (S, S ). For A ∈ S , define
, the first entry time to A . , the first hitting time to A .
A
= inf{t ∈ T : Xt ∈ A}
A
= inf{t ∈ T+ : Xt ∈ A}
As usual, inf(∅) = ∞ so ρ = ∞ if X ∉ A for all t ∈ T , so that the process never enters A , and t ∈ T , so that the process never hits A . In discrete time, it's easy to see that these are stopping times. A
t
if
τA = ∞
Xt ∉ A
for all
+
Suppose that {X : n ∈ N} is a stochastic process on (Ω, F ) with state space (S, S ). If A ∈ S then τ and ρ are stopping times relative to the natural filtration F . n
A
A
0
Proof So of course in discrete time, τ and ρ are stopping times relative to any filtration F to which X is adapted. You might think that τ and ρ should always be a stopping times, since τ ≤ t if and only if X ∈ A for some s ∈ T with s ≤ t , and ρ ≤ t if and only if X ∈ A for some s ∈ T with s ≤ t . It would seem that these events are known if one is allowed to observe the process up to time t . The problem is that when T = [0, ∞) , these are uncountable unions, so we need to make additional assumptions on the stochastic process X or the filtration F, or both. A
A
A
A
A
s
+
A
s
Suppose that S has an LCCB topology, and that S is the σ-algebra of Borel sets. Suppose also that X = {X right continuous and has left limits. Then τ and ρ are stopping times relative to F for every open A ∈ S .
t
: t ∈ [0, ∞)}
is
0
A
A
+
Here is another result that requires less of the stochastic process X, but more of the filtration F. Suppose that X = {X : t ∈ [0, ∞)} is a stochastic process on (Ω, F ) that is progressively measurable relative to a complete, right-continuous filtration F = {F : t ∈ [0, ∞)} . If A ∈ S then ρ and τ are stopping times relative to F. t
t
A
A
This page titled 2.11: Filtrations and Stopping Times is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2.11.8
https://stats.libretexts.org/@go/page/10139
CHAPTER OVERVIEW 3: Distributions Recall that a probability distribution is just another name for a probability measure. Most distributions are associated with random variables, and in fact every distribution can be associated with a random variable. In this chapter we explore the basic types of probability distributions (discrete, continuous, mixed), and the ways that distributions can be defined using density functions, distribution functions, and quantile functions. We also study the relationship between the distribution of a random vector and the distributions of its components, conditional distributions, and how the distribution of a random variable changes when the variable is transformed. In the advanced sections, we study convergence in distribution, one of the most important types of convergence. We also construct the abstract integral with respect to a positive measure and study the basic properties of the integral. This leads in turn to general (signed measures), absolute continuity and singularity, and the existence of density functions. Finally, we study various vector spaces of functions that are defined by integral pro 3.1: Discrete Distributions 3.2: Continuous Distributions 3.3: Mixed Distributions 3.4: Joint Distributions 3.5: Conditional Distributions 3.6: Distribution and Quantile Functions 3.7: Transformations of Random Variables 3.8: Convergence in Distribution 3.9: General Distribution Functions 3.10: The Integral With Respect to a Measure 3.11: Properties of the Integral 3.12: General Measures 3.13: Absolute Continuity and Density Functions 3.14: Function Spaces
This page titled 3: Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
3.1: Discrete Distributions Basic Theory Definitions and Basic Properties As usual, our starting point is a random experiment modeled by a probability space (S, S , P). So to review, S is the set of outcomes, S the collection of events, and P the probability measure on the sample space (S, S ). We use the terms probability measure and probability distribution synonymously in this text. Also, since we use a general definition of random variable, every probability measure can be thought of as the probability distribution of a random variable, so we can always take this point of view if we like. Indeed, most probability measures naturally have random variables associated with them. Recall that the sample space (S, S ) is discrete if S is countable and S = P(S) is the collection of all subsets of case, P is a discrete distribution and (S, S , P) is a discrete probabiity space.
S
. In this
For the remainder or our discussion we assume that (S, S , P) is a discrete probability space. In the picture below, the blue dots are intended to represent points of positive probability.
Figure 3.1.1 : A discrete distribution
It's very simple to describe a discrete probability distribution with the function that assigns probabilities to the individual points in S. The function f on S defined by f (x) = P({x}) for x ∈ S is the probability density function of P, and satisfies the following properties: 1. f (x) ≥ 0, x ∈ S 2. ∑ f (x) = 1 3. ∑ f (x) = P(A) for A ⊆ S x∈S
x∈A
Proof Property (c) is particularly important since it shows that a discrete probability distribution is completely determined by its probability density function. Conversely, any function that satisfies properties (a) and (b) can be used to construct a discrete probability distribution on S via property (c). A nonnegative function f on S that satisfies ∑ defined as follows is a probability measure on S .
x∈S
f (x) = 1
is a (discrete) probability density function on
P(A) = ∑ f (x),
A ⊆S
S
, and then
P
(3.1.1)
x∈A
Proof Technically, f is the density of P relative to counting measure section on absolute continuity and density functions.
#
on S . The technicalities are discussed in detail in the advanced
3.1.1
https://stats.libretexts.org/@go/page/10141
Figure 3.1.2 : A discrete distribution is completely determined by its probability density function.
The set of outcomes S is often a countable subset of some larger set, such as R for some n ∈ N . But not always. We might want to consider a random variable with values in a deck of cards, or a set of words, or some other discrete population of objects. Of course, we can always map a countable set S one-to-one into a Euclidean set, but it might be contrived or unnatural to do so. In any event, if S is a subset of a larger set, we can always extend a probability density function f , if we want, to the larger set by defining f (x) = 0 for x ∉ S . Sometimes this extension simplifies formulas and notation. Put another way, the “set of values” is often a convenience set that includes the points with positive probability, but perhaps other points as well. n
+
Suppose that f is a probability density function on S . Then {x ∈ S : f (x) > 0} is the support set of the distribution. Values of x that maximize the probability density function are important enough to deserve a name. Suppose again that f is a probability density function on S . An element x ∈ S that maximizes f is a mode of the distribution. When there is only one mode, it is sometimes used as a measure of the center of the distribution. A discrete probability distribution defined by a probability density function f is equivalent to a discrete mass distribution, with total mass 1. In this analogy, S is the (countable) set of point masses, and f (x) is the mass of the point at x ∈ S . Property (c) in (2) above simply means that the mass of a set A can be found by adding the masses of the points in A . But let's consider a probabilistic interpretation, rather than one from physics. We start with a basic random variable X for an experiment, defined on a probability space (Ω, F , P). Suppose that X has a discrete distribution on S with probability density function f . So in this setting, f (x) = P(X = x) for x ∈ S . We create a new, compound experiment by conducting independent repetitions of the original experiment. So in the compound experiment, we have a sequence of independent random variables (X , X , …) each with the same distribution as X; in statistical terms, we are sampling from the distribution of X. Define 1
2
1 fn (x) =
n
1 # {i ∈ {1, 2, … , n} : Xi = x} =
n
n
∑ 1(Xi = x),
x ∈ S
(3.1.3)
i=1
Note that f (x) is the relative frequency of outcome x ∈ S in the first n runs. Note also that f (x) is a random variable for the compound experiment for each x ∈ S . By the law of large numbers, f (x) should converge to f (x), in some sense, as n → ∞ . The function f is called the empirical probability density function, and it is in fact a (random) probability density function, since it satisfies properties (a) and (b) of (2). Empirical probability density functions are displayed in most of the simulation apps that deal with discrete variables. n
n
n
n
It's easy to construct discrete probability density functions from other nonnegative functions defined on a countable set. Suppose that g is a nonnegative function defined on S , and let c = ∑ g(x)
(3.1.4)
x∈S
If 0 < c < ∞ , then the function f defined by f (x) =
1 c
g(x)
for x ∈ S is a discrete probability density function on S .
Proof Note that since we are assuming that g is nonnegative, c = 0 if and only if g(x) = 0 for every x ∈ S . At the other extreme, c = ∞ could only occur if S is infinite (and the infinite series diverges). When 0 < c < ∞ (so that we can construct the probability density function f ), c is sometimes called the normalizing constant. This result is useful for constructing probability density functions with desired functional properties (domain, shape, symmetry, and so on).
3.1.2
https://stats.libretexts.org/@go/page/10141
Conditional Densities Suppose again that X is a random variable on a probability space (Ω, F , P) and that X takes values in our discrete set S . The distributionn of X (and hence the probability density function of X) is based on the underlying probability measure on the sample space (Ω, F ). This measure could be a conditional probability measure, conditioned on a given event E ∈ F (with P(E) > 0 ). The probability density function in this case is f (x ∣ E) = P(X = x ∣ E),
x ∈ S
(3.1.6)
Except for notation, no new concepts are involved. Therefore, all results that hold for discrete probability density functions in general have analogies for conditional discrete probability density functions. For fixed E ∈ F with P(E) > 0 the function x ↦ f (x ∣ E) is a discrete probability density function on S That is, 1. f (x ∣ E) ≥ 0 for x ∈ S . 2. ∑ f (x ∣ E) = 1 3. ∑ f (x ∣ E) = P(X ∈ A ∣ E) for ⊆ S x∈S
x∈A
Proof In particular, the event E could be an event defined in terms of the random variable X itself. Suppose that B ⊆ S and P(X ∈ B) > 0 . The conditional probability density function of X given X ∈ B is the function on B defined by f (x) f (x ∣ X ∈ B) =
f (x) =
P(X ∈ B)
, ∑
y∈B
x ∈ B
(3.1.7)
f (y)
Proof Note that the denominator is simply the normalizing constant for f restricted to B . Of course, f (x ∣ B) = 0 for x ∈ B . c
Conditioning and Bayes' Theorem Suppose again that X is a random variable defined on a probability space (Ω, F , P) and that X has a discrete distribution on S , with probability density function f . We assume that f (x) > 0 for x ∈ S so that the distribution has support S . The versions of the law of total probability and Bayes' theorem given in the following theorems follow immediately from the corresponding results in the section on Conditional Probability. Only the notation is different. Law of Total Probability. If E ∈ F is an event then P(E) = ∑ f (x)P(E ∣ X = x)
(3.1.8)
x∈S
Proof This result is useful, naturally, when the distribution of X and the conditional probability of E given the values of X are known. When we compute P(E) in this way, we say that we are conditioning on X. Note that P(E), as expressed by the formula, is a weighted average of P(E ∣ X = x) , with weight factors f (x), over x ∈ S . Bayes' Theorem. If E ∈ F is an event with P(E) > 0 then f (x)P(E ∣ X = x) f (x ∣ E) =
, ∑
y∈S
x ∈ S
(3.1.10)
f (y)P(E ∣ X = y)
Proof Bayes' theorem, named for Thomas Bayes, is a formula for the conditional probability density function of X given E . Again, it is useful when the quantities on the right are known. In the context of Bayes' theorem, the (unconditional) distribution of X is referred to as the prior distribution and the conditional distribution as the posterior distribution. Note that the denominator in Bayes' formula is P(E) and is simply the normalizing constant for the function x ↦ f (x)P(E ∣ X = x) .
3.1.3
https://stats.libretexts.org/@go/page/10141
Examples and Special Cases We start with some simple (albeit somewhat artificial) discrete distributions. After that, we study three special parametric models— the discrete uniform distribution, hypergeometric distributions, and Bernoulli trials. These models are very important, so when working the computational problems that follow, try to see if the problem fits one of these models. As always, be sure to try the problems yourself before looking at the answers and proofs in the text.
Simple Discrete Distributions Let g be the function defined by g(n) = n(10 − n) for n ∈ {1, 2, … , 9}. 1. Find the probability density function f that is proportional to g as in . 2. Sketch the graph of f and find the mode of the distribution. 3. Find P(3 ≤ N ≤ 6) where N has probability density function f . Answer Let g be the function defined by g(n) = n
2
(10 − n)
for n ∈ {1, 2 … , 10}.
1. Find the probability density function f that is proportional to g . 2. Sketch the graph of f and find the mode of the distribution. 3. Find P(3 ≤ N ≤ 6) where N has probability density function f . Answer Let g be the function defined by g(x, y) = x + y for (x, y) ∈ {1, 2, 3} . 2
1. Sketch the domain of g . 2. Find the probability density function f that is proportional to g . 3. Find the mode of the distribution. 4. Find P(X > Y ) where (X, Y ) has probability density function f . Answer Let g be the function defined by g(x, y) = xy for (x, y) ∈ {(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)}. 1. Sketch the domain of g . 2. Find the probability density function f that is proportional to g . 3. Find the mode of the distribution. 4. Find P [(X, Y ) ∈ {(1, 2), (1, 3), (2, 2), (2, 3)}]where (X, Y ) has probability density function f . Answer Consider the following game: An urn initially contains one red and one green ball. A ball is selected at random, and if the ball is green, the game is over. If the ball is red, the ball is returned to the urn, another red ball is added, and the game continues. At each stage, a ball is selected at random, and if the ball is green, the game is over. If the ball is red, the ball is returned to the urn, another red ball is added, and the game continues. Let X denote the length of the game (that is, the number of selections required to obtain a green ball). Find the probability density function of X. Solution
Discrete Uniform Distributions An element X is chosen at random from a finite set S . The distribution of X is the discrete uniform distribution on S . 1. X has probability density function f given by f (x) = 1/#(S) for x ∈ S . 2. P(X ∈ A) = #(A)/#(S) for A ⊆ S . Proof Many random variables that arise in sampling or combinatorial experiments are transformations of uniformly distributed variables. The next few exercises review the standard methods of sampling from a finite population. The parameters m and n are positive inteters.
3.1.4
https://stats.libretexts.org/@go/page/10141
Suppose that n elements are chosen at random, with replacement from a set D with m elements. Let X denote the ordered sequence of elements chosen. Then X is uniformly distributed on the Cartesian power set S = D , and has probability density function f given by n
1 f (x) = m
n
,
x ∈ S
(3.1.13)
Proof Suppose that n elements are chosen at random, without replacement from a set D with m elements (so n ≤ m ). Let X denote the ordered sequence of elements chosen. Then X is uniformly distributed on the set S of permutations of size n chosen from D, and has probability density function f given by 1 f (x) = m
(n)
,
x ∈ S
(3.1.14)
Proof Suppose that n elements are chosen at random, without replacement, from a set D with m elements (so n ≤ m ). Let W denote the unordered set of elements chosen. Then W is uniformly distributed on the set T of combinations of size n chosen from D, and has probability density function f given by 1 f (w) =
m
(
n
,
w ∈ T
(3.1.15)
)
Proof Suppose that X is uniformly distributed on a finite set distribution of X given X ∈ B is uniform on B .
S
and that
B
is a nonempty subset of
S
. Then the conditional
Proof
Hypergeometric Models Suppose that a dichotomous population consists of m objects of two different types: r of the objects are type 1 and m − r are type 0. Here are some typical examples: The objects are persons, each either male or female. The objects are voters, each either a democrat or a republican. The objects are devices of some sort, each either good or defective. The objects are fish in a lake, each either tagged or untagged. The objects are balls in an urn, each either red or green. A sample of n objects is chosen at random (without replacement) from the population. Recall that this means that the samples, either ordered or unordered are equally likely. Note that this probability model has three parameters: the population size m, the number of type 1 objects r, and the sample size n . Each is a nonnegative integer with r ≤ m and n ≤ m . Now, suppose that we keep track of order, and let X denote the type of the ith object chosen, for i ∈ {1, 2, … , n}. Thus, X is an indicator variable (that is, a variable that just takes values 0 and 1). i
X = (X1 , X2 , … , Xn )
i
has probability density function f given by (y)
r
(n−y)
(m − r)
f (x1 , x2 , … , xn ) =
, m
(n)
n
(x1 , x2 , … , xn ) ∈ {0, 1 } where y = x1 + x2 + ⋯ + xn
(3.1.17)
Proof Note that the value of f (x , x , … , x ) depends only on y = x + x + ⋯ + x , and hence is unchanged if (x , x , … , x ) is permuted. This means that (X , X , … , X ) is exchangeable. In particular, the distribution of X is the same as the distribution of X , so P(X = 1) = . Thus, the variables are identically distributed. Also the distribution of (X , X ) is the same as the 1
2
1
n
2
1
2
n
n
1
2
n
i
r
1
i
distribution of
i
m
(X1 , X2 )
, so
r(r−1)
P(Xi = 1, Xj = 1) =
m(m−1)
. Thus,
Xi
and
Xj
j
are not independent, and in fact are negatively
correlated.
3.1.5
https://stats.libretexts.org/@go/page/10141
Now let Y denote the number of type 1 objects in the sample. Note that Y sum of indicator variables. Y
n
=∑
i=1
Xi
. Any counting variable can be written as a
has probability density function g given by. r
m−r
y
n−y
( )( g(y) =
m
(
n
) ,
y ∈ {0, 1, … , n}
(3.1.18)
)
1. g(y − 1) < g(y) if and only if y < t where t = (r + 1)(n + 1)/(m + 2) . 2. If t is not a positive integer, there is a single mode at ⌊t⌋. 3. If t is a positive integer, then there are two modes, at t − 1 and t . Proof The distribution defined by the probability density function in the last result is the hypergeometric distributions with parameters m, r , and n . The term hypergeometric comes from a certain class of special functions, but is not particularly helpful in terms of remembering the model. Nonetheless, we are stuck with it. The set of values {0, 1, … , n} is a convenience set: it contains all of the values that have positive probability, but depending on the parameters, some of the values may have probability 0. Recall our convention for binomial coefficients: for j, k ∈ N , ( ) = 0 if j > k . Note also that the hypergeometric distribution is unimodal: the probability density function increases and then decreases, with either a single mode or two adjacent modes. k
+
j
We can extend the hypergeometric model to a population of three types. Thus, suppose that our population consists of m objects; r of the objects are type 1, s are type 2, and m − r − s are type 0. Here are some examples: The objects are voters, each a democrat, a republican, or an independent. The objects are cicadas, each one of three species: tredecula, tredecassini, or tredecim The objects are peaches, each classified as small, medium, or large. The objects are faculty members at a university, each an assistant professor, or an associate professor, or a full professor. Once again, a sample of n objects is chosen at random (without replacement). The probability model now has four parameters: the population size m, the type sizes r and s , and the sample size n . All are nonnegative integers with r + s ≤ m and n ≤ m . Moreover, we now need two random variables to keep track of the counts for the three types in the sample. Let Y denote the number of type 1 objects in the sample and Z the number of type 2 objects in the sample. (Y , Z)
has probability density function h given by r
s
y
z
m−r−s
( )( )( h(y, z) =
n−y−z
m
(
n
) ,
2
(y, z) ∈ {0, 1, … , n} with y + z ≤ n
(3.1.19)
)
Proof The distribution defined by the density function in the last exericse is the bivariate hypergeometric distribution with parameters m, r , s , and n . Once again, the domain given is a convenience set; it includes the set of points with positive probability, but depending on the parameters, may include points with probability 0. Clearly, the same general pattern applies to populations with even more types. However, because of all of the parameters, the formulas are not worthing remembering in detail; rather, just note the pattern, and remember the combinatorial meaning of the binomial coefficient. The hypergeometric model will be revisited later in this chapter, in the section on joint distributions and in the section on conditional distributions. The hypergeometric distribution and the multivariate hypergeometric distribution are studied in detail in the chapter on Finite Sampling Models. This chapter contains a variety of distributions that are based on discrete uniform distributions.
Bernoulli Trials A Bernoulli trials sequence is a sequence (X , X , …) of independent, identically distributed indicator variables. Random variable X is the outcome of trial i, where in the usual terminology of reliability, 1 denotes success while 0 denotes failure, The process is named for Jacob Bernoulli. Let p = P(X = 1) ∈ [0, 1] denote the success parameter of the process. Note that the indicator variables in the hypergeometric model satisfy one of the assumptions of Bernoulli trials (identical distributions) but not the other (independence). 1
2
i
i
X = (X1 , X2 , … , Xn )
has probability density function f given by
3.1.6
https://stats.libretexts.org/@go/page/10141
y
n−y
f (x1 , x2 , … , xn ) = p (1 − p )
,
n
(x1 , x2 , … , xn ) ∈ {0, 1 } , where y = x1 + x2 + ⋯ + xn
(3.1.20)
Proof Now let Y denote the number of successes in the first n trials. Note that Y = ∑ X , so we see again that a complicated random variable can be written as a sum of simpler ones. In particular, a counting variable can always be written as a sum of indicator variables. n
i=1
Y
i
has probability density function g given by g(y) = (
n y n−y ) p (1 − p ) ,
y ∈ {0, 1, … , n}
(3.1.21)
y
1. g(y − 1) < g(y) if and only if y < t , wher t = (n + 1)p . 2. If t is not a positive integer, there is a single mode at ⌊t⌋. 3. If t is a positive integer, then there are two modes, at t − 1 and t . Proof The distribution defined by the probability density function in the last theorem is called the binomial distribution with parameters n and p. The distribution is unimodal: the probability density function at first increases and then decreases, with either a single mode or two adjacent modes. The binomial distribution is studied in detail in the chapter on Bernoulli Trials. Suppose that p > 0 and let N denote the trial number of the first success. Then N has probability density function h given by n−1
h(n) = (1 − p )
p,
n ∈ N+
(3.1.22)
The probability density function h is decreasing and the mode is n = 1 . Proof The distribution defined by the probability density function in the last exercise is the geometric distribution on N with parameter p . The geometric distribution is studied in detail in the chapter on Bernoulli Trials. +
Sampling Problems In the following exercises, be sure to check if the problem fits one of the general models above. An urn contains 30 red and 20 green balls. A sample of 5 balls is selected at random, without replacement. Let number of red balls in the sample.
Y
denote the
1. Compute the probability density function of Y explicitly and identify the distribution by name and parameter values. 2. Graph the probability density function and identify the mode(s). 3. Find P(Y > 3) . Answer In the ball and urn experiment, select sampling without replacement and set m = 50 , r = 30 , and n = 5 . Run the experiment 1000 times and note the agreement between the empirical density function of Y and the probability density function. An urn contains 30 red and 20 green balls. A sample of 5 balls is selected at random, with replacement. Let number of red balls in the sample.
Y
denote the
1. Compute the probability density function of Y explicitly and identify the distribution by name and parameter values. 2. Graph the probability density function and identify the mode(s). 3. Find P(Y > 3) . Answer In the ball and urn experiment, select sampling with replacement and set m = 50 , r = 30 , and n = 5 . Run the experiment 1000 times and note the agreement between the empirical density function of Y and the probability density function.
3.1.7
https://stats.libretexts.org/@go/page/10141
A group of voters consists of 50 democrats, 40 republicans, and 30 independents. A sample of 10 voters is chosen at random, without replacement. Let X denote the number of democrats in the sample and Y the number of republicans in the sample. 1. Give the probability density function of X. 2. Give the probability density function of Y . 3. Give the probability density function of (X, Y ). 4. Find the probability that the sample has at least 4 democrats and at least 4 republicans. Answer The Math Club at Enormous State University (ESU) has 20 freshmen, 40 sophomores, 30 juniors, and 10 seniors. A committee of 8 club members is chosen at random, without replacement to organize π-day activities. Let X denote the number of freshman in the sample, Y the number of sophomores, and Z the number of juniors. 1. Give the probability density function of X. 2. Give the probability density function of Y . 3. Give the probability density function of Z . 4. Give the probability density function of (X, Y ). 5. Give the probability density function of (X, Y , Z) . 6. Find the probability that the committee has no seniors. Answer
Coins and Dice Suppose that a coin with probability of heads p is tossed repeatedly, and the sequence of heads and tails is recorded. 1. Identify the underlying probability model by name and parameter. 2. Let Y denote the number of heads in the first n tosses. Give the probability density function of Y and identify the distribution by name and parameters. 3. Let N denote the number of tosses needed to get the first head. Give the probability density function of N and identify the distribution by name and parameter. Answer Suppose that a coin with probability of heads p = 0.4 is tossed 5 times. Let Y denote the number of heads. 1. Compute the probability density function of Y explicitly. 2. Graph the probability density function and identify the mode. 3. Find P(Y > 3) . Answer In the binomial coin experiment, set n = 5 and p = 0.4 . Run the experiment 1000 times and compare the empirical density function of Y with the probability density function. Suppose that a coin with probability of heads p = 0.2 is tossed until heads occurs. Let N denote the number of tosses. 1. Find the probability density function of N . 2. Find P(N ≤ 5) . Answer In the negative binomial experiment, set k = 1 and density function with the probability density function.
p = 0.2
. Run the experiment 1000 times and compare the empirical
Suppose that two fair, standard dice are tossed and the sequence of scores (X , X ) recorded. Let Y = X sum of the scores, U = min{X , X } the minimum score, and V = max{X , X } the maximum score. 1
1
2
1
1. Find the probability density function of (X 2. Find the probability density function of Y . 3. Find the probability density function of U .
1,
X2 )
2
1
+ X2
denote the
2
. Identify the distribution by name.
3.1.8
https://stats.libretexts.org/@go/page/10141
4. Find the probability density function of V . 5. Find the probability density function of (U , V ). Answer Note that (U , V ) in the last exercise could serve as the outcome of the experiment that consists of throwing two standard dice if we did not bother to record order. Note from the previous exercise that this random vector does not have a uniform distribution when the dice are fair. The mistaken idea that this vector should have the uniform distribution was the cause of difficulties in the early development of probability. In the dice experiment, select n = 2 fair dice. Select the following random variables and note the shape and location of the probability density function. Run the experiment 1000 times. For each of the following variables, compare the empirical density function with the probability density function. 1. Y , the sum of the scores. 2. U , the minimum score. 3. V , the maximum score. In the die-coin experiment, a fair, standard die is rolled and then a fair coin is tossed the number of times showing on the die. Let N denote the die score and Y the number of heads. 1. Find the probability density function of N . Identify the distribution by name. 2. Find the probability density function of Y . Answer Run the die-coin experiment 1000 times. For the number of heads, compare the empirical density function with the probability density function. Suppose that a bag contains 12 coins: 5 are fair, 4 are biased with probability of heads ; and 3 are two-headed. A coin is chosen at random from the bag and tossed 5 times. Let V denote the probability of heads of the selected coin and let Y denote the number of heads. 1 3
1. Find the probability density function of V . 2. Find the probability density function of Y . Answer Compare thedie-coin experiment with the bag of coins experiment. In the first experiment, we toss a coin with a fixed probability of heads a random number of times. In second experiment, we effectively toss a coin with a random probability of heads a fixed number of times. In both cases, we can think of starting with a binomial distribution and randomizing one of the parameters. In the coin-die experiment, a fair coin is tossed. If the coin lands tails, a fair die is rolled. If the coin lands heads, an ace-six flat die is tossed (faces 1 and 6 have probability each, while faces 2, 3, 4, 5 have probability each). Find the probability density function of the die score Y . 1
1
4
8
Answer Run the coin-die experiment 1000 times, with the settings in the previous exercise. Compare the empirical density function with the probability density function. Suppose that a standard die is thrown 10 times. Let Y denote the number of times an ace or a six occurred. Give the probability density function of Y and identify the distribution by name and parameter values in each of the following cases: 1. The die is fair. 2. The die is an ace-six flat. Answer Suppose that a standard die is thrown until an ace or a six occurs. Let N denote the number of throws. Give the probability density function of N and identify the distribution by name and parameter values in each of the following cases:
3.1.9
https://stats.libretexts.org/@go/page/10141
1. The die is fair. 2. The die is an ace-six flat. Answer Fred and Wilma takes turns tossing a coin with probability of heads p ∈ (0, 1): Fred first, then Wilma, then Fred again, and so forth. The first person to toss heads wins the game. Let N denote the number of tosses, and W the event that Wilma wins. 1. Give the probability density function of N and identify the distribution by name. 2. Compute P(W ) and sketch the graph of this probability as a function of p. 3. Find the conditional probability density function of N given W . Answer The alternating coin tossing game is studied in more detail in the section on The Geometric Distribution in the chapter on Bernoulli trials. Suppose that k players each have a coin with probability of heads p, where k ∈ {2, 3, …} and where p ∈ (0, 1). 1. Suppose that the players toss their coins at the same time. Find the probability that there is an odd man, that is, one player with a different outcome than all the rest. 2. Suppose now that the players repeat the procedure in part (a) until there is an odd man. Find the probability density function of N , the number of rounds played, and identify the distribution by name. Answer The odd man out game is treated in more detail in the section on the Geometric Distribution in the chapter on Bernoulli Trials.
Cards Recall that a poker hand consists of 5 cards chosen at random and without replacement from a standard deck of 52 cards. Let X denote the number of spades in the hand and Y the number of hearts in the hand. Give the probability density function of each of the following random variables, and identify the distribution by name: 1. X 2. Y 3. (X, Y ) Answer Recall that a bridge hand consists of 13 cards chosen at random and without replacement from a standard deck of 52 cards. An honor card is a card of denomination ace, king, queen, jack or 10. Let N denote the number of honor cards in the hand. 1. Find the probability density function of N and identify the distribution by name. 2. Find the probability that the hand has no honor cards. A hand of this kind is known as a Yarborough, in honor of Second Earl of Yarborough. Answer In the most common high card point system in bridge, an ace is worth 4 points, a king is worth 3 points, a queen is worth 2 points, and a jack is worth 1 point. Find the probability density function of V , the point value of a random bridge hand.
Reliability Suppose that in a batch of 500 components, 20 are defective and the rest are good. A sample of 10 components is selected at random and tested. Let X denote the number of defectives in the sample. 1. Find the probability density function of X and identify the distribution by name and parameter values. 2. Find the probability that the sample contains at least one defective component. Answer A plant has 3 assembly lines that produce a certain type of component. Line 1 produces 50% of the components and has a defective rate of 4%; line 2 has produces 30% of the components and has a defective rate of 5%; line 3 produces 20% of the
3.1.10
https://stats.libretexts.org/@go/page/10141
components and has a defective rate of 1%. A component is chosen at random from the plant and tested. 1. Find the probability that the component is defective. 2. Given that the component is defective, find the conditional probability density function of the line that produced the component. Answer Recall that in the standard model of structural reliability, a systems consists of n components, each of which, independently of the others, is either working for failed. Let X denote the state of component i, where 1 means working and 0 means failed. Thus, the state vector is X = (X , X , … , X ). The system as a whole is also either working or failed, depending only on the states of the components. Thus, the state of the system is an indicator random variable U = u(X) that depends on the states of the components according to a structure function u : {0, 1} → {0, 1}. In a series system, the system works if and only if every components works. In a parallel system, the system works if and only if at least one component works. In a k out of n system, the system works if and only if at least k of the n components work. i
1
2
n
n
The reliability of a device is the probability that it is working. Let p = P(X = 1) denote the reliability of component i, so that p = (p , p , … , p ) is the vector of component reliabilities. Because of the independence assumption, the system reliability depends only on the component reliabilities, according to a reliability function r(p) = P(U = 1) . Note that when all component reliabilities have the same value p, the states of the components form a sequence of n Bernoulli trials. In this case, the system reliability is, of course, a function of the common component reliability p. i
1
2
i
n
Suppose that the component reliabilities all have the same value p. Let X denote the state vector and Y denote the number of working components. 1. Give the probability density function of X. 2. Give the probability density function of Y and identify the distribution by name and parameter. 3. Find the reliability of the k out of n system. Answer Suppose that we have 4 independent components, with common reliability components.
p = 0.8
. Let
Y
denote the number of working
1. Find the probability density function of Y explicitly. 2. Find the reliability of the parallel system. 3. Find the reliability of the 2 out of 4 system. 4. Find the reliability of the 3 out of 4 system. 5. Find the reliability of the series system. Answer Suppose that we have 4 independent components, with reliabilities p the number of working components.
1
= 0.6
,p
2
= 0.7
,p
3
= 0.8
, and p
4
= 0.9
. Let Y denote
1. Find the probability density function of Y . 2. Find the reliability of the parallel system. 3. Find the reliability of the 2 out of 4 system. 4. Find the reliability of the 3 out of 4 system. 5. Find the reliability of the series system. Answer
The Poisson Distribution Suppose that a > 0 . Define f by n
f (n) = e
−a
a
,
n ∈ N
(3.1.24)
n!
1. f is a probability density function. 2. f (n − 1) < f (n) if and only if n < a .
3.1.11
https://stats.libretexts.org/@go/page/10141
3. If a is not a positive integer, there is a single mode at ⌊a⌋ 4. If a is a positive integer, there are two modes at a − 1 and a . Proof The distribution defined by the probability density function in the previous exercise is the Poisson distribution with parameter a , named after Simeon Poisson. Note that like the other named distributions we studied above (hypergeometric and binomial), the Poisson distribution is unimodal: the probability density function at first increases and then decreases, with either a single mode or two adjacent modes. The Poisson distribution is studied in detail in the Chapter on Poisson Processes, and is used to model the number of “random points” in a region of time or space, under certain ideal conditions. The parameter a is proportional to the size of the region of time or space. Suppose that the customers arrive at a service station according to the Poisson model, at an average rate of 4 per hour. Thus, the number of customers N who arrive in a 2-hour period has the Poisson distribution with parameter 8. 1. Find the modes. 2. Find P(N ≥ 6) . Answer In the Poisson experiment, set r = 4 and t = 2 . Run the simulation 1000 times and compare the empirical density function to the probability density function. Suppose that the number of flaws N in a piece of fabric of a certain size has the Poisson distribution with parameter 2.5. 1. Find the mode. 2. Find P(N > 4) . Answer Suppose that the number of raisins N in a piece of cake has the Poisson distribution with parameter 10. 1. Find the modes. 2. Find P(8 ≤ N ≤ 12) . Answer
A Zeta Distribution Let g be the function defined by g(n) =
1 2
n
for n ∈ N . +
1. Find the probability density function f that is proportional to g . 2. Find the mode of the distribution. 3. Find P(N ≤ 5) where N has probability density function f . Answer The distribution defined in the previous exercise is a member of the zeta family of distributions. Zeta distributions are used to model sizes or ranks of certain types of objects, and are studied in more detail in the chapter on Special Distributions.
Benford's Law Let f be the function defined by f (d) = log(d + 1) − log(d) = log(1 + the base 10 common logarithm, not the base e natural logarithm.)
1 d
)
for d ∈ {1, 2, … , 9}. (The logarithm function is
1. Show that f is a probability density function. 2. Compute the values of f explicitly, and sketch the graph. 3. Find P(X ≤ 3) where X has probability density function f . Answer The distribution defined in the previous exercise is known as Benford's law, and is named for the American physicist and engineer Frank Benford. This distribution governs the leading digit in many real sets of data. Benford's law is studied in more detail in the
3.1.12
https://stats.libretexts.org/@go/page/10141
chapter on Special Distributions.
Data Analysis Exercises In the M&M data, let R denote the number of red candies and empirical probability density function of each of the following: 1. R 2. N 3. R given N
N
the total number of candies. Compute and graph the
> 57
Answer In the Cicada data, let G denotes gender, density function of each of the following: 1. G 2. S 3. (G, S) 4. G given W
> 0.20
S
species type, and
W
body weight (in grams). Compute the empirical probability
grams.
Answer This page titled 3.1: Discrete Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
3.1.13
https://stats.libretexts.org/@go/page/10141
3.2: Continuous Distributions In the previous section, we considered discrete distributions. In this section, we study a complementary type of distribution. As usual, if you are a new student of probability, you may want to skip the technical details.
Basic Theory Definitions and Basic Properties As usual, our starting point is a random experiment modeled by a probability space (S, S , P). So to review, S is the set of outcomes, S the collection of events, and P the probability measure on the sample space (S, S ). We use the terms probability measure and probability distribution synonymously in this text. Also, since we use a general definition of random variable, every probability measure can be thought of as the probability distribution of a random variable, so we can always take this point of view if we like. Indeed, most probability measures naturally have random variables associated with them. In this section, we assume that S ⊆ R for some n ∈ N . n
+
Details Here is our first fundamental definition. The probability measure P is continuous if P({x}) = 0 for all x ∈ S . The fact that each point is assigned probability 0 might seem impossible or paradoxical at first, but soon we will see very familiar analogies. If P is a continuous distribtion then P(C ) = 0 for every countable C
⊆S
.
Proof Thus, continuous distributions are in complete contrast with discrete distributions, for which all of the probability mass is concentrated on the points in a discrete set. For a continuous distribution, the probability mass is continuously spread over S in some sense. In the picture below, the light blue shading is intended to suggest a continuous distribution of probability.
Figure 3.2.1 : A continuous probability distribution on S
Typically, S is a region of R defined by inequalities involving elementary functions, for example an interval in R, a circular region in R , and a conical region in R . Suppose that P is a continuous probability measure on S . The fact that each point in S has probability 0 is conceptually the same as the fact that an interval of R can have positive length even though it is composed of points each of which has 0 length. Similarly, a region of R can have positive area even though it is composed of points (or curves) each of which has area 0. In the one-dimensional case, continuous distributions are used to model random variables that take values in intervals of R, variables that can, in principle, be measured with any degree of accuracy. Such variables abound in applications and include n
2
3
2
length, area, volume, and distance time mass and weight charge, voltage, and current resistance, capacitance, and inductance velocity and acceleration energy, force, and work Usually a continuous distribution can usually be described by certain type of function.
3.2.1
https://stats.libretexts.org/@go/page/10142
Suppose again that P is a continuous distribution on S . A function f P(A) = ∫
: S → [0, ∞)
f (x) dx,
is a probability density function for P if
A ∈ S
(3.2.2)
A
Details So the probability distribution P is completely determined by the probability density function f . As a special case, note that ∫ f (x) dx = P(S) = 1 . Conversely, a nonnegative function on S with this property defines a probability measure. S
A function f : S → [0, ∞) that satisfies ∫ a continuous probability measure on S :
S
f (x) dx = 1
is a probability density function on S and then P defined as follows is
P(A) = ∫
f (x) dx,
A ∈ S
(3.2.3)
A
Proof
Figure 3.2.2 : A continuous distribution is completely determined by its probability density function
Note that we can always extend f to a probability density function on a subset of R that contains S , or to all of R , by defining f (x) = 0 for x ∉ S . This extension sometimes simplifies notation. Put another way, we can be a bit sloppy about the “set of values” of the random variable. So for example if a, b ∈ R with a < b and X has a continuous distribution on the interval [a, b], then we could also say that X has a continuous distribution on (a, b) or [a, b), or (a, b]. n
n
The points x ∈ S that maximize the probability density function f are important, just as in the discrete case. Suppose that P is a continuous distribution on S with probability density function f . An element mode of the distribution.
x ∈ S
that maximizes
f
is a
If there is only one mode, it is sometimes used as a measure of the center of the distribution. You have probably noticed that probability density functions for continuous distributions are analogous to probability density functions for discrete distributions, with integrals replacing sums. However, there are essential differences. First, every discrete distribution has a unique probability density function f given by f (x) = P({x}) for x ∈ S . For a continuous distribution, the existence of a probability density function is not guaranteed. The advanced section on absolute continuity and density functions has several examples of continuous distribution that do not have density functions, and gives conditions that are necessary and sufficient for the existence of a probability density function. Even if a probability density function f exists, it is never unique. Note that the values of f on a finite (or even countably infinite) set of points could be changed to other nonnegative values and the new function would still be a probability density function for the same distribution. The critical fact is that only integrals of f are important. Second, the values of the PDF f for a discrete distribution are probabilities, and in particular f (x) ≤ 1 for x ∈ S . For a continuous distribution the values are not probabilities and in fact it's possible that f (x) > 1 for some or even all x ∈ S . Further, f can be unbounded on S . In the typical calculus interpretation, f (x) really is probability density at x. That is, f (x) dx is approximately the probability of a “small” region of size dx about x.
Constructing Probability Density Functions Just as in the discrete case, a nonnegative function on S can often be scaled to produce a produce a probability density function. Suppose that g : S → [0, ∞) and let c =∫
g(x) dx
(3.2.4)
S
3.2.2
https://stats.libretexts.org/@go/page/10142
If 0 < c < ∞ then f defined by on S .
f (x) =
1 c
g(x)
for x ∈ S defines a probability density function for a continuous distribution
Proof Note again that f is just a scaled version of g . So this result can be used to construct probability density functions with desired properties (domain, shape, symmetry, and so on). The constant c is sometimes called the normalizing constant of g .
Conditional Densities Suppose now that X is a random variable defined on a probability space (Ω, F , P) and that X has a continuous distribution on S . A probability density function for X is based on the underlying probability measure on the sample space (Ω, F ). This measure could be a conditional probability measure, conditioned on a given event E ∈ F with P(E) > 0 . Assuming that the conditional probability density function exists, the usual notation is f (x ∣ E),
x ∈ S
(3.2.6)
Note, however, that except for notation, no new concepts are involved. The defining property is ∫
f (x ∣ E) dx = P(X ∈ A ∣ E),
A ∈ S
(3.2.7)
A
and all results that hold for probability density functions in general hold for conditional probability density functions. The event E could be an event described in terms of the random variable X itself: Suppose that X has a continuous distribution on S with probability density function f and that The conditional probability density function of X given X ∈ B is the function on B given by
B ∈ S
with P(X ∈ B) > 0 .
f (x) f (x ∣ X ∈ B) =
,
x ∈ B
(3.2.8)
P(X ∈ B)
Proof Of course, P(X ∈ B) = ∫
B
f (x) dx
and hence is the normaliziang constant for the restriction of f to B , as in (8)
Examples and Applications As always, try the problems yourself before looking at the answers.
The Exponential Distribution Let f be the function defined by f (t) = re
−rt
for t ∈ [0, ∞), where r ∈ (0, ∞) is a parameter.
1. Show that f is a probability density function. 2. Draw a careful sketch of the graph of f , and state the important qualitative features. Proof The distribution defined by the probability density function in the previous exercise is called the exponential distribution with rate parameter r. This distribution is frequently used to model random times, under certain assumptions. Specifically, in the Poisson model of random points in time, the times between successive arrivals have independent exponential distributions, and the parameter r is the average rate of arrivals. The exponential distribution is studied in detail in the chapter on Poisson Processes. The lifetime T of a certain device (in 1000 hour units) has the exponential distribution with parameter r = 1. P(T 2. P(T
1 2
. Find
> 2) > 3 ∣ T > 1)
Answer In the gamma experiment, set n = 1 to get the exponential distribution. Vary the rate parameter r and note the shape of the probability density function. For various values of r, run the simulation 1000 times and compare the the empirical density function with the probability density function.
3.2.3
https://stats.libretexts.org/@go/page/10142
A Random Angle In Bertrand's problem, a certain random angle Θ has probability density function f given by f (θ) = sin θ for θ ∈ [0,
π 2
]
.
1. Show that f is a probability density function. 2. Draw a careful sketch of the graph f , and state the important qualitative features. 3. Find P (Θ < ) . π 4
Answer Bertand's problem is named for Joseph Louis Bertrand and is studied in more detail in the chapter on Geometric Models. In Bertrand's experiment, select the model with uniform distance. Run the simulation 1000 times and compute the empirical probability of the event {Θ < } . Compare with the true probability in the previous exercise. π 4
Gamma Distributions n
Let g be the function defined by g
n (t)
n
=e
−t t
n!
for t ∈ [0, ∞) where n ∈ N is a parameter.
1. Show that g is a probability density function for each n ∈ N . 2. Draw a careful sketch of the graph of g , and state the important qualitative features. n
n
Proof Interestingly, we showed in the last section on discrete distributions, that f (n) = g (t) is a probability density function on N for each t ≥ 0 (it's the Poisson distribution with parameter t ). The distribution defined by the probability density function g belongs to the family of Erlang distributions, named for Agner Erlang; n + 1 is known as the shape parameter. The Erlang distribution is studied in more detail in the chapter on the Poisson Process. In turn the Erlang distribution belongs to the more general family of gamma distributions. The gamma distribution is studied in more detail in the chapter on Special Distributions. t
n
n
In the gamma experiment, keep the default rate parameter r = 1 . Vary the shape parameter and note the shape and location of the probability density function. For various values of the shape parameter, run the simulation 1000 times and compare the empirical density function with the probability density function. Suppose that the lifetime of a device following:
T
(in 1000 hour units) has the gamma distribution above with
n =2
. Find each of the
1. P(T > 3) . 2. P(T ≤ 2) 3. P(1 ≤ T ≤ 4) Answer
Beta Distributions Let f be the function defined by f (x) = 6x(1 − x) for x ∈ [0, 1]. 1. Show that f is a probability density function. 2. Draw a careful sketch of the graph of f , and state the important qualitative features. Answer Let f be the function defined by f (x) = 12x
2
(1 − x)
for x ∈ [0, 1].
1. Show that f is a probability density function. 2. Draw a careful sketch the graph of f , and state the important qualitative features. Answer The distributions defined in the last two exercises are examples of beta distributions. These distributions are widely used to model random proportions and probabilities, and physical quantities that take values in bounded intervals (which, after a change of units, can be taken to be [0, 1]). Beta distributions are studied in detail in the chapter on Special Distributions.
3.2.4
https://stats.libretexts.org/@go/page/10142
In the special distribution simulator, select the beta distribution. For the following parameter values, note the shape of the probability density function. Run the simulation 1000 times and compare the empirical density function with the probability density function. 1. a = 2 , b = 2 . This gives the first beta distribution above. 2. a = 3 , b = 2 . This gives the second beta distribuiton above. Suppose that P is a random proportion. Find P (
1 4
≤P ≤
3 4
)
in each of the following cases:
1. P has the first beta distribution above. 2. P has the second beta distribution above. Answer Let f be the function defined by 1 f (x) =
− − − − − − − π √ x(1 − x)
,
x ∈ (0, 1)
(3.2.10)
1. Show that f is a probability density function. 2. Draw a careful sketch of the graph of f , and state the important qualitative features. Answer The distribution defined in the last exercise is also a member of the beta family of distributions. But it is also known as the (standard) arcsine distribution, because of the arcsine function that arises in the proof that f is a probability density function. The arcsine distribution has applications to a very important random process known as Brownian motion, named for the Scottish botanist Robert Brown. Arcsine distributions are studied in more generality in the chapter on Special Distributions. In the special distribution simulator, select the (continuous) arcsine distribution and keep the default parameter values. Run the simulation 1000 times and compare the empirical density function with the probability density function. Suppose that X represents the change in the price of a stock at time t , relative to the value at an initial reference time 0. We treat t as a continuous variable measured in weeks. Let T = max {t ∈ [0, 1] : X = 0} , the last time during the first week that the stock price was unchanged over its initial value. Under certain ideal conditions, T will have the arcsine distribution. Find each of the following: t
t
1. P (T 2. P (T 3. P (T
< ≥ ≤
1 4 1 2 3 4
) ) )
Answer Open the Brownian motion experiment and select the last zero variable. Run the experiment in single step mode a few times. The random process that you observe models the price of the stock in the previous exercise. Now run the experiment 1000 times and compute the empirical probability of each event in the previous exercise.
The Pareto Distribution Let g be the function defined by g(x) = 1/x for x ∈ [1, ∞), where b ∈ (0, ∞) is a parameter. b
1. Draw a careful sketch the graph of g , and state the important qualitative features. 2. Find the values of b for which there exists a probability density function f (8)proportional to g . Identify the mode. Answer Note that the qualitative features of g are the same, regardless of the value of the parameter b > 0 , but only when b > 1 can g be normalized into a probability density function. In this case, the distribution is known as the Pareto distribution, named for Vilfredo Pareto. The parameter a = b − 1 , so that a > 0 , is known as the shape parameter. Thus, the Pareto distribution with shape parameter a has probability density function
3.2.5
https://stats.libretexts.org/@go/page/10142
a f (x) =
a+1
,
x ∈ [1, ∞)
(3.2.13)
x
The Pareto distribution is widely used to model certain economic variables and is studied in detail in the chapter on Special Distributions. In the special distribution simulator, select the Pareto distribution. Leave the scale parameter fixed, but vary the shape parameter, and note the shape of the probability density function. For various values of the shape parameter, run the simulation 1000 times and compare the empirical density function with the probability density function. Suppose that the income X (in appropriate units) of a person randomly selected from a population has the Pareto distribution with shape parameter a = 2 . Find each of the following: 1. P(X > 2) 2. P(X ≤ 4) 3. P(3 ≤ X ≤ 5) Answer
The Cauchy Distribution Let f be the function defined by 1 f (x) =
2
π(x
,
x ∈ R
(3.2.14)
+ 1)
1. Show that f is a probability density function. 2. Draw a careful sketch the graph of f , and state the important qualitative features. Answer The distribution constructed in the previous exercise is known as the (standard) Cauchy distribution, named after Augustin Cauchy It might also be called the arctangent distribution, because of the appearance of the arctangent function in the proof that f is a probability density function. In this regard, note the similarity to the arcsine distribution above. The Cauchy distribution is studied in more generality in the chapter on Special Distributions. Note also that the Cauchy distribution is obtained by normalizing the function x ↦ ; the graph of this function is known as the witch of Agnesi, in honor of Maria Agnesi. 1
1+x2
In the special distribution simulator, select the Cauchy distribution with the default parameter values. Run the simulation 1000 times and compare the empirical density function with the probability density function. A light source is 1 meter away from position 0 on an infinite, straight wall. The angle Θ that the light beam makes with the perpendicular to the wall is randomly chosen from the interval (− , ). The position X = tan(Θ) of the light beam on the wall has the standard Cauchy distribution. Find each of the following: π
π
2
2
1. P(−1 < X < 1) . 2. P (X ≥ 3.
1 √3
)
– P(X ≤ √3)
Answer The Cauchy experiment (with the default parameter values) is a simulation of the experiment in the last exercise. 1. Run the experiment a few times in single step mode. 2. Run the experiment 1000 times and compare the empirical density function with the probability density function. 3. Using the data from (b), compute the relative frequency of each event in the previous exercise, and compare with the true probability.
The Standard Normal Distribution Let ϕ be the function defined by ϕ(z) =
1 √2π
e
−z
2
/2
for z ∈ R .
3.2.6
https://stats.libretexts.org/@go/page/10142
1. Show that ϕ is a probability density function. 2. Draw a careful sketch the graph of ϕ , and state the important qualitative features. Proof The distribution defined in the last exercise is the standard normal distribution, perhaps the most important distribution in probability and statistics. It's importance stems largely from the central limit theorem, one of the fundamental theorems in probability. In particular, normal distributions are widely used to model physical measurements that are subject to small, random errors. The family of normal distributions is studied in more generality in the chapter on Special Distributions. In the special distribution simulator, select the normal distribution and keep the default parameter values. Run the simulation 1000 times and compare the empirical density function and the probability density function. The function z ↦ e is a notorious example of an integrable function that does not have an antiderivative that can be expressed in closed form in terms of other elementary functions. (That's why we had to resort to the polar coordinate trick to show that ϕ is a probability density function.) So probabilities involving the normal distribution are usually computed using mathematical or statistical software. −z
2
/2
Suppose that the error Z in the length of a certain machined part (in millimeters) has the standard normal distribution. Use mathematical software to approximate each of the following: 1. P(−1 ≤ Z ≤ 1) 2. P(Z > 2) 3. P(Z < −3) Answer
The Extreme Value Distribution Let f be the function defined by f (x) = e
−x
e
−e
−x
for x ∈ R.
1. Show that f is a probability density function. 2. Draw a careful sketch of the graph of f , and state the important qualitative features. 3. Find P(X > 0) , where X has probability density function f . Answer The distribution in the last exercise is the (standard) type 1 extreme value distribution, also known as the Gumbel distribution in honor of Emil Gumbel. Extreme value distributions are studied in more generality in the chapter on Special Distributions. In the special distribution simulator, select the extreme value distribution. Keep the default parameter values and note the shape and location of the probability density function. Run the simulation 1000 times and compare the empirical density function with the probability density function.
The Logistic Distribution Let f be the function defined by e f (x) =
x
x
2
,
x ∈ R
(3.2.19)
(1 + e )
1. Show that f is a probability density function. 2. Draw a careful sketch the graph of f , and state the important qualitative features. 3. Find P(X > 1) , where X has probability density function f . Answer The distribution in the last exercise is the (standard) logistic distribution. Logistic distributions are studied in more generality in the chapter on Special Distributions.
3.2.7
https://stats.libretexts.org/@go/page/10142
In the special distribution simulator, select the logistic distribution. Keep the default parameter values and note the shape and location of the probability density function. Run the simulation 1000 times and compare the empirical density function with the probability density function.
Weibull Distributions Let f be the function defined by f (t) = 2te
for t ∈ [0, ∞).
2
−t
1. Show that f is a probability density function. 2. Draw a careful sketch the graph of f , and state the important qualitative features. Answer Let f be the function defined by f (t) = 3t
2
3
e
−t
for t ≥ 0 .
1. Show that f is a probability density function. 2. Draw a careful sketch the graph of f , and state the important qualitative features. Answer The distributions in the last two exercises are examples of Weibull distributions, name for Waloddi Weibull. Weibull distributions are studied in more generality in the chapter on Special Distributions. They are often used to model random failure times of devices (in appropriately scaled units). In the special distribution simulator, select the Weibull distribution. For each of the following values of the shape parameter k , note the shape and location of the probability density function. Run the simulation 1000 times and compare the empirical density function with the probability density function. 1. k = 2 . This gives the first Weibull distribution above. 2. k = 3 . This gives the second Weibull distribution above. Suppose that T is the failure time of a device (in 1000 hour units). Find P (T
>
1 2
)
in each of the following cases:
1. T has the first Weibull distribution above. 2. T has the second Weibull distribution above. Answer
Additional Examples Let f be the function defined by f (x) = − ln x for x ∈ (0, 1]. 1. Show that f is a probability density function. 2. Draw a careful sketch of the graph of f , and state the important qualitative features. 3. Find P ( ≤ X ≤ ) where X has the probability density function in (a). 1
1
3
2
Answer Let f be the function defined by f (x) = 2e
−x
(1 − e
−x
)
for x ∈ [0, ∞).
1. Show that f is a probability density function. 2. Draw a careful sketch of the graph of f , and give the important qualitative features. 3. Find P(X ≥ 1) where X has the probability density function in (a). Answer The following problems deal with two and three dimensional random vectors having continuous distributions. The idea of normalizing a function to form a probability density function is important for some of the problems. The relationship between the distribution of a vector and the distribution of its components will be discussed later, in the section on joint distributions. Let f be the function defined by f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . 1. Show that f is a probability density function, and identify the mode.
3.2.8
https://stats.libretexts.org/@go/page/10142
2. Find P(Y ≥ X) where (X, Y ) has the probability density function in (a). 3. Find the conditional density of (X, Y ) given {X < , Y < } . 1
1
2
2
Answer Let g be the function defined by g(x, y) = x + y for 0 ≤ x ≤ y ≤ 1 . 1. Find the probability density function f that is proportional to g . 2. Find P(Y ≥ 2X) where (X, Y ) has the probability density function in (a). Answer Let g be the function defined by g(x, y) = x
2
y
for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 .
1. Find the probability density function f that is proportional to g . 2. Find P(Y ≥ X) where (X, Y ) has the probability density function in (a). Answer Let g be the function defined by g(x, y) = x
2
y
for 0 ≤ x ≤ y ≤ 1 .
1. Find the probability density function f that is proportional to g . 2. Find P (Y ≥ 2X) where (X, Y ) has the probability density function in (a). Answer Let g be the function defined by g(x, y, z) = x + 2y + 3z for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 , 0 ≤ z ≤ 1 . 1. Find the probability density function f that is proportional to g . 2. Find P(X ≤ Y ≤ Z) where (X, Y , Z) has the probability density function in (a). Answer Let g be the function defined by g(x, y) = e
−x
e
−y
for 0 ≤ x ≤ y < ∞ .
1. Find the probability density function f that is proportional to g . 2. Find P(X + Y < 1) where (X, Y ) has the probability density function in (a). Answer
Continuous Uniform Distributions Our next discussion will focus on an important class of continuous distributions that are defined purely in terms of geometry. We need a preliminary definition. For n ∈ N , the standard measure λ on R is given by n
+
n
λn (A) = ∫
n
1 dx,
A ⊆R
(3.2.23)
A
In particular, λ
1 (A)
is the length of A ⊆ R , λ
is the area of A ⊆ R , and λ 2
2 (A)
3 (A)
is the volume of A ⊆ R . 3
Details Note that if A ∈⊆ R .
n >1
, the integral above is a multiple integral. Generally,
λn (A)
is referred to as the
n
-dimensional volumve of
n
Suppose that S ⊆ R for some n ∈ N n
+
with 0 < λ
n (S)
0 for every x ∈ D. 2. C ⊆ R for some n ∈ N and P({x}) = 0 for every x ∈ C . n
+
Details Recall that the term partition means that D and C are disjoint and S = D ∪ C . As alwasy, the collection of events S is required to be a σ-algebra. The set C is a measurable subset of R and then the elements of S have the form A ∪ B where A ⊆ D and B is a measurable subset of C . Typically in applications, C is defined by a finite number of inequalities involving elementary functions. n
Often the discrete set D is a subset of R also, but that's not a requirement. Note that since D and C are complements, 0 < P(C ) < 1 also. Thus, part of the distribution is concentrated at points in a discrete set D; the rest of the distribution is continuously spread over C . In the picture below, the light blue shading is intended to represent a continuous distribution of probability while the darker blue dots are intended to represents points of positive probability. n
Figure 3.3.1 : A mixed distribution on S
The following result is essentially equivalent to the definition. Suppose that P is a probability measure on S of mixed type as in (1). 1. The conditional probability measure A ↦ P(A ∣ D) = P(A)/P (D) for A ⊆ D is a discrete distribution on D 2. The conditional probability measure A ↦ P(A ∣ C ) = P(A)/P(C ) for A ⊆ C is a continuous distribution on C . Proof Note that P(A) = P(D)P(A ∣ D) + P(C )P(A ∣ C ),
A ∈ S
(3.3.1)
Thus, the probability measure P really is a mixture of a discrete distribution and a continuous distribution. Mixtures are studied in more generality in the section on conditional distributions. We can define a function on D that is a partial probability density function for the discrete part of the distribution. Suppose that x ∈ D . Then
P
is a probability measure on
S
of mixed type as in (1). Let
3.3.1
g
be the function defined by
g(x) = P({x})
for
https://stats.libretexts.org/@go/page/10143
1. g(x) ≥ 0 for x ∈ D 2. ∑ g(x) = P(D) 3. P(A) = ∑ g(x) for A ⊆ D x∈D
x∈A
Proof Clearly, the normalized function x ↦ g(x)/P(D) is the probability density function of the conditional distribution given discussed in (2). Often, the continuous part of the distribution is also described by a partial probability density function. A partial probability density function for the continuous part of P is a nonnegative function h : C P(A) = ∫
h(x) dx,
→ [0, ∞)
A ∈ C
D
,
such that (3.3.2)
A
Details Technically, h is require to be measurable, and is a density function with respect to Lebesgue measure measure on R .
λn
on C , the standard
n
Clearly, the normalized function x ↦ h(x)/P(C ) is the probability density function of the conditional distribution given C discussed in (2). As with purely continuous distributions, the existence of a probability density function for the continuous part of a mixed distribution is not guaranteed. And when it does exist, a density function for the continuous part is not unique. Note that the values of h could be changed to other nonnegative values on a countable subset of C , and the displayed equation above would still hold, because only integrals of h are important. The probability measure P is completely determined by the partial probability density functions. Suppose that P has partial probability density functions g and h for the discrete and continuous parts, respectively. Then P(A) =
∑
g(x) + ∫
h(x) dx,
A ∈ S
(3.3.3)
A∩C
x∈A∩D
Proof
Figure 3.3.2 : A mixed distribution is completely determined by its partial density functions.
Truncated Variables Distributions of mixed type occur naturally when a random variable with a continuous distribution is truncated in a certain way. For example, suppose that T is the random lifetime of a device, and has a continuous distribution with probability density function f that is positive on [0, ∞). In a test of the device, we can't wait forever, so we might select a positive constant a and record the random variable U , defined by truncating T at a , as follows: U ={
U
T,
T 0 . The conditional probability density function x ↦ g(x ∣ E) of X given E can be computed as follows: 1. If X has a discrete distribution then g(x)P(E ∣ X = x) g(x ∣ E) = ∑
s∈S
,
x ∈ S
(3.5.9)
,
x ∈ S
(3.5.10)
g(s)P(E ∣ X = s)
2. If X has a continuous distribution then g(x)P(E ∣ X = x) g(x ∣ E) = ∫
S
g(s)P(E ∣ X = s) ds
Proof In the context of Bayes' theorem, g is called the prior probability density function of X and x ↦ g(x ∣ E) is the posterior probability density function of X given E . Note also that the conditional probability density function of X given E is proportional to the function x ↦ g(x)P(E ∣ X = x) , the sum or integral of this function that occurs in the denominator is simply the normalizing constant. As with the law of total probability, Bayes' theorem is useful when P(E ∣ X = x) and g(x) are known for x ∈ S.
Conditional Probability Density Functions The definitions and results above apply, of course, if E is an event defined in terms of another random variable for our experiment. Here is the setup: Suppose that X and Y are random variables on the probability space, with values in sets S and T , respectively, so that (X, Y ) is a random variable with values in S × T . We assume that (X, Y ) has probability density function f , as discussed in the section on Joint Distributions. Recall that X has probability density function g defined as follows: 1. If Y has a discrete distribution on the countable set T then g(x) = ∑ f (x, y),
x ∈ S
(3.5.12)
y∈T
2. If Y has a continuous distribution on T
k
⊆R
then g(x) = ∫
f (x, y)dy,
x ∈ S
(3.5.13)
T
3.5.2
https://stats.libretexts.org/@go/page/10145
Similary, the probability density function h of Y can be obtained by summing integrating f over S if X has a continuous distribution.
f
over
x ∈ S
if
X
has a discrete distribution or
Suppose that x ∈ S and that g(x) > 0 . The function y ↦ h(y ∣ x) defined below is a probability density function on T : f (x, y) h(y ∣ x) =
,
y ∈ T
(3.5.14)
g(x)
Proof The distribution that corresponds to this probability density function is what you would expect: For x ∈ S , the function y ↦ h(y ∣ x) is the conditional probability density function of Y given X = x . That is, 1. If Y has a discrete distribution then P(Y ∈ B ∣ X = x) = ∑ h(y ∣ x),
B ⊆T
(3.5.17)
y∈B
2. If Y has a continuous distribution then P(Y ∈ B ∣ X = x) = ∫
h(y ∣ x) dy,
B ⊆T
(3.5.18)
B
Proof The following theorem gives Bayes' theorem for probability density functions. We use the notation established above. Bayes' Theorem. For y ∈ T , the conditional probability density function x ↦ g(x ∣ y) of X given y = y can be computed as follows: 1. If X has a discrete distribution then g(x)h(y ∣ x) g(x ∣ y) =
,
x ∈ S
(3.5.24)
∑s∈S g(s)h(y ∣ s)
2. If X has a continuous distribution then g(x)h(y ∣ x) g(x ∣ y) =
, ∫
S
x ∈ S
(3.5.25)
g(s)h(y ∣ s)ds
Proof In the context of Bayes' theorem, g is the prior probability density function of X and x ↦ g(x ∣ y) is the posterior probability density function of X given Y = y for y ∈ T . Note that the posterior probability density function x ↦ g(x ∣ y) is proportional to the function x ↦ g(x)h(y ∣ x). The sum or integral in the denominator is the normalizing constant.
Independence Intuitively, X and Y should be independent if and only if the conditional distributions are the same as the corresponding unconditional distributions. The following conditions are equivalent: 1. X and Y are independent. 2. f (x, y) = g(x)h(y) for x ∈ S , y ∈ T 3. h(y ∣ x) = h(y) for x ∈ S , y ∈ T 4. g(x ∣ y) = g(x) for x ∈ S , y ∈ T Proof
Examples and Applications In the exercises that follow, look for special models and distributions that we have studied. A special distribution may be embedded in a larger problem, as a conditional distribution, for example. In particular, a conditional distribution sometimes arises when a
3.5.3
https://stats.libretexts.org/@go/page/10145
parameter of a standard distribution is randomized. A couple of special distributions will occur frequently in the exercises. First, recall that the discrete uniform distribution on a finite, nonempty set S has probability density function f given by f (x) = 1/#(S) for x ∈ S . This distribution governs an element selected at random from S . Recall also that Bernoulli trials (named for Jacob Bernoulli) are independent trials, each with two possible outcomes generically called success and failure. The probability of success p ∈ [0, 1] is the same for each trial, and is the basic parameter of the random process. The number of successes in n ∈ N Bernoulli trials has the binomial distribution with parameters n and p. This distribution has probability density function f given by f (x) = ( )p (1 − p) for x ∈ {0, 1, … , n}. The binomial distribution is studied in more detail in the chapter on Bernoulli trials +
n
x
n−x
x
Coins and Dice Suppose that two standard, fair dice are rolled and the sequence of scores (X V = max{ X , X } denote the minimum and maximum scores, respectively.
1,
1
X2 )
is recorded. Let U
= min{ X1 , X2 }
and
2
1. Find the conditional probability density function of U given V 2. Find the conditional probability density function of V given U
=v =u
for each v ∈ {1, 2, 3, 4, 5, 6}. for each u ∈ {1, 2, 3, 4, 5, 6}.
Answer In the die-coin experiment, a standard, fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let N denote the die score and Y the number of heads. 1. Find the joint probability density function of (N , Y ). 2. Find the probability density function of Y . 3. Find the conditional probability density function of N given Y
=y
for each y ∈ {0, 1, 2, 3, 4, 5, 6}.
Answer In the die-coin experiment, select the fair die and coin. 1. Run the simulation of 1000 times and compare the empirical density function of Y with the true probability density function in the previous exercise 2. Run the simulation 1000 times and compute the empirical conditional density function of N given Y = 3 . Compare with the conditional probability density functions in the previous exercise. In the coin-die experiment, a fair coin is tossed. If the coin is tails, a standard, fair die is rolled. If the coin is heads, a standard, ace-six flat die is rolled (faces 1 and 6 have probability each and faces 2, 3, 4, 5 have probability each). Let X denote the coin score (0 for tails and 1 for heads) and Y the die score. 1
1
4
8
1. Find the joint probability density function of (X, Y ). 2. Find the probability density function of Y . 3. Find the conditional probability density function of X given Y
=y
for each y ∈ {1, 2, 3, 4, 5, 6}.
Answer In the coin-die experiment, select the settings of the previous exercise. 1. Run the simulation 1000 times and compare the empirical density function of Y with the true probability density function in the previous exercise. 2. Run the simulation 100 times and compute the empirical conditional probability density function of X given Y = 2 . Compare with the conditional probability density function in the previous exercise. Suppose that a box contains 12 coins: 5 are fair, 4 are biased so that heads comes up with probability , and 3 are two-headed. A coin is chosen at random and tossed 2 times. Let P denote the probability of heads of the selected coin, and X the number of heads. 1 3
1. Find the joint probability density function of (P , X). 2. Find the probability density function of X.
3.5.4
https://stats.libretexts.org/@go/page/10145
3. Find the conditional probability density function of P given X = x for x ∈ {0, 1, 2}. Answer Compare the die-coin experiment with the box of coins experiment. In the first experiment, we toss a coin with a fixed probability of heads a random number of times. In the second experiment, we effectively toss a coin with a random probability of heads a fixed number of times. Suppose that P has probability density function g(p) = 6p(1 − p) for heads p is tossed 3 times. Let X denote the number of heads.
. Given
p ∈ [0, 1]
P =p
, a coin with probability of
1. Find the joint probability density function of (P , X). 2. Find the probability density of function of X. 3. Find the conditional probability density of P given X = x for x ∈ {0, 1, 2, 3}. Graph these on the same axes. Answer Compare the box of coins experiment with the last experiment. In the second experiment, we effectively choose a coin from a box with a continuous infinity of coin types. The prior distribution of P and each of the posterior distributions of P in part (c) are members of the family of beta distributions, one of the reasons for the importance of the beta family. Beta distributions are studied in more detail in the chapter on Special Distributions. In the simulation of the beta coin experiment, set a = b = 2 and n = 3 to get the experiment studied in the previous exercise. For various “true values” of p, run the experiment in single step mode a few times and observe the posterior probability density function on each run.
Simple Mixed Distributions Recall that the exponential distribution with rate parameter r ∈ (0, ∞) has probability density function f given by f (t) = re for t ∈ [0, ∞). The exponential distribution is often used to model random times, under certain assumptions. The exponential distribution is studied in more detail in the chapter on the Poisson Process. Recall also that for a, b ∈ R with a < b , the continuous uniform distribution on the interval [a, b] has probability density function f given by f (x) = for x ∈ [a, b]. This distribution governs a point selected at random from the interval. −rt
1
b−a
Suppose that there are 5 light bulbs in a box, labeled 1 to 5. The lifetime of bulb n (in months) has the exponential distribution with rate parameter n . A bulb is selected at random from the box and tested. 1. Find the probability that the selected bulb will last more than one month. 2. Given that the bulb lasts more than one month, find the conditional probability density function of the bulb number. Answer Suppose that X is uniformly distributed on {1, 2, 3}, and given X = x ∈ {1, 2, 3}, random variable Y is uniformly distributed on the interval [0, x]. 1. Find the joint probability density function of (X, Y ). 2. Find the probability density function of Y . 3. Find the conditional probability density function of X given Y
=y
for y ∈ [0, 3].
Answer
The Poisson Distribution n
Recall that the Poisson distribution with parameter a ∈ (0, ∞) has probability density function g(n) = e for n ∈ N . This distribution is widely used to model the number of “random points” in a region of time or space; the parameter a is proportional to the size of the region. The Poisson distribution is named for Simeon Poisson, and is studied in more detail in the chapter on the Poisson Process. −a a
n!
Suppose that N is the number of elementary particles emitted by a sample of radioactive material in a specified period of time, and has the Poisson distribution with parameter a . Each particle emitted, independently of the others, is detected by a counter with probability p ∈ (0, 1) and missed with probability 1 − p . Let Y denote the number of particles detected by the counter.
3.5.5
https://stats.libretexts.org/@go/page/10145
1. For n ∈ N , argue that the conditional distribution of Y given N = n is binomial with parameters n and p. 2. Find the joint probability density function of (N , Y ). 3. Find the probability density function of Y . 4. For y ∈ N , find the conditional probability density function of N given Y = y . Answer The fact that Y also has a Poisson distribution is an interesting and characteristic property of the distribution. This property is explored in more depth in the section on thinning the Poisson process.
Simple Continuous Distributions Suppose that (X, Y ) has probability density function f defined by f (x, y) = x + y for (x, y) ∈ (0, 1) . 2
1. Find the conditional probability density function of X given Y = y for y ∈ (0, 1) 2. Find the conditional probability density function of Y given X = x for x ∈ (0, 1) 3. Find P ( ≤ Y ≤ ∣∣ X = ) . 4. Are X and Y independent? 1
3
1
4
4
3
Answer Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2(x + y) for 0 < x < y < 1 . 1. Find the conditional probability density function of X given Y = y for y ∈ (0, 1). 2. Find the conditional probability density function of Y given X = x for x ∈ (0, 1). 3. Find P (Y ≥ ∣∣ X = ) . 4. Are X and Y independent? 3
1
4
2
Answer Suppose that (X, Y ) has probability density function f defined by f (x, y) = 15x
2
y
for 0 < x < y < 1 .
1. Find the conditional probability density function of X given Y = y for y ∈ (0, 1). 2. Find the conditional probability density function of Y given X = x for x ∈ (0, 1). 3. Find P (X ≤ ∣∣ Y = ) . 4. Are X and Y independent? 1
1
4
3
Answer Suppose that (X, Y ) has probability density function f defined by f (x, y) = 6x
2
y
for 0 < x < 1 and 0 < y < 1 .
1. Find the conditional probability density function of X given Y = y for y ∈ (0, 1). 2. Find the conditional probability density function of Y given X = x for x ∈ (0, 1). 3. Are X and Y independent? Answer Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2e
−x
e
−y
for 0 < x < y < ∞ .
1. Find the conditional probability density function of X given Y = y for y ∈ (0, ∞). 2. Find the conditional probability density function of Y given X = x for x ∈ (0, ∞). 3. Are X and Y independent? Answer Suppose that X is uniformly distributed on the interval (0, 1), and that given X = x , Y is uniformly distributed on the interval (0, x). 1. Find the joint probability density function of (X, Y ). 2. Find the probability density function of Y . 3. Find the conditional probability density function of X given Y 4. Are X and Y independent?
=y
for y ∈ (0, 1).
Answer
3.5.6
https://stats.libretexts.org/@go/page/10145
Suppose that X has probability density function function of Y given X = x is h(y ∣ x) =
3y
2
g
defined by
2
g(x) = 3x
for
. The conditional probability density
x ∈ (0, 1)
for y ∈ (0, x).
3
x
1. Find the joint probability density function of (X, Y ). 2. Find the probability density function of Y . 3. Find the conditional probability density function of X given Y 4. Are X and Y independent?
=y
.
Answer
Multivariate Uniform Distributions Multivariate uniform distributions give a geometric interpretation of some of the concepts in this section. Recall that For n ∈ N , the standard measure λ on R is given by n
+
n
λn (A) = ∫
n
1 dx,
A ⊆R
(3.5.29)
A
In particular, λ
1 (A)
is the length of A ⊆ R , λ
2 (A)
is the area of A ⊆ R and λ 2
3 (A)
is the volume of A ⊆ R . 3
Details Suppose now that
takes values in R , Y takes values in R , and that (X, Y ) is uniformly distributed on a set R ⊆ R . So 0 0, so that μ has an atom at x. So μ is a discrete measure (recall that this means that μ has countable support) if and only if F is a step function. Suppose again that F is a distribution function and μ is the positive measure on (R, R) associated with F . If a ∈ R then 1. μ(a, ∞) = F (∞) − F (a) 2. μ[a, ∞) = F (∞) − F (a ) 3. μ(−∞, a] = F (a) − F (−∞) 4. μ(−∞, a) = F (a ) − F (−∞) 5. μ(R) = F (∞) − F (−∞) −
−
Proof
Distribution Functions on [0, ∞) Positive measures and distribution functions on [0, ∞) are particularly important in renewal theory and Poisson processes, because they model random times. The discrete case. Suppose that G is discrete, so that there exists a countable set C ⊂ [0, ∞) with G (C ) = 0 . Let g(t) = G{t} for t ∈ C so that g is the density function of G with respect to counting measure on C . If u : [0, ∞) → R is locally bounded then c
t
∫
u(s) dG(s) =
0
∑
u(s)g(s)
(3.9.7)
s∈C ∩[0,t]
Figure 3.9.1 : A discrete measure
In the discrete case, the distribution is often arithmetic. Recall that this means that the countable set C is of the form {nd : n ∈ N} for some d ∈ (0, ∞) . In the following results, The continuous case. Suppose that G is absolutely continuous with respect to Lebesgue measure on function g : [0, ∞) → [0, ∞) . If u : [0, ∞) → R is locally bounded then t
∫ 0
[0, ∞)
with density
t
u(s) dG(s) = ∫
u(s)g(s) ds
(3.9.8)
0
3.9.2
https://stats.libretexts.org/@go/page/10149
Figure 3.9.2 : A continuous measure
The mixed case. Suppose that there exists a countable set C ⊂ [0, ∞) with G(C ) > 0 and G (C ) > 0 , and that G restricted to subsets of C is absolutely continuous with respect to Lebesgue measure. Let g(t) = G{t} for t ∈ C and let h be a density with respect to Lebesgue measure of G restricted to subsets of C . If u : [0, ∞) → R is locally bounded then, c
c
c
t
∫ 0
t
u(s) dG(s) =
∑
u(s)g(s) + ∫
s∈C ∩[0,t]
u(s)h(s) ds
(3.9.9)
0
Figure 3.9.3 : A mixed measure
The three special cases do not exhaust the possibilities, but are by far the most common cases in applied problems. This page titled 3.9: General Distribution Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
3.9.3
https://stats.libretexts.org/@go/page/10149
3.10: The Integral With Respect to a Measure Probability density functions have very different interpretations for discrete distributions as opposed to continuous distributions. For a discrete distribution, the probability of an event is computed by summing the density function over the outcomes in the event, while for a continuous distribution, the probability is computed by integrating the density function over the outcomes. For a mixed distributions, we have partial discrete and continuous density functions and the probability of an event is computed by summing and integrating. The various types of density functions can unified under a general theory of integration, which is the subject of this section. This theory has enormous importance in probability, far beyond just density functions. Expected value, which we consider in the next chapter, can be interpreted as an integral with respect to a probability measure. Beyond probability, the general theory of integration is of fundamental importance in many areas of mathematics.
Basic Theory Definitions Our starting point is a measure space (S, S , μ). That is, S is a set, S is a σ-algebra of subsets of S , and μ is a positive measure on S . As usual, the most important special cases are Euclidean space: S = R for some n ∈ N , S = R , the σ-algebra of Lebesgue measurable subsets of R , and μ = λ , standard n -dimensional Lebesgue measure. Discrete space: S is a countable set, S is the collection of all subsets of S , and μ = # , counting measure. Probability space: S is the set of outcomes of a random experiment, S is the σ-algebra of events, and μ = P , a probability measure. n
n
+
n
n
The following definition reflects the fact that in measure theory, sets of measure 0 are often considered unimportant. Consider a statement with x ∈ S as a free variable. Technically such a statement is a predicate on S . Suppose that A ∈ S . 1. The statement holds on A if it is true for every x ∈ A . 2. The statement holds almost everywhere on A (with respect to μ ) if there exists B ∈ S with B ⊆ A such that the statement holds on B and μ(A ∖ B) = 0 . A typical statement that we have in mind is an equation or an inequality with x ∈ S as a free variable. Our goal is to define the integral of certain measurable functions f : S → R , with respect to the measure μ . The integral may exist as a number in R (in which case we say that f is integrable), or may exist as ∞ or −∞ , or may not exist at all. When it exists, the integral is denoted variously by ∫ S
f dμ, ∫
f (x) dμ(x), ∫
S
f (x)μ(dx)
(3.10.1)
S
We will use the first two. Since the set of extended real numbers R = R ∪ {−∞, ∞} plays an important role in the theory, we need to recall the arithmetic of ∞ and −∞ . Here are the conventions that are appropriate for integration: ∗
Arithmetic on R
∗
1. If a ∈ (0, ∞] then a ⋅ ∞ = ∞ and a ⋅ (−∞) = −∞ 2. If a ∈ [−∞, 0) then a ⋅ ∞ = −∞ and a ⋅ (−∞) = ∞ 3. 0 ⋅ ∞ = 0 and 0 ⋅ (−∞) = 0 4. If a ∈ R then a + ∞ = ∞ and a + (−∞) = −∞ 5. ∞ + ∞ = ∞ 6. −∞ + (−∞) = −∞ However, ∞ − ∞ is not defined (because it does not make consistent sense) and we must be careful never to produce this indeterminate form. You might recall from calculus that 0 ⋅ ∞ is also an indeterminate form. However, for the theory of integration, the convention that 0 ⋅ ∞ = 0 is convenient and consistent. In terms of order of course, −∞ < a < ∞ for a ∈ R .
3.10.1
https://stats.libretexts.org/@go/page/10150
We also need to extend topology and measure to R . In terms of the first, (a, ∞] is an open neighborhood of ∞ and [−∞, a) is an open neighborhood of −∞ for every a ∈ R . This ensures that if x ∈ R for n ∈ N then x → ∞ or x → −∞ as n → ∞ has its usual calculus meaning. Technically this topology results in the two-point compactification of R. Now we can give R the Borel σ-algebra R , that is, the σ-algebra generated by the topology. Basically, this simply means that if A ∈ R then A ∪ {∞} , A ∪ {−∞} , and A ∪ {−∞, ∞} are all in R . ∗
n
+
n
n
∗
∗
∗
Desired Properties As motivation for the definition, every version of integration should satisfy some basic properties. First, the integral of the indicator function of a measurable set should simply be the size of the set, as measured by μ . This gives our first definition: If A ∈ S then ∫
1A dμ = μ(A)
S
.
This definition hints at the intimate relationship between measure and integration. We will construct the integral from the measure μ in this section, but this first property shows that if we started with the integral, we could recover the measure. This property also shows why we need ∞ as a possible value of the integral, and coupled with some of the properties below, why −∞ is also needed. Here is a simple corollary of our first definition. ∫
S
0 dμ = 0
Proof We give three more essential properties that we want. First are the linearity properties in two parts—part (a) is the additive property and part (b) is the scaling property. If f ,
are measurable functions whose integrals exist, and c ∈ R , then
g : S → R
1. ∫ 2. ∫
S S
(f + g) dμ = ∫
S
cf dμ = c ∫
S
f dμ + ∫
f dμ
S
.
g dμ
as long as the right side is not of the form ∞ − ∞
The additive property almost implies the scaling property To be more explicit, we want the additivity property (a) to hold if at least one of the integrals on the right is finite, or if both are ∞ or if both are −∞ . What is ruled out are the two cases where one integral is ∞ and the other is −∞ , and this is what is meant by the indeterminate form ∞ − ∞ . Our next essential properties are the order properties, again in two parts—part (a) is the positive property and part (b) is the increasing property. Suppose that f ,
g : S → R
are measurable.
1. If f ≥ 0 on S then ∫ f dμ ≥ 0 . 2. If the integrals of f and g exist and f S
≤g
on S then ∫
S
f dμ ≤ ∫
S
g dμ
The positive property and the additive property imply the increasing property Our last essential property is perhaps the least intuitive, but is a type of continuity property of integration, and is closely related to the continuity property of positive measure. The official name is the monotone convergence theorem. Suppose that f
n
: S → [0, ∞)
is measurable for n ∈ N
+
∫ S
and that f is increasing in n . Then n
lim fn dμ = lim ∫
n→∞
n→∞
fn dμ
(3.10.3)
S
Note that since f is increasing in n , lim f (x) exists in R ∪ {∞} for each x ∈ R (and the limit defines a measurable function). This property shows that it is sometimes convenient to allow nonnegative functions to take the value ∞. Note also that by the increasing property, ∫ f dμ is increasing in n and hence also has a limit in R ∪ {∞} . n
n→∞
S
n
n
To see the connection with measure, suppose that (A , A , …) is an increasing sequence of sets in S , and let A = ⋃ A . Note that 1 is increasing in n ∈ N and 1 → 1 as n → ∞ . For this reason, the union A is sometimes called the limit of A as n → ∞ . The continuity theorem of positive measure states that μ(A ) → μ(A) as n → ∞ . Equivalently, ∫ 1 dμ → ∫ 1 dμ as n → ∞ , so the continuity theorem of positive measure is a special case of the monotone convergence theorem. ∞
1
An
+
An
2
i
i=1
A
n
n
3.10.2
S
An
S
A
https://stats.libretexts.org/@go/page/10150
Armed with the properties that we want, the definition of the integral is fairly straightforward, and proceeds in stages. We give the definition successively for 1. Nonnegative simple functions 2. Nonnegative measurable functions 3. Measurable real-valued functions Of course, each definition should agree with the previous one on the functions that are in both collections.
Simple Functions A simple function on S is simply a measurable, real-valued function with finite range. Simple functions are usually expressed as linear combinations of indicator functions. Representations of simple functions 1. Suppose that I is a finite index set, a ∈ R for each i ∈ I , and {A : i ∈ I } is a collection of sets in S that partition S . Then f = ∑ a 1 is a simple function. Expressing a simple function in this form is a representation of f . 2. A simple function f has a unique representation as f = ∑ b 1 where J is a finite index set, {b : j ∈ J} is a set of distinct real numbers, and {B : j ∈ J} is a collection of nonempty sets in S that partition S . This representation is known as the canonical representation. i
i
i∈I
i
Ai
j
j∈J
Bj
j
j
Proof You might wonder why we don't just always use the canonical representation for simple functions. The problem is that even if we start with canonical representations, when we combine simple functions in various ways, the resulting representations may not be canonical. The collection of simple functions is closed under the basic arithmetic operations, and in particular, forms a vector space. Suppose that f and g are simple functions with representations f 1. f + g is simple, with representation f + g = ∑ 2. f g is simple, with representation f g = ∑ (a b 3. cf is simple, with representation cf = ∑ ca 1 .
(i,j)∈I×J
i
(i,j)∈I×J
i∈I
i
=∑
i∈I
(ai + bj )1Ai ∩Bj
j )1Ai ∩Bj
and g = ∑
ai 1Ai
j∈J
bj 1Bj
, and that c ∈ R . Then
.
.
Ai
Proof As we alluded to earlier, note that even if the representations of f and g are canonical, the representations for f + g and not be. The next result treats composition, and will be important for the change of variables theorem in the next section. Suppose that (T , T ) is another measurable space, and that f : S → T is measurable. If g is a simple function on representation g = ∑ b 1 , then g ∘ f is a simple function on S with representation g ∘ f = ∑ b 1 . i∈I
i
Bi
i∈I
i
f
−1
fg
T
may
with
( Bi )
Proof Given the definition of the integral of an indicator function in (3) and that we want the linearity property (5) to hold, there is no question as to how we should define the integral of a nonnegative simple function. Suppose that f is a nonnegative simple function, with the representation f ∫ S
=∑
i∈I
f dμ = ∑ ai μ(Ai )
ai 1A
i
where a
i
≥0
for i ∈ I . We define (3.10.4)
i∈I
The definition is consistent Note that if f is a nonnegative simple function, then ∫ f dμ exists in [0, ∞], so the order properties holds. We next show that the linearity properties are satisfied for nonnegative simple functions. S
Suppose that f and g are nonnegative simple functions, and that c ∈ [0, ∞). Then 1. ∫ 2. ∫
S S
(f + g) dμ = ∫
S
cf dμ = c ∫
S
f dμ + ∫
S
g dμ
f dμ
3.10.3
https://stats.libretexts.org/@go/page/10150
Proof The increasing property holds for nonnegative simple functions. Suppose that f and g are nonnegative simple functions and f
≤g
on S . Then ∫
S
f dμ ≤ ∫
S
g dμ
Proof Next we give a version of the continuity theorem in (7) for simple functions. It's not completely general, but will be needed for the next subsection where we do prove the general version. Suppose that ∞
A =⋃
n=1
is a nonnegative simple function and that . then
f
An
∫
1An f dμ → ∫
S
(A1 , A2 , …)
is an increasing sequence of sets in
1A f dμ as n → ∞
S
with
(3.10.12)
S
Proof Note that 1 theorem.
An
f
is increasing in n ∈ N
+
and 1
An
f → 1A f
as n → ∞ , so this really is a special case of the monotone convergence
Nonnegative Functions Next we will consider nonnegative measurable functions on S . First we note that a function of this type is the limit of nonnegative simple functions. Suppose that f : S → [0, ∞) is measurable. Then there exists an increasing sequence functions with f → f on S as n → ∞ .
(f1 , f2 , …)
of nonnegative simple
n
Proof The last result points the way towards the definition of the integral of a measurable function f : S → [0, ∞) in terms of the integrals of simple functions. If g is a nonnegative simple function with g ≤ f , then by the order property, we need ∫ g dμ ≤ ∫ f dμ . On the other hand, there exists a sequence of nonnegative simple function converging to f . Thus the continuity property suggests the following definition: S
If f
S
: S → [0, ∞)
is measurable, define ∫
f dμ = sup {∫
S
g dμ : g is simple and 0 ≤ g ≤ f }
(3.10.15)
S
Note that ∫ f dμ exists in [0, ∞] so the positive property holds. Note also that if f is simple, the new definition agrees with the old one. As always, we need to establish the essential properties. First, the increasing property holds. S
If f ,
g : S → [0, ∞)
are measurable and f
≤g
on S then ∫
S
f dμ ≤ ∫
S
g dμ
.
Proof We can now prove the continuity property known as the monotone convergence theorem in full generality. Suppose that f
n
: S → [0, ∞)
is measurable for n ∈ N
+
∫ S
and that f is increasing in n . Then n
lim fn dμ = lim ∫
n→∞
n→∞
fn dμ
(3.10.17)
S
Proof If f : S → [0, ∞) is measurable, then by the theorem above, there exists an increasing sequence (f , f , …) of simple functions with f → f as n → ∞ . By the monotone convergence theorem in (18), ∫ f dμ → ∫ f dμ as n → ∞ . These two facts can be used to establish other properties of the integral of a nonnegative function based on our knowledge that the properties hold for simple functions. This type of argument is known as bootstrapping. We use bootstrapping to show that the linearity properties hold: 1
n
S
3.10.4
n
2
S
https://stats.libretexts.org/@go/page/10150
If f ,
are measurable and c ∈ [0, ∞), then
g : S → [0, ∞)
1. ∫ 2. ∫
S S
(f + g) dμ = ∫
S
cf dμ = c ∫
S
f dμ + ∫
S
g dμ
f dμ
Proof
General Functions Our final step is to define the integral of a measurable function f +
x
. First, recall the positive and negative parts of x ∈ R:
: S → R −
= max{x, 0}, x
= max{−x, 0}
(3.10.19)
Note that x ≥ 0 , x ≥ 0 , x = x − x , and |x| = x + x . Given that we want the integral to have the linearity properties in (5), there is no question as to how we should define the integral of f in terms of the integrals of f and f , which being nonnegative, are defined by the previous subsection. +
−
+
−
+
−
+
If f
: S → R
−
is measurable, we define ∫
f dμ = ∫
S
f
+
dμ − ∫
S
−
f
dμ
(3.10.20)
S
assuming that at least one of the integrals on the right is finite. If both are finite, then f is said to be integrable. Assuming that either the integral of the positive part or the integral of the negative part is finite ensures that we do not get the dreaded indeterminate form ∞ − ∞ . Suppose that f
: S → R
is measurable. Then f is integrable if and only if ∫
S
|f | dμ < ∞
.
Proof Note that if f is nonnegative, then our new definition agrees with our old one, since f integral has the same basic form as for nonnegative simple functions: Suppose that f is a simple function with the representation f ∫
=∑
i∈I
ai 1A
i
+
=f
and f
−
=0
. For simple functions the
. Then
f dμ = ∑ ai μ(Ai )
S
(3.10.21)
i∈I
assuming that the sum does not have both ∞ and −∞ terms. Proof Once again, we need to establish the essential properties. Our first result is an intermediate step towards linearity. If f , g : S → [0, ∞) are measurable then right is finite.
∫
S
(f − g) dμ = ∫
S
f dμ − ∫
S
g dμ
as long as at least one of the integrals on the
Proof We finally have the linearity properties in full generality. If f ,
g : S → R
1. ∫ 2. ∫
S S
are measurable functions whose integrals exist, and c ∈ R , then
(f + g) dμ = ∫
S
cf dμ = c ∫
S
f dμ + ∫
S
g dμ
as long as the right side is not of the form ∞ − ∞ .
f dμ
Proof In particular, note that if f and g are integrable, then so are f + g and cf for c ∈ R . Thus, the set of integrable functions on (S, S , μ) forms a vector space, which is denoted L (S, S , μ). The L is in honor of Henri Lebesgue, who first developed the theory. This vector space, and other related ones, will be studied in more detail in the section on function spaces. We also have the increasing property in full generality.
3.10.5
https://stats.libretexts.org/@go/page/10150
If f ,
g : S → R
are measurable functions whose integrals exist, and if f
≤g
on S then ∫
S
f dμ ≤ ∫
S
g dμ
Proof
The Integral Over a Set Now that we have defined the integral of a measurable function f over all of S , there is a natural extension to the integral of f over a measurable subset If f
: S → R
is measurable and A ∈ S , we define ∫
f dμ = ∫
A
1A f dμ
(3.10.31)
S
assuming that the integral on the right exists. If f
: S → R
is a measurable function whose integral exists and A ∈ S , then the integral of f over A exists.
Proof On the other hand, it's clearly possible for ∫
A
to exist for some A ∈ S , but not ∫
f dμ
S
f dμ
.
We could also simply think of ∫ f dμ as the integral of a measurable function f : A → R over the measure space (A, S , μ ), where S = {B ∈ S : B ⊆ A} = {C ∩ A : C ∈ S } is the σ-algebra of measurable subsets of A , and where μ is the restriction of μ to S . It follows that all of the essential properties hold for integrals over A : the linearity properties, the order properties, and the monotone convergence theorem. The following property is a simple consequence of the general additive property, and is known as additive property for disjoint domains. A
A
A
A
A
A
Suppose that f
: S → R
is a measurable function whose integral exists, and that A, ∫
f dμ = ∫
A∪B
f dμ + ∫
A
B ∈ S
are disjoint. then
f dμ
(3.10.32)
B
Proof By induction, the additive property holds for a finite collection of disjoint domains. The extension to a countably infinite collection of disjoint domains will be considered in the next section on properties of the integral.
Special Cases Discrete Spaces Recall again that the measure space (S, S , #) is discrete if S is countable, S is the collection of all subsets of S , and # is counting measure on S . Thus all functions f : S → R are measurable, and and as we will see, integrals with respect to # are simply sums. If f
: S → R
then ∫ S
f d# = ∑ f (x)
(3.10.34)
x∈S
as long as either the sum of the positive terms or the sum of the negative terms in finite. Proof If the sum of the positive terms and the sum of the negative terms are both finite, then f is integrable with respect to #, but the usual term from calculus is that the series ∑ f (x) is absolutely convergent. The result will look more familiar in the special case S = N . Functions on S are simply sequences, so we can use the more familiar notation a rather than a(i) for a function a : S → R . Part (b) of the proof (with A = {1, 2, … , n}) is just the definition of an infinite series of nonnegative terms as the limit of the partial sums: x∈S
+
i
n
∞
n
∑ ai = lim ∑ ai
(3.10.36)
n→∞
i=1
i=1
3.10.6
https://stats.libretexts.org/@go/page/10150
Part (c) of the proof is just the definition of a general infinite series ∞
∞
∞ +
∑ ai = ∑ a
i
i=1
−
−∑a
(3.10.37)
i
i=1
i=1
as long as one of the series on the right is finite. Again, when both are finite, the series is absolutely convergent. In calculus we also consider conditionally convergent series. This means that ∑ a = ∞ , ∑ a = ∞ , but lim ∑ a exists in R . Such series have no place in general integration theory. Also, you may recall that such series are pathological in the sense that, given any number in R , there exists a rearrangement of the terms so that the rearranged series converges to the given number. ∞
+
∞
−
i=1
i
i=1
i
n
n→∞
i=1
i
∗
The Lebesgue and Riemann Integrals on R Consider the one-dimensional Euclidean space (R, R, λ) where R is the usual σ-algebra of Lebesgue measurable sets and λ is Lebesgue measure. The theory developed above applies, of course, for the integral ∫ f dμ of a measurable function f : R → R over a set A ∈ R . It's not surprising that in this special case, the theory of integration is referred to as Lebesgue integration in honor of our good friend Henri Lebesgue, who first developed the theory. A
On the other hand, we already have a theory of integration on R, namely the Riemann integral of calculus, named for our other good friend Georg Riemann. For a suitable function f and domain A this integral is denoted ∫ f (x) dx, as we all remember from calculus. How are the two integrals related? As we will see, the Lebesgue integral generalizes the Riemann integral. A
To understand the connection we need to review the definition of the Riemann integral. Consider first the standard case where the domain of integration is a closed, bounded interval. Here are the preliminary definitions that we will need. Suppose that f
: [a, b] → R
, where a,
b ∈ R
and a < b .
1. A partition A = {A : i ∈ I } of [a, b] is a finite collection of disjoint subintervals whose union is [a, b]. 2. The norm of a partition A is ∥A∥ = max{λ(A ) : i ∈ I } , the length of the largest subinterval of A . 3. A set of points B = {x : i ∈ I } where x ∈ A for each i ∈ I is said to be associated with the partition A . 4. The Riemann sum of f corresponding to a partition A and and a set B associated with A is i
i
i
i
i
R (f , A , B) = ∑ f (xi )λ(Ai )
(3.10.38)
i∈I
Note that the Riemann sum is simply the integral of the simple function g = ∑ f (x )1 . Moreover, since A is an interval for each i ∈ I , g is a step function, since it is constant on a finite collection of disjoint intervals. Moreover, again since A is an interval for each i ∈ I , λ(A ) is simply the length of the subinterval A , so of course measure theory per se is not needed for Riemann integration. Now for the definition from calculus: i∈I
i
Ai
i
i
i
i
is Riemann integrable on [a, b] if there exists r ∈ R with the property that for every ϵ > 0 there exists δ > 0 such that if A is a partition of [a, b] with ∥A∥ < δ then |r − R (f , A , B)| < ϵ for every set of points B associated with A . Then of course we define the integral by f
b
∫
f (x) dx = r
(3.10.39)
a
Here is our main theorem of this subsection. If f
: [a, b] → R
is Riemann integrable on [a, b] then f is Lebesgue integrable on [a, b] and b
∫ [a,b]
f dλ = ∫
f (x) dx
(3.10.40)
a
On the other hand, there are lots of functions that are Lebesgue integrable but not Riemann integrable. In fact there are indicator functions of this type, the simplest of functions from the point of view of Lebesgue integration. Consider the function 1 where as usual, Q is the set of rational number in R. Then Q
1. ∫
R
1Q dλ = 0
.
3.10.7
https://stats.libretexts.org/@go/page/10150
2. 1 is not Riemann integrable on any interval [a, b] with a < b . Q
Proof The following fundamental theorem completes the picture. f : [a, b] → R
is Riemann integrable on [a, b] if and only if f is bounded on [a, b] and f is continuous almost everywhere on
.
[a, b]
Now that the Riemann integral is defined for a closed bounded interval, it can be extended to other domains. Extensions of the Riemann integral. b
t
1. If f is defined on [a, b) and Riemann integrable on [a, t] for a < t < b , we define ∫ f (x) dx = lim ∫ f (x) dx if the limit exists in R . 2. If f is defined on (a, b] and Riemann integrable on [t, b] for a < t < b , we define ∫ f (x) dx = lim ∫ f (x) dx if the limit exists in R . 3. If f is defined on (a, b), we select c ∈ (a, b) and define ∫ f (x) dx = ∫ f (x) dx + ∫ f (x), dx if the integrals on the right exist in R by (a) and (b), and are not of the form ∞ − ∞ . 4. If f is defined an [a, ∞) and Riemann integrable on [a, t] for a < t < ∞ we define ∫ f (x) dx = lim ∫ f (x) dx. 5. if f is defined on (−∞, b] and Riemann integrable on [t, b] for −∞ < t < b we define t↑b
a
a
∗
b
b
t↓a
a
t
∗
b
c
a
b
a
c
∗
∞
a
b
b
t
t→∞
a
if the limit exists in R 6. if f is defined on R we select c ∈ R and define ∫ f (x) dx = ∫ f (x) dx + ∫ f (x) dx if both integrals on the right exist by (d) and (e), and are not of the form ∞ − ∞ . 7. The integral is be defined for a domain that is the union of a finite collection of disjoint intervals by the requirement that the integral be additive over disjoint domains ∫
−∞
f (x) dx = limt→−∞ ∫
t
∗
f (x) dx
∞
c
−∞
−∞
∞
c
As another indication of its superiority, note that none of these convolutions is necessary for the Lebesgue integral. Once and for all, we have defined ∫ f (x) dx for a general measurable function f : R → R and a general domain A ∈ R A
The Lebesgue-Stieltjes Integral Consider again the measurable space (R, R) where R is the usual σ-algebra of Lebesgue measurable subsets of R. Suppose that F : R → R is a general distribution function, so that by definition, F is increasing and continuous from the right. Recall that the Lebesgue-Stieltjes measure μ associated with F is the unique measure on R that satisfies μ(a, b] = F (b) − F (a);
a, b ∈ R, a < b
(3.10.42)
Recall that F satisfies some, but not necessarily all of the properties of a probability distribution function. The properties not necessarily satisfied are the normalizing properties F (x) → 0 F (x) → 1
as x → −∞ as x → ∞
If F does satisfy these two additional properties, then μ is a probability measure and F its probability distribution function. The integral with respect to the measure μ is, appropriately enough, referred to as the Lebesgue-Stieltjes integral with respect to F , and like the measure, is named for the ubiquitous Henri Lebesgue and for Thomas Stieltjes. In addition to our usual notation ∫ f dμ , the Lebesgue-Stieltjes integral is also denoted ∫ f dF and ∫ f (x) dF (x). S
S
S
Probability Spaces Suppose that (S, S , P) is a probability space, so that S is the set of outcomes of a random experiment, S is the σ-algebra of events, and P the probability measure on the sample space (S, S ). A measurable, real-valued function X on S is, of course, a realvalued random variable. The integral with respect to P, if it exists, is the expected value of X and is denoted E(X) = ∫
X dP
(3.10.43)
S
3.10.8
https://stats.libretexts.org/@go/page/10150
This concept is of fundamental importance in probability theory and is studied in detail in a separate chapter on Expected Value, mostly from an elementary point of view that does not involve abstract integration. However an advanced section treats expected value as an integral over the underlying probability measure, as above. Suppose next that (T , T , #) is a discrete space and that X is a random variable for the experiment, taking values in T . In this case X has a discrete distribution and the probability density function f of X is given by f (x) = P(X = x) for x ∈ T . More generally, P(X ∈ A) = ∑ f (x) = ∫
f d#,
A ⊆T
(3.10.44)
A
x∈A
On the other hand, suppose that X is a random variable with values in R , where as usual, (R , R , λ ) is Euclidean space. If X has a continuous distribution, then f : T → [0, ∞) is a probability density function of X if n
n
n
P(X ∈ A) = ∫
f dλn ,
A ∈ R
n
n
n
-dimensional
(3.10.45)
A
Technically, f is the density function of X with respect to counting measure # in the discrete case, and f is the density function of X with respect to Lebesgue measure λ in the continuous case. In both cases, the probability of an event A is computed by integrating the density function, with respect to the appropriate measure, over A . There are still differences, however. In the discrete case, the existence of the density function with respect to counting measure is guaranteed, and indeed we have an explicit formula for it. In the continuous case, the existence of a density function with respect to Lebesgue measure is not guaranteed, and indeed there might not be one. More generally, suppose that we have a measure space (T , T , μ) and a random variable X with values in T . A measurable function f : T → [0, ∞) is a probability density function of X (or more precisely, the distribution of X) with respect to μ if n
P(X ∈ A) = ∫
f dμ,
A ∈ T
(3.10.46)
A
This fundamental question of the existence of a density function will be clarified in the section on absolute continuity and density functions. Suppose again that X is a real-valued random variable with distribution function F . Then, by definition, the distribution of X is the Lebesgue-Stieltjes measure associated with F : P(a < X ≤ b) = F (b) − F (a),
a, b ∈ R, a < b
(3.10.47)
regardless of whether the distribution is discrete, continuous, or mixed. Trivially, P(X ∈ A) = ∫ 1 dF for A ∈ R and the expected value of X defined above can also be written as E(X) = ∫ x dF (x). Again, all of this will be explained in much more detail in the next chapter on Expected Value. S
A
R
Computational Exercises Let g(x) =
1 2
1+x
for x ∈ R.
∞
1. Find ∫ g(x) dx . 2. Show that ∫ xg(x) dx does not exist. −∞
∞
−∞
Answer You may recall that the function g in the last exercise is important in the study of the Cauchy distribution, named for Augustin Cauchy. You may also remember that the graph of g is known as the witch of Agnesi, named for Maria Agnesi. Let g(x) =
1 b
x
for x ∈ [1, ∞) where b > 0 is a parameter. Find ∫
∞
1
g(x) dx
Answer You may recall that the function Pareto.
g
in the last exercise is important in the study of the Pareto distribution, named for Vilfredo
Suppose that f (x) = 0 if x ∈ Q and f (x) = sin(x) if x ∈ R − Q . 1. Find ∫
[0,π]
f (x) dλ(x)
3.10.9
https://stats.libretexts.org/@go/page/10150
2. Does ∫
π
0
f (x) dx
exist?
Answer This page titled 3.10: The Integral With Respect to a Measure is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
3.10.10
https://stats.libretexts.org/@go/page/10150
3.11: Properties of the Integral Basic Theory Again our starting point is a measure space measure on S .
. That is,
(S, S , μ)
is a set,
S
is a σ-algebra of subsets of
S
S
, and
μ
is a positive
Definition In the last section we defined the integral of certain measurable functions f : S → R with respect to the measure μ . Recall that the integral, denoted ∫ f dμ , may exist as a number in R (in which case f is integrable), or may exist as ∞ or −∞ , or may fail to exist. Here is a review of how the definition is built up in stages: S
Definition of the integral 1. If f is a nonnegative simple function, so that f = ∑ { A : i ∈ I } is measurable partition of S , then
i∈I
where I is a finite index set, a
ai 1A
i
i
∈ [0, ∞)
for i ∈ I , and
i
∫
f dμ = ∑ ai μ(Ai )
S
2. If f
: S → [0, ∞)
is measurable, then ∫
f dμ = sup {∫
S
3. If f
: S → R
(3.11.1)
i∈I
g dμ : g is simple and 0 ≤ g ≤ f }
(3.11.2)
S
is measurable, then ∫
f dμ = ∫
S
f
+
dμ − ∫
S
f
−
dμ
(3.11.3)
S
as long as the right side is not of the form ∞ − ∞ , and where f and f denote the positive and negative parts of f . 4. If f : S → R is measurable and A ∈ S , then the integral of f over A is defined by +
∫
f dμ = ∫
A
−
1A f dμ
(3.11.4)
S
assuming that the integral on the right exists. Consider a statement on the elements of S , for example an equation or an inequality with x ∈ S as a free variable. (Technically such a statement is a predicate on S .) For A ∈ S , we say that the statement holds on A if it is true for every x ∈ A . We say that the statement holds almost everywhere on A (with respect to μ ) if there exists B ∈ S with B ⊆ A such that the statement holds on B and μ(A ∖ B) = 0 .
Basic Properties A few properties of the integral that were essential to the motivation of the definition were given in the last section. In this section, we extend some of those properties and we study a number of new ones. As a review, here is what we know so far. Properties of the integral 1. If f , g : S → R are measurable functions whose integrals exist, then ∫ (f + g) dμ = ∫ f dμ + ∫ g dμ as long as the right side is not of the form ∞ − ∞ . 2. If f : S → R is a measurable function whose integral exists and c ∈ R , then ∫ cf dμ = c ∫ f dμ . 3. If f : S → R is measurable and f ≥ 0 on S then ∫ f dμ ≥ 0 . 4. If f , g : S → R are measurable functions whose integrals exist and f ≤ g on S then ∫ f dμ ≤ ∫ g dμ 5. If f : S → [0, ∞) is measurable for n ∈ N and f is increasing in n on S then ∫ lim f dμ = lim ∫ f dμ . 6. f : S → R is measurable and the the integral of f on A ∪ B exists, where A, B ∈ S are disjoint, then ∫ f dμ = ∫ f dμ + ∫ f dμ . S
S
S
S
S
S
S
n
A∪B
+
A
n
S
S
n→∞
n
n→∞
S
n
B
3.11.1
https://stats.libretexts.org/@go/page/10151
Parts (a) and (b) are the linearity properties; part (a) is the additivity property and part (b) is the scaling property. Parts (c) and (d) are the order properties; part (c) is the positive property and part (d) is the increasing property. Part (e) is a continuity property known as the monotone convergence theorem. Part (f) is the additive property for disjoint domains. Properties (a)–(e) hold with S replaced by A ∈ S .
Equality and Order Our first new results are extensions dealing with equality and order. The integral of a function over a null set is 0: Suppose that f
: S → R
is measurable and A ∈ S with μ(A) = 0 . Then ∫
A
f dμ = 0
.
Proof Two functions that are indistinguishable from the point of view of μ must have the same integral. Suppose that f : S → R is a measurable function whose integral exists. If everywhere on S , then ∫ g dμ = ∫ f dμ . S
is measurable and
g : S → R
g =f
almost
S
Proof Next we have a simple extension of the positive property. Suppose that f 1. ∫ 2. ∫
S S
: S → R
is measurable and f
≥0
almost everywhere on S . Then
f dμ ≥ 0 f =0
if and only if f
=0
almost everywhere on S .
Proof So, if f ≥ 0 almost everywhere on S then ∫ f dμ > 0 if and only if μ{x ∈ S : f (x) > 0} > 0 . The simple extension of the positive property in turn leads to a simple extension of the increasing property. S
Suppose that f ,
g : S → R
are measurable functions whose integrals exist, and that f
1. ∫ f ≤ ∫ g 2. Except in the case that both integrals are ∞ or both −∞ , ∫ S
S
f dμ = ∫
S
S
g dμ
≤g
almost everywhere on S . Then
if and only if f
=g
almost everywhere on S .
Proof So if
almost everywhere on S then, except in the two cases mentioned, ∫ f dμ < ∫ g dμ if and only if . The exclusion when both integrals are ∞ or −∞ is important. A counterexample when this condition does not hold is given below. The next result is the absolute value inequality. f ≤g
S
S
μ{x ∈ S : f (x) < g(x)} > 0
Suppose that f
: S → R
is a measurable function whose integral exists. Then ∣ ∣ ∣∫ f dμ∣ ≤ ∫ |f | dμ ∣ S ∣ S
If f is integrable, then equality holds if and only if f
≥0
almost everywhere on S or f
(3.11.9)
≤0
almost everywhere on S .
Proof
Change of Variables Suppose that (T , T ) is another measurable space and that measures, ν defined by
u : S → T
−1
ν (B) = μ [ u
(B)] ,
is measurable. As we saw in our first study of positive
B ∈ T
(3.11.11)
is a positive measure on (T , T ). The following result is known as the change of variables theorem. If f
: T → R
is measurable then, assuming that the integrals exist, ∫ T
f dν = ∫
(f ∘ u) dμ
(3.11.12)
S
3.11.2
https://stats.libretexts.org/@go/page/10151
Proof The change of variables theorem will look more familiar if we give the variables explicitly. Thus, suppose that we want to evaluate ∫
f [u(x)] dμ(x)
(3.11.17)
S
where again, u : S → T and f
: T → R
. One way is to use the substitution u = u(x), find the new measure ν , and then evaluate ∫
g(u) dν (u)
(3.11.18)
T
Convergence Properties We start with a simple but important corollary of the monotone convergence theorem that extends the additivity property to a countably infinite sum of nonnegative functions. Suppose that f
n
: S → [0, ∞)
is measurable for n ∈ N . Then +
∞
∫
∞
∑ fn dμ = ∑ ∫
S n=1
n=1
fn dμ
(3.11.19)
S
Proof A theorem below gives a related result that relaxes the assumption that f be nonnegative, but imposes a stricter integrability requirement. Our next result is the additivity of the integral over a countably infinite collection of disjoint domains. Suppose that f : S → R is a measurable function whose integral exists, and that {A in S . Let A = ⋃ A . Then
n
: n ∈ N+ }
is a disjoint collection of sets
∞
n=1
n
∞
∫
f dμ = ∑ ∫
A
n=1
f dμ
(3.11.20)
An
Proof Of course, the previous theorem applies if f is nonnegative or if f is integrable. Next we give a minor extension of the monotone convergence theorem that relaxes the assumption that the functions be nonnegative. Monotone Convergence Theorem. Suppose that f : S → R is a measurable function whose integral exists for each n ∈ N and that f is increasing in n on S . If ∫ f dμ > −∞ then n
n
S
+
1
∫ S
lim fn dμ = lim ∫
n→∞
n→∞
fn dμ
(3.11.24)
S
Proof Here is the complementary result for decreasing functions. Suppose that f : S → R is a measurable function whose integral exists for each n ∈ N If ∫ f dμ < ∞ then n
S
+
and that f is decreasing in n on S . n
1
∫ S
lim fn dμ = lim ∫
n→∞
n→∞
fn dμ
(3.11.25)
S
Proof The additional assumptions on the integral of f in the last two extensions of the monotone convergence theorem are necessary. An example is given in below. 1
Our next result is also a consequence of the montone convergence theorem, and is called Fatou's lemma in honor of Pierre Fatou. Its usefulness stems from the fact that no assumptions are placed on the integrand functions, except that they be nonnegative and measurable.
3.11.3
https://stats.libretexts.org/@go/page/10151
Fatou's Lemma. Suppose that f
n
is measurable for n ∈ N . Then
: S → [0, ∞)
∫
+
lim inf fn dμ ≤ lim inf ∫ n→∞
S
n→∞
fn dμ
(3.11.26)
S
Proof Given the weakness of the hypotheses, it's hardly surprising that strict inequality can easily occur in Fatou's lemma. An example is given below. Our next convergence result is one of the most important and is known as the dominated convergence theorem. It's sometimes also known as Lebesgue's dominated convergence theorem in honor of Henri Lebesgue, who first developed all of this stuff in the context of R . The dominated convergence theorem gives a basic condition under which we may interchange the limit and integration operators. n
Dominated Convergence Theorem. Suppose that f : S → R is measurable for Suppose also that |f | ≤ g for n ∈ N where g : S → [0, ∞) is integrable. Then n
n ∈ N+
and that
limn→∞ fn
exists on
S
.
n
∫
lim fn dμ = lim ∫
n→∞
S
n→∞
fn dμ
(3.11.29)
S
Proof As you might guess, the assumption that |f | is uniformly bounded in n by an integrable function is critical. A counterexample when this assumption is missing is given below when this assumption is missing. The dominated convergence theorem remains true if lim f exists almost everywhere on S . The follow corollary of the dominated convergence theorem gives a condition for the interchange of infinite sum and integral. n
n→∞
n
Suppose that f
i
: S → R
is measurable for i ∈ N and that ∑
∞
+
i=1
∞
∫
is integrable. then
| fi | ∞
∑ fi dμ = ∑ ∫
S
i=1
fi dμ
(3.11.30)
S
i=1
Proof The following corollary of the dominated convergence theorem is known as the bounded convergence theorem. Bounded Convergence Theorem. Suppose that f : S → R is measurable for μ(A) < ∞ , lim f exists on A , and |f | is bounded in n ∈ N on A . Then n
n→∞
n
n
n ∈ N+
and there exists
A ∈ S
such that
+
∫
lim fn dμ = lim ∫
n→∞
A
n→∞
fn dμ
(3.11.31)
A
Proof Again, the bounded convergence remains true if lim f exists almost everywhere on particular for a probability space), the condition that μ(A) < ∞ automatically holds. n→∞
n
A
. For a finite measure space (and in
Product Spaces Suppose now that (S, S , μ) and (T , T , ν ) are σ-finite measure spaces. Please recall the basic facts about the product σ-algebra S ⊗ T of subsets of S × T , and the product measure μ ⊗ ν on S ⊗ T . The product measure space (S × T , S ⊗ T , μ ⊗ ν ) is the standard one that we use for product spaces. If f : S × T → R is measurable, there are three integrals we might consider. First, of course, is the integral of f with respect to the product measure μ ⊗ ν ∫
f (x, y) d(μ ⊗ ν )(x, y)
(3.11.32)
S×T
sometimes called a double integral in this context. But also we have the nested or iterated integrals where we integrate with respect to one variable at a time: ∫ S
(∫
f (x, y) dν (y)) dμ(x),
∫
T
T
3.11.4
(∫
f (x, y)dμ(x)) dν (y)
(3.11.33)
S
https://stats.libretexts.org/@go/page/10151
How are these integrals related? Well, just as in calculus with ordinary Riemann integrals, under mild conditions the three integrals are the same. The resulting important theorem is known as Fubini's Theorem in honor of the Italian mathematician Guido Fubini. Fubini's Theorem. Suppose that f ∫
: S ×T → R
is measurable. If the double integral on the left exists, then
f (x, y) d(μ ⊗ ν )(x, y) = ∫
S×T
∫
S
f (x, y) dν (y) dμ(x) = ∫
T
∫
T
f (x, y) dμ(x) dν (y)
(3.11.34)
S
Proof Of course, the double integral exists, and so Fubini's theorem applies, if either f is nonnegative or integrable with respect to μ ⊗ ν . When f is nonnegative, the result is sometimes called Tonelli's theorem in honor of another Italian mathematician, Leonida Tonelli. On the other hand, the iterated integrals may exist, and may be different, when the double integral does not exist. A counterexample and a second counterexample are given below. A special case of Fubini's theorem (and indeed part of the proof) is that we can compute the measure of a set in the product space by integrating the cross-sectional measures. If C
then
∈ S ⊗T
(μ ⊗ ν )(C ) = ∫
ν (Cx ) dμ(x) = ∫
S
where C
x
= {y ∈ T : (x, y) ∈ C }
for x ∈ S , and C
y
μ (C
y
) dν (y)
(3.11.39)
T
= {x ∈ S : (x, y) ∈ C }
for y ∈ T .
In particular, if C , D ∈ S ⊗ T have the property that ν (C ) = ν (D ) for all x ∈ S , or μ (C ) = μ (D ) for all y ∈ T (that is, C and D have the same cross-sectional measures with respect to one of the variables), then (μ ⊗ ν )(C ) = (μ ⊗ ν )(D) . In R with area, and in R with volume (Lebesgue measure in both cases), this is known as Cavalieri's principle, named for Bonaventura Cavalieri, yet a third Italian mathematician. Clearly, Italian mathematicians cornered the market on theorems of this sort. y
x
y
x
2
3
A simple corollary of Fubini's theorem is that the double integral of a product function over a product set is the product of the integrals. This result has important applications to independent random variables. Suppose that g : S → R and respectively. Then
h : T → R
∫
are measurable, and are either nonnegative or integrable with respect to
g(x)h(y)d(μ ⊗ ν )(x, y) = ( ∫
S×T
g(x) dμ(x)) ( ∫
S
h(y) dν (y))
μ
and ν ,
(3.11.40)
T
Recall that a discrete measure space consists of a countable set with the σ-algebra of all subsets and with counting measure. In such a space, integrals are simply sums and so Fubini's theorem allows us to rearrange the order of summation in a double sum. Suppose that I and J are countable and that negative terms is finite, then
aij ∈ R
∑ (i,j)∈I×J
for
i ∈ I
and
j∈ J
. If the sum of the positive terms or the sum of the
aij = ∑ ∑ aij = ∑ ∑ aij i∈I
j∈J
j∈J
(3.11.41)
i∈I
Often I = J = N , and in this case, a can be viewed as an infinite array, with i ∈ N the row number and j ∈ N number: +
ij
+
+
a11
a12
a13
…
a21
a22
a23
…
a31
a32
a33
…
⋮
⋮
⋮
⋮
3.11.5
the column
https://stats.libretexts.org/@go/page/10151
The significant point is that N is totally ordered. While there is no implied order of summation in the double sum ∑ a , the iterated sum ∑ ∑ a is obtained by summing over the rows in order and then summing the results by column in order, while the iterated sum ∑ ∑ a is obtained by summing over the columns in order and then summing the results by row in order. 2
+
∞
i=1
(i,j)∈N+
ij
∞
ij
j=1 ∞
∞
j=1
i=1
ij
Of course, only one of the product spaces might be discrete. Theorems (9) and (15) which give conditions for the interchange of sum and integral can be viewed as applications of Fubini's theorem, where one of the measure spaces is (S, S , μ) and the other is N with counting measure. +
Examples and Applications Probability Spaces Suppose that (Ω, F , P) is a probability space, so that Ω is the set of outcomes of a random experiment, F is the σ-algebra of events, and P is a probability measure on the sample space (Ω, F ). Suppose also that (S, S ) is another measurable space, and that X is a random variable for the experiment, taking values in S . Of course, this simply means that X is a measurable function from Ω to S . Recall that the probability distribution of X is the probability measure P on (S, S ) defined by X
PX (A) = P(X ∈ A),
A ∈ S
(3.11.42)
Since {X ∈ A} is just probability notation for the inverse image of A under X, P is simply a special case of constructing a new positive measure from a given positive measure via a change of variables. Suppose now that r : S → R is measurable, so that r(X) is a real-valued random variable. The integral of r(X) (assuming that it exists) is known as the expected value of r(X) and is of fundamental importance. We will study expected values in detail in the next chapter. Here, we simply note different ways to write the integral. By the change of variables formula (8) we have X
∫
r [X(ω)] dP(ω) = ∫
Ω
r(x) dPX (x)
(3.11.43)
S
Now let F denote the distribution function of Y = r(X) . By another change of variables, Y has a probability distribution P on R , which is also a Lebesgue-Stieltjes measure, named for Henri Lebesgue and Thomas Stiletjes. Recall that this probability measure is characterized by Y
Y
PY (a, b] = P(a < Y ≤ b) = FY (b) − FY (a);
a, b ∈ R, a < b
(3.11.44)
With another application of our change of variables theorem, we can add to our chain of integrals: ∫
r [X(ω)] dP(ω) = ∫
Ω
r(x) dPX (x) = ∫
S
R
y dPY (y) = ∫
y dFY (y)
(3.11.45)
R
Of course, the last two integrals are simply different notations for exactly the same thing. In the section on absolute continuity and density functions, we will see other ways to write the integral.
Counterexamples In the first three exercises below, (R, R, λ) is the standard one-dimensional Euclidean space, so Lebesgue measurabel sets and λ is Lebesgue measure. Let f
= 1[1,∞)
and g = 1
[0,∞)
mathscrR
is
σ
-algebra of
. Show that
1. f ≤ g on R 2. λ{x ∈ R : f (x) < g(x)} = 1 3. ∫ f dλ = ∫ g dλ = ∞ R
R
This example shows that the strict increasing property can fail when the integrals are infinite. Let f
n
= 1[n,∞)
for n ∈ N . Show that +
1. f is decreasing in n ∈ N on R. 2. f → 0 as n → ∞ on R. 3. ∫ f dλ = ∞ for each n ∈ N . n
+
n
R
n
+
3.11.6
https://stats.libretexts.org/@go/page/10151
This example shows that the monotone convergence theorem can fail if the first integral is infinite. It also illustrates strict inequality in Fatou's lemma. Let f
n
= 1[n,n+1]
for n ∈ N . Show that +
1. lim f = 0 on R so ∫ lim 2. ∫ f dλ = 1 for n ∈ N so lim 3. sup{f : n ∈ N } = 1 on R n→∞
R
n
n→∞
R
n
+
n
+
fn dμ = 0
n→∞
∫
R
fn dλ = 1
[1,∞)
This example shows that the dominated convergence theorem can fail if shows that strict inequality can hold in Fatou's lemma.
| fn |
is not bounded by an integrable function. It also
Consider the product space [0, 1] with the usual Lebesgue measurable subsets and Lebesgue measure. Let f defined by 2
2
x f (x, y) =
2
(x
−y
2
: [0, 1 ]
→ R
be
2
2
(3.11.46)
2
+y )
Show that 1. ∫
2
[0,1]
2. ∫
1
3. ∫
1
0
0
∫
f (x, y) d(x, y)
1
0
∫
1
0
does not exist.
f (x, y) dx dy = − f (x, y) dy dx =
π 4
π 4
This example shows that Fubini's theorem can fail if the double integral does not exist. For i,
j ∈ N+
define the sequence a as follows: a ij
ii
=1
and a
= −1
i+1,i
1. Give a in array form with i ∈ N as the row number and j ∈ N 2. Show that ∑ a does not exist 3. Show that ∑ ∑ a = 1 4. Show that ∑ ∑ a = 0 ij
+
2
(i,j)∈N+
+
for i ∈ N , a +
ij
=0
otherwise.
as the column number
ij
∞
∞
i=1
j=1
∞
∞
j=1
i=1
ij
ij
This example shows that the iterated sums can exist and be different when the double sum does not exist, a counterexample to the corollary to Fubini's theorem for sums when the hypotheses are not satisfied.
Computational Exercises Compute ∫
D
f (x, y) d(x, y)
1. f (x, y) = e 2. f (x, y) = e
−2x
−2x
e e
−3y
−3y
in each case below for the given D ⊆ R and f 2
: D → R
.
, D = [0, ∞) × [0, ∞) , D = {(x, y) ∈ R : 0 ≤ x ≤ y < ∞} 2
Integrals of the type in the last exercise are useful in the study of exponential distributions. This page titled 3.11: Properties of the Integral is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
3.11.7
https://stats.libretexts.org/@go/page/10151
3.12: General Measures Basic Theory Our starting point in this section is a measurable space (S, S ). That is, S is a set and S is a σ-algebra of subsets of S . So far, we have only considered positive measures on such spaces. Positive measures have applications, as we know, to length, area, volume, mass, probability, counting, and similar concepts of the nonnegative “size” of a set. Moreover, we have defined the integral of a measurable function f : S → R with respect to a positive measure, and we have studied properties of the integral.
Definition But now we will consider measures that can take negative values as well as positive values. These measures have applications to electric charge, monetary value, and other similar concepts of the “content” of a set that might be positive or negative. Also, this generalization will help in our study of density functions in the next section. The definition is exactly the same as for a positive measure, except that values in R = R ∪ {−∞, ∞} are allowed. ∗
A measure on (S, S ) is a function μ : S
∗
→ R
that satisfies the following properties:
1. μ(∅) = 0 2. If {A : i ∈ I } is a countable, disjoint collection of sets in S then μ (⋃ i
i∈I
Ai ) = ∑
i∈I
μ(Ai )
As before, (b) is known as countable additivity and is the critical assumption: the measure of a set that consists of a countable number of disjoint pieces is the sum of the measures of the pieces. Implicit in the statement of this assumption is that the sum in (b) exists for every countable disjoint collection {A : i ∈ I } . That is, either the sum of the positive terms is finite or the sum of the negative terms is finite. In turn, this means that the order of the terms in the sum does not matter (a good thing, since there is no implied order). The term signed measure is used by many, but we will just use the simple term measure, and add appropriate adjectives for the special cases. Note that if μ(A) ≥ 0 for all A ∈ S , then μ is a positive measure, the kind we have already studied (and so the new definition really is a generalization). In this case, the sum in (b) always exists in [0, ∞]. If μ(A) ∈ R for all A ∈ S then μ is a finite measure. Note that in this case, the sum in (b) is absolutely convergent for every countable disjoint collection {A : i ∈ I } . If μ is a positive measure and μ(S) = 1 then μ is a probability measure, our favorite kind. Finally, as with positive measures, μ is σ-finite if there exists a countable collection {A : i ∈ I } of sets in S such that S = ⋃ A and μ(A ) ∈ R for i ∈ I . i
i
i
i∈I
i
i
Basic Properties We give a few simple properties of general measures; hopefully many of these will look familiar. Throughout, we assume that μ is a measure on (S, S ). Our first result is that although μ can take the value ∞ or −∞ , it turns out that it cannot take both of these values. Either μ(A) > −∞ for all A ∈ S or μ(A) < ∞ for all A ∈ S . Proof We will say that two measures are of the same type if neither takes the value ∞ or if neither takes the value −∞ . Being of the same type is trivially an equivalence relation on the collection of measures on (S, S ). The difference rule holds, as long as the sets have finite measure: Suppose that A,
B ∈ S
. If μ(B) ∈ R then μ(B ∖ A) = μ(B) − μ(A ∩ B) .
Proof The following corollary is the difference rule for subsets, and will be needed below. Suppose that A,
B ∈ S
and A ⊆ B . If μ(B) ∈ R then μ(A) ∈ R and μ(B ∖ A) = μ(B) − μ(A) .
Proof
3.12.1
https://stats.libretexts.org/@go/page/10152
As a consequence, suppose that A, B ∈ S and A ⊆ B . If μ(A) = ∞ , then by the infinity rule we cannot have μ(B) = −∞ and by the difference rule we cannot have μ(B) ∈ R , so we must have μ(B) = ∞ . Similarly, if μ(A) = −∞ then μ(B) = −∞ . The inclusion-exclusion rules hold for general measures, as long as the sets have finite measure. Suppose that A
i
∈ S
for each i ∈ I where #(I ) = n , and that μ(A
i)
∈ R
for i ∈ I . Then
n k−1
μ ( ⋃ Ai ) = ∑(−1 ) i∈I
k=1
∑
μ ( ⋂ Aj )
(3.12.1)
j∈J
J⊆I, #(J)=k
Proof The continuity properties hold for general measures. Part (a) is the continuity property for increasing sets, and part (b) is the continuity property for decreasing sets. Suppose that A
n
1. If A 2. If A
n
∈ S
⊆ An+1
n+1
⊆ An
for n ∈ N . +
for n ∈ N for n ∈ N
+ +
then lim and μ(A
n→∞
∞
μ(An ) = μ (⋃
1) ∈ R
, then lim
i=1
Ai )
. ∞
n→∞
μ(An ) = μ (⋂
i=1
Ai )
Proof Recall that a positive measure is an increasing function, relative to the subset partial order on S and the ordinary order on [0, ∞], and this property follows from the difference rule. But for general measures, the increasing property fails, and so do other properties that flow from it, including the subadditive property (Boole's inequality in probability) and the Bonferroni inequalities.
Constructions It's easy to construct general measures as differences of positive measures. Suppose that μ and ν are positive measures on (S, S ) and that at least one of them is finite. Then δ = μ − ν is a measure. Proof The collection of measures on our space is closed under scalar multiplication. If μ is a measure on (S, S ) and c ∈ R , then cμ is a measure on (S, S ) Proof If μ is a finite measure, then so is cμ for c ∈ R . If μ is not finite then μ and cμ are of the same type if c > 0 and are of opposite types if c < 0 . We can add two measures to get another measure, as long as they are of the same type. In particular, the collection of finite measures is closed under addition as well as scalar multiplication, and hence forms a vector space. If μ and ν are measures on (S, S ) of the same type then μ + ν is a measure on (S, S ). Proof Finally, it is easy to explicitly construct measures on a σ-algebra generated by a countable partition. Such σ-algebras are important for counterexamples and to gain insight, and also because many σ-algebras that occur in applications can be constructed from them. Suppose that
is a countable partition of S into nonempty sets, and that S = σ(A ) . For i ∈ I , define μ(A ) ∈ R arbitrarily, subject only to the condition that the sum of the positive terms is finite, or the sum of the negative terms is finite. For A = ⋃ A where J ⊆ I , define A = { Ai : i ∈ I }
∗
i
j∈J
j
μ(A) = ∑ μ(Aj )
(3.12.7)
j∈J
Then μ is a measure on (S, S ). Proof
3.12.2
https://stats.libretexts.org/@go/page/10152
Positive, Negative, and Null Sets To understand the structure of general measures, we need some basic definitions and properties. As before, we assume that measure on (S, S ).
μ
is a
Definitions 1. A ∈ S is a positive set for μ if μ(B) ≥ 0 for every B ∈ S with B ⊆ A . 2. A ∈ S is a negative set for μ if μ(B) ≤ 0 for every B ∈ S with B ⊆ A . 3. A ∈ S is a null set for μ if μ(B) = 0 for every B ∈ S with B ⊆ A . Note that positive and negative are used in the weak sense (just as we use the terms increasing and decreasing in this text). Of course, if μ is a positive measure, then every A ∈ S is positive for μ , and A ∈ S is negative for μ if and only if A is null for μ if and only if μ(A) = 0 . For a general measure, A ∈ S is both positive and negative for μ if and only if A is null for μ . In particular, ∅ is null for μ . A set A ∈ S is a support set for μ if and only if A is a null set for μ . A support set is a set where the measure “lives” in a sense. Positive, negative, and null sets for μ have a basic inheritance property that is essentially equivalent to the definition. c
Suppose A ∈ S . 1. If A is positive for μ then B is positive for μ for every B ∈ S with B ⊆ A . 2. If A is negative for μ then B is negative for μ for every B ∈ S with B ⊆ A . 3. If A is null for μ then B is null for μ for every B ∈ S with B ⊆ A . The collections of positive sets, negative sets, and null sets for μ are closed under countable unions. Suppose that {A
i
: i ∈ I}
is a countable collection of sets in S .
1. If A is positive for μ for i ∈ I then ⋃ A is positive for μ . 2. If A is negative for μ for i ∈ I then ⋃ A is negative for μ . 3. If A is null for μ for i ∈ I then ⋃ A is null for μ . i
i∈I
i
i∈I
i
i
i∈I
i
i
Proof It's easy to see what happens to the positive, negative, and null sets when a measure is multiplied by a non-zero constant. Suppose that μ is a measure on (S, S ), c ∈ R , and A ∈ S . 1. If c > 0 then A is positive (negative) for μ if and only if A is positive (negative) for cμ. 2. If c < 0 then A is positive (negative) for μ if and only if A is negative (positive) for cμ. 3. If c ≠ 0 then A is null for μ if and only if A is null for cμ Positive, negative, and null sets are also preserved under countable sums, assuming that the measures make senes. Suppose that μ is a measure on (S, S ) for each i in a countable index set I , and that μ = ∑ on (S, S ). Let A ∈ S . i
i∈I
μi
is a well-defined measure
1. If A is positive for μ for every i ∈ I then A is positive for μ . 2. If A is negative for μ for every i ∈ I then A is negative for μ . 3. If A is null for μ for every i ∈ I then A is null for μ . i
i
i
In particular, note that μ = ∑ μ is a well-defined measure if μ is a positive measure for each i ∈ I , or if I is finite and μ is a finite measure for each i ∈ I . It's easy to understand the positive, negative, and null sets for a σ-algebra generated by a countable partition. i∈I
i
Suppose that A = {A : i ∈ I } is a countable partition of measure on (S, S ). Define i
i
S
into nonempty sets, and that
i
S = σ(A )
I+ = {i ∈ I : μ(Ai ) > 0}, I− = {i ∈ I : μ(Ai ) < 0}, I0 = {i ∈ I : μ(Ai ) = 0}
3.12.3
. Suppose that
μ
is a
(3.12.9)
https://stats.libretexts.org/@go/page/10152
Let A ∈ S , so that A = ⋃
j∈J
Aj
for some J ⊆ I (and this representation is unique). Then
1. A is positive for μ if and only if J ⊆ I 2. A is negative for μ if and only if J ⊆ I 3. A is null for μ if and only if J ⊆ I .
∪ I0
+ −
∪ I0
. .
0
The Hahn Decomposition The fundamental results in this section and the next are two decomposition theorems that show precisely the relationship between general measures and positive measures. First we show that if a set has finite, positive measure, then it has a positive subset with at least that measure. If A ∈ S and 0 ≤ μ(A) < ∞ then there exists P
∈ S
with P
⊆A
such that P is positive for μ and μ(P ) ≥ μ(A) .
Proof The assumption that μ(A) < ∞ is critical; a counterexample is given below. Our first decomposition result is the Hahn decomposition theorem, named for the Austrian mathematician Hans Hahn. It states that S can be partitioned into a positive set and a negative set, and this decomposition is essentially unique. Hahn Decomposition Theorem. There exists P ∈ S such that P is positive for μ and P is negative for μ . The pair (P , P is a Hahn decomposition of S . If (Q, Q ) is another Hahn decomposition, then P △ Q is null for μ . c
c
)
c
Proof It's easy to see the Hahn decomposition for a measure on a σ-algebra generated by a countable partition. Suppose that A = {A : i ∈ I } is a countable partition of S into nonempty sets, and that S measure on (S, S ). Let I = {i ∈ I : μ(A ) > 0} and I = {i ∈ I : μ(A ) = 0 . Then (P , P μ if and only if the positive set P has the form P = ⋃ A where J = I ∪ K and K ⊆ I . i
+
i
0
j∈J
i
j
+
. Suppose that μ is a is a Hahn decomposition of
= σ(A ) c
)
0
The Jordan Decomposition The Hahn decomposition leads to another decomposition theorem called the Jordan decomposition theorem, named for the French mathematician Camille Jordan. This one shows that every measure is the difference of positive measures. Once again we assume that μ is a measure on (S, S ). Jordan Decomposition Theorem. The measure μ can be written uniquely in the form μ = μ − μ where μ and μ are positive measures, at least one finite, and with the property that if (P , P ) is any Hahn decomposition of S , then P is a null set of μ and P is a null set of μ . The pair (μ , μ ) is the Jordan decomposition of μ . +
c
+
−
+
−
+
−
c
−
Proof The Jordan decomposition leads to an important set of new definitions. Suppose that μ has Jordan decomposition μ = μ
+
− μ−
.
1. The positive measure μ is called the positive variation measure of μ . 2. The positive measure μ is called the negative variation measure of μ . 3. The positive measure |μ| = μ + μ is called the total variation measure of μ . 4. ∥μ∥ = |μ| (S) is the total variation of μ . + −
+
−
Note that, in spite of the similarity in notation, μ (A) and μ (A) are not simply the positive and negative parts of the (extended) real number μ(A) , nor is |μ| (A) the absolute value of μ(A) . Also, be careful not to confuse the total variation of μ , a number in [0, ∞], with the total variation measure. The positive, negative, and total variation measures can be written directly in terms of μ . +
−
For A ∈ S , 1. μ (A) = sup{μ(B) : B ∈ S , B ⊆ A} 2. μ (A) = − inf{μ(B) : B ∈ S , B ⊆ A} 3. |μ(A)| = sup {∑ μ(A ) : {A : i ∈ I } is a finite, measurable partition of A} + −
i∈I
i
i
3.12.4
https://stats.libretexts.org/@go/page/10152
4. ∥μ∥ = sup {∑
i∈I
μ(Ai ) : { Ai : i ∈ I } is a finite, measurable partition of S}
The total variation measure is related to sum and scalar multiples of measures in a natural way. Suppose that μ and ν are measures of the same type and that c ∈ R . Then 1. |μ| = 0 if and only if μ = 0 (the zero measure). 2. |cμ| = |c| |μ| 3. |μ + ν | ≤ |μ| + |ν | Proof You may have noticed that the properties in the last result look a bit like norm properties. In fact, total variation really is a norm on the vector space of finite measures on (S, S ): Suppose that μ and ν are measures of the same type and that c ∈ R . Then 1. ∥μ∥ = 0 if and only if μ = 0 (the zero property) 2. ∥cμ∥ = |c| ∥μ∥ (the scaling property) 3. ∥μ + ν ∥ ≤ ∥μ∥ + ∥ν ∥ (the triangle inequality) Proof Every norm on a vector space leads to a corresponding measure of distance (a metric). Let M denote the collection of finite measures on (S, S ). Then M , under the usual definition of addition and scalar multiplication of measures, is a vector space, and as the last theorem shows, ∥ ⋅ ∥ is a norm on M . Here are the corresponding metric space properties: Suppose that μ,
ν, ρ ∈ M
and c ∈ R . Then
1. ∥μ − ν ∥ = ∥ν − μ∥ , the symmetric property 2. ∥μ∥ = 0 if and only if μ = 0 , the zero property 3. ∥μ − ρ∥ ≤ ∥μ − ν ∥ + ∥ν − ρ∥ , the triangle inequality Now that we have a metric, we have a corresponding criterion for convergence. Suppose that n → ∞.
μn ∈ M
for
n ∈ N+
and
μ ∈ M
. We say that
Of course, M includes the probability measures on we have studied or will study. Here is a list:
μn → μ
as
n → ∞
in total variation if
∥ μn − μ∥ → 0
as
, so we have a new notion of convergence to go along with the others
(S, S )
convergence with probability 1 convergence in probability convergence in distribution convergence in k th mean convergence in total variation
The Integral Armed with the Jordan decomposition, the integral can be extended to general measures in a natural way. Suppose that μ is a measure on (S, S ) and that f ∫ S
is measurable. We define
: S → R
f dμ = ∫
f dμ+ − ∫
S
f dμ−
(3.12.14)
S
assuming that the integrals on the right exist and that the right side is not of the form ∞ − ∞ . We will not pursue this extension, but as you might guess, the essential properties of the integral hold.
3.12.5
https://stats.libretexts.org/@go/page/10152
Complex Measures Again, suppose that (S, S ) is a measurable space. The same axioms that work for general measures can be used to define complex measures. Recall that C = {x + iy : x, y ∈ R} denotes the set of complex numbers, where i is the imaginary unit. A complex measure on (S, S ) is a function μ : S
→ C
that satisfies the following properties:
1. μ(∅) = 0 2. If {A : i ∈ I } is a countable, disjoint collection of sets in S then μ (⋃ i
i∈I
Ai ) = ∑
i∈I
μ(Ai )
Clearly a complex measure μ can be decomposed as μ = ν + iρ where ν and ρ are finite (real) measures on (S, S ). We will have no use for complex measures in this text, but from the decomposition into finite measures, it's easy to see how to develop the theory.
Computational Exercises Counterexamples The lemma needed for the Hahn decomposition theorem can fail without the assumption that μ(A) < ∞ . Let S be a set with subsets A and B satisfying ∅ ⊂ B ⊂ A ⊂ S . Let Define μ(B) = −1 , μ(A ∖ B) = ∞ , μ(A ) = 1 .
S = σ{A, B}
be the σ-algebra generated by
.
{A, B}
c
1. Draw the Venn diagram of A , B , S . 2. List the sets in S . 3. Using additivity, give the value of μ on each set in S . 4. Show that A does not have a positive subset P ∈ S with μ(P ) ≥ μ(A) . This page titled 3.12: General Measures is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
3.12.6
https://stats.libretexts.org/@go/page/10152
3.13: Absolute Continuity and Density Functions Basic Theory Our starting point is a measurable space (S, S ). That is S is a set and S is a σ-algebra of subsets of S . In the last section, we discussed general measures on (S, S ) that can take positive and negative values. Special cases are positive measures, finite measures, and our favorite kind, probability measures. In particular, we studied properties of general measures, ways to construct them, special sets (positive, negative, and null), and the Hahn and Jordan decompositions. In this section, we see how to construct a new measure from a given positive measure using a density function, and we answer the fundamental question of when a measure has a density function relative to the given positive measure.
Relations on Measures The answer to the question involves two important relations on the collection of measures on (S, S ) that are defined in terms of null sets. Recall that A ∈ S is null for a measure μ on (S, S ) if μ(B) = 0 for every B ∈ S with B ⊆ A . At the other extreme, A ∈ S is a support set for μ if A is a null set. Here are the basic definitions: c
Suppose that μ and ν are measures on (S, S ). 1. ν is absolutely continuous with respect to μ if every null set of μ is also a null set of ν . We write ν ≪ μ . 2. μ and ν are mutually singular if there exists A ∈ S such that A is null for μ and A is null for ν . We write μ ⊥ ν . c
Thus ν
≪ μ
if every support support set of μ is a support set of ν . At the opposite end, μ ⊥ ν if μ and ν have disjoint support sets.
Suppose that μ , ν , and ρ are measures on (S, S ). Then 1. μ ≪ μ , the reflexive property. 2. If μ ≪ ν and ν ≪ ρ then μ ≪ ρ , the transitive property. Recall that every relation that is reflexive and transitive leads to an equivalence relation, and then in turn, the original relation can be extended to a partial order on the collection of equivalence classes. This general theorem on relations leads to the following two results. Measures μ and ν on (S, S ) are equivalent if μ ≪ ν and ν ≪ μ , and we write μ ≡ ν . The relation relation on the collection of measures on (S, S ). That is, if μ , ν , and ρ are measures on (S, S ) then
≡
is an equivalence
1. μ ≡ μ , the reflexive property 2. If μ ≡ ν then ν ≡ μ , the symmetric property 3. If μ ≡ ν and ν ≡ ρ then μ ≡ ρ , the transitive property Thus, μ and ν are equivalent if they have the same null sets and thus the same support sets. This equivalence relation is rather weak: equivalent measures have the same support sets, but the values assigned to these sets can be very different. As usual, we will write [μ] for the equivalence class of a measure μ on (S, S ), under the equivalence relation ≡. If μ and ν are measures on (S, S ), we write [μ] ⪯ [ν ] if μ ≪ ν . The definition is consistent, and defines a partial order on the collection of equivalence classes. That is, if μ , ν , and ρ are measures on (S, S ) then 1. [μ] ⪯ [μ] , the reflexive property. 2. If [μ] ⪯ [ν ] and [ν ] ⪯ [μ] then [μ] = [ν ] , the antisymmetric property. 3. If [μ] ⪯ [ν ] and [ν ] ⪯ [ρ] then [μ] ⪯ [ρ], the transitive property The singularity relation is trivially symmetric and is almost anti-reflexive. Suppose that μ and ν are measures on (S, S ). Then 1. If μ ⊥ ν then ν ⊥ μ , the symmetric property. 2. μ ⊥ μ if and only if μ = 0 , the zero measure.
3.13.1
https://stats.libretexts.org/@go/page/10153
Proof Absolute continuity and singularity are preserved under multiplication by nonzero constants. Suppose that μ and ν are measures on (S, S ) and that a, 1. ν 2. ν
≪ μ ⊥μ
b ∈ R ∖ {0}
. Then
if and only if aν ≪ bμ . if and only if aν ⊥ bμ .
Proof There is a corresponding result for sums of measures. Suppose that μ is a measure on (S, S ) and that ν is a measure on (S, S ) for each i in a countable index set I . Suppose also that ν = ∑ ν is a well-defined measure on (S, S ). i
i∈I
1. If ν 2. If ν
i
≪ μ
i
⊥μ
i
for every i ∈ I then ν ≪ μ . for every i ∈ I then ν ⊥ μ .
Proof As before, note that ν = ∑ ν is well-defined if ν is a positive measure for each i ∈ I or if I is finite and ν is a finite measure for each i ∈ I . We close this subsection with a couple of results that involve both the absolute continuity relation and the singularity relation i∈I
i
i
i
Suppose that μ , ν , and ρ are measures on (S, S ). If ν
≪ μ
and μ ⊥ ρ then ν
⊥ρ
.
Proof Suppose that μ and ν are measures on (S, S ). If ν
≪ μ
and ν
⊥μ
then ν
=0
.
Proof
Density Functions We are now ready for our study of density functions. Throughout this subsection, we assume that μ is a positive, σ-finite measure on our measurable space (S, S ). Recall that if f : S → R is measurable, then the integral of f with respect to μ may exist as a number in R = R ∪ {−∞, ∞} or may fail to exist. ∗
Suppose that f
: S → R
is a measurable function whose integral with respect to μ exists. Then function ν defined by ν (A) = ∫
f dμ,
A ∈ S
(3.13.1)
A
is a σ-finite measure on relative to μ .
(S, S )
that is absolutely continuous with respect to
μ
. The function
f
is a density function of
ν
Proof The following three special cases are the most important: 1. If f is nonnegative (so that the integral exists in R ∪ {∞} ) then ν is a positive measure since ν (A) ≥ 0 for A ∈ S . 2. If f is integrable (so that the integral exists in R), then ν is a finite measure since ν (A) ∈ R for A ∈ S . 3. If f is nonnegative and ∫ f dμ = 1 then ν is a probability measure since ν (A) ≥ 0 for A ∈ S and ν (S) = 1 . S
In case 3, f is the probability density function of functions are essentially unique.
ν
relative to μ , our favorite kind of density function. When they exist, density
Suppose that ν is a σ-finite measure on (S, S ) and that ν has density function f with respect to μ . Then density function of ν with respect to μ if and only if f = g almost everywhere on S with respect to μ .
g : S → R
is a
Proof The essential uniqueness of density functions can fail if the positive measure space (S, S , μ) is not σ-finite. A simple example is given below. Our next result answers the question of when a measure has a density function with respect to μ , and is the fundamental theorem of this section. The theorem is in two parts: Part (a) is the Lebesgue decomposition theorem, named for our
3.13.2
https://stats.libretexts.org/@go/page/10153
old friend Henri Lebesgue. Part (b) is the Radon-Nikodym theorem, named for Johann Radon and Otto Nikodym. We combine the theorems because our proofs of the two results are inextricably linked. Suppose that ν is a σ-finite measure on (S, S ). 1. Lebesgue Decomposition Theorem. ν can be uniquely decomposed as ν 2. Radon-Nikodym Theorem. ν has a density function with respect to μ .
= νc + νs
where ν
c
≪ μ
and ν
s
⊥μ
.
c
Proof In particular, a measure ν on (S, S ) has a density function with respect to μ if and only if ν ≪ μ . The density function in this case is also referred to as the Radon-Nikodym derivative of ν with respect to μ and is sometimes written in derivative notation as dν /dμ. This notation, however, can be a bit misleading because we need to remember that a density function is unique only up to a μ -null set. Also, the Radon-Nikodym theorem can fail if the positive measure space (S, S , μ) is not σ-finite. A couple of examples are given below. Next we characterize the Hahn decomposition and the Jordan decomposition of ν in terms of the density function. Suppose that P
is a measure on (S, S ) with ν ≪ μ , and that ν has density function = {x ∈ S : f (x) ≥ 0} , and let f and f denote the positive and negative parts of f . ν
+
f
with respect to
μ
. Let
−
1. A Hahn decomposition of ν is (P , P ). 2. The Jordan decomposition is ν = ν − ν c
+
−
where ν
+ (A)
=∫
A
f
+
dμ
and ν
− (A)
=∫
A
f
−
dμ
, for A ∈ S .
Proof The following result is a basic change of variables theorem for integrals. Suppose that ν is a positive measure on (S, S ) with ν ≪ μ and that ν has density function f with respect to μ . If g : S → R is a measurable function whose integral with respect to ν exists, then ∫
g dν = ∫
S
gf dμ
(3.13.6)
S
Proof In differential notation, the change of variables theorem has the familiar form dν = f dμ , and this is really the justification for the derivative notation f = dν /dμ in the first place. The following result gives the scalar multiple rule for density functions. Suppose that ν is a measure on (S, S ) with ν density function cf with respect to μ .
≪ μ
and that ν has density function f with respect to μ . If c ∈ R , then cν has
Proof Of course, we already knew that ν ≪ μ implies cν ≪ μ for c ∈ R , so the new information is the relation between the density functions. In derivative notation, the scalar multiple rule has the familiar form d(cν )
dν =c
dμ
(3.13.9) dμ
The following result gives the sum rule for density functions. Recall that two measures are of the same type if neither takes the value ∞ or if neither takes the value −∞ . Suppose that ν and ρ are measures on (S, S ) of the same type with ν ≪ μ and ρ ≪ μ , and that ν and functions f and g with respect to μ , respectively. Then ν + ρ has density function f + g with respect to μ .
ρ
have density
Proof Of course, we already knew that ν ≪ μ and ρ ≪ μ imply ν + ρ ≪ μ , so the new information is the relation between the density functions. In derivative notation, the sum rule has the familiar form d(ν + ρ)
dν =
dμ
dρ +
dμ
3.13.3
(3.13.11) dμ
https://stats.libretexts.org/@go/page/10153
The following result is the chain rule for density functions. Suppose that ν is a positive measure on (S, S ) with ν ≪ μ and that ν has density function f with respect to μ . Suppose ρ is a measure on (S, S ) with ρ ≪ ν and that ρ has density function g with respect to ν . Then ρ has density function gf with respect to μ . Proof Of course, we already knew that ν ≪ μ and ρ ≪ ν imply ρ ≪ μ , so once again the new information is the relation between the density functions. In derivative notation, the chan rule has the familiar form dρ
dρ dν =
dμ
(3.13.12) dν dμ
The following related result is the inverse rule for density functions. Suppose that ν is a positive measure on (S, S ) with ν ≪ μ and respect to μ then μ has density function 1/f with respect to ν .
(so that
μ ≪ ν
ν ≡μ
). If
ν
has density function
f
with
Proof In derivative notation, the inverse rule has the familiar form dμ
1 =
dν
(3.13.13) dν /dμ
Examples and Special Cases Discrete Spaces Recall that a discrete measure space (S, S , #) consists of a countable set S with the σ-algebra S = P(S) of all subsets of S , and with counting measure #. Of course # is a positive measure and is trivially σ-finite since S is countable. Note also that ∅ is the only set that is null for #. If ν is a measure on S , then by definition, ν (∅) = 0 , so ν is absolutely continuous relative to μ . Thus, by the Radon-Nikodym theorem, ν can be written in the form ν (A) = ∑ f (x),
A ⊆S
(3.13.14)
x∈A
for a unique f : S → R . Of course, this is obvious by a direct argument. If we define equation follows by the countable additivity of ν .
f (x) = ν {x}
for
x ∈ S
then the displayed
Spaces Generated by Countable Partitions We can generalize the last discussion to spaces generated by countable partitions. Suppose that S is a set and that A = { A : i ∈ I } is a countable partition of S into nonempty sets. Let S = σ(A ) and recall that every A ∈ S has a unique representation of the form A = ⋃ A where J ⊆ I . Suppse now that μ is a positive measure on S with 0 < μ(A ) < ∞ for every i ∈ I . Then once again, the measure space (S, S , μ) is σ-finite and ∅ is the only null set. Hence if ν is a measure on (S, S ) then ν is absolutely continuous with respect to μ and hence has unique density function f with respect to μ : i
j∈J
j
i
ν (A) = ∫
f dμ,
A ∈ S
(3.13.15)
A
Once again, we can construct the density function explicitly. In the setting above, define f to μ .
: S → R
by f (x) = ν (A
i )/μ(Ai )
for x ∈ A and i ∈ I . Then f is the density of ν with respect i
Proof Often positive measure spaces that occur in applications can be decomposed into spaces generated by countable partitions. In the section on Convergence in the chapter on Martingales, we show that more general density functions can be obtained as limits of density functions of the type in the last theorem.
3.13.4
https://stats.libretexts.org/@go/page/10153
Probability Spaces Suppose that (Ω, F , P) is a probability space and that X is a random variable taking values in a measurable space (S, S ). Recall that the distribution of X is the probability measure P on (S, S ) given by X
PX (A) = P(X ∈ A),
A ∈ S
(3.13.17)
If μ is a positive measure, σ-finite measure on (S, S ), then the theory of this section applies, of course. The Radon-Nikodym theorem tells us precisely when (the distribution of) X has a probability density function with respect to μ : we need the distribution to be absolutely continuous with respect to μ : if μ(A) = 0 then P (A) = P(X ∈ A) = 0 for A ∈ S . X
Suppose that r : S → R is measurable, so that r(X) is a real-valued random variable. The integral of r(X) (assuming that it exists) is of fundamental importance, and is knowns as the expected value of r(X). We will study expected values in detail in the next chapter, but here we just note different ways to write the integral. By the change of variables theorem in the last section we have ∫
r[X(ω)]dP(ω) = ∫
Ω
Assuming that P , the distribution of chain of integrals using Theorem (14): X
∫
X
r(x)dPX (x)
(3.13.18)
S
, is absolutely continuous with respect to μ , with density function f , we can add to our
r[X(ω)]dP(ω) = ∫
Ω
r(x)dPX (x) = ∫
S
r(x)f (x)dμ(x)
(3.13.19)
S
Specializing, suppose that (S, S , #) is a discrete measure space. Thus X has a discrete distribution and (as noted in the previous subsection), the distribution of X is absolutely continuous with respect to #, with probability density function f given by f (x) = P(X = x) for x ∈ S . In this case the integral simplifies: ∫
r[X(ω)]dP(ω) = ∑ r(x)f (x)
Ω
(3.13.20)
x∈S
Recall next that for n ∈ N , the n -dimensional Euclidean measure space is (R , R , λ ) where R is the σ-algebra of Lebesgue measurable sets and λ is Lebesgue measure. Suppose now that S ∈ R and that S is the σ-algebra of Lebesgue measurable subsets of S , and that once again, X is a random variable with values in S . By definition, X has a continuous distribution if P(X = x) = 0 for x ∈ S . But we now know that this is not enough to ensure that the distribution of X has a density function with respect to λ . We need the distribution to be absolutely continuous, so that if λ (A) = 0 then P(X ∈ A) = 0 for A ∈ S . Of course λ {x} = 0 for x ∈ S , so absolute continuity implies continuity, but not conversely. Continuity of the distribution is a (much) weaker condition than absolute continuity of the distribution. If the distribution of X is continuous but not absolutely so, then the distribution will not have a density function with respect to λ . n
+
n
n
n
n
n
n
n
n
n
For example, suppose that λ (S) = 0 . Then the distribution of X and λ are mutually singular since P(X ∈ S) = 1 and so X will not have a density function with respect to λ . This will always be the case if S is countable, so that the distribution of X is discrete. But it is also possible for X to have a continuous distribution on an uncountable set S ∈ R with λ (S) = 0 . In such a case, the continuous distribution of X is said to be degenerate. There are a couple of natural ways in which this can happen that are illustrated in the following exercises. n
n
n
n
Suppose that Θ is uniformly distributed on the interval [0, 2π). Let X = cos Θ , Y 1. (X, Y ) has a continuous distribution on the circle C = {(x, y) : x 2. The distribution of (X, Y ) and λ are mutually singular. 3. Find P(Y > X) .
2
+y
2
= 1}
= sin Θ
n
.
.
2
Solution The last example is artificial since (X, Y ) has a one-dimensional distribution in a sense, in spite of taking values in course Θ has a probability density function f with repsect λ given by f (θ) = 1/2π for θ ∈ [0, 2π).
2
R
. And of
1
Suppose that X is uniformly distributed on the set {0, 1, 2}, Y is uniformly distributed on the interval [0, 2], and that X and Y are independent. 1. (X, Y ) has a continuous distribution on the product set S = {0, 1, 2} × [0, 2].
3.13.5
https://stats.libretexts.org/@go/page/10153
2. The distribution of (X, Y ) and λ are mutually singular. 3. Find P(Y > X) . 2
Solution The last exercise is artificial since X has a discrete distribution on {0, 1, 2} (with all subsets measureable and with #), and Y a continuous distribution on the Euclidean space [0, 2] (with Lebesgue mearuable subsets and with λ ). Both are absolutely continuous; X has density function g given by g(x) = 1/3 for x ∈ {0, 1, 2} and Y has density function h given by h(y) = 1/2 for y ∈ [0, 2]. So really, the proper measure space on S is the product measure space formed from these two spaces. Relative to this product space (X, Y ) has a density f given by f (x, y) = 1/6 for (x, y) ∈ S. It is also possible to have a continuous distribution on S ⊆ R with λ (S) > 0 , yet still with no probability density function, a much more interesting situation. We will give a classical construction. Let (X , X , …) be a sequence of Bernoulli trials with success parameter p ∈ (0, 1). We will indicate the dependence of the probability measure P on the parameter p with a subscript. Thus, we have a sequence of independent indicator variables with n
n
1
Pp (Xi = 1) = p,
2
Pp (Xi = 0) = 1 − p
(3.13.21)
We interpret X as the ith binary digit (bit) of a random variable X taking values in (0, 1). That is, X = ∑ X /2 . Conversely, recall that every number x ∈ (0, 1) can be written in binary form as x = ∑ x /2 where x ∈ {0, 1} for each i ∈ N . This representation is unique except when x is a binary rational of the form x = k/2 for n ∈ N and k ∈ {1, 3, … 2 − 1} . In this case, there are two representations, one in which the bits are eventually 0 and one in which the bits are eventually 1. Note, however, that the set of binary rationals is countable. Finally, note that the uniform distribution on (0, 1) is the same as Lebesgue measure on (0, 1). ∞
i
i
i=1
∞
i
i
i
i=1
i
+
n
n
+
X
has a continuous distribution on (0, 1) for every value of the parameter p ∈ (0, 1). Moreover,
1. If p, q ∈ (0, 1) and p ≠ q then the distribution of X with parameter p and the distribution of X with parameter q are mutually singular. 2. If p = , X has the uniform distribution on (0, 1). 3. If p ≠ , then the distribution of X is singular with respect to Lebesgue measure on (0, 1), and hence has no probability density function in the usual sense. 1 2 1 2
Proof For an application of some of the ideas in this example, see Bold Play in the game of Red and Black.
Counterexamples The essential uniqueness of density functions can fail if the underlying positive measure counterexample:
μ
is not
σ
-finite. Here is a trivial
Suppose that S is a nonempty set and that S = {S, ∅} is the trivial σ-algebra. Define the positive measure μ on (S, S ) by μ(∅) = 0 , μ(S) = ∞ . Let ν denote the measure on (S, S ) with constant density function c ∈ R with respect to μ . c
1. (S, S , μ) is not σ-finite. 2. ν = μ for every c ∈ (0, ∞). c
The Radon-Nikodym theorem can fail if the measure counterexample:
μ
is not
σ
-finite, even if
ν
is finite. Here are a couple of standard
Suppose that S is an uncountable set and S is the σ-algebra of countable and co-countable sets: c
S = {A ⊆ S : A is countable or A is countable}
As usual, let # denote counting measure on countable. Then
S
, and define
ν
on
S
by
ν (A) = 0
if
A
(3.13.23)
is countable and
ν (A) = 1
if
c
A
is
1. (S, S , #) is not σ-finite. 2. ν is a finite, positive measure on (S, S ). 3. ν is absolutely continuous with respect to #.
3.13.6
https://stats.libretexts.org/@go/page/10153
4. ν does not have a density function with respect to #. Proof Let R denote the standard Borel σ-algebra on respectively. Then
R
. Let
#
and
λ
denote counting measure and Lebesgue measure on
(R, R)
,
1. (R, R, #) is not σ-finite. 2. λ is absolutely continuous with respect to #. 3. λ does not have a density function with respect to #. Proof This page titled 3.13: Absolute Continuity and Density Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
3.13.7
https://stats.libretexts.org/@go/page/10153
3.14: Function Spaces Basic Theory Our starting point is a positive measure space (S, S , μ). That is S is a set, measure on (S, S ). As usual, the most important special cases are
S
is a σ-algebra of subsets of
S
, and μ is a positive
Euclidean space: S is a Lebesgue measurable subset of R for some n ∈ N , S is the σ-algebra of Lebesgue measurable subsets of S , and μ = λ is n -dimensional Lebesgue measure. Discrete space: S is a countable set, S = P(S) is the collection of all subsets of S , and μ = # is counting measure. Probability space: S is the set of outcomes of a random experiment, S is the σ-algebra of events, and μ = P is a probability measure. n
+
n
In previous sections, we defined the integral of certain measurable functions f : S → R with respect to μ , and we studied properties of the integral. In this section, we will study vector spaces of functions that are defined in terms of certain integrability conditions. These function spaces are of fundamental importance in all areas of analysis, including probability. In particular, the results of this section will reappear in the form of spaces of random variables in our study of expected value.
Definitions and Basic Properties Consider a statement on the elements of S , for example an equation or an inequality with x ∈ S as a free variable. (Technically such a statement is a predicate on S .) For A ∈ S , we say that the statement holds on A if it is true for every x ∈ A . We say that the statement holds almost everywhere on A (with respect to μ ) if there exists B ∈ S with B ⊆ A such that the statement holds on B and μ(A ∖ B) = 0 . Measurable functions f , g : S → R are equivalent if f = g almost everywhere on relation ≡ is an equivalence relation on the collection of measurable functions from measurable then
S S
, in which case we write f ≡ g . The to R. That is, if f , g, h : S → R are
1. f ≡ f , the reflexive property. 2. If f ≡ g then g ≡ f , the symmetric property. 3. If f ≡ g and g ≡ h then f ≡ h , the transitive property. Thus, equivalent functions are indistinguishable from the point of view of the measure μ . As with any equivalence relation, ≡ partitions the underlying set (in this case the collection of real-valued measurable functions on S ) into equivalence classes of mutually equivalent elements. As we will see, we often view these equivalence classes as the basic objects of study. Our next task is to define measures of the “size” of a function; these will become norms in our spaces. Suppose that f
: S → R
is measurable. For p ∈ (0, ∞) we define 1/p
∥f ∥p = ( ∫
p
|f |
dμ)
(3.14.1)
S
We also define ∥f ∥
∞
Since
= inf {b ∈ [0, ∞] : |f | ≤ b almost everywhere on S}
.
is a nonnegative, measurable function for p ∈ (0, ∞), ∫ |f | dμ exists in [0, ∞], and hence so does ∥f ∥ . Clearly ∥f ∥ also exists in [0, ∞] and is known as the essential supremum of f . A number b ∈ [0, ∞] such that |f | ≤ b almost everywhere on S is an essential bound of f and so, appropriately enough, the essential supremum of f is the infimum of the essential bounds of f . Thus, we have defined ∥f ∥ for all p ∈ (0, ∞]. The definition for p = ∞ is special, but we will see that it's the appropriate one. p
p
|f |
p
S
∞
p
For p ∈ (0, ∞], let L denote the collection of measurable functions f p
: S → R
such ∥f ∥
p
0) > 0 then E(X) > 0 .
4.1.4
https://stats.libretexts.org/@go/page/10156
Proof Next is the increasing property, perhaps the most important property of expected value, after linearity. Suppose that P(X ≤ Y ) = 1 . Then 1. E(X) ≤ E(Y ) 2. If P(X < Y ) > 0 then E(X) < E(Y ) . Proof Absolute value inequalities: 1. |E(X)| ≤ E (|X|) 2. If P(X > 0) > 0 and P(X < 0) > 0 then |E(X)| < E (|X|). Proof Only in Lake Woebegone are all of the children above average: If P [X ≠ E(X)] > 0 then 1. P [X > E(X)] > 0 2. P [X < E(X)] > 0 Proof Thus, if X is not a constant (with probability 1), then X must take values greater than its mean with positive probability and values less than its mean with positive probability.
Symmetry Again, suppose that X is a random variable taking values in R. The distribution of X is symmetric about a ∈ R if the distribution of a − X is the same as the distribution of X − a . Suppose that the distribution of X is symmetric about a ∈ R . If E(X) exists, then E(X) = a . Proof The previous result applies if X has a continuous distribution on R with a probability density f that is symmetric about a ; that is, f (a + x) = f (a − x) for x ∈ R .
Independence If X and Y are independent real-valued random variables then E(XY ) = E(X)E(Y ) . Proof It follows from the last result that independent random variables are uncorrelated (a concept that we will study in a later section). Moreover, this result is more powerful than might first appear. Suppose that X and Y are independent random variables taking values in general spaces S and T respectively, and that u : S → R and v : T → R . Then u(X) and v(Y ) are independent, realvalued random variables and hence E [u(X)v(Y )] = E [u(X)] E [v(Y )]
(4.1.14)
Examples and Applications As always, be sure to try the proofs and computations yourself before reading the proof and answers in the text.
Uniform Distributions Discrete uniform distributions are widely used in combinatorial probability, and model a point chosen at random from a finite set. Suppose that X has the discrete uniform distribution on a finite set S ⊆ R . 1. E(X) is the arithmetic average of the numbers in S . 2. If the points in S are evenly spaced with endpoints a, b, then E(X) =
4.1.5
a+b 2
, the average of the endpoints.
https://stats.libretexts.org/@go/page/10156
Proof The previous results are easy to see if we think of E(X) as the center of mass, since the discrete uniform distribution corresponds to a finite set of points with equal mass. Open the special distribution simulator, and select the discrete uniform distribution. This is the uniform distribution on n points, starting at a , evenly spaced at distance h . Vary the parameters and note the location of the mean in relation to the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean to the distribution mean. Next, recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the interval. Continuous uniform distributions arise in geometric probability and a variety of other applied problems. Suppose that X has the continuous uniform distribution on an interval [a, b], where a, 1. E(X) = 2. E (X ) =
a+b 2
and a < b .
, the midpoint of the interval. 1
n
b ∈ R
n+1
n
(a
n−1
+a
n−1
b + ⋯ + ab
n
+b )
for n ∈ N .
Proof Part (a) is easy to see if we think of the mean as the center of mass, since the uniform distribution corresponds to a uniform distribution of mass on the interval. Open the special distribution simulator, and select the continuous uniform distribution. This is the uniform distribution the interval [a, a + w] . Vary the parameters and note the location of the mean in relation to the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean to the distribution mean. Next, the average value of a function on an interval, as defined in calculus, has a nice interpretation in terms of the uniform distribution. Suppose that X is uniformly distributed on the interval E [g(X)] is the average value of g on [a, b]:
, and that
[a, b]
∫ b −a
is an integrable function from
[a, b]
into
R
. Then
b
1 E [g(X)] =
g
g(x)dx
(4.1.19)
a
Proof Find the average value of the following functions on the given intervals: 1. f (x) = x on [2, 4] 2. g(x) = x on [0, 1] 3. h(x) = sin(x) on [0, π]. 2
Answer The next exercise illustrates the value of the change of variables theorem in computing expected values. Suppose that X is uniformly distributed on [−1, 3]. 1. Give the probability density function of X. 2. Find the probability density function of X . 3. Find E (X ) using the probability density function in (b). 4. Find E (X ) using the change of variables theorem. 2
2
2
Answer The discrete uniform distribution and the continuous uniform distribution are studied in more detail in the chapter on Special Distributions.
4.1.6
https://stats.libretexts.org/@go/page/10156
Dice Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard die in which faces 1 and 6 have probability each, and faces 2, 3, 4, and 5 have probability each. 1
1
4
8
Two standard, fair dice are thrown, and the scores variables.
(X1 , X2 )
recorded. Find the expected value of each of the following
1. Y = X + X , the sum of the scores. 2. M = (X + X ) , the average of the scores. 3. Z = X X , the product of the scores. 4. U = min{X , X } , the minimum score 5. V = max{X , X } , the maximum score. 1
2
1
1
2
1
2
2
1
2
1
2
Answer In the dice experiment, select two fair die. Note the shape of the probability density function and the location of the mean for the sum, minimum, and maximum variables. Run the experiment 1000 times and compare the sample mean and the distribution mean for each of these variables. Two standard, ace-six flat dice are thrown, and the scores (X variables.
1,
X2 )
recorded. Find the expected value of each of the following
1. Y = X + X , the sum of the scores. 2. M = (X + X ) , the average of the scores. 3. Z = X X , the product of the scores. 4. U = min{X , X } , the minimum score 5. V = max{X , X } , the maximum score. 1
2
1 2
1
1
2
2
1
1
2
2
Answer In the dice experiment, select two ace-six flat die. Note the shape of the probability density function and the location of the mean for the sum, minimum, and maximum variables. Run the experiment 1000 times and compare the sample mean and the distribution mean for each of these variables.
Bernoulli Trials Recall that a Bernoulli trials process is a sequence X = (X , X , …) of independent, identically distributed indicator random variables. In the usual language of reliability, X denotes the outcome of trial i, where 1 denotes success and 0 denotes failure. The probability of success p = P(X = 1) ∈ [0, 1] is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate chapter on the Bernoulli Trials explores this process in detail. 1
2
i
i
n
For n ∈ N , the number of successes in the first n trials is Y = ∑ X . Recall that this random variable has the binomial distribution with parameters n and p, and has probability density function f given by +
i=1
f (y) = (
n y n−y ) p (1 − p ) , y
i
y ∈ {0, 1, … , n}
(4.1.20)
If Y has the binomial distribution with parameters n and p then E(Y ) = np Proof from the definition Proof using the additive property Note the superiority of the second proof to the first. The result also makes intuitive sense: in n trials with success probability p, we expect np successes. In the binomial coin experiment, vary n and p and note the shape of the probability density function and the location of the mean. For selected values of n and p, run the experiment 1000 times and compare the sample mean to the distribution mean.
4.1.7
https://stats.libretexts.org/@go/page/10156
Suppose that p ∈ (0, 1], and let N denote the trial number of the first success. This random variable has the geometric distribution on N with parameter p, and has probability density function g given by +
n−1
g(n) = p(1 − p )
,
n ∈ N+
(4.1.23)
If N has the geometric distribution on N with parameter p ∈ (0, 1] then E(N ) = 1/p . +
Proof Again, the result makes intuitive sense. Since p is the probability of success, we expect a success to occur after 1/p trials. In the negative binomial experiment, select k = 1 to get the geometric distribution. Vary p and note the shape of the probability density function and the location of the mean. For selected values of p, run the experiment 1000 times and compare the sample mean to the distribution mean.
The Hypergeometric Distribution Suppose that a population consists of m objects; r of the objects are type 1 and m − r are type 0. A sample of n objects is chosen at random, without replacement. The parameters m, r, n ∈ N with r ≤ m and n ≤ m . Let X denote the type of the ith object selected. Recall that X = (X , X , … , X ) is a sequence of identically distributed (but not independent) indicator random variable with P(X = 1) = r/m for each i ∈ {1, 2, … , n}. i
1
2
n
i
Let Y denote the number of type 1 objects in the sample, so that Y which has probability density function f given by r
m−r
( )( y
f (y) =
n−y
m
(
n
n
=∑
i=1
Xi
. Recall that Y has the hypergeometric distribution,
) ,
y ∈ {0, 1, … , n}
(4.1.25)
)
If Y has the hypergeometric distribution with parameters m, n , and r then E(Y ) = n
r m
.
Proof from the definition Proof using the additive property In the ball and urn experiment, vary n , r, and m and note the shape of the probability density function and the location of the mean. For selected values of the parameters, run the experiment 1000 times and compare the sample mean to the distribution mean. Note that if we select the objects with replacement, then X would be a sequence of Bernoulli trials, and hence binomial distribution with parameters n and p = . Thus, the mean would still be E(Y ) = n . r
r
m
m
Y
would have the
The Poisson Distribution Recall that the Poisson distribution has probability density function f given by n
f (n) = e
−a
a
,
n ∈ N
(4.1.29)
n!
where a ∈ (0, ∞) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number of “random points” in a region of time or space; the parameter a is proportional to the size of the region. The Poisson distribution is studied in detail in the chapter on the Poisson Process. If N has the Poisson distribution with parameter a then E(N ) = a . Thus, the parameter of the Poisson distribution is the mean of the distribution. Proof In the Poisson experiment, the parameter is a = rt . Vary the parameter and note the shape of the probability density function and the location of the mean. For various values of the parameter, run the experiment 1000 times and compare the sample mean to the distribution mean.
4.1.8
https://stats.libretexts.org/@go/page/10156
The Exponential Distribution Recall that the exponential distribution is a continuous distribution with probability density function f given by f (t) = re
−rt
,
t ∈ [0, ∞)
(4.1.31)
where r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other “arrival times”; in particular, the distribution governs the time between arrivals in the Poisson model. The exponential distribution is studied in detail in the chapter on the Poisson Process. Suppose that T has the exponential distribution with rate parameter r. Then E(T ) = 1/r . Proof Recall that the mode of T is 0 and the median of T is ln 2/r. Note how these measures of center are ordered: 0 < ln 2/r < 1/r In the gamma experiment, set n = 1 to get the exponential distribution. This app simulates the first arrival in a Poisson process. Vary r with the scroll bar and note the position of the mean relative to the graph of the probability density function. For selected values of r, run the experiment 1000 times and compare the sample mean to the distribution mean. Suppose again that T has the exponential distribution with rate parameter r and suppose that t > 0 . Find E(T
∣ T > t)
.
Answer
The Gamma Distribution Recall that the gamma distribution is a continuous distribution with probability density function f given by n−1
n
t
f (t) = r
e
−rt
,
t ∈ [0, ∞)
(4.1.33)
(n − 1)!
where n ∈ N is the shape parameter and r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other “arrival times”, and in particular, models the n th arrival in the Poisson process. Thus it follows that if (X , X , … , X ) is a sequence of independent random variables, each having the exponential distribution with rate parameter r, then T = ∑ X has the gamma distribution with shape parameter n and rate parameter r. The gamma distribution is studied in more generality, with non-integer shape parameters, in the chapter on the Special Distributions. +
1
2
n
n
i=1
i
Suppose that T has the gamma distribution with shape parameter n and rate parameter r. Then E(T ) = n/r . Proof from the definition Proof using the additive property Note again how much easier and more intuitive the second proof is than the first. Open the gamma experiment, which simulates the arrival times in the Poisson process. Vary the parameters and note the position of the mean relative to the graph of the probability density function. For selected parameter values, run the experiment 1000 times and compare the sample mean to the distribution mean.
Beta Distributions The distributions in this subsection belong to the family of beta distributions, which are widely used to model random proportions and probabilities. The beta distribution is studied in detail in the chapter on Special Distributions. Suppose that X has probability density function f given by f (x) = 3x for x ∈ [0, 1]. 2
1. Find the mean of X. 2. Find the mode of X. 3. Find the median of X. 4. Sketch the graph of f and show the location of the mean, median, and mode on the x-axis. Answer
4.1.9
https://stats.libretexts.org/@go/page/10156
In the special distribution simulator, select the beta distribution and set a = 3 and b = 1 to get the distribution in the last exercise. Run the experiment 1000 times and compare the sample mean to the distribution mean. Suppose that a sphere has a random radius R with probability density function f given by Find the expected value of each of the following:
2
f (r) = 12 r (1 − r)
for
.
r ∈ [0, 1]
1. The circumference C = 2πR 2. The surface area A = 4πR 3. The volume V = π R 2
4
3
3
Answer Suppose that X has probability density function f given by f (x) =
1 π√x(1−x)
for x ∈ (0, 1).
1. Find the mean of X. 2. Find median of X. 3. Note that f is unbounded, so X does not have a mode. 4. Sketch the graph of f and show the location of the mean and median on the x-axis. Answer The particular beta distribution in the last exercise is also known as the (standard) arcsine distribution. It governs the last time that the Brownian motion process hits 0 during the time interval [0, 1]. The arcsine distribution is studied in more generality in the chapter on Special Distributions. Open the Brownian motion experiment and select the last zero. Run the simulation 1000 times and compare the sample mean to the distribution mean. Suppose that the grades on a test are described by the random variable Y = 100X where X has the beta distribution with probability density function f given by f (x) = 12x(1 − x) for x ∈ [0, 1]. The grades are generally low, so the teacher − − − − decides to “curve” the grades using the transformation Z = 10√Y = 100√X . Find the expected value of each of the following variables 2
1. X 2. Y 3. Z Answer
The Pareto Distribution Recall that the Pareto distribution is a continuous distribution with probability density function f given by a f (x) =
a+1
,
x ∈ [1, ∞)
(4.1.36)
x
where a ∈ (0, ∞) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is widely used to model certain financial variables. The Pareto distribution is studied in detail in the chapter on Special Distributions. Suppose that X has the Pareto distribution with shape parameter a . Then 1. E(X) = ∞ if 0 < a ≤ 1 2. E(X) = if a > 1 a
a−1
Proof The previous exercise gives us our first example of a distribution whose mean is infinite. In the special distribution simulator, select the Pareto distribution. Note the shape of the probability density function and the location of the mean. For the following values of the shape parameter a , run the experiment 1000 times and note the behavior of the empirical mean.
4.1.10
https://stats.libretexts.org/@go/page/10156
1. a = 1 2. a = 2 3. a = 3 .
The Cauchy Distribution Recall that the (standard) Cauchy distribution has probability density function f given by 1 f (x) =
,
2
x ∈ R
(4.1.39)
π (1 + x )
This distribution is named for Augustin Cauchy. The Cauchy distributions is studied in detail in the chapter on Special Distributions. If X has the Cauchy distribution then E(X) does not exist. Proof Note that the graph of f is symmetric about 0 and is unimodal. Thus, the mode and median of X are both 0. By the symmetry result, if X had a mean, the mean would be 0 also, but alas the mean does not exist. Moreover, the non-existence of the mean is not just a pedantic technicality. If we think of the probability distribution as a mass distribution, then the moment to the right of a is ∫ (x − a)f (x) dx = ∞ and the moment to the left of a is ∫ (x − a)f (x) dx = −∞ for every a ∈ R . The center of mass simply does not exist. Probabilisitically, the law of large numbers fails, as you can see in the following simulation exercise: ∞
a
a
−∞
In the Cauchy experiment (with the default parameter values), a light sources is 1 unit from position 0 on an infinite straight wall. The angle that the light makes with the perpendicular is uniformly distributed on the interval ( , ), so that the position of the light beam on the wall has the Cauchy distribution. Run the simulation 1000 times and note the behavior of the empirical mean. −π
π
2
2
The Normal Distribution Recall that the standard normal distribution is a continuous distribution with density function ϕ given by 1 ϕ(z) =
− −e √2π
−
1 2
z
2
,
z ∈ R
(4.1.41)
Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in detail in the chapter on Special Distributions. If Z has the standard normal distribution then E(X) = 0 . Proof The standard normal distribution is unimodal and symmetric about 0. Thus, the median, mean, and mode all agree. More generally, for μ ∈ (−∞, ∞) and σ ∈ (0, ∞), recall that X = μ + σZ has the normal distribution with location parameter μ and scale parameter σ. X has probability density function f given by 1 f (x) =
− − √2πσ
1 exp[−
x −μ (
2
2
) ],
x ∈ R
(4.1.43)
σ
The location parameter is the mean of the distribution: If X has the normal distribution with location parameter μ ∈ R and scale parameter σ ∈ (0, ∞), then E(X) = μ Proof In the special distribution simulator, select the normal distribution. Vary the parameters and note the location of the mean. For selected parameter values, run the simulation 1000 times and compare the sample mean to the distribution mean.
4.1.11
https://stats.libretexts.org/@go/page/10156
Additional Exercises Suppose that (X, Y ) has probability density function following expected values:
f
given by
f (x, y) = x + y
for
. Find the
(x, y) ∈ [0, 1] × [0, 1]
1. E(X) 2. E (X Y ) 3. E (X + Y ) 4. E(XY ∣ Y > X) 2
2
2
Answer Suppose that
has a discrete distribution with probability density function n ∈ {1, 2, 3, 4}. Find each of the following: N
f
given by
f (n) =
1 50
2
n (5 − n)
for
1. The median of N . 2. The mode of N 3. E(N ). 4. E (N ) 5. E(1/N ). 6. E (1/N ). 2
2
Answer Suppose that X and Y are real-valued random variables with E(X) = 5 and E(Y ) = −2 . Find E(3X + 4Y
− 7)
.
Answer Suppose that
X
and
E [(3X − 4)(2Y + 7)]
Y
are real-valued, independent random variables, and that
E(X) = 5
and
E(Y ) = −2
. Find
.
Answer Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each hunter selects one duck at random and shoots. Find the expected number of ducks killed. Solution For a more complete analysis of the duck hunter problem, see The Number of Distinct Sample Values in the chapter on Finite Sampling Models. Consider the following game: An urn initially contains one red and one green ball. A ball is selected at random, and if the ball is green, the game is over. If the ball is red, the ball is returned to the urn, another red ball is added, and the game continues. At each stage, a ball is selected at random, and if the ball is green, the game is over. If the ball is red, the ball is returned to the urn, another red ball is added, and the game continues. Let X denote the length of the game (that is, the number of selections required to obtain a green ball). Find E(X). Solution This page titled 4.1: Definitions and Basic Properties is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.1.12
https://stats.libretexts.org/@go/page/10156
4.2: Additional Properties In this section, we study some properties of expected value that are a bit more specialized than the basic properties considered in the previous section. Nonetheless, the new results are also very important. They include two fundamental inequalities as well as special formulas for the expected value of a nonnegative variable. As usual, unless otherwise noted, we assume that the referenced expected values exist.
Basic Theory Markov's Inequality Our first result is known as Markov's inequality (named after Andrei Markov). It gives an upper bound for the tail probability of a nonnegative random variable in terms of the expected value of the variable. If X is a nonnegative random variable, then E(X) P(X ≥ x) ≤
,
x >0
(4.2.1)
x
Proof The upper bound in Markov's inequality may be rather crude. In fact, it's quite possible that E(X)/x ≥ 1 , in which case the bound is worthless. However, the real value of Markov's inequality lies in the fact that it holds with no assumptions whatsoever on the distribution of X (other than that X be nonnegative). Also, as an example below shows, the inequality is tight in the sense that equality can hold for a given x. Here is a simple corollary of Markov's inequality. If X is a real-valued random variable and k ∈ (0, ∞) then k
E (|X| ) P(|X| ≥ x) ≤
x >0
(4.2.2)
xk
Proof In this corollary of Markov's inequality, we could try to find
k >0
so that
k
k
E (|X| ) / x
is minimized, thus giving the tightest
bound on P (|X|) ≥ x) .
Right Distribution Function Our next few results give alternative ways to compute the expected value of a nonnegative random variable by means of the righttail distribution function. This function also known as the reliability function if the variable represents the lifetime of a device. If X is a nonnegative random variable then ∞
E(X) = ∫
P(X > x) dx
(4.2.4)
0
Proof Here is a slightly more general result: If X is a nonnegative random variable and k ∈ (0, ∞) then ∞
E(X
k
) =∫
k−1
kx
P(X > x) dx
(4.2.6)
0
Proof The following result is similar to the theorem above, but is specialized to nonnegative integer valued variables: Suppose that N has a discrete distribution, taking values in N. Then
4.2.1
https://stats.libretexts.org/@go/page/10157
∞
∞
E(N ) = ∑ P(N > n) = ∑ P(N ≥ n) n=0
(4.2.8)
n=1
Proof
A General Definition The special expected value formula for nonnegative variables can be used as the basis of a general formulation of expected value that would work for discrete, continuous, or even mixed distributions, and would not require the assumption of the existence of probability density functions. First, the special formula is taken as the definition of E(X) if X is nonnegative. If X is a nonnegative random variable, define ∞
E(X) = ∫
P(X > x) dx
(4.2.10)
0
Next, for x ∈ R, recall that the positive and negative parts of x are x
+
= max{x, 0}
and x
−
.
= max{0, −x}
For x ∈ R, 1. x ≥ 0 , x ≥ 0 2. x = x − x 3. |x| = x + x +
−
+
−
+
−
Now, if X is a real-valued random variable, then X and X , the positive and negative parts of X, are nonnegative random variables, so their expected values are defined as above. The definition of E(X) is then natural, anticipating of course the linearity property. +
If X is a real-valued random variable, define E(X) = E (X the right is finite.
−
+
) − E (X
−
)
, assuming that at least one of the expected values on
The usual formulas for expected value in terms of the probability density function, for discrete, continuous, or mixed distributions, would now be proven as theorems. We will not go further in this direction, however, since the most complete and general definition of expected value is given in the advanced section on expected value as an integral.
The Change of Variables Theorem Suppose that X takes values in S and has probability density function f . Suppose also that r : S → R , so that r(X) is a realvalued random variable. The change of variables theorem gives a formula for computing E [r(X)] without having to first find the probability density function of r(X). If S is countable, so that X has a discrete distribution, then E [r(X)] = ∑ r(x)f (x)
(4.2.11)
x∈S
If S ⊆ R and X has a continuous distribution on S then n
E [r(X)] = ∫
r(x)f (x) dx
(4.2.12)
S
In both cases, of course, we assume that the expected values exist. In the previous section on basic properties, we proved the change of variables theorem when X has a discrete distribution and when X has a continuous distribution but r has countable range. Now we can finally finish our proof in the continuous case. Suppose that X has a continuous distribution on S with probability density function f , and r : S → R . Then E [r(X)] = ∫
r(x)f (x) dx
(4.2.13)
S
Proof
4.2.2
https://stats.libretexts.org/@go/page/10157
Jensens's Inequality Our next sequence of exercises will establish an important inequality known as Jensen's inequality, named for Johan Jensen. First we need a definition. A real-valued function g defined on an interval S ⊆ R is said to be convex (or concave upward) on S if for each exist numbers a and b (that may depend on t ), such that
t ∈ S
, there
1. a + bt = g(t) 2. a + bx ≤ g(x) for all x ∈ S The graph of x ↦ a + bx is called a supporting line for g at t . Thus, a convex function has at least one supporting line at each point in the domain
Figure 4.2.1 : A convex function and several supporting lines
You may be more familiar with convexity in terms of the following theorem from calculus: If g has a continuous, non-negative second derivative on S , then g is convex on S (since the tangent line at t is a supporting line at t for each t ∈ S ). The next result is the single variable version of Jensen's inequality If X takes values in an interval S and g : S → R is convex on S , then E [g(X)] ≥ g [E(X)]
(4.2.17)
Proof Jensens's inequality extends easily to higher dimensions. The 2-dimensional version is particularly important, because it will be used to derive several special inequalities in the section on vector spaces of random variables. We need two definitions. A set
is convex if for every pair of points in and p ∈ [0, 1] then px + (1 − p)y ∈ S . n
S ⊆R
x, y ∈ S
S
, the line segment connecting those points also lies in
S
. That is, if
Figure 4.2.2 : A convex subset of R
2
Suppose that S ⊆ R is convex. A function g : S → R on a ∈ R and b ∈ R (depending on t ) such that n
S
is convex (or concave upward) if for each
t ∈ S
, there exist
n
1. a + b ⋅ t = g(t) 2. a + b ⋅ x ≤ g(x) for all x ∈ S The graph of x ↦ a + b ⋅ x is called a supporting hyperplane for g at t . In R a supporting hyperplane is an ordinary plane. From calculus, if g has continuous second derivatives on S and has a positive non-definite second derivative matrix, then g is convex on S . Suppose now that X = (X , X , … , X ) takes values in S ⊆ R , 2
n
1
4.2.3
2
n
https://stats.libretexts.org/@go/page/10157
and let E(X) = (E(X
1 ),
. The following result is the general version of Jensen's inequlaity.
E(X2 ), … , E(Xn ))
If S is convex and g : S → R is convex on S then E [g(X)] ≥ g [E(X)]
(4.2.19)
Proof We will study the expected value of random vectors and matrices in more detail in a later section. In both the one and n dimensional cases, a function g : S → R is concave (or concave downward) if the inequality in the definition is reversed. Jensen's inequality also reverses.
Expected Value in Terms of the Quantile Function If X has a continuous distribution with support on an interval of R, then there is a simple (but not well known) formula for the expected value of X as the integral the quantile function of X. Here is the general result: Suppose that X has a continuous distribution with support on an interval (a, b) ⊆ R . Let F denote the cumulative distribution function of X so that F is the quantile function of X. If g : (a, b) → R then (assuming that the expected value exists), −1
1
E[g(X)] = ∫
g [F
−1
(p)] dp,
n ∈ N
(4.2.21)
0
Proof So in particular, E(X) = ∫
1
0
F
−1
(p) dp
.
Examples and Applications Let a ∈ (0, ∞) and let equality at x = a .
P(X = a) = 1
, so that
X
is a constant random variable. Show that Markov's inequality is in fact
Solution
The Exponential Distribution Recall that the exponential distribution is a continuous distribution with probability density function f given by f (t) = re
−rt
,
t ∈ [0, ∞)
(4.2.23)
where r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other “arrival times”; in particular, the distribution governs the time between arrivals in the Poisson model. The exponential distribution is studied in detail in the chapter on the Poisson Process. Suppose that X has exponential distribution with rate parameter r. 1. Find E(X) using the right distribution formula. 2. Find E(X) using the quantile function formula. 3. Compute both sides of Markov's inequality. Answer Open the gamma experiment. Keep the default value of the stopping parameter (n = 1 ), which gives the exponential distribution. Vary the rate parameter r and note the shape of the probability density function and the location of the mean. For various values of the rate parameter, run the experiment 1000 times and compare the sample mean with the distribution mean.
The Geometric Distribution Recall that Bernoulli trials are independent trials each with two outcomes, which in the language of reliability, are called success and failure. The probability of success on each trial is p ∈ [0, 1]. A separate chapter on Bernoulli Trials explores this random process in more detail. It is named for Jacob Bernoulli. If p ∈ (0, 1), the trial number N of the first success has the geometric distribution on N with success parameter p. The probability density function f of N is given by +
n−1
f (n) = p(1 − p )
4.2.4
,
n ∈ N+
(4.2.24)
https://stats.libretexts.org/@go/page/10157
Suppose that N has the geometric distribution on N with parameter p ∈ (0, 1). +
1. Find E(N ) using the right distribution function formula. 2. Compute both sides of Markov's inequality. 3. Find E(N ∣ N is even ) . Answer Open the negative binomial experiment. Keep the default value of the stopping parameter (k = 1 ), which gives the geometric distribution. Vary the success parameter p and note the shape of the probability density function and the location of the mean. For various values of the success parameter, run the experiment 1000 times and compare the sample mean with the distribution mean.
The Pareto Distribution Recall that the Pareto distribution is a continuous distribution with probability density function f given by a f (x) =
a+1
,
x ∈ [1, ∞)
(4.2.25)
x
where a ∈ (0, ∞) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is widely used to model certain financial variables. The Pareto distribution is studied in detail in the chapter on Special Distributions. Suppose that X has the Pareto distribution with parameter a > 1 . 1. Find E(X) using the right distribution function formula. 2. Find E(X) using the quantile function formula. 3. Find E(1/X). 4. Show that x ↦ 1/x is convex on (0, ∞). 5. Verify Jensen's inequality by comparing E(1/X) and 1/E(X). Answer Open the special distribution simulator and select the Pareto distribution. Keep the default value of the scale parameter. Vary the shape parameter and note the shape of the probability density function and the location of the mean. For various values of the shape parameter, run the experiment 1000 times and compare the sample mean with the distribution mean.
A Bivariate Distribution Suppose that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 . 1. Show that the domain of f is a convex set. 2. Show that (x, y) ↦ x + y is convex on the domain of f . 3. Compute E (X + Y ) . 4. Compute [E(X)] + [E(Y )] . 5. Verify Jensen's inequality by comparing (b) and (c). 2
2
2
2
2
2
Answer
The Arithmetic and Geometric Means Suppose that {x
1,
x2 , … , xn }
is a set of positive numbers. The arithmetic mean is at least as large as the geometric mean: n
1/n
1
( ∏ xi )
≤
i=1
n
n
∑ xi
(4.2.26)
i=1
Proof This page titled 4.2: Additional Properties is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.2.5
https://stats.libretexts.org/@go/page/10157
4.3: Variance Recall the expected value of a real-valued random variable is the mean of the variable, and is a measure of the center of the distribution. Recall also that by taking the expected value of various transformations of the variable, we can measure other interesting characteristics of the distribution. In this section, we will study expected values that measure the spread of the distribution about the mean.
Basic Theory Definitions and Interpretations As usual, we start with a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). Suppose that X is a random variable for the experiment, taking values in S ⊆ R . Recall that E(X), the expected value (or mean) of X gives the center of the distribution of X. The variance and standard deviation of X are defined by 1. var(X) = E ([X − E(X)] 2.
2
)
− − − − − − sd(X) = √var(X)
Implicit in the definition is the assumption that the mean E(X) exists, as a real number. If this is not the case, then var(X) (and hence also sd(X)) are undefined. Even if E(X) does exist as a real number, it's possible that var(X) = ∞ . For the remainder of our discussion of the basic theory, we will assume that expected values that are mentioned exist as real numbers. The variance and standard deviation of X are both measures of the spread of the distribution about the mean. Variance (as we will see) has nicer mathematical properties, but its physical unit is the square of that of X. Standard deviation, on the other hand, is not as nice mathematically, but has the advantage that its physical unit is the same as that of X. When the random variable X is understood, the standard deviation is often denoted by σ, so that the variance is σ . 2
Recall that the second moment of X about a ∈ R is E [(X − a) ] . Thus, the variance is the second moment of X about the mean μ = E(X) , or equivalently, the second central moment of X. In general, the second moment of X about a ∈ R can also be thought of as the mean square error if the constant a is used as an estimate of X. In addition, second moments have a nice interpretation in physics. If we think of the distribution of X as a mass distribution in R, then the second moment of X about a ∈ R is the moment of inertia of the mass distribution about a . This is a measure of the resistance of the mass distribution to any change in its rotational motion about a . In particular, the variance of X is the moment of inertia of the mass distribution about the center of mass μ . 2
Figure 4.3.1 : The moment of inertia about a .
The mean square error (or equivalently the moment of inertia) about a is minimized when a = μ : Let mse(a) = E [(X − a)
2
]
for a ∈ R . Then mse is minimized when a = μ , and the minimum value is σ . 2
Proof
Figure 4.3.2 : mse(a) is minimized when a = μ .
4.3.1
https://stats.libretexts.org/@go/page/10158
The relationship between measures of center and measures of spread is studied in more detail in the advanced section on vector spaces of random variables.
Properties The following exercises give some basic properties of variance, which in turn rely on basic properties of expected value. As usual, be sure to try the proofs yourself before reading the ones in the text. Our first results are computational formulas based on the change of variables formula for expected value Let μ = E(X) . 1. If X has a discrete distribution with probability density function f , then var(X) = ∑ 2. If X has a continuous distribution with probability density function f , then var(X) = ∫
2
x∈S S
(x − μ) f (x)
.
2
(x − μ) f (x)dx
Proof Our next result is a variance formula that is usually better than the definition for computational purposes. var(X) = E(X
2
2
) − [E(X)]
.
Proof Of course, by the change of variables formula, E (X ) = ∑ x f (x) if X has a discrete distribution, and E (X ) = ∫ x f (x) dx if X has a continuous distribution. In both cases, f is the probability density function of X. 2
2
x∈S
2
2
S
Variance is always nonnegative, since it's the expected value of a nonnegative random variable. Moreover, any random variable that really is random (not a constant) will have strictly positive variance. The nonnegative property. 1. var(X) ≥ 0 2. var(X) = 0 if and only if P(X = c) = 1 for some constant c (and then of course, E(X) = c ). Proof Our next result shows how the variance and standard deviation are changed by a linear transformation of the random variable. In particular, note that variance, unlike general expected value, is not a linear operation. This is not really surprising since the variance is the expected value of a nonlinear function of the variable: x ↦ (x − μ) . 2
If a,
b ∈ R
then
1. var(a + bX) = b var(X) 2. sd(a + bX) = |b| sd(X) 2
Proof Recall that when b > 0 , the linear transformation x ↦ a + bx is called a location-scale transformation and often corresponds to a change of location and change of scale in the physical units. For example, the change from inches to centimeters in a measurement of length is a scale transformation, and the change from Fahrenheit to Celsius in a measurement of temperature is both a location and scale transformation. The previous result shows that when a location-scale transformation is applied to a random variable, the standard deviation does not depend on the location parameter, but is multiplied by the scale factor. There is a particularly important location-scale transformation. Suppose that X is a random variable with mean μ and variance σ . The random variable Z defined as follows is the standard score of X. 2
X −μ Z =
(4.3.2) σ
1. E(Z) = 0 2. var(Z) = 1 Proof
4.3.2
https://stats.libretexts.org/@go/page/10158
Since X and its mean and standard deviation all have the same physical units, the standard score the directed distance from E(X) to X in terms of standard deviations. Let Z denote the standard score of X, and suppose that Y
where a,
= a + bX
b ∈ R
Z
is dimensionless. It measures
and b ≠ 0 .
1. If b > 0 , the standard score of Y is Z . 2. If b < 0 , the standard score of Y is −Z . Proof As just noted, when b > 0 , the variable Y = a + bX is a location-scale transformation and often corresponds to a change of physical units. Since the standard score is dimensionless, it's reasonable that the standard scores of X and Y are the same. Here is another standardized measure of dispersion: Suppose that X is a random variable with mean:
E(X) ≠ 0
. The coefficient of variation is the ratio of the standard deviation to the
sd(X) cv(X) =
(4.3.4) E(X)
The coefficient of variation is also dimensionless, and is sometimes used to compare variability for random variables with different means. We will learn how to compute the variance of the sum of two random variables in the section on covariance.
Chebyshev's Inequality Chebyshev's inequality (named after Pafnuty Chebyshev) gives an upper bound on the probability that a random variable will be more than a specified distance from its mean. This is often useful in applied problems where the distribution is unknown, but the mean and variance are known (at least approximately). In the following two results, suppose that X is a real-valued random variable with mean μ = E(X) ∈ R and standard deviation σ = sd(X) ∈ (0, ∞) . Chebyshev's inequality 1. σ P (|X − μ| ≥ t) ≤
2
,
2
t >0
(4.3.5)
t
Proof
Figure 4.3.3 : Chebyshev's inequality
Here's an alternate version, with the distance in terms of standard deviation. Chebyshev's inequality 2. 1 P (|X − μ| ≥ kσ) ≤
2
,
k >0
(4.3.6)
k
Proof The usefulness of the Chebyshev inequality comes from the fact that it holds for any distribution (assuming only that the mean and variance exist). The tradeoff is that for many specific distributions, the Chebyshev bound is rather crude. Note in particular that the first inequality is useless when t ≤ σ , and the second inequality is useless when k ≤ 1 , since 1 is an upper bound for the probability of any event. On the other hand, it's easy to construct a distribution for which Chebyshev's inequality is sharp for a specified value of t ∈ (0, ∞) . Such a distribution is given in an exercise below.
4.3.3
https://stats.libretexts.org/@go/page/10158
Examples and Applications As always, be sure to try the problems yourself before looking at the solutions and answers.
Indicator Variables Suppose that X is an indicator variable with p = P(X = 1) , where p ∈ [0, 1]. Then 1. E(X) = p 2. var(X) = p(1 − p) Proof The graph of var(X) as a function of p is a parabola, opening downward, with roots at 0 and 1. Thus the minimum value of var(X) is 0, and occurs when p = 0 and p = 1 (when X is deterministic, of course). The maximum value is and occurs when p = . 1 4
1 2
Figure 4.3.4 : The variance of an indicator variable as a function of p.
Uniform Distributions Discrete uniform distributions are widely used in combinatorial probability, and model a point chosen at random from a finite set. The mean and variance have simple forms for the discrete uniform distribution on a set of evenly spaced points (sometimes referred to as a discrete interval): Suppose that X has the discrete uniform distribution on n ∈ N . Let b = a + (n − 1)h , the right endpoint. Then
{a, a + h, … , a + (n − 1)h}
where
a ∈ R
,
h ∈ (0, ∞)
, and
+
1. E(X) = (a + b) . 2. var(X) = (b − a)(b − a + 2h) . 1 2
1
12
Proof Note that mean is simply the average of the endpoints, while the variance depends only on difference between the endpoints and the step size. Open the special distribution simulator, and select the discrete uniform distribution. Vary the parameters and note the location and size of the mean ± standard deviation bar in relation to the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. Next, recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the interval. Continuous uniform distributions arise in geometric probability and a variety of other applied problems. Suppose that X has the continuous uniform distribution on the interval [a, b] where a,
b ∈ R
with a < b . Then
1. E(X) = (a + b) 2. var(X) = (b − a) 1 2
1
2
12
Proof Note that the mean is the midpoint of the interval and the variance depends only on the length of the interval. Compare this with the results in the discrete case.
4.3.4
https://stats.libretexts.org/@go/page/10158
Open the special distribution simulator, and select the continuous uniform distribution. This is the uniform distribution the interval [a, a + w] . Vary the parameters and note the location and size of the mean ± standard deviation bar in relation to the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Dice Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice. Here are three: An ace-six flat die is a six-sided die in which faces 1 and 6 have probability each while faces 2, 3, 4, and 5 have probability each. A two-five flat die is a six-sided die in which faces 2 and 5 have probability each while faces 1, 3, 4, and 6 have probability each. A three-four flat die is a six-sided die in which faces 3 and 4 have probability each while faces 1, 2, 5, and 6 have probability each. 1 4
1 8
1
1
4
8
1 4
1 8
A flat die, as the name suggests, is a die that is not a cube, but rather is shorter in one of the three directions. The particular probabilities that we use ( and ) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter axis have slightly larger probabilities that the other four faces. Flat dice are sometimes used by gamblers to cheat. In the following problems, you will compute the mean and variance for each of the various types of dice. Be sure to compare the results. 1
1
4
8
A standard, fair die is thrown and the score each of the following:
is recorded. Sketch the graph of the probability density function and compute
X
1. E(X) 2. var(X) Answer An ace-six flat die is thrown and the score each of the following:
X
is recorded. Sketch the graph of the probability density function and compute
X
is recorded. Sketch the graph of the probability density function and compute
1. E(X) 2. var(X) Answer A two-five flat die is thrown and the score each of the following: 1. E(X) 2. var(X) Answer A three-four flat die is thrown and the score each of the following:
X
is recorded. Sketch the graph of the probability density function and compute
1. E(X) 2. var(X) Answer In the dice experiment, select one die. For each of the following cases, note the location and size of the mean ± standard deviation bar in relation to the probability density function. Run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. 1. Fair die 2. Ace-six flat die 3. Two-five flat die 4. Three-four flat die
4.3.5
https://stats.libretexts.org/@go/page/10158
The Poisson Distribution Recall that the Poisson distribution is a discrete distribution on N with probability density function f given by n
f (n) = e
−a
a
,
n ∈ N
(4.3.10)
n!
where a ∈ (0, ∞) is a parameter. The Poisson distribution is named after Simeon Poisson and is widely used to model the number of “random points” in a region of time or space; the parameter a is proportional to the size of the region. The Poisson distribution is studied in detail in the chapter on the Poisson Process. Suppose that N has the Poisson distribution with parameter a . Then 1. E(N ) = a 2. var(N ) = a Proof Thus, the parameter of the Poisson distribution is both the mean and the variance of the distribution. In the Poisson experiment, the parameter is a = rt . Vary the parameter and note the size and location of the mean ± standard deviation bar in relation to the probability density function. For selected values of the parameter, run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
The Geometric Distribution Recall that Bernoulli trials are independent trials each with two outcomes, which in the language of reliability, are called success and failure. The probability of success on each trial is p ∈ [0, 1]. A separate chapter on Bernoulli Trials explores this random process in more detail. It is named for Jacob Bernoulli. If p ∈ (0, 1], the trial number N of the first success has the geometric distribution on N with success parameter p. The probability density function f of N is given by +
n−1
f (n) = p(1 − p )
,
n ∈ N+
(4.3.13)
Suppose that N has the geometric distribution on N with success parameter p ∈ (0, 1]. Then +
1. E(N ) =
1 p
2. var(N ) =
1−p p
2
Proof Note that the variance is 0 when p = 1 , not surprising since X is deterministic in this case. In the negative binomial experiment, set k = 1 to get the geometric distribution . Vary p with the scroll bar and note the size and location of the mean ± standard deviation bar in relation to the probability density function. For selected values of p, run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. Suppose that N has the geometric distribution with parameter p = . Compute the true value and the Chebyshev bound for the probability that N is at least 2 standard deviations away from the mean. 3 4
Answer
The Exponential Distribution Recall that the exponential distribution is a continuous distribution on [0, ∞) with probability density function f given by f (t) = re
−rt
,
t ∈ [0, ∞)
(4.3.16)
where r ∈ (0, ∞) is the with rate parameter. This distribution is widely used to model failure times and other “arrival times”. The exponential distribution is studied in detail in the chapter on the Poisson Process. Suppose that T has the exponential distribution with rate parameter r. Then
4.3.6
https://stats.libretexts.org/@go/page/10158
1. E(T ) = . 2. var(T ) = . 1 r
1
r2
Proof Thus, for the exponential distribution, the mean and standard deviation are the same. In the gamma experiment, set k = 1 to get the exponential distribution. Vary r with the scroll bar and note the size and location of the mean ± standard deviation bar in relation to the probability density function. For selected values of r, run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. Suppose that X has the exponential distribution with rate parameter r > 0 . Compute the true value and the Chebyshev bound for the probability that X is at least k standard deviations away from the mean. Answer
The Pareto Distribution Recall that the Pareto distribution is a continuous distribution on [1, ∞) with probability density function f given by a f (x) =
a+1
,
x ∈ [1, ∞)
(4.3.19)
x
where a ∈ (0, ∞) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is widely used to model financial variables such as income. The Pareto distribution is studied in detail in the chapter on Special Distributions. Suppose that X has the Pareto distribution with shape parameter a . Then 1. E(X) = ∞ if 0 < a ≤ 1 and E(X) = if 1 < a < ∞ 2. var(X) is undefined if 0 < a ≤ 1 , var(X) = ∞ if 1 < a ≤ 2 , and var(X) = a
a−1
a 2
if 2 < a < ∞
(a−1 ) (a−2)
Proof In the special distribution simuator, select the Pareto distribution. Vary a with the scroll bar and note the size and location of the mean ± standard deviation bar. For each of the following values of a , run the experiment 1000 times and note the behavior of the empirical mean and standard deviation. 1. a = 1 2. a = 2 3. a = 3
The Normal Distribution Recall that the standard normal distribution is a continuous distribution on R with probability density function ϕ given by 1 ϕ(z) =
− − √2π
e
−
1 2
z
2
,
z ∈ R
(4.3.22)
Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in detail in the chapter on Special Distributions. Suppose that Z has the standard normal distribution. Then 1. E(Z) = 0 2. var(Z) = 1 Proof More generally, for μ ∈ R and σ ∈ (0, ∞), recall that the normal distribution with location parameter μ and scale parameter a continuous distribution on R with probability density function f given by
4.3.7
σ
is
https://stats.libretexts.org/@go/page/10158
1 f (x) =
− − √2πσ
1 exp[−
2
x −μ (
2
) ],
x ∈ R
(4.3.25)
σ
Moreover, if Z has the standard normal distribution, then X = μ + σZ has the normal distribution with location parameter μ and scale parameter σ. As the notation suggests, the location parameter is the mean of the distribution and the scale parameter is the standard deviation. Suppose that X has the normal distribution with location parameter μ and scale parameter σ. Then 1. E(X) = μ 2. var(X) = σ
2
Proof So to summarize, if X has a normal distribution, then its standard score Z has the standard normal distribution. In the special distribution simulator, select the normal distribution. Vary the parameters and note the shape and location of the mean ± standard deviation bar in relation to the probability density function. For selected parameter values, run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Beta Distributions The distributions in this subsection belong to the family of beta distributions, which are widely used to model random proportions and probabilities. The beta distribution is studied in detail in the chapter on Special Distributions. Suppose that X has a beta distribution with probability density function f . In each case below, graph f below and compute the mean and variance. 1. f (x) = 6x(1 − x) for x ∈ [0, 1] 2. f (x) = 12x (1 − x) for x ∈ [0, 1] 3. f (x) = 12x(1 − x) for x ∈ [0, 1] 2
2
Answer In the special distribution simulator, select the beta distribution. The parameter values below give the distributions in the previous exercise. In each case, note the location and size of the mean ± standard deviation bar. Run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. 1. a = 2 , b = 2 2. a = 3 , b = 2 3. a = 2 , b = 3 Suppose that a sphere has a random radius R with probability density function f given by Find the mean and standard deviation of each of the following:
2
f (r) = 12 r (1 − r)
for
.
r ∈ [0, 1]
1. The circumference C = 2πR 2. The surface area A = 4πR 3. The volume V = π R 2
4
3
3
Answer Suppose that X has probability density function f given by f (x) =
1 π√x(1−x)
for x ∈ (0, 1). Find
1. E(X) 2. var(X) Answer The particular beta distribution in the last exercise is also known as the (standard) arcsine distribution. It governs the last time that the Brownian motion process hits 0 during the time interval [0, 1]. The arcsine distribution is studied in more generality in the chapter on Special Distributions.
4.3.8
https://stats.libretexts.org/@go/page/10158
Open the Brownian motion experiment and select the last zero. Note the location and size of the mean ± standard deviation bar in relation to the probability density function. Run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. Suppose that the grades on a test are described by the random variable Y = 100X where X has the beta distribution with probability density function f given by f (x) = 12x(1 − x) for x ∈ [0, 1]. The grades are generally low, so the teacher − − − − decides to “curve” the grades using the transformation Z = 10√Y = 100√X . Find the mean and standard deviation of each of the following variables: 2
1. X 2. Y 3. Z Answer
Exercises on Basic Properties Suppose that X is a real-valued random variable with E(X) = 5 and var(X) = 4 . Find each of the following: 1. var(3X − 2) 2. E(X ) 2
Answer Suppose that X is a real-valued random variable with E(X) = 2 and E [X(X − 1)] = 8 . Find each of the following: 1. E(X ) 2. var(X) 2
Answer The expected value E [X(X − 1)] is an example of a factorial moment. Suppose that X and Then
X2
1
1. E (X X ) = μ μ 2. var (X X ) = σ σ 1
2
1
1
2
are independent, real-valued random variables with
E(Xi ) = μi
and
var(Xi ) = σ
2
i
for
.
i ∈ {1, 2}
2
2
2
1
2
2
2
1
2
+σ μ
2
2
2
1
+σ μ
Proof Marilyn Vos Savant has an IQ of 228. Assuming that the distribution of IQ scores has mean 100 and standard deviation 15, find Marilyn's standard score. Answer Fix
t ∈ (0, ∞)
. Suppose that X is the discrete random variable with probability density function defined by , P(X = 0) = 1 − 2p , where p ∈ (0, ). Then equality holds in Chebyshev's inequality at t . 1
P(X = t) = P(X = −t) = p
2
Proof This page titled 4.3: Variance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.3.9
https://stats.libretexts.org/@go/page/10158
4.4: Skewness and Kurtosis As usual, our starting point is a random experiment, modeled by a probability space (Ω, F , P ). So to review, Ω is the set of outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). Suppose that X is a real-valued random variable for the experiment. Recall that the mean of X is a measure of the center of the distribution of X. Furthermore, the variance of X is the second moment of X about the mean, and measures the spread of the distribution of X about the mean. The third and fourth moments of X about the mean also measure interesting (but more subtle) features of the distribution. The third moment measures skewness, the lack of symmetry, while the fourth moment measures kurtosis, roughly a measure of the fatness in the tails. The actual numerical measures of these characteristics are standardized to eliminate the physical units, by dividing by an appropriate power of the standard deviation. As usual, we assume that all expected values given below exist, and we will let μ = E(X) and σ = var(X) . We assume that σ > 0 , so that the random variable is really random. 2
Basic Theory Skewness The skewness of X is the third moment of the standard score of X: X −μ skew(X) = E [ (
3
) ]
(4.4.1)
σ
The distribution of X is said to be positively skewed, negatively skewed or unskewed depending on whether positive, negative, or 0.
skew(X)
is
In the unimodal case, if the distribution is positively skewed then the probability density function has a long tail to the right, and if the distribution is negatively skewed then the probability density function has a long tail to the left. A symmetric distribution is unskewed. Suppose that the distribution of X is symmetric about a . Then 1. E(X) = a 2. skew(X) = 0 . Proof The converse is not true—a non-symmetric distribution can have skewness 0. Examples are given in Exercises (30) and (31) below. skew(X)
can be expressed in terms of the first three moments of X. E (X
3
) − 3μE (X
skew(X) = σ
2
3
) + 2μ
E (X
3
) − 3μσ
=
3
σ
3
2
3
−μ
(4.4.2)
Proof Since skewness is defined in terms of an odd power of the standard score, it's invariant under a linear transformation with positve slope (a location-scale transformation of the distribution). On the other hand, if the slope is negative, skewness changes sign. Suppose that a ∈ R and b ∈ R ∖ {0} . Then 1. skew(a + bX) = skew(X) if b > 0 2. skew(a + bX) = −skew(X) if b < 0 Proof Recall that location-scale transformations often arise when physical units are changed, such as inches to centimeters, or degrees Fahrenheit to degrees Celsius.
4.4.1
https://stats.libretexts.org/@go/page/10159
Kurtosis The kurtosis of X is the fourth moment of the standard score: X −μ kurt(X) = E [ (
4
) ]
(4.4.4)
σ
Kurtosis comes from the Greek word for bulging. Kurtosis is always positive, since we have assumed that σ > 0 (the random variable really is random), and therefore P(X ≠ μ) > 0 . In the unimodal case, the probability density function of a distribution with large kurtosis has fatter tails, compared with the probability density function of a distribution with smaller kurtosis. kurt(X)
can be expressed in terms of the first four moments of X. E (X
4
) − 4μE (X
3
2
) + 6 μ E (X
kurt(X) = σ
2
4
) − 3μ
E (X
4
) − 4μE (X
=
4
σ
3
2
) + 6μ σ
2
4
+ 3μ
(4.4.5)
4
Proof Since kurtosis is defined in terms of an even power of the standard score, it's invariant under linear transformations. Suppose that a ∈ R and b ∈ R ∖ {0} . Then kurt(a + bX) = kurt(X) . Proof We will show in below that the kurtosis of the standard normal distribution is 3. Using the standard normal distribution as a benchmark, the excess kurtosis of a random variable X is defined to be kurt(X) − 3 . Some authors use the term kurtosis to mean what we have defined as excess kurtosis.
Computational Exercises As always, be sure to try the exercises yourself before expanding the solutions and answers in the text.
Indicator Variables Recall that an indicator random variable is one that just takes the values 0 and 1. Indicator variables are the building blocks of many counting random variables. The corresponding distribution is known as the Bernoulli distribution, named for Jacob Bernoulli. Suppose that X is an indicator variable with P(X = 1) = p where p ∈ (0, 1). Then 1. E(X) = p 2. var(X) = p(1 − p) 3. skew(X) = 4. kurt(X) =
1−2p √p(1−p) 1−3p+3p
2
p(1−p)
Proof Open the binomial coin experiment and set n = 1 to get an indicator variable. Vary p and note the change in the shape of the probability density function.
Dice Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice. Here are three: An ace-six flat die is a six-sided die in which faces 1 and 6 have probability each. A two-five flat die is a six-sided die in which faces 2 and 5 have probability each.
1 4
each while faces 2, 3, 4, and 5 have probability
1 8
4.4.2
1 4
each while faces 1, 3, 4, and 6 have probability
1 8
https://stats.libretexts.org/@go/page/10159
A three-four flat die is a six-sided die in which faces 3 and 4 have probability each.
1 4
each while faces 1, 2, 5, and 6 have probability
1 8
A flat die, as the name suggests, is a die that is not a cube, but rather is shorter in one of the three directions. The particular probabilities that we use ( and ) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter axis have slightly larger probabilities that the other four faces. Flat dice are sometimes used by gamblers to cheat. 1
1
4
8
A standard, fair die is thrown and the score X is recorded. Compute each of the following: 1. E(X) 2. var(X) 3. skew(X) 4. kurt(X) Answer An ace-six flat die is thrown and the score X is recorded. Compute each of the following: 1. E(X) 2. var(X) 3. skew(X) 4. kurt(X) Answer A two-five flat die is thrown and the score X is recorded. Compute each of the following: 1. E(X) 2. var(X) 3. skew(X) 4. kurt(X) Answer A three-four flat die is thrown and the score X is recorded. Compute each of the following: 1. E(X) 2. var(X) 3. skew(X) 4. kurt(X) Answer All four die distributions above have the same mean kurtosis.
7 2
and are symmetric (and hence have skewness 0), but differ in variance and
Open the dice experiment and set n = 1 to get a single die. Select each of the following, and note the shape of the probability density function in comparison with the computational results above. In each case, run the experiment 1000 times and compare the empirical density function to the probability density function. 1. fair 2. ace-six flat 3. two-five flat 4. three-four flat
Uniform Distributions Recall that the continuous uniform distribution on a bounded interval corresponds to selecting a point at random from the interval. Continuous uniform distributions arise in geometric probability and a variety of other applied problems. Suppose that X has uniform distribution on the interval [a, b], where a,
4.4.3
b ∈ R
and a < b . Then
https://stats.libretexts.org/@go/page/10159
1. E(X) = (a + b) 2. var(X) = (b − a) 3. skew(X) = 0 4. kurt(X) = 1 2
1
2
12
9 5
Proof Open the special distribution simulator, and select the continuous uniform distribution. Vary the parameters and note the shape of the probability density function in comparison with the moment results in the last exercise. For selected values of the parameter, run the simulation 1000 times and compare the empirical density function to the probability density function.
The Exponential Distribution Recall that the exponential distribution is a continuous distribution on [0, ∞)with probability density function f given by f (t) = re
−rt
,
t ∈ [0, ∞)
(4.4.7)
where r ∈ (0, ∞) is the with rate parameter. This distribution is widely used to model failure times and other “arrival times”. The exponential distribution is studied in detail in the chapter on the Poisson Process. Suppose that X has the exponential distribution with rate parameter r > 0 . Then 1. E(X) = 2. var(X) = 3. skew(X) = 2 4. kurt(X) = 9 1 r
1
r2
Proof Note that the skewness and kurtosis do not depend on the rate parameter r. That's because exponential distribution
1/r
is a scale parameter for the
Open the gamma experiment and set n = 1 to get the exponential distribution. Vary the rate parameter and note the shape of the probability density function in comparison to the moment results in the last exercise. For selected values of the parameter, run the experiment 1000 times and compare the empirical density function to the true probability density function.
Pareto Distribution Recall that the Pareto distribution is a continuous distribution on [1, ∞) with probability density function f given by a f (x) =
a+1
,
x ∈ [1, ∞)
(4.4.8)
x
where a ∈ (0, ∞) is a parameter. The Pareto distribution is named for Vilfredo Pareto. It is a heavy-tailed distribution that is widely used to model financial variables such as income. The Pareto distribution is studied in detail in the chapter on Special Distributions. Suppose that X has the Pareto distribution with shape parameter a > 0 . Then 1. E(X) = 2. var(X) =
a
a−1
a 2
(a−1 ) (a−2)
3. skew(X) = 4. kurt(X) =
if a > 1
a−3
if a > 2
− − − − −
2(1+a)
√1 − 2
2 a
3(a−2)(3 a +a+2) a(a−3)(a−4)
if a > 3 if a > 4
Proof Open the special distribution simulator and select the Pareto distribution. Vary the shape parameter and note the shape of the probability density function in comparison to the moment results in the last exercise. For selected values of the parameter, run the experiment 1000 times and compare the empirical density function to the true probability density function.
4.4.4
https://stats.libretexts.org/@go/page/10159
The Normal Distribution Recall that the standard normal distribution is a continuous distribution on R with probability density function ϕ given by 1 ϕ(z) =
− − √2π
e
−
1 2
z
2
,
z ∈ R
(4.4.9)
Normal distributions are widely used to model physical measurements subject to small, random errors and are studied in detail in the chapter on Special Distributions. Suppose that Z has the standard normal distribution. Then 1. E(Z) = 0 2. var(Z) = 1 3. skew(Z) = 0 4. kurt(Z) = 3 Proof More generally, for μ ∈ R and σ ∈ (0, ∞), recall that the normal distribution with mean continuous distribution on R with probability density function f given by
μ
and standard deviation
σ
is a
2
f (x) =
1 1 x −μ ( ) ], − − exp[− 2 σ √2πσ
x ∈ R
(4.4.10)
However, we also know that μ and σ are location and scale parameters, respectively. That is, if distribution then X = μ + σZ has the normal distribution with mean μ and standard deviation σ.
Z
has the standard normal
If X has the normal distribution with mean μ ∈ R and standard deviation σ ∈ (0, ∞), then 1. skew(X) = 0 2. kurt(X) = 3 Proof Open the special distribution simulator and select the normal distribution. Vary the parameters and note the shape of the probability density function in comparison to the moment results in the last exercise. For selected values of the parameters, run the experiment 1000 times and compare the empirical density function to the true probability density function.
The Beta Distribution The distributions in this subsection belong to the family of beta distributions, which are continuous distributions on [0, 1] widely used to model random proportions and probabilities. The beta distribution is studied in detail in the chapter on Special Distributions. Suppose that X has probability density function f given by f (x) = 6x(1 − x) for x ∈ [0, 1]. Find each of the following: 1. E(X) 2. var(X) 3. skew(X) 4. kurt(X) Answer Suppose that X has probability density function f given by f (x) = 12x
2
(1 − x)
for x ∈ [0, 1]. Find each of the following:
1. E(X) 2. var(X) 3. skew(X) 4. kurt(X) Answer
4.4.5
https://stats.libretexts.org/@go/page/10159
Suppose that X has probability density function f given by f (x) = 12x(1 − x) for x ∈ [0, 1]. Find each of the following: 2
1. E(X) 2. var(X) 3. skew(X) 4. kurt(X) Answer Open the special distribution simulator and select the beta distribution. Select the parameter values below to get the distributions in the last three exercises. In each case, note the shape of the probability density function in relation to the calculated moment results. Run the simulation 1000 times and compare the empirical density function to the probability density function. 1. a = 2 , b = 2 2. a = 3 , b = 2 3. a = 2 , b = 3 Suppose that X has probability density function f given by f (x) =
1 π√x(1−x)
for x ∈ (0, 1). Find
1. E(X) 2. var(X) 3. skew(X) 4. kurt(X) Answer The particular beta distribution in the last exercise is also known as the (standard) arcsine distribution. It governs the last time that the Brownian motion process hits 0 during the time interval [0, 1]. The arcsine distribution is studied in more generality in the chapter on Special Distributions. Open the Brownian motion experiment and select the last zero. Note the shape of the probability density function in relation to the moment results in the last exercise. Run the simulation 1000 times and compare the empirical density function to the probability density function.
Counterexamples The following exercise gives a simple example of a discrete distribution that is not symmetric but has skewness 0. Suppose that X is a discrete random variable with probability density function f given by f (2) = . Find each of the following and then show that the distribution of X is not symmetric.
f (−3) =
1 10
,
f (−1) =
1 2
,
2 5
1. E(X) 2. var(X) 3. skew(X) 4. kurt(X) Answer The following exercise gives a more complicated continuous distribution that is not symmetric but has skewness 0. It is one of a collection of distributions constructed by Erik Meijer. Suppose that U , V , and I are independent random variables, and that U is normally distributed with mean μ = −2 and variance σ = 1 , V is normally distributed with mean ν = 1 and variance τ = 2 , and I is an indicator variable with P(I = 1) = p = . Let X = I U + (1 − I )V . Find each of the following and then show that the distribution of X is not symmetric. 2
2
1 3
1. E(X) 2. var(X) 3. skew(X)
4.4.6
https://stats.libretexts.org/@go/page/10159
4. kurt(X) Solution This page titled 4.4: Skewness and Kurtosis is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.4.7
https://stats.libretexts.org/@go/page/10159
4.5: Covariance and Correlation Recall that by taking the expected value of various transformations of a random variable, we can measure many interesting characteristics of the distribution of the variable. In this section, we will study an expected value that measures a special type of relationship between two real-valued variables. This relationship is very important both in probability and statistics.
Basic Theory Definitions As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). Unless otherwise noted, we assume that all expected values mentioned in this section exist. Suppose now that X and Y are real-valued random variables for the experiment (that is, defined on the probability space) with means E(X), E(Y ) and variances var(X), var(Y ), respectively. The covariance of (X, Y ) is defined by cov(X, Y ) = E ([X − E(X)] [Y − E(Y )])
(4.5.1)
and, assuming the variances are positive, the correlation of (X, Y ) is defined by cov(X, Y ) cor(X, Y ) =
(4.5.2) sd(X)sd(Y )
1. If cov(X, Y ) > 0 then X and Y are positively correlated. 2. If cov(X, Y ) < 0 then X and Y are negatively correlated. 3. If cov(X, Y ) = 0 then X and Y are uncorrelated. Correlation is a scaled version of covariance; note that the two parameters always have the same sign (positive, negative, or 0). Note also that correlation is dimensionless, since the numerator and denominator have the same physical units, namely the product of the units of X and Y . As these terms suggest, covariance and correlation measure a certain kind of dependence between the variables. One of our goals is a deeper understanding of this dependence. As a start, note that (E(X), E(Y )) is the center of the joint distribution of (X, Y ), and the vertical and horizontal lines through this point separate R into four quadrants. The function (x, y) ↦ [x − E(X)] [y − E(Y )] is positive on the first and third quadrants and negative on the second and fourth. 2
Figure 4.5.1 : A joint distribution with (E(X), E(Y ))as the center of mass
Properties of Covariance The following theorems give some basic properties of covariance. The main tool that we will need is the fact that expected value is a linear operation. Other important properties will be derived below, in the subsection on the best linear predictor. As usual, be sure to try the proofs yourself before reading the ones in the text. Once again, we assume that the random variables are defined on the common sample space, are real-valued, and that the indicated expected values exist (as real numbers). Our first result is a formula that is better than the definition for computational purposes, but gives less insight.
4.5.1
https://stats.libretexts.org/@go/page/10160
cov(X, Y ) = E(XY ) − E(X)E(Y )
.
Proof From (2), we see that X and Y are uncorrelated if and only if E(XY ) = E(X)E(Y ) , so here is a simple but important corollary: If X and Y are independent, then they are uncorrelated. Proof However, the converse fails with a passion: Exercise (31) gives an example of two variables that are functionally related (the strongest form of dependence), yet uncorrelated. The computational exercises give other examples of dependent yet uncorrelated variables also. Note also that if one of the variables has mean 0, then the covariance is simply the expected product. Trivially, covariance is a symmetric operation. cov(X, Y ) = cov(Y , X)
.
As the name suggests, covariance generalizes variance. cov(X, X) = var(X)
.
Proof Covariance is a linear operation in the first argument, if the second argument is fixed. If X, Y , Z are random variables, and c is a constant, then 1. cov(X + Y , Z) = cov(X, Z) + cov(Y , Z) 2. cov(cX, Y ) = c cov(X, Y ) Proof By symmetry, covariance is also a linear operation in the second argument, with the first argument fixed. Thus, the covariance operator is bi-linear. The general version of this property is given in the following theorem. Suppose that
and are constants. Then
(X1 , X2 , … , Xn )
(b1 , b2 , … , bm )
(Y1 , Y2 , … , Ym )
n
are sequences of random variables, and that
m
n
(a1 , a2 , … , an )
m
cov ( ∑ ai Xi , ∑ bj Yj ) = ∑ ∑ ai bj cov(Xi , Yj ) i=1
j=1
i=1
and
(4.5.7)
j=1
The following result shows how covariance is changed under a linear transformation of one of the variables. This is simply a special case of the basic properties, but is worth stating. If a,
b ∈ R
then cov(a + bX, Y ) = b cov(X, Y ) .
Proof Of course, by symmetry, the same property holds in the second argument. Putting the two together we have that if then cov(a + bX, c + dY ) = bd cov(X, Y ) .
a, b, c, d ∈ R
Properties of Correlation Next we will establish some basic properties of correlation. Most of these follow easily from corresponding properties of covariance above. We assume that var(X) > 0 and var(Y ) > 0 , so that the random variable really are random and hence the correlation is well defined. The correlation between X and Y is the covariance of the corresponding standard scores: X − E(X) cor(X, Y ) = cov (
Y − E(Y ) ,
sd(X)
X − E(X)
Y − E(Y )
sd(X)
sd(Y )
) =E( sd(Y )
)
(4.5.8)
Proof
4.5.2
https://stats.libretexts.org/@go/page/10160
This shows again that correlation is dimensionless, since of course, the standard scores are dimensionless. Also, correlation is symmetric: cor(X, Y ) = cor(Y , X)
.
Under a linear transformation of one of the variables, the correlation is unchanged if the slope is positve and changes sign if the slope is negative: If a,
b ∈ R
and b ≠ 0 then
1. cor(a + bX, Y ) = cor(X, Y ) if b > 0 2. cor(a + bX, Y ) = −cor(X, Y ) if b < 0 Proof This result reinforces the fact that correlation is a standardized measure of association, since multiplying the variable by a positive constant is equivalent to a change of scale, and adding a contant to a variable is equivalent to a change of location. For example, in the Challenger data, the underlying variables are temperature at the time of launch (in degrees Fahrenheit) and O-ring erosion (in millimeters). The correlation between these two variables is of fundamental importance. If we decide to measure temperature in degrees Celsius and O-ring erosion in inches, the correlation is unchanged. Of course, the same property holds in the second argument, so if a, b, c, d ∈ R with b ≠ 0 and d ≠ 0 , then cor(a + bX, c + dY ) = cor(X, Y ) if bd > 0 and cor(a + bX, c + dY ) = −cor(X, Y ) if bd < 0 . The most important properties of covariance and correlation will emerge from our study of the best linear predictor below.
The Variance of a Sum We will now show that the variance of a sum of variables is the sum of the pairwise covariances. This result is very useful since many random variables with special distributions can be written as sums of simpler random variables (see in particular the binomial distribution and hypergeometric distribution below). If (X
1,
X2 , … , Xn )
is a sequence of real-valued random variables then n
n
n
n
var ( ∑ Xi ) = ∑ ∑ cov(Xi , Xj ) = ∑ var(Xi ) + 2 i=1
i=1
j=1
i=1
∑
cov(Xi , Xj )
(4.5.10)
{(i,j):i 0 . In statistical terms, the variables form a random sample from the common distribution. 1
For n ∈ N+ , let Y
n
n
1. E (Y ) = nμ 2. var (Y ) = nσ
=∑
i=1
2
.
Xi
n
n
2
Proof For n ∈ N , let M +
n
= Yn /n =
1 n
n
∑
i=1
Xi
, so that M is the sample mean of (X
1,
n
.
X2 , … , Xn )
1. E (M ) = μ 2. var (M ) = σ /n 3. var (M ) → 0 as n → ∞ 4. P (|M − μ| > ϵ) → 0 as n → ∞ for every ϵ > 0 . n
2
n n
n
Proof Part (c) of (17) means that M → μ as n → ∞ in mean square. Part (d) means that M → μ as n → ∞ in probability. These are both versions of the weak law of large numbers, one of the fundamental theorems of probability. n
n
The standard score of the sum Y and the standard score of the sample mean M are the same: n
n
Zn =
Yn − n μ − √n σ
=
Mn − μ
(4.5.16)
− σ/ √n
1. E(Z ) = 0 2. var(Z ) = 1 n
n
Proof The central limit theorem, the other fundamental theorem of probability, states that the distribution of Z converges to the standard normal distribution as n → ∞ . n
Events If A and B are events in our random experiment then the covariance and correlation of A and B are defined to be the covariance and correlation, respectively, of their indicator random variables. If A and B are events, define cov(A, B) = cov(1
A,
1B )
and cor(A, B) = cor(1
A,
1B )
. Equivalently,
1. cov(A, B) = P(A ∩ B) − P(A)P(B) − −−−−−−−−−−−−−−−−−−−−−−− − 2. cor(A, B) = [P(A ∩ B) − P(A)P(B)] /√P(A) [1 − P(A)] P(B) [1 − P(B)] Proof In particular, note that A and B are positively correlated, negatively correlated, or independent, respectively (as defined in the section on conditional probability) if and only if the indicator variables of A and B are positively correlated, negatively correlated, or uncorrelated, as defined in this section. If A and B are events then 1. cov(A, B ) = −cov(A, B) 2. cov(A , B ) = cov(A, B) c
c
c
Proof If A and B are events with A ⊆ B then 1. cov(A, B) = P(A)[1 − P(B)] − −−−−−−−−−−−−−−−−−−−−−−−− −
2. cor(A, B) = √P(A) [1 − P(B)] /P(B) [1 − P(A)]
4.5.4
https://stats.libretexts.org/@go/page/10160
Proof In the language of the experiment, surprising.
A ⊆B
means that
A
implies
B
. In such a case, the events are positively correlated, not
The Best Linear Predictor What linear function of X (that is, a function of the form a + bX where a, b ∈ R ) is closest to Y in the sense of minimizing mean square error? The question is fundamentally important in the case where random variable X (the predictor variable) is observable and random variable Y (the response variable) is not. The linear function can be used to estimate Y from an observed value of X. Moreover, the solution will have the added benefit of showing that covariance and correlation measure the linear relationship between X and Y . To avoid trivial cases, let us assume that var(X) > 0 and var(Y ) > 0 , so that the random variables really are random. The solution to our problem turns out to be the linear function of X with the same expected value as Y , and whose covariance with X is the same as that of Y . The random variable L(Y
∣ X)
defined as follows is the only linear function of X satisfying properties (a) and (b). cov(X, Y ) L(Y ∣ X) = E(Y ) +
[X − E(X)]
(4.5.17)
var(X)
1. E [L(Y ∣ X)] = E(Y ) 2. cov [X, L(Y ∣ X)] = cov(X, Y ) Proof Note that in the presence of part (a), part (b) is equivalent to E [XL(Y ∣ X)] = E(XY ) . Here is another minor variation, but one that will be very useful: L(Y ∣ X) is the only linear function of X with the same mean as Y and with the property that Y − L(Y ∣ X) is uncorrelated with every linear function of X. L(Y ∣ X)
is the only linear function of X that satisfies
1. E [L(Y ∣ X)] = E(Y ) 2. cov [Y − L(Y ∣ X), U ] = 0 for every linear function U of X. Proof The variance of L(Y
∣ X)
and its covariance with Y turn out to be the same.
Additional properties of L(Y 1. var [L(Y 2. cov [L(Y
∣ X)
:
2
∣ X)] = cov (X, Y )/var(X) 2
∣ X), Y ] = cov (X, Y )/var(X)
Proof We can now prove the fundamental result that L(Y ∣ X) is the linear function of X that is closest to Y in the mean square sense. We give two proofs; the first is more straightforward, but the second is more interesting and elegant. Suppose that U is a linear function of X. Then 1. E ([Y
− L(Y ∣ X)]
2
2
) ≤ E [(Y − U ) ]
2. Equality occurs in (a) if and only if U
= L(Y ∣ X)
with probability 1.
Proof from calculus Proof using properties The mean square error when L(Y
∣ X)
is used as a predictor of Y is
E ( [Y − L(Y ∣ X)]
2
2
) = var(Y ) [1 − cor (X, Y )]
(4.5.30)
Proof Our solution to the best linear perdictor problems yields important properties of covariance and correlation.
4.5.5
https://stats.libretexts.org/@go/page/10160
Additional properties of covariance and correlation: 1. −1 ≤ cor(X, Y ) ≤ 1 2. −sd(X)sd(Y ) ≤ cov(X, Y ) ≤ sd(X)sd(Y ) 3. cor(X, Y ) = 1 if and only if, with probability 1, Y is a linear function of X with positive slope. 4. cor(X, Y ) = −1 if and only if, with probability 1, Y is a linear function of X with negative slope. Proof The last two results clearly show that cov(X, Y ) and cor(X, Y ) measure the linear association between X and Y . The equivalent inequalities (a) and (b) above are referred to as the correlation inequality. They are also versions of the Cauchy-Schwarz inequality, named for Augustin Cauchy and Karl Schwarz Recall from our previous discussion of variance that the best constant predictor of Y , in the sense of minimizing mean square error, is E(Y ) and the minimum value of the mean square error for this predictor is var(Y ). Thus, the difference between the variance of Y and the mean square error above for L(Y ∣ X) is the reduction in the variance of Y when the linear term in X is added to the predictor: var(Y ) − E ( [Y − L(Y ∣ X)]
Thus cor (X, Y ) is the proportion of reduction in (distribution) coefficient of determination. Now let 2
var(Y )
2
2
) = var(Y ) cor (X, Y )
(4.5.33)
when X is included as a predictor variable. This quantity is called the
cov(X, Y ) L(Y ∣ X = x) = E(Y ) +
[x − E(X)] ,
x ∈ R
(4.5.34)
var(X)
The function x ↦ L(Y ∣ X = x) is known as the distribution regression function for Y given X, and its graph is known as the distribution regression line. Note that the regression line passes through (E(X), E(Y )), the center of the joint distribution.
Figure 4.5.2 : The distribution regression line
However, the choice of predictor variable and response variable is crucial. The regression line for Y given X and the regression line for X given Y are not the same line, except in the trivial case where the variables are perfectly correlated. However, the coefficient of determination is the same, regardless of which variable is the predictor and which is the response. Proof Suppose that A and B are events with 0 < P(A) < 1 and 0 < P(B) < 1 . Then 1. cor(A, B) = 1 if and only P(A ∖ B) + P(B ∖ A) = 0 . (That is, A and B are equivalent events.) 2. cor(A, B) = −1 if and only P(A ∖ B ) + P(B ∖ A) = 0 . (That is, A and B are equivalent events.) c
c
c
Proof The concept of best linear predictor is more powerful than might first appear, because it can be applied to transformations of the variables. Specifically, suppose that X and Y are random variables for our experiment, taking values in general spaces S and T , respectively. Suppose also that g and h are real-valued functions defined on S and T , respectively. We can find L [h(Y ) ∣ g(X)] , the linear function of g(X) that is closest to h(Y ) in the mean square sense. The results of this subsection apply, of course, with
4.5.6
https://stats.libretexts.org/@go/page/10160
replacing covariances. g(X)
X
and
h(Y )
replacing
Y
. Of course, we must be able to compute the appropriate means, variances, and
We close this subsection with two additional properties of the best linear predictor, the linearity properties. Suppose that X, Y , and Z are random variables and that c is a constant. Then 1. L(Y + Z ∣ X) = L(Y ∣ X) + L(Z ∣ X) 2. L(cY ∣ X) = cL(Y ∣ X) Proof from the definitions Proof by characterizing properties There are several extensions and generalizations of the ideas in the subsection: The corresponding statistical problem of estimating a and b , when these distribution parameters are unknown, is considered in the section on Sample Covariance and Correlation. The problem finding the function of X that is closest to Y in the mean square error sense (using all reasonable functions, not just linear functions) is considered in the section on Conditional Expected Value. The best linear prediction problem when the predictor and response variables are random vectors is considered in the section on Expected Value and Covariance Matrices. The use of characterizing properties will play a crucial role in these extensions.
Examples and Applications Uniform Distributions Suppose that X is uniformly distributed on the interval [−1, 1] and Y a function of X (the strongest form of dependence).
=X
2
. Then X and Y are uncorrelated even though Y is
Proof Suppose that (X, Y ) is uniformly distributed on the region the variables are independent in each of the following cases:
2
S ⊆R
. Find
cov(X, Y )
and
cor(X, Y )
and determine whether
1. S = [a, b] × [c, d] where a < b and c < d , so S is a rectangle. 2. S = {(x, y) ∈ R : −a ≤ y ≤ x ≤ a} where a > 0 , so S is a triangle 3. S = {(x, y) ∈ R : x + y ≤ r } where r > 0 , so S is a circle 2
2
2
2
2
Answer In the bivariate uniform experiment, select each of the regions below in turn. For each region, run the simulation 2000 times and note the value of the correlation and the shape of the cloud of points in the scatterplot. Compare with the results in the last exercise. 1. Square 2. Triangle 3. Circle Suppose that X is uniformly distributed on the interval (0, 1) and that given X = x ∈ (0, 1) , Y is uniformly distributed on the interval (0, x). Find each of the following: 1. cov(X, Y ) 2. cor(X, Y ) 3. L(Y ∣ X) 4. L(X ∣ Y ) Answer
4.5.7
https://stats.libretexts.org/@go/page/10160
Dice Recall that a standard die is a six-sided die. A fair die is one in which the faces are equally likely. An ace-six flat die is a standard die in which faces 1 and 6 have probability each, and faces 2, 3, 4, and 5 have probability each. 1
1
4
8
A pair of standard, fair dice are thrown and the scores (X , X ) recorded. Let Y = X + X denote the sum of the scores, U = min{ X , X } the minimum scores, and V = max{ X , X } the maximum score. Find the covariance and correlation of each of the following pairs of variables: 1
1
2
2
1
1
2
2
1. (X , X ) 2. (X , Y ) 3. (X , U ) 4. (U , V ) 5. (U , Y ) 1
2
1 1
Answer Suppose that n fair dice are thrown. Find the mean and variance of each of the following variables: 1. Y , the sum of the scores. 2. M , the average of the scores. n
n
Answer In the dice experiment, select fair dice, and select the following random variables. In each case, increase the number of dice and observe the size and location of the probability density function and the mean ± standard deviation bar. With n = 20 dice, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation. 1. The sum of the scores. 2. The average of the scores. Suppose that n ace-six flat dice are thrown. Find the mean and variance of each of the following variables: 1. Y , the sum of the scores. 2. M , the average of the scores. n
n
Answer In the dice experiment, select ace-six flat dice, and select the following random variables. In each case, increase the number of dice and observe the size and location of the probability density function and the mean ± standard deviation bar. With n = 20 dice, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation. 1. The sum of the scores. 2. The average of the scores. A pair of fair dice are thrown and the scores (X , X U = min{ X , X } the minimum score, and V = max{ X 1
1
1. L(Y 2. L(U 3. L(V
2)
recorded. Let Y = X + X denote the sum of the scores, the maximum score. Find each of the following: 1
1,
2
2
X2 }
∣ X1 ) ∣ X1 ) ∣ X1 )
Answer
Bernoulli Trials Recall that a Bernoulli trials process is a sequence X = (X , X , …) of independent, identically distributed indicator random variables. In the usual language of reliability, X denotes the outcome of trial i, where 1 denotes success and 0 denotes failure. The probability of success p = P(X = 1) is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate chapter on the Bernoulli Trials explores this process in detail. 1
2
i
i
4.5.8
https://stats.libretexts.org/@go/page/10160
For n ∈ N , the number of successes in the first n trials is Y = ∑ X . Recall that this random variable has the binomial distribution with parameters n and p, which has probability density function f given by n
+
n
fn (y) = (
i=1
n y n−y ) p (1 − p ) , y
i
y ∈ {0, 1, … , n}
(4.5.45)
The mean and variance of Y are n
1. E(Y ) = np 2. var(Y ) = np(1 − p) n
n
Proof In the binomial coin experiment, select the number of heads. Vary n and p and note the shape of the probability density function and the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation. For n ∈ N , the proportion of successes in the first n trials is M estimator of the parameter p, when the parameter is unknown. +
n
= Yn /n
. This random variable is sometimes used as a statistical
The mean and variance of M are n
1. E(M ) = p 2. var(M ) = p(1 − p)/n n
n
Proof In the binomial coin experiment, select the proportion of heads. Vary n and p and note the shape of the probability density function and the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation. As a special case of (17) note that M
n
→ p
as n → ∞ in mean square and in probability.
The Hypergeometric Distribution Suppose that a population consists of m objects; r of the objects are type 1 and m − r are type 0. A sample of n objects is chosen at random, without replacement. The parameters m, n ∈ N and r ∈ N with n ≤ m and r ≤ m . For i ∈ {1, 2, … , n}, let X denote the type of the ith object selected. Recall that (X , X , … , X ) is a sequence of identically distributed (but not independent) indicator random variables. +
i
1
2
n
Let Y denote the number of type 1 objects in the sample, so that Y = ∑ hypergeometric distribution, which has probability density function f given by
Xi
. Recall that this random variable has the
y ∈ {0, 1, … , n}
(4.5.46)
n i=1
n
r
m−r
y
n−y
( )( f (y) =
m
(
For distinct i,
n
) ,
)
,
j ∈ {1, 2, … , n}
1. E(X ) = 2. var(X ) = (1 − 3. cov(X , X ) = − 4. cor(X , X ) = − r
i
m
r
i
r
m
m
r
i
i
j
j
m
)
(1 −
r m
)
1 m−1
1
m−1
Proof Note that the event of a type 1 object on draw i and the event of a type 1 object on draw j are negatively correlated, but the correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if m = 2 . Think about these result intuitively. The mean and variance of Y are 1. E(Y ) = n
r m
4.5.9
https://stats.libretexts.org/@go/page/10160
2. var(Y ) = n
r m
(1 −
r m
)
m−n m−1
Proof Note that if the sampling were with replacement, Y would have a binomial distribution, and so in particular E(Y ) = n and var(Y ) = n (1 − ) . The additional factor that occurs in the variance of the hypergeometric distribution is sometimes called the finite population correction factor. Note that for fixed m, is decreasing in n , and is 0 when n = m . Of course, we know that we must have var(Y ) = 0 if n = m , since we would be sampling the entire population, and so deterministically, Y = r . On the other hand, for fixed n , → 1 as m → ∞ . More generally, the hypergeometric distribution is well approximated by the binomial when the population size m is large compared to the sample size n . These ideas are discussed more fully in the section on the hypergeometric distribution in the chapter on Finite Sampling Models. r
m
r
r
m−n
m
m
m−1
m−n m−1
m−n m−1
In the ball and urn experiment, select sampling without replacement. Vary m, r, and n and note the shape of the probability density function and the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation.
Exercises on Basic Properties Suppose that X and Y are real-valued random variables with cov(X, Y ) = 3 . Find cov(2X − 5, 4Y
.
+ 2)
Answer Suppose X and Y are real-valued random variables with var(X) = 5 , var(Y ) = 9 , and cov(X, Y ) = −3 . Find 1. cor(X, Y ) 2. var(2X + 3Y − 7) 3. cov(5X + 2Y − 3, 3X − 4Y + 2) 4. cor(5X + 2Y − 3, 3X − 4Y + 2) Answer Suppose that
and .
X
Y
are independent, real-valued random variables with
var(X) = 6
and
var(Y ) = 8
.
Find
var(3X − 4Y + 5)
Answer Suppose that following:
A
and
B
are events in an experiment with
P(A) =
1 2
,
P(B) =
1 3
, and
P(A ∩ B) =
1 8
. Find each of the
1. cov(A, B) 2. cor(A, B) Answer Suppose that
X
,
Y
L(Z ∣ X) = 5 + 4X
, and Z are real-valued random variables for an experiment, and that . Find L(6Y − 2Z ∣ X) .
L(Y ∣ X) = 2 − 3X
and
Answer Suppose that
X
and Y are real-valued random variables for an experiment, and that . Find each of the following:
E(X) = 3
,
var(X) = 4
, and
L(Y ∣ X) = 5 − 2X
1. E(Y ) 2. cov(X, Y ) Answer
Simple Continuous Distributions Suppose that (X, Y ) has probability density function f given by f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . Find each of the following 1. cov(X, Y )
4.5.10
https://stats.libretexts.org/@go/page/10160
2. cor(X, Y ) 3. L(Y ∣ X) 4. L(X ∣ Y ) Answer Suppose that following:
(X, Y )
has probability density function
f
given by
for
f (x, y) = 2(x + y)
0 ≤x ≤y ≤1
. Find each of the
1. cov(X, Y ) 2. cor(X, Y ) 3. L(Y ∣ X) 4. L(X ∣ Y ) Answer Suppose again that (X, Y ) has probability density function f given by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 . 1. Find cov (X , Y ). 2. Find cor (X , Y ). 3. Find L (Y ∣ X ) . 4. Which predictor of Y is better, the one based on X or the one based on X ? 2
2
2
2
Answer Suppose that (X, Y ) has probability density function f given by f (x, y) = 6x following:
2
y
for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . Find each of the
1. cov(X, Y ) 2. cor(X, Y ) 3. L(Y ∣ X) 4. L(X ∣ Y ) Answer Suppose that following:
(X, Y )
has probability density function
f
given by
for
2
f (x, y) = 15 x y
0 ≤x ≤y ≤1
. Find each of the
1. cov(X, Y ) 2. cor(X, Y ) 3. L(Y ∣ X) 4. L(X ∣ Y ) Answer Suppose again that (X, Y ) has probability density function f given by f (x, y) = 15x
2
y
for 0 ≤ x ≤ y ≤ 1 .
− − cov (√X , Y )
1. Find . − − 2. Find cor (√X , Y ). − − 3. Find L (Y ∣ √X ) . − − 4. Which of the predictors of Y is better, the one based on X of the one based on √X ? Answer This page titled 4.5: Covariance and Correlation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.5.11
https://stats.libretexts.org/@go/page/10160
4.6: Generating Functions As usual, our starting point is a random experiment modeled by a probability sace (Ω, F , P). A generating function of a realvalued random variable is an expected value of a certain transformation of the random variable involving another (deterministic) variable. Most generating functions share four important properties: 1. Under mild conditions, the generating function completely determines the distribution of the random variable. 2. The generating function of a sum of independent variables is the product of the generating functions 3. The moments of the random variable can be obtained from the derivatives of the generating function. 4. Ordinary (pointwise) convergence of a sequence of generating functions corresponds to the special convergence of the corresponding distributions. Property 1 is perhaps the most important. Often a random variable is shown to have a certain distribution by showing that the generating function has a certain form. The process of recovering the distribution from the generating function is known as inversion. Property 2 is frequently used to determine the distribution of a sum of independent variables. By contrast, recall that the probability density function of a sum of independent variables is the convolution of the individual density functions, a much more complicated operation. Property 3 is useful because often computing moments from the generating function is easier than computing the moments directly from the probability density function. The last property is known as the continuity theorem. Often it is easer to show the convergence of the generating functions than to prove convergence of the distributions directly. The numerical value of the generating function at a particular value of the free variable is of no interest, and so generating functions can seem rather unintuitive at first. But the important point is that the generating function as a whole encodes all of the information in the probability distribution in a very useful way. Generating functions are important and valuable tools in probability, as they are in other areas of mathematics, from combinatorics to differential equations. We will study the three generating functions in the list below, which correspond to increasing levels of generality. The fist is the most restrictive, but also by far the simplest, since the theory reduces to basic facts about power series that you will remember from calculus. The third is the most general and the one for which the theory is most complete and elegant, but it also requires basic knowledge of complex analysis. The one in the middle is perhaps the one most commonly used, and suffices for most distributions in applied probability. 1. the probability generating function 2. the moment generating function 3. the characteristic function We will also study the characteristic function for multivariate distributions, although analogous results hold for the other two types. In the basic theory below, be sure to try the proofs yourself before reading the ones in the text.
Basic Theory The Probability Generating Function For our first generating function, assume that N is a random variable taking values in N. The probability generating function P of N is defined by N
P (t) = E (t
)
(4.6.1)
for all t ∈ R for which the expected value exists in R. N
That is, P (t) is defined when E (|t|
) 1 . Then P orders.
(k)
(1) = E [ N
(k)
]
for k ∈ N . In particular, N has finite moments of all
Proof Suppose again that r > 1 . Then 1. E(N ) = P (1) 2. var(N ) = P (1) + P ′
′′
′
′
(1) [1 − P (1)]
Proof Suppose that N and N are independent random variables taking values in N, with probability generating functions P and P having radii of convergence r and r , respectively. Then the probability generating function P of N + N is given by P (t) = P (t)P (t) for |t| < r ∧ r . 1
2
1
2
1
1
2
1
2
1
2
2
Proof
The Moment Generating Function Our next generating function is defined more generally, so in this discussion we assume that the random variables are real-valued. The moment generating function of X is the function M defined by M (t) = E (e
tX
),
t ∈ R
(4.6.7)
Note that since e ≥ 0 with probability 1, M (t) exists, as a real number or ∞, for any t ∈ R . But as we will see, our interest will be in the domain where M (t) < ∞ . tX
Suppose that X has a continuous distribution on R with probability density function f . Then
4.6.2
https://stats.libretexts.org/@go/page/10161
∞
M (t) = ∫
e
tx
f (x) dx
(4.6.8)
−∞
Proof Thus, the moment generating function of X is closely related to the Laplace transform of the probability density function f . The Laplace transform is named for Pierre Simon Laplace, and is widely used in many areas of applied mathematics, particularly differential equations. The basic inversion theorem for moment generating functions (similar to the inversion theorem for Laplace transforms) states that if M (t) < ∞ for t in an open interval about 0, then M completely determines the distribution of X. Thus, if two distributions on R have moment generating functions that are equal (and finite) in an open interval about 0, then the distributions are the same. Suppose that orders and
X
has moment generating function
M
that is finite in an open interval ∞
E (X
n
M (t) = ∑ n=0
)
n
t ,
I
about 0. Then
X
has moments of all
t ∈ I
(4.6.9)
n!
Proof So under the finite assumption in the last theorem, the moment generating function, like the probability generating function, is a power series in t . Suppose again that for n ∈ N
X
has moment generating function M that is finite in an open interval about 0. Then M
(n)
(0) = E (X
n
)
Proof Thus, the derivatives of the moment generating function at 0 determine the moments of the variable (hence the name). In the language of combinatorics, the moment generating function is the exponential generating function of the sequence of moments. Thus, a random variable that does not have finite moments of all orders cannot have a finite moment generating function. Even when a random variable does have moments of all orders, the moment generating function may not exist. A counterexample is constructed below. For nonnegative random variables (which are very common in applications), the domain where the moment generating function is finite is easy to understand. Suppose that X takes values in [0, ∞) and has moment generating function M . If M (t) < ∞ for t ∈ R then M (s) < ∞ for s ≤t. Proof So for a nonnegative random variable, either M (t) < ∞ for all t ∈ R or there exists r ∈ (0, ∞) such that M (t) < ∞ for t < r . Of course, there are complementary results for non-positive random variables, but such variables are much less common. Next we consider what happens to the moment generating function under some simple transformations of the random variables. Suppose that X has moment generating function M and that a, given by N (t) = e M (bt) for t ∈ R .
b ∈ R
. The moment generating function N of Y
= a + bX
is
at
Proof Recall that if a ∈ R and b ∈ (0, ∞) then the transformation a + bX is a location-scale transformation on the distribution of X, with location parameter a and scale parameter b . Location-scale transformations frequently arise when units are changed, such as length changed from inches to centimeters or temperature from degrees Fahrenheit to degrees Celsius. Suppose that X and X are independent random variables with moment generating functions moment generating function M of Y = X + X is given by M (t) = M (t)M (t) for t ∈ R . 1
2
1
2
1
M1
and
M2
respectively. The
2
Proof The probability generating function of a variable can easily be converted into the moment generating function of the variable.
4.6.3
https://stats.libretexts.org/@go/page/10161
Suppose that X is a random variable taking values in N with probability generating function G having radius of convergence r . The moment generating function M of X is given by M (t) = G (e ) for t < ln(r) . t
Proof The following theorem gives the Chernoff bounds, named for the mathematician Herman Chernoff. These are upper bounds on the tail events of a random variable. If X has moment generating function M then 1. P(X ≥ x) ≤ e 2. P(X ≤ x) ≤ e
−tx −tx
M (t) M (t)
for t > 0 for t < 0
Proof Naturally, the best Chernoff bound (in either (a) or (b)) is obtained by finding t that minimizes e
−tx
M (t)
.
The Characteristic Function Our last generating function is the nicest from a mathematical point of view. Once again, we assume that our random variables are real-valued. The characteristic function of X is the function χ defined by by χ(t) = E (e
itX
) = E [cos(tX)] + iE [sin(tX)] ,
t ∈ R
(4.6.12)
Note that χ is a complex valued function, and so this subsection requires some basic knowledge of complex analysis. The function ∣ χ is defined for all t ∈ R because the random variable in the expected value is bounded in magnitude. Indeed, ∣ ∣e ∣ = 1 for all t ∈ R . Many of the properties of the characteristic function are more elegant than the corresponding properties of the probability or moment generating functions, because the characteristic function always exists. itX
If X has a continuous distribution on R with probability density function f and characteristic function χ then ∞
χ(t) = ∫
e
itx
f (x)dx,
t ∈ R
(4.6.13)
−∞
Proof Thus, the characteristic function of X is closely related to the Fourier transform of the probability density function f . The Fourier transform is named for Joseph Fourier, and is widely used in many areas of applied mathematics. As with other generating functions, the characteristic function completely determines the distribution. That is, random variables X and Y have the same distribution if and only if they have the same characteristic function. Indeed, the general inversion formula given next is a formula for computing certain combinations of probabilities from the characteristic function. Suppose again that X has characteristic function χ. If a, n
e
−iat
−e
∫ −n
b ∈ R
and a < b then
−ibt
1 χ(t) dt → P(a < X < b) +
2πit
[P(X = b) − P(X = a)] as n → ∞
(4.6.14)
2
The probability combinations on the right side completely determine the distribution of continuous distributions:
X
. A special inversion formula holds for
Suppose that X has a continuous distribution with probability density function f and characteristic function χ. At every point x ∈ R where f is differentiable, ∞
1 f (x) =
∫ 2π
e
−itx
χ(t) dt
(4.6.15)
−∞
This formula is essentially the inverse Fourrier transform. As with the other generating functions, the characteristic function can be used to find the moments of X. Moreover, this can be done even when only some of the moments exist.
4.6.4
https://stats.libretexts.org/@go/page/10161
Suppose again that X has characteristic function χ. If n ∈ N
+
n
and E (|X
E (X
k
)
χ(t) = ∑ k=0
and therefore χ
(n)
n
(0) = i E (X
n
)
k
(it)
n
|) < ∞
. Then
n
+ o(t )
(4.6.16)
k!
.
Details Next we consider how the characteristic function is changed under some simple transformations of the variables. Suppose that ψ(t) = e
iat
X
has characteristic function for t ∈ R .
χ
and that
a, b ∈ R
. The characteristic function
ψ
of
Y = a + bX
is given by
χ(bt)
Proof Suppose that X and X are independent random variables with characteristic functions characteristic function χ of Y = X + X is given by χ(t) = χ (t)χ (t) for t ∈ R . 1
2
1
2
1
χ1
and
χ2
respectively. The
2
Proof The characteristic function of a random variable can be obtained from the moment generating function, under the basic existence condition that we saw earlier. Suppose that X has moment generating function M that satisfies characteristic function χ of X satisfies χ(t) = M (it) for t ∈ I .
M (t) < ∞
for
in an open interval
t
I
about 0. Then the
The final important property of characteristic functions that we will discuss relates to convergence in distribution. Suppose that (X , X , …) is a sequence of real-valued random with characteristic functions (χ , χ , …) respectively. Since we are only concerned with distributions, the random variables need not be defined on the same probability space. 1
2
1
2
The Continuity Theorem 1. If the distribution of X converges to the distribution of a random variable X as n → ∞ and X has characteristic function χ, then χ (t) → χ(t) as n → ∞ for all t ∈ R . 2. Conversely, if χ (t) converges to a function χ(t) as n → ∞ for t in an open interval about 0, and if χ is continuous at 0, then χ is the characteristic function of a random variable X, and the distribution of X converges to the distribution of X as n → ∞ . n
n
n
n
There are analogous versions of the continuity theorem for probability generating functions and moment generating functions. The continuity theorem can be used to prove the central limit theorem, one of the fundamental theorems of probability. Also, the continuity theorem has a straightforward generalization to distributions on R . n
The Joint Characteristic Function All of the generating functions that we have discussed have multivariate extensions. However, we will discuss the extension only for the characteristic function, the most important and versatile of the generating functions. There are analogous results for the other generating functions. So in this discussion, we assume that (X, Y ) is a random vector for our experiment, taking values in R . 2
The (joint) characteristic function χ of (X, Y ) is defined by χ(s, t) = E [exp(isX + itY )] ,
2
(s, t) ∈ R
(4.6.18)
Once again, the most important fact is that χ completely determines the distribution: two random vectors taking values in R have the same characteristic function if and only if they have the same distribution. 2
The joint moments can be obtained from the derivatives of the characteristic function. Suppose that (X, Y ) has characteristic function χ. If m,
n ∈ N
and E (|X
4.6.5
m
Y
n
|) < ∞
then
https://stats.libretexts.org/@go/page/10161
(m,n)
χ
(0, 0) = e
i (m+n)
E (X
m
Y
n
)
(4.6.19)
The marginal characteristic functions and the characteristic function of the sum can be easily obtained from the joint characteristic function: Suppose again that (X, Y ) has characteristic function χ, and let and X + Y , respectively. For t ∈ R
χ1
, χ , and 2
χ+
denote the characteristic functions of
X
,
Y
,
1. χ(t, 0) = χ 2. χ(0, t) = χ 3. χ(t, t) = χ
1 (t) 2 (t)
+ (t)
Proof Suppose again that χ , χ , and χ are the characteristic functions of independent if and only if χ(s, t) = χ (s)χ (t) for all (s, t) ∈ R . 1
2
X
,
Y
, and
(X, Y )
respectively. Then
X
and
Y
are
2
1
2
Naturally, the results for bivariate characteristic functions have analogies in the general multivariate case. Only the notation is more complicated.
Examples and Applications As always, be sure to try the computational problems yourself before expanding the solutions and answers in the text.
Dice Recall that an ace-six flat die is a six-sided die for which faces numbered 1 and 6 have probability each, while faces numbered 2, 3, 4, and 5 have probability each. Similarly, a 3-4 flat die is a six-sided die for which faces numbered 3 and 4 have probability each, while faces numbered 1, 2, 5, and 6 have probability each. 1 4
1
1
8
4
1 8
Suppose that an ace-six flat die and a 3-4 flat die are rolled. Use probability generating functions to find the probability density function of the sum of the scores. Solution Two fair, 6-sided dice are rolled. One has faces numbered (0, 1, 2, 3, 4, 5) and the other has faces numbered (0, 6, 12, 18, 24, 30) . Use probability generating functions to find the probability density function of the sum of the scores, and identify the distribution. Solution Suppose that random variable Y has probability generating function P given by 2 P (t) = (
3 t+
5
2
t 10
1 +
3
t 5
1 +
5 4
t )
,
t ∈ R
(4.6.22)
10
1. Interpret Y in terms of rolling dice. 2. Use the probability generating function to find the first two factorial moments of Y . 3. Use (b) to find the variance of Y . Answer
Bernoulli Trials Suppose X is an indicator random variable with generating function P (t) = 1 − p + pt for t ∈ R .
p = P(X = 1)
, where
p ∈ [0, 1]
is a parameter. Then
X
has probability
Proof Recall that a Bernoulli trials process is a sequence (X , X , …) of independent, identically distributed indicator random variables. In the usual language of reliability, X denotes the outcome of trial i, where 1 denotes success and 0 denotes failure. The 1
2
i
4.6.6
https://stats.libretexts.org/@go/page/10161
probability of success p = P(X = 1) is the basic parameter of the process. The process is named for Jacob Bernoulli. A separate chapter on the Bernoulli Trials explores this process in more detail. i
For n ∈ N , the number of successes in the first n trials is Y = ∑ X . Recall that this random variable has the binomial distribution with parameters n and p, which has probability density function f given by n
+
n
i=1
i
n
fn (y) = (
n y n−y ) p (1 − p ) ,
y ∈ {0, 1, … , n}
(4.6.23)
y
Random variable Y has probability generating function P given by P n
n (t)
n
n
= (1 − p + pt)
for t ∈ R .
Proof Rando variable Y has the following parameters: n
1. E [Y
(k)
n
(k)
] =n
k
p
2. E (Y ) = np 3. var (Y ) = np(1 − p) 4. P(Y is even) = [1 − (1 − 2p) n
n
1
n
n
2
]
Proof Suppose that U has the binomial distribution with parameters m ∈ N parameters n ∈ N and q ∈ [0, 1], and that U and V are independent.
+
and
,
p ∈ [0, 1]
V
has the binomial distribution with
+
1. If p = q then U + V has the binomial distribution with parameters m + n and p. 2. If p ≠ q then U + V does not have a binomial distribution. Proof Suppose now that p ∈ (0, 1]. The trial number N of the first success in the sequence of Bernoulli trials has the geometric distribution on N with success parameter p. The probability density function h is given by +
n−1
h(n) = p(1 − p )
,
n ∈ N+
(4.6.24)
The geometric distribution is studied in more detail in the chapter on Bernoulli trials. Let Q denote the probability generating function of N . Then 1. Q(t) = 2. E [ N
pt 1−(1−p)t
for − k−1
(k)
(1−p)
] = k!
3. E(N ) =
pk
1 1−p
0 . Then P(N ≥ n) ≤ e
n−a
a (
n
) ,
n >a
(4.6.28)
n
Proof The following theorem gives an important convergence result that is explored in more detail in the chapter on the Poisson process. Suppose that p ∈ (0, 1) for n ∈ N and that np → a ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters n and p converges to the Poisson distribution with parameter a as n → ∞ . n
+
n
n
Proof
The Exponential Distribution Recall that the exponential distribution is a continuous distribution on [0, ∞) with probability density function f given by f (t) = re
−rt
,
t ∈ (0, ∞)
(4.6.31)
where r ∈ (0, ∞) is the rate parameter. This distribution is widely used to model failure times and other random times, and in particular governs the time between arrivals in the Poisson model. The exponential distribution is studied in more detail in the chapter on the Poisson Process. Suppose that T has the exponential distribution with rate parameter function of T . Then
r ∈ (0, ∞)
and let
M
denote the moment generating
1. M (s) = for s ∈ (−∞, r) . 2. E(T ) = n!/r for n ∈ N r
r−s
n
n
Proof Suppose that (T , T , …) is a sequence of independent random variables, each having the exponential distribution with rate parameter r ∈ (0, ∞). For n ∈ N , the moment generating function M of U = ∑ T is given by 1
2
n
+
n
i=1
i
n
r Mn (s) = (
n
) ,
s ∈ (−∞, r)
(4.6.32)
r−s
Proof Random variable U has the Erlang distribution with shape parameter n and rate parameter r, named for Agner Erlang. This distribution governs the n th arrival time in the Poisson model. The Erlang distribution is a special case of the gamma distribution and is studied in more detail in the chapter on the Poisson Process. n
4.6.8
https://stats.libretexts.org/@go/page/10161
Uniform Distributions Suppose that a, b ∈ R and function f given by
a 0 n
a
a−n
n
) =∞
M
denote the moment generating function of
X
.
if n ≥ a
Proof On the other hand, like all distributions on R, the Pareto distribution has a characteristic function. However, the characteristic function of the Pareto distribution does not have a simple, closed form.
The Cauchy Distribution Recall that the (standard) Cauchy distribution is a continuous distribution on R with probability density function f given by 1 f (x) =
2
,
x ∈ R
(4.6.42)
π (1 + x )
and is named for Augustin Cauchy. The Cauch distribution is studied in more generality in the chapter on Special Distributions. The graph of f is known as the Witch of Agnesi, named for Maria Agnesi. Suppose that X has the standard Cauchy distribution, and let M denote the moment generating function of X. Then 1. E(X) does not exist. 2. M (t) = ∞ for t ≠ 0 . Proof Once again, all distributions on R have characteristic functions, and the standard Cauchy distribution has a particularly simple one. Let χ denote the characteristic function of X. Then χ(t) = e
−|t|
for t ∈ R .
Proof
Counterexample For the Pareto distribution, only some of the moments are finite; so course, the moment generating function cannot be finite in an interval about 0. We will now give an example of a distribution for which all of the moments are finite, yet still the moment
4.6.10
https://stats.libretexts.org/@go/page/10161
generating function is not finite in any interval about 0. Furthermore, we will see two different distributions that have the same moments of all orders. Suppose that Z has the standard normal distribution and let X = e . The distribution of X is known as the (standard) lognormal distribution. The lognormal distribution is studied in more generality in the chapter on Special Distributions. This distribution has finite moments of all orders, but infinite moment generating function. Z
X
has probability density function f given by f (x) =
1. E (X 2. E (e
n
tX
1
) =e
2
1 1 2 ln (x)), − − exp(− 2 √2πx
x >0
(4.6.43)
2
for n ∈ N . for t > 0 .
n
) =∞
Proof Next we construct a different distribution with the same moments as X. Let h be the function defined by for x > 0 . Then
h(x) = sin(2π ln x)
for x > 0 and let
g
be the function defined by
g(x) = f (x) [1 + h(x)]
1. g is a probability density function. 2. If Y has probability density function g then E (Y
n
1
) =e
2
2
n
for n ∈ N
Proof
Figure 4.6.1 : The graphs of f and g , probability density functions for two distributions with the same moments of all orders. This page titled 4.6: Generating Functions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.6.11
https://stats.libretexts.org/@go/page/10161
4.7: Conditional Expected Value As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). Suppose next that X is a random variable taking values in a set S and that Y is a random variable taking values in T ⊆ R . We assume that either Y has a discrete distribution, so that T is countable, or that Y has a continuous distribution so that T is an interval (or perhaps a union of intervals). In this section, we will study the conditional expected value of Y given X, a concept of fundamental importance in probability. As we will see, the expected value of Y given X is the function of X that best approximates Y in the mean square sense. Note that X is a general random variable, not necessarily real-valued, but as usual, we will assume that either X has a discrete distribution, so that S is countable or that X has a continuous distribution on S ⊆ R for some n ∈ N . In the latter case, S is typically a region defined by inequalites involving elementary functions. We will also assume that all expected values that are mentioned exist (as real numbers). n
+
Basic Theory Definitions Note that we can think of (X, Y ) as a random variable that takes values in the Cartesian product set S × T . We need recall some basic facts from our work with joint distributions and conditional distributions. We assume that (X, Y ) has joint probability density function f and we let g denote the (marginal) probability density function X. Recall that if Y has a discrte distribution then g(x) = ∑ f (x, y),
x ∈ S
(4.7.1)
y∈T
and if Y has a continuous distribution then g(x) = ∫
f (x, y) dy,
x ∈ S
(4.7.2)
T
In either case, for x ∈ S , the conditional probability density function of Y given X = x is defined by f (x, y) h(y ∣ x) =
,
y ∈ T
(4.7.3)
g(x)
We are now ready for the basic definitions: For x ∈ S , the conditional expected value of Y given distribution. So if Y has a discrete distribution then
X =x ∈ S
is simply the mean computed relative to the conditional
E(Y ∣ X = x) = ∑ yh(y ∣ x),
x ∈ S
(4.7.4)
y∈T
and if Y has a continuous distribution then E(Y ∣ X = x) = ∫
yh(y ∣ x) dy,
x ∈ S
(4.7.5)
T
1. The function v : S → R defined by v(x) = E(Y ∣ X = x) for x ∈ S is the regression function of Y based on X. 2. The random variable v(X) is called the conditional expected value of Y given X and is denoted E(Y ∣ X) . Intuitively, we treat X as known, and therefore not random, and we then average Y with respect to the probability distribution that remains. The advanced section on conditional expected value gives a much more general definition that unifies the definitions given here for the various distribution types.
4.7.1
https://stats.libretexts.org/@go/page/10163
Properties The most important property of the random variable E(Y ∣ X) is given in the following theorem. In a sense, this result states that E(Y ∣ X) behaves just like Y in terms of other functions of X, and is essentially the only function of X with this property. The fundamental property 1. E [r(X)E(Y ∣ X)] = E [r(X)Y ] for every function r : S → R . 2. If u : S → R satisfies E[r(X)u(X)] = E[r(X)Y ] for every r : S → R then P [u(X) = E(Y
∣ X)] = 1
.
Proof Two random variables that are equal with probability 1 are said to be equivalent. We often think of equivalent random variables as being essentially the same object, so the fundamental property above essentially characterizes E(Y ∣ X) . That is, we can think of E(Y ∣ X) as any random variable that is a function of X and satisfies this property. Moreover the fundamental property can be used as a definition of conditional expected value, regardless of the type of the distribution of (X, Y ). If you are interested, read the more advanced treatment of conditional expected value. Suppose that X is also real-valued. Recall that the best linear predictor of Y based on X was characterized by property (a), but with just two functions: r(x) = 1 and r(x) = x . Thus the characterization in the fundamental property is certainly reasonable, since (as we show below) E(Y ∣ X) is the best predictor of Y among all functions of X, not just linear functions. The basic property is also very useful for establishing other properties of conditional expected value. Our first consequence is the fact that Y and E(Y ∣ X) have the same mean. E [E(Y ∣ X)] = E(Y )
.
Proof Aside from the theoretical interest, this theorem is often a good way to compute E(Y ) when we know the conditional distribution of Y given X. We say that we are computing the expected value of Y by conditioning on X. For many basic properties of ordinary expected value, there are analogous results for conditional expected value. We start with two of the most important: every type of expected value must satisfy two critical properties: linearity and monotonicity. In the following two theorems, the random variables Y and Z are real-valued, and as before, X is a general random variable. Linear Properties 1. E(Y + Z ∣ X) = E(Y ∣ X) + E(Z ∣ X) . 2. E(c Y ∣ X) = c E(Y ∣ X) Proof Part (a) is the additive property and part (b) is the scaling property. The scaling property will be significantly generalized below in (8). Positive and Increasing Properties 1. If Y ≥ 0 then E(Y ∣ X) ≥ 0 . 2. If Y ≤ Z then E(Y ∣ X) ≤ E(Z ∣ X) . 3. |E(Y ∣ X)| ≤ E (|Y | ∣ X) Proof Our next few properties relate to the idea that restatement of the fundamental property. If r : S → R , then Y
− E(Y ∣ X)
E(Y ∣ X)
is the expected value of
Y
given
X
. The first property is essentially a
and r(X) are uncorrelated.
Proof The next result states that any (deterministic) function of respect to X.
X
acts like a constant in terms of the conditional expected value with
If s : S → R then
4.7.2
https://stats.libretexts.org/@go/page/10163
E [s(X) Y ∣ X] = s(X) E(Y ∣ X)
(4.7.12)
Proof The following rule generalizes theorem (8) and is sometimes referred to as the substitution rule for conditional expected value. If s : S × T
→ R
then E [s(X, Y ) ∣ X = x] = E [s(x, Y ) ∣ X = x]
(4.7.14)
In particular, it follows from (8) that E[s(X) ∣ X] = s(X) . At the opposite extreme, we have the next result: If X and Y are independent, then knowledge of X gives no information about Y and so the conditional expected value with respect to X reduces to the ordinary (unconditional) expected value of Y . If X and Y are independent then E(Y ∣ X) = E(Y )
(4.7.15)
Proof Suppose now that Z is real-valued and that X and Y are random variables (all defined on the same probability space, of course). The following theorem gives a consistency condition of sorts. Iterated conditional expected values reduce to a single conditional expected value with respect to the minimum amount of information. For simplicity, we write E(Z ∣ X, Y ) rather than E [Z ∣ (X, Y )] . Consistency 1. E [E(Z ∣ X, Y ) ∣ X] = E(Z ∣ X) 2. E [E(Z ∣ X) ∣ X, Y ] = E(Z ∣ X) Proof Finally we show that E(Y Y in its relations with X.
∣ X)
has the same covariance with X as does Y , not surprising since again, E(Y
cov [X, E(Y ∣ X)] = cov(X, Y )
∣ X)
behaves just like
.
Proof
Conditional Probability The conditional probability of an event A , given random variable X (as above), can be defined as a special case of the conditional expected value. As usual, let 1 denote the indicator random variable of A . A
If A is an event, defined P(A ∣ X) = E (1A ∣ X)
(4.7.17)
Here is the fundamental property for conditional probability: The fundamental property 1. E [r(X)P(A ∣ X)] = E [r(X)1 ] for every function r : S → R . 2. If u : S → R and u(X) satisfies E[r(X)u(X)] = E [r(X)1 ] for every function r : S → R , then P [u(X) = P(A ∣ X)] = 1 . A
A
For example, suppose that X has a discrete distribution on a countable set S with probability density function g . Then (a) becomes ∑ r(x)P(A ∣ X = x)g(x) = ∑ r(x)P(A, X = x) x∈S
(4.7.18)
x∈S
But this is obvious since P(A ∣ X = x) = P(A, X = x)/P(X = x) and distribution on S ⊆ R then (a) states that
g(x) = P(X = x)
. Similarly, if
X
has a continuous
n
4.7.3
https://stats.libretexts.org/@go/page/10163
E [r(X)1A ] = ∫
r(x)P(A ∣ X = x)g(x) dx
(4.7.19)
S
The properties above for conditional expected value, of course, have special cases for conditional probability. P(A) = E [P(A ∣ X)]
.
Proof Again, the result in the previous exercise is often a good way to compute P(A) when we know the conditional probability of A given X. We say that we are computing the probability of A by conditioning on X. This is a very compact and elegant version of the conditioning result given first in the section on Conditional Probability in the chapter on Probability Spaces and later in the section on Discrete Distributions in the Chapter on Distributions. The following result gives the conditional version of the axioms of probability. Axioms of probability 1. P(A ∣ X) ≥ 0 for every event A . 2. P(Ω ∣ X) = 1 3. If {A : i ∈ I } is a countable collection of disjoint events then P (⋃ i
i∈I
Ai ∣ ∣ X) = ∑i∈I P(Ai ∣ X)
.
Details From the last result, it follows that other standard probability rules hold for conditional probability given X. These results include the complement rule the increasing property Boole's inequality Bonferroni's inequality the inclusion-exclusion laws
The Best Predictor The next result shows that, of all functions of X, E(Y ∣ X) is closest to Y , in the sense of mean square error. This is fundamentally important in statistical problems where the predictor vector X can be observed but not the response variable Y . In this subsection and the next, we assume that the real-valued random variables have finite variance. If u : S → R , then 1. E ([E(Y
∣ X) − Y ]
2
) ≤ E ([u(X) − Y ]
2
)
2. Equality holds in (a) if and only if u(X) = E(Y
∣ X)
with probability 1.
Proof Suppose now that X is real-valued. In the section on covariance and correlation, we found that the best linear predictor of Y given X is cov(X, Y ) L(Y ∣ X) = E(Y ) +
[X − E(X)]
(4.7.23)
var(X)
On the other hand, E(Y ∣ X) is the best predictor of Y among all functions of X. It follows that if E(Y ∣ X) happens to be a linear function of X then it must be the case that E(Y ∣ X) = L(Y ∣ X) . However, we will give a direct proof also: If E(Y
∣ X) = a + bX
for constants a and b then E(Y
∣ X) = L(Y ∣ X)
; that is,
1. b = cov(X, Y )/var(X) 2. a = E(Y ) − E(X)cov(X, Y )/var(X) Proof
4.7.4
https://stats.libretexts.org/@go/page/10163
Conditional Variance The conditional variance of Y given X is defined like the ordinary variance, but with all expected values conditioned on X. The conditional variance of Y given X is defined as var(Y ∣ X) = E ( [Y − E(Y ∣ X)]
2
∣ ∣ X) ∣
(4.7.24)
Thus, var(Y ∣ X) is a function of X, and in particular, is a random variable. Our first result is a computational formula that is analogous to the one for standard variance—the variance is the mean of the square minus the square of the mean, but now with all expected values conditioned on X: var(Y ∣ X) = E (Y
2
∣ X) − [E(Y ∣ X)]
2
.
Proof Our next result shows how to compute the ordinary variance of Y by conditioning on X. var(Y ) = E [var(Y ∣ X)] + var [E(Y ∣ X)]
.
Proof Thus, the variance of Y is the expected conditional variance plus the variance of the conditional expected value. This result is often a good way to compute var(Y ) when we know the conditional distribution of Y given X. With the help of (21) we can give a formula for the mean square error when E(Y ∣ X) is used a predictor of Y . Mean square error E ( [Y − E(Y ∣ X)]
2
) = var(Y ) − var [E(Y ∣ X)]
(4.7.27)
Proof Let us return to the study of predictors of the real-valued random variable Y , and compare the three predictors we have studied in terms of mean square error. Suppose that Y is a real-valued random variable. 1. The best constant predictor of Y is E(Y ) with mean square error var(Y ). 2. If X is another real-valued random variable, then the best linear predictor of Y given X is cov(X, Y ) L(Y ∣ X) = E(Y ) +
[X − E(X)]
(4.7.29)
var(X)
with mean square error var(Y ) [1 − cor (X, Y )] . 3. If X is a general random variable, then the best overall predictor of Y given X is E(Y var(Y ) − var [E(Y ∣ X)] . 2
∣ X)
with mean square error
Conditional Covariance Suppose that Y and Z are real-valued random variables, and that X is a general random variable, all defined on our underlying probability space. Analogous to variance, the conditional covariance of Y and Z given X is defined like the ordinary covariance, but with all expected values conditioned on X. The conditional covariance of Y and Z given X is defined as ∣ cov(Y , Z ∣ X) = E ([Y − E(Y ∣ X)][Z − E(Z ∣ X) ∣ X) ∣
(4.7.30)
Thus, cov(Y , Z ∣ X) is a function of X, and in particular, is a random variable. Our first result is a computational formula that is analogous to the one for standard covariance—the covariance is the mean of the product minus the product of the means, but now
4.7.5
https://stats.libretexts.org/@go/page/10163
with all expected values conditioned on X: cov(Y , Z ∣ X) = E (Y Z ∣ X) − E(Y ∣ X)E(Z ∣ X)
.
Proof Our next result shows how to compute the ordinary covariance of Y and Z by conditioning on X. cov(Y , Z) = E [cov(Y , Z ∣ X)] + cov [E(Y ∣ X), E(Z ∣ X)]
.
Proof Thus, the covariance of Y and Z is the expected conditional covariance plus the covariance of the conditional expected values. This result is often a good way to compute cov(Y , Z) when we know the conditional distribution of (Y , Z) given X.
Examples and Applications As always, be sure to try the proofs and computations yourself before reading the ones in the text.
Simple Continuous Distributions Suppose that (X, Y ) has probability density function f defined by f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . 1. Find L(Y ∣ X) . 2. Find E(Y ∣ X) . 3. Graph L(Y ∣ X = x) and E(Y ∣ X = x) as functions of x, on the same axes. 4. Find var(Y ). 5. Find var(Y ) [1 − cor (X, Y )] . 6. Find var(Y ) − var [E(Y ∣ X)] . 2
Answer Suppose that (X, Y ) has probability density function f defined by f (x, y) = 2(x + y) for 0 ≤ x ≤ y ≤ 1 . 1. Find L(Y ∣ X) . 2. Find E(Y ∣ X) . 3. Graph L(Y ∣ X = x) and E(Y ∣ X = x) as functions of x, on the same axes. 4. Find var(Y ). 5. Find var(Y ) [1 − cor (X, Y )] . 6. Find var(Y ) − var [E(Y ∣ X)] . 2
Answer Suppose that (X, Y ) has probability density function f defined by f (x, y) = 6x
2
y
for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 .
1. Find L(Y ∣ X) . 2. Find E(Y ∣ X) . 3. Graph L(Y ∣ X = x) and E(Y ∣ X = x) as functions of x, on the same axes. 4. Find var(Y ). 5. Find var(Y ) [1 − cor (X, Y )] . 6. Find var(Y ) − var [E(Y ∣ X)] . 2
Answer Suppose that (X, Y ) has probability density function f defined by f (x, y) = 15x
2
y
for 0 ≤ x ≤ y ≤ 1 .
1. Find L(Y ∣ X) . 2. Find E(Y ∣ X) . 3. Graph L(Y ∣ X = x) and E(Y ∣ X = x) as functions of x, on the same axes. 4. Find var(Y ). 5. Find var(Y ) [1 − cor (X, Y )] . 6. Find var(Y ) − var [E(Y ∣ X)] . 2
4.7.6
https://stats.libretexts.org/@go/page/10163
Answer
Exercises on Basic Properties Suppose that E (Y e
X
X
,
Y
, and
− Z sin X ∣ X)
Z
are real-valued random variables with
E(Y ∣ X) = X
3
and
E(Z ∣ X) =
1 1+X
.
2
.
Find
Answer
Uniform Distributions As usual, continuous uniform distributions can give us some geometric insight. Recall first that for n ∈ N , the standard measure on R is n
+
λn (A) = ∫
1dx,
n
A ⊆R
(4.7.36)
A
In particular, λ
1 (A)
is the length of A ⊆ R , λ
2 (A)
is the area of A ⊆ R , and λ 2
3 (A)
is the volume of A ⊆ R . 3
Details With our usual setup, suppose that X takes values in S ⊆ R , Y takes values in T ⊆ R , and that (X, Y ) is uniformly distributed on R ⊆ S × T ⊆ R . So 0 < λ (R) < ∞ , and the joint probability density function f of (X, Y ) is given by f (x, y) = 1/ λ (R) for (x, y) ∈ R. Recall that uniform distributions, whether discrete or continuous, always have constant densities. Finally, recall that the cross section of R at x ∈ S is T = {y ∈ T : (x, y) ∈ R} . n
n+1
n+1
n+1
x
In the setting above, suppose that T is a bounded interval with midpoint m(x) and length l(x) for each x ∈ S . Then x
1. E(Y ∣ X) = m(X) 2. var(Y ∣ X) = l (X) 1
2
12
Proof So in particular, the regression curve x ↦ E(Y
∣ X = x)
follows the midpoints of the cross-sectional intervals.
In each case below, suppose that (X, Y ) is uniformly distributed on the give region. Find E(Y
∣ X)
and var(Y
∣ X)
1. The rectangular region R = [a, b] × [c, d] where a < b and c < d . 2. The triangular region T = {(x, y) ∈ R : −a ≤ x ≤ y ≤ a} where a > 0 . 3. The circular region C = {(x, y) ∈ R : x + y ≤ r} where r > 0 . 2
2
2
2
Answer In the bivariate uniform experiment, select each of the following regions. In each case, run the simulation 2000 times and note the relationship between the cloud of points and the graph of the regression function. 1. square 2. triangle 3. circle Suppose that X is uniformly distributed on the interval (0, 1), and that given X, random variable Y is uniformly distributed on (0, X). Find each of the following: 1. E(Y ∣ X) 2. E(Y ) 3. var(Y ∣ X) 4. var(Y ) Answer
4.7.7
https://stats.libretexts.org/@go/page/10163
The Hypergeometric Distribution Suppose that a population consists of m objects, and that each object is one of three types. There are a objects of type 1, b objects of type 2, and m − a − b objects of type 0. The parameters a and b are positive integers with a + b < m . We sample n objects from the population at random, and without replacement, where n ∈ {0, 1, … , m}. Denote the number of type 1 and 2 objects in the sample by X and Y , so that the number of type 0 objects in the sample is n − X − Y . In the in the chapter on Distributions, we showed that the joint, marginal, and conditional distributions of X and Y are all hypergeometric—only the parameters change. Here is the relevant result for this section: In the setting above, 1. E(Y
∣ X) =
2. var(Y
b m−a
(n − X) b(m−a−b)
∣ X) =
(n − X)(m − a − n + X)
2
(m−a) (m−a−1)
3. E ([Y
2
− E(Y ∣ X)] ) =
n(m−n)b(m−a−b) m(m−1)(m−a)
Proof Note that E(Y
∣ X)
is a linear function of X and hence E(Y
∣ X) = L(Y ∣ X)
.
In a collection of 120 objects, 50 are classified as good, 40 as fair and 30 as poor. A sample of 20 objects is selected at random and without replacement. Let X denote the number of good objects in the sample and Y the number of poor objects in the sample. Find each of the following: 1. E(Y ∣ X) 2. var(Y ∣ X) 3. The predicted value of Y given X = 8 Answer
The Multinomial Trials Model Suppose that we have a sequence of n independent trials, and that each trial results in one of three outcomes, denoted 0, 1, and 2. On each trial, the probability of outcome 1 is p, the probability of outcome 2 is q, so that the probability of outcome 0 is 1 − p − q . The parameters p, q ∈ (0, 1) with p + q < 1 , and of course n ∈ N . Let X denote the number of trials that resulted in outcome 1, Y the number of trials that resulted in outcome 2, so that n − X − Y is the number of trials that resulted in outcome 0. In the in the chapter on Distributions, we showed that the joint, marginal, and conditional distributions of X and Y are all multinomial— only the parameters change. Here is the relevant result for this section: +
In the setting above, 1. E(Y
∣ X) =
2. var(Y
q 1−p
(n − X)
q(1−p−q)
∣ X) =
2
(n − X)
(1−p)
3. E ([Y
q(1−p−q)
2
− E(Y ∣ X)] ) =
1−p
n
Proof Note again that E(Y
∣ X)
is a linear function of X and hence E(Y
∣ X) = L(Y ∣ X)
.
Suppose that a fair, 12-sided die is thrown 50 times. Let X denote the number of throws that resulted in a number from 1 to 5, and Y the number of throws that resulted in a number from 6 to 9. Find each of the following: 1. E(Y ∣ X) 2. var(Y ∣ X) 3. The predicted value of Y given X = 20 Answer
The Poisson Distribution Recall that the Poisson distribution, named for Simeon Poisson, is widely used to model the number of “random points” in a region of time or space, under certain ideal conditions. The Poisson distribution is studied in more detail in the chapter on the Poisson
4.7.8
https://stats.libretexts.org/@go/page/10163
Process. The Poisson distribution with parameter r ∈ (0, ∞) has probability density function f defined by x
f (x) = e
−r
r
,
x ∈ N
(4.7.39)
x!
The parameter r is the mean and variance of the distribution. Suppose that X and Y are independent random variables, and that X has the Poisson distribution with parameter and Y has the Poisson distribution with parameter b ∈ (0, ∞). Let N = X + Y . Then 1. E(X ∣ N ) = 2. var(X ∣ N ) =
a
a+b
a ∈ (0, ∞)
N ab 2
N
(a+b)
3. E ([X − E(X ∣ N )]
2
) =
ab a+b
Proof Once again, E(X ∣ N ) is a linear function of N and so E(X ∣ N ) = L(X ∣ N ) . If we reverse the roles of the variables, the conditional expected value is trivial from our basic properties: E(N ∣ X) = E(X + Y ∣ X) = X + b
(4.7.41)
Coins and Dice A pair of fair dice are thrown, and the scores (X , X ) recorded. Let U = min { X , X } the minimum score. Find each of the following: 1
1
2
Y = X1 + X2
denote the sum of the scores and
2
1. E (Y ∣ X ) 2. E (U ∣ X ) 3. E (Y ∣ U ) 4. E (X ∣ X ) 1
1
2
1
Answer A box contains 10 coins, labeled 0 to 9. The probability of heads for coin i is tossed. Find the probability of heads.
i 9
. A coin is chosen at random from the box and
Answer This problem is an example of Laplace's rule of succession, named for Pierre Simon Laplace.
Random Sums of Random Variables Suppose that X = (X , X , …) is a sequence of independent and identically distributed real-valued random variables. We will denote the common mean, variance, and moment generating function, respectively, by μ = E(X ) , σ = var(X ) , and G(t) = E (e ) . Let 1
2
2
i
i
t Xi
n
Yn = ∑ Xi ,
n ∈ N
(4.7.42)
i=1
so that (Y , Y , …) is the partial sum process associated with independent of X. Then 0
1
X
. Suppose now that
N
is a random variable taking values in
N
,
N
YN = ∑ Xi
(4.7.43)
i=1
is a random sum of random variables; the terms in the sum are random, and the number of terms is random. This type of variable occurs in many different contexts. For example, N might represent the number of customers who enter a store in a given period of time, and X the amount spent by the customer i, so that Y is the total revenue of the store during the period. i
N
The conditional and ordinary expected value of Y
N
1. E (Y 2. E (Y
N
are
∣ N) = Nμ
N ) = E(N )μ
4.7.9
https://stats.libretexts.org/@go/page/10163
Proof Wald's equation, named for Abraham Wald, is a generalization of the previous result to the case where N is not necessarily independent of X, but rather is a stopping time for X. Roughly, this means that the event N = n depends only (X , X , … , X ). Wald's equation is discussed in the chapter on Random Samples. An elegant proof of and Wald's equation is given in the chapter on Martingales. The advanced section on stopping times is in the chapter on Probability Measures. 1
The conditional and ordinary variance of Y
N
1. var (Y 2. var (Y
N
∣ N) = Nσ
N ) = E(N )σ
2
n
are
2
2
2
+ var(N )μ
Proof Let H denote the probability generating function of N . The conditional and ordinary moment generating function of Y
N
1. E (e 2. E (e
tYN tN
∣ N ) = [G(t)]
are
N
) = H (G(t))
Proof Thus the moment generating function of Y is H ∘ G , the composition of the probability generating function of common moment generating function of X, a simple and elegant result. N
N
with the
In the die-coin experiment, a fair die is rolled and then a fair coin is tossed the number of times showing on the die. Let denote the die score and Y the number of heads. Find each of the following:
N
1. The conditional distribution of Y given N . 2. E (Y ∣ N ) 3. var (Y ∣ N ) 4. E (Y ) 5. var(Y ) i
Answer Run the die-coin experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. The number of customers entering a store in a given hour is a random variable with mean 20 and standard deviation 3. Each customer, independently of the others, spends a random amount of money with mean $50 and standard deviation $5. Find the mean and standard deviation of the amount of money spent during the hour. Answer A coin has a random probability of heads V and is tossed a random number of times N . Suppose that V is uniformly distributed on [0, 1]; N has the Poisson distribution with parameter a > 0 ; and V and N are independent. Let Y denote the number of heads. Compute the following: 1. E(Y ∣ N , V ) 2. E(Y ∣ N ) 3. E(Y ∣ V ) 4. E(Y ) 5. var(Y ∣ N , V ) 6. var(Y ) Answer
Mixtures of Distributions Suppose that X = (X , X , …) is a sequence of real-valued random variables. Denote the mean, variance, and moment generating function of X by μ = E(X ) , σ = var(X ) , and M (t) = E (e ) , for i ∈ N . Suppose also that N is a random variable taking values in N , independent of X. Denote the probability density function of N by p = P(N = n) for n ∈ N . 1
2
i
i
i
2
i
i
i
+
t Xi
+
n
4.7.10
+
https://stats.libretexts.org/@go/page/10163
The distribution of the random variable the mixing distribution.
XN
is a mixture of the distributions of
X = (X1 , X2 , …)
, with the distribution of
N
as
The conditional and ordinary expected value of X are N
1. E(X 2. E(X
∣ N ) = μN
N
∞
N ) = ∑n=1 pn μn
Proof The conditional and ordinary variance of X are N
1. var (X 2. var(X
N
∣ N) = σ
2
N
∞
2
2
∞
N ) = ∑n=1 pn (σn + μn ) − (∑n=1 pn μn )
2
.
Proof The conditional and ordinary moment generating function of X are N
1. E (e 2. E (e
tXN tXN
∣ N ) = MN (t) ∞
) =∑
i=1
pi Mi (t)
.
Proof In the coin-die experiment, a biased coin is tossed with probability of heads . If the coin lands tails, a fair die is rolled; if the coin lands heads, an ace-six flat die is rolled (faces 1 and 6 have probability each, and faces 2, 3, 4, 5 have probability each). Find the mean and standard deviation of the die score. 1 3
1
1
4
8
Answer Run the coin-die experiment 1000 times and note the apparent convergence of the empirical mean and standard deviation to the distribution mean and standard deviation. This page titled 4.7: Conditional Expected Value is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.7.11
https://stats.libretexts.org/@go/page/10163
4.8: Expected Value and Covariance Matrices The main purpose of this section is a discussion of expected value and covariance for random matrices and vectors. These topics are somewhat specialized, but are particularly important in multivariate statistical models and for the multivariate normal distribution. This section requires some prerequisite knowledge of linear algebra. We assume that the various indices m, n, p, k that occur in this section are positive integers. Also we assume that expected values of real-valued random variables that we reference exist as real numbers, although extensions to cases where expected values are ∞ or −∞ are straightforward, as long as we avoid the dreaded indeterminate form ∞ − ∞ .
Basic Theory Linear Algebra We will follow our usual convention of denoting random variables by upper case letters and nonrandom variables and constants by lower case letters. In this section, that convention leads to notation that is a bit nonstandard, since the objects that we will be dealing with are vectors and matrices. On the other hand, the notation we will use works well for illustrating the similarities between results for random matrices and the corresponding results in the one-dimensional case. Also, we will try to be careful to explicitly point out the underlying spaces where various objects live. Let
denote the space of all m × n matrices of real numbers. The (i, j) entry of a ∈ R is denoted a for i ∈ {1, 2, … , m} and j ∈ {1, 2, … , n}. We will identify R with R , so that an ordered n -tuple can also be thought of as an n × 1 column vector. The transpose of a matrix a ∈ R is denoted a —the n × m matrix whose (i, j) entry is the (j, i) entry of a . Recall the definitions of matrix addition, scalar multiplication, and matrix multiplication. Recall also the standard inner product (or dot product) of x, y ∈ R : m×n
m×n
R
ij
n
n×1
m×n
T
n
n T
⟨x, y⟩ = x ⋅ y = x
y = ∑ xi yi
(4.8.1)
i=1
The outer product of x and y is xy , the n × n matrix whose (i, j) entry is x y . Note that the inner product is the trace (sum of the diagonal entries) of the outer product. Finally recall the standard norm on R , given by T
i
j
n
− − − − −
− − − − − − − − − − − − − − − 2
2
2
∥x∥ = √⟨x, x⟩ = √ x + x + ⋯ + xn 1 2
(4.8.2)
Recall that inner product is bilinear, that is, linear (preserving addition and scalar multiplication) in each argument separately. As a consequence, for x, y ∈ R , n
2
∥x + y∥
2
= ∥x ∥
2
+ ∥y∥
+ 2⟨x, y⟩
(4.8.3)
Expected Value of a Random Matrix As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So to review, Ω is the set of outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). It's natural to define the expected value of a random matrix in a component-wise manner. Suppose that X is an m × n matrix of real-valued random variables, whose (i, j) entry is denoted X . Equivalently, X is as a random m × n matrix, that is, a random variable with values in R . The expected value E(X) is defined to be the m × n matrix whose (i, j) entry is E (X ), the expected value of X . ij
m×n
ij
ij
Many of the basic properties of expected value of random variables have analogous results for expected value of random matrices, with matrix operation replacing the ordinary ones. Our first two properties are the critically important linearity properties. The first part is the additive property—the expected value of a sum is the sum of the expected values. E(X + Y ) = E(X) + E(Y )
if X and Y are random m × n matrices.
Proof The next part of the linearity properties is the scaling property—a nonrandom matrix factor can be pulled out of the expected value.
4.8.1
https://stats.libretexts.org/@go/page/10164
Suppose that X is a random n × p matrix. 1. E(aX) = aE(X) if a ∈ R 2. E(Xa) = E(X)a if a ∈ R
m×n p×n
.
.
Proof Recall that for independent, real-valued variables, the expected value of the product is the product of the expected values. Here is the analogous result for random matrices. E(XY ) = E(X)E(Y )
if X is a random m × n matrix, Y is a random n × p matrix, and X and Y are independent.
Proof Actually the previous result holds if X and Y are simply uncorrelated in the sense that X and Y are uncorrelated for each i ∈ {1, … , m}, j ∈ {1, 2, … , n} and k ∈ {1, 2, … p}. We will study covariance of random vectors in the next subsection. ij
jk
Covariance Matrices Our next goal is to define and study the covariance of two random vectors. Suppose that X is a random vector in R and Y is a random vector in R . m
n
1. The covariance matrix of X and Y is the m × n matrix cov(X, Y ) whose (i, j) entry is cov (X , Y ) the ordinary covariance of X and Y . 2. Assuming that the coordinates of X and Y have positive variance, the correlation matrix of X and Y is the m × n matrix cor(X, Y ) whose (i, j) entry is cor (X , Y ), the ordinary correlation of X and Y i
i
j
j
i
j
i
j
Many of the standard properties of covariance and correlation for real-valued random variables have extensions to random vectors. For the following three results, X is a random vector in R and Y is a random vector in R . m
cov(X, Y ) = E ([X − E(X)] [Y − E(Y )]
T
n
)
Proof Thus, the covariance of X and Y is the expected value of the outer product of X − E(X) and Y − E(Y ) . Our next result is the computational formula for covariance: the expected value of the outer product of X and Y minus the outer product of the expected values. cov(X, Y ) = E (X Y
T
T
) − E(X)[E(Y )]
.
Proof The next result is the matrix version of the symmetry property. T
cov(Y , X) = [cov(X, Y )]
.
Proof In the following result, 0 denotes the m × n zero matrix. cov(X, Y ) = 0
if and only if
cov (Xi , Yj ) = 0
for each
i
and j , so that each coordinate of
X
is uncorrelated with each
coordinate of Y . Proof Naturally, when cov(X, Y ) = 0 , we say that the random vectors X and Y are uncorrelated. In particular, if the random vectors are independent, then they are uncorrelated. The following results establish the bi-linear properties of covariance. The additive properties. 1. cov(X + Y , Z) = cov(X, Z) + cov(Y , Z) if X and Y are random vectors in R and Z is a random vector in R . 2. cov(X, Y + Z) = cov(X, Y ) + cov(X, Z) if X is a random vector in R , and Y and Z are random vectors in R . m
m
n
n
Proof
4.8.2
https://stats.libretexts.org/@go/page/10164
The scaling properties 1. cov(aX, Y ) = acov(X, Y ) if X is a random vector in R , Y is a random vector in R , and a ∈ R 2. cov(X, aY ) = cov(X, Y )a if X is a random vector in R , Y is a random vector in R , and a ∈ R n
T
p
m
m×n
n
.
k×n
.
Proof
Variance-Covariance Matrices Suppose that X is a random vector in R . The covariance matrix of X with itself is called the variance-covariance matrix of X: n
vc(X) = cov(X, X) = E ([X − E(X)] [X − E(X)]
T
)
(4.8.6)
Recall that for an ordinary real-valued random variable X, var(X) = cov(X, X) . Thus the variance-covariance matrix of a random vector in some sense plays the same role that variance does for a random variable. vc(X)
is a symmetric n × n matrix with (var(X
1 ),
var(X2 ), … , var(Xn ))
on the diagonal.
Proof The following result is the formula for the variance-covariance matrix of a sum, analogous to the formula for the variance of a sum of real-valued variables. vc(X + Y ) = vc(X) + cov(X, Y ) + cov(Y , X) + vc(Y )
if X and Y are random vectors in R . n
Proof Recall that var(aX) = a var(X) if X is a real-valued random variable and a ∈ R . Here is the analogous result for the variancecovariance matrix of a random vector. 2
T
vc(aX) = avc(X)a
if X is a random vector in R and a ∈ R n
m×n
.
Proof Recall that if X is a random variable, then var(X) ≥ 0 , and var(X) = 0 if and only if X is a constant (with probability 1). Here is the analogous result for a random vector: Suppose that X is a random vector in R . n
1. vc(X) is either positive semi-definite or positive definite. 2. vc(X) is positive semi-definite but not positive definite if and only if there exists a ∈ R and c ∈ R such that, with probability 1, a X = ∑ a X = c n
T
n
i=1
i
i
Proof Recall that since vc(X) is either positive semi-definite or positive definite, the eigenvalues and the determinant of vc(X) are nonnegative. Moreover, if vc(X) is positive semi-definite but not positive definite, then one of the coordinates of X can be written as a linear transformation of the other coordinates (and hence can usually be eliminated in the underlying model). By contrast, if vc(X) is positive definite, then this cannot happen; vc(X) has positive eigenvalues and determinant and is invertible.
Best Linear Predictor Suppose that X is a random vector in R and that Y is a random vector in R . We are interested in finding the function of X of the form a + bX , where a ∈ R and b ∈ R , that is closest to Y in the mean square sense. Functions of this form are analogous to linear functions in the single variable case. However, unless a = 0 , such functions are not linear transformations in the sense of linear algebra, so the correct term is affine function of X. This problem is of fundamental importance in statistics when random vector X, the predictor vector is observable, but not random vector Y , the response vector. Our discussion here generalizes the one-dimensional case, when X and Y are random variables. That problem was solved in the section on Covariance and Correlation. We will assume that vc(X) is positive definite, so that vc(X) is invertible, and none of the coordinates of X can be written as an affine function of the other coordinates. We write vc (X) for the inverse instead of the clunkier [vc(X)] . m
n
n
n×m
−1
4.8.3
−1
https://stats.libretexts.org/@go/page/10164
As with the single variable case, the solution turns out to be the affine function that has the same expected value as Y , and whose covariance with X is the same as that of Y . Define L(Y satisfying
∣ X) = E(Y ) + cov(Y , X)vc
−1
(X) [X − E(X)]
. Then
L(Y ∣ X)
is the only affine function of
X
in
n
R
1. E [L(Y ∣ X)] = E(Y ) 2. cov [L(Y ∣ X), X] = cov(Y , X) Proof A simple corollary is the Y
− L(Y ∣ X)
is uncorrelated with any affine function of X:
If U is an affine function of X then 1. cov [Y − L(Y ∣ X), U ] = 0 2. E (⟨Y − L(Y ∣ X), U ⟩) = 0 Proof The variance-covariance matrix of single variable case. Additional properties of L(Y
, and its covariance matrix with
L(Y ∣ X)
Y
turn out to be the same, again analogous to the
:
∣ X)
1. cov [Y , L(Y ∣ X)] = cov(Y , X)vc (X)cov(X, Y ) 2. vc [L(Y ∣ X)] = cov(Y , X)vc (X)cov(X, Y ) −1
−1
Proof Next is the fundamental result that L(Y Suppose that U
n
∈ R
∣ X)
is the affine function of X that is closest to Y in the mean square sense.
is an affine function of X. Then
1. E (∥Y − L(Y ∣ X)∥ ) ≤ E (∥Y − U ∥ ) 2. Equality holds in (a) if and only if U = L(Y 2
2
∣ X)
with probability 1.
Proof The variance-covariance matrix of the difference between Y and the best affine approximation is given in the next theorem. vc [Y − L(Y ∣ X)] = vc(Y ) − cov(Y , X)vc
−1
(X)cov(X, Y )
Proof The actual mean square error when we use L(Y
∣ X)
to approximate Y , namely E (∥Y
− L(Y ∣ X)∥
2
, is the trace (sum of the
)
diagonal entries) of the variance-covariance matrix above. The function of x given by L(Y ∣ X = x) = E(Y ) + cov(Y , X)vc
−1
(X) [x − E(X)]
is known as the (distribution) linear regression function. If we observe x then L(Y
∣ X = x)
(4.8.16)
is our best affine prediction of Y .
Multiple linear regression is more powerful than it may at first appear, because it can be applied to non-linear transformations of the random vectors. That is, if g : R → R and h : R → R then L [h(Y ) ∣ g(X)] is the affine function of g(X) that is closest to h(Y ) in the mean square sense. Of course, we must be able to compute the appropriate means, variances, and covariances. m
j
n
k
Moreover, Non-linear regression with a single, real-valued predictor variable can be thought of as a special case of multiple linear regression. Thus, suppose that X is the predictor variable, Y is the response variable, and that (g , g , … , g ) is a sequence of real-valued functions. We can apply the results of this section to find the linear function of (g (X), g (X), … , g (X)) that is closest to Y in the mean square sense. We just replace X with g (X) for each i. Again, we must be able to compute the appropriate means, variances, and covariances to do this. 1
1
i
2
n
2
n
i
4.8.4
https://stats.libretexts.org/@go/page/10164
Examples and Applications Suppose that (X, Y ) has probability density function the following:
f
defined by
f (x, y) = x + y
for
f
defined by
f (x, y) = 2(x + y)
0 ≤x ≤1
,
0 ≤y ≤1
. Find each of
1. E(X, Y ) 2. vc(X, Y ) Answer Suppose that following:
(X, Y )
has probability density function
for
0 ≤x ≤y ≤1
. Find each of the
1. E(X, Y ) 2. vc(X, Y ) Answer Suppose that (X, Y ) has probability density function f defined by f (x, y) = 6x following:
2
y
for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . Find each of the
1. E(X, Y ) 2. vc(X, Y ) Answer Suppose that following:
(X, Y )
has probability density function
f
defined by
2
f (x, y) = 15 x y
for
0 ≤x ≤y ≤1
. Find each of the
1. E(X, Y ) 2. vc(X, Y ) 3. L(Y ∣ X) 4. L [Y ∣ (X, X )] 5. Sketch the regression curves on the same set of axes. 2
Answer Suppose that following:
(X, Y , Z)
is uniformly distributed on the region
3
{(x, y, z) ∈ R
: 0 ≤ x ≤ y ≤ z ≤ 1}
. Find each of the
1. E(X, Y , Z) 2. vc(X, Y , Z) 3. L [Z ∣ (X, Y )] 4. L [Y ∣ (X, Z)] 5. L [X ∣ (Y , Z)] 6. L [(Y , Z) ∣ X] Answer Suppose that X is uniformly distributed on Find each of the following:
, and that given
(0, 1)
X
, random variable
Y
is uniformly distributed on
.
(0, X)
1. E(X, Y ) 2. vc(X, Y ) Answer This page titled 4.8: Expected Value and Covariance Matrices is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.8.5
https://stats.libretexts.org/@go/page/10164
4.9: Expected Value as an Integral In the introductory section, we defined expected value separately for discrete, continuous, and mixed distributions, using density functions. In the section on additional properties, we showed how these definitions can be unified, by first defining expected value for nonnegative random variables in terms of the right-tail distribution function. However, by far the best and most elegant definition of expected value is as an integral with respect to the underlying probability measure. This definition and a review of the properties of expected value are the goals of this section. No proofs are necessary (you will be happy to know), since all of the results follow from the general theory of integration. However, to understand the exposition, you will need to review the advanced sections on the integral with respect to a positive measure and the properties of the integral. If you are a new student of probability, or are not interested in the measure-theoretic detail of the subject, you can safely skip this section.
Definitions As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So Ω is the set of outcomes, the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ).
F
is
Recall that a random variable X for the experiment is simply a measurable function from (Ω, F ) into another measurable space (S, S ). When S ⊆ R , we assume that S is Lebesgue measurable, and we take S to the σ-algebra of Lebesgue measurable subsets of S . As noted above, here is the measure-theoretic definition: n
If X is a real-valued random variable on the probability space, the expected value of respect to P, assuming that the integral exists: E(X) = ∫
X
is defined as the integral of
X dP
X
with
(4.9.1)
Ω
Let's review how the integral is defined in stages, but now using the notation of probability theory. Let S denote the support set of X, so that S is a measurable subset of R. 1. If S is finite, then E(X) = ∑ x P(X = x) . 2. If S ⊆ [0, ∞) , then E(X) = sup {E(Y ) : Y has finite range and 0 ≤ Y ≤ X} 3. For general S ⊆ R , E(X) = E (X ) − E (X ) as long as the right side is not of the form ∞ − ∞ , and where X and X denote the positive and negative parts of X. 4. If A ∈ F , then E(X; A) = E (X1 ) , assuming that the expected value on the right exists. x∈S
+
−
+
−
A
Thus, as with integrals generally, an expected value can exist as a number in R (in which case X is integrable), can exist as ∞ or −∞ , or can fail to exist. In reference to part (a), a random variable with a finite set of values in R is a simple function in the terminology of general integration. In reference to part (b), note that the expected value of a nonnegative random variable always exists in [0, ∞]. In reference to part (c), E(X) exists if and only if either E (X ) < ∞ or E (X ) < ∞ . +
−
Our next goal is to restate the basic theorems and properties of integrals, but in the notation of probability. Unless otherwise noted, all random variables are assumed to be real-valued.
Basic Properties The Linear Properties Perhaps the most important and basic properties are the linear properties. Part (a) is the additive property and part (b) is the scaling property. Suppose that X and Y are random variables whose expected values exist, and that c ∈ R . Then 1. E(X + Y ) = E(X) + E(Y ) as long as the right side is not of the form ∞ − ∞ . 2. E(cX) = cE(X)
4.9.1
https://stats.libretexts.org/@go/page/10165
Thus, part (a) holds if at least one of the expected values on the right is finite, or if both are ∞, or if both are −∞ . What is ruled out are the two cases where one expected value is ∞ and the other is −∞ , and this is what is meant by the indeterminate form ∞−∞ .
Equality and Order Our next set of properties deal with equality and order. First, the expected value of a random variable over a null set is 0. If X is a random variable and A is an event with P(A) = 0 . Then E(X; A) = 0 . Random variables that are equivalent have the same expected value If X is a random variable whose expected value exists, and Y is a random variable with P(X = Y ) = 1 , then E(X) = E(Y ) . Our next result is the positive property of expected value. Suppose that X is a random variable and P(X ≥ 0) = 1 . Then 1. E(X) ≥ 0 2. E(X) = 0 if and only if P(X = 0) = 1 . So, if X is a nonnegative random variable then E(X) > 0 if and only if P(X > 0) > 0 . The next result is the increasing property of expected value, perhaps the most important property after linearity. Suppose that X, Y are random variables whose expected values exist, and that P(X ≤ Y ) = 1 . Then 1. E(X) ≤ E(Y ) 2. Except in the case that both expected values are ∞ or both −∞ , E(X) = E(Y ) if and only if P(X = Y ) = 1 . So if X ≤ Y with probability 1 then, except in the two cases mentioned, result is the absolute value inequality.
E(X) < E(Y )
if and only if
P(X < Y ) > 0
. The next
Suppose that X is a random variable whose expected value exists. Then 1. |E(X)| ≤ E (|X|) 2. If E(X) is finite, then equality holds in (a) if and only if P(X ≥ 0) = 1 or P(X ≤ 0) = 1 .
Change of Variables and Density Functions The Change of Variables Theorem Suppose now that X is a general random variable on the probability space (Ω, F , P), taking values in a measurable space (S, S ). Recall that the probability distribution of X is the probability measure P on (S, S ) given by P (A) = P(X ∈ A) for A ∈ S . This is a special case of a new positive measure induced by a given positive measure and a measurable function. If g : S → R is measurable, then g(X) is a real-valued random variable. The following result shows how to computed the expected value of g(X) as an integral with respect to the distribution of X, and is known as the change of variables theorem. If g : S → R is measurable then, assuming that the expected value exists, E [g(X)] = ∫
g(x) dP (x)
(4.9.2)
S
So, using the original definition and the change of variables theorem, and giving the variables explicitly for emphasis, we have E [g(X)] = ∫
g [X(ω)] dP(ω) = ∫
Ω
g(x) dP (x)
(4.9.3)
S
The Radon-Nikodym Theorem Suppose now μ is a positive measure on (S, S ), and that the distribution of X is absolutely continuous with respect to μ . Recall that this means that μ(A) = 0 implies P (A) = P(X ∈ A) = 0 for A ∈ S . By the Radon-Nikodym theorem, named for Johann
4.9.2
https://stats.libretexts.org/@go/page/10165
Radon and Otto Nikodym, X has a probability density function f with respect to μ . That is, P (A) = P(X ∈ A) = ∫
f dμ,
A ∈ S
(4.9.4)
A
In this case, we can write the expected value of g(X) as an integral with respect to the probability density function. If g : S → R is measurable then, assuming that the expected value exists, E [g(X)] = ∫
gf dμ
(4.9.5)
S
Again, giving the variables explicitly for emphasis, we have the following chain of integrals: E [g(X)] = ∫
g [X(ω)] dP(ω) = ∫
Ω
g(x) dP (x) = ∫
S
g(x)f (x) dμ(x)
(4.9.6)
S
There are two critically important special cases.
Discrete Distributions Suppose first that (S, S , #) is a discrete measure space, so that S is countable, S = P(S) is the collection of all subsets of S , and # is counting measure on (S, S ). Thus, X has a discrete distribution on S , and this distribution is always absolutely continuous with respect to #. Specifically, #(A) = 0 if and only if A = ∅ and of course P(X ∈ ∅) = 0 . The probability density function f of X with respect to #, as we know, is simply f (x) = P(X = x) for x ∈ S . Moreover, integrals with respect to # are sums, so E [g(X)] = ∑ g(x)f (x)
(4.9.7)
x∈S
assuming that the expected value exists. Existence in this case means that either the sum of the positive terms is finite or the sum of the negative terms is finite, so that the sum makes sense (and in particular does not depend on the order in which the terms are added). Specializing further, if X itself is real-valued and g = 1 we have E(X) = ∑ xf (x)
(4.9.8)
x∈S
which was our original definition of expected value in the discrete case.
Continuous Distributions For the second special case, suppose that (S, S , λ ) is a Euclidean measure space, so that S is a Lebesgue measurable subset of R for some n ∈ N , S is the σ-algebra of Lebesgue measurable subsets of S , and λ is Lebesgue measure on (S, S ). The distribution of X is absolutely continuous with respect to λ if λ (A) = 0 implies P(X ∈ A) = 0 for A ∈ S . If this is the case, then a probability density function f of X has its usual meaning. Thus, n
n
+
n
n
n
E [g(X)] = ∫
g(x)f (x) dλn (x)
(4.9.9)
S
assuming that the expected value exists. When g is a typically nice function, this integral reduces to an ordinary Riemann integral of calculus. Specializing further, if X is itself real-valued and g = 1 then E(X) = ∫
xf (x) dx
n
-dimensional
(4.9.10)
S
which was our original definition of expected value in the continuous case.
Interchange Properties In this subsection, we review properties that allow the interchange of expected value and other operations: limits of sequences, infinite sums, and integrals. We assume again that the random variables are real-valued unless otherwise specified.
4.9.3
https://stats.libretexts.org/@go/page/10165
Limits Our first set of convergence results deals with the interchange of expected value and limits. We start with the expected value version of Fatou's lemma, named in honor of Pierre Fatou. Its usefulness stems from the fact that no assumptions are placed on the random variables, except that they be nonnegative. Suppose that X is a nonnegative random variable for n ∈ N . Then n
+
E (lim inf Xn ) ≤ lim inf E(Xn ) n→∞
(4.9.11)
n→∞
Our next set of results gives conditions for the interchange of expected value and limits. Suppose that X is a random variable for each n ∈ N . then n
+
E ( lim Xn ) = lim E (Xn ) n→∞
(4.9.12)
n→∞
in each of the following cases: 1. X is nonnegative for each n ∈ N and X is increasing in n . 2. E(X ) exists for each n ∈ N , E(X ) > −∞ , and X is increasing in n . 3. E(X ) exists for each n ∈ N , E(X ) < ∞ , and X is decreasing in n . 4. lim X exists, and |X | ≤ Y for n ∈ N where Y is a nonnegative random variable with E(Y ) < ∞ . 5. lim X exists, and |X | ≤ c for n ∈ N where c is a positive constant. n
+
n
n
+
1
n
+
1
n→∞
n
n
n→∞
n
n
n
n
Statements about the random variables in the theorem above (nonnegative, increasing, existence of limit, etc.) need only hold with probability 1. Part (a) is the monotone convergence theorem, one of the most important convergence results and in a sense, essential to the definition of the integral in the first place. Parts (b) and (c) are slight generalizations of the monotone convergence theorem. In parts (a), (b), and (c), note that lim X exists (with probability 1), although the limit may be ∞ in parts (a) and (b) and −∞ in part (c) (with positive probability). Part (d) is the dominated convergence theorem, another of the most important convergence results. It's sometimes also known as Lebesgue's dominated convergence theorem in honor of Henri Lebesgue. Part (e) is a corollary of the dominated convergence theorem, and is known as the bounded convergence theorem. n→∞
n
Infinite Series Our next results involve the interchange of expected value and an infinite sum, so these results generalize the basic additivity property of expected value. Suppose that X is a random variable for n ∈ N . Then n
+
∞
∞
E ( ∑ Xn ) = ∑ E (Xn ) n=1
(4.9.13)
n=1
in each of the following cases: 1. X is nonnegative for each n ∈ N . 2. E (∑ | X |) < ∞ n
+
∞
n
n=1
Part (a) is a consequence of the monotone convergence theorem, and part (b) is a consequence of the dominated convergence theorem. In (b), note that ∑ | X | < ∞ and hence ∑ X is absolutely convergent with probability 1. Our next result is the additivity of the expected value over a countably infinite collection of disjoint events. ∞
n=1
∞
n
n=1
n
Suppose that X is a random variable whose expected value exists, and that {A A =⋃ A . Then
n
: n ∈ N+ }
is a disjoint collection events. Let
∞
n=1
n
∞
E(X; A) = ∑ E(X; An )
(4.9.14)
n=1
Of course, the previous theorem applies in particular if X is nonnegative.
4.9.4
https://stats.libretexts.org/@go/page/10165
Integrals Suppose that (T , T , μ) is a σ-finite measure space, and that X is a real-valued random variable for each t ∈ T . Thus we can think of {X : t ∈ T } is a stochastic process indexed by T . We assume that (ω, t) ↦ X (ω) is measurable, as a function from the product space (Ω × T , F ⊗ T ) into R. Our next result involves the interchange of expected value and integral, and is a consequence of Fubini's theorem, named for Guido Fubini. t
t
t
Under the assumptions above, E [∫
Xt dμ(t)] = ∫
T
E (Xt ) dμ(t)
(4.9.15)
T
in each of the following cases: 1. X is nonnegative for each t ∈ T . 2. ∫ E (|X |) dμ(t) < ∞ t
T
t
Fubini's theorem actually states that the two iterated integrals above equal the joint integral ∫
Xt (ω) d(P ⊗ μ)(ω, t)
(4.9.16)
Ω×T
where of course, P ⊗ μ is the product measure on (Ω × T , F ⊗ T ) . However, our interest is usually in evaluating the iterated integral above on the left in terms of the iterated integral on the right. Part (a) is the expected value version of Tonelli's theorem, named for Leonida Tonelli.
Examples and Exercises You may have worked some of the computational exercises before, but try to see them in a new light, in terms of the general theory of integration.
The Cauchy Distribution Recall that the Cauchy distribution, named for Augustin Cauchy, is a continuous distribution with probability density function given by
f
1 f (x) =
2
,
x ∈ R
(4.9.17)
π (1 + x )
The Cauchy distribution is studied in more generality in the chapter on Special Distributions. Suppose that X has the Cauchy distribution. 1. Show that E(X) does not exist. 2. Find E (X ) 2
Answer Open the Cauchy Experiment and keep the default parameters. Run the experiment 1000 times and note the behaior of the sample mean.
The Pareto Distribution Recall that the Pareto distribution, named for Vilfredo Pareto, is a continuous distribution with probability density function f given by a f (x) =
a+1
,
x ∈ [1, ∞)
(4.9.18)
x
where a > 0 is the shape parameter. The Pareto distribution is studied in more generality in the chapter on Special Distributions. Suppose that X has the Pareto distribution with shape parameter a . Find E(X) is the following cases: 1. 0 < a ≤ 1
4.9.5
https://stats.libretexts.org/@go/page/10165
2. a > 1 Answer Open the special distribution simulator and select the Pareto distribution. Vary the shape parameter and note the shape of the probability density function and the location of the mean. For various values of the parameter, run the experiment 1000 times and compare the sample mean with the distribution mean. Suppose that X has the Pareto distribution with shape parameter a . Find E (1/X
n
)
for n ∈ N . +
Answer
Special Results for Nonnegative Variables For a nonnegative variable, the moments can be obtained from integrals of the right-tail distribution function. If X is a nonnegative random variable then ∞
E (X
n
n−1
) =∫
nx
P(X > x) dx
(4.9.19)
0
Proof ∞
When n = 1 we have E(X) = ∫ P(X > x) dx . We saw this result before in the section on additional properties of expected value, but now we can understand the proof in terms of Fubini's theorem. 0
For a random variable taking nonnegative integer values, the moments can be computed from sums involving the right-tail distribution function. Suppose that X has a discrete distribution, taking values in N. Then ∞
E (X
n
n
) = ∑ [k
n
− (k − 1 ) ] P(X ≥ k)
(4.9.21)
k=1
Proof When n = 1 we have E(X) = ∑ P(X ≥ k) . We saw this result before in the section on additional properties of expected value, but now we can understand the proof in terms of the interchange of sum and expected value. ∞
k=0
This page titled 4.9: Expected Value as an Integral is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.9.6
https://stats.libretexts.org/@go/page/10165
4.10: Conditional Expected Value Revisited Conditional expected value is much more important than one might at first think. In fact, conditional expected value is at the core of modern probability theory because it provides the basic way of incorporating known information into a probability measure.
Basic Theory Definition As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P), so that Ω is the set of outcomes,, F is the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ). In our first elementary discussion, we studied the conditional expected value of a real-value random variable X given a general random variable Y . The more general approach is to condition on a sub σ-algebra G of F . The sections on σ-algebras and measure theory are essential prerequisites for this section. Before we get to the definition, we need some preliminaries. First, all random variables mentioned are assumed to be real valued. next the notion of equivalence plays a fundamental role in this section. Next recall that random variables X and X are equivalent if P(X = X ) = 1 . Equivalence really does define an equivalence relation on the collection of random variables defined on the sample space. Moreover, we often regard equivalent random variables as being essentially the same object. More precisely from this point of view, the objects of our study are not individual random variables but rather equivalence classes of random variables under this equivalence relation. Finally, for A ∈ F , recall the notation for the expected value of X on the event A 1
1
2
2
E(X; A) = E(X 1A )
(4.10.1)
assuming of course that the expected value exists. For the remainder of this subsection, suppose that G is a sub σ-algebra of F . Suppose that X is a random variable with E(|X|) < ∞ . The conditional expected value of X given G is the random variable E(X ∣ G ) defined by the following properties: 1. E(X ∣ G ) is measurable with repsect to G . 2. If A ∈ G then E[E(X ∣ G ); A] = E(X; A) The basic idea is that E(X ∣ G ) is the expected value of X given the information in the σ-algebra G . Hopefully this idea will become clearer during our study. The conditions above uniquely define E(X ∣ G ) up to equivalence. The proof of this fact is a simple application of the Radon-Nikodym theorem, named for Johann Radon and Otto Nikodym Suppose again that X is a random variable with E(|X|) < ∞ . 1. There exists a random variable V satisfying the definition. 2. If V and V satisfy the definition, then P(V = V ) = 1 so that V and V are equivalent. 1
2
1
2
1
2
Proof The following characterization might seem stronger but in fact in equivalent to the definition. Suppose again that X is a random variable with E(|X|) < ∞ . Then E(X ∣ G ) is characterized by the following properties: 1. E(X ∣ G ) is measurable with respect to G 2. If U is measurable with respect to G and E(|U X|) < ∞ then E[U E(X ∣ G )] = E(U X) . Proof
Properties Our next discussion concerns some fundamental properties of conditional expected value. All equalities and inequalities are understood to hold modulo equivalence, that is, with probability 1. Note also that many of the proofs work by showing that the right hand side satisfies the properties in the definition for the conditional expected value on the left side. Once again we assume that G is a sub sigma-algebra of F . Our first property is a simple consequence of the definition: X and E(X ∣ G ) have the same mean.
4.10.1
https://stats.libretexts.org/@go/page/10338
Suppose that X is a random variable with E(|X|) < ∞ . Then E[E(X ∣ G )] = E(X) . Proof The result above can often be used to compute E(X), by choosing the σ-algebra G in a clever way. We say that we are computing E(X) by conditioning on G . Our next properties are fundamental: every version of expected value must satisfy the linearity properties. The first part is the additive property and the second part is the scaling property. Suppose that X and Y are random variables with E(|X|) < ∞ and E(|Y |) < ∞ , and that c ∈ R . Then 1. E(X + Y ∣ G ) = E(X ∣ G ) + E(Y 2. E(cX ∣ G ) = cE(X ∣ G )
∣ G)
Proof The next set of properties are also fundamental to every notion of expected value. The first part is the positive property and the second part is the increasing property. Suppose again that X and Y are random variables with E(|X|) < ∞ and E(|Y |) < ∞ . 1. If X ≥ 0 then E(X ∣ G ) ≥ 0 2. If X ≤ Y then E(X ∣ G ) ≤ E(Y
∣ G)
Proof The next few properties relate to the central idea that E(X ∣ G ) is the expected value of X given the information in the σ-algebra G. Suppose that X and V are random variables with E(|X|) < ∞ and E(|XV |) < ∞ and that G . Then E(V X ∣ G ) = V E(X ∣ G ) .
V
is measurable with respect to
Proof Compare this result with the scaling property. If V is measurable with respect to G then V is like a constant in terms of the conditional expected value given G . On the other hand, note that this result implies the scaling property, since a constant can be viewed as a random variable, and as such, is measurable with respect to any σ-algebra. As a corollary to this result, note that if X itself is measurable with respect to G then E(X ∣ G ) = X . The following result gives the other extreme. Suppose that X is a random variable with E(|X|) < ∞ . If X and G are independent then E(X ∣ G ) = E(X) . Proof Every random variable X is independent of the trivial σ-algebra {∅, Ω} so it follows that E(X ∣ {∅, Ω}) = E(X). The next properties are consistency conditions, also known as the tower properties. When conditioning twice, with respect to nested σ-algebras, the smaller one (representing the least amount of information) always prevails. Suppose that X is a random variable with E(|X|) < ∞ and that H is a sub σ-algebra of G . Then 1. E[E(X ∣ H ) ∣ G ] = E(X ∣ H ) 2. E[E(X ∣ G ) ∣ H ] = E(X ∣ H ) Proof The next result gives Jensen's inequality for conditional expected value, named for Johan Jensen. Suppose that X takes values in an interval S ⊆ R and that g : S → R is convex. If E(|X|) < ∞ and E(|g(X)| < ∞ then E[g(X) ∣ G ] ≥ g[E(X ∣ G )]
(4.10.9)
Proof
Conditional Probability For our next discussion, suppose as usual that G is a sub σ-algebra of F . The conditional probability of an event A given G can be defined as a special case of conditional expected value. As usual, let 1 denote the indicator random variable of A . A
4.10.2
https://stats.libretexts.org/@go/page/10338
For A ∈ F we define P(A ∣ G ) = E(1A ∣ G )
(4.10.11)
Thus, we have the following characterizations of conditional probability, which are special cases of the definition and the alternate version: If A ∈ F then P(A ∣ G ) is characterized (up to equivalence) by the following properties 1. P(A ∣ G ) is measurable with respect to G . 2. If B ∈ G then E[P(A ∣ G ); B] = P(A ∩ B) Proof If A ∈ F then P(A ∣ G ) is characterized (up to equivalence) by the following properties 1. P(A ∣ G ) is measurable with respect to G . 2. If U is measurable with respect to G and E(|U |) < ∞ then E[U P(A ∣ G )] = E(U ; A) The properties above for conditional expected value, of course, have special cases for conditional probability. In particular, we can compute the probability of an event by conditioning on a σ-algebra: If A ∈ F then P(A) = E[P(A ∣ G )] . Proof Again, the last theorem is often a good way to compute P(A) when we know the conditional probability of A given G . This is a very compact and elegant version of the law of total probability given first in the section on Conditional Probability in the chapter on Probability Spaces and later in the section on Discrete Distributions in the Chapter on Distributions. The following theorem gives the conditional version of the axioms of probability. The following properties hold (as usual, modulo equivalence): 1. P(A ∣ G ) ≥ 0 for every A ∈ F 2. P(Ω ∣ G ) = 1 3. If {A : i ∈ I } is a countable disjoint subset of F then P(⋃ i
i∈I
Ai ∣ ∣ G ) = ∑i∈I P(Ai ∣ G )
Proof From the last result, it follows that other standard probability rules hold for conditional probability given equivalence). These results include
G
(as always, modulo
the complement rule the increasing property Boole's inequality Bonferroni's inequality the inclusion-exclusion laws However, it is not correct to state that A ↦ P(A ∣ G ) is a probability measure, because the conditional probabilities are only defined up to equivalence, and so the mapping does not make sense. We would have to specify a particular version of E(A ∣ G ) for each A ∈ F for the mapping to make sense. Even if we do this, the mapping may not define a probability measure. In part (c), the left and right sides are random variables and the equation is an event that has probability 1. However this event depends on the collection {A : i ∈ I } . In general, there will be uncountably many such collections in F , and the intersection of all of the corresponding events may well have probability less than 1 (if it's measurable at all). It turns out that if the underlying probability space (Ω, F , P) is sufficiently “nice” (and most probability spaces that arise in applications are nice), then there does in fact exist a regular conditional probability. That is, for each A ∈ F , there exists a random variable P(A ∣ G ) satisfying the conditions in (12) and such that with probability 1, A ↦ P(A ∣ G ) is a probability measure. i
The following theorem gives a version of Bayes' theorem, named for the inimitable Thomas Bayes. Suppose that A ∈ G and B ∈ F . then
4.10.3
https://stats.libretexts.org/@go/page/10338
E[P(B ∣ G ); A] P(A ∣ B) =
(4.10.14) E[P(B ∣ G )]
Proof
Basic Examples The purpose of this discussion is to tie the general notions of conditional expected value that we are studying here to the more elementary concepts that you have seen before. Suppose that A is an event (that is, a member of F ) with P(A) > 0 . If B is another event, then of course, the conditional probability of B given A is P(A ∩ B) P(B ∣ A) =
(4.10.15) P(A)
If X is a random variable then the conditional distribution of X given A is the probability measure on R given by P({X ∈ R} ∩ A) R ↦ P(X ∈ R ∣ A) =
for measurable R ⊆ R
(4.10.16)
P(A)
If E(|X|) < ∞ then the conditional expected value of distribution.
given
X
A
, denoted
E(X ∣ A)
, is simply the mean of this conditional
Suppose now that A = {A : i ∈ I } is a countable partition of the sample space Ω into events with positive probability. To review the jargon, A ⊆ F ; the index set I is countable; A ∩ A = ∅ for distinct i, j ∈ I ; ⋃ A = Ω ; and P(A ) > 0 for i ∈ I . Let G = σ(A ) , the σ-algebra generated by A . The elements of G are of the form ⋃ A for J ⊆ I . Moreover, the random variables that are measurable with respect to G are precisely the variables that are constant on A for each i ∈ I . The σ-algebra G is said to be countably generated. i
i
j
i∈I
j∈J
i
i
j
i
If B ∈ F then P(B ∣ G ) is the random variable whose value on A is P(B ∣ A
i)
i
for each i ∈ I .
Proof In this setting, the version of Bayes' theorem in (15) reduces to the usual elementary formulation: For E[P(B ∣ G ); A ] = P(A )P(B ∣ A ) and E[P(B ∣ G )] = ∑ P(A )P(B ∣ A ) . Hence i
i
i
j
j∈I
i ∈ I
,
j
P(Ai )P(B ∣ Ai ) P(Ai ∣ B) =
(4.10.18) ∑
j∈I
P(Aj )P(B ∣ Aj )
If X is a random variable with E(|X|) < ∞ , then E(X ∣ G ) is the random variable whose value on A is E(X ∣ A i ∈ I.
i)
i
for each
Proof The previous examples would apply to G = σ(Y ) if Y is a discrete random variable taking values in a countable set T . In this case, the partition is simply A = {{Y = y} : y ∈ T } . On the other hand, suppose that Y is a random variable taking values in a general set T with σ-algebra T . The real-valued random variables that are measurable with respect to G = σ(Y ) are (up to equivalence) the measurable, real-valued functions of Y . Specializing further, Suppose that X takes values in S ⊆ R , Y takes values in T ⊆ R (where S and T are Lebesgue measurable) and that (X, Y ) has a joint continuous distribution with probability density function f . Then Y has probability density function h given by n
h(y) = ∫
f (x, y) dx,
y ∈ T
(4.10.20)
S
Assume that h(y) > 0 for y ∈ T . Then for y ∈ T , a conditional probability density function of X given Y
=y
is defined by
f (x, y) g(x ∣ y) =
,
x ∈ S
(4.10.21)
h(y)
This is precisely the setting of our elementary discussion of conditional expected value. If E(X ∣ Y ) instead of the clunkier E[X ∣ σ(Y )].
4.10.4
E(|X|) < ∞
then we usually write
https://stats.libretexts.org/@go/page/10338
In this setting above suppose that E(|X|) < ∞ . Then E(X ∣ Y ) = ∫
xg(x ∣ Y ) dx
(4.10.22)
S
Proof
Best Predictor In our elementary treatment of conditional expected value, we showed that the conditional expected value of a real-valued random variable X given a general random variable Y is the best predictor of X, in the least squares sense, among all real-valued functions of Y . A more careful statement is that E(X ∣ Y ) is the best predictor of X among all real-valued random variables that are measurable with respect to σ(Y ). Thus, it should come as not surprise that if G is a sub σ-algebra of F , then E(X ∣ G ) is the best predictor of X, in the least squares sense, among all real-valued random variables that are measurable with respect to G ). We will show that this is indeed the case in this subsection. The proofs are very similar to the ones given in the elementary section. For the rest of this discussion, we assume that G is a sub σ-algebra of F and that all random variables mentioned are real valued. Suppose that X and U are random variables with E(|X|) < ∞ and E(|XU |) < ∞ and that G . Then X − E(X ∣ G ) and U are uncorrelated.
U
is measurable with respect to
Proof The next result is the main one: E(X ∣ G ) is closer to X in the mean square sense than any other random variable that is measurable with respect to G . Thus, if G represents the information that we have, then E(X ∣ G ) is the best we can do in estimating X. Suppose that X and U are random variables with E(X Then
2
)
1 2 1 2 1 2
Answer Suppose that X is uniformly distributed on the interval [0, 1]. Find d (X, t) = E (|X − t|) as a function of graph. Find the minimum value of the function and the value of t where the minimum occurs. 1
t
and sketch the
Suppose that X is uniformly distributed on the set [0, 1] ∪ [2, 3]. Find d (X, t) = E (|X − t|) as a function of t and sketch the graph. Find the minimum value of the function and the values of t where the minimum occurs. 1
Suppose that (X, Y ) has probability density function f (x, y) = x + y for 0 ≤ x ≤ 1 , 0 ≤ y ≤ 1 . Verify Hölder's inequality in the following cases: 1. j = k = 2 2. j = 3 , k =
3 2
Answer
Counterexamples The following exercise shows that convergence with probability 1 does not imply convergence in mean. Suppose that (X
1,
X2 , …)
is a sequence of independent random variables with 1
3
P (X = n ) =
2
1 , P(Xn = 0) = 1 −
n
2
;
n ∈ N+
(4.11.38)
n
1. X → 0 as n → ∞ with probability 1. 2. X → 0 as n → ∞ in probability. 3. E(X ) → ∞ as n → ∞ . n n
n
Proof The following exercise shows that convergence in mean does not imply convergence with probability 1. Suppose that (X
1,
X2 , …)
is a sequence of independent indicator random variables with 1 P(Xn = 1) =
1. P(X 2. P(X 3. P(X
n
= 0 for infinitely many n) = 1
n
= 1 for infinitely many n) = 1
n does
n
1 , P(Xn = 0) = 1 −
; n
n ∈ N+
(4.11.39)
. .
not converge as n → ∞) = 1
.
4.11.7
https://stats.libretexts.org/@go/page/10339
4. X
n
→ 0
as n → ∞ in k th mean for every k ≥ 1 .
Proof The following exercise show that convergence of the k th means does not imply convergence in k th mean. Suppose that U has the Bernoulli distribution with parmaeter n ∈ N and let X = 1 − U . Let k ∈ [1, ∞). Then
1 2
, so that
P(U = 1) = P(U = 0) =
1 2
. Let
Xn = U
for
+
1. E(X ) = E(X ) = for n ∈ N , so E(X ) → E(X ) as n → ∞ 2. E(|X − X| ) = 1 for n ∈ N so X does not converge to X as n → ∞ in L . k n
k
1 2
k n
+
k
k
n
n
k
Proof This page titled 4.11: Vector Spaces of Random Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.11.8
https://stats.libretexts.org/@go/page/10339
4.12: Uniformly Integrable Variables Two of the most important modes of convergence in probability theory are convergence with probability 1 and convergence in mean. As we have noted several times, neither mode of convergence implies the other. However, if we impose an additional condition on the sequence of variables, convergence with probability 1 will imply convergence in mean. The purpose of this brief, but advanced section, is to explore the additional condition that is needed. This section is particularly important for the theory of martingales.
Basic Theory As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P). So Ω is the set of outcomes, F is the σ-algebra of events, and P is the probability measure on the sample space (Ω, F ). In this section, all random variables that are mentioned are assumed to be real valued, unless otherwise noted. Next, recall from the section on vector spaces that for ,
1/k
is the vector space of random variables X with E(|X| ) < ∞ , endowed with the norm ∥X∥ = [E(X )] . In particular, X ∈ L simply means that E(|X|) < ∞ so that E(X) exists as a real number. From the section on expected value as an integral, recall the following notation, assuming of course that the expected value makes sense: k
k ∈ [1, ∞) Lk
k
k
1
E(X; A) = E(X 1A ) = ∫
X dP
(4.12.1)
A
Definition The following result is motivation for the main definition in this section. If X is a random variable then E(|X|) < ∞ if and only if E(|X|; |X| ≥ x) → 0 as x → ∞ . Proof Suppose now that X is a random variable for each i in a nonempty index set I (not necessarily countable). The critical definition for this section is to require the convergence in the previous theorem to hold uniformly for the collection of random variables X = {X : i ∈ I } . i
i
The collection X = {X
i
: i ∈ I}
is uniformly integrable if for each ϵ > 0 there exists x > 0 such that for all i ∈ I , E(| Xi |; | Xi | > x) < ϵ
Equivalently E(|X
i |;
| Xi | > x) → 0
(4.12.3)
as x → ∞ uniformly in i ∈ I .
Properties Our next discussion centers on conditions that ensure that the collection of random variables integrable. Here is an equivalent characterization: The collection X = {X
i
: i ∈ I}
X = { Xi : i ∈ I }
is uniformly
is uniformly integrable if and only if the following conditions hold:
1. {E(|X |) : i ∈ I } is bounded. 2. For each ϵ > 0 there exists δ > 0 such that if A ∈ F and P(A) < δ then E(|X i
i |;
A) < ϵ
for all i ∈ I .
Proof Condition (a) means that X is bounded (in norm) as a subset of the vector space random variables is uniformly integrable. Suppose that I is finite and that E(|X
i |)
0 such that |X
i|
≤c
for all i ∈ I then X = {X
i
: i ∈ I}
is uniformly integrable.
Just having E(|X |) bounded in i ∈ I (condition (a) in the characterization above) is not sufficient for X = {X : i ∈ I } to be uniformly integrable; a counterexample is given below. However, if E(|X | ) is bounded in i ∈ I for some k > 1 , then X is uniformly integrable. This condition means that X is bounded (in norm) as a subset of the vector space L . i
i
k
i
k
If {E(|X
k
i|
: i ∈ I}
is bounded for some k > 1 , then {X
i
: i ∈ I}
is uniformly integrable.
Proof Uniformly integrability is closed under the operations of addition and scalar multiplication. Suppose that X = {X : i ∈ I } and Y collections is also uniformly integrable. i
= { Yi : i ∈ I }
are uniformly integrable and that
c ∈ R
. Then each of the following
1. X + Y = {X + Y : i ∈ I } 2. cX = {cX : i ∈ I } i
i
i
Proof The following corollary is trivial, but will be needed in our discussion of convergence below. Suppose that {X : i ∈ I } is uniformly integrable and that X is a random variable with E(|X|) < ∞ . Then {X is uniformly integrable. i
i
− X : i ∈ I}
Proof
Convergence We now come to the main results, and the reason for the definition of uniform integrability in the first place. To set up the notation, suppose that X is a random variable for n ∈ N and that X is a random variable. We know that if X → X as n → ∞ in mean then X → X as n → ∞ in probability. The converse is also true if and only if the sequence is uniformly integrable. Here is the first half: n
+
n
n
If X
→ X
n
as n → ∞ in mean, then {X
n
: n ∈ N}
is uniformly integrable.
Proof Here is the more important half, known as the uniform integrability theorem: If {X
n
: n ∈ N+ }
is uniformly integrable and X
n
→ X
as n → ∞ in probability, then X
n
→ X
as n → ∞ in mean.
Proof As a corollary, recall that if X → X as n → ∞ with probability 1, then X X = { X : n ∈ N } is uniformly integrable then X → X as n → ∞ in mean. n
n
+
n
→ X
as
n → ∞
in probability. Hence if
n
4.12.2
https://stats.libretexts.org/@go/page/10340
Examples Our first example shows that bounded L norm is not sufficient for uniform integrability. 1
Suppose that U is uniformly distributed on the interval X = n1(U ≤ 1/n) . Then
(0, 1)
(so U has the standard uniform distribution). For
n ∈ N+
, let
n
1. E(|X 2. E(|X
n |) n |;
=1 | Xn
for all n ∈ N | > x) = 1 for x > 0 , n ∈ N +
+
with n > x
Proof By part (b), E(|X integrable.
n |;
| Xn | > x)
does not converge to 0 as x → ∞ uniformly in n ∈ N , so X = {X +
n
: n ∈ N+ }
is not uniformly
The next example gives an important application to conditional expected value. Recall that if X is a random variable with E(|X|) < ∞ and G is a sub σ-algebra of F then E(X ∣ G ) is the expected value of X given the information in G , and is the G measurable random variable closest to X in a sense. Indeed if X ∈ L (F ) then E(X ∣ G ) is the projection of X onto L (G ). The collection of all conditional expected values of X is uniformly integrable: 2
Suppose that X is a real-valued random variable with uniformly integrable.
E(|X|) < ∞
2
. Then
{E(X ∣ G ) : G is a sub σ-algebra of F }
is
Proof Note that the collection of sub σ-algebras of F , and so also the collection of conditional expected values above, might well be uncountable. The conditional expected values range from E(X), when G = {Ω, ∅} to X itself, when G = F . This page titled 4.12: Uniformly Integrable Variables is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.12.3
https://stats.libretexts.org/@go/page/10340
4.13: Kernels and Operators The goal of this section is to study a type of mathematical object that arises naturally in the context of conditional expected value and parametric distributions, and is of fundamental importance in the study of stochastic processes, particularly Markov processes. In a sense, the main object of study in this section is a generalization of a matrix, and the operations generalizations of matrix operations. If you keep this in mind, this section may seem less abstract.
Basic Theory Definitions Recall that a measurable space (S, S ) consists of a set S and a σ-algebra S of subsets of S . If μ is a positive measure on (S, S ), then (S, S , μ) is a measure space. The two most important special cases that we have studied frequently are 1. Discrete: S is countable, S = P(S) is the collection of all subsets of S , and μ = # is counting measure on (S, S ). 2. Euclidean: S is a measurable subset of R for some n ∈ N , S is the collection of subsets of S that are also measurable, and μ = λ is n -dimensional Lebesgue measure on (S, S ). n
+
n
More generally, S usually comes with a topology that is locally compact, Hausdorff, with a countable base (LCCB), and S is the Borel σ-algebra, the σ-algebra generated by the topology (the collection of open subsets of S ). The measure μ is usually a Borel measure, and so satisfies μ(C ) < ∞ if C ⊆ S is compact. A discrete measure space is of this type, corresponding to the discrete topology. A Euclidean measure space is also of this type, corresponding to the Euclidean topology, if S is open or closed (which is usually the case). In the discrete case, every function from S to another measurable space is measurable, and every from function from S to another topological space is continuous, so the measure theory is not really necessary. Recall also that the measure space (S, S , μ) is σ-finite if there exists a countable collection {A : i ∈ I } ⊆ S such that μ(A ) < ∞ for i ∈ I and S = ⋃ A . If (S, S , μ) is a Borel measure space corresponding to an LCCB topology, then it is σfinite. i
i
i∈I
i
If f : S → R is measurable, define ∥f ∥ = sup{|f (x)| : x ∈ S}. Of course we may well have ∥f ∥ = ∞. Let B(S) denote the collection of bounded measurable functions f : S → R . Under the usual operations of pointwise addition and scalar multiplication, B(S) is a vector space, and ∥ ⋅ ∥ is the natural norm on this space, known as the supremum norm. This vector space plays an important role. In this section, it is sometimes more natural to write integrals with respect to the positive measure μ with the differential before the integrand, rather than after. However, rest assured that this is mere notation, the meaning of the integral is the same. So if f : S → R is measurable then we may write the integral of f with respect to μ in operator notation as μf = ∫
μ(dx)f (x)
(4.13.1)
S
assuming, as usual, that the integral exists. This will be the case if f is nonnegative, although ∞ is a possible value. More generally, the integral exists in R ∪ {−∞, ∞} if μf < ∞ or μf < ∞ where f and f are the positive and negative parts of f . If both are finite, the integral exists in R (and f is integrable with respect to μ ). If If μ is a probability measure and we think of (S, S ) as the sample space of a random experiment, then we can think of f as a real-valued random variable, in which case our new notation is not too far from our traditional expected value E(f ). Our main definition comes next. +
−
+
−
Suppose that (S, S ) and (T , T ) are measurable spaces. A kernel from (S, S ) to (T , T ) is a function K : S × T such that
→ [0, ∞]
1. x ↦ K(x, A) is a measurable function from S into [0, ∞] for each A ∈ T . 2. A ↦ K(x, A) is a positive measure on T for each x ∈ S . If (T , T ) = (S, S ) , then K is said to be a kernel on (S, S ). There are several classes of kernels that deserve special names.
4.13.1
https://stats.libretexts.org/@go/page/10499
Suppose that K is a kernel from (S, S ) to (T , T ). Then 1. K 2. K 3. K 4. K
is σ-finite if the measure K(x, ⋅) is σ-finite for every x ∈ S . is finite if K(x, T ) < ∞ for every x ∈ S . is bounded if K(x, T ) is bounded in x ∈ S . is a probability kernel if K(x, T ) = 1 for every x ∈ S .
Define ∥K∥ = sup{K(x, T ) : x ∈ S} , so that ∥K∥ < ∞ if K is a bounded kernel and ∥K∥ = 1 if K is a probability kernel. So a probability kernel is bounded, a bounded kernel is finite, and a finite kernel is σ-finite. The terms stochastic kernel and Markov kernel are also used for probability kernels, and for a probability kernel ∥K∥ = 1 of course. The terms are consistent with terms used for measures: K is a finite kernel if and only if K(x, ⋅) is a finite measure for each x ∈ S , and K is a probability kernel if and only if K(x, ⋅) is a probability measure for each x ∈ S . Note that ∥K∥ is simply the supremum norm of the function x ↦ K(x, T ) . A kernel defines two natural integral operators, by operating on the left with measures, and by operating on the right with functions. As usual, we are often a bit casual witht the question of existence. Basically in this section, we assume that any integrals mentioned exist. Suppose that K is a kernel from (S, S ) to (T , T ). 1. If μ is a positive measure on (S, S ), then μK defined as follows is a positive measure on (T , T ): μK(A) = ∫
μ(dx)K(x, A),
A ∈ T
(4.13.2)
S
2. If f
: T → R
is measurable, then Kf
: S → R
defined as follows is measurable (assuming that the integrals exist in R):
Kf (x) = ∫
K(x, dy)f (y),
x ∈ S
(4.13.3)
T
Proof Thus, a kernel transforms measures on (S, S ) into measures on (T , T ), and transforms certain measurable functions from T to R into measurable functions from S to R. Again, part (b) assumes that f is integrable with respect to the measure K(x, ⋅) for every x ∈ S . In particular, the last statement will hold in the following important special case: Suppose that K is a kernel from (S, S ) to (T , T ) and that f
∈ B(T )
.
1. If K is finite then Kf is defined and ∥Kf ∥ = ∥K∥∥f ∥. 2. If K is bounded then Kf ∈ B(T ) . Proof The identity kernel I on the measurable space (S, S ) is defined by I (x, A) = 1(x ∈ A) for x ∈ S and A ∈ S . Thus, I (x, A) = 1 if x ∈ A and I (x, A) = 0 if x ∉ A . So x ↦ I (x, A) is the indicator function of A ∈ S , while A ↦ I (x, A) is point mass at x ∈ S . Clearly the identity kernel is a probability kernel. If we need to indicate the dependence on the particular space, we will add a subscript. The following result justifies the name. Let I denote the identity kernel on (S, S ). 1. If μ is a positive measure on (S, S ) then μI 2. If f : S → R is measurable, then I f = f .
=μ
.
Constructions We can create a new kernel from two given kernels, by the usual operations of addition and scalar multiplication. Suppose that K and L are kernels from (S, S ) to (T , T ), and that kernels from (S, S ) to (T , T ).
. Then cK and K + L defined below are also
c ∈ [0, ∞)
1. (cK)(x, A) = cK(x, A) for x ∈ S and A ∈ T .
4.13.2
https://stats.libretexts.org/@go/page/10499
2. (K + L)(x, A) = K(x, A) + L(x, A) for x ∈ S and A ∈ T . If K and L are σ-finite (finite) (bounded) then cK and K + L are σ-finite (finite) (bounded), respectively. Proof A simple corollary of the last result is that if a, b ∈ [0, ∞) then aK + bL is a kerneal from (S, S ) to (T , T ). In particular, if K, L are probability kernels and p ∈ (0, 1) then pK + (1 − p)L is a probability kernel. A more interesting and important way to form a new kernel from two given kernels is via a “multiplication” operation. Suppose that K is a kernel from (R, R) to (S, S ) and that L is a kernel from (S, S ) to (T , T ). Then KL defined as follows is a kernel from (R, R) to (T , T ): KL(x, A) = ∫
K(x, dy)L(y, A),
x ∈ R, A ∈ T
(4.13.5)
S
1. If K is finite and L is bounded then KL is finite. 2. If K and L are bounded then KL is bounded. 3. If K and L are stochastic then KL is stochastic Proof Once again, the identity kernel lives up to its name: Suppose that K is a kernel from (S, S ) to (T , T ). Then 1. I K = K 2. KI = K S
T
The next several results show that the operations are associative whenever they make sense. Suppose that K is a kernel from (S, S ) to (T , T ), μ is a positive measure on S , c ∈ [0, ∞), and f Then, assuming that the appropriate integrals exist,
: T → R
is measurable.
1. c(μK) = (cμ)K 2. c(Kf ) = (cK)f 3. (μK)f = μ(Kf ) Proof Suppose that K is a kernel from (R, R) to (S, S ) and L is a kernel from (S, S ) to (T , T ). Suppose also that μ is a positive measure on (R, R) , f : T → R is measurable, and c ∈ [0, ∞). Then, assuming that the appropriate integrals exist, 1. (μK)L = μ(KL) 2. K(Lf ) = (KL)f 3. c(KL) = (cK)L Proof Suppose that K is a kernel from (R, R) to (S, S ), L is a kernel from (S, S ) to (T , T ), and M is a kernel from (T , T ) to (U , U ) . Then (KL)M = K(LM ) . Proof The next several results show that the distributive property holds whenever the operations makes sense. Suppose that K and L are kernels from (R, R) to (S, S ) and that M and N are kernels from (S, S ) to (T , T ). Suppose also that μ is a positive measure on (R, R) and that f : S → R is measurable. Then, assuming that the appropriate integrals exist, 1. (K + L)M = KM + LM 2. K(M + N ) = KM + KN 3. μ(K + L) = μK + μL
4.13.3
https://stats.libretexts.org/@go/page/10499
4. (K + L)f
= Kf + Lf
Suppose that K is a kernel from (S, S ) to (T , T ), and that μ and ν are positive measures on measurable functions from T to R. Then, assuming that the appropriate integrals exist,
, and that
(S, S )
f
and
g
are
1. (μ + ν )K = μK + ν K 2. K(f + g) = Kf + Kg 3. μ(f + g) = μf + μg 4. (μ + ν )f = μf + ν f In particular, note that if K is a kernel from (S, S ) to (T , T ), then the transformation μ ↦ μK defined for positive measures on (S, S ), and the transformation f ↦ Kf defined for measurable functions f : T → R (for which Kf exists), are both linear operators. If μ is a positive measure on (S, S ), then the integral operator f ↦ μf defined for measurable f : S → R (for which μf exists) is also linear, but of course, we already knew that. Finally, note that the operator f ↦ Kf is positive: if f ≥ 0 then Kf ≥ 0 . Here is the important summary of our results when the kernel is bounded. If K is a bounded kernel from (S, S ) to (T , T ), then f ∥K∥ is the norm of the transformation.
↦ Kf
is a bounded, linear transformation from B(T ) to B(S) and
The commutative property for the product of kernels fails with a passion. If K and L are kernels, then depending on the measurable spaces, KL may be well defined, but not LK . Even if both products are defined, they may be kernels from or to different measurable spaces. Even if both are defined from and to the same measurable spaces, it may well happen that KL ≠ LK . Some examples are given below If K is a kernel on kernel on S .
(S, S )
and
n ∈ N
, we let
K
n
= KK ⋯ K
, the n -fold power of
K
. By convention,
K
0
=I
, the identity
Fixed points of the operators associated with a kernel turn out to be very important. Suppose that K is a kernel from (S, S ) to (T , T ). 1. A positive measure μ on (S, S ) such that μK = μ is said to be invariant for K . 2. A measurable function f : T → R such that Kf = f is said to be invariant for K So in the language of linear algebra (or functional analysis), an invariant measure is a left eigenvector of the kernel, while an invariant function is a right eigenvector of the kernel, both corresponding to the eigenvalue 1. By our results above, if μ and ν are invariant measures and c ∈ [0, ∞), then μ + ν and cμ are also invariant. Similarly, if f and g are invariant functions and c ∈ R , the f + g and cf are also invariant. Of couse we are particularly interested in probability kernels. Suppose that P is a probability kernel from (R, R) to (S, S ) and that Suppose also that μ is a probability measure on (R, R) . Then
Q
is a probability kernel from
(S, S )
to
(T , T )
.
1. P Q is a probability kernel from (R, R) to (T , T ). 2. μP is a probability measure on (S, S ). Proof As a corollary, it follows that if P is a probability kernel on (S, S ), then so is P for n ∈ N . n
The operators associated with a kernel are of fundamental importance, and we can easily recover the kernel from the operators. Suppose that K is a kernel from (S, S ) to (T , T ), and let x ∈ S and A ∈ T . Then trivially, K1 (x) = K(x, A) where as usual, 1 is the indicator function of A . Trivially also δ K(A) = K(x, A) where δ is point mass at x. A
A
x
x
Kernel Functions Usually our measurable spaces are in fact measure spaces, with natural measures associated with the spaces, as in the special cases described in (1). When we start with measure spaces, kernels are usually constructed from density functions in much the same way that positive measures are defined from density functions.
4.13.4
https://stats.libretexts.org/@go/page/10499
Suppose that
and (T , T , μ) are measure spaces. As usual, S × T is given the product σ-algebra is measurable, then the function K defined as follows is a kernel from (S, S ) to (T , T ):
(S, S , λ)
k : S × T → [0, ∞)
K(x, A) = ∫
k(x, y)μ(dy),
x ∈ S, A ∈ T
S ⊗T
. If
(4.13.9)
A
Proof Clearly the kernel K depends on the positive measure μ on (T , T ) as well as the function k , while the measure λ on (S, S ) plays no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Appropriately enough, the function k is called a kernel density function (with respect to μ ), or simply a kernel function. Suppose again that (S, S , λ) and (T , T , μ) are measure spaces. Suppose also K is a kernel from kernel function k . If f : T → R is measurable, then, assuming that the integrals exists, Kf (x) = ∫
k(x, y)f (y)μ(dy),
x ∈ S
(S, S )
to
(T , T )
with
(4.13.10)
S
Proof A kernel function defines an operator on the left with functions on above with functions on T .
S
in a completely analogous way to the operator on the right
Suppose again that (S, S , λ) and (T , T , μ) are measure spaces, and that K is a kernel from (S, S ) to (T , T ) with kernel function k . If f : S → R is measurable, then the function f K : T → R defined as follows is also measurable, assuming that the integrals exists f K(y) = ∫
λ(dx)f (x)k(x, y),
y ∈ T
(4.13.12)
S
The operator defined above depends on the measure λ on (S, S ) as well as the kernel function k , while the measure μ on (T , T ) playes no role (and so is not even necessary). But again, our point of view is that the spaces have fixed, natural measures. Here is how our new operation on the left with functions relates to our old operation on the left with measures. Suppose again that (S, S , λ) and (T , T , μ) are measure spaces, and that K is a kernel from (S, S ) to (T , T ) with kernel function k . Suppose also that f : S → [0, ∞) is measurable, and let ρ denote the measure on (S, S ) that has density f with respect to λ . Then f K is the density of the measure ρK with respect to μ . Proof As always, we are particularly interested in stochastic kernels. With a kernel function, we can have doubly stochastic kernels. Suppose again that (S, S , λ) and double stochastic kernel function if 1. ∫ 2. ∫
T S
k(x, y)μ(dy) = 1 λ(dx)k(x, y) = 1
(T , T , μ)
are measure spaces and that
k : S × T → [0, ∞)
is measurable. Then
k
is a
for x ∈ S for y ∈ S
Of course, condition (a) simply means that the kernel associated with k is a stochastic kernel according to our original definition. The most common and important special case is when the two spaces are the same. Thus, if (S, S , λ) is a measure space and k : S × S → [0, ∞) is measurable, then we have an operator K that operates on the left and on the right with measurable functions f : S → R : f K(y) = ∫
λ(dx)f (x)k(x, y),
y ∈ S
k(x, y)f (y)λ(dy),
x ∈ S
S
Kf (x) = ∫ S
If f is nonnegative and μ is the measure on with density function f , then f K is the density function of the measure μK (both with respect to λ ).
4.13.5
https://stats.libretexts.org/@go/page/10499
Suppose again that (S, S , λ) is a measure space and k(x, y) = k(y, x) for all (x, y) ∈ S .
is measurable. Then
k : S × S → [0, ∞)
k
is symmetric if
2
Of course, if k is a symmetric, stochastic kernel function on (S, S , λ) then k is doubly stochastic, but the converse is not true. Suppose that (R, R, λ) , (S, S , μ), and (T , T , ρ) are measure spaces. Suppose also that K is a kernel from (R, R) to (S, S ) with kernel function k , and that L is a kernel from (S, S ) to (T , T ) with kernel function l. Then the kernel KL from (R, R) to (T , T ) has density kl given by kl(x, z) = ∫
k(x, y)l(y, z)μ(dy),
(x, z) ∈ R × T
(4.13.13)
S
Proof
Examples and Special Cases The Discrete Case In this subsection, we assume that the measure spaces are discrete, as described in (1). Since the σ-algebra (all subsets) and the measure (counting measure) are understood, we don't need to reference them. Recall that integrals with respect to counting measure are sums. Suppose now that K is a kernel from the discrete space S to the discrete space T . For x ∈ S and y ∈ T , let K(x, y) = K(x, {y}). Then more generally, K(x, A) = ∑ K(x, y),
x ∈ S, A ⊆ T
(4.13.14)
y∈A
The function (x, y) ↦ K(x, y) is simply the kernel function of the kernel K , as defined above, but in this case we usually don't bother with using a different symbol for the function as opposed to the kernel. The function K can be thought of as a matrix, with rows indexed by S and columns indexed by T (and so an infinite matrix if S or T is countably infinite). With this interpretation, all of the operations defined above can be thought of as matrix operations. If f : T → R and f is thought of as a column vector indexed by T , then Kf is simply the ordinary product of the matrix K and the vector f ; the product is a column vector indexed by S: Kf (x) = ∑ K(x, y)f (y),
x ∈ S
(4.13.15)
y∈S
Similarly, if f : S → R and f is thought of as a row vector indexed by S , then f K is simple the ordinary product of the vector and the matrix K ; the product is a row vector indexed by T : f K(y) = ∑ f (x)K(x, y),
y ∈ T
f
(4.13.16)
x∈S
If L is another kernel from T to another discrete space U , then as functions, KL is the simply the matrix product of K and L: KL(x, z) = ∑ K(x, y)L(x, z),
(x, z) ∈ S × L
(4.13.17)
y∈T
Let S = {1, 2, 3} and T = {1, 2, 3, 4}. Define the kernel K from S to T by K(x, y) = x + y for (x, y) ∈ S × T . Define the function f on S by f (x) = x! for x ∈ S , and define the function g on T by g(y) = y for y ∈ T . Compute each of the following using matrix algebra: 2
1. f K 2. Kg Answer Let R = {0, 1} , S = {a, b} , and T = {1, 2, 3}. Define the kernel K from R to S , the kernel L from S to S and the kernel M from S to T in matrix form as follows: 1
4
2
3
K =[
2
2
1
5
], L = [
1
0
2
0
3
1
], M = [
4.13.6
]
(4.13.20)
https://stats.libretexts.org/@go/page/10499
Compute each of the following kernels, or explain why the operation does not make sense: 1. KL 2. LK 3. K 4. L 5. KM 6. LM 2
2
Proof
Conditional Probability An important class of probability kernels arises from the distribution of one random variable, conditioned on the value of another random variable. In this subsection, suppose that (Ω, F , P) is a probability space, and that (S, S ) and (T , T ) are measurable spaces. Further, suppose that X and Y are random variables defined on the probability space, with X taking values in S and that Y taking values in T . Informally, X and Y are random variables defined on the same underlying random experiment. The function P defined as follows is a probability kernel from (S, S ) to (T , T ), known as the conditional probability kernel of Y given X. P (x, A) = P(Y ∈ A ∣ X = x),
x ∈ S, A ∈ T
(4.13.25)
Proof The operators associated with this kernel have natural interpretations. Let P be the conditional probability kernel of Y given X. 1. If f : T → R is measurable, then P f (x) = E[f (Y ) ∣ X = x] for x ∈ S (assuming as usual that the expected value exists). 2. If μ is the probability distribution of X then μP is the probability distribution of Y . Proof As in the general discussion above, the measurable spaces (S, S ) and (T , T ) are usually measure spaces with natural measures attached. So the conditional probability distributions are often given via conditional probability density functions, which then play the role of kernel functions. The next two exercises give examples. Suppose that X and Y are random variables for an experiment, taking values in R. For x ∈ R, the conditional distribution of Y given X = x is normal with mean x and standard deviation 1. Use the notation and operations of this section for the following computations: 1. Give the kernel function for the conditional distribution of Y given X. 2. Find E (Y ∣∣ X = x) . 3. Suppose that X has the standard normal distribution. Find the probability density function of Y . 2
Answer Suppose that X and Y are random variables for an experiment, with X taking values in {a, b, c} and Y taking values in {1, 2, 3, 4}. The kernel function of Y given X is as follows: P (a, y) = 1/4 , P (b, y) = y/10, and P (c, y) = y /30, each for y ∈ {1, 2, 3, 4}. 2
1. Give the kernel P in matrix form and verify that it is a probability kernel. 2. Find f P where f (a) = f (b) = f (c) = 1/3 . The result is the density function of Y given that X is uniformly distributed. 3. Find P g where g(y) = y for y ∈ {1, 2, 3, 4}. The resulting function is E(Y ∣ X = x) for x ∈ {a, b, c}. Answer
Parametric Distributions A parametric probability distribution also defines a probability kernel in a natural way, with the parameter playing the role of the kernel variable, and the distribution playing the role of the measure. Such distributions are usually defined in terms of a parametric density function which then defines a kernel function, again with the parameter playing the role of the first argument and the
4.13.7
https://stats.libretexts.org/@go/page/10499
variable the role of the second argument. If the parameter is thought of as a given value of another random variable, as in Bayesian analysis, then there is considerable overlap with the previous subsection. In most cases, (and in particular in the examples below), the spaces involved are either discrete or Euclidean, as described in (1). Consider the parametric family of exponential distributions. Let f denote the identity function on (0, ∞). 1. Give the probability density function as a probability kernel function p on (0, ∞). 2. Find P f . 3. Find f P . 4. Find p , the kernel function corresponding to the product kernel P . 2
2
Answer Consider the parametric family of Poisson distributions. Let f be the identity function on on (0, ∞).
N
and let
g
be the identity function
1. Give the probability density function p as a probability kernel function from (0, ∞) to N. 2. Show that P f = g . 3. Show that gP = f . Answer Clearly the Poisson distribution has some very special and elegant properties. The next family of distributions also has some very special properties. Compare this exercise with the exercise (30). Consider the family of normal distributions, parameterized by the mean and with variance 1. 1. Give the probability density function as a probability kernel function p on R. 2. Show that p is symmetric. 3. Let f be the identity function on R. Show that P f = f and f P = f . 4. For n ∈ N , find p the kernel function for the operator P . n
n
Answer For each of the following special distributions, express the probability density function as a probability kernel function. Be sure to specify the parameter spaces. 1. The general normal distribution on R. 2. The beta distribution on (0, 1). 3. The negative binomial distribution on N. Answer This page titled 4.13: Kernels and Operators is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
4.13.8
https://stats.libretexts.org/@go/page/10499
CHAPTER OVERVIEW 5: Special Distributions In this chapter, we study several general families of probability distributions and a number of special parametric families of distributions. Unlike the other expository chapters in this text, the sections are not linearly ordered and so this chapter serves primarily as a reference. You may want to study these topics as the need arises. First, we need to discuss what makes a probability distribution special in the first place. In some cases, a distribution may be important because it is connected with other special distributions in interesting ways (via transformations, limits, conditioning, etc.). In some cases, a parametric family may be important because it can be used to model a wide variety of random phenomena. This may be the case because of fundamental underlying principles, or simply because the family has a rich collection of probability density functions with a small number of parameters (usually 3 or less). As a general philosophical principle, we try to model a random process with as few parameters as possible; this is sometimes referred to as the principle of parsimony of parameters. In turn, this is a special case of Ockham's razor, named in honor of William of Ockham, the principle that states that one should use the simplest model that adequately describes a given phenomenon. Parsimony is important because often the parameters are not known and must be estimated. In many cases, a special parametric family of distributions will have one or more distinguished standard members, corresponding to specified values of some of the parameters. Usually the standard distributions will be mathematically simplest, and often other members of the family can be constructed from the standard distributions by simple transformations on the underlying standard random variable. An incredible variety of special distributions have been studied over the years, and new ones are constantly being added to the literature. To truly deserve the adjective special, a distribution should have a certain level of mathematical elegance and economy, and should arise in interesting and diverse applications. 5.1: Location-Scale Families 5.2: General Exponential Families 5.3: Stable Distributions 5.4: Infinitely Divisible Distributions 5.5: Power Series Distributions 5.6: The Normal Distribution 5.7: The Multivariate Normal Distribution 5.8: The Gamma Distribution 5.9: Chi-Square and Related Distribution 5.10: The Student t Distribution 5.11: The F Distribution 5.12: The Lognormal Distribution 5.13: The Folded Normal Distribution 5.14: The Rayleigh Distribution 5.15: The Maxwell Distribution 5.16: The Lévy Distribution 5.17: The Beta Distribution 5.18: The Beta Prime Distribution 5.19: The Arcsine Distribution 5.20: General Uniform Distributions 5.21: The Uniform Distribution on an Interval 5.22: Discrete Uniform Distributions 5.23: The Semicircle Distribution 5.24: The Triangle Distribution
1
5.25: The Irwin-Hall Distribution 5.26: The U-Power Distribution 5.27: The Sine Distribution 5.28: The Laplace Distribution 5.29: The Logistic Distribution 5.30: The Extreme Value Distribution 5.31: The Hyperbolic Secant Distribution 5.32: The Cauchy Distribution 5.33: The Exponential-Logarithmic Distribution 5.34: The Gompertz Distribution 5.35: The Log-Logistic Distribution 5.36: The Pareto Distribution 5.37: The Wald Distribution 5.38: The Weibull Distribution 5.39: Benford's Law 5.40: The Zeta Distribution 5.41: The Logarithmic Series Distribution
This page titled 5: Special Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
2
5.1: Location-Scale Families General Theory As usual, our starting point is a random experiment modeled by a probability space (Ω, F , P), so that Ω is the set of outcomes, F the collection of events, and P the probability measure on the sample space (Ω, F ). In this section, we assume that we fixed random variable Z defined on the probability space, taking values in R.
Definition For a ∈ R and b ∈ (0, ∞), let X = a + b Z . The two-parameter family of distributions associated with X is called the location-scale family associated with the given distribution of Z . Specifically, a is the location parameter and b the scale parameter. Thus a linear transformation, with positive slope, of the underlying random variable Z creates a location-scale family for the underlying distribution. In the special case that b = 1 , the one-parameter family is called the location family associated with the given distribution, and in the special case that a = 0 , the one-parameter family is called the scale family associated with the given distribution. Scale transformations, as the name suggests, occur naturally when physical units are changed. For example, if a random variable represents the length of an object, then a change of units from meters to inches corresponds to a scale transformation. Location transformations often occur when the zero reference point is changed, in measuring distance or time, for example. Location-scale transformations can also occur with a change of physical units. For example, if a random variable represents the temperature of an object, then a change of units from Fahrenheit to Celsius corresponds to a location-scale transformation.
Distribution Functions Our goal is to relate various functions that determine the distribution of X = a + bZ to the corresponding functions for Z . First we consider the (cumulative) distribution function. If Z has distribution function G then X has distribution function F given by x −a F (x) = G (
),
x ∈ R
(5.1.1)
b
Proof Next we consider the probability density function. The results are a bit different for discrete distributions and continuous distribution, not surprising since the density function has different meanings in these two cases. If Z has a discrete distribution with probability density function density function f given by
g
then
X
also has a discrete distribution, with probability
x −a f (x) = g (
),
x ∈ R
(5.1.3)
b
Proof If Z has a continuous distribution with probability density function probability density function f given by 1 f (x) =
, then
X
also has a continuous distribution, with
x −a g(
b
g
),
x ∈ R
(5.1.5)
b
1. For the location family associated with g , the graph of f is obtained by shifting the graph of g , a units to the right if a > 0 and −a units to the left if a < 0 . 2. For the scale family associated with g , if b > 1 , the graph of f is obtained from the graph of g by stretching horizontally and compressing vertically, by a factor of b . If 0 < b < 1 , the graph of f is obtained from the graph of g by compressing horizontally and stretching vertically, by a factor of b .
5.1.1
https://stats.libretexts.org/@go/page/10167
Proof If Z has a mode at z , then X has a mode at x = a + bz . Proof Next we relate the quantile functions of Z and X. If G and F are the distribution functions of Z and X, respectively, then 1. F (p) = a + b G (p) for p ∈ (0, 1) 2. If z is a quantile of order p for Z then x = a + b z is a quantile of order p for X. −1
−1
Proof Suppose now that Z has a continuous distribution on [0, ∞), and that we think of Z as the failure time of a device (or the time of death of an organism). Let X = bZ where b ∈ [0, ∞), so that the distribution of X is the scale family associated with the distribution of Z . Then X also has a continuous distribution on [0, ∞) and can also be thought of as the failure time of a device (perhaps in different units). Let G and F denote the reliability functions of Z and X respectively, and let r and R denote the failure rate functions of Z and X, respectively. Then c
c
1. F (x) = G (x/b) for x ∈ [0, ∞) 2. R(x) = r ( ) for x ∈ [0, ∞) c
c
1
x
b
b
Proof
Moments The following theorem relates the mean, variance, and standard deviation of Z and X. As before, suppose that X = a + b Z . Then 1. E(X) = a + b E(Z) 2. var(X) = b var(Z) 3. sd(X) = b sd(Z) 2
Proof Recall that the standard score of a random variable is obtained by subtracting the mean and dividing by the standard deviation. The standard score is dimensionless (that is, has no physical units) and measures the distance from the mean to the random variable in standard deviations. Since location-scale familes essentially correspond to a change of units, it's not surprising that the standard score is unchanged by a location-scale transformation. The standard scores of X and Z are the same: X − E(X)
Z − E(Z) =
sd(X)
(5.1.6) sd(Z)
Proof Recall that the skewness and kurtosis of a random variable are the third and fourth moments, respectively, of the standard score. Thus it follows from the previous result that skewness and kurtosis are unchanged by location-scale transformations: skew(X) = skew(Z) , kurt(X) = kurt(Z) . We can represent the moments of X (about 0) to those of Z by means of the binomial theorem: n
E (X
n
) = ∑(
n k n−k k )b a E (Z ) ,
n ∈ N
(5.1.8)
k
k=0
Of course, the moments of X about the location parameter a have a simple representation in terms of the moments of Z about 0: n
n
E [(X − a) ] = b E (Z
5.1.2
n
),
n ∈ N
(5.1.9)
https://stats.libretexts.org/@go/page/10167
The following exercise relates the moment generating functions of Z and X. If Z has moment generating function m then X has moment generating function M given by M (t) = e
at
m(bt)
(5.1.10)
Proof
Type As we noted earlier, two probability distributions that are related by a location-scale transformation can be thought of as governing the same underlying random quantity, but in different physical units. This relationship is important enough to deserve a name. Suppose that P and Q are probability distributions on R with distribution functions F and G, respectively. Then P and Q are of the same type if there exist constants a ∈ R and b ∈ (0, ∞) such that x −a F (x) = G (
),
x ∈ R
(5.1.12)
b
Being of the same type is an equivalence relation on the collection of probability distributions on R. That is, if P , Q, and R are probability distribution on R then 1. P is the same type as P (the reflexive property). 2. If P is the same type as Q then Q is the same type as P (the symmetric property). 3. If P is the same type as Q, and Q is the same type as R , then P is the same type as R (the transitive property). Proof So, the collection of probability distributions on distributions in each class are all of the same type.
R
is partitioned into mutually exclusive equivalence classes, where the
Examples and Applications Special Distributions Many of the special parametric families of distributions studied in this chapter and elsewhere in this text are location and/or scale families. The arcsine distribution is a location-scale family. The Cauchy distribution is a location-scale family. The exponential distribution is a scale family. The exponential-logarithmic distribution is a scale family for each value of the shape parameter. The extreme value distribution is a location-scale family. The gamma distribution is a scale family for each value of the shape parameter. The Gompertz distribution is a scale family for each value of the shape parameter. The half-normal distribution is a scale family. The hyperbolic secant distribution is a location-scale family. The Lévy distribution is a location scale family. The logistic distribution is a location-scale family.
5.1.3
https://stats.libretexts.org/@go/page/10167
The log-logistic distribution is a scale family for each value of the shape parameter. The Maxwell distribution is a scale family. The normal distribution is a location-scale family. The Pareto distribution is a scale family for each value of the shape parameter. The Rayleigh distribution is a scale family. The semicircle distribution is a location-scale family. The triangle distribution is a location-scale family for each value of the shape parameter. The uniform distribution on an interval is a location-scale family. The U-power distribution is a location-scale family for each value of the shape parameter. The Weibull distribution is a scale family for each value of the shape parameter. The Wald distribution is a scale family, although in the usual formulation, neither of the parameters is a scale parameter. This page titled 5.1: Location-Scale Families is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.1.4
https://stats.libretexts.org/@go/page/10167
5.2: General Exponential Families Basic Theory Definition We start with a probability space (Ω, F , P) as a model for a random experiment. So as usual, Ω is the set of outcomes, F the σalgebra of events, and P the probability measure on the sample space (Ω, F ). For the general formulation that we want in this section, we need two additional spaces, a measure space (S, S , μ) (where the probability distributions will live) and a measurable space (T , T ) (serving the role of a parameter space). Typically, these spaces fall into our two standard categories. Specifically, the measure space is usually one of the following: Discrete. S is countable, S is the collection of all subsets of S , and μ = # is counting measure. Euclidean. S is a sufficiently nice Borel measurable subset of R for some n ∈ N , S is the σ-algebra of Borel measurable subsets of S , and μ = λ is n -dimensional Lebesgue measure. n
+
n
Similarly, the parameter space (T , T ) is usually either discrete, so that T is countable and T the collection of all subsets of T , or Euclidean so that T is a sufficiently nice Borel measurable subset of R for some m ∈ N and T is the σ-algebra of Borel measurable subsets of T . m
+
Suppose now that X is random variable defined on the probability space, taking values in S , and that the distribution of X depends on a parameter θ ∈ T . For θ ∈ T we assume that the distribution of X has probability density function f with respect to μ . θ
for k ∈ N , the family of distributions of X is a k -parameter exponential family if +
k
fθ (x) = α(θ) g(x) exp( ∑ βi (θ) hi (x));
x ∈ S, θ ∈ T
(5.2.1)
i=1
where α and (β , β , … , β ) are measurable functions from T into R, and where functions from S into R. Moreover, k is assumed to be the smallest such integer. 1
2
k
g
and
(h1 , h2 , … , hk )
are measurable
1. The parameters (β (θ), β (θ), … , β (θ)) are called the natural parameters of the distribution. 2. the random variables (h (X), h (X), … , h (X)) are called the natural statistics of the distribution. 1
2
k
1
2
k
Although the definition may look intimidating, exponential families are useful because many important theoretical results in statistics hold for exponential families, and because many special parametric families of distributions turn out to be exponential families. It's important to emphasize that the representation of f (x) given in the definition must hold for all x ∈ S and θ ∈ T . If the representation only holds for a set of x ∈ S that depends on the particular θ ∈ T , then the family of distributions is not a general exponential family. θ
The next result shows that if we sample from the distribution of an exponential family, then the distribution of the random sample is itself an exponential family with the same natural parameters. Suppose that the distribution of random variable X is a k -parameter exponential family with natural parameters (β (θ), β (θ), … , β (θ)) , and natural statistics (h (X), h (X), … , h (X)). Let X = (X , X , … , X ) be a sequence of n independent random variables, each with the same distribution as X. Then X is a k -parameter exponential family with natural parameters (β (θ), β (θ), … , β (θ)) , and natural statistics 1
2
k
1
2
1
2
k
1
2
n
k
n
uj (X) = ∑ hj (Xi ),
j ∈ {1, 2, … , k}
(5.2.2)
i=1
Proof
Examples and Special Cases
5.2.1
https://stats.libretexts.org/@go/page/10168
Special Distributions Many of the special distributions studied in this chapter are general exponential families, at least with respect to some of their parameters. On the other hand, most commonly, a parametric family fails to be a general exponential family because the support set depends on the parameter. The following theorems give a number of examples. Proofs will be provided in the individual sections. The Bernoulli distribution is a one parameter exponential family in the success parameter p ∈ [0, 1] The beta distiribution is a two-parameter exponential family in the shape parameters a ∈ (0, ∞) , b ∈ (0, ∞). The beta prime distribution is a two-parameter exponential family in the shape parameters a ∈ (0, ∞) , b ∈ (0, ∞). The binomial distribution is a one-parameter exponential family in the success parameter p ∈ [0, 1] for a fixed value of the trial parameter n ∈ N . +
The chi-square distribution is a one-parameter exponential family in the degrees of freedom n ∈ (0, ∞) . The exponential distribution is a one-parameter exponential family (appropriately enough), in the rate parameter r ∈ (0, ∞). The gamma distribution is a two-parameter exponential family in the shape parameter b ∈ (0, ∞).
k ∈ (0, ∞)
and the scale parameter
The geometric distribution is a one-parameter exponential family in the success probability p ∈ (0, 1). The half normal distribution is a one-parameter exponential family in the scale parameter σ ∈ (0, ∞) The Laplace distribution is a one-parameter exponential family in the scale parameter location parameter a ∈ R .
b ∈ (0, ∞)
for a fixed value of the
The Lévy distribution is a one-parameter exponential family in the scale parameter b ∈ (0, ∞) for a fixed value of the location parameter a ∈ R . The logarithmic distribution is a one-parameter exponential family in the shape parameter p ∈ (0, 1) The lognormal distribution is a two parameter exponential family in the shape parameters μ ∈ R , σ ∈ (0, ∞). The Maxwell distribution is a one-parameter exponential family in the scale parameter b ∈ (0, ∞). The k -dimensional multinomial distribution is a k -parameter exponential family in the probability parameters (p for a fixed value of the trial parameter n ∈ N .
1,
p2 , … , pk )
+
The k -dimensional multivariate normal distribution is a vector μ and the variance-covariance matrix V .
1 2
2
(k
+ 3k)
-parameter exponential family with respect to the mean
The negative binomial distribution is a one-parameter exponential family in the success parameter p ∈ (0, 1) for a fixed value of the stopping parameter k ∈ N . +
The normal distribution is a two-parameter exponential family in the mean μ ∈ R and the standard deviation σ ∈ (0, ∞). The Pareto distribution is a one-parameter exponential family in the shape parameter for a fixed value of the scale parameter. The Poisson distribution is a one-parameter exponential family.
5.2.2
https://stats.libretexts.org/@go/page/10168
The Rayleigh distribution is a one-parameter exponential family. The U-power distribution is a one-parameter exponential family in the shape parameter, for fixed values of the location and scale parameters. The Weibull distribution is a one-parameter exponential family in the scale parameter for a fixed value of the shape parameter. The zeta distribution is a one-parameter exponential family. The Wald distribution is a two-parameter exponential family. This page titled 5.2: General Exponential Families is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.2.3
https://stats.libretexts.org/@go/page/10168
5.3: Stable Distributions This section discusses a theoretical topic that you may want to skip if you are a new student of probability.
Basic Theory Stable distributions are an important general class of probability distributions on R that are defined in terms of location-scale transformations. Stable distributions occur as limits (in distribution) of scaled and centered sums of independent, identically distributed variables. Such limits generalize the central limit theorem, and so stable distributions generalize the normal distribution in a sense. The pioneering work on stable distributions was done by Paul Lévy.
Definition In this section, we consider real-valued random variables whose distributions are not degenerate (that is, not concentrated at a single value). After all, a random variable with a degenerate distribution is not really random, and so is not of much interest. Random variable X has a stable distribution if the following condition holds: If n ∈ N and (X , X , … , X ) is a sequence of independent variables, each with the same distribution as X, then X + X + ⋯ + X has the same distribution as a + b X for some a ∈ R and b ∈ (0, ∞) . If a = 0 for n ∈ N then the distribution of X is strictly stable. +
1
n
n
n
n
n
2
1
2
n
n
+
1. The parameters a for n ∈ N are the centering parameters. 2. The parameters b for n ∈ N are the norming parameters. n
+
n
+
Details Recall that two distributions on R that are related by a location-scale transformation are said to be of the same type, and that being of the same type defines an equivalence relation on the class of distributions on R. With this terminology, the definition of stability has a more elegant expression: X has a stable distribution if the sum of a finite number of independent copies of X is of the same type as X. As we will see, the norming parameters are more important than the centering parameters, and in fact, only certain norming parameters can occur.
Basic Properties We start with some very simple results that follow easily from the definition, before moving on to the deeper results. Suppose that X has a stable distribution with mean centering parameters are (n − √− n ) μ for n ∈ N .
μ
and finite variance. Then the norming parameters are
− √n
and the
+
Proof It will turn out that the only stable distribution with finite variance is the normal distribution, but the result above is useful as an intermediate step. Next, it seems fairly clear from the definition that the family of stable distributions is itself a location-scale family. Suppose that the distribution of X is stable, with centering parameters a ∈ R and norming parameters b ∈ (0, ∞) for n ∈ N . If c ∈ R and d ∈ (0, ∞) , then the distribution of Y = c + dX is also stable, with centering parameters da + (n − b )c and norming parameters b for n ∈ N . n
n
+
n
n
n
+
Proof An important point is the the norming parameters are unchanged under a location-scale transformation. Suppose that the distribution of X is stable, with centering parameters a ∈ R and norming parameters b ∈ (0, ∞) for n ∈ N . Then the distribution of −X is stable, with centering parameters −a and norming parameters b for n ∈ N . n
+
n
n
n
+
Proof From the last two results, if X has a stable distribution, then so does c + dX , with the same norming parameters, for every c, d ∈ R with d ≠ 0 . Stable distributions are also closed under convolution (corresponding to sums of independent variables) if the norming parameters are the same.
5.3.1
https://stats.libretexts.org/@go/page/10169
Suppose that X and Y are independent variables. Assume also that a ∈ R and norming parameters b ∈ (0, ∞) for n ∈ N , and that c ∈ R and the same norming parameters b for n ∈ N . Then paraemters a + c and norming parameters b for n ∈ N . n
n
+
n
n
n
n
n
+
has a stable distribution with centering parameters has a stable distribution with centering parameters Z = X +Y has a stable distribution with centering X
Y
+
Proof We can now give another characterization of stability that just involves two independent copies of X. Random variable X has a stable distribution if and only if the following condition holds: If X , X are independent variables, each with the same distribution as X and d , d ∈ (0, ∞) then d X + d X has the same distribution as a + bX for some a ∈ R and b ∈ (0, ∞). 1
1
2
1
1
2
2
2
Proof As a corollary of a couple of the results above, we have the following: Suppose that X and Y are independent with the same stable distribution. Then the distribution of X − Y is strictly stable, with the same norming parameters. Note that the distribution of X − Y is symmetric (about 0). The last result is useful because it allows us to get rid of the centering parameters when proving facts about the norming parameters. Here is the most important of those facts: Suppose that X has a stable distribution. Then the norming parameters have the form b = n α ∈ (0, 2]. The parameter α is known as the index or characteristic exponent of the distribution.
1/α
n
for
n ∈ N+
, for some
Proof Every stable distribution is continuous. Proof The next result is a precise statement of the limit theorem alluded to in the introductory paragraph. n
Suppose that (X , X , …) is a sequence of independent, identically distributed random variables, and let Y = ∑ X for n ∈ N . If there exist constants a ∈ R and b ∈ (0, ∞) for n ∈ N such that (Y − a )/b has a (non-degenerate) limiting distribution as n → ∞ , then the limiting distribution is stable. 1
+
2
n
n
n
+
n
n
i=1
i
n
The following theorem completely characterizes stable distributions in terms of the characteristic function. Suppose that X has a stable distribution. The characteristic function of β ∈ [−1, 1], c ∈ R , and d ∈ (0, ∞) χ(t) = E (e
where
sgn
itX
) = exp(itc − d
α
α
|t|
X
has the following form, for some
[1 + iβ sgn(t)uα (t)]),
t ∈ R
,
α ∈ (0, 2]
(5.3.15)
is the usual sign function, and where tan( uα (t) = {
2
πα 2
),
ln(|t|),
π
α ≠1 (5.3.16) α =1
1. The parameter α is the index, as before. 2. The parameter β is the skewness parameter. 3. The parameter c is the location parameter. 4. The parameter d is the scale parameter. Thus, the family of stable distributions is a 4 parameter family. The index parameter α and and the skewness parameter β can be considered shape parameters. When the location parameter c = 0 and the scale parameter d = 1 , we get the standard form of the stable distributions, with characteristic function χ(t) = E (e
itX
α
) = exp(−|t|
[1 + iβ sgn(t)uα (t)]),
5.3.2
t ∈ R
(5.3.17)
https://stats.libretexts.org/@go/page/10169
The characteristic function gives another proof that stable distributions are closed under convolution (corresponding to sums of independent variables), if the index is fixed. Suppose that X and X are independent random variables, and that X and X have the stable distribution with common index α ∈ (0, 2], skewness parameter β ∈ [−1, 1], location parameter c ∈ R , and scale parameter d ∈ (0, ∞) . Then X +X has the stable distribution with index α , location parameter c = c + c , scale parameter d = (d + d ) , and skewness parameter 1
2
1
k
1
2
k
2
k
1
β1 d β =
α 1
d
α 1
+ β2 d +d
2
α
α
1
2
1/α
α 2
α
(5.3.18)
2
Proof
Special Cases Three special parametric families of distributions studied in this chapter are stable. In the proofs in this subsection, we use the definition of stability and various important properties of the distributions. These properties, in turn, are verified in the sections devoted to the distributions. We also give proofs based on the characteristic function, which allows us to identify the skewness parameter. The normal distribution is stable with index α = 2 . There is no skewness parameter. Proof Of course, the normal distribution has finite variance, so once we know that it is stable, it follows from the finite variance property above that the index must be 2. Moreover, the characteristic function shows that the normal distribution is the only stable distribution with index 2, and hence the only stable distribution with finite variance. Open the special distribution simulator and select the normal distribution. Vary the parameters and note the shape and location of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function. The Cauchy distribution is stable with index α = 1 and skewness parameter β = 0 . Proof Open the special distribution simulator and select the Cauchy distribution. Vary the parameters and note the shape and location of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function. The Lévy distribution is stable with index α =
1 2
and skewness parameter β = 1 .
Proof Open the special distribution simulator and select the Lévy distribution. Vary the parameters and note the shape and location of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function. The normal, Cauchy, and Lévy distributions are the only stable distributions for which the probability density function is known in closed form. This page titled 5.3: Stable Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.3.3
https://stats.libretexts.org/@go/page/10169
5.4: Infinitely Divisible Distributions This section discusses a theoretical topic that you may want to skip if you are a new student of probability.
Basic Theory Infinitely divisible distributions form an important class of distributions on R that includes the stable distributions, the compound Poisson distributions, as well as several of the most important special parametric families of distribtions. Basically, the distribution of a real-valued random variable is infinitely divisible if for each n ∈ N , the variable can be decomposed into the sum of n independent copies of another variable. Here is the precise definition. +
The distribution of a real-valued random variable X is infinitely divisible if for every n ∈ N , there exists a sequence of independent, identically distributed variables (X , X , … , X ) such that X + X + ⋯ + X has the same distribution as X. +
1
2
n
1
2
n
If the distribution of X is stable then the distribution is infinitely divisible. Proof Suppose now that X = (X , X , …) is a sequence of independent, identically distributed random variables, and that N has a Poisson distribution and is independent of X. Recall that the distribution of ∑ X is said to be compound Poisson. Like the stable distributions, the compound Poisson distributions form another important class of infinitely divisible distributions. 1
2
N
i
i=1
Suppose that Y is a random variable. 1. If Y is compound Poisson then Y is infinitely divisible. 2. If Y is infinitely divisible and takes values in N then Y is compound Poisson. Proof
Special Cases A number of special distributions are infinitely divisible. Proofs of the results stated below are given in the individual sections.
Stable Distributions First, the normal distribution, the Cauchy distribution, and the Lévy distribution are stable, so they are infinitely divisible. However, direct arguments give more information, because we can identify the distribution of the component variables. The normal distribution is infinitely divisible. If X has the normal distribution with mean μ ∈ R and standard deviation σ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has the normal distribution with mean μ/n and standard deviation σ/√− n for each i ∈ {1, 2, … , n}. +
1
2
n
1
2
n
i
The Cauchy distribution is infinitely divisible. If X has the Cauchy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has the Cauchy distribution with location parameter a/n and scale parameter b/n for each i ∈ {1, 2, … , n}. +
1
2
n
1
2
n
i
Other Special Distributions On the other hand, there are distributions that are infinitely divisible but not stable. The gamma distribution is infinitely divisible. If X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞), then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has the gamma distribution with shape parameter k/n and scale parameter b for each i ∈ {1, 2, … , n} +
1
2
n
1
2
n
i
The chi-square distribution is infinitely divisible. If X has the chi-square distribution with k ∈ (0, ∞) degrees of freedom, then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has +
1
2
n
5.4.1
1
2
n
i
https://stats.libretexts.org/@go/page/10170
the chi-square distribution with k/n degrees of freedom for each i ∈ {1, 2, … , n}. The Poisson distribution distribution is infinitely divisible. If X has the Poisson distribution with rate parameter λ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has the Poisson distribution with rate parameter λ/n for each i ∈ {1, 2, … , n}. +
1
2
n
1
2
n
i
The general negative binomial distribution on N is infinitely divisible. If X has the negative binomial distribution on N with parameters k ∈ (0, ∞) and p ∈ (0, 1), then for n ∈ N , X has the same distribution as X + X + ⋯ + X were (X , X , … , X ) are independent, and X has the negative binomial distribution on N with parameters k/n and p for each i ∈ {1, 2, … , n}. +
1
2
n
1
2
n
i
Since the Poisson distribution and the negative binomial distributions are distributions on N, it follows from the characterization above that these distributions must be compound Poisson. Of course it is completely trivial that the Poisson distribution is compound Poisson, but it's far from obvious that the negative binomial distribution has this property. It turns out that the negative binomial distribution can be obtained by compounding the logarithmic series distribution with the Poisson distribution. The Wald distribution is infinitely divisible. If X has the Wald distribution with shape parameter λ ∈ (0, ∞) and mean μ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and X has the Wald distribution with shape parameter λ/n and mean μ/n for each i ∈ {1, 2, … , n}. +
1
2
n
1
2
n
2
i
This page titled 5.4: Infinitely Divisible Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.4.2
https://stats.libretexts.org/@go/page/10170
5.5: Power Series Distributions Power Series Distributions are discrete distributions on (a subset of) N constructed from power series. This class of distributions is important because most of the special, discrete distributions are power series distributions.
Basic Theory Power Series Suppose that a = (a , a , a , …) is a sequence of nonnegative real numbers. We are interested in the power series with a as the sequence of coefficients. Recall first that the partial sum of order n ∈ N is 0
1
2
n k
gn (θ) = ∑ ak θ ,
θ ∈ R
(5.5.1)
k=0
The power series g is then defined by g(θ) = lim
n→∞
for θ ∈ R for which the limit exists, and is denoted
gn (θ)
∞ n
g(θ) = ∑ an θ
(5.5.2)
n=0
Note that the series converges when θ = 0 , and g(0) = a . Beyond this trivial case, recall that there exists r ∈ [0, ∞] such that the series converges absolutely for |θ| < r and diverges for |θ| > r . The number r is the radius of convergence. From now on, we assume that r > 0 . If r < ∞ , the series may converge (absolutely) or may diverge to ∞ at the endpoint r. At −r, the series may converge absolutely, may converge conditionally, or may diverge. 0
Distributions From now on, we restrict θ to the interval [0, r); this interval is our parameter space. Some of the results below may hold when r < ∞ and θ = r , but dealing with this case explicitly makes the exposition unnecessarily cumbersome. Suppose that N is a random variable with values in N. Then N has the power series distribution associated with the function g (or equivalently with the sequence a ) and with parameter θ ∈ [0, r) if N has probability density function f given by θ
n
fθ (n) =
an θ
,
n ∈ N
(5.5.3)
g(θ)
Proof Note that when θ = 0 , the distribution is simply the point mass distribution at 0; that is, f
0 (0)
=1
.
The distribution function F is given by θ
Fθ (n) =
gn (θ)
,
n ∈ N
(5.5.4)
g(θ)
Proof Of course, the probability density function f is most useful when the power series g(θ) can be given in closed form, and similarly the distribution function F is most useful when the power series and the partial sums can be given in closed form θ
θ
Moments The moments of N can be expressed in terms of the underlying power series function g , and the nicest expression is for the factorial moments. Recall that the permutation formula is t = t(t − 1) ⋯ (t − k + 1) for t ∈ R and k ∈ N , and the factorial moment of N of order k ∈ N is E (N ). (k)
(k)
For θ ∈ [0, r) , the factorial moments of N are as follows, where g k
E (N
(k)
θ g
(k)
(k)
is the k th derivative of g .
(θ)
) =
,
k ∈ N
(5.5.5)
g(θ)
5.5.1
https://stats.libretexts.org/@go/page/10171
Proof The mean and variance of N are 1. E(N ) = θg
′
(θ)/g(θ)
2. var(N ) = θ
2
(g
′′
2
′
(θ)/g(θ) − [ g (θ)/g(θ)] )
Proof The probability generating function of N also has a simple expression in terms of g . For θ ∈ (0, r) , the probability generating function P of N is given by N
P (t) = E (t
g(θt) ) =
r ,
t
0 , then Z = (X − μ)/σ is the standard score of X. A corollary of the last result is that if X has a normal distribution then the standard score Z has a standard normal distribution. Conversely, any normally distributed variable can be constructed from a standard normal variable. Standard score. X−μ
1. If X has the normal distribution with mean μ and standard deviation σ then Z = has the standard normal distribution. 2. If Z has the standard normal distribution and if μ ∈ R and σ ∈ (0, ∞), then X = μ + σZ has the normal distribution with mean μ and standard deviation σ. σ
Suppose that X and X are independent random variables, and that X is normally distributed with mean μ and variance σ for i ∈ {1, 2}. Then X + X is normally distributed with 1
2
i
1
2
1
1
2
2
i
2
1. E(X + X ) = μ + μ 2. var(X + X ) = σ + σ 1
i
2
2
2
1
2
Proof This theorem generalizes to a sum of n independent, normal variables. The important part is that the sum is still normal; the expressions for the mean and variance are standard results that hold for the sum of independent variables generally. As a consequence of this result and the one for linear transformations, it follows that the normal distribution is stable. The normal distribution is stable. Specifically, suppose that X has the normal distribution with mean μ ∈ R and variance σ ∈ (0, ∞) . If (X , X , … , X ) are independent copies of X, then X + X + ⋯ + X has the same distribution as − − (n − √n ) μ + √n X , namely normal with mean nμ and variance nσ . 2
1
2
n
1
2
n
2
5.6.4
https://stats.libretexts.org/@go/page/10172
Proof All stable distributions are infinitely divisible, so the normal distribution belongs to this family as well. For completeness, here is the explicit statement: The normal distribution is infinitely divisible. Specifically, if X has the normal distribution with mean μ ∈ R and variance σ ∈ (0, ∞) , then for n ∈ N , X has the same distribution as X + X + ⋯ + X where (X , X , … , X ) are independent, and each has the normal distribution with mean μ/n and variance σ /n. 2
+
1
2
n
1
2
n
2
Finally, the normal distribution belongs to the family of general exponential distributions. Suppose that
X
has the normal distribution with mean
family with natural parameters (
μ σ2
,−
1 2 σ2
μ
and variance
σ
2
. The distribution is a two-parameter exponential
, and natural statistics (X, X ). 2
)
Proof A number of other special distributions studied in this chapter are constructed from normally distributed variables. These include The lognormal distribution The folded normal distribution, which includes the half normal distribution as a special case The Rayleigh distribution The Maxwell distribution The Lévy distribution Also, as mentioned at the beginning of this section, the importance of the normal distribution stems in large part from the central limit theorem, one of the fundamental theorems of probability. By virtue of this theorem, the normal distribution is connected to many other distributions, by means of limits and approximations, including the special distributions in the following list. Details are given in the individual sections. The binomial distribution The negative binomial distribution The Poisson distribution The gamma distribution The chi-square distribution The student t distribution The Irwin-Hall distribution
Computational Exercises Suppose that the volume of beer in a bottle of a certain brand is normally distributed with mean 0.5 liter and standard deviation 0.01 liter. 1. Find the probability that a bottle will contain at least 0.48 liter. 2. Find the volume that corresponds to the 95th percentile Answer A metal rod is designed to fit into a circular hole on a certain assembly. The radius of the rod is normally distributed with mean 1 cm and standard deviation 0.002 cm. The radius of the hole is normally distributed with mean 1.01 cm and standard deviation 0.003 cm. The machining processes that produce the rod and the hole are independent. Find the probability that the rod is to big for the hole. Answer The weight of a peach from a certain orchard is normally distributed with mean 8 ounces and standard deviation 1 ounce. Find the probability that the combined weight of 5 peaches exceeds 45 ounces. Answer
5.6.5
https://stats.libretexts.org/@go/page/10172
A Further Generlization In some settings, it's convenient to consider a constant as having a normal distribution (with mean being the constant and variance 0, of course). This convention simplifies the statements of theorems and definitions in these settings. Of course, the formulas for the probability density function and the distribution function do not hold for a constant, but the other results involving the moment generating function, linear transformations, and sums are still valid. Moreover, the result for linear transformations would hold for all a and b . This page titled 5.6: The Normal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.6.6
https://stats.libretexts.org/@go/page/10172
5.7: The Multivariate Normal Distribution The multivariate normal distribution is among the most important of multivariate distributions, particularly in statistical inference and the study of Gaussian processes such as Brownian motion. The distribution arises naturally from linear transformations of independent normal variables. In this section, we consider the bivariate normal distribution first, because explicit results can be given and because graphical interpretations are possible. Then, with the aid of matrix notation, we discuss the general multivariate distribution.
The Bivariate Normal Distribution The Standard Distribution Recall that the probability density function ϕ of the standard normal distribution is given by 1 ϕ(z) =
− −e √2π
−z
2
/2
,
z ∈ R
(5.7.1)
The corresponding distribution function is denoted Φ and is considered a special function in mathematics: z
Φ(z) = ∫
z
1
ϕ(x) dx = ∫
−∞
−∞
2
− −e √2π
−x /2
dx,
z ∈ R
(5.7.2)
Finally, the moment generating function m is given by m(t) = E (e
tZ
1 ) = exp[
2
var(tZ)] = e
t /2
,
t ∈ R
(5.7.3)
2
Suppose that Z and W are independent random variables, each with the standard normal distribution. The distribution of (Z, W ) is known as the standard bivariate normal distribution. The basic properties of the standard bivariate normal distribution follow easily from independence and properties of the (univariate) normal distribution. Recall first that the graph of a function f : R → R is a surface. For c ∈ R , the set of points {(x, y) ∈ R : f (x, y) = c} is the level curve of f at level c . The graph of f can be understood by means of the level curves. 2
2
The probability density function ϕ of the standard bivariate normal distribution is given by 2
1 ϕ2 (z, w) =
e
−
1
(z
2
2
+w )
2
,
2
(z, w) ∈ R
(5.7.4)
2π
1. The level curves of ϕ are circles centered at the origin. 2. The mode of the distribution is (0, 0). 3. ϕ is concave downward on {(z, w) ∈ R : z + w < 1} 2
2
2
2
2
Proof Clearly ϕ has a number of symmetry properties as well: ϕ (z, w) is symmetric in z about 0 so that ϕ (−z, w) = ϕ (z, w) ; ϕ (z, w) is symmetric in w about 0 so that ϕ (z, −w) = ϕ (z, w) ; ϕ (z, w) is symmetric in (z, w) so that ϕ (z, w) = ϕ (w, z) . In short, ϕ has the classical “bell shape” that we associate with normal distributions. 2
2
2
2
2
2
2
2
2
2
Open the bivariate normal experiment, keep the default settings to get the standard bivariate normal distribution. Run the experiment 1000 times. Observe the cloud of points in the scatterplot, and compare the empirical density functions to the probability density functions. Suppose that (Z, W ) has the standard bivariate normal distribution. The moment generating function m of (Z, W ) is given by 2
1 m2 (s, t) = E [exp(sZ + tW )] = exp[
1 var(sZ + tW )] = exp[
2
2
(s
2
+ t )],
2
(s, t) ∈ R
(5.7.6)
2
Proof
The General Distribution The general bivariate normal distribution can be constructed by means of an affine transformation on a standard bivariate normal vector. The distribution has 5 parameters. As we will see, two are location parameters, two are scale parameters, and one is a correlation
5.7.1
https://stats.libretexts.org/@go/page/10173
parameter. Suppose that (Z, W ) has the standard bivariate normal distribution. Let μ, be new random variables defined by
ν ∈ R
; σ,
; and ρ ∈ (−1, 1), and let X and Y
τ ∈ (0, ∞)
X = μ+σ Z
(5.7.7) − − − − −
Y
2
= ν + τ ρZ + τ √ 1 − ρ
W
(5.7.8)
The joint distribution of (X, Y ) is called the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ). We can use the change of variables formula to find the joint probability density function. Suppose that (X, Y ) has the bivariate normal distribution with the parameters (μ, ν , σ, τ , ρ) as specified above. The joint probability density function f of (X, Y ) is given by 1 f (x, y) =
2
1
− − − − − 2 2πστ √ 1 − ρ
exp{−
(x − μ) 2
2(1 − ρ )
[ σ
2
2
(x − μ)(y − ν ) − 2ρ
(y − ν ) +
στ
τ
2
2
]},
(x, y) ∈ R
(5.7.9)
1. The level curves of f are ellipses centered at (μ, ν ). 2. The mode of the distribution is (μ, ν ). Proof The following theorem gives fundamental properties of the bivariate normal distribution. Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) as specified above. Then 1. X is normally distributed with mean μ and standard deviation σ. 2. Y is normally distributed with mean ν and standard deviation τ . 3. cor(X, Y ) = ρ . 4. X and Y are independent if and only if ρ = 0 . Proof Thus, two random variables with a joint normal distribution are independent if and only if they are uncorrelated. In the bivariate normal experiment, change the standard deviations of X and Y with the scroll bars. Watch the change in the shape of the probability density functions. Now change the correlation with the scroll bar and note that the probability density functions do not change. For various values of the parameters, run the experiment 1000 times. Observe the cloud of points in the scatterplot, and compare the empirical density functions to the probability density functions. In the case of perfect correlation (ρ = 1 or ρ = −1 ), the distribution of (X, Y ) is also said to be bivariate normal, but degenerate. In this case, we know from our study of covariance and correlation that (X, Y ) takes values on the regression line {(x, y) ∈ R : y = ν + ρ (x − μ)} , and hence does not have a probability density function (with respect to Lebesgue measure on R ). Degenerate normal distributions will be discussed in more detail below. 2
τ
2
σ
In the bivariate normal experiment, run the experiment 1000 times with the values of ρ given below and selected values of σ and τ . Observe the cloud of points in the scatterplot and compare the empirical density functions to the probability density functions. 1. ρ ∈ {0, 0.3, 0.5, 0.7, 1} 2. ρ ∈ {−0.3, −0.5, −0.7, −1} The conditional distributions are also normal. Suppose that (X, Y ) has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) as specified above. 1. For x ∈ R, the conditional distribution of Y given X = x is normal with mean E(Y ∣ X = x) = ν + ρ var(Y ∣ X = x) = τ (1 − ρ ) . 2. For y ∈ R , the conditional distribution of X given Y = y is normal with mean E(X ∣ Y = y) = μ + ρ var(X ∣ Y = y) = σ (1 − ρ ) . 2
2
2
2
τ σ
σ τ
(x − μ)
(y − ν )
and variance and variance
Proof from density functions Proof from random variables
5.7.2
https://stats.libretexts.org/@go/page/10173
Note that the conditional variances do not depend on the value of the given variable. In the bivariate normal experiment, set the standard deviation of X to 1.5, the standard deviation of Y to 0.5, and the correlation to 0.7. 1. Run the experiment 100 times. 2. For each run, compute E(Y ∣ X = x) the predicted value of Y for the given the value of X. 3. Over all 100 runs, compute the square root of the average of the squared errors between the predicted value of Y and the true value of Y . You may be perplexed by the lack of symmetry in how (X, Y ) is defined in terms of (Z, W ) in the original definition. Note however − − − − − that the distribution is completely determined by the 5 parameters. If we define X = μ + σρZ + σ √1 − ρ W and Y = ν + τ Z then (X , Y ) has the same distribution as (X, Y ), namely the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ) (although, of course (X , Y ) and (X, Y ) are different random vectors). There are other ways to define the same distribution as an affine transformation of (Z, W )—the situation will be clarified in the next subsection. ′
′
′
2
′
′
′
Suppose that (X, Y ) has the bivariate normal distribution with parameters function M given by
. Then
(μ, ν , σ, τ , ρ)
(X, Y )
has moment generating
1 M (s, t) = E [exp(sX + tY )] = exp[E(sX + tY ) +
var(sX + tY )] = exp
(5.7.15)
2 1 [μs + ν t +
2
2
(σ s
+ 2ρστ st + τ
2
2
2
t )],
(s, t) ∈ R
2
Proof We showed above that if converse is not true.
(X, Y )
has a bivariate normal distribution then the marginal distributions of
X
and
Y
are also normal. The
Suppose that (X, Y ) has probability density function f given by 1 f (x, y) =
2
e
−( x +y
2
)/2
2
[1 + xy e
−( x +y
2
−2)
2
/2] ,
(x, y) ∈ R
(5.7.18)
2π
1. X and Y each have standard normal distributions. 2. (X, Y ) does not have a bivariate normal distribution. Proof
Transformations Like its univariate counterpart, the family of bivariate normal distributions is preserved under two types of transformations on the underlying random vector: affine transformations and sums of independent vectors. We start with a preliminary result on affine transformations that should help clarify the original definition. Throughout this discussion, we assume that the parameter vector (μ, ν , σ, τ , ρ) satisfies the usual conditions: μ, ν ∈ R , and σ, τ ∈ (0, ∞), and ρ ∈ (−1, 1). Suppose that (Z, W ) has the standard bivariate normal distribution. Let X = a + b Z + c W and Y = a + b Z + c the coefficients are in R and b c − c b ≠ 0 . Then (X, Y ) has a bivariate normal distribution with parameters given by 1
1
1. E(X) = a 2. E(Y ) = a 3. var(X) = b + c 4. var(Y ) = b + c 5. cov(X, Y ) = b b
2
1
1
1
2
2W
2
where
2
1
2
2
2
1
1
2
2
2
2
1
2
+ c1 c2
Proof Now it is easy to show more generally that the bivariate normal distribution is closed with respect to affine transformations. Suppose that
(X, Y )
V = a2 + b2 X + c2 Y
has the bivariate normal distribution with parameters (μ, ν , σ, τ , ρ). Define U = a + b X + c Y and , where the coefficients are in R and b c − c b ≠ 0 . Then (U , V ) has a bivariate normal distribution with 1
1
2
1
1
1
2
parameters as follows: 1. E(U ) = a
1
+ b1 μ + c1 ν
5.7.3
https://stats.libretexts.org/@go/page/10173
2. E(V ) = a + b 3. var(U ) = b σ 4. var(V ) = b σ 5. cov(U , V ) = b
2μ
2
2
2
1
2
2
2
+ c2 ν 2
2
+c τ 1
2
2
+c τ 2
1 b2 σ
2
+ 2 b1 c1 ρστ + 2 b2 c2 ρστ
+ c1 c2 τ
2
+ (b1 c2 + b2 c1 )ρστ
Proof The bivariate normal distribution is preserved with respect to sums of independent variables. Suppose that (X , Y ) has the bivariate normal distribution with parameters (μ , ν , σ , τ , ρ ) for i ∈ {1, 2}, and that (X , Y ) are independent. Then (X + X , Y + Y ) has the bivariate normal distribution with parameters given by i
2
i
i
2
1
1. E(X + X ) = μ + μ 2. E(Y + Y ) = ν + ν 3. var(X + X ) = σ + σ 4. var(Y + Y ) = τ + τ 5. cov(X + X , Y + Y ) = ρ 1
2
1
1
2
1
1
2
i
(X1 , Y1 )
and
2
2
1
2
2
2
1
2
i
2
2
2
1
i
2
1
1
2
i
2
1
1 σ1 τ1
2
+ ρ2 σ2 τ2
Proof The following result is important in the simulation of normal variables. Suppose that (Z, W ) has the standard bivariate normal distribution. Define the polar coordinates (R, Θ) of (Z, W ) by the equations Z = R cos Θ , W = R sin Θ where R ≥ 0 and 0 ≤ Θ < 2 π . Then 1. R has probability density function g given by g(r) = r e 2. Θ is uniformly distributed on [0, 2π). 3. R and Θ are independent.
−
1 2
2
r
for r ∈ [0, ∞).
Proof The distribution of R is known as the standard Rayleigh distribution, named for William Strutt, Lord Rayleigh. The Rayleigh distribution studied in more detail in a separate section. Since the quantile function Φ of the normal distribution cannot be given in a simple, closed form, we cannot use the usual random quantile method of simulating a normal random variable. However, the quantile method works quite well to simulate a Rayleigh variable, and of course simulating uniform variables is trivial. Hence we have a way of simulating a standard bivariate normal vector with a pair of random numbers (which, you will recall are independent random variables, each with the standard uniform distribution, that is, the uniform distribution on [0, 1)). −1
− −−−− −
Suppose that U and V are independent random variables, each with the standard uniform distribution. Let R = √−2 ln U and Θ = 2πV . Define Z = R cos Θ and W = R sin Θ . Then (Z, W ) has the standard bivariate normal distribution. Proof Of course, if we can simulate (Z, W ) with a standard bivariate normal distribution, then we can simulate (X, Y ) with the general − − − − − bivariate normal distribution, with parameter (μ, ν , σ, τ , ρ) by definition (5), namely X = μ + σZ , Y = ν + τ ρZ + τ √1 − ρ W . 2
The General Multivariate Normal Distribution The general multivariate normal distribution is a natural generalization of the bivariate normal distribution studied above. The exposition is very compact and elegant using expected value and covariance matrices, and would be horribly complex without these tools. Thus, this section requires some prerequisite knowledge of linear algebra. In particular, recall that A denotes the transpose of a matrix A and that we identify a vector in R with the corresponding n × 1 column vector. T
n
The Standard Distribution Suppose that Z = (Z , Z , … , Z ) is a vector of independent random variables, each with the standard normal distribution. Then Z is said to have the n -dimensional standard normal distribution. 1
2
n
1. E(Z) = 0 (the zero vector in R ). 2. vc(Z) = I (the n × n identity matrix). n
Z
has probability density function ϕ given by n
5.7.4
https://stats.libretexts.org/@go/page/10173
1 ϕn (z) =
1 exp(− n/2
1 exp(− n/2
2
(2π)
1
z ⋅ z) =
2
(2π)
n 2
n
∑ z ),
z = (z1 , z2 , … , zn ) ∈ R
i
(5.7.32)
i=1
where as usual, ϕ is the standard normal PDF. Proof Z
has moment generating function m given by n
1 mn (t) = E [exp(t ⋅ Z)] = exp[
1 var(t ⋅ Z)] = exp(
2
n
1 t ⋅ t) = exp(
2
2
∑ t ), 2
i
n
t = (t1 , t2 , … , tn ) ∈ R
(5.7.33)
i=1
Proof
The General Distribution Suppose that Z has the n -dimensional standard normal distribution. Suppose also that μ ∈ R and that A ∈ R random vector X = μ + AZ is said to have an n -dimensional normal distribution. n
n×n
is invertible. The
1. E(X) = μ . 2. vc(X) = A A . T
Proof In the context of this result, recall that the variance-covariance matrix vc(X) = AA is symmetric and positive definite (and hence also invertible). We will now see that the multivariate normal distribution is completely determined by the expected value vector μ and the variance-covariance matrix V , and hence these give the basic parameters of the distribution. T
Suppose that X has an n -dimensional normal distribution with expected value vector probability density function f of X is given by 1
f (x) = n/2
(2π)
1 (x − μ) ⋅ V − − − − − − exp[− 2 √det(V )
−1
μ
and variance-covariance matrix
(x − μ)],
n
x ∈ R
V
. The
(5.7.34)
Proof Suppose again that X has an n -dimensional normal distribution with expected value vector The moment generating function M of X is given by 1 M (t) = E [exp(t ⋅ X)] = exp[E(t ⋅ X) +
μ
and variance-covariance matrix
1 var(t ⋅ X)] = exp(t ⋅ μ +
2
t ⋅ V t),
n
t ∈ R
V
.
(5.7.39)
2
Proof Of course, the moment generating function completely determines the distribution. Thus, if a random vector generating function of the form given above, for some μ ∈ R and symmetric, positive definite V ∈ R dimensional normal distribution with mean μ and variance-covariance matrix V . n
in R has a moment , then X has the n -
X
n×n
n
Note again that in the representation X = μ + AZ , the distribution of X is uniquely determined by the expected value vector μ and the variance-covariance matrix V = AA , but not by μ and A. In general, for a given positive definite matrix V , there are many invertible matrices A such that V = AA (the matrix A is a bit like a square root of V ). A theorem in matrix theory states that there is a unique lower triangular matrix L with this property. The representation X = μ + LZ is known as the canonical representation of X. T
T
If
X = (X, Y )
LL
T
= vc(X)
has bivariate normal distribution with parameters is σ
, then the lower triangular matrix
(μ, ν , σ, τ , ρ)
L
such that
0
L =[ τρ
− − − − −] 2 τ √1 − ρ
(5.7.41)
Proof Note that the matrix L above gives the canonical representation of (X, Y ) in terms of the standard normal vector (Z, W ) in the original − − − − − definition, namely X = μ + σZ , Y = ν + τ ρZ + τ √1 − ρ W . 2
5.7.5
https://stats.libretexts.org/@go/page/10173
If the matrix A ∈ R in the definition is not invertible, then the variance-covariance matrix V = AA is symmetric, but only positive semi-definite. The random vector X = μ + AZ takes values in a lower dimensional affine subspace of R that has measure 0 relative to n -dimensional Lebesgue measure λ . Thus, X does not have a probability density function relative to λ , and so the distribution is degenerate. However, the formula for the moment generating function still holds. Degenerate normal distributions are discussed in more detail below. n×n
T
n
n
n
Transformations The multivariate normal distribution is invariant under two basic types of transformations on the underlying random vectors: affine transformations (with linearly independent rows), and concatenation of independent vectors. As simple corollaries of these two results, the normal distribution is also invariant with respect to subsequences of the random vector, re-orderings of the terms in the random vector, and sums of independent random vectors. The main tool that we will use is the moment generating function. We start with the first main result on affine transformations. Suppose that X has the n -dimensional normal distribution with mean vector μ and variance-covariance matrix V . Suppose also that a∈ R and that A ∈ R has linearly independent rows (thus, m ≤ n ). Then Y = a + AX has an m-dimensional normal distribution, with m
m×n
1. E(Y ) = a + Aμ 2. vc(Y ) = AV A
T
Proof A clearly important special case is m = n , which generalizes the definition. Thus, if Y = a + AX has an n -dimensional normal distribution. Here are some other corollaries: Suppose that X = (X
1,
has an n -dimensional normal distribution. If {i has an m-dimensional normal distribution.
X2 , … , Xn )
Y = (Xi1 , Xi2 , … , Xim )
1,
n
a∈ R
and
i2 , … , im }
n×n
A ∈ R
is invertible, then
is a set of distinct indices, then
Proof In the context of the previous result, if X has mean vector μ and variance-covariance matrix V , then Y has mean vector Aμ and variance-covariance matrix AV A , where A is the 0-1 matrix defined in the proof. As simple corollaries, note that if X = (X , X , … , X ) has an n -dimensional normal distribution, then any permutation of the coordinates of X also has an n dimensional normal distribution, and (X , X , … , X ) has an m-dimensional normal distribution for any m ≤ n . Here is a slight extension of the last statement. T
1
2
n
1
Suppose that X is a random vector in distribution. Then
2
m
R
,
m
Y
is a random vector in
n
R
, and that
(X, Y )
has an
(m + n)
-dimensional normal
1. X has an m-dimensional normal distribution. 2. Y has an n -dimensional normal distribution. 3. X and Y are independent if and only if cov(X, Y ) = 0 (the m × n zero matrix). Proof Next is the converse to part (c) of the previous result: concatenating independent normally distributed vectors produces another normally distributed vector. Suppose that X has the m-dimensional normal distribution with mean vector μ and variance-covariance matrix U , Y has the n dimensional normal distribution with mean vector ν and variance-covariance matrix V , and that X and Y are independent. Then Z = (X, Y ) has the m + n -dimensional normal distribution with 1. E(X, Y ) = (μ, ν ) 2. vc(X, Y ) = [
vc(X) T
0
0 ]
where 0 is the m × n zero matrix.
vc(Y )
Proof Just as in the univariate case, the normal family of distributions is closed with respect to sums of independent variables. The proof follows easily from the previous result. Suppose that X has the n -dimensional normal distribution with mean vector μ and variance-covariance matrix U , Y has the n dimensional normal distribution with mean vector ν and variance-covariance matrix V , and that X and Y are independent. Then
5.7.6
https://stats.libretexts.org/@go/page/10173
X +Y
has the n -dimensional normal distribution with
1. E(X + Y ) = μ + ν 2. vc(X + Y ) = U + V Proof We close with a trivial corollary to the general result on affine transformation, but this corollary points the way to a further generalization of the multivariate normal distribution that includes the degenerate distributions. Suppose that X has an n -dimensional normal distribution with mean vector μ and variance-covariance matrix V , and that with a ≠ 0 . Then Y = a ⋅ X has a (univariate) normal distribution with
n
a∈ R
1. E(Y ) = a ⋅ μ 2. var(Y ) = a ⋅ V a Proof
A Further Generalization The last result can be used to give a simple, elegant definition of the multivariate normal distribution that includes the degenerate distributions as well as the ones we have considered so far. First we will adopt our general definition of the univariate normal distribution that includes constant random variables. A random variable X that takes values in R has an n -dimensional normal distribution if and only if a ⋅ X has a univariate normal distribution for every a ∈ R . n
n
Although an n -dimensional normal distribution may not have a probability density function with respect to measure λ , the form of the moment generating function is unchanged.
n
-dimensional Lebesgue
n
Suppose that X has mean vector μ and variance-covariance matrix moment generating function of X is given by
V
, and that
X
has an n -dimensional normal distribution. The
1 E [exp(t ⋅ X)] = exp[E(t ⋅ X) +
1 var(t ⋅ X)] = exp(t ⋅ μ +
2
t ⋅ V t),
n
t ∈ R
(5.7.47)
2
Proof Our new general definition really is a generalization. Suppose that X has an n -dimensional normal distribution in the sense of the general definition, and that the distribution of X has a probability density function on R with respect to Lebesgue measure λ . Then X has an n -dimensional normal distribution in the sense of our original definition. n
n
Proof This page titled 5.7: The Multivariate Normal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.7.7
https://stats.libretexts.org/@go/page/10173
5.8: The Gamma Distribution In this section we will study a family of distributions that has special importance in probability and statistics. In particular, the arrival times in the Poisson process have gamma distributions, and the chi-square distribution in statistics is a special case of the gamma distribution. Also, the gamma distribution is widely used to model physical quantities that take positive values.
The Gamma Function Before we can study the gamma distribution, we need to introduce the gamma function, a special function whose values will play the role of the normalizing constants.
Definition The gamma function Γ is defined as follows ∞ k−1
Γ(k) = ∫
x
e
−x
dx,
k ∈ (0, ∞)
(5.8.1)
0
The function is well defined, that is, the integral converges for any k ≤ 0.
k >0
. On the other hand, the integral diverges to
∞
for
Proof The gamma function was first introduced by Leonhard Euler.
Figure 5.8.1 : The graph of the gamma function on the interval (0, 5)
The (lower) incomplete gamma function is defined by x
Γ(k, x) = ∫
k−1
t
e
−t
dt,
k, x ∈ (0, ∞)
(5.8.6)
0
Properties Here are a few of the essential properties of the gamma function. The first is the fundamental identity. Γ(k + 1) = k Γ(k)
for k ∈ (0, ∞) .
Proof Applying this result repeatedly gives Γ(k + n) = k(k + 1) ⋯ (k + n − 1)Γ(k),
n ∈ N+
(5.8.8)
It's clear that the gamma function is a continuous extension of the factorial function. Γ(k + 1) = k!
for k ∈ N .
Proof
5.8.1
https://stats.libretexts.org/@go/page/10174
The values of the gamma function for non-integer arguments generally cannot be expressed in simple, closed forms. However, there are exceptions. Γ(
1 2
− ) = √π
.
Proof We can generalize the last result to odd multiples of
1 2
.
For n ∈ N , 1 ⋅ 3 ⋯ (2n − 1)
2n + 1 Γ(
) =
n
2
2
− √π =
(2n)! n
4 n!
− √π
(5.8.11)
Proof
Stirling's Approximation One of the most famous asymptotic formulas for the gamma function is Stirling's formula, named for James Stirling. First we need to recall a definition. Suppose that f ,
g : D → (0, ∞)
where D = (0, ∞) or D = N . Then f (x) ≈ g(x) as x → ∞ means that +
f (x) → 1 as x → ∞
(5.8.12)
g(x)
Stirling's formula x Γ(x + 1) ≈ (
x
− − − ) √2πx as x → ∞
(5.8.13)
e
As a special case, Stirling's result gives an asymptotic formula for the factorial function: n n! ≈ (
n
−− − ) √2πn as n → ∞
(5.8.14)
e
The Standard Gamma Distribution Distribution Functions The standard gamma distribution with shape parameter density function f given by
k ∈ (0, ∞)
1 f (x) =
k−1
x
e
−x
,
is a continuous distribution on
(0, ∞)
with probability
x ∈ (0, ∞)
(5.8.15)
Γ(k)
Clearly f is a valid probability density function, since f (x) > 0 for x > 0 , and by definition, Γ(k) is the normalizing constant for the function x ↦ x e on (0, ∞). The following theorem shows that the gamma density has a rich variety of shapes, and shows why k is called the shape parameter. k−1
−x
The gamma probability density function f with shape parameter k ∈ (0, ∞) satisfies the following properties: 1. If 0 < k < 1 , f is decreasing with f (x) → ∞ as x ↓ 0. 2. If k = 1 , f is decreasing with f (0) = 1 . 3. If k > 1 , f increases and then decreases, with mode at k − 1 . 4. If 0 < k ≤ 1 , f is concave upward. −−− − 5. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at k − 1 + √k − 1 . −−− − 6. If k > 2 , f is concave upward, then downward, then upward again, with inflection points at k − 1 ± √k − 1 . Proof The special case k = 1 gives the standard exponential distribuiton. When k ≥ 1 , the distribution is unimodal.
5.8.2
https://stats.libretexts.org/@go/page/10174
In the simulation of the special distribution simulator, select the gamma distribution. Vary the shape parameter and note the shape of the density function. For various values of k , run the simulation 1000 times and compare the empirical density function to the true probability density function. The distribution function and the quantile function do not have simple, closed representations for most values of the shape parameter. However, the distribution function has a trivial representation in terms of the incomplete and complete gamma functions. The distribution function F of the standard gamma distribution with shape parameter k ∈ (0, ∞) is given by Γ(k, x) F (x) =
,
x ∈ (0, ∞)
(5.8.16)
Γ(k)
Approximate values of the distribution and quantile functions can be obtained from special distribution calculator, and from most mathematical and statistical software packages. Using the special distribution calculator, find the median, the first and third quartiles, and the interquartile range in each of the following cases: 1. k = 1 2. k = 2 3. k = 3
Moments Suppose that X has the standard gamma distribution with shape parameter k ∈ (0, ∞) . The mean and variance are both simply the shape parameter. The mean and variance of X are 1. E(X) = k 2. var(X) = k Proof In the simulation of the special distribution simulator, select the gamma distribution. Vary the shape parameter and note the size and location of the mean ± standard deviation bar. For selected values of k , run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. More generally, the moments can be expressed easily in terms of the gamma function: The moments of X are 1. E(X 2. E(X
a
n
) = Γ(a + k)/Γ(k) [n]
) =k
if a > −k
= k(k + 1) ⋯ (k + n − 1)
if n ∈ N
Proof Note also that E(X
a
) =∞
if a ≤ −k . We can now also compute the skewness and the kurtosis.
The skewness and kurtosis of X are 1. skew(X) =
2 √k
2. kurt(X) = 3 +
6 k
Proof In particular, note that k → ∞.
skew(X) → 0
and
kurt(X) → 3
as
k → ∞
5.8.3
. Note also that the excess kurtosis
kurt(X) − 3 → 0
as
https://stats.libretexts.org/@go/page/10174
In the simulation of the special distribution simulator, select the gamma distribution. Increase the shape parameter and note the shape of the density function in light of the previous results on skewness and kurtosis. For various values of k , run the simulation 1000 times and compare the empirical density function to the true probability density function. The following theorem gives the moment generating function. The moment generating function of X is given by E (e
tX
1 ) =
,
t 1 , f increases and then decreases, with mode at (k − 1)b . 4. If 0 < k ≤ 1 , f is concave upward. −−− − 5. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at b (k − 1 + √k − 1 ) . −−− − 6. If k > 2 , f is concave upward, then downward, then upward again, with inflection points at b (k − 1 ± √k − 1 ) . In the simulation of the special distribution simulator, select the gamma distribution. Vary the shape and scale parameters and note the shape and location of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the true probability density function. Once again, the distribution function and the quantile function do not have simple, closed representations for most values of the shape parameter. However, the distribution function has a simple representation in terms of the incomplete and complete gamma functions.
5.8.4
https://stats.libretexts.org/@go/page/10174
The distribution function F of X is given by Γ(k, x/b) F (x) =
,
x ∈ (0, ∞)
(5.8.24)
Γ(k)
Proof Approximate values of the distribution and quanitle functions can be obtained from special distribution calculator, and from most mathematical and statistical software packages. Open the special distribution calculator. Vary the shape and scale parameters and note the shape and location of the distribution and quantile functions. For selected values of the parameters, find the median and the first and third quartiles.
Moments Suppose again that X has the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). The mean and variance of X are 1. E(X) = bk 2. var(X) = b
2
k
Proof In the special distribution simulator, select the gamma distribution. Vary the parameters and note the shape and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. The moments of X are 1. E(X 2. E(X
a
n
a
) = b Γ(a + k)/Γ(k) n
[n]
) =b k
for a > −k
n
= b k(k + 1) ⋯ (k + n − 1)
if n ∈ N
Proof Note also that E(X ) = ∞ if a ≤ −k . Recall that skewness and kurtosis are defined in terms of the standard score, and hence are unchanged by the addition of a scale parameter. a
The skewness and kurtosis of X are 1. skew(X) =
2 √k
2. kurt(X) = 3 +
6 k
The moment generating function of X is given by E (e
tX
1 ) =
1 k
,
t
2 , f increases and then decreases with mode at n − 2 . 4. If 0 < n ≤ 2 , f is concave downward. −−−− − 5. If 2 < n ≤ 4 , f is concave downward and then upward, with inflection point at n − 2 + √2n − 4 −−−− − 6. If n > 4 then f is concave upward then downward and then upward again, with inflection points at n − 2 ± √2n − 4 1 2
In the special distribution simulator, select the chi-square distribution. Vary n with the scroll bar and note the shape of the probability density function. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability density function. The distribution function and the quantile function do not have simple, closed-form representations for most values of the parameter. However, the distribution function can be given in terms of the complete and incomplete gamma functions. Suppose that X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom. The distribution function F of X is given by Γ(n/2, x/2) F (x) =
,
x ∈ (0, ∞)
(5.9.2)
Γ(n/2)
Approximate values of the distribution and quantile functions can be obtained from the special distribution calculator, and from most mathematical and statistical software packages. In the special distribution calculator, select the chi-square distribution. Vary the parameter and note the shape of the probability density, distribution, and quantile functions. In each of the following cases, find the median, the first and third quartiles, and the interquartile range.
5.9.1
https://stats.libretexts.org/@go/page/10175
1. n = 1 2. n = 2 3. n = 5 4. n = 10
Moments The mean, variance, moments, and moment generating function of the chi-square distribution can be obtained easily from general results for the gamma distribution. If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom then 1. E(X) = n 2. var(X) = 2n In the simulation of the special distribution simulator, select the chi-square distribution. Vary n with the scroll bar and note the size and location of the mean ± standard deviation bar. For selected values of n , run the simulation 1000 times and compare the empirical moments to the distribution moments. The skewness and kurtosis of the chi-square distribution are given next. If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then − − −
1. skew(X) = 2√2/n 2. kurt(X) = 3 + 12/n Note that skew(X) → 0 and kurt(X) → 3 as n → ∞ . In particular, the excess kurtosis kurt(X) − 3 → 0 as n → ∞ . In the simulation of the special distribution simulator, select the chi-square distribution. Increase n with the scroll bar and note the shape of the probability density function in light of the previous results on skewness and kurtosis. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability density function. The next result gives the general moments of the chi-square distribution. If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then for k > −n/2 , E (X
k
k
Γ(n/2 + k)
) =2
(5.9.3) Γ(n/2)
In particular, if k ∈ N
+
then E (X
k
k
) =2
n (
Note also E (X
k
) =∞
n )(
2
n + 1) ⋯ (
2
+ k − 1)
(5.9.4)
2
if k ≤ −n/2 .
If X has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, then X has moment generating function E (e
tX
1 ) =
1 n/2
(1 − 2t)
,
t
1 , g increases and then decreases with mode u = √n − 1 4. If 0 < n < 1 , g is concave upward.
−−−−−−−−−−−−−−− − −−−− − 1 [2n − 1 + √8n − 7 ]
5. If 1 ≤ n ≤ 2 , g is concave downward and then upward with inflection point at u = √
2
−−−−−−−−−−−−−−− − −−−− − 1 [2n − 1 ± √8n − 7 ]
6. If n > 2 , g is concave upward then downward then upward again with inflection points at u = √
2
Moments The raw moments of the chi distribution are easy to comput in terms of the gamma function. Suppose that U has the chi distribution with n ∈ (0, ∞) degrees of freedom. Then E(U
k
k/2
Γ[(n + k)/2]
) =2
,
k ∈ (0, ∞)
(5.9.15)
Γ(n/2)
Proof Curiously, the second moment is simply the degrees of freedom parameter. Suppose again that U has the chi distribution with n ∈ (0, ∞) degrees of freedom. Then 1/2 Γ[(n+1)/2]
1. E(U ) = 2 2. E(U
2
Γ(n/2)
) =n
3. var(U ) = n − 2
2
Γ [(n+1)/2] 2
Γ (n/2)
Proof
5.9.4
https://stats.libretexts.org/@go/page/10175
Relations The fundamental relationship of course is the one between the chi distribution and the chi-square distribution given in the definition. In turn, this leads to a fundamental relationship between the chi distribution and the normal distribution. Suppose that n ∈ N distribution. Then
+
and that
is a sequence of independent variables, each with the standard normal
(Z1 , Z2 , … , Zn )
− −−−−− −− −−−−−−− − U =√ Z
2
1
+Z
2
2
2
+ ⋯ + Zn
(5.9.19)
has the chi distribution with n degrees of freedom. Note that the random variable U in the last result is the standard Euclidean norm of (Z , Z , … , Z ), thought of as a vector in R . Note also that the chi distribution with 1 degree of freedom is the distribution of |Z|, the absolute value of a standard normal variable, which is known as the standard half-normal distribution. n
1
2
n
The Non-Central Chi-Square Distribution Much of the importance of the chi-square distribution stems from the fact that it is the distribution that governs the sum of squares of independent, standard normal variables. A natural generalization, and one that is important in statistical applications, is to consider the distribution of a sum of squares of independent normal variables, each with variance 1 but with different means. Suppose that n ∈ N and that (X , X , … , X ) is a sequence of independent variables, where X has the normal distribution with mean μ ∈ R and variance 1 for k ∈ {1, 2, … , n}. The distribution of Y = ∑ X is the non-central chisquare distribution with n degrees of freedom and non-centrality parameter λ = ∑ μ . +
1
2
n
k
k
n
2
k=1
k
Note that the degrees of freedom is a positive integer while the non-centrality parameter the degrees of freedom.
n
2
k=1
k
, but we will soon generalize
λ ∈ [0, ∞)
Distribution Functions Like the chi-square and chi distributions, the non-central chi-square distribution is a continuous distribution on (0, ∞). The probability density function and distribution function do not have simple, closed expressions, but there is a fascinating connection to the Poisson distribution. To set up the notation, let f and F denote the probability density and distribution functions of the chisquare distribution with k ∈ (0, ∞) degrees of freedom. Suppose that Y has the non-central chi-square distribution with n ∈ N degrees of freedom and non-centrality parameter λ ∈ [0, ∞). The following fundamental theorem gives the probability density function of Y as an infinite series, and shows that the distribution does in fact depend only on n and λ . k
k
+
The probability density function g of Y is given by ∞
g(y) = ∑ e
k
−λ/2
(λ/2)
fn+2k (y),
y ∈ (0, ∞)
(5.9.20)
k!
k=0
Proof k
(λ/2)
The function k ↦ e on N is the probability density function of the Poisson distribution with parameter λ/2. So it follows that if N has the Poisson distribution with parameter λ/2 and the conditional distribution of Y given N is chi-square with parameter n + 2N , then Y has the distribution discussed here—non-central chi-square with n degrees of freedom and noncentrality parameter λ . Moreover, it's clear that g is a valid probability density function for any n ∈ (0, ∞) , so we can generalize our definition a bit. −λ/2
k!
For n ∈ (0, ∞) and λ ∈ [0, ∞), the distribution with probability density function distribution with n degrees of freedom and non-centrality parameter λ .
g
above is the non-central chi-square
The distribution function G is given by ∞
G(y) = ∑ e k=0
k
−λ/2
(λ/2)
Fn+2k (y),
y ∈ (0, ∞)
(5.9.28)
k!
5.9.5
https://stats.libretexts.org/@go/page/10175
Proof
Moments In this discussion, we assume again that non-centrality parameter λ ∈ [0, ∞).
Y
has the non-central chi-square distribution with
n ∈ (0, ∞)
degrees of freedom and
The moment generating function M of Y is given by M (t) = E (e
tY
1
λt
) =
exp( n/2
),
t ∈ (−∞, 1/2)
(5.9.29)
1 − 2t
(1 − 2t)
Proof The mean and variance of Y are 1. E(Y ) = n + λ 2. var(Y ) = 2(n + 2λ) Proof The skewness and kurtosis of Y are 1. skew(Y ) = 2
n+3λ
3/2
3/2
(n+2λ)
2. kurt(Y ) = 3 + 12
n+4λ 2
(n+2λ)
Note that
skew(Y ) → 0
as
n → ∞
or as
λ → ∞
. Note also that the excess kurtosis is
kurt(Y ) − 3 = 12
n+4λ 2
. So
(n+2λ)
kurt(Y ) → 3
(the kurtosis of the normal distribution) as n → ∞ or as λ → ∞ .
Relations Trivially of course, the ordinary chi-square distribution is a special case of the non-central chi-square distribution, with noncentrality parameter 0. The most important relation is the orignal definition above. The non-central chi-square distribution with n ∈ N degrees of freedom and non-centrality parameter λ ∈ [0, ∞) is the distribution of the sum of the squares of n independent normal variables with variance 1 and whose means satisfy ∑ μ = λ . The next most important relation is the one that arose in the probability density function and was so useful for computing moments. We state this one again for emphasis. +
n
2
k=1
k
Suppose that N has the Poisson distribution with parameter λ/2, where λ ∈ (0, ∞) , and that the conditional distribution of Y given N is chi-square with n + 2N degrees of freedom, where n ∈ (0, ∞) . Then the (unconditional) distribution of Y is noncentral chi-square with n degree of freedom and non-centrality parameter λ . Proof As the asymptotic results for the skewness and kurtosis suggest, there is also a central limit theorem. Suppose that Y has the non-central chi-square distribution with n ∈ (0, ∞) degrees of freedom and non-centrality parameter λ ∈ (0, ∞) . Then the distribution of the standard score Y − (n + λ) − − − − − − − − √ 2(n + 2λ)
(5.9.33)
converges to the standard normal distribution as n → ∞ or as λ → ∞ .
Computational Exercises Suppose that a missile is fired at a target at the origin of a plane coordinate system, with units in meters. The missile lands at (X, Y ) where X and Y are independent and each has the normal distribution with mean 0 and variance 100. The missile will destroy the target if it lands within 20 meters of the target. Find the probability of this event. Answer
5.9.6
https://stats.libretexts.org/@go/page/10175
Suppose that X has the chi-square distribution with n = 18 degrees of freedom. For each of the following, compute the true value using the special distribution calculator and then compute the normal approximation. Compare the results. 1. P(15 < X < 20) 2. The 75th percentile of X. Answer This page titled 5.9: Chi-Square and Related Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.9.7
https://stats.libretexts.org/@go/page/10175
5.10: The Student t Distribution In this section we will study a distribution that has special importance in statistics. In particular, this distribution will arise in the study of a standardized version of the sample mean when the underlying distribution is normal.
Basic Theory Definition Suppose that Z has the standard normal distribution, V has the chi-squared distribution with and that Z and V are independent. Random variable
n ∈ (0, ∞)
degrees of freedom,
Z T =
(5.10.1)
− − − − √V /n
has the student t distribution with n degrees of freedom. The student t distribution is well defined for any n > 0 , but in practice, only positive integer values of distribution was first studied by William Gosset, who published under the pseudonym Student.
n
are of interest. This
Distribution Functions Suppose that T has the t distribution with probability density function f given by
n ∈ (0, ∞)
degrees of freedom. Then
− − − √nπ Γ(n/2)
has a continuous distribution on
R
with
−(n+1)/2
2
Γ[(n + 1)/2] f (t) =
T
t (1 +
)
,
t ∈ R
(5.10.2)
n
Proof The proof of this theorem provides a good way of thinking of the t distribution: the distribution arises when the variance of a mean 0 normal distribution is randomized in a certain way. In the special distribution simulator, select the student t distribution. Vary n and note the shape of the probability density function. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability density function. The Student probability density function f with n ∈ (0, ∞) degrees of freedom has the following properties: 1. f is symmetric about t = 0 . 2. f is increasing and then decreasing with mode t = 0 . − − − − − − − − 3. f is concave upward, then downward, then upward again with inflection points at ±√n/(n + 1) . 4. f (t) → 0 as t → ∞ and as t → −∞ . In particular, the distribution is unimodal with mode and median at n → ∞.
t =0
. Note also that the inflection points converge to
±1
as
The distribution function and the quantile function of the general t distribution do not have simple, closed-form representations. Approximate values of these functions can be obtained from the special distribution calculator, and from most mathematical and statistical software packages. In the special distribution calculator, select the student distribution. Vary the parameter and note the shape of the probability density, distribution, and quantile functions. In each of the following cases, find the first and third quartiles: 1. n = 2 2. n = 5 3. n = 10
5.10.1
https://stats.libretexts.org/@go/page/10176
4. n = 20
Moments Suppose that T has a t distribution. The representation in the definition can be used to find the mean, variance and other moments of T . The main point to remember in the proofs that follow is that since V has the chi-square distribution with n degrees of freedom, E (V ) = ∞ if k ≤ − , while if k > − , k
n
n
2
2
E (V
k
k
Γ(k + n/2)
) =2
(5.10.6) Γ(n/2)
Suppose that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then 1. E(T ) is undefined if 0 < n ≤ 1 2. E(T ) = 0 if 1 < n < ∞ Proof Suppose again that T has the t distribution with n ∈ (0, ∞) degrees of freedom then 1. var(T ) is undefined if 0 < n ≤ 1 2. var(T ) = ∞ if 1 < n ≤ 2 3. var(T ) = if 2 < n < ∞ n
n−2
Proof Note that var(T ) → 1 as n → ∞ . In the simulation of the special distribution simulator, select the student t distribution. Vary n and note the location and shape of the mean ± standard deviation bar. For selected values of n , run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. Next we give the general moments of the t distribution. Suppose again that T has the t distribution with n ∈ (0, ∞) degrees of freedom and k ∈ N . Then 1. E (T ) is undefined if k is odd and k ≥ n 2. E (T ) = ∞ if k is even and k ≥ n 3. E (T ) = 0 if k is odd and k < n 4. If k is even and k < n then k k
k
k/2
E (T
k
n ) =
k/2
1 ⋅ 3 ⋯ (k − 1)Γ ((n − k)/2) k/2
2
n =
Γ(n/2)
k!Γ ((n − k)/2)
k
(5.10.7)
2 (k/2)!Γ(n/2)
Proof From the general moments, we can compute the skewness and kurtosis of T . Suppose again that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then 1. skew(T ) = 0 if n > 3 2. kurt(T ) = 3 + if n > 4 6
n−4
Proof Note that kurt(T ) → 3 as n → ∞ and hence the excess kurtosis kurt(T ) − 3 → 0 as n → ∞ . In the special distribution simulator, select the student t distribution. Vary n and note the shape of the probability density function in light of the previous results on skewness and kurtosis. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability density function.
5.10.2
https://stats.libretexts.org/@go/page/10176
Since T does not have moments of all orders, there is no interval about 0 on which the moment generating function of T is finite. The characteristic function exists, of course, but has no simple representation, except in terms of special functions.
Relations The t distribution with 1 degree of freedom is known as the Cauchy distribution. The probability density function is 1 f (t) =
,
2
t ∈ R
(5.10.11)
π(1 + t )
The Cauchy distribution is named after Augustin Cauchy and is studied in more detail in a separate section. You probably noticed that, qualitatively at least, the t probability density function is very similar to the standard normal probability density function. The similarity is quantitative as well: Let f denote the t probability density function with n ∈ (0, ∞) degrees of freedom. Then for fixed t ∈ R , n
1 fn (t) →
− − √2π
e
−
1 2
2
t
as n → ∞
(5.10.12)
Proof Note that the function on the right is the probability density function of the standard normal distribution. We can also get convergence of the t distribution to the standard normal distribution from the basic random variable representation in the definition. Suppose that T has the t distribution with n ∈ N n
+
degrees of freedom, so that we can represent T as n
Z Tn =
(5.10.15)
− −− − √Vn /n
where Z has the standard normal distribution, V has the chi-square distribution with n degrees of freedom, and Z and V are independent. Then T → Z as n → ∞ with probability 1. n
n
n
Proof The t distribution has more probability in the tails, and consequently less probability near 0, compared to the standard normal distribution.
The Non-Central t Distribution One natural way to generalize the student t distribution is to replace the standard normal variable Z in the definition above with a normal variable having an arbitrary mean (but still unit variance). The reason this particular generalization is important is because it arises in hypothesis tests about the mean based on a random sample from the normal distribution, when the null hypothesis is false. For details see the sections on tests in the normal model and tests in the bivariate normal model in the chapter on Hypothesis Testing. Suppose that Z has the standard normal distribution, μ ∈ R , freedom, and that Z and V are independent. Random variable
V
has the chi-squared distribution with
n ∈ (0, ∞)
degrees of
Z +μ T =
(5.10.16)
− − − − √V /n
has the non-central student t distribution with n degrees of freedom and non-centrality parameter μ . The standard functions that characterize a distribution—the probability density function, distribution function, and quantile function—do not have simple representations for the non-central t distribution, but can only be expressed in terms of other special functions. Similarly, the moments do not have simple, closed form expressions either. For the beginning student of statistics, the most important fact is that the probability density function of the non-central t distribution is similar (but not exactly the same) as that of the standard t distribution (with the same degrees of freedom), but shifted and scaled. The density function is shifted to the right or left, depending on whether μ > 0 or μ < 0 .
5.10.3
https://stats.libretexts.org/@go/page/10176
Computational Exercises Suppose that T has the t distribution with n = 10 degrees of freedom. For each of the following, compute the true value using the special distribution calculator and then compute the normal approximation. Compare the results. 1. P(−0.8 < T < 1.2) 2. The 90th percentile of T . Answer This page titled 5.10: The Student t Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.10.4
https://stats.libretexts.org/@go/page/10176
5.11: The F Distribution In this section we will study a distribution that has special importance in statistics. In particular, this distribution arises from ratios of sums of squares when sampling from a normal distribution, and so is important in estimation and in the two-sample normal model and in hypothesis testing in the two-sample normal model.
Basic Theory Definition Suppose that U has the chi-square distribution with n ∈ (0, ∞) degrees of freedom, V has the chi-square distribution with d ∈ (0, ∞) degrees of freedom, and that U and V are independent. The distribution of U /n X =
(5.11.1) V /d
is the F distribution with n degrees of freedom in the numerator and d degrees of freedom in the denominator. The F distribution was first derived by George Snedecor, and is named in honor of Sir Ronald Fisher. In practice, the parameters n and d are usually positive integers, but this is not a mathematical requirement.
Distribution Functions Suppose that X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator. Then X has a continuous distribution on (0, ∞) with probability density function f given by Γ(n/2 + d/2) n
n/2−1
[(n/d)x]
f (x) =
,
x ∈ (0, ∞)
(5.11.2)
Γ(n/2)Γ(d/2) d [1 + (n/d)x] n/2+d/2
where Γ is the gamma function. Proof Recall that the beta function B can be written in terms of the gamma function by Γ(a)Γ(b) B(a, b) =
,
a, b ∈ (0, ∞)
(5.11.8)
Γ(a + b)
Hence the probability density function of the F distribution above can also be written as 1
n
n/2−1
[(n/d)x]
f (x) =
,
x ∈ (0, ∞)
(5.11.9)
B(n/2, d/2) d [1 + (n/d)x] n/2+d/2
When n ≥ 2 , the probability density function is defined at x = 0 , so the support interval is [0, ∞) is this case. In the special distribution simulator, select the F distribution. Vary the parameters with the scroll bars and note the shape of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function. Both parameters influence the shape of the F probability density function, but some of the basic qualitative features depend only on the numerator degrees of freedom. For the remainder of this discussion, let f denote the F probability density function with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator. Probability density function f satisfies the following properties: 1. If 0 < n < 2 , f is decreasing with f (x) → ∞ as x ↓ 0. 2. If n = 2 , f is decreasing with mode at x = 0 . 3. If n > 2 , f increases and then decreases, with mode at x =
(n−2)d n(d+2)
5.11.1
.
https://stats.libretexts.org/@go/page/10351
Proof Qualitatively, the second order properties of f also depend only on n , with transitions at n = 2 and n = 4 . For n > 2 , define x1 =
x2 =
d
− −−−−−−−−−−−−−−−− − (n − 2)(d + 4) − √ 2(n − 2)(d + 4)(n + d)
n
(d + 2)(d + 4)
d
− −−−−−−−−−−−−−−−− − (n − 2)(d + 4) + √ 2(n − 2)(d + 4)(n + d)
n
(d + 1)(d + 4)
(5.11.11)
(5.11.12)
The probability density function f satisfies the following properties: 1. If 0 < n ≤ 2 , f is concave upward. 2. If 2 < n ≤ 4 , f is concave downward and then upward, with inflection point at x . 3. If n > 4 , f is concave upward, then downward, then upward again, with inflection points at x and x . 2
1
2
Proof The distribution function and the quantile function do not have simple, closed-form representations. Approximate values of these functions can be obtained from the special distribution calculator and from most mathematical and statistical software packages. In the special distribution calculator, select the F distribution. Vary the parameters and note the shape of the probability density function and the distribution function. In each of the following cases, find the median, the first and third quartiles, and the interquartile range. 1. n = 5 , d = 5 2. n = 5 , d = 10 3. n = 10 , d = 5 4. n = 10 , d = 10 The general probability density function of the F distribution is a bit complicated, but it simplifies in a couple of special cases. Special cases. 1. If n = 2 , 1 f (x) =
,
x ∈ (0, ∞)
(5.11.14)
(1 + 2x/d)1+d/2
2. If n = d ∈ (0, ∞) , n/2−1
Γ(n) f (x) =
x
2
n Γ (n/2) (1 + x)
,
x ∈ (0, ∞)
(5.11.15)
3. If n = d = 2 , 1 f (x) =
(1 + x)2
,
x ∈ (0, ∞)
(5.11.16)
4. If n = d = 1 , f (x) =
1 , − π √x (1 + x)
x ∈ (0, ∞)
(5.11.17)
Moments The random variable representation in the definition, along with the moments of the chi-square distribution can be used to find the mean, variance, and other moments of the F distribution. For the remainder of this discussion, suppose that X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator. Mean
5.11.2
https://stats.libretexts.org/@go/page/10351
1. E(X) = ∞ if 0 < d ≤ 2 2. E(X) = if d > 2 d
d−2
Proof Thus, the mean depends only on the degrees of freedom in the denominator. Variance 1. var(X) is undefined if 0 < d ≤ 2 2. var(X) = ∞ if 2 < d ≤ 4 3. If d > 4 then 2
d var(X) = 2(
n+d−2
) d−2
(5.11.19) n(d − 4)
Proof In the simulation of the special distribution simulator, select the F distribution. Vary the parameters with the scroll bar and note the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.. General moments. For k > 0 , 1. E (X ) = ∞ if 0 < d ≤ 2k 2. If d > 2k then k
E (X
k
d ) =(
k
Γ(n/2 + k) Γ(d/2 − k)
) n
(5.11.23) Γ(n/2)Γ(d/2)
Proof If k ∈ N , then using the fundamental identity of the gamma distribution and some algebra, E (X
k
d ) =(
k
n(n + 2) ⋯ [n + 2(k − 1)]
) n
(5.11.26) (d − 2)(d − 4) ⋯ (d − 2k)
From the general moment formula, we can compute the skewness and kurtosis of the F distribution. Skewness and kurtosis 1. If d > 6 , − − − − − − − (2n + d − 2)√ 8(d − 4) skew(X) =
(5.11.27)
− −−−−−−−− − (d − 6)√ n(n + d − 2)
2. If d > 8 , 2
n(5d − 22)(n + d − 2) + (d − 4)(d − 2) kurt(X) = 3 + 12
(5.11.28) n(d − 6)(d − 8)(n + d − 2)
Proof Not surprisingly, the F distribution is positively skewed. Recall that the excess kurtosis is 2
n(5d − 22)(n + d − 2) + (d − 4)(d − 2) kurt(X) − 3 = 12
(5.11.29) n(d − 6)(d − 8)(n + d − 2)
In the simulation of the special distribution simulator, select the F distribution. Vary the parameters with the scroll bar and note the shape of the probability density function in light of the previous results on skewness and kurtosis. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function.
5.11.3
https://stats.libretexts.org/@go/page/10351
Relations The most important relationship is the one in the definition, between the F distribution and the chi-square distribution. In addition, the F distribution is related to several other special distributions. Suppose that X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator. Then 1/X has the F distribution with d degrees of freedom in the numerator and n degrees of freedom in the denominator. Proof Suppose that T has the t distribution with n ∈ (0, ∞) degrees of freedom. Then X = T of freedom in the numerator and n degrees of freedom in the denominator.
2
has the F distribution with 1 degree
Proof Our next relationship is between the F distribution and the exponential distribution. Suppose that X and Y are independent random variables, each with the exponential distribution with rate parameter r ∈ (0, ∞). Then Z = X/Y . has the F distribution with 2 degrees of freedom in both the numerator and denominator. Proof A simple transformation can change a variable with the F distribution into a variable with the beta distribution, and conversely. Connections between the F distribution and the beta distribution. 1. If X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator, then (n/d)X Y =
(5.11.37) 1 + (n/d)X
has the beta distribution with left parameter n/2 and right parameter d/2. 2. If Y has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) then bY X =
(5.11.38) a(1 − Y )
has the F distribution with 2a degrees of freedom in the numerator and 2b degrees of freedom in the denominator. Proof The F distribution is closely related to the beta prime distribution by a simple scale transformation. Connections with the beta prime distributions. 1. If X has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator, then Y = X has the beta prime distribution with parameters n/2 and d/2. 2. If Y has the beta prime distribution with parameters a ∈ (0, ∞) and b ∈ (0, ∞) then X = X has the F distribution with 2a degrees of the freedom in the numerator and 2b degrees of freedom in the denominator. n d
b
a
Proof
The Non-Central F Distribution The F distribution can be generalized in a natural way by replacing the ordinary chi-square variable in the numerator in the definition above with a variable having a non-central chi-square distribution. This generalization is important in analysis of variance. Suppose that U has the non-central chi-square distribution with n ∈ (0, ∞) degrees of freedom and non-centrality parameter λ ∈ [0, ∞), V has the chi-square distribution with d ∈ (0, ∞) degrees of freedom, and that U and V are independent. The distribution of
5.11.4
https://stats.libretexts.org/@go/page/10351
U /n X =
(5.11.43) V /d
is the non-central F distribution with non-centrality parameter λ .
n
degrees of freedom in the numerator,
d
degrees of freedom in the denominator, and
One of the most interesting and important results for the non-central chi-square distribution is that it is a Poisson mixture of ordinary chi-square distributions. This leads to a similar result for the non-central F distribution. Suppose that N has the Poisson distribution with parameter λ/2, and that the conditional distribution of X given N is the F distribution with N + 2n degrees of freedom in the numerator and d degrees of freedom in the denominator, where λ ∈ [0, ∞) and n, d ∈ (0, ∞) . Then X has the non-central F distribution with n degrees of freedom in the numerator, d degrees of freedom in the denominator, and non-centrality parameter λ . Proof From the last result, we can express the probability density function and distribution function of the non-central F distribution as a series in terms of ordinary F density and distribution functions. To set up the notation, for j, k ∈ (0, ∞) let f be the probability density function and F the distribution function of the F distribution with j degrees of freedom in the numerator and k degrees of freedom in the denominator. For the rest of this discussion, λ ∈ [0, ∞) and n, d ∈ (0, ∞) as usual. jk
jk
The probability density function g of the non-central F distribution with n degrees of freedom in the numerator, d degrees of freedom in the denominator, and non-centrality parameter λ is given by ∞
g(x) = ∑ e
k
−λ/2
(λ/2)
fn+2k,d (x),
k!
k=0
x ∈ (0, ∞)
(5.11.44)
The distribution function G of the non-central F distribution with n degrees of freedom in the numerator, freedom in the denominator, and non-centrality parameter λ is given by ∞
G(x) = ∑ e k=0
d
degrees of
k
−λ/2
(λ/2) k!
Fn+2k,d (x),
x ∈ (0, ∞)
(5.11.45)
This page titled 5.11: The F Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.11.5
https://stats.libretexts.org/@go/page/10351
5.12: The Lognormal Distribution Basic Theory Definition Suppose that Y has the normal distribution with mean lognormal distribution with parameters μ and σ.
and standard deviation
μ ∈ R
σ ∈ (0, ∞)
. Then
X =e
Y
has the
1. The parameter σ is the shape parameter of the distribution. 2. The parameter e is the scale parameter of the distribution. μ
If Z has the standard normal distribution then W
=e
Z
has the standard lognormal distribution.
So equivalently, if X has a lognormal distribution then ln X has a normal distribution, hence the name. The lognormal distribution is a continuous distribution on (0, ∞) and is used to model random quantities when the distribution is believed to be skewed, such as certain income and lifetime variables. It's easy to write a general lognormal variable in terms of a standard lognormal variable. Suppose that Z has the standard normal distribution and let W = e so that W has the standard lognormal distribution. If μ ∈ R and σ ∈ (0, ∞) then Y = μ + σZ has the normal distribution with mean μ and standard deviation σ and hence X = e has the lognormal distribution with parameters μ and σ. But Z
Y
X =e
Y
=e
μ+σZ
=e
μ
(e
Z
σ
)
=e
μ
W
σ
(5.12.1)
Distribution Functions Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞). The probability density function f of X is given by 1 f (x) =
(ln x − μ)
exp[− − − √2πσx
2σ
1. f increases and then decreases with mode at x = exp(μ − σ
2
2
],
2
x ∈ (0, ∞)
(5.12.2)
.
)
2. f is concave upward then downward then upward again, with inflection points at x = exp(μ −
3 2
σ
2
±
1 2
− −−− − 2 σ √σ + 4 )
3. f (x) → 0 as x ↓ 0 and as x → ∞ . Proof In the special distribution simulator, select the lognormal distribution. Vary the parameters and note the shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the true probability density function. Let Φ denote the standard normal distribution function, so that Φ is the standard normal quantile function. Recall that values of Φ and Φ can be obtained from the special distribution calculator, as well as standard mathematical and statistical software packages, and in fact these functions are considered to be special functions in mathematics. The following two results show how to compute the lognormal distribution function and quantiles in terms of the standard normal distribution function and quantiles. −1
−1
The distribution function F of X is given by ln x − μ F (x) = Φ (
),
x ∈ (0, ∞)
(5.12.5)
σ
Proof The quantile function of X is given by F
−1
−1
(p) = exp[μ + σ Φ
5.12.1
(p)],
p ∈ (0, 1)
(5.12.7)
https://stats.libretexts.org/@go/page/10352
Proof In the special distribution calculator, select the lognormal distribution. Vary the parameters and note the shape and location of the probability density function and the distribution function. With μ = 0 and σ = 1 , find the median and the first and third quartiles.
Moments The moments of the lognormal distribution can be computed from the moment generating function of the normal distribution. Once again, we assume that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞). For t ∈ R , 1
t
E (X ) = exp(μt +
2
2
σ t )
(5.12.8)
2
Proof In particular, the mean and variance of X are 1. E(X) = exp(μ + σ ) 2. var(X) = exp[2(μ + σ 1
2
2
2
2
)] − exp(2μ + σ )
In the simulation of the special distribution simulator, select the lognormal distribution. Vary the parameters and note the shape and location of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical moments to the true moments. From the general formula for the moments, we can also compute the skewness and kurtosis of the lognormal distribution. The skewness and kurtosis of X are 1. skew(X) = (e 2. kurt(X) = e
4σ
σ
2
− − − − − − σ2 + 2) √e −1
2
+ 2e
3σ
2
+ 3e
2σ
2
−3
Proof The fact that the skewness and kurtosis do not depend on μ is due to the fact that μ is a scale parameter. Recall that skewness and kurtosis are defined in terms of the standard score and so are independent of location and scale parameters. Naturally, the lognormal distribution is positively skewed. Finally, note that the excess kurtosis is kurt(X) − 3 = e
4σ
2
+ 2e
3σ
2
+ 3e
2σ
2
−6
(5.12.10)
Even though the lognormal distribution has finite moments of all orders, the moment generating function is infinite at any positive number. This property is one of the reasons for the fame of the lognormal distribution. E (e
tX
) =∞
for every t > 0 .
Proof
Related Distributions The most important relations are the ones between the lognormal and normal distributions in the definition: if X has a lognormal distribution then ln X has a normal distribution; conversely if Y has a normal distribution then e has a lognormal distribution. The lognormal distribution is also a scale family. Y
Suppose that X has the lognormal distribution with parameters lognormal distribution with parameters μ + ln c and σ.
μ ∈ R
and σ ∈ (0, ∞) and that
c ∈ (0, ∞)
. Then
cX
has the
Proof The reciprocal of a lognormal variable is also lognormal.
5.12.2
https://stats.libretexts.org/@go/page/10352
If X has the lognormal distribution with parameters parameters −μ and σ.
μ ∈ R
and
σ ∈ (0, ∞)
then
has the lognormal distribution with
1/X
Proof The lognormal distribution is closed under non-zero powers of the underlying variable. In particular, this generalizes the previous result. Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞) and that a ∈ R ∖ {0} . Then X has the lognormal distribution with parameters with parameters aμ and |a|σ. a
Proof Since the normal distribution is closed under sums of independent variables, it's not surprising that the lognormal distribution is closed under products of independent variables. Suppose that n ∈ N and that (X , X , … , X ) is a sequence of independent variables, where X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞) for i ∈ {1, 2, … , n}. Then ∏ X has the lognormal distribution with parameters μ and σ where μ = ∑ μ and σ = ∑ σ . +
1
2
n
i
n
i
i
n
i=1
2
i
i=1
n
2
i=1
i
i
Proof Finally, the lognormal distribution belongs to the family of general exponential distributions. Suppose that X has the lognormal distribution with parameters μ ∈ R and σ ∈ (0, ∞). The distribution of X is a 2-parameter exponential family with natural parameters and natural statistics, respectively, given by 1. (−1/2σ , μ/σ 2. (ln (X), ln X) 2
2
)
2
Proof
Computational Exercises Suppose that the income X of a randomly chosen person in a certain population (in $1000 units) has the lognormal distribution with parameters μ = 2 and σ = 1 . Find P(X > 20) . Answer Suppose that the income X of a randomly chosen person in a certain population (in $1000 units) has the lognormal distribution with parameters μ = 2 and σ = 1 . Find each of the following: 1. E(X) 2. var(X) Answer This page titled 5.12: The Lognormal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.12.3
https://stats.libretexts.org/@go/page/10352
5.13: The Folded Normal Distribution The General Folded Normal Distribution Introduction The folded normal distribution is the distribution of the absolute value of a random variable with a normal distribution. As has been emphasized before, the normal distribution is perhaps the most important in probability and is used to model an incredible variety of random phenomena. Since one may only be interested in the magnitude of a normally distributed variable, the folded normal arises in a very natural way. The name stems from the fact that the probability measure of the normal distribution on (−∞, 0] is “folded over” to [0, ∞). Here is the formal definition: Suppose that Y has a normal distribution with mean μ ∈ R and standard deviation σ ∈ (0, ∞). Then X = |Y | has the folded normal distribution with parameters μ and σ. So in particular, the folded normal distribution is a continuous distribution on [0, ∞).
Distribution Functions Suppose that Z has the standard normal distribution. Recall that Z has probability density function given by 1 ϕ(z) =
− − √2π
e
−z
2
/2
,
and distribution function
z ∈ R
z
ϕ(x) dx = ∫
−∞
−∞
Φ
(5.13.1)
z
Φ(z) = ∫
ϕ
1 − − √2π
2
e
−x /2
dx,
z ∈ R
(5.13.2)
The standard normal distribution is so important that Φ is considered a special function and can be computed using most mathematical and statistical software. If μ ∈ R and σ ∈ (0, ∞), then Y = μ + σZ has the normal distribution with mean μ and standard deviation σ, and therefore X = |Y | = |μ + σZ| has the folded normal distribution with parameters μ and σ. For the remainder of this discussion we assume that X has this folded normal distribution. X
has distribution function F given by x −μ F (x)
=Φ(
−x − μ ) −Φ (
σ x
=∫ 0
x −μ ) =Φ(
σ
σ 2
1
x +μ ) +Φ (
) −1
(5.13.3)
σ 2
1 y +μ 1 y −μ ( ) ] + exp[− ( ) ]} dy, − − {exp[− 2 σ 2 σ √ σ 2π
x ∈ [0, ∞)
(5.13.4)
Proof We cannot compute the quantile function F
−1
in closed form, but values of this function can be approximated.
Open the special distribution calculator and select the folded normal distribution, and set the view to CDF. Vary the parameters and note the shape of the distribution function. For selected values of the parameters, compute the median and the first and third quartiles. X
has probability density function f given by 1 f (x) =
x −μ [ϕ (
σ
σ 1
=
x +μ ) +ϕ(
)]
(5.13.7)
σ 2
2
1 x +μ 1 x −μ ( ) ] + exp[− ( ) ]} , − − {exp[− 2 σ 2 σ σ √2π
x ∈ [0, ∞)
(5.13.8)
Proof
5.13.1
https://stats.libretexts.org/@go/page/10353
Open the special distribution simulator and select the folded normal distribution. Vary the parameters μ and σ and note the shape of the probability density function. For selected values of the parameters, run the simulation 1000 times and compae the empirical density function to the true probability density function. Note that the folded normal distribution is unimodal for some values of the parameters and decreasing for other values. Note also that μ is not a location parameter nor is σ a scale parameter; both influence the shape of the probability density function.
Moments We cannot compute the the mean of the folded normal distribution in closed form, but the mean can at least be given in terms of Φ. Once again, we assume that X has the folded normal distribution with parmaeters μ ∈ R and σ ∈ (0, ∞). The first two moments of x are − − −
1. E(X) = μ[1 − 2Φ(−μ/σ)] + σ √2/π exp(−μ 2. E(X ) = μ + σ 2
2
2
2
/2 σ )
2
Proof In particular, the variance of X is 2
var(X) = μ
+σ
2
− − 2
μ − {μ [1 − 2Φ (−
)] + σ √ σ
2
2
μ exp(−
π
2σ
2
)}
(5.13.12)
Open the special distribution simulator and select the folded normal distribution. Vary the parameters and note the size and location of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the true mean and standard deviation.
Related Distributions The most important relation is the one between the folded normal distribution and the normal distribution in the definition: If Y has a normal distribution then X = |Y | has a folded normal distribution. The folded normal distribution is also related to itself through a symmetry property that is perhaps not completely obvious from the initial definition: For μ ∈ R and σ ∈ (0, ∞), the folded normal distribution with parameters distribution with parameters μ and σ.
−μ
and
σ
is the same as the folded normal
Proof 1 Proof 2 The folded normal distribution is also closed under scale transformations. Suppose that X has the folded normal distribution with parameters μ ∈ R and σ ∈ (0, ∞) and that the folded normal distribution with parameters bμ and bσ.
b ∈ (0, ∞)
. Then
bX
has
Proof
The Half-Normal Distribution When μ = 0 , results for the folded normal distribution are much simpler, and fortunately this special case is the most important one. We are more likely to be interested in the magnitude of a normally distributed variable when the mean is 0, and moreover, this distribution arises in the study of Brownian motion. Suppose that Z has the standard normal distribution and that σ ∈ (0, ∞). Then X = σ|Z| has the half-normal distribution with scale parameter σ. If σ = 1 so that X = |Z| , then X has the standard half-normal distribution.
Distribution Functions For our next discussion, suppose that X has the half-normal distribution with parameter σ ∈ (0, ∞). Once again, denote the distribution function and quantile function, respectively, of the standard normal distribution. The distribution function F and quantile function F
−1
Φ
and
−1
Φ
of X are
5.13.2
https://stats.libretexts.org/@go/page/10353
x
x F (x) = 2Φ ( σ
F
−1
1
)−1 = ∫
− − 2
√
σ
0
y
2
exp(− π
2σ
2
) dy,
x ∈ [0, ∞)
(5.13.13)
1 +p
−1
(p) = σ Φ
(
),
p ∈ [0, 1)
(5.13.14)
2
Proof Open the special distribution calculator and select the folded normal distribution. Select CDF view and keep μ = 0 . Vary σ and note the shape of the CDF. For various values of σ, compute the median and the first and third quartiles. The probability density function f of X is given by 2
x
f (x) =
ϕ( σ
− − 2
1
√
) = σ
σ
2
x exp(−
π
2σ 2
),
x ∈ [0, ∞)
(5.13.15)
1. f is decreasing with mode at x = 0 . 2. f is concave downward and then upward, with inflection point at x = σ . Proof Open the special distribution simulator and select the folded normal distribution. Keep μ = 0 and vary σ, and note the shape of the probability density function. For selected values of σ, run the simulation 1000 times and compare the empricial density function to the true probability density function.
Moments The moments of the half-normal distribution can be computed explicitly. Once again we assume that distribution with parameter σ ∈ (0, ∞).
X
has the half-normal
For n ∈ N E(X
2n
) =σ
2n
(2n)! (5.13.16)
n
n!2
− − 2 2n+1 2n+1 n E(X ) =σ 2 √ n! π
(5.13.17)
Proof − − −
In particular, we have E(X) = σ √2/π and var(X) = σ
2
(1 − 2/π)
Open the special distribution simulator and select the folded normal distribution. Keep μ = 0 and vary σ, and note the size and location of the mean±standard deviation bar. For selected values of σ, run the simulation 1000 times and compare the mean and standard deviation to the true mean and standard deviation. Next are the skewness and kurtosis of the half-normal distribution. Skewness and kurtosis 1. The skewness of X is − − − √2/π(4/π − 1) skew(X) =
≈ 0.99527
(5.13.19)
≈ 3.8692
(5.13.20)
3/2
(1 − 2/π)
2. The kurtosis of X is 3 − 4/π − 12/π kurt(X) =
(1 − 2/π)2
2
Proof
5.13.3
https://stats.libretexts.org/@go/page/10353
Related Distributions Once again, the most important relation is the one in the definition: If Y has a normal distribution with mean 0 then X = |Y | has a half-normal distribution. Since the half normal distribution is a scale family, it is trivially closed under scale transformations. Suppose that X has the half-normal distribution with parameter distribution with parameter bσ.
σ
and that
b ∈ (0, ∞)
. Then
bX
has the half-normal
Proof The standard half-normal distribution is also a special case of the chi distribution. The standard half-normal distribution is the chi distribution with 1 degree of freedom. Proof This page titled 5.13: The Folded Normal Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.13.4
https://stats.libretexts.org/@go/page/10353
5.14: The Rayleigh Distribution The Rayleigh distribution, named for William Strutt, Lord Rayleigh, is the distribution of the magnitude of a two-dimensional random vector whose coordinates are independent, identically distributed, mean 0 normal variables. The distribution has a number of applications in settings where magnitudes of normal variables are important.
The Standard Rayleigh Distribution Definition Suppose that
Z1
− − − − − − − R = √Z
2
1
+Z
2
2
and
are independent random variables with standard normal distributions. The magnitude
Z2
of the vector (Z
1,
So in this definition, (Z
1,
Z2 )
has the standard Rayleigh distribution.
Z2 )
has the standard bivariate normal distribution
Distribution Functions We give five functions that completely characterize the standard Rayleigh distribution: the distribution function, the probability density function, the quantile function, the reliability function, and the failure rate function. For the remainder of this discussion, we assume that R has the standard Rayleigh distribution. R
has distribution function G given by G(x) = 1 − e
for x ∈ [0, ∞).
2
−x /2
Proof R
has probability density function g given by g(x) = xe
2
−x /2
for x ∈ [0, ∞).
1. g increases and then decreases with mode at x = 1 . – 2. g is concave downward and then upward with inflection point at x = √3 . Proof Open the Special Distribution Simulator and select the Rayleigh distribution. Keep the default parameter value and note the shape of the probability density function. Run the simulation 1000 times and compare the emprical density function to the probability density function. R
has quantile function G
−1
1. q 2. q 3. q
1 2 3
given by G
−1
− − − − − − − − − − (p) = √−2 ln(1 − p)
for p ∈ [0, 1). In particular, the quartiles of R are
−−−−−−−−− − = √4 ln 2 − 2 ln 3 ≈ 0.7585 − −− − = √2 ln 2 ≈ 1.1774 − −− − = √4 ln 2 ≈ 1.6651
, the first quartile , the median , the third quartile
Proof Open the Special Distribution Calculator and select the Rayleigh distribution. Keep the default parameter value. Note the shape and location of the distribution function. Compute selected values of the distribution function and the quantile function. R
has reliability function G given by G c
c
2
(x) = e
−x /2
for x ∈ [0, ∞).
Proof R
has failure rate function h given by h(x) = x for x ∈ [0, ∞). In particular, R has increasing failure rate.
Proof
Moments Once again we assume that R has the standard Rayleigh distribution. We can express the moment generating function of R in terms of the standard normal distribution function Φ. Recall that Φ is so commonly used that it is a special function of mathematics.
5.14.1
https://stats.libretexts.org/@go/page/10354
R
has moment generating function m given by m(t) = E(e
tR
2 − − t /2 ) = 1 + √2πte Φ(t),
t ∈ R
(5.14.3)
Proof The mean, variance of R are − − −
1. E(R) = √π/2 ≈ 1.2533 2. var(R) = 2 − π/2 Proof Numerically, E(R) ≈ 1.2533 and sd(R) ≈ 0.6551. Open the Special Distribution Simulator and select the Rayleigh distribution. Keep the default parameter value. Note the size and location of the mean±standard deviation bar. Run the simulation 1000 times and compare the empirical mean and stadard deviation to the true mean and standard deviation. The general moments of R can be expressed in terms of the gamma function Γ. n
n/2
E(R ) = 2
Γ(1 + n/2)
for n ∈ N .
Proof Of course, the formula for the general moments gives an alternate derivation of the mean and variance above, since − Γ(3/2) = √π /2 and Γ(2) = 1 . On the other hand, the moment generating function can be also be used to derive the formula for the general moments. The skewness and kurtosis of R are 1. skew(R) = 2√− π (π − 3)/(4 − π ) ≈ 0.6311 2. kurt(R) = (32 − 3π )/(4 − π ) ≈ 3.2451 3/2
2
2
Proof
Related Distributions The fundamental connection between the standard Rayleigh distribution and the standard normal distribution is given in the very definition of the standard Rayleigh, as the distribution of the magnitude of a point with independent, standard normal coordinates. Connections to the chi-square distribution. 1. If R has the standard Rayleigh distribution then R has the chi-square distribution with 2 degrees of freedom. − − 2. If V has the chi-square distribution with 2 degrees of freedom then √V has the standard Rayleigh distribution. 2
Proof Recall also that the chi-square distribution with 2 degrees of freedom is the same as the exponential distribution with scale parameter 2. Since the quantile function is in closed form, the standard Rayleigh distribution can be simulated by the random quantile method. Connections between the standard Rayleigh distribution and the standard uniform distribution. 1. If U has the standard uniform distribution (a random number) then R = G Rayleigh distribution. 2. If R has the standard Rayleigh distribution then U = G(R) = 1 − exp(−R
−1
2
In part (a), note that 1 − U has the same distribution as Rayleigh distribution.
U
−−−−−−−−− − (U ) = √−2 ln(1 − U )
/2)
has the standard
has the standard uniform distribution
(the standard uniform). Hence
− −−−− − R = √−2 ln U
also has the standard
Open the random quantile simulator and select the Rayleigh distribution with the default parameter value (standard). Run the simulation 1000 times and compare the empirical density function to the true density function.
5.14.2
https://stats.libretexts.org/@go/page/10354
There is another connection with the uniform distribution that leads to the most common method of simulating a pair of independent standard normal variables. We have seen this before, but it's worth repeating. The result is closely related to the definition of the standard Rayleigh variable as the magnitude of a standard bivariate normal pair, but with the addition of the polar coordinate angle. Suppose that R has the standard Rayleigh distribution, Θ is uniformly distributed on [0, 2π), and that independent. Let Z = R cos Θ , W = R sin Θ . Then (Z, W ) have the standard bivariate normal distribution.
R
and
Θ
are
Proof
The General Rayleigh Distribution Definition The standard Rayleigh distribution is generalized by adding a scale parameter. If R has the standard Rayleigh distribution and b ∈ (0, ∞) then X = bR has the Rayleigh distribution with scale parameter b . Equivalently, the Rayleigh distribution is the distribution of the magnitude of a two-dimensional vector whose components have independent, identically distributed mean 0 normal variables. − −−−−− −
If U and U are independent normal variables with mean 0 and standard deviation σ ∈ (0, ∞) then X = √U 1
2
2
1
+U
2
2
has the
Rayleigh distribution with scale parameter σ. Proof
Distribution Functions In this section, we assume that X has the Rayleigh distribution with scale parameter b ∈ (0, ∞). X
has cumulative distribution function F given by F (x) = 1 − exp(−
2
x
2
)
2b
for x ∈ [0, ∞).
Proof X
has probability density function f given by f (x) =
x 2
2
exp(−
b
x
2
2b
)
for x ∈ [0, ∞).
1. f increases and then decreases with mode at x = b . – 2. f is concave downward and then upward with inflection point at x = √3b . Proof Open the Special Distribution Simulator and select the Rayleigh distribution. Vary the scale parameter and note the shape and location of the probability density function. For various values of the scale parameter, run the simulation 1000 times and compare the emprical density function to the probability density function. X
has quantile function F
1. q 2. q 3. q
1 2 3
−1
given by F
−1
− − − − − − − − − − (p) = b √−2 ln(1 − p)
for p ∈ [0, 1). In particular, the quartiles of X are
−−−−−−−−− − = b √4 ln 2 − 2 ln 3 − −− − = b √2 ln 2 − −− − = b √4 ln 2
, the first quartile , the median , the third quartile
Proof Open the Special Distribution Calculator and select the Rayleigh distribution. Vary the scale parameter and note the location and shape of the distribution function. For various values of the scale parameter, compute selected values of the distribution function and the quantile function. X
has reliability function F given by F c
c
2
(x) = exp(−
x
2
)
for x ∈ [0, ∞).
2b
Proof
5.14.3
https://stats.libretexts.org/@go/page/10354
X
has failure rate function h given by h(x) = x/b for x ∈ [0, ∞). In particular, X has increasing failure rate. 2
Proof
Moments Again, we assume that distribution function. X
has the Rayleigh distribution with scale parameter b , and recall that
X
Φ
denotes the standard normal
has moment generating function M given by 2
M (t) = E(e
tX
2
b t − − ) = 1 + √2πbt exp( 2
)Φ(t),
t ∈ R
(5.14.10)
Proof The mean and variance of R are − − −
1. E(X) = b√π/2 2. var(X) = b (2 − π/2) 2
Proof Open the Special Distribution Simulator and select the Rayleigh distribution. Vary the scale parameter and note the size and location of the mean±standard deviation bar. For various values of the scale parameter, run the simulation 1000 times and compare the empirical mean and stadard deviation to the true mean and standard deviation. Again, the general moments can be expressed in terms of the gamma function Γ. E(X
n
n/2
n
) =b 2
Γ(1 + n/2)
for n ∈ N .
Proof The skewness and kurtosis of X are 1. skew(X) = 2√− π (π − 3)/(4 − π ) ≈ 0.6311 2. kurt(X) = (32 − 3π )/(4 − π ) ≈ 3.2451 3/2
2
2
Proof
Related Distributions The fundamental connection between the Rayleigh distribution and the normal distribution is the defintion, and of course, is the primary reason that the Rayleigh distribution is special in the first place. By construction, the Rayleigh distribution is a scale family, and so is closed under scale transformations. If X has the Rayleigh distribution with scale parameter with scale parameter bc.
b ∈ (0, ∞)
and if
c ∈ (0, ∞)
then
cX
has the Rayleigh distribution
The Rayleigh distribution is a special case of the Weibull distribution. The Rayleigh distribution with scale parameter – parameter √2b.
b ∈ (0, ∞)
is the Weibull distribution with shape parameter
2
and scale
The following result generalizes the connection between the standard Rayleigh and chi-square distributions. If X has the Rayleigh distribution with scale parameter parameter 2b .
b ∈ (0, ∞)
then
X
2
has the exponential distribution with scale
2
Proof Since the quantile function is in closed form, the Rayleigh distribution can be simulated by the random quantile method. Suppose that b ∈ (0, ∞).
5.14.4
https://stats.libretexts.org/@go/page/10354
−−−−−−−−− −
1. If U has the standard uniform distribution (a random number) then X = F (U ) = b√−2 ln(1 − U ) has the Rayleigh distribution with scale parameter b . 2. If X has the Rayleigh distribution with scale parameter b then U = F (X) = 1 − exp(−X /2b ) has the standard uniform distribution −1
2
In part (a), note that 1 − U has the same distribution as distribution with scale parameter b .
U
(the standard uniform). Hence
2
− −−−− − X = b √−2 ln U
also has the Rayleigh
Open the random quantile simulator and select the Rayleigh distribution. For selected values of the scale parameter, run the simulation 1000 times and compare the empirical density function to the true density function. Finally, the Rayleigh distribution is a member of the general exponential family. If X has the Rayleigh distribution with scale parameter natural parameter −1/b and natural statistic X /2. 2
b ∈ (0, ∞)
then X has a one-parameter exponential distribution with
2
Proof This page titled 5.14: The Rayleigh Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.14.5
https://stats.libretexts.org/@go/page/10354
5.15: The Maxwell Distribution The Maxwell distribution, named for James Clerk Maxwell, is the distribution of the magnitude of a three-dimensional random vector whose coordinates are independent, identically distributed, mean 0 normal variables. The distribution has a number of applications in settings where magnitudes of normal variables are important, particularly in physics. It is also called the MaxwellBoltzmann distribution in honor also of Ludwig Boltzmann. The Maxwell distribution is closely related to the Rayleigh distribution, which governs the magnitude of a two-dimensional random vector whose coordinates are independent, identically distributed, mean 0 normal variables.
The Standard Maxwell Distribution Definition Suppose that
Z1
,
Z2
, and
− −−−−−−−−− − R = √Z
2
1
+Z
2
2
+Z
2
3
Z3
are independent random variables with standard normal distributions. The magnitude
of the vector (Z
So in the context of the definition, continuous distribution on [0, ∞).
1,
Z2 , Z3 )
(Z1 , Z2 , Z3 )
has the standard Maxwell distribution.
has the standard trivariate normal distribution. The Maxwell distribution is a
Distribution Functions In this discussion, we assume that R has the standard Maxwell distribution. The distribution function of R can be expressed in terms of the standard normal distribution function Φ. Recall that Φ occurs so frequently that it is considered a special function in mathematics. R
has distribution function G given by − − 2 2 −x /2 xe − 1, π
G(x) = 2Φ(x) − √
x ∈ [0, ∞)
(5.15.1)
Proof R
has probability density function g given by − − 2 2 2 −x /2 x e , π
g(x) = √
x ∈ [0, ∞)
(5.15.4)
–
1. g increases and then decreases with mode at x = √2 . 2. g is concave upward, then downward, then upward again, with inflection points at x
1
x2
− −−−−−−−− − − − = √(5 − √17)/2 ≈ 0.6622
and
− −−−−−−−− − − − = √(5 + √17)/2 ≈ 2.1358
Proof Open the Special Distribution Simulator and select the Maxwell distribution. Keep the default parameter value and note the shape of the probability density function. Run the simulation 1000 times and compare the emprical density function to the probability density function. The quantile function has no simple closed-form expression. Open the Special Distribution Calculator and select the Maxwell distribution. Keep the default parameter value. Find approximate values of the median and the first and third quartiles.
Moments Suppose again that R has the standard Maxwell distribution. The moment generating function of R , like the distribution function, can be expressed in terms of the standard normal distribution function Φ.
5.15.1
https://stats.libretexts.org/@go/page/10355
R
has moment generating function m given by m(t) = E (e
tR
− − 2 2 2 t /2 t + 2(1 + t )e Φ(t), π
) =√
t ∈ R
(5.15.5)
Proof The mean and variance of R can be found from the moment generating function, but direct computations are also easy. The mean and variance of R are − − −
1. E(R) = 2√2/π 2. var(R) = 3 − 8/π Proof Numerically, E(R) ≈ 1.5958 and sd(R) =≈ 0.6734 Open the Special Distribution Simulator and select the Maxwell distribution. Keep the default parameter value. Note the size and location of the mean±standard deviation bar. Run the simulation 1000 times and compare the empirical mean and standard deviation to the true mean and standard deviation. The general moments of R can be expressed in terms of the gamma function Γ For n ∈ N , +
n/2+1
n
E(R ) =
2
− √π
n+3 Γ(
)
(5.15.13)
2
Proof Of course, the formula for the general moments gives an alternate derivation for the mean and variance above since Γ(2) = 1 and − Γ(5/2) = 3 √π /4. On the other hand, the moment generating function can be also be used to derive the formula for the general moments. Finally, we give the skewness and kurtosis of R . The skewness and kurtosis of R are –
1. skew(R) = 2√2(16 − 5π)/(3π − 8) ≈ 0.4857 2. kurt(R) = (15π + 16π − 192)/(3π − 8) ≈ 3.1082 3/2
2
2
Proof
Related Distributions The fundamental connection between the standard Maxwell distribution and the standard normal distribution is given in the very definition of the standard Maxwell, as the distribution of the magnitude of a vector in R with independent, standard normal coordinates. 3
Connections to the chi-square distribution. 1. If R has the standard Maxwell distribution then R has the chi-square distribution with 3 degrees of freedom. − − 2. If V has the chi-square distribution with 3 degrees of freedom then √V has the standard Maxwell distribution. 2
Proof Equivalently, the Maxwell distribution is simply the chi distribution with 3 degrees of freedom.
The General Maxwell Distribution Definition The standard Maxwell distribution is generalized by adding a scale parameter. If R has the standard Maxwell distribution and b ∈ (0, ∞) then X = bR has the Maxwell distribution with scale parameter b .
5.15.2
https://stats.libretexts.org/@go/page/10355
Equivalently, the Maxwell distribution is the distribution of the magnitude of a three-dimensional vector whose components have independent, identically distributed, mean 0 normal variables. If
U1
,
U2
and
are independent normal variables with mean 0 and standard deviation
U3
− − − − − − − − − − − − X = √U
2
1
+U
2
2
+U
2
3
σ ∈ (0, ∞)
then
has the Maxwell distribution with scale parameter σ.
Proof
Distribution Functions In this section, we assume that X has the Maxwell distribution with scale parameter function of X in terms of the standard normal distribution function Φ. X
b ∈ (0, ∞)
. We can give the distribution
has distribution function F given by x F (x) = 2Φ (
1 )−
− − 2 2 x x exp(− ) − 1,
√
b
b
2
π
x ∈ [0, ∞)
(5.15.15)
2b
Proof X
has probability density function f given by 1 f (x) =
3
− − 2 2 x 2 x exp(− ), π 2b2
√
b
x ∈ [0, ∞)
(5.15.16)
–
1. f increases and then decreases with mode at x = b√2 .
− −−−−−−−− − − −
2. f is concave upward, then downward, then upward again, with inflection points at x = b√(5 ± √17)/2 . Proof Open the Special Distribution Simulator and select the Maxwell distribution. Vary the scale parameter and note the shape and location of the probability density function. For various values of the scale parameter, run the simulation 1000 times and compare the emprical density function to the probability density function. Again, the quantile function does not hava a simple, closed-form expression. Open the Special Distribution Calculator and select the Maxwell distribution. For various values of the scale parameter, compute the median and the first and third quartiles.
Moments Again, we assume that X has the Maxwell distribution with scale parameter b ∈ (0, ∞). As before, the moment generating function of X can be written in terms of the standard normal distribution function Φ. X
has moment generating function M given by M (t) = E (e
tX
− − 2 2 2 b t 2 2 bt + 2(1 + b t ) exp( )Φ(bt), π 2
) =√
t ∈ R
(5.15.17)
Proof The mean and variance of X are − − −
1. E(X) = b2√2/π 2. var(X) = b (3 − 8/π) 2
Proof Open the Special Distribution Simulator and select the Maxwell distribution. Vary the scale parameter and note the size and location of the mean±standard deviation bar. For various values of the scale parameter, run the simulation 1000 times compare the empirical mean and standard deviation to the true mean and standard deviation.
5.15.3
https://stats.libretexts.org/@go/page/10355
As before, the general moments can be expressed in terms of the gamma function Γ. For n ∈ N , n/2+1
E(X
n
n
) =b
2
− √π
n+3 Γ(
)
(5.15.18)
2
Proof Finally, the skewness and kurtosis are unchanged. The skewness and kurtosis of X are –
1. skew(X) = 2√2(16 − 5π)/(3π − 8) ≈ 0.4857 2. kurt(X) = (15π + 16π − 192)/(3π − 8) ≈ 3.1082 3/2
2
2
Proof
Related Distributions The fundamental connection between the Maxwell distribution and the normal distribution is given in the definition, and of course, is the primary reason that the Maxwell distribution is special in the first place. By construction, the Maxwell distribution is a scale family, and so is closed under scale transformations. If X has the Maxwell distribution with scale parameter with scale parameter bc.
b ∈ (0, ∞)
and if
c ∈ (0, ∞)
then
cX
has the Maxwell distribution
Proof The Maxwell distribution is a generalized exponential distribution. If X has the Maxwell distribution with scale parameter b ∈ (0, ∞) then X is a one-parameter exponential family with natural parameter −1/b and natural statistic X /2. 2
2
Proof This page titled 5.15: The Maxwell Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.15.4
https://stats.libretexts.org/@go/page/10355
5.16: The Lévy Distribution The Lévy distribution, named for the French mathematician Paul Lévy, is important in the study of Brownian motion, and is one of only three stable distributions whose probability density function can be expressed in a simple, closed form.
The Standard Lévy Distribution Definition If Z has the standard normal distribution then U
= 1/Z
2
has the standard Lévy distribution.
So the standard Lévy distribution is a continuous distribution on (0, ∞).
Distribution Functions We assume that U has the standard Lévy distribution. The distribution function of standard normal distribution function Φ, not surprising given the definition. U
U
has a simple expression in terms of the
has distribution function G given by 1 G(u) = 2 [1 − Φ (
− )] , √u
u ∈ (0, ∞)
(5.16.1)
Proof Similarly, the quantile function of U has a simple expression in terms of the standard normal quantile function Φ
−1
U
has quantile function G
−1
.
given by −1
G
1 (p) = −1
[Φ
(1 − p/2)]
2
,
p ∈ [0, 1)
(5.16.3)
The quartiles of U are 1. q
1
2. q
2
3. q
3
−1
= [Φ
−1
= [Φ
−1
= [Φ
( ( (
7 8 3 4 5 8
−2
)]
≈ 0.7557
, the first quartile.
≈ 2.1980
, the median.
≈ 9.8516
, the third quartile.
−2
)] −2
)]
Proof Open the Special Distribution Calculator and select the Lévy distribution. Keep the default parameter values. Note the shape and location of the distribution function. Compute a few values of the distribution function and the quantile function. Finally, the probability density function of U has a simple closed expression. U
has probability density function g given by 1 g(u) =
1
− − 3/2 √2π u
1. g increases and then decreasing with mode at x =
1 3
1 exp(−
),
u ∈ (0, ∞)
.
2. g is concave upward, then downward, then upward again, with inflection points at x = x =
1 3
+
√10 15
≈ 0.5442
(5.16.4)
2u
1 3
−
√10 15
≈ 0.1225
and at
.
Proof Open the Special Distribtion Simulator and select the Lévy distribution. Keep the default parameter values. Note the shape of the probability density function. Run the simulation 1000 times and compare the empirical density function to the probability density function.
5.16.1
https://stats.libretexts.org/@go/page/10356
Moments We assume again that U has the standard Lévy distribution. After exploring the graphs of the probability density function and distribution function above, you probably noticed that the Lévy distribution has a very heavy tail. The 99th percentile is about 6400, for example. The following result is not surprising. E(U ) = ∞
Proof Of course, the higher-order moments are infinite as well, and the variance, skewness, and kurtosis do not exist. The moment generating function is infinite at every positive value, and so is of no use. On the other hand, the characteristic function of the standard Lévy distribution is very useful. For the following result, recall that the sign function sgn is given by sgn(t) = 1 for t > 0 , sgn(t) = −1 for t < 0 , and sgn(0) = 0 . U
has characteristic function χ given by 0
χ0 (t) = E (e
itU
1/2
) = exp(−|t|
[1 + i sgn(t)]),
t ∈ R
(5.16.9)
Related Distributions The most important relationship is the one in the definition: If Z has the standard normal distribution then standard Lévy distribution. The following result is bascially the converse. If U has the standard Lévy distribution, then V
− − = 1/ √U
U = 1/Z
2
has the
has the standard half-normal distribution.
Proof
The General Lévy Distribution Like so many other “standard distributions”, the standard Lévy distribution is generalized by adding location and scale parameters.
Definition Suppose that U has the standard Lévy distribution, and with location parameter a and scale parameter b .
a ∈ R
and
b ∈ (0, ∞)
. Then
X = a + bU
has the Lévy distribution
Note that X has a continuous distribution on the interval (a, ∞).
Distribution Functions Suppose that X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). As before, the distribution function of X has a simple expression in terms of the standard normal distribution function Φ. X
has distribution function G given by − − − − − − − b F (x) = 2 [1 − Φ ( √
)] ,
x ∈ (a, ∞)
(5.16.10)
(x − a)
Proof Similarly, the quantile function of X has a simple expression in terms of the standard normal quantile function Φ
−1
X
has quantile function F
−1
.
given by F
−1
b (p) = a + −1
[Φ
(1 − p/2)]
2
,
p ∈ [0, 1)
(5.16.11)
The quartiles of X are 1. q
1
−1
= a + b[ Φ
(
7 8
−2
)]
, the first quartile.
5.16.2
https://stats.libretexts.org/@go/page/10356
2. q
2
3. q
3
−1
= a + b[ Φ
(
−1
= a + b[ Φ
(
3 4 5 8
−2
)] −2
)]
, the median. , the third quartile.
Proof Open the Special Distribution Calculator and select the Lévy distribution. Vary the parameter values and note the shape of the graph of the distribution function function. For various values of the parameters, compute a few values of the distribution function and the quantile function. Finally, the probability density function of X has a simple closed expression. X
has probability density function f given by −− − b
1
f (x) = √
b exp[− 3/2
2π (x − a)
1. f increases and then decreases with mode at x = a +
1 3
b
],
x ∈ (a, ∞)
(5.16.12)
2(x − a)
.
2. f is concave upward, then downward, then upward again with inflection points at x = a + (
1 3
√10
±
15
)b
.
Proof Open the Special Distribtion Simulator and select the Lévy distribution. Vary the parameters and note the shape and location of the probability density function. For various parameter values, run the simulation 1000 times and compare the empirical density function to the probability density function.
Moments Assume again that X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). Of course, since the standard Lévy distribution has infinite mean, so does the general Lévy distribution. E(X) = ∞
Also as before, the variance, skewness, and kurtosis of X are undefined. On the other hand, the characteristic function of X is very important. X
has characteristic function χ given by χ(t) = E (e
itX
1/2
) = exp(ita − b
1/2
|t|
[1 + i sgn(t)]),
t ∈ R
(5.16.13)
Proof
Related Distributions Since the Lévy distribution is a location-scale family, it is trivially closed under location-scale transformations. Suppose that X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that c ∈ R and d ∈ (0, ∞) . Then Y = c + dX has the Lévy distribution with location parameter c + ad and scale parameter bd . Proof Of more interest is the fact that the Lévy distribution is closed under convolution (corresponding to sums of independent variables). Suppose that X and X are independent, and that, X has the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞) for k ∈ {1, 2}. Then X + X has the Lévy distribution with location parameter a + a and scale 1
2
k
1/2
parameter (b
1
k
1
1/2
+b
2
2
)
k
2
1
2
.
Proof As a corollary, the Lévy distribution is a stable distribution with index α =
5.16.3
1 2
:
https://stats.libretexts.org/@go/page/10356
Suppose that n ∈ N and that (X , X , … , X ) is a sequence of independent random variables, each having the Lévy distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). Then X + X + ⋯ + X has the Lévy distribution with location parameter na and scale parameter n b. +
1
2
n
1
2
n
2
Stability is one of the reasons for the importance of the Lévy distribution. From the characteristic function, it follows that the skewness parameter is β = 1 . This page titled 5.16: The Lévy Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.16.4
https://stats.libretexts.org/@go/page/10356
5.17: The Beta Distribution In this section, we will study the beta distribution, the most important distribution that has bounded support. But before we can study the beta distribution we must study the beta function.
The Beta Function Definition The beta function B is defined as follows: 1 a−1
B(a, b) = ∫
u
b−1
(1 − u )
du;
a, b ∈ (0, ∞)
(5.17.1)
0
Proof that B is well defined The beta function was first introduced by Leonhard Euler.
Properties The beta function satisfies the following properties: 1. B(a, b) = B(b, a) for a, b ∈ (0, ∞), so B is symmetric. 2. B(a, 1) = for a ∈ (0, ∞) 3. B(1, b) = for b ∈ (0, ∞) 1
a
1 b
Proof The beta function has a simple expression in terms of the gamma function: If a,
b ∈ (0, ∞)
then Γ(a)Γ(b) B(a, b) =
(5.17.4) Γ(a + b)
Proof Recall that the gamma function is a generalization of the factorial function. Here is the corresponding result for the beta function: If j,
k ∈ N+
then (j − 1)!(k − 1)! B(j, k) =
(5.17.8) (j + k − 1)!
Proof Let's generalize this result. First, recall from our study of combinatorial structures that for a ∈ R and j ∈ N , the ascending power of base a and order j is [j]
a
If a,
b ∈ (0, ∞)
, and j,
k ∈ N
= a(a + 1) ⋯ [a + (j − 1)]
(5.17.9)
, then [j]
B(a + j, b + k)
a
[k]
b
=
(5.17.10) [j+k]
B(a, b)
(a + b)
Proof B(
1 2
,
1 2
) =π
.
Proof
5.17.1
https://stats.libretexts.org/@go/page/10357
Figure 5.17.1 : The graph of B(a, b) on the square 0 < a < 5 , 0 < b < 5
The Incomplete Beta Function The integral that defines the beta function can be generalized by changing the interval of integration from x ∈ [0, 1].
(0, 1)
to
(0, x)
where
The incomplete beta function is defined as follows x
B(x; a, b) = ∫
a−1
u
b−1
(1 − u )
du,
x ∈ (0, 1); a, b ∈ (0, ∞)
(5.17.11)
0
Of course, the ordinary (complete) beta function is B(a, b) = B(1; a, b) for a,
b ∈ (0, ∞)
.
The Standard Beta Distribution Distribution Functions The beta distributions are a family of continuous distributions on the interval (0, 1). The (standard) beta distribution with left parameter function f given by
a ∈ (0, ∞)
1
a−1
f (x) =
x
and right parameter
b−1
(1 − x )
,
b ∈ (0, ∞)
x ∈ (0, 1)
has probability density
(5.17.12)
B(a, b)
Of course, the beta function is simply the normalizing constant, so it's clear that f is a valid probability density function. If a ≥ 1 , f is defined at 0, and if b ≥ 1 , f is defined at 1. In these cases, it's customary to extend the domain of f to these endpoints. The beta distribution is useful for modeling random probabilities and proportions, particularly in the context of Bayesian analysis. The distribution has just two parameters and yet a rich variety of shapes (so in particular, both parameters are shape parameters). Qualitatively, the first order properties of f depend on whether each parameter is less than, equal to, or greater than 1. For a,
b ∈ (0, ∞)
with a + b ≠ 2 , define a−1 x0 =
(5.17.13) a+b −2
1. If 0 < a < 1 and 0 < b < 1 , f decreases and then increases with minimum value at x and with f (x) → ∞ as x ↓ 0 and as x ↑ 1. 2. If a = 1 and b = 1 , f is constant. 3. If 0 < a < 1 and b ≥ 1 , f is decreasing with f (x) → ∞ as x ↓ 0. 4. If a ≥ 1 and 0 < b < 1 , f is increasing with f (x) → ∞ as x ↑ 1. 5. If a = 1 and b > 1 , f is decreasing with mode at x = 0 . 6. If a > 1 and b = 1 , f is increasing with mode at x = 1 . 7. If a > 1 and b > 1 , f increases and then decreases with mode at x . 0
0
Proof From part (b), note that the special case a = 1 and b = 1 gives the continuous uniform distribution on the interval (0, 1) (the standard uniform distribution). Note also that when a < 1 or b < 1 , the probability density function is unbounded, and hence the
5.17.2
https://stats.libretexts.org/@go/page/10357
distribution has no mode. On the other hand, if a ≥ 1 , b ≥ 1 , and one of the inequalites is strict, the distribution has a unique mode at x . The second order properties are more complicated. 0
For a,
b ∈ (0, ∞)
with a + b ∉ {2, 3} and (a − 1)(b − 1)(a + b − 3) ≥ 0 , define −−−−−−−−−−−−−−−−−− − (a − 1)(a + b − 3) − √ (a − 1)(b − 1)(a + b − 3) x1 =
(5.17.15) (a + b − 3)(a + b − 2) −−−−−−−−−−−−−−−−−− − (a − 1)(a + b − 3) + √ (a − 1)(b − 1)(a + b − 3)
x2 =
(5.17.16) (a + b − 3)(a + b − 2)
For a < 1 and a + b = 2 or for b < 1 and a + b = 2 , define x
1
= x2 = 1 − a/2
.
1. If a ≤ 1 and b ≤ 1 , or if a ≤ 1 and b ≥ 2 , or if a ≥ 2 and b ≤ 1 , f is concave upward. 2. If a ≤ 1 and 1 < b < 2 , f is concave upward and then downward with inflection point at x . 3. If 1 < a < 2 and b ≤ 1 , f is concave downward and then upward with inflection point at x . 4. If 1 < a ≤ 2 and 1 < b ≤ 2 , f is concave downward. 5. If 1 < a ≤ 2 and b > 2 , f is concave downward and then upward with inflection point at x . 6. If a > 2 and 1 < b ≤ 2 , f is concave upward and then downward with inflection point at x . 7. If a > 2 and b > 2 , f is concave upward, then downward, then upward again, with inflection points at x and x . 1 2
2 1
1
2
Proof In the special distribution simulator, select the beta distribution. Vary the parameters and note the shape of the beta density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the true density function. The special case a =
1 2
,b=
1 2
is the arcsine distribution, with probability density function given by 1 f (x) =
− − − − − − −, π √ x(1 − x)
x ∈ (0, 1)
(5.17.18)
This distribution is important in a number of applications, and so the arcsine distribution is studied in a separate section. The beta distribution function F can be easily expressed in terms of the incomplete beta function. As usual parameter and b the right parameter. The beta distribution function F with parameters a,
b ∈ (0, ∞)
a
denotes the left
is given by
B(x; a, b) F (x) =
,
x ∈ (0, 1)
(5.17.19)
B(a, b)
The distribution function F is sometimes known as the regularized incomplete beta function. In some special cases, the distribution function F and its inverse, the quantile function F , can be computed in closed form, without resorting to special functions. −1
If a ∈ (0, ∞) and b = 1 then 1. F (x) = x for x ∈ (0, 1) 2. F (p) = p for p ∈ (0, 1) a
−1
1/a
If a = 1 and b ∈ (0, ∞) then 1. F (x) = 1 − (1 − x) for x ∈ (0, 1) 2. F (p) = 1 − (1 − p) for p ∈ (0, 1) b
−1
If a = b =
1/b
(the arcsine distribution) then
1 2
−
1. F (x) = arcsin(√x) for x ∈ (0, 1) 2. F (p) = sin ( p) for p ∈ (0, 1) 2
π
−1
2
π 2
5.17.3
https://stats.libretexts.org/@go/page/10357
There is an interesting relationship between the distribution functions of the beta distribution and the binomial distribution, when the beta parameters are positive integers. To state the relationship we need to embellish our notation to indicate the dependence on the parameters. Thus, let F denote the beta distribution function with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞), and let G denote the binomial distribution function with trial parameter n ∈ N and success parameter p ∈ (0, 1). a,b
n,p
If j,
k ∈ N+
+
and x ∈ (0, 1) then Fj,k (x) = Gj+k−1,1−x (k − 1)
(5.17.20)
Proof In the special distribution calculator, select the beta distribution. Vary the parameters and note the shape of the density function and the distribution function. In each of the following cases, find the median, the first and third quartiles, and the interquartile range. Sketch the boxplot. 1. a = 1 , b = 1 2. a = 1 , b = 3 3. a = 3 , b = 1 4. a = 2 , b = 4 5. a = 4 , b = 2 6. a = 4 , b = 4
Moments The moments of the beta distribution are easy to express in terms of the beta function. As before, suppose that distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞).
X
has the beta
If k ∈ [0, ∞) then E (X
k
B(a + k, b) ) =
(5.17.24) B(a, b)
In particular, if k ∈ N then [k]
E (X
k
a ) =
(5.17.25) [k]
(a + b)
Proof From the general formula for the moments, it's straightforward to compute the mean, variance, skewness, and kurtosis. The mean and variance of X are a E(X) =
(5.17.27) a+b ab
var(X) =
(5.17.28)
2
(a + b ) (a + b + 1)
Proof Note that the variance depends on the parameters a and b only through the product ab and the sum a + b . Open the special distribution simulator and select the beta distribution. Vary the parameters and note the size and location of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation. The skewness and kurtosis of X are
5.17.4
https://stats.libretexts.org/@go/page/10357
−−−−−− − 2(b − a)√ a + b + 1 skew(X) =
(5.17.29)
− − (a + b + 2)√ab 3
3
3 a b + 3ab
2
2
+ 6a b
3
+a
3
+b
2
2
+ 13 a b + 13ab
2
+a
2
+b
+ 14ab
kurt(X) =
(5.17.30) ab(a + b + 2)(a + b + 3)
Proof In particular, note that the distribution is positively skewed if a < b , unskewed if a = b (the distribution is symmetric about x = in this case) and negatively skewed if a > b .
1 2
Open the special distribution simulator and select the beta distribution. Vary the parameters and note the shape of the probability density function in light of the previous result on skewness. For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the true probability density function.
Related Distributions The beta distribution is related to a number of other special distributions. If X has the beta distribution with left parameter a ∈ (0, ∞) and right parameter distribution with left parameter b and right parameter a .
b ∈ (0, ∞)
then
Y = 1 −X
has the beta
Proof The beta distribution with right parameter 1 has a reciprocal relationship with the Pareto distribution. Suppose that a ∈ (0, ∞) . 1. If X has the beta distribution with left parameter a and right parameter 1 then Y = 1/X has the Pareto distribution with shape parameter a . 2. If Y has the Pareto distribution with shape parameter a then X = 1/Y has the beta distribution with left parameter a and right parameter 1. Proof The following result gives a connection between the beta distribution and the gamma distribution. Suppose that X has the gamma distribution with shape parameter a ∈ (0, ∞) and rate parameter r ∈ (0, ∞), Y has the gamma distribution with shape parameter b ∈ (0, ∞) and rate parameter r, and that X and Y are independent. Then V = X/(X + Y ) has the beta distribution with left parameter a and right parameter b . Proof The following result gives a connection between the beta distribution and the of the previous result. If X has the F distribution with denominator then
n ∈ (0, ∞)
F
distribution. This connection is a minor variation
degrees of freedom in the numerator and
d ∈ (0, ∞)
degrees of freedom in the
(n/d)X Y =
(5.17.38) 1 + (n/d)X
has the beta distribution with left parameter a = n/2 and right parameter b = d/2 . Proof Our next result is that the beta distribution is a member of the general exponential family of distributions. Suppose that X has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Then the distribution is a two-parameter exponential family with natural parameters a − 1 and b − 1 , and natural statistics ln(X) and ln(1 − X) . Proof The beta distribution is also the distribution of the order statistics of a random sample from the standard uniform distribution.
5.17.5
https://stats.libretexts.org/@go/page/10357
Suppose n ∈ N and that (X , X , … , X ) is a sequence of independent variables, each with the standard uniform distribution. For k ∈ {1, 2, … , n}, the k th order statistics X has the beta distribution with left parameter a = k and right parameter b = n − k + 1 . +
1
2
n
(k)
Proof One of the most important properties of the beta distribution, and one of the main reasons for its wide use in statistics, is that it forms a conjugate family for the success probability in the binomial and negative binomial distributions. Suppose that P is a random probability having the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Suppose also that X is a random variable such that the conditional distribution of X given P = p ∈ (0, 1) is binomial with trial parameter n ∈ N and success parameter p. Then the conditional distribution of P given X = k is beta with left parameter a + k and right parameter b + n − k . +
Proof Suppose again that P is a random probability having the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Suppose also that N is a random variable such that the conditional distribution of N given P = p ∈ (0, 1) is negative binomial with stopping parameter k ∈ N and success parameter p. Then the conditional distribution of P given N = n is beta with left parameter a + k and right parameter b + n − k . +
Proof in both cases, note that in the posterior distribution of P , the left parameter is increased by the number of successes and the right parameter by the number of failures. For more on this, see the section on Bayesian estimation in the chapter on point estimation.
The General Beta Distribution The beta distribution can be easily generalized from the support interval (0, 1) to an arbitrary bounded interval using a linear transformation. Thus, this generalization is simply the location-scale family associated with the standard beta distribution. Suppose that Z has the standard beta distibution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). For c ∈ R and d ∈ (0, ∞) random variable X = c + dZ has the beta distribution with left parameter a , right parameter b , location parameter c and scale parameter d . For the remainder of this discussion, suppose that X has the distribution in the definition above. X
has probability density function 1
a−1
f (x) =
(x − c ) B(a, b)d
b−1
(c + d − x )
,
x ∈ (c, c + d)
(5.17.44)
a+b−1
Proof Most of the results in the previous sections have simple extensions to the general beta distribution. The mean and variance of X are 1. E(X) = c + d 2. var(X) = d
a a+b ab
2 2
(a+b) (a+b+1)
Proof Recall that skewness and variance are defined in terms of standard scores, and hence are unchanged under location-scale transformations. Hence the skewness and kurtosis of X are just as for the standard beta distribution. This page titled 5.17: The Beta Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.17.6
https://stats.libretexts.org/@go/page/10357
5.18: The Beta Prime Distribution Basic Theory The beta prime distribution is the distribution of the odds ratio associated with a random variable with the beta distribution. Since variables with beta distributions are often used to model random probabilities and proportions, the corresponding odds ratios occur naturally as well.
Definition Suppose that U has the beta distribution with shape parameters distribution with shape parameters a and b .
a, b ∈ (0, ∞)
. Random variable
The special case a = b = 1 is known as the standard beta prime distribution. Since random variable X has a continuous distribution on the interval (0, ∞).
U
X = U /(1 − U )
has the beta prime
has a continuous distribution on the interval
,
(0, 1)
Distribution Functions Suppose that X has the beta prime distribution with shape parameters a, X
b ∈ (0, ∞)
, and as usual, let B denote the beta function.
has probability density function f given by a−1
1
x
f (x) =
,
x ∈ (0, ∞)
(5.18.1)
B(a, b) (1 + x)a+b
Proof If a ≥ 1 , the probability density function is defined at x = 0 , so in this case, it's customary add this endpoint to the domain. In particular, for the standard beta prime distribution, 1 f (x) =
2
,
x ∈ [0, ∞)
(5.18.4)
(1 + x)
Qualitatively, the first order properties of the probability density function f depend only on a , and in particular on whether a is less than, equal to, or greater than 1. The probability density function f satisfies the following properties: 1. If 0 < a < 1 , f is decreasing with f (x) → ∞ as x ↓ 0. 2. If a = 1 , f is decreasing with mode at x = 0 . 3. If a > 1 , f increases and then decreases with mode at x = (a − 1)/(b + 1) . Proof Qualitatively, the second order properties of f also depend only on a , with transitions at a = 1 and a = 2 . For a > 1 , define −−−−−−−−−−−−−−− − (a − 1)(b + 2) − √ (a − 1)(b + 2)(a + b) x1 =
(5.18.6) (b + 1)(b + 2) −−−−−−−−−−−−−−− − (a − 1)(b + 2) + √ (a − 1)(b + 2)(a + b)
x2 =
(5.18.7) (b + 1)(b + 2)
The probability density function f satisfies the following properties: 1. If 0 < a ≤ 1 , f is concave upward. 2. If 1 < a ≤ 2 , f is concave downward and then upward, with inflection point at x . 3. If a > 2 , f is concave upward, then downward, then upward again, with inflection points at x and x . 2
1
2
Proof Open the Special Distribution Simulator and select the beta prime distribution. Vary the parameters and note the shape of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function.
5.18.1
https://stats.libretexts.org/@go/page/10358
Because of the definition of the beta prime variable, the distribution function of X has a simple expression in terms of the beta distribution function with the same parameters, which in turn is the regularized incomplete beta function. So let G denote the distribution function of the beta distribution with parameters a, b ∈ (0, ∞), and recall that B(x; a, b) G(x) =
,
x ∈ (0, 1)
(5.18.9)
B(a, b)
X
has distribution function F given by x F (x) = G (
),
x ∈ [0, ∞)
(5.18.10)
x +1
Proof Similarly, the quantile function of X has a simple expression in terms of the beta quantile function G
−1
X
has quantile function F
with the same parameters.
given by
−1
−1
F
−1
G (p) =
(p)
−1
1 −G
,
p ∈ [0, 1)
(5.18.12)
(p)
Proof Open the Special Distribution Calculator and choose the beta prime distribution. Vary the parameters and note the shape of the distribution function. For selected values of the parameters, find the median and the first and third quartiles. For certain values of the parameters, the distribution and quantile functions have simple, closed form expressions. If a ∈ (0, ∞) and b = 1 then 1. F (x) = ( 2. F
−1
a
x x+1 p
(p) =
for x ∈ [0, ∞)
)
1/a
1−p
for p ∈ [0, 1)
1/a
Proof If a = 1 and b ∈ (0, ∞) then 1. F (x) = 1 − (
b
1 x+1
for x ∈ [0, ∞)
) 1/b
2. F
−1
1−(1−p)
(p) =
for p ∈ [0, 1)
1/b
(1−p)
Proof If a = b =
then
1 2
1. F (x) =
2 π
−− − arcsin(√ 2
2. F
−1
sin (
(p) = 2
π 2
1−sin (
p) π 2
x
x+1
)
for x ∈ [0, ∞)
for p ∈ [0, 1)
p)
Proof When a = b =
1 2
, X is the odds ratio for a variable with the standard arcsine distribution.
Moments As before, X denotes a random variable with the beta prime distribution, with parameters expression in terms of the beta function.
a, b ∈ (0, ∞)
. The moments of
X
have a simple
If t ∈ (−a, b) then t
B(a + t, b − t)
E (X ) =
(5.18.13) B(a, b)
If t ∈ (−∞, −a] ∪ [b, ∞) then E(X
t
) =∞
.
Proof
5.18.2
https://stats.libretexts.org/@go/page/10358
Of course, we are usually most interested in the integer moments of x = x(x + 1) ⋯ (x + n − 1) .
X
. Recall that for
x ∈ R
and n ∈ N , the rising power of
x
of order
n
is
[n]
Suppose that n ∈ N . If n < b Then n
E (X
n
a+k−1
) =∏
(5.18.15) b −k
k=1
If n ≥ b then E (X
n
) =∞
.
Proof As a corollary, we have the mean and variance. If b > 1 then a E(X) =
(5.18.17) b −1
If b > 2 then a(a + b − 1) var(X) =
(5.18.18)
2
(b − 1 ) (b − 2)
Proof Open the Special Distribution Simulator and select the beta prime distribution. Vary the parameters and note the size and location of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. Finally, the general moment result leads to the skewness and kurtosis of X. If b > 3 then −−−−−−−−− − 2(2a + b − 1)
b −2
skew(X) =
(5.18.19)
√
b −3
a(a + b − 1)
Proof In particular, the distibution is positively skewed for all a > 0 and b > 3 . If b > 4 then kurt(X) 3
2
3a b
2
=
+ 54 b
3
3
+ 69 a b − 30 a
2
3
+ 6a b
2
2
+ 12 a b
2
2
− 78 a b + 60 a
(5.18.20) 4
+ 3ab
3
+ 9ab
2
− 69ab
4
+ 99ab − 42a + 6 b
3
− 30 b
− 42b + 12 (a + b − 1)(b − 3)(b − 4)
Proof
Related Distributions The most important connection is the one between the beta prime distribution and the beta distribution given in the definition. We repeat this for emphasis. Suppose that a,
b ∈ (0, ∞)
.
1. If U has the beta distribution with parameters a and b , then X = U /(1 − U ) has the beta prime distribution with parameters a and b . 2. If X has the beta prime distribution with parameters a and b , then U = X/(X + 1) has the beta distribution with parameters a and b . The beta prime family is closed under the reciprocal transformation. If X has the beta prime distribution with parameters a,
b ∈ (0, ∞)
then 1/X has the beta prime distribution with parameters b and a .
Proof The beta prime distribution is closely related to the F distribution by a simple scale transformation.
5.18.3
https://stats.libretexts.org/@go/page/10358
Connections with the F distributions. 1. If X has the beta prime distribution with parameters a, b ∈ (0, ∞) then Y = X has the F distribution with 2a degrees of the freedom in the numerator and 2b degrees of freedom in the denominator. 2. If Y has the F distribution with n ∈ (0, ∞) degrees of freedom in the numerator and d ∈ (0, ∞) degrees of freedom in the denominator, then X = Y has the beta prime distribution with parameters n/2 and d/2. b
a
n d
Proof The beta prime is the distribution of the ratio of independent variables with standard gamma distributions. (Recall that standard here means that the scale parameter is 1.) Suppose that Y and Z are independent and have standard gamma distributions with shape parameters respectively. Then X = Y /Z has the beta prime distribution with parameters a and b .
a ∈ (0, ∞)
and
b ∈ (0, ∞)
,
Proof The standard beta prime distribution is the same as the standard log-logistic distribution. Proof Finally, the beta prime distribution is a member of the general exponential family of distributions. Suppose that X has the beta prime distribution with parameters a, b ∈ (0, ∞). Then X has a two-parameter general exponential distribution with natural parameters a − 1 and −(a + b) and natural statistics ln(X) and ln(1 + X) . Proof This page titled 5.18: The Beta Prime Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.18.4
https://stats.libretexts.org/@go/page/10358
5.19: The Arcsine Distribution The arcsine distribution is important in the study of Brownian motion and prime numbers, among other applications.
The Standard Arcsine Distribution Distribution Functions The standard arcsine distribution is a continuous distribution on the interval (0, 1) with probability density function g given by 1 g(x) =
− − − − − − −, π √ x(1 − x)
x ∈ (0, 1)
(5.19.1)
Proof The occurrence of the arcsine function in the proof that g is a probability density function explains the name. The standard arcsine probability density function g satisfies the following properties: 1. g is symmetric about x = . 2. g decreases and then increases with minimum value at x = 3. g is concave upward 4. g(x) → ∞ as x ↓ 0 and as x ↑ 1. 1 2
1 2
.
Proof In particular, the standard arcsine distribution is U-shaped and has no mode. Open the Special Distribution Simulator and select the arcsine distribution. Keep the default parameter values and note the shape of the probability density function. Run the simulation 1000 times and compare the emprical density function to the probability density function. The distribution function has a simple expression in terms of the arcsine function, again justifying the name of the distribution. The standard arcsine distribution function G is given by G(x) =
2 π
− arcsin(√x )
for x ∈ [0, 1].
Proof Not surprisingly, the quantile function has a simple expression in terms of the sine function. The standard arcinse quantile function G
is given by G
−1
1. q 2. q 3. q
1
2
= sin ( 1
8
) =
1 4
– (2 − √2) ≈ 0.1464
2
(p) = sin (
π 2
p)
for p ∈ [0, 1]. In particular, the quartiles are
, the first quartile
, the median
2
=
3
= sin (
2
π
−1
2
3π 8
) =
1 4
– (2 + √2) ≈ 0.8536
, the third quartile
Proof Open the Special Distribution Calculator and select the arcsine distribution. Keep the default parameter values and note the shape of the distribution function. Compute selected values of the distribution function and the quantile function.
Moments Suppose that random variable Z has the standard arcsine distribution. First we give the mean and variance. The mean and variance of Z are 1. E(Z) = 2. var(Z) = 1 2
1 8
Proof
5.19.1
https://stats.libretexts.org/@go/page/10359
Open the Special Distribution Simulator and select the arcsine distribution. Keep the default parameter values. Run the simulation 1000 times and compare the empirical mean and stadard deviation to the true mean and standard deviation. The general moments about 0 can be expressed as products. For n ∈ N , n−1
E (Z
n
2j + 1
) = ∏ j=0
(5.19.8) 2j + 2
Proof Of course, the moments can be used to give a formula for the moment generating function, but this formula is not particularly helpful since it is not in closed form. Z
has moment generating function m given by ∞
m(t) = E (e
tZ
n−1
n=0
j=0
n
2j + 1
) = ∑ (∏
t )
2j + 2
,
t ∈ R
(5.19.10)
n!
Finally we give the skewness and kurtosis. The skewness and kurtosis of Z are 1. skew(Z) = 0 2. kurt(Z) = 3 2
Proof
Related Distributions As noted earlier, the standard arcsine distribution is a special case of the beta distribution. The standard arcsine distribution is the beta distribution with left parameter
1 2
and right parameter
1 2
.
Proof Since the quantile function is in closed form, the standard arcsine distribution can be simulated by the random quantile method. Connections with the standard uniform distribution. 1. If U has the standard uniform distribution (a random number) then X = sin ( U ) has the standard arcsine distribution. − − 2. If X has the standard arcsine distribution then U = arcsin(√X ) has the standard uniform distribution. 2
π 2
2
π
Open the random quantile simulator and select the arcsine distribution. Keep the default parameters. Run the experiment 1000 times and compare the empirical probability density function, mean, and standard deviation to their distributional counterparts. Note how the random quantiles simulate the distribution. The following exercise illustrates the connection between the Brownian motion process and the standard arcsine distribution. Open the Brownian motion simulator. Keep the default time parameter and select the last zero random variable. Note that this random variable has the standard arcsine distribution. Run the experiment 1000 time and compare the empirical probability density function, mean, and standard deviation to their distributional counterparts. Note how the last zero simulates the distribution.
The General Arcsine Distribution The standard arcsine distribution is generalized by adding location and scale parameters.
5.19.2
https://stats.libretexts.org/@go/page/10359
Definition If Z has the standard arcsine distribution, and if location parameter a and scale parameter b .
a ∈ R
and
b ∈ (0, ∞)
, then
X = a + bZ
has the arcsine distribution with
So X has a continuous distribution on the interval (a, a + b) .
Distribution Functions Suppose that X has the arcsine distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) . X
has probability density function f given by 1 f (x) =
− −−−−−−−−−−−−− −, π √ (x − a)(a + w − x)
1. f is symmetric about a + w . 2. f decreases and then increases with minimum value at x = a + 3. f is concave upward. 4. f (x) → ∞ as x ↓ a and as x ↑ a + w .
x ∈ (a, a + w)
(5.19.12)
1 2
1 2
w
.
Proof An alternate parameterization of the general arcsine distribution is by the endpoints of the support interval: the left endpoint (location parameter) a and the right endpoint b = a + w . Open the Special Distribution Simulator and select the arcsine distribution. Vary the location and scale parameters and note the shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the emprical density function to the probability density function. Once again, the distribution function has a simple representation in terms of the arcsine function. X
has distribution function F given by − −−− − x −a ), w
2 F (x) =
arcsin( √ π
x ∈ [a, a + w]
(5.19.13)
Proof As before, the quantile function has a simple representation in terms of the sine functioon X
has quantile function F
1. q 2. q 3. q
1
2
= a + w sin ( 1
w
8
given by F
) = a+
1 4
−1
2
(p) = a + w sin (
– (2 − √2) w
π 2
p)
for p ∈ [0, 1] In particular, the quartiles of X are
, the first quartile
, the median
2
= a+
3
= a + w sin (
2
π
−1
2
3π 8
) = a+
1 4
– (2 + √2) w
, the third quartile
Proof Open the Special Distribution Calculator and select the arcsine distribution. Vary the parameters and note the shape and location of the distribution function. For various values of the parameters, compute selected values of the distribution function and the quantile function.
Moments Again, we assume that X has the arcsine distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) . First we give the mean and variance. The mean and variance of X are 1. E(X) = a + w 2. var(X) = w 1 2
1
2
8
5.19.3
https://stats.libretexts.org/@go/page/10359
Proof Open the Special Distribution Simulator and select the arcsine distribution. Vary the parameters and note the size and location of the mean±standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical mean and stadard deviation to the true mean and standar deviation. The moments of X can be obtained from the moments of Z , but the results are messy, except when the location parameter is 0. Suppose the location parameter a = 0 . For n ∈ N , n−1
E(X
n
) =w
n
2j + 1
∏ j=0
(5.19.14) 2j + 2
Proof The moment generating function can be expressed as a series with product coefficients, and so is not particularly helpful. X
has moment generating function M given by ∞
M (t) = E (e
tX
) =e
at
n−1
n=0
j=0
n
2j + 1
∑ (∏ 2j + 2
n
w t )
,
t ∈ R
(5.19.15)
n!
Proof Finally, the skewness and kurtosis are unchanged. The skewness and kurtosis of X are 1. skew(X) = 0 2. kurt(X) = 3 2
Proof
Related Distributions By construction, the general arcsine distribution is a location-scale family, and so is closed under location-scale transformations. If X has the arcsine distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) and if then c + dX has the arcsine distribution with location parameter c + ad scale parameter dw.
c ∈ R
and d ∈ (0, ∞)
Proof Since the quantile function is in closed form, the arcsine distribution can be simulated by the random quantile method. Suppose that a ∈ R and w ∈ (0, ∞) . 1. If U has the standard uniform distribution (a random number) then X = a + w sin location parameter a and scale parameter b .
2
(
π 2
U)
2. If X has the arcsine distribution with location parameter a and scale parameter b then U
has the arcsine distribution with − − − −
=
2 π
arcsin(√
X−a w
)
has the
standard uniform distribution. Open the random quantile simulator and select the arcsine distribution. Vary the parameters and note the location and shape of the probability density function. For selected parameter values, run the experiment 1000 times and compare the empirical probability density function, mean, and standard deviation to their distributional counterparts. Note how the random quantiles simulate the distribution. The following exercise illustrates the connection between the Brownian motion process and the arcsine distribution. Open the Brownian motion simulator and select the last zero random variable. Vary the time parameter t and note that the last zero has the arcsine distribution on the interval (0, t). Run the experiment 1000 time and compare the empirical probability
5.19.4
https://stats.libretexts.org/@go/page/10359
density function, mean, and standard deviation to their distributional counterparts. Note how the last zero simulates the distribution. This page titled 5.19: The Arcsine Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.19.5
https://stats.libretexts.org/@go/page/10359
5.20: General Uniform Distributions This section explores uniform distributions in an abstract setting. If you are a new student of probability, or are not familiar with measure theory, you may want to skip this section and read the sections on the uniform distribution on an interval and the discrete uniform distributions.
Basic Theory Definition Suppose that (S, S , λ) is a measure space. That is, S is a set, S a σ-algebra of subsets of Suppose also that 0 < λ(S) < ∞ , so that λ is a finite, positive measure.
S
, and
λ
a positive measure on
S
.
Random variable X with values in S has the uniform distribution on S (with respect to λ ) if λ(A) P(X ∈ A) =
,
A ∈ S
(5.20.1)
λ(S)
Thus, the probability assigned to a set A ∈ S depends only on the size of A (as measured by λ ). The most common special cases are as follows: 1. Discrete: The set S is finite and non-empty, S is the σ-algebra of all subsets of S , and λ = # (counting measure). 2. Euclidean: For n ∈ N , let R denote the σ-algebra of Borel measureable subsets of R and let λ denote Lebesgue measure on (R , R ). In this setting, S ∈ R with 0 < λ (S) < ∞ , S = {A ∈ R : A ⊆ S} , and the measure is λ restricted to (S, S ). n
+
n
n
n
n
n
n
n
n
In the Euclidean case, recall that λ is length measure on R, λ is area measure on R , λ is volume measure on general λ is sometimes referred to as n -dimensional volume. Thus, S ∈ R is a set with positive, finite volume. 2
1
2
3
n
3
R
, and in
n
Properties Suppose (S, S , λ) is a finite, positive measure space, as above, and that X is uniformly distributed on S . The probability density function f of X (with respect to λ ) is 1 f (x) =
,
x ∈ S
(5.20.2)
λ(S)
Proof Thus, the defining property of the uniform distribution on a set is constant density on that set. Another basic property is that uniform distributions are preserved under conditioning. Suppose that R ∈ S with λ(R) > 0 . The conditional distribution of X given X ∈ R is uniform on R . Proof In the setting of previous result, suppose that X = (X , X , …) is a sequence of independent variables, each uniformly distributed on S . Let N = min{n ∈ N : X ∈ R} . Then N has the geometric distribution on N with success parameter p = P(X ∈ R) . More importantly, the distribution of X is the same as the conditional distribution of X given X ∈ R , and hence is uniform on R . This is the basis of the rejection method of simulation. If we can simulate a uniform distribution on S , then we can simulate a uniform distribution on R . 1
+
2
n
+
N
If h is a real-valued function on S , then E[h(X)] is the average value of h on S , as measured by λ : If h : S → R is integrable with respect to λ Then 1 E[h(X)] =
∫ λ(S)
5.20.1
h(x) dλ(x)
(5.20.5)
S
https://stats.libretexts.org/@go/page/10360
Proof The entropy of the uniform distribution on S depends only on the size of S , as measured by λ : The entropy of X is H (X) = ln[λ(S)] . Proof
Product Spaces Suppose now that (S, S , λ) and (T , T , μ) are finite, positive measure spaces, so that 0 < λ(S) < ∞ and 0 < μ(T ) < ∞ . Recall the product space (S × T , S ⊗ T , λ ⊗ μ) . The product σ-algebra S ⊗ T is the σ-algebra of subsets of S × T generated by product sets A × B where A ∈ S and B ∈ T . The product measure λ ⊗ μ is the unique positive measure on (S × T , S ⊗ T ) that satisfies (λ ⊗ μ)(A × B) = λ(A)μ(B) for A ∈ S and B ∈ T . is uniformly distributed on S × T if and only if X is uniformly distributed on S , Y is uniformly distributed on T , and and Y are independent.
(X, Y ) X
Proof This page titled 5.20: General Uniform Distributions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.20.2
https://stats.libretexts.org/@go/page/10360
5.21: The Uniform Distribution on an Interval The continuous uniform distribution on an interval of R is one of the simplest of all probability distributions, but nonetheless very important. In particular, continuous uniform distributions are the basic tools for simulating other probability distributions. The uniform distribution corresponds to picking a point at random from the interval. The uniform distribution on an interval is a special case of the general uniform distribution with respect to a measure, in this case Lebesgue measure (length measure) on R.
The Standard Uniform Distribution Definition The continuous uniform distribution on the interval standard uniform distribution then
[0, 1]
is known as the standard uniform distribution. Thus if
P(U ∈ A) = λ(A)
U
has the
(5.21.1)
for every (Borel measurable) subset A of [0, 1], where λ is Lebesgue (length) measure. A simulation of a random variable with the standard uniform distribution is known in computer science as a random number. All programming languages have functions for computing random numbers, as do calculators, spreadsheets, and mathematical and statistical software packages.
Distribution Functions Suppose that U has the standard uniform distribution. By definition, the probability density function is constant on [0, 1]. U
has probability density function g given by g(u) = 1 for u ∈ [0, 1].
Since the density function is constant, the mode is not meaningful. Open the Special Distribution Simulator and select the continuous uniform distribution. Keep the default parameter values. Run the simulation 1000 times and compare the empirical density function and to the probability density function. The distribution function is simply the identity function on [0, 1]. U
has distribution function G given by G(u) = u for u ∈ [0, 1].
Proof The quantile function is the same as the distribution function. U
has quantile function G
−1
1. q 2. q 3. q
1
=
2
=
3
=
1 4 1 2 3 4
given by G
−1
(p) = p
for p ∈ [0, 1]. The quartiles are
, the first quartile , the median , the third quartile
Proof Open the Special Distribution Calculator and select the continuous uniform distribution. Keep the default parameter values. Compute a few values of the distribution function and the quantile function.
Moments Suppose again that U has the standard uniform distribution. The moments (about 0) are simple. For n ∈ N , E (U
n
1 ) =
(5.21.2) n+1
5.21.1
https://stats.libretexts.org/@go/page/10361
Proof The mean and variance follow easily from the general moment formula. The mean and variance of U are 1. E(U ) = 2. var(U ) = 1 2
1 12
Open the Special Distribution Simulator and select the continuous uniform distribution. Keep the default parameter values. Run the simulation 1000 times and compare the empirical mean and standard deviation to the true mean and standard deviation. Next are the skewness and kurtosis. The skewness and kurtosis of U are 1. skew(U ) = 0 2. kurt(U ) = 9 5
Proof Thus, the excess kurtosis is kurt(U ) − 3 = −
6 5
Finally, we give the moment generating function. The moment generating function m of U is given by m(0) = 1 and t
e −1 m(t) =
,
t ∈ R ∖ {0}
(5.21.5)
t
Proof
Related Distributions The standard uniform distribution is connected to every other probability distribution on R by means of the quantile function of the other distribution. When the quantile function has a simple closed form expression, this result forms the primary method of simulating the other distribution with a random number. Suppose that F is the distribution function for a probability distribution on R, and that F is the corresponding quantile function. If U has the standard uniform distribution, then X = F (U ) has distribution function F . −1
−1
Proof Open the Random Quantile Experiment. For each distribution, run the simulation 1000 times and compare the empirical density function to the probability density function of the selected distribution. Note how the random quantiles simulate the distribution. For a continuous distribution on an interval of R, the connection goes the other way. Suppose that X has a continuous distribution on an interval standard uniform distribution.
I ⊆R
, with distribution function
F
. Then
U = F (X)
has the
Proof The standard uniform distribution is a special case of the beta distribution. The beta distribution with left parameter a = 1 and right parameter b = 1 is the standard uniform distribution. Proof The standard uniform distribution is also the building block of the Irwin-Hall distributions.
5.21.2
https://stats.libretexts.org/@go/page/10361
The Uniform Distribution on a General Interval Definition The standard uniform distribution is generalized by adding location-scale parameters. Suppose that U has the standard uniform distribution. For a ∈ R and uniform distribution with location parameter a and scale parameter w.
w ∈ (0, ∞)
random variable
X = a + wU
has the
Distribution Functions Suppose that X has the uniform distribution with location parameter a ∈ R and scale parameter w ∈ (0, ∞) . X
has probability density function f given by f (x) = 1/w for x ∈ [a, a + w] .
Proof The last result shows that X really does have a uniform distribution, since the probability density function is constant on the support interval. Moreover, we can clearly parameterize the distribution by the endpoints of this interval, namely a and b = a + w , rather than by the location, scale parameters a and w. In fact, the distribution is more commonly known as the uniform distribution on the interval [a, b]. Nonetheless, it is useful to know that the distribution is the location-scale family associated with the standard uniform distribution. In terms of the endpoint parameterization, 1 f (x) =
,
x ∈ [a, b]
(5.21.10)
b −a
Open the Special Distribution Simulator and select the uniform distribution. Vary the location and scale parameters and note the graph of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function. X
has distribution function F given by x −a F (x) =
,
x ∈ [a, a + w]
(5.21.11)
w
Proof In terms of the endpoint parameterization, x −a F (x) =
,
x ∈ [a, b]
(5.21.12)
b −a X
has quantile function F
1. q 2. q 3. q
1
= a+
2
= a+
3
= a+
1 4 1 2 3 4
w = w = w =
3 4 1 2 1 4
a+ a+ a+
−1
1 4 1 2 3 4
b b b
given by F
−1
(p) = a + pw = (1 − p)a + pb
for p ∈ [0, 1]. The quartiles are
, the first quartile , the median , the third quartile
Proof Open the Special Distribution Calculator and select the uniform distribution. Vary the parameters and note the graph of the distribution function. For selected values of the parameters, compute a few values of the distribution function and the quantile function.
Moments Again we assume that X has the uniform distribution on the interval [a, b] where a, is a and the scale parameter w = b − a .
b ∈ R
and a < b . Thus the location parameter
The moments of X are n+1
E(X
n
b
n+1
−a
) =
,
n ∈ N
(5.21.13)
(n + 1)(b − a)
5.21.3
https://stats.libretexts.org/@go/page/10361
Proof The mean and variance of X are 1. E(X) = (a + b) 2. var(X) = (b − a) 1 2
1
2
12
Open the Special Distribution Simulator and select the uniform distribution. Vary the parameters and note the location and size of the mean±standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. The skewness and kurtosis of X are 1. skew(X) = 0 2. kurt(X) = 9 5
Proof Once again, the excess kurtosis is kurt(X) − 3 = −
6 5
.
The moment generating function M of X is given by M (0) = 1 and e
bt
−e
at
M (t) =
,
t ∈ R ∖ {0}
(5.21.15)
t(b − a)
Proof If h is a real-valued function on [a, b], then E[h(X)] is the average value of h on [a, b], as defined in calculus: If h : [a, b] → R is integrable, then b
1 E[h(X)] =
∫ b −a
h(x) dx
(5.21.16)
a
Proof The entropy of the uniform distribution on an interval depends only on the length of the interval. The entropy of X is H (X) = ln(b − a) . Proof
Related Distributions Since the uniform distribution is a location-scale family, it is trivially closed under location-scale transformations. If
X
has the uniform distribution with location parameter a and scale parameter w, and if c ∈ R and has the uniform distribution with location parameter c + da and scale parameter dw.
d ∈ (0, ∞)
, then
Y = c + dX
Proof As we saw above, the standard uniform distribution is a basic tool in the random quantile method of simulation. Uniform distributions on intervals are also basic in the rejection method of simulation. We sketch the method in the next paragraph; see the section on general uniform distributions for more theory. Suppose that h is a probability density function for a continuous distribution with values in a bounded interval (a, b) ⊆ R . Suppose also that h is bounded, so that there exits c > 0 such that h(x) ≤ c for all x ∈ (a, b). Let X = (X , X , …) be a sequence of independent variables, each uniformly distributed on (a, b), and let Y = (Y , Y , …) be a sequence of independent variables, each uniformly distributed on (0, c). Finally, assume that X and Y are independent. Then ((X , Y ), (X , Y ), …)) is a sequence of independent variables, each uniformly distributed on (a, b) × (0, c) . Let N = min{n ∈ N : 0 < Y < h(X )} . Then (X , Y ) is uniformly distributed on R = {(x, y) ∈ (a, b) × (0, c) : y < h(x)} (the region under the graph of h ), and therefore X has probability density function h . In words, we generate uniform points in the rectangular region (a, b) × (0, c) until we get a point 1
1
2
2
1
+
1
2
n
2
n
N
N
N
5.21.4
https://stats.libretexts.org/@go/page/10361
under the graph of h . The x-coordinate of that point is our simulated value. The rejection method can be used to approximately simulate random variables when the region under the density function is unbounded. Open the rejection method simulator. For each distribution, select a set of parameter values. Run the experiment 2000 times and observe how the rejection method works. Compare the empirical density function, mean, and standard deviation to their distributional counterparts. This page titled 5.21: The Uniform Distribution on an Interval is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.21.5
https://stats.libretexts.org/@go/page/10361
5.22: Discrete Uniform Distributions Uniform Distributions on a Finite Set Suppose that S is a nonempty, finite set. A random variable X taking values in S has the uniform distribution on S if #(A) P(X ∈ A) =
,
A ⊆S
(5.22.1)
#(S)
The discrete uniform distribution is a special case of the general uniform distribution with respect to a measure, in this case counting measure. The distribution corresponds to picking an element of S at random. Most classical, combinatorial probability models are based on underlying discrete uniform distributions. The chapter on Finite Sampling Models explores a number of such models. The probability density function f of X is given by 1 f (x) =
,
x ∈ S
(5.22.2)
#(S)
Proof Like all uniform distributions, the discrete uniform distribution on a finite set is characterized by the property of constant density on the set. Another property that all uniform distributions share is invariance under conditioning on a subset. Suppose that R is a nonempty subset of S . Then the conditional distribution of X given X ∈ R is uniform on R . Proof If h : S → R then the expected value of h(X) is simply the arithmetic average of the values of h : 1 E[h(X)] =
∑ h(x) #(S)
(5.22.4)
x∈S
Proof The entropy of X depends only on the number of points in S . The entropy of X is H (X) = ln[#(S)] . Proof
Uniform Distributions on Finite Subsets of R Without some additional structure, not much more can be said about discrete uniform distributions. Thus, suppose that n ∈ N and that S = {x , x , … , x } is a subset of R with n points. We will assume that the points are indexed in order, so that x 0 ) are particularly common, since X often represents a random angle. The scale transformation with b = π gives the angle in radians. In this case the probability density function is f (x) = sin(x) for x ∈ [0, π]. 1 2
5.27.2
https://stats.libretexts.org/@go/page/10367
Since the radian is the standard angle unit, this distribution could also be considered the “standard one”. The scale transformation with b = 90 gives the angle in degrees. In this case, the probability density function is f (x) = sin( x) for x ∈ [0, 90]. This was Gilbert's original formulation. π
π
180
90
In the special distribution simulator, select the sine distribution. Vary the parameters and note the shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function. The distribution function F of X is given by 1 F (x) =
x −a [1 − cos(π
)] ,
2
x ∈ [a, a + b]
(5.27.12)
b
Proof The quantile function F
−1
of X is given by F
−1
b (p) = a +
arccos(1 − 2p),
p ∈ (0, 1)
(5.27.14)
π
1. The first quartile is a + b/3 . 2. The median is a + b/2 . 3. The third quartile is a + 2b/3 Proof In the special distribution calculator, select the sine distribution. Vary the parameters and note the shape and location of the probability density function and the distribution function. For selected values of the parameters, find the quantiles of order 0.1 and 0.9.
Moments Suppose again that X has the sine distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞). The moment generating function M of X is given by π M (t) =
2
(e
at
2
+e 2
2 (b t
(a+b)t
2
) ,
t ∈ R
(5.27.15)
+π )
Proof The mean and variance of X are 1. E(X) = a + b/2 2. var(X) = b (1/4 − 2/π 2
2
)
Proof In the special distribution simulator, select the sine distribution. Vary the parameters and note the shape and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. The skewness and kurtosis of X are 1. skew(X) = 0 2. kurt(X) = (384 − 48π
2
4
+ π )/(π
2
2
− 8)
Proof
Related Distributions The general sine distribution is a location-scale family, so it is trivially closed under location-scale transformations.
5.27.3
https://stats.libretexts.org/@go/page/10367
Suppose that X has the sine distribution with location parameter a ∈ R and scale parameter b ∈ (0, ∞), and that d ∈ (0, ∞) . Then Y = c + dX has the sine distribution with location parameter c + ad and scale parameter bd .
c ∈ R
and
Proof This page titled 5.27: The Sine Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.27.4
https://stats.libretexts.org/@go/page/10367
5.28: The Laplace Distribution The Laplace distribution, named for Pierre Simon Laplace arises naturally as the distribution of the difference of two independent, identically distributed exponential variables. For this reason, it is also called the double exponential distribution.
The Standard Laplace Distribution Distribution Functions The standard Laplace distribution is a continuous distribution on R with probability density function g given by 1 g(u) =
e
−|u|
,
u ∈ R
(5.28.1)
2
Proof The probability density function g satisfies the following properties: 1. g is symmetric about 0. 2. g increases on (−∞, 0] and decreases on [0, ∞), with mode u = 0 . 3. g is concave upward on (−∞, 0] and on [0, ∞) with a cusp at u = 0 Proof Open the Special Distribution Simulator and select the Laplace distribution. Keep the default parameter value and note the shape of the probability density function. Run the simulation 1000 times and compare the emprical density function and the probability density function. The standard Laplace distribution function G is given by 1
G(u) = {
2
u
e ,
1−
1 2
u ∈ (−∞, 0] e
−u
(5.28.3) ,
u ∈ [0, ∞)
Proof The quantile function G
−1
given by −1
G
ln(2p),
p ∈ [0,
(p) = { − ln[2(1 − p)],
p ∈ [
1 2
1 2
] (5.28.4)
, 1]
1. G (1 − p) = −G (p) for p ∈ (0, 1) 2. The first quartile is q = − ln 2 ≈ −0.6931 . 3. The median is q = 0 4. The third quartile is q = ln 2 ≈ 0.6931 . −1
−1
1
2
3
Proof Open the Special Distribution Calculator and select the Laplace distribution. Keep the default parameter value. Compute selected values of the distribution function and the quantile function.
Moments Suppose that U has the standard Laplace distribution. U
has moment generating function m given by m(t) = E (e
tU
1 ) =
2
,
t ∈ (−1, 1)
(5.28.5)
1 −t
Proof
5.28.1
https://stats.libretexts.org/@go/page/10368
The moments of U are 1. E(U 2. E(U
n
) =0
n
) = n!
if n ∈ N is odd. if n ∈ N is even.
Proof The mean and variance of U are 1. E(U ) = 0 2. var(U ) = 2 Open the Special Distribution Simulator and select the Laplace distribution. Keep the default parameter value. Run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. The skewness and kurtosis of U are 1. skew(U ) = 0 2. kurt(U ) = 6 Proof It follows that the excess kurtosis is kurt(U ) − 3 = 3 .
Related Distributions Of course, the standard Laplace distribution has simple connections to the standard exponential distribution. If U has the standard Laplace distribution then V
= |U |
has the standard exponential distribution.
Proof If V and W are independent and each has the standard exponential distribution, then distribution.
has the standard Laplace
U = V −W
Proof using PDFs Proof using MGFs If
V
has the standard exponential distribution, I has the standard Bernoulli distribution, and has the standard Laplace distribution.
V
and
I
are independent, then
U = (2I − 1)V
Proof The standard Laplace distribution has a curious connection to the standard normal distribution. Suppose that (Z , Z , Z , Z ) is a random sample of size 4 from the standard normal distribution. Then has the standard Laplace distribution. 1
2
3
4
U = Z1 Z2 + Z3 Z4
Proof The standard Laplace distribution has the usual connections to the standard uniform distribution by means of the distribution function and the quantile function computed above. Connections to the standard uniform distribution. 1. If V has the standard uniform distribution then U Laplace distribution. 2. If U has the standard Laplace distribution then V distribution.
= ln(2V )1 (V
2 , g is concave upward, then downward, then upward again, with inflection points at t = [
3(k−1)±√(5k−1)(k−1) 2k
]
Proof So the Weibull density function has a rich variety of shapes, depending on the shape parameter, and has the classic unimodal shape when k > 1 . If k ≥ 1 , g is defined at 0 also. In the special distribution simulator, select the Weibull distribution. Vary the shape parameter and note the shape of the probability density function. For selected values of the shape parameter, run the simulation 1000 times and compare the empirical density function to the probability density function. The quantile function G
−1
is given by −1
G
1. The first quartile is q = (ln 4 − ln 3) 2. The median is q = (ln 2) . 3. The third quartile is q = (ln 4) .
1/k
1
1/k
(p) = [− ln(1 − p)]
,
p ∈ [0, 1)
(5.38.5)
.
1/k
2
1/k
3
Proof Open the special distribution calculator and select the Weibull distribution. Vary the shape parameter and note the shape of the distribution and probability density functions. For selected values of the parameter, compute the median and the first and third quartiles. The reliability function G is given by c
5.38.1
https://stats.libretexts.org/@go/page/10471
c
k
G (t) = exp(−t ),
t ∈ [0, ∞)
(5.38.6)
Proof The failure rate function r is given by k−1
r(t) = kt
,
t ∈ (0, ∞)
(5.38.7)
1. If 0 < k < 1 , r is decreasing with r(t) → ∞ as t ↓ 0 and r(t) → 0 as t → ∞ . 2. If k = 1 , r is constant 1. 3. If k > 1 , r is increasing with r(0) = 0 and r(t) → ∞ as t → ∞ . Proof Thus, the Weibull distribution can be used to model devices with decreasing failure rate, constant failure rate, or increasing failure rate. This versatility is one reason for the wide use of the Weibull distribution in reliability. If k ≥ 1 , r is defined at 0 also.
Moments Suppose that Z has the basic Weibull distribution with shape parameter k ∈ (0, ∞) . The moments of variance of Z can be expressed in terms of the gamma function Γ E(Z
n
) = Γ (1 +
n k
)
Z
, and hence the mean and
for n ≥ 0 .
Proof So the Weibull distribution has moments of all orders. The moment generating function, however, does not have a simple, closed expression in terms of the usual elementary functions. In particular, the mean and variance of Z are 1. E(Z) = Γ (1 + ) 2. var(Z) = Γ (1 + 1
k
2
k
2
)−Γ
(1 +
1 k
)
Note that E(Z) → 1 and var(Z) → 0 as k → ∞ . We will learn more about the limiting distribution below. In the special distribution simulator, select the Weibull distribution. Vary the shape parameter and note the size and location of the mean ± standard deviation bar. For selected values of the shape parameter, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. The skewness and kurtosis also follow easily from the general moment result above, although the formulas are not particularly helpful. Skewness and kurtosis 1. The skewness of Z is 3
Γ(1 + 3/k) − 3Γ(1 + 1/k)Γ(1 + 2/k) + 2 Γ (1 + 1/k) skew(Z) =
(5.38.10) 2
[Γ(1 + 2/k) − Γ (1 + 1/k)]
3/2
2. The kurtosis of Z is 2
4
Γ(1 + 4/k) − 4Γ(1 + 1/k)Γ(1 + 3/k) + 6 Γ (1 + 1/k)Γ(1 + 2/k) − 3 Γ (1 + 1/k) kurt(Z) = 2
[Γ(1 + 2/k) − Γ (1 + 1/k)]
2
(5.38.11)
Proof
Related Distributions As noted above, the standard Weibull distribution (shape parameter 1) is the same as the standard exponential distribution. More generally, any basic Weibull variable can be constructed from a standard exponential variable. Suppose that k ∈ (0, ∞) .
5.38.2
https://stats.libretexts.org/@go/page/10471
1. If U has the standard exponential distribution then Z = U has the basic Weibull distribution with shape parameter k . 2. If Z has the basic Weibull distribution with shape parameter k then U = Z has the standard exponential distribution. 1/k
k
Proof The basic Weibull distribution has the usual connections with the standard uniform distribution by means of the distribution function and the quantile function given above. Suppose that k ∈ (0, ∞) . 1. If U has the standard uniform distribution then Z = (− ln U ) has the basic Weibull distribution with shape parameter k . 2. If Z has the basic Weibull distribution with shape parameter k then U = exp(−Z ) has the standard uniform distribution. 1/k
k
Proof Since the quantile function has a simple, closed form, the basic Weibull distribution can be simulated using the random quantile method. Open the random quantile experiment and select the Weibull distribution. Vary the shape parameter and note again the shape of the distribution and density functions. For selected values of the parameter, run the simulation 1000 times and compare the empirical density, mean, and standard deviation to their distributional counterparts. The limiting distribution with respect to the shape parameter is concentrated at a single point. The basic Weibull distribution with shape parameter k ∈ (0, ∞) converges to point mass at 1 as k → ∞ . Proof
The General Weibull Distribution Like most special continuous distributions on [0, ∞), the basic Weibull distribution is generalized by the inclusion of a scale parameter. A scale transformation often corresponds in applications to a change of units, and for the Weibull distribution this usually means a change in time units. Suppose that Z has the basic Weibull distribution with shape parameter k ∈ (0, ∞) . For b ∈ (0, ∞), random variable X = bZ has the Weibull distribution with shape parameter k and scale parameter b . Generalizations of the results given above follow easily from basic properties of the scale transformation.
Distribution Functions Suppose that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). X
distribution function F given by k
t F (t) = 1 − exp[−(
) ],
t ∈ [0, ∞)
(5.38.12)
b
Proof X
has probability density function f given by k f (t) =
k
k−1
t
k
t exp[−(
) ],
t ∈ (0, ∞)
(5.38.13)
b
b
1. If 0 < k < 1 , f is decreasing and concave upward with f (t) → ∞ as t ↓ 0 . 2. If k = 1 , f is decreasing and concave upward with mode t = 0 . 3. If k > 1 , f increases and then decreases, with mode t = b(
k−1 k
1/k
)
.
4. If 1 < k ≤ 2 , f is concave downward and then upward, with inflection point at t = b[
5.38.3
1/k
3(k−1)+√(5k−1)(k−1) 2k
]
https://stats.libretexts.org/@go/page/10471
5. If k > 2 , f is concave upward, then downward, then upward again, with inflection points at 1/k
t = b[
3(k−1)±√(5k−1)(k−1) 2k
]
Proof Open the special distribution simulator and select the Weibull distribution. Vary the parameters and note the shape of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function. X
has quantile function F
−1
given by F
−1
1. The first quartile is q = b(ln 4 − ln 3) 2. The median is q = b(ln 2) . 3. The third quartile is q = b(ln 4) .
1/k
1
1/k
(p) = b[− ln(1 − p)]
,
p ∈ [0, 1)
(5.38.14)
.
1/k
2
1/k
3
Proof Open the special distribution calculator and select the Weibull distribution. Vary the parameters and note the shape of the distribution and probability density functions. For selected values of the parameters, compute the median and the first and third quartiles. X
has reliability function F given by c
F
c
t (t) = exp[−(
k
) ],
t ∈ [0, ∞)
(5.38.15)
b
Proof As before, the Weibull distribution has decreasing, constant, or increasing failure rates, depending only on the shape parameter. X
has failure rate function R given by k−1
kt R(t) =
k
,
t ∈ (0, ∞)
(5.38.16)
b
1. If 0 < k < 1 , R is decreasing with R(t) → ∞ as t ↓ 0 and R(t) → 0 as t → ∞ . 2. If k = 1 , R is constant . 3. If k > 1 , R is increasing with R(0) = 0 and R(t) → ∞ as t → ∞ . 1 b
Moments Suppose again that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Recall that by definition, we can take X = bZ where Z has the basic Weibull distribution with shape parameter k . E(X
n
n
) = b Γ (1 +
n k
)
for n ≥ 0 .
Proof In particular, the mean and variance of X are 1. E(X) = bΓ (1 + ) 2. var(X) = b [Γ (1 + 1
k
2
2 k
2
)−Γ
(1 +
1 k
)]
Note that E(X) → b and var(X) → 0 as k → ∞ . Open the special distribution simulator and select the Weibull distribution. Vary the parameters and note the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the
5.38.4
https://stats.libretexts.org/@go/page/10471
empirical mean and standard deviation to the distribution mean and standard deviation. Skewness and kurtosis 1. The skewness of X is 3
Γ(1 + 3/k) − 3Γ(1 + 1/k)Γ(1 + 2/k) + 2 Γ (1 + 1/k) skew(X) =
(5.38.17) 2
[Γ(1 + 2/k) − Γ (1 + 1/k)]
3/2
2. The kurtosis of X is 2
4
Γ(1 + 4/k) − 4Γ(1 + 1/k)Γ(1 + 3/k) + 6 Γ (1 + 1/k)Γ(1 + 2/k) − 3 Γ (1 + 1/k) kurt(X) = 2
[Γ(1 + 2/k) − Γ (1 + 1/k)]
(5.38.18)
2
Proof
Related Distributions Since the Weibull distribution is a scale family for each value of the shape parameter, it is trivially closed under scale transformations. Suppose that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter then Y = cX has the Weibull distribution with shape parameter k and scale parameter bc.
b ∈ (0, ∞)
. If
c ∈ (0, ∞)
Proof The exponential distribution is a special case of the Weibull distribution, the case corresponding to constant failure rate. The Weibull distribution with shape parameter 1 and scale parameter parameter b .
b ∈ (0, ∞)
is the exponential distribution with scale
Proof More generally, any Weibull distributed variable can be constructed from the standard variable. The following result is a simple generalization of the connection between the basic Weibull distribution and the exponential distribution. Suppose that k,
b ∈ (0, ∞)
.
1. If X has the standard exponential distribution (parameter 1), then Y = b X has the Weibull distribution with shape parameter k and scale parameter b . 2. If Y has the Weibull distribution with shape parameter k and scale parameter b , then X = (Y /b) has the standard exponential distribution. 1/k
k
Proof The Rayleigh distribution, named for William Strutt, Lord Rayleigh, is also a special case of the Weibull distribution. The Rayleigh distribution with scale parameter – parameter √2b.
b ∈ (0, ∞)
is the Weibull distribution with shape parameter
2
and scale
Proof Recall that the minimum of independent, exponentially distributed variables also has an exponential distribution (and the rate parameter of the minimum is the sum of the rate parameters of the variables). The Weibull distribution has a similar, but more restricted property. Suppose that (X , X , … , X ) is an independent sequence of variables, each having the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Then U = min{X , X , … , X } has the Weibull distribution with shape parameter k and scale parameter b/n . 1
2
n
1
2
n
1/k
Proof
5.38.5
https://stats.libretexts.org/@go/page/10471
As before, Weibull distribution has the usual connections with the standard uniform distribution by means of the distribution function and the quantile function given above.. Suppose that k,
b ∈ (0, ∞)
.
1. If U has the standard uniform distribution then X = b(− ln U ) has the Weibull distribution with shape parameter k and scale parameter b . 2. If X has the basic Weibull distribution with shape parameter k then U = exp[−(X/b) ] has the standard uniform distribution. 1/k
k
Proof Again, since the quantile function has a simple, closed form, the Weibull distribution can be simulated using the random quantile method. Open the random quantile experiment and select the Weibull distribution. Vary the parameters and note again the shape of the distribution and density functions. For selected values of the parameters, run the simulation 1000 times and compare the empirical density, mean, and standard deviation to their distributional counterparts. The limiting distribution with respect to the shape parameter is concentrated at a single point. The Weibull distribution with shape parameter k → ∞.
k ∈ (0, ∞)
and scale parameter
b ∈ (0, ∞)
converges to point mass at
b
as
Proof Finally, the Weibull distribution is a member of the family of general exponential distributions if the shape parameter is fixed. Suppose that X has the Weibull distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). For fixed k , has a general exponential distribution with respect to b , with natural parameter k − 1 and natural statistics ln X .
X
Proof
Computational Exercises The lifetime T of a device (in hours) has the Weibull distribution with shape parameter k = 1.2 and scale parameter b = 1000. 1. Find the probability that the device will last at least 1500 hours. 2. Approximate the mean and standard deviation of T . 3. Compute the failure rate function. Answer This page titled 5.38: The Weibull Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.38.6
https://stats.libretexts.org/@go/page/10471
5.39: Benford's Law Benford's law refers to probability distributions that seem to govern the significant digits in real data sets. The law is named for the American physicist and engineer Frank Benford, although the “law” was actually discovered earlier by the astronomer and mathematician Simon Newcomb. To understand Benford's law, we need some preliminaries. Recall that a positive real number x can be written uniquely in the form x = y ⋅ 10 (sometimes called scientific notation) where y ∈ [ , 1) is the mantissa and n ∈ Z is the exponent (both of these terms are base 10, of course). Note that 1
n
10
log x = log y + n
(5.39.1)
where the logarithm function is the base 10 common logarithm instead of the usual base e natural logarithm. In the old days BC (before calculators), one would compute the logarithm of a number by looking up the logarithm of the mantissa in a table of logarithms, and then adding the exponent. Of course, these remarks apply to any base b > 1 , not just base 10. Just replace 10 with b and the common logarithm with the base b logarithm.
Distribution of the Mantissa Distribution Functions Suppose now that X is a number selected at random from a certain data set of positive numbers. Based on empirical evidence from a number of different types of data, Newcomb, and later Benford, noticed that the mantissa Y of X seemed to have distribution function F (y) = 1 + log y for y ∈ [1/10, 1). We will generalize this to an arbitrary base b > 1 . The Benford mantissa distribution with base b ∈ (1, ∞), is a continuous distribution on [1/b, 1) with distribution function F given by F (y) = 1 + logb y,
y ∈ [1/b, 1)
(5.39.2)
The special case b = 10 gives the standard Benford mantissa distribution. Proof The probability density function f is given by 1 f (y) =
,
y ∈ [1/b, 1)
(5.39.3)
y ln b
1. f is decreasing with mode y = 2. f is concave upward.
.
1 b
Proof Open the Special Distribution Simulator and select the Benford mantissa distribution. Vary the base b and note the shape of the probability density function. For various values of b , run the simulation 1000 times and compare the empirical density function to the probability density function. The quantile function F
−1
is given by F
−1
1 (p) =
1−p
,
p ∈ [0, 1]
(5.39.4)
b
1. The first quartile is F 2. The median is F
−1
(
−1
1 2
3. The third quartile is F
(
1 4
) =
1 3/4
b 1
) =
√b −1
(
3 4
) =
1 1/4
b
Proof Numerical values of the quartiles for the standard (base 10) distribution are given in an exercise below. Open the special distribution calculator and select the Benford mantissa distribution. Vary the base and note the shape and location of the distribution and probability density functions. For selected values of the base, compute the median and the first and third quartiles.
5.39.1
https://stats.libretexts.org/@go/page/10472
Moments Assume that Y has the Benford mantissa distribution with base b ∈ (1, ∞). The moments of Y are n
E (Y
n
b ) =
−1
n
nb
,
n ∈ (0, ∞)
(5.39.5)
ln b
Proof Note that for fixed n > 0 , E(Y ) → 1 as b ↓ 1 and E(Y ) → 0 as b → ∞ . We will learn more about the limiting distribution below. The mean and variance follow easily from the general moment result. n
n
Mean and variance 1. The mean of Y is b −1 E(Y ) =
(5.39.7) b ln b
2. the variance of Y is b −1 var(Y ) =
b2 ln b
b +1 [
b −1 −
2
]
(5.39.8)
ln b
Numerical values of the mean and variance for the standard (base 10) distribution are given in an exercise below. In the Special Distribution Simulator, select the Benford mantissa distribution. Vary the base b and note the size and location of the mean ± standard deviation bar. For selected values of b , run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.
Related Distributions The Benford mantissa distribution has the usual connections to the standard uniform distribution by means of the distribution function and quantile function given above. Suppose that b ∈ (1, ∞). 1. If U has the standard uniform distribution then Y = b has the Benford mantissa distribution with base b . 2. If Y has the Benford mantissa distribution with base b then U = − log Y has the standard uniform distribution. −U
b
Proof Since the quantile function has a simple closed form, the Benford mantissa distribution can be simulated using the random quantile method. Open the random quantile experiment and select the Benford mantissa distribution. Vary the base b and note again the shape and location of the distribution and probability density functions. For selected values of b , run the simulation 1000 times and compare the empirical density function, mean, and standard deviation to their distributional counterparts. Also of interest, of course, are the limiting distributions of Y with respect to the base b . The Benford mantissa distribution with base b ∈ (1, ∞) converges to 1. Point mass at 1 as b ↓ 1 . 2. Point mass at 0 as b ↑ ∞ . Proof Since the probability density function is bounded on a bounded support interval, the Benford mantissa distribution can also be simulated via the rejection method. Open the rejection method experiment and select the Benford mantissa distribution. Vary the base b and note again the shape and location of the probability density functions. For selected values of b , run the simulation 1000 times and compare the empirical density function, mean, and standard deviation to their distributional counterparts.
5.39.2
https://stats.libretexts.org/@go/page/10472
Distributions of the Digits Assume now that the base is a positive integer b ∈ {2, 3, …}, which of course is the case in standard number systems. Suppose that the sequence of digits of our mantissa Y (in base b ) is (N , N , …), so that 1
2
∞
Y =∑ k=1
Nk
(5.39.9)
k
b
Thus, our leading digit N takes values in {1, 2, … , b − 1}, while each of the other significant digits takes values in {0, 1, … , b − 1}. Note that (N , N , …) is a stochastic process so at least we would like to know the finite dimensional distributions. That is, we would like to know the joint probability density function of the first k digits for every k ∈ N . But let's start, appropriately enough, with the first digit law. The leading digit is the most important one, and fortunately also the easiest to analyze mathematically. 1
1
2
+
First Digit Law has probability density function g given by g (n) = log density function g is decreasing and hence the mode is n = 1 . N1
1
1
b
(1 +
1 n
) = logb (n + 1) − logb (n)
for
n ∈ {1, 2, … , b − 1}
. The
1
Proof Note that when b = 2 , N = 1 deterministically, which of course has to be the case. The first significant digit of a number in base 2 must be 1. Numerical values of g for the standard (base 10) distribution are given in an exercise below. 1
1
In the Special Distribution Simulator, select the Benford first digit distribution. Vary the base b with the input control and note the shape of the probability density function. For various values of b , run the simulation 1000 times and compare the empirical density function to the probability density function. N1
has distribution function G given by G
1 (x)
1
= logb (⌊x⌋ + 1)
for x ∈ [1, b − 1] .
Proof N1
has quantile function G
−1 1
given by G
−1 1
p
(p) = ⌈b
− 1⌉
for p ∈ (0, 1].
1. The first quartile is ⌈b − 1⌉ . 2. The median is ⌈b − 1⌉ . 3. The third quartile is ⌈b − 1⌉ . 1/4
1/2
3/4
Proof Numerical values of the quantiles for the standard (base 10) distribution are given in an exercise below. Open the special distribution calculator and choose the Benford first digit distribution. Vary the base and note the shape and location of the distribution and probability density functions. For selected values of the base, compute the median and the first and third quartiles. For the most part the moments of N do not have simple expressions. However, we do have the following result for the mean. 1
E(N1 ) = (b − 1) − logb [(b − 1)!]
.
Proof Numerical values of the mean and variance for the standard (base 10) distribution are given in an exercise below. Opne the Special Distribution Simulator and select the Benford first digit distribution. Vary the base b with the input control and note the size and location of the mean ± standard deviation bar. For various values of b , run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation.. Since the quantile function has a simple, closed form, the Benford first digit distribution can be simulated via the random quantile method. Open the random quantile experiment and select the Benford first digit distribution. Vary the base b and note again the shape and location of the probability density function. For selected values of the base, run the experiment 1000 times and compare the empirical density function, mean, and standard deviation to their distributional counterparts.
5.39.3
https://stats.libretexts.org/@go/page/10472
Higher Digits Now, to compute the joint probability density function of the first k significant digits, some additional notation will help. If n
1
∈ {1, 2, … , b − 1}
and n
j
∈ {0, 1, … , b − 1}
for j ∈ {2, 3, … , k}, let k k−j
[ n1 n2 ⋯ nk ]b = ∑ nj b
(5.39.13)
j=1
Of course, this is just the base b version of what we do in our standard base 10 system: we represent integers as strings of digits between 0 and 9 (except that the first digit cannot be 0). Here is a base 5 example: 2
[324 ]5 = 3 ⋅ 5
The joint probability density function h of (N
1,
k
N2 , … , Nk )
1
+2 ⋅ 5
0
+4 ⋅ 5
= 89
(5.39.14)
is given by
1 hk (n1 , n2 , … , nk ) = log (1 + b
),
k−1
n1 ∈ {1, 2, … , b − 1}, (n2 , … , nk ) ∈ {2, … , b − 1 }
(5.39.15)
[ n1 n2 ⋯ nk ]b
Proof The probability density function of (N , N ) in the standard (base 10) case is given in an exercise below. Of course, the probability density function of a given digit can be obtained by summing the joint probability density over the unwanted digits in the usual way. However, except for the first digit, these functions do not reduce to simple expressions. 1
2
The probability density function g of N is given by 2
2
b−1
g2 (n) = ∑ logb (1 + k=1
1 [k n]b
b−1
1
) = ∑ logb (1 + k=1
),
n ∈ {0, 1, … , b − 1}
(5.39.18)
k b +n
The probability density function of N in the standard (base 10) case is given in an exercise below. 2
Theoretical Explanation Aside from the empirical evidence noted by Newcomb and Benford (and many others since), why does Benford's law work? For a theoretical explanation, see the article A Statistical Derivation of the Significant Digit Law by Ted Hill.
Computational Exercises In the following exercises, suppose that Y has the standard Benford mantissa distribution (the base 10 decimal case), and that (N are the digits of Y .
1,
N2 , …)
Find each of the following for the mantissa Y 1. The density function f . 2. The mean and variance 3. The quartiles Answer For N , find each of the following numerically 1
1. The probability density function 2. The mean and variance 3. The quartiles Answer Explicitly compute the values of the joint probability density function of (N
1,
N2 )
.
Answer For N , find each of the following numerically 2
1. The probability density function 2. E(N ) 3. var(N ) 2
2
5.39.4
https://stats.libretexts.org/@go/page/10472
Answer Comparing the result for N and the result result for N , note that the distribution of N is flatter than the distribution of N . In general, it turns out that distribution of N converges to the uniform distribution on {0, 1, … , b − 1} as k → ∞ . Interestingly, the digits are dependent. 1
2
2
1
k
N1
and N are dependent. 2
Proof Find each of the following. 1. P(N 2. P(N 3. P(N
1
= 5, N2 = 3, N3 = 1)
1
= 3, N2 = 1, N3 = 5)
1
= 1, N2 = 3, N3 = 5)
This page titled 5.39: Benford's Law is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.39.5
https://stats.libretexts.org/@go/page/10472
5.40: The Zeta Distribution The zeta distribution is used to model the size or ranks of certain types of objects randomly chosen from certain types of populations. Typical examples include the frequency of occurrence of a word randomly chosen from a text, or the population rank of a city randomly chosen from a country. The zeta distribution is also known as the Zipf distribution, in honor of the American linguist George Zipf.
Basic Theory The Zeta Function The Riemann zeta function ζ , named after Bernhard Riemann, is defined as follows: ∞
ζ(a) = ∑ n=1
1 a
,
a ∈ (1, ∞)
(5.40.1)
n
You might recall from calculus that the series in the zeta function converges for a > 1 and diverges for a ≤ 1 .
Figure 5.40.1 : Graph of ζ on the interval (1, 10]
The zeta function satifies the following properties: 1. ζ is decreasing. 2. ζ is concave upward. 3. ζ(a) ↓ 1 as a ↑ ∞ 4. ζ(a) ↑ ∞ as a ↓ 1 The zeta function is transcendental, and most of its values must be approximated. However, ζ(a) can be given explicitly for even integer values of a ; in particular, ζ(2) = and ζ(4) = . π
2
6
π
4
90
The Probability Density Function The zeta distribution with shape parameter given by.
a ∈ (1, ∞)
is a discrete distribution on
N+
with probability density function
f
1 f (n) =
ζ(a)na
,
n ∈ N+
(5.40.2)
1. f is decreasing with mode n = 1 . 2. When smoothed, f is concave upward. Proof Open the special distribution simulator and select the zeta distribution. Vary the shape parameter and note the shape of the probability density function. For selected values of the parameter, run the simulation 1000 times and compare the empirical density function to the probability density function.
5.40.1
https://stats.libretexts.org/@go/page/10473
The distribution function and quantile function do not have simple closed forms, except in terms of other special functions. Open the special distribution calculator and select the zeta distribution. Vary the parameter and note the shape of the distribution and probability density functions. For selected values of the parameter, compute the median and the first and third quartiles.
Moments Suppose that N has the zeta distribution with shape parameter a ∈ (1, ∞) . The moments of X can be expressed easily in terms of the zeta function. If k ≥ a − 1 , E(X) = ∞ . If k < a − 1 , E (N
k
ζ(a − k) ) =
(5.40.3) ζ(a)
Proof The mean and variance of N are as follows: 1. If a > 2 , ζ(a − 1) E(N ) =
(5.40.5) ζ(a)
2. If a > 3 , ζ(a − 2)
ζ(a − 1)
var(N ) =
−( ζ(a)
2
)
(5.40.6)
ζ(a)
Open the special distribution simulator and select the zeta distribution. Vary the parameter and note the shape and location of the mean ± standard deviation bar. For selected values of the parameter, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. The skewness and kurtosis of N are as follows: 1. If a > 4 , ζ(a − 3)ζ
2
(a) − 3ζ(a − 1)ζ(a − 2)ζ(a) + 2 ζ
3
(a − 1)
skew(N ) =
(5.40.7) [ζ(a − 2)ζ(a) − ζ
2
3/2
(a − 1)]
2. If a > 5 , ζ(a − 4)ζ
3
(a) − 4ζ(a − 1)ζ(a − 3)ζ
2
(a) + 6 ζ
2
(a − 1)ζ(a − 2)ζ(a) − 3 ζ
kurt(N ) = [ζ(a − 2)ζ(a) − ζ
2
(a − 1)]
2
4
(a − 1) (5.40.8)
Proof The probability generating function of N can be expressed in terms of the polylogarithm function Li that was introduced in the section on the exponential-logarithmic distribution. Recall that the polylogarithm of order s ∈ R is defined by ∞
k
x
Li s (x) = ∑ k=1
N
s
,
x ∈ (−1, 1)
(5.40.9)
k
has probability generating function P given by N
P (t) = E (t
Li a (t) ) =
,
t ∈ (−1, 1)
(5.40.10)
ζ(a)
Proof
5.40.2
https://stats.libretexts.org/@go/page/10473
Related Distributions In an algebraic sense, the zeta distribution is a discrete version of the Pareto distribution. Recall that if distribution with shape parameter a − 1 is a continuous distribution on [1, ∞) with probability density function
a >1
, the Pareto
a−1 f (x) =
xa
,
x ∈ [1, ∞)
(5.40.12)
Naturally, the limits of the zeta distribution with respect to the shape parameter a are of interest. The zeta distribution with shape parameter a ∈ (1, ∞) converges to point mass at 1 as a → ∞ . Proof Finally, the zeta distribution is a member of the family of general exponential distributions. Suppose that N has the zeta distribution with parameter a . Then the distribution is a one-parameter exponential family with natural parameter a and natural statistic − ln N . Proof
Computational Exercises Let N denote the frequency of occurrence of a word chosen at random from a certain text, and suppose that distribution with parameter a = 2 . Find P(N > 4) .
X
has the zeta
Answer Suppose that N has the zeta distribution with parameter a = 6 . Approximate each of the following: 1. E(N ) 2. var(N ) 3. skew(N ) 4. kurt(N ) Answer This page titled 5.40: The Zeta Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
5.40.3
https://stats.libretexts.org/@go/page/10473
5.41: The Logarithmic Series Distribution The logarithmic series distribution, as the name suggests, is based on the standard power series expansion of the natural logarithm function. It is also sometimes known more simply as the logarithmic distribution.
Basic Theory Distribution Functions The logarithmic series distribution with shape parameter function f given by
p ∈ (0, 1)
is a discrete distribution on
N+
with probability density
n
1
p
f (n) =
,
n ∈ N+
n
− ln(1 − p)
(5.41.1)
1. f is decreasing with mode n = 1 . 2. When smoothed, f is concave upward. Proof Open the Special Distribution Simulator and select the logarithmic series distribution. Vary the parameter and note the shape of the probability density function. For selected values of the parameter, run the simulation 1000 times and compare the empirical density function to the probability density function. The distribution function and the quantile function do not have simple, closed forms in terms of the standard elementary functions. Open the special distribution calculator and select the logarithmic series distribution. Vary the parameter and note the shape of the distribution and probability density functions. For selected values of the parameters, compute the median and the first and third quartiles.
Moments Suppose again that random variable N has the logarithmic series distribution with shape parameter p ∈ (0, 1). Recall that the permutation formula is n = n(n − 1) ⋯ (n − k + 1) for n ∈ R and k ∈ N . The factorial moments of N are E (N ) for k ∈ N. (k)
(k)
The factorial moments of N are given by E (N
(k)
(k − 1)!
k
p
) =
(
) ,
k ∈ N+
1 −p
− ln(1 − p)
(5.41.5)
Proof The mean and variance of N are 1. 2.
1
p
− ln(1 − p)
1 −p
E(N ) =
(5.41.9)
1
p
var(N ) = − ln(1 − p)
p 2
(1 − p)
[1 −
]
(5.41.10)
− ln(1 − p)
Proof Open the special distribution simulator and select the logarithmic series distribution. Vary the parameter and note the shape of the mean ± standard deviation bar. For selected values of the parameter, run the simulation 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation. The probability generating function P of N is given by
5.41.1
https://stats.libretexts.org/@go/page/10474
N
P (t) = E (t
ln(1 − pt) ) =
1 ,
ln(1 − p)
|t|
0 for some j them m(x) > 0 for each i then m(x) ≤ m(y) for each i and x < y for some j then m(x) < m(y) j
j
j
Proof Trivially, the mean of a constant sample is simply the constant. . If c = (c, c, … , c) is a constant sample then m(c) = c .
6.2.1
https://stats.libretexts.org/@go/page/10179
Proof As a special case of these results, suppose that x = (x , x , … , x ) is a sample of size n corresponding to a real variable x, and that a and b are constants. Then the sample corresponding to the variable y = a + bx , in our vector notation, is a + bx . The sample means are related in precisely the same way, that is, m(a + bx) = a + bm(x) . Linear transformations of this type, when b > 0 , arise frequently when physical units are changed. In this case, the transformation is often called a location-scale transformation; a is the location parameter and b is the scale parameter. For example, if x is the length of an object in inches, then y = 2.54x is the length of the object in centimeters. If x is the temperature of an object in degrees Fahrenheit, then y = (x − 32) is the temperature of the object in degree Celsius. 1
2
n
5 9
Sample means are ubiquitous in statistics. In the next few paragraphs we will consider a number of special statistics that are based on sample means. The Empirical Distribution
Suppose now that x = (x , x , … , x ) is a sample of size n from a general variable taking values in a set frequency of A corresponding to x is the number of data values that are in A : 1
2
n
S
. For
A ⊆S
, the
n
n(A) = #{i ∈ {1, 2, … , n} : xi ∈ A} = ∑ 1(xi ∈ A)
(6.2.5)
i=1
The relative frequency of A corresponding to x is the proportion of data values that are in A : n(A) p(A) =
1 =
n
n
n
∑ 1(xi ∈ A)
(6.2.6)
i=1
Note that for fixed A , p(A) is itself a sample mean, corresponding to the data {1(x ∈ A) : i ∈ {1, 2, … , n}}. This fact bears repeating: every sample proportion is a sample mean, corresponding to an indicator variable. In the picture below, the red dots represent the data, so p(A) = 4/15. i
Figure 6.2.2 : The p
empirical probability of A
is a probability measure on S .
1. p(A) ≥ 0 for every A ⊆ S 2. p(S) = 1 3. If {A
j
: j ∈ J}
is a countable collection of pairwise disjont subsets of S then p (⋃
j∈J
Aj ) = ∑
j∈J
p(Aj )
Proof This probability measure is known as the empirical probability distribution associated with the data set x. It is a discrete distribution that places probability at each point x . In fact this observation supplies a simpler proof of previous theorem. Thus, if the data values are distinct, the empirical distribution is the discrete uniform distribution on {x , x , … , x }. More generally, if x ∈ S occurs k times in the data then the empirical distribution assigns probability k/n to x. 1
i
n
1
2
n
If the underlying variable is real-valued, then clearly the sample mean is simply the mean of the empirical distribution. It follows that the sample mean satisfies all properties of expected value, not just the linear properties and increasing properties given above. These properties are just the most important ones, and so were repeated for emphasis. Empirical Density
Suppose now that the population variable R is given by
x
takes values in a set
d
S ⊆R
for some
d ∈ N+
. Recall that the standard measure on
d
6.2.2
https://stats.libretexts.org/@go/page/10179
λd (A) = ∫
d
1 dx,
A ⊆R
(6.2.9)
A
In particular λ (A) is the length of A , for A ⊆ R ; λ (A) is the area of A , for A ⊆ R ; and λ (A) is the volume of A , for A ⊆ R . Suppose that x is a continuous variable in the sense that λ (S) > 0 . Typically, S is an interval if d = 1 and a Cartesian product of intervals if d > 1 . Now for A ⊆ S with λ (A) > 0 , the empirical density of A corresponding to x is 2
1
2
3
3
d
d
p(A) D(A) =
n
1 =
λd (A)
∑ 1(xi ∈ A) n λd (A)
(6.2.10)
i=1
Thus, the empirical density of A is the proportion of data values in (corresponding to d = 2 ), if A has area 5, say, then D(A) = 4/75.
A
, divided by the size of
A
. In the picture below
Figure 6.2.3 : The empirical density of A The Empirical Distribution Function
Suppose again that x = (x , x , … , x ) is a sample of size n from a real-valued variable. For x ∈ R, let F (x) denote the relative frequency (empirical probability) of (−∞, x] corresponding to the data set x. Thus, for each x ∈ R, F (x) is the sample mean of the data {1(x ≤ x) : i ∈ {1, 2, … , n}} : 1
2
n
i
n
1 F (x) = p ((−∞, x]) =
∑ 1(xi ≤ x)
n
F
(6.2.11)
i=1
is a distribution function.
1. F increases from 0 to 1. 2. F is a step function with jumps at the distinct sample values {x
1,
.
x2 , … , xn }
Proof Appropriately enough, F is called the empirical distribution function associated with x and is simply the distribution function of the empirical distribution corresponding to x. If we know the sample size n and the empirical distribution function F , we can recover the data, except for the order of the observations. The distinct values of the data are the places where F jumps, and the number of data values at such a point is the size of the jump, times the sample size n . The Empirical Discrete Density Function
Suppose now that x = (x , x , … , x ) is a sample of size n from a discrete variable that takes values in a countable set S . For x ∈ S , let f (x) be the relative frequency (empirical probability) of x corresponding to the data set x . Thus, for each x ∈ S , f (x) is the sample mean of the data {1(x = x) : i ∈ {1, 2, … , n}}: 1
2
n
i
1 f (x) = p({x}) = n
n
∑ 1(xi = x)
(6.2.12)
i=1
In the picture below, the dots are the possible values of the underlying variable. The red dots represent the data, and the numbers indicate repeated values. The blue dots are possible values of the the variable that did not happen to occur in the data. So, the sample size is 12, and for the value x that occurs 3 times, we have f (x) = 3/12.
Figure 6.2.4 : The discrete probability density function
6.2.3
https://stats.libretexts.org/@go/page/10179
f
is a discrete probabiltiy density function:
1. f (x) ≥ 0 for x ∈ S 2. ∑ f (x) = 1 x∈S
Proof Appropriately enough, f is called the empirical probability density function or the relative frequency function associated with x, and is simply the probabiltiy density function of the empirical distribution corresponding to x. If we know the empirical PDF f and the sample size n , then we can recover the data set, except for the order of the observations. If the underlying population variable is real-valued, then the sample mean is the expected value computed relative to the empirical density function. That is, n
1 n
∑ xi = ∑ x f (x) i=1
(6.2.14)
x∈S
Proof As we noted earlier, if the population variable is real-valued then the sample mean is the mean of the empirical distribution. The Empirical Continuous Density Function
Suppose now that
is a sample of size n from a continuous variable that takes values in a set S ⊆ R . Let A = { A : j ∈ J} be a partition of S into a countable number of subsets, each of positive, finite measure. Recall that the word partition means that the subsets are pairwise disjoint and their union is S . Let f be the function on S defined by the rule that f (x) is the empricial density of A , corresponding to the data set x, for each x ∈ A . Thus, f is constant on each of the partition sets: d
x = (x1 , x2 , … , xn )
j
j
j
p(Aj ) = λd (Aj )
f
n
1
f (x) = D(Aj ) =
∑ 1(xi ∈ Aj ), nλd (Aj )
x ∈ Aj
(6.2.16)
i=1
is a continuous probabiltiy density function.
1. f (x) ≥ 0 for x ∈ S 2. ∫ f (x) dx = 1 S
Proof The function f is called the empirical probability density function associated with the data x and the partition A . For the probability distribution defined by f , the empirical probability p(A ) is uniformly distributed over A for each j ∈ J . In the picture below, the red dots represent the data and the black lines define a partition of S into 9 rectangles. For the partition set A in the upper right, the empirical distribution would distribute probability 3/15 = 1/5 uniformly over A . If the area of A is, say, 4, then f (x) = 1/20 for x ∈ A . j
j
Figure 6.2.5 : Empirical probability density function
Unlike the discrete case, we cannot recover the data from the empirical PDF. If we know the sample size, then of course we can determine the number of data points in A for each j , but not the precise location of these points in A . For this reason, the mean of the empirical PDF is not in general the same as the sample mean when the underlying variable is real-valued. j
j
Histograms
Our next discussion is closely related to the previous one. Suppose again that x = (x , x , … , x ) is a sample of size n from a variable that takes values in a set S and that A = (A , A , … , A ) is a partition of S into k subsets. The sets in the partition are sometimes known as classes. The underlying variable may be discrete or continuous. 1
1
2
2
n
k
The mapping that assigns frequencies to classes is known as a frequency distribution for the data set and the given partition.
6.2.4
https://stats.libretexts.org/@go/page/10179
The mapping that assigns relative frequencies to classes is known as a relative frequency distribution for the data set and the given partition. In the case of a continuous variable, the mapping that assigns densities to classes is known as a density distribution for the data set and the given partition. In dimensions 1 or 2, the bar graph any of these distributions, is known as a histogram. The histogram of a frequency distribution and the histogram of the corresponding relative frequency distribution look the same, except for a change of scale on the vertical axis. If the classes all have the same size, the histogram of the corresponding density histogram also looks the same, again except for a change of scale on the vertical axis. If the underlying variable is real-valued, the classes are usually intervals (discrete or continuous) and the midpoints of these intervals are sometimes referred to as class marks.
Figure 6.2.6 : A density histogram
The whole purpose of constructing a partition and graphing one of these empirical distributions corresponding to the partition is to summarize and display the data in a meaningful way. Thus, there are some general guidelines in choosing the classes: 1. The number of classes should be moderate. 2. If possible, the classes should have the same size. For highly skewed distributions, classes of different sizes are appropriate, to avoid numerous classes with very small frequencies. For a continuous variable with classes of different sizes, it is essential to use a density histogram, rather than a frequency or relative frequency histogram, otherwise the graphic is visually misleading, and in fact mathematically wrong. It is important to realize that frequency data is inevitable for a continuous variable. For example, suppose that our variable represents the weight of a bag of M&Ms (in grams) and that our measuring device (a scale) is accurate to 0.01 grams. If we measure the weight of a bag as 50.32, then we are really saying that the weight is in the interval [50.315, 50.324)(or perhaps some other interval, depending on how the measuring device works). Similarly, when two bags have the same measured weight, the apparent equality of the weights is really just an artifact of the imprecision of the measuring device; actually the two bags almost certainly do not have the exact same weight. Thus, two bags with the same measured weight really give us a frequency count of 2 for a certain interval. Again, there is a trade-off between the number of classes and the size of the classes; these determine the resolution of the empirical distribution corresponding to the partition. At one extreme, when the class size is smaller than the accuracy of the recorded data, each class contains a single datum or no datum. In this case, there is no loss of information and we can recover the original data set from the frequency distribution (except for the order in which the data values were obtained). On the other hand, it can be hard to discern the shape of the data when we have many classes with small frequency. At the other extreme is a frequency distribution with one class that contains all of the possible values of the data set. In this case, all information is lost, except the number of the values in the data set. Between these two extreme cases, an empirical distribution gives us partial information, but not complete information. These intermediate cases can organize the data in a useful way. Ogives
Suppose now the underlying variable is real-valued and that the set of possible values is partitioned into intervals (A , A , … , A ), with the endpoints of the intervals ordered from smallest to largest. Let n denote the frequency of class A , so that p = n /n is the relative frequency of class A . Let t denote the class mark (midpoint) of class A . The cumulative frequency of class A is N = ∑ n and the cumulative relative frequency of class A is P = ∑ p = N /n . Note that the cumulative frequencies increase from n to n and the cumulative relative frequencies increase from p to 1. 1
2
k
j
j
j
j
j
j
j
j
i=1
j
j
j
i
j
1
j
i=1
i
j
1
6.2.5
https://stats.libretexts.org/@go/page/10179
The mapping that assigns cumulative frequencies to classes is known as a cumulative frequency distribution for the data set and the given partition. The polygonal graph that connects the points (t , N ) for j ∈ {1, 2, … , k} is the cumulative frequency ogive. The mapping that assigns cumulative relative frequencies to classes is known as a cumulative relative frequency distribution for the data set and the given partition. The polygonal graph that connects the points (t , P ) for j ∈ {1, 2, … , k} is the cumulative relative frequency ogive. j
j
j
j
Note that the relative frquency ogive is simply the graph of the distribution function corresponding to the probability distibution that places probability p at t for each j . j
j
Approximating the Mean
In the setting of the last subsection, suppose that we do not have the actual data approximate value of the sample mean is 1 n
k
x
, but just the frequency distribution. An
k
∑ nj tj = ∑ pj tj j=1
(6.2.18)
j=1
This approximation is based on the hope that the mean of the data values in each class is close to the midpoint of that class. In fact, the expression on the right is the expected value of the distribution that places probability p on class mark t for each j . j
j
Exercises Basic Properties
Suppose that operation.
x
is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of
1. Classify x by type and level of measurement. 2. A sample of 30 components has mean 113°. Find the sample mean if the temperature is converted to degrees Celsius. The transformation is y = (x − 32) . 5 9
Answer Suppose that x is the length (in inches) of a machined part in a manufacturing process. 1. Classify x by type and level of measurement. 2. A sample of 50 parts has mean 10.0. Find the sample mean if length is measured in centimeters. The transformation is y = 2.54x. Answer Suppose that x is the number of brothers and the number of siblings.
y
the number of sisters for a person in a certain population. Thus,
z = x +y
is
1. Classify the variables by type and level of measurement. 2. For a sample of 100 persons, m(x) = 0.8 and m(y) = 1.2 . Find m(z). Answer Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). The mean grade on the first midterm exam was 64 (out of a possible 100 points). Professor Moriarity thinks the grades are a bit low and is considering various transformations for increasing the grades. In each case below give the mean of the transformed grades, or state that there is not enough information. 1. Add 10 points to each grade, so the transformation is y = x + 10 . 2. Multiply each grade by 1.2, so the transformation is z = 1.2x 3. Use the transformation w = 10√− x . Note that this is a non-linear transformation that curves the grades greatly at the low end and very little at the high end. For example, a grade of 100 is still 100, but a grade of 36 is transformed to 60. One of the students did not study at all, and received a 10 on the midterm. Professor Moriarity considers this score to be an outlier.
6.2.6
https://stats.libretexts.org/@go/page/10179
4. What would the mean be if this score is omitted? Answer Computational Exercises
All statistical software packages will compute means and proportions, draw dotplots and histograms, and in general perform the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets, the use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the graphs with minimal technological aids. Suppose that
x
is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data .
x = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)
1. Classify x by type and level of measurement. 2. Sketch the dotplot. 3. Compute the sample mean m from the definition and indicate its location on the dotplot. 4. Find the empirical density function f and sketch the graph. 5. Compute the sample mean m using f . 6. Find the empirical distribution function F and sketch the graph. Answer Suppose that a sample of size 12 from a discrete variable f (−1) = 1/4, f (0) = 1/3, f (1) = 1/6, f (2) = 1/6.
x
has empirical density function given by
,
f (−2) = 1/12
1. Sketch the graph of f . 2. Compute the sample mean m using f . 3. Find the empirical distribution function F 4. Give the sample values, ordered from smallest to largest. Answer The following table gives a frequency distribution for the commuting distance to the math/stat building (in miles) for a sample of ESU students. Class
Freq
(0, 2]
6
(2, 6]
16
(6, 10]
18
(10, 20]
10
Total
Rel Freq
Density
Density
Cum Freq
Cum Freq
Cum Rel Freq
Cum Rel Freq
Midpoint
Midpoint
1. Complete the table 2. Sketch the density histogram 3. Sketch the cumulative relative frquency ogive. 4. Compute an approximation to the mean Answer App Exercises
In the interactive histogram, click on the x-axis at various points to generate a data set with at least 20 values. Vary the number of classes and switch between the frequency histogram and the relative frequency histogram. Note how the shape of the histogram changes as you perform these operations. Note in particular how the histogram loses resolution as you decrease the number of classes.
6.2.7
https://stats.libretexts.org/@go/page/10179
In the interactive histogram, click on the axis to generate a distribution of the given type with at least 30 points. Now vary the number of classes and note how the shape of the distribution changes. 1. A uniform distribution 2. A symmetric unimodal distribution 3. A unimodal distribution that is skewed right. 4. A unimodal distribution that is skewed left. 5. A symmetric bimodal distribution 6. A u-shaped distribution. Data Analysis Exercises
Statistical software should be used for the problems in this subsection. Consider the petal length and species variables in Fisher's iris data. 1. Classify the variables by type and level of measurement. 2. Compute the sample mean and plot a density histogram for petal length. 3. Compute the sample mean and plot a density histogram for petal length by species. Answers Consider the erosion variable in the Challenger data set. 1. Classify the variable by type and level of measurement. 2. Compute the mean 3. Plot a density histogram with the classes [0, 5), [5, 40), [40, 50), [50, 60). Answer Consider Michelson's velocity of light data. 1. Classify the variable by type and level of measurement. 2. Plot a density histogram. 3. Compute the sample mean. 4. Find the sample mean if the variable is converted to km/hr. The transformation is y = x + 299 000 Answer Consider Short's paralax of the sun data. 1. Classify the variable by type and level of measurement. 2. Plot a density histogram. 3. Compute the sample mean. 4. Find the sample mean if the variable is converted to degrees. There are 3600 seconds in a degree. 5. Find the sample mean if the variable is converted to radians. There are π/180 radians in a degree. Answer Consider Cavendish's density of the earth data. 1. Classify the variable by type and level of measurement. 2. Compute the sample mean. 3. Plot a density histogram. Answer Consider the M&M data. 1. Classify the variables by type and level of measurement. 2. Compute the sample mean for each color count variable. 3. Compute the sample mean for the total number of candies, using the results from (b). 4. Plot a relative frequency histogram for the total number of candies.
6.2.8
https://stats.libretexts.org/@go/page/10179
5. Compute the sample mean and plot a density histogram for the net weight. Answer Consider the body weight, species, and gender variables in the Cicada data. 1. Classify the variables by type and level of measurement. 2. Compute the relative frequency function for species and plot the graph. 3. Compute the relative frequeny function for gender and plot the graph. 4. Compute the sample mean and plot a density histogram for body weight. 5. Compute the sample mean and plot a density histogrm for body weight by species. 6. Compute the sample mean and plot a density histogram for body weight by gender. Answer Consider Pearson's height data. 1. Classify the variables by type and level of measurement. 2. Compute the sample mean and plot a density histogram for the height of the father. 3. Compute the sample mean and plot a density histogram for the height of the son. Answer This page titled 6.2: The Sample Mean is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
6.2.9
https://stats.libretexts.org/@go/page/10179
6.3: The Law of Large Numbers Basic Theory This section continues the discussion of the sample mean from the last section, but we now consider the more interesting setting where the variables are random. Specifically, suppose that we have a basic random experiment with an underlying probability measure P, and that X is random variable for the experiment. Suppose now that we perform n independent replications of the basic experiment. This defines a new, compound experiment with a sequence of independent random variables X = (X , X , … , X ), each with the same distribution as X. Recall that in statistical terms, X is a random sample of size n from the distribution of X. All of the relevant statistics discussed in the the previous section, are defined for X, but of course now these statistics are random variables with distributions of their own. For the most part, we use the notation established previously, except that for the usual convention of denoting random variables with capital letters. Of course, the deterministic properties and relations established previously apply as well. When we acutally run the experiment and observe the values x = (x , x , … , x ) of the random variables, then we are precisely in the setting of the previous section. 1
1
Suppose now that the basic variable X is real valued, and let variance of X (assumed finite). The sample mean is
μ = E(X)
1 M = n
2
denote the expected value of
2
n
n
X
and
σ
2
= var(X)
the
n
∑ Xi
(6.3.1)
i=1
Ofen the distribution mean μ is unknown and the sample mean M is used as an estimator of this unknown parameter. Moments
The mean and variance of M are 1. E(M ) = μ 2. var(M ) = σ
2
/n
Proof Part (a) means that the sample mean M is an unbiased estimator of the distribution mean μ . Therefore, the variance of M is the mean square error, when M is used as an estimator of μ . Note that the variance of M is an increasing function of the distribution variance and a decreasing function of the sample size. Both of these make intuitive sense if we think of the sample mean M as an estimator of the distribution mean μ . The fact that the mean square error (variance in this case) decreases to 0 as the sample size n increases to ∞ means that the sample mean M is a consistent estimator of the distribution mean μ . Recall that X − M is the deviation of X from M , that is, the directed distance from M to X . The following theorem states that the sample mean is uncorrelated with each deviation, a result that will be crucial for showing the independence of the sample mean and the sample variance when the sampling distribution is normal. i
M
and X
i
−M
i
i
are uncorrelated.
Proof The Weak and Strong Laws of Large Numbers
The law of large numbers states that the sample mean converges to the distribution mean as the sample size increases, and is one of the fundamental theorems of probability. There are different versions of the law, depending on the mode of convergence. Suppose again that X is a real-valued random variable for our basic experiment, with mean μ and standard deviation σ (assumed finite). We repeat the basic experiment indefinitely to create a new, compound experiment with an infinite sequence of independent random variables (X , X , …), each with the same distribution as X. In statistical terms, we are sampling from the distribution of X. In probabilistic terms, we have an independent, identically distributed (IID) sequence. For each n , let M denote the sample mean of the first n sample variables: 1
2
n
1 Mn =
n
n
∑ Xi
(6.3.5)
i=1
6.3.1
https://stats.libretexts.org/@go/page/10180
From the result above on variance, note that
var(Mn ) = E [(Mn − μ)
in mean square. As stated in the next theorem, M
n
P (| Mn − μ| > ϵ) → 0
→ μ
2
as n → ∞ . This means that
] → 0
Mn → μ
as
n → ∞
as n → ∞ in probability as well.
as n → ∞ for every ϵ > 0 .
Proof Recall that in general, convergence in mean square implies convergence in probability. The convergence of the sample mean to the distribution mean in mean square and in probability are known as weak laws of large numbers. Finally, the strong law of large numbers states that the sample mean M converges to the distribution mean μ with probability 1 . As the name suggests, this is a much stronger result than the weak laws. We will need some additional notation for the proof. First let Y = ∑ X so that M = Y /n . Next, recall the definitions of the positive and negative parts a real number x: x = max{x, 0}, x = max{−x, 0}. Note that x ≥ 0, x ≥ 0, x = x −x , and |x| = x + x . n
n
n
i=1
i
+
n
n
−
Mn → μ
+
−
+
−
+
−
as n → ∞ with probability 1.
Proof The proof of the strong law of large numbers given above requires that the variance of the sampling distribution be finite (note that this is critical in the first step). However, there are better proofs that only require that E (|X|) < ∞ . An elegant proof showing that M → μ as n → ∞ with probability 1 and in mean, using backwards martingales, is given in the chapter on martingales. In the next few paragraphs, we apply the law of large numbers to some of the special statistics studied in the previous section. n
Emprical Probability
Suppose that X is the outcome random variable for a basic experiment, with sample space S and probability measure P. Now suppose that we repeat the basic experiment indefinitley to form a sequence of independent random variables (X , X , …) each with the same distribution as X. That is, we sample from the distribution of X. For A ⊆ S , let P (A) denote the empricial probability of A corresponding to the sample (X , X , … , X ): 1
2
n
1
2
n
1 Pn (A) =
n
n
∑ 1(Xi ∈ A)
(6.3.13)
i=1
Now of course, P (A) is a random variable for each event A . In fact, the sum ∑ parameters n and P(A) .
n
n
i=1
1(Xi ∈ A)
has the binomial distribution with
For each event A , 1. E [P (A)] = P(A) 2. var [P (A)] = P(A) [1 − P(A)] 3. P (A) → P(A) as n → ∞ with probability 1. n
1
n
n
n
Proof This special case of the law of large numbers is central to the very concept of probability: the relative frequency of an event converges to the probability of the event as the experiment is repeated. The Empirical Distribution Function
Suppose now that X is a real-valued random variable for a basic experiment. Recall that the distribution function of function F given by F (x) = P(X ≤ x),
x ∈ R
X
is the
(6.3.14)
Now suppose that we repeat the basic experiment indefintely to form a sequence of independent random variables (X , X , …), each with the same distribution as X. That is, we sample from the distribution of X. Let F denote the empirical distribution function corresponding to the sample (X , X , … , X ): 1
2
n
1
2
n
1 Fn (x) =
n
n
∑ 1(Xi ≤ x),
x ∈ R
(6.3.15)
i=1
6.3.2
https://stats.libretexts.org/@go/page/10180
Now, of course, F (x) is a random variable for each x ∈ R. In fact, the sum parameters n and F (x). n
n
∑
i=1
1(Xi ≤ x)
has the binomial distribution with
For each x ∈ R, 1. E [F (x)] = F (x) 2. var [F (x)] = F (x) [1 − F (x)] 3. F (x) → F (x) as n → ∞ with probability 1. n
1
n
n
n
Proof Empirical Density for a Discrete Variable
Suppose now that X is a random variable for a basic experiment with a discrete distribution on a countable set probability density function of X is the function f given by f (x) = P(X = x),
x ∈ S
S
. Recall that the
(6.3.16)
Now suppose that we repeat the basic experiment to form a sequence of independent random variables (X , X , …) each with the same distribution as X. That is, we sample from the distribution of X. Let f denote the empirical probability density function corresponding to the sample (X , X , … , X ): 1
2
n
1
2
n
n
1 fn (x) =
Now, of course, f (x) is a random variable for each parameters n and f (x). n
n
∑ 1(Xi = x),
x ∈ S
(6.3.17)
i=1
x ∈ S
. In fact, the sum
n
∑
i=1
1(Xi = x)
has the binomial distribution with
For each x ∈ S , 1. E [f (x)] = f (x) 2. var [f (x)] = f (x) [1 − f (x)] 3. f (x) → f (x) as n → ∞ with probability 1. n
1
n
n
n
Proof Recall that a countable intersection of events with probability 1 still has probability 1. Thus, in the context of the previous theorem, we actually have P [ fn (x) → f (x) as n → ∞ for every x ∈ S] = 1
(6.3.18)
Empirical Density for a Continuous Variable
Suppose now that X is a random variable for a basic experiment, with a continuous distribution on S ⊆ R , and that X has probability density function f . Technically, f is the probability density function with respect to the standard (Lebsesgue) measure λ . Thus, by definition, d
d
P(X ∈ A) = ∫
f (x) dx,
A ⊆S
(6.3.19)
A
Again we repeat the basic experiment to generate a sequence of independent random variables (X , X , …) each with the same distribution as X. That is, we sample from the distribution of X. Suppose now that A = {A : j ∈ J} is a partition of S into a countable number of subsets, each with positive, finite size. Let f denote the empirical probability density function corresponding to the sample (X , X , … , X ) and the partition A : 1
2
j
n
1
2
n
Pn (Aj ) fn (x) =
n
1 =
λd (Aj )
n λd (Aj )
∑ 1(Xi ∈ Aj );
j ∈ J, x ∈ Aj
(6.3.20)
i=1
Of course now, f (x) is a random variable for each x ∈ S . If the partition is sufficiently fine (so that λ and if the sample size n is sufficiently large, then by the law of large numbers,
d (Aj )
n
fn (x) ≈ f (x),
6.3.3
x ∈ S
is small for each j ),
(6.3.21)
https://stats.libretexts.org/@go/page/10180
Exercises Simulation Exercises
In the dice experiment, recall that the dice scores form a random sample from the specified die distribution. Select the average random variable, which is the sample mean of the sample of dice scores. For each die distribution, start with 1 die and increase the sample size n . Note how the distribution of the sample mean begins to resemble a point mass distribution. Note also that the mean of the sample mean stays the same, but the standard deviation of the sample mean decreases. For selected values of n and selected die distributions, run the simulation 1000 times and compare the relative frequency function of the sample mean to the true probability density function, and compare the empirical moments of the sample mean to the true moments. Several apps in this project are simulations of random experiments with events of interest. When you run the experiment, you are performing independent replications of the experiment. In most cases, the app displays the relative frequency of the event and its complement, both graphically in blue, and numerically in a table. When you run the experiment, the relative frequencies are shown graphically in red and also numerically. In the simulation of Buffon's coin experiment, the event of interest is that the coin crosses a crack. For various values of the parameter (the radius of the coin), run the experiment 1000 times and compare the relative frequency of the event to the true probability. In the simulation of Bertrand's experiment, the event of interest is that a “random chord” on a circle will be longer than the length of a side of the inscribed equilateral triangle. For each of the various models, run the experiment 1000 times and compuare the relative frequency of the event to the true probability. Many of the apps in this project are simulations of experiments which result in discrete variables. When you run the simulation, you are performing independent replications of the experiment. In most cases, the app displays the true probability density function numerically in a table and visually as a blue bar graph. When you run the simulation, the relative frequency function is also shown numerically in the table and visually as a red bar graph. In the simulation of the binomial coin experiment, select the number of heads. For selected values of the parameters, run the simulation 1000 times and compare the sample mean to the distribution mean, and compare the empirical density function to the probability density function. In the simulation of the matching experiment, the random variable is the number of matches. For selected values of the parameter, run the simulation 1000 times and compare the sample mean and the distribution mean, and compare the empirical density function to the probability density function. In the poker experiment, the random variable is the type of hand. Run the simulation 1000 times and compare the empirical density function to the true probability density function. Many of the apps in this project are simulations of experiments which result in variables with continuous distributions. When you run the simulation, you are performing independent replications of the experiment. In most cases, the app displays the true probability density function visually as a blue graph. When you run the simulation, an empirical density function, based on a partition, is also shown visually as a red bar graph. In the simulation of the gamma experiment, the random variable represents a random arrival time. For selected values of the parameters, run the experiment 1000 times and compare the sample mean to the distribution mean, and compare the empirical density function to the probability density function. In the special distribution simulator, select the normal distribution. For various values of the parameters (the mean and standard deviation), run the experiment 1000 times and compare the sample mean to the distribution mean, and compare the empirical density function to the probability density function.
6.3.4
https://stats.libretexts.org/@go/page/10180
Probability Exercises
Suppose that X has probability density function f (x) = 12x beta family. Compute each of the following
2
1. E(X) 2. var(X) 3. P (X ≤
1 2
(1 − x)
for 0 ≤ x ≤ 1 . The distribution of X is a member of the
)
Answer Suppose now that (X , X , … , X ) is a random sample of size 9 from the distribution in the previous problem. Find the expected value and variance of each of the following random variables: 1
2
1. The sample mean M 2. The empirical probability P
9
([0,
1 2
])
Answer Suppose that X has probability density function f (x) = family. Compute each of the following
3 4
x
for 1 ≤ x < ∞ . The distribution of
X
is a member of the Pareto
1. E(X)) 2. var(X) 3. P(2 ≤ X ≤ 3) Answer Suppose now that (X , X , … , X ) is a random sample of size 16 from the distribution in the previous problem. Find the expected value and variance of each of the following random variables: 1
2
1. The sample mean M 2. The empirical probability P
16
([2, 3])
Answer Recall that for an ace-six flat die, faces 1 and 6 have probability each, while faces 2, 3, 4, and 5 have probability X denote the score when an ace-six flat die is thrown. Compute each of the following: 1
1
4
8
each. Let
1. The probability density function f (x) for x ∈ {1, 2, 3, 4, 5, 6} 2. The distribution function F (x) for x ∈ {1, 2, 3, 4, 5, 6} 3. E(X) 4. var(X) Answer Suppose now that an ace-six flat die is thrown n times. Find the expected value and variance of each of the following random variables: 1. The empirical probability density function f (x) for x ∈ {1, 2, 3, 4, 5, 6} 2. The empirical distribution function F (x) for x ∈ {1, 2, 3, 4, 5, 6} 3. The average score M n
n
Answer This page titled 6.3: The Law of Large Numbers is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
6.3.5
https://stats.libretexts.org/@go/page/10180
6.4: The Central Limit Theorem The central limit theorem and the law of large numbers are the two fundamental theorems of probability. Roughly, the central limit theorem states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution. The importance of the central limit theorem is hard to overstate; indeed it is the reason that many statistical procedures work.
Partial Sum Processes Definitions Suppose that X = (X , X , …) is a sequence of independent, identically distributed, real-valued random variables with common probability density function f , mean μ , and variance σ . We assume that 0 < σ < ∞ , so that in particular, the random variables really are random and not constants. Let 1
2
2
n
Yn = ∑ Xi ,
n ∈ N
(6.4.1)
i=1
Note that by convention, Y = 0 , since the sum is over an empty index set. The random process Y = (Y , Y , Y , …) is called the partial sum process associated with X. Special types of partial sum processes have been studied in many places in this text; in particular see 0
0
1
2
the binomial distribution in the setting of Bernoulli trials the negative binomial distribution in the setting of Bernoulli trials the gamma distribution in the Poisson process the the arrival times in a general renewal process Recall that in statistical terms, the sequence X corresponds to sampling from the underlying distribution. In particular, (X , X , … , X ) is a random sample of size n from the distribution, and the corresponding sample mean is 1
2
n
Yn Mn =
By the law of large numbers, M
→ μ
n
1 =
n
n
n
∑ Xi
(6.4.2)
i=1
as n → ∞ with probability 1.
Stationary, Independent Increments The partial sum process corresponding to a sequence of independent, identically distributed variables has two important properties, and these properties essentially characterize such processes. If m ≤ n then Y
n
− Ym
has the same distribution as Y
n−m
. Thus the process Y has stationary increments.
Proof Note however that distribution.
Yn − Ym
and
Yn−m
If n ≤ n ≤ n ≤ ⋯ then (Y Y has independent increments. 1
2
3
n1
, Yn
2
are very different random variables; the theorem simply states that they have the same
− Yn , Yn 1
3
− Yn , …) 2
is a sequence of independent random variables. Thus the process
Proof Conversely, suppose that V = (V , V , V U = V −V for i ∈ N . Then U = (U the partial sum process associated with U . 0
i
i
i−1
+
1
2,
…)
is a random process with stationary, independent increments. Define is a sequence of independent, identically distributed variables and V is
1 , U2 , …)
Thus, partial sum processes are the only discrete-time random processes that have stationary, independent increments. An interesting, and much harder problem, is to characterize the continuous-time processes that have stationary independent increments. The Poisson counting process has stationary independent increments, as does the Brownian motion process.
6.4.1
https://stats.libretexts.org/@go/page/10181
Moments If n ∈ N then 1. E(Y ) = nμ 2. var(Y ) = nσ n
2
n
Proof If n ∈ N
and m ∈ N with m ≤ n then
+
1. cov(Y
m,
2. cor(Y
m,
2
Yn ) = m σ − − Yn ) = √
3. E(Y
m Yn ) = m σ
2
m n
2
+ mnμ
Proof If X has moment generating function G then Y has moment generating function G . n
n
Proof
Distributions Suppose that X has either a discrete distribution or a continuous distribution with probability density function f . Then the probability density function of Y is f = f ∗ f ∗ ⋯ ∗ f , the convolution power of f of order n . ∗n
n
Proof More generally, we can use the stationary and independence properties to find the joint distributions of the partial sum process: If n
1
< n2 < ⋯ < nk
then (Y
n1
, Yn , … , Yn ) 2
k
fn1 , n2 ,…, nk (y1 , y2 , … , yk ) = f
∗n1
has joint probability density function
(y1 )f
∗( n2 −n1 )
(y2 − y1 ) ⋯ f
∗( nk −nk−1 )
k
(yk − yk−1 ),
(y1 , y2 , … , yk ) ∈ R
(6.4.5)
Proof
The Central Limit Theorem First, let's make the central limit theorem more precise. From Theorem 4, we cannot expect Y itself to have a limiting distribution. Note that var(Y ) → ∞ as n → ∞ since σ > 0 , and E(Y ) → ∞ as n → ∞ if μ > 0 while E(Y ) → −∞ as n → ∞ if μ < 0 . Similarly, we know that M → μ as n → ∞ with probability 1, so the limiting distribution of the sample mean is degenerate. Thus, to obtain a limiting distribution of Y or M that is not degenerate, we need to consider, not these variables themeselves, but rather the common standard score. Thus, let n
n
n
n
n
n
n
Zn =
Zn
Yn − nμ − √n σ
=
Mn − μ
(6.4.6)
− σ/ √n
has mean 0 and variance 1.
1. E(Z ) = 0 2. var(Z ) = 1 n
n
Proof The precise statement of the central limit theorem is that the distribution of the standard score Z converges to the standard normal distribution as n → ∞ . Recall that the standard normal distribution has probability density function n
1 ϕ(z) =
− −e √2π
−
1 2
z
2
,
z ∈ R
(6.4.7)
and is studied in more detail in the chapter on special distributions. A special case of the central limit theorem (to Bernoulli trials), dates to Abraham De Moivre. The term central limit theorem was coined by George Pólya in 1920. By definition of convergence in distribution, the central limit theorem states that F (z) → Φ(z) as n → ∞ for each z ∈ R , where F is the distribution function of Z and Φ is the standard normal distribution function: n
n
n
6.4.2
https://stats.libretexts.org/@go/page/10181
z
z
Φ(z) = ∫
1
ϕ(x) dx = ∫
−∞
− − √2π
−∞
e
−
1 2
2
x
dx,
z ∈ R
(6.4.8)
An equivalent statment of the central limit theorm involves convergence of the corresponding characteristic functions. This is the version that we will give and prove, but first we need a generalization of a famous limit from calculus. Suppose that (a
1,
a2 , …)
is a sequence of real numbers and that a
n
(1 +
an
n
)
→ a ∈ R
as n → ∞ . Then
a
→ e as n → ∞
(6.4.9)
n
Now let χ denote the characteristic function of the standard score of the sample variable function of the standard score Z :
X
, and let
χn
denote the characteristic
n
X −μ χ(t) = E [exp(it σ
Recall that t ↦ e
−
1 2
2
t
χn (t) → e
1 2
2
t
t ∈ R
(6.4.10)
is the characteristic function of the standard normal distribution. We can now give a proof.
The central limit theorem. The distribution of −
)] , χn (t) = E[exp(itZn )];
Zn
converges to the standard normal distribution as
n → ∞
. That is,
as n → ∞ for each t ∈ R .
Proof
Normal Approximations The central limit theorem implies that if the sample size n is “large” then the distribution of the partial sum Y is approximately normal with mean nμ and variance nσ . Equivalently the sample mean M is approximately normal with mean μ and variance σ /n. The central limit theorem is of fundamental importance, because it means that we can approximate the distribution of certain statistics, even if we know very little about the underlying sampling distribution. n
2
n
2
Of course, the term “large” is relative. Roughly, the more “abnormal” the basic distribution, the larger n must be for normal approximations to work well. The rule of thumb is that a sample size n of at least 30 will usually suffice if the basic distribution is not too weird; although for many distributions smaller n will do. Let Y denote the sum of the variables in a random sample of size 30 from the uniform distribution on approximations to each of the following:
. Find normal
[0, 1]
1. P(13 < Y < 18) 2. The 90th percentile of Y Answer Random variable Y in the previous exercise has the Irwin-Hall distribution of order 30. The Irwin-Hall distributions are studied in more detail in the chapter on Special Distributions and are named for Joseph Irwin and Phillip Hall. In the special distribution simulator, select the Irwin-Hall distribution. Vary and n from 1 to 10 and note the shape of the probability density function. With n = 10 run the experiment 1000 times and compare the empirical density function to the true probability density function. Let
denote the sample mean of a random sample of size 50 from the distribution with probability density function for 1 ≤ x < ∞ . This is a Pareto distribution, named for Vilfredo Pareto. Find normal approximations to each of the following: M
f (x) =
3
x4
1. P(M > 1.6) 2. The 60th percentile of M Answer
6.4.3
https://stats.libretexts.org/@go/page/10181
The Continuity Correction A slight technical problem arises when the sampling distribution is discrete. In this case, the partial sum also has a discrete distribution, and hence we are approximating a discrete distribution with a continuous one. Suppose that X takes integer values (the most common case) and hence so does the partial sum Y . For any k ∈ Z and h ∈ [0, 1), note that the event {k − h ≤ Y ≤ k + h} is equivalent to the event {Y = k} . Different values of h lead to different normal approximations, even though the events are equivalent. The smallest approximation would be 0 when h = 0 , and the approximations increase as h increases. It is customary to split the difference by using h = for the normal approximation. This is sometimes called the halfunit continuity correction or the histogram correction. The continuity correction is extended to other events in the natural way, using the additivity of probability. n
n
1 2
Suppose that j, k ∈ Z with j ≤ k . 1. For the event {j ≤ Y ≤ k} = {j − 1 < Y < k + 1} , use {j − ≤ Y ≤ k + } in the normal approximation. 2. For the event {j ≤ Y } = {j − 1 < Y } , use {j − ≤ Y } in the normal approximation. 3. For the event {Y ≤ k} = {Y < k + 1} , use {Y ≤ k + } in the normal approximation. 1
n
n
1
n
2
2
1
n
n
n
2
1
n
n
n
2
Let Y denote the sum of the scores of 20 fair dice. Compute the normal approximation to P(60 ≤ Y
≤ 75)
.
Answer In the dice experiment, set the die distribution to fair, select the sum random variable Y , and set 1000 times and find each of the following. Compare with the result in the previous exercise: 1. P(60 ≤ Y ≤ 75) 2. The relative frequency of the event {60 ≤ Y
≤ 75}
n = 20
. Run the simulation
(from the simulation)
Normal Approximation to the Gamma Distribution Recall that the gamma distribution with shape parameter on (0, ∞) with probability density function f given by
k ∈ (0, ∞)
1
k−1
f (x) =
x
e
and scale parameter
−x/b
,
b ∈ (0, ∞)
x ∈ (0, ∞)
is a continuous distribution
(6.4.14)
Γ(k)bk
The mean is kb and the variance is kb . The gamma distribution is widely used to model random times (particularly in the context of the Poisson model) and other positive random variables. The general gamma distribution is studied in more detail in the chapter on Special Distributions. In the context of the Poisson model (where k ∈ N ), the gamma distribution is also known as the Erlang distribution, named for Agner Erlang; it is studied in more detail in the chapter on the Poisson Process. Suppose now that Y has the gamma (Erlang) distribution with shape parameter k ∈ N and scale parameter b > 0 then 2
+
k
+
k
Yk = ∑ Xi
(6.4.15)
i=1
where (X , X , …) is a sequence of independent variables, each having the exponential distribution with scale parameter b . (The exponential distribution is a special case of the gamma distribution with shape parameter 1.) It follows that if k is large, the gamma distribution can be approximated by the normal distribution with mean kb and variance kb . The same statement actually holds when k is not an integer. Here is the precise statement: 1
2
2
Suppose that Y has the gamma distribution with scale parameter b ∈ (0, ∞) and shape parameter k ∈ (0, ∞) . Then the distribution of the standardized variable Z below converges to the standard normal distribution as k → ∞ : k
k
Zk =
Yk − kb − − √k b
(6.4.16)
In the special distribution simulator, select the gamma distribution. Vary and b and note the shape of the probability density function. With k = 10 and various values of b , run the experiment 1000 times and compare the empirical density function to the true probability density function.
6.4.4
https://stats.libretexts.org/@go/page/10181
Suppose that Y has the gamma distribution with shape parameter approximations to each of the following:
k = 10
and scale parameter
b =2
. Find normal
1. P(18 ≤ Y ≤ 23) 2. The 80th percentile of Y Answer
Normal Approximation to the Chi-Square Distribution Recall that the chi-square distribution with n ∈ (0, ∞) degrees of freedom is a special case of the gamma distribution, with shape parameter k = n/2 and scale parameter b = 2 . Thus, the chi-square distribution with n degrees of freedom has probability density function 1
n/2−1
f (x) =
x
e
−x/2
,
0 0 , arise frequently when physical units are changed. In this case, the transformation is often called a location-scale transformation; a is the location parameter and b is the scale parameter. For example, if x is the length of an object in inches, then y = 2.54x is the length of the object in centimeters. If x is the temperature of an object in degrees Fahrenheit, then y = (x − 32) is the temperature of the object in degree Celsius. 1
2
n
5 9
Now, for i ∈ {1, 2, … , n}, let z = (x − m)/s . The number z is the standard score associated with x . Note that since x , m, and s have the same physical units, the standard score z is dimensionless (that is, has no physical units); it measures the directed distance from the mean m to the data value x in standard deviations. i
i
i
i
i
i
i
The sample of standard scores z = (z
1,
z2 , … , zn )
has mean 0 and variance 1. That is,
1. m(z) = 0 2. s (z) = 1 2
Proof Approximating the Variance
Suppose that instead of the actual data x, we have a frequency distribution corresponding to a partition with classes (intervals) (A , A , … , A ), class marks (midpoints of the intervals) (t , t , … , t ), and frequencies (n , n , … , n ). Recall that the relative frequency of class A is p = n /n . In this case, approximate values of the sample mean and variance are, respectively, 1
2
k
1
j
j
2
k
1
2
k
j
1 m = n
2
s
k
k
∑ nj tj = ∑ pj tj j=1
1 = n−1
(6.5.18)
j=1 k 2
∑ nj (tj − m )
n =
j=1
n−1
k 2
∑ pj (tj − m )
(6.5.19)
j=1
These approximations are based on the hope that the data values in each class are well represented by the class mark. In fact, these are the standard definitions of sample mean and variance for the data set in which t occurs n times for each j . j
6.5.3
j
https://stats.libretexts.org/@go/page/10182
Inferential Statistics We continue our discussion of the sample variance, but now we assume that the variables are random. Thus, suppose that we have a basic random experiment, and that X is a real-valued random variable for the experiment with mean μ and standard deviation σ. We will need some higher order moments as well. Let σ = E [(X − μ) ] and σ = E [(X − μ) ] denote the 3rd and 4th moments about the mean. Recall that σ /σ = skew(X) , the skewness of X, and σ /σ = kurt(X) , the kurtosis of X. We assume that σ < ∞ . 3
4
3
4
3
4
3
4
4
We repeat the basic experiment n times to form a new, compound experiment, with a sequence of independent random variables X = (X , X , … , X ) , each with the same distribution as X. In statistical terms, X is a random sample of size n from the distribution of X. All of the statistics above make sense for X, of course, but now these statistics are random variables. We will use the same notationt, except for the usual convention of denoting random variables by capital letters. Finally, note that the deterministic properties and relations established above still hold. 1
2
n
In addition to being a measure of the center of the data X, the sample mean n
1 M = n
∑ Xi
(6.5.20)
i=1
is a natural estimator of the distribution mean μ . In this section, we will derive statistics that are natural estimators of the distribution variance σ . The statistics that we will derive are different, depending on whether μ is known or unknown; for this reason, μ is referred to as a nuisance parameter for the problem of estimating σ . 2
2
A Special Sample Variance
First we will assume that μ is known. Although this is almost always an artificial assumption, it is a nice place to start because the analysis is relatively easy and will give us insight for the standard case. A natural estimator of σ is the following statistic, which we will refer to as the special sample variance. 2
W
2
1 = n
is the sample mean for a random sample of size properties: W
2
n
n 2
∑(Xi − μ)
(6.5.21)
i=1
from the distribution of
2
(X − μ)
, and satisfies the following
1. E (W ) = σ 2. var (W ) = (σ − σ ) 3. W → σ as n → ∞ with probability 1 − −−−− − 4. The distribution of √− n (W − σ ) / √σ − σ converges to the standard normal distribution as n → ∞ . 2
2
1
2
n
2
4
4
2
2
2
4
4
Proof In particular part (a) means that W is an unbiased estimator of σ . From part (b), note that var(W ) → 0 as n → ∞ ; this means that W is a consistent estimator of σ . The square root of the special sample variance is a special version of the sample standard deviation, denoted W . 2
2
E(W ) ≤ σ
2
2
2
. Thus, W is a negativley biased estimator that tends to underestimate σ.
Proof Next we compute the covariance and correlation between the sample mean and the special sample variance. The covariance and correlation of M and W are 2
1. cov (M , W 2. cor (M , W
2 2
) = σ3 /n
.
− − − − − − − − − − 3 2 4 ) = σ / √σ (σ4 − σ )
Proof Note that the correlation does not depend on the sample size, and that the sample mean and the special sample variance are uncorrelated if σ = 0 (equivalently skew(X) = 0 ). 3
6.5.4
https://stats.libretexts.org/@go/page/10182
The Standard Sample Variance
Consider now the more realistic case in which μ is unknown. In this case, a natural approach is to average, in some sense, the squared deviations (X − M ) over i ∈ {1, 2, … , n}. It might seem that we should average by dividing by n . However, another approach is to divide by whatever constant would give us an unbiased estimator of σ . This constant turns out to be n − 1 , leading to the standard sample variance: 2
i
2
S
2
n
1
2
= n−1
E (S
2
) =σ
2
∑(Xi − M )
(6.5.24)
i=1
.
Proof Of course, the square root of the sample variance is the sample standard deviation, denoted S . E(S) ≤ σ
. Thus, S is a negativley biased estimator than tends to underestimate σ.
Proof S
2
→ σ
2
as n → ∞ with probability 1.
Proof Since S is an unbiased estimator of σ , the variance of S is the mean square error, a measure of the quality of the estimator. 2
var (S
2
2
1
) =
(σ4 −
n
n−3 n−1
4
σ )
2
.
Proof Note that var(S ) → 0 as n → ∞ , and hence S is a consistent estimator of σ . On the other hand, it's not surprising that the variance of the standard sample variance (where we assume that μ is unknown) is greater than the variance of the special standard variance (in which we assume μ is known). 2
var (S
2
2
) > var (W
2
)
2
.
Proof Next we compute the covariance between the sample mean and the sample variance. The covariance and correlation between the sample mean and sample variance are 1. cov (M , S 2. cor (M , S
2
2
) = σ3 /n σ3
) = σ√σ4 −σ
4
(n−3)/(n−1)
Proof In particular, note that cov(M , S ) = cov(M , W ) . Again, the sample mean and variance are uncorrelated if σ = 0 so that skew(X) = 0 . Our last result gives the covariance and correlation between the special sample variance and the standard one. Curiously, the covariance the same as the variance of the special sample variance. 2
2
3
The covariance and correlation between W and S are 2
1. cov (W
2
,S
2
2
4
) = (σ4 − σ )/n − −−−−−−−−−− −
2. cor (W
2
,S
2
) =√
σ4 −σ 4 σ4 −σ 4 (n−3)/(n−1)
Proof Note that cor (W
2
,S
2
) → 1
as n → ∞ , not surprising since with probability 1, S
2
→ σ
2
and W
2
→ σ
2
as n → ∞ .
A particularly important special case occurs when the sampling distribution is normal. This case is explored in the section on Special Properties of Normal Samples.
6.5.5
https://stats.libretexts.org/@go/page/10182
Exercises Basic Properties
Suppose that x is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of operation. A sample of 30 components has mean 113° and standard deviation 18°. 1. Classify x by type and level of measurement. 2. Find the sample mean and standard deviation if the temperature is converted to degrees Celsius. The transformation is y = (x − 32) . 5 9
Answer Suppose that x is the length (in inches) of a machined part in a manufacturing process. A sample of 50 parts has mean 10.0 and standard deviation 2.0. 1. Classify x by type and level of measurement. 2. Find the sample mean if length is measured in centimeters. The transformation is y = 2.54x. Answer Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). The mean grade on the first midterm exam was 64 (out of a possible 100 points) and the standard deviation was 16. Professor Moriarity thinks the grades are a bit low and is considering various transformations for increasing the grades. In each case below give the mean and standard deviation of the transformed grades, or state that there is not enough information. 1. Add 10 points to each grade, so the transformation is y = x + 10 . 2. Multiply each grade by 1.2, so the transformation is z = 1.2x 3. Use the transformation w = 10√− x . Note that this is a non-linear transformation that curves the grades greatly at the low end and very little at the high end. For example, a grade of 100 is still 100, but a grade of 36 is transformed to 60. One of the students did not study at all, and received a 10 on the midterm. Professor Moriarity considers this score to be an outlier. 4. Find the mean and standard deviation if this score is omitted. Answer Computational Exercises
All statistical software packages will compute means, variances and standard deviations, draw dotplots and histograms, and in general perform the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets, the use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the graphs with minimal technological aids. Suppose that
x
is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data .
x = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)
1. Classify x by type and level of measurement. 2. Sketch the dotplot. 3. Construct a table with rows corresponding to cases and columns corresponding to i, x , x at the bottom in the i column for totals and means. i
i
−m
, and (x
i
2
− m)
. Add rows
Answer Suppose that a sample of size 12 from a discrete variable f (−1) = 1/4, f (0) = 1/3, f (1) = 1/6, f (2) = 1/6.
x
has empirical density function given by
,
f (−2) = 1/12
1. Sketch the graph of f . 2. Compute the sample mean and variance. 3. Give the sample values, ordered from smallest to largest. Answer
6.5.6
https://stats.libretexts.org/@go/page/10182
The following table gives a frequency distribution for the commuting distance to the math/stat building (in miles) for a sample of ESU students. Class
Freq
(0, 2]
6
(2, 6]
16
(6, 10]
18
(10, 20])
10
Rel Freq
Density
Cum Freq
Cum Rel Freq
Midpoint
Total
1. Complete the table 2. Sketch the density histogram 3. Sketch the cumulative relative frquency ogive. 4. Compute an approximation to the mean and standard deviation. Answer Error Function Exercises
In the error function app, select root mean square error. As you add points, note the shape of the graph of the error function, the value that minimizes the function, and the minimum value of the function. In the error function app, select mean absolute error. As you add points, note the shape of the graph of the error function, the values that minimizes the function, and the minimum value of the function. Suppose that our data vector is (2, 1, 5, 7). Explicitly give mae as a piecewise function and sketch its graph. Note that 1. All values of a ∈ [2, 5] minimize mae. 2. mae is not differentiable at a ∈ {1, 2, 5, 7}. Suppose that our data vector is (3, 5, 1). Explicitly give mae as a piecewise function and sketch its graph. Note that 1. mae is minimized at a = 3 . 2. mae is not differentiable at a ∈ {1, 3, 5}. Simulation Exercises
Many of the apps in this project are simulations of experiments with a basic random variable of interest. When you run the simulation, you are performing independent replications of the experiment. In most cases, the app displays the standard deviation of the distribution, both numerically in a table and graphically as the radius of the blue, horizontal bar in the graph box. When you run the simulation, the sample standard deviation is also displayed numerically in the table and graphically as the radius of the red horizontal bar in the graph box. In the binomial coin experiment, the random variable is the number of heads. For various values of the parameters n (the number of coins) and p (the probability of heads), run the simulation 1000 times and compare the sample standard deviation to the distribution standard deviation. In the simulation of the matching experiment, the random variable is the number of matches. For selected values of n (the number of balls), run the simulation 1000 times and compare the sample standard deviation to the distribution standard deviation. Run the simulation of the gamma experiment 1000 times for various values of the rate parameter r and the shape parameter k . Compare the sample standard deviation to the distribution standard deviation.
6.5.7
https://stats.libretexts.org/@go/page/10182
Probability Exercises
Suppose that X has probability density function the beta family. Compute each of the following 1. μ = E(X) 2. σ = var(X) 3. d = E [(X − μ) 4. d = E [(X − μ)
2
f (x) = 12 x
(1 − x)
for
0 ≤x ≤1
. The distribution of
X
is a member of
2
3
]
3
4
]
4
Answer Suppose now that (X each of the following:
1,
X2 , … , X10 )
is a random sample of size 10 from the beta distribution in the previous problem. Find
1. E(M ) 2. var(M ) 3. E (W ) 4. var (W ) 5. E (S ) 6. var (S ) 7. cov (M , W ) 8. cov (M , S ) 9. cov (W , S ) 2
2
2
2
2
2
2
2
Answer Suppose that X has probability density function f (x) = λe for 0 ≤ x < ∞ , where λ > 0 is a parameter. Thus exponential distribution with rate parameter λ . Compute each of the following −λx
1. μ = E(X) 2. σ = var(X) 3. d = E [(X − μ) 4. d = E [(X − μ)
X
has the
2
3
3
4
4
] ]
Answer Suppose now that (X , X , … , X Find each of the following: 1
2
5)
is a random sample of size 5 from the exponential distribution in the previous problem.
1. E(M ) 2. var(M ) 3. E (W ) 4. var (W ) 5. E (S ) 6. var (S ) 7. cov (M , W ) 8. cov (M , S ) 9. cov (W , S ) 2
2
2
2
2
2
2
2
Answer Recall that for an ace-six flat die, faces 1 and 6 have probability each, while faces 2, 3, 4, and 5 have probability X denote the score when an ace-six flat die is thrown. Compute each of the following: 1. μ = E(X) 2. σ = var(X) 3. d = E [(X − μ) 4. d = E [(X − μ)
1
1
4
8
each. Let
2
3
3 4
4
] ]
Answer
6.5.8
https://stats.libretexts.org/@go/page/10182
Suppose now that an ace-six flat die is tossed 8 times. Find each of the following: 1. E(M ) 2. var(M ) 3. E (W ) 4. var (W ) 5. E (S ) 6. var (S ) 7. cov (M , W ) 8. cov (M , S ) 9. cov (W , S ) 2
2
2
2
2
2
2
2
Answer Data Analysis Exercises
Statistical software should be used for the problems in this subsection. Consider the petal length and species variables in Fisher's iris data. 1. Classify the variables by type and level of measurement. 2. Compute the sample mean and standard deviation, and plot a density histogram for petal length. 3. Compute the sample mean and standard deviation, and plot a density histogram for petal length by species. Answers Consider the erosion variable in the Challenger data set. 1. Classify the variable by type and level of measurement. 2. Compute the mean and standard deviation 3. Plot a density histogram with the classes [0, 5), [5, 40), [40, 50), [50, 60). Answer Consider Michelson's velocity of light data. 1. Classify the variable by type and level of measurement. 2. Plot a density histogram. 3. Compute the sample mean and standard deviation. 4. Find the sample mean and standard deviation if the variable is converted to km/hr. The transformation is y = x + 299 000 Answer Consider Short's paralax of the sun data. 1. Classify the variable by type and level of measurement. 2. Plot a density histogram. 3. Compute the sample mean and standard deviation. 4. Find the sample mean and standard deviation if the variable is converted to degrees. There are 3600 seconds in a degree. 5. Find the sample mean and standard deviation if the variable is converted to radians. There are π/180 radians in a degree. Answer Consider Cavendish's density of the earth data. 1. Classify the variable by type and level of measurement. 2. Compute the sample mean and standard deviation. 3. Plot a density histogram. Answer Consider the M&M data. 1. Classify the variables by type and level of measurement.
6.5.9
https://stats.libretexts.org/@go/page/10182
2. Compute the sample mean and standard deviation for each color count variable. 3. Compute the sample mean and standard deviation for the total number of candies. 4. Plot a relative frequency histogram for the total number of candies. 5. Compute the sample mean and standard deviation, and plot a density histogram for the net weight. Answer Consider the body weight, species, and gender variables in the Cicada data. 1. Classify the variables by type and level of measurement. 2. Compute the relative frequency function for species and plot the graph. 3. Compute the relative frequeny function for gender and plot the graph. 4. Compute the sample mean and standard deviation, and plot a density histogram for body weight. 5. Compute the sample mean and standard deviation, and plot a density histogrm for body weight by species. 6. Compute the sample mean and standard deviation, and plot a density histogram for body weight by gender. Answer Consider Pearson's height data. 1. Classify the variables by type and level of measurement. 2. Compute the sample mean and standard deviation, and plot a density histogram for the height of the father. 3. Compute the sample mean and standard deviation, and plot a density histogram for the height of the son. Answer This page titled 6.5: The Sample Variance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
6.5.10
https://stats.libretexts.org/@go/page/10182
6.6: Order Statistics Descriptive Theory Recall again the basic model of statistics: we have a population of objects of interest, and we have various measurements (variables) that we make on these objects. We select objects from the population and record the variables for the objects in the sample; these become our data. Our first discussion is from a purely descriptive point of view. That is, we do not assume that the data are generated by an underlying probability distribution. But as always, remember that the data themselves define a probability distribution, namely the empirical distribution. Order Statistics
Suppose that x is a real-valued variable for a population and that x = (x , x , … , x ) are the observed values of a sample of size n corresponding to this variable. The order statistic of rank k is the k th smallest value in the data set, and is usually denoted x . To emphasize the dependence on the sample size, another common notation is x . Thus, 1
2
n
(k)
n:k
x(1) ≤ x(2) ≤ ⋯ ≤ x(n−1) ≤ x(n)
(6.6.1)
Naturally, the underlying variable x should be at least at the ordinal level of measurement. The order statistics have the same physical units as x. One of the first steps in exploratory data analysis is to order the data, so order statistics occur naturally. In particular, note that the extreme order statistics are x(1) = min{ x1 , x2 … , xn },
The sample range is r = x − x and the sample midrange is are measures of the dispersion of the data set. (n)
(1)
r 2
=
x(n) = max{ x1 , x2 , … , xn } 1 2
[ x(n) − x(1) ]
(6.6.2)
. These statistics have the same physical units as x and
The Sample Median
If n is odd, the sample median is the middle of the ordered observations, namely x where k = . If n is even, there is not a single middle observation, but rather two middle observations. Thus, the median interval is [ x , x ] where k = . In this case, the sample median is defined to be the midpoint of the median interval, namely [ x + x ] where k = . In a sense, this definition is a bit arbitrary because there is no compelling reason to prefer one point in the median interval over another. For more on this issue, see the discussion of error functions in the section on Sample Variance. In any event, sample median is a natural statistic that gives a measure of the center of the data set. n+1
(k)
2
n
(k)
(k+1)
1 2
2
n
(k)
(k+1)
2
Sample Quantiles
We can generalize the sample median discussed above to other sample quantiles. Thus, suppose that p ∈ [0, 1]. Our goal is to find the value that is the fraction p of the way through the (ordered) data set. We define the rank of the value that we are looking for as (n − 1)p + 1 . Note that the rank is a linear function of p, and that the rank is 1 when p = 0 and n when p = 1 . But of course, the rank will not be an integer in general, so we let k = ⌊(n − 1)p + 1⌋ , the integer part of the desired rank, and we let t = [(n − 1)p + 1] − k , the fractional part of the desired rank. Thus, (n − 1)p + 1 = k + t where k ∈ {1, 2, … , n} and t ∈ [0, 1). So, using linear interpolation, we define the sample quantile of order p to be x[p] = x(k) + t [ x(k+1) − x(k) ] = (1 − t)x(k) + tx(k+1)
(6.6.3)
Sample quantiles have the same physical units as the underlying variable x. The algorithm really does generalize the results for sample medians. The sample quantile of order p =
1 2
is the median as defined earlier, in both cases where n is odd and where n is even.
The sample quantile of order is known as the first quartile and is frequently denoted q . The the sample quantile of order is known as the third quartile and is frequently denoted q . The sample median which is the quartile of order is sometimes denoted q . The interquartile range is defined to be iqr = q − q . Note that iqr is a statistic that measures the spread of the distribution about the median, but of course this number gives less information than the interval [q , q ]. 1
3
1
4
4
1
3
3
2
2
1
1
3
The statistic q − iqr is called the lower fence and the statistic q + iqr is called the upper fence. Sometimes lower limit and upper limit are used instead of lower fence and upper fence. Values in the data set that are below the lower fence or above the upper fence are potential outliers, that is, values that don't seem to fit the overall pattern of the data. An outlier can be due to a measurement error, or may be a valid but rather extreme value. In any event, outliers usually deserve additional study. 3
1
3
3
2
2
The five statistics (x , q , q , q , x ) are often referred to as the five-number summary. Together, these statistics give a great deal of information about the data set in terms of the center, spread, and skewness. The five numbers roughly separate the data set into four intervals (1)
1
2
3
(n)
6.6.1
https://stats.libretexts.org/@go/page/10183
each of which contains approximately 25% of the data. Graphically, the five numbers, and the outliers, are often displayed as a boxplot, sometimes called a box and whisker plot. A boxplot consists of an axis that extends across the range of the data. A line is drawn from smallest value that is not an outlier (of course this may be the minimum x ) to the largest value that is not an outlier (of course, this may be the maximum x ). Vertical marks (“whiskers”) are drawn at the ends of this line. A rectangular box extends from the first quartile q to the third quartile q and with an additional whisker at the median q . Finally, the outliers are denoted as points (beyond the extreme whiskers). All statistical packages will compute the quartiles and most will draw boxplots. The picture below shows a boxplot with 3 outliers. (1)
(n)
1
3
2
Figure 6.6.1 : Boxplot Alternate Definitions
The algorithm given above is not the only reasonable way to define sample quantiles, and indeed there are lots of alternatives. One natural method would be to first compute the empirical distribution function n
1 F (x) = n
∑ 1(xi ≤ x),
x ∈ R
(6.6.4)
i=1
Recall that F has the mathematical properties of a distribution function, and in fact F is the distribution function of the empirical distribution of the data. Recall that this is the distribution that places probability at each data value x (so this is the discrete uniform distribution on { x , x , … , x } if the data values are distinct). Thus, F (x) = for x ∈ [x , x ). Then, we could define the quantile function to be the inverse of the distribution function, as we usually do for probability distributions: 1
i
n
k
1
2
n
(k)
n
F
−1
(k+1)
(p) = min{x ∈ R : F (x) ≥ p},
It's easy to see that with this definition, the quantile of order p ∈ (0, 1) is simply x
(k)
p ∈ (0, 1)
(6.6.5)
where k = ⌈np⌉ .
Another method is to compute the rank of the quantile of order p ∈ (0, 1) as (n + 1)p , rather than (n − 1)p + 1 , and then use linear interpolation just as we have done. To understand the reasoning behind this method, suppose that the underlying variable x takes value in an interval (a, b). Then the n points in the data set x separate this interval into n + 1 subintervals, so it's reasonable to think of x as the quantile of order . This method also reduces to the standard calculation for the median when p = . However, the method will fail if p is so small that (n + 1)p < 1 or so large that (n + 1)p > n . (k)
k
1
n+1
2
The primary definition that we give above is the one that is most commonly used in statistical software and spreadsheets. Moreover, when the sample size n is large, it doesn't matter very much which of these competing quantile definitions is used. All will give similar results. Transformations
Suppose again that x = (x , x , … , x ) is a sample of size n from a population variable x, but now suppose also that y = a + bx is a new variable, where a ∈ R and b ∈ (0, ∞). Recall that transformations of this type are location-scale transformations and often correspond to changes in units. For example, if x is the length of an object in inches, then y = 2.54x is the length of the object in centimeters. If x is the temperature of an object in degrees Fahrenheit, then y = (x − 32) is the temperature of the object in degrees Celsius. Let y = a + bx denote the sample from the variable y . 1
2
n
5 9
Order statistics and quantiles are preserved under location-scale transformations: 1. y 2. y
(i)
= a + b x(i)
[p]
= a + b x[p]
for i ∈ {1, 2, … , n} for p ∈ [0, 1]
Proof Like standard deviation (our most important measure of spread), range and interquartile range are not affected by the location parameter, but are scaled by the scale parameter. The range and interquartile range of y are 1. r(y) = b r(x) 2. iqr(y) = b iqr(x) Proof More generally, suppose
where g is a strictly increasing real-valued function on the set of possible values of x. Let denote the sample corresponding to the variable y . Then (as in the proof of Theorem 2), the order statistics = g(x ) . However, if g is nonlinear, the quantiles are not preserved (because the quantiles involve linear interpolation). ) are not usually the same. When g is convex or concave we can at least give an inequality for the sample quantiles. y = g(x)
y = (g(x1 ), g(x2 ), … , g(xn ))
are preserved so y That is, y and g(x
(i)
[p]
(i)
[p]
6.6.2
https://stats.libretexts.org/@go/page/10183
Suppose that y = g(x) where g is strictly increasing. Then 1. y = g (x ) for i ∈ {1, 2, … , n} 2. If g is convex then y ≥ g (x ) for p ∈ [0, 1] 3. If g is concave then y ≤ g (x ) for p ∈ [0, 1] (i)
(i)
[p]
[p]
[p]
[p]
Proof Stem and Leaf Plots
A stem and leaf plot is a graphical display of the order statistics (x , x , … , x ). It has the benefit of showing the data in a graphical way, like a histogram, and at the same time, preserving the ordered data. First we assume that the data have a fixed number format: a fixed number of digits, then perhaps a decimal point and another fixed number of digits. A stem and leaf plot is constructed by using an initial part of this string as the stem, and the remaining parts as the leaves. There are lots of variations in how to do this, so rather than give an exhaustive, complicated definition, we will just look at a couple of examples in the exercise below. (1)
(2)
(n)
Probability Theory We continue our discussion of order statistics except that now we assume that the variables are random variables. Specifically, suppose that we have a basic random experiment, and that X is a real-valued random variable for the experiment with distribution function F . We perform n independent replications of the basic experiment to generate a random sample X = (X , X , … , X ) of size n from the distribution of X. Recall that this is a sequence of independent random variables, each with the distribution of X. All of the statistics defined in the previous section make sense, but now of course, they are random variables. We use the notation established previously, except that we follow our usual convention of denoting random variables with capital letters. Thus, for k ∈ {1, 2, … , n}, X is the k th order statistic, that is, the k smallest of (X , X , … , X ). Our interest now is on the distribution of the order statistics and statistics derived from them. 1
2
n
(k)
1
2
n
Distribution of the kth order statistic
Finding the distribution function of an order statistic is a nice application of Bernoulli trials and the binomial distribution. The distribution function F of X k
(k)
is given by n
Fk (x) = ∑ ( j=k
n j n−j ) [F (x)] [1 − F (x)] , j
x ∈ R
(6.6.8)
Proof As always, the extreme order statistics are particularly interesting. The distribution functions F of X 1
1. F 2. F
1 (x)
n
The quantile functions F
−1
1
−1
1
(p) = F
−1 (p) n
=F
−1
−1
and F of X n
(n)
are given by
for x ∈ R for x ∈ R
= 1 − [1 − F (x)]
n (x) = [F (x)]
1. F 2. F
(1)
n
and F
−1 n
of X
(1)
and X
(n)
are given by
] for p ∈ (0, 1) for p ∈ (0, 1) 1/n
[1 − (1 − p ) 1/n
(p
)
Proof When the underlying distribution is continuous, we can give a simple formula for the probability density function of an order statistic. Suppose now that X has a continuous distribution with probability density function f . Then probability density function f given by
X(k)
has a continuous distribution with
k
n!
k−1
fk (x) =
[F (x)]
[1 − F (x)]
n−k
f (x),
x ∈ R
(6.6.11)
(k − 1)!(n − k)!
Proof Heuristic Proof Here are the special cases for the extreme order statistics. The probability density function f of X 1
1. f 2. f
1 (x) n (x)
= n[1 − F (x)] n−1
= n[F (x)]
n−1
f (x)
f (x)
(1)
and f of X n
(n)
are given by
for x ∈ R
for x ∈ R
6.6.3
https://stats.libretexts.org/@go/page/10183
Joint Distributions
We assume again that X has a continuous distribution with distribution function F and probability density function f . Suppose that j, k ∈ {1, 2, … , n} with j < k . The joint probability density function f
of (X
(j) ,
j,k
n!
j−1
fj,k (x, y) =
[F (x)]
k−j−1
[F (y) − F (x)]
[1 − F (y)]
is given by
X(k) )
n−k
f (x)f (y);
x, y ∈ R, x < y
(6.6.17)
(j − 1)!(k − j − 1)!(n − k)!
Heuristic Proof From the joint distribution of two order statistics we can, in principle, find the distribution of various other statistics: the sample range R ; sample quantiles X for p ∈ [0, 1], and in particular the sample quartiles Q , Q , Q ; and the inter-quartile range IQR. The joint distribution of the extreme order statistics (X , X ) is a particularly important case. 1
[p]
(1)
2
3
(n)
The joint probability density function f
1,n
of (X
(1) ,
X(n) )
is given by n−2
f1,n (x, y) = n(n − 1) [F (y) − F (x)]
f (x)f (y);
x, y ∈ R, x < y
(6.6.20)
Proof Arguments similar to the one above can be used to obtain the joint probability density function of any number of the order statistics. Of course, we are particularly interested in the joint probability density function of all of the order statistics. It turns out that this density function has a remarkably simple form. (X(1) , X(2) , … , X(n) )
has joint probability density function g given by g(x1 , x2 , … , xn ) = n!f (x1 ) f (x2 ) ⋯ f (xn ),
x1 < x2 < ⋯ < xn
(6.6.21)
Proof Heuristic Proof Probability Plots
A probability plot, also called a quantile-quantile plot or a Q-Q plot for short, is an informal, graphical test to determine if observed data come from a specified distribution. Thus, suppose that we observe real-valued data (x , x , … , x ) from a random sample of size n . We are interested in the question of whether the data could reasonably have come from a continuous distribution with distribution function F . First, we order that data from smallest to largest; this gives us the sequence of observed values of the order statistics: (x , x , … , x ). 1
2
n
(1)
Note that we can view yi = F
−1
(
i n+1
)
x(i)
has the sample quantile of order
i n+1
(2)
(n)
. Of course, by definition, the distribution quantile of order
. If the data really do come from the distribution, then we would expect the points ((x
(1)
i n+1
, y1 ) , (x(2) , y2 ) … , (x(n) , yn ))
is to
be close to the diagonal line y = x ; conversely, strong deviation from this line is evidence that the distribution did not produce the data. The plot of these points is referred to as a probability plot. Usually however, we are not trying to see if the data come from a particular distribution, but rather from a parametric family of distributions (such as the normal, uniform, or exponential families). We are usually forced into this situation because we don't know the parameters; indeed the next step, after the probability plot, may be to estimate the parameters. Fortunately, the probability plot method has a simple extension for any location-scale family of distributions. Thus, suppose that G is a given distribution function. Recall that the location-scale family associated with G has distribution function F (x) = G ( ) for, x ∈ R , where a ∈ R is the location parameter and b ∈ (0, ∞) is the scale parameter. Recall also that for p ∈ (0, 1), if z = G (p) denote the quantile of order p for G and y = F (p) the quantile of order p for F . Then y = a + b z . It follows that if the probability plot constructed with distribution function F is nearly linear (and in particular, if it is close to the diagonal line), then the probability plot constructed with distribution function G will be nearly linear. Thus, we can use the distribution function G without having to know the location and scale parameters. x−a b
−1
−1
p
p
p
p
In the exercises below, you will explore probability plots for the normal, exponential, and uniform distributions. We will study a formal, quantitative procedure, known as the chi-square goodness of fit test in the chapter on Hypothesis Testing.
Exercises and Applications Basic Properties
Suppose that x is the temperature (in degrees Fahrenheit) for a certain type of electronic component after 10 hours of operation. A sample of 30 components has five number summary (84, 102, 113, 120, 135) . 1. Classify x by type and level of measurement. 2. Find the range and interquartile range.
6.6.4
https://stats.libretexts.org/@go/page/10183
3. Find the five number summary, range, and interquartile range if the temperature is converted to degrees Celsius. The transformation is y = (x − 32) . 5 9
Answer Suppose that x is the length (in inches) of a machined part in a manufacturing process. A sample of 50 parts has five number summary (9.6, 9.8, 10.0, 10.1, 10.3). 1. Classify x by type and level of measurement. 2. Find the range and interquartile range. 3. Find the five number summary, range, and interquartile if length is measured in centimeters. The transformation is y = 2.54x. Answer Professor Moriarity has a class of 25 students in her section of Stat 101 at Enormous State University (ESU). For the first midterm exam, the five number summary was (16, 52, 64, 72, 81) (out of a possible 100 points). Professor Moriarity thinks the grades are a bit low and is considering various transformations for increasing the grades. 1. Find the range and interquartile range. 2. Suppose she adds 10 points to each grade. Find the five number summary, range, and interquartile range for the transformed grades. 3. Suppose she multiplies each grade by 1.2. Find the five number summary, range, and interquartile range for the transformed grades. 4. Suppose she uses the transformation w = 10√− x , which curves the grades greatly at the low end and very little at the high end. Give whatever information you can about the five number summary of the transformed grades. 5. Determine whether the low score of 16 is an outlier. Answer Computational Exercises
All statistical software packages will compute order statistics and quantiles, draw stem-and-leaf plots and boxplots, and in general perform the numerical and graphical procedures discussed in this section. For real statistical experiments, particularly those with large data sets, the use of statistical software is essential. On the other hand, there is some value in performing the computations by hand, with small, artificial data sets, in order to master the concepts and definitions. In this subsection, do the computations and draw the graphs with minimal technological aids. Suppose that
x
is the number of math courses completed by an ESU student. A sample of 10 ESU students gives the data .
x = (3, 1, 2, 0, 2, 4, 3, 2, 1, 2)
1. Classify x by type and level of measurement. 2. Give the order statistics 3. Compute the five number summary and draw the boxplot. 4. Compute the range and the interquartile range. Answer Suppose that a sample of size 12 from a discrete variable f (0) = 1/3, f (1) = 1/6, f (2) = 1/6.
x
has empirical density function given by
,
f (−2) = 1/12
,
f (−1) = 1/4
1. Give the order statistics. 2. Compute the five number summary and draw the boxplot. 3. Compute the range and the interquartile range. Answer The stem and leaf plot below gives the grades for a 100-point test in a probability course with 38 students. The first digit is the stem and the second digit is the leaf. Thus, the low score was 47 and the high score was 98. The scores in the 6 row are 60, 60, 62, 63, 65, 65, 67, 68. 4
7
5
0346
6
00235578
7
0112346678899
8
0367889
9
1368
Compute the five number summary and draw the boxplot.
6.6.5
https://stats.libretexts.org/@go/page/10183
Answer App Exercises
In the histogram app, construct a distribution with at least 30 values of each of the types indicated below. Note the five number summary. 1. A uniform distribution. 2. A symmetric, unimodal distribution. 3. A unimodal distribution that is skewed right. 4. A unimodal distribution that is skewed left. 5. A symmetric bimodal distribution. 6. A u-shaped distribution. In the error function app, Start with a distribution and add additional points as follows. Note the effect on the five number summary: 1. Add a point below x . 2. Add a point between x and q . 3. Add a point between q and q . 4. Add a point between q and q . 5. Add a point between q and x . 6. Add a point above x . (1)
(1)
1
1
2
2
3
3
(n)
(n)
In the last problem, you may have noticed that when you add an additional point to the distribution, one or more of the five statistics does not change. In general, quantiles can be relatively insensitive to changes in the data. The Uniform Distribution
Recall that the standard uniform distribution is the uniform distribution on the interval [0, 1]. Suppose that X is a random sample of size n from the standard uniform distribution. For k ∈ {1, 2, … , n}, distribution, with left parameter k and right parameter n − k + 1 . The probability density function f is given by
X(k)
has the beta
k
n! fk (x) =
k−1
x
n−k
(1 − x )
,
0 ≤x ≤1
(6.6.22)
(k − 1)!(n − k)!
Proof In the order statistic experiment, select the standard uniform distribution and n = 5 . Vary k from 1 to 5 and note the shape of the probability density function of X . For each value of k , run the simulation 1000 times and compare the empirical density function to the true probability density function. (k)
It's easy to extend the results for the standard uniform distribution to the general uniform distribution on an interval. Suppose that X is a random sample of size n from the uniform distribution on the interval [a, a + h] where a ∈ R and h ∈ (0, ∞) . For k ∈ {1, 2, … , n}, X has the beta distribution with left parameter k , right parameter n − k + 1 , location parameter a , and scale parameter h . In particular, (k)
1. E (X
(k) )
2. var (X
= a+h
(k)
2
k n+1 k(n−k+1)
) =h
2
(n+1 ) (n+2)
Proof We return to the standard uniform distribution and consider the range of the random sample. Suppose that X is a random sample of size n from the standard uniform distribution. The sample range R has the beta distribution with left parameter n − 1 and right parameter 2. The probability density function g is given by n−2
g(r) = n(n − 1)r
(1 − r),
0 ≤r ≤1
(6.6.23)
Proof Once again, it's easy to extend this result to a general uniform distribution. Suppose that X = (X , X , … , X ) is a random sample of size n from the uniform distribution on [a, a + h] where a ∈ R and h ∈ (0, ∞) . The sample range R = X −X has the beta distribution with left parameter n − 1 , right parameter 2, and scale 1
2
n
(n)
(1)
6.6.6
https://stats.libretexts.org/@go/page/10183
parameter h . In particular, 1. E(R) = h
n−1 n+1 2( n1 )
2. var(R) = h
2
2
(n+1 ) (n+2)
Proof The joint distribution of the order statistics for a sample from the uniform distribution is easy to get. Suppose that (X , X , … , X ) is a random sample of size n from the uniform distribution on the interval [a, a + h] , where a ∈ R and h ∈ (0, ∞) . Then (X ,X ,…,X ) is uniformly distributed on {x ∈ [a, a + h ] : a ≤ x ≤ x ≤ ⋯ ≤ x < a + h} . 1
2
n
n
(1)
(2)
(n)
1
2
n
Proof The Exponential Distribution
Recall that the exponential distribution with rate parameter λ > 0 has probability density function f (x) = λ e
−λx
,
0 ≤x 49, S < 20)) . 6. P(−1 < T < 1) . 2
2
2
Answer Suppose that the SAT math scores from 16 Alabama students form a random sample X from the normal distribution with mean 550 and standard deviation 20, while the SAT math scores from 25 Georgia students form a random sample Y from the normal distribution with mean 540 and standard deviation 15. The two samples are independent. Find each of the following: 1. The mean and standard deviation of M (X). 2. The mean and standard deviation of M (Y ). 3. The mean and standard deviation of M (X) − M (Y ) . 4. P[M (X) > M (Y )] . 5. The mean and standard deviation of S (X). 6. The mean and standard deviation of S (Y ). 7. The mean and standard deviation of S (X)/S (Y ) 8. P[S(X) > S(Y )] . 2 2 2
2
Answer This page titled 6.8: Special Properties of Normal Samples is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
6.8.7
https://stats.libretexts.org/@go/page/10185
CHAPTER OVERVIEW 7: Point Estimation Point estimation refers to the process of estimating a parameter from a probability distribution, based on observed data from the distribution. It is one of the core topics in mathematical statistics. In this chapter, we will explore the most common methods of point estimation: the method of moments, the method of maximum likelihood, and Bayes' estimators. We also study important properties of estimators, including sufficiency and completeness, and the basic question of whether an estimator is the best possible one. 7.1: Estimators 7.2: The Method of Moments 7.3: Maximum Likelihood 7.4: Bayesian Estimation 7.5: Best Unbiased Estimators 7.6: Sufficient, Complete and Ancillary Statistics
This page titled 7: Point Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
7.1: Estimators The Basic Statistical Model As usual, our starting point is a random experiment with an underlying sample space and a probability measure P. In the basic statistical model, we have an observable random variable X taking values in a set S . Recall that in general, this variable can have quite a complicated structure. For example, if the experiment is to sample n objects from a population and record various measurements of interest, then the data vector has the form X = (X1 , X2 , … , Xn )
(7.1.1)
where X is the vector of measurements for the ith object. The most important special case is when (X , X , … , X ) are independent and identically distributed (IID). In this case X is a random sample of size n from the distribution of an underlying measurement variable X. i
1
2
n
Statistics Recall also that a statistic is an observable function of the outcome variable of the random experiment: U = u(X) where u is a known function from S into another set T . Thus, a statistic is simply a random variable derived from the observation variable X, with the assumption that U is also observable. As the notation indicates, U is typically also vector-valued. Note that the original data vector X is itself a statistic, but usually we are interested in statistics derived from X. A statistic U may be computed to answer an inferential question. In this context, if the dimension of U (as a vector) is smaller than the dimension of X (as is usually the case), then we have achieved data reduction. Ideally, we would like to achieve significant data reduction with no loss of information about the inferential question at hand.
Parameters In the technical sense, a parameter θ is a function of the distribution of X, taking values in a parameter space T . Typically, the distribution of X will have k ∈ N real parameters of interest, so that θ has the form θ = (θ , θ , … , θ ) and thus T ⊆ R . In many cases, one or more of the parameters are unknown, and must be estimated from the data variable X. This is one of the of the most important and basic of all statistical problems, and is the subject of this chapter. If U is a statistic, then the distribution of U will depend on the parameters of X, and thus so will distributional constructs such as means, variances, covariances, probability density functions and so forth. We usually suppress this dependence notationally to keep our mathematical expressions from becoming too unwieldy, but it's very important to realize that the underlying dependence is present. Remember that the critical idea is that by observing a value u of a statistic U we (hopefully) gain information about the unknown parameters. k
+
1
2
k
Estimators Suppose now that we have an unknown real parameter θ taking values in a parameter space T ⊆ R . A real-valued statistic U = u(X) that is used to estimate θ is called, appropriately enough, an estimator of θ . Thus, the estimator is a random variable and hence has a distribution, a mean, a variance, and so on (all of which, as noted above, will generally depend on θ ). When we actually run the experiment and observe the data x, the observed value u = u(x) (a single number) is the estimate of the parameter θ . The following definitions are basic. Suppose that U is a statistic used as an estimator of a parameter θ with values in T
⊆R
. For θ ∈ T ,
1. U − θ is the error. 2. bias(U ) = E(U − θ) = E(U ) − θ is the bias of U 3. mse(U ) = E [(U − θ) ] is the mean square error of U 2
Thus the error is the difference between the estimator and the parameter being estimated, so of course the error is a random variable. The bias of U is simply the expected error, and the mean square error (the name says it all) is the expected square of the error. Note that bias and mean square error are functions of θ ∈ T . The following definitions are a natural complement to the definition of bias. Suppose again that U is a statistic used as an estimator of a parameter θ with values in T
7.1.1
⊆R
.
https://stats.libretexts.org/@go/page/10189
1. U is unbiased if bias(U ) = 0 , or equivalently E(U ) = θ , for all θ ∈ T . 2. U is negatively biased if bias(U ) ≤ 0 , or equivalently E(U ) ≤ θ , for all θ ∈ T . 3. U is positively biased if bias(U ) ≥ 0 , or equivalently E(U ) ≥ θ , for all θ ∈ T . Thus, for an unbiased estimator, the expected value of the estimator is the parameter being estimated, clearly a desirable property. On the other hand, a positively biased estimator overestimates the parameter, on average, while a negatively biased estimator underestimates the parameter on average. Our definitions of negative and positive bias are weak in the sense that the weak inequalities ≤ and ≥ are used. There are corresponding strong definitions, of course, using the strong inequalities < and >. Note, however, that none of these definitions may apply. For example, it might be the case that bias(U ) < 0 for some θ ∈ T , bias(U ) = 0 for other θ ∈ T , and bias(U ) > 0 for yet other θ ∈ T . 2
mse(U ) = var(U ) + bias (U )
Proof In particular, if the estimator is unbiased, then the mean square error of U is simply the variance of U . Ideally, we would like to have unbiased estimators with small mean square error. However, this is not always possible, and the result in (3) shows the delicate relationship between bias and mean square error. In the next section we will see an example with two estimators of a parameter that are multiples of each other; one is unbiased, but the other has smaller mean square error. However, if we have two unbiased estimators of θ , we naturally prefer the one with the smaller variance (mean square error). Suppose that U and V are unbiased estimators of a parameter θ with values in T
⊆R
.
1. U is more efficient than V if var(U ) ≤ var(V ) . 2. The relative efficiency of U with respect to V is var(V ) eff(U , V ) =
(7.1.3) var(U )
Asymptotic Properties Suppose again that we have a real parameter θ with possible values in a parameter space T . Often in a statistical experiment, we observe an infinite sequence of random variables over time, X = (X , X , … , ), so that at time n we have observed X = (X , X , … , X ) . In this setting we often have a general formula that defines an estimator of θ for each sample size n . Technically, this gives a sequence of real-valued estimators of θ : U = (U , U , …) where U is a real-valued function of X for each n ∈ N . In this case, we can discuss the asymptotic properties of the estimators as n → ∞ . Most of the definitions are natural generalizations of the ones above. 1
n
1
2
2
n
1
2
n
n
+
The sequence of estimators U = (U , U , …) is asymptotically unbiased if equivalently, E(U ) → θ as n → ∞ for every θ ∈ T . 1
2
bias(Un ) → 0
as
n → ∞
for every
θ ∈ T
, or
n
Suppose that U = (U , U , …) and V = (V asymptotic relative efficiency of U to V is 1
1,
2
V2 , …)
are two sequences of estimators that are asymptotically unbiased. The
lim eff(Un , Vn ) = lim
n→∞
n→∞
var(Vn )
(7.1.4)
var(Un )
assuming that the limit exists. Naturally, we expect our estimators to improve, as the sample size n increases, and in some sense to converge to the parameter as n → ∞ . This general idea is known as consistency. Once again, for the remainder of this discussion, we assume that U = (U , U , …) is a sequence of estimators for a real-valued parameter θ , with values in the parameter space T . 1
2
Consistency 1. U is consistent if U → θ as n → ∞ in probability for each θ ∈ T . That is, P (|U − θ| > ϵ) → 0 as n → ∞ for every ϵ > 0 and θ ∈ T . 2. U is mean-square consistent if mse(U ) = E[(U − θ) ] → 0 as n → ∞ for θ ∈ T . n
n
2
n
n
7.1.2
https://stats.libretexts.org/@go/page/10189
Here is the connection between the two definitions: If U is mean-square consistent then U is consistent. Proof That mean-square consistency implies simple consistency is simply a statistical version of the theorem that states that mean-square convergence implies convergence in probability. Here is another nice consequence of mean-square consistency. If U is mean-square consistent then U is asymptotically unbiased. Proof In the next several subsections, we will review several basic estimation problems that were studied in the chapter on Random Samples.
Estimation in the Single Variable Model Suppose that X is a basic real-valued random variable for an experiment, with mean μ ∈ R and variance σ ∈ (0, ∞) . We sample from the distribution of X to produce a sequence X = (X , X , …) of independent variables, each with the distribution of X. For each n ∈ N , X = (X , X , … , X ) is a random sample of size n from the distribution of X. 2
1
+
n
1
2
2
n
Estimating the Mean This subsection is a review of some results obtained in the section on the Law of Large Numbers in the chapter on Random Samples. Recall that a natural estimator of the distribution mean μ is the sample mean, defined by n
1 Mn =
Properties of M = (M
1,
M2 , …)
n
∑ Xi ,
n ∈ N+
(7.1.7)
i=1
as a sequence of estimators of μ .
1. E(M ) = μ so M is unbiased for n ∈ N 2. var(M ) = σ /n for n ∈ N so M is consistent. n
n
+
2
n
+
The consistency of M is simply the weak law of large numbers. Moreover, there are a number of important special cases of the results in (10). See the section on Sample Mean for the details. Special cases of the sample mean 1. Suppose that X = 1 , the indicator variable for an event A that has probability P(A) . Then the sample mean for a random sample of size n ∈ N from the distribution of X is the relative frequency or empirical probability of A , denoted P (A). Hence P (A) is an unbiased estimator of P(A) for n ∈ N and (P (A) : n ∈ N ) is consistent.. 2. Suppose that F denotes the distribution function of a real-valued random variable Y . Then for fixed y ∈ R , the empirical distribution function F (y) is simply the sample mean for a random sample of size n ∈ N from the distribution of the indicator variable X = 1(Y ≤ y) . Hence F (y) is an unbiased estimator of F (y) for n ∈ N and (F (y) : n ∈ N ) is consistent. 3. Suppose that U is a random variable with a discrete distribution on a countable set S and f denotes the probability density function of U . Then for fixed u ∈ S , the empirical probability density function f (u) is simply the sample mean for a random sample of size n ∈ N from the distribution of the indicator variable X = 1(U = u) . Hence f (u) is an unbiased estimator of f (u) for n ∈ N and (f (u) : n ∈ N ) is consistent. A
+
n
n
+
n
+
n
+
n
+
n
+
n
+
+
n
n
+
Estimating the Variance This subsection is a review of some results obtained in the section on the Sample Variance in the chapter on Random Samples. We also assume that the fourth central moment σ = E [(X − μ) ] is finite. Recall that σ /σ is the kurtosis of X. Recall first that if μ is known (almost always an artificial assumption), then a natural estimator of σ is a special version of the sample variance, defined by 4
4
4
4
2
7.1.3
https://stats.libretexts.org/@go/page/10189
n
Properties of W
2
= (W
2
1
,W
2
2
, …)
n
1
2
Wn =
2
∑(Xi − μ) ,
n ∈ N+
(7.1.8)
i=1
as a sequence of estimators of σ . 2
1. E (W ) = σ so W is unbiased for n ∈ N 2. var (W ) = (σ − σ ) for n ∈ N so W is consistent. 2 n
2
2 n
+
1
2 n
4
4
n
2
+
Proof If μ is unknown (the more reasonable assumption), then a natural estimator of the distribution variance is the standard version of the sample variance, defined by
Properties of S 1. E (S
2 n)
2
=σ
2. var (S
2 n)
=
= (S
2
2
2
1 n
,S
2
3
, …)
n
1
2
Sn =
2
n−1
∑(Xi − Mn ) ,
n ∈ {2, 3, …}
(7.1.9)
i=1
as a sequence of estimators of σ
2
so S is unbiased for n ∈ {2, 3, …} 2 n
(σ4 −
n−3 n−1
4
σ )
for n ∈ {2, 3, …} so S is consistent sequence. 2
Naturally, we would like to compare the sequences sense if μ is known. Comparison of W and S 2
W
2
and
S
2
as estimators of
σ
2
. But again remember that
W
2
only makes
2
1. var (W ) < var(S ) for n ∈ {2, 3, …}. 2. The asymptotic relative efficiency of W to S is 1. 2 n
2 n
2
2
So by (a) W is better than S for n ∈ {2, 3, …}, assuming that μ is known so that we can actually use W . This is perhaps not surprising, but by (b) S works just about as well as W for a large sample size n . Of course, the sample standard deviation S is a natural estimator of the distribution standard deviation σ. Unfortunately, this estimator is biased. Here is a more general result: 2 n
2 n
2 n
2 n
2 n
n
Suppose that θ is a parameter with possible values in T ⊆ (0, ∞) (with at least two points) and that U is a statistic with values in T . If U is an unbiased estimator of θ then U is a negatively biased estimator of θ . 2
2
Proof Thus, we should not be too obsessed with the unbiased property. For most sampling distributions, there will be no statistic U with the property that U is an unbiased estimator of σ and U is an unbiased estimator of σ . 2
2
Estimation in the Bivariate Model In this subsection we review some of the results obtained in the section on the Correlation and Regression in the chapter on Random Samples Suppose that X and Y are real-valued random variables for an experiment, so that (X, Y ) has a bivariate distribution in R . Let μ = E(X) and σ = var(X) denote the mean and variance of X, and let ν = E(Y ) and τ = var(Y ) denote the mean and variance of Y . For the bivariate parameters, let δ = cov(X, Y ) denote the distribution covariance and ρ = cor(X, Y ) the distribution correlation. We need one higher-order moment as well: let δ = E [(X − μ) (Y − ν ) ] , and as usual, we assume that all of the parameters exist. So the general parameter spaces are μ, ν ∈ R , σ , τ ∈ (0, ∞) , δ ∈ R , and ρ ∈ [0, 1]. Suppose now that we sample from the distribution of (X, Y ) to generate a sequence of independent variables ((X , Y ), (X , Y ), …), each with the distribution of (X, Y ). As usual, we will let X = (X , X , … , X ) and Y = (Y , Y , … , Y ) ; these are random samples of size n from the distributions of X and Y , respectively. 2
2
2
2
2
2
2
2
1
n
1
2
n
n
1
2
1
2
2
n
Since we now have two underlying variables, we need to enhance our notation somewhat. It will help to define the deterministic versions of our statistics. So if x = (x , x , …) and y = (y , y , …) are sequences of real numbers and n ∈ N , we define the mean and special covariance functions by 1
2
1
2
7.1.4
+
https://stats.libretexts.org/@go/page/10189
n
1 mn (x) =
n
∑ xi i=1 n
1 wn (x, y) =
∑(xi − μ)(yi − ν )
n
i=1
If n ∈ {2, 3, …} we define the variance and standard covariance functions by n
1
2
sn (x) =
n−1
2
∑[ xi − mn (x)] i=1 n
1 sn (x, y) =
n−1
∑[ xi − mn (x)][ yi − mn (y)] i=1
It should be clear from context whether we are using the one argument or two argument version of s (x, x) = s (x) .
sn
. On this point, note that
2 n
n
Estimating the Covariance If μ and ν are known (almost always an artificial assumption), then a natural estimator of the distribution covariance δ is a special version of the sample covariance, defined by 1 Wn = wn (X, Y ) =
Properties of W
= (W1 , W2 , …)
n
n
∑(Xi − μ)(Yi − ν ),
n ∈ N+
(7.1.11)
i=1
as a sequence of estimators of δ .
1. E (W ) = δ so W is unbiased for n ∈ N . 2. var (W ) = (δ − δ ) for n ∈ N so W is consistent. n
n
+
1
n
n
2
2
+
Proof If μ and ν are unknown (usually the more reasonable assumption), then a natural estimator of the distribution covariance standard version of the sample covariance, defined by 1 Sn = sn (X, Y ) =
Properties of S = (S
2,
1. E (S
n)
=δ
2. var (S
n)
=
S3 , …)
n−1
δ
is the
n
∑[ Xi − mn (X)][ Yi − mn (Y )],
n ∈ {2, 3, …}
(7.1.12)
i=1
as a sequence of estimators of δ .
so S is unbiased for n ∈ {2, 3, …}. n
1 n
(δ2 +
1 n−1
2
σ τ
2
−
n−2 n−1
2
δ )
for n ∈ {2, 3, …} so S is consistent.
Once again, since we have two competing sequences of estimators of δ , we would like to compare them. Comparison of W and S as estimators of δ : 1. var (W ) < var (S ) for n ∈ {2, 3, …}. 2. The asymptotic relative efficiency of W to S is 1. n
n
Thus, U is better than V for n ∈ {2, 3, …}, assuming that μ and ν are known so that we can actually use W . But for large n , V works just about as well as U . n
n
n
n
n
Estimating the Correlation A natural estimator of the distribution correlation ρ is the sample correlation sn (X, Y ) Rn =
,
n ∈ {2, 3, …}
(7.1.13)
sn (X)sn (Y )
Note that this statistics is a nonlinear function of the sample covariance and the two sample standard deviations. For most distributions of (X, Y ), we have no hope of computing the bias or mean square error of this estimator. If we could compute the
7.1.5
https://stats.libretexts.org/@go/page/10189
expected value, we would probably find that the estimator is biased. On the other hand, even though we cannot compute the mean square error, a simple application of the law of large numbers shows that R → ρ as n → ∞ with probability 1. Thus, R = (R , R , …) is at least consistent. n
2
3
Estimating the regression coefficients Recall that the distribution regression line, with X as the predictor variable and Y as the response variable, is y = a + b x where cov(X, Y ) a = E(Y ) −
cov(X, Y ) E(X),
b =
var(X)
(7.1.14) var(X)
On the other hand, the sample regression line, based on the sample of size n ∈ {2, 3, …}, is y = A
n
An = mn (Y ) −
sn (X, Y ) 2 sn (X)
mn (X),
Bn =
+ Bn x
where
sn (X, Y )
(7.1.15)
2
sn (X)
Of course, the statistics A and B are natural estimators of the parameters a and b , respectively, and in a sense are derived from our previous estimators of the distribution mean, variance, and covariance. Once again, for most distributions of (X, Y ), it would be difficult to compute the bias and mean square errors of these estimators. But applications of the law of large numbers show that with probability 1, A → a and B → b as n → ∞ , so at least A = (A , A , …) and B = (B , B , …) are consistent. n
n
n
n
2
3
2
3
Exercises and Special Cases The Poisson Distribution Let's consider a simple example that illustrates some of the ideas above. Recall that the Poisson distribution with parameter λ ∈ (0, ∞) has probability density function g given by x
g(x) = e
−λ
λ
,
x ∈ N
(7.1.16)
x!
The Poisson distribution is often used to model the number of random “points” in a region of time or space, and is studied in more detail in the chapter on the Poisson process. The parameter λ is proportional to the size of the region of time or space; the proportionality constant is the average rate of the random points. The distribution is named for Simeon Poisson. Suppose that X has the Poisson distribution with parameter λ . . Hence 1. μ = E(X) = λ 2. σ = var(X) = λ 3. σ = E [(X − λ) ] = 3λ 2
4
2
4
+λ
Proof Suppose now that we sample from the distribution of X to produce a sequence of independent random variables X = (X , X , …), each having the Poisson distribution with unknown parameter λ ∈ (0, ∞) . Again, X = (X , X , … , X ) is a random sample of size n ∈ N from the from the distribution for each n ∈ N . From the previous exercise, λ is both the mean and the variance of the distribution, so that we could use either the sample mean M or the sample variance S as an estimator of λ . Both are unbiased, so which is better? Naturally, we use mean square error as our criterion. 1
2
n
1
2
n
+
2 n
n
Comparison of M to S as estimators of λ . 2
1. var (M
n)
2. var
2 (Sn )
= =
λ n λ n
for n ∈ N . +
(1 + 2λ
n n−1
)
for n ∈ {2, 3, …}.
3. var (M ) < var so M for n ∈ {2, 3, …}. 4. The asymptotic relative efficiency of M to S is 1 + 2λ . n
2 (Sn )
n
2
So our conclusion is that the sample mean M is a better estimator of the parameter n ∈ {2, 3, …}, and the difference in quality increases with λ . n
λ
than the sample variance
Run the Poisson experiment 100 times for several values of the parameter. In each case, compute the estimators Which estimator seems to work better?
7.1.6
M
2
Sn
and
S
for
2
.
https://stats.libretexts.org/@go/page/10189
The emission of elementary particles from a sample of radioactive material in a time interval is often assumed to follow the Poisson distribution. Thus, suppose that the alpha emissions data set is a sample from a Poisson distribution. Estimate the rate parameter λ . 1. using the sample mean 2. using the sample variance Answer
Simulation Exercises In the sample mean experiment, set the sampling distribution to gamma. Increase the sample size with the scroll bar and note graphically and numerically the unbiased and consistent properties. Run the experiment 1000 times and compare the sample mean to the distribution mean. Run the normal estimation experiment 1000 times for several values of the parameters. 1. Compare the empirical bias and mean square error of M with the theoretical values. 2. Compare the empirical bias and mean square error of S and of W to their theoretical values. Which estimator seems to work better? 2
2
In matching experiment, the random variable is the number of matches. Run the simulation 1000 times and compare 1. the sample mean to the distribution mean. 2. the empirical density function to the probability density function. Run the exponential experiment 1000 times and compare the sample standard deviation to the distribution standard deviation.
Data Analysis Exercises For Michelson's velocity of light data, compute the sample mean and sample variance. Answer For Cavendish's density of the earth data, compute the sample mean and sample variance. Answer For Short's parallax of the sun data, compute the sample mean and sample variance. Answer Consider the Cicada data. 1. Compute the sample mean and sample variance of the body length variable. 2. Compute the sample mean and sample variance of the body weight variable. 3. Compute the sample covariance and sample correlation between the body length and body weight variables. Answer Consider the M&M data. 1. Compute the sample mean and sample variance of the net weight variable. 2. Compute the sample mean and sample variance of the total number of candies. 3. Compute the sample covariance and sample correlation between the number of candies and the net weight. Answer Consider the Pearson data. 1. Compute the sample mean and sample variance of the height of the father. 2. Compute the sample mean and sample variance of the height of the son. 3. Compute the sample covariance and sample correlation between the height of the father and height of the son.
7.1.7
https://stats.libretexts.org/@go/page/10189
Answer The estimators of the mean, variance, and covariance that we have considered in this section have been natural in a sense. However, for other parameters, it is not clear how to even find a reasonable estimator in the first place. In the next several sections, we will consider the problem of constructing estimators. Then we return to the study of the mathematical properties of estimators, and consider the question of when we can know that an estimator is the best possible, given the data. This page titled 7.1: Estimators is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
7.1.8
https://stats.libretexts.org/@go/page/10189
7.2: The Method of Moments Basic Theory The Method Suppose that we have a basic random experiment with an observable, real-valued random variable X. The distribution of X has k unknown real-valued parameters, or equivalently, a parameter vector θ = (θ , θ , … , θ ) taking values in a parameter space, a subset of R . As usual, we repeat the experiment n times to generate a random sample of size n from the distribution of X. 1
2
k
k
X = (X1 , X2 , … , Xn )
(7.2.1)
Thus, X is a sequence of independent random variables, each with the distribution of X. The method of moments is a technique for constructing estimators of the parameters that is based on matching the sample moments with the corresponding distribution moments. First, let (j)
μ
j
(θ) = E (X ) ,
j ∈ N+
(7.2.2)
so that μ (θ) is the j th moment of X about 0. Note that we are emphasizing the dependence of these moments on the vector of parameters θ. Note also that μ (θ) is just the mean of X, which we usually denote simply by μ . Next, let (j)
(1)
M
(j)
n
1
j
(X) =
∑X , n
so that
M
j
j
1
2
(j)
(X)
(X , X , … , Xn )
from the distribution of
X
j
(7.2.3)
i=1
is the j th sample moment about 0. Equivalently,
j
j ∈ N+
i
M
(j)
(X)
is the sample mean for the random sample
. Note that we are emphasizing the dependence of the sample moments on the
sample X. Note also that M (X) is just the ordinary sample mean, which we usually just denote by M (or by M if we wish to emphasize the dependence on the sample size). From our previous work, we know that M (X) is an unbiased and consistent estimator of μ (θ) for each j . Here's how the method works: (1)
n
(j)
(j)
To construct the method of moments estimators consider the equations (j)
μ
consecutively for j ∈ N
+
(W1 , W2 , … , Wk )
(W1 , W2 , … , Wk ) = M
until we are able to solve for (W
1,
(j)
for the parameters
(θ1 , θ2 , … , θk )
respectively, we
(X1 , X2 , … , Xn )
W2 , … , Wk )
(7.2.4)
in terms of (M
(1)
,M
(2)
.
, …)
The equations for j ∈ {1, 2, … , k} give k equations in k unknowns, so there is hope (but no guarantee) that the equations can be solved for (W , W , … , W ) in terms of (M , M , … , M ). In fact, sometimes we need equations with j > k . Exercise 28 below gives a simple example. The method of moments can be extended to parameters associated with bivariate or more general multivariate distributions, by matching sample product moments with the corresponding distribution product moments. The method of moments also sometimes makes sense when the sample variables (X , X , … , X ) are not independent, but at least are identically distributed. The hypergeometric model below is an example of this. (1)
1
2
(2)
(k)
k
1
2
n
Of course, the method of moments estimators depend on the sample size n ∈ N . We have suppressed this so far, to keep the notation simple. But in the applications below, we put the notation back in because we want to discuss asymptotic behavior. +
Estimates for the Mean and Variance Estimating the mean and variance of a distribution are the simplest applications of the method of moments. Throughout this subsection, we assume that we have a basic real-valued random variable X with μ = E(X) ∈ R and σ = var(X) ∈ (0, ∞) . Occasionally we will also need σ = E[(X − μ) ] , the fourth central moment. We sample from the distribution of X to produce a sequence X = (X , X , …) of independent variables, each with the distribution of X. For each n ∈ N , X = (X , X , … , X ) is a random sample of size n from the distribution of X. We start by estimating the mean, which is essentially trivial by this method. 2
4
1
n
1
2
4
2
+
n
Suppose that the mean μ is unknown. The method of moments estimator of μ based on X is the sample mean n
7.2.1
https://stats.libretexts.org/@go/page/10190
n
1 Mn =
1. E(M ) = μ so M is unbiased for n ∈ N 2. var(M ) = σ /n for n ∈ N so M = (M n
n
∑ Xi
n
(7.2.5)
i=1
+
2
n
1,
+
M2 , …)
is consistent.
Proof Estimating the variance of the distribution, on the other hand, depends on whether the distribution mean μ is known or unknown. First we will consider the more realistic case when the mean in also unknown. Recall that for n ∈ {2, 3, …}, the sample variance based on X is n
Recall also that
2
E(Sn ) = σ
so S is unbiased for
2
n
1
2
Sn =
2
∑(Xi − Mn )
n−1
, and that
2 n
(7.2.6)
i=1
2
n ∈ {2, 3, …}
var(Sn ) =
1 n
(σ4 −
n−3 n−1
4
σ )
so
S
2
= (S
2
2
,S
2
3
, …)
is consistent. Suppose that the mean μ and the variance σ are both unknown. For n ∈ N , the method of moments estimator of on X is 2
+
σ
2
based
n
1. bias(T 2. mse(T
2 n )
2 n )
2
= −σ /n =
1 3
n
for n ∈ N
so T
+
2
2
2
2
= (T
1
,T
[(n − 1 ) σ4 − (n
− 5n + 3)σ ]
n
2
∑(Xi − Mn )
(7.2.7)
i=1
is asymptotically unbiased. for n ∈ N so T is consistent.
2
2
4
n
1
2
Tn =
, …)
2
+
Proof Hence T is negatively biased and on average underestimates σ . Because of this result, variance to distinguish it from the ordinary (unbiased) sample variance S . 2 n
2
2
Tn
is referred to as the biased sample
2 n
Next let's consider the usually unrealistic (but mathematically interesting) case where the mean is known, but not the variance. Suppose that the mean μ is known and the variance σ unknown. For n ∈ N , the method of moments estimator of σ based on X is 2
2
+
n
1. E(W ) = σ so W is unbiased for n ∈ N 2. var(W ) = (σ − σ ) for n ∈ N so W 2 n
2
2 n
1
2 n
2
∑(Xi − μ)
n
(7.2.8)
i=1
+
4
4
n
n
1
2
Wn =
+
2
= (W
2
1
,W
2
2
, …)
is consistent.
Proof We compared the sequence of estimators S with the sequence of estimators W in the introductory section on Estimators. Recall that var(W ) < var(S ) for n ∈ {2, 3, …} but var(S )/var(W ) → 1 as n → ∞ . There is no simple, general relationship between mse(T ) and mse(S ) or between mse(T ) and mse(W ), but the asymptotic relationship is simple. 2
2 n
2 n
2 n
2
2
2 n
2 n
2
mse(Tn )/mse(Wn ) → 1
2 n
and mse(T
2 2 n )/mse(Sn )
2 n
2 n
→ 1
as n → ∞
Proof It also follows that if both μ and σ are unknown, then the method of moments estimator of the standard deviation σ is T − − − In the unlikely event that μ is known, but σ unknown, then the method of moments estimator of σ is W = √W . 2
2
− − − 2 = √T
.
2
Estimating Two Parameters There are several important special distributions with two paraemters; some of these are included in the computational exercises below. With two parameters, we can derive the method of moments estimators by matching the distribution mean and variance with the sample mean and variance, rather than matching the distribution mean and second moment with the sample mean and second
7.2.2
https://stats.libretexts.org/@go/page/10190
moment. This alternative approach sometimes leads to easier equations. To setup the notation, suppose that a distribution on R has parameters a and b . We sample from the distribution to produce a sequence of independent variables X = (X , X , …), each with the common distribution. For n ∈ N , X = (X , X , … , X ) is a random sample of size n from the distribution. Let M , 1
+
(2)
, and μ(a, b), μ Mn
n
1
2
2
n
n
denote the sample mean, second-order sample mean, and biased sample variance corresponding to (a, b), and σ (a, b) denote the mean, second-order mean, and variance of the distribution. 2
Tn
(2)
Xn
, and let
2
If the method of moments estimators U and V of a and b , respectively, can be found by solving the first two equations n
n
(2)
μ(Un , Vn ) = Mn ,
μ
(2)
(Un , Vn ) = Mn
(7.2.9)
then U and V can also be found by solving the equations n
n
μ(Un , Vn ) = Mn ,
2
2
σ (Un , Vn ) = Tn
(7.2.10)
Proof Because of this result, the biased sample variance T will appear in many of the estimation problems for special distributions that we consider below. 2 n
Special Distributions The Normal Distribution The normal distribution with mean function g given by
μ ∈ R
and variance
σ
2
∈ (0, ∞)
is a continuous distribution on
R
with probability density
2
g(x) =
1 1 x −μ ( ) ], − − exp[− 2 σ √2πσ
x ∈ R
(7.2.11)
This is one of the most important distributions in probability and statistics, primarily because of the central limit theorem. The normal distribution is studied in more detail in the chapter on Special Distributions. Suppose now that X = (X , X , … , X ) is a random sample of size n from the normal distribution with mean μ and variance σ . Form our general work above, we know that if μ is unknown then the sample mean M is the method of moments estimator of μ , and if in addition, σ is unknown then the method of moments estimator of σ is T . On the other hand, in the unlikely event that μ is known then W is the method of moments estimator of σ . Our goal is to see how the comparisons above simplify for the normal distribution. 1
2
n
2
2
2
2
2
2
Mean square errors of S and T . 2 n
1. mse(T 2. mse(S 3. mse(T
2
2
) =
) =
2
2n−1 2
n 2
n−1
σ
σ
2 n
4
4
) < mse(S
2
for n ∈ {2, 3, … , }
)
Proof Thus, S and T are multiplies of one another; S is unbiased, but when the sampling distribution is normal, T has smaller mean square error. Surprisingly, T has smaller mean square error even than W . 2
2
2
2
2
2
Mean square errors of T and W . 2
1. mse(W ) = σ 2. mse(T ) < mse(W 2
2
4
n
2
2
2
)
for n ∈ {2, 3, …}
Proof Run the normal estimation experiment 1000 times for several values of the sample size n and the parameters μ and σ. Compare the empirical bias and mean square error of S and of T to their theoretical values. Which estimator is better in terms of bias? Which estimator is better in terms of mean square error? 2
2
7.2.3
https://stats.libretexts.org/@go/page/10190
− − −
Next we consider estimators of the standard deviation σ. As noted in the general discussion above, T = √T is the method of − − − moments estimator when μ is unknown, while W = √W is the method of moments estimator in the unlikely event that μ is −− known. Another natural estimator, of course, is S = √S , the usual sample standard deviation. The following sequence, defined in terms of the gamma function turns out to be important in the analysis of all three estimators. 2
2
2
Consider the sequence − − 2 Γ[(n + 1)/2)
an = √
Then 0 < a
n
0
.
Proof In the voter example (3) above, typically N and r are both unknown, but we would only be interested in estimating the ratio p = r/N . In the reliability example (1), we might typically know N and would be interested in estimating r . In the wildlife example (4), we would typically know r and would be interested in estimating N . This example is known as the capture-recapture model. Clearly there is a close relationship between the hypergeometric model and the Bernoulli trials model above. In fact, if the sampling is with replacement, the Bernoulli trials model would apply rather than the hypergeometric model. In addition, if the population size N is large compared to the sample size n , the hypergeometric model is well approximated by the Bernoulli trials model. This page titled 7.2: The Method of Moments is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
7.2.10
https://stats.libretexts.org/@go/page/10190
7.3: Maximum Likelihood Basic Theory The Method Suppose again that we have an observable random variable X for an experiment, that takes values in a set S . Suppose also that distribution of X depends on an unknown parameter θ , taking values in a parameter space Θ. Of course, our data variable X will almost always be vector valued. The parameter θ may also be vector valued. We will denote the probability density function of X on S by f for θ ∈ Θ . The distribution of X could be discrete or continuous. θ
The likelihood function is the function obtained by reversing the roles of x and θ in the probability density function; that is, we view θ as the variable and x as the given information (which is precisely the point of view in estimation). The likelihood function at x ∈ S is the function L
x
given by
: Θ → [0, ∞)
Lx (θ) = fθ (x),
θ ∈ Θ
(7.3.1)
In the method of maximum likelihood, we try to find the value of the parameter that maximizes the likelihood function for each value of the data vector. Suppose that the maximum value of L occurs at u(x) ∈ Θ for each x ∈ S . Then the statistic u(X) is a maximum likelihood estimator of θ . x
The method of maximum likelihood is intuitively appealing—we try to find the value of the parameter that would have most likely produced the data we in fact observed. Since the natural logarithm function is strictly increasing on (0, ∞), the maximum value of the likelihood function, if it exists, will occur at the same points as the maximum value of the logarithm of the likelihood function. The log-likelihood function at x ∈ S is the function ln L : x
ln Lx (θ) = ln fθ (x),
If the maximum value of estimator of θ
ln Lx
occurs at
u(x) ∈ Θ
for each
θ ∈ Θ
x ∈ S
. Then the statistic
(7.3.2) u(X)
is a maximum likelihood
The log-likelihood function is often easier to work with than the likelihood function (typically because the probability density function f (x) has a product structure). θ
Vector of Parameters An important special case is when θ = (θ , θ , … , θ ) is a vector of k real parameters, so that Θ ⊆ R . In this case, the maximum likelihood problem is to maximize a function of several variables. If Θ is a continuous set, the methods of calculus can be used. If the maximum value of L occurs at a point θ in the interior of Θ, then L has a local maximum at θ. Therefore, assuming that the likelihood function is differentiable, we can find this point by solving k
1
2
k
x
x
∂ ∂θi
Lx (θ) = 0,
i ∈ {1, 2, … , k}
(7.3.3)
or equivalently ∂ ∂θi
ln Lx (θ) = 0,
i ∈ {1, 2, … , k}
(7.3.4)
On the other hand, the maximum value may occur at a boundary point of Θ, or may not exist at all.
7.3.1
https://stats.libretexts.org/@go/page/10191
Random Samples The most important special case is when the data variables form a random sample from a distribution. Suppose that X = (X , X , … , X ) is a random sample of size n from the distribution of a random variable X taking values in R , with probability density function g for θ ∈ Θ . Then X takes values in S = R , and the likelihood and log-likelihood functions for x = (x , x , … , x ) ∈ S are 1
2
n
n
θ
1
2
n
n
Lx (θ) = ∏ gθ (xi ),
θ ∈ Θ
i=1 n
ln Lx (θ) = ∑ ln gθ (xi ),
θ ∈ Θ
i=1
Extending the Method and the Invariance Property Returning to the general setting, suppose now that h is a one-to-one function from the parameter space Θ onto a set Λ . We can view λ = h(θ) as a new parameter taking values in the space Λ , and it is easy to re-parameterize the probability density function with the new parameter. Thus, let f^ (x) = f (x) for x ∈ S and λ ∈ Λ . The corresponding likelihood function for x ∈ S is −1
λ
h
(λ)
−1 ^ Lx (λ) = Lx [ h (λ)] ,
λ ∈ Λ
(7.3.5)
^ Clearly if u(x) ∈ Θ maximizes L for x ∈ S . Then h [u(x)] ∈ Λ maximizes L for x ∈ S . It follows that if likelihood estimator for θ , then V = h(U ) is a maximum likelihood estimator for λ = h(θ) . x
x
U
is a maximum
If the function h is not one-to-one, the maximum likelihood function for the new parameter λ = h(θ) is not well defined, because we cannot parameterize the probability density function in terms of λ . However, there is a natural generalization of the method. Suppose that h : Θ → Λ , and let λ = h(θ) denote the new parameter. Define the likelihood function for λ at x ∈ S by −1 ^ Lx (λ) = max { Lx (θ) : θ ∈ h {λ}} ;
^ If v(x) ∈ Λ maximizes L for each x ∈ S , then V x
λ ∈ Λ
(7.3.6)
is a maximum likelihood estimator of λ .
= v(X)
This definition extends the maximum likelihood method to cases where the probability density function is not completely parameterized by the parameter of interest. The following theorem is known as the invariance property: if we can solve the maximum likelihood problem for θ then we can solve the maximum likelihood problem for λ = h(θ) . In the setting of the previous theorem, if U is a maximum likelihood estimator of θ , then V estimator of λ .
= h(U )
is a maximum likelihood
Proof
Examples and Special Cases In the following subsections, we will study maximum likelihood estimation for a number of special parametric families of distributions. Recall that if X = (X , X , … , X ) is a random sample from a distribution with mean μ and variance σ , then the method of moments estimators of μ and σ are, respectively, 2
1
2
n
2
n
1 M = n
T
2
∑ Xi
n
1 = n
(7.3.7)
i=1
2
∑(Xi − M )
(7.3.8)
i=1
Of course, M is the sample mean, and T is the biased version of the sample variance. These statistics will also sometimes occur as maximum likelihood estimators. Another statistic that will occur in some of the examples below is 2
1 M2 =
n
∑X n
2
i
(7.3.9)
i=1
7.3.2
https://stats.libretexts.org/@go/page/10191
the second-order sample mean. As always, be sure to try the derivations yourself before looking at the solutions.
The Bernoulli Distribution Suppose that X = (X , X , … , X ) is a random sample of size p ∈ [0, 1]. Recall that the Bernoulli probability density function is 1
2
n
n
x
1−x
g(x) = p (1 − p )
,
from the Bernoulli distribution with success parameter
x ∈ {0, 1}
(7.3.10)
Thus, X is a sequence of independent indicator variables with P(X = 1) = p for each i. In the usual language of reliability, X is the outcome of trial i, where 1 means success and 0 means failure. Let Y = ∑ X denote the number of successes, so that the proportion of successes (the sample mean) is M = Y /n . Recall that Y has the binomial distribution with parameters n and p. i
i
n
i=1
i
The sample mean M is the maximum likelihood estimator of p on the parameter space (0, 1). Proof Recall that M is also the method of moments estimator of p. It's always nice when two different estimation procedures yield the same result. Next let's look at the same problem, but with a much restricted parameter space. Suppose now that p takes values in {
1 2
. Then the maximum likelihood estimator of p is the statistic
, 1}
1, U ={
1 2
1. E(U ) = {
1, 1 2
Y =n
,
(7.3.14)
Y 0 . The gamma distribution is often used to model random times and certain other types of positive random variables, and is studied in more detail in the chapter on Special Distributions. The probability density function is 1
2
n
1 gb (x) =
k−1
k
x
e
−x/b
,
x ∈ (0, ∞)
(7.5.26)
Γ(k)b
The basic assumption is satisfied with respect to b . Moreover, the mean and variance of the gamma distribution are respectively. 2
b
nk
M k
kb
and
2
kb
,
is the Cramér-Rao lower bound for the variance of unbiased estimators of b . attains the lower bound in the previous exercise and hence is an UMVUE of b .
The Beta Distribution Suppose that X = (X , X , … , X ) is a random sample of size n from the beta distribution with left parameter a > 0 and right parameter b = 1 . Beta distributions are widely used to model random proportions and other random variables that take values in bounded intervals, and are studied in more detail in the chapter on Special Distributions. In our specialized case, the probability density function of the sampling distribution is 1
2
n
a−1
ga (x) = a x
,
x ∈ (0, 1)
(7.5.27)
The basic assumption is satisfied with respect to a . The mean and variance of the distribution are 1. μ = 2. σ =
a
a+1
2
a 2
(a+1 ) (a+2)
2
The Cramér-Rao lower bound for the variance of unbiased estimators of μ is
a
4
.
n (a+1)
The sample mean M does not achieve the Cramér-Rao lower bound in the previous exercise, and hence is not an UMVUE of μ.
The Uniform Distribution Suppose that X = (X , X , … , X ) is a random sample of size n from the uniform distribution on unknown parameter. Thus, the probability density function of the sampling distribution is 1
2
n
[0, a]
where
a >0
is the
1 ga (x) =
,
x ∈ [0, a]
(7.5.28)
a
The basic assumption is not satisfied. The Cramér-Rao lower bound for the variance of unbiased estimators of a is apply, by the previous exercise.
7.5.5
2
a
n
. Of course, the Cramér-Rao Theorem does not
https://stats.libretexts.org/@go/page/10193
Recall that V
=
n+1 n
max{ X1 , X2 , … , Xn }
is unbiased and has variance
2
a
n(n+2)
. This variance is smaller than the Cramér-
Rao bound in the previous exercise. The reason that the basic assumption is not satisfied is that the support set {x ∈ R : g
a (x)
> 0}
depends on the parameter a .
Best Linear Unbiased Estimators We now consider a somewhat specialized problem, but one that fits the general theme of this section. Suppose that X = (X , X , … , X ) is a sequence of observable real-valued random variables that are uncorrelated and have the same unknown mean μ ∈ R , but possibly different standard deviations. Let σ = (σ , σ , … , σ ) where σ = sd(X ) for i ∈ {1, 2, … , n}. 1
2
n
1
2
n
i
i
We will consider estimators of μ that are linear functions of the outcome variables. Specifically, we will consider estimators of the following form, where the vector of coefficients c = (c , c , … , c ) is to be determined: 1
2
n
n
Y = ∑ ci Xi
(7.5.29)
i=1
Y
is unbiased if and only if ∑
n i=1
ci = 1
.
The variance of Y is n 2
2
i
i
var(Y ) = ∑ c σ
(7.5.30)
i=1
The variance is minimized, subject to the unbiased constraint, when 1/σ
2
j
cj =
n
∑
i=1
1/ σ
2
,
j ∈ {1, 2, … , n}
(7.5.31)
i
Proof This exercise shows how to construct the Best Linear Unbiased Estimator (BLUE) of deviations σ is known.
μ
, assuming that the vector of standard
Suppose now that σ = σ for i ∈ {1, 2, … , n} so that the outcome variables have the same standard deviation. In particular, this would be the case if the outcome variables form a random sample of size n from a distribution with mean μ and standard deviation σ. i
In this case the variance is minimized when c
i
= 1/n
for each i and hence Y
=M
, the sample mean.
This exercise shows that the sample mean M is the best linear unbiased estimator of μ when the standard deviations are the same, and that moreover, we do not need to know the value of the standard deviation. This page titled 7.5: Best Unbiased Estimators is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
7.5.6
https://stats.libretexts.org/@go/page/10193
7.6: Sufficient, Complete and Ancillary Statistics Basic Theory The Basic Statistical Model Consider again the basic statistical model, in which we have a random experiment with an observable random variable X taking values in a set S . Once again, the experiment is typically to sample n objects from a population and record one or more measurements for each item. In this case, the outcome variable has the form X = (X1 , X2 , … , Xn )
(7.6.1)
where X is the vector of measurements for the ith item. In general, we suppose that the distribution of X depends on a parameter θ taking values in a parameter space T . The parameter θ may also be vector-valued. We will sometimes use subscripts in probability density functions, expected values, etc. to denote the dependence on θ . i
As usual, the most important special case is when X is a sequence of independent, identically distributed random variables. In this case X is a random sample from the common distribution.
Sufficient Statistics Let U = u(X) be a statistic taking values in a set R . Intuitively, U is sufficient for θ if U contains all of the information about θ that is available in the entire data variable X. Here is the formal definition: A statistic U is sufficient for θ if the conditional distribution of X given U does not depend on θ ∈ T . Sufficiency is related to the concept of data reduction. Suppose that X takes values in R . If we can find a sufficient statistic U that takes values in R , then we can reduce the original data vector X (whose dimension n is usually large) to the vector of statistics U (whose dimension j is usually much smaller) with no loss of information about the parameter θ . n
j
The following result gives a condition for sufficiency that is equivalent to this definition. Let U = u(X) be a statistic taking values in R , and let f and h denote the probability density functions of respectively. Then U is suffcient for θ if and only if the function on S given below does not depend on θ ∈ T : θ
θ
X
and
U
fθ (x) x ↦
(7.6.2) hθ [u(x)]
Proof The definition precisely captures the intuitive notion of sufficiency given above, but can be difficult to apply. We must know in advance a candidate statistic U , and then we must be able to compute the conditional distribution of X given U . The FisherNeyman factorization theorem given next often allows the identification of a sufficient statistic from the form of the probability density function of X. It is named for Ronald Fisher and Jerzy Neyman. Fisher-Neyman Factorization Theorem. Let f denote the probability density function of X and suppose that U = u(X) is a statistic taking values in R . Then U is sufficient for θ if and only if there exists G : R × T → [0, ∞) and r : S → [0, ∞) such that θ
fθ (x) = G[u(x), θ]r(x);
x ∈ S, θ ∈ T
(7.6.3)
Proof Note that r depends only on the data x but not on the parameter θ . Less technically, u(X) is sufficient for density function f (x) depends on the data vector x and the parameter θ only through u(x).
θ
if the probability
θ
If U and V are equivalent statistics and U is sufficient for θ then V is sufficient for θ .
7.6.1
https://stats.libretexts.org/@go/page/10194
Minimal Sufficient Statistics The entire data variable X is trivially sufficient for θ . However, as noted above, there usually exists a statistic U that is sufficient for θ and has smaller dimension, so that we can achieve real data reduction. Naturally, we would like to find the statistic U that has the smallest dimension possible. In many cases, this smallest dimension j will be the same as the dimension k of the parameter vector θ . However, as we will see, this is not necessarily the case; j can be smaller or larger than k . An example based on the uniform distribution is given in (38). Suppose that a statistic sufficient for θ .
U
is sufficient for θ . Then U is minimally sufficient if
U
is a function of any other statistic
V
that is
Once again, the definition precisely captures the notion of minimal sufficiency, but is hard to apply. The following result gives an equivalent condition. Let f denote the probability density function of X corresponding to the parameter value θ ∈ T and suppose that U = u(X) is a statistic taking values in R . Then U is minimally sufficient for θ if the following condition holds: for x ∈ S and y ∈ S θ
fθ (x)
is independent of θ if and only if u(x) = u(y)
(7.6.4)
fθ (y)
Proof If U and V are equivalent statistics and U is minimally sufficient for θ then V is minimally sufficient for θ .
Properties of Sufficient Statistics Sufficiency is related to several of the methods of constructing estimators that we have studied. Suppose that U is sufficient for θ and that there exists a maximum likelihood estimator of θ . Then there exists a maximum likelihood estimator V that is a function of U . Proof In particular, suppose that V is the unique maximum likelihood estimator of θ and that V is sufficient for θ . If U is sufficient for θ then V is a function of U by the previous theorem. Hence it follows that V is minimally sufficient for θ . Our next result applies to Bayesian analysis. Suppose that the statistic U = u(X) is sufficient for the parameter θ and that θ is modeled by a random variable Θ with values in T . Then the posterior distribution of Θ given X = x ∈ S is a function of u(x). Proof Continuing with the setting of Bayesian analysis, suppose that θ is a real-valued parameter. If we use the usual mean-square loss function, then the Bayesian estimator is V = E(Θ ∣ X) . By the previous result, V is a function of the sufficient statistics U . That is, E(Θ ∣ X) = E(Θ ∣ U ) . The next result is the Rao-Blackwell theorem, named for CR Rao and David Blackwell. The theorem shows how a sufficient statistic can be used to improve an unbiased estimator. Rao-Blackwell Theorem. Suppose that U is sufficient for θ and that V is an unbiased estimator of a real parameter λ = λ(θ) . Then E (V ∣ U ) is also an unbiased estimator of λ and is uniformly better than V . θ
Proof
Complete Statistics Suppose that
U = u(X)
is a statistic taking values in a set
R
. Then
U
is a complete statistic for
θ
if for any function
r : R → R
Eθ [r(U )] = 0 for all θ ∈ T
⟹
Pθ [r(U ) = 0] = 1 for all θ ∈ T
7.6.2
(7.6.9)
https://stats.libretexts.org/@go/page/10194
To understand this rather strange looking condition, suppose that r(U ) is a statistic constructed from U that is being used as an estimator of 0 (thought of as a function of θ ). The completeness condition means that the only such unbiased estimator is the statistic that is 0 with probability 1. If U and V are equivalent statistics and U is complete for θ then V is complete for θ . The next result shows the importance of statistics that are both complete and sufficient; it is known as the Lehmann-Scheffé theorem, named for Erich Lehmann and Henry Scheffé. Lehmann-Scheffé Theorem. Suppose that U is sufficient and complete for θ and that V = r(U ) is an unbiased estimator of a real parameter λ = λ(θ) . Then V is a uniformly minimum variance unbiased estimator (UMVUE) of λ . Proof
Ancillary Statistics Suppose that V = v(X) is a statistic taking values in a set R . If the distribution of V does not depend on θ , then V is called an ancillary statistic for θ . Thus, the notion of an ancillary statistic is complementary to the notion of a sufficient statistic. A sufficient statistic contains all available information about the parameter; an ancillary statistic contains no information about the parameter. The following result, known as Basu's Theorem and named for Debabrata Basu, makes this point more precisely. Basu's Theorem. Suppose that U is complete and sufficient for a parameter θ and that V is an ancillary statistic for θ . Then U and V are independent. Proof If U and V are equivalent statistics and U is ancillary for θ then V is ancillary for θ .
Applications and Special Distributions In this subsection, we will explore sufficient, complete, and ancillary statistics for a number of special distributions. As always, be sure to try the problems yourself before looking at the solutions.
The Bernoulli Distribution Recall that the Bernoulli distribuiton with parameter p ∈ (0, 1) is a discrete distribution on {0, 1} with probability density function g defined by x
1−x
g(x) = p (1 − p )
,
x ∈ {0, 1}
(7.6.10)
Suppose that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with parameter p. Equivalently, X is a sequence of Bernoulli trials, so that in the usual langauage of reliability, X = 1 if trial i is a success, and X = 0 if trial i is a failure. The Bernoulli distribution is named for Jacob Bernoulli and is studied in more detail in the chapter on Bernoulli Trials 1
2
n
i
Let Y = ∑ X denote the number of successes. Recall that probability density function h defined by n
i=1
i
h(y) = (
Y
i
has the binomial distribution with parameters
n y n−y ) p (1 − p ) ,
y ∈ {0, 1, … , n}
n
and p, and has
(7.6.11)
y
is sufficient for p. Specifically, for y ∈ {0, 1, … , n}, the conditional distribution of X given Y points Y
n
Dy = {(x1 , x2 , … , xn ) ∈ {0, 1 }
: x1 + x2 + ⋯ + xn = y}
=y
is uniform on the set of
(7.6.12)
Proof This result is intuitively appealing: in a sequence of Bernoulli trials, all of the information about the probability of success p is contained in the number of successes Y . The particular order of the successes and failures provides no additional information. Of
7.6.3
https://stats.libretexts.org/@go/page/10194
course, the sufficiency of additional insight. Y
Y
follows more easily from the factorization theorem (3), but the conditional distribution provides
is complete for p on the parameter space (0, 1).
Proof The proof of the last result actually shows that if the parameter space is any subset of (0, 1) containing an interval of positive length, then Y is complete for p. But the notion of completeness depends very much on the parameter space. The following result considers the case where p has a finite set of values. Suppose that the parameter space T not complete for p.
⊂ (0, 1)
is a finite set with k ∈ N
+
elements. If the sample size n is at least k , then Y is
Proof The sample mean M = Y /n (the sample proportion of successes) is clearly equivalent to Y (the number of successes), and hence is also sufficient for p and is complete for p ∈ (0, 1). Recall that the sample mean M is the method of moments estimator of p, and is the maximum likelihood estimator of p on the parameter space (0, 1). In Bayesian analysis, the usual approach is to model p with a random variable P that has a prior beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Then the posterior distribution of P given X is beta with left parameter a + Y and right parameter b + (n − Y ) . The posterior distribution depends on the data only through the sufficient statistic Y , as guaranteed by theorem (9). The sample variance S is an UMVUE of the distribution variance p(1 − p) for p ∈ (0, 1), and can be written as 2
S
2
Y
Y
=
(1 − n−1
)
(7.6.17)
n
Proof
The Poisson Distribution Recall that the Poisson distribution with parameter defined by
θ ∈ (0, ∞)
is a discrete distribution on
N
with probability density function
g
x
g(x) = e
−θ
θ
,
x ∈ N
(7.6.19)
x!
The Poisson distribution is named for Simeon Poisson and is used to model the number of “random points” in region of time or space, under certain ideal conditions. The parameter θ is proportional to the size of the region, and is both the mean and the variance of the distribution. The Poisson distribution is studied in more detail in the chapter on Poisson process. Suppose now that X = (X that the sum of the scores Y
1,
is a random sample of size n from the Poisson distribution with parameter θ . Recall also has the Poisson distribution, but with parameter nθ .
X2 , … , Xn ) n
= ∑i=1 Xi
The statistic Y is sufficient for θ . Specifically, for y ∈ N , the conditional distribution of distribution with y trials, n trial values, and uniform trial probabilities.
X
given
Y =y
is the multinomial
Proof As before, it's easier to use the factorization theorem to prove the sufficiency of additional insight. Y
Y
, but the conditional distribution gives some
is complete for θ ∈ (0, ∞) .
Proof As with our discussion of Bernoulli trials, the sample mean M = Y /n is clearly equivalent to Y and hence is also sufficient for θ and complete for θ ∈ (0, ∞) . Recall that M is the method of moments estimator of θ and is the maximum likelihood estimator on the parameter space (0, ∞). An UMVUE of the parameter P(X = 0) = e
−θ
for θ ∈ (0, ∞) is
7.6.4
https://stats.libretexts.org/@go/page/10194
Y
n−1 U =(
)
(7.6.23)
n
Proof
The Normal Distribution Recall that the normal distribution with mean density function g defined by
and variance
μ ∈ R
1 g(x) =
σ
1
− − − √2 πσ
exp[−
2
∈ (0, ∞)
) ],
2
R
with probability
2
x −μ (
is a continuous distribution on
x ∈ R
(7.6.26)
σ
The normal distribution is often used to model physical quantities subject to small, random errors, and is studied in more detail in the chapter on Special Distributions. Because of the central limit theorem, the normal distribution is perhaps the most important distribution in statistics. Suppose that X = (X , X , … , X ) is a random sample from the normal distribution with mean each of the following pairs of statistics is minimally sufficient for (μ, σ ) 1
2
n
μ
and variance
σ
2
. Then
2
n
1. (Y , V ) where Y = ∑ 2. (M , S ) where M = 3. (M , T ) where T =
i=1
2
2
2
1 n
Xi n
∑
n
∑
i
n
1
i
i=1
n−1
2
(Xi − M )
is the sample variance.
2
i=1
n
2
i=1
2
i=1
1
n
and V = ∑ X . X is the sample mean and S = ∑ (X − M ) is the biased sample variance. i
Proof Recall that M and T are the method of moments estimators of estimators on the parameter space R × (0, ∞) . 2
μ
and
σ
2
, respectively, and are also the maximum likelihood
Run the normal estimation experiment 1000 times with various values of the parameters. Compare the estimates of the parameters in terms of bias and mean square error. Sometimes the variance σ of the normal distribution is known, but not the mean μ . It's rarely the case that μ is known but not σ . Nonetheless we can give sufficient statistics in both cases. 2
2
Suppose again that X = (X σ ∈ (0, ∞) . If
1,
X2 , … , Xn )
is a random sample from the normal distribution with mean μ ∈ R and variance
2
n
1. If σ is known then Y = ∑ 2. If μ is known then U = ∑ 2
n
i=1
i=1
Xi
is minimally sufficient for μ . is sufficient for σ . 2
2
(Xi − μ)
Proof Of course by equivalence, in part (a) the sample mean M = Y /n is minimally sufficient for μ , and in part (b) the special sample variance W = U /n is minimally sufficient for σ . Moreover, in part (a), M is complete for μ on the parameter space R and the sample variance S is ancillary for μ (Recall that (n − 1)S /σ has the chi-square distribution with n − 1 degrees of freedom.) It follows from Basu's theorem (15) that the sample mean M and the sample variance S are independent. We proved this by more direct means in the section on special properties of normal samples, but the formulation in terms of sufficient and ancillary statistics gives additional insight. 2
2
2
2
2
The Gamma Distribution Recall that the gamma distribution with shape parameter on (0, ∞) with probability density function g given by
k ∈ (0, ∞)
1 g(x) =
k−1
k
x
e
and scale parameter
−x/b
,
x ∈ (0, ∞)
b ∈ (0, ∞)
is a continuous distribution
(7.6.29)
Γ(k)b
The gamma distribution is often used to model random times and certain other types of positive random variables, and is studied in more detail in the chapter on Special Distributions.
7.6.5
https://stats.libretexts.org/@go/page/10194
Suppose that X = (X , X , … , X ) is a random sample from the gamma distribution with shape parameter parameter b . Each of the following pairs of statistics is minimally sufficient for (k, b) 1
2
n
k
and scale
1. (Y , V ) where Y = ∑ X is the sum of the scores and V = ∏ X is the product of the scores. 2. (M , U ) where M = Y /n is the sample (arithmetic) mean of X and U = V is the sample geometric mean of X. n
n
i
i=1
i
i=1
1/n
Proof Recall that the method of moments estimators of k and b are M /T and T /M , respectively, where M = ∑ X is the sample mean and T = ∑ (X − M ) is the biased sample variance. If the shape parameter k is known, M is both the method of moments estimator of b and the maximum likelihood estimator on the parameter space (0, ∞). Note that T is not a function of the sufficient statistics (Y , V ), and hence estimators based on T suffer from a loss of information. 2
2
2
n
1
2
n
1
2
1
i
i=1
n
i
i=1
n
k
2
2
Run the gamma estimation experiment 1000 times with various values of the parameters and the sample size n . Compare the estimates of the parameters in terms of bias and mean square error. The proof of the last theorem actually shows that Y is sufficient for b if k is known, and that V is sufficient for k if b is known. Suppose again that X = (X , X , … , X ) is a random sample of size n from the gamma distribution with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Then Y = ∑ X is complete for b . 1
2
n
n
i
i=1
Proof Suppose again that X = (X , X , … , X ) is a random sample from the gamma distribution on (0, ∞) with shape parameter k ∈ (0, ∞) and scale parameter b ∈ (0, ∞). Let M = ∑ X denote the sample mean and U = (X X … X ) the sample geometric mean, as before. Then 1
2
n
1
n
n
i=1
1/n
i
1
2
n
1. M /U is ancillary for b . 2. M and M /U are independent. Proof
The Beta Distribution Recall that the beta distribution with left parameter (0, 1) with probability density function g given by
a ∈ (0, ∞)
1
a−1
g(x) =
x
and right parameter
b−1
(1 − x )
,
b ∈ (0, ∞)
x ∈ (0, 1)
is a continuous distribution on
(7.6.33)
B(a, b)
where B is the beta function. The beta distribution is often used to model random proportions and other random variables that take values in bounded intervals. It is studied in more detail in the chapter on Special Distribution Suppose that X = (X , X , … , X ) is a random sample from the beta distribution with left parameter a and right parameter b . Then (P , Q) is minimally sufficient for (a, b) where P = ∏ X and Q = ∏ (1 − X ) . 1
2
n
n
n
i
i=1
i
i=1
Proof The proof also shows that P is sufficient for a if b is known, and that Q is sufficient for b if a is known. Recall that the method of moments estimators of a and b are M (M − M
(2)
U =
)
(1 − M ) (M − M ,
M
(2)
−M
V =
2
(2)
) (7.6.35)
M
(2)
−M
2
respectively, where M = ∑ X is the sample mean and M = ∑ X is the second order sample mean. If b is known, the method of moments estimator of a is U = bM /(1 − M ) , while if a is known, the method of moments estimator of b is V = a(1 − M )/M . None of these estimators is a function of the sufficient statistics (P , Q) and so all suffer from a loss of information. On the other hand, if b = 1 , the maximum likelihood estimator of a on the interval (0, ∞) is W = −n/ ∑ ln X , which is a function of P (as it must be). 1
n
n
i=1
(2)
i
1
n
2
n
i=1
i
b
a
n
i=1
i
Run the beta estimation experiment 1000 times with various values of the parameters. Compare the estimates of the parameters.
7.6.6
https://stats.libretexts.org/@go/page/10194
The Pareto Distribution Recall that the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b ∈ (0, ∞) is a continuous distribution on [b, ∞) with probability density function g given by a
ab g(x) =
a+1
,
b ≤x 0 and right parameter b > 0 , then the posterior distribution of p given X is beta with left parameter a + Y and right parameter b + (n − Y ) ; the left parameter is increased by the number of successes and the right parameter by the number of failure. It follows that a 1 − α level Bayesian confidence interval for p is [ U (y), U (y)] where U (y) is the quantile of order r for the posterior beta distribution. In the special case a = b = 1 the prior distribution is uniform on (0, 1) and reflects a lack of previous knowledge about p . α/2
1−α/2
r
Suppose that we have a coin with an unknown probability p of heads, and that we give knowledge about p. We then toss the coin 50 times, observing 30 heads.
p
the uniform prior, reflecting our lack of
1. Find the posterior distribution of p given the data. 2. Construct the 95% Bayesian confidence interval. 3. Construct the classical Wald confidence interval at the 95% level. Answer
The Poisson Distribution Suppose that X = (X , X , … , X ) is a random sample of size n from the Poisson distribution with parameter λ ∈ (0, ∞) . Recall that the Poisson distribution is often used to model the number of “random points” in a region of time or space and is studied in more detail in the chapter on the Poisson Process. The distribution is named for the inimitable Simeon Poisson and given λ , has probability density function 1
2
n
x
g(x ∣ θ) = e
−λ
λ
,
x ∈ N
(8.5.10)
x!
As usual, we will denote the sum of the sample values by Y but with parameter nλ.
n
=∑
i=1
Xi
. Given λ , random variable Y also has a Poisson distribution,
In our previous discussion of Bayesian estimation, we showed that the gamma distribution is conjugate for λ . Specifically, if the prior distribution of λ is gamma with shape parameter k > 0 and rate parameter r > 0 (so that the scale parameter is 1/r), then the posterior
8.5.2
https://stats.libretexts.org/@go/page/10204
distribution of λ given X is gamma with shape parameter k + Y and rate parameter r + n . It follows that a 1 − α level Bayesian confidence interval for λ is [ U (y), U (y)] where U (y) is the quantile of order p for the posterior gamma distribution. α/2
p
1−α/2
Consider the alpha emissions data, which we believe come from a Poisson distribution with unknown parameter λ . Suppose that a priori, we believe that λ is about 5, so we give λ a prior gamma distribution with shape parameter 5 and rate parameter 1. (Thus – the mean is 5 and the standard deviation √5 = 2.236.) 1. Find the posterior distribution of λ given the data. 2. Construct the 95% Bayesian confidence interval. 3. Construct the classical t confidence interval at the 95% level. Answer
The Normal Distribution Suppose that x = (X , X , … , X ) is a random sample of size n from the normal distribution with unknown mean μ ∈ R and known variance σ ∈ (0, ∞) . Of course, the normal distribution plays an especially important role in statistics, in part because of the central limit theorem. The normal distribution is widely used to model physical quantities subject to numerous small, random errors. Recall that the normal probability density function (given the parameters) is 1
2
n
2
x −μ
1 g(x ∣ μ, σ) =
We denote the sum of the sample values by Y nμ and variance nσ .
− − √2πσ
n
exp[−(
2
) ],
x ∈ R
(8.5.11)
σ
. Recall that Y also has a normal distribution (given μ and σ), but with mean
= ∑i=1 Xi
2
In our previous discussion of Bayesian estimation, we showed that the normal distribution is conjugate for μ (with σ known). Specifically, if the prior distribution of μ is normal with mean a ∈ R and standard deviation b ∈ (0, ∞), then the posterior distribution of μ given X is also normal, with 2
Y b E(μ ∣ X) = σ
2
+ aσ 2
2
2
2
σ b ,
var(μ ∣ X) =
+ nb
σ
2
(8.5.12)
2
+ nb
It follows that a 1 − α level Bayesian confidence interval for μ is [ U (y), U (y)] where U (y) is the quantile of order p for the posterior normal distribution. An interesting special case is when b = σ , so that the standard deviation of the prior distribution of μ is the same as the standard deviation of the sampling distribution. In this case, the posterior mean is (Y + a)/(n + 1) and the posterior variance is σ /(n + 1) α/2
1−α/2
p
2
The length of a certain machined part is supposed to be 10 centimeters but due to imperfections in the manufacturing process, the actual length is a normally distributed with mean μ and variance σ . The variance is due to inherent factors in the process, which remain fairly stable over time. From historical data, it is known that σ = 0.3. On the other hand, μ may be set by adjusting various parameters in the process and hence may change to an unknown value fairly frequently. Thus, suppose that we give μ with a prior normal distribution with mean 10 and standard deviation 0.03 A sample of 100 parts has mean 10.2. 2
1. Find the posterior distribution of μ given the data. 2. Construct the 95% Bayesian confidence interval. 3. Construct the classical z confidence interval at the 95% level. Answer
The Beta Distribution Suppose that X = (X , X , … , X ) is a random sample of size n from the beta distribution with unknown left shape parameter a ∈ (0, ∞) and right shape parameter b = 1 . The beta distribution is widely used to model random proportions and probabilities and other variables that take values in bounded intervals. Recall that the probability density function (given a ) is 1
2
n
a−1
g(x ∣ a) = ax
We denote the product of the sample values by W
= X1 X2 ⋯ Xn
,
x ∈ (0, 1)
(8.5.13)
.
In our previous discussion of Bayesian estimation, we showed that the gamma distribution is conjugate for a . Specifically, if the prior distribution of a is gamma with shape parameter k > 0 and rate parameter r > 0 , then the posterior distribution of a given X is also gamma, with shape parameter k + n and rate parameter r − ln(W ) . It follows that a 1 − α level Bayesian confidence interval for a is
8.5.3
https://stats.libretexts.org/@go/page/10204
where U (w) is the quantile of order p for the posterior gamma distribution. In the special case that k = 1 , the prior distribution of a is exponential with rate parameter r. [ Uα/2 (w), U1−α/2 (w)]
p
Suppose that the resistance of an electrical component (in Ohms) has the beta distribution with unknown left parameter a and right parameter b = 1 . We believe that a may be about 10, so we give a the prior gamma distribution with shape parameter 10 and rate parameter 1. We sample 20 components and observe the data 0.98, 0.93, 0.99, 0.89, 0.79, 0.99, 0.92, 0.97, 0.88, 0.97, 0.86, 0.84, 0.96, 0.97, 0.92, 0.90, 0.98, 0.96, 0.96, 1.00 (8.5.14)
1. Find the posterior distribution of a . 2. Construct the 95% Bayesian confidence interval for a . Answer
The Pareto Distribution Suppose that X = (X , X , … , X ) is a random sample of size n from the Pareto distribution with shape parameter a ∈ (0, ∞) and scale parameter b = 1 . The Pareto distribution is used to model certain financial variables and other variables with heavy-tailed distributions, and is named for Vilfredo Pareto. Recall that the probability density function (given a ) is 1
2
n
a g(x ∣ a) =
a+1
,
x ∈ [1, ∞)
(8.5.15)
x
We denote the product of the sample values by W
= X1 X2 ⋯ Xn
.
In our previous discussion of Bayesian estimation, we showed that the gamma distribution is conjugate for a . Specifically, if the prior distribution of a is gamma with shape parameter k > 0 and rate parameter r > 0 , then the posterior distribution of a given X is also gamma, with shape parameter k + n and rate parameter r + ln(W ) . It follows that a 1 − α level Bayesian confidence interval for a is [U (w), U (w)] where U (w) is the quantile of order p for the posterior gamma distribution. In the special case that k = 1 , the prior distribution of a is exponential with rate parameter r. α/2
1−α/2
p
Suppose that a financial variable has the Pareto distribution with unknown shape parameter a and scale parameter b = 1 . We believe that a may be about 4, so we give a the prior gamma distribution with shape parameter 4 and rate parameter 1. A random sample of size 20 from the variable gives the data 1.09, 1.13, 2.00, 1.43, 1.26, 1.00, 1.36, 1.03, 1.46, 1.18, 2.16, 1.16, 1.22, 1.06, 1.28, 1.23, 1.11, 1.03, 1.04, 1.05 (8.5.16)
1. Find the posterior distribution of a . 2. Construct the 95% Bayesian confidence interval for a . Answer This page titled 8.5: Bayesian Set Estimation is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
8.5.4
https://stats.libretexts.org/@go/page/10204
CHAPTER OVERVIEW 9: Hypothesis Testing Hypothesis testing refers to the process of choosing between competing hypotheses about a probability distribution, based on observed data from the distribution. It is a core topic in mathematical statistics, and indeed is a fundamental part of the language of statistics. In this chapter, we study the basics of hypothesis testing, and explore hypothesis tests in some of the most important parametric models: the normal model and the Bernoulli model. 9.1: Introduction to Hypothesis Testing 9.2: Tests in the Normal Model 9.3: Tests in the Bernoulli Model 9.4: Tests in the Two-Sample Normal Model 9.5: Likelihood Ratio Tests 9.6: Chi-Square Tests
This page titled 9: Hypothesis Testing is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
9.1: Introduction to Hypothesis Testing Basic Theory Preliminaries As usual, our starting point is a random experiment with an underlying sample space and a probability measure P. In the basic statistical model, we have an observable random variable X taking values in a set S . In general, X can have quite a complicated structure. For example, if the experiment is to sample n objects from a population and record various measurements of interest, then X = (X1 , X2 , … , Xn )
(9.1.1)
where X is the vector of measurements for the ith object. The most important special case occurs when (X , X , … , X independent and identically distributed. In this case, we have a random sample of size n from the common distribution. i
1
2
n)
are
The purpose of this section is to define and discuss the basic concepts of statistical hypothesis testing. Collectively, these concepts are sometimes referred to as the Neyman-Pearson framework, in honor of Jerzy Neyman and Egon Pearson, who first formalized them.
Hypotheses A statistical hypothesis is a statement about the distribution of X. Equivalently, a statistical hypothesis specifies a set of possible distributions of X: the set of distributions for which the statement is true. A hypothesis that specifies a single distribution for X is called simple; a hypothesis that specifies more than one distribution for X is called composite. In hypothesis testing, the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a conjectured alternative hypothesis. The null hypothesis is usually denoted H while the alternative hypothesis is usually denoted H . 0
1
An hypothesis test is a statistical decision; the conclusion will either be to reject the null hypothesis in favor of the alternative, or to fail to reject the null hypothesis. The decision that we make must, of course, be based on the observed value x of the data vector X. Thus, we will find an appropriate subset R of the sample space S and reject H if and only if x ∈ R . The set R is known as the rejection region or the critical region. Note the asymmetry between the null and alternative hypotheses. This asymmetry is due to the fact that we assume the null hypothesis, in a sense, and then see if there is sufficient evidence in x to overturn this assumption in favor of the alternative. 0
An hypothesis test is a statistical analogy to proof by contradiction, in a sense. Suppose for a moment that H is a statement in a mathematical theory and that H is its negation. One way that we can prove H is to assume H and work our way logically to a contradiction. In an hypothesis test, we don't “prove” anything of course, but there are similarities. We assume H and then see if the data x are sufficiently at odds with that assumption that we feel justified in rejecting H in favor of H . 1
0
1
0
0
0
1
Often, the critical region is defined in terms of a statistic w(X), known as a test statistic, where w is a function from S into another set T . We find an appropriate rejection region R ⊆ T and reject H when the observed value w(x) ∈ R . Thus, the rejection region in S is then R = w (R ) = {x ∈ S : w(x) ∈ R } . As usual, the use of a statistic often allows significant data reduction when the dimension of the test statistic is much smaller than the dimension of the data vector. T
0
T
−1
T
T
Errors The ultimate decision may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is actually true. Types of errors: 1. A type 1 error is rejecting the null hypothesis H when H is true. 2. A type 2 error is failing to reject the null hypothesis H when the alternative hypothesis H is true. 0
0
0
1
9.1.1
https://stats.libretexts.org/@go/page/10211
Similarly, there are two ways to make a correct decision: we could reject H when H is true or we could fail to reject H when H is true. The possibilities are summarized in the following table: 0
1
0
0
Hypothesis Test State | Decision
Fail to reject H
Reject H
0
0
H0
True
Correct
Type 1 error
H1
True
Type 2 error
Correct
Of course, when we observe X = x and make our decision, either we will have made the correct decision or we will have committed an error, and usually we will never know which of these events has occurred. Prior to gathering the data, however, we can consider the probabilities of the various errors. If H is true (that is, the distribution of X is specified by H ), then P(X ∈ R) is the probability of a type 1 error for this distribution. If H is composite, then H specifies a variety of different distributions for X and thus there is a set of type 1 error probabilities. 0
0
0
0
The maximum probability of a type 1 error, over the set of distributions specified by H , is the significance level of the test or the size of the critical region. 0
The significance level is often denoted by α . Usually, the rejection region is constructed so that the significance level is a prescribed, small value (typically 0.1, 0.05, 0.01). If H is true (that is, the distribution of X is specified by H ), then P(X ∉ R) is the probability of a type 2 error for this distribution. Again, if H is composite then H specifies a variety of different distributions for X, and thus there will be a set of type 2 error probabilities. Generally, there is a tradeoff between the type 1 and type 2 error probabilities. If we reduce the probability of a type 1 error, by making the rejection region R smaller, we necessarily increase the probability of a type 2 error because the complementary region S ∖ R is larger. 1
1
1
1
The extreme cases can give us some insight. First consider the decision rule in which we never reject H , regardless of the evidence x. This corresponds to the rejection region R = ∅ . A type 1 error is impossible, so the significance level is 0. On the other hand, the probability of a type 2 error is 1 for any distribution defined by H . At the other extreme, consider the decision rule in which we always rejects H regardless of the evidence x. This corresponds to the rejection region R = S . A type 2 error is impossible, but now the probability of a type 1 error is 1 for any distribution defined by H . In between these two worthless tests are meaningful tests that take the evidence x into account. 0
1
0
0
Power If H is true, so that the distribution of X is specified by H , then P(X ∈ R) , the probability of rejecting H is the power of the test for that distribution. 1
1
0
Thus the power of the test for a distribution specified by H is the probability of making the correct decision. 1
Suppose that we have two tests, corresponding to rejection regions R and R , respectively, each having significance level α . The test with region R is uniformly more powerful than the test with region R if 1
1
2
2
P(X ∈ R1 ) ≥ P(X ∈ R2 ) for every distribution of X specified by H1
(9.1.2)
Naturally, in this case, we would prefer the first test. Often, however, two tests will not be uniformly ordered; one test will be more powerful for some distributions specified by H while the other test will be more powerful for other distributions specified by H . 1
1
If a test has significance level α and is uniformly more powerful than any other test with significance level α , then the test is said to be a uniformly most powerful test at level α . Clearly a uniformly most powerful test is the best we can do.
9.1.2
https://stats.libretexts.org/@go/page/10211
P
-value
In most cases, we have a general procedure that allows us to construct a test (that is, a rejection region significance level α ∈ (0, 1). Typically, R decreases (in the subset sense) as α decreases.
Rα
) for any given
α
The P -value of the observed value x of X, denoted P (x), is defined to be the smallest smallest significance level for which H is rejected, given X = x .
α
for which
x ∈ Rα
; that is, the
0
Knowing P (x) allows us to test H at any significance level for the given data x: If P (x) ≤ α then we would reject H at significance level α ; if P (x) > α then we fail to reject H at significance level α . Note that P (X) is a statistic. Informally, P (x) can often be thought of as the probability of an outcome “as or more extreme” than the observed value x, where extreme is interpreted relative to the null hypothesis H . 0
0
0
0
Analogy with Justice Systems There is a helpful analogy between statistical hypothesis testing and the criminal justice system in the US and various other countries. Consider a person charged with a crime. The presumed null hypothesis is that the person is innocent of the crime; the conjectured alternative hypothesis is that the person is guilty of the crime. The test of the hypotheses is a trial with evidence presented by both sides playing the role of the data. After considering the evidence, the jury delivers the decision as either not guilty or guilty. Note that innocent is not a possible verdict of the jury, because it is not the point of the trial to prove the person innocent. Rather, the point of the trial is to see whether there is sufficient evidence to overturn the null hypothesis that the person is innocent in favor of the alternative hypothesis of that the person is guilty. A type 1 error is convicting a person who is innocent; a type 2 error is acquitting a person who is guilty. Generally, a type 1 error is considered the more serious of the two possible errors, so in an attempt to hold the chance of a type 1 error to a very low level, the standard for conviction in serious criminal cases is beyond a reasonable doubt.
Tests of an Unknown Parameter Hypothesis testing is a very general concept, but an important special class occurs when the distribution of the data variable X depends on a parameter θ taking values in a parameter space Θ. The parameter may be vector-valued, so that θ = (θ , θ , … , θ ) and Θ ⊆ R for some k ∈ N . The hypotheses generally take the form 1
2
n
k
+
H0 : θ ∈ Θ0 versus H1 : θ ∉ Θ0
(9.1.3)
where Θ is a prescribed subset of the parameter space Θ. In this setting, the probabilities of making an error or a correct decision depend on the true value of θ . If R is the rejection region, then the power function Q is given by 0
Q(θ) = Pθ (X ∈ R),
θ ∈ Θ
(9.1.4)
The power function gives a lot of information about the test. The power function satisfies the following properties: 1. Q(θ) is the probability of a type 1 error when θ ∈ Θ . 2. max {Q(θ) : θ ∈ Θ } is the significance level of the test. 3. 1 − Q(θ) is the probability of a type 2 error when θ ∉ Θ . 4. Q(θ) is the power of the test when θ ∉ Θ . 0
0
0
0
If we have two tests, we can compare them by means of their power functions. Suppose that we have two tests, corresponding to rejection regions R and R , respectively, each having significance level α . The test with rejection region R is uniformly more powerful than the test with rejection region R if Q (θ) ≥ Q (θ) for all θ ∉ Θ . 1
2
1
2
1
2
0
Most hypothesis tests of an unknown real parameter θ fall into three special cases: Suppose that θ is a real parameter and tailed test, and the right-tailed test. 1. H
0
: θ = θ0
versus H
1
θ0 ∈ Θ
a specified value. The tests below are respectively the two-sided test, the left-
: θ ≠ θ0
9.1.3
https://stats.libretexts.org/@go/page/10211
2. H 3. H
0
: θ ≥ θ0
0
: θ ≤ θ0
versus H versus H
1
: θ < θ0
1
: θ > θ0
Thus the tests are named after the conjectured alternative. Of course, there may be other unknown parameters besides θ (known as nuisance parameters).
Equivalence Between Hypothesis Test and Confidence Sets There is an equivalence between hypothesis tests and confidence sets for a parameter θ . Suppose that C (x) is a 1 − α level confidence set for θ . The following test has significance level H : θ =θ versus H : θ ≠ θ : Reject H if and only if θ ∉ C (x) 0
0
1
0
0
α
for the hypothesis
0
Proof Equivalently, we fail to reject H at significance level α if and only if θ is in the corresponding 1 − α level confidence set. In particular, this equivalence applies to interval estimates of a real parameter θ and the common tests for θ given above. 0
0
In each case below, the confidence interval has confidence level 1 − α and the test has significance level α . 1. Suppose that [L(X, U (X)] is a two-sided confidence interval for θ . Reject H : θ = θ versus H : θ ≠ θ if and only if θ < L(X) or θ > U (X) . 2. Suppose that L(X) is a confidence lower bound for θ . Reject H : θ ≤ θ versus H : θ > θ if and only if θ < L(X) . 3. Suppose that U (X) is a confidence upper bound for θ . Reject H : θ ≥ θ versus H : θ < θ if and only if θ > U (X) . 0
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
Pivot Variables and Test Statistics Recall that confidence sets of an unknown parameter θ are often constructed through a pivot variable, that is, a random variable W (X, θ) that depends on the data vector X and the parameter θ , but whose distribution does not depend on θ and is known. In this case, a natural test statistic for the basic tests given above is W (X, θ ). 0
This page titled 9.1: Introduction to Hypothesis Testing is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
9.1.4
https://stats.libretexts.org/@go/page/10211
9.2: Tests in the Normal Model Basic Theory The Normal Model The normal distribution is perhaps the most important distribution in the study of mathematical statistics, in part because of the central limit theorem. As a consequence of this theorem, a measured quantity that is subject to numerous small, random errors will have, at least approximately, a normal distribution. Such variables are ubiquitous in statistical experiments, in subjects varying from the physical and biological sciences to the social sciences. So in this section, we assume that X = (X , X , … , X ) is a random sample from the normal distribution with mean μ and standard deviation σ. Our goal in this section is to to construct hypothesis tests for μ and σ; these are among of the most important special cases of hypothesis testing. This section parallels the section on Estimation in the Normal Model in the chapter on Set Estimation, and in particular, the duality between interval estimation and hypothesis testing will play an important role. But first we need to review some basic facts that will be critical for our analysis. 1
2
n
Recall that the sample mean M and sample variance S are 2
1 M = n
n
∑ Xi ,
S
2
n
1
2
= n−1
i=1
∑(Xi − M )
(9.2.1)
i=1
From our study of point estimation, recall that M is an unbiased and consistent estimator of μ while S is an unbiased and consistent estimator of σ . From these basic statistics we can construct the test statistics that will be used to construct our hypothesis tests. The following results were established in the section on Special Properties of the Normal Distribution. 2
2
Define M −μ Z =
1. Z 2. T 3. V 4. Z
− σ/ √n
M −μ ,
T =
− S/ √n
n−1 ,
V = σ
2
S
2
(9.2.2)
has the standard normal distribution. has the student t distribution with n − 1 degrees of freedom. has the chi-square distribution with n − 1 degrees of freedom. and V are independent.
It follows that each of these random variables is a pivot variable for (μ, σ) since the distributions do not depend on the parameters, but the variables themselves functionally depend on one or both parameters. The pivot variables will lead to natural test statistics that can then be used to perform the hypothesis tests of the parameters. To construct our tests, we will need quantiles of these standard distributions. The quantiles can be computed using the special distribution calculator or from most mathematical and statistical software packages. Here is the notation we will use: Let p ∈ (0, 1) and k ∈ N . +
1. z(p) denotes the quantile of order p for the standard normal distribution. 2. t (p) denotes the quantile of order p for the student t distribution with k degrees of freedom. 3. χ (p) denotes the quantile of order p for the chi-square distribution with k degrees of freedom k
2
k
Since the standard normal and student t distributions are symmetric about 0, it follows that z(1 − p) = −z(p) t (1 − p) = −t (p) for p ∈ (0, 1) and k ∈ N . On the other hand, the chi-square distribution is not symmetric. k
k
and
+
Tests for the Mean with Known Standard Deviation For our first discussion, we assume that the distribution mean μ is unknown but the standard deviation σ is known. This is not always an artificial assumption. There are often situations where σ is stable over time, and hence is at least approximately known, while μ changes because of different “treatments”. Examples are given in the computational exercises below.
9.2.1
https://stats.libretexts.org/@go/page/10212
For a conjectured μ
0
∈ R
, define the test statistic Z =
1. If μ = μ then Z has the standard normal distribution. 2. If μ ≠ μ then Z has the normal distribution with mean
M − μ0
(9.2.3)
− σ/ √n
0
0
μ−μ
So in case (b),
0
σ/ √n
μ−μ
0
σ/ √n
and variance 1.
can be viewed as a non-centrality parameter. The graph of the probability density function of Z is like that of
the standard normal probability density function, but shifted to the right or left by the non-centrality parameter, depending on whether μ > μ or μ < μ . 0
0
For α ∈ (0, 1), each of the following tests has significance level α : 1. Reject H
0
: μ = μ0
versus H : μ ≠ μ if and only if Z < −z(1 − α/2) or Z > z(1 − α/2) if and only if or M > μ + z(1 − α/2) . 1
M < μ0 − z(1 − α/2)
2. Reject H
0
3. Reject H
0
0
σ
σ
0
√n
: μ ≤ μ0
versus H
: μ ≥ μ0
versus H
1
1
√n
: μ > μ0
if and only if Z > z(1 − α) if and only if M
: μ < μ0
if and only if Z < −z(1 − α) if and only if M
> μ0 + z(1 − α)
.
σ √n
< μ0 − z(1 − α)
σ √n
.
Proof Part (a) is the standard two-sided test, while (b) is the right-tailed test and (c) is the left-tailed test. Note that in each case, the hypothesis test is the dual of the corresponding interval estimate constructed in the section on Estimation in the Normal Model. For each of the tests above, we fail to reject confidence interval, that is 1. M − z(1 − α/2)
σ √n
2. μ
≤ M + z(1 − α)
3. μ
≥ M − z(1 − α)
0
0
H0
≤ μ0 ≤ M + z(1 − α/2)
at significance level
α
if and only if
μ0
is in the corresponding
1 −α
σ √n
σ √n σ √n
Proof The two-sided test in (a) corresponds to α/2 in each tail of the distribution of the test statistic Z , under H . This set is said to be unbiased. But of course we can construct other biased tests by partitioning the confidence level α between the left and right tails in a non-symmetric way. 0
For every
, the following test has significance level or Z ≥ z(1 − pα) .
α, p ∈ (0, 1)
Z < z(α − pα)
α
: Reject
H0 : μ = μ0
versus
H1 : μ ≠ μ0
if and only if
1. p = gives the symmetric, unbiased test. 2. p ↓ 0 gives the left-tailed test. 3. p ↑ 1 gives the right-tailed test. 1 2
Proof The P -value of these test can be computed in terms of the standard normal distribution function Φ. The P -values of the standard tests above are respectively 1. 2 [1 − Φ (|Z|)] 2. 1 − Φ(Z) 3. Φ(Z) Recall that the power function of a test of a parameter is the probability of rejecting the null hypothesis, as a function of the true value of the parameter. Our next series of results will explore the power functions of the tests above. The power function of the general two-sided test above is given by
9.2.2
https://stats.libretexts.org/@go/page/10212
− √n Q(μ) = Φ (z(α − pα) −
− √n (μ − μ0 )) + Φ (
σ
(μ − μ0 ) − z(1 − pα)) ,
μ ∈ R
(9.2.4)
σ
1. Q is decreasing on (−∞, m ) and increasing on (m , ∞) where m 2. Q(μ ) = α . 3. Q(μ) → 1 as μ ↑ ∞ and Q(μ) → 1 as μ ↓ −∞ . 4. If p = then Q is symmetric about μ (and m = μ ). 5. As p increases, Q(μ) increases if μ > μ and decreases if μ < μ . 0
0
0
= μ0 + [z(α − pα) + z(1 − pα)]
√n 2σ
.
0
1
0
2
0
0
0
0
So by varying p, we can make the test more powerful for some values of μ , but only at the expense of making the test less powerful for other values of μ . The power function of the left-tailed test above is given by − √n Q(μ) = Φ (z(α) + σ
(μ − μ0 )) ,
μ ∈ R
(9.2.5)
(μ − μ0 )) ,
μ ∈ R
(9.2.6)
1. Q is increasing on R. 2. Q(μ ) = α . 3. Q(μ) → 1 as μ ↑ ∞ and Q(μ) → 0 as μ ↓ −∞ . 0
The power function of the right-tailed test above, is given by − √n Q(μ) = Φ (z(α) − σ
1. Q is decreasing on R. 2. Q(μ ) = α . 3. Q(μ) → 0 as μ ↑ ∞ and Q(μ) → 1 as μ ↓ −∞ . 0
For any of the three tests in above , increasing the sample size n or decreasing the standard deviation σ results in a uniformly more powerful test. In the mean test experiment, select the normal test statistic and select the normal sampling distribution with standard deviation σ = 2 , significance level α = 0.1 , sample size n = 20 , and μ = 0 . Run the experiment 1000 times for several values of the true distribution mean μ . For each value of μ , note the relative frequency of the event that the null hypothesis is rejected. Sketch the empirical power function. 0
In the mean estimate experiment, select the normal pivot variable and select the normal distribution with μ = 0 and standard deviation σ = 2 , confidence level 1 − α = 0.90 , and sample size n = 10 . For each of the three types of confidence intervals, run the experiment 20 times. State the corresponding hypotheses and significance level, and for each run, give the set of μ for which the null hypothesis would be rejected. 0
In many cases, the first step is to design the experiment so that the significance level is α and so that the test has a given power β for a given alternative μ . 1
For either of the one-sided tests in above, the sample size alternative μ is
n
needed for a test with significance level
α
and power
β
for the
1
σ [z(β) − z(α)] n =(
2
)
(9.2.7)
μ1 − μ0
Proof For the unbiased, two-sided test, the sample size n needed for a test with significance level α and power β for the alternative μ is approximately 1
9.2.3
https://stats.libretexts.org/@go/page/10212
2
σ [z(β) − z(α/2)] n =(
)
(9.2.8)
μ1 − μ0
Proof
Tests of the Mean with Unknown Standard Deviation For our next discussion, we construct tests of μ without requiring the assumption that σ is known. And in applications of course, σ is usually unknown. For a conjectured μ
0
∈ R
, define the test statistic T =
M − μ0
(9.2.9)
− S/ √n
1. If μ = μ , the statistic T has the student t distribution with n − 1 degrees of freedom. 2. If μ ≠ μ then T has a non-central t distribution with n − 1 degrees of freedom and non-centrality parameter 0
0
μ−μ0 σ/ √n
.
In case (b), the graph of the probability density function of T is much (but not exactly) the same as that of the ordinary t distribution with n − 1 degrees of freedom, but shifted to the right or left by the non-centrality parameter, depending on whether μ >μ or μ < μ . 0
0
For α ∈ (0, 1), each of the following tests has significance level α : 1. Reject H
0
: μ = μ0
versus H
1
M < μ0 − tn−1 (1 − α/2)
2. Reject H
0
3. Reject H
0
: μ ≠ μ0
or T
S √n
: μ ≤ μ0
versus H
: μ ≥ μ0
versus H
1
1
if and only if T
< −tn−1 (1 − α/2)
> μ0 + tn−1 (1 − α/2)
: μ > μ0
if and only if T
: μ < μ0
if and only if T
S √n
or T
> tn−1 (1 − α/2)
if and only if
.
> tn−1 (1 − α)
if and only if M
< −tn−1 (1 − α)
> μ0 + tn−1 (1 − α)
if and only if M
.
S √n
< μ0 − tn−1 (1 − α)
S √n
.
Proof Part (a) is the standard two-sided test, while (b) is the right-tailed test and (c) is the left-tailed test. Note that in each case, the hypothesis test is the dual of the corresponding interval estimate constructed in the section on Estimation in the Normal Model. For each of the tests above, we fail to reject confidence interval. 1. M − t
n−1 (1
− α/2)
S √n
2. μ
≤ M + tn−1 (1 − α)
3. μ
≥ M − tn−1 (1 − α)
0
0
H0
at significance level
≤ μ0 ≤ M + tn−1 (1 − α/2)
α
if and only if
μ0
is in the corresponding
1 −α
S √n
S √n S √n
Proof The two-sided test in (a) corresponds to α/2 in each tail of the distribution of the test statistic T , under H . This set is said to be unbiased. But of course we can construct other biased tests by partitioning the confidence level α between the left and right tails in a non-symmetric way. 0
For every
, the following test has significance level α : Reject H : μ = μ versus or T ≥ t (1 − pα) if and only if M < μ + t (α − pα) or M > μ
α, p ∈ (0, 1)
T < tn−1 (α − pα)
0
0
S
n−1
0
n−1
√n
0
H1 : μ ≠ μ0
if and only if .
+ tn−1 (1 − pα)
S
√n
1. p = gives the symmetric, unbiased test. 2. p ↓ 0 gives the left-tailed test. 3. p ↑ 1 gives the right-tailed test. 1 2
Proof The P -value of these test can be computed in terms of the distribution function freedom.
9.2.4
Φn−1
of the t -distribution with
n−1
degrees of
https://stats.libretexts.org/@go/page/10212
The P -values of the standard tests above are respectively 1. 2 [1 − Φ (|T |)] 2. 1 − Φ (T ) 3. Φ (T ) n−1
n−1
n−1
In the mean test experiment, select the student test statistic and select the normal sampling distribution with standard deviation σ = 2 , significance level α = 0.1 , sample size n = 20 , and μ = 1 . Run the experiment 1000 times for several values of the true distribution mean μ . For each value of μ , note the relative frequency of the event that the null hypothesis is rejected. Sketch the empirical power function. 0
In the mean estimate experiment, select the student pivot variable and select the normal sampling distribution with mean 0 and standard deviation 2. Select confidence level 0.90 and sample size 10. For each of the three types of intervals, run the experiment 20 times. State the corresponding hypotheses and significance level, and for each run, give the set of μ for which the null hypothesis would be rejected. 0
The power function for the t tests above can be computed explicitly in terms of the non-central t distribution function. Qualitatively, the graphs of the power functions are similar to the case when σ is known, given above two-sided, left-tailed, and right-tailed cases. If an upper bound σ on the standard deviation σ is known, then conservative estimates on the sample size needed for a given confidence level and a given margin of error can be obtained using the methods for the normal pivot variable, in the two-sided and one-sided cases. 0
Tests of the Standard Deviation For our next discussion, we will construct hypothesis tests for the distribution standard deviation σ. So our assumption is that σ is unknown, and of course almost always, μ would be unknown as well. For a conjectured value σ
0
, define the test statistic
∈ (0, ∞)
n−1 V = σ
2
S
2
(9.2.10)
0
1. If σ = σ , then V has the chi-square distribution with n − 1 degrees of freedom. 2. If σ ≠ σ then V has the gamma distribution with shape parameter (n − 1)/2 and scale parameter 2σ 0
2
0
/σ
2
0
.
Recall that the ordinary chi-square distribution with n − 1 degrees of freedom is the gamma distribution with shape parameter (n − 1)/2 and scale parameter . So in case (b), the ordinary chi-square distribution is scaled by σ /σ . In particular, the scale factor is greater than 1 if σ > σ and less than 1 if σ < σ . 1
2
2
0
2
0
0
For every α ∈ (0, 1), the following test has significance level α : 1. Reject H
0
S
2
: σ = σ0 σ
2
χ
n−1
if and only if V σ
(1 − α/2)
2
χ
0
1
2
>χ
n−1
(1 − α/2)
if and only if
n−1
: σ ≥ σ0
1
or V
0
2. Reject H
0
(α/2)
2
2 n−1 2 n−1
(α)
if and only if S
(1 − α)
2
σ
2
χ
2 0
n−1
σ
(1 − α)
2 0
n−1
Proof Part (a) is the unbiased, two-sided test that corresponds to α/2 in each tail of the chi-square distribution of the test statistic V , under H . Part (b) is the left-tailed test and part (c) is the right-tailed test. Once again, we have a duality between the hypothesis tests and the interval estimates constructed in the section on Estimation in the Normal Model. 0
For each of the tests in above, we fail to reject confidence interval. That is
H0
at significance level
9.2.5
α
if and only if
σ
2
0
is in the corresponding
1 −α
https://stats.libretexts.org/@go/page/10212
1.
n−1 2
χ
n−1
2. σ
2
n−1
≤
0
S
2
χ
n−1
3. σ
2
≤σ
2
χ
n−1
2
≤
0
n−1 2
χ
n−1
S
S
2
(α/2)
2
(α)
n−1
≥
0
2
(1−α/2)
S
2
(1−α)
Proof As before, we can construct more general two-sided tests by partitioning the significance level α between the left and right tails of the chi-square distribution in an arbitrary way. For every 2
V ≤χ
n−1
, the following test has significance level
α, p ∈ (0, 1) (α − pα)
or V
2
≥χ
n−1
(1 − pα)
if and only if S
2
α
: Reject
H0 : σ = σ0 σ
2
χ
n−1
(1 − pα)
2 0
n−1
if and only if .
1. p = gives the equal-tail test. 2. p ↓ 0 gives the left-tail test. 3. p ↑ 1 gives the right-tail test. 1 2
Proof Recall again that the power function of a test of a parameter is the probability of rejecting the null hypothesis, as a function of the true value of the parameter. The power functions of the tests for σ can be expressed in terms of the distribution function G of the chi-square distribution with n − 1 degrees of freedom. n−1
The power function of the general two-sided test above is given by the following formula, and satisfies the given properties: σ Q(σ) = 1 − Gn−1 (
2
0
σ2
σ
2
χ
n−1
1. Q is decreasing on (−∞, σ ) and increasing on (σ 2. Q(σ ) = α . 3. Q(σ) → 1 as σ ↑ ∞ and Q(σ) → 1 as σ ↓ 0.
0,
0
(1 − p α)) + Gn−1 (
∞)
2
0
σ2
2
χ
n−1
(α − p α))
(9.2.11)
.
0
The power function of the left-tailed test in above is given by the following formula, and satisfies the given properties: 2
σ
0
Q(σ) = 1 − Gn−1 (
2
σ
2
χ
n−1
(1 − α))
(9.2.12)
1. Q is increasing on (0, ∞). 2. Q(σ ) = α . 3. Q(σ) → 1 as σ ↑ ∞ and Q(σ) → 0 as σ ↓ 0. 0
The power function for the right-tailed test above is given by the following formula, and satisfies the given properties: σ
2
σ
2
0
Q(σ) = Gn−1 (
2
χ
n−1
(α))
(9.2.13)
1. Q is decreasing on (0, ∞). 2. Q(σ ) = α . 3. Q(σ) → 0 as σ ↑ ∞) and Q(σ) → 0 as σ ↑ ∞ and as σ ↓ 0. 0
In the variance test experiment, select the normal distribution with mean 0, and select significance level 0.1, sample size 10, and test standard deviation 1.0. For various values of the true standard deviation, run the simulation 1000 times. Record the relative frequency of rejecting the null hypothesis and plot the empirical power curve. 1. Two-sided test 2. Left-tailed test 3. Right-tailed test
9.2.6
https://stats.libretexts.org/@go/page/10212
In the variance estimate experiment, select the normal distribution with mean 0 and standard deviation 2, and select confidence level 0.90 and sample size 10. Run the experiment 20 times. State the corresponding hypotheses and significance level, and for each run, give the set of test standard deviations for which the null hypothesis would be rejected. 1. Two-sided confidence interval 2. Confidence lower bound 3. Confidence upper bound
Exercises Robustness The primary assumption that we made is that the underlying sampling distribution is normal. Of course, in real statistical problems, we are unlikely to know much about the sampling distribution, let alone whether or not it is normal. Suppose in fact that the underlying distribution is not normal. When the sample size n is relatively large, the distribution of the sample mean will still be approximately normal by the central limit theorem, and thus our tests of the mean μ should still be approximately valid. On the other hand, tests of the variance σ are less robust to deviations form the assumption of normality. The following exercises explore these ideas. 2
In the mean test experiment, select the gamma distribution with shape parameter 1 and scale parameter 1. For the three different tests and for various significance levels, sample sizes, and values of μ , run the experiment 1000 times. For each configuration, note the relative frequency of rejecting H . When H is true, compare the relative frequency with the significance level. 0
0
0
In the mean test experiment, select the uniform distribution on [0, 4]. For the three different tests and for various significance levels, sample sizes, and values of μ , run the experiment 1000 times. For each configuration, note the relative frequency of rejecting H . When H is true, compare the relative frequency with the significance level. 0
0
0
How large n needs to be for the testing procedure to work well depends, of course, on the underlying distribution; the more this distribution deviates from normality, the larger n must be. Fortunately, convergence to normality in the central limit theorem is rapid and hence, as you observed in the exercises, we can get away with relatively small sample sizes (30 or more) in most cases. In the variance test experiment, select the gamma distribution with shape parameter 1 and scale parameter 1. For the three different tests and for various significance levels, sample sizes, and values of σ , run the experiment 1000 times. For each configuration, note the relative frequency of rejecting H . When H is true, compare the relative frequency with the significance level. 0
0
0
In the variance test experiment, select the uniform distribution on [0, 4]. For the three different tests and for various significance levels, sample sizes, and values of μ , run the experiment 1000 times. For each configuration, note the relative frequency of rejecting H . When H is true, compare the relative frequency with the significance level. 0
0
0
Computational Exercises The length of a certain machined part is supposed to be 10 centimeters. In fact, due to imperfections in the manufacturing process, the actual length is a random variable. The standard deviation is due to inherent factors in the process, which remain fairly stable over time. From historical data, the standard deviation is known with a high degree of accuracy to be 0.3. The mean, on the other hand, may be set by adjusting various parameters in the process and hence may change to an unknown value fairly frequently. We are interested in testing H : μ = 10 versus H : μ ≠ 10 . 0
1
1. Suppose that a sample of 100 parts has mean 10.1. Perform the test at the 0.1 level of significance. 2. Compute the P -value for the data in (a). 3. Compute the power of the test in (a) at μ = 10.05. 4. Compute the approximate sample size needed for significance level 0.1 and power 0.8 when μ = 10.05. Answer
9.2.7
https://stats.libretexts.org/@go/page/10212
A bag of potato chips of a certain brand has an advertised weight of 250 grams. Actually, the weight (in grams) is a random variable. Suppose that a sample of 75 bags has mean 248 and standard deviation 5. At the 0.05 significance level, perform the following tests: 1. H 2. H
versus H : μ < 250 versus H : σ < 7
0
: μ ≥ 250
0
: σ ≥7
1
1
Answer At a telemarketing firm, the length of a telephone solicitation (in seconds) is a random variable. A sample of 50 calls has mean 310 and standard deviation 25. At the 0.1 level of significance, can we conclude that 1. μ > 300 ? 2. σ > 20? Answer At a certain farm the weight of a peach (in ounces) at harvest time is a random variable. A sample of 100 peaches has mean 8.2 and standard deviation 1.0. At the 0.01 level of significance, can we conclude that 1. μ > 8 ? 2. σ < 1.5? Answer The hourly wage for a certain type of construction work is a random variable with standard deviation 1.25. For sample of 25 workers, the mean wage was $6.75. At the 0.01 level of significance, can we conclude that μ < 7.00? Answer
Data Analysis Exercises Using Michelson's data, test to see if the velocity of light is greater than 730 (+299000) km/sec, at the 0.005 significance level. Answer Using Cavendish's data, test to see if the density of the earth is less than 5.5 times the density of water, at the 0.05 significance level . Answer Using Short's data, test to see if the parallax of the sun differs from 9 seconds of a degree, at the 0.1 significance level. Answer Using Fisher's iris data, perform the following tests, at the 0.1 level: 1. The mean petal length of Setosa irises differs from 15 mm. 2. The mean petal length of Verginica irises is greater than 52 mm. 3. The mean petal length of Versicolor irises is less than 44 mm. Answer This page titled 9.2: Tests in the Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
9.2.8
https://stats.libretexts.org/@go/page/10212
9.3: Tests in the Bernoulli Model Basic Tests Preliminaries Suppose that X = (X , X , … , X ) is a random sample from the Bernoulli distribution with unknown parameter p ∈ (0, 1). Thus, these are independent random variables taking the values 1 and 0 with probabilities p and 1 − p respectively. In the usual language of reliability, 1 denotes success and 0 denotes failure, but of course these are generic terms. Often this model arises in one of the following contexts: 1
2
n
1. There is an event of interest in a basic experiment, with unknown probability p. We replicate the experiment n times and define X = 1 if and only if the event occurred on run i . 2. We have a population of objects of several different types; p is the unknown proportion of objects of a particular type of interest. We select n objects at random from the population and let X = 1 if and only if object i is of the type of interest. When the sampling is with replacement, these variables really do form a random sample from the Bernoulli distribution. When the sampling is without replacement, the variables are dependent, but the Bernoulli model may still be approximately valid if the population size is very large compared to the sample size n . For more on these points, see the discussion of sampling with and without replacement in the chapter on Finite Sampling Models. i
i
In this section, we will construct hypothesis tests for the parameter p. The parameter space for p is the interval (0, 1), and all hypotheses define subsets of this space. This section parallels the section on Estimation in the Bernoulli Model in the Chapter on Interval Estimation.
The Binomial Test Recall that the number of successes density function given by
n
Y =∑
i=1
Xi
has the binomial distribution with parameters
P(Y = y) = (
n y n−y ) p (1 − p ) , y
n
and p, and has probability
y ∈ {0, 1, … , n}
(9.3.1)
Recall also that the mean is E(Y ) = np and variance is var(Y ) = np(1 − p) . Moreover Y is sufficient for p and hence is a natural candidate to be a test statistic for hypothesis tests about p. For α ∈ (0, 1), let b (α) denote the quantile of order α for the binomial distribution with parameters n and p. Since the binomial distribution is discrete, only certain (exact) quantiles are possible. For the remainder of this discussion, p ∈ (0, 1) is a conjectured value of p. n,p
0
For every α ∈ (0, 1), the following tests have approximate significance level α : 1. Reject H 2. Reject H 3. Reject H
0
: p = p0
0
: p ≥ p0
0
: p ≤ p0
versus H versus H versus H
1
: p ≠ p0
1
: p < p0
1
: p > p0
if and only if Y if and only if Y if and only if Y
≤ bn,p (α/2) 0
≤ bn,p (α) 0
or Y
.
≥ bn,p (1 − α) 0
≥ bn,p (1 − α/2) 0
.
.
Proof The test in (a) is the standard, symmetric, two-sided test, corresponding to probability α/2 (approximately) in both tails of the binomial distribution under H . The test in (b) is the left-tailed and test and the test in (c) is the right-tailed test. As usual, we can generalize the two-sided test by partitioning α between the left and right tails of the binomial distribution in an arbitrary manner. 0
For any α, r ∈ (0, 1), the following test has (approximate) significance level α : Reject H only if Y ≤ b (α − rα) or Y ≥ b (1 − rα) .
0
n,p0
: p = p0
versus H
1
: p ≠ p0
if and
n,p0
1. r = gives the standard symmetric two-sided test. 2. r ↓ 0 gives the left-tailed test. 3. r ↑ 1 gives the right-tailed test. 1 2
Proof
9.3.1
https://stats.libretexts.org/@go/page/10213
An Approximate Normal Test When n is large, the distribution of normal test.
Y
is approximately normal, by the central limit theorem, so we can construct an approximate
Suppose that the sample size n is large. For a conjectured p
0
Z =
, define the test statistic
∈ (0, 1)
Y − np0
(9.3.2)
− −−−−−−− − √ np0 (1 − p0 )
1. If p = p , then Z has approximately a standard normal distribution. 0
−
2. If p ≠ p , then Z has approximately a normal distribution with mean √n 0
p−p0 √p (1−p ) 0
and variance
p(1−p) p (1−p ) 0
0
0
Proof As usual, for α ∈ (0, 1), let z(α) denote the quantile of order α for the standard normal distribution. For selected values of α , z(α) can be obtained from the special distribution calculator, or from most statistical software packages. Recall also by symmetry that z(1 − α) = −z(α) . For every α ∈ (0, 1), the following tests have approximate significance level α : 1. Reject H 2. Reject H 3. Reject H
0
: p = p0
0
: p ≥ p0
0
: p ≤ p0
versus H versus H versus H
1
: p ≠ p0
1
: p < p0
1
: p ≥ p0
if and only if Z < −z(1 − α/2) or Z > z(1 − α/2) . if and only if Z < −z(1 − α) . if and only if Z > z(1 − α) .
Proof The test in (a) is the symmetric, two-sided test that corresponds to α/2 in both tails of the distribution of Z , under H . The test in (b) is the left-tailed test and the test in (c) is the right-tailed test. As usual, we can construct a more general two-sided test by partitioning α between the left and right tails of the standard normal distribution in an arbitrary manner. 0
For every α, r ∈ (0, 1), the following test has approximate significance level α : Reject H only if Z < z(α − rα) or Z > z(1 − rα) .
0
: p = p0
versus H
1
: p ≠ p0
if and
1. r = gives the standard, symmetric two-sided test. 2. r ↓ 0 gives the left-tailed test. 3. r ↑ 1 gives the right-tailed test. 1 2
Proof
Simulation Exercises In the proportion test experiment, set H : p = p , and select sample size 10, significance level 0.1, and p = 0.5 . For each p ∈ {0.1, 0.2, … , 0.9} , run the experiment 1000 times and then note the relative frequency of rejecting the null hypothesis. Graph the empirical power function. 0
0
0
In the proportion test experiment, repeat the previous exercise with sample size 20. In the proportion test experiment, set H : p ≤ p , and select sample size 15, significance level 0.05, and p = 0.3 . For each p ∈ {0.1, 0.2, … , 0.9} , run the experiment 1000 times and note the relative frequency of rejecting the null hypothesis. Graph the empirical power function. 0
0
0
In the proportion test experiment, repeat the previous exercise with sample size 30. In the proportion test experiment, set H : p ≥ p , and select sample size 20, significance level 0.01, and p = 0.6 . For each p ∈ {0.1, 0.2, … , 0.9} , run the experiment 1000 times and then note the relative frequency of rejecting the null hypothesis. Graph the empirical power function. 0
0
0
In the proportion test experiment, repeat the previous exercise with sample size 50.
9.3.2
https://stats.libretexts.org/@go/page/10213
Computational Exercises In a pole of 1000 registered voters in a certain district, 427 prefer candidate X. At the 0.1 level, is the evidence sufficient to conclude that more that 40% of the registered voters prefer X? Answer A coin is tossed 500 times and results in 302 heads. At the 0.05 level, test to see if the coin is unfair. Answer A sample of 400 memory chips from a production line are tested, and 32 are defective. At the 0.05 level, test to see if the proportion of defective chips is less than 0.1. Answer A new drug is administered to 50 patients and the drug is effective in 42 cases. At the 0.1 level, test to see if the success rate for the new drug is greater that 0.8. Answer Using the M&M data, test the following alternative hypotheses at the 0.1 significance level: 1. The proportion of red M&Ms differs from . 2. The proportion of green M&Ms is less than . 3. The proportion of yellow M&M is greater than 1 6
1 6
1 6
.
Answer
The Sign Test Derivation Suppose now that we have a basic random experiment with a real-valued random variable U of interest. We assume that U has a continuous distribution with support on an interval of S ⊆ R . Let m denote the quantile of a specified order p ∈ (0, 1) for the distribution of U . Thus, by definition, 0
p0 = P(U ≤ m)
(9.3.4)
In general of course, m is unknown, even though p is specified, because we don't know the distribution of want to construct hypothesis tests for m. For a given test value m , let 0
U
. Suppose that we
0
p = P(U ≤ m0 )
(9.3.5)
Note that p is unknown even though m is specified, because again, we don't know the distribution of U . 0
Relations 1. m = m if and only if p = p . 2. m < m if and only if p > p . 3. m > m if and only if p < p . 0
0
0
0
0
0
Proof As usual, we repeat the basic experiment n times to generate a random sample U = (U , U , … , U ) of size distribution of U . Let X = 1(U ≤ m ) be the indicator variable of the event {U ≤ m } for i ∈ {1, 2, … , n}. 1
i
i
0
i
2
n
from the
n
0
Note that X = (X , X , … , X ) is a statistic (an observable function of the data vector U ) and is a random sample of size n from the Bernoulli distribution with parameter p. 1
2
n
From the last two results it follows that tests of the unknown quantile m can be converted to tests of the Bernoulli parameter p, and thus the tests developed above apply. This procedure is known as the sign test, because essentially, only the sign of U − m is recorded for each i. This procedure is also an example of a nonparametric test, because no assumptions about the distribution of U i
9.3.3
0
https://stats.libretexts.org/@go/page/10213
are made (except for continuity). In particular, we do not need to assume that the distribution of parametric family.
U
belongs to a particular
The most important special case of the sign test is the case where p = ; this is the sign test of the median. If the distribution of U is known to be symmetric, the median and the mean agree. In this case, sign tests of the median are also tests of the mean. 1
0
2
Simulation Exercises In the sign test experiment, set the sampling distribution to normal with mean 0 and standard deviation 2. Set the sample size to 10 and the significance level to 0.1. For each of the 9 values of m , run the simulation 1000 times. 0
1. When m = m , give the empirical estimate of the significance level of the test and compare with 0.1. 2. In the other cases, give the empirical estimate of the power of the test. 0
In the sign test experiment, set the sampling distribution to uniform on the interval [0, 5]. Set the sample size to 20 and the significance level to 0.05. For each of the 9 values of m , run the simulation 1000 times. 0
1. When m = m , give the empirical estimate of the significance level of the test and compare with 0.05. 2. In the other cases, give the empirical estimate of the power of the test. 0
In the sign test experiment, set the sampling distribution to gamma with shape parameter 2 and scale parameter 1. Set the sample size to 30 and the significance level to 0.025. For each of the 9 values of m , run the simulation 1000 times. 0
1. When m = m , give the empirical estimate of the significance level of the test and compare with 0.025. 2. In the other cases, give the empirical estimate of the power of the test. 0
Computational Exercises Using the M&M data, test to see if the median weight exceeds 47.9 grams, at the 0.1 level. Answer Using Fisher's iris data, perform the following tests, at the 0.1 level: 1. The median petal length of Setosa irises differs from 15 mm. 2. The median petal length of Verginica irises is less than 52 mm. 3. The median petal length of Versicolor irises is less than 42 mm. Answer This page titled 9.3: Tests in the Bernoulli Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
9.3.4
https://stats.libretexts.org/@go/page/10213
9.4: Tests in the Two-Sample Normal Model In this section, we will study hypothesis tests in the two-sample normal model and in the bivariate normal model. This section parallels the section on Estimation in the Two Sample Normal Model in the chapter on Interval Estimation.
The Two-Sample Normal Model Suppose that X = (X , X , … , X ) is a random sample of size m from the normal distribution with mean μ and standard deviation σ, and that Y = (Y , Y , … , Y ) is a random sample of size n from the normal distribution with mean ν and standard deviation τ . Moreover, suppose that the samples X and Y are independent. 1
2
n
1
2
n
This type of situation arises frequently when the random variables represent a measurement of interest for the objects of the population, and the samples correspond to two different treatments. For example, we might be interested in the blood pressure of a certain population of patients. The X vector records the blood pressures of a control sample, while the Y vector records the blood pressures of the sample receiving a new drug. Similarly, we might be interested in the yield of an acre of corn. The X vector records the yields of a sample receiving one type of fertilizer, while the Y vector records the yields of a sample receiving a different type of fertilizer. Usually our interest is in a comparison of the parameters (either the mean or variance) for the two sampling distributions. In this section we will construct tests for the for the difference of the means and the ratio of the variances. As with previous estimation problems we have studied, the procedures vary depending on what parameters are known or unknown. Also as before, key elements in the construction of the tests are the sample means and sample variances and the special properties of these statistics when the sampling distribution is normal. We will use the following notation for the sample mean and sample variance of a generic sample U 1 M (U ) = k
k
∑ Ui ,
S
2
k−1
i=1
:
k
1 (U ) =
= (U1 , U2 , … , Uk )
2
∑[ Ui − M (U )]
(9.4.1)
i=1
Tests of the Difference in the Means with Known Standard Deviations Our first discussion concerns tests for the difference in the means ν − μ under the assumption that the standard deviations σ and τ are known. This is often, but not always, an unrealistic assumption. In some statistical problems, the variances are stable, and are at least approximately known, while the means may be different because of different treatments. Also this is a good place to start because the analysis is fairly easy. For a conjectured difference of the means δ ∈ R , define the test statistic [M (Y ) − M (X)] − δ Z =
(9.4.2)
− − − − − − − − − − − 2 2 √ σ /m + τ /n
1. If ν − μ = δ then Z has the standard normal distribution. − − − − − − − − − − − 2. If ν − μ ≠ δ then Z has the normal distribution with mean [(ν − μ) − δ]/√σ /m + τ /n and variance 1. 2
2
Proof Of course (b) actually subsumes (a), but we separate them because the two cases play an impotrant role in the hypothesis tests. In part (b), the non-zero mean can be viewed as a non-centrality parameter. As usual, for p ∈ (0, 1), let z(p) denote the quantile of order p for the standard normal distribution. For selected values of p, z(p) can be obtained from the special distribution calculator or from most statistical software packages. Recall also by symmetry that z(1 − p) = −z(p) . For every α ∈ (0, 1), the following tests have significance level α : 1. Reject H
if and only if Z < −z(1 − α/2) or Z > z(1 − α/2) if and only if − − − − − − − − − − − or M (Y ) − M (X) < δ − z(1 − α/2)√σ /m + τ /n . 2. Reject H : ν − μ ≥ δ versus H : ν − μ < δ if and only if Z < −z(1 − α) if and only if − − − − − − − − − − − M (Y ) − M (X) < δ − z(1 − α)√σ /m + τ /n . 0
: ν −μ = δ
versus H
1
: ν −μ ≠ δ
− − − − − − − − − − − 2 2 M (Y ) − M (X) > δ + z(1 − α/2)√σ /m + τ /n 0
2
2
1
2
2
9.4.1
https://stats.libretexts.org/@go/page/10214
3. Reject H
versus
if and only if Z > z(1 − α) if and only if .
H1 : ν − μ > δ 0 : ν −μ ≤ δ − − − − − − − − − − − M (Y ) − M (X) > δ + z(1 − α)√σ 2 /m + τ 2 /n
Proof For each of the tests above, we fail to reject confidence interval.
at significance level
H0
α
if and only if
δ
is in the corresponding
− − − − − − − − − − −
level
1 −α
− − − − − − − − − − − 2 2 /m + τ /n
1. [M (Y ) − M (X)] − z(1 − α/2)√σ /m + τ /n ≤ δ ≤ [M (Y ) − M (X)] + z(1 − α/2)√σ − − − − − − − − − − − 2. δ ≤ [M (Y ) − M (X)] + z(1 − α)√σ /m + τ /n − − − − − − − − − − − 3. δ ≥ [M (Y ) − M (X)] − z(1 − α)√σ /m + τ /n 2
2
2
2
2
2
Proof
Tests of the Difference of the Means with Unknown Standard Deviations Next we will construct tests for the difference in the means ν − μ under the more realistic assumption that the standard deviations σ and τ are unknown. In this case, it is more difficult to find a suitable test statistic, but we can do the analysis in the special case that the standard deviations are the same. Thus, we will assume that σ = τ , and the common value σ is unknown. This assumption is reasonable if there is an inherent variability in the measurement variables that does not change even when different treatments are applied to the objects in the population. Recall that the pooled estimate of the common variance σ is the weighted average of the sample variances, with the degrees of freedom as the weight factors: 2
S
2
(m − 1)S
2
(X) + (n − 1)S
2
(Y )
(X, Y ) =
(9.4.3) m +n−2
The statistic S
2
(X, Y )
is an unbiased and consistent estimator of the common variance σ . 2
For a conjectured δ ∈ R define the test statistc [M (Y ) − M (X)] − δ T =
(9.4.4)
− − − − − − − − − S(X, Y )√ 1/m + 1/n
1. If ν − μ = δ then T has the t distribution with m + n − 2 degrees of freedom, 2. If ν − μ ≠ δ then T has a non-central t distribution with m + n − 2 degrees of freedom and non-centrality parameter (ν − μ) − δ (9.4.5)
− − − − − − − − − σ √ 1/m + 1/n
Proof As usual, for k > 0 and p ∈ (0, 1), let t (p) denote the quantile of order p for the t distribution with k degrees of freedom. For selected values of k and p, values of t (p) can be computed from the special distribution calculator, or from most statistical software packages. Recall also that, by symmetry, t (1 − p) = −t (p) . k
k
k
k
The following tests have significance level α : 1. Reject H : ν − μ = δ versus H only if M (Y ) − M (X) > δ + t
1
2. Reject H
: ν −μ ≠ δ
if and only if
T < −tm+n−2 (1 − α/2) − − − − − − − − − − − 2 2 (1 − α/2) √ σ /m + τ /n m+n−2 − − − − − − − − − − − M (Y ) − M (X) < δ − tm+n−2 (1 − α/2)√σ 2 /m + τ 2 /n 0
versus
if and only if T
≤ −tm−n+2 (1 − α)
if and only if T
≥ tm−n+2 (1 − α)
H1 : ν − μ < δ 0 : ν −μ ≥ δ − − − − − − − − − − − 2 2 M (Y ) − M (X) < δ − tm+n−2 (1 − α)√σ /m + τ /n
3. Reject H
0
: ν −μ ≤ δ
versus H
or T
> tm+n−2 (1 − α/2)
if and
or
1
: ν −μ > δ
if and only if
if and only if
− − − − − − − − − − − 2 2 M (Y ) − M (X) > δ + tm+n−2 (1 − α)√σ /m + τ /n
Proof For each of the tests above, we fail to reject confidence interval. 1. [M (Y ) − M (X)] − t 2. δ ≤ [M (Y ) − M (X)] + t
m+n−2
H0
at significance level
α
if and only if
δ
is in the corresponding
1 −α
level
− − − − − − − − − − − − − − − − − − − − − − 2 2 2 2 (1 − α/2)√σ /m + τ /n ≤ δ ≤ [M (Y ) − M (X)] + tm+n−2 (1 − α/2)√σ /m + τ /n
m+n−2
− − − − − − − − − − − 2 2 (1 − α)√σ /m + τ /n
9.4.2
https://stats.libretexts.org/@go/page/10214
3. δ ≥ [M (Y ) − M (X)] − t
m+n−2
− − − − − − − − − − − (1 − α)√σ 2 /m + τ 2 /n
Proof
Tests of the Ratio of the Variances Next we will construct tests for the ratio of the distribution variances τ course the means μ and ν are unknown.
2
/σ
2
. So the basic assumption is that the variances, and of
For a conjectured ρ ∈ (0, ∞), define the test statistics S F = S
2
2
(X) ρ
(9.4.7)
(Y )
1. If τ /σ = ρ then F has the F distribution with m − 1 degrees of freedom in the numerator and n − 1 degrees of freedom in the denominator. 2. If τ /σ ≠ ρ then F has a scaled F distribution with m − 1 degrees of freedom in the numerator, n − 1 degrees of freedom in the denominator, and scale factor ρ . 2
2
2
2
σ τ
2
2
Proof The following tests have significance level α : 1. Reject H 2. Reject H 3. Reject H
0
: τ
0
: τ
0
: τ
2 2 2
/σ /σ /σ
2 2 2
=ρ ≤ρ ≥ρ
versus H versus H versus H
1
: τ
1
: τ
1
: τ
2 2 2
/σ /σ /σ
2 2 2
≠ρ >ρ fm−1,n−1 (1 − α/2) < fm−1,n−1 (α)
or F
< fm−1,n−1 (α/2)
.
.
> fm−1,n−1 (1 − α)
.
Proof For each of the tests above, we fail to reject confidence interval. 1.
S S
2
2
(Y ) (X)
2. ρ ≤ 3. ρ ≥
S S S S
S
Fm−1,n−1 (α/2) ≤ ρ ≤ 2
2 2
2
S
2
2
H0
at significance level
α
if and only if
ρ0
is in the corresponding
1 −α
level
(Y ) (X)
Fm−1,n−1 (1 − α/2)
(Y ) (X)
Fm−1,n−1 (α)
(Y ) (X)
Fm−1,n−1 (1 − α)
Proof
Tests in the Bivariate Normal Model In this subsection, we consider a model that is superficially similar to the two-sample normal model, but is actually much simpler. Suppose that ((X1 , Y1 ), (X2 , Y2 ), … , (Xn , Yn ))
is a random sample of size n from the bivariate normal distribution of var(Y ) = τ , and cov(X, Y ) = δ .
(X, Y )
(9.4.9)
with
E(X) = μ
,
E(Y ) = ν
,
var(X) = σ
2
,
2
Thus, instead of a pair of samples, we have a sample of pairs. The fundamental difference is that in this model, variables X and Y are measured on the same objects in a sample drawn from the population, while in the previous model, variables X and Y are measured on two distinct samples drawn from the population. The bivariate model arises, for example, in before and after experiments, in which a measurement of interest is recorded for a sample of n objects from the population, both before and after a treatment. For example, we could record the blood pressure of a sample of n patients, before and after the administration of a certain drug. We will use our usual notation for the sample means and variances of X = (X also that the sample covariance of (X, Y ) is
1,
1 S(X, Y ) = n−1
X2 , … , Xn )
and Y
= (Y1 , Y2 , … , Yn )
. Recall
n
∑[ Xi − M (X)][ Yi − M (Y )]
(9.4.10)
i=1
9.4.3
https://stats.libretexts.org/@go/page/10214
(not to be confused with the pooled estimate of the standard deviation in the two-sample model above). The sequence of differences Y − X = (Y − X , Y of Y − X . The sampling distribution is normal with 1
1. E(Y − X) = ν − μ 2. var(Y − X) = σ + τ 2
2
1
2
− X2 , … , Yn − Xn )
is a random sample of size n from the distribution
−2 δ
The sample mean and variance of the sample of differences are 1. M (Y − X) = M (Y ) − M (X) 2. S (Y − X) = S (X) + S (Y ) − 2 S(X, Y ) 2
2
2
The sample of differences Y − X fits the normal model for a single variable. The section on Tests in the Normal Model could be used to perform tests for the distribution mean ν − μ and the distribution variance σ + τ − 2δ . 2
2
Computational Exercises A new drug is being developed to reduce a certain blood chemical. A sample of 36 patients are given a placebo while a sample of 49 patients are given the drug. The statistics (in mg) are m = 87 , s = 4 , m = 63 , s = 6 . Test the following at the 10% significance level: 1
1. H : σ = σ versus H : σ ≠ σ . 2. H : μ ≤ μ versus H : μ > μ (assuming that σ 3. Based on (b), is the drug effective? 0
1
2
1
1
0
1
2
1
1
1
2
2
2
2
1
= σ2
).
Answer A company claims that an herbal supplement improves intelligence. A sample of 25 persons are given a standard IQ test before and after taking the supplement. The before and after statistics are m = 105 , s = 13 , m = 110 , s = 17 , s = 190 . At the 10% significance level, do you believe the company's claim? 1
1
2
2
1, 2
Answer In Fisher's iris data, consider the petal length variable for the samples of Versicolor and Virginica irises. Test the following at the 10% significance level: 1. H 2. H
0
: σ1 = σ2
0
: μ1 ≤ μ2
versus H versus μ
1
1
. (assuming that σ
: σ1 ≠ σ2 > μ2
1
= σ2
).
Answer A plant has two machines that produce a circular rod whose diameter (in cm) is critical. A sample of 100 rods from the first machine as mean 10.3 and standard deviation 1.2. A sample of 100 rods from the second machine has mean 9.8 and standard deviation 1.6. Test the following hypotheses at the 10% level. 1. H 2. H
0
: σ1 = σ2
0
: μ1 = μ2
versus H versus H
1
: σ1 ≠ σ2
1
: μ1 ≠ μ2
. (assuming that σ
1
= σ2
).
Answer This page titled 9.4: Tests in the Two-Sample Normal Model is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
9.4.4
https://stats.libretexts.org/@go/page/10214
9.5: Likelihood Ratio Tests Basic Theory As usual, our starting point is a random experiment with an underlying sample space, and a probability measure P. In the basic statistical model, we have an observable random variable X taking values in a set S . In general, X can have quite a complicated structure. For example, if the experiment is to sample n objects from a population and record various measurements of interest, then X = (X1 , X2 , … , Xn )
(9.5.1)
where X is the vector of measurements for the ith object. The most important special case occurs when (X , X , … , X independent and identically distributed. In this case, we have a random sample of size n from the common distribution. i
1
n)
2
are
In the previous sections, we developed tests for parameters based on natural test statistics. However, in other cases, the tests may not be parametric, or there may not be an obvious statistic to start with. Thus, we need a more general method for constructing test statistics. Moreover, we do not yet know if the tests constructed so far are the best, in the sense of maximizing the power for the set of alternatives. In this and the next section, we investigate both of these ideas. Likelihood functions, similar to those used in maximum likelihood estimation, will play a key role.
Tests of Simple Hypotheses Suppose that X has one of two possible distributions. Our simple hypotheses are H0 : X H1 : X
has probability density function f . has probability density function f . 0 1
We will use subscripts on the probability measure P to indicate the two hypotheses, and we assume that f and f are postive on S . The test that we will construct is based on the following simple idea: if we observe X = x , then the condition f (x) > f (x) is evidence in favor of the alternative; the opposite inequality is evidence against the alternative. 0
1
1
0
The likelihood ratio function L : S → (0, ∞) is defined by f0 (x) L(x) =
,
x ∈ S
(9.5.2)
f1 (x)
The statistic L(X) is the likelihood ratio statistic. Restating our earlier observation, note that small values of L are evidence in favor of H . Thus it seems reasonable that the likelihood ratio statistic may be a good test statistic, and that we should consider tests in which we teject H if and only if L ≤ l , where l is a constant to be determined: 1
0
The significance level of the test is α = P
0 (L
≤ l)
.
As usual, we can try to construct a test by choosing l so that α is a prescribed value. If X has a discrete distribution, this will only be possible when α is a value of the distribution function of L(X). An important special case of this model occurs when the distribution of X depends on a parameter θ that has two possible values. Thus, the parameter space is {θ , θ }, and f denotes the probability density function of X when θ = θ and f denotes the probability density function of X when θ = θ . In this case, the hypotheses are equivalent to H : θ = θ versus H : θ = θ . 0
1
0
0
1
0
0
1
1
1
As noted earlier, another important special case is when X = (X , X , … , X ) is a random sample of size n from a distribution an underlying random variable X taking values in a set R . In this case, S = R and the probability density function f of X has the form 1
2
n
n
f (x1 , x2 , … , xn ) = g(x1 )g(x2 ) ⋯ g(xn ),
(x1 , x2 , … , xn ) ∈ S
(9.5.3)
where g is the probability density function of X. So the hypotheses simplify to H0 : X
has probability density function g . 0
9.5.1
https://stats.libretexts.org/@go/page/10215
H1 : X
has probability density function g . 1
and the likelihood ratio statistic is n
g0 (Xi )
L(X1 , X2 , … , Xn ) = ∏
(9.5.4) g1 (Xi )
i=1
In this special case, it turns out that under H , the likelihood ratio statistic, as a function of the sample size n , is a martingale. 1
The Neyman-Pearson Lemma The following theorem is the Neyman-Pearson Lemma, named for Jerzy Neyman and Egon Pearson. It shows that the test given above is most powerful. Let R = {x ∈ S : L(x) ≤ l}
(9.5.5)
and recall that the size of a rejection region is the significance of the test with that rejection region. Consider the tests with rejection regions R given above and arbitrary A ⊆ S . If the size of R is at least as large as the size of A then the test with rejection region R is more powerful than the test with rejection region A . That is, if P (X ∈ R) ≥ P (X ∈ A) then P (X ∈ R) ≥ P (X ∈ A) . 0
0
1
1
Proof The Neyman-Pearson lemma is more useful than might be first apparent. In many important cases, the same most powerful test works for a range of alternatives, and thus is a uniformly most powerful test for this range. Several special cases are discussed below.
Generalized Likelihood Ratio The likelihood ratio statistic can be generalized to composite hypotheses. Suppose again that the probability density function f of the data variable X depends on a parameter θ , taking values in a parameter space Θ. Consider the hypotheses θ ∈ Θ versus θ ∉ Θ , where Θ ⊆ Θ . θ
0
0
0
Define sup { fθ (x) : θ ∈ Θ0 } L(x) =
(9.5.9) sup { fθ (x) : θ ∈ Θ}
The function L is the likelihood ratio function and L(X) is the likelihood ratio statistic. By the same reasoning as before, small values of L(x) are evidence in favor of the alternative hypothesis.
Examples and Special Cases Tests for the Exponential Model Suppose that X = (X , X , … , X ) is a random sample of size n ∈ N from the exponential distribution with scale parameter b ∈ (0, ∞). The sample variables might represent the lifetimes from a sample of devices of a certain type. We are interested in testing the simple hypotheses H : b = b versus H : b = b , where b , b ∈ (0, ∞) are distinct specified values. 1
2
n
0
+
0
1
1
0
1
Recall that the sum of the variables is a sufficient statistic for b : n
Y = ∑ Xi
(9.5.10)
i=1
Recall also that Y has the gamma distribution with shape parameter quantile of order α for the this distribution by γ (α).
n
and scale parameter b . For
α >0
, we will denote the
n,b
The likelihood ratio statistic is L =(
b1
n
)
1 exp[(
b0
9.5.2
1 −
b1
)Y ]
(9.5.11)
b0
https://stats.libretexts.org/@go/page/10215
Proof The following tests are most powerful test at the α level 1. Suppose that b 2. Suppose that b
1
> b0
1
< b0
. Reject H . Reject H
0
: b = b0
0
: b = b0
versus H versus H
1
: b = b1
1
: b = b1
if and only if Y if and only if Y
≥ γn,b0 (1 − α) ≤ γn,b0 (α)
.
.
Proof Note that the these tests do not depend on the value of b . This fact, together with the monotonicity of the power function can be used to shows that the tests are uniformly most powerful for the usual one-sided tests. 1
Suppose that b
0
∈ (0, ∞)
.
1. The decision rule in part (a) above is uniformly most powerful for the test H 2. The decision rule in part (b) above is uniformly most powerful for the test H
0
: b ≤ b0
0
: b ≥ b0
versus H versus H
1
: b > b0
1
: b < b0
. .
Tests for the Bernoulli Model Suppose that X = (X , X , … , X ) is a random sample of size n ∈ N from the Bernoulli distribution with success parameter p . The sample could represent the results of tossing a coin n times, where p is the probability of heads. We wish to test the simple hypotheses H : p = p versus H : p = p , where p , p ∈ (0, 1) are distinct specified values. In the coin tossing model, we know that the probability of heads is either p or p , but we don't know which. 1
0
2
0
n
+
1
1
0
0
1
1
Recall that the number of successes is a sufficient statistic for p: n
Y = ∑ Xi
(9.5.14)
i=1
Recall also that Y has the binomial distribution with parameters n and p. For α ∈ (0, 1), we will denote the quantile of order α for the this distribution by b (α); although since the distribution is discrete, only certain values of α are possible. n,p
The likelihood ratio statistic is L =(
1 − p0
n
Y
p0 (1 − p1 )
) [
1 − p1
]
(9.5.15)
p1 (1 − p0 )
Proof The following tests are most powerful test at the α level 1. Suppose that p 2. Suppose that p
1
> p0
1
< p0
. Reject H : p = p versus H : p = p if and only if Y ≥ b . Reject p = p versus p = p if and only if Y ≤ b (α) . 0
0
1
0
1
n,p
1
n,p
0
(1 − α)
.
0
Proof Note that these tests do not depend on the value of p . This fact, together with the monotonicity of the power function can be used to shows that the tests are uniformly most powerful for the usual one-sided tests. 1
Suppose that p
0
.
∈ (0, 1)
1. The decision rule in part (a) above is uniformly most powerful for the test H 2. The decision rule in part (b) above is uniformly most powerful for the test H
0
: p ≤ p0
0
: p ≥ p0
versus H versus H
1
: p > p0
1
: p < p0
. .
Tests in the Normal Model The one-sided tests that we derived in the normal model, for μ with σ known, for μ with σ unknown, and for are all uniformly most powerful. On the other hand, none of the two-sided tests are uniformly most powerful.
σ
with μ unknown
A Nonparametric Example Suppose that X = (X , X , … , X ) is a random sample of size n ∈ N , either from the Poisson distribution with parameter 1 or from the geometric distribution on N with parameter p = . Note that both distributions have mean 1 (although the Poisson distribution has variance 1 while the geometric distribution has variance 2). So, we wish to test the hypotheses 1
2
n
+
1 2
9.5.3
https://stats.libretexts.org/@go/page/10215
H0 : X
has probability density function g
H1 : X
has probability density function g
0 (x) 1 (x)
=e =(
−1
1 x!
1 2
for x ∈ N.
x+1
)
for x ∈ N.
The likelihood ratio statistic is n
Y
n
L =2 e
−n
2
U
n
where Y = ∑ Xi and U = ∏ Xi ! i=1
(9.5.18)
i=1
Proof The most powerful tests have the following form, where d is a constant: reject H if and only if ln(2)Y 0
− ln(U ) ≤ d
.
Proof This page titled 9.5: Likelihood Ratio Tests is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
9.5.4
https://stats.libretexts.org/@go/page/10215
9.6: Chi-Square Tests In this section, we will study a number of important hypothesis tests that fall under the general term chi-square tests. These are named, as you might guess, because in each case the test statistics has (in the limit) a chi-square distribution. Although there are several different tests in this general category, they all share some common themes: In each test, there are one or more underlying multinomial samples, Of course, the multinomial model includes the Bernoulli model as a special case. Each test works by comparing the observed frequencies of the various outcomes with expected frequencies under the null hypothesis. If the model is incompletely specified, some of the expected frequencies must be estimated; this reduces the degrees of freedom in the limiting chi-square distribution. We will start with the simplest case, where the derivation is the most straightforward; in fact this test is equivalent to a test we have already studied. We then move to successively more complicated models.
The One-Sample Bernoulli Model Suppose that X = (X , X , … , X ) is a random sample from the Bernoulli distribution with unknown success parameter p ∈ (0, 1). Thus, these are independent random variables taking the values 1 and 0 with probabilities p and 1 − p respectively. We want to test H : p = p versus H : p ≠ p , where p ∈ (0, 1) is specified. Of course, we have already studied such tests in the Bernoulli model. But keep in mind that our methods in this section will generalize to a variety of new models that we have not yet studied. 1
0
2
0
n
1
0
n
0
n
Let O = ∑ X and O = n − O = ∑ (1 − X ) . These statistics give the number of times (frequency) that outcomes 1 and 0 occur, respectively. Moreover, we know that each has a binomial distribution; O has parameters n and p, while O has parameters n and 1 − p . In particular, E(O ) = np , E(O ) = n(1 − p) , and var(O ) = var(O ) = np(1 − p) . Moreover, recall that O is sufficient for p. Thus, any good test statistic should be a function of O . Next, recall that when n is large, the distribution of O is approximately normal, by the central limit theorem. Let 1
j=1
j
0
1
j
j=1
1
1
0
1
1
0
0
1
1
Z =
O1 − np0
(9.6.1)
− −−−−−−− − √ np0 (1 − p0 )
Note that Z is the standard score of O under H . Hence if n is large, Z has approximately the standard normal distribution under H , and therefore V = Z has approximately the chi-square distribution with 1 degree of freedom under H . As usual, let χ denote the quantile function of the chi-square distribution with k degrees of freedom. 1
0
2
0
An approximate test of H versus H at the α level of significance is to reject H if and only if V 0
1
0
The test above is equivalent to the unbiased test with test statistic Tests in the Bernoulli model.
Z
2
> χ (1 − α) 1
0
V
k
.
(the approximate normal test) derived in the section on
For purposes of generalization, the critical result in the next exercise is a special representation of V . Let e = np . Note that these are the expected frequencies of the outcomes 0 and 1, respectively, under H . 1
2
0
e0 = n(1 − p0 )
and
0
can be written in terms of the observed and expected frequencies as follows: 2
V =
(O0 − e0 )
2
+
e0
(O1 − e1 )
(9.6.2)
e1
This representation shows that our test statistic V measures the discrepancy between the expected frequencies, under H , and the observed frequencies. Of course, large values of V are evidence in favor of H . Finally, note that although there are two terms in the expansion of V in Exercise 3, there is only one degree of freedom since O + O = n . The observed and expected frequencies could be stored in a 1 × 2 table. 0
1
0
9.6.1
1
https://stats.libretexts.org/@go/page/10216
The Multi-Sample Bernoulli Model Suppose now that we have samples from several (possibly) different, independent Bernoulli trials processes. Specifically, suppose that X = (X , X , … , X ) is a random sample of size n from the Bernoulli distribution with unknown success parameter p ∈ (0, 1) for each i ∈ {1, 2, … , m}. Moreover, the samples (X , X , … , X ) are independent. We want to test hypotheses about the unknown parameter vector p = (p , p , … , p ). There are two common cases that we consider below, but first let's set up the essential notation that we will need for both cases. For i ∈ {1, 2, … , m} and j ∈ {0, 1}, let O denote the number of times that outcome j occurs in sample X . The observed frequency O has a binomial distribution; O has parameters n and p while O has parameters n and 1 − p . i
i,1
i,2
i,ni
i
i
1
1
2
2
m
m
i,j
i
i,0
i
i,j
i,1
i
i
i
The Completely Specified Case Consider a specified parameter vector p = (p , p , … , p ) ∈ (0, 1) . We want to test the null hypothesis H : p = p , versus H : p ≠ p . Since the null hypothesis specifies the value of p for each i, this is called the completely specified case. Now let e = n (1 − p ) and let e = n p . These are the expected frequencies of the outcomes 0 and 1, respectively, from sample X under H . m
0
1
i,0
0,2
0,m
0
0
i
i
0,1
i,0
0
i
i,1
i
i,0
0
If n is large for each i, then under H the following test statistic has approximately the chi-square distribution with m degrees of freedom: i
0
m
1
2
(Oi,j − ei,j )
V = ∑∑ i=1
(9.6.3) ei,j
j=0
Proof As a rule of thumb, “large” means that we need e expected frequencies the better.
i,j
≥5
for each i ∈ {1, 2, … , m} and j ∈ {0, 1}. But of course, the larger these
Under the large sample assumption, an approximate test of only if V > χ (1 − α) .
H0
versus H at the 1
α
level of significance is to reject
H0
if and
2 m
Once again, note that the test statistic V measures the discrepancy between the expected and observed frequencies, over all outcomes and all samples. There are 2 m terms in the expansion of V in Exercise 4, but only m degrees of freedom, since O +O =n for each i ∈ {1, 2, … , m}. The observed and expected frequencies could be stored in an m × 2 table. i,0
i,1
i
The Equal Probability Case Suppose now that we want to test the null hypothesis H : p = p = ⋯ = p that all of the success probabilities are the same, versus the complementary alternative hypothesis H that the probabilities are not all the same. Note, in contrast to the previous model, that the null hypothesis does not specify the value of the common success probability p. But note also that under the null hypothesis, the m samples can be combined to form one large sample of Bernoulli trials with success probability p. Thus, a natural approach is to estimate p and then define the test statistic that measures the discrepancy between the expected and observed frequencies, just as before. The challenge will be to find the distribution of the test statistic. 0
1
2
m
1
m
Let n = ∑ n denote the total sample size when the samples are combined. Then the overall sample mean, which in this context is the overall sample proportion of successes, is i=1
i
1 P = n
m
ni
1
∑ ∑ Xi,j = i=1
j=1
n
m
∑ Oi,1
(9.6.4)
i=1
The sample proportion P is the best estimate of p, in just about any sense of the word. Next, let E = n (1 − P ) and E = n P . These are the estimated expected frequencies of 0 and 1, respectively, from sample X under H . Of course these estimated frequencies are now statistics (and hence random) rather than parameters. Just as before, we define our test statistic i,0
i,1
i
i
m
1
2
(Oi,j − Ei,j )
V = ∑∑ i=1
i
0
j=0
9.6.2
(9.6.5) Ei,j
https://stats.libretexts.org/@go/page/10216
It turns out that under n → ∞.
H0
, the distribution of
converges to the chi-square distribution with
V
An approximate test of H versus H at the α level of significance is to reject H if and only if V 0
1
degrees of freedom as
m −1
0
2
>χ
m−1
(1 − α)
.
Intuitively, we lost a degree of freedom over the completely specified case because we had to estimate the unknown common success probability p. Again, the observed and expected frequencies could be stored in an m × 2 table.
The One-Sample Multinomial Model Our next model generalizes the one-sample Bernoulli model in a different direction. Suppose that X = (X , X , … , X ) is a sequence of multinomial trials. Thus, these are independent, identically distributed random variables, each taking values in a set S with k elements. If we want, we can assume that S = {0, 1, … , k − 1} ; the one-sample Bernoulli model then corresponds to k = 2 . Let f denote the common probability density function of the sample variables on S , so that f (j) = P(X = j) for i ∈ {1, 2, … , n} and j ∈ S . The values of f are assumed unknown, but of course we must have ∑ f (j) = 1 , so there are really only k − 1 unknown parameters. For a given probability density function f on S we want to test H : f = f versus H : f ≠f . 1
2
n
i
j∈S
0
1
0
0
0
By this time, our general approach should be clear. We let O denote the number of times that outcome j ∈ S occurs in sample X: j
n
Oj = ∑ 1(Xi = j)
(9.6.6)
i=1
Note that O has the binomial distribution with parameters outcome j occurs, under H . Out test statistic, of course, is j
n
and
f (j)
. Thus,
is the expected number of times that
ej = n f0 (j)
0
2
(Oj − ej ) V =∑ e
j∈S
(9.6.7)
j
It turns out that under H , the distribution of V converges to the chi-square distribution with k − 1 degrees of freedom as n → ∞ . Note that there are k terms in the expansion of V , but only k − 1 degrees of freedom since ∑ O = n. 0
j∈S
j
An approximate test of H versus H at the α level of significance is to reject H if and only if V 0
1
Again, as a rule of thumb, we need e
j
0
2
>χ
k−1
(1 − α)
.
for each j ∈ S , but the larger the expected frequencies the better.
≥5
The Multi-Sample Multinomial Model As you might guess, our final generalization is to the multi-sample multinomial model. Specifically, suppose that X = (X ,X ,…,X ) is a random sample of size n from a distribution on a set S with k elements, for each i ∈ {1, 2, … , m}. Moreover, we assume that the samples (X , X , … , X ) are independent. Again there is no loss in generality if we take S = {0, 1, … , k − 1} . Then k = 2 reduces to the multi-sample Bernoulli model, and m = 1 corresponds to the onesample multinomial model. i
i,1
i,2
i,ni
i
1
Let
2
m
denote the common probability density function of the variables in sample X , so that f (j) = P(X = j) for , , and j ∈ S . These are generally unknown, so that our vector of parameters is the vector of probability density functions: f = (f , f , … , f ). Of course, ∑ f (j) = 1 for i ∈ {1, 2, … , m}, so there are actually m (k − 1) unknown parameters. We are interested in testing hypotheses about f . As in the multi-sample Bernoulli model, there are two common cases that we consider below, but first let's set up the essential notation that we will need for both cases. For i ∈ {1, 2, … , m} and j ∈ S , let O denote the number of times that outcome j occurs in sample X . The observed frequency O has a binomial distribution with parameters n and f (j) . fi
i
i
i,l
i ∈ {1, 2, … , m} l ∈ {1, 2, … , ni }
1
2
m
j∈S
i
i,j
i
i,j
i
i
The Completely Specified Case Consider a given vector of probability density functions on S , denoted f = (f , f , … , f ) . We want to test the null hypothesis H : f = f , versus H : f ≠ f . Since the null hypothesis specifies the value of f (j) for each i and j , this is called the completely specified case. Let e = n f (j) . This is the expected frequency of outcome j in sample X under H . 0
0
0
1
i,j
0
i
0,1
0,2
0,m
i
0,i
i
9.6.3
0
https://stats.libretexts.org/@go/page/10216
If n is large for each i, then under H , the test statistic V below has approximately the chi-square distribution with m (k − 1) degrees of freedom: i
0
m
2
(Oi,j − ei,j )
V = ∑∑ i=1
(9.6.8) ei,j
j∈S
Proof As usual, our rule of thumb is that we need e frequencies the better.
i,j
≥5
for each i ∈ {1, 2, … , m} and j ∈ S . But of course, the larger these expected
Under the large sample assumption, an approximate test of only if V > χ (1 − α) .
H0
versus H at the 1
α
level of significance is to reject
H0
if and
2
m (k−1)
As always, the test statistic V measures the discrepancy between the expected and observed frequencies, over all outcomes and all samples. There are mk terms in the expansion of V in Exercise 8, but we lose m degrees of freedom, since ∑ O =n for each i ∈ {1, 2, … , m}. j∈S
i,j
i
The Equal PDF Case Suppose now that we want to test the null hypothesis H : f = f = ⋯ = f that all of the probability density functions are the same, versus the complementary alternative hypothesis H that the probability density functions are not all the same. Note, in contrast to the previous model, that the null hypothesis does not specify the value of the common success probability density function f . But note also that under the null hypothesis, the m samples can be combined to form one large sample of multinomial trials with probability density function f . Thus, a natural approach is to estimate the values of f and then define the test statistic that measures the discrepancy between the expected and observed frequencies, just as before. 0
1
2
m
1
Let n = ∑
m i=1
ni
denote the total sample size when the samples are combined. Under H , our best estimate of f (j) is 0
1 Pj =
n
m
∑ Oi,j
(9.6.9)
i=1
Hence our estimate of the expected frequency of outcome j in sample X under H is E = n P . Again, this estimated frequency is now a statistic (and hence random) rather than a parameter. Just as before, we define our test statistic i
m
0
i,j
i
j
2
(Oi,j − Ei,j )
V = ∑∑ i=1
j∈S
(9.6.10) Ei,j
As you no doubt expect by now, it turns out that under H , the distribution of V converges to a chi-square distribution as n → ∞ . But let's see if we can determine the degrees of freedom heuristically. 0
The limiting distribution of V has (k − 1)(m − 1) degrees of freedom. Proof An approximate test of H versus H at the α level of significance is to reject H if and only if V 0
1
0
2
>χ
(k−1) (m−1)
(1 − α)
.
A Goodness of Fit Test A goodness of fit test is an hypothesis test that an unknown sampling distribution is a particular, specified distribution or belongs to a parametric family of distributions. Such tests are clearly fundamental and important. The one-sample multinomial model leads to a quite general goodness of fit test. To set the stage, suppose that we have an observable random variable X for an experiment, taking values in a general set S . Random variable X might have a continuous or discrete distribution, and might be single-variable or multi-variable. We want to test the null hypothesis that X has a given, completely specified distribution, or that the distribution of X belongs to a particular parametric family.
9.6.4
https://stats.libretexts.org/@go/page/10216
Our first step, in either case, is to sample from the distribution of X to obtain a sequence of independent, identically distributed variables X = (X , X , … , X ). Next, we select k ∈ N and partition S into k (disjoint) subsets. We will denote the partition by {A : j ∈ J} where #(J) = k . Next, we define the sequence of random variables Y = (Y , Y , … , Y ) by Y = j if and only if X ∈ A for i ∈ {1, 2, … , n} and j ∈ J . 1
2
n
+
j
i
Y
1
2
n
i
j
is a multinomial trials sequence with parameters n and f , where f (j) = P(X ∈ A
j)
for j ∈ J .
The Completely Specified Case Let H denote the statement that X has a given, completely specified distribution. Let f denote the probability density function on J defined by f (j) = P(X ∈ A ∣ H ) for j ∈ J . To test hypothesis H , we can formally test H : f = f versus H : f ≠ f , which of course, is precisely the problem we solved in the one-sample multinomial model. 0
0
j
0
0
1
0
Generally, we would partition the space S into as many subsets as possible, subject to the restriction that the expected frequencies all be at least 5.
The Partially Specified Case Often we don't really want to test whether X has a completely specified distribution (such as the normal distribution with mean 5 and variance 9), but rather whether the distribution of X belongs to a specified parametric family (such as the normal). A natural course of action in this case would be to estimate the unknown parameters and then proceed just as above. As we have seen before, the expected frequencies would be statistics E because they would be based on the estimated parameters. As a rule of thumb, we lose a degree of freedom in the chi-square statistic V for each parameter that we estimate, although the precise mathematics can be complicated. j
A Test of Independence Suppose that we have observable random variables X and Y for an experiment, where X takes values in a set S with k elements, and Y takes values in a set T with m elements. Let f denote the joint probability density function of (X, Y ), so that f (i, j) = P(X = i, Y = j) for i ∈ S and j ∈ T . Recall that the marginal probability density functions of X and Y are the functions g and h respectively, where g(i) = ∑ f (i, j),
i ∈ S
(9.6.11)
j∈ T
(9.6.12)
j∈T
h(j) = ∑ f (i, j), i∈S
Usually, of course, f , g , and h are unknown. In this section, we are interested in testing whether X and Y are independent, a basic and important test. Formally then we want to test the null hypothesis H0 : f (i, j) = g(i) h(j),
(i, j) ∈ S × T
(9.6.13)
versus the complementary alternative H . 1
Our first step, of course, is to draw a random sample (X, Y ) = ((X , Y ), (X , Y ), … , (X , Y )) from the distribution of (X, Y ). Since the state spaces are finite, this sample forms a sequence of multinomial trials. Thus, with our usual notation, let O denote the number of times that (i, j) occurs in the sample, for each (i, j) ∈ S × T . This statistic has the binomial distribution with trial parameter n and success parameter f (i, j). Under H , the success parameter is g(i) h(j) . However, since we don't know the success parameters, we must estimate them in order to compute the expected frequencies. Our best estimate of f (i, j) is the sample proportion O . Thus, our best estimates of g(i) and h(j) are N and M , respectively, where N is the number of times that i occurs in sample X and M is the number of times that j occurs in sample Y : 1
1
2
2
n
n
i,j
0
1
n
1
i,j
n
1
i
n
j
i
j
Ni = ∑ Oi,j
(9.6.14)
j∈T
Mj = ∑ Oi,j
(9.6.15)
i∈S
Thus, our estimate of the expected frequency of (i, j) under H is 0
1 Ei,j = n
1 Ni
n
1 Mj =
n
9.6.5
Ni Mj
(9.6.16)
n
https://stats.libretexts.org/@go/page/10216
Of course, we define our test statistic by 2
(Oi,j − Ei,j ) V = ∑∑ i∈J
(9.6.17) Ei,j
j∈T
As you now expect, the distribution of V converges to a chi-square distribution as appropriate degrees of freedom on heuristic grounds.
n → ∞
. But let's see if we can determine the
The limiting distribution of V has (k − 1) (m − 1) degrees of freedom. Proof An approximate test of H versus H at the α level of significance is to reject H if and only if V 0
1
0
2
>χ
(k−1)(m−1)
(1 − α)
.
The observed frequencies are often recorded in a k × m table, known as a contingency table, so that O is the number in row i and column j . In this setting, note that N is the sum of the frequencies in the ith row and M is the sum of the frequencies in the j th column. Also, for historical reasons, the random variables X and Y are sometimes called factors and the possible values of the variables categories. i,j
i
j
Computational and Simulation Exercises Computational Exercises In each of the following exercises, specify the number of degrees of freedom of the chi-square statistic, give the value of the statistic and compute the P -value of the test. A coin is tossed 100 times, resulting in 55 heads. Test the null hypothesis that the coin is fair. Answer Suppose that we have 3 coins. The coins are tossed, yielding the data in the following table: Heads
Tails
Coin 1
29
21
Coin 2
23
17
Coin 3
42
18
1. Test the null hypothesis that all 3 coin are fair. 2. Test the null hypothesis that coin 1 has probability of heads ; coin 2 is fair; and coin 3 has probability of heads 3. Test the null hypothesis that the 3 coins have the same probability of heads. 3
2
5
3
.
Answer A die is thrown 240 times, yielding the data in the following table: Score
1
2
3
4
5
6
Frequency
57
39
28
28
36
52
1. Test the null hypothesis that the die is fair. 2. Test the null hypothesis that the die is an ace-six flat die (faces 1 and 6 have probability have probability each).
1 4
each while faces 2, 3, 4, and 5
1 8
Answer Two dice are thrown, yielding the data in the following table: Score
1
2
3
4
5
6
Die 1
22
17
22
13
22
24
9.6.6
https://stats.libretexts.org/@go/page/10216
Die 2
44
24
19
19
18
36
1. Test the null hypothesis that die 1 is fair and die 2 is an ace-six flat. 2. Test the null hypothesis that all the dice have have the same probability distribuiton. Answer A university classifies faculty by rank as instructors, assistant professors, associate professors, and full professors. The data, by faculty rank and gender, are given in the following contingency table. Test to see if faculty rank and gender are independent. Faculty
Instructor
Assistant Professor
Associate Professor
Full Professor
Male
62
238
185
115
Female
118
122
123
37
Answer
Data Analysis Exercises The Buffon trial data set gives the results of 104 repetitions of Buffon's needle experiment. The number of crack crossings is 56. In theory, this data set should correspond to 104 Bernoulli trials with success probability p = . Test to see if this is reasonable. 2
π
Answer Test to see if the alpha emissions data come from a Poisson distribution. Answer Test to see if Michelson's velocity of light data come from a normal distribution. Answer
Simulation Exercises In the simulation exercises below, you will be able to explore the goodness of fit test empirically. In the dice goodness of fit experiment, set the sampling distribution to fair, the sample size to 50, and the significance level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case (a), give the empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give the empirical estimate of the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem reasonable? 1. fair 2. ace-six flats 3. the symmetric, unimodal distribution 4. the distribution skewed right In the dice goodness of fit experiment, set the sampling distribution to ace-six flats, the sample size to 50, and the significance level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case (a), give the empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give the empirical estimate of the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem reasonable? 1. fair 2. ace-six flats 3. the symmetric, unimodal distribution 4. the distribution skewed right In the dice goodness of fit experiment, set the sampling distribution to the symmetric, unimodal distribution, the sample size to 50, and the significance level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case (a), give the empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give
9.6.7
https://stats.libretexts.org/@go/page/10216
the empirical estimate of the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem reasonable? 1. the symmetric, unimodal distribution 2. fair 3. ace-six flats 4. the distribution skewed right In the dice goodness of fit experiment, set the sampling distribution to the distribution skewed right, the sample size to 50, and the significance level to 0.1. Set the test distribution as indicated below and in each case, run the simulation 1000 times. In case (a), give the empirical estimate of the significance level of the test and compare with 0.1. In the other cases, give the empirical estimate of the power of the test. Rank the distributions in (b)-(d) in increasing order of apparent power. Do your results seem reasonable? 1. the distribution skewed right 2. fair 3. ace-six flats 4. the symmetric, unimodal distribution Suppose that D and D are different distributions. Is the power of the test with sampling distribution D and test distribution D the same as the power of the test with sampling distribution D and test distribution D ? Make a conjecture based on your results in the previous three exercises. 1
2
1
2
2
1
In the dice goodness of fit experiment, set the sampling and test distributions to fair and the significance level to 0.05. Run the experiment 1000 times for each of the following sample sizes. In each case, give the empirical estimate of the significance level and compare with 0.05. 1. n = 10 2. n = 20 3. n = 40 4. n = 100 In the dice goodness of fit experiment, set the sampling distribution to fair, the test distributions to ace-six flats, and the significance level to 0.05. Run the experiment 1000 times for each of the following sample sizes. In each case, give the empirical estimate of the power of the test. Do the powers seem to be converging? 1. n = 10 2. n = 20 3. n = 40 4. n = 100 This page titled 9.6: Chi-Square Tests is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
9.6.8
https://stats.libretexts.org/@go/page/10216
CHAPTER OVERVIEW 10: Geometric Models In this chapter, we explore several problems in geometric probability. These problems are interesting, conceptually clear, and the analysis is relatively simple. Thus, they are good problems for the student of probability. In addition, Buffon's problems and Bertrand's problem are historically famous, and contributed significantly to the early development of probability theory. 10.1: Buffon's Problems 10.2: Bertrand's Paradox 10.3: Random Triangles
This page titled 10: Geometric Models is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
10.1: Buffon's Problems Buffon's experiments are very old and famous random experiments, named after comte de Buffon. These experiments are considered to be among the first problems in geometric probability.
Buffon's Coin Experiment Buffon's coin experiment consists of dropping a coin randomly on a floor covered with identically shaped tiles. The event of interest is that the coin crosses a crack between tiles. We will model Buffon's coin problem with square tiles of side length 1— assuming the side length is 1 is equivalent to taking the side length as the unit of measurement.
Assumptions First, let us define the experiment mathematically. As usual, we will idealize the physical objects by assuming that the coin is a perfect circle with radius r and that the cracks between tiles are line segments. A natural way to describe the outcome of the experiment is to record the center of the coin relative to the center of the tile where the coin happens to fall. More precisely, we will construct coordinate axes so that the tile where the coin falls occupies the square S = [−
1 2
,
1 2
2
]
.
Now when the coin is tossed, we will denote the center of the coin by (X, Y ) ∈ S so that S is our sample space and X and Y are our basic random variables. Finally, we will assume that r < so that it is at least possible for the coin to fall inside the square without touching a crack. 1 2
Figure 10.1.1 : Buffon's floor
Next, we need to define an appropriate probability measure that describes our basic random vector (X, Y ). If the coin falls “randomly” on the floor, then it is natural to assume that (X, Y ) is uniformly distributed on S . By definition, this means that area(A) P[(X, Y ) ∈ A] =
,
A ⊆S
(10.1.1)
area(S)
Run Buffon's coin experiment with the default settings. Watch how the points seem to fill the sample space manner.
S
in a uniform
The Probability of a Crack Crossing Our interest is in the probability of the event C that the coin crosses a crack. The probability of a crack crossing is P(C ) − 1 − (1 − 2r) . 2
Proof
Figure 10.1.2 : P(C ) as a function of r
10.1.1
https://stats.libretexts.org/@go/page/10222
In Buffon's coin experiment, change the radius with the scroll bar and watch how the events C and C and change. Run the experiment with various values of r and compare the physical experiment with the points in the scatterplot. Compare the relative frequency of C to the probability of C . c
The convergence of the relative frequency of an event (as the experiment is repeated) to the probability of the event is a special case of the law of large numbers. Solve Buffon's coin problem with rectangular tiles that have height h and width w. Answer Solve Buffon's coin problem with equilateral triangular tiles that have side length 1. Recall that random numbers are simulation of independent random variables, each with the standard uniform distribution, that is, the continuous uniform distribution on the interval (0, 1). Show how to simulate the center of the coin (X, Y ) in Buffon's coin experiment using random numbers. Answer
Buffon's Needle Problem Buffon's needle experiment consists of dropping a needle on a hardwood floor. The main event of interest is that the needle crosses a crack between floorboards. Strangely enough, the probability of this event leads to a statistical estimate of the number π!
Assumptions Our first step is to define the experiment mathematically. Again we idealize the physical objects by assuming that the floorboards are uniform and that each has width 1. We will also assume that the needle has length L < 1 so that the needle cannot cross more than one crack. Finally, we assume that the cracks between the floorboards and the needle are line segments. When the needle is dropped, we want to record its orientation relative to the floorboard cracks. One way to do this is to record the angle X that the top half of the needle makes with the line through the center of the needle, parallel to the floorboards, and the distance Y from the center of the needle to the bottom crack. These will be the basic random variables of our experiment, and thus the sample space of the experiment is S = [0, π) × [0, 1) = {(x, y) : 0 ≤ x < π, 0 ≤ y < 1}
(10.1.3)
Figure 10.1.3 : Buffon's needle problem
Again, our main modeling assumption is that the needle is tossed “randomly” on the floor. Thus, a reasonable mathematical assumption might be that the basic random vector (X, Y ) is uniformly distributed over the sample space. By definition, this means that area(A) P[(X, Y ) ∈ A] =
,
A ⊆S
(10.1.4)
area(S)
Run Buffon's needle experiment with the default settings and watch the outcomes being plotted in the sample space. Note how the points in the scatterplot seem to fill the sample space S in a uniform way.
The Probability of a Crack Crossing Our main interest is in the event C that the needle crosses a crack between the floorboards. The event C can be written in terms of the basic angle and distance variables as follows:
10.1.2
https://stats.libretexts.org/@go/page/10222
L C = {Y
1 −
2
sin(X)}
(10.1.5)
2
The curves y = sin(x) and y = 1 − sin(x) on the interval 0 ≤ x < π are shown in blue in the scatterplot of Buffon's needle experiment, and hence event C is the union of the regions below the lower curve and above the upper curve. Thus, the needle crosses a crack precisely when a point falls in this region. L
L
2
2
The probability of a crack crossing is P(C ) = 2L/π. Proof
Figure 10.1.4 : P(C ) as a function of L
In the Buffon's needle experiment, vary the needle length L with the scroll bar and watch how the event C changes. Run the experiment with various values of L and compare the physical experiment with the points in the scatterplot. Compare the relative frequency of C to the probability of C . The convergence of the relative frequency of an event (as the experiment is repeated) to the probability of the event is a special case of the law of large numbers. Find the probabilities of the following events in Buffon's needle experiment. In each case, sketch the event as a subset of the sample space. 1. {0 < X < π/2, 0 < Y 2. {1/4 < Y < 2/3} 3. {X < Y } 4. {X + Y < 2}
< 1/3}
Answer
The Estimate of π Suppose that we run Buffon's needle experiment a large number of times. By the law of large numbers, the proportion of crack crossings should be about the same as the probability of a crack crossing. More precisely, we will denote the number of crack crossings in the first n runs by N . Note that N is a random variable for the compound experiment that consists of n replications of the basic needle experiment. Thus, if n is large, we should have ≈ and hence n
n
Nn
2L
n
π
2Ln π ≈
(10.1.6) Nn
This is Buffon's famous estimate of π. In the simulation of Buffon's needle experiment, this estimate is computed on each run and shown numerically in the second table and visually in a graph. Run the Buffon's needle experiment with needle lengths simulation runs.
. In each case, watch the estimate of
L ∈ {0.3, 0.5, 0.7, 1}
π
as the
Let us analyze the estimation problem more carefully. On each run j we have an indicator variable I , where I = 1 if the needle crosses a crack on run j and I = 0 if the needle does not cross a crack on run j . These indicator variables are independent, and identically distributed, since we are assuming independent replications of the experiment. Thus, the sequence forms a Bernoulli trials process. j
j
j
10.1.3
https://stats.libretexts.org/@go/page/10222
The number of crack crossings in the first n runs of the experiment is n
Nn = ∑ Ij
(10.1.7)
j=1
which has the binomial distribution with parameters n and 2L/π. The mean and variance of N are n
1. E(N ) = n 2. var(N ) = n
2L
n
π
n
2L π
With probability 1,
(1 −
Nn 2Ln
2L π
→
)
1 π
as n → ∞ and
2Ln Nn
as n → ∞ .
→ π
Proof Thus, we have two basic estimators:
Nn 2Ln
as an estimator of
1 π
and
2Ln Nn
as an estimator of π. The estimator of
1 π
has several
important statistical properties. First, it is unbiased since the expected value of the estimator is the parameter being estimated: The estimator of
1 π
is unbiased: E(
Nn
1 ) =
2Ln
(10.1.8) π
Proof Since this estimator is unbiased, the variance gives the mean square error: var (
Nn
) = E [(
2Ln
The mean square error of the estimator of
1 π
Nn
2
1 −
)
2Ln
]
(10.1.9)
π
is var (
Nn
π − 2L ) =
2Ln
2Lnπ
(10.1.10)
2
The variance is a decreasing function of the needle length L. Thus, the estimator of overestimate π:
1 π
improves as the needle length increases. On the other hand, the estimator of
π
is biased; it tends to
The estimator of π is positively biased: 2Ln E(
) ≥π
(10.1.11)
Nn
Proof The estimator of π also tends to improve as the needle length increases. This is not easy to see mathematically. However, you can see it empirically. In the Buffon's needle experiment, run the simulation 5000 times each with how well the estimator seems to work in each case.
L = 0.3
,
L = 0.5
,
L = 0.7
, and
L = 0.9
. Note
Finally, we should note that as a practical matter, Buffon's needle experiment is not a very efficient method of approximating π. According to Richard Durrett, to estimate π to four decimal places with L = would require about 100 million tosses! 1 2
Run the Buffon's needle experiment until the estimates of π seem to be consistently correct to two decimal places. Note the number of runs required. Try this for needle lengths L = 0.3 , L = 0.5 , L = 0.7 , and L = 0.9 and compare the results.
10.1.4
https://stats.libretexts.org/@go/page/10222
Show how to simulate the angle X and distance Y in Buffon's needle experiment using random numbers. Answer
Notes Buffon's needle problem is essentially solved by Monte-Carlo integration. In general, Monte-Carlo methods use statistical sampling to approximate the solutions of problems that are difficult to solve analytically. The modern theory of Monte-Carlo methods began with Stanislaw Ulam, who used the methods on problems associated with the development of the hydrogen bomb. The original needle problem has been extended in many ways, starting with Simon Laplace who considered a floor with rectangular tiles. Indeed, variations on the problem are active research problems even today. Neil Weiss has pointed out that our computer simulation of Buffon's needle experiment is circular, in the sense the program assumes knowledge of π (you can see this from the simulation result above). Try to write a computer algorithm for Buffon's needle problem, without assuming the value of numbers.
π
or any other transcendental
This page titled 10.1: Buffon's Problems is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
10.1.5
https://stats.libretexts.org/@go/page/10222
10.2: Bertrand's Paradox Preliminaries Statement of the Problem Bertrand's problem is to find the probability that a “random chord” on a circle will be longer than the length of a side of the inscribed equilateral triangle. The problem is named after the French mathematician Joseph Louis Bertrand, who studied the problem in 1889. It turns out, as we will see, that there are (at least) three answers to Bertrand's problem, depending on how one interprets the phrase “random chord”. The lack of a unique answer was considered a paradox at the time, because it was assumed (naively, in hindsight) that there should be a single natural answer. Run Bertrand's experiment 100 times for each of the following models. Do not be concerned with the exact meaning of the models, but see if you can detect a difference in the behavior of the outcomes 1. Uniform distance 2. Uniform angle 3. Uniform endpoint
Mathematical Formulation To formulate the problem mathematically, let us take (0, 0) as the center of the circle and take the radius of the circle to be 1. These assumptions entail no loss of generality because they amount to measuring distances relative to the center of the circle, and taking the radius of the circle as the unit of length. Now consider a chord on the circle. By rotating the circle, we can assume that one point of the chord is (1, 0) and the other point is (X, Y ) where Y > 0 and X + Y = 1 . 2
2
With these assumptions, the chord is completely specified by giving any one of the following variables 1. The (perpendicular) distance D from the center of the circle to the midpoint of the chord. Note that 0 ≤ D ≤ 1 . 2. The angle A between the x-axis and the line from the center of the circle to the midpoint of the chord. Note that 0 ≤ A ≤ π/2 . 3. The horizontal coordinate X. Note that −1 ≤ X ≤ 1 .
Figure 10.2.1 : A chord in the circle
The variables are related as follows: 1. D = cos(A) 2. X = 2D − 1 − − − − − − 3. Y = 2D√1 − D 2
2
The inverse relations are given below. Note again that there are one-to-one correspondences between X, A , and D. 1. A = arccos(D)
− −−−−− −
2. D = √ 3.
1 2
(x + 1)
− − − − − − − − − − − − − −−− − 1 1 √1 − y 2 D =√ ± 2
2
10.2.1
https://stats.libretexts.org/@go/page/10223
If the chord is generated in a probabilistic way, D, A , X, and Y become random variables. In light of the previous results, specifying the distribution of any of the variables D, A , or X completely determines the distribution of all four variables. The angle A is also the angle between the chord and the tangent line to the circle at (1, 0). Now consider the equilateral triangle inscribed in the circle so that one of the vertices is (1, 0). Consider the chord defined by the upper side of the triangle. For this chord, the angle, distance, and coordinate variables are given as follows: 1. a = π/3 2. d = 1/2 3. x = −1/2 – 4. y = √3/2
Figure 10.2.2 : The inscribed equilateral triangle
Now suppose that a chord is chosen in probabilistic way. The length of the chord is greater than the length of a side of the inscribed equilateral triangle if and only if the following equivalent conditions occur: 1. 0 < D < 1/2 2. π/3 < A < π/2 3. −1 < X < −1/2
Models When an object is generated “at random”, a sequence of “natural” variables that determines the object should be given an appropriate uniform distribution. The coordinates of the coin center are such a sequence in Buffon's coin experiment; the angle and distance variables are such a sequence in Buffon's needle experiment. The crux of Bertrand's paradox is the fact that the distance D, the angle A , and the coordinate X each seems to be a natural variable that determine the chord, but different models are obtained, depending on which is given the uniform distribution.
The Model with Uniform Distance Suppose that D is uniformly distributed on the interval [0, 1]. The solution of Bertrand's problem is 1 P (D
P(V = n − 1) if and only if n < t . 2. The probability density function at first increases and then decreases, reaching its maximum value at ⌊t⌋. 3. There is a single mode at ⌊t⌋ if t is not an integers, and two consecutive modes at t − 1 and t if t is an integer. k
k
Times Between Successes Next we will define the random variables that give the number of trials between successive successes. Let U = V −V for k ∈ {2, 3, …} k
k
U1 = V1
and
k−1
is a sequence of independent random variables, each having the geometric distribution on parameter p. Moreover, U = (U1 , U2 , …)
11.4.1
N+
with
https://stats.libretexts.org/@go/page/10236
k
Vk = ∑ Ui
(11.4.5)
i=1
In statistical terms, U corresponds to sampling from the geometric distribution with parameter p, so that for each k , (U , U , … , U ) is a random sample of size k from this distribution. The sample mean corresponding to this sample is V /k ; this random variable gives the average number of trials between the first k successes. In probability terms, the sequence of negative binomial variables V is the partial sum process corresponding to the sequence U . Partial sum processes are studied in more generality in the chapter on Random Samples. 1
2
k
k
The random process V 1. If j < k then V 2. If k < k < k
− Vj
k
1
2
3
= (V1 , V2 , …)
m then 1. A 2. A
n (p) n (p)
< Am (p) > Am (p)
if 0 < p < if < p < 1 1 2
1 2
Proof Let N denote the number of trials in the series. Then N has probability density function n
n
k−1 P(Nn = k) = (
n
k−n
) [ p (1 − p )
n
k−n
+ (1 − p ) p
],
k ∈ {n, n + 1, … , 2 n − 1}
(11.4.21)
n−1
Proof Explicitly compute the probability density function, expected value, and standard deviation for the number of games in a best of 7 series with the following values of p: 1. 0.5 2. 0.7 3. 0.9 Answer
11.4.6
https://stats.libretexts.org/@go/page/10236
Division of Stakes The problem of points originated from a question posed by Chevalier de Mere, who was interested in the fair division of stakes when a game is interrupted. Specifically, suppose that players A and B each put up c monetary units, and then play Bernoulli trials until one of them wins a specified number of trials. The winner then takes the entire 2c fortune. If the game is interrupted when A needs to win n more trials and divided between A and B , respectively, as follows: 1. 2cA (p) for A 2. 2c [1 − A (p)] = 2cA
B
needs to win
m
more trials, then the fortune should be
n,m
m,n (1
n,m
− p)
for B .
Suppose that players A and B bet $50 each. The players toss a fair coin until one of them has 10 wins; the winner takes the entire fortune. Suppose that the game is interrupted by the gambling police when A has 5 wins and B has 3 wins. How should the stakes be divided? Answer
Alternate and General Versions Let's return to the formulation at the beginning of this section. Thus, suppose that we have a sequence of Bernoulli trials X with success parameter p ∈ (0, 1], and for k ∈ N , we let V denote the trial number of the k th success. Thus, V has the negative binomial distribution with parameters k and p as we studied above. The random variable W = V − k is the number of failures before the k th success. Let N = W , the number of failures before the first success, and let N = W − W , the number of failures between the (k − 1) st success and the k th success, for k ∈ {2, 3, …}. +
k
k
k
1
1
k
k
k
k−1
is a sequence of independent random variables, each having the geometric distribution on parameter p. Moreover, N = (N1 , N2 , …)
N
with
k
Wk = ∑ Ni
(11.4.22)
i=1
Thus, W
= (W1 , W2 , …)
is the partial sum process associated with N . In particular, W has stationary, independent increments.
Probability Density Functions The probability density function of W is given by k
n+k−1 P(Wk = n) = (
k
n
) p (1 − p )
n+k−1 =(
k
n
) p (1 − p ) ,
k−1
n ∈ N
(11.4.23)
n
Proof The distribution of W is also referred to as the negative binomial distribution with parameters k and p. Thus, the term negative binomial distribution can refer either to the distribution of the trial number of the k th success or the distribution of the number of failures before the k th success, depending on the author and the context. The two random variables differ by a constant, so it's not a particularly important issue as long as we know which version is intended. In this text, we will refer to the alternate version as the negative binomial distribution on N, to distinguish it from the original version, which has support set {k, k + 1, …} k
More interestingly, however, the probability density function in the last result makes sense for any k ∈ (0, ∞) , not just integers. To see this, first recall the definition of the general binomial coefficient: if a ∈ R and n ∈ N , we define (n)
a (
a ) =
n
a(a − 1) ⋯ (a − n + 1) =
(11.4.24)
n!
n!
The function f given below defines a probability density function for every p ∈ (0, 1) and k ∈ (0, ∞) : n+k−1 f (n) = (
k
n
) p (1 − p ) ,
n ∈ N
(11.4.25)
n
Proof
11.4.7
https://stats.libretexts.org/@go/page/10236
Once again, the distribution defined by the probability density function in the last theorem is the negative binomial distribution on N , with parameters k and p . The special case when k is a positive integer is sometimes referred to as the Pascal distribution, in honor of Blaise Pascal. The distribution is unimodal. Let t = |k − 1|
1−p
.
p
1. f (n − 1) < f (n) if and only if n < t . 2. The distribution has a single mode at ⌊t⌋ if t is not an integer. 3. The distribution has two consecutive modes at t − 1 and t if t is a positive integer.
Basic Properties Suppose that W has the negative binomial distribution on N with parameters k ∈ (0, ∞) and p ∈ (0, 1). To establish basic properties, we can no longer use the decomposition of W as a sum of independent geometric variables. Instead, the best approach is to derive the probability generating function and then use the generating function to obtain other basic properties. W
has probability generating function P given by W
P (t) = E (t
k
p ) =(
) ,
1 |t|
P(Z2n = 2k)
is symmetric about n and is u-shaped:
if and only if j < k and 2k ≤ n
In particular, 0 and 2n are the most likely values and hence are the modes of the distribution. The discrete arcsine distribution is quite surprising. Since we are tossing a fair coin to determine the steps of the walker, you might easily think that the random walk should be positive half of the time and negative half of the time, and that it should return to 0 frequently. But in fact, the arcsine law implies that with probability , there will be no return to 0 during the second half of the walk, from time n + 1 to 2n, regardless of n , and it is not uncommon for the walk to stay positive (or negative) during the entire time from 1 to 2n. 1 2
Explicitly compute the probability density function, mean, and variance of Z . 10
Answer
The Ballot Problem and the First Return to Zero The Ballot Problem Suppose that in an election, candidate A receives a votes and candidate B receives b votes where a > b . Assuming a random ordering of the votes, what is the probability that A is always ahead of B in the vote count? This is an historically famous problem known as the Ballot Problem, that was solved by Joseph Louis Bertrand in 1887. The ballot problem is intimately related to simple random walks. Comment on the validity of the assumption that the voters are randomly ordered for a real election. The ballot problem can be solved by using a simple conditional probability argument to obtain a recurrence relation. Let denote the probability that A is always ahead of B in the vote count. f
f (a, b)
satisfies the initial condition f (1, 0) = 1 and the following recurrence relation: a
b
f (a, b) =
f (a − 1, b) + a+b
f (a, b − 1)
(11.6.17)
a+b
Proof The probability that A is always ahead in the vote count is a−b f (a, b) =
(11.6.18) a+b
11.6.3
https://stats.libretexts.org/@go/page/10238
Proof In the ballot experiment, vary the parameters a and b and note the change the ballot probability. For selected values of the parameters, run the experiment 1000 times and compare the relative frequency to the true probability. In an election for mayor of a small town, Mr. Smith received 4352 votes while Ms. Jones received 7543 votes. Compute the probability that Jones was always ahead of Smith in the vote count. Answer
Relation to Random Walks Consider again the simple random walk X with parameter p. Given X
n
=k
,
n+k
n−k
2
2
1. There are steps to the right and steps to the left. 2. All possible orderings of the steps to the right and the steps to the left are equally likely. For k > 0 , k P(X1 > 0, X2 > 0, … , Xn−1 > 0 ∣ Xn = k) =
(11.6.19) n
Proof In the ballot experiment, vary the parameters a and b and note the change the ballot probability. For selected values of the parameters, run the experiment 1000 times and compare the relative frequency to the true probability. An American roulette wheel has 38 slots; 18 are red, 18 are black, and 2 are green. Fred bet $1 on red, at even stakes, 50 times, winning 22 times and losing 28 times. Find the probability that Fred's net fortune was always negative. Answer Roulette is studied in more detail in the chapter on Games of Chance.
The Distribution of the First Zero Consider again the simple random walk with parameter p, as in the last subsection. Let T denote the time of the first return to 0: T = min{n ∈ N+ : Xn = 0}
(11.6.20)
Note that returns to 0 can only occur at even times; it may also be possible that the random walk never returns to 0. Thus, T takes values in the set {2, 4, …} ∪ {∞}. The probability density funtion of T
2n
is given by
P(T = 2n) = (
2n 1 n n ) p (1 − p ) , n 2n − 1
n ∈ N+
(11.6.21)
Proof Fred and Wilma are tossing a fair coin; Fred gets a point for each head and Wilma gets a point for each tail. Find the probability that their scores are equal for the first time after n tosses, for each n ∈ {2, 4, 6, 8, 10}. Answer This page titled 11.6: The Simple Random Walk is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
11.6.4
https://stats.libretexts.org/@go/page/10238
11.7: The Beta-Bernoulli Process An interesting thing to do in almost any parametric probability model is to “randomize” one or more of the parameters. Done in a clever way, this often leads to interesting new models and unexpected connections between models. In this section we will randomize the success parameter in the Bernoulli trials process. This leads to interesting and surprising connections with Pólya's urn process.
Basic Theory Definitions First, recall that the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞) is a continuous distribution on the interval (0, 1) with probability density function g given by 1
a−1
g(p) =
p
b−1
(1 − p )
,
p ∈ (0, 1)
(11.7.1)
B(a, b)
where B is the beta function. So B(a, b) is simply the normalizing constant for the function (0, 1). Here is our main definition: Suppose that
a−1
p ↦ p
b−1
(1 − p )
on the interval
has the beta distribution with left parameter a ∈ (0, ∞) and right parameter b ∈ (0, ∞). Next suppose that is a sequence of indicator random variables with the property that given P = p ∈ (0, 1) , X is a conditionally independent sequence with P
X = (X1 , X2 , …)
P(Xi = 1 ∣ P = p) = p,
i ∈ N+
(11.7.2)
Then X is the beta-Bernoulli process with parameters a and b . In short, given P = p , the sequence X is a Bernoulli trials sequence with success parameter p. In the usual language of reliability, X is the outcome of trial i , where 1 denotes success and 0 denotes failure. For a specific application, suppose that we select a random probability of heads according to the beta distribution with with parameters a and b , and then toss a coin with this probability of heads repeatedly. i
Outcome Variables What's our first step? Well, of course we need to compute the finite dimensional distributions of X. Recall that for r ∈ R and j ∈ N, r denotes the ascending power r(r + 1) ⋯ [r + (j − 1)] . By convention, a product over an empty index set is 1, so r = 1. [j]
[0]
Suppose that n ∈ N
+
and (x
1,
n
x2 , … , xn ) ∈ {0, 1 }
. Let k = x
1
. Then
+ x2 + ⋯ + xn [k]
a P(X1 = x1 , X2 = x2 , … , Xn = xn ) =
[n−k]
b
(11.7.3) [n]
(a + b)
Proof From this result, it follows that Pólya's urn process with parameters a, b, c ∈ N is equivalent to the beta-Bernoulli process with parameters a/c and b/c, quite an interesting result. Note that since the joint distribution above depends only on x +x +⋯ +x , the sequence X is exchangeable. Finally, it's interesting to note that the beta-Bernoulli process with parameters a and b could simply be defined as the sequence with the finite-dimensional distributions above, without reference to the beta distribution! It turns out that every exchangeable sequence of indicator random variables can be obtained by randomizing the success parameter in a sequence of Bernoulli trials. This is de Finetti's theorem, named for Bruno de Finetti, which is studied in the section on backwards martingales. +
1
2
n
For each i ∈ N
+
1. E(X
i)
=
2. var(X
i)
a a+b
=
a
b
a+b
a+b
11.7.1
https://stats.libretexts.org/@go/page/10239
Proof Thus X is a sequence of identically distributed variables, quite surprising at first but of course inevitable for any exchangeable sequence. Compare the joint distribution with the marginal distributions. Clearly the variables are dependent, so let's compute the covariance and correlation of a pair of outcome variables. Suppose that i, 1. cov(X
i,
2. cor(X
i,
j ∈ N+
a b
Xj ) = Xj ) =
are distinct. Then 2
(a+b) (a+b+1) 1 a+b+1
Proof Thus, the variables are positively correlated. It turns out that in any infinite sequence of exchangeable variables, the the variables must be nonnegatively correlated. Here is another result that explores how the variables are related. Suppose that n ∈ N
+
and (x
1,
n
x2 , … , xn ) ∈ {0, 1 }
n
. Let k = ∑
i=1
xi
. Then a+k
P(Xn+1 = 1 ∣ X1 = x1 , X2 = x2 , … Xn = xn ) =
(11.7.7) a+b +n
Proof The beta-Bernoulli model starts with the conditional distribution of X given P . Let's find the conditional distribution in the other direction. Suppose that
n ∈ N+
and
. Let k = ∑ x . Then the conditional distribution of is beta with left parameter a + k and right parameter b + (n − k) . Hence n
n
(x1 , x2 , … , xn ) ∈ {0, 1 }
(X1 = x1 , X2 , = x2 , … , Xn = xn )
i=1
i
P
given
a+k E(P ∣ X1 = x1 , X2 = x2 , … , Xn = xn ) =
(11.7.8) a+b +k
Proof Thus, the left parameter increases by the number of successes while the right parameter increases by the number of failures. In the language of Bayesian statistics, the original distribution of P is the prior distribution, and the conditional distribution of P given the data (x , x , … , x ) is the posterior distribution. The fact that the posterior distribution is beta whenever the prior distribution is beta means that the beta distributions is conjugate to the Bernoulli distribution. The conditional expected value in the last theorem is the Bayesian estimate of p when p is modeled by the random variable P . These concepts are studied in more generality in the section on Bayes Estimators in the chapter on Point Estimation. It's also interesting to note that the expected values in the last two theorems are the same: If n ∈ N , (x , x , … , x ) ∈ {0, 1} and k = ∑ x then 1
2
n
n
n
1
2
n
i=1
i
a+k E(Xn+1 ∣ X1 = x1 , … , Xn = xn ) = E(P ∣ X1 = x1 , … , Xn = xn ) =
(11.7.12) a+b +n
Run the simulation of the beta coin experiment for various values of the parameter. Note how the posterior probability density function changes from the prior probability density function, given the number of heads.
The Number of Successes It's already clear that the number of successes in a given number of trials plays an important role, so let's study these variables. For n ∈ N , let +
n
Yn = ∑ Xi
(11.7.13)
i=1
denote the number of successes in the first X = (X , X , …). 1
Yn
n
trials. Of course,
Y = (Y0 , Y1 , …)
is the partial sum process associated with
2
has probability density function given by
11.7.2
https://stats.libretexts.org/@go/page/10239
[k]
P(Yn = k) = (
[n−k]
n a b ) , [n] k (a + b)
k ∈ {0, 1, … , n}
(11.7.14)
Proof The distribution of Y is known as the beta-binomial distribution with parameters n , a , and b . n
In the simulation of the beta-binomial experiment, vary the parameters and note how the shape of the probability density function of Y (discrete) parallels the shape of the probability density function of P (continuous). For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function. n
The case where the parameters are both 1 is interesting. If a = b = 1 , so that P is uniformly distributed on (0, 1), then Y is uniformly distributed on {0, 1, … , n}. n
Proof Next, let's compute the mean and variance of Y . n
The mean and variance of Y are n
1. E(Y
n)
=n
2. var(Y
n)
a a+b
=n
ab 2
[1 + (n − 1)
(a+b)
1 a+b+1
]
Proof In the simulation of the beta-binomial experiment, vary the parameters and note the location and size of the mean-standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical moments to the true moments. We can restate the conditional distributions in the last subsection more elegantly in terms of Y . n
Let n ∈ N . 1. The conditional distribution of X
n+1
given Y is n
a + Yn
P(Xn+1 = 1 ∣ Yn ) = E(Xn+1 ∣ Yn ) =
(11.7.18)
a+b +n
2. The conditional distribution of P given Y is beta with left parameter a + Y and right parameter b + (n − Y particular n
n)
n
E(P ∣ Yn ) =
a + Yn
. In
(11.7.19)
a+b +n
Proof Once again, the conditional expected value E(P ∣ Y ) is the Bayesian estimator of p. In particular, if a = b = 1 , so that P has the uniform distribution on (0, 1), then P(X = 1 ∣ Y = n) = . This is Laplace's rule of succession, another interesting connection. The rule is named for Pierre Simon Laplace, and is studied from a different point of view in the section on Independence. n
n+1
n+1
n
n+2
The Proportion of Successes For n ∈ N , let +
Mn =
Yn n
1 = n
n
∑ Xi
(11.7.24)
i=1
so that M is the sample mean of (X , X , … , X ), or equivalently the proportion of successes in the first n trials. Properties of M follow easily from the corresponding properties of Y . In particular, P(M = k/n) = P(Y = k) for k ∈ {0, 1, … , n} as given above, so let's move on to the mean and variance. n
n
1
2
n
n
n
11.7.3
n
https://stats.libretexts.org/@go/page/10239
For n ∈ N , the mean and variance of M are +
1. E(M
n)
2. var(M
n
=
n)
a a+b
=
1 n
ab 2
+
n−1 n
(a+b)
ab 2
(a+b) (a+b+1)
Proof So E(M ) is constant in n ∈ N while var(M ) → ab/(a + b) (a + b + 1) as n → ∞ . These results suggest that perhaps M has a limit, in some sense, as n → ∞ . For an ordinary sequence of Bernoulli trials with success parameter p ∈ (0, 1), we know from the law of large numbers that M → p as n → ∞ with probability 1 and in mean (and hence also in distribution). What happens here when the success probability P has been randomized with the beta distribution? The answer is what we might hope. 2
n
+
n
n
n
as n → ∞ with probability 1 and in mean square, and hence also in in distribution.
Mn → P
Proof Proof of convergence in distribution Recall again that the Bayesian estimator of p based on (X
1,
E(P ∣ Yn ) =
It follows from the last theorem that E(P
X2 , … , Xn )
is a/n + Mn
a + Yn
=
a+b +n
(11.7.33) a/n + b/n + 1
with probability 1, in mean square, and in distribution. The stochastic process that we have seen several times now is of fundamental importance, and turns out to be a martingale. The theory of martingales provides powerful tools for studying convergence in the beta-Bernoulli process. ∣ Yn ) → P
Z = { Zn = (a + Yn )/(a + b + n) : n ∈ N}
The Trial Number of a Success For
k ∈ N+
, let
denote the trial number of the k th success. As we have seen before in similar circumstances, the process can be defined in terms of the process Y :
Vk
V = (V1 , V2 , …)
Vk = min{n ∈ N+ : Yn = k},
Note that V takes values in other in a sense.
{k, k + 1, …}
k
For k ∈ N and n ∈ N
+
1. V 2. V
k
≤n
k
=n
. The random processes
k ∈ N+
V = (V1 , V2 , …)
(11.7.34)
and Y
= (Y0 , Y1 , …)
are inverses of each
with k ≤ n ,
if and only if Y if and only if Y
n
≥k
n−1
= k−1
and X
n
=1
The probability denisty function of V is given by k
[k]
n−1 P(Vk = n) = (
a )
k−1
[n−k]
b
[n]
,
n ∈ {k, k + 1, …}
(11.7.35)
(a + b)
Proof 1 Proof 2 The distribution of V is known as the beta-negative binomial distribution with parameters k , a , and b . k
If a = b = 1 so that P is uniformly distributed on (0, 1), then k P(Vk = n) =
,
n ∈ {k, k + 1, k + 2, …}
(11.7.36)
n(n + 1)
Proof In the simulation of the beta-negative binomial experiment, vary the parameters and note the shape of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function.
11.7.4
https://stats.libretexts.org/@go/page/10239
The mean and variance of V are k
1. E(V
k)
2. var(V
=k
k)
a+b−1 a−1
=k
if a > 1 .
a+b−1 (a−1)(a−2)
2
[b + k(a + b − 2)] − k (
a+b−1 a−1
2
)
Proof In the simulation of the beta-negative binomial experiment, vary the parameters and note the location and size of the mean ±standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical moments to the true moments. This page titled 11.7: The Beta-Bernoulli Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
11.7.5
https://stats.libretexts.org/@go/page/10239
CHAPTER OVERVIEW 12: Finite Sampling Models This chapter explores a number of models and problems based on sampling from a finite population. Sampling without replacement from a population of objects of various types leads to the hypergeometric and multivariate hypergeometric models. Sampling with replacement from a finite population leads naturally to the birthday and coupon-collector problems. Sampling without replacement form an ordered population leads naturally to the matching problem and to the study of order statistics. 12.1: Introduction to Finite Sampling Models 12.2: The Hypergeometric Distribution 12.3: The Multivariate Hypergeometric Distribution 12.4: Order Statistics 12.5: The Matching Problem 12.6: The Birthday Problem 12.7: The Coupon Collector Problem 12.8: Pólya's Urn Process 12.9: The Secretary Problem
This page titled 12: Finite Sampling Models is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
12.1: Introduction to Finite Sampling Models Basic Theory Sampling Models Suppose that we have a population D of m objects. The population could be a deck of cards, a set of people, an urn full of balls, or any number of other collections. In many cases, we simply label the objects from 1 to m, so that D = {1, 2, … , m}. In other cases (such as the card experiment), it may be more natural to label the objects with vectors. In any case, D is usually a finite subset of R for some k ∈ N . k
+
Our basic experiment consists of selecting n objects from the population D at random and recording the sequence of objects chosen. Thus, the outcome is X = (X , X , … , X ) where X ∈ D is the ith object chosen. If the sampling is with replacement, the sample size n can be any positive integer. In this case, the sample space S is 1
2
n
n
S =D
i
= {(x1 , x2 , … , xn ) : xi ∈ D for each i}
(12.1.1)
If the sampling is without replacement, the sample size n can be no larger than the population size space S consists of all permutations of size n chosen from D:
m
. In this case, the sample
S = Dn = {(x1 , x2 , … , xn ) : xi ∈ D for each i and xi ≠ xj for all i ≠ j}
(12.1.2)
From the multiplication principle of combinatorics, 1. #(D 2. #(D
n
) =m
n)
=m
n (n)
= m(m − 1) ⋯ (m − n + 1)
With either type of sampling, we assume that the samples are equally likely and thus that the outcome variable distributed on the appropriate sample space S ; this is the meaning of the phrase random sample:
X
is uniformly
#(A) P(X ∈ A) =
,
A ⊆S
(12.1.3)
#(S)
The Exchangeable Property Suppose again that we select n objects at random from the population D, either with or without replacement and record the ordered sample X = (X , X , … , X ) 1
2
n
Any permutation of X has the same distribution as X itself, namely the uniform distribution on the appropriate sample space S: 1. D if the sampling is with replacement. 2. D if the sampling is without replacement. n
n
A sequence of random variables with this property is said to be exchangeable. Although this property is very simple to understand, both intuitively and mathematically, it is nonetheless very important. We will use the exchangeable property often in this chapter. More generally, any sequence of k of the n outcome variables is uniformly distributed on the appropriate sample space: 1. D if the sampling is with replacement. 2. D if the sampling is without replacement. k
k
In particular, for either sampling method, X is uniformly distributed on D for each i ∈ {1, 2, … , n}. i
If the sampling is with replacement then X = (X
1,
X2 , … , Xn )
is a sequence of independent random variables.
Thus, when the sampling is with replacement, the sample variables form a random sample from the uniform distribution, in statistical terminology.
12.1.1
https://stats.libretexts.org/@go/page/10244
If the sampling is without replacement, then the conditional distribution of a sequence of k of the outcome variables, given the values of a sequence of j other outcome variables, is the uniform distribution on the set of permutations of size k chosen from the population when the j known values are removed (of course, j + k ≤ n ). In particular, X and X are dependent for any distinct i and j when the sampling is without replacement. i
j
The Unordered Sample In many cases when the sampling is without replacement, the order in which the objects are chosen is not important; all that matters is the (unordered) set of objects: W = { X1 , X2 , … , Xn }
(12.1.4)
The random set W takes values in the set of combinations of size n chosen from D: T = {{ x1 , x2 , … , xn } : xi ∈ D for each i and xi ≠ xj for all i ≠ j}
Recall that #(T ) = (
m
W
n
)
(12.1.5)
.
is uniformly distributed over T : #(B) P(W ∈ B) =
#(B) =
#(T )
m
(
n
,
B ⊆T
(12.1.6)
)
Proof Suppose now that the sampling is with replacement, and we again denote the unordered outcome by W . In this case, W takes values in the collection of multisets of size n from D. (A multiset is like an ordinary set, except that repeated elements are allowed). T = {{ x1 , x2 , … , xn } : xi ∈ D for each i}
Recall that #(T ) = (
m+n−1
W
n
)
(12.1.7)
.
is not uniformly distributed on T .
Summary of Sampling Formulas The following table summarizes the formulas for the number of samples of size n chosen from a population of m elements, based on the criteria of order and replacement. Sampling Formulas Number of samples
With order
m…
With replacement
m
m…
Without
Without m+n−1
n
(
n
)
m
(n)
(
m
n
)
Examples and Applications Suppose that a sample of size 2 is chosen from the population {1, 2, 3, 4}. Explicitly list all samples in the following cases: 1. Ordered samples, with replacement. 2. Ordered samples, without replacement. 3. Unordered samples, with replacement. 4. Unordered samples, without replacement. Answer
Multi-type Populations A dichotomous population consists of two types of objects.
12.1.2
https://stats.libretexts.org/@go/page/10244
Suppose that a batch of 100 components includes 10 that are defective. A random sample of 5 components is selected without replacement. Compute the probability that the sample contains at least one defective component. Answer An urn contains 50 balls, 30 red and 20 green. A sample of 15 balls is chosen at random. Find the probability that the sample contains 10 red balls in each of the following cases: 1. The sampling is without replacement 2. The sampling is with replacement Answer In the ball and urn experiment select 50 balls with 30 red balls, and sample size 15. Run the experiment 100 times. Compute the relative frequency of the event that the sample has 10 red balls in each of the following cases, and compare with the respective probability in the previous exercise: 1. The sampling is without replacement 2. The sampling is with replacement Suppose that a club has 100 members, 40 men and 60 women. A committee of 10 members is selected at random (and without replacement, of course). 1. Find the probability that both genders are represented on the committee. 2. If you observed the experiment and in fact the committee members are all of the same gender, would you believe that the sampling was random? Answer Suppose that a small pond contains 500 fish, 50 of them tagged. A fisherman catches 10 fish. Find the probability that the catch contains at least 2 tagged fish. Answer The basic distribution that arises from sampling without replacement from a dichotomous population is studied in the section on the hypergeometric distribution. More generally, a multi-type population consists of objects of k different types. Suppose that a legislative body consists of 60 republicans, 40 democrats, and 20 independents. A committee of 10 members is chosen at random. Find the probability that at least one party is not represented on the committee. Answer The basic distribution that arises from sampling without replacement from a multi-type population is studied in the section on the multivariate hypergeometric distribution.
Cards Recall that a standard card deck can be modeled by the product set D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠}
(12.1.8)
where the first coordinate encodes the denomination or kind (ace, 2-10, jack, queen, king) and where the second coordinate encodes the suit (clubs, diamonds, hearts, spades). The general card experiment consists of drawing n cards at random and without replacement from the deck D. Thus, the ith card is X = (Y , Z ) where Y is the denomination and Z is the suit. The special case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment. Note that with respect to the denominations or with respect to the suits, a deck of cards is a multi-type population as discussed above. i
i
i
i
i
In the card experiment with n = 5 cards (poker), there are 1. 311,875,200 ordered hands 2. 2,598,960 unordered hands In the card experiment with n = 13 cards (bridge), there are
12.1.3
https://stats.libretexts.org/@go/page/10244
1. 3,954,242,643,911,239,680,000 ordered hands 2. 635,013,559,600 unordered hands In the card experiment, set n = 5 . Run the simulation 5 times and on each run, list all of the (ordered) sequences of cards that would give the same unordered hand as the one you observed. In the card experiment, 1. Y is uniformly distributed on {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k}for each i. 2. Z is uniformly distributed on {♣, ♢, ♡, ♠} for each i. i
j
In the card experiment, Y and Z are independent for any i and j . i
j
In the card experiment, (Y
1,
Y2 )
and (Z
1,
Z2 )
are dependent.
Suppose that a sequence of 5 cards is dealt. Find each of the following: 1. The probability that the third card is a spade. 2. The probability that the second and fourth cards are queens. 3. The conditional probability that the second card is a heart given that the fifth card is a heart. 4. The probability that the third card is a queen and the fourth card is a heart. Answer Run the card experiment 500 time. Compute the relative frequency corresponding to each probability in the previous exercise. Find the probability that a bridge hand will contain no honor cards that is, no cards of denomination 10, jack, queen, king, or ace. Such a hand is called a Yarborough, in honor of the second Earl of Yarborough. Answer
Dice Rolling
fair, six-sided dice is equivalent to choosing a random sample of size n with replacement from the population . Generally, selecting a random sample of size n with replacement from D = {1, 2, … , m} is equivalent to rolling n fair, m -sided dice. n
{1, 2, 3, 4, 5, 6}
In the game of poker dice, 5 standard, fair dice are thrown. Find each of the following: 1. The probability that all dice show the same score. 2. The probability that the scores are distinct. 3. The probability that 1 occurs twice and 6 occurs 3 times. Answer Run the poker dice experiment 500 times. Compute the relative frequency of each event in the previous exercise and compare with the corresponding probability. The game of poker dice is treated in more detail in the chapter on Games of Chance.
Birthdays Supposes that we select n persons at random and record their birthdays. If we assume that birthdays are uniformly distributed throughout the year, and if we ignore leap years, then this experiment is equivalent to selecting a sample of size n with replacement from D = {1, 2, … , 365}. Similarly, we could record birth months or birth weeks. Suppose that a probability class has 30 students. Find each of the following: 1. The probability that the birthdays are distinct. 2. The probability that there is at least one duplicate birthday.
12.1.4
https://stats.libretexts.org/@go/page/10244
Answer In the birthday experiment, set m = 365 and n = 30 . Run the experiment 1000 times and compare the relative frequency of each event in the previous exercise to the corresponding probability. The birthday problem is treated in more detail later in this chapter.
Balls into Cells Suppose that we distribute n distinct balls into m distinct cells at random. This experiment also fits the basic model, where D is the population of cells and X is the cell containing the ith ball. Sampling with replacement means that a cell may contain more than one ball; sampling without replacement means that a cell may contain at most one ball. i
Suppose that 5 balls are distributed into 10 cells (with no restrictions). Find each of the following: 1. The probability that the balls are all in different cells. 2. The probability that the balls are all in the same cell. Answer
Coupons Suppose that when we purchase a certain product (bubble gum, or cereal for example), we receive a coupon (a baseball card or small toy, for example), which is equally likely to be any one of m types. We can think of this experiment as sampling with replacement from the population of coupon types; X is the coupon that we receive on the ith purchase. i
Suppose that a kid's meal at a fast food restaurant comes with a toy. The toy is equally likely to be any of 5 types. Suppose that a mom buys a kid's meal for each of her 3 kids. Find each of the following: 1. The probability that the toys are all the same. 2. The probability that the toys are all different. Answer The coupon collector problem is studied in more detail later in this chapter.
The Key Problem Suppose that a person has n keys, only one of which opens a certain door. The person tries the keys at random. We will let denote the trial number when the person finds the correct key. Suppose that unsuccessful keys are discarded (the rational thing to do, of course). Then {1, 2, … , n}. 1. P(N = i) = , 2. E(N ) = . 1
i ∈ {1, 2, … , n}
n
N
N
has the uniform distribution on
.
n+1 2
3. var(N ) =
2
n −1 12
.
Suppose that unsuccessful keys are not discarded (perhaps the person has had a bit too much to drink). Then geometric distribution on N .
N
has a
+
1. P(N
= i) =
1 n
(
n−1 n
i−1
)
,
i ∈ N+
.
2. E(N ) = n . 3. var(N ) = n(n − 1) .
Simulating a Random Samples It's very easy to simulate a random sample of size n , with replacement from ⌈x⌉ gives the smallest integer that is at least as large as x.
. Recall that the ceiling function
D = {1, 2, … , m}
Let U = (U , U , … , U ) be a sequence of be a random numbers. Recall that these are independent random variables, each uniformly distributed on the interval [0, 1] (the standard uniform distribution). Then X = ⌈m U ⌉ for i ∈ {1, 2, … , n} 1
2
n
i
12.1.5
i
https://stats.libretexts.org/@go/page/10244
simulates a random sample, with replacement, from D. It's a bit harder to simulate a random sample of size n , without replacement, since we need to remove each sample value before the next draw. The following algorithm generates a random sample of size n , without replacement, from D. 1. For i = 1 to m, let b 2. For i = 1 to n ,
i
=i
.
a. let j = m − i + 1 b. let U be a random number c. let J = ⌊jU ⌋ d. let X = b e. let k = b f. let b = b g. let b = k i
i
i
J
j
j
J
J
3. Return X = (X
1,
X2 , … , Xn )
This page titled 12.1: Introduction to Finite Sampling Models is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
12.1.6
https://stats.libretexts.org/@go/page/10244
12.2: The Hypergeometric Distribution Basic Theory Dichotomous Populations Suppose that we have a dichotomous population D. That is, a population that consists of two types of objects, which we will refer to as type 1 and type 0. For example, we could have balls in an urn that are either red or green a batch of components that are either good or defective a population of people who are either male or female a population of animals that are either tagged or untagged voters who are either democrats or republicans Let R denote the subset of D consisting of the type 1 objects, and suppose that #(D) = m and #(R) = r . As in the basic sampling model, we sample n objects at random from D. In this section, our only concern is in the types of the objects, so let X denote the type of the ith object chosen (1 or 0). The random vector of types is i
X = (X1 , X2 , … , Xn )
(12.2.1)
Our main interest is the random variable Y that gives the number of type 1 objects in the sample. Note that Y is a counting variable, and thus like all counting variables, can be written as a sum of indicator variables, in this case the type variables: n
Y = ∑ Xi
(12.2.2)
i=1
We will assume initially that the sampling is without replacement, which is usually the realistic setting with dichotomous populations.
The Probability Density Function Recall that since the sampling is without replacement, the unordered sample is uniformly distributed over the set of all combinations of size n chosen from D. This observation leads to a simple combinatorial derivation of the probability density function of Y . The probability density function of Y is given by r
m−r
( )( y
P(Y = y) =
n−y
m
(
n
) ,
y ∈ {max{0, n − (m − r)}, … , min{n, r}}
(12.2.3)
)
Proof This distribution defined by this probability density function is known as the hypergeometric distribution with parameters and n .
m
, r,
Another form of the probability density function of Y is (y)
P(Y = y) = (
r n ) y
(n−y)
(m − r)
, m
y ∈ {max{0, n − (m − r)}, … , min{n, r}}
(12.2.4)
(n)
Combinatorial Proof Algebraic Proof Recall our convention that j = ( ) = 0 for i > j . With this convention, the two formulas for the probability density function are correct for y ∈ {0, 1, … , n}. We usually use this simpler set as the set of values for the hypergeometric distribution. (i)
j i
The hypergeometric distribution is unimodal. Let v = 1. P(Y
= y) > P(Y = y − 1)
(r+1)(n+1) m+2
. Then
if and only if y < v .
12.2.1
https://stats.libretexts.org/@go/page/10245
2. The mode occurs at ⌊v⌋ if v is not an integer, and at v and v − 1 if v is an integer greater than 0. In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the shape of the probability density function. For selected values of the parameters, run the experiment 1000 times and compare the relative frequency function to the probability density function. You may wonder about the rather exotic name hypergeometric distribution, which seems to have nothing to do with sampling from a dichotomous population. The name comes from a power series, which was studied by Leonhard Euler, Carl Friedrich Gauss, Bernhard Riemann, and others. A (generalized) hypergeometric series is a power series ∞ k
k
∑a x
(12.2.5)
k=0
where k ↦ a
k+1 / ak
is a rational function (that is, a ratio of polynomials).
Many of the basic power series studied in calculus are hypergeometric series, including the ordinary geometric series and the exponential series. The probability generating function of the hypergeometric distribution is a hypergeometric series. Proof In addition, the hypergeometric distribution function can be expressed in terms of a hypergeometric series. These representations are not particularly helpful, so basically were stuck with the non-descriptive term for historical reasons.
Moments Next we will derive the mean and variance of Y . The exchangeable property of the indicator variables, and properties of covariance and correlation will play a key role. E(Xi ) =
for each i.
r m
Proof From the representation of Y as the sum of indicator variables, the expected value of Y is trivial to compute. But just for fun, we give the derivation from the probability density function as well. E(Y ) = n
r m
.
Proof Proof from the definition Next we turn to the variance of the hypergeometric distribution. For that, we will need not only the variances of the indicator variables, but their covariances as well. var(Xi ) =
r m
(1 −
r m
for each i.
)
Proof For distinct i, j, 1. cov (X , X ) = − 2. cor (X , X ) = − i
i
j
j
r m
(1 −
r m
)
1 m−1
1 m−1
Proof Note that the event of a type 1 object on draw i and the event of a type 1 object on draw j are negatively correlated, but the correlation depends only on the population size and not on the number of type 1 objects. Note also that the correlation is perfect if m = 2 , which must be the case.
12.2.2
https://stats.libretexts.org/@go/page/10245
var(Y ) = n
r m
(1 −
r m
)
m−n m−1
.
Proof Note that var(Y ) = 0 if r = 0 or r = m or n = m , which must be true since Y is deterministic in each of these cases. In the ball and urn experiment, select sampling without replacement. Vary the parameters and note the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the empirical mean and standard deviation to the true mean and standard deviation.
Sampling with Replacement Suppose now that the sampling is with replacement, even though this is usually not realistic in applications. (X1 , X2 , … , Xn )
is a sequence of n Bernoulli trials with success parameter
r m
.
The following results now follow immediately from the general theory of Bernoulli trials, although modifications of the arguments above could also be used. Y
has the binomial distribution with parameters n and P(Y = y) = (
r m
:
y n r )( ) (1 −
y
m
n−y
r )
,
y ∈ {0, 1, … , n}
(12.2.10)
m
The mean and variance of Y are 1. E(Y ) = n 2. var(Y ) = n r
m r m
(1 −
r m
)
Note that for any values of the parameters, the mean of Y is the same, whether the sampling is with or without replacement. On the other hand, the variance of Y is smaller, by a factor of , when the sampling is without replacement than with replacement. It certainly makes sense that the variance of Y should be smaller when sampling without replacement, since each selection reduces the variablility in the population that remains. The factor is sometimes called the finite population correction factor. m−n m−1
m−n m−1
In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with replacement. Note the difference between the graphs of the hypergeometric probability density function and the binomial probability density function. Note also the difference between the mean ± standard deviation bars. For selected values of the parameters and for the two different sampling modes, run the simulation 1000 times.
Convergence of the Hypergeometric Distribution to the Binomial Suppose that the population size m is very large compared to the sample size n . In this case, it seems reasonable that sampling without replacement is not too much different than sampling with replacement, and hence the hypergeometric distribution should be well approximated by the binomial. The following exercise makes this observation precise. Practically, it is a valuable result, since the binomial distribution has fewer parameters. More specifically, we do not need to know the population size m and the number of type 1 objects r individually, but only in the ratio r/m. Suppose that r ∈ {0, 1, … , m} for each m ∈ N and that r /m → p ∈ [0, 1] as m → ∞ . Then for fixed n , the hypergeometric probability density function with parameters m, r , and n converges to the binomial probability density function with parameters n and p as m → ∞ m
+
m
m
Proof The type of convergence in the previous exercise is known as convergence in distribution. In the ball and urn experiment, vary the parameters and switch between sampling without replacement and sampling with replacement. Note the difference between the graphs of the hypergeometric probability density function and the binomial probability density function. In particular, note the similarity when m is large and n small. For selected values of the parameters, and for both sampling modes, run the experiment 1000 times.
12.2.3
https://stats.libretexts.org/@go/page/10245
In the setting of the convergence result above, note that the mean and variance of the hypergeometric distribution converge to the mean and variance of the binomial distribution as m → ∞ .
Inferences in the Hypergeometric Model In many real problems, the parameters r or m (or both) may be unknown. In this case we are interested in drawing inferences about the unknown parameters based on our observation of Y , the number of type 1 objects in the sample. We will assume initially that the sampling is without replacement, the realistic setting in most applications.
Estimation of r with m Known Suppose that the size of the population m is known but that the number of type 1 objects r is unknown. This type of problem could arise, for example, if we had a batch of m manufactured items containing an unknown number r of defective items. It would be too costly to test all m items (perhaps even destructive), so we might instead select n items at random and test those. A simple estimator of r can be derived by hoping that the sample proportion of type 1 objects is close to the population proportion of type 1 objects. That is, Y
r ≈
n
Thus, our estimator of r is E(
m n
m n
Y
m ⟹
r ≈
m
Y
(12.2.11)
n
. This method of deriving an estimator is known as the method of moments.
Y ) =r
Proof The result in the previous exercise means that the estimator, in the mean square sense. var (
m n
Y ) = (m − r)
r
m−n
n
m−1
m n
Y
is an unbiased estimator of r. Hence the variance is a measure of the quality of
.
Proof For fixed m and r, var (
m n
Y ) ↓ 0
as n ↑ m .
Thus, the estimator improves as the sample size increases; this property is known as consistency. In the ball and urn experiment, select sampling without replacement. For selected values of the parameters, run the experiment 100 times and note the estimate of r on each run. 1. Compute the average error and the average squared error over the 100 runs. 2. Compare the average squared error with the variance in mean square error given above. Often we just want to estimate the ratio sample proportion Y /n. The estimator of
r m
r/m
(particularly if we don't know
m
either. In this case, the natural estimator is the
has the following properties:
1. E ( ) = , so the estimator is unbiased. 2. var ( ) = (1 − ) 3. var ( ) ↓ 0 as n ↑ m so the estimator is consistent. Y
r
n
m
Y
1
r
r
m−n
n
n
m
m
m−1
Y
n
Estimation of m with r Known Suppose now that the number of type 1 objects r is known, but the population size m is unknown. As an example of this type of problem, suppose that we have a lake containing m fish where m is unknown. We capture r of the fish, tag them, and return them to the lake. Next we capture n of the fish and observe Y , the number of tagged fish in the sample. We wish to estimate m from this data. In this context, the estimation problem is sometimes called the capture-recapture problem.
12.2.4
https://stats.libretexts.org/@go/page/10245
Do you think that the main assumption of the sampling model, namely equally likely samples, would be satisfied for a real capture-recapture problem? Explain. Once again, we can use the method of moments to derive a simple estimate of objects is close the population proportion of type 1 objects. That is, Y
r
Thus, our estimator of m is
nr Y
if Y
>0
and is ∞ if Y
⟹
m ≈
m =0
, by hoping that the sample proportion of type 1
nr
≈ n
m
(12.2.12) Y
.
In the ball and urn experiment, select sampling without replacement. For selected values of the parameters, run the experiment 100 times. 1. On each run, compare the true value of m with the estimated value. 2. Compute the average error and the average squared error over the 100 runs. If y > 0 then estimator of m.
maximizes
nr y
E(
nr Y
) ≥m
P(Y = y)
as a function of
m
for fixed
r
and n . This means that
nr Y
is a maximum likelihood
.
Proof Thus, the estimator is positivley biased and tends to over-estimate m. Indeed, if n ≤ m − r , so that E( ) = ∞ . For another approach to estimating the population size m , see the section on Order Statistics.
P(Y = 0) > 0
then
nr Y
Sampling with Replacement Suppose now that the sampling is with replacement, even though this is unrealistic in most applications. In this case, Y has the binomial distribution with parameters n and . The estimators of r with m known, , and m with r known make sense, just as before, but have slightly different properties. The estimator 1. E (
m n
2. var (
m n
Y
r
r
m
m
of r with m known satisfies
Y ) =r m n
r(m−r)
Y ) =
The estimator
1 n
Y
1. E ( Y ) = 2. var ( Y ) = 1
n
of
r m
satisfies
r
n
m
1
1
r
n
n
m
(1 −
r m
)
Thus, the estimators are still unbiased and consistent, but have larger mean square error than before. Thus, sampling without replacement works better, for any values of the parameters, than sampling with replacement. In the ball and urn experiment, select sampling with replacement. For selected values of the parameters, run the experiment 100 times. 1. On each run, compare the true value of r with the estimated value. 2. Compute the average error and the average squared error over the 100 runs.
Examples and Applications A batch of 100 computer chips contains 10 defective chips. Five chips are chosen at random, without replacement. Find each of the following: 1. The probability density function of the number of defective chips in the sample. 2. The mean and variance of the number of defective chips in the sample
12.2.5
https://stats.libretexts.org/@go/page/10245
3. The probability that the sample contains at least one defective chip. Answer A club contains 50 members; 20 are men and 30 are women. A committee of 10 members is chosen at random. Find each of the following: 1. The probability density function of the number of women on the committee. 2. The mean and variance of the number of women on the committee. 3. The mean and variance of the number of men on the committee. 4. The probability that the committee members are all the same gender. Answer A small pond contains 1000 fish; 100 are tagged. Suppose that 20 fish are caught. Find each of the following: 1. The probability density function of the number of tagged fish in the sample. 2. The mean and variance of the number of tagged fish in the sample. 3. The probability that the sample contains at least 2 tagged fish. 4. The binomial approximation to the probability in (c). Answer Forty percent of the registered voters in a certain district prefer candidate A . Suppose that 10 voters are chosen at random. Find each of the following: 1. The probability density function of the number of voters in the sample who prefer A . 2. The mean and variance of the number of voters in the sample who prefer A . 3. The probability that at least 5 voters in the sample prefer A . Answer Suppose that 10 memory chips are sampled at random and without replacement from a batch of 100 chips. The chips are tested and 2 are defective. Estimate the number of defective chips in the entire batch. Answer A voting district has 5000 registered voters. Suppose that 100 voters are selected at random and polled, and that 40 prefer candidate A . Estimate the number of voters in the district who prefer candidate A . Answer From a certain lake, 200 fish are caught, tagged and returned to the lake. Then 100 fish are caught and it turns out that 10 are tagged. Estimate the population of fish in the lake. Answer
Cards Recall that the general card experiment is to select n cards at random and without replacement from a standard deck of 52 cards. The special case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment. In a poker hand, find the probability density function, mean, and variance of the following random variables: 1. The number of spades 2. The number of aces Answer In a bridge hand, find each of the following: 1. The probability density function, mean, and variance of the number of hearts 2. The probability density function, mean, and variance of the number of honor cards (ace, king, queen, jack, or 10). 3. The probability that the hand has no honor cards. A hand of this kind is known as a Yarborough, in honor of Second Earl of Yarborough.
12.2.6
https://stats.libretexts.org/@go/page/10245
Answer
The Randomized Urn An interesting thing to do in almost any parametric probability model is to randomize one or more of the parameters. Done in the right way, this often leads to an interesting new parametric model, since the distribution of the randomized parameter will often itself belong to a parametric family. This is also the natural setting to apply Bayes' theorem. In this section, we will randomize the number of type 1 objects in the basic hypergeometric model. Specifically, we assume that we have m objects in the population, as before. However, instead of a fixed number r of type 1 objects, we assume that each of the m objects in the population, independently of the others, is type 1 with probability p and type 0 with probability 1 − p . We have eliminated one parameter, r, in favor of a new parameter p with values in the interval [0, 1]. Let U denote the type of the ith object in the population, so that U = (U , U , … , U ) is a sequence of Bernoulli trials with success parameter p. Let V = ∑ U denote the number of type 1 objects in the population, so that V has the binomial distribution with parameters m and p. i
m
1
2
n
i=1
i
As before, we sample n object from the population. Again we let X denote the type of the ith object sampled, and we let Y =∑ X denote the number of type 1 objects in the sample. We will consider sampling with and without replacement. In the first case, the sample size can be any positive integer, but in the second case, the sample size cannot exceed the population size. The key technique in the analysis of the randomized urn is to condition on V . If we know that V = r , then the model reduces to the model studied above: a population of size m with r type 1 objects, and a sample of size n . i
n
i
i=1
With either type of sampling, P(X
i
= 1) = p
Proof Thus, in either model, X is a sequence of identically distributed indicator variables. Ah, but what about dependence? Suppose that the sampling is without replacement. Let (x
1,
n
x2 , … , xn ) ∈ {0, 1 }
and let y = ∑
n i=1
y
xi
. Then
n−y
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = p (1 − p )
(12.2.13)
Proof From the joint distribution in the previous exercise, we see that X is a sequence of Bernoulli trials with success parameter p, and hence Y has the binomial distribution with parameters n and p. We could also argue that X is a Bernoulli trials sequence directly, by noting that {X , X , … , X } is a randomly chosen subset of {U , U , … , U }. 1
2
n
1
2
Suppose now that the sampling is with replacement. Again, let (x
m
n
1,
x2 , … , xn ) ∈ {0, 1 } V
y
P(X1 = x1 , X2 = x2 , … , Xn = xn ) = E [
and let y = ∑
n i=1
xi
. Then
n−y
(m − V ) m
n
]
(12.2.15)
Proof A closed form expression for the joint distribution of X, in terms of the parameters m, n , and p is not easy, but it is at least clear that the joint distribution will not be the same as the one when the sampling is without replacement. In particular, X is a dependent sequence. Note however that X is an exchangeable sequence, since the joint distribution is invariant under a permutation of the coordinates (this is a simple consequence of the fact that the joint distribution depends only on the sum y ). The probability density function of Y is given by P(Y = y) = (
n V )E [ y
y
n−y
(m − V ) m
n
],
y ∈ {0, 1, … , n}
Suppose that i and j are distinct indices. The covariance and correlation of (X
i,
1. cov (X , X ) = 2. cor (X , X ) = i
i
are
p(1−p)
j
j
Xj )
(12.2.17)
m 1 m
Proof The mean and variance of Y are
12.2.7
https://stats.libretexts.org/@go/page/10245
1. E(Y ) = np 2. var(Y ) = np(1 − p)
m+n−1 m
Proof Let's conclude with an interesting observation: For the randomized urn, X is a sequence of independent variables when the sampling is without replacement but a sequence of dependent variables when the sampling is with replacement—just the opposite of the situation for the deterministic urn with a fixed number of type 1 objects. This page titled 12.2: The Hypergeometric Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
12.2.8
https://stats.libretexts.org/@go/page/10245
12.3: The Multivariate Hypergeometric Distribution Basic Theory The Multitype Model As in the basic sampling model, we start with a finite population D consisting of m objects. In this section, we suppose in addition that each object is one of k types; that is, we have a multitype population. For example, we could have an urn with balls of several different colors, or a population of voters who are either democrat, republican, or independent. Let D denote the subset of all type i objects and let m = #(D ) for i ∈ {1, 2, … , k}. Thus D = ⋃ D and m = ∑ m . The dichotomous model considered earlier is clearly a special case, with k = 2 . i
k
i
i
k
i
i=1
i=1
i
As in the basic sampling model, we sample n objects at random from D. Thus the outcome of the experiment is X = (X , X , … , X ) where X ∈ D is the i th object chosen. Now let Y denote the number of type i objects in the sample, for i ∈ {1, 2, … , k}. Note that ∑ Y = n so if we know the values of k − 1 of the counting variables, we can find the value of the remaining counting variable. As with any counting variable, we can express Y as a sum of indicator variables: 1
2
n
k
i
i
i
i=1
i
For i ∈ {1, 2, … , k} n
Yi = ∑ 1 (Xj ∈ Di )
(12.3.1)
j=1
We assume initially that the sampling is without replacement, since this is the realistic case in most applications.
The Joint Distribution Basic combinatorial arguments can be used to derive the probability density function of the random vector of counting variables. Recall that since the sampling is without replacement, the unordered sample is uniformly distributed over the combinations of size n chosen from D. The probability density funtion of (Y
1,
Y2 , … , Yk )
is given by m1
( P(Y1 = y1 , Y2 = y2 , … , Yk = yk ) =
y1
m2
)(
mk
)⋯(
y2
m
(
n
yk
k
) k
,
(y1 , y2 , … , yk ) ∈ N with ∑ yi = n
)
(12.3.2)
i=1
Proof The distribution of (Y , Y , … , Y ) is called the multivariate hypergeometric distribution with parameters m, (m , m , … , m ), and n . We also say that (Y , Y , … , Y ) has this distribution (recall again that the values of any k − 1 of the variables determines the value of the remaining variable). Usually it is clear from context which meaning is intended. The ordinary hypergeometric distribution corresponds to k = 2 . 1
1
2
2
k
1
2
k
k−1
An alternate form of the probability density function of Y
1,
Y2 , … , Yk )
m
n P(Y1 = y1 , Y2 = y2 , … , Yk = yk ) = (
)
is
(y ) 1
1
y1 , y2 , … , yk
m
(y ) 2
2
m
⋯m
(y )
k
k
k
,
(y1 , y2 , … , yk ) ∈ Nk with ∑ yi = n
(n)
(12.3.3)
i=1
Combinatorial Proof Algebraic Proof
The Marginal Distributions For i ∈ {1, 2, … , k}, Y has the hypergeometric distribution with parameters m, m , and n i
i
mi
( P(Yi = y) =
y
m−mi
)(
n−y
m
(
n
) ,
y ∈ {0, 1, … , n}
(12.3.4)
)
Proof
Grouping The multivariate hypergeometric distribution is preserved when the counting variables are combined. Specifically, suppose that (A , A , … , A ) is a partition of the index set {1, 2, … , k} into nonempty, disjoint subsets. Let W = ∑ Y and r = ∑ m for 1
2
l
j
i∈Aj
i
j
i∈Aj
i
j ∈ {1, 2, … , l}
(W1 , W2 , … , Wl )
has the multivariate hypergeometric distribution with parameters m, (r
1,
, and n .
r2 , … , rl )
Proof
12.3.1
https://stats.libretexts.org/@go/page/10246
Note that the marginal distribution of Y given above is a special case of grouping. We have two types: type i and not type i. More generally, the marginal distribution of any subsequence of (Y , Y , … , Y ) is hypergeometric, with the appropriate parameters. i
1
2
n
Conditioning The multivariate hypergeometric distribution is also preserved when some of the counting variables are observed. Specifically, suppose that (A, B) is a partition of the index set {1, 2, … , k} into nonempty, disjoint subsets. Suppose that we observe Y = y for j ∈ B . Let z = n−∑ y and r = ∑ m . j
j∈B
j
j
i
i∈A
The conditional distribution of (Y
i
: i ∈ A)
given (Y
j
= yj : j ∈ B)
is multivariate hypergeometric with parameters r, (m
i
: i ∈ A)
, and z .
Proof Combinations of the grouping result and the conditioning result can be used to compute any marginal or conditional distributions of the counting variables.
Moments We will compute the mean, variance, covariance, and correlation of the counting variables. Results from the hypergeometric distribution and the representation in terms of indicator variables are the main tools. For i ∈ {1, 2, … , k}, mi
1. E(Y ) = n 2. var(Y ) = n i
m mi
m−mi
m−n
m
m
m−1
i
Proof Now let I
ti
= 1(Xt ∈ Di )
, the indicator variable of the event that the t th object selected is type i, for t ∈ {1, 2, … , n} and i ∈ {1, 2, … , k}.
Suppose that r and s are distinct elements of {1, 2, … , n}, and i and j are distinct elements of {1, 2, … , k}. Then cov (Iri , Irj ) = −
mi mj m
mi mj
1 cov (Iri , Isj ) =
(12.3.5)
m
m −1
m
(12.3.6)
m
Proof Suppose again that r and s are distinct elements of {1, 2, … , n}, and i and j are distinct elements of {1, 2, … , k}. Then − − − − − − − − − − − − − − mj mi
cor (Iri , Irj ) = −√
m − mi
− − − − − − − − − − − − − − mj mi
1 cor (Iri , Isj ) =
(12.3.7)
m − mj
√
m −1
m − mi
(12.3.8)
m − mj
Proof In particular, I and I are negatively correlated while I and I are positively correlated. ri
For distinct i,
rj
ri
sj
,
j ∈ {1, 2, … , k}
cov (Yi , Yj ) = − n
mi mj
m −n (12.3.9)
m m m −1 − − − − − − − − − − − − − − mj mi
cor (Yi , Yj ) = − √
m − mi
(12.3.10)
m − mj
Sampling with Replacement Suppose now that the sampling is with replacement, even though this is usually not realistic in applications. The types of the objects in the sample form a sequence of n multinomial trials with parameters (m
1 /m,
.
m2 /m, … , mk /m)
The following results now follow immediately from the general theory of multinomial trials, although modifications of the arguments above could also be used. (Y1 , Y2 , … , Yk )
has the multinomial distribution with parameters n and (m
12.3.2
1 /m,
:
m2 , /m, … , mk /m)
https://stats.libretexts.org/@go/page/10246
m
n P(Y1 = y1 , Y2 = y2 , … , Yk = yk ) = (
For distinct i,
)
y
1
1
m
y1 , y2 , … , yk
y
2
2
m
⋯m n
yk k
k
,
k
(y1 , y2 , … , yk ) ∈ N with ∑ yi = n
(12.3.11)
i=1
,
j ∈ {1, 2, … , k} mi
1. E (Y ) = n 2. var (Y ) = n 3. cov (Y , Y ) = −n i
m
mi
i
i
4. cor (Y
i,
m−mi
m
m mi
j
m
mj m
− −−−−−− − − m
Yj ) = −√
mi
m−mi
j
m−mj
Comparing with our previous results, note that the means and correlations are the same, whether sampling with or without replacement. The variances and covariances are smaller when sampling without replacement, by a factor of the finite population correction factor (m − n)/(m − 1)
Convergence to the Multinomial Distribution Suppose that the population size m is very large compared to the sample size n . In this case, it seems reasonable that sampling without replacement is not too much different than sampling with replacement, and hence the multivariate hypergeometric distribution should be well approximated by the multinomial. The following exercise makes this observation precise. Practically, it is a valuable result, since in many cases we do not know the population size exactly. For the approximate multinomial distribution, we do not need to know m and m individually, but only in the ratio m /m. i
i
Suppose that m depends on m and that m /m → p as m → ∞ for probability density function with parameters m, (m , m , … , m ), and parameters n and (p , p , … , p ). i
i
i
1
1
2
2
k
. For fixed n , the multivariate hypergeometric converges to the multinomial probability density function with
i ∈ {1, 2, … , k} n
k
Proof
Examples and Applications A population of 100 voters consists of 40 republicans, 35 democrats and 25 independents. A random sample of 10 voters is chosen. Find each of the following: 1. The joint density function of the number of republicans, number of democrats, and number of independents in the sample 2. The mean of each variable in (a). 3. The variance of each variable in (a). 4. The covariance of each pair of variables in (a). 5. The probability that the sample contains at least 4 republicans, at least 3 democrats, and at least 2 independents. Answer
Cards Recall that the general card experiment is to select n cards at random and without replacement from a standard deck of 52 cards. The special case n = 5 is the poker experiment and the special case n = 13 is the bridge experiment. In a bridge hand, find the probability density function of 1. The number of spades, number of hearts, and number of diamonds. 2. The number of spades and number of hearts. 3. The number of spades. 4. The number of red cards and the number of black cards. Answer In a bridge hand, find each of the following: 1. The mean and variance of the number of spades. 2. The covariance and correlation between the number of spades and the number of hearts. 3. The mean and variance of the number of red cards. Answer In a bridge hand, find each of the following: 1. The conditional probability density function of the number of spades and the number of hearts, given that the hand has 4 diamonds. 2. The conditional probability density function of the number of spades given that the hand has 3 hearts and 2 diamonds.
12.3.3
https://stats.libretexts.org/@go/page/10246
Answer In the card experiment, a hand that does not contain any cards of a particular suit is said to be void in that suit. Use the inclusion-exclusion rule to show that the probability that a poker hand is void in at least one suit is 1913496 ≈ 0.736
(12.3.12)
2598960
In the card experiment, set n = 5 . Run the simulation 1000 times and compute the relative frequency of the event that the hand is void in at least one suit. Compare the relative frequency with the true probability given in the previous exercise. Use the inclusion-exclusion rule to show that the probability that a bridge hand is void in at least one suit is 32427298180 ≈ 0.051
(12.3.13)
635013559600
This page titled 12.3: The Multivariate Hypergeometric Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
12.3.4
https://stats.libretexts.org/@go/page/10246
12.4: Order Statistics Basic Theory Definitions Suppose that the objects in our population are numbered from 1 to m, so that D = {1, 2, … , m}. For example, the population might consist of manufactured items, and the labels might correspond to serial numbers. As in the basic sampling model we select n objects at random, without replacement from D. Thus the outcome is X = (X , X , … , X ) where X ∈ S is the i th object chosen. Recall that X is uniformly distributed over the set of permutations of size n chosen from D. Recall also that W = { X , X , … , X } is the unordered sample, which is uniformly distributed on the set of combinations of size n chosen from D. 1
1
2
2
n
i
n
For i ∈ {1, 2, … , n} let X = i th smallest element of {X , X , … , X }. The random variable statistic of order i for the sample X. In particular, the extreme order statistics are 1
(i)
Random variable X
(i)
2
n
X(i)
is known as the order
X(1) = min{ X1 , X2 , … , Xn }
(12.4.1)
X(n) = max{ X1 , X2 , … , Xn }
(12.4.2)
takes values in {i, i + 1, … , m − n + i} for i ∈ {1, 2, … , n}.
We will denote the vector of order statistics by Y
. Note that Y takes values in
= (X(1) , X(2) , … , X(n) ) n
L = {(y1 , y2 , … , yn ) ∈ D
: y1 < y2 < ⋯ < yn }
(12.4.3)
Run the order statistic experiment. Note that you can vary the population size m and the sample size n . The order statistics are recorded on each update.
Distributions L
has (
m n
)
elements and Y is uniformly distributed on L.
Proof The probability density function of X
(i)
is x−1
( P [ X(i) = x] =
i−1
m−x
)( m
(
n
n−i
) ,
x ∈ {i, i + 1, … , m − n + i}
(12.4.4)
)
Proof In the order statistic experiment, vary the parameters and note the shape and location of the probability density function. For selected values of the parameters, run the experiment 1000 times and compare the relative frequency function to the probability density function.
Moments The probability density function of X above can be used to obtain an interesting identity involving the binomial coefficients. This identity, in turn, can be used to find the mean and variance of X . (i)
(i)
For i,
n, m ∈ N+
with i ≤ n ≤ m , m−n+i
∑ k=i
k−1 (
m −k )(
i −1
m ) =(
n−i
)
(12.4.5)
n
Proof The expected value of X
(i)
is
12.4.1
https://stats.libretexts.org/@go/page/10247
m +1 E [ X(i) ] = i
(12.4.6) n+1
Proof The variance of X
(i)
is (m + 1)(m − n) var [ X(i) ] = i(n − i + 1)
(12.4.7)
2
(n + 1 ) (n + 2)
Proof In the order statistic experiment, vary the parameters and note the size and location of the mean ± standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation.
Estimators of m Based on Order Statistics Suppose that the population size m is unknown. In this subsection we consider estimators of m constructed from the various order statistics. For i ∈ {1, 2, … , n}, the following statistic is an unbiased estimator of m: n+1 Ui =
i
X(i) − 1
(12.4.8)
Proof Since U is unbiased, its variance is the mean square error, a measure of the quality of the estimator. i
The variance of U is i
(m + 1)(m − n)(n − i + 1) var(Ui ) =
(12.4.9) i(n + 2)
Proof For fixed m and n , var(U and U the worst.
i)
decreases as i increases. Thus, the estimators improve as i increases; in particular, U is the best n
1
The relative efficiency of U with respect to U is j
i
var(Ui )
j(n − i + 1) =
var(Uj )
(12.4.10) i(n − j + 1)
Note that the relative efficiency depends only on the orders i and j and the sample size n , but not on the population size m (the unknown parameter). In particular, the relative efficiency of U with respect to U is n . For fixed i and j , the asymptotic relative efficiency of U to U is j/i. Usually, we hope that an estimator improves (in the sense of mean square error) as the sample size n increases (the more information we have, the better our estimate should be). This general idea is known as consistency. 2
n
j
var(Un )
1
i
decreases to 0 as n increases from 1 to m, and so U is consistent: n
(m + 1)(m − n) var(Un ) =
(12.4.11) n(n + 2)
For fixed i, var(U ) at first increases and then decreases to 0 as n increases from i to m. Thus, U is inconsistent. i
i
12.4.2
https://stats.libretexts.org/@go/page/10247
Figure 12.4.1 : var(U
1)
as a function of n for m
= 100
An Estimator of m Based on the Sample Mean In this subsection, we will derive another estimator of the parameter m based on the average of the sample variables M = ∑ x , (the sample mean) and compare this estimator with the estimator based on the maximum of the variables (the largest order statistic). 1
n
n
i=1
E(M ) =
i
m+1 2
.
Proof It follows that V = 2M − 1 is an unbiased estimator of m. Moreover, it seems that superficially at least, V uses more information from the sample (since it involves all of the sample variables) than U . Could it be better? To find out, we need to compute the variance of the estimator (which, since it is unbiased, is the mean square error). This computation is a bit complicated since the sample variables are dependent. We will compute the variance of the sum as the sum of all of the pairwise covariances. n
For distinct i,
,
j ∈ {1, 2, … , n} cov (Xi , Xj ) = −
m+1 12
.
Proof For i ∈ {1, 2, … , n}, var(X
i)
2
=
m −1 12
.
Proof (m+1)(m−n)
var(M ) =
.
12 n
Proof (m+1)(m−n)
var(V ) =
3 n
.
Proof The variance of V is decreasing with n , so V is also consistent. Let's compute the relative efficiency of the estimator based on the maximum to the estimator based on the mean. var(V )/var(Un ) = (n + 2)/3
.
Thus, once again, the estimator based on the maximum is better. In addition to the mathematical analysis, all of the estimators except U can sometimes be manifestly worthless by giving estimates that are smaller than some of the smaple values. n
Sampling with Replacement If the sampling is with replacement, then the sample X = (X , X , … , X ) is a sequence of independent and identically distributed random variables. The order statistics from such samples are studied in the chapter on Random Samples. 1
12.4.3
2
n
https://stats.libretexts.org/@go/page/10247
Examples and Applications Suppose that in a lottery, tickets numbered from 1 to 25 are placed in a bowl. Five tickets are chosen at random and without replacement. 1. Find the probability density function of X 2. Find E [ X ]. 3. Find var [ X ].
(3)
.
(3)
(3)
Answer
The German Tank Problem The estimator U was used by the Allies during World War II to estimate the number of German tanks m that had been produced. German tanks had serial numbers, and captured German tanks and records formed the sample data. The statistical estimates turned out to be much more accurate than intelligence estimates. Some of the data are given in the table below. n
German Tank Data. Source: Wikipedia Date
Statistical Estimate
Intelligence Estimate
German Records
June 1940
169
1000
122
June 1941
244
1550
271
August 1942
327
1550
342
One of the morals, evidently, is not to put serial numbers on your weapons! Suppose that in a certain war, 5 enemy tanks have been captured. The serial numbers are 51, 3, 27, 82, 65. Compute the estimate of m, the total number of tanks, using all of the estimators discussed above. Answer In the order statistic experiment, and set m = 100 and n = 10 . Run the experiment 50 times. For each run, compute the estimate of m based on each order statistic. For each estimator, compute the square root of the average of the squares of the errors over the 50 runs. Based on these empirical error estimates, rank the estimators of m in terms of quality. Suppose that in a certain war, 10 enemy tanks have been captured. The serial numbers are 304, 125, 417, 226, 192, 340, 468, 499, 87, 352. Compute the estimate of m, the total number of tanks, using the estimator based on the maximum and the estimator based on the mean. Answer This page titled 12.4: Order Statistics is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
12.4.4
https://stats.libretexts.org/@go/page/10247
12.5: The Matching Problem Definitions and Notation The Matching Experiment The matching experiment is a random experiment that can the formulated in a number of colorful ways: Suppose that n male-female couples are at a party and that the males and females are randomly paired for a dance. A match occurs if a couple happens to be paired together. An absent-minded secretary prepares n letters and envelopes to send to n different people, but then randomly stuffs the letters into the envelopes. A match occurs if a letter is inserted in the proper envelope. n people with hats have had a bit too much to drink at a party. As they leave the party, each person randomly grabs a hat. A match occurs if a person gets his or her own hat. These experiments are clearly equivalent from a mathematical point of view, and correspond to selecting a random permutation X = (X , X , … , X ) of the population D = {1, 2, … , n}. Here are the interpretations for the examples above: 1
2
n
n
Number the couples from 1 to n . Then X is the number of the woman paired with the ith man. Number the letters and corresponding envelopes from 1 to n . Then X is the number of the envelope containing the ith letter. Number the people and their corresponding hats from 1 to n . Then X is the number of the hat chosen by the ith person. i
i
i
Our modeling assumption, of course, is that X is uniformly distributed on the sample space of permutations of D . The number of objects n is the basic parameter of the experiment. We will also consider the case of sampling with replacement from the population D , because the analysis is much easier but still provides insight. In this case, X is a sequence of independent random variables, each uniformly distributed over D . n
n
n
Matches We will say that a match occurs at position mathematically by
j
if
Xj = j
. Thus, number of matches is the random variable
N
defined
n
Nn = ∑ Ij
(12.5.1)
j=1
where I = 1(X = j) is the indicator variable for the event of match at position j . Our problem is to compute the probability distribution of the number of matches. This is an old and famous problem in probability that was first considered by Pierre-Remond Montmort; it sometimes referred to as Montmort's matching problem in his honor. j
j
Sampling With Replacement First let's solve the matching problem in the easy case, when the sampling is with replacement. Of course, this is not the way that the matching game is usually played, but the analysis will give us some insight. (I1 , I2 , … , In )
is a sequence of n Bernoulli trials, with success probability
1 n
.
Proof The number of matches N has the binomial distribution with trial parameter n and success parameter n
k
P(Nn = k) = (
n 1 )( ) k n
.
n−k
1 (1 −
1 n
)
,
k ∈ {0, 1, … , n}
(12.5.2)
n
Proof The mean and variance of the number of matches are 1. E(N ) = 1 2. var(N ) = n
n
n−1 n
Proof
12.5.1
https://stats.libretexts.org/@go/page/10248
The distribution of the number of matches converges to the Poisson distribution with parameter 1 as n → ∞ : e
−1
P(Nn = k) →
as n → ∞ for k ∈ N
(12.5.3)
k!
Proof
Sampling Without Replacement Now let's consider the case of real interest, when the sampling is without replacement, so that elements of D = {1, 2, … , n}.
X
is a random permutation of the
n
Counting Permutations with Matches To find the probability density function of N , we need to count the number of permutations of D with a specified number of matches. This will turn out to be easy once we have counted the number of permutations with no matches; these are called derangements of D . We will denote the number of permutations of D with exactly k matches by b (k) = #{N = k} for k ∈ {0, 1, … , n}. In particular, b (0) is the number of derrangements of D . n
n
n
n
n
n
n
n
The number of derrangements is n
j
(−1)
bn (0) = n! ∑ j=0
(12.5.5) j!
Proof The number of permutations with exactly k matches is n! bn (k) =
n−k
j
(−1)
∑ k!
,
k ∈ {0, 1, … , n}
(12.5.7)
j!
j=0
Proof
The Probability Density Function The probability density function of the number of matches is 1 P(Nn = k) =
n−k
j
(−1)
∑ k!
,
k ∈ {0, 1, … , n}
(12.5.8)
j!
j=0
Proof In the matching experiment, vary the parameter n and note the shape and location of the probability density function. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability density function. P(Nn = n − 1) = 0
.
Proof The distribution of the number of matches converges to the Poisson distribution with parameter 1 as n → ∞ : e P(Nn = k) →
−1
as n → ∞,
k ∈ N
(12.5.9)
k!
Proof The convergence is remarkably rapid. In the matching experiment, increase n and note how the probability density function stabilizes rapidly. For selected values of n , run the simulation 1000 times and compare the relative frequency function to the probability density function.
12.5.2
https://stats.libretexts.org/@go/page/10248
Moments The mean and variance of the number of matches could be computed directly from the distribution. However, it is much better to use the representation in terms of indicator variables. The exchangeable property is an important tool in this section. for j ∈ {1, 2, … , n}.
1
E(Ij ) =
n
Proof E(Nn ) = 1
for each n
Proof Thus, the expected number of matches is 1, regardless of n , just as when the sampling is with replacement. var(Ij ) =
n−1 2
n
for j ∈ {1, 2, … , n}.
Proof A match in one position would seem to make it more likely that there would be a match in another position. Thus, we might guess that the indicator variables are positively correlated. For distinct j, 1. cov(I
j,
,
k ∈ {1, 2, … , n}
Ik ) =
2. cor(I
j , Ik ) =
1 2
n (n−1) 1 2
(n−1)
Proof Note that when n = 2 , the event that there is a match in position 1 is perfectly correlated with the event that there is a match in position 2. This makes sense, since there will either be 0 matches or 2 matches. var(Nn ) = 1
for every n ∈ {2, 3, …}.
Proof In the matching experiment, vary the parameter n and note the shape and location of the mean ± standard deviation bar. For selected values of the parameter, run the simulation 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation. For distinct j,
,
k ∈ {1, 2, … , n} cov(Ij , Ik ) → 0
as n → ∞ .
Thus, the event that a match occurs in position j is nearly independent of the event that a match occurs in position k if n is large. For large n , the indicator variables behave nearly like n Bernoulli trials with success probability , which of course, is what happens when the sampling is with replacement. 1
n
A Recursion Relation In this subsection, we will give an alternate derivation of the distribution of the number of matches, in a sense by embedding the experiment with parameter n into the experiment with parameter n + 1 . The probability density function of the number of matches satisfies the following recursion relation and initial condition: 1. P(N 2. P(N
n
= k) = (k + 1)P(Nn+1 = k + 1)
1
= 1) = 1
for k ∈ {0, 1, … , n}.
.
Proof This result can be used to obtain the probability density function of N recursively for any n . n
The Probability Generating Function Next recall that the probability generating function of N is given by n
12.5.3
https://stats.libretexts.org/@go/page/10248
n Nn
Gn (t) = E (t
j
) = ∑ P(Nn = j)t ,
t ∈ R
(12.5.13)
j=0
The family of probability generating functions satisfies the following differential equations and ancillary conditions: 1. G 2. G
′ n+1
for t ∈ R and n ∈ N for n ∈ N
(t) = Gn (t)
n (1) = 1
+
+
Note also that G
1 (t)
=t
for t ∈ R . Thus, the system of differential equations can be used to compute G for any n ∈ N . n
+
In particular, for t ∈ R , 1. G 2. G 3. G
2 (t)
=
3 (t)
=
4 (t)
For k,
=
1 2 1 3 3 8
+ + +
n ∈ N+
1 2 1 2 1 3
2
t
t+ t+
1 6 1 4
3
t
2
t
+
1 24
4
t
with k < n , (k)
Gn
(t) = Gn−k (t),
t ∈ R
(12.5.14)
Proof For n ∈ N , +
1 P(Nn = k) =
k!
P(Nn−k = 0),
k ∈ {0, 1, … , n − 1}
(12.5.15)
Proof
Examples and Applications A secretary randomly stuffs 5 letters into 5 envelopes. Find each of the following: 1. The number of outcomes with exactly k matches, for each k ∈ {0, 1, 2, 3, 4, 5}. 2. The probability density function of the number of matches. 3. The covariance and correlation of a match in one envelope and a match in another envelope. Answer Ten married couples are randomly paired for a dance. Find each of the following: 1. The probability density function of the number of matches. 2. The mean and variance of the number of matches. 3. The probability of at least 3 matches. Answer In the matching experiment, set n = 10 . Run the experiment 1000 times and compare the following for the number of matches: 1. The true probabilities 2. The relative frequencies from the simulation 3. The limiting Poisson probabilities Answer This page titled 12.5: The Matching Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
12.5.4
https://stats.libretexts.org/@go/page/10248
12.6: The Birthday Problem Introduction The Sampling Model As in the basic sampling model, suppose that we select n numbers at random, with replacement, from the population D = {1, 2, … , m}. Thus, our outcome vector is X = (X , X , … , X ) where X is the i th number chosen. Recall that our basic modeling assumption is that X is uniformly distributed on the sample space S = D = {1, 2, … , m} 1
2
n
i
n
n
In this section, we are interested in the number of population values missing from the sample, and the number of (distinct) population values in the sample. The computation of probabilities related to these random variables are generally referred to as birthday problems. Often, we will interpret the sampling experiment as a distribution of n balls into m cells; X is the cell number of ball i. In this interpretation, our interest is in the number of empty cells and the number of occupied cells. i
For i ∈ D , let Y denote the number of times that i occurs in the sample: i
n
Yi = # {j ∈ {1, 2, … , n} : Xj = i} = ∑ 1(Xj = i)
(12.6.1)
j=1
Y = (Y1 , Y2 , … , Ym )
has the multinomial distribution with parameters n and (1/m, 1/m, … , 1/m): n
P(Y1 = y1 , Y2 = y2 , … , Ym = ym ) = ( y1 , y2 , … , yn
m
1 ) m
n
,
n
(y1 , y2 , … , ym ) ∈ N with ∑ yi = n
(12.6.2)
i=1
Proof We will now define the main random variables of interest. The number of population values missing in the sample is m
U = # {j ∈ {1, 2, … , m} : Yj = 0} = ∑ 1(Yj = 0)
(12.6.3)
j=1
and the number of (distinct) population values that occur in the sample is m
V = # {j ∈ {1, 2, … , m} : Yj > 0} = ∑ 1(Yj > 0)
(12.6.4)
j=1
Also, U takes values in {max{m − n, 0}, … , m − 1} and V takes values in {1, 2, … , min{m, n}}. Clearly we must have U + V = m so once we have the probability distribution and moments of one variable, we can easily find them for the other variable. However, we will first solve the simplest version of the birthday problem.
The Simple Birthday Problem The event that there is at least one duplication when a sample of size n is chosen from a population of size m is Bm,n = {V < n} = {U > m − n}
(12.6.5)
The (simple) birthday problem is to compute the probability of this event. For example, suppose that we choose n people at random and note their birthdays. If we ignore leap years and assume that birthdays are uniformly distributed throughout the year, then our sampling model applies with m = 365. In this setting, the birthday problem is to compute the probability that at least two people have the same birthday (this special case is the origin of the name). The solution of the birthday problem is an easy exercise in combinatorial probability. The probability of the birthday event is
12.6.1
https://stats.libretexts.org/@go/page/10249
m P (Bm,n ) = 1 −
and P
(Bm,n ) = 1
(n)
m
n
,
n ≤m
(12.6.6)
for n > m
Proof The fact that the probability is 1 for n > m is sometimes referred to as the pigeonhole principle: if more than m pigeons are placed into m holes then at least one hole has 2 or more pigeons. The following result gives a recurrence relation for the probability of distinct sample values and thus gives another way to compute the birthday probability. Let p denote the probability of the complementary birthday event B , that the sample variables are distinct, with population size m and sample size n . Then p satisfies the following recursion relation and initial condition: c
m,n
m,n
1. p 2. p
m,n+1
=
m−n m
pm,n
=1
m,1
Examples Let m = 365 (the standard birthday problem). 1. P (B 2. P (B 3. P (B 4. P (B 5. P (B 6. P (B
365,10)
= 0.117
365,20)
= 0.411
365,30)
= 0.706
365,40)
= 0.891
365,50)
= 0.970
365,60) = 0.994
Figure 12.6.1 : P (B
365,n
)
as a function of n , smoothed for the sake of appearance
In the birthday experiment, set n = 365 and select the indicator variable I . For n ∈ {10, 20, 30, 40, 50, 60} run the experiment 1000 times each and compare the relative frequencies with the true probabilities. In spite of its easy solution, the birthday problem is famous because, numerically, the probabilities can be a bit surprising. Note that with a just 60 people, the event is almost certain! With just 23 people, the birthday event is about ; specifically P(B ) = 0.507. Mathematically, the rapid increase in the birthday probability, as n increases, is due to the fact that m grows much faster than m . 1 2
n
365,23
(n)
Four fair, standard dice are rolled. Find the probability that the scores are distinct. Answer In the birthday experiment, set m = 6 and select the indicator variable I . Vary n with the scrollbar and note graphically how the probabilities change. Now with n = 4 , run the experiment 1000 times and compare the relative frequency of the event to the corresponding probability. Five persons are chosen at random.
12.6.2
https://stats.libretexts.org/@go/page/10249
1. Find the probability that at least 2 have the same birth month. 2. Criticize the sampling model in this setting Answer In the birthday experiment, set m = 12 and select the indicator variable I . Vary n with the scrollbar and note graphically how the probabilities change. Now with n = 5 , run the experiment 1000 times and compare the relative frequency of the event to the corresponding probability. A fast-food restaurant gives away one of 10 different toys with the purchase of a kid's meal. A family with 5 children buys 5 kid's meals. Find the probability that the 5 toys are different. Answer In the birthday experiment, set m = 10 and select the indicator variable I . Vary n with the scrollbar and note graphically how the probabilities change. Now with n = 5 , run the experiment 1000 times and comparethe relative frequency of the event to the corresponding probability. Let m = 52 . Find the smallest value of n such that the probability of a duplication is at least
1 2
.
Answer
The General Birthday Problem We now return to the more general problem of finding the distribution of the number of distinct sample values and the distribution of the number of excluded sample values.
The Probability Density Function The number of samples with exactly j values excluded is m #{U = j} = (
m−j
m −j
k
) ∑(−1 ) ( j
n
)(m − j − k) ,
j ∈ {max{m − n, 0}, … , m − 1}
(12.6.7)
k
k=0
Proof The distributions of the number of excluded values and the number of distinct values are now easy. The probability density function of U is given by m P(U = j) = (
m−j k
m −j
) ∑(−1 ) ( j
k
k=0
n
j+k ) (1 −
)
,
j ∈ {max{m − n, 0}, … , m − 1}
(12.6.11)
m
Proof The probability density function of the number of distinct values V is given by m P(V = j) = (
j k
j
) ∑(−1 ) ( j
k=0
j−k )(
k
n
) ,
j ∈ {1, 2, … , min{m, n}}
(12.6.12)
m
Proof In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 and compare the relative frequency function to the probability density function. The distribution of the number of excluded values can also be obtained by a recursion argument. Let f denote the probability density function of the number of excluded values sample size is n . Then m,n
12.6.3
U
, when the population size is
m
and the
https://stats.libretexts.org/@go/page/10249
1. f 2. f
m,1 (m
− 1) = 1
m,n+1 (j)
=
m−j m
fm,n (j) +
j+1 m
fm,n (j + 1)
Moments Now we will find the means and variances. The number of excluded values and the number of distinct values are counting variables and hence can be written as sums of indicator variables. As we have seen in many other models, this representation is frequently the best for computing moments. For j ∈ {0, 1, … , m}, let I = 1(Y = 0) , the indicator variable of the event that j is not in the sample. Note that the number of population values missing in the sample can be written as the sum of the indicator variables: j
j
m
U = ∑ Ij
(12.6.13)
j=1
For distinct i, 1. E (I
j)
= (1 −
2. var (I
j)
3. cov (I
i,
,
j ∈ {1, 2, … , m} n
1 m
= (1 −
) 1 m
n
)
Ij ) = (1 −
− (1 − 2 m
1 m
2 n
)
n
)
− (1 −
1 m
2 n
)
Proof The expected number of excluded values and the expected number of distinct values are 1. E(U ) = m(1 −
1 m
n
)
2. E(V ) = m [1 − (1 −
1 m
n
)
]
Proof The variance of the number of exluded values and the variance of the number of distinct values are n
2 var(U ) = var(V ) = m(m − 1) (1 −
)
n
1 + m (1 −
m
) m
−m
2
2n
1 (1 −
)
(12.6.14)
m
Proof In the birthday experiment, select the number of distinct sample values. Vary the parameters and note the size and location of the mean ± standard-deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the sample mean and variance to the distribution mean and variance.
Examples and Applications Suppose that 30 persons are chosen at random. Find each of the following: 1. The probability density function of the number of distinct birthdays. 2. The mean of the number of distinct birthdays. 3. The variance of the number of distinct birthdays. 4. The probability that there are at least 28 different birthdays represented. Answer In the birthday experiment, set m = 365 and n = 30 . Run the experiment 1000 times with an update frequency of 10 and compute the relative frequency of the event in part (d) of the last exercise. Suppose that 10 fair dice are rolled. Find each of the following: 1. The probability density function of the number of distinct scores. 2. The mean of the number of distinct scores. 3. The variance of the number of distinct scores.
12.6.4
https://stats.libretexts.org/@go/page/10249
4. The probability that there will 4 or fewer distinct scores. Answer In the birthday experiment, set m = 6 and n = 10 . Run the experiment 1000 times and compute the relative frequency of the event in part (d) of the last exercise. A fast food restaurant gives away one of 10 different toys with the purchase of each kid's meal. A family buys 15 kid's meals. Find each of the following: 1. The probability density function of the number of toys that are missing. 2. The mean of the number of toys that are missing. 3. The variance of the number of toys that are missing. 4. The probability that at least 3 toys are missing. Answwer In the birthday experiment, set m = 10 and n = 15 . Run the experiment 1000 times and compute the relative frequency of the event in part (d). The lying students problem. Suppose that 3 students, who ride together, miss a mathematics exam. They decide to lie to the instructor by saying that the car had a flat tire. The instructor separates the students and asks each of them which tire was flat. The students, who did not anticipate this, select their answers independently and at random. Find each of the following: 1. The probability density function of the number of distinct answers. 2. The probability that the students get away with their deception. 3. The mean of the number of distinct answers. 4. The standard deviation of the number of distinct answers. Answer The duck hunter problem. Suppose that there are 5 duck hunters, each a perfect shot. A flock of 10 ducks fly over, and each hunter selects one duck at random and shoots. Find each of the following: 1. The probability density function of the number of ducks that are killed. 2. The mean of the number of ducks that are killed. 3. The standard deviation of the number of ducks that are killed. Answer This page titled 12.6: The Birthday Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
12.6.5
https://stats.libretexts.org/@go/page/10249
12.7: The Coupon Collector Problem Basic Theory Definitions In this section, our random experiment is to sample repeatedly, with replacement, from the population D = {1, 2, … , m}. This generates a sequence of independent random variables X = (X , X , …), each uniformly distributed on D 1
2
We will often interpret the sampling in terms of a coupon collector: each time the collector buys a certain product (bubble gum or Cracker Jack, for example) she receives a coupon (a baseball card or a toy, for example) which is equally likely to be any one of m types. Thus, in this setting, X ∈ D is the coupon type received on the ith purchase. i
Let V denote the number of distinct values in the first n selections, for n ∈ N . This is the random variable studied in the last section on the Birthday Problem. Our interest is in this section is the sample size needed to get a specified number of distinct sample values n
+
For k ∈ {1, 2, … , m}, let Wk = min{n ∈ N+ : Vn = k}
(12.7.1)
the sample size needed to get k distinct sample values. In terms of the coupon collector, this random variable gives the number of products required to get k distinct coupon types. Note that the set of possible values of W is {k, k + 1, …} . We will be particularly interested in W , the sample size needed to get the entire population. In terms of the coupon collector, this is the number of products required to get the entire set of coupons. k
m
In the coupon collector experiment, run the experiment in single-step mode a few times for selected values of the parameters.
The Probability Density Function Now let's find the distribution of W . The results of the previous section will be very helpful k
For k ∈ {1, 2, … , m}, the probability density function of W is given by k
k−1
m −1 P(Wk = n) = (
j
k−1
) ∑(−1 ) ( k−1
j=0
k −j−1 )(
j
n−1
)
,
n ∈ {k, k + 1, …}
(12.7.2)
m
Proof In the coupon collector experiment, vary the parameters and note the shape of and position of the probability density function. For selected values of the parameters, run the experiment 1000 times and compare the relative frequency function to the probability density function. An alternate approach to the probability density function of W is via a recursion formula. k
For fixed m, let g denote the probability density function of W . Then k
1. g 2. g
k (n
+ 1) =
1 (1)
k−1 m
k
gk (n) +
m−k+1 m
gk−1 (n)
=1
Decomposition as a Sum We will now show that W can be decomposed as a sum of k independent, geometrically distributed random variables. This will provide some additional insight into the nature of the distribution and will make the computation of the mean and variance easy. k
For i ∈ {1, 2, … , m}, let Z denote the number of additional samples needed to go from i − 1 distinct values to i distinct values. Then Z = (Z , Z , … , Z ) is a sequence of independent random variables, and Z has the geometric distribution on i
1
2
m
i
12.7.1
https://stats.libretexts.org/@go/page/10250
N+
with parameter p
i
=
m−i+1 m
. Moreover, k
Wk = ∑ Zi ,
k ∈ {1, 2, … , m}
(12.7.4)
i=1
This result shows clearly that each time a new coupon is obtained, it becomes harder to get the next new coupon. In the coupon collector experiment, run the experiment in single-step mode a few times for selected values of the parameters. In particular, try this with m large and k near m.
Moments The decomposition as a sum of independent variables provides an easy way to compute the mean and other moments of W . k
The mean and variance of the sample size needed to get k distinct values are 1. E(W
k)
2. var(W
k
m
=∑
k)
i=1
m−i+1
k
=∑
i=1
(i−1)m 2
(m−i+1)
Proof In the coupon collector experiment, vary the parameters and note the shape and location of the mean ± standard deviation bar. For selected values of the parameters, run the experiment 1000 times and compare the sample mean and standard deviation to the distribution mean and standard deviation. The probability generating function of W is given by k
k Wk
E (t
m −i +1
) =∏ i=1
m ,
m − (i − 1)t
|t|
0 . n
n
n
Suppose that a, b, c ∈ N . There exists a random variable P having the beta distribution with parameters a/c and that M → P and Z → P as n → ∞ with probability 1 and in mean square, and hence also in distribution. +
n
b/c
such
n
Proof In turns out that the random process Z = {Z = (a + cY )/(a + b + cn) : n ∈ N} is a martingale. The theory of martingales provides powerful tools for studying convergence in Pólya's urn process. As an interesting special case, note that if a = b = c then the limiting distribution is the uniform distribution on (0, 1). n
n
The Trial Number of the kth Red Ball Suppose again that c ∈ N , so that the process continues indefinitely. For k ∈ N selected. Thus
+
let V denote the trial number of the k th red ball k
Vk = min{n ∈ N+ : Yn = k}
Note that V takes values in other in a sense.
{k, k + 1, …}
k
For k, 1. V 2. V
n ∈ N+
k
≤n
k
=n
. The random processes
V = (V1 , V2 , …)
(12.8.18)
and Y
= (Y1 , Y2 , …)
are inverses of each
with k ≤ n ,
if and only if Y if and only if Y
n
≥k
n−1
= k−1
and X
n
=1
The probability denisty function of V is given by k
(c,k)
n−1 P(Vk = n) = (
a
(c,n−k)
b
)
,
n ∈ {k, k + 1, …}
(12.8.19)
(c,n)
k−1
(a + b)
Proof Of course this probability density function reduces to the negative binomial density function with trial parameter k and success parameter p = when c = 0 (sampling with replacement). When c > 0 , the distribution is a special case of the beta-negative binomial distribution. a
a+b
If a,
b, c ∈ N+
then V has the beta-negative binomial distribution with parameters k , a/c, and b/c. That is, k
[k]
n−1 P(Vk = n) = (
(a/c )
[n−k]
(b/c )
)
,
n ∈ {k, k + 1, …}
(12.8.20)
[n]
k−1
(a/c + b/c)
Proof If a = b = c then k P(Vk = n) =
,
n ∈ {k, k + 1, k + 2, …}
(12.8.21)
n(n + 1)
Proof Fix a , b , and k , and let c → ∞ . Then 1. P(V
k
= k) →
a a+b
12.8.5
https://stats.libretexts.org/@go/page/10251
2. P(V
k
∈ {k + 1, k + 2, …}) → 0
Thus, the limiting distribution of V is concentrated on k and ∞. The limiting probabilities at these two points are just the initial proportion of red and green balls, respectively. Interpret this result in terms of the dynamics of Pólya's urn scheme. k
This page titled 12.8: Pólya's Urn Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
12.8.6
https://stats.libretexts.org/@go/page/10251
12.9: The Secretary Problem In this section we will study a nice problem known variously as the secretary problem or the marriage problem. It is simple to state and not difficult to solve, but the solution is interesting and a bit surprising. Also, the problem serves as a nice introduction to the general area of statistical decision making.
Statement of the Problem As always, we must start with a clear statement of the problem. We have n candidates (perhaps applicants for a job or possible marriage partners). The assumptions are 1. The candidates are totally ordered from best to worst with no ties. 2. The candidates arrive sequentially in random order. 3. We can only determine the relative ranks of the candidates as they arrive. We cannot observe the absolute ranks. 4. Our goal is choose the very best candidate; no one less will do. 5. Once a candidate is rejected, she is gone forever and cannot be recalled. 6. The number of candidates n is known. The assumptions, of course, are not entirely reasonable in real applications. The last assumption, for example, that more appropriate for the secretary interpretation than for the marriage interpretation.
n
is known, is
What is an optimal strategy? What is the probability of success with this strategy? What happens to the strategy and the probability of success as n increases? In particular, when n is large, is there any reasonable hope of finding the best candidate?
Strategies Play the secretary game several times with n = 10 candidates. See if you can find a good strategy just by trial and error. After playing the secretary game a few times, it should be clear that the only reasonable type of strategy is to let a certain number k − 1 of the candidates go by, and then select the first candidate we see who is better than all of the previous candidates (if she exists). If she does not exist (that is, if no candidate better than all previous candidates appears), we will agree to accept the last candidate, even though this means failure. The parameter k must be between 1 and n ; if k = 1 , we select the first candidate; if k = n , we select the last candidate; for any other value of k , the selected candidate is random, distributed on {k, k + 1, … , n} . We will refer to this “let k − 1 go by” strategy as strategy k . Thus, we need to compute the probability of success p (k) using strategy k with n candidates. Then we can maximize the probability over k to find the optimal strategy, and then take the limit over n to study the asymptotic behavior. n
Analysis First, let's do some basic computations. For the case optimal. k
p3 (k)
n =3
, list the 6 permutations of
{1, 2, 3}
and verify the probabilities in the table below. Note that
1
2
3
2
3
2
6
6
6
k =2
is
Answer In the secretary experiment, set the number of candidates to
n =3
. Run the experiment 1000 times with each strategy
k ∈ {1, 2, 3}
For the case n = 4 , list the 24 permutations of {1, 2, 3, 4} and verify the probabilities in the table below. Note that optimal. The last row gives the total number of successes for each strategy.
12.9.1
k =2
is
https://stats.libretexts.org/@go/page/10252
1
k
p4 (k)
2
3
4
6
11
10
6
24
24
24
24
Answer In the secretary experiment, set the number of candidates to
n =4
. Run the experiment 1000 times with each strategy
k ∈ {1, 2, 3, 4}
For the case n = 5 , list the 120 permutations of {1, 2, 3, 4, 5}and verify the probabilities in the table below. Note that k = 3 is optimal. 1
k
p5 (k)
2
3
4
5
24
50
52
42
24
120
120
120
120
120
In the secretary experiment, set the number of candidates to
n =5
. Run the experiment 1000 times with each strategy
k ∈ {1, 2, 3, 4, 5}
Well, clearly we don't want to keep doing this. Let's see if we can find a general analysis. With n candidates, let X denote the number (arrival order) of the best candidate, and let S denote the event of success for strategy k (we select the best candidate). n
n,k
Xn
is uniformly distributed on {1, 2, … , n}.
Proof Next we will compute the conditional probability of success given the arrival order of the best candidate. For n ∈ N
+
and k ∈ {2, 3, … , n}, 0, P(Sn,k ∣ Xn = j) = {
j ∈ {1, 2, … , k − 1}
k−1
,
j−1
j ∈ {k, k + 1, … , n}
(12.9.1)
Proof The two cases are illustrated below. The large dot indicates the best candidate. Red dots indicate candidates that are rejected out of hand, while blue dots indicate candidates that are considered.
Figure 12.9.1: The case when X
=j1
(12.9.4)
j−1
, the function p at first increases and then decreases. The maximum value of > 1 . This is the optimal strategy with n candidates, which we have denoted by k .
n ∈ N+
1
1
n
pn
occurs at the
n
As n increases, k increases and the optimal probability p
n (kn )
n
decreases.
Asymptotic Analysis We are naturally interested in the asymptotic behavior of the function p , and the optimal strategy as n → ∞ . The key is recognizing p as a Riemann sum for a simple integral. (Riemann sums, of course, are named for Georg Riemann.) n
n
If k(n) depends on n and k(n)/n → x ∈ (0, 1) as n → ∞ then p
n [k(n)]
→ −x ln x
as n → ∞ .
Proof The optimal strategy k that maximizes k ↦ p (k) , the ratio k /n, and the optimal probability candidate, as functions of n ∈ {10, 20, … , 100}are given in the following table: n
n
n
pn (kn )
of finding the best
Candidates n
Optimal strategy k
Ratio k
Optimal probability p
10
4
0.4000
0.3987
20
8
0.4000
0.3842
30
12
0.4000
0.3786
n
n /n
12.9.4
n (kn )
https://stats.libretexts.org/@go/page/10252
Candidates n
Optimal strategy k
Ratio k
Optimal probability p
40
16
0.4000
0.3757
50
19
0.3800
0.3743
60
23
0.3833
0.3732
70
27
0.3857
0.3724
80
30
0.3750
0.3719
90
34
0.3778
0.3714
100
38
0.3800
0.3710
n /n
n
The graph below shows the true probabilities p
n (k)
and the limiting values −
n (kn )
k n
ln(
k n
)
as a function of k with n = 100 .
Figure 12.9.6 : True and approximate probabilities of success as a function of k with n = 100
For the optimal strategy k , there exists x ∈ (0, 1) such that k /n → x as n → ∞ . Thus, x of the candidates that we reject out of hand. Moreover, x maximizes x ↦ −x ln x on (0, 1). n
0
n
0
0
∈ (0, 1)
is the limiting proportion
0
The maximum value of −x ln x occurs at x
0
= 1/e
and the maximum value is also 1/e.
Proof
Figure 12.9.7 : The graph of x ln x on the interval (0, 1)
Thus, the magic number 1/e ≈ 0.3679 occurs twice in the problem. For large n : Our approximate optimal strategy is to reject out of hand the first 37% of the candidates and then select the first candidate (if she appears) that is better than all of the previous candidates. Our probability of finding the best candidate is about 0.37. The article “Who Solved the Secretary Problem?” by Tom Ferguson (1989) has an interesting historical discussion of the problem, including speculation that Johannes Kepler may have used the optimal strategy to choose his second wife. The article also discusses many interesting generalizations of the problem. A different version of the secretary problem, in which the candidates are assigned a score in [0, 1], rather than a relative rank, is discussed in the section on Stopping Times in the chapter on Martingales
12.9.5
https://stats.libretexts.org/@go/page/10252
This page titled 12.9: The Secretary Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
12.9.6
https://stats.libretexts.org/@go/page/10252
CHAPTER OVERVIEW 13: Games of Chance Games of chance hold an honored place in probability theory, because of their conceptual clarity and because of their fundamental influence on the early development of the subject. In this chapter, we explore some of the most common and basic games of chance. Roulette, craps, and Keno are casino games. The Monty Hall problem is based on a TV game show, and has become famous because of the controversy that it generated. Lotteries are now basic ways that governments and other institutions raise money. In the last four sections on the game of red and black, we study various types of gambling strategies, a study which leads to some deep and fascinating mathematics. 13.1: Introduction to Games of Chance 13.2: Poker 13.3: Simple Dice Games 13.4: Craps 13.5: Roulette 13.6: The Monty Hall Problem 13.7: Lotteries 13.8: The Red and Black Game 13.9: Timid Play 13.10: Bold Play 13.11: Optimal Strategies
This page titled 13: Games of Chance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
13.1: Introduction to Games of Chance Gambling and Probability Games of chance are among the oldest of human inventions. The use of a certain type of animal heel bone (called the astragalus or colloquially the knucklebone) as a crude die dates to about 3600 BCE. The modern six-sided die dates to 2000 BCE, and the term bones is used as a slang expression for dice to this day (as in roll the bones). It is because of these ancient origins, by the way, that we use the die as the fundamental symbol in this project.
Figure 13.1.1 : An artificial knucklebone made of steatite, from the Arjan Verweij Dice Website
Gambling is intimately interwoven with the development of probability as a mathematical theory. Most of the early development of probability, in particular, was stimulated by special gambling problems, such as DeMere's problem Pepy's problem the problem of points the Petersburg problem Some of the very first books on probability theory were written to analyze games of chance, for example Liber de Ludo Aleae (The Book on Games of Chance), by Girolamo Cardano, and Essay d' Analyse sur les Jeux de Hazard (Analytical Essay on Games of Chance), by Pierre-Remond Montmort. Gambling problems continue to be a source of interesting and deep problems in probability to this day (see the discussion of Red and Black for an example).
Figure 13.1.2 : Allegory of Fortune by Dosso Dossi (c. 1591), Getty Museum. For more depictions of gambling in paintings, see the ancillary material on art.
Of course, it is important to keep in mind that breakthroughs in probability, even when they are originally motivated by gambling problems, are often profoundly important in the natural sciences, the social sciences, law, and medicine. Also, games of chance provide some of the conceptually clearest and cleanest examples of random experiments, and thus their analysis can be very helpful to students of probability. However, nothing in this chapter should be construed as encouraging you, gentle reader, to gamble. On the contrary, our analysis will show that, in the long run, only the gambling houses prosper. The gambler, inevitably, is a sad victim of the law of large numbers.
13.1.1
https://stats.libretexts.org/@go/page/10255
In this chapter we will study some interesting games of chance. Poker, poker dice, craps, and roulette are popular parlor and casino games. The Monty Hall problem, on the other hand, is interesting because of the controversy that it generated. The lottery is a basic way that many states and nations use to raise money (a voluntary tax, of sorts).
Terminology Let us discuss some of the basic terminology that will be used in several sections of this chapter. Suppose that random experiment. The mathematical odds concerning A refer to the probability of A .
A
is an event in a
If a and b are positive numbers, then by definition, the following are equivalent: 1. the odds in favor of A are a : b . 2. P(A) = . 3. the odds against A are b : a . 4. P(A ) = . a
a+b
b
c
a+b
In many cases, a and b can be given as positive integers with no common factors. Similarly, suppose that p ∈ [0, 1]. The following are equivalent: 1. P(A) = p . 2. The odds in favor of A are p : 1 − p . 3. P(A ) = 1 − p . 4. The odds against A are 1 − p : p . c
On the other hand, the house odds of an event refer to the payout when a bet is made on the event. A bet on event A pays n : m means that if a gambler bets m units on A then 1. If A occurs, the gambler receives the m units back and an additional n units (for a net profit of n ) 2. If A does not occur, the gambler loses the bet of m units (for a net profit of −m). Equivalently, the gambler puts up m units (betting on A ), the house puts up n units, (betting on A ) and the winner takes the pot. Of course, it is usually not necessary for the gambler to bet exactly m; a smaller or larger is bet is scaled appropriately. Thus, if the gambler bets k units and wins, his payout is k . c
n
m
Naturally, our main interest is in the net winnings if we make a bet on an event. The following result gives the probability density function, mean, and variance for a unit bet. The expected value is particularly interesting, because by the law of large numbers, it gives the long term gain or loss, per unit bet. Suppose that the odds in favor of event A are a : b and that a bet on event A pays n : m . Let W denote the winnings from a unit bet on A . Then 1. P(W
= −1) =
2. E(W ) =
b a+b
, P (W
=
n m
) =
a a+b
a n−bm m(a+b) 2
3. var(W ) =
ab(n+m)
2
m2 (a+b)
In particular, the expected value of the bet is zero if and only if an = bm , positive if and only if an > bm , and negative if and only if an < bm . The first case means that the bet is fair, and occurs when the payoff is the same as the odds against the event. The second means that the bet is favorable to the gambler, and occurs when the payoff is greater that the odds against the event. The third case means that the bet is unfair to the gambler, and occurs when the payoff is less than the odds against the event. Unfortunately, all casino games fall into the third category.
More About Dice
13.1.2
https://stats.libretexts.org/@go/page/10255
Shapes of Dice The standard die, of course, is a cube with six sides. A bit more generally, most real dice are in the shape of Platonic solids, named for Plato naturally. The faces of a Platonic solid are congruent regular polygons. Moreover, the same number of faces meet at each vertex so all of the edges and angles are congruent as well. The five Platonic solids are 1. The tetrahedron, with 4 sides. 2. The hexahedron (cube), with 6 sides 3. The octahedron, with 8 sides 4. The dodecahedron, with 12 sides 5. The icosahedron, with 20 sides
Figure 13.1.3 : Blue Platonic Dice from Wikipedia
Note that the 4-sided die is the only Platonic die in which the outcome is the face that is down rather than up (or perhaps it's better to think of the vertex that is up as the outcome).
Fair and Crooked Dice Recall that a fair die is one in which the faces are equally likely. In addition to fair dice, there are various types of crooked dice. For the standard six-sided die, there are three crooked types that we use frequently in this project. To understand the geometry, recall that with the standard six-sided die, opposite faces sum to 7. Flat Dice 1. An ace-six flat die is a six-sided die in which faces 1 and 6 have probability each while faces 2, 3, 4, and 5 have probability each. 2. A two-five flat die is a six-sided die in which faces 2 and 5 have probability each while faces 1, 3, 4, and 6 have probability each. 3. A three-four flat die is a six-sided die in which faces 3 and 4 have probability each while faces 1, 2, 5, and 6 have probability each. 1 4
1 8
1 4
1 8
1 4
1 8
A flat die, as the name suggests, is a die that is not a cube, but rather is shorter in one of the three directions. The particular probabilities that we use (1/4 and 1/8) are fictitious, but the essential property of a flat die is that the opposite faces on the shorter axis have slightly larger probabilities (because they have slightly larger areas) than the other four faces. Flat dice are sometimes used by gamblers to cheat. In the Dice Experiment, select one die. Run the experiment 1000 times in each of the following cases and observe the outcomes. 1. fair die 2. ace-six flat die 3. two-five flat die 4. three-four flat die
Simulation It's very easy to simulate a fair die with a random number. Recall that the ceiling function ⌈x⌉ gives the smallest integer that is at least as large as x. Suppose that U is uniformly distributed on the interval (0, 1], so that U has the standard uniform distribution (a random number). Then X = ⌈6 U ⌉ is uniformly distributed on the set {1, 2, 3, 4, 5, 6} and so simulates a fair six-sided die. More generally, X = ⌈n U ⌉ is uniformly distributed on {1, 2, … , n} and so simlates a fair n -sided die.
13.1.3
https://stats.libretexts.org/@go/page/10255
We can also use a real fair die to simulate other types of fair dice. Recall that if X is uniformly distributed on {1, 2, … , n} and k ∈ {1, 2, … , n − 1} , then the conditional distribution of X given that X ∈ {1, 2, … , k} is uniformly distributed on {1, 2, … , k}. Thus, suppose that we have a real, fair, n -sided die. If we ignore outcomes greater than k then we simulate a fair k sided die. For example, suppose that we have a carefully constructed icosahedron that is a fair 20-sided die. We can simulate a fair 13-sided die by simply rolling the die and stopping as soon as we have a score between 1 and 13. To see how to simulate a card hand, see the Introduction to Finite Sampling Models. A general method of simulating random variables is based on the quantile function. This page titled 13.1: Introduction to Games of Chance is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.1.4
https://stats.libretexts.org/@go/page/10255
13.2: Poker Basic Theory The Poker Hand A deck of cards naturally has the structure of a product set and thus can be modeled mathematically by D = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, j, q, k} × {♣, ♢, ♡, ♠}
(13.2.1)
where the first coordinate represents the denomination or kind (ace, two through 10, jack, queen, king) and where the second coordinate represents the suit (clubs, diamond, hearts, spades). Sometimes we represent a card as a string rather than an ordered pair (for example q ♡). There are many different poker games, but we will be interested in standard draw poker, which consists of dealing 5 cards at random from the deck D. The order of the cards does not matter in draw poker, so we will record the outcome of our random experiment as the random set (hand) X = {X , X , X , X , X } where X = (Y , Z ) ∈ D for each i and X ≠ X for i ≠ j . Thus, the sample space consists of all possible poker hands: 1
2
3
4
5
i
i
i
i
S = {{ x1 , x2 , x3 , x4 , x5 } : xi ∈ D for each i and xi ≠ xj for all i ≠ j}
j
(13.2.2)
Our basic modeling assumption (and the meaning of the term at random) is that all poker hands are equally likely. Thus, the random variable X is uniformly distributed over the set of possible poker hands S . #(A) P(X ∈ A) =
(13.2.3) #(S)
In statistical terms, a poker hand is a random sample of size 5 drawn without replacement and without regard to order from the population D. For more on this topic, see the chapter on Finite Sampling Models.
The Value of the Hand There are nine different types of poker hands in terms of value. We will use the numbers 0 to 8 to denote the value of the hand, where 0 is the type of least value (actually no value) and 8 the type of most value. The hand value V of a poker hand is a random variable taking values 0 through 8, and is defined as follows: 0. No Value. The hand is of none of the other types. 1. One Pair. The hand has 2 cards of one kind, and one card each of three other kinds. 2. Two Pair. The hand has 2 cards of one kind, 2 cards of another kind, and one card of a third kind. 3. Three of a Kind. The hand has 3 cards of one kind and one card of each of two other kinds. 4. Straight. The kinds of cards in the hand form a consecutive sequence but the cards are not all in the same suit. An ace can be considered the smallest denomination or the largest denomination. 5. Flush. The cards are all in the same suit, but the kinds of the cards do not form a consecutive sequence. 6. Full House. The hand has 3 cards of one kind and 2 cards of another kind. 7. Four of a Kind. The hand has 4 cards of one kind, and 1 card of another kind. 8. Straight Flush. The cards are all in the same suit and the kinds form a consecutive sequence. Run the poker experiment 10 times in single-step mode. For each outcome, note that the value of the random variable corresponds to the type of hand, as given above. For some comic relief before we get to the analysis, look at two of the paintings of Dogs Playing Poker by CM Coolidge. 1. His Station and Four Aces 2. Waterloo
13.2.1
https://stats.libretexts.org/@go/page/10256
The Probability Density Function Computing the probability density function of V is a good exercise in combinatorial probability. In the following exercises, we need the two fundamental rules of combinatorics to count the number of poker hands of a given type: the multiplication rule and the addition rule. We also need some basic combinatorial structures, particularly combinations. The number of different poker hands is #(S) = (
52 ) = 2 598 960 5
(13.2.4)
.
P(V = 1) = 1 098 240/2 598 960 ≈ 0.422569
Proof .
P(V = 2) = 123 552/2 598 960 ≈ 0.047539
Proof .
P(V = 3) = 54 912/2 598 860 ≈ 0.021129
Proof .
P(V = 8) = 40/2 598 960 ≈ 0.000015
Proof .
P(V = 4) = 10 200/2 598 960 ≈ 0.003925
Proof .
P(V = 5) = 5108/2 598 960 ≈ 0.001965
Proof .
P(V = 6) = 3744/2 598 960 ≈ 0.001441
Proof .
P(V = 7) = 624/2 598 960 ≈ 0.000240
Proof .
P(V = 0) = 1 302 540/2 598 960 ≈ 0.501177
Proof Note that the probability density function of V is decreasing; the more valuable the type of hand, the less likely the type of hand is to occur. Note also that no value and one pair account for more than 92% of all poker hands. In the poker experiment, note the shape of the density graph. Note that some of the probabilities are so small that they are essentially invisible in the graph. Now run the poker hand 1000 times and compare the relative frequency function to the density function. In the poker experiment, set the stop criterion to the value of V given below. Note the number of poker hands required. 1. V 2. V 3. V 4. V 5. V 6. V
=3 =4 =5 =6 =7 =8
Find the probability of getting a hand that is three of a kind or better.
13.2.2
https://stats.libretexts.org/@go/page/10256
Answer In the movie The Parent Trap (1998), both twins get straight flushes on the same poker deal. Find the probability of this event. Answer Classify V in terms of level of measurement: nominal, ordinal, interval, or ratio. Is the expected value of V meaningful? Answer A hand with a pair of aces and a pair of eights (and a fifth card of a different type) is called a dead man's hand. The name is in honor of Wild Bill Hickok, who held such a hand at the time of his murder in 1876. Find the probability of getting a dead man's hand. Answer
Drawing Cards In draw poker, each player is dealt a poker hand and there is an initial round of betting. Typically, each player then gets to discard up to 3 cards and is dealt that number of cards from the remaining deck. This leads to myriad problems in conditional probability, as partial information becomes available. A complete analysis is far beyond the scope of this section, but we will consider a comple of simple examples. Suppose that Fred's hand is {4 ♡, 5 ♡, 7 ♠, q ♣, 1 ♢}. Fred discards the q ♣ and 1 ♢ and draws two new cards, hoping to complete the straight. Note that Fred must get a 6 and either a 3 or an 8. Since he is missing a middle denomination (6), Fred is drawing to an inside straight. Find the probability that Fred is successful. Answer Suppose that Wilma's hand is {4 ♡, 5 ♡, 6 ♠, q ♣, 1 ♢}. Wilma discards q ♣ and 1 ♢ and draws two new cards, hoping to complete the straight. Note that Wilma must get a 2 and a 3, or a 7 and an 8, or a 3 and a 7. Find the probability that Wilma is successful. Clearly, Wilma has a better chance than Fred. Answer This page titled 13.2: Poker is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.2.3
https://stats.libretexts.org/@go/page/10256
13.3: Simple Dice Games In this section, we will analyze several simple games played with dice—poker dice, chuck-a-luck, and high-low. The casino game craps is more complicated and is studied in the next section.
Figure 13.3.1 : The Dice Players by Georges de La Tour (c. 1651). For more on the influence of probability in painting, see the ancillary material on art.
Poker Dice Definition The game of poker dice is a bit like standard poker, but played with dice instead of cards. In poker dice, 5 fair dice are rolled. We will record the outcome of our random experiment as the (ordered) sequence of scores: X = (X1 , X2 , X3 , X4 , X5 )
(13.3.1)
Thus, the sample space is S = {1, 2, 3, 4, 5, 6} . Since the dice are fair, our basic modeling assumption is that X is a sequence of independent random variables and each is uniformly distributed on {1, 2, 3, 4, 5, 6}. 5
Equivalently, X is uniformly distributed on S : #(A) P(X ∈ A) =
,
A ⊆S
(13.3.2)
#(S)
In statistical terms, a poker dice hand is a random sample of size 5 drawn with replacement and with regard to order from the population D = {1, 2, 3, 4, 5, 6}. For more on this topic, see the chapter on Finite Sampling Models. In particular, in this chapter you will learn that the result of Exercise 1 would not be true if we recorded the outcome of the poker dice experiment as an unordered set instead of an ordered sequence.
The Value of the Hand The value V of the poker dice hand is a random variable with support set {0, 1, 2, 3, 4, 5, 6}. The values are defined as follows: 0. None alike. Five distinct scores occur. 1. One Pair. Four distinct scores occur; one score occurs twice and the other three scores occur once each. 2. Two Pair. Three distinct scores occur; one score occurs twice and the other three scores occur once each. 3. Three of a Kind. Three distinct scores occur; one score occurs three times and the other two scores occur once each. 4. Full House. Two distinct scores occur; one score occurs three times and the other score occurs twice. 5. Four of a king. Two distinct scores occur; one score occurs four times and the other score occurs once. 6. Five of a kind. Once score occurs five times.
13.3.1
https://stats.libretexts.org/@go/page/10257
Run the poker dice experiment 10 times in single-step mode. For each outcome, note that the value of the random variable corresponds to the type of hand, as given above.
The Probability Density Function Computing the probability density function of V is a good exercise in combinatorial probability. In the following exercises, we will need the two fundamental rules of combinatorics to count the number of dice sequences of a given type: the multiplication rule and the addition rule. We will also need some basic combinatorial structures, particularly combinations and permutations (with types of objects that are identical). The number of different poker dice hands is #(S) = 6
5
P(V = 0) =
720 7776
= 0.09259
.
≈ 0.46296
.
≈ 0.23148
.
≈ 0.15432
.
≈ 0.03858
.
= 0.01929
.
≈ 0.00077
.
= 7776
.
Proof P(V = 1) =
3600 7776
Proof P(V = 2) =
1800 7776
Proof P(V = 3) =
1200 7776
Proof P(V = 4) =
300 7776
Proof P(V = 5) =
150 7776
Proof P(V = 6) =
6 7776
Proof Run the poker dice experiment 1000 times and compare the relative frequency function to the density function. Find the probability of rolling a hand that has 3 of a kind or better. Answer In the poker dice experiment, set the stop criterion to the value of V given below. Note the number of hands required. 1. V 2. V 3. V 4. V
=3 =4 =5 =6
Chuck-a-Luck Chuck-a-luck is a popular carnival game, played with three dice. According to Richard Epstein, the original name was Sweat Cloth, and in British pubs, the game is known as Crown and Anchor (because the six sides of the dice are inscribed clubs, diamonds, hearts, spades, crown and anchor). The dice are over-sized and are kept in an hourglass-shaped cage known as the bird cage. The dice are rolled by spinning the bird cage. Chuck-a-luck is very simple. The gambler selects an integer from 1 to 6, and then the three dice are rolled. If exactly k dice show the gambler's number, the payoff is k : 1 . As with poker dice, our basic mathematical assumption is that the dice are fair, and therefore the outcome vector X = (X , X , X ) is uniformly distributed on the sample space S = {1, 2, 3, 4, 5, 6} . 3
1
2
3
13.3.2
https://stats.libretexts.org/@go/page/10257
Let Y denote the number of dice that show the gambler's number. Then Y has the binomial distribution with parameters n = 3 and p = : 1 6
3 P(Y = k) = (
1 )(
k
k
6
3−k
5
) (
)
,
k ∈ {0, 1, 2, 3}
(13.3.3)
6
Let W denote the net winnings for a unit bet. Then 1. W 2. W
if Y = 0 if Y > 0
= −1 =Y
The probability density function of W is given by 1. P(W 2. P(W 3. P(W 4. P(W
125
= −1) = = 1) = = 2) = = 3) =
216
75 216 15 216 1 216
Run the chuck-a-luck experiment 1000 times and compare the empirical density function of W to the true probability density function. The expected value and variance of W are 1. E(W ) = − 2. var(W ) =
17 216
≈ 0.0787
75815 46656
≈ 1.239
Run the chuck-a-luck experiment 1000 times and compare the empirical mean and standard deviation of W to the true mean and standard deviation. Suppose you had bet $1 on each of the 1000 games. What would your net winnings be?
High-Low In the game of high-low, a pair of fair dice are rolled. The outcome is high if the sum is 8, 9, 10, 11, or 12. low if the sum is 2, 3, 4, 5, or 6 seven if the sum is 7 A player can bet on any of the three outcomes. The payoff for a bet of high or for a bet of low is 1 : 1 . The payoff for a bet of seven is 4 : 1 . Let Z denote the outcome of a game of high-low. Find the probability density function of Z . Answer Let W denote the net winnings for a unit bet. Find the expected value and variance of W for each of the three bets: 1. high 2. low 3. seven Answer This page titled 13.3: Simple Dice Games is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.3.3
https://stats.libretexts.org/@go/page/10257
13.4: Craps The Basic Game Craps is a popular casino game, because of its complexity and because of the rich variety of bets that can be made.
Figure 13.4.1 : A typical craps table
According to Richard Epstein, craps is descended from an earlier game known as Hazard, that dates to the Middle Ages. The formal rules for Hazard were established by Montmort early in the 1700s. The origin of the name craps is shrouded in doubt, but it may have come from the English crabs or from the French Crapeaud (for toad). From a mathematical point of view, craps is interesting because it is an example of a random experiment that takes place in stages; the evolution of the game depends critically on the outcome of the first roll. In particular, the number of rolls is a random variable.
Definitions The rules for craps are as follows: The player (known as the shooter) rolls a pair of fair dice 1. If the sum is 7 or 11 on the first throw, the shooter wins; this event is called a natural. 2. If the sum is 2, 3, or 12 on the first throw, the shooter loses; this event is called craps. 3. If the sum is 4, 5, 6, 8, 9, or 10 on the first throw, this number becomes the shooter's point. The shooter continues rolling the dice until either she rolls the point again (in which case she wins) or rolls a 7 (in which case she loses). As long as the shooter wins, or loses by rolling craps, she retrains the dice and continues. Once she loses by failing to make her point, the dice are passed to the next shooter. Let us consider the game of craps mathematically. Our basic assumption, of course, is that the dice are fair and that the outcomes of the various rolls are independent. Let N denote the (random) number of rolls in the game and let (X , Y ) denote the outcome of the ith roll for i ∈ {1, 2, … , N }. Finally, let Z = X + Y , the sum of the scores on the ith roll, and let V denote the event that the shooter wins. i
i
i
i
i
In the craps experiment, press single step a few times and observe the outcomes. Make sure that you understand the rules of the game.
The Probability of Winning We will compute the probability that the shooter wins in stages, based on the outcome of the first roll. The sum of the scores Z on a given roll has the probability density function in the following table: z
P(Z = z)
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
5
4
3
2
1
36
36
36
36
36
36
36
36
36
36
36
13.4.1
https://stats.libretexts.org/@go/page/10258
The probability that the player makes her point can be computed using a simple conditioning argument. For example, suppose that the player throws 4 initially, so that 4 is the point. The player continues until she either throws 4 again or throws 7. Thus, the final roll will be an element of the following set: S4 = {(1, 3), (2, 2), (3, 1), (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}
(13.4.1)
Since the dice are fair, these outcomes are equally likely, so the probability that the player makes her 4 point is argument can be used for the other points. Here are the results:
3 9
. A similar
The probabilities of making the point z are given in the following table: 4
z
P(V ∣ Z1 = z)
5
6
8
9
10
3
4
5
5
4
3
9
10
11
11
10
9
The probability that the shooter wins is P(V ) =
244 495
≈ 0.49293
Proof Note that craps is nearly a fair game. For the sake of completeness, the following result gives the probability of winning, given a “point” on the first roll. P(V ∣ Z1 ∈ {4, 5, 6, 8, 9, 10}) =
67 165
≈ 0.406
Proof
Bets There is a bewildering variety of bets that can be made in craps. In the exercises in this subsection, we will discuss some typical bets and compute the probability density function, mean, and standard deviation of each. (Most of these bets are illustrated in the picture of the craps table above). Note however, that some of the details of the bets and, in particular the payout odds, vary from one casino to another. Of course the expected value of any bet is inevitably negative (for the gambler), and thus the gambler is doomed to lose money in the long run. Nonetheless, as we will see, some bets are better than others.
Pass and Don't Pass A pass bet is a bet that the shooter will win and pays 1 : 1 . Let W denote the winnings from a unit pass bet. Then 1. P(W = −1) = , P(W = 1) = 2. E(W ) = − ≈ −0.0141 3. sd(W ) ≈ 0.9999 251
244
495
495
7
495
In the craps experiment, select the pass bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? A don't pass bet is a bet that the shooter will lose, except that 12 on the first throw is excluded (that is, the shooter loses, of course, but the don't pass better neither wins nor loses). This is the meaning of the phrase don't pass bar double 6 on the craps table. The don't pass bet also pays 1 : 1 . Let W denote the winnings for a unit don't pass bet. Then 1. P(W = −1) = , P(W = 0) = 2. E(W ) = − ≈ −0.01363 3. sd(W ) ≈ 0.9859 244
1
495
36
, P(W
= 1) =
949 1980
27
1980
13.4.2
https://stats.libretexts.org/@go/page/10258
Thus, the don't pass bet is slightly better for the gambler than the pass bet. In the craps experiment, select the don't pass bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? The come bet and the don't come bet are analogous to the pass and don't pass bets, respectively, except that they are made after the point has been established.
Field A field bet is a bet on the outcome of the next throw. It pays 1 : 1 if 3, 4, 9, 10, or 11 is thrown, 2 : 1 if 2 or 12 is thrown, and loses otherwise. Let W denote the winnings for a unit field bet. Then 1. P(W = −1) = , P(W = 1) = 2. E(W ) = − ≈ −0.0556 3. sd(W ) ≈ 1.0787 5
7
9
18
, P(W
= 2) =
1 18
1
18
In the craps experiment, select the field bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be?
Seven and Eleven A 7 bet is a bet on the outcome of the next throw. It pays 4 : 1 if a 7 is thrown. Similarly, an 11 bet is a bet on the outcome of the next throw, and pays 15 : 1 if an 11 is thrown. In spite of the romance of the number 7, the next exercise shows that the 7 bet is one of the worst bets you can make. Let W denote the winnings for a unit 7 bet. Then 1. P(W = −1) = , P(W = 4) = 2. E(W ) = − ≈ −0.1667 3. sd(W ) ≈ 1.8634 5
1
6
6
1 6
In the craps experiment, select the 7 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? Let W denote the winnings for a unit 11 bet. Then 1. P(W = −1) = , P(W = 15) = 2. E(W ) = − ≈ −0.1111 3. sd(W ) ≈ 3.6650 17
1
18
18
1 9
In the craps experiment, select the 11 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be?
Craps All craps bets are bets on the next throw. The basic craps bet pays 7 : 1 if 2, 3, or 12 is thrown. The craps 2 bet pays 30 : 1 if a 2 is thrown. Similarly, the craps 12 bet pays 30 : 1 if a 12 is thrown. Finally, the craps 3 bet pays 15 : 1 if a 3 is thrown. Let W denote the winnings for a unit craps bet. Then 1. P(W
= −1) =
8 9
, P(W
= 7) =
1 9
13.4.3
https://stats.libretexts.org/@go/page/10258
2. E(W ) = − ≈ −0.1111 3. sd(W ) ≈ 5.0944 1 9
In the craps experiment, select the craps bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? Let W denote the winnings for a unit craps 2 bet or a unit craps 12 bet. Then 1. P(W = −1) = , P(W = 30) = 2. E(W ) = − ≈ −0.1389 3. sd(W ) = 5.0944 35
1
36
36
5
36
In the craps experiment, select the craps 2 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? In the craps experiment, select the craps 12 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? Let W denote the winnings for a unit craps 3 bet. Then 1. P(W = −1) = , P(W = 15) = 2. E(W ) = − ≈ −0.1111 3. sd(W ) ≈ 3.6650 17
1
18
18
1 9
In the craps experiment, select the craps 3 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? Thus, of the craps bets, the basic craps bet and the craps 3 bet are best for the gambler, and the craps 2 and craps 12 are the worst.
Big Six and Big Eight The big 6 bet is a bet that 6 is thrown before 7. Similarly, the big 8 bet is a bet that 8 is thrown before 7. Both pay even money 1 : 1. Let W denote the winnings for a unit big 6 bet or a unit big 8 bet. Then 1. P(W = −1) = , P(W = 1) = 2. E(W ) = − ≈ −0.0909 3. sd(W ) ≈ 0.9959 6
5
11
11
1
11
In the craps experiment, select the big 6 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? In the craps experiment, select the big 8 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be?
13.4.4
https://stats.libretexts.org/@go/page/10258
Hardway Bets A hardway bet can be made on any of the numbers 4, 6, 8, or 10. It is a bet that the chosen number n will be thrown “the hardway” as (n/2, n/2), before 7 is thrown and before the chosen number is thrown in any other combination. Hardway bets on 4 and 10 pay 7 : 1 , while hardway bets on 6 and 8 pay 9 : 1 . Let W denote the winnings for a unit hardway 4 or hardway 10 bet. Then 1. P(W = −1) = , P(W = 7) = 2. E(W ) = − ≈ −0.1111 3. sd(W ) = 2.5142 8
1
9
9
1 9
In the craps experiment, select the hardway 4 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? In the craps experiment, select the hardway 10 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? Let W denote the winnings for a unit hardway 6 or hardway 8 bet. Then 1. P(W = −1) = , P(W = 9) = 2. E(W ) = − ≈ −0.0909 3. sd(W ) ≈ 2.8748 10
1
11
11
1
11
In the craps experiment, select the hardway 6 bet. Run the simulation 1000 times and compare the empirical density and moments of W to the true density and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? In the craps experiment, select the hardway 8 bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? Thus, the hardway 6 and 8 bets are better than the hardway 4 and 10 bets for the gambler, in terms of expected value.
The Distribution of the Number of Rolls Next let us compute the distribution and moments of the number of rolls N in a game of craps. This random variable is of no special interest to the casino or the players, but provides a good mathematically exercise. By definition, if the shooter wins or loses on the first roll, N = 1 . Otherwise, the shooter continues until she either makes her point or rolls 7. In this latter case, we can use the geometric distribution on N which governs the trial number of the first success in a sequence of Bernoulli trials. The distribution of N is a mixture of distributions. +
The probability density function of N is ⎧ P(N = n) = ⎨ ⎩
12 36 1 24
,
(
n =1 3 4
n−2
)
+
5 81
(
13 18
n−2
)
+
55 648
(
25 36
(13.4.5)
n−2
)
,
n ∈ {2, 3, …}
Proof The first few values of the probability density function of N are given in the following table: n
1
2
3
4
5
P(N = n)
0.33333
0.18827
0.13477
0.09657
0.06926
13.4.5
https://stats.libretexts.org/@go/page/10258
Find the probability that a game of craps will last at least 8 rolls. Answer The mean and variance of the number of rolls are 1. E(N ) = 2. var(N ) =
557 165
≈ 3.3758
245 672 27 225
≈ 9.02376
Proof This page titled 13.4: Craps is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.4.6
https://stats.libretexts.org/@go/page/10258
13.5: Roulette The Roulette Wheel According to Richard Epstein, roulette is the oldest casino game still in operation. It's invention has been variously attributed to Blaise Pascal, the Italian mathematician Don Pasquale, and several others. In any event, the roulette wheel was first introduced into Paris in 1765. Here are the characteristics of the wheel: The (American) roulette wheel has 38 slots numbered 00, 0, and 1–36. 1. Slots 0, 00 are green; 2. Slots 1, 3, 5, 7, 9, 12, 14, 16, 18, 19, 21, 23, 25, 27, 30, 32, 34, 36 are red; 3. Slots 2, 4, 6, 8, 10, 11, 13, 15, 17, 20, 22, 24, 26, 28, 29, 31, 33, 35 are black. Except for 0 and 00, the slots on the wheel alternate between red and black. The strange order of the numbers on the wheel is intended so that high and low numbers, as well as odd and even numbers, tend to alternate.
Figure 13.5.1 : A typical roulette wheel and table
The roulette experiment is very simple. The wheel is spun and then a small ball is rolled in a groove, in the opposite direction as the motion of the wheel. Eventually the ball falls into one of the slots. Naturally, we assume mathematically that the wheel is fair, so that the random variable X that gives the slot number of the ball is uniformly distributed over the sample space S = {00, 0, 1, … , 36}. Thus, P(X = x) = for each x ∈ S . 1
38
Bets As with craps, roulette is a popular casino game because of the rich variety of bets that can be made. The picture above shows the roulette table and indicates some of the bets we will study. All bets turn out to have the same expected value (negative, of course). However, the variances differ depending on the bet. Although all bets in roulette have the same expected value, the standard deviations vary inversely with the number of numbers selected. What are the implications of this for the gambler?
Straight Bets A straight bet is a bet on a single number, and pays 35 : 1. Let W denote the winnings on a unit straight bet. Then 1. P(W = −1) = , P(W = 35) = 2. E(W ) = − ≈ −0.0526 3. sd(W ) ≈ 5.7626 37
1
38
38
1
19
In the roulette experiment, select the single number bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be?
13.5.1
https://stats.libretexts.org/@go/page/10259
Two Number Bets A 2-number bet (or a split bet) is a bet on two adjacent numbers in the roulette table. The bet pays 17 : 1. Let W denote the winnings on a unit split bet. Then 1. P(W = −1) = , P(W = 17) = 2. E(W ) = − ≈ −0.0526 3. sd(W ) ≈ 4.0193 18
1
19
19
1
19
In the roulette experiment, select the 2 number bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be?
Three Number Bets A 3-number bet (or row bet) is a bet on the three numbers in a vertical row on the roulette table. The bet pays 11 : 1. Let W denote the winnings on a unit row bet. Then 1. P(W = −1) = , P(W = 11) = 2. E(W ) = − ≈ −0.0526 3. sd(W ) ≈ 3.2359 35
3
38
38
1
19
In the roulette experiment, select the 3-number bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be?
Four Number Bets A 4-number bet or a square bet is a bet on the four numbers that form a square on the roulette table. The bet pays 8 : 1 . Let W denote the winnings on a unit 4-number bet. Then 1. P(W = −1) = , P(W = 8) = 2. E(W ) = − ≈ −0.0526 3. sd(W ) ≈ 2.7620 17
2
19
19
1
19
In the roulette experiment, select the 4-number bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be?
Six Number Bets A 6-number bet or 2-row bet is a bet on the 6 numbers in two adjacent rows of the roulette table. The bet pays 5 : 1 . Let W denote the winnings on a unit 6-number bet. Then 1. P(W = −1) = , P(W = 5) = 2. E(W ) = − ≈ −0.0526 3. sd(W ) ≈ 2.1879 16
3
19
19
1
19
In the roulette experiment, select the 6-number bet. Run the simulation 1000 times and compuare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be?
13.5.2
https://stats.libretexts.org/@go/page/10259
Twelve Number Bets A 12-number bet is a bet on 12 numbers. In particular, a column bet is bet on any one of the three columns of 12 numbers running horizontally along the table. Other 12-number bets are the first 12 (1-12), the middle 12 (13-24), and the last 12 (25-36). A 12number bet pays 2 : 1 . Let W denote the winnings on a unit 12-number bet. Then 1. P(W = −1) = , P(W = 2) = 2. E(W ) = − ≈ −0.0526 3. sd(W ) ≈ 1.3945 13
6
19
19
1
19
In the roulette experiment, select the 12-number bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be?
Eighteen Number Bets An 18-number bet is a bet on 18 numbers. In particular, A color bet is a bet either on red or on black. A parity bet is a bet on the odd numbers from 1 to 36 or the even numbers from 1 to 36. The low bet is a bet on the numbers 1-18, and the high bet is the bet on the numbers from 19-36. An 18-number bet pays 1 : 1 . Let W denote the winnings on a unit 18-number bet. Then 1. P(W = −1) = , P(W = 1) = 2. E(W ) = − ≈ −0.0526 3. sd(W ) ≈ 0.9986 10
9
19
19
1
19
In the roulette experiment, select the 18-number bet. Run the simulation 1000 times and compare the empirical density function and moments of W to the true probability density function and moments. Suppose that you bet $1 on each of the 1000 games. What would your net winnings be? This page titled 13.5: Roulette is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.5.3
https://stats.libretexts.org/@go/page/10259
13.6: The Monty Hall Problem Preliminaries Statement of the Problem The Monty Hall problem involves a classical game show situation and is named after Monty Hall, the long-time host of the TV game show Let's Make a Deal. There are three doors labeled 1, 2, and 3. A car is behind one of the doors, while goats are behind the other two:
Figure 13.6.1 : The car and the two goats
The rules are as follows: 1. The player selects a door. 2. The host selects a different door and opens it. 3. The host gives the player the option of switching from her original choice to the remaining closed door. 4. The door finally selected by the player is opened and she either wins or loses. The Monty Hall problem became the subject of intense controversy because of several articles by Marilyn Vos Savant in the Ask Marilyn column of Parade magazine, a popular Sunday newspaper supplement. The controversy began when a reader posed the problem in the following way:
Suppose you're on a game show, and you're given a choice of three doors. Behind one door is a car; behind the others, goats. You pick a door—say No. 1—and the host, who knows what's behind the doors, opens another door—say No. 3—which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice? Marilyn's response was that the contestant should switch doors, claiming that there is a chance that the car is behind door 1, while there is a chance that the car is behind door 2. In two follow-up columns, Marilyn printed a number of responses, some from academics, most of whom claimed in angry or sarcastic tones that she was wrong and that there are equal chances that the car is behind doors 1 or 2. Marilyn stood by her original answer and offered additional, but non-mathematical, arguments. 1
2
3
3
Think about the problem. Do you agree with Marilyn or with her critics, or do you think that neither solution is correct? In the Monty Hall game, set the host strategy to standard (the meaning of this strategy will be explained in the below). Play the Monty Hall game 50 times with each of the following strategies. Do you want to reconsider your answer to question above? 1. Always switch 2. Never switch In the Monty Hall game, set the host strategy to blind (the meaning of this strategy will be explained below). Play the Monty Hall game 50 times with each of the following strategies. Do you want to reconsider your answer to question above? 1. Always switch 2. Never switch
Modeling Assumptions When we begin to think carefully about the Monty Hall problem, we realize that the statement of the problem by Marilyn's reader is so vague that a meaningful discussion is not possible without clarifying assumptions about the strategies of the host and player. Indeed, we will see that misunderstandings about these strategies are the cause of the controversy. Let us try to formulate the problem mathematically. In general, the actions of the host and player can vary from game to game, but if we are to have a random experiment in the classical sense, we must assume that the same probability distributions govern the host and player on each game and that the games are independent. There are four basic random variables for a game: 1. U : the number of the door containing the car. 2. X: the number of the first door selected by the player. 3. V : the number of the door opened by the host. 4. Y : the number of the second door selected by the player.
13.6.1
https://stats.libretexts.org/@go/page/10260
Each of these random variables has the possible values 1, 2, and 3. However, because of the rules of the game, the door opened by the host cannot be either of the doors selected by the player, so V ≠ X and V ≠ Y . In general, we will allow the possibility V = U , that the host opens the door with the car behind it. Whether this is a reasonable action of the host is a big part of the controversy about this problem. The Monty Hall experiment will be completely defined mathematically once the joint distribution of the basic variables is specified. This joint distribution in turn depends on the strategies of the host and player, which we will consider next.
Strategies Host Strategies In the Monty Hall experiment, note that the host determines the probability density function of the door containing the car, namely P(U = i) for i ∈ {1, 2, 3}. The obvious choice for the host is to randomly assign the car to one of the three doors. This leads to the uniform distribution, and unless otherwise noted, we will always assume that U has this distribution. Thus, P(U = i) = for i ∈ {1, 2, 3}. 1 3
The host also determines the conditional density function of the door he opens, given knowledge of the door containing the car and the first door selected by the player, namely P(V = k ∣ U = i, X = j) for i, j ∈ {1, 2, 3}. Recall that since the host cannot open the door chosen by the player, this probability must be 0 for k = j . Thus, the distribution of U and the conditional distribution of V given U and X constitute the host strategy.
The Standard Strategy In most real game shows, the host would always open a door with a goat behind it. If the player's first choice is incorrect, then the host has no choice; he cannot open the door with the car or the player's choice and must therefore open the only remaining door. On the other hand, if the player's first choice is correct, then the host can open either of the remaining doors, since goats are behind both. Thus, he might naturally pick one of these doors at random. This strategy leads to the following conditional distribution for V given U and X: ⎧ 1, ⎪ P(V = k ∣ U = i, X = j) = ⎨ ⎩ ⎪
1 2
,
0,
i ≠ j, i ≠ k, k ≠ j i = j, k ≠ i
(13.6.1)
k = i, k = j
This distribution, along with the uniform distribution for U , will be referred to as the standard strategy for the host. In the Monty Hall game, set the host strategy to standard. Play the game 50 times with each of the following player strategies. Which works better? 1. Always switch 2. Never switch
The Blind Strategy Another possible second-stage strategy is for the host to always open a door chosen at random from the two possibilities. Thus, the host might well open the door containing the car. This strategy leads to the following conditional distribution for V given U and X: 1
P(V = k ∣ U = i, X = j) = {
2
,
0,
k ≠j
(13.6.2)
k =j
This distribution, together with the uniform distribution for U , will be referred to as the blind strategy for the host. The blind strategy seems a bit odd. However, the confusion between the two strategies is the source of the controversy concerning this problem. In the Monty Hall game, set the host strategy to blind. Play the game 50 times with each of the following player strategies. Which works better? 1. Always switch 2. Never switch
Player Strategies The player, on the other hand, determines the probability density function of her first choice, namely P(X = j) for j ∈ {1, 2, 3}. The obvious first choice for the player is to randomly choose a door, since the player has no knowledge at this point. This leads to the uniform distribution, so P(X = j) = for 1 3
j ∈ {1, 2, 3}
The player also determines the conditional density function of her second choice, given knowledge of her first choice and the door opened by the host, namely P(Y = l ∣ X = j, V = k) for i, j, k ∈ {1, 2, 3} with j ≠ k . Recall that since the player cannot choose the door opened by the host, this probability must be 0 for l = k . The distribution of X and the conditional distribution of Y given X and V constitute the player strategy. Suppose that the player switches with probability p ∈ [0, 1]. This leads to the following conditional distribution: ⎧ p, P(Y = l ∣ X = j, V = k) = ⎨ 1 − p, ⎩ 0,
13.6.2
j ≠ k, j ≠ l, k ≠ l j ≠ k, l = j
(13.6.3)
j = k, l = k
https://stats.libretexts.org/@go/page/10260
In particular, if p = 1 , the player always switches, while if p = 0 , the player never switches.
Mathematical Analysis We are almost ready to analyze the Monty Hall problem mathematically. But first we must make some independence assumptions to incorporate the lack of knowledge that the host and player have about each other's actions. First, the player has no knowledge of the door containing the car, so we assume that U and X are independent. Also, the only information about the car door that the player has when she makes her second choice is the information (if any) revealed by her first choice and the host's subsequent selection. Mathematically, this means that Y is conditionally independent of U given X and V .
Distributions The host and player strategies form the basic data for the Monty Hall problem. Because of the independence assumptions, the joint distribution of the basic random variables is completely determined by these strategies. The joint probability density function of (U , X, V , Y ) is given by P(U = i, X = j, V = k, Y = l) = P(U = i)P(X = j)P(V = k ∣ U = i, X = j)P(Y = l ∣ X = j, V = k),
i, j, k, l ∈ {1, 2, 3}
(13.6.4)
Proof The probability of any event defined in terms of the Monty Hall problem can be computed by summing the joint density over the appropriate values of (i, j, k, l). With either of the basic host strategies, V is uniformly distributed on {1, 2, 3}. Suppose that the player switches with probability p. With either of the basic host strategies, Y is uniformly distributed on {1, 2, 3}. In the Monty Hall experiment, set the host strategy to standard. For each of the following values of p, run the simulation 1000 times. Based on relative frequency, which strategy works best? 1. p = 0 (never switch) 2. p = 0.3 3. p = 0.5 4. p = 0.7 5. p = 1 (always switch) In the Monty Hall experiment, set the host strategy to blind. For each of the following values of p, run the experiment 1000 times. Based on relative frequency, which strategy works best? 1. p = 0 (never switch) 2. p = 0.3 3. p = 0.5 4. p = 0.7 5. p = 1 (always switch)
The Probability of Winning The event that the player wins a game is {Y
= U}
. We will compute the probability of this event with the basic host and player strategies.
Suppose that the host follows the standard strategy and that the player switches with probability p. Then the probability that the player wins is 1 +p P(Y = U ) =
(13.6.5) 3
In particular, if the player always switches, the probability that she wins is p =
2 3
and if the player never switches, the probability that she wins is p =
1 3
.
In the Monty Hall experiment, set the host strategy to standard. For each of the following values of p, run the simulation 1000 times. In each case, compare the relative frequency of winning to the probability of winning. 1. p = 0 (never switch) 2. p = 0.3 3. p = 0.5 4. p = 0.7 5. p = 1 (always switch) Suppose that the host follows the blind strategy. Then for any player strategy, the probability that the player wins is 1 P(Y = U ) =
(13.6.6) 3
13.6.3
https://stats.libretexts.org/@go/page/10260
In the Monty Hall experiment, set the host strategy to blind. For each of the following values of p, run the experiment 1000 times. In each case, compare the relative frequency of winning to the probability of winning. 1. p = 0 (never switch) 2. p = 0.3 3. p = 0.5 4. p = 0.7 5. p = 1 (always switch) For a complete solution of the Monty Hall problem, we want to compute the conditional probability that the player wins, given that the host opens a door with a goat behind it: P(Y = U ) P(Y = U ∣ V ≠ U ) =
(13.6.7) P(V ≠ U )
With the basic host and player strategies, the numerator, the probability of winning, has been computed. Thus we need to consider the denominator, the probability that the host opens a door with a goat. If the host use the standard strategy, then the conditional probability of winning is the same as the unconditional probability of winning, regardless of the player strategy. In particular, we have the following result: If the host follows the standard strategy and the player switches with probability p, then 1 +p P(Y = U ∣ V ≠ U ) =
(13.6.8) 3
Proof Once again, the probability increases from
1 3
when p = 0 , so that the player never switches, to
If the host follows the blind strategy, then for any player strategy, P(V
≠ U) =
2 3
2 3
when p = 1 , so that the player always switches.
and therefore P(Y
= U ∣ V ≠ U) =
1 2
.
In the Monty Hall experiment, set the host strategy to blind. For each of the following values of p, run the experiment 500 times. In each case, compute the conditional relative frequency of winning, given that the host shows a goat, and compare with the theoretical answer above, 1. p = 0 (never switch) 2. p = 0.3 3. p = 0.5 4. p = 0.7 5. p = 1 (always switch) The confusion between the conditional probability of winning for these two strategies has been the source of much controversy in the Monty Hall problem. Marilyn was probably thinking of the standard host strategy, while some of her critics were thinking of the blind strategy. This problem points out the importance of careful modeling, of the careful statement of assumptions. Marilyn is correct if the host follows the standard strategy; the critics are correct if the host follows the blind strategy; any number of other answers could be correct if the host follows other strategies. The mathematical formulation we have used is fairly complete. However, if we just want to solve Marilyn's problem, there is a much simpler analysis (which you may have discovered yourself). Suppose that the host follows the standard strategy, and thus always opens a door with a goat. If the player's first door is incorrect (contains a goat), then the host has no choice and must open the other door with a goat. Then, if the player switches, she wins. On the other hand, if the player's first door is correct and she switches, then of course she loses. Thus, we see that if the player always switches, then she wins if and only if her first choice is incorrect, an event that obviously has probability . If the player never switches, then she wins if and only if her first choice is correct, an event with probability . 2 3
1 3
This page titled 13.6: The Monty Hall Problem is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.6.4
https://stats.libretexts.org/@go/page/10260
13.7: Lotteries You realize the odds of winning [the lottery] are the same as being mauled by a polar bear and a regular bear in the same day. —E*TRADE baby, January 2010. Lotteries are among the simplest and most widely played of all games of chance, and unfortunately for the gambler, among the worst in terms of expected value. Lotteries come in such an incredible number of variations that it is impractical to analyze all of them. So, in this section, we will study some of the more common lottery formats.
Figure 13.7.1 : A lottery ticket issued by the Continental Congress in 1776 to raise money for the American Revolutionary War. Source: Wikipedia
The Basic Lottery Basic Format The basic lottery is a random experiment in which the gambling house (in many cases a government agency) selects n numbers at random, without replacement, from the integers from 1 to N . The integer parameters N and n vary from one lottery to another, and of course, n cannot be larger than N . The order in which the numbers are chosen usually does not matter, and thus in this case, the sample space S of the experiment consists of all subsets (combinations) of size n chosen from the population {1, 2, … , N }. S = {x ⊆ {1, 2, … , N } : #(x) = n}
(13.7.1)
Recall that N #(S) = (
N! ) =
n
(13.7.2) n!(N − n)!
Naturally, we assume that all such combinations are equally likely, and thus, the chosen combination X, the basic random variable of the experiment, is uniformly distributed on S . 1 P(X = x) =
N
(
n
,
x ∈ S
(13.7.3)
)
The player of the lottery pays a fee and gets to select m numbers, without replacement, from the integers from 1 to N . Again, order does not matter, so the player essentially chooses a combination y of size m from the population {1, 2, … , N }. In many cases m = n , so that the player gets to choose the same number of numbers as the house. In general then, there are three parameters in the basic (N , n, m) lottery. The player's goal, of course, is to maximize the number of matches (often called catches by gamblers) between her combination y and the random combination X chosen by the house. Essentially, the player is trying to guess the outcome of the random experiment before it is run. Thus, let U = #(X ∩ y) denote the number of catches. The number of catches U in the (N , n, m), lottery has probability density function given by m
( P(U = k) =
k
N −m
)(
n−k
N
(
n
) ,
k ∈ {0, 1, … , m}
(13.7.4)
)
13.7.1
https://stats.libretexts.org/@go/page/10261
The distribution of U is the hypergeometric distribution with parameters N , n , and m, and is studied in detail in the chapter on Finite Sampling Models. In particular, from this section, it follows that the mean and variance of the number of catches U are m E(U ) =n
(13.7.5) N m
var(U ) = n
m (1 −
N
N −n )
N
(13.7.6) N −1
Note that P(U = k) = 0 if k > n or k < n + m − N . However, in most lotteries, m ≤ n and N is much larger than these common cases, the density function is positive for the values of k given in above.
n+m
. In
We will refer to the special case where m = n as the (N , n) lottery; this is the case in most state lotteries. In this case, the probability density function of the number of catches U is n
N −n
( )( P(U = k) =
k
n−k
N
(
n
) ,
k ∈ {0, 1, … , n}
(13.7.7)
)
The mean and variance of the number of catches U in this special case are 2
n E(U ) =
(13.7.8) N 2
2
n (N − n) var(U ) = N
2
(13.7.9) (N − 1)
Explicitly give the probability density function, mean, and standard deviation of the number of catches in the (47, 5) lottery. Answer Explicitly give the probability density function, mean, and standard deviation of the number of catches in the (49, 5) lottery. Answer Explicitly give the probability density function, mean, and standard deviation of the number of catches in the (47, 7) lottery. Answer The analysis above was based on the assumption that the player's combination y is selected deterministically. Would it matter if the player chose the combination in a random way? Thus, suppose that the player's selected combination Y is a random variable taking values in S . (For example, in many lotteries, players can buy tickets with combinations randomly selected by a computer; this is typically known as Quick Pick). Clearly, X and Y must be independent, since the player (and her randomizing device) can have no knowledge of the winning combination X. As you might guess, such randomization makes no difference. Let U denote the number of catches in the (N , n, m) lottery when the player's combination Y is a random variable, independent of the winning combination X. Then U has the same distribution as in the deterministic case above. Proof There are many websites that publish data on the frequency of occurrence of numbers in various state lotteries. Some gamblers evidently feel that some numbers are luckier than others. Given the assumptions and analysis above, do you believe that some numbers are luckier than others? Does it make any mathematical sense to study historical data for a lottery? The prize money in most state lotteries depends on the sales of the lottery tickets. Typically, about 50% of the sales money is returned as prize money; the rest goes for administrative costs and profit for the state. The total prize money is divided among the winning tickets, and the prize for a given ticket depends on the number of catches U . For all of these reasons, it is impossible to give a simple mathematical analysis of the expected value of playing a given state lottery. Note however, that since the state keeps a fixed percentage of the sales, there is essentially no risk for the state. From a pure gambling point of view, state lotteries are bad games. In most casino games, by comparison, 90% or more of the money that comes in is returned to the players as prize money. Of course, state lotteries should be viewed as a form of voluntary
13.7.2
https://stats.libretexts.org/@go/page/10261
taxation, not simply as games. The profits from lotteries are typically used for education, health care, and other essential services. A discussion of the value and costs of lotteries from a political and social point of view (as opposed to a mathematical one) is beyond the scope of this project.
Bonus Numbers Many state lotteries now augment the basic (N , n), format with a bonus number. The bonus number T is selected from a specified set of integers, in addition to the combination X, selected as before. The player likewise picks a bonus number s , in addition to a combination y . The player's prize then depends on the number of catches U between X and y , as before, and in addition on whether the player's bonus number s matches the random bonus number T chosen by the house. We will let I denote the indicator variable of this latter event. Thus, our interest now is in the joint distribution of (I , U ) . In one common format, the bonus number T is selected at random from the set of integers {1, 2, … , M }, independently of the combination X of size n chosen from {1, 2, … , N }. Usually M < N . Note that with this format, the game is essentially two independent lotteries, one in the (N , n), format and the other in the (M , 1), format. Explicitly compute the joint probability density function of (I , U ) for the (47, 5) lottery with independent bonus number from 1 to 27. This format is used in the California lottery, among others. Answer Explicitly compute the joint probability density function of (I , U ) for the (49, 5) lottery with independent bonus number from 1 to 42. This format is used in the Powerball lottery, among others. Answer In another format, the bonus number T is chosen from 1 to N , and is distinct from the numbers in the combination X. To model this game, we assume that T is uniformly distributed on {1, 2, … , N }, and given T = t , X is uniformly distributed on the set of combinations of size n chosen from {1, 2, … , N } ∖ {t}. For this format, the joint probability density function is harder to compute. The probability density function of (I , U ) is given by n
N −1−n
( )( P(I = 1, U = k) =
k
n−k N −1
N(
n
) ,
k ∈ {0, 1, … , n}
n
N −1−n
k
n−k
( )( P(I = 0, U = k)
(13.7.11)
)
= (N − n + 1)
N −1
N(
n
n−1
)
( +n
)
k
N −n
)(
n−k
N −1
N(
n
) ,
k ∈ {0, 1, … , n}
(13.7.12)
)
Proof Explicitly compute the joint probability density function of (I , U ) for the (47, 7) lottery with bonus number chosen as described above. This format is used in the Super 7 Canada lottery, among others.
Keno Keno is a lottery game played in casinos. For a fixed N (usually 80) and n (usually 20), the player can play a range of basic (N , n, m) games, as described in the first subsection. Typically, m ranges from 1 to 15, and the payoff depends on m and the number of catches U . In this section, you will compute the density function, mean, and standard deviation of the random payoff, based on a unit bet, for a typical keno game with N = 80 , n = 20 , and m ∈ {1, 2, … , 15}. The payoff tables are based on the keno game at the Tropicana casino in Atlantic City, New Jersey. Recall that the probability density function of the number of catches U above , is given by m
( P(U = k) =
k
80−m
)(
20−k
)
80
(
,
k ∈ {0, 1, … , m}
(13.7.13)
)
20
The payoff table for payoff.
m =1
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 1
13.7.3
https://stats.libretexts.org/@go/page/10261
Catches
0
1
Payoff
0
3
Answer The payoff table for payoff.
m =2
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 2
Catches
0
1
2
Payoff
0
0
12
Answer The payoff table for payoff.
m =3
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 3
Catches
0
1
2
3
Payoff
0
0
1
43
Answer The payoff table for payoff.
m =4
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 4
Catches
0
1
2
3
4
Payoff
0
0
1
3
130
Answer The payoff table for payoff.
m =5
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 5
Catches
0
1
2
3
4
5
Payoff
0
0
0
1
10
800
Answer The payoff table for payoff.
m =6
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 6
Catches
0
1
2
3
4
5
6
Payoff
0
0
0
1
4
95
1500
Answer The payoff table for payoff.
m =7
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 7
Catches
0
1
2
3
4
5
6
7
Payoff
0
0
0
0
1
25
350
8000
13.7.4
https://stats.libretexts.org/@go/page/10261
Answer The payoff table for payoff.
m =8
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 8
Catches
0
1
2
3
4
5
6
7
8
Payoff
0
0
0
0
0
9
90
1500
25,000
Answer The payoff table for payoff.
m =9
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 9
Catches
0
1
2
3
4
5
6
7
8
9
Payoff
0
0
0
0
0
4
50
280
4000
50,000
Answer The payoff table for payoff.
m = 10
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 10
Catches
0
1
2
3
4
5
6
7
8
9
10
Payoff
0
0
0
0
0
1
22
150
1000
5000
100,000
Answer The payoff table for payoff.
m = 11
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 11
Catches
0
1
2
3
4
5
6
7
8
9
10
11
Payoff
0
0
0
0
0
0
8
80
400
2500
25,000
100,000
Answer The payoff table for payoff.
m = 12
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 12
Catches
0
1
2
3
4
5
6
7
8
9
10
11
12
Payoff
0
0
0
0
0
0
5
32
200
1000
5000
25,000
100,000
Answer The payoff table for payoff.
m = 13
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 13
Catche s
0
1
2
3
4
5
6
7
8
9
10
11
12
13
Payoff
1
0
0
0
0
0
1
20
80
600
3500
10,000
50,000
100,00 0
Proof
13.7.5
https://stats.libretexts.org/@go/page/10261
The payoff table for payoff.
m = 14
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 14
Catch es
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Payoff
1
0
0
0
0
0
1
9
42
310
1100
8000
25,00 0
50,00 0
100,00 0
Answer The payoff table for payoff.
m = 15
is given below. Compute the probability density function, mean, and standard deviation of the Pick m
= 15
Catch es
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Payof f
1
0
0
0
0
0
0
10
25
100
300
2800
25,00 0
50,00 0
100,0 00
100,0 00
Answer In the exercises above, you should have noticed that the expected payoff on a unit bet varies from about 0.71 to 0.75, so the expected profit (for the gambler) varies from about −0.25 to −0.29. This is quite bad for the gambler playing a casino game, but as always, the lure of a very high payoff on a small bet for an extremely rare event overrides the expected value analysis for most players. With m = 15 , show that the top 4 prizes (25,000, 50,000, 100,000, 100,000) contribute only about 0.017 (less than 2 cents) to the total expected value of about 0.714. On the other hand, the standard deviation of the payoff varies quite a bit, from about 1 to about 55. Although the game is highly unfavorable for each m, with expected value that is nearly constant, which do you think is better for the gambler—a format with high standard deviation or one with low standard deviation? This page titled 13.7: Lotteries is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.7.6
https://stats.libretexts.org/@go/page/10261
13.8: The Red and Black Game In this section and the following three sections, we will study gambling strategies for one of the simplest gambling models. Yet in spite of the simplicity of the model, the mathematical analysis leads to some beautiful and sometimes surprising results that have importance and application well beyond gambling. Our exposition is based primarily on the classic book by Dubbins and Savage, Inequalities for Stochastic Processes (How to Gamble if You Must) by Lester E Dubbins and Leonard J Savage (1965).
Basic Theory Assumptions Here is the basic situation: The gambler starts with an initial sum of money. She bets on independent, probabilistically identical games, each with two outcomes—win or lose. If she wins a game, she receives the amount of the bet on that game; if she loses a game, she must pay the amount of the bet. Thus, the gambler plays at even stakes. This particular situation (IID games and even stakes) is known as red and black, and is named for the color bets in the casino game roulette. Other examples are the pass and don't pass bets in craps. Let us try to formulate the gambling experiment mathematically. First, let I denote the outcome of the n th game for where 1 denotes a win and 0 denotes a loss. These are independent indicator random variables with the same distribution: n
P (Ij = 1) = pj ,
P (Ij = 0) = q = 1 − pj
where p ∈ [0, 1] is the probability of winning an individual game. Thus, I
= (I1 , I2 , …)
n ∈ N+
,
(13.8.1)
is a sequence of Bernoulli trials.
If p = 0 , then the gambler always loses and if p = 1 then the gambler always wins. These trivial cases are not interesting, so we will usually assume that 0 < p < 1 . In real gambling houses, of course, p < (that is, the games are unfair to the player), so we will be particularly interested in this case. 1 2
Random Processes The gambler's fortune over time is the basic random process of interest: Let X denote the gambler's initial fortune and X the gambler's fortune after i games. The gambler's strategy consists of the decisions of how much to bet on the various games and when to quit. Let Y denote the amount of the ith bet, and let N denote the number of games played by the gambler. If we want to, we can always assume that the games go on forever, but with the assumption that the gambler bets 0 on all games after N . With this understanding, the game outcome, fortune, and bet processes are defined for all times i ∈ N . 0
i
i
+
The fortune process is related to the wager process as follows: Xj = Xj−1 + (2 Ij − 1) Yj ,
j ∈ N+
(13.8.2)
Strategies The gambler's strategy can be very complicated. For example, the random variable Y , the gambler's bet on game n , or the event N = n − 1 , her decision to stop after n − 1 games, could be based on the entire past history of the game, up to time n . Technically, this history forms a σ-algebra: n
Hn = σ { X0 , Y1 , I1 , Y2 , I2 , … , Yn−1 , In−1 }
(13.8.3)
Moreover, they could have additional sources of randomness. For example a gambler playing roulette could partly base her bets on the roll of a lucky die that she keeps in her pocket. However, the gambler cannot see into the future (unfortunately from her point of view), so we can at least assume that Y and {N = n − 1} are independent of (I , I , … , I ). n
1
2
n−1
At least in terms of expected value, any gambling strategy is futile if the games are unfair. E (Xi ) = E (Xi−1 ) + (2p − 1)E (Yi )
for i ∈ N
+
Proof Suppose that the gambler has a positive probability of making a real bet on game i, so that E(Y
i)
1. E(X
i)
< E(Xi−1 )
if p
0
. Then
1 2
13.8.1
https://stats.libretexts.org/@go/page/10262
2. E(X 3. E(X
i)
> E(Xi−1 )
i)
= E(Xi−1 )
if p > if p =
1 2 1 2
Proof Thus on any game in which the gambler makes a positive bet, her expected fortune strictly decreases if the games are unfair, remains the same if the games are fair, and strictly increases if the games are favorable. As we noted earlier, a general strategy can depend on the past history and can be randomized. However, since the underlying Bernoulli games are independent, one might guess that these complicated strategies are no better than simple strategies in which the amount of the bet and the decision to stop are based only on the gambler's current fortune. These simple strategies do indeed play a fundamental role and are referred to as stationary, deterministic strategies. Such a strategy can be described by a betting function S from the space of fortunes to the space of allowable bets, so that S(x) is the amount that the gambler bets when her current fortune is x.
The Stopping Rule From now on, we will assume that the gambler's stopping rule is a very simple and standard one: she will bet on the games until she either loses her entire fortune and is ruined or reaches a fixed target fortune a : N = min{n ∈ N : Xn = 0 or Xn = a}
(13.8.4)
Thus, any strategy (betting function) S must satisfy s(x) ≤ min{x, a − x} for 0 ≤ x ≤ a : the gambler cannot bet what she does not have, and will not bet more than is necessary to reach the target a . If we want to, we can think of the difference between the target fortune and the initial fortune as the entire fortune of the house. With this interpretation, the player and the house play symmetric roles, but with complementary win probabilities: play continues until either the player is ruined or the house is ruined. Our main interest is in the final fortune X of the gambler. Note that this random variable takes just two values; 0 and a . N
The mean and variance of the final fortune are given by 1. E(X ) = aP(X = a) 2. var(X ) = a P(X = a) [1 − P(X N
N
2
N
N
N
= a)]
Presumably, the gambler would like to maximize the probability of reaching the target fortune. Is it better to bet small amounts or large amounts, or does it not matter? How does the optimal strategy, if there is one, depend on the initial fortune, the target fortune, and the game win probability? We are also interested in E(N ), the expected number of games played. Perhaps a secondary goal of the gambler is to maximize the expected number of games that she gets to play. Are the two goals compatible or incompatible? That is, can the gambler maximize both her probability of reaching the target and the expected number of games played, or does maximizing one quantity necessarily mean minimizing the other? In the next two sections, we will analyze and compare two strategies that are in a sense opposites: Timid Play: On each game, until she stops, the gambler makes a small constant bet, say $1. Bold Play: On each game, until she stops, the gambler bets either her entire fortune or the amount needed to reach the target fortune, whichever is smaller. In the final section of the chapter, we will return to the question of optimal strategies. This page titled 13.8: The Red and Black Game is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.8.2
https://stats.libretexts.org/@go/page/10262
13.9: Timid Play Basic Theory Recall that with the strategy of timid play in red and black, the gambler makes a small constant bet, say $1, on each game until she stops. Thus, on each game, the gambler's fortune either increases by 1 or decreases by 1, until the fortune reaches either 0 or the target a (which we assume is a positive integer). Thus, the fortune process (X , X , …) is a random walk on the fortune space {0, 1, … , a} with 0 and a as absorbing barriers. 0
1
As usual, we are interested in the probability of winning and the expected number of games. The key idea in the analysis is that after each game, the fortune process simply starts over again, but with a different initial value. This is an example of the Markov property, named for Andrei Markov. A separate chapter on Markov Chains explores these random processes in more detail. In particular, this chapter has sections on Birth-Death Chains and Random Walks on Graphs, particular classes of Markov chains that generalize the random processes that we are studying here.
Figure 13.9.1 : The transition graph for timid play
The Probability of Winning Our analysis based on the Markov property suggests that we treat the initial fortune as a variable. Thus, we will denote the probability that the gambler reaches the target a , starting with an initial fortune x by f (x) = P(XN = a ∣ X0 = x),
x ∈ {0, 1, … , a}
(13.9.1)
The function f satisfies the following difference equation and boundary conditions: 1. f (x) = qf (x − 1) + pf (x + 1) for x ∈ {1, 2, … , a − 1} 2. f (0) = 0 , f (a) = 1 Proof The difference equation is linear (in the unknown function f ), homogeneous (because each term involves the unknown function f ), and second order (because 2 is the difference between the largest and smallest fortunes in the equation). Recall that linear homogeneous difference equations can be solved by finding the roots of the characteristic equation. The characteristic equation of the difference equation is pr
2
If p ≠
1 2
−r+q = 0
, and that the roots are r = 1 and r = q/p .
, then the roots are distinct. In this case, the probability that the gambler reaches her target is x
(q/p ) f (x) =
a
(q/p )
−1 ,
x ∈ {0, 1, … , a}
(13.9.2)
−1
If p = , the characteristic equation has a single root 1 that has multiplicity 2. In this case, the probability that the gambler reaches her target is simply the ratio of the initial fortune to the target fortune: 1 2
x f (x) =
,
x ∈ {0, 1, … , a}
(13.9.3)
a
Thus, we have the distribution of the final fortune X in either casse: N
P(XN = 0 ∣ X0 = x) = 1 − f (x), P(XN = a ∣ X0 = x) = f (x);
x ∈ {0, 1, … , a}
(13.9.4)
In the red and black experiment, choose Timid Play. Vary the initial fortune, target fortune, and game win probability and note how the probability of winning the game changes. For various values of the parameters, run the experiment 1000 times and compare the relative frequency of winning a game to the probability of winning a game.
13.9.1
https://stats.libretexts.org/@go/page/10263
As a function of x, for fixed p and a , 1. f is increasing from 0 to a . 2. f is concave upward if p < f
1 2
and concave downward if p >
1 2
. Of course, f is linear if p =
1 2
.
is continuous as a function of p, for fixed x and a .
Proof For fixed x and a , f (x) increases from 0 to 1 as p increases from 0 to 1.
Figure 13.9.2 : The graph of f for p = 0.4 , p = 0.5 , and p = 0.6
The Expected Number of Trials Now let us consider the expected number of games needed with timid play, when the initial fortune is x: g(x) = E(N ∣ X0 = x),
x ∈ {0, 1, … , a}
(13.9.5)
The function g satisfies the following difference equation and boundary conditions: 1. g(x) = qg(x − 1) + pg(x + 1) + 1 for x ∈ {1, 2, … , a − 1} 2. g(0) = 0 , g(a) = 0 Proof The difference equation in the last exercise is linear, second order, but non-homogeneous (because of the constant term 1 on the right side). The corresponding homogeneous equation is the equation satisfied by the win probability function f . Thus, only a little additional work is needed to solve the non-homogeneous equation. If p ≠
1 2
, then x g(x) =
a −
q −p
f (x),
x ∈ {0, 1, … , a}
(13.9.6)
q −p
where f is the win probability function above. If p =
1 2
, then g(x) = x(a − x),
x ∈ {0, 1, … , a}
(13.9.7)
Consider g as a function of the initial fortune x, for fixed values of the game win probability p and the target fortune a . 1. g at first increases and then decreases. 2. g is concave downward.
13.9.2
https://stats.libretexts.org/@go/page/10263
When p = , the maximum value of g is rather complicated. 1 2
2
a
4
and occurs when x =
a 2
. When p ≠
1 2
, the value of x where the maximum occurs is
Figure 13.9.3 : The graph of g for p = 0.4 , p = 0.5 , and p = 0.6 g
is continuous as a function of p, for fixed x and a .
Proof For many parameter settings, the expected number of games is surprisingly large. For example, suppose that p = and the target fortune is 100. If the gambler's initial fortune is 1, then the expected number of games is 99, even though half of the time, the gambler will be ruined on the first game. If the initial fortune is 50, the expected number of games is 2500. 1 2
In the red and black experiment, select Timid Play. Vary the initial fortune, the target fortune and the game win probability and notice how the expected number of games changes. For various values of the parameters, run the experiment 1000 times and compare the sample mean number of games to the expect value.
Increasing the Bet What happens if the gambler makes constant bets, but with an amount higher than 1? The answer to this question may give insight into what will happen with bold play. In the red and black game, set the target fortune to 16, the initial fortune to 8, and the win probability to 0.45. Play 10 games with each of the following strategies. Which seems to work best? 1. Bet 1 on each game (timid play). 2. Bet 2 on each game. 3. Bet 4 on each game. 4. Bet 8 on each game (bold play). We will need to embellish our notation to indicate the dependence on the target fortune. Let f (x, a) = P(XN = a ∣ X0 = x),
x ∈ {0, 1, … , a}, a ∈ N+
(13.9.8)
Now fix p and suppose that the target fortune is 2a and the initial fortune is 2x. If the gambler plays timidly (betting $1 each time), then of course, her probability of reaching the target is f (2x, 2a). On the other hand: Suppose that the gambler bets $2 on each game. The fortune process (X /2 : i ∈ N) corresponds to timid play with initial fortune x and target fortune a and that therefore the probability that the gambler reaches the target is f (x, a). i
Thus, we need to compare the probabilities f (2x, 2a) and f (x, a). The win probability functions are related as follows: x
(q/p ) f (2x, 2a) = f (x, a)
a
(q/p )
+1 ,
x ∈ {0, 1, … , a}
(13.9.9)
+1
13.9.3
https://stats.libretexts.org/@go/page/10263
In particular 1. f (2x, 2a) < f (x, a) if p < 2. f (2x, 2a) = f (x, a) if p = 3. f (2x, 2a) > f (x, a) if p >
1 2 1 2 1 2
Thus, it appears that increasing the bets is a good idea if the games are unfair, a bad idea if the games are favorable, and makes no difference if the games are fair. What about the expected number of games played? It seems almost obvious that if the bets are increased, the expected number of games played should decrease, but a direct analysis using the expected value function above is harder than one might hope (try it!), We will use a different method, one that actually gives better results. Specifically, we will have the $1 and $2 gamblers bet on the same underlying sequence of games, so that the two fortune processes are defined on the same sample space. Then we can compare the actual random variables (the number of games played), which in turn leads to a comparison of their expected values. Recall that this general method is referred to as coupling. Let X denote the fortune after n games for the gamble making $1 bets (simple timid play). Then 2X − X is the fortune after n games for the gambler making $2 bets (with the same initial fortune, betting on the same sequence of games). Assume again that the initial fortune is 2x and the target fortune 2a where 0 < x < a . Let N denote the number of games played by the $1 gambler, and N the number of games played by the $2 gambler, Then n
n
0
1
2
1. If the $1 gambler falls to fortune x, the $2 gambler is ruined (fortune 0). 2. If the $1 gambler hits fortune x + a , the $2 gambler reaches the target 2a. 3. The $1 gambler must hit x before hitting 0 and must hit x + a before hitting 2a. 4. N < N given X = 2x . 5. E(N ∣ X = 2x) < E(N ∣ X = 2x) 2
1
2
0
0
1
0
Of course, the expected values agree (and are both 0) if x = 0 or x = a . This result shows that N is stochastically smaller than N when the gamblers are not playing the same sequence of games (so that the random variables are not defined on the same sample space). 2
1
Generalize the analysis in this subsection to compare timid play with the strategy of betting $k on each game (let the initial fortune be kx and the target fortune ka . It appears that with unfair games, the larger the bets the better, at least in terms of the probability of reaching the target. Thus, we are naturally led to consider bold play. This page titled 13.9: Timid Play is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
13.9.4
https://stats.libretexts.org/@go/page/10263
13.10: Bold Play Basic Theory Preliminaries Recall that with the strategy of bold play in red and black, the gambler on each game bets either her entire fortune or the amount needed to reach the target fortune, whichever is smaller. As usual, we are interested in the probability that the player reaches the target and the expected number of trials. The first interesting fact is that only the ratio of the initial fortune to the target fortune matters, quite in contrast to timid play. Suppose that the gambler plays boldly with initial fortune x and target fortune a . As usual, let X = (X , X , …) denote the fortune process for the gambler. For any c > 0 , the random process cX = (cX , cX , …) is the fortune process for bold play with initial fortune cx and target fortune ca. 1
0
2
1
Because of this result, it is convenient to use the target fortune as the monetary unit and to allow irrational, as well as rational, initial fortunes. Thus, the fortune space is [0, 1]. Sometimes in our analysis we will ignore the states 0 or 1; clearly there is no harm in this because in these states, the game is over. Recall that the betting function S is the function that gives the amount bet as a function of the current fortune. For bold play, the betting function is x,
0 ≤x ≤
S(x) = min{x, 1 − x} = {
1
1 − x,
2
1 2
(13.10.1)
≤x ≤1
Figure 13.10.1: The betting function for bold play
The Probability of Winning We will denote the probability that the bold gambler reaches the target a = 1 starting from the initial fortune x ∈ [0, 1] by F (x). By the scaling property, the probability that the bold gambler reaches some other target value a > 0 , starting from x ∈ [0, a] is F (x/a). The function F satisfies the following functional equation and boundary conditions: 1. F (x) = {
pF (2x), p + qF (2x − 1),
0 ≤x ≤ 1 2
1 2
≤x ≤1
2. F (0) = 0 , F (1) = 1 From the previous result, and a little thought, it should be clear that an important role is played by the following function: Let d be the function defined on [0, 1) by 2x, d(x) = 2x − ⌊2x⌋ = { 2x − 1,
13.10.1
0 ≤x < 1 2
1 2
(13.10.2)
≤x n . Rank can be extended to all numbers in [0, 1) by defining the rank of 0 to be 0 (0 is also considered a binary rational) and by defining the rank of a binary irrational to be ∞. We will denote the rank of x by r(x). n
i
Applied to the binary sequences, the doubling function d is the shift operator: For x ∈ [0, 1), [d(x)]
i
= xi+1
.
Bold play in red and black can be elegantly described by comparing the bits of the initial fortune with the game bits. Suppose that gambler starts with initial fortune x ∈ (0, 1). The gambler eventually reaches the target 1 if and only if there exists a positive integer k such that I = 1 − x for j ∈ {1, 2, … , k − 1} and I = x . That is, the gambler wins if and only if when the game bit agrees with the corresponding fortune bit for the first time, that bit is 1. j
j
k
k
The random variable whose bits are the complements of the fortune bits will play an important role in our analysis. Thus, let ∞
1 − Ij
W =∑ j=1
(13.10.4)
j
2
Note that W is a well defined random variable taking values in [0, 1]. Suppose that the gambler starts with initial fortune x ∈ (0, 1). Then the gambler reaches the target 1 if and only if W
13.10.2
0
, the
satisfies the following functional equation and boundary conditions:
1. G(x) = {
1 + pG(2x), 1 + qG(2x − 1),
0 s is the same as the distribution of X for every s ∈ [0, ∞) . Equivalently, P(X > t + s ∣ X > s) = P(X > t),
s, t ∈ [0, ∞)
X −s
given
(14.2.1)
The memoryless property determines the distribution of X up to a positive parameter, as we will see now.
Distribution functions Suppose that X takes values in [0, ∞) and satisfies the memoryless property. X
has a continuous distribution and there exists r ∈ (0, ∞) such that the distribution function F of X is F (t) = 1 − e
−r t
,
t ∈ [0, ∞)
(14.2.2)
Proof The probability density function of X is f (t) = r e
−r t
,
t ∈ [0, ∞)
(14.2.7)
1. f is decreasing on [0, ∞). 2. f is concave upward on [0, ∞). 3. f (t) → 0 as t → ∞ . Proof A random variable with the distribution function above or equivalently the probability density function in the last theorem is said to have the exponential distribution with rate parameter r. The reciprocal is known as the scale parameter (as will be justified below). Note that the mode of the distribution is 0, regardless of the parameter r, not very helpful as a measure of center. 1 r
In the gamma experiment, set n = 1 so that the simulated random variable has an exponential distribution. Vary r with the scroll bar and watch how the shape of the probability density function changes. For selected values of r, run the experiment 1000 times and compare the empirical density function to the probability density function. The quantile function of X is F
−1
− ln(1 − p) (p) =
,
p ∈ [0, 1)
(14.2.8)
r
1. The median of X is ln(2) ≈ 0.6931 2. The first quartile of X is [ln(4) − ln(3)] ≈ 0.2877 3. The third quartile X is ln(4) ≈ 1.3863 4. The interquartile range is ln(3) ≈ 1.0986 1
1
r
r
1 r
1
1 r
1
r
r
1
1
r
r
Proof
14.2.1
https://stats.libretexts.org/@go/page/10267
In the special distribution calculator, select the exponential distribution. Vary the scale parameter (which is 1/r) and note the shape of the distribution/quantile function. For selected values of the parameter, compute a few values of the distribution function and the quantile function. Returning to the Poisson model, we have our first formal definition: A process of random points in time is a Poisson process with rate r ∈ (0, ∞) if and only the interarrvial times are independent, and each has the exponential distribution with rate r.
Constant Failure Rate Suppose now that X has a continuous distribution on [0, ∞) and is interpreted as the lifetime of a device. If F denotes the distribution function of X, then F = 1 − F is the reliability function of X. If f denotes the probability density function of X then the failure rate function h is given by c
f (t) h(t) = F
c
,
t ∈ [0, ∞)
(14.2.9)
(t)
If X has the exponential distribution with rate r > 0 , then from the results above, the reliability function is probability density function is f (t) = re , so trivially X has constant rate r. The converse is also true.
F
c
(t) = e
−rt
and the
−rt
If X has constant failure rate r > 0 then X has the exponential distribution with parameter r. Proof The memoryless and constant failure rate properties are the most famous characterizations of the exponential distribution, but are by no means the only ones. Indeed, entire books have been written on characterizations of this distribution.
Moments Suppose again that X has the exponential distribution with rate parameter variance, and various other moments of X. If n ∈ N then E (X
n
n
) = n!/ r
r >0
. Naturaly, we want to know the the mean,
.
Proof More generally, E (X
a
a
) = Γ(a + 1)/ r
for every a ∈ [0, ∞), where Γ is the gamma function.
In particular. 1. E(X) = 2. var(X) = 3. skew(X) = 2 4. kurt(X) = 9 1 r
1
2
r
In the context of the Poisson process, the parameter r is known as the rate of the process. On average, there are 1/r time units between arrivals, so the arrivals come at an average rate of r per unit time. The Poisson process is completely determined by the sequence of inter-arrival times, and hence is completely determined by the rate r. Note also that the mean and standard deviation are equal for an exponential distribution, and that the median is always smaller than the mean. Recall also that skewness and kurtosis are standardized measures, and so do not depend on the parameter r (which is the reciprocal of the scale parameter). The moment generating function of X is M (s) = E (e
sX
r ) =
,
s ∈ (−∞, r)
(14.2.12)
r−s
Proof
14.2.2
https://stats.libretexts.org/@go/page/10267
In the gamma experiment, set n = 1 so that the simulated random variable has an exponential distribution. Vary r with the scroll bar and watch how the mean±standard deviation bar changes. For various values of r, run the experiment 1000 times and compare the empirical mean and standard deviation to the distribution mean and standard deviation, respectively.
Additional Properties The exponential distribution has a number of interesting and important mathematical properties. First, and not surprisingly, it's a member of the general exponential family. Suppose that X has the exponential distribution with rate parameter r ∈ (0, ∞). Then exponential distribution, with natural parameter −r and natural statistic X.
X
has a one parameter general
Proof
The Scaling Property As suggested earlier, the exponential distribution is a scale family, and 1/r is the scale parameter. Suppose that X has the exponential distribution with rate parameter distribution with rate parameter r/c.
r >0
and that
c >0
. Then
cX
has the exponential
Proof Recall that multiplying a random variable by a positive constant frequently corresponds to a change of units (minutes into hours for a lifetime variable, for example). Thus, the exponential distribution is preserved under such changes of units. In the context of the Poisson process, this has to be the case, since the memoryless property, which led to the exponential distribution in the first place, clearly does not depend on the time units. In fact, the exponential distribution with rate parameter 1 is referred to as the standard exponential distribution. From the previous result, if Z has the standard exponential distribution and r > 0 , then X = Z has the exponential distribution with rate parameter r . Conversely, if X has the exponential distribution with rate r > 0 then Z = rX has the standard exponential distribution. 1 r
Similarly, the Poisson process with rate parameter 1 is referred to as the standard Poisson process. If Z is the ith inter-arrival time for the standard Poisson process for i ∈ N , then letting X = Z for i ∈ N gives the inter-arrival times for the Poisson process with rate r. Conversely if X is the ith inter-arrival time of the Poisson process with rate r > 0 for i ∈ N , then Z = rX for i ∈ N gives the inter-arrival times for the standard Poisson process. i
1
+
i
r
i
+
i
+
i
i
+
Relation to the Geometric Distribution In many respects, the geometric distribution is a discrete version of the exponential distribution. In particular, recall that the geometric distribution on N is the only distribution on N with the memoryless and constant rate properties. So it is not surprising that the two distributions are also connected through various transformations and limits. +
+
Suppose that X has the exponential distribution with rate parameter r > 0 . Then 1. ⌊X⌋ has the geometric distributions on N with success parameter 1 − e . 2. ⌈X⌉ has the geometric distributions on N with success parameter 1 − e . −r
−r
+
Proof The following connection between the two distributions is interesting by itself, but will also be very important in the section on splitting Poisson processes. In words, a random, geometrically distributed sum of independent, identically distributed exponential variables is itself exponential. Suppose that X = (X , X , …) is a sequence of independent variables, each with the exponential distribution with rate r. Suppose that U has the geometric distribution on N with success parameter p and is independent of X. Then Y = ∑ X has the exponential distribution with rate rp. 1
2
U
+
i=1
i
Proof The next result explores the connection between the Bernoulli trials process and the Poisson process that was begun in the Introduction.
14.2.3
https://stats.libretexts.org/@go/page/10267
For
, suppose that U has the geometric distribution on N with success parameter p , where np . Then the distribution of U /n converges to the exponential distribution with parameter r as n → ∞ .
n ∈ N+
n → ∞
n
+
n
n
→ r >0
as
n
Proof To understand this result more clearly, suppose that we have a sequence of Bernoulli trials processes. In process n , we run the trials at a rate of n per unit time, with probability of success p . Thus, the actual time of the first success in process n is U /n. The last result shows that if np → r > 0 as n → ∞ , then the sequence of Bernoulli trials processes converges to the Poisson process with rate parameter r as n → ∞ . We will return to this point in subsequent sections. n
n
n
Orderings and Order Statistics Suppose that X and Y have exponential distributions with parameters a and b , respectively, and are independent. Then a P(X < Y ) =
(14.2.16) a+b
Proof The following theorem gives an important random version of the memoryless property. Suppose that X and Y are independent variables taking values in [0, ∞) and that Y has the exponential distribution with rate parameter r > 0 . Then X and Y − X are conditionally independent given X < Y , and the conditional distribution of Y − X is also exponential with parameter r. Proof For our next discussion, suppose that X = (X , X , … , X ) is a sequence of independent random variables, and that X has the exponential distribution with rate parameter r > 0 for each i ∈ {1, 2, … , n}. 1
2
n
i
i
Let U
= min{ X1 , X2 , … , Xn }
n
. Then U has the exponential distribution with parameter ∑
i=1
ri
.
Proof In the context of reliability, if a series system has independent components, each with an exponentially distributed lifetime, then the lifetime of the system is also exponentially distributed, and the failure rate of the system is the sum of the component failure rates. In the context of random processes, if we have n independent Poisson process, then the new process obtained by combining the random points in time is also Poisson, and the rate of the new process is the sum of the rates of the individual processes (we will return to this point latter). Let V
= max{ X1 , X2 , … , Xn }
. Then V has distribution function F given by n
F (t) = ∏ (1 − e
−ri t
),
t ∈ [0, ∞)
(14.2.22)
i=1
Proof Consider the special case where r = r ∈ (0, ∞) for each i ∈ N . In statistical terms, X is a random sample of size n from the exponential distribution with parameter r. From the last couple of theorems, the minimum U has the exponential distribution with rate nr while the maximum V has distribution function F (t) = (1 − e ) for t ∈ [0, ∞). Recall that U and V are the first and last order statistics, respectively. i
+
−rt
n
In the order statistic experiment, select the exponential distribution. 1. Set k = 1 (this gives the minimum U ). Vary n with the scroll bar and note the shape of the probability density function. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability density function. 2. Vary n with the scroll bar, set k = n each time (this gives the maximum V ), and note the shape of the probability density function. For selected values of n , run the simulation 1000 times and compare the empirical density function to the true probability density function.
14.2.4
https://stats.libretexts.org/@go/page/10267
Curiously, the distribution of the maximum of independent, identically distributed exponential variables is also the distribution of the sum of independent exponential variables, with rates that grow linearly with the index. Suppose that r
i
= ir
for each i ∈ {1, 2, … , n} where r ∈ (0, ∞). Then Y F (t) = (1 − e
−rt
n
) ,
n
=∑
i=1
Xi
has distribution function F given by
t ∈ [0, ∞)
(14.2.23)
Proof This result has an application to the Yule process, named for George Yule. The Yule process, which has some parallels with the Poisson process, is studied in the chapter on Markov processes. We can now generalize the order probability above: For i ∈ {1, 2, … , n}, ri
P (Xi < Xj for all j ≠ i) =
(14.2.26)
n
∑
j=1
rj
Proof Suppose that for each i, X is the time until an event of interest occurs (the arrival of a customer, the failure of a device, etc.) and that these times are independent and exponentially distributed. Then the first time U that one of the events occurs is also exponentially distributed, and the probability that the first event to occur is event i is proportional to the rate r . i
i
The probability of a total ordering is n
ri
P(X1 < X2 < ⋯ < Xn ) = ∏ i=1
(14.2.27)
n
∑j=i rj
Proof Of course, the probabilities of other orderings can be computed by permuting the parameters appropriately in the formula on the right. The result on minimums and the order probability result above are very important in the theory of continuous-time Markov chains. But for that application and others, it's convenient to extend the exponential distribution to two degenerate cases: point mass at 0 and point mass at ∞ (so the first is the distribution of a random variable that takes the value 0 with probability 1, and the second the distribution of a random variable that takes the value ∞ with probability 1). In terms of the rate parameter r and the distribution function F , point mass at 0 corresponds to r = ∞ so that F (t) = 1 for 0 < t < ∞ . Point mass at ∞ corresponds to r = 0 so that F (t) = 0 for 0 < t < ∞ . The memoryless property, as expressed in terms of the reliability function F , still holds for these degenerate cases on (0, ∞): c
F
c
(s)F
c
(t) = F
c
(s + t),
s, t ∈ (0, ∞)
(14.2.30)
We also need to extend some of results above for a finite number of variables to a countably infinite number of variables. So for the remainder of this discussion, suppose that {X : i ∈ I } is a countable collection of independent random variables, and that X has the exponential distribution with parameter r ∈ (0, ∞) for each i ∈ I . i
i
i
Let U
= inf{ Xi : i ∈ I }
. Then U has the exponential distribution with parameter ∑
i∈I
ri
Proof For i ∈ N , +
ri
P (Xi < Xj for all j ∈ I − {i}) =
(14.2.32)
∑j∈I rj
Proof We need one last result in this setting: a condition that ensures that the sum of an infinite collection of exponential variables is finite with probability one. Let Y
= ∑i∈I Xi
and μ = ∑
i∈I
1/ ri
. Then μ = E(Y ) and P(Y
< ∞) = 1
14.2.5
if and only if μ < ∞ .
https://stats.libretexts.org/@go/page/10267
Proof
Computational Exercises Show directly that the exponential probability density function is a valid probability density function. Solution Suppose that the length of a telephone call (in minutes) is exponentially distributed with rate parameter r = 0.2. Find each of the following: 1. The probability that the call lasts between 2 and 7 minutes. 2. The median, the first and third quartiles, and the interquartile range of the call length. Answer Suppose that the lifetime of a certain electronic component (in hours) is exponentially distributed with rate parameter r = 0.001. Find each of the following: 1. The probability that the component lasts at least 2000 hours. 2. The median, the first and third quartiles, and the interquartile range of the lifetime. Answer Suppose that the time between requests to a web server (in seconds) is exponentially distributed with rate parameter Find each of the following:
r =2
.
1. The mean and standard deviation of the time between requests. 2. The probability that the time between requests is less that 0.5 seconds. 3. The median, the first and third quartiles, and the interquartile range of the time between requests. Answer Suppose that the lifetime X of a fuse (in 100 hour units) is exponentially distributed with P(X > 10) = 0.8 . Find each of the following: 1. The rate parameter. 2. The mean and standard deviation. 3. The median, the first and third quartiles, and the interquartile range of the lifetime. Answer The position following:
X
of the first defect on a digital tape (in cm) has the exponential distribution with mean 100. Find each of the
1. The rate parameter. 2. The probability that X < 200 given X > 150 . 3. The standard deviation. 4. The median, the first and third quartiles, and the interquartile range of the position. Answer Suppose that
are independent, exponentially distributed random variables with respective parameters . Find the probability of each of the 6 orderings of the variables.
X, Y , Z
a, b, c ∈ (0, ∞)
Proof This page titled 14.2: The Exponential Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
14.2.6
https://stats.libretexts.org/@go/page/10267
14.3: The Gamma Distribution Basic Theory We now know that the sequence of inter-arrival times X = (X , X , …) in the Poisson process is a sequence of independent random variables, each having the exponential distribution with rate parameter r, for some r > 0 . No other distribution gives the strong renewal assumption that we want: the property that the process probabilistically restarts, independently of the past, at each arrival time and at each fixed time. 1
2
The n th arrival time is simply the sum of the first n inter-arrival times: n
Tn = ∑ Xi ,
n ∈ N
(14.3.1)
i=0
Thus, the sequence of arrival times T X = (X , X , …). 1
= (T0 , T1 , …)
is the partial sum process associated with the sequence of inter-arrival times
2
Distribution Functions Recall that the common probability density function of the inter-arrival times is f (t) = re
−rt
,
0 ≤t 0 for n ∈ N , and if λ(A ) → 0 as n → ∞ then i
i
n
n
+
n
P [N (An ) = 1]
→ r as n → ∞
(14.4.13)
→ 0 as n → ∞
(14.4.14)
λ(An ) P [N (An ) > 1] λ(An )
Proof Of course, part (a) is the stationary assumption and part (b) the independence assumption. The first limit in (c) is sometimes called the rate property and the second limit the sparseness property. In a “small” time interval of length dt , the probability of a single random point is approximately r dt, and the probability of two or more random points is negligible.
Sums Suppose that N and M are independent random variables, and that N has the Poisson distribution with parameter a ∈ (0, ∞) and M has the Poisson distribution with parameter b ∈ (0, ∞). Then N + M has the Poisson distribution with parameter a+b . Proof from the Poisson process Proof from probability generating functions Proof from convolution From the last theorem, it follows that the Poisson distribution is infinitely divisible. That is, a Poisson distributed variable can be written as the sum of an arbitrary number of independent, identically distributed (in fact also Poisson) variables. Suppose that N has the Poisson distribution with parameter a ∈ (0, ∞) . Then for n ∈ N , N has the same distribution as ∑ N where (N , N , … , N ) are independent, and each has the Poisson distribution with parameter a/n. +
n
i=1
i
1
2
n
Normal Approximation Because of the representation as a sum of independent, identically distributed variables, it's not surprising that the Poisson distribution can be approximated by the normal. Suppose that N has the Poisson distribution with parameter the standard normal distribution as t → ∞ . t
t >0
. Then the distribution of the variable below converges to
Nt − t Zt =
(14.4.18) √t
Proof Thus, if N has the Poisson distribution with parameter a , and a is “large”, then the distribution of N is approximately normal with − mean a and standard deviation √− a. When using the normal approximation, we should remember to use the continuity correction, since the Poisson is a discrete distribution. In the Poisson experiment, set r = t = 1 . Increase t and note how the graph of the probability density function becomes more bell-shaped.
General Exponential The Poisson distribution is a member of the general exponential family of distributions. This fact is important in various statistical procedures. Suppose that N has the Poisson distribution with parameter family with natural parameter ln(a) and natural statistic N .
a ∈ (0, ∞)
. This distribution is a one-parameter exponential
Proof
14.4.4
https://stats.libretexts.org/@go/page/10269
The Uniform Distribution The Poisson process has some basic connections to the uniform distribution. Consider again the Poisson process with rate As usual, T = (T , T , …) denotes the arrival time sequence and N = (N : t ≥ 0) the counting process. 0
1
r >0
.
t
For t > 0 , the conditional distribution of T given N 1
t
=1
is uniform on the interval (0, t].
Proof More generally, for t > 0 and n ∈ N , the conditional distribution of (T , T , … , T ) given N = n is the same as the distribution of the order statistics of a random sample of size n from the uniform distribution on the interval (0, t]. +
1
2
n
t
Heuristic proof Note that the conditional distribution in the last result is independent of the rate r. This means that, in a sense, the Poisson model gives the most “random” distribution of points in time.
The Binomial Distribution The Poisson distribution has important connections to the binomial distribution. First we consider a conditional distribution based on the number of arrivals of a Poisson process in a given interval, as we did in the last subsection. Suppose that (N : t ∈ [0, ∞)) is a Poisson counting process with rate r ∈ (0, ∞). If s, t ∈ (0, ∞) with s < t , and n ∈ N , then the conditional distribution of N given N = n is binomial with trial parameter n and success parameter p = s/t . t
+
s
t
Proof Note again that the conditional distribution in the last result does not depend on the rate r. Given N = n , each of the n arrivals, independently of the others, falls into the interval (0, s] with probability s/t and into the interval (s, t] with probability 1 − s/t = (t − s)/t . Here is essentially the same result, outside of the context of the Poisson process. t
Suppose that M has the Poisson distribution with parameter a ∈ (0, ∞) , N has the Poisson distribution with parameter b ∈ (0, ∞), and that M and N are independent. Then the conditional distribution of M given M + N = n is binomial with parameters n and p = a/(a + b) . Proof More importantly, the Poisson distribution is the limit of the binomial distribution in a certain sense. As we will see, this convergence result is related to the analogy between the Bernoulli trials process and the Poisson process that we discussed in the Introduction, the section on the inter-arrival times, and the section on the arrival times. Suppose that p ∈ (0, 1) for n ∈ N and that np → a ∈ (0, ∞) as n → ∞ . Then the binomial distribution with parameters n and p converges to the Poisson distribution with parameter a as n → ∞ . That is, for fixed k ∈ N , n
+
n
n
k
(
n a n−k −a k ) pn (1 − pn ) → e as n → ∞ k k!
(14.4.27)
Direct proof Proof from generating functions The mean and variance of the binomial distribution converge to the mean and variance of the limiting Poisson distribution, respectively. 1. np 2. np
n
→ a
n (1
as n → ∞ as n → ∞
− pn ) → a
Of course the convergence of the means is precisely our basic assumption, and is further evidence that this is the essential assumption. But for a deeper look, let's return to the analogy between the Bernoulli trials process and the Poisson process. Recall that both have the strong renewal property that at each fixed time, and at each arrival time, the process stochastically starts over, independently of the past. The difference, of course, is that time is discrete in the Bernoulli trials process and continuous in the Poisson process. The convergence result is a special case of the more general fact that if we run Bernoulli trials at a faster and faster rate but with a smaller and smaller success probability, in just the right way, the Bernoulli trials process converges to the
14.4.5
https://stats.libretexts.org/@go/page/10269
Poisson process. Specifically, suppose that we have a sequence of Bernoulli trials processes. In process n we perform the trials at a rate of n per unit time, with success probability p . Our basic assumption is that np → r as n → ∞ where r > 0 . Now let Y denote the number of successes in the time interval (0, t] for Bernoulli trials process n , and let N denote the number of arrivals in this interval for the Poisson process with rate r. Then Y has the binomial distribution with parameters ⌊nt⌋ and p , and of course N has the Poisson distribution with parameter rt . n
n
n,t
t
n,t
n
t
For t > 0 , the distribution of Y
n,t
converges to the distribution of N as n → ∞ . t
Proof Compare the Poisson experiment and the binomial timeline experiment. 1. Open the Poisson experiment and set r = 1 and t = 5 . Run the experiment a few times and note the general behavior of the random points in time. Note also the shape and location of the probability density function and the mean±standard deviation bar. 2. Now open the binomial timeline experiment and set n = 100 and p = 0.05. Run the experiment a few times and note the general behavior of the random points in time. Note also the shape and location of the probability density function and the mean±standard deviation bar. From a practical point of view, the convergence of the binomial distribution to the Poisson means that if the number of trials n is “large” and the probability of success p “small”, so that np is small, then the binomial distribution with parameters n and p is well approximated by the Poisson distribution with parameter r = np . This is often a useful result, because the Poisson distribution has fewer parameters than the binomial distribution (and often in real problems, the parameters may only be known approximately). Specifically, in the approximating Poisson distribution, we do not need to know the number of trials n and the probability of success p individually, but only in the product np. The condition that np be small means that the variance of the binomial distribution, namely np(1 − p) = np − np is approximately r = np , the variance of the approximating Poisson distribution. 2
2
2
Recall that the binomial distribution can also be approximated by the normal distribution, by virtue of the central limit theorem. The normal approximation works well when np and n(1 − p) are large; the rule of thumb is that both should be at least 5. The Poisson approximation works well, as we have already noted, when n is large and np small. 2
Computational Exercises Suppose that requests to a web server follow the Poisson model with rate r = 5 per minute. Find the probability that there will be at least 8 requests in a 2 minute period. Answer Defects in a certain type of wire follow the Poisson model with rate 1.5 per meter. Find the probability that there will be no more than 4 defects in a 2 meter piece of the wire. Answer Suppose that customers arrive at a service station according to the Poisson model, at a rate of standard deviation of the number of customers in an 8 hour period.
r =4
. Find the mean and
Answer In the Poisson experiment, set r = 5 and t = 4 . Run the experiment 1000 times and compute the following: 1. P(15 ≤ N ≤ 22) 2. The relative frequency of the event {15 ≤ N ≤ 22} . 3. The normal approximation to P(15 ≤ N ≤ 22) . 4
4
4
Answer Suppose that requests to a web server follow the Poisson model with rate r = 5 per minute. Compute the normal approximation to the probability that there will be at least 280 requests in a 1 hour period. Answer
14.4.6
https://stats.libretexts.org/@go/page/10269
Suppose that requests to a web server follow the Poisson model, and that 1 request comes in a five minute period. Find the probability that the request came during the first 3 minutes of the period. Answer Suppose that requests to a web server follow the Poisson model, and that 10 requests come during a 5 minute period. Find the probability that at least 4 requests came during the first 3 minutes of the period. Answer In the Poisson experiment, set r = 3 and t = 5 . Run the experiment 100 times. 1. For each run, compute the estimate of r based on N . 2. Over the 100 runs, compute the average of the squares of the errors. 3. Compare the result in (b) with var(N ) . t
t
Suppose that requests to a web server follow the Poisson model with unknown rate server receives 342 requests. Estimate r.
r
per minute. In a one hour period, the
Answer In the binomial experiment, set following:
n = 30
and
p = 0.1
, and run the simulation 1000 times. Compute and compare each of the
1. P(Y ≤ 4) 2. The relative frequency of the event {Y ≤ 4} 3. The Poisson approximation to P(Y ≤ 4) 30
30
30
Answer Suppose that we have 100 memory chips, each of which is defective with probability 0.05, independently of the others. Approximate the probability that there are at least 3 defectives in the batch. Answer In the binomial timeline experiment, set n = 40 and p = 0.1 and run the simulation 1000 times. Compute and compare each of the following: 1. P(Y > 5) 2. The relative frequency of the event {Y > 5} 3. The Poisson approximation to P(Y > 5) 4. The normal approximation to P(Y > 5) 40
40
40
40
Answer In the binomial timeline experiment, set n = 100 and p = 0.1 and run the simulation 1000 times. Compute and compare each of the following: 1. P(8 < Y < 15) 2. The relative frequency of the event {8 < Y < 15} 3. The Poisson approximation to P(8 < Y < 15) 4. The normal approximation to P(8 < Y < 15) 100
100
100
100
Answer A text file contains 1000 words. Assume that each word, independently of the others, is misspelled with probability p. 1. If p = 0.015, approximate the probability that the file contains at least 20 misspelled words. 2. If p = 0.001, approximate the probability that the file contains at least 3 misspelled words. Answer
14.4.7
https://stats.libretexts.org/@go/page/10269
This page titled 14.4: The Poisson Distribution is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
14.4.8
https://stats.libretexts.org/@go/page/10269
14.5: Thinning and Superpositon Thinning Thinning or splitting a Poisson process refers to classifying each random point, independently, into one of a finite number of different types. The random points of a given type also form Poisson processes, and these processes are independent. Our exposition will concentrate on the case of just two types, but this case has all of the essential ideas.
The Two-Type Process We start with a Poisson process with rate r > 0 . Recall that this statement really means three interrelated stochastic processes: the sequence of inter-arrival times X = (X , X , …), the sequence of arrival times T = (T , T , …) , and the counting process N = (N : t ≥ 0) . Suppose now that each arrival, independently of the others, is one of two types: type 1 with probability p and type 0 with probability 1 − p , where p ∈ (0, 1) is a parameter. Here are some common examples: 1
2
0
1
t
The arrivals are radioactive emissions and each emitted particle is either detected (type 1) or missed (type 0) by a counter. The arrivals are customers at a service station and each customer is classified as either male (type 1) or female (type 0). We want to consider the type 1 and type 0 random points separately. For this reason, the new random process is usually referred to as thinning or splitting the original Poisson process. In some applications, the type 1 points are accepted while the type 0 points are rejected. The main result of this section is that the type 1 and type 0 points form separate Poisson processes, with rates rp and r(1 − p) respectively, and are independent. We will explore this important result from several points of view.
Bernoulli Trials In the previous sections, we have explored the analogy between the Bernoulli trials process and the Poisson process. Both have the strong renewal property that at each fixed time and at each arrival time, the process stochastically “restarts”, independently of the past. The difference, of course, is that time is discrete in the Bernoulli trials process and continuous in the Poisson process. In this section, we have both processes simultaneously, and given our previous explorations, it's perhaps not surprising that this leads to some interesting mathematics. Thus, in addition to the processes X, T , and N , we have a sequence of Bernoulli trials I = (I , I , …) with success parameter p. Indicator variable I specifies the type of the j th arrival. Moreover, because of our assumptions, I is independent of X, T , and N . Recall that V , the trial number of the k th success has the negative binomial distribution with parameters k and p for k ∈ N . We take V = 0 by convention. Also, U , the number of trials needed to go from the (k − 1) st success to the k th success has the geometric distribution with success parameter p for k ∈ N . Moreover, U = (U , U , …) is independent and V = (V , V , …) is the partial sum process associated with U : 1
2
j
k
+
0
k
+
1
2
0
1
k
Vk = ∑ Ui ,
k ∈ N
(14.5.1)
i=1
Uk = Vk − Vk−1 ,
k ∈ N+
(14.5.2)
As noted above, the Bernoulli trials process can be thought of as random points in discrete time, namely the trial numbers of the successes. With this understanding, U is the sequence of inter-arrival times and V is the sequence of arrival times.
The Inter-arrival Times Now consider just the type 1 points in our Poisson process. The time between the arrivals of (k − 1) st and k th type 1 point is Vk
Yk =
∑
Xi ,
k ∈ N+
(14.5.3)
i=Vk−1 +1
Note that Y has U terms. The next result shows that the type 1 points form a Poisson process with rate pr. k
k
Y = (Y1 , Y2 , …)
is a sequence of independent variables and each has the exponential distribution with rate parameter pr
Proof
14.5.1
https://stats.libretexts.org/@go/page/10270
Similarly, if Z = (Z , Z , …) is the sequence of interarrvial times for the type 0 points, then Z is a sequence of independent variables, and each has the exponential distribution with rate (1 − p)r . Moreover, Y and Z are independent. 1
2
Counting Processes For
denote the number of type 1 arrivals in (0, t] and W the number of type 0 arrivals in (0, t]. Thus, and W = (W : t ≥ 0) are the counting processes for the type 1 arrivals and for the type 0 arrivals. The next result follows from the previous subsection, but a direct proof is interesting. t ≥0
, let
Mt
t
M = (Mt : t ≥ 0)
t
For t ≥ 0 , M has the Poisson distribution with parameter M and W are independent. t
t
,
pr Wt
has the Poisson distribution with parameter
(1 − p)r
, and
t
Proof In the two-type Poisson experiment vary r, p, and t with the scroll bars and note the shape of the probability density functions. For various values of the parameters, run the experiment 1000 times and compare the relative frequency functions to the probability density functions.
Estimating the Number of Arrivals Suppose that the type 1 arrivals are observable, but not the type 0 arrivals. This setting is natural, for example, if the arrivals are radioactive emissions, and the type 1 arrivals are emissions that are detected by a counter, while the type 0 arrivals are emissions that are missed. Suppose that for a given t > 0 , we would like to estimate the total number arrivals N after observing the number of type 1 arrivals M . t
t
The conditional distribution of N given M t
t
is the same as the distribution of k + W .
=k
t
P(Nt = n ∣ Mt = k) = e
−(1−p)rt
[(1 − p)rt]
n−k
,
n ∈ {k, k + 1, …}
(14.5.7)
(n − k)!
Proof E(Nt ∣ Mt = k) = k + (1 − p) r
.
Proof Thus, if the overall rate r of the process and the probability p that an arrival is type 1 are known, then it follows form the general theory of conditional expectation that the best estimator of N based on M , in the least squares sense, is t
t
E(Nt ∣ Mt ) = Mt + (1 − p)r
The mean square error is E ([N
t
− E(Nt ∣ Mt )]
2
) = (1 − p)rt
(14.5.10)
.
Proof
The Multi-Type Process As you might guess, the results above generalize from 2 types to k types for general k ∈ N . Once again, we start with a Poisson process with rate r > 0 . Suppose that each arrival, independently of the others, is type i with probability p for i ∈ {0, 1, … , k − 1}. Of course we must have p ≥ 0 for each i and ∑ p = 1 . Then for each i , the type i points form a Poisson process with rate p r, and these processes are independent. +
i
k−1
i
i=0
i
i
Superposition Complementary to splitting or thinning a Poisson process is superposition: if we combine the random points in time from independent Poisson processes, then we have a new Poisson processes. The rate of the new process is the sum of the rates of the processes that were combined. Once again, our exposition will concentrate on the superposition of two processes. This case contains all of the essential ideas.
14.5.2
https://stats.libretexts.org/@go/page/10270
Two Processes Suppose that we have two independent Poisson processes. We will denote the sequence of inter-arrival times, the sequence of arrival times, and the counting variables for the process i ∈ {1, 2} by X = (X , X …), T = (T , T , …), and N = (N : t ∈ [0, ∞)) , and we assume that process i has rate r ∈ (0, ∞) . The new process that we want to consider is obtained by simply combining the random points. That is, the new random points are {T : n ∈ N } ∪ {T : n ∈ N } , but of course then ordered in time. We will denote the sequence of inter-arrival times, the sequence of arrival times, and the counting variables for the new process by X = (X , X …) , T = (T , T , …) , and N = (N : t ∈ [0, ∞)) . Clearly if A is an interval in [0, ∞) then i
i
i
i
1
2
i
i
i
1
2
i
i
t
1 n
1
2
1
2
N (A) = N
1
+
2 n
+
t
2
(A) + N
(A)
(14.5.11)
the number of combined points in A is simply the sum of the number of point in A for processes 1 and 2. It's also worth noting that X1 = min{ X
1
1
,X
2
1
}
(14.5.12)
the first arrival for the combined process is the smaller of the first arrival times for processes 1 and 2. The other inter-arrival times, and hence also the arrival times, for the combined process are harder to state. The combined process is a Poisson process with rate r
1
+ r2
. Moreover,
Proof
Computational Exercises In the two-type Poisson experiment, set r = 2 , t = 3 , and p = 0.7 . Run the experiment 1000 times, Compute the appropriate relative frequency functions and investigate empirically the independence of the number of type 1 points and the number of type 0 points. Suppose that customers arrive at a service station according to the Poisson model, with rate r = 20 per hour. Moreover, each customer, independently, is female with probability 0.6 and male with probability 0.4. Find the probability that in a 2 hour period, there will be at least 20 women and at least 15 men. Answer In the two-type Poisson experiment, set r = 3 , t = 4 , and p = 0.8 . Run the experiment 100 times. 1. Compute the estimate of N based on M for each run. 2. Over the 100 runs, compute average of the sum of the squares of the errors. 3. Compare the result in (b) with the result in Exercise 8. t
t
Suppose that a piece of radioactive material emits particles according to the Poisson model at a rate of r = 100 per second. Moreover, assume that a counter detects each emitted particle, independently, with probability 0.9. Suppose that the number of detected particles in a 5 second period is 465. 1. Estimate the number of particles emitted. 2. Compute the mean square error of the estimate. Answer This page titled 14.5: Thinning and Superpositon is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
14.5.3
https://stats.libretexts.org/@go/page/10270
14.6: Non-homogeneous Poisson Processes Basic Theory A non-homogeneous Poisson process is similar to an ordinary Poisson process, except that the average rate of arrivals is allowed to vary with time. Many applications that generate random points in time are modeled more faithfully with such non-homogeneous processes. The mathematical cost of this generalization, however, is that we lose the property of stationary increments. Non-homogeneous Poisson processes are best described in measure-theoretic terms. Thus, you may need to review the sections on measure theory in the chapters on Foundations, Probability Measures, and Distributions. Our basic measure space in this section is [0, ∞) with the σ-algebra of Borel measurable subsets (named for Émile Borel). As usual, λ denotes Lebesgue measure on this space, named for Henri Lebesgue. Recall that the Borel σ-algebra is the one generated by the intervals, and λ is the generalization of length on intervals.
Definition and Basic Properties Of all of our various characterizations of the ordinary Poisson process, in terms of the inter-arrival times, the arrival times, and the counting process, the characterizations involving the counting process leads to the most natural generalization to non-homogeneous processes. Thus, consider a process that generates random points in time, and as usual, let N denote the number of random points in the interval (0, t] for t ≥ 0 , so that N = {N : t ≥ 0} is the counting process. More generally, N (A) denotes the number of random points in a measurable A ⊆ [0, ∞) , so N is our random counting measure. As before, t ↦ N is a (random) distribution function and A ↦ N (A) is the (random) measure associated with this distribution function. t
t
t
Suppose now that r : [0, ∞) → [0, ∞) is measurable, and define m : [0, ∞) → [0, ∞) by m(t) = ∫
r(s) dλ(s)
(14.6.1)
(0,t]
From properties of the integral, m is increasing and right-continuous on [0, ∞) and hence is distribution function. The positive measure on [0, ∞) associated with m (which we will also denote by m) is defined on a measurable A ⊆ [0, ∞) by m(A) = ∫
r(s) dλ(s)
(14.6.2)
A
Thus, m(t) = m(0, t] , and for s, t ∈ [0, ∞) with s < t , m(s, t] = m(t) − m(s) . Finally, note that the measure m is absolutely continuous with respect to λ , and r is the density function. Note the parallels between the random distribution function and measure N and the deterministic distribution function and measure m. With the setup involving r and m complete, we are ready for our first definition. A process that produces random points in time is a non-homogeneous Poisson process with rate function process N satisfies the following properties: 1. If {A : i ∈ I } is a countable, disjoint collection of measurable subsets of [0, ∞) then {N (A independent random variables. 2. If A ⊆ [0, ∞) is measurable then N (A) has the Poisson distribution with parameter m(A).
i)
i
: i ∈ I}
r
if the counting
is a collection of
Property (a) is our usual property of independent increments, while property (b) is a natural generalization of the property of Poisson distributed increments. Clearly, if r is a positive constant, then m(t) = rt for t ∈ [0, ∞) and as a measure, m is proportional to Lebesgue measure λ . In this case, the non-homogeneous process reduces to an ordinary, homogeneous Poisson process with rate r. However, if r is not constant, then m is not linear, and as a measure, is not proportional to Lebesgue measure. In this case, the process does not have stationary increments with respect to λ , but does of course, have stationary increments with respect to m. That is, if A, B are measurable subsets of [0, ∞) and λ(A) = λ(B) then N (A) and N (B) will not in general have the same distribution, but of course they will have the same distribution if m(A) = m(B) . In particular, recall that the parameter of the Poisson distribution is both the mean and the variance, so E [N (A)] = var [N (A)] = m(A) for measurable A ⊆ [0, ∞) , and in particular, E(N ) = var(N ) = m(t) for t ∈ [0, ∞). The t
14.6.1
t
https://stats.libretexts.org/@go/page/10271
function m is usually called the mean function. Since m (t) = r(t) (if r is continuous at t ), it makes sense to refer to r as the rate function. Locally, at t , the arrivals are occurring at an average rate of r(t) per unit time. ′
As before, from a modeling point of view, the property of independent increments can reasonably be evaluated. But we need something more primitive to replace the property of Poisson increments. Here is the main theorem. A process that produces random points in time is a non-homogeneous Poisson process with rate function counting process N satisfies the following properties: 1. If {A : i ∈ I } is a countable, disjoint collection of measurable subsets of [0, ∞) then {N (A independent variables. 2. For t ∈ [0, ∞),
i)
i
: i ∈ I}
r
if and only if the
is a set of
P [N (t, t + h] = 1] → r(t) as h ↓ 0
(14.6.3)
→ 0 as h ↓ 0
(14.6.4)
h P [N (t, t + h] > 1] h
So if h is “small” the probability of a single arrival in arrival in this interval is negligible.
is approximately
[t, t + h)
r(t)h
, while the probability of more than 1
Arrival Times and Time Change Suppose that we have a non-homogeneous Poisson process with rate function r, as defined above. As usual, let T denote the time of the n th arrival for n ∈ N . As with the ordinary Poisson process, we have an inverse relation between the counting process N = { N : t ∈ [0, ∞)} and the arrival time sequence T = {T : n ∈ N} , namely T = min{t ∈ [0, ∞) : N = n} , N = #{n ∈ N : T ≤ t} , and { T ≤ t} = { N ≥ n} , since both events mean at least n random points in (0, t]. The last relationship allows us to get the distribution of T . n
t
n
t
n
n
n
t
t
n
For n ∈ N , T has probability density function f given by +
n
n
m
n−1
(t)
fn (t) =
r(t)e
−m(t)
,
t ∈ [0, ∞)
(14.6.5)
(n − 1)!
Proof In particular, T has probability density function f given by 1
1
f1 (t) = r(t)e
−m(t)
,
t ∈ [0, ∞)
(14.6.8)
Recall that in reliability terms, r is the failure rate function, and that the reliability function is the right distribution function: F
c
1
(t) = P(T1 > t) = e
−m(t)
,
t ∈ [0, ∞)
(14.6.9)
In general, the functional form of f is clearly similar to the probability density function of the gamma distribution, and indeed, T can be transformed into a random variable with a gamma distribution. This amounts to a time change which will give us additional insight into the non-homogeneous Poisson process. n
Let U
n
= m(Tn )
n
for n ∈ N . Then U has the gamma distribution with shape parameter n and rate parameter 1 +
n
Proof Thus, the time change u = m(t) transforms the non-homogeneous Poisson process into a standard (rate 1) Poisson process. Here is an equivalent way to look at the time change result. For u ∈ [0, ∞), let Poisson process.
Mu = Nt
where
t =m
−1
(u)
. Then
{ Mu : u ∈ [0, ∞)}
is the counting process for a standard, rate 1
Proof Equivalently, we can transform a standard (rate 1) Poisson process into a a non-homogeneous Poisson process with a time change.
14.6.2
https://stats.libretexts.org/@go/page/10271
Suppose that M = {M : u ∈ [0, ∞)} is the counting process for a standard Poisson process, and let N = M for t ∈ [0, ∞). Then { N : t ∈ [0, ∞)} is the counting process for a non-homogeneous Poisson process with mean function m (and rate function r). u
t
m(t)
t
Proof This page titled 14.6: Non-homogeneous Poisson Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
14.6.3
https://stats.libretexts.org/@go/page/10271
14.7: Compound Poisson Processes In a compound Poisson process, each arrival in an ordinary Poisson process comes with an associated real-valued random variable that represents the value of the arrival in a sense. These variables are independent and identically distributed, and are independent of the underlying Poisson process. Our interest centers on the sum of the random variables for all the arrivals up to a fixed time t , which thus is a Poisson-distributed random sum of random variables. Distributions of this type are said to be compound Poisson distributions, and are important in their own right, particularly since some surprising parametric distributions turn out to be compound Poisson.
Basic Theory Definition Suppose we have a Poisson process with rate r ∈ (0, ∞). As usual, we wil denote the sequence of inter-arrival times by X = (X , X , …), the sequence of arrival times by T = (T , T , T , …) , and the counting process by N = { N : t ∈ [0, ∞)} . To review some of the most important facts briefly, recall that X is a sequence of independent random variables, each having the exponential distribution on [0, ∞) with rate r. The sequence T is the partial sum sequence associated with X, and has stationary independent increments. For n ∈ N , the n th arrival time T has the gamma distribution with parameters n and r. The process N is the inverse of T , in a certain sense, and also has stationary independent increments. For t ∈ (0, ∞) , the number of arrivals N in (0, t] has the Poisson distribution with parameter rt . 1
2
0
+
1
2
t
n
t
Suppose now that each arrival has an associated real-valued random variable that represents the value of the arrival in a certain sense. Here are some typical examples: The arrivals are customers at a store. Each customer spends a random amount of money. The arrivals are visits to a website. Each visitor spends a random amount of time at the site. The arrivals are failure times of a complex system. Each failure requires a random repair time. The arrivals are earthquakes at a particular location. Each earthquake has a random severity, a measure of the energy released. For n ∈ N , let U denote the value of the n th arrival. We assume that U = (U , U , …) is a sequence of independent, identically distributed, real-valued random variables, and that U is independent of the underlying Poisson process. The common distribution may be discrete or continuous, but in either case, we let f denote the common probability density function. We will let μ = E(U ) denote the common mean, σ = var(U ) the common variance, and G the common moment generating function, so that G(s) = E [exp(sU )] for s in some interval I about 0. Here is our main definition: +
n
1
2
2
n
n
n
The compound Poisson process associated with the given Poisson process V = { V : t ∈ [0, ∞)} where
N
and the sequence
U
is the stochastic process
t
Nt
Vt = ∑ Un
(14.7.1)
n=1
Thus, V is the total value for all of the arrivals in (0, t]. For the examples above t
Vt Vt Vt Vt
is the total income to the store up to time t . is the total time spent at the site by the customers who arrived up to time t . is the total repair time for the failures up to time t . is the total energy released up to time t .
Recall that a sum over an empty index set is 0, so V
0
=0
.
Properties Note that for fixed t , V is a random sum of independent, identically distributed random variables, a topic that we have studied before. In this sense, we have a special case, since the number of terms N has the Poisson distribution with parameter rt . But we also have a new wrinkle, since the process is indexed by the continuous time parameter t , and so we can study its properties as a stochastic process. Our first result is a pair of properties shared by the underlying Poisson process. t
t
14.7.1
https://stats.libretexts.org/@go/page/10272
V
has stationary, independent increments:
1. If s, t ∈ [0, ∞) with s < t , then V − V has the same distribution as V . 2. If (t , t , … , t ) is a sequence of points in [0, ∞) with t < t < ⋯ < t then (V sequence of independent variables. t
1
2
s
t−s
n
1
2
n
t1
, Vt
2
− Vt , … , Vt 1
n
− Vt
n−1
)
is a
Proof Next we consider various moments of the compound process. For t ∈ [0, ∞), the mean and variance of V are t
1. E(V ) = μrt 2. var(V ) = (μ t
2
t
2
+ σ )rt
Proof For t ∈ [0, ∞), the moment generating function of V is given by t
E [exp(sVt )] = exp(rt [G(s) − 1]),
s ∈ I
(14.7.4)
Proof By exactly the same argument, the same relationship holds for characteristic functions and, in the case that the variables in U take values in N, for probability generating functions.. That is, if the variables in U have generating function G, then the generating function H of V is given by t
H (s) = exp(rt[G(s) − 1])
for s in the domain of characteristic.
G
(14.7.6)
, where generating function can be any of the three types we have discussed: probability, moment, or
Examples and Special Cases The Discrete Case First we note that Thinning a Poisson process can be thought of as a special case of a compound Poisson process. Thus, suppose that U = (U , U , …) is a Bernoulli trials sequence with success parameter p ∈ (0, 1), and as above, that U is independent of the Poisson process N . In the usual language of thinning, the arrivals are of two types (1 and 0), and U is the type of the ith arrival. Thus the compound process V constructed above is the thinned process, so that V is the number of type 1 points up to time t . We know that V is also a Poisson process, with rate rp. 1
2
i
t
The results above for thinning generalize to the case where the values of the arrivals have a discrete distribution. Thus, suppose U takes values in a countable set S ⊆ R , and as before, let f denote the common probability density function so that f (u) = P(U = u) for u ∈ S and i ∈ N . For u ∈ S , let N denote the number of arrivals up to time t that have the value u, and let N = {N : t ∈ [0, ∞)} denote the corresponding stochastic process. Armed with this setup, here is the result: i
u
i
u
+
t
u
t
The compound Poisson process V associated with N and U can be written in the form u
Vt = ∑ u Nt ,
t ∈ [0, ∞)
(14.7.7)
u∈S
The processes {N
u
: u ∈ S}
are independent Poisson processes, and N has rate rf (u) for u ∈ S . u
Proof
Compound Poisson Distributions A compound Poisson random variable can be defined outside of the context of a Poisson process. Here is the formal definition: Suppose that U = (U , U , …) is a sequence of independent, identically distributed random variables, and that N is independent of U and has the Poisson distribution with parameter λ ∈ (0, ∞) . Then V = ∑ U has a compound Poisson distribution. 1
2
N
i=1
14.7.2
i
https://stats.libretexts.org/@go/page/10272
But in fact, compound Poisson variables usually do arise in the context of an underlying Poisson process. In any event, the results on the mean and variance above and the generating function above hold with rt replaced by λ . Compound Poisson distributions are infinitely divisible. A famous theorem of William Feller gives a partial converse: an infinitely divisible distribution on N must be compound Poisson. The negative binomial distribution on N is infinitely divisible, and hence must be compound Poisson. Here is the construction: Let p, k ∈ (0, ∞). Suppose that U = (U , U , …) is a sequence of independent variables, each having the logarithmic series distribution with shape parameter 1 − p . Suppose also that N is independent of U and has the Poisson distribution with parameter −k ln(p) . Then V = ∑ U has the negative binomial distribution on N with parameters k and p. 1
2
N
i=1
i
Proof As a special case (k = 1 ), it follows that the geometric distribution on N is also compound Poisson. This page titled 14.7: Compound Poisson Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
14.7.3
https://stats.libretexts.org/@go/page/10272
14.8: Poisson Processes on General Spaces Basic Theory The Process So far, we have studied the Poisson process as a model for random points in time. However there is also a Poisson model for random points in space. Some specific examples of such “random points” are Defects in a sheet of material. Raisins in a cake. Stars in the sky. The Poisson process for random points in space can be defined in a very general setting. All that is really needed is a measure space (S, S , μ). Thus, S is a set (the underlying space for our random points), S is a σ-algebra of subsets of S (as always, the allowable sets), and μ is a positive measure on (S, S ) (a measure of the size of sets). The most important special case is when S is a (Lebesgue) measurable subset of R for some d ∈ N , S is the σ-algebra of measurable subsets of S , and μ = λ is d dimensional Lebesgue measure. Specializing further, recall the lower dimensional spaces: d
+
d
1. When d = 1 , S ⊆ R and λ is length measure. 2. When d = 2 , S ⊆ R and λ is area measure. 3. When d = 3 , S ⊆ R and λ is volume measure. 1
2
2
3
3
Of course, the characterizations of the Poisson process on [0, ∞), in term of the inter-arrival times and the characterization in terms of the arrival times do not generalize because they depend critically on the order relation on [0, ∞). However the characterization in terms of the counting process generalizes perfectly to our new setting. Thus, consider a process that produces random points in S , and as usual, let N (A) denote the number of random points in A ∈ S . Thus N is a random, counting measure on (S, S ) The random measure N is a Poisson process or a Poisson random measure on S with density parameter r > 0 if the following axioms are satisfied: 1. If A ∈ S then N (A) has the Poisson distribution with parameter rμ(A). 2. If {A : i ∈ I } is a countable, disjoint collection of sets in S then {N (A variables.
i)
i
: i ∈ I}
is a set of independent random
To draw parallels with the Poison process on [0, ∞), note that axiom (a) is the generalization of stationary, Poisson-distributed increments, and axiom (b) is the generalization of independent increments. By convention, if μ(A) = 0 then N (A) = 0 with probability 1, and if μ(A) = ∞ then N (A) = ∞ with probability 1. (These distributions are considered degenerate members of the Poisson family.) On the other hand, note that if 0 < μ(A) < ∞ then N (A) has support N. In the two-dimensional Poisson process, vary the width w and the rate r. Note the location and shape of the probability density function of N . For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the true probability density function. For A ⊆ D 1. E [N (A)] = rμ(A) 2. var [N (A)] = rμ(A) Proof In particular, r can be interpreted as the expected density of the random points (that is, the expected number of points in a region of unit size), justifying the name of the parameter. In the two-dimensional Poisson process, vary the width w and the density parameter r. Note the size and location of the mean ±standard deviation bar of N . For various values of the parameters, run the simulation 1000 times and compare the empirical mean and standard deviation to the true mean and standard deviation.
14.8.1
https://stats.libretexts.org/@go/page/10273
The Distribution of the Random Points As before, the Poisson model defines the most random way to distribute points in space, in a certain sense. Assume that we have a Poisson process N on (S, S , μ) with density parameter r ∈ (0, ∞). Given that A ∈ S contains exactly one random point, the position X of the point is uniformly distributed on A . Proof More generally, if A contains n points, then the positions of the points are independent and each is uniformly distributed in A . Suppose that A, B ∈ S and B ⊆ A . For n ∈ N , the conditional distribution of distribution with trial parameter n and success parameter p = μ(B)/μ(A) . +
N (B)
given
N (A) = n
is the binomial
Proof Thus, given N (A) = n , each of the n random points falls into B , independently, with probability p = μ(B)/μ(A) , regardless of the density parameter r. More generally, suppose that A ∈ S and that A is partitioned into k subsets (B , B , … , B ) in S . Then the conditional distribution of (N (B ), N (B ), … , N (B )) given N (A) = n is the multinomial distribution with parameters n and (p , p , … p ), where p = μ(B )/μ(A) for i ∈ {1, 2, … , k}. 1
1
1
2
k
2
i
2
k
k
i
Thinning and Combining Suppose that N is a Poisson random process on (S, S , μ) with density parameter r ∈ [0, ∞). Thinning (or splitting) this process works just like thinning the Poisson process on [0, ∞). Specifically, suppose that the each random point, independently of the others is either type 1 with probability p or type 0 with probability 1 − p , where p ∈ (0, 1) is a new parameter. Let N and N denote the random counting measures associated with the type 1 and type 0 points, respectively. That is, N (A) is the number of type i random points in A , for A ∈ S and i ∈ {0, 1}. 1
0
i
N0
and N are independent Poisson processes on (S, S , μ) with density parameters pr and (1 − p)r , respectively. 1
Proof This result extends naturally to k ∈ N types. As in the standard case, combining independent Poisson processes produces a new Poisson process, and the density parameters add. +
Suppose that N and N are independent Poisson processes on (S, S , μ), with density parameters r and r , respectively. Then the process obtained by combining the random points is also a Poisson process on (S, S , μ) with density parameter r +r . 0
0
1
0
1
1
Proof
Applications and Special Cases Non-homogeneous Poisson Processes A non-homogeneous Poisson process on [0, ∞) can be thought of simply as a Poisson process on [0, ∞) with respect to a measure that is not the standard Lebesgue measure λ on [0, ∞). Thus suppose that r : [0, ∞) → (0, ∞) is piece-wise continuous with ∫ r(t) dt = ∞ , and let 1
∞
0
t
m(t) = ∫
r(s) ds,
t ∈ [0, ∞)
(14.8.8)
0
Consider the non-homogeneous Poisson process with rate function r (and hence mean function m). Recall that the LebesgueStieltjes measure on [0, ∞) associated with m (which we also denote by m) is defined by the condition m(a, b] = m(b) − m(a),
a, b ∈ [0, ∞), a < b
Equivalently, m is the measure that is absolutely continuous with respect to measurable subset of [0, ∞) then
14.8.2
λ1
(14.8.9)
, with density function r. That is, if
A
is a
https://stats.libretexts.org/@go/page/10273
m(A) = ∫
r(t) dt
(14.8.10)
A
The non-homogeneous Poisson process on measure m.
[0, ∞)
with rate function
is the Poisson process on
r
[0, ∞)
with respect to the
Proof
Nearest Points in R
d
In this subsection, we consider a rather specialized topic, but one that is fun and interesting. Consider the Poisson process on (R , R , λ ) with density parameter r > 0 , where as usual, R is the σ-algebra of Lebesgue measurable subsets of R , and λ is d -dimensional Lebesgue measure. We use the usual Euclidean norm on R : d
d
d
d
d
d
d
d
∥x ∥d = (x
1
For t > 0 , let B
t
d
= {x ∈ R
: ∥x ∥d ≤ t}
d
+x
2
d
1/d
⋯ +x )
d
,
d
x = (x1 , x2 , … , xd ) ∈ R
(14.8.11)
denote the ball of radius t centered at the origin. Recall that λ
d
d (Bt )
π
= cd t
where
d/2
cd =
(14.8.12) Γ(d/2 + 1)
is the measure of the unit ball in R , and where Γ is the gamma function. Of course, c d
1
=2
,c
2
=π
,c
3
.
= 4π/3
For t ≥ 0 , let M = N (B ) , the number of random points in the ball B , or equivalently, the number of random points within distance t of the origin. From our formula for the measure of B above, it follows that M has the Poisson distribution with parameter rc t . t
t
t
t
t
d
d
Now let Z = 0 and for n ∈ N let Z denote the distance of the n th closest random point to the origin. Note that Z is analogous to the n th arrival time for the Poisson process on [0, ∞). Clearly the processes M = (M : t ≥ 0) and Z = (Z , Z , …) are inverses of each other in the sense that Z ≤ t if and only if M ≥ n . Both of these events mean that there are at least n random points within distance t of the origin. 0
+
n
n
t
n
0
1
t
Distributions 1. c Z has the gamma distribution with shape parameter n and rate parameter r. 2. Z has probability density function g given by d
d n
n
n
gn (z) =
d(cd r)
n
z
nd−1 d
exp(−rcd z ),
0 ≤z 0) > 0 , so that the interarrival times are nonnegative, but not identically 0. Let μ = E(X) denote the common mean of the interarrival times. We allow that possibility that μ = ∞ . On the other hand, 1
i
1
μ >0
2
.
Proof If μ < ∞ , we will let σ = var(X) denote the common variance of the interarrival times. Let F denote the common distribution function of the interarrival times, so that 2
F (x) = P(X ≤ x),
x ∈ [0, ∞)
(15.1.1)
The distribution function F turns out to be of fundamental importance in the study of renewal processes. We will let f denote the probability density function of the interarrival times if the distribution is discrete or if the distribution is continuous and has a probability density function (that is, if the distribution is absolutely continuous with respect to Lebesgue measure on [0, ∞)). In the discrete case, the following definition turns out to be important: If X takes values in the set {nd : n ∈ N} for some d ∈ (0, ∞) , then X (or its distribution) is said to be arithmetic (the terms lattice and periodic are also used). The largest such d is the span of X. The reason the definition is important is because the limiting behavior of renewal processes turns out to be more complicated when the interarrival distribution is arithmetic.
The Arrival Times Let n
Tn = ∑ Xi ,
n ∈ N
(15.1.2)
i=1
We follow our usual convention that the sum over an empty index set is 0; thus T = 0 . On the other hand, T is the time of the n th arrival for n ∈ N . The sequence T = (T , T , …) is called the arrival time process, although note that T is not considered an arrival. A renewal process is so named because the process starts over, independently of the past, at each arrival time. 0
+
0
1
n
0
15.1.1
https://stats.libretexts.org/@go/page/10277
Figure 15.1.1 : The interarrival times and arrival times
The sequence T is the partial sum process associated with the independent, identically distributed sequence of interarrival times X. Partial sum processes associated with independent, identically distributed sequences have been studied in several places in this project. In the remainder of this subsection, we will collect some of the more important facts about such processes. First, we can recover the interarrival times from the arrival times: Xi = Ti − Ti−1 ,
i ∈ N+
(15.1.3)
t ∈ [0, ∞)
(15.1.4)
Next, let F denote the distribution function of T , so that n
n
Fn (t) = P(Tn ≤ t),
Recall that if X has probability density function f (in either the discrete or continuous case), then function f = f = f ∗ f ∗ ⋯ ∗ f , the n -fold convolution power of f .
Tn
has probability density
∗n
n
The sequence of arrival times T has stationary, independent increments: 1. If m ≤ n then T − T has the same distribution as T 2. If n ≤ n ≤ n ≤ ⋯ then (T , T − T , T − T n
1
2
m
n−m
3
n1
n2
n1
n3
n2
and thus has distribution function F is a sequence of independent random variables. n−m
, …)
Proof If n,
m ∈ N
then
1. E (T ) = nμ 2. var (T ) = nσ 3. cov (T , T ) = min{m, n}σ n
2
n
m
2
n
Proof Recall the law of large numbers: T
n /n
→ μ
as n → ∞
1. With probability 1 (the strong law). 2. In probability (the weak law). Note that T ≤ T for n ∈ N since the interarrival times are nonnegative. Also P(T = T ) = P(X = 0) = F (0) . This can be positive, so with positive probability, more than one arrival can occur at the same time. On the other hand, the arrival times are unbounded: n
Tn → ∞
n+1
n
n−1
n
as n → ∞ with probability 1.
Proof
The Counting Process For t ≥ 0 , let N denote the number of arrivals in the interval [0, t]: t
∞
Nt = ∑ 1(Tn ≤ t),
t ∈ [0, ∞)
(15.1.6)
n=1
We will refer to the random process N = (N : t ≥ 0) as the counting process. Recall again that arrival, but it's possible to have T = 0 for n ∈ N , so there may be one or more arrivals at time 0.
T0 = 0
t
n
Nt = max{n ∈ N : Tn ≤ t}
If s,
t ∈ [0, ∞)
is not considered an
+
for t ≥ 0 .
and s ≤ t then N
t
− Ns
is the number of arrivals in (s, t] .
Note that as a function of t , N is a (random) step function with jumps at the distinct values of (T , T an arrival time is the number of arrivals at that time. In particular, N is an increasing function of t . t
1
15.1.2
2,
; the size of the jump at
…)
https://stats.libretexts.org/@go/page/10277
Figure 15.1.2 : The counting process
More generally, we can define the (random) counting measure corresponding to the sequence of random points (T , T [0, ∞). Thus, if A is a (measurable) subset of [0, ∞), we will let N (A) denote the number of the random points in A : 1
2,
…)
in
∞
N (A) = ∑ 1(Tn ∈ A)
(15.1.7)
n=1
In particular, note that with our new notation, N = N [0, t] for t ≥ 0 and N (s, t] = N − N for s ≤ t . Thus, the random counting measure is completely determined by the counting process. The counting process is the “cumulative measure function” for the counting measure, analogous the cumulative distribution function of a probability measure. t
t
s
For t ≥ 0 and n ∈ N , 1. T 2. N
n t
≤t =n
if and only if N if and only if T
t n
≥n ≤ t < Tn+1
Proof Of course, the complements of the events in (a) are also equivalent, so T > t if and only if N < n . On the other hand, neither of the events N ≤ n and T ≥ t implies the other. For example, we couse easily have N = n and T < t < T . Taking complements, neither of the events N > n and T < t implies the other. The last result also shows that the arrival time process T and the counting process N are inverses of each other in a sense. n
t
n
t
t
t
n
n+1
n
The following events have probability 1: 1. N 2. N
t
x} = { Nt − Nt−x = 0}
t
≥ x, Rt > y} = { Rt−x > x + y} = { Nt+y − Nt−x = 0}
Proof
Figure 15.1.4 : The events of interest for the current and remaining life
Of course, the various equivalent events in the last result must have the same probability. In particular, it follows that if we know the distribution of R for all t then we also know the distribution of C for all t , and in fact we know the joint distribution of (R , C ) for all t and hence also the distribution of L for all t . t
t
t
t
t
For fixed t ∈ (0, ∞) the total life at t (the lifetime of the device in service at time t ) is stochastically larger than a generic lifetime. This result, a bit surprising at first, is known as the inspection paradox. Let X denote fixed interarrival time. for x ≥ 0 .
P(Lt > x) ≥ P(X > x)
Proof
Basic Comparison The basic comparison in the following result is often useful, particularly for obtaining various bounds. The idea is very simple: if the interarrival times are shortened, the arrivals occur more frequently. Suppose now that we have two interarrival sequences, X = (X , X , …) and Y = (Y , Y probability space, with Y ≤ X (with probability 1) for each i. Then for n ∈ N and t ∈ [0, ∞), 1
i
1. T 2. N 3. m
Y ,n
≤ TX,n
Y ,t
≥ NX,t
Y
2
1
2,
…)
defined on the same
i
(t) ≥ mX (t)
Examples and Special Cases Bernoulli Trials Suppose that X = (X , X , …) is a sequence of Bernoulli trials with success parameter p ∈ (0, 1). Recall that X is a sequence of independent, identically distributed indicator variables with P(X = 1) = p . 1
2
Recall the random processes derived from X: 1. Y = (Y , Y , …) where Y the number of success in the first n trials. The sequence Y is the partial sum process associated with X. The variable Y has the binomial distribution with parameters n and p. 2. U = (U , U , …) where U the number of trials needed to go from success number n − 1 to success number n . These are independent variables, each having the geometric distribution on N with parameter p. 0
1
n
n
1
2
n
+
15.1.5
https://stats.libretexts.org/@go/page/10277
3. V = (V , V , …) where V is the trial number of success n . The sequence V is the partial sum process associated with U . The variable V has the negative binomial distribution with parameters n and p. 0
1
n
n
It is natural to view the successes as arrivals in a discrete-time renewal process. Consider the renewal process with interarrival sequence U . Then 1. The basic assumptions are satisfied and that the mean interarrival time is μ = 1/p . 2. V is the sequence of arrival times. 3. Y is the counting process (restricted to N). 4. The renewal function is m(n) = np for n ∈ N . It follows that the renewal measure is proportional to counting measure on N . +
Run the binomial timeline experiment 1000 times for various values of the parameters distribution of the counting variable to the true distribution. Run the negative binomial experiment 1000 times for various values of the parameters distribution of the arrival time to the true distribution.
n
and p. Compare the empirical
and p. Comare the empirical
k
Consider again the renewal process with interarrival sequence U . For n ∈ N , 1. The current life and remaining life at time n are independent. 2. The remaining life at time n has the same distribution as an interarrival time U , namely the geometric distribution on N with parameter p. 3. The current life at time n has a truncated geometric distribution with parameters n and p:
+
k
P(Cn = k) = {
p(1 − p ) , n
(1 − p ) ,
k ∈ {0, 1, … , n − 1}
(15.1.20)
k =n
Proof This renewal process starts over, independently of the past, not only at the arrival times, but at fixed times n ∈ N as well. The Bernoulli trials process (with the successes as arrivals) is the only discrete-time renewal process with this property, which is a consequence of the memoryless property of the geometric interarrival distribution. We can also use the indicator variables as the interarrival times. This may seem strange at first, but actually turns out to be useful. Consider the renewal process with interarrival sequence X. 1. The basic assumptions are satisfied and that the mean interarrival time is μ = p . 2. Y is the sequence of arrival times. 3. The number of arrivals at time 0 is U − 1 and the number of arrivals at time i ∈ N is U . 4. The number of arrivals in the interval [0, n] is V − 1 for n ∈ N . This gives the counting process. 5. The renewal function is m(n) = − 1 for n ∈ N . 1
+
i+1
n+1
n+1 p
The age processes are not very interesting for this renewal process. For n ∈ N (with probability 1), 1. C 2. R
n
=0
n
=1
The Moment Generating Function of the Counting Variables As an application of the last renewal process, we can show that the moment generating function of the counting variable N in an arbitrary renewal process is finite in an interval about 0 for every t ∈ [0, ∞). This implies that N has finite moments of all orders and in particular that m(t) < ∞ for every t ∈ [0, ∞). t
t
15.1.6
https://stats.libretexts.org/@go/page/10277
Suppose that X = (X , X , …) is the interarrival sequence for a renewal process. By the basic assumptions, there exists a > 0 such that p = P(X ≥ a) > 0 . We now consider the renewal process with interarrival sequence X = (X , X , …) , where X = a 1(X ≥ a) for i ∈ N . The renewal process with interarrival sequence X is just like the renewal process with Bernoulli interarrivals, except that the arrival times occur at the points in the sequence (0, a, 2a, …), instead of (0, 1, 2, …). 1
2
a
a,i
i
+
a,1
a,2
a
For each t ∈ [0, ∞), N has finite moment generating function in an interval about 0, and hence N has moments of all orders at 0. t
t
Proof
The Poisson Process The Poisson process, named after Simeon Poisson, is the most important of all renewal processes. The Poisson process is so important that it is treated in a separate chapter in this project. Please review the essential properties of this process: Properties of the Poisson process with rate r ∈ (0, ∞). 1. The interarrival times have an exponential distribution with rate parameter r. Thus, the basic assumptions above are satisfied and the mean interarrival time is μ = 1/r . 2. The exponential distribution is the only distribution with the memoryless property on [0, ∞). 3. The time of the n th arrival T has the gamma distribution with shape parameter n and rate parameter r. 4. The counting process N = (N : t ≥ 0) has stationary, independent increments and N has the Poisson distribution with parameter rt for t ∈ [0, ∞). 5. In particular, the renewal function is m(t) = rt for t ∈ [0, ∞). Hence, the renewal measure is a multiple of the standard length measure (Lebesgue measure) on [0, ∞). n
t
t
Consider again the Poisson process with rate parameter r. For t ∈ [0, ∞), 1. The current life and remaining life at time t are independent. 2. The remaining life at time t has the same distribution as an interarrival time X, namely the exponential distribution with rate parameter r. 3. The current life at time t has a truncated exponential distribution with parameters t and r: P(Ct ≥ s) = {
e
−rs
0,
,
0 ≤s ≤t
(15.1.22)
s >t
Proof The Poisson process starts over, independently of the past, not only at the arrival times, but at fixed times t ∈ [0, ∞) as well. The Poisson process is the only renewal process with this property, which is a consequence of the memoryless property of the exponential interarrival distribution. Run the Poisson experiment 1000 times for various values of the parameters t and r. Compare the empirical distribution of the counting variable to the true distribution. Run the gamma experiment 1000 times for various values of the parameters n and r. Compare the empirical distribution of the arrival time to the true distribution.
Simulation Exercises Open the renewal experiment and set t = 10 . For each of the following interarrival distributions, run the simulation 1000 times and note the shape and location of the empirical distribution of the counting variable. Note also the mean of the interarrival distribution in each case. 1. The continuous uniform distribution on the interval [0, 1] (the standard uniform distribution). 2. the discrete uniform distribution starting at a = 0 , with step size h = 0.1 , and with n = 10 points. 3. The gamma distribution with shape parameter k = 2 and scale parameter b = 1 . 4. The beta distribution with left shape parameter a = 3 and right shape parameter b = 2 . 5. The exponential-logarithmic distribution with shape parameter p = 0.1 and scale parameter b = 1 .
15.1.7
https://stats.libretexts.org/@go/page/10277
6. The Gompertz distribution with shape paraemter a = 1 and scale parameter b = 1 . 7. The Wald distribution with mean μ = 1 and shape parameter λ = 1 . 8. The Weibull distribution with shape parameter k = 2 and scale parameter b = 1 . This page titled 15.1: Introduction is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
15.1.8
https://stats.libretexts.org/@go/page/10277
15.2: Renewal Equations Many quantities of interest in the study of renewal processes can be described by a special type of integral equation known as a renewal equation. Renewal equations almost always arise by conditioning on the time of the first arrival and by using the defining property of a renewal process—the fact that the process restarts at each arrival time, independently of the past. However, before we can study renewal equations, we need to develop some additional concepts and tools involving measures, convolutions, and transforms. Some of the results in the advanced sections on measure theory, general distribution functions, the integral with respect to a measure, properties of the integral, and density functions are needed for this section. You may need to review some of these topics as necessary. As usual, we assume that all functions and sets that are mentioned are measurable with respect to the appropriate σ-algebras. In particular, [0, ∞) which is our basic temporal space, is given the usual Borel σ-algebra generated by the intervals.
Measures, Integrals, and Transforms Distribution Functions and Positive Measures Recall that a distribution function on [0, ∞) is a function G : [0, ∞) → [0, ∞) that is increasing and continuous from the right. The distribution function G defines a positive measure on [0, ∞), which we will also denote by G, by means of the formula G[0, t] = G(t) for t ∈ [0, ∞).
Figure 15.2.1 : G(t) is the cumulative measure at t
Hopefully, our notation will not cause confusion and it will be clear from context whether G refers to the positive measure (a set function) or the distribution function (a point function). More generally, if a, b ∈ [0, ∞) and a ≤ b then G(a, b] = G(b) − G(a) . Note that the positive measure associated with a distribution function is locally finite in the sense that G(A) < ∞ is A ⊂ [0, ∞) is bounded. Of course, if A is unbounded, G(A) may well be infinite. The basic structure of a distribution function and its associated positive measure occurred several times in our preliminary discussion of renewal processes: Distributions associated with a renewal process. 1. The distribution function F of the interarrival times defines a probability measure on [0, ∞) 2. The counting process N defines a (random) counting measure on [0, ∞) 3. the renewal function M defines a (deterministic) positive measure on [0, ∞) Suppose again that G is a distribution function on [0, ∞). Recall that the integral associated with the positive measure G is also called the Lebesgue-Stieltjes integral associated with the distribution function G (named for Henri Lebesgue and Thomas Stieltjes). If f : [0, ∞) → R and A ⊆ [0, ∞) (measurable of course), the integral of f over A (if it exists) is denoted ∫
f (t) dG(t)
(15.2.1)
A t
∞
We use the more conventional ∫ f (x) dG(x) for the integral over [0, t] and ∫ f (x) dG(x) for the integral over [0, ∞). On the other hand, ∫ f (x) dG(x) means the integral over (s, t] for s < t , and ∫ f (x) dG(x) means the integral over (s, ∞) . Thus, the additivity of the integral over disjoint domains holds, as it must. For example, for t ∈ [0, ∞), 0
0
t
∞
s
s
∞
∫ 0
t
f (x) dG(x) = ∫
∞
f (x) dG(x) + ∫
0
f (x) dG(x)
(15.2.2)
t
This notation would be ambiguous without the clarification, but is consistent with how the measure works: G[0, t] = G(t) for t ≥ 0 , G(s, t] = G(t) − G(s) for 0 ≤ s < t , etc. Of course, if G is continuous as a function, so that G is also continuous as a measure, then none of this matters—the integral over an interval is the same whether or not endpoints are included. . The following definition is a natural complement to the locally finite property of the positive measures that we are considering. A function f
: [0, ∞) → R
is locally bounded if it is measurable and is bounded on [0, t] for each t ∈ [0, ∞).
15.2.1
https://stats.libretexts.org/@go/page/10278
The locally bounded functions form a natural class for which our integrals of interest exist. Suppose that G is a distribution function on [0, ∞) and f g(t) = ∫ f (s) dG(s) is also locally bounded.
: [0, ∞) → R
is locally bounded. Then g : [0, ∞) → R defined by
t
0
Proof Note that if f and g are locally bounded, then so are f + g and f g. If f is increasing on [0, ∞) then f is locally bounded, so in particular, a distribution function on [0, ∞) is locally bounded. If f is continuous on [0, ∞) then f is locally bounded. Similarly, if G and H are distribution functions on [0, ∞) and if c ∈ (0, ∞), then G + H and cG are also distribution functions on [0, ∞). Convolution, which we consider next, is another way to construct new distributions on [0, ∞) from ones that we already have.
Convolution The term convolution means different things in different settings. Let's start with the definition we know, the convolution of probability density functions, on our space of interest [0, ∞). Suppose that X and Y are independent random variables with values in [0, ∞) and with probability density functions f and g , respectively. Then X + Y has probability density function f ∗ g given as follows, in the discrete and continuous cases, respectively (f ∗ g)(t) = ∑ f (t − s)g(s)
(15.2.4)
s∈[0,t] t
(f ∗ g)(t) = ∫
f (t − s)g(s) ds
(15.2.5)
0
In the discrete case, it's understood that t is a possible value of X + Y , and the sum is over the countable collection of s ∈ [0, t] with s a value of X and t − s a value of Y . Often in this case, the random variables take values in N, in which case the sum is simply over the set {0, 1, … , t} for t ∈ N . The discrete and continuous cases could be unified by defining convolution with respect to a general positive measure on [0, ∞). Moreover, the definition clearly makes sense for functions that are not necessarily probability density functions. Suppose that f , g : [0, ∞) → R ae locally bounded and that H is a distribution function on [0, ∞). The convolution of f and g with respect to H is the function on [0, ∞) defined by t
t ↦ ∫
f (t − s)g(s) dH (s)
(15.2.6)
0
If f and g are probability density functions for discrete distributions on a countable set C ⊆ [0, ∞) and if H is counting measure on C , we get discrete convolution, as above. If f and g are probability density functions for continuous distributions on [0, ∞) and if H is Lebesgue measure, we get continuous convolution, as above. Note however, that if g is nonnegative then G(t) = ∫ g(s) dH (s) for t ∈ [0, ∞) defines another distribution function on [0, ∞), and the convolution integral above is simply ∫ f (t − s) dG(s) . This motivates our next version of convolution, the one that we will use in the remainder of this section. t
0
t
0
Suppose that f : [0, ∞) → R is locally bounded and that G is a distribution function on function f with the distribution G is the function f ∗ G defined by
. The convolution of the
[0, ∞)
t
(f ∗ G)(t) = ∫
f (t − s) dG(s),
t ∈ [0, ∞)
(15.2.7)
0
Note that if F and G are distribution functions on [0, ∞), the convolution F ∗ G makes sense, with F simply as a function and G as a distribution function. The result is another distribution function. Moreover in this case, the operation is commutative. If F and G are distribution functions on [0, ∞) then F
∗G
is also a distribution function on [0, ∞), and F
∗ G = G∗ F
Proof
15.2.2
https://stats.libretexts.org/@go/page/10278
If F and G are probability distribution functions corresponding to independent random variables X and Y with values in [0, ∞), then F ∗ G is the probabiltiy distribution function of X + Y . Suppose now that f : [0, ∞) → R is locally bounded and that G and H are distribution functions on [0, ∞). From the previous result, both (f ∗ G) ∗ H and f ∗ (G ∗ H ) make sense. Fortunately, they are the same so that convolution is associative. Suppose that f
: [0, ∞) → R
is locally bounded and that G and H are distribution functions on [0, ∞). Then (f ∗ G) ∗ H = f ∗ (G ∗ H )
(15.2.9)
Proof Finally, convolution is a linear operation. That is, convolution preserves sums and scalar multiples, whenever these make sense. Suppose that f ,
g : [0, ∞) → R
are locally bounded, H is a distribution function on [0, ∞), and c ∈ R . Then
1. (f + g) ∗ H = (f ∗ H ) + (g ∗ H ) 2. (cf ) ∗ H = c(f ∗ H ) Proof Suppose that f
: [0, ∞) → R
is locally bounded, G and H are distribution functions on [0, ∞), and that c ∈ (0, ∞). Then
1. f ∗ (G + H ) = (f ∗ G) + (f ∗ H ) 2. f ∗ (cG) = c(f ∗ G) Proof
Laplace Transforms Like convolution, the term Laplace transform (named for Pierre Simon Laplace of course) can mean slightly different things in different settings. We start with the usual definition that you may have seen in your study of differential equations or other subjects: The Laplace transform of a function integral exists in R:
is the function
f : [0, ∞) → R
ϕ
defined as follows, for all
s ∈ (0, ∞)
for which the
∞
ϕ(s) = ∫
e
−st
f (t) dt
(15.2.11)
0
Suppose that f is nonnegative, so that the integral defining the transform exists in [0, ∞] for every s ∈ (0, ∞) . If ϕ(s ) < ∞ for some s ∈ (0, ∞) then ϕ(s) < ∞ for s ≥ s . The transform of a general function f exists (in R) if and only if the transform of |f | is finite at s . It follows that if f has a Laplace transform, then the transform ϕ is defined on an interval of the form (a, ∞) for some a ∈ (0, ∞) . The actual domain is of very little importance; the main point is that the Laplace transform, if it exists, will be defined for all sufficiently large s . Basically, a nonnegative function will fail to have a Laplace transform if it grows at a “hyperexponential rate” as t → ∞ . 0
0
0
We could generalize the Laplace transform by replacing the Riemann or Lebesgue integral with the integral over a positive measure on [0, ∞). Suppose that that G is a distribution on [0, ∞). The Laplace transform of given below, defined for all s ∈ (0, ∞) for which the integral exists in R:
f : [0, ∞) → R
with respect to
G
is the function
∞
s ↦ ∫
e
−st
f (t) dG(t)
(15.2.12)
0
t
However, as before, if f is nonnegative, then H (t) = ∫ f (x) dG(x) for t ∈ [0, ∞) defines another distribution function, and the previous integral is simply ∫ e dH (t) . This motivates the definiton for the Laplace transform of a distribution. ∞
0
−st
0
The Laplace transform of a distribution integral is finite:
F
on
[0, ∞)
is the function
Φ
defined as follows, for all
s ∈ (0, ∞)
for which the
∞
Φ(s) = ∫
e
−st
dF (t)
(15.2.13)
0
15.2.3
https://stats.libretexts.org/@go/page/10278
Once again if F has a Laplace transform, then the transform will be defined for all sufficiently large s ∈ (0, ∞) . We will try to be explicit in explaining which of the Laplace transform definitions is being used. For a generic function, the first definition applies, and we will use a lower case Greek letter. If the function is a distribution function, either definition makes sense, but it is usually the the latter that is appropriate, in which case we use an upper case Greek letter. Fortunately, there is a simple relationship between the two. Suppose that F is a distribution function on [0, ∞). Let Laplace transform of the function F . Then Φ(s) = sϕ(s) .
Φ
denote the Laplace transform of the distribution
F
and
ϕ
the
Proof For a probability distribution, there is also a simple relationship between the Laplace transform and the moment generating function. Suppose that X is a random variable with values in [0, ∞) and with probability distribution function F . The Laplace transform Φ and the moment generating function Γ of the distribution F are given as follows, and so Φ(s) = Γ(−s) for all s ∈ (0, ∞) . ∞
Φ(s) = E (e
−sX
) =∫
e
−st
dF (t)
(15.2.16)
0 ∞
Γ(s) = E (e
sX
) =∫
e
st
dF (t)
(15.2.17)
0
In particular, a probability distribution F on [0, ∞) always has a Laplace transform F (0) < 1 (so that X is not deterministically 0), then Φ(s) < 1 for s ∈ (0, ∞) .
Φ
, defined on
. Note also that if
(0, ∞)
Laplace transforms are important for general distributions on [0, ∞) for the same reasons that moment generating functions are important for probability distributions: the transform of a distribution uniquely determines the distribution, and the transform of a convolution is the product of the corresponding transforms (and products are much nicer mathematically than convolutions). The following theorems give the essential properties of Laplace transforms. We assume that the transforms exist, of course, and it should be understood that equations involving transforms hold for sufficiently large s ∈ (0, ∞) . Suppose that F and G are distributions on sufficiently large, then G = H
[0, ∞)
with Laplace transforms
In the case of general functions on [0, ∞), the conclusion is that Laplace transform is a linear operation. Suppose that f ,
g : [0, ∞) → R
f =g
Φ
and
Γ
, respectively. If
except perhaps on a subset of
Φ(s) = Γ(s)
[0, ∞)
for
s
of measure 0. The
have Laplace transforms ϕ and γ, respectively, and c ∈ R then
1. f + g has Laplace transform ϕ + γ 2. cf has Laplace transform cϕ Proof The same properties holds for distributions on [0, ∞) with c ∈ (0, ∞). Integral transforms have a smoothing effect. Laplace transforms are differentiable, and we can interchange the derivative and integral operators. Suppose that f
: [0, ∞) → R
has Lapalce transform ϕ . Then ϕ has derivatives of all orders and ∞ (n)
ϕ
(s) = ∫
n
n
(−1 ) t e
−st
f (t) dt
(15.2.18)
0
Restated, (−1) ϕ is the Laplace transform of the function Laplace transform turns convolution into products. n
(n)
n
t ↦ t f (t)
. Again, one of the most important properties is that the
Suppose that f : [0, ∞) → R is locally bounded with Laplace transform ϕ , and that G is a distribution function on [0, ∞) with Laplace transform Γ. Then f ∗ G has Laplace transform ϕ ⋅ Γ . Proof
15.2.4
https://stats.libretexts.org/@go/page/10278
If F and G are distributions on [0, ∞), then so is F ∗ G . The result above applies, of course, with F and F ∗ G thought of as functions and G as a distribution, but multiplying through by s and using the theorem above, it's clear that the result is also true with all three as distributions.
Renewal Equations and Their Solutions Armed with our new analytic machinery, we can return to the study of renewal processes. Thus, suppose that we have a renewal process with interarrival sequence X = (X , X , …), arrival time sequence T = (T , T , …) , and counting process N = { N : t ∈ [0, ∞)} . As usual, let F denote the common distribution function of the interarrival times, and let M denote the renewal function, so that M (t) = E(N ) for t ∈ [0, ∞). Of course, the probability distribution function F defines a probability measure on [0, ∞), but as noted earlier, M is also a distribution functions and so defines a positive measure on [0, ∞). Recall that F = 1 − F is the right distribution function (or reliability function) of an interarrival time. 1
2
0
1
t
t
c
The distributions of the arrival times are the convolution powers of F . That is, F
n
=F
∗n
=F ∗F ∗⋯∗F
.
Proof The next definition is the central one for this section. Suppose that a : [0, ∞) → R is locally bounded. An integral equation of the form u = a+u ∗ F
(15.2.22)
for an unknown function u : [0, ∞) → R is called a renewal equation for u. Often u(t) = E(U ) where {U : t ∈ [0, ∞)} is a random process of interest associated with the renewal process. The renewal equation comes from conditioning on the first arrival time T = X , and then using the defining property of the renewal process— the fact that the process starts over, interdependently of the past, at the arrival time. Our next important result illustrates this. t
t
1
1
Renewal equations for M and F : 1. M = F + M ∗ F 2. F = M − F ∗ M Proof Thus, the renewal function itself satisfies a renewal equation. Of course, we already have a “formula” for M , namely M =∑ F . However, sometimes M can be computed more easily from the renewal equation directly. The next result is the transform version of the previous result: ∞
n=1
n
The distributions F and M have Laplace transfroms Φ and Γ, respectively, that are related as follows: Φ
Γ
Γ =
,
Φ =
1 −Φ
(15.2.25) Γ+1
Proof from the renewal equation Proof from convolution In particular, the renewal distribution M always has a Laplace transform. The following theorem gives the fundamental results on the solution of the renewal equation. Suppose that u = a+u ∗ F
is locally bounded. Then the unique locally bounded solution to the renewal equation is u = a + a ∗ M .
a : [0, ∞) → R
Direct proof Proof from Laplace transforms Returning to the renewal equations for M and F above, we now see that the renewal function M completely determines the renewal process: from M we can obtain F , and everything is ultimately constructed from the interarrival times. Of course, this is also clear from the Laplace transform result above which gives simple algebraic equations for each transform in terms of the other.
15.2.5
https://stats.libretexts.org/@go/page/10278
The Distribution of the Age Variables Let's recall the definition of the age variables. A deterministic time t ∈ [0, ∞) falls in the random renewal interval [T , T ). The current life (or age) at time t is C = t − T , the remaining life at time t is R = T − t , and the total life at time t is L =T −T . In the usual reliability setting, C is the age of the device that is in service at time t , while R is the time until that device fails, and L is the total lifetime of the device. Nt
t
t
Nt +1
Nt
t
Nt
Nt +1
Nt +1
t
t
t
For t,
, let
y ∈ [0, ∞)
ry (t) = P(Rt > y) = P (N (t, t + y] = 0)
(15.2.28)
and let F (t) = F (t + y) . Note that y ↦ r (t) is the right distribution function of R . We will derive and then solve a renewal equation for r by conditioning on the time of the first arrival. We can then find integral equations that describe the distribution of the current age and the joint distribution of the current and remaining ages. c y
c
y
t
y
For y ∈ [0, ∞), r satisfies the renewal equation r y
y
c
= Fy + ry ∗ F
and hence for t ∈ [0, ∞),
t
P(Rt > y) = F
c
(t + y) + ∫
F
c
(t + y − s) dM (s),
y ≥0
(15.2.29)
0
Proof We can now describe the distribution of the current age. For t ∈ [0, ∞), t−x
P(Ct ≥ x) = F
c
(t) + ∫
F
c
(t − s) dM (s),
x ∈ [0, t]
(15.2.32)
0
Proof Finally we get the joint distribution of the current and remaining ages. For t ∈ [0, ∞), t−x
P(Ct ≥ x, Rt > y) = F
c
(t + y) + ∫
F
c
(t + y − s) dM (s),
x ∈ [0, t], y ∈ [0, ∞)
(15.2.33)
0
Proof
Examples and Special Cases Uniformly Distributed Interarrivals Consider the renewal process with interarrival times uniformly distributed on [0, 1]. Thus the distribution function of an interarrival time is F (x) = x for 0 ≤ x ≤ 1 . The renewal function M can be computed from the general renewal equation for M by successively solving differential equations. The following exercise give the first two cases. On the interval [0, 2], show that M is given as follows: 1. M (t) = e − 1 for 0 ≤ t ≤ 1 2. M (t) = (e − 1) − (t − 1)e t
t
t−1
for 1 ≤ t ≤ 2
Solution
15.2.6
https://stats.libretexts.org/@go/page/10278
Figure 15.2.2 : The graph of M on the interval [0, 2]
Show that the Laplace transform Φ of the interarrival distribution F and the Laplace transform Γ of the renewal distribution M are given by 1 −e Φ(s) =
−s
1 −e
−s
, Γ(s) = s
s−1 +e
−s
;
s ∈ (0, ∞)
(15.2.34)
Solution Open the renewal experiment and select the uniform interarrival distribution on the interval [0, 1]. For each of the following values of the time parameter, run the experiment 1000 times and note the shape and location of the empirical distribution of the counting variable. 1. t = 5 2. t = 10 3. t = 15 4. t = 20 5. t = 25 6. t = 30
The Poisson Process Recall that the Poisson process has interarrival times that are exponentially distributed with rate parameter r > 0 . Thus, the interarrival distribution function F is given by F (x) = 1 − e for x ∈ [0, ∞). The following exercises give alternate proofs of fundamental results obtained in the Introduction. −rx
Show that the renewal function M is given by M (t) = rt for t ∈ [0, ∞) 1. Using the renewal equation 2. Using Laplace transforms Solution Show that the current and remaining life at time t ≥ 0 satisfy the following properties: 1. C and R are independent. 2. R has the same distribution as an interarrival time, namely the exponential distribution with rate parameter r. 3. C has a truncated exponential distribution with parameters t and r: t
t
t
t
P(Ct ≥ x) = {
e
−rx
0,
,
0 ≤x ≤t
(15.2.40)
x >t
Solution
Bernoulli Trials Consider the renewal process for which the interarrival times have the geometric distribution with parameter p. Recall that the probability density function is n−1
f (n) = (1 − p )
p,
15.2.7
n ∈ N+
(15.2.42)
https://stats.libretexts.org/@go/page/10278
The arrivals are the successes in a sequence of Bernoulli trials. The number of successes Y in the first n trials is the counting variable for n ∈ N . The renewal equations in this section can be used to give alternate proofs of some of the fundamental results in the Introduction. n
Show that the renewal function is M (n) = np for n ∈ N 1. Using the renewal equation 2. Using Laplace transforms Proof Show that the current and remaining life at time n ∈ N satisfy the following properties:. 1. C and R are independent. 2. R has the same distribution as an interarrival time, namely the geometric distribution with parameter p. 3. C has a truncated geometric distribution with parameters n and p: n
n
n
n
j
P(Cn = j) = {
p(1 − p ) ,
j ∈ {0, 1, … , n − 1}
n
(1 − p ) ,
(15.2.50)
j=n
Solution
A Gamma Interarrival Distribution Consider the renewal process whose interarrival distribution Thus
F
is gamma with shape parameter
F (t) = 1 − (1 + rt)e
−rt
,
t ∈ [0, ∞)
2
and rate parameter
r ∈ (0, ∞)
.
(15.2.52)
Recall also that F is the distribution of the sum of two independent random variables, each having the exponential distribution with rate parameter r. Show that the renewal distribution function M is given by 1 M (t) = −
1 +
4
1 rt +
2
e
−2rt
,
t ∈ [0, ∞)
(15.2.53)
4
Solution Note that M (t) ≈ −
1 4
+
1 2
rt
as t → ∞ .
Figure 15.2.3 : The graph of M on the interval [0, 5] when r = 1
Open the renewal experiment and select the gamma interarrival distribution with shape parameter k = 2 and scale parameter b = 1 (so the rate parameter r = is also 1). For each of the following values of the time parameter, run the experiment 1000 times and note the shape and location of the empirical distribution of the counting variable. 1 b
1. t = 5 2. t = 10 3. t = 15 4. t = 20 5. t = 25 6. t = 30
15.2.8
https://stats.libretexts.org/@go/page/10278
This page titled 15.2: Renewal Equations is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
15.2.9
https://stats.libretexts.org/@go/page/10278
15.3: Renewal Limit Theorems We start with a renewal process as constructed in the introduction. Thus, X = (X , X , …) is the sequence of interarrival times. These are independent, identically distributed, nonnegative variables with common distribution function F (satisfying F (0) < 1 ) and common mean μ . When μ = ∞ , we let 1/μ = 0 . When μ < ∞ , we let σ denote the common standard deviation. Recall also that F = 1 − F is the right distribution function (or reliability function). Then, T = (T , T , …) is the arrival time sequence, where T = 0 and 1
2
c
0
1
0
n
Tn = ∑ Xi
(15.3.1)
i=1
is the time of the n th arrival for n ∈ N . Finally, N +
= { Nt : t ∈ [0, ∞)}
is the counting process, where for t ∈ [0, ∞),
∞
Nt = ∑ 1(Tn ≤ t)
(15.3.2)
n=1
is the number of arrivals in [0, t]. The renewal function M is defined by M (t) = E (N
t)
for t ∈ [0, ∞).
We noted earlier that the arrival time process and the counting process are inverses, in a sense. The arrival time process is the partial sum process for a sequence of independent, identically distributed variables. Thus, it seems reasonable that the fundamental limit theorems for partial sum processes (the law of large numbers and the central limit theorem theorem), should have analogs for the counting process. That is indeed the case, and the purpose of this section is to explore the limiting behavior of renewal processes. The main results that we will study, known appropriately enough as renewal theorems, are important for other stochastic processes, particularly Markov chains.
Basic Theory The Law of Large Numbers Our first result is a strong law of large numbers for the renewal counting process, which comes as you might guess, from the law of large numbers for the sequence of arrival times. If μ < ∞ then N
t /t
→ 1/μ
as t → ∞ with probability 1.
Proof Thus, 1/μ is the limiting average rate of arrivals per unit time. Open the renewal experiment and set t = 50 . For a variety of interarrival distributions, run the simulation 1000 times and note how the empirical distribution is concentrated near t/μ.
The Central Limit Theorem Our next goal is to show that the counting variable N is asymptotically normal. t
Suppose that μ and σ are finite, and let Nt − t/μ Zt =
− − − − , 3 σ √t/μ
t >0
(15.3.5)
The distribution of Z converges to the standard normal distribution as t → ∞ . t
Proof Open the renewal experiment and set t = 50 . For a variety of interarrival distributions, run the simulation 1000 times and note − − − − the “normal” shape of the empirical distribution. Compare the empirical mean and standard deviation to t/μ and σ √t/μ , respectively 3
15.3.1
https://stats.libretexts.org/@go/page/10279
The Elementary Renewal Theorem The elementary renewal theorem states that the basic limit in the law of large numbers above holds in mean, as well as with probability 1. That is, the limiting mean average rate of arrivals is 1/μ. The elementary renewal theorem is of fundamental importance in the study of the limiting behavior of Markov chains, but the proof is not as easy as one might hope. In particular, recall that convergence with probability 1 does not imply convergence in mean, so the elementary renewal theorem does not follow from the law of large numbers. M (t)/t → 1/μ
as t → ∞ .
Proof Open the renewal experiment and set t = 50 . For a variety of interarrival distributions, run the experiment 1000 times and − − − − once again compare the empirical mean and standard deviation to t/μ and σ √t/μ , respectively. 3
The Renewal Theorem The renewal theorem states that the expected number of renewals in an interval is asymptotically proportional to the length of the interval; the proportionality constant is 1/μ. The precise statement is different, depending on whether the renewal process is arithmetic or not. Recall that for an arithmetic renewal process, the interarrival times take values in a set of the form {nd : n ∈ N} for some d ∈ (0, ∞) , and the largest such d is the span of the distribution. For h > 0 , M (t, t + h] →
h μ
as t → ∞ in each of the following cases:
1. The renewal process is non-arithmetic 2. The renewal process is arithmetic with span d , and h is a multiple of d The renewal theorem is also known as Blackwell's theorem in honor of David Blackwell. The final limit theorem we will study is the most useful, but before we can state the theorem, we need to define and study the class of functions to which it applies.
Direct Riemann Integration Recall that in the ordinary theory of Riemann integration, the integral of a function on the interval [0, t] exists if the upper and lower Riemann sums converge to a common number as the partition is refined. Then, the integral of the function on [0, ∞) is defined to be the limit of the integral on [0, t], as t → ∞ . For our new definition, a function is said to be directly Riemann integrable if the lower and upper Riemann sums on the entire unbounded interval [0, ∞) converge to a common number as the partition is refined, a more restrictive definition than the usual one. Suppose that
g : [0, ∞) → [0, ∞)
. For
Mk (g, h) = sup{g(t) : t ∈ [kh, (k + 1)h)
h ∈ [0, ∞) and k ∈ N , let m (g, h) = inf{g(t) : t ∈ [kh, (k + 1)h)} . The lower and upper Riemann sums of g on [0, ∞) corresponding to h are k
∞
and
∞
Lg (h) = h ∑ mk (g, h),
Ug (h) = h ∑ Mk (g, h)
k=0
(15.3.12)
k=0
The sums exist in [0, ∞] and satisfy the following properties: 1. L 2. L 3. U
for h > 0 (h) increases as h decreases (h) decreases as h decreases
g (h) g g
≤ Ug (h)
It follows that lim L (h) and lim U (h) exist in limits are finite and agree is what we're after. h↓0
g
h↓0
g
[0, ∞]
and
limh↓0 Lg (h) ≤ limh↓0 Ug (h)
A function g : [0, ∞) → [0, ∞) is directly Riemann integrable if U
g (h)
0 and (15.3.13)
h↓0
.
15.3.2
https://stats.libretexts.org/@go/page/10279
Ordinary Riemann integrability on [0, ∞) allows functions that are unbounded and oscillate wildly as t → ∞ , and these are the types of functions that we want to exclude for the renewal theorems. The following result connects ordinary Riemann integrability with direct Riemann integrability. If g : [0, ∞) → [0, ∞) is integrable (in the ordinary Riemann sense) on [0, t] for every t ∈ [0, ∞) and if U h ∈ (0, ∞) then g is directly Riemann integrable.
g (h)
x) →
∫ μ
F
c
(y) dy as t → ∞,
x ∈ [0, ∞)
(15.3.17)
(y) dy as t → ∞,
x ∈ [0, ∞)
(15.3.19)
x
Proof If the renewal process is aperiodic, then ∞
1 P(Rt > x) →
∫ μ
F
c
x
Proof The current and remaining life have the same limiting distribution. In particular, x
1 lim P(Ct ≤ x) = lim P(Rt ≤ x) = t→∞
t→∞
∫ μ
F
c
(y) dy,
x ∈ [0, ∞)
(15.3.21)
0
Proof
15.3.3
https://stats.libretexts.org/@go/page/10279
The fact that the current and remaining age processes have the same limiting distribution may seem surprising at first, but there is a simple intuitive explanation. After a long period of time, the renewal process looks just about the same backward in time as forward in time. But reversing the direction of time reverses the rolls of current and remaining age.
Examples and Special Cases The Poisson Process Recall that the Poisson process, the most important of all renewal processes, has interarrival times that are exponentially distributed with rate parameter r > 0 . Thus, the interarrival distribution function is F (x) = 1 − e for x ≥ 0 and the mean interarrival time is μ = 1/r . −rx
Verify each of the following directly: 1. The law of large numbers for the counting process. 2. The central limit theorem for the counting process. 3. The elementary renewal theorem. 4. The renewal theorem.
Bernoulli Trials Suppose that X = (X , X , …) is a sequence of Bernoulli trials with success parameter p ∈ (0, 1). Recall that X is a sequence of independent, identically distributed indicator variables with p = P(X = 1) . We have studied a number of random processes derived from X: 1
2
Random processes associated with Bernoulli trials. 1. Y = (Y , Y , …) where Y the number of successes in the first n trials. The sequence Y is the partial sum process associated with X. The variable Y has the binomial distribution with parameters n and p. 2. U = (U , U , …) where U the number of trials needed to go from success number n − 1 to success number n . These are independent variables, each having the geometric distribution on N with parameter p. 3. V = (V , V , …) where V is the trial number of success n . The sequence V is the partial sum process associated with U . The variable V has the negative binomial distribution with parameters n and p. 0
1
n
n
1
2
n
+
0
1
n
n
Consider the renewal process with interarrival sequence U . Thus, μ = 1/p is the mean interarrival time, and Y is the counting process. Verify each of the following directly: 1. The law of large numbers for the counting process. 2. The central limit theorem for the counting process. 3. The elementary renewal theorem. Consider the renewal process with interarrival sequence X. Thus, the mean interarrival time is arrivals in the interval [0, n] is V − 1 for n ∈ N . Verify each of the following directly:
μ =p
and the number of
n+1
1. The law of large numbers for the counting process. 2. The central limit theorem for the counting process. 3. The elementary renewal theorem. This page titled 15.3: Renewal Limit Theorems is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
15.3.4
https://stats.libretexts.org/@go/page/10279
15.4: Delayed Renewal Processes Basic Theory Preliminaries A delayed renewal process is just like an ordinary renewal process, except that the first arrival time is allowed to have a different distribution than the other interarrival times. Delayed renewal processes arise naturally in applications and are also found embedded in other random processes. For example, in a Markov chain (which we study in the next chapter), visits to a fixed state, starting in that state form the random times of an ordinary renewal process. But visits to a fixed state, starting in another state form a delayed renewal process. Suppose that X = (X , X , …) is a sequence of independent variables taking values in [0, ∞), with (X , X , …) identically distributed. Suppose also that P(X > 0) > 0 for i ∈ N . The stochastic process with X as the sequence of interarrival times is a delayed renewal process. 1
2
2
i
3
+
As before, the actual arrival times are the partial sums of X. Thus let n
Tn = ∑ Xi
(15.4.1)
i=1
so that T = 0 and T is the time of the n th arrival for counting T ): 0
n
. Also as before,
n ∈ {1, 2, …}
Nt
is the number of arrivals in
[0, t]
(not
0
∞
Nt = ∑ 1(Tn ≤ t) = max{n ∈ N : Tn ≤ t}
(15.4.2)
n=1
If we restart the clock at time T = X , we have an ordinary renewal process with interarrival sequence (X , X , …). We use some of the standard notation developed in the Introduction for this renewal process. In particular, F denotes the common distribution function and μ the common mean of X for i ∈ {2, 3, …}. Similarly F = F denotes the distribution function of the sum of n independent variables with distribution function F , and M denotes the renewal function: 1
1
2
3
∗n
i
n
∞
M (t) = ∑ Fn (t),
t ∈ [0, ∞)
(15.4.3)
n=1
On the other hand, we will let G denote the distribution function of X (the special interarrival time, different from the rest), and we will let G denote the distribution function of T for n ∈ N . As usual, F = 1 − F and G = 1 − G are the corresponding right-tail distribution functions. 1
c
n
Gn = G ∗ Fn−1 = Fn−1 ∗ G
n
c
+
for n ∈ N . +
Proof Finally, we will let U denote the renewal function for the delayed renewal process. Thus, U (t) = E(N arrivals in [0, t] for t ∈ [0, ∞).
t)
is the expected number of
The delayed renewal function satisfies ∞
U (t) = ∑ Gn (t),
t ∈ [0, ∞)
(15.4.4)
n=1
Proof The delayed renewal function U satisfies the equation U
= G+M ∗ G
; that is,
t
U (t) = G(t) + ∫
M (t − s) dG(s),
t ∈ [0, ∞)
(15.4.6)
0
15.4.1
https://stats.libretexts.org/@go/page/10280
Proof The delayed renewal function U satisfies the renewal equation U
= G+U ∗ F
; that is,
t
U (t) = G(t) + ∫
U (t − s) dF (s),
t ∈ [0, ∞)
(15.4.8)
0
Proof
Asymptotic Behavior In a delayed renewal process only the first arrival time is changed. Thus, it's not surprising that the asymptotic behavior of a delayed renewal process is the same as the asymptotic behavior of the corresponding regular renewal process. Our first result is the strong law of large numbers for the delayed renewal process. Nt /t → 1/μ
as t → ∞ with probability 1.
Proof Our next result is the elementary renewal theorem for the delayed renewal process. U (t)/t → 1/μ
as t → ∞ .
Next we have the renewal theorem for the delayed renwal process, also known as Blackwell's theorem, named for David Blackwell. For h > 0 , U (t, t + h] = U (t + h) − U (t) → h/μ
as t → ∞ in each of the following cases:
1. F is non-arithmetic 2. F is arithmetic with span d ∈ (0, ∞) , and h is a multiple of d . Finally we have the key renewal theorem for the delayed renewal process. Suppose that the renewal process is non-arithmetic and that g : [0, ∞) → [0, ∞) is directly Riemann integrable. Then t
(g ∗ U )(t) = ∫
∞
1 g(t − s) dU (s) →
∫ μ
0
g(x) dx as t → ∞
(15.4.11)
0
Stationary Point Processes Recall that a point process is a stochastic process that models a discrete set of random points in a measure space (S, S , λ). Often, of course, S ⊆ R for some n ∈ N and λ is the corresponding n -dimensional Lebesgue measure. The special cases S = N with counting measure and S = [0, ∞) with length measure are of particular interest, in part because renewal and delayed renewal processes give rise to point processes in these spaces. n
+
For a general point process on S , we use our standard notation and denote the number of random points A ∈ S by N (A) . There are a couple of natural properties that a point process may have. In particular, the process is said to be stationary if λ(A) = λ(B) implies that N (A) and N (B) have the same distribution for A, B ∈ S . In [0, ∞) the term stationary increments is often used, because the stationarity property means that for s, t ∈ [0, ∞) , the distribution of N (s, s + t] = N − N depends only on t . s+t
s
Consider now a regular renewal process. We showed earlier that the asymptotic distributions of the current life and remaining life are the same. Intuitively, after a very long period of time, the renewal process looks pretty much the same forward in time or backward in time. This suggests that if we make the renewal process into a delayed renewal process by giving the first arrival time this asymptotic distribution, then the resulting point process will be stationary. This is indeed the case. Consider the setting and notation of the preliminary subsection above. For the delayed renewal process, the point process N is stationary if and only if the initial arrival time has distribution function t
1 G(t) =
∫ μ
F
c
(s) ds,
t ∈ [0, ∞)
(15.4.12)
0
in which case the renewal function is U (t) = t/μ for t ∈ [0, ∞).
15.4.2
https://stats.libretexts.org/@go/page/10280
Proof
Examples and Applications Patterns in Multinomial Trials Suppose that L = (L , L , …) is a sequence of independent, identically distributed random variables taking values in a finite set S , so that L is a sequence of multinomial trials. Let f denote the common probability density function so that for a generic trial variable L, we have f (a) = P(L = a) for a ∈ S . We assume that all outcomes in S are actually possible, so f (a) > 0 for a ∈ S . 1
2
In this section, we interpret S as an alphabet, and we write the sequence of variables in concatenation form, L = L L ⋯ rather than standard sequence form. Thus the sequence is an infinite string of letters from our alphabet S . We are interested in the repeated occurrence of a particular finite substring of letters (that is, a “word” or “pattern”) in the infinite sequence. 1
2
So, fix a word a (again, a finite string of elements of S ), and consider the successive random trial numbers (T , T , …) where the word a is completed in L . Since the sequence L is independent and identically distributed, it seems reasonable that these variables are the arrival times of a renewal process. However there is a slight complication. An example may help. 1
2
Suppose that L is a sequence of Bernoulli trials (so S = {0, 1}). Suppose that the outcome of L is 101100101010001101000110 ⋯
1. For the word a = 001 note that T 2. For the word b = 010 , note that T
1 1
=7
,T ,T
2
=8
2
= 15
,T ,T
= 10
3 3
(15.4.19)
= 22 = 12
,T
4
= 19
In this example, you probably noted an important difference between the two words. For b , a suffix of the word (a proper substring at the end) is also a prefix of the word (a proper substring at the beginning. Word a does not have this property. So, once we “arrive” at b , there are ways to get to b again (taking advantage of the suffix-prefix) that do not exist starting from the beginning of the trials. On the other hand, once we arrive at a , arriving at a again is just like with a new sequence of trials. Thus we are lead to the following definition. Suppose that a is a finite word from the alphabet S . If no proper suffix of a is also a prefix, then a is simple. Otherwise, a is compound. ∞
Returning to the general setting, let T = 0 and then let X = T − T for n ∈ N . For k ∈ N , let N = ∑ 1(T ≤ k) . For occurrences of the word a , X = (X , X , …) is the sequence of interarrival times, T = (T , T , …) is the sequence of arrival times, and N = {N : k ∈ N} is the counting process. If a is simple, these form an ordinary renewal process. If a is compound, they form a delayed renewal process, since X will have a different distribution than (X , X , …). Since the structure of a delayed renewal process subsumes that of an ordinary renewal process, we will work with the notation above for the delayed process. In particular, let U denote the renewal function. Everything in this paragraph depends on the word a of course, but we have suppressed this in the notation. 0
n
1
n
n−1
+
k
2
0
n=1
n
1
k
1
Suppose a = a
2
3
, where a ∈ S for each i ∈ {1, 2, … , k}, so that a is a word of length k . Note that X takes values in is simple, this applies to the other interarrival times as well. If a is compound, the situation is more complicated X , X , … will have some minimum value j < k , but the possible values are positive integers, of course, and include {k + 1, k + 2, …} . In any case, the renewal process is arithmetic with span 1. Expanding the definition of the probability density function f , let 1 a2
{k, k + 1, …}
. If 2
⋯ ak
i
1
a 3
k
f (a) = ∏ f (ai )
(15.4.20)
i=1
so that
is the probability of forming a with k consecutive trials. Let μ(a) denote the common mean of X for , so μ(a) is the mean number of trials between occurrences of a . Let ν (a) = E(X ) , so that ν (a) is the mean time number of trials until a occurs for the first time. Our first result is an elegant connection between μ(a) and f (a) , which has a wonderfully simple proof from renewal theory. f (a)
n
n ∈ {2, 3, …}
1
If a is a word in S then
15.4.3
https://stats.libretexts.org/@go/page/10280
1 μ(a) =
(15.4.21) f (a)
Proof Our next goal is to compute ν (a) in the case that a is a compound word. Suppose that a is a compound word, and that b is the largest word that is a proper suffix and prefix of a . Then 1 ν (a) = ν (b) + μ(a) = ν (b) +
(15.4.22) f (a)
Proof By repeated use of the last result, we can compute the expected number of trials needed to form any compound word. Consider Bernoulli trials with success probability p ∈ (0, 1), and let q = 1 − p . For each of the following strings, find the expected number of trials between occurrences and the expected number of trials to the first occurrence. 1. a = 001 2. b = 010 3. c = 1011011 4. d = 11 ⋯ 1 (k times) Answer Recall that an ace-six flat die is a six-sided die for which faces 1 and 6 have probability probability each. Ace-six flat dice are sometimes used by gamblers to cheat.
1 4
each while faces 2, 3, 4, and 5 have
1 8
Suppose that an ace-six flat die is thrown repeatedly. Find the expected number of throws until the pattern occurs.
6165616
first
Solution Suppose that a monkey types randomly on a keyboard that has the 26 lower-case letter keys and the space key (so 27 keys). Find the expected number of keystrokes until the monkey produces each of the following phrases: 1. it was the best of times 2. to be or not to be Proof This page titled 15.4: Delayed Renewal Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
15.4.4
https://stats.libretexts.org/@go/page/10280
15.5: Alternating Renewal Processes Basic Theory Preliminaries An alternating renewal process models a system that, over time, alternates between two states, which we denote by 1 and 0 (so the system starts in state 1). Generically, we can imagine a device that, over time, alternates between on and off states. Specializing further, suppose that a device operates until it fails, and then is replaced with an identical device, which in turn operates until failure and is replaced, and so forth. In this setting, the times that the device is functioning correspond to the on state, while the replacement times correspond to the off state. (The device might actually be repaired rather than replaced, as long as the repair returns the device to pristine, new condition.) The basic assumption is that the pairs of random times successively spent in the two states form an independent, identically distributed sequence. Clearly the model of a system alternating between two states is basic and important, but moreover, such alternating processes are often found embedded in other stochastic processes. Let's set up the mathematical notation. Let U = (U , U , …) denote the successive lengths of time that the system is in state 1, and let V = (V , V , …) the successive lengths of time that the system is in state 0. So to be clear, the system starts in state 1 and remains in that state for a period of time U , then goes to state 0 and stays in this state for a period of time V , then back to state 1 for a period of time U , and so forth. Our basic assumption is that W = ((U , V ), (U , V ), …) is an independent, identically distributed sequence. It follows that U and V each are independent, identically distributed sequences, but U and V might well be dependent. In fact, V might be a function of U for n ∈ N . Let μ = E(U ) denote the mean of a generic time period U in state 1 and let ν = E(V ) denote the mean of a generic time period V in state 0. Let G denote the distribution function of a time period U in state 1, and as usual, let G = 1 − G denote the right distribution function (or reliability function) of U . 1
1
2
2
1
1
2
1
n
n
1
2
2
+
c
Clearly it's natural to consider returns to state 1 as the arrivals in a renewal process. Thus, let X = U + V for n ∈ N and consider the renewal process with interarrival times X = (X , X , …). Clearly this makes sense, since X is an independent, identically distributed sequence of nonnegative variables. For the most part, we will use our usual notation for a renewal process, so the common distribution function of X = U + V is denoted by F , the arrival time process is T = (T , T , …) , the counting process is {N : t ∈ [0, ∞)} , and the renewal function is M . But note that the mean interarrival time is now μ + ν . n
1
n
n
n
n
+
2
n
0
1
t
The renwal process associated with process.
W = ((U1 , V1 ), (U2 , V2 ), …)
as constructed above is known as an alternating renewal
The State Process Our interest is the state I of the system at time t ∈ [0, ∞), so I = {I : t ∈ [0, ∞)} is a stochastic process with state space {0, 1}. Clearly the stochastic processes W and I are equivalent in the sense that we can recover one from the other. Let p(t) = P(I = 1) , the probability that the device is on at time t ∈ [0, ∞). Our first main result is a renewal equation for the function p. t
t
t
The function p satisfies the renewal equation p = G
c
+p ∗ F
and hence p = G
c
c
+G
∗M
.
Proof We can now apply the key renewal theorem to get the asymptotic behavior of p. If the renewal process is non-arithmetic, then μ p(t) →
as t → ∞
(15.5.3)
μ+ν
Proof Thus, the limiting probability that the system is on is simply the ratio of the mean of an on period to the mean of an on-off period. It follows, of course, that ν P(It = 0) = 1 − p(t) →
15.5.1
as t → ∞
(15.5.5)
μ+ν
https://stats.libretexts.org/@go/page/10281
so in particular, the fact that the system starts in the on state makes no difference in the limit. We will return to the asymptotic behavior of the alternating renewal process in the next section on renewal reward processes.
Applications and Special Cases With a clever definition of on and off, many stochastic processes can be turned into alternating renewal processes, leading in turn to interesting limits, via the basic limit theorem above.
Age Processes The last remark applies in particular to the age processes of a standard renewal process. So, suppose that we have a renewal process with interarrival sequence X, arrival sequence T , and counting process N . As usual, let μ denote the mean and F the probability distribution function of an interarrival time, and let F = 1 − F denote the right distribution function (or reliability function). c
For t ∈ [0, ∞), recall that the current life, remaining life and total life at time t are Ct = t − TN , t
Rt = TN
t
+1
− t,
Lt = Ct + Rt = TN
t
+1
− TN
t
= XN
t
+1
(15.5.6)
respectively. In the usual terminology of reliability, C is the age of the device in service at time t , R is the time remaining until this device fails, and L is total life of the device. We will use limit theorem above to derive the limiting distributions these age processes. The limiting distributions were obtained earlier, in the section on renewal limit theorems, by a direct application of the key renewal theorem. So the results are not new, but the method of proof is interesting. t
t
t
If the renewal process is non-arithmetic then x
1 lim P(Ct ≤ x) = lim P(Rt ≤ x) =
t→∞
t→∞
∫ μ
F
c
(y) dy,
x ∈ [0, ∞)
(15.5.7)
0
Proof As we have noted before, the fact that the limiting distributions are the same is not surprising after a little thought. After a long time, the renewal process looks the same forward and backward in time, and reversing the arrow of time reverses the roles of current and remaining time. If the renewal process is non-arithmetic then x
1 lim P(Lt ≤ x) =
t→∞
∫ μ
y dF (y),
x ∈ [0, ∞)
(15.5.12)
0
Proof This page titled 15.5: Alternating Renewal Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
15.5.2
https://stats.libretexts.org/@go/page/10281
15.6: Renewal Reward Processes Basic Theory Preliminaries In a renewal reward process, each interarrival time is associated with a random variable that is generically thought of as the reward associated with that interarrival time. Our interest is in the process that gives the total reward up to time t . So let's set up the usual notation. Suppose that X = (X , X , …) are the interarrival times of a renewal process, so that X is a sequence of independent, identically distributed, nonnegative variables with common distribution function F and mean μ . As usual, we assume that F (0) < 1 so that the interarrival times are not deterministically 0, and in this section we also assume that μ < ∞ . Let 1
2
n
Tn = ∑ Xi ,
n ∈ N
(15.6.1)
i=1
so that T is the time of the n th arrival for n ∈ N n
+
and T
= (T0 , T1 , …)
is the arrival time sequence. Finally, Let
∞
Nt = ∑ 1(Tn ≤ t),
t ∈ [0, ∞)
(15.6.2)
n=1
so that N is the number of arrivals in [0, t] and N t ∈ [0, ∞) so that M is the renewal function. t
is the counting process. As usual, let M (t) = E (N
= { Nt : t ∈ [0, ∞)}
t)
for
Suppose now that Y = (Y , Y , …) is a sequence of real-valued random variables, where Y is thought of as the reward associated with the interarrival time X . However, the term reward should be interpreted generically since Y might actually be a cost or some other value associated with the interarrival time, and in any event, may take negative as well as positive values. Our basic assumption is that the interarrival time and reward pairs Z = ((X , Y ), (X , Y ), …) form an independent and identically distributed sequence. Recall that this implies that X is an IID sequence, as required by the definition of the renewal process, and that Y is also an IID sequence. But X and Y might well be dependent, and in fact Y might be a function of X for n ∈ N . Let ν = E(Y ) denote the mean of a generic reward Y , which we assume exists in R . 1
2
n
n
n
1
1
2
2
n
The stochastic process R = {R
t
: t ∈ [0, ∞)}
n
+
defined by Nt
Rt = ∑ Yi ,
t ∈ [0, ∞)
(15.6.3)
i=1
is the reward renewal process associated with Z . The function r given by r(t) = E(R
t)
for t ∈ [0, ∞) is the reward function.
As promised, R is the total reward up to time t ∈ [0, ∞). Here are some typical examples: t
The arrivals are customers at a store. Each customer spends a random amount of money. The arrivals are visits to a website. Each visitor spends a random amount of time at the site. The arrivals are failure times of a complex system. Each failure requires a random repair time. The arrivals are earthquakes at a particular location. Each earthquake has a random severity, a measure of the energy released. So R is a random sum of random variables for each t ∈ [0, ∞). In the special case that Y and X independent, the distribution of R is known as a compound distribution, based on the distribution of N and the distribution of a generic reward Y . Specializing further, if the renewal process is Poisson and is independent of Y , the process R is a compound Poisson process. t
t
t
Note that a renewal reward process generalizes an ordinary renewal process. Specifically, if Y = 1 for each n ∈ N , then R =N for t ∈ [0, ∞), so that the reward process simply reduces to the counting process, and then r reduces to the renewal function M . n
t
+
t
The Renewal Reward Theorem For t ∈ (0, ∞) , the average reward on the interval [0, t] is R /t, and the expected average reward on that interval is fundamental theorem on renewal reward processes gives the asymptotic behavior of these averages. t
15.6.1
r(t)/t
. The
https://stats.libretexts.org/@go/page/10282
The renewal reward theorem 1. R /t → ν /μ as t → ∞ with probability 1. 2. r(t)/t → ν /μ as t → ∞ t
Proof Part (a) generalizes the law of large numbers and part (b) generalizes elementary renewal theorem, for a basic renewal process. Once again, if Y = 1 for each n , then (a) becomes N /t → 1/μ as t → ∞ and (b) becomes M (t)/t → 1/μ as t → ∞ . It's not surprising then that these two theorems play a fundamental role in the proof of the renewal reward theorem. n
t
General Reward Processes The renewal reward process R = {R : t ∈ [0, ∞)} above is constant, taking the value ∑ Y , on the renewal interval [T , T ) for each n ∈ N . Effectively, the rewards are received discretely: Y at time T , an additional Y at time T , and so forth. It's possible to modify the construction so the rewards accrue continuously in time or in a mixed discrete/continuous manner. Here is a simple set of conditions for a general reward process. n
t
n
i=1
n+1
1
Suppose again that
i
1
2
2
is the sequence of interarrival times and rewards. A stochastic process (on our underlying probability space) is a reward process associated with Z if the following conditions
Z = ((X1 , Y1 ), (X2 , Y2 ), …)
V = { Vt : t ∈ [0, ∞)}
hold: n
1. V = ∑ Y for n ∈ N 2. V is between V and V Tn
i=1
i
t
Tn
Tn+1
for t ∈ (T
n,
Tn+1 )
and n ∈ N
In the continuous case, with nonnegative rewards (the most important case), the reward process will typically have the following form: Suppose that the rewards are nonnegative and that underlying probability space) with
U = { Ut : t ∈ [0, ∞)}
is a nonnegative stochastic process (on our
1. t ↦ U piecewise continous 2. ∫ U dt = Y for n ∈ N t
Tn+1
t
Tn
Let V
t
=∫
t
0
n+1
Us ds
for t ∈ [0, ∞). Then V
= { Vt : t ∈ [0, ∞)}
is a reward process associated with Z .
Proof Thus in this special case, the rewards are being accrued continuously and U is the rate at which the reward is being accrued at time t . So U plays the role of a reward density process. For a general reward process, the basic renewal reward theorem still holds. t
Suppose that V = {V : t ∈ [0, ∞)} is a reward process associated with Z = ((X for t ∈ [0, ∞) be the corresponding reward function.
1,
t
Y1 ), (X2 , Y2 ), …)
, and let v(t) = E (V
t)
1. V /t → ν /μ as t → ∞ with probability 1. 2. v(t)/t → ν /μ as t → ∞ . t
Proof Here is the corollary for a continuous reward process. Suppose that the rewards are positive, and consider the continuous reward process with density process U = { U : t ∈ [0, ∞)} as above. Let u(t) = E(U ) for t ∈ [0, ∞). Then t
1.
1
2.
1
t
t
∫
t
0
∫
t
0
Us ds →
t
ν μ
u(s) ds →
as t → ∞ with probability 1 ν μ
as t → ∞
Special Cases and Applications With a clever choice of the “rewards”, many interesting renewal processes can be turned into renewal reward processes, leading in turn to interesting limits via the renewal reward theorem.
15.6.2
https://stats.libretexts.org/@go/page/10282
Alternating Renewal Processes Recall that in an alternating renewal process, a system alternates between on and off states (starting in the on state). If we let U = (U , U , …) be the lengths of the successive time periods in which the system is on, and V = (V , V , …) the lengths of the successive time periods in which the system is off, then the basic assumptions are that ((U , V ), (U , V ), …) is an independent, identically distributed sequence, and that the variables X = U + V for n ∈ N form the interarrival times of a standard renewal process. Let μ = E(U ) denote the mean of a time period that the device is on, and ν = E(V ) the mean of a time period that the device is off. Recall that I denotes the state (1 or 0) of the system at time t ∈ [0, ∞), so that I = {I : t ∈ [0, ∞)} is the state process. The state probability function p is given by p(t) = P(I = 1) for t ∈ [0, ∞). 1
2
1
1
n
n
n
1
2
2
2
+
t
t
t
Limits for the alternating renewal process. 1.
1
2.
1
t
t
∫
t
0
∫
t
0
Is ds →
μ
as t → ∞ with probability 1
μ+ν μ
p(s) ds →
as t → ∞
μ+ν
Proof Thus, the asymptotic average time that the device is on, and the asymptotic mean average time that the device is on, are both simply the ratio of the mean of an on period to the mean of an on-off period. In our previous study of alternating renewal processes, the fundamental result was that in the non-arithmetic case, p(t) → μ/(μ + ν ) as t → ∞ . This result implies part (b) in the theorem above.
Age Processes Renewal reward processes can be used to derive some asymptotic results for the age processes of a standard renewal process So, suppose that we have a renewal process with interarrival sequence X, arrival sequence T , and counting process N . As usual, let μ = E(X) denote the mean of an interarrival time, but now we will also need ν = E(X ) , the second moment. We assume that both moments are finite. 2
For t ∈ [0, ∞), recall that the current life, remaining life and total life at time t are At = t − TNt ,
Bt = TNt +1 − t,
Lt = At + Bt = TNt +1 − TNt = XNt +1
(15.6.15)
respectively. In the usual terminology of reliability, A is the age of the device in service at time t , B is the time remaining until this device fails, and L is total life of the device. (To avoid notational clashes, we are using different notation than in past sections.) Let a(t) = E(A ) , b(t) = E(B ) , and l(t) = E(L ) for t ∈ [0, ∞), the corresponding mean functions. To derive our asymptotic results, we simply use the current life and the remaining life as reward densities (or rates) in a renewal reward process. t
t
t
t
t
t
Limits for the current life process. 1.
1
2.
1
t
t
∫
t
0
∫
t
0
As ds →
as t → ∞ with probability 1
ν 2μ ν
a(s) ds →
2μ
as t → ∞
Proof Limits for the remaining life process. 1.
1
2.
1
t
t
∫
t
0
Bs ds →
t
∫
0
as t → ∞ with probability 1
ν 2μ ν
b(s) ds →
2μ
as t → ∞
Proof With a little thought, it's not surprising that the limits for the current life and remaining life processes are the same. After a long period of time, a renewal process looks stochastically the same forward or backward in time. Changing the “arrow of time” reverses the role of the current and remaining life. Asymptotic results for the total life process now follow trivially from the results for the current and remaining life processes. Limits for the total life process 1.
1
2.
1
t
t
∫
t
0
∫
t
0
Ls ds → l(s) ds =
ν μ ν μ
as t → ∞ with probability 1 as t → ∞
15.6.3
https://stats.libretexts.org/@go/page/10282
Replacement Models Consider again a standard renewal process as defined in the Introduction, with interarrival sequence X = (X , X , …), arrival sequence T = (T , T , …) , and counting process N = {N : t ∈ [0, ∞)} . One of the most basic applications is to reliability, where a device operates for a random lifetime, fails, and then is replaced by a new device, and the process continues. In this model, X is the lifetime and T the failure time of the n th device in service, for n ∈ N , while N is the number of failures in [0, t] for t ∈ [0, ∞). As usual, F denotes the distribution function of a generic lifetime X, and F = 1 −F the corresponding right distribution function (reliability function). Sometimes, the device is actually a system with a number of critical components—the failure of any of the critical components causes the system to fail. 1
0
1
2
t
n
n
+
t
c
Replacement models are variations on the basic model in which the device is replaced (or the critical components replaced) at times other than failure. Often the cost a of a planned replacement is less than the cost b of an emergency replacement (at failure), so replacement models can make economic sense. We will consider the the most common model. In the age replacement model, the device is replaced either when it fails or when it reaches a specified age s ∈ (0, ∞) . This model gives rise to a new renewal process with interarrival sequence U = (U , U , …) where U = min{X , s} for n ∈ N . If a, b ∈ (0, ∞) are the costs of planned and unplanned replacements, respectively, then the cost associated with the renewal period U is 1
2
n
n
+
n
Yn = a1(Un = s) + b1(Un < s) = a1(Xn ≥ s) + b1(Xn < s)
(15.6.18)
Clearly ((U , Y ), (U , Y ), …) satisfies the assumptions of a renewal reward process given above. The model makes mathematical sense for any a, b ∈ (0, ∞) but if a ≥ b , so that the planned cost of replacement is at least as large as the unplanned cost of replacement, then Y ≥ b for n ∈ N , so the model makes no financial sense. Thus we assume that a < b . 1
1
2
2
n
+
In the age replacement model, with planned replacement at age s ∈ (0, ∞) , 1. The expected cost of a renewal period is E(Y ) = aF 2. The expected length of a renewal period is E(U ) = ∫
c
(s) + bF (s)
s
0
F
c
.
(x) dx
The limiting expected cost per unit time is c
aF C (s) = ∫
s
0
(s) + bF (s) (15.6.19) F
c
(x) dx
Proof So naturally, given the costs a and b , and the lifetime distribution function F , the goal is be to find the value of s that minimizes C (s) ; this value of s is the optimal replacement time. Of course, the optimal time may not exist. Properties of C 1. C (s) → ∞ as s ↓ 0 2. C (s) → μ/b as s ↑ ∞ Proof As s → ∞ , the age replacement model becomes the standard (unplanned) model with limiting expected average cost b/μ. Suppose that the lifetime of the device (in appropriate units) has the standard exponential distribution. Find C (s) and solve the optimal age replacement problem. Answer The last result is hardly surprising. A device with an exponentially distributed lifetime does not age—if it has not failed, it's just as good as new. More generally, age replacement does not make sense for any device with decreasing failure rate. Such devices improve with age. Suppose that the lifetime of the device (in appropriate units) has the gamma distribution with shape parameter parameter 1. Suppose that the costs (in appropriate units) are a = 1 and b = 5 .
2
and scale
1. Find C (s) .
15.6.4
https://stats.libretexts.org/@go/page/10282
2. Sketch the graph of C (s) . 3. Solve numerically the optimal age replacement problem. Answer Suppose again that the lifetime of the device (in appropriate units) has the gamma distribution with shape parameter scale parameter 1. But suppose now that the costs (in appropriate units) are a = 1 and b = 2 .
2
and
1. Find C (s) . 2. Sketch the graph of C (s) . 3. Solve the optimal age replacement problem. Answer In the last case, the difference between the cost of an emergency replacement and a planned replacement is not great enough for age replacement to make sense. Suppose that the lifetime of the device (in appropriately scaled units) is uniformly distributed on the interval [0, 1]. Find C (s) and solve the optimal replacement problem. Give the results explicitly for the following costs: 1. a = 4 , b = 6 2. a = 2 , b = 5 3. a = 1 , b = 10 Proof
Thinning We start with a standard renewal process with interarrival sequence X = (X , X , …), arrival sequence T = (T , T , …) and counting process N = {N : t ∈ [0, ∞)} . As usual, let μ = E(X) denote the mean of an interarrival time. For n ∈ N , suppose now that arrival n is either accepted or rejected, and define random variable Y to be 1 in the first case and 0 in the second. Let Z = (X , Y ) denote the interarrival time and rejection variable pair for n ∈ N , and assume that Z = (Z , Z , …) is an independent, identically distributed sequence. 1
2
0
1
t
+
n
n
n
n
+
1
2
Note that we have the structure of a renewal reward process, and so in particular, Y = (Y , Y , …) is a sequence of Bernoulli trials. Let p denote the parameter of this sequence, so that p is the probability of accepting an arrival. The procedure of accepting or rejecting points in a point process is known as thinning the point process. We studied thinning of the Poisson process. In the notation of this section, note that the reward process R = {R : t ∈ [0, ∞)} is the thinned counting process. That is, 1
2
t
Nt
Rt = ∑ Yi
(15.6.25)
i=1
is the number of accepted points in [0, t] for t ∈ [0, ∞). So then r(t) = E(R The renewal reward theorem gives the asymptotic behavior.
t)
is the expected number of accepted points in [0, t].
Limits for the thinned process. 1. R /t → p/μ as t → ∞ 2. r(t)/t → p/μ as t → ∞ t
Proof This page titled 15.6: Renewal Reward Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
15.6.5
https://stats.libretexts.org/@go/page/10282
CHAPTER OVERVIEW 16: Markov Processes A Markov process is a random process in which the future is independent of the past, given the present. Thus, Markov processes are the natural stochastic analogs of the deterministic processes described by differential and difference equations. They form one of the most important classes of random processes. 16.1: Introduction to Markov Processes 16.2: Potentials and Generators for General Markov Processes 16.3: Introduction to Discrete-Time Chains 16.4: Transience and Recurrence for Discrete-Time Chains 16.5: Periodicity of Discrete-Time Chains 16.6: Stationary and Limiting Distributions of Discrete-Time Chains 16.7: Time Reversal in Discrete-Time Chains 16.8: The Ehrenfest Chains 16.9: The Bernoulli-Laplace Chain 16.10: Discrete-Time Reliability Chains 16.11: Discrete-Time Branching Chain 16.12: Discrete-Time Queuing Chains 16.13: Discrete-Time Birth-Death Chains 16.14: Random Walks on Graphs 16.15: Introduction to Continuous-Time Markov Chains 16.16: Transition Matrices and Generators of Continuous-Time Chains 16.17: Potential Matrices 16.18: Stationary and Limting Distributions of Continuous-Time Chains 16.19: Time Reversal in Continuous-Time Chains 16.20: Chains Subordinate to the Poisson Process 16.21: Continuous-Time Birth-Death Chains 16.22: Continuous-Time Queuing Chains 16.23: Continuous-Time Branching Chains
This page titled 16: Markov Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
16.1: Introduction to Markov Processes A Markov process is a random process indexed by time, and with the property that the future is independent of the past, given the present. Markov processes, named for Andrei Markov, are among the most important of all random processes. In a sense, they are the stochastic analogs of differential equations and recurrence relations, which are of course, among the most important deterministic processes. The complexity of the theory of Markov processes depends greatly on whether the time space T is N (discrete time) or [0, ∞) (continuous time) and whether the state space is discrete (countable, with all subsets measurable) or a more general topological space. When T = [0, ∞) or when the state space is a general space, continuity assumptions usually need to be imposed in order to rule out various types of weird behavior that would otherwise complicate the theory. When the state space is discrete, Markov processes are known as Markov chains. The general theory of Markov chains is mathematically rich and relatively simple. When T = N and the state space is discrete, Markov processes are known as discrete-time Markov chains. The theory of such processes is mathematically elegant and complete, and is understandable with minimal reliance on measure theory. Indeed, the main tools are basic probability and linear algebra. Discrete-time Markov chains are studied in this chapter, along with a number of special models. When T = [0, ∞) and the state space is discrete, Markov processes are known as continuous-time Markov chains. If we avoid a few technical difficulties (created, as always, by the continuous time space), the theory of these processes is also reasonably simple and mathematically very nice. The Markov property implies that the process, sampled at the random times when the state changes, forms an embedded discrete-time Markov chain, so we can apply the theory that we will have already learned. The Markov property also implies that the holding time in a state has the memoryless property and thus must have an exponential distribution, a distribution that we know well. In terms of what you may have already studied, the Poisson process is a simple example of a continuous-time Markov chain. For a general state space, the theory is more complicated and technical, as noted above. However, we can distinguish a couple of classes of Markov processes, depending again on whether the time space is discrete or continuous. When T = N and S = R , a simple example of a Markov process is the partial sum process associated with a sequence of independent, identically distributed real-valued random variables. Such sequences are studied in the chapter on random samples (but not as Markov processes), and revisited below. In the case that T = [0, ∞) and S = R or more generally S = R , the most important Markov processes are the diffusion processes. Generally, such processes can be constructed via stochastic differential equations from Brownian motion, which thus serves as the quintessential example of a Markov process in continuous time and space. k
The goal of this section is to give a broad sketch of the general theory of Markov processes. Some of the statements are not completely rigorous and some of the proofs are omitted or are sketches, because we want to emphasize the main ideas without getting bogged down in technicalities. If you are a new student of probability you may want to just browse this section, to get the basic ideas and notation, but skipping over the proofs and technical details. Then jump ahead to the study of discrete-time Markov chains. On the other hand, to understand this section in more depth, you will need to review topcis in the chapter on foundations and in the chapter on stochastic processes.
Basic Theory Preliminaries As usual, our starting point is a probability space (Ω, F , P), so that Ω is the set of outcomes, F the σ-algebra of events, and P the probability measure on (Ω, F ). The time set T is either N (discrete time) or [0, ∞) (continuous time). In the first case, T is given the discrete topology and in the second case T is given the usual Euclidean topology. In both cases, T is given the Borel σ-algebra T , the σ-algebra generated by the open sets. In the discrete case when T = N , this is simply the power set of T so that every subset of T is measurable; every function from T to another measurable space is measurable; and every function from T to another topological space is continuous. The time space (T , T ) has a natural measure; counting measure # in the discrete case, and Lebesgue in the continuous case.
16.1.1
https://stats.libretexts.org/@go/page/10288
The set of states S also has a σ-algebra S of admissible subsets, so that (S, S ) is the state space. Usually S has a topology and S is the Borel σ-algebra generated by the open sets. A typical set of assumptions is that the topology on S is LCCB: locally compact, Hausdorff, and with a countable base. These particular assumptions are general enough to capture all of the most important processes that occur in applications and yet are restrictive enough for a nice mathematical theory. Usually, there is a natural positive measure λ on the state space (S, S ). When S has an LCCB topology and S is the Borel σ-algebra, the measure λ wil usually be a Borel measure satisfying λ(C ) < ∞ if C ⊆ S is compact. The term discrete state space means that S is countable with S = P(S) , the collection of all subsets of S . Thus every subset of S is measurable, as is every function from S to another measurable space. This is the Borel σ-algebra for the discrete topology on S , so that every function from S to another topological space is continuous. The compact sets are simply the finite sets, and the reference measure is #, counting measure. If S = R for some k ∈ S (another common case), then we usually give S the Euclidean topology (which is LCCB) so that S is the usual Borel σ-algebra. The compact sets are the closed, bounded sets, and the reference measure λ is k -dimensional Lebesgue measure. k
Clearly, the topological and measure structures on T are not really necessary when T = N , and similarly these structures on S are not necessary when S is countable. But the main point is that the assumptions unify the discrete and the common continuous cases. Also, it should be noted that much more general state spaces (and more general time spaces) are possible, but most of the important Markov processes that occur in applications fit the setting we have described here. Various spaces of real-valued functions on S play an important role. Let B denote the collection of bounded, measurable functions f : S → R . With the usual (pointwise) addition and scalar multiplication, B is a vector space. We give B the supremum norm, defined by ∥f ∥ = sup{|f (x)| : x ∈ S}. Suppose now that X = {X : t ∈ T } is a stochastic process on (Ω, F , P) with state space S and time space T . Thus, X is a random variable taking values in S for each t ∈ T , and we think of X ∈ S as the state of a system at time t ∈ T . We also assume that we have a collection F = {F : t ∈ T } of σ-algebras with the properties that X is measurable with respect to F for t ∈ T , and the F ⊆ F ⊆ F for s, t ∈ T with s ≤ t . Intuitively, F is the collection of event up to time t ∈ T . Technically, the assumptions mean that F is a filtration and that the process X is adapted to F. The most basic (and coarsest) filtration is the t
t
t
t
s
t
t
natural filtration
t
t
F
0
0
= {Ft
where
: t ∈ T}
0
Ft
= σ{ Xs : s ∈ T , s ≤ t}
, the σ-algebra generated by the process up to time
. In continuous time, however, it is often necessary to use slightly finer σ-algebras in order to have a nice mathematical theory. In particular, we often need to assume that the filtration F is right continuous in the sense that F = F for t ∈ T where F = ⋂{ F : s ∈ T , s > t} . We can accomplish this by taking F = F so that F = F for t ∈ T , and in this case, F is referred to as the right continuous refinement of the natural filtration. We also sometimes need to assume that F is complete with respect to P in the sense that if A ∈ S with P(A) = 0 and B ⊆ A then B ∈ F . That is, F contains all of the null events (and hence also all of the almost certain events), and therefore so does F for all t ∈ T . t ∈ T
t+
s
t
0
0
t+
t
+
0
t+
0
t
Definitions The random process X is a Markov process if P(Xs+t ∈ A ∣ Fs ) = P(Xs+t ∈ A ∣ Xs )
for all s,
t ∈ T
(16.1.1)
and A ∈ S .
The defining condition, known appropriately enough as the the Markov property, states that the conditional distribution of X given F is the same as the conditional distribution of X just given X . Think of s as the present time, so that s + t is a time in the future. If we know the present state X , then any additional knowledge of events in the past is irrelevant in terms of predicting the future state X . Technically, the conditional probabilities in the definition are random variables, and the equality must be interpreted as holding with probability 1. As you may recall, conditional expected value is a more general and useful concept than conditional probability, so the following theorem may come as no surprise. s+t
s
s+t
s
s
s+t
The random process X is a Markov process if and only if E[f (Xs+t ) ∣ Fs ] = E[f (Xs+t ) ∣ Xs ]
for every s,
t ∈ T
and every f
∈ B
(16.1.2)
.
Proof sketch
16.1.2
https://stats.libretexts.org/@go/page/10288
Technically, we should say that X is a Markov process relative to the filtration F. If X satisfies the Markov property relative to a filtration, then it satisfies the Markov property relative to any coarser filtration. Suppose that the stochastic process X = {X G = { G : t ∈ T } is a filtration that is finer than relative to F.
: t ∈ T}
t
t
F
. If
X
is adapted to the filtration is a Markov process relative to
and that is a Markov process
F = { Ft : t ∈ T } G
then
X
Proof In particular, if X is a Markov process, then X satisfies the Markov property relative to the natural filtration Markov processes is simplified considerably if we add an additional assumption.
F
0
. The theory of
A Markov process X is time homogeneous if P(Xs+t ∈ A ∣ Xs = x) = P(Xt ∈ A ∣ X0 = x)
for every s,
t ∈ T
(16.1.4)
, x ∈ S and A ∈ S .
So if X is homogeneous (we usually don't bother with the time adjective), then the process {X : t ∈ T } given X = x is equivalent (in distribution) to the process {X : t ∈ T } given X = x . For this reason, the initial distribution is often unspecified in the study of Markov processes—if the process is in state x ∈ S at a particular time s ∈ T , then it doesn't really matter how the process got to state x; the process essentially “starts over”, independently of the past. The term stationary is sometimes used instead of homogeneous. s+t
t
s
0
From now on, we will usually assume that our Markov processes are homogeneous. This is not as big of a loss of generality as you might think. A non-homogenous process can be turned into a homogeneous process by enlarging the state space, as shown below. For a homogeneous Markov process, if s, t ∈ T , x ∈ S , and f ∈ B , then E[f (Xs+t ) ∣ Xs = x] = E[f (Xt ) ∣ X0 = x]
(16.1.5)
Feller Processes In continuous time, or with general state spaces, Markov processes can be very strange without additional continuity assumptions. Suppose (as is usually the case) that S has an LCCB topology and that S is the Borel σ-algebra. Let C denote the collection of bounded, continuous functions f : S → R . Let C denote the collection of continuous functions f : S → R that vanish at ∞. The last phrase means that for every ϵ > 0 , there exists a compact set C ⊆ S such that |f (x)| < ϵ if x ∉ C . With the usual (pointwise) operations of addition and scalar multiplication, C is a vector subspace of C , which in turn is a vector subspace of B . Just as with B , the supremum norm is used for C and C . 0
0
0
A Markov process X = {X
t
: t ∈ T}
is a Feller process if the following conditions are satisfied.
1. Continuity in space: For t ∈ T and y ∈ S , the distribution of X given X = x converges to the distribution of X given X = y as x → y . 2. Continuity in time: Given X = x for x ∈ S , X converges in probability to x as t ↓ 0 . t
0
t
0
0
t
Additional details Feller processes are named for William Feller. Note that if S is discrete, (a) is automatically satisfied and if T is discrete, (b) is automatically satisfied. In particular, every discrete-time Markov chain is a Feller Markov process. There are certainly more general Markov processes, but most of the important processes that occur in applications are Feller processes, and a number of nice properties flow from the assumptions. Here is the first: If X = {X : t ∈ T } is a Feller process, then there is a version of has left limits for every ω ∈ Ω . t
X
such that
t ↦ Xt (ω)
is continuous from the right and
Again, this result is only interesting in continuous time T = [0, ∞) . Recall that for ω ∈ Ω , the function t ↦ X (ω) is a sample path of the process. So we will often assume that a Feller Markov process has sample paths that are right continuous have left limits, since we know there is a version with these properties. t
16.1.3
https://stats.libretexts.org/@go/page/10288
Stopping Times and the Strong Markov Property For our next discussion, you may need to review again the section on filtrations and stopping times.To give a quick review, suppose again that we start with our probability space (Ω, F , P) and the filtration F = {F : t ∈ T } (so that we have a filtered probability space). t
Since time (past, present, future) plays such a fundamental role in Markov processes, it should come as no surprise that random times are important. We often need to allow random times to take the value ∞, so we need to enlarge the set of times to T = T ∪ {∞} . The topology on T is extended to T by the rule that for s ∈ T , the set {t ∈ T : t > s} is an open neighborhood of ∞. This is the one-point compactification of T and is used so that the notion of time converging to infinity is preserved. The Borel σ-algebra T is used on T , which again is just the power set in the discrete case. ∞
∞
∞
∞
∞
If X = {X : t ∈ T } is a stochastic process on the sample space (Ω, F ), and if τ is a random time, then naturally we want to consider the state X at the random time. There are two problems. First if τ takes the value ∞, X is not defined. The usual solution is to add a new “death state” δ to the set of states S , and then to give S = S ∪ {δ} the σ algebra S = S ∪ {A ∪ {δ} : A ∈ S } . A function f ∈ B is extended to S by the rule f (δ) = 0 . The second problem is that X may not be a valid random variable (that is, measurable) unless we assume that the stochastic process X is measurable. Recall that this means that X : Ω × T → S is measurable relative to F ⊗ T and S . (This is always true in discrete time.) t
τ
τ
δ
δ
δ
τ
Recall next that a random time τ is a stopping time (also called a Markov time or an optional time) relative to F if {τ ≤ t} ∈ F for each t ∈ T . Intuitively, we can tell whether or not τ ≤ t from the information available to us at time t . In a sense, a stopping time is a random time that does not require that we see into the future. Of course, the concept depends critically on the filtration. Recall that if a random time τ is a stopping time for a filtration F = {F : t ∈ T } then it is also a stopping time for a finer filtration G = {G : t ∈ T } , so that F ⊆ G for t ∈ T . Thus, the finer the filtration, the larger the collection of stopping times. In fact if the filtration is the trivial one where F = F for all t ∈ T (so that all information is available to us from the beginning of time), then any random time is a stopping time. But of course, this trivial filtration is usually not sensible. t
t
t
t
t
t
Next, recall that if τ is a stopping time for the filtration F, then the σ-algebra F associated with τ is given by τ
Fτ = {A ∈ F : A ∩ {τ ≤ t} ∈ Ft for all t ∈ T }
(16.1.6)
Intuitively, F is the collection of events up to the random time τ , analogous to the F which is the collection of events up to the deterministic time t ∈ T . If X = {X : t ∈ T } is a stochastic process adapted to F and if τ is a stopping time relative to F, then we would hope that X is measurable with respect to F just as X is measurable with respect to F for deterministic t ∈ T . However, this will generally not be the case unless X is progressively measurable relative to F, which means that X : Ω × T → S is measurable with respect to F ⊗ T and S where T = {s ∈ T : s ≤ t} and T the corresponding Borel σalgebra. This is always true in discrete time, of course, and more generally if S has an LCCB topology with S the Borel σ-algebra, and X is right continuous. If X is progressively measurable with respect to F then X is measurable and X is adapted to F. τ
t
t
τ
τ
t
t
The strong Markov property for our stochastic process the present, when the present time is a stopping time.
t
t
t
t
X = { Xt : t ∈ T }
t
states that the future is independent of the past, given
The random process X is a strong Markov process if E[f (Xτ+t ) ∣ Fτ ] = E[f (Xτ+t ) ∣ Xτ ]
for every t ∈ T , stopping time τ , and f
∈ B
(16.1.7)
.
As with the regular Markov property, the strong Markov property depends on the underlying filtration F. If the property holds with respect to a given filtration, then it holds with respect to a coarser filtration. Suppose that the stochastic process X = {X : t ∈ T } is progressively measurable relative to the filtration F = {F : t ∈ T } and that the filtration G = {G : t ∈ T } is finer than F. If X is a strong Markov process relative to G then X is a strong Markov process relative to F. t
t
t
Proof So if X is a strong Markov process, then X satisfies the strong Markov property relative to its natural filtration. Again there is a tradeoff: finer filtrations allow more stopping times (generally a good thing), but make the strong Markov property harder to satisfy and may not be reasonable (not so good). So we usually don't want filtrations that are too much finer than the natural one.
16.1.4
https://stats.libretexts.org/@go/page/10288
With the strong Markov and homogeneous properties, the process {X : t ∈ T } given X = x is equivalent in distribution to the process {X : t ∈ T } given X = x . Clearly, the strong Markov property implies the ordinary Markov property, since a fixed time t ∈ T is trivially also a stopping time. The converse is true in discrete time. τ+t
t
τ
0
Suppose that X = {X
n
: n ∈ N}
is a (homogeneous) Markov process in discrete time. Then X is a strong Markov process.
As always in continuous time, the situation is more complicated and depends on the continuity of the process X and the filtration F . Here is the standard result for Feller processes. If X = {X : t ∈ [0, ∞) is a Feller Markov process, then continuous refinement of the natural filtration.. t
X
is a strong Markov process relative to filtration
0
F+
, the right-
Transition Kernels of Markov Processes For our next discussion, you may need to review the section on kernels and operators in the chapter on expected value. Suppose again that X = {X : t ∈ T } is a (homogeneous) Markov process with state space S and time space T , as described above. The kernels in the following definition are of fundamental importance in the study of X t
For t ∈ T , let Pt (x, A) = P(Xt ∈ A ∣ X0 = x),
x ∈ S, A ∈ S
(16.1.9)
Then P is a probability kernel on (S, S ), known as the transition kernel of X for time t . t
Proof That is, Pt (x, ⋅)
P (x, ⋅) is the conditional distribution of X given X = x for t ∈ T and is also the conditional distribution of X given X = x for s ∈ T : t
t
0
s+t
x ∈ S
. By the time homogenous property,
s
Pt (x, A) = P(Xs+t ∈ A ∣ Xs = x),
s, t ∈ T , x ∈ S, A ∈ S
(16.1.10)
Note that P = I , the identity kernel on (S, S ) defined by I (x, A) = 1(x ∈ A) for x ∈ S and A ∈ S , so that I (x, A) = 1 if x ∈ A and I (x, A) = 0 if x ∉ A . Recall also that usually there is a natural reference measure λ on (S, S ). In this case, the transition kernel P will often have a transition density p with respect to λ for t ∈ T . That is, 0
t
t
Pt (x, A) = P(Xt ∈ A ∣ X0 = x) = ∫
pt (x, y)λ(dy),
x ∈ S, A ∈ S
(16.1.11)
A
The next theorem gives the Chapman-Kolmogorov equation, named for Sydney Chapman and Andrei Kolmogorov, the fundamental relationship between the probability kernels, and the reason for the name transition kernel. Suppose again that X = {X P P =P . That is,
t
s
t
: t ∈ T}
is a Markov process on S with transition kernels P
= { Pt : t ∈ T }
. If s,
s ∈ T
, then
s+t
Ps+t (x, A) = ∫
Ps (x, dy)Pt (y, A),
x ∈ S, A ∈ S
(16.1.12)
S
Proof In the language of functional analysis, P is a semigroup. Recall that the commutative property generally does not hold for the product operation on kernels. However the property does hold for the transition kernels of a homogeneous Markov process. That is, P P =P P =P for s, t ∈ T . As a simple corollary, if S has a reference measure, the same basic relationship holds for the transition densities. s
t
t
s
s+t
Suppose that λ is the reference measure on (S, S ) and that X = {X densities {p : t ∈ T } . If s, t ∈ T then p p = p . That is,
t
t
s
t
: t ∈ T}
is a Markov process on S and with transition
s+t
pt (x, z) = ∫
ps (x, y)pt (y, z)λ(dy),
x, z ∈ S
(16.1.16)
S
Proof
16.1.5
https://stats.libretexts.org/@go/page/10288
If T
(discrete time), then the transition kernels of then P = P for n ∈ N .
=N
P = P1
are just the powers of the one-step transition kernel. That is, if we let
X
n
n
Recall that a kernel defines two operations: operating on the left with positive measures on (S, S ) and operating on the right with measurable, real-valued functions. For the transition kernels of a Markov process, both of the these operators have natural interpretations. Suppose that s,
t ∈ T
. If μ is the distribution of X then X s
s
s+t
μs+t (A) = ∫
has distribution μ
s+t
μs (dx)Pt (x, A),
= μs Pt
. That is,
A ∈ S
(16.1.17)
S
Proof So if P denotes the collection of probability measures on (S, S ), then the left operator P maps P back into P . In particular, if X has distribution μ (the initial distribution) then X has distribution μ = μ P for every t ∈ T . t
0
0
t
t
A positive measure μ on (S, S ) is invariant for X if μP
t
=μ
0
t
for every t ∈ T .
Hence if μ is a probability measure that is invariant for X, and X has distribution μ , then X has distribution μ for every t ∈ T so that the process X is identically distributed. In discrete time, note that if μ is a positive measure and μP = μ then μP = μ for every n ∈ N , so μ is invariant for X. The operator on the right is given next. 0
t
n
Suppose that f
: S → R
. If t ∈ T then (assuming that the expected value exists), Pt f (x) = ∫
Pt (x, dy)f (y) = E [f (Xt ) ∣ X0 = x] ,
x ∈ S
(16.1.19)
S
Proof In particular, the right operator P is defined on B , the vector space of bounded, linear functions f : S → R , and in fact is a linear operator on B . That is, if f , g ∈ B and c ∈ R , then P (f + g) = P f + P g and P (cf ) = cP f . Moreover, P is a contraction operator on B , since ∥P f ∥ ≤ ∥f ∥ for f ∈ B . It then follows that P is a continuous operator on B for t ∈ T . t
t
t
t
t
t
t
t
t
For the right operator, there is a concept that is complementary to the invariance of of a positive measure for the left operator. A measurable function f
: S → R
Again, in discrete time, if P f
=f
is harmonic for X if P
then P
tf
n
f =f
0
t
for all t ∈ T .
for all n ∈ N , so f is harmonic for X.
Combining two results above, if X has distribution μ and f exists), μ P f = E[f (X )] for t ∈ T . That is, 0
=f
0
: S → R
is measurable, then (again assuming that the expected value
t
E[f (Xt )] = ∫
μ0 (dx) ∫
S
Pt (x, dy)f (y)
(16.1.21)
S
The result above shows how to obtain the distribution of X from the distribution of X and the transition kernel P for t ∈ T . But we can do more. Recall that one basic way to describe a stochastic process is to give its finite dimensional distributions, that is, the distribution of (X , X , … , X ) for every n ∈ N and every (t , t , … , t ) ∈ T . For a Markov process, the initial distribution and the transition kernels determine the finite dimensional distributions. It's easiest to state the distributions in differential form. t
0
t
n
t1
Suppose X = {X
t
0 < t1 < ⋯ < tn
t2
tn
+
1
2
n
is a Markov process with transition operators P = {P : t ∈ T } , and that (t , … , t has distribution μ , then in differential form, the distribution of (X , X , … , X ) is
: t ∈ T}
. If X
0
t
0
0
μ0 (dx0 )Pt (x0 , dx1 )Pt 1
2
−t1
(x1 , dx2 ) ⋯ Pt
n
−tn−1
n)
1
(xn−1 , dxn )
t1
∈ T
n
with
tn
(16.1.22)
Proof This result is very important for constructing Markov processes. If we know how to define the transition kernels P for t ∈ T (based on modeling considerations, for example), and if we know the initial distribution μ , then the last result gives a consistent set of finite dimensional distributions. From the Kolmogorov construction theorem, we know that there exists a stochastic process t
0
16.1.6
https://stats.libretexts.org/@go/page/10288
that has these finite dimensional distributions. In continuous time, however, two serious problems remain. First, it's not clear how we would construct the transition kernels so that the crucial Chapman-Kolmogorov equations above are satisfied. Second, we usually want our Markov process to have certain properties (such as continuity properties of the sample paths) that go beyond the finite dimensional distributions. The first problem will be addressed in the next section, and fortunately, the second problem can be resolved for a Feller process. Suppose that
is a Markov process on an LCCB state space (S, S ) with transition operators . Then X is a Feller process if and only if the following conditions hold:
X = { Xt : t ∈ T }
P = { Pt : t ∈ [0, ∞)}
1. Continuity in space: If f ∈ C and t ∈ [0, ∞) then P f ∈ C 2. Continuity in time: If f ∈ C and x ∈ S then P f (x) → f (x) as t ↓ 0 . 0
t
0
0
t
A semigroup of probability kernels P = {P : t ∈ T } that satisfies the properties in this theorem is called a Feller semigroup. So the theorem states that the Markov process X is Feller if and only if the transition semigroup of transition P is Feller. As before, (a) is automatically satisfied if S is discrete, and (b) is automatically satisfied if T is discrete. Condition (a) means that P is an operator on the vector space C , in addition to being an operator on the larger space B . Condition (b) actually implies a stronger form of continuity in time. t
t
0
Suppose that P = {P : t ∈ T } is a Feller semigroup of transition operators. Then t ↦ P supremum norm) for f ∈ C .
is continuous (with respect to the
tf
t
0
Additional details So combining this with the remark above, note that if P is a Feller semigroup of transition operators, then f ↦ P f is continuous on C for fixed t ∈ T , and t ↦ P f is continuous on T for fixed f ∈ C . Again, the importance of this is that we often start with the collection of probability kernels P and want to know that there exists a nice Markov process X that has these transition operators. t
0
t
0
Sampling in Time If we sample a Markov process at an increasing sequence of points in time, we get another Markov process in discrete time. But the discrete time process may not be homogeneous even if the original process is homogeneous. Suppose that X = {X
: t ∈ T}
0 = t0 < t1 < t2 < ⋯
. Let Y
t
n
is a Markov process with state space (S, S ) and that (t , t , t , …) is a sequence in T with for n ∈ N . Then Y = {Y : n ∈ N} is a Markov process in discrete time. 0
= Xt
1
2
n
n
Proof If we sample a homogeneous Markov process at multiples of a fixed, positive time, we get a homogenous Markov process in discrete time. Suppose that
is a homogeneous Markov process with state space (S, S ) and transition kernels P = { P : t ∈ T } . Fix r ∈ T with r > 0 and define Y = X for n ∈ N . Then Y = {Y : n ∈ N} is a homogeneous Markov process in discrete time, with one-step transition kernel Q given by X = { Xt : t ∈ T }
t
n
nr
Q(x, A) = Pr (x, A);
n
x ∈ S, A ∈ S
(16.1.28)
In some cases, sampling a strong Markov process at an increasing sequence of stopping times yields another Markov process in discrete time. The point of this is that discrete-time Markov processes are often found naturally embedded in continuous-time Markov processes.
Enlarging the State Space Our first result in this discussion is that a non-homogeneous Markov process can be turned into a homogenous Markov process, but only at the expense of enlarging the state space. Suppose that X = {X : t ∈ T } is a non-homogeneous Markov process with state space (S, S ). Suppose also that τ is a random variable taking values in T , independent of X. Let τ = τ + t and let Y = (X , τ ) for t ∈ T . Then Y = { Y : t ∈ T } is a homogeneous Markov process with state space (S × T , S ⊗ T ) . For t ∈ T , the transition kernel P is given by t
t
t
t
τt
t
t
16.1.7
https://stats.libretexts.org/@go/page/10288
Pt [(x, r), A × B] = P(Xr+t ∈ A ∣ Xr = x)1(r + t ∈ B),
(x, r) ∈ S × T , A × B ∈ S ⊗ T
(16.1.29)
Proof The trick of enlarging the state space is a common one in the study of stochastic processes. Sometimes a process that has a weaker form of “forgetting the past” can be made into a Markov process by enlarging the state space appropriately. Here is an example in discrete time. Suppose that X = {X : n ∈ N} is a random process with state space the last two states. That is, for n ∈ N n
(S, S )
in which the future depends stochastically on
P(Xn+2 ∈ A ∣ Fn+1 ) = P(Xn+2 ∈ A ∣ Xn , Xn+1 ),
where {F : n ∈ N} is the natural filtration associated with the process homogeneous in the sense that n
X
A ∈ S
(16.1.31)
. Suppose also that the process is time
P(Xn+2 ∈ A ∣ Xn = x, Xn+1 = y) = Q(x, y, A)
independently of n ∈ N . Let Y = (X , X ) for n ∈ N . Then Y = {Y state space (S × S, S ⊗ S . The one step transition kernel P is given by n
n
n+1
n
P [(x, y), A × B] = I (y, A)Q(x, y, B);
: n ∈ N}
(16.1.32)
is a homogeneous Markov process with
x, y ∈ S, A, B ∈ S
(16.1.33)
Proof The last result generalizes in a completely straightforward way to the case where the future of a random process in discrete time depends stochastically on the last k states, for some fixed k ∈ N .
Examples and Applications Recurrence Relations and Differential Equations As noted in the introduction, Markov processes can be viewed as stochastic counterparts of deterministic recurrence relations (discrete time) and differential equations (continuous time). Our goal in this discussion is to explore these connections. Suppose that X = {X
n
: n ∈ N}
is a stochastic process with state space (S, S ) and that X satisfies the recurrence relation Xn+1 = g(Xn ),
where
n ∈ N
(16.1.34)
is measurable. Then X is a homogeneous Markov process with one-step transition operator for a measurable function f : S → R .
g : S → S
Pf = f ∘g
P
given by
Proof In the deterministic world, as in the stochastic world, the situation is more complicated in continuous time. Nonetheless, the same basic analogy applies. Suppose that X = {X
t
: t ∈ [0, ∞)}
with state space (R, R)satisfies the first-order differential equation d dt
Xt = g(Xt )
(16.1.35)
where g : R → R is Lipschitz continuous. Then X is a Feller Markov process Proof In differential form, the process can be described by dX = g(X ) dt . This essentially deterministic process can be extended to a very important class of Markov processes by the addition of a stochastic term related to Brownian motion. Such stochastic differential equations are the main tools for constructing Markov processes known as diffusion processes. t
t
Processes with Stationary, Independent Increments For our next discussion, we consider a general class of stochastic processes that are Markov processes. Suppose that X = { X : t ∈ T } is a random process with S ⊆ R as the set of states. The state space can be discrete (countable) or “continuous”. Typically, S is either N or Z in the discrete case, and is either [0, ∞) or R in the continuous case. In any case, S is t
16.1.8
https://stats.libretexts.org/@go/page/10288
given the usual σ-algebra S of Borel subsets of S (which is the power set in the discrete case). Also, the state space (S, S ) has a natural reference measure measure λ , namely counting measure in the discrete case and Lebesgue measure in the continuous case. Let F = {F : t ∈ T } denote the natural filtration, so that F = σ{X : s ∈ T , s ≤ t} for t ∈ T . t
t
s
The process X has 1. Independent increments if X − X is independent of F for all s, t ∈ T . 2. Stationary increments if the distribution of X − X is the same as the distribution of X s+t
s
s
s+t
s
t
− X0
for all s,
t ∈ T
.
A difference of the form X − X for s, t ∈ T is an increment of the process, hence the names. Sometimes the definition of stationary increments is that X − X have the same distribution as X . But this forces X = 0 with probability 1, and as usual with Markov processes, it's best to keep the initial distribution unspecified. If X has stationary increments in the sense of our definition, then the process Y = {Y = X − X : t ∈ T } has stationary increments in the more restricted sense. For the remainder of this discussion, assume that X = {X : t ∈ T } has stationary, independent increments, and let Q denote the distribution of X − X for t ∈ T . s+t
s
s+t
s
t
t
t
0
0
t
t
t
0
for s,
Qs ∗ Qt = Qs+t
t ∈ T
.
Proof So the collection of distributions Q = {Q point mass at 0.
: t ∈ T}
t
forms a semigroup, with convolution as the operator. Note that Q is simply 0
The process X is a homogeneous Markov process. For t ∈ T , the transition operator P is given by t
Pt f (x) = ∫
f (x + y)Qt (dy),
f ∈ B
(16.1.36)
S
Proof Clearly the semigroup property of P = {P : t ∈ T } (with the usual operator product) is equivalent to the semigroup property of Q = { Q : t ∈ T } (with convolution as the product). t
t
Suppose that for positive t ∈ T , the distribution Q has probability density function g with respect to the reference measure λ . Then the transition density is t
t
pt (x, y) = gt (y − x),
Of course, from the result above, it follows that probability density functions. If Q
t
→ Q0
gs ∗ gt = gs+t
for s,
x, y ∈ S
t ∈ T
(16.1.39)
, where here
∗
refers to the convolution operation on
as t ↓ 0 then X is a Feller Markov process.
Thus, by the general theory sketched above, X is a strong Markov process, and there exists a version of X that is right continuous and has left limits. Such a process is known as a Lévy process, in honor of Paul Lévy. For a real-valued stochastic process X = {X
t
: t ∈ T}
, let m and v denote the mean and variance functions, so that
m(t) = E(Xt ), v(t) = var(Xt );
t ∈ T
(16.1.40)
assuming of course that the these exist. The mean and variance functions for a Lévy process are particularly simple. Suppose again that X has stationary, independent increments. 1. If μ = E(X ) ∈ R and μ 2. If in addition, σ = var(X 0
0
1
2
0
then m(t) = μ + (μ − μ )t for t ∈ T . and σ = var(X ) ∈ (0, ∞) then v(t) = σ + (σ
= E(X1 ) ∈ R
0 ) ∈ (0, ∞)
0
1
0
2
1
1
2
2
0
1
2
− σ )t 0
for t ∈ T .
Proof It's easy to describe processes with stationary independent increments in discrete time.
16.1.9
https://stats.libretexts.org/@go/page/10288
A process X = {X random variables (U
: n ∈ N}
n
0 , U1 , …)
has independent increments if and only if there exists a sequence of independent, real-valued such that n
Xn = ∑ Ui
(16.1.43)
i=0
In addition, X has stationary increments if and only if (U
1,
U2 , …)
are identically distributed.
Proof Thus suppose that U = (U , U , …) is a sequence of independent, real-valued random variables, with (U distributed with common distribution Q. Then from our main result above, the partial sum process X = {X with U is a homogeneous Markov process with one step transition kernel P given by 0
1,
1
n
P (x, A) = Q(A − x),
U2 , …)
: n ∈ N}
x ∈ S, A ∈ S
identically associated
(16.1.44)
More generally, for n ∈ N , the n -step transition kernel is P (x, A) = Q (A − x) for x ∈ S and A ∈ S . This Markov process is known as a random walk (although unfortunately, the term random walk is used in a number of other contexts as well). The idea is that at time n , the walker moves a (directed) distance U on the real line, and these steps are independent and identically distributed. If Q has probability density function g with respect to the reference measure λ , then the one-step transition density is n
∗n
n
p(x, y) = g(y − x),
x, y ∈ S
(16.1.45)
Consider the random walk on R with steps that have the standard normal distribution. Give each of the following explicitly: 1. The one-step transition density. 2. The n -step transition density for n ∈ N . +
Proof In continuous time, there are two processes that are particularly important, one with the discrete state space continuous state space R. For
t ∈ [0, ∞)
, let
N
and one with the
denote the probability density function of the Poisson distribution with parameter t , and let for x, y ∈ N. Then {p : t ∈ [0, ∞)} is the collection of transition densities for a Feller semigroup on N
gt
pt (x, y) = gt (y − x)
t
Proof So a Lévy process N = {N : t ∈ [0, ∞)} with these transition densities would be a Markov process with stationary, independent increments and with sample paths are right continuous and have left limits. We do know of such a process, namely the Poisson process with rate 1. t
Open the Poisson experiment and set the rate parameter to 1 and the time parameter to 10. Run the experiment several times in single-step mode and note the behavior of the process. For t ∈ (0, ∞) , let g denote the probability density function of the normal distribution with mean 0 and variance t , and let p (x, y) = g (y − x) for x, y ∈ R. Then { p : t ∈ [0, ∞)} is the collection of transition densities of a Feller semigroup on R . t
t
t
t
Proof So a Lévy process X = {X : t ∈ [0, ∞)} on R with these transition densities would be a Markov process with stationary, independent increments, and whose sample paths are continuous from the right and have left limits. In fact, there exists such a process with continuous sample paths. This process is Brownian motion, a process important enough to have its own chapter. t
Run the simulation of standard Brownian motion and note the behavior of the process. This page titled 16.1: Introduction to Markov Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.1.10
https://stats.libretexts.org/@go/page/10288
16.2: Potentials and Generators for General Markov Processes Our goal in this section is to continue the broad sketch of the general theory of Markov processes. As with the last section, some of the statements are not completely precise and rigorous, because we want to focus on the main ideas without being overly burdened by technicalities. If you are a new student of probability, or are primarily interested in applications, you may want to skip ahead to the study of discrete-time Markov chains.
Preliminaries Basic Definitions As usual, our starting point is a probability space (Ω, F , P), so that Ω is the set of outcomes, F the σ-algebra of events, and P the probability measure on the sample space (Ω, F ). The set of times T is either N, discrete time with the discrete topology, or [0, ∞), continuous time with the usual Euclidean topology. The time set T is given the Borel σ-algebra T , which is just the power set if T = N , and then the time space (T , T ) is given the usual measure, counting measure in the discrete case and Lebesgue measure in the continuous case. The set of states S has an LCCB topology (locally compact, Hausdorff, with a countable base), and is also given the Borel σ-algebra S . Recall that to say that the state space is discrete means that S is countable with the discrete topology, so that S is the power set of S . The topological assumptions mean that the state space (S, S ) is nice enough for a rich mathematical theory and general enough to encompass the most important applications. There is often a natural Borel measure λ on (S, S ), counting measure # if S is discrete, and for example, Lebesgue measure if S = R for some k ∈ N . k
+
Recall also that there are several spaces of functions on S that are important. Let B denote the set of bounded, measurable functions f : S → R . Let C denote the set of bounded, continuous functions f : S → R , and let C denote the set of continuous functions f : S → R that vanish at ∞ in the sense that for every ϵ > 0 , there exists a compact set K ⊆ S such |f (x)| < ϵ for x ∈ K . These are all vector spaces under the usual (pointwise) addition and scalar multiplication, and C ⊆ C ⊆ B . The supremum norm, defined by ∥f ∥ = sup{|f (x)| : x ∈ S} for f ∈ B is the norm that is used on these spaces. 0
c
0
Suppose now that X = {X : t ∈ T } is a time-homogeneous Markov process with state space (S, S ) defined on the probability space (Ω, F , P). As before, we also assume that we have a filtration F = {F : t ∈ T } , that is, an increasing family of sub σalgebras of F , indexed by the time space, with the properties that X is measurable with repsect to F for t ∈ T . Intuitively, F is the collection of events up to time t ∈ T . t
t
t
t
t
As usual, we let P denote the transition probability kernel for an increase in time of size t ∈ T . Thus t
Pt (x, A) = P(Xt ∈ A ∣ X0 = x),
x ∈ S, A ∈ S
(16.2.1)
Recall that for t ∈ T , the transition kernel P defines two operators, on the left with measures and on the right with functions. So, if μ is a measure on (S, S ) then μP is the measure on (S, S ) given by t
t
μPt (A) = ∫
μ(dx)Pt (x, A),
A ∈ S
(16.2.2)
S
If μ is the distribution of X then μP is the distribution of X for t ∈ T . If f 0
t
t
Pt f (x) = ∫
∈ B
then P
tf
∈ B
is defined by
Pt (x, dy)f (y) = E [f (Xt ) ∣ X0 = x]
(16.2.3)
S
Recall that the collection of transition operators P = {P : t ∈ T } is a semigroup because P P = P for s, t ∈ T . Just about everything in this section is defined in terms of the semigroup P , which is one of the main analytic tools in the study of Markov processes. t
s
t
s+t
Feller Markov Processes We make the same assumptions as in the Introduction. Here is a brief review: We assume that the Markov process process):
X = { Xt : t ∈ T }
1. For t ∈ T and y ∈ S , the distribution of X given X t
0
satisfies the following properties (and hence is a Feller Markov
=x
converges to the distribution of X given X
16.2.1
t
0
=y
as x → y .
https://stats.libretexts.org/@go/page/10289
2. Given X
0
=x ∈ S
, X converges in probability to x as t ↓ 0 . t
Part (a) is an assumption on continuity in space, while part (b) is an assumption on continuity in time. If S is discrete then (a) automatically holds, and if T is discrete then (b) automatically holds. As we will see, the Feller assumptions are sufficient for a very nice mathematical theory, and yet are general enough to encompass the most important continuous-time Markov processes. The process X = {X
t
has the following properties:
: t ∈ T}
1. There is a version of X such that t ↦ X is continuous from the right and has left limits. 2. X is a strong Markov process relative to the F , the right-continuous refinement of the natural filtration. t
0
+
The Feller assumptions on the Markov process have equivalent formulations in terms of the transition semigroup. The transition semigroup P 1. If f 2. If f
∈ C0 ∈ C0
= { Pt : t ∈ T }
has the following properties:
and t ∈ T then P f ∈ C and x ∈ S then P f (x) → f (x) as t ↓ 0 . t
0
t
As before, part (a) is a condition on continuity in space, while part (b) is a condition on continuity in time. Once again, (a) is trivial if S is discrete, and (b) trivial if T is discrete. The first condition means that P is a linear operator on C (as well as being a linear operator on B ). The second condition leads to a stronger continuity result. t
For f
∈ C0
, the mapping t ↦ P
tf
0
is continuous on T . That is, for t ∈ T ,
∥ Ps f − Pt f ∥ = sup{| Ps f (x) − Pt f (x)| : x ∈ S} → 0 as s → t
(16.2.4)
Our interest in this section is primarily the continuous time case. However, we start with the discrete time case since the concepts are clearer and simpler, and we can avoid some of the technicalities that inevitably occur in continuous time.
Discrete Time Suppose that T = N , so that time is discrete. Recall that the transition kernels are just powers of the one-step kernel. That is, we let P =P and then P = P for n ∈ N . n
1
n
Potential Operators For α ∈ (0, 1], the α -potential kernel R of X is defined as follows: α
∞ n
Rα (x, A) = ∑ α P
n
(x, A),
x ∈ S, A ∈ S
(16.2.5)
n=0
1. The special case R = R is simply the potential kernel of X. 2. For x ∈ S and A ∈ S , R(x, A) is the expected number of visits of X to A , starting at x. 1
Proof Note that it's quite possible that R(x, A) = ∞ for some x ∈ S and A ∈ S . In fact, knowing when this is the case is of considerable importance in the study of Markov processes. As with all kernels, the potential kernel R defines two operators, operating on the right on functions, and operating on the left on positive measures. For the right potential operator, if f : S → R is measurable then α
∞
∞ n
Rα f (x) = ∑ α P n=0
n
∞ n
f (x) = ∑ α n=0
∫
P
n
n
(x, dy)f (y) = ∑ α E[f (Xn ) ∣ X0 = x],
S
x ∈ S
(16.2.7)
n=0
assuming as usual that the expected values and the infinite series make sense. This will be the case, in particular, if nonnegative or if p ∈ (0, 1) and f ∈ B . If α ∈ (0, 1), then R
α (x,
S) =
1 1−α
f
is
for all x ∈ S .
Proof
16.2.2
https://stats.libretexts.org/@go/page/10289
It follows that for α ∈ (0, 1), the right operator R is a bounded, linear operator on (1 − α)R is a probability kernel. There is a nice interpretation of this kernel. α
B
with
∥ Rα ∥ =
1 1−α
. It also follows that
α
If α ∈ (0, 1) then (1 − α)R (x, ⋅) is the conditional distribution of and has the geometric distribution on N with parameter 1 − α . α
XN
given
X0 = x ∈ S
, where
N
is independent of
X
Proof So (1 − α)R is a transition probability kernel, just as P is a transition probability kernel, but corresponding to the random time N (with α ∈ (0, 1) as a parameter), rather than the deterministic time n ∈ N . An interpretation of the potential kernel R for α ∈ (0, 1) can be also given in economic terms. Suppose that A ∈ S and that we receive one monetary unit each time the process X visits A . Then as above, R(x, A) is the expected total amount of money we receive, starting at x ∈ S . However, typically money that we will receive at times distant in the future has less value to us now than money that we will receive soon. Specifically suppose that a monetary unit received at time n ∈ N has a present value of α , where α ∈ (0, 1) is an inflation factor (sometimes also called a discount factor). Then R (x, A) gives the expected, total, discounted amount we will receive, starting at x ∈ S . A bit more generally, if f ∈ B is a reward function, so that f (x) is the reward (or cost, depending on the sign) that we receive when we visit state x ∈ S , then for α ∈ (0, 1), R f (x) is the expected, total, discounted reward, starting at x ∈ S . α
n
α
n
α
α
For the left potential operator, if μ is a positive measure on S then ∞
∞ n
μRα (A) = ∑ α μP
n
n
(A) = ∑ α
n=0
∫
μ(dx)P
n
(x, A),
A ∈ S
(16.2.12)
S
n=0
In particular, if μ is a probability measure and X has distribution μ then μP is the distribution of X for n ∈ N , so from the last result, (1 − α)μR is the distribution of X where again, N is independent of X and has the geometric distribution on N with parameter 1 − α . The family of potential kernels gives the same information as the family of transition kernels. n
0
α
n
N
The potential kernels R = {R
α
: α ∈ (0, 1)}
completely determine the transition kernels P
= { Pn : n ∈ N}
.
Proof Of course, it's really only necessary to determine P , the one step transition kernel, since the other transition kernels are powers of P . In any event, it follows that the kernels R = { R : α ∈ (0, 1)} , along with the initial distribution, completely determine the finite dimensional distributions of the Markov process X. The potential kernels commute with each other and with the transition kernels. α
Suppose that α, 1. P 2. R
k
β ∈ (0, 1]
Rα = Rα P
α Rβ
k
and k ∈ N . Then (as kernels)
∞
=∑
n=0 ∞
= Rβ Rα = ∑
m=0
n
α P
n+k
∞
∑
n=0
m
α
β
n
P
m+n
Proof The same identities hold for the right operators on the entire space B , with the additional restrictions that fundamental equation that relates the potential kernels is given next. If α,
β ∈ (0, 1]
α 0 , the operators U and G have an inverse relationship. α
Suppose again that α ∈ (0, ∞) . 1. U = (αI − G) : C → D 2. G = αI − U : D → C −1
α
0
−1 α
0
16.2.6
https://stats.libretexts.org/@go/page/10289
Proof So, from the generator G we can determine the potential operators U = {U : α ∈ (0, ∞)} , which in turn determine the transition operators P = {P : t ∈ (0, ∞)} . In continuous time, transition operators P = {P : t ∈ [0, ∞)} can be obtained from the single, infinitesimal operator G in a way that is reminiscent of the fact that in discrete time, the transition operators P = {P : n ∈ N} can be obtained from the single, one-step operator P . α
t
t
n
Examples and Applications Our first example is essentially deterministic. Consider the Markov process X = {X
t
: t ∈ [0, ∞)}
on R satisfying the ordinary differential equation
d dt
Xt = g(Xt ),
t ∈ [0, ∞)
where g : R → R is Lipschitz continuous. The infinitesimal operator domain D of functions f : R → R where f ∈ C and f ∈ C .
G
is given by
(16.2.52)
′
Gf (x) = f (x)g(x)
for
x ∈ R
on the
′
0
0
Proof Our next example considers the Poisson process as a Markov process. Compare this with the binomial process above. Let
N = { Nt : t ∈ [0, ∞)}
X = { Xt : t ∈ [0, ∞)}
by X
t
denote the Poisson process on N with rate β ∈ (0, ∞). Define the Markov process = X +N where X takes values in N and is independent of N . 0
t
0
1. For t ∈ [0, ∞), show that the probability transition matrix P of X is given by t
y−x
Pt (x, y) = e
−βt
(βt)
,
x, y ∈ N, y ≥ x
(16.2.54)
(y − x)!
2. For α ∈ [0, ∞), show that the potential matrix U of X is given by α
1 Uα (x, y) =
β (
α +β
y−x
)
,
x, y ∈ N, y ≥ x
(16.2.55)
α +β
3. For α > 0 and x ∈ N, identify the probability distribution defined by α U (x, ⋅). 4. Show that the infinitesimal matrix G of X is given by G(x, x) = −β, G(x, x + 1) = β for x ∈ N. α
Solutions This page titled 16.2: Potentials and Generators for General Markov Processes is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.2.7
https://stats.libretexts.org/@go/page/10289
16.3: Introduction to Discrete-Time Chains In this and the next several sections, we consider a Markov process with the discrete time space N and with a discrete (countable) state space. Recall that a Markov process with a discrete state space is called a Markov chain, so we are studying discrete-time Markov chains.
Review We will review the basic definitions and concepts in the general introduction. With both time and space discrete, many of these definitions and concepts simplify considerably. As usual, our starting point is a probability space (Ω, F , P), so Ω is the sample space, F the σ-algebra of events, and P the probability measure on (Ω, F ). Let X = (X , X , X , …) be a stochastic process defined on the probability space, with time space N and with countable state space S . In the context of the general introduction, S is given the power set P(S) as the σ-algebra, so all subsets of S are measurable, as are all functions from S into another measurable space. Counting measure # is the natural measure on S , so integrals over S are simply sums. The same comments apply to the time space N: all subsets of N are measurable and counting measure # is the natural measure on N. 0
The vector space B consisting of bounded functions f norm defined by
: S → R
1
2
will play an important role. The norm that we use is the supremum
∥f ∥ = sup{|f (x)| : x ∈ S},
f ∈ B
(16.3.1)
For n ∈ N , let F = σ{X , X , … , X }, the σ-algebra generated by the process up to time n . Thus F = {F , F , F , …} is the natural filtration associated with X. We also let G = σ{X , X , …}, the σ-algebra generated by the process from time n on. So if n ∈ N represents the present time, then F contains the events in the past and G the events in the future. n
0
1
n
0
n
n
1
2
n+1
n
n
Definitions We start with the basic definition of the Markov property: the past and future are conditionally independent, given the present. X = (X0 , X1 , X2 , …) B ∈ Gn
is a Markov chain if
P(A ∩ B ∣ Xn ) = P(A ∣ Xn )P(B ∣ Xn )
for every
n ∈ N
,
A ∈ Fn
and
.
There are a number of equivalent formulations of the Markov property for a discrete-time Markov chain. We give a few of these. X = (X0 , X1 , X2 , …)
1. P(X 2. E[f (X
n+1
is a Markov chain if either of the following equivalent conditions is satisfied: for every n ∈ N and x ∈ S . for every n ∈ N and f ∈ B .
= x ∣ Fn ) = P(Xn+1 = x ∣ Xn )
n+1 )
∣ Fn ] = E[f (Xn+1 ) ∣ Xn ]
Part (a) states that for n ∈ N , the conditional probability density function of X given F is the same as the conditional probability density function of X given X . Part (b) also states, in terms of expected value, that the conditional distribution of X given F is the same as the conditional distribution of X given X . Both parts are the Markov property looking just one time step in the future. But with discrete time, this is equivalent to the Markov property at general future times. n+1
n+1
n+1
n
n+1
X = (X0 , X1 , X2 , …)
1. P(X 2. E[f (X
n+k
n
n
n
is a Markov chain if either of the following equivalent conditions is satisfied: for every n, k ∈ N and x ∈ S . for every n, k ∈ N and f ∈ B .
= x ∣ Fn ) = P(Xn+k = x ∣ Xn )
n+k )
∣ Fn ] = E[f (Xn+k ) ∣ Xn ]
Part (a) states that for n, k ∈ N , the conditional probability density function of X given F is the same as the conditional probability density function of X given X . Part (b) also states, in terms of expected value, that the conditional distribution of X given F is the same as the conditional distribution of X given X . In discrete time and space, the Markov property can also be stated without explicit reference to σ-algebras. If you are not familiar with measure theory, you can take this as the starting definition. n+k
n+k
n+k
n
n
n+k
X = (X0 , X1 , X2 , …)
n
n
is a Markov chain if for every n ∈ N and every sequence of states (x
0,
,
x1 , … , xn−1 , x, y)
P(Xn+1 = y ∣ X0 = x0 , X1 = x1 , … , Xn−1 = xn−1 , Xn = x) = P(Xn+1 = y ∣ Xn = x)
16.3.1
(16.3.2)
https://stats.libretexts.org/@go/page/10290
The theory of discrete-time Markov chains is simplified considerably if we add an additional assumption. A Markov chain X = (X
0,
X1 , X2 , …)
is time homogeneous if
P(Xn+k = y ∣ Xk = x) = P(Xn = y ∣ X0 = x)
for every k,
n ∈ N
and every x,
y ∈ S
(16.3.3)
.
That is, the conditional distribution of X given X = x depends only on n . So if X is homogeneous (we usually don't bother with the time adjective), then the chain {X : n ∈ N} given X = x is equivalent (in distribution) to the chain { X : n ∈ N} given X = x . For this reason, the initial distribution is often unspecified in the study of Markov chains—if the chain is in state x ∈ S at a particular time k ∈ N , then it doesn't really matter how the chain got to state x; the process essentially “starts over”, independently of the past. The term stationary is sometimes used instead of homogeneous. n+k
k
k+n
k
n
0
From now on, we will usually assume that our Markov chains are homogeneous. This is not as big of a loss of generality as you might think. A non-homogenous Markov chain can be turned into a homogeneous Markov process by enlarging the state space, as shown in the introduction to general Markov processes, but at the cost of creating an uncountable state space. For a homogeneous Markov chain, if k, n ∈ N , x ∈ S , and f ∈ B , then E[f (Xk+n ) ∣ Xk = x] = E[f (Xn ) ∣ X0 = x]
(16.3.4)
Stopping Times and the Strong Markov Property Consider again a stochastic process X = (X , X , X , …) with countable state space S , and with the natural filtration F = (F , F , F , …) as given above. Recall that a random variable τ taking values in N ∪ {∞} is a stopping time or a Markov time for X if {τ = n} ∈ F for each n ∈ N . Intuitively, we can tell whether or not τ = n by observing the chain up to time n . In a sense, a stopping time is a random time that does not require that we see into the future. The following result gives the quintessential examples of stopping times. 0
0
1
1
2
2
n
Suppose again X = {X : n ∈ N} is a discrete-time Markov chain with state space following random times are stopping times: n
1. ρ 2. τ
S
as defined above. For
A ⊆S
, the
, the entrance time to A . , the hitting time to A .
A
= inf{n ∈ N : Xn ∈ A}
A
= inf{n ∈ N+ : Xn ∈ A}
Proof An example of a random time that is generally not a stopping time is the last time that the process is in A : ζA = max{n ∈ N+ : Xn ∈ A}
We cannot tell if ζ
A
=n
without looking into the future: {ζ
A
(16.3.5)
= n} = { Xn ∈ A, Xn+1 ∉ A, Xn+2 ∉ A, …}
for n ∈ N .
If τ is a stopping time for X, the σ-algebra associated with τ is Fτ = {A ∈ F : A ∩ {τ = n} ∈ Fn for all n ∈ N}
(16.3.6)
Intuitively, F contains the events that can be described by the process up to the random time τ , in the same way that F contains the events that can be described by the process up to the deterministic time n ∈ N . For more information see the section on filtrations and stopping times. τ
n
The strong Markov property states that the future is independent of the past, given the present, when the present time is a stopping time. For a discrete-time Markov chain, the ordinary Markov property implies the strong Markov property. If X = (X , X , X , …) is a discrete-time Markov chain then stopping time for X then 0
1. P(X 2. E[f (X
τ+k
1
2
X
has the strong Markov property. That is, if
τ
is a finite
for every k ∈ N and x ∈ S . for every k ∈ N and f ∈ B .
= x ∣ Fτ ) = P(Xτ+k = x ∣ Xτ )
τ+k
) ∣ Fτ ] = E[f (Xτ+k ) ∣ Xτ ]
Part (a) states that the conditional probability density function of X given F is the same as the conditional probability density function of X given just X . Part (b) also states, in terms of expected value, that the conditional distribution of X given F τ+k
τ+k
τ
τ
τ+k
16.3.2
τ
https://stats.libretexts.org/@go/page/10290
is the same as the conditional distribution of X given just X . Assuming homogeneity as usual, the Markov chain {X : n ∈ N} given X = x is equivalent in distribution to the chain { X : n ∈ N} given X = x . τ+k
τ+n
τ
τ
n
0
Transition Matrices Suppose again that X = (X , X , X , …) is a homogeneous, discrete-time Markov chain with state space S . With a discrete state space, the transition kernels studied in the general introduction become transition matrices, with rows and columns indexed by S (and so perhaps of infinite size). The kernel operations become familiar matrix operations. The results in this section are special cases of the general results, but we sometimes give independent proofs for completeness, and because the proofs are simpler. You may want to review the section on kernels in the chapter on expected value. 0
1
2
For n ∈ N let Pn (x, y) = P(Xn = y ∣ X0 = x),
(x, y) ∈ S × S
(16.3.7)
The matrix P is the n -step transition probability matrix for X. n
Thus, y ↦ P (x, y) is the probability density function of X given X = x . In particular, P is a probability matrix (or stochastic matrix) since P (x, y) ≥ 0 for (x, y) ∈ S and ∑ P (x, y) = 1 for x ∈ S . As with any nonnegative matrix on S , P defines a kernel on S for n ∈ N : n
n
0
n
2
n
n
y∈S
Pn (x, A) = ∑ Pn (x, y) = P(Xn ∈ A ∣ X0 = x),
x ∈ S, A ⊆ S
(16.3.8)
y∈A
So A ↦ P (x, A) is the probability distribution of X given X = x . The next result is the Chapman-Kolmogorov equation, named for Sydney Chapman and Andrei Kolmogorov. It gives the basic relationship between the transition matrices. n
If m,
n ∈ N
n
then P
m Pn
0
= Pm+n
Proof It follows immediately that the transition matrices are just the matrix powers of the one-step transition matrix. That is, letting P =P we have P = P for all n ∈ N . Note that P = I , the identity matrix on S given by I (x, y) = 1 if x = y and 0 otherwise. The right operator corresponding to P yields an expected value. n
1
0
n
n
Suppose that n ∈ N and that f
: S → R
P
n
. Then, assuming that the expected value exists,
f (x) = ∑ P
n
(x, y)f (y) = E[f (Xn ) ∣ X0 = x],
x ∈ S
(16.3.12)
y∈S
Proof The existence of the expected value is only an issue if S is infinte. In particular, the result holds if f is nonnegative or if f ∈ B (which in turn would always be the case if S is finite). In fact, P is a linear contraction operator on the space B for n ∈ N . That is, if f ∈ B then P f ∈ B and ∥P f ∥ ≤ ∥f ∥. The left operator corresponding to P is defined similarly. For f : S → R n
n
n
n
fP
n
(y) = ∑ f (x)P
n
(x, y),
y ∈ S
(16.3.14)
x∈S
assuming again that the sum makes sense (as before, only an issue when S is infinite). The left operator is often restricted to nonnegative functions, and we often think of such a function as the density function (with respect to #) of a positive measure on S . In this sense, the left operator maps a density function to another density function. A function f
: S → R
is invariant for P (or for the chain X) if f P
Clearly if f is invariant, so that f P
=f
then f P
n
=f
=f
.
for all n ∈ N . If f is a probability density function, then so is f P .
If X has probability density function f , then X has probability density function f P for n ∈ N . 0
n
n
Proof
16.3.3
https://stats.libretexts.org/@go/page/10290
In particular, if X has probability density function f , and f is invariant for X, then X has probability density function f for all n ∈ N , so the sequence of variables X = (X , X , X , …) is identically distributed. Combining two results above, suppose that X has probability density function f and that g : S → R . Assuming the expected value exists, E[g(X )] = f P g . Explicitly, 0
n
0
1
2
n
0
n
E[g(Xn )] = ∑ ∑ f (x)P x∈S
n
(x, y)g(y)
(16.3.16)
y∈S
It also follows from the last theorem that the distribution of X (the initial distribution) and the one-step transition matrix determine the distribution of X for each n ∈ N . Actually, these basic quantities determine the finite dimensional distributions of the process, a stronger result. 0
n
Suppose that X has probability density function f . For any sequence of states (x 0
0,
0
x1 , … , xn ) ∈ S
n
,
,
P(X0 = x0 , X1 = x1 , … , Xn = xn ) = f0 (x0 )P (x0 , x1 )P (x1 , x2 ) ⋯ P (xn−1 , xn )
(16.3.17)
Proof Computations of this sort are the reason for the term chain in the name Markov chain. From this result, it follows that given a probability matrix P on S and a probability density function f on S , we can construct a Markov chain X = (X , X , X , …) such that X has probability density function f and the chain has one-step transition matrix P . In applied problems, we often know the one-step transition matrix P from modeling considerations, and again, the initial distribution is often unspecified. 0
1
2
0
There is a natural graph (in the combinatorial sense) associated with a homogeneous, discrete-time Markov chain. Suppose again that X = (X , X , X , …) is a Markov chain with state space S and transition probability matrix P . The state graph of X is the directed graph with vertex set S and edge set E = {(x, y) ∈ S : P (x, y) > 0} . 0
1
2
2
That is, there is a directed edge from x to y if and only if state x leads to state y in one step. Note that the graph may well have loops, since a state can certainly lead back to itself in one step. More generally, we have the following result: Suppose again that X = (X , X , X , …) is a Markov chain with state space S and transition probability matrix x, y ∈ S and n ∈ N , there is a directed path of length n in the state graph from x to y if and only if P (x, y) > 0 . 0
1
2
P
. For
n
+
Proof
Potential Matrices For α ∈ (0, 1], the α -potential matrix R of X is α
∞ n
Rα = ∑ α P
n
,
(x, y) ∈ S
2
(16.3.19)
n=0
1. R = R is simply the potential matrix of X. 2. R(x, y) is the expected number of visits by X to y ∈ S , starting at x ∈ S . 1
Proof Note that it's quite possible that R(x, y) = ∞ for some (x, y) ∈ S . In fact, knowing when this is the case is of considerable importance in recurrence and transience, which we study in the next section. As with any nonnegative matrix, the α -potential matrix defines a kernel and defines left and right operators. For the kernel, 2
∞ n
Rα (x, A) = ∑ Rα (x, y) = ∑ α P y∈A
n
(x, A),
x ∈ S, A ⊆ S
(16.3.21)
n=0
In particular, R(x, A) is the expected number of visits by the chain to A starting in x: ∞
R(x, A) = ∑ R(x, y) = ∑ P y∈A
If α ∈ (0, 1), then R
α (x,
S) =
1 1−α
∞ n
(x, A) = E [ ∑ 1(Xn ∈ A)] ,
n=0
x ∈ S, A ⊆ S
(16.3.22)
n=0
for all x ∈ S .
Proof
16.3.4
https://stats.libretexts.org/@go/page/10290
Hence R is a bounded matrix for matrix. α
α ∈ (0, 1)
and
(1 − α)Rα
If α ∈ (0, 1) then (1 − α)R (x, y) = P(X = y ∣ X geometric distribution on N with parameter 1 − α . α
N
0
= x)
is a probability matrix. There is a simple interpretation of this for
(x, y) ∈ S
2
, where
N
is independent of
X
and has the
Proof So (1 − α)R can be thought of as a transition matrix just as P is a transition matrix, but corresponding to the random time N (with α as a paraamter) rather than the deterministic time n . An interpretation of the potential matrix R for α ∈ (0, 1) can also be given in economic terms. Suppose that we receive one monetary unit each time the chain visits a fixed state y ∈ S . Then R(x, y) is the expected total reward, starting in state x ∈ S . However, typically money that we will receive at times distant in the future have less value to us now than money that we will receive soon. Specifically suppose that a monetary unit at time n ∈ N has a present value of α , so that α is an inflation factor (sometimes also called a discount factor). Then R (x, y) gives the expected total discounted reward, starting at x ∈ S . n
α
α
n
α
The potential kernels R = {R
α
: α ∈ (0, 1)}
completely determine the transition kernels P
= { Pn : n ∈ N}
.
Proof Of course, it's really only necessary to determine P , the one step transition kernel, since the other transition kernels are powers of P . In any event, it follows that the matrices R = { R : α ∈ (0, 1)} , along with the initial distribution, completely determine the finite dimensional distributions of the Markov chain X. The potential matrices commute with each other and with the transition matrices. α
If α,
β ∈ (0, 1]
1. P 2. R
k
and k ∈ N , then
Rα = Rα P
α Rβ
∞
k
=∑
n=0 ∞
= Rβ Rα = ∑
m=0
n
α P
n+k
∞
∑
n=0
m
α
β
n
P
m+n
Proof The fundamental equation that relates the potential matrices is given next. If α,
β ∈ (0, 1]
with α ≥ β then α Rα = β Rβ + (α − β)Rα Rβ
(16.3.31)
Proof If α ∈ (0, 1] then I + α R
αP
= I + αP Rα = Rα
.
Proof This leads to an important result: when α ∈ (0, 1), there is an inverse relationship between P and R . α
If α ∈ (0, 1), then 1. R = (I − αP ) 2. P = (I − R )
−1
α
1
−1 α
α
Proof This result shows again that the potential matrix R determines the transition operator P . α
Sampling in Time If we sample a Markov chain at multiples of a fixed time k , we get another (homogeneous) chain. Suppose that X = (X , X , X , …) is an Markov chain with state space S and transition probability matrix k ∈ N , the sequence X = (X , X , X , …) is a Markov chain on S with transition probability matrix P . 0
1
2
P
. For fixed
k
+
k
0
k
2k
If we sample a Markov chain at a general increasing sequence of time points 0 < n < n < ⋯ in N, then the resulting stochastic process Y = (Y , Y , Y , …), where Y = X for k ∈ N , is still a Markov chain, but is not time homogeneous in general. 1
0
1
2
k
2
nk
16.3.5
https://stats.libretexts.org/@go/page/10290
Recall that if A is a nonempty subset of S , then P is the matrix P restricted to A × A . So P is a sub-stochastic matrix, since the row sums may be less than 1. Recall also that P means (P ) , not (P ) ; in general these matrices are different. A
A
n
n
n
A
A
A
If A is a nonempty subset of S then for n ∈ N , P
n
A
(x, y) = P(X1 ∈ A, X2 ∈ A, … , Xn−1 ∈ A, Xn = y ∣ X0 = x),
(x, y) ∈ A × A
(16.3.36)
That is, P (x, y) is the probability of going from state x to y in n steps, remaining in A all the while. In terms of the state graph of X, it is the sum of products of probabilities along paths of length n from x to y that stay inside A . n
A
Examples and Applications Computational Exercises Let X = (X
0,
X1 , …)
be the Markov chain on S = {a, b, c} with transition matrix ⎡ P =⎢ ⎢ ⎣
1
1
2
2
1 4
1
0 ⎤ 3
0
4
0
0
⎥ ⎥
(16.3.37)
⎦
For the Markov chain X, 1. Draw the state graph. 2. Find P(X = a, X = b, X = c ∣ X = a) 3. Find P 4. Suppose that g : S → R is given by g(a) = 1 , g(b) = 2 , g(c) = 3 . Find E[g(X ) ∣ X = x] for x ∈ S . 5. Suppose that X has the uniform distribution on S . Find the probability density function of X . 1
2
3
0
2
2
0
0
2
Answer Let A = {a, b} . Find each of the following: 1. P 2. P 3. (P
A 2
A 2
)A
Proof Find the invariant probability density function of X Answer Compute the α -potential matrix R for α ∈ (0, 1). α
Answer
The Two-State Chain Perhaps the simplest, non-trivial Markov chain has two states, say where p ∈ (0, 1) and q ∈ (0, 1) are parameters.
S = {0, 1}
1 −p
p
q
1 −q
P =[
and the transition probability matrix given below,
]
(16.3.41)
For n ∈ N , P
n
n
q + p(1 − p − q)
1 =
[ p +q
n
q − q(1 − p − q)
n
p − p(1 − p − q)
n
]
(16.3.42)
p + q(1 − p − q)
Proof As n → ∞ ,
16.3.6
https://stats.libretexts.org/@go/page/10290
P
n
1 →
q
p
q
p
[ p +q
]
(16.3.44)
Proof Open the simulation of the two-state, discrete-time Markov chain. For various values of p and q, and different initial states, run the simulation 1000 times. Compare the relative frequency distribution to the limiting distribution, and in particular, note the rate of convergence. Be sure to try the case p = q = 0.01 The only invariant probability density function for the chain is f =[
q
p
p+q
p+q
]
(16.3.45)
Proof For α ∈ (0, 1), the α -potential matrix is 1 Rα =
q
p
q
p
[ (p + q)(1 − α)
1 ]+
2
p
−p
−q
q
[
(p + q ) (1 − α)
]
(16.3.46)
Proof In spite of its simplicity, the two state chain illustrates some of the basic limiting behavior and the connection with invariant distributions that we will study in general in a later section.
Independent Variables and Random Walks Suppose that X = (X , X , X , …) is a sequence of independent random variables taking values in a countable set (X , X , …) are identically distributed with (discrete) probability density function f . 0
1
1
2
S
, and that
2
is a Markov chain on invariant for P . X
S
with transition probability matrix
P
given by
P (x, y) = f (y)
for
(x, y) ∈ S × S
. Also,
f
is
Proof As a Markov chain, the process X is not very interesting, although of course it is very interesting in other ways. Suppose now that S = Z , the set of integers, and consider the partial sum process (or random walk) Y associated with X: n
Yn = ∑ Xi ,
n ∈ N
(16.3.49)
i=0
Y
is a Markov chain on Z with transition probability matrix Q given by Q(x, y) = f (y − x) for (x, y) ∈ Z × Z .
Proof Thus the probability density function f governs the distribution of a step size of the random walker on Z. Consider the special case of the random walk on Z with f (1) = p and f (−1) = 1 − p , where p ∈ (0, 1). 1. Give the transition matrix Q explicitly. 2. Give Q explicitly for n ∈ N . n
Answer This special case is the simple random walk on Z. When p = we have the simple, symmetric random walk. The simple random walk on Z is studied in more detail in the section on random walks on graphs. The simple symmetric random walk is studied in more detail in the chapter on Bernoulli Trials. 1 2
Doubly Stochastic Matrices A matrix P on S is doubly stochastic if it is nonnegative and if the row and columns sums are 1:
16.3.7
https://stats.libretexts.org/@go/page/10290
∑ P (x, u) = 1, ∑ P (u, y) = 1,
(x, y) ∈ S × S
(16.3.53)
u∈s
u∈S
Suppose that X is a Markov chain on a finite state space distribution on S is invariant.
S
with doubly stochastic transition matrix
P
. Then the uniform
Proof If P and Q are doubly stochastic matrices on S , then so is P Q. Proof It follows that if P is doubly stochastic then so is P for n ∈ N . n
Suppose that X = (X
0,
X1 , …)
is the Markov chain with state space S = {−1, 0, 1} and with transition matrix 1
1
2
2
0 ⎤
P =⎢ 0 ⎢
1
1
2
2
⎡
⎣
1 2
0
1
⎥ ⎥
(16.3.56)
⎦
2
1. Draw the state graph. 2. Show that P is doubly stochastic 3. Find P . 4. Show that the uniform distribution on S is the only invariant distribution for X. 5. Suppose that X has the uniform distribution on S . For n ∈ N , find E(X ) and var(X ). 6. Find the α -potential matrix R for α ∈ (0, 1). 2
0
n
n
α
Proof Recall that a matrix M indexed by a countable set S is symmetric if M (x, y) = M (y, x) for all x,
y ∈ S
.
If P is a symmetric, stochastic matrix then P is doubly stochastic. Proof The converse is not true. The doubly stochastic matrix in the exercise above is not symmetric. But since a symmetric, stochastic matrix on a finite state space is doubly stochastic, the uniform distribution is invariant. Suppose that X = (X
0,
X1 , …)
is the Markov chain with state space S = {−1, 0, 1} and with transition matrix ⎡
1
P =⎢0 ⎢ ⎣
0
0
0
1
3
4
4
3
1
4
4
⎤ ⎥ ⎥
(16.3.60)
⎦
1. Draw the state graph. 2. Show that P is symmetric 3. Find P . 4. Find all invariant probability density functions for X. 5. Find the α -potential matrix R for α ∈ (0, 1). 2
α
Proof
Special Models The Markov chains in the following exercises model interesting processes that are studied in separate sections. Read the introduction to the Ehrenfest chains. Read the introduction to the Bernoulli-Laplace chain. Read the introduction to the reliability chains.
16.3.8
https://stats.libretexts.org/@go/page/10290
Read the introduction to the branching chain. Read the introduction to the queuing chains. Read the introduction to random walks on graphs. Read the introduction to birth-death chains. This page titled 16.3: Introduction to Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.3.9
https://stats.libretexts.org/@go/page/10290
16.4: Transience and Recurrence for Discrete-Time Chains The study of discrete-time Markov chains, particularly the limiting behavior, depends critically on the random times between visits to a given state. The nature of these random times leads to a fundamental dichotomy of the states.
Basic Theory As usual, our starting point is a probability space (Ω, F , P), so that Ω is the sample space, F the σ-algebra of events, and P the probability measure on (Ω, F ). Suppose now that X = (X , X , X , …) is a (homogeneous) discrete-time Markov chain with (countable) state space S and transition probability matrix P . So by definition, 0
1
2
P (x, y) = P(Xn+1 = y ∣ Xn = x)
for x,
y ∈ S
(16.4.1)
and n ∈ N . Let F = σ{X , X , … , X }, the σ-algebra of events defined by the chain up to time , …) is the natural filtration associated with X. n
0
1
n
n ∈ N
, so that
F = (F0 , F1
Hitting Times and Probabilities Let A be a nonempty subset of S . Recall that the hitting time to A is the random variable that gives the first positive time that the chain is in A : τA = min{n ∈ N+ : Xn ∈ A}
(16.4.2)
Since the chain may never enter A , the random variable τ takes values in N ∪ {∞} (recall our convention that the minimum of the empty set is ∞). Recall also that τ is a stopping time for X. That is, {τ = n} ∈ F for n ∈ N . Intuitively, this means that we can tell if τ = n by observing the chain up to time n . This is clearly the case, since explicitly A
+
A
A
n
+
A
{ τA = n} = { X1 ∉ A, … , Xn−1 ∉ A, Xn ∈ A}
(16.4.3)
When A = {x} for x ∈ S , we will simplify the notation to τ . This random variable gives the first positive time that the chain is in state x. When the chain enters a set of states A for the first time, the chain must visit some state in A for the first time, so it's clear that x
τA = min{ τx : x ∈ A},
A ⊆S
(16.4.4)
Next we define two functions on S that are related to the hitting times. For x ∈ S , A ⊆ S (nonempty), and n ∈ N
+
1. H (x, A) = P(τ 2. H (x, A) = P(τ n
A
A
So H (x, A) = ∑
define
= n ∣ X0 = x)
< ∞ ∣ X0 = x)
∞ n=1
Hn (x, A)
.
Note that n ↦ H (x, A) is the probability density function of τ , given X = x , except that the density function may be defective in the sense that the sum H (x, A) may be less than 1, in which case of course, 1 − H (x, A) = P(τ = ∞ ∣ X = x) . Again, when A = {y} , we will simplify the notation to H (x, y) and H (x, y), respectively. In particular, H (x, x) is the probability, starting at x, that the chain eventually returns to x. If x ≠ y , H (x, y) is the probability, starting at x, that the chain eventually reaches y . Just knowing when H (x, y) is 0, positive, and 1 will turn out to be of considerable importance in the overall structure and limiting behavior of the chain. As a function on S , we will refer to H as the hitting matrix of X. Note however, that unlike the transition matrix P , we do not have the structure of a kernel. That is, A ↦ H (x, A) is not a measure, so in particular, it is generally not true that H (x, A) = ∑ H (x, y). The same remarks apply to H for n ∈ N . However, there are interesting relationships between the transition matrix and the hitting matrix. n
A
0
A
0
n
2
n
y∈A
H (x, y) > 0
if and only if P
n
(x, y) > 0
+
for some n ∈ N . +
Proof The following result gives a basic relationship between the sequence of hitting probabilities and the sequence of transition probabilities.
16.4.1
https://stats.libretexts.org/@go/page/10291
Suppose that (x, y) ∈ S . Then 2
n
P
n
(x, y) = ∑ Hk (x, y)P
n−k
(y, y),
n ∈ N+
(16.4.6)
k=1
Proof Suppose that x ∈ S and A ⊆ S . Then 1. H (x, A) = ∑ P (x, y)H (y, A) for n ∈ N 2. H (x, A) = P (x, A) + ∑ P (x, y)H (y, A) n+1
n
y∉A
+
y∉A
Proof The following definition is fundamental for the study of Markov chains. Let x ∈ S . 1. State x is recurrent if H (x, x) = 1. 2. State x is transient if H (x, x) < 1. Thus, starting in a recurrent state, the chain will, with probability 1, eventually return to the state. As we will see, the chain will return to the state infinitely often with probability 1, and the times of the visits will form the arrival times of a renewal process. This will turn out to be the critical observation in the study of the limiting behavior of the chain. By contrast, if the chain starts in a transient state, then there is a positive probability that the chain will never return to the state.
Counting Variables and Potentials Again, suppose that A is a nonempty set of states. A natural complement to the hitting time to A is the counting variable that gives the number of visits to A (at positive times). Thus, let ∞
NA = ∑ 1(Xn ∈ A)
(16.4.12)
n=1
Note that N takes value in N ∪ {∞} . We will mostly be interested in the special case will simplify the notation to N . A
A = {x}
for
x ∈ S
, and in this case, we
x
Let G(x, A) = E(N
A
∣ X0 = x)
for x ∈ S and A ⊆ S . Then G is a kernel on S and ∞
G(x, A) = ∑ P
n
(x, A)
(16.4.13)
n=1
Proof Thus G(x, A) is the expected number of visits to A at positive times. As usual, when A = {y} for y ∈ S we simplify the notation to G(x, y), and then more generally we have G(x, A) = ∑ G(x, y) for A ⊆ S . So, as a matrix on S , G = ∑ P . The matrix G is closely related to the potential matrix R of X, given by R = ∑ P . So R = I + G , and R(x, y) gives the expected number of visits to y ∈ S at all times (not just positive times), starting at x ∈ S . The matrix G is more useful for our purposes in this section. ∞
y∈A
n
n=1
∞
n
n=0
The distribution of N has a simple representation in terms of the hitting probabilities. Note that because of the Markov property and time homogeneous property, whenever the chain reaches state y , the future behavior is independent of the past and is stochastically the same as the chain starting in state y at time 0. This is the critical observation in the proof of the following theorem. y
If x,
y ∈ S
1. P(N 2. P(N
then
y
= 0 ∣ X0 = x) = 1 − H (x, y)
y
= n ∣ X0 = x) = H (x, y)[H (y, y)]
n−1
[1 − H (y, y)]
for n ∈ N
+
Proof
16.4.2
https://stats.libretexts.org/@go/page/10291
Figure 16.4.1 : Visits to state y starting in state x
The essence of the proof is illustrated in the graphic above. The thick lines are intended as reminders that these are not one step transitions, but rather represent all paths between the given vertices. Note that in the special case that x = y we have n
P(Nx = n ∣ X0 = x) = [H (x, x)] [1 − H (x, x)],
n ∈ N
(16.4.15)
In all cases, the counting variable N has essentially a geometric distribution, but the distribution may well be defective, with some of the probability mass at ∞. The behavior is quite different depending on whether y is transient or recurrent. y
If x,
y ∈ S
and y is transient then
1. P(N < ∞ ∣ X = x) = 1 2. G(x, y) = H (x, y)/[1 − H (y, y)] 3. H (x, y) = G(x, y)/[1 + G(y, y)] y
0
Proof if x,
y ∈ S
and y is recurrent then
1. P(N = 0 ∣ X = x) = 1 − H (x, y) and P(N = ∞ ∣ X = x) = H (x, y) 2. G(x, y) = 0 if H (x, y) = 0 and G(x, y) = ∞ if H (x, y) > 0 3. P(N = ∞ ∣ X = y) = 1 and G(y, y) = ∞ y
0
y
y
0
0
Proof Note that there is an invertible relationship between the matrix H and the matrix G; if we know one we can compute the other. In particular, we can characterize the transience or recurrence of a state in terms of G. Here is our summary so far: Let x ∈ S . 1. State x is transient if and only if H (x, x) < 1 if and only if G(x, x) < ∞. 2. State x is recurrent if and only if H (x, x) = 1 if and only if G(x, x) = ∞. Of course, the classification also holds for the potential matrix R(x, x) < ∞ and state x is recurrent if and only if R(x, x) = ∞.
R = I +G
. That is, state
x ∈ S
is transient if and only if
Relations The hitting probabilities suggest an important relation on the state space S . For (x, y) ∈ S , we say that x leads to y and we write x → y if either x = y or H (x, y) > 0. 2
It follows immediately from the result above that x → y if and only if P (x, y) > 0 for some n ∈ N . In terms of the state graph of the chain, x → y if and only if x = y or there is a directed path from x to y . Note that the leads to relation is reflexive by definition: x → x for every x ∈ S . The relation has another important property as well. n
The leads to relation is transitive: For x,
y, z ∈ S
, if x → y and y → z then x → z .
Proof The leads to relation naturally suggests a couple of other definitions that are important. Suppose that A ⊆ S is nonempty. 1. A is closed if x ∈ A and x → y implies y ∈ A . 2. A is irreducible if A is closed and has no proper closed subsets.
16.4.3
https://stats.libretexts.org/@go/page/10291
Suppose that A ⊆ S is closed. Then 1. P , the restriction of P to A × A , is a transition probability matrix on A . 2. X restricted to A is a Markov chain with transition probability matrix P . 3. (P ) = (P ) for n ∈ N . A
A
n
n
A
A
Proof Of course, the entire state space S is closed by definition. If it is also irreducible, we say the Markov chain X itself is irreducible. Recall that for a nonempty subset A of S and for n ∈ N , the notation P refers to (P ) and not (P ) . In general, these are not the same, and in fact for x, y ∈ A, n
n
P
n
A
n
A
A
A
(x, y) = P(X1 ∈ A, … , Xn−1 ∈ A, Xn = y ∣ X0 = x)
(16.4.17)
the probability of going from x to y in n steps, remaining in A all the while. But if A is closed, then as noted in part (c), this is just P (x, y). n
Suppose that A is a nonempty subset of S . Then containing A , and is called the closure of A . That is,
cl(A) = {y ∈ S : x → y for some x ∈ A}
is the smallest closed set
1. cl(A) is closed. 2. A ⊆ cl(A) . 3. If B is closed and A ⊆ B then cl(A) ⊆ B Proof Recall that for a fixed positive integer k , P is also a transition probability matrix, and in fact governs the k -step Markov chain (X , X , X , …). It follows that we could consider the leads to relation for this chain, and all of the results above would still hold (relative, of course, to the k -step chain). Occasionally we will need to consider this relation, which we will denote by →, k
0
k
2k
k
particularly in our study of periodicity. Suppose that j,
k ∈ N+
. If x
→ y
and j ∣ k then x
k
→ y
.
j
Proof By combining the leads to relation → with its inverse, the comes from relation ←, we can obtain another very useful relation. For (x, y) ∈ S , we say that x to and from y and we write x ↔ y if x → y and y → x . 2
By definition, this relation is symmetric: if x ↔ y then y ↔ x . From our work above, it is also reflexive and transitive. Thus, the to and from relation is an equivalence relation. Like all equivalence relations, it partitions the space into mutually disjoint equivalence classes. We will denote the equivalence class of a state x ∈ S by [x] = {y ∈ S : x ↔ y}
Thus, for any two states x,
y ∈ S
(16.4.18)
, either [x] = [y] or [x] ∩ [y] = ∅ , and moreover, ⋃
x∈S
[x] = S
.
Figure 16.4.2 : The equivalence relation partitions S into mutually disjoint equivalence classes
Two negative results: 1. A closed set is not necessarily an equivalence class. 2. An equivalence class is not necessarily closed. Example On the other hand, we have the following result:
16.4.4
https://stats.libretexts.org/@go/page/10291
If A ⊆ S is irreducible, then A is an equivalence class. Proof The to and from equivalence relation is very important because many interesting state properties turn out in fact to be class properties, shared by all states in a given equivalence class. In particular, the recurrence and transience properties are class properties.
Transient and Recurrent Classes Our next result is of fundamental importance: a recurrent state can only lead to other recurrent states. If x is a recurrent state and x → y then y is recurrent and H (x, y) = H (y, x) = 1 . Proof From the last theorem, note that if x is recurrent, then all states in [x] are also recurrent. Thus, for each equivalence class, either all states are transient or all states are recurrent. We can therefore refer to transient or recurrent classes as well as states. If A is a recurrent equivalence class then A is irreducible. Proof If A is finite and closed then A has a recurrent state. Proof If A is finite and irreducible then A is a recurrent equivalence class. Proof Thus, the Markov chain X will have a collection (possibly empty) of recurrent equivalence classes {A : j ∈ J} where J is a countable index set. Each A is irreducible. Let B denote the set of all transient states. The set B may be empty or may consist of a number of equivalence classes, but the class structure of B is usually not important to us. If the chain starts in A for some j ∈ J then the chain remains in A forever, visiting each state infinitely often with probability 1. If the chain starts in B , then the chain may stay in B forever (but only if B is infinite) or may enter one of the recurrent classes A , never to escape. However, in either case, the chain will visit a given transient state only finitely many time with probability 1. This basic structure is known as the canonical decomposition of the chain, and is shown in graphical form below. The edges from B are in gray to indicate that these transitions may not exist. j
j
j
j
j
Figure 16.4.3 : The canonical decomposition of the state space
Staying Probabilities and a Classification Test Suppose that A is a proper subset of S . Then 1. P (x, A) = P(X ∈ A, X 2. lim P (x, A) = P(X n
1
A
2
n
n→∞
A
1
∈ A, … , Xn ∈ A ∣ X0 = x) ∈ A, X2 ∈ A … ∣ X0 = x)
for x ∈ A for x ∈ A
Proof Let g denote the function defined by part (b), so that A
gA (x) = P(X1 ∈ A, X2 ∈ A, … ∣ X0 = x),
x ∈ A
(16.4.20)
The staying probability function g is an interesting complement to the hitting matrix studied above. The following result characterizes this function and provides a method that can be used to compute it, at least in some cases. A
16.4.5
https://stats.libretexts.org/@go/page/10291
For
A ⊂S
,
gA
is the largest function on .
A
that takes values in
[0, 1]
and satisfies
g = PA g
. Moreover, either
or
gA = 0A
sup{ gA (x) : x ∈ A} = 1
Proof Note that the characterization in the last result includes a zero-one law of sorts: either the probability that the chain stays in A forever is 0 for every initial state x ∈ A , or we can find states in A for which the probability is arbitrarily close to 1. The next two results explore the relationship between the staying function and recurrence. Suppose that X is an irreducible, recurrent chain with state space S . Then g
A
= 0A
for every proper subset A of S .
Proof Suppose that X is an irreducible Markov chain with state space S and transition probability matrix P . If there exists a state x such that g = 0 where A = S ∖ {x} , then X is recurrent. A
A
Proof More generally, suppose that X is a Markov chain with state space S and transition probability matrix P . The last two theorems can be used to test whether an irreducible equivalence class C is recurrent or transient. We fix a state x ∈ C and set A = C ∖ {x} . We then try to solve the equation g = P g on A . If the only solution taking values in [0, 1] is 0 , then the class C is recurrent by the previous result. If there are nontrivial solutions, then C is transient. Often we try to choose x to make the computations easy. A
A
Computing Hitting Probabilities and Potentials We now know quite a bit about Markov chains, and we can often classify the states and compute quantities of interest. However, we do not yet know how to compute: when x and y are transient H (x, y) when x is transient and y is transient or recurrent. G(x, y)
These problems are related, because of the general inverse relationship between the matrix H and the matrix G noted in our discussion above. As usual, suppose that X is a Markov chain with state space S , and let B denote the set of transient states. The next result shows how to compute G , the matrix G restricted to the transient states. Recall that the values of this matrix are finite. B
GB GB
satisfies the equation = (I −P ) P .
GB = PB + PB GB
and is the smallest nonnegative solution. If
B
is finite then
−1
B
B
B
Proof Now that we can compute G , we can also compute H using the result above. All that remains is for us to compute the hitting probability H (x, y) when x is transient and y is recurrent. The first thing to notice is that the hitting probability is a class property. B
B
Suppose that x is transient and that A is a recurrent class. Then H (x, y) = H (x, A) for y ∈ A . That is, starting in the transient state x ∈ S , the hitting probability to y is constant for y ∈ A , and is just the hitting probability to the class A . As before, let B denote the set of transient states and suppose that A is a recurrent equivalence class. Let h denote the function on B that gives the hitting probability to class A , and let p denote the function on B that gives the probability of entering A on the first step: A
A
hA (x) = H (x, A), pA (x) = P (x, A), hA = pA + GB pA
x ∈ B
(16.4.21)
.
Proof This result is adequate if we have already computed G (using the result in above, for example). However, we might just want to compute h directly. B
A
hA
satisfies the equation h
A
= pA + PB hA
and is the smallest nonnegative solution. If B is finite, h
A
−1
= (IB − PB )
pA
.
Proof
16.4.6
https://stats.libretexts.org/@go/page/10291
Examples and Applications Finite Chains Consider a Markov chain with state space S = {a, b, c, d} and transition matrix P given below: ⎡
1
2
2
3
⎢ ⎢ 1 P =⎢ ⎢ 0 ⎢ ⎣
0
0 ⎤ ⎥ 0 ⎥ ⎥ 0 ⎥ ⎥
0
0
0
1
1
1
1
1
4
4
4
4
(16.4.22)
⎦
1. Draw the state graph. 2. Find the equivalent classes and classify each as transient or recurrent. 3. Compute the matrix G. 4. Compute the matrix H . Answer Consider a Markov chain with state space S = {1, 2, 3, 4, 5, 6} and transition matrix P given below: 1
⎡ 0
0
⎢ 0 ⎢ ⎢ 1 ⎢ ⎢ 4 P =⎢ ⎢ 0 ⎢ ⎢ ⎢ 0 ⎢
0
⎣ 0
1
1
1
4
4
4
0 0 0
2
0 1 2
0 1 3
0 0 0 1 0
1 2
0 1 4
0 2 3
0
0 ⎤ 1 ⎥ ⎥ ⎥ 0 ⎥ ⎥ ⎥ 0 ⎥ ⎥ ⎥ 0 ⎥ ⎥ 1
(16.4.23)
⎦
4
1. Sketch the state graph. 2. Find the equivalence classes and classify each as recurrent or transient. 3. Compute the matrix G. 4. Compute the matrix H . Answer Consider a Markov chain with state space S = {1, 2, 3, 4, 5, 6} and transition matrix P given below: 1
1
2
2
⎢ 1 ⎢ 4 ⎢ ⎢ 1 ⎢ 4 P =⎢ ⎢ 1 ⎢ ⎢ 4 ⎢ ⎢ 0 ⎢
3
⎡
⎣
0
4
0 0
0
0
0
0
0
0
1
1
2
4
1
1
4
4
0
0
0
0
0
0
0 ⎤
2
⎥ 0 ⎥ ⎥ ⎥ 0 ⎥ ⎥ 1 ⎥ ⎥ 4 ⎥ ⎥ 1 ⎥ 2 ⎥
1
1
2
2
0 0 1
(16.4.24)
⎦
1. Sketch the state graph. 2. Find the equivalence classes and classify each as recurrent or transient. 3. Compute the matrix G. 4. Compute the matrix H . Answer
Special Models Read again the definitions of the Ehrenfest chains and the Bernoulli-Laplace chains. Note that since these chains are irreducible and have finite state spaces, they are recurrent. Read the discussion on recurrence in the section on the reliability chains.
16.4.7
https://stats.libretexts.org/@go/page/10291
Read the discussion on random walks on Z in the section on the random walks on graphs. k
Read the discussion on extinction and explosion in the section on the branching chain. Read the discussion on recurrence and transience in the section on queuing chains. Read the discussion on recurrence and transience in the section on birth-death chains. This page titled 16.4: Transience and Recurrence for Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.4.8
https://stats.libretexts.org/@go/page/10291
16.5: Periodicity of Discrete-Time Chains A state in a discrete-time Markov chain is periodic if the chain can return to the state only at multiples of some integer larger than 1. Periodic behavior complicates the study of the limiting behavior of the chain. As we will see in this section, we can eliminate the periodic behavior by considering the d -step chain, where d ∈ N is the period, but only at the expense of introducing additional equivalence classes. Thus, in a sense, we can trade one form of complexity for another. +
Basic Theory Definitions and Basic Results As usual, our starting point is a (time homogeneous) discrete-time Markov chain space S and transition probability matrix P .
X = (X0 , X1 , X2 , …)
with (countable) state
The period of state x ∈ S is d(x) = gcd{n ∈ N+ : P
n
(x, x) > 0}
(16.5.1)
State x is aperiodic if d(x) = 1 and periodic if d(x) > 1 . Thus, starting in x, the chain can return to x only at multiples of the period d , and d is the largest such integer. Perhaps the most important result is that period, like recurrence and transience, is a class property, shared by all states in an equivalence class under the to and from relation. If x ↔ y then d(x) = d(y) . Proof Thus, the definitions of period, periodic, and aperiodic apply to equivalence classes as well as individual states. When the chain is irreducible, we can apply these terms to the entire chain. Suppose that x ∈ S . If P (x, x) > 0 then x (and hence the equivalence class of x) is aperiodic. Proof The converse is not true, of course. A simple counterexample is given below.
The Cyclic Classes Suppose now that X = (X , X , X , …) is irreducible and is periodic with period d . There is no real loss in generality in assuming that the chain is irreducible, for if this were not the case, we could simply restrict our attention to one of the irreducible equivalence classes. Our exposition will be easier and cleaner if we recall the congruence equivalence relation modulo d on Z, which in turn is based on the divison partial order. For n, m ∈ Z, n ≡ m if and only if d ∣ (n − m) , equivalently n − m is an integer multiple of d , equivalently m and n have the same remainder after division by d . The basic fact that we will need is that ≡ is preserved under sums and differences. That is, if m, n, p, q ∈ Z and if m ≡ n and p ≡ q , then m + p ≡ n + q and m −p ≡ n−q . 0
1
2
d
d
d
d
d
d
Now, we fix a reference state u ∈ S , and for k ∈ {0, 1, … , d − 1} , define Ak = {x ∈ S : P
nd+k
That is, x ∈ A if and only if there exists m ∈ N with m ≡ k
d
Suppose that x,
y ∈ S
(u, x) > 0 for some n ∈ N}
k
and P
m
(16.5.2)
.
(u, x) > 0
.
1. If x ∈ A and y ∈ A for j, k ∈ {0, 1, … , d − 1} then P (x, y) > 0 for some n ≡ k − j 2. Conversely, if P (x, y) > 0 for some n ∈ N , then there exists j, k ∈ {0, 1, … d − 1} such that x ∈ A , y ∈ A and n ≡ k−j . 3. The sets (A , A , … , A ) partition S . n
j
k
d
n
j
k
d
0
1
k−1
Proof
16.5.1
https://stats.libretexts.org/@go/page/10292
(A0 , A1 , … , Ad−1 )
are the equivalence classes for the
d
-step to and from relation
↔
that governs the
d
-step chain
d
(X0 , Xd , X2d , …)
The sets (A
0,
that has transition matrix P .
A1 , … , Ad−1 )
d
are known as the cyclic classes. The basic structure of the chain is shown in the state diagram below:
Figure 16.5.1 : The cyclic classes of a chain with period d
Examples and Special Cases Finite Chains Consider the Markov chain with state space S = {a, b, c} and transition matrix P given below: ⎡0 P =⎢0 ⎣
1
1
2
3
3
0
1 ⎥
0
0
⎤ (16.5.3)
⎦
1. Sketch the state graph and show that the chain is irreducible. 2. Show that the chain is aperiodic. 3. Note that P (x, x) = 0 for all x ∈ S . Answer Consider the Markov chain with state space S = {1, 2, 3, 4, 5, 6, 7} and transition matrix P given below: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ P =⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
0
0
1
1
1
2
4
4
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
2
2
1
3
4
4
3
0
2 3
0
0
0
0
1 3 1 2 3
0
0
0
0
0
0
0
0
0
0
0
4
⎤
⎥ ⎥ ⎥ 2 ⎥ ⎥ 3 ⎥ 1 ⎥ ⎥ 2 ⎥ 1 ⎥ ⎥ 4 ⎥ ⎥ 0 ⎥ ⎥
(16.5.4)
0 ⎦
1. Sketch the state graph and show that the chain is irreducible. 2. Find the period d . 3. Find P . 4. Identify the cyclic classes. d
Answer
Special Models Review the definition of the basic Ehrenfest chain. Show that this chain has period 2, and find the cyclic classes. Review the definition of the modified Ehrenfest chain. Show that this chain is aperiodic.
16.5.2
https://stats.libretexts.org/@go/page/10292
Review the definition of the simple random walk on classes.
k
Z
. Show that the chain is periodic with period 2, and find the cyclic
This page titled 16.5: Periodicity of Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.5.3
https://stats.libretexts.org/@go/page/10292
16.6: Stationary and Limiting Distributions of Discrete-Time Chains In this section, we study some of the deepest and most interesting parts of the theory of discrete-time Markov chains, involving two different but complementary ideas: stationary distributions and limiting distributions. The theory of renewal processes plays a critical role.
Basic Theory As usual, our starting point is a (time homogeneous) discrete-time Markov chain X = (X , X , X , …) with (countable) state space S and transition probability matrix P . In the background, of course, is a probability space (Ω, F , P) so that Ω is the sample space, F the σ-algebra of events, and P the probability measure on (Ω, F ). For n ∈ N , let F = σ{X , X , … , X }, the σalgebra of events determined by the chain up to time n , so that F = {F , F , …} is the natural filtration associated with X. 0
1
2
n
0
0
1
n
1
The Embedded Renewal Process Let y ∈ S and n ∈ N . We will denote the number of visits to y during the first n positive time units by +
n
Ny,n = ∑ 1(Xi = y)
(16.6.1)
i=1
Note that N
y,n
→ Ny
as n → ∞ , where ∞
Ny = ∑ 1(Xi = y)
(16.6.2)
i=1
is the total number of visits to y at positive times, one of the important random variables that we studied in the section on transience and recurrence. For n ∈ N , we denote the time of the n th visit to y by +
τy,n = min{k ∈ N+ : Ny,k = n}
(16.6.3)
where as usual, we define min(∅) = ∞ . Note that τ is the time of the first visit to y , which we denoted simply by τ in the section on transience and recurrence. The times of the visits to y are stopping times for X. That is, {τ = k} ∈ F for n ∈ N and k ∈ N . Recall also the definition of the hitting probability to state y starting in state x: y,1
y
y,n
H (x, y) = P (τy < ∞ ∣ X0 = x) ,
Suppose that x,
y ∈ S
, and that y is recurrent and X
0
=x
(x, y) ∈ S
2
k
+
(16.6.4)
.
1. If x = y , then the successive visits to y form a renewal process. 2. If x ≠ y but x → y , then the successive visits to y form a delayed renewal process. Proof As noted in the proof, (τ , τ , …) is the sequence of arrival times and (N , N , …) is the associated sequence of counting variables for the embedded renewal process associated with the recurrent state y . The corresponding renewal function, given X = x , is the function n ↦ G (x, y) where y,1
0
y,2
y,1
y,2
n
n
Gn (x, y) = E (Ny,n ∣ X0 = x) = ∑ P
k
(x, y),
n ∈ N
(16.6.5)
k=1
Thus
is the expected number of visits to y in the first n positive time units, starting in state x. Note that as n → ∞ where G is the potential matrix that we studied previously. This matrix gives the expected total number visits to state y ∈ S , at positive times, starting in state x ∈ S : Gn (x, y)
Gn (x, y) → G(x, y)
∞
G(x, y) = E (Ny ∣ X0 = x) = ∑ P
k
(x, y)
(16.6.6)
k=1
16.6.1
https://stats.libretexts.org/@go/page/10293
Limiting Behavior The limit theorems of renewal theory can now be used to explore the limiting behavior of the Markov chain. Let μ(y) = E(τ ∣ X = y) denote the mean return time to state y , starting in y . In the following results, it may be the case that μ(y) = ∞ , in which case we interpret 1/μ(y) as 0. y
If x,
0
y ∈ S
and y is recurrent then 1 P( n
1 Nn,y → μ(y)
∣ as n → ∞ ∣ X0 = x) = H (x, y) ∣
(16.6.7)
Proof Note that If x,
1 n
Ny,n =
y ∈ S
1 n
n
∑k=1 1(Xk = y)
is the average number of visits to y in the first n positive time units.
and y is recurrent then 1 n
n
1 Gn (x, y) =
∑P n
k
H (x, y) (x, y) →
as n → ∞
(16.6.8)
μ(y)
k=1
Proof Note that G starting at x. 1
n (x,
n
If x,
y ∈ S
y) =
1 n
n
∑
k=1
P
k
(x, y)
is the expected average number of visits to
y
during the first
n
positive time units,
and y is recurrent and aperiodic then P
n
H (x, y) (x, y) →
as n → ∞
(16.6.9)
μ(y)
Proof Note that H (y, y) = 1 by the very definition of a recurrent state. Thus, when x = y , the law of large numbers above gives convergence with probability 1, and the first and second renewal theory limits above are simply 1/μ(y). By contrast, we already know the corresponding limiting behavior when y is transient. If x,
y ∈ S
and y is transient then
1. P ( N → 0 as n → ∞ ∣ X = x) = 1 2. G (x, y) = ∑ P (x, y) → 0 as n → ∞ 3. P (x, y) → 0 as n → ∞ 1
n
y,n
1
n
n
0
1
n
n
k=1
k
n
Proof On the other hand, if y is transient then P(τ = ∞ ∣ X = y) > 0 by the very definition of a transience. Thus μ(y) = ∞ , and so the results in parts (b) and (c) agree with the corresponding results above for a recurrent state. Here is a summary. y
For x,
y ∈ S
0
, 1
1 Gn (x, y) =
n
n
∑P n
k
H (x, y) (x, y) →
as n → ∞
(16.6.11)
μ(y)
k=1
If y is transient or if y is recurrent and aperiodic, P
n
H (x, y) (x, y) →
as n → ∞
(16.6.12)
μ(y)
Positive and Null Recurrence Clearly there is a fundamental dichotomy in terms of the limiting behavior of the chain, depending on whether the mean return time to a given state is finite or infinite. Thus the following definition is natural. Let x ∈ S .
16.6.2
https://stats.libretexts.org/@go/page/10293
1. State x is positive recurrent if μ(x) < ∞ . 2. If x is recurrent but μ(x) = ∞ then state x is null recurrent. Implicit in the definition is the following simple result: If x ∈ S is positive recurrent, then x is recurrent. Proof On the other hand, it is possible to have P(τ < ∞ ∣ X = x) = 1 , so that x is recurrent, and also E(τ ∣ X = x) = ∞ , so that x is null recurrent. Simply put, a random variable can be finite with probability 1, but can have infinite expected value. A classic example is the Pareto distribution with shape parameter a ∈ (0, 1). x
0
x
0
Like recurrence/transience, and period, the null/positive recurrence property is a class property. If x is positive recurrent and x → y then y is positive recurrent. Proof Thus, the terms positive recurrent and null recurrent can be applied to equivalence classes (under the to and from equivalence relation), as well as individual states. When the chain is irreducible, the terms can be applied to the chain as a whole. Recall that a nonempty set of states closed set of states.
A
is closed if
x ∈ A
and
x → y
implies
y ∈ A
. Here are some simple results for a finite,
If A ⊆ S is finite and closed, then A contains a positive recurrent state. Proof If A ⊆ S is finite and closed, then A contains no null recurrent states. Proof If A ⊆ S is finite and irreducible, then A is a positive recurrent equivalence class. Proof In particular, a Markov chain with a finite state space cannot have null recurrent states; every state must be transient or positive recurrent.
Limiting Behavior, Revisited Returning to the limiting behavior, suppose that the chain X is irreducible, so that either all states are transient, all states are null recurrent, or all states are positive recurrent. From the basic limit theorem above, if the chain is transient or if the chain is recurrent and aperiodic, then P
n
1 (x, y) →
as n → ∞ for every x ∈ S
(16.6.16)
μ(y)
Note in particular that the limit is independent of the initial state x. Of course in the transient case and in the null recurrent and aperiodic case, the limit is 0. Only in the positive recurrent, aperiodic case is the limit positive, which motivates our next definition. A Markov chain X that is irreducible, positive recurrent, and aperiodic, is said to be ergodic. In the ergodic case, as we will see, X has a limiting distribution as n → ∞ that is independent of the initial distribution. n
The behavior when the chain is periodic with period d ∈ {2, 3, …} is a bit more complicated, but we can understand this behavior by considering the d -step chain X = (X , X , X , …) that has transition matrix P . Essentially, this allows us to trade periodicity (one form of complexity) for reducibility (another form of complexity). Specifically, recall that the d -step chain is aperiodic but has d equivalence classes (A , A , … , A ); and these are the cyclic classes of original chain X. d
d
0
0
d
1
2d
d−1
16.6.3
https://stats.libretexts.org/@go/page/10293
Figure 16.6.1 : The cyclic classes of a chain with period d
The mean return time to state x for the d -step chain X is μ d
d (x)
= μ(x)/d
.
Proof Let i, 1. P 2. P
j, k ∈ {0, 1, … , d − 1} nd+k nd+k
,
as n → ∞ if x ∈ A and y ∈ A and j = (i + k) as n → ∞ in all other cases.
(x, y) → d/μ(y) (x, y) → 0
i
j
mod d
.
Proof If y ∈ S is null recurrent or transient then regardless of the period of y , P
n
(x, y) → 0
as n → ∞ for every x ∈ S .
Invariant Distributions Our next goal is to see how the limiting behavior is related to invariant distributions. Suppose that f is a probability density function on the state space S . Recall that f is invariant for P (and for the chain X) if f P = f . It follows immediately that fP = f for every n ∈ N . Thus, if X has probability density function f then so does X for each n ∈ N , and hence X is a sequence of identically distributed random variables. A bit more generally, suppose that g : S → [0, ∞) is invariant for P , and let C =∑ g(x) . If 0 < C < ∞ then f defined by f (x) = g(x)/C for x ∈ S is an invariant probability density function. n
0
n
x∈S
Suppose that g : S → [0, ∞) is invariant for P and satisfies ∑
x∈S
g(x) < ∞
. Then
1 g(y) =
∑ g(x)H (x, y), μ(y)
y ∈ S
(16.6.17)
x∈S
Proof Note that if y is transient or null recurrent, then g(y) = 0 . Thus, a invariant function with finite sum, and in particular an invariant probability density function must be concentrated on the positive recurrent states. Suppose now that the chain X is irreducible. If X is transient or null recurrent, then from the previous result, the only nonnegative functions that are invariant for P are functions that satisfy ∑ g(x) = ∞ and the function that is identically 0: g = 0 . In particular, the chain does not have an invariant distribution. On the other hand, if the chain is positive recurrent, then H (x, y) = 1 for all x, y ∈ S . Thus, from the previous result, the only possible invariant probability density function is the function f given by f (x) = 1/μ(x) for x ∈ S . Any other nonnegative function g that is invariant for P and has finite sum, is a multiple of f (and indeed the multiple is sum of the values). Our next goal is to show that f really is an invariant probability density function. x∈S
If X is an irreducible, positive recurrent chain then the function probability density function for X.
f
given by
f (x) = 1/μ(x)
for
x ∈ S
is an invariant
Proof In summary, an irreducible, positive recurrent Markov chain X has a unique invariant probability density function f given by f (x) = 1/μ(x) for x ∈ S . We also now have a test for positive recurrence. An irreducible Markov chain X is positive recurrent if and only if there exists a positive function g on S that is invariant for P and satisfies ∑ g(x) < ∞ (and then, of course, normalizing g would give f ). x∈S
16.6.4
https://stats.libretexts.org/@go/page/10293
Consider now a general Markov chain X on S . If X has no positive recurrent states, then as noted earlier, there are no invariant distributions. Thus, suppose that X has a collection of positive recurrent equivalence classes (A : i ∈ I ) where I is a nonempty, countable index set. The chain restricted to A is irreducible and positive recurrent for each i ∈ I , and hence has a unique invariant probability density function f on A given by i
i
i
i
1 fi (x) =
, μ(x)
We extend f to S by defining f (x) = 0 for x ∉ A , so that density functions for X are mixtures of these functions: i
f
i
i
fi
x ∈ Ai
(16.6.21)
is a probability density function on
S
. All invariant probability
is an invariant probability density function for X if and only if f has the form f (x) = ∑ pi fi (x),
x ∈ S
(16.6.22)
i∈I
where
is a probability density function on the index set I . That is, otherwise.
(pi : i ∈ I )
f (x) = 0
f (x) = pi fi (x)
for
i ∈ I
and
x ∈ Ai
, and
Proof
Invariant Measures Suppose that X is irreducible. In this section we are interested in general functions g : S → [0, ∞) that are invariant for X, so that gP = g . A function g : S → [0, ∞) defines a positive measure ν on S by the simple rule ν (A) = ∑ g(x),
A ⊆S
(16.6.27)
x∈A
so in this sense, we are interested in invariant positive measures for X that may not be probability measures. Technically, g is the density function of ν with respect to counting measure # on S . From our work above, We know the situation if X is positive recurrent. In this case, there exists a unique invariant probability density function f that is positive on S , and every other nonnegative invariant function g is a nonnegative multiples of f . In particular, either g = 0 , the zero function on S , or g is positive on S and satisfies ∑ g(x) < ∞ . x∈S
We can generalize to chains that are simply recurrent, either null or positive. We will show that there exists a positive invariant function that is unique, up to multiplication by positive constants. To set up the notation, recall that τ = min{k ∈ N : X = x} is the first positive time that the chain is in state x ∈ S . In particular, if the chain starts in x then τ is the time of the first return to x. For x ∈ S we define the function γ by x
+
k
x
x
τx −1
∣ γx (y) = E ( ∑ 1(Xn = y) ∣ X0 = x) , ∣
y ∈ S
(16.6.28)
n=0
so that γ
x (y)
is the expected number of visits to y before the first return to x, starting in x. Here is the existence result.
Suppose that X is recurrent. For x ∈ S , 1. γ (x) = 1 2. γ is invariant for X 3. γ (y) ∈ (0, ∞) for y ∈ S . x x x
Proof Next is the uniqueness result. Suppose again that X is recurrent and that g : S → [0, ∞) is invariant for X. For fixed x ∈ S , g(y) = g(x)γx (y),
y ∈ S
(16.6.33)
Proof Thus, suppose that X is null recurrent. Then there exists an invariant function g that is positive on S and satisfies ∑ g(x) = ∞ . Every other nonnegative invariant function is a nonnegative multiple of g . In particular, either g = 0 , the zero x∈S
16.6.5
https://stats.libretexts.org/@go/page/10293
function on S , or g is positive on S and satisfies invariant function for a null recurrent chain.
∑
x∈S
g(x) = ∞
. The section on reliability chains gives an example of the
The situation is complicated when X is transient. In this case, there may or may not exist nonnegative invariant functions that are not identically 0. When they do exist, they may not be unique (up to multiplication by nonnegative constants). But we still know that there are no invariant probability density functions, so if g is a nonnegative function that is invariant for X then either g = 0 or ∑ g(x) = ∞ . The section on random walks on graphs provides lots of examples of transient chains with nontrivial invariant functions. In particular, the non-symmetric random walk on Z has a two-dimensional space of invariant functions. x∈S
Examples and Applications Finite Chains Consider again the general two-state chain on S = {0, 1} with transition probability matrix given below, where p ∈ (0, 1) and q ∈ (0, 1) are parameters. 1 −p
p
q
1 −q
P =[
]
(16.6.39)
1. Find the invariant distribution. 2. Find the mean return time to each state. 3. Find lim P without having to go to the trouble of diagonalizing P , as we did in the introduction to discrete-time chains. n
n→∞
Answer Consider a Markov chain with state space S = {a, b, c, d} and transition matrix P given below: 1
2
3
3
⎢ 1 P =⎢ ⎢ ⎢ 0
⎡
⎣
0
0 ⎤
0
0
0
1
0 ⎥ ⎥ ⎥ 0 ⎥
1
1
1
1
4
4
4
4
(16.6.40)
⎦
1. Draw the state diagram. 2. Determine the equivalent classes and classify each as transient or positive recurrent. 3. Find all invariant probability density functions. 4. Find the mean return time to each state. 5. Find lim P . n
n→∞
Answer Consider a Markov chain with state space S = {1, 2, 3, 4, 5, 6} and transition matrix P given below: 1
0
0
⎢ 0 ⎢ ⎢ 1 ⎢ 4 P =⎢ ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢
0
⎣ 0
1
1
1
4
4
4
⎡
0 0 0
2
0 1 2
0 1 3
0 0 0 1 0
1 2
0 1 4
0 2 3
0
0
⎤
1 ⎥ ⎥ ⎥ 0 ⎥ ⎥ ⎥ 0 ⎥ ⎥ 0 ⎥ ⎥ 1
(16.6.41)
⎦
4
1. Sketch the state graph. 2. Find the equivalence classes and classify each as transient or positive recurrent. 3. Find all invariant probability density functions. 4. Find the mean return time to each state. 5. Find lim P . n
n→∞
Answer
16.6.6
https://stats.libretexts.org/@go/page/10293
Consider a Markov chain with state space S = {1, 2, 3, 4, 5, 6} and transition matrix P given below: ⎡
1
1
2
2
1
3
⎢ ⎢ 4 ⎢ ⎢ 1 ⎢ 4 P =⎢ ⎢ 1 ⎢ 4 ⎢ ⎢ 0 ⎢ ⎣
0
0
0
0
0
0
0
1
1
2
4
1
1
4
4
0
0
0
2
0 ⎥ ⎥ ⎥ 0 ⎥ ⎥ ⎥ 1 ⎥ 4 ⎥ ⎥ 1 ⎥ 2 ⎥
0
0
0
1
1
2
2
4
0 0
0
0 0 1
⎤
(16.6.42)
⎦
1. Sketch the state graph. 2. Find the equivalence classes and classify each as transient or positive recurrent. 3. Find all invariant probability density functions. 4. Find the mean return time to each state. 5. Find lim P . n
n→∞
Answer Consider the Markov chain with state space S = {1, 2, 3, 4, 5, 6, 7} and transition matrix P given below: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ P =⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
1
1
1
2
4
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
2
2
1
3
4
4
1 3
0
2 3
0
0
0
0 ⎥ ⎥ ⎥ 2 ⎥ ⎥ 3 ⎥ 1 ⎥ ⎥ 2 ⎥ 1 ⎥ ⎥ 4 ⎥ ⎥ 0 ⎥ ⎥
1 3 1 2 3
0
0
0
0
0
0
0
0
0
0
0
4
⎤
(16.6.43)
0 ⎦
1. Sketch the state digraph, and show that the chain is irreducible with period 3. 2. Identify the cyclic classes. 3. Find the invariant probability density function. 4. Find the mean return time to each state. 5. Find lim P . 6. Find lim P . 7. Find lim P . n→∞ n→∞ n→∞
3n
3n+1 3n+2
Answer
Special Models Read the discussion of invariant distributions and limiting distributions in the Ehrenfest chains. Read the discussion of invariant distributions and limiting distributions in the Bernoulli-Laplace chain. Read the discussion of positive recurrence and invariant distributions for the reliability chains. Read the discussion of positive recurrence and limiting distributions for the birth-death chain. Read the discussion of positive recurrence and for the queuing chains. Read the discussion of positive recurrence and limiting distributions for the random walks on graphs.
16.6.7
https://stats.libretexts.org/@go/page/10293
This page titled 16.6: Stationary and Limiting Distributions of Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.6.8
https://stats.libretexts.org/@go/page/10293
16.7: Time Reversal in Discrete-Time Chains The Markov property, stated in the form that the past and future are independent given the present, essentially treats the past and future symmetrically. However, there is a lack of symmetry in the fact that in the usual formulation, we have an initial time 0, but not a terminal time. If we introduce a terminal time, then we can run the process backwards in time. In this section, we are interested in the following questions: Is the new process still Markov? If so, how does the new transition probability matrix relate to the original one? Under what conditions are the forward and backward processes stochastically the same? Consideration of these questions leads to reversed chains, an important and interesting part of the theory of Markov chains.
Basic Theory Reversed Chains Our starting point is a (homogeneous) discrete-time Markov chain X = (X , X , X , …) with (countable) state space S and transition probability matrix P . Let m be a positive integer, which we will think of as the terminal time or finite time horizon. We ^ won't bother to indicate the dependence on m notationally, since ultimately the terminal time will not matter. Define X =X for n ∈ {0, 1, … , m}. Thus, the process forward in time is X = (X , X , … , X ) while the process backwards in time is 0
1
2
n
0
1
m−n
m
^ ^ ^ ^ X = (X 0 , X 1 , … , X m ) = (Xm , Xm−1 , … , X0 )
(16.7.1)
^ ^ ^ ^ F n = σ{ X 0 , X 1 , … , X n } = σ{ Xm−n , Xm−n+1 , … , Xm }
(16.7.2)
For n ∈ {0, 1, … , m}, let
^ ^ denote the σ algebra of the events of the process X up to time n . So of course, an event for X up to time n is an event for X from time m − n forward. Our first result is that the reversed process is still a Markov chain, but not time homogeneous in general. ^ ^ ^ ^ The process X = (X , X , … , X ) is a Markov chain, but is not time homogenous in general. The one-step transition matrix at time n ∈ {0, 1, … , m − 1} is given by 0
1
m
^ ^ P(X n+1 = y ∣ X n = x) =
P(Xm−n−1 = y) P (y, x),
(x, y) ∈ S
2
(16.7.3)
P(Xm−n = x)
Proof However, the backwards chain will be time homogeneous if X has an invariant distribution. 0
Suppose that X is irreducible and positive recurrent, with (unique) invariant probability density function f . If ^ invariant probability distribution, then X is a time-homogeneous Markov chain with transition matrix P^ given by ^ P (x, y) =
f (y) P (y, x),
(x, y) ∈ S
X0
2
has the
(16.7.6)
f (x)
Proof Recall that a discrete-time Markov chain is ergodic if it is irreducible, positive recurrent, and aperiodic. For an ergodic chain, the previous result holds in the limit of the terminal time. Suppose that X is ergodic, with (unique) invariant probability density function f . Regardless of the distribution of X , 0
^ ^ P(X n+1 = y ∣ X n = x) →
f (y) P (y, x) as m → ∞
(16.7.7)
f (x)
Proof
16.7.1
https://stats.libretexts.org/@go/page/10294
These three results are motivation for the definition that follows. We can generalize by defining the reversal of an irreducible Markov chain, as long as there is a positive, invariant function. Recall that a positive invariant function defines a positive measure on S , but of course not in general a probability distribution. Suppose that X is an irreducible Markov chain with transition matrix P , and that g : S → (0, ∞) is invariant for X. The ^ ^ ^ ^ reversal of X with respect to g is the Markov chain X = (X , X , …) with transition probability matrix P defined by 0
^(x, y) = P
1
g(y) P (y, x),
(x, y) ∈ S
2
(16.7.8)
g(x)
Proof Recall that if g is a positive invariant function for same reversed chain. So let's consider the cases:
X
then so is
cg
for every positive constant c . Note that
g
and
cg
generate the
Suppose that X is an irreducible Markov chain on S . 1. If X is recurrent, then X always has a positive invariant function that is unique up to multiplication by positive constants. Hence the reversal of a recurrent chain X always exists and is unique, and so we can refer to the reversal of X without reference to the invariant function. 2. Even better, if X is positive recurrent, then there exists a unique invariant probability density function, and the reversal of X can be interpreted as the time reversal (with respect to a terminal time) when X has the invariant distribution, as in the motivating exercises above. 3. If X is transient, then there may or may not exist a positive invariant function, and if one does exist, it may not be unique (up to multiplication by positive constants). So a transient chain may have no reversals or more than one. Nonetheless, the general definition is natural, because most of the important properties of the reversed chain follow from the balance equation between the transition matrices P and P^ , and the invariant function g : ^ g(x)P (x, y) = g(y)P (y, x),
(x, y) ∈ S
2
(16.7.10)
We will see this balance equation repeated with other objects related to the Markov chains. Suppose that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that respect to g . For x, y ∈ S ,
^ X
is the reversal of X with
1. P^(x, x) = P (x, x) 2. P^(x, y) > 0 if and only if P (y, x) > 0 Proof ^ From part (b) it follows that the state graphs of X and X are reverses of each other. That is, to go from the state graph of one chain to the state graph of the other, simply reverse the direction of each edge. Here is a more complicated (but equivalent) version of the balance equation for chains of states: ^ Suppose again that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that X is the reversal of X with respect to g . For every n ∈ N and every sequence of states (x , x , … , x , x ) ∈ S , n+1
+
1
2
n
n+1
^ ^ ^ g(x1 )P (x1 , x2 )P (x2 , x3 ) ⋯ P (xn , xn+1 ) = g(xn+1 )P (xn+1 , xn ) ⋯ P (x3 , x2 )P (x2 , x1 )
(16.7.11)
Proof The balance equation holds for the powers of the transition matrix: ^ Suppose again that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that X is the reversal of X with respect to g . For every (x, y) ∈ S and n ∈ N , 2
n
^ (x, y) = g(y)P g(x)P
n
(y, x)
(16.7.14)
Proof
16.7.2
https://stats.libretexts.org/@go/page/10294
We can now generalize the simple result above. ^ Suppose again that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that X is the reversal of X with respect to g . For n ∈ N and (x, y) ∈ S , 2
1. P 2. P^
n
n
^ (x, x) = P (x, x)
n
(x, y) > 0
if and only if P
n
(y, x) > 0
In terms of the state graphs, part (b) has an obvious meaning: If there exists a path of length n from y to x in the original state graph, then there exists a path of length n from x to y in the reversed state graph. The time reversal definition is symmetric with respect to the two Markov chains. ^ Suppose again that X is an irreducible Markov chain with invariant function g : S → (0, ∞) , and that X is the reversal of X with respect to g . Then ^ 1. g is also invariant for X . ^ 2. X is also irreducible. ^ 3. X is the reversal of X with respect to g .
Proof The balance equation also holds for the potential matrices. ^ Suppose that X and X are time reversals with respect to the invariant function g : S → (0, ∞) . For α ∈ (0, 1], the α potential matrices are related by ^ (x, y) = g(y)R (y, x), g(x)R α α
(x, y) ∈ S
2
(16.7.16)
Proof Markov chains that are time reversals share many important properties: ^ Suppose that X and X are time reversals. Then ^ 1. X and X are of the same type (transient, null recurrent, or positive recurrent). ^ 2. X and X have the same period. ^ 3. X and X have the same mean return time μ(x) for every x ∈ S .
Proof The main point of the next result is that we don't need to know a-priori that g is invariant for X, if we can guess g and P^ . Suppose again that X is irreducible with transition probability matrix P . If there exists a a function transition probability matrix P^ such that g(x)P^(x, y) = g(y)P (y, x) for all (x, y) ∈ S , then
g : S → (0, ∞)
and a
2
1. g is invariant for X. 2. P^ is the transition matrix of the reversal of X with respect to g . Proof As a corollary, if there exists a probability density function f on S and a transition probability matrix P^ such that ^ ^ f (x)P (x, y) = f (y)P (y, x) for all (x, y) ∈ S then in addition to the conclusions above, we know that the chains X and X are positive recurrent. 2
Reversible Chains Clearly, an interesting special case occurs when the transition matrix of the reversed chain turns out to be the same as the original transition matrix. A chain of this type could be used to model a physical process that is stochastically the same, forward or backward in time.
16.7.3
https://stats.libretexts.org/@go/page/10294
Suppose again that X = (X , X , X , …) is an irreducible Markov chain with transition matrix P and invariant function g : S → (0, ∞) . If the reversal of X with respect to g also has transition matrix P , then X is said to be reversible with respect to g . That is, X is reversible with respect to g if and only if 0
1
2
g(x)P (x, y) = g(y)P (y, x),
(x, y) ∈ S
2
(16.7.19)
Clearly if X is reversible with respect to the invariant function g : S → (0, ∞) then X is reversible with respect to the invariant function cg for every c ∈ (0, ∞). So again, let's review the cases. Suppose that X is an irreducible Markov chain on S . 1. If X is recurrent, there exists a positive invariant function that is unique up to multiplication by positive constants. So X is either reversible or not, and we don't have to reference the invariant function g . 2. If X is positive recurrent then there exists a unique invariant probability density function f : S → (0, 1) , and again, either X is reversible or not. If X is reversible, then P is the transition matrix of X forward or backward in time, when the chain has the invariant distribution. 3. If X is transient, there may or may not exist positive invariant functions. If there are two or more positive invariant functions that are not multiplies of one another, X might be reversible with respect to one function but not the others. The non-symmetric simple random walk on Z falls into the last case. Using the last result in the previous subsection, we can tell whether X is reversible with respect to g without knowing a-priori that g is invariant. Suppose again that
X
is irreducible with transition matrix for all (x, y) ∈ S , then
P
. If there exists a function
g : S → (0, ∞)
such that
2
g(x)P (x, y) = g(y)P (y, x)
1. g is invariant for X. 2. X is reversible with respect to g If we have reason to believe that a Markov chain is reversible (based on modeling considerations, for example), then the condition in the previous theorem can be used to find the invariant functions. This procedure is often easier than using the definition of invariance directly. The next two results are minor generalizations: Suppose again that X is irreducible and that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to only if for every n ∈ N and every sequence of states (x , x , … x , x ) ∈ S ,
g
if and
n+1
+
1
2
n
n+1
g(x1 )P (x1 , x2 )P (x2 , x3 ) ⋯ P (xn , xn+1 ) = g(xn+1 )P (xn+1 , xn ), ⋯ P (x3 , x2 )P (x2 , x1 )
(16.7.20)
Suppose again that X is irreducible and that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to only if for every (x, y) ∈ S and n ∈ N ,
g
if and
2
+
g(x)P
n
(x, y) = g(y)P
n
(y, x)
(16.7.21)
Here is the condition for reversibility in terms of the potential matrices. Suppose again that X is irreducible and that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to only if g(x)Rα (x, y) = g(y)Rα (y, x),
α ∈ (0, 1], (x, y) ∈ S
2
g
if and
(16.7.22)
In the positive recurrent case (the most important case), the following theorem gives a condition for reversibility that does not directly reference the invariant distribution. The condition is known as the Kolmogorov cycle condition, and is named for Andrei Kolmogorov Suppose that
X
is irreducible and positive recurrent. Then
X
is reversible if and only if for every sequence of states
,
(x1 , x2 , … , xn )
P (x1 , x2 )P (x2 , x3 ) ⋯ P (xn−1 , xn )P (xn , x1 ) = P (x1 , xn )P (xn , xn−1 ) ⋯ P (x3 , x2 )P (x2 , x1 )
16.7.4
(16.7.23)
https://stats.libretexts.org/@go/page/10294
Proof Note that the Kolmogorov cycle condition states that the probability of visiting states (x , x , … , x , x ) in sequence, starting in state x is the same as the probability of visiting states (x , x , … , x , x ) in sequence, starting in state x . The cycle condition is also known as the balance equation for cycles. 2
1
n
n−1
2
3
n
1
1
1
Figure 16.7.1 : The Kolmogorov cycle condition
Examples and Applications Finite Chains Recall the general two-state chain X on S = {0, 1} with the transition probability matrix 1 −p
p
q
1 −q
P =[
where
p, q ∈ (0, 1) q
f =(
p+q
p
,
p+q
)
are parameters. The chain
X
]
(16.7.25)
is reversible and the invariant probability density function is
.
Proof Suppose that
X
is a Markov chain on a finite state space S with symmetric transition probability matrix P . Thus for all (x, y) ∈ S . The chain X is reversible and that the uniform distribution on S is invariant.
P (x, y) = P (y, x)
2
Proof Consider the Markov chain X on S = {a, b, c} with transition probability matrix P given below: 1
1
1
4
4
2
⎢ P =⎢ ⎢
1
1
1
3
3
3
⎣
1
1
2
2
0 ⎦
⎡
⎤ ⎥ ⎥ ⎥
(16.7.27)
1. Draw the state graph of X and note that the chain is irreducible. 2. Find the invariant probability density function f . 3. Find the mean return time to each state. ^ 4. Find the transition probability matrix P^ of the time-reversed chain X . ^ 5. Draw the state graph of X. Answer
Special Models Read the discussion of reversibility for the Ehrenfest chains. Read the discussion of reversibility for the Bernoulli-Laplace chain. Read the discussion of reversibility for the random walks on graphs. Read the discussion of time reversal for the reliability chains. Read the discussion of reversibility for the birth-death chains.
16.7.5
https://stats.libretexts.org/@go/page/10294
This page titled 16.7: Time Reversal in Discrete-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.7.6
https://stats.libretexts.org/@go/page/10294
16.8: The Ehrenfest Chains Basic Theory The Ehrenfest chains, named for Paul Ehrenfest, are simple, discrete models for the exchange of gas molecules between two containers. However, they can be formulated as simple ball and urn models; the balls correspond to the molecules and the urns to the two containers. Thus, suppose that we have two urns, labeled 0 and 1, that contain a total of m balls. The state of the system at time n ∈ N is the number of balls in urn 1, which we will denote by X . Our stochastic process is X = (X , X , X , …) with state space S = {0, 1, … , m}. Of course, the number of balls in urn 0 at time n is m − X . n
0
1
2
n
The Models In the basic Ehrenfest model, at each discrete time unit, independently of the past, a ball is selected at random and moved to the other urn.
Figure 16.8.1 : The Ehrenfest model X
is a discrete-time Markov chain on S with transition probability matrix P given by x P (x, x − 1) =
m −x , P (x, x + 1) =
m
,
x ∈ S
(16.8.1)
m
Proof In the Ehrenfest experiment, select the basic model. For selected values of m and selected values of the initial state, run the chain for 1000 time steps and note the limiting behavior of the proportion of time spent in each state. Suppose now that we modify the basic Ehrenfest model as follows: at each discrete time, independently of the past, we select a ball at random and a urn at random. We then put the chosen ball in the chosen urn. X
is a discrete-time Markov chain on S with the transition probability matrix Q given by x
1
Q(x, x − 1) =
, Q(x, x) = 2m
m −x , Q(x, x + 1) =
,
2
x ∈ S
(16.8.3)
2m
Proof Note that Q(x, y) =
1 2
P (x, y)
for y ∈ {x − 1, x + 1} .
In the Ehrenfest experiment, select the modified model. For selected values of m and selected values of the initial state, run the chain for 1000 time steps and note the limiting behavior of the proportion of time spent in each state.
Classification The basic and modified Ehrenfest chains are irreducible and positive recurrent. Proof The basic Ehrenfest chain is periodic with period 2. The cyclic classes are the set of even states and the set of odd states. The two-step transition matrix is P
2
x(x − 1) (x, x − 2) = m
2
, P
2
x(m − x + 1) + (m − x)(x + 1) (x, x) = m
, P
2
2
(m − x)(m − x − 1) (x, x + 2) = m
2
,
x ∈ S
(16.8.5)
Proof The modified Ehrenfest chain is aperiodic. Proof
Invariant and Limiting Distributions For the basic and modified Ehrenfest chains, the invariant distribution is the binomial distribution with trial parameter m and success parameter So the invariant probability density function f is given by m f (x) = ( x
2
.
m
1 )(
1
)
,
x ∈ S
(16.8.6)
2
16.8.1
https://stats.libretexts.org/@go/page/10295
Proof Thus, the invariant distribution corresponds to placing each ball randomly and independently either in urn 0 or in urn 1. The mean return time to state x ∈ S for the basic or modified Ehrenfest chain is μ(x) = 2
m
m
/(
x
)
.
Proof For the basic Ehrenfest chain, the limiting behavior of the chain is as follows: 1. P
2n
2. P
2n+1
m
(x, y) → (
y
)( m
(x, y) → (
y
1 2
m−1
)
)(
1 2
as n → ∞ if x,
m−1
)
y ∈ S
as n → ∞ if x,
have the same parity (both even or both odd). The limit is 0 otherwise.
y ∈ S
have oppositie parity (one even and one odd). The limit is 0 otherwise.
Proof For the modified Ehrenfest chain, Q
n
m
(x, y) → (
y
)(
1 2
m
)
as n → ∞ for x,
y ∈ S
.
Proof In the Ehrenfest experiment, the limiting binomial distribution is shown graphically and numerically. For each model and for selected values of m and selected values of the initial state, run the chain for 1000 time steps and note the limiting behavior of the proportion of time spent in each state. How do the choices of m, the initial state, and the model seem to affect the rate of convergence to the limiting distribution?
Reversibility The basic and modified Ehrenfest chains are reversible. Proof Run the simulation of the Ehrenfest experiment 10,000 time steps for each model, for selected values of m, and with initial state 0. Note that at first, you can see the “arrow of time”. After a long period, however, the direction of time is no longer evident.
Computational Exercises Consider the basic Ehrenfest chain with m = 5 balls, and suppose that X has the uniform distribution on S . 0
1. Compute the probability density function, mean and variance of X . 2. Compute the probability density function, mean and variance of X . 3. Compute the probability density function, mean and variance of X . 4. Sketch the initial probability density function and the probability density functions in parts (a), (b), and (c) on a common set of axes. 1 2 3
Answer Consider the modified Ehrenfest chain with m = 5 balls, and suppose that the chain starts in state 2 (with probability 1). 1. Compute the probability density function, mean and standard deviation of X . 2. Compute the probability density function, mean and standard deviation of X . 3. Compute the probability density function, mean and standard deviation of X . 4. Sketch the initial probability density function and the probability density functions in parts (a), (b), and (c) on a common set of axes. 1 2 3
Answer This page titled 16.8: The Ehrenfest Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.8.2
https://stats.libretexts.org/@go/page/10295
16.9: The Bernoulli-Laplace Chain Basic Theory Introduction The Bernoulli-Laplace chain, named for Jacob Bernoulli and Pierre Simon Laplace, is a simple discrete model for the diffusion of two incompressible gases between two containers. Like the Ehrenfest chain, it can also be formulated as a simple ball and urn model. Thus, suppose that we have two urns, labeled 0 and 1. Urn 0 contains j balls and urn 1 contains k balls, where j, k ∈ N . Of the j + k balls, r are red and the remaining j + k − r are green. Thus r ∈ N and 0 < r < j + k . At each discrete time, independently of the past, a ball is selected at random from each urn and then the two balls are switched. The balls of different colors correspond to molecules of different types, and the urns are the containers. The incompressible property is reflected in the fact that the number of balls in each urn remains constant over time. +
+
Figure 16.9.1 : The Bernoulli-Laplace model
Let X denote the number of red balls in urn 1 at time n ∈ N . Then n
1. k − X is the number of green balls in urn 1 at time n . 2. r − X is the number of red balls in urn 0 at time n . 3. j − r + X is the number of green balls in urn 0 at time n . n
n
n
X = (X0 , X1 , X2 , …) P
is a discrete-time Markov chain with state space S = {max{0, r − j}, … , min{k, r}} and with transition matrix
given by (j − r + x)x P (x, x − 1) =
(r − x)x + (j − r + x)(k − x) , P (x, x) =
(r − x)(k − x) , P (x, x + 1) =
jk
;
jk
x ∈ S
(16.9.1)
jk
Proof This is a fairly complicated model, simply because of the number of parameters. Interesting special cases occur when some of the parameters are the same. Consider the special case j = k , so that each urn has the same number of balls. The state space is and the transition probability matrix is (k − r + x)x P (x, x − 1) =
S = {max{0, r − k}, … , min{k, r}}
(r − x)x + (k − r + x)(k − x) , P (x, x) =
2
(r − x)(k − x) , P (x, x + 1) =
2
k
;
2
k
x ∈ S
(16.9.2)
k
Consider the special case r = j , so that the number of red balls is the same as the number of balls in urn 0. The state space is S = {0, … , min{j, k}} and the transition probability matrix is 2
x P (x, x − 1) =
x(j + k − 2x) , P (x, x) =
jk
Consider the special case
(j − x)(k − x) , P (x, x + 1) =
;
jk
x ∈ S
(16.9.3)
jk
, so that the number of red balls is the same as the number of balls in urn 1. The state space is and the transition probability matrix is
r =k
S = {max{0, k − j}, … , k}
(j − k + x)x P (x, x − 1) =
2
(k − x)(j − k + 2x) , P (x, x) =
(k − x) , P (x, x + 1) =
jk
jk
;
x ∈ S
(16.9.4)
jk
Consider the special case j = k = r , so that each urn has the same number of balls, and this is also the number of red balls. The state space is S = {0, 1, … , k} and the transition probability matrix is 2
x P (x, x − 1) =
2
k
2
2x(k − x) , P (x, x) =
2
k
16.9.1
(k − x) , P (x, x + 1) =
2
;
x ∈ S
(16.9.5)
k
https://stats.libretexts.org/@go/page/10296
Run the simulation of the Bernoulli-Laplace experiment for 10000 steps and for various values of the parameters. Note the limiting behavior of the proportion of time spent in each state.
Invariant and Limiting Distributions The Bernoulli-Laplace chain is irreducible. Proof Except in the trivial case j = k = r = 1 , the Bernoulli-Laplace chain aperiodic. Proof The invariant distribution is the hypergeometric distribution with population parameter The probability density function is r
j+k−r
( )( x
f (x) =
k−x
j+k
(
k
j+k
, sample parameter k , and type parameter r.
) ,
x ∈ S
(16.9.6)
)
Proof Thus, the invariant distribution corresponds to selecting a sample of k balls at random and without replacement from the j + k balls and placing them in urn 1. The mean and variance of the invariant distribution are r μ =k
, σ
2
r
j+k −r
j
j+k
j+k
j+k −1
=k
j+k
(16.9.7)
The mean return time to each state x ∈ S is j+k
(
k
μ(x) =
)
r
j+k−r
x
k−x
( )(
(16.9.8) )
Proof P
n
r
r+k−r
y
k−y
(x, y) → f (y) = ( )(
j+k
)/(
k
)
as n → ∞ for (x, y) ∈ S . 2
Proof In the simulation of the Bernoulli-Laplace experiment, vary the parameters and note the shape and location of the limiting hypergeometric distribution. For selected values of the parameters, run the simulation for 10000 steps and and note the limiting behavior of the proportion of time spent in each state.
Reversibility The Bernoulli-Laplace chain is reversible. Proof Run the simulation of the Bernoulli-Laplace experiment 10,000 time steps for selected values of the parameters, and with initial state 0. Note that at first, you can see the “arrow of time”. After a long period, however, the direction of time is no longer evident.
Computational Exercises Consider the Bernoulli-Laplace chain with each of the following:
j = 10
,
k =5
, and
r =4
. Suppose that
X0
has the uniform distribution on
S
. Explicitly give
1. The state space S 2. The transition matrix P . 3. The probability density function, mean and variance of X . 4. The probability density function, mean and variance of X . 5. The probability density function, mean and variance of X . 1 2 3
Answer Consider the Bernoulli-Laplace chain with j = k = 10 and r = 6 . Give each of the following explicitly: 1. The state space S 2. The transition matrix P
16.9.2
https://stats.libretexts.org/@go/page/10296
3. The invariant probability density function. Answer This page titled 16.9: The Bernoulli-Laplace Chain is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.9.3
https://stats.libretexts.org/@go/page/10296
16.10: Discrete-Time Reliability Chains The Success-Runs Chain Suppose that we have a sequence of trials, each of which results in either success or failure. Our basic assumption is that if there have been x ∈ N consecutive successes, then the probability of success on the next trial is p(x), independently of the past, where p : N → (0, 1). Whenever there is a failure, we start over, independently, with a new sequence of trials. Appropriately enough, p is called the success function. Let X denote the length of the run of successes after n trials. n
X = (X0 , X1 , X2 , …)
is a discrete-time Markov chain with state space N and transition probability matrix P given by P (x, x + 1) = p(x), P (x, 0) = 1 − p(x);
x ∈ N
(16.10.1)
The Markov chain X is called the success-runs chain.
Figure 16.10.1: State graph of the success-runs chain
Now let T denote the trial number of the first failure, starting with a fresh sequence of trials. Note that in the context of the success-runs chain X, T = τ , the first return time to state 0, starting in 0. Note that T takes values in N ∪ {∞} , since presumably, it is possible that no failure occurs. Let r(n) = P(T > n) for n ∈ N , the probability of at least n consecutive successes, starting with a fresh set of trials. Let f (n) = P(T = n + 1) for n ∈ N , the probability of exactly n consecutive successes, starting with a fresh set of trails. 0
+
The functions p, r, and f are related as follows: 1. p(x) = r(x + 1)/r(x) for x ∈ N 2. r(n) = ∏ p(x) for n ∈ N 3. f (n) = [1 − p(n)] ∏ p(x) for n ∈ N 4. r(n) = 1 − ∑ f (x) for n ∈ N 5. f (n) = r(n) − r(n + 1) for n ∈ N n−1 x=0
n−1 x=0
n−1 x=0
Thus, the functions p, r, and f give equivalent information. If we know one of the functions, we can construct the other two, and hence any of the functions can be used to define the success-runs chain. The function r is the reliability function associated with T . The function r is characterized by the following properties: 1. r is positive. 2. r(0) = 1 3. r is strictly decreasing. The function f is characterized by the following properties: 1. f is positive. 2. ∑ f (x) ≤ 1 ∞
x=0
Essentially, f is the probability density function of T − 1 , except that it may be defective in the sense that the sum of its values may be less than 1. The leftover probability, of course, is the probability that T = ∞ . This is the critical consideration in the classification of the success-runs chain, which we will consider shortly. Verify that each of the following functions has the appropriate properties, and then find the other two functions: 1. p is a constant in (0, 1). 2. r(n) = 1/(n + 1) for n ∈ N .
16.10.1
https://stats.libretexts.org/@go/page/10297
3. r(n) = (n + 1)/(2 n + 1) for n ∈ N . 4. p(x) = 1/(x + 2) for x ∈ N. Answer In part (a), note that the trials are Bernoulli trials. We have an app for this case. The success-runs app is a simulation of the success-runs chain based on Bernoulli trials. Run the simulation 1000 times for various values of p and various initial states, and note the general behavior of the chain. The success-runs chain is irreducible and aperiodic. Proof Recall that T has the same distribution as τ , the first return time to 0 starting at state 0. Thus, the classification of the chain as recurrent or transient depends on α = P(T = ∞) . Specifically, the success-runs chain is transient if α > 0 and recurrent if α = 0 . Thus, we see that the chain is recurrent if and only if a failure is sure to occur. We can compute the parameter α in terms of each of the three functions that define the chain. 0
In terms of p, r, and f , ∞
∞
α = ∏ p(x) = lim r(n) = 1 − ∑ f (x) x=0
n→∞
(16.10.2)
x=0
Compute α and determine whether the success-runs chain X is transient or recurrent for each of the examples above. Answer Run the simulation of the success-runs chain 1000 times for various values of p, starting in state 0. Note the return times to state 0. Let μ = E(T ) , the expected trial number of the first failure, starting with a fresh sequence of trials. μ
is related to α , f , and r as follows:
1. If α > 0 then μ = ∞ 2. If α = 0 then μ = 1 + ∑ 3. μ = ∑ r(n)
∞ n=0
nf (n)
∞
n=0
Proof The success-runs chain X is positive recurrent if and only if μ < ∞ . Proof If X is recurrent, then r is invariant for probability density function g given by
X
. In the positive recurrent case, when
μ 0 for n ∈ N . When the device fails, it is immediately (and independently) replaced by an identical device. For n ∈ N , let Y denote the time to failure of the device that is in service at time n . n
Y = (Y0 , Y1 , Y2 , …)
is a discrete-time Markov chain with state space N and transition probability matrix Q given by Q(0, x) = f (x), Q(x + 1, x) = 1;
x ∈ N
(16.10.6)
The Markov chain Y is called the remaining life chain with lifetime probability density function f , and has the state graph below.
Figure 16.10.2: State graph of the remaining life chain
We have an app for the remaining life chain whose lifetime distribution is the geometric distribution on 1 − p ∈ (0, 1) . Run the simulation of the remaining-life chain 1000 times for various values of behavior of the chain. If U denotes the lifetime of a device, as before, note that T Y
= 1 +U
p
N
, with parameter
and various initial states. Note the general
is the return time to 0 for the chain Y , starting at 0.
is irreducible, aperiodic, and recurrent.
Proof Now let
r(n) = P(U ≥ n) = P(T > n) ∞
μ = 1 +∑
x=0
∞
f (x) = ∑
n=0
r(n)
for
n ∈ N
and let
μ = E(T ) = 1 + E(U )
. Note that
∞
r(n) = ∑
x=n
.
The success-runs chain X is positive recurrent if and only if density function g given by
μ 0 for n ∈ N . Let X be the success-runs chain associated with f and Y the remaining life chain associated with f . Then X and Y are time reversals of each other.
16.10.3
https://stats.libretexts.org/@go/page/10297
Proof In the context of reliability, it is also easy to see that the chains are time reversals of each other. Consider again a device whose random lifetime takes values in N, with the device immediately replaced by an identical device upon failure. For n ∈ N , we can think of X as the age of the device in service at time n and Y as the time remaining until failure for that device. n
n
Run the simulation of the success-runs chain 1000 times for various values of p, starting in state 0. This is the time reversal of the simulation in the next exercise Run the simulation of the remaining-life chain 1000 times for various values of p, starting in state 0. This is the time reversal of the simulation in the previous exercise. This page titled 16.10: Discrete-Time Reliability Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.10.4
https://stats.libretexts.org/@go/page/10297
16.11: Discrete-Time Branching Chain Basic Theory Introduction Generically, suppose that we have a system of particles that can generate or split into other particles of the same type. Here are some typical examples: The particles are biological organisms that reproduce. The particles are neutrons in a chain reaction. The particles are electrons in an electron multiplier. We assume that each particle, at the end of its life, is replaced by a random number of new particles that we will refer to as children of the original particle. Our basic assumption is that the particles act independently, each with the same offspring distribution on N. Let f denote the common probability density function of the number of offspring of a particle. We will also let f = f ∗ f ∗ ⋯ ∗ f denote the convolution power of degree n of f ; this is the probability density function of the total number of children of n particles. ∗n
We will consider the evolution of the system in real time in our study of continuous-time branching chains. In this section, we will study the evolution of the system in generational time. Specifically, the particles that we start with are in generation 0, and recursively, the children of a particle in generation n are in generation n + 1 .
Figure 16.11.1: Generations 0, 1, 2, and 3 of a branching chain.
Let X denote the number of particles in generation n ∈ N . One way to construct the process mathematically is to start with an array of independent random variables (U : n ∈ N, i ∈ N ) , each with probability density function f . We interpret U as the number of children of the ith particle in generation n (if this particle exists). Note that we have more random variables than we need, but this causes no harm, and we know that we can construct a probability space that supports such an array of random variables. We can now define our state variables recursively by n
n,i
+
n,i
Xn
Xn+1 = ∑ Un,i
(16.11.1)
i=1
X = (X0 , X1 , X2 , …)
is a discrete-time Markov chain on N with transition probability matrix P given by P (x, y) = f
∗x
(y),
2
(x, y) ∈ N
(16.11.2)
The chain X is the branching chain with offspring distribution defined by f . Proof The branching chain is also known as the Galton-Watson process in honor of Francis Galton and Henry William Watson who studied such processes in the context of the survival of (aristocratic) family names. Note that the descendants of each initial particle form a branching chain, and these chains are independent. Thus, the branching chain starting with x particles is equivalent to x independent copies of the branching chain starting with 1 particle. This features turns out to be very important in the analysis of the chain. Note also that 0 is an absorbing state that corresponds to extinction. On the other hand, the population may grow to infinity, sometimes called explosion. Computing the probability of extinction is one of the fundamental problems in branching chains; we will essentially solve this problem in the next subsection.
16.11.1
https://stats.libretexts.org/@go/page/10384
Extinction and Explosion The behavior of the branching chain in expected value is easy to analyze. Let m denote the mean of the offspring distribution, so that ∞
m = ∑ xf (x)
(16.11.4)
x=0
Note that m ∈ [0, ∞]. The parameter m will turn out to be of fundamental importance. Expected value properties 1. E(X 2. E(X 3. E(X 4. E(X 5. E(X
for n ∈ N for n ∈ N ) → 0 as n → ∞ if m < 1 . ) = E(X ) for each n ∈ N if m = 1 . ) → ∞ as n → ∞ if m > 1 and E(X
n+1 ) n) n n n
= mE(Xn ) n
= m E(X0 )
0
0)
>0
.
Proof Part (c) is extinction in the mean; part (d) is stability in the mean; and part (e) is explosion in the mean. Recall that state 0 is absorbing (there are no particles), and hence {X = 0 for some n ∈ N} = {τ < ∞} is the extinction event (where as usual, τ is the time of the first return to 0). We are primarily concerned with the probability of extinction, as a function of the initial state. First, however, we will make some simple observations and eliminate some trivial cases. n
0
0
Suppose that f (1) = 1 , so that each particle is replaced by a single new particle. Then 1. Every state is absorbing. 2. The equivalence classes are the singleton sets. 3. With probability 1, X = X for every n ∈ N . n
0
Proof Suppose that f (0) > 0 so that with positive probability, a particle will die without offspring. Then 1. Every state leads to 0. 2. Every positive state is transient. 3. With probability 1 either X = 0 for some n ∈ N (extinction) or X n
n
→ ∞
as n → ∞ (explosion).
Proof Suppose that f (0) = 0 and f (1) < 1 , so that every particle is replaced by at least one particle, and with positive probability, more than one. Then 1. Every positive state is transient. 2. P(X → ∞ as n → ∞ ∣ X = x) = 1 for every x ∈ N , so that explosion is certain, starting with at least one particle. n
0
+
Proof Suppose that f (0) > 0 and f (0) + f (1) = 1 , so that with positive probability, a particle will die without offspring, and with probability 1, a particle is not replaced by more than one particle. Then 1. Every state leads to 0. 2. Every positive state is transient. 3. With probability 1, X = 0 for some n ∈ N , so extinction is certain. n
Proof Thus, the interesting case is when f (0) > 0 and f (0) + f (1) < 1 , so that with positive probability, a particle will die without offspring, and also with positive probability, the particle will be replaced by more than one new particles. We will assume these conditions for the remainder of our discussion. By the state classification above all states lead to 0 (extinction). We will denote the probability of extinction, starting with one particle, by
16.11.2
https://stats.libretexts.org/@go/page/10384
q = P(τ0 < ∞ ∣ X0 = 1) = P(Xn = 0 for some n ∈ N ∣ X0 = 1)
(16.11.6)
The set of positive states N is a transient equivalence class, and the probability of extinction starting with x ∈ N particles is +
q
x
= P(τ0 < ∞ ∣ X0 = x) = P(Xn = 0 for some n ∈ N ∣ X0 = x)
(16.11.7)
Proof The parameter q satisfies the equation ∞
q = ∑ f (x)q
x
(16.11.8)
x=0
Proof Thus the extinction probability distribution:
q
starting with 1 particle is a fixed point of the probability generating function
Φ
of the offspring
∞ x
Φ(t) = ∑ f (x)t ,
t ∈ [0, 1]
(16.11.11)
x=0
Moreover, from the general discussion of hitting probabilities in the section on recurrence and transience, q is the smallest such number in the interval (0, 1]. If the probability generating function Φ can be computed in closed form, then q can sometimes be computed by solving the equation Φ(t) = t . Φ
satisfies the following properties:
1. Φ(0) = f (0). 2. Φ(1) = 1 . 3. Φ (t) > 0 for t ∈ (0, 1) so Φ in increasing on (0, 1). 4. Φ (t) > 0 for t ∈ (0, 1) so Φ in concave upward on (0, 1). 5. m = lim Φ (t) . ′
′′
′
t↑1
Proof Our main result is next, and relates the extinction probability q and the mean of the offspring distribution m. The extinction probability q and the mean of the offspring distribution m are related as follows: 1. If m ≤ 1 then q = 1 , so extinction is certain. 2. If m > 1 then 0 < q < 1 , so there is a positive probability of extinction and a positive probability of explosion. Proof
16.11.3
https://stats.libretexts.org/@go/page/10384
Figure 16.11.2: The case of certain extinction.
Figure 16.11.3: The case of possible extinction and possible explosion.
Computational Exercises Consider the branching chain with offspring probability density function f given by f (0) = 1 − p , f (2) = p , where p ∈ (0, 1) is a parameter. Thus, each particle either dies or splits into two new particles. Find each of the following. 1. The transition matrix P . 2. The mean m of the offspring distribution. 3. The generating function Φ of the offspring distribution. 4. The extinction probability q. Answer Consider the branching chain whose offspring distribution is the geometric distribution on p ∈ (0, 1). Thus f (n) = (1 − p)p for n ∈ N . Find each of the following:
N
with parameter
1 −p
, where
n
1. The transition matrix P . 2. The mean m of the offspring distribution. 3. The generating function Φ of the offspring distribution. 4. The extinction probability q. Answer Curiously, the extinction probability is the same as for the previous problem. Consider the branching chain whose offspring distribution is the Poisson distribution with parameter f (n) = e m /n! for n ∈ N . Find each of the following: −m
m ∈ (0, ∞)
. Thus
n
1. The transition matrix P . 2. The mean m of the offspring distribution. 3. The generating function Φ of the offspring distribution. 4. The approximate extinction probability q when m = 2 and when m = 3 . Answer
16.11.4
https://stats.libretexts.org/@go/page/10384
This page titled 16.11: Discrete-Time Branching Chain is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.11.5
https://stats.libretexts.org/@go/page/10384
16.12: Discrete-Time Queuing Chains Basic Theory Introduction In a queuing model, customers arrive at a station for service. As always, the terms are generic; here are some typical examples: The customers are persons and the service station is a store. The customers are file requests and the service station is a web server. The customers are packages and the service station is a processing facility.
Figure 16.12.1: Ten customers and a server Queuing models can be quite complex, depending on such factors as the probability distribution that governs the arrival of customers, the probability distribution that governs the service of customers, the number of servers, and the behavior of the customers when all servers are busy. Indeed, queuing theory has its own lexicon to indicate some of these factors. In this section, we will study one of the simplest, discrete-time queuing models. However, as we will see, this discrete-time chain is embedded in a much more realistic continuous-time queuing process knows as the M/G/1 queue. In a general sense, the main interest in any queuing model is the number of customers in the system as a function of time, and in particular, whether the servers can adequately handle the flow of customers. Our main assumptions are as follows: 1. If the queue is empty at a given time, then a random number of new customers arrive at the next time. 2. If the queue is nonempty at a given time, then one customer is served and a random number of new customers arrive at the next time. 3. The number of customers who arrive at each time period form an independent, identically distributed sequence. Thus, let X denote the number of customers in the system at time n ∈ N , and let U denote the number of new customers who arrive at time n ∈ N . Then U = (U , U , …) is a sequence of independent random variables, with common probability density function f on N, and n
n
+
1
2
Xn+1 = {
X = (X0 , X1 , X2 , …)
Un+1 ,
Xn = 0
(Xn − 1) + Un+1 ,
Xn > 0
,
n ∈ N
(16.12.1)
is a discrete-time Markov chain with state space N and transition probability matrix P given by P (0, y) = f (y), P (x, y)
y ∈ N
= f (y − x + 1),
(16.12.2) x ∈ N+ , y ∈ {x − 1, x, x + 1, …}
(16.12.3)
The chain X is the queuing chain with arrival distribution defined by f . Proof
Recurrence and Transience From now on we will assume that f (0) > 0 and f (0) + f (1) < 1 . Thus, at each time unit, it's possible that no new customers arrive or that at least 2 new customers arrive. Also, we let m denote the mean of the arrival distribution, so that ∞
m = ∑ xf (x)
(16.12.4)
x=0
Thus m is the average number of new customers who arrive during a time period. The chain X is irreducible and aperiodic.
16.12.1
https://stats.libretexts.org/@go/page/10385
Proof Our goal in this section is to compute the probability that the chain reaches 0, as a function of the initial state (so that the server is able to serve all of the customers). As we will see, there are some curious and unexpected parallels between this problem and the problem of computing the extinction probability in the branching chain. As a corollary, we will also be able to classify the queuing chain as transient or recurrent. Our basic parameter of interest is q = H (1, 0) = P(τ < ∞ ∣ X = 1) , where as usual, H is the hitting probability matrix and τ = min{n ∈ N : X = 0} is the first positive time that the chain is in state 0 (possibly infinite). Thus, q is the probability that the queue eventually empties, starting with a single customer. 0
0
+
0
n
The parameter q satisifes the following properties: 1. q = H (x, x − 1) for every x ∈ N . 2. q = H (x, 0) for every x ∈ N . +
x
+
Proof The parameter q satisfies the equation: ∞
q = ∑ f (x)q
x
(16.12.5)
x=0
Proof Note that this is exactly the same equation that we considered for the branching chain, namely Φ(q) = q , where Φ is the probability generating function of the distribution that governs the number of new customers that arrive during each period.
Figure 16.12.2: The graph of ϕ in the recurrent case
Figure 16.12.3: The graph of ϕ in the transient case q
is the smallest solution in (0, 1] of the equation Φ(t) = t . Moreover
1. If m ≤ 1 then q = 1 and the chain is recurrent. 2. If m > 1 then 0 < q < 1 and the chain is transient.. Proof
16.12.2
https://stats.libretexts.org/@go/page/10385
Positive Recurrence Our next goal is to find conditions for the queuing chain to be positive recurrent. Recall that m is the mean of the probability density function f ; that is, the expected number of new customers who arrive during a time period. As before, let τ denote the first positive time that the chain is in state 0. We assume that the chain is recurrent, so m ≤ 1 and P(τ < ∞) = 1 . 0
0
Let Ψ denote the probability generating function of τ , starting in state 1. Then 0
1. Ψ is also the probability generating function of τ starting in state 0. 2. Ψ is the probability generating function of τ starting in state x ∈ N . 0
x
0
+
Proof Ψ(t) = tΦ[Ψ(t)]
for t ∈ [−1, 1].
Proof The deriviative of Ψ is ′
Ψ (t) =
Φ[Ψ(t)] ′
,
t ∈ (−1, 1)
(16.12.11)
1 − tΦ [Ψ(t)]
Proof As usual, let μ
0
1. μ 2. μ
1
0
=
0
=∞
= E(τ0 ∣ X0 = 0)
, the mean return time to state 0 starting in state 0. Then
if m < 1 and therefore the chain is positive recurrent. if m = 1 and therefore the chain is null recurrent.
1−m
Proof So to summarize, the queuing chain is positive recurrent if m < 1 , null recurrent if m = 1 , and transient if m > 1 . Since m is the expected number of new customers who arrive during a service period, the results are certainly reasonable.
Computational Exercises Consider the queuing chain with arrival probability density function f given by f (0) = 1 − p , f (2) = p , where p ∈ (0, 1) is a parameter. Thus, at each time period, either no new customers arrive or two arrive. 1. Find the transition matrix P . 2. Find the mean m of the arrival distribution. 3. Find the generating function Φ of the arrival distribution. 4. Find the probability q that the queue eventually empties, starting with one customer. 5. Classify the chain as transient, null recurrent, or positive recurrent. 6. In the positive recurrent case, find μ , the mean return time to 0. 0
Answer Consider the queuing chain whose arrival distribution is the geometric distribution on p ∈ (0, 1). Thus f (n) = (1 − p)p for n ∈ N .
N
with parameter
1 −p
, where
n
1. Find the transition matrix P . 2. Find the mean m of the arrival distribution. 3. Find the generating function Φ of the arrival distribution. 4. Find the probability q that the queue eventually empties, starting with one customer. 5. Classify the chain as transient, null recurrent, or positive recurrent. 6. In the positive recurrent case, find μ , the mean return time to 0. 0
Answer Curiously, the parameter q and the classification of the chain are the same in the last two models. Consider the queuing chain whose arrival distribution is the Poisson distribution with parameter f (n) = e m /n! for n ∈ N . Find each of the following: −m
m ∈ (0, ∞)
. Thus
n
16.12.3
https://stats.libretexts.org/@go/page/10385
1. The transition matrix P 2. The mean m of the arrival distribution. 3. The generating function Φ of the arrival distribution. 4. The approximate value of q when m = 2 and when m = 3 . 5. Classify the chain as transient, null recurrent, or positive recurrent. 6. In the positive recurrent case, find μ , the mean return time to 0. 0
Answer This page titled 16.12: Discrete-Time Queuing Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.12.4
https://stats.libretexts.org/@go/page/10385
16.13: Discrete-Time Birth-Death Chains Basic Theory Introduction Suppose that S is an interval of integers (that is, a set of consecutive integers), either finite or infinite. A (discrete-time) birthdeath chain on S is a discrete-time Markov chain X = (X , X , X , …) on S with transition probability matrix P of the form 0
1
2
P (x, x − 1) = q(x), P (x, x) = r(x), P (x, x + 1) = p(x);
x ∈ S
(16.13.1)
where p, q, and r are nonnegative functions on S with p(x) + q(x) + r(x) = 1 for x ∈ S . If the interval S has a minimum value a ∈ Z then of course we must have q(a) = 0 . If r(a) = 1 , the boundary point a is absorbing and if p(a) = 1 , then a is reflecting. Similarly, if the interval S has a maximum value b ∈ Z then of course we must have p(b) = 0 . If r(b) = 1 , the boundary point b is absorbing and if p(b) = 1 , then b is reflecting. Several other special models that we have studied are birth-death chains; these are explored in below. In this section, as you will see, we often have sums of products. Recall that a sum over an empty index set is 0, while a product over an empty index set is 1.
Recurrence and Transience If S is finite, classification of the states of a birth-death chain as recurrent or transient is simple, and depends only on the state graph. In particular, if the chain is irreducible, then the chain is positive recurrent. So we will study the classification of birth-death chains when S = N . We assume that p(x) > 0 for all x ∈ N and that q(x) > 0 for all x ∈ N (but of course we must have q(0) = 0 ). Thus, the chain is irreducible. +
Under these assumptions, the birth-death chain on N is 1. Aperiodic if r(x) > 0 for some x ∈ N. 2. Periodic with period 2 if r(x) = 0 for all x ∈ N. Proof We will use the test for recurrence derived earlier with A = N , the set of positive states. That is, we will compute the probability that the chain never hits 0, starting in a positive state. +
The chain X is recurrent if and only if ∞
q(1) ⋯ q(x)
∑ x=0
=∞
(16.13.2)
p(1) ⋯ p(x)
Proof Note that r, the function that assigns to each state x ∈ N the probability of an immediate return to x, plays no direct role in whether the chain is transient or recurrent. Indeed all that matters are the ratios q(x)/p(x) for x ∈ N . +
Positive Recurrence and Invariant Distributions Suppose again that we have a birth-death chain X on N, with p(x) > 0 for all x ∈ N and q(x) > 0 for all x ∈ N . Thus the chain is irreducible. +
The function g : N → (0, ∞) defined by p(0) ⋯ p(x − 1) g(x) =
,
x ∈ N
(16.13.9)
q(1) ⋯ q(x)
16.13.1
https://stats.libretexts.org/@go/page/10386
is invariant for X, and is the only invariant function, up to multiplication by constants. Hence X is positive recurrent if and only if B = ∑ g(x) < ∞ , in which case the (unique) invariant probability density function f is given by f (x) = g(x) for x ∈ N. ∞
1
x=0
B
Proof Here is a summary of the classification: For the birth-death chain X, define ∞
x=0
∞
q(1) ⋯ q(x)
A =∑
,
p(0) ⋯ p(x − 1)
B =∑
p(1) ⋯ p(x)
x=0
(16.13.13) q(1) ⋯ q(x)
1. X is transient if A < ∞ 2. X is null recurrent if A = ∞ and B = ∞ . 3. X is positive recurrent if B < ∞ . Note again that r, the function that assigns to each state x ∈ N the probability of an immediate return to x, plays no direct role in whether the chain is transient, null recurrent, or positive recurrent. Also, we know that an irreducible, recurrent chain has a positive invariant function that is unique up to multiplication by positive constants, but the birth-death chain gives an example where this is also true in the transient case. Suppose now that n ∈ N and that X = (X , X , X , …) is a birth-death chain on the integer interval N = {0, 1, … , n}. We assume that p(x) > 0 for x ∈ {0, 1, … , n − 1} while q(x) > 0 for x ∈ {1, 2, … n}. Of course, we must have q(0) = p(n) = 0 . With these assumptions, X is irreducible, and since the state space is finite, positive recurrent. So all that remains is to find the invariant distribution. The result is essentially the same as when the state space is N. +
0
1
2
n
The invariant probability density function f is given by n
n
p(0) ⋯ p(x − 1)
1 fn (x) =
p(0) ⋯ p(x − 1)
for x ∈ Nn where Bn = ∑
Bn
q(1) ⋯ q(x)
x=0
(16.13.14) q(1) ⋯ q(x)
Proof Note that B → B as n → ∞ , and if B < ∞ , f (x) → f (x) as n → ∞ for x ∈ N. We will see this type of behavior again. Results for the birth-death chain on N often converge to the corresponding results for the birth-death chain on N as n → ∞ . n
n
n
Absorption Often when the state space S = N , the state of a birth-death chain represents a population of individuals of some sort (and so the terms birth and death have their usual meanings). In this case state 0 is absorbing and means that the population is extinct. Specifically, suppose that X = (X , X , X , …) is a birth-death chain on N with r(0) = 1 and with p(x), q(x) > 0 for x ∈ N . Thus, state 0 is absorbing and all positive states lead to each other and to 0. Let N = min{n ∈ N : X = 0} denote the time until absorption, where as usual, min ∅ = ∞ . 0
1
2
+
n
One of the following events will occur: 1. Population extinction: N < ∞ or equivalently, X 2. Population explosion: N = ∞ or equivalently X
m
n
=0
→ ∞
for some m ∈ N and hence X as n → ∞ .
n
=0
for all n ≥ m .
Proof Naturally we would like to find the probability of these complementary events, and happily we have already done so in our study of recurrence above. Let u(x) = P(N = ∞) = P(Xn → ∞ as n → ∞ ∣ X0 = x),
x ∈ N
(16.13.16)
so the absorption probability is v(x) = 1 − u(x) = P(N < ∞) = P(Xn = 0 for some n ∈ N ∣ X0 = x),
x ∈ N
(16.13.17)
For the birth-death chain X,
16.13.2
https://stats.libretexts.org/@go/page/10386
1 u(x) =
x−1
A
∞
q(1) ⋯ q(i)
∑ p(1) ⋯ p(i)
i=0
for x ∈ N+ where A = ∑ i=0
q(1) ⋯ q(i) (16.13.18) p(1) ⋯ p(i)
Proof So if A = ∞ then u(x) = 0 for all x ∈ S . If A < ∞ then u(x) > 0 for all x ∈ N and u(x) → 1 as x → ∞ . For the absorption probability, v(x) = 1 for all x ∈ N if A = ∞ and so absorption is certain. If A < ∞ then +
∞
1 v(x) =
q(1) ⋯ q(i)
∑ A
i=x
,
x ∈ N
(16.13.19)
p(1) ⋯ p(i)
Next we consider the mean time to absorption, so let m(x) = E(N
∣ X0 = x)
for x ∈ N . +
The mean absorption function is given by x
∞
p(j) ⋯ p(k)
m(x) = ∑ ∑
,
j=1 k=j−1
x ∈ N
(16.13.20)
q(j) ⋯ q(k + 1)
Probabilisitic Proof Analytic Proof Next we will consider a birth-death chain on a finite integer interval with both endpoints absorbing. Our interest is in the probability of absorption in one endpoint rather than the other, and in the mean time to absorption. Thus suppose that n ∈ N and that X = (X , X , X , …) is a birth-death chain on N = {0, 1, … , n} with r(0) = r(n) = 1 and with p(x) > 0 and q(x) > 0 for x ∈ {1, 2, … , n − 1}. So the endpoints 0 and n are absorbing, and all other states lead to each other and to the endpoints. Let N = min{n ∈ N : X ∈ {0, n}} , the time until absorption, and for x ∈ S let v (x) = P(X = 0 ∣ X = x) and m (x) = E(N ∣ X = x) . The definitions make sense since N is finite with probability 1. +
0
1
2
n
n
n
n
N
0
0
The absorption probability function for state 0 is given by 1 vn (x) =
n−1
An
n−1
q(1) ⋯ q(i)
∑ p(1) ⋯ p(i)
i=x
q(1) ⋯ q(i)
for x ∈ Nn where An = ∑ i=0
(16.13.28) p(1) ⋯ p(i)
Proof Note that A → A as n → ∞ where A is the constant above for the absorption probability at 0 with the infinite state space N. If A < ∞ then v (x) → v(x) as n → ∞ for x ∈ N . n
n
The mean absorption time is given by x−1
mn (x) = mn (1) ∑ y=0
q(1) ⋯ q(y)
x−1
y
q(z + 1) ⋯ q(y)
−∑∑ p(1) ⋯ p(y)
y=0 z=1
, p(z) ⋯ p(y)
x ∈ Nn
(16.13.31)
where, with A as in the previous theorem, n
1 mn (1) =
n−1
y
q(z + 1) ⋯ q(y)
∑∑ An
y=1 z=1
(16.13.32) p(z) ⋯ p(y)
Proof
Time Reversal Our next discussion is on the time reversal of a birth-death chain. Essentially, every recurrent birth-death chain is reversible. Suppose that reversible.
X = (X0 , X1 , X2 , …)
is an irreducible, recurrent birth-death chain on an integer interval
S
. Then
X
is
Proof If S is finite and the chain X is irreducible, then of course X is recurrent (in fact positive recurrent), so by the previous result, X is reversible. In the case S = N , we can use the invariant function above to show directly that the chain is reversible.
16.13.3
https://stats.libretexts.org/@go/page/10386
Suppose that X = (X is reversible.
0,
X1 , X2 , …)
is a birth-death chain on N with p(x) > 0 for x ∈ N and q(x) > 0 for x ∈ N . Then X +
Proof Thus, in the positive recurrent case, when the variables are given the invariant distribution, the transition matrix chain forward in time and backwards in time.
describes the
P
Examples and Special Cases As always, be sure to try the problems yourself before looking at the solutions.
Constant Birth and Death Probabilities Our first examples consider birth-death chains on N with constant birth and death probabilities, except at the boundary points. Such chains are often referred to as random walks, although that term is used in a variety of different settings. The results are special cases of the general results above, but sometimes direct proofs are illuminating. Suppose that X = (X , X , X , …) is the birth-death chain on N with constant birth probability constant death probability q ∈ (0, ∞) on N , with p + q ≤ 1 . Then 0
1
2
p ∈ (0, ∞)
on
N
and
+
1. X is transient if q < p 2. X is null recurrent if q = p 3. X is positive recurrent if q > p , and the invariant distribution is the geometric distribution on N with parameter p/q p f (x) = (1 −
x
p )(
q
) ,
x ∈ N
(16.13.37)
q
Next we consider the random walk on N with 0 absorbing. As in the discussion of absorption above, v(x) denotes the absorption probability and m(x) the mean time to absorption, starting in state x ∈ N. Suppose that X = (X , X , …) is the birth-death chain on N with constant birth probability p ∈ (0, ∞) on N and constant death probability q ∈ (0, ∞) on N , with p + q ≤ 1 . Assume also that r(0) = 1 , so that 0 is absorbing. 0
1
+
+
1. If q ≥ p then v(x) = 1 for all x ∈ N. If q < p then v(x) = (q/p) for x ∈ N. 2. If q ≤ p then m(x) = ∞ for all x ∈ N . If q > p then m(x) = x/(q − p) for x ∈ N. x
+
Proof This chain is essentially the gambler's ruin chain. Consider a gambler who bets on a sequence of independent games, where p and q are the probabilities of winning and losing, respectively. The gambler receives one monetary unit when she wins a game and must pay one unit when she loses a game. So X is the gambler's fortune after playing n games. n
Next we consider random walks on a finite interval. Suppose that X = (X , X , …) is the birth-death chain on N = {0, 1, … , n} with constant birth probability p ∈ (0, ∞) on {0, 1, … , n − 1} and constant death probability q ∈ (0, ∞) on {1, 2, … , n}, with p + q ≤ 1 . Then X is positive recurrent and the invariant probability density function f is given as follows: 0
1
n
n
1. If p ≠ q then x
(p/q ) (1 − p/q) fn (x) =
n+1
,
x ∈ Nn
(16.13.39)
1 − (p/q)
2. If p = q then f
n (x)
= 1/(n + 1)
for x ∈ N . n
Note that if p < q then the invariant distribution is a truncated geometric distribution, and f (x) → f (x) for x ∈ N where f is the invariant probability density function of the birth-death chain on N considered above. If p = q , the invariant distribution is uniform on N , certainly a reasonable result. Next we consider the chain with both endpoints absorbing. As before, v is the function that gives the probability of absorption in state 0, while m is the function that gives the mean time to absorption. n
n
n
n
16.13.4
https://stats.libretexts.org/@go/page/10386
Suppose that X = (X , X , …) is the birth-death chain on N = {0, 1, … , n} with constant birth probability p ∈ (0, 1) and death probability q ∈ (0, ∞) on {1, 2, … , n − 1}, where p + q ≤ 1 . Assume also that r(0) = r(n) = 1 , so that 0 and n are absorbing. 0
1
n
1. If p ≠ q then x
(q/p ) vn (x) =
2. If p = q then v
n (x)
Note that if q < p then v
= 1 − x/n
n (x)
→ v(x)
n
− (q/p )
,
1 − (q/p)n
(16.13.40)
for x ∈ N
n
as n → ∞ for x ∈ N.
Suppose again that X = (X , X , …) is the birth-death chain on N p ∈ (0, 1) and death probability q ∈ (0, ∞) on {1, 2, … , n − 1}, where that 0 and n are absorbing. 0
x ∈ Nn
1
n
with constant birth probability p + q ≤ 1 . Assume also that r(0) = r(n) = 1 , so = {0, 1, … , n}
1. If p ≠ q then x
1 − (q/p)
n mn (x) =
p −q
n
x +
1 − (q/p)
, q −p
x ∈ Nn
(16.13.41)
2. If p = q then 1 mn (x) =
x(n − x), 2p
x ∈ Nn
(16.13.42)
Special Birth-Death Chains Some of the random processes that we have studied previously are birth-death Markov chains. Describe each of the following as a birth-death chain. 1. The Ehrenfest chain. 2. The modified Ehrenfest chain. 3. The Bernoulli-Laplace chain 4. The simple random walk on Z. Answer
Other Examples Consider the birth-death process on N with p(x) =
1 x+1
, q(x) = 1 − p(x) , and r(x) = 0 for x ∈ S .
1. Find the invariant function g . 2. Classify the chain. Answer This page titled 16.13: Discrete-Time Birth-Death Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.13.5
https://stats.libretexts.org/@go/page/10386
16.14: Random Walks on Graphs Basic Theory Introduction Suppose that G = (S, E) is a graph with vertex set S and edge set E ⊆ S . We assume that the graph is undirected (perhaps a better term would be bi-directed) in the sense that (x, y) ∈ E if and only if (y, x) ∈ E . The vertex set S is countable, but may be infinite. Let N (x) = {y ∈ S : (x, y) ∈ E} denote the set of neighbors of a vertex x ∈ S , and let d(x) = #[N (x)] denote the degree of x. We assume that N (x) ≠ ∅ for x ∈ S , so G has no isolated points. 2
Suppose now that there is a conductance c(x, y) > 0 associated with each edge (x, y) ∈ E . The conductance is symmetric in the sense that c(x, y) = c(y, x) for (x, y) ∈ E . We extend c to a function on all of S × S by defining c(x, y) = 0 for (x, y) ∉ E . Let C (x) = ∑ c(x, y),
x ∈ S
(16.14.1)
y∈S
so that C (x) is the total conductance of the edges coming from x. Our main assumption is that C (x) < ∞ for x ∈ S . As the terminology suggests, we imagine a fluid of some sort flowing through the edges of the graph, so that the conductance of an edge measures the capacity of the edge in some sense. One of the best interpretation is that the graph is an electrical network and the edges are resistors. In this interpretation, the conductance of a resistor is the reciprocal of the resistance. In some applications, specifically the resistor network just mentioned, it's appropriate to impose the additional assumption that G has no loops, so that (x, x) ∉ E for each x ∈ S . However, that assumption is not mathematically necessary for the Markov chains that we will consider in this section. The discrete-time Markov chain X = (X
0,
X1 , X2 , …)
with state space S and transition probability matrix P given by c(x, y)
P (x, y) =
,
(x, y) ∈ S
2
(16.14.2)
C (x)
is called a random walk on the graph G. Justification This chain governs a particle moving along the vertices of G. If the particle is at vertex x ∈ S at a given time, then the particle will be at a neighbor of x at the next time; the neighbor is chosen randomly, in proportion to the conductance. In the setting of an electrical network, it is natural to interpret the particle as an electron. Note that multiplying the conductance function c by a positive constant has no effect on the associated random walk. Suppose that d(x) < ∞ for each x ∈ S and that c is constant on the edges. Then 1. C (x) = cd(x) for every x ∈ S . 2. The transition matrix P is given by P (x, y) =
1 d(x)
for x ∈ S and y ∈ N (x), and P (x, y) = 0 otherwise.
The discrete-time Markov chain X is the symmetric random walk on G. Proof Thus, for the symmetric random walk, if the state is x ∈ S at a given time, then the next state is equally likely to be any of the neighbors of x. The assumption that each vertex has finite degree means that the graph G is locally finite. Let X be a random walk on a graph G. 1. If G is connected then X is irreducible. 2. If G is not connected then the equivalence classes of X are the components of G (the maximal connected subsets of S ). Proof So as usual, we will usually assume that G is connected, for otherwise we could simply restrict our attention to a component of G. In the case that G has no loops (again, an important special case because of applications), it's easy to characterize the periodicity of
16.14.1
https://stats.libretexts.org/@go/page/10387
the chain. For the theorem that follows, recall that G is bipartite if the vertex set S can be partitioned into nonempty, disjoint sets A and B (the parts) such that every edge in E has one endpoint in A and one endpoint in B . Suppose that X is a random walk on a connected graph G with no loops. Then X is either aperiodic or has period 2. Moreover, X has period 2 if and only if G is bipartite, in which case the parts are the cyclic classes of X. Proof
Positive Recurrence and Invariant Distributions Suppose again that X is a random walk on a graph G, and assume that G is connected so that X is irreducible. The function C is invariant for P . The random walk X is positive recurrent if and only if K = ∑ C (x) = x∈S
∑ (x,y)∈S
c(x, y) < ∞
(16.14.4)
2
in which case the invariant probability density function f is given by f (x) = C (x)/K for x ∈ S . Proof Note that K is the total conductance over all edges in G. In particular, of course, if S is finite then X is positive recurrent, with f as the invariant probability density function. For the symmetric random walk, this is the only way that positive recurrence can occur: The symmetric random walk on G is positive recurrent if and only if the set of vertices S is finite, in which case the invariant probability density function f is given by d(x) f (x) =
,
x ∈ S
(16.14.6)
2m
where d is the degree function and where m is the number of undirected edges. Proof On the other hand, when S is infinite, the classification of X as recurrent or transient is complicated. We will consider an interesting special case below, the symmetric random walk on Z . k
Reversibility Essentially, all reversible Markov chains can be interpreted as random walks on graphs. This fact is one of the reasons for studying such walks. If X is a random walk on a connected graph G, then X is reversible with respect to C . Proof Of course, if X is recurrent, then C is the only positive invariant function, up to multiplication by positive constants, and so X is simply reversible. Conversely, suppose that X is an irreducible Markov chain on S with transition matrix P and positive invariant function g . If X is reversible with respect to g then X is the random walk on the state graph with conductance function c given by c(x, y) = g(x)P (x, y) for (x, y) ∈ S . 2
Proof Again, in the important special case that X is recurrent, there exists a positive invariant function g that is unique up to multiplication by positive constants. In this case the theorem states that an irreducible, recurrent, reversible chain is a random walk on the state graph.
Examples and Applications
16.14.2
https://stats.libretexts.org/@go/page/10387
The Wheatstone Bridge Graph The graph below is called the Wheatstone bridge in honor of Charles Wheatstone.
Figure 16.14.1: The Wheatstone bridge network, with conductance values in red
In this subsection, let X be the random walk on the Wheatstone bridge above, with the given conductance values. For the random walk X, 1. Explicitly give the transition probability matrix P . 2. Given X = a , find the probability density function of X . 0
2
Answer For the random walk X, 1. Show that X is aperiodic. 2. Find the invariant probability density function. 3. Find the mean return time to each state. 4. Find lim P . n
n→∞
Answer
The Cube Graph The graph below is the 3-dimensional cube graph. The vertices are bit strings of length 3, and two vertices are connected by an edge if and only if the bit strings differ by a single bit.
Figure 16.14.2: The cube graph with conductance values in red
In this subsection, let X denote the random walk on the cube graph above, with the given conductance values. For the random walk X, 1. Explicitly give the transition probability matrix P . 2. Suppose that the initial distribution is the uniform distribution on {000, 001, 101, 100} . Find the probability density function of X . 2
Answer For the random walk X, 1. Show that the chain has period 2 and find the cyclic classes. 2. Find the invariant probability density function. 3. Find the mean return time to each state. 4. Find lim P . 5. Find lim P . 2n
n→∞
2n+1
n→∞
Answer
16.14.3
https://stats.libretexts.org/@go/page/10387
Special Models Recall that the basic Ehrenfest chain with m ∈ N sketch the graph and find a conductance function.
balls is reversible. Interpreting the chain as a random walk on a graph,
+
Answer Recall that the modified Ehrenfest chain with m ∈ N sketch the graph and find a conductance function.
+
balls is reversible. Interpreting the chain as a random walk on a graph,
Answer Recall that the Bernoulli-Laplace chain with j ∈ N balls in urn 0, k ∈ N balls in urn 1, and with r ∈ {0, … , j + k} of the balls red, is reversible. Interpreting the chain as a random walk on a graph, sketch the graph and find a conductance function. Simplify the conductance function in the special case that j = k = r . +
+
Answer
Random Walks on Z Random walks on integer lattices are particularly interesting because of their classification as transient or recurrent. We consider the one-dimensional case in this subsection, and the higher dimensional case in the next subsection. Let X = (X
0,
X1 , X2 , …)
be the discrete-time Markov chain with state space Z and transition probability matrix P given by P (x, x + 1) = p, P (x, x − 1) = 1 − p,
x ∈ Z
(16.14.9)
where p ∈ (0, 1). The chain X is called the simple random walk on Z with parameter p. The term simple is used because the transition probabilities starting in state x ∈ Z do not depend on x. Thus the chain is spatially as well as temporally homogeneous. In the special case p = , the chain X is the simple symmetric random walk on Z. Basic properties of the simple random walk on Z, and in particular, the simple symmetric random walk were studied in the chapter on Bernoulli Trials. Of course, the state graph G of X has vertex set Z, and the neighbors of x ∈ Z are x + 1 and x − 1 . It's not immediately clear that X is a random walk on G associated with a conductance function, which after all, is the topic of this section. But that fact and more follow from the next result. 1 2
Let g be the function on Z defined by p g(x) = (
x
) ,
x ∈ Z
(16.14.10)
1 −p
Then 1. g(x)P (x, y) = g(y)P (y, x) for all (x, y) ∈ Z 2. g is invariant for X 3. X is reversible with respect to g 4. X is the random walk on Z with conductance function c given by c(x, x + 1) = p 2
x+1
x
/(1 − p )
for x ∈ Z.
Proof In particular, the simple symmetric random walk is the symmetric random walk on G. The chain X is irreducible and periodic with period 2. Moreover P
2n
(0, 0) = (
2n n n ) p (1 − p ) , n
n ∈ N
(16.14.11)
Proof Classification of the simple random walk on Z. 1. If p ≠ 2. If p =
1 2 1 2
then X is transient. then X is null recurrent.
16.14.4
https://stats.libretexts.org/@go/page/10387
Proof So for the one-dimensional lattice Z, the random walk X is transient in the non-symmetric case, and null recurrent in the symmetric case. Let's return to the invariant functions of X Consider again the random walk X on Z with parameter p ∈ (0, 1). The constant function 1 on Z and the function g given by x
p g(x) = (
) ,
x ∈ Z
(16.14.13)
1 −p
are invariant for X. All other invariant functions are linear combinations of these two functions. Proof Note that when p = , the constant function 1 is the only positive invariant function, up to multiplication by positive constants. But we know this has to be the case since the chain is recurrent when p = . Moreover, the chain is reversible. In the nonsymmetric case, when p ≠ , we have an example of a transient chain which nonetheless has non-trivial invariant functions—in fact a two dimensional space of such functions. Also, X is reversible with respect to g , as shown above, but the reversal of X with respect to 1 is the chain with transition matrix Q given by Q(x, y) = P (y, x) for (x, y) ∈ Z . This chain is just the simple random walk on Z with parameter 1 − p . So the non-symmetric simple random walk is an example of a transient chain that is reversible with respect to one invariant measure but not with respect to another invariant measure. 1 2
1 2
1 2
2
Random walks on Z
k
More generally, we now consider Z , where k ∈ N . For i ∈ {1, 2, … , k}, let u ∈ Z denote the unit vector with 1 in position i and 0 elsewhere. The k -dimensional integer lattice G has vertex set Z , and the neighbors of x ∈ Z are x ± u for i ∈ {1, 2, … , k}. So in particular, each vertex has 2k neighbors. k
k
+
i
k
k
Let X
be the Markov chain on Z with transition probability matrix P given by k
= (X1 , X2 , …)
P (x, x + ui ) = pi , P (x, x − ui ) = qi ;
where p > 0 , q > 0 for i ∈ {1, 2, … , k} and ∑ (p + q parameters p = (p , p , … , p ) and q = (q , q , … , q ) . k
i
i
2
k
1
2
i)
i
i=1
1
i
=1
k
x ∈ Z , i ∈ {1, 2, … , k}
. The chain
X
(16.14.15)
is the simple random walk on
k
Z
with
k
Again, the term simple means that the transition probabilities starting at x ∈ Z do not depend on x, so that the chain is spatially homogeneous as well as temporally homogeneous. In the special case that p = q = for i ∈ {1, 2, … , k}, X is the simple symmetric random walk on Z . The following theorem is the natural generalization of the result abpve for the one-dimensional case. k
1
i
i
2k
k
Define the function g : Z
k
→ (0, ∞)
by k
g(x1 , x2 , … , xk ) = ∏ ( i=1
pi
xi
)
,
qi
k
(x1 , x2 , … , xk ) ∈ Z
(16.14.16)
Then 1. g(x)P (x, y) = g(y)P (y, x) for all x, y ∈ Z 2. g is invariant for X . 3. X is reversible with respect to g . 4. X is the random walk on G with conductance function c given by c(x, y) = g(x)P (x, y) for x, k
k
y ∈ Z
.
Proof It terms of recurrence and transience, it would certainly seem that the larger the dimension k , the less likely the chain is to be recurrent. That's generally true: Classification of the simple random walk on Z . k
1. For k ∈ {1, 2}, X is null recurrent in the symmetric case and transient for all other values of the parameters.
16.14.5
https://stats.libretexts.org/@go/page/10387
2. For k ∈ {3, 4, …}, X is transient for all values of the parameters. Proof sketch So for the simple, symmetric random walk on the integer lattice Z , we have the following interesting dimensional phase shift: the chain is null recurrent in dimensions 1 and 2 and transient in dimensions 3 or more. k
Let's return to the positive invariant functions for X . Again, the results generalize those for the one-dimensional case. For J ⊆ {1, 2 … , k}, define g on Z by k
J
pj gJ (x1 , x2 , … , xk ) = ∏ ( j∈J
Let X denote the simple random walk on wherre p = p , q = q for j ∈ J , and p J
J j
J
j
J
j
j
j
J
J
qj
,
k
(x1 , x2 , … , xk ) ∈ Z
(16.14.18)
with transition matrix P , corresponding to the parameter vectors =q , q =p for j ∉ J . Then k
Z
1. g (x)P (x, y) = g (y)P (y, x) for all x, 2. g is invariant for X . 3. X is reversal of X with respect to g . J
xj
)
J
p
J
and q , J
J
j
j
j
k
y ∈ Z
J
J
J
Proof Note that when J = ∅ , g = 1 and when J = {1, 2, … , k}, g = g , the invariant function introduced above. So in the completely non-symmetric case where p ≠ q for every i ∈ {1, 2, … , k}, the random walk X has 2 positive invariant functions that are linearly independent, and X is reversible with respect to one of them. J
J
k
i
i
This page titled 16.14: Random Walks on Graphs is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.14.6
https://stats.libretexts.org/@go/page/10387
16.15: Introduction to Continuous-Time Markov Chains This section begins our study of Markov processes in continuous time and with discrete state spaces. Recall that a Markov process with a discrete state space is called a Markov chain, so we are studying continuous-time Markov chains. It will be helpful if you review the section on general Markov processes, at least briefly, to become familiar with the basic notation and concepts. Also, discrete-time chains plays a fundamental role, so you will need review this topic also. We will study continuous-time Markov chains from different points of view. Our point of view in this section, involving holding times and the embedded discrete-time chain, is the most intuitive from a probabilistic point of view, and so is the best place to start. In the next section, we study the transition probability matrices in continuous time. This point of view is somewhat less intuitive, but is closest to how other types of Markov processes are treated. Finally, in the third introductory section we study the Markov chain from the view point of potential matrices. This is the least intuitive approach, but analytically one of the best. Naturally, the interconnections between the various approaches are particularly important.
Preliminaries As usual, we start with a probability space (Ω, F , P), so that Ω is the set of outcomes, F the σ-algebra of events, and P the probability measure on the sample space (Ω, F ). The time space is ([0, ∞), T ) where as usual, T is the Borel σ-algebra on [0, ∞) corresponding to the standard Euclidean topology. The state space is (S, S ) where S is countable and S is the power set of S . So every subset of S is measurable, as is every function from S to another measurable space. Recall that S is also the Borel σ algebra corresponding to the discrete topology on S . With this topology, every function from S to another topological space is continuous. Counting measure # is the natural measure on (S, S ), so in the context of the general introduction, integrals over S are simply sums. Also, kernels on S can be thought of as matrices, with rows and sums indexed by S . The left and right kernel operations are generalizations of matrix multiplication. Suppose now that
is stochastic process with state space (S, S ). For t ∈ [0, ∞), let , so that F is the σ-algebra of events defined by the process up to time t . The collection of σ-algebras F = {F : t ∈ [0, ∞)} is the natural filtration associated with X. For technical reasons, it's often necessary to have a filtration F = { F : t ∈ [0, ∞)} that is slightly finer than the natural one, so that F ⊆ F for t ∈ [0, ∞) (or in equivlaent jargon, X is adapted to F). See the general introduction for more details on the common ways that the natural filtration is refined. We will also let G = σ{X : s ∈ [t, ∞)} , the σ-algebra of events defined by the process from time t onward. If t is thought of as the present time, then F is the collection of events in the past and G is the collection of events in the future. 0
Ft
X = { Xt : t ∈ [0, ∞)}
= σ{ Xs : s ∈ [0, t]}
0
0
t
0
t
0
t
t
t
t
s
t
t
It's often necessary to impose assumptions on the continuity of the process X in time. Recall that X is right continuous if t ↦ X (ω) is right continuous on [0, ∞) for every ω ∈ Ω , and similarly X has left limits if t ↦ X (ω) has left limits on (0, ∞) for every ω ∈ Ω . Since S has the discrete topology, note that if X is right continuous, then for every t ∈ [0, ∞) and ω ∈ Ω , there exists ϵ (depending on t and ω) such that X (ω) = X (ω) for s ∈ [0, ϵ) . Similarly, if X has left limits, then for every t ∈ (0, ∞) and ω ∈ Ω there exists δ (depending on t and ω) such that X (ω) is constant for s ∈ (0, δ) . t
t
t+s
t
t−s
The Markov Property There are a number of equivalent ways to state the Markov property. At the most basic level, the property states that the past and future are conditionally independent, given the present. The process X = {X
t
: t ∈ [0, ∞)}
is a Markov chain on S if for every t ∈ [0, ∞), A ∈ F , and B ∈ G , t
t
P(A ∩ B ∣ Xt ) = P(A ∣ Xt )P(B ∣ Xt )
(16.15.1)
Another version is that the conditional distribution of a state in the future, given the past, is the same as the conditional distribution just given the present state. The process X = {X
t
: t ∈ [0, ∞)}
is a Markov chain on S if for every s,
t ∈ [0, ∞)
P(Xs+t = x ∣ Fs ) = P(Xs+t = x ∣ Xs )
16.15.1
, and x ∈ S , (16.15.2)
https://stats.libretexts.org/@go/page/10388
Technically, in the last two definitions, we should say that X is a Markov process relative to the filtration F. But recall that if X satisfies the Markov property relative to a filtration, then it satisfies the Markov property relative to any coarser filtration, and in particular, relative to the natural filtration. For the natural filtration, the Markov property can also be stated without explicit reference to σ-algebras, although at the cost of additional clutter: The process
X = { Xt : t ∈ [0, ∞)} n
(t1 , t2 , … , tn ) ∈ [0, ∞)
with t
1
is a Markov chain on S if and only if for every , and state sequence (x , x , … , x ) ∈ S ,
n ∈ N+
, time sequence
n
< t2 < ⋯ < tn
1
2
n
P (Xtn = xn ∣ Xt1 = x1 , Xt2 = x2 , … Xtn−1 = xn−1 ) = P (Xtn = xn ∣ Xtn−1 = xn−1 )
(16.15.3)
As usual, we also assume that our Markov chain X is time homogeneous, so that P(X = y ∣ X = x) = P(X = y ∣ X = x) for s, t ∈ [0, ∞) and x, y ∈ S . So, for a homogeneous Markov chain on S , the process {X : t ∈ [0, ∞)} given X = x , is independent of F and equivalent to the process {X : t ∈ [0, ∞)} given X = x , for every s ∈ [0, ∞) and x ∈ S . That is, if the chain is in state x ∈ S at a particular time s ∈ [0, ∞) , it does not matter how the chain got to x; the chain essentially starts over in state x. s+t
s
s+t
s
t
t
0
s
0
The Strong Markov Property Random times play an important role in the study of continuous-time Markov chains. It's often necessary to allow random times to take the value ∞, so formally, a random time τ is a random variable on the underlying sample space (Ω, F ) taking values in [0, ∞]. Recall also that a random time τ is a stopping time (also called a Markov time or an optional time) if {τ ≤ t} ∈ F for every t ∈ [0, ∞). If τ is a stopping time, the σ-algebra associated with τ is t
Fτ = {A ∈ F : A ∩ {τ ≤ t} ∈ Ft for all t ∈ [0, ∞)}
(16.15.4)
So F is the collection of events up to the random time τ in the same way that F is the collection of events up to the deterministic time t ∈ [0, ∞). We usually want the Markov property to extend from deterministic times to stopping times. τ
t
The process X = {X
t
: t ∈ [0, ∞)}
is a strong Markov chain on S if for every stopping time τ , t ∈ [0, ∞), and x ∈ S , P(Xτ+t = x ∣ Fτ ) = P(Xτ+t = x ∣ Xτ )
(16.15.5)
So, for a homogeneous strong Markov chain on S , the process {X : t ∈ [0, ∞)} given X = x , is independent of F and equivalent to the process {X : t ∈ [0, ∞)} given X = x , for every stopping time τ and x ∈ S . That is, if the chain is in state x ∈ S at a stopping time τ , then the chain essentially starts over at x, independently of the past. τ+t
t
τ
τ
0
Holding Times and the Jump Chain For our first point of view, we sill study when and how our Markov chain properties of the exponential distribution, so we need a quick review.
X
changes state. The discussion depends heavily on
The Exponential Distribution A random variable τ has the exponential distribution with rate parameter r ∈ (0, ∞) if τ has a continuous distribution on [0, ∞) with probability density function f given by f (t) = re for t ∈ [0, ∞). Equivalently, the right distribution function F is given by −rt
F
c
c
(t) = P(τ > t) = e
−rt
,
t ∈ [0, ∞)
(16.15.6)
The mean of the distribution is 1/r and the variance is 1/r . The exponential distribution has an amazing number of characterizations. One of the most important is the memoryless property which states that a random variable τ with values in [0, ∞) has an exponential distribution if and only if the conditional distribution of τ − s given τ > s is the same as the distribution of τ itself, for every s ∈ [0, ∞) . It's easy to see that the memoryless property is equivalent to the law of exponents for right distribution function F , namely F (s + t) = F (s)F (t) for s, t ∈ [0, ∞) . Since F is right continuous, the only solutions are exponential functions. 2
c
c
c
c
c
For our study of continuous-time Markov chains, it's helpful to extend the exponential distribution to two degenerate cases, τ = 0 with probability 1, and τ = ∞ with probability 1. In terms of the parameter, the first case corresponds to r = ∞ so that F (t) = P(τ > t) = 0 for every t ∈ [0, ∞), and the second case corresponds to r = 0 so that F (t) = P(τ > t) = 1 for every t ∈ [0, ∞). Note that in both cases, the function F satisfies the law of exponents, and so corresponds to a memoryless distribution
16.15.2
https://stats.libretexts.org/@go/page/10388
in a general sense. In all cases, the mean of the exponential distribution with parameter 1/0 = ∞ and 1/∞ = 0 .
is
r ∈ [0, ∞]
, where we interpret
1/r
Holding Times The Markov property implies the memoryless property for the random time when a Markov process first leaves its initial state. It follows that this random time must have an exponential distribution. Suppose that X = {X : t ∈ [0, ∞)} is a Markov chain on S , and let τ = inf{t ∈ [0, ∞) : X conditional distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞]. t
t
≠ X0 }
. For
x ∈ S
, the
0
Proof So, associated with the Markov chain X on S is a function λ : S → [0, ∞] that gives the exponential parameters for the holding times in the states. Considering the ordinary exponential distribution, and the two degenerate versions, we are led to the following classification of states: Suppose again that X = {X
t
: t ∈ [0, ∞)}
is a Markov chain on S with exponential parameter function λ . Let x ∈ S .
1. If λ(x) = 0 then P(τ = ∞ ∣ X = x) = 1 , and x is said to be an absorbing state. 2. If λ(x) ∈ (0, ∞) then P(0 < τ < ∞ ∣ X = x) = 1 and x is said to be an stable state. 3. If λ(x) = ∞ then P(τ = 0 ∣ X = x) = 1 , and x is said to be an instantaneous state. 0
0
0
As you can imagine, an instantaneous state corresponds to weird behavior, since the chain starting in the state leaves the state at times arbitrarily close to 0. While mathematically possible, instantaneous states make no sense in most applications, and so are to be avoided. Also, the proof of the last result has some technical holes. We did not really show that τ is a valid random time, let alone a stopping time. Fortunately, one of our standard assumptions resolves these problems. Suppose again that X = {X then
t
: t ∈ [0, ∞)}
is a Markov chain on S . If the process X and the filtration F are right continuous,
1. τ is a stopping time. 2. X has no instantaneous states. 3. P(X ≠ x ∣ X = x) = 1 if x ∈ S is stable. 4. X is a strong Markov process. τ
0
Proof There is actually a converse to part (b) that states that if X has no instantaneous states, then there is a version of X that is right continuous. From now on, we will assume that our Markov chains are right continuous with probability 1, and hence have no instantaneous states. On the other hand, absorbing states are perfectly reasonable and often do occur in applications. Finally, if the chain enters a stable state, it will stay there for a (proper) exponentially distributed time, and then leave.
The Jump Chain Without instantaneous states, we can now construct a sequence of stopping times. Basically, we let τ denote the n th time that the chain changes state for n ∈ N , unless the chain has previously been caught in an absorbing state. Here is the formal construction: n
+
Suppose again that X = {X : t ∈ [0, ∞)} is a Markov chain on Recursively, suppose that τ is defined for n ∈ N . If τ = ∞ let τ t
n
+
n
. Let τ = 0 and τ = ∞ . Otherwise, let
S
n+1
0
1
= inf{t ∈ [0, ∞) : Xt ≠ X0 }
τn+1 = inf {t ∈ [ τn , ∞) : Xt ≠ Xτ }
(16.15.9)
n
Let M
= sup{n ∈ N : τn < ∞}
.
.
In the definition of M , of course, sup(N) = ∞ , so M is the number of changes of state. If M < ∞ , the chain was sucked into an absorbing state at time τ . Since we have ruled out instantaneous states, the sequence of random times in strictly increasing up until the (random) term M . That is, with probability 1, if n ∈ N and τ < ∞ then τ < τ . Of course by construction, if τ = ∞ then τ = ∞ . The increments τ −τ for n ∈ N with n < M are the times spent in the states visited by X. The process at the random times when the state changes forms an embedded discrete-time Markov chain. M
n
n
n+1
n+1
n
n+1
n
16.15.3
https://stats.libretexts.org/@go/page/10388
Suppose again that X = {X : t ∈ [0, ∞)} is a Markov chain on S . Let {τ : n ∈ N} denote the stopping times and M the random index, as defined above. For n ∈ N , let Y = X if n ≤ M and Y = X if n > M . Then Y = {Y : n ∈ N} is a (homogenous) discrete-time Markov chain on S , known as the jump chain of X. t
n
n
τn
n
τM
n
Proof As noted in the proof, the one-step transition probability matrix Q for the jump chain Y is given for (x, y) ∈ S by 2
Q(x, y) = {
P(Xτ = y ∣ X0 = x),
x stable
I (x, y),
x absorbing
(16.15.13)
where I is the identity matrix on S . Of course Q satisfies the usual properties of a probability matrix on S , namely Q(x, y) ≥ 0 for (x, y) ∈ S and ∑ Q(x, y) = 1 for x ∈ S . But Q satisfies another interesting property as well. Since the the state actually changes at time τ starting in a stable state, we must have Q(x, x) = 0 if x is stable and Q(x, x) = 1 if x is absorbing. 2
y∈S
Given the initial state, the holding time and the next state are independent. If x,
y ∈ S
and t ∈ [0, ∞) then P(Y
1
= y, τ1 > t ∣ Y0 = x) = Q(x, y)e
−λ(x)t
Proof The following theorem is a generalization. The changes in state and the holding times are independent, given the initial state. Suppose that n ∈ N
+
and that (x
0,
x1 , … , xn )
is a sequence of stable states and (t
1,
t2 , … , tn )
is a sequence in [0, ∞). Then
P(Y1 = x1 , τ1 > t1 , Y2 = x2 , τ2 − τ1 > t2 , … , Yn = xn , τn − τn−1 > tn ∣ Y0 = x0 ) = Q(x0 , x1 )e
−λ( x0 ) t1
Q(x1 , x2 )e
−λ( x1 ) t2
⋯ Q(xn−1 , xn )e
−λ( xn−1 ) tn
Proof
Regularity We now know quite a bit about the structure of a continuous-time Markov chain X = {X : t ∈ [0, ∞)} (without instantaneous states). Once the chain enters a given state x ∈ S , the holding time in state x has an exponential distribution with parameter λ(x) ∈ [0, ∞), after which the next state y ∈ S is chosen, independently of the holding time, with probability Q(x, y). However, we don't know everything about the chain. For the sequence {τ : n ∈ N} defined above, let τ = lim τ , which exists in (0, ∞] of course, since the sequence is increasing. Even though the holding time in a state is positive with probability 1, it's possible that τ < ∞ with positive probability, in which case we know nothing about X for t ≥ τ . The event {τ < ∞} is known as explosion, since it means that the X makes infinitely many transitions before the finite time τ . While not as pathological as the existence of instantaneous states, explosion is still to be avoided in most applications. t
n
∞
∞
t
n→∞
n
∞
∞
∞
A Markov chain X = {X
t
: t ∈ [0, ∞)}
on S is regular if each of the following events has probability 1:
1. X is right continuous. 2. τ → ∞ as n → ∞ . n
There is a simple condition on the exponential parameters and the embedded chain that is equivalent to condition (b). Suppose that X = {X : t ∈ [0, ∞)} is a right-continuous Markov chain on S with exponential parameter function λ and embedded chain Y = (Y , Y , …). Then τ → ∞ as n → ∞ with probability 1 if and only if ∑ 1/λ(Y ) = ∞ with probability 1. t
∞
0
1
n
n=0
n
Proof If λ is bounded, then X is regular. Suppose that X = {X is regular.
t
: t ∈ [0, ∞)}
is a Markov chain on S with exponential parameter function λ . If λ is bounded, then X
Proof Here is another sufficient condition that is useful when the state space is infinite.
16.15.4
https://stats.libretexts.org/@go/page/10388
Suppose that
is a Markov chain on . Then X is regular if
X = { Xt : t ∈ [0, ∞)}
S+ = {x ∈ S : λ(x) > 0}
with exponential parameter function
S
λ : S → [0, ∞)
. Let
1 ∑ x∈S+
=∞
(16.15.19)
λ(x)
Proof As a corollary, note that if S is finite then λ is bounded, so a continuous-time Markov chain on a finite state space is regular. So to review, if the exponential parameter function λ is finite, the chain X has no instantaneous states. Even better, if λ is bounded or if the conditions in the last theorem are satisfied, then X is regular. A continuous-time Markov chain with bounded exponential parameter function λ is called uniform, for reasons that will become clear in the next section on transition matrices. As we will see in later section, a uniform continuous-time Markov chain can be constructed from a discrete-time chain and an independent Poisson process. For the next result, recall that to say that X has left limits with probability 1 means that the random function t ↦ X has limits from the left on (0, ∞) with probability 1. t
If X = {X
t
: t ∈ [0, ∞)}
is regular then X has left limits with probability 1.
Proof Thus, our standard assumption will be that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S . For such a chain, the behavior of X is completely determined by the exponential parameter function λ that governs the holding times, and the transition probability matrix Q of the jump chain Y . Conversely, when modeling real stochastic systems, we often start with λ and Q. It's then relatively straightforward to construct the continuous-time Markov chain that has these parameters. For simplicity, we will assume that there are no absorbing states. The inclusion of absorbing states is not difficult, but mucks up the otherwise elegant exposition. t
Suppose that λ : S → (0, ∞) is bounded and that Q is a probability matrix on S with the property that Q(x, x) = 0 for every x ∈ S . The regular, continuous-time Markov chain X = { X : t ∈ [0, ∞)} with exponential parameter function λ and jump transition matrix Q can be constructed as follows: t
1. First construct the jump chain Y = (Y , Y , …) having transition matrix Q. 2. Next, given Y = (x , x , …) , the transition times (τ , τ , …) are constructed so that the holding times (τ , τ − τ are independent and exponentially distributed with parameters (λ(x ), λ(x ), …) 3. Again given Y = (x , x , …) , define X = x for 0 ≤ t < τ and for n ∈ N , define X = x for τ ≤ t < τ 0
0
1
1
1
2
1
0
0
1
t
0
1
2
1,
…)
1
+
t
n
n
n+1 )
.
Additional details Often, particularly when summarized with a graph.
S
is finite, the essential structure of a standard, continuous-time Markov chain can be succinctly
Suppose again that X = {X embedded transition matrix
is a regular Markov chain on S , with exponential parameter function λ and . The state graph of X is the graph with vertex set S and directed edge set : Q(x, y) > 0} . The graph is labeled as follows: t
E = {(x, y) ∈ S
2
: t ∈ [0, ∞)}
Q
1. Each vertex x ∈ S is labeled with the exponential parameter λ(x). 2. Each edge (x, y) ∈ E is labeled with the transition probability Q(x, y). So except for the labels on the vertices, the state graph of X is the same as the state graph of the discrete-time jump chain Y . That is, there is a directed edge from state x to state y if and only if the chain, when in x, can move to y after the random holding time in x. Note that the only loops in the state graph correspond to absorbing states, and for such a state there are no outward edges. Let's return again to the construction above of a continuous-time Markov chain from the jump transition matrix Q and the exponential parameter function λ . Again for simplicity, assume there are no absorbing states. We assume that Q(x, x) = 0 for all x ∈ S , so that the state really does change at the transition times. However, if we drop this assumption, the construction still produces a continuous-time Markov chain, but with an altered jump transition matrix and exponential parameter function. Suppose that Q is a transition matrix on S × S with Q(x, x) < 1 for x ∈ S , and that λ : S → (0, ∞) is bounded. The stochastic process X = {X : t ∈ [0, ∞)} constructed above from Q and λ is a regular, continuous-time Markov chain with t
16.15.5
https://stats.libretexts.org/@go/page/10388
~
~
exponential parameter function λ and jump transition matrix Q given by ~ λ(x) = λ(x)[1 − Q(x, x)],
x ∈ S
Q(x, y)
~ Q(x, y) =
,
(x, y) ∈ S
2
, x ≠y
1 − Q(x, x)
Proof 1 Proof 2 This construction will be important in our study of chains subordinate to the Poisson process.
Transition Times The structure of a regular Markov chain on S , as described above, can be explained purely in terms of a family of independent, exponentially distributed random variables. The main tools are some additional special properties of the exponential distribution, that we need to restate in the setting of our Markov chain. Our interest is in how the process evolves among the stable states until it enters an absorbing state (if it does). Once in an absorbing state, the chain stays there forever, so the behavior from that point on is trivial. Suppose that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S , with exponential parameter function probability matrix Q. Define μ(x, y) = λ(x)Q(x, y) for (x, y) ∈ S . Then t
λ
and transition
2
1. λ(x) = ∑ μ(x, y) for x ∈ S . 2. Q(x, y) = μ(x, y)/λ(x) if (x, y) ∈ S and x is stable. y∈S
2
The main point is that the new parameters μ(x, y) for (x, y) ∈ S determine the exponential parameters λ(x) for x ∈ S , and the transition probabilities Q(x, y) when x ∈ S is stable and y ∈ S . Of course we know that if λ(x) = 0 , so that x is absorbing, then Q(x, x) = 1. So in fact, the new parameters, as specified by the function μ , completely determine the old parmeters, as specified by the functions λ and Q. But so what? 2
Consider the functions μ , λ , and Q as given in the previous result. Suppose that T has the exponential distribution with parameter μ(x, y) for each (x, y) ∈ S and that {T : (x, y) ∈ S } is a set of independent random variables. Then x,y
2
2
x,y
1. T = inf {T 2. P (T = T x
x,y
x
: y ∈ S}
has the exponential distribution with parameter λ(x) for x ∈ S . for (x, y) ∈ S . 2
x,y ) = Q(x, y)
Proof So here's how we can think of a regular, continuous-time Markov chain on S : There is a timer associated with each (x, y) ∈ S , set to the random time T . All of the timers function independently. When the chain enters state x ∈ S , the timers on (x, y) for y ∈ S are started simultaneously. As soon as the first alarm goes off for a particular (x, y), the chain immediately moves to state y , and the process repeats. Of course, if μ(x, y) = 0 then T = ∞ with probability 1, so only the timers with λ(x) > 0 and Q(x, y) > 0 matter (these correspond to the non-loop edges in the state graph). In particular, if x is absorbing, then the timers on (x, y) are set to infinity for each y , and no alarm ever sounds. 2
x,y
x,y
The new collection of exponential parameters can be used to give an alternate version of the state graph. Again, the vertex set is S and the edge set is E = {(x, y) ∈ S : Q(x, y) > 0} . But now each edge (x, y) is labeled with the exponential rate parameter μ(x, y). The exponential rate parameters are closely related to the generator matrix, a matrix of fundamental importance that we will study in the next section. 2
Examples and Exercises The Two-State Chain The two-state chain is the simplest non-trivial, continuous-time Markov chain, but yet this chain illustrates many of the important properties of general continuous-time chains. So consider the Markov chain X = {X : t ∈ [0, ∞)} on the set of states S = {0, 1}, with transition rate a ∈ [0, ∞) from 0 to 1 and transition rate b ∈ [0, ∞) from 1 to 0. t
The transition matrix Q for the embedded chain is given below. Draw the state graph in each case.
16.15.6
https://stats.libretexts.org/@go/page/10388
1. Q = [ 2. Q = [ 3. Q = [ 4. Q = [
0
1
1
0
1
0
1
0
0
1
0
1
1
0
0
1
]
if a > 0 and b > 0 , so that both states are stable.
]
if a = 0 and b > 0 , so that a is absorbing and b is stable.
]
if a > 0 and b = 0 , so that a is stable and b is absorbing.
]
if a = 0 and b = 0 , so that both states are absorbing.
We will return to the two-state chain in subsequent sections.
Computational Exercises Consider the Markov chain embedded transition matrix
X = { Xt : t ∈ [0, ∞)}
on
S = {0, 1, 2}
with exponential parameter function
1
1
2
2
Q =⎢ 1 ⎢
0
0 ⎥ ⎥
1
2
3
3
⎡ 0
⎣
λ = (4, 1, 3)
and
⎤ (16.15.27)
0 ⎦
1. Draw the state graph and classify the states. 2. Find the matrix of transition rates. 3. Classify the jump chain in terms of recurrence and period. 4. Find the invariant distribution of the jump chain. Answer
Special Models Read the introduction to chains subordinate to the Poisson process. Read the introduction to birth-death chains. Read the introduction to continuous-time queuing chains. Read the introduction to continuous-time branching chains. This page titled 16.15: Introduction to Continuous-Time Markov Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.15.7
https://stats.libretexts.org/@go/page/10388
16.16: Transition Matrices and Generators of Continuous-Time Chains 16. Transition Matrices and Generators of Continuous-Time Chains Preliminaries This is the second of the three introductory sections on continuous-time Markov chains. Thus, suppose that X = { X : t ∈ [0, ∞)} is a continuous-time Markov chain defined on an underlying probability space (Ω, F , P) and with state space (S, S ). By the very meaning of Markov chain, the set of states S is countable and the σ-algebra S is the collection of all subsets of S . So every subset of S is measurable, as is every function from S to another measurable space. Recall that S is also the Borel σ algebra corresponding to the discrete topology on S . With this topology, every function from S to another topological space is continuous. Counting measure # is the natural measure on (S, S ), so in the context of the general introduction, integrals over S are simply sums. Also, kernels on S can be thought of as matrices, with rows and sums indexed by S . The left and right kernel operations are generalizations of matrix multiplication. t
A space of functions on S plays an important role. Let B denote the collection of bounded functions f : S → R . With the usual pointwise definitions of addition and scalar multiplication, B is a vector space. The supremum norm on B is given by ∥f ∥ = sup{|f (x)| : x ∈ S},
f ∈ B
(16.16.1)
Of course, if S is finite, B is the set of all real-valued functions on S , and ∥f ∥ = max{|f (x)| : x ∈ S} for f
∈ B
.
In the last section, we studied X in terms of when and how the state changes. To review briefly, let τ = inf{t ∈ (0, ∞) : X ≠ X } . Assuming that X is right continuous, the Markov property of X implies the memoryless property of τ , and hence the distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞) for each x ∈ S . The assumption of right continuity rules out the pathological possibility that λ(x) = ∞ , which would mean that x is an instantaneous state so that P(τ = 0 ∣ X = x) = 1 . On the other hand, if λ(x) ∈ (0, ∞) then x is a stable state, so that τ has a proper exponential distribution given X = x with P(0 < τ < ∞ ∣ X = x) = 1 . Finally, if λ(x) = 0 then x is an absorbing state, so that P(τ = ∞ ∣ X = x) = 1 . Next we define a sequence of stopping times: First τ = 0 and τ = τ . Recursively, if τ < ∞ then τ = inf {t > τ : X ≠ X } , while if τ = ∞ then τ = ∞ . With M = sup{n ∈ N : τ < ∞} we define Y = X if n ∈ N with n ≤ M and Y = Y if n ∈ N with n > M . The sequence Y = (Y , Y , …) is a discrete-time Markov chain on S with one-step transition matrix Q given by Q(x, y) = P(X = y ∣ X = x) if x, y ∈ S with x stable, and Q(x, x) = 1 if x ∈ S is absorbing. Assuming that X is regular, which means that τ → ∞ as n → ∞ with probability 1 (ruling out the explosion event of infinitely many transitions in finite time), the structure of X is completely determined by the sequence of stopping times τ = (τ , τ , …) and the discrete-time jump chain Y = (Y , Y , …) . Analytically, the distribution X is determined by the exponential parameter function λ and the one-step transition matrix Q of the jump chain. t
0
0
0
0
0
0
n
n
0
t
τn
n
n
1
n+1
n
M
0
τ
n
n
τn
1
0
n
0
1
0
1
In this section, we sill study the Markov chain X in terms of the transition matrices in continuous time and a fundamentally important matrix known as the generator. Naturally, the connections between the two points of view are particularly interesting.
The Transition Semigroup Definition and basic Properties
The first part of our discussion is very similar to the treatment for a general Markov processes, except for simplifications caused by the discrete state space. We assume that X = {X : t ∈ [0, ∞)} is a Markov chain on S . t
The transition probability matrix P of X corresponding to t ∈ [0, ∞) is t
Pt (x, y) = P(Xt = y ∣ X0 = x),
In particular, P
0
=I
(x, y) ∈ S
2
(16.16.2)
, the identity matrix on S
Proof Note that since we are assuming that the Markov chain is homogeneous, Pt (x, y) = P(Xs+t = y ∣ Xs = x),
16.16.1
(x, y) ∈ S
2
(16.16.3)
https://stats.libretexts.org/@go/page/10389
for every s, t ∈ [0, ∞) . The Chapman-Kolmogorov equation given next is essentially yet another restatement of the Markov property. The equation is named for Andrei Kolmogorov and Sydney Chapman, Suppose that s, t ∈ [0, ∞)
P = { Pt : t ∈ [0, ∞)}
is the collection of transition matrices for the chain
X
. Then
Ps Pt = Ps+t
for
. Explicitly, Ps+t (x, z) = ∑ Ps (x, y)Pt (y, z),
x, z ∈ S
(16.16.4)
y∈S
Proof Restated in another form of jargon, the collection P = {P : t ∈ [0, ∞)} is a semigroup of probability matrices. The semigroup of transition matrices P , along with the initial distribution, determine the finite-dimensional distributions of X. t
Suppose that X has probability density function f . If and (x , x , … , x ) ∈ S is a state sequence, then 0
n
(t1 , t2 , … , tn ) ∈ [0, ∞)
is a time sequence with
0 < t1 < ⋯ < tn
n+1
0
1
n
P (X0 = x0 , Xt1 = x1 , … Xtn = xn ) = f (x0 )Pt1 (x0 , x1 )Pt2 −t1 (x1 , x2 ) ⋯ Ptn −tn−1 (xn−1 , xn )
(16.16.8)
Proof As with any matrix on S , the transition matrices define left and right operations on functions which are generalizations of matrix multiplication. For a transition matrix, both have natural interpretations. Suppose that f
: S → R
, and that either f is nonnegative or f
∈ B
. Then for t ∈ [0, ∞),
Pt f (x) = ∑ Pt (x, y)f (y) = E[f (Xt ) ∣ X0 = x],
x ∈ S
(16.16.12)
y∈S
The mapping f
↦ Pt f
is a bounded, linear operator on B and ∥P
t∥
=1
.
Proof If f is nonnegative and S is infinte, then it's possible that P f (x) = ∞ . In general, the left operation of a positive kernel acts on positive measures on the state space. In the setting here, if μ is a positive (Borel) measure on (S, S ), then the function f : S → [0, ∞) given by f (x) = μ{x} for x ∈ S is the density function of μ with respect to counting measure # on (S, S ). This simply means that μ(A) = ∑ f (x) for A ⊆ S . Conversely, given f : S → [0, ∞) , the set function μ(A) = ∑ f (x) for A ⊆ S defines a positive measure on (S, S ) with f as its density function. So for the left operation of P , it's natural to consider only nonnegative functions. t
x∈A
x∈A
t
If f
: S → [0, ∞)
then f Pt (y) = ∑ f (x)Pt (x, y),
y ∈ S
(16.16.13)
x∈S
If X has probability density function f then X has probability density function f P . 0
t
t
Proof More generally, if f is the density function of a positive measure μ on (S, S ) then f P is the density function of the measure μP , defined by t
μPt (A) = ∑ μ{x} Pt (x, A) = ∑ f (x)Pt (x, A), x∈S
A function f t ∈ [0, ∞).
: S → [0, ∞)
t
A ⊆S
(16.16.15)
x∈S
is invariant for the Markov chain
X
(or for the transition semigroup
P
) if
f Pt = f
for every
It follows that if X has an invariant probability density function f , then X has probability density function f for every t ∈ [0, ∞), so X is identically distributed. Invariant and limiting distributions are fundamentally important for continuous-time Markov chains. 0
t
16.16.2
https://stats.libretexts.org/@go/page/10389
Standard Semigroups
Suppose again that X = {X : t ∈ [0, ∞)} is a Markov chain on S with transition semigroup P = {P : t ∈ [0, ∞)} . Once again, continuity assumptions need to be imposed on X in order to rule out strange behavior that would otherwise greatly complicate the theory. In terms of the transition semigroup P , here is the basic assumption: t
t
The transition semigroup P is standard if P
t (x,
x) → 1
as t ↓ 0 for each x ∈ S .
Since P (x, x) = 1 for x ∈ S , the standard assumption is clearly a continuity assumption. It actually implies much stronger smoothness properties that we will build up by stages. 0
If the transition semigroup (x, y) ∈ S .
P = { Pt : t ∈ [0, ∞)}
is standard, then the function
t ↦ Pt (x, y)
is right continuous for each
2
Proof Our next result connects one of the basic assumptions in the section on transition times and the embedded chain with the standard assumption here. If the Markov chain X has no instantaneous states then the transition semigroup P is standard. Proof Recall that the non-existence of instantaneous states is essentially equivalent to the right continuity of X. So we have the nice result that if X is right continuous, then so is P . For the remainder of our discussion, we assume that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S with transition semigroup P = {P : t ∈ [0, ∞)} , exponential function λ and one-step transition matrix Q for the jump chain. Our next result is the fundamental integral equations relating P , λ , and Q. t
t
For t ∈ [0, ∞), t
Pt (x, y) = I (x, y)e
−λ(x)t
+∫
λ(x)e
−λ(x)s
Q Pt−s (x, y) ds,
(x, y) ∈ S
2
(16.16.18)
0
Proof We can now improve on the continuity result that we got earlier. First recall the leads to relation for the jump chain Y : For (x, y) ∈ S , x leads to y if Q (x, y) > 0 for some n ∈ N . So by definition, x leads to x for each x ∈ S , and for (x, y) ∈ S with x ≠ y , x leads to y if and only if the discrete-time chain starting in x eventually reaches y with positive probability. 2
n
2
For (x, y) ∈ S , 2
1. t ↦ P (x, y) is continuous. 2. If x leads to y then P (x, y) > 0 for every t ∈ (0, ∞) . 3. If x does not lead to y then P (x, y) = 0 for every t ∈ (0, ∞) . t
t
t
Proof Parts (b) and (c) are known as the Lévy dichotomy, named for Paul Lévy. It's possible to prove the Lévy dichotomy just from the semigroup property of P , but this proof is considerably more complicated. In light of the dichotomy, the leads to relation clearly makes sense for the continuous-time chain X as well as the discrete-time embedded chain Y .
The Generator Matrix Definition and Basic Properties
In this discussion, we assume again that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S with transition semigroup P = { P : t ∈ [0, ∞)} , exponential parameter function λ and one-step transition matrix Q for the embedded jump chain. The fundamental integral equation above now implies that the transition probability matrix P is differentiable in t . The derivative at 0 is particularly important. t
t
t
The matrix function t ↦ P has a (right) derivative at 0: t
16.16.3
https://stats.libretexts.org/@go/page/10389
Pt − I
→ G as t ↓ 0
(16.16.25)
t
where the infinitesimal generator matrix G is given by G(x, y) = −λ(x)I (x, y) + λ(x)Q(x, y) for (x, y) ∈ S . 2
Proof Note that
for every x ∈ S , since λ(x) = 0 is x is absorbing, while Q(x, x) = 0 if x is stable. So for x ∈ S , and G(x, y) = λ(x)Q(x, y) for (x, y) ∈ S with y ≠ x . Thus, the generator matrix G determines the exponential parameter function λ and the jump transition matrix Q, and thus determines the distribution of the Markov chain X. λ(x)Q(x, x) = 0
2
G(x, x) = −λ(x)
Given the generator matrix G of X, 1. λ(x) = −G(x, x) for x ∈ S 2. Q(x, y) = −G(x, y)/G(x, x) if x ∈ S is stable and y ∈ S − {x} The infinitesimal generator has a nice interpretation in terms of our discussion in the last section. Recall that when the chain first enters a stable state x, we set independent, exponentially distributed “timers” on (x, y), for each y ∈ S − {x} . Note that G(x, y) is the exponential parameter for the timer on (x, y). As soon as an alarm sounds for a particular (x, y), the chain moves to state y and the process continues. The generator matrix G satisfies the following properties for every x ∈ S : 1. G(x, x) ≤ 0 2. ∑ G(x, y) = 0 y∈S
The matrix function Explicitly,
is differentiable on
t ↦ Pt
, and satisfies the Kolmogorov backward equation:
[0, ∞)
′
P (x, y) = −λ(x)Pt (x, y) + ∑ λ(x)Q(x, z)Pt (z, y), t
(x, y) ∈ S
2
′
Pt = GPt
.
(16.16.27)
z∈S
Proof The backward equation is named for Andrei Kolmogorov. In continuous time, the transition semigroup P = {P : t ∈ [0, ∞)} can be obtained from the single, generator matrix G in a way that is reminiscent of the fact that in discrete time, the transition semigroup P = {P : n ∈ N} can be obtained from the single, one-step matrix P . From a modeling point of view, we often start with the generator matrix G and then solve the the backward equation, subject to the initial condition P = I , to obtain the semigroup of transition matrices P . t
n
0
As with any matrix on S , the generator matrix G defines left and right operations on functions that are analogous to ordinary matrix multiplication. The right operation is defined for functions in B . If f
∈ B
then Gf is given by Gf (x) = −λ(x)f (x) + ∑ λ(x)Q(x, y)f (y),
x ∈ S
(16.16.29)
y∈S
Proof But note that Gf is not in B unless λ ∈ B . Without this additional assumption, G is a linear operator from the vector space B of bounded functions from S to R into the vector space of all functions from S to R. We will return to this point in our next discussion. Uniform Transition Semigroups
We can obtain stronger results for the generator matrix if we impose stronger continuity assumptions on P . The transition semigroup P
= { Pt : t ∈ [0, ∞)}
is uniform if P
t (x,
x) → 1
as t ↓ 0 uniformly in x ∈ S .
If P is uniform, then the operator function t ↦ P is continuous on the vector space B . t
16.16.4
https://stats.libretexts.org/@go/page/10389
Proof As usual, we want to look at this new assumption from different points of view. The following are equivalent: 1. The transition semigroup P is uniform. 2. The exponential parameter function λ is bounded. 3. The generator matrix G defiens a bounded linear operator on B . Proof So when the equivalent conditions are satisfied, the Markov chain X = {X : t ∈ [0, ∞)} is also said to be uniform. As we will see in a later section, a uniform, continuous-time Markov chain can be constructed from a discrete-time Markov chain and an independent Poisson process. For a uniform transition semigroup, we have a companion to the backward equation. t
Suppose that Explicitly,
P
is a uniform transition semigroup. Then
satisfies the Kolmogorov forward equation
t ↦ Pt
′
P (x, y) = −λ(y)Pt (x, y) + ∑ Pt (x, z)λ(z)Q(z, y), t
(x, y) ∈ S
2
′
Pt = Pt G
.
(16.16.33)
z∈S
The backward equation holds with more generality than the forward equation, since we only need the transition semigroup P to be standard rather than uniform. It would seem that we need stronger conditions on λ for the forward equation to hold, for otherwise it's not even obvious that ∑ P (x, z)λ(z)Q(z, y) is finite for (x, y) ∈ S . On the other hand, the forward equation is sometimes easier to solve than the backward equation, and the assumption that λ is bounded is met in many applications (and of course holds automatically if S is finite). z∈S
t
As a simple corollary, the transition matrices and the generator matrix commute for a uniform semigroup: P G = GP for t ∈ [0, ∞). The forward and backward equations formally look like the differential equations for the exponential function. This actually holds with the operator exponential. t
Suppose again that P
= { Pt : t ∈ [0, ∞)}
t
is a uniform transition semigroup with generator G. Then ∞
Pt = e
tG
n
t
n
=∑ n=0
G ,
t ∈ [0, ∞)
(16.16.34)
n!
Proof We can characterize the generators of uniform transition semigroups. We just need the minimal conditions that the diagonal entries are nonpositive and the row sums are 0. Suppose that
G
a matrix on S with ∥G∥ < ∞ . Then if and only if for every x ∈ S ,
is the generator of a uniform transition semigroup
G
P = { Pt : t ∈ [0, ∞)}
1. G(x, x) ≤ 0 2. ∑ G(x, y) = 0 y∈S
Proof
Examples and Exercises The Two-State Chain
Let X = {X : t ∈ [0, ∞)} be the Markov chain on the set of states S = {0, 1}, with transition rate a ∈ [0, ∞) from 0 to 1 and transition rate b ∈ [0, ∞) from 1 to 0. This two-state Markov chain was studied in the previous section. To avoid the trivial case with both states absorbing, we will assume that a + b > 0 . t
The generator matrix is −a
a
b
−b
G=[
]
16.16.5
(16.16.38)
https://stats.libretexts.org/@go/page/10389
Show that for t ∈ [0, ∞), 1 Pt =
b
a
[ a+b
1 ]−
b
e
−(a+b)t
a+b
a
−a
a
b
−b
[
]
(16.16.39)
1. By solving the Kolmogorov backward equation. 2. By solving the Kolmogorov forward equation. 3. By computing P = e . tG
t
You probably noticed that the forward equation is easier to solve because there is less coupling of terms than in the backward equation. Define the probability density function f on S by f (0) = 1. P
t
→
1 a+b
b
a
b
a
[
]
b a+b
, f (1) =
a a+b
. Show that
as t → ∞ , the matrix with f in both rows.
2. f P = f for all t ∈ [0, ∞), so that f is invariant for P . 3. f G = 0 . t
Computational Exercises
Consider the Markov chain embedded transition matrix
X = { Xt : t ∈ [0, ∞)}
on
S = {0, 1, 2}
with exponential parameter function
1
1
2
2
Q =⎢ ⎢ 1
0
0 ⎥ ⎥
1
2
3
3
⎡
⎣
0
λ = (4, 1, 3)
and
⎤ (16.16.40)
0 ⎦
1. Draw the state graph and classify the states. 2. Find the generator matrix G. 3. Find the transition matrix P for t ∈ [0, ∞). 4. Find lim P . t
t→∞
t
Answer Special Models
Read the discussion of generator and transition matrices for chains subordinate to the Poisson process. Read the discussion of the infinitesimal generator for continuous-time birth-death chains. Read the discussion of the infinitesimal generator for continuous-time queuing chains. Read the discussion of the infinitesimal generator for continuous-time branching chains. This page titled 16.16: Transition Matrices and Generators of Continuous-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.16.6
https://stats.libretexts.org/@go/page/10389
16.17: Potential Matrices Prelimnaries This is the third of the introductory sections on continuous-time Markov chains. So our starting point is a time-homogeneous Markov chain X = {X : t ∈ [0, ∞)} defined on an underlying probability space (Ω, F , P) and with discrete state space (S, S ). Thus S is countable and S is the power set of S , so every subset of S is measurable, as is every function from S into another measurable space. In addition, S is given the discret topology so that S can also be thought of as the Borel σ-algebra. Every function from S to another topological space is continuous. Counting measure # is the natural measure on (S, S ), so in the context of the general introduction, integrals over S are simply sums. Also, kernels on S can be thought of as matrices, with rows and sums indexed by S , so the left and right kernel operations are generalizations of matrix multiplication. As before, let B denote the collection of bounded functions f : S → R . With the usual pointwise definitions of addition and scalar multiplication, B is a vector space. The supremum norm on B is given by t
∥f ∥ = sup{|f (x)| : x ∈ S},
f ∈ B
(16.17.1)
Of course, if S is finite, B is the set of all real-valued functions on S , and ∥f ∥ = max{|f (x)| : x ∈ S} for f ∈ B . The time space is ([0, ∞), T ) where as usual, T is the Borel σ-algebra on [0, ∞) corresponding to the standard Euclidean topology. Lebesgue measure is the natural measure on ([0, ∞), T ). In our first point of view, we studied X in terms of when and how the state changes. To review briefly, let τ = inf{t ∈ (0, ∞) : X ≠ X } . Assuming that X is right continuous, the Markov property of X implies the memoryless property of τ , and hence the distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞) for each x ∈ S . The assumption of right continuity rules out the pathological possibility that λ(x) = ∞ , which would mean that x is an instantaneous state so that P(τ = 0 ∣ X = x) = 1 . On the other hand, if λ(x) ∈ (0, ∞) then x is a stable state, so that τ has a proper exponential distribution given X = x with P(0 < τ < ∞ ∣ X = x) = 1 . Finally, if λ(x) = 0 then x is an absorbing state, so that P(τ = ∞ ∣ X = x) = 1 . Next we define a sequence of stopping times: First τ = 0 and τ = τ . Recursively, if τ < ∞ then τ = inf {t > τ : X ≠ X } , while if τ = ∞ then τ = ∞ . With M = sup{n ∈ N : τ < ∞} we define Y = X if n ∈ N with n ≤ M and Y = Y if n ∈ N with n > M . The sequence Y = (Y , Y , …) is a discrete-time Markov chain on S with one-step transition matrix Q given by Q(x, y) = P(X = y ∣ X = x) if x, y ∈ S with x stable, and Q(x, x) = 1 if x ∈ S is absorbing. Assuming that X is regular, which means that τ → ∞ as n → ∞ with probability 1 (ruling out the explosion event of infinitely many transitions in finite time), the structure of X is completely determined by the sequence of stopping times τ = (τ , τ , …) and the embedded discrete-time jump chain Y = (Y , Y , …) . Analytically, the distribution X is determined by the exponential parameter function λ and the one-step transition matrix Q of the jump chain. t
0
0
0
0
0
0
n
n
0
t
τn
n
n
1
n+1
n
n
M
0
τ
n
τn
1
0
n
0
1
0
In our second point of view, we studied t ∈ [0, ∞),
X
1
in terms of the collection of transition matrices
Pt (x, y) = P(Xt = y ∣ X0 = x),
(x, y) ∈ S
P = { Pt : t ∈ [0, ∞)}
, where for
2
(16.17.2)
The Markov and time-homogeneous properties imply the Chapman-Kolmogorov equations P P = P for s, t ∈ [0, ∞) , so that P is a semigroup of transition matrices. The semigroup P , along with the initial distribution of X , completely determines the distribution of X. For a regular Markov chain X, the fundamental integral equation connecting the two points of view is s
t
s+t 0
t
Pt (x, y) = I (x, y)e
−λ(x)t
+∫
λ(x)e
−λ(x)s
Q Pt−s (x, y) ds,
(x, y) ∈ S
2
(16.17.3)
0
which is obtained by conditioning on τ and X . It then follows that the matrix function t ↦ P is differentiable, with the derivative satisfying the Kolmogorov backward equation P = GP where the generator matrix G is given by τ
t
′
t
t
G(x, y) = −λ(x)I (x, y) + λ(x)Q(x, y),
(x, y) ∈ S
2
(16.17.4)
If the exponential parameter function λ is bounded, then the transition semigroup P is uniform, which leads to stronger results. The generator G is a bounded operator on B , the backward equation holds as well as a companion forward equation P = P G , as operators on B (so with respect to the supremum norm rather than just pointwise). Finally, we can represent the transition matrix as an exponential: P = e for t ∈ [0, ∞). ′
t
t
tG
t
16.17.1
https://stats.libretexts.org/@go/page/10390
In this section, we study the Markov chain X in terms of a family of matrices known as potential matrices. This is the least intuitive of the three points of view, but analytically one of the best approaches. Essentially, the potential matrices are transforms of the transition matrices.
Basic Theory We assume again that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S with transition semigroup P = {P Our first discussion closely parallels the general theory, except for simplifications caused by the discrete state space. t
t
: t ∈ [0, ∞)}
.
Definitions and Properties For α ∈ [0, ∞), the α -potential matrix U of X is defined as follows: α
∞
Uα (x, y) = ∫
e
−αt
Pt (x, y) dt,
(x, y) ∈ S
2
(16.17.5)
0
1. The special case U = U is simply the potential matrix of X. 2. For (x. y) ∈ S , U (x, y) is the expected amount of time that X spends in y , starting at x. 3. The family of matrices U = {U : α ∈ (0, ∞)} is known as the reolvent of X. 0
2
α
Proof It's quite possible that U (x, y) = ∞ for some (x, y) ∈ S , and knowing when this is the case is of considerable interest. If f : S → R and α ≥ 0 , then giving the right operation in its many forms, 2
∞
Uα f (x) = ∑ Uα (x, y)f (y) = ∫
e
−αt
Pt f (x) dt
0
y∈S ∞
=∫
∞
e
0
−αt
∑ Pt (x, y)f (y) = ∫
e
−αt
E[f (Xt ) ∣ X0 = x] dt,
x ∈ S
0
y∈S
assuming, as always, that the sums and integrals make sense. This will be the case in particular if f is nonnegative (although ∞ is a possible value), or as we will now see, if f ∈ B and α > 0 . If α > 0 , then U
α (x,
S) =
1 α
for all x ∈ S .
Proof It follows that for α ∈ (0, ∞) , the right potential operator U is a bounded, linear operator on B with ∥U that αU is a probability matrix. This matrix has a nice interpretation.
α∥
α
=
1 α
. It also follows
α
If α > 0 then α U (x, ⋅) is the conditional probability density function of and has the exponential distribution on [0, ∞) with parameter α . α
XT
given
X0 = x
, where
T
is independent of
X
Proof So αU is a transition probability matrix, just as P is a transition probability matrix, but corresponding to the random time T (with α ∈ (0, ∞) as a parameter), rather than the deterministic time t ∈ [0, ∞). The potential matrix can also be interpreted in economic terms. Suppose that we receive money at a rate of one unit per unit time whenever the process X is in a particular state y ∈ S . Then U (x, y) is the expected total amount of money that we receive, starting in state x ∈ S . But money that we receive later is of less value to us now than money that we will receive sooner. Specifically, suppose that one monetary unit at time t ∈ [0, ∞) has a present value of e where α ∈ (0, ∞) is the inflation factor or discount factor. Then U (x, y) is the total, expected, discounted amount that we receive, starting in x ∈ S . A bit more generally, suppose that f ∈ B and that f (y) is the reward (or cost, depending on the sign) per unit time that we receive when the process is in state y ∈ S . Then U f (x) is the expected, total, discounted reward, starting in state x ∈ S . α
t
−αt
α
α
α Uα → I
as α → ∞ .
Proof If f
: S → [0, ∞)
, then giving the left potential operation in its various forms,
16.17.2
https://stats.libretexts.org/@go/page/10390
∞
f Uα (y) = ∑ f (x)Uα (x, y) = ∫
e
−αt
f Pt (y) dt
0
x∈S ∞
=∫
∞
e
−αt
[ ∑ f (x)Pt (x, y)] dt = ∫
0
e
−αt
0
x∈S
[ ∑ f (x)P(Xt = y)] dt,
y ∈ S
x∈S
In particular, suppose that α > 0 and that f is the probability density function of X . Then f P is the probability density function of X for t ∈ [0, ∞), and hence from the last result, αf U is the probability density function of X , where again, T is independent of X and has the exponential distribution on [0, ∞) with parameter α . The family of potential kernels gives the same information as the family of transition kernels. 0
t
t
α
The resolvent U
= { Uα : α ∈ (0, ∞)}
T
completely determines the family of transition kernels P
= { Pt : t ∈ (0, ∞)}
.
Proof Although not as intuitive from a probability view point, the potential matrices are in some ways nicer than the transition matrices because of additional smoothness. In particular, the resolvent {U : α ∈ [0, ∞)} , along with the initial distribution, completely determine the finite dimensional distributions of the Markov chain X. The potential matrices commute with the transition matrices and with each other. α
Suppose that α, 1. P 2. U
t Uα
. Then
β, t ∈ [0, ∞)
= Uα Pt = ∫
∞
0
α Uβ = Uβ Uα = ∫
e
∞
0
−αs
∫
∞
0
Ps+t ds e
−αs
e
−βt
Ps+t ds dt
Proof The equations above are matrix equations, and so hold pointwise. The same identities hold for the right operators on the space B under the additional restriction that α > 0 and β > 0 . The fundamental equation that relates the potential kernels, known as the resolvent equation, is given in the next theorem: If α,
β ∈ [0, ∞)
with α ≤ β then U
α
= Uβ + (β − α)Uα Uβ
.
Proof The equation above is a matrix equation, and so holds pointwise. The same identity holds for the right potential operators on the space B , under the additional restriction that α > 0 .
Connections with the Generator Once again, assume that X = {X : t ∈ [0, ∞)} is a regular Markov chain on S with transition semigroup P = { P : t ∈ [0, ∞)} , infinitesimal generator G, resolvent U = { U : α ∈ (0, ∞)} , exponential parameter function λ , and onestep transition matrix Q for the jump chain. There are fundamental connections between the potential U and the generator matrix G, and hence between U and the function λ and the matrix Q. t
t
α
α
α
If α ∈ (0, ∞) then I + GU
α
= α Uα
. In terms of λ and Q, λ(x)
1 Uα (x, y) =
I (x, y) + α + λ(x)
Q Uα (x, y),
(x, y) ∈ S
2
(16.17.16)
α + λ(x)
Proof 1 Proof 2 Proof 3 As before, we can get stronger results if we assume that λ is bounded, or equivalently, the transition semigroup P is uniform. Suppose that λ is bounded and α ∈ (0, ∞) . Then as operators on B (and hence also as matrices), 1. I + GU = α U 2. I + U G = α U α
α
α α
Proof
16.17.3
https://stats.libretexts.org/@go/page/10390
As matrices, the equation in (a) holds with more generality than the equation in (b), much as the Kolmogorov backward equation holds with more generality than the forward equation. Note that Uα G(x, y) = ∑ Uα (x, z)G(z, y) = −λ(y)Uα (x, y) + ∑ Uα (x, z)λ(z)Q(z, y), z∈S
(x, y) ∈ S
2
(16.17.26)
z∈S
If λ is unbounded, it's not clear that the second sum is finite. Suppose that λ is bounded and α ∈ (0, ∞) . Then as operators on B (and hence also as matrices), 1. U = (αI − G) 2. G = αI − U
−1
α
−1 α
Proof So the potential operator U and the generator G have a simple, elegant inverse relationship. Of course, these results hold in particular if S is finite, so that all of the various matrices really are matrices in the elementary sense. α
Examples and Exercises The Two-State Chain Let X = {X : t ∈ [0, ∞)} be the Markov chain on the set of states S = {0, 1}, with transition rate a ∈ [0, ∞) from 0 to 1 and transition rate b ∈ [0, ∞) from 1 to 0. To avoid the trivial case with both states absorbing, we will assume that a + b > 0 . The first two results below are a review from the previous two sections. t
The generator matrix G is −a
a
b
−b
G=[
]
(16.17.27)
The transition matrix at time t ∈ [0, ∞) is 1 Pt =
b
a
b
a
[ a+b
1 ]−
e
−(a+b)t
−a
a
b
−b
[
],
a+b
t ∈ [0, ∞)
(16.17.28)
Now we can find the potential matrix in two ways. For α ∈ (0, ∞) , show that the potential matrix U is α
1 Uα =
1. From the definition. 2. From the relation U
α
−1
= (αI − G)
b
a
b
a
[ α(a + b)
1 ]−
−a
a
b
−b
[ (α + a + b)(a + b)
]
(16.17.29)
.
Computational Exercises Consider the Markov chain jump transition matrix
X = { Xt : t ∈ [0, ∞)}
on
S = {0, 1, 2}
⎡
0
1
with exponential parameter function
and
1
⎤
2
2
Q =⎢ 1 ⎢
0
0 ⎥ ⎥
1
2
3
3
0 ⎦
⎣
λ = (4, 1, 3)
(16.17.30)
1. Draw the state graph and classify the states. 2. Find the generator matrix G. 3. Find the potential matrix U for α ∈ (0, ∞) . α
Answer
16.17.4
https://stats.libretexts.org/@go/page/10390
Special Models Read the discussion of potential matrices for chains subordinate to the Poisson process. This page titled 16.17: Potential Matrices is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.17.5
https://stats.libretexts.org/@go/page/10390
16.18: Stationary and Limting Distributions of Continuous-Time Chains In this section, we study the limiting behavior of continuous-time Markov chains by focusing on two interrelated ideas: invariant (or stationary) distributions and limiting distributions. In some ways, the limiting behavior of continuous-time chains is simpler than the limiting behavior of discrete-time chains, in part because the complications caused by periodicity in the discrete-time case do not occur in the continuous-time case. Nonetheless as we will see, the limiting behavior of a continuous-time chain is closely related to the limiting behavior of the embedded, discrete-time jump chain.
Review Once again, our starting point is a time-homogeneous, continuous-time Markov chain X = {X : t ∈ [0, ∞)} defined on an underlying probability space (Ω, F , P) and with discrete state space (S, S ). By definition, this means that S is countable with the discrete topology, so that S is the σ-algebra of all subsets of S . t
Let's review what we have so far. We assume that the Markov chain X is regular. Among other things, this means that the basic structure of X is determined by the transition times τ = (τ , τ , τ , …) and the jump chain Y = (Y , Y , Y , …). First, τ = 0 and τ = τ = inf{t > 0 : X ≠ X } . The time-homogeneous and Markov properties imply that the distribution of τ given X = x is exponential with parameter λ(x) ∈ [0, ∞). Part of regularity is that X is right continuous so that there are no instantaneous states where λ(x) = ∞ , which would mean P(τ = 0 ∣ X = x) = 1 . On the other hand, λ(x) ∈ (0, ∞) means that x is a stable state so that τ has a proper exponential distribution given X = x , with P(0 < τ < ∞ ∣ X = x) = 1 . Finally, λ(x) = 0 means that x is an absorbing state so that P(τ = ∞ ∣ X = x) = 1 . The remaining transition times are defined recursively: τ = inf {t > τ : X ≠ X } if τ < ∞ and τ = ∞ if τ = ∞ . Another component of regularity is that with probability 1, τ → ∞ as n → ∞ , ruling out the explosion event of infinitely many jumps in finite time. The jump chain Y is formed by sampling X at the transition times (until the chain is sucked into an absorbing state, if that happens). That is, with M = sup{n : τ < ∞} and for n ∈ N , we define Y = X if n ≤ M and Y = X if n > M . Then Y is a discrete-time Markov chain with one-step transition matrix Q given Q(x, y) = P(X = y ∣ X = x) if (x, y) ∈ S with x stable and Q(x, x) = 1 if x ∈ S is absorbing. 0
1
t
1
2
0
1
2
0
0
0
0
0
0
0
n+1
n
t
τn
n
n+1
n
n
n
n
τn
n
τM
2
τ
0
The transition matrix P at time t ∈ [0, ∞) is given by P (x, y) = P(X = y ∣ X = x) for (x, y) ∈ S . The time-homogenous and Markov properties imply that the collection of transition matrices P = {P : t ∈ [0, ∞)} satisfies the Chapman-Kolmogorov equations P P = P for s, t ∈ [0, ∞) , and hence is a semigroup. of transition matrices The transition semigroup P and the initial distribution of X determine all of the finite-dimensional distributions of X. Since there are no instantaneous states, P is standard which means that P → I as t ↓ 0 (as matrices, and so pointwise). The fundamental relationship between P on the one hand, and λ and Q on the other, is t
t
t
2
0
t
s
t
s+t
0
t
t
Pt (x, y) = I (x, y)e
−λ(x)t
+∫
λ(x)e
−λ(x)s
Q Pt−s (x, y) ds,
(x, y) ∈ S
2
(16.18.1)
0
From this, it follows that the matrix function t ↦ P is differentiable (again, pointwise) and satisfies the Kolmogorov backward equation P = GP , where the infinitesimal generator matrix G is given by G(x, y) = −λ(x)I (x, y) + λ(x)Q(x, y) for (x, y) ∈ S . If we impose the stronger assumption that P is uniform, which means that P → I as t ↓ 0 as operators on B (so with respect to the supremum norm), then the backward equation as well as the companion Kolmogorov forward equation P == P G hold as operators on B . In addition, we have the matrix exponential representation P = e for t ∈ [0, ∞). The uniform assumption is equivalent to the exponential parameter function being bounded. t
d
dt
t
t
2
t
d
dt
tG
t
t
t
∞
Finally, for α ∈ [0, ∞), the α potential matrix U of X is U = ∫ e P dt . The resolvent U = {U : α ∈ (0, ∞)} is the Laplace transform of P and hence gives the same information as P . From this point of view, the time-homogeneous and Markov properties lead to the resolvent equation U = U + (β − α)U U for α, β ∈ (0, ∞) with α ≤ β . For α ∈ (0, ∞) , the α potential matrix is related to the generator by the fundamental equation α U = I + GU . If P is uniform, then this equation, as well as the companion α U = I + U G hold as operators on B , which leads to U = (αI − G) . α
α
α
β
−αt
t
0
α
α
β
α
α
−1
α
α
α
Basic Theory
16.18.1
https://stats.libretexts.org/@go/page/10391
Relations and Classification We start our discussion with relations among states and classifications of states. These are the same ones that we studied for discrete-time chains in our study of recurrence and transience, applied here to the jump chain Y . But as we will see, the relations and classifications make sense for the continuous-time chain X as well. The discussion is complicated slightly when there are absorbing states. Only when X is in an absorbing state can we not interpret the values of Y as the values of X at the transition times (because of course, there are no transitions when X is in an absorbing state). But x ∈ S is absorbing for the continuous-time chain X if and only if x is absorbing for the jump chain Y , so this trivial exception is easily handled. For y ∈ S let ρ = inf{n ∈ N : Y = y} , the (discrete) hitting time to y for the jump chain Y , where as usual, inf(∅) = ∞ . That is, ρ is the first positive (discrete) time that Y in in state y . The analogous random time for the continuous-time chain X is τ , where naturally we take τ = ∞ . This is the first time that X is in state y , not counting the possible initial period in y . Specifically, suppose X = x . If x ≠ y then τ = inf{t > 0 : X = y} . If x = y then τ = inf{t > τ : X = y} . y
+
n
y
ρ
y
∞
0
ρy
t
ρy
1
t
Define the hitting matrix H by H (x, y) = P(ρy < ∞ ∣ Y0 = x),
Then H (x, y) = P (τ
ρ
y
< ∞ ∣ X0 = x)
(x, y) ∈ S
2
(16.18.2)
except when x is absorbing and y = x .
So for the continuous-time chain, if x ∈ S is stable then H (x, x) is the probability that, starting in x, the chain X returns to x after its initial period in x. If x, y ∈ S are distinct, then H (x, y) is simply the probability that X, starting in x, eventually reaches y . It follows that the basic relation among states makes sense for either the continuous-time chain X as well as its jump chain Y . Define the relation → on S by x → y if x = y or H (x, y) > 0. 2
The leads to relation → is reflexive by definition: x → x for every x ∈ S . From our previous study of discrete-time chains, we know it's also transitive: if x → y and y → z then x → z for x, y, z ∈ S . We also know that x → y if and only if there is a directed path in the state graph from x to y , if and only if Q (x, y) > 0 for some n ∈ N . For the continuous-time transition matrices, we have a stronger result that in turn makes a stronger case that the leads to relation is fundamental for X. n
Suppose (x, y) ∈ S . 2
1. If x → y then P 2. If x ↛ y then P
t (x,
y) > 0
t (x,
y) = 0
for all t ∈ (0, ∞) . for all t ∈ (0, ∞) .
Proof This result is known as the Lévy dichotomy, and is named for Paul Lévy. Let's recall a couple of other definitions: Suppose that A is a nonempty subset of S . 1. A is closed if x ∈ A and x → y imply y ∈ A . 2. A is irreducible if A is closed and has no proper closed subset. If S is irreducible, we also say that the chain X itself is irreducible. If A is a nonempty subset of is called the closure of A .
S
, then cl(A) = {y ∈ S : x → y for some x ∈ A} is the smallest closed set containing
A
, and
Suppose that A ⊆ S is closed. Then 1. P , the restriction of P to A × A , is a transition probability matrix on A for every t ∈ [0, ∞). 2. X restricted to A is a continuous-time Markov chain with transition semigroup P = {P : t ∈ [0, ∞)} . A
t
t
A
A
t
Proof Define the relation ↔ on S by x ↔ y if x → y and y → x for (x, y) ∈ S . 2
16.18.2
https://stats.libretexts.org/@go/page/10391
The to and from relation ↔ defines an equivalence relation on S and hence partitions S into mutually disjoint equivalence classes. Recall from our study of discrete-time chains that a closed set is not necessarily an equivalence class, nor is an equivalence class necessarily closed. However, an irreducible set is an equivalence class, but an equivalence class may not be irreducible. The importance of the relation ↔ stems from the fact that many important properties of Markov chains (in discrete or continuous time) turn out to be class properties, shared by all states in an equivalence class. The following definition is fundamental, and once again, makes sense for either the continuous-time chain X or its jump chain Y . Let x ∈ S . 1. State x is transient if H (x, x) < 1 2. State x is recurrent if H (x, x) = 1. Recall from our study of discrete-time chains that if x is recurrent and x → y then y is recurrent and y → x . Thus, recurrence and transience are class properties, shared by all states in an equivalence class.
Time Spent in a State For x ∈ S , let N denote the number of visits to state x by the jump chain Y , and let the continuous-time chain X. Thus x
∞
Tx
denote the total time spent in state x by
∞
Nx = ∑ 1(Yn = x),
Tx = ∫
1(Xt = x) dt
(16.18.3)
0
n=0
The expected values R(x, y) = E(N ∣ Y = x) and U (x, y) = E(T ∣ X = x) for (x, y) ∈ S define the potential matrices of Y and X, respectively. From our previous study of discrete-time chains, we know the distribution and mean of N given Y = x in terms of the hitting matrix H . The next two results give a review: 2
y
0
y
0
y
Suppose that x, 1. P(N 2. P(N
y ∈ S
0
are distinct. Then n−1
y
= n ∣ Y0 = y) = H
(y, y)[1 − H (y, y)]
y
= 0 ∣ Y0 = x) = 1 − H (x, y)
and P(N
y
for n ∈ N
+
= n ∣ Y0 = x) = H (x, y)H
n−1
(y, y)[1 − H (y, y)]
for n ∈ N
+
Let's take cases. First suppose that y is recurrent. In part (a), P(N = n ∣ Y = y) = 0 for all n ∈ N , and consequently P(N = ∞ ∣ Y = y) = 1 . In part (b), P(N = n ∣ Y = x) = 0 for n ∈ N , and consequently P(N = 0 ∣ Y = x) = 1 − H (x, y) while P(N = ∞ ∣ Y = x) = H (x, y) . Suppose next that y is transient. Part (a) specifies a proper geometric distribution on N while in part (b), probability 1 − H (x, y) is assigned to 0 and the remaining probability H (x, y) is geometrically distributed over N as in (a). In both cases, N is finite with probability 1. Next we consider the expected value, that is, the (discrete) potential. To state the results succinctly we will use the convention that a/0 = ∞ if a > 0 and 0/0 = 0 . y
y
0
y
y
0
y
0
0
+
+
0
+
+
Suppose again that x,
y ∈ S
y
are distinct. Then
1. R(y, y) = 1/[1 − H (y, y)] 2. R(x, y) = H (x, y)/[1 − H (y, y)] Let's take cases again. If y ∈ S is recurrent then R(y, y) = ∞ , and for x ∈ S with x ≠ y , either R(x, y) = ∞ if x → y or R(x, y) = 0 if x ↛ y . If y ∈ S is transient, R(y, y) is finite, as is R(x, y) for every x ∈ S with x ≠ y . Moreover, there is an inverse relationship of sorts between the potential and the hitting probabilities. Naturally, our next goal is to find analogous results for the continuous-time chain X. For the distribution of T it's best to use the right distribution function. y
Suppose that x, 1. P(T 2. P(T
y ∈ S
are distinct. Then for t ∈ [0, ∞)
y
> t ∣ X0 = y) = exp{−λ(y)[1 − H (y, y)]t}
y
> t ∣ X0 = x) = H (x, y) exp{−λ(y)[1 − H (y, y)]t}
Proof
16.18.3
https://stats.libretexts.org/@go/page/10391
Let's take cases as before. Suppose first that y is recurrent. In part (a), P(T > t ∣ X = y) = 1 for every t ∈ [0, ∞) and hence P(T = ∞ ∣ X = y) = 1 . In part (b), P(T > t ∣ X = x) = H (x, y) for every t ∈ [0, ∞) and consequently P(T = 0 ∣ X = x) = 1 − H (x, y) while P(T = ∞ ∣ X = x) = H (x, y) . Suppose next that y is transient. From part (a), the distribution of T given X = y is exponential with parameter λ(y)[1 − H (y, y)] . In part (b), the distribution assigns probability 1 − H (x, y) to 0 while the remaining probability H (x, y) is exponentially distributed over (0, ∞) as in (a). Taking expected value, we get a very nice relationship between the potential matrix U of the continuous-time chain X and the potential matrix R of the discrete-time jump chain Y : y
y y
0
y
0
y
y
0
0
0
0
For every (x, y) ∈ S , 2
R(x, y) U (x, y) =
(16.18.6) λ(y)
Proof In particular, y ∈ S is transient if and only if R(x, y) < ∞ for every x ∈ S , if and only if U (x, y) < ∞ for every other hand, y is recurrent if and only if R(x, y) = U (x, y) = ∞ if x → y and R(x, y) = U (x, y) = 0 if x ↛ y .
x ∈ S
. On the
Null and Positive Recurrence Unlike transience and recurrence, the definitions of null and positive recurrence of a state x ∈ S are different for the continuoustime chain X and its jump chain Y . This is because these definitions depend on the expected hitting time to x, starting in x, and not just the finiteness of this hitting time. For x ∈ S , let ν (x) = E(ρ ∣ Y = x) , the expected (discrete) return time to x starting in x. Recall that x is positive recurrent for Y if ν (x) < ∞ and x is null recurrent if x is recurrent but not positive recurrent, so that H (x, x) = 1 but ν (x) = ∞ . The definitions are similar for X, but using the continuous hitting time τ . x
0
ρx
For x ∈ S , let μ(x) = 0 if x is absorbing and μ(x) = E (τ return time to x starting in x (after the initial period in x).
ρx
∣ X0 = x)
if x is stable. So if
x
is stable,
μ(x)
is the expected
1. State x is positive recurrent for X if μ(x) < ∞ . 2. State x is null recurrent for X if x recurrent but not positive recurrent, so that H (x, x) = 1 but μ(x) = ∞ . A state x ∈ S can be positive recurrent for X but null recurrent for its jump chain Y or can be null recurrent for X but positive recurrent for Y . But like transience and recurrence, positive and null recurrence are class properties, shared by all states in an equivalence class under the to and from equivalence relation ↔.
Invariant Functions Our next discussion concerns functions that are invariant for the transition matrix Q of the jump chain Y and functions that are invariant for the transition semigroup P = {P : t ∈ [0, ∞)} of the continuous-time chain X. For both discrete-time and continuous-time chains, there is a close relationship between invariant functions and the limiting behavior in time. t
First let's recall the definitions. A function f : S → [0, ∞) is invariant for Q (or for the chain Y ) if f Q = f . It then follows that f Q = f for every n ∈ N . In continuous time we must assume invariance at each time. That is, a function f : S → [0, ∞) is invariant for P (or for the chain X) if f P = f for all t ∈ [0, ∞). Our interest is in nonnegative functions, because we can think of such a function as the density function, with respect to counting measure, of a positive measure on S . We are particularly interested in the special case that f is a probability density function, so that ∑ f (x) = 1 . If Y has a probability density function f that is invariant for Q, then Y has probability density function f for all n ∈ N and hence Y is stationary. Similarly, if X has a probability density function f that is invariant for P then X has probability density function f for every t ∈ [0, ∞) and once again, the chain X is stationary. n
t
x∈S
0
n
0
t
Our first result shows that there is a one-to-one correspondence between invariant functions for generator G. Suppose f
: S → [0, ∞)
Q
and zero functions for the
. Then f G = 0 if and only if (λf )Q = λf , so that λf is invariant for Q.
Proof If our chain X has no absorbing states, then f
: S → [0, ∞)
is invariant for Q if and only if (f /λ)G = 0 .
16.18.4
https://stats.libretexts.org/@go/page/10391
Suppose that f
: S → [0, ∞)
. Then f is invariant for P if and only if f G = 0 .
Proof 1 Proof 2 So putting the two main results together we see that f is invariant for the continuous-time chain X if and only if λf is invariant for the jump chain Y . Our next result shows how functions that are invariant for X are related to the resolvent U = {U : α ∈ (0, ∞)} . To appreciate the result, recall that for α ∈ (0, ∞) the matrix αU is a probability matrix, and in fact α U (x, ⋅) is the conditional probability density function of X , given X = x , where T is independent of X and has the exponential distribution with parameter α . So αU is a transition matrix just as P is a transition matrix, but corresponding to the exponentially distributed random time T with parameter α ∈ (0, ∞) rather than the deterministic time t ∈ [0, ∞). α
α
α
T
α
Suppose that f fG = 0.
: S → [0, ∞)
0
t
. If f G = 0 then f (α U
α)
=f
for α ∈ (0, ∞) . Conversely if f (α U
α)
=f
for α ∈ (0, ∞) then
Proof So extending our summary, f : S → [0, ∞) is invariant for the transition semigroup P = {P : t ∈ [0, ∞)} if and only if λf is invariant for jump transition matrices {Q : n ∈ N} if and only if f G = 0 if and only if f is invariant for the collection of probability matrices {α U : α ∈ (0, ∞)} . From our knowledge of the theory for discrete-time chains, we now have the following fundamental result: t
n
α
Suppose that X is irreducible and recurrent. 1. There exists g : S → (0, ∞) that is invariant for X. 2. If f is invariant for X, then f = cg for some constant c ∈ [0, ∞). Proof Invariant functions have a nice interpretation in terms of occupation times, an interpretation that parallels the discrete case. The potential gives the expected total time in a state, starting in another state, but here we need to consider the expected time in a state during a cycle that starts and ends in another state. For x ∈ S , define the function γ by x
τρ
x
γx (y) = E ( ∫ 0
so that γ
x (y)
∣ 1(Xs = y) ds ∣ X0 = x) , ∣
y ∈ S
(16.18.13)
is the expected occupation time in state y before the first return to x, starting in x.
Suppose again that X is irreducible and recurrent. For x ∈ S , 1. γ : S → (0, ∞) 2. γ is invariant for X 3. γ (x) = 1/λ(x) 4. μ(x) = ∑ γ (y) x x x
y∈S
x
Proof So now we have some additional insight into positive and null recurrence for the continuous-time chain X and the associated jump chain Y . Suppose again that the chains are irreducible and recurrent. There exist g : S → (0, ∞) that is invariant for Y , and then g/λ is invariant for X. The invariant functions are unique up to multiplication by positive constants. The jump chain Y is positive recurrent if and only if ∑ g(x) < ∞ while the continuous-time chain X is positive recurrent if and only if ∑ g(x)/λ(x) < ∞ . Note that if λ is bounded (which is equivalent to the transition semigroup P being uniform), then X is positive recurrent if and only if Y is positive recurrent. x∈S
x∈S
Suppose again that X is irreducible and recurrent. 1. If X is null recurrent then X does not have an invariant probability density function. 2. If X is positive recurrent then X has a unique, positive invariant probability density function. Proof
16.18.5
https://stats.libretexts.org/@go/page/10391
Limiting Behavior Our next discussion focuses on the limiting behavior of the transition semigroup simple corollary of the result above for potentials. If y ∈ S is transient, then P
t (x,
y) → 0
P = { Pt : t ∈ [0, ∞)}
. Our first result is a
as t → ∞ for every x ∈ S .
Proof So we should turn our attention to the recurrent states. The set of recurrent states partitions into equivalent classes under ↔, and each of these classes is irreducible. Hence we can assume without loss of generality that our continuous-time chain X = { X : t ∈ [0, ∞)} is irreducible and recurrent. To avoid trivialities, we will also assume that S has at least two states. Thus, there are no absorbing states and so λ(x) > 0 for x ∈ S . Here is the main result. t
Suppose that X = {X : t ∈ [0, ∞)} is irreducible and recurrent. Then independently of x ∈ S . The function f is invariant for X and t
f (y) = limt→∞ Pt (x, y)
exists for each
y ∈ S
,
γx (y) f (y) =
,
y ∈ S
(16.18.17)
μ(x)
1. If X is null recurrent then f (y) = 0 for all y ∈ S . 2. If X is positive recurrent then f (y) > 0 for all y ∈ S and ∑
y∈S
f (y) = 1
.
Proof sketch The limiting function f can be computed in a number of ways. First we find a function g : S → (0, ∞) that is invariant for X. We can do this by solving gPt = g
for t ∈ (0, ∞)
gG = 0
for α ∈ (0, ∞) and then g = h/λ
g(α Uα ) = g hQ = h
The function g is unique up to multiplication by positive constants. If ∑ and so f is simply g normalized:
x∈S
g(x) < ∞
, then we are in the positive recurrent case
g(y) f (y) =
,
y ∈ S
(16.18.19)
∑x∈S g(x)
The following result is known as the ergodic theorem for continuous-time Markov chains. It can also be thought of as a strong law of large numbers for continuous-time Markov chains. Suppose that X = {X f . If h : S → R then
t
: t ∈ [0, ∞)}
is irreducible and positive recurrent, with (unique) invariant probability density function t
1 ∫ t
h(Xs )ds → ∑ f (x)h(x) as t → ∞
0
(16.18.20)
x∈S
with probability 1, assuming that the sum on the right converges absolutely. Notes Note that no assumptions are made about X , so the limit is independent of the initial state. By now, this should come as no surprise. After a long period of time, the Markov chain X “forgets” about the initial state. Note also that ∑ f (x)h(x) is the expected value of h , thought of as a random variable on S with probability measure defined by f . On the other hand, ∫ h(X )ds is the average of the time function s ↦ h(X ) on the interval [0, t]. So the ergodic theorem states that the limiting time average on the left is the same as the spatial average on the right. 0
x∈S
1 t
t
0
s
s
Applications and Exercises
16.18.6
https://stats.libretexts.org/@go/page/10391
The Two-State Chain The continuous-time, two-state chain has been studied in the last several sections. The following result puts the pieces together and completes the picture. Consider the continuous-time Markov chain X = {X : t ∈ [0, ∞)} on S = {0, 1} with transition rate a ∈ (0, ∞) from 0 to 1 and transition rate b ∈ (0, ∞) from 1 to 0. Give each of the following t
1. The transition matrix Q for Y at n ∈ N . 2. The infinitesimal generator G. 3. The transition matrix P for X at t ∈ [0, ∞). 4. The invariant probability density function for Y . 5. The invariant probability density function for X. 6. The limiting behavior of Q as n → ∞ . 7. The limiting behavior of P as t → ∞ . n
t
n
t
Answer
Computational Exercises The following continuous-time chain has also been studied in the previous three sections. Consider the Markov chain jump transition matrix
X = { Xt : t ∈ [0, ∞)}
on
S = {0, 1, 2}
⎡ 0
1
with exponential parameter function
and
1
⎤
2
2
Q =⎢ 1 ⎢
0
0 ⎥ ⎥
1
2
3
3
⎣
λ = (4, 1, 3)
(16.18.21)
0 ⎦
1. Recall the generator matrix G. 2. Find the invariant probability density function f for Y by solving f Q = f . 3. Find the invariant probability density function f for X by solving f G = 0 . 4. Verify that λf is a multiple of f . 5. Describe the limiting behavior of Q as n → ∞ . 6. Describe the limiting behavior of P as t → ∞ . 7. Verify the result in (f) by recalling the transition matrix P for X at t ∈ [0, ∞). c
d
d
c
c
d
d
n
t
t
Answer
Special Models Read the discussion of stationary and limiting distributions for chains subordinate to the Poisson process. Read the discussion of stationary and limiting distributions for continuous-time birth-death chains. Read the discussion of classification and limiting distributions for continuous-time queuing chains. This page titled 16.18: Stationary and Limting Distributions of Continuous-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.18.7
https://stats.libretexts.org/@go/page/10391
16.19: Time Reversal in Continuous-Time Chains Earlier, we studied time reversal of discrete-time Markov chains. In continous time, the issues are basically the same. First, the Markov property stated in the form that the past and future are independent given the present, essentially treats the past and future symmetrically. However, there is a lack of symmetry in the fact that in the usual formulation, we have an initial time 0, but not a terminal time. If we introduce a terminal time, then we can run the process backwards in time. In this section, we are interested in the following questions: Is the new process still Markov? If so, how are the various parameters of the reversed Markov chain related to those of the original chain? Under what conditions are the forward and backward Markov chains stochastically the same? Consideration of these questions leads to reversed chains, an important and interesting part of the theory of continuous-time Markov chains. As always, we are also interested in the relationship between properties of a continuous-time chain and the corresponding properties of its discrete-time jump chain. In this section we will see that there are simple and elegant connections between the time reversal of a continuous-time chain and the time-reversal of the jump chain.
Basic Theory Reversed Chains Our starting point is a (homogeneous) continuous-time Markov chain X = {X : t ∈ [0, ∞)} with (countable ) state space S . We will assume that X is irreducible, so that every state in S leads to every other state, and to avoid trivialities, we will assume that there are at least two states. The irreducibility assumption involves no serious loss of generality since otherwise we could simply restrict our attention to an irreducible equivalence class of states. With our usual notation, we will let P = {P : t ∈ [0, ∞)} denote the semigroup of transition matrices of X and G the infinitesimal generator. Let λ(x) denote the exponential parameter for the holding time in state x ∈ S and Q the transition matrix for the discrete-time jump chain Y = (Y , Y , …). Finally, let U = {U : α ∈ [0, ∞)} denote the collection of potential matrices of X. We will assume that the chain X is regular, which gives us the following properties: t
t
0
1
α
as t ↓ 0 for x ∈ S . There are no instantaneous states, so λ(x) < ∞ for x ∈ S . The transition times (τ , τ , …) satisfy τ → ∞ as n → ∞ . We may assume that the chain X is right continuous and has left limits. Pt (x, x) → 1
1
2
n
The assumption of regularity rules out various types of weird behavior that, while mathematically possible, are usually not appropriate in applications. If X is uniform, a stronger assumption than regularity, we have the following additional properties: as t ↓ 0 uniformly in x ∈ S . is bounded. P =e for t ∈ [0, ∞). U = (αI − G) for α ∈ (0, ∞) . Pt (x, x) → 1 λ
tG
t
−1
α
Now let h ∈ (0, ∞) . We will think of h as the terminal time or time horizon so the chains in our first discussion will be defined on the time interval [0, h]. Notationally, we won't bother to indicate the dependence on h , since ultimately the time horizon won't matter. ^ Define X =X for t ∈ [0, h]. Thus, the process forward in time is X = {X : t ∈ [0, h]} while the process backwards in time is t
h−t
t
^ ^ X = { X t : t ∈ [0, h]} = { Xh−t : t ∈ [0, h]}
(16.19.1)
Similarly let ^ ^ F t = σ{ X s : s ∈ [0, t]} = σ{ Xh−s : s ∈ [0, t]} = σ{ Xr : r ∈ [h − t, h]},
t ∈ [0, h]
(16.19.2)
^ So F^ is the σ-algebra of events of the process X up to time t , which of course, is also the σ-algebra of events of X from time h − t forward. Our first result is that the chain reversed in time is still Markov t
^ ^ The process X = { X : t ∈ [0, h]} is a Markov chain, but is not time homogeneous in general. For s, transition matrix from s to t is t
16.19.1
t ∈ [0, h]
with s < t , the
https://stats.libretexts.org/@go/page/10392
^ P s,t (x, y) =
P(Xh−t = y)
Pt−s (y, x),
(x, y) ∈ S
2
(16.19.3)
P(Xh−s = x)
Proof However, the backwards chain will be time homogeneous if X has an invariant distribution. 0
Suppose that X is positive recurrent, with (unique) invariant probability density function f . If X has the invariant distribution, ^ then X is a time-homogeneous Markov chain. The transition matrix at time t ∈ [0, ∞) (for every terminal time h ≥ t ), is given by 0
^ P t (x, y) =
f (y) f (x)
Pt (y, x),
(x, y) ∈ S
2
(16.19.6)
Proof The previous result holds in the limit of the terminal time, regardless of the initial distribution. Suppose again that X is positive recurrent, with (unique) invariant probability density function f . Regardless of the distribution of X , 0
f (y)
^ ^ P(X s+t = y ∣ X s = x) →
Pt (y, x) as h → ∞
(16.19.7)
f (x)
Proof These three results are motivation for the definition that follows. We can generalize by defining the reversal of an irreducible Markov chain, as long as there is a positive, invariant function. Recall that a positive invariant function defines a positive measure on S , but of course not in general a probability measure. Suppose that
is invariant for X. The reversal of with transition semigroup P^ defined by
g : S → (0, ∞)
^ ^ X = { X t : t ∈ [0, ∞)}
^ P t (x, y) =
g(y) Pt (y, x),
with respect to
X
(x, y) ∈ S
2
g
is the Markov chain
, t ∈ [0, ∞)
(16.19.8)
g(x)
Justification Recall that if g is a positive invariant function for same reversed chain. So let's consider the cases:
X
then so is
cg
for every constant
c ∈ (0, ∞)
. Note that
g
and
cg
generate the
Suppose again that X is a Markov chain satisfying the assumptions above. 1. If X is recurrent, then X always has a positive invariant function g , unique up to multiplication by positive constants. Hence the reversal of a recurrent chain X always exists and is unique, and so we can refer to the reversal of X without reference to the invariant function. 2. Even better, if X is positive recurrent, then there exists a unique invariant probability density function, and the reversal of X can be interpreted as the time reversal (relative to a time horizon) when X has the invariant distribution, as in the motivating result above. 3. If X is transient, then there may or may not exist a positive invariant function, and if one does exist, it may not be unique (up to multiplication by positive constants). So a transient chain may have no reversals or more than one. Nonetheless, the general definition is natural, because most of the important properties of the reversed chain follow from the basic balance equation relating the transition semigroups P and P^, and the invariant function g : ^ g(x)P t (x, y) = g(y)Pt (y, x),
(x, y) ∈ S
2
, t ∈ [0, ∞)
(16.19.10)
We will see the balance equation repeated for other objects associated with the Markov chains. ^ Suppose again that g : S → (0, ∞) is invariant for X, and that X is the time reversal of X with respect to g . Then ^. 1. g is also invariant for X
16.19.2
https://stats.libretexts.org/@go/page/10392
^ 2. X is the time reversal of X with respect to g .
Proof In the balance equation for the transition semigroups, it's not really necessary to know a-priori that the function know the two transition semigroups. Suppose that only if
g : S → (0, ∞)
. Then
g
is invariant and the Markov chains
^ (x, y) = g(y)P (y, x), g(x)P t t
X
and
(x, y) ∈ S
2
^ X
g
is invariant, if we
are time reversals with respect to
, t ∈ [0, ∞)
g
if and
(16.19.12)
Proof Here is a slightly more complicated (but equivalent) version of the balance equation for the transition probabilities. ^ Suppose again that g : S → (0, ∞) . Then g is invariant and the chains X and X are time reversals with respect to g if and only if ^ ^ ^ g(x1 )P t1 (x1 , x2 )P t2 (x2 , x3 ) ⋯ P tn (xn , xn+1 ) = g(xn+1 )Ptn (xn+1 , xn )Ptn−1 (xn , xn−1 ) ⋯ Pt1 (x2 , x1 )
for all n ∈ N , (t +
1,
n
t2 , … , tn ) ∈ [0, ∞)
, and (x
1,
x2 , … , xn+1 ) ∈ S
n+1
(16.19.14)
.
Proof The balance equation holds for the potenetial matrices. ^ are time reversals with respect to g if and only Suppose again that g : S → (0, ∞) . Then g is invariant and the chains X and X if the potential matrices satisfy ^ g(x)U α (x, y) = g(y)Uα (y, x),
(x, y) ∈ S
2
, α ∈ [0, ∞)
(16.19.17)
Proof As a corollary, continuous-time chains that are time reversals are of the same type. ^ ^ If X and X are time reversals, then X and X are of the same type: transient, null recurrent, or positive recurrent.
Proof The balance equation extends to the infinitesimal generator matrices. Suppose again that g : S → (0, ∞) . Then infinitesimal generators satisfy
g
is invariant and the Markov chains
^ g(x)G(x, y) = g(y)G(y, x),
X
and
(x, y) ∈ S
2
^ X
are time reversals if and only if the
(16.19.19)
Proof This leads to further results and connections: ^ Suppose again that g : S → (0, ∞) . Then g is invariant and X and X are time reversals with respect to g if and only if ^ 1. X and X have the same exponential parmeter function λ . 2. The jump chains Y and Y^ are (discrete) time reversals with respect to λg.
Proof In our original discussion of time reversal in the positive recurrent case, we could have argued that the previous results must be true. If we run the positive recurrent chain X = {X : t ∈ [0, h]} backwards in time to obtain the time reversed chain ^ ^ ^ ^ ^ X = { X : t ∈ [0, h]} , then the exponential parameters for X must the be same as those for X, and the jump chain Y for X must be the time reversal of the jump chain Y for X. t
t
16.19.3
https://stats.libretexts.org/@go/page/10392
Reversible Chains Clearly an interesting special case is when the time reversal of a continuous-time Markov chain is stochastically the same as the original chain. Once again, we assume that we have a regular Markov chain X = {X : t ∈ [0, ∞)} that is irreducible on the state space S , with transition semigroup P = {P : t ∈ [0, ∞)} . As before, U = {U : α ∈ [0, ∞)} denotes the collection of potential matrices, and G the infinitesimal generator. Finally, λ denotes the exponential parameter function, Y = {Y : n ∈ N} the jump chain, and Q the transition matrix of Y . Here is the definition of reversibility: t
t
α
n
Suppose that
is invariant for X. Then X is reversible with respect to also has transition semigroup P . That is,
g : S → (0, ∞)
^ ^ X = { X t : t ∈ [0, ∞)}
g(x)Pt (x, y) = g(y)Pt (y, x),
(x, y) ∈ S
2
g
if the time reversed chain
, t ∈ [0, ∞)
(16.19.21)
Clearly if X is reversible with respect to g then X is reversible with respect to cg for every c ∈ (0, ∞). So here is another review of the cases: Suppose that X is a Markov chain satisfying the assumptions above. 1. If X is recurrent, then there exists an invariant function g : S → (0, ∞) that is unique up to multiplication by positive constants. So X is either reversible or not, and we do not have to reference the invariant function. 2. Even better, if X is positive recurrent, then there exists a unique invariant probability density function f . Again, X is either reversible or not, but if it is, then with the invariant distribution, the chain X is stochastically the same, forward in time or backward in time. 3. If X is transient, then a positive invariant function may or may not exist. If such a function does exist, it may not be unique, up to multiplication by positive constants. So in the transient case, X may be reversible with respect to one invariant function but not with respect to others. The following results are corollaries of the results above for time reversals. First, we don't need to know a priori that the function g is invariant. Suppose that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if g(x)Pt (x, y) = g(y)Pt (y, x),
(x, y) ∈ S
2
, t ∈ [0, ∞)
(16.19.22)
Suppose again that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if g(x1 )Pt1 (x1 , x2 )Pt2 (x2 , x3 ) ⋯ Ptn (xn , xn+1 ) = g(xn+1 )Ptn (xn+1 , xn )Ptn−1 (xn , xn−1 ) ⋯ Pt1 (x2 , x1 )
for all n ∈ N , (t +
1,
n
t2 , … , tn ) ∈ [0, ∞)
, and (x
1,
x2 , … , xn+1 ) ∈ S
n+1
(16.19.23)
.
Suppose again that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if g(x)Uα (x, y) = g(y)Uα (y, x),
(x, y) ∈ S
2
, α ∈ [0, ∞)
(16.19.24)
Suppose again that g : S → (0, ∞) . Then g is invariant and X is reversible with respect to g if and only if g(x)G(x, y) = g(y)G(y, x),
Suppose again that respect to λg.
g : S → (0, ∞)
. Then
g
is invariant and
X
(x, y) ∈ S
2
(16.19.25)
is reversible if and only if the jump chain
is reversible with
Y
Recall that X is recurrent if and only if the jump chain Y is recurrent. In this case, the invariant functions for X and Y exist and are unique up to positive constants. So in this case, the previous theorem states that X is reversible if and only if Y is reversible. In the positive recurrent case (the most important case), the following theorem gives a condition for reversibility that does not directly reference the invariant distribution. The condition is known as the Kolmogorov cycle condition, and is named for Andrei Kolmogorov Suppose that X is positive recurrent. Then X is reversible if and only if for every sequence of distinct states (x
1,
G(x1 , x2 )G(x2 , x3 ) ⋯ G(xn−1 , xn )G(xn , x1 ) = G(x1 , xn )G(xn , xn−1 ) ⋯ G(x3 , x2 )G(x2 , x1 )
16.19.4
,
x2 , … , xn )
(16.19.26)
https://stats.libretexts.org/@go/page/10392
Proof Note that the Kolmogorov cycle condition states that the transition rate of visiting states (x , x , … , x , x ) in sequence, starting in state x is the same as the transition rate of visiting states (x , x , … , x , x ) in sequence, starting in state x . The cycle condition is also known as the balance equation for cycles. 2
1
n
n−1
2
3
n
1
1
1
Figure 16.19.1: The Kolmogorov cycle condition
Applications and Exercises The Two-State Chain The continuous-time, two-state chain has been studied in our previous sections on continuous-time chains, so naturally we are interested in time reversal. Consider the continuous-time Markov chain X = {X : t ∈ [0, ∞)} on S = {0, 1} with transition rate and transition rate b ∈ (0, ∞) from 1 to 0. Show that X is reversible t
a ∈ (0, ∞)
from 0 to 1
1. Using the transition semigroup P = {P : t ∈ [0, ∞)} . 2. Using the resolvent U = {U : α ∈ (0, ∞)} . 3. Using the generator matrix G. t
α
Solutions
Computational Exercises The Markov chain in the following exercise has also been studied in previous sections. Consider the Markov chain jump transition matrix
X = { Xt : t ∈ [0, ∞)}
on
S = {0, 1, 2}
⎡ 0
1
with exponential parameter function
and
1
⎤
2
2
Q =⎢ 1 ⎢
0
0 ⎥ ⎥
1
2
3
3
⎣
λ = (4, 1, 3)
(16.19.29)
0 ⎦
^ Give each of the following for the time reversed chain X :
1. The state graph. 2. The semigroup of transition matrices P^ = {P^ : t ∈ [0, ∞)} . ^ ^ 3. The resolvent of potential matrices U = {U : α ∈ (0, ∞)} . ^ 4. The generator matrixG . 5. The transition matrix of the jump chain Y^ . t
α
Solutions
Special Models Read the discussion of time reversal for chains subordinate to the Poisson process. Read the discussion of time reversal for continuous-time birth-death chains. This page titled 16.19: Time Reversal in Continuous-Time Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.19.5
https://stats.libretexts.org/@go/page/10392
16.20: Chains Subordinate to the Poisson Process Basic Theory Introduction Recall that the standard Poisson process with rate parameter r ∈ (0, ∞) involves three interrelated stochastic processes. First the sequence of interarrival times T = (T , T , …) is independent, and each variable has the exponential distribution with parameter r . Next, the sequence of arrival times τ = (τ , τ , …) is the partial sum sequence associated with the interrival sequence T : 1
2
0
1
n
τn = ∑ Ti ,
n ∈ N
(16.20.1)
i=1
For
n ∈ N+
, the arrival time τ has the gamma distribution with parameters is defined by n
n
and r. Finally, the Poisson counting process
N = { Nt : t ∈ [0, ∞)}
Nt = max{n ∈ N : τn ≤ t},
t ∈ [0, ∞)
(16.20.2)
so that N is the number of arrivals in (0, t] for t ∈ [0, ∞). The counting variable N has the Poisson distribution with parameter rt for t ∈ [0, ∞). The counting process N and the arrival time process τ are inverses in the sense that τ ≤ t if and only if N ≥ n for t ∈ [0, ∞) and n ∈ N . The Poisson counting process can be viewed as a continuous-time Markov chain. t
t
n
t
Suppose that
takes values in N and is independent of N . Define X = X + N for t ∈ [0, ∞). Then is a continuous-time Markov chain on N with exponential parameter function given by λ(x) = r for x ∈ N and jump transition matrix Q given by Q(x, x + 1) = 1 for x ∈ S . X0
t
0
t
X = { Xt : t ∈ [0, ∞)}
Proof Note that the Poisson process, viewed as a Markov chain is a pure birth chain. Clearly we can generalize this continuous-time Markov chain in a simple way by allowing a general embedded jump chain. Suppose that X = {X : t ∈ [0, ∞)} is a Markov chain with (countable) state space S , and with constant exponential parameter λ(x) = r ∈ (0, ∞) for x ∈ S , and jump transition matrix Q. Then X is said to be subordinate to the Poisson process with rate parameter r. t
1. The transition times (τ , τ , …) are the arrival times of the Poisson process with rate r. 2. The inter-transition times (τ , τ − τ , …) are the inter-arrival times of the Poisson process with rate r (independent, and each with the exponential distribution with rate r). 3. N = {N : t ∈ [0, ∞)} is the Poisson counting process, where N is the number of transitions in (0, t] for t ∈ [0, ∞). 4. The Poisson process and the jump chain Y = (Y , Y , …) are independent, and X = Y for t ∈ [0, ∞). 1
2
1
2
1
t
t
0
1
t
Nt
Proof Since all states are stable, note that we must have Q(x, x) = 0 for x ∈ S . Note also that for x, y ∈ S with x ≠ y , the exponential rate parameter for the transition from x to y is μ(x, y) = rQ(x, y). Conversely suppose that μ : S → (0, ∞) satisfies μ(x, x) = 0 and ∑ μ(x, y) = r for every x ∈ S . Then the Markov chain with transition rates given by μ is subordinate to the Poisson process with rate r. It's easy to construct a Markov chain subordinate to the Poisson process. 2
y∈S
Suppose that N = {N : t ∈ [0, ∞)} is a Poisson counting process with rate r ∈ (0, ∞) and that Y = {Y : n ∈ N} is a discrete-time Markov chain on S , independent of N , whose transition matrix satisfies Q(x, x) = 0 for every x ∈ S . Let X =Y for t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain subordinate to the Poisson process. t
t
Nt
n
t
Generator and Transition Matrices Next let's find the generator matrix and the transition semigroup. Suppose again that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S subordinate to the Poisson process with rate r ∈ (0, ∞) and with jump transition matrix Q. As usual, let P = { P : t ∈ [0, ∞)} denote the transition semigroup and G the infinitesimal generator. t
t
16.20.1
https://stats.libretexts.org/@go/page/10393
The generator matrix G of X is G = r(Q − I ) . Hence for t ∈ [0, ∞) 1. The Kolmogorov backward equation is P = r(Q − I )P 2. The Kolmogorov forward equation is P = rP (Q − I ) ′
t
t
′
t
t
Proof There are several ways to find the transition semigroup P underlying Poisson process.
= { Pt : t ∈ [0, ∞)}
. The best way is a probabilistic argument using the
For t ∈ [0, ∞), the transition matrix P is given by t
∞
n
Pt = ∑ e
(rt)
−rt
n
Q
(16.20.3)
n!
n=0
Proof from the underlying Poisson process Proof using the generator matrix
Potential Matrices Next let's find the potential matrices. As with the transition matrices, we can do this in (at least) two different ways. Suppose again that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S subordinate to the Poisson process with rate r ∈ (0, ∞) and with jump transition matrix Q. For α ∈ (0, ∞) , the potential matrix U of X is t
α
∞
1 Uα =
r
∑( α +r
n n
) Q
(16.20.5)
α +r
n=0
Proof from the definition Proof using the generator ∞
Recall that for p ∈ (0, 1), the p-potential matrix of the jump chain Y is R relationship between the potential matrix of X and the potential matrix of Y :
p
n
n
= ∑n=0 p Q
. Hence we have the following nice
1 Uα =
Rr/(α+r)
α +r
(16.20.9)
Next recall that α U (x, ⋅) is the probability density function of X given X = x , where T has the exponential distribution with parameter α and is independent of X. On the other hand, α U (x, ⋅) = (1 − p)R (x, ⋅) where p = r/(α + r) . We know from our study of discrete potentials that (1 − p)R (x, ⋅) is the probability density function of Y where M has the geometric distribution on N with parameter 1 − p and is independent of Y . But also X = Y . So it follows that if T has the exponential distribution with parameter α , N = {N : t ∈ [0, ∞)} is a Poisson process with rate r, and is independent of T , then N has the geometric distribution on N with parameter α/(α + r) . Of course, we could easily verify this directly, but it's still fun to see such connections. α
T
0
α
p
p
M
T
NT
t
T
Limiting Behavior and Stationary Distributions Once again, suppose that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S subordinate to the Poisson process with rate r ∈ (0, ∞) and with jump transition matrix Q. Let Y = {Y : n ∈ N} denote the jump process. The limiting behavior and stationary distributions of X are closely related to those of Y . t
n
Suppose that X (and hence Y ) are irreducible and positive recurrent 1. g : S → (0, ∞) is invariant for X if and only if g is invariant for Y . 2. f is an invariant probability density function for X if and only if f is an invariant probability density function for Y . 3. X is null recurrent if and only if Y is null recurrent, and in this case, lim Q (x, y) = lim P (x, y) = 0 for (x, y) ∈ S . 4. X is positive recurrent if and only if Y is positive recurrent. If Y is aperiodic, then lim Q (x, y) = lim P (x, y) = f (y) for (x, y) ∈ S , where f is the invariant probability density function. n
n→∞
t→∞
t
2
n
n→∞
2
t→∞
t
Proof
16.20.2
https://stats.libretexts.org/@go/page/10393
Time Reversal Once again, suppose that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S subordinate to the Poisson process with rate r ∈ (0, ∞) and with jump transition matrix Q. Let Y = {Y : n ∈ N} denote the jump process. We assume that X (and hence Y ) are irreducible. The time reversal of X is closely related to that of Y . t
n
^ Suppose that g : S → (0, ∞) is invariant for X. The time reversal X with respect to g is also subordinate to the Poisson ^ ^ process with rate r. The jump chain Y of X is the (discrete) time reversal of Y with respect to g .
Proof In particular, X is reversible with respect to g if and only if Y is reversible with respect to g . As noted earlier, X and Y are of the same type: both transient or both null recurrent or both positive recurrent. In the recurrent case, there exists a positive invariant function that is unique up to multiplication by constants. In this case, the reversal of X is unique, and is the chain subordinate to the Poisson process with rate r whose jump chain is the reversal of Y .
Uniform Chains In the construction above for a Markov chain X = {X : t ∈ [0, ∞)} that is subordinate to the Poisson process with rate r and jump transition kernel Q, we assumed of course that Q(x, x) = 0 for every x ∈ S . So there are no absorbing states and the sequence (τ , τ , …) of arrival times of the Poisson process are the jump times of the chain X. However in our introduction to continuous-time chains, we saw that the general construction of a chain starting with the function λ and the transition matrix Q works without this assumption on Q, although the exponential parameters and transition probabilities change. The same idea works here. t
1
2
Suppose that N = {N : t ∈ [0, ∞)} is a counting Poisson process with rate r ∈ (0, ∞) and that Y = {Y : n ∈ N} is a discrete-time Markov chain with transition matrix Q on S × S satisfying Q(x, x) < 1 for x ∈ S . Assume also that N and Y are independent. Define X = Y for t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a continuous-Markov chain with ~ exponential parameter function λ(x) = r[1 − Q(x, x)] for x ∈ S and jump transition matrix Q given by t
n
t
Nt
t
~ Q(x, y) =
Q(x, y) ,
(x, y) ∈ S
2
, x ≠y
(16.20.10)
1 − Q(x, x)
Proof The Markov chain constructed above is no longer a chain subordinate to the Poisson process by our definition above, since the exponential parameter function is not constant, and the transition times of X are no longer the arrival times of the Poisson process. Nonetheless, many of the basic results above still apply. Let X = {X
t
: t ∈ [0, ∞)}
be the Markov chain constructed in the previous theorem. Then
1. For t ∈ [0, ∞), the transition matrix P is given by t
∞
n
Pt = ∑ e
−rt
(rt)
n=0
n
Q
(16.20.11)
n!
2. For α ∈ (0, ∞) , the α potential matrix is given by ∞
1 Uα =
r
∑( α +r
n=0
n n
) Q
(16.20.12)
α +r
3. The generator matrix is G = r(Q − I ) 4. g : S → (0, ∞) is invariant for X if and only if g is invariant for Y . Proof It's a remarkable fact that every continuous-time Markov chain with bounded exponential parameters can be constructed as in the last theorem, a process known as uniformization. The name comes from the fact that in the construction, the exponential parameters become constant, but at the expense of allowing the embedded discrete-time chain to jump from a state back to that state. To review the definition, suppose that X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on S with transition semigroup t
16.20.3
https://stats.libretexts.org/@go/page/10393
, exponential parameter function uniformly in x, or equivalently if λ is bounded.
P = { Pt : t ∈ [0, ∞)} t ↓ 0
λ
and jump transition matrix
. Then
Q
P
is uniform if
Pt (x, x) → 1
as
Suppose that λ : S → (0, ∞) is bounded and that Q is a transition matrix on S with Q(x, x) = 0 for every x ∈ S . Let r ∈ (0, ∞) be an upper bound on λ and N = { N : t ∈ [0, ∞)} a Poisson counting process with rate r . Define the transition ^ matrix Q on S by t
^ Q(x, x) = 1 −
λ(x) x ∈ S r
^ Q(x, y) =
λ(x) Q(x, y)
(x, y) ∈ S
2
, x ≠y
r
and let Y
= { Yn : n ∈ N}
t
. Then X = {X transition matrix Q. t ∈ [0, ∞)
^ be a discrete-time Markov chain with transition matrix Q , independent of N . Define X = Y for : t ∈ [0, ∞)} is a continuous-time Markov chain with exponential parameter function λ and jump
t
Nt
Proof Note in particular that if the state space S is finite then of course λ is bounded so the previous theorem applies. The theorem is useful for simulating a continuous-time Markov chain, since the Poisson process and discrete-time chains are simple to simulate. In addition, we have nice representations for the transition matrices, potential matrices, and the generator matrix. Suppose that X = {X
t
: t ∈ [0, ∞}
is a continuous-time Markov chain on
and jump transition matrix Q. Define r and
λ : S → (0, ∞)
^ Q
S
with bounded exponential parameter function
as in the last theorem. Then
1. For t ∈ [0, ∞), the transition matrix P is given by t
∞
n
Pt = ∑ e
−rt
(rt)
n
^ Q
(16.20.14)
n!
n=0
2. For α ∈ (0, ∞) , the α potential matrix is given by ∞
1 Uα =
r
∑( α +r
n
n
^ ) Q
(16.20.15)
α +r
n=0
^ 3. The generator matrix is G = r(Q − I) ^ 4. g : S → (0, ∞) is invariant for X if and only if g is invariant for Q .
Proof
Examples The Two-State Chain The following exercise applies the uniformization method to the two-state chain. Consider the continuous-time Markov chain X = {X : t ∈ [0, ∞)} on S = {0, 1} with exponential parameter function λ = (a, b) , where a, b ∈ (0, ∞) . Thus, states 0 and 1 are stable and the jump chain has transition matrix t
0
1
1
0
Q =[
]
(16.20.16)
Let r = a + b , an upper bound on λ . Show that ^ 1. Q =
1 a+b
2. G = [ 3. P
t
4. U
α
[
a
b
a
−a
a
b
−b
]
]
^ = Q− =
b
1 α
1 a+b
^ Q−
e
−(a+b)t
G
1 (α+a+b)(a+b)
for t ∈ [0, ∞) G
for α ∈ (0, ∞)
16.20.4
https://stats.libretexts.org/@go/page/10393
Proof Although we have obtained all of these results for the two-state chain before, the derivation based on uniformization is the easiest. This page titled 16.20: Chains Subordinate to the Poisson Process is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.20.5
https://stats.libretexts.org/@go/page/10393
16.21: Continuous-Time Birth-Death Chains Basic Theory Introduction A continuous-time birth-death chain is a simple class of Markov chains on a subset of Z with the property that the only possible transitions are to increase the state by 1 (birth) or decrease the state by 1 (death). It's easiest to define the birth-death process in terms of the exponential transition rates, part of the basic structure of continuous-time Markov chains. Suppose that S is an integer interval (that is, a set of consecutive integers), either finite or infinite. The birth-death chain with birth rate function α : S → [0, ∞) and death rate function β : S → [0, ∞) is the Markov chain X = {X : t ∈ [0, ∞)} on S with transition rate α(x) from x to x + 1 and transition rate β(x) from x to x − 1 , for x ∈ S . t
If S has a minimum element m, then of course we must have β(m) = 0 . If α(m) = 0 also, then the boundary point m is absorbing. Similarly, if S has a maximum element n then we must have α(n) = 0 . If β(n) = 0 also then the boundary point n is absorbing. If x ∈ S is not a boundary point, then typically we have α(x) + β(x) > 0 , so that x is stable. If β(x) = 0 for all x ∈ S , then X is a pure birth process, and similarly if α(x) = 0 for all x ∈ S then X is a pure death process. From the transition rates, it's easy to compute the parameters of the exponential holding times in a state and the transition matrix of the embedded, discretetime jump chain. Consider again the birth-death chain X on S with birth rate function α and death rate function β. As usual, let exponential parameter function and Q the transition matrix for the jump chain.
λ
denote the
1. λ(x) = α(x) + β(x) for x ∈ S 2. If x ∈ S is stable, so that α(x) + β(x) > 0 , then α(x)
β(x)
Q(x, x + 1) =
,
Q(x, x − 1) =
α(x) + β(x)
Note that jump chain Y = (Y , Y follows: If x ∈ S is stable then 0
1,
…)
(16.21.1) α(x) + β(x)
is a discrete-time birth death chain. The probability functions p, q, and r of Y are given as α(x) p(x) = Q(x, x + 1) = α(x) + β(x) β(x) q(x) = Q(x, x − 1) = α(x) + β(x) r(x) = Q(x, x) = 0
If x is absorbing then of course p(x) = q(x) = 0 and r(x) = 1 . Except for the initial state, the jump chain Y is deterministic for a pure birth process, with Q(x, x) = 1 if x is absorbing and Q(x, x + 1) = 1 if x is stable. Similarly, except for the initial state, Y is deterministic for a pure death process, with Q(x, x) = 1 if x is absorbing and Q(x, x − 1) = 1 if x is stable. Note that the Poisson process with rate parameter r ∈ (0, ∞), viewed as a continuous-time Markov chain, is a pure birth process on N with birth function α(x) = r for each x ∈ N. More generally, a birth death process with λ(x) = α(x) + β(x) = r for all x ∈ S is also subordinate to the Poisson process with rate r. Note that λ is bounded if and only if α and β are bounded (always the case if S is finite), and in this case the birth-death chain X = { X : t ∈ [0, ∞)} is uniform. If λ is unbounded, then X may not even be regular, as an example below shows. Recall that a sufficient condition for X to be regular when S is infinite is t
1 ∑ x∈S+
1 = ∑
λ(x)
x∈S+
=∞
(16.21.2)
α(x) + β(x)
where S = {x ∈ S : λ(x) = α(x) + β(x) > 0} is the set of stable states. Except for the aforementioned example, we will restrict our study to regular birth-death chains. +
16.21.1
https://stats.libretexts.org/@go/page/10394
Infinitesimal Generator and Transition Matrices Suppose again that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on an interval S ⊆ Z with birth rate function α and death rate function β. As usual, we will let P denote the transition matrix at time t ∈ [0, ∞) and G the infinitesimal generator. As always, the infinitesimal generator gives the same information as the exponential parameter function and the jump transition matrix, but in a more compact and useful form. t
t
The generator matrix G is given by G(x, x) = −[α(x) + β(x)], G(x, x + 1) = α(x), G(x, x − 1) = β(x),
x ∈ S
(16.21.3)
Proof The Kolmogorov backward and forward equations are 1. 2.
d dt d dt
for (x, y) ∈ S . (x, y + 1) for (x, y) ∈ S 2
Pt (x, y) = −[α(x) + β(x)] Pt (x, y) + α(x)Pt (x + 1, y) + β(x)Pt (x − 1, y) Pt (x, y) = −[α(y) + β(y)] Pt (x, y) + α(y − 1)Pt (x, y − 1) + β(y + 1)Pt
2
Proof
Limiting Behavior and Stationary Distributions For our discussion of limiting behavior, we will consider first the important special case of a continuous-time birth-death chain X = { X : t ∈ [0, ∞)} on S = N and with α(x) > 0 for all x ∈ N and β(x) > 0 for all x ∈ N . For the jump chain Y = { Y : n ∈ N} , recall that t
+
n
α(x)
β(x)
p(x) = Q(x, x + 1) =
, q(x) = Q(x, x − 1) = α(x) + β(x)
,
x ∈ N
(16.21.4)
α(x) + β(x)
The jump chain Y is a discrete-time birth-death chain, and our notation here is consistent with the notation that we used in that section. Note that X and Y are irreducible. We first consider transience and recurrence. The chains X and Y are recurrent if and only if ∞
β(1) ⋯ β(x)
∑ x=0
=∞
(16.21.5)
α(1) ⋯ α(x)
Proof Next we consider positive recurrence and invariant distributions. It's nice to look at this from different points of view. The function g : N → (0, ∞) defined by α(0) ⋯ α(x − 1) g(x) =
,
x ∈ N
(16.21.8)
β(1) ⋯ β(x)
is invariant for X, and is the only invariant function, up to multiplication by constants. Hence X is positive recurrent if and only if B = ∑ g(x) < ∞ , in which case the (unique) invariant probability density function f is given by f (x) = g(x) for x ∈ N. Moreover, P (x, y) → f (y) as t → ∞ for every x, y ∈ N ∞
1
x=0
B
t
Proof using the jump chain Proof from the balance equation Here is a summary of the classification: For the continuous-time birth-death chain X, let ∞
x=0
∞
β(1) ⋯ β(x)
A =∑
α(0) ⋯ α(x − 1)
, B =∑ α(1) ⋯ α(x)
x=0
(16.21.13) β(1) ⋯ β(x)
1. X is transient if A < ∞ . 2. X is null recurrent if A = ∞ and B = ∞ . 3. X is positive recurrent if B < ∞ .
16.21.2
https://stats.libretexts.org/@go/page/10394
Suppose now that
and that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on the integer interval . We assume that α(x) > 0 for x ∈ {0, 1, … , n − 1} while β(x) > 0 for x ∈ {1, 2, … n}. Of course, we must have β(0) = α(n) = 0 . With these assumptions, X is irreducible, and since the state space is finite, positive recurrent. So all that remains is to find the invariant distribution. The result is essentially the same as when the state space is N. n ∈ N+
t
Nn = {0, 1, … , n}
The invariant probability density function f is given by n
n
α(0) ⋯ α(x − 1)
1 fn (x) =
Bn
α(0) ⋯ α(x − 1)
for x ∈ Nn where Bn = ∑
β(1) ⋯ β(x)
x=0
(16.21.14) β(1) ⋯ β(x)
Proof Note that B → B as n → ∞ , and if B < ∞ , f (x) → f (x) as n → ∞ for x ∈ N. We will see this type of behavior again. Results for the birth-death chain on N often converge to the corresponding results for the birth-death chain on N as n → ∞ . n
n
n
Absorption Often when the state space S = N , the state of a birth-death chain represents a population of individuals of some sort (and so the terms birth and death have their usual meanings). In this case state 0 is absorbing and means that the population is extinct. Specifically, suppose that X = {X : t ∈ [0, ∞)} is a regular birth-death chain on N with α(0) = β(0) = 0 and with α(x), β(x) > 0 for x ∈ N . Thus, state 0 is absorbing and all positive states lead to each other and to 0. Let T = min{t ∈ [0, ∞) : X = 0} denote the time until absorption, where as usual, min ∅ = ∞ . Many of the results concerning extinction of the continuous-time birth-death chain follow easily from corresponding results for the discrete-time birth-death jump chain. t
+
t
One of the following events will occur: 1. Population extinction: T < ∞ or equivalently, X = 0 for some s ∈ [0, ∞) and hence X 2. Population explosion: T = ∞ or equivalently X → ∞ as t → ∞ . s
t
=0
for all t ∈ [s, ∞) .
t
Proof Naturally we would like to find the probability of these complementary events, and happily we have already done so in our study of discrete-time birth-death chains. The absorption probability function v is defined by v(x) = P(T < ∞) = P(Xt = 0 for some t ∈ [0, ∞) ∣ X0 = x),
x ∈ N
(16.21.16)
As before, let ∞
β(1) ⋯ β(i)
A =∑ i=0
(16.21.17) α(1) ⋯ α(i)
1. If A = ∞ then v(x) = 1 for all x ∈ N. 2. If A < ∞ then ∞
1 v(x) =
β(1) ⋯ β(i)
∑ A
i=x
,
x ∈ N
(16.21.18)
α(1) ⋯ α(i)
Proof The mean time to extinction is considered next, so let m(x) = E(T ∣ X = x) for x ∈ N. Unlike the probability of extinction, computing the mean time to extinction cannot be easily reduced to the corresponding discrete-time computation. However, the method of computation does extend. 0
The mean absorption function is given by x
∞
α(j) ⋯ α(k)
m(x) = ∑ ∑ j=1 k=j−1
,
x ∈ N
(16.21.19)
β(j) ⋯ β(k + 1)
Probabilisitic Proof
16.21.3
https://stats.libretexts.org/@go/page/10394
In particular, note that ∞
α(1) ⋯ α(k)
m(1) = ∑ k=0
(16.21.23) β(1) ⋯ β(k + 1)
If m(1) = ∞ then m(x) = ∞ for all x ∈ S . If m(1) < ∞ then m(x) < ∞ for all x ∈ S Next we will consider a birth-death chain on a finite integer interval with both endpoints absorbing. Our interest is in the probability of absorption in one endpoint rather than the other, and in the mean time to absorption. Thus suppose that n ∈ N and that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on N = {0, 1, … , n} with α(0) = β(0) = 0 , α(n) = β(n) = 0 , and α(x) > 0 , β(x) > 0 for x ∈ {1, 2, … , n − 1}. So the endpoints 0 and n are absorbing, and all other states lead to each other and to the endpoints. Let T = inf{t ∈ [0, ∞) : X ∈ {0, n}} , the time until absorption, and for x ∈ S let v (x) = P(X = 0 ∣ X = x) and m (x) = E(T ∣ X = x) . The definitions make sense since T is finite with probability 1. +
t
n
t
n
T
0
n
0
The absorption probability function for state 0 is given by 1 vn (x) =
n−1
An
i=x
n−1
β(1) ⋯ β(i)
∑
β(1) ⋯ β(i)
for x ∈ Nn where An = ∑ α(1) ⋯ α(i)
i=0
(16.21.24) α(1) ⋯ α(i)
Proof Note that A → A as n → ∞ where A is the constant above for the absorption probability at 0 with the infinite state space N. If A < ∞ then v (x) → v(x) as n → ∞ for x ∈ N . n
n
Time Reversal Essentially, every irreducible continuous-time birth-death chain is reversible. Suppose that X = {X : t ∈ [0, ∞)} is a positive recurrent birth-death chain on an integer interval S ⊆ Z with birth rate function α : S → [0, ∞) and death rate function β : S → ∞ . Assume that α(x) > 0 , except at the maximum value of S , if there is one, and similarly that β(x) > 0, except at the minimum value of X, if there is one. Then X is reversible. t
Proof In the important special case of a birth-death chain on N, we can verify the balance equations directly. Suppose that X = {X : t ∈ [0, ∞)} is a continuous-time birth-death chain on x ∈ N and death rate β(x) > 0 for all x ∈ N . Then X is reversible. t
S =N
and with birth rate
α(x) > 0
for all
+
Proof In the positive recurrent case, it follows that the birth-death chain is stochastically the same, forward or backward in time, if the chain has the invariant distribution.
Examples and Special Cases Regular and Irregular Chains Our first exercise gives two pure birth chains, each with an unbounded exponential parameter function. One is regular and one is irregular. Consider the pure birth process X = {X
t
: t ∈ [0, ∞)}
on N with birth rate function α . +
1. If α(x) = x for x ∈ N , then X is not regular. 2. If α(x) = x for x ∈ N , then X is regular. 2
+
+
Proof
Constant Birth and Death Rates Our next examples consider birth-death chains with constant birth and death rates, except perhaps at the endpoints. Note that such chains will be regular since the exponential parameter function λ is bounded.
16.21.4
https://stats.libretexts.org/@go/page/10394
Suppose that X = {X : t ∈ [0, ∞)} is the birth-death chain on death rate β ∈ (0, ∞) on N .
N
t
, with constant birth rate
α ∈ (0, ∞)
on
N
and constant
+
1. X is transient if β < α . 2. X is null recurrent if β = α . 3. X is positive recurrent if β > α . The invariant distribution is the geometric distribution on N with parameter α/β α f (x) = (1 −
x
α )(
β
) ,
x ∈ N
(16.21.28)
β
Proof Next we consider the chain with 0 absorbing. As in the general discussion above, let v denote the function that gives the probability of absorption and m the function that gives the mean time to absorption. Suppose that X = {X : t ∈ [0, ∞)} is the birth-death chain in N with constant birth rate α ∈ (0, ∞) on N , constant death reate β ∈ (0, ∞) on N , and with 0 absorbing. Then t
+
+
1. If β ≥ α then v(x) = 1 for x ∈ N. If β < α then v(x) = (β/α) for x ∈ N. 2. If α ≥ β then m(x) = ∞ . If α < β then m(x) = x/(β − α) for x ∈ N. x
Next let's look at chains on a finite state space. Let n ∈ N
and define N
+
Suppose that
.
= {0, 1, … , n}
n
is a continuous-time birth-death chain on N with constant birth rate α ∈ (0, ∞) on and constant death rate β ∈ (0, ∞) on {1, 2, … n}. The invariant probability density function f is given as
X = { Xt : t ∈ [0, ∞)}
{0, 1, … , n − 1}
n
n
follows: 1. If α ≠ β then x
(α/β ) (1 − α/β) fn (x) =
,
n+1
x ∈ Nn
(16.21.30)
1 − (α/β)
2. If α = β then f
n (x)
= 1/(n + 1)
for x ∈ N
n
Note that when α = β , the invariant distribution is uniform on N . Our final exercise considers the absorption probability at 0 when both endpoints are absorbing. Let v denote the function that gives the probability of absorption into 0, rather than n . n
n
Suppose that X = {X : t ∈ [0, ∞)} is the birth-death chain on {1, 2, … , n − 1}, and with 0 and n absorbing. t
Nn
with constant birth rate
α
and constant death rate
β
on
1. If α ≠ β then x
(β/α )
n
− (β/α )
vn (x) =
n
,
x ∈ Nn
(16.21.31)
1 − (β/α)
2. If α = β then v
n (x)
= (n − x)/n
for x ∈ N . n
Linear Birth and Death Rates For our next discussion, consider individuals that act identically and independently. Each individual splits into two at exponential rate a ∈ (0, ∞) and dies at exponential rate b ∈ (0, ∞). Let X denote the population at time t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a regular, continuous-time birth-death chain with birth and death rate functions given by α(x) = ax and β(x) = bx for x ∈ N. t
t
Proof Note that 0 is absorbing since the population is extinct, so as usual, our interest is in the probability of absorption and the mean time to absorption as functions of the initial state. The probability of absorption is the same as for the chain with constant birth and death rates discussed above. The absorption probability function v is given as follows:
16.21.5
https://stats.libretexts.org/@go/page/10394
1. v(x) = 1 for all x ∈ N if b ≥ a . 2. v(x) = (b/a) for x ∈ N if b < a . x
Proof The mean time to absorption is more interesting. The mean time to absorption function m is given as follows: 1. If a ≥ b then m(x) = ∞ for x ∈ N . 2. If a < b then +
x
a/b
j−1
b
m(x) = ∑
a
j=1
j−1
u
∫
j
du,
x ∈ N
(16.21.34)
1 −u
0
Proof For small values of x ∈ N, the integrals in the case a < b can be done by elementary methods. For example, 1 m(1) = −
a ln(1 −
a
) b
1
b
m(2) = m(1) −
− a 1
m(3)
a ln(1 −
2
)
a
= m(2) −
b 2
b −
2a
2
b −
a
3
a ln(1 −
a
) b
However, a general formula requires the introduction of a special function that is not much more helpful than the integrals themselves. The Markov chain X is actually an example of a branching chain. We will revisit this chain in that section.
Linear Birth and Death with Immigration We continue our previous discussion but generalizing a bit. Suppose again that we have individuals that act identically and independently. An individual splits into two at exponential rate a ∈ [0, ∞) and dies at exponential rate b ∈ [0, ∞). Additionally, new individuals enter the population at exponential rate c ∈ [0, ∞). This is the immigration effect, and when c = 0 we have the birth-death chain in the previous discussion. Let X denote the population at time t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a regular, continuous-time birth-death chain with birth and death rate functions given by α(x) = ax + c and β(x) = bx for x ∈ N. t
t
Proof The infinitesimal matrix G is given as follows, for x ∈ N: 1. G(x, x) = −[(a + b)x + c] 2. G(x, x + 1) = ax + c 3. G(x, x − 1) = bx The backward and forward equations are given as follows, for (x, y) ∈ N and t ∈ (0, ∞) 2
1. 2.
d dt d dt
Pt (x, y) = −[(a + b)x + c] Pt (x, y) + (ax + c)Pt (x + 1, y) + bx Pt (x − 1, y) Pt (x, y) = −[(a + b)y + c] Pt (x, y) + [a(y − 1) + c] Pt (x, y − 1) + b(y + 1)Pt (x, y + 1
We can use the forward equation to find the expected population size. Let M
t (x)
For t ∈ [0, ∞) and x ∈ N, the mean population size M
t (x)
1. If a = b then M 2. If a ≠ b then
t (x)
= ct + x
)
= E(Xt , ∣ X0 = x)
for t ∈ [0, ∞) and x ∈ N.
is given as follows:
. c Mt (x) =
[e
(a−b)t
− 1] + x e
(a−b)t
(16.21.40)
a−b
Proof
16.21.6
https://stats.libretexts.org/@go/page/10394
Note that b > a , so that the individual death rate exceeds the birth rate, then M (x) → c/(b − a) as t → ∞ for x ∈ N. If a ≥ b so that the birth rate equals or exceeds the death rate, then M (x) → ∞ as t → ∞ for x ∈ N . t
t
+
Next we will consider the special case with no births, but only death and immigration. In this case, the invariant distribution is easy to compute, and is one of our favorites. Suppose that a = 0 and that b,
c >0
. Then X is positive recurrent. The invariant distribution is Poisson with parameter c/b: x
f (x) = e
−c/b
(c/b)
,
x ∈ N
(16.21.42)
x!
Proof
The Logistics Chain Consider a population that fluctuates between a minimum value m ∈ N and a maximum value n ∈ N , where of course, m < n . Given the population size, the individuals act independently and identically. Specifically, if the population is x ∈ {m, m + 1, … , n} then an individual splits in two at exponential rate a(n − x) and dies at exponential rate b(x − m) , where a, b ∈ (0, ∞) . Thus, an individual's birth rate decreases linearly with the population size from a(n − m) to 0 while the death rate increases linearly with the population size from 0 to b(n − m) . These assumptions lead to the following definition. +
The continuous-time birth-death chain death rate function β given by
X = { Xt : t ∈ [0, ∞)}
on
+
S = {m, m + 1, … , n}
α(x) = ax(n − x), β(x) = bx(x − m),
with birth rate function
x ∈ S
α
and
(16.21.45)
is the logistic chain on S with parameters a and b . Justification Note that the logistics chain is a stochastic counterpart of the logistics differential equation, which typically has the form dx = c(x − m)(n − x)
(16.21.46)
dt
where m, n, c ∈ (0, ∞) and m < n . Starting in x(0) ∈ (m, n), the solution remains in (m, n) for all t ∈ [0, ∞). Of course, the logistics differential equation models a system that is continuous in time and space, whereas the logistics Markov chain models a system that is continuous in time and discrete is space. For the logistics chain 1. The exponential parameter function λ is given by λ(x) = ax(n − x) + bx(x − m),
x ∈ S
(16.21.47)
2. The transition matrix Q of the jump chain is given by b(x − m)
a(n − x)
Q(x, x − 1) =
, Q(x, x + 1) = a(n − x) + b(x − m)
,
x ∈ S
(16.21.48)
a(n − x) + b(x − m)
In particular, m and n are reflecting boundary points, and so the chain is irreducible. The generator matrix G for the logistics chain is given as follows, for x ∈ S : 1. G(x, x) = −x[a(n − x) + b(x − m)] 2. G(x, x − 1) = bx(x − m) 3. G(x, x + 1) = ax(n − x) Since S is finite, X is positive recurrent. The invariant distribution is given next. Define g : S → (0, ∞) by 1 g(x) =
n−m (
x
a )(
x −m
x−m
)
,
x ∈ S
(16.21.49)
b
16.21.7
https://stats.libretexts.org/@go/page/10394
Then g is invariant for X. Proof Of course it now follows that the invariant probability density function the normalizing constant n
1
c = ∑ x=m
n−m (
x
f
for X is given by
a )(
x −m
f (x) = g(x)/c
for
x ∈ S
where
c
is
x−m
)
(16.21.51)
b
The limiting distribution of X has probability density function f .
Other Special Birth-Death Chains There are a number of special birth-death chains that are studied in other sections, because the models are important and lead to special insights and analytic tools. These include Queuing chains The pure death branching chain The Yule process, a pure birth branching chain The general birth-death branching chain This page titled 16.21: Continuous-Time Birth-Death Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.21.8
https://stats.libretexts.org/@go/page/10394
16.22: Continuous-Time Queuing Chains Basic Theory Introduction In a queuing model, customers arrive at a station for service. As always, the terms are generic; here are some typical examples: The customers are persons and the service station is a store. The customers are file requests and the service station is a web server.
Figure 16.22.1: Ten customers and a server Queuing models can be quite complex, depending on such factors as the probability distribution that governs the arrival of customers, the probability distribution that governs the service of customers, the number of servers, and the behavior of the customers when all servers are busy. Indeed, queuing theory has its own lexicon to indicate some of these factors. In this section, we will discuss a few of the basic, continuous-time queuing chains. In a general sense, the main interest in any queuing model is the number of customers in the system as a function of time, and in particular, whether the servers can adequately handle the flow of customers. This section parallels the section on discrete-time queuing chains. Our main assumptions are as follows: 1. There are k ∈ N ∪ {∞} servers. 2. The customers arrive according to a Poisson process with rate μ ∈ (0, ∞) . 3. If all of the servers are busy, a new customer goes to the end of a single line of customers waiting service. 4. The time required to service a customer has an exponential distribution with parameter ν ∈ (0, ∞) . 5. The service times are independent from customer to customer, and are independent of the arrival process. +
Assumption (b) means that the times between arrivals of customers are independent and exponentially distributed, with parameter μ . Assumption (c) means that we have a first-in, first-out model, often abbreviated FIFO. Note that there are three parameters in the model: the number of servers k , the exponential parameter μ that governs the arrivals, and the exponential parameter ν that governs the service times. The special cases k = 1 (a single server) and k = ∞ (infinitely many servers) deserve special attention. As you might guess, the assumptions lead to a continuous-time Markov chain. Let
Xt
denote the number of customers in the system (waiting in line or being served) at time is a continuous-time Markov chain on N, known as the M/M/k queuing chain.
t ∈ [0, ∞)
. Then
X = { Xt : t ∈ [0, ∞)}
In terms of the basic structure of the chain, the important quantities are the exponential parameters for the states and the transition matrix for the embedded jump chain. For the M/M/k chain X, 1. The exponential parameter function λ is given by λ(x) = μ + ν x if x ∈ N and x < k and λ(x) = μ + ν k if x ∈ N and x ≥ k. 2. The transition matrix Q for the jump chain is given by μ
νx Q(x, x − 1)
=
, Q(x, x + 1) = μ + νx νk
Q(x, x − 1)
=
,
x ∈ N, x < k
,
x ∈ N, x ≥ k
μ + νx μ , Q(x, x + 1) =
μ + νk
μ + νk
So the M/M/k chain is a birth-death chain with 0 as a reflecting boundary point. That is, in state x ∈ N , the next state is either x − 1 or x + 1 , while in state 0, the next state is 1. When k = 1 , the single-server queue, the exponential parameter in state x ∈ N is μ + ν and the transition probabilities for the jump chain are +
+
16.22.1
https://stats.libretexts.org/@go/page/10395
ν
μ
Q(x, x − 1) =
, Q(x, x + 1) = μ+ν
When
, the infinite server queue, the cases above for and the transition probabilities are
k =∞
μ + xν
(16.22.1) μ+ν
x ≥k
are vacuous, so the exponential parameter in state
νx
x ∈ N
is
μ
Q(x, x − 1) =
, Q(x, x + 1) = μ + νx
(16.22.2) μ + νx
Infinitesimal Generator The infinitesimal generator of the chain gives the same information as the exponential parameter function and the jump transition matrix, but in a more compact form. For the M/M/k queuing chain X, the infinitesimal generator G is given by
So for
G(x, x)
= −(μ + ν x), G(x, x − 1) = ν x, G(x, x + 1) = μ;
G(x, x)
= −(μ + ν k), G(x, x − 1) = ν k, G(x, x + 1) = μ;
x ∈ N, x < k x ∈ N, x ≥ k
, the single server queue, the generator G is given by G(0, 0) = −μ , G(0, 1) = μ , while for x ∈ N , , G(x, x − 1) = ν , G(x, x + 1) = μ . For k = ∞ , the infinite server case, the generator G is given by G(x, x) = −(μ + ν x) , G(x, x − 1) = ν x , and G(x, x + 1) = μ for all x ∈ N . k =1
+
G(x, x) = −(μ + ν )
Classification and Limiting Behavior Again, let X = {X : t ∈ [0, ∞)} denote the M/M/k queuing chain with arrival rate μ , service rate ν and with k ∈ N ∪ {∞} servers. As noted in the introduction, of fundamental importance is the question of whether the servers can handle the flow of customers, so that the queue eventually empties, or whether the length of the queue grows without bound. To understand the limiting behavior, we need to classify the chain as transient, null recurrent, or positive recurrent, and find the invariant functions. This will be easy to do using our results for more general continuous-time birth-death chains. Note first that X is irreducible. It's best to consider the single server and infinite server cases individually. t
+
The single server queuing chain X is 1. Transient if ν < μ . 2. Null recurrent if ν = μ . 3. Positive recurrent if ν > μ . The invariant distribution is the geometric distribution on N with parameter μ/ν. The invariant probability density function f is given by μ f (x) = (1 −
x
μ )(
ν
) ,
x ∈ N
(16.22.3)
ν
Proof The result makes intuitive sense. If the service rate is less than the arrival rate, the chain is transient and the length of the queue grows to infinity. If the service rate is greater than the arrival rate, the chain is positive recurrent. At the boundary between these two cases, when the arrival and service rates are the same, the chain is null recurrent. The infinite server queuing chain X is positive recurrent. The invariant distribution is the Poisson distribution with parameter μ/ν. The invariant probability density function f is given by x
f (x) = e
−μ/ν
(μ/ν )
,
x ∈ N
(16.22.4)
x!
Proof This result also makes intuitive sense. This page titled 16.22: Continuous-Time Queuing Chains is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
16.22.2
https://stats.libretexts.org/@go/page/10395
16.23: Continuous-Time Branching Chains Basic Theory Introduction Generically, suppose that we have a system of particles that can generate or split into other particles of the same type. Here are some typical examples: The particles are biological organisms that reproduce. The particles are neutrons in a chain reaction. The particles are electrons in an electron multiplier. We assume that the lifetime of each particle is exponentially distributed with parameter α ∈ (0, ∞) , and at the end of its life, is replaced by a random number of new particles that we will refer to as children of the original particle. The number of children N of a particle has probability density function f on N. The particles act independently, so in addition to being identically distributed, the lifetimes and the number of children are independent from particle to particle. Finally, we assume that f (1) = 0 , so that a particle cannot simply die and be replaced by a single new particle. Let μ and σ denote the mean and variance of the number of offspring of a single particle. So 2
∞
∞
μ = E(N ) = ∑ nf (n),
σ
2
2
= var(N ) = ∑(n − μ) f (n)
n=0
(16.23.1)
n=0
We assume that μ is finite and so σ makes sense. In our study of discrete-time Markov chains, we studied branching chains in terms of generational time. Here we want to study the model in real time. 2
Let X denote the number of particles at time t ∈ [0, ∞). Then X = {X : t ∈ [0, ∞)} is a continuous-time Markov chain on N , known as a branching chain. The exponential parameter function λ and jump transition matrix Q are given by t
t
1. λ(x) = αx for x ∈ N 2. Q(x, x + k − 1) = f (k) for x ∈ N and k ∈ N . +
Proof Of course 0 is an absorbing state, since this state means extinction with no particles. (Note that λ(0) = 0 and so by default, Q(0, 0) = 1.) So with a branching chain, there are essentially two types of behavior: population extinction or population explosion. For the branching chain X = {X
t
1. Extinction: X 2. Explosion: X
t
t
=0
: t ∈ [0, ∞)}
one of the following events occurs with probability 1:
for some t ∈ [0, ∞) and hence X as t → ∞ .
s
=0
for all s ≥ t .
→ ∞
Proof Without the assumption that μ < ∞ , explosion can actually occur in finite time. On the other hand, the assumption that f (1) = 0 is for convenience. Without this assumption, X would still be a continuous-time Markov chain, but as discussed in the Introduction, the exponential parameter function would be λ(x) = αf (1)x for x ∈ N and the jump transition matrix would be f (k) Q(x, x + k − 1) =
,
x ∈ N+ , k ∈ {0, 2, 3, …}
(16.23.2)
1 − f (1)
Because all particles act identically and independently, the branching chain starting with x ∈ N particles is essentially x independent copies of the branching chain starting with 1 particle. In many ways, this is the fundamental insight into branching chains, and in particular, means that we can often condition on X(0) = 1 . +
Generator and Transition Matrices As usual, we will let
denote the semigroup of transition matrices of for (x, y) ∈ N . Similarly, G denotes the infinitesimal generator matrix of X.
P = { Pt : t ∈ [0, ∞)}
Pt (x, y) = P(Xt = y ∣ X = x)
X
,
so
that
2
16.23.1
https://stats.libretexts.org/@go/page/10396
The infinitesimal generator G is given by G(x, x) = −αx, G(x, x + k − 1)
x ∈ N
= αxf (k),
x ∈ N+ , k ∈ N
Proof The Kolmogorov backward equation is ∞
d dt
2
Pt (x, y) = −αx Pt (x, x) + αx ∑ f (k)Pt (x + k − 1, y),
(x, y) ∈ N
(16.23.3)
k=0
Proof Unlike some of our other continuous-time models, the jump chain Y governed by Q is not the discrete-time version of the model. That is, Y is not a discrete-time branching chain, since in discrete time, the index n represents the n th generation, whereas here it represent the n th time that a particle reproduces. However, there are lots of discrete-time branching chain embedded in the continuous-time chain. Fix t ∈ (0, ∞) and define Z = {X : n ∈ N} . Then Z is a discrete-time branching chain with offspring probability density function f given by f (x) = P (1, x) for x ∈ N. t
t
t
nt
t
t
Proof
Probability Generating Functions As in the discrete case, probability generating functions are an important analytic tool for continuous-time branching chains. For t ∈ [0, ∞) let Φ denote the probability generating function of X given X t
t
0
=1
∞ Xt
Φt (r) = E (r
x
∣ X0 = 1) = ∑ r Pt (1, x)
(16.23.5)
x=0
Let Ψ denote the probability generating function of N ∞ N
Ψ(r) = E (r
n
) = ∑ r f (n)
(16.23.6)
n=0
The generating functions are defined (the series are absolutely convergent) at least for r ∈ (−1, 1] . The collection of generating functions Φ = {Φ : t ∈ [0, ∞)} gives the same information as the collection of probability density functions {P (1, ⋅) : t ∈ [0, ∞)}. With the fundamental insight that the branching process starting with one particle determines the branching process in general, Φ actually determines the transition semigroup P = {P : t ∈ [0, ∞)} . t
t
t
For t ∈ [0, ∞) and x ∈ N, the probability generating function of X given X t
0
=x
is Φ : x t
∞ y
x
∑ r Pt (x, y) = [ Φt (r)]
(16.23.7)
y=0
Proof Note that
is the generating function of the offspring distribution for the embedded discrete-time branching chain for t ∈ (0, ∞) . On the other hand, Ψ is the generating function of the offspring distribution for the continuous-time chain. So our main goal in this discussion is to see how Φ is built from Ψ. Because P is a semigroup under matrix multiplication, and because the particles act identically and independently, Φ is a semigroup under composition. Φt
Zt = { Xnt : n ∈ N}
Φs+t = Φs ∘ Φt
for s,
t ∈ [0, ∞)
.
Proof Note also that Φ (r) = E(r ∣ X = 1) = r for all r ∈ R . This also follows from the semigroup property: Φ = Φ ∘ Φ . The fundamental relationship between the collection of generating functions Φ and the generating function Ψ is given in the following 0
X0
0
0
16.23.2
0
0
https://stats.libretexts.org/@go/page/10396
theorem: The mapping t ↦ Φ satisfies the differential equation t
d dt
Φt = α(Ψ ∘ Φt − Φt )
(16.23.8)
Proof This differential equation, along with the initial condition Φ (r) = r for all r ∈ R determines the collection of generating functions Φ. In fact, an implicit solution for Φ (r) is given by the integral equation 0
t
Φt (r)
1
∫
du = αt
(16.23.11)
Ψ(u) − u
r
Another relationship is given in the following theorem. Here, Φ refers to the derivative of the generating function Φ with respect to its argument, of course (so r, not t ). ′
t
t
For t ∈ [0, ∞), ′
Φ
t
Ψ ∘ Φt − Φt
=
(16.23.12)
Ψ − Φ0
Proof
Moments In this discussion, we wil study the mean and variance of the number of particles at time t ∈ [0, ∞). Let mt = E(Xt ∣ X0 = 1), vt = var(Xt ∣ X0 = 1),
t ∈ [0, ∞)
(16.23.16)
so that m and v are the mean and variance, starting with a single particle. As always with a branching process, it suffices to consider a single particle: t
t
For t ∈ [0, ∞) and x ∈ N, 1. E(X ∣ X = x) = x m 2. var(X ∣ X = x) = x v t
0
t
t
0
t
Proof Recall also that μ and σ are the the mean and variance of the number of offspring of a particle. Here is the connection between the means: 2
mt = e
α(μ−1)t
for t ∈ [0, ∞).
1. If μ < 1 then m 2. If μ > 1 then m 3. If μ = 1 then m
as t → ∞ . This is extinction in the mean. as t → ∞ . This is explosion in the mean. for all t ∈ [0, ∞). This is stability in the mean.
t
→ 0
t
→ ∞
t
=1
Proof This result is intuitively very appealing. As a function of time, the expected number of particles either grows or decays exponentially, depending on whether the expected number of offspring of a particle is greater or less than one. The connection between the variances is more complicated. We assume that σ < ∞ . 2
If μ ≠ 1 then σ vt = [
If μ = 1 then v
t
2
= ασ t
1. If μ < 1 then v 2. If μ ≥ 1 then v
2
+ (μ − 1)] [e
2α(μ−1)t
−e
α(μ−1)t
],
t ∈ [0, ∞)
(16.23.21)
μ−1
. as t → ∞ as t → ∞
t
→ 0
t
→ ∞
16.23.3
https://stats.libretexts.org/@go/page/10396
Proof If μ < 1 so that m → 0 as t → ∞ and we have extinction in the mean, then v → 0 as t → ∞ also. If μ > 1 so that m → ∞ as t → ∞ and we have explosion in the mean, then v → ∞ as t → ∞ also. We would expect these results. On the other hand, if μ = 1 so that m = 1 for all t ∈ [0, ∞) and we have stability in the mean, then v grows linearly in t . This gives some insight into what to expect next when we consider the probability of extinction. t
t
t
t
t
t
The Probability of Extinction As shown above, there are two types of behavior for a branching process, either population extinction or population explosion. In this discussion, we study the extinction probability, starting as usual with a single particle: q = P(Xt = 0 for some t ∈ (0, ∞) ∣ X0 = 1) = lim P(Xt = 0 ∣ X0 = 1)
(16.23.27)
t→∞
Need we say it? The extinction probability starting with an arbitrary number of particles is easy. For x ∈ N, P(Xt = 0 for some t ∈ (0, ∞) ∣ X0 = x) = lim P(Xt = 0 ∣ X0 = x) = q
x
(16.23.28)
t→∞
Proof We can easily relate extinction for the continuous-time branching chain branching chains.
X
to extinction for any of the embedded discrete-time
If extinction occurs for X then extinction occurs for Z for every t ∈ (0, ∞) . Conversely if extinction occurs for Z for some t ∈ (0, ∞) then extinction occurs for Z for every t ∈ (0, ∞) and extinction occurs for X. Hence q is the minimum solution in (0, 1] of the equation Φ (r) = r for every t ∈ (0, ∞) . t
t
t
t
Proof So whether or not extinction is certain depends on the critical parameter μ . The extinction probability q and the mean of the offspring distribution μ are related as follows: 1. If μ ≤ 1 then q = 1 , so extinction is certain. 2. If μ > 1 then 0 < q < 1 , so there is a positive probability of extinction and a positive probability of explosion. Proof It would be nice to have an equation for q in terms of the offspring probability generating function Ψ. This is also easy The probability of extinction q is the minimum solution in (0, 1] of the equation Ψ(r) = r . Proof
Special Models We now turn our attention to a number of special branching chains that are important in applications or lead to interesting insights. We will use the notation established above, so that α is the parameter of the exponential lifetime of a particle, Q is the transition matrix of the jump chain, G is the infinitesimal generator matrix, and P is the transition matrix at time t ∈ [0, ∞). Similarly, m = E(X ∣ X = x) , v = var(X ∣ X = x) , and Φ are the mean, variance, and generating function of the number of particles at time t ∈ [0, ∞), starting with a single particle. As always, be sure to try these exercises yourself before looking at the proofs and solutions. t
t
t
0
t
t
0
t
The Pure Death Branching Chain First we consider the branching chain in which each particle simply dies without offspring. Sadly for these particles, extinction is inevitable, but this case is still a good place to start because the analysis is simple and lead to explicit formulas. Thus, suppose that X = { X : t ∈ [0, ∞)} is a branching process with lifetime parameter α ∈ (0, ∞) and offspring probability density function f with f (0) = 1 . t
The transition matrix of the jump chain and the generator matrix are given by
16.23.4
https://stats.libretexts.org/@go/page/10396
1. Q(0, 0) = 1 and Q(x, x − 1) = 1 for x ∈ N 2. G(x, x) = −αx for x ∈ N and G(x, x − 1) = αx for x ∈ N +
+
The time-varying functions are more interesting. Let t ∈ [0, ∞). Then 1. m = e 2. v = e −e 3. Φ (r) = 1 − (1 − r)e for r ∈ R 4. Given X = x the distribution of X is binomial with trial parameter x and success parameter e −αt
t
−αt
−2αt
t
−αt
t
0
−αt
t
Pt (x, y) = (
x −αty −αt x−y )e (1 − e ) , y
x ∈ N, y ∈ {0, 1, … , x}
. (16.23.29)
Direct Proof In particular, note that P (x, 0) = (1 − e ) → 1 as t → ∞ . that is, the probability of extinction by time t increases to 1 exponentially fast. Since we have an explicit formula for the transition matrices, we can find an explicit formula for the potential matrices as well. The result uses the beta function B . −αt
t
x
For β ∈ (0, ∞) the potential matrix U is given by β
1 Uβ (x, y) =
( α
x )B(y + β/α, x − y + 1), y
x ∈ N, y ∈ {0, 1, … , x}
(16.23.31)
For β = 0 , the potential matrix U is given by 1. U (x, 0) = ∞ for x ∈ N 2. U (x, y) = 1/αy for x, y ∈ N and x ≤ y . +
Proof We could argue the results for the potential U directly. Recall that U (x, y) is the expected time spent in state y starting in state x. Since 0 is absorbing and all states lead to 0, U (x, 0) = ∞ for x ∈ N. If x, y ∈ N and x ≤ y , then x leads to y with probability 1. Once in state y the time spent in y has an exponential distribution with parameter λ(y) = αy , and so the mean is 1/αy. Of course, when the chain leaves y , it never returns. +
Recall that βU is a transition probability matrix for β > 0 , and in fact β U (x, ⋅) is the probability density function of X given X = x where T is independent of X has the exponential distribution with parameter β. For the next result, recall the ascending power notation β
β
T
0
[k]
a
For β > 0 and x ∈ N , the function 1. +
= a(a + 1) ⋯ (a + k − 1),
β Uβ (x, ⋅)
(16.23.36)
is the beta-binomial probability density function with parameters x, [y]
β Uβ (x, y) = (
a ∈ R, k ∈ N
, and
β/α
[x−y]
(β/α) 1 x ) , y (1 + β/α)[x]
x ∈ N, y ∈ {0, 1, … x}
(16.23.37)
Proof
The Yule Process Next we consider the pure birth branching chain in which each particle, at the end of its life, is replaced by 2 new particles. Equivalently, we can think of particles that never die, but each particle gives birth to a new particle at a constant rate. This chain could serve as the model for an unconstrained nuclear reaction, and is known as the Yule process, named for George Yule. So specifically, let X = {X : t ∈ [0, ∞)} be the branching chain with exponential parameter α ∈ (0, ∞) and offspring probability density function given by f (2) = 1 . Explosion is inevitable, starting with at least one particle, but other properties of the Yule process are interesting. in particular, there are fascinating parallels with the pure death branching chain. Since 0 is an isolated, absorbing state, we will sometimes restrict our attention to positive states. t
16.23.5
https://stats.libretexts.org/@go/page/10396
The transition matrix of the jump chain and the generator matrix are given by 1. Q(0, 0) = 1 and Q(x, x + 1) = 1 for x ∈ N 2. G(x, x) = −αx for x ∈ N and G(x, x + 1) = αx for x ∈ N +
+
Since the Yule process is a pure birth process and the birth rate in state x ∈ N is αx, the process is also called the linear birth chain. As with the pure death process, we can give the distribution of X specifically. t
Let t ∈ [0, ∞). Then 1. m = e 2. v = e 3. Φ (r) = 4. Given X
αt
t
2αt
−e
t
t
αt
re
−αt
0
=x
for |r| < has the negative binomial distribution on N with stopping parameter x and success parameter e 1
1−r+re−αt
1−e−αt
,X
t
−αt
+
y −1 Pt (x, y) = (
)e
−xαt
(1 − e
−αt
y−x
)
,
x ∈ N+ , y ∈ {x, x + 1, …}
x −1
.
(16.23.40)
Proof from the general results Direct proof Recall that the negative binomial distribution with parameters k ∈ N and p ∈ (0, 1) governs the trial number of the k th success in a sequence of Bernoulli trials with success parameter p. So the occurrence of this distribution in the Yule process suggests such an interpretation. However this interpretation is not nearly as obvious as with the binomial distribution in the pure death branching chain. Next we give the potential matrices. +
For β ∈ [0, ∞) the potential matrix U is given by β
1 Uβ (x, y) =
If β > 0 , the function β U
β (x,
⋅)
y −1 (
α
)B(x + β/α, y − x + 1), x −1
x ∈ N+ , y ∈ {x, x + 1, …}
(16.23.45)
is the beta-negative binomial probability density function with parameters x, β/α, and 1: [x]
(β/α)
y −1 β Uβ (x, y) = (
[y−x]
1
) x −1
[y]
,
x ∈ N, y ∈ {x, x + 1, …}
(16.23.46)
(1 + β/α)
Proof If we think of the Yule process in terms of particles that never die, but each particle gives birth to a new particle at rate α , then we can study the age of the particles at a given time. As usual, we can start with a single, new particle at time 0. So to set up the notation, let X = {X : t ∈ [0, ∞)} be the Yule branching chain with birth rate α ∈ (0, ∞) , and assume that X = 1 . Let τ = 0 and for n ∈ N , let τ denote the time of the n th transition (birth). t
+
0
0
n
For t ∈ [0, ∞), let A denote the total age of the particles at time t . Then t
Xt −1
At = ∑ (t − τn ),
t ∈ [0, ∞)
(16.23.49)
n=0
The random process A = {A
t
: t ∈ [0, ∞)}
is the age process.
Proof Here is another expression for the age process. Again, let A = {A
t
: t ∈ [0, ∞)}
be the age process for the Yule chain starting with a single particle. Then t
At = ∫
Xs ds,
t ∈ [0, ∞)
(16.23.50)
0
Proof With the last representation, we can easily find the expected total age at time t .
16.23.6
https://stats.libretexts.org/@go/page/10396
Again, let A = {A
t
be the age process for the Yule chain starting with a single particle. Then
: t ∈ [0, ∞)}
e
αt
−1
E(At ) =
,
t ∈ [0, ∞)
(16.23.53)
α
Proof
The General Birth-Death Branching Chain Next we consider the continuous-time branching chain in which each particle, at the end of its life, leaves either no children or two children. At each transition, the number of particles either increases by 1 or decreases by 1, and so such a branching chain is also a continuous-time birth-death chain. Specifically, let X = {X : t ∈ [0, ∞)} be a continuous-time branching chain with lifetime parameter α ∈ (0, ∞) and offspring probability density function f given by f (0) = 1 − p , f (2) = p , where p ∈ [0, 1]. When p = 0 we have the pure death chain, and when p = 1 we have the Yule process. We have already studied these, so the interesting case is when p ∈ (0, 1) so that both extinction and explosion are possible. t
The transition matrix of the jump chain and the generator matrix are given by 1. Q(0, 0) = 1, and Q(x, x − 1) = 1 − p , Q(x, x + 1) = p for x ∈ N 2. G(x, x) = −αx for x ∈ N, and G(x, x − 1) = α(1 − p)x , G(x, x + 1) = αpx for x ∈ N +
+
As mentioned earlier, X is also a continuous-time birth-death chain on N, with 0 absorbing. In state x ∈ N , the birth rate is αpx and the death rate is α(1 − p)x . The moment functions are given next. +
For t ∈ [0, ∞), 1. m = e 2. If p ≠ ,
α(2p−1)t
t
1 2
4p(1 − p) vt = [
If p =
1 2
,v
t
= 4αp(1 − p)t
+ (2p − 1)] [e
2α(2p−1)t
−e
α(2p−1)t
]
(16.23.55)
2p − 1
.
Proof The next result gives the generating function of the offspring distribution and the extinction probability. For the birth-death branching chain, 1. Ψ(r) = pr
2
+ (1 − p)
2. q = 1 if 0 < p ≤
1 2
for r ∈ R .
and q =
1−p p
if
1 2
. But suppose now that instead of making constant unit bets, the gambler makes bets that depend on the outcomes of previous games. This leads to a martingale transform as studied above. 0
1
n
2
1
1
2
2
Suppose that the gambler bets Y on game n ∈ N (at even stakes), where Y ∈ [0, ∞) depends on (V , V , V , … , V ) and satisfies E(Y ) < ∞ . So the process Y = {Y : n ∈ N } is predictable with respect to X, and the gambler's net winnings after n games is n
+
n
n
n
0
1
2
n
n
(Y ⋅ X )n = V0 + ∑ Yk Vk = X0 + ∑ Yk (Xk − Xk−1 ) k=1
1. Y 2. Y 3. Y
⋅X ⋅X ⋅X
n−1
+
(17.2.15)
k=1
is a sub-martingale if p > . is a super-martingale if p < . is a martingale if p = . 1 2
1 2
1 2
Proof The simple random walk X is also a discrete-time Markov chain on P (x, x + 1) = p , P (x, x − 1) = 1 − p . The function h given by h(x) = (
1−p p
x
)
Z
with one-step transition matrix
P
given by
for x ∈ Z is harmonic for X.
Proof It now follows from our theorem above that the process
Z = { Zn : n ∈ N}
given by
Zn = (
1−p p
Xn
)
for
n ∈ N
is a martingale.
We showed this directly from the definition in the Introduction. As you may recall, this is De Moivre's martingale and named for Abraham De Moivre.
17.2.5
https://stats.libretexts.org/@go/page/10300
Branching Processes Recall the discussion of the simple branching process from the Introduction. The fundamental assumption is that the particles act independently, each with the same offspring distribution on N. As before, we will let f denote the (discrete) probability density function of the number of offspring of a particle, m the mean of the distribution, and ϕ the probability generating function of the distribution. We assume that f (0) > 0 and f (0) + f (1) < 1 so that a particle has a positive probability of dying without children and a positive probability of producing more than 1 child. Recall that q denotes the probability of extinction, starting with a single particle. The stochastic process of interest is X = {X : n ∈ N} where X is the number of particles in the n th generation for n ∈ N . Recall that X is a discrete-time Markov chain on N with one-step transition matrix P given by P (x, y) = f (y) for x, y ∈ N where f denotes the convolution power of order x of f . n
n
∗x
∗x
The function h given by h(x) = q for x ∈ N is harmonic for X. x
Proof It now follows from our theorem above that the process Z = {Z : n ∈ N} is a martingale where Z = q for n ∈ N . We showed this directly from the definition in the Introduction. We also showed that the process Y = {Y : n ∈ N} is a martingale where Y = X /m for n ∈ N . But we can't write Y = h(X ) for a function h defined on the state space, so we can't interpret this martingale in terms of a harmonic function. n
Xn
n
n
n
n
n
n
n
General Random Walks Suppose that
is a stochastic process satisfying the basic assumptions above relative to the filtration . Recall from the Introduction that the term increment refers to a difference of the form X − X for s, t ∈ T . The process X has independent increments if this increment is always independent of F , and has stationary increments this increment always has the same distribution as X − X . In discrete time, a process with stationary, independent increments is simply a random walk as discussed above. In continuous time, a process with stationary, independent increments (and with the continuity assumptions we have imposed) is called a continuous-time random walk, and also a Lévy process, named for Paul Lévy. X = { Xt : t ∈ T }
F = { Ft : t ∈ T }
s+t
s
s
t
0
So suppose that X has stationary, independent increments. For t ∈ T let Q denote the probability distribution of X − X on (R, R), so that Q is also the probability distribution aof X −X for every s, t ∈ T . From our previous study, we know that X is a Markov processes with transition kernel P at time t ∈ T given by t
t
s+t
t
0
s
t
Pt (x, A) = Qt (A − x);
We also know that E(X R ).
t
− X0 ) = at
for t ∈ T where a = E(X
1
x ∈ R, A ∈ R
− X0 )
(17.2.18)
(assuming of course that the last expected value exists in
The identity function I is . 1. Harmonic for X if a = 0 . 2. Sub-harmonic for X if a ≥ 0 . 3. Super-harmonic for X if a ≤ 0 . Proof It now follows that X is a martingale if a = 0 , a sub-martingale if a ≥ 0 , and a super-martingale if a ≤ 0 . We showed this directly in the Introduction. Recall that in continuous time, the Poisson counting process has stationary, independent increments, as does standard Brownian motion This page titled 17.2: Properties and Constructions is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
17.2.6
https://stats.libretexts.org/@go/page/10300
17.3: Stopping Times Basic Theory As in the Introduction, we start with a stochastic process X = {X : t ∈ T } on an underlying probability space (Ω, F , P), having state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). Next, we have a filtration F = {F : t ∈ T } , and we assume that X is adapted to F. So F is an increasing family of sub σ-algebras of F and X is measurable with respect to F for t ∈ T . We think of F as the collection of events up to time t ∈ T . We assume that E (| X |) < ∞ , so that the mean of X exists as a real number, for each t ∈ T . Finally, in continuous time where T = [0, ∞) , we make the standard assumption that X is right continuous and has left limits, and that the filtration F is right continuous and complete. t
t
t
t
t
t
t
Our general goal in this section is to see if some of the important martingale properties are preserved if the deterministic time t ∈ T is replaced by a (random) stopping time. Recall that a random time τ with values in T ∪ {∞} is a stopping time relative to F if {τ ≤ t} ∈ F for t ∈ T . So a stopping time is a random time that does not require that we see into the future. That is, we can tell if τ ≤ t from the information available at time t . Next recall that the σ-algebra associated with the stopping time τ is t
Fτ = {A ∈ F : A ∩ {τ ≤ t} ∈ Ft for all t ∈ T }
(17.3.1)
So F is the collection of events up to the random time τ just as F is the collection of events up to the deterministic time t ∈ T . In terms of a gambler playing a sequence of games, the time that the gambler decides to stop playing must be a stopping time, and in fact this interpretation is the origin of the name. That is, the time when the gambler decides to stop playing can only depend on the information that the gambler has up to that point in time. τ
t
Optional Stopping The basic martingale equation E(X ∣ F ) = X for s, t ∈ T with s ≤ t can be generalized by replacing both s and t by bounded stopping times. The result is known as the Doob's optional stopping theorem and is named again for Joseph Doob. Suppose that X = {X : t ∈ T } satisfies the basic assumptions above with respect to the filtration F = {F : t ∈ T } t
s
s
t
t
Suppose that are bounded stopping times relative to F with ρ ≤ τ . 1. If X is a martingale relative to F then E(X ∣ F ) = X . 2. If X is a sub-martingale relative to F then E(X ∣ F ) ≥ X . 3. If X is a super-martingale relative to F then E(X ∣ F ) ≤ X . τ
ρ
τ
ρ
ρ
τ
ρ
ρ
ρ
Proof in discrete time Proof in continuous time The assumption that the stopping times are bounded is critical. A counterexample when this assumption does not hold is given below. Here are a couple of simple corollaries: Suppose again that ρ and τ are bounded stopping times relative to F with ρ ≤ τ . 1. If X is a martingale relative to F then E(X ) = E(X ) . 2. If X is a sub-martingale relative to F then E(X ) ≥ E(X ) . 3. If X is a super-martingale relative to F then E(X ) ≤ E(X ) . τ
ρ
τ
ρ
τ
ρ
Proof Suppose that τ is a bounded stopping time relative to F. 1. If X is a martingale relative to F then E(X ) = E(X ) . 2. If X is a sub-martingale relative to F then E(X ) ≥ E(X ) . 3. If X is a super-martingale relative to F then E(X ) ≤ E(X ) . τ
0
τ
0
τ
0
The Stopped Martingale For our next discussion, we first need to recall how to stop a stochastic process at a stopping time.
17.3.1
https://stats.libretexts.org/@go/page/10301
Suppose that X satisfies the assumptions above and that τ is a stopping time relative to the filtration F. The stopped proccess X = { X : t ∈ [0, ∞)} is defined by τ
τ
t
X
τ
t
= Xt∧τ ,
t ∈ [0, ∞)
(17.3.7)
Details So X = X if t < τ and X = X if t ≥ τ . In particular, note that X = X . If X is the fortune of a gambler at time t ∈ T , then X is the revised fortune at time t when τ is the stopping time of the gamber. Our next result, known as the elementary stopping theorem, is that a martingale stopped at a stopping time is still a martingale. τ
τ
t
t
τ
τ
t
0
0
t
τ
t
Suppose again that X satisfies the assumptions above, and that τ is a stopping time relative to F. 1. If X is a martingale relative to F then so is X . 2. If X is a sub-martingale relative to F then so is X . 3. If X is a super-martingale relative to F then so is X . τ
τ
τ
General proof Special proof in discrete time The elementary stopping theorem is bad news for the gambler playing a sequence of games. If the games are fair or unfavorable, then no stopping time, regardless of how cleverly designed, can help the gambler. Since a stopped martingale is still a martingale, the the mean property holds. Suppose again that X satisfies the assumptions above, and that τ is a stopping time relative to F. Let t ∈ T . 1. If X is a martingale relative to F then E(X ) = E(X ) 2. If X is a sub-martingale relative to F then E(X ) ≥ E(X ) 3. If X is a super-martingale relative to F then E(X ) ≤ E(X t∧τ
0
t∧τ
0
0)
t∧τ
Optional Stopping in Discrete Time A simple corollary of the optional stopping theorem is that if X is a martingale and τ a bounded stopping time, then E(X ) = E(X ) (with the appropriate inequalities if X is a sub-martingale or a super-martingale). Our next discussion centers on other conditions which give these results in discrete time. Suppose that X = {X : n ∈ N} satisfies the basic assumptions above with respect to the filtration F = {F : n ∈ N} , and that τ is a stopping time relative to F. τ
0
n
n
Suppose that |X
n|
is bounded uniformly in n ∈ N and that τ is finite.
4. If X is a martingale then E(X ) = E(X ) . 5. If X is a sub-martingale then E(X ) ≥ E(X ) . 6. If X is a super-martingale then E(X ) ≤ E(X ) . τ
0
τ
0
τ
0
Proof Suppose that |X
n+1
− Xn |
is bounded uniformly in n ∈ N and that E(τ ) < ∞ .
4. If X is a martingale then E(X ) = E(X ) . 5. If X is a sub-martingale then E(X ) ≥ E(X ) . 6. If X is a super-martingale then E(X ) ≤ E(X ) . τ
0
τ
0
τ
0
Proof Let's return to our original interpretation of a martingale X representing the fortune of a gambler playing fair games. The gambler could choose to quit at a random time τ , but τ would have to be a stopping time, based on the gambler's information encoded in the filtration F. Under the conditions of the theorem, no such scheme can help the gambler in terms of expected value.
Examples and Applications The Simple Random Walk Suppose that
is a sequence if independent, identically distributed random variables with P(V = 1) = p and for i ∈ N , where p ∈ (0, 1). Let X = (X , X , X , …) be the partial sum process associated with V so
V = (V1 , V2 , …)
P(Vi = −1) = 1 − p
+
i
0
17.3.2
1
2
https://stats.libretexts.org/@go/page/10301
that n
Xn = ∑ Vi ,
n ∈ N
(17.3.14)
i=1
Then X is the simple random walk with parameter p. In terms of gambling, our gambler plays a sequence of independent and identical games, and on each game, wins €1 with probability p and loses €1 with probability 1 − p . So X is the the gambler's total net winnings after n games. We showed in the Introduction that X is a martingale if p = (the fair case), a sub-martingale if p > (the favorable case), and a super-martingale if p < (the unfair case). Now, for c ∈ Z , let n
1 2
1
1
2
2
τc = inf{n ∈ N : Xn = c}
(17.3.15)
where as usual, inf(∅) = ∞ . So τ is the first time that the gambler's fortune reaches c . What if the gambler simply continues playing until her net winnings is some specified positive number (say €1 000 000 )? Is that a workable strategy? c
Suppose that p =
1 2
and that c ∈ N . +
1. P(τ < ∞) = 1 2. E (X ) = c ≠ 0 = E(X 3. E(τ ) = ∞ c
0)
τc
c
Proof Note that part (b) does not contradict the optional stopping theorem because of part (c). The strategy of waiting until the net winnings reaches a specified goal c is unsustainable. Suppose now that the gambler plays until the net winnings either falls to a specified negative number (a loss that she can tolerate) or reaches a specified positive number (a goal she hopes to reach). Suppose again that p =
1 2
. For a,
b ∈ N+
, let τ
= τ−a ∧ τb
. Then
1. E(τ ) < ∞ 2. E(X ) = 0 3. P(τ < τ ) = b/(a + b) τ
−a
b
Proof So gambling until the net winnings either falls to −a or reaches b is a workable strategy, but alas has expected value 0. Here's another example that shows that the first version of the optional sampling theorem can fail if the stopping times are not bounded. Suppose again that p =
1 2
. Let a,
b ∈ N+
with a < b . Then τ
a
b = E (Xτ
b
< τb < ∞
∣ Fτ ) ≠ Xτ a
a
but =a
(17.3.17)
Proof This result does not contradict the optional stopping theorem since the stopping times are not bounded.
Wald's Equation Wald's equation, named for Abraham Wald is a formula for the expected value of the sum of a random number of independent, identically distributed random variables. We have considered this before, in our discussion of conditional expected value and our discussion of random samples, but martingale theory leads to a particularly simple and elegant proof. Suppose that X = (X : n ∈ N ) is a sequence of independent, identically distributed variables with common mean If N is a stopping time for X with E(N ) < ∞ then n
+
μ ∈ R
.
N
E ( ∑ Xk ) = E(N )μ
(17.3.18)
k=1
Proof
17.3.3
https://stats.libretexts.org/@go/page/10301
Patterns in Multinomial Trials Patterns in multinomial trials were studied in the chapter on Renewal Processes. As is often the case, martingales provide a more elegant solution. Suppose that L = (L , L , …) is a sequence of independent, identically distributed random variables taking values in a finite set S , so that L is a sequence of multinomial trials. Let f denote the common probability density function so that for a generic trial variable L, we have f (a) = P(L = a) for a ∈ S . We assume that all outcomes in S are actually possible, so f (a) > 0 for a ∈ S . 1
2
In this discussion, we interpret S as an alphabet, and we write the sequence of variables in concatenation form, L = L L ⋯ rather than standard sequence form. Thus the sequence is an infinite string of letters from our alphabet S . We are interested in the first occurrence of a particular finite substring of letters (that is, a “word” or “pattern”) in the infinite sequence. The following definition will simplify the notation. 1
If a = a
1 a2
⋯ ak
is a word of length k ∈ N
+
2
from the alphabet S , define k
f (a) = ∏ f (ai )
(17.3.22)
i=1
so f (a) is the probability of k consecutive trials producing word a . So, fix a word a = a a ⋯ a of length k ∈ N from the alphabet S , and consider the number of trials N until a is completed. Our goal is compute ν (a) = E (N ) . We do this by casting the problem in terms of a sequence of gamblers playing fair games and then using the optional stopping theorem above. So suppose that if a gambler bets c ∈ (0, ∞) on a letter a ∈ S on a trial, then the gambler wins c/f (a) if a occurs on that trial and wins 0 otherwise. The expected value of this bet is 1
2
k
+
a
a
c f (a)
−c = 0
(17.3.23)
f (a)
and so the bet is fair. Consider now a gambler with an initial fortune 1. When she starts playing, she bets 1 on a . If she wins, she bet her entire fortune 1/f (a ) on the next trial on a . She continues in this way: as long as she wins, she bets her entire fortune on the next trial on the next letter of the word, until either she loses or completes the word a . Finally, we consider a sequence of independent gamblers playing this strategy, with gambler i starting on trial i for each i ∈ N . 1
1
2
+
For a finite word a from the alphabet S , ν (a) is the total winnings by all of the players at time N . a
Proof Given a , we can compute the total winnings precisely. By definition, trials N − k + 1, … , N form the word a for the first time. Hence for i ≤ N − k , gambler i loses at some point. Also by definition, gambler N − k + 1 wins all of her bets, completes word a and so collects 1/f (a). The complicating factor is that gamblers N − k + 2, … , N may or may not have won all of their bets at the point when the game ends. The following exercise illustrates this. Suppose that L is a sequence of Bernoulli trials (so S = {0, 1}) with success probability p ∈ (0, 1). For each of the following strings, find the expected number of trials needed to complete the string. 1. 001 2. 010 Solution The difference between the two words is that the word in (b) has a prefix (a proper string at the beginning of the word) that is also a suffix (a proper string at the end of the word). Word a has no such prefix. Thus we are led naturally to the following dichotomy: Suppose that a is a finite word from the alphabet S . If no proper prefix of a is also a suffix, then a is simple. Otherwise, a is compound. Here is the main result, which of course is the same as when the problem was solved using renewal theory. Suppose that a is a finite word in the alphabet S . 1. If a is simple then ν (a) = 1/f (a) .
17.3.4
https://stats.libretexts.org/@go/page/10301
2. If a is compound, then ν (a) = 1/f (a) + ν (b) where b is the longest word that is both a prefix and a suffix of a . Proof For a compound word, we can use (b) to reduce the computation to simple words. Consider Bernoulli trials with success probability strings is completed.
. Find the expected number of trials until each of the following
p ∈ (0, 1)
1. 1011011 2. 11 ⋯ 1 (k times) Solutions Recall that an ace-six flat die is a six-sided die for which faces 1 and 6 have probability probability each. Ace-six flat dice are sometimes used by gamblers to cheat.
1 4
each while faces 2, 3, 4, and 5 have
1 8
Suppose that an ace-six flat die is thrown repeatedly. Find the expected number of throws until the pattern 6165616 occurs. Solution Suppose that a monkey types randomly on a keyboard that has the 26 lower-case letter keys and the space key (so 27 keys). Find the expected number of keystrokes until the monkey produces each of the following phrases: 1. it was the best of times 2. to be or not to be Solution
The Secretary Problem The secretary problem was considered in the chapter on Finite Sampling Models. In this discussion we will solve a variation of the problem using martingales. Suppose that there are n ∈ N candidates for a job, or perhaps potential marriage partners. The candidates arrive sequentially in random order and are interviewed. We measure the quality of each candidate by a number in the interval [0, 1]. Our goal is to select the very best candidate, but once a candidate is rejected, she cannot be recalled. Mathematically, our assumptions are that the sequence of candidate variables X = (X , X , … , X ) is independent and that each is uniformly distributed on the interval [0, 1] (and so has the standard uniform distribution). Our goal is to select a stopping time τ with respect to X that maximizes E(X ), the expected value of the chosen candidate. The following sequence will play a critical role as a sequence of thresholds. +
1
2
n
τ
Define the sequence a = (a
k
: k ∈ N)
by a
0
=0
and a
k+1
1. a < 1 for k ∈ N . 2. a < a for k ∈ N . 3. a → 1 as k → ∞ . 4. If X is uniformly distributed on [0, 1] then E(X ∨ a
=
1 2
2
(1 + a ) k
for k ∈ N . Then
k k
k+1
k
k)
= ak+1
for k ∈ N .
Proof Since a
0
=0
, all of the terms of the sequence are in [0, 1) by (a). Approximations of the first 10 terms are (0, 0.5, 0.625, 0.695, 0.742, 0.775, 0.800, 0.820, 0.836, 0.850, 0.861, …)
(17.3.27)
Property (d) gives some indication of why the sequence is important for the secretary probelm. At any rate, the next theorem gives the solution. To simplify the notation, let N = {0, 1, … , n} and N = {1, 2, … , n}. +
n
The stopping time τ is E(X ) = a . τ
+
= inf {k ∈ N
n
: Xk > an−k }
n
is optimal for the secretary problem with n candidates. The optimal value
n
Proof Here is a specific example: For n = 5 , the decision rule is as follows:
17.3.5
https://stats.libretexts.org/@go/page/10301
1. Select candidate 1 if X 2. select candidate 2 if X 3. select candidate 3 if X 4. select candidate 4 if X 5. select candidate 5.
1
2 3 4
; otherwise, ; otherwise, > 0.625; otherwise, > 0.5 ; otherwise, > 0.742
> 0.695
The expected value of our chosen candidate is 0.775. In our original version of the secretary problem, we could only observe the relative ranks of the candidates, and our goal was to maximize the probability of picking the best candidate. With n = 5 , the optimal strategy is to let the first two candidates go by and then pick the first candidate after that is better than all previous candidates, if she exists. If she does not exist, of course, we must select candidate 5. The probability of picking the best candidate is 0.433 This page titled 17.3: Stopping Times is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
17.3.6
https://stats.libretexts.org/@go/page/10301
17.4: Inequalities Basic Theory In this section, we will study a number of interesting inequalities associated with martingales and their sub-martingale and supermartingale cousins. These turn out to be very important for both theoretical reasons and for applications. You many need to review infimums and supremums.
Basic Assumptions As in the Introduction, we start with a stochastic process X = {X : t ∈ T } on an underlying probability space (Ω, F , P), having state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). Next, we have a filtration F = {F : t ∈ T } , and we assume that X is adapted to F. So F is an increasing family of sub σ-algebras of F and X is measurable with respect to F for t ∈ T . We think of F as the collection of events up to time t ∈ T . We assume that E (| X |) < ∞ , so that the mean of X exists as a real number, for each t ∈ T . Finally, in continuous time where T = [0, ∞) , we make the standard assumptions that t ↦ X is right continuous and has left limits, and that the filtration F is right continuous and complete. t
t
t
t
t
t
t
t
Maximal Inequalites For motivation, let's review a modified version of Markov's inequality, named for Andrei Markov. If X is a real-valued random variable then 1 P(X ≥ x) ≤
E(X; X ≥ x),
x ∈ (0, ∞)
(17.4.1)
x
Proof So Markov's inequality gives an upper bound on the probability that X exceeds a given positive value x, in terms of a monent of X. Now let's return to our stochastic process X = { X : t ∈ T } . To simplify the notation, let T = {s ∈ T : s ≤ t} for t ∈ T . Here is the main definition: t
t
For the process X, define the corresponding maximal process U
= { Ut : t ∈ T }
Ut = sup{ Xs : s ∈ Tt },
Clearly, the maximal process is increasing, so that if inequality above would give
s, t ∈ T
with
by
t ∈ T
s ≤t
then
(17.4.3)
Us ≤ Ut
. A trivial application of Markov's
1 P(Ut ≥ x) ≤
x
E(Ut ; Ut ≥ x),
x >0
(17.4.4)
But when X is a sub-martingale, the following theorem gives a much stronger result by replacing the first occurrent of U on the right with X . The theorem is known as Doob's sub-martingale maximal inequality (or more simply as Doob's inequaltiy), named once again for Joseph Doob who did much of the pioneering work on martingales. A sub-martingale has an increasing property of sorts in the sense that if s, t ∈ T with s ≤ t then E(X ∣ F ) ≥ X , so it's perhaps not entirely surprising that such a bound is possible. t
t
t
Suppose that X is a sub-martingale. For t ∈ T , let U
t
s
s
= sup{ Xs : s ∈ Tt }
. Then
1 P(Ut ≥ x) ≤
x
E(Xt ; Ut ≥ x),
x ∈ (0, ∞)
(17.4.5)
Proof in the discrete time Proof in continuous time There are a number of simple corollaries of the maximal inequality. For the first, recall that the positive part of x = x ∨ 0 , so that x = x if x > 0 and x = 0 if x ≤ 0 . +
+
x ∈ R
is
+
17.4.1
https://stats.libretexts.org/@go/page/10302
Suppose that X is a sub-martingale. For t ∈ T , let V
t
+
= sup{ Xs 1
P(Vt ≥ x) ≤
: s ∈ Tt }
+
E(Xt ; Vt ≥ x),
x
. Then x ∈ (0, ∞)
(17.4.14)
Proof As a further simple corollary, note that 1 P(Vt ≥ x) ≤
x
+
E(Xt ),
x ∈ (0, ∞)
(17.4.15)
This is sometimes how the maximal inequality is given in the literature. Suppose that X is a martingale. For t ∈ T , let W
t
= sup{| Xs | : s ∈ Tt }
. Then
1 P(Wt ≥ x) ≤
x
E(| Xt |; Wt ≥ x),
x ∈ (0, ∞)
(17.4.16)
Proof Once again, a further simple corollary is 1 P(Wt ≥ x) ≤
x
E(| Xt |),
x ∈ (0, ∞)
(17.4.17)
Next recall that for k ∈ (1, ∞) , the k -norm of a real-valued random variable X is ∥X∥
k
k
= [E(|X | )]
1/k
, and the vector space L
k
consists of all real-valued random variables for which this norm is finite. The following theorem is the norm version of the Doob's maximal inequality. Suppose again that X is a martingale. For t ∈ T , let W
t
. Then for k > 1 ,
= sup{| Xs | : s ∈ Tt } k
∥ Wt ∥k ≤
k−1
∥ Xt ∥k
(17.4.18)
Proof Once again,
is the maximal process associated with |X| = {|X | : t ∈ T } . As noted in the proof, is the exponent conjugate to k , satisfying 1/j + 1/k = 1 . So this version of the maximal inequality states that the k norm of the maximum of the martingale X on T is bounded by j times the k norm of X , where j and k are conjugate exponents. Stated just in terms of expected value, rather than norms, the L maximal inequality is W = { Wt : t ∈ T }
t
j = k/(k − 1)
t
t
k
k
k
k
E (| Wt | ) ≤ (
k
) E (| Xt | )
(17.4.27)
k−1
Our final result in this discussion is a variation of the maximal inequality for super-martingales. Suppose that X = {X
t
: t ∈ T}
is a nonnegative super-martingale, and let U
∞
= sup{ Xt : t ∈ T }
. Then
1 P(U∞ ≥ x) ≤
x
E(X0 ),
x ∈ (0, ∞)
(17.4.28)
Proof
The Up-Crossing Inequality The up-crossing inequality gives a bound on how much a sub-martingale (or super-martingale) can oscillate, and is the main tool in the martingale convergence theorems that will be studied in the next section. It should come as no surprise by now that the inequality is due to Joseph Doob. We start with the discrete-time case. Suppose that x = (x recursively define
n
: n ∈ N)
is a sequence of real numbers, and that
17.4.2
a, b ∈ R
with
a 0 and f (0) + f (1) < 1 so that a particle has a positive probability of dying without children and a positive probability of producing more than 1 child. The stochastic process of interest is X = {X : n ∈ N} where X is the number of particles in the n th generation for n ∈ N . Recall that X is a discrete-time Markov chain on N. Since 0 is an absorbing state, and all positive states lead to 0, we know that the positive states are transient and so are visited only finitely often with probability 1. It follows that either X → 0 as n → ∞ (extinction) or X → ∞ as n → ∞ (explosion). We have quite a bit of information about which of these events will occur from our study of Markov chains, but the martingale convergence theorems give more information. n
n
n
n
Extinction and explosion 1. If m ≤ 1 then q = 1 and extinction is certain. 2. If m > 1 then q ∈ (0, 1). Either X → 0 as n → ∞ or X n
n
→ ∞
as n → ∞ at an exponential rate.
Proof
The Beta-Bernoulli Process Recall that the beta-Bernoulli process is constructed by randomizing the success parameter in a Bernoulli trials process with a beta distribution. Specifically, we start with a random variable P having the beta distribution with parameters a, b ∈ (0, ∞). Next we have a sequence X = (X , X , …) of indicator variables with the property that X is conditionally independent given P = p ∈ (0, 1) with P(X = 1 ∣ P = p) = p for i ∈ N . Let Y = { Y : n ∈ N} denote the partial sum process associated with X, so that once again, Y = ∑ X for n ∈ N . Next let M = Y /n for n ∈ N so that M is the sample mean of (X , X , … , X ). Finally let 1
2
i
+
n
n
n
1
2
i=1
i
n
n
+
n
n
a + Yn
Zn =
We showed in the Introduction that Z = {Z
n
Mn → P
and Z
n
→ P
: n ∈ N}
,
n ∈ N
(17.5.10)
a+b +n
is a martingale with respect to X.
as n → ∞ with probability 1 and in mean.
Proof This is a very nice result and is reminiscent of the fact that for the ordinary Bernoulli trials sequence with success parameter p ∈ (0, 1) we have the law of large numbers that M → p as n → ∞ with probability 1 and in mean. n
Pólya's Urn Process Recall that in the simplest version of Pólya's urn process, we start with an urn containing a red and b green balls. At each discrete time step, we select a ball at random from the urn and then replace the ball and add c new balls of the same color to the urn. For the parameters, we need a, b ∈ N and c ∈ N . For i ∈ N , let X denote the color of the ball selected on the ith draw, where 1 means red and 0 means green. For n ∈ N , let Y = ∑ X , so that Y = {Y : n ∈ N} is the partial sum process associated with X = { X : i ∈ N } . Since Y is the number of red balls in the urn at time n ∈ N , the average number of balls at time n is M = Y /n . On the other hand, the total number of balls in the urn at time n ∈ N is a + b + cn so the proportion of red balls in the urn at time n is +
+
i
n
n
i
n
+
i=1
i
n
n
+
n
Zn =
a + cYn
(17.5.11)
a + b + cn
We showed in the Introduction, that Z = {Z : n ∈ N} is a martingale. Now we are interested in the limiting behavior of M and Z as n → ∞ . When c = 0 , the answer is easy. In this case, Y has the binomial distribution with trial parameter n and success parameter a/(a + b) , so by the law of large numbers, M → a/(a + b) as n → ∞ with probability 1 and in mean. On the other hand, Z = a/(a + b) when c = 0 . So the interesting case is when c > 0 . n
n
n
n
n
n
Suppose that c ∈ N . Then there exists a random variable P such that M → P and Z → P as n → ∞ with probability 1 and in mean. Moreover, P has the beta distribution with left parameter a/c and right parameter b/c. +
n
n
Proof
17.5.3
https://stats.libretexts.org/@go/page/10303
Likelihood Ratio Tests Recall the discussion of likelihood ratio tests in the Introduction. To review, suppose that (S, S , μ) is a general measure space, and that X = {X : n ∈ N} is a sequence of independent, identically distributed random variables, taking values in S , and having a common probability density function with respect to μ . The likelihood ratio test is a hypothesis test, where the null and alternative hypotheses are n
H0 H1
: the probability density function is g . : the probability density function is g . 0 1
We assume that g and g are positive on S . Also, it makes no sense for g and g to be the same, so we assume that g set of positive measure. The test is based on the likelihood ratio test statistic 0
1
0
n
i=1
Under H , L 1
n
H1
,
0
≠ g1
on a
g0 (Xi )
Ln = ∏
We showed that under the alternative hypothesis likelihood ratio martingale.
1
,
n ∈ N
(17.5.12)
g1 (Xi )
L = { Ln : n ∈ N}
is a martingale with respect to
X
, known as the
as n → ∞ with probability 1.
→ 0
Proof This result is good news, statistically speaking. Small values of L are evidence in favor of H , so the decision rule is to reject H in favor of H if L ≤ l for a chosen critical value l ∈ (0, ∞). If H is true and the sample size n is sufficiently large, we will reject H . In the proof, note that ln(L ) must diverge to −∞ at least as fast as n diverges to ∞. Hence L → 0 as n → ∞ exponentially fast, at least. It also worth noting that L is a mean 1 martingale (under H ) so trivially E(L ) → 1 as n → ∞ even though L → 0 as n → ∞ with probability 1. So the likelihood ratio martingale is a good example of a sequence where the interchange of limit and expected value is not valid. n
1
1
n
0
1
0
n
n
1
n
n
Partial Products Suppose that Let
X = { Xn : n ∈ N+ }
is an independent sequence of nonnegative random variables with
E(Xn ) = 1
for
n ∈ N+
.
n
Yn = ∏ Xi ,
n ∈ N
(17.5.15)
i=1
so that Y = {Y : n ∈ N} is the partial product process associated with X. From our discussion of this process in the Introduction, we know that Y is a martingale with respect to X. Since Y is nonnegative, the second martingale convergence theorem applies, so there exists a random variable Y such that Y → Y as n → ∞ with probability 1. What more can we say? The following result, known as the Kakutani product martingale theorem, is due to Shizuo Kakutani. n
∞
Let a
n
− − − = E (√Xn )
1. If A > 0 then Y 2. If A = 0 then Y
for n ∈ N
+
∞
and let A = ∏
i=1
ai
n
.
as n → ∞ in mean and E(Y with probability 1.
→ Y∞
n ∞
=0
∞
∞)
=1
.
Proof
Density Functions This discussion continues the one on density functions in the Introduction. To review, we start with our probability space (Ω, F , P) and a filtration F = {F : n ∈ N} in discrete time. Recall again that F = σ (⋃ F ) . Suppose now that μ is a finite measure on the sample space (Ω, F ). For each n ∈ N ∪ {∞} , the restriction of μ to F is a measure on (Ω, F ) and similarly the restriction of P to F is a probability measure on (Ω, F ). To save notation and terminology, we will refer to these as μ and P on F , respectively. Suppose now that μ is absolutely continuous with respect to P on F for each n ∈ N . By the Radon-Nikodym theorem, μ has a density function (or Radon-Nikodym derivative) X : Ω → R with respect to P on F for each n ∈ N . The theorem and the derivative are named for Johann Radon and Otto Nikodym. In the Introduction we showed that X = { X : n ∈ N} is a martingale with respect to F . Here is the convergence result: ∞
n
∞
n=0
n
n
n
n
n
n
n
n
n
n
17.5.4
https://stats.libretexts.org/@go/page/10303
There exists a random variable X
∞
such that X
→ X∞
n
as n → ∞ with probability 1.
1. If μ is absolutely continuous with respect to P on F then X is a density function of μ with respect to P on F . 2. If μ and P are mutually singular on F then X = 0 with probability 1. ∞
∞
∞
∞
∞
Proof The martingale approach can be used to give a probabilistic proof of the Radon-Nikodym theorem, at least in certain cases. We start with a sample set Ω. Suppose that A = {A : i ∈ I } is a countable partition of Ω for each n ∈ N . Thus I is countable, A ∩ A = ∅ for distinct i, j ∈ I , and ⋃ A = Ω . Suppose also that A refines A for each n ∈ N in the sense that A is a union of sets in A for each i ∈ I . Let F = σ(A ) . Thus F is generated by a countable partition, and so the sets in F are of the form ⋃ A where J ⊆ I . Moreover, by the refinement property F ⊆ F for n ∈ N , so that F = {F : n ∈ N} is a filtration. Let F = F = σ (⋃ F ) = σ (⋃ A ) , so that our sample space is (Ω, F ). Finally, suppose that P is a probability measure on (Ω, F ) with the property that P(A ) > 0 for n ∈ N and i ∈ I . We now have a probability space (Ω, F , P). Interesting probability spaces that occur in applications are of this form, so the setting is not as specialized as you might think. n
n
n
n
i
j
n
i
n
n
n
i∈In
n+1
n
n
n+1
i
n
n
n
i
n
n
n
j∈J
n
j
n
∞
∞
n+1
n
∞
n
n=0
n
n=0
n
n
i
Suppose now that μ a finte measure on (Ω, F ). From our assumptions, the only null set for P on F is ∅, so μ is automatically absolutely continuous with respect to P on F . Moreover, for n ∈ N , we can give the density function of μ with respect to P on F explicitly: n
n
n
The density function of μ with respect to P on F is the random variable X whose value on A is μ(A i ∈ I . Equivalently, n
n
n
n
i
i
n
)/P(A ) i
for each
n
n
μ(A ) Xn = ∑ i∈In
i
n
n
1(A )
(17.5.22)
i
P(A ) i
Proof By our theorem above, there exists a random variable X such that X → X as n → ∞ with probability 1. If μ is absolutely continuous with respect to P on F , then X is a density function of μ with respect to P on F . The point is that we have given a more or less explicit construction of the density. n
For a concrete example, consider Ω = [0, 1). For n ∈ N , let j An = {[
n
2
j+1 ,
n
) : j ∈ {0, 1, … , 2
n
− 1}}
(17.5.24)
2
This is the partition of [0, 1) into 2 subintervals of equal length 1/2 , based on the dyadic rationals (or binary rationals) of rank n or less. Note that every interval in A is the union of two adjacent intervals in A , so the refinement property holds. Let P be ordinary Lebesgue measure on [0, 1) so that P(A ) = 1/2 for n ∈ N and i ∈ {0, 1, … , 2 − 1}. As above, let F = σ(A ) and F = σ (⋃ F ) = σ (⋃ A ) . The dyadic rationals are dense in [0, 1), so F is the ordinary Borel σ-algebra on [0, 1). Thus our probability space (Ω, F , P) is simply [0, 1) with the usual Euclidean structures. If μ is a finite measure on ([0, 1), F ) then the density function of μ on F is the random variable X whose value on the interval [j/2 , (j + 1)/2 ) is 2 μ[j/ 2 , (j + 1)/ 2 ) . If μ is absolutely continuous with respect to P on F (so absolutely continuous in the usual sense), then a density function of μ is X = lim X . n
n
n
n+1
n
n
n
n
i
∞
n=0
n
n=0
n
n
n
n
n
n
∞
n
n
n
n→∞
n
This page titled 17.5: Convergence is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
17.5.5
https://stats.libretexts.org/@go/page/10303
17.6: Backwards Martingales Basic Theory A backwards martingale is a stochastic process that satisfies the martingale property reversed in time, in a certain sense. In some ways, backward martingales are simpler than their forward counterparts, and in particular, satisfy a convergence theorem similar to the convergence theorem for ordinary martingales. The importance of backward martingales stems from their numerous applications. In particular, some of the fundamental theorems of classical probability can be formulated in terms of backward martingales.
Definitions As usual, we start with a stochastic process Y = {Y : t ∈ T } on an underlying probability space (Ω, F , P), having state space R, and where the index set T (representing time) is either N (discrete time) or [0, ∞) (continuous time). So to review what all this means, Ω is the sample space, F the σ-algebra of events, P the probability measure on (Ω, F ), and Y is a random variable with values in R for each t ∈ T . But at this point our formulation diverges. Suppose that G is a sub σ-algebra of F for each t ∈ T , and that G = {G : t ∈ T } is decreasing so that if s, t ∈ T with s ≤ t then G ⊆ G . Let G = ⋂ G . We assume that Y is measurable with respect to G and that E(|Y |) < ∞ for each t ∈ T . t
t
t
t
t
t
The process
Y = { Yt : t ∈ T }
E(Ys ∣ Gt ) = Yt
for all s,
s
∞
t
t∈T
t
t
t ∈ T
is a backwards martingale (or reversed martingale) with respect to with s ≤ t .
G = { Gt : t ∈ T }
if
A backwards martingale can be formulated as an ordinary martingale by using negative times as the indices. Let T = {−t : t ∈ T } , so that if T = N (the discrete case) then T is the set of non-positive integers, and if T = [0, ∞) (the continuous case) then T = (−∞, 0] . Recall also that the standard martingale definitions make sense for any totally ordered index set. −
−
−
Suppose again that Y = {Y : t ∈ T } is a backwards martingale with respect to G = {G : t ∈ T } . Let F =G for t ∈ T . Then X = {X : t ∈ T } is a martingale with respect to F = {F : t ∈ T } . t
t
t
−
−t
−
t
Xt = Y−t
and
−
t
Proof Most authors define backwards martingales with negative indices, as above, in the first place. There are good reasons for doing so, since some of the fundamental theorems of martingales apply immediately to backwards martingales. However, for the applications of backwards martingales, this notation is artificial and clunky, so for the most part, we will use our original definition. The next result is another way to view a backwards martingale as an ordinary martingale. This one preserves nonnegative time, but introduces a finite time horizon. For t ∈ T , let T = {s ∈ T : s ≤ t} , a notation we have used often before. t
Suppose again that X =Y and F t s
t
t−s
s
is a backwards martingale with respect to G = {G : t ∈ T } . Fix t ∈ T and define for s ∈ T . Then X = {X : s ∈ T } is a martingale relative to F = {F : s ∈ T } .
Y = { Yt : t ∈ T } = Gt−s
t
t
t
t s
t
t
t
s
t
Proof
Properties Backwards martingales satisfy a simple and important property. Suppose that Y = {Y : t ∈ T } is a backwards martingale with repsect to t ∈ T and hence Y is uniformly integrable. t
G = { Gt : t ∈ T }
. Then
Yt = E(Y0 ∣ Gt )
for
Proof Here is the Doob backwards martingale, analogous to the ordinary Doob martingale, and of course named for Joseph Doob. In a sense, this is the converse to the previous result. Suppose that Y is a random variable on our probability space (Ω, F , P) with E(|Y |) < ∞ , and that G = {G : t ∈ T } is a decreasing family of sub σ-algebras of F , as above. Let Y = E(Y ∣ G ) for t ∈ T . Then Y = {Y : t ∈ T } is a backwards martingale with respect to G. t
t
17.6.1
t
t
https://stats.libretexts.org/@go/page/10304
Proof The convergence theorems are the most important results for the applications of backwards martingales. Recall once again that for k ∈ [1, ∞), the k-norm of a real-valued random variable X is k
1/k
∥X ∥k = [E (|X | )]
(17.6.5)
and the normed vector space L consists of all X with ∥X∥ < ∞ . Convergence in the space L is also referred to as convergence in mean, and convergence in the space L is referred to as convergence in mean square. Here is the primary backwards martingale convergence theorem: k
k
1
2
Suppose again that Y = {Y : t ∈ T } is a backwards martingale with respect to random variable Y such that t
G = { Gt : t ∈ T }
. Then there exists a
∞
1. Y 2. Y 3. Y
as t → ∞ with probability 1. → Y as t → ∞ in mean. = E(Y ∣ G ).
t
→ Y∞
t
∞
∞
0
∞
Proof As a simple extension of the last result, if Y
0
∈ Lk
for some k ∈ [1, ∞) then the convergence is in L also. k
Suppose again that Y = {Y : t ∈ T } is a backwards martingale relative to k ∈ [1, ∞) then Y → Y as t → ∞ in L . t
t
∞
G = { Gt : t ∈ T }
. If
Y0 ∈ Lk
for some
k
Proof
Applications The Strong Law of Large Numbers The strong law of large numbers is one of the fundamental theorems of classical probability. Our previous proof required that the underlying distribution have finite variance. Here we present an elegant proof using backwards martingales that does not require this extra assumption. So, suppose that X = {X : n ∈ N } is a sequence of independent, identically distributed random variables with common mean μ ∈ R . In statistical terms, X corresponds to sampling from the underlying distribution. Next let n
+
n
Yn = ∑ Xi ,
n ∈ N
(17.6.12)
i=1
so that Y = {Y : n ∈ N} is the partial sum process associated with X. Recall that the sequence Y is also a discrete-time random walk. Finally, let M = Y /n for n ∈ N so that M = {M : n ∈ N } is the sequence of sample means. n
n
n
+
n
+
The law of large numbers 1. M 2. M
n
→ μ
n
→ μ
as n → ∞ with probability 1. as n → ∞ in mean.
Proof
Exchangeable Variables We start with a probability space (Ω, F , P) and another measurable space (S, S ). Suppose that X = (X , X , …) is a sequence of random variables each taking values in S . Recall that X is exchangeable if for every n ∈ N , every permutation of (X , X , … , X ) has the same distribution on (S , S ) (where S is the n -fold product σ-algebra). Clearly if X is a sequence of independent, identically distributed variables, then X is exchangeable. Conversely, if X is exchangeable then the variables are identically distributed (by definition), but are not necessarily independent. The most famous example of a sequence that is exchangeable but not independent is Pólya's urn process, named for George Pólya. On the other hand, conditionally independent and identically distributed sequences are exchangeable. Thus suppose that (T , T ) is another measurable space and that Θ is a random variable taking values in T . 1
1
2
n
n
n
2
n
If X is conditionally independent and identically distributed given Θ, then X is exchangeable.
17.6.2
https://stats.libretexts.org/@go/page/10304
Proof Often the setting of this theorem arises when we start with a sequence of independent, identically distributed random variables that are governed by a parametric distribution, and then randomize one of the parameters. In a sense, we can always think of the setting in this way: Imagine that θ ∈ T is a parameter for a distribution on S . A special case is the beta-Bernoulli process, in which the success parameter p in sequence of Bernoulli trials is randomized with the beta distribution. On the other hand, Pólya's urn process is an example of an exchangeable sequence that does not at first seem to have anything to do with randomizing parameters. But in fact, we know that Pólya's urn process is a special case of the beta-Bernoulli process. This connection gives a hint of de Finetti's theorem, named for Bruno de Finetti, which we consider next. This theorem states any exchangeable sequence of indicator random variables corresponds to randomizing the success parameter in a sequence of Bernoulli trials. de Finetti's Theorem. Suppose that X = (X , X , …) is an exchangeable sequence of random variables, each taking values in {0, 1}. Then there exists a random variable P with values in [0, 1], such that given P = p ∈ [0, 1], X is a sequence of Bernoulli trials with success parameter p. 1
2
Proof De Finetti's theorem has been extended to much more general sequences of exchangeable variables. Basically, if X = (X , X , …) is an exchangeable sequence of random variables, each taking values in a significantly nice measurable space (S, S ) then there exists a random variable Θ such that X is independent and identically distributed given Θ. In the proof, the result that M → P as n → ∞ with probability 1, where M = ∑ X , is known as de Finetti's strong law of large numbers. De Finetti's theorem, and it's generalizations are important in Bayesian statistical inference. For an exchangeable sequence of random variables (our observations in a statistical experiment), there is a hidden, random parameter Θ. Given Θ = θ , the variables are independent and identically distributed. We gain information about Θ by imposing a prior distribution on Θ and then updating this, based on our observations and using Baye's theorem, to a posterior distribution. 1
2
n
n
1
n
n
i=1
i
Stated more in terms of distributions, de Finetti's theorem states that the distribution of n distinct variables in the exchangeable sequence is a mixture of product measures. That is, if μ is the distribution of a generic X on (S, S ) given Θ = θ , and ν is the distribution of Θ on (T , T ), then the distribution of n of the variables on (S S ) is θ
n
n
B ↦ ∫
μ (B) dν (θ) θ
n
(17.6.28)
T
This page titled 17.6: Backwards Martingales is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
17.6.3
https://stats.libretexts.org/@go/page/10304
CHAPTER OVERVIEW 18: Brownian Motion Brownian motion is a stochastic process of great theoretical importance, and as the basic building block of a variety of other processes, of great practical importance as well. In this chapter we study Brownian motion and a number of random processes that can be constructed from Brownian motion. We also study the Ito stochastic integral and the resulting calculus, as well as two remarkable representation theorems involving stochastic integrals. 18.1: Standard Brownian Motion 18.2: Brownian Motion with Drift and Scaling 18.3: The Brownian Bridge 18.4: Geometric Brownian Motion
This page titled 18: Brownian Motion is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
1
18.1: Standard Brownian Motion Basic Theory History In 1827, the botanist Robert Brown noticed that tiny particles from pollen, when suspended in water, exhibited continuous but very jittery and erratic motion. In his “miracle year” in 1905, Albert Einstein explained the behavior physically, showing that the particles were constantly being bombarded by the molecules of the water, and thus helping to firmly establish the atomic theory of matter. Brownian motion as a mathematical random process was first constructed in rigorous way by Norbert Wiener in a series of papers starting in 1918. For this reason, the Brownian motion process is also known as the Wiener process. Run the two-dimensional Brownian motion simulation several times in single-step mode to get an idea of what Mr. Brown may have observed under his microscope. Along with the Bernoulli trials process and the Poisson process, the Brownian motion process is of central importance in probability. Each of these processes is based on a set of idealized assumptions that lead to a rich mathematial theory. In each case also, the process is used as a building block for a number of related random processes that are of great importance in a variety of applications. In particular, Brownian motion and related processes are used in applications ranging from physics to statistics to economics.
Definition A standard Brownian motion is a random process properties:
X = { Xt : t ∈ [0, ∞)}
with state space
R
that satisfies the following
1. X = 0 (with probability 1). 2. X has stationary increments. That is, for s, t ∈ [0, ∞) with s < t , the distribution of X − X is the same as the distribution of X . 3. X has independent increments. That is, for t , t , … , t ∈ [0, ∞) with t < t < ⋯ < t , the random variables X ,X −X , … , X −X are independent. 4. X is normally distributed with mean 0 and variance t for each t ∈ (0, ∞) . 5. With probability 1, t ↦ X is continuous on [0, ∞). 0
t
s
t−s
1
t1
t2
t1
tn
2
n
1
2
n
tn−1
t
t
To understand the assumptions physically, let's take them one at a time. 1. Suppose that we measure the position of a Brownian particle in one dimension, starting at an arbitrary time which we designate as t = 0 , with the initial position designated as x = 0 . Then this assumption is satisfied by convention. Indeed, occasionally, it's convenient to relax this assumption and allow X to have other values. 2. This is a statement of time homogeneity: the underlying dynamics (namely the jostling of the particle by the molecules of water) do not change over time, so the distribution of the displacement of the particle in a time interval [s, t] depends only on the length of the time interval. 3. This is an idealized assumption that would hold approximately if the time intervals are large compared to the tiny times between collisions of the particle with the molecules. 4. This is another idealized assumption based on the central limit theorem: the position of the particle at time t is the result of a very large number of collisions, each making a very small contribution. The fact that the mean is 0 is a statement of spatial homogeneity: the particle is no more or less likely to be jostled to the right than to the left. Next, recall that the assumptions of stationary, independent increments means that var(X ) = σ t for some positive constant σ . By a change in time scale, we can assume σ = 1 , although we will consider more general Brownian motions in the next section. 5. Finally, the continuity of the sample paths is an essential assumption, since we are modeling the position of a physical particle as a function of time. 0
2
2
t
2
Of course, the first question we should ask is whether there exists a stochastic process satisfying the definition. Fortunately, the answer is yes, although the proof is complicated.
18.1.1
https://stats.libretexts.org/@go/page/10403
There exists a probability space (Ω, F , P) and a stochastic process satisfying the assumptions in the definition.
X = { Xt : t ∈ [0, ∞)}
on this probability space
Proof sketch Run the simulation of the standard Brownian motion process a few times in single-step mode. Note the qualitative behavior of the sample paths. Run the simulation 1000 times and compare the empirical density function and moments of X to the true probabiltiy density function and moments. t
Brownian Motion as a Limit of Random Walks Clearly the underlying dynamics of the Brownian particle being knocked about by molecules suggests a random walk as a possible model, but with tiny time steps and tiny spatial jumps. Let X = (X , X , X , …) be the symmetric simple random walk. Thus, X =∑ U where U = (U , U , …) is a sequence of independent variables with P(U = 1) = P(U = −1) = for each i ∈ N . Recall that E(X ) = 0 and var(X ) = n for n ∈ N . Also, since X is the partial sum process associated with an IID sequence, X has stationary, independent increments (but of course in discrete time). Finally, recall that by the central limit theorem, X /√− n converges to the standard normal distribution as n → ∞ . Now, for h, d ∈ (0, ∞) the continuous time process 0
1
2
n
n
1
i
i=1
1
+
2
i
n
i
2
n
n
Xh,d = {dX⌊t/h⌋ : t ∈ [0, ∞)}
(18.1.1)
is a jump process with jumps at {0, h, 2h, …} and with jumps of size ±d . Basically we would like to let h ↓ 0 and d ↓ 0 , but this cannot be done arbitrarily. Note that E [X (t)] = 0 but var [X (t)] = d ⌊t/h⌋ . Thus, by the central limit theorem, if we take − d = √h then the distribution of X (t) will converge to the normal distribution with mean 0 and variance t as h ↓ 0 . More generally, we might hope that all of requirements in the definition are satisfied by the limiting process, and if so, we have a standard Brownian motion. 2
h,d
h,d
h,d
Run the simulation of the random walk process for increasing values of n . In particular, run the simulation several times with n = 100 . Compare the qualitative behavior with the standard Brownian motion process. Note that the scaling of the random walk in time and space is effecitvely accomplished by scaling the horizontal and vertical axes in the graph window.
Finite Dimensional Distributions Let X = {X : t ∈ [0, ∞)} be a standard Brownian motion. It follows from part (d) of the definition that density function f given by t
Xt
has probability
t
2
1 ft (x) =
x
−− − √2πt
exp(−
),
x ∈ R
(18.1.2)
2t
This family of density functions determines the finite dimensional distributions of X. If t by
1,
t2 , … , tn ∈ (0, ∞)
ft
1
, t2 ,…, tn (x1 ,
with 0 < t
1
< t2 ⋯ < tn
x2 , … , xn ) = ft (x1 )ft 1
2
then (X
t1
−t1
, Xt2 , … , Xtn )
(x2 − x1 ) ⋯ ft
n
−tn− 1
has probability density function f
t1 , t2 ,…, tn
(xn − xn−1 ),
n
(x1 , x2 , … , xn ) ∈ R
given
(18.1.3)
Proof X
is a Gaussian process with mean function mean function for s, t ∈ [0, ∞) .
m(t) = 0
for
t ∈ [0, ∞)
and covariance function
c(s, t) = min{s, t}
Proof Recall that for a Gaussian process, the finite dimensional (multivariate normal) distributions are completely determined by the mean function m and the covariance function c . Thus, it follows that a standard Brownian motion is characterized as a continuous Gaussian process with the mean and covariance functions in the last theorem. Note also that −−−−−−− − min{s, t} cor(Xs , Xt ) =
− − √st
min{s, t} =√
2
,
(s, t) ∈ [0, ∞)
(18.1.5)
max{s, t}
We can also give the higher moments and the moment generating function for X . t
18.1.2
https://stats.libretexts.org/@go/page/10403
For n ∈ N and t ∈ [0, ∞), 1. E (X 2. E (X
2n
t
n
) = 1 ⋅ 3 ⋯ (2n − 1)t
2n+1
t
n
n
= (2n)! t /(n! 2 )
) =0
Proof For t ∈ [0, ∞), X has moment generating function given by t
E (e
uXt
) =e
tu/2
,
u ∈ R
(18.1.6)
Proof
Simple Transformations There are several simple transformations that preserve standard Brownian motion and will give us insight into some of its properties. As usual, our starting place is a standard Brownian motion X = {X : t ∈ [0, ∞)} . Our first result is that reflecting the paths of X in the line x = 0 gives another standard Brownian motion t
Let Y
t
= −Xt
for t ≥ 0 . Then Y
= { Yt : t ≥ 0}
is also a standard Brownian motion.
Proof Our next result is related to the Markov property, which we explore in more detail below. If we “restart” Brownian motion at a fixed time s , and shift the origin to X , then we have another standard Brownian motion. This means that Brownian motion is both temporally and spatially homogeneous. s
Fix s ∈ [0, ∞) and define Y
t
= Xs+t − Xs
for t ≥ 0 . Then Y
= { Yt : t ∈ [0, ∞)}
is also a standard Brownian motion.
Proof Our next result is a simple time reversal, but to state this result, we need to restrict the time parameter to a bounded interval of the form [0, T ] where T > 0 . The upper endpoint T is sometimes referred to as a finite time horizon. Note that {X : t ∈ [0, T ]} still satisfies the definition, but with the time parameters restricted to [0, T ]. t
Define Y
t
= XT −t − XT
for 0 ≤ t ≤ T . Then Y
= { Yt : t ∈ [0, T ]}
is also a standard Brownian motion on [0, T ].
Proof Our next transformation involves scaling X both temporally and spatially, and is known as self-similarity. Let a > 0 and define Y
t
=
1 a
Xa2 t
for t ≥ 0 . Then Y
= { Yt : t ∈ [0, ∞)}
is also a standard Brownian motion.
Proof Note that the graph of Y can be obtained from the graph of X by scaling the time axis t by a factor of a and scaling the spatial axis x by a factor of a . The fact that the temporal scale factor must be the square of the spatial scale factor is clearly related to Brownian motion as the limit of random walks. Note also that this transformation amounts to “zooming in or out” of the graph of X and hence Brownian motion has a self-similar, fractal quality, since the graph is unchanged by this transformation. This also suggests that, although continuous, t ↦ X is highly irregular. We consider this in the next subsection. 2
t
Our final transformation is referred to as time inversion. Let Y
0
=0
and Y
t
= tX1/t
for t > 0 . Then Y
= { Yt : t ∈ [0, ∞)}
is also a standard Brownian motion.
Proof
Irregularity The defining properties suggest that standard Brownian motion function. Consider the usual difference quotient at t ,
X = { Xt : t ∈ [0, ∞)}
Xt+h − Xt
cannot be a smooth, differentiable
(18.1.11)
h
18.1.3
https://stats.libretexts.org/@go/page/10403
By the stationary increments property, if h > 0 , the numerator has the same distribution as X , while if h < 0 , the numerator has the same distribution as −X , which in turn has the same distribution as X . So, in both cases, the difference quotient has the same distribution as X /h, and this variable has the normal distribution with mean 0 and variance |h| /h = 1/ |h| . So the variance of the difference quotient diverges to ∞ as h → 0 , and hence the difference quotient does not even converge in distribution, the weakest form of convergence. h
−h
−h
2
|h|
The temporal-spatial transformation above also suggests that Brownian motion cannot be differentiable. The intuitive meaning of differentiable at t is that the function is locally linear at t —as we zoon in, the graph near t begins to look like a line (whose slope, of course, is the derivative). But as we zoon in on Brownian motion, (in the sense of the transformation), it always looks the same, and in particular, just as jagged. More formally, if X is differentiable at t , then so is the transformed process Y , and the chain rule gives Y (t) = aX (a t) . But Y is also a standard Brownian motion for every a > 0 , so something is clearly wrong. While not rigorous, these examples are motivation for the following theorem: ′
′
2
With probability 1, X is nowhere differentiable on [0, ∞). Run the simulation of the standard Brownian motion process. Note the continuity but very jagged quality of the sample paths. Of course, the simulation cannot really capture Brownian motion with complete fidelity. The following theorems gives a more precise measure of the irregularity of standard Brownian motion. Standard Brownian motion X has Hölder exponent . That is, X is Hölder continuous with exponent α for every α < is not Hölder continuous with exponent α for any α > . 1 2
1 2
, but
1 2
In particular, X is not Lipschitz continuous, and this shows again that it is not differentiable. The following result states that in terms of Hausdorff dimension, the graph of standard Brownian motion lies midway between a simple curve (dimension 1) and the plane (dimension 2). The graph of standard Brownian motion has Hausdorff dimension
3 2
.
Yet another indication of the irregularity of Brownian motion is that it has infinite total variation on any interval of positive length. Suppose that a,
b ∈ R
with a < b . Then the total variation of X on [a, b] is ∞.
The Markov Property and Stopping Times As usual, we start with a standard Brownian motion X = {X : t ∈ [0, ∞)} . Recall that a Markov process has the property that the future is independent of the past, given the present state. Because of the stationary, independent increments property, Brownian motion has the property. As a minor note, to view X as a Markov process, we sometimes need to relax Assumption 1 and let X have an arbitrary value in R. Let F = σ{X : 0 ≤ s ≤ t} , the sigma-algebra generated by the process up to time t ∈ [0, ∞). The family of σ-algebras F = {F : t ∈ [0, ∞)} is known as a filtration. t
0
t
s
t
Standard Brownian motion is a time-homogeneous Markov process with transition probability density p given by 2
(y − x)
1 pt (x, y) = ft (y − x) =
−− − exp[− √2πt
],
t ∈ (0, ∞); x, y ∈ R
(18.1.12)
2t
Proof The transtion density p satisfies the following diffusion equations. The first is known as the forward equation and the second as the backward equation. ∂ ∂t
1 pt (x, y) =
∂ ∂t
2
2 ∂y 2 1
pt (x, y) =
∂
∂
pt (x, y)
(18.1.13)
pt (x, y)
(18.1.14)
2
2 ∂x2
Proof
18.1.4
https://stats.libretexts.org/@go/page/10403
The diffusion equations are so named, because the spatial derivative in the first equation is with respect to y , the state forward at time t , while the spatial derivative in the second equation is with respect to x, the state backward at time 0. Recall that a random time τ taking values in [0, ∞] is a stopping time with respect to the process X = {X : t ∈ [0, ∞)} if {τ ≤ t} ∈ F for every t ∈ [0, ∞). Informally, we can determine whether or not τ ≤ t by observing the process up to time t . An important special case is the first time that our Brownian motion hits a specified state. Thus, for x ∈ R let τ = inf{t ∈ [0, ∞) : X = x} . The random time τ is a stopping time. t
t
x
t
x
For a stopping time τ , we need the σ-algebra of events that can be defined in terms of the process up to the random time τ , analogous to F , the σ-algebra of events that can be defined in terms of the process up to a fixed time t . The appropriate definition is t
Fτ = {B ∈ F : B ∩ {τ ≤ t} ∈ Ft for all t ≥ 0}
(18.1.15)
See the section on Filtrations and Stopping Times for more information on filtrations, stopping times, and the σ-algebra associated with a stopping time. The strong Markov property is the Markov property generalized to stopping times. Standard Brownian motion X is also a strong Markov process. The best way to say this is by a generalization of the temporal and spatial homogeneity result above. Suppose that τ is a stopping time and define Brownian motion and is independent of F .
Yt = Xτ+t − Xτ
for
t ∈ [0, ∞)
. Then
Y = { Yt : t ∈ [0, ∞)}
is a standard
τ
The Reflection Principle Many interesting properties of Brownian motion can be obtained from a clever idea known as the reflection principle. As usual, we start with a standard Brownian motion X = {X : t ∈ [0, ∞)} . Let τ be a stopping time for X. Define t
Wt = {
Xt ,
0 ≤t 0 , τ has the same distribution as y is given by y
y
2
/Z
2
, where Z is a standard normal variable. The probability density function g
y
2
gy (t) =
y y ), − − − − exp(− 3 2t √2πt
t ∈ (0, ∞)
(18.1.20)
Proof The distribution of τ is the Lévy distribution with scale parameter y , and is named for the French mathematician Paul Lévy. The Lévy distribution is studied in more detail in the chapter on special distributions. 2
y
Open the hitting time experiment. Vary y and note the shape and location of the probability density function of τ . For selected values of the parameter, run the simulation in single step mode a few times. Then run the experiment 1000 times and compare the empirical density function to the probability density function. y
Open the special distribution simulator and select the Lévy distribution. Vary the parameters and note the shape and location of the probability density function. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function. Standard Brownian motion is recurrent. That is, P(τ
y
< ∞) = 1
for every y ∈ R .
Proof Thus, for each y ∈ R , X eventually hits y with probability 1. Actually we can say more: With probability 1, X visits every point in R. Proof On the other hand, Standard Brownian motion is null recurrent. That is, E(τ
y)
=∞
for every y ∈ R ∖ {0} .
Proof The process {τ
x
: x ∈ [0, ∞)}
has stationary, independent increments.
Proof The family of probability density functions x, y ∈ (0, ∞).
{ gx : x ∈ (0, ∞)}
is closed under convolution. That is,
for
gx ∗ gy = gx+y
Proof Now we turn our attention to the maximum process {Y
t
: t ∈ [0, ∞)}
, the “inverse” of the hitting process {τ
y
: y ∈ [0, ∞)}
.
For t > 0 , Y has the same distribution as |X |, known as the half-normal distribution with scale parameter t . The probability density function is t
t
18.1.6
https://stats.libretexts.org/@go/page/10403
− − − 2 ht (y) = √
y
2
exp(− πt
),
y ∈ [0, ∞)
(18.1.27)
2t
Proof The half-normal distribution is a special case of the folded normal distribution, which is studied in more detail in the chapter on special distributions. For t ≥ 0 , the mean and variance of Y are t
− −
1. E(Y
t)
=√
2. var(Y
t)
2t π
= t (1 −
2 π
)
Proof In the standard Brownian motion simulation, select the maximum value. Vary the parameter t and note the shape of the probability density function and the location and size of the mean-standard deviation bar. Run the simulation 1000 times and compare the empirical density and moments to the true probability density function and moments. Open the special distribution simulator and select the folded-normal distribution. Vary the parameters and note the shape and location of the probability density function and the size and location of the mean-standard deviation bar. For selected values of the parameters, run the simulation 1000 times and compare the empirical density function and moments to the true density function and moments.
Zeros and Arcsine Laws As usual, we start with a standard Brownian motion X = {X : t ∈ [0, ∞)} . Study of the zeros of X lead to a number of probability laws referred to as arcsine laws, because as we might guess, the probabilities and distributions involve the arcsine function. t
For
s, t ∈ [0, ∞)
with
s 0 , so with probability 1, X has a zero in (0, t). Actually, we can say a bit more: For t > 0 , X has infinitely many zeros in (0, t) with probability 1. Proof The last result is further evidence of the very strange and irregular behavior of Brownian motion. Note also that P [E(s, t)] depends only on the ratio s/t . Thus, P [E(s, t)] = P [E(1/t, 1/s)] and P [E(s, t)] = P [E(cs, ct)] for every c > 0 . So, for example the probability of at least one zero in the interval (2, 5) is the same as the probability of at least one zero in (1/5, 1/2), the same as the probability of at least one zero in (6, 15), and the same as the probability of at least one zero in (200, 500). For t > 0 , let Z denote the time of the last zero of X before time t . That is, Z = max {s ∈ [0, t] : X = 0} . Then Z has the arcsine distribution with parameter t . The distribution function H and the probability density function h are given by t
t
t
− − s ), t
2 Ht (s) =
π
arcsin( √
s
t
t
0 ≤s ≤t
(18.1.34)
1 ht (s) =
− −−−− − π √ s(t − s)
,
0 1 − 2μ/σ nμ + (n − n) > 0 so E(X ) → ∞ as t → ∞ . The mean and variance follow easily from the general moment result. 2
σ
2
2
2
2
then
n
t
2
For t ∈ [0, ∞), 1. E(X
t)
=e
2. var(X
t)
μt
=e
2μt
(e
σ
2
t
− 1)
In particular, note that the mean function m(t) = E(X ) = e for t ∈ [0, ∞) satisfies the deterministic part of the stochastic differential equation above. If μ > 0 then m(t) → ∞ as t → ∞ . If μ = 0 then m(t) = 1 for all t ∈ [0, ∞). If μ < 0 then m(t) → 0 as t → ∞ . μt
t
Open the simulation of geometric Brownian motion. The graph of the mean function m is shown as a blue curve in the main graph box. For various values of the parameters, run the simulation 1000 times and note the behavior of the random process in relation to the mean function. Open the simulation of geometric Brownian motion. Vary the parameters and note the size and location of the mean±standard deviation bar for X . For various values of the parameter, run the simulation 1000 times and compare the empirical mean and standard deviation to the true mean and standard deviation. t
Properties The parameter μ − σ
2
/2
determines the asymptotic behavior of geometric Brownian motion.
Asymptotic behavior: 1. If μ > σ 2. If μ < σ 3. If μ = σ
2 2 2
/2 /2 /2
then X → ∞ as t → ∞ with probability 1. then X → 0 as t → ∞ with probability 1. then X has no limit as t → ∞ with probability 1. t t t
Proof It's interesting to compare this result with the asymptotic behavior of the mean function, given above, which depends only on the parameter μ . When the drift parameter is 0, geometric Brownian motion is a martingale.
18.4.2
https://stats.libretexts.org/@go/page/10406
If μ = 0 , geometric Brownian motion X is a martingale with respect to the underlying Brownian motion Z . Proof from stochastic integrals Direct proof This page titled 18.4: Geometric Brownian Motion is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist (Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
18.4.3
https://stats.libretexts.org/@go/page/10406
Index A
F
arcsine distribution 5.19: The Arcsine Distribution
B Bayesian estimation 7.4: Bayesian Estimation
Feller Markov processes 16.2: Potentials and Generators for General Markov Processes
G
Bernoulli Trials 11.1: Introduction to Bernoulli Trials
Bertrand's paradox
5.8: The Gamma Distribution 14.3: The Gamma Distribution
gamma function
10.2: Bertrand's Paradox
beta distribution
geometric distribution
5.17: The Beta Distribution
beta prime distribution 5.18: The Beta Prime Distribution
binomial distribution 2.5: Independence
bold play 13.10: Bold Play
bonus number 18: Brownian Motion 18.1: Standard Brownian Motion
chi distribution 5.9: Chi-Square and Related Distribution
conditional expected value 4.7: Conditional Expected Value
conditional probability 2.4: Conditional Probability
continuous distributions
4.13: Kernels and Operators
P
I incomplete beta function 5.17: The Beta Distribution
independence
2.4: Conditional Probability
craps 13.4: Craps
infinitely divisible distributions 5.4: Infinitely Divisible Distributions
isomorphism
disjointness 2.5: Independence
distribution function 3.6: Distribution and Quantile Functions 3.9: General Distribution Functions
Ehrenfest chains 16.8: The Ehrenfest Chains
equivalence class 1.5: Equivalence Relations
extremal elements
Pareto distribution 3.2: Continuous Distributions 5.36: The Pareto Distribution
partial orders 1.4: Partial Orders 13.3: Simple Dice Games
power series distributions
J
5.5: Power Series Distributions
joint distributions
Q queuing theory
K
16.22: Continuous-Time Queuing Chains
keno 13.7: Lotteries
R
kernals 4.13: Kernels and Operators
Kolmogorov axioms 2.3: Probability Measures
kurtosis 4.4: Skewness and Kurtosis
random experiment 2.1: Random Experiments
Rayleigh distribution 5.14: The Rayleigh Distribution
Red and Black game 13.8: The Red and Black Game
roulette
L
13.5: Roulette
Lévy distribution 5.16: The Lévy Distribution
law of the iterated logarithm likelihood ratio 9.5: Likelihood Ratio Tests
lottery
S second central moment 4.3: Variance
second moment 4.3: Variance
skewness
13.7: Lotteries
4.4: Skewness and Kurtosis
stable distributions
M
5.3: Stable Distributions
Martingalges
E
12.8: Pólya's Urn Process 17.1: Introduction to Martingalges
poker dice
1.4: Partial Orders
18.1: Standard Brownian Motion
D
Pólya's urn process
2.5: Independence
3.2: Continuous Distributions
Correlation
9.2: Tests in the Normal Model
1.4: Partial Orders
3.4: Joint Distributions
4.3: Variance
normal model
operators
Hasse graph
C Chebyshev's inequality
normal distribution
O
H
13.7: Lotteries
Brownian motion
N
11.3: The Geometric Distribution
5.17: The Beta Distribution
beta function
13.6: The Monty Hall Problem
multivariate normal distribution
5.6: The Normal Distribution
5.8: The Gamma Distribution
10.2: Bertrand's Paradox
Bertrand's problem
7.2: The Method of Moments
Monty Hall problem 5.7: The Multivariate Normal Distribution
gamma distribution
Benford's law 5.39: Benford's Law
moments
standard triangle distribution
17: Martingales
5.24: The Triangle Distribution
matching problem 12.5: The Matching Problem
metric spaces
Stirling's approximation 5.8: The Gamma Distribution
strong law of large numbers
1.10: Metric Spaces
2.6: Convergence
1.4: Partial Orders
1
https://stats.libretexts.org/@go/page/10524
Student t distribution 5.10: The Student t Distribution
V
Weibull distribution 5.38: The Weibull Distribution
variance
Wiener process
4.3: Variance
T The F distribution 5.11: The F Distribution
timid play 13.9: Timid Play
18.1: Standard Brownian Motion
W
Z
Wald distribution 5.37: The Wald Distribution
weak law of large numbers 2.6: Convergence
zeta distribution 5.40: The Zeta Distribution
Zipf distribution 5.40: The Zeta Distribution
2
https://stats.libretexts.org/@go/page/10524
Glossary Sample Word 1 | Sample Definition 1
1
https://stats.libretexts.org/@go/page/25717
Detailed Licensing Overview Title: Probability, Mathematical Statistics, and Stochastic Processes (Siegrist) Webpages: 226 All licenses found: CC BY 2.0: 98.7% (223 pages) Undeclared: 1.3% (3 pages)
By Page Probability, Mathematical Statistics, and Stochastic Processes (Siegrist) - CC BY 2.0
3.1: Discrete Distributions - CC BY 2.0 3.2: Continuous Distributions - CC BY 2.0 3.3: Mixed Distributions - CC BY 2.0 3.4: Joint Distributions - CC BY 2.0 3.5: Conditional Distributions - CC BY 2.0 3.6: Distribution and Quantile Functions - CC BY 2.0 3.7: Transformations of Random Variables - CC BY 2.0 3.8: Convergence in Distribution - CC BY 2.0 3.9: General Distribution Functions - CC BY 2.0 3.10: The Integral With Respect to a Measure - CC BY 2.0 3.11: Properties of the Integral - CC BY 2.0 3.12: General Measures - CC BY 2.0 3.13: Absolute Continuity and Density Functions CC BY 2.0 3.14: Function Spaces - CC BY 2.0
Front Matter - CC BY 2.0 TitlePage - CC BY 2.0 InfoPage - CC BY 2.0 Introduction - CC BY 2.0 Table of Contents - Undeclared Licensing - Undeclared Table of Contents - CC BY 2.0 Object Library - CC BY 2.0 Credits - CC BY 2.0 Sources and Resources - CC BY 2.0 1: Foundations - CC BY 2.0 1.1: Sets - CC BY 2.0 1.2: Functions - CC BY 2.0 1.3: Relations - CC BY 2.0 1.4: Partial Orders - CC BY 2.0 1.5: Equivalence Relations - CC BY 2.0 1.6: Cardinality - CC BY 2.0 1.7: Counting Measure - CC BY 2.0 1.8: Combinatorial Structures - CC BY 2.0 1.9: Topological Spaces - CC BY 2.0 1.10: Metric Spaces - CC BY 2.0 1.11: Measurable Spaces - CC BY 2.0 1.12: Special Set Structures - CC BY 2.0
4: Expected Value - CC BY 2.0 4.1: Definitions and Basic Properties - CC BY 2.0 4.2: Additional Properties - CC BY 2.0 4.3: Variance - CC BY 2.0 4.4: Skewness and Kurtosis - CC BY 2.0 4.5: Covariance and Correlation - CC BY 2.0 4.6: Generating Functions - CC BY 2.0 4.7: Conditional Expected Value - CC BY 2.0 4.8: Expected Value and Covariance Matrices - CC BY 2.0 4.9: Expected Value as an Integral - CC BY 2.0 4.10: Conditional Expected Value Revisited - CC BY 2.0 4.11: Vector Spaces of Random Variables - CC BY 2.0 4.12: Uniformly Integrable Variables - CC BY 2.0 4.13: Kernels and Operators - CC BY 2.0
2: Probability Spaces - CC BY 2.0 2.1: Random Experiments - CC BY 2.0 2.2: Events and Random Variables - CC BY 2.0 2.3: Probability Measures - CC BY 2.0 2.4: Conditional Probability - CC BY 2.0 2.5: Independence - CC BY 2.0 2.6: Convergence - CC BY 2.0 2.7: Measure Spaces - CC BY 2.0 2.8: Existence and Uniqueness - CC BY 2.0 2.9: Probability Spaces Revisited - CC BY 2.0 2.10: Stochastic Processes - CC BY 2.0 2.11: Filtrations and Stopping Times - CC BY 2.0
5: Special Distributions - CC BY 2.0 5.1: Location-Scale Families - CC BY 2.0 5.2: General Exponential Families - CC BY 2.0 5.3: Stable Distributions - CC BY 2.0 5.4: Infinitely Divisible Distributions - CC BY 2.0 5.5: Power Series Distributions - CC BY 2.0
3: Distributions - CC BY 2.0
1
https://stats.libretexts.org/@go/page/32586
5.6: The Normal Distribution - CC BY 2.0 5.7: The Multivariate Normal Distribution - CC BY 2.0 5.8: The Gamma Distribution - CC BY 2.0 5.9: Chi-Square and Related Distribution - CC BY 2.0 5.10: The Student t Distribution - CC BY 2.0 5.11: The F Distribution - CC BY 2.0 5.12: The Lognormal Distribution - CC BY 2.0 5.13: The Folded Normal Distribution - CC BY 2.0 5.14: The Rayleigh Distribution - CC BY 2.0 5.15: The Maxwell Distribution - CC BY 2.0 5.16: The Lévy Distribution - CC BY 2.0 5.17: The Beta Distribution - CC BY 2.0 5.18: The Beta Prime Distribution - CC BY 2.0 5.19: The Arcsine Distribution - CC BY 2.0 5.20: General Uniform Distributions - CC BY 2.0 5.21: The Uniform Distribution on an Interval - CC BY 2.0 5.22: Discrete Uniform Distributions - CC BY 2.0 5.23: The Semicircle Distribution - CC BY 2.0 5.24: The Triangle Distribution - CC BY 2.0 5.25: The Irwin-Hall Distribution - CC BY 2.0 5.26: The U-Power Distribution - CC BY 2.0 5.27: The Sine Distribution - CC BY 2.0 5.28: The Laplace Distribution - CC BY 2.0 5.29: The Logistic Distribution - CC BY 2.0 5.30: The Extreme Value Distribution - CC BY 2.0 5.31: The Hyperbolic Secant Distribution - CC BY 2.0 5.32: The Cauchy Distribution - CC BY 2.0 5.33: The Exponential-Logarithmic Distribution - CC BY 2.0 5.34: The Gompertz Distribution - CC BY 2.0 5.35: The Log-Logistic Distribution - CC BY 2.0 5.36: The Pareto Distribution - CC BY 2.0 5.37: The Wald Distribution - CC BY 2.0 5.38: The Weibull Distribution - CC BY 2.0 5.39: Benford's Law - CC BY 2.0 5.40: The Zeta Distribution - CC BY 2.0 5.41: The Logarithmic Series Distribution - CC BY 2.0
7.2: The Method of Moments - CC BY 2.0 7.3: Maximum Likelihood - CC BY 2.0 7.4: Bayesian Estimation - CC BY 2.0 7.5: Best Unbiased Estimators - CC BY 2.0 7.6: Sufficient, Complete and Ancillary Statistics CC BY 2.0 8: Set Estimation - CC BY 2.0 8.1: Introduction to Set Estimation - CC BY 2.0 8.2: Estimation the Normal Model - CC BY 2.0 8.3: Estimation in the Bernoulli Model - CC BY 2.0 8.4: Estimation in the Two-Sample Normal Model CC BY 2.0 8.5: Bayesian Set Estimation - CC BY 2.0 9: Hypothesis Testing - CC BY 2.0 9.1: Introduction to Hypothesis Testing - CC BY 2.0 9.2: Tests in the Normal Model - CC BY 2.0 9.3: Tests in the Bernoulli Model - CC BY 2.0 9.4: Tests in the Two-Sample Normal Model - CC BY 2.0 9.5: Likelihood Ratio Tests - CC BY 2.0 9.6: Chi-Square Tests - CC BY 2.0 10: Geometric Models - CC BY 2.0 10.1: Buffon's Problems - CC BY 2.0 10.2: Bertrand's Paradox - CC BY 2.0 10.3: Random Triangles - CC BY 2.0 11: Bernoulli Trials - CC BY 2.0 11.1: Introduction to Bernoulli Trials - CC BY 2.0 11.2: The Binomial Distribution - CC BY 2.0 11.3: The Geometric Distribution - CC BY 2.0 11.4: The Negative Binomial Distribution - CC BY 2.0 11.5: The Multinomial Distribution - CC BY 2.0 11.6: The Simple Random Walk - CC BY 2.0 11.7: The Beta-Bernoulli Process - CC BY 2.0 12: Finite Sampling Models - CC BY 2.0 12.1: Introduction to Finite Sampling Models - CC BY 2.0 12.2: The Hypergeometric Distribution - CC BY 2.0 12.3: The Multivariate Hypergeometric Distribution CC BY 2.0 12.4: Order Statistics - CC BY 2.0 12.5: The Matching Problem - CC BY 2.0 12.6: The Birthday Problem - CC BY 2.0 12.7: The Coupon Collector Problem - CC BY 2.0 12.8: Pólya's Urn Process - CC BY 2.0 12.9: The Secretary Problem - CC BY 2.0
6: Random Samples - CC BY 2.0 6.1: Introduction - CC BY 2.0 6.2: The Sample Mean - CC BY 2.0 6.3: The Law of Large Numbers - CC BY 2.0 6.4: The Central Limit Theorem - CC BY 2.0 6.5: The Sample Variance - CC BY 2.0 6.6: Order Statistics - CC BY 2.0 6.7: Sample Correlation and Regression - CC BY 2.0 6.8: Special Properties of Normal Samples - CC BY 2.0
13: Games of Chance - CC BY 2.0 13.1: Introduction to Games of Chance - CC BY 2.0 13.2: Poker - CC BY 2.0 13.3: Simple Dice Games - CC BY 2.0
7: Point Estimation - CC BY 2.0 7.1: Estimators - CC BY 2.0
2
https://stats.libretexts.org/@go/page/32586
13.4: Craps - CC BY 2.0 13.5: Roulette - CC BY 2.0 13.6: The Monty Hall Problem - CC BY 2.0 13.7: Lotteries - CC BY 2.0 13.8: The Red and Black Game - CC BY 2.0 13.9: Timid Play - CC BY 2.0 13.10: Bold Play - CC BY 2.0 13.11: Optimal Strategies - CC BY 2.0
16.9: The Bernoulli-Laplace Chain - CC BY 2.0 16.10: Discrete-Time Reliability Chains - CC BY 2.0 16.11: Discrete-Time Branching Chain - CC BY 2.0 16.12: Discrete-Time Queuing Chains - CC BY 2.0 16.13: Discrete-Time Birth-Death Chains - CC BY 2.0 16.14: Random Walks on Graphs - CC BY 2.0 16.15: Introduction to Continuous-Time Markov Chains - CC BY 2.0 16.16: Transition Matrices and Generators of Continuous-Time Chains - CC BY 2.0 16.17: Potential Matrices - CC BY 2.0 16.18: Stationary and Limting Distributions of Continuous-Time Chains - CC BY 2.0 16.19: Time Reversal in Continuous-Time Chains CC BY 2.0 16.20: Chains Subordinate to the Poisson Process CC BY 2.0 16.21: Continuous-Time Birth-Death Chains - CC BY 2.0 16.22: Continuous-Time Queuing Chains - CC BY 2.0 16.23: Continuous-Time Branching Chains - CC BY 2.0
14: The Poisson Process - CC BY 2.0 14.1: Introduction to the Poisson Process - CC BY 2.0 14.2: The Exponential Distribution - CC BY 2.0 14.3: The Gamma Distribution - CC BY 2.0 14.4: The Poisson Distribution - CC BY 2.0 14.5: Thinning and Superpositon - CC BY 2.0 14.6: Non-homogeneous Poisson Processes - CC BY 2.0 14.7: Compound Poisson Processes - CC BY 2.0 14.8: Poisson Processes on General Spaces - CC BY 2.0 15: Renewal Processes - CC BY 2.0 15.1: Introduction - CC BY 2.0 15.2: Renewal Equations - CC BY 2.0 15.3: Renewal Limit Theorems - CC BY 2.0 15.4: Delayed Renewal Processes - CC BY 2.0 15.5: Alternating Renewal Processes - CC BY 2.0 15.6: Renewal Reward Processes - CC BY 2.0
17: Martingales - CC BY 2.0 17.1: Introduction to Martingalges - CC BY 2.0 17.2: Properties and Constructions - CC BY 2.0 17.3: Stopping Times - CC BY 2.0 17.4: Inequalities - CC BY 2.0 17.5: Convergence - CC BY 2.0 17.6: Backwards Martingales - CC BY 2.0
16: Markov Processes - CC BY 2.0 16.1: Introduction to Markov Processes - CC BY 2.0 16.2: Potentials and Generators for General Markov Processes - CC BY 2.0 16.3: Introduction to Discrete-Time Chains - CC BY 2.0 16.4: Transience and Recurrence for Discrete-Time Chains - CC BY 2.0 16.5: Periodicity of Discrete-Time Chains - CC BY 2.0 16.6: Stationary and Limiting Distributions of Discrete-Time Chains - CC BY 2.0 16.7: Time Reversal in Discrete-Time Chains - CC BY 2.0 16.8: The Ehrenfest Chains - CC BY 2.0
18: Brownian Motion - CC BY 2.0 18.1: Standard Brownian Motion - CC BY 2.0 18.2: Brownian Motion with Drift and Scaling - CC BY 2.0 18.3: The Brownian Bridge - CC BY 2.0 18.4: Geometric Brownian Motion - CC BY 2.0 Back Matter - CC BY 2.0 Index - CC BY 2.0 Glossary - CC BY 2.0 Detailed Licensing - Undeclared
3
https://stats.libretexts.org/@go/page/32586