239 13 3MB
English Pages [144] Year 2004
Solution Manual for Selected Problems Monte Carlo Statistical Methods, 2nd Edition Christian P. Robert and George Casella c
2007 Springer Science+Business Media
This manual has been compiled by Roberto Casarin, Universit´e Dauphine and Universit´a di Brescia, partly from his notes and partly from contributions from Cyrille Joutard, CREST, and Arafat Tayeb, Universit´e Dauphine, under the supervision of the authors. Later additions were made by Christian Robert — Second Version, June 27, 2007
Chapter 1 Problem 1.2 Let X ∼ N (θ, σ 2 ) and Y ∼ N (µ, ρ2 ). The event {Z > z} is a.s. equivalent to {X > z} and {Y > z}. From the independence between X et Y , it follows P (Z > z) = P (X > z)P (Y > z) Let G be the c.d.f. of z, then 1 − G(z) = [1 − P (X < z)] [1 − P (Y < z)] z−θ z−µ = [1 − Φ 1−Φ σ ρ By taking the derivative and rearranging we obtain z−θ z−µ z−µ z−θ −1 −1 g(z) = 1 − Φ ρ ϕ + 1−Φ σ ϕ σ ρ ρ σ Let X ∼ W(α, β) and Z = X ∧ ω, then Z ∞ α αβxα−1 e−βx dx P (X > ω) = ω
2
Solution Manual
and P (Z = ω) = P (> ω) =
Z
∞
α
αβxα−1 e−βx dx
ω
We conclude that the p.d.f. of Z is α
f (z) = αβz α−1 e−βz Iz6ω +
Z
∞
ω
α αβxα−1 e−βx dx δω (z)
Problem 1.4 In order to find an explicit form of the integral Z ∞ α αβxα−1 e−βx dx, ω
we use the change of variable y = xα . We have dy = αxα−1 dx and the integral becomes Z ∞ Z ∞ α
α
βe−βy dy = e−βω .
αβxα−1 e−βx dx =
ωα
ω
Problem 1.6 Let X1 , ..., Xn be an iid sample from the mixture distribution f (x) = p1 f1 (x) + ... + pk fk (x). Suppose that the moments up to the order k of every fj , j = 1, ..., k are finite and let Z i mi,j = E(X ) = xi fj (x)dx, where X ∼ fj . An usual approximation of the moments of f is n
µi =
1X i X . n j=1 j
Thus, we have the approximation µi =
k X
pj mi,j ,
j=1
for i = 1, ..., k. This is a linear system that gives (p1 , ..., pk ) if the matrix M = [mi,j ] is invertible.
Monte Carlo Statistical Methods
3
Problem 1.7 The density f of the vector Yn is f (yn , µ, σ) =
1 √ σ 2π
n
n
1X exp − 2 i=1
yi − µ σ
2 !
,
∀yn ∈ Rn , ∀(µ, σ 2 ) ∈ R×R∗+
This function is strictly positive and the first and second order partial derivatives with respect to µ and σ exist and are positive. The same hypotheses are satisfied for the log-likelihood function n √ 1X log(L(µ, σ, yn )) = −n log 2π − n log σ − 2 i=1
yi − µ σ
2
thus we can find the ML estimator of µ and σ 2 . The gradient of the loglikelihood is ( Pn ( 1 ∂ log(L(µ,σ,yn )) (y − µ) 2 ∂µ Pn i ∇ log (L) = ∂ log(L(µ,σ,yn )) = σ n i=1 (y −µ)2 − σ + i=1σ3i ∂σ if we equate the gradient to the null vector, ∇ log (L) = 0 and solve the resulting system in µ and σ, we find n
µ ˆ=
1X yi = y¯ , n i=1 n
σ ˆ2 =
1X (yi − y¯)2 = s2 . n i=1
Problem 1.8 Let X be a r.v. following a mixture of the two exponential distributions Exp(1) and Exp(2). The density is f (x) = πe−x + 2(1 − π)e−2x . The s-th non-central moment of the mixture is
4
Solution Manual
E(X s ) =
Z
+∞
πxs e−x dx +
+∞
0
0
Z
Z
2xs (1 − π)e−2x dx
x=+∞ sx e dx − xs e−x π x=0 0 Z +∞ x=+∞ sxs−1 e−2x dx − xs e−x (1 − π) + (1 − π) x=0 0 Z +∞ Z +∞ 1 s(s − 1)xs−2 e−x dx + (1 − π) =π s(s − 1)xs−2 e−2x dx 2 0 0 = ··· 1−π s! = πs! + 2s 1−π = πΓ (s + 1) + Γ (s + 1) 2s
=π
+∞
s−1 −x
For s = 1, E (X) =
π+1 2
gives the unbiased estimator π1∗ ¯ −1 π1∗ = 2X for π. Replacing the s-th moment by its empirical counterpart, for all s ≥ 0, we obtain n X 1 1−π . Xs = π + nΓ (s + 1) i=1 i 2s
Let us define
n
αs =
X 1 X s. nΓ (s + 1) i=1 i
noindent We find the following family of unbiased estimators for π πs∗ =
2s αs−1 . 2s − 1
Let us find the value s∗ that minimises the variance of πs∗ V(πs∗ ) =
22s V(αs ) , (2s − 1)2
with 1 V (X s ) nΓ 2 (s + 1) 1 E(X 2s ) − E 2 (X s ) = 2 nΓ (s + 1) 1 πΓ (1 + 2s) + (1 − π)Γ (1 + 2s)2−2s 2 2 −2s −s = . − π − (1 − π) 2 − 2π(1 − π)2 n Γ 2 (1 + s)
V(αs ) =
Monte Carlo Statistical Methods
5
The optimal value s∗ depends on π. In the following table, the value of s∗ is given for different values of π ∗ : π 0.1 0.3 0.5 0.7 0.9 s∗ 1.45 0.9 0.66 0.5 0.36 Problem 1.11 (a) The random variable x has the Weibull distribution with density c c−1 −xc /αc x e . αc 1
1
If we use the change of variables x = y c and insert the Jacobian 1c y c −1 , we get the density y
c 1 c 1 1 1 − 1 −y/αc y c yy c e = c e−y/α αc c y α
which is the density of an Exp(1/αc ) distribution. (b) The log-likelihood is based on log f (xi |α, c) = log(c) − c log(α) + (c − 1) log(xi ) −
xci αc
Summing over n observations and differentiating with respect to α yields n n X ∂ −nc c X c log f (xi |α, c) = + c+1 xi = 0 ∂∂α α α i=1 i=1
⇔ nc =
n c X c x αc i=1 i n
⇔ αc =
1X c x n i=1 i n
⇔α=
1X c x n i=1 i
! 1c
while summing over n observations and differentiating with respect to c yields n X ∂ log f (xi |α, c) ∂c i=1
=
n n x X n 1 X c i =0 log(xi ) − c − n log(α) + xi log c α α i=1 i=1
6
Solution Manual
(using the formula
∂(xα ) ∂α
= xα log(x)), which gives
n x n 1 X c i log(xi ) = − c n log(α) − xi log . c α α i=1 i=1 n X
Therefore, there is no explicit solution for α and c. (c) Now, let the sample contain censored observations at level y0 . If xi ≤ y0 , we get xi but if xi > y0 , we get no observation, simply the information that it is above y0 . This occurs with probability [P (xi > y0 ) = 1 − P (xi ≥ F (y0 ))] . Therefore the new log-likelihood function is ! n Y 1−δi δi log f (xi |α, c) (1 − F (y0 )) i=1
=
n X i=1
log(f (xi |α, c)δi ) + log (1 − F (y0 ))
1−δi
c n X y xci δi log(c) − c log(α) + (c − 1) log(xi ) − c + (1 − δi ) − 0c α α i=1
As in part ((b)), there is no explicit maximum likelihood estimator for c: For α, n c X c −c c c =0 + c+1 xi + (1 − δi ) y δi α α αc+1 0 i n n y0c X 1 X c δ x + (1 − δi ) = 0 i i αc i=1 αc i=1 i=1 ! n n n X X 1 X c ⇔ c δi xi + y0c (1 − δi ) = α i=1 i=1 i=1 Pn 1 Pn c yc (1 − δi ) c i=1 xi + Pn0 i=1 ⇔α= i=1 δi
⇔−
n X
δi +
while, for c, " n X xc 1 − log(α) + log(xi ) − c i δi c α log i=1
that is,
xi α
!
y yc 0 + (1 − δi ) − 0c log α α
n n n X X 1X δi log(xi ) + δi + δi − log(α) c i=1 i=1 i=1 n n y X x yc 1 X c 0 i 0 − c − log (1 − δi ) = 0 xi δi log α i=1 α αc α i=1
#
= 0,
Monte Carlo Statistical Methods
7
Problem 1.13 The function to be maximized in the most general case is 17 Y −(xi −γ)c (216−γ)c (244−γ)c c − − c−1 αc αc αc 1 − e 1 − e (x − γ) e i c,γ,α αc i=1
max
where the first 17 xi ’s are the uncensored observations. Estimates are about .0078 for c and 3.5363 for α when γ is equal to 100. Problem 1.14 (a) We have θ − xi −2 df (xi |θ, σ) = 3 . 2 dθ σ π (1 + (xi −θ) )2 σ2 Therefore n X θ − xi dL −2 (θ, σ|x) = 2 3π dθ σ (1 + (xi −θ) )2 2 i=1 σ
n Y
j=1,j6=i
f (xj |θ, σ).
Reducing to the same denominator implies that a solution of the likelihood equation is a root of a 2n − 1 degree polynomial. Problem 1.16 (a) When Y ∼ B([1 + eθ ]−1 ), the likelihood is L(θ|y) = If y = 0 then L(θ|0) =
1 1 + eθ θ e 1+eθ
y
eθ 1 + eθ
1−y
.
. Since L(θ|0) ≤ 1 for all θ and
limθ→∞ L(θ|0) = 1, the maximum likelihood estimator is ∞. (b) When Y1 , Y2 ∼ B([1 + eθ ]−1 ), the likelihood is L(θ|y1 , y2 ) =
1 1 + eθ
y1 +y2
eθ 1 + eθ
2−y1 −y2
.
As in the previous question, the maximum likelihood estimator is ∞ when y1 = y2 = 0 and −∞ when y1 = y2 = 1. If y1 = 1 and y2 = 0 (or conversely) then L(θ|1, 0) =
eθ . (1 + eθ )2
8
Solution Manual
The log-likelihood is log L(θ|1, 0) = θ − 2 log(1 + eθ ). It is a strictly concave function whose derivative is equal to 0 at θ = 0. Therefore, the maximum likelihood estimator is 0. Problem 1.20 2
(a) If X ∼ Np (θ, Ip ), the likelihood is L(θ|x) = (2π)1p/2 exp (− kx−θk ). The 2 maximum likelihood estimator of θ solves the equations xi − θi dL kx − θk2 , (θ|x) = exp − dθi 2 (2π)p/2 ˆ for i = 1, ..., p. Therefore, the maximum likelihood of λ = kθk2 is λ(x) = kxk2 . This estimator is biased as ˆ E{λ(X)} = E{kXk2 } = E{χ2p (λ)} = λ + p implies a constant bias equal to p. (b) The variable Y = kXk2 is distributed as a noncentral chi squared random variable χ2p (λ). Here, the likelihood is L(λ|y) = The derivative in λ is dL 1 y (λ|y) = − dλ 4 λ
p−2 4
1 y 2 λ
− λ+y 2
e
p−2 4
p λ+y I p−2 ( λy)e− 2 . 2
p−2 1+ 2λ
p I p−2 ( λy) − 2
r
p y ′ I p−2 ( λy) . λ 2
The MLE of λ is the solution of the equation dL(λ|y)/dλ = 0. Now, we try to simplify this equation using the fact that
ν 2ν Iν (z) and Iν−1 (z) = Iν+1 (z) − Iν (z) . z z √ ν λy. The MLE of λ is the solution of (1+ 2λ )Iν (z)− Let ν = p−2 2 and z = py py ′ I (z) = 0. So, the MLE of λ is the solution of I (z) − I ν λ ν λ ν+1 (z) = 0. Therefore the MLE of λ is the solution of the implicit equation p p √ √ λI p−2 ( λy) = yI p2 ( λy). Iν′ (z) = Iν−1 (z) −
2
When y < p, the MLE of λ is p. (c) We use the prior π(θ) = kθk−(p−1) on θ. According to Bayes’ formula, the posterior distribution π(θ|x) is given by
Monte Carlo Statistical Methods
9
kx−θk2
f (x|θ)π(θ) e− 2 π(θ|x) = R ∝ . kθkp−1 f (x|θ)π(θ)dθ
The Bayes estimator of λ is therefore π
Problem 1.22
π
2
R
δ (λ) = E {kθk |x} = R
Rp Rp
kθk3−p e−
kθk1−p e−
kx−θk2 2
dθ
kx−θk2 2
dθ
.
(a) Since L(δ, h(θ)) ≥ 0 by using Fubini’s theorem, we get Z Z r(π, δ) = L(δ, h(θ))f (x|θ)π(θ)dxdθ ZΘ ZX = L(δ, h(θ))f (x|θ)π(θ)dθdx X Θ Z Z = L(δ, h(θ))m(x)π(θ|x)dθdx ZX Θ = ϕ(π, δ|x)m(x)dx , X
where m is the marginal distribution of X and ϕ(π, δ|x) is the posterior average cost. The estimator that minimizes the integrated risk r is therefore, for each x, the one that minmizes the posterior average cost and it is given by δ π (x) = arg min ϕ(π, δ|x) . δ
(b) The average posterior loss is given by : ϕ(π, δ|x) = Eπ [L(δ, θ)|x] = Eπ ||h(θ) − δ||2 |x = Eπ ||h(θ)||2 |x + δ 2 − 2 < δ, Eπ [h(θ)|x] >
A simple derivation shows that the minimum is attained for δ π (x) = Eπ [h(θ)|x] .
(c) Take m to be the posterior median and consider the auxiliary function of θ, s(θ), defined as −1 if h(θ) < m s(θ) = +1 if h(θ) > m Then s satisfies the propriety
10
Solution Manual
Eπ [s(θ)|x] = −
Z
m
π(θ|x)dθ + −∞
Z
∞
π(θ|x)dθ
m
= −P(h(θ) < m|x) + P(h(θ) > m|x) = 0 For δ < m, we have L(δ, θ) − L(m, θ) = |h(θ) − δ| − |h(θ) − m| from which it follows that if δ > h(θ) δ − m = (m − δ)s(θ) if m < δ L(δ, θ) − L(m, θ) = m − δ = m − δ 2h(θ) − (δ + m) > (m − δ)s(θ) if δ < h(θ) < m
It turns out that L(δ, θ) − L(m, θ) > (m − δ)s(θ) which implies that Eπ [L(δ, θ) − L(m, θ)|x] > (m − δ)Eπ [s(θ)|x] = 0 .
This still holds, using similar argument when δ > m, so the minimum of Eπ [L(δ, θ)|x] is reached at δ = m. Problem 1.23 (a) When X|σ ∼ N (0, σ 2 ),
1 σ2
∼ Ga(1, 2), the posterior distribution is π σ −2 |X ∝ f (x|σ)π(σ −2 ) 1 (x2 /2+2) ∝ e− σ2 σ 3
= (σ 2 ) 2 −1 e− which means that 1/σ 2 ∼ Ga( 23 , 2 + m(x) =
Z
f (x|σ)π(σ
−2
x2 2 ).
)d(σ
(x2 /2+2) σ2
,
The marginal distribution is
−2
)∝
− 23 x2 , +2 2
that is, X ∼ T (2, 0, 2). (b) When X|λ ∼ P(λ), λ ∼ Ga(2, 1), the posterior distribution is π(λ) ∝ f (x|λ)π(λ) ∝ λx+1 e−3λ
which means that λ ∼ Ga(x + 2, 3). The marginal distribution is Z Γ (x + 2) (x + 1) m(x) = f (x|λ)π(λ)dλ ∝ √ x+2 = √ x+2 . π3 π3 x! (c) When X|p ∼ N eg(10, p), p ∼ Be( 12 ,
1 2 ),
the posterior distribution is 1
1
π(λ) ∝ f (x|λ)π(λ) ∝ p10− 2 (1 − p)x− 2 .
which means that λ ∼ Be(10 + 21 , x + 12 ). The marginal distribution is Z 1 11 + x 1 m(x) = f (x|λ)π(λ)dλ ∝ B 10 + , x + , x 2 2 the so-called Beta-Binomial distribution.
Monte Carlo Statistical Methods
11
Problem 1.27 Starting from π(θ) = 1, the density of σ such that θ = log(σ) is the Jacobian of the transform, dθ 1 = dσ σ and the density of ̺ such that θ = log(̺/(1 − ̺)) also is the Jacobian of the transform, dθ 1 = . d̺ ̺(1 − ̺) Problem 1.31 (a) Since yi ∼ N (b1 xi1 + b2 xi2 , 1) ,
the posterior means of b1 and b2 are obtained as expectations under the posterior distribution of the pair (b1 , b2 ), equal to π(b1 , b2 |y1 , . . . , yn ) ∝
n Y
j=1
ϕ(yj − b1 X1j − b2 X2j )I[0,1]2 (b1 , b2 )
and the missing normalising constant is equal to the integral of the above. Therefore (i = 1, 2), R 1 R 1 Qn bi j=1 ϕ(yj − b1 X1j − b2 X2j ) db1 db2 π E [bi |y1 , . . . , yn ] = R0 1 R0 1 Qn , j=1 ϕ(yj − b1 X1j − b2 X2j ) db1 db2 0 0
where ϕ is the density of the standard normal distribution. (b) If we replace the bounds in the above integrals with the indicators functions I[0,1]2 (b1 , b2 ), we get RR Qn bi I[0,1]2 (b1 , b2 ) j=1 ϕ(yj − b1 X1j − b2 X2j ) db1 db2 π R R Q E [bi |y1 , . . . , yn ] = n I[0,1]2 (b1 , b2 ) j=1 ϕ(yj − b1 X1j − b2 X2j ) db1 db2 Eπ bi I[0,1]2 (b1 , b2 )|y1 , . . . , yn = π , P [(b1 , b2 ) ∈ [0, 1]2 |y1 , . . . , yn ] where the expectation is under the unconstrained posterior, i.e. the posterior that does not take into account the support constraint. This posterior is the usual conjugate posterior for the linear model: if β = (b1 , b2 )T and βˆ = (ˆb1 , ˆb2 )T , then T −1 ˆ T ˆ β−β X X β−β π(b1 , b2 |y1 , . . . , yn ) ∝ exp 2
is indeed a
distribution.
ˆ X t X −1 , N2 β,
12
Solution Manual
(c) The denominator P π (b1 , b2 ) ∈ [0, 1]2 |y1 , . . . , yn does not have a known expression: in the double integral, the first integration involves the cdf Φ of the normal distribution with different moments than the normal density and there is thus no closed form expression for the outer integral. The exception is indeed when (X t X) is diagonal because, then, both integrals separate. Problem 1.25 We consider an iid sample X1 , ..., Xn from N (θ, σ 2 ). The prior on (θ, σ 2 ) is s2 π(θ, σ 2 ) = σ −2(α+1) exp (− 2σ02 ). (a) The posterior distribution is π(θ, σ 2 |x1 , ..., xn ) ∝
n Y
i=1
∝σ
f (xi |θ, σ 2 )π(θ, σ 2 )
−(n+2(α+1))
x − θ)2 s20 + s2 + n(¯ exp − , 2σ 2
Pn Pn where x ¯ = n1 i=1 xi and s2 = i=1 (xi − x ¯)2 . This posterior only de2 pends on x ¯ and s . (b) The expectation of σ 2 is 2 ZZ s + s2 + n(¯ x − θ)2 1 σ −(n+2(α+1)) exp − 0 dθd(σ 2 ) Eπ [σ 2 |x1 , ..., xn ] = n 2σ 2 (2π) 2 2 Z s0 + s2 1 −(n+2(α+1)) σ exp − = n 2σ 2 (2π) 2 Z n(θ − x ¯)2 dθ d(σ 2 ) × exp − 2 2σ 2 Z s0 + s2 1 −(n+2α+1) σ exp − = d(σ 2 ) n−1 √ 2 2σ 2 n (2π) 2 Z n2 +α−1 1 s0 + s2 1 1 1 d exp − = n−1 √ 2 2 2 σ 2 σ σ 2 (2π) n n Γ ( 2 + α) 1 . = 2 n−1 √ 2 n 2 +α (2π) 2 n ( s0 +s 2 ) Problem 1.28 (a) If X ∼ G(θ, β), then π(θ|x, β) ∝ π(θ) × (βx)θ /Γ (θ)
Monte Carlo Statistical Methods
13
and a family of functions π(θ) that are similar to the likelihood is given by π(θ) ∝ ξ θ /Γ (θ)α , where ξ > 0 and α > 0 (in fact, α could even be restricted to be an integer). This distribution is integrable when α > 0 thanks to the Stirling approximation, Γ (θ) ≈ θθ−1/2 e−θ . (b) When X ∼ Be(1, θ), θ ∈ N, we have f (x|θ) =
Γ (1 + θ) (1 − x)θ−1 (1 − x)θ−1 = = θ (1 − x)θ−1 B(1, θ) Γ (θ)
and this suggest using a gamma-like distribution on θ, π(θ) ∝ θm e−αθ , where m ∈ N and α > 0. This function is clearly summable, due to the integrability of the gamma density, and conjugate. Problem 1.32 We consider the hierarchical model X|θ ∼ Np (θ, σ12 Ip ),
θ|ξ ∼ Np (ξ1, σπ Ip ), ξ ∼ N (ξ0 , τ 2 ).
(a) The Bayesian estimator of θ is δ(x|ξ, σπ ) = Eπ(θ|x, ξ, σπ ) [θ]. Using Bayes’ formula, π(θ|x, ξ, σπ ) ∝ f (x|θ)π(θ|ξ) kx − θk2 kθ − ξ1k2 ) ∝ exp − exp (− 2σ12 2σπ2 σπ2 x + σ12 ξ1 2 1 σπ2 + σ12 kθ − k , ∝ exp − 2 σπ2 σ12 σπ2 + σ12 2 2 2 σπ x+σ12 ξ1 σπ σ1 √ . Therefore, which means that (θ|x, ξ, σ1 ) ∼ Np , 2 2 σ 2 +σ 2 π
δ(x|ξ, σπ ) = Similarly,
1
σπ +σ1
σπ2 x + σ12 ξ1 σ2 = x − 2 1 2 (x − ξ1). 2 2 σπ + σ1 σπ + σ1
14
Solution Manual
Z
f (x|θ)π(θ|ξ, σπ )π(ξ)π2 (σπ2 )dθ Z 1 σπ2 x + σ12 ξ1 2 σ12 σπ2 ∝ kθ − exp − 2 k σπp 2σ1 + σπ2 σπ2 + σ12 2 kxk2 pξ 2 − (ξ−ξ20 )2 kσπ x + σ12 ξ1k2 + e 2τ π2 (σπ2 )dθ + exp 2σπ2 σ12 (σπ2 + σ12 ) σ12 σπ2 π2 (σπ2 ) p(¯ x − ξ)2 + s2 (ξ − ξ0 )2 ∝ 2 exp − − , (σπ + σ12 )p /2 2(σπ2 + σ12 ) 2τ 2 Pp where s2 = i=1 (xi − x ¯)2 . Now, we can obtain π2 (ξ|σπ2 , x) as follows π2 (ξ,
σπ2 |x)
∝
π2 (ξ, σπ2 |x) π2 (ξ, σπ2 |x)dξ ( 2 ) pτ 2 x ¯ + (σπ2 + σ12 )ξ02 pτ 2 + σπ2 + σ12 ξ− . ∝ exp − 2 2 2τ (σπ + σ12 ) pτ 2 + σπ2 + σ12
π2 (ξ|σπ2 , x) ∝ R
Therefore, π2 (ξ|σπ2 , x) is a normal distribution with mean µ(x, σπ2 ) =
pτ 2 x ¯ + (σπ2 + σ12 )ξ02 pτ 2 + σπ2 + σ12
and standard deviation Vπ (σπ2 ) = τ
s
τ σπ2 + σ12 =q . 2 2 2 2 pτ + σπ + σ1 1 + σ2pτ+σ2 π
1
(b) We first compute π2 (σπ2 |x). This density is given by Z 2 π2 (σπ |x) = π2 (ξ, σπ2 |x)dξ ( 2 ) Z π2 (σπ2 ) pτ 2 x ¯ + (σπ2 + σ12 )ξ02 pτ 2 + σπ2 + σ12 ∝ ξ− exp − 2 2 dξ 2τ (σπ + σ12 ) pτ 2 + σπ2 + σ12 (σπ2 + σ12 )p/2 p(¯ x − ξ0 )2 s2 − × exp − 2(σ 2 + σ 2 ) pτ 2 + σπ2 + σ12 o n π 21 x−ξ0 )2 τ exp − 2(σ2s+σ2 ) − pτp(¯ 2 +σ 2 +σ 2 π π 1 1 π2 (σπ2 ). ∝ 2 2 (p−1)/2 2 (σπ + σ1 ) (pτ + σπ2 + σ12 )1/2 Head (c) We deduce from the previous question that 2 2 σ12 σ12 + σπ2 π π2 (σπ |x) π2 (σπ |x) δ (x) = x−E (¯ x−ξ0 )1. (x−¯ x1)−E σ12 + σπ2 σ12 + σπ2 + pτ 2
Monte Carlo Statistical Methods
15
Problem 1.40 (a) The coherence condition is that for every integer k, the prior πk is invariant under index permutations. This means that for every permutation on {1, ..., k}, σ ∈ Sk , πk (p1 , ..., pk ) = πk (pσ(1) , ..., pσ(k) ). (b) When πk is a Dirichlet distribution Dk (α1 , ..., αk ), that is, πk (p1 , ..., pk ) =
Γ (p1 + ... + pk ) α1 −1 αk −1 p ...pk , Γ (p1 )...Γ (pk ) 1
the coherence condition becomes α1 = ... = αk . Q −1/k (d) The prior πk (pk ) = i pi is invariant under permutation of the indices of the pi ’s. Therefore, this distribution satisfies the coherence property.
Chapter 2 Problem 2.1 (a) This is rather straightforward when using R: hist(runif(10^3),proba=T,nclass=80,xlab="x",main="Uniform histogram") as shown by the output
1.0 0.0
0.5
Density
1.5
Uniform histogram
0.0
0.2
0.4
0.6
0.8
1.0
x
Fig. B.1. Histogram of a sample of 103 points produced by runif
(b) Similarly, in R:
16
Solution Manual
N=10^4 X=runif(N) plot(X[1:(N-1)],X[2:N],xlab=expression(X[i]), ylab=expression(X[i+1]),cex=.3,pch=19)
0.0
0.2
0.4
Xi+1
0.6
0.8
1.0
shows a fairly uniform repartition over the unit cube:
0.0
0.2
0.4
0.6
0.8
1.0
Xi
Fig. B.2. Pairwise plot of 104 points produced by runif
Problem 2.4 The arcsine distribution was discussed in Example 2.2. p (a) We have that f (x) = π1 x(1 − x). Consider y = h(x) = 1 − x. This is a strictly decreasing function of x and, therefore, dx −1 f (y) = f h (x) · dy 1p 1p (1 − y)(1 − (1 − y))| − 1| = y(1 − y) = π π
Thus, the arcsine distribution is invariant under the transformation h(x) = 1 − x. (b) We need to show that D is distributed as U[0,1] . Now, for 0 ≤ d ≤ 1,
Monte Carlo Statistical Methods
17
1 1 2x ≤ d, x ≤ + P 2(1 − x) ≤ d, x > 2 2 1 1 P 2x ≤ d|x ≤ = P x≤ 2 2 1 1 P 2(1 − x) ≤ d|x > +P x > 2 2 1 = (d + d) = d . 2
P (D ≤ d) = P
From this, we conclude that D is indeed distributed as U[0,1] . Problem 2.6 (a) Let U ∼ U[0,1] and X = h(U ) = −log (U )/λ. The density of X is |Jh−1 (x)|, where Jh−1 is the Jacobian of the inverse of the transformation h. Then, X ∼ Exp(λ). (b) Let (Ui ) be a sequence of iid U[0,1] random variables and Y = −2
ν X
log (Uj ).
j=1
Then Y is a sum of ν Exp(/12) rv’s, that is, of ν χ22 rv’s. By definition, Y is thus a chi-squared variable: Y ∼ χ22ν . Pa 1 Let Y = − β j=1 log (Uj ). Using the previous result, 2βY ∼ χ22a , that is 2βY ∼ Ga(a, 1/2). Therefore, Y ∼ Ga(a, β). Let
Pa
j=1
log (Uj )
j=1
log (Uj )
Y = Pa+b
=
1 . 1 + Y1 /Y2
Pa Pa+b where Y1 = − j=1 log (Uj ) and Y2 = − j=a+1 log (Uj ). Using the previous result with β = 1,we have Y1 ∼ Ga(a, 1) and Y2 ∼ Ga(b, 1). Denote by f , f2 and f3 the density functions of Y , Y1 and Y2 respectively. Using the transformation h : (Y1 , Y2 ) −→ (Y, Y2 ), we obtain Z ∞ 1 |Jh−1 (y, y2 )|f1 (y2 ( − 1))f2 (y2 )dy2 f (y) = y 0 Z ∞ y2 (1 − y)a−1 y2a+b−1 e− y dy2 = a+1 Γ (a)Γ (b)y 0 Γ (a + b) a−1 y a−1 (1 − y)b−1 = y (1 − y)b−1 = . Γ (a)Γ (b) B(a, b) Therefore, Y is a Be(a, b) random variable.
18
Solution Manual
nX (c) We use the fact that if X ∼ Fm.n , then m+nX ∼ Be(m, n). Thus, we mY simulate Y ∼ Be(m, n) using the previous result and take X = n(1−Y ).
(d) Let U ∼ U[0,1] and X = h(U ) = log U/1 − U . The density of X is then f (x) = |(h−1 )′ (x)| =
e−x (1 + e−x )2
with h−1 (x) = e−x /1 + e−x . Then X is a Logistic(0, 1) random variable. Using the fact that βX +µ is a Logistic(µ, β) random variable, generating a U[0,1] variable and taking X = β log U/1 − U + µ gives a Logistic(µ, β) variable. Problem 2.7 (a) Consider the transformation h : (U1 , U2 ) −→ (X1 , X2 ). The joint density function of (X1 , X2 ) is fX1 ,X2 (x1 , x2 ) = |Jh−1 (x1 , x2 )|, where Jh−1 is the Jacobian of the inverse transform h−1 . The transform h−1 is given by 2 x2 arctan(x1 /x2 ) 1 +x2 . h−1 (x1 , x2 ) = e− 2 , 2π and |Jh−1 | is
Therefore,
x e−(x21 +x22 )/2 −x e−(x21 +x22 )/2 e−(x21 +x22 )/2 1 2 . = x1 −x2 2π(x21 +x 2) 2 2 2π 2π(x1 +x2 ) 2
fX1 ,X2 (x1 , x2 ) =
x2 x2 1 − x21 +x22 1 1 1 2 2 e = √ e− 2 × √ e− 2 . 2π 2π 2π
Thus, X1 and X2 are iid N (0, 1). (b) The random variable r2 is the sum of two squared normal variables. So, it is a central chi-squared variable with two degree of liberty, r2 ∼ χ22 . The random variable θ is equal to 2πU2 and is uniform U[0, 2π] . (c) The variable −2 log r2 is equal to U1 , thus is uniform U[0, 1] . Both r2 and θ can be directly simulated from the uniform U[0, 1] distribution. In fact, simulate U1 and U2 from U[0,1] and take e− respectively.
U1 2
and 2πU2 for r2 and θ
Problem 2.8 (a) For the faster version of the Box-Muller algorithm, given a square [a, b] × [c, d] included in the unit sphere, the probability of (U1 , U2 ) being in [a, b]× [c, d] conditional on S ≤ 1 is
Monte Carlo Statistical Methods
19
RbRb
du2 /2du2 /2 P ((U1 , U2 ) ∈ [a, b] × [c, d]|S ≤ 1) = R Ra√ c 2 1−u 1 √ 2 2 du1 /2du2 /2 −1 −
1−u2
(b − a)(d − c)/4 = π/4 (b − a)(d − c) = . π
Therefore (U1 , U2 )|S ≤ 1 is uniform in the unit sphere. We write X1 , X2 as q p X1 = −2 log (S) U1 qS X = p−2 log (S) U2 . 2
S
This is the samep as in the standard Box-Muller algorithm with U1′ = S −1 ′ and U2 = tan ( U2 /U1 ), which implies that X1 and X2 are iid N (0, 1). (b) The average number of generations in this algorithm is Z 1 Z √1−u22 du1 du2 π P (S ≤ 1) = √ 2 2 2 = 4 ≃ 0.785 . −1 − 1−u2 It is equal to the ratio of the surface (π) of the unit sphere and the surface (4) of the square [−1, 1] × [−1, 1]. The original Box-Muller accepts all the (U1 , U2 )’s while the algorithm [A.8] accepts only about 78.5% of them. In this restricted sense, [A.3] is better than [A.8]. But [A.3] uses the sine and cosine functions and it thus needs more computation time. An experiment generating a sample of size 500 000 normal rv’s takes about one second and ten seconds for [A.3] and [A.8] respectively. (c) If we do not constrain (U1 , U2 ) p on the unit circle, the distribution is no longer normal. In fact, if s > 1, −2 log(s) is not even defined! Problem 2.9
The cdf of X generated in [A.9] is F (x) = P (X ≤ x)
= P (Y1 I(U 1/2) ≤ x|Y2 > (1 − Y1 )2 /2) =
1 P (Y1 ≤ x, Y2 > (1 − Y1 )2 /2) + P (Y1 ≥ −x, Y2 > (1 − Y1 )2 /2) . 2 P (Y2 > (1 − Y1 )2 /2)
The denominator is 2
P (Y2 > (1 − Y1 ) /2) =
Z
0
∞
−y1
e
Z (
∞ (1−y1 )2 2
−y2
e
1 dy2 )dy1 = 2
r
2π e
20
Solution Manual
and the numerator is P (Y1 I(U 1/2) ≤ x|Y2 > (1 − Y1 )2 /2) ! Z x Z ∞ Z x 1 −y2 −y1 = e−y1 e dy2 dy1 + e (1−y1 )2 2 ∞ 0 2 ! ! Z ∞ Z ∞ 1 −y2 + e dy2 dy1 Ix≤0 (1−y1 )2 2 −x 2 Z x 2 y1 1 e− 2 dy1 . = √ 2 e −∞ Therefore, 1 F (x) = √ 2π
Z
x
e−
2 y1 2
Z
∞ (1−y1 )2 2
e−y2 dy2
!
dy1
dy1
−∞
and X is a normal variable. Problem 2.10 The algorithm based on the Central Limit Theorem is stated as below. If U1 , . . . , Un are independent and identically distributed, √ Un − µ √ 1 ≈ N (0, 1) = 12n U n − Yn = n σ 2 (a) As Ui′ are U[0,1] the biggest value Ui can take is 1 and the smallest value is √ √ 0. Hence, Y ∈ − 3n, 3n , while the N (0, 1) distribution has the range (−∞, ∞). (b) Now, consider the moments of the random variable Y, generated by the above algorithm. Since the moment generating function of a U[0,1] rv is
the mgf of Yn is
et − 1 MU (t) = E eU t = , t √ ˜ 1 Mn (t) = E etY = E e 12n(U − 2 )t n 1 n n2 −√3nt 2√ 3t e n −1 e = 2t 3
(i)
(c) If we use a formal language like Mathematica to calculate mi = Mn (t)|t=0 (i = 1, . . . , 4), it appears that m1 = m3 = 0 and m2 = 1, as in the N (0, 1) case. The 4th central moment (kurtosis) satisfies m4 = 3 − n5 , which depends on n. Therefore, Yn has a lighter tail than a standard Normal.
!
Ix≥0
Monte Carlo Statistical Methods
21
Problem 2.11 Note that there is a typo in the cdf of question (b) in that a 1/2 is missing after tan−1 (x)/π. (a) Cauchy from quotient of Normals: Let X1 , X2 ∼ N (0, 1), Y1 = X1 /X2 and Y2 = X2 . This transformation is one-to-one with X1 = Y1 Y2 , X2 = Y2 and |J| = |Y2 |. We have fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 (y1 y2 , y2 )|J| 2 1 y = exp − 2 (y12 + 1) |y2 | . 2π 2 Integrating Y2 out of the joint pdf yields Z ∞ fY1 ,Y2 (y1 , y2 )dy2 fY1 (y1 ) = −∞ 2 Z ∞ y2 2 y2 exp − (y1 + 1) dy2 =2 2π 2 0 1 1 1 1 =2 = . 2π 1 + y12 π 1 + y12 (b) Cauchy from Uniform: Let θ ∼ U[−0.5π,0.5π] and let y = g(θ) = tan(θ). This transformation is one-to-one: θ = g −1 (y) = tan−1 (y) and fY (y) = fU (y)|J(y)| =
1 1 π 1 + y2
where −∞ < y < ∞. So, Y ∼ C(0, 1). (c) Comparison: We assume that the same uniform random generator is used for all purposes discussed here. Normal variates are generated with the Box-Muller method. Consequently, the generation of each normal variate require the generation of more than a single uniform deviate, given the restriction S = U12 + U22 ≤ 1. Therefore the generation of a single Cauchy deviate as a ratio of normals will require the computation of more than two uniforms on average. On the other hand, generating Cauchy deviates via inversion requires the computation of exactly one uniform on average. This would seem to favor inversion over ratio of normals. Formally the comparison of overall efficiency comes down to the question of how much slower the computation of the arctan is on average compared to the natural logarithm, compared to the computation of the uniform deviates. A simulation confirmed that ratio of normals is about half the speed of inversion. Problem 2.13 (a) Suppose that we have a sequence of Exp(λ) rv’s Xi with cdf F . Then the cdf of the sum X1 + . . . + Xn is the convolution
22
Solution Manual ∗
FX1 +...+Xn (x) = F n (x) = 1 − exλ
n−1 X k=0
(xλ)k k!
and the probability that X1 + . . . + Xn ≤ 1 ≤ X1 + . . . + Xn+1 is ! n n−1 k X X (xλ)k ∗ ∗ λ − 1 − eλ F n (1) − F (n−1) (1) = 1 − exλ k! k! k=0
k=0
n
(λ) = e−λ = Pλ (N = k) . n!
(b) Given the previous question, the following algorithm generates from a Poisson P(λ) distribution: 0. Take s = 0, c = −1. 1. Repeat Generate U ∼ U([0, 1]) c=c+1 s = s − log(U )/λ until s > 1. 2. Take X = c Now we can improve the efficiency of this algorithm: We are taking sums of logarithms (log(u1 )+log(u2 ), log(u1 )+log(u2 )+log(u3 ), . . .) to compute s. It would be just as easy to multiply the ui together first and then take the log. Note that we get out of the loop when −
X log(ui ) λ
>1
P P Q that is, when − log(ui ) > λ, or −λ > log(ui ) or yet eλ > ui . This leads to the algorithm suggested in the problem. Problem 2.14 In this problem, we give two ways of establishing (2.4) of Example 2.12. (a) We consider the transform h1 : (u, y) −→ (x = yu1/α , w = y). Then, the cdf of X is F (t) = P (X ≤ t) Z tZ 1 fY (xw−1/α )|Jh−1 (x, w)|dwdx = 1 0 0 Z Z t 1 −1/α xα w−(1+1/α) e−xw dw dx . = 0 0 αΓ (α) R1 −1/α In the computation of the integral 0 w−(1+1/α) e−xw dw, we make the change of variable w = s−α , with dw = −αs−(α+1) ds. It becomes
Monte Carlo Statistical Methods
Z
1
w−(1+1/α) e−xw
−1/α
dw = α
Z
∞
e−xs ds = α
1
0
23
e−x . x
Substituting this result in the expression of F (t), we obtain F (t) =
Z
t
0
xα−1 −x e dx, Γ (α)
which means that X ∼ Ga(α, 1). (b) Here, we use the transform h2 : (y, z) −→ (x = yz, w = z). Then the cdf of X is F (t) = P (X ≤ t) Z t Z ∞ x fY ( )fZ (w)|Jh−1 (x, w)|dw dx = 2 w x 0 Z t Z ∞ α−1 (x/w) (1 − x/w)−α e−w = dw dx, B(α, 1 − α) x 0 that is F (t) =
Z
0
t α−1
x
Γ (α, 1 − α)
Z
∞
x
−α −w
(w − x)
e
dw dx.
R∞ To compute the integral x (w − x)−α e−w dw, we use the change of variables w = s + x. We obtain Z ∞ Z ∞ −α −w −x s−α e−s ds = e−x Γ (1 − α). (w − x) e dw = e 0
x
Substituting this quantity in the expression of F (t), we get F (t) =
Z
0
t
xα−1 −x e dx. Γ (α)
Therefore, X ∼ Ga(α, 1). Problem 2.17 (a) To find the distribution of U(i) over all possibilities of ordering U1 , ..., Un into U(i) ≤ U(2) ≤, ..., ≤ U(n) , let Sn be the set of the permutations σ of {1, ..., n}. (It contains n! elements.) The cdf of U(i) is
24
Solution Manual
FU(i) (u) = P (U(i) ≤ u) Z X Z u Z uσ(2) duσ(1) ( ( [ = Z ×(
1
uσ(i)
= =
X Z
σ∈Sn Z u 0
=
Z
0
u
0
0
σ∈Sn
Z duσ(+1) (
0
Z duσ(2) (...(
1
duσ(i+2) (...
uσ(i+1)
u
0
uσ(3)
ui−1 (1 − u(i) )n−i (i) duσ(i) (i − 1)! (n − i)!
Z
uσ(i)
duσ(i−1) ) . . .))) 0
1
duσ(n) )) . . .)duσ(i) ] uσ(n−1)
n! x(i−1) (1 − x)n−i dx (i − 1)!(n − i)!
x(i−1) (1 − x)n−i dx. B(i, n − i + 1)
Therefore, U(i) is distributed as a Be(i, n − i + 1) variable. (b) The same method used for (U(i1 ) , U(i2 ) − U(i1 ) , ..., U(ik ) − U(ik−1 ) , 1 − U(ik ) ) gives F (u1 , u2 , ..., uk ) =
k Z X Y ( [
σ∈Sn j=1
(...
=
Z
uσ(ij )
uj 0
Z (
uσ(ij−1 +2)
0
duσ(ij −1) )...)duσ(j) )]I{Pk
j=1
0
X Z
σ∈Sn
duσ(ij−1 +1)
[0,u1 ]×...×[0,uk ]
I{Pk
j=1
uij =1}
uσ(ij ) =1}
i −1 k Y uijj
j=1
(ij − 1)!
n! (i1 − i0 − 1)!...(ik+1 − ik − 1)! Z i −ik −1 P × [uii11 −i0 −1 ...uik+1 I{ k k
dui1 ...duik
=
j=1
[0,u1 ]×...×[0,uk ]
Γ (n + 1) Γ (i1 )Γ (i2 − i1 )...Γ (n − ik ) Z i −ik −1 P × [uii11 −i0 −1 ...uik+1 I{ k k
uij =1} ]dui1 ...duik
=
j=1
[0,u1 ]×...×[0,uk ]
uij =1} ]dui1 ...duik
where i0 = 0 and ik+1 = n for the sake of simplicity. Therefore, (U(i1 ) , U(i2 ) −U(i1 ) , ..., U(ik ) −U(ik−1 ) , 1−U(ik ) ) is distributed as a Dirichlet Dk+1 (i1 , i2 − i1 , ..., n − ik + 1). (c) Let U and V be iid random variables from U[0,1] and let X=
U 1/α . + V 1/β
U 1/α
Monte Carlo Statistical Methods
25
Consider the transform h : (u, v) −→
x=
u1/α ,y = v 1/α u + v 1/β
The distribution of X conditional on U 1/α + V 1/β ≤ 1 is P (X ≤ x|U
1/α
+V
1/β
R x R (1−z)β ( |Jh−1 (z, y)|dy)dz ≤ 1) = 0 0 1/α P (U + V 1/β ≤ 1)
R x R (1−z)β α/β α−1 ( αy z (1 − z)−α−1 dy)dz = 0 0 R 1 R (1−v1/β )α ( 0 du)dv 0 Rx (β(1 − z)α+β (α + β)−1 αz α−1 (1 − z)−α−1 dz) = 0 R1 (1 − v 1/β )α dv 0
R1 For computing 0 (1 − v 1/β )α dv, we use the change of variable v = tβ . Then dv = βtβ−1 dt and Z 1 Z 1 αβΓ (α)Γ (β) tβ−1 (1 − t)α dt = βB(α+1, β) = (1 − v 1/β )α dv = β . (α + β)Γ (α + β) 0 0 Therefore, P (X ≤ x|U 1/α + V 1/β ≤ 1) =
Γ (α)Γ (β) Γ (α + β)
And X is a Be(α, β) random variable. (d) The Renyi representation of Pi
j=1 u(i) = Pn j=1
νj νj
Z
0
x
z α−1 (1 − z)β−1 dz.
,
where the νj ’s are iid ∼ Exp(1) is the same as the representation of Pa j=1 log (Uj ) Y = Pa+b j=1 log (Uj )
with a = i, b = n − i and νj = − log(Uj ). Problem 2.6 implies that Y is a Be(a, b) rv and U(i) is Be(i, n − i) distributed. Note that the U(i) ’s are ordered, U(1) ≤ U(2) ≤ ... ≤ U(n) . Thus, the Renyi representation gives the order statistics for a uniform sample. Problem 2.23 (a) Let Y1 ∼ Ga(α, 1) and Y2 ∼ Ga(β, 1). To compute the density function of X = Y1 /(Y1 + Y2 ), consider the transform
26
Solution Manual
h : (y1 , y2 ) −→
x=
y1 , y = y2 . y1 + y2
which is one-by-one from R × R into [0, 1] × R. The inverse transform is xy h−1 : (x, y) −→ y1 = , y2 = y 1−x and its Jacobian is Jh−1 (x, y) = Z
∞
fY1
xy 1−x
y (1−x)2 .
Thus, the density of X is
fY2 (y)|Jh−1 (x, y)|dy Z ∞ y xα−1 y α+β−1 e− 1−x dy = Γ (α)Γ (β)(1 − xα+1 ) 0 xα−1 (1 − x)β−1 Γ (α + β) α−1 x (1 − x)β−1 = . = Γ (α)Γ (β) B(α, β)
f (x) =
0
Hence, X ∼ Be(α, β). (b) The following algorithm generates a beta random variable: 1. Generate Y1 , Y2 from Ga(α, 1) and Ga(β, 1) respectively, using Ahrens and Dieter Gamma algorithm. 2. Take Y1 X= . Y1 + Y2 Problem 2.27 To establish the validity of Lemma 2.27, consider
Monte Carlo Statistical Methods
P (X ≤ x0 ) = P
=
Rx0
−∞ +∞ R
−∞
1−
εf2 (x) f1 (x)
1−
εf2 (x) f1 (x)
1 = 1−ε =
Zx0
−∞
=
=
1 1−ε Zx0
−∞
Zx0
−∞
Rx0 R1 duf1 (x) dx −∞ εf2 /f1 f 2 X ≤ x0 U ≥ ε = +∞ 1 R R f1 duf1 (x) dx
Zx0
−∞
Rx0
f1 (x) dx =
−∞
f1 (x) dx
εf2 (x) 1− f1 (x)
εf2 (x) 1− f1 (x)
27
−∞ εf2 /f1
εf2 (x) f1 (x)
1−
1−ε
+∞ R
f1 (x) dx
f2 (x) dx
−∞
f1 (x) dx
f1 (x) dx
∞ X
εi
i=0
f1 (x) − εf2 (x) f1 (x)
f1 (x) − εf2 (x) 1−ε
dx =
f1 (x) dx Zx0
f (x) dx .
−∞
This shows that X is indeed distributed from f . Problem 2.30 (a) The probability of accepting a random variable X is
P
U≤
=
+∞ Z
−∞
f (X) M g (X)
f (x)
=
+∞ M Z Zg(x)
−∞
du g (x) dx
0
1 f (x) g (x) dx = M g (x) M
+∞ Z 1 f (x) dx = M
−∞
(Note that this also is the ratio of the surface of the subgraph of f over the surface of the subgraph of M g, following from the Fundamental Theorem 2.15.) (b) If f and g are properly normalized densities then f (x) ≤ M g (x) , ∀x ∈ {x|f (x) 6= 0} and
28
Solution Manual
M
Z
g (x) dx ≥
Z
f (x) dx ⇒ M ≥ 1 .
(c) Suppose that the bound M is not tight and that there exists a bound M ′ such that M ′ < M and f (x) ≤ M ′ g (x). Then the Accept-Reject algorithm based on M remains valid. The acceptance probability can be rewritten as follows f (X) P (Y ≤ y) = P Y ≤ y|U ≤ M g (X) −1 f (x) f (x) +∞ M Z Zy MZg(x) Zg(x) = du g (x) dx du g (x) dx −∞
−∞
0
0
+∞ −1 Zy Z f (x) f (x) = g (x) dx g (x) dx M g (x) M g (x) −∞
=
Zy
−∞
f (x) dx ,
−∞
that is, the accepted random variables have density f . However, note that the acceptance probability is in this case smaller than the acceptance probability of the algorithm based on the smallest bound 1 1 < ′ M M It may be convenient to use M instead of M ′ when the evaluation of the bound M ′ is too costly. For example if we want to simulate from a Ga (α, β) we can use the instrumental distribution Ga(a, b) where a = [α] and b = aβ α . The ratio between the two densities is a a −bx −1 b x e β α xα e−βx Γ (α) Γ (a) Γ (a) β α α−a −x(β−b) x e = Γ (α) ba
p (x) =
and by considering the first order conditions: (α − a) xα−a−1 e−x(β−b) − (β − b) xα−a e−x(β−b) = 0 that is, x=
α−a , β−b
we conclude that the ratio is bounded by the following constant:
Monte Carlo Statistical Methods
Γ (a) β α p (x) ≤ Γ (α) ba
α−a β−b
α−a
29
e−(α−a) = M ′
(a) Note that ΓΓ (α) ≤ 1 implies that the previous quantity can be approximated by a computationally simpler bound α−a βα α − a e−(α−a) = M p (x) ≤ a b β−b
(d) When the bound is too tight the algorithm produces random variables with the following distribution. Suppose there exists a set A = {x ⊆ suppf (x) |f (x) > 0, f (x) > M g (x)} . Then it follows that F˜ (y) = P
=
Ry R1
−∞ 0 +∞ R R1 −∞ 0
=
=
Ry
f (x) Y ≤ y|U ≤ M g(x)
IA (x) + IAc (x) If (x)/M g(x) (u) du g(x)dx IA (x) + IAc (x)If (x)/M g(x) (u) du g(x)dx
IA (x)g(x)dx +
−∞ Ry
IA (x) g (x) dx + IA (x) g (x) dx +
−∞ +∞ R
Ry
IAc (x)
−∞
−∞
+∞ R
IA (x) g (x) dx +
dug (x) dx
0
+∞ R
−∞ Ry
−∞ +∞ R −∞
−∞
f (x)/M R g(x)
IAc (x)
f (x)/M R g(x)
dug (x) dx
0
(x) dx IAc (x) fM (x) dx IAc (x) fM
where A ∪ Ac = suppf (x). Therefore the probability distribution is f˜ (y) =
+∞ R
−∞
=
(
IA (y) M g (y) + IAc (y) f (y) = +∞ R IAc (x) f (x) dx IA (x) M g (x) dx + −∞
kM g (y)
if
kf (y)
if
y ∈ {y|f (y) >
y∈ / {y|f (y) >
= k min {f (x), M g(y)} where k =
+∞ R
−∞
IA (x) M g (x) dx +
+∞ R
−∞
f (y) M g(y) } f (y) M g(y) }
IAc (x) f (x) dx
!−1
.
30
Solution Manual
As a special case, note that if Ac = ∅, then f˜ (y) =
IA (y) g (y) +∞ R
IA (x) g (x) dx
−∞
while, if A = suppf (x) = suppg(x), then f˜(x) = g(x). (e) If the bound M is not tight and if we know a lower bound M ′ then it is possible to recycle the rejected random variables in applying twice the Accept-Reject algorithm. The first step is given by the usual acceptreject algorithm. The second step is a consequence of Lemma 2.27. In fact, the rejected random variables are generated as in Lemma 2.27 and the distribution of the rejected variables satisfy the hypothesis of Lemma 2.27. The distribution of the rejected variables is given by f (z) = P (Z ≤ z) = P Z ≤ z|U > M g (z) Rz R1 Rz (x) du g (x) dx g (x) − fM dx −∞ f (x)/M g(x) −∞ = +∞ = +∞ R R R1 (x) dx g (x) − fM du g (x) dx −∞
−∞ f (x)/M g(x)
with probability density
P (Z ≤ z) =
1 g (x) − M f (x) g (x) − ρf (x) = 1 1−ρ 1− M
and it can be used as instrumental distribution for introducing a second accept-reject procedure. In fact the hypothesis for the accept-reject algorithm are satisfied (the ratio is bounded): f (z) g(z)−ρf (z) 1−ρ
=
(1 − ρ) f (z) (1 − ρ) f (z) ≤ g (z) − ρf (z) f (z) M1 ′ − ρf (z)
that is, f (z) g(z)−ρf (z) 1−ρ
≤
1 1− M M −1 1 = M 1 − M′ M M′ − 1
M Furthermore, if it is impossible to evaluate the ratio M ′ then an importance sampling argument can be applied in order to recycle the rejected random variables. If Z1 , . . . , Zt are the rejected random variables and Y1 , . . . , Yn the accepted random variables, then it is possible to built the following unbiased estimator of Ef (h): t
δ0 =
1 X (1 − ρ) f (Zj ) h (Zj ) t j=1 g (Zj ) − ρf (Zj )
Monte Carlo Statistical Methods
31
and then to consider a convex combination of the usual Accept-Reject estimator and of the previous quantity ! n n 1X n t−n t−n δ2 = δ0 = δ1 + δ0 h (yi ) + t n i=1 t t t Note that the estimators δ0 and δ1 are both unbiased: Z
n
1X h (yi ) f (yi ) dyi n i=1 Z 1 = n h (y) f (y) dy = Ef (h) n
Ef (δ1 |t) =
and Z
t
1 X (1 − ρ) f (zj ) g (zj ) − ρf (zj ) Ef (δ0 |t) = h (zj ) dzj t j=1 g (zj ) − ρf (zj ) 1−ρ Z 1 = t f (z) h (z) dz = Ef (h) . t Thus δ2 , which is a convex combination of δ0 and δ1 , is also an unbiased estimator of Ef (h). Problem 2.31 First, we determine the ratio of the target density to the instrumental density: 1 f = xn−1 λ−1 e(λ−1)x , gλ Γ (n) where 1/Γ (n) is a normalising constant, f is the density of the Ga(n, 1) distribution, and gλ is the density of the Exp(λ) distribution. Then we find the bound for the ratio, by differentiating w.r.t. x : ∂ f = (n − 1)x0n−2 λ−1 e(λ−1)x0 + x0n−1 λ−1 e(λ−1)x0 (λ − 1) = 0 , ∂x gλ that is,
n−1 1−λ for n > 1 and λ < 1. Also, the second order derivative reduces to n n−1 (λ − 1)3 1−λ ≤ 0. (n − 1)2 x0 =
We thus insert the solution for x in the ratio to determine the bound M :
32
Solution Manual
1 M= Γ (n)
n−1 (1 − λ)e
n−1
λ−1
Finally, to minimize the bound, we maximize λ(1−λ)n−1 yielding the solution λ = 1/n. Inserting the solution for λ in the bound yields M=
n (n−1) e
n
Γ (n)
and so we can plot the bound as a function of n. Problem 2.33 First we calculate the expression of the density function of a truncated gamma. Given a random variable x ∼ fX (x) the truncation of the p.d.f. on the set [a, b] is given by fX (x) I[a,b] (x) FX (b) − FX (a)
and therefore for a Gamma distribution truncated on the interval [0, t], the p.d.f. is ba a−1 −bx e I[0,t] (x) ba xa−1 e−bx I[0,t] (x) Γ (a) x = Rt ba Rt a−1 −bx a−1 e−bx dx x (bx) e b dx Γ (a) 0 0 ba xa−1 e−bx I[0,t] (x) ba xa−1 e−bx I[0,t] (x) = Rbt γ (a, bt) ua−1 e−u du
T G (a, b, t) =
=
=
0
(a) Without loss of generality we can consider t = 1 since if X ∼ T G (a, b, t) and if Y = X/t, then Y is distributed from a T G (a, bt, 1) distribution. (b) The density f of T G (a, bt, 1) is T G (a, b, 1) =
=
ba a−1 −bx e I[0,1] (x) ba xa−1 e−bx I[0,1] (x) Γ (a) x = R1 ba R1 a−1 −bx a−1 e−bx dx (bx) e b dx Γ (a) x 0 0 ba xa−1 e−bx I[0,1] (x) ba xa−1 e−bx I[0,1] (x) = Rb γ (a, b) a−1 −u (u) e du
=
0
where γ (a, b) =
Rb 0
(u)
a−1 −u
e
du. It can can be expressed as a mixture of
beta Be(a, k+1) distributions. Using a Taylor expansion of the exponential function we obtain
Monte Carlo Statistical Methods
f (x) = T G (a, b, 1) =
33
ba e−bx xa−1 I[0,1] (x) γ (a, b)
ba e−b b(1−x) a−1 e x I[0,1] (x) γ (a, b) ! +∞ ba e−b X bk k = (1 − x) xa−1 I[0,1] (x) γ (a, b) k! k=0 ! +∞ k a −b X b e b k a−1 = (1 − x) x I[0,1] (x) γ (a, b) k! =
ba e−b = γ (a, b) a −b
b e = γ (a, b)
k=0 +∞ X k=0 +∞ X k=0 +∞ X
! bk Γ (a) Γ (a + k + 1) k a−1 (1 − x) x I[0,1] (x) k! Γ (a + k + 1) Γ (a) bk Γ (a) Γ (a + k + 1) Γ (a + k + 1) Γ (a)
k+1−1
(1 − x) Γ (k + 1) !
a−1
x
!
I[0,1] (x)
ba e−b Γ (a) bk Be (a, k + 1) γ (a, b) Γ (a + k + 1) k=0 +∞ X e−b Γ (a) ba+k = Be (a, k + 1) γ (a, b) Γ (a + k + 1)
=
k=0
or equivalently f (x) =
+∞ X
ba+k−1
k=1
Γ (a) e−b γ (a, b) Γ (a + k)
Be (a, k)
which is indeed a mixture of Beta distributions. (c) If we replace the function f with gn which is the series truncated at k = n, then the acceptance probability of the Accept-Reject algorithm based on (f, gn ) is given by the inverse of Mn = sup
f (x) gn (x)
We use the following truncated series in order to approximate the infinite mixture n −b X Γ (a) 1 a+k−1 e b gn (x) = Be (a, k) γ (a, b) Γ (a + k) A∗ k=1
where A∗ =
n X
k=1
that is,
ba+k−1
Γ (a) e−b γ (a, b) Γ (a + k)
,
34
Solution Manual
gn (x) = A=
n X
k=1 n X
b
k−1
1 Γ (a + k)
b
k−1
1 Γ (a + k)
k=1
1 Be (a, k) A
and thus f (x) = P n gn (x)
k=1
=
ba xa−1 e−bx I[0,1] (x) γ(a,b) 1 Γ (a+k) a−1 1 (1 − x) bk−1 Γ (a+k) A Γ (k)Γ (a) x
k−1
ba xa−1 e−bx I[0,1] (x) Γ (a) A n γ (a, b) P k−1 1 bk−1 Γ (k) xa−1 (1 − x) k=1
ba e−bx I[0,1] (x) Γ (a) A = n γ (a, b) P k−1 1 bk−1 Γ (k) (1 − x) =
k=1 a −bx
Γ (a) A b e I[0,1] (x) . γ (a, b) Sn (x)
To check that this function is decreasing, we calculate the first derivative: Γ (a) A ba e−bx I[0,1] (x) ∂ f (x) = ∂x gn (x) γ (a, b) Sn2 =
Γ (a) A ba e−bx I[0,1] (x) γ (a, b) Sn2
=
Γ (a) A ba e−bx I[0,1] (x) γ (a, b) Sn2
=
Γ (a) A ba e−bx I[0,1] (x) γ (a, b) Sn2
n X
bk−1 k−2 (k − 1) −b Sn + (1 − x) Γ (k) k=2 ! n X bk−1 k−2 −b Sn + (1 − x) Γ (k − 1) k=2 ! n−1 X bk−1 k−1 −b Sn + b (1 − x) Γ (k) k=1 !! n−1 bn−1 (1 − x) ≤ 0. −b Γ (N )
Therefore, the supremum is attained at x = 0 and the inverse of the acceptance probability is given by
!
Monte Carlo Statistical Methods
ba e−bx I[0,1] (x) f (x = 0) Γ (a) A Mn = = n P gn (x = 0) γ (a, b) k−1 1 k−1 b Γ (k) (1 − x) k=1
x=0
a −bx b e I[0,1] (x) 1 Γ (a) bk−1 = n Γ (a + k) γ (a, b) P k−1 1 k−1 k=1 b (1 − x) Γ (k) k=1 x=0 n a X b Γ (a) 1 bk−1 = n Γ (a + k) γ (a, b) P 1 k=1 bk−1 Γ (k) n X
n P
a+k−1
b Γ (a) k=1 = n γ (a, b) P k=1
1 Γ (a+k)
1 bk−1 Γ (k)
k=1
.
Observing that
n X
1 Γ (k)
1 Γ (a + k)
b
k−1
k=1
b
=e
γ (n, b) 1− Γ (n)
and that n X
k=1 ∞ X
b
a+k−1
X ∞ 1 1 ba+k − Γ (a + k + 1) Γ (a + k + 1) k=0 k=n ∞ 1 γ (a, b) eb X n+a+k b − = Γ (a) Γ (a + k + n + 1) =
ba+k
k=0
γ (a, b) eb γ (a + n, b) eb = − Γ (a) Γ (a + n) we can express the bound as Mn =
Γ (a) γ (a, b)
γ(a,b) b Γ (a) e
1−
−
γ(a+n+1,b) b Γ (a+n+1) e
γ(n+1,b) Γ (n+1)
eb
=
1−
γ(a+n+1,b)Γ (a) γ(a,b)Γ (a+n+1) 1 − γ(n,b) Γ (n)
Thus we conclude that the acceptance probability is P (n) =
1 − γ(n+1,b) 1 − γ(n+1,b) 1 Γ (n+1) n! = = γ(a+n+1,b)Γ (a) γ(a+n+1,b)Γ (a) Mn 1 − γ(a,b)Γ (a+n+1) 1 − γ(a,b)Γ (a+n+1)
35
36
Solution Manual
(d) The acceptance probability P (n) has been evaluated for different values of the parameters a and b. Furthermore we evaluate P (n) for increasing values of n. The probability converges to 1 when n tends to infinity. If we fix the level of precision in evaluating the acceptance probability by constraining the absolute incremental error |P (n) − P (n − 1)| to be less than 1.00e-8 then the optimal number of mixture components increases with the parameters a and b. It seems possible to express the optimal number of mixture components as a function of a given level of acceptance probability. But it is too difficult to find an explicit relation between the desired level of acceptance probability p and the optimal number n of components, therefore we use instead numerical methods. We compute the optimal number n of components for a given level p of probability, solving the following problem inf {n|P (n) ≥ p} with p ∈ [0, 1] The integration problem (relative to the functions: Γ (c) and γ(c, d)) can be solved by Simpson’s algorithm. In Tables B.2–B.5, optimal numbers of mixture components are given for different parameters values and for different p. (e) In order to use the density function gn in an accept/reject algorithm, it is necessary to simulate from the mixture gn . Therefore we propose a na¨ıve simulation algorithm for mixtures of beta distribution 1. 2. 3. 4.
Generate u ∼ U[0,1] Let k be such that u ∈ [w ˜k−1 , w ˜k ] Generate x ∼ Be(a, k) Return x
The algorithm uses the inverse c.d.f. transform for simulating a discrete random variable with values in k ∈ {1, . . . , N } for choosing the k-th component of the mixture. The cumulative density function associated to the singular part of the mixture is given by the following equations ω ¯0 = ω ˜ 0 = 0,
ω ˜1 = ω ¯ 1 = 1,
and
ω ¯ k+1 = ω ¯k +
b a+k
k+1 N X X ω ¯ k+1 ω ¯i ω ¯i = ω ˜N i=0 i=0 Once the k∗ such that k ∗ = inf k|Fk (k) ≥ u with u ∼ U[0,1] is found, the algorithm simulates X from the k ∗ -th beta distribution of the mixture. The accept/reject algorithm for the right truncated gamma is then
ω ˜ k+1 = ω ˜k +
Monte Carlo Statistical Methods
37
6 r
w ˜N = 1
r
w ˜k
b
-
u ∼ U[0,1]
r
w ˜k−1
b
b
-
k−1
0
k
? Be(b, k − 1)
6
N
? Be(b, k)
6
-
-
Fig. B.3. Simulation from a Beta Mixture. The random uniform draw u allows to select which component of the mixture will be simulated.
1. Generate u ∼ U[0,1] and x ∼ gN N P bk−1 2. Compute Mn−1 = Γ (k) and k=1 N −1 P bk−1 k−1 ρ (x) = e−bx (1 − x) Γ (k) k=1
3. If uMn ≤ ρ (x) then return x else go to step 1.
where Mn in the second step represents the computable bound of the accept/reject algorithm. This bound is derived numerically in Table B.1.
38
Solution Manual Table B.1. Numerical evaluation of the bound Mn n 1 2 3 4 5 6 7 8 9 10
b=1 1.00000 0.50000 0.40000 0.37500 0.36923 0.36809 0.36791 0.36788 0.36787 0.36787
b=3 1.00000 0.25000 0.11764 0.07692 0.06106 0.05434 0.05151 0.05038 0.04997 0.04984
b=5 1.00000 0.16666 0.05405 0.02542 0.01529 0.01093 0.00884 0.00777 0.00723 0.00695
b=7 1.00000 0.12500 0.03076 0.01115 0.00527 0.00303 0.00202 0.00152 0.00125 0.00109
b=9 1.00000 0.10000 0.01980 0.00581 0.00224 0.00106 0.00059 0.00038 0.00027 0.00021
b = 11 1.00000 0.08333 0.01379 0.00339 0.00110 0.00044 0.00021 0.00011 0.00007 0.00004
Table B.2. Optimal number of mixture components - I p a 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
b 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
0.8 n 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5
P (n) .8428 .9326 .9266 .9237 .9223 .9214 .9209 .9206 .9204 .9202 .8106 .8817 .8693 .8638 .861 .8595 .8587 .8582 .8579 .8577 .8034 .8509 .8324 .8241 .8202 .8181 .817 .8164 .816 .8158
0.85 n 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 6 6 6 6 6 6 6 6
P (n) .9482 .9326 .9266 .9237 .9223 .9214 .9209 .9206 .9204 .9202 .9127 .8817 .8693 .8638 .861 .8595 .8587 .8582 .8579 .8577 .8942 .8509 .9222 .919 .9175 .9169 .9165 .9163 .9162 .9162
0.9 n 3 3 3 3 3 3 3 3 3 3 4 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6
P (n) .9482 .9326 .9266 .9237 .9223 .9214 .9209 .9206 .9204 .9202 .9127 .9546 .9506 .9489 .9482 .9478 .9476 .9475 .9475 .9474 .9496 .9299 .9222 .919 .9175 .9169 .9165 .9163 .9162 .9162
0.95 N 4 4 4 4 4 4 4 4 4 4 5 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7
P (n) 0.9867 0.9832 0.982 0.9815 0.9813 0.9812 0.9811 0.9811 0.9811 0.9811 0.9658 0.9546 0.9506 0.9838 0.9836 0.9835 0.9835 0.9835 0.9835 0.9834 0.9788 0.9711 0.9683 0.9673 0.9669 0.9667 0.9666 0.9665 0.9665 0.9665
Monte Carlo Statistical Methods
39
Table B.3. Optimal number of mixture components - II p a 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
b 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6
0.8 n 5 6 6 7 7 7 7 7 7 7 6 7 8 8 8 8 8 8 8 8 7 8 9 9 9 9 9 9 9 9
P (n) .805 .832 .8078 .8938 .8915 .8905 .8899 .8897 .8895 .8894 .8099 .8204 .8804 .8731 .8698 .8682 .8675 .8671 .8669 .8668 .8156 .8135 .8658 .8561 .8516 .8494 .8483 .8478 .8476 .8474
0.85 n 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 10 10 10 10 10
P (n) .8849 .9107 .8989 .8938 .8915 .8905 .8899 .8897 .8895 .8894 .8804 .8964 .8804 .8731 .8698 .8682 .8675 .8671 .8669 .8668 .8785 .8857 .8658 .8561 .8516 .9169 .9165 .9163 .9162 .9161
0.9 n 7 7 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10
P (n) .9382 .9107 .9524 .9504 .9496 .9492 .949 .949 .9489 .9489 .9304 .9454 .9377 .9345 .9331 .9325 .9322 .9320 .9320 .9319 .9251 .9352 .9248 .9200 .9179 .9169 .9165 .9163 .9162 .9161
0.95 N 8 8 8 8 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11
P (n) 0.97 0.9574 0.9524 0.9504 0.9788 0.9787 0.9787 0.9787 0.9786 0.9786 0.9628 0.9737 0.9704 0.9691 0.9686 0.9683 0.9683 0.9682 0.9682 0.9682 0.9570 0.9661 0.9611 0.9590 0.9581 0.9577 0.9575 0.9574 0.9574 0.9574
Problem 2.34 (a) The ratio between the two densities is 2
f (x) π(1 + x2 )e−x √ = g(x) 2π
/2
=
r
2 π (1 + x2 )e−x /2 2
with first and second derivatives ′ r ′′ r 2 2 π π f (x) f (x) = = x(1 − x)(1 + x)e−x , (1 − 4x2 + x4 )e−x g(x) 2 g(x) 2 . A simple study of the derivatives tells us that the ratio is bounded from above
40
Solution Manual Table B.4. Optimal number of mixture components - III p a 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
b 6 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9
0.8 n 9 8 9 10 10 10 10 10 10 10 10 9 10 11 11 11 11 11 11 11 11 10 11 12 12 12 12 12 12 12 12
P (n) 0.8474 0.8215 0.8094 0.8543 0.8422 0.8363 0.8334 0.8319 0.8312 0.8309 0.8307 0.827 0.8072 0.8452 0.8309 0.8234 0.8197 0.8178 0.8169 0.8164 0.8162 0.8321 0.8063 0.838 0.8215 0.8126 0.8079 0.8055 0.8043 0.8037 0.8033
0.85 n 10 9 10 10 11 11 11 11 11 11 11 10 11 12 12 12 12 12 12 12 12 11 12 13 13 13 13 13 13 13 13
P (n) 0.9161 0.8781 0.8777 0.8543 0.9071 0.9041 0.9027 0.9021 0.9018 0.9016 0.9016 0.8784 0.8717 0.9039 0.8958 0.8918 0.8898 0.8889 0.8885 0.8883 0.8882 0.8792 0.8671 0.8956 0.8858 0.8807 0.8782 0.8769 0.8763 0.8761 0.8759
f (x) 6 g(x)
r
0.9 n 10 10 11 11 11 11 11 11 11 11 11 11 12 12 13 13 13 13 13 13 13 12 13 14 14 14 14 14 14 14 14
P (n) 0.9161 0.9214 0.9267 0.9135 0.9071 0.9041 0.9027 0.9021 0.9018 0.9016 0.9016 0.9188 0.9196 0.9039 0.9399 0.9379 0.937 0.9365 0.9364 0.9363 0.9362 0.9169 0.9137 0.9366 0.9312 0.9285 0.9273 0.9267 0.9264 0.9263 0.9262
π 2 √ = 2 e
r
0.95 N 11 11 12 12 13 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15
P (n) 0.9574 0.9523 0.959 0.9523 0.974 0.9734 0.9732 0.9731 0.973 0.973 0.973 0.9693 0.9527 0.9695 0.9674 0.9665 0.9661 0.966 0.9659 0.9658 0.9658 0.9662 0.9693 0.9637 0.9609 0.9596 0.959 0.9587 0.9586 0.9586 0.9586
2π , e
and that the upper bound ispattained when x = 1 or x = −1. (b) Let M = max(f(x)/g(x)) = 2π/e, X ∼ g(x) and U ∼ U(0, M g(x)) then the acceptance probability is P (U < f (Y )) =
Z
+∞
−∞
Z
M g(x)
0
(c) The ratio between densities is
Iu 1.5, var(δ2 (h2 )) is finite if and only if ν > 5.5 and var(δ2 (h3 )) is finite if and only if ν > 3.5. (c) When the importance sampling is used with the normal N (0, ν/(ν − 2)) distribution, the estimator is the same as previous with (ν > 2) g(x) ∝ e−(ν−2)x Because of the term e(ν−2)x 1, 2, 3.
2
/2ν
2
/2ν
.
, the variance of δ3 (hi ) is infinite for i =
Problem 3.13 The ratio f (x)/g(x) is given by ω(x) =
γ(a) β α α−a (b−β)x x e , γ(α) ba
where a = [α] and b = β/α. Since ω(x) ≥ 0, maximizing ω(x) is equivalent to maximize its logarithm, log ω(x) ∝ log xα−a e(b−β)x = (α − a) log x + (b − β)x
and taking its the derivative leads to (α − a) + (b − β) = 0 x that is, ⇒x=
α−a , β−b
and using this value of x, we get the optimum α−a γ(a) β α α − a M= e(α−a) , γ(α) ba β − b i.e. the equality (3.18). Problem 3.17 (a) We continue to use the same estimator n
1X f (Yi ) h (Yi ) n i=1 g (Yi ) It has the expectation
Monte Carlo Statistical Methods
57
# " n X f (Y1 ) 1 f (Yi ) = E h (Y1 ) E h (Yi ) n g (yi ) g (Y1 ) i=1 =
Z
as Y1 , . . . , Yn have the same marginal density g Z f (y) h(y) g(y)dy = h(y)f (y)dy g(y)
so the estimator remains unbiased. (b) In this case, we are sampling Yi conditionally on Yi−1 using the density q(y|Yi−1 ). So we change the denominators accordingly and have the estimator ( ) n f (Y1 ) X f (Yi ) 1 h (Y1 ) , h (Yi ) + n g (Y1 ) i=2 q (Yi |Yi−1 ) where g is the marginal density of Y1 . Now Z Z f (Y1 ) f (y) E h (Y1 ) = h (y) g (y) dy = h (y) f (y) dy g (Y1 ) g (y)
and for i > 1, E h (Yi )
f (Yi ) |Yi−1 q (Yi |Yi−1 )
= =
Z
Z
h (y)
f (y) q (y|Yi−1 ) dy q (y|Yi−1 )
h (y) f (y) dy
So, Z f (Yi ) = h (y) f (y) dy E h (Yi ) q (Yi |Yi−1 ) ( " #) Z n 1 f (Y1 ) X f (Yi ) E h (Y1 ) = h (y) f (y) dy h (Yi ) + n g (Y1 ) i=2 q (Yi |Yi−1 ) (c) We have f (Y2i−1 ) f (Y2i ) cov h(Y2i−1 ) , h(Y2i ) g(Y2i−1 ) g (Y2i ) f (Y2i−1 ) f (Y2i ) = E cov h(Y2i−1 ) , h(Y2i ) |Y2i−1 + g(Y2i−1 ) g (Y2i ) f (Y2i−1 ) f (Y2i ) + cov E h(Y2i−1 ) |Y2i , E h(Y2i ) |Y2i−1 g(Y2i−1 ) g (Y2i ) Z f (Y2i−1 ) = 0 + cov h(Y2i−1 ) , h (y) f (y) dy g(Y2i−1 ) =0
58
Solution Manual
Note that we do not need the information that cov (Y2i−1 , Y2i ) < 0. In general, suppose we have an iid sample (Y1 , Y2 , . . . , Ym )with density g and for each i, we have a secondary sample Z1i , . . . , Ynii , which are iid conditioned on Yi with conditional density q (·, Yi ). Then we use the estimator ni m i m X X f Z P 1 f (Yi ) j ni where N = h Zji + h (Yi ) i N +m g (Yi ) q Zj |Yi i=1 j=1
i=1
Here, the estimator is also unbiased.
! f (Zji ) f (Yi ) i cov h(Yi ) , h(Zj ) g(Yi ) q Zji |Yi ( " #) i f (Z ) f (Yi ) j |Y2i−1 = E cov h(Yi ) , h(Zji ) + g(Yi ) q Zji |Yi #) ( " f (Zji ) f (Yi ) i |Yi + cov E h(Yi ) |Yi , E h(Zj ) g(Yi ) q Zji |Yi Z f (Yi ) , h (y) f (y) dy = 0 + cov h(Yi ) g(Yi ) =0
and cov
h(Zji )
f (Zji ) q Zji |Yi
, h(Zki )
f (Zki ) q Zki |Yi
!
# ) i f (Z ) k , h(Zki ) |Yi + = E cov h(Zji ) q Zji |Yi q Zki |Yi # " #) ( " f (Zji ) f (Zki ) i i |Yi , E h(Zk ) |Yi +cov E h(Zj ) q Zji |Yi q Zki |Yi Z Z = 0 + cov h (y) f (y) dy, h (y) f (y) dy (
"
f (Zji )
as Zji ,Zki are independent given Yi
= 0 So the terms in the stimator are uncorrelated even in this general case. Problem 3.19 Let h(θ) be a given function of θ, such that Z varπ(θ) (h(θ)) = h2 (θ)π(θ)dθ
Monte Carlo Statistical Methods
59
R is finite. We want to evaluate the integral Eπ(θ|x) (h(θ)) = h(θ)π(θ|x)dθ. We use the representation Z Eπ(θ|x) (h(θ)) ∝ [h(θ)ℓ(θ|x)]π(θ)dθ = Eπ(θ) [h(θ)ℓ(θ|x)]. Therefore, the prior distribution π(θ) can be used as an instrumental distribution. (a) When the instrumental distribution is the prior, the importance sampling estimator is n 1X h(θi )ℓ(θi |x). δ1 ∝ n i=1 Its variance is
1 varπ(θ) (h(θ1 )ℓ(θ1 |x)) n Z Z 1 2 2 ∝ h (θ)ℓ (θ|x)π(θ)dθ− h(θ)ℓ(θ|x)π(θ)dθ 2 . n
varπ(θ) (δ1 ) =
Then, if the likelihood is bounded, this variance is finite.
(b) Writing Eπ(θ|x) (h(θ)) as the alternative representation Z Eπ(θ|x) (h(θ)) ∝ [h(θ)π(θ)]ℓ(θ|x)dθ ∝ Eℓ(θ|x) [h(θ)π(θ)], yields to use ℓ(θ|x) as an instrumental distribution. When the instrumental distribution is (proportional to) the likelihood ℓ(θ|x), the estimator becomes n 1X δ2 ∝ h(θi )π(θi ). n i=1 Its variance is
1 varℓ(θ|x) (h(θ1 )π(θ1 )) n" Z 2 # Z 1 2 2 h (θ)π (θ)ℓ(θ|x)dθ − h(θ)π(θ)ℓ(θ|x)dθ , ∝ n
varℓ(θ|x) (δ2 ) =
which is not usually finite even when the likelihood is bounded. We give here an example from exponential family. Suppose that x ∼ Ga(ν, θ) that is,
ℓ(θ|x) ∝ θν e−θx
and θ ∼ Ga(α, β), and
π(θ) ∝ θα−1 e−βθ .
60
Solution Manual
Here, the posterior is θ|x ∼ Ga(ν + α, β + x) ,
that is, π(θ|x) ∝ θν+α−1 e−(β+x)θ .
Let h(θ) = θµ . h(θ) has a finite variance respect to π(θ) if and only if µ > −α/2. And the variance of δ2 is infinite if µ < 1/2 − α − ν/2. So, if we take ν = 1/4, α = 1/2, and µ = −0.2, the estimator δ2 has an infinite variance while the variance of δ1 is finite. Problem 3.21 (a) By the Law of Large Numbers, n
lim n
1 X fXY (x∗ , yi )w(xi ) n i=1 fXY (xi , yi ) fXY (x∗ , Y ) w (X) =Ef fXY (X, Y ) Z Z fXY (x∗ , y)w(x) = fXY (x, y)dxdy fXY (x, y) Z Z = fXY (x∗ , y)w(x)dxdy Z Z = w(x)dx fXY (x∗ , y)dy = fX (x∗ )
y−1
(b) The conditional density of X given Y is fX|Y (x|y) = Ga (y, 1) = xΓ (y) e−x . The marginal density of Y is fY (y) = e−y . The joint density is therefore fXY (x, y) = fX|Y (x|y) ∗ fY (y) =
xy−1 −(x+y) e . Γ (y)
This will yield the marginal density of X as Z Z y−1 Z y−1 x x −(x+y) −x fX (x) = fXY (x, y)dy = e dy = e e−y dy Γ (y) Γ (y) we cannot find out the explicity form for fX (x), we can, however, plot the marginal density by using Mathematica. (c) Based upon the spirit of Theorem 3.12, we may choose w(x) which minimize the variance fXY (x∗ , Y )w(X) . var fXY (X, Y ) The approach is the same as the proof of that theorem: 2 fXY (x∗ , Y )w(X) fXY (x∗ , Y ) w (X) var = Ef fXY (X, Y ) fXY (X, Y ) 2 fXY (x∗ , Y ) w (X) − Ef fXY (X, Y )
Monte Carlo Statistical Methods
61
The second term above is 2 Z 2 Z fXY (x∗ , Y ) w (X) ∗ Ef = w(x)dx fXY (x , y)dy fXY (X, Y ) Z 2 2 = fXY (x∗ , y) dy = (fX (x∗ )) which is independent of w(x). We only minimize the first term. Using Jensen’s inequality, Ef
fXY (x∗ , Y ) w (X) fXY (X, Y )
2
≥ = =
Ef
Z Z Z
fXY (x∗ , Y ) w (X) fXY (X, Y )
2
fXY (x∗ , y)w(x)dxdy
w(x)dx
Z
∗
2
fXY (x , y)dy
2
2
= (fX (x∗ )) .
2
The lower bound (fX (x∗ )) is independent of the choice of density w(x). If we could find a density function w∗ (x) such that the lower bound is attained, w∗ (x) is the optimal choice. This is a typical optimization problem: Z Z 2 fXY (x∗ , y)w(x)2 min dxxy u fXY (x, y) R subject to w(x)dx = 1 and w(x) ≥ 0. For a special case, we may be able to work out a solution. For instance, if X and Y are independent, w∗ = fX (x) is the optimal choice. In fact, Z Z 2 ∗ 2 Z Z 2 2 fX (x )fY (y)fX (x) fXY (x∗ , y)w(x)2 dxdy = dxdy 2 fXY (x, y) fX (x)fY (y) Z Z ∗ 2 = (fX (x∗ )) fX (x)dx fY (y)dy = (fX (x∗ )) . This is a theoretical optimization, because we are looking for fX (x). Problem 3.25 Using the first order approximation (3.21), we get Z Z (x−ˆ xθ )2 ′′ nh(x|θ) nh(ˆ xθ |θ) I= e dx = e en 2 h (ˆxθ |θ) dx A
A b
Z
(x−ˆ xθ )2
e−n 2σ2 dx a √ b−x ˆθ b−x ˆθ nh(ˆ xθ |θ) 2πσ Φ =e −Φ , σ σ
= enh(ˆxθ |θ)
62
Solution Manual
where σ 2 = −1/nh′′ (ˆ xθ |θ). Therefore,(3.22) holds: s o n p p 2π nh(ˆ xθ |θ) I=e xθ |θ)(b − x ˆθ )] − Φ[ −nh′′ (ˆ xθ |θ)(a − x ˆθ )] . Φ[ −nh′′ (ˆ ′′ −nh (ˆ xθ |θ) Problem 3.29 For a density R f and a given function h, we consider the two following estimators of I = h(x)dx 2n 1 X h(Xi ), δ1 = 2n i=1
where (X1 , ..., Xn ) is an iid sample generated from f based on inverse method and an iid uniform sample (U1 , ..., Un ), and n
δ2 =
1 X [h(Xi ) + h(Yi )], 2n i=1
where (Y1 , ..., Yn ) is generated using (1 − U1 , ..., 1 − Un ). The variances are var(δ1 ) =
1 var(h(X1 )), 2n
and 1 {E[h(X1 )h ◦ F − (1 − F (X1 ))] − I2 } 2n 1 cov[h(X1 ), h ◦ F − (1 − F (X1 ))] = var(δ1 ) + 2n 1 = var(δ1 ) + cov[h ◦ F − (U1 ), h ◦ F − (1 − U1 )], 2n
var(δ2 ) = var(δ1 ) +
where F is the cdf of f and F − is the generalised inverse of F . Then, δ2 is more efficient than δ1 if the variables h(X1 ) and h ◦ F − (1 − F (X1 )) or equally h ◦ F − (U1 ) and h ◦ F − (1 − U1 ) are negatively correlated. 1 1 1 −1 (u) = tan[π(u − 12 )]. For If f (x) = π(1+x 2 ) , F (x) = π arctan x + 2 and F h(x) = F (x), we have var(δ2 ) − var(δ1 ) =
1 1 cov[U1 , 1 − U1 ] = − < 0, 2n 24n
and for h(x) = cos(2πF (x)), we obtain var(δ2 ) − var(δ1 ) =
1 1 cov[cos(2πU12 − 1), cos(2π(1 − U1 ))] = > 0. 2n 4n
The performances of δ2 , as opposed to δ1 , depend on h. (A general result is that δ2 is more efficient than δ1 when h is a monotone function.)
Monte Carlo Statistical Methods
63
Problem 3.30 (a) We start by finding out the distribution of the Zi ’s from the Accept-Reject algorithm. P (Z ≤ z) = P
Rz
P (Y ≤ z, U ≥ Mff(Y(Y) ) ) f (Y ) Y ≤ z U ≥ = M g(Y ) P (U ≥ Mff(Y(Y) ) )
R1 Rz ( f (y)/M g(y) du)g(y)dy [(1 − f )/(M g)g](y)dy = = R−∞ R1 ∞ [(1 − f )/(M g)g](y)dy ( f (y)/M g(y) du)g(y)dy −∞ Rz Rz (g(y) − f (y)/M )dy (g(y) − f (y)/M )dy = R−∞ = −∞ ∞ 1 − 1/M (g(y) − f (y)/M )dy −∞ Z z M g(y) − f (y) = dy. M −1 −∞ −∞ R∞ −∞
Thus the density of Zi is (M g − f )/(M − 1). Since
N −t Z 1 X h(z)f (z)dz E(δ2 ) = N − t j=1
= Ef [h(X)].
δ1 is the standard Accept-Reject estimator (X ∼ f ) and then E(δ1 ) = Ef [h(X)]. (b) As the sample (Y1 , ..., YN ) generated from g is i.i.d. and the uniform U[0,1] variables Ui used to choose between acceptance and rejection of the Yi ’s are independent and independent form every Yj , the variables Zi and Xj are independent. Moreover, Zi and Zj are independent for i 6= j, the same for Xi and Xj . Therefore, δ1 and δ2 are independent. (c) Let δ3 = βδ1 + (1 − β)δ2 . The estimator δ3 is an unbiased estimator of I = Ef [h(X)]. It variance is var(δ3 ) = β 2 var(δ1 ) + (1 − β)2 var(δ2 ). The optimal weight β ∗ is the solution of β∗ =
∂var(δ3 ) (β ∗ ) ∂β
= 0, i.e.
var(δ2 ) . var(δ1 ) + var(δ2 )
As Zi and Zj for i 6= j, are independent, (M − 1)f (Z1 ) 1 var h(Z1 ) var(δ2 ) = N −t M g(Z1 ) − f (Z1 ) "Z Z 2 # 1 (M − 1)f 2 (x) 2 = h (x) dx − h(x)f (x)dx . N −t M g(x) − f (x)
64
Solution Manual
The variance of δ1 is 1 1 var(δ1 ) = var(h(X1 )) = t t
"Z
h2 (x)f (x)dx −
Z
h(x)f (x)dx
2 #
.
Therefore, var(δ1 ) and var(δ2 ) and thus, β ∗ depend only on t and N and not on (Y1 , ..., YN ). Problem 3.31 (a) The cdf of a rejected variable is ! Z 1 Z z Z P (Z ≤ z) = du g(t)dt = f (t) M g(t)
−∞
z
−∞
f (t) 1− M g(t)
g(t)dt .
Then, the distribution of a rejected variable is f (z) g(z) − ρf (z) 1− g(z) = , M g(z) 1−ρ where ρ = 1/M . The probability P (Zi ≤ z) is the sum of the corresponding quantities over all possible permutations of rejection and acceptance. For rejection part, Z z g(t) − ρf (t) t dt , P (Zi ≤ z) = n + t − 1 −∞ 1−ρ and for acceptance part, n−1 P (Zi ≤ z) = n+t−1
Z
z
f (t)dt .
−∞
Therefore, the marginal distribution of Zi is fm (z) =
n−1 g(z) − ρf (z) t f (z) + . n+t−1 n+t−1 1−ρ
The same analysis applies for the joint distribution of (Zi , Zj ), i 6= j and gives that this density is the sum of three terms, corresponding to the cases when Zi and Zj are both accepted, only one of them is accepted, or both are rejected, respectively. We obtain (n − 1)(n − 2) f (zi )f (zj ) (n + t − 1)(n + t − 2) (n − 1)t + (n + t − 1)(n + t − 2) g(zj ) − ρf (zj ) g(zi ) − ρf (zi ) f (zj ) × f (zi ) 1−ρ 1−ρ g(zj ) − ρf (zj ) g(zi ) − ρf (zi ) n(n − 1) . + (n + t − 1)(n + t − 2) 1−ρ 1−ρ
fm (zi , zj ) =
Monte Carlo Statistical Methods
65
Problem 3.33 (a) Both estimators t
δ0 =
1X b(Yj )h(Yj ) t j=1
and
n
δ AR =
1X h(Xi ) n i=1
are studied in Problem 3.30, where it is shown that δ0 and δ AR are unbiased estimators of I = Ef [h(X)] and independent. Moreover, Z 1 (1 − ρ)f 2 (x) 2 2 var(δ0 ) = h (x) dx − I t g(x) − ρf (x) and var(δ
AR
1 )= n
Z
2
h (x)f (x)dx − I
2
.
Therefore, if h = h0 is a constant function, var(δ AR ) = 0 and Z h20 (1 − ρ)f 2 (x) var(δ0 ) = dx − 1 t g(x) − ρf (x) which is always larger than 0 and then δ0 does not uniformly dominate δ AR . For example, when f (x) = 2e−2x and g(x) = e−x , we take M = 2 and we obtain Z ∞ 6e−4x h20 dx − 1 . var(δ0 ) = t 2e−x − e−2x 0 With the change of variables x = − log u, Z 1 u2 h20 6 du − 1 var(δ0 ) = t 0 2−u Z 1 4 h20 6 = − (u + 2) du − 1 t 2−u 0 h2 h2 = 0 [6(4 log 2 − 5/2) − 1] ≃ 0.63 0 > 0. t t Problem 3.34 (a) Conditionally on t, the variables Yj ; j = 1, ..., t are independent, the density of Yj ; j = 1, ..., t − 1 is M g(y) − f (y) , M −1
66
Solution Manual
and the density of the last one Yt is f (y), because the stopping rule accepts the last generated value. Thus, the joint density of (Y1 , ..., Yt ) is t−1 Y
j=1
M g(yj ) − f (yj ) f (yt ) . M −1
The estimator δ2 is
t
δ2 =
1X f (Yj ) . h(Yj ) t j=1 g(Yj )
Its expectation is
t−1 X 1 1 f (Y ) f (Y ) t j E(δ2 ) = E + h(Yt ) h(Yj ) t j=1 g(Yj ) t g(Yt ) f (Y1 ) 1 f (Yt ) t−1 h(Y1 ) + h(Yt ) =E t g(Y1 ) t g(Yt ) t−1 f (Y1 ) 1 f (Yt ) =E E h(Y1 ) |t + E (h(Yt ) |t , t g(Y1 ) t g(Yt )
using the rule of the double expectation, E[X|Y ] = E[E[X|Y ]]. Then, Z Z 1 t−1 f (y) M g(y) − f (y) f (y) E(δ2 ) = E dy + f (y)dy h(y) h(y) t g(y) M −1 t g(y) Z Z M t−1 1 f (y) =E f (y)dy h(y)f (y)dy − h(y) t M −1 M −1 g(y) Z 1 f (y) f (y)dy + h(y) t g(y) f (X) M 1 t−1 =E Ef [h(X)] − Ef h(X) t M −1 M −1 g(X) f (X) 1 . + Ef h(X) t g(X) 1 (b) Let ρ = M be the acceptance probability of the Accept-Reject algorithm and suppose that Ef [h(X)] = 0. The bias of δ2 is equal to its expectation and is given by 1 f (X) 1 1 E − 1− + Ef h(X) M −1 t t g(X) f (X) 1 ρ = Ef h(X) . E[t−1 ] − 1−ρ 1−ρ g(X)
(c) If t ∼ Geo(ρ),
Monte Carlo Statistical Methods
E[t
−1
67
∞ X (1 − ρ)n−1 ρ ]= n n=1
=
∞ ρ log ρ ρ X (1 − ρ)n =− . 1 − ρ n=1 n 1−ρ
And the bias of δ2 becomes ρ log ρ f (X) ρ f (X) ρ log ρ − h(X) E = − h(X) . − [1+ ]E f f (1 − ρ)2 1−ρ g(X) 1−ρ 1−ρ g(X) (d) The independence of Yi and Yj for i 6= j leads to 2 2 1 h (Yj )f 2 Yj h (Yt )f 2 Yt t−1 E E |t + |t E[δ22 ] = E t2 g 2 (Yj ) t2 g 2 (Yt ) Z 2 Z 2 t−1 1 h (y)f 2 (y) g(y) − ρf (y) h (y)f 2 (y) =E dy + f (y)dy t2 g 2 (y) 1−ρ t2 g 2 (y) Z Z 2 ρ h (y)f 2 (y) 1 t−1 2 2 h (y)f (y)dy − f (y)dy =E t2 1−ρ 1−ρ g 2 (y) Z 2 h (y)f 2 (y) 1 f (y)dy + 2 t g 2 (y) f (X) f 2 (X) ρ 1 t−1 2 2 − Ef h (X) Ef h (X) 2 =E t2 1−ρ g(X) 1−ρ g (X) 2 f (X) 1 + 2 Ef h2 (X) 2 t g (X) t−1 f (X) 1 2 =E Ef h (X) t2 1−ρ g(X) f 2 (X) ρ(t − 1) 1 2 Ef h (X) 2 . +E 2 1 − t 1−ρ g (X) For t ∼ Geo(ρ), we have E[t−1 ] = − and E[t−2 ] = Therefore,
ρ log ρ , 1−ρ
ρLi(1 − ρ) . 1−ρ
68
Solution Manual
1 f (X) 1 1 − 2 Ef h2 (X) t t 1−ρ g(X) f 2 (X) ρ 1 1 − 2ρ 1 2 h (X) E − +E f 1 − ρ t2 1−ρ t g 2 (X) f (X) ρ 2 h (X) =− {log ρ − Li(1 − ρ)} E f (1 − ρ)2 g(X) f 2 (X) ρ 2 h (X) . {ρ log ρ − (1 − 2ρ)Li(1 − ρ)} E + f (1 − ρ)2 g 2 (X)
E[δ22 ] = E
Chapter 4 Problem 4.1 The computation of ratios of normalizing constants arise in many statistical problems including likelihood ratios for hypothesis testing and Bayes factors. In such cases the quantity of interest is ρ = c1 /c2 where pi (θ) = p˜1 (θ)/ci (i = 1, 2). (a) To prove that ρ can be approximated by n
1 X p˜1 (θ) , n i=1 p˜2 (θ) where θi ∼ p2 , we use the fact that n
R
p˜i (θ)dθ = ci and by the LLN
p˜1 (θ) p˜2 (θ) Z p˜1 (θ) = p2 (θ)dθ p˜2 (θ) Z 1 c1 = p˜1 (θ)dθ = . c2 c2
1 X p˜1 (θ) −→ E n i=1 p˜2 (θ)
(b) Similarly, to establish that R p˜ (θ)p2 (θ)α(θ)dθ c1 R 1 = c2 p˜2 (θ)p1 (θ)α(θ)dθ
holds, we use the identity Z Z 1 p˜i (θ)α(θ)dθ p˜i (θ)pj (θ)α(θ)dθ = cj
to simplify both numerator and denominator, R R p˜ (θ)p2 (θ)α(θ)dθ p2 (θ)α(θ)dθ c1 p˜1 (θ)˜ c1 R 1 R = = . c2 p˜2 (θ)˜ c2 p˜2 (θ)p1 (θ)α(θ)dθ p1 (θ)α(θ)dθ
Monte Carlo Statistical Methods
69
(c) Again we invoke the LLN Pn2 1 ˜1 (θ2i )α(θ2i ) Ep p˜1 (θ)α(θ) i=1 p n2 Pn1 −→ 1 1 Ep2 p˜2 (θ)α(θ) p ˜ (θ )α(θ ) 1i i=1 2 1i n1 R c1 p˜2 (θ)˜ p1 (θ)α(θ)dθ c1 = R = c2 c2 p˜1 (θ)˜ p2 (θ)α(θ)dθ
(d) Here
o n ˜ −1 Ep2 p2 (θ) n o ρ= ˜ −1 Ep1 p2 (θ) R p2 (θ)/˜ p2 (θ)dθ c1 =R = , c2 p1 (θ)/˜ p1 (θ)dθ −1
where α(θ) = (˜ p1 (θ)˜ p2 (θ)) . (e) Following Meng and Wong (1996), we look at the Relative MSE, RM SE(˜ ρα ) =
E (˜ ρα − ρ) ρ2
2
where the expectation is taken over all random draws. The exact calculation of the RMSE depends on α and is difficult, if not intractable because ρ˜α is a ratio estimator. However, with a large number of draws from each density (n1 and n2 from p1 and p2 respectively), we can approximate the RMSE with its first order term. This essentially ignores the bias term (which should be small because of CLT). Using the notation Z A1 = p˜1 (θ)p2 (θ)α(θ)dθ n2 1 X p˜1 (θ2i )α(θ2i ) A˜1 = n2 i=1 Z A2 = p˜2 (θ)p1 (θ)α(θ)dθ n1 1 X ˜ A2 = p˜2 (θ1i )α(θ1i ) n1 i=1
and following the derivation in Casella and Berger (1990 section 7.4.2 pag. 328), the first order approximation yields var A˜2 var A˜1 1 . + + O RM SE(˜ ρα ) ≈ 2 2 A1 A2 n2
70
Solution Manual
Substituting the ”A’s” back into the equation above we find that R q1 q2 (q1 + q2 )α2 dθ 1 1 1 RM SE(˜ ρα ) ≈ n − +O − 2 R n1 n2 n2 q1 q2 αdθ
where qi = ninpi . We want to find the function α(θ) that minimizes the first order approximation of the RMSE. Note that Z 2 Z r q1 q2 p 2 q1 q2 (q1 + q2 )|α| dθ . q1 q2 αdθ ≤ q1 + q 2 The Cauchy-Schwartz inequality yields Z 2 Z r Z q1 q2 2 2 q1 q2 αdθ dθ q1 q2 (q1 + q2 ) α dθ q1 + q2
Thus,
R
Z q1 q2 (q1 + q2 )α2 dθ q1 q2 R dθ ≥1 2 q1 + q2 ( q1 q2 αdθ)
where equality holds iff
1 q1 + q2 n1 + n2 ∝ . n1 p1 + n2 p2
α∝
Problem 4.2 Suppose the priors π1 and π2 belong to a the parametric family (πi (θ) = π(θ|λi )) and denote the corresponding constant c (λi ) πi (θ) = π(θ|λi ) =
π ˜ (θ|λi ) c (λi )
(a) We derive the fundamental identity for path sampling Z d 1 d π ˜ (θ|λ) µ(dθ) log c (λ) = dλ c (λ) dλ Z π (θ|λ) d π ˜ (θ|λ) µ(dθ) = π ˜ (θ|λ) dλ Z d log π ˜ (θ|λ)µ(dθ) = π (θ|λ) dλ d = Eλ log π ˜ (θ|λ) . dλ Now set
Monte Carlo Statistical Methods
71
d log π ˜ (θ|λ) . dλ Then, by integrating out λ ∈ [λ1 , λ2 ], we obtain U (θ, λ) =
− log
c (λ1 ) c (λ2 )
=
Zλ2
Eλ [U (θ, λ)] dλ ,
λ1
but if we consider λ as a random variable distributed from a uniform distribution π (λ) on [λ1 , λ2 ], we can rewrite previous equation as follows Zλ2 Z
U (θ, λ) π(θ|λ)
λ1
π (λ) U (θ, λ) dλ = E π (λ) π (λ)
(b) The expectation in the last equation can be approximated through simulation n c (λ2 ) c (λ1 ) U (θ, λ) ∼ 1 X U (θi , λi ) ξ = log = − log =E = c (λ1 ) c (λ2 ) π (λ) n i=1 π (λi ) where (θi , λi ) ∼ π (θ|λ) π (λ) (c) A way of deriving the optimal prior is to study the variance of the estimator of ξ. If (θi , λi ) are independent, the variance of the Monte Carlo estimator is λ Z 2 Z U 2 (θ, λ) 1 2 π (θ, λ) π (λ) µ (dθ) dλ − ξ var ξ˜ = n π 2 (λ) λ1
λ Z2 1 Eλ U 2 (θ, λ) = dλ − ξ 2 n π (λ) λ1
by the Cauchy-Schwartz inequality Zλ2
λ1
!2 Zλ2 p Zλ2 p 2 Eλ U 2 (θ, λ) Eλ (U 2 (θ, λ)) p dλ π (λ) dλ dλ = π (λ) π (λ) λ1
λ1
2 λ Z 2p Eλ (U 2 (θ, λ)) dλ ≥ λ1
Therefore the optimal π (λ) is obtained when the equality holds in the previous inequality, that is
72
Solution Manual
p
⋆
π (λ) =
λ R2 p
Eλ (U 2 (θ, λ))
Eλ (U 2 (θ, λ)) dλ
λ1
If the proportionally rate c (λ) = c is constant then ξ = 0 and the optimal prior is the Jeffreys prior density based on π (θ|λ) with λ ∈ [λ1 , λ2 ]. In general the optimal prior gives a generalised Jeffreys prior based on π ˜ (θ|λ). Problem 4.6 (a) For the accept-reject algorithm, (X1 , . . . , Xm ) ∼ f (x) i.i.d. (U1 , . . . , UN ) ∼ U[0,1] i.i.d. (Y1 , . . . , YN ) ∼ g(y) f (Y )
and the acceptance weights are the wj = M g(Yj j ) . N is the stopping time associated with these variables, that is, YN = Xm . We have ρi = P (Ui ≤ wi |N = n, Y1 , . . . , Yn ) P (Ui ≤ wi , N = n, Y1 , . . . , Yn ) = P (N = n, Y1 , . . . , Yn ) where the numerator is the probability that YN is accepted as Xm , Yi is accepted as one Xj and there are (m − 2) Xj ’s that are chosen from the remaining (n − 2) Yℓ ’s. Since P (Yj is accepted) = P (Uj ≤ wj ) = wj , the numerator is wi
X
m−2 Y
(i1 ,...,im−2 ) j=1
wij
n−2 Y
(1 − wij )
j=m−1
where Qm−2 i) j=1 wij is the probability that among the N Yj ’s, in addition to both YN and Yi being accepted, there are (m − 2) other Yj ’s accepted as Xℓ ’s; Qn−2 ii) j=m−1 (1 − wij ) is the probability that there are (n − m) rejected Yj ’s, given that Yi and YN are accepted; iii) the sum is over all subsets of (1, . . . , i − 1, i + 1, . . . , n) since, except for Yi and YN , other (m − 2) Yj ’s are chosen uniformly from (n − 2) Yj ’s.
Monte Carlo Statistical Methods
73
Similarly the denominator P (N = n, Y1 , . . . , Yn ) = wi
X
m−1 Y
(i1 ,...,im−1 ) j=1
wij
n−1 Y
(1 − wij )
j=m
is the probability that YN is accepted as Xm and (m − 1) other Xj ’s are chosen from (n − 1) Yℓ ’s. Thus ρi = P (Ui ≤ wi |N = n, Y1 , . . . , Yn ) Qn−2 P Qm−2 j=m−1 (1 − wij ) (i1 ,...,im−2 ) j=1 wij = wi P Qm−1 Qn−1 (i1 ,...,im−1 ) j=1 wij j=m−1 (1 − wij )
(b) To check that
Sk (m) = wm Sk−1 (m − 1) + (1 − wm )Sk (m − 1) i Ski (m) = wm Sk−1 (m − 1) + (1 − wm )Ski (m − 1) note that Sk (m) =
k Y
X
(i1 ,...,ik ) j=1
wij
m Y
(1 − wij ) ,
j=k+1
since Sk (m) is the probability that k Yj ’s are accepted as Xi ’s while m Yj ’s are rejected, each Yj being accepted with probability wij and rejected with probability (1 − wij ). If we write Sk (m) = P (A), with the event A decomposed as A = B ∪C, B corresponding to the acceptance of Ym , with (k − 1) Yj ’s accepted for j < m, and C to the rejection of Ym , with (k) Yj ’s accepted for j < m. Then P (B) = wm Sk−1 (m − 1)
P (C) = (1 − wm )Sk (m − 1)
and Sk (m) = wm Sk−1 (m − 1) + (1 − wm )Sk (m − 1) (c) m
N
δ1 =
1 X 1 X h (Xi ) = h (Yj ) IUj ≤wj m i=1 m j=1
δ2 =
1 X 1 X E IUj ≤wj |N, Y1 , . . . , YN h (Yj ) = ρi h (Yi ) m j=1 m i=1
N
Since E (E (X|Y )) = E (X),
N
74
Solution Manual
N X 1 E (δ2 ) = E E IUj ≤wj |N, Y1 , . . . , YN m j=1 N 1 X E IUj ≤wj h (Yj ) m j=1 N X 1 = E h (Yj ) IUj ≤wj = E (δ1 ) m j=1
=
Under quadratic loss, the risk of δ1 and δ2 are: 2
R (δ1 ) = E (δ1 − Eh (X)) 2 = E δ12 + E (E(h(X))) − 2E (δ1 E(h(X))) 2
2
= var (δ1 ) − (E(δ1 )) + E (E (h(X))) − 2E (δ1 E (h(X)))
and 2
R (δ2 ) = E (δ2 − Eh (X)) 2 = E δ22 + E (E(h(X))) − 2E (δ2 E(h(X))) 2
2
= var (δ2 ) − (E(δ2 )) + E (E (h(X))) − 2E (δ2 E (h(X)))
Since E (δ1 ) = E (δ2 ), we only need to compare var (δ1 ) and var (δ2 ). From the definition of δ1 and δ2 , we have δ2 (X) = E (δ1 (X)|Y ) so var (E (δ1 )) = var (δ2 ) ≤ var (δ1 ) . Problem 4.12 (a) Let δ2 the estimator of I δ2 =
n−1 X i=1
(X(i+1) − X(i) )
h(X(i) )f (X(i) ) + h(X(i+1)f (X(i+1) ) ) 2
.
The difference I − δ2 can be written as
h(X(i) )f (X(i) ) + h(X(i+1) )f (X(i+1) ) h(x)f (x) − dx 2 i=1 X(i) Z X(1) Z ∞ h(x)f (x)dx. h(x)f (x)dx +
n−1 X Z X(i+1) −∞
X(n)
First, we study the remainders. The distribution of X(1) is given by
Monte Carlo Statistical Methods
75
P (X(1) ≥ x) = P (X1 ≥ x, ..., Xn ≥ x) n Z ∞ f (t)dt . = x
Rx The same decomposition for X(n) gives P (X(n) ≤ x) = ( −∞ f (t)dt)n . Then an integration by parts gives "Z # n Z ∞ Z ∞ X(1) p f (t)dt dx1 , |h(x1 )|f (x1 ) E |h(x)f (x)| dx ≤ pM0p−1 −∞
−∞
where M0 =
R
x1
|h(x)|f (x)dx and p is a positive integer. The term n Z ∞ f (t)dt |h(x1 )| f (x1 ) x1
in the above integral is dominated by |h(x1 )f (x1 )| and goes to 0 when n goes to infinity. The Lebesgue convergence theorem then implies that ! Z X(1)
E
−∞
|h(x)f (x)|p dx
goes to 0 when n goes to infinity. The same result takes place for the other remainder. We now study the interesting part of the estimator. Assuming that f and h have bounded second derivatives, the first-order expansion gives h(x)f (x)−h(X(i) )f (X(i) ) = (x−X(i) )(hf )′ (X(i) )+(x−X(i) )2 /2(hf )′′ (ξ1 ), and h(X(i+1) )f (X(i+1) ) − h(x)f (x) = (x − X(i+1) )(hf )′ (X(i+1) )
+ (x − X(i+1) )2 /2(hf )′′ (ξ2 ),
where ξ1 ∈ [X(i) , x] and ξ2 ∈ [x, X(i+1) ]. Thus, Z
X(i+1)
X(i)
h(x)f (x) −
h(X(i) )f (X(i) ) + h(X(i+1) )f (X(i+1) ) dx = 2
c(X(i+1) − X(i) )3 [(hf )′′ (ξ1 ) + (hf )′′ (ξ3 ) + (hf )′′ (ξ3 )],
where ξ3 ∈ [X(i) , X(i+1) ] and c is a constant. Let M = 3|c| sup |(hf )′′ | and δ2′ be the estimator δ without its remainders. Then
76
Solution Manual
var(δ2′ ) ≤ M 2 E ≤ M2 +2
n X i=1
( n X i=1
X i a) = 0 f (x)dx. P n The natural estimator to consider is δ1 = n1 i=1 I (Xi > a), where the Xi ’s are i.i.d from f . Suppose now that f is symmetric or, more generally that for some parameter Pn µ we know the value of P (X < µ). We can then introduce δ3 = 1 i=1 I (Xi > µ) and build the control variate estimator n δ2 = δ1 + β (δ3 − P (X > µ))
Since var (δ2 ) = var (δ1 ) + β 2 var (δ3 ) + 2βcov (δ1 , δ3 ), and cov (δ1 , δ3 ) = var (δ3 ) =
1 P (X > a) (1 − P (X > µ)) n
1 P (X > µ) (1 − P (X > µ)) n
if follows that δ2 will be an improvement over δ1 if 2cov (δ1 , δ3 ) var (δ3 ) Pn (a) We want to calculate var (δ3 ), where δ3 = n1 i=1 I (Xi > µ). But var (δ3 ) = 2 E δ32 − (E [δ3 ]) and β < 0 and
δ32 =
|β|
µ) I (Xj > µ) n2 i=1 j=1
so we have that n
E (δ3 ) =
n
1X 1X E (I (Xi > µ)) = P (X > µ) = P (X > µ) n i=1 n i=1
and n n 1 XX E (I (Xi > µ) I (Xj > µ)) E δ32 = 2 n i=1 j=1
=
n n n 1 X 1 XX E (I (Xi > µ)) E (I (Xj > µ)) E (I (X > µ)) + i n2 i=1 n2 i=1 j=1 i6=j
1 2 = 2 nP (X > µ) + (n2 − n)P (X > a) n
78
Solution Manual
so we finally have var (δ3 ) = =
1 2 2 2 − P (X > µ) nP (X > µ) + (n − n)P (X > µ) n2
1 P (X > µ) (1 − P (X > µ)) n
(b) Since var (δ3 ) ≥ 0 and cov (δ1 , δ3 ) ≥ 0 then β < 0 in order for δ2 to improve on δ1 . And, given that β < 0, β 2 var (δ3 ) + 2βcov (δ1 , δ3 ) < 0 if and only if 2cov (δ1 , δ3 ) |β| < . var (δ3 ) (c) In order to find P (X > a) for a = 3, 5, 7, we first simulated 106 N (0, 1) random variables. Table B.7 gives the results obtained by applying the acceleration method described before. Table B.7. Comparison between estimators in Problem 4.15: normal case. a δ1 µ δ3 β P (X > a) 3 1.3346x10−3 0.5 0.4999876 -0.004 1.33x10−3 5 1.2x10−6 1.33x10−3 1.3346x10−3 -0.004 1.18x10−6 7 0.2x10−6 1.18x10−6 1.2x10−6 -0.004 2x10−7
(d) In order to find P (X > a) for a = 3, 5, 7, we first simulated 50, 000 T5 random variables. Table B.8 gives the results obtained by applying the acceleration method described before. Table B.8. Comparison between estimators in Problem 4.15: Student’s t case. a δ1 µ δ3 β P (X > a) 3 2.73x10−2 0.5 4.99x10−1 -0.03 2.733x10−2 5 8.56x10−3 2.733x10−2 2.73x10−2 -0.03 8.56x10−3 7 3.82x10−3 8.56x10−3 8.56x10−3 -0.03 3.82x10−3
(e) In order to obtain a, the same simulation as in question (d) can be used and we obtain the results presented in Table B.9 Table B.9. Comparison between estimators in Problem 4.15: choice of a. P (X > a) 0.01 0.001 0.0001
µ β δ3 δ1 a 0.5 -0.03 4.99x10−1 9.97x10−3 4.677 0.5 -0.003 4.99x10−1 9.97x10−4 13.22 0.5 -0.0003 4.99x10−1 9.97x10−5 43.25
Monte Carlo Statistical Methods
79
Problem 4.18 R (y) (a) Let us transform I into I = h(y)f m(y) m(y)dy, where m is the marginal density of Y1 . We have Z X h(y)f (y) I= P (N = n) m(y|N = n) n∈N h(y)f (y) |N . = EN E m(y) (b) As β is constant, for every function c, h(Y )f (Y ) I = βE[c(Y )] + E − βc(Y ) . m(Y ) (c) The variance associated with an empirical mean of the h(Yi )f (Yi ) − βc(Yi ) m(Yi ) is var(b I) = β 2 var(c(Y )) + var
h(Y )f (Y ) m(Y )
− 2βcov
h(Y )f (Y ) , c(Y ) m(Y )
= β 2 var(c(Y )) − 2βcov[d(Y ), c(Y )] + var(d(Y )).
Thus, the optimal choice of β is such that ∂var(b I) =0 ∂β
and is given by
cov[d(Y ), c(Y )] . var(c(Y )) (d) The first choice of c is c(y) = I{y>y0 } , which is interesting when p = P (Y > y0 ) is known. In this case, R R R R hf hf − y>y0 hf y>y0 m y>y0 y>y0 ∗ R R = . β = 2 p m − ( y>y0 m) y>y0 β∗ =
Thus, β ∗ can be estimated using the Accept-reject sample. A second choice of c is c(y) = y, which leads to the two first moments of Y . When those two moments m1 and m2 are known or can be well approximated, the optimal choice of β is R yh(y)f (y)dy − Im1 . β∗ = m2 and can be estimated using R the same sample or another instrumental density namely when I′ = yh(y)f (y)dy is simple to compute, compared to I.
80
Solution Manual
Problem 4.19 If t ∼ Geo(ρ), that is ft (n) = ρ(1 − ρ)n−1 I{1,2,...} (n), which is in fact a translated standard geometric variable, E[t−1 ] = =
+∞ X (1 − ρ)n−1 ρ n n=1
+∞ ρ log ρ ρ X (1 − ρ)n =− , 1 − ρ n=1 n 1−ρ
and E[t
−2
+∞ X (1 − ρ)n−1 ρ ]= n2 n=1
=
+∞ ρLi(1 − ρ) ρ X (1 − ρ)n =− , 1 − ρ n=1 n2 1−ρ
where Li(x) =
+∞ k X x
k=1
is the dilog function.
k2
Problem 4.21 We have Li(1 − ρ) = Thus,
+∞ X (1 − ρ)k
k=1
k2
.
∞
+∞
X (1 − ρ)k−1 X (1 − ρ)k ρLi(1 − ρ) =ρ = ρ + ρ(1 − ρ) . 1−ρ k2 (k + 2)2 k=1
k=0
Every term in the last sum is dominated by from above (by
π2 6 ).
1 (k+2)2
Therefore, lim
ρ→1
ρLi(1 − ρ) = 1. 1−ρ
Moreover, ρ→1
then
Li(ρ) ∼ (1 − ρ) , lim log(ρ)Li(1 − ρ) = 0 .
ρ→1
and this sum is bounded
Monte Carlo Statistical Methods
81
Problem 4.22 The advantage of the stratified estimator is that it allows for observations to be assigned to segments of the integral which have the most variability. Let us imagine the integral partitioned into p segments with unequal variability. Rather than allocate the same number of draws to each partition, we could sample more frequently from partitions with greater variance to improve the variance of the overall estimation. Hence the following relation connecting the number of observations with the variability of the partition Z 2 nj ∝ p2j (h(x) − Ij ) fj (x)dx , χj
with Ij defined as
Z
h(x)fj (x)dx .
χj
Here fj (x) denotes the distribution f restricted to the interval χj . In the stratified estimator given in the question, nj is set equal to n. Thus no improvement is possible: it is as if the variances of all partitions were all the same. The ∝ can be replaced with some constant and an equality. Using the fact that the χj ’s form a partition, we get Z p X 1 2 2 pj (h(x) − Ij ) fj (x)dx n χj j=1 j !2 Z Z p 1X 2 f (x)dx (h(x) − Ij ) fj (x)dx = n j=1 χj χj Z 2 Z 1 2 f (x)dx (h(x) − I) f (x)dx = np χ χ Z 1 2 (h(x) − I) f (x)dx = np which is exactly the variance for the standard Monte Carlo estimator. This would not be possible if nj 6= n. If the pj ’s are unknown and must be estimated the formula for conditional variance var(x) = var (E(x|y)) + E (var(x|y)) can be used to show there is no improvement (where the y’s are the pj ’s and index the various partitions. These can be thought of as increasing what partition is under consideration with a multinomial distribution. The variable x represents the integral of χj . The quantity var(x|y) is the variance of the former across the partitions.) Since E (var(x|y)) is nonnegative, having the pj ’s unknown cannot improve variance estimation.
82
Solution Manual
Chapter 5 Problem 5.7 (a) If X|y ∼ N (0, ν/y) and Y ∼ χ2ν , then the likelihood is L(y|x) = f (x|y)π(y) 1 1 x2 = p y (ν−2)/2 e−y/2 exp − 2ν/y 2ν/2 Γ (ν/2) 2πν/y 1 x2 1 (ν−1)/2 y , y exp − + = √ 2 2ν 2πν2ν/2 Γ (ν/2) and the marginal distribution of X is Z m(x) = L(y|x)dy Γ ((ν + 1)/2) = (πν)1/2 Γ (ν/2)
x2 1+ ν
−(ν+1)/2
,
that is X ∼ Tν . (b) For X ∼ Tν , Z ∼ Ga((α − 1)/2, α/2) and a given function h(x), let H(x, z) =
Γ ((α − 1)/2) −α/2 (α/2−1/h(x))z z [e I{h(x)>0} −e(α/2+1/h(x))z I{h(x) 2k pk (1 − p)k and so m11 = ∞. Thus, the random walk is transient if p > 12 . (c) If p = 21 , the random walk is not positive recurrent, neither transient so it is null recurrent. Problem 6.35 (a) Since P (Xi+1 = 1|Xi = −1) = 1 and P (Xi+1 = −1|Xi = 1) = 1, the transition matrix is 01 Q= 10 and (1/2 1/2)Q = (1/2 1/2) implies that P0 is indeed stationary. (b) Since X0 = Xk if k is even and X0 = −Xk if k is odd, we get that cov(X0 , Xk ) = (−1)k for all k’s. (c) As stated, the question ‘The Markov chain is not strictly positive. Verify this by exhibiting a set that has positive unconditional probability but zero conditional probability.’ is not correct. The fundamental property of the chain is that it is periodic and hence cannot be ergodic since the limiting distributions of (X2k ) and (X2k+1 ) are Dirac masses at 1 and −1, respectively. From this point of view, the two limiting distributions are mutually exclusive. Problem 6.38 Since (an ) is a sequence of real numbers converging to a, for all ǫ > 0, there exists n0 such that, when n > n0 , |an − a| < ǫ. Therefore, 1 |(a1 − a) + · · · + (an0 − a) + · · · + (an − a)| n 1 1 ≤ |(a1 − a) + · · · + (an0 − a)| + |(an0 +1 − a) + · · · + (an − a)| n n n − n0 1 ǫ ≤ |(a1 − a) + · · · + (an0 − a)| + n n
|bn − a| =
and the first term in the bound goes to 0 as n goes to infinity. For n > n1 , we thus have n − n0 |bn − a| ≤ ǫ + ǫ ≤ 2ǫ . n This concludes the proof.
98
Solution Manual
Problem 6.39 Define Sn−1 (aj , bn−j ) =
n−1 X
′ aj bn−j and Sn−1 (a∗ , bj ) = a∗
j=1
n−1 X
bj .
j=1
We simply need to show that both sequences Sn and Sn′ converge to the same limit. The difference is Sn (aj , bn−j ) − Sn′ (a∗ , bj ) = =
n−1 X j=1
n−1 X j=1
=
n−1 X j=1
so ±
n−1 X j=1
aj bn−j − aj bn−j −
n−1 X j=1
n−1 X
∃N : ∀n > N
bn−j (aj − a∗ )
|an −a∗ | < ǫ and
P
n−1 X
j≥1 bj
±ǫM ≤ lim (Sn (aj , bn−j ) − Sn′ (a∗ , bj )) ≤ ±ǫM n→∞
and
a∗ bn−j
j=1
bn−j |aj − a∗ | ≤ Sn (aj , bn−j ) − Sn′ (a∗ , bj ) ≤ ±
Since ∀ǫ > 0, we get
a∗ bj
j=1
bn−j |aj − a∗ | .
converges as n → ∞,
where
M=
j=1
(Sn (aj , bn−j ) − Sn′ (a∗ , bj )) −→ 0, n → ∞
i.e.
lim Sn (aj , bn−j ) = lim Sn′ (a∗ , bj ) = a∗
n→∞
n→∞
∞ X
bj .
j=1
Problem 6.42 (a) The probability of {Z(n) = 1} is Pq (Z(n) = 1) = Pq (∃j; Sj = n) ∞ X Pq (Sj = n) = j=0
=
∞ X j=0
∞ X
q ⋆ pj⋆ (n) = q ⋆ u(n).
bn−j
Monte Carlo Statistical Methods
99
(b) We have Tpq = min{j; Zq (j) = Zp (j) = 1}, then P (Tpq > n) = P (Zq (j) = Zp (j) = 0; 1 ≤ j ≤ n) n Y [P (Zq (j) = 0)P (Zp (j) = 0)] = = =
j=0 n Y
j=0 n Y
j=0
[{1 − Pq (Z(j) = 1)}{1 − Pp (Z(j) = 1)}] [{1 − q ⋆ u(j)}{1 − p ⋆ u(j)}]
(c) Suppose the mean renewal time is finite, that is, mp =
∞ X
n=0
and let
np(n) < ∞
∞ 1 X p(j). e(n) = mp j=n+1
For every n, we have Pe (Z(n) = 1) = e ⋆ u(n) n X e(k)u(n − k) = k=0
=
∞ n 1 X X p(j)u(n − k). mp k=0 j=k+1
Interverting both sums in the last quantity leads to " # j−1 n+1 X 1 X p(j) u(n − k) Pe (Z(n) = 1) = mp j=1 k=0
(d) Since e is the invariant distribution of the renewal process, Pp (Z(n) = 1) tends to m1p when n goes to infinity and mp is finite. On the other hand, Lemma 6.49 says that Tpq is almost surely finite, that is, that, lim P (Tpq > n) = 0 .
n→∞
Therefore, the result (b) becomes
100
Solution Manual
1 = 0, lim q ⋆ u(n) − n→∞ mp
provided that the mean renewal time is finite. Problem 6.47
(a) Let B = P−A. Since P and A commute, for every non negative n, Pn A = A and A2 = A, we have Bn =
n X
Cnk (−1)n−k Pk An−k
k=0
= Pn +
n−1 X
Cnk (−1)n−k A
k=0
= Pn − A
Then B n tends to 0 when n goes to infinity. An algebraic result implies that the spectral radius of B is ρ(B) < 1 and I − B is invertible. So Z exists. (b) The algebraic result applied above implies further that the inverse of I −B is ∞ X Bn. (I − B)−1 = n=0
Then,
Z=
∞ X
n=0
(Pn − A),
since B n = Pn − A. (c) As shown in Problem 6.10, πP = π and πA = π, we have πZ = π +
∞ X
n=1
(πPn − πA) = π.
Moreover, as P and A commute, P and Z commute. Problem 6.49 (a) For every initial distribution, Eµ [Nj (n)] =
n−1 X
(k)
pij
k=0
thus Eµ [Nj (n)] − nπj =
n−1 X k=0
P k − A −→ Z − A ;
Monte Carlo Statistical Methods
101
therefore, µ (Eµ [Nj (n)] − nπj ) −→ µ (Z − A) = µ (Z) − A . (b) Proceeding similary as in (a), we get that Eµ [Nj (n)] − Eν [Nj (n)] = (Eµ [Nj (n)] − nπj ) − (Eν [Nj (n)] − nπj ) −→ (µ(Z) − α) − (ν(Z) − α) = (µ − ν)Z ,
where α is the limiting vector of the fundamental matrix. (c) Using the question above and choosing particular states u, v, one can deduce that Eu [Nj (n)] − nπj −→ zuj − πj so that
Eu [Nj (n)] − Ev [Nj (n)] −→ (zuj − πj ) − (zvj − πj ) = zuj − zvj Problem 6.51 The stationary distribution satisfies πP = π, which leads, in this case to (1 − α)π1 + βπ2 = π1 απ1 + (1 − β)π2 = π2 . Considering the distribution condition π1 + π2 = 1, the system solves into π1 =
β α , π2 = . α+β α+β
The probability distribution of the number of steps before entering state 1 is given by P1 (f1 = 1) = 1 − α and P1 (f1 = n) = αβ(1 − β)n−2 , when starting from state 1, and by P2 (f1 = n) = β(1−β)n−1 , for every n ≥ 1, when starting from state 2. Therefore, m1,1 = E1 [f1 ] ∞ X nP1 (f1 = n) = n=0
= 1 − α + αβ
∞ X
n=2
n(1 − β)n−2
# " ∞ X ∂ n n(1 − β) − = 1 − α + αβ(1 − β) ∂β n=2 ∂ 1 = 1 − α − αβ(1 − β)−1 + 1 + (1 − β) ∂β β α+β = β −1
102
Solution Manual
and, by a similar computation m21 = E2 [f1 ] = m22 =
α+β , α
m12 =
Finally, α+β 1 β α α+β 1 β α
M=
1 β.
!
By symmetry, 1 . α
.
Problem 6.54 Let P=
1−α α β 1−β
be the transition matrix of a two-state ergodic chain. The ergodicity implies that α and β are not null together, that is that, α + β > 0. The stationary distribution is α β , , π= α+β α+β ˜ = P and P is reversible. and then we obtain P For an ergodic chain with symmetric transition matrix and state space equal to {1, ..., n}, the invariant distribution is 1 1 π= , , ..., n n and so
πj pji = pji = pij . πi For the given matrix, let π = (0.1, 0.2, 0.4, 0.2, 0.1). An easy computation gives πP = π and hence, π is the invariant distribution of P. Since, p˜12 = π2 0.2 π1 p21 = 0.1 0.5 = 1 6= p12 = 0, P is not reversible. p˜ij =
Problem 6.67 ¯ (a) For X, ¯ = var(X)
X 1 1 var(X1 ) + cov(Xi , Xj ) 2 n+1 (n + 1) i,j;i6=j
2
=
σ 1 + n + 1 (n + 1)2
X
ρj−i .
i,j;i6=j
If ρj−i = ρ > 0, ¯ = var(X)
n 1 var(X1 ) + ρ ∼n→∞ ρ. n+1 (n + 1)
¯ is not consistent. This implies that X
Monte Carlo Statistical Methods
103
(b) If |ρj−i | ≤ M γ |j−i| , with 0 < γ < 1 2 X ¯ − σ ≤ 2M var(X) γ j−i 2 n+1 (n + 1) iM > M and g(yt ) g(yt )f (x(t) ) g(x(t) ) M g(x(t) ) ( ) f (yt ) f (x(t) ) f (yt )f˜(x(t) ) if >M < M and (t) min 1, = g(yt ) f (x ) g(x(t) ) (t) ˜ f (x )f (yt ) f (yt ) f (x(t) ) f (yt ) M and M g(yt ) g(yt ) g(x(t) ) 1 otherwise. Problem 7.11
(a) In order to verify the identity (7.10), we write Z Z (K(u, y) − f (y))dy (K(x, u) − f (u))du E A Z Z = [K(x, u)K(u, y) − f (y)K(x, u) − K(u, y)f (u) + f (y)f (u)]dydx. E
A
On the other hand, we have the following simplifications R RE K(x, u)K(u, y)du = K 2 (x, y) K(x, u)du = 1 RE f (u)du = 1. E Moreover, the detailed balance condition
K(u, y)f (u) = K(y, u)f (y) implies
Z
K(u, y)f (u)du =
E
Z
f (y)K(y, u)du = f (y).
E
Finally, we obtain the identity (7.10), Z Z Z 2 K (x, y)f (y)dy = (K(u, y) − f (y))dy (K(x, u) − f (u))du. A
Using (7.9), it follows
E
A
106
Solution Manual
Z (K(u, y) − f (y))dy ≤ 1 − 1 , M A
for every u in E. Then, we have the majoration Z Z (K 2 (x, y) − f (y))dy ≤ 1 − 1 (K(x, u) − f (u))du M E A 2 1 ≤ 2 1− . M and finally the majoration (7.11) kK 2 (x, .) − f kT V ≤ 2(1 −
1 2 ) . M
(b) Using the fact that K n+1 (x, y) = K n ⋆ K(x, y) =
Z
K n (x, u)K(u, y)du,
E
the same arguments as the previous proof lead to the relation (7.12). Now, we prove (7.8) by recurrence. It holds for n = 2. Suppose that it holds for a certain integer n, we get Z n Z (K(x, u) − f (u))du . (K n+1 (x, y) − f (y))dy ≤ 1 − 1 M E A R 1 ) as before, gives the maMajorating | E (K(x, u) − f (u))du| by 2(1 − M joration (7.8) for n + 1. Finally, the majoration holds for every n. Problem 7.17 (a) As d log f (x) < 0, dx there exists a certain A > 0 such that, for every x larger than A, d log f (x) − ϕ¯ ≤ − ϕ¯ , dx 2 ϕ¯ = lim
x→∞
and in particular,
d ϕ¯ log f (x) ≤ . dx 2 On the other hand, we have, for x < y large enough (greater than A) Z y d ϕ¯ log f (y) − log f (x) = log f (t)dt ≤ (y − x). dt 2 x As f is symmetric, (7.16) holds for f with x1 = A and α = − ϕ2¯ > 0 so, f is log-concave and hence, we can apply Theorem 7.25 in this case.
Monte Carlo Statistical Methods
107
(b) Suppose ϕ¯ = 0. For every δ > 0, there exists a certain B > 0 such that, for every x > B, we get d log f (x) ≤ δ, dx and in particular
d log f (x) ≥ −δ. dx Therefore, for every x large enough and every positive z, Z x+z d log f (t)dt ≥ −δz. log f (x + z) − log f (x) = dt x
Evaluating the exponential of the inequality gives f (x + z) exp (δz) ≥ f (x). Now, integrating out the z’s and using the change of variables y = x + z, the last inequality becomes Z ∞ Z ∞ Z ∞ f (x)dz = ∞. f (x)eδy dy ≥ f (x + z)eδz dz = e−δx x
0
0
Therefore, (7.18) fails and the chain is not geometrically ergodic. Problem 7.18 Example 7.16 is a direct application of Theorem 7.15. 2
(a) For ϕ(x) ∝ exp (− x2 ), limx→∞ log ϕ(x) = −∞. Then, the inequality of Problem 7.17 (a) holds for every ϕ¯ < 0 and x < y large enough. Therefore, the chain associated with ϕ is geometrically ergodic. (b) For ψ(x) ∝ (1 + |x|)−3 , limx→∞ log ψ(x) = 0. And the result (b) of Problem 7.17 implies that the chain associated with ψ is not geometrically ergodic. Problem 7.24 When f is the normal N (0, 1) density, and q(.|x) is the uniform U[−x−δ, −x+ δ], the Metropolis-Hastings algorithm writes
108
Solution Manual
Algorithm A.64 x(t)
Given
1. Generate yt ∼ U[−x(t) − δ, −x(t) + δ] 2. Take (t+1)
x
=
(
yt
with probability
x(t) otherwise
o n t) min 1, ff(x(y(t) )
Problem 7.30 We consider the Metropolis-Hastings algorithm [A.24], where the Yt ’s are generated from q(y|x(t) ) and we assume that Y1 = X (1) ∼ f . (a) Let δ0 be the estimator δ0 =
T 1 X f (yt ) h(yt ). T t=1 q(yt |x(t) )
The expectation of δ0 is ( T ) X f (yt ) 1 (t) , h(yt )|x E[δ0 ] = E E T q(yt |x(t) ) t=1 using the double expectation rule E[U ] = E[E[U |V ]] for every random variables U and V . But, the conditional distribution of yt given x(t) is q(y|x(t) ), then Z f (yt ) f (y) (t) E = h(y )|x h(y)q(y|x(t) )dy t (t) q(yt |x ) q(y|x(t) ) = Ef [h(X)]. Therefore, E[δ0 ] = Ef [h(X)] and δ0 is an unbiased estimator of Ef [h(X)]. (b) We write ( ) T X f (y2 ) 1 f (yi ) h(x1 ) + h(y2) + h(yi ) δ1 = T q(y2 |x1 ) q(yi |xi ) i=3 and we transform the last term, say S, of the right hand of the last identity into its Rao-Blackwellised version. We adopt the same notation for ρij , ρ¯ij , ρij , ζij , τj and ωij . The quantity S becomes
Monte Carlo Statistical Methods
S RB
T X I(xj =yi ) f (yi )h(yi )E = |y1 , ..., yT q(y |x ) i i j=i i=3 i T X X I(xj =yi ) f (yi )h(yi )E = |y1 , ..., yT q(yi |xi ) j=1 i=3 T X
= = where
109
T X
f (yi )h(yi ) PT −1
ϕ¯i
′ i′ =1 τi ζi′ (T −1) i=3 PT ¯i i=3 f (yi )h(yi )ϕ , PT −1 τ ζ i=1 i i(T −1)
ϕ¯i =
i−1 X j=1
=
i−1 X j=1
1 τj ζj(i−1) ωji q(yi |xi ) τj ζj(i−2) (1 − ρj(i−1) )ωji .
Finally, we obtain the Rao-Blackwellized version of δ0 , ( ) PT f (y )h(y ) ϕ ¯ f (y2 ) 1 i i i i=3 h(x1 ) + . h(y2 ) + P δ1 = T −1 T q(y2 |x1 ) i=1 τi ζi(T −1)
(c) The Rao-Blackwellized estimator of Theorem 7.21 is PT ϕi h(xi ) δ RB = PTi=0 . −1 t=0 τi ζi(T −1)
In order to compare the two versions δ1 and δ RB , we study the example of the distribution T3 , with the instrumental q(y|x) = C(x, 1) and the function of interest h(x) = I(x>2) . The theoretical value of Ef [h(X)] is 0.0697.
Problem 7.31 (a) The conditional probability is τi = P (Xi = yi |y0 , ..., yT ). This probability can be written as τi =
i−1 X j=0
P (Xi = yi |Xi−1 = yj , y0 , ..., yT )P (Xi−1 = yj |y0 , ..., yT ).
110
Solution Manual
But, (Xt )t is a Markov chain, then P (Xi = yi |Xi−1 = yj , y0 , ..., yT ) = P (Xi = yi |Xi−1 yj ) = ρji P (Xi−1 = yj |y0 , ..., yT ) = P (Xi−1 = yj ) Finally, it follows that τi =
i−1 X
ρji P (Xi−1 = yj ).
j=0
Problem 7.35 We consider the target density f of the Exp(1) distribution and the conditional densities g1 and g2 of the Exp(0.1) and Exp(5) distributions, respectively. The independent Metropolis–Hastings algorithms based on (f, gi ), (i = 1, 2) take yt = x(t) + zt , where zt ∼ gi . (b) For the pair (f, g2 ), we have that, for every M > 0, the set log(5M ) {x, f (x) > M g2 (x)} = , +∞ 4 is of positive measure. Therefore, Theorem 7.8 guaranties that the generated chain is not geometrically ergodic. For the pair (f, g1 ), we have f (x) ≤ 10g1 (x). Therefore, the same theorem implies that the generated chain is uniformly ergodic and in particular geometrically ergodic. Problem 7.37 When the acceptance probability is ̺(x, y) =
s(x, y) 1+
f (x)q(y|x) f (y)q(x|y)
,
for s a positive symmetric function such that ̺(x, y) ≤ 1, the transition kernel of the chain becomes
R
K(x, y) = ̺(x, y)q(y|x) + (1 − r(x))δx (y),
where r(x) = ̺(x, y)q(y|x)dy and δx is the Dirac mass in x. This kernel satisfies the detailed balance condition K(x, y)f (x) = K(y, x)f (y). Indeed,
Monte Carlo Statistical Methods
̺(x, y)q(y|x)f (x) = =
111
q(y|x)f (x)s(x, y) 1+
f (x)q(y|x) f (y)q(x|y)
q(x|y)f (y)s(y, x) f (y)q(x|y) f (x)q(y|x)
+1
= ̺(x, y)q(x|y)f (y), because of the symmetry of s, and evidently (1 − r(x))δx (y)f (x) = (1 − r(y))δy (x)f (y). The invariance of the distribution f follows from Theorem 7.2. Problem 7.47 The posterior distribution of (7.27) is 1 5 2 5 2 θ1 θ2 π(θ1 , θ2 |y) ∝ exp − θ + θ + − (θ1 + θ2 )y . 2 4 1 4 2 2 To simulate from π(θ1 , θ2 |y), we simulate first θ2 from the marginal distribution Z π(θ2 |y) ∝ π(θ1 , θ2 |y)dθ1 3 y ∝ exp − (θ2 − )2 , 5 3 that is, θ2 ∼ N ( y3 , 65 ) and then simulate θ1 from the conditional distribution π(θ1 |θ2 , y) ∝ π(θ1 , θ2 |y) 2y − θ2 2 5 ) , ∝ exp − (θ1 − 8 5 that is, θ1 |θ2 , y ∼ N (
2y − θ2 4 , ). 5 5
Problem 7.48 The discretisation of the Langevin diffusion leads to the random walk of (7.23) x(t+1) = x(t) +
σ2 ∇ log f (x(t) ) + σεt , 2
where εt ∼ N (0, 1). The transition kernel of the chain (X (t) ) is
112
Solution Manual
K(x, y) = √
1 1 σ2 exp − [y − x − ∇ log f (x)]2 . 2 2 2πσ
Assume that
σ2 ∇ log f (x)|x|−1 x→∞ 2 exists and is smaller than −1. Denote δ < −1 this limit. Consider the potential function V (x) = (1 − e−δx )I(x≥0) . The function V is bounded and positive. The drift of V is given by Z ∆V (x) = K(x, y)V (y)dx − V (x) lim
Z
2
∞
e−z /2 √ = dz − 1 2π m1 (x) "Z 2 2 # ∞ 2 −z 2 /2 σ δ σ e √ dz − exp − δ ∇ log f (x) , + e−δx 2 2 2π m2 (x) 2
where m1 (x) = x + σ2 ∇ log f (x) and m2 (x) = x + have the following equivalences
σ2 2 ∇ log f (x)
− σ 2 δ. We
x→∞
m1 (x) ≈ (1 + δ)x x→∞
m2 (x) ≈ (1 + δ)x . As δ < −1, we obtain lim m1 (x) = −∞
x→∞
lim m2 (x) = −∞.
x→∞
Therefore, for x large enough, we have 2 2 σ δ −δx 2 1 − exp ∆V (x) ≈ e −δ x , 2 which is positive for x large enough. On the other hand, V (x) > r, for every 0 < r < 1, if, and only if, x > log (1−r) . Note that log (1−r) goes to infinity when r goes to 1 from below. δ δ Therefore, we can find r such that 0 < r < 1 and for every x for which V (x) > r, we have ∆V (x) > 0. Theorem 6.71 ensures the transience of the chain. The same argument establishes transience when the limit (7.24) exists and is larger than 1 at −∞.
Chapter 8 Problem 8.2
Monte Carlo Statistical Methods
Zx 0
113
√
1 − e 2
√
Zx √ √ u e−v vdv = 1 − (1 + x)e− x du = 0
Problem 8.7 Let x1 , . . . , xn be a sample of observations from the mixture model π(x) =
k X
pi fi (x) .
i=1
It is possible to obtain an analytical posterior distribution for the parameter p1 , . . . , pk only after a demarginalization step. Let z ∼ M(1; p1 , . . . , pk ) be an auxiliary variable and assume a Dirichlet prior p1 , . . . , pk ∼ Dk (γ1 , . . . , γk ) on the weights. Then the posterior distribution is k k n Y Y Y γ −1 z z j ji pj (fj (xi )) ji pj π (p1 , . . . , pk |x, z) ∝ i=1 j=1
∝
k n Y Y
(fj (xi ))
i=1 j=1
∝ Dk
γ1 +
X
zji
j=1
k Y
Pn
pj
i=1 zji +γj −1
j=1
z1i , . . . , γk +
i
X
zki
i
!
Note that, by a general slice sampler reasoning, the introduction of the auxiliary variable z allows for a multiplicative representation of the posterior distribution. Although it is possible to apply the Gibbs sampler to the groups of auxiliary variables and to the unknown parameters (and this is the common approach to this problem), here we focus on a general slice sampler framework. We introduce two other sets of auxiliary variables ω1 , . . . , ωnk and ω1′ , . . . , ωk′ and write the posterior as the marginal of k n Y Y
i=1 j=1
I0≤ωij ≤(fj (xi )pj )zji
k Y
j=1
I0≤ω′ ≤pγj −1 j
j
The resulting (conditional) distribution on the weights p1 , . . . , pk is then uniform over a subset of the simplex of Rk , while the distribution on the zij is uniform over the indices j such that ωij < pj fj (xi ). Problem 8.17 Let g(u) = exp −u1/d . Then
114
Solution Manual
d log g(u) = (d − 1)d−2 u(1/d−2) du which is larger than zero for d > 1. Thus g(u) is not log-concave. Let g(x) = exp −||x||. Note that ∀d ≥ 1, x, x′ ∈ ℜ and a ∈ [0, 1], by Minkowski’s inequality: ||ax + (1 − a)x′ || ≤ a||x|| + (1 − a)||x′ ||. Thus log g(x) = −||x|| is a concave function.
Chapter 9 Problem 9.1 For the Gibbs sampler [A.33]: (a) Set X0 = x0 , and for t = 1, 2, . . . generate Yt ∼ fY |X (·|xt−1 )
Xt+1 ∼ fX|Y (·|yt )
where fX|Y and fY |X are the conditional distributions. By construction, we know (Xt , Yt ) only depends on (xt−1 , yt−1 ) so the pair (Xt , Yt ) is a Markov chain. We know (Xt ) is independent from Xt−2 , Xt−3 , . . ., and is simulated conditionally only on xt−1 . Thus, (Xt ) is a Markov chain. And the same argument applied to (Yt ). At last, (Xt ) has transition density Z (B.1) K (x, x∗ ) = fY |X (y|x)fX|Y (x∗ |y)dy Therefore, (Xt ) is independent from Xt−2 , Xt−3 , . . . and is simulated conditionally only on xt−1 . Thus, (Xt ) is a Markov chain. And the same argument applies to (Yt ). (b) Since (Xt ) has transition kernel (B.1), we want to show that Z π (B) = k (x, B) π (dx) But
Z
k(x, x∗ )fX (x)dx =
Z Z
fY |X (y|x)fX|Y (x∗ |y)dyfX (x)dx Z Z = fX|Y (x∗ |y) fY |X (y|x)dxdy Z Z = fX|Y (x∗ |y) fXY (x, y)dxdy Z = fXY (x∗ , y)dy = fX (x∗ )
Monte Carlo Statistical Methods
115
Hence, fX is an invariant density of (Xt ). The same argument applies to (Yt ) Problem 9.5 (a) When the chains (X (t) ) and (Y (t) ) are interleaved with stationary distributions g1 and g2 respectively, we consider, for every function h ∈ L2 (gi ), (i = 1, 2), the standard estimator δ0 and its Rao-Blackwellized version δrb , where T 1X h(X (t) ), T t=1
δ0 =
δrb =
T 1X E[h(X (t) )|Y (t) ]. T t=1
Now, we compute, the variance of this two estimators. We assume, for simplicity, that Eg2 [h(X)] = 0. The variance of δ0 is ′ 1 X var(δ0 ) = 2 cov(h(X (t) ), h(X (t ) )). T ′ t,t
′
But, cov(h(X (t) ), h(X (t ) )) depends only on the difference |t−t′ |. It follows var(δ0 ) =
T 1 2X var(h(X (0) )) + cov(h(X (0) ), h(X (t) )). T T t=1
The CLT for limiting variance gives σδ20 = lim T var(δ0 ) = var(h(X (0) )) + 2 T →∞
∞ X
cov(h(X (0) ), h(X (k) )).
k=1
For δrb , we write var(δrb ) =
′ ′ 1 X cov(E[h(X (t) )|Y (t) ], E[h(X (t ) )|Y (t ) ]) T2 ′
t,t
′ ′ 1 X = 2 cov(E[h(X (0) )|Y (0) ], E[h(X (|t−t |) )|Y (|t−t |) ]) T ′
t,t
1 = var(E[h(X (0) )|y (0) ]) T T 2X + cov(E[h(X (0) )|Y (0) ], E[h(X (t) )|Y (t) ]). T t=1
The limiting variance follows σδ2rb = lim T var(δrb ) = var(E[h(X (0) )|Y (0) ]) T →∞
+2
∞ X
k=1
cov(E[h(X (0) )|Y (0) , E[h(X (k) )|Y (0) ]).
116
Solution Manual
Using the fact that var(U ) ≥ var(E[U |V ]) for every random variables U and V , we obtain namely σδ20 ≥ σδ2rb , which is “la raison d ′ eˆtre” of the Rao–Blackwellization. (b) Consider the bivariate normal sampler 1ρ 0 X , , ∼N ρ1 0 Y where ρ < 1. The Gibbs sampler is based on the conditional distributions X|y ∼ N (ρy, 1 − ρ2 ) Y |x ∼ N (ρx, 1 − ρ2 ). The generated sequences (X (t) , Y (t) ), (X (t) ) and (X (t) ) are Markov chains with invariant distributions the joint distribution fXY , the marginal distribution fX of X and the marginal distribution fY of Y , respectively. The marginals are N (0, 1) distributions. Thus, cov(X (0) , X (k) ) = E[X (0) X (k) ] − E[X (0) ]E[X (k) ]
= E[E[X (0) |Y (1) ]E[X (k) |Y (k−1) ]] = ... = var(E[...E[E[X|Y ]|X]...]),
where the last term involves k conditional expectations, alternatively in Y and in X. On the other hand, we have E[X|Y ] = ρY and E[Y |X] = ρX. It follows
cov(X
(0)
,X
(k)
)=
var(ρk X) = ρ2k var(X) = ρ2k var(ρk Y ) = ρ2k var(Y ) = ρ2k
if if
k is even k is odd.
Therefore, σδ20
=1+2
∞ X
k=1
ρ2k = 1 + 2
ρ2 1 + ρ2 = . 2 1−ρ 1 − ρ2
Moreover, cov(E[X (0) |Y (0) ], E[X (k) |Y (k) ]) = cov(E[E[X (0) |Y (0) |X (1) ]], E[E[X (0) |y (k) ]||Y (k−1) ]) = ... = var(E[...E[E[X|Y ]|X]...]),
where here, the last term involves k + 1 conditional expectations, alternatively in Y and in X. Thus, cov(E[X (0) |Y (0) ], E[X (k) |Y (k) ]) = ρ2(k+1) . Finally, we obtain
Monte Carlo Statistical Methods
σδ2rb
2
=ρ +2
∞ X
ρ2(k+1) = ρ2 + 2
k=1
and
117
2 ρ4 21+ρ = ρ , 1 − ρ2 1 − ρ2
σδ20 1 = 2 > 1, σδ2rb ρ
as −1 < ρ < 1, which justifies the performance of δrb compared to δ0 in term of variance. Problem 9.6 The monotone decrease of the correlation holds for functions h of only one of the variables. If the function h depends on more than one variable the property may not hold as shown in the next example. Consider the bivariate Gibbs sampler of Example 9.20 and take h(x, y) = x − y. We have cov(h(X1 , Y1 ), h(X2 , Y2 )) = E[X1 X2 ] − E[X1 Y2 ] − E[Y1 X2 ] + E[Y1 Y2 ] = E[X1 X2 |Y1 ] − E[X1 Y2 |X1 ] − E[Y1 X2 |Y2 ] + E[Y1 Y2 |X2 ]
= ρ2 E[Y12 ] − ρE[X12 ] − ρE[Y1 Y2 ]ρ2 E[X22 ] = ρ2 − ρ − ρ3 + ρ2 = −ρ(1 + ρ2 − 2ρ) = −ρ(1 − ρ)2 , which is negative when ρ > 0. Problem 9.9 (a) The likelihood of the observations of Table 9.1 is −(x1 +x2 +x3 +x4 )λ x2 +2x3 +3x4
ℓ(λ|x1 , ..., x5 ) ∝ e
−λ
1−e
λ
∝ e−347λ λ313
1 − e−λ
3 X λi
i!
i=0
!
3 X λi i=0
i!
!
.
In order to use the EM algorithm, we augment the data with y = (y1 , ..., y13 ), the vector of the 13 units larger than 4. The augmented data is simulated from yi ∼ P(λ)I(yi ≥4) . The complete-data likelihood becomes ℓc (λ|x, y) ∝ e−347λ λ313
13 Y
i=1 P13
−360λ 313+
∝e
λ
λyi e−λyi
The expected complete-data log-likelihood is
i=1
yi
.
118
Solution Manual
Q(λ|λ0 , x) = Eλ0 [log ℓc (λ|x, Y )] ∝ (313 + 13E[Y1 |λ0 ])λ − 360, which is maximal for λ=
313 + 13E[Y1 |λ0 ] , 360
with E[Y1 |λ] =
P∞
k −λ0 kλ0 k=4 e k! P∞ −λ λk0 0 k=4 e k!
= λ0 1 −
eλ0
λ30 /6 . − (1 + λ0 + λ20 /2 + λ30 /6)
(c) The Rao-Blackwellized estimator is given by δrb =
T 1X E[λ|x1 , ..., x5 , y1 , ..., y13 ]. T t=1
δrb =
1 360T
P13
y (t) , 360), then ! 13 T X X (t) . yi 313 +
But, λ|x1 , ..., x5 , y1 , ..., y13 ∼ Ga(313 +
t=1
i=1
i=1
PT This estimator dominates its original version δ0 = T1 t=1 λ(t) in term of variance as shown in the general case (Theorem 9.19). Problem 9.10 The observations come from the multinomial model X ∼ M5 (n; a1 µ + b1 , a2 µ + b2 , a3 η + b3 , a4 η + b4 , c(1 − µ − η)), and the prior on (µ, η) is the Dirichlet distribution, π(µ, η) ∝ µα1 −1 η α2 −1 (1 − µ − η)α3 −1 . The posterior distribution of (µ, η) follows π(µ, η|x) ∝ (a1 µ + b1 )x1 (a2 µ + b2 )x2 (a3 η + b3 )x3 (a4 η + b4 )x4 × (1 − µ − η)x5 +α3 −1 µα1 −1 η α2 −1 . (a) To obtain the marginal density π(µ|x) of µ, we have only to integrate out η in the joint posterior distribution, π(µ|x) ∝ (a1 µ + b1 )x1 (a2 µ + b2 )x2 µα1 −1 Z × (a3 η + b3 )x3 (a4 η + b4 )x4 (1 − µ − η)x5 +α3 −1 η α2 −1 dη. If the αi ’s are integers, the quantity in the last term of the integrand is also polynomial in η. When η is integrated out, we obtain the marginal distribution of µ as a polynomial in µ. The same result holds for η.
Monte Carlo Statistical Methods
119
(b) Consider the transformation of (µ, η) into µ ξ= , ζ = η = H(µ, η) . 1−µ−η The jacobian of the inverse transform is given by |JH −1 (ξ, ζ)| =
1−ζ , (1 + ξ)2
and the marginal posterior distribution of ξ by (1 − ζ) ξ, ζ|x |JH −1 (ξ, ζ)|dζ π(ξ|x) = π 1+ξ Z ∝ ξ α1 −1 (1 + ξ)−(x1 +x2 +x5 +α1 +α3 ) [ζ α2 −1 (1 − ζ)x3 +α3 (b′1 (ξ) − a1 ξζ)x1 Z
× (b′2 (ξ) − a2 ξζ)x2 (a3 ζ + b3 )x3 (a4 ζ + b4 )x4 ]dζ,
where b′i (ξ) = ai + bi (1 + ξ), (i = 1, 2). When the augmented data is considered, the posterior distribution of (µ, η) becomes π(µ, ζ) ∝ µz1 +z2 +α1 −1 η z3 +z4 +α2 −1 (1 − µ − η)x5 α3 −1 , that is, (µ, η, 1 − µ − η) ∼ D3 (z1 + z2 + α1 , z3 + z4 + α2 , x5 + α3 ). In this case, the marginal posterior of ξ becomes π(ξ|x) ∝ ξ z1 +z2 +α1 −1 (1 + ξ)−(z1 +z2 +x5 +α1 +α3 ) . (c) The Gibbs sampler writes Algorithm A.65 Given
θt = (µ(t) , η (t) ),
1. Simulate ( 2. Simulate from :
1µ Zit+1 ∼ B(xi , a1aµ+b ) i 1η Zit+1 ∼ B(xi , a1aη+b ) i
(i = 1, 2) (i = 3, 4)
θ(t+1) = (µ(t+1) , η (t+1) )
(µ, η, 1 − µ − η) ∼ D3 (z1 + z2 + α1 , z3 + z4 + α2 , x5 + α3 ).
120
Solution Manual
Problem 9.11 (a) Substituting k(z|θ′ , y) and L(θ′ |y) we have Z Z L(θ|y, z) L(θ′ |y, z) L(θ′ |y) R R L∗ (θ|y) = dθ . dz ′ L(θ|y, z)dθ L(θ |y) L(θ′ |y)dθ′
Since all integrands are nonnegative finite we invoke Fubini’s theorem and switch the order of integration, Z Z L(θ|y, z) L(θ′ |y, z) L(θ′ |y) R R L∗ (θ|y) = dθ′ dz L(θ|y, z)dθ L(θ′ |y) L(θ′ |y)dθ′ Cancelling terms we are left with Z 1 R L(θ|y, z)dz = L∗ (θ|y) L(θ′ |y)dθ′
(b) Let
g1 (θ|z) = L∗ (θ|y, z) g2 (z|θ) = k(z|θ, y) where y is fixed and known. Then the two stage Hammersley-Clifford Theorem can be used to give the joint distribution defined by these two conditional distributions, if such a joint distribution exists: g1 (θ|z) g1 (θ′ |z)/g2 (z|θ′ )dθ′ L∗ (θ|y, z) = R ∗ ′ L (θ |y, z)/k(z|θ′ , y)dθ′
f (θ, z|y) = R
We are looking for the distribution of θ|y so we integrate over z, Z L∗ (θ|y, z) R f (θ|y) = dz ∗ ′ L (θ |y, z)/k(z|θ′ , y)dθ′ Z L(θ|y, z) 1 R = dz R L(θ|y, z)dθ R L(θ′ |y,z) L(θ′ ′ |y) dθ′ L(θ |y,z) L(θ|y,z)dθ Z 1 L(θ|y, z)dz = L∗ (θ|y). = R L(θ′ |y)dθ′
The output from the Gibbs sampler can be used to make inferences about the likelihood function (unnormalized), by simply collecting the sequence θ(j) into bins or by other methods.
Monte Carlo Statistical Methods
121
Problem 9.18 (a) Suppose that we have a two-dimensional random variable X 0 1ρ ∼N , Y 0 ρ1 Then the conditional densities for the Gibbs sampler are X | y ∼ N (ρy, 1 − ρ2 ) , Y | x ∼ N (ρx, 1 − ρ2 ) , R +∞ We know that f (X|X ∗ ) = −∞ f (X|y)f (y|X ∗ )dy. Plugging in the conditional densities 1 1 − (x−ρy)2 f (X|Y ) = p e 2(1−ρ2 ) 2π(1 − ρ2 ) 1 1 − (y−ρx∗ )2 e 2(1−ρ2 ) f (Y |X ∗ ) = p 2π(1 − ρ2 )
gives the desired result. (b) Suppose X ∗ has a standard normal distribution Then Z +∞ fX|X ∗ (x|x∗ )fX ∗ (x∗ )dx∗ f (x) = −∞ +∞
=
Z
−∞
p
1
2π(1 −
ρ4 )
− 12
e
(x−ρ2 x∗ )2 (1−ρ4 )
∗2 1 1 √ e− 2π x dx∗ 2π
Adding the exponents and ignoring the −1/2 gives
Then
x∗2 − ρ4 x∗2 x2 − 2ρ2 x∗ x + ρ4 x∗2 (x − ρ2 x∗ )2 + + x∗2 = 4 4 (1 − ρ ) 1−ρ 1 − ρ4 x2 − 2ρ2 x∗ x + ρ4 x∗2 + x∗2 − ρ4 x∗2 = 1 − ρ4 (x∗ − ρ2 x)2 x2 − ρ4 x2 = + 4 (1 − ρ ) 1 − ρ4 (x∗ − ρ2 x)2 + x2 = (1 − ρ4 ) f (x) = =
Z
+∞
−∞ Z +∞ −∞
fX|X ∗ (x|x∗ )fX ∗ (x∗ )dx∗ (x∗ −ρ2 x)2 1 2 1 1 −1 p e 2 (1−ρ4 ) dx∗ √ e− 2 x 4 2π 2π(1 − ρ )
1 2 1 = √ e− 2 x 2π
We conclude that X ∗ ∼ N (0, 1) is the invariant distribution.
122
Solution Manual
1 (c) If we ignore the term − 2(1−ρ 2 ) , we get
(x − ρy)2 + (y − ρx∗ )2 =
= (1 + ρ2 )y 2 − 2y(ρx + ρx∗ ) + x2 + ρ2 x∗2 2 ! ρx + ρx∗ ρx + ρx∗ 2 2 + + = (1 + ρ ) y − 2y 1 + ρ2 1 + ρ2 (ρx + ρx∗ )2 x2 + ρ2 x2 + ρ2 x∗2 + ρ4 x∗2 + 1 + ρ2 1 + ρ2 2 ρx + ρx∗ 1 = (1 + ρ2 ) y − (x − ρ2 x∗ )2 + 1 + ρ2 1 + ρ2 −
Plugging this into the formula in (a) gives Z +∞ “ ” (x−ρ2 x∗ )2 ρx+ρx∗ 2 1+ρ2 1 −1 − 12 1−ρ 2 y− 1+ρ2 e e 2 (1−ρ2 )(1+ρ2 ) dy K(x∗ , x) = 2 2π(1 − ρ ) −∞ Z +∞ s “ ” (x−ρ2 x∗ )2 1+ρ2 ρx+ρx∗ 2 1 1 + ρ2 − 21 1−ρ − 21 2 y− 1+ρ2 1−ρ4 = p e dy e 2π(1 − ρ2 ) 2π(1 − ρ2 ) −∞ (x−ρ2 x∗ )2 1 −1 e 2 1−ρ4 = p 2π(1 − ρ2 )
which is what we want. (d) That Xk = ρ2 Xk−1 + Uk follows straightforwardly from (b). From here we can say that Xk = ρ2k X0 plus a random variable independent from X0 . Thus, cov (Xk , X0 ) = cov ρ2k X0 , X0 = ρ2k var (X0 )
If we assume that the variance of X0 is 1, then we conclude that cov (Xk , X0 ) = ρ2k . Since ρ must be less than 1, the sequence cov (Xk , X0 )k must converge to 0.
Chapter 10 Problem 10.5 (a) The density f satisfies the positivity condition for x0 , ..., xm . The Hammersley– Clifford Theorem (Theorem 10.5) applies. For the permutation ℓ(i) = i+1, we obtain m−1 Y f (xi+1 ) f (y) = f (x) . f (xi ) i=0
(b) Each pair of states in the Gibbs sampler associated with this setup can be jointed in a finite number of steps. So, the associated chain is irreducible. As the state space is finite, the results of Chapter 6 and namely those of Problem 6.8 and 6.9 imply that the chain is ergodic.
Monte Carlo Statistical Methods
123
(c) In Example 10.7, the state space is X = E ∪ E′ , where E and E′ are the disks of R2 with radius 1 and respective centers (1, 1) and (−1, −1). The distribution f is f (x1 , x2 ) =
1 (IE (x1 , x2 ) + IE′ (x1 , x2 )), 2π
which is the uniform distribution on X. If x and y are in the same disk, then the sequence z0 = x = (x1 , x2 ), z1 = (x1 , y2 ), z2 = y = (y1 , y2 ) joint x and y in two steps such that zi and zi+1 differ in only a single component, the zi ’s are in the same disk and f (zi ) > 0. If x and y are not in the same disk, we can assume that x ∈ E and y ∈ E′ . Note that the points u = (0, 1) ∈ E and v = (0, −1) ∈ E′ differ in only one component. Then, we joint successively x to u, u to v and v to y. We obtain the sequence z0 = x, z1 = (x1 , 1), z2 = u, z3 = v, z4 = (y1 , −1), z5 = y which connect x to y in 5 steps such that zi and zi+1 differ only in a single component and f (zi ) > 0. Therefore, the condition holds for this example, but, as seen in Example 10.7, the irreducibility fails to hold for the associated chain and so does the ergodicity. Problem 10.7 The given kernel is associated to a Gibbs sampler where the density f is the posterior distribution resulting from the model X|θ ∼ N (θ, 1), θ ∼ C(θ0 , 1) In this problem, we want to bound the kernel uniformly from below in θ. First, θ2 θ0 η ′ − [θ′2 + η(θ − θ0 )2 ] = −η 0 ≤ 0, (1 + η) θ − 1+η 1+η then θ0 η 1 η 1+η θ′ − ≥ exp − θ′2 − (θ − θ0 )2 . exp − 2 1+η 2 2 √ We majorate 1 + η, appearing in the integral giving K(θ, θ′ ), by 1. It follows that ′2
ν e−θ /2 1 √ K(θ, θ′ ) ≥ 1 + (θ − θ0 )2 ν 2π 2 Γ (ν) Z ∞ n ηo dη. η ν−1 exp − 1 + (θ − θ0 )2 + (θ′ − θ0 )2 × 2 0
The last integral equals
2ν Γ (ν) , [1 + (θ − θ0 )2 + (θ′ − θ0 )2 ]ν
124
Solution Manual
which is larger than 2ν Γ (ν) . [1 + (θ − θ0 )2 ]ν [1 + (θ′ − θ0 )2 ]ν Finally,
′2
e−θ /2 √ , K(θ, θ ) ≥ [1 + (θ − θ0 ) ] 2π and thus uniform ergodicity of the associated chain follows. ′
′
2 −ν
Problem 10.10 The complete hierarchical model is Xi ∼ P(λi ) λi ∼ Ga(α, βi ) βi ∼ Ga(a, b), where α, a and b are known hyperparameters and P (λ) is the Poisson distribution of parameter λ. The Gibbs sampler is based on the posterior densities λi |x, α, βi ∼ Ga(α + xi , βi + 1) βi |x, α, a, b, λ ∼ Ga(a + α, b + λi ) Problem 10.18 The Tobit model is based on the observation of a transform of a normal variable yi∗ ∼ N (xti β, σ 2 ) by truncation yi = max (yi∗ , 0). If the observables are the yi∗ ’s, the model reduces to ∗ yi ∼ N (xti β, σ 2 ) (β, σ 2 ) ∼ π(β, σ 2 ) . Then, the posterior of (β, σ 2 ) is π(β, σ 2 ) ∝ L(β, σ 2 |y ∗ , x)π(β, σ 2 ) ) ( n X 1 (y ∗ − xti β)2 π(β, σ 2 ). ∝ σ −n exp − 2 2σ i=1 i But, in this model, we observe only yi (i = 1, ..., n) and step 1. of Algorithm [A.39] takes yi∗ = yi if yi > 0 and simulates yi∗ from N− (xti β, σ 2 , 0) if yi = 0. The sample (yi∗ ) produced by this step is independent and approximately distributed from N (xti β, σ 2 ). Therefore, the resulting sample (β, σ 2 ) is approximately distributed from the posterior distribution.
Monte Carlo Statistical Methods
125
Problem 10.22 In the setup of Example 10.25, the original parametrisation of the model is (i = 1, ..., I), (j = 1, ..., J) yij ∼ N (µ + αi , σy2 ) αi ∼ N (0, σα2 ) (i = 1, ..., I) µ ∼ π(µ) ∝ 1,
where σy2 and σα2 are known hyperparameters. The posterior of (µ, α1 , ..., αI ) is ( ) 1 X X 1 π(µ, α1 , ..., αI |y) ∝ exp − 2 (yij − µ − αi )2 exp − 2 αi2 . 2σy 2σ α i,j i
Now, we compute the moments of (µ, α1 , ..., αI ) with respect to this distribution, we obtain the correlations 2 −1/2 ρµ,α = 1 + Iσy2 (i = 1, ..., I) i Jσα 2 −1 Iσ ρα ,α = 1 + y2 (i, j = 1, ..., I), i 6= j. i j Jσ α
Problem 10.23
The hierarchical parametrisation of the model of Example 10.25 is given by (i = 1, ..., I), (j = 1, ..., J) yij ∼ N (ηi , σy2 ) ηi ∼ N (µ, σα2 ) (i = 1, ..., I) µ ∼ π(µ) ∝ 1,
where σy2 and σα2 are once again known hyperparameters. The posterior of (µ, η1 , ..., ηI ) is ( ) 1 X 1 X 2 2 (yij − ηi ) exp − 2 (ηi − µ) . π(µ, η1 , ..., ηI |y) ∝ exp − 2 2σy 2σα i i,j
Now, we compute the moments of (µ, η1 , ..., ηI ) with respect to this distribution, we obtain the correlations −1/2 2 ρµ,η = 1 + IJσ2 α (i = 1, ..., I) i σy −1 2 ρηi ,ηj = 1 + IJσ2 α (i, j = 1, ..., I), i 6= j. σ y
These correlations are smaller than those of the original parametrisation (see Problem 10.22). This is why a reparameterisation is preferred to the original one in many cases.
126
Solution Manual
Problem 10.24 (a) A set of conditional densities f1 , . . . , fm is compatible if there exists a joint density f which generates them: fi (xi |x1 , . . . , xi−1 , xi+1 , . . . , xm ) = R
f (x1 , . . . , xm ) f (x1 , . . . , xm ) dxi
for i = 1, . . . , m. A set of conditional densities f1 , . . . , fm is functionally ′ ′ possible ratios compatible if for each fixed x1 , . . . , xm , all of the m! 2 of the form Q ′ m ′ j=1 flj xlj |xl1 , . . . , xlj−1 , xlj+1 , . . . , xlm · ′ Qm f x′ |x , . . . , x , . . . , xlm lj−1 , xl′ j=2 lj lj l1 j+1 −1 Q ′ m j=1 fkj xkj |xk1 , . . . , xkj−1 , xk′ , . . . , xkm j+1 · Q ′ ′ m f ′ , x , . . . , x , . . . , x |x x k k k j−1 1 j j=2 km kj k j+1
′ ′ is equal to some constant that only depends on x1 , . . . , xm and is inde
pendent of (x1 , . . . , xm ), where (xl1 , . . . , xlm ) and (xk1 , . . . , xkm ) are different permutations of (x1 , . . . , xm ). (b) In the case of a hierarchical model, each fi is the joint distribution of the data and the parameters considered as a function of a single parameter. Then the ratios of the form Q ′ m ′ , x , . . . , x , . . . , x |x x f l l l l j−1 1 j lm j=1 j lj+1 · Q ′ m f x′ |x , . . . , x , . . . , xlm lj−1 , xl′ j=2 lj lj l1 j+1 −1 Q ′ m ′ , x , . . . , x , . . . , x |x x f k k k k j−1 1 j j km j=1 kj+1 · Q ′ ′ m f ′ , . . . , xkm j=2 kj xkj |xk1 , . . . , xkj−1 , xk j+1
will be constant since every conditional density has the same form. (c) If the densities are compatible, then there is a joint density f that generates them. Now we claim that this is the invariant density, so the chain is positive recurrent. The kernel of the Gibbs sampler is given by the conditional density (t+1)
k(x1
(t+1)
= f1 (x1
(t)
, . . . , x(t+1) |x1 , . . . , x(t) m m) (t),...,x(t) m
|x2
(t+1)
fm (x(t+1) |x1 m
So it is enough to prove that
(t+1)
)f2 (x2
(t+1)
, x2
(t+1)
|x1
(t)
(t)
, x3 , . . . , x(t) m )...
, . . . , xm−1 )
Monte Carlo Statistical Methods (t+1)
127
(t+1)
f (x1 , x2 , . . . , x(t+1) ) m Z (t+1) (t) (t) (t) (t) (t) (t) = k(x1 , . . . , x(t+1) |x1 , . . . , x(t) m m )f (x1 , x2 , . . . , xm−1 )dx1 . . . dxm Let f
(i)
(x1 , . . . , xi−1 , xi+1 , . . . , xm ) =
Z
f (x1 , . . . , xm )dxi
Thus we have Z (t+1) (t+1) (t) (t) (t) (t) (t) k(x1 , x2 , . . . , x(t+1) |x1 , . . . , x(t) m m )f (x1 , . . . , xm−1 )dx1 . . . dxm Z (t+1) (t) (t+1) (t+1) (t) = f1 (x1 |x2 , . . . , x(t) |x1 , x3 , . . . , x(t) m )f2 (x2 m ) · ... (t+1)
. . . · fm (x(t+1) |x1 m
(t+1)
(t+1)
(t)
(t)
(t) , . . . , xm−1 ) · . . . · f (x1 , . . . , x(t) m )dx1 . . . dxm (t+1)
= fm (x(t+1) |x1 , . . . , xm−1 ) · . . . m Z (t+1) (t) (t+1) (t+1) (t) . . . · f1 (x1 |x2 , . . . , x(t) |x1 , x3 , . . . , x(t) m )f2 (x2 m ) · ... (t+1)
(t+1)
. . . · fm−1 (xm−1 |x1 (t+1)
(t+1)
(t)
(t)
(t) (1) , . . . , xm−2 , x(t) (x2 , . . . , x(t) m )f m )dx2 . . . dxm (t+1)
= fm (x(t+1) |x1 , . . . , xm−1 ) · . . . m Z (t+1) (t+1) (t) (t+1) (t+1) (t+1) . . . · f2 (x2 |x1 , x3 , . . . , x(t) , . . . , xm−2 , x(t) m ) . . . fm−1 (xm−1 |x1 m ) · ... (t+1)
. . . · f (x1
(t)
(t)
(t) , x2 , . . . , x(t) m )dx2 . . . dxm (t+1)
(t+1)
= fm (x(t+1) |x1 , . . . , xm−1 ) · . . . m Z (t+1) (t+1) (t) (t+1) (t+1) (t+1) . . . · f2 (x2 |x1 , x3 , . . . , x(t) , . . . , xm−2 , x(t) m ) . . . fm−1 (xm−1 |x1 m ) · ... (t+1)
. . . · f (2) (x1
(t+1)
= fm (x(t+1) |x1 m (t+1)
= f (x1
(t)
(t)
(t) , x3 , . . . , x(t) m )dx3 . . . dxm (t+1)
(t+1)
, . . . , xm−1 )f(m) (x1
(t+1)
, . . . , xm−1 )
, . . . , x(t+1) ) m
Problem 10.25 The random effects model of Example 10.31 is given by Yij = β + Ui + εij , (i = 1, ..., I , j = 1, ..., J), where Ui ∼ N (0, σ 2 ) and εij ∼ N (0, τ 2 ). The prior on (β, σ 2 , τ 2 ) is the Jeffrey prior π(β, σ 2 , τ 2 ) = σ −2 τ −2 . The total posterior distribution of (β, σ 2 , τ 2 , u) is " # X X 1 1 2 −I−2 2 2 2 −IJ−2 (yij − β − ui ) σ exp − 2 u . π(β, σ , τ , u|y) ∝ τ exp − 2 2τ i,j 2σ i i
128
Solution Manual
In order to obtain the full “posterior distribution” of (β, σ 2 , τ 2 ), we have only to integrate out the random effects (i.e. the ui ’s) in the last integral. It follows X 1 π(β, σ 2 , τ 2 |y) ∝ σ −I−2 τ −IJ−2 exp − 2 (yij − y¯i )2 2τ i,j # " X J (¯ yi − β)2 (Jτ −2 + σ −2 )−I/2 . × exp − 2(τ 2 + Jσ 2 ) i Now, we integrate out β to get the marginal posterior density of (σ 2 , τ 2 ): π(σ 2 , τ 2 |y) ∝ σ −I−2 τ −IJ−2 (Jτ −2 + σ −2 )−I/2 (τ 2 + Jσ 2 )1/2 " # X X 1 J 2 2 × exp − 2 (yij − y¯i ) exp − (¯ yi − y¯) . 2τ i,j 2(τ 2 + Jσ 2 ) i
If τ 6= 0, we have, for σ near 0,
(Jτ −2 + σ −2 )−I/2 ≃ σ I . Thus, for σ near 0, π(σ 2 , τ 2 |y) behaves like σ −2 and is not integrable. This is because of the improperty of the considered Jeffreys prior. Problem 10.26 The marginal density m(x) can be written as: (B.2)
m(x) =
f (x|θ)π(θ) π(θ|x)
Suppose that θ = (θ1 , . . . , θB ) and that the full conditional distributions are available B O {π (θr |x, θs , s 6= r)} r=1
(a) In the case B = 2 it is possible to decompose the posterior distribution as follow π (θ|x) = π (θ1 |x, θ2 ) π (θ2 |x)
and by assuming that f (x|θ) = f (x|θ1 ) equation (B.2) can be rewritten as Z Z m(x)π (θ1 |x, θ2 ) π(θ2 |x)dθ2 = f (x|θ)π (θ1 ) π (θ2 ) dθ2 Z = f (x|θ1 )π (θ1 ) π (θ2 ) dθ2
Monte Carlo Statistical Methods
129
that is, m(x) = R
(B.3)
f (x|θ1 )π (θ1 ) π (θ1 |x, θ2 ) π (θ2 |x) dθ2
This equation can be approximated by evaluating numerically the integral and by evaluating the densities at an arbitrary point θ∗ . Therefore taking the logarithm in (B.3) and approximating the densities we obtain ˜l(x) = log f (x|θ∗ ) + log π1 (θ∗ ) − log π ˜ (θ1∗ |x) 1 1 where the integral in (B.3) is approximated by simulating π ˜ (θ1∗ |x)
T 1X ∗ (t) = π ˜ θ1 |x, θ2 T i=1
The situation in which the parameter vector can be decomposed in two parts and the posterior density must be integrated with respect a subvector of parameters is frequent in latent variables models. (b) In the general case we can rewrite the posterior density as π (θ|x) = π1 (θ1 |x) π2 (θ2 |x, θ1 ) π3 (θ3 |x, θ1 , θ2 ) . . . πB (θB |x, θ1 , . . . , θB−1 ) Each component πr (θr |x, θs , s < r) of the previous equation can be obtained by marginalising the joint density Z ∗ πr (θr |x, θs s < r) = πr θr∗ |x, θ1∗ , . . . , θr−1 , θr+1 , . . . , θB · ∗ dθr+1 . . . dθB ·πr θr+1 , . . . , θB |x, θ1∗ , . . . , θr−1
where the density is evaluated in an arbitrary point θ∗ and can be approximated by Gibbs sampling T 1X ∗ (t) (t) ∗ π ˆr (θr |x, θs , s < r) = , θr+1 , . . . , θB πr θr |x, θ1∗ , . . . , θr−1 T t=1 (t)
where θl , (l > r) are simulated from the full conditionals densities ∗ , θr , . . . , θB . π θl |θ1∗ , . . . , θr−1 (c) Through the previous approximations, it is possible to evaluate numerically the joint density at an arbitrary point θ∗ (B.4)
π ˆ (θr∗ |x) =
B Y
r=1
π ˆ (θr∗ |x, θs s < r)
Finally the log of the marginal density in the following way log m(x) = log f (x|θ∗ ) + ln π(θ∗ ) −
B X r=1
log π ˆ (θr∗ |x, θs s < r)
130
Solution Manual
(d) In order to evaluate the density π ˆr (θr∗ |x, θs s < r) it is necessary to use a Gibbs sampler with steps r + 1, . . . , B based on the conditional densities. Therefore the number of additional iterations in this Gibbs sampler is B X r=1
(B − r − 1) = (B − 1)B +
1+B 3B − 1 B= B. 2 2
(e) The Bayesian prediction density is Z f (y|x, θ)π(θ|x) f (y|x) = f (y|θ, x)π(θ|x)dθ = π(θ|x, y) Previous equality follows from the definition of posterior distribution π (θ|x, y) =
f (y|θ, x)π(θ, x) f (y|θ, x)π(θ|x) ⇔ f (y|x) = f (y|x) π(θ|x, y)
which can be approximated by Gibbs sampling log f (y|x) = log f (y|θ∗ , x) + log π ˆ (θ∗ |x) − log π ˆ (θ∗ |x, y) where π ˆ (θ∗ |x) and π ˆ (θ∗ |x, y) are the posterior densities approximated as in Eq. B.4. Problem 10.29 (a) In this first part, the parametrisation of the model is Yij ∼ N (αi + βi (xj − x ¯), σc2 ), (i = 1, ..., 30) (j = 1, ..., J) with priors αi ∼ N (αc , σc2 ),
βi ∼ N (βc , σβ2 ),
and flat hyperpriors αc , βc ∼ N (0, 104 ),
σc−2 , σα−2 , σβ−2 ∼ Ga(10−3 , 10−3 ).
The posterior distribution is then given by 1 X (yij − αi − βi (xj − x ¯))2 π(V ), π(V |y) ∝ σc−30J exp − 2 2σc i,j where V denotes the vector of parameters
V = (αi , βi , αc , βc , σc−2 , σα−2 , σβ−2 ) and π(V ) its prior. The “full”conditional posteriors are
Monte Carlo Statistical Methods
131
2 σ s +σ 2 α σ2 σ2 αi |y, V− ∼ N ασ2i +σc2 c , σ2α+σc2 α c α c 2 2 2 ′ σβ σc σβ si +σc2 βc , β |y, V ∼ N 2 2 i − 2 2 σβ v(x)+σc σβ v(x)+σc 2 104 σα 104 α ¯ α |y, V ∼ N 5 +σ 2 , 3.104 +σ 2 c − 3.10 α α 2 104 σβ 104 β¯ αc |y, V− ∼ N 3.10 5 +σ 2 , 3.104 +σ 2 β β P ¯))2 ) σc−2 |y, V− ∼ Ga(10−3 + 15J, 10−3 + 21 i,j (yij − αi − βi (xj − x P −2 σα |y, V− ∼ Ga(10−3 + 15, 10−3 + 12 i (αi − αc )2 ) P −2 σβ |y, V− ∼ Ga(10−3 + 15, 10−3 + 12 i (βi − βc )2 ) ,
wherePV− denotes P the vector of all parameters P one and P except 2the current 1 ¯) , α ¯ = 30 ¯), v(x) = j (xj − x si = j yij , s′i = j yij (xj − x i αi and P 1 β¯ = 30 β . The Gibbs sampler follows easily. i i (b) Here, the parametrisation becomes Yij ∼ N (β1i + β2j , σc2 ), with the prior βi = (β1i , β2j ) ∼ N2 (µβ , Ωβ )
and flat hyperprior W(2, R) on Ωβ−1 , where R=
200 0 0 0.2
.
As in (a), the posteriors distributions are given by βi |y, V− ∼ N2 (m, Σ) Ωβ−1 |y, V− ∼ W(62, R1 ) P −2 σc |y, V− ∼ Ga(10−3 + 15J, 10−3 + i,j (yij − β1i − β1i xj )2 ), where Σ = (Ωβ−1 + Ω1−1 )−1 with Ω1−1 =
J σc2
m = Σ Ωβ−1 µβ + Ω1−1
1x ¯ , x ¯x ¯
si −s′i J(1−¯ x) x ¯si −s′i
Jx ¯(1−¯ x)
,
132
Solution Manual
and R1 = (R−1 + R′ )−1 with P P 2 i (β1i − µ1β ) i (β1i − µ1β )(β2i − µ2β ) . R′ = P P 2 (β − µ ) (β − µ )(β − µ ) 2i 2β 1i 1β 2i 2β i i
(c) The independence between β1i and β2i fails in the parametrisation of (b), in dead the variance-covariance matrix of the posterior joint distribution of βi is not diagonal. This is justified by the parametrisation of (b) compared to that of (a), β1i represents αi − βi x ¯ of (a) and β2i represents βi of (a). As αi − βi x ¯ and βi are not independent, the correlation between β1i and β2i takes place.
Chapter 11 Problem 11.6 (a) In Example 11.6, the jump from Hk+1 to Hk consists on choosing randomly a component i in {1, ..., k}, grouping the two intervals [ai , ai+1 ] and [ai+1 , ai+2 ] into one interval and corresponding to this aggregate interval (k+1) (k+1) the sum of the weights pi and pi+1 of the two ex-components as (k+1)
(k+1)
weight, that is, pi (k) = pi + pi+1 . The other parameters are kept unchanged. The Jacobian involved in this jump is the inverse of the Jacobian of the inverse jump. So, the weight of the moving from Hk+1 to Hk is the Jacobian ∂(p(k) , a(k) , u1 , u2 ) ∂(p(k+1) , a(k+1) ) . The transformation is given by (3) (4) p1 = p1 (3) (4) (4) p2 = p2 + p 3 (3) (4) p3 = p4 (3) (4) a2 = a2 (3) (4) a3 = a4 (4) (4) 4 −a3 u1 = a(4) (4) a −a 4 2 (4) u2 = (4)p2 (4) p2 +p3
and the Jacobian follows
1 0 0 1 ∂(p(k) , a(k) , u1 , u2 ) 0 0 = 0 0 ∂(p(k+1) , a(k+1) ) 0 0 0 0 0 β1
0 1 0 0 0 0 β2
0 0 1 0 0 0 0
0 0 0 1 0 α1 0
0 0 0 0 0 α2 0
0 0 0 0 , 1 α3 0
Monte Carlo Statistical Methods
133
where α1 = α2 = α3 = β1 = β2 =
(4)
∂u1 (4) ∂a2 ∂u1 (4) ∂a3
=
∂u1 (4) ∂a4
=
∂u2 (4) ∂p2
=
∂u2 (4) ∂p3
=
=
(4)
a4 −a3
(4) (4) (a4 −a2 )2
−1 (4) (4) a4 −a2 (4) (4) a3 −a2 (4)
(4)
(a4 −a2 )2 (4) p3 (4)
(4)
(4)
(4)
p2 +p3 (4) −p2 p2 +p3
.
Computing this determinant by development following first row and three times following second row leads to ∂(p(k) , a(k) , u1 , u2 ) ∂(p(k+1) , a(k+1) ) 1 1 = α2 (β2 − β1 ) = α2 β1 β2 1 . = (4) (4) (4) (4) p2 + p3 a4 − a2
Problem 11.20
Denoting Z = min(X1 , · · · , Xn ), we have for all t > 0, P(Z > t) = (P(X1 > t))n = exp(−nλt). Therefore, Z ∼ Exp(nλ). If Xi ∼ Exp(λi ), then P(Z > t) =
n Y
i=1
and thus Z ∼ Exp (
Pn
i=1
exp(−λi t) = exp −
n X i=1
λi t
!
λi ).
Chapter 12 Problem 12.4 (a) The density of X ∼ Be(α, 1) is f (x) = αxα−1 I[0,1] (x). Therefore, E(X 1−α ) = R 1 1−α α−1 αx x dx = α 0 R ǫ1/1−α α−1 1−α (b) P(X < ǫ) = P(X < ǫ1/1−α ) = 0 αx dx = ǫα/1−α .
134
Solution Manual
(c) In this case, we have (see Example 12.13) STR = PT
y(T ) − y(1)
t=1 (y(t+1)
Problem 12.17
α−1 − y(t) )y(t)
.
Let us compute the determinant of P − λI2 (where I2 is the 2 × 2 identity matrix). det(P − λI2 ) = λ2 + λ(β + α − 2) + 1 − α − β = (λ − 1)(λ − (1 − α − β)). The second eigenvalue is λ = 1 − α − β. We can verify that the eigenvector associated with λ is v = (−α, β)′ . Now, let us prove the inequality for i = j = 1 and i = 1, j = 0 (the others cases are treated in the same way). First, we have P e′0 = (1 − α, β)′ = e′0 + v. Similarly, P e′1 = e′1 − v. Next, since P v = λv, we obtain t−1 X v(1 − λt ) λj = e′0 + P t e′0 = e′0 + v α+β j=1
and
P t e′1 = e′1 −
v(1 − λt ) . α+β
β α Moreover, we know that P (x(∞) = 0) = α+β and P (x(∞) = 1) = α+β (see Problem 6.51). Therefore, for i = 1, j = 0, P (x(t) = 1|x(0) = 0) − P (x(∞) = 1) = |e0 P t e′1 − P (x(∞) = 1)| α α (1 − λt ) − = α+β α +β α|λ|t ≤ǫ = α+β (α+β) since |λ|t = |1 − α − β|t ≤ ǫ max(α,β) . Likewise, for i = j = 0,
β (t) (0) (∞) t ′ = 0) = e0 P e1 − P (x = 0|x = 0) − P (x α +β α β t = 1 − (1 − λ ) − α+β α +β t α|λ| = ≤ ǫ. α+β
Monte Carlo Statistical Methods
135
Chapter 14 Problem 14.4 (a) Applying Bayes’ theorem and the Markov property of the chain, the actualization equations are p x1:t , yt |y1:(t−1) p (x1:t |y1:t ) = p yt |y1:(t−1) p yt |x1:t , y1:(t−1) p x1:t |y1:(t−1) = p yt |y1:(t−1) f (yt |xt )p x1:t |y1:(t−1) = p(yt |y1:(t−1) ) and p yt |x1:t , y1:(t−1) = p xt |x1:(t−1) , y1:(t−1) p x1:(t−1) |y1:(t−1) = p (xt |xt−1 ) p x1:(t−1) |y1:(t−1) = Pxt−1 xt p x1:(t−1) |y1:(t−1)
(b) By considering previous results, the filtering density can be rewritten as follow f (yt |xt )p x1:t |y1:(t−1) p (x1:t |y1:t ) = p(yt |y1:(t−1) ) f (yt |xt )Pxt−1 xt p x1:(t−1) |y1:(t−1) = p(yt |y1:(t−1) ) Note that the computational complexity of the filtering density is partly due to the evaluation of p x1:(t−1) |y1:(t−1) . Thus the complexity of p yt |y1:(t−1) is the same because Z p (yt |x1:t ) p x1:t |y1:(t−1) dx1:t p yt |y1:(t−1) = N t X Z f (yt |xt )p xt |x1:(t−1) , y1:(t−1) p x1:(t−1) |y1:(t−1) dx1:t = N t X Z f (yt |xt )p (xt |xt−1 ) p x1:(t−1) |y1:(t−1) dx1:t = N t X
(c) The propagation equation is
136
Solution Manual
p xt |y1:(t−1) = = =
Z
t N
X
t N
X
Z
X
p x1:(t−1) , xt |y1:(t−1) dx1:(t−1)
p xt |x1:(t−1) , y1:(t−1) p x1:(t−1) |y1:(t−1) dx1:(t−1)
x1:(t−1)
p x1:(t−1) |y1:(t−1) Pxt−1 xt
Problem 14.5 Baum-Welsh formulas can be established as follow: (a) The forward and backward relations are α1 (i) = p (y1 , x1 = i) = f (y1 |x1 = i) πi
αt+1 (j) = p (y1:t+1 , xt+1 = j) X = p (y1:t , yt+1 , xt+1 = j, xt = i) i
=
X i
=
X i
=
X i
p (xt+1 = j, yt+1 |y1:t , xt = i) p (y1:t , xt = i)
p (yt+1 |xt+1 = j, y1:t , xt = i) p (xt+1 = j|y1:t , xt = i) p (y1:t , xt = i) f (yt+1 |xt+1 = j) p (xt+1 = j|xt = i) p (y1:t , xt = i)
= f (yt+1 |xt+1 = j)
X
Pi,j αt (i)
i
and βT (i) = 1 βt (i) = p (yt+1:T , xT = i) = p (yt+1 , yt+2:T |xt = i) X = p (yt+1 , yt+2:T , xt+1 = j|xt = i) j
=
X j
=
X j
p (yt+1 |xt+1 = j) p (yt+2:T |xt+1 = j) p (xt+1 = j|xt = i) f (yt+1 |xt+1 = j) βt+1 (j)Pi,j
By means of the forward and backward relations the conditional density γt (i) can be rewritten as
Monte Carlo Statistical Methods
137
γt (i) = p (xt = i|y1:T ) p (y1:T , xt = i) = p (y1:T ) p (yt+1:T |xt = i) p (y1:t , xt = i) = p (y1:T ) αt (i)βt (i) αt (i)βt (i) = P =P p (y1:T , xt = j) αt (j)βt (j) j
j
(b) The computation of γt (i) requires the summation of k terms for each t = 1, . . . , T and each i = 1, . . . , k, thus the computational complexity is O(T k 2 ). (c) ξt (i, j) = = = = =
p (xt = i, xt+1 = j, y1:T ) p (y1:T ) p (xt = i, xt+1 = j, y1:t , yt+1 , yt+2:T ) p (y1:T ) p (yt+1:T , xt+1 = j|xt = i, y1:t ) p (y1:t , xt = i) p (y1:T ) p (yt+2:T |xt+1 = j) p (yt+1 |xt+1 = j) p (xt+1 = j|xt = i) p (y1:t , xt = i) p (y1:T ) βt+1 (j)f (yt+1 |xt+1 = j) p (xt+1 = j|xt = i) αt (i) PP p (xt = i, xt+1 = j, y1:T ) i
j
βt+1 (j)f (yt+1 |xt+1 = j) p (xt+1 = j|xt = i) αt (i) = PP βt+1 (j)f (yt+1 |xt+1 = j) p (xt+1 = j|xt = i) αt (i) i
j
The evaluation of the previous quantity requires the summation of k terms k-times for each t = 1, . . . , T , thus the computational complexity is O(T k 2 ). (d) Denote with αt∗ (i) the normalised forward variables αt∗ (i) = αt (i)/ct , then γt (i) = P j
=
αt∗ βt (i) αt∗ (j)βt (j)
c α β (i) Pt t t = γt (i) ct αt (j)βt (j) j
Problem 14.6 (a) We have
138
Solution Manual
p(xs |xs−1 , y1:t ) = p(xs |xs−1 , y1:s−1 , ys:t ) p(xs , xs−1 , y1:s−1 , ys:t ) = p(xs−1 , y1:s−1 , ys:t ) p(y1:s−1 |xs , xs−1 , ys:t )p(ys:t , xs , xs−1 ) = p(y1:s−1 |xs−1 )p(xs−1 , ys:t ) p(y1:s−1 |xs−1 )p(ys:t , xs , xs−1 ) = p(y1:s−1 |xs−1 )p(xs−1 , ys:t ) p(ys:t , xs , xs−1 ) = = p(xs |xs−1 , ys:t ) p(xs−1 , ys:t ) (b) By representing the density in integral form and by applying Bayes’ theorem on y1:s , we obtain X p(xs , xs+1 |xs−1 , y1:t ) p(xs |xs−1 , y1:t ) = xs+1
X p(xs , xs+1 , y1:s |ys+1:t , xs−1 ) = p(y1:s |ys+1:t , xs−1 ) x s+1
X p(y1:s−1 |xs , xs+1 , xs−1 , ys:t )p(ys , xs |xs+1 , xs−1 , ys+1:t )p(xs+1 |xs−1 , ys+1:t ) = p(y1:s−1 |ys:t , xs−1 )p(ys |ys+1:t , xs−1 ) xs+1 X p(ys , xs |xs+1 , xs−1 , ys+1:t )p(xs+1 |xs−1 , ys+1:t ) ∝ xs+1
=
X
xs+1
p(ys |xs )p(xs |xs−1 )p(xs+1 |xs−1 , ys+1:t )
= f (ys |xs )Pxs−1 ,xs = f (ys |xs )Pxs−1 ,xs
X
xs+1
X
xs+1
p(xs+1 |xs−1 , ys+1:t ) p(xs+1 |xs−1 , y1:t )
For s = t, the backward equation becomes X p(xt+1 |xt−1 ) = f (yt |xt )Pxt−1 ,xt p(xt |xt−1 , y1:t ) = f (yt |xt )Pxt−1 ,xt xt+1
while for s = 1 the backward equation is X X p(x2 |x1 , y1:t ) . p(x2 |x1 , y1:t ) = f (y1 |x1 )π(x1 ) p(x1 |x0 , y1:t ) = f (y1 |x1 )Px0 ,x1 x2
x2
We thus need to evaluate for each s = 1, . . . , t and each xs−1 ∈ {1, . . . , k} a summation in xs+1 of k terms, thus a complexity of O(tk 2 ).
Monte Carlo Statistical Methods
139
(c) Note that the joint distribution p(x1:t |y1:t ) can be written as p(x1:t |y1:t ) =
t Y
r=1
p(xr |xr−1 , y1:t )
thus it is possible to simulate from the joint distribution by generating first x1 from p(x1 |y1:t ) and then xr from p(xr |xr−1 , y1:t ) for r = 2, . . . , t. Simulation and evaluation of the joint distribution can be achieved in O(tk 2 ) operations. (d) Define the value function as vt (i) = max {p(x0:t , xt = i|y1:t )} x0:t−1
then the following recursive relation holds vt+1 (j) = max vt (i)Pxt =i,xt+1 =j f (yt+1 |xt+1 = j) i
(e) Denote the unnormalized density
p∗r (i|xr−1 , y1:t ) = f (ys |xs )Pxs−1 ,xs
X
xs+1
p(xs+1 |xs−1 , y1:t )
then the joint density p(x1:t |y1:t ) can be rewritten as p(x1:t |y1:t ) =
π(x1 )f (y1 |x1 ) k P
i=1
=
i=1
k P p∗2 (i|x1 , y1:t ) Y p∗r+1 (i|xr , y1:t ) t Pxr−1 xr f (yr |xr ) i=1 i=1
p∗1 (i|y1:t )
π(x1 )f (y1 |x1 ) k P
k P
k P
i=1
i=1
p∗2 (i|x1 , y1:t ) Y t
p∗1 (i|y1:t )
k P
r=2
r=2
p∗r (i|xr−1 , y1:t )
Pxr−1 xr f (yr |xr )
and by applying Bayes’rule p(y1:t ) =
=
p(y1:t |x1:t )p(x1:t ) p(x1:t |y1:t ) t Q Pxr−1 xr f (yr |xr ) r=1
π(x1 )f (y1 |x1 )
=
k X
t Q
r=2
Pxr−1 xr f (yr |xr )
k X
p∗1 (i|y1:t )
i=1
p∗1 (i|y1:t )
i=1
which is the normalising constant of the initial distribution.
140
Solution Manual
(f) We conclude that the likelihood of a hidden Markov model with k states and T observations can be computed in O(T k 2 ) operations as stated in (c). Problem 14.7 (a) The first term is ϕ1 (j) = p(x1 = j) while the (t + 1)-th term is ϕt+1 (j) = p(xt+1 = j|y1:t ) p(xt+1 = j, y1:t ) = p(y1:t ) k P p(xt+1 = j, xt = i, y1:t ) i=1 = p(y1:t ) k P p(xt+1 = j|xt = i, y1:t )p(xt = i, y1:t ) = i=1 p(y1:t ) k P p(xt+1 = j|xt = i)p(xt = i, y1:t ) i=1 = p(yt |y1:t−1 )p(y1:t−1 ) k P Pij p(yt |xt = i, y1:t−1 )p(xt = i|y1:t−1 )p(y1:t−1 ) = i=1 p(yt |y1:t−1 )p(y1:t−1 ) k P Pij f (yt |xt = i)p(xt = i|y1:t−1 ) i=1 = p(yt |y1:t−1 ) k P Pij f (yt |xt = i)ϕt (i) = k i=1 P f (yt |xt = i)p(xt = i|y1:t−1 ) i=1
(b) Using the Markov property of the hidden Markov model, the likelihood factorises as follows p(y1:t ) =
t Y
r=1
=
p(yr |y1:r−1 ) =
k t X Y
r=1 i=1
k t X Y
r=1 i=1
p(yr |xr = i)p(xr = i|y1:r−1 )
p(yr |xr = i)ϕr (i)
and taking the logarithm the log-likelihood is
Monte Carlo Statistical Methods
log p(y1:t ) =
t X
log
r=1
( k X i=1
)
p(yr |xr = i)ϕr (i)
(c) The complexity of this algorithm is O(k 2 (t − 1)). (d) The gradient of the log-likelihood with respect to θ is ∇θ log p(y1:t ) ( k ) t X X ∇θ log = f (yr |xr = i)ϕr (i) r=1
=
i=1
k P
f (yr |xr = i)ϕr (i) t ∇θ X i=1 r=1
k P
i=1
=
f (yr |xr = i)ϕr (i)
k P [f (yr |xr = i)∇θ ϕr (i) + ϕr (i)∇θ f (yr |xr = i)] t X i=1 k P
r=1
i=1
f (yr |xr = i)ϕr (i)
where ∇θ ϕt (j) can be computed recursively ∇θ ϕt (j) = ∇θ =
k 1 X [Pij ϕt (i)∇θ f (yt |xt = i) + Pij f (yt |xt = i)∇θ ϕt (i)] ct i=1
− =
! k 1 X Pij f (yt |xt = i)ϕt (i) ct i=1
k ∇θ ct X Pij f (yt |xt = i)ϕt (i) c2t i=1
k 1 X [Pij ϕt (i)∇θ f (yt |xt = i) + Pij f (yt |xt = i)∇θ ϕt (i)] ct i=1
−
k P
∇θ ct i=1 ct
Pij f (yt |xt = i)ϕt (i) ct
141
142
Solution Manual
=
=
k 1 X ∇θ ct [Pij ϕt (i)∇θ f (yt |xt = i) + Pij f (yt |xt = i)∇θ ϕt (i)] − ϕt+1 (i) ct i=1 ct k 1 X [Pij ϕt (i)∇θ f (yt |xt = i) + Pij f (yt |xt = i)∇θ ϕt (i)] + ct i=1
− =
k X i=1
[ϕt+1 (i)ϕt (i)∇θ f (yt |xt = i) + f (yt |xt = i)ϕt+1 (i)∇θ ϕt (i)]
k 1 X (Pij − ϕt+1 (i)) [ϕt (i)∇θ f (yt |xt = i) + f (yt |xt = i)∇θ ϕt (i)] ct i=1
(e) The gradient of the likelihood with respect to η is ( k ) t X X ∇η log ∇η log p(y1:t ) = f (yr |xr = i)ϕr (i) r=1
=
t X r=1
log
( k X i=1
i=1
)
f (yr |xr = i)∇η ϕr (i)
where the gradient of ϕt+1 (i) can be evaluated recursively ! k 1 X Pij f (yt |xt = i)ϕt (i) ∇η ϕt (j) = ∇η ct i=1
k 1 X [Pij f (yt |xt = i)∇η ϕt (i) + f (yt |xt = i)ϕt (i)∇η Pij ] + = ct i=1
− =
=
k ∇η ct 1 X f (yt |xt = i) [Pij ∇η ϕt (i) + ϕt (i)∇η Pij ] − ϕt+1 (j) ct i=1 ct k 1 X f (yt |xt = i) [Pij ∇η ϕt (i) + ϕt (i)∇η Pij ] + ct i=1
−
=
=
k ∇η ct X Pij f (yt |xt = i)ϕt (i) c2t i=1
k 1 X f (yt |xt = i)ϕt+1 (j)∇η ϕt (i) ct i=1
k 1 X f (yt |xt = i) [Pij ∇η ϕt (i) + ϕt (i)∇η Pij − f (yt |xt = i)ϕt+1 (j)∇η ϕt (i)] ct i=1 k 1 X {[Pij − ϕt+1 (j)] f (yt |xt = i)∇η ϕt (i) + f (yt |xt = i)ϕt (i)∇η Pij } ct i=1
Monte Carlo Statistical Methods
143
(f) Note that r = 1, . . . , t and for each r the summation of k terms is needed. Moreover for each step, the evaluation of the gradient of ϕt+1 (j) requires the summation k terms. Thus, the complexity of these computations is O(tk 2 ). (g) Let t be fixed and η ∈ E, θ ∈ Θ be the parameter vectors. Denote p = dim(E)+dim(Θ) the sum of the dimensions of the parameter spaces, then, at each iteration, the gradient vectors ∇θ ϕr (i) ∈ Θ and ∇η ϕr (i) ∈ E have to be evaluated for i = 1, . . . , k. Thus the complexity of the algorithm is O(pk). Problem 14.8 Let f (y|x) = p
1 exp 2πσx2
(y − µx )2 2σx2
then in formulas of part (d) of Problem 14.7 substitute the gradient of f (y|x) with respect to (µx , σx ) ! −2µx (y−µx ) ∇µx ,σx f (y|x) = f (y|x)
2 2σx −(y−µx )2 3 2σx
−
1 σx
and in the recursive equation note that ∇µx ,σx ϕ1 (j) = 0.
http://www.springer.com/978-0-387-21239-5