121 86
English Pages 256 [252] Year 2010
Rajan Srinivasan Importance Sampling
Springer-Verlag Berlin Heidelberg GmbH
. · q, ONLINE LIBRARY Engmeermg U http://www.springer.delengine/
Rajan Srinivasan
Importance Sampling Appl ications in Communications and Detection
With 114 Figures
,
Springer
Dr. Rajan Srinivasan University of Twente Room EL/TN 9160 P0217 7500 AE Enschede Netherlands e-mail: [email protected]
Library of Congress Cataloging-in-Publication Data Srinivasan, Rajan: Importance Sampling: Applications in Communications and Detection / Rajan Srinivasan. (Engineering online library) ISBN 978-3-642-07781-4 ISBN 978-3-662-05052-1 (eBook) DOI 10.1007/978-3-662-05052-1
This work is subject to copyright. AII rights are reserved, whether the whole or part of the material is concerned, specificalIy the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in other ways, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag Berlin Heide1berg GmbH. Violations are liable for prosecution act under German Copyright Law. http://www.springer.de © Springer-Verlag Berlin Heidelberg 2002 Originally published by Springer-Verlag Berlin Heidelberg N ew York in 2002 The use of general descriptive names, registered names, traaemarKS, elc. m 11llS pUDllcaIion does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera ready by author Cover-Design: de'blik, Berlin Printed on acid free paper SPIN: 10874744
62/3020/kk - 5432 1 O
In memory of my father
Mysore Srinivasan (1916-1989)
Preface
This research monograph deals with fast stochastic simulation based on importance sampling (IS) principles and some of its applications. It is in large part devoted to an adaptive form of IS that has proved to be effective in applications that involve the estimation of probabilities of rare events. Rare events are often encountered in scientific and engineering processes. Their characterization is especially important as their occurrence can have catastrophic consequences of varying proportions. Examples range from fracture due to material fatigue in engineering structures to exceedance of dangerous levels during river water floods to false target declarations in radar systems. Fast simulation using IS is essentially a forced Monte Carlo procedure designed to hasten the occurrence of rare events. Development of this simulation method of analysis of scientific phenomena is usually attributed to the mathematician von Neumann, and others. Since its inception, MC simulation has found a wide range of employment, from statistical thermodynamics in disordered systems to the analysis and design of engineering structures characterized by high complexity. Indeed, whenever an engineering problem is analytically intractable (which is often the case) and a solution by numerical techniques prohibitively expensive computationally, a last resort to determine the input-output characteristics of, or states within, a system is to carry out a simulation. Simulation is concerned with replicating or mimicking a system and its operation by mechanizing the exact mathematical equations that describe it and all its inputs using a computer. The reliability of a simulation is governed primarily by the authenticity of the analytical model, that is, by how closely the mathematical descriptions used fit the actual physical system and its environs. The accuracy is determined by the precision of the computations. In several applications, systems are driven or perturbed by stochastic inputs that may arise from natural sources or are derived from the outputs of other systems. It is often of interest to determine the average behaviour of a system in terms of its response. The MC method then uses a (discretized) model of these stochastic processes to generate random numbers, and runs them through the simulated system to give rise to responses of interest. If this is done a sufficiently large number of times, the law of large numbers guarantees that the averaged results will approach the mean or expected
VIII
Preface
behaviour of the system. Hence, analysis by simulation can playa very useful role in the design process of complex systems. The Me method however is not limited to studying systems with stochastic inputs. An early and classical use has been in the evaluation of integrals of functions over complicated multidimensional regions. Random points are generated over a simpler or more convenient region which contains the desired region of integration. The points which fall in the latter region are used to evaluate the integrands and the results are weighted and summed to provide an estimate of the integral. There are important application areas wherein system performance is closely linked with the occurrence of certain rare phenomena or events. In digital communications, for example, bit error probabilities over satellite links using error correction coding can be required to be as low as 10- 10 . In packet switching over telecommunication networks, an important parameter is the probability of packet loss at a switch. These probabilities are required to be of the order of 10- 9 • False alarm probabilities of radar and sonar receivers are usually constrained to not exceed values close to 10- 6 • Development of these sophisticated systems is primarily due to complex signal processing operations that underlie them. In such situations, analysis by mathematical or numerical techniques becomes very difficult owing to memories, nonlinearities, and couplings present in the system, and high dimensionality of the processes involved. Conventional Me simulation also becomes ineffective because of excessively long run times required to generate rare events in sufficiently large numbers for obtaining statistically significant results. It is in situations such as those described above that IS has a powerful role to play. It was first researched by physicists and statisticians. Its use subsequently spread to the area of reliability in the domains of civil and mechanical engineering. In more recent times, IS has found several applications in queuing theory, performance estimation of highly reliable computing systems, and digital communications. Since the mid 1990's, it has made appreciable inroads into the analysis and design of detection algorithms that have applications in radar (and sonar) systems. Research in IS methods and new applications still goes on, especially as engineering systems become more complex and increasingly reliable. In simulations based on IS, probability distributions of underlying processes that give rise to rare events in a system are changed or biased, causing these events to occur more frequently. This renders them quickly countable. Each event is then weighted appropriately and summed, to provide unbiased estimates of the rare event probabilities. It turns out that if the biasing distribution is carefully chosen, the resulting estimate has markedly lower (error) variance than the conventional Me estimate. Apart from use of IS in specific applications, an important aspect of its research has therefore been concerned with the search for good biasing distributions. Several theoretical results on the latter subject are available in the literature, especially those concerning use of tilted (or twisted) densities. These densities have been known for a long
Preface
IX
time and have played a major role in development of the theory of large deviations. In fast simulation they have been shown to have attractive optimality properties in asymptotic situations, when dealing with large numbers of random variables. The approach taken in this monograph is somewhat different insofar as applications are concerned. Choice of good biasing distributions in a specific situation is largely left to the ingenuity of the analyst. This need cause no alarm to an intending user of IS. In many applications, choice of a family of (parameterized) biasing distributions can usually be made with a little thought. Once this is done, the rest of the procedure is concerned with determining parameters of the family that provide estimates that have low variances. It is in fact the most direct approach to obtaining accurate estimators based on IS, whenever these can be mechanized without too much difficulty. The chief aim of this monograph, therefore, is to introduce interested researchers, analysts, designers, and advanced students to the elements of fast simulation based on adaptive IS methods, with several expository numerical examples of elementary and applied nature being provided herein that hopefully render the techniques readily usable. The concept of IS is introduced and described in Chapter 1 with emphasis on estimation of rare event probabilities. Different biasing methods are described in Chapter 2 and evaluated in terms of the variances of the estimates that they provide. The concept of adaptive biasing is introduced in this chapter and optimization algorithms are developed. The IS simulation problem is posed as one of variance minimization using stochastic algorithms. The third chapter is devoted to sums of random variables. Tail probability estimation is discussed and a method based on conditional probability is developed. A simulation methodology for determining a number that is exceeded by sums of random variables with a given small probability is formulated. It is referred to as the inverse IS problem and forms the basis for parameter optimization in systems to achieve specified performance levels. This has several practical applications, as demonstrated by numerical examples. In this same chapter, a new approximation for tail probability is derived. The derivation of the Srinivasan density, an approximation for densities of sums of independent and identically distributed random variables, is given. Several simulation examples are given in these chapters to illustrate the nature of results that can be expected from well designed IS algorithms. The next chapter, Chapter 4, is a short one containing derivations of approximations for detection and false alarm probabilities of likelihood ratio tests. They complement some well known classical results in this topic. The remaining chapters, 5 to 8, are on applications of the IS techniques discussed previously. Chapter 5 presents an effective solution for applying IS to constant false alarm rate detection algorithms that are used in radar and sonar receivers. It is shown how adaptive techniques can be applied to their analysis and design. Several detection situations are described and numerical results provided. Results are provided in Chapter 6 on ensemble detection,
x
Preface
a technique that combines the outputs of several processors to achieve robustness properties. In Chapter 7 is described blind simulation, a procedure for handling situations in which the statistical distributions of underlying processes may be unknown or partially known. It is applied to a detection algorithm to demonstrate its capabilities. The second area of application studied in this monograph is in Chapter 8. It deals with performance evaluation of digital communication systems that cannot be handled analytically or even using standard numerical techniques. Parameter optimization is also addressed. Several examples are given that serve to illustrate how adaptive IS can be used for such systems. Some of the application examples in these last chapters are treated briefly in regard to their setting and mathematical background. In particular, the topic of space-time adaptive processing (STAP) detection algorithms in Chapter 5 mainly consists of indications of how IS could be used in their simulation. This was necessary since the material is still nascent and can be the subject of further research. Much of the material in this book consists of my research results developed since 1996, when I first became interested in the subject of fast simulation. Being strictly in the nature of a monograph, I have not dealt with several topics and results of IS that are important in their own right. From those scientists and authors whose works have not been included or mentioned, I beg indulgence. Undeniably, several of the results reported here were obtained in discussions and collaboration with students and colleagues. They have been given adequate due by way of referenced published literature. Nevertheless, it is my pleasure to recall and recognize here the encouragement, suggestions, and help that I received from friends and colleagues while penning this work. In particular I wish to thank Mansij Chaudhuri, S Ravi Babu, Edison Thomas, and A. Vengadarajan, who were my colleagues in India, and David Remondo Bueno, Wim van Etten, and Hans Roelofs, from Europe. January 2002 University of Twente The Netherlands
Rajan Srinivasan
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
VII
1.
Elements of Importance Sampling ........................ 1.1 Rare events and simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Fast simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Random functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Optimal biasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example: An optimal biasing density. . . . . . . . . . . . . . . . 1.3.1 Conditions on f* ........................ . . . . . . . . . 1.4 The simulation gain ....................................
1 1 2 4 4 6 7 8
2.
Methods of Importance Sampling. . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Conventional biasing methods. . . . . . . . . . . . . . . . . . . . . . . . . . .. 2.1.1 Scaling.......................................... Example: Scaling the Wei bull density ............... Example: Estimating a probability close to unity ..... Transformations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2.1.2 Translation...................................... Example: Optimal translated density. . . . . . . . . . . . . . .. Example: Translating the Gaussian density .......... Example: Translating the Wei bull density. . . . . . . . . . .. An equivalence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2.1.3 Exponential twisting ............................ " Example: Twisting the Gaussian density. . . . . . . . . . . .. Example: Twisting the Gamma density ............ " 2.2 Adaptive IS - optimized biasing. . . . . . . . . . . . . . . . . . . . . . . . .. 2.2.1 Variance estimation and minimization ............. " Adaptive optimization ....................... . . . .. Example: Optimum translation for Rayleigh density. .. 2.2.2 Estimation by IS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Example: Estimating variance using IS . . . . . . . . . . . . .. 2.3 Combined scaling and translation. . . . . . . . . . . . . . . . . . . . . . . .. Example: Two-dimensional biasing for a Weibull density 2.4 Other biasing methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
9 10 10 11 13 14 16 17 18 19 20 21 24 24 25 25 27 27 29 33 34 35 36
XII
Table of Contents 2.4.1
Extreme order density functions. . . . . . . . . . . . . . . . . . .. Biasing with in,! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Biasing with in,n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Appendix A ............. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Appendix B ...........................................
38 40 42 44 45
3.
Sums of Random Variables ............................... 3.1 Tail probability of an Li.d. sum. . . . . . . . . . . . . . . . . . . . . . . . . .. 3.2 The g-method. . . . . ... . . .. . . . . .. .... .. ... .. . . . . . . .. . . ... 3.2.1 Advantage over usual IS. . . . . . . . . . . . . . . . . . . . . . . . . .. Example: Scaling for Weibull i.i.d. sums . . . . . . . . . . . .. Example: Translation for Li.d. sums. . . . . . . . . . . . . . . .. 3.3 The inverse IS problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Example: Adaptive determination of t . . . . . . . . . . . . . .. 3.3.1 System parameter optimization. . . . . . . . . . . . . . . . . . .. 3.4 Approximations for tail probability . . . . . . . . . . . . . . . . . . . . . .. 3.4.1 The g-representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.5 Asymptotic IS ......................................... 3.5.1 The variance rate ................................ Minimizing the rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.5.2 Asymptotic behaviour of the g-method . . . . . . . . . . . . .. Example: Laplace density. . . . . . . . . . . . . . . . . . . . . . . . .. Example: Laplace density. . . . . . . . . . . . . . . . . . . . . . . . .. 3.5.3 Constant probability estimation. . . . . . . . . . . . . . . . . . .. The weighting function. . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.6 Density estimation for sums ............................. Example: Density estimation by translation . . . . . . . . .. 3.6.1 An approximation: The Srinivasan density. . . . . . . . . .. Convergence of the Srinivasan density. . . . . . . . . . . . . .. Density and distribution pairs. . . . . . . . . . . . . . . . . . . . .. Example: Density approximation of Rayleigh sums. . .. 3.6.2 Exponential twisting. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Appendix C ...........................................
47 48 49 51 52 55 55 56 56 57 59 61 62 64 65 65 66 67 71 71 73 74 76 77 77 79 83
4.
Detection Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.1 The Neyman-Pearson lemma..... .. ... . ...... .. . .. . . ..... 4.1.1 Properties of likelihood ratio tests. . . . . . . . . . . . . . . . .. 4.2 Approximations for the error probabilities . . . . . . . . . . . . . . . .. 4.2.1 False alarm probability. . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.2.2 Miss probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.3 Asymptotically constant error probabilities ................ 4.3.1 Detection probability .. . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.3.2 False alarm probability. . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.4 Densities for the log-likelihood ratio ...................... Appendix D ...........................................
85 85 86 88 90 90 91 92 93 93 94
Table of Contents
XJII
5.
CFAR detection .......................................... 5.1 Constant false alarm rate detection . . . . . . . . . . . . . . . . . . . . . .. 5.2 IS for CFAR algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 5.2.1 Input biasing .................................... 5.3 Multiplier determination-adaptive optimization ............. 5.4 Exponential twisting for CA-CFAR ....................... 5.4.1 Conventional IS estimator ......................... 5.4.2 g-method estimator ............................... A classic example ................................ 5.5 Approximations for CA-CFAR ........................... 5.5.1 Using density approximations ...................... The asymptotic form ............................. 5.5.2 Using exponential twisting ......................... 5.6 The GM-CFAR detector ................................. 5.6.1 Approximations for FAP .......................... 5.7 Point of application of biasing ............................ Example: SO-CA-CFAR detector ................... 5.8 FAP decomposition for SO detectors: CA and GM .......... 5.8.1 Fast estimation .................................. 5.8.2 Variance and IS gain .............................. 5.8.3 SO- and GO-GM detectors ........................ Approximations for FAP .......................... 5.9 Examples in CFAR detection ............................ Example: CA-CFAR detector in Rayleigh clutter ..... Example: Censored OS CA-CFAR in Weibull clutter .. Example: ML-CFAR detector in Weibull clutter ...... 5.10 STAP detection ........................................
97 97 99 100 101 101 102 103 104 105 105 106 106 107 108 111 111 113 114 115 117 119 121 121 126 130 132
6.
Ensemble CFAR detection ................................ 6.1 Ensemble processing .................................... 6.2 The E-CFAR detector ................................... 6.2.1 Normalization .................................... 6.2.2 FAP estimation and bias optimization .............. 6.2.3 Determining ensemble thresholds ................... 6.3 Performance in nonhomogeneous clutter ................... 6.4 Results for some ensembles .............................. Geometric mean detectors ......................... 6.4.1 Comments ....................................... 6.5 Randomized ensembles .................................. 6.5.1 FAP decompositions .............................. 6.5.2 Choice of functions: further decompositions .......... RE1 ............................................ REs 2, 3, 5 ...................................... RE4 ............................................ RE6 ............................................
137 137 139 140 143 144 145 146 147 153 153 154 155 156 158 158 160
XIV
Table of Contents
6.6 'lUning the multipliers: homogeneous operating points ....... 161
7.
Blind Simulation ......................................... 7.1 Blind biasing .......................................... 7.1.1 The weighting function. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Tail probability estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 The h-function ................................... 7.2.2 The asymptotic rate .............................. 7.2.3 Blind simulation gain ............................. Partially blind case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Completely blind case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 CFAR detection ........................................ 7.3.1 Estimator implementation ......................... Optimum twist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 The blind simulation gain ......................... 7.3.3 An application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Threshold adaptation and at control. . . . . . . . . . . . . . .. 7.3.4 Comments .......................................
167 167 168 169 170 171 172 172 174 174 176 176 176 178 178 180
8.
Digital Communications .................................. 8.1 Adaptive simulation .................................... 8.2 DPSK in AWGN ....................................... 8.3 Parameter optimization ................................. 8.3.1 Noncoherent OOK in AWGN: threshold optimization . 8.4 Sum density of randomly phased sinusoids ................. 8.5 M-ary PSK in co-channel interference ..................... 8.5.1 Interference dominated environment ................ Coherent BPSK .................................. M -ary PSK. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Interference and AWGN ........................... Coherent BPSK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gaussian approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . M -ary PSK ...................................... 8.6 Crosstalk in WDM networks ............................. 8.6.1 Gaussian approximation ........................... 8.6.2 Chernoff bound .................................. 8.6.3 IS estimation .................................... Threshold optimization ........................... 8.7 Multiuser detection ..................................... 8.8 Capacity of multi-antenna systems. . . . . . . . . . . . . . . . . . . . . . . .
185 185 187 190 192 194 196 198 199 201 204 204 208 209 211 215 216 217 219 223 230
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Index ..................................................... 241
1. Elements of Importance Sampling
The accurate estimation of probabilities of rare events through fast simulation is a primary concern of importance sampling. Rare events are almost always defined on the tails of probability density functions. They have small probabilities and occur infrequently in real applications or in a simulation. This makes it difficult to generate them in sufficiently large numbers that statistically significant conclusions may be drawn. However, these events can be made to occur more often by deliberately introducing changes in the probability distributions that govern their behavior. Results obtained from such simulations are then altered to compensate for or undo the effects of these changes. In this chapter the concept of IS is motivated by examining the estimation of tail probabilities. It is a problem frequently encountered in applications and forms a good starting point for the study of IS theory.
1.1 Rare events and simulation Consider estimating by simulation the probability Pt of an event {X ~ t}, where X is a random variable with distribution F and probability density function f(x) = F'(x), where prime denotes derivative. Throughout this book we shall assume that the density f exists. The value of t is such that the event is rare. The usual Monte Carlo (MC) simulation procedure for estimating Pt is to conduct Bernoulli trials. A K -length independent and identically distributed (i.i.d.) sequence {Xi}f is generated from the distribution F, and the number kt of random variables that lie above the threshold t are counted. The random variable k t is characterized by the Binomial distribution
P(kt = k) =
(K) k Pt (1- Pt) k
K-k
,
k = 0,1, ... ,K.
(1.1)
The maximum likelihood estimator Pt of Pt based on the observed k t is obtained by setting
_aP_(:...,-kt_=---,-k) = 0 aPt
in (1.1). This yields
2
Importance Sampling
kt K 1 K K L1(Xi ~ t)
(1.2)
i=l
where l(x
I
x>
0,
elsewhere
> t) = { '
-
-
t
is the indicator function for the event of interest. This estimate is usually referred to as the MC estimate. It is unbiased and has variance given by varpt Pt K for small Pt
(1.3)
Some observations follow. From (1.2) and the unbiasedness property it is clear that E{kt} = Kpt. Therefore in order to obtain, on the average, a non-zero number of threshold crossings and hence a non-zero value for the estimate Pt, we need to perform the simulation experiment with sequence lengths K definitely greater than l/pt. Estimates of the sequence lengths needed to measure Pt to a desired accuracy can be made. Suppose it is required to estimate Pt with 95% confidence of having an error no greater than 20%. That is, we wish to have
pCPt ~ ptl
::; 0.2) = 0.95
Assuming K reasonably large and using a central limit theorem (CLT) argument, it turns out that K > 100/pt. This calculation can easily be verified using standard error function tables for the unit normal distribution. Thus, if Pt = 10- 6 , a sequence length of at least 108 would be required to achieve the specified accuracy. Such requirements place severe demands on the period lengths of most random number generators and, more importantly, necessitate large computation times for the simulation. Attempting to obtain a low variance estimate of Pt by increasing K is therefore clearly impractical. We note in passing that the complementary event {X < t} with probability qt = 1 - Pt is not a rare event; however essentially the same considerations as above would apply in estimating qt accurately through simulation if Pt is small.
1.2 Fast simulation Importance sampling is concerned with the determination and use of an alternate density function f* (for X), usually referred to as a biasing density,
Elements of Importance Sampling
3
for the simulation experiment, Hammersley & Handscomb [27]. This density allows the event- {X ~ t} to occur more frequently, thereby necessitating smaller sequence lengths K for a given estimator variance. Alternatively, for a given K, use of the biasing density results in a variance smaller than that of the conventional MC estimate in (1.2). Introducing f* into the definition of Pt gives E{l(X ~ t)}
Pt
f
l(x
~ t) f~~) f*(x) dx (1.4)
E*{l(X ~ t) W(X)}
where
is a likelihood ratio and is referred to as the weighting function. The notation E*{·} denotes expectation with respect to the density function f*. For (1.4) to hold it is required that f*(x) > 0 for all x ~ t such that f(x) > o. The last equality in (1.4) motivates the estimator 1 Pt = K
L l(Xi ~ t) W(Xi ), K
Xi
rv
f*
(1.5)
i=l
The notation "X rv f" indicates that the random variable X is drawn from the distribution corresponding to the density function f. This is the IS estimator of Pt and is unbiased, by (1.4). That is, the estimation procedure is to generate i.i.d. samples from f* and for each sample which exceeds t, the estimate is incremented by the weight W evaluated at the sample value. The results are averaged over K trials. The variance of the IS estimator is easily shown to be 1
K var*{l(X ~ t) W(X)}
var*pt
~[E*{12(X ~ t) W2(X)} - p~l =
~ [E{l(X ~ t) W(X)} - p~l
(1.6)
the last following from the definition of W. The IS problem then centres around finding a biasing density f* such that the variance in (1.6) is less than the variance in (1.3) of the Monte Carlo estimate. Both the MC and IS estimators are consistent, for as K ---+ 00, Pt ~ Pt by the law of large numbers. However, the IS estimator tends to Pt faster than the MC estimator, as shall be seen later.
4
Importance Sampling
1.2.1 Random functions Before studying properties that a good biasing density should possess, we introduce a generalization which will be of use in applications. Importance sampling is not restricted to the estimation of rare event or tail probabilities. It can also be used to evaluate expectations of functions of random variables. The estimation of expectations may be considered as a special case of the evaluation of multidimensional integrals by MC simulation. Define G == E{g(X)} < 00, where g(.) is a real valued scalar function whose expectation is to be determined. The MC estimator of G is
(1.7) In the above we have used vector notation to denote X = (Xl.'" ,Xn ), an n-length random vector with n-variate density function In. The simulation sequence {Xi}f is an Li.d. vector sequence. As in the case oftail probabilities, 8 is unbiased with variance (1.8) The corresponding unbiased IS estimator in this case is ~
1
K
G = K Lg(X i ) W(X i ),
Xi'"
In*
(1.9)
i=l
with the n-variate biasing density denoted by In*. The weighting function W is again the likelihood ratio between In and In*' The variance of this estimator is easily shown to be (1.10) The relationship of estimating expectations of functions to rare events is as follows. Whereas G itself need not be small, the largest contribution to E{g(X)} is from values of g(X) for X belonging to some set 9 for which P(Q) is small.
1.3 Optimal biasing There exists an optimal biasing density function, denoted I~~t, which minimizes the variance in (1.10), and under certain conditions reduces it to zero, [27]. Indeed, applying Jensen's inequality to the expectation in (1.10) gives
Elements of Importance Sampling
5
(1.11) with equality if and only if
Ig(X)1 W(X) = c for some constant c and X rv fn*' Under this condition, c = E{lg(X)I}, and it follows from the definition of W that
(1.12) is the biasing density function which achieves the lower bound in (1.11) and provides a minimum variance estimate of G. Substituting the equality condition in (1.10) yields
(1.13) In general the optimal biasing density function in (1.12) is unrealizable as it is known only to within the unknown normalizing constant c- 1 . Its form however can be a useful guide in selecting good biasing densities in specific applications. It should be noted that we resort to simulation for estimating G because of the analytical or numerical difficulties involved in evaluating it. The same difficulties would be encountered in determining the constant c. In the special case of non-negative functions g, we have c = G and (1.13) yields var*G = O. For tail probability estimation, replacing g(x) by l(x 2': t) in the above, we get the zero variance condition
l(X 2': t) W(X) = Pt for X
rv
f*
with the corresponding (unique) optimal biasing density given by
f~Pt(x) = ~ l(x 2': t) f(x) Pt
(1.14)
This density places all its probability mass in the event region {X 2': t}, where it is proportional to the original density function but again with the unknown proportionality constant Pt 1 . It provides a perfect estimate of Pt. Indeed, if f~Pt were known then simulation would not be required. In the LLd. sequence {Xi} generated using the density function f, it is interesting to consider the distribution of those random variables that exceed t. The density function of these variables is nothing but the conditional density f(xlX 2': t). It can easily be shown that
f(xlX ~ t)
=
dP(X:::; xiX 2': t) dx
6
Importance Sampling
l(x ::::: t) f(x) Pt f~Pt(x)
This suggests that random variables possessing density f~Pt can be collected by picking up those that exceed t in the sequence {Xi} generated from the density f. However this is not of direct help because knowledge of f~Pt is still required to mechanize the weighting function W in an IS implementation of the estimator in (1.5). Nevertheless these random variables can be used to make an estimate of the optimal density by employing a kernel estimator or any other density estimation technique. Development of this idea leads to a form of adaptive importance sampling (Ang, Ang & Tang [3]' Karamchandani, Bjerager & Cornell [35]). In this monograph we will not study this but will develop a form of IS which deals with adaptive optimization of parameterized families of biasing densities. Example 1.1. An optimal biasing density
Shown in Figure 1.1 is a unity mean and zero variance Gaussian density with t = 3.0902 chosen to provide Pt 0.01, together with the corresponding pt optimal density . •
r:
3.5
f.oPt(x)
3
2.5 In
c: 0
tic:
2
'iii c:
0
1.5
.2 .?:-
0.5
f(x)
OL-____L-____L-____L -_ _ _ _ -1 o 2 -2
~==
__
~
____
~~
3
x
Fig. 1.1. The original and optimal biasing density functions.
4
__
~
5
__
~
6
Elements of Imporlance Sampling
7
A good biasing density therefore aims to push probability mass into the event space and tries to maintain I(X 2:: t) W(X) constant in this region. It samples more frequently from the "important" event region. Although one can think of any number of density functions that place unity mass in the desired event region, they will provide varying amounts of variance reductions in the IS estimates. Let I denote the expectation in the right hand side of (1.6) and f* =I- f~Pt be any biasing density that is known to provide a reduced variance estimate of Pt. Then (1.15) the second inequality following from the fact that if there is no biasing i.e. W(x) = 1, then the variance in (1.6) is equal to the MC variance in (1.3). The shape of the biasing density in the region is therefore of importance and will determine to a large extent the gains obtained by its use, as we shall see in later examples. 1.3.1 Conditions on
f*
Some conditions on the biasing density f* to provide reduced variance estimates can be derived. Comparing the integral forms of the expectations in (1.8) and (1.10), and using the definition of W, it follows that a sufficient condition for var* G ~ varG is W(x) ~ 1, all x. If fn and fn* have identical support then clearly this condition cannot hold everywhere. In the case of tail probability estimation, a comparison of (1.3) and (1.6) yields the sufficient condition W(x)~I,
x2::t
That is, a good biasing density satisfies f*(x) ~ f(x) in the event region of interest and is guaranteed to give lower variance estimates than conventional Me simulation. The optimal density in (1.14) has this property. Necessary conditions can also be derived but these are not very informative. Applying the Cauchy-Schwarz inequality for random variables to (1.10) yields E*{g2(X) W2(X)}
>
E;{g2(X) W(X)}jE* {g2 (X)} E2{g2 (X)}jE* {g2(X)}
(1.16)
the second following from the definition of W. Imposing the requirement varJ; ~ varG in (1.8) and (1.10), and using (1.16) yields E{l(X)}
>
E*{l(X) W2(X)} > E2{g2(X)}jE*{g2(X)}
(1.17)
8
Importance Sampling
so that (1.18) is the necessary condition. Properties required of the biasing density f* that satisfy (1.18) can only be deduced by specifying the form of the function g. For tail probabilities, with 9 replaced by I(X ~ t) in (1.18), the necessary condition reduces to (1.19) which merely states that under the biasing distribution the rare event probability is increased.
1.4 The simulation gain In any implementation of IS it is useful to quantify the computational benefits provided by the biasing scheme that is employed. A measure that is usually adopted is the gain in sample size over conventional Me simulation afforded by the IS scheme for equal estimator variances. Denoting by Kc and KJ the respective sample sizes in (1.8) and (1.10) and equating variances, the simulation gain r == Kc / KJ can be expressed as
r=
E{g2(X)} - G2 E*{g2(X) W2(X)} _ G2
(1.20)
and for tail probabilities as
r
= E*{1 2(X
Pt(1- Pt)
~ t) W2(X)} _ P~
(1.21)
r
As G and Pt are unknown, in (1.20) and (1.21) cannot be evaluated. This is true also for estimator variances. In practice however, gains and variances can be estimated by setting up sample estimates of the expectations required in these quantities. For example, a sample variance estimate of var* Gis (1.22) with G given by (1.9). This can be used in an implementation to measure the precision of the estimator.
2. Methods of Importance Sampling
A large part of IS research has been concerned with the search for good simulation densities, or those that approach the optimal. Most of the suggested biasing schemes in use are motivated by the requirement that for tail probability estimation, the biasing density should effect an increase in the event probability as compared to the original density. In the previous chapter we introduced the problem of estimating the tail probability of a random variable with given density function. In most cases this probability can be either found analytically or evaluated accurately using numerical integration. The real power of IS lies in its ability to precisely estimate rare event probabilities involving a random variable that is a function of several other random variables. Such situations frequently arise in applications and examples of functions include Li.d. and non-Li.d. sums, and other transformations encountered in communications and nonlinear signal processing. The usual approach to finding good biasing densities involves the selection of a family (or class) of density functions indexed by one or more parameters. The form of the representative density is chosen based on its ability to effect increase in the event probability for an appropriate choice of the indexing parameters. Thus, once this choice is made, the rest of the IS problem is concerned with determining optimal parameter values. Biasing density families can be obtained directly as a result of transformations imposed on the original random variables or on their density functions. This is the method most often used in practice. Alternatively, densities can be chosen that are not apparently related to the original but which have the desired properties. The latter approach has not received much attention and we shall not deal with it here. Another approach to IS is concerned directly with the search for the optimal biasing density. This search is carried out adaptively and it has been studied in some detail mainly in the application area of reliability. In this chapter we describe some of the available biasing methods and also those that are commonly used in applications. The single random variable case is treated here. The development is carried out by means of several illustrative examples.
10
Importance Sampling
2.1 Conventional biasing methods There are three biasing methods that are most widely used in the applications ofIS. These are scaling (Shanmugam & Balaban [66]), translation (Lu & Yao [40]), and exponential twisting (Siegmund [68]). 2.1.1 Scaling
Shifting probability mass into the event region {X :::: t} by positive scaling of the random variable X with a number greater than unity has the effect of increasing the variance (as also the mean) of the density function. This results in a heavier tail of the density, leading to an increase in the event probability. Scaling is probably one of the earliest biasing methods known and has been extensively used in practice. It is simple to implement and except in a few special cases, usually provides conservative simulation gains as compared to other methods. In IS by scaling, the simulation density is chosen as the density function of the scaled random variable aX, where usually a > 1 for tail probability estimation. By transformation therefore
and the weighting function is
f(x) W(x) = a f(x/a) The tail probability estimate is then given by (1.5) which can be written as fit
=
(2.1)
The scaling factor a is chosen to minimize the estimator variance in (1.6). This is equivalent to minimizing the expectation
J(a)
= =
E*{1 2(X:::: t) W2(X)}
t
it
XJ
ap(x) dx f(x/a)
(2.2)
Under some mild conditions on f, we can differentiate inside the integral and set a = 1 to obtain 1'(1) = -tf(t) < o. Further, it can be shown that J(a) --+ +00 as a --+ 00. These imply that J(a) has a minimum for 1 < a < 00.
Methods of Importance Sampling 11
Two observations can be made that are helpful while implementing IS estimators. If x f(x) -+ 0 as x -+ 00 (a property of most density functions), then it follows from (2.1) that for any K < 00, Pt -+ 0 with probability 1 as a -+ 00. That is, the IS estimator will consistently underestimate with excessive scaling. This is despite the fact that Pt is unbiased. Secondly, most IS implementations involve adaptive algorithms which estimate IS variance or the I-function and perform minimization. An estimate of I(a) can be written as 1 K K l(Xi :::: t) W 2 (Xi ),
L
i(a)
Xi
rv
f*
i=1
(2.3) and using the same argument as above, i(a) -+ 0 as a -+ 00. That is, the estimated variance approaches zero with excessive scaling despite the actual variance becoming unbounded. Therefore excessive biasing is undesirable. We shall see this in some later examples.
Example 2.1. Scaling the Weibull density Consider a random variable X with the one-sided density function f(x)
=
~(~) b-le_(x//L)b,
x:::: 0,/1- > O,b >
°
This is the Wei bull family of density functions with shape parameter band scale parameter /1-. The Rayleigh density is a member of this family for b = 2 and the exponential for b = 1. An estimate of Pt = e-(t//L)b
is required through IS. The optimal biasing density is given by (1.14). Scaling the density f by a and setting up W yields W(x) = ab e-(1-a- b ) (x//L)b In implementing this estimator an important and obvious issue is the value of the scaling factor a to be used. To determine an optimum value aopt, we have to minimize the IS variance. This can be done by minimizing I(a) with respect to a. However, it is not necessary to determine an aopt for every (/1-, b) pair of the Wei bull density. This can be seen as follows. First note that any random variable X can be transformed to an exponential variable by the transformation Y = -logG(X)
12 Importance Sampling where G(x) = 1 - F(x). Therefore G(x)
= e-(x/l1-)b
and y
=
(~)
b
with Y having the unit exponential density. Suppose now that we estimate
using IS with a scaling parameter ae • The corresponding weighting function for Y is
and we need to minimize E{I(Y ~ 8) Wa.(Y)}
I(a e) =
!
l(y
~ 8) a e e-(2-1/a.) Ydy
!
l(x
~ t) W(x) f(x) dx
=
Now substituting for W(x) in
I(a) =
and making the change of variable y = (xl f.1.)b yields the integral
I(a) =
!
l(y
~ 8) ab e-(2-1/a b ) Ydy
where 8 = (tIJ-ll. Setting a = a!/b yields I(a) = I(a e ) with q6 = Pt. Hence estimating q6 for the exponential variable Y would result in the same simulation gain as estimating Pt using X. It is easy to show that I(a e ) is convex with minimum occurring at the solution of 2 a~
+ 2(logpt -
1) a e -logpt = 0
In terms of t this is
For example, with Pt = 10-6 this yields ae = 14.33 and aopt = 3.7859 for the Rayleigh density (f.1. = 1, b = 2) at t = 3.716922. The resulting maximum gain is obtained from (1.21) as Pt - P;
I(aopt} - p~
Methods of Importance Sampling 13
'''.''''':>'''' ",>!/~!!, .. .... .......... ..... . . ......... ..... .... ............. . .. ... .. ~i~!!!~~Hi1iiH~~Hi!!!!U! . ............. ... ..
8
10
~
::::: ~ ~:~ ~ ~ ~ ~ ~!!::' ... :: .. :': ... ::.::,' ..... ! ~ ~ ~;1 HHHH~~ ~; ~ HH1n~ ~::::" . ........................... ..
..... .........
,
.....:. . . . .. . ..:......... .
j;....
~
c:
'(ij
., ,', ........:- .........>.........:- ... .
C>
c:
o
:;
5
10 ,!! ... ""'" """ .. ...... .
:; E
.
.. , .....................
,
.......... .
. . . . . .. ....... ............................
~ 1..•................. ~ H1Hl11l1 H: : : : : : : : '.:'
i:i5
. ..... -: ..........:..........: .. . . .. .. . ... . .
102
:~11ttt~t~1~11!!
: :: : :: :: :i : : :: ::
100~
o
__
~
____
•.
:::l:~::.~ll~~1~111:.1. 1. 1. ~.!. !. :. :. :. ~.:..1 11
. . . ..
., ................................ .
~
2
__
~
____
3
~
4
:11111111H:l!H!lH!tH!!11!!!t!1!!1!!!:
:: .:: : : : : : : : : ::: : : : : : : : : ~: : : : : : : : : ~: : : : : : : : :
____L -_ _ _ _ _ _ 5 6 7 ~
~ _ _ ~ _ _ _ _~ _ _ ~
8
9
10
-log 10 p,
Fig. 2.1. Simulation gain using scaling for Weibull density functions as a function of tail probability Pt. 54145 implying that approximately 1850 samples from f* would yield the same estimator variance as 108 samples from f. Therefore the simulation gain depends only on Pt, with the value of t determined by the density f. This is shown in Figure 2.l. In Figure 2.2 is shown the variation of the IS gain obtained as a function of the scaling parameter a of the Rayleigh density. Note that for a > 1414 the simulation gain actually falls below unity, implying that for such biasing IS performs worse than conventional MC simulation. Simulation gains are available for scaling in the range 1 < a < 1414, and for a > 1414 the IS estimator variance increases beyond that of the MC estimator (a = 1). The reason for this performance degradation is that as the scaling factor increases, the (sufficient) condition f*(X) > f(X) i.e., W(X) < 1 for X 2:: t, is violated with increasing probability. In fact, f*(t) = f(t) at a ~ 1000 for t = 3.71692. The density functions f, f~Pt, and f* using aopt are shown in Figure 2.3. When compared with f~pt, the scaled density f* leaves an appreciable amount of mass in the complementary region {X < t}. This suggests that better choices for a biasing density can be made. A Example 2.2. Estimating a probability close to unity From the results of Example 2.1 the I-function can be put in the form
14 Importance Sampling
102
scaling factor a
Fig. 2.2. The simulation gain as a function of scaling factor for estimating tail probability of the Rayleigh density. aopt = 3.7859. 2
I( ) _ ae 2-1/ae ae - 2a e _1 Pt
(2.4)
For a given Pt, minimization of this yields the optimum scaling factor for the exponential density. The scaling factor for any other Weibull density is obtained from a = a!/b. Suppose we wish to estimate
P(X < t) = 1 - Pt for the unit exponential density. This would result in an I-function given by 2
J( ) = ae ae 2ae -1
(1 _Pt
2-1/a e )
(2.5)
which is minimized for a value of a e different from that for minimizing I(a e ) .
•
Transformations. The invariance of simulation gain to certain transformations of the underlying random variable encountered in the above example can be expressed in a more general form. Consider the I-function
Ix(a) =
1
00
t
af2(x) f(x/a) dx
(2.6)
Methods of Importance Sampling 15
0.9
0.8 0.7 CI)
c
.gO.6
ti c .2 0.5 ~
'en
~ 0.4
o
0.3
0.2
f.
(a=3.78)
0.1 t 4
3
2
5
x
6
7
8
Fig. 2.3. The original, optimal, and optimally scaled biasing density functions.
of an IS scaling problem of estimating Pt. Let Y = g(X) be a monotone transformation to a random variable Y with density function v. Let h denote the inverse of g. The IS problem becomes one of estimating
Pt
=
P(Y ~ g(t))
depending on whether 9 is increasing or decreasing, and remains invariant with identical variance and simulation gain provided 9 has one of the forms given below. The I-function associated with the transformed problem for increasing 9 is given by rg(oo) ig(t)
=
a v v 2 (y) d
v(y/a v ) y
(2.7)
Jx(a)
where av (a) denotes the scaling factor for the transformed problem with a > 0 and a v > O. The function 9 is given by 9 () x
={
k xb
-k(-x)b
if x ~ 0, if x < 0
(2.8)
16
Importance Sampling
(2.9)
g(x) = k( _X)-b,
x:s 0
(2.10)
with all(a) = a- b, and b > 0, k > 0 in all the above. These are the increasing forms. The decreasing forms are just -g(x) in all cases. A constructive proof is given in Appendix A. As ly(a ll ) = Ix(a), it follows that optimum scaling factors for biasing can be determined for either IS problem, resulting in the same variance and simulation gain. There are no other transformations with this invariance property for scaling problems. The problem considered in Example 2.1 is an instance ofthis property where -log G(x) = g(x) of (2.8) with k = jl-b. Further results on transformations for biasing are discussed in Section 2.4.1. • In most problems that require the use of IS, the variance or equivalently the I-function cannot be evaluated analytically, as was done in the above example. If in some problem I can be easily obtained analytically or even numerically, then usually so will the case be with Pt because, from (1.15), I = Pt with no biasing. Therefore in general, minimization of I with respect to biasing parameters will require estimation of I, also by simulation. This is studied in a later section. Scaling for tail probability estimation suffers from an important drawback that becomes especially pronounced when dealing with sums of random variables that have two-sided infinite support. While scaling shifts probability mass into the desired event region, it also pushes mass into the complementary region {X < t} which is undesirable. If X is a sum of n random variables, the spreading of mass takes place in an n-dimensional space. The consequence of this is a decreasing IS gain for increasing n, and is called the dimensionality effect. A nonuniform scaling technique which mitigates this degradation has been proposed by Davis [18] for Gaussian processes in digital communication systems.
2.1.2 Translation Another simple and, in many cases, effective biasing technique employs translation of the density function (and hence random variable) so as to place much of its probability mass in the rare event region. Translation does not suffer from a dimensionality effect and has been successfully used in several applications relating to simulation of digital communication systems. It often provides better simulation gains than scaling. In biasing by translation, the simulation density is given by
f*(x)
= f(x - c),
c> 0
where c is the amount of shift and is to be chosen to minimize var* Pt or I(c). The IS estimate is given by
Methods of Importance Sampling 17
(2.11) Using similar arguments as in biasing by scaling, it can be shown that I(c) has minimum occurring for 0 < c < 00. If f is such that f(x + c)j f(x) -t 0 as c -t 00 then, as in scaling, Pt -t 0 with probability 1. All density functions with decaying tails satisfy this property. Random variables with one-sided densities need to be treated with some care. When X has density f(x), x ~ 0, the biasing density is
f*(x) = f(x - c),
x
~ c
>0
The mean value of Pt is therefore
Pc::::; Pt for c
=
>t
(2.12)
Le. the estimate is biased if the original density is translated beyond t. Hence translations for one-sided densities are restricted to c ::::; t. The I-function of this biased estimate for c > t is given by
I(c)
J2(X) } E* { l(X ~ t) J2(X _ c)
=
1 f(~(~ 1 00
=
00
Therefore I(c) -t 0 as c -t
00
c) f(x) dx
W(x, c) f(x) dx
provided W
0) to
00
yields
P(X 2: t+b) = P(X 2: t)P(X 2: b) Denoting G(x) = 1- F(x), the above gives the relation G(t + b) = G(t) G(b) with G a nonnegative monotone decreasing function. Hence G has a unique solution of the form G(x) = e- kx for some k > 0 and x > o. It follows that f is a one-sided exponential density. A Example 2.4. Translating the Gaussian density Let X be N(O, 1). Then, with
we have
I(c)
=
Q(c + t) e
where
Q(x) == - 1
2
C
1
00
.j2;
e- Y 2 /2 dy
x
Using the approximation
Q(x)
~
_1_ e- x 2/ 2 x.j2;
for large x in I(c) and minimizing on c yields Copt
Methods of Importance Sampling 19
for t » 1. The effects of over biasing by excessive translation are easily seen in this example by noting that
J(c)
1 e-c(t-c/2)-t2/2 (c + t) v'21T
;::;;j
---t
00
as c ---t
00
Using the value of c = t yields
I(c = t)
1
;::;;j
rn= e- t
2
2v27l"t
which holds for large t. Compare this with the value of P~ for large t, where Pt
;::;;j
_1_e- t2 / 2
v'21Tt
As an example, for Pt = 10- 6 , t = 4.7534 with Copt ~ 4.8574. Using in I(c), the gain evaluates to ropt ;::;;j 184310, so that approximately 540 samples from f* are required as compared to 108 from f. If this same example is worked for biasing by scaling with a parameter a, then it can be shown in a similar manner as above that Copt
which is minimized for a ;::;;j t. In this case approximately 5100 samples from (scaled) f* are required, nearly a ten-fold increase over translation biasing. Shown in Figure 2.4 are the number of IS samples KIS required to estimate Pt for translation and scaling biasing using
KIS=~ Pt ropt
Whereas this number increases for scaling, for translation it is relatively constant with decreasing Pt. For a N(/1-, 0 and s > 0,
J-L(s)
=
log
i:
eSx f(x) dx
> log [00 e
Sx
f(x) dx
> log [e K S P(X ~ K)] =
K s + log P(X
~
K)
Then Jt(s) ~ s(t - K) -log P(X ~ K). Choosing K arbitrarily large we have therefore that as s -t 00, Jt(s) -t -00 for all t. Similarly Jt(s) -t -00 as s -t -00. As Jt(O) = 0 and J:(O) = t - m, it follows that Jt(s) achieves a maximum at s = St ~ 0 for t ~ m and at St > 0 for t > m, where St satisfies J£(St) = t-J-L'(St) = O. This holds for two-sided densities f. In the case of onesided f the property Jt(s) -t -00 holds for s -t 00. The proof given above is valid with an obvious change of integration limit. To prove this for s -t -00 requires certain assumptions on the behaviour of M(s). Note that for t = 0,
Methods of Importance Sampling 23
Jt(S) -+ 00 as S -+ -00. However by Jensen's inequality, M(s) ~ exp(sm) and hence Jt(s) ~ s(t - m). Therefore as s -+ -00, Jt(s) -+ -00 when t > m. This condition is always met in tail probability estimation. We will encounter the situation when t < m with one-sided densities later. Now define the function
This is known as the large deviations rate function and plays a central role in large deviations theory. It will be encountered in Section 3.4 when dealing with sums of random variables. It is easily shown that I(t) is convex in t. Note from the above that Jm(s) ~ 0 = Jm(O) for all s. It follows that I(m) = 0 and I(t) ~ Jt(O) = 0 = I(m), with St = Sm = 0 at t = m. That is, I(t) is a positive function and has minimum at t = m. Shown in Figure 2.5 are the functions Jt (s) for an example density. Also shown are the loci of their J,(s)
.... 1ft)
··.(,-Iocii of maxima
Fig. 2.5. The functions Jt(s) and the large deviations function I(t)
maxima I(t) as a function of St. Therefore 1(s) in (2.21) is minimized by choosing s = St. The tightest upper bound is then
24 Importance Sampling
and the corresponding optimized biasing density is (2.22) To minimize I(s) we have from (2.20) that
I"(s)
=
1
00
[JL"(s)
+ (JL'(s) -
X)2] eJ.L(S)-SX f(x) dx
> 0 and I(s) is also convex. Setting I'(s)
[00 (JL'(sopt) _ x) It is easy to see that
JL'(sopt)
1
00
St
= 0 for s = Sopt say, yields
eJ.L(sopt)-sopt
x f(x) dx
=0
(2.23)
cannot be a solution to (2.23), for
eJ.L(Sop.)-Sopt X f(x)dx
=
>
1
00
t
x eJ.L(Sopt)-Sopt X f(x)dx
[00 eJ.L(Sopt)-Sopt x
f(x) dx
Hence JL'(sopt) > t when I'(sopt} = O. Excepting for a few special cases, St and Sopt usually have to be determined numerically, or through simulation. As 1 is an upper bound to I, Sopt is a better choice than St for the biasing density f*. However, the difference in estimator performance may be marginal. Shifting of probability mass into the event region by exponential twisting can be deduced from the fact that under f*, the mean of the random variable becomes E*{X} = JL'(s) ~ t > m, for probability estimation and the equality holding for S = St. Example 2.6. Twisting the Gaussian density Let X be N(O, 1). Then JL(s) = s2/2, and using this in (2.19) reveals that f* is just the N (s, 1) density. Hence, for Gaussian random variables, exponential twisting results in density translation. From (2.22) and the results of Example 3 respectively we see that St
= t and
Sopt
~
v'1+t2
Example 2.7. Twisting the Gamma density Let X have the Gamma density
f(x) =
r7r) (axr-le-ax,
with parameters rand a. Then
x~
O,r > O,a >
°
Methods of Importance Sampling 25
at M(s) = (0 - S )r '
o 1, Waa > 0 and Wee> O. This implies that
Iaa = E{l(X 2: t) Waa } > 0 and Icc = E{l(X 2: t) Wee} > 0 From the above second derivatives we also have Waa Wee = W;e' This yields det J
E{l(X 2: t) W aa } E{l(X 2: t) Wee} - E2{1(X 2: t) Wac} E{l(X 2: t) Waa } E{l(X 2: t) Wee} - E2{1(X 2: t) W~£2 W1j2}
> 0 the last following from the Cauchy-Schwarz inequality. These together imply that J is positive definite and hence I(a, c) is convex. Minimization of I can then be performed using the two-dimensional (stochastic) Newton recursion
8m+!
= 8m -
8J;;.lV I(8m)
where 8 is used to control convergence. Implementation results are shown in Figures 2.11 and 2.12. The estimated simulation gain over the twodimensional parameter space is shown in Figure 2.11 for the Rayleigh density (I-' = 1, b = 2) with Pt = 10-6 at t = 3.716922. Figure 2.12 shows convergence of the 8-algorithm to the optimum biasing parameter vector 8 0pt ~ (0.26,3.62). The maximum gain at 8 0pt is nearly 2.5 x 106 . This is eight times more than that achieved by optimum translation (Copt = 3.075) alone considered in Example 2.8 on page 27. In fact at a shift of 3.075 the best scaling (down) factor is 0.722, providing a gain of 3.7 x 105 , whereas for a shift c = t the best scaling factor is 0.16 with a maximum gain of 3 x 105 . The biased density using 8 0pt is shown in Figure 2.13 with f~Pt and the density translated by 3.075. It can be observed that the two-parameter density more closely resembles f~Pt. • It is evident from this example that two-parameter IS problems require appreciably more effort for an optimal implementation than problems involving a single biasing parameter. Nevertheless the extra effort is justified if significant savings in sample sizes can be obtained in an application.
2.4 Other biasing methods It is clear that motivated by the shape of f~Pt, several density functions can be proposed as candidates for biasing. Suggestions in the literature include the use of apparently unrelated densities with heavier tails to replace
Methods of Importance Sampling 37
Simulation gain urface
.. :-'
.'
.
•
I
'
3.2
..
0.1
Fig. 2.11. Simulation gains achieved by combined scaling and translation.
0
0.28
~
0.26
t5
Ol
.5
co
0 (J)
0.24 0.22 0.2
5
10
15
20
25
30
35
40
45
50
30
35
40
45
50
recursions m
3.64
c:
3.63
~ Iii c:
3.62
0
~ I-
3.61
5
10
15
20
25 recursions m
Fig. 2.12. Convergence of combined scaling and translation parameter algorithm.
38 Importance Sampling 8 f.opt(x)
7
6 (/)
(I)
E(/)
5
c:
(I)
"0 Cl
4
c:
·iii
as iii 3 2
a- 1f((x-c)/a)
3.2
3.4
3.6
t 3.8
4
4.2
4.4
4.6
4.8
5
x Fig. 2.13. Optimal, optimally translated, and optimally scaled and translated biasing densities.
the original density, and nonlinear transformations that have the desired effects. For example, in Schlebusch [62] and Beaulieu [7] the density function of the transformation IXI + c is used as a biasing density. This provides a reflection and translation of the original density function. In Beaulieu [8], densities with heavier tails are used for biasing. Search can be carried out for good biasing densities using more general classes of transformations. However, such formulations may not result in much advantage since the distorted tails produced by the transformations could have little resemblance to the original density. Another approach described in the literature suggests selection of biasing densities based on certain distance criteria, Orsak & Aazhang [46]. Here a constrained class of densities that excludes f~Pt is defined. Then from this class a density function is chosen that minimizes the f-divergence or Ali-Silvey distance with respect to f~Pt, using some results from robust hypothesis testing. A form of biasing is now proposed and discussed that makes use of the density functions of order statistics. 2.4.1 Extreme order density functions
In a finite set of Li.d. random variables the rare event {X 2: t} mayor may not take place. Irrespective of this occurrence however, it is naturally interesting
Methods of Imparlance Sampling 39
to study, in the ordered sample, the density functions of the random variables which are known to be larger in magnitude. It turns out that certain order statistic density functions are well suited for use as biasing densities. Consider the i.i.d. sequence {X j }l' with density f and its order statistic
with
where }j = X(j). The vector Y is no longer i.i.d. As there are n! ways of rearranging an n length sequence, the joint density function of Y can be written as
fn(Y) = n! f(Yl) ... f(Yn),
-00
< Yl :::; ... :::; Yn < 00
(2.45)
Successive integration yields the marginal density of the r-th extreme as
fn,r(Y) = r
(~) Fr-1(y) [1- F(y)t- r f(y),
-00
< Y < 00
(2.46)
The lower and upper extremes Y1 and Yn have the respective densities
fn,l(Y) = n [1 - F(y)t- 1f(y)
(2.47)
and (2.48) with corresponding distribution functions
Note that if we are not interested in actually ordering an n-Iength sample, n need not be integer. That is, the densities in (2.47) and (2.48) and the distribution functions remain valid ones if n (> 0) is a real number. In the following we assume that n is positive and real. The two extreme order densities have interesting behavior as n takes values less and greater than unity. Their effects on f are shown in Figures 2.14 and 2.15 and summarized in Table 2.1. In these depictions the original density f is one-sided. This suggests that the upper extreme order density fn,n can be used for biasing with n > 1, and the lower density fn,l can be used as it is for n < 1 and with translation for n > 1. Using translation with fn,l leads to a two-parameter IS problem, in n and the translation c. The procedure is as discussed in Example 2.10. It must be mentioned that intermediate extreme order density functions fn,r cannot be used for biasing because they result in infinite IS variance.
40
Importance Sampling 5
:«4
E
~3
(I)
"'0
---
0.5
2.5
2
1.5
3
3 U)
(I)
2.5
~ 2 c:
~ 1.5
~
Q.
1
:::J 0 .5
"" ""
O~----~----~----~--~~~~--~~--~=-~~~~~
o
0.5
1.5
2.5
2
3
3.5
4
Fig. 2.14. Extreme order densities for n > 1. Table 2.1. Properties of in,l and value ofn nl
in,n
fn,l(X)
!n,n(X)
compresses compresses and pushes forward
stretches compresses
Biasing with 1... ,1. With i* = fn,l the I-function for estimating Pt becomes
h{n) = =
.!.1°O [1 n
.!. n
t
r l
F{x)]l-n f{x) dx
(I _ F)l-ndF
iF(t) 2-n
Pt n{2-n)'
(2.49)
We make two observations. This method of biasing has the canonical property that the maximum simulation gain is independent of the actual form of the density f and depends only on Pt. Of course for a given Pt, f determines the value of t. Secondly, comparing (2.49) with (2.4) it is seen that for n = l/ae , h{n) = I{a e ). The optimum value of n is given by
Methods of Importance Sampling 41
0.9 ",-
0.8 0.7 (/)
~0.6
'wc
~0.5
.... (J) ;:
00.4
...J
0.3
Fig. 2.15. Extreme order densities for n
n op t
1
< 1. (
1)
=l+-p- 1+ (32
1/2
(2.50)
where (3 = -logpt, and is shown in Figure 2.16. In Figure 2.18 is shown the maximum simulation gain as a function of Pt. That is, using the lower extreme density in,1 for biasing provides simulation results equivalent to those obtained from scaling the exponential density. The reason is not hard to see. Recall that for every density i we can turn to the exponential through the transformation -log G (x). As can easily be verified, the same transformation converts any in,1 to an exponential with parameter lin. Making this change of variable in (2.49) results in the I-function
h(n)
1
e-2y
00
=
--dy -logG(t) ne- ny I( ae ), ae = lin
(2.51)
This equivalence holds for all densities f. In view of the results on transformations on pages 14 to 16, it is interesting to examine whether lower extreme order density functions are scaled versions of i. The class of densities that have this property is given by the solution of in ,1(X)
1
= -a i(xla)
(2.52)
42
Importance Sampling
with n = n(a). As shown in Appendix B the Weibull density is the only one with this property. Also, there are no two-sided densities that share the property. •
0.9 0.8
c:
0.7
.... Q)
Q) 0.6
E III
iii 0.5 c..
Cl
.~ 0.4 III
iii
0.3 0.2 0.1
OL-______L -_ _ _ _ _ _
o
2
~
______
~
_ _ _ _ _ _ _ L_ _ _ _ _ _ _ L_ _ _ _ _ _ _ J
6
4
8
10
12
- I0 9 1O Pt
Fig. 2.16. Optimum n for biasing with tn,l.
Biasing with fn,n. With 1*
Iu(n)
-!
=
= In,n the I-function for estimatingpt becomes
.!.
r
l pI-ndP n iF(t) 1 - (1 - pt)2-n n(2-n)
n21
(2.53)
with Iu(2) = log(1- Pt). It can be easily verified that Iu has a minimum for n > 1. Shown in Figure 2.17 is the optimum n. Hence, biasing with In,n also has the property that the maximum simulation gain does not depend on the density I for a given Pt. The maximum simulation gain is compared in Figure 2.18 with that for biasing with In,I. Therefore, biasing with In,n provides superior estimates. As in the case of In,I, we can turn to the exponential density through the transformation Y = -log P(X). Using this in (2.53) yields
Methods of Importance Sampling 43
2
3
4
5
6
7
-log10 Pr
8
9
10
11
12
Fig. 2.17. Optimum n for biasing with in,n. 1012.--------r-------,r-------.--------.--------~----~~
1010
108 C
·iii
CI
c 0
~
106
:;
E
en
104 using f 1 . n.
102
100
0
2
4
6 -log10 Pr
8
10
12
Fig. 2.18. Maximum simulation gains for biasing with extreme order densities.
44
Importance Sampling
Iu(n) =
i
-log F(t)
o
e- 2 y --dy ne- ny
(2.54)
which can be recognized, from (2.5) on page 14, as I(a e ) for estimating P(Y < -log(l - Pt)) when Y is unit exponential and a e = l/n. As before the density function whose scaled version is In,n can be found using arguments parallel to those in Appendix B. Alternatively, it may be noted that the lower extreme density In,l with underlying i.i.d. density I becomes, through the transformation Y = l/X, an upper extreme density vn,n with underlying density v. The form of v is obtained from I through this transformation. Therefore it follows that if I is Weibull given by (2.63), then v is the density we are looking for with the scaling property. This is given by
v(x) = (:b) x-b-le-(/Lx)-b,
x:::: 0, b> 0,
f.t
>0
(2.55)
• Appendix A We outline a derivation for the transformations in (2.8) - (2.10) on pages 15 and 16. Let h denote the inverse of g. Making a change of variable in (2.6) and introducing the density v yields
Jx(a) =
1
9 (OO)
g(t)
a v 2 (y) h' (g( a-I h(y))) dy v(g(a-lh(y))) h'(y)
(2.56)
where we assume that 9 is increasing so that g(oo) > g(t). This integral reduces to the form
Jx(a) (2.57) only if 9 is chosen to satisfy g(a-lh(y)) pair a~lg(x)
= g(x/a)
and
= y/ay, which can be written as the (2.58)
noting that (2.58) implies h'(y/ay)/h'(y) = ay/a. Note that for a = 1, = 1. As 9 is increasing, it follows from (2.58) that g( -00) = 0 or - 00. Let g( -(0) = O. Then from (2.58), g(O) = 0 or 00. If g(O) = 0, 9 cannot be increasing. Hence g(O) = 00 and x ;t O. This implies from (2.58) that ay(a) is a decreasing function of a. Now let g( -00) = -00. Again, from (2.58) it follows that g(O) = 0 or 00. If g(O) = 00, then there exists a Xo < 0 such that a y (l)
Methods of Importance Sampling 45
o. Then for some a such that av i= 1, this violates (2.58) for x = Xo. Therefore g(O) = o. This implies from (2.58) that av(a) is an increasing function of a. Construction of 9 is as follows. Let us consider the case g( -00) = 0 Then 9 > 0 for x ::; o. Setting x = -a in (2.58) yields g( -a) = g(-l) av(a), which can be written as g(x) = kav(-x) for x ::; 0, where k = g(-l) > o. Then g(x/a) = kav(-x/a) = kav(-x)/av(a) using (2.58). This gives av(-x) = av(a) av ( -x/a) for x ::; 0, or av(x y) = av(x) av(y) for x > 0 and y > o. It is easily shown that for a v to be a decreasing function, the unique solution is av(a) = a- b for b > o. Substituting this in g(x) yields (2.10). The other monotone forms can be derived using essentially parallel steps and we omit the details.
g(xo) =
Appendix B We can write (2.52) as 1
n(a) [1 - F(x)]n(a)-l f(x) = - f(x/a)
(2.59)
a
for a > 0, n(a) > functional equation
o.
Note that n(l)
= 1. Integrating (2.59) yields the
G(x/a) = Gn(a) (x)
(2.60)
It is easily verified from (2.60) that n(a) is a monotone increasing function for x < 0 and decreasing for x > o. This implies that two-sided densities do not satisfy (2.52). As usual we will assume that X is nonnegative. It is easy to show that (2.60) has the solution
G(x/a) = e-kn(a)/n(x), Setting a
k = -logG(l)
(2.61 )
= 1 in the above yields G(x) = e-k/n(x)
from which
G(x/a) = e-k/n(x/a) Equating this with (2.61) yields n(x) = n(x/a) n(a), or n(xy) for x > 0 and y > o. This has the solution
n(x) = x- b ,
b> 0
Substituting (2.62) in G(x) above leads to the density
=
n(x) n(y) (2.62)
46 Importance Sampling
It; )b-l
f{x) = b (
e-(X//-t)b,
x
~
0, b > 0,
J.1-
>0
(2.63)
which is the Weibull density with shape parameter b and scale parameter J.1- = k- 1 / b • It can easily be verified that it satisfies (2.59). By the manner of construction it follows that the solution is unique.
3. Sums of Random Variables
Several practical applications involve functions of many random variables. In systems driven or perturbed by stochastic inputs, sums of random variables frequently appear as quantities of importance. For example, they playa central role in most estimation operations in signal processing applications. In this chapter we apply IS concepts to sums of Li.d. random variables. Apart from the usual biasing techniques, a method referred to as the g-method, Srinivasan [70j, will be described. While not a form of biasing, it exploits knowledge of the common distribution function of the single variable to enhance the performance of any biasing technique. The g-method has a powerful feature, namely that of differentiability of the IS estimate, which permits solution of the inverse IS problem. This problem is one of finding through simulation a number which is exceeded by a sum of random variables to achieve a specified (tail) probability. It is of great importance in applications, for example, in the determination of thresholds for radar and sonar detectors, and parameter optimization in communication systems. All these systems are designed to operate with specific performance probabilities. A solution to the inverse IS problem is obtained by minimizing a suitable objective function. The asymptotic situation when the number of terms in the sum becomes large is of both theoretical and practical interest. This is also studied in the chapter from the viewpoint of IS. Approximate expansions are developed for the tail probability and IS variance using exponential twisting. Using optimization by simulation for the inverse IS problem, it is possible, within experimental limits, to implement a decreasing threshold as the number of terms in the sum is allowed to become large while maintaining a specified tail probability. Of natural interest in such a situation is the asymptotic behaviour of the biasing scheme that is used. It turns out that the asymptotic IS variance and the resulting simulation gain approach limiting non-zero values. It is shown that this result is closely related to the convergence of the i.i.d. sum to normality. The estimation of density functions of sums is then studied. An approximate expression for sum densities and its asymptotic form are derived. Through example and simulation it is established that the approximate expression captures the true form of the sum density even for small n.
48 Importance Sampling
3.1 Tail probability of an Li.d. sum Consider estimating by simulation the probability pt(n) = P(Sn
~
t) where
and the {Xj}! are i.i.d. with common density function f(x) having mean m and variance v 2 . As n -7 00, by the law of large numbers, Sn ---+ m and Pt -7 0, if m < t. A well known upper bound on pt(n) is the Chernoff bound, [90], which gets tighter with increasing n. In many practical applications the statistical behavior of Sn for finite n is of importance. As is well known, except in a few special cases, the density or distribution of Sn can be found only through an n-fold convolution. This is always a computationally tedious exercise. The tail probability pt(n) can be expressed as
pt(n)
J
=
ISn(x) dx
x~t
E{I(Sn
~
(3.1)
t)}
Here ISn denotes the density of the sum Sn and n
In(x) =
II I(Xj) I
denotes the n-variate density function ofthe vector X = (Xl, ... , Xn). While all results obtained in previous sections are applicable to ISn' we are usually faced with situations where this density is unavailable. That is, we are constrained to work with the component density I. The unbiased IS estimator Pt (n) of Pt can be written as
pt(n)
1 K
=
K
L I(Sn,i ~ t) Wn(X
i ),
Xi
rv
In*
(3.2)
i=l
where Wn = In/ln* is the weighting function and In* denotes an n-variate biasing density. The estimator variance is 1
K var* I(Sn
~
t) Wn(X)
~[E*{12(Sn ~ t) W~(X)} - p~(n)]
(3.3)
Sums of Random Variables 49
Following (1.14) on page 5, the optimal biasing density is given by
f~~t(x) =
_(1) l(sn
Pt n
~ t) fn(x)
(3.4)
This is an n-variate joint density function with identical marginals. Selection of a good joint density function is very difficult and to render the biasing problem amenable it is the practice to restrict fn* to be n-variate Li.d. The restriction to i.i.d. biasing densities will entail loss in estimator performance and perfect estimation cannot be achieved. Any of the biasing methods described earlier can be applied to the individual random variables comprising Sn, of course with varying performances. Unfortunately, determining optimal Li.d. IS densities for tail probability estimation of sums appears to be a mathematically intractable problem. While all biasing schemes are applicable to sums, there is a restriction on the use of translation. This form of Li.d. biasing cannot be used for sums consisting of random variables that are one-sided. The reason is simple. The domain of support of the n-variate product biasing density does not cover the event region {Sn ~ t} when n > 1. That is, if each shift is c then the n
part not covered by the biasing density is {Sn 2: t} n{U{0 ~ X j 1
< c}}.
3.2 The g-method This method improves the performance of any biasing technique when applied to Li.d. sums. It makes more complete use of the component density f (and distribution F). Note that in any IS implementation, knowledge of underlying densities is required to mechanize the weighting function W. Also required is the distribution F which, in general, is used to generate appropriate random variables for the simulation experiment. In this method, instead of estimating expectation of the indicator function of an n-component sum, the expected value of a conditional probability over an (n - I)-dimensional vector X is estimated using any form of biasing. The tail probability can be expressed as
pt(n)
=
P(Xn 2: nt - (n - l)Sn-l) E{9t(Sn-t}}, n = 2,3, ...
(3.5)
with
!it(x) == 1 - F(nt - (n - l)x) The Me estimator (without any IS), denoted by Pt,g, of pt(n) is given by 1 K
pt,g(n) = K L9t(Sn-d, i=l
Xj '"
I;
j
= 1, ... ,n -1; n = 2,...
(3.6)
50 Importance Sampling
where the subscript i has been dropped for convenience. The estimator uses knowledge of F explicitly. For n = 1, pt,g(l) = 1- F(t) = Pt(1). The variance of this estimator is given by (3.7) On the other hand the conventional MC estimator, denoted by Pt,c, is given by (3.2) with Wn = 1 and has variance
varpt,c(n) =
~[Pt(n) - p~(n)l
Now, as gt is a probability we have g;(x) ::; gt(x). Using this and (3.5) in (3.7) yields
varpt,g(n) ::; varpt,c(n) Therefore the estimator Pt,g is superior to the usual MC estimator. Intuitively speaking, the reason for this improvement is that the indicator 10 in (3.2) is replaced by the continuous function gt in (3.6), resulting in "smoother" estimates. This improvement carries over to the case where IS is applied to both estimators. The IS estimator of pt(n) using this formulation can be set up as
1 K pt(n) = K Lgt(Sn-l) Wn-1(X), i=l
for some component biasing density
'*'
Xj
rv
f*
(3.8)
with variance given by
In this estimator the biasing is carried out for n - 1 random variables. When the component density f of the LLd. sum is one-sided, then we define (3.5) as
(3.10) where
(3.11) with T = ntj(n -1) and iO = 1-10. In (3.8) and (3.9) gt is replaced by ht . Following (1.12), the optimal biasing density that provides perfect estimation in (3.8) is given by
,~:..tl*(X) =
_(1)gt(Sn-l) fn-l(X) Pt n
which again requires knowledge of pt(n).
(3.12)
Sums of Random Variables 51
3.2.1 Advantage over usual IS For the same biasing scheme applied to the estimators in (3.2) and (3.8) and optimized separately, the latter will result in lower variance and therefore in higher simulation gain. This can be established as follows. Suppose that the usual estimator has been optimized by appropriate choice of an LLd. biasing density. That is, this choice minimizes the variance in (3.3). Then apply the same biasing density to the proposed estimator. Clearly, this choice will not minimize the variance in (3.9). Let 1min and 19 denote respectively the expectations in the right hand sides of (3.3) and (3.9), and W(x) denote the individual weighting function that comprises Wn(x). We can express E{I(Sn 2: t) Wn(Xn E{I(Sn 2: t) W(Xn) Wn-I(Xn E{E{I(Xn 2: Y) W(Xn)IY} Wn-I(Xn
where Y = nt -
(3.13)
L:7- I Xj and the outer expectation is over in-I. Then
1min(n) - 19(n) = E{[E{I(Xn 2: Y) W(Xn)IY} - g;(Sn-I)] Wn-I(Xn (3.14) It then suffices to show that the quantity in brackets is greater than zero for all Y. By definition the second term inside the brackets is p2(Xn 2: YIY) E;{I(Xn 2: Y) W(Xn)IY}
Applying the Cauchy-Schwarz inequality to this yields E;{12(Xn 2: Y) W(Xn)IY} ::; E*{I(Xn 2: Y)IY} E*{I(Xn 2: Y) W 2(Xn )IY} E*{l(Xn 2: Y)IY} E{I(Xn 2: Y) W(Xn)IY}
< E{l(Xn 2: Y) W(Xn)IY}
(3.15)
since E*{I(Xn 2: Y)IY} = P*(Xn 2: YIY) < 1. In (3.15) equality holds in the first inequality if and only if W(Xn) = 1 (which is zero IS), but the second inequality continues to hold. It follows from (3.14) that 1min(n) - 19(n) > O. Therefore the variance of the g-method is smaller than that of the usual optimized IS procedure for any biasing scheme. As we shall see later, the method is suitable when n is small. The estimator (3.8) is optimal in another sense. Suppose we consider a class of estimators defined by
52 Importance Sampling
and seek
subject to the unbiased ness constraint E{h(X)} = pt(n) for a given biasing scheme. A simple exercise in the calculus of variations shows that the optimal h is given by h(x) = pt(n)/Wn-1(x), and this requires knowledge of pt(n). In fact the only function that we can lay hands on that has the unbiasedness property and does not need pt(n) for its specification is gt(Sn-l). Of course, there are n - 2 other such functions obtained through convolutions but these take us away from simulation to numerical solutions. From an implementation point of view there is another advantage of this method. It may be noted that in the conventional estimator of (3.2), a suitable starting value of an IS parameter, such as scaling factor, has to be chosen for the IS algorithm. This is to avoid the original simulation difficulty that I(Sn ~ t), Xj rv f, is zero most of the time for rare events. This difficulty is obviously not present in the g-method because any initial choice of parameter will result in a nonzero IS estimate. The gain improvement provided by the g-method is illustrated in the following examples. Example 3.1. Scaling for Weibull i.i.d. sums We consider tail probability estimation for an LLd. Wei bull sum using scaling. The density function of the sum is not known in closed form. Exponential twisting can be used with the g-method and it can be shown that the variance is convex in the biasing parameter. Generation of exponentially twisted Weibull random variables however requires solution of a differential equation
for each variate, Ulrich & Watson [87), or the use of an acceptance-rejection procedure, Nakayama [43). For simplicity we use scaling. With f(x) as in Example 2.1 on page 11 and choosing the component biasing density f* as that ofaXj , a> 1, we obtain
Wn-1(x)
=
a(n-l)be(a-b-l) E~-l(xj/a)b
Also gt(x) = e-((nt-(n-l)x)/p.)b Using these in (3.10) and (3.11) yields the estimator
where
and
Sums of Random Variables 53 n
Optimizing this estimator requires selection of a good value of scaling parameter a that minimizes an estimate of its variance. This can be achieved as follows. Denoting by In(a) the expectation in the right hand side of (3.9) with g replaced by h, we have
Then I~(a)
E{h;(Sn_l) W~_l(X)} E*{h;(Sn-d Wn_1(X) W~_l(X)}
and similarly
for the two derivatives of In. Simulation estimators in(a), iI,(a), and I::(a) of these quantities can then easily be set up. It can be shown that In is convex in the scaling parameter a. An estimate of the optimum a that minimizes In(a) can be obtained using the recursion am+l
~(am) , m = 1,2, ... I::(a m )
= am -
8_
(3.16)
with index m. The parameter 8 controls convergence. Implementation results are shown in Figures 3.1 and 3.2. Here n = 10 and Weibull parameters are f.J, = 1 and b = 1.7. The threshold is t = 2.0615 which provides a tail probability close to 10- 8 . The t value was found using the inverse IS technique described later in Section 3.3. To implement the a-algorithm for the conventional estimator in (3.2), a reasonable initial choice for al is needed. This has been obtained by overbounding In and selecting al to minimize this bound. As shown in Appendix C, the result is al = t/ f.J, for b ;::: 1 and al = n(b-l)/bt/ f.J, for b :::; 1. Convergence of scaling factor is shown in Figure 3.1 and is slightly different from that of the conventional estimator. The IS gain using the g-method is shown in Figure 3.2. Gains have been estimated using the estimator
Tn
=
~t(n) - pz(n) In(a) - pz(n)
Results show a gain of about 107 , indicating a sample size requirement approximately five times less as compared to the usual IS technique. These results are based on an IS sample size of K = 10,000. ...
54
Importance Sampling 2.3
2.25
Weibull ~1 , b=1. 7
g-method, inverse IS
"C
"0 .c (J)
~
2.2
-£
"C
c:
co ....
I
a (n = 10)
2.15
0
"0
~
2.1
Cl
.£:
Iii (.)
(J)
E ::J E
E-
2.05
t(n = 10)
2
O
1.95
1.9
o
20
40
60
80
100
140
120
160
180
200
Recursion index m
Fig. 3.1. Scaling factor and threshold convergences (inverse IS) for Weibull sums. 108 , - - - - - - - - - - - - - - - - , - - - - - - - - - - - - - - - - - , - - - - - - - - - - - - - - - - , Weibulll! = 1, b = 1.7
g-method (J)
c:
107
·iii
Cl
~ "C Q)
iii E
~
W
Usual IS
106
105L-__
o
~
____________L __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
~_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ~
100
50
150
Recursion index m
Fig. 3.2. Comparison of IS gains for Weibull sum (n
= 10)
using scaling.
Sums of Random Variables 55 Example 3.2. Translation for i.i.d. sums The second example considered uses translation biasing for the component density
f(x) =
e-x-e- x
,
-00
< x < 00
which is an extremal distribution, Billingsley [5], with m = 0.5772 and v 2 = 1.6449. We omit the details which are straightforward. The simulation gains for the translation case are shown in Figure 3.3. Translation is usually
Translation biasing (n = 2) g-melhod
U)
c:
·iii
OJ
104
~ "0 Q)
iii E
~
103
usual IS
w
102
10'L-__- L_ _ _ _L -_ _- L_ _ _ _L-__- L_ _ _ _L -_ _- L_ _ _ _L -_ _- L_ _
o
10
20
30
40
50
60
70
80
90
~
100
Recursion index m
Fig. 3.3. Comparison of IS gains for an extremal density sum.
considered to be an effective biasing technique. However, the present choice of density function reveals quite poor performance, especially as n increases. A value of n = 2 has been chosen with t = 8.08 for a tail probability close to 10- 6 . Nevertheless, the g-method does provide improvement over usual IS .
•
3.3 The inverse IS problem The inverse IS problem (Srinivasan [70]) is one of estimating t for a given Pt using fast simulation. The IS method proposed and discussed in Section 3.2 is
56 Importance Sampling
well suited to solve this important problem. Consider the stochastic objective function
where
0: 0
denotes the value of a desired tail probability Pt. We wish to find
at = to such that Pto = 0: 0 • By minimizing J(t) via simulation, what can be determined is at = to such that J(to) = o. Differentiating J(t) twice yields JII(t) =
2(pt _
0: 0 )
~~t + 2 (~t) 2
(3.17)
from which it follows that J(t) achieves a minimum at t = to and is convex in the neighbourhood. The minimization of J(t) can be performed using a descent method. Gradient descent requires adaptive changes of the step size during the course of the algorithm. From (3.8), the gradient of Pt is given by
p~(n) =
K
- ;
Lf(nt - (n -l)Sn-l) Wn-1(X),
Xj
row
f*'
(3.18)
i=l
assuming differentiability of the distribution F. For one-sided densities an additional factor of L·(Sn-l) is included in the summand. An estimate of to can then be obtained using the recursion
tm+l=tm+Ot
0: 0
-
fit", (n)
Ptm(n)
,
m=1,2, ...
(3.19)
The learning rate parameter Ot can be used to adaptively control the initial learning speed and the noisiness of the converged estimate. Substituting a known f in (3.18) and using a chosen biasing scheme, the t-algorithm of (3.19) 'and the a-algorithm of (3.16) can be implemented simultaneously to obtain an accurate solution of the inverse IS problem. Example 3.3. Adaptive determination oft The results of an implementation of the t-algorithm for the Weibull sum of Example 3.1 is displayed in Figure 3.1. This shows convergence to the t value • required for Pt = 10-8 with n = 10.
3.3.1 System parameter optimization This procedure of constructing an objective function for extremization can be extended to situations where certain operating parameters of a system need to be optimized to achieve desired performance probabilities. Suppose G = G(rJ) represents the performance of a system with an operating parameter rJ considered here as scalar for convenience. It is desired to estimate G as well as optimize the parameter rJ to obtain a specified performance level
Sums of Random Variables 57
Go = G(1Jo). This is a frequently occurring requirement in the design of several engineering systems, which can be accomplished using adaptive IS. Assume that the performance can be represented as the expectation G(1J) = E{g(X,1J)} and estimated, as in (2.26), by ~
1
K
G(7]i 0) = K Lg(Xi, 1J) W(Xi' 0),
Xi
rv
fn*
i=l
Now consider the stochastic objective function
If there is only one value of the system parameter 1Jo which provides a performance equal to Go, then J will have a unique minimum at 1Jo. Then 1Jo can be estimated by the stochastic Newton recursion m= 1, ...
(3.20)
where
assuming that g(x,1J) is differentiable in 1J. The parameter optimization algorithm of (3.20) is implemented simultaneously with the adaptive biasing algorithm of (2.32). If g(x, 7]) is not differentiable and is for example an indicator function, then it can be approximated by a smooth function which can be used to estimate the gradient required in the recursion. This method of parameter optimization has been used successfully in several applications and can be a powerful design technique, as described in later chapters.
3.4 Approximations for tail probability In this section we develop well known expressions for the tail probability Pt (n) when n is large. Such approximations are often useful for estimating Pt in applications involving sums of a large number of random variables. For a fixed t, pt(n) approaches zero as n -t 00. In cases where pt(n) decreases exponentially fast, the rate of this convergence is governed primarily by the large deviations rate function I(t) defined in Section 2.1.3, and is a subject of study of large deviations theory. The large deviations approach makes
58 Importance Sampling
essential use of the concept of exponentially twisted densities. Combined with asymptotic normality as dictated by the central limit theorem (CLT) , this leads to 'large n' approximations for the tail probability pt(n), Van Trees [88], Kennedy [36]' Bucklew [11]. Approximations are also given for the grepresentation for Pt of (3.5). We follow closely the development of [11]. The exponentially twisted density can be written as
f ( ) = eS x I (x) 8
(3.21)
M(s)
X
from (2.19) on page 21. This has moment generating function
Es{exp(OXn =
M(O+ s) M(s)
with mean Il(s) = M'(s)jM(s) and variance "( ) = J..L
s
M"(s) _ (M'(S))2 M(s) M(s)
Then the moment generating function of a sum of n random variables each with density Is is just (M(O + s)jM(s))n. The inverse Laplace transform of this with respect to 0 yields the density of the sum as (3.22)
E;
where Iv... denotes the density of the sum Vn = n Sn = Xj of n variables each with density I having mean m. Using (3.22) in (3.1) yields for the tail probability
pt(n)
= P(Vn =
~
Mn(s)
n t)
1
00
nt
e- SX I;n) (x) dx
Setting s = St as in Section 2.1.3, and making change of variable y in the above gives
pt(n) = e-nI(t)
1
00
e- St Y v~~)(y) dy
(3.23)
=x -
nt
(3.24)
using the definition of I(t) on page 23. The density v~~)(y) = I;~)(y+nt)
has mean zero and variance nf-L"(st). Note that I;~)(x) is the density of a sum of n random variables each with mean t and variance f-L" (St). The density
Sums of Random Variables
59
vi~)(Y) can then be normalized by z = y/';np,"(st) to represent a unity variance and zero mean sum of n random variables. Denoting the density of this sum by d~), (3.24) becomes pt(n) = e-nI(t)
1
00
e- 8 ,vn/L"(s,)z ~}~)(z) dz
(3.25)
As n becomes large the distribution of the normalized sum approaches N (0, 1) by the CLT. Therefore (3.25) can be approximated by
e-nI(t)
,j'j1r
1
00
e- 8 ,..Jn/L"(s,) z e- z2 /2 dz
0
';27rnp," (St)
(3.26) St
for increasing n and the last approximation is valid for t > m. Note that 1 n
-I(t) - -log( J27rnp,"(sd St) -I(t)
as n --+ 00. That is, the tail probability has an exponential rate of decrease to zero. A rigorous proof of this using large deviations theory is given in [11].
3.4.1 The g-representation The tail probability pt(n) from (3.5) can be expressed as
pt(n)
E{gt(Sn-l)} i : gt(n: l)fV n_l(X)dX Mn-l(s) i:gt(n:1)e-sxf;n-l)(X)dX
=
(3.27)
the last line by an application of (3.22). Then, following completely parallel steps to those leading to (3.26) yields the representation
e-(n-l)I(t) pt(n) =,j'j1r 27r
1
00
(
gt t
+
-00
100 [1 _ F(x)] ';27r(n -l)p,"(sd M(St) -00 e
-nI(t)
eSt
x-(t-x)2 /2(n-l) /L"(8,) dx
60 Importance Sampling
(3.28) with the change of variable x = t - V(n - 1) JL" (St) z and using the definition of gt. This holds for two-sided densities and is valid for all t in the support of f. From the first line of (3.28) we have, as t ~ m, that
J 00
_1_
.,fi/i
vvn=l z)] e- z2 / 2 dz
[1- F(m -
-00
1/2
~
as n ~ 00. The limit in the second step follows from dominated (or bounded) convergence, noting that the sequence offunctions F( m - vy'"ii z) ~ 1(z ~ 0) the indicator for z > o. In a similar manner, using (3.10), it can be shown that for one-sided densities the representation is given by
+
I
t/'V(n-1)p.1I (St)
-tV(n-1)/p."(s.)
gt(t + (JL"(St)(n _1))-1/2 z)
.e-s.v(n-1)p."(st) z_z2 /2
=
e-nI(t)
dZ]
[e- t2 / 2 p.II(St)n
V27rnJLII(st} M(St)
+
l
nt
St
[1 _ F(x)] eSt x e-(t-x)2 /2p."(st)n dX]
(3.29)
These expressions can be evaluated only if the form of f is specified. Nevertheless, one would expect that for sufficiently large n the approximations will coincide asymptotically with (3.26). This is indeed true and can be verified in some simple cases. As n ~ 00 the asymptotic form of the tail probability in (3.28) can be approximated as (3.30) for t > m. It is easily shown that the integral above exists. Rewriting it as
1
00
P(X
~ -x) e- St x dx +
1
00
P(X
~ x) eSt x dx
(3.31)
Sums of Random Variables 61
the first integral is bounded above by 1/ St. For the second a Chernoff bound argument gives P(X ;:::: x) :S M(Sl) exp( -Sl x) for some Sl > O. Then this integral is bounded by M(sd/(Sl - St) < 00 if we choose Sl > St. Integrating (3.31) by parts yields
M(St) -+ St
1 lim -[1 - P(x)]e St x
x-+oo St
(3.32)
To evaluate this limit we note, using integration by parts, that the moment generating function can be expressed as
The second term on the right hand side above is less than unity and since we are assuming that M(st) < 00, it follows that the integrand in the third term f(x) exp(st x) --+ 0 as x --+ 00. From L'Hopital's rule the limit in (3.32) is then zero. Therefore the integral in (3.30) is M(st)/st and the g-representation reduces, in the asymptotic case, to the standard form in (3.26) for t > m.
3.5 Asymptotic IS In Section 3.2.1 we showed that the g-method provides enhanced simulation performance compared to the usual IS procedure for any biasing scheme when n ;:::: 2. Formally, when n = 1 the gain advantage is infinity. We shall demonstrate that as n becomes large this advantage decreases and approaches a limiting value. That is, the g-method is ideally suited for sums involving a small number of terms. Specifically we consider exponential twisting as the biasing method. There are two reasons for this. As mentioned earlier, twisting is known to be asymptotically optimal in the sense of minimizing the rate of decrease of IS variance as n --+ 00. A proof of this is given in [11]. Secondly, because of the forms of the twisted density and the corresponding weighting function, several interesting results can be analytically obtained. These results throw some light on the asymptotic performance of other biasing techniques. Consider estimating pt(n) using the conventional IS estimator in (3.2) with the biasing density fs t • The weighting function is given by (3.33) Using this in the expectation denoted by Ie, in the variance expression (3.3), yields
62
Importance Sampling
e- 2nI (t)
(3.34)
2 J27fnf.-L" (St) St
for large n. The derivation of this is almost identical to that leading to (3.26) and is omitted. As n -+ 00 the IS variance approaches zero exponentially fast. From (3.3), (3.26), and (3.34) we have, for large n, that e- 2nI (t)
Then logn
-2I(t) - 2n
-+
-2I(t)
as n -+ 00. The corresponding asymptotic gain r is r = 2 exp(2nI(t)). Therefore for biasing by exponential twisting, the IS variance decreases asymptotically at the same rate as p;(n) which is 2I(t). From (3.3) note that
Ic(St)
=
p;(n)
+ K var* pt(n)
and this implies that Ic(St) cannot decrease to zero at a rate faster than p;(n). However, biasing densities may exist for which the IS variance, being the difference of two terms that decrease at the same rate, approaches zero faster than 2I(t). 3.5.1 The variance rate
From the above we conclude that all biasing densities will satisfy lim
n-+CXl
~n log Ie
2: -2I(t)
(3.35)
Any biasing distribution which satisfies (3.35) with equality is known as an efficient simulation distribution. We derive now a result for the rate of IS estimators for an arbitrary biasing density f*. This is a somewhat simplified version of a theorem obtained in the more general setting of Markov chains in Bucklew, Ney & Sadowsky [13]. From (3.3) we have
Sums of Random Variables 63
P(Xi)}
< E {eO(I:~ Xi-nt) rrn *
e
J;(Xi )
i=l
B> 0 ,
- (J e f*(x)f2(X) )n OXi
nOt
dx
Define the density
(3.36) where
(3.37) Then we have
and (lin) log Ie::; -Bt + log ~(B), whence
. 1 hm -log Ie n
n-t=
< min [-Bt + log o -Bt t + log
~(B)J
~(Bt)
where Bt is the solution of
We now derive a lower bound. For
>
J... J
l(t ::;
e- nOE C(B) e- nOE
Sn
E
> t we have
< E)
IT P(Xi) dx f*(Xi)
i=l
J... J
l(t ::;
C(B)Edl(t::;
Sn
Sn
O. Then
bv
+ ..fii
68
Importance Sampling
for sufficiently large b. Now consider the application of exponential twisting with
where Sn == St n case. Then
= (tn -m)/v 2 denotes the twisting parameter in the Gaussian
That is, with t decreasing as n- 1/ 2 toward the mean m, the IS variance for exponential twisting is constant and this holds for all n in the Gaussian case. It turns out that this is also the correct asymptotic rate of decrease of the threshold for any other density f to achieve constant IS variance. To demonstrate the result we need power series expansions for quantities in (3.34) or (3.39) that depend on t. We begin with Sn. As the log moment generating function Jl(s) (defined on page 21) is convex, Jl'(sn) = tn is a nondecreasing function. Since Jl'(O) = m, it follows that Sn -+ 0 as tn -+ m. Then we can define a nondecreasing function ¢ such that Sn = ¢(tn ) and which has the Taylor expansion
Sn = ¢'(m)(tn - m)
+ ¢"(m)
(t
n
;! m)2 + ...
around tn = m. We have ¢'(m) = l/Jl"(O) = 1/v 2 and
¢"(m)
=
-Jl'" (0) Jl"(0)3 M"'(O) - 3mv 2 v6
-
m3
Therefore
Sn -_
b;;:; vyn
2
2
b v + O( 1/ n 3/2) + C3 2n
Further, we can expand Jl"(sn) around Sn = 0 as
(3.44)
Sums of Random Variables 69
"'(O)b V2 + _J.L_ _ vJii
+ O(l/n)
(3.45)
by the use of (3.44). Lastly we can expand J.L(Sn) as v2
2
3
= mSn + 2 Sn + O(sn)
J.L(Sn)
bm + b2(1 + C3 v 2 m) + O(1/n3/2)
vJii
2n
(3.46)
Using the expansions (3.44) - (3.46) in (3.34) we obtain -2 n(sn tn - J.L(sn)) -log( v'n Sn an) _b2 -logb+O(1/nl/2)
Therefore
A similar exercise using (3.26) shows that Pt" (n) converges to the appropriate constant. For the g-method we note, using the Taylor expansions above, that the first exponent in the integrand of (3.39) decreases as n- 1/ 2 while the second exponent decreases as n- 1 for each x. Hence (3.42) is valid and we obtain
+ log [1Jii
1
[1 - F(xW e2sn x dx]
00
-00
To deal with the integral above we express it as i 1+i2 and perform integration by parts to get i1
== =
1
[1 - F( _x)]2 e-2S n x dx
[1
F(O)]2
00
-
1 +-
2 Sn
Sn
1
00
0
[1-F(-x)]f(-x)e- 2SnX dx
Then ----t
1
-[1- F(OW 2 1
-
2
+
I 0
F (O)
(1- F)dF
70
Importance Sampling
as n -t 00. It follows from (3.44) that i1 -t v-/ii/2 b. Employing the bounding argument following (3.30) for the integral i2 we have
M2(Sn + €) 2€ where S1 - Sn = € > o. Hence i2/-/ii -t 0 and (i1 + i2)/-/ii -t v/2b. Using this in the expression involving Ig(sn) above we obtain that
also. From this and (3.43) we conclude that the gain advantage of the gmethod becomes unity in the asymptotic situation when pt(n) is held fixed. The asymptotic simulation gain denoted as roo is then given by 9 8.5 8 7.5 ~8 0
Oi
7
.Q
6.5 6 5.5 5
6
6.5
7
7.5
8
8.5
- 10910 P,
Fig. 3.6. Asymptotic gain (n -+ 00) at fixed probability.
9
9.5
Sums of Random Variables 71
(3.47) and shown in Figure 3.6. The value of b determines the probability pt(n). It may be noted that the gain expression is valid for small probabilities (or large b) as it uses the approximation for the Q-function.
The weighting function. In constant probability estimation, the weighting function used in the IS algorithm converges to lognormality. From (3.33) we have with an appropriate change of notation (3.48) where Xj '" fs n • Then
and
by the use of (3.44) - (3.46). Expanding (3.48) for large n yields
b2 2
-~
b(nSn-nm) c3b2v2(m-Sn) +-----'-----'-vy'n 2
y
• 3.6 Density estimation for sums An interesting consequence of the inverse formulation with the proposed IS method is that -p~(n) in (3.18) is an unbiased estimator of the density function of the i.i.d. sum Sn. This is however not the best IS estimate of the density since it uses a biasing scheme f* optimized to minimize the variance in (3.9) of the estimator pt(n). Clearly, the same optimization may not minimize var*p~(n) as discussed in Section 2.2.2. Estimating the density f Sn (t) in this setting is a means of performing point by point n-fold convolution using simulation by IS, Srinivasan [71]. From (3.18) we can mechanize an IS convolver as
72
Importance Sampling
1
A
fsJt) = K
L nf(nt K
(n -l)Sn-d Wn-1(X);
Xj
rv
f*
(3.49)
1
For one-sided random variables a factor of IT (Sn-l) is included with T defined after (3.11). It is easily verified that the estimate is unbiased with variance (3.50) The optimal biasing density providing perfect estimation is given by (3.51 ) where fn-l(X) = TI~-l f(Xi)' Note the obvious connection between this and the optimal density (3.12) for probability estimation. The optimal density here requires knowledge of fSn and it is a joint density function. If f is Gaussian, it can be easily seen that f~~\ * is a joint density with peak at {Xi = t}~-l. The problem of finding an optimal density (obviously with non-zero variance) under the constraint of LLd. biasing seems to be beset with similar mathematical difficulties that exist in the case of tail probability estimation. It is difficult to infer from (3.51) what a good i.i.d. biasing density should look like. However, the arguments in Section 1.3.1 on page 7 are applicable here. This suggests that biasing densities satisfying f*(x) ~ f(x) should be effective in providing improvements over the zero-IS density estimator (3.52) This estimator is similar in form to the well known kernel estimator for density functions when f is unknown. The estimator (3.49) can then be considered as an IS extension to kernel estimation using a kernel with the unbiasedness property. Remarks about optimality similar to those in Section 3.2.1 also apply here. The density estimator does not have a "usual IS" counterpart against which performance comparisons can be made. Of course we can compare with a kernel estimator that does not exploit the shape of f. However it seems more appropriate to evaluate the relative performances of (3.49) and the zero-IS estimator (3.52). The simulation gain rf achieved by iSn (t) can be obtained from
r _ f -
n 2 E{j2(nt - (n -l)Sn-d} - nJt) n 2 E*{p(nt - (n -l)Sn-l) W~_l(X)} - f~n (t)
(3.53)
with the expectation quantities above provided by corresponding estimators. As we have seen in probability estimation, minimizability of IS variance is intimately related to the behaviour of the weighting W as a function of the
Sums of Random Variables 73
biasing. Such a dependence will be present in density estimation also. Moreover, it seems intuitively reasonable to conjecture that a good i.i.d. biasing density should have more mass near t so that the sum Sn has high probability of being in the vicinity of t. For unimodal densities in particular, with means close to the density peak, placing the mean at or near t will achieve this mass concentration. In view of these qualitative arguments, it appears that biasing techniques such as exponential twisting, which transform the individual density to have mean at t, can be effective for density estimation. The IS for density estimation is most effective in the tails of I, where a simulation would produce few samples. The amount of biasing required to minimize the variance is less in regions where I is relatively larger. For example, as t decreases from the right, the biasing reaches a minimum in the vicinity of the peak (which approaches the mean as n increases) and then again increases. This is in contrast to probability estimation where the biasing decreases as probability to the right increases. For one-sided densities with, say a single peak, one would expect biasing by scaling to result in a down scaling or compression left of the peak where ISn is small. The following example illustrates the effects of density shape on biasing. Example 9.6. Density estimation by translation Let X '" N(O, 1). We use translation biasing by an amount c as in Example 2.4 on page 18. From (3.50) the I-function is
I(t, c)
after some algebra, where an > 0, bn > 0, en > 0, are constants depending on nand dn is given below. It is evident that the variance is minimized by choosing the optimum shift as c = dn t, that is
°
For t < the optimum shift becomes negative in order to place density mass appropriately. The variance of the estimate using optimum shift is
which is a maximum for t = 0. Therefore performance improves in regions away from the mean where I Sn is small. In this example we have used a unity variance Gaussian density for biasing. It may however well be true that lower IS variance can be obtained by optimally scaling the spread of the biasing density simultaneously. Note that in this case exponential twisting prescribes • a shift of t for all n.
74 Importance Sampling
3.6.1 An approximation: The Srinivasan density Based on the representation for tail probability developed in Section 3.4.1, an analytical approximation for fS n can be derived, Srinivasan [71]. The steps are essentially similar to those leading to (3.28) and are indicated below. Differentiating (3.27) we have
fSn(t)
=
nE{f(nt - (n -1)Sn-l)}
I:
nEs{f(nt - (n -1)Sn-l) Wn-1(X)} =
nMn-l(s)
f(nt - x) e- SX f~n-l)(x) dx
(3.54)
where
and the expectation in the second line above is with respect to the density fs. The density f~n-l) can be normalized by the transformation
x - (n -1)J.t'(s) y = -vr.(~n=-=:1::;=)J.t~II;:;::(s7'-) to a density e~n-l) representing a zero-mean and unity variance sum of n-1 LLd. random variables. Carrying out this change of variable yields
J 00
fS n (t)
=
n kn-1(s)
f(n t - (n - 1)J.t'(s) - an y) e- sa ,.. y e~n-l)(y) dy
-00
(3.55) where
kn(s) = exp(-n(sJ.t'(s) - J.t(s))) and an = V(n -1)J.t"(S) The second line in (3.55) is obtained by noting that e~n-l) approaches a unit normal density as n becomes large, and making the change of variable x = J.t' (s) -an y. Before completing the derivation we make some observations. The sum density in the first step in (3.55) above is a properly normalized one with respect to the variable t and is an exact expression for all s. That is, the choice of the twisting parameter s would not matter provided the exact form of e~n-l) were known. However, the approximation of using the Gaussian density
Sums of Random Variables 75
in the second step destroys the normalization property and also introduces dependence on s. Integrating the second expression with respect to t and equating to unity leads to the requirement S2
g(s) == s/./(s) - 2J.l"(s) - J.l(s) = 0 The solution of this differential equation is
which is the log moment generating function of the distribution N(m,v2). Therefore the second expression will be normalized only if the individual density I is Gaussian. A choice for s has thus to be made to complete the approximation in (3.55). The mean of the approximate density can be obtained by integration as 1
- e-(n-l)g(s) [/L'(O)
n
+ (n -
l)(J.l'(s) -
S
/L"(s)]
Note that for s = 0 this mean is J.l'(O), which is the mean of Sn and the approximation is a normalized density for any I. In this case (3.55) reduces to
ISn (t)
~
n E{f(X
+ n(t - m)}
where X rv N(m, (n - 1)v 2 ), which can be obtained directly from (3.54) assuming that n is large enough for the CLT to apply to Sn-l. For s =I=- 0 and if I is Gaussian, then the mean expression above reduces to J.l'{O) if we choose S = St such that J.l'(St) = t, as can easily be verified. This choice places the mean of the twisted density Is at t. Using the same choice for any I of course neither guarantees a normalized density nor one with mean at /L'(O). However some intuitive justification is provided in the next section based on the fact that (3.54) represents the expectation of an IS point estimate of ISn (t). Moreover, it is evident from (3.55) that I(x + n (t - /L'(s))) -+ 0 as n -+ 00 for every x if J.l'(s) =I=- t. Hence this condition on S is necessary for the approximation not to vanish asymptotically. Therefore, with s = St, the approximation in (3.55) becomes
(3.56) where 18) indicates convolution, 1s t is the exponentially twisted density with parameter St, 7jJ~~) denotes the Gaussian density corresponding to
76
Importance Sampling
N(O, Jl,"(st)(n - 1)), and I(t) = St t - J..t(St) is the well known large deviations rate function. Therefore, for large n, the above expression can be used as an approximation for the density of Sn. It involves evaluation of I(t) and one convolution. We refer to this approximation as the Srinivasan density. For one-side densities (nonnegative random variables), it follows by carrying out the steps from (3.54) that the integral in the first line of (3.56) extends from 0 to nt, and the convolution form does not hold. If 1 has support in [a, bj, some care has to be taken in dealing with the endpoints. The lower and upper limits of the integral are given as n t-(n-l) a
J (-)
a b
J(-)
ifa:::;t:::;(b+(n-1)a)/n, if (b + (n - 1) a)/n :::; t :::; (a + (n - 1) b)/n,
a
b
J (.)
if(a+(n-1)b)/n:::;t:::;b
n t-(n-l) b
(3.57)
Convergence of the Srinivasan density. As n -+ of I sn is obtained as
00
the asymptotic form
(3.58) and this expression is valid for one-sided densities also. Note the similarity between this and the corresponding tail probability approximation in (3.26) on page 59. To tie up this result with the CLT, it is convenient to normalize I sn to have zero mean and unity variance. Replacing t by x in the above and making the change of variable t = vIn(x - m)/v, the normalized density, denoted as 111" becomes v e-nI(m+tv/v'n) 111, (t) = --r.:==:::::;::=:=====c: J27rJ.L"(sm+tv/.,fii)
(3.59)
For a fixed t it is to be shown that 111,(t) converges to the unit normal density at t as n -+ 00. Applying the expansions (3.44) - (3.46), with b replaced by t, we immediately obtain
While this lends some support to the formula (3.56), the practical value of the approximation lies in its ability to accurately capture the sum density especially in the tails for intermediate ranges of n. •
Sums of Random Variables 77
Density and distribution pairs. To summarize these results, the approximate and asymptotic forms of the sum densities and (complementary) distributions, (3.56), (3.28), (3.58), and (3.26), are collected below. They are
for the approximations, and
for the asymptotic case. It must be noted that the asymptotic tail probability expression is valid only for t > m. • Example 3.7. Density approximation of Rayleigh sums We approximate and estimate using IS the density function of a sum of Rayleigh distributed random variables with common density f(x) = 2 x exp( _x 2 ) for x ~ o. For the approximation the manipulations required in (3.56) are straightforward. The result is
(3.60) where
78
Importance Sampling
and 2 al a ---3 - 2al 1
+
The notation erf denotes the standard error function and is related to the Q-function by erf(x) = 1- 2Q(v'2x) for x > 0 and erf(-x) = -erf(x). The moment generating function of the Rayleigh density is given by
This can be used to evaluate St. However it is simpler to choose St in an appropriate range and use M'(St)/M(St) for t. The exact (convolution) density for n = 2 is shown along with the approximation in Figure 3.7. The match is surprisingly good but the lack of normalization in the approximation is clearly apparent from the behaviour around the peaks. The match becomes numerically indistinguishable as t becomes large, indicating that the approximation has captured the tail accurately. In Figure 3.8 the approximation
Sum of 2 Rayleigh variables
1.2 approximation convolution
mo.S
:e II)
c::
CD
00.6
0.4
0.2
0.5
1.5
Fig. 3.7. Comparing convolution and approximation.
2
2.5
Sums of Random Variables 79 9
8
approximation
Sum of Rayleigh variates
asymptotic form
7 n= 100
6 t/)
Q)
:;:::;
5
fS (t)
·iii
c: Q)
0
4
3 2
Fig. 3.8. Comparing asymptotic and limiting forms for Rayleigh sums. and its asymptotic form (3.58) are compared for various n. As expected, the two increasingly coincide with growing n. Simulation results are shown in the next four figures. An IS convolver was implemented for n = 20 using biasing by scaling and optimized with an adaptive IS algorithm. In the t-range of (0.5,1.5), iSn was estimated at 200 points with each point using an optimized scaling factor. Figures 3.9 and Figure 3.10 show the body and tail of density estimation and approximation respectively. The optimized scaling factors are shown in Figure 3.11. As discussed earlier, scaling becomes unity near the mean and causes a compression of the biasing density as t decreases further. The peak is close to the Rayleigh mean of 0.886227. This is reflected in the resulting IS gain shown in Figure 3.12. •
3.6.2 Exponential twisting The IS estimate for exponential twisting, using (3.49) and the weighting Wn - 1 (x) defined after (3.54), can be generated by
(3.61 )
80 Importance Sampling 4 Sum of 20
3.5
Rayleigh variates
'S (t)
3
n
2.5 til
.!!! :t:: til
r:::
2
CD
Q
1.5
0.5
O~-=~
____- L____- L____
0.6
0.7
0.8
~
____
~
____
~~
0.9
1.1
t
____
C==-~
1.2
1.3
Fig. 3.9. Estimating the density function of an i.i.d. sum. X 10-4
1 .
Sum of 20 Rayleigh
simulation
variates
approximation
0.8 0.7
til
~
0.6
~ 0.5 til r::: Q) C 0.4 0.3
fs (t) n
0.2 0.1 O~---L--~~--~--~----~--~----~--~----~---L--
1.4
1.41
1.42
1.43
1.44
1.45
Fig. 3.10. Density tails of Rayleigh sums.
1.46
1.47
1.48
1.49
1.5
Sums of Random Variables 81 1.6 1.5 1.4 1.3 0)
~ 1.2 It!
~ E 1.1 :l
E
:;:; 0.
1
0
0.9 0.8 0.7 ~mean
0.6 0.5
0.6
0.7
0.8
0.9
1 t
1.1
1.2
Fig. 3.11. Optimized values of biasing for density estimation 1012
1010 ~....
C
'iii
0)
C
108
0
~ :; E
en
106
104
102
Fig. 3.12. Estimated gains for density estimation.
1.3
1.4
1.5
82 Importance Sampling
It is unbiased with mean approximated by (3.56). The variance of this estimate can be obtained from the first term in (3.50), denoted by In, and expressed as 00
In(t)
=
n 2M 2(n-l)(St)
JP(nt-X)e-2StXf~~-1)(x)dX -00
(3.62) as in the derivation leading to (3.56). The corresponding expression In,Mc without IS (W = 1) is obtained as
(3.63) For large n the asymptotic forms of these can be written as (3.64) and (3.65) Using (3.64) and (3.58) in (3.50) yields the asymptotic variance of the IS convolver as
~
-2I(t) + 3logn
-+
-2I(t)
2n
as n -+ 00 for t i- m. Hence the variance decreases exponentially for large n. Similarly with (3.64), (3.65), and (3.58) in (3.53), the asymptotic simulation gain becomes
r
~
enI(t)M(St)
-:~:-:-oo_ _ _ __
J -00
j2(x) e2StX dx
Sums of Random Variables 83
--+
00
as n ---+ 00 for t =I- m. Therefore exponential twisting is asymptotically optimal for density estimation, yielding zero IS variance as n ---+ 00, as in the case of tail probability estimation.
Appendix C Upper bounds on In(a) and Pt for Weibull i.i.d. sums: The tail probability Pt can be written as
Pt(n)
=
E{l(Sn:2: t)} E{l((Sn/J.t)b:2: r)}
where r = (t/J.t)b. Therefore, using this in the expectation in (3.3) yields
In(a)
E{l((Sn/ J.t)b :2: r) Wn(X)} < E{e 8 «Sn/pl-r)wn (x)}, s:2: 0
(3.66)
This upper bound can be loosened by means of the inequalities
(~~)'
0
(4.2)
is most powerful of size 0 for testing Ho against Hi. (b) For every a, 0 ::; a ::; 1, there exists a test of the form above with ,),(x) = ,)" a constant, for which Eo{¢(X)} = a. (c) If ¢' is a most powerful test of size a for testing Ho against Hi> then it has the form (4.1) or (4.2), except for a set of x with probability zero under Ho and Hi. Comment. The reader is referred to Fergusson [23J for a proof of the lemma. We note that the Neyman-Pearson lemma prescribes a likelihood ratio test, for testing Ho against Hi when both hypotheses are simple, of the form
_ h(X) HI A(X) = fo(X) ~ 'fJ
(4.3)
The threshold 'fJ of the test is calculated from the false alarm probability (or size) specification. If the distributions we are dealing with are continuous, we can take')' = 0 and randomization is not necessary. However, for discrete or mixed distributions, it may be necessary to consider randomization to achieve a specified false alarm probability. A plot of the detection probability f3 against the false alarm probability a with the threshold 'fJ as parameter is known as the receiver operating characteristic (ROC). Let f A,i denote the density function of A on the hypothesis Hi. Then we have
(4.4) and
(4.5) In many instances the task of determining density functions is made easier by considering the logarithm of the likelihood ratio above. This is particularly true for Li.d. observations belonging to the exponential family of distributions. See [23J and Lehman [39J for detailed treatments of hypotheses testing.
4.1.1 Properties of likelihood ratio tests The following properties are stated and proved here for continuous distributions.
Detection Theory 87
1 All ROC's lie above the j3 = a line. That is, j3
~
a.
Proof Let j3 denote the detection probability of the most powerful test of size a, 0 < a < 1 for testing H o against HI. Let rj/(x) = a be a test. Then rj/ has size a. Also, rj/ has detection probability a. Therefore j3 ~ a. However, if a = j3, the test rj/ is most powerful and it satisfies the lemma (c), and must satisfy (4.1). This implies hex) = fo(x) almost everywhere, and we have a degenerate testing problem. 2 The slope of the tangent of the ROC at any point is the value of the likelihood ratio threshold at that point. Proof From (4.4) and (4.5) we have dj3 da
fA,I(1]) fA,O(1])
(4.6)
--
as the slope of the ROC which is to be shown equal to 1]. Define the set
X1J == {x : 1]:S A(x) :S 1] + d1]} Then, for small d1] we have
P(1] :S A(X) :S 1] + d1]IH I ) = fA,I(1]) d1] = and
P(1]:SA(X):S1]+d1]IHo)=fA,o(1])d1]=
r
h(x) dx
(4.7)
r
fo(x)dx
(4.8)
}XT/
}XT/
From (4.3) and the definition of X1J we have for x E X1J that ry fo(x) ::; hex) ::; (ry
+ dry) fo(x)
This can be used to bound (4.7) as
1]
r fo(x) dx ::; fA,l(1]) d1] ::; (1] + d1]) }XT/r fo(x) dx
}XT/
(4.9)
Using (4.8) in (4.9) yields
1] fA,O(1]) d1] ::; fA,l(1]) d1] :S (1] + d1])fA,o(1]) d1] from which
< fA,lCfJ) < + d 1]-fA,O ()-1] 1] 1] As d1] -+ 0 we get the desired result. A useful interpretation of this result, for A(x) = 1], is
fA,l(1]) h(x) --=--=1] fA,O(ry) fo(x)
(4.10)
88 Importance Sampling
3 Likelihood ratio tests on Li.d. observations are consistent. Consider the simple hypothesis testing problem n j=l
versus n
Xn
= (Xl, ...
,Xn) '"
II fI(Xj)
(4.11)
j=l
where Xn represents a sequence of i.i.d. observations. Let ¢n denote the sequence of likelihood ratio tests
(4.12) where the threshold TJn satisfies 0 < 0 1 ::; TJn ::; O2 < 00 for some constants 0 1 and O2 • Consistency means that the false alarm probability an and detection probability f3n satisfy
(4.13) and
(4.14) respectively. See Appendix D for a proof.
4.2 Approximations for the error probabilities For the simple hypothesis testing problem with i.i.d. observations as in Property 3 above, the likelihood ratio test can be written as
(4.15) Defining
we have the equivalent log-likelihood ratio test
(4.16)
Detection Theory 89
with T = (log 1]) / n representing the threshold. Let Vi denote the density function of Yj on the hypothesis Hi and FVi the corresponding distribution function. Denote Vn = I:~ Yj and its density function on Hi by fV",i, and the density of Ln by !Ln,i. From (4.15) and (4.16) we have Vn = logAn = nL n . Using these with the property (4.10) yields eX
vo(x)
fVn,l(X)
eX
fv",o(x)
!L",l(X)
enxf £",0 (x)
Vl(X)
(4.17)
Further, let Mi(S) and P,i(S) denote respectively the corresponding moment generating and log moment generating functions of Yj. Then
Mo(s)
Eo{e sYj }
J J
eSY(x) fo(x)
dx
ff(x) fJ-S(x) dx
and M1(s) = Mo(1 +s). From this it follows that p,~(o) p,~(1). Also, define mo
= p,~(1)
and p,~(o)
=
Eo{Yj} p,~(o)
J
h(x)
log fo(x) . fo(x) dx
(4.18)
and
El {Yj} p,~(1)
J
h(x)
log fo(x) . h(x) dx
(4.19)
which is well known as the Kullback-Leibler number, [37]. Then Eo{Ln} = mo, varoLn = p,~(O)/n, EdLn} = ml, and varlLn = p,~(1)/n. It is easily shown, using the elementary inequalities log x ::; x - 1 and log x ~ 1 - 1/ x respectively in (4.18) and (4.19), that mo ::; 0 and ml ~ o. The false alarm and miss probabilities of the test can be expressed as (4.20) and (4.21 ) respectively.
90 Importance Sampling
4.2.1 False alarm probability Applying the results on tail probability approximation in Section 3.4.1 to (4.20), we have immediately from (3.28) on page 60 that a
n
~
e-nIo(r)
1
00
,127r(n _ I)JL~(sr) Mo(sr)
[1 - F. (x)] e8T x-(r-x) IJo
-00
2
/2(n-l)I'~(8T)
dx
(4.22) where JLo(sr) = 7, and 10(7) = Sr 7 - JLO(sr). For large n, this can be further approximated, as in (3.26) on page 59, by an
~
e-nIa(r) -,::===;;;:;==;:--
v'27rnJL~(sr) Sr
(4.23)
which is valid for 7> mo. The form (4.23) is a well known asymptotic result in detection theory, Van Trees [88], while (4.22) can be used as an approximation for intermediate values of n. Of course, it requires determination of the density Vo. Note that
1 -logan n
as n
-t 00
-t
.
-10(7)
(4.24)
for the exponential rate of decrease of false alarm probability.
4.2.2 Miss probability From (4.21) the miss probability is given by 1 - (3n
=
P1(Yn
i:
< n7 - Vn - 1)
EdFIJI (n7 - Vn- 1)}
=
FIJI (n7 - x) eX Ivn_l,o(x) dx
(4.25)
From (3.22) on page 58 we note that
I Vn-l,O (X ) -- M 0n -
1( )
S
e -8X I(n-l)( 8,0 x)
where I;~-l) denotes the density of the sum of n - 1 random variables each with a density obtained by exponentially twisting 10. Then, using parallel steps following (3.23) leading to the expression (3.28), yields
Detection Theory 91
which provides an approximation for the detection probability. Note that we have used the same value Sr for S here as in the case of false alarm probability. Heuristic justification for this follows from a Chernoff bound argument for 1 - (3n. This is H(Vn < nr)
-1,
+ 1),
SLD
s > -2,
ED
where r(·) is the standard Gamma function, with means SLD ED where 'Y = 0.57721566 is the Euler constant, and variances var U = M"(O) - [M'(0))2
=
{
11"2/6'
SLD
11"2/24, ED
Reproducing (3.58)
fsJx)
*
~
n . e-nI(x) 211"f.L"(sx) ,
-00
< x < 00
where Sn = I:~=1 Vi and I is the large deviations rate function for the density fu. Then the FAP can be written as
CFAR detection 109
at
~
i:[l-Fu(logt+X)]IsJX)dX
{
fooo e-xx-l IsJlogx -logt) dx, ~
SLD
fooo e-xx- 1 ISn (0.5 log x -log t) dx,
ED
These expressions can be used to develop a Gaussian approximation for the FAP as we did in the case of the CA detector. This is a straightforward exercise and is left to the interested reader. The integrals above are in a form suitable for Gauss-Laguerre numerical integration, and we demonstrate the results in Figures 5.3 and 5.4. To determine the twist sx and the rate function 10-2 r - - - - , - - - - , - - - - - , - - - - , - - - - - , - - - - , - - - - , - - - - - , - - - - , GM-CFAR. Square law detection
20
30
40
50
60
70
80
90
100
Fig. 5.3. False alarm probabilities for geometric mean CFAR detection.
I(x), we have used the Lanczos approximation for the Gamma function r(x)
= (x + 5.5)x+O.5 ex+5.5 .j'i1r x
[(co + ~ + ... + ~)], x+l x+6
x> 0
given in [48], reproduced here for the reader's benefit. The series terms are used to model the poles of the Gamma function in the left-half plane. The coefficients are
110 Importance Sampling 10-2~----.------.------r-----'------.------r-----~ GM-CFAR, Envelope detection
~
:cas
.c a.
e
10~
E .... as
iii CD
10~
II)
iii u..
10-10
10-12'---____-'-____--'-______'--____-'-____--'-____-"'-~~__~ 7 3 4 5 6 8 9 10
Fig. 5.4. False alarm probabilities for geometric mean CFAR detection. Co = 1.000000000190015 Cl
= 6.18009172947146
C2
= -86.50532032941677
Cg
= 24.01409824083091
C4
= -1.231739572450155
C5
= 0.1208650973866179
X
C(;
= -0.5395239384953
10- 5.0
X
10- 2 .0
The inverse problem, namely that of finding the threshold t for a given FAP, can be solved by noting that the derivatives are given by
a~
{ =
-t 1000 e-t 1000 e-
X
fs..{logx -logt) dx,
SLD
X
fS n (0.5 log x -logt) dx,
ED
These can be used in the algorithm (5.7). Numerical results are given in Table 5.1 on page 120. Corresponding simulation results for thresholds can be found in Chapter 6 in Table 6.3 on page 147. The latter have been normalized in a certain manner described in that chapter and this should be taken into account while making comparisons.
CFAR detection 111
5.7 Point of application of biasing The simulation procedures described thus far are applicable to any CFAR detector. The IS biasing is performed on the input random variables. Accuracy of estimates and resulting simulation gains will of course depend on the particular CFAR detector under study. It is generally true that better estimator performance can be obtained if IS biasing can be carried out closer to the point in the processing chain of the detector where the actual (rare event) decisionmaking is done. This necessitates knowledge of density functions of the processes at the point where biasing is to be implemented. Often, input stochastic variables may have undergone transformations whose results are difficult to characterize statistically in analytical form, and we have to rely on the general method above. However, when such transformations can be characterized, then IS should be carried out using the modified processes. For some detectors though, such characterizations are possible. The following simple example serves to illustrate the importance of choice of point of application of biasing. Example 5.1. SO-CA-CFAR detector We discuss briefly the SO-CFAR detector for which fast simulation using input biasing is particularly difficult. By this we mean that the simulation gain obtained relative to the CA or GO detectors is poor and large numbers of samples are required to achieve comparable accuracies. The SO detector statistic ¥So is obtained by first summing the left and right half windows, and then selecting the smaller. The FAP is given by at = P(X ~ t¥So)
Consider the case of n = 2 using envelope detection in Gaussian clutter. Then ¥So is just the smaller of two identically Rayleigh distributed independent random variables Y1 and Y2. The g-function is exp( _t2~~). The FAP of this detector can be obtained in closed form which we set at at = 10-6 . We look at 3 different applications of IS using biasing by scaling. In input biasing, the variables Y1 and Y2 are scaled by (). The resulting I function, denoted as h, is given by
h
= =
E{ e- t2Y.!()4 e- X (1-1//l2)} ()s
(2()2 - 1)[2(1
+ t 2)()2 -
1]
where the expectation is evaluated over the distributions of Y1 and Y2. The FAP is given by at = E{exp(-t2~~)} =
1
--2
1+t
Therefore, t = 999.9995. The quantity h above achieves a minimum for an optimum scaling factor of ()o = 0.866025 with Ir(()o) = 4.21875 x 10- 7 . The
112 Importance Sampling
resulting simulation gain over conventional Me estimation is (at - a~)/(h -
aD = 2.37037. This is a very low gain and it is quite worthless to implement
IS in this manner. In fact if we use () = 1.0 (no IS), then a gain of 2 is obtained which is of course due to the use of the g-function. Let us now go up the transformation chain and consider implementing IS by directly biasing ¥So which has the density 4yexp( _2y2) for y 2: o. Again using scaling with the g-method, the corresponding expectation, denoted as I y , becomes (}4
Iy = 2(1 + t 2 )(}2 This has a minimum at
(}o
-
1
= 1/";1 + t 2 , yielding 1
h«(}o) = (1 + t2)2 which is exactly equal to a~. Hence the variance is zero and we obtain a perfect estimate, implying infinite gain and simulation is not required. Finally we apply IS at the decisionmaking point where at = P(Z 2: t) with Z == X/Yso . The g-method cannot be used here as we have gathered all variables into one decision variable Z. The random variable Z has the density 2z/(1 + z2)2 and distribution z2/(1 + z2) for z 2: o. Its expectation is 7r /2 and no other moments exist. Upon use of scaling we obtain
-1
lz -
t
00
2z(z2 (;12(1
+ (}2)2
+ z2)4
dz
which has a minimum at (}o = 1316.07 with lz«(}o) = 2.1547 x 10- 12 • Therefore we obtain a simulation gain of 8.66 x 105 , which is appreciable. The variable Z possesses a heavy-tailed distribution. Its square, Z2, has a Pareto distribution, also heavy-tailed. Methods for dealing with LLd. sums of such variables are discussed in [4]. • This rather dismal situation that obtains with input biasing of course improves somewhat as n becomes larger. Unfortunately, for envelope detection, it is not possible to obtain an analytical form for the density of the SO statistic to implement IS closer to the decisioning point. However the Srinivasan density can be used to benefit here. Armed with this approximation there are two options available to us to estimate the FAP of the SO detector. The distribution function corresponding to the Srinivasan density can be used to set up the density of the SO statistic and then the required probability evaluated directly with a numerical computation. Alternatively we can proceed with fast simulation. In this latter there are two types of estimators that can be mechanized. They differ in the points at which biasing is performed. The SO statistic is obtained by first summing the left and right half windows, and then selecting the smaller. Performing the biasing just short of this selection will bypass calculation of the distribution function during simulation.
CFAR detection 113
5.8 FAP decomposition for SO detectors: CA and GM With input biasing it turns out that simulations for the SO-CFAR detector require large IS sample sizes for achieving comparable accuracies to other CFAR types. The reason for this presumably lies in the shape of the density function of the SO detector's random threshold. It is a general technique that compresses the density functions of the clutter samples in the window with no regard to the transformations that lead to the adaptive (random) detector threshold. It works very well for the CA-, and GO-CFAR detectors and is less effective for SO-, censored order statistic (COS), geometric mean (GM), and maximum likelihood Wei bull clutter CFAR detectors. An effective method for the COS detector is developed on page 126. These comments apply to CFAR processors preceded by both square law as well as envelope detection operations. The importance of the SO (and GO) processors for robust detection using ensemble processing is described in Chapter 6. We describe now a procedure that effectively (and virtually) brings the biasing point closer to the decisionmaking point, but still uses input biasing. This is done by decomposing the FAP of the SO detector using conditional probability such that the (rare) event is expressible in terms of random variables closer to the input of the detector. Define
as the SO statistic constituting the random threshold. Here ZL and ZR can be the arithmetic sums or geometric means of
and
respectively. The detection algorithm is
with
E{ P(X 2': tYsolHo, V)} E{g(tYso)} where 9
=
1 - F. This can be decomposed as
(5.23)
114 Importance Sampling
Ot
=
P(X ~ tmin(ZL,ZR),ZL < ZRIHo) +P(X ~ tmin(ZL' ZR), ZL ~ ZRIHo)
~
OL
+ OR,
respectively
Then
P(X ~ tmin(ZL,ZR),ZL < ZR)
0L =
E{P(X ~ tmin(ZL,ZR),ZL < ZRIY)} E{l(ZL < ZR)9(tZL)}
(5.24)
dropping the condition on Ho for notational ease. Similarly (5.25) which is equal to 0L since ZL and ZR are i.i.d. We note in passing that, in a similar manner, the FAP of a GO detector can be expressed as (5.26) 5.S.1 Fast estimation
Other than input biasing used with (5.23), there are (at least) three methods by which Ot can be estimated using IS. An estimate of OL or OR can be doubled, or independently generated estimates of 0L and OR from (5.24) and (5.25) added. In the first case the resulting variance is four times that of estimating either OL or OR. In the second case the variances will add, leading to twice the individual variance for the summed estimate. But twice the number of samples are required to generate these estimates with the result that simulation gains of both methods will be the same. One can do better with a third method wherein estimates of 0L and OR are obtained using the same sets of random variables generated during a simulation run. These estimates will be dependent and positively correlated. The variance of their sum will be less than four times the individual variance while using the same number of samples as needed for a single estimate. The resulting simulation gain will therefore be enhanced, as demonstrated below. Consider IS estimation of 0L. The estimate is K
ch = ~ L[l(ZL < ZR)9(tZL) W1 / 2 (YL)] (j); j=l
y;,(j)
rv
f*, i
= 1, ...
,n/2;
y;,(j)
rv
f, i
= n/2 + 1, ... ,n
(5.27)
for a K-Iength Li.d. simulation. A single superscript has been used above to avoid indexing each variable in the summand. The input biasing here is
CFAR detection 115
carried out only on the samples in the left half window using the density f*. Recall that the density f* compresses each Yi. This has the effect of increasing the probability term in (5.27) as well as causing the indicator to clock unity more frequently. The samples in the right half window are left unbiased. Hence the IS weighting function
W 1/ 2 (Y L )
g
_ n/2 f(Yi) f*(Yi)
=
is computed over {Yi}~/2. Therefore this IS method takes into account the smallest-of operation between the two half windows but pays no heed to whether these result from arithmetic or geometric means. During this simulation an estimate of aR is generated simultaneously K
aR =
~ ~)l(ZL:::: ZR)9(tZR) W1/ 2 (YR)] (k); k=l
~(k)
rv
f, i = 1, ... ,n/2; ~(k)
rv
f*, i = n/2 + 1, ... ,n (5.28)
using the same variables for biasing. The superscripts (j and k) are kept distinct for convenience in calculating correlations below. The estimate of at is
5.8.2 Variance and IS gain The variances of both unbiased estimates are equal (5.29) The expectation E* of course proceeds over the biased left half and the unbiased right half windows. The variance of the summed estimate is varat
2varaL + 2E*{aLaR} _
2
~t
< 4varaL
(5.30)
the inequality following in an elementary manner from the fact that aL and aR are identically distributed. Noting that the lh and kth terms in (5.27) and (5.28) are LLd. for j #- k, it is easily shown that the correlation above is
~ ~} = K1 E * {L j R} K -1 a~ E * { aLaR j + -y.4
(5.31)
116 Importance Sampling
where (5.32) and (5.33) To evaluate the single-term correlation function in (5.31) it is necessary to explicitly relate the quantities L j and RJ to the unbiased variables {Yi}l for the ph IS trial. Biasing is performed by scaling each Yi by () with 0 :s; () :s; 1, and assume that the same () is used for both half-windows. This is reasonable from the symmetry of the problem. Then
Substituting this in (5.32) and noting that ZL is a biased quantity
with a corresponding form for R j . Note that in this form, L j is composed only of unbiased variables each with density f. Dependence on () has been made explicit. Therefore the single-term correlation becomes
E*{LjR j }
=
{ J... J
E 1(()ZL < ZR)1(ZL :::: () ZR) g(t() ZL) g(t() ZR)
rr ()!( n
i=1
()Yi) } !(Yi)
1(()ZL < zR)1(ZL :::: ()ZR) g(tBzL) g(t()ZR)
i=1
J... J
1(()ZL < zR)1(ZL :::: ()ZR) 9(tzL)9(tzR)
IT !(Yi) dy
.=1
E{1(()ZL < ZR)1(ZL :::: ()ZR) g(tZL) 9(tZR)}
(5.34a)
E*{1(()ZL < ZR)1(ZL :::: ()ZR) g(tZL) g(tZR) . W1/ 2 (YL) W 1/ 2 (Y R )} (5.34b) the second integral above obtained through a simple substitution and a trivial change of variable. It follows from (5.34a) that the correlation is a monotone function decreasing with (). For () = 1, that is, no biasing, the variables L j and R j are orthogonal. As () ~ 0
CFAR detection 117
E*{LjRj } -+ E{g(tZL) g(tZR)} p2(X
~
tZL)
(5.35)
from symmetry. In actual simulation, a small (optimum) value of () would be used if the threshold t is large, implying that the probability term and hence the correlation in (5.35) are small, and approaching zero. Nevertheless, these limits provide bounds on the variance for any value of () used. Using them in (5.31) and (5.30) gives
~
a~
~
~
a~ + K2 p2 (X
2vawL - 2K :::; varat :::; 2vawL - 2K
Z )
~t L
To estimate simulation gains afforded by the method, it is necessary to estimate varat. Denoting by I the expectation in the right hand side of (5.29) and substituting (5.29) and (5.31) in (5.30) yields varat
=
~ [21 + 2E*{Lj Rj } - a~]
(5.36)
Both I and E*{LjRj } from (5.34b) can be easily estimated during the IS simulation. The gain r obtained by the proposed estimation procedure over the use of input biasing can then be estimated from
where the expectation in the numerator is the I-function obtained using input biasing on (5.23) with a full weighting function W.
5.8.3 80- and GO-GM detectors This decomposition method is now applied to the FAP estimation of SO-GM detectors. It is easily shown that setting () = 1 in (5.37) leads to r = l. That is, the proposed decomposition provides no improvement when biasing is not used. Showing analytically that gains are obtained with biasing appears difficult. Hence this is demonstrated by implementing both methods and carrying out separate optimizations as indicated in (5.37). It must be mentioned that an optimum () has been determined by minimizing only the first term in the denominator using stochastic Newton recursions. This has been done for convenience. Though derivatives of the correlation term can be estimated, the expressions are cumbersome. Hence the gains reported will be somewhat lower than can be obtained with a complete optimization. Results are shown respectively in Figures 5.5 and 5.6 for SO-CA and SOGM processors preceded by envelope detection operations. Shown are gains
118 Importance Sampling 107~---------------.,----------------.----------------~
Simple input biasing SO-CA detectors Rayleigh clutter, n =16 100~~~----------~~--------------~----------------~
o
100
50
150
recursions
Fig. 5.5. Estimated simulation gains. Target FAP is 10- 6 . 106 . - - - - - - - - - - - - - - - - - . - - - - - - - - - - - - - - - - - . - - - - - - - - - - - - - - - - - .
Decomposition method
Simple input biasing SO-GM detectors Rayleigh clutter, n = 16
10°L-~J-
o
____________L -________________L -______________ 50 100 150 ~
recursions
Fig. 5.6. Estimated simulation gains. Target FAP is 10- 6 .
CFAR detection 119
achieved by both methods over straight MC simulation. Their ratio is r. For the CA case a gain of more than 106 is obtained for the decomposition method, whereas simple input biasing provides only about 103 . The gains obtained are less in the GM case. This can be attributed to the use of scaling as the biasing technique. Hence GM detectors require a better biasing method. The actual FAP estimates using just 100 biased samples are shown in Figures 5.7 and 5.8 to provide visual comparisons. In all figures, recursions on
~
Simple input biasing
Rayleigh clutter SO-CA, n= 16 10-7 ' - - - - - - - - - - ' - - - - - - - - - - - ' ' - - - - - - - - - - '
o
50
recursions
100
150
Fig. 5.7. Comparing FAP estimates. Inner graph is the decomposition method estimate.
the x-axis refer to evolutions of adaptive biasing and adaptive threshold optimization algorithms running simultaneously. Numerical results for threshold multipliers of SO-CA detectors presented in Chapter 6 have been obtained using this decomposition method. Results for the GO variant of GM detectors have also been obtained using simple input biasing. Numerical values from adaptive threshold optimizations for a target FAP of 10- 6 are placed in the Table 5.1 for SO-GM and GOGM detectors. Asymptotes are of course obtained from simple theoretical analyses.
Approximations for FAP. Numerical approximations for the FAPs of SOand GO-GM detectors are developed here using the results of Section 5.6.l. From (5.25) we have
120 Importance Sampling
(/J
Q)
1i1 E
~
Q)
a.. 10
-6
it
f-
Simple input biasing SO-GM. n= 16
Rayleigh clutter 10-8~--------------~----------------~----------------~
o
50
recursions
100
150
Fig. 5.8. Comparing FAP estimates. Inner graph is the decomposition method estimate.
Table 5.1. Thresholds t for SO-GM, GO-GM, and GM detectors in Gaussian clutter. Rows marked with an asterisk contain values obtained through numerical approximation. All others are from simulation.
I CFAR II
48
n
SO GM GO GM
detectors
n -+ 00
CFAR detection 121
at
2E{1(ZL < ZR) g(tZL)} 2E{1(logZL < log ZR) g(tZL)}
Define VL == log ZL and VR == log ZR. Then at
I: I:
= 2E{1(VL < VR ) g(te VL )} 2
l(z < y) g(te Z) fVL (z) fVR (y) dz dy
Let m = n/2. The densities fVL and fVR are then both equal to the density f s'" of the normalized sum of m i.i.d. variables with common density fu defined on page 108. Denoting the distribution of 8 m as FSm yields at
~
2i:g(teZ)[1-FSm(z)]fSm(z)dZ
{
2 Iooo e- X x- 1 [1 - FSm (log x -log t)]JSm (log x -logt) dx,
Iooo e-
X
SLD
x- 1 [1 - FSm (0.5 log x -logt)]fsn (0.5 log x -log t) dx, ED
The term in brackets can of course be written as 1 - FSm (z)
=
1
00
fSm (y) dy
and the expressions above are for the SO-GM detector. For GO-GM detectors, the term 1 - FSm is replaced by Fs", in the above integral approximations. Numerical results for FAPs are shown in Figures 5.9-5.12 for square law and envelope detection versions. Values of threshold multipliers evaluated using these expressions are given in Table 5.1 for comparison with the ones obtained through simulation.
5.9 Examples in CFAR detection A few CFAR detection examples are now described that give rise to analytically intractable problems. These cases serve to illustrate the power of adaptive IS methods in performance analysis and design.
Example 5.2. CA-CFAR detector in Rayleigh clutter Consider the CA-CFAR detector preceded by an envelope detector operating in complex Gaussian clutter. The cell variables are Rayleigh distributed and determining the false alarm probability is analytically intractable for n > 1. With g(tnx) = exp(-n 2 t 2 x 2 ) and using the moment generating function of the Rayleigh density (given on page 78) to obtain numerically the values of sx, the approximation (5.17) can be computed with a Gauss-Laguerre formula.
122 Importance Sampling
SO-GM CFAR Window length
10-12L-______
o
48 ________ 50 100
=n
32
40
_ L_ _ _ _ _ _ _ _L_~_ _ _ _~L_~_ _ _ __ L_ _ _ _ _ _ _ _~
~
150 t
200
250
300
Fig. 5.9. False alarm probabilities with square law detection in Gaussian clutter. 100 SO-GMCFAR Window length
10-2
~ :0
=n
10-4
CII
..c 0 ....
C.
E ....
10-6
CII (ij OJ
til
(ij LL
10-8
10-10
10- 12 2
4
6
8
10
12
14
16
18
Fig. 5.10. False alarm probabilities with envelope detection in Gaussian clutter.
CFAR detection 123 10-2 r------.-------.------,-------~----_,r_----_.------,
GO-GMCFAR
Window length = n
~
:cal .0
ec. E ...al
(ij CD
en
10-6
10-8
(ij
U.
10-10
10-12 L -_ _ _ _---'_ _ _ _ _ _- ' -_ _ _ _ _ _- ' -_ _ _ _ _ _...L.-_ _ _ _- - - ' ' - -_ _ _ _--''~_ _ _ __ ' 10 20 40 50 60 70 30 80
Fig. 5.11. False alarm probabilities with square law detection in Gaussian clutter. 10-1 GO-GMCFAR
10-2
Window length = n
10-3
~
:0 al .0
ec.
...E
10-4 10-5
al (ij CD
10-6
(ij
10-7
en
U.
10-8 10-9 10-10
2
3
4
5
6
7
8
9
Fig. 5.12. False alarm probabilities with envelope detection in Gaussian clutter.
124 Importance Sampling
On the other hand, the Gaussian approximation of (5.19) is easily shown to be
where m = yf7r/2, v = v'1 - 7f/4. Further, using biasing by scaling, the g-method estimator is easily shown to be given by
AIL =e _t K
Cl!t
K
2 ("n y.)2 L..l]
.
a 2n e -(1-1/a
2)
"n y2 L..l ]
1
'
where a is the scaling factor. This estimator can be adaptively optimized with respect to a. Results are shown in Figure 5.13-5.15. In Figure 5.13 is shown the be0.8.-.----.-------.------.-------.------.-------.------. CA CFAR
0.7
n = 16, Rayleigh clutter a
o
=10-3
0.4
0.3
o
10
20
40 30 recursions m
50
60
70
Fig. 5.13. Optimized IS scaling factors for FAP estimation of CA-CFAR detectors.
haviour of the biasing parameter algorithm of (2.32) on page 27 applied to a, and the threshold multiplier algorithm of (5.7) in Figure 5.14. A comparison of IS simulation estimates for false alarm probability with values computed using the approximations (5.17) and the Gaussian approximation is shown in
CFAR detection 125 7,-,--------.---------,---------,---------.--------. CACFAR 6.5
n= 16
c:
a
o
=10-3
20
15
25
30
recursions m
Fig. 5.14. Convergence of threshold multipliers for Rayleigh clutter. 10-1 , - - - - - , - - - - - , - - - - - , - - - - - - , - - - - - , - - - - - , - - - - - , , - - - - - , - - - - - , ..... analytical approximation - - simulation - - - Gaussian approximation
,,8
',16
3
4
5
6
nt
7
Fig. 5.15. Comparing approximations and simulation.
8
8 9
10
11
126 Importance Sampling
Figure 5.15. This indicates that the Gaussian approximation is inaccurate for moderate n and low values of false alarm probabilities. However, the approximation (5.17) is graphically indistinguishable from the optimized simulation results under all conditions and can therefore be used effectively to compute .. false alarm probabilities. Example 5.3. Censored as CA-CFAR in Weibull clutter Consider the censored ordered statistic cell averaging CFAR detector operating in a Wei bull clutter background. This detector structure was proposed in Rickard & Dillard [56] and further analyzed in Ritcey [57] and Gandhi & Kassam [26] for its ability to tolerate interfering targets. In this detector, the n reference window samples are ordered and the lowest n-r samples summed to comprise Y. Analytical solution of the false alarm probability performance is possible only in the case of Gaussian clutter when the detector is preceded by a square law device, Shor & Levanon [67]' rendering the cell variables exponentially distributed (i.e., Weibull (see Example 2.1 on page 11) with f.L = 1 and b = 1). If an envelope detector is used in this case, the variables have a Rayleigh distribution (Weibull with f.L = 1 and b = 2), and the density function of the non-Li.d. sum Y can be found numerically only through an (n - r)-fold convolution which is computationally intensive. This is also true of other Weibull distributed clutter which is encountered, for example, in high resolution radars. It is in such situations that the techniques developed in the preceding sections find a useful and powerful role. An application of IS to this detector can be found in [70]. We give here a slight generalization, called the trimmed mean detector, wherein T smallest samples are censored together with r of the largest. Assuming that the cell variables are i.i.d., we denote their ordered statistic Y1 < Y2 < ... < Yn < 00. From (2.45) on page 39 we by Y such that have
°: ;
fn(Y)
= n!f(Yl) ... f(Yn),
0::; Yl < ... < Yn < 00
where f denotes the single sample clutter (in this case Weibull) density. By successive integration the joint density function of the (n - T - r) ordered variables is easily shown to be
o ::; YT+1
::; ... ::; Yn-r
1 and (1 < 1. The weighting function W is the ratio of these densities. We will not go further with the analysis as the procedure for implementing the estimator is straightforward and is as laid out in Example 2.10 on page 35. To estimate the threshold multiplier 'f/, the g-method can be used. However this will entail using the distribution of the decision statistic Istii- 1XI. An easier method that avoids this is to approximate the indicator function. The procedure is described in Section 7.3.3 on page 178. The IS estimator in (5.42), using two-dimensional scaling for biasing, can be used for simulating the AMF for data with any probability distributions. The Gaussian assumption was used merely to make a convenient choice for the covariance structure of the samples. When the underlying distributions are non-Gaussian, the CFAR property may not hold. Nevertheless, most detection schemes are designed using the Gaussian assumption as a nominal one for setting the value of the threshold multiplier. This will also be true for advanced algorithms conceived to handle inhomogeneities and nonstationarities. Geometric mean CFAR detectors for instance, are known to be robust to interferers that appear in the CFAR window, as demonstrated in the next chapter. In a STAP system, an interfering target at a particular range would appear in all elements of a secondary vector x(k) for the corresponding (k-th) range gate. A geometric mean version of the AMF would look like
Whether this algorithm has the CFAR property remains to be shown. Nevertheless, the biasing techniques discussed above would still be applicable. Analyses of this and other such variant algorithms are a matter for further research.
6. Ensemble CFAR detection
We describe here a generic approach to CFAR processing, introduced in (74), which results in detection algorithms that tend to be robust to inhomogeneities in radar clutter. Termed as ensemble- or E-CFAR detection, it combines members from the family of known CFAR processors. It is simple and easy to implement. While finding the most robust algorithms is still an open problem, the concept allows the synthesis of a large number of candidate algorithms that can be tested for their properties.
6.1 Ensemble processing It has been felt by radar scientists and engineers that robust detection can be achieved by using combinations of different CFAR processors known individually to possess desirable properties in various clutter conditions. This is a reasonable premise and some qualitative arguments in support are given below. However, specific structures or algorithms have not been suggested till the publication of [74). An important reason for this apparent lack is the formidable difficulty in carrying out mathematical analyses of even the simplest schemes that involve more than one CFAR processor. Based on fast simulation theory, it has been possible to make some inroads into the problem of robust detection using combinations from an ensemble of processors. Not surprisingly, there exists a virtual myriad of possibilities in terms of such structures. A combination procedure that immediately comes to mind for enhancing CFAR processor performance is that of distributed detection, (76) - (84), [1], [28), and [30). It has been argued in several papers, sometimes at length, that distributed detection can be used to combine detection reports, or declarations, from the signal processors of different sensors observing the same volume. There are practical issues that arise in such a scheme for radars, some of which are mentioned here. Assuming overlapping surveillance volumes, combining different (unambiguous) reports from the same target would require coordinate transformations between radars and accurate range and bearing estimates. This is especially true in multiple target situations. Moreover, compensation would have to be carried out for returns arriving at different times
138 Importance Sampling
at each radar. Another important practical issue is the need for synchronization of system clocks between radars. Further, even small differences in target aspect angles could result in widely varying signal to noise ratios at the detectors. These and other related issues are quite complex, and can possibly be addressed only in ground or fixed radars. For fixed radars there are other methods such as clutter map procedures which work well. Therefore it is decidedly more practical (and simpler) to implement networking (or fusion) of radars at the track level. The real need for combatting clutter and enhancing detection performance is in sensors on moving platforms, such as airborne radars. Indeed, decision fusion can still be carried out to advantage within a single radar processor using several CFAR detectors. This is seen as follows. For simplicity consider first a system consisting of N = 2 radars operating on the returns Rl and R2. The two detectors produce decisions which are combined by a (boolean) fusion rule to give a final decision regarding the presence or absence of a target. For a specified false alarm probability at the system output, detector processing and fusion rule are optimized to provide maximum detection probability. This is the Neyman-Pearson criterion in the distributed context. It turns out from the theory of distributed detection that the optimal detectors perform likelihood ratio processing with individual detection thresholds depending on the fusion rule used. The fusion can be either an AND or an OR rule. There is no general theory possible which enables determination of the optimal rule without specifying input statistical characterizations. Therefore a search needs to be carried out to locate the better fusion rule in terms of numerically evaluated performance. For small or moderate N, an exhaustive search is feasible. As N increases, the problem becomes computationally intense. This is, in brief, the situation when Rl and R2 are statistically independent. This means that Rl and R2 could be returns from geographically distanced sensors assumed to observe a common target in the same volume. If Rl and R2 happen to be dependent or correlated, the optimal distributed detection problem is analytically intractable. It is not known what processing even the individual detectors should perform to obtain best detection performance. In the extreme case when Rl and R2 are completely correlated, for example if Rl = R2 (i.e. same radar signal processor), then it is clear that nothing can be achieved by using more than one likelihood ratio processor because the information content about the target is the same in both observations. Therefore a single detector is obviously sufficient and optimal, provided the probability distribution of Rl is known. In practice however, there arises an intermediate case when Rl = R2. This occurs when the detector is known to be suboptimal for its clutter environment. The suboptimality arises for two reasons. The CA-CFAR detector is the most powerful invariant processor for homogeneous Gaussian clutter, but it is not a likelihood ratio processor. Secondly, most radar signal processors use envelope detectors prior to CFAR detection. Further and
Ensemble detection 139
more importantly, radar clutter is often nonhomogeneous and non-Gaussian. It is therefore clear that no single detector can be robust to all clutter and interference conditions. Hence, given the inherent suboptimality of the CACFAR detector (and in fact of every CFAR detector) in any realistic target environment, an appropriate combination of two or more different CFAR detectors operating on the same set of returns could result in a processing structure wherein the individual detectors complement each other to produce overall detection performance better than any single detector. This observation forms the essential premise of ensemble processing. With some thought this reasoning can be carried further to claim that if the CFAR detectors being used are different from the CA-CFAR detector, then their appropriate combination should produce, in homogeneous Gaussian clutter, a detection performance that approaches that of the CA-CFAR detector; the hope further being that, in the process of combination, any inhomogeneity-resistant properties possessed by the individual detectors (and not possessed by the CA-CFAR detector) would cohere to provide enhanced robustness. These conjectures are indeed borne out by numerical results for the E-CFAR processor. The processor is a departure from the standard distributed detection structure. Distributed detection problems are notoriously difficult to solve, even numerically, [76, 30j. This is because the optimal solution involves a system of N coupled nonlinear equations. In E-CFAR processing, we dispense with the quantization at the individual detector level and instead combine the different CFAR statistics directly for threshold decisioning. Whereas something is gained in terms of target information in the first step, we lose somewhat in the second because our combination procedure is purely intuitive and not based on the mathematics of statistically optimal processing. This last appears to be of formidable analytical difficulty. Moreover it will involve the statistics of the return when a target is present and will result in particularization of the detector structure for a target model. The approach is to choose an ensemble scheme and determine its threshold for a desired FAP in homogeneous Gaussian clutter. This choice is dictated by certain attractive properties that individual detectors may have. In the process, new results are obtained for the geometric-mean detector. Then the performance of the ensemble in inhomogeneities is estimated and compared with other stand-alone CFAR detectors. A robust ensemble is identified and recommended for implementation. Parallel results are developed for squarelaw and envelope detection. Several possibilities in detection are shown to exist.
6.2 The E-CFAR detector Consider the statistic £ defined as
140 Importance Sampling J
(6.1)
V· £= ~aj' ~. '" j=1
j
where lj is some function of the sequence {X1, . .. , X N} of variables in the detector window. Each function represents a known CFAR processing operation that can be used in the ensemble. The coefficients {aj}{ are binary and only used to select which types of statistics make up the ensemble. Each CFAR statistic is weighted by a weight Vj. With Xo the cell under test, the E-CFAR detection rule is given by
(6.2) where t is referred to as the ensemble threshold. The ensemble processor is shown in Figure 6.1.
6.2.1 Normalization The weights (V1, ... , v J) used in the ensemble will reflect the relative importance assigned to each CFAR statistic. To combine the different statistics in a consistent manner, some form of normalization needs to be performed. Two schemes come to mind based on intuitive considerations. For convenience we use the CA-CFAR statistic, denoted as Yl, as a base with unity weight (V1 = 1.0). In one approach, normalization is carried out over the means of the statistics. That is, we set
E{lj}
Vj
= E{Y1 }'
j =2, ... ,J
(6.3)
These weights can be determined analytically or numerically or through simulation. A second method of normalization is based on equalizing, for a common threshold, the tail probability of each CFAR decision rule used individually to some specified value a o • This can be done in the following manner. Denote the weights as {¢1, ... , ¢ J }. Let tj denote the value of the threshold of the jth decision statistic used alone with the weight Vj. That is a o,
j = 1,00' ,J
Hence, to satisfy j
= 1, ... ,J
Ensemble detection 141
~
Fig. 6.1. E-CFAR detector.
tiN
142 Importance Sampling it is required that A..
'f'J
= tlt.' Vj
= 1, ... , J
j
J
The ensemble statistic therefore retains the same form as in (6.1) but with v replaced by t g /2 ~
tg
where k = tc/(tg - t c). Substituting this is (6.17) and recalling (5.26) on page 114, gives
2E{9(tgZ d FZR (ZL)} (= (}:g), tc::::; tg/2 { tc > tg/2 Pg1 = 2E{g(tgZL) FzR(ZL/k)}, 0, tc ~ tg
(6.18)
where FZR denotes the distribution function of the sum ZR of m = n/2 LLd. variables with common density f. In a completely analogous manner it can be shown that the remaining terms in (6.10) and (6.11) are given by tc ::::; t g /2 tc tc
> t g /2 ~
(6.19)
tg
and
where d = tc/(ts - tc). For the single sample density 1 f(x) = "Ae-xl\
x ~0
where >. is the noise power, the sum ZR has the Gamma density given on page 24 with appropriate parameters. This is given by
The corresponding Gamma distribution function is
_
FZR(X) - 1
_ -xl).. m (x/>.)i-l e ~ (i -I)!'
x
~0
The left half-sum ZL has the same distribution in the homogeneous case. Using these in (6.18) - (6.21) and carrying out some algebra leads to the formulae
158 Importance Sampling
ag
PgI =
(
2
=
(l+tg)'" -
2 (I+t g )'"
-
2 ~m (i+m-2) L..i=1
m-I
2 ~m (i+m-2) ( L..i=1
m-I
I
(2+tg)i+'"
t )i-I (tc)m
t;
1- ~
I (l+tc)i+", I,
0,
0,
Pg2 =
PsI
=
n,
'-'c
•
[2 L..t=1 ~m (i+m-2) (1 _ !s..)i-I (!s..)m m-I tg tg
~m (i+m-2) ( m-I 1{2 L..i=1
tc)i-I(tc)m
t;
t;
2 ~m (i+m-2) I as ( = L..i=1 m-I ' (2+ts)'+=
-
1] ,
tc
> t g/2
(6.23)
I (I+tc)'+= I, 1
)
(6.24)
,
and
Ps2 = {
2a . c
[1 _ L..t=1 ~m (i+m-2) (1 _ !s..)m(!s..)i-I] m-I ts ts '
(6.25)
ac,
The FAP a is the sum of the quantities in (6.22) - (6.25). Numerical evaluation of a is discussed in Section 6.6. • REs 2, 3, 5. For these ensembles, excepting for the three individual FAPs, all terms have to be estimated by simulation or, in some cases with a combination of simulation and numerical approximations. • RE4. For the ensemble RE4 (all statistics from GMs), the intersection of the events in the indicators of the last line of (6.15) is
{tcYc < tsYd{YL < YR}
= {YL ~ aYR}{YL < YR} {
{YL < YR < YL/a}, iP the null set,
a0 which means, from (7.2), that h* is an exponentially twisted density. One way to meet the restriction h(x) ::; 1 is to truncate the function. We can set
Blind simulation 171
h(x) =
{
es(x-c)
'
1,
x:::;c x>c
(7.11)
for some c, with ah
= E{ 1(X :::; c) eS(X-c)} + P(X > c)
(7.12)
Therefore h* is the partially twisted density
x>c The truncation parameter c should satisfy c > t. The reason for this is as follows. Consider, for c :::; t, the event n
for a set of biased variables comprising Sn in one term of (7.7). The indicator 1(Sn ~ t) will be zero and these biased random variables would be wasted, contributing to an increase in the estimator variance. To ensure that all biased variables have a nonzero probability of contributing to the estimate, we must choose c greater than t. For a chosen c, an optimum value of the parameter s that minimizes h in (7.10) has to be determined (in an adaptive manner). As in exponential twisting, the optimum s will not depend on n. The choice of c is dealt with shortly. 7.2.2 The asymptotic rate
From (3.38) on page 64 we have 1
lim -log I n
n-+oo
min[-Ot + log (}
min[-2 s t S
~(O)]
+ log ~(2 s)]
replacing 0 by 2 s. For arbitrary h we have from (3.37) on page 63 that
Comparing this with (7.10) yields for the asymptotic rate above
172 Importance Sampling
1 lim -log I n
~minlog h n S
n-too
mln [-2 st + 10g(J e:S(:~h f(x) dX) ]
For h as in (7.11), the integral above is
E{
ah } = ah h(X)
e2 S X
[I
C
eS(X+C) f(x)dx
-00
+
1
00
(7.13)
e2sx f(x)dx ]
C
with ah as in (7.12). A calculation shows that the right hand side tends to M2(S) as c ~ 00. Hence the asymptotic rate in (7.13) becomes 2min[-st+logM(s)] = -2I(t) S
Therefore (partially) blind simulation asymptotically achieves optimal performance as c becomes large. The completely blind case is treated in [12]. 7.2.3 Blind simulation gain From the above we see that we can get arbitrarily close to the minimum of lb, represented by the use of exponential twisting, by choosing c sufficiently large. However, if a large value of c is chosen then from (7.12) we have ah
< e- SC M(s)+P(X2::c) ~
0
as c ~ +00. But ah is the probability of selecting an original sample X as a biased one. Therefore large values of c will result in small simulation length K r . On the other hand, a small c means that h(X) is close to unity with high probability which implies that the degree of biasing is small and the estimator (7.7) will require very large values Kr and hence K, to build up a count. Hence a moderate value of c > t is needed. The choice of c will actually depend on the density f which we do not know. In actual application some experimentation is required to achieve a compromise between the number of biased samples obtained and an estimated gain over conventional Me simulation.
Partially blind case. In many practical situations we are interested in the case of finite n. Let Kme denote the total number of original samples needed in a conventional Me simulation to achieve a variance of
Then the blind simulation gain for the same variance (7.9) is
r == K melK
obtained by the estimator (7.7)
Blind simulation 173
(7.14) which can be estimated in an implementation. We study the effect of choice of c on the gain Substituting h in (7.11) into I from (7.9) yields
r. For simplicity let n = l. c5:t c>t
Using (7.12) in the above and evaluating limits we obtain C -+-00 C-+ 00
Therefore it follows that the gain r in (7.14) changes from unity to zero as C increases from -00 to +00. To show that gains greater than unity can be obtained, it is sufficient to establish that the slope r' = 8rj8c is greater than zero as c increases from -00. Using I above and differentiating r in (7.14) with respect to c for c 5: t yields
r' _
I
pt(l) - p~(l) p~(l)J2
- -ah [ah -
But from (7.12)
a~
[Coo eSXf(x)dx
=
_se- sc
--t
0, as c --t
-00
It follows that r' ..l- 0 as c --t -00, and r' c 5: t. Again, differentiating r yields
- p~(l) r ' = Pt(1) [I _ p¥(1)J2
[I
I
> 0 for c >
ah - ah
I'
-
-00.
This holds for
2(1) I J Pt ah
Differentiating I above for c > t and substituting into r' leads, upon simplification, to
> 0 for c ..l- t. The gain r increases as c increases from t and therefore must have a maximum for some c > t. The optimum value of c will have to be found • through simulation for n > 1.
174 Importance Sampling
Completely blind case. Suppose we use the estimator of (7.8) in a completely blind simulation. For ease of analysis we continue with n = 1. Then 1 Kr Pt(1) = K ~ 1(XI
1
~ t) . h(XI) , Xl'" h*
Let A denote one summand. Then E*{pt(1)}
= E{
i
}E*{A}
pt(1)
=
ah-ah
=
Pt(1)
and the estimator is unbiased. Also E*{[Pt(1)]2IKr }
=
~2 [KrE*{A 2} + (K'; -
Kr)E;{A}]
from which E*{[Pt(1)f} =
~2 [E{Kr }E*{A2} + E{K;}E;{A} -
E{Kr }E;{A}]
Therefore
Hence the gain over conventional MC simulation, also a completely blind technique, is
Now E {A2} = E{ 1(XI ~ * h(Xt}
t)}
Since h(X) ~ 1, it follows that E*{A2} ~ pt(1) and therefore r ~ 1 for all h. It is clear that completely blind simulation provides no improvement over conventional MC because of use of the simple estimate ah = Kr/ K. Hence better estimates of ah are required to obtain simulation gains. This is a matter for research and we do not discuss the issue further. •
7.3 CFAR detection In the radar context a situation that typically arises is when clutter samples entering the detection circuit are recorded and made available for performance estimation subsequent to a radar trial. The samples could represent
Blind simulation 175
returns from a geographic region over which it is of interest to learn the radar performance. For example Blacknell & White, [9]' describe various clutter classification procedures for synthetic aperture radar imagery. In the realm of radar CFAR detection however, conventional MC simulation can be used to estimate false alarm probabilities without estimating clutter densities. This is however undesirable, not because of the paucity of sufficient numbers of samples, but due to the fact that long simulations with real data tend to result in unreliable estimates given the nonstationarities and inhomogeneities that are almost always present in radar clutter. Therefore there is need to carry out performance estimation with as few samples as possible and yet produce accurate estimates. It is in this context that we attempt to study and develop blind IS algorithms for CFAR detectors, [73]. For simplicity we restrict attention to cell averaging CFAR algorithms. At the outset it is clear that in this blind situation the g-method cannot be used. From Section 5.2 on page 99 we know that the densities of variables in the CFAR window have to be compressed for biasing whereas that of the cell under test pushed forward. Let k* and h* denote respectively the multiplicatively shifted densities of window variables and the cell under test, obtained using the acceptance-rejection procedure with some functions k and h. From Section 5.4 we know that these should be ET densities. For the window variables we have immediately, from (5.11) on page 103, that
k(y)=e- sty ,
(7.15)
s>O
with ak == E{k(Y)} = M(-st). In the case of the cell under test we use h(x) as in (7.11). From (5.3) on page 100 and using the weighting functions corresponding to h* and k*, the blind false alarm probability estimator is given by , _ _1 at - K
2: 1 Kr
r j=l
(
X
(j) _
n
t2:t:
(j)
i=l
> ) ah _0 h(X(j))
II k(t:ak n
(j)'
i=l
)
X
rv
. rv
h*,Yi
k*
(7.16) Here Kr denotes the length of the IS simulation obtained from K original clutter samples. For the moment we assume that Kn ah, and ak are given. Then the variance of this estimator is
The exponential upper bound
h It is noted that as c -+ (5.14) on page 103.
00,
h on the expectation I in (7.17) becomes
=
E{ e:~~~h }a~n
the upper bound
(7.18)
h -+ M2(s) M2n( -st) as in
176 Importance Sampling
7.3.1 Estimator implementation We describe a procedure for implementing the estimator (7.16). Assume that a sufficiently large set of K clutter samples from f is available. The estimator can be implemented adaptively in two ways. In either case a value of s is required to generate the (1 + n)Kr biased samples needed for a single evaluation of (7.16). In one adaptive implementation, there is an s-algorithm that optimizes the value of s by minimizing an estimated variance, and the false alarm probability is estimated at each step of this algorithm until the variance minimum is obtained. This is the procedure we developed in [70] using stochastic Newton recursions. In the other implementation, an optimum value of s is determined using original samples prior to a single at estimation. This can be done because biased samples are not required to minimize the variance upper bound lb. We have used the latter approach here. Furthermore there are two ways of using the original samples for mechanizing (7.16). Biased samples can be generated, using an optimized s value, by the acceptance-rejection method until K original samples are utilized. In this case Kr will be a random variable. Alternately, we can continue testing samples until a fixed or chosen value Kr of IS simulation length is obtained. Then K will be a random variable. In either case the average number of original samples used will be the same and for large K the two procedures will yield equivalent results. For reasons of convenience in simulation we use the latter method. Optimum twist. We refer to the twist s that minimizes h as optimum. Minimization of the actual variance, or I, requires biased samples to generate (variance) estimates as this cannot be done using original samples due to the fact that the false alarm event is rare. However, h is easily minimizable as all the quantities and their derivatives can be estimated from the original samples. The algebra is straightforward but tedious and is omitted. The stochastic s-algorithm will be
r
sm+1=sm-8~b(sm); l~'
(7.19)
m=l, ...
where primes indicate derivatives with respect to s, hats indicate estimates, m is a recursion index, and 8 is a control parameter. This algorithm can be run with a fixed set of original samples. The value of s obtained from the algorithm (7.19) will depend on the choice of the value of the truncation • parameter c in (7.11). As we know, a moderate value of c is required. 7.3.2 The blind simulation gain The test of the blind estimation scheme can be provided by calculating the gain in sample size obtained over conventional Me simulation for achieving is generated equal variances. For a fixed Kr, a set of biased samples {Xj}
f;1
Blind simulation 177
using K original samples. These are Bernoulli trials (using h(x)) with success probability ah and are terminated when the first Kr successes occur. Then K - Kr has the negative binomial distribution
with E{K} = Kr/ah and var(K) = Kr(1 - ah)/a~. For each j we again conduct Qj Bernoulli trials (using k(y)) with success probability ak until n successes occur. Then
P(Qj-n=q)= (n+:-1)a k(1-a k )q with E{Qj} = n/ak and var(Qj) = n(1 - ak)/a~ for all j. Hence the total number NT of original samples used for a single at estimation is Kr
NT=K+ 'LQj j=l
with E{NT } = K r (1/ah +n/ak). Now let Kmc be the simulation length used for a conventional (zero IS) Me estimator arc of at. This will have variance 1 [ 2] vara~mc t = -K at - at mc and the total number of samples used will be NJtc = (1 the variances in (7.17) and (7.20) yields
The gain as
(7.20)
+ n)Kmc . Equating
r in sample size obtained by blind simulation can then be defined r + n)Kr + L:f:l Qj
(1 K
at - a~
I -a~
(7.21)
For large Kr we can also define an average gain as
NJtc E{Nr} (1 + n)ahak ak + nah
(7.22)
178 Importance Sampling
From this we conclude that as c varies from 0 to 00, the average gain changes from the value corresponding to biasing only the window variables, to zero, although the I function, and hence the estimator variance, reduce to their minimum values. Hence a value of the truncation parameter c could exist for o < c < 00 for which maximum gain over conventional MC estimation can be achieved. Therefore a property of the blind simulation scheme described here is that IS related parameters have to be optimized to achieve maximum gains, as compared to the case of known densities wherein minimization of the I function directly leads to variance minimization or gain maximization.
7.3.3 An application To establish the effectiveness of these blind algorithms, Rayleigh distributed LLd. clutter samples are used in an implementation. Results for this case are already known in terms of the threshold multiplier t and CFAR window length n. Of course, the algorithms are blind to this knowledge and the clutter density. Various results of simulation are shown in Figures 7.1- 7.7. First we obtain a rough estimate of the optimum truncation parameter c for a given t. This is done by maximizing the quantity 'Y=-
ahak ak +nah
1 .-
h
obtained from (7;22) by ignoring terms that do not depend on c, replacing I by its upper bound lb, and assuming that lb ~ Q.~, since we are far from using the optimal form of IS for this problem. A plot of'Y is shown in Figure 7.1 as a function of c and has been obtained by estimating ah, ak, and h with a set of original samples using an optimized value of s from (7.19) for each c. It is clear that an optimum choice of c exists. The corresponding optimum twist s is shown in Figure 7.2. The values of t shown (0.3312 and 0.375) are known to provide (see Figure 5.14 on page 125) false alarm probabilities of 10- 6 and 10- 7 respectively for a CFAR window length of 16. The IS estimator algorithm in (7.16), as it evolves with simulation length K r , is shown in Figure 7.3 with estimated gains in Figure 7.4. As described earlier, the optimal choices of c and s were kept fixed for all K r . Gains of approximately 1000 and 5000 over conventional MC simulation were obtained for false alarm probabilities of 10- 6 and 10- 7 respectively. This is appreciable, given that the algorithms use no knowledge of clutter densities.
Threshold adaptation and at control. We describe multiplier adaptation for a specified target false alarm probability. The blind algorithms were used with the t-algorithm of (5.7) on page 101 to estimate t. To compute the gradient of at, we used S(x)
=- 1 + 1e- ax ,
a> 0
Blind simulation 179 10Br-.-----r----.-----,----.----.----,-----.----.----~ CA-CFAR n
=16
1~~~----~--~----~----~--~----~----~--~----~ 1.6 1.8 2 2.2 2.4 2.6 2.8 3.2 3.4 3
c
Fig. 7.1. Determining optimal truncation parameter c for blind IS in Rayleigh clutter.
to approximate the indicator function in (7.16) with x above replaced by
L ~(j) n
X(j) -
t
i=l
This differentiable function may be recognized as the well known sigmoid nonlinearity used in artificial neural network training algorithms. As a becomes larger it approximates the indicator function more closely. The approximate estimator was used for the (sole) purpose of computing a gradient. This technique was first used in [55] for demodulator threshold optimization in communication systems. Results are shown in Figure 7.5 where a o is the target probability. The sigmoid parameter was chosen as 5 and c was kept fixed at 2.7. The only input to the algorithms was the value of a o . At each recursion of the t-algorithm, with simulation length of Kr = 2000, an optimal twist s was obtained. The set of original unbiased clutter samples was not fixed but rather generated as required for the algorithms in order to model a more practical situation. No smoothing was carried out and the t-algorithm did not use any adaptation of the control parameter 8t . It is clear that the algorithms have captured threshold multiplier values with some accuracy. To test the behaviour of the algorithms in a nonstationary clutter environment, the following modest experiment was performed. A clutter distribution
180 Importance Sampling 6.5r_,-----r---~----_.----~--_.----_r----~----r_--_,
CA-CFAR
6
n =16
II)
4L-J-____
1.6
~
1.8
__
~
_____ L_ _ _ _
2
2.2
~
__
~
_ _ _ _ _ L_ _ _ _
2.6
2.4
2.8
~
3
_ _ _ _L __ _
3.2
~
3.4
c
Fig. 7.2. Optimal twisting parameter
8
as a function of truncation parameter c.
transition from Rayleigh to Weibull was modelled in the form of a step change introduced at the 11th recursion of the t-algorithm. The Weibull density used was
with shape parameter of 2.22 .... No attempt was made to adapt the truncation parameter c which was maintained at the suboptimal value of 2.7 throughout the experiment. This is the main reason for not introducing a very drastic step change in the clutter distribution. A simulation length of Kr = 1000 was used for each recursion. The results are shown in Figure 7.6 and Figure 7.7. It is evident that the algorithms, though operating with suboptimal parameters, are sensitive to the clutter nonstationarity and able to adapt the threshold to maintain false alarm probability performance in the blind situation. • 7.3.4 Comments
Of particular interest in real applications is the situation of unknown clutter distributions. Using algorithms which do not rely on any a-priori statistical knowledge of clutter, we demonstrated that performance estimation
Blind simulation 181 10-5 , - - - , - - - - , - - - - , - - - - , - - - - , - - - - , - - - - - , - - - - , - - - - r - - - ,
CA-CFAR
n =16 1= 0.331
1=0.375
10-8L-__~__~____~__~~__~__~~__~____L -_ _~_ _~ 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Kr
Fig. 7.3. Evolution offalse alarm probability blind IS estimate in Rayleigh clutter. and threshold determination can be carried out a few thousand times more quickly than by conventional MC methods in terms of sample sizes. Implicit in these developments is the assumption that the samples, though unknown in distribution, are stationary and free of inhomogeneities. As is well known, clutter nonstationarity has been the bane of researchers attempting to develop robust CFAR algorithms. An admittedly modest experiment shows that blind algorithms can track step changes in the clutter distribution by means of threshold adaptation. Clearly this opens up the possibility of adaptive control of false alarm probabilities. The results are certainly significant in controlled simulations. In real terms though, there is a long way to go before one can even think of using such algorithms in practice. There are a few issues that are worth addressing here. The principal question seems to be "can we develop methods of estimation and adaptation based on very few samples"? To appreciate the problem quantitatively, consider the simulation gain shown in Figure 5.22 on page 132 for the CA-CFAR detector false alarm probability with window length of 32. In this simulation the clutter distribution is known exactly and the IS estimator operates optimally for the chosen biasing method which is scaling. A gain over conventional MC simulation of more than 106 is obtained at a false alarm probability of 10- 6 . Suppose we have decided, based on relative precision, that it is sufficiently accurate to simulate the false alarm event indicator 108 times in a MC simulation. It
182 Importance Sampling
CA-CFAR n= 16
t= 0.375
t= 0.331
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
K r
Fig. 7.4. Blind simulation gains in Rayleigh clutter. follows that an IS estimator with knowledge of the clutter distribution would require no more than a 1000length simulation, implying thereby that a total of 3300 clutter samples are needed. This is certainly not a restrictive demand. In fact, better biasing schemes can be developed which can provide further gains. The question then becomes one of asking whether we can design blind IS schemes to perform nearly as well. Of course, algorithms can be tuned and complete adaptivity introduced in terms of optimizing all parameters to squeeze out more gain. The real challenge however lies in devising other more successful biasing schemes. A point to be noted in this regard is that the choice of the k function is not optimal given the truncated h function that was used. That is, can other h (and therefore, k) functions be found which will perform better? Another equally important issue to be addressed is the one of clutter nonstationarity. We have shown the potential of these algorithms for tracking transitions in distributions. But this assumed that the cell under test also contained samples from the new distribution. Real radar returns may well consist of targets in the cell under test and interferers in the window. We do not know how the algorithms will behave under such conditions.
Blind simulation 183 0.45,-...,----,,----,---,------.---...------,---,-----,.----, CA-CFAR n= 16
0.4
a.
I!?
o
=10-7
.!!!
a. :;:: "3
E
"C
(5
.s::.
e .s::. 1/1
-~
a.
"C
o
=10-6
E
:; w
0.2 L-...l....-_----'L-_-'-_ _..L..-_--'-_ _..L-_---!_ _-1-_ _L - _ - J 2 4 10 12 14 16 6 8 18 20 recursions
Fig. 7.5. Multiplier determination using blind IS. Target FAP is
0
0 ,
0.36.------.--------r------,-----r-----.-----.
0.34
CA-CFAR n= 16
0.32
e 1:
0.3
0
0 "C
(5
.s::.
e
0.28
1/1
.s::.
I-
0.26 :-tWeibull
0.24 Rayleight-
0.22 0.2
0
5
10
15
20
25
recursions
Fig. 7.6. Threshold tracking in distribution transition using blind algorithms.
30
184 Importance Sampling
10-2
10-3
e"E 0 0
CA-CFAR n= 16
10-4
~
:cIII
.0
ec.
10-5
E .... III
iii Q)
10-6
I/)
iii u.. 10-7 Rayleighf-
10-6 0
5
:~Weibull
10
15
20
25
30
recursions
Fig. 7.7. FAP control in clutter distribution transition using blind algorithms.
8. Digital Communications
Performance estimation of digital communication systems using simulation is an area where IS has been extensively applied for over two decades. It has found applications in several situations where error probabilities are to be estimated for systems that do not yield easily to analytical or numerical analyses. A review is contained in Smith, Shaft & Gao [69]. Some of the more challenging applications that occur in the communications context are to systems that are characterized by nonlinearities, memory, and signal fading. Nonlinearities in communication systems are encountered, for example, in front-end amplifiers of transponders of satellite channels, blind beamformers in adaptive arrays used in mobile cellular systems, and more fundamentally, in receiver processing elements such as decision feedback equalizers (used to combat intersymbol interference effects) and envelope detectors (for noncoherent detection). Signal fading, shadowing, and multipath transmission are always present in most mobile communication channels. These phenomena result in mathematical characterizations of the communication system that are difficult to deal with analytically and numerically, necessitating the use of fast simulation methods for performance analysis and system design. In this chapter we demonstrate the use of IS in digital communication systems analysis and design by means of a few representative examples. The number of applications that have been studied by researchers and the number of possible applications of fast simulation is very large, and all cannot be recorded here. These examples, however, are sufficient to illustrate the methods and procedures of adaptive IS algorithms and their role in dealing with complex communication systems.
8.1 Adaptive simulation The principle of adaptive IS is illustrated in the conceptual schematic shown in Figure 8.1. It is the general methodology for the application of IS of which we have already seen application in the last few chapters. For the purpose of exposition, it is shown here in relation to a fading multi path environment. Models of all noises, interferences, channel fluctuations, and random data streams are used to generate the appropriate environment for the communication system. Some or all of these random processes are first modified or
186 Importance Sampling
Models of stochastic data streams, spreading codes, FEC, modulations
ChaMa model: fadflng, multlpath, shadowing
+
Multiple access Inter· ferance: mullklser, co-, adjacent chama Interferances, additive noise
TransI'onnatIons d statistical rnocIeI8 under IS
I Adapllve biasing I algorithms
r· •••
~
11
I--
ParameI8r optimization
COMA communication syaIam: mobile or base station rec:eMIr
algorithms
:
,• • • • • • ••.1
Reskxation of ortgInaI models
1 I
EstImation d perfurmance and IS related quantities I
l Optimized system parameters Perfonnance estimates
Fig. 8.1. Schematic of adaptive IS for communication systems.
transformed into biased processes before being input to the system. Biased processes are obtained by modifying the model density functions or selecting new ones. The response of the biased system is then subjected to an 'inverse' transformation which exactly compensates for the effects of biasing. The result is used to monitor and estimate system performance as well as IS related quantities. These estimates in turn drive adaptive biasing and system parameter optimization algorithms running simultaneously. The entire process is recursive and is continued until certain desired accuracy levels and performance specifications are met. The final output consists of a set of optimum system parameter specifications and associated performances. For other communication systems, only models of external inputs need to be changed. In particular research applications it may be desired only to evaluate the performance of a system described by a given or design set of parameters. In such cases the full generality of the adaptive simulation as described above need
Digital Communications 187
not be implemented. Only the adaptive biasing and performance estimation algorithms can be computationally mechanized and the system parameter algorithms dispensed with.
8.2 DPSK in AWGN Consider binary differential phase shift keying (DPSK) detection in additive white Gaussian noise (AWGN), [14], as shown in Figure 8.2. The error prob-
~I-----oIDecIsIonI
-0
Sampler
,,, ,
: lhresholcl
Symbol'
synctvonlzatlon
EsHmated InfOImatIon
Fig. 8.2. Delay and multiply detection of DPSK.
ability of this receiver is exactly known and we describe its estimation using adaptive IS. The information bearing signal at the receiver input in the k-th bit interval is given by
where Ae denotes the carrier amplitude, p(.) the transmitter pulse shape, T the bit duration, We the carrier angular frequency which is an integer multiple of 271" IT, and ¢ the unknown carrier phase. The variable ak takes the values 0 or 1, depending on the k-th transmitted information bit bk . That is, if bk = 1 then ak = ak-I, otherwise ak =f. ak-l. Using a bandpass representation for the additive noise n(t), the signals at the inputs of the multiplier are
Xe(t)
+ n(t) = [Aep(t -
kT)
+ ni(t)] cos (wet + ¢ + ak7l")
- nq(t) sin (wet + ¢ + ak7l")
and
2[xe(t - T)
+ n(t -
T)] = 2[Aep(t - kT)
T)] cos (Wet + ¢ + ak7l") - 2nq(t - T) sin (wet + ¢ + ak7l")
+ ni(t -
where ni(') and nq(·) are independent in-phase and quadrature Gaussian random variables with zero mean and variance 0'2. Therefore the signal Y at the input of the comparator at sampling instant tk is
188 Importance Sampling
Y
=
[Ac + ni(tk)][Ac + ni(tk - T)] COS(ak - ak-l)7r + nq(tk)nq(tk - T)] COS(ak - ak-l)7r
where the random variables ni(tk), ni(tk - T), nq(tk), and nq(tk - T) are all mutually independent. It is assumed that the pulse p(.) is unity at each sampling instant. Owing to the problem symmetry, the error probability Pe is given by
for which
Defining ai =
2"1[ni(tk) + ni(tk - T)]
aq £
~[nq(tk) + nq(tk -
t:.
a2
£ (Ac
T)]
+ ai)2 + a~
yields
Note that ai, a q, {3i, and {3q are all zero-mean independent Gaussian variables with common variance 0- 2 /2. Further, a and {3 are independent, possessing Rice and Rayleigh distributions respectively. The error probability is easily shown to be
To estimate this using IS we rewrite Pe
P(a5,{3) E{l(a 5, (3)} E*{l(a 5, (3) W(a,{3)}
and refer to the vector depiction of the detection error event in Figure 8.3. The IS estimate of Pe is
Pe =
K
Ll(a:'S{3)W(a,{3),
a,{3rv/*
with the weighting function W obtained by suitably biasing a and {3. The g-method can also be used here to have a one-dimensional biasing problem.
Digital Communications 189
Fig. 8.3. Random variables involved in DPSK detection.
Alternatively, we can consider biasing, the four underlying Gaussian variables with the estimator
Pe =
K
L l(a ~ (3) W(ai' a q, {3i, (3q),
ai, a q, {3i, {3q
rv
f*
In general this will lead to a four-dimensional biasing vector. However, consideration of Figure 8.3 shows that it will be sufficiently effective to translate the mean of ai negatively towards the origin of coordinates and scale up the variances of {3i and {3q equally to obtain increase in the frequency of occurrence of the indicator. For this two-dimensional biasing problem, denoting by c and a respectively the translation and scaling parameters, the weighting function takes the form W(a· {3.
(3 .c a) --
" " q,'
e-
2
o:;e+e 2 a2e-(.B~+.B~)(1-1/a2)
where we have set (12 = 1 for convenience. Using the notation of Example 2.10 on page 35, the I-function is
and the various derivatives of W are We = 2 (-ai Wa = 2
+ c) W(.)
[~_ {3; +3 (3;] a
a
W(.)
Wee
= 2 (1 + 2a~ - 4aic + 2c2 ) W(·)
Waa
= 2 [~_ {3; + {3; a2
a4
+ 2({3; + (3;)2] a6
W(.)
190 Importance Sampling
The adaptive recursion for optimizing the biasing vector is m= ~-1
1,2, ...
~
where J m is the inverse of the matrix J m given by
J m -
(Lmcm
Lmam
and the estimated second derivatives above are obtained from ~
1
K
lcma m = K L1(o:::; ,B)Wcmam(O:i,,Bi,,BqiCm,am)W(O:i,,Bi,,BqiCm,am), 1
The gradient vector 'iJI (., .) is as defined on page 35 and is estimated in the usual manner. Results of implementation are shown in Figures 8.4-8.6. The translation and scaling algorithms are shown in Figures 8.4 and 8.5 respectively. A comparison of the exact Pe and its estimate is shown in Figure 8.6.
8.3 Parameter optimization We specialize the formulation of Section 3.3.1 on page 56 to the present context. Suppose p(r) represents the error probability of a digital communication system where r denotes a demodulator detection threshold or some other system parameter, considered here for convenience as a scalar quantity. In a more general setting, it could represent the weight vector of an adaptive equalizer. If jJ(r) is an IS estimate of p(r), then an estimate of the optimum r that minimizes p(r) can be found using the recursion rm+l
;;(r) pll(r)
= rm - 8.,.=--,
m
= 1,2, ...
(8.1)
In some situations the interplay of random processes in the communication system is such that it may not be possible to express p( r) as a differentiable function of r. This happens, for example, when the g-method cannot be applied directly and the rare event (i.e. reception error) occurrence is expressible only as an indicator function l(X > r). In such cases we can endow
Digital Communications 191 -2.4 ,----,--,----.------,---,------r---,------r---,------, -2.41 -2.42 -2.43
~
-2.44
c:
~
-2.45
iii c:
e -2.46
I-
-2.47 -2.48 -2.49 _2.5l...--.....L------L---'----'----'--...L---'-----''--.....I...----.J o 10 20 30 40 50 60 70 80 90 100 recursions
Fig. 8.4. Convergence of translation parameter for Pe
= 10-6
and Ac
= 5.12296.
3 2.95 2.9
i
2.85
III Cl
.5
2.8
iij ()
(/)
2.75 2.7 2.65 2.6 0
10
20
30
40
50
60
70
80
90
recursions
Fig. 8.5. Convergence of scaling parameter for Pe = 10- 6 and Ac = 5.12296.
100
192 Importance Sampling 10~~--'----'----.---~----r----r----r---'----'----.
t
a: 10-6 W
III
10-9 ' - -_ _....J-_ _- - '_ _ _ _- ' - -_ _- - L_ _ _ _- ' - -_ _- ' - -_ _ _ _' - -_ _- ' -_ _- - '_ _---> 8 8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 SNRdB-t
Fig. 8.6. Probability of errorPe and its IS estimate shown dotted for DPSK.
differentiability into an estimator by approximating the indicator l(x > 7") by 1
S r (x)=-~~ - 1 + ec(r-x)
(8.2)
for c > o. Recall that this device was used in Chapter 7 for blind simulation of CFAR detectors. As c becomes larger it approximates the indicator function more closely. The approximate estimator can then be used for the sole purpose of computing the gradients for (8.1).
8.3.1 Noncoherent OOK in AWGN: threshold optimization We study noncoherent on-off keying (OaK) in AWGN, [14], to illustrate parameter optimization in a communication system, Figure 8.7. The signal at the input of the envelope detector is
following the same notations as in the previous section. The input to the comparator at the k-th sampling instant is
(8.3)
Digital Communications 193 Received signal
BPF
Envelope detector
Estimated Dec InfOlTTlOllon device
~~ Sompler
, :
Symbol'
Threshold
synchronization
Fig. 8.7. Noncoherent receiver for OOK.
assuming that p(kT) = 1 for all k. Since the carrier frequency is much larger than the bit rate, the average signal power can be approximated as A~ /4 and the signal to noise ratio 'Y is 'Y
A~
= 4a 2
The sampled signal is Rayleigh distributed when ak = 0 and Rice distributed when ak = 1. With equally likely bits, the probability of error is given by
1 -e 2
_T 2/2O' 2 + -11T -e Y _(y2+A2)/2O'2 [, (AcY) dY 2
0
a2
c
0
-a2
(8.4)
where 7 is the decision threshold and 100 is the modified Bessel function of the first kind and zeroth order. For large SNR's, 'Y » 1, the Ricean density can be approximated by a Gaussian density and the integral above evaluated analytically, [14]. The optimal value of threshold 70 that minimizes Pe can then be determined as
We now describe the estimation of Pe and determination of 70 using IS. The two terms in (8.4) are estimated separately. When ak = 0, we have from (8.3) that (8.5)
The variables ni and nq are zero mean independent Gaussian with variances a 2 . This signal sample can be viewed as the length of a two-dimensional vector and the decision boundary is the circumference of a circle of radius 7 centered at the origin of coordinates. From symmetry it is clear that an appropriate means of biasing to make the conditional error event in (8.4) occur more frequently is to scale up the variances of ni and nq equally. Let a denote the scaling factor. Then, with a 2 = 1, the weighting function becomes
194 Importance Sampling
with aWo(ni' nq; a)
aa
and a2Wo(ni' n q; a)
aa 2
When
ak
= 1, the decision sample is
In this case the mean of the variable ni needs to be translated negatively. Although the variances of ni and nq can also be modified, it is simpler in implementation to use only translation for biasing. Denoting the biased mean of ni as m, the weighting function takes the form W 1 (ni,n q;m) = e-(2n i m-m 2 )/2
with
and
These can be used in an adaptive IS implementation in the usual manner to estimate Pe • To optimize the threshold T we use the procedure described above. An implementation of this is then compared with one using numerical integration and minimization of Pe in (8.4). The details are straightforward and can be found in [51]. Results are shown in Figures 8.8 and 8.9. The difference of using IS and the Gaussian approximation is clearly seen from these figures.
8.4 Sum density of randomly phased sinusoids Many digital communication systems are perturbed by interference that can be modelled as a sum of sinusoids with random phases. Mobile wireless systems often operate in interference dominated environments which can have a
Digital Communications 195
0.56 IS
Numerical
0.55 0 .54
-{ l->
SNR
=12 dB
SNR
=16 dB
0.53 0.52 0.51
~
0.5 0
20
Gaussian opprox
40
80
60
100
number of recursions Fig. 8.8. Optimizing OOK thresholds. Upper Pe ~ 2
X
10- 4 , lower ~ 1 x 10- 9
0.57.------.------.-----.----.------..------.
:~
Numerical (solid) and IS
0.54
0.53 Gaussian approx.
0.52
0.5~0
11
12
Fig. 8.9. Optimizing OOK thresholds.
13
SNR (dB)
14
15
16
196 Importance Sampling
limiting effect on performance in terms of bit error rates and cellular capacities. In particular, co-channel interference (CCI) in such systems arises from frequency reuse in certain fixed patterns of geographic cells. The information bearing signal in a particular frequency cell is interfered with by signals arriving from surrounding cells that use the same frequency. These interfering signals appear with random phases in an additive manner, giving rise to CCL Analysis reveals that crosstalk appears as the sum n
Vn = a
L cos for an SIR of 7.615dB which corresponds to a Pe of 10-6 . The Pe estimates are shown in Fig0.2 0.195 0.19 0.185 -et:>
0.18 0.175 0.17 0.165 0.16
5
10
15
20
25
30
35
40
45
50
recursions
Fig. 8.15. Convergence of variance parameter for SIR= 7.615dB and Pe = 10- 6 . ure 8.16. It is observed that for SIR values higher than 7.81 dB, the error probability becomes zero and no bit errors occur due to interference. This corresponds to a. = 1/6. This simply means that for any a. < 1/L, the maximum possible total vector length of L interferers cannot exceed the signal vector length. Thus there exists, for the interference dominated situation, a zero-error threshold of signal to interference ratio above which no bit errors are possible.
M-ary PSK. For transmitting M-ary symbols using PSK, log2 M information bits are encoded into each symbol waveform in terms ofthe phases of the carrier. The optimum receiver is equivalent to a phase detector that computes
202 Importance Sampling
10-5
......................... .
10-20
....... .
4.5
5
5.5
6
6.5
7
7.5
8
SIAdB
Fig. 8.16. Error probability for BPSK with co-channel interference.
the phase of the received signal vector and selects that symbol whose phase is closest. Assume that zero phase has been transmitted. From the signal space diagram of Figure 8.13, the phase cPr of the received vector is described by
. ,/.
SIn 'f'r
= -,.====:::::::::=¥~============
0 L:f-l sin cPi (0 L:f=l sincPi)2 + (1 + 0 L:f=l COScPi)2
(8.10)
1 + 0 L:f=l cos cPi ,=====:::;:=============;:: (0 L:f=l sincPi)2 + (1 + 0 L:f=l COScPi)2
(8.11)
J
and ,/. cos 'f'r =
J
A correct detection is made when cPr satisfies -7r 1M:::; cPr :::; 7r 1M. Defining 1(·) = 1 - 1(.), the probability of a symbol error can be written as
Pe =
P(-7r/M 1, cPr 1, 7r/M) E{I(-7r/M:::; cPr:::; 7r/M)}
and its IS estimate as _
1
Pe = K
K_
L 1( I
-7r 1M:::;
cPr :::; 7r 1M) W (cPll ... , cPd
Digital Communications 203
The indicator for the complement of the event {-'Tr/M ~ rPr ~ 'Tr/M} can easily be simulated by referring to the signal space diagram of Figure 8.17 and
Interferer
A
Fig. 8.17. Decision region in M-PSK signal space and possible biasing for inter-
ference phase.
noting that the error region comprises of the region {I together with (union) the intersection {I
+a
+ a L~=l cos rPi
~ O}
L
L COSrPi > O} n {{tanrPr 2: tan('Tr/M)}
U
{tanrPr ~ - tan ('Tr/M)} }
i=l
Further, we note that the distribution of the received phase rPr is symmetric around rPr = 0 by virtue of the fact that all the interference phases are independent and uniformly distributed. Hence it is sufficient to simulate events from only (the upper) half of the error region described above. This in turn can be described by considering the intersection of the complete error region and the set {sin rPr 2: O}. Consequently, an effective method of biasing is to generate all interference phases rPi such that vectors are most likely to be aligned along the line marked o in Figure 8.17. This is evident from orthogonality and will produce an increase in the frequency of occurrence of errors. The biased interfering phases are once again Gaussian with their mean at J..L == ('Tr/2) +'Tr/M and common variance ()" q, optimized through adaptive simulation.
204 Importance Sampling
The weighting function and its derivatives are almost identical to those for BPSK simulation. They are ( ..!!..L)L
A. A..) = { v'2ii W( 'f'1,··· ,'f'L,O'c/> 0,
exp
('\'L (c/>;-J.L)2)
i...J1 ~ ,
elsewhere
AB in the case of BPSK a threshold value aT for a exists, below which no errors occur. This can be found by combining (8.10) and (8.11) as
and substituting ¢i = JL. This yields 1 .
7r
aT = - S l n -
L
M
The signal energy per bit is S == A2/ log2 M and the total interference energy is 1== A 2 a 2 • The corresponding SIR for L = 6 is given by
L:f
1
SIR = 6a 21og2 M Simulation results are shown in Figure 8.18.
8.5.2 Interference and AWGN We consider here a more realistic model of the communication environment which consists of co-channel interference as well as additive Gaussian noise. Again, the cases of BPSK and M -ary PSK are treated separately.
Coherent BPSK. The signal space diagram in Figure 8.19 illustrates the situation with 2 interferers. For a transmitted +1, the demodulator output in the presence of additive noise is simply A +n +
L
L aA cos ¢i i=1
Digital Communications 205
10-25 L--_ _ _.l..-_ _---L.l..-_ _ _...L-_ _L---L-_ _ _--'---_ _ _-L-...J 10 4 6 8 14 16 12 SIRdB
Fig. 8.18. Symbol error probabilities for M-PSK with co-channel interference.
Interferer
2
-'W'=-----
A
n
Fig. 8.19. BPSK signal space with two co-channel interferers and AWGN.
206 Importance Sampling
where n is Gaussian and has zero mean with variance 17 2 . The error probability can be written as L
P(A + n + LaACOS
02W o17c1>0e 02W &2
=
same as (8.9)
oW e-n 017c1>· ~
W 2 2 17 4 . (17 + (e - n) )
The optimization of biasing is carried out through a two-dimensional parameter recursion. The signal to noise ratio is defined as SNR = A2 /217 2 • Results are shown in Figure 8.20 and 8.21. The bit error probability is shown in Figure 8.20 as a function of SIR with SNR as parameter. For finite SNR an error floor exists as the SIR becomes large. The same result is shown in Figure 8.21 but with SIR as the parameter. For all SIR < 7.81dB, which is the zero error threshold for CCI, an error floor exists as SNR -+ 00. Above this threshold value of SIR, there is no floor.
Digital Communications 207
SNR:i:4dB
40 00
onlyCCI
6
8
10
12
14
16
18
SIRdB
Fig. 8.20. Error probabilities for BPSK with
eel and AWGN. Parameter SNR.
10-25 Ll._ _ _...L-_ _......1._-'-----'Ll...-_ _....L..J._ _-'--....l.-_--'----'_--'-_-' 5 10 15 20 25 30 35 40
SNRdB
Fig. 8.21. Error probabilities for BPSK with
eel and AWGN.
Parameter SIR.
208 Importance Sampling
Gaussian approximation. In much of the literature, CCI is modelled as being Gaussian. We examine the effect of this by calculating the bit error rate assuming that the interference can be replaced by a Gaussian noise source having the same total power, in addition to thermal noise of course. Defining a signal to interference and noise ratio SNIR as
the error rate is approximated as Pe ~
Q( ..hSNIR)
where Q(.) is defined on page 18. This is shown in Figure 8.22 together with optimized IS estimates of Pe for comparison. For low SNR's, that is in noise dominated situations, the Gaussian approximation is close to the IS estimates. As the SNR increases, in the interference dominated situation, the approximation becomes increasingly worse. This illustrates the importance of making accurate simulation estimates of performance in interference limited environments in preference to using Gaussian approximations.
10-10 20
1l..Q) 20
GA
10-15
10-25 L--_ _---'-_ _ _....L-_ _ _- ' - -_ _- - - '_ _ _--L.._ _ _- ' - - - ' ' ' - - - - - ' 4 6 8 10 12 14 16 18 SIR dB
Fig. 8.22. Comparison of Gaussian approximations (GA) and IS simulation for BPSK.
Digital Communications 209
M -ary PSK. The signal space diagram for this case is in Figure 8.23 for a single interferer. The noise vector n has zero-mean Gaussian quadrature
'"
................. ..
A
........
n·I
O'~; ~'"
-'-'8101) .••••
bolJ"~" IVOry. '. 1
Fig. 8.23. M-PSK signal space with a single co-channel interferer and AWGN.
components ni and phase is given by
nq
of equal variance. As in (8.10) and (8.11), the received
(8.12) and (8.13) Biasing for the interference phases is as before to implement an IS estimator of Pe. As far as the noise n is concerned, increasing its variance will of course produce more errors. However this will lead to a more than two-dimensional optimization problem. A simple solution is to translate the means of the quadrature noise variables along the shortest line to the decision boundary, as is done for the interferers. Denoting the biased means of the noise components as Ci and cq , it follows from Figure 8.23 that they should be related as
210 Importance Sampling
This results in a two-dimensional biasing problem involving the parameters (J n
,n)
+ I:~=2 8 m + I:~,n=2 8 m>n
V2ao
m,n) }
for which the IS estimator will be ~
Pe,ON =
L
1 K 1 ( -TON -erfc 1 2
K
M ~ + "M ~ + "L.Jm=2 ":::m L.Jm,n=2 ":::m,n)
V2
2aG
m>n
Digital Communications 219
where f* denotes the Gaussian phase biasing density with mean at 7r and variance a~. In a similar way, using (8.21), the OFF component of the error probability can be written as M
L 5 m,n)
P(nc > r OFF -
Pe,OFF
m,n=2 m>n
1
E { -erfc
(r
~
Pe,OFF
1
=K
L K
1
1 -erfc 2
J2
(r uJ2
,\,M
OFF -
5 m ,n )
}
2ac
2
with estimator
OFF - l:!;;,n=2 m>n
~
m ,n=2 '=m,n ) m>n
2ac
The weighting function for both the ON and OFF cases has the same form
AW( 'f'2,
A-.) =
. .. ,'f' M ,
a q,
17
{
,\,M
v'27r exp um=2 0,
(q,,,,_1r)2 217: ' ..-
0::; cPm ::; 27r elsewhere
as was derived for the case of co-channel interference. Results of implementation are shown in Figure 8.29 They correspond to a 4-channel WDM router with receiver noise variance such that the SNR is 21.1dB, infinite extinction ratio, and a symmetric threshold setting of r = E2 /4. At very low crosstalk levels the Gaussian approximation coincides with the IS results, which is the error probability floor due to the presence of AWGN. At more practical crosstalk levels of -30 to -25dB, the looseness of the Chernoff bound is evident. The implication of this is that the network designer can employ optical cross-connects with almost twice as large crosstalk levels than those predicted by the approximation methods for bit error rate performance.
Threshold optimization. The symmetric threshold setting of r = E2 /4 is not optimum for detection performance since the probability distributions of the two decision statistics for the ON and OFF cases are very different. Error rate performance can be improved if an optimized threshold is used. This can be done using the procedure outlined in Section 8.3. The derivatives of Fe are obtained through the derivatives of the g-functions in (8.22) and (8.23), given by
220 Importance Sampling
Gaussian approximation ~
10.16 L--_-'-_--'-_---'_ _.L..-_..L-_--'-_---'_ _.L..-_..L-_--' -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10
XSR (dB) Fig. 8.29. Comparing the GA, Chernoff bound, and adaptive IS.
and
for the ON and OFF cases respectively. The dependence of the error probability performance on the threshold setting is shown in Figure 8.30. The performance using optimized thresholds is shown along with that for a symmetric threshold. Optimum thresholds are estimated at three XSR levels of -20, -25, and -30dB. Sharp bends in the performance graphs are observed around the XSR value for which the threshold is optimized. At XSR of -30dB, the optimum threshold is 0.87 E2 /4. In comparison with the approximations, the error probability is lower in the vicinity of the XSR value for which the threshold is optimized. This is of course logical and a direct consequence of the optimization. However, for lower XSR values the error probability floor is higher because the optimized value of the threshold in the presence of crosstalk is no longer optimal when AWGN is dominant. The influence of threshold optimization
Digital Communications 221
XSRopt= -20 dB~
ffi
co
10-8 XSRopt= -30 dB ""10.
12 1=-_ _-
_ _ _ _"\.--=----~~-
Symmetric threshold
~
10.16 L -_ _' - - _ - - ' ' - - _ - - - '_ _---'_ _--I._ _---'-_ _---L_ _- ' -45 -40 -35 -30 -25 -20 -15 -50 -55 XSR (dB) Fig. 8.30. Effects of threshold optimization. is quite significant. The tolerable crosstalk level increases by a further 3dB for a wide range of XSR values, and by about 7dB when compared with the Gaussian approximation. Hence, assumption of a symmetric threshold yields pessimistic results for error probability. Simulations show that the threshold which is optimum for a particular level of crosstalk and receiver noise is close to optimum for a range of receiver noise powers. This is shown in Figure 8.31. The threshold is optimized for XSR = -30dB and SNR = 2LldB. For this threshold, error probabilities are shown for SNR values from 19.5dB to 22.2dB. The bends in the curve around the same XSR level indicate that the threshold is nearly optimal over the SNR range considered. These fast simulation results can be used to estimate the impact on system performance of the number of wavelength channels used. This is an important parameter in the design of WDM networks. Shown in Figure 8.32 is the power penalty due to introduction of an additional channel. The results are for XSR = -25dB with the threshold optimized for SNR = 22dB for both 4 and 5 wavelength channels. At an error rate of 10- 9 , an SNR increase of IdB is required. Additional details of simulation results can be found in [51].
222 Importance Sampling
I
XSRopt----+
SNR
= 19.5dB~
SNR
= 21.1
SNR
= 20.2 dB~
I
Q::
w 10-l! co
dB~
10.12
10.16 -55
-SO
-45
-40
-35 -30 XSR (dB)
-25
-20
-15
Fig. 8.31. Dependence of optimum threshold on SNR.
104 .------r------~----_,------~------r_----_,
Fig. 8.32. Impact on performance of the number of WDM channels.
Digital Communications 223
8.7 Multiuser detection In this section we describe briefly a multiuser detector (MUD) situation in code division multiple access (CDMA) channels, and how its error probability performance can be estimated using IS. Although the case considered can be analysed numerically with comparable ease, the procedure described can be extended to more general situations that involve more realistic channels and interference models. The blind adaptive MUD in AWGN is considered here. Following the development in [31J and [89], the L-user CDMA white Gaussian noise channel output, for an M-length bit sequence, has the baseband representation M
L
L L A1b1(i)SI(t - in - 7!) + n(t)
y(t) =
i=-M 1=1
where n(t) is zero-mean white Gaussian noise with power spectral density a~it' the data bits bl (i) are independent, equally likely and take the values -1 and +1 with duration Al denotes the received amplitude of the l-th user with relative delay 7/, and
n,
is the unit energy direct sequence spread spectrum signature waveform of the l-th user of duration = NTc consisting of the N chips
n
(t)
PT
{I, 0:-:; t :-:; Tc
=
0,
c
elsewhere
where (al' ... ,aN) denotes the binary chip sequence. For synchronous CDMA we can set 71 = ···7£. Assuming chip-matched filtering in the front end, the i-th bit waveform can be represented, using the vector forms y(i) = [y(i,l) ... y(i,N)] Sl
= [sl(l) ... sl(N)J
nl =
[n(i, 1) ... n(i, N)J
as £
y(i) =
L A1b1(i)SI + n(i) 1=1
Let the user of interest be user 1, l = 1. The single user matched filter correlates the above channel output with the signature sequence S1 of user
224 Importance Sampling
1. A linear MUD correlates the output with a sequence Cl that combats the multiple access interference and Gaussian noise. The sign of the output of the linear detector is then used to make the decision bl about user l's transmitted bit
bl = sgn«
y,Cl
»
(8.24)
where < . , . > denotes inner product. The error probability of such a detector is given by
with II . II denoting Euclidean norm. It is shown in [31J that for any linear MUD, the sequence Cl can be decomposed into two orthogonal components
and the sequence
Xl
satisfies
Furthermore, if Xl is chosen to minimize the mean output energy
then the resulting detector also minimizes the mean square error between the transmitted bit and its estimate as the two measures differ by only a constant. The mean output energy is convex over the set of vectors that are orthogonal to 81. Therefore a gradient descent algorithm can be used to determine the Xl that minimizes it for any initial condition. A stochastic gradient adaptation that can be used to find the optimum Xl is given by (8.26) where
Z(i) = < y(i),8l +xl(i -1) >
(8.27)
is the output of the linear MUD used in (8.24) for decisioning, and
Zmf(i) = < y(i),8l > is the output of a conventional single user matched filter detector. A condition on the step size J.L to ensure stability is stated in [31J as
Digital Communications 225
2
< A2
f..L
max
2 + O"bit
where Amax is the largest amplitude among all users. To estimate Pe in (8.25) using IS, we can write from symmetry
Pe
P(b1 (i)
= -1Ib 1 (i)
= 1)
L
L
b2E{-1,+1}
hE{-l,+1}
P(b1 (i)
L
L
b2E{ -1,+1}
bLE{ -1,+1}
2L1_1
= -1, blb 1 (i) = 1)
P(b1(i) = -1Ib 1 (i) = 1, b) (8.28)
where b = (b 2 ... bL ) represents the sequence of data bits of all other users. Using (8.24) and (8.27), this conditional probability can be expressed in terms of the MUD output as
P(Z(i) :::; 0lb 1 (i)
= 1, b)
E{1(Z(i) :::; O)lb 1 (i) = 1, b} The IS estimate of this probability is given by
P(b1(i) =
-1Ib 1 (i) = 1, b) K
= ~ L 1(Z(i) :::; 0lb 1 (i) = 1, b)W(Z(i)), Z(i)
rv
f* (8.29)
1
with a slight unavoidable abuse of notation for denoting the indicator function. With Z(i) given by (8.27), this estimate will depend on the recursion (or bit) index i of the algorithm (8.26). It is of course much simpler computationally, and more meaningful practically, to estimate Pe for a fixed value of the sequence X1(i). For this we use the converged value of (8.26), denoted as x1,opt. Defining C1,opt == 51
+ X1,opt L
Y(1, b) ==
L
1=1
A1b1(i) < 51,C1,opt >
b1 (i)=1
L
=
L
1=1
b1 (i)=1
N
A1b1(i)
L SI(j)C1,opt j=l
the MUD output in (8.27) can be written as
226 Importance Sampling L
Z(i)
=
1=1
b1 (i)=1 L
=
+ < n(i), C1,opt >
b1 (i)=1
N
~
T{1 , b)
+L
T(I, b)
+ {i)
j=1
c1,opt (j)
n( i, j)
which renders the decision statistic into separate noise and channel output terms. The channel output T consists of user l's signal component and the multiple access interference from other users. For a properly converged value of X1,opt, the sequence C1,opt will suppress the interference so that the dominant contribution to T will be from user 1. The Gaussian noise term {j) has zero mean and variance N
var {i) =
L
1=1
ctoPt (j)
a~it
It is clear from the results of Example 2.4 on page 18 that an effective method of biasing to implement the estimator of (8.29) is a simple translation of the density of {i) by an amount JT2(1, b) + var