Uniformly Efficient Importance Sampling for the Tail Distribution of Sums of Random Variables

Size: px

Start display at page:

Download "Uniformly Efficient Importance Sampling for the Tail Distribution of Sums of Random Variables"

Meghan Chandler
5 years ago
Views:

1 Uniformly Efficient Importance Sampling for the Tail Distribution of Sums of Random Variables P. Glasserman Columbia University S.K. Juneja Tata Institute of Fundamental Research February 2006 Abstract Successful efficient rare event simulation typically involves using importance sampling tailored to a specific rare event. However, in applications one may be interested in simultaneous estimation of many probabilities or even an entire distribution. In this paper, we address this issue in a simple but fundamental setting. Specifically, we consider the problem of efficient estimation of the probabilities P (S n na) for large n, for all a lying in an interval A, where S n denotes the sum of n independent, identically distributed light-tailed random variables. Importance sampling based on exponential twisting is known to produce asymptotically efficient estimates when A contains a single point. We show, however, that this procedure fails to be asymptotically efficient throughout A when A contains more than one point. We analyze the best performance that can be achieved using a discrete mixture of exponentially twisted distributions and then present a method using a continuous mixture. We show that a continuous mixture of exponentially twisted probabilities and a discrete mixture with sufficiently a large number of components produce asymptotically efficient estimates for all a A simultaneously. Introduction The use of importance sampling for efficient rare event simulation has been studied extensively (see, e.g., Bucklew [3], Heidelberger [7], and Juneja and Shahabuddin [8] for surveys). Application areas for rare event simulation include communications networks where information loss and large delays are important rare events of interest (as in Chang et al. [4]), insurance risk where the probability of ruin is a critical performance measure (e.g., Asmussen []), credit risk models in finance where large losses are a primary concern (e.g., Glasserman and Li [6]), and reliability systems where system failure is a rare event of utmost importance (e.g., Shahabuddin []). Successful applications of importance sampling for rare event simulation typically focus on the probability of a single rare event. As a way of demonstrating the effectiveness of an importance sampling technique, the probability of interest is often embedded in a sequence of probabilities decreasing to zero. The importance sampling technique is said to be asymptotically efficient or asymptotically optimal if the second moment of the associated estimator decreases at the fastest possible rate as the sequence of probabilities approaches zero.

2 Importance sampling based on exponential twisting produces asymptotically efficient estimates of rare event probabilities in a wide range of problems. As in the setting we consider here, asymptotic efficiency typically requires that the twisting parameter be tailored to a specific event. But in applications, one is often interested in estimating many probabilities or an entire distribution; and estimating a distribution is often a step toward estimating functions of the distribution, such as quantiles or tail conditional expectations. These problems motivate our investigation. We consider the simple but fundamental setting of tail probabilities associated with a random walk. Exponential twisting produces asymptotically efficient estimates for a single point in the tail; we show, however, that the standard approach fails to produce asymptotically efficient estimates for multiple points simultaneously. We develop and analyze modifications that rectify this. The relatively simple context we consider here provides a convenient setting in which to identify problems and solutions that may apply more generally. Specifically, we consider the random walk S n = n i= X i where (X i : i n) are independent, identically distributed (iid) random variables with mean µ. We focus on the problem of simultaneous efficient estimation by simulation of the probabilities P (S n /n a) for large n, for all a > µ lying in an interval A. Certain regularity conditions are imposed on this interval. We refer to this problem as that of efficient estimation of the tail distribution curve. A standard technique for estimating P (S n /n a), a > µ, uses importance sampling by applying an exponential twist to the X i. It is well-known that if the twisting parameter is correctly tailored to a, this method produces asymptotically efficient estimates. (See, e.g., Sadowsky and Bucklew [0].) However, we show in Section 2.3 that the exponentially twisted distribution that achieves asymptotic efficiency in estimating P (S n /n a ) fails to be asymptotically efficient in estimating P (S n /n a 2 ) for a 2 a. In particular, it incurs an exponentially increasing computational penalty as n. This motivates the need for the alternative procedures that we develop in this paper. Within the class of exponentially twisted distributions, we first identify the one that estimates the tail distribution curve with asymptotically minimal computational overhead. We then extend this analysis to find the best mixture of k < exponentially twisted distributions. However, even with this distribution asymptotic efficiency is not achieved. We note that this shortcoming may be remedied if k is selected to be an increasing function of n. We further propose an importance sampling distribution that is a continuous mixture of exponentially twisted distributions. We show that such a mixture also estimates the tail probability distribution curve asymptotically efficiently for all a A simultaneously. Furthermore, we identify the mixing probability density function that ensures that all points along the curve are estimated with roughly equal precision. The rest of this paper is organized as follows: In Section 2 we review the basics of importance sampling and introduce some notions of efficiency relevant to our analysis. In Section 3, we discuss the performance of importance sampling in simultaneously estimating the tail probability distribution curve using discrete mixtures of appropriate exponentially twisted distributions. The analysis is then extended to continuous mixtures in Section 4. Concluding remarks are given in Section 5. 2

3 2 Naive Estimation and Importance Sampling 2. Naive Estimation of the Tail Distribution Curve Under naive simulation, m iid samples ((X i, X i2,..., X in ) : i m) of (X, X 2,..., X n ) are generated using the original probability measure P. Let S i n = n j= X ij. Then, for each a A, m m I(Sn i na) i= provides an unbiased estimator of P (S n /n a), where I( ) denotes the indicator function of the event in parentheses. 2.2 Importance Sampling We restrict attention to probability measures P for which P is absolutely continuous with respect to P when both measures are restricted to the σ-algebra generated by (X, X 2,..., X n ). Let L denote the likelihood ratio of the restriction of P to the restriction of P. Then, P (S n /n a) = E P [LI(S n /n a)] () for each a A, where the subscript affixed to E denotes the probability measure used in determining the expectation. If, under P and P, (X, X 2,..., X n ) have joint densities f and f, respectively, then L = f(x, X 2,..., X n ) f (X, X 2,..., X n ). Clearly, L depends on n, though we do not make this dependence explicit in our notation. Under importance sampling with probability P, m iid samples ((X i, X i2,..., X in ) : i m) of (X, X 2,..., X n ) are generated using P. Let (L i : i m) denote the associated samples of L. For each a A, we compute the estimate m m L i I(Sn i na). (2) i= In light of (), (2) provides an unbiased estimator of P (S n /n a). The probabilities P (S n /n a) decrease exponentially in n if the X i are light-tailed. More precisely, if the moment generating function of the X i is finite in a neighborhood of the origin, then (as in, e.g., [5], p.34) lim n log P (S n/n a) = Λ (a), (3) where Λ denotes the large deviations rate function associated with the sequence (S n /n : n 0), to be defined in Section 3.2, and a belongs to the interior of {x : Λ (x) < }. Also note that for any P, E P [L 2 I(S n /n a)] (E P [LI(S n /n a)]) 2 = P (S n /n a) 2 3

4 and hence lim inf n log E P [L2 I(S n /n a)] 2Λ (a). (4) An importance sampling measure P is said to be asymptotically efficiently for P (S n /n a) if equality holds in (4) with the liminf replaced by an ordinary limit. Furthermore, we say that the relative variance of the estimate (under a probability measure P ) grows polynomially at rate p > 0 if 0 < lim inf E P L 2 I(S n /n a) n p P (S n /n a) 2 lim sup E P L 2 I(S n /n a) n p P (S n /n a) 2 <. It is easily seen that if this holds for some p > 0, then P is asymptotically efficient for P (S n /n a). The relative variance of the estimate grows exponentially if the ratio E P L 2 I(S n /n a) P (S n /n a) 2 grows at least at an exponential rate with n. We say that the relative variance of the estimate of the tail distribution curve over A grows polynomially at rate p > 0 if 0 < sup lim inf a A E P L 2 I(S n /n a) n p P (S n /n a) 2 sup lim sup a A n p E P L 2 I(S n /n a) P (S n /n a) 2 <. It is said to grow at most at the polynomial rate p > 0 if the last inequality holds. 2.3 Asymptotically Efficient Exponentially Twisted Distribution We first review the asymptotically efficient exponentially twisted importance sampling distribution for estimating a single probability P (S n /n a), a > µ. We also show that under this measure the relative variance of the estimate grows polynomially at rate /2. Some notation and assumptions are needed for this. Let X i have distribution function F and log-moment generating function Λ under the original probability measure P, Λ(θ) = log exp(θx) df (x). We assume that the domain Θ = {θ : Λ(θ) < } includes zero in its interior Θ o. Further, we assume that A H o, where H = {Λ (θ) : θ Θ}. Let F θ denote the distribution function obtained by exponentially twisting F by θ Θ, i.e., df θ (x) = exp(θx Λ(θ)) df (x). 4

5 Let P θ denote the probability measure under which (X i : i ) are iid with distribution F θ. The restrictions of P and P θ to the σ-algebra generated by X,..., X n are mutually absolutely continuous. Let L θ denote the Radon-Nikodym derivative of the restriction of P to the restriction of P θ. Then, L θ = exp( θs n + nλ(θ)) a.s. (5) We recall some definitions associated with the large deviations rate of S n ; see Dembo and Zeitouni [5] for background. The rate function for (S n /n : n ) is defined by Λ (a) = sup(θa Λ(θ)). θ For each a H o, there is, by definition, a θ(a) Θ for which Λ (θ(a)) = a. It is easy to see that θ(a)a Λ(θ(a)) = sup(θa Λ(θ)) = Λ (a). (6) θ Moreover, through differentiation, it can be seen that Λ (a) = θ(a). By differentiating the equation Λ (θ(a)) = a, it follows that θ (a) = Λ (θ(a)) = Λ (a); as in Section of Dembo and Zeitouni [5], we have Λ (θ) > 0 when Λ (θ) H o. (Furthermore, they note that Λ(θ) is C in Θ o and Λ (x) is C in H o.) It is well known that (as in Sadowsky and Bucklew [0]) P θ(a) asymptotically efficiently estimates P (S n /n a), i.e., log E Pθ(a) [L 2 θ(a) lim I(S n/n a)] = 2. (7) log P (S n /n a) Theorem sharpens this result and also shows what happens when we use a twisting parameter θ different from θ(a) in estimating P (S n /n > a). The theorem applies the Bahadur-Rao [2] approximation: P (S n /n a) 2πΛ (θ(a))nθ(a) exp[ nλ (a)], (8) that is valid when X i have a non-lattice distribution. In this paper we restrict our discussion to X i having a non-lattice distribution. Theorem For a > µ and t in H o, E Pθ(t) [L 2 θ(t) I(S n/n a)] 2πΛ (θ(a))n(θ(t) + θ(a)) exp[n(θ(t)(t a) Λ (t) Λ (a))]. (9) Thus, with t = a the relative variance grows polynomially at rate /2, but with t a the relative variance grows exponentially and P θ(t) fails to be asymptotically efficient for P (S n /n a). 5

6 Proof of Theorem : Once we establish (9), the polynomial rate of growth of the relative variance when t = a follows from dividing (9) by the square of (8). When t a, the strict convexity of Λ (see Dembo and Zeitouni [5], Section ) implies that Λ (a) > Λ (t) + (a t)λ (t). Since θ(t) = Λ (t), dividing (9) by the square of (8) shows that the relative variance grows exponentially. To establish (9), we use (5) and (6) to write Also note that L θ(t) = exp[ θ(t)(s n na) + nθ(t)(t a) nλ (t)]. (0) E Pθ(t) [L 2 θ(t) I(S n/n a)] = E P [L θ(t) I(S n /n a)] = E Pθ(a) [L θ(t) L θ(a) I(S n /n a)]. Using (5) to replace L θ(a) and (0) to replace L θ(t), we get E Pθ(t) L 2 θ(t) I(S n/n a) = exp[n(θ(t)(t a) Λ (t) Λ (a))]e Pθ(a) exp[ (θ(t) + θ(a))(s n na)]i(s n /n a). Using exactly the same arguments as in Bahadur and Rao [2] or Dembo and Zeitouni [5], Section 3.7.4, it follows that E Pθ(a) exp[ (θ(t) + θ(a))(s n na)]i(s n /n a) 2πΛ (θ(a))n(θ(t) + θ(a)). 3 Finite Mixtures of Exponentially Twisted Distributions We have seen in the previous section that a single exponential change of measure cannot estimate multiple points along the tail distribution curve with asymptotic efficiency. In this section, we identify the twisting parameter that is minimax optimal in the sense that it minimizes the worst case relative variance over an interval A = [a, a 2 ] H o, a > µ. We then consider an optimal mixture of k < exponentially twisted distributions. We show that if k as n, then the tail distribution curve is asymptotically efficiently estimated over A. Furthermore, if k is Θ( n/ log n) the relative variance of the tail probability distribution curve grows polynomially over A. Some definitions are useful to our analysis. From Theorem, it follows that for a H o and a > µ, lim n log E P θ(t) [L 2 θ(t) I(S n/n a)] = H(t, a), where H(t, a) = Λ (a) + Λ (t) + Λ (t)(a t). A non-negative function f(x) is said to be Θ(g(x)), where g is another non-negative function, if there exists a positive constant K such that f(x) Kg(x) for all x sufficiently large. 6

7 The asymptotic rate of the relative variance of the estimator under P θ(t) may be defined as From Theorem, this equals ( lim n log EPθ(t) L 2 θ(t) I(S ) n/n a) P (S n /n a) 2, J(t, a) = Λ (a) (a t)λ (t) Λ (t) = 2Λ (a) H(t, a). This may be interpreted as the non-negative exponential penalty incurred for twisting with parameter θ(t) in estimating P (S n /n a). It equals zero at t = a and is positive otherwise. The penalty J(t, a) has the following geometric interpretation: It equals the vertical distance between the rate function Λ and the tangent drawn to this function at t, both evaluated at a. 3. The Minimax Optimal Twist A formulation of the problem of minimizing the worst case asymptotic relative variance is to find t H o minimizing sup lim a A n log ( EPθ(t) L 2 θ(t) I(S ) n/n a) P (S n /n a) 2 = sup J(t, a). () a A Proposition For A = [a, a 2 ] H o, a > µ, the asymptotic worst case relative variance () is minimized over t by t : Proof: The optimization problem reduces to inf t H o a [a,a 2 ] Λ (t ) = Λ (a 2 ) Λ (a ) a 2 a. (2) sup (Λ (a) Λ (t)(a t) Λ (t)). Due to the convexity of Λ, for any t, sup a [a,a 2 ](Λ (a) Λ (t)(a t) Λ (t)) is achieved at a or a 2. Thus, the problem reduces to inf t H o max{λ (a ) Λ (t)(a t) Λ (t), Λ (a 2 ) Λ (t)(a 2 t) Λ (t)}. If t < a then both functions inside the max are decreasing in t, so increasing t reduces the maximum. If t > a 2 then both functions inside the max are increasing in t, so decreasing t reduces the maximum. It therefore suffices to consider t [a, a 2 ]. At all such t, the first function is increasing and the second is decreasing, so the maximum is minimized where they are equal, which is the point t in (2). 7

8 3.2 Finite Mixtures: Minimax Objective We now consider a probability measure P that is a non-negative mixture of k exponentially twisted distributions so that k P (A) = p i P θ(ti )(A) i= where p i > 0, k i= p i =, and t < t 2 < < t k, and each t i H o. Our objective is to find the optimal twisting parameters for a fixed k. We then examine the asymptotic relative variance as k increases. Mixtures of exponentially twisted distributions have been used in many applications, including Sadowsky and Bucklew [0]. For a fixed k, we formulate the problem of selecting a mixture of k exponentially twisted distributions to minimize the worst-case asymptotic relative variance over A = [a, a 2 ] H o as follows: ( inf sup E lim (t,...,t k ) H o n log P [ L 2 ) I(S n /n a)] P (S n /n a) 2, (3) a A where L is the likelihood ratio of P with respect to P, given by L = ki= p i exp[θ(t i )S n nλ(θ(t i ))], and the existence of the limit in (3) is guaranteed by Lemma. Recall that H(t, a) = Λ (a) + Λ (t) + Λ (t)(a t). Lemma For (a, t,..., t k ) H o, a > µ, there exists a constant c such that ( ) E P [ L 2 I(S n /n a)] c exp n max H(t i, a), (4) i=,...,k for all sufficiently large n. Furthermore, lim n log E P [ L 2 I(S n /n a)] = max H(t i, a). (5) i=,...,k Recall that J(t, a) = 2Λ (a) H(t, a). In view of (5), our optimization problem reduces to inf sup min J(t i, a). (6) (t,...,t k ) H o i=,...,k a A Proof of Lemma : We first show (4). We have E P [ L 2 I(S n /n a)] = E[ LI(S n /n a)] (7) and so L E P [ L 2 I(S n /n a)] min (/p i) exp[ θ(t i )S n + nλ(θ(t i ))], i=,...,k min (/p i) exp[ θ(t i )na + nλ(θ(t i ))]P (S n /n a). i=,...,k 8

9 From (8) we get for all sufficiently large n. therefore get P (S n /n a) constant exp[ nλ (a)], Recalling that θ(t i ) = Λ (t i ), and Λ (t) = θ(t)t Λ(θ(t)), we E P [ L 2 I(S n /n a)] constant min (/p i) exp[ nh(t i, a)], (8) i=,...,k for all sufficiently large n. This proves the upper bound in the lemma because the minimum is achieved by the largest exponent H(t i, a), i =,..., k, for all sufficiently large n. To see (5), from (8) it follows that lim sup n log E P [ L 2 I(S n /n a)] To get a lower bound, choose ɛ > 0 and write min H(t i, a). i=,...,k E P [ L 2 I(S n /n a)] = E Pθ(a+ɛ) [ LL θ(a+ɛ) I(S n /n a)] E Pθ(a+ɛ) [ LL θ(a+ɛ) I(a + 2ɛ S n /n a)]. This last expression is, in turn, bounded below by M n,ɛ = exp[ θ(a + ɛ)(a + 2ɛ)n + nλ(θ(a + ɛ))] ( k p i exp(θ i (a + 2ɛ)n nλ(θ i ))) P θ(a+ɛ) (a S n /n a + 2ɛ), i= where we have written θ i for θ(t i ). Because the X i have mean a + ɛ under P θ(a+ɛ), Thus, lim inf from which follows lim inf P θ(a+ɛ) (a S n /n a + 2ɛ). n log M n,ɛ θ(a + ɛ)(a + 2ɛ) + Λ(θ(a + ɛ)) max i=,...,k {θ i(a + 2ɛ) Λ(θ i ))} n log E P [ L 2 I(S n /n a)] θ(a + ɛ)(a + 2ɛ) + Λ(θ(a + ɛ)) max i=,...,k {θ i(a + 2ɛ) Λ(θ i ))}. Because ɛ > 0 is arbitrary, we also have lim inf n log E P [ L 2 I(S n /n a)] θ(a)a+λ(θ(a)) max i=,...,k {θ ia Λ(θ i ))} = max i=,...,m H(t i, a). 9

10 3.3 Finite Mixtures: Optimal Parameters We now turn to the solution of (6), the problem of finding the optimal twisting parameters. We first develop some simple results related to convex functions that are useful for this. Consider a function f that is strictly convex on an interval that includes [y, z] in its interior. Consider the problem of finding the maximum vertical distance between the graph of f and the line joining the points (y, f(y)) and (z, f(z)); i.e., At the optimal point t, we have ( ) f(z) f(y) g(y, z) = max f(z) (z t) f(t). t [y,x] z y f (t ) = f(z) f(y), (9) z y and By substituting for f (t ), it is also easily seen that g(y, z) = f(z) (z t )f (t ) f(t ). (20) g(y, z) = f(y) + (t y)f (t ) f(t ). (2) Also note that g(y, z) is a continuous function of (y, z), it increases in z and decreases in y. The following lemma is useful in arriving at optimal parameters of finite mixtures. Its content is intuitively obvious: given a strictly convex function on an interval, the interval can be partitioned into k subintervals such that the maximum distance between the function and its piecewise linear interpolation between the endpoints of the subintervals is the same over all subintervals. Lemma 2 Suppose that f is strictly convex on [y, z]. Then there exist points y = b < b 2 < < b k+ = z satisfying g(b i, b i+ ) = g(b i+, b i+2 ) i =, 2,..., k. (22) Note that whenever the existence of (b, b 2,..., b k+ ) as in Lemma 2 is established, the common value of g(b i, b i+ ) for i =, 2,..., k, is determined by b k+ if b is held fixed. In the proof, J k (b k+ ) denotes this value. Proof of Lemma 2: We prove this by induction. First consider k = 2. Here g(b, b) g(b, b 3 ) is a continuous increasing function of b. It is negative at b = b and positive at b = b 3. Thus, there exists b 2 (b, b 3 ) such that g(b, b 2 ) = g(b 2, b 3 ) = J 2 (b 3 ). Furthermore, as b 3 increases g(b, b) g(b, b 3 ) decreases, so b 2 and hence J 2 (b 3 ) increase continuously with b 3. To proceed by induction, assume that for each b k > y, there exist points y = b < b 2 < < b k such that g(b i, b i+ ) = g(b i+, b i+2 ) = J k (b k ) for i =, 2,..., k 2. Furthermore, each b i and J k (b k ) are an increasing function of b k. Now consider the function J k (b) g(b, b k+ ) as a function of b. This function is negative at b = b, positive at b = b k+, and it increases continuously with b. So again there exists a value b k so that J k (b k ) = g(b k, b k+ ). Set this 0

11 value equal to J k (b k+ ) Again, it may be similarly seen that b k and hence all b i, i = 2, 3,..., k, and J k (b k+ ) increase with b k+. Given the points (b, b 2,..., b k+ ), as in Lemma 2, we may define t i by f (t i ) = f(b i+) f(b i ) b i+ b i, i =, 2,..., k. (23) Then, from (20) and (2), it is easy to see that (22) amounts to f(t i+ ) f(t i ) = f (t i )[b i+ t i ] + f (t i+ )[t i+ b i+ ] i =, 2,..., k. In Proposition 2, we use these observations and Lemma 2 to identify the solution of (6), i.e., the optimal twisting parameters. Proposition 2 Suppose that A = [a, a 2 ] H o, a > µ. Let the points a = b < b 2 < < b m+ = a 2 and t,..., t k satisfy and Λ (t i ) = Λ (b i+ ) Λ (b i ) b i+ b i, i =,..., k (24) Λ (t i+ ) Λ (t i ) = Λ (t i )[b i+ t i ] + Λ (t i+ )[t i+ b i+ ], i =,..., k, (25) as in Lemma 2 and (22). These t,..., t k solve (6). Proof: From the strict convexity of Λ in H o, it follows that the tangent lines at t,..., t k partition the interval [a, a 2 ] into k subintervals such that throughout the ith interval, Λ is closer to the tangent line at t i than it is to the tangent line at any other t j. The endpoints of the ith subinterval, i = 2,..., k, are the points at which the tangent line at t i crosses the tangent lines at t i and t i+. These are the points b 2,..., b m in (25). The left endpoint of the first subinterval is simply a = b and the right endpoint of the mth subinterval is simply a 2 = b k+. Thus, for all a [b i, b i+ ], min J(t j, a) = J(t i, a). j=,...,k Furthermore, each J(t i, ) is a convex function, so sup b i a b i+ J(t i, a) = max{j(t i, b i ), J(t i, b i+ )}, i =,..., k. Viewing b 2,..., b k as functions of t,..., t k (through (25)), the problem in (6) thus becomes one of minimizing max{j(t, b ), J(t, b 2 ), J(t 2, b 2 ), J(t 2, b 3 ),..., J(t k, b k ), J(t k, b k+ )} over t,..., t k. In fact, (25) says that J(t i, b i+ ) = J(t i+, b i+ ), i =,..., k, so we may simplify this to minimizing max{j(t, b ), J(t 2, b 2 ),..., J(t k, b k ), J(t k, b k+ )}. (26)

12 b 4 t 3 t t 2 b 3 b b 2 Figure : Illustration of the points t i and b i Each b i, i = 2,..., k, is an increasing function of t i and t i. Each J(t i, b i ) J(t i, b i (t i, t i )) is easily seen to be an increasing function of t i and a decreasing function of t i, i = 2,..., k. Similarly, J(t, b ) is an increasing function of t and J(t k, b k+ ) is a decreasing function of t k. It follows that if all values inside the max in (26) are equal, then t,..., t k is optimal: no change in t,..., t k can simultaneously reduce all values inside the max. All values in (26) are equal if J(t i, b i ) = J(t i+, b i+ ), i =,..., k, and J(t k, b k ) = J(t k, b k+ ). Simple algebra now shows that this is equivalent to (24). The proof of Proposition 2 leads to a simple algorithm for finding the optimal points t,..., t k to use in a mixture of m exponentially twisted distributions. The key observation is that it suffices to find the optimal value of (26), which (as noted in the proof) is attained when all values inside the maximum in (26) are equal. Start with a guess ɛ > 0 for this value and set b = a. Now proceed recursively, for i =,..., m: given b i, find t i by solving the equation and then find b i+ by solving Λ (b i ) Λ (t i ) Λ (t i )[b i t i ] = ɛ Λ (b i+ ) Λ (t i ) Λ (t i )[b i+ t i ] = ɛ. This is straightforward, because the left side of the first equation is increasing in the unknown t i and the left side of the second equation is increasing in the the unknown b i+. If at some step there is no solution in [a, a 2 ], the procedure stops, reduces ɛ and starts over at b. If a solution is found at every step but b k+ < a 2, then ɛ is increased and the procedure starts over at b. Thus, one may apply a bisection search over ɛ to find the optimal ɛ and, simultaneously, the optimal t i and b i. The exact optimum has b k+ = a 2 ; in practice, one must specify some error tolerance for b k+ a 2. 2

13 3.4 Increasing Number of Components Again consider A = [a, a 2 ] H o and let J k denote the minimal value in (6), the minimax penalty incurred in estimating P (S n /n a) for all a A using a mixture of k exponentially twisted distributions. We examine how J k decreases with k. Recall that for a H o, 0 < Λ (θ(a)) <. Since, Λ (a) = /Λ (θ(a)), it follows that 0 < Λ (a) <. Since A H o is a compact set, we have inf a A Λ (a) > 0 and sup a A Λ (a) <. Proposition 3 For A = [a, a 2 ] H o, a > µ, there exist constants d, d 2 such that, for all k =, 2,..., d k 2 J k d 2 k 2. Proof: Let c = inf a A Λ (a) and c + = sup a A Λ (a). For any interval [b i, b i+ ] A and point t i satisfying (24), twice integrating the bounds on the second derivative of Λ yields c 2 (t i b i ) 2 Λ (b i ) Λ (t i ) Λ (t i )[b i t i ] c + 2 (t i b i ) 2 and c 2 (t i b i+ ) 2 Λ (b i+ ) Λ (t i ) Λ (t i )[b i+ t i ] c + 2 (t i b i+ ) 2. To construct an upper bound on J k, we may take b,..., b k+ to be equally spaced, so that b i+ b i = (a 2 a )/k, and choose t i to satisfy (24). Then J k max i=,...,k sup {Λ (a) Λ (t i ) Λ (t i )[a t i ]}. b i a b i+ As in the proof of Proposition, the supremum over the ith subinterval is attained at the endpoints, so c + J k max i=,...,k 2 (t i b i ) 2 c +(a 2 a ) 2 2k 2. To get a lower bound on J k, observe that for any b < < b k+ with a = b and b k+ = a 2 (including the optimal values), at least one subinterval [b i, b i+ ] must have length greater than or equal to (a 2 a )/k. Fix such a subinterval and let t i satisfy (24). Then J k max{λ (b i ) Λ (t i ) Λ (t i )[b i t i ], Λ (b i+ ) Λ (t i ) Λ (t i )[b i+ t i ]}, so J k max{ c 2 (t i b i+ ) 2, c 2 (t i b i ) 2 } c 8 (b i+ b i ) 2 c 8k 2 (a 2 a ) 2. The following result shows that by increasing the number of components k in the mixture together with n, we can recover asymptotic efficiency. Here P is defined from k components, as before, using the optimal twisting parameters θ(t i ), i =,..., k. 3

14 Proposition 4 For A = [a, a 2 ] H o, a > µ, if k as n, then P is asymptotically efficient for P (S n /n a), for every a A. If k = Θ( n/ log n), then the relative variance of the estimated tail distribution curve over A is polynomially bounded, for p >. If k = Θ( n), then (27) holds with p =. E sup lim sup P L 2 I(S n /n a) a A n p P (S n /n a) 2 <, (27) Proof: From Lemma and (8), we know that E P [ L 2 I(S n /n a)] constant exp( n max i=,...,k H(t i, a)) and P (S n /n a) constant n exp( nλ (a)), for all sufficiently large n. Thus, E P L 2 I(S n /n a) P (S n /n a) 2 n constant exp(n[2λ (a) max i=,...,k H(t i, a)]) n constant exp(nj k ) n constant exp(nd 2 /k 2 ), (28) for all sufficiently large n. If k, then ( E lim n log P L 2 ) I(S n /n a) P (S n /n a) 2 = 0. From this, asymptotic efficiency as an estimator of P (S n /n a) follows. If k = Θ( n/ log n), (28) implies (27) for any p >. If k = Θ( n), (28) implies (27) with p =. Bounds of the same order of magnitude as those in Proposition 3 (but with different constants) hold if we use equally spaced points t i rather than the optimal points. Thus, the number of components k needed to achieve asymptotic efficiency in that case is of the same order of magnitude as with optimally chosen points. This suggests that choosing the t i optimally is most relevant when k is relatively small. 4 Continuous Mixture of Exponentially Twisted Distributions We now show that a continuous mixture of P θ(t) for t A, simultaneously asymptotically efficiently estimates each P (S n /n a) for each a A. In Theorem 2, we allow A to be any interval in H o of values greater than µ. Thus, it is allowed to have the form [a, ) as long as it is a subset of H o and a > µ. 4

15 4. Polynomially Bounded Relative Variance Let g be a probability density function with support A. Consider the probability measure P g, where for any set A, P g (A) = a2 a P θ(t) (A)g(t)dt. (Here a 2 may equal.) This measure may be used to estimate P (S n /n a) as follows: First a sample T is generated using the density g. Then, exponentially twisted distribution P θ(t ) is used to generate the sample (X, X 2,..., X n ) and hence the output L g (S n )I(S n /n a) where ( a2 ) L g (S n ) = exp[θ(t)s n nλ(θ(t))]g(t)dt. a Theorem 2 For each a A o, when g is a positive continuous function on A, lim sup E Pg L g (S n ) 2 I(S n /n a) n P (S n /n a) 2 θ(a)λ (θ(a)) = Λ (a) g(a) Λ (a)g(a), so that the relative variance of the associated importance sampling estimator grows polynomially at most at rate p =. Proof: On the event {S n na}, L g (S n ) a ( a2 a ) exp[θ(t)n(a Λ(θ(t)))]g(t)dt. Since Λ (t) = θ(t)t Λ(θ(t)), and θ(t) = Λ (t), this upper bound equals ( a2 ) exp( nλ (a)) exp( n(λ (a) Λ (t)(a t) Λ (t))g(t)dt. Note that Λ (a) Λ (t)(a t) Λ (t) is minimized at t = a. Using Laplace s approximation (see, e.g., Olver [9], p. 80-8) a2 2π exp( n(λ (a) Λ (t)(a t) Λ (t))g(t)dt nλ g(a) (29) (θ a ) a if a (a, a 2 ). Thus, for ɛ > 0 and n large enough, on {S n na}, L g (S n ) exp( nλ nλ (a)) (θ(a)) ( + ɛ). (30) 2π g(a) Now E Pg L g (S n ) 2 I(S n /n a) = E P L g (S n )I(S n /n a). Hence, for n large enough, E Pg L g (S n ) 2 I(S n /n a) exp( nλ nλ (a)) (θ a ) 2π g(a) ( + ɛ)p (S n/n a). Using the sharp asymptotic for P (S n /n a) given in (8), and noting that ɛ is arbitrary, it follows that E Pg L g (S n ) 2 I(S n /n a) lim sup n P (S n /n a) 2 θ aλ (θ a ). g(a) Since θ a = Λ (a) and Λ (θ a ) = /Λ (a), the result follows. 5

16 Remark In Theorem 2, if a is on the boundary of [a, a 2 ], the analysis remains unchanged except that in (29) the asymptotic has to be divided by 2 to get the correct value (see, e.g., Olver [9], p. 8). Hence in this setting lim sup E Pg L g (S n ) 2 I(S n /n a) n P (S n /n a) 2 2 In particular, it follows that if A = [a, a 2 ] H o, a > µ, 4.2 Choice of Density Function Λ (a) Λ (a)g(a). E Pg L g (S n ) 2 I(S n /n a) sup lim sup a A n P (S n /n a) 2 <. From an implementation viewpoint, it may be important to select g to estimate all P (S n /n a), a A, with the same asymptotically relative variance. If this holds, we may continue to generate samples until the sample relative variance at one point becomes small, and this ensures that the relative variance at other points is not much different. Theorem 2 suggests that this may be achieved by selecting g(a) proportional to Λ (a)/λ (a) or equivalently θ(a)/θ (a). Suppose that A = [a, a 2 ] H o, a > µ. In this case a2 so such a g is feasible. We now evaluate such a g for some simple cases: a θ(a) θ da < (3) (a) Example If X i has a Bernoulli distribution with mean p, then its log-moment generating function is Λ(θ) = log(exp(θ)p + ( p)) and It is easily seen that Λ (θ) = [ θ(a) = log exp(θ)p exp(θ)p + ( p). a a ] p p and θ (a) = /a + /( a). Thus, g(a) is proportional to ( a + ) [ log a a a ] p. p It can be seen that this function is maximized at the solution of that exceeds p. a p exp[ /(2a )] = a p 6

17 Example 2 If X i has a Gaussian distribution with mean zero and variance, then its logmoment generating function is Λ(θ) = θ 2 /2 and Λ (θ) = θ. Also, θ(a) = a and θ (a) = and g(a) is proportional a. This suggests that more mass should be given to larger values of a in the interval [a, a 2 ]. Example 3 If X i has a Gamma distribution with log-moment generating function the its mean equals αβ. Now Λ(θ) = α log( θβ), Λ (θ) = αβ θβ so that θ(a) = αβ β ( a ) and θ (a) = α. Then, g(a) is proportional to a 2 Recall that a > αβ. 5 Concluding Remarks a( a αβ ). In this paper, we have considered the problem of simultaneous estimation of the probabilities of multiple rare events. In the setting of tail probabilities associated with a random walk, we have shown that the standard importance sampling estimator that yields asymptotically efficient estimates for one point on the distribution fails to do so for all other points. To address this problem, we have examined mixtures of exponentially twisted distributions. We have identified the optimal finite mixture and shown that asymptotic efficiency is achieved uniformly with either a continuous mixture or a finite mixture with an increasing number of components. Although our analysis is restricted to the random walk setting, we expect that similar techniques could prove useful in other rare event simulation problems. Similar ideas should also prove useful in going beyond the estimation of multiple probabilities to the estimation of functions of tail distributions, such as quantiles or tail conditional expectations. Acknowledgements: DMS This work is supported, in part, by NSF grants DMI and References [] Asmussen, S Ruin Probabilities, World Scientific Publishing Co. Ltd. London. [2] Bahadur, R.R. and R. R. Rao On deviations of the sample mean. Ann. Math. Statis.3, [3] Bucklew, J. A Introduction to Rare Event Simulation. Springer-Verlag, New York. [4] Chang, C.S., P. Heidelberger, S. Juneja, and P. Shahabuddin Effective bandwidth and fast simulation of ATM intree networks. Performance Evaluation 20,

18 [5] Dembo, A., and O. Zeitouni Large Deviations Techniques and Applications. Second Edition. Springer, New York, NY. [6] Glasserman, P., and J. Li Importance sampling for portfolio credit risk. Management Science 5, [7] Heidelberger, P Fast simulation of rare events in queueing and reliability models. ACM Transactions on Modeling and Computer Simulation 5,, [8] Juneja, S., and P. Shahabuddin Rare Event Simulation Techniques. To appear in Handbook on Simulation. Editors: Shane Henderson and Barry Nelson. [9] Olver, F. W. J Introduction to Asymptotics and Special Functions, Academic Press, New York. [0] Sadowsky, J. S., and J.A. Bucklew On large deviation theory and asymptotically efficient Monte Carlo estimation. IEEE Transactions on Information Theory 36, 3, [] Shahabuddin, P Importance sampling for the simulation of highly reliable Markovian systems. Management Science 40,

Rare-event Simulation Techniques: An Introduction and Recent Advances

Rare-event Simulation Techniques: An Introduction and Recent Advances S. Juneja Tata Institute of Fundamental Research, India juneja@tifr.res.in P. Shahabuddin Columbia University perwez@ieor.columbia.edu