Importance Sampling and. Radon-Nikodym Derivatives. Steven R. Dunbar. Sampling with respect to 2 distributions. Rare Event Simulation

1 / 33

Outline 1 2 3 4 5 2 / 33

More than one way to evaluate a statistic A statistic for X with pdf u(x) is A = E u [F (X)] = F (x)u(x) dx 3 / 33 Suppose v(x) is another probability density such that is well-defined. L(x) = u(x) v(x) Then A = E v [F (X)L(X)] = F (x) u(x) v(x) dx v(x)

Likelihood ratio The likelihood ratio is so L(x) = u(x) v(x) A = E u [F (X)] = E v [F (X)L(X)] To get A we can either: take samples with density u(x) calculate with F ; or take samples with density v(x), calculate with F L. 4 / 33

Good choice of likelihood ratio How should we seek v(x) and corresponding L(x)? Reduction of variance is a good criterion: Want [ Var v [F (X)L(X)] = E ] v (F (X)L(X)) 2 A 2 less than [ Var u [F (X)] = E ] u (F (X)) 2 A 2 5 / 33

Example: Large values for a normal r.v. Suppose X N(0, 1) and we wish to evaluate P [X > 4] R code for probability: 1 - pnorm(4) The output [1] 3.167124e-05 6 / 33

Sampling Code for sampling sample <- rnorm(10^5) sum( sample > 4 ) sample[ which(sample > 4) ] The output [1] 6 [1] 4.110055 4.970314 4.026803 4.133050 4.014239 Most of the sample doesn t count, "wasted effort". The samples with X > 4 are only a little larger than 4. 7 / 33

A better choice of sampling PDF Draw samples from N(4, 1) instead. Put most of the weight in sampling near 4. Likelihood ratio is L(x) = u(x) v(x) = ke x2/2 ke (x 4)2 /2 = e42 /2 e 4x. 8 / 33

New sampling The v-method, with sample draw according to the new pdf: P N(0,1) [X > 4] = 4 L(X)v(x) dx 1 N = e42 /2 N L(v k ) v k >4 v k >4 e 4v k 9 / 33

New sampling example Code for sampling vsample <- rnorm(10^5,4,1) usevsample <- vsample[which(vsample > 4)] Pv <- exp(4^2/2)*sum(exp(-4*usevsample))/10^5 Pv The output [1] 3.17203e-05 10 / 33

Intuition behind Sampling About half the samples are counted, but they are counted with a small weight. A lot of small weight hits give a lower variance estimator than a few large weight hits. 11 / 33

Digression: Reduction of Variance Variance for the naive sampling: /2 P = P N(0,1) [X > 4] = 1 [X>4] (x) e x2 2π dx [ Var u [F (X)] = E ] u (F (X)) 2 P 2 = (1 [X>4] (x)) 2 /2 e x2 2π = P P 2 P dx 12 / 33 Variance is same order as P. (The standard deviation will be relatively large compared to P.)

Digression: Reduction of Variance Naive Sample Variance P <- sum((sample > 4)^2)/10^5 P-P^2 Results [1] 5.99964e-05 13 / 33

Digression: Reduction of Variance sampling: L(x) = e 42 /2 e 4x Variance for importance sampling: [ Var v [F (X)] = E ] u (F (X)L(X)) 2 P 2 = (1 [X>4] (x)e 42 /2 e 4x ) 2 /2 e (x 4)2 dx P 2 2π 14 / 33

Digression: Reduction of Variance Sample Variance secmomv <- sum((exp(4^2/2)*exp(-4*usevsample))^2 secmomv - Pv^2 Results [1] 4.550243e-09 15 / 33

Digression: Reduction of Variance What is best normal distribution N(b, 1) for importance sampling of P = P N(0,1) [X > 4]? After some calculation with L(x): Var v [ 1[X>4] ] = e b 2 (1 Φ(b + 4)) P 2 ( Φ(x) is the cdf of the N(0, 1) distribution. ) 16 / 33

Digression: Reduction of Variance Unfortunately, (1 Φ(b + 4)) is 0 to precision-levels for 3 b 5 leading to severe numerical error in e b2 (1 Φ(b + 4)) P 2. Use standard asymptotic estimates for the normal cdf tail: 1 ) (e (b+4)2 /2 e (b+5)2 /2 2π(b + 5) 1 Φ(x) 1 ) (e (b+4)2 /2 2π(b + 4) 17 / 33

Digression: Reduction of Variance Using the estimates with L(x) provides algebraic simplification which reduces numerical error. Using a N(b, 1) with b 4.1 gives minimum variance. 18 / 33

New Measures from Old If Q is a probability, can define a new probability measure dp = L(x) dq provided L(x) 0 (almost surely w.r.t. Q). L(x) dq = 1. Ω Then E P [F (X)] = E Q [F (X)L(X)] 19 / 33

Relation between measures Ask the reverse question: Given measures P and Q, is there an L(x) relating them? Necessary condition: Q[B] = 0 = P[B] = 0 because P[B] = E P [1 B (X)] = E Q [1 B (X)L(X)] = 0 Impossible under Q is also impossible under P. 20 / 33

If Q[B] = 0 = P[B] = 0 we say P is absolutely continuous w.r.t. Q and write P Q. Impossible under Q is also impossible under P. 21 / 33

Theorem The R-N Theorem says the necessary condition of absolute continuity is also sufficient: Theorem P and Q are probability measures on common σ-algebra F; and P is absolutely continuous w.r.t. Q then there is a function L(x) = dp dq. 22 / 33 so that E P [F (X)] = E Q [F (X)L(X)].

Completely Singular If there is a B with P[B] = 0 and Q[B] = 1 we say Q is completely singular w.r.t. P. Note: P[B C ] = 1 and Q[B C ] = 0 so then P is completely singular w.r.t. Q and we write P Q 23 / 33

Simple Example 1 Then X U([0, 1]) with probability P Y N(0, 1) with probability Q P is absolutely continuous w.r.t. Q ( Q[B] = 0 = P[B] = 0 ) Q is not absolutely continuous w.r.t. P ( P[1, 2] = 0 but Q[1, 2] 0.136 ) 24 / 33

Simple Example 2 X R 2, X N((0, 0), I 2,2 ) with probability P Y = X/ X, Y U(S 1 ) with probability Q Then P[S 1 ] = 0 and Q[S 1 ] = 1 P Q. 25 / 33

Sophisticated Example 26 / 33 Then X U([0, 1]) with probability P Y has the Cantor distribution on [0, 1], with probability Q Q has a continuous c.d.f. (Devil s staircase), but the p.d.f. does not exist in any easy sense. P Q. This example shows that absolute continuity and completely singular are extensions of the calculus definitions for functions from real analysis.

Example of an R-N derivative Previous sampling example: X N(0, 1), with probability P, p.d.f. e x2 /2 Y N(4, 1), with probability Q, p.d.f e (x 4)2 /2 dp dq = e42 /2 e 4x 27 / 33

Bigger example of an R-N derivative X R n, X N(0, C n,n ), H = C 1. Y R n, Y N(µ, C n,n ), H = C 1. u(x) = k n e xt Hx/2 v(x) = k n e (xt µ)h(x µ)/2 = k n e µt Hµ/2 e µt Hx e xt Hx/2 28 / 33

Continuation, Bigger Example Let so µ T H = ν T Hµ = ν µ = Cν dp dq = eνt µ/2 e νt x 29 / 33

Deciding the Distribution Have a single sample point: X. Is: X P (null hypothesis ); or X Q (alternate hypothesis) Test: Criterion is set B: If x B say Q If x B C say P 30 / 33

Types of errors Type I error is saying Q when truth is P (reject null and accept alternative when null is true) Type II error is saying P when truth is Q (accept null and reject alternative when null is false) (Usually we believe) Type I errors are worse than Type II 31 / 33

Confidence in the test The confidence in the procedure is P [ B C] = 1 P [B], probability of accepting null, when the null is true. The power in the procedure is Q [B] of rejecting null, when the null is false. If Q P, there is a test with 100% confidence and 100% power. Conversely, If there is a test with 100% confidence and 100% power, then Q P, 32 / 33

Neyman-Pearson Lemma A test is efficient if there is no way to increase the confidence without decreasing the power. Neyman-Pearson Lemma If B is efficient, then there is L 0 such that { } dq B = x : dp > L 0 33 / 33