E cient Importance Sampling

Size: px

Start display at page:

Download "E cient Importance Sampling"

Chad Preston
5 years ago
Views:

1 E cient David N. DeJong University of Pittsburgh Spring 2008

2 Our goal is to calculate integrals of the form Z G (Y ) = ϕ (θ; Y ) dθ. Special case (e.g., posterior moment): Z G (Y ) = Θ Θ φ (θ; Y ) p (θjy ) dθ. Scenario: analytical solutions to these integrals are unavailable. We will remedy this problem using numerical approximation methods.

3 , cont. In the context of achieving likelihood evaluation and ltering in state space representations, recall that we face the challenge of constructing the ltering density f (s t jy t ) = f (y t, s t jy t 1 ) f (y t jy t 1 ) = f (y tjs t, Y t 1 ) f (s t jy t 1 ), f (y t jy t 1 ) where Z f (s t jy t 1 ) = f (s t js t 1, Y t 1 ) f (s t 1 jy t 1 ) ds t 1, and Z f (y t jy t 1 ) = f (y t js t, Y t 1 ) f (s t jy t 1 ) ds t.

4 , cont. To gain intuition, suppose the integral we face is of the form Z G (Y ) = φ (θ; Y ) p (θjy ) dθ, Θ and it is possible to obtain pseudo-random drawings θ i from p (θjy ). Then by the law of large numbers, G (Y ) N = 1 N N i=1 φ (θ i ; Y ) converges in probability to G (Y ). We refer to G (Y ) N as the Monte Carlo estimate of G (Y ). As we shall see, from the standpoint of numerical e ciency, this represents a best-case scenario.

5 , cont. The MC estimate of the standard deviation of G (Y ) is given by σ N (G (Y )) = " 1 N N i φ (θ i ; Y ) 2! G (Y ) 2 N # 1/2. The numerical standard error associated with G (Y ) N is given by s.e. G (Y ) N = σ N (G (Y )) p. N Thus for N = 10, 000, s.e. G (Y ) N is 1% of the size of σ N (G (Y )).

6 (Geweke, 1989 Econometrica) If p (θjy ) is unavialable as a sampler, one remedy is to augment the targeted integrand with an importance sampling distribution g (θja): G (Y ) = = Z Θ Z Θ ϕ (θ; Y ) g (θja) dθ g (θja) φ (θ; Y ) p (θjy ) g (θja) dθ. g (θja) Key requirements: I Support of g (θja) must span that of ϕ (θjy ) I E [G (Y )] must exist and be nite. I g (θja) must be implementable as a sampler

7 , cont. I MC estimate of G (Y ): G (Y ) N = 1 N N i=1 ω i ; ω i = ϕ (θ i jy ) g (θ i ja) I MC estimate of the standard deviation of ω w.r.t. g (θja): σ N (ω (θ, Y )) = 1 N ω 2 i G (Y ) 2 N 1/2. I Numerical standard error associated with G (θ, Y ) N : s.e. G (Y ) N I = σ N (ω (θ, Y )) p N. I Note: Variability in ω translates into increased n.s.e. (numerical ine ciency)

8 , cont. For the special case in which the integrand factorizes as ϕ (θ i jy ) = φ (θ; Y ) p (θjy ), G (Y ) N = 1 N " σ N (G (Y )) = N i=1 1 N φ (θ i ; Y ) w i, w i = p (θ i jy ) g (θ i ja)! # 1/2 N φ (θ i ; Y ) 2 w i G (Y ) 2 N. i

9 , cont. Continuing with the special case, if we lack an integrating constant for either p (θjy ) or g (θja), we can work instead with G (Y ) N = σ N (G (Y )) =! N 1 w i " 1 w i φ (θ i ; Y ) w i i=1! N! φ (θ i ; Y ) 2 w i i=1 G (Y ) 2 N # 1/2. The impact of ignoring integrating constants is eliminated by the inclusion of the accumulated weights in the denominators of these expressions.

10 , cont. Regardless of whether p (θjy ) and g (θja) are proper p.d.f.s, notice that when p (θjy ) itself can be used as a sampling density (i.e., p (θjy ) = g (θja)), then w i = 1 8 i, and we revert to the best-case scenario. In general, the accuracy of our approximation will fall short of the best-case scenario. To judge the degree of the shortfall, various metrics are available.

11 , cont. Two metrics for judging numerical accuracy: I w max 2 w i 2 (good practical benchmark: 1%) I relative numerical e ciency (RNE)

12 , cont. Motivation for RNE. Issue: how close are we to the best-case scenario? I Under the best-case scenario, recall that n.s.e. is given by s.e. G (Y ) N = σ N (G (Y )) p. N I Actual n.s.e. is given by s.e. G (Y ) N I = σ N (ω (θ, Y )) p N, where recall σ N (ω (θ, Y )) = 1 N ω 2 i G (Y ) 2 N 1/2, ω i = ϕ (θ i jy ) g (θ i ja)

13 , cont. The idea behind RNE is to compare the actual n.s.e. to an estimate of the optimal (best-case) n.s.e.: RNE = = (ideal n.s.e.)2 (actual n.s.e.) 2 σn (G (Y p )) 2 N 2. s.e. G (Y ) N I

14 , cont. Rearranging yields s.e. G (Y ) N I = σ N (G (Y )) p N RNE. Note: relative to the best-case scenario, p N RNE replaces p N in the denominator. Thus the further is RNE from 1, the more draws are required to achieve a given level of accuracy relative to the best-case scenario.

15 Suppose p (θjy ) N (µ, Σ), 2 µ = , corr(σ) = sqrt(diag(σ)) 0 = Statistics of interest: 3 7 5, E (sumc(µ)), E (prodc(µ))

16 , cont. Using rndseed , and N = 10, 000, MC estimates (bµ, sqrt(diag(σ)), \ n.s.e.(bµ):

17 , cont. MC estimate of corr(σ) :

18 , cont. MC estimates of statistics (mean, std. dev., n.s.e.(mean)): E (sumc(µ)) E (prodc(µ))

19 , cont. Exercise: replicate Hint for exercise. To obtain draws from N(µ, Σ) distribution: I swish = chol(sig) ; swish is lower-diagonal Cholesky decomposition of sig (i.e., sig=swish*swish ). I draw = mu + swish*rndn(n,1);

20 , cont. Suppose instead we seek to obtain estimates using an density g (θja) N (µ I, Σ I ), with µ I = µ sqrt(diag(σ)), Σ I = diagrv(σ, sqrt(diag(σ)))

21 , cont. Using rndseed , and N = 10, 000, MC versus IS estimates (bµ, sqrt(diag(σ)), \ n.s.e.(bµ):

22 , cont. MC versus IS estimate of corr(σ) :

23 , cont. MC versus estimates of statistics (mean, std. dev., n.s.e.(mean)): E (sumc(µ)) E (prodc(µ)) E (sumc(µ)) E (prodc(µ))

24 , cont. Accuracy Diagnostics Summary statistics on weights: avg, stdev, min, max, maxsq/totsq: RNEs and 1/RNEs:

25 , cont. Plot of weights (all, top 1,000, top 100):

26 , cont. Exercise: Replicate

27 From a programming perspective, two simple approaches to improving e ciency are: I Increase N (RNEs indicate good rules of thumb for necessary increases). This brute-force method is often computationally prohibitive. I Sequential updating. (Can still be expensive, but less brutish.)

28 , cont. Sequential updating: I Begin with an initial parameterization a 0 for g (θja) (e.g., (µ 0, Σ 0 )). I Calculate bθ 0, map into a 1. I Repeat until a i yields an acceptable level of numerical accuracy.

29 , cont. Returning to the example, RNEs and 1/RNEs evolve as follows: Iteration 0: Iteration 1: Iteration 2: Iteration 3:

30 , cont. Weight plots, Iteration 3:

31 , cont. Caveats regarding sequential updating: I Convergence to target is not guaranteed I Performance can be sensitive to starting values I Initial sampler should be su ciently di use to ensure coverage of appropriate range for targeted integrand

32 , cont. Aside regarding coverage: The multivariate-t density is an attractive sampler relative to the normal density: it has similar location and shape parameters, but has thicker tails. Tail thickness controlled by degrees-of-freedom parameter υ (smaller υ, thicker tails). I Parameters of the multivariat-t: (γ, V, υ). I Mean and second-order moments: γ, V 1 υ υ 2

33 , cont. Algorithm for obtaining drawings µ i from multivariate-t (γ, V, υ) : I Obtain s i from a χ 2 (υ) distribution: s i = υ j=1 x 2 j, x j N(0, 1) I Use s i to construct the scaling factor I Obtain µ i as σ i = (s i /υ) 1/2 µ i = γ σ i V 1/2 w i, w i,j N(0, 1), j = 1,..., k V 1/2 = chol(v 1 ) 0 I Note: use of yields antithetic acceleration (Geweke, 1988, JoE)

34 , cont. Exercise: Replace the normal densities used as Samplers in the exercise above with multi-t densities, with γ = µ and V 1 = Σ. Experiment with alternative υ 0 s to assess the impact on numerical e ciency.

35 (Richard and Zhang, 2007 JoE) Goal: Tailor g (θja) (via the speci cation of a) to minimize the n.s.e. associated with the approximation of Z G (Y ) = ϕ (θjy ) dθ. Θ Write g (θja) as g (θja) = χ(a) = k (θ; a) χ(a), Z k (θ; a) dθ. Θ

36 , cont. Details regarding the tailoring of g (θja) are distinct for two special cases: I g (θja) is parametric (i.e., a normal distribution) I g (θja) is piecewise-linear

37 , cont. When g (θja) is fully parametric, n.s.e. is (approximately) minimized via iterations on (ba l+1, bc l+1 ) = arg min a,c Q N (a, c; Y jba l ), Q N (a, c; Y jba l ) = 1 N d θ l i, a, c, Y = ln ϕ ω θ l i ; Y, ba l N i=1 d 2 θ l i, a, c, Y θ l i jy = ϕ (θ i jy ) g (θ i jba l ). c ω θ l i ; Y, ba l, ln k θ l i ; a, The term c is a normalizing constant that controls for factors in ϕ and g that do not depend upon θ. Typically, it su ces to set ω θ l i ; Y, ba l = 1 8 i.

38 , cont. When g (θja) is piecewise-linear, the parameters a are grid points: a 0 = (a 0,..., a R ), a 0 < a 1 <... < a R. In this case, the kernel k(θ; a) is given by ln k j (θ; a) = α j + β j θ 8θ 2 [a j 1, a j ], β j = ln ϕ(a j ) ln ϕ(a j 1 ) a j a j 1, α j = ln ϕ(a j ) β j a j. Optimization is achieved by selecting ba as an equal-probability division of the support of ϕ (θjy ).

39 , cont. Given nal estimates (ba, bc), the estimate of G (Y ) is given by G (Y ) N = 1 N N i=1 ω (θ i ; Y, ba). N.S.E. is computed as indicated above.

40 , Gaussian Sampler To simplify notation, denote the targeted integrand as ϕ (θjy ) ϕ (θ). We ll take θ as k dimensional, with elements (x 1, x 2,...x k ). With g (θja) Gaussian, a consists of the k 1 vector of means µ and the k k covariance matrix Σ. Since the covariance matrix is symmetric, the number of auxiliary parameters reduces to k + k(k + 1)/2. The precision matrix H = Σ 1.

41 , Gaussian Sampler, cont. Our goal is to choose (µ, H) to approximate optimally ln ϕ(s) by a Gaussian kernel: ln ϕ(θ) 1 2 (θ µ)0 H(θ µ) 1 2 θ0 Hθ 2θ 0 Hµ. Recall that by optimally, we refer to the weighted-squared-error minimization introduced above.

42 , Gaussian Sampler, cont. The term θ 0 Hθ can be written as h 11 h 21.. h 1k x 1 h 21 h 22.. h 2k x 2 x 1 x 2.. x k B..... C B. A. h k1 h k2.. h kk x k

43 , Gaussian Sampler, cont. Expanding θ 0 Hθ, we obtain θ 0 Hθ = h 11 x1 2 + h22 x hjj xkj 2 +2h 21 (x 2 x 1 ) + 2h 31 (x 3 x 1 ) h k1 (x k x 1 ) +2h 32 (x 3 x 2 ) + 2h 42 (x 4 x 2 ) h k2 (x k x 2 )... +2h k(k 1) (x k x k 1 ). This expression indicates that the coe cients of the squares, pairwise products and the individual components of θ are in one-to-one correspondence with the µ and H.

44 , Gaussian Sampler, cont. Given this correspondence, the optimization problem amounts to a weighted-least-squares problem involving the regression of ln ϕ(s) on 1, x1,..., x k, x 1 x 2, x 1 x 3,..., x k The number of regressors is K = 1 x k x1 2, x2 2,..., xk k + k(k+1) 2.

45 , Gaussian Sampler, cont. Algorithm I Specify a 0, generate 0 y ln ϕ(θ 1 )... ln ϕ(θ M ) 1 A, 0 w ω 1... ω M 1 A, ϕ (θ i jy ) g (θ i ja 0 ), X = (Note: M << N) κ 1... κ M 1 A, κ i = 1~θ 0 i ~vech(θ i θ 0 i ) 0.

46 , Gaussian Sampler, cont. Algorithm, cont. I Construct ey = y. (w.^2), ex = X. (w.^2). (Caution: set w = 1 when using a poor initial sampling density.) I Estimate I Map bβ into bβ = ( ex 0 ex ) 1 ex ey. bµ, bσ. Jointly, these constitute a 1. I Replacing a 0 above with a 1, repeat until convergence.

47 , Gaussian Sampler, cont. Algorithm, cont. To map bβ into bµ, bσ : I Map the k + 2 through K elements of bβ into a symmetric matrix eh, with j th diagonal element corresponding to the coe cient associated with the squared value of the j th element of θ, and (j, k) th element corresponding to the product of the j and k th element of θ. I.E., eh=xpnd(bβ[k+2:rows(beta)]); I Construct eh by multiplying all elements of eh by 1, then multiplying the diagonal elements by 2. I bσ = eh 1 I bµ = bσ bβ[2 : k + 1]

48 , Gaussian Sampler, cont. Exercise: Return to the example outlined above. Using an initial sampler speci ed with µ 0 = µ + 3 sqrt(diag(σ)), Σ 0 = 10 diagrv(σ, sqrt(diag(σ))), show that the algorithm yields bµ = µ, bσ = Σ in one step using M = 50, w = 1. Experiment with alternative initial samplers.

49 , Piecewise-Linear Approximation Context: E ective for use in univariate cases featuring obvious deviations from normality. (Curse of dimensionality renders implementation problematic in high-dimensional cases.)

50 , PW-L Approx., cont. Motivation 1. Mixture of Normals, (µ, σ) = ( 0.5, 0.25), (0.5, 0.5), equal weights

51 , PW-L Approx., cont. Approximation via Gaussian samplers (handmade and ): RNEs for calculating E(x): 0.61, 0.78.

52 RNE: 0.37, PW-L Approx., cont. Motivation 2. (µ, σ) = ( 1.5, 0.25), (1.5, 0.5), equal weights

53 RNE: 0.11, PW-L Approx., cont. Motivation 3. (µ, σ) = ( 1.5, 0.05), (1.5, 0.5), equal weights

54 , PW-L Approx., cont. Reboot: Recall that when g (θja) is piecewise-linear, the parameters a are grid points: a 0 = (a 0,..., a R ), a 0 < a 1 <... < a R. In this case, the kernel k(θ; a) is given by ln k j (θ; a) = α j + β j θ 8θ 2 [a j 1, a j ], β j = ln ϕ(a j ) ln ϕ(a j 1 ) a j a j 1, α j = ln ϕ(a j ) β j a j. Optimization is achieved by selecting ba as an equal-probability division of the support of ϕ (θjy ).

55 , PW-L Approx., cont. In order to achieve equal-probability division, and to implement the distribution as a sampler, we must I construct its associated CDF I invert the CDF.

56 Inversion involves inducing a mapping from the y to the x axis. involves mapping drawings obtained from a U [0, 1] distribution onto the x axis., PW-L Approx., cont. To gain intuition behind implementation and inversion, consider the CDF associated with Case 3:

57 , PW-L Approx., cont. Letting s denote the x written as axis variable, the CDF of k can be K j (s; a) = χ j (s; a) χ n (a), 8s 2 [a j 1, a j ], χ j (s; a) = χ j 1 (a) + 1 β j [k j (s; a) k j (a j 1 ; a)], χ 0 (a) = 0, χ j (a) = χ j (a j ; a). Note: χ n (a) is the integrating constant associated with the pdf.

58 , PW-L Approx., cont. Inversion/implementation: s = 1 nln hk j (a j 1 ; a) + β β j u χ R (a) χ j 1 (a) i o α j, j u 2 ]0, 1[ and χ j 1 (a) < u χ R (a) < χ j (a).

59 , PW-L Approx., cont. Unre ned (equal spacing over x range) piecewise approximation for case 1: RNE: 0.99

60 , PW-L Approx., cont. Unre ned (equal spacing over x range) piecewise approximation for case 2: RNE: 0.99

61 , PW-L Approx., cont. Unre ned (equal spacing over x range) piecewise approximation for case 3: RNE: 0.24

62 , PW-L Approx., cont. To understand the source of the problem in Case 3, consider again the CDF of the unre ned sampler: Note: relatively few points along steep portions of the CDF,

63 , PW-L Approx., cont. To address the problem: equal probability division of the range for s. I.e., divide the vertical axis of the CDF into equal portions, then map into s : u i = ε + (2 ε) i, i = 1,..., R 1, R with ε su ciently small (typically ε = 10 4 ) to avoid tail intervals of excessive length.

64 , PW-L Approx., cont. Iterative construction: Given the step-l grid ba l, construct the density kernel k and its CDF K as described above. The step-l + 1 grid is then computed as ba l+1 i = K 1 (u i ), i = 1,..., R 1. Iterate until (approximate) convergence.

65 , PW-L Approx., cont. Re ned approximation for Case 3: RNE: 0.95

66 , PW-L Approx., cont. Re ned CDF:

E cient Likelihood Evaluation of State-Space Representations

E cient Likelihood Evaluation of State-Space Representations David N. DeJong, Roman Liesenfeld, Guilherme V. Moura, Jean-Francois Richard, Hariharan Dharmarajan University of Pittsburgh Universität Kiel