Stochastic Approximation Monte Carlo and Its Applications

Size: px

Start display at page:

Download "Stochastic Approximation Monte Carlo and Its Applications"

William Hunt
5 years ago
Views:

1 Stochastic Approximation Monte Carlo and Its Applications Faming Liang Department of Statistics Texas A&M University

2 1. Liang, F., Liu, C. and Carroll, R.J. (2007) Stochastic approximation in Monte Carlo computation. JASA, 102, Liang, F. (2007) Annealing stochastic approximation Monte Carlo for neural network training. Machine Learning, 68, Liang, F. (2007) Continuous contour Monte Carlo for marginal density estimation with an application to a spatial statistical model, JCGS, 16(3), Liang, F. (2007) Improving SAMC using smoothing methods: theory and applications. Annals of Statistics, to appear. 5. Cheon, S. and Liang, F. (2008) Phylogenetic Tree Reconstruction Using Stochastic Approximation Monte Carlo. BioSystems, 91, Liang, F. (2007) Improving Stochastic Approximation Markov chain Monte Carlo by Trajectory Averaging. Submitted to Bernoulli.

3 7. Liang, F., Chen, M-H. and Joseph G. Ibrahim (2007) SAMC for Monte Carlo Integration with applications to high dimensional regression problems. Submitted to Statistica Sinica.

4 Stochastic Approximation Monte Carlo Problems Two motivation examples Example 1: Suppose we are interested in sampling from the following mixture Gaussian distribution, f(x) = 1 3 N 2(µ 1, Σ 1 ) N 2(µ 2, Σ 2 ) N 2(µ 3, Σ 3 ), where µ 1 = ( ) 8 8 Σ 1 = ( 1 ) µ 2 = ( 6 6) Σ 2 = ( 1 ) µ 3 = ( 0 0) Σ 3 = ( )

5 Stochastic Approximation Monte Carlo Problems (a) (b) (c) (d) Figure 1: Plot (a) shows the contour plot of the density.

6 Stochastic Approximation Monte Carlo Problems Example 2: Consider minimizing the following function on [ 1.1, 1.1] 2 U(x, y) = (x sin(20y) + y sin(20x)) 2 cosh(sin(10x)x) (x cos(10y) y sin(10x)) 2 cosh(cos(20y)y), whose global minimum is , attained at (x, y) = ( , ) and (1.0445, ).

7 Stochastic Approximation Monte Carlo Problems (a) (b) Z Y X y x Figure 2: Grid and contour representation of a function defined on [ 1.1, 1.1] 2.

8 Stochastic Approximation Monte Carlo Problems The above examples can be formulated to simulate from a Boltzmann distribution, f(x) = cψ(x), x X, (1) where c is a constant, ψ(x) = exp( U(x)/τ), τ is called the temperature, and U(x) is called the energy function. Two basic MCMC algorithms (1) Metropolis-Hastings algorithm (Metropolis et al, 1953; Hastings, 1970) (2) The Gibbs sampler. (Geman and Geman, 1984).

9 Stochastic Approximation Monte Carlo Problems Metropolis-Hastings Algorithm (a) Propose a new state y from a proposal distribution T (x t y), where x t denotes the state of the Markov chain at time t. (b) Accept y with the probability min{ ψ(y)t (y x t) ψ(x t )T (x t y), 1}. If it is accepted, set x t+1 = y, otherwise, set x t+1 = x t.

10 Stochastic Approximation Monte Carlo Problems Difficulty On the energy landscape of these systems, there are a multitude of local minima separated by high energy barriers. The sampler tends to get trapped in one of local energy minima indefinitely, rendering the simulation ineffective. Typical Problems in Scientific Computation 1. Protein folding. 2. Phylogenetic tree reconstruction. 3. Neural Networks. 4. Some spatial statistical problems, e.g., Ising model, disease mapping.

11 Stochastic Approximation Monte Carlo Literature review Strategies for improving MCMC 1. The use of auxiliary variables: Swendsen-Wang algorithm (Swendsen and Wang, 1987) Parallel tempering (Geyer, 1991) Simulated tempering (Marinari and Parisi, 1992) Evolutionary Monte Carlo (Liang and Wong, 2001) Strength and weakness: The temperature is typically treated as an auxiliary variable. Simulations at high temperatures broaden the space of sampling, and thus are able to help the system to escape from local energy minima.

12 Stochastic Approximation Monte Carlo Literature review 2. The use of past samples: Multicanonical (Berg and Neuhaus, 1991) 1/k-ensemble algorithm (Hesselbo and Stinchcombe, 1995; Liang, 2004) Wang-Landau (WL) Algorithm (Wang and Landau, 2001; Liang, 2005) Dynamic weighting (Wong and Liang, 1997) Dynamically weighted importance sampling (Liang, 2002)

13 Stochastic Approximation Monte Carlo Literature review Strength and weakness: Dynamic weighting: The variability of the weights is too high. Multicanonical and related algorithms: They are usually used for discrete systems. In the multicanonical algorithm, the trial distribution is defined as: f (x) = 1 #{y : U(y) = U(x)}, where x and y take values on a discrete set. There is no rigorous theory to support their convergence.

14 Stochastic Approximation Monte Carlo Algorithm Basic Idea Partition the sample space into different subregions: E 1,..., E m, M i=1 E i = X, and E i E j = for i j. Let g i = E i ψ(x)dx, and choose π = (π 1,..., π m ), π i 0, and i π i = 1. Sampling from the distribution p θ (x) m i=1 ψ(x) I(x E i). e θ(i) If θ (i) = log(g i /π i ) for all i, sampling from p θ (x) will result in a random walk in the space of subregions with each subregion being sampled with probability π i (viewing each subregion as a single point ). Therefore, sampling from p θ (x) can avoid the local trap problem encountered in sampling from f(x).

15 Stochastic Approximation Monte Carlo Algorithm Algorithm Setting Condition (A 1 ) The sequence {a k } k=0 and satisfies the conditions: a k =, k=1 and for some τ (0, 1) lim (ka k ) =, k is non-increasing, positive lim (a 1 k+1 a 1 k ) = 0, k (2) a (1+τ)/2 k k < (3) k=1 It is clear that a k = 1/k η, η (1/2, 1], satisfies (2). Then (3) holds for any τ > 1/η 1.

16 Stochastic Approximation Monte Carlo Algorithm Algorithm 1. (Sampling) Draw sample x k+1 with a single MH iteration for which the invariant distribution is p k (x) = 1 Z k 2. (Weight updating) Set m i=1 ψ(x) I(x E i ). e θ(i) k θ k+1 = θ k + a k+1 H(θ k, x k+1 ), where H(θ k, x k+1 ) = e xk+1 π and e xk+1 = (I(x k+1 E 1 ),..., I(x k+1 E m )). 3. (Varying trunaction) If θ k+1 Θ, set θ k+1 = θ k+1. Otherwise, set θ k+1 = θ k+1 + c, where c is chosen such that θ k+1 + c Θ.

17 Stochastic Approximation Monte Carlo Algorithm Lyapunov condition on h(θ) Let x, y denote the Euclidean inner product. (A 2 ) The function h : Θ R d is continuous, and there exists a continuously differentiable function v : Θ [0, ) such that (i) For any integer M > 0, the level set V M = {θ Θ, v(θ) M} Θ is compact. (ii) There exists M 0 > 0 such that Θ = {θ Θ, v(θ), h(θ) = 0} int(v M0 ), and v(θ), h(θ) < 0 for any θ Θ\V M0, where int(a) denotes the interior of set A. (iii) For all θ Θ, v(θ), h(θ) 0, and int(v( Θ)) =.

18 Stochastic Approximation Monte Carlo Algorithm Stability condition on h(θ) (A 3 ) The mean field function h(θ) is measurable and locally bounded. There exist a stable matrix F (i.e., all eigenvalues of F are with negative real parts), γ > 0, and ρ (τ, 1] such that for any point θ 0 Θ, h(θ) F (θ θ 0 ) c 1 θ θ 0 1+ρ, θ {θ : θ θ 0 γ}, where c 1 is a constant.

19 Stochastic Approximation Monte Carlo Algorithm Drift condition (A 4 ) For any θ Θ, the transition kernel P θ is irreducible and aperiodic. In addition, there exists a function V : X κ [1, ) and constants α 2 and β (0, 1] such that for any compact subset K Θ, (i) There exist a set C X, an integer l, constants 0 < λ < 1, b, ς, δ > 0 and a probability measure ν such that sup θ K sup θ K sup θ K PθV l α (x) λv α (x) + bi(x C), x X (4). P θ V α (x) ςv α (x), x X. (5) P l θ(x, A) δν(a), x C, A B. (6)

20 Stochastic Approximation Monte Carlo Algorithm (ii) There exists a constant c such that for all x X, sup H(θ, x) cv (x). (7) θ K sup H(θ, x) H(θ, x) cv (x) θ θ (8) β. (θ,θ ) K (iii) There exists a constant c such that for all (θ, θ ) K K, P θ g P θ g V c 2 g V θ θ β, g L V. (9) P θ g P θ g V α c 2 g V α θ θ β, g L(10) V α.

21 Stochastic Approximation Monte Carlo Algorithm Theoretical Results Lemma 1 Assume the drift condition (A 4 ) and sup x X V (x) <. Let ɛ k = H(θ k, x k+1 ) h(θ k ). There exist R d -valued random processes {e k } k 1, {ν k } k 1, and {ς k } k 1 defined on a probability space (Ω, F, P) such that (i) ɛ k = e k + ν k + ς k. (ii) {e k } is a martingale difference sequence, and 1 n n k=1 e k N(0, Q) in distribution, where Q = lim k E(e k e k ). (iii) E ν k = O(a (1+τ)/2 k ), where τ is given in condition (A 1 ). (iv) n k=0 a kς k = O(a n ).

22 Stochastic Approximation Monte Carlo Algorithm THEOREM 1 (Convergence) (Liang et al., 2007) Let α σ denote the number of itertaions for which the σ-th truncation occurs in the SAMC simulation. Assume the conditions (A 1 ) and (A 2 ) hold, and there exists a drift function V (x) such that sup x X V (x) < and the drift condition (A 4 ) holds. Then there exists a number σ such that α σ < a.s., α σ+1 = a.s., and {θ k } given by the SAMC algorithm has no truncation for k α σ, i.e., θ k+1 = θ k + a k H(θ k, x k+1 ), k α σ, and θ (i) k { c + log( E i ψ(x)dx) log(π i + ν), if E i,. if E i =, (11) where ν = j {i:e i = } π j/(m m 0 ) and m 0 is the number of empty subregions. The constant c can be determined by imposing an extra constraint on θ k, e.g., θ (m) k = 0 for all k 0.

23 Stochastic Approximation Monte Carlo Algorithm THEOREM 2 (Averaging Normality) (Liang 2007, submitted) Under the conditions of Theorem 2, we have k θ k = 1 k i=1 θ i is asymptotically efficient; that is, k( θk θ 0 ) N(0, S) as k, where S = F 1 Q(F 1 ) T, and Q is as defined in Lemma 1.

24 Stochastic Approximation Monte Carlo Algorithm THEOREM 3 (IWIW property) (Liang et al., 2007, submitted) If the desired sampling distribution is uniform over all the subregions, i.e., π 1 = = π m = 1/m SAMC is invariant with respect to the importance weights (IWIW). This Theorem implies that the integral E f h(x) can be estimated online by E f h(x) = n k=1 w kh(x k ) n k=1 w, k where w k = m i=1 eθ(i) k I(xk E i ). As n, E f h(x) E f h(x), for the same reason that the usual importance sampling estimate converges.

25 Stochastic Approximation Monte Carlo Algorithm Implementation Issues 1. Sample space partition. It can be made according to our goal and the complexity of the problem. Here are some examples: (a) Importance sampling: Energy function, maximum energy difference 2. (b) Model selection: Model index. 2. Desired sampling distribution. (a) Set π to bias the sampling to low energy regions if we aim to minimize the energy function. (b) Set π to be uniform if we aim at estimation.

26 Stochastic Approximation Monte Carlo Algorithm 3. Choices of η, t 0 and the number of iterations. The diagnostic statistic: ɛ f (E i ) = { bπi (π i +ν) π i 100%, +ν if E i, 0, if E i =, (12) for i = 1,..., m. If max m i=1 ɛ f (E i ) is large, say, greater than 10%, the convergence of the run should be questioned. In this case, SAMC should be re-run with more iterations, a larger value of t 0, or a smaller value of η.

27 SAMC Applications Demonstration x f(x) Table 1: The unnormalized mass function of the 10-state distribution. Table 2: Comparison of SAMC and MH for the 10-state example, where the Bias and Standard Error (of the Bias) were calculated based on 100 independent runs. Algorithm Bias ( 10 3 ) Standard Error ( 10 3 ) CPU time (seconds) SAMC MH The sample space was partitioned according to the mass function into five subregions: E 1 = {8}, E 2 = {2}, E 3 = {5, 6}, E 4 = {3, 9} and E 5 = {1, 4, 7, 10}. The desired sampling distribution is uniform over 5 subregions.

28 SAMC Applications Demonstration (a) MH samples (b) SAMC samples (c) Log weight of SAMC samples ACF ACF Lag Lag iterations (in thousands) Figure 3: Computational results for the 10-state example. (a) Autocorrelation plot of the MH samples. (b) Autocorrelation plot of the SAMC samples. (c) Log-weight of the SAMC samples.

29 Importance Sampling Spatial Autologistic Models Let s = {s i : i D} denote the observed binary data, where s i is called a spin and D is the set of indices of the spins. Let D denote the total number of spins in D, and N(i) denote a set of neighbors of spin i. The likelihood function of the model is f(s α, β) = 1 ϕ(α, β) exp where (α, β) Ω, and ϕ(α, β) = for all possible s α i D s i + β 2 i D exp α s j + β 2 j D s i i D j N(i) s i s j j N(i), s j (13) When β is large, say, 0.5, the configuration s tends to have large clusters of the same orientation, which fluctuate very slowly..

30 Importance Sampling Spatial Autologistic Models Methods to resolve the difficulty in normalizing constant evaluation: (a) Working on a pseudo-likelihood function (Besag, 1975): P L(α, β s) = i D e s i(α+β P j N(i) s j) e α+β P j N(i) s j + e α β P j N(i) s j. (14) The resulting estimate is called MPLE. (b) Working on a Monte Carlo log-likelihood (up to a constant)(geyer and Thompson, 1992): L n (α, β s) = α i D s i + β 2 i D s i ( j N(i) s j ) log[ 1 n n k=1 ψ(α, β, s (k) ) ψ(α, β, s (k) ) ]. (15) The resulting estimate is called MCMLE.

31 Importance Sampling Spatial Autologistic Models A natural choice for the trial distribution is a mixture distribution of the form p mix(s) = 1 m m j=1 p(s α j, β j ), (16) where the values of the parameters (α 1, β 1 ),..., (α m, β m ) are prespecified. To complete this idea, the key is to estimate ϕ(α j, β j ),..., ϕ(α m, β m ) (up to a common multiplicative constant).

32 Importance Sampling Spatial Autologistic Models Estimate single-mcmle SAMC RMSE(T1 sim ) RMSE(T2 sim ) Table 3: Comparison of the accuracy of the SAMC and single-mcmles for the US cancer data. T 1 = i s i, T 2 = i s i( j s j)/2, RMSE(T sim 5 T sim i k=1 (T sim,k i Ti obs ) 2 /5, where i = 1, 2, and T sim,k i calculated based on the k th estimate of (α, β). i ) is calculated as denotes the value of

33 Importance Sampling Spatial Autologistic Models True Observations Fitted mortality rate Figure 4: The U.S. cancer mortality rate data. (a) The mortality map of liver and gallbladder cancer (including bile ducts) for white males during the decade The black squares denote the counties of high cancer mortality rate, and the white squares denote the counties of low cancer mortality rate. (b) Fitted cancer mortality rates. The cancer mortality rate of each county is represented by the gray level of the corresponding square.

34 Importance Sampling Spatial Autologistic Models (a) (b) probability log normalizing constant beta alpha beta alpha Figure 5: Computational results of SAMC. (a) Estimate of log ϕ(α, β) on a lattice with α{ 0.5, 0.45,..., 0.5} and β {0, 0.05,..., 0.5}. (b) Estimate of P (s i = +1 α, β) on a lattice with α { 0.49, 0.47,..., 0.49} and β {0.01, 0.03,..., 0.49}.

35 Importance Sampling Spatial Autologistic Models (a) SAMC (b) SAMC alpha beta *10^7 4*10^7 6*10^7 8*10^7 10^8 0 2*10^7 4*10^7 6*10^7 8*10^7 10^8 iteration iteration (c) RJMCMC (d) RJMCMC alpha beta *10^7 4*10^7 6*10^7 8*10^7 10^8 0 2*10^7 4*10^7 6*10^7 8*10^7 10^8 iteration iteration Figure 6: Comparison of SAMC and RJMCMC. Plots (a) and (b) show, respectively, the sample paths of α and β in a run of SAMC. Plots (c) and (d) show, respectively, the sample paths of α and β in a run of RJMCMC.

36 Kernel Smoothing SSAMC Motivation for Smoothing SAMC Intuitively, x t may contain some information on its neighboring subregions, so the visiting to its neighboring subregions should also be penalized to some extent in the next iteration. The efficiency of SAMC can be improved by including at each iteration a smoothing step, which distributes the information contained in each sample to its neighboring subregions. The new algorithm is thus called smoothing-samc or SSAMC for simplicity.

37 Kernel Smoothing SSAMC Motivation Examples We note that for many problems, E 1,..., E m can be regarded as a sequence of naturally ordered categories. Here are some examples. Model selection: The model space X can be partitioned according to the index of models, and the subregions can be naturally ordered according to the number of parameters contained in each model. Function optimization: The solution space X can be partitioned according to the objective function, and the subregions can also be naturally ordered according to the objective function.

38 Kernel Smoothing SSAMC are samples generated using a MH ker- k,..., x(κ) k nel with the invariant distribution p θk (x). Suppose that x (1) Since κ is usually a small number, say, 10 to 20, the samples form a sparse frequency vector e xk = (e (i) k,..., e(m) k ) with e (i) k = κ l=1 I(x(l) k E i).

39 Kernel Smoothing SSAMC The frequency estimate can be improved by a smoothing method. The Nadaraya-Watson (NW) kernel estimator works as follows: p (i) k = m j=1 W ( Λ(i j) mh k ) e (j) k κ m j=1 W ( Λ(i j) mh k ), (17) where W (z) is a kernel function with bandwidth h k, and Λ is a rough estimate of the range of λ(x), x X. By assuming that W (z) has a bounded support, we can show p (i) k e(i) k /κ = O(h k).

40 Algorithm SSAMC k,..., x(κ) k using the MH algorithm with the proposal distribution q(x k (i), ) and the invariant distribution p θk (x), where x (0) k = x (κ) k 1. (a) (Sampling) Simulate samples x (1) (b) (Smoothing) Calculate p k = ( p (i) k,..., p(m) k ) using a kernel smoothing method. (c) (Weight updating) Set θ = θ k + a k+1 ( p k π). (18) If θ Θ, set θ k+1 = θ ; otherwise, set θ k+1 = θ + c, where c can be any number which satisfies the condition θ + c Θ.

41 Change-Point identification SSAMC Notations Let Z = (z 1, z 2,, z n ) denote a sequence of independent observations. Let ϑ (k) denote a configuration of ϑ with k ones, which represents a model of k change points. Let η (k) = (ϑ (k), µ 1, σ 2 1,, µ k+1, σ 2 k+1 ). Let X k denote the space of models with k change points, ϑ (k) X k, and X = n k=0 X k.

42 Change-Point identification SSAMC Assuming appropriate prior distributions, integrating out the parameters µ 1, σ1, 2, µ k+1, σk+1 2 from the full posterior distribution, and taking a logarithm, we have log P (ϑ (k) Z) = a k + k ( c i c i log 2π k+1 i=1 + α ) log [ β {1 2 log(c i c i 1 ) log Γ( c i c i c i j=c i 1 +1 zj 2 ( c i j=c i 1 +1 z j) 2 ]}. 2(c i c i 1 ) (19) + α)

43 Change-Point identification SSAMC Observation Time Figure 7: Comparison of the true change-point pattern (horizontal lines) and its MAP estimate (vertical lines).

44 Change-Point identification SSAMC SSAMC SAMC RJMCMC k prob(%) SD prob(%) SD prob(%) SD Table 4: The estimated posterior distribution P (X k Z) for the change-point identification example. SD: standard deviation of the estimates.

45 SAMC Applications Stochastic Optimization Algorithm Mean Standard Error Minimum Maximum Proportion SAMC Annealing Annealing Annealing Table 5: Comparison of SAMC and simulated annealing. Annealing-1, Annealing-2, and Annealing-3 correspond to the runs with t high = 5, t high = 2, and t high = 1, respectively.

46 SAMC Applications Stochastic Optimization x y O O (a) GWL x y O O (b) Metropolis (t=5) x y O O (c) Metropolis (t=0.1) Figure 8: Sample paths of SAMC and the Metropolis-Hastings algorithm. The circles show the global minimum locations. (a) The sample path of a SAMC run. (b) The sample path of a Metropolis-Hastings run at t = 5. (c) The sample path of a Metropolis- Hastings run at t = 0.1.

47 SAMC Applications Stochastic Optimization Annealing SAMC The algorithm initiates the search in the entire sample space X 0 = m i=1 E i, and then iteratively searches in the set X t = ϖ(u (t) min +ℵ) i=1 E i, t = 1, 2,..., (20) where ϖ(u) denotes the index of the subregion that a sample x with energy u belongs to, U (t) min is the best function value obtained until iteration t, and ℵ > 0 is a user specified parameter which determines the broadness of the sample space at each iteration. Since the sample space shrinks iteration by iteration, the algorithm is called annealing SAMC.

48 SAMC Applications Stochastic Optimization Algorithm Mean S.D. Minimum Maximum Succ Iter( 10 6 ) Time ASAMC m SAMC m SA m SA m BFGS s Table 6: Comparison of the ASAMC and SA algorithms for the two-spiral example. Succ denotes the number of runs (out of 20) found a solution with energy less than 0.2.

49 SAMC Applications Stochastic Optimization x y (a) x y (b) Figure 9: Two-spiral problem: Classification maps learned by a MLP of 30 hidden units. The black and white points show the training data for two different spirals. (a) The classification map learned in one run. (b) The classification map learned in 20 runs.

50 SAMC Applications Stochastic Optimization Advantages of SAMC over simulated annealing 1. Simulated Annealing: It requires the temperature decrease so lowly, 1 at the rate, that it is impossible to be implemented exactly in log(t) practice. 2. SAMC: The modification factor γ can decrease much faster, at the rate 1 t. In an annealing version of SAMC, X t will converge in distribution to f(x)i(x E ɛ ) as t, where E ɛ = {x : H(x) < H min + ɛ}. Further work: convergence rate of annealing SAMC.

51 SAMC Discussion Other Applications Importance sampling (Liang et al., 2007, JASA) Marginal density estimation (Liang, 2007, JCGS) Normalizing constant estimation (Liang, 2007, Encyclsoepedia of Artifical Intelligence) protein folding simulation (Liang, 2004, J. Chem. Phys) Phylogenetic tree reconstruction (Cheon and Liang, 2007, BioSystems) Variable selection for high dimensional regression (Liang, Chen and Ibrahim, 2007)

52 SAMC High Dimensioanl Regression (a) SAMC (b) MH best energy values best energy values iterations iterations Figure 10: Progression of the best energy values in (a) SAMC and (b) MH runs for a high dimensional regression problem with n = 150 and p = 600 (Liang et al., 2007).

53 SAMC Phylogeny Estimation Figure 11: Comparison of the phylogenetic trees produced by SSAMC, BAMBE, and MrBayes for the simulated example. The respective log-likelihood values of the trees are (a) , (b) , (c) , (d)

54 SAMC Phylogeny Estimation Figure 12: Comparison of the MAP trees produced by SSAMC, MrBayes, and BAMBE for African cichlid fish example. The respective log-likelihood values are , and

55 SAMC Phylogeny Estimation CPU time SSAMC BAMBE MrBayes Number of taxa Figure 13: CPU times cost by a single run ( iterations) of SSAMC, BAMBE and MrBayes.

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule Abstract Simulated annealing has been widely used in the solution of optimization problems. As known