for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

Similar documents
Stochastic Approximation Monte Carlo and Its Applications

Markov Chain Monte Carlo Lecture 6

Learning Bayesian Networks for Biomedical Data

Simulated Annealing for Constrained Global Optimization

Monte Carlo methods for sampling-based Stochastic Optimization

Gradient-based Adaptive Stochastic Search

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Introduction to Machine Learning CMU-10701

6 Markov Chain Monte Carlo (MCMC)

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

MCMC: Markov Chain Monte Carlo

Simulation - Lectures - Part III Markov chain Monte Carlo

Computer Intensive Methods in Mathematical Statistics

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Simultaneous drift conditions for Adaptive Markov Chain Monte Carlo algorithms

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov Chain Monte Carlo Lecture 4

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Quantifying Uncertainty

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Markov Chain Monte Carlo. Simulated Annealing.

Bayesian Methods for Machine Learning

Perturbed Proximal Gradient Algorithm

Markov Chains and MCMC

Convex Optimization CMU-10725

Consistency of the maximum likelihood estimator for general hidden Markov models

y(x n, w) t n 2. (1)

Mollifying Networks. ICLR,2017 Presenter: Arshdeep Sekhon & Be

The Origin of Deep Learning. Lili Mou Jan, 2015

Likelihood Inference for Lattice Spatial Processes

Lecture 8: The Metropolis-Hastings Algorithm

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Markov chain Monte Carlo

Stochastic Approximation in Monte Carlo Computation

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Pattern Classification

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Monte Carlo Methods. Leon Gu CSD, CMU

Stochastic Proximal Gradient Algorithm

Markov Chain Monte Carlo Methods

Introduction to Restricted Boltzmann Machines

CPSC 540: Machine Learning

Stochastic Networks Variations of the Hopfield model

Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Stochastic Approximation in Monte Carlo Computation

STA 294: Stochastic Processes & Bayesian Nonparametrics

Stochastic optimization Markov Chain Monte Carlo

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Session 3A: Markov chain Monte Carlo (MCMC)

Variational Inference via Stochastic Backpropagation

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Advances and Applications in Perfect Sampling

MCMC Sampling for Bayesian Inference using L1-type Priors

STA 4273H: Statistical Machine Learning

Computational statistics

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Sampling multimodal densities in high dimensional sampling space

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

You submitted this quiz on Wed 16 Apr :18 PM IST. You got a score of 5.00 out of 5.00.

Adaptive HMC via the Infinite Exponential Family

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Propp-Wilson Algorithm (and sampling the Ising model)

Neural Networks and the Back-propagation Algorithm

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 9: Markov Chain Monte Carlo

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Adaptive Monte Carlo methods

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Monte Carlo Methods. Geoff Gordon February 9, 2006

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Multilayer Perceptron

Walsh Diffusions. Andrey Sarantsev. March 27, University of California, Santa Barbara. Andrey Sarantsev University of Washington, Seattle 1 / 1

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Bayesian Inference and MCMC

RANDOM TOPICS. stochastic gradient descent & Monte Carlo

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

Markov Chain Monte Carlo, Numerical Integration

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

Annealing Between Distributions by Averaging Moments

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

An introduction to adaptive MCMC

Neural Network Training

Undirected Graphical Models

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

4. Multilayer Perceptrons

Computer Practical: Metropolis-Hastings-based MCMC

Estimating Unnormalized models. Without Numerical Integration

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

Bayesian Methods with Monte Carlo Markov Chains II

Metropolis-Hastings Algorithm

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Lecture 26: Neural Nets

Transcription:

Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule

Abstract Simulated annealing has been widely used in the solution of optimization problems. As known by many researchers, the global optima cannot be guaranteed to be located by it unless a logarithmic cooling schedule is used. However, the logarithmic cooling schedule is so slow that no one can afford to have such a long CPU time. We propose a new stochastic optimization algorithm, the so-called simulated stochastic approximation annealing algorithm. Under the framework of stochastic approximation Markov chain Monte Carlo, we show that the new algorithm can work with a cooling schedule in which the temperature can decrease much faster than in the logarithmic cooling schedule, e.g., a square-root cooling schedule, while guaranteeing the global optima to be reached when the temperature tends to zero. The new algorithm has been tested on a few benchmark optimization problems, including feed-forward neural network training and protein-folding. The numerical results indicate that the new algorithm can significantly outperform simulated annealing and other competitors.

The problem The optimization problem can be simply stated as a minimization problem: where X is the domain of U(x). min U(x), x X Minimizing U(x) is equivalent to sampling from the Boltzmann distribution at a very small value (closing to 0) of τ. f τ (x) exp( U(x)/τ )

Simulated Annealing (Kirkpatrick et al., 1983) It simulates from a sequence of Boltzmann distributions, f τ1 (x), f τ2 (x),..., f τm (x), in a sequential manner, where the temperatures τ 1,..., τ m form a decreasing ladder τ 1 > τ 2 > > τ m = τ > 0 with τ 0 and τ 1 reasonably large such that most uphill Metropolis-Hastings (MH) moves at that level can be accepted.

Simulated Annealing: Algorithm 1. Initialize the simulation at temperature τ 1 and an arbitrary sample x 0 X. 2. At each temperature τ i, simulate the distribution f τi (x) for n i iterations using the MH sampler. Pass the final sample to the next lower temperature level as the initial sample.

Simulated Annealing: Difficulty The major difficulty with simulated annealing is in choosing the cooling schedule: Logarithmic cooling schedule O(1/log(t)): It ensures the simulation to converge to the global minima of U(x) with probability 1. However, it is so slow that no one can afford to have so long running time. Linear or geometrical cooling schedule: A linear or geometrical cooling schedule is commonly used, but, as shown in Holley et al. (1989), these schedules can no longer guarantee the global minima to be reached.

Stochastic Approximation Monte Carlo (SAMC) SAMC is a general purpose MCMC algorithm. To be precise, it is an adaptive MCMC algorithm and also a dynamic importance sampling algorithm. Its self-adjusting mechanism enables it to be immune to local traps. Let E 1,..., E m denote a partition of the sample space X, which are made according to the energy function as follows: E 1 = {x : U(x) u 1 }, E 2 = {x : u 1 < U(x) u 2 },..., E m = {x : u m < U(x) u m }, E m = {x : U(x) > u m }, (1) where u 1 < u 2 <... < u m are prespecified numbers. Let {γ t } be a positive, non-increasing sequence satisfying the condition γ t =, t=1 γt 2 <. t=1

Stochastic Approximation Monte Carlo: Algorithm 1. (Sampling) Simulate a sample X t+1 with a single MH update, which starts with X t and leaves the following distribution invariant: f θt,τ (x) m i=1 { exp where I ( ) is the indicator function. 2. (θ-updating) Set U(x)/τ θ (i) t } I (x E i ), (2) θ t+ 1 2 = θ t + γ t+1 H τt+1 (θ t, x t+1 ), (3) where H τt+1 (θ t, x t+1 ) = e t+1 π, e t+1 = (I(x t+1 E 1 ),..., I(x t+1 E m )), and π = (π 1,..., π m ). Obviously, it is difficult to mix over the domain X if the temperature τ is very low! In this case, only very few points will be sampled from each subregion.

Space Annealing SAMC (Liang, 2007) Suppose that the sample space has been partitioned as in (1) with u 1,..., u m arranged in an ascending order. Let κ(u) denote the index of the subregion that a sample x with energy u belongs to. For example, if x E j, then κ(u(x)) = j. Let X (t) denote the sample space at iteration t. Space annealing SAMC starts with X (1) = m i=1 E i, and then iteratively shrinks the sample space by setting X (t) = κ(u(t) min +ℵ) i=1 E i, (4) where u (t) min is the minimum energy value obtained by iteration t, and ℵ is a user specified parameter. A major shortcoming of this algorithm is that it tends to get trapped into local energy minima when ℵ is small and the proposal is relatively local.

SAA Algorithm Simulated Stochastic Approximation Annealing, or SAA in short, is a combination of simulated annealing and stochastic approximation. Let {M k, k = 0, 1,...} be a sequence of positive numbers increasingly diverging to infinity, which work as truncation bounds of {θ t}. Let σ t be a counter for the number of truncations up to iteration t, and σ 0 = 0. Let θ 0 be a fixed point in Θ. E 1,..., E m is the partition of the sample space. π = (π 1,..., π m) is the desired sampling distribution of the m subregions. {γ t } is a gain factor sequence. {τ t} is a temperature sequence.

SAA Algorithm 1. (Sampling) Simulate a sample X t+1 with a single MH update, which starts with X t and leaves the following distribution invariant: f θt,τ t+1 (x) m i=1 { exp where I ( ) is the indicator function. 2. (θ-updating) Set U(x)/τ t+1 θ (i) t } I (x E i ), (5) θ t+ 1 2 = θ t + γ t+1 H τt+1 (θ t, x t+1 ), (6) where H τt+1 (θ t, x t+1 ) = e t+1 π, e t+1 = (I(x t+1 E 1 ),..., I(x t+1 E m)), and π = (π 1,..., π m ). 3. (Truncation) If θ t+ 1 2 M σt, set θ t+1 = θ t+ 1 2 ; otherwise, set θ t+1 = θ 0 and σ t+1 = σ t + 1.

Features of SAA Self-adjusting mechanism: This distinguishes the SAA algorithm from simulated annealing. For simulated annealing, the change of the invariant distribution is solely determined by the temperature ladder. While for SAA, the change of the invariant distribution is determined by both the temperature ladder and the past samples. As a result, SAA can converge with a much faster cooling schedule. Sample space shrinkage: Compared to space annealing SAMC, SAA also shrinks its sample space with iterations but in a soft way: it gradually biases sampling toward local energy minima of each subregion through lowering the temperature with iterations. This strategy of sample space shrinkage reduces the risk of getting trapped into local energy minima. Convergence: SAA can achieve essentially the same convergence toward global energy minima as simulated annealing from the perspective of practical applications.

Formulation of SAA The SAA algorithm can be formulated as a SAMCMC algorithm with the goal of solving the integration equation h τ (θ) = H τ (θ, x)f θ,τ (x)dx = 0, (7) where f θ,τ (x) denotes a density function dependent on θ and the limiting temperature τ s, and h is called the mean field function. SAA works through solving a system of equations defined along the temperature sequence {τ t }: h τt (θ) = H τt (θ, x)f θ,τt (x)dx = 0, t = 1, 2,..., (8) where f θ,τt (x) is a density function dependent on θ and the temperature τ t.

Conditions on mean filed function For SAA, the mean field function is given by h τ (θ) = H τ (θ, x)f θ,τ (x)dx = ( S τ (1) (θ) S τ (θ) π 1,..., S(m) τ (θ) S τ (θ) π m ), (9) for any fixed value of θ Θ and τ T, where S τ (i) (θ) = E e U(x)/τ dx/e θ(i), and i S τ (θ) = m i=1 S(i) τ (θ). Further, we define ( m v τ (θ) = 1 2 i=1 S τ (i) ) 2 (θ) S τ (θ) π i, (10) which is the so-called Lyapunov function in the literature of stochastic approximation. Then it is easy to verify that SAA satisfies the stability condition.

Stability Condition: (A 1 ) The function h τ (θ) is bounded and continuously differentiable with respect to both θ and τ, and there exists a non-negative, upper bounded, and continuously differentiable function v τ (θ) such that for any > δ > 0, sup T θ vτ (θ)hτ (θ) < 0, (11) δ d((θ,τ),l) where L = {(θ, τ) : h τ (θ) = 0, θ Θ, τ T } is the zero set of h τ (θ), and d(z, S) = inf y { z y : y S}. Further, the set v(l) = {v τ (θ) : (θ, τ) L} is nowhere dense.

Conditions on observation noise Observation noise: ξ t+1 = H τt+1 (θ t, x t+1 ) h τt+1 (θ t ). One can directly impose some conditions on observation noise, see e.g., Kushner and Clark (1978), Kulkarni and Horn (1995), and Chen (2002). These conditions are usually very weak, but difficult to verify. Alternatively, one can impose some conditions on the Markov transition kernel, which can lead to required conditions on the observation noise.

Doeblin condition: (A 2 ) (A 2 ) (Doeblin condition) For any given θ Θ and τ T, the Markov transition kernel P θ,τ is irreducible and aperiodic. In addition, there exist an integer l, 0 < δ < 1, and a probability measure ν such that for any compact subset K Θ, inf θ K,τ T Pl θ,τ (x, A) δν(a), x X, A B X, where B X denotes the Borel set of X ; that is, the whole support X is a small set for each kernel P θ,τ, θ K and τ T. Uniform ergodicity is slightly stronger than V -uniform ergodicity, but it just serves right for the SAA as for which the function H τ (θ, X ) is bounded, and thus the mean field function and observation noise are bounded. If the drift function V (x) 1, then V -uniform ergodicity is reduced to uniform ergodicity.

Doeblin condition To verify (A 2 ), one may assume that X is compact, U(x) is bounded in X, and the proposal distribution q(x, y) satisfies the local positive condition: (Q) There exists δ q > 0 and ϵ q > 0 such that, for every x X, x y δ q q(x, y) ϵ q.

Conditions on {γ t } and {τ t }: (A 3 ) (i) The sequence {γ t} is positive, non-increasing and satisfies the following conditions: γ t =, t=1 γ t+1 γ t γ t = O(γ ι t+1 ), t=1 γ (1+ι )/2 t t <, (12) for some ι [1, 2) and ι (0, 1). (ii) The sequence {τ t } is positive and non-increasing and satisfies the following conditions: lim τt = τ, t τt for some ι (0, 1), and τ t+1 = o(γ t), γ t τ t τ t ι <, (13) t=1 γ t τ t τ <, (14) t=1

Conditions on {γ t } and {τ t } For the sequences {γ t } and {τ t }, one can typically set γ t = C 1 t ς, τ t = C 2 t + τ, (15) for some constants C 1 > 0, C 2 > 0, and ς (0.5, 1]. Then it is easy to verify that (15) satisfies (A 3 ).

Convergence Theorem Theorem 1. Assume that T is compact and the conditions (A 1 )-(A 3 ) holds. If θ 0 used in the SAA algorithm is such that sup τ T v τ ( θ 0 ) < inf θ =c0,τ T v τ (θ) for some c 0 > 0 and θ 0 < c 0, then the number of truncations in SAA is almost surely finite; that is, {θ t } remains in a compact subset of Θ almost surely.

Convergence Theorem Theorem 2. Assume the conditions of Theorem 1 hold. Then, as t, d(θ t, L τ ) 0, a.s., where L τ = {θ Θ : h τ (θ) = 0} and d(z, S) = inf y { z y : y S}. That is, θ (i) t { C + log( E f τ (x)dx) log(π i i + π e), if E i,, if E i =, where C is a constant, and π e = j:e j = π j /(m m0), and m 0 is the number of empty subregions.

Strong Law of Large Numbers(SLLN) Theorem 3. Assume the conditions of Theorem 1 hold. Let x 1,..., x n denote a set of samples simulated by SAA in n iterations. Let g: X R be a measurable function such that it is bounded and integrable with respect to f θ,τ (x). Then 1 n n k=1 g(x k ) g(x)f θ,τ (x)dx, X a.s.

Convergence to Global Minima Corollary. Assume the conditions of Theorem 1 hold. Let x 1,..., x t denote a set of samples simulated by SAA in t iterations. Then, for any ϵ > 0, as t, 1 t k=1 I (J(x k) = i) t I (U(x k ) ui +ϵ & J(x k ) = i) {x:u(x) u i +ϵ} E e U(x)/τ dx i E e U(x)/τ, a.s dx i k=1 (16) for i = 1,..., m, where I ( ) denotes an indicator function. Moreover, if τ goes to 0, then P ( U(X t ) ui + ϵ J(X t ) = i ) 1, i = 1,..., m. (17) For simulated annealing, as shown in Haario and Saksman (1991), it can achieve the following convergence with a logarithmic cooling schedule: For any ϵ > 0, P(U(X t) u1 + ϵ) 1, a.s., (18) as t.

Comparison with Simulated Annealing Simulated annealing can achieve a stronger convergence mode than SAA. As a trade-off, SAA can work with a cooling schedule in which the temperature decreases much faster than in the logarithmic cooling schedule, such as the square-root cooling schedule. From the perspective of practical applications, (17) and (18) are almost equivalent: Both allows one to identify a sequence of samples that converge to the global energy minima of U(x). In practice, SAA can often work better than simulated annealing. This is because SAA possesses the self-adjusting mechanism, which enables SAA to be immune to local traps.

A 10-state Distribution The unnormalized mass function of the 10-state distribution. x 1 2 3 4 5 6 7 8 9 10 P(x) 5 100 40 1 125 75 1 150 50 20 The sample space X = {1, 2,..., 10} was partitioned according to the mass function into five subregions: E 1 = {8}, E 2 = {2, 5}, E 3 = {6, 9}, E 4 = {3} and E 5 = {1, 4, 7, 10}.

A 10-state Distribution Convergence of θ t for the 10-state distribution: the true value θ n is calculated at the end temperature 0.0104472, ˆθ n is the average of θ n over 5 independent runs, s.d. is the standard deviation of ˆθ n, and freq is the averaged relative sampling frequency of each subregion. The standard deviation of freq is nearly 0. Subregion E 1 E 2 E 3 E 4 E 5 θ n 6.3404-11.1113-60.0072-120.1772-186.5248 ˆθ n 6.3404-11.1116-60.0009-120.1687-186.5044 s.d. 0 6.26 10 3 2.28 10 3 6.01 10 3 8.16 10 3 freq 20.29% 20.23% 20.05% 19.84% 19.6%

A 10-state Distribution state 3 4 5 6 7 8 9 10 49000000 49200000 49400000 49600000 49800000 50000000 iteration A thinned sample path of SAA for the 10-state distribution.

A function with multiple local minima Consider minimizing the function U(x) = {x 1 sin(20x 2 ) + x 2 sin(20x 1 )} 2 cosh{sin(10x 1 )x 1 } {x 1 cos(10x 2 ) x 2 sin(10x 1 )} 2 cosh{cos(20x 2 )x 2 }, where x = (x 1, x 2 ) [.1, 1.1] 2.

A function with multiple local minima Comparison of SAA and simulated annealing for the multi-modal example: Average of Minimum Energy Values a 20000 40000 60000 80000 100000 SAA -8.1145-8.1198-8.1214-8.1223-8.1229 (3.0 10 4 ) (1.5 10 4 ) (1.0 10 4 ) (7.5 10 5 ) (5.9 10 5 ) SA d (sr) -5.9227-5.9255-5.9265-5.9269-5.9271 (1.3 10 ) (1.3 10 ) (1.3 10 ) (1.3 10 ) (1.3 10 ) SA e (geo) -6.5534-6.5598-6.5611-6.5617-6.5620 (3.3 10 ) (3.3 10 ) (3.3 10 ) (3.3 10 ) (3.3 10 )

3 0 0 3 6 A function with multiple local minima (a) Contour (b) SAA x2.0 0.5 0.0 0.5 1.0 4 6 3 4 3 3 3 3 4 3 4 x2.0 0.5 0.0 0.5 1.0 O O.0 0.5 0.0 0.5 1.0 x1.0 0.5 0.0 0.5 1.0 x1 (c) SA (square root) (d) SA (geometric) x2.0 0.5 0.0 0.5 1.0 O O x2.0 0.5 0.0 0.5 1.0 O O.0 0.5 0.0 0.5 1.0.0 0.5 0.0 0.5 1.0 x1 (a) Contour of U(x), (b) sample path of SAA, (c) sample path of simulated annealing with a square-root cooling schedule, and (d) sample path of simulated annealing with a geometric cooling schedule. The white circles show the global minima of U(x). x1

Feed-forward Neural Networks I 4 H 3 I 3 H 2 O Output Layer I 2 B H 1 Hidden Layer I 1 Input Layer A fully connected one hidden layer MLP network with four input units (I 1, I 2, I 3, I 4 ), one bias unit (B), three hidden units (H 1, H 2, H 3 ), and one output unit (O). The arrows show the direction of data feeding.

Two spiral Problem The two-spiral problem is to learn a feedforward neural network that distinguishes between points on two intertwined spirals. This is a benchmark feedforward neural network training problem. The objective function is high-dimensional, highly nonlinear, and consists of a multitude of local energy minima separated by high energy barriers.

Two spiral Problem 6 4 0 2 4 6 y (a) 6 4 0 2 4 6 y (b) 6 4 0 2 4 6 6 4 0 2 4 6 x Classification maps learned by SAA with a MLE of 30 hidden units. The black and white points show the training data for two intertwined spirals. (a) Classification map learned in one run of SAA. (b) Classification map averaged over 20 runs. This figure shows the success of SAA in optimization of complex functions. x

Two spiral Problem Comparison of SAA, space annealing SAMC, simulated annealing, and BFGS for the two-spiral example. Notation: v i denotes the minimum energy value obtained in the ith run for i = 1,..., 20, Mean is the average of v i, SD is the standard deviation of mean, minimum =min 20 i=1 v i, maximum =max 20 i=1 v i, proportion = #{i : v i 0.21}, Iteration is the average number of iterations performed in each run. SA-1 employs the linear cooling schedule, and SA-2 employs the geometric cooling schedule with a decreasing rate of 0.9863. Algorithm Mean SD Min Max Prop Iter( 10 6 ) SAA 0.341 0.099 0.201 2.04 18 5.82 Space annealing SAMC 0.620 0.191 0.187 3.23 15 7.07 Simulated annealing-1 17.485 0.706 9.02 22.06 0 10.0 simulated annealing-2 6.433 0.450 3.03 11.02 0 10.0 BFGS 15.50 0.899 10.00 24.00 0

Protein Folding The AB model consists of only two types of monomers, A and B, which behave as hydrophobic (σ i = +1) and hydrophilic (σ i = ) monomers, respectively. The monomers are linked by rigid bonds of unit length to form linear chains living in two or three dimensional space. For the 2D case, the energy function consists of two types of contributions, bond angle and Lennard-Jones, and is given by N N 1 U(x) = 4 (1 cos x N i,i+1) + 4 i=1 i=1 j=i+2 [ r 2 ij ] C 2 (σ i, σ j )r 6 ij, (19) where x = (x 1,2,..., x N,N ), x i,j [ π, π] is the angle between the ith and jth bond vectors, and r ij is the distance between monomers i and j. The constant C 2 (σ i, σ j ) is +1, + 1 2, and 1 for AA, BB, and AB pairs, respectively. 2

Protein Folding Comparison of SAA and simulated annealing for the 2D-AB models. a The minimum energy value obtained by SAA (subject to a post conjugate gradient minimization procedure starting from the best configurations found in each run). b The averaged minimum energy value sampled by the algorithm and the standard deviation of the average. c The minimum energy value sampled by the algorithm in all runs. SAA Simulated Annealing N Post a Average b Best c Average b Best c 13-3.2941-3.2833 (0.0011) -3.2881-3.1775 (0.0018) -3.2012 21-6.1980-6.1578 (0.0020) -6.1712-5.9809 (0.0463) -6.1201 34-10.8060-10.3396 (0.0555) -10.7689-9.5845 (0.1260) -10.5240

Protein Folding (a) (b) (c) Minimum energy configurations produced by SAA (subject to post conjugate gradient optimization) for (a) the 13-mer sequence with energy value -3.2941, (b) the 21-mer sequence with energy value -6.1980, and (c) the 34-mer sequence with energy -10.8060. The solid and open circles indicate the hydrophobic and hydrophilic monomers, respectively.

Summary We have developed the SAA algorithm for global optimization. Under the framework of stochastic approximation, we show that SAA can work with a cooling schedule in which the temperature can decrease much faster than in the logarithmic cooling schedule, e.g., a square-root cooling schedule, while guaranteeing the global energy minima to be reached when the temperature tends to 0. Compared to simulated annealing, an added advantage of SAA is its self-adjusting mechanism that enables it to be immune to local traps. Compared to space annealing SAMC, SAA shrinks its sample space in a soft way, gradually biasing the sampling toward the local energy minima of each subregion through lowering the temperature with iterations. This strategy of sample space shrinkage reduces the risk for SAA to get trapped into local energy minima. SAA provides a more general framework of stochastic approximation than the current stochastic approximation MCMC algorithms. By including an additional control parameter τ t, stochastic approximation may find new applications or improve its performance in existing applications.

Acknowledgments Collaborators: Yichen Cheng and Guang Lin. NSF grants KAUST grant