for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule

Abstract Simulated annealing has been widely used in the solution of optimization problems. As known by many researchers, the global optima cannot be guaranteed to be located by it unless a logarithmic cooling schedule is used. However, the logarithmic cooling schedule is so slow that no one can afford to have such a long CPU time. We propose a new stochastic optimization algorithm, the so-called simulated stochastic approximation annealing algorithm. Under the framework of stochastic approximation Markov chain Monte Carlo, we show that the new algorithm can work with a cooling schedule in which the temperature can decrease much faster than in the logarithmic cooling schedule, e.g., a square-root cooling schedule, while guaranteeing the global optima to be reached when the temperature tends to zero. The new algorithm has been tested on a few benchmark optimization problems, including feed-forward neural network training and protein-folding. The numerical results indicate that the new algorithm can significantly outperform simulated annealing and other competitors.

The problem The optimization problem can be simply stated as a minimization problem: where X is the domain of U(x). min U(x), x X Minimizing U(x) is equivalent to sampling from the Boltzmann distribution at a very small value (closing to 0) of τ. f τ (x) exp( U(x)/τ )

Simulated Annealing (Kirkpatrick et al., 1983) It simulates from a sequence of Boltzmann distributions, f τ1 (x), f τ2 (x),..., f τm (x), in a sequential manner, where the temperatures τ 1,..., τ m form a decreasing ladder τ 1 > τ 2 > > τ m = τ > 0 with τ 0 and τ 1 reasonably large such that most uphill Metropolis-Hastings (MH) moves at that level can be accepted.

Simulated Annealing: Algorithm 1. Initialize the simulation at temperature τ 1 and an arbitrary sample x 0 X. 2. At each temperature τ i, simulate the distribution f τi (x) for n i iterations using the MH sampler. Pass the final sample to the next lower temperature level as the initial sample.

Simulated Annealing: Difficulty The major difficulty with simulated annealing is in choosing the cooling schedule: Logarithmic cooling schedule O(1/log(t)): It ensures the simulation to converge to the global minima of U(x) with probability 1. However, it is so slow that no one can afford to have so long running time. Linear or geometrical cooling schedule: A linear or geometrical cooling schedule is commonly used, but, as shown in Holley et al. (1989), these schedules can no longer guarantee the global minima to be reached.

Stochastic Approximation Monte Carlo (SAMC) SAMC is a general purpose MCMC algorithm. To be precise, it is an adaptive MCMC algorithm and also a dynamic importance sampling algorithm. Its self-adjusting mechanism enables it to be immune to local traps. Let E 1,..., E m denote a partition of the sample space X, which are made according to the energy function as follows: E 1 = {x : U(x) u 1 }, E 2 = {x : u 1 < U(x) u 2 },..., E m = {x : u m < U(x) u m }, E m = {x : U(x) > u m }, (1) where u 1 < u 2 <... < u m are prespecified numbers. Let {γ t } be a positive, non-increasing sequence satisfying the condition γ t =, t=1 γt 2 <. t=1

Stochastic Approximation Monte Carlo: Algorithm 1. (Sampling) Simulate a sample X t+1 with a single MH update, which starts with X t and leaves the following distribution invariant: f θt,τ (x) m i=1 { exp where I ( ) is the indicator function. 2. (θ-updating) Set U(x)/τ θ (i) t } I (x E i ), (2) θ t+ 1 2 = θ t + γ t+1 H τt+1 (θ t, x t+1 ), (3) where H τt+1 (θ t, x t+1 ) = e t+1 π, e t+1 = (I(x t+1 E 1 ),..., I(x t+1 E m )), and π = (π 1,..., π m ). Obviously, it is difficult to mix over the domain X if the temperature τ is very low! In this case, only very few points will be sampled from each subregion.

Space Annealing SAMC (Liang, 2007) Suppose that the sample space has been partitioned as in (1) with u 1,..., u m arranged in an ascending order. Let κ(u) denote the index of the subregion that a sample x with energy u belongs to. For example, if x E j, then κ(u(x)) = j. Let X (t) denote the sample space at iteration t. Space annealing SAMC starts with X (1) = m i=1 E i, and then iteratively shrinks the sample space by setting X (t) = κ(u(t) min +ℵ) i=1 E i, (4) where u (t) min is the minimum energy value obtained by iteration t, and ℵ is a user specified parameter. A major shortcoming of this algorithm is that it tends to get trapped into local energy minima when ℵ is small and the proposal is relatively local.

SAA Algorithm Simulated Stochastic Approximation Annealing, or SAA in short, is a combination of simulated annealing and stochastic approximation. Let {M k, k = 0, 1,...} be a sequence of positive numbers increasingly diverging to infinity, which work as truncation bounds of {θ t}. Let σ t be a counter for the number of truncations up to iteration t, and σ 0 = 0. Let θ 0 be a fixed point in Θ. E 1,..., E m is the partition of the sample space. π = (π 1,..., π m) is the desired sampling distribution of the m subregions. {γ t } is a gain factor sequence. {τ t} is a temperature sequence.

SAA Algorithm 1. (Sampling) Simulate a sample X t+1 with a single MH update, which starts with X t and leaves the following distribution invariant: f θt,τ t+1 (x) m i=1 { exp where I ( ) is the indicator function. 2. (θ-updating) Set U(x)/τ t+1 θ (i) t } I (x E i ), (5) θ t+ 1 2 = θ t + γ t+1 H τt+1 (θ t, x t+1 ), (6) where H τt+1 (θ t, x t+1 ) = e t+1 π, e t+1 = (I(x t+1 E 1 ),..., I(x t+1 E m)), and π = (π 1,..., π m ). 3. (Truncation) If θ t+ 1 2 M σt, set θ t+1 = θ t+ 1 2 ; otherwise, set θ t+1 = θ 0 and σ t+1 = σ t + 1.

Features of SAA Self-adjusting mechanism: This distinguishes the SAA algorithm from simulated annealing. For simulated annealing, the change of the invariant distribution is solely determined by the temperature ladder. While for SAA, the change of the invariant distribution is determined by both the temperature ladder and the past samples. As a result, SAA can converge with a much faster cooling schedule. Sample space shrinkage: Compared to space annealing SAMC, SAA also shrinks its sample space with iterations but in a soft way: it gradually biases sampling toward local energy minima of each subregion through lowering the temperature with iterations. This strategy of sample space shrinkage reduces the risk of getting trapped into local energy minima. Convergence: SAA can achieve essentially the same convergence toward global energy minima as simulated annealing from the perspective of practical applications.

Formulation of SAA The SAA algorithm can be formulated as a SAMCMC algorithm with the goal of solving the integration equation h τ (θ) = H τ (θ, x)f θ,τ (x)dx = 0, (7) where f θ,τ (x) denotes a density function dependent on θ and the limiting temperature τ s, and h is called the mean field function. SAA works through solving a system of equations defined along the temperature sequence {τ t }: h τt (θ) = H τt (θ, x)f θ,τt (x)dx = 0, t = 1, 2,..., (8) where f θ,τt (x) is a density function dependent on θ and the temperature τ t.

Conditions on mean filed function For SAA, the mean field function is given by h τ (θ) = H τ (θ, x)f θ,τ (x)dx = ( S τ (1) (θ) S τ (θ) π 1,..., S(m) τ (θ) S τ (θ) π m ), (9) for any fixed value of θ Θ and τ T, where S τ (i) (θ) = E e U(x)/τ dx/e θ(i), and i S τ (θ) = m i=1 S(i) τ (θ). Further, we define ( m v τ (θ) = 1 2 i=1 S τ (i) ) 2 (θ) S τ (θ) π i, (10) which is the so-called Lyapunov function in the literature of stochastic approximation. Then it is easy to verify that SAA satisfies the stability condition.

Stability Condition: (A 1 ) The function h τ (θ) is bounded and continuously differentiable with respect to both θ and τ, and there exists a non-negative, upper bounded, and continuously differentiable function v τ (θ) such that for any > δ > 0, sup T θ vτ (θ)hτ (θ) < 0, (11) δ d((θ,τ),l) where L = {(θ, τ) : h τ (θ) = 0, θ Θ, τ T } is the zero set of h τ (θ), and d(z, S) = inf y { z y : y S}. Further, the set v(l) = {v τ (θ) : (θ, τ) L} is nowhere dense.

Conditions on observation noise Observation noise: ξ t+1 = H τt+1 (θ t, x t+1 ) h τt+1 (θ t ). One can directly impose some conditions on observation noise, see e.g., Kushner and Clark (1978), Kulkarni and Horn (1995), and Chen (2002). These conditions are usually very weak, but difficult to verify. Alternatively, one can impose some conditions on the Markov transition kernel, which can lead to required conditions on the observation noise.

Doeblin condition: (A 2 ) (A 2 ) (Doeblin condition) For any given θ Θ and τ T, the Markov transition kernel P θ,τ is irreducible and aperiodic. In addition, there exist an integer l, 0 < δ < 1, and a probability measure ν such that for any compact subset K Θ, inf θ K,τ T Pl θ,τ (x, A) δν(a), x X, A B X, where B X denotes the Borel set of X ; that is, the whole support X is a small set for each kernel P θ,τ, θ K and τ T. Uniform ergodicity is slightly stronger than V -uniform ergodicity, but it just serves right for the SAA as for which the function H τ (θ, X ) is bounded, and thus the mean field function and observation noise are bounded. If the drift function V (x) 1, then V -uniform ergodicity is reduced to uniform ergodicity.

Doeblin condition To verify (A 2 ), one may assume that X is compact, U(x) is bounded in X, and the proposal distribution q(x, y) satisfies the local positive condition: (Q) There exists δ q > 0 and ϵ q > 0 such that, for every x X, x y δ q q(x, y) ϵ q.

Conditions on {γ t } and {τ t }: (A 3 ) (i) The sequence {γ t} is positive, non-increasing and satisfies the following conditions: γ t =, t=1 γ t+1 γ t γ t = O(γ ι t+1 ), t=1 γ (1+ι )/2 t t <, (12) for some ι [1, 2) and ι (0, 1). (ii) The sequence {τ t } is positive and non-increasing and satisfies the following conditions: lim τt = τ, t τt for some ι (0, 1), and τ t+1 = o(γ t), γ t τ t τ t ι <, (13) t=1 γ t τ t τ <, (14) t=1

Conditions on {γ t } and {τ t } For the sequences {γ t } and {τ t }, one can typically set γ t = C 1 t ς, τ t = C 2 t + τ, (15) for some constants C 1 > 0, C 2 > 0, and ς (0.5, 1]. Then it is easy to verify that (15) satisfies (A 3 ).

Convergence Theorem Theorem 1. Assume that T is compact and the conditions (A 1 )-(A 3 ) holds. If θ 0 used in the SAA algorithm is such that sup τ T v τ ( θ 0 ) < inf θ =c0,τ T v τ (θ) for some c 0 > 0 and θ 0 < c 0, then the number of truncations in SAA is almost surely finite; that is, {θ t } remains in a compact subset of Θ almost surely.

Convergence Theorem Theorem 2. Assume the conditions of Theorem 1 hold. Then, as t, d(θ t, L τ ) 0, a.s., where L τ = {θ Θ : h τ (θ) = 0} and d(z, S) = inf y { z y : y S}. That is, θ (i) t { C + log( E f τ (x)dx) log(π i i + π e), if E i,, if E i =, where C is a constant, and π e = j:e j = π j /(m m0), and m 0 is the number of empty subregions.

Strong Law of Large Numbers(SLLN) Theorem 3. Assume the conditions of Theorem 1 hold. Let x 1,..., x n denote a set of samples simulated by SAA in n iterations. Let g: X R be a measurable function such that it is bounded and integrable with respect to f θ,τ (x). Then 1 n n k=1 g(x k ) g(x)f θ,τ (x)dx, X a.s.

Convergence to Global Minima Corollary. Assume the conditions of Theorem 1 hold. Let x 1,..., x t denote a set of samples simulated by SAA in t iterations. Then, for any ϵ > 0, as t, 1 t k=1 I (J(x k) = i) t I (U(x k ) ui +ϵ & J(x k ) = i) {x:u(x) u i +ϵ} E e U(x)/τ dx i E e U(x)/τ, a.s dx i k=1 (16) for i = 1,..., m, where I ( ) denotes an indicator function. Moreover, if τ goes to 0, then P ( U(X t ) ui + ϵ J(X t ) = i ) 1, i = 1,..., m. (17) For simulated annealing, as shown in Haario and Saksman (1991), it can achieve the following convergence with a logarithmic cooling schedule: For any ϵ > 0, P(U(X t) u1 + ϵ) 1, a.s., (18) as t.

Comparison with Simulated Annealing Simulated annealing can achieve a stronger convergence mode than SAA. As a trade-off, SAA can work with a cooling schedule in which the temperature decreases much faster than in the logarithmic cooling schedule, such as the square-root cooling schedule. From the perspective of practical applications, (17) and (18) are almost equivalent: Both allows one to identify a sequence of samples that converge to the global energy minima of U(x). In practice, SAA can often work better than simulated annealing. This is because SAA possesses the self-adjusting mechanism, which enables SAA to be immune to local traps.

A 10-state Distribution The unnormalized mass function of the 10-state distribution. x 1 2 3 4 5 6 7 8 9 10 P(x) 5 100 40 1 125 75 1 150 50 20 The sample space X = {1, 2,..., 10} was partitioned according to the mass function into five subregions: E 1 = {8}, E 2 = {2, 5}, E 3 = {6, 9}, E 4 = {3} and E 5 = {1, 4, 7, 10}.

A 10-state Distribution Convergence of θ t for the 10-state distribution: the true value θ n is calculated at the end temperature 0.0104472, ˆθ n is the average of θ n over 5 independent runs, s.d. is the standard deviation of ˆθ n, and freq is the averaged relative sampling frequency of each subregion. The standard deviation of freq is nearly 0. Subregion E 1 E 2 E 3 E 4 E 5 θ n 6.3404-11.1113-60.0072-120.1772-186.5248 ˆθ n 6.3404-11.1116-60.0009-120.1687-186.5044 s.d. 0 6.26 10 3 2.28 10 3 6.01 10 3 8.16 10 3 freq 20.29% 20.23% 20.05% 19.84% 19.6%

A 10-state Distribution state 3 4 5 6 7 8 9 10 49000000 49200000 49400000 49600000 49800000 50000000 iteration A thinned sample path of SAA for the 10-state distribution.

A function with multiple local minima Consider minimizing the function U(x) = {x 1 sin(20x 2 ) + x 2 sin(20x 1 )} 2 cosh{sin(10x 1 )x 1 } {x 1 cos(10x 2 ) x 2 sin(10x 1 )} 2 cosh{cos(20x 2 )x 2 }, where x = (x 1, x 2 ) [.1, 1.1] 2.

A function with multiple local minima Comparison of SAA and simulated annealing for the multi-modal example: Average of Minimum Energy Values a 20000 40000 60000 80000 100000 SAA -8.1145-8.1198-8.1214-8.1223-8.1229 (3.0 10 4 ) (1.5 10 4 ) (1.0 10 4 ) (7.5 10 5 ) (5.9 10 5 ) SA d (sr) -5.9227-5.9255-5.9265-5.9269-5.9271 (1.3 10 ) (1.3 10 ) (1.3 10 ) (1.3 10 ) (1.3 10 ) SA e (geo) -6.5534-6.5598-6.5611-6.5617-6.5620 (3.3 10 ) (3.3 10 ) (3.3 10 ) (3.3 10 ) (3.3 10 )

3 0 0 3 6 A function with multiple local minima (a) Contour (b) SAA x2.0 0.5 0.0 0.5 1.0 4 6 3 4 3 3 3 3 4 3 4 x2.0 0.5 0.0 0.5 1.0 O O.0 0.5 0.0 0.5 1.0 x1.0 0.5 0.0 0.5 1.0 x1 (c) SA (square root) (d) SA (geometric) x2.0 0.5 0.0 0.5 1.0 O O x2.0 0.5 0.0 0.5 1.0 O O.0 0.5 0.0 0.5 1.0.0 0.5 0.0 0.5 1.0 x1 (a) Contour of U(x), (b) sample path of SAA, (c) sample path of simulated annealing with a square-root cooling schedule, and (d) sample path of simulated annealing with a geometric cooling schedule. The white circles show the global minima of U(x). x1

Feed-forward Neural Networks I 4 H 3 I 3 H 2 O Output Layer I 2 B H 1 Hidden Layer I 1 Input Layer A fully connected one hidden layer MLP network with four input units (I 1, I 2, I 3, I 4 ), one bias unit (B), three hidden units (H 1, H 2, H 3 ), and one output unit (O). The arrows show the direction of data feeding.

Two spiral Problem The two-spiral problem is to learn a feedforward neural network that distinguishes between points on two intertwined spirals. This is a benchmark feedforward neural network training problem. The objective function is high-dimensional, highly nonlinear, and consists of a multitude of local energy minima separated by high energy barriers.

Two spiral Problem 6 4 0 2 4 6 y (a) 6 4 0 2 4 6 y (b) 6 4 0 2 4 6 6 4 0 2 4 6 x Classification maps learned by SAA with a MLE of 30 hidden units. The black and white points show the training data for two intertwined spirals. (a) Classification map learned in one run of SAA. (b) Classification map averaged over 20 runs. This figure shows the success of SAA in optimization of complex functions. x

Two spiral Problem Comparison of SAA, space annealing SAMC, simulated annealing, and BFGS for the two-spiral example. Notation: v i denotes the minimum energy value obtained in the ith run for i = 1,..., 20, Mean is the average of v i, SD is the standard deviation of mean, minimum =min 20 i=1 v i, maximum =max 20 i=1 v i, proportion = #{i : v i 0.21}, Iteration is the average number of iterations performed in each run. SA-1 employs the linear cooling schedule, and SA-2 employs the geometric cooling schedule with a decreasing rate of 0.9863. Algorithm Mean SD Min Max Prop Iter( 10 6 ) SAA 0.341 0.099 0.201 2.04 18 5.82 Space annealing SAMC 0.620 0.191 0.187 3.23 15 7.07 Simulated annealing-1 17.485 0.706 9.02 22.06 0 10.0 simulated annealing-2 6.433 0.450 3.03 11.02 0 10.0 BFGS 15.50 0.899 10.00 24.00 0

Protein Folding The AB model consists of only two types of monomers, A and B, which behave as hydrophobic (σ i = +1) and hydrophilic (σ i = ) monomers, respectively. The monomers are linked by rigid bonds of unit length to form linear chains living in two or three dimensional space. For the 2D case, the energy function consists of two types of contributions, bond angle and Lennard-Jones, and is given by N N 1 U(x) = 4 (1 cos x N i,i+1) + 4 i=1 i=1 j=i+2 [ r 2 ij ] C 2 (σ i, σ j )r 6 ij, (19) where x = (x 1,2,..., x N,N ), x i,j [ π, π] is the angle between the ith and jth bond vectors, and r ij is the distance between monomers i and j. The constant C 2 (σ i, σ j ) is +1, + 1 2, and 1 for AA, BB, and AB pairs, respectively. 2

Protein Folding Comparison of SAA and simulated annealing for the 2D-AB models. a The minimum energy value obtained by SAA (subject to a post conjugate gradient minimization procedure starting from the best configurations found in each run). b The averaged minimum energy value sampled by the algorithm and the standard deviation of the average. c The minimum energy value sampled by the algorithm in all runs. SAA Simulated Annealing N Post a Average b Best c Average b Best c 13-3.2941-3.2833 (0.0011) -3.2881-3.1775 (0.0018) -3.2012 21-6.1980-6.1578 (0.0020) -6.1712-5.9809 (0.0463) -6.1201 34-10.8060-10.3396 (0.0555) -10.7689-9.5845 (0.1260) -10.5240

Protein Folding (a) (b) (c) Minimum energy configurations produced by SAA (subject to post conjugate gradient optimization) for (a) the 13-mer sequence with energy value -3.2941, (b) the 21-mer sequence with energy value -6.1980, and (c) the 34-mer sequence with energy -10.8060. The solid and open circles indicate the hydrophobic and hydrophilic monomers, respectively.

Summary We have developed the SAA algorithm for global optimization. Under the framework of stochastic approximation, we show that SAA can work with a cooling schedule in which the temperature can decrease much faster than in the logarithmic cooling schedule, e.g., a square-root cooling schedule, while guaranteeing the global energy minima to be reached when the temperature tends to 0. Compared to simulated annealing, an added advantage of SAA is its self-adjusting mechanism that enables it to be immune to local traps. Compared to space annealing SAMC, SAA shrinks its sample space in a soft way, gradually biasing the sampling toward the local energy minima of each subregion through lowering the temperature with iterations. This strategy of sample space shrinkage reduces the risk for SAA to get trapped into local energy minima. SAA provides a more general framework of stochastic approximation than the current stochastic approximation MCMC algorithms. By including an additional control parameter τ t, stochastic approximation may find new applications or improve its performance in existing applications.

Acknowledgments Collaborators: Yichen Cheng and Guang Lin. NSF grants KAUST grant