Gradient-based Adaptive Stochastic Search

Size: px

Start display at page:

Download "Gradient-based Adaptive Stochastic Search"

Patience Fletcher
5 years ago
Views:

1 1 / 41 Gradient-based Adaptive Stochastic Search Enlu Zhou H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology November 5, 2014

2 Outline 2 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

3 Outline 3 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

4 Introduction 4 / 41 We consider x arg max x X H(x) Given any x X, H(x) can be evaluated exactly.

5 Introduction 4 / 41 We consider x arg max x X H(x) Given any x X, H(x) can be evaluated exactly. We are interested in objective functions: lack structural properties (such as convexity and differentiability) have multiple local optima only be assessed by black-box evaluation

6 Examples of Objective Functions 5 / 41

7 Background 6 / 41 Stochastic Search: use randomized mechanism to generate a sequence of iterates

8 6 / 41 Background Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg 1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson 2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulated annealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin et al. 2004).

9 6 / 41 Background Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg 1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson 2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulated annealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin et al. 2004). Model-based Algorithms: generate candidate solutions from a sampling distribution (i.e., probabilistic model)

10 6 / 41 Background Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg 1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson 2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulated annealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin et al. 2004). Model-based Algorithms: generate candidate solutions from a sampling distribution (i.e., probabilistic model) e.g., ant colony optimization (Dorigo and Gambardella 1997), annealing adaptive search (Romeijn and Smith 1994), estimation of distribution algorithms (Muhlenbein and Paaß 1996), the cross-entropy method (Rubinstein 1997), model reference adaptive search (Hu et al. 2007).

11 Model-based optimization 7 / 41

12 Model-based optimization Figure 1: Optimization via model-based methods. 7 / distribution X H(x) X

13 Model-based optimization 7 / distribution X H(x) X

14 Model-based optimization Figure 1: Optimization via model-based methods. 7 / distribution X H(x) X

15 Model-based optimization Figure 1: Optimization via model-based methods. 7 / distribution X H(x) X

16 Outline 8 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

17 Reformulation 9 / 41 Original problem: x arg max x X H(x), X Rn.

18 Reformulation 9 / 41 Original problem: x arg max x X H(x), X Rn. Let {f (x; θ)} be a parameterized family of probability density functions on X. H(x)f (x; θ)dx H(x ) H, θ R d.

19 Reformulation 9 / 41 Original problem: x arg max x X H(x), X Rn. Let {f (x; θ)} be a parameterized family of probability density functions on X. H(x)f (x; θ)dx H(x ) H, θ R d. = is achieved if and only if θ s.t. the probability mass of f (x; θ ) is concentrated on a subset of the optimal solutions.

20 Reformulation 9 / 41 Original problem: x arg max x X H(x), X Rn. Let {f (x; θ)} be a parameterized family of probability density functions on X. H(x)f (x; θ)dx H(x ) H, θ R d. = is achieved if and only if θ s.t. the probability mass of f (x; θ ) is concentrated on a subset of the optimal solutions. New problem: θ arg max θ R d H(x)f (x; θ)dx.

21 Why reformulation? 10 / 41 Possible Scenarios: Original Problem arg max x X H(x) Discrete in x Non-differentiable in x New Problem arg max θ H(x)f (x; θ)dx Continuous in θ Differentiable in θ

22 Why reformulation? 10 / 41 Possible Scenarios: Original Problem arg max x X H(x) Discrete in x Non-differentiable in x New Problem arg max θ H(x)f (x; θ)dx Continuous in θ Differentiable in θ Incorporate model-based optimization into gradient-based optimization: 1). Generate candidate solutions from f ( ; θ) on the solution space X. 2). Use a gradient-based method to update the parameter θ.

23 Why reformulation? 10 / 41 Possible Scenarios: Original Problem arg max x X H(x) Discrete in x Non-differentiable in x New Problem arg max θ H(x)f (x; θ)dx Continuous in θ Differentiable in θ Incorporate model-based optimization into gradient-based optimization: 1). Generate candidate solutions from f ( ; θ) on the solution space X. 2). Use a gradient-based method to update the parameter θ. Combine the robustness of model-based optimization with the relative fast convergence of gradient-based optimization.

24 More reformulation 11 / 41 For an arbitrary but fixed θ R d, define the function ( ) l(θ; θ ) ln S θ (H(x))f (x; θ)dx.

25 More reformulation 11 / 41 For an arbitrary but fixed θ R d, define the function ( ) l(θ; θ ) ln S θ (H(x))f (x; θ)dx. The shape function S θ ( ) : R R + is chosen to ensure 0 < l(θ; θ ) ln (S θ (H )) θ, and = is achieved if a θ s.t. the probability mass of f (x; θ ) is concentrated on a subset of global optima.

26 More reformulation 11 / 41 For an arbitrary but fixed θ R d, define the function ( ) l(θ; θ ) ln S θ (H(x))f (x; θ)dx. The shape function S θ ( ) : R R + is chosen to ensure 0 < l(θ; θ ) ln (S θ (H )) θ, and = is achieved if a θ s.t. the probability mass of f (x; θ ) is concentrated on a subset of global optima. So consider max l(θ; θ ). θ

27 Parameter updating 12 / 41 Suppose {f ( ; θ)} is an exponential family of densities, i.e., f (x; θ) = exp{θ T T (x) φ(θ)}, φ(θ) = ln{ exp(θ T T (x))dx}.

28 Parameter updating 12 / 41 Suppose {f ( ; θ)} is an exponential family of densities, i.e., f (x; θ) = exp{θ T T (x) φ(θ)}, φ(θ) = ln{ exp(θ T T (x))dx}. Then θ l(θ; θ ) θ=θ = E p( ;θ )[T (X)] E θ [T (X)], 2 θ l(θ; θ ) θ=θ = Var p( ;θ )[T (X)] Var θ [T (X)], where p(x; θ ) S θ (H(x))f (x;θ ) Sθ (H(x))f (x;θ )dx.

29 Parameter updating 12 / 41 Suppose {f ( ; θ)} is an exponential family of densities, i.e., f (x; θ) = exp{θ T T (x) φ(θ)}, φ(θ) = ln{ exp(θ T T (x))dx}. Then θ l(θ; θ ) θ=θ = E p( ;θ )[T (X)] E θ [T (X)], 2 θ l(θ; θ ) θ=θ = Var p( ;θ )[T (X)] Var θ [T (X)], where p(x; θ ) S θ (H(x))f (x;θ ) Sθ (H(x))f (x;θ )dx. A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0.

30 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0.

31 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0. Var θk [T (X)] = E[( θ ln f (X; θ k )) 2 ] is the Fisher information matrix, leading to the following facts:

32 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0. Var θk [T (X)] = E[( θ ln f (X; θ k )) 2 ] is the Fisher information matrix, leading to the following facts: Var θk [T (X)] 1 is the minimum-variance step size in stochastic approximation.

33 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0. Var θk [T (X)] = E[( θ ln f (X; θ k )) 2 ] is the Fisher information matrix, leading to the following facts: Var θk [T (X)] 1 is the minimum-variance step size in stochastic approximation. Var θk [T (X)] 1 adapts the gradient step to our belief about promising regions. (Think about T (X) = X...)

34 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0. Var θk [T (X)] = E[( θ ln f (X; θ k )) 2 ] is the Fisher information matrix, leading to the following facts: Var θk [T (X)] 1 is the minimum-variance step size in stochastic approximation. Var θk [T (X)] 1 adapts the gradient step to our belief about promising regions. (Think about T (X) = X...) Var θk [T (X)] 1 θ l(θ; θ k ) θ=θk is the gradient of l(θ; θ k ) on the statistical manifold equipped with Fisher metric.

35 Main algorithm: GASS 14 / 41 Gradient-based Adaptive Stochastic Search (GASS) Initialization: set k = 0. Sampling: draw samples xk i iid f (x; θ k ), i = 1, 2,..., N k. Updating: update the parameter θ according to θ k+1 = θ k + α k ( Var θk [T (X)] + ɛi) 1 (Êp k [T (X)] E θk [T (X)]), where Var θk [T (X)] and Êp k [T (X)] are estimates using the samples {x i k, i = 1,..., N k}. Stopping: If some stopping criterion is satisfied, stop and return the current best sampled solution; else, set k := k + 1 and go back to step 2).

36 Accelerated algorithm: GASS_avg 15 / 41 GASS can be viewed as a stochastic approximation algorithm in finding θ.

37 Accelerated algorithm: GASS_avg 15 / 41 GASS can be viewed as a stochastic approximation algorithm in finding θ. Accelerated GASS: use Polyak averaging with online feedback θ k+1 = θ k + α k ( Var θk [T (X)] + ɛi) 1 ( Ê pk [T (X)] E θk [T (X)]) + α k c( θ k θ k ), θ k = 1 k θ i. k i=1

38 Convergence analysis 16 / 41 The updating of θ can be rewritten in the form of a generalized Robbins-Monro iterates: θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], where D(θ k ) is the gradient field, b k is the bias term, and ξ k is the noise term. D(θ k ) = (Var θk [T (X)] + ɛi) 1 θ l(θ k ; θ k ).

39 Convergence analysis 16 / 41 The updating of θ can be rewritten in the form of a generalized Robbins-Monro iterates: θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], where D(θ k ) is the gradient field, b k is the bias term, and ξ k is the noise term. D(θ k ) = (Var θk [T (X)] + ɛi) 1 θ l(θ k ; θ k ). It can be viewed as a noisy discretization of the ordinary differential equation (ODE) θ t = D(θ t ), t 0.

40 Convergence analysis 17 / 41 θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], θ t = D(θ t ), t 0.

41 17 / 41 Convergence analysis θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], θ t = D(θ t ), t 0. Assumption α k 0 as k, k=0 α k =.

42 Convergence analysis θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], θ t = D(θ t ), t 0. Assumption α k 0 as k, k=0 α k =. Lemma 1 Under certain assumptions, b k 0 w.p.1 as k. 17 / 41

43 Convergence analysis 17 / 41 Assumption θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], θ t = D(θ t ), t 0. α k 0 as k, k=0 α k =. Lemma 1 Under certain assumptions, b k 0 w.p.1 as k. Lemma 2 Under certain assumptions, for any T > 0, { } n lim sup α k {n:0 n 1 i ξ i = 0, w.p.1. i=k α i T } i=k

44 Convergence results 18 / 41 Theorem (Asymptotic Convergence) Assume that D(θ t ) is continuous with a unique integral curve and some regularity conditions hold. Then the sequence {θ k } converges to a limit set of the ODE w.p.1. Furthermore, if the limit sets of the ODE are isolated equilibrium points, then w.p.1 {θ k } converges to a unique equilibrium point. Implication: GASS converges to a stationary point of l(θ; θ ).

45 Convergence results 19 / 41 Theorem (Asymptotic Convergence Rate) Let α k = α 0 /k α for 0 < α < 1. For a given constant τ > 2α, let N k = Θ(k τ α ). Assume the convergence of the sequence {θ k } occurs to a unique equilibrium point θ w.p.1. If Assumptions 1, 2, and 3 hold, then k τ 2 (θk θ ) dist N(0, QMQ T ), where Q is an orthogonal matrix such that Q T ( J L (θ ))Q = Λ with Λ being a diagonal matrix, and the (i, j) th entry of the matrix M is given by M (i,j) = (Q T ΦΣΦ T Q) (i,j) (Λ (i,i) + Λ (j,j) ) 1. Implication: The asymptotic convergence rate of GASS is O(1/ k τ ).

46 Numerical results 20 / 41 2 D Dejong5 function 4 D Shekel function Function value true optimum GASS 3.5 GASS_avg modified CE MRAS Total sample size x 10 4 Function value true optimum GASS 2 GASS_avg 1 modified CE MRAS Total sample size x D Powel function D Rosenbrock function Function value Function value 10 2 true optimum 10 8 GASS GASS_avg modified CE MRAS Total sample size x true optimum GASS GASS_avg modified CE MRAS Total sample size x 10 6 Figure : Comparison of average performance of GASS, GASS_avg, MRAS, and the modified CE.

47 Numerical results 21 / D Griewank function 50 D Trigonometric function Function value Function value true optimum GASS 8 GASS_avg modified CE 9 MRAS Total sample size x 10 5 true optimum 10 4 GASS GASS_avg modified CE MRAS Total sample size x D Pinter function 50 D Sphere function Function value true optimum 10 4 GASS GASS_avg modified CE MRAS Total sample size x 10 5 Function value true optimum GASS 10 5 GASS_avg modified CE MRAS Total sample size x 10 5 Figure : Comparison of average performance of GASS, GASS_avg, MRAS, and the modified CE.

48 Numerical results 22 / D Griewank function D Trigonometric function Function value 10 1 Function value true optimum 10 2 GASS GASS_avg Total sample size x true optimum GASS GASS_avg Total sample size x D Levy function D Sphere function Function value 10 3 Function value true optimum GASS GASS_avg Total sample size x 10 5 true optimum GASS GASS_avg Total sample size x 10 5 Figure : Average performance of GASS and GASS_avg on 200-dimensional benchmark problems.

49 Numerical results 23 / 41 GASS_avg and GASS find the ɛ-optimal solutions in all the runs for 7 out of the 8 benchmark problems (except the Shekel function).

50 Numerical results 23 / 41 GASS_avg and GASS find the ɛ-optimal solutions in all the runs for 7 out of the 8 benchmark problems (except the Shekel function). Accuracy: GASS_avg and GASS find better solutions than the modified CE method on badly-scaled functions and are comparable to the modified Cross Entropy method (Rubinstein 1998) on multi-modal functions; outperform Model Reference Adaptive Search (Hu et al. 2007) on all the problems.

51 Numerical results 23 / 41 GASS_avg and GASS find the ɛ-optimal solutions in all the runs for 7 out of the 8 benchmark problems (except the Shekel function). Accuracy: GASS_avg and GASS find better solutions than the modified CE method on badly-scaled functions and are comparable to the modified Cross Entropy method (Rubinstein 1998) on multi-modal functions; outperform Model Reference Adaptive Search (Hu et al. 2007) on all the problems. Convergence speed: GASS_avg always converges faster than GASS; both are faster than MRAS on all the problems and faster than the modified CE on most problems.

52 Resource allocation in communication networks 24 / 41 Q users may transmit or receive signals using N carriers, under a power budget B q for the qth user. The objective is to maximize the total transmission rate (sum-rate) by optimally allocating each user s power resource to the carriers.

53 Resource allocation in communication networks 25 / 41 max p q(k), q, k Q subject to: q=1 k=1 ( N log 1 + ) H qq 2 p q (k) N 0 + Q r=1,r q H rq(k) 2 p r (k) N p q (k) B q, q = 1,, Q, k=1 p q (k) 0, q = 1,, Q, k = 1,, N.

54 Resource allocation in communication networks 25 / 41 max p q(k), q, k Q subject to: q=1 k=1 ( N log 1 + ) H qq 2 p q (k) N 0 + Q r=1,r q H rq(k) 2 p r (k) N p q (k) B q, q = 1,, Q, k=1 p q (k) 0, q = 1,, Q, k = 1,, N. The sampling distribution f ( ; θ) is chosen to be the Dirichlet distribution, whose support is a multi-dimensional simplex.

55 Resource allocation in communication networks 26 / 41 Figure : Numerical results on resource allocation in communication networks. IWFA, DDPA, MADP, GPA are distributed algorithms. Other algorithms are running multi-start versions of NEOS Solvers:

56 Discrete optimization 27 / 41 X is a discrete set.

57 Discrete optimization 27 / 41 X is a discrete set. Discrete-GASS: use discrete distribution Sampling is easy, but the parameter is of high dimension.

58 Discrete optimization 27 / 41 X is a discrete set. Discrete-GASS: use discrete distribution Sampling is easy, but the parameter is of high dimension. Annealing-GASS: use Boltzmann distribution Parameter is always of dimension 1, but sampling (by MCMC) is more expensive and inexact.

59 Discrete optimization 27 / 41 X is a discrete set. Discrete-GASS: use discrete distribution Sampling is easy, but the parameter is of high dimension. Annealing-GASS: use Boltzmann distribution Parameter is always of dimension 1, but sampling (by MCMC) is more expensive and inexact. Annealing-GASS converges to the set of optimal solutions in probability.

60 Numerical results X 10 6 (Shekel), (Rosenbrock), (others). 4 D Shekel function 50 D Powel function Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Total sample size x Total sample size x D Rosenbrock function 0 50 D Griewank function 10 1 Function value True Optimum Dist GASS MRAS 10 4 Annealing GASS SA (Geo Temp) SA (Log Temp) Total sample size x 10 5 Function value 0.5 True Optimum Dist GASS MRAS Annealing GASS 1 SA (Geo Temp) SA (Log Temp) Total sample size x 10 5 Figure : Average performance of discrete-gass, Annealing-GASS, MRAS, SA (geometric temperature), and SA (logarithmic temperature) 28 / 41

61 Numerical results 29 / D Rastrigin function 50 D Levy function Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Total sample size x Total sample size x D Pinter function 50 D Sphere function Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Total sample size x Total sample size x 10 5 Figure : Average performance of discrete-gass, Annealing-GASS, MRAS, SA (geometric temperature), and SA (logarithmic temperature)

62 Numerical results 30 / 41 Discrete-GASS outperforms MRAS in both accuracy and convergence rate.

63 Numerical results 30 / 41 Discrete-GASS outperforms MRAS in both accuracy and convergence rate. Annealing-GASS algorithm is an improvement of multi-start simulated annealing algorithms with geometric and logarithmic temperature schedules.

64 Numerical results 30 / 41 Discrete-GASS outperforms MRAS in both accuracy and convergence rate. Annealing-GASS algorithm is an improvement of multi-start simulated annealing algorithms with geometric and logarithmic temperature schedules. Discrete-GASS provides accurate solutions in most of the problems; Annealing-GASS yields accurate solutions only in the low-dimensional problem and badly-scaled problems.

65 Numerical results 30 / 41 Discrete-GASS outperforms MRAS in both accuracy and convergence rate. Annealing-GASS algorithm is an improvement of multi-start simulated annealing algorithms with geometric and logarithmic temperature schedules. Discrete-GASS provides accurate solutions in most of the problems; Annealing-GASS yields accurate solutions only in the low-dimensional problem and badly-scaled problems. Discrete-GASS usually needs more computation time for each iteration than Annealing-GASS, but needs less iterations to converge.

66 Implementation & Software 31 / 41 Most tuning parameters can be set to default; need carefully choose stepsize {α k }. Choice of sampling distribution X is a continuous set: (truncated) Gaussian X is a simplex (with or without interior): Dirichlet X is a discrete set: discrete, Boltzmann

67 Implementation & Software 31 / 41 Most tuning parameters can be set to default; need carefully choose stepsize {α k }. Choice of sampling distribution X is a continuous set: (truncated) Gaussian X is a simplex (with or without interior): Dirichlet X is a discrete set: discrete, Boltzmann Software available at

68 Outline 32 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

69 Simulation optimization: introduction 33 / 41 Simulation optimization: max H(x) E ξ x [h(x, ξ x )]. x X

70 Simulation optimization: introduction 33 / 41 Simulation optimization: max H(x) E ξ x [h(x, ξ x )]. x X x Computer simulation y ~ h(x,ξx) of a complex system

71 Simulation optimization: introduction 33 / 41 Simulation optimization: max H(x) E ξ x [h(x, ξ x )]. x X x Computer simulation y ~ h(x,ξx) of a complex system Example: a queueing system (x: service rate; H: waiting time + staffing cost; ξ x : arrival/service times)

72 Simulation optimization: introduction 33 / 41 Simulation optimization: max H(x) E ξ x [h(x, ξ x )]. x X x Computer simulation y ~ h(x,ξx) of a complex system Example: a queueing system (x: service rate; H: waiting time + staffing cost; ξ x : arrival/service times) X is a continuous set.

73 Simulation optimization: introduction 34 / 41 Main solution methods Ranking & Selection (for problems with finite solution space) Stochastic approximation Response surface methods Sample average approximation Stochastic search methods

74 GASS for simulation optimization 35 / 41 Gradient-based Adaptive Stochastic Search (GASS) Initialization Sampling: draw samples xk i iid f (x; θ k ), i = 1, 2,..., N k. Estimation: simulate each xk i for M k times; estimate Ĥ(xk i ) = 1 Mk M k j=1 h(x k i, ξi,j k ). Updating: update the parameter θ according to θ k+1 = θ k + α k ( Var θk [T (X)] + ɛi) 1 (Êp k [T (X)] E θk [T (X)]), where Var θk [T (X)] and Êp k [T (X)] are estimates using {x i k } and {Ĥ(x i k )}. Stopping

75 Two-timescale GASS 36 / 41 Motivated by two-timescale stochastic approximation (Borkar 1997): Two-timescale GASS (GASS_2T) Assume α k 0, β k 0, β k /α k 0. Draw samples xk i iid f (x; θ k ), i = 1,..., N, and carry out computer simulation for each x i once. Update the gradient and Hessian estimates in GASS on the fast timescale with step size α k. Update θ k on the slow timescale with step size β k.

76 Two-timescale GASS 36 / 41 Motivated by two-timescale stochastic approximation (Borkar 1997): Two-timescale GASS (GASS_2T) Assume α k 0, β k 0, β k /α k 0. Draw samples xk i iid f (x; θ k ), i = 1,..., N, and carry out computer simulation for each x i once. Update the gradient and Hessian estimates in GASS on the fast timescale with step size α k. Update θ k on the slow timescale with step size β k. Intuition: sampling distribution can be viewed as fixed while the gradient and Hessian estimates are updated over many iterations. So only a small sample size N is needed.

77 Numerical results 37 / D Powell function 10 0 true optimum GASSO GASSO_2T 10 2 CEOCBA D Trignometric function true optimum GASSO GASSO_2T CEOCBA Function value 10 4 Function value Number of Function Evaluations Number of Function Evaluations 10 D Pinter function 10 D Rastrigin function true optimum GASSO GASSO_2T CEOCBA true optimum GASSO GASSO_2T CEOCBA Function value Function value Number of Function Evaluations Number of Function Evaluations Figure : Average performance of GASS, GASS_2T, CEOCBA (He et al. 2010) on problems with independent noise

78 Outline 38 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

79 Conclusions 39 / 41 By reformulating a hard optimization problem into a differentiable one, we can incorporate direct gradient search with stochastic search.

80 Conclusions 39 / 41 By reformulating a hard optimization problem into a differentiable one, we can incorporate direct gradient search with stochastic search. A class of gradient-based adaptive stochastic search (GASS) algorithms for non-differentiable optimization, black-box optimization, and simulation optimization problems.

81 Conclusions 39 / 41 By reformulating a hard optimization problem into a differentiable one, we can incorporate direct gradient search with stochastic search. A class of gradient-based adaptive stochastic search (GASS) algorithms for non-differentiable optimization, black-box optimization, and simulation optimization problems. Convergence results and numerical results show that GASS is a promising and competitive method.

82 References 40 / 41 E. Zhou and Jiaqiao Hu, Gradient-based Adaptive Stochastic Search for Non-differentiable Optimization, IEEE Transactions on Automatic Control, 59(7), pp , E. Zhou and Shalabh Bhatnagar, Gradient-based Adaptive Stochastic Search for Simulation Optimization over Continuous Space, submitted.

83 Thank you! 41 / 41

Proceedings of the 2012 Winter Simulation Conference C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, and A. M. Uhrmacher, eds.

Proceedings of the 2012 Winter Simulation Conference C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, and A. M. Uhrmacher, eds. COMBINING GRADIENT-BASED OPTIMIZATION WITH STOCHASTIC SEARCH Enlu Zhou