Gradient-based Adaptive Stochastic Search

Size: px
Start display at page:

Download "Gradient-based Adaptive Stochastic Search"

Transcription

1 1 / 41 Gradient-based Adaptive Stochastic Search Enlu Zhou H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology November 5, 2014

2 Outline 2 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

3 Outline 3 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

4 Introduction 4 / 41 We consider x arg max x X H(x) Given any x X, H(x) can be evaluated exactly.

5 Introduction 4 / 41 We consider x arg max x X H(x) Given any x X, H(x) can be evaluated exactly. We are interested in objective functions: lack structural properties (such as convexity and differentiability) have multiple local optima only be assessed by black-box evaluation

6 Examples of Objective Functions 5 / 41

7 Background 6 / 41 Stochastic Search: use randomized mechanism to generate a sequence of iterates

8 6 / 41 Background Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg 1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson 2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulated annealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin et al. 2004).

9 6 / 41 Background Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg 1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson 2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulated annealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin et al. 2004). Model-based Algorithms: generate candidate solutions from a sampling distribution (i.e., probabilistic model)

10 6 / 41 Background Stochastic Search: use randomized mechanism to generate a sequence of iterates e.g., simulated annealing (Krikpatrick et al. 1983), genetic algorithms (Goldberg 1989), tabu search (Glover 1990), nested partitions method (Shi and Ólafsson 2000), pure adaptive search (Zabinsky 2003), sequential Monte Carlo simulated annealing (Zhou and Chen 2011), model-based algorithms (survey by Zlochin et al. 2004). Model-based Algorithms: generate candidate solutions from a sampling distribution (i.e., probabilistic model) e.g., ant colony optimization (Dorigo and Gambardella 1997), annealing adaptive search (Romeijn and Smith 1994), estimation of distribution algorithms (Muhlenbein and Paaß 1996), the cross-entropy method (Rubinstein 1997), model reference adaptive search (Hu et al. 2007).

11 Model-based optimization 7 / 41

12 Model-based optimization Figure 1: Optimization via model-based methods. 7 / distribution X H(x) X

13 Model-based optimization 7 / distribution X H(x) X

14 Model-based optimization Figure 1: Optimization via model-based methods. 7 / distribution X H(x) X

15 Model-based optimization Figure 1: Optimization via model-based methods. 7 / distribution X H(x) X

16 Outline 8 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

17 Reformulation 9 / 41 Original problem: x arg max x X H(x), X Rn.

18 Reformulation 9 / 41 Original problem: x arg max x X H(x), X Rn. Let {f (x; θ)} be a parameterized family of probability density functions on X. H(x)f (x; θ)dx H(x ) H, θ R d.

19 Reformulation 9 / 41 Original problem: x arg max x X H(x), X Rn. Let {f (x; θ)} be a parameterized family of probability density functions on X. H(x)f (x; θ)dx H(x ) H, θ R d. = is achieved if and only if θ s.t. the probability mass of f (x; θ ) is concentrated on a subset of the optimal solutions.

20 Reformulation 9 / 41 Original problem: x arg max x X H(x), X Rn. Let {f (x; θ)} be a parameterized family of probability density functions on X. H(x)f (x; θ)dx H(x ) H, θ R d. = is achieved if and only if θ s.t. the probability mass of f (x; θ ) is concentrated on a subset of the optimal solutions. New problem: θ arg max θ R d H(x)f (x; θ)dx.

21 Why reformulation? 10 / 41 Possible Scenarios: Original Problem arg max x X H(x) Discrete in x Non-differentiable in x New Problem arg max θ H(x)f (x; θ)dx Continuous in θ Differentiable in θ

22 Why reformulation? 10 / 41 Possible Scenarios: Original Problem arg max x X H(x) Discrete in x Non-differentiable in x New Problem arg max θ H(x)f (x; θ)dx Continuous in θ Differentiable in θ Incorporate model-based optimization into gradient-based optimization: 1). Generate candidate solutions from f ( ; θ) on the solution space X. 2). Use a gradient-based method to update the parameter θ.

23 Why reformulation? 10 / 41 Possible Scenarios: Original Problem arg max x X H(x) Discrete in x Non-differentiable in x New Problem arg max θ H(x)f (x; θ)dx Continuous in θ Differentiable in θ Incorporate model-based optimization into gradient-based optimization: 1). Generate candidate solutions from f ( ; θ) on the solution space X. 2). Use a gradient-based method to update the parameter θ. Combine the robustness of model-based optimization with the relative fast convergence of gradient-based optimization.

24 More reformulation 11 / 41 For an arbitrary but fixed θ R d, define the function ( ) l(θ; θ ) ln S θ (H(x))f (x; θ)dx.

25 More reformulation 11 / 41 For an arbitrary but fixed θ R d, define the function ( ) l(θ; θ ) ln S θ (H(x))f (x; θ)dx. The shape function S θ ( ) : R R + is chosen to ensure 0 < l(θ; θ ) ln (S θ (H )) θ, and = is achieved if a θ s.t. the probability mass of f (x; θ ) is concentrated on a subset of global optima.

26 More reformulation 11 / 41 For an arbitrary but fixed θ R d, define the function ( ) l(θ; θ ) ln S θ (H(x))f (x; θ)dx. The shape function S θ ( ) : R R + is chosen to ensure 0 < l(θ; θ ) ln (S θ (H )) θ, and = is achieved if a θ s.t. the probability mass of f (x; θ ) is concentrated on a subset of global optima. So consider max l(θ; θ ). θ

27 Parameter updating 12 / 41 Suppose {f ( ; θ)} is an exponential family of densities, i.e., f (x; θ) = exp{θ T T (x) φ(θ)}, φ(θ) = ln{ exp(θ T T (x))dx}.

28 Parameter updating 12 / 41 Suppose {f ( ; θ)} is an exponential family of densities, i.e., f (x; θ) = exp{θ T T (x) φ(θ)}, φ(θ) = ln{ exp(θ T T (x))dx}. Then θ l(θ; θ ) θ=θ = E p( ;θ )[T (X)] E θ [T (X)], 2 θ l(θ; θ ) θ=θ = Var p( ;θ )[T (X)] Var θ [T (X)], where p(x; θ ) S θ (H(x))f (x;θ ) Sθ (H(x))f (x;θ )dx.

29 Parameter updating 12 / 41 Suppose {f ( ; θ)} is an exponential family of densities, i.e., f (x; θ) = exp{θ T T (x) φ(θ)}, φ(θ) = ln{ exp(θ T T (x))dx}. Then θ l(θ; θ ) θ=θ = E p( ;θ )[T (X)] E θ [T (X)], 2 θ l(θ; θ ) θ=θ = Var p( ;θ )[T (X)] Var θ [T (X)], where p(x; θ ) S θ (H(x))f (x;θ ) Sθ (H(x))f (x;θ )dx. A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0.

30 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0.

31 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0. Var θk [T (X)] = E[( θ ln f (X; θ k )) 2 ] is the Fisher information matrix, leading to the following facts:

32 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0. Var θk [T (X)] = E[( θ ln f (X; θ k )) 2 ] is the Fisher information matrix, leading to the following facts: Var θk [T (X)] 1 is the minimum-variance step size in stochastic approximation.

33 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0. Var θk [T (X)] = E[( θ ln f (X; θ k )) 2 ] is the Fisher information matrix, leading to the following facts: Var θk [T (X)] 1 is the minimum-variance step size in stochastic approximation. Var θk [T (X)] 1 adapts the gradient step to our belief about promising regions. (Think about T (X) = X...)

34 Parameter updating 13 / 41 A Newton-like scheme for updating θ ( ) θ k+1 = θ k + α k (Var θk [T (X)] + ɛi) 1 E [T (X)] E p( ;θk ) θ k [T (X)], α k > 0, ɛ > 0. Var θk [T (X)] = E[( θ ln f (X; θ k )) 2 ] is the Fisher information matrix, leading to the following facts: Var θk [T (X)] 1 is the minimum-variance step size in stochastic approximation. Var θk [T (X)] 1 adapts the gradient step to our belief about promising regions. (Think about T (X) = X...) Var θk [T (X)] 1 θ l(θ; θ k ) θ=θk is the gradient of l(θ; θ k ) on the statistical manifold equipped with Fisher metric.

35 Main algorithm: GASS 14 / 41 Gradient-based Adaptive Stochastic Search (GASS) Initialization: set k = 0. Sampling: draw samples xk i iid f (x; θ k ), i = 1, 2,..., N k. Updating: update the parameter θ according to θ k+1 = θ k + α k ( Var θk [T (X)] + ɛi) 1 (Êp k [T (X)] E θk [T (X)]), where Var θk [T (X)] and Êp k [T (X)] are estimates using the samples {x i k, i = 1,..., N k}. Stopping: If some stopping criterion is satisfied, stop and return the current best sampled solution; else, set k := k + 1 and go back to step 2).

36 Accelerated algorithm: GASS_avg 15 / 41 GASS can be viewed as a stochastic approximation algorithm in finding θ.

37 Accelerated algorithm: GASS_avg 15 / 41 GASS can be viewed as a stochastic approximation algorithm in finding θ. Accelerated GASS: use Polyak averaging with online feedback θ k+1 = θ k + α k ( Var θk [T (X)] + ɛi) 1 ( Ê pk [T (X)] E θk [T (X)]) + α k c( θ k θ k ), θ k = 1 k θ i. k i=1

38 Convergence analysis 16 / 41 The updating of θ can be rewritten in the form of a generalized Robbins-Monro iterates: θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], where D(θ k ) is the gradient field, b k is the bias term, and ξ k is the noise term. D(θ k ) = (Var θk [T (X)] + ɛi) 1 θ l(θ k ; θ k ).

39 Convergence analysis 16 / 41 The updating of θ can be rewritten in the form of a generalized Robbins-Monro iterates: θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], where D(θ k ) is the gradient field, b k is the bias term, and ξ k is the noise term. D(θ k ) = (Var θk [T (X)] + ɛi) 1 θ l(θ k ; θ k ). It can be viewed as a noisy discretization of the ordinary differential equation (ODE) θ t = D(θ t ), t 0.

40 Convergence analysis 17 / 41 θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], θ t = D(θ t ), t 0.

41 17 / 41 Convergence analysis θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], θ t = D(θ t ), t 0. Assumption α k 0 as k, k=0 α k =.

42 Convergence analysis θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], θ t = D(θ t ), t 0. Assumption α k 0 as k, k=0 α k =. Lemma 1 Under certain assumptions, b k 0 w.p.1 as k. 17 / 41

43 Convergence analysis 17 / 41 Assumption θ k+1 = θ k + α k [D(θ k ) + b k + ξ k ], θ t = D(θ t ), t 0. α k 0 as k, k=0 α k =. Lemma 1 Under certain assumptions, b k 0 w.p.1 as k. Lemma 2 Under certain assumptions, for any T > 0, { } n lim sup α k {n:0 n 1 i ξ i = 0, w.p.1. i=k α i T } i=k

44 Convergence results 18 / 41 Theorem (Asymptotic Convergence) Assume that D(θ t ) is continuous with a unique integral curve and some regularity conditions hold. Then the sequence {θ k } converges to a limit set of the ODE w.p.1. Furthermore, if the limit sets of the ODE are isolated equilibrium points, then w.p.1 {θ k } converges to a unique equilibrium point. Implication: GASS converges to a stationary point of l(θ; θ ).

45 Convergence results 19 / 41 Theorem (Asymptotic Convergence Rate) Let α k = α 0 /k α for 0 < α < 1. For a given constant τ > 2α, let N k = Θ(k τ α ). Assume the convergence of the sequence {θ k } occurs to a unique equilibrium point θ w.p.1. If Assumptions 1, 2, and 3 hold, then k τ 2 (θk θ ) dist N(0, QMQ T ), where Q is an orthogonal matrix such that Q T ( J L (θ ))Q = Λ with Λ being a diagonal matrix, and the (i, j) th entry of the matrix M is given by M (i,j) = (Q T ΦΣΦ T Q) (i,j) (Λ (i,i) + Λ (j,j) ) 1. Implication: The asymptotic convergence rate of GASS is O(1/ k τ ).

46 Numerical results 20 / 41 2 D Dejong5 function 4 D Shekel function Function value true optimum GASS 3.5 GASS_avg modified CE MRAS Total sample size x 10 4 Function value true optimum GASS 2 GASS_avg 1 modified CE MRAS Total sample size x D Powel function D Rosenbrock function Function value Function value 10 2 true optimum 10 8 GASS GASS_avg modified CE MRAS Total sample size x true optimum GASS GASS_avg modified CE MRAS Total sample size x 10 6 Figure : Comparison of average performance of GASS, GASS_avg, MRAS, and the modified CE.

47 Numerical results 21 / D Griewank function 50 D Trigonometric function Function value Function value true optimum GASS 8 GASS_avg modified CE 9 MRAS Total sample size x 10 5 true optimum 10 4 GASS GASS_avg modified CE MRAS Total sample size x D Pinter function 50 D Sphere function Function value true optimum 10 4 GASS GASS_avg modified CE MRAS Total sample size x 10 5 Function value true optimum GASS 10 5 GASS_avg modified CE MRAS Total sample size x 10 5 Figure : Comparison of average performance of GASS, GASS_avg, MRAS, and the modified CE.

48 Numerical results 22 / D Griewank function D Trigonometric function Function value 10 1 Function value true optimum 10 2 GASS GASS_avg Total sample size x true optimum GASS GASS_avg Total sample size x D Levy function D Sphere function Function value 10 3 Function value true optimum GASS GASS_avg Total sample size x 10 5 true optimum GASS GASS_avg Total sample size x 10 5 Figure : Average performance of GASS and GASS_avg on 200-dimensional benchmark problems.

49 Numerical results 23 / 41 GASS_avg and GASS find the ɛ-optimal solutions in all the runs for 7 out of the 8 benchmark problems (except the Shekel function).

50 Numerical results 23 / 41 GASS_avg and GASS find the ɛ-optimal solutions in all the runs for 7 out of the 8 benchmark problems (except the Shekel function). Accuracy: GASS_avg and GASS find better solutions than the modified CE method on badly-scaled functions and are comparable to the modified Cross Entropy method (Rubinstein 1998) on multi-modal functions; outperform Model Reference Adaptive Search (Hu et al. 2007) on all the problems.

51 Numerical results 23 / 41 GASS_avg and GASS find the ɛ-optimal solutions in all the runs for 7 out of the 8 benchmark problems (except the Shekel function). Accuracy: GASS_avg and GASS find better solutions than the modified CE method on badly-scaled functions and are comparable to the modified Cross Entropy method (Rubinstein 1998) on multi-modal functions; outperform Model Reference Adaptive Search (Hu et al. 2007) on all the problems. Convergence speed: GASS_avg always converges faster than GASS; both are faster than MRAS on all the problems and faster than the modified CE on most problems.

52 Resource allocation in communication networks 24 / 41 Q users may transmit or receive signals using N carriers, under a power budget B q for the qth user. The objective is to maximize the total transmission rate (sum-rate) by optimally allocating each user s power resource to the carriers.

53 Resource allocation in communication networks 25 / 41 max p q(k), q, k Q subject to: q=1 k=1 ( N log 1 + ) H qq 2 p q (k) N 0 + Q r=1,r q H rq(k) 2 p r (k) N p q (k) B q, q = 1,, Q, k=1 p q (k) 0, q = 1,, Q, k = 1,, N.

54 Resource allocation in communication networks 25 / 41 max p q(k), q, k Q subject to: q=1 k=1 ( N log 1 + ) H qq 2 p q (k) N 0 + Q r=1,r q H rq(k) 2 p r (k) N p q (k) B q, q = 1,, Q, k=1 p q (k) 0, q = 1,, Q, k = 1,, N. The sampling distribution f ( ; θ) is chosen to be the Dirichlet distribution, whose support is a multi-dimensional simplex.

55 Resource allocation in communication networks 26 / 41 Figure : Numerical results on resource allocation in communication networks. IWFA, DDPA, MADP, GPA are distributed algorithms. Other algorithms are running multi-start versions of NEOS Solvers:

56 Discrete optimization 27 / 41 X is a discrete set.

57 Discrete optimization 27 / 41 X is a discrete set. Discrete-GASS: use discrete distribution Sampling is easy, but the parameter is of high dimension.

58 Discrete optimization 27 / 41 X is a discrete set. Discrete-GASS: use discrete distribution Sampling is easy, but the parameter is of high dimension. Annealing-GASS: use Boltzmann distribution Parameter is always of dimension 1, but sampling (by MCMC) is more expensive and inexact.

59 Discrete optimization 27 / 41 X is a discrete set. Discrete-GASS: use discrete distribution Sampling is easy, but the parameter is of high dimension. Annealing-GASS: use Boltzmann distribution Parameter is always of dimension 1, but sampling (by MCMC) is more expensive and inexact. Annealing-GASS converges to the set of optimal solutions in probability.

60 Numerical results X 10 6 (Shekel), (Rosenbrock), (others). 4 D Shekel function 50 D Powel function Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Total sample size x Total sample size x D Rosenbrock function 0 50 D Griewank function 10 1 Function value True Optimum Dist GASS MRAS 10 4 Annealing GASS SA (Geo Temp) SA (Log Temp) Total sample size x 10 5 Function value 0.5 True Optimum Dist GASS MRAS Annealing GASS 1 SA (Geo Temp) SA (Log Temp) Total sample size x 10 5 Figure : Average performance of discrete-gass, Annealing-GASS, MRAS, SA (geometric temperature), and SA (logarithmic temperature) 28 / 41

61 Numerical results 29 / D Rastrigin function 50 D Levy function Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Total sample size x Total sample size x D Pinter function 50 D Sphere function Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Function value True Optimum Dist GASS MRAS Annealing GASS SA (Geo Temp) SA (Log Temp) Total sample size x Total sample size x 10 5 Figure : Average performance of discrete-gass, Annealing-GASS, MRAS, SA (geometric temperature), and SA (logarithmic temperature)

62 Numerical results 30 / 41 Discrete-GASS outperforms MRAS in both accuracy and convergence rate.

63 Numerical results 30 / 41 Discrete-GASS outperforms MRAS in both accuracy and convergence rate. Annealing-GASS algorithm is an improvement of multi-start simulated annealing algorithms with geometric and logarithmic temperature schedules.

64 Numerical results 30 / 41 Discrete-GASS outperforms MRAS in both accuracy and convergence rate. Annealing-GASS algorithm is an improvement of multi-start simulated annealing algorithms with geometric and logarithmic temperature schedules. Discrete-GASS provides accurate solutions in most of the problems; Annealing-GASS yields accurate solutions only in the low-dimensional problem and badly-scaled problems.

65 Numerical results 30 / 41 Discrete-GASS outperforms MRAS in both accuracy and convergence rate. Annealing-GASS algorithm is an improvement of multi-start simulated annealing algorithms with geometric and logarithmic temperature schedules. Discrete-GASS provides accurate solutions in most of the problems; Annealing-GASS yields accurate solutions only in the low-dimensional problem and badly-scaled problems. Discrete-GASS usually needs more computation time for each iteration than Annealing-GASS, but needs less iterations to converge.

66 Implementation & Software 31 / 41 Most tuning parameters can be set to default; need carefully choose stepsize {α k }. Choice of sampling distribution X is a continuous set: (truncated) Gaussian X is a simplex (with or without interior): Dirichlet X is a discrete set: discrete, Boltzmann

67 Implementation & Software 31 / 41 Most tuning parameters can be set to default; need carefully choose stepsize {α k }. Choice of sampling distribution X is a continuous set: (truncated) Gaussian X is a simplex (with or without interior): Dirichlet X is a discrete set: discrete, Boltzmann Software available at

68 Outline 32 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

69 Simulation optimization: introduction 33 / 41 Simulation optimization: max H(x) E ξ x [h(x, ξ x )]. x X

70 Simulation optimization: introduction 33 / 41 Simulation optimization: max H(x) E ξ x [h(x, ξ x )]. x X x Computer simulation y ~ h(x,ξx) of a complex system

71 Simulation optimization: introduction 33 / 41 Simulation optimization: max H(x) E ξ x [h(x, ξ x )]. x X x Computer simulation y ~ h(x,ξx) of a complex system Example: a queueing system (x: service rate; H: waiting time + staffing cost; ξ x : arrival/service times)

72 Simulation optimization: introduction 33 / 41 Simulation optimization: max H(x) E ξ x [h(x, ξ x )]. x X x Computer simulation y ~ h(x,ξx) of a complex system Example: a queueing system (x: service rate; H: waiting time + staffing cost; ξ x : arrival/service times) X is a continuous set.

73 Simulation optimization: introduction 34 / 41 Main solution methods Ranking & Selection (for problems with finite solution space) Stochastic approximation Response surface methods Sample average approximation Stochastic search methods

74 GASS for simulation optimization 35 / 41 Gradient-based Adaptive Stochastic Search (GASS) Initialization Sampling: draw samples xk i iid f (x; θ k ), i = 1, 2,..., N k. Estimation: simulate each xk i for M k times; estimate Ĥ(xk i ) = 1 Mk M k j=1 h(x k i, ξi,j k ). Updating: update the parameter θ according to θ k+1 = θ k + α k ( Var θk [T (X)] + ɛi) 1 (Êp k [T (X)] E θk [T (X)]), where Var θk [T (X)] and Êp k [T (X)] are estimates using {x i k } and {Ĥ(x i k )}. Stopping

75 Two-timescale GASS 36 / 41 Motivated by two-timescale stochastic approximation (Borkar 1997): Two-timescale GASS (GASS_2T) Assume α k 0, β k 0, β k /α k 0. Draw samples xk i iid f (x; θ k ), i = 1,..., N, and carry out computer simulation for each x i once. Update the gradient and Hessian estimates in GASS on the fast timescale with step size α k. Update θ k on the slow timescale with step size β k.

76 Two-timescale GASS 36 / 41 Motivated by two-timescale stochastic approximation (Borkar 1997): Two-timescale GASS (GASS_2T) Assume α k 0, β k 0, β k /α k 0. Draw samples xk i iid f (x; θ k ), i = 1,..., N, and carry out computer simulation for each x i once. Update the gradient and Hessian estimates in GASS on the fast timescale with step size α k. Update θ k on the slow timescale with step size β k. Intuition: sampling distribution can be viewed as fixed while the gradient and Hessian estimates are updated over many iterations. So only a small sample size N is needed.

77 Numerical results 37 / D Powell function 10 0 true optimum GASSO GASSO_2T 10 2 CEOCBA D Trignometric function true optimum GASSO GASSO_2T CEOCBA Function value 10 4 Function value Number of Function Evaluations Number of Function Evaluations 10 D Pinter function 10 D Rastrigin function true optimum GASSO GASSO_2T CEOCBA true optimum GASSO GASSO_2T CEOCBA Function value Function value Number of Function Evaluations Number of Function Evaluations Figure : Average performance of GASS, GASS_2T, CEOCBA (He et al. 2010) on problems with independent noise

78 Outline 38 / 41 1 Introduction 2 GASS for non-differentiable optimization 3 GASS for simulation optimization 4 Conclusions

79 Conclusions 39 / 41 By reformulating a hard optimization problem into a differentiable one, we can incorporate direct gradient search with stochastic search.

80 Conclusions 39 / 41 By reformulating a hard optimization problem into a differentiable one, we can incorporate direct gradient search with stochastic search. A class of gradient-based adaptive stochastic search (GASS) algorithms for non-differentiable optimization, black-box optimization, and simulation optimization problems.

81 Conclusions 39 / 41 By reformulating a hard optimization problem into a differentiable one, we can incorporate direct gradient search with stochastic search. A class of gradient-based adaptive stochastic search (GASS) algorithms for non-differentiable optimization, black-box optimization, and simulation optimization problems. Convergence results and numerical results show that GASS is a promising and competitive method.

82 References 40 / 41 E. Zhou and Jiaqiao Hu, Gradient-based Adaptive Stochastic Search for Non-differentiable Optimization, IEEE Transactions on Automatic Control, 59(7), pp , E. Zhou and Shalabh Bhatnagar, Gradient-based Adaptive Stochastic Search for Simulation Optimization over Continuous Space, submitted.

83 Thank you! 41 / 41

Proceedings of the 2012 Winter Simulation Conference C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, and A. M. Uhrmacher, eds.

Proceedings of the 2012 Winter Simulation Conference C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, and A. M. Uhrmacher, eds. Proceedings of the 2012 Winter Simulation Conference C. Laroque, J. Himmelspach, R. Pasupathy, O. Rose, and A. M. Uhrmacher, eds. COMBINING GRADIENT-BASED OPTIMIZATION WITH STOCHASTIC SEARCH Enlu Zhou

More information

Gradient-Based Adaptive Stochastic Search for Non-Differentiable Optimization

Gradient-Based Adaptive Stochastic Search for Non-Differentiable Optimization Gradient-Based Adaptive Stochastic Search for Non-Differentiable Optimization Enlu Zhou Department of Industrial & Enterprise Systems Engineering, University of Illinois at Urbana-Champaign, IL 61801,

More information

A Model Reference Adaptive Search Method for Global Optimization

A Model Reference Adaptive Search Method for Global Optimization OPERATIONS RESEARCH Vol. 55, No. 3, May June 27, pp. 549 568 issn 3-364X eissn 1526-5463 7 553 549 informs doi 1.1287/opre.16.367 27 INFORMS A Model Reference Adaptive Search Method for Global Optimization

More information

A Model Reference Adaptive Search Method for Global Optimization

A Model Reference Adaptive Search Method for Global Optimization A Model Reference Adaptive Search Method for Global Optimization Jiaqiao Hu Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College Park, MD

More information

Simulated Annealing for Constrained Global Optimization

Simulated Annealing for Constrained Global Optimization Monte Carlo Methods for Computation and Optimization Final Presentation Simulated Annealing for Constrained Global Optimization H. Edwin Romeijn & Robert L.Smith (1994) Presented by Ariel Schwartz Objective

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule Abstract Simulated annealing has been widely used in the solution of optimization problems. As known

More information

A PARTICLE FILTERING FRAMEWORK FOR RANDOMIZED OPTIMIZATION ALGORITHMS. Enlu Zhou Michael C. Fu Steven I. Marcus

A PARTICLE FILTERING FRAMEWORK FOR RANDOMIZED OPTIMIZATION ALGORITHMS. Enlu Zhou Michael C. Fu Steven I. Marcus Proceedings of the 2008 Winter Simulation Conference S. J. Mason, R. R. Hill, L. Moench, O. Rose, eds. A PARTICLE FILTERIG FRAMEWORK FOR RADOMIZED OPTIMIZATIO ALGORITHMS Enlu Zhou Michael C. Fu Steven

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Numerical Optimization: Basic Concepts and Algorithms

Numerical Optimization: Basic Concepts and Algorithms May 27th 2015 Numerical Optimization: Basic Concepts and Algorithms R. Duvigneau R. Duvigneau - Numerical Optimization: Basic Concepts and Algorithms 1 Outline Some basic concepts in optimization Some

More information

A New Trust Region Algorithm Using Radial Basis Function Models

A New Trust Region Algorithm Using Radial Basis Function Models A New Trust Region Algorithm Using Radial Basis Function Models Seppo Pulkkinen University of Turku Department of Mathematics July 14, 2010 Outline 1 Introduction 2 Background Taylor series approximations

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Mathematical Methods for Neurosciences. ENS - Master MVA Paris 6 - Master Maths-Bio ( )

Mathematical Methods for Neurosciences. ENS - Master MVA Paris 6 - Master Maths-Bio ( ) Mathematical Methods for Neurosciences. ENS - Master MVA Paris 6 - Master Maths-Bio (2014-2015) Etienne Tanré - Olivier Faugeras INRIA - Team Tosca October 22nd, 2014 E. Tanré (INRIA - Team Tosca) Mathematical

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Sequential Monte Carlo Simulated Annealing. Enlu Zhou Xi Chen

Sequential Monte Carlo Simulated Annealing. Enlu Zhou Xi Chen Sequential Monte Carlo Simulated Annealing Enlu Zhou Xi Chen Department of Industrial & Enterprise Systems Engineering University of Illinois at Urbana-Champaign Urbana, IL 680, U.S.A. ABSTRACT In this

More information

STOCHASTIC OPTIMIZATION USING MODEL REFERENCE ADAPTIVE SEARCH. Jiaqiao Hu Michael C. Fu Steven I. Marcus

STOCHASTIC OPTIMIZATION USING MODEL REFERENCE ADAPTIVE SEARCH. Jiaqiao Hu Michael C. Fu Steven I. Marcus Proceedings of the 2005 Winter Simulation Conference M. E. Kuhl, N. M. Steiger, F. B. Armstrong, and J. A. Joines, eds. STOCHASTIC OPTIMIZATION USING MODEL REFERENCE ADAPTIVE SEARCH Jiaqiao Hu Michael

More information

Advanced Optimization

Advanced Optimization Advanced Optimization Lecture 3: 1: Randomized Algorithms for for Continuous Discrete Problems Problems November 22, 2016 Master AIC Université Paris-Saclay, Orsay, France Anne Auger INRIA Saclay Ile-de-France

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Annealing Between Distributions by Averaging Moments

Annealing Between Distributions by Averaging Moments Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto Partition Functions We

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Statistical Estimation

Statistical Estimation Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement Submitted to imanufacturing & Service Operations Management manuscript MSOM-11-370.R3 Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement (Authors names blinded

More information

Natural Evolution Strategies for Direct Search

Natural Evolution Strategies for Direct Search Tobias Glasmachers Natural Evolution Strategies for Direct Search 1 Natural Evolution Strategies for Direct Search PGMO-COPI 2014 Recent Advances on Continuous Randomized black-box optimization Thursday

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Toward Effective Initialization for Large-Scale Search Spaces

Toward Effective Initialization for Large-Scale Search Spaces Toward Effective Initialization for Large-Scale Search Spaces Shahryar Rahnamayan University of Ontario Institute of Technology (UOIT) Faculty of Engineering and Applied Science 000 Simcoe Street North

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis Stéphanie Allassonnière CIS, JHU July, 15th 28 Context : Computational Anatomy Context and motivations :

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms

Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Almost Sure Convergence of Two Time-Scale Stochastic Approximation Algorithms Vladislav B. Tadić Abstract The almost sure convergence of two time-scale stochastic approximation algorithms is analyzed under

More information

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

More information

Online Companion: Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures

Online Companion: Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures Online Companion: Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures Daniel R. Jiang and Warren B. Powell Abstract In this online companion, we provide some additional preliminary

More information

Natural Computing. Lecture 11. Michael Herrmann phone: Informatics Forum /10/2011 ACO II

Natural Computing. Lecture 11. Michael Herrmann phone: Informatics Forum /10/2011 ACO II Natural Computing Lecture 11 Michael Herrmann mherrman@inf.ed.ac.uk phone: 0131 6 517177 Informatics Forum 1.42 25/10/2011 ACO II ACO (in brief) ACO Represent solution space Set parameters, initialize

More information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science

More information

Transmitter optimization for distributed Gaussian MIMO channels

Transmitter optimization for distributed Gaussian MIMO channels Transmitter optimization for distributed Gaussian MIMO channels Hon-Fah Chong Electrical & Computer Eng Dept National University of Singapore Email: chonghonfah@ieeeorg Mehul Motani Electrical & Computer

More information

A Distributed Newton Method for Network Utility Maximization, II: Convergence

A Distributed Newton Method for Network Utility Maximization, II: Convergence A Distributed Newton Method for Network Utility Maximization, II: Convergence Ermin Wei, Asuman Ozdaglar, and Ali Jadbabaie October 31, 2012 Abstract The existing distributed algorithms for Network Utility

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization

Optimization Methods. Lecture 18: Optimality Conditions and. Gradient Methods. for Unconstrained Optimization 5.93 Optimization Methods Lecture 8: Optimality Conditions and Gradient Methods for Unconstrained Optimization Outline. Necessary and sucient optimality conditions Slide. Gradient m e t h o d s 3. The

More information

Quasi Stochastic Approximation American Control Conference San Francisco, June 2011

Quasi Stochastic Approximation American Control Conference San Francisco, June 2011 Quasi Stochastic Approximation American Control Conference San Francisco, June 2011 Sean P. Meyn Joint work with Darshan Shirodkar and Prashant Mehta Coordinated Science Laboratory and the Department of

More information

Stochastic optimization Markov Chain Monte Carlo

Stochastic optimization Markov Chain Monte Carlo Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester Physics 403 Numerical Methods, Maximum Likelihood, and Least Squares Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Quadratic Approximation

More information

CHANGE DETECTION WITH UNKNOWN POST-CHANGE PARAMETER USING KIEFER-WOLFOWITZ METHOD

CHANGE DETECTION WITH UNKNOWN POST-CHANGE PARAMETER USING KIEFER-WOLFOWITZ METHOD CHANGE DETECTION WITH UNKNOWN POST-CHANGE PARAMETER USING KIEFER-WOLFOWITZ METHOD Vijay Singamasetty, Navneeth Nair, Srikrishna Bhashyam and Arun Pachai Kannu Department of Electrical Engineering Indian

More information

Distributed Power Control for Time Varying Wireless Networks: Optimality and Convergence

Distributed Power Control for Time Varying Wireless Networks: Optimality and Convergence Distributed Power Control for Time Varying Wireless Networks: Optimality and Convergence Tim Holliday, Nick Bambos, Peter Glynn, Andrea Goldsmith Stanford University Abstract This paper presents a new

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

A Model Reference Adaptive Search Method for Stochastic Global Optimization

A Model Reference Adaptive Search Method for Stochastic Global Optimization A Model Reference Adaptive Search Method for Stochastic Global Optimization Jiaqiao Hu Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland, College

More information

Center-based initialization for large-scale blackbox

Center-based initialization for large-scale blackbox See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/903587 Center-based initialization for large-scale blackbox problems ARTICLE FEBRUARY 009 READS

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

Efficient Monte Carlo Optimization with ATMathCoreLib

Efficient Monte Carlo Optimization with ATMathCoreLib Efficient Monte Carlo Optimization with ATMathCoreLib Reiji Suda 1, 2 and Vivek S. Nittoor 1, 2 In this paper, we discuss a deterministic optimization method for stochastic simulation with unknown distribution.

More information

Estimation theory and information geometry based on denoising

Estimation theory and information geometry based on denoising Estimation theory and information geometry based on denoising Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1 Abstract What is the best

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning Lecture 0 Neural networks and optimization Machine Learning and Data Mining November 2009 UBC Gradient Searching for a good solution can be interpreted as looking for a minimum of some error (loss) function

More information

Partially Collapsed Gibbs Samplers: Theory and Methods. Ever increasing computational power along with ever more sophisticated statistical computing

Partially Collapsed Gibbs Samplers: Theory and Methods. Ever increasing computational power along with ever more sophisticated statistical computing Partially Collapsed Gibbs Samplers: Theory and Methods David A. van Dyk 1 and Taeyoung Park Ever increasing computational power along with ever more sophisticated statistical computing techniques is making

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 5, MAY

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 5, MAY IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 43, NO. 5, MAY 1998 631 Centralized and Decentralized Asynchronous Optimization of Stochastic Discrete-Event Systems Felisa J. Vázquez-Abad, Christos G. Cassandras,

More information

MATH 4211/6211 Optimization Basics of Optimization Problems

MATH 4211/6211 Optimization Basics of Optimization Problems MATH 4211/6211 Optimization Basics of Optimization Problems Xiaojing Ye Department of Mathematics & Statistics Georgia State University Xiaojing Ye, Math & Stat, Georgia State University 0 A standard minimization

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Application of Optimization Methods and Edge AI

Application of Optimization Methods and Edge AI and Hai-Liang Zhao hliangzhao97@gmail.com November 17, 2018 This slide can be downloaded at Link. Outline 1 How to Design Novel Models with Methods Embedded Naturally? Outline 1 How to Design Novel Models

More information

13. Parameter Estimation. ECE 830, Spring 2014

13. Parameter Estimation. ECE 830, Spring 2014 13. Parameter Estimation ECE 830, Spring 2014 1 / 18 Primary Goal General problem statement: We observe X p(x θ), θ Θ and the goal is to determine the θ that produced X. Given a collection of observations

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Theory of Maximum Likelihood Estimation. Konstantin Kashin Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical

More information

Information Geometric view of Belief Propagation

Information Geometric view of Belief Propagation Information Geometric view of Belief Propagation Yunshu Liu 2013-10-17 References: [1]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Stochastic reasoning, Free energy and Information Geometry, Neural

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Some Results on the Ergodicity of Adaptive MCMC Algorithms Some Results on the Ergodicity of Adaptive MCMC Algorithms Omar Khalil Supervisor: Jeffrey Rosenthal September 2, 2011 1 Contents 1 Andrieu-Moulines 4 2 Roberts-Rosenthal 7 3 Atchadé and Fort 8 4 Relationship

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Expectation propagation for symbol detection in large-scale MIMO communications

Expectation propagation for symbol detection in large-scale MIMO communications Expectation propagation for symbol detection in large-scale MIMO communications Pablo M. Olmos olmos@tsc.uc3m.es Joint work with Javier Céspedes (UC3M) Matilde Sánchez-Fernández (UC3M) and Fernando Pérez-Cruz

More information

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Introduction to Maximum Likelihood Estimation

Introduction to Maximum Likelihood Estimation Introduction to Maximum Likelihood Estimation Eric Zivot July 26, 2012 The Likelihood Function Let 1 be an iid sample with pdf ( ; ) where is a ( 1) vector of parameters that characterize ( ; ) Example:

More information

Monte Carlo methods for sampling-based Stochastic Optimization

Monte Carlo methods for sampling-based Stochastic Optimization Monte Carlo methods for sampling-based Stochastic Optimization Gersende FORT LTCI CNRS & Telecom ParisTech Paris, France Joint works with B. Jourdain, T. Lelièvre, G. Stoltz from ENPC and E. Kuhn from

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

The particle swarm optimization algorithm: convergence analysis and parameter selection

The particle swarm optimization algorithm: convergence analysis and parameter selection Information Processing Letters 85 (2003) 317 325 www.elsevier.com/locate/ipl The particle swarm optimization algorithm: convergence analysis and parameter selection Ioan Cristian Trelea INA P-G, UMR Génie

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement

Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement Dynamic Call Center Routing Policies Using Call Waiting and Agent Idle Times Online Supplement Wyean Chan DIRO, Université de Montréal, C.P. 6128, Succ. Centre-Ville, Montréal (Québec), H3C 3J7, CANADA,

More information

Elaine T. Hale, Wotao Yin, Yin Zhang

Elaine T. Hale, Wotao Yin, Yin Zhang , Wotao Yin, Yin Zhang Department of Computational and Applied Mathematics Rice University McMaster University, ICCOPT II-MOPTA 2007 August 13, 2007 1 with Noise 2 3 4 1 with Noise 2 3 4 1 with Noise 2

More information

Partially Collapsed Gibbs Samplers: Theory and Methods

Partially Collapsed Gibbs Samplers: Theory and Methods David A. VAN DYK and Taeyoung PARK Partially Collapsed Gibbs Samplers: Theory and Methods Ever-increasing computational power, along with ever more sophisticated statistical computing techniques, is making

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Solving Continuous-State POMDPs. via Density Projection

Solving Continuous-State POMDPs. via Density Projection Solving Continuous-State POMDPs 1 via Density Projection Enlu Zhou, Student Member, IEEE, Michael C. Fu, Fellow, IEEE, and Steven I. Marcus, Fellow, IEEE, Abstract Research on numerical solution methods

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Introduction, basic but important concepts

Introduction, basic but important concepts Introduction, basic but important concepts Felix Kubler 1 1 DBF, University of Zurich and Swiss Finance Institute October 7, 2017 Felix Kubler Comp.Econ. Gerzensee, Ch1 October 7, 2017 1 / 31 Economics

More information

Mollifying Networks. ICLR,2017 Presenter: Arshdeep Sekhon & Be

Mollifying Networks. ICLR,2017 Presenter: Arshdeep Sekhon & Be Mollifying Networks Caglar Gulcehre 1 Marcin Moczulski 2 Francesco Visin 3 Yoshua Bengio 1 1 University of Montreal, 2 University of Oxford, 3 Politecnico di Milano ICLR,2017 Presenter: Arshdeep Sekhon

More information

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET

AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET. Questions AUTOMATIC CONTROL COMMUNICATION SYSTEMS LINKÖPINGS UNIVERSITET The Problem Identification of Linear and onlinear Dynamical Systems Theme : Curve Fitting Division of Automatic Control Linköping University Sweden Data from Gripen Questions How do the control surface

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

Metaheuristics. 2.3 Local Search 2.4 Simulated annealing. Adrian Horga

Metaheuristics. 2.3 Local Search 2.4 Simulated annealing. Adrian Horga Metaheuristics 2.3 Local Search 2.4 Simulated annealing Adrian Horga 1 2.3 Local Search 2 Local Search Other names: Hill climbing Descent Iterative improvement General S-Metaheuristics Old and simple method

More information

Sum-Power Iterative Watefilling Algorithm

Sum-Power Iterative Watefilling Algorithm Sum-Power Iterative Watefilling Algorithm Daniel P. Palomar Hong Kong University of Science and Technolgy (HKUST) ELEC547 - Convex Optimization Fall 2009-10, HKUST, Hong Kong November 11, 2009 Outline

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

SGD and Randomized projection algorithms for overdetermined linear systems

SGD and Randomized projection algorithms for overdetermined linear systems SGD and Randomized projection algorithms for overdetermined linear systems Deanna Needell Claremont McKenna College IPAM, Feb. 25, 2014 Includes joint work with Eldar, Ward, Tropp, Srebro-Ward Setup Setup

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information