Natural Evolution Strategies for Direct Search

Size: px
Start display at page:

Download "Natural Evolution Strategies for Direct Search"

Transcription

1 Tobias Glasmachers Natural Evolution Strategies for Direct Search 1 Natural Evolution Strategies for Direct Search PGMO-COPI 2014 Recent Advances on Continuous Randomized black-box optimization Thursday October 30, 2014 Tobias Glasmachers Institut für Neuroinformatik Ruhr-Universität Bochum, Germany

2 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2

3 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization:

4 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization: 1 Minimize f : R d R.

5 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization: 1 Minimize f : R d R. 2 0-th order (direct search) setting.

6 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization: 1 Minimize f : R d R. 2 0-th order (direct search) setting. 3 Black-box model x f f(x)

7 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 2 Function value free black-box optimization: 1 Minimize f : R d R. 2 0-th order (direct search) setting. 3 Black-box model x f f(x) 4 Don t ever rely on function values, only comparisons f(x) < f(x ), e.g., for ranking.

8 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 3 Algorithm operations:

9 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 3 Algorithm operations: 1 generate solution x R d most of the logic

10 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 3 Algorithm operations: 1 generate solution x R d most of the logic 2 evaluate f(x) black box, most of the computation time

11 Introduction Tobias Glasmachers Natural Evolution Strategies for Direct Search 3 Algorithm operations: 1 generate solution x R d most of the logic 2 evaluate f(x) black box, most of the computation time 3 compare (rank) against other solutions: f(x) < f(x )?

12 Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4

13 Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges:

14 Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective

15 Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective multi-modality

16 Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective multi-modality observations of objective values perturbed by noise

17 Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective multi-modality observations of objective values perturbed by noise high dimensionality, e.g., d 1000

18 Introduction: Challenges Tobias Glasmachers Natural Evolution Strategies for Direct Search 4 Facing a black-box objective, anything can happen. We d better prepare for the following challenges: non-smooth or even discontinuous objective multi-modality observations of objective values perturbed by noise high dimensionality, e.g., d 1000 black-box constraints (possibly non-smooth,...)

19 Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5

20 Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5 1 Tuning of parameters of expensive simulations, e.g., in engineering.

21 Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5 1 Tuning of parameters of expensive simulations, e.g., in engineering.

22 Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5 1 Tuning of parameters of expensive simulations, e.g., in engineering. 2 Tuning of parameters of machine learning algorithms.

23 Introduction: Application Examples Tobias Glasmachers Natural Evolution Strategies for Direct Search 5 1 Tuning of parameters of expensive simulations, e.g., in engineering. 2 Tuning of parameters of machine learning algorithms.

24 Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 6 input m R d, σ > 0, C(= I) R d d loop sample offspring x 1,...,x λ N(m,σ 2 C) evaluate f(x 1 ),...,f(x λ ) select new population of size µ (e.g., best offspring) update mean vector m update global step size σ update covariance matrix C until stopping criterion met

25 Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 6 input m R d, σ > 0, C(= I) R d d loop sample offspring x 1,...,x λ N(m,σ 2 C) evaluate f(x 1 ),...,f(x λ ) select new population of size µ (e.g., best offspring) update mean vector m update global step size σ update covariance matrix C until stopping criterion met randomized population-based rank-based (function-value free) step size control metric learning (covariance matrix adaptation, CMA)

26 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population?

27 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems.

28 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ?

29 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions.

30 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions.

31 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions.

32 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions.

33 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks?

34 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values.

35 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values.

36 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values. Why CMA?

37 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values. Why CMA? Efficient optimization of ill-conditioned problems, similar to (quasi) Newton methods.

38 Evolution Strategies: Design Principles Tobias Glasmachers Natural Evolution Strategies for Direct Search 7 Why rely on a population? Not necessary, hillclimber (1+1)-ES also works, but populations are more robust for noisy and multi-modal problems. Why online adaptation of the step size σ? Scale-invariant algorithm linear convergence on scale-invariant problems locally linear convergence on C 2 functions. Why discard function values and keep only ranks? Invariance to strictly monotonic (rank-preserving) transformations of objective values. Why CMA? Efficient optimization of ill-conditioned problems, similar to (quasi) Newton methods.

39 This Talk Tobias Glasmachers Natural Evolution Strategies for Direct Search 8 Given samples and their ranks, how to update the distribution parameters?

40 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9

41 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C).

42 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C). Parameters (m,c) = θ Θ, distribution P θ, pdf p θ.

43 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C). Parameters (m,c) = θ Θ, distribution P θ, pdf p θ. For multi-variate Gaussians: Θ = R d P d { } P d = M R d d M = M T,M pos. def.

44 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C). Parameters (m,c) = θ Θ, distribution P θ, pdf p θ. For multi-variate Gaussians: Θ = R d P d { } P d = M R d d M = M T,M pos. def. Statistical manifold of distributions { P θ θ Θ } = Θ.

45 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 9 Change of perspective: algorithm state population x 1,...,x µ distribution N(m,C). Parameters (m,c) = θ Θ, distribution P θ, pdf p θ. For multi-variate Gaussians: Θ = R d P d { } P d = M R d d M = M T,M pos. def. Statistical manifold of distributions { P θ θ Θ } = Θ. Equipped with intrinsic (Riemannian) geometry; metric tensor given by Fisher information matrix I(θ).

46 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 10 Goal: optimization of θ Θ instead of x R d.

47 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 10 Goal: optimization of θ Θ instead of x R d. Lift objective function f : R d R to objective W f : Θ R.

48 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 10 Goal: optimization of θ Θ instead of x R d. Lift objective function f : R d R to objective W f : Θ R. Simplest choice: ] W f (θ) = E x Pθ [f(x)

49 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 10 Goal: optimization of θ Θ instead of x R d. Lift objective function f : R d R to objective W f : Θ R. Simplest choice: More flexible choice: ] W f (θ) = E x Pθ [f(x) W f (θ) = E x Pθ [ w ( f(x) )] with monotonic weight function w : R R.

50 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 11 Expectation operator adds one degree of smoothness.

51 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 11 Expectation operator adds one degree of smoothness. Hence under weak assumptions W f (θ) can be optimized with gradient descent (GD): θ θ η θ W f (θ)

52 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 11 Expectation operator adds one degree of smoothness. Hence under weak assumptions W f (θ) can be optimized with gradient descent (GD): log-likelohood trick : θ θ η θ W f (θ) θ W f (θ) = θ E x Pθ [ w ( f(x) )] = E x Pθ [ θ log ( p θ (x) ) w ( f(x) )]

53 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why?

54 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ)

55 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ) The natural gradient is invariant under changes of the parameterization θ P θ.

56 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ) The natural gradient is invariant under changes of the parameterization θ P θ. (Natural) gradient vector field Θ T Θ defines (natural) gradient flow φ t (θ) tangential to θ W f (θ).

57 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ) The natural gradient is invariant under changes of the parameterization θ P θ. (Natural) gradient vector field Θ T Θ defines (natural) gradient flow φ t (θ) tangential to θ W f (θ).

58 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 12 Problem: this does not work well. Why? We have to replace the plain gradient on (Euclidean) parameter space Θ with the natural gradient respecting the intrinsic geometry of the statistical manifold: θ W f (θ) = ( I(θ)) 1 θ W f (θ) The natural gradient is invariant under changes of the parameterization θ P θ. (Natural) gradient vector field Θ T Θ defines (natural) gradient flow φ t (θ) tangential to θ W f (θ). Optimization: follow inverse flow curves t φ t (θ) from θ into (local) minimum of W f.

59 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 13 Black-box setting: expectation is intractable. E x Pθ [ θ log ( p θ (x) ) w ( f(x) )]

60 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 13 Black-box setting: expectation [ E x Pθ θ log ( p θ (x) ) w ( f(x) )] is intractable. Monte Carlo (MC) approximation θ W f (θ) G(θ) = 1 λ for x 1,...,x λ P θ. λ [ θ log ( p θ (x i ) ) w ( f(x i ) )] i=1

61 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 13 Black-box setting: expectation is intractable. E x Pθ [ θ log ( p θ (x) ) w ( f(x) )] Monte Carlo (MC) approximation θ W f (θ) G(θ) = 1 λ for x 1,...,x λ P θ. λ [ θ log ( p θ (x i ) ) w ( f(x i ) )] i=1 The estimate E[G(θ)] = θ W f (θ) is consistent, hence following G(θ) amounts to stochastic gradient descent.

62 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 13 Black-box setting: expectation is intractable. E x Pθ [ θ log ( p θ (x) ) w ( f(x) )] Monte Carlo (MC) approximation θ W f (θ) G(θ) = 1 λ for x 1,...,x λ P θ. λ [ θ log ( p θ (x i ) ) w ( f(x i ) )] i=1 The estimate E[G(θ)] = θ W f (θ) is consistent, hence following G(θ) amounts to stochastic gradient descent. Yields natural gradient MC approximation G(θ) = ( I(θ) ) 1 G(θ).

63 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ)

64 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ) SNGD is a two-fold approximation of the gradient flow: discretized time: Euler steps, randomized gradient: MC sampling.

65 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ) SNGD is a two-fold approximation of the gradient flow: discretized time: Euler steps, randomized gradient: MC sampling. Natural gradient flow is invariant under choice of distribution parameters θ P θ.

66 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ) SNGD is a two-fold approximation of the gradient flow: discretized time: Euler steps, randomized gradient: MC sampling. Natural gradient flow is invariant under choice of distribution parameters θ P θ. SNGD algorithm invariant in first order approximation due to Euler steps.

67 Information Geometric Perspective Tobias Glasmachers Natural Evolution Strategies for Direct Search 14 Stochastic Natural Gradient Descent (SNGD) update rule: θ θ η G(θ) SNGD is a two-fold approximation of the gradient flow: discretized time: Euler steps, randomized gradient: MC sampling. Natural gradient flow is invariant under choice of distribution parameters θ P θ. SNGD algorithm invariant in first order approximation due to Euler steps. Ollivier et al. Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles. arxiv: , 2011.

68 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 15 Natural (gradient) Evolution Strategies (NES) closely follow this scheme. They were the first ES derived this way.

69 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 15 Natural (gradient) Evolution Strategies (NES) closely follow this scheme. They were the first ES derived this way. NES approach: optimization of expected objective value with Gaussian distributions.

70 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 15 Natural (gradient) Evolution Strategies (NES) closely follow this scheme. They were the first ES derived this way. NES approach: optimization of expected objective value with Gaussian distributions. Offspring x 1,...,x λ act as MC sample for the estimation of G(θ).

71 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 15 Natural (gradient) Evolution Strategies (NES) closely follow this scheme. They were the first ES derived this way. NES approach: optimization of expected objective value with Gaussian distributions. Offspring x 1,...,x λ act as MC sample for the estimation of G(θ). Closed form Fisher tensor for N(m,C): I i,j (θ) = mt C 1 m + 1 ( ) θ i θ j 2 tr C 1 C C 1 C θ i θ j

72 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 16 NES algorithm input θ Θ, λ N, η > 0 loop sample x 1,...,x λ P θ evaluate f(x 1 ),...,f(x λ ) G(θ) 1 λ λ i=1 θlog ( p θ (x i ) ) f(x i ) G(θ) ( I(θ) ) 1 G(θ) θ θ η G(θ) until stopping criterion met Wierstra et al. Natural Evolution Strategies. CEC, 2008 and JMLR, 2014.

73 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 17 NES works better when replacing f(x i ) in G(θ) = 1 λ λ [ θ log ( p θ (x i ) ) f(x i ) ] i=1 with rank-based utility values w 1,...,w λ.

74 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 17 NES works better when replacing f(x i ) in G(θ) = 1 λ λ [ θ log ( p θ (x i ) ) f(x i ) ] i=1 with rank-based utility values w 1,...,w λ. Sample ranks are f-quantile estimators, hence utility values can be represented as w(f(x i )) with special weight function w = w θ based on f-quantiles under current distribution P θ.

75 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 17 NES works better when replacing f(x i ) in G(θ) = 1 λ λ [ θ log ( p θ (x i ) ) f(x i ) ] i=1 with rank-based utility values w 1,...,w λ. Sample ranks are f-quantile estimators, hence utility values can be represented as w(f(x i )) with special weight function w = w θ based on f-quantiles under current distribution P θ. This turns NES into a function value free algorithm.

76 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 17 NES works better when replacing f(x i ) in G(θ) = 1 λ λ [ θ log ( p θ (x i ) ) f(x i ) ] i=1 with rank-based utility values w 1,...,w λ. Sample ranks are f-quantile estimators, hence utility values can be represented as w(f(x i )) with special weight function w = w θ based on f-quantiles under current distribution P θ. This turns NES into a function value free algorithm. Benefits: invariance under monotonic transformations of objective values linear convergence on scale invariant problems.

77 Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 18 xnes (exponential NES) algorithm input (m,a) Θ, λ N, η m,η A > 0 loop sample z 1,...,z λ N(0,I) transform x k,...,x λ Az k +m evaluate f(x 1 ),...,f(x λ ) G m (θ) 1 λ λ i=1 w(f(x i)) z i G C (θ) 1 λ λ i=1 w(f(x i)) 1 2 (z izi T I) m m η m ( A G m (θ) A A exp η A 1 G ) 2 C (θ) until stopping criterion met Glasmachers et al. Exponential Natural Evolution Strategies. GECCO, 2010.

78 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 19 The resulting algorithm is indeed an ES (R d perspective), and at the same time an SNGD algorithm (Θ perspective).

79 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 19 The resulting algorithm is indeed an ES (R d perspective), and at the same time an SNGD algorithm (Θ perspective). ES traditionally have three distinct and often very different mechanisms for optimization: adaptation of m, step size control: adaptation of σ, metric learning: adaptation of C.

80 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 19 The resulting algorithm is indeed an ES (R d perspective), and at the same time an SNGD algorithm (Θ perspective). ES traditionally have three distinct and often very different mechanisms for optimization: adaptation of m, step size control: adaptation of σ, metric learning: adaptation of C. SNGD has only a single mechanism.

81 Natural (Gradient) Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 19 The resulting algorithm is indeed an ES (R d perspective), and at the same time an SNGD algorithm (Θ perspective). ES traditionally have three distinct and often very different mechanisms for optimization: adaptation of m, step size control: adaptation of σ, metric learning: adaptation of C. SNGD has only a single mechanism. Astonishing insight: all (most) parameters can be updated with a single mechanism.

82 Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 20 Although derived completely differently, this algorithm turns out to be closely related to CMA-ES. Akimoto et al. Bidirectional relation between CMA evolution strategies and natural evolution strategies. PPSN 2010.

83 Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 20 Although derived completely differently, this algorithm turns out to be closely related to CMA-ES. Akimoto et al. Bidirectional relation between CMA evolution strategies and natural evolution strategies. PPSN In particular, the update of m is identical and the rank-µ update of CMA-ES coincides with a NES update.

84 Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 20 Although derived completely differently, this algorithm turns out to be closely related to CMA-ES. Akimoto et al. Bidirectional relation between CMA evolution strategies and natural evolution strategies. PPSN In particular, the update of m is identical and the rank-µ update of CMA-ES coincides with a NES update. This is astonishing, it is surely not by coincidence.

85 Exponential Natural Evolution Strategies Tobias Glasmachers Natural Evolution Strategies for Direct Search 20 Although derived completely differently, this algorithm turns out to be closely related to CMA-ES. Akimoto et al. Bidirectional relation between CMA evolution strategies and natural evolution strategies. PPSN In particular, the update of m is identical and the rank-µ update of CMA-ES coincides with a NES update. This is astonishing, it is surely not by coincidence. This is insightful: it means that CMA-ES is (essentially) a SNGD algorithm.

86 Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21

87 Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization.

88 Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization. Here we have focused on a core question: how to update the distribution parameters in a principled way?

89 Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization. Here we have focused on a core question: how to update the distribution parameters in a principled way? The parameter space is equipped with the non-euclidean information geometry of the corresponding statistical manifold of search distributions.

90 Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization. Here we have focused on a core question: how to update the distribution parameters in a principled way? The parameter space is equipped with the non-euclidean information geometry of the corresponding statistical manifold of search distributions. SNGD on a stochastically relaxed problem results in a direct search algorithm: the Natural-gradient Evolution Strategy (NES) algorithm.

91 Summary Tobias Glasmachers Natural Evolution Strategies for Direct Search 21 Evolution Strategies (ES) are randomized direct search methods suitable for continuous black box optimization. Here we have focused on a core question: how to update the distribution parameters in a principled way? The parameter space is equipped with the non-euclidean information geometry of the corresponding statistical manifold of search distributions. SNGD on a stochastically relaxed problem results in a direct search algorithm: the Natural-gradient Evolution Strategy (NES) algorithm. The SNGD parameter update is by no means restricted to Gaussian distributions. It is a general construction template for update equations of continuous distribution parameters.

92 Thank you! Questions? Tobias Glasmachers Natural Evolution Strategies for Direct Search 22

How information theory sheds new light on black-box optimization

How information theory sheds new light on black-box optimization How information theory sheds new light on black-box optimization Anne Auger Colloque Théorie de l information : nouvelles frontières dans le cadre du centenaire de Claude Shannon IHP, 26, 27 and 28 October,

More information

A CMA-ES with Multiplicative Covariance Matrix Updates

A CMA-ES with Multiplicative Covariance Matrix Updates A with Multiplicative Covariance Matrix Updates Oswin Krause Department of Computer Science University of Copenhagen Copenhagen,Denmark oswin.krause@di.ku.dk Tobias Glasmachers Institut für Neuroinformatik

More information

Natural Gradient for the Gaussian Distribution via Least Squares Regression

Natural Gradient for the Gaussian Distribution via Least Squares Regression Natural Gradient for the Gaussian Distribution via Least Squares Regression Luigi Malagò Department of Electrical and Electronic Engineering Shinshu University & INRIA Saclay Île-de-France 4-17-1 Wakasato,

More information

arxiv: v4 [math.oc] 28 Apr 2017

arxiv: v4 [math.oc] 28 Apr 2017 Journal of Machine Learning Research 18 (2017) 1-65 Submitted 11/14; Revised 10/16; Published 4/17 Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles arxiv:1106.3708v4

More information

Information geometry of mirror descent

Information geometry of mirror descent Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee

More information

Advanced Optimization

Advanced Optimization Advanced Optimization Lecture 3: 1: Randomized Algorithms for for Continuous Discrete Problems Problems November 22, 2016 Master AIC Université Paris-Saclay, Orsay, France Anne Auger INRIA Saclay Ile-de-France

More information

Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles

Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles Yann Ollivier, Ludovic Arnold, Anne Auger, Nikolaus Hansen To cite this version: Yann Ollivier, Ludovic Arnold,

More information

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares Robert Bridson October 29, 2008 1 Hessian Problems in Newton Last time we fixed one of plain Newton s problems by introducing line search

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles

Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles Ludovic Arnold, Anne Auger, Nikolaus Hansen, Yann Ollivier To cite this version: Ludovic Arnold, Anne Auger,

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Towards stability and optimality in stochastic gradient descent

Towards stability and optimality in stochastic gradient descent Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction

More information

arxiv: v2 [cs.ai] 13 Jun 2011

arxiv: v2 [cs.ai] 13 Jun 2011 A Linear Time Natural Evolution Strategy for Non-Separable Functions Yi Sun, Faustino Gomez, Tom Schaul, and Jürgen Schmidhuber IDSIA, University of Lugano & SUPSI, Galleria, Manno, CH-698, Switzerland

More information

arxiv: v1 [cs.ne] 9 May 2016

arxiv: v1 [cs.ne] 9 May 2016 Anytime Bi-Objective Optimization with a Hybrid Multi-Objective CMA-ES (HMO-CMA-ES) arxiv:1605.02720v1 [cs.ne] 9 May 2016 ABSTRACT Ilya Loshchilov University of Freiburg Freiburg, Germany ilya.loshchilov@gmail.com

More information

Manifold Monte Carlo Methods

Manifold Monte Carlo Methods Manifold Monte Carlo Methods Mark Girolami Department of Statistical Science University College London Joint work with Ben Calderhead Research Section Ordinary Meeting The Royal Statistical Society October

More information

Discrete Variables and Gradient Estimators

Discrete Variables and Gradient Estimators iscrete Variables and Gradient Estimators This assignment is designed to get you comfortable deriving gradient estimators, and optimizing distributions over discrete random variables. For most questions,

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Blackbox Optimization Marc Toussaint U Stuttgart Blackbox Optimization The term is not really well defined I use it to express that only f(x) can be evaluated f(x) or 2 f(x)

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation

Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation Tutorial CMA-ES Evolution Strategies and Covariance Matrix Adaptation Anne Auger & Nikolaus Hansen INRIA Research Centre Saclay Île-de-France Project team TAO University Paris-Sud, LRI (UMR 8623), Bat.

More information

Block Diagonal Natural Evolution Strategies

Block Diagonal Natural Evolution Strategies Block Diagonal Natural Evolution Strategies Giuseppe Cuccu and Faustino Gomez IDSIA USI-SUPSI 6928 Manno-Lugano, Switzerland {giuse,tino}@idsia.ch http://www.idsia.ch/{~giuse,~tino} Abstract. The Natural

More information

BI-population CMA-ES Algorithms with Surrogate Models and Line Searches

BI-population CMA-ES Algorithms with Surrogate Models and Line Searches BI-population CMA-ES Algorithms with Surrogate Models and Line Searches Ilya Loshchilov 1, Marc Schoenauer 2 and Michèle Sebag 2 1 LIS, École Polytechnique Fédérale de Lausanne 2 TAO, INRIA CNRS Université

More information

Evolution Strategies and Covariance Matrix Adaptation

Evolution Strategies and Covariance Matrix Adaptation Evolution Strategies and Covariance Matrix Adaptation Cours Contrôle Avancé - Ecole Centrale Paris Anne Auger January 2014 INRIA Research Centre Saclay Île-de-France University Paris-Sud, LRI (UMR 8623),

More information

A new Hierarchical Bayes approach to ensemble-variational data assimilation

A new Hierarchical Bayes approach to ensemble-variational data assimilation A new Hierarchical Bayes approach to ensemble-variational data assimilation Michael Tsyrulnikov and Alexander Rakitko HydroMetCenter of Russia College Park, 20 Oct 2014 Michael Tsyrulnikov and Alexander

More information

Problem Statement Continuous Domain Search/Optimization. Tutorial Evolution Strategies and Related Estimation of Distribution Algorithms.

Problem Statement Continuous Domain Search/Optimization. Tutorial Evolution Strategies and Related Estimation of Distribution Algorithms. Tutorial Evolution Strategies and Related Estimation of Distribution Algorithms Anne Auger & Nikolaus Hansen INRIA Saclay - Ile-de-France, project team TAO Universite Paris-Sud, LRI, Bat. 49 945 ORSAY

More information

CMA-ES a Stochastic Second-Order Method for Function-Value Free Numerical Optimization

CMA-ES a Stochastic Second-Order Method for Function-Value Free Numerical Optimization CMA-ES a Stochastic Second-Order Method for Function-Value Free Numerical Optimization Nikolaus Hansen INRIA, Research Centre Saclay Machine Learning and Optimization Team, TAO Univ. Paris-Sud, LRI MSRC

More information

Algorithms for Variational Learning of Mixture of Gaussians

Algorithms for Variational Learning of Mixture of Gaussians Algorithms for Variational Learning of Mixture of Gaussians Instructors: Tapani Raiko and Antti Honkela Bayes Group Adaptive Informatics Research Center 28.08.2008 Variational Bayesian Inference Mixture

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

New Insights and Perspectives on the Natural Gradient Method

New Insights and Perspectives on the Natural Gradient Method 1 / 18 New Insights and Perspectives on the Natural Gradient Method Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology March 13, 2018 Motivation 2 / 18

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Annealing Between Distributions by Averaging Moments

Annealing Between Distributions by Averaging Moments Annealing Between Distributions by Averaging Moments Chris J. Maddison Dept. of Comp. Sci. University of Toronto Roger Grosse CSAIL MIT Ruslan Salakhutdinov University of Toronto Partition Functions We

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

The Extended Kalman Filter is a Natural Gradient Descent in Trajectory Space

The Extended Kalman Filter is a Natural Gradient Descent in Trajectory Space The Extended Kalman Filter is a Natural Gradient Descent in Trajectory Space Yann Ollivier Abstract The extended Kalman filter is perhaps the most standard tool to estimate in real time the state of a

More information

Covariance Matrix Adaptation in Multiobjective Optimization

Covariance Matrix Adaptation in Multiobjective Optimization Covariance Matrix Adaptation in Multiobjective Optimization Dimo Brockhoff INRIA Lille Nord Europe October 30, 2014 PGMO-COPI 2014, Ecole Polytechnique, France Mastertitelformat Scenario: Multiobjective

More information

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal Zero-Order Methods for the Optimization of Noisy Functions Jorge Nocedal Northwestern University Simons Institute, October 2017 1 Collaborators Albert Berahas Northwestern University Richard Byrd University

More information

Stochastic gradient descent on Riemannian manifolds

Stochastic gradient descent on Riemannian manifolds Stochastic gradient descent on Riemannian manifolds Silvère Bonnabel 1 Centre de Robotique - Mathématiques et systèmes Mines ParisTech SMILE Seminar Mines ParisTech Novembre 14th, 2013 1 silvere.bonnabel@mines-paristech

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Stochastic optimization and a variable metric approach

Stochastic optimization and a variable metric approach The challenges for stochastic optimization and a variable metric approach Microsoft Research INRIA Joint Centre, INRIA Saclay April 6, 2009 Content 1 Introduction 2 The Challenges 3 Stochastic Search 4

More information

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling 1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Stochastic gradient descent on Riemannian manifolds

Stochastic gradient descent on Riemannian manifolds Stochastic gradient descent on Riemannian manifolds Silvère Bonnabel 1 Robotics lab - Mathématiques et systèmes Mines ParisTech Gipsa-lab, Grenoble June 20th, 2013 1 silvere.bonnabel@mines-paristech Introduction

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Discrete Latent Variable Models

Discrete Latent Variable Models Discrete Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 14 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 14 1 / 29 Summary Major themes in the course

More information

Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry

Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry Frank Nielsen 1 and Gaëtan Hadjeres 1 École Polytechnique, France/Sony Computer Science Laboratories, Japan Frank.Nielsen@acm.org

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Degrees of Freedom in Regression Ensembles

Degrees of Freedom in Regression Ensembles Degrees of Freedom in Regression Ensembles Henry WJ Reeve Gavin Brown University of Manchester - School of Computer Science Kilburn Building, University of Manchester, Oxford Rd, Manchester M13 9PL Abstract.

More information

Fast yet Simple Natural-Gradient Variational Inference in Complex Models

Fast yet Simple Natural-Gradient Variational Inference in Complex Models Fast yet Simple Natural-Gradient Variational Inference in Complex Models Mohammad Emtiyaz Khan RIKEN Center for AI Project, Tokyo, Japan http://emtiyaz.github.io Joint work with Wu Lin (UBC) DidrikNielsen,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

PROJECTION METHODS FOR DYNAMIC MODELS

PROJECTION METHODS FOR DYNAMIC MODELS PROJECTION METHODS FOR DYNAMIC MODELS Kenneth L. Judd Hoover Institution and NBER June 28, 2006 Functional Problems Many problems involve solving for some unknown function Dynamic programming Consumption

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

Benchmarking Natural Evolution Strategies with Adaptation Sampling on the Noiseless and Noisy Black-box Optimization Testbeds

Benchmarking Natural Evolution Strategies with Adaptation Sampling on the Noiseless and Noisy Black-box Optimization Testbeds Benchmarking Natural Evolution Strategies with Adaptation Sampling on the Noiseless and Noisy Black-box Optimization Testbeds Tom Schaul Courant Institute of Mathematical Sciences, New York University

More information

Lecture 3 September 1

Lecture 3 September 1 STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have

More information

Neighbourhoods of Randomness and Independence

Neighbourhoods of Randomness and Independence Neighbourhoods of Randomness and Independence C.T.J. Dodson School of Mathematics, Manchester University Augment information geometric measures in spaces of distributions, via explicit geometric representations

More information

POLI 8501 Introduction to Maximum Likelihood Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Introduction to Optimization

Introduction to Optimization Introduction to Optimization Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Machine learning is important and interesting The general concept: Fitting models to data So far Machine

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

High Dimensions and Heavy Tails for Natural Evolution Strategies

High Dimensions and Heavy Tails for Natural Evolution Strategies High Dimensions and Heavy Tails for Natural Evolution Strategies Tom Schaul IDSIA, University of Lugano Manno, Switzerland tom@idsia.ch Tobias Glasmachers IDSIA, University of Lugano Manno, Switzerland

More information

Bio-inspired Continuous Optimization: The Coming of Age

Bio-inspired Continuous Optimization: The Coming of Age Bio-inspired Continuous Optimization: The Coming of Age Anne Auger Nikolaus Hansen Nikolas Mauny Raymond Ros Marc Schoenauer TAO Team, INRIA Futurs, FRANCE http://tao.lri.fr First.Last@inria.fr CEC 27,

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Gradient-based Adaptive Stochastic Search

Gradient-based Adaptive Stochastic Search 1 / 41 Gradient-based Adaptive Stochastic Search Enlu Zhou H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology November 5, 2014 Outline 2 / 41 1 Introduction

More information

Introduction to Randomized Black-Box Numerical Optimization and CMA-ES

Introduction to Randomized Black-Box Numerical Optimization and CMA-ES Introduction to Randomized Black-Box Numerical Optimization and CMA-ES July 3, 2017 CEA/EDF/Inria summer school "Numerical Analysis" Université Pierre-et-Marie-Curie, Paris, France Anne Auger, Asma Atamna,

More information

Riemann Manifold Methods in Bayesian Statistics

Riemann Manifold Methods in Bayesian Statistics Ricardo Ehlers ehlers@icmc.usp.br Applied Maths and Stats University of São Paulo, Brazil Working Group in Statistical Learning University College Dublin September 2015 Bayesian inference is based on Bayes

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Differentiation of functions of covariance

Differentiation of functions of covariance Differentiation of log X May 5, 2005 1 Differentiation of functions of covariance matrices or: Why you can forget you ever read this Richard Turner Covariance matrices are symmetric, but we often conveniently

More information

Vectors. January 13, 2013

Vectors. January 13, 2013 Vectors January 13, 2013 The simplest tensors are scalars, which are the measurable quantities of a theory, left invariant by symmetry transformations. By far the most common non-scalars are the vectors,

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Machine Learning in Modern Well Testing

Machine Learning in Modern Well Testing Machine Learning in Modern Well Testing Yang Liu and Yinfeng Qin December 11, 29 1 Introduction Well testing is a crucial stage in the decision of setting up new wells on oil field. Decision makers rely

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Efficient Likelihood-Free Inference

Efficient Likelihood-Free Inference Efficient Likelihood-Free Inference Michael Gutmann http://homepages.inf.ed.ac.uk/mgutmann Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh 8th November 2017

More information

Stochastic Methods for Continuous Optimization

Stochastic Methods for Continuous Optimization Stochastic Methods for Continuous Optimization Anne Auger et Dimo Brockhoff Paris-Saclay Master - Master 2 Informatique - Parcours Apprentissage, Information et Contenu (AIC) 2016 Slides taken from Auger,

More information

1 Bayesian Linear Regression (BLR)

1 Bayesian Linear Regression (BLR) Statistical Techniques in Robotics (STR, S15) Lecture#10 (Wednesday, February 11) Lecturer: Byron Boots Gaussian Properties, Bayesian Linear Regression 1 Bayesian Linear Regression (BLR) In linear regression,

More information

The K-FAC method for neural network optimization

The K-FAC method for neural network optimization The K-FAC method for neural network optimization James Martens Thanks to my various collaborators on K-FAC research and engineering: Roger Grosse, Jimmy Ba, Vikram Tankasali, Matthew Johnson, Daniel Duckworth,

More information

Stochastic Search using the Natural Gradient

Stochastic Search using the Natural Gradient Keywords: stochastic search, natural gradient, evolution strategies Sun Yi Daan Wierstra Tom Schaul Jürgen Schmidhuber IDSIA, Galleria 2, Manno 6928, Switzerland yi@idsiach daan@idsiach tom@idsiach juergen@idsiach

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Multilevel Sequential 2 Monte Carlo for Bayesian Inverse Problems

Multilevel Sequential 2 Monte Carlo for Bayesian Inverse Problems Jonas Latz 1 Multilevel Sequential 2 Monte Carlo for Bayesian Inverse Problems Jonas Latz Technische Universität München Fakultät für Mathematik Lehrstuhl für Numerische Mathematik jonas.latz@tum.de November

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Spall John Wiley and Sons, Inc., 2003 Preface... xiii 1. Stochastic Search

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

FALL 2018 MATH 4211/6211 Optimization Homework 4

FALL 2018 MATH 4211/6211 Optimization Homework 4 FALL 2018 MATH 4211/6211 Optimization Homework 4 This homework assignment is open to textbook, reference books, slides, and online resources, excluding any direct solution to the problem (such as solution

More information

Exercises * on Linear Algebra

Exercises * on Linear Algebra Exercises * on Linear Algebra Laurenz Wiskott Institut für Neuroinformatik Ruhr-Universität Bochum, Germany, EU 4 February 7 Contents Vector spaces 4. Definition...............................................

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information