Parallelised Bayesian Optimisation via Thompson Sampling

Size: px
Start display at page:

Download "Parallelised Bayesian Optimisation via Thompson Sampling"

Transcription

1 Parallelised Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy Carnegie Mellon University Google Research, Mountain View, CA Sep 27, 2017 Slides:

2 Slides are up on my website: kkandasa Slides

3 Black-box Optimisation hyperparameters Neural Network - Train NN using given hyper-parameters - Compute accuracy on validation set cross validation accuracy 1/31

4 Black-box Optimisation Expensive Blackbox Function 1/31

5 Black-box Optimisation Expensive Blackbox Function Other Examples: - ML estimation in astrophysics - Pre-clinical drug discovery - Optimal policy in autonomous driving 1/31

6 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. f(x) x 2/31

7 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. f(x) x 2/31

8 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) x x 2/31

9 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) Simple Regret after n evaluations SR(n) = f (x ) max f (x t). t=1,...,n x x 2/31

10 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) Cumulative Regret after n evaluations n ( ) CR(n) = f (x ) f (x t ) t=1 x x 2/31

11 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) Simple Regret after n evaluations SR(n) = f (x ) max f (x t). t=1,...,n x x 2/31

12 A walk-through Bayesian Optimisation (BO) with Gaussian Processes A review of Gaussian Processes (GPs) Thompson Sampling (TS): an algorithm for BO Other methods and models for BO 3/31

13 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. 4/31

14 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Functions with no observations f(x) x 4/31

15 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Prior GP f(x) x 4/31

16 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Observations f(x) x 4/31

17 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations f(x) x 4/31

18 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations f(x) Completely characterised by mean function µ : X R, and covariance kernel κ : X X R. After t observations, f (x) N ( µ t (x), σ 2 t (x) ). x 4/31

19 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) x 5/31

20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. x 5/31

21 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. 2) Draw sample g from posterior. x 5/31

22 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose x t = argmax x g(x). x t x 5/31

23 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose x t = argmax x g(x). 4) Evaluate f at x t. x t x 5/31

24 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) x 6/31

25 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 1 x 6/31

26 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 2 x 6/31

27 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 3 x 6/31

28 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 4 x 6/31

29 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 5 x 6/31

30 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 6 x 6/31

31 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 7 x 6/31

32 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 14 x 6/31

33 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 25 x 6/31

34 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) 7/31

35 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) Theorem: For Thompson sampling, (Russo & van Roy 2014, Srinivas et al. 2010) Ψn log(n) E[SR(n)]. ignores constants n Ψ n Maximum Information Gain. 7/31

36 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) Theorem: For Thompson sampling, (Russo & van Roy 2014, Ψ n E[SR(n)] Maximum Information Gain. Srinivas et al. 2010) Ψn log(n). ignores constants n When X R d, SE (Gaussian) kernel: Ψ n d d log(n) d. Matérn kernel: Ψ n n 1 c d 2. 7/31

37 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) Theorem: For Thompson sampling, (Russo & van Roy 2014, Ψ n E[SR(n)] Maximum Information Gain. Srinivas et al. 2010) Ψn log(n). ignores constants n When X R d, SE (Gaussian) kernel: Ψ n d d log(n) d. Matérn kernel: Ψ n n 1 c d 2. Several other results: (Agrawal et al 2012, Kaufmann et al 2012, Russo & van Roy 2016, Chowdhury & Gopalan 2017 and more... ) 7/31

38 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) 8/31

39 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) x 8/31

40 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) x 8/31

41 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x 8/31

42 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t x 8/31

43 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t x 8/31

44 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 Expected improvement (Jones et al. 1998) Probability of improvement (Kushner et al. 1964) Entropy search (Hernández-Lobato et al. 2014) All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. x t = argmax x X ϕ t (x). x t x 8/31

45 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 Expected improvement (Jones et al. 1998) Probability of improvement (Kushner et al. 1964) Entropy search (Hernández-Lobato et al. 2014) All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. x t = argmax x X ϕ t (x). Other models for f : Neural networks (Snoek et al. 2015), Random Forests (Hutter 2009). x t x 8/31

46 Big picture: scaling up black-box optimisation 9/31

47 Big picture: scaling up black-box optimisation Optimising in high dimensional spaces e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms. (Kandasamy et al. ICML 2015) 9/31

48 Big picture: scaling up black-box optimisation Optimising in high dimensional spaces e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms. (Kandasamy et al. ICML 2015) Multi-fidelity optimisation: what if we have cheap approximations to f? E.g. Train an ML model with N data and T iterations. But use N < N data and T < T iterations to approximate cross validation performance at (N, T ). (Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017) 9/31

49 Big picture: scaling up black-box optimisation Optimising in high dimensional spaces e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms. (Kandasamy et al. ICML 2015) Multi-fidelity optimisation: what if we have cheap approximations to f? E.g. Train an ML model with N data and T iterations. But use N < N data and T < T iterations to approximate cross validation performance at (N, T ). (Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017) Extends beyond GPs. 9/31

50 This work: Parallel Evaluations (Kandasamy et al. Arxiv 2017) Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice! 10/31

51 This work: Parallel Evaluations (Kandasamy et al. Arxiv 2017) Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice! Some desiderata: Statistically, achieve M improvement. Methodologically, be scalable for a very large number of workers, - Method remains computationally tractable as M increases. - Method is conceptually simple, for robustness in practice. 10/31

52 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results 6. Open questions/challenges 11/31

53 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats. 6. Open questions/challenges 11/31

54 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 11/31

55 Parallel Evaluations: set up Sequential evaluations with one worker 12/31

56 Parallel Evaluations: set up Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) 12/31

57 Parallel Evaluations: set up Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous) 12/31

58 Parallel Evaluations: set up Sequential evaluations with one worker j th job has feedback from all previous j 1 jobs. Parallel evaluations with M workers (Asynchronous) j th job missing feedback from exactly M 1 jobs. Parallel evaluations with M workers (Synchronous) j th job missing feedback from M 1 jobs. 12/31

59 Simple Regret in Parallel Settings (Kandasamy et al. Arxiv 2017) Simple regret after n evaluations, SR(n) = f (x ) max f (x t). t=1,...,n n number of completed evaluations by all M workers. 13/31

60 Simple Regret in Parallel Settings (Kandasamy et al. Arxiv 2017) Simple regret after n evaluations, SR(n) = f (x ) max f (x t). t=1,...,n n number of completed evaluations by all M workers. Simple regret with time as a resource, Asynchronous Synchronous SR (T ) = f (x ) max f (x t). t=1,...,n N (possibly random) number of completed evaluations by all M workers within time T. 13/31

61 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 13/31

62 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) 14/31

63 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) Asynchronicity 14/31

64 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) Asynchronicity Theoretical guarantees (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) 14/31

65 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) Asynchronicity Theoretical guarantees Conceptual simplicity * (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) * straightforward extension of sequential algorithm works. 14/31

66 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... f(x) x 15/31

67 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t1 x 15/31

68 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). - Second worker: acquisition is the same! x t1 = x t2 f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t2 = x t1 x 15/31

69 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). - Second worker: acquisition is the same! x t1 = x t2 - x t1 = x t2 = = x tm. f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t2 = x t1 x 15/31

70 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). - Second worker: acquisition is the same! x t1 = x t2 - x t1 = x t2 = = x tm. f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t2 = x t1 x Direct application of sequential algorithm does not work. Need to encourage diversity. 15/31

71 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) x 16/31

72 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) ˆf x 16/31

73 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) ˆf x 16/31

74 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) x 16/31

75 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) Optimise an acquisition over X M. x 16/31

76 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) Optimise an acquisition over X M. Resort to heuristics, typically requires additional hyper-parameters and/or computational routines. x 16/31

77 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) Optimise an acquisition over X M. Resort to heuristics, typically requires additional hyper-parameters and/or computational routines. x Take-home message: Straightforward application of sequential algorithm works for TS. Inherent randomness takes care of exploration vs. exploitation trade-off when managing M workers. 16/31

78 Parallel Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyts At any given time, 1. (x, y ) Wait for a worker to finish. 2. Compute posterior GP. 3. Draw a sample g GP. 4. Re-deploy worker at argmax g. 17/31

79 Parallel Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyts Synchronous: synts At any given time, 1. (x, y ) Wait for a worker to finish. 2. Compute posterior GP. 3. Draw a sample g GP. 4. Re-deploy worker at argmax g. At any given time, 1. {(x m, y m)} M m=1 Wait for all workers to finish. 2. Compute posterior GP. 3. Draw M samples g m GP, m. 4. Re-deploy worker m at argmax g m, m. 17/31

80 Parallel Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyts Synchronous: synts At any given time, 1. (x, y ) Wait for a worker to finish. 2. Compute posterior GP. 3. Draw a sample g GP. 4. Re-deploy worker at argmax g. At any given time, 1. {(x m, y m)} M m=1 Wait for all workers to finish. 2. Compute posterior GP. 3. Draw M samples g m GP, m. 4. Re-deploy worker m at argmax g m, m. Variants in prior work: (Osband et al. 2016, Israelsen et al. 2016, Hernandez-Lobato et al. 2017) 17/31

81 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 18/31

82 Experiment: Park1-4D M = 10 Comparison in terms of number of evaluations 10 0 asyts seqts synts /31

83 Experiment: Branin-2D M = 4 Evaluation time sampled from a uniform distribution /31

84 Experiment: Branin-2D M = 4 Evaluation time sampled from a uniform distribution /31

85 Experiment: Branin-2D M = 4 Evaluation time sampled from a uniform distribution synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31

86 Experiment: Hartmann-6D M = 12 Evaluation time sampled from a half-normal distribution 10 0 synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31

87 Experiment: Hartmann-18D M = 25 Evaluation time sampled from an exponential distribution synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31

88 Experiment: Currin-Exponential-14D M = 35 Evaluation time sampled from a Pareto-3 distribution synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31

89 Experiment: Model Selection in Cifar10 M = 4 Tune # filters in in range (32, 256) for each layer in a 6 layer CNN. Time taken for an evaluation: 4-16 minutes asyts asyei asyhucb asyrand synts 0.69 synhucb /31

90 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats. 6. Open questions/challenges 24/31

91 Bounds for SR(n), synts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Ψ n Maximum information gain. 25/31

92 Bounds for SR(n), synts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Ψ n Maximum information gain. Theorem: synts (Kandasamy et al. Arxiv 2017) E[SR(n)] M log(m) Ψn+M log(n+m) + n n Leading constant is also the same. 25/31

93 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n 26/31

94 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts (Kandasamy et al. Arxiv 2017) ξm Ψ n log(n) E[SR(n)] n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). 26/31

95 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts (Kandasamy et al. Arxiv 2017) ξm Ψ n log(n) E[SR(n)] n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). Theorem: There exists an asynchronously parallelisable initialisation scheme requiring O(Mpolylog(M)) evaluations to f such that ξ M C. (Krause et al. 2008, Desautels et al. 2012) 26/31

96 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] Mpolylog(M) CΨn log(n) + n n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). Theorem: There exists an asynchronously parallelisable initialisation scheme requiring O(Mpolylog(M)) evaluations to f such that ξ M C. (Krause et al. 2008, Desautels et al. 2012) 26/31

97 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] Mpolylog(M) CΨn log(n) + n n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). Theorem: There exists an asynchronously parallelisable initialisation scheme requiring O(Mpolylog(M)) evaluations to f such that ξ M C. (Krause et al. 2008, Desautels et al. 2012) * We do not believe this is necessary. 26/31

98 Bounds for asyts without the initialisation scheme Theorem: synts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] M log(m) Ψn+M log(n + M) + n n 27/31

99 Bounds for asyts without the initialisation scheme Theorem: synts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] M log(m) Ψn+M log(n + M) + n n Theorem: asyts, X R d E[SR(n)]... + M log(n) n 1/O(d) (Ongoing work) 27/31

100 Theoretical Results for SR (T ) Model evaluation time as an independent random variable Uniform unif(a, b) bounded Half-normal HN (τ 2 ) sub-gaussian Exponential exp(λ) sub-exponential 28/31

101 Theoretical Results for SR (T ) Model evaluation time as an independent random variable Uniform unif(a, b) bounded Half-normal HN (τ 2 ) sub-gaussian Exponential exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synts asyts. Otherwise, bounds for asyts are better than synts. More the variability in evaluation times, the bigger the difference.

102 Theoretical Results for SR (T ) Model evaluation time as an independent random variable Uniform unif(a, b) bounded Half-normal HN (τ 2 ) sub-gaussian Exponential exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synts asyts. Otherwise, bounds for asyts are better than synts. More the variability in evaluation times, the bigger the difference. - Uniform: constant factor - Half-normal: log(m) factor - Exponential: log(m) factor 28/31

103 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 28/31

104 Open Challenges for Parallelised TS 1. Bounds for asynchronous TS without initialisation. 2. Other models for evaluation times. - e.g. evaluation time depends on x X. 29/31

105 Open Challenges for Parallelised TS 1. Bounds for asynchronous TS without initialisation. 2. Other models for evaluation times. - e.g. evaluation time depends on x X. 3. In the asynchronous setting, Should you wait for another job to finish without immediately re-deploying? Do you kill an on-going job depending on the result of a completed job? 29/31

106 Open Challenges for Parallelised TS 4. Optimising the sample when X = [0, 1] d, f(x) x t = argmax g(x), x X x t where g Posterior GP Global optimisation of a non-convex function!.. a common challenge in most BO methods. x 30/31

107 Open Challenges for Parallelised TS 4. Optimising the sample when X = [0, 1] d, f(x) x t = argmax g(x), x X x t where g Posterior GP Global optimisation of a non-convex function!.. a common challenge in most BO methods. But additionally for TS, As g is not deterministic, draw samples from a fixed set of points and pick the maximum. Or if using an adaptive method, scales O((N + S) 3 ) where N # of evaluations to f, S # of evaluations to g. x 30/31

108 Summary synts, asyts: direct application of TS to synchronous and asynchronous parallel settings. 31/31

109 Summary synts, asyts: direct application of TS to synchronous and asynchronous parallel settings. Take-aways: Theory - Both perform essentially the same as seqts in terms of the number of evaluations. - When we factor time as a resource, asyts performs best. 31/31

110 Summary synts, asyts: direct application of TS to synchronous and asynchronous parallel settings. Take-aways: Theory - Both perform essentially the same as seqts in terms of the number of evaluations. - When we factor time as a resource, asyts performs best. Take-aways: Practice - Conceptually simple and scales better with the number of workers than other methods. 31/31

111 Akshay Jeff Barnabás Krishnamurthy Schneider Póczos Code: github.com/kirthevasank/gp-parallel-ts Slides: Thank you.

arxiv: v1 [stat.ml] 25 May 2017

arxiv: v1 [stat.ml] 25 May 2017 Asynchronous Parallel Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, Barnabás Póczos Carnegie Mellon University, University of Massachusetts, Amherst

More information

Predictive Variance Reduction Search

Predictive Variance Reduction Search Predictive Variance Reduction Search Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, Svetha Venkatesh Centre of Pattern Recognition and Data Analytics (PRaDA), Deakin University Email: v.nguyen@deakin.edu.au

More information

High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models

High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models Chun-Liang Li Kirthevasan Kandasamy Barnabás Póczos Jeff Schneider {chunlial, kandasamy, bapoczos, schneide}@cs.cmu.edu Carnegie

More information

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Quantifying mismatch in Bayesian optimization

Quantifying mismatch in Bayesian optimization Quantifying mismatch in Bayesian optimization Eric Schulz University College London e.schulz@cs.ucl.ac.uk Maarten Speekenbrink University College London m.speekenbrink@ucl.ac.uk José Miguel Hernández-Lobato

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

Optimisation séquentielle et application au design

Optimisation séquentielle et application au design Optimisation séquentielle et application au design d expériences Nicolas Vayatis Séminaire Aristote, Ecole Polytechnique - 23 octobre 2014 Joint work with Emile Contal (computer scientist, PhD student)

More information

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Emile Contal David Buffoni Alexandre Robicquet Nicolas Vayatis CMLA, ENS Cachan, France September 25, 2013 Motivating

More information

Talk on Bayesian Optimization

Talk on Bayesian Optimization Talk on Bayesian Optimization Jungtaek Kim (jtkim@postech.ac.kr) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77-Cheongam-ro, Nam-gu, Pohang-si 37673, Gyungsangbuk-do,

More information

Knowledge-Gradient Methods for Bayesian Optimization

Knowledge-Gradient Methods for Bayesian Optimization Knowledge-Gradient Methods for Bayesian Optimization Peter I. Frazier Cornell University Uber Wu, Poloczek, Wilson & F., NIPS 17 Bayesian Optimization with Gradients Poloczek, Wang & F., NIPS 17 Multi

More information

Information-Based Multi-Fidelity Bayesian Optimization

Information-Based Multi-Fidelity Bayesian Optimization Information-Based Multi-Fidelity Bayesian Optimization Yehong Zhang, Trong Nghia Hoang, Bryan Kian Hsiang Low and Mohan Kankanhalli Department of Computer Science, National University of Singapore, Republic

More information

A parametric approach to Bayesian optimization with pairwise comparisons

A parametric approach to Bayesian optimization with pairwise comparisons A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org

More information

arxiv: v1 [cs.lg] 10 Oct 2018

arxiv: v1 [cs.lg] 10 Oct 2018 Combining Bayesian Optimization and Lipschitz Optimization Mohamed Osama Ahmed Sharan Vaswani Mark Schmidt moahmed@cs.ubc.ca sharanv@cs.ubc.ca schmidtm@cs.ubc.ca University of British Columbia arxiv:1810.04336v1

More information

Gaussian Process Optimization with Mutual Information

Gaussian Process Optimization with Mutual Information Gaussian Process Optimization with Mutual Information Emile Contal 1 Vianney Perchet 2 Nicolas Vayatis 1 1 CMLA Ecole Normale Suprieure de Cachan & CNRS, France 2 LPMA Université Paris Diderot & CNRS,

More information

Multi-fidelity Bayesian Optimisation with Continuous Approximations

Multi-fidelity Bayesian Optimisation with Continuous Approximations Multi-fidelity Bayesian Optimisation with Continuous Approximations Kirthevasan Kandasamy 1 Gautam Dasarathy 2 Jeff Schneider 1 Barnabás Póczos 1 Abstract Bandit methods for black-box optimisation, such

More information

arxiv: v1 [stat.ml] 13 Mar 2017

arxiv: v1 [stat.ml] 13 Mar 2017 Bayesian Optimization with Gradients Jian Wu, Matthias Poloczek, Andrew Gordon Wilson, and Peter I. Frazier School of Operations Research and Information Engineering Cornell University {jw926,poloczek,andrew,pf98}@cornell.edu

More information

Probabilistic numerics for deep learning

Probabilistic numerics for deep learning Presenter: Shijia Wang Department of Engineering Science, University of Oxford rning (RLSS) Summer School, Montreal 2017 Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling

More information

KNOWLEDGE GRADIENT METHODS FOR BAYESIAN OPTIMIZATION

KNOWLEDGE GRADIENT METHODS FOR BAYESIAN OPTIMIZATION KNOWLEDGE GRADIENT METHODS FOR BAYESIAN OPTIMIZATION A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor

More information

Bayesian optimization for automatic machine learning

Bayesian optimization for automatic machine learning Bayesian optimization for automatic machine learning Matthew W. Ho man based o work with J. M. Hernández-Lobato, M. Gelbart, B. Shahriari, and others! University of Cambridge July 11, 2015 Black-bo optimization

More information

arxiv: v3 [stat.ml] 7 Feb 2018

arxiv: v3 [stat.ml] 7 Feb 2018 Bayesian Optimization with Gradients Jian Wu Matthias Poloczek Andrew Gordon Wilson Peter I. Frazier Cornell University, University of Arizona arxiv:703.04389v3 stat.ml 7 Feb 08 Abstract Bayesian optimization

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

A General Framework for Constrained Bayesian Optimization using Information-based Search

A General Framework for Constrained Bayesian Optimization using Information-based Search Journal of Machine Learning Research 17 (2016) 1-53 Submitted 12/15; Revised 4/16; Published 9/16 A General Framework for Constrained Bayesian Optimization using Information-based Search José Miguel Hernández-Lobato

More information

Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation

Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, Volkan Cevher Laboratory for Information and Inference

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

New Insights into History Matching via Sequential Monte Carlo

New Insights into History Matching via Sequential Monte Carlo New Insights into History Matching via Sequential Monte Carlo Associate Professor Chris Drovandi School of Mathematical Sciences ARC Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)

More information

Learning to Optimize via Information-Directed Sampling

Learning to Optimize via Information-Directed Sampling Learning to Optimize via Information-Directed Sampling Daniel Russo Stanford University Stanford, CA 94305 djrusso@stanford.edu Benjamin Van Roy Stanford University Stanford, CA 94305 bvr@stanford.edu

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp

Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp .. Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp Jialei Wang 1 Peter Frazier 1 Scott Clark 2 Eric Liu 2 1 School of Operations Research & Information Engineering,

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018 LA Support for Scalable Kernel Methods David Bindel 29 Sep 2018 Collaborators Kun Dong (Cornell CAM) David Eriksson (Cornell CAM) Jake Gardner (Cornell CS) Eric Lee (Cornell CS) Hannes Nickisch (Phillips

More information

Multi-Information Source Optimization

Multi-Information Source Optimization Multi-Information Source Optimization Matthias Poloczek, Jialei Wang, and Peter I. Frazier arxiv:1603.00389v2 [stat.ml] 15 Nov 2016 School of Operations Research and Information Engineering Cornell University

More information

Black-box α-divergence Minimization

Black-box α-divergence Minimization Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

arxiv: v2 [stat.ml] 16 Oct 2017

arxiv: v2 [stat.ml] 16 Oct 2017 Correcting boundary over-exploration deficiencies in Bayesian optimization with virtual derivative sign observations arxiv:7.96v [stat.ml] 6 Oct 7 Eero Siivola, Aki Vehtari, Jarno Vanhatalo, Javier González,

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

arxiv: v4 [cs.lg] 22 Jul 2014

arxiv: v4 [cs.lg] 22 Jul 2014 Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm

More information

Adaptive Bayesian Optimization for Dynamic Problems

Adaptive Bayesian Optimization for Dynamic Problems Adaptive Bayesian Optimization for Dynamic Problems Favour Mandanji Nyikosa Linacre College University of Oxford A thesis submitted for the degree of Doctor of Philosophy Hilary 2018 To the memory of

More information

arxiv: v7 [cs.lg] 7 Jul 2017

arxiv: v7 [cs.lg] 7 Jul 2017 Learning to Optimize Via Information-Directed Sampling Daniel Russo 1 and Benjamin Van Roy 2 1 Northwestern University, daniel.russo@kellogg.northwestern.edu 2 Stanford University, bvr@stanford.edu arxiv:1403.5556v7

More information

Learning Representations for Hyperparameter Transfer Learning

Learning Representations for Hyperparameter Transfer Learning Learning Representations for Hyperparameter Transfer Learning Cédric Archambeau cedrica@amazon.com DALI 2018 Lanzarote, Spain My co-authors Rodolphe Jenatton Valerio Perrone Matthias Seeger Tuning deep

More information

Robustness in GANs and in Black-box Optimization

Robustness in GANs and in Black-box Optimization Robustness in GANs and in Black-box Optimization Stefanie Jegelka MIT CSAIL joint work with Zhi Xu, Chengtao Li, Ilija Bogunovic, Jonathan Scarlett and Volkan Cevher Robustness in ML noise Generator Critic

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

arxiv: v1 [stat.ml] 24 Oct 2016

arxiv: v1 [stat.ml] 24 Oct 2016 Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation arxiv:6.7379v [stat.ml] 4 Oct 6 Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, Volkan Cevher Laboratory

More information

Constrained Bayesian Optimization and Applications

Constrained Bayesian Optimization and Applications Constrained Bayesian Optimization and Applications The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Gelbart, Michael

More information

Hypervolume-based Multi-objective Bayesian Optimization with Student-t Processes

Hypervolume-based Multi-objective Bayesian Optimization with Student-t Processes Hypervolume-based Multi-objective Bayesian Optimization with Student-t Processes Joachim van der Herten Ivo Coucuyt Tom Dhaene IDLab Ghent University - imec igent Tower - Department of Electronics and

More information

Dynamic Batch Bayesian Optimization

Dynamic Batch Bayesian Optimization Dynamic Batch Bayesian Optimization Javad Azimi EECS, Oregon State University azimi@eecs.oregonstate.edu Ali Jalali ECE, University of Texas at Austin alij@mail.utexas.edu Xiaoli Fern EECS, Oregon State

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

System identification and control with (deep) Gaussian processes. Andreas Damianou

System identification and control with (deep) Gaussian processes. Andreas Damianou System identification and control with (deep) Gaussian processes Andreas Damianou Department of Computer Science, University of Sheffield, UK MIT, 11 Feb. 2016 Outline Part 1: Introduction Part 2: Gaussian

More information

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Saba Q. Yahyaa, Madalina M. Drugan and Bernard Manderick Vrije Universiteit Brussel, Department of Computer Science, Pleinlaan 2, 1050 Brussels,

More information

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

Bandits : optimality in exponential families

Bandits : optimality in exponential families Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities

More information

arxiv: v3 [stat.ml] 8 Jun 2015

arxiv: v3 [stat.ml] 8 Jun 2015 Gaussian Process Optimization with Mutual Information arxiv:1311.485v3 [stat.ml] 8 Jun 15 Emile Contal 1, Vianney Perchet, and Nicolas Vayatis 1 1 CMLA, UMR CNRS 8536, ENS Cachan, France LPMA, Université

More information

Parallel Bayesian Global Optimization of Expensive Functions

Parallel Bayesian Global Optimization of Expensive Functions Parallel Bayesian Global Optimization of Expensive Functions Jialei Wang 1, Scott C. Clark 2, Eric Liu 3, and Peter I. Frazier 1 arxiv:1602.05149v3 [stat.ml] 1 Nov 2017 1 School of Operations Research

More information

An Efficient Approach for Assessing Parameter Importance in Bayesian Optimization

An Efficient Approach for Assessing Parameter Importance in Bayesian Optimization An Efficient Approach for Assessing Parameter Importance in Bayesian Optimization Frank Hutter Freiburg University fh@informatik.uni-freiburg.de Holger H. Hoos and Kevin Leyton-Brown University of British

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Multi-Objective Stochastic Optimization of Magnetic Fields. David S Bindel, Cornell University

Multi-Objective Stochastic Optimization of Magnetic Fields. David S Bindel, Cornell University Multi-Objective Stochastic Optimization of Magnetic Fields David S Bindel, Cornell University Current State of the Art (STELLOPT) Optimizer Calculate! " (physics + engineering targets) Adjust plasma boundary

More information

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 5: Regret Bounds for Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined

More information

Active Area Search via Bayesian Quadrature

Active Area Search via Bayesian Quadrature Yifei Ma Roman Garnett Jeff Schneider Knowledge Discovery Department University of Bonn rgarnett@uni-bonn.de Machine Learning Department Carnegie Mellon University yifeim@cs.cmu.edu Robotics Institute

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford

Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford Global Optimisation with Gaussian Processes Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford Global optimisation considers objective functions that

More information

Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information

Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information 5st IEEE Conference on Decision and Control December 0-3, 202 Maui, Hawaii, USA Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information Joseph Hall, Carl Rasmussen

More information

Predictive Entropy Search for Efficient Global Optimization of Black-box Functions

Predictive Entropy Search for Efficient Global Optimization of Black-box Functions Predictive Entropy Search for Efficient Global Optimization of Black-bo Functions José Miguel Hernández-Lobato jmh233@cam.ac.uk University of Cambridge Matthew W. Hoffman mwh3@cam.ac.uk University of Cambridge

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin

More information

Bits of Machine Learning Part 2: Unsupervised Learning

Bits of Machine Learning Part 2: Unsupervised Learning Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

Bayesian Optimization in a Billion Dimensions via Random Embeddings

Bayesian Optimization in a Billion Dimensions via Random Embeddings via Random Embeddings Ziyu Wang University of British Columbia Masrour Zoghi University of Amsterdam David Matheson University of British Columbia Frank Hutter University of British Columbia Nando de Freitas

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

arxiv: v1 [stat.ml] 26 Feb 2018

arxiv: v1 [stat.ml] 26 Feb 2018 DEEP BAYESIAN BANDITS SHOWDOWN AN EMPIRICAL COMPARISON OF BAYESIAN DEEP NETWORKS FOR THOMPSON SAMPLING Carlos Riquelme Google Brain rikel@google.com George Tucker Google Brain gjt@google.com Jasper Snoek

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes

More information

Improved Bayesian Compression

Improved Bayesian Compression Improved Bayesian Compression Marco Federici University of Amsterdam marco.federici@student.uva.nl Karen Ullrich University of Amsterdam karen.ullrich@uva.nl Max Welling University of Amsterdam Canadian

More information

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt ACS Spring National Meeting. COMP, March 16 th 2016 Matthew Segall, Peter Hunt, Ed Champness matt.segall@optibrium.com Optibrium,

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Imprecise Filtering for Spacecraft Navigation

Imprecise Filtering for Spacecraft Navigation Imprecise Filtering for Spacecraft Navigation Tathagata Basu Cristian Greco Thomas Krak Durham University Strathclyde University Ghent University Filtering for Spacecraft Navigation The General Problem

More information

Gaussian Process Regression Networks

Gaussian Process Regression Networks Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh

More information

arxiv: v1 [stat.ml] 10 Jun 2014

arxiv: v1 [stat.ml] 10 Jun 2014 Predictive Entropy Search for Efficient Global Optimization of Black-bo Functions arxiv:46.254v [stat.ml] Jun 24 José Miguel Hernández-Lobato jmh233@cam.ac.uk University of Cambridge Matthew W. Hoffman

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Local Affine Approximators for Improving Knowledge Transfer

Local Affine Approximators for Improving Knowledge Transfer Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural

More information

Spatial smoothing using Gaussian processes

Spatial smoothing using Gaussian processes Spatial smoothing using Gaussian processes Chris Paciorek paciorek@hsph.harvard.edu August 5, 2004 1 OUTLINE Spatial smoothing and Gaussian processes Covariance modelling Nonstationary covariance modelling

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Submodularity in Machine Learning

Submodularity in Machine Learning Saifuddin Syed MLRG Summer 2016 1 / 39 What are submodular functions Outline 1 What are submodular functions Motivation Submodularity and Concavity Examples 2 Properties of submodular functions Submodularity

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Introduction to Convolutional Neural Networks (CNNs)

Introduction to Convolutional Neural Networks (CNNs) Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei

More information

Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior

Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior Zi Wang MIT CSAIL ziw@csail.mit.edu Beomjoon Kim MIT CSAIL beomjoon@mit.edu Leslie Pack Kaelbling MIT CSAIL lpk@csail.mit.edu

More information

An Information-Theoretic Analysis of Thompson Sampling

An Information-Theoretic Analysis of Thompson Sampling Journal of Machine Learning Research (2015) Submitted ; Published An Information-Theoretic Analysis of Thompson Sampling Daniel Russo Department of Management Science and Engineering Stanford University

More information