Parallelised Bayesian Optimisation via Thompson Sampling

Size: px

Start display at page:

Download "Parallelised Bayesian Optimisation via Thompson Sampling"

Ellen Clark
6 years ago
Views:

1 Parallelised Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy Carnegie Mellon University Google Research, Mountain View, CA Sep 27, 2017 Slides:

2 Slides are up on my website: kkandasa Slides

3 Black-box Optimisation hyperparameters Neural Network - Train NN using given hyper-parameters - Compute accuracy on validation set cross validation accuracy 1/31

4 Black-box Optimisation Expensive Blackbox Function 1/31

5 Black-box Optimisation Expensive Blackbox Function Other Examples: - ML estimation in astrophysics - Pre-clinical drug discovery - Optimal policy in autonomous driving 1/31

6 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. f(x) x 2/31

7 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. f(x) x 2/31

8 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) x x 2/31

9 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) Simple Regret after n evaluations SR(n) = f (x ) max f (x t). t=1,...,n x x 2/31

10 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) Cumulative Regret after n evaluations n ( ) CR(n) = f (x ) f (x t ) t=1 x x 2/31

11 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) Simple Regret after n evaluations SR(n) = f (x ) max f (x t). t=1,...,n x x 2/31

12 A walk-through Bayesian Optimisation (BO) with Gaussian Processes A review of Gaussian Processes (GPs) Thompson Sampling (TS): an algorithm for BO Other methods and models for BO 3/31

13 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. 4/31

14 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Functions with no observations f(x) x 4/31

15 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Prior GP f(x) x 4/31

16 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Observations f(x) x 4/31

17 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations f(x) x 4/31

18 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations f(x) Completely characterised by mean function µ : X R, and covariance kernel κ : X X R. After t observations, f (x) N ( µ t (x), σ 2 t (x) ). x 4/31

19 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) x 5/31

20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. x 5/31

21 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. 2) Draw sample g from posterior. x 5/31

22 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose x t = argmax x g(x). x t x 5/31

23 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose x t = argmax x g(x). 4) Evaluate f at x t. x t x 5/31

24 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) x 6/31

25 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 1 x 6/31

26 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 2 x 6/31

27 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 3 x 6/31

28 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 4 x 6/31

29 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 5 x 6/31

30 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 6 x 6/31

31 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 7 x 6/31

32 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 14 x 6/31

33 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 25 x 6/31

34 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) 7/31

35 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) Theorem: For Thompson sampling, (Russo & van Roy 2014, Srinivas et al. 2010) Ψn log(n) E[SR(n)]. ignores constants n Ψ n Maximum Information Gain. 7/31

36 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) Theorem: For Thompson sampling, (Russo & van Roy 2014, Ψ n E[SR(n)] Maximum Information Gain. Srinivas et al. 2010) Ψn log(n). ignores constants n When X R d, SE (Gaussian) kernel: Ψ n d d log(n) d. Matérn kernel: Ψ n n 1 c d 2. 7/31

37 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) Theorem: For Thompson sampling, (Russo & van Roy 2014, Ψ n E[SR(n)] Maximum Information Gain. Srinivas et al. 2010) Ψn log(n). ignores constants n When X R d, SE (Gaussian) kernel: Ψ n d d log(n) d. Matérn kernel: Ψ n n 1 c d 2. Several other results: (Agrawal et al 2012, Kaufmann et al 2012, Russo & van Roy 2016, Chowdhury & Gopalan 2017 and more... ) 7/31

38 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) 8/31

39 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) x 8/31

40 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) x 8/31

41 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x 8/31

42 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t x 8/31

43 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t x 8/31

44 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 Expected improvement (Jones et al. 1998) Probability of improvement (Kushner et al. 1964) Entropy search (Hernández-Lobato et al. 2014) All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. x t = argmax x X ϕ t (x). x t x 8/31

45 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 Expected improvement (Jones et al. 1998) Probability of improvement (Kushner et al. 1964) Entropy search (Hernández-Lobato et al. 2014) All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. x t = argmax x X ϕ t (x). Other models for f : Neural networks (Snoek et al. 2015), Random Forests (Hutter 2009). x t x 8/31

46 Big picture: scaling up black-box optimisation 9/31

47 Big picture: scaling up black-box optimisation Optimising in high dimensional spaces e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms. (Kandasamy et al. ICML 2015) 9/31

48 Big picture: scaling up black-box optimisation Optimising in high dimensional spaces e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms. (Kandasamy et al. ICML 2015) Multi-fidelity optimisation: what if we have cheap approximations to f? E.g. Train an ML model with N data and T iterations. But use N < N data and T < T iterations to approximate cross validation performance at (N, T ). (Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017) 9/31

49 Big picture: scaling up black-box optimisation Optimising in high dimensional spaces e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms. (Kandasamy et al. ICML 2015) Multi-fidelity optimisation: what if we have cheap approximations to f? E.g. Train an ML model with N data and T iterations. But use N < N data and T < T iterations to approximate cross validation performance at (N, T ). (Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017) Extends beyond GPs. 9/31

50 This work: Parallel Evaluations (Kandasamy et al. Arxiv 2017) Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice! 10/31

51 This work: Parallel Evaluations (Kandasamy et al. Arxiv 2017) Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice! Some desiderata: Statistically, achieve M improvement. Methodologically, be scalable for a very large number of workers, - Method remains computationally tractable as M increases. - Method is conceptually simple, for robustness in practice. 10/31

52 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results 6. Open questions/challenges 11/31

53 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats. 6. Open questions/challenges 11/31

54 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 11/31

55 Parallel Evaluations: set up Sequential evaluations with one worker 12/31

56 Parallel Evaluations: set up Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) 12/31

57 Parallel Evaluations: set up Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous) 12/31

58 Parallel Evaluations: set up Sequential evaluations with one worker j th job has feedback from all previous j 1 jobs. Parallel evaluations with M workers (Asynchronous) j th job missing feedback from exactly M 1 jobs. Parallel evaluations with M workers (Synchronous) j th job missing feedback from M 1 jobs. 12/31

59 Simple Regret in Parallel Settings (Kandasamy et al. Arxiv 2017) Simple regret after n evaluations, SR(n) = f (x ) max f (x t). t=1,...,n n number of completed evaluations by all M workers. 13/31

60 Simple Regret in Parallel Settings (Kandasamy et al. Arxiv 2017) Simple regret after n evaluations, SR(n) = f (x ) max f (x t). t=1,...,n n number of completed evaluations by all M workers. Simple regret with time as a resource, Asynchronous Synchronous SR (T ) = f (x ) max f (x t). t=1,...,n N (possibly random) number of completed evaluations by all M workers within time T. 13/31

61 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 13/31

62 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) 14/31

63 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) Asynchronicity 14/31

64 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) Asynchronicity Theoretical guarantees (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) 14/31

65 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) Asynchronicity Theoretical guarantees Conceptual simplicity * (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) * straightforward extension of sequential algorithm works. 14/31

66 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... f(x) x 15/31

67 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t1 x 15/31

68 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). - Second worker: acquisition is the same! x t1 = x t2 f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t2 = x t1 x 15/31

69 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). - Second worker: acquisition is the same! x t1 = x t2 - x t1 = x t2 = = x tm. f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t2 = x t1 x 15/31

70 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). - Second worker: acquisition is the same! x t1 = x t2 - x t1 = x t2 = = x tm. f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t2 = x t1 x Direct application of sequential algorithm does not work. Need to encourage diversity. 15/31

71 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) x 16/31

72 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) ˆf x 16/31

73 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) ˆf x 16/31

74 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) x 16/31

75 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) Optimise an acquisition over X M. x 16/31

76 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) Optimise an acquisition over X M. Resort to heuristics, typically requires additional hyper-parameters and/or computational routines. x 16/31

77 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) Optimise an acquisition over X M. Resort to heuristics, typically requires additional hyper-parameters and/or computational routines. x Take-home message: Straightforward application of sequential algorithm works for TS. Inherent randomness takes care of exploration vs. exploitation trade-off when managing M workers. 16/31

78 Parallel Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyts At any given time, 1. (x, y ) Wait for a worker to finish. 2. Compute posterior GP. 3. Draw a sample g GP. 4. Re-deploy worker at argmax g. 17/31

79 Parallel Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyts Synchronous: synts At any given time, 1. (x, y ) Wait for a worker to finish. 2. Compute posterior GP. 3. Draw a sample g GP. 4. Re-deploy worker at argmax g. At any given time, 1. {(x m, y m)} M m=1 Wait for all workers to finish. 2. Compute posterior GP. 3. Draw M samples g m GP, m. 4. Re-deploy worker m at argmax g m, m. 17/31

80 Parallel Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyts Synchronous: synts At any given time, 1. (x, y ) Wait for a worker to finish. 2. Compute posterior GP. 3. Draw a sample g GP. 4. Re-deploy worker at argmax g. At any given time, 1. {(x m, y m)} M m=1 Wait for all workers to finish. 2. Compute posterior GP. 3. Draw M samples g m GP, m. 4. Re-deploy worker m at argmax g m, m. Variants in prior work: (Osband et al. 2016, Israelsen et al. 2016, Hernandez-Lobato et al. 2017) 17/31

81 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 18/31

82 Experiment: Park1-4D M = 10 Comparison in terms of number of evaluations 10 0 asyts seqts synts /31

83 Experiment: Branin-2D M = 4 Evaluation time sampled from a uniform distribution /31

84 Experiment: Branin-2D M = 4 Evaluation time sampled from a uniform distribution /31

85 Experiment: Branin-2D M = 4 Evaluation time sampled from a uniform distribution synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31

86 Experiment: Hartmann-6D M = 12 Evaluation time sampled from a half-normal distribution 10 0 synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31

87 Experiment: Hartmann-18D M = 25 Evaluation time sampled from an exponential distribution synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31

88 Experiment: Currin-Exponential-14D M = 35 Evaluation time sampled from a Pareto-3 distribution synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31

89 Experiment: Model Selection in Cifar10 M = 4 Tune # filters in in range (32, 256) for each layer in a 6 layer CNN. Time taken for an evaluation: 4-16 minutes asyts asyei asyhucb asyrand synts 0.69 synhucb /31

90 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats. 6. Open questions/challenges 24/31

91 Bounds for SR(n), synts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Ψ n Maximum information gain. 25/31

92 Bounds for SR(n), synts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Ψ n Maximum information gain. Theorem: synts (Kandasamy et al. Arxiv 2017) E[SR(n)] M log(m) Ψn+M log(n+m) + n n Leading constant is also the same. 25/31

93 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n 26/31

94 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts (Kandasamy et al. Arxiv 2017) ξm Ψ n log(n) E[SR(n)] n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). 26/31

95 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts (Kandasamy et al. Arxiv 2017) ξm Ψ n log(n) E[SR(n)] n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). Theorem: There exists an asynchronously parallelisable initialisation scheme requiring O(Mpolylog(M)) evaluations to f such that ξ M C. (Krause et al. 2008, Desautels et al. 2012) 26/31

96 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] Mpolylog(M) CΨn log(n) + n n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). Theorem: There exists an asynchronously parallelisable initialisation scheme requiring O(Mpolylog(M)) evaluations to f such that ξ M C. (Krause et al. 2008, Desautels et al. 2012) 26/31

97 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] Mpolylog(M) CΨn log(n) + n n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). Theorem: There exists an asynchronously parallelisable initialisation scheme requiring O(Mpolylog(M)) evaluations to f such that ξ M C. (Krause et al. 2008, Desautels et al. 2012) * We do not believe this is necessary. 26/31

98 Bounds for asyts without the initialisation scheme Theorem: synts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] M log(m) Ψn+M log(n + M) + n n 27/31

99 Bounds for asyts without the initialisation scheme Theorem: synts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] M log(m) Ψn+M log(n + M) + n n Theorem: asyts, X R d E[SR(n)]... + M log(n) n 1/O(d) (Ongoing work) 27/31

100 Theoretical Results for SR (T ) Model evaluation time as an independent random variable Uniform unif(a, b) bounded Half-normal HN (τ 2 ) sub-gaussian Exponential exp(λ) sub-exponential 28/31

101 Theoretical Results for SR (T ) Model evaluation time as an independent random variable Uniform unif(a, b) bounded Half-normal HN (τ 2 ) sub-gaussian Exponential exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synts asyts. Otherwise, bounds for asyts are better than synts. More the variability in evaluation times, the bigger the difference.

102 Theoretical Results for SR (T ) Model evaluation time as an independent random variable Uniform unif(a, b) bounded Half-normal HN (τ 2 ) sub-gaussian Exponential exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synts asyts. Otherwise, bounds for asyts are better than synts. More the variability in evaluation times, the bigger the difference. - Uniform: constant factor - Half-normal: log(m) factor - Exponential: log(m) factor 28/31

103 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 28/31

104 Open Challenges for Parallelised TS 1. Bounds for asynchronous TS without initialisation. 2. Other models for evaluation times. - e.g. evaluation time depends on x X. 29/31

105 Open Challenges for Parallelised TS 1. Bounds for asynchronous TS without initialisation. 2. Other models for evaluation times. - e.g. evaluation time depends on x X. 3. In the asynchronous setting, Should you wait for another job to finish without immediately re-deploying? Do you kill an on-going job depending on the result of a completed job? 29/31

106 Open Challenges for Parallelised TS 4. Optimising the sample when X = [0, 1] d, f(x) x t = argmax g(x), x X x t where g Posterior GP Global optimisation of a non-convex function!.. a common challenge in most BO methods. x 30/31

107 Open Challenges for Parallelised TS 4. Optimising the sample when X = [0, 1] d, f(x) x t = argmax g(x), x X x t where g Posterior GP Global optimisation of a non-convex function!.. a common challenge in most BO methods. But additionally for TS, As g is not deterministic, draw samples from a fixed set of points and pick the maximum. Or if using an adaptive method, scales O((N + S) 3 ) where N # of evaluations to f, S # of evaluations to g. x 30/31

108 Summary synts, asyts: direct application of TS to synchronous and asynchronous parallel settings. 31/31

109 Summary synts, asyts: direct application of TS to synchronous and asynchronous parallel settings. Take-aways: Theory - Both perform essentially the same as seqts in terms of the number of evaluations. - When we factor time as a resource, asyts performs best. 31/31

110 Summary synts, asyts: direct application of TS to synchronous and asynchronous parallel settings. Take-aways: Theory - Both perform essentially the same as seqts in terms of the number of evaluations. - When we factor time as a resource, asyts performs best. Take-aways: Practice - Conceptually simple and scales better with the number of workers than other methods. 31/31

111 Akshay Jeff Barnabás Krishnamurthy Schneider Póczos Code: github.com/kirthevasank/gp-parallel-ts Slides: Thank you.

arxiv: v1 [stat.ml] 25 May 2017

arxiv: v1 [stat.ml] 25 May 2017 Asynchronous Parallel Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, Barnabás Póczos Carnegie Mellon University, University of Massachusetts, Amherst