Parallelised Bayesian Optimisation via Thompson Sampling
|
|
- Ellen Clark
- 6 years ago
- Views:
Transcription
1 Parallelised Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy Carnegie Mellon University Google Research, Mountain View, CA Sep 27, 2017 Slides:
2 Slides are up on my website: kkandasa Slides
3 Black-box Optimisation hyperparameters Neural Network - Train NN using given hyper-parameters - Compute accuracy on validation set cross validation accuracy 1/31
4 Black-box Optimisation Expensive Blackbox Function 1/31
5 Black-box Optimisation Expensive Blackbox Function Other Examples: - ML estimation in astrophysics - Pre-clinical drug discovery - Optimal policy in autonomous driving 1/31
6 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. f(x) x 2/31
7 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. f(x) x 2/31
8 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) x x 2/31
9 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) Simple Regret after n evaluations SR(n) = f (x ) max f (x t). t=1,...,n x x 2/31
10 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) Cumulative Regret after n evaluations n ( ) CR(n) = f (x ) f (x t ) t=1 x x 2/31
11 Black-box Optimisation f : X R is an expensive, black-box, noisy function, accessible only via noisy evaluations. Let x = argmax x f (x). f(x) f(x ) Simple Regret after n evaluations SR(n) = f (x ) max f (x t). t=1,...,n x x 2/31
12 A walk-through Bayesian Optimisation (BO) with Gaussian Processes A review of Gaussian Processes (GPs) Thompson Sampling (TS): an algorithm for BO Other methods and models for BO 3/31
13 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. 4/31
14 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Functions with no observations f(x) x 4/31
15 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Prior GP f(x) x 4/31
16 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Observations f(x) x 4/31
17 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations f(x) x 4/31
18 Gaussian Processes (GP) GP(µ, κ): A distribution over functions from X to R. Posterior GP given observations f(x) Completely characterised by mean function µ : X R, and covariance kernel κ : X X R. After t observations, f (x) N ( µ t (x), σ 2 t (x) ). x 4/31
19 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) x 5/31
20 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. x 5/31
21 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. 2) Draw sample g from posterior. x 5/31
22 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose x t = argmax x g(x). x t x 5/31
23 Gaussian Process (Bayesian) Optimisation Model f GP(0, κ). Thompson Sampling (TS) (Thompson, 1933). f(x) 1) Construct posterior GP. 2) Draw sample g from posterior. 3) Choose x t = argmax x g(x). 4) Evaluate f at x t. x t x 5/31
24 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) x 6/31
25 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 1 x 6/31
26 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 2 x 6/31
27 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 3 x 6/31
28 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 4 x 6/31
29 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 5 x 6/31
30 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 6 x 6/31
31 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 7 x 6/31
32 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 14 x 6/31
33 Thompson Sampling (TS) in GPs (Thompson, 1933) f(x) t = 25 x 6/31
34 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) 7/31
35 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) Theorem: For Thompson sampling, (Russo & van Roy 2014, Srinivas et al. 2010) Ψn log(n) E[SR(n)]. ignores constants n Ψ n Maximum Information Gain. 7/31
36 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) Theorem: For Thompson sampling, (Russo & van Roy 2014, Ψ n E[SR(n)] Maximum Information Gain. Srinivas et al. 2010) Ψn log(n). ignores constants n When X R d, SE (Gaussian) kernel: Ψ n d d log(n) d. Matérn kernel: Ψ n n 1 c d 2. 7/31
37 Some Theoretical Results for TS Simple Regret: SR(n) = f (x ) max t=1,...,n f (x t) Theorem: For Thompson sampling, (Russo & van Roy 2014, Ψ n E[SR(n)] Maximum Information Gain. Srinivas et al. 2010) Ψn log(n). ignores constants n When X R d, SE (Gaussian) kernel: Ψ n d d log(n) d. Matérn kernel: Ψ n n 1 c d 2. Several other results: (Agrawal et al 2012, Kaufmann et al 2012, Russo & van Roy 2016, Chowdhury & Gopalan 2017 and more... ) 7/31
38 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) 8/31
39 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) x 8/31
40 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) x 8/31
41 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x 8/31
42 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t x 8/31
43 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t x 8/31
44 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 Expected improvement (Jones et al. 1998) Probability of improvement (Kushner et al. 1964) Entropy search (Hernández-Lobato et al. 2014) All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. x t = argmax x X ϕ t (x). x t x 8/31
45 Other methods for BO Other criteria for selecting x t : Upper Confidence Bounds (Srinivas et al. 2010) f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 Expected improvement (Jones et al. 1998) Probability of improvement (Kushner et al. 1964) Entropy search (Hernández-Lobato et al. 2014) All deterministic methods, choose next point for evaluation by maximising a deterministic acquisition function, i.e. x t = argmax x X ϕ t (x). Other models for f : Neural networks (Snoek et al. 2015), Random Forests (Hutter 2009). x t x 8/31
46 Big picture: scaling up black-box optimisation 9/31
47 Big picture: scaling up black-box optimisation Optimising in high dimensional spaces e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms. (Kandasamy et al. ICML 2015) 9/31
48 Big picture: scaling up black-box optimisation Optimising in high dimensional spaces e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms. (Kandasamy et al. ICML 2015) Multi-fidelity optimisation: what if we have cheap approximations to f? E.g. Train an ML model with N data and T iterations. But use N < N data and T < T iterations to approximate cross validation performance at (N, T ). (Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017) 9/31
49 Big picture: scaling up black-box optimisation Optimising in high dimensional spaces e.g.: Tuning models with several hyper-parameters Additive models for f lead to statistically and computationally tractable algorithms. (Kandasamy et al. ICML 2015) Multi-fidelity optimisation: what if we have cheap approximations to f? E.g. Train an ML model with N data and T iterations. But use N < N data and T < T iterations to approximate cross validation performance at (N, T ). (Kandasamy et al. NIPS 2016a&b, Kandasamy et al. ICML 2017) Extends beyond GPs. 9/31
50 This work: Parallel Evaluations (Kandasamy et al. Arxiv 2017) Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice! 10/31
51 This work: Parallel Evaluations (Kandasamy et al. Arxiv 2017) Parallelisation with M workers: can evaluate f at M different points at the same time. E.g. Train M models with different hyper-parameter values in parallel at the same time. Inability to parallelise is a real bottleneck in practice! Some desiderata: Statistically, achieve M improvement. Methodologically, be scalable for a very large number of workers, - Method remains computationally tractable as M increases. - Method is conceptually simple, for robustness in practice. 10/31
52 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results 6. Open questions/challenges 11/31
53 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats. 6. Open questions/challenges 11/31
54 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 11/31
55 Parallel Evaluations: set up Sequential evaluations with one worker 12/31
56 Parallel Evaluations: set up Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) 12/31
57 Parallel Evaluations: set up Sequential evaluations with one worker Parallel evaluations with M workers (Asynchronous) Parallel evaluations with M workers (Synchronous) 12/31
58 Parallel Evaluations: set up Sequential evaluations with one worker j th job has feedback from all previous j 1 jobs. Parallel evaluations with M workers (Asynchronous) j th job missing feedback from exactly M 1 jobs. Parallel evaluations with M workers (Synchronous) j th job missing feedback from M 1 jobs. 12/31
59 Simple Regret in Parallel Settings (Kandasamy et al. Arxiv 2017) Simple regret after n evaluations, SR(n) = f (x ) max f (x t). t=1,...,n n number of completed evaluations by all M workers. 13/31
60 Simple Regret in Parallel Settings (Kandasamy et al. Arxiv 2017) Simple regret after n evaluations, SR(n) = f (x ) max f (x t). t=1,...,n n number of completed evaluations by all M workers. Simple regret with time as a resource, Asynchronous Synchronous SR (T ) = f (x ) max f (x t). t=1,...,n N (possibly random) number of completed evaluations by all M workers within time T. 13/31
61 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 13/31
62 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) 14/31
63 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) Asynchronicity 14/31
64 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) Asynchronicity Theoretical guarantees (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) 14/31
65 Prior work in Parallel BO (Ginsbourger et al. 2011) (Janusevkis et al. 2012) (Contal et al. 2013) (Desautels et al. 2014) (Gonzalez et al. 2015) (Shah & Ghahramani. 2015) (Wang et al. 2016) (Kathuria et al. 2016) (Wu & Frazier. 2017) Asynchronicity Theoretical guarantees Conceptual simplicity * (Wang et al. 2017) (Kandasamy et al. Arxiv 2017) * straightforward extension of sequential algorithm works. 14/31
66 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... f(x) x 15/31
67 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t1 x 15/31
68 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). - Second worker: acquisition is the same! x t1 = x t2 f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t2 = x t1 x 15/31
69 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). - Second worker: acquisition is the same! x t1 = x t2 - x t1 = x t2 = = x tm. f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t2 = x t1 x 15/31
70 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Direct application of GP-UCB in the synchronous setting... - First worker: maximise acquisition, x t1 = argmax ϕ t (x). - Second worker: acquisition is the same! x t1 = x t2 - x t1 = x t2 = = x tm. f(x) ϕ t = µ t 1 +β 1/2 t σ t 1 x t2 = x t1 x Direct application of sequential algorithm does not work. Need to encourage diversity. 15/31
71 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) x 16/31
72 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) ˆf x 16/31
73 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) ˆf x 16/31
74 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) x 16/31
75 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) Optimise an acquisition over X M. x 16/31
76 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) Optimise an acquisition over X M. Resort to heuristics, typically requires additional hyper-parameters and/or computational routines. x 16/31
77 Why are deterministic algorithms not simple? Need to encourage diversity in parallel evaluations Add hallucinated observations. f(x) Optimise an acquisition over X M. Resort to heuristics, typically requires additional hyper-parameters and/or computational routines. x Take-home message: Straightforward application of sequential algorithm works for TS. Inherent randomness takes care of exploration vs. exploitation trade-off when managing M workers. 16/31
78 Parallel Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyts At any given time, 1. (x, y ) Wait for a worker to finish. 2. Compute posterior GP. 3. Draw a sample g GP. 4. Re-deploy worker at argmax g. 17/31
79 Parallel Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyts Synchronous: synts At any given time, 1. (x, y ) Wait for a worker to finish. 2. Compute posterior GP. 3. Draw a sample g GP. 4. Re-deploy worker at argmax g. At any given time, 1. {(x m, y m)} M m=1 Wait for all workers to finish. 2. Compute posterior GP. 3. Draw M samples g m GP, m. 4. Re-deploy worker m at argmax g m, m. 17/31
80 Parallel Thompson Sampling (Kandasamy et al. Arxiv 2017) Asynchronous: asyts Synchronous: synts At any given time, 1. (x, y ) Wait for a worker to finish. 2. Compute posterior GP. 3. Draw a sample g GP. 4. Re-deploy worker at argmax g. At any given time, 1. {(x m, y m)} M m=1 Wait for all workers to finish. 2. Compute posterior GP. 3. Draw M samples g m GP, m. 4. Re-deploy worker m at argmax g m, m. Variants in prior work: (Osband et al. 2016, Israelsen et al. 2016, Hernandez-Lobato et al. 2017) 17/31
81 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 18/31
82 Experiment: Park1-4D M = 10 Comparison in terms of number of evaluations 10 0 asyts seqts synts /31
83 Experiment: Branin-2D M = 4 Evaluation time sampled from a uniform distribution /31
84 Experiment: Branin-2D M = 4 Evaluation time sampled from a uniform distribution /31
85 Experiment: Branin-2D M = 4 Evaluation time sampled from a uniform distribution synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31
86 Experiment: Hartmann-6D M = 12 Evaluation time sampled from a half-normal distribution 10 0 synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31
87 Experiment: Hartmann-18D M = 25 Evaluation time sampled from an exponential distribution synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31
88 Experiment: Currin-Exponential-14D M = 35 Evaluation time sampled from a Pareto-3 distribution synrand synhucb synucbpe synts asyrand asyucb asyhucb asyei asyhts asyts /31
89 Experiment: Model Selection in Cifar10 M = 4 Tune # filters in in range (32, 256) for each layer in a 6 layer CNN. Time taken for an evaluation: 4-16 minutes asyts asyei asyhucb asyrand synts 0.69 synhucb /31
90 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats. 6. Open questions/challenges 24/31
91 Bounds for SR(n), synts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Ψ n Maximum information gain. 25/31
92 Bounds for SR(n), synts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Ψ n Maximum information gain. Theorem: synts (Kandasamy et al. Arxiv 2017) E[SR(n)] M log(m) Ψn+M log(n+m) + n n Leading constant is also the same. 25/31
93 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n 26/31
94 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts (Kandasamy et al. Arxiv 2017) ξm Ψ n log(n) E[SR(n)] n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). 26/31
95 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts (Kandasamy et al. Arxiv 2017) ξm Ψ n log(n) E[SR(n)] n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). Theorem: There exists an asynchronously parallelisable initialisation scheme requiring O(Mpolylog(M)) evaluations to f such that ξ M C. (Krause et al. 2008, Desautels et al. 2012) 26/31
96 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] Mpolylog(M) CΨn log(n) + n n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). Theorem: There exists an asynchronously parallelisable initialisation scheme requiring O(Mpolylog(M)) evaluations to f such that ξ M C. (Krause et al. 2008, Desautels et al. 2012) 26/31
97 Bounds for SR(n), asyts seqts (Russo & van Roy 2014) Ψn log(n) E[SR(n)] n Theorem: asyts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] Mpolylog(M) CΨn log(n) + n n ξ M = sup Dn,n 1 max A X, A M e I (f ;A Dn). Theorem: There exists an asynchronously parallelisable initialisation scheme requiring O(Mpolylog(M)) evaluations to f such that ξ M C. (Krause et al. 2008, Desautels et al. 2012) * We do not believe this is necessary. 26/31
98 Bounds for asyts without the initialisation scheme Theorem: synts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] M log(m) Ψn+M log(n + M) + n n 27/31
99 Bounds for asyts without the initialisation scheme Theorem: synts, arbitrary X (Kandasamy et al. Arxiv 2017) E[SR(n)] M log(m) Ψn+M log(n + M) + n n Theorem: asyts, X R d E[SR(n)]... + M log(n) n 1/O(d) (Ongoing work) 27/31
100 Theoretical Results for SR (T ) Model evaluation time as an independent random variable Uniform unif(a, b) bounded Half-normal HN (τ 2 ) sub-gaussian Exponential exp(λ) sub-exponential 28/31
101 Theoretical Results for SR (T ) Model evaluation time as an independent random variable Uniform unif(a, b) bounded Half-normal HN (τ 2 ) sub-gaussian Exponential exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synts asyts. Otherwise, bounds for asyts are better than synts. More the variability in evaluation times, the bigger the difference.
102 Theoretical Results for SR (T ) Model evaluation time as an independent random variable Uniform unif(a, b) bounded Half-normal HN (τ 2 ) sub-gaussian Exponential exp(λ) sub-exponential Theorem (Informal): If evaluation times are the same, synts asyts. Otherwise, bounds for asyts are better than synts. More the variability in evaluation times, the bigger the difference. - Uniform: constant factor - Half-normal: log(m) factor - Exponential: log(m) factor 28/31
103 Outline (Kandasamy et al. Arxiv 2017) 1. Set up & definitions 2. Prior work & challenges 3. Algorithms synts, asyts: direct application of TS to synchronous and asynchronous parallel settings 4. Experiments 5. Theoretical Results synts and asyts perform essentially the same as seqts in terms of the number of evaluations. When we factor time as a resource, asyts outperforms synts and seqts.... with some caveats 6. Open questions/challenges 28/31
104 Open Challenges for Parallelised TS 1. Bounds for asynchronous TS without initialisation. 2. Other models for evaluation times. - e.g. evaluation time depends on x X. 29/31
105 Open Challenges for Parallelised TS 1. Bounds for asynchronous TS without initialisation. 2. Other models for evaluation times. - e.g. evaluation time depends on x X. 3. In the asynchronous setting, Should you wait for another job to finish without immediately re-deploying? Do you kill an on-going job depending on the result of a completed job? 29/31
106 Open Challenges for Parallelised TS 4. Optimising the sample when X = [0, 1] d, f(x) x t = argmax g(x), x X x t where g Posterior GP Global optimisation of a non-convex function!.. a common challenge in most BO methods. x 30/31
107 Open Challenges for Parallelised TS 4. Optimising the sample when X = [0, 1] d, f(x) x t = argmax g(x), x X x t where g Posterior GP Global optimisation of a non-convex function!.. a common challenge in most BO methods. But additionally for TS, As g is not deterministic, draw samples from a fixed set of points and pick the maximum. Or if using an adaptive method, scales O((N + S) 3 ) where N # of evaluations to f, S # of evaluations to g. x 30/31
108 Summary synts, asyts: direct application of TS to synchronous and asynchronous parallel settings. 31/31
109 Summary synts, asyts: direct application of TS to synchronous and asynchronous parallel settings. Take-aways: Theory - Both perform essentially the same as seqts in terms of the number of evaluations. - When we factor time as a resource, asyts performs best. 31/31
110 Summary synts, asyts: direct application of TS to synchronous and asynchronous parallel settings. Take-aways: Theory - Both perform essentially the same as seqts in terms of the number of evaluations. - When we factor time as a resource, asyts performs best. Take-aways: Practice - Conceptually simple and scales better with the number of workers than other methods. 31/31
111 Akshay Jeff Barnabás Krishnamurthy Schneider Póczos Code: github.com/kirthevasank/gp-parallel-ts Slides: Thank you.
arxiv: v1 [stat.ml] 25 May 2017
Asynchronous Parallel Bayesian Optimisation via Thompson Sampling Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, Barnabás Póczos Carnegie Mellon University, University of Massachusetts, Amherst
More informationPredictive Variance Reduction Search
Predictive Variance Reduction Search Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, Svetha Venkatesh Centre of Pattern Recognition and Data Analytics (PRaDA), Deakin University Email: v.nguyen@deakin.edu.au
More informationHigh Dimensional Bayesian Optimization via Restricted Projection Pursuit Models
High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models Chun-Liang Li Kirthevasan Kandasamy Barnabás Póczos Jeff Schneider {chunlial, kandasamy, bapoczos, schneide}@cs.cmu.edu Carnegie
More informationCOMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation
COMP 55 Applied Machine Learning Lecture 2: Bayesian optimisation Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp55 Unless otherwise noted, all material posted
More informationThe geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan
The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm
More informationQuantifying mismatch in Bayesian optimization
Quantifying mismatch in Bayesian optimization Eric Schulz University College London e.schulz@cs.ucl.ac.uk Maarten Speekenbrink University College London m.speekenbrink@ucl.ac.uk José Miguel Hernández-Lobato
More informationMulti-Attribute Bayesian Optimization under Utility Uncertainty
Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu
More informationOptimisation séquentielle et application au design
Optimisation séquentielle et application au design d expériences Nicolas Vayatis Séminaire Aristote, Ecole Polytechnique - 23 octobre 2014 Joint work with Emile Contal (computer scientist, PhD student)
More informationParallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration
Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Emile Contal David Buffoni Alexandre Robicquet Nicolas Vayatis CMLA, ENS Cachan, France September 25, 2013 Motivating
More informationTalk on Bayesian Optimization
Talk on Bayesian Optimization Jungtaek Kim (jtkim@postech.ac.kr) Machine Learning Group, Department of Computer Science and Engineering, POSTECH, 77-Cheongam-ro, Nam-gu, Pohang-si 37673, Gyungsangbuk-do,
More informationKnowledge-Gradient Methods for Bayesian Optimization
Knowledge-Gradient Methods for Bayesian Optimization Peter I. Frazier Cornell University Uber Wu, Poloczek, Wilson & F., NIPS 17 Bayesian Optimization with Gradients Poloczek, Wang & F., NIPS 17 Multi
More informationInformation-Based Multi-Fidelity Bayesian Optimization
Information-Based Multi-Fidelity Bayesian Optimization Yehong Zhang, Trong Nghia Hoang, Bryan Kian Hsiang Low and Mohan Kankanhalli Department of Computer Science, National University of Singapore, Republic
More informationA parametric approach to Bayesian optimization with pairwise comparisons
A parametric approach to Bayesian optimization with pairwise comparisons Marco Co Eindhoven University of Technology m.g.h.co@tue.nl Bert de Vries Eindhoven University of Technology and GN Hearing bdevries@ieee.org
More informationarxiv: v1 [cs.lg] 10 Oct 2018
Combining Bayesian Optimization and Lipschitz Optimization Mohamed Osama Ahmed Sharan Vaswani Mark Schmidt moahmed@cs.ubc.ca sharanv@cs.ubc.ca schmidtm@cs.ubc.ca University of British Columbia arxiv:1810.04336v1
More informationGaussian Process Optimization with Mutual Information
Gaussian Process Optimization with Mutual Information Emile Contal 1 Vianney Perchet 2 Nicolas Vayatis 1 1 CMLA Ecole Normale Suprieure de Cachan & CNRS, France 2 LPMA Université Paris Diderot & CNRS,
More informationMulti-fidelity Bayesian Optimisation with Continuous Approximations
Multi-fidelity Bayesian Optimisation with Continuous Approximations Kirthevasan Kandasamy 1 Gautam Dasarathy 2 Jeff Schneider 1 Barnabás Póczos 1 Abstract Bandit methods for black-box optimisation, such
More informationarxiv: v1 [stat.ml] 13 Mar 2017
Bayesian Optimization with Gradients Jian Wu, Matthias Poloczek, Andrew Gordon Wilson, and Peter I. Frazier School of Operations Research and Information Engineering Cornell University {jw926,poloczek,andrew,pf98}@cornell.edu
More informationProbabilistic numerics for deep learning
Presenter: Shijia Wang Department of Engineering Science, University of Oxford rning (RLSS) Summer School, Montreal 2017 Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling
More informationKNOWLEDGE GRADIENT METHODS FOR BAYESIAN OPTIMIZATION
KNOWLEDGE GRADIENT METHODS FOR BAYESIAN OPTIMIZATION A Dissertation Presented to the Faculty of the Graduate School of Cornell University in Partial Fulfillment of the Requirements for the Degree of Doctor
More informationBayesian optimization for automatic machine learning
Bayesian optimization for automatic machine learning Matthew W. Ho man based o work with J. M. Hernández-Lobato, M. Gelbart, B. Shahriari, and others! University of Cambridge July 11, 2015 Black-bo optimization
More informationarxiv: v3 [stat.ml] 7 Feb 2018
Bayesian Optimization with Gradients Jian Wu Matthias Poloczek Andrew Gordon Wilson Peter I. Frazier Cornell University, University of Arizona arxiv:703.04389v3 stat.ml 7 Feb 08 Abstract Bayesian optimization
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationSparse Linear Contextual Bandits via Relevance Vector Machines
Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,
More informationA General Framework for Constrained Bayesian Optimization using Information-based Search
Journal of Machine Learning Research 17 (2016) 1-53 Submitted 12/15; Revised 4/16; Published 9/16 A General Framework for Constrained Bayesian Optimization using Information-based Search José Miguel Hernández-Lobato
More informationTruncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation
Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, Volkan Cevher Laboratory for Information and Inference
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationNew Insights into History Matching via Sequential Monte Carlo
New Insights into History Matching via Sequential Monte Carlo Associate Professor Chris Drovandi School of Mathematical Sciences ARC Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)
More informationLearning to Optimize via Information-Directed Sampling
Learning to Optimize via Information-Directed Sampling Daniel Russo Stanford University Stanford, CA 94305 djrusso@stanford.edu Benjamin Van Roy Stanford University Stanford, CA 94305 bvr@stanford.edu
More informationPractical Bayesian Optimization of Machine Learning. Learning Algorithms
Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that
More informationParallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp
.. Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp Jialei Wang 1 Peter Frazier 1 Scott Clark 2 Eric Liu 2 1 School of Operations Research & Information Engineering,
More informationDevelopment of Stochastic Artificial Neural Networks for Hydrological Prediction
Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental
More informationLA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018
LA Support for Scalable Kernel Methods David Bindel 29 Sep 2018 Collaborators Kun Dong (Cornell CAM) David Eriksson (Cornell CAM) Jake Gardner (Cornell CS) Eric Lee (Cornell CS) Hannes Nickisch (Phillips
More informationMulti-Information Source Optimization
Multi-Information Source Optimization Matthias Poloczek, Jialei Wang, and Peter I. Frazier arxiv:1603.00389v2 [stat.ml] 15 Nov 2016 School of Operations Research and Information Engineering Cornell University
More informationBlack-box α-divergence Minimization
Black-box α-divergence Minimization José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, Richard Turner, Harvard University, University of Cambridge, Universidad Autónoma de Madrid.
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationarxiv: v2 [stat.ml] 16 Oct 2017
Correcting boundary over-exploration deficiencies in Bayesian optimization with virtual derivative sign observations arxiv:7.96v [stat.ml] 6 Oct 7 Eero Siivola, Aki Vehtari, Jarno Vanhatalo, Javier González,
More informationIntroduction to Gaussian Processes
Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationarxiv: v4 [cs.lg] 22 Jul 2014
Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm
More informationAdaptive Bayesian Optimization for Dynamic Problems
Adaptive Bayesian Optimization for Dynamic Problems Favour Mandanji Nyikosa Linacre College University of Oxford A thesis submitted for the degree of Doctor of Philosophy Hilary 2018 To the memory of
More informationarxiv: v7 [cs.lg] 7 Jul 2017
Learning to Optimize Via Information-Directed Sampling Daniel Russo 1 and Benjamin Van Roy 2 1 Northwestern University, daniel.russo@kellogg.northwestern.edu 2 Stanford University, bvr@stanford.edu arxiv:1403.5556v7
More informationLearning Representations for Hyperparameter Transfer Learning
Learning Representations for Hyperparameter Transfer Learning Cédric Archambeau cedrica@amazon.com DALI 2018 Lanzarote, Spain My co-authors Rodolphe Jenatton Valerio Perrone Matthias Seeger Tuning deep
More informationRobustness in GANs and in Black-box Optimization
Robustness in GANs and in Black-box Optimization Stefanie Jegelka MIT CSAIL joint work with Zhi Xu, Chengtao Li, Ilija Bogunovic, Jonathan Scarlett and Volkan Cevher Robustness in ML noise Generator Critic
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationBandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University
Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3
More informationarxiv: v1 [stat.ml] 24 Oct 2016
Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation arxiv:6.7379v [stat.ml] 4 Oct 6 Ilija Bogunovic, Jonathan Scarlett, Andreas Krause, Volkan Cevher Laboratory
More informationConstrained Bayesian Optimization and Applications
Constrained Bayesian Optimization and Applications The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Gelbart, Michael
More informationHypervolume-based Multi-objective Bayesian Optimization with Student-t Processes
Hypervolume-based Multi-objective Bayesian Optimization with Student-t Processes Joachim van der Herten Ivo Coucuyt Tom Dhaene IDLab Ghent University - imec igent Tower - Department of Electronics and
More informationDynamic Batch Bayesian Optimization
Dynamic Batch Bayesian Optimization Javad Azimi EECS, Oregon State University azimi@eecs.oregonstate.edu Ali Jalali ECE, University of Texas at Austin alij@mail.utexas.edu Xiaoli Fern EECS, Oregon State
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationSystem identification and control with (deep) Gaussian processes. Andreas Damianou
System identification and control with (deep) Gaussian processes Andreas Damianou Department of Computer Science, University of Sheffield, UK MIT, 11 Feb. 2016 Outline Part 1: Introduction Part 2: Gaussian
More informationAnnealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm
Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Saba Q. Yahyaa, Madalina M. Drugan and Bernard Manderick Vrije Universiteit Brussel, Department of Computer Science, Pleinlaan 2, 1050 Brussels,
More informationGaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak
Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression
More informationBandits : optimality in exponential families
Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities
More informationarxiv: v3 [stat.ml] 8 Jun 2015
Gaussian Process Optimization with Mutual Information arxiv:1311.485v3 [stat.ml] 8 Jun 15 Emile Contal 1, Vianney Perchet, and Nicolas Vayatis 1 1 CMLA, UMR CNRS 8536, ENS Cachan, France LPMA, Université
More informationParallel Bayesian Global Optimization of Expensive Functions
Parallel Bayesian Global Optimization of Expensive Functions Jialei Wang 1, Scott C. Clark 2, Eric Liu 3, and Peter I. Frazier 1 arxiv:1602.05149v3 [stat.ml] 1 Nov 2017 1 School of Operations Research
More informationAn Efficient Approach for Assessing Parameter Importance in Bayesian Optimization
An Efficient Approach for Assessing Parameter Importance in Bayesian Optimization Frank Hutter Freiburg University fh@informatik.uni-freiburg.de Holger H. Hoos and Kevin Leyton-Brown University of British
More informationCOMP 551 Applied Machine Learning Lecture 20: Gaussian processes
COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55
More informationActive and Semi-supervised Kernel Classification
Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationPILCO: A Model-Based and Data-Efficient Approach to Policy Search
PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol
More informationMulti-Objective Stochastic Optimization of Magnetic Fields. David S Bindel, Cornell University
Multi-Objective Stochastic Optimization of Magnetic Fields David S Bindel, Cornell University Current State of the Art (STELLOPT) Optimizer Calculate! " (physics + engineering targets) Adjust plasma boundary
More informationLecture 5: Regret Bounds for Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined
More informationActive Area Search via Bayesian Quadrature
Yifei Ma Roman Garnett Jeff Schneider Knowledge Discovery Department University of Bonn rgarnett@uni-bonn.de Machine Learning Department Carnegie Mellon University yifeim@cs.cmu.edu Robotics Institute
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory
More informationNeutron inverse kinetics via Gaussian Processes
Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory
More informationMulti-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang
Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.
More informationModel-Based Reinforcement Learning with Continuous States and Actions
Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks
More informationExploration. 2015/10/12 John Schulman
Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards
More informationGlobal Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford
Global Optimisation with Gaussian Processes Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford Global optimisation considers objective functions that
More informationModelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information
5st IEEE Conference on Decision and Control December 0-3, 202 Maui, Hawaii, USA Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information Joseph Hall, Carl Rasmussen
More informationPredictive Entropy Search for Efficient Global Optimization of Black-box Functions
Predictive Entropy Search for Efficient Global Optimization of Black-bo Functions José Miguel Hernández-Lobato jmh233@cam.ac.uk University of Cambridge Matthew W. Hoffman mwh3@cam.ac.uk University of Cambridge
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin
More informationBits of Machine Learning Part 2: Unsupervised Learning
Bits of Machine Learning Part 2: Unsupervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationReinforcement Learning
Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the
More informationBayesian Optimization in a Billion Dimensions via Random Embeddings
via Random Embeddings Ziyu Wang University of British Columbia Masrour Zoghi University of Amsterdam David Matheson University of British Columbia Frank Hutter University of British Columbia Nando de Freitas
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationarxiv: v1 [stat.ml] 26 Feb 2018
DEEP BAYESIAN BANDITS SHOWDOWN AN EMPIRICAL COMPARISON OF BAYESIAN DEEP NETWORKS FOR THOMPSON SAMPLING Carlos Riquelme Google Brain rikel@google.com George Tucker Google Brain gjt@google.com Jasper Snoek
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts
ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes
More informationImproved Bayesian Compression
Improved Bayesian Compression Marco Federici University of Amsterdam marco.federici@student.uva.nl Karen Ullrich University of Amsterdam karen.ullrich@uva.nl Max Welling University of Amsterdam Canadian
More informationGaussian Processes: We demand rigorously defined areas of uncertainty and doubt
Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt ACS Spring National Meeting. COMP, March 16 th 2016 Matthew Segall, Peter Hunt, Ed Champness matt.segall@optibrium.com Optibrium,
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationImprecise Filtering for Spacecraft Navigation
Imprecise Filtering for Spacecraft Navigation Tathagata Basu Cristian Greco Thomas Krak Durham University Strathclyde University Ghent University Filtering for Spacecraft Navigation The General Problem
More informationGaussian Process Regression Networks
Gaussian Process Regression Networks Andrew Gordon Wilson agw38@camacuk mlgengcamacuk/andrew University of Cambridge Joint work with David A Knowles and Zoubin Ghahramani June 27, 2012 ICML, Edinburgh
More informationarxiv: v1 [stat.ml] 10 Jun 2014
Predictive Entropy Search for Efficient Global Optimization of Black-bo Functions arxiv:46.254v [stat.ml] Jun 24 José Miguel Hernández-Lobato jmh233@cam.ac.uk University of Cambridge Matthew W. Hoffman
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationLocal Affine Approximators for Improving Knowledge Transfer
Local Affine Approximators for Improving Knowledge Transfer Suraj Srinivas & François Fleuret Idiap Research Institute and EPFL {suraj.srinivas, francois.fleuret}@idiap.ch Abstract The Jacobian of a neural
More informationSpatial smoothing using Gaussian processes
Spatial smoothing using Gaussian processes Chris Paciorek paciorek@hsph.harvard.edu August 5, 2004 1 OUTLINE Spatial smoothing and Gaussian processes Covariance modelling Nonstationary covariance modelling
More informationStochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning
More informationSubmodularity in Machine Learning
Saifuddin Syed MLRG Summer 2016 1 / 39 What are submodular functions Outline 1 What are submodular functions Motivation Submodularity and Concavity Examples 2 Properties of submodular functions Submodularity
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationOn the Complexity of Best Arm Identification in Multi-Armed Bandit Models
On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March
More informationIntroduction to Convolutional Neural Networks (CNNs)
Introduction to Convolutional Neural Networks (CNNs) nojunk@snu.ac.kr http://mipal.snu.ac.kr Department of Transdisciplinary Studies Seoul National University, Korea Jan. 2016 Many slides are from Fei-Fei
More informationRegret bounds for meta Bayesian optimization with an unknown Gaussian process prior
Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior Zi Wang MIT CSAIL ziw@csail.mit.edu Beomjoon Kim MIT CSAIL beomjoon@mit.edu Leslie Pack Kaelbling MIT CSAIL lpk@csail.mit.edu
More informationAn Information-Theoretic Analysis of Thompson Sampling
Journal of Machine Learning Research (2015) Submitted ; Published An Information-Theoretic Analysis of Thompson Sampling Daniel Russo Department of Management Science and Engineering Stanford University
More information