Distributed Gaussian Processes

Size: px
Start display at page:

Download "Distributed Gaussian Processes"

Transcription

1 Distributed Gaussian Processes Marc Deisenroth Department of Computing Imperial College London Gaussian Process Summer School, University of Sheffield 15th September 2015

2 Table of Contents Gaussian Processes Sparse Gaussian Processes Distributed Gaussian Processes Distributed Gaussian Processes Marc 15th September

3 Problem Setting 2 f(x) x Objective For a set of N observations y i f px i q ` ε, ε N `0, σε 2, find a distribution over functions pp f X, yq that explains the data GP is a good solution to this probabilistic regression problem Distributed Gaussian Processes Marc 15th September

4 GP Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood ppy X, θq with respect to the hyper-parameters θ: θ P arg max θ log ppy X, θq log ppy X, θq 1 2 yjk 1 y 1 2 log K ` const Distributed Gaussian Processes Marc 15th September

5 GP Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood ppy X, θq with respect to the hyper-parameters θ: θ P arg max θ log ppy X, θq log ppy X, θq 1 2 yjk 1 y 1 2 log K ` const Automatic trade-off between data fit and model complexity Distributed Gaussian Processes Marc 15th September

6 GP Training via Marginal Likelihood Maximization GP Training Maximize the evidence/marginal likelihood ppy X, θq with respect to the hyper-parameters θ: θ P arg max θ log ppy X, θq log ppy X, θq 1 2 yjk 1 y 1 2 log K ` const Automatic trade-off between data fit and model complexity Gradient-based optimization possible: B log ppy X, θq Bθ 1 BK 2yJK 1 K 1 y 1 Bθ 2 BK tr`k 1 Bθ Computational complexity: OpN 3 q for K and K 1 Distributed Gaussian Processes Marc 15th September

7 GP Predictions At a test point x the predictive (posterior) distribution is Gaussian: pp f px q x, X, y, θq N ` f m, σ 2 m kpx, x q J K 1 y σ 2 kpx, x q kpx, x q J K 1 kpx, x q Distributed Gaussian Processes Marc 15th September

8 GP Predictions At a test point x the predictive (posterior) distribution is Gaussian: pp f px q x, X, y, θq N ` f m, σ 2 m kpx, x q J K 1 y σ 2 kpx, x q kpx, x q J K 1 kpx, x q When you cache K 1 and K 1 y after training, then The mean prediction can be computed in OpNq The variance prediction can be computed in OpN 2 q Distributed Gaussian Processes Marc 15th September

9 Application Areas Bayesian Optimization (Experimental Design) Model unknown utility functions with GPs Reinforcement Learning and Robotics Model value functions and/or dynamics with GPs Data visualization Nonlinear dimensionality reduction (GP-LVM) Distributed Gaussian Processes Marc 15th September

10 Limitations of Gaussian Processes Computational and memory complexity Training scales in OpN 3 q Prediction (variances) scales in OpN 2 q Memory requirement: OpND ` N 2 q Practical limit N «10, 000 Distributed Gaussian Processes Marc 15th September

11 Table of Contents Gaussian Processes Sparse Gaussian Processes Distributed Gaussian Processes Distributed Gaussian Processes Marc 15th September

12 GP Factor Graph Inducing function values Training function values f i f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Probabilistic graphical model (factor graph) of a GP All function values are jointly Gaussian distributed (e.g., training and test function values) Distributed Gaussian Processes Marc 15th September

13 GP Factor Graph Inducing function values Training function values f i f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Probabilistic graphical model (factor graph) of a GP All function values are jointly Gaussian distributed (e.g., training and test function values) GP prior «ff «ff 0 K f f K f pp f, f q N, 0 K f K Distributed Gaussian Processes Marc 15th September

14 Inducing Variables Inducing function values f(u1) f(um ) Training function values f i Hypothetical function values u j f(x1) f(xn ) f 1 f L Training data Test data Introduce inducing function values f u Hypothetical function values Distributed Gaussian Processes Marc 15th September

15 Inducing Variables Inducing function values f(u1) f(um ) Training function values f i Hypothetical function values u j f(x1) f(xn ) f 1 f L Training data Test data Introduce inducing function values f u Hypothetical function values All function values are still jointly Gaussian distributed (e.g., training, test and inducing function values) Approach: Compress real function values into inducing function values Distributed Gaussian Processes Marc 15th September

16 Central Approximation Scheme Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Approximation: Training and test set are conditionally independent given the inducing function values: f K f f u Distributed Gaussian Processes Marc 15th September

17 Central Approximation Scheme Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Training data Test data Approximation: Training and test set are conditionally independent given the inducing function values: f K f f u Then, the effective GP prior is ż «ff» fi qp f, f q pp f f u qpp f f u qpp f u qd f u N 0, K f f Q f fl, 0 Q f K Q f : K fu K 1 f u f u K fu f Nyström approximation Distributed Gaussian Processes Marc 15th September

18 FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Distributed Gaussian Processes Marc 15th September

19 FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Effective GP prior with this approximation «ff» fi qp f, f q N 0, Q f f diagpq f f K f f q Q f fl 0 Q f K Q diagpq K q can be used instead of K FIC Distributed Gaussian Processes Marc 15th September

20 FI(T)C Sparse Approximation Inducing function values f(u1) f(um ) f(x1) f(xn ) f 1 f L Assume that training (and test sets) are fully independent given the inducing variables (Snelson & Ghahramani, 2006) Training data Test data Effective GP prior with this approximation «ff» fi qp f, f q N 0, Q f f diagpq f f K f f q Q f fl 0 Q f K Q diagpq K q can be used instead of K FIC Training: OpNM 2 q, Prediction: OpM 2 q Distributed Gaussian Processes Marc 15th September

21 Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs Distributed Gaussian Processes Marc 15th September

22 Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Distributed Gaussian Processes Marc 15th September

23 Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Distributed Gaussian Processes Marc 15th September

24 Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Intuitively: The marginal likelihood is not only parameterized by the hyper-parameters θ, but also by the inducing inputs u 1:M. Distributed Gaussian Processes Marc 15th September

25 Inducing Inputs FI(T)C sparse approximation exploits inducing function values f pu i q, where u i are the corresponding inputs These inputs are unknown a priori Find optimal ones Find them by maximizing the FI(T)C marginal likelihood with respect to the inducing inputs (and the standard hyper-parameters): u 1:M P arg max q FITC py X, u 1:M, θq u 1:M Intuitively: The marginal likelihood is not only parameterized by the hyper-parameters θ, but also by the inducing inputs u 1:M. End up with a high-dimensional non-convex optimization problem with MD additional parameters Distributed Gaussian Processes Marc 15th September

26 FITC Example Pink: Original data Red crosses: Initialization of inducing inputs Blue crosses: Location of inducing inputs after optimization Figure from Ed Snelson Efficient compression of the original data set Distributed Gaussian Processes Marc 15th September

27 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Distributed Gaussian Processes Marc 15th September

28 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Distributed Gaussian Processes Marc 15th September

29 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Distributed Gaussian Processes Marc 15th September

30 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Distributed Gaussian Processes Marc 15th September

31 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď Often: M P Op10 2 q in the case of inducing variables Distributed Gaussian Processes Marc 15th September

32 Summary Sparse Gaussian Processes Sparse approximations typically approximate a GP with N data points by a model with M! N data points Selection of these M data points can be tricky and may involve non-trivial computations (e.g., optimizing inducing inputs) Simple (random) subset selection is fast and generally robust (Chalupka et al., 2013) Computational complexity: OpM 3 q or OpNM 2 q for training Practical limit M ď Often: M P Op10 2 q in the case of inducing variables If we set M N{100, i.e., each inducing function value summarizes 100 real function values, our practical limit is N P Op10 6 q Distributed Gaussian Processes Marc 15th September

33 Table of Contents Gaussian Processes Sparse Gaussian Processes Distributed Gaussian Processes Distributed Gaussian Processes Marc 15th September

34 Distributed Gaussian Processes Joint work with Jun Wei Ng Distributed Gaussian Processes Marc 15th September

35 An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = O(MP 3 ) Distributed Gaussian Processes Marc 15th September

36 An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = O(MP 3 ) Randomly split the full data set into M chunks Distributed Gaussian Processes Marc 15th September

37 An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = O(MP 3 ) Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Distributed Gaussian Processes Marc 15th September

38 An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = O(MP 3 ) Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Distributed Gaussian Processes Marc 15th September

39 An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = O(MP 3 ) Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Block-diagonal approximation of kernel matrix K Distributed Gaussian Processes Marc 15th September

40 An Orthogonal Approximation: Distributed GPs Data set Standard GP Kernel matrix 1 O(N 3 ) GP GP GP GP Distributed GP 1 = O(MP 3 ) Randomly split the full data set into M chunks Place M independent GP models (experts) on these small chunks Independent computations can be distributed Block-diagonal approximation of kernel matrix K Combine independent computations to an overall result Distributed Gaussian Processes Marc 15th September

41 Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed Gaussian Processes Marc 15th September

42 Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed optimization and training straightforward Distributed Gaussian Processes Marc 15th September

43 Training the Distributed GP Split data set of size N into M chunks of size P Independence of experts Factorization of marginal likelihood: log ppy X, θq «ÿ M log p k 1 kpy pkq X pkq, θq Distributed optimization and training straightforward Computational complexity: OpMP 3 q [instead of OpN 3 q] But distributed over many machines Memory footprint: OpMP 2 ` NDq [instead of OpN 2 ` NDq] Distributed Gaussian Processes Marc 15th September

44 Empirical Training Time Computation time of NLML and its gradients (s) Computation time (DGP) 10 2 Computation time (Full GP) Computation time (FITC) Number of GP experts (DGP) Training data set size Number of GP experts (DGP) NLML is proportional to training time Distributed Gaussian Processes Marc 15th September

45 Empirical Training Time Computation time of NLML and its gradients (s) Computation time (DGP) 10 2 Computation time (Full GP) Computation time (FITC) Number of GP experts (DGP) Training data set size Number of GP experts (DGP) NLML is proportional to training time Full GP (16K training points) «sparse GP (50K training points) «distributed GP (16M training points) Push practical limit by order(s) of magnitude Distributed Gaussian Processes Marc 15th September

46 Practical Training Times Training* with N 10 6, D 1 on a laptop: «10 30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours Distributed Gaussian Processes Marc 15th September

47 Practical Training Times Training* with N 10 6, D 1 on a laptop: «10 30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** Distributed Gaussian Processes Marc 15th September

48 Practical Training Times Training* with N 10 6, D 1 on a laptop: «10 30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** **: Convergence often after line searches*** Distributed Gaussian Processes Marc 15th September

49 Practical Training Times Training* with N 10 6, D 1 on a laptop: «10 30 min Training* with N 5 ˆ 10 6, D 8 on a workstation: «4 hours *: Maximize the marginal likelihood, stop when converged** **: Convergence often after line searches*** ***: Line search «2 3 evaluations of marginal likelihood and its gradient (usually OpN 3 q) Distributed Gaussian Processes Marc 15th September

50 Predictions with the Distributed GP µ, σ Prediction of each GP expert is Gaussian N `µ i, σ 2 i How to combine them to an overall prediction N `µ, σ 2? Distributed Gaussian Processes Marc 15th September

51 Predictions with the Distributed GP µ, σ Prediction of each GP expert is Gaussian N `µ i, σi 2 How to combine them to an overall prediction N `µ, σ 2? Product-of-GP-experts PoE (product of experts; Ng & Deisenroth, 2014) gpoe (generalized product of experts; Cao & Fleet, 2014) BCM (Bayesian Committee Machine; Tresp, 2000) rbcm (robust BCM; Deisenroth & Ng, 2015) Distributed Gaussian Processes Marc 15th September

52 Objectives µ, σ µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Figure: Two computational graphs Scale to large data sets Distributed Gaussian Processes Marc 15th September

53 Objectives µ, σ µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Distributed Gaussian Processes Marc 15th September

54 Objectives µ, σ µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Predictions independent of computational graph Runs on heterogeneous computing infrastructures (laptop, cluster,...) Distributed Gaussian Processes Marc 15th September

55 Objectives µ, σ µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Figure: Two computational graphs Scale to large data sets Good approximation of full GP ( ground truth ) Predictions independent of computational graph Runs on heterogeneous computing infrastructures (laptop, cluster,...) Reasonable predictive variances Distributed Gaussian Processes Marc 15th September

56 Running Example Investigate various product-of-experts models Same training procedure, but different mechanisms for predictions Distributed Gaussian Processes Marc 15th September

57 Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq9 p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Distributed Gaussian Processes Marc 15th September

58 Product of GP Experts Prediction model (independent predictors): Mź pp f x, Dq9 p k p f x, D pkq q, k 1 p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision (inverse variance) and mean: pσ poe q 2 ÿ σ 2 px q k µ poe k pσ poe q ÿ 2 σ 2 px qµ k k px q k Distributed Gaussian Processes Marc 15th September

59 Computational Graph Prediction: Mź pp f x, Dq9 p k p f x, D pkq q k 1 Distributed Gaussian Processes Marc 15th September

60 Computational Graph PoE PoE GP Experts Prediction: Mź pp f x, Dq9 p k p f x, D pkq q Multiplication is associative: a b c d pa bq pc dq Mź Lź źl k ÿ p k p f D pkq q p ki p f D pkiq q, L k M k 1 k 1 k 1i 1 k Independent of computational graph Distributed Gaussian Processes Marc 15th September

61 Product of GP Experts Unreasonable variances for M ą 1: Distributed Gaussian Processes Marc 15th September

62 Product of GP Experts Unreasonable variances for M ą 1: pσ poe q 2 ÿ σ 2 px q k The more experts the more certain the prediction, even if every expert itself is very uncertain Cannot fall back to the prior Distributed Gaussian Processes Marc 15th September k

63 Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Distributed Gaussian Processes Marc 15th September

64 Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Prediction model (independent predictors): pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Distributed Gaussian Processes Marc 15th September

65 Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Prediction model (independent predictors): pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision and mean: pσ gpoe q 2 ÿ βkσ 2 px q k µ gpoe pσ gpoe k q ÿ 2 βkσ 2 px q µ k k px q k Distributed Gaussian Processes Marc 15th September

66 Generalized Product of GP Experts Weight the responsiblity of each expert in PoE with β k Prediction model (independent predictors): pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q p k p f x, D pkq q N ` f µ k px q, σ 2 k px q Predictive precision and mean: pσ gpoe q 2 ÿ βkσ 2 px q k µ gpoe pσ gpoe k q ÿ 2 βkσ 2 px q µ k k px q With ř k β k 1, the model can fall back to the prior Log-opinion pool model (Heskes, 1998) Distributed Gaussian Processes Marc 15th September k

67 Computational Graph PoE PoE gpoe GP Experts Prediction: pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q Lź źl k k 1 i 1 p β k i k i p f D pk iq q, ÿ β ki 1 k,i Distributed Gaussian Processes Marc 15th September

68 Computational Graph PoE PoE gpoe GP Experts Prediction: pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q Lź źl k k 1 i 1 p β k i k i p f D pk iq q, ÿ β ki 1 k,i Independent of computational graph if ř k,i β k i 1 Distributed Gaussian Processes Marc 15th September

69 Computational Graph PoE PoE gpoe GP Experts Prediction: pp f x, Dq9 Mź k 1 p β k k p f x, D pkq q Lź źl k k 1 i 1 p β k i k i p f D pk iq q, ÿ β ki 1 k,i Independent of computational graph if ř k,i β k i 1 A priori setting of β ki required β ki 1{M a priori ( ) Distributed Gaussian Processes Marc 15th September

70 Generalized Product of GP Experts Same mean as PoE Model no longer overconfident and falls back to prior Very conservative variances Distributed Gaussian Processes Marc 15th September

71 Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Distributed Gaussian Processes Marc 15th September

72 Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Prediction model (D pjq K D pkq f ): ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q Distributed Gaussian Processes Marc 15th September

73 Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Prediction model (D pjq K D pkq f ): Predictive precision and mean: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q pσ bcm q 2 ÿ M µ bcm σ 2 k 1 k pσ bcm q ÿ 2 M px q pm 1qσ 2 σ 2 k 1 k px qµ k px q Distributed Gaussian Processes Marc 15th September

74 Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Prediction model (D pjq K D pkq f ): Predictive precision and mean: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q pσ bcm q 2 ÿ M µ bcm σ 2 k 1 k pσ bcm q ÿ 2 M px q pm 1qσ 2 σ 2 k 1 k px qµ k px q Product of GP experts, divided by M 1 times the prior Distributed Gaussian Processes Marc 15th September

75 Bayesian Committee Machine Apply Bayes theorem when combining predictions (and not only for computing predictions) Prediction model (D pjq K D pkq f ): Predictive precision and mean: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q pσ bcm q 2 ÿ M µ bcm σ 2 k 1 k pσ bcm q ÿ 2 M px q pm 1qσ 2 σ 2 k 1 k px qµ k px q Product of GP experts, divided by M 1 times the prior Guaranteed to fall back to the prior outside data regime Distributed Gaussian Processes Marc 15th September

76 Computational Graph Prediction: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q Distributed Gaussian Processes Marc 15th September

77 Computational Graph Prediction: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q ś M k 1 p kp f D pkq q p M 1 p f q ś L ś Lk k 1 i 1 p k i p f D pkiq q p M 1 p f q Distributed Gaussian Processes Marc 15th September

78 Computational Graph Prior PoE PoE GP Experts Prediction: ś M k 1 pp f x, Dq9 p kp f x, D pkq q p M 1 p f q ś M k 1 p kp f D pkq q p M 1 p f q ś L ś Lk k 1 i 1 p k i p f D pkiq q p M 1 p f q Independent of computational graph Distributed Gaussian Processes Marc 15th September

79 Bayesian Committee Machine Variance estimates are about right When leaving the data regime, the BCM can produce junk Robustify Distributed Gaussian Processes Marc 15th September

80 Robust Bayesian Committee Machine Merge gpoe (weighting of experts) with the BCM (Bayes theorem when combining predictions) Distributed Gaussian Processes Marc 15th September

81 Robust Bayesian Committee Machine Merge gpoe (weighting of experts) with the BCM (Bayes theorem when combining predictions) Prediction model (conditional independence D pjq K D pkq f ): pp f x, Dq9 ś M k 1 p β k k p f x, D pkq q p ř k βk 1 p f q Distributed Gaussian Processes Marc 15th September

82 Robust Bayesian Committee Machine Merge gpoe (weighting of experts) with the BCM (Bayes theorem when combining predictions) Prediction model (conditional independence D pjq K D pkq f ): pp f x, Dq9 Predictive precision and mean: pσ rbcm q 2 ÿ M µ rbcm pσ rbcm βkσ 2 k 1 k ś M k 1 p β k k p f x, D pkq q p ř k βk 1 p f q px q `p1 řm q ÿ 2 βkσ 2 px q µ k k px q k k 1 βkqσ 2, Distributed Gaussian Processes Marc 15th September

83 Computational Graph Prediction: pp f x, Dq9 ś M k 1 p β k k p f x, D pkq q p ř k βk 1 p f q ś L ś Lk k 1 i 1 p β k i p ř k i k i β ki 1 p f q p f D pk iq q Distributed Gaussian Processes Marc 15th September

84 Computational Graph Prior PoE PoE gpoe GP Experts Prediction: pp f x, Dq9 ś M k 1 p β k k p f x, D pkq q p ř k βk 1 p f q ś L ś Lk k 1 i 1 p β k i p ř k i k i β ki 1 p f q p f D pk iq q Independent of computational graph, even with arbitrary β ki Distributed Gaussian Processes Marc 15th September

85 Robust Bayesian Committee Machine Does not break down in case of weak experts Robustified Robust version of BCM Reasonable predictions Independent of computational graph (for all choices of β k ) Distributed Gaussian Processes Marc 15th September

86 Empirical Approximation Error RMSE #Points/Expert rbcm BCM gpoe PoE SOD GP Gradient time in sec Simulated robot arm data (10K training, 10K test) Hyper-parameters of ground-truth full GP RMSE as a function of the training time Sparse GP (SOR) performs worse than any distributed GP rbcm performs best with weak GP experts Distributed Gaussian Processes Marc 15th September

87 Empirical Approximation Error (2) NLPD #Points/Expert rbcm BCM gpoe PoE SOD GP Gradient time in sec NLPD as a function of the training time Mean and variance BCM and PoE are not robust for weak experts gpoe suffers from too conservative variances rbcm consistently outperforms other methods Distributed Gaussian Processes Marc 15th September

88 Large Data Sets: Airline Data (700K) RMSE Airline Delay, 700K DGP, random data assignment 34 Sparse GP rbcm BCM gpoe PoE Dist VGP SVI GP rbcm performs best (r)bcm and (g)poe with 4096 GP experts Gradient time: 13 seconds (12 cores) Inducing inputs: Dist-VGP (Gal et al., 2014), SVI-GP (Hensman et al., 2013) (g)poe and BCM perform not worse than sparse GPs Distributed Gaussian Processes Marc 15th September

89 Large Data Sets: Airline Data (700K) RMSE Airline Delay, 700K DGP, random data assignment 34 DGP, KD- tree data assignment Sparse GP rbcm rbcm KD BCM BCM KD gpoe gpoe KD PoE PoE KDDist VGP SVI GP rbcm performs best (r)bcm and (g)poe with 4096 GP experts Gradient time: 13 seconds (12 cores) Inducing inputs: Dist-VGP (Gal et al., 2014), SVI-GP (Hensman et al., 2013) (g)poe and BCM perform not worse than sparse GPs KD-tree data assignment clearly helps (r)bcm Distributed Gaussian Processes Marc 15th September

90 Summary: Distributed Gaussian Processes µ, σ µ 1, σ 1 µ 2, σ 2 µ 11, σ 11 µ 12, σ 12 µ 13, σ 13 µ 21, σ 21 µ 22, σ 22 µ 23, σ 23 Scale Gaussian processes to large data (beyond 10 6 ) Model conceptually straightforward and easy to train Key: Distributed computation Currently tested with N P Op10 7 q Scales to arbitrarily large data sets (with enough computing power) m.deisenroth@imperial.ac.uk Distributed Gaussian Processes Marc 15th September

91 References [1] Y. Cao and D. J. Fleet. Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process Predictions. October [2] K. Chalupka, C. K. I. Williams, and I. Murray. A Framework for Evaluating Approximate Methods for Gaussian Process Regression. Journal of Machine Learning Research, 14: , February [3] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on Machine Learning, [4] M. P. Deisenroth, C. E. Rasmussen, and D. Fox. Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning. In Proceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June [5] Y. Gal, M. van der Wilk, and C. E. Rasmussen. Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models. In Advances in Neural Information Processing Systems [6] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. In A. Nicholson and P. Smyth, editors, Proceedings of the Conference on Uncertainty in Artificial Intelligence. AUAI Press, [7] T. Heskes. Selecting Weighting Factors in Logarithmic Opinion Pools. In Advances in Neural Information Processing Systems, pages Morgan Kaufman, [8] N. Lawrence. Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models. Journal of Machine Learning Research, 6: , November [9] J. Ng and M. P. Deisenroth. Hierarchical Mixture-of-Experts Model for Large-Scale Gaussian Process Regression. December [10] J. Quiñonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression. Journal of Machine Learning Research, 6(2): , [11] E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages The MIT Press, Cambridge, MA, USA, [12] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics, [13] V. Tresp. A Bayesian Committee Machine. Neural Computation, 12(11): , Distributed Gaussian Processes Marc 15th September

Gaussian Processes. Recommended reading: Rasmussen/Williams: Chapters 1, 2, 4, 5 Deisenroth & Ng (2015)[3] Data Analysis and Probabilistic Inference

Gaussian Processes. Recommended reading: Rasmussen/Williams: Chapters 1, 2, 4, 5 Deisenroth & Ng (2015)[3] Data Analysis and Probabilistic Inference Data Analysis and Probabilistic Inference Gaussian Processes Recommended reading: Rasmussen/Williams: Chapters 1, 2, 4, 5 Deisenroth & Ng (2015)[3] Marc Deisenroth Department of Computing Imperial College

More information

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

PILCO: A Model-Based and Data-Efficient Approach to Policy Search PILCO: A Model-Based and Data-Efficient Approach to Policy Search (M.P. Deisenroth and C.E. Rasmussen) CSC2541 November 4, 2016 PILCO Graphical Model PILCO Probabilistic Inference for Learning COntrol

More information

Deep Gaussian Processes

Deep Gaussian Processes Deep Gaussian Processes Neil D. Lawrence 30th April 2015 KTH Royal Institute of Technology Outline Introduction Deep Gaussian Process Models Variational Approximation Samples and Results Outline Introduction

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Tree-structured Gaussian Process Approximations

Tree-structured Gaussian Process Approximations Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27 Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary

More information

Gaussian Processes for Big Data. James Hensman

Gaussian Processes for Big Data. James Hensman Gaussian Processes for Big Data James Hensman Overview Motivation Sparse Gaussian Processes Stochastic Variational Inference Examples Overview Motivation Sparse Gaussian Processes Stochastic Variational

More information

Deep learning with differential Gaussian process flows

Deep learning with differential Gaussian process flows Deep learning with differential Gaussian process flows Pashupati Hegde Markus Heinonen Harri Lähdesmäki Samuel Kaski Helsinki Institute for Information Technology HIIT Department of Computer Science, Aalto

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Multiple-step Time Series Forecasting with Sparse Gaussian Processes Multiple-step Time Series Forecasting with Sparse Gaussian Processes Perry Groot ab Peter Lucas a Paul van den Bosch b a Radboud University, Model-Based Systems Development, Heyendaalseweg 135, 6525 AJ

More information

Deep Gaussian Processes

Deep Gaussian Processes Deep Gaussian Processes Neil D. Lawrence 8th April 2015 Mascot Num 2015 Outline Introduction Deep Gaussian Process Models Variational Methods Composition of GPs Results Outline Introduction Deep Gaussian

More information

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017 Motivation DGPs promise much, but are difficult to train Doubly Stochastic

More information

Non-Factorised Variational Inference in Dynamical Systems

Non-Factorised Variational Inference in Dynamical Systems st Symposium on Advances in Approximate Bayesian Inference, 08 6 Non-Factorised Variational Inference in Dynamical Systems Alessandro D. Ialongo University of Cambridge and Max Planck Institute for Intelligent

More information

Lecture 11: Probability Distributions and Parameter Estimation

Lecture 11: Probability Distributions and Parameter Estimation Intelligent Data Analysis and Probabilistic Inference Lecture 11: Probability Distributions and Parameter Estimation Recommended reading: Bishop: Chapters 1.2, 2.1 2.3.4, Appendix B Duncan Gillies and

More information

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES

SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES SINGLE-TASK AND MULTITASK SPARSE GAUSSIAN PROCESSES JIANG ZHU, SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, P. R. China E-MAIL:

More information

Variational Dependent Multi-output Gaussian Process Dynamical Systems

Variational Dependent Multi-output Gaussian Process Dynamical Systems Variational Dependent Multi-output Gaussian Process Dynamical Systems Jing Zhao and Shiliang Sun Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai

More information

Generalized Robust Bayesian Committee Machine for Large-scale Gaussian Process Regression

Generalized Robust Bayesian Committee Machine for Large-scale Gaussian Process Regression Generalized Robust Bayesian Committee Machine for Large-scale Gaussian Process Regression Haitao Liu 1 Jianfei Cai 2 Yi Wang 3 Yew-Soon Ong 2 4 Abstract In order to scale standard Gaussian process (GP)

More information

State-Space Inference and Learning with Gaussian Processes

State-Space Inference and Learning with Gaussian Processes Ryan Turner 1 Marc Peter Deisenroth 1, Carl Edward Rasmussen 1,3 1 Department of Engineering, University of Cambridge, Trumpington Street, Cambridge CB 1PZ, UK Department of Computer Science & Engineering,

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Analytic Long-Term Forecasting with Periodic Gaussian Processes

Analytic Long-Term Forecasting with Periodic Gaussian Processes Nooshin Haji Ghassemi School of Computing Blekinge Institute of Technology Sweden Marc Peter Deisenroth Department of Computing Imperial College London United Kingdom Department of Computer Science TU

More information

Local and global sparse Gaussian process approximations

Local and global sparse Gaussian process approximations Local and global sparse Gaussian process approximations Edward Snelson Gatsby Computational euroscience Unit University College London, UK snelson@gatsby.ucl.ac.uk Zoubin Ghahramani Department of Engineering

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray School of Informatics, University of Edinburgh The problem Learn scalar function of vector values f(x).5.5 f(x) y i.5.2.4.6.8 x f 5 5.5 x x 2.5 We have (possibly

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Probabilistic & Bayesian deep learning. Andreas Damianou

Probabilistic & Bayesian deep learning. Andreas Damianou Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019 In this talk Not in this talk: CRFs, Boltzmann machines,... In this

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning

More information

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression Michalis K. Titsias Department of Informatics Athens University of Economics and Business mtitsias@aueb.gr Miguel Lázaro-Gredilla

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Intelligent Systems I

Intelligent Systems I Intelligent Systems I 00 INTRODUCTION Stefan Harmeling & Philipp Hennig 24. October 2013 Max Planck Institute for Intelligent Systems Dptmt. of Empirical Inference Which Card? Opening Experiment Which

More information

Distributed Robust Gaussian Process Regression

Distributed Robust Gaussian Process Regression Noname manuscript No. will be inserted by the editor Distributed Robust Gaussian Process Regression Sebastian air Ulf Brefeld Received: 17 Sep 2016 / Revised: 21 Apr 2017 / Accepted: 29 ay 2017 Abstract

More information

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model (& discussion on the GPLVM tech. report by Prof. N. Lawrence, 06) Andreas Damianou Department of Neuro- and Computer Science,

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

The Automatic Statistician

The Automatic Statistician The Automatic Statistician Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ James Lloyd, David Duvenaud (Cambridge) and

More information

Non Linear Latent Variable Models

Non Linear Latent Variable Models Non Linear Latent Variable Models Neil Lawrence GPRS 14th February 2014 Outline Nonlinear Latent Variable Models Extensions Outline Nonlinear Latent Variable Models Extensions Non-Linear Latent Variable

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Gaussian Processes Barnabás Póczos http://www.gaussianprocess.org/ 2 Some of these slides in the intro are taken from D. Lizotte, R. Parr, C. Guesterin

More information

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,

More information

Gaussian Process Vine Copulas for Multivariate Dependence

Gaussian Process Vine Copulas for Multivariate Dependence Gaussian Process Vine Copulas for Multivariate Dependence José Miguel Hernández-Lobato 1,2 joint work with David López-Paz 2,3 and Zoubin Ghahramani 1 1 Department of Engineering, Cambridge University,

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Modelling local and global phenomena with sparse Gaussian processes

Modelling local and global phenomena with sparse Gaussian processes Modelling local and global phenomena with sparse Gaussian processes Jarno Vanhatalo Department of Biomedical Engineering and Computational Science Helsinki University of Technology 02015 TKK, Finland Abstract

More information

Lecture 1c: Gaussian Processes for Regression

Lecture 1c: Gaussian Processes for Regression Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes 1 Objectives to express prior knowledge/beliefs about model outputs using Gaussian process (GP) to sample functions from the probability measure defined by GP to build

More information

Variable sigma Gaussian processes: An expectation propagation perspective

Variable sigma Gaussian processes: An expectation propagation perspective Variable sigma Gaussian processes: An expectation propagation perspective Yuan (Alan) Qi Ahmed H. Abdel-Gawad CS & Statistics Departments, Purdue University ECE Department, Purdue University alanqi@cs.purdue.edu

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

arxiv: v4 [stat.ml] 11 Apr 2016

arxiv: v4 [stat.ml] 11 Apr 2016 Manifold Gaussian Processes for Regression Roberto Calandra, Jan Peters, Carl Edward Rasmussen and Marc Peter Deisenroth Intelligent Autonomous Systems Lab, Technische Universität Darmstadt, Germany Max

More information

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart Machine Learning Bayesian Regression & Classification learning as inference, Bayesian Kernel Ridge regression & Gaussian Processes, Bayesian Kernel Logistic Regression & GP classification, Bayesian Neural

More information

Probabilistic Models for Learning Data Representations. Andreas Damianou

Probabilistic Models for Learning Data Representations. Andreas Damianou Probabilistic Models for Learning Data Representations Andreas Damianou Department of Computer Science, University of Sheffield, UK IBM Research, Nairobi, Kenya, 23/06/2015 Sheffield SITraN Outline Part

More information

Gaussian Process Random Fields

Gaussian Process Random Fields Gaussian Process Random Fields David A. Moore and Stuart J. Russell Computer Science Division University of California, Berkeley Berkeley, CA 94709 {dmoore, russell}@cs.berkeley.edu Abstract Gaussian processes

More information

Latent Force Models. Neil D. Lawrence

Latent Force Models. Neil D. Lawrence Latent Force Models Neil D. Lawrence (work with Magnus Rattray, Mauricio Álvarez, Pei Gao, Antti Honkela, David Luengo, Guido Sanguinetti, Michalis Titsias, Jennifer Withers) University of Sheffield University

More information

Non-Gaussian likelihoods for Gaussian Processes

Non-Gaussian likelihoods for Gaussian Processes Non-Gaussian likelihoods for Gaussian Processes Alan Saul University of Sheffield Outline Motivation Laplace approximation KL method Expectation Propagation Comparing approximations GP regression Model

More information

Reliability Monitoring Using Log Gaussian Process Regression

Reliability Monitoring Using Log Gaussian Process Regression COPYRIGHT 013, M. Modarres Reliability Monitoring Using Log Gaussian Process Regression Martin Wayne Mohammad Modarres PSA 013 Center for Risk and Reliability University of Maryland Department of Mechanical

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Model-Based Reinforcement Learning with Continuous States and Actions

Model-Based Reinforcement Learning with Continuous States and Actions Marc P. Deisenroth, Carl E. Rasmussen, and Jan Peters: Model-Based Reinforcement Learning with Continuous States and Actions in Proceedings of the 16th European Symposium on Artificial Neural Networks

More information

Variational Dependent Multi-output Gaussian Process Dynamical Systems

Variational Dependent Multi-output Gaussian Process Dynamical Systems Journal of Machine Learning Research 17 (016) 1-36 Submitted 10/14; Revised 3/16; Published 8/16 Variational Dependent Multi-output Gaussian Process Dynamical Systems Jing Zhao Shiliang Sun Department

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

More information

Latent Variable Models with Gaussian Processes

Latent Variable Models with Gaussian Processes Latent Variable Models with Gaussian Processes Neil D. Lawrence GP Master Class 6th February 2017 Outline Motivating Example Linear Dimensionality Reduction Non-linear Dimensionality Reduction Outline

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

What, exactly, is a cluster? - Bernhard Schölkopf, personal communication

What, exactly, is a cluster? - Bernhard Schölkopf, personal communication Chapter 1 Warped Mixture Models What, exactly, is a cluster? - Bernhard Schölkopf, personal communication Previous chapters showed how the probabilistic nature of GPs sometimes allows the automatic determination

More information

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression JMLR: Workshop and Conference Proceedings : 9, 08. Symposium on Advances in Approximate Bayesian Inference Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

More information

Optimization of Gaussian Process Hyperparameters using Rprop

Optimization of Gaussian Process Hyperparameters using Rprop Optimization of Gaussian Process Hyperparameters using Rprop Manuel Blum and Martin Riedmiller University of Freiburg - Department of Computer Science Freiburg, Germany Abstract. Gaussian processes are

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Neil D. Lawrence GPSS 10th June 2013 Book Rasmussen and Williams (2006) Outline The Gaussian Density Covariance from Basis Functions Basis Function Representations Constructing

More information

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF) Case Study 4: Collaborative Filtering Review: Probabilistic Matrix Factorization Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 2 th, 214 Emily Fox 214 1 Probabilistic

More information

an author's https://oatao.univ-toulouse.fr/17504 http://dx.doi.org/10.5220/0005700504370445 Chiplunkar, Ankit and Rachelson, Emmanuel and Colombo, Michele and Morlier, Joseph Sparse Physics-based Gaussian

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Joint Emotion Analysis via Multi-task Gaussian Processes

Joint Emotion Analysis via Multi-task Gaussian Processes Joint Emotion Analysis via Multi-task Gaussian Processes Daniel Beck, Trevor Cohn, Lucia Specia October 28, 2014 1 Introduction 2 Multi-task Gaussian Process Regression 3 Experiments and Discussion 4 Conclusions

More information

Parameter Estimation. Industrial AI Lab.

Parameter Estimation. Industrial AI Lab. Parameter Estimation Industrial AI Lab. Generative Model X Y w y = ω T x + ε ε~n(0, σ 2 ) σ 2 2 Maximum Likelihood Estimation (MLE) Estimate parameters θ ω, σ 2 given a generative model Given observed

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Models. Carl Henrik Ek Philip H. S. Torr Neil D. Lawrence. Computer Science Departmental Seminar University of Bristol October 25 th, 2007

Models. Carl Henrik Ek Philip H. S. Torr Neil D. Lawrence. Computer Science Departmental Seminar University of Bristol October 25 th, 2007 Carl Henrik Ek Philip H. S. Torr Neil D. Lawrence Oxford Brookes University University of Manchester Computer Science Departmental Seminar University of Bristol October 25 th, 2007 Source code and slides

More information