Tree-structured Gaussian Process Approximations

Size: px

Start display at page:

Download "Tree-structured Gaussian Process Approximations"

Alexandrina Clarke
5 years ago
Views:

1 Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, / 27

2 Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary 2 / 27

3 GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σn), 2 f GP(0, k θ (, )) 3 / 27

4 GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σ 2 n), f GP(0, k θ (, )) f 1 f 2 f 3 f n f N f 3 / 27

5 GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σn), 2 f GP(0, k θ (, )) Posterior is also a GP, f 1 f 2 f 3 f n f N f m f (x) = K xf (K ff +σ 2 ni) 1 y, k f (x, x ) = k(x, x ) K xf (K ff +σ 2 ni) 1 K fx. Log marginal likelihood for learning L = log N (y; 0, K ff +σ 2 ni). 3 / 27

6 GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σn), 2 f GP(0, k θ (, )) Posterior is also a GP, f 1 f 2 f 3 f n f N f m f (x) = K xf (K ff +σni) 2 1 y, k f (x, x ) = k(x, x ) K xf (K ff +σni) 2 1 K fx. Log marginal likelihood for learning L = log N (y; 0, K ff +σni). 2 Cost: O(N 3 ) 3 / 27

7 Prior work Indirect posterior approximation schemes Introducing pseudo-dataset {xm, u m } M m=1 and removing some dependencies in the prior: FITC, PI(T)C (Snelson and Ghahramani 2006; Snelson and Ghahramani 2007) Approximating the prior using M cosine basis functions: SSGP (Lázaro-Gredilla et al. 2010) 4 / 27

8 Prior work Indirect posterior approximation schemes Introducing pseudo-dataset {xm, u m } M m=1 and removing some dependencies in the prior: FITC, PI(T)C (Snelson and Ghahramani 2006; Snelson and Ghahramani 2007) Approximating the prior using M cosine basis functions: SSGP (Lázaro-Gredilla et al. 2010) Direct posterior approximation schemes Variational free energy approach (Seeger 2003; Titsias 2009), SVI extension to handle big data (Hensman et al. 2013) Expectation propagation (Qi et al. 2010) 4 / 27

9 Prior work Indirect posterior approximation schemes Introducing pseudo-dataset {xm, u m } M m=1 and removing some dependencies in the prior: FITC, PI(T)C (Snelson and Ghahramani 2006; Snelson and Ghahramani 2007) Approximating the prior using M cosine basis functions: SSGP (Lázaro-Gredilla et al. 2010) Direct posterior approximation schemes Variational free energy approach (Seeger 2003; Titsias 2009), SVI extension to handle big data (Hensman et al. 2013) Expectation propagation (Qi et al. 2010) Local approximations (Tresp 2000; Urtasun and Darrell 2008) 4 / 27

10 Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) 5 / 27

11 Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u f 1 f 2 f 3 f n f N f 5 / 27

12 Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) u f 1 f 2 f 3 f n f N f 5 / 27

13 Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) u f 1 f 2 f 3 f n f N f 5 / 27

14 Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 u prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j f 1 f 2 f 3 f n f N f prior: q(u, f) = q(u)q(f u) Calibrate model using KL(p(u, f) q(u, f)) 5 / 27

15 Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 u prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j f 1 f 2 f 3 f n f N f prior: q(u, f) = q(u)q(f u) Calibrate model using KL(p(u, f) q(u, f)) q(u) = p(u) and q(f i u) = p(f i u) 5 / 27

16 Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) f 1 f 2 f 3 f n f N f Calibrate model using KL(p(u, f) q(u, f)) q(u) = p(u) and q(f i u) = p(f i u) New generative model: p(u) = N (u; 0, K uu ), p(y u) = N (y; K fu K 1 uu u, diag(k ff K fu K 1 uu K uf ) + σ 2 ni). 5 / 27

17 Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) f 1 f 2 f 3 f n f N f Calibrate model using KL(p(u, f) q(u, f)) q(u) = p(u) and q(f i u) = p(f i u) New generative model: p(u) = N (u; 0, K uu ), p(y u) = N (y; K fu K 1 uu u, diag(k ff K fu K 1 uu K uf ) + σ 2 ni). Cost: O(NM 2 ) 5 / 27

18 Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) f 1 f 2 f 3 f n f N f Calibrate model using KL(p(u, f) q(u, f)) q(u) = p(u) and q(f i u) = p(f i u) New generative model: p(u) = N (u; 0, K uu ), p(y u) = N (y; K fu K 1 uu u, diag(k ff K fu K 1 uu K uf ) + σ 2 ni). Cost: O(NM 2 ) If assume f Bi f Bj u, i, j, we obtain PI(T)C: p(y u) = N (y; K fu K 1 uu u, blkdiag(k ff K fu K 1 uu K uf ) + σ 2 ni). 5 / 27

19 Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) 6 / 27

20 Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) Introducing variational dist. q(f, u) = p(f u)q(u) gives the ELBO: F(q(u)) = d u dfp(f u)q(u) log p(u)p(y f) q(u) 6 / 27

21 Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) Introducing variational dist. q(f, u) = p(f u)q(u) gives the ELBO: F(q(u)) = d u dfp(f u)q(u) log p(u)p(y f) q(u) 6 / 27

22 Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) Introducing variational dist. q(f, u) = p(f u)q(u) gives the ELBO: F(q(u)) = d u dfp(f u)q(u) log p(u)p(y f) q(u) Optimal distribution q(u) = 1 Z p(u) exp ( dfp(f u) log p(y f) ) and F(q(u)) = log N (y; 0, σni 2 + K fu K 1 uu K uf ) 1 2σn 2 Tr(K ff K fu K 1 uu K uf ). 6 / 27

23 Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) Introducing variational dist. q(f, u) = p(f u)q(u) gives the ELBO: F(q(u)) = d u dfp(f u)q(u) log p(u)p(y f) q(u) Optimal distribution q(u) = 1 Z p(u) exp ( dfp(f u) log p(y f) ) and F(q(u)) = log N (y; 0, σni 2 + K fu K 1 uu K uf ) 1 2σn 2 Tr(K ff K fu K 1 uu K uf ). Cost: O(NM 2 ) 6 / 27

24 Example Exact VFE y x N = 100, M = 10 We only need a small M if the underlying function is simple 7 / 27

25 Example 2 y Exact VFE x N = 100, M = 10 M needs to be large if the underlying function is complicated 8 / 27

26 Limitations of global approximations 3 L 2 1 y 0-1 l x 9 / 27

27 Limitations of global approximations 3 L 2 1 y 0-1 l x Approximately need M D d=1 L d/l d, where L d and l d are the data range and lengthscale in dimension d, 9 / 27

28 Limitations of global approximations 3 L 2 1 y 0-1 l x Approximately need M D d=1 L d/l d, where L d and l d are the data range and lengthscale in dimension d,i.e. large M when Datasets span a large input space: time-series or spatial datasets Underlying functions have short lengthscales, i.e. lots of wiggles 9 / 27

29 Limitations of global approximations 3 L 2 1 y 0-1 l x Approximately need M D d=1 L d/l d, where L d and l d are the data range and lengthscale in dimension d,i.e. large M when Datasets span a large input space: time-series or spatial datasets Underlying functions have short lengthscales, i.e. lots of wiggles O(NM 2 ) is still expensive! Local approximations may give better time/accuracy trade-off. 9 / 27

30 Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM 10 / 27

31 Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) 10 / 27

32 Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) Predict using the posterior of only one partition closest to the test point 1 : p(f y Bi ) = d f i p(f f i )p(f i y Bi ) 1 See Tresp 2000 for a way to combine predictors 10 / 27

33 Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) Predict using the posterior of only one partition closest to the test point 1 : p(f y Bi ) = d f i p(f f i )p(f i y Bi ) Partitions have shared or separate hyperparameters 1 See Tresp 2000 for a way to combine predictors 10 / 27

34 Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) Predict using the posterior of only one partition closest to the test point 1 : p(f y Bi ) = d f i p(f f i )p(f i y Bi ) Partitions have shared or separate hyperparameters Cost: O(ND 2 ), D: average size of partitions 1 See Tresp 2000 for a way to combine predictors 10 / 27

35 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary 11 / 27

36 Tree-structured approximation (TSGP) TSGP is in the same family as FITC and PITC, i.e. indirect approximation via prior modification, but having additional structures: Local inducing variables for each partition, spare connection between inducing blocks. u u f 1 f 2 f 3 f n f N f (a) Full GP f 1 f 2 f 3 f n f N f (b) FITC u u B1 u B2 u B3 u Bk u BK f B1 f B2 f B3 f Bk f f BK (c) PIC f B1 f B2 f B3 f Bk f f BK (d) Tree (chain) 12 / 27

37 Prior modification Generative model: K q(u) = q(u Bk u Bl ), k=1 K q(f u) = q(f Bk u Bk ), k=1 N p(y f) = p(y n; f n, σn). 2 n=1 u Bl A k, Q k u Bk f Bk C k, R k 13 / 27

38 Prior modification Generative model: q(u) = q(f u) = p(y f) = K q(u Bk u Bl ), k=1 K q(f Bk u Bk ), k=1 N p(y n; f n, σn). 2 n=1 u Bl A k, Q k u Bk f Bk Model calibration by minimising a forward KL divergence, KL(p(f, u) q(f Bk u Bk )q(u Bk u Bl )) k which gives, q(u Bk u Bl ) = p(u Bk u Bl ) = N (u Bk ; A k u Bl, Q k ), q(f Ck u Bk ) = p(f Ck u Bk ) = N (f Ck ; C k u Bk, R k ). C k, R k A k = K uk u l K 1 u l u l, Q k = K uk u k K uk u l K 1 u l u l K ul u k, C k = K f k u k K 1 u k u k, R k = K f k f k K f k u k K 1 u k u k K uk f k. 13 / 27

39 Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data 14 / 27

40 Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data Joint posterior: p(u y) ( exp 1 ) 2 u i J i u i + u i h i exp (u i J ij u j ) i V i,j E where J i = Q 1 i + C i (R i + σ 2 I i ) 1 C i + j nei(i) A j Q 1 j A j, h i = C i R 1 i y i, and J ij = Q 1 i A i. 14 / 27

41 Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data Joint posterior: p(u y) ( exp 1 ) 2 u i J i u i + u i h i exp (u i J ij u j ) i V i,j E where J i = Q 1 i + C i (R i + σ 2 I i ) 1 C i + j nei(i) A j Q 1 j A j, h i = C i R 1 i y i, and J ij = Q 1 i A i. Use Gaussian belief propagation algorithm to find the marginal distribution p(u Bi y) 14 / 27

42 Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data Joint posterior: p(u y) ( exp 1 ) 2 u i J i u i + u i h i exp (u i J ij u j ) i V i,j E where J i = Q 1 i + C i (R i + σ 2 I i ) 1 C i + j nei(i) A j Q 1 j A j, h i = C i R 1 i y i, and J ij = Q 1 i A i. Use Gaussian belief propagation algorithm to find the marginal distribution p(u Bi y) Prediction at test points: p(f y) = d u Bi p(f u Bi )p(u Bi y) 14 / 27

43 Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data Joint posterior: p(u y) ( exp 1 ) 2 u i J i u i + u i h i exp (u i J ij u j ) i V i,j E where J i = Q 1 i + C i (R i + σ 2 I i ) 1 C i + j nei(i) A j Q 1 j A j, h i = C i R 1 i y i, and J ij = Q 1 i A i. Use Gaussian belief propagation algorithm to find the marginal distribution p(u Bi y) Prediction at test points: p(f y) = d u Bi p(f u Bi )p(u Bi y) Cost: O(ND 2 ), D: average number of observations per block 14 / 27

44 Hyperparameter learning Log marginal likelihood and its derivatives can be computed using the same message passing algorithm: p(y 1:K θ) = K p(y k y 1:k 1, θ) k=1 K d dθ log p(y θ) = [ d dθ log p(u k u l ) p(uk,u l y) + d ] dθ log p(y k u k) p(uk y). k=1 15 / 27

45 Hyperparameter learning Log marginal likelihood and its derivatives can be computed using the same message passing algorithm: p(y 1:K θ) = K p(y k y 1:k 1, θ) k=1 K d dθ log p(y θ) = [ d dθ log p(u k u l ) p(uk,u l y) + d ] dθ log p(y k u k) p(uk y). k=1 Tree construction: starting with k-means clustering to find observations blocks using Kruskal s algorithm to greedily select a tree choosing a large random subset of observations in each block to be pseudo outputs, no optimisation needed 15 / 27

46 Comparison KL minimisation Method KL minimisation Result FITC KL(p(f, u) q(u) n q(fn u)) q(u) = p(u), q(fn u) = p(fn u) PIC KL(p(f, u) q(u) k q(f C k u)) q(u) = p(u), q(f C k u) = p(f C k u) PP KL( 1 Z p(u)p(f u)q(y u) p(f, u y)) q(y u) = N (y; K fuk 1 uu u, σ 2 ni) VFE KL(p(f u)q(u) p(f, u y)) q(u) p(u) exp( log(p(y f)) p(f u) ) EP KL(q(f; u)p(y n f n)/q n(f; u) q(f; u)) q(f; u) p(f) m p(um fm) Tree KL(p(f, u) k q(f B k ub k ) q(u Bk u par(bk ))) q(f Bk u Bk ) = p(f Bk u Bk ) q(u Bk u pa(bk )) = p(u Bk u par(bk )) 16 / 27

47 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary 17 / 27

48 Audio data Task: filling missing data Data: Subband of a speech signal: N = 50000, SE kernel: k θ (t, t ) = σ 2 exp( 1 (t t ) 2 ) 2l 2 Filtered speech signal: N = 50000, spectral mixture kernel: k θ (t, t ) = 2 k=1 σ2 k cos(ω k(t t )) exp( 1 (t t ) 2 ) 2lk 2 Evaluation: Prediction error vs. Training/Test time 18 / 27

49 Audio data Task: filling missing data Data: Subband of a speech signal: N = 50000, SE kernel: k θ (t, t ) = σ 2 exp( 1 (t t ) 2 ) 2l 2 Filtered speech signal: N = 50000, spectral mixture kernel: k θ (t, t ) = 2 k=1 σ2 k cos(ω k(t t )) exp( 1 (t t ) 2 ) 2lk 2 Evaluation: Prediction error vs. Training/Test time 2 3 y t 0 y t 0 y t True Chain Local Time/ms y t Time/ms Left: Subband data, Right: Filtered signal 18 / 27

50 Audio subband data SMSE ,50 5, , , ,125 2,50 5,125 5,100 5,50 2,20 5,20 5, SMSE Chain Local FITC VFE SSGP 128 5,50 5, , ,125 2,50 5,125 5,100 5,50 2,20 5,20 5, , ,10 2, Training time/s Test time/s 19 / 27

51 Audio filter data 1 2,50 2,40 5,125 5,20 2,10 2,8 1 2,50 5,25 2,40 5,20 2,10 2,8 5, SMSE ,25 5,125 5,20 2,8 512 SMSE ,25 2,8 5,125 5, ,40 2,50 2,20 2,10 Chain Local FITC 2,40 2,50 2,20 2, Training time/s Test time/s 20 / 27

52 Terrain data Task: filling missing data Data: Altitude of a 20kmx30km region, 80 missing blocks of 1kmx1km or 200k/40k training/test points, 2D SE kernel Evaluation: Prediction error vs. Training/Test time 21 / 27

53 Terrain data Task: filling missing data Data: Altitude of a 20kmx30km region, 80 missing blocks of 1kmx1km or 200k/40k training/test points, 2D SE kernel Evaluation: Prediction error vs. Training/Test time (a) (b) (c) graph complete data tree inference error local inference error FITC inference error 21 / 27

54 Terrain data SMSE ,400 3,300 3,300 3,400 5,300 7,400 10, ,300 33, ,400 SMSE ,400 3,300 3,300 3, ,400 5, ,300 15, ,400 20,400 7, VFE FITC SSGP Tree Local Training time/s 10,300 13,400 20,400 33, ,300 13,400 20,400 33, Test time/ms 22 / 27

55 Summary Tree-structured Gaussian process approximation pseudo-dataset has tree/chain structure model was calibrated using a KL divergence inference and learning via Gaussian belief propagatation better time/accuracy trade-off compared to FITC, VFE possible extensions: online learning, loopy BP 23 / 27

56 References Hensman, James, Nicolo Fusi, and Neil D Lawrence (2013). Gaussian processes for big data. In: arxiv preprint arxiv: Lázaro-Gredilla, Miguel et al. (2010). Sparse spectrum Gaussian process regression. In: The Journal of Machine Learning Research 11, pp Qi, Yuan (Alan), Ahmed H. Abdel-Gawad, and Thomas P. Minka (2010). Sparse-posterior Gaussian Processes for general likelihoods. In: UAI. Ed. by Peter Grünwald and Peter Spirtes. AUAI Press, pp Seeger, Matthias (2003). Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations. PhD thesis. School of Informatics, College of Science and Engineering, University of Edinburgh. Snelson, Edward and Zoubin Ghahramani (2006). Sparse Gaussian Processes using Pseudo-inputs. In: Advances in Neural Information Processing Systems. MIT press, pp (2007). Local and global sparse Gaussian process approximations. In: International Conference on Artificial Intelligence and Statistics, pp Titsias, Michalis K. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In: International Conference on Artificial Intelligence and Statistics, pp Tresp, Volker (2000). A Bayesian committee machine. In: Neural Computation 12.11, pp Urtasun, Raquel and Trevor Darrell (2008). Sparse probabilistic regression for activity-independent human pose inference. In: Computer Vision and Pattern Recognition, CVPR IEEE Conference on. IEEE, pp / 27

57 Thanks!

58 Bayesian Committee Machine (BCM) (Tresp 2000) BCM combines predictions from M estimators, each uses a subset of training points Consider M partitions of the training set, by Bayes rule p(f y B1:m ) = p(f y B1:m 1, y Bm ) p(f )p(y Bm f )p(y B1:m 1 y Bm, f ) p(f )p(y Bm f )p(y B1:m 1 f ) p(f y Bm )p(f y B1:m 1 ) p(f ) (1) Apply 1 recursively to obtain p(f y B1:M ) M i=1 p(f y Bi ) p(f ) M 1 26 / 27

59 BCM for GP regression (Tresp 2000) Let p(f ) = N (0, K ) and p(f y Bi ) = N (ˆµ i, ˆΣ i ), then p(f y B1:M ) = N (ˆµ, ˆΣ) where, ˆΣ 1 = (M 1) K + ˆΣ 1 ˆµ = M i=1 Cost: O(ND 2 ), D: partition size ˆΣ 1 i ˆµ i More test points give better approximation! M i=1 ˆΣ 1 i 27 / 27

Variational Model Selection for Sparse Gaussian Process Regression

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of Computer Science University of Manchester 7 September 2008 Outline Gaussian process regression and sparse