Tree-structured Gaussian Process Approximations

Tree-structured Gaussian Process Approximations Thang Bui joint work with Richard Turner MLG, Cambridge July 1st, 2014 1 / 27

Outline 1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary 2 / 27

GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σn), 2 f GP(0, k θ (, )) 3 / 27

GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σ 2 n), f GP(0, k θ (, )) f 1 f 2 f 3 f n f N f 3 / 27

GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σn), 2 f GP(0, k θ (, )) Posterior is also a GP, f 1 f 2 f 3 f n f N f m f (x) = K xf (K ff +σ 2 ni) 1 y, k f (x, x ) = k(x, x ) K xf (K ff +σ 2 ni) 1 K fx. Log marginal likelihood for learning L = log N (y; 0, K ff +σ 2 ni). 3 / 27

GPs for regression - A quick recap Given {x n, y n } N n=1, y n = f(x n ) + ɛ n, ɛ n iid N (0, σn), 2 f GP(0, k θ (, )) Posterior is also a GP, f 1 f 2 f 3 f n f N f m f (x) = K xf (K ff +σni) 2 1 y, k f (x, x ) = k(x, x ) K xf (K ff +σni) 2 1 K fx. Log marginal likelihood for learning L = log N (y; 0, K ff +σni). 2 Cost: O(N 3 ) 3 / 27

Prior work Indirect posterior approximation schemes Introducing pseudo-dataset {xm, u m } M m=1 and removing some dependencies in the prior: FITC, PI(T)C (Snelson and Ghahramani 2006; Snelson and Ghahramani 2007) Approximating the prior using M cosine basis functions: SSGP (Lázaro-Gredilla et al. 2010) Direct posterior approximation schemes Variational free energy approach (Seeger 2003; Titsias 2009), SVI extension to handle big data (Hensman et al. 2013) Expectation propagation (Qi et al. 2010) 4 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u f 1 f 2 f 3 f n f N f 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) u f 1 f 2 f 3 f n f N f 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 u prior: p(u, f) = p(u)p(f u) Assume f i f j u, i, j f 1 f 2 f 3 f n f N f prior: q(u, f) = q(u)q(f u) Calibrate model using KL(p(u, f) q(u, f)) 5 / 27

Fully independent training conditionals (FITC or SPGP) (Snelson and Ghahramani 2006) Introduce pseudo-dataset {x m, u m } M m=1 prior: p(u, f) = p(u)p(f u) u Assume f i f j u, i, j prior: q(u, f) = q(u)q(f u) f 1 f 2 f 3 f n f N f Calibrate model using KL(p(u, f) q(u, f)) q(u) = p(u) and q(f i u) = p(f i u) New generative model: p(u) = N (u; 0, K uu ), p(y u) = N (y; K fu K 1 uu u, diag(k ff K fu K 1 uu K uf ) + σ 2 ni). 5 / 27

Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) 6 / 27

Variational free energy (VFE) appoach (Titsias 2009) Augment model by pseudo-dataset {x m, u m } M m=1 joint posterior of f and u: p(f, u y) p(f, u)p(y f) Introducing variational dist. q(f, u) = p(f u)q(u) gives the ELBO: F(q(u)) = d u dfp(f u)q(u) log p(u)p(y f) q(u) Optimal distribution q(u) = 1 Z p(u) exp ( dfp(f u) log p(y f) ) and F(q(u)) = log N (y; 0, σni 2 + K fu K 1 uu K uf ) 1 2σn 2 Tr(K ff K fu K 1 uu K uf ). 6 / 27

Example 1 2 1.5 Exact VFE 1 0.5 y 0-0.5-1 -1.5-2 -15-10 -5 0 x 5 10 15 N = 100, M = 10 We only need a small M if the underlying function is simple 7 / 27

Example 2 y 2.5 2 1.5 1 0.5 0-0.5-1 -1.5 Exact VFE -2-15 -10-5 0 x 5 10 15 N = 100, M = 10 M needs to be large if the underlying function is complicated 8 / 27

Limitations of global approximations 3 L 2 1 y 0-1 l -2-15 -10-5 0 5 10 15 x 9 / 27

Limitations of global approximations 3 L 2 1 y 0-1 l -2-15 -10-5 0 5 10 15 x Approximately need M D d=1 L d/l d, where L d and l d are the data range and lengthscale in dimension d, 9 / 27

Limitations of global approximations 3 L 2 1 y 0-1 l -2-15 -10-5 0 5 10 15 x Approximately need M D d=1 L d/l d, where L d and l d are the data range and lengthscale in dimension d,i.e. large M when Datasets span a large input space: time-series or spatial datasets Underlying functions have short lengthscales, i.e. lots of wiggles 9 / 27

Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM 10 / 27

Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) Predict using the posterior of only one partition closest to the test point 1 : p(f y Bi ) = d f i p(f f i )p(f i y Bi ) 1 See Tresp 2000 for a way to combine predictors 10 / 27

Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) Predict using the posterior of only one partition closest to the test point 1 : p(f y Bi ) = d f i p(f f i )p(f i y Bi ) Partitions have shared or separate hyperparameters 1 See Tresp 2000 for a way to combine predictors 10 / 27

Local GPs Divide the training set into M disjoint partitions {B i } M i=1, where B i = {x j, y j } N i j=1. f B1 f B2 f B3 f Bj f BM Obtain the posterior for each partition: p(f i y Bi ) p(f i )p(y Bi f i ) Predict using the posterior of only one partition closest to the test point 1 : p(f y Bi ) = d f i p(f f i )p(f i y Bi ) Partitions have shared or separate hyperparameters Cost: O(ND 2 ), D: average size of partitions 1 See Tresp 2000 for a way to combine predictors 10 / 27

1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary 11 / 27

Tree-structured approximation (TSGP) TSGP is in the same family as FITC and PITC, i.e. indirect approximation via prior modification, but having additional structures: Local inducing variables for each partition, spare connection between inducing blocks. u u f 1 f 2 f 3 f n f N f (a) Full GP f 1 f 2 f 3 f n f N f (b) FITC u u B1 u B2 u B3 u Bk u BK f B1 f B2 f B3 f Bk f f BK (c) PIC f B1 f B2 f B3 f Bk f f BK (d) Tree (chain) 12 / 27

Prior modification Generative model: K q(u) = q(u Bk u Bl ), k=1 K q(f u) = q(f Bk u Bk ), k=1 N p(y f) = p(y n; f n, σn). 2 n=1 u Bl A k, Q k u Bk f Bk C k, R k 13 / 27

Prior modification Generative model: q(u) = q(f u) = p(y f) = K q(u Bk u Bl ), k=1 K q(f Bk u Bk ), k=1 N p(y n; f n, σn). 2 n=1 u Bl A k, Q k u Bk f Bk Model calibration by minimising a forward KL divergence, KL(p(f, u) q(f Bk u Bk )q(u Bk u Bl )) k which gives, q(u Bk u Bl ) = p(u Bk u Bl ) = N (u Bk ; A k u Bl, Q k ), q(f Ck u Bk ) = p(f Ck u Bk ) = N (f Ck ; C k u Bk, R k ). C k, R k A k = K uk u l K 1 u l u l, Q k = K uk u k K uk u l K 1 u l u l K ul u k, C k = K f k u k K 1 u k u k, R k = K f k f k K f k u k K 1 u k u k K uk f k. 13 / 27

Inference Marginalising out f, the model is a tree-structured Gaussian model with latent variables u and observations y Special case: Linear Gaussian state space model for time series or 1D data Joint posterior: p(u y) ( exp 1 ) 2 u i J i u i + u i h i exp (u i J ij u j ) i V i,j E where J i = Q 1 i + C i (R i + σ 2 I i ) 1 C i + j nei(i) A j Q 1 j A j, h i = C i R 1 i y i, and J ij = Q 1 i A i. Use Gaussian belief propagation algorithm to find the marginal distribution p(u Bi y) 14 / 27

Hyperparameter learning Log marginal likelihood and its derivatives can be computed using the same message passing algorithm: p(y 1:K θ) = K p(y k y 1:k 1, θ) k=1 K d dθ log p(y θ) = [ d dθ log p(u k u l ) p(uk,u l y) + d ] dθ log p(y k u k) p(uk y). k=1 Tree construction: starting with k-means clustering to find observations blocks using Kruskal s algorithm to greedily select a tree choosing a large random subset of observations in each block to be pseudo outputs, no optimisation needed 15 / 27

Comparison KL minimisation Method KL minimisation Result FITC KL(p(f, u) q(u) n q(fn u)) q(u) = p(u), q(fn u) = p(fn u) PIC KL(p(f, u) q(u) k q(f C k u)) q(u) = p(u), q(f C k u) = p(f C k u) PP KL( 1 Z p(u)p(f u)q(y u) p(f, u y)) q(y u) = N (y; K fuk 1 uu u, σ 2 ni) VFE KL(p(f u)q(u) p(f, u y)) q(u) p(u) exp( log(p(y f)) p(f u) ) EP KL(q(f; u)p(y n f n)/q n(f; u) q(f; u)) q(f; u) p(f) m p(um fm) Tree KL(p(f, u) k q(f B k ub k ) q(u Bk u par(bk ))) q(f Bk u Bk ) = p(f Bk u Bk ) q(u Bk u pa(bk )) = p(u Bk u par(bk )) 16 / 27

1 Introduction 2 Tree-structured GP approximation 3 Experiments 4 Summary 17 / 27

Audio data Task: filling missing data Data: Subband of a speech signal: N = 50000, SE kernel: k θ (t, t ) = σ 2 exp( 1 (t t ) 2 ) 2l 2 Filtered speech signal: N = 50000, spectral mixture kernel: k θ (t, t ) = 2 k=1 σ2 k cos(ω k(t t )) exp( 1 (t t ) 2 ) 2lk 2 Evaluation: Prediction error vs. Training/Test time 2 3 y t 0 y t 0 y t 2 2 0 2 True Chain Local 3 2340 2350 2360 2370 2380 5340 5350 5360 5370 5380 Time/ms y t 3 3 0 Time/ms Left: Subband data, Right: Filtered signal 18 / 27

Audio subband data SMSE 1 0.5 0.2 16 16 5,50 5,25 16 32 2,20 256 64 32 64 128 32 2,40 512 256 128 64 128 256 5,125 2,50 5,125 5,100 5,50 2,20 5,20 5,25 2.8 SMSE 1 0.5 0.2 32 16 64 Chain Local FITC VFE SSGP 128 5,50 5,25 2.40 2.20 5,100 256 5,125 2,50 5,125 5,100 5,50 2,20 5,20 5,25 16 16 2,8 512 2,10 2,8 10 100 1000 Training time/s 0.1 1 10 Test time/s 19 / 27

Audio filter data 1 2,50 2,40 5,125 5,20 2,10 2,8 1 2,50 5,25 2,40 5,20 2,10 2,8 5,125 64 SMSE 0.1 128 5,25 5,125 5,20 2,8 512 SMSE 0.1 128 5,25 2,8 5,125 5,20 512 2,40 2,50 2,20 2,10 Chain Local FITC 2,40 2,50 2,20 2,10 0.01 100 1000 Training time/s 0.01 0.1 0.3 0.5 Test time/s 20 / 27

Terrain data Task: filling missing data Data: Altitude of a 20kmx30km region, 80 missing blocks of 1kmx1km or 200k/40k training/test points, 2D SE kernel Evaluation: Prediction error vs. Training/Test time 2.5 0.5 (a) (b) (c) 2.5 0-1.5 graph complete data tree inference error local inference error FITC inference error 21 / 27

Terrain data SMSE 0.2 0.1 128 256 64 128 3,400 3,300 3,300 3,400 5,300 7,400 10,300 256 15,300 33,400 512 7,400 SMSE 0.2 0.1 3,400 3,300 3,300 3,400 128 7,400 5,300 256 512 256 5,300 15,300 512 33,400 20,400 7,400 64 128 256 512 VFE FITC SSGP Tree Local Training time/s 10,300 13,400 20,400 33,400 0.03 100 1000 10,300 13,400 20,400 33,400 0.03 0.01 0.1 1 Test time/ms 22 / 27

Summary Tree-structured Gaussian process approximation pseudo-dataset has tree/chain structure model was calibrated using a KL divergence inference and learning via Gaussian belief propagatation better time/accuracy trade-off compared to FITC, VFE possible extensions: online learning, loopy BP 23 / 27

References Hensman, James, Nicolo Fusi, and Neil D Lawrence (2013). Gaussian processes for big data. In: arxiv preprint arxiv:1309.6835. Lázaro-Gredilla, Miguel et al. (2010). Sparse spectrum Gaussian process regression. In: The Journal of Machine Learning Research 11, pp. 1865 1881. Qi, Yuan (Alan), Ahmed H. Abdel-Gawad, and Thomas P. Minka (2010). Sparse-posterior Gaussian Processes for general likelihoods. In: UAI. Ed. by Peter Grünwald and Peter Spirtes. AUAI Press, pp. 450 457. Seeger, Matthias (2003). Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations. PhD thesis. School of Informatics, College of Science and Engineering, University of Edinburgh. Snelson, Edward and Zoubin Ghahramani (2006). Sparse Gaussian Processes using Pseudo-inputs. In: Advances in Neural Information Processing Systems. MIT press, pp. 1257 1264. (2007). Local and global sparse Gaussian process approximations. In: International Conference on Artificial Intelligence and Statistics, pp. 524 531. Titsias, Michalis K. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. In: International Conference on Artificial Intelligence and Statistics, pp. 567 574. Tresp, Volker (2000). A Bayesian committee machine. In: Neural Computation 12.11, pp. 2719 2741. Urtasun, Raquel and Trevor Darrell (2008). Sparse probabilistic regression for activity-independent human pose inference. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, pp. 1 8. 24 / 27

Thanks!

Bayesian Committee Machine (BCM) (Tresp 2000) BCM combines predictions from M estimators, each uses a subset of training points Consider M partitions of the training set, by Bayes rule p(f y B1:m ) = p(f y B1:m 1, y Bm ) p(f )p(y Bm f )p(y B1:m 1 y Bm, f ) p(f )p(y Bm f )p(y B1:m 1 f ) p(f y Bm )p(f y B1:m 1 ) p(f ) (1) Apply 1 recursively to obtain p(f y B1:M ) M i=1 p(f y Bi ) p(f ) M 1 26 / 27

BCM for GP regression (Tresp 2000) Let p(f ) = N (0, K ) and p(f y Bi ) = N (ˆµ i, ˆΣ i ), then p(f y B1:M ) = N (ˆµ, ˆΣ) where, ˆΣ 1 = (M 1) K + ˆΣ 1 ˆµ = M i=1 Cost: O(ND 2 ), D: partition size ˆΣ 1 i ˆµ i More test points give better approximation! M i=1 ˆΣ 1 i 27 / 27