Sequential Inference for Deep Gaussian Process

Size: px

Start display at page:

Download "Sequential Inference for Deep Gaussian Process"

Gary Moody
5 years ago
Views:

1 Sequential Inference for Deep Gaussian Process Yali Wang, Marcus Brubaker, Brahim Chaib-draa, Raquel Urtasun Chinese Academy of Sciences University of Toronto Laval University AISTATS 2016 Presented by Lei Yu Sept 9, 2016

2 Highlight 1 An efficient sequential inference framework for Deep Gaussian Process (DGP) is proposed. 2 Two DGP extensions: heteroscedasticity (non-stationary noise) multi-task learning.

3 Motivation GPs can not handle with non-stationary functions, e.g. functions with sudden jumps Variants of GP to address non-stationarity: warped Gaussian Process [1] product models [2] Deep GPs Inference of DGP: Variational approximations are commonly used for inference Matirx inversion involved in inference leads to O(N 3 ) complexity

4 Sequential Inference for Deep GP

5 Deep Gaussian Process The output is assumed to be multi-dimensional, y R D y and to be generated as follows: h 0 = x h i = f i (h i 1 ) + v i, i = 1,..., L with y = f y (h L ) + v y f i GP(0, k θi ) f y GP(0, k θy ) Figure 1: The graphical model for DGP.

6 Sequential Inference for Deep GP Sequential inference means, for each training pair (x n, y n ), the inference of latent variables is composed of two steps: State Estimation: sequential Monte Carlo [3] for the latent states S n = {h n 1:L } Model Updating: sparse online GP (GP so ) [4] or Kernel Recursive Least-Square method [5] for model parameters M = {M 1,..., M L, M y }, with M n 1 { X n 1, µ n 1, Σ n 1, Q n 1 } Active training set Inverse covariance matrix of X Posterior over X: N(µ, Σ)

7 State Estimation The latent states S n are estimated via the posterior expectation E[S n ] = S n p(s n x n, y n, M n 1, Θ dgp )ds n intractable due to the deep architecture Sequential Monte Carlo: for k = 1,..., N p 1 sampling kth sample (particle) of S n from current DGP model h n 1 (k) p(hn 1 xn, M1 n 1, Θ dgp ) h n i (k) p(hn i hn i 1 (k), Mn 1 i, Θ dgp ) i = 2,..., L 2 weighting N p ĥ n i = ŵ n (k)h n i (k), with w n (k) ŵn (k) = Np k=1 k=1 wn (k) where w n (k) is determined by liklihood in the observation layer w n (k) = p(y n h n L (k), Mn 1 y, Θ dgp )

8 Model Updating Once the states of i-th layer are ready, the input-output training data (h n i 1, hn i ) for single GP is available and online updating from Mn 1 i to Mi n is required (layer index is omit) M n 1 { X n 1, µ n 1, Σ n 1, Q n 1 } M n { X n, µ n, Σ n, Q n } And the task is fulfilled in two steps: 1 updating the posterior distribution p(f n X n ) based on p(f n 1 X n 1 ) = N(f n 1 µ n 1, Σ n 1 ) 2 updating the active set of training data X. Kernel Recursive Least-Square method is exploited [5]

9 KRLS: Updating posterior (1) When n-th data is available, update X n l = X n 1 l (h n l 1, hn l ). Then according to Bayes rule: p(f n X n ) = p(f n 1, f n X n 1, h n ) = p(hn f n )p( f n f n 1 )p(f n 1 X n 1 ) p(h n X n 1 ) Observation prior GP and Conditioned GP prior p(h n f n ) = N(h n f n, σ 2 ) p(f n 1 X n 1 ) = N(f n 1 µ n 1, Σ n 1 ), p( f n f n 1 ) = N( f n q T n f n 1, γ 2 n) Evidence p(h n X n 1 ) = p(h n f n )p( f n f n 1 )p(f n 1 X n 1 )df n 1 d f n = N(h n q T n f n 1, σ 2 + γn 2 + q T n Σ n 1 q } {{ } n ) η 2 n

10 KRLS: Updating posterior (2) p(f n X n ) = N(f n µ n, Σ n ) (1) where [ µ n = µ n 1 q T n µ n 1 [ Σ n Σ = n 1 q T n Σ n 1 [ Q n Q n 1 0 = 0 T 0 q n = Q n 1 k n 1 (h n ) ] + hn q T n µ n 1 η 2 n Σ n 1 q n γn 2 + q T n Σ n 1 q n ] + 1 [ qn γn 2 1 ψ n ] ψ nψ T n ] [ qn 1 η 2 n ] T γn 2 = k(h n, h n ) k T n 1 (hn )q n [ Σ ψ n = n 1 ] q n γn 2 + q T n Σ n 1 q n

11 KRLS: Updating active dataset without info loss Consider the posterior of f n given the data set X n, p(f n X n ) = p(f n 1, f n X n ) = p(f n 1 X n ) p( f n f n 1, X n ) } {{ } related to f n f n can be removed f n is non-informative, i.e. p( f n f n 1, X n ) = p(yn f n ) p(y n f n 1 ) p( f n f n 1 ) = p( f n f n 1 ) p(y n f n ) N( f n, σ 2 I) p(y n f n 1 ) N(q T n f n 1, (σ 2 + γ 2 n)i) the second equation verified when γ 2 n = 0 and f n = q T n f n 1. If γ 2 n = 0, basis f n can be removed without any loss of information, and update active set X n l = X n l \(hn l 1, hn l )

12 KRLS: Updating active dataset with info loss Once the amount of bases in the dictionary grows larger than a predefined budget N AC, remove one basis from the dictionary. Considering i-th basis, removing fi n will effect the approximation results, which is measured by the KL distance KL(p(f n X n ) p([f n 1 ] i X n ) p( f n i [fn 1 ] i ) ) for the exact distribution, mean of f n is µ n for the approximate distribution, mean of [f n 1 ] i remains unchanged, i.e. [µ n ] i, while mean of fi n is q T n [µ n ] i Select the removed basis via E i = µ n i q T n [µ n ] i = [Qn µ n ] i Q n i,i k = arg min i ( E 2 i )

13 KRLS: Re-Updating after Removing Basis Model parameters should be updated simply via µ n [µ n ] k Σ n [Σ n ] k, k Q n [Q n ] k, k [Qn ] k,k [Q n ] T k,k [Q n ] k,k Update active set X n l = X n l \(hk l 1, hk l )

Overall algorithm (a) Deep GP (b) Heteroscedastic Deep GP igure 1: The graphical model of DGP and its heteroscedastic extension. ) is Gaussian with mean µ gp? and + 2 I] 1 y, k? [K + 2 I] 1 (k?

14 Overall algorithm (a) Deep GP (b) Heteroscedastic Deep GP igure 1: The graphical model of DGP and its heteroscedastic extension. ) is Gaussian with mean µ gp? and + 2 I] 1 y, k? [K + 2 I] 1 (k? ) T + 2. (3) ctor is y = [y 1,...,y N ] T. The K is computed with (K ) i,j =...N. Similarly, k?,? is computed the vector k? is computed with =1,...N. :N, ), a popular approach to infer e negative log likelihood (4) 1 y log K + 2 I + 1 n log 2. 2 on involved in Eq. (3-4) leads to Many approximations have been e computational e ciency. A defound in [14, 15]. ian Process heir ability to learn non-stationary nction with sudden jumps) [5, 6]. orrelations between di erent outpically ignored if GPs are applied Algorithm 1 Sequential inference for DGP 1: Input: 2: The n-th data pair: (x n, y n ) 3: The DGP model parameters at the (n 1)-th step: M n 1 = {M1 n 1,...,M n 1 L,My n 1 } 4: Output: 5: Estimates of the latent state at the n-th step: Ŝn = {ĥn 1:L } 6: DGP model parameters at the n-th step: M n = {M1 n,...,ml n,mn y } 7: State Estimation: 8: Draw N p samples of S n from (7) & (8) 9: Weight samples by (9) & (11) 10: Estimate Ŝn by (10) 11: Model Update: 12: Update M n 1 to M n by using GP so with (x n, ĥn 1 ),,(ĥn L 1, ĥn L ), (ĥn L, yn ). 3 Sequential Inference for Deep GP For n-th training data, the complexity (including sampling and model updating) is O(NAC 2 ) The overall complexity is O(NAC 2 N), which is much less than that of full GP O(N 3 ). In this work, we propose a sequential inference algorithm which is able to perform learning for the DGP model e ectively and e ciently. The algorithm (summarized in Alg. 1) assumes an online setting where data is presented for learning in sequence. Specifically, for each training pair, (x n, y n ), the algorithm consists

15 Prediction and Hyperparameter Learning Given a trained DGP model M = {M 1,..., M L, M y } and a new input x, h 1 (k) p(h 1 x, M 1, Θ dgp ) h i (k) p(h i h i 1 (k), M i, Θ dgp ) Predictive distribution of y is Gaussian mixture ŷ 1 N p N p k=1 p(y h L (k), M y, Θ dgp ) Hyperparameter Θ dgp for each layer can be learned by minimizing the negative log likelihood.

16 Sequential DGP Extensions

17 Heteroscedastic Learning Homoscedastic noise, i.e. σ y is a constant y = f y (h L ) + v y Heteroscedastic noise, i.e. σ y is dependent to input s α = f α (h L ) + v α, s β = f β (h L ) + v β p(f s α, s β ) = N(s α, exp(s β )) Figure 2: Heteroscedastic extension of DGP.

18 Multi-Task Learning Deep network structure of DGP allows correlations between outputs to be represented by sharing the latent layers Multi-Task learning is straightforward in sequential inference scheme, Unnormalized weight is computed by using only observed dimensions of the output y n obs w n (k) = p(y n obs hn L (k), Mn 1 y, Θ dgp ) Missing dimensions are estimated using the same procedure as latent states. Model parameters M n y are estimated using both observed and estimated outputs.

19 Further possible extensions A novel and efficient inference framework for DGP is proposed. Further possible extensions might be: 1 Exploiting deep learning techniques such as ReLU activation functions and dropout in sided the framework 2 Sequential DGP with other network structures e.g. CNNs, RNNs, LSTMs.

20 Experiments

21 Experiments ARD kernal: Evaluation: d k(x, x ) = σ 2 ker exp 0.5 c i (x i x i )2 i=1 Root mean squared error (RMSE) mean negative log probability (MNLP) Average training time (second) Abbreviations: Sequential DGP - DGP seq Heteroscedastic sequential DGP - HDGP seq Variational inference DGP - DGP var Sparse online GP - GP so

22 DGP as Deep Dynamical Prior: unsupervised learning Input is the time step and output is the multi-dimensional observation. Datasets (Time seuqences): motion, flu and stock Training size 1500/300/1000 for motion/flu/stock datasets Layers L = 2 Latent dimension D = 5 Number of samples N p = 100 Budget size N AC = 100/100/200 respectively for 3 datasets

23 Performance w.r.t. Latent Layer Number L

24 Performance w.r.t. Layer Dimensionality D

25 Performance w.r.t. Number of Particles N p

26 Performance w.r.t. Size of Active Set N AC

27 Qualitative Results on motion dataset Yali Wang, Marcus Brubaker, Brahim Chaib-draa, Raquel Urtasun Figure 5: DGP as Deep Dynamical Prior (the motion dataset). Particle Estimation of y at the time 700, 1200, 1570, 2070 where the number of particles Np is 100. Figure 3: DGP as Deep Dynamical Prior (the motion dataset). Particle Estimation of y.

28 Sequential Inference for DeepSupplementary GP Material: Sequential Sequential InferenceDGP for Extensions Deep Gaussian Process Experiments Dimensionality Reduction 24 Low-dimensional latent representations on Frey Our DGP Supplementary 22 Material: Sequential Inference for Deep seq face (L=1 D=3) Gaussian dataset: Processinput and 23 Our DGP (L=2 D=2) output are the noisy face images. seq Reconstruction Error (RMSE) Reconstruction Reconstruction Error Error (RMSE) (RMSE) Noise Level (std=0.1) (std=0.3) 25 Our DGP seq (L=1 D=2) DGP var (L=1 D=2) DGP (L=1 D=3) var Noise Level (std=0.1) DGP var (L=2 D=2) Our DGP seq (L=1 D=2) Number of of Episodes Our DGP (L=1 D=3) 22 seq Figure 8: Dimensionality Reduction for Image Reconstruction (Face Data): the reconstruction error as a function Our DGP (L=2 D=2) of the number of episodes. 20Note that there is no sequential inference & hyperparameter seq learning episode in DGP var. The results of DGP var are lines. The error bar is mean±standarddgp deviation. (L=1 D=2) var 18 DGP var (L=1 D=3) 16 DGP var (L=2 D=2) Number of Episodes Ground Truth Figure 8: Dimensionality Reduction for Image Reconstruction (Face Data): the reconstruction error as a function of the number of episodes. Note that there is no sequential inference & hyperparameter learning episode in Noisy Observation DGP var. The (std=0.1) results of DGP var are lines. The error bar is mean±standard deviation. DGP_var (L=1, D=3) Our DGP_seq (1 st Ep, L=1, D=3) Our DGP_seq (5 th Ep, L=1, D=3) Ground Truth Figure 9: Dimensionality Reduction for Image Reconstruction (Face Data): the reconstruction image comparison at the time step 50:50:1900.

Our DGP_seq Sequential Inference (1 st Ep, for L=1, Deep D=3) GP Sequential DGP Extensions Experiments Our DGP_seq (5 Heteroscedastic th Ep, L=1, D=3) Learning Result on Motercycle Figure 9:

29 Our DGP_seq Sequential Inference (1 st Ep, for L=1, Deep D=3) GP Sequential DGP Extensions Experiments Our DGP_seq (5 Heteroscedastic th Ep, L=1, D=3) Learning Result on Motercycle Figure 9: Dimensionality Reduction for Image Reconstruction (Face Data): the reconstruction image comparison at the time step 50:50:1900. Dynamical prior estimation on Motorcycle dataset: 94 time/acceration pairs. Acceleration Time (a) our HGPseq vs GP (b) s of our HGPseq (c) s of our HGPseq Figure 10: Heteroscedastic GP Modeling (Posterior of Motorcycle Data). In Plot10(a), the observations are black crosses, the mean of our HGP seq / GP are pink / blue lines, 95% confidence interval of our HGP seq /GP are pink / blue dashed lines.

30 Missing Data in Multi-task Learning (1) Dynamical prior estimation on motion, flu and stock (small portion of observations is missing)

31 Sequential Inference for Deep GP Supplementary Sequential DGP Material: Extensions Sequential Inference Experiments for Dee Missing Data in Multi-task Learning (2) Data Methods RMSE MNLP GP so 9.33 ± ±0.0 Parkinsons DGP var 9.26± ±1.3 our DGP seq 8.86± ±0.0 Dynamical prior estimation on Jura 359/100 total/missing dataset, with input 2D location and output the measurement of cadmium, nickel and zinc concentraions. 1 Table 2: DGP as Regressor (Parkinsons Data): Prediction Err Methods MAE Train T(s) GP ± CMOGP ± SLFM ± GPRN ± GPRN-NPV ± our DGP seq ± Table 5: Comparison with State-of-the-art Multi-task GPs (the Jura dataset). The results of GP, CMOGP, SLFM, GPRN are from [12]. [5] J. Hensman Compressio [6] Y. Cho, L. Learning, in [7] Y. Gal, M. tributed Va Process Reg in: NIPS, 2 [8] K. Kersting 1 nickel and zinc concentrations. The total number of gard, Most MAE - Mean absolute error; CMOGP - convolved multiple outputs GP; SLFM - data pairs is 359, where 100 measurements of cadmium cess Regress semiparametric latent factor model; GPRN - GP regression networks; GPRN-NPV1 - are missing. We follow the experimental settings of GPRN with nonparametric variational inference (mode 1). [9] M. Lázaro-

32 DGP for Supervised Learning (1) Regresion on Parkinsons elemonitoring dataset, a 6 months biomedical voice recording from 42 people with early-stage Parkinson s disease for a total of 5875 input/output pairs for training and 875 for testing Parameter setting: L = 2, D = 2, N p = 500, N AC = 100 Supplementary Material: Sequential Inference for Deep Gaussian Process Data Methods RMSE MNLP Train T(s) GP so 9.33 ± ± ±2 Parkinsons DGP var 9.26± ± ±1370 our DGP seq 8.86± ± ±18 Table 2: DGP as Regressor (Parkinsons Data): Prediction Error & Training E ciency. Methods MAE Train T(s) GP ± CMOGP ± [5] J. Hensman, N. Lawrence, Nested Varitaio Compression in Deep Gaussian Processes,

33 Motion DGP for Supervised Learning (2) Classification on Tumor gene expression dataset and MNIST digits dataset. 2 Tumor: L = 3, D = 12, N p = 200, N AC = 144 MNIST: L = 1, D = 400, N p = 100, N AC = 1000 Data Methods RMSE(%) MNLP GP so our DGP seq 1.0± ±0.7 our HDGP seq 1.1± ±0.2 Flu Stock GP so our DGP seq 4.0± ±0.3 our HDGP seq 3.0± ±0.1 GP so our DGP seq 6.4± ±0.0 our HDGP seq 6.6± ±0.0 Table 1: Missing Data in Multi-Task Learning (Mot Data Methods Acc Train T(s) GP so Tumor DGP var our DGP seq GP so mins MNIST DGP var - - DVI-GPLVM mins* our DGP seq mins Table 3: Classification. For MNIST, the results of DVI-GPLVM are from [7]. (*) Distributed implementation using an unreported number of cores. 2 DVI-GPLVM - GPLVM with distributed variational inference. Methods GP NGP WGP MLHGP VHGP our HGP seq Table 4: Com eroscedastic GP

34 References [1] M. Lázaro-Gredilla, Bayesian warped Gaussian processes, in Advances in Neural Information Processing Systems, 2012, pp [2] R. P. Adams and O. Stegle, Gaussian process product models for nonparametric nonstationarity, in Proceedings of the 25th international conference on Machine learning, 2008, pp [3] L. Csató and M. Opper, Sparse on-line Gaussian processes, Neural computation, vol. 14, no. 3, pp , [4] O. Cappé, S. J. Godsill, and E. Moulines, An overview of existing methods and recent advances in sequential Monte Carlo, Proceedings of the IEEE, vol. 95, no. 5, pp , [5] S. Van Vaerenbergh, M. Lázaro-Gredilla, and I. Santamaría, Kernel recursive least-squares tracker for time-varying regression, IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 8, pp , 2012.

Sequential Inference for Deep Gaussian Process

Sequential Inference for Deep Gaussian Process Yali Wang Marcus Brubaker Brahim Chaib-draa Raquel Urtasun Chinese Academy of Sciences University of Toronto Laval University University of Toronto yl.wang@siat.ac.cn mbrubake@cs.toronto.edu chaib@ift.ulaval.ca