Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

Size: px

Start display at page:

Download "Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London"

Irma Webster
5 years ago
Views:

1 Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017

2 Motivation DGPs promise much, but are difficult to train Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 2

3 Motivation DGPs promise much, but are difficult to train Fully factorized VI doesn t work well Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 2

4 Motivation DGPs promise much, but are difficult to train Fully factorized VI doesn t work well We seek a variational approach that works and scales Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 2

5 Motivation DGPs promise much, but are difficult to train Fully factorized VI doesn t work well We seek a variational approach that works and scales Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 2

6 Motivation DGPs promise much, but are difficult to train Fully factorized VI doesn t work well We seek a variational approach that works and scales Other recently proposed schemes [1, 2, 5] make additional approximations and require more machinery than VI Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 2

7 Talk outline 1. Summary: Model Inference Results 2. Details: Model Inference Results 3. Questions Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 3

8 Model We use the standard DGP model, with one addition: Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 4

9 Model We use the standard DGP model, with one addition: We include a linear (identity) mean function for all the internal layers Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 4

10 Model We use the standard DGP model, with one addition: We include a linear (identity) mean function for all the internal layers (1D example in [4]) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 4

11 Inference We use the model conditioned on the inducing points as a conditional variational posterior Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 5

12 Inference We use the model conditioned on the inducing points as a conditional variational posterior We impose Gaussians on the inducing points, (independent between layers but full rank within layers) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 5

13 Inference We use the model conditioned on the inducing points as a conditional variational posterior We impose Gaussians on the inducing points, (independent between layers but full rank within layers) We use sampling to deal with the intractable expectation Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 5

14 Inference We use the model conditioned on the inducing points as a conditional variational posterior We impose Gaussians on the inducing points, (independent between layers but full rank within layers) We use sampling to deal with the intractable expectation Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 5

15 Inference We use the model conditioned on the inducing points as a conditional variational posterior We impose Gaussians on the inducing points, (independent between layers but full rank within layers) We use sampling to deal with the intractable expectation We never compute N ˆ N matrices (we make no additional simplifications to variational posterior) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 5

16 Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 6

17 Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 6

18 Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points On small data we never do worse than the single layer model, and often better Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 6

19 Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points On small data we never do worse than the single layer model, and often better We can get 98.1% on mnist with only 100 inducing points Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 6

20 Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points On small data we never do worse than the single layer model, and often better We can get 98.1% on mnist with only 100 inducing points We surpass all permutation invariant methods on rectangles-images (designed to test deep vs shallow architectures) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 6

21 Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points On small data we never do worse than the single layer model, and often better We can get 98.1% on mnist with only 100 inducing points We surpass all permutation invariant methods on rectangles-images (designed to test deep vs shallow architectures) Identical model/inference hyperparameters for all our models Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 6

22 Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: If dimensions agree use the identity, otherwise PCA Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 7

23 Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: If dimensions agree use the identity, otherwise PCA Sensible alternative: initialize latents to identity (but linear mean function works better) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 7

24 Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: If dimensions agree use the identity, otherwise PCA Sensible alternative: initialize latents to identity (but linear mean function works better) Not so sensible alternative: random. Doesn t work well (posterior is (very) multimodal) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 7

25 The DGP: Graphical Model X Z 0 f 1 u 1 h 1 Z 1 f 2 u 2 h 2 Z 2 f 3 u 3 y Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 8

26 The DGP: Density hkkkkkikkkkkj likelihood Nπ ppy, th l, f l, u l ul 1 L q ppy i fi L qˆ i 1 Lπ pph l f l qppf l u l ; h l 1, Z l 1 qppu l ; Z l 1 q loooooooooooooooooooooooooomoooooooooooooooooooooooooon l 1 DGP prior Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/2017 9

27 Factorised Variational Posterior X Z 0 X f 1 u 1 f 1 u 1 N (u 1 m 1, S 1 ) h 1 Z 1 Q i N (h1 i µ1 i, 1 i 2 ) h 1 f 2 u 2 f 2 u 2 N (u 2 m 2, S 2 ) h 2 Z 2 Q i N (h2 i µ2 i, 2 i 2 ) h 2 f 3 u 3 f 3 u 3 N (u 3 m 3, S 3 ) y y Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

28 Our Variational Posterior X Z 0 X f 1 u 1 f 1 u 1 N (u 1 m 1, S 1 ) h 1 Z 1 h 1 f 2 u 2 f 2 u 2 N (u 2 m 2, S 2 ) h 2 Z 2 h 2 f 3 u 3 f 3 u 3 N (u 3 m 3, S 3 ) y f 3 Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

29 Recap: GPs for Big Data [3] qpf, uq ppf u; X, ZqN pu m, Sq Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

30 Recap: GPs for Big Data [3] qpf, uq ppf u; X, ZqN pu m, Sq Marginalise u from the variational posterior: ª ppf u; X, ZqN pu m, Sqdu N pf µ, Sq : qpf m, S; X, Zq (1) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

31 Recap: GPs for Big Data [3] qpf, uq ppf u; X, ZqN pu m, Sq Marginalise u from the variational posterior: ª ppf u; X, ZqN pu m, Sqdu N pf µ, Sq : qpf m, S; X, Zq (1) Define the following mean and covariance functions: µ m,z px i q mpx i q`apx i q T pm mpzqq, S S,Z px i, x j q kpx i, x j q apx i q T pkpz, Zq Sqapx j q. where apx i q kpx i, ZqkpZ, Zq 1 Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

32 Recap: GPs for Big Data [3] qpf, uq ppf u; X, ZqN pu m, Sq Marginalise u from the variational posterior: ª ppf u; X, ZqN pu m, Sqdu N pf µ, Sq : qpf m, S; X, Zq (1) Define the following mean and covariance functions: µ m,z px i q mpx i q`apx i q T pm mpzqq, S S,Z px i, x j q kpx i, x j q apx i q T pkpz, Zq Sqapx j q. where apx i q kpx i, ZqkpZ, Zq 1 With these functions rµs i µ m,z px i q and rss ij S S,Z px i, x j q. Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

33 Recap: GPs for Big Data [3] cont. Key idea: The f i marginals of qpf, uq ppf u; X, ZqN pu m, Sq depend only on the inputs x i (and the variational parameters) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

34 Our Variational Posterior X Z 0 X f 1 u 1 f 1 u 1 N (u 1 m 1, S 1 ) h 1 Z 1 h 1 f 2 u 2 f 2 u 2 N (u 2 m 2, S 2 ) h 2 Z 2 h 2 f 3 u 3 f 3 u 3 N (u 3 m 3, S 3 ) y f 3 Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

35 Our approach We can marginalise all the u l from our posterior Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

36 Our approach We can marginalise all the u l from our posterior Result for lth layer is qpf l m l, S l ; h l 1, Z l 1 q. Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

37 Our approach We can marginalise all the u l from our posterior Result for lth layer is qpf l m l, S l ; h l 1, Z l 1 q. The fully coupled (both between and within layers) variational posterior is Lπ pph l f l qqpf l m l, S l ; h l 1, Z l 1 q l 1 Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

38 But what about the ith marginals? Since at each layer the ith marginal depends only on the ith component of the layer below, we have qpt f l i, hl i ul l 1 q Lπ l 1 pph l i f l i qqp f l i ml, S l ; h l 1 i, Z l 1 q (2) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

39 The lower bound Since our variational posterior matches the model everywhere except the inducing points, the bound is: Lÿ L E qpt f l i,hi lul l 1 q log ppy i fi L q l 1 KLpqpu l q ppu l qq The analytic marginalisation of the all the inner layers qp f L i q is intractable, but we can draw samples ancestrally Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

40 Sampling from the variational posterior Each layer is Gaussian, given the layer below Whole sampling process is differentiable wrt variational parameters Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

41 Sampling from the variational posterior Each layer is Gaussian, given the layer below We draw samples using unit Gaussians e N p0, 1q at each layer Whole sampling process is differentiable wrt variational parameters Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

42 Sampling from the variational posterior Each layer is Gaussian, given the layer below We draw samples using unit Gaussians e N p0, 1q at each layer For ˆf u, l mean and var from qp fi l ml, S l ; ĥl 1 i, Z l 1 q b ˆf i l µ m l,z l 1pĥl 1 q`e i S S l,z l 1pĥl 1 i, ĥl 1 i q where for the first layer ĥ0 i x i Whole sampling process is differentiable wrt variational parameters Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

43 Sampling from the variational posterior Each layer is Gaussian, given the layer below We draw samples using unit Gaussians e N p0, 1q at each layer For ˆf u, l mean and var from qp fi l ml, S l ; ĥl 1 i, Z l 1 q b ˆf i l µ m l,z l 1pĥl 1 q`e i S S l,z l 1pĥl 1 i, ĥl 1 i q where for the first layer ĥ0 i x i Just add noise for the ĥl i Whole sampling process is differentiable wrt variational parameters Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

44 The second source of stochasticity We sample the bound in minibatches Linear scaling in N Can be used when only steaming is possible ( 50GB datasets) Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

45 Inference recap We use the full model as a variational posterior, conditioned on the inducing points Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

46 Inference recap We use the full model as a variational posterior, conditioned on the inducing points We use Gaussians for the inducing points Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

47 Inference recap We use the full model as a variational posterior, conditioned on the inducing points We use Gaussians for the inducing points The lower bound requires only the posterior marginals Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

48 Inference recap We use the full model as a variational posterior, conditioned on the inducing points We use Gaussians for the inducing points The lower bound requires only the posterior marginals We can take samples from the posterior marginals using a Monte Carlo estimate Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

49 Results (1): UCI Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

50 Code Demo master/demos/demo_regression_uci.ipynb Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

51 Results (2): Large and Massive Data Test RMSE N D SGP SGP 500 DGP 2 DGP 3 DGP 4 DGP 5 year airline 700K taxi 1B Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

52 Thanks for listening Questions? Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

53 References [1] T. D. Bui, D. Hernández-Lobato, Y. Li, J. M. Hernández-Lobato, and R. E. Turner. Deep Gaussian Processes for Regression using Approximate Expectation Propagation. Icml, [2] K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Practical Learning of Deep Gaussian Processes via Random Fourier Features. arxiv preprint arxiv: , [3] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence, pages , [4] M. Lázaro-Gredilla. Bayesian Warped Gaussian Processes. Advances in Neural Information Processing Systems, pages , [5] Y. Wang, M. Brubaker, B. Chaib-Draa, and R. Urtasun. Sequential Inference for Deep Gaussian Process. Artificial Intelligence and Statistics, Doubly Stochastic Inference for DGPs Hugh Berlin, 29/5/

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints Thang D. Bui Richard E. Turner tdb40@cam.ac.uk ret26@cam.ac.uk Computational and Biological Learning