Probabilistic & Bayesian deep learning. Andreas Damianou

Size: px

Start display at page:

Download "Probabilistic & Bayesian deep learning. Andreas Damianou"

Kevin Tyler
5 years ago
Views:

1 Probabilistic & Bayesian deep learning Andreas Damianou Amazon Research Cambridge, UK Talk at University of Sheffield, 19 March 2019

2 In this talk Not in this talk: CRFs, Boltzmann machines,...

3 In this talk Not in this talk: CRFs, Boltzmann machines,...

4 A standard neural network x 1 x 2 x 3... x q ϕ 1 w 1 ϕ 2 w2 ϕ 3 w3... w q ϕ q Y Define: φ j = φ(x j ) and f j = w j φ j (ignore bias for now) Once we ve defined all w s with back-prop, then f (and the whole network) becomes deterministic. What does that imply?

5 Trained neural network is deterministic. Implications? Generalization: Overfitting occurs. Need for ad-hoc invention of regularizers: dropout, early stopping... Data generation: A model which generalizes well, should also understand -or even be able to create ( imagine )- variations. No predictive uncertainty: Uncertainty needs to be propagated across the model to be reliable.

6 Need for uncertainty Reinforcement learning Critical predictive systems Active learning Semi-automatic systems Scarce data scenarios...

7 BNN with priors on its weights x 1 x 2 x 3... x q x 1 x 2 x 3... x q ϕ 1 ϕ 2 ϕ 3... ϕ q ϕ 1 ϕ 2 ϕ 3... ϕ q w1 w2 w3 w q q(w 1 ) q(w 2 ) q(w3) q(w q ) Y Y

8 BNN with priors on its weights x 1 x 2 x 3... x q x 1 x 2 x 3... x q ϕ 1 ϕ 2 ϕ 3... ϕ q ϕ 1 ϕ 2 ϕ 3... ϕ q Y Y

9 Probabilistic re-formulation DNN: y = g(w, x) = w 1 ϕ(w 2 ϕ(... x)) Training minimizing loss: arg min W 1 N (g(w, x i ) y i ) 2 + λ 2 i=1 i }{{} fit w i } {{ } regularizer Equivalent probabilistic view for regression, maximizing posterior probability: arg max W log p(y x, W) }{{} fit where p(y x, W) N and p(w) Laplace + log p(w) }{{} regularizer

10 Probabilistic re-formulation DNN: y = g(w, x) = w 1 ϕ(w 2 ϕ(... x)) Training minimizing loss: arg min W 1 N (g(w, x i ) y i ) 2 + λ 2 i=1 i }{{} fit w i } {{ } regularizer Equivalent probabilistic view for regression, maximizing posterior probability: arg max W log p(y x, W) }{{} fit where p(y x, W) N and p(w) Laplace + log p(w) }{{} regularizer

11 Probabilistic re-formulation DNN: y = g(w, x) = w 1 ϕ(w 2 ϕ(... x)) Training minimizing loss: arg min W 1 N (g(w, x i ) y i ) 2 + λ 2 i=1 i }{{} fit w i } {{ } regularizer Equivalent probabilistic view for regression, maximizing posterior probability: arg max W log p(y x, W) }{{} fit where p(y x, W) N and p(w) Laplace + log p(w) }{{} regularizer

12 Integrating out weights Define: D = (x, y) Remember Bayes rule: p(w D) = p(d w)p(w) p(d) = p(d w)p(w)dw For Bayesian inference, weights need to also be integrated out. This gives us a properly defined posterior on the parametres.

13 (Bayesian) Occam s Razor A plurality is not to be posited without necessity. W. of Ockham Everything should be made as simple as possible, but not simpler. A. Einstein Evidence is higher for the model that is not unnecessarily complex but still explains the data D.

14 Bayesian Inference Remember: Separation of Model and Inference

15 Inference p(d) (and hence p(w D)) is difficult to compute because of the nonlinear way in which w appears through g. Attempt at variational inference: where KL (q(w; θ) p(w D)) = log(p(d)) L(θ) }{{}}{{} minimize maximize L(θ) = E q(w;θ) [log p(d, w)] +H [q(w; θ)] }{{} F Term in red is still problematic. Solution: MC. Such approaches can be formulated as black-box inferences.

16 Inference p(d) (and hence p(w D)) is difficult to compute because of the nonlinear way in which w appears through g. Attempt at variational inference: where KL (q(w; θ) p(w D)) = log(p(d)) L(θ) }{{}}{{} minimize maximize L(θ) = E q(w;θ) [log p(d, w)] +H [q(w; θ)] }{{} F Term in red is still problematic. Solution: MC. Such approaches can be formulated as black-box inferences.

17 Inference p(d) (and hence p(w D)) is difficult to compute because of the nonlinear way in which w appears through g. Attempt at variational inference: where KL (q(w; θ) p(w D)) = log(p(d)) L(θ) }{{}}{{} minimize maximize L(θ) = E q(w;θ) [log p(d, w)] +H [q(w; θ)] }{{} F Term in red is still problematic. Solution: MC. Such approaches can be formulated as black-box inferences.

18 Credit: Tushar Tank

19 Credit: Tushar Tank

20 Credit: Tushar Tank

21 Credit: Tushar Tank

22 Inference: Score function method θ F = θ E q(w;θ) [log p(d, w)] = E q(w;θ) [p(d, w) θ log q(w; θ)] 1 K p(d, w (k) ) θ log q(w (k) ; θ), K i=1 w (k) iid q(w; θ) (Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014, Ruiz et al. 2016)

23 Inference: Reparameterization gradient Reparametrize w as a transformation T of a simpler variable ɛ: w = T (ɛ; θ) q(ɛ) is now independent of θ θ F = θ E q(w;θ) [log p(d, w)] = E q(ɛ) [ w p(d, w) w=t (ɛ;θ) θ T (ɛ; θ) ] For example: w N (µ, σ) T w = µ + σ ɛ, ɛ N (0, 1) MC by sampling from q(ɛ) (thus obtaining samples from w through T ) (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

24 Inference: Reparameterization gradient Reparametrize w as a transformation T of a simpler variable ɛ: w = T (ɛ; θ) q(ɛ) is now independent of θ θ F = θ E q(w;θ) [log p(d, w)] = E q(ɛ) [ w p(d, w) w=t (ɛ;θ) θ T (ɛ; θ) ] For example: w N (µ, σ) T w = µ + σ ɛ, ɛ N (0, 1) MC by sampling from q(ɛ) (thus obtaining samples from w through T ) (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

25 Inference: Reparameterization gradient Reparametrize w as a transformation T of a simpler variable ɛ: w = T (ɛ; θ) q(ɛ) is now independent of θ θ F = θ E q(w;θ) [log p(d, w)] = E q(ɛ) [ w p(d, w) w=t (ɛ;θ) θ T (ɛ; θ) ] For example: w N (µ, σ) T w = µ + σ ɛ, ɛ N (0, 1) MC by sampling from q(ɛ) (thus obtaining samples from w through T ) (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

26 Inference: Reparameterization gradient Reparametrize w as a transformation T of a simpler variable ɛ: w = T (ɛ; θ) q(ɛ) is now independent of θ θ F = θ E q(w;θ) [log p(d, w)] = E q(ɛ) [ w p(d, w) w=t (ɛ;θ) θ T (ɛ; θ) ] For example: w N (µ, σ) T w = µ + σ ɛ, ɛ N (0, 1) MC by sampling from q(ɛ) (thus obtaining samples from w through T ) (Salimans and Knowles, 2013; Kingma and Welling, 2014, Ruiz et al. 2016)

27 Black-box VI

28 Black-box VI (github.com/blei-lab/edward)

29 Priors on weights (what we saw before) X ϕ ϕ... ϕ H. H ϕ ϕ... ϕ Y

30 From NN to GP X... ϕ ϕ... ϕ... H. H Y G G... ϕ ϕ... ϕ... NN: H 2 = W 2 φ(h 1 ) GP: φ is dimensional so: H 2 = f 2 (H 1 ; θ 2 ) + ɛ NN: p(w) GP: p(f( )) VAE can be seen as a special case of this

31 From NN to GP X... ϕ ϕ... ϕ... H. H Y G G... ϕ ϕ... ϕ... NN: H 2 = W 2 φ(h 1 ) GP: φ is dimensional so: H 2 = f 2 (H 1 ; θ 2 ) + ɛ NN: p(w) GP: p(f( )) VAE can be seen as a special case of this

32 From NN to GP X... ϕ ϕ... ϕ... H. H Y G G... ϕ ϕ... ϕ... NN: H 2 = W 2 φ(h 1 ) GP: φ is dimensional so: H 2 = f 2 (H 1 ; θ 2 ) + ɛ NN: p(w) GP: p(f( )) VAE can be seen as a special case of this

33 From NN to GP X... ϕ ϕ... ϕ... H. H Y G G... ϕ ϕ... ϕ... Real world perfectly described by unobserved latent variables: Ĥ But we only observe noisy high-dimensional data: Y We try to interpret the world and infer the latents: H Ĥ Inference: p(h Y) = p(y H)p(H) p(y)

34 Deep Gaussian processes Uncertainty about parameters: Check. Uncertainty about structure? Deep GP simultaneously brings in: prior on weights input/latent space is kernalized stochasticity in the warping

35 Introducing Gaussian Processes: A Gaussian distribution depends on a mean and a covariance matrix. A Gaussian process depends on a mean and a covariance function.

56 Samples from a 1-D Gaussian

57 Samples from a 1-D Gaussian

58 Samples from a 1-D Gaussian

59 Samples from a 1-D Gaussian

60 Samples from a 1-D Gaussian

61 Samples from a 1-D Gaussian

62 Samples from a 1-D Gaussian

63 Samples from a 1-D Gaussian

64 How do GPs solve the overfitting problem (i.e. regularize)? Answer: Integrate over the function itself! So, we will average out all possible function forms, under a (GP) prior! Recap: MAP: Bayesian: θ are hyperparameters arg max p(y w, φ(x))p(w) e.g. y = φ(x) w + ɛ w arg max f p(y f) p(f x, θ) e.g. y = f(x, θ) + ɛ θ }{{} GP prior The Bayesian approach (GP) automatically balances the data-fitting with the complexity penalty.

65 How do GPs solve the overfitting problem (i.e. regularize)? Answer: Integrate over the function itself! So, we will average out all possible function forms, under a (GP) prior! Recap: MAP: Bayesian: θ are hyperparameters arg max p(y w, φ(x))p(w) e.g. y = φ(x) w + ɛ w arg max f p(y f) p(f x, θ) e.g. y = f(x, θ) + ɛ θ }{{} GP prior The Bayesian approach (GP) automatically balances the data-fitting with the complexity penalty.

66 Deep Gaussian processes x Define a recursive stacked construction h 1 f 1 f 2 f(x) GP f L (f L 1 (f L 2 f 1 (x)))) deep GP h 2 Compare to: y f 3 ϕ(x) w NN ϕ(ϕ(ϕ(x) w 1 )... w L 1 ) w L DNN

67 Two-layered DGP X Input f 1 H Unobserved f 2 Output Y

68 Inference in DGPs It s tricky. An additional difficulty comes from the fact that the kernel couples all points, so p(y x) is not factorized anymore. Indeed, p(y x, w) = f p(y f)p(f x, w) but w appears inside the non-linear kernel function: p(f x, w) = N(f 0, k(x, X w))

69 The resulting objective function By integrating over all unknowns: We obtain approximate posteriors q(h), q(f) over them We end up with a well regularized variational lower bound on the true marginal likelihood: F = Data Fit KL (q(h 1 ) p(h 1 )) + }{{} Regularisation Regularization achieved through uncertainty propagation. L H (q(h l )) }{{} Regularisation l=2

70 An example where uncertainty propagation matters: Recurrent learning.

71 Dynamics/memory: Deep Recurrent Gaussian Process

72 Avatar control Figure: The generated motion with a step function signal, starting with walking (blue), switching to running (red) and switching back to walking (blue) Videos: Switching between learned speeds Interpolating (un)seen speed Constant unseen speed

73 RGP GPNARX MLP-NARX LSTM

74 RGP GPNARX MLP-NARX LSTM

75 Stochastic warping X ϕ ϕ... ϕ H N (0, I) G. H N (0, I) G ϕ ϕ... ϕ Inference: Need to infer posteriors on H Define q(h) = N (µ, Σ) Amortize: {µ, Σ} = MLP (Y; θ) Y

76 Stochastic warping X ϕ ϕ... ϕ H N (0, I) G. H N (0, I) G ϕ ϕ... ϕ Inference: Need to infer posteriors on H Define q(h) = N (µ, Σ) Amortize: {µ, Σ} = MLP (Y; θ) Y

77 Amortized inference Solution: Reparameterization through recognition model g: µ n 1 = g 1 (y (n) ) = g l (µ (n) l 1 ) g l = MLP(θ l ) µ (n) l g l deterministic µ (n) l = g l (... g 1 (y (n) )) Dai, Damianou, Gonzalez, Lawrence, ICLR 2016

78 Bayesian approach recap Three ways of introducing uncertainty / noise in a NN: Treat weights w as distributions (BNNs) Stochasticity in the warping function φ (VAE) Bayesian non-parametrics applied to DNNs (DGP)

79 Connection between Bayesian and traditional approaches There s another view of Bayesian NNs. Namely that they can be used as a means of explaining / motivating DNN techniques (e.g. Dropout) in a more mathematically grounded way. Dropout as variational inference. (Kingma et al. 2015), (Gal and Ghahramani. 2016) Low-rank NN weight approximation and Deep Gaussian processes (Hensman and Lawrence 2014, Damianou 2015, Louizos and Welling 2016, Cutajar et al. 2016)...

80 Summary Motivation for probabilistic and Bayesian reasoning Three ways of incorporating uncertainty in DNNs Inference is more challenging when uncertainty has to be propagated Connection between Bayesian and traditional NN approaches

System identification and control with (deep) Gaussian processes. Andreas Damianou

System identification and control with (deep) Gaussian processes Andreas Damianou Department of Computer Science, University of Sheffield, UK MIT, 11 Feb. 2016 Outline Part 1: Introduction Part 2: Gaussian