Neural Networks - II

Size: px

Start display at page:

Download "Neural Networks - II"

Lucy Richardson
5 years ago
Views:

1 Neural Networks - II Henrik I Christensen Robotics & Intelligent GT Georgia Institute of Technology, Atlanta, GA hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 / 23

2 Outline 1 Introduction 2 Mixture Density Networks 3 Bayesian Neural Networks 4 Summary Henrik I Christensen (RIM@GT) Neural Networks 2 / 23

3 Introduction Last lecture: Neural networks as a layered regression problem Feed-forward networks Linear model with activation functions Global Optimization Coverage of multi-modal networks Bayesian models for neural networks Henrik I Christensen (RIM@GT) Neural Networks 3 / 23

4 Outline 1 Introduction 2 Mixture Density Networks 3 Bayesian Neural Networks 4 Summary Henrik I Christensen (RIM@GT) Neural Networks 4 / 23

5 Motivation The models this far have assumed a Gaussian Distribution How about multi-modal distributions? How about inverse problems Mixture models is one possible solution Henrik I Christensen (RIM@GT) Neural Networks 5 / 23

6 Motivation The models this far have assumed a Gaussian Distribution How about multi-modal distributions? How about inverse problems Mixture models is one possible solution Henrik I Christensen (RIM@GT) Neural Networks 5 / 23

7 Motivation The models this far have assumed a Gaussian Distribution How about multi-modal distributions? How about inverse problems Mixture models is one possible solution Henrik I Christensen (RIM@GT) Neural Networks 5 / 23

8 Simple Robot Example L 2 (x 1, x 2 ) (x 1, x 2 ) L 1 θ 2 θ 1 elbow up elbow down Henrik I Christensen (RIM@GT) Neural Networks 6 / 23

9 Simple Functional Approximation Example Henrik I Christensen (RIM@GT) Neural Networks 7 / 23

10 Basic Formulation Objective - approximation of: A generic model p(t x) = p(t x) K π k (x)n(t µ k (x), σk 2 (x)) k=1 Here a Gaussian mixture is used but any distribution could be the basis Parameters to be estimated π k (x), µ k (x) and σ 2 k (x). Henrik I Christensen (RIM@GT) Neural Networks 8 / 23

11 The mixture density network p(t x) x D θ M θ x 1 θ 1 t Henrik I Christensen (RIM@GT) Neural Networks 9 / 23

12 The Model Parameters Mixing coefficients K π k (x) = 1 0 π k (x) 1 k=1 achieved using softmax π k (x) = e aπ k K l=1 eaπ l The variance must be postive, so a good choice is σ k (x) = e aσ k The means can be represented by direct activations µ kj (x) = a µ kj Henrik I Christensen (RIM@GT) Neural Networks 10 / 23

13 The Energy Equation(s) The error function is then as seen before { N K } E(w) = ln π k (x n, w)n(t µ k (x n, w), σk 2 (x n, w)) n=1 k=1 Computing the derivatives we can minimize E(w) Lets use γ nk = γ n (t n x n ) = π k N nk / π l N nl The derivatives are then E n a π k E n a µ kl E n a σ k = π k γ nk { } µkl t nl = γ nk σk 2 { = γ nk L t n µ k 2 } σ 2 k Henrik I Christensen (RIM@GT) Neural Networks 11 / 23

14 A Toy Example (a) (b) (c) (d) Henrik I Christensen (RIM@GT) Neural Networks 12 / 23

15 Mixed density networks The net is optimizing a mixture of parameters Different parts corresponds to different components Each part has its own set energy terms and gradients Illustrates the flexibility but also complications Henrik I Christensen (RIM@GT) Neural Networks 13 / 23

16 Outline 1 Introduction 2 Mixture Density Networks 3 Bayesian Neural Networks 4 Summary Henrik I Christensen (RIM@GT) Neural Networks 14 / 23

17 Introductory Remarks What is the output was a probability distribution? Could we optimize over the posterior distribution? p(t x) Assume it is Gaussian to enable processing p(t x, w, β) = N(t y(x, w), β 1 ) Let s consider how we can analyze the problem? Henrik I Christensen (RIM@GT) Neural Networks 15 / 23

18 The Laplace Approximation - I Sometimes the posterior is no longer Gaussian Challenges integration Closed form solutions might not be available How can we generate an approximation Obviously, using a Gaussian approximation would be helpful. Using a Laplace approximation Consider for now p(z) = f (z) f (a)da the denominator is merely for normalization and considered unknown Assume the mode, z 0 has been determined, so that df (z)/dz = 0 Henrik I Christensen (RIM@GT) Neural Networks 16 / 23

19 The Laplace Approximation - II Taylor expansion of ln f is then where Taking the exponential ln f (z) ln f (z 0 ) 1 2 A(z z 0) 2 A = d 2 dz 2 ln f (z) z=z 0 f (z) f (z 0 )e { A 2 (z z 0) 2 } which can be transformed to ( ) 1 A 2 q(z) = e { A 2 (z z 0) 2 } 2π the extension to multi-variate distribution is straight forward (see book). Henrik I Christensen (RIM@GT) Neural Networks 17 / 23

20 Posterior Parameter Distribution Back to the Bayesian networks For an IID dataset with target values t = {t 1,..., t N } we have N p(t w, β) = N(t n y(x n, w), β 1 ) n=1 The posterior is then p(w t, α, β) p(w α)p(t w, β) As usual we have ln p(w t) = α 2 wt w β 2 N {y(x n, w) t n } 2 + const n=1 Henrik I Christensen (RIM@GT) Neural Networks 18 / 23

21 Posterior Parameter Distribution - II We can use the Laplace approximation to estimate the distribution A = 2 ln p(w t, α, β) = αi + βh The approximation would be q(w t) = N(w w MAP, A 1 ) In turn we have p(t x, t, α, β) = N(t y(x, w MAP ), σ 2 ) where σ 2 = β 1 + g T A 1 g and g = w y(x, w) w=wmap Henrik I Christensen (RIM@GT) Neural Networks 19 / 23

22 Optimization of Hyper-parameters How do we estimate α and β? We can consider the problem p(t α, β) = p(t w, β)p(w α)dw From linear regression we have the composition βhu i = λ i u i where H is the Hessian for the error, E with regression with have γ α = wmap T w MAP where γ is the effective rank of the Hessian Similarly β can be derived to be 1 β = 1 N {y(x n, w MAP ) t n } 2 N γ Henrik I Christensen (RIM@GT) Neural n=1networks 20 / 23

23 Bayesian Neural Networks Modelling of system as a probabilistic generator Use standard techniques to generate w MAP We can in addition generate estimates for the precision/variance Henrik I Christensen (RIM@GT) Neural Networks 21 / 23

24 Outline 1 Introduction 2 Mixture Density Networks 3 Bayesian Neural Networks 4 Summary Henrik I Christensen (RIM@GT) Neural Networks 22 / 23

25 Summary With Neural Nets we have a general functional estimator Can be applied both for regression and discrmination The basis functions can be a broad set of functions NNs can also be used for estimation of mixture systems Estimation of probability distributions is also possible for Gaussians (approximation w. w MAP, β) Neural nets is a rich area with a long history. Henrik I Christensen (RIM@GT) Neural Networks 23 / 23

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen Neural Networks - I Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 /