1 Computing with Distributed Distributional Codes Convergent Inference in Brains and Machines? Maneesh Sahani Professor of Theoretical Neuroscience and Machine Learning Gatsby Computational Neuroscience Unit University College London October 26, 2018

2 AI and Biology Rosenblatt s Perceptron combined (rudimentary) psychology, neuroscience, ML and AI.

5 Deep Learning and Biology Yamins & DiCarlo, 2016

7 Is this the whole story?

8 Adversarial examples = x sign( x J(θ, x, y)) x + ǫsign( x J(θ, x, y)) panda nematode gibbon 57.7% confidence 8.2% confidence 99.3 % confidence Goodfellow et al ICLR Hints that recognition in deep convolutional nets depends on a conjunction of textural cues.

9 Vision isn t just object recognition Pixel decision contributions don t segment objects... Zintgraf et al. ICLR

10 Vision isn t just object recognition Pixel decision contributions don t segment objects and we can parse scenes like this: Fantastic Planet (1973) René Laloux

11 Vision isn t just object recognition Pixel decision contributions don t segment objects and we can parse scenes like this: Limited extrapolative generalisation. Fantastic Planet (1973) René Laloux

12 Vision isn t just object recognition Pixel decision contributions don t segment objects and we can parse scenes like this: Limited extrapolative generalisation. No sense of causal structure Fantastic Planet (1973) René Laloux

13 Inference and Bayes

14 Inference and Bayes A and B have the same physical luminance.

15 Inference and Bayes A and B have the same physical luminance. They appear different because we see by inference here we infer the likely reflectances of the squares.

16 Inference and Bayes A and B have the same physical luminance. They appear different because we see by inference here we infer the likely reflectances of the squares. Inferences depend on many cues, local and remote: optimal integration is usually probabilistic or Bayesian.

17 Can supervised (deep) learning be Bayesian? Yes if Bayes is optimal, then with enough supervision a network will learn to behave in a Bayesian way.

18 Can supervised (deep) learning be Bayesian? Yes if Bayes is optimal, then with enough supervision a network will learn to behave in a Bayesian way. No The specific Bayes-optimal function may look very different case-to-case. Bayesian reasoning involves beliefs about real but unobserved causal quantities. Basis of extrapolative generalisation. Usually derived by seeking conditional independence.

19 What about variational auto-encoders? ɛ ˆx1 ˆx2 ˆxD y (3) 1 y (3) y (3) 2 y1 yk K1 y (1) 1 y (1) y (1) 2 K1 x1 x2 xd VAEs (and GANs) can reproduce generative densities quite well (though with biases, as we ll see). Generate samples using deterministic transformations of external random variates (reparametrisation trick). By themselves, do not learn sensible causal components (but see Johnson et al. 2016). Require a simplified posterior representation need analytic expectations or samples. mostly Gaussian (or other tractable exponential family). generally neglect or severely simplify posterior correlations.

20 What about variational auto-encoders? ɛ ˆx1 ˆx2 ˆxD y (3) 1 y (3) y (3) 2 y1 yk K1 y (1) 1 y (1) y (1) 2 K1 x1 x2 xd VAEs (and GANs) can reproduce generative densities quite well (though with biases, as we ll see). Generate samples using deterministic transformations of external random variates (reparametrisation trick). By themselves, do not learn sensible causal components (but see Johnson et al. 2016). Require a simplified posterior representation need analytic expectations or samples. mostly Gaussian (or other tractable exponential family). generally neglect or severely simplify posterior correlations. If eventual goal is inference (for understanding, or for decisions) then this simplified posterior may miss the point.

21 Encoding uncertainty? Part of the difficulty is that we do not have efficient and accurate ways to represent uncertainty. Parameters of simple distributions (e.g. VAEs): σ For complex models the true posterior is very rarely simple. Biases learning towards parameters where posteriors look simpler than they should be. µ Sample of particles : Difficult to differentiate through general (non-reparametrisation) samplers: Importance resampling or MCMC.

22 An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003):

23 An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z).

24 An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z).

25 An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z). For example: ψ i (z) = g(w i z + b i ).

26 An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z). For example: ψ i (z) = g(w i z + b i ). Generalises the idea of moments

27 An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z). For example: ψ i (z) = g(w i z + b i ). Generalises the idea of moments Each expectation r i = ψ i (z) places a constraint on the encoded distribution.

28 An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z). For example: ψ i (z) = g(w i z + b i ). Generalises the idea of moments Each expectation r i = ψ i (z) places a constraint on the encoded distribution. Links to kernel-space mean embeddings, predictive state representations, and (as we ll see) information geometry.

29 Decoding a DDC If needed, we can transform a DDC representation into another form:

30 Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood.

31 Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood. Particles herding.

32 Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood. Particles herding. Maximum entropy: p(z) e i η i ψ i (z) although η i may be difficult to find. (Exponential-family distribution: in general described equally well by the natural parameters {η i } or by the mean parameters {r i } as in information geometry.) Data distribution Distribution decoded from DDC z 2 z 1 z 1

33 Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood. Particles herding. Maximum entropy: p(z) e i η i ψ i (z) although η i may be difficult to find. (Exponential-family distribution: in general described equally well by the natural parameters {η i } or by the mean parameters {r i } as in information geometry.) Data distribution Distribution decoded from DDC z 2 z 1 z 1 But perhaps we don t need to decode?

34 Computing with DDC mean parameters For a general exponential family distribution, the mean parameters do not provide easy evaluation of the density.

35 Computing with DDC mean parameters For a general exponential family distribution, the mean parameters do not provide easy evaluation of the density. But many computations are essentially evaluations of expected values: belief propagation expectation of conditional density decision making expected reward for action variational learning expected value of log-likehood / suff stats.

36 Computing with DDC mean parameters For a general exponential family distribution, the mean parameters do not provide easy evaluation of the density. But many computations are essentially evaluations of expected values: belief propagation expectation of conditional density decision making expected reward for action variational learning expected value of log-likehood / suff stats. With a flexible set of basis functions, such expectations can be approximated by linear combinations of DDC activations: f (x) = i α i ψ i (x) f (x) = i α i ψ i (x) = i α i r i where the r i are the learnt expected values.

37 Example: Bayesian filtering z t = f (z t 1 ) + dw z z 1 z 2 z 3 z T x t = g(z t ) + dw x x 1 x 2 x 3 x T Bayesian updating rule: p(z t x 1:t ) dz t 1 p(z t z t 1 )p(x t z t )p(z t 1 x 1:t 1 )

38 Example: Bayesian filtering z t = f (z t 1 ) + dw z z 1 z 2 z 3 z T x t = g(z t ) + dw x x 1 x 2 x 3 x T Bayesian updating rule: p(z t x 1:t ) dz t 1 p(z t z t 1 )p(x t z t )p(z t 1 x 1:t 1 ) Can be implemented by mapping the DDC representations ψ i (z t ) = σ(w ψ(z t 1 ), x t ) r i (t) = σ(w r(t 1), x t ) provided σ is linear in the first argument.

39 Example: Bayesian filtering z t = f (z t 1 ) + dw z z 1 z 2 z 3 z T x t = g(z t ) + dw x x 1 x 2 x 3 x T Bayesian updating rule: p(z t x 1:t ) dz t 1 p(z t z t 1 )p(x t z t )p(z t 1 x 1:t 1 ) Can be implemented by mapping the DDC representations ψ i (z t ) = σ(w ψ(z t 1 ), x t ) r i (t) = σ(w r(t 1), x t ) provided σ is linear in the first argument. So DDC-filtering is implemented by a form of RNN.

40 Supervised learning Expectations are easily learned from samples: {x (s), z (s) } p(x, z) target: ψ(z (s) ) W = argmin f (x (s) ; W ) ψ(z (s) ) 2 input: x (s) f (x) ψ(z) p(z s).

41 Unsupervised learning The Helmholtz Machine (Dayan et al. 1995). Approximate inference by recognition network. Learning: Generative or causal network A model of the data Recognition or inference network Reasons about causes of a datum Wake phase: estimate mean-field representation ẑ = q(z) = R(x; ρ). Update generative parameters θ. Sleep phase: sample from generative model. Update recognition parameters ρ.

42 Distributed Distributional Recognition for a Helmholtz Machine z L1 z LKL z 11 z 12 z 1K1 x 1 x 2 x D

43 Distributed Distributional Recognition for a Helmholtz Machine z L1 z LKL DDC z 11 z 12 z 1K1 ψ 1(z 1) ψ 2(z 1) ψ N(z 1) x 1 x 2 x D

44 Distributed Distributional Recognition for a Helmholtz Machine DDC z L1 z LKL ψ 1(z L) ψ 2(z L) ψ N(z L) DDC z 11 z 12 z 1K1 ψ 1(z 1) ψ 2(z 1) ψ N(z 1) x 1 x 2 x D

45 Wake phase learning the model Learning requires expected gradients of joint likelihood. DDC z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) DDC z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) x 1 x 2 x D θ F(z l, θ) i γ i l ψ i(z l) θ F(z l, θ) q i γ i l ψ i(z l)

46 Sleep phase learning to recognise and to learn Samples in the sleep phase are used to learn the recognition model and the gradients needed for learning. z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) samples z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) samples x 1 x 2 x D

47 Sleep phase learning to recognise and to learn Samples in the sleep phase are used to learn the recognition model and the gradients needed for learning. z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) samples z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) samples x 1 x 2 x D Train ρ 1 : φ(x (s) ) ψ(z (s) 1 ) and ρ l : ψ(z (s) l 1 ) ψ(z(s) l ) in wake phase: r l = ψ(z l ) x = ρ l ρ l 1... ρ 1 φ(x)

48 Sleep phase learning to recognise and to learn Samples in the sleep phase are used to learn the recognition model and the gradients needed for learning. z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) samples z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) samples x 1 x 2 x D Train ρ 1 : φ(x (s) ) ψ(z (s) 1 ) and ρ l : ψ(z (s) l 1 ) ψ(z(s) l ) in wake phase: r l = ψ(z l ) x = ρ l ρ l 1... ρ 1 φ(x) Train α l : ψ(z (s) l ) T (s) l θ g(z (s) l+1, θ ) and β l l : ψ(z (s) l ) T (s) l 1 θg(z (s) l, θ l 1 ) in wake phase: θl F x = α l r l β l+1 r l+1

49 Results Model: 2 latent-layer deep exponential network with Laplacian conditionals.

50 Results Model: 2 latent-layer deep exponential network with Laplacian conditionals. 8 VAE HM log MMD

51 Results Model: 2 latent-layer model of olfaction with Gamma latents. DDC HM True model MMD=0.002 x1 VAE True model MMD=0.02 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5

52 Results 2-layer model on patches from natural images (van Hateren, 1998) z (1) z (1) 1 K1 z (2) 1 z (2) z (2) 2 K2 x1 x2 xd Model architecture MMD VAE MMD HM p-val (H 0 : VAE < HM) L 1 =10, L 2 = <10 10 L 1 =10, L 2 = <10 10 L 1 =50, L 2 = <10 10 L 1 =50, L 2 = <10 10 L 1 =100, L 2 = <10 10

53 Summary Machine learning systems have some way to go before they approach biological performance.

54 Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs.

55 Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs):

56 Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs): carry the necessary uncertainty (in expectations, or the mean parameters of an exponential family code)

57 Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs): carry the necessary uncertainty (in expectations, or the mean parameters of an exponential family code) are well suited to learning from examples

58 Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs): carry the necessary uncertainty (in expectations, or the mean parameters of an exponential family code) are well suited to learning from examples and ease computation of expectations given a sufficiently rich family of encoding functionals

59 Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs): carry the necessary uncertainty (in expectations, or the mean parameters of an exponential family code) are well suited to learning from examples and ease computation of expectations given a sufficiently rich family of encoding functionals The DDC Helmholtz machine provides a local approach for learning, with a rich posterior (inferential) representation, which outperforms standard machine learning approaches.

60 Thanks Collaborators Eszter Vertes Kevin Li Peter Dayan Current Group Gergo Bohner Angus Chadwick Lea Duncker Kevin Li Kirsty McNaught Arne Meyer Virginia Rutten Joana Soldado-Magraner Eszter Vertes

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV Aapo Hyvärinen Gatsby Unit University College London Part III: Estimation of unnormalized models Often,

More information

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 7301: Advanced Machine Learning Vibhav Gogate The University of Texas at Dallas Supervised Learning Issues in supervised learning What makes learning hard Point Estimation: MLE vs Bayesian

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences!! h0p:// Lecture 7 Approximate

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford Maximum Likelihood Principle A generative model for

More information

Notes on Adversarial Examples

Notes on Adversarial Examples Notes on Adversarial Examples David Meyer dmm@{,,...} March 14, 2017 1 Introduction The surprising discovery of adversarial examples by Szegedy et al. [6] has led to new ways of thinking

More information