Computing with Distributed Distributional Codes Convergent Inference in Brains and Machines?

Computing with Distributed Distributional Codes Convergent Inference in Brains and Machines? Maneesh Sahani Professor of Theoretical Neuroscience and Machine Learning Gatsby Computational Neuroscience Unit University College London October 26, 2018

AI and Biology Rosenblatt s Perceptron combined (rudimentary) psychology, neuroscience, ML and AI.

Modern ML INPUT 32x32 C1: feature maps 6@28x28 C3: f. maps 16@10x10 S4: f. maps 16@5x5 S2: f. maps 6@14x14 C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection ❼ ❾ r ❷ 6 P ❺ r ❻ r 6 ➎ ➂ 6 ❷ r 6❼ ❷ ❺ LeCun et al. 1998 å➈ê è ç ï ëíï å➄è ÿ ä ï î➓å➄æ ê ø➑é è ç ï æ ä ï ú ø➀ì➈ÿ ê û➀å ï➊ä ð ã ç ï➓è ä❿å➄ø➀é å û➑ï ❶ì➇ï ➊ø➑ï é è å➄é ù➒ ø å➈ê ➊ì➈é è ä ì➈û è ç ï➓ï ï ❶è ì➄ë è ç ï ê ø ➈î ì➈ø ù é ì➈é û➀ø➑é ï å➄ä ø è ð➃➏6ë è ç ï➉ ❶ì➇ï ➊ø➑ï é è ø➀ê ê î➓å➄û➀û ➌ è ç ï➊é è ç ï ÿ é ø➑è➉ì➈æ ï➊ä❿å è ï ê ø➀é å ➍ ÿ å ê ø 6û➑ø➀é ï å➈ä î ì ù ï➎➌ å➄é ù è ç ï ê ÿ❼ ê å➈î æ û➑ø➀é❼ ➓û å ➈ï➊ä î ï➊ä ï➊û û➀ÿ ä❿ê è ç ï ø➑é æ ÿ è ð 20% ➏6ë è ç ï ❶ì➇ï ➊ø➑ï é è ø➀ê û➀å➈ä❺ ï ➌ ê ÿ❼ ê å➄î æ û➀ø➑é❼ ÿ é ø➑è ê4 ➊å➈é ï ê ï➊ï é å➈ê æ ï➊ä ëíì➈ä î ø➑é❼ å é ì ø➀ê❺ ý ì ä➉å é ì ø➀ê❺ ➉ñ4 ëíÿ é è ø➑ì é ù ï➊æ ï➊é ù ø➀é❼ ì➈é è ç ï úrå➈û➑ÿ ï ì➈ë è ç ï ø➀å ê➊ð þ➇ÿ ❶ï ê ê ø➑ú ï û å ➈ï ä ê ì➄ë ❶ì➈é➇ú ì➈û➀ÿ è ø➀ì➈é ê å➄é ù❺ê ÿ❼ ê å➈î æ û➑ø➀é❼ å➄ä ï è ➇æ ø ➊å➈û➑û å➄û➑è ï ä é å è ï ù ➌ ä ï ê ÿ û➑è ø➀é❼ ø➀é å ❺ ø 6æ ➇ä å➈î ø ù 10% å è ï å ❿ç û å ➈ï ä ➌ è ç ï é➇ÿ î➉ ï➊ä ì➄ë ëíï å è ÿ ä ï î➓å➄æ ê➉ø➀ê ø➑é ❶ä ï å➈ê ï ù å ê è ç ï ê æ å➄è ø å➄û ä ï ê ì û➑ÿ è ø➀ì➈é ø ê ù ï ❶ä ï å ê ï ù ð å ❿ç ÿ é ø è ø➀é è ç ï è ç ø➀ä❿ù ç ø ù ù ï é û å ➈ï ä ø➀é➑ ÿ ä ï î➓å ➉ç årú ï ø➑é æ ÿ è ➊ì➈é é ï è ø➑ì é ê ëíä ì➈î ê ï ú➈ï ä å➈ûrëíï å è ÿ ä ï➏î➓å➄æ ê ø➀é è ç ï æ ä ï➊ú➇ø➀ì➈ÿ ê û å ➈ï ä ð ã ç ï➓ ❶ì é➇ú➈ì➈û➀ÿ è ø➑ì é ê ÿ❼ ê å➈î æ û➑ø➀é❼ ❶ì➈î ø➀é å è ø➑ì é ➌ ø➀é ê æ ø➑ä ï ù õ ÿ ï û➏å➄é ù 2011 ø➑ï ê ï û 2012 ê➉é ì➄è ø➀ì➈é ê➉ì➈ë 2013 ê ø➑î æ û➀ï å➄é ù ❶ì î æ û➑ï ❶ï û➑û ê ➌➄ó å ê ø➑î æ û➀ï➊î ï é è ï ù ø➀é ÿ ô➇ÿ ê ç ø➑î➓å ê ñ ï ì ❶ì é ø➑è ä ì➈é ➌ è ç ì ÿ❼ ➈ç é ì ➈û➀ì å➈û➑û ê ÿ æ ï➊ä ú➇ø➀ê ï ù û➀ï å➈ä é ø➑é❼ æ ä ì ❶ï ù ÿ ä ï ê ÿ ❿ç å➈ê å ❿ô 6æ ä ì➈æ å å è ø➑ì é➓ó å ê årú å➄ø➀û➀å û➑ï è ç ï➊é ð❹ û å➄ä ➈ï ù ï ➈ä ï➊ï ì➈ë ø➀é ú å➈ä ø å➄é ➊ï è ì ➈ï ì➈î ï❶è ä ø è ä❿å➄é ê4ëíì ä î➓å è ø➑ì é ê ì➈ë XRCE AlexNet ZF ❺ ➑ ImageNet error rate (top 5) ã ç ø➀ê➉ê ï è ø➀ì➈é ù ï ê❺ ➊ä ø ï ê ø➀é î ì ä ï ù ï➊è å➈ø➑û è ç ï å➈ä ❿ç ø➑è ï è ÿ ä ï ì➈ë ï ñ ï➊è ➌ è ç ï1 ì é➇ú➈ì➈û➀ÿ è ø➑ì é å➄û ñ ï➊ÿ ä å➈û ñ ï➊è4ó ì ä ô ÿ ê ï ù ø➑é è ç ï ï æ ï➊ä ø➑î ï é è❿ê➊ð ï ñ ï➊è ➉ ➊ì➈î æ ä ø➀ê ï ê û➀å ï➊ä❿ê ➌ é ì➄è ❶ì ÿ é è ø➀é❼ è ç ï ø➀é æ ÿ è ➌ å➄û➀û ì➄ë ó ç ø ❿ç ➊ì➈é è å➈ø➑é è ä❿å➄ø➀é å û➑ï æ å➄ä❿å➄î ï➊è ï➊ä❿ê 9ó ï ø ç è ê ð ã ç ï➏ø➀é æ ÿ è ø➀ê å æ ø ï➊û➈ø➑î➓å ï➈ð ã ç ø➀ê ø➀ê ê ø ➈é ø ➊å➈é è û ➉û å➄ä ➈ï➊ä è ç å➄é è ç ï û➀å➈ä❺ ï ê è ❿ç å➄ä❿å ❶è ï ä ø➑é è ç ï ù å➄è å å➈ê ï 9å è î ì ê4è æ ø ï➊û ê➓ ❶ï é è ï➊ä ï ù❺ø➑é å ï➊û ù ❶ð ã ç ï➓ä ï å➈ê ì➈é ø ê è ç å➄è ø è ø ê ù ï ê ø➀ä å û➀ï è ç å➄è æ ì➄è ï➊é è ø å➄û ù ø ê4è ø➑é ❶è ø➀ú➈ï ëíï å è ÿ ä ï ê➉ê ÿ ❿ç å ê ê è ä ì➈ô ï ï➊é ù æ ì➈ø➀é è ê ì➈ä ❶ì ä é ï ä å➄é å➈æ æ ï å➄ä ❼ r ì➈ë è ç ï ä ï ➊ï➊æ è ø➀ú➈ï ï û➀ù ì➈ë è ç ï ç ø ç ï ê è 6û➀ï➊ú➈ï û ëíï å è ÿ ä ï ù ï❶è ï è ì➈ä❿ê➊ð ➏ é ï ñ ï❶è❺ VGG è ç ï ê ï❶è ì➈ë ❶ï➊é è ï➊ä❿ê➏ì➄ë è ç ï➉ä ï ➊ï➊æ è ø➑ú ï ï û➀ù ê ì➄ë è ç ï➉û å➈ê è ➊ì➈é➇ú➈ì û➑ÿ è ø➀ì➈é å➄û û å ➈ï➊ä ❼➌➈ê ï➊ï ï➊û➀ì ó ëíì➈ä î å å➄ä ï å ø➑é è ç ï ❶ï é è ï➊ä 2015 2016 Human ì➄ë è ç ï ø➀é æ ÿ è ð ã ç ï ú å➄û➀ÿ ï ê ì➄ë è ç ï ø➀é æ ÿ è æ ø ï➊û ê å➄ä ï➉é ì ä î➓å➄û➀ø ï ù❺ê ì è ç å➄è è ç ï å➎ ❿ô ä ì ÿ é ù û➀ï➊ú ï➊û íó ç ø è ï t ❶ì ä ä ï ê æ ì➈é ù ê 2014 ResNet GoogLeNet-v4 è ì å ú å➄û➀ÿ ï ì➄ë➃ ð å➄é ù è ç ï ëíì ä ï ➈ä ì➈ÿ é ù û å ❿ô❼ ❶ì ä ä ï ê æ ì➈é ù ê è ì1 ð ð ã ç ø➀ê î➓å➈ô➈ï ê è ç ï➓î ï å➈é ø➀é æ ÿ è ä ì ÿ❼ ➈ç û ➌ å➈é ù❺è ç ï

Deep Learning and Biology Yamins & DiCarlo, 2016

Is this the whole story?

Adversarial examples +.007 = x sign( x J(θ, x, y)) x + ǫsign( x J(θ, x, y)) panda nematode gibbon 57.7% confidence 8.2% confidence 99.3 % confidence Goodfellow et al. 2015 ICLR Hints that recognition in deep convolutional nets depends on a conjunction of textural cues.

Vision isn t just object recognition Pixel decision contributions don t segment objects... Zintgraf et al. ICLR

Vision isn t just object recognition Pixel decision contributions don t segment objects...... and we can parse scenes like this: Fantastic Planet (1973) René Laloux

Vision isn t just object recognition Pixel decision contributions don t segment objects...... and we can parse scenes like this: Limited extrapolative generalisation. Fantastic Planet (1973) René Laloux

Vision isn t just object recognition Pixel decision contributions don t segment objects...... and we can parse scenes like this: Limited extrapolative generalisation. No sense of causal structure Fantastic Planet (1973) René Laloux

Inference and Bayes

Inference and Bayes A and B have the same physical luminance.

Inference and Bayes A and B have the same physical luminance. They appear different because we see by inference here we infer the likely reflectances of the squares.

Inference and Bayes A and B have the same physical luminance. They appear different because we see by inference here we infer the likely reflectances of the squares. Inferences depend on many cues, local and remote: optimal integration is usually probabilistic or Bayesian.

Can supervised (deep) learning be Bayesian? Yes if Bayes is optimal, then with enough supervision a network will learn to behave in a Bayesian way.

Can supervised (deep) learning be Bayesian? Yes if Bayes is optimal, then with enough supervision a network will learn to behave in a Bayesian way. No The specific Bayes-optimal function may look very different case-to-case. Bayesian reasoning involves beliefs about real but unobserved causal quantities. Basis of extrapolative generalisation. Usually derived by seeking conditional independence.

What about variational auto-encoders? ɛ ˆx1 ˆx2 ˆxD y (3) 1 y (3) y (3) 2 y1 yk K1 y (1) 1 y (1) y (1) 2 K1 x1 x2 xd VAEs (and GANs) can reproduce generative densities quite well (though with biases, as we ll see). Generate samples using deterministic transformations of external random variates (reparametrisation trick). By themselves, do not learn sensible causal components (but see Johnson et al. 2016). Require a simplified posterior representation need analytic expectations or samples. mostly Gaussian (or other tractable exponential family). generally neglect or severely simplify posterior correlations.

Encoding uncertainty? Part of the difficulty is that we do not have efficient and accurate ways to represent uncertainty. Parameters of simple distributions (e.g. VAEs): σ For complex models the true posterior is very rarely simple. Biases learning towards parameters where posteriors look simpler than they should be. µ Sample of particles : Difficult to differentiate through general (non-reparametrisation) samplers: Importance resampling or MCMC.

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003):

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z).

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z). For example: ψ i (z) = g(w i z + b i ). Generalises the idea of moments Each expectation r i = ψ i (z) places a constraint on the encoded distribution.

Decoding a DDC If needed, we can transform a DDC representation into another form:

Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood.

Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood. Particles herding.

Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood. Particles herding. Maximum entropy: p(z) e i η i ψ i (z) although η i may be difficult to find. (Exponential-family distribution: in general described equally well by the natural parameters {η i } or by the mean parameters {r i } as in information geometry.) Data distribution Distribution decoded from DDC z 2 z 1 z 1

Computing with DDC mean parameters For a general exponential family distribution, the mean parameters do not provide easy evaluation of the density.

Computing with DDC mean parameters For a general exponential family distribution, the mean parameters do not provide easy evaluation of the density. But many computations are essentially evaluations of expected values: belief propagation expectation of conditional density decision making expected reward for action variational learning expected value of log-likehood / suff stats. With a flexible set of basis functions, such expectations can be approximated by linear combinations of DDC activations: f (x) = i α i ψ i (x) f (x) = i α i ψ i (x) = i α i r i where the r i are the learnt expected values.

Example: Bayesian filtering z t = f (z t 1 ) + dw z z 1 z 2 z 3 z T x t = g(z t ) + dw x x 1 x 2 x 3 x T Bayesian updating rule: p(z t x 1:t ) dz t 1 p(z t z t 1 )p(x t z t )p(z t 1 x 1:t 1 )

Example: Bayesian filtering z t = f (z t 1 ) + dw z z 1 z 2 z 3 z T x t = g(z t ) + dw x x 1 x 2 x 3 x T Bayesian updating rule: p(z t x 1:t ) dz t 1 p(z t z t 1 )p(x t z t )p(z t 1 x 1:t 1 ) Can be implemented by mapping the DDC representations ψ i (z t ) = σ(w ψ(z t 1 ), x t ) r i (t) = σ(w r(t 1), x t ) provided σ is linear in the first argument.

Supervised learning Expectations are easily learned from samples: {x (s), z (s) } p(x, z) target: ψ(z (s) ) W = argmin f (x (s) ; W ) ψ(z (s) ) 2 input: x (s) f (x) ψ(z) p(z s).

Unsupervised learning The Helmholtz Machine (Dayan et al. 1995). Approximate inference by recognition network. Learning: Generative or causal network A model of the data Recognition or inference network Reasons about causes of a datum Wake phase: estimate mean-field representation ẑ = q(z) = R(x; ρ). Update generative parameters θ. Sleep phase: sample from generative model. Update recognition parameters ρ.

Distributed Distributional Recognition for a Helmholtz Machine z L1 z LKL z 11 z 12 z 1K1 x 1 x 2 x D

Distributed Distributional Recognition for a Helmholtz Machine z L1 z LKL DDC z 11 z 12 z 1K1 ψ 1(z 1) ψ 2(z 1) ψ N(z 1) x 1 x 2 x D

Distributed Distributional Recognition for a Helmholtz Machine DDC z L1 z LKL ψ 1(z L) ψ 2(z L) ψ N(z L) DDC z 11 z 12 z 1K1 ψ 1(z 1) ψ 2(z 1) ψ N(z 1) x 1 x 2 x D

Wake phase learning the model Learning requires expected gradients of joint likelihood. DDC z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) DDC z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) x 1 x 2 x D θ F(z l, θ) i γ i l ψ i(z l) θ F(z l, θ) q i γ i l ψ i(z l)

Sleep phase learning to recognise and to learn Samples in the sleep phase are used to learn the recognition model and the gradients needed for learning. z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) samples z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) samples x 1 x 2 x D Train ρ 1 : φ(x (s) ) ψ(z (s) 1 ) and ρ l : ψ(z (s) l 1 ) ψ(z(s) l ) in wake phase: r l = ψ(z l ) x = ρ l ρ l 1... ρ 1 φ(x)

Results Model: 2 latent-layer deep exponential network with Laplacian conditionals.

Results Model: 2 latent-layer deep exponential network with Laplacian conditionals. 8 VAE HM 6 4 2 0 12 10 8 6 4 2 log MMD

Results Model: 2 latent-layer model of olfaction with Gamma latents. DDC HM True model MMD=0.002 x1 VAE True model MMD=0.02 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5

Results 2-layer model on 16 16 patches from natural images (van Hateren, 1998) z (1) z (1) 1 K1 z (2) 1 z (2) z (2) 2 K2 x1 x2 xd Model architecture MMD VAE MMD HM p-val (H 0 : VAE < HM) L 1 =10, L 2 = 2 0.235 0.0497 <10 10 L 1 =10, L 2 = 10 0.274 0.0252 <10 10 L 1 =50, L 2 = 2 0.222 0.000392 <10 10 L 1 =50, L 2 = 10 0.251 0.000526 <10 10 L 1 =100, L 2 = 2 0.286 0.000595 <10 10

Summary Machine learning systems have some way to go before they approach biological performance.

Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs.

Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs): carry the necessary uncertainty (in expectations, or the mean parameters of an exponential family code) are well suited to learning from examples and ease computation of expectations given a sufficiently rich family of encoding functionals

Thanks Collaborators Eszter Vertes Kevin Li Peter Dayan Current Group Gergo Bohner Angus Chadwick Lea Duncker Kevin Li Kirsty McNaught Arne Meyer Virginia Rutten Joana Soldado-Magraner Eszter Vertes