Computing with Distributed Distributional Codes Convergent Inference in Brains and Machines?

Similar documents
Variational Autoencoders

Probabilistic Reasoning in Deep Learning

Auto-Encoding Variational Bayes

Lecture 14: Deep Generative Learning

Probabilistic & Unsupervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Variational Inference via Stochastic Backpropagation

Generative Adversarial Networks

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Variational Autoencoders

Probabilistic Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Lecture 16 Deep Neural Generative Models

Machine Learning: Logistic Regression. Lecture 04

Variational Autoencoders. Presented by Alex Beatson Materials from Yann LeCun, Jaan Altosaar, Shakir Mohamed

Lecture : Probabilistic Machine Learning

Bayesian Deep Learning

Generative models for missing value completion

The Success of Deep Generative Models

Reading Group on Deep Learning Session 1

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

An Introduction to Statistical and Probabilistic Linear Models

arxiv: v1 [stat.ml] 6 Dec 2018

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Deep latent variable models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Nonparametric Inference for Auto-Encoding Variational Bayes

Linear Models for Classification

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Latent Variable Models

Probabilistic & Unsupervised Learning

STA 4273H: Statistical Machine Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Bayesian Models in Machine Learning

Efficient Likelihood-Free Inference

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

A Unified View of Deep Generative Models

Variational Autoencoders (VAEs)

Introduction to Machine Learning

The Origin of Deep Learning. Lili Mou Jan, 2015

Ways to make neural networks generalize better

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Part 4: Conditional Random Fields

Deep Generative Models

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

STA 4273H: Statistical Machine Learning

Sequential Monte Carlo Methods for Bayesian Computation

Variational Autoencoder

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY

FAST ADAPTATION IN GENERATIVE MODELS WITH GENERATIVE MATCHING NETWORKS

Chapter 16. Structured Probabilistic Models for Deep Learning

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Latent Variable Models and EM Algorithm

Unsupervised Learning

Probabilistic Graphical Models for Image Analysis - Lecture 1

Variational Auto Encoders

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Parameter Estimation. Industrial AI Lab.

Bayesian Machine Learning

MMD GAN 1 Fisher GAN 2

Need for Sampling in Machine Learning. Sargur Srihari

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

Lecture 3: Pattern Classification

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Probabilistic & Bayesian deep learning. Andreas Damianou

Graphical Models for Collaborative Filtering

Denoising Criterion for Variational Auto-Encoding Framework

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

PROBABILISTIC PROGRAMMING: BAYESIAN MODELLING MADE EASY. Arto Klami

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Mathematical Formulation of Our Example

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Experiments on the Consciousness Prior

Lecture 3: Pattern Classification. Pattern classification

Learning Deep Architectures

CPSC 540: Machine Learning

6.867 Machine Learning

Greedy Layer-Wise Training of Deep Networks

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts III-IV

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

STA 4273H: Statistical Machine Learning

Deep Feedforward Networks

Introduction to Deep Neural Networks

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Notes on Adversarial Examples

Transcription:

Computing with Distributed Distributional Codes Convergent Inference in Brains and Machines? Maneesh Sahani Professor of Theoretical Neuroscience and Machine Learning Gatsby Computational Neuroscience Unit University College London October 26, 2018

AI and Biology Rosenblatt s Perceptron combined (rudimentary) psychology, neuroscience, ML and AI.

AI and Biology Rosenblatt s Perceptron combined (rudimentary) psychology, neuroscience, ML and AI.

Modern ML INPUT 32x32 C1: feature maps 6@28x28 C3: f. maps 16@10x10 S4: f. maps 16@5x5 S2: f. maps 6@14x14 C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection ❼ ❾ r ❷ 6 P ❺ r ❻ r 6 ➎ ➂ 6 ❷ r 6❼ ❷ ❺ LeCun et al. 1998 å➈ê è ç ï ëíï å➄è ÿ ä ï î➓å➄æ ê ø➑é è ç ï æ ä ï ú ø➀ì➈ÿ ê û➀å ï➊ä ð ã ç ï➓è ä❿å➄ø➀é å û➑ï ❶ì➇ï ➊ø➑ï é è å➄é ù➒ ø å➈ê ➊ì➈é è ä ì➈û è ç ï➓ï ï ❶è ì➄ë è ç ï ê ø ➈î ì➈ø ù é ì➈é û➀ø➑é ï å➄ä ø è ð➃➏6ë è ç ï➉ ❶ì➇ï ➊ø➑ï é è ø➀ê ê î➓å➄û➀û ➌ è ç ï➊é è ç ï ÿ é ø➑è➉ì➈æ ï➊ä❿å è ï ê ø➀é å ➍ ÿ å ê ø 6û➑ø➀é ï å➈ä î ì ù ï➎➌ å➄é ù è ç ï ê ÿ❼ ê å➈î æ û➑ø➀é❼ ➓û å ➈ï➊ä î ï➊ä ï➊û û➀ÿ ä❿ê è ç ï ø➑é æ ÿ è ð 20% ➏6ë è ç ï ❶ì➇ï ➊ø➑ï é è ø➀ê û➀å➈ä❺ ï ➌ ê ÿ❼ ê å➄î æ û➀ø➑é❼ ÿ é ø➑è ê4 ➊å➈é ï ê ï➊ï é å➈ê æ ï➊ä ëíì➈ä î ø➑é❼ å é ì ø➀ê❺ ý ì ä➉å é ì ø➀ê❺ ➉ñ4 ëíÿ é è ø➑ì é ù ï➊æ ï➊é ù ø➀é❼ ì➈é è ç ï úrå➈û➑ÿ ï ì➈ë è ç ï ø➀å ê➊ð þ➇ÿ ❶ï ê ê ø➑ú ï û å ➈ï ä ê ì➄ë ❶ì➈é➇ú ì➈û➀ÿ è ø➀ì➈é ê å➄é ù❺ê ÿ❼ ê å➈î æ û➑ø➀é❼ å➄ä ï è ➇æ ø ➊å➈û➑û å➄û➑è ï ä é å è ï ù ➌ ä ï ê ÿ û➑è ø➀é❼ ø➀é å ❺ ø 6æ ➇ä å➈î ø ù 10% å è ï å ❿ç û å ➈ï ä ➌ è ç ï é➇ÿ î➉ ï➊ä ì➄ë ëíï å è ÿ ä ï î➓å➄æ ê➉ø➀ê ø➑é ❶ä ï å➈ê ï ù å ê è ç ï ê æ å➄è ø å➄û ä ï ê ì û➑ÿ è ø➀ì➈é ø ê ù ï ❶ä ï å ê ï ù ð å ❿ç ÿ é ø è ø➀é è ç ï è ç ø➀ä❿ù ç ø ù ù ï é û å ➈ï ä ø➀é➑ ÿ ä ï î➓å ➉ç årú ï ø➑é æ ÿ è ➊ì➈é é ï è ø➑ì é ê ëíä ì➈î ê ï ú➈ï ä å➈ûrëíï å è ÿ ä ï➏î➓å➄æ ê ø➀é è ç ï æ ä ï➊ú➇ø➀ì➈ÿ ê û å ➈ï ä ð ã ç ï➓ ❶ì é➇ú➈ì➈û➀ÿ è ø➑ì é ê ÿ❼ ê å➈î æ û➑ø➀é❼ ❶ì➈î ø➀é å è ø➑ì é ➌ ø➀é ê æ ø➑ä ï ù õ ÿ ï û➏å➄é ù 2011 ø➑ï ê ï û 2012 ê➉é ì➄è ø➀ì➈é ê➉ì➈ë 2013 ê ø➑î æ û➀ï å➄é ù ❶ì î æ û➑ï ❶ï û➑û ê ➌➄ó å ê ø➑î æ û➀ï➊î ï é è ï ù ø➀é ÿ ô➇ÿ ê ç ø➑î➓å ê ñ ï ì ❶ì é ø➑è ä ì➈é ➌ è ç ì ÿ❼ ➈ç é ì ➈û➀ì å➈û➑û ê ÿ æ ï➊ä ú➇ø➀ê ï ù û➀ï å➈ä é ø➑é❼ æ ä ì ❶ï ù ÿ ä ï ê ÿ ❿ç å➈ê å ❿ô 6æ ä ì➈æ å å è ø➑ì é➓ó å ê årú å➄ø➀û➀å û➑ï è ç ï➊é ð❹ û å➄ä ➈ï ù ï ➈ä ï➊ï ì➈ë ø➀é ú å➈ä ø å➄é ➊ï è ì ➈ï ì➈î ï❶è ä ø è ä❿å➄é ê4ëíì ä î➓å è ø➑ì é ê ì➈ë XRCE AlexNet ZF ❺ ➑ ImageNet error rate (top 5) ã ç ø➀ê➉ê ï è ø➀ì➈é ù ï ê❺ ➊ä ø ï ê ø➀é î ì ä ï ù ï➊è å➈ø➑û è ç ï å➈ä ❿ç ø➑è ï è ÿ ä ï ì➈ë ï ñ ï➊è ➌ è ç ï1 ì é➇ú➈ì➈û➀ÿ è ø➑ì é å➄û ñ ï➊ÿ ä å➈û ñ ï➊è4ó ì ä ô ÿ ê ï ù ø➑é è ç ï ï æ ï➊ä ø➑î ï é è❿ê➊ð ï ñ ï➊è ➉ ➊ì➈î æ ä ø➀ê ï ê û➀å ï➊ä❿ê ➌ é ì➄è ❶ì ÿ é è ø➀é❼ è ç ï ø➀é æ ÿ è ➌ å➄û➀û ì➄ë ó ç ø ❿ç ➊ì➈é è å➈ø➑é è ä❿å➄ø➀é å û➑ï æ å➄ä❿å➄î ï➊è ï➊ä❿ê 9ó ï ø ç è ê ð ã ç ï➏ø➀é æ ÿ è ø➀ê å æ ø ï➊û➈ø➑î➓å ï➈ð ã ç ø➀ê ø➀ê ê ø ➈é ø ➊å➈é è û ➉û å➄ä ➈ï➊ä è ç å➄é è ç ï û➀å➈ä❺ ï ê è ❿ç å➄ä❿å ❶è ï ä ø➑é è ç ï ù å➄è å å➈ê ï 9å è î ì ê4è æ ø ï➊û ê➓ ❶ï é è ï➊ä ï ù❺ø➑é å ï➊û ù ❶ð ã ç ï➓ä ï å➈ê ì➈é ø ê è ç å➄è ø è ø ê ù ï ê ø➀ä å û➀ï è ç å➄è æ ì➄è ï➊é è ø å➄û ù ø ê4è ø➑é ❶è ø➀ú➈ï ëíï å è ÿ ä ï ê➉ê ÿ ❿ç å ê ê è ä ì➈ô ï ï➊é ù æ ì➈ø➀é è ê ì➈ä ❶ì ä é ï ä å➄é å➈æ æ ï å➄ä ❼ r ì➈ë è ç ï ä ï ➊ï➊æ è ø➀ú➈ï ï û➀ù ì➈ë è ç ï ç ø ç ï ê è 6û➀ï➊ú➈ï û ëíï å è ÿ ä ï ù ï❶è ï è ì➈ä❿ê➊ð ➏ é ï ñ ï❶è❺ VGG è ç ï ê ï❶è ì➈ë ❶ï➊é è ï➊ä❿ê➏ì➄ë è ç ï➉ä ï ➊ï➊æ è ø➑ú ï ï û➀ù ê ì➄ë è ç ï➉û å➈ê è ➊ì➈é➇ú➈ì û➑ÿ è ø➀ì➈é å➄û û å ➈ï➊ä ❼➌➈ê ï➊ï ï➊û➀ì ó ëíì➈ä î å å➄ä ï å ø➑é è ç ï ❶ï é è ï➊ä 2015 2016 Human ì➄ë è ç ï ø➀é æ ÿ è ð ã ç ï ú å➄û➀ÿ ï ê ì➄ë è ç ï ø➀é æ ÿ è æ ø ï➊û ê å➄ä ï➉é ì ä î➓å➄û➀ø ï ù❺ê ì è ç å➄è è ç ï å➎ ❿ô ä ì ÿ é ù û➀ï➊ú ï➊û íó ç ø è ï t ❶ì ä ä ï ê æ ì➈é ù ê 2014 ResNet GoogLeNet-v4 è ì å ú å➄û➀ÿ ï ì➄ë➃ ð å➄é ù è ç ï ëíì ä ï ➈ä ì➈ÿ é ù û å ❿ô❼ ❶ì ä ä ï ê æ ì➈é ù ê è ì1 ð ð ã ç ø➀ê î➓å➈ô➈ï ê è ç ï➓î ï å➈é ø➀é æ ÿ è ä ì ÿ❼ ➈ç û ➌ å➈é ù❺è ç ï

Deep Learning and Biology Yamins & DiCarlo, 2016

Deep Learning and Biology Yamins & DiCarlo, 2016

Is this the whole story?

Adversarial examples +.007 = x sign( x J(θ, x, y)) x + ǫsign( x J(θ, x, y)) panda nematode gibbon 57.7% confidence 8.2% confidence 99.3 % confidence Goodfellow et al. 2015 ICLR Hints that recognition in deep convolutional nets depends on a conjunction of textural cues.

Vision isn t just object recognition Pixel decision contributions don t segment objects... Zintgraf et al. ICLR

Vision isn t just object recognition Pixel decision contributions don t segment objects...... and we can parse scenes like this: Fantastic Planet (1973) René Laloux

Vision isn t just object recognition Pixel decision contributions don t segment objects...... and we can parse scenes like this: Limited extrapolative generalisation. Fantastic Planet (1973) René Laloux

Vision isn t just object recognition Pixel decision contributions don t segment objects...... and we can parse scenes like this: Limited extrapolative generalisation. No sense of causal structure Fantastic Planet (1973) René Laloux

Inference and Bayes

Inference and Bayes A and B have the same physical luminance.

Inference and Bayes A and B have the same physical luminance. They appear different because we see by inference here we infer the likely reflectances of the squares.

Inference and Bayes A and B have the same physical luminance. They appear different because we see by inference here we infer the likely reflectances of the squares. Inferences depend on many cues, local and remote: optimal integration is usually probabilistic or Bayesian.

Can supervised (deep) learning be Bayesian? Yes if Bayes is optimal, then with enough supervision a network will learn to behave in a Bayesian way.

Can supervised (deep) learning be Bayesian? Yes if Bayes is optimal, then with enough supervision a network will learn to behave in a Bayesian way. No The specific Bayes-optimal function may look very different case-to-case. Bayesian reasoning involves beliefs about real but unobserved causal quantities. Basis of extrapolative generalisation. Usually derived by seeking conditional independence.

What about variational auto-encoders? ɛ ˆx1 ˆx2 ˆxD y (3) 1 y (3) y (3) 2 y1 yk K1 y (1) 1 y (1) y (1) 2 K1 x1 x2 xd VAEs (and GANs) can reproduce generative densities quite well (though with biases, as we ll see). Generate samples using deterministic transformations of external random variates (reparametrisation trick). By themselves, do not learn sensible causal components (but see Johnson et al. 2016). Require a simplified posterior representation need analytic expectations or samples. mostly Gaussian (or other tractable exponential family). generally neglect or severely simplify posterior correlations.

What about variational auto-encoders? ɛ ˆx1 ˆx2 ˆxD y (3) 1 y (3) y (3) 2 y1 yk K1 y (1) 1 y (1) y (1) 2 K1 x1 x2 xd VAEs (and GANs) can reproduce generative densities quite well (though with biases, as we ll see). Generate samples using deterministic transformations of external random variates (reparametrisation trick). By themselves, do not learn sensible causal components (but see Johnson et al. 2016). Require a simplified posterior representation need analytic expectations or samples. mostly Gaussian (or other tractable exponential family). generally neglect or severely simplify posterior correlations. If eventual goal is inference (for understanding, or for decisions) then this simplified posterior may miss the point.

Encoding uncertainty? Part of the difficulty is that we do not have efficient and accurate ways to represent uncertainty. Parameters of simple distributions (e.g. VAEs): σ For complex models the true posterior is very rarely simple. Biases learning towards parameters where posteriors look simpler than they should be. µ Sample of particles : Difficult to differentiate through general (non-reparametrisation) samplers: Importance resampling or MCMC.

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003):

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z).

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z).

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z). For example: ψ i (z) = g(w i z + b i ).

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z). For example: ψ i (z) = g(w i z + b i ). Generalises the idea of moments

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z). For example: ψ i (z) = g(w i z + b i ). Generalises the idea of moments Each expectation r i = ψ i (z) places a constraint on the encoded distribution.

An alternative Borrow from neural theory: Zemel et al. (1998); Sahani & Dayan (2003): Represent distribution p(z) by expectations of encoding functions ψ i (z). For example: ψ i (z) = g(w i z + b i ). Generalises the idea of moments Each expectation r i = ψ i (z) places a constraint on the encoded distribution. Links to kernel-space mean embeddings, predictive state representations, and (as we ll see) information geometry.

Decoding a DDC If needed, we can transform a DDC representation into another form:

Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood.

Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood. Particles herding.

Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood. Particles herding. Maximum entropy: p(z) e i η i ψ i (z) although η i may be difficult to find. (Exponential-family distribution: in general described equally well by the natural parameters {η i } or by the mean parameters {r i } as in information geometry.) Data distribution Distribution decoded from DDC z 2 z 1 z 1

Decoding a DDC If needed, we can transform a DDC representation into another form: Simple parametric form method of moments or maximum likelihood. Particles herding. Maximum entropy: p(z) e i η i ψ i (z) although η i may be difficult to find. (Exponential-family distribution: in general described equally well by the natural parameters {η i } or by the mean parameters {r i } as in information geometry.) Data distribution Distribution decoded from DDC z 2 z 1 z 1 But perhaps we don t need to decode?

Computing with DDC mean parameters For a general exponential family distribution, the mean parameters do not provide easy evaluation of the density.

Computing with DDC mean parameters For a general exponential family distribution, the mean parameters do not provide easy evaluation of the density. But many computations are essentially evaluations of expected values: belief propagation expectation of conditional density decision making expected reward for action variational learning expected value of log-likehood / suff stats.

Computing with DDC mean parameters For a general exponential family distribution, the mean parameters do not provide easy evaluation of the density. But many computations are essentially evaluations of expected values: belief propagation expectation of conditional density decision making expected reward for action variational learning expected value of log-likehood / suff stats. With a flexible set of basis functions, such expectations can be approximated by linear combinations of DDC activations: f (x) = i α i ψ i (x) f (x) = i α i ψ i (x) = i α i r i where the r i are the learnt expected values.

Example: Bayesian filtering z t = f (z t 1 ) + dw z z 1 z 2 z 3 z T x t = g(z t ) + dw x x 1 x 2 x 3 x T Bayesian updating rule: p(z t x 1:t ) dz t 1 p(z t z t 1 )p(x t z t )p(z t 1 x 1:t 1 )

Example: Bayesian filtering z t = f (z t 1 ) + dw z z 1 z 2 z 3 z T x t = g(z t ) + dw x x 1 x 2 x 3 x T Bayesian updating rule: p(z t x 1:t ) dz t 1 p(z t z t 1 )p(x t z t )p(z t 1 x 1:t 1 ) Can be implemented by mapping the DDC representations ψ i (z t ) = σ(w ψ(z t 1 ), x t ) r i (t) = σ(w r(t 1), x t ) provided σ is linear in the first argument.

Example: Bayesian filtering z t = f (z t 1 ) + dw z z 1 z 2 z 3 z T x t = g(z t ) + dw x x 1 x 2 x 3 x T Bayesian updating rule: p(z t x 1:t ) dz t 1 p(z t z t 1 )p(x t z t )p(z t 1 x 1:t 1 ) Can be implemented by mapping the DDC representations ψ i (z t ) = σ(w ψ(z t 1 ), x t ) r i (t) = σ(w r(t 1), x t ) provided σ is linear in the first argument. So DDC-filtering is implemented by a form of RNN.

Supervised learning Expectations are easily learned from samples: {x (s), z (s) } p(x, z) target: ψ(z (s) ) W = argmin f (x (s) ; W ) ψ(z (s) ) 2 input: x (s) f (x) ψ(z) p(z s).

Unsupervised learning The Helmholtz Machine (Dayan et al. 1995). Approximate inference by recognition network. Learning: Generative or causal network A model of the data Recognition or inference network Reasons about causes of a datum Wake phase: estimate mean-field representation ẑ = q(z) = R(x; ρ). Update generative parameters θ. Sleep phase: sample from generative model. Update recognition parameters ρ.

Distributed Distributional Recognition for a Helmholtz Machine z L1 z LKL z 11 z 12 z 1K1 x 1 x 2 x D

Distributed Distributional Recognition for a Helmholtz Machine z L1 z LKL DDC z 11 z 12 z 1K1 ψ 1(z 1) ψ 2(z 1) ψ N(z 1) x 1 x 2 x D

Distributed Distributional Recognition for a Helmholtz Machine DDC z L1 z LKL ψ 1(z L) ψ 2(z L) ψ N(z L) DDC z 11 z 12 z 1K1 ψ 1(z 1) ψ 2(z 1) ψ N(z 1) x 1 x 2 x D

Wake phase learning the model Learning requires expected gradients of joint likelihood. DDC z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) DDC z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) x 1 x 2 x D θ F(z l, θ) i γ i l ψ i(z l) θ F(z l, θ) q i γ i l ψ i(z l)

Sleep phase learning to recognise and to learn Samples in the sleep phase are used to learn the recognition model and the gradients needed for learning. z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) samples z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) samples x 1 x 2 x D

Sleep phase learning to recognise and to learn Samples in the sleep phase are used to learn the recognition model and the gradients needed for learning. z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) samples z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) samples x 1 x 2 x D Train ρ 1 : φ(x (s) ) ψ(z (s) 1 ) and ρ l : ψ(z (s) l 1 ) ψ(z(s) l ) in wake phase: r l = ψ(z l ) x = ρ l ρ l 1... ρ 1 φ(x)

Sleep phase learning to recognise and to learn Samples in the sleep phase are used to learn the recognition model and the gradients needed for learning. z L1 z LKL ψ 1(z) ψ 2(z) ψ N(z) samples z 11 z 12 z 1K1 ψ 1(z) ψ 2(z) ψ N(z) samples x 1 x 2 x D Train ρ 1 : φ(x (s) ) ψ(z (s) 1 ) and ρ l : ψ(z (s) l 1 ) ψ(z(s) l ) in wake phase: r l = ψ(z l ) x = ρ l ρ l 1... ρ 1 φ(x) Train α l : ψ(z (s) l ) T (s) l θ g(z (s) l+1, θ ) and β l l : ψ(z (s) l ) T (s) l 1 θg(z (s) l, θ l 1 ) in wake phase: θl F x = α l r l β l+1 r l+1

Results Model: 2 latent-layer deep exponential network with Laplacian conditionals.

Results Model: 2 latent-layer deep exponential network with Laplacian conditionals. 8 VAE HM 6 4 2 0 12 10 8 6 4 2 log MMD

Results Model: 2 latent-layer model of olfaction with Gamma latents. DDC HM True model MMD=0.002 x1 VAE True model MMD=0.02 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5

Results 2-layer model on 16 16 patches from natural images (van Hateren, 1998) z (1) z (1) 1 K1 z (2) 1 z (2) z (2) 2 K2 x1 x2 xd Model architecture MMD VAE MMD HM p-val (H 0 : VAE < HM) L 1 =10, L 2 = 2 0.235 0.0497 <10 10 L 1 =10, L 2 = 10 0.274 0.0252 <10 10 L 1 =50, L 2 = 2 0.222 0.000392 <10 10 L 1 =50, L 2 = 10 0.251 0.000526 <10 10 L 1 =100, L 2 = 2 0.286 0.000595 <10 10

Summary Machine learning systems have some way to go before they approach biological performance.

Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs.

Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs):

Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs): carry the necessary uncertainty (in expectations, or the mean parameters of an exponential family code)

Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs): carry the necessary uncertainty (in expectations, or the mean parameters of an exponential family code) are well suited to learning from examples

Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs): carry the necessary uncertainty (in expectations, or the mean parameters of an exponential family code) are well suited to learning from examples and ease computation of expectations given a sufficiently rich family of encoding functionals

Summary Machine learning systems have some way to go before they approach biological performance. Part of the problem has to do with limited representations of probabilistic beliefs. Distributed distributional representations (DDCs): carry the necessary uncertainty (in expectations, or the mean parameters of an exponential family code) are well suited to learning from examples and ease computation of expectations given a sufficiently rich family of encoding functionals The DDC Helmholtz machine provides a local approach for learning, with a rich posterior (inferential) representation, which outperforms standard machine learning approaches.

Thanks Collaborators Eszter Vertes Kevin Li Peter Dayan Current Group Gergo Bohner Angus Chadwick Lea Duncker Kevin Li Kirsty McNaught Arne Meyer Virginia Rutten Joana Soldado-Magraner Eszter Vertes