How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Similar documents
How to do backpropagation in a brain

An efficient way to learn deep generative models

UNSUPERVISED LEARNING

STA 414/2104: Lecture 8

Knowledge Extraction from DBNs for Images

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Learning Energy-Based Models of High-Dimensional Data

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

STA 414/2104: Lecture 8

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Learning Deep Architectures for AI. Part II - Vijay Chakilam

Machine Learning. Neural Networks

CSC321 Lecture 20: Autoencoders

The Origin of Deep Learning. Lili Mou Jan, 2015

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Learning to Disentangle Factors of Variation with Manifold Learning

Lecture 16 Deep Neural Generative Models

Statistical NLP for the Web

Course Structure. Psychology 452 Week 12: Deep Learning. Chapter 8 Discussion. Part I: Deep Learning: What and Why? Rufus. Rufus Processed By Fetch

Neural Networks. Mark van Rossum. January 15, School of Informatics, University of Edinburgh 1 / 28

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Deep unsupervised learning

Approximate inference in Energy-Based Models

Deep learning in the brain. Deep learning summer school Montreal 2017

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Artificial Neural Networks

Learning Deep Architectures

Introduction to Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Lecture 5: Logistic Regression. Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Deep Neural Networks

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Fast classification using sparsely active spiking networks. Hesham Mostafa Institute of neural computation, UCSD

Contrastive Divergence

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Deep Boltzmann Machines

Variational Inference for Monte Carlo Objectives

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

CSC 411 Lecture 10: Neural Networks

Data Mining Part 5. Prediction

Ways to make neural networks generalize better

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Neural Networks: Backpropagation

Greedy Layer-Wise Training of Deep Networks

Restricted Boltzmann Machines

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Jakub Hajic Artificial Intelligence Seminar I

Implementation of a Restricted Boltzmann Machine in a Spiking Neural Network

Convolutional networks. Sebastian Seung

The connection of dropout and Bayesian statistics

Introduction to Deep Learning

Introduction Biologically Motivated Crude Model Backpropagation

Introduction to Restricted Boltzmann Machines

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Restricted Boltzmann Machines for Collaborative Filtering

Neural Nets Supervised learning

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Supervised Learning in Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Basic Principles of Unsupervised and Unsupervised

10. Artificial Neural Networks

Introduction to Convolutional Neural Networks (CNNs)

Computational Explorations in Cognitive Neuroscience Chapter 5: Error-Driven Task Learning

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Artificial Neural Networks. MGS Lecture 2

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Discriminative Learning of Sum-Product Networks. Robert Gens Pedro Domingos

Multilayer Perceptron

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

An Efficient Learning Procedure for Deep Boltzmann Machines

CONVOLUTIONAL DEEP BELIEF NETWORKS

RegML 2018 Class 8 Deep learning

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Enhanced Gradient and Adaptive Learning Rate for Training Restricted Boltzmann Machines

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Neural Networks: Backpropagation

Νεςπο-Ασαυήρ Υπολογιστική Neuro-Fuzzy Computing

Dynamical Systems and Deep Learning: Overview. Abbas Edalat

Part 2. Representation Learning Algorithms

Ruslan Salakhutdinov Joint work with Geoff Hinton. University of Toronto, Machine Learning Group

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Lab 5: 16 th April Exercises on Neural Networks

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Deep Learning Autoencoder Models

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Transcription:

1 How to do backpropagation in a brain Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

What is wrong with back-propagation? It requires labeled training data. (fixed) Almost all real data is unlabeled. The brain needs to fit about 10^14 connection weights in only about 10^9 seconds. Labels cannot possibly provide enough information. The learning time does not scale well for many hidden layers. (fixed) The neurons need to send two different types of signal Forward pass: signal = activity = y Backward pass: signal = de/dy

Training a deep network First train a layer of features that receive input directly from the pixels. Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer. Each time we add another layer of features we improve a bound on how well we are modeling the set of training images.

Discriminative fine-tuning First train multiple hidden layers greedily. Then connect some output units to the top layer of features and do backpropagation through all of the layers to fine-tune all of the feature detectors. On MNIST with no prior knowledge or extra data, this works much better than standard backpropagation and better than SVM s.

Using backpropagation for fine-tuning Greedily learning one layer at a time scales well to really big networks, especially if we have locality in each layer. We do not start backpropagation until we already have sensible weights that already do well at the task. So the initial gradients are sensible and backpropagation only needs to perform a local search. Most of the information in the final weights comes from modeling the distribution of input vectors. The precious information in the labels is only used for the final fine-tuning. It slightly modifies the features. It does not need to discover features. So we can do very well when most of the training data is unlabelled.

But how can the brain back-propagate through a stack of RBM s? Many neuroscientists think back-propagation is biologically implausible because they cannot see how neurons could possibly do it. Chomsky used the same logic to infer that syntax is innate! Some very good researchers have postulated inefficient algorithms that use random perturbations. Do you really believe that evolution could not find an efficient way to adapt a feature so that it is more useful to the higher-level features? (have faith!)

The idea Learning a stack of simple models, each of which is good at reconstructing its inputs, creates the conditions required to allow neurons to implement backpropagation in an elegant way. The trick is to use temporal derivatives to represent error derivatives. This allows the output of a neuron to represent an error derivative at the same time as it is also representing the presence or absence of a feature. This is a big assumption about cortical representations. Is there any evidence for it? (velocity neurons?)

A prerequisite Consider a binary stochastic output unit, j, with a crossentropy error function. E / x j = t j p j derivative of the error w.r.t. The total input to j target value actual probability So if we continuously regress the output probability towards the target value, we get E / x p j j temporal derivative

Backpropagation is easy perturb towards the correct output h h + W v v + Δv Let component j of represent where is the total input to neuron j x j Δh Δh W T E / x j If h would be reconstructed as v, h+ will be reconstructed as v+ Δv where Δv is a vector with component i representing E / x i Δh

The synaptic update rule First do a forward pass (as usual). Then perturb the top level activities so that the change in activity of a unit represents the derivative of the error function w.r.t. the total input to that unit. Then do a downwards pass. This will make the activity changes at every level represent error derivatives. Then update each synapse in proportion to: pre-synaptic activity X rate-of-change of post-synaptic activity

If this is what is happening, what should neuroscientists see? weight change relative time of post-synaptic spike 0 Spike-time-dependent plasticity is just a derivative filter.

A problem This way of performing backpropagation requires symmetric weights But contrastive divergence learning still works if we treat each symmetric connection as two directed connections and randomly remove many of the directed connections.

Functional symmetry The fluctuations in the hidden units are conditionally independent, but the state of a hidden unit can still be estimated very well from the states of other hidden units that have similar receptive fields. So top-down connections from these other correlated units can learn to mimic the effect of the missing topdown part of a symmetric connection. All we require is functional symmetry on and near the data manifold. Contrastive divergence learning seems to achieve functional symmetry well enough to make backpropagation work.

Contrastive divergence learning: A quick way to learn an RBM 0 < v i hj> i j 1 < v i hj> t = 0 t = 1 data reconstruction i j Start with a training vector on the visible units. Update all the hidden units in parallel Update all the visible units in parallel to get a reconstruction. Update all the hidden units again. Δw ij = ε ( < v h i j > 0 < v i h j > 1 ) This is not following the gradient of the log likelihood. But it works well. It is approximately following the gradient of another objective function.

Backpropagation learning of an autoencoder using temporal derivatives v W T v + Δv W h h + Δh W v negligible to first order Backprop: vδh hδv CD: vh ( v + Δv)( h + Δh) = vδh hδv ΔvΔh

THE END And if you reserved a place on a bus to whistler, please make sure you get on the bus that you have been assigned to because the buses are all full.