Lecture 3a: The Origin of Variational Bayes

Similar documents
Ways to make neural networks generalize better

Logistic Regression. Machine Learning Fall 2018

Approximate inference in Energy-Based Models

Neural networks and support vector machines

Machine Learning: Logistic Regression. Lecture 04

y(x) = x w + ε(x), (1)

ECE521 Lectures 9 Fully Connected Neural Networks

Learning Energy-Based Models of High-Dimensional Data

The connection of dropout and Bayesian statistics

Networks of McCulloch-Pitts Neurons

Artificial Neural Networks. Part 2

Chapter 3. Systems of Linear Equations: Geometry

Nonparametric Bayesian Methods (Gaussian Processes)

STA 4273H: Sta-s-cal Machine Learning

Announcements Wednesday, September 06

Machine Learning Basics

Lecture VIII Dim. Reduction (I)

CSC321 Lecture 9: Generalization

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

How to do backpropagation in a brain

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Does the Wake-sleep Algorithm Produce Good Density Estimators?

CSC 2541: Bayesian Methods for Machine Learning

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

CSC321 Lecture 9: Generalization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

STA 4273H: Statistical Machine Learning

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Stochastic gradient descent; Classification

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Bayesian Support Vector Machines for Feature Ranking and Selection

Artificial Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

Pattern Recognition and Machine Learning

word2vec Parameter Learning Explained

Lecture 5: Logistic Regression. Neural Networks

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Consider this problem. A person s utility function depends on consumption and leisure. Of his market time (the complement to leisure), h t

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Bayesian Machine Learning

Machine Learning Techniques for Computer Vision

STA 414/2104: Machine Learning

MODULE -4 BAYEIAN LEARNING

CSC321 Lecture 18: Learning Probabilistic Models

Logistic Regression & Neural Networks

CSC 578 Neural Networks and Deep Learning

CSC321 Lecture 8: Optimization

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 6. Regression

Regression coefficients may even have a different sign from the expected.

Week 5: Logistic Regression & Neural Networks

Neural Networks: Backpropagation

CSC242: Intro to AI. Lecture 21

Gaussian and Linear Discriminant Analysis; Multiclass Classification

The wake-sleep algorithm for unsupervised neural networks

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Latent Variable Models

STA414/2104 Statistical Methods for Machine Learning II

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

ECE521 Lecture 7/8. Logistic Regression

Machine Learning (CSE 446): Neural Networks

CPSC 540: Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

Interpreting Deep Classifiers

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

CSC 411: Lecture 02: Linear Regression

Comparison of Modern Stochastic Optimization Algorithms

ECE521 week 3: 23/26 January 2017

CSC321 Lecture 7: Optimization

Bayesian Methods for Machine Learning

13: Variational inference II

Data Mining Part 5. Prediction

Neural Networks and Deep Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

How to do backpropagation in a brain. Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto

Lecture 9: Generalization

Overfitting, Bias / Variance Analysis

Deep Learning for NLP

STA 414/2104: Lecture 8

Reading Group on Deep Learning Session 1

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

The simplest kind of unit we consider is a linear-gaussian unit. To

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Modeling High-Dimensional Discrete Data with Multi-Layer Neural Networks

Self Supervised Boosting

Deep Feedforward Networks

Machine Learning Lecture 5

Unsupervised Discovery of Nonlinear Structure Using Contrastive Backpropagation

Part 8: Neural Networks

Transcription:

CSC535: 013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton

The origin of variational Bayes In variational Bayes, e approximate the true posterior across parameters by a much simpler, factorial distribution. Since e are being Bayesian, e need a prior for this posterior distribution. When e use standard L eight decay e are implicitly assuming a Gaussian prior ith zero mean. Could e have a more interesting prior?

Types of eight penalty Sometimes it orks better to use a eight penalty that has negligible effect on large eights. We can easily make up a heuristic cost function for this. But e get more insight if e vie it as the negative log probability under a mixture of to zero-mean Gaussians C( ) 0 λ = 1+ k 0

Soft eight-sharing Le Cun shoed that netorks generalize better if e constrain subsets of the eights to be equal. This removes degrees of freedom from the parameters so it simplifies the model. But for most tasks e do not kno in advance hich eights should be the same. Maybe e can learn hich eights should be the same.

Modeling the distribution of the eights The values of the eights form a distribution in a onedimensional space. If the eights are tightly clustered, they have high probability density under a mixture of Gaussians model. To raise the probability density move each eight toards its nearest cluster center. p() W à

Fitting the eights and the mixture prior together We can alternate beteen to types of update: Adjust the eights to reduce the error in the output and to increase the probability density of the eights under the mixture prior. Adjust the means and variances and mixing proportions in the mixture prior to fit the posterior distribution of the eights better. This is called empirical Bayes. This automatically clusters the eights. We do not need to specify in advance hich eights should belong to the same cluster.

A different optimization method Alternatively, e can just apply conjugate gradient descent to all of the parameters in parallel (hich is hat e did). To keep the variance positive, use the log variances in the optimization (these are the natural parameters for a scale variable). To ensure that the mixing proportions of the Gaussians sum to 1, use the parameters of a softmax in the optimization. x π i = j e i e x j

The cost function and its derivatives ) ( ) ( ), ( log ) ( 1 j i j j i c out i j j i i j j c c c out j p i c y c t c y k C p t y k C σ µ σ σ µ π σ = = Probability of eight i under Gaussian j negative log probability of desired output under a Gaussian hose mean is the output of the net posterior probability of Gaussian j given eight i

The sunspot prediction problem Predicting the number of sunspots next year is important because they affect eather and communications. The hole time series has less than 400 points and there is no obvious ay to get any more data. So it is orth using computationally expensive methods to get good predictions. The best model produced by statisticians as a combination of to linear autoregressive models that sitched at a particular threshold value. Heavy-tailed eight decay orks better. Soft eight-sharing using a mixture of Gaussians prior orks even better.

The eights learned by the eight hidden units for predicting the number of sunspots 1 1 3: 4 fin de siècle Rule 1: (uses units): High if high last year. Rule : High if high 6, 9, or 11 years ago. Rule 3: Lo if lo 1 or 8 ago & high or 3 ago. Rule 4: Lo if high 9 ago and lo 1 or 3 ago

The Toronto distribution The mixture of five Gaussians learned for clustering the eights. the skydome Weights near zero are very cheap because they have high density under the empirical prior.

Predicting sunspot numbers far into the future by iterating the single year predictions. The net ith soft eightsharing gives the loest errors

The problem ith soft eight-sharing It constructs a sensible empirical prior for the eights. But it ignores the fact that some eights need to be coded accurately and others can be very imprecise ithout having much effect on the squared error. A coding frameork needs to model the number of bits required to code the value of a eight and this depends on the precision as ell as the value.

Using the variational approach to make Bayesian learning efficient Consider a standard backpropagation netork ith one hidden layer and the squared error function. The full Bayesian approach to learning is: Start ith a prior distribution across all possible eight vectors Multiply the prior for each eight vector by the probability of the observed outputs given that eight vector and then renormalize to get the posterior distribution. Use this posterior distribution over all possible eight vectors for making predictions. This is not feasible for large nets. Can e use a tractable approximation to the posterior?

An independence assumption We can approximate the posterior distribution by assuming that it is an axis-aligned Gaussian in eight space. i.e. e give each eight its on posterior variance. Weights that are not very important for minimizing the squared error ill have big variances. This can be interpreted nicely in terms of minimum description length. Weights ith high posterior variances can be communicated in very fe bits This is because e can use lots of entropy to pick a precise value from the posterior, so e get lots of bits back.

Communicating a noisy eight First pick a precise value for the eight from its posterior. We ill get back a number of bits equal to the entropy of the eight We could imagine quantizing ith a very small quantization idth to eliminate the infinities. Then code the precise value under the Gaussian prior. This costs a number of bits equal to the crossentropy beteen the posterior and the prior. ( Q( )log P( ) d) ( Q( )logq( ) d) expected number of bits to send eight expected number of bits back

The cost of communicating a noisy eight If the sender and receiver agree on a prior distribution, P, for the eights, the cost of communicating a eight ith posterior distribution Q is: KL( Q P) = Q( )log Q( ) P( ) d If the distributions are both Gaussian this cost becomes: KL( Q P) = σ log σ P Q + 1 σ P [ σ σ + ( µ µ ) ] Q P P Q

What do noisy eights do to the expected squared error? Consider a linear neuron ith a single input. Let the eight on this input be stochastic The noise variance for the eight gets multiplied by the squared input value and added to the squared error: y < Error < = y x > = < ( x) Error x = ( t Stochastic output of neuron < > = mean of eight distribution x y) > = µ > = ( t xµ xµ + x ) σ + x σ Extra squared error caused by noisy eight

Ho to deal ith the non-linearity in the hidden units The noise on the incoming connections to a hidden unit is independent so its variance adds. This Gaussian input noise to a hidden unit turns into non-gaussian output noise, but e can use a big table to find the mean and variance of this non-gaussian noise. The non-gaussian noise coming out of each hidden unit is independent so e can just add up the variances coming into an output unit. σ σ 4 σ σ 1 3

The mean and variance of the output of a logistic hidden unit non-gaussian noise out Gaussian noise in

The forard table The forard table is indexed by the mean and the variance of the Gaussian total input to a hidden unit. µ in It returns the mean and variance of the non- Gaussian output. This non-gaussian mean and variance is all e need to compute the expected squared error. σin µ σ out out

The backard table The backard table is indexed by the mean and variance of the total input. It returns four partial derivatives hich are all e need for backpropagating the derivatives of the squared error to the input à hidden eights. µ µ out in, σ µ out in, µ σ out in, σ σ out in

Empirical Bayes: Fitting the prior We can no trade off the precision of the eights against the extra squared error caused by noisy eights. even though the residuals are non-gaussian e can choose to code them using a Gaussian. We can also learn the idth of the prior Gaussian used for coding the eights. We can even have a mixture of Gaussians prior This allos the posterior eights to form clusters Very good for coding lots of zero eights precisely ithout using many bits. Also makes large eights cheap if they are the same as other large eights.

Some eights learned by variational bayes output eight bias It learns a fe big positive eights, a fe big negative eights, and lots of zeros. It has found four rules that ork ell. 18 input units Only 105 training cases to train 51 eights

The learned empirical prior for the eights The posterior for the eights needs to be Gaussian to make it possible to figure out the extra squared error caused by noisy eights and the cost of coding the noisy eights. The learned prior can be a mixture of Gaussians. This learned prior is a mixture of 5 Gaussians ith 14 parameters