Lecture 4 Towards Deep Learning

Similar documents
Restricted Boltzmann Machines

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Iterative Reweighted Least Squares

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Neural Network Training

Lecture 16 Deep Neural Generative Models

Introduction to Restricted Boltzmann Machines

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Deep Boltzmann Machines

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Logistic Regression. Seungjin Choi

Deep Neural Networks

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

CSC321 Lecture 18: Learning Probabilistic Models

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Greedy Layer-Wise Training of Deep Networks

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Contrastive Divergence

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The Origin of Deep Learning. Lili Mou Jan, 2015

Deep Generative Models. (Unsupervised Learning)

Learning to Disentangle Factors of Variation with Manifold Learning

Intractable Likelihood Functions

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Lecture 2 Part 1 Optimization

Lecture 3 September 1

Introduction to Machine Learning

Deep unsupervised learning

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Learning Tetris. 1 Tetris. February 3, 2009

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Lecture 2: Logistic Regression and Neural Networks

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

CPSC 540: Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Latent Variable Models and EM algorithm

Robust Classification using Boltzmann machines by Vasileios Vasilakakis

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Multiclass Logistic Regression

Machine Learning 4771

Reading Group on Deep Learning Session 1

Linear and logistic regression

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics III

Deep Feedforward Networks. Lecture slides for Chapter 6 of Deep Learning Ian Goodfellow Last updated

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Linear Models in Machine Learning

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Chapter 20. Deep Generative Models

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

UNSUPERVISED LEARNING

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Graphical Models for Collaborative Filtering

Regression with Numerical Optimization. Logistic

Generative v. Discriminative classifiers Intuition

Lecture 5: Logistic Regression. Neural Networks

Gibbs Sampling in Linear Models #2

Restricted Boltzmann Machines for Collaborative Filtering

FALL 2018 MATH 4211/6211 Optimization Homework 4

Machine Learning

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Logistic Regression. Machine Learning Fall 2018

Lecture 3 - Linear and Logistic Regression

Name: Student number:

Foundations of Statistical Inference

x k+1 = x k + α k p k (13.1)

Naive Bayes and Gaussian Bayes Classifier

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Introduction to gradient descent

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

MCV172, HW#3. Oren Freifeld May 6, 2017

Self Supervised Boosting

Autoencoders and Score Matching. Based Models. Kevin Swersky Marc Aurelio Ranzato David Buchman Benjamin M. Marlin Nando de Freitas

Probabilistic Graphical Models for Image Analysis - Lecture 4

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Week 5: Logistic Regression & Neural Networks

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Logistisk regression T.K.

Expectation maximization tutorial

Machine Learning, Fall 2012 Homework 2

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CPSC 540: Machine Learning

Deep Learning Basics Lecture 8: Autoencoder & DBM. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning. 7. Logistic and Linear Regression

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Introduction to Logistic Regression and Support Vector Machine

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Computational statistics

Ph.D. Qualifying Exam Monday Tuesday, January 4 5, 2016

COMP90051 Statistical Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Transcription:

Lecture 4 Towards Deep Learning (January 30, 2015) Mu Zhu University of Waterloo

Deep Network Fields Institute, Toronto, Canada 2015 by Mu Zhu 2

Boltzmann Distribution probability distribution for a complex system p(x) = 1 Z ef(x;θ) with Z x e f(x;θ) [or ] e f(x;θ) dx often, where f(x;θ) = u(x;ϑ), kt u(x; ϑ) = energy function; k T = Boltzmann constant; = thermodynamic temperature e.g., lattice of particles, protein molecule Fields Institute, Toronto, Canada 2015 by Mu Zhu 3

Boltzmann Distribution [ ] log[p(x;θ)] = f(x;θ) log e f(x;θ) x d dθ log[p(x;θ)] = d dθ f(x;θ) 1 x = d dθ f(x;θ) x = d dθ f(x;θ) x = d dθ f(x;θ) E e f(x;θ) e f(x;θ) Z p(x) [ e f(x;θ) [ d dθ f(x;θ) x d dθ f(x;θ) d dθ f(x;θ) ] ] d dθ f(x;θ) Fields Institute, Toronto, Canada 2015 by Mu Zhu 4

Boltzmann Distribution given x 1,x 2,...,x i iid p(x;θ) log-likelihood is l(θ) = 1 n n log[p(x i ;θ)] i=1 its first derivative is d dθ l(θ) = 1 n Ê n { d dθ f(x i;θ) E i=1 [ ] d dθ f(x;θ) E [ ]} d dθ f(x i;θ) ] [ d dθ f(x;θ) Fields Institute, Toronto, Canada 2015 by Mu Zhu 5

Restricted Boltzmann Machine h 1 h 2 h 3 h 4 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 bottom nodes v = (v 1,v 2,...) T ; top nodes h = (h 1,h 2,...) T Fields Institute, Toronto, Canada 2015 by Mu Zhu 6

Restricted Boltzmann Machine bottom nodes v = (v 1,v 2,...) T top nodes h = (h 1,h 2,...) T Boltzmann distribution p(v,h;θ) = 1 Z ef(v,h;θ) with f(v,h;θ) = h T Wv +α T h+β T v = h t w T t v + α t h t +β T v t t i.e., θ = {W,α,β} Fields Institute, Toronto, Canada 2015 by Mu Zhu 7

Restricted Boltzmann Machine if just one binary top node h {0,1} get p(v,h = 1) = 1 Z ewt v+α+β Tv, p(v,h = 0) = 1 Z eβt v so p(v,h = 1) p(v,h = 0) = P(h = 1 v)f(v) P(h = 0 v)f(v) log P(h = 1 v) P(h = 0 v) = wt v +α hence, model for h v is usual logistic regression Fields Institute, Toronto, Canada 2015 by Mu Zhu 8

Restricted Boltzmann Machine if more than one binary top nodes h 1,h 2,... {0,1} get f(v,h;θ) = h t w T t v +α t h t + s t h s w T s v + s tα s h s +β T v so log P(h t = 1 v,h t ) P(h t = 0 v,h t ) = wt t v +α t in general, model for h t v,h t is usual logistic regression notice conditional independence between h t and h t given v Fields Institute, Toronto, Canada 2015 by Mu Zhu 9

Fitting RBM given (v i,h i ), i = 1,2,...,n, MLE by gradient ascent, [ ] dl W new = W old +ε dw and likewise for α, β θ old, gradients: f(v,h;θ) = h T Wv +α T h+β T v dl dw = Ê[ hv T] E [ hv T] dl dα = Ê[h] E[h], dl dβ = Ê[v] E[v] Fields Institute, Toronto, Canada 2015 by Mu Zhu 10

Fitting RBM given {(v i,h i )} n i=1, Ê[hvT ],Ê[h],Ê[v] are easy to compute just take empirical averages [definition of Ê( )] but E[hv T ],E[h],E[v] are hard to compute estimate by drawing an MCMC sample from p(v,h) in particular, do Gibbs Sampling for just a few iterations Gibbs Sampling Repeat given v, sample h from the conditional distribution of h v; given h, sample v from the conditional distribution of v h; until convergence ( burn-in ). Fields Institute, Toronto, Canada 2015 by Mu Zhu 11

Gibbs Sampling recall: model for h t v,h t is usual logistic regression, so h t v,h t Bernoulli [ σ ( α t +w T t v )] if v 1,v 2,... also binary, then symmetry gives v b h,v b Bernoulli [ σ ( β b +h T w b )] Gibbs sampling amounts to back-and-forth coin flips note: σ( ) above denotes the sigmoid function Fields Institute, Toronto, Canada 2015 by Mu Zhu 12

Towards Deep Learning stack many RBMs on top of each other top nodes for layer l becomes bottom nodes for layer l+1, i.e., v (l+1) = h (l) fit the whole thing layer by layer note intermediate layers are latent (hidden, not visible) so for things like Ê(h), can use Ê( h) where h i E(h v i ) kind of an EM procedure Fields Institute, Toronto, Canada 2015 by Mu Zhu 13

Towards Deep Learning repeat with current parameters θ = {α,β,w} (a) estimate h i E(h v i ) Ê( ) (b) draw h i p(h v i ) and v i p(v h i ) [or repeat] E( ) move along the gradient (a crude estimate of it) ] θ θ +ε [Ê( ) E( ) until some stopping criterion proceed to next layer Fields Institute, Toronto, Canada 2015 by Mu Zhu 14

On v 1,v 2,... Being Binary not as restrictive as it appears can already handle image data and text data to some extent [examples next two slides] can generalize to other inputs make some changes to the model f(v,h) p(v h) no longer logistic model [obviously] ideally, want p(v h) nice to sample from Fields Institute, Toronto, Canada 2015 by Mu Zhu 15

Example: Image Data each v 1,v 2,... a binary pixel Fields Institute, Toronto, Canada 2015 by Mu Zhu 16

Example: Text Data Word 1 Word 2 Word 3 "hate" "I" "love" "math" "you" each v 1,v 2,... an indicator for a particular word Fields Institute, Toronto, Canada 2015 by Mu Zhu 17

The Gaussian-Bernoulli RBM v 1,v 2,... {0,1}: v 1,v 2,... R: f(v,h;θ) = h T Wv +α T h+β T v = [ ] h T w b vb +α T h+ β b v b b b f(v,h;θ) = b [ ] [ ] v h T b w b τ b +α T h b (v b β b ) 2 2τ 2 b Exercise Let ṽ = (v 1 /τ 1,v 2 /τ 2,...) T. Show that T (a) p(h t v,h t ) Bernoulli[σ(α t +wt ṽ)]; (easy) (b) p(v b h,v b ) N(β b +τ b (h T w b ),τ 2 b ). (slightly harder) Fields Institute, Toronto, Canada 2015 by Mu Zhu 18

Iterative Optimization consider an iterative rule, x t+1 = m(x t ), for minimizing f(x) Newton s method: gradient descent: m(x t ) = x t f (x t ) f (x t ) m(x t ) = x t f (x t ) multivariate case... same idea, with gradient and Hessian, i.e., x t+1 = x t εh 1 t g t, x t+1 = x t εg t ; will explain extra ε later Fields Institute, Toronto, Canada 2015 by Mu Zhu 19

Iterative Optimization suppose iteration converges to a local minimum x must have m(x ) = x [i.e., a fixed point] then, in neighborhood near x, e t+1 x t+1 x = m(x t ) m(x ) [ = m(x )+m (x )(x t x )+ m (ξ) (x t x ) ] m(x 2 ) 2 = m (x )e t + m (ξ) 2 thus, a basic requirement is m (x ) < 1 (and nearby x ) e t 2 Fields Institute, Toronto, Canada 2015 by Mu Zhu 20

Gradient Descent m(x t ) = x t f (x t ) m (x ) = 1 f (x ) x local mimimum f (x ) > 0 so, need f (x ) < 1 (and nearby x ) ensure by letting m(x t ) = x t εf (x t ) Fields Institute, Toronto, Canada 2015 by Mu Zhu 21

Newton s Method m(x t ) = x t f (x t ) f (x t ) m (x ) = 1 f (x )f (x ) f (x )f (x ) [f (x )] 2 = f (x )f (x ) [f (x )] 2 = 0 get e t+1 O(e 2 t), much faster local convergence but second derivative H d d is not easy for large d Fields Institute, Toronto, Canada 2015 by Mu Zhu 22

Quasi-Newton use local curvature information fast local convergence want to do some of this without computing H t key idea: construct a sequence B t to mimic H t example [symmetric rank-1 (SR1)]: construct B t+1 so that (a) g t+1 = g t +B t+1 (x t+1 x t ) [B t+1 like a Hessian ] (b) B t+1 = B t +uv T [update just a little ] (c) u v [ensures symmetry] many variations... Fields Institute, Toronto, Canada 2015 by Mu Zhu 23

Some Details of SR1 ( Bt +uv T) (x t+1 x t ) }{{} x t = g t+1 g t }{{} g t u [ v T ( x t ) ] }{{} scalar = ( g t ) B t ( x t ) u = ( g t) B t ( x t ) v T ( x t ) taking u = v = γ[( g t ) B t ( x t )] leads to B t+1 = B t + [( g t) B t ( x t )][( g t ) B t ( x t )] T [( g t ) B t ( x t )] T ( x t ) Fields Institute, Toronto, Canada 2015 by Mu Zhu 24

Summary key ideas: RBMs (building block for deep learning) gradient descent; Gibbs sampling local convergence behavior (gradient vs Newton) specific techniques: quasi-newton (SR1) Gaussian-Bernoulli RBM Fields Institute, Toronto, Canada 2015 by Mu Zhu 25

Next... a short, 10-minute break lecture by Dr. R. Grosse on some current work about RBMs Fields Institute, Toronto, Canada 2015 by Mu Zhu 26