Lecture 4 Towards Deep Learning (January 30, 2015) Mu Zhu University of Waterloo
Deep Network Fields Institute, Toronto, Canada 2015 by Mu Zhu 2
Boltzmann Distribution probability distribution for a complex system p(x) = 1 Z ef(x;θ) with Z x e f(x;θ) [or ] e f(x;θ) dx often, where f(x;θ) = u(x;ϑ), kt u(x; ϑ) = energy function; k T = Boltzmann constant; = thermodynamic temperature e.g., lattice of particles, protein molecule Fields Institute, Toronto, Canada 2015 by Mu Zhu 3
Boltzmann Distribution [ ] log[p(x;θ)] = f(x;θ) log e f(x;θ) x d dθ log[p(x;θ)] = d dθ f(x;θ) 1 x = d dθ f(x;θ) x = d dθ f(x;θ) x = d dθ f(x;θ) E e f(x;θ) e f(x;θ) Z p(x) [ e f(x;θ) [ d dθ f(x;θ) x d dθ f(x;θ) d dθ f(x;θ) ] ] d dθ f(x;θ) Fields Institute, Toronto, Canada 2015 by Mu Zhu 4
Boltzmann Distribution given x 1,x 2,...,x i iid p(x;θ) log-likelihood is l(θ) = 1 n n log[p(x i ;θ)] i=1 its first derivative is d dθ l(θ) = 1 n Ê n { d dθ f(x i;θ) E i=1 [ ] d dθ f(x;θ) E [ ]} d dθ f(x i;θ) ] [ d dθ f(x;θ) Fields Institute, Toronto, Canada 2015 by Mu Zhu 5
Restricted Boltzmann Machine h 1 h 2 h 3 h 4 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 v 9 bottom nodes v = (v 1,v 2,...) T ; top nodes h = (h 1,h 2,...) T Fields Institute, Toronto, Canada 2015 by Mu Zhu 6
Restricted Boltzmann Machine bottom nodes v = (v 1,v 2,...) T top nodes h = (h 1,h 2,...) T Boltzmann distribution p(v,h;θ) = 1 Z ef(v,h;θ) with f(v,h;θ) = h T Wv +α T h+β T v = h t w T t v + α t h t +β T v t t i.e., θ = {W,α,β} Fields Institute, Toronto, Canada 2015 by Mu Zhu 7
Restricted Boltzmann Machine if just one binary top node h {0,1} get p(v,h = 1) = 1 Z ewt v+α+β Tv, p(v,h = 0) = 1 Z eβt v so p(v,h = 1) p(v,h = 0) = P(h = 1 v)f(v) P(h = 0 v)f(v) log P(h = 1 v) P(h = 0 v) = wt v +α hence, model for h v is usual logistic regression Fields Institute, Toronto, Canada 2015 by Mu Zhu 8
Restricted Boltzmann Machine if more than one binary top nodes h 1,h 2,... {0,1} get f(v,h;θ) = h t w T t v +α t h t + s t h s w T s v + s tα s h s +β T v so log P(h t = 1 v,h t ) P(h t = 0 v,h t ) = wt t v +α t in general, model for h t v,h t is usual logistic regression notice conditional independence between h t and h t given v Fields Institute, Toronto, Canada 2015 by Mu Zhu 9
Fitting RBM given (v i,h i ), i = 1,2,...,n, MLE by gradient ascent, [ ] dl W new = W old +ε dw and likewise for α, β θ old, gradients: f(v,h;θ) = h T Wv +α T h+β T v dl dw = Ê[ hv T] E [ hv T] dl dα = Ê[h] E[h], dl dβ = Ê[v] E[v] Fields Institute, Toronto, Canada 2015 by Mu Zhu 10
Fitting RBM given {(v i,h i )} n i=1, Ê[hvT ],Ê[h],Ê[v] are easy to compute just take empirical averages [definition of Ê( )] but E[hv T ],E[h],E[v] are hard to compute estimate by drawing an MCMC sample from p(v,h) in particular, do Gibbs Sampling for just a few iterations Gibbs Sampling Repeat given v, sample h from the conditional distribution of h v; given h, sample v from the conditional distribution of v h; until convergence ( burn-in ). Fields Institute, Toronto, Canada 2015 by Mu Zhu 11
Gibbs Sampling recall: model for h t v,h t is usual logistic regression, so h t v,h t Bernoulli [ σ ( α t +w T t v )] if v 1,v 2,... also binary, then symmetry gives v b h,v b Bernoulli [ σ ( β b +h T w b )] Gibbs sampling amounts to back-and-forth coin flips note: σ( ) above denotes the sigmoid function Fields Institute, Toronto, Canada 2015 by Mu Zhu 12
Towards Deep Learning stack many RBMs on top of each other top nodes for layer l becomes bottom nodes for layer l+1, i.e., v (l+1) = h (l) fit the whole thing layer by layer note intermediate layers are latent (hidden, not visible) so for things like Ê(h), can use Ê( h) where h i E(h v i ) kind of an EM procedure Fields Institute, Toronto, Canada 2015 by Mu Zhu 13
Towards Deep Learning repeat with current parameters θ = {α,β,w} (a) estimate h i E(h v i ) Ê( ) (b) draw h i p(h v i ) and v i p(v h i ) [or repeat] E( ) move along the gradient (a crude estimate of it) ] θ θ +ε [Ê( ) E( ) until some stopping criterion proceed to next layer Fields Institute, Toronto, Canada 2015 by Mu Zhu 14
On v 1,v 2,... Being Binary not as restrictive as it appears can already handle image data and text data to some extent [examples next two slides] can generalize to other inputs make some changes to the model f(v,h) p(v h) no longer logistic model [obviously] ideally, want p(v h) nice to sample from Fields Institute, Toronto, Canada 2015 by Mu Zhu 15
Example: Image Data each v 1,v 2,... a binary pixel Fields Institute, Toronto, Canada 2015 by Mu Zhu 16
Example: Text Data Word 1 Word 2 Word 3 "hate" "I" "love" "math" "you" each v 1,v 2,... an indicator for a particular word Fields Institute, Toronto, Canada 2015 by Mu Zhu 17
The Gaussian-Bernoulli RBM v 1,v 2,... {0,1}: v 1,v 2,... R: f(v,h;θ) = h T Wv +α T h+β T v = [ ] h T w b vb +α T h+ β b v b b b f(v,h;θ) = b [ ] [ ] v h T b w b τ b +α T h b (v b β b ) 2 2τ 2 b Exercise Let ṽ = (v 1 /τ 1,v 2 /τ 2,...) T. Show that T (a) p(h t v,h t ) Bernoulli[σ(α t +wt ṽ)]; (easy) (b) p(v b h,v b ) N(β b +τ b (h T w b ),τ 2 b ). (slightly harder) Fields Institute, Toronto, Canada 2015 by Mu Zhu 18
Iterative Optimization consider an iterative rule, x t+1 = m(x t ), for minimizing f(x) Newton s method: gradient descent: m(x t ) = x t f (x t ) f (x t ) m(x t ) = x t f (x t ) multivariate case... same idea, with gradient and Hessian, i.e., x t+1 = x t εh 1 t g t, x t+1 = x t εg t ; will explain extra ε later Fields Institute, Toronto, Canada 2015 by Mu Zhu 19
Iterative Optimization suppose iteration converges to a local minimum x must have m(x ) = x [i.e., a fixed point] then, in neighborhood near x, e t+1 x t+1 x = m(x t ) m(x ) [ = m(x )+m (x )(x t x )+ m (ξ) (x t x ) ] m(x 2 ) 2 = m (x )e t + m (ξ) 2 thus, a basic requirement is m (x ) < 1 (and nearby x ) e t 2 Fields Institute, Toronto, Canada 2015 by Mu Zhu 20
Gradient Descent m(x t ) = x t f (x t ) m (x ) = 1 f (x ) x local mimimum f (x ) > 0 so, need f (x ) < 1 (and nearby x ) ensure by letting m(x t ) = x t εf (x t ) Fields Institute, Toronto, Canada 2015 by Mu Zhu 21
Newton s Method m(x t ) = x t f (x t ) f (x t ) m (x ) = 1 f (x )f (x ) f (x )f (x ) [f (x )] 2 = f (x )f (x ) [f (x )] 2 = 0 get e t+1 O(e 2 t), much faster local convergence but second derivative H d d is not easy for large d Fields Institute, Toronto, Canada 2015 by Mu Zhu 22
Quasi-Newton use local curvature information fast local convergence want to do some of this without computing H t key idea: construct a sequence B t to mimic H t example [symmetric rank-1 (SR1)]: construct B t+1 so that (a) g t+1 = g t +B t+1 (x t+1 x t ) [B t+1 like a Hessian ] (b) B t+1 = B t +uv T [update just a little ] (c) u v [ensures symmetry] many variations... Fields Institute, Toronto, Canada 2015 by Mu Zhu 23
Some Details of SR1 ( Bt +uv T) (x t+1 x t ) }{{} x t = g t+1 g t }{{} g t u [ v T ( x t ) ] }{{} scalar = ( g t ) B t ( x t ) u = ( g t) B t ( x t ) v T ( x t ) taking u = v = γ[( g t ) B t ( x t )] leads to B t+1 = B t + [( g t) B t ( x t )][( g t ) B t ( x t )] T [( g t ) B t ( x t )] T ( x t ) Fields Institute, Toronto, Canada 2015 by Mu Zhu 24
Summary key ideas: RBMs (building block for deep learning) gradient descent; Gibbs sampling local convergence behavior (gradient vs Newton) specific techniques: quasi-newton (SR1) Gaussian-Bernoulli RBM Fields Institute, Toronto, Canada 2015 by Mu Zhu 25
Next... a short, 10-minute break lecture by Dr. R. Grosse on some current work about RBMs Fields Institute, Toronto, Canada 2015 by Mu Zhu 26