Learning to Disentangle Factors of Variation with Manifold Learning Scott Reed Kihyuk Sohn Yuting Zhang Honglak Lee University of Michigan, Department of Electrical Engineering and Computer Science 08 May 2015 Presented by: Kyle Ulrich Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 1 / 19
Introduction It is a challenge to separate the factors of variation that combine to generate observations e.g., for face images: pose, expression, illumination, identity, etc. This work considers each factor of variation as forming a sub-manifold Observations are formed from the interactions between these sub-manifolds Furthermore, additional strategies may help to disentangle factors of variation, e.g., Taking into account label information Forcing known similarities/differences on the sub-manifold Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 2 / 19
Example We could be interested in supporting queries such as: Given a fixed identity, provide the same face image in a different pose Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 3 / 19
Review of Restricted Boltzmann Machines The RBM is a bipartite graphical model with visible units v {0, 1} D and hidden units h {0, 1} K. The joint distribution is: P(v, h) = 1 Z exp( E(v, h)), (1) D K K E(v, h) = v i W ik h k b k h k i=1 k=1 k=1 i=1 With conditional distributions: P(v i = 1 h) = σ( W ik h k + c i ) k D c i v i (2) P(h k = 1 v) = σ( i W ik v i + b k ) Contrastive divergence (CD) is often used to approximate gradients for Θ = {W, b, c} Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 4 / 19
The disentangling Boltzmann machine (disbm) The disbm models higher-order interactions between observations and multiple groups of hidden units For now, consider two groups of hidden units h and m Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 5 / 19
The disentangling Boltzmann machine (disbm) Consider the disbm model with visible units v {0, 1} D and two groups of hidden units, h {0, 1} K and m {0, 1} L The energy function is defined as E(v, m, h) = f ( i W v if v i)( j W m jf m j)( k W h kf h k) ij P m ij v i m j ik P h ik v ih k (3) such that the weight tensor W R D L K has F factors, W ijk = F f =1 W v if W m jf W h kf (4) Two-way interactions are allowed between the visible units and each hidden group through P m and P h Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 6 / 19
Conditional Independence between Groups Hidden units are no longer conditionally independent given visible units However, each group is conditionally independent given all other groups: P(v i = 1 h, m) = σ( jk W ijk m j h k + j P m ij m j + k P h ik h k) (5) P(m j = 1 v, h) = σ( ik W ijk v i h k + i P m ij v i ) (6) P(h k = 1 v, m) = σ( ij W ijk v i m j + i P h ik v i) (7) Allows for efficient 3-way block Gibbs sampling Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 7 / 19
Inference Variational inference is used to approximate the true posterior with the factorized distribution Q(m, h) = Q(m j )Q(h k ) j Minimizing KL(Q(m, h) P(m, h v)) provides the fixed-point equations k ĥ k = σ( ij W ijk v i ˆm j + i P h ik v i) (8) ˆm j = σ( ik W ijk v i ĥ k + i P m ij v i ) (9) where ĥ k = Q(h k = 1) and ˆm j = Q(m j = 1) Alternate updating ĥ and ˆm until convergence Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 8 / 19
Learning The model is trained to maximize the data log-likelihood using stochastic gradient descent The gradient of the log likelihood with respect to the parameters Θ = {W v, W m, W h, P m, P h } is [ ] [ ] E(v, m, h) E(v, m, h) E P(m,h v) + E θ P(v,m,h) θ The first term can be approximated using variational inference The second term can be approximated with persistent CD using 3-way sampling Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 9 / 19
Disentangling It is desired to disentangle the factors of variation i.e., each group of hidden units is sensitive to changes in a single factor of variation and remains relatively invariant to changes in others Several disentangling methods are proposed for the disbm Partial labels Clamping Manifold-based Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 10 / 19
Disentangling: Partial Labels Labels may be provided for any group of hidden units Here, hidden units e are connected to hidden units m The energy function is augmented such that E label (v, m, h, e) = E(v, m, h) jl m j U jl e l (10) Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 11 / 19
Disentangling: Partial Labels The energy function is augmented such that E label (v, m, h, e) = E(v, m, h) jl m j U jl e l (10) where l e l = 1 Variational inference updates proceed according to the equations: ĥ k = σ( ij W ijk v i ˆm j + i P h ik v i) (11) ˆm j = σ( ik W ijk v i ĥ k + i P m ij v i + l U jl ê l ) (12) ê l = exp( j U jl ˆm j ) l exp( j U jl ˆm j) (13) Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 12 / 19
Disentangling: Clamping Perhaps it is known two data points match in some factor of variation (e.g., images of the same person) The hidden units h may be clamped between these data by: E clamp (v (1), v (2), m (1), m (2), h) = E(v (1), m (1), h) + E(v (2), m (2), h) (11) Fixed point equations are adjusted such that ĥ k = σ( ij W ijk v (1) i ˆm (1) j + i P h ik v (1) i + ij W ijk v (2) i ˆm (2) j + i P h ik v (2) i ) (12) Note: labels may be included simultanously on another group of hidden units Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 13 / 19
Disentangling: Manifold-Based Training Perhaps clamping is too strong of an assumption Clamping forces pairs to the same point on the manifold Does not exploit non-correspondence Another method is to learn a representation for h such that h (1) h (2) 2 2 0 h (1) h (3) 2 2 β, if (v (1), v (2) ) D sim, if (v (1), v (3) ) D dis The objective is augmented with the manifold objective: h (1) h (2) 2 2 + max(0, β h (1) h (3) 2 ) 2 (13) Gradients may be computed with RNN backpropagation Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 14 / 19
Experiments: Flipped MNIST Digits A random 50% of digits in the MNIST dataset had all their pixel values flipped A disbm model was trained with two groups: a single flip unit and appearance units Flip mode was successfully disentangled A linear SVM was trained on the appearance units for classification Samples from flipped MNIST dataset Test classification errors Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 15 / 19
Experiments: Toronto Face Database (TFD) Contains 112,234 face images with 7 possible emotion labels and 3,874 identity labels Given an input identity, disbm can traverse the expression manifold Fix identity units h and label units e Perform Gibbs sampling between v and m Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 16 / 19
Experiments: CMU Multi-PIE Contains 754,200 face images with variations in pose, lighting, and expression Given an input identity, disbm can traverse the pose manifold Fix identity units h and label units e Perform Gibbs sampling between v and m Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 17 / 19
Experiments: TFD & Multi-PIE Identities in the left column are transferred to the expressions and poses of the middle column Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 18 / 19
Experiments: TFD Performance on TFD of emotion recognition and face verification Emotion recognition: trained linear SVM, report % accuracy Face verification: use cosine similarity as a score, report AUC Comparisons to other methods Comparisons among different proposed disentangling methods Reed et al. (University of Michigan) Disentangling Boltzmann Machines 08 May 2015 19 / 19