Probabilistic & Unsupervised Learning. Latent Variable Models
|
|
- Anna Gilmore
- 5 years ago
- Views:
Transcription
1 Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018
2 Exponential family models Simple, single-stage generative models. Easy, often closed-form expressions for learning and model comparison.... but limited in expressiveness.
3 Exponential family models Simple, single-stage generative models. Easy, often closed-form expressions for learning and model comparison.... but limited in expressiveness. What about distributions like these?
4 Exponential family models Simple, single-stage generative models. Easy, often closed-form expressions for learning and model comparison.... but limited in expressiveness. What about distributions like these? In each case, data may be generated by combining and transforming latent exponential family variates.
5 Latent variable models Explain correlations in x by assuming dependence on latent variables z (e.g. objects, illumination, pose) (e.g. object parts, surfaces) z P[θ z] x z P[θ x] (e.g. edges) p(x, z; θ x, θ z) = p(x z; θ x)p(z; θ z) p(x; θ x, θ z) = dz p(x z; θ x)p(z; θ z) (retinal image, i.e. pixels)
6 Latent variable models Describe structured distributions. Correlations in high-dimensional x may be captured by fewer parameters. Capture an underlying generative process. z may describe causes of x. help to separate signal from noise. Combine exponential family distributions into richer, more flexible forms. P(z), P(x z) and even P(x, z) may be in the exponential family P(x) rarely is. (Exception: Linear Gaussian models).
7 Latent variables and Gaussians Gaussian correlation can be composed from latent components and uncorrelated noise. x N ( [ ]) 3 2 0, 2 3
8 Latent variables and Gaussians Gaussian correlation can be composed from latent components and uncorrelated noise. x N ( [ ]) 3 2 0, 2 3 z N (0, 1) x N ( [ ] [ ]) z, 1 0 1
9 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA.
10 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D
11 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D Model for observations x is a correlated Gaussian: p(z) = N (0, I) p(x z) = N (Λz, ψi) where Λ is a D K matrix.
12 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D Model for observations x is a correlated Gaussian: p(z) = N (0, I) p(x z) = N (Λz, ψi) p(x) = p(z)p(x z)dz where Λ is a D K matrix.
13 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D Model for observations x is a correlated Gaussian: p(z) = N (0, I) Note: E x [f (x)] = E z [ Ex z [f (x)] ] V x [x] = E z [V [x z]] + V z [E [x z]] p(x z) = N (Λz, ψi) ] ) p(x) = p(z)p(x z)dz = N (E z [Λz], E z [Λzz T Λ T + ψi where Λ is a D K matrix.
14 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D Model for observations x is a correlated Gaussian: p(z) = N (0, I) Note: E x [f (x)] = E z [ Ex z [f (x)] ] V x [x] = E z [V [x z]] + V z [E [x z]] p(x z) = N (Λz, ψi) ] ) ( ) p(x) = p(z)p(x z)dz = N (E z [Λz], E z [Λzz T Λ T + ψi = N 0, ΛΛ T + ψi where Λ is a D K matrix.
15 Multivariate Gaussians and latent variables Two models: p(x) = N (0, Σ) p(z) = N (0, I) p(x z) = N (Λz, ψi) ( ) p(x) = N 0, ΛΛ T + ψi Descriptive density model: correlations are captured by off-diagonal elements of Σ. Σ has D(D+1) free parameters. 2 Only constrained to be positive definite. Simple ML estimate. Interpretable causal model: correlations captured by common influence of latent variable. ΛΛ T + ψi has DK + 1 free parameters. For K < D covariance structure is constrained ( blurry pancake ) ML estimation is more complex.
16 PPCA likelihood The marginal distribution on x gives us the PPCA likelihood: log p(x Λ, ψ) = N 2π(ΛΛ 2 log T + ψi) 1 [ 2 Tr (ΛΛ T + ψi) ] 1 xx T n }{{} NS To find the ML values of (Λ, ψ) we could optimise numerically (gradient ascent / Newton s method), or we could use a different iterative algorithm called EM which we ll introduce soon. In fact, however, ML for PPCA is more straightforward in principle, as we will see by first considering the limit ψ 0. [Note: We may also add a constant mean µ to the output, so as to model data that are not distributed around 0. In this case, the ML estimate µ = 1 N n xn and we can define S = 1 N n (x µ)(x µ)t in the likelihood above.]
17 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance.
18 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance.
19 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance.
20 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance.
21 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance. In a Gaussian model, the ML parameters will find the K -dimensional space of most variance.
22 Principal Components Analysis This leads us to an (old) algorithm called Principal Components Analysis (PCA). Assume data D = {x i} have zero mean (if not, subtract it). Find direction of greatest variance λ (1). 5 λ (1) = argmax v =1 (x T nv) 2 n x x x Find direction orthogonal to λ (1) with greatest variance λ (2). Find direction orthogonal to {λ (1), λ (2),..., λ (n 1) } with greatest variance λ (n). Terminate when remaining variance drops below a threshold.
23 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy.
24 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1).
25 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix S = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector).
26 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix S = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector). The D eigenvectors are orthogonal (or orthogonalisable, if ω (i) = ω (j) ). Thus, they form an orthonormal basis. i u (i) u (i) T = I.
27 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix S = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector). The D eigenvectors are orthogonal (or orthogonalisable, if ω (i) = ω (j) ). Thus, they form an orthonormal basis. i u (i) u (i) T = I. Any vector v can be written as ( ) T v = u (i) u (i) v = (u T (i) v)u (i) = v (i) u (i) i i i
28 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix S = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector). The D eigenvectors are orthogonal (or orthogonalisable, if ω (i) = ω (j) ). Thus, they form an orthonormal basis. i u (i) u (i) T = I. Any vector v can be written as ( ) T v = u (i) u (i) v = (u T (i) v)u (i) = i i i v (i) u (i) The original matrix S can be written: S = i ω (i) u (i) u (i) T = UWU T where U = [u (1), u (2),..., u (D) ] collects the eigenvectors and W = diag [ (ω (1), ω (2),..., ω (D) ) ].
29 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i)
30 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) = i ij v (i) u (i) T Su (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i)ω (i)
31 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) = i ij v (i) u (i) T Su (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i)ω (i) If v T v = 1, then i v 2 (i) = 1 and so argmax v =1 (x T v) 2 = u (max) The direction of greatest variance is the eigenvector the largest eigenvalue.
32 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) = i ij v (i) u (i) T Su (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i)ω (i) If v T v = 1, then i v 2 (i) = 1 and so argmax v =1 (x T v) 2 = u (max) The direction of greatest variance is the eigenvector the largest eigenvalue. In general, the PCs are exactly the eigenvectors of the empirical covariance matrix, ordered by decreasing eigenvalue.
33 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) = i ij v (i) u (i) T Su (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i)ω (i) If v T v = 1, then i v 2 (i) = 1 and so argmax v =1 (x T v) 2 = u (max) The direction of greatest variance is the eigenvector the largest eigenvalue. In general, the PCs are exactly the eigenvectors of the empirical covariance matrix, ordered by decreasing eigenvalue. 100 The eigenspectrum shows how the variance 80 is distributed across dimensions; can identify transitions that might separate signal from noise, or the number of PCs that capture a predetermined fraction of variance eigenvalue (variance) eigenvalue number fractional variance remaining eigenvalue number
34 PCA subspace The K principle components define the K -dimensional subspace of greatest variance x x 2 x 1 Each data point x n is associated with a projection ˆx n into the principle subspace. K ˆx n = x n(k) λ (k) k=1 This can be used for lossy compression, denoising, recognition,...
35 Example of PCA: Eigenfaces vismod.media.mit.edu/vismod/demos/facerec/basic.html
36 Example of PCA: Genetic variation within Europe Novembre et al. (2008) Nature 456:98-101
37 Example of PCA: Genetic variation within Europe Novembre et al. (2008) Nature 456:98-101
38 Example of PCA: Genetic variation within Europe Novembre et al. (2008) Nature 456:98-101
39 Equivalent definitions of PCA Find K directions of greatest variance in data. Find K -dimensional orthogonal projection that preserves greatest variance. Find K -dimensional vectors z i and matrix Λ so that ˆx i = Λz i is as close as possible (in squared distance) to x i.... (many others)
40 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z)
41 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z.
42 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2
43 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2 Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix). Σ z = AΣ xa T = AUW xu T A T
44 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2 Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix). Σ z = AΣ xa T = AUW xu T A T So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances).
45 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2 Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix). Σ z = AΣ xa T = AUW xu T A T So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances). Projection to the principal component subspace preserves the most information about the (Gaussian) data.
46 Linear autoencoders: From supervised learning to PCA output units hidden units ˆx 1 ˆx 2 ˆx D Q P z 1 z K decoder generation encoder recognition input units x 1 x 2 x D Learning: argmin ˆx x 2 ˆx = Qz z = Px P,Q At the optimum, P and Q perform the projection and reconstruction steps of PCA. (Baldi & Hornik 1989).
47 ML learning for PPCA l = N 2 log 2πC N 2 Tr [ C 1 S ] l Λ = N 2 where C = ΛΛ T + ψi ( log C Λ Λ Tr [ C 1 S ] ) = N ( C 1 Λ + C 1 SC 1 Λ ) So at the stationary points we have SC 1 Λ = Λ. This implies either: Λ = 0, which turns out to be a minimum. C = S ΛΛ T = S ψi. Now rank(λλ T ) K rank(s ψi) K S has D K eigenvalues = ψ and Λ aligns with space of remaining eigenvectors. or, taking the SVD: Λ = ULV T : S(ULV T VLU T + ψi) 1 ULV T = ULV T VL 1 S(UL 2 U T + ψi) 1 U = U U(L 2 + ψi) = (UL 2 U T + ψi)u (UL 2 U T + ψi) 1 U = U(L 2 + ψi) 1 SU(L 2 + ψi) 1 = U (L 2 + ψi) SU = U (L 2 + ψi) }{{} diagonal columns of U are eigenvectors of S with eigenvalues given by l 2 i + ψ. Thus, Λ = ULV T spans a space defined by K eigenvectors of S; and the lengths of the column vectors of L are given by the eigenvalues ψ (V selects an arbitrary basis in the latent space). Remains to show (we won t, but it s intuitively reasonable) that the global ML solution is attained when Λ aligns with the K leading eigenvectors.
48 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection?
49 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n.
50 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n)
51 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)}
52 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]}
53 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]}
54 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]}
55 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]} So Σ = (I + Λ T Ψ 1 Λ) 1 = I βλ and µ = ΣΛ T Ψ 1 x n = βx n. Where β = ΣΛ T Ψ 1.
56 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]} So Σ = (I + Λ T Ψ 1 Λ) 1 = I βλ and µ = ΣΛ T Ψ 1 x n = βx n. Where β = ΣΛ T Ψ 1. Thus, ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n
57 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]} So Σ = (I + Λ T Ψ 1 Λ) 1 = I βλ and µ = ΣΛ T Ψ 1 x n = βx n. Where β = ΣΛ T Ψ 1. Thus, ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n This is not the same projection. PPCA takes into account noise in the principal subspace.
58 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]} So Σ = (I + Λ T Ψ 1 Λ) 1 = I βλ and µ = ΣΛ T Ψ 1 x n = βx n. Where β = ΣΛ T Ψ 1. Thus, ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n This is not the same projection. PPCA takes into account noise in the principal subspace. As ψ 0, the PPCA estimate the PCA value.
59 PPCA latents principal subspace
60 PPCA latents PCA projection principal subspace
61 PPCA latents PPCA noise PPCA latent prior principal subspace
62 PPCA latents PPCA posterior PPCA latent prior PPCA noise PPCA projection principal subspace
63 Factor Analysis If dimensions are not equivalent, equal variance assumption is inappropriate. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, Ψ dd) Gaussian noise K <D Model for observations x is still a correlated Gaussian: p(z) = N (0, I) p(x z) = N (Λz, Ψ) p(x) = p(z)p(x z)dz = N ( ) 0, ΛΛ T + Ψ where Λ is a D K, and Ψ is D D and diagonal. z 1 z 2 z K x 1 x 2 x D Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures the correlation structure of the data.
64 Factor Analysis (cont.) z 1 z 2 z K x 1 x 2 x D ML learning finds Λ ( common factors ) and Ψ ( unique factors or uniquenesses ) given data parameters (corrected for symmetries): DK + D K (K 1) 2 If number of parameters > D(D+1) 2 model is not identifiable (even after accounting for rotational degeneracy discussed later) no closed form solution for ML params: N (0, ΛΛ T + Ψ)
65 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA.
66 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood.
67 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood. z = Uz and Λ = ΛU T Λ z = ΛU T Uz = Λz
68 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood. z = Uz and Λ = ΛU T Λ z = ΛU T Uz = Λz l = 1 2π(ΛΛ 2 log T + Ψ) xt (ΛΛ T + Ψ) 1 x
69 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood. z = Uz and Λ = ΛU T Λ z = ΛU T Uz = Λz l = 1 2π(ΛU 2 log T UΛ T + Ψ) xt (ΛU T UΛ T + Ψ) 1 x
70 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood. z = Uz and Λ = ΛU T Λ z = ΛU T Uz = Λz l = 1 2π(ΛU 2 log T UΛ T + Ψ) xt (ΛU T UΛ T + Ψ) 1 x = 1 2 log 2π( Λ ΛT + Ψ) xt ( Λ Λ T + Ψ) 1 x
71 Gradient methods for learning FA Optimise negative log-likelihood: l = 1 2 log 2π(ΛΛT + Ψ) xt (ΛΛ T + Ψ) 1 x w.r.t. Λ and Ψ (need matrix calculus) subject to constraints. No spectral short-cut exists. Likelihood can have more than one (local) optimum, making it difficult to find the global value. For some data ( Heywood cases ) likelihood may grow unboundedly by taking one or more Ψ dd 0. Can eliminate by assuming a prior on Ψ with zero density at Ψ dd = 0, but results sensitive to precise choice of prior. Expectation maximisation (next week) provides an alternative approach to maximisation, but doesn t solve these issues.
72 FA vs PCA PCA and PPCA are rotationally invariant; FA is not If x Ux for unitary U, then λ PCA (i) Uλ PCA (i) FA is measurement scale invariant; PCA and PPCA are not If x Sx for diagonal S, then λ FA (i) Sλ FA (i) FA and PPCA define a probabilistic model; PCA does not [Note: it may be tempting to try to eliminate the scale-dependence of (P)PCA by pre-processing data to equalise total variance on each axis. But P(PCA) assume equal noise variance. Total variance has contributions from both ΛΛ T and noise, so this approach does not exactly solve the problem.]
73 Canonical Correlations Analysis Data vector pairs: D = {(u 1, v 1), (u 2, v 2)... } in spaces U and V. Classic CCA Find unit vectors υ 1 U, φ 1 V such that the correlation of u T i υ 1 and v T i φ 1 is maximised. As with PCA, repeat in orthogonal subspaces. Probabilistic CCA Generative model with latent z i R K : z N (0, I) u N (Υz, Ψ u) Ψ u 0 v N (Φz, Ψ v) Ψ v 0 Block diagonal noise.
74 Limitations of Gaussian, FA and PCA models Gaussian, FA and PCA models are easy to understand and use in practice. They are a convenient way to find interesting directions in very high dimensional data sets, eg as preprocessing However, they make strong assumptions about the distribution of the data: only the mean and variance of the data are taken into account. The class of densities which can be modelled is too restrictive. 1 x i x i1 By using mixtures of simple distributions, such as Gaussians, we can expand the class of densities greatly.
75 Mixture Distributions 1 x i x i1 A mixture distribution has a single discrete latent variable: s iid i Discrete[π] x i s i P si [θ si ] Mixtures arise naturally when observations from different sources have been collated. They can also be used to approximate arbitrary distributions.
76 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ]
77 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ] Under the discrete distribution P(s i = m) = π m; π m 0, k π m = 1 m=1
78 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ] Under the discrete distribution P(s i = m) = π m; π m 0, k π m = 1 m=1 Thus, the probability (density) at a single data point x i is k P(x i) = P(x i s i = m)p(s i = m) m=1
79 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ] Under the discrete distribution P(s i = m) = π m; π m 0, k π m = 1 m=1 Thus, the probability (density) at a single data point x i is P(x i) = = k P(x i s i = m)p(s i = m) m=1 k π mp m(x i; θ m) m=1
80 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ] Under the discrete distribution P(s i = m) = π m; π m 0, k π m = 1 m=1 Thus, the probability (density) at a single data point x i is P(x i) = = k P(x i s i = m)p(s i = m) m=1 k π mp m(x i; θ m) m=1 The mixture distribution (density) is a convex combination (or weighted average) of the component distributions (densities).
81 Approximation with a Mixture of Gaussians (MoG) The component densities may be viewed as elements of a basis which can be combined to approximate arbitrary distributions. Here are examples where non-gaussian densities are modelled (aproximated) as a mixture of Gaussians. The red curves show the (weighted) Gaussians, and the blue curve the resulting density. Uniform Triangle Heavy tails Given enough mixture components we can model (almost) any density (as accurately as desired), but still only need to work with the well-known Gaussian form.
82 Clustering with a MoG
83 Clustering with a MoG
84 Clustering with a MoG In clustering applications, the latent variable s i represents the (unknown) identity of the cluster to which the ith observation belongs. Thus, the latent distribution gives the prior probability of a data point coming from each cluster. P(s i = m π) = π m Data from the mth cluster are distributed according to the mth component: P(x i s i = m) = P m(x i) Once we observe a data point, the posterior probability distribution for the cluster it belongs to is P(s i = m x i) = Pm(xi)πm m Pm(xi)πm This is often called the responsibility of the mth cluster for the ith data point.
85 The MoG likelihood Each component of a MoG is a Gaussian, with mean µ m and covariance matrix Σ m. Thus, the probability density evaluated at a set of n iid observations, D = {x 1... x n} (i.e. the likelihood) is p(d {µ m}, {Σ m}, π) = The log of the likelihood is = n i=1 m=1 n i=1 m=1 k π m N (x i µ m, Σ m) k π m 1 2πΣm e 1 2 (x i µ m) T Σ 1 m (x i µ m) log p(d {µ m}, {Σ m}, π) = n log i=1 k m=1 π m 1 2πΣm e 1 2 (x i µ m) T Σ 1 m (x i µ m) Note that the logarithm fails to simplify the component density terms. A mixture distribution does not lie in the exponential family. Direct optimisation is not easy.
86 Maximum Likelihood for a Mixture Model The log likelihood is:
87 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1
88 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m k m=1 πmpm(xi; θm) P m(x i; θ m) θ m
89 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ,
90 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ, = n π mp m(x i; θ m) k m=1 πmpm(xi; θm) }{{} i=1 log P m(x i; θ m) θ m
91 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ, = = n π mp m(x i; θ m) k m=1 πmpm(xi; θm) }{{} n i=1 i=1 log P m(x i; θ m) θ m r im log P m(x i; θ m) θ m
92 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ, = = n π mp m(x i; θ m) k m=1 πmpm(xi; θm) }{{} n i=1 i=1 log P m(x i; θ m) θ m r im log P m(x i; θ m) θ m And its partial derivative wrt π m is L π m = n P m(x i; θ m) k m=1 πmpm(xi; θm) i=1
93 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ, = = n π mp m(x i; θ m) k m=1 πmpm(xi; θm) }{{} n i=1 i=1 log P m(x i; θ m) θ m r im log P m(x i; θ m) θ m And its partial derivative wrt π m is L π m = n P m(x i; θ m) n k = m=1 πmpm(xi; θm) i=1 i=1 r im π m
94 MoG Derivatives For a MoG, with θ m = {µ m, Σ m} we get L µ m = n i=1 L = 1 Σ 1 m 2 r imσ 1 m (x i µ m) n i=1 r im ( Σ m (x i µ m)(x i µ m) T ) These equations can be used (along with Lagrangian derivatives wrt π m that enforce normalisation) for gradient based learning; e.g., taking small steps in the direction of the gradient (or using conjugate gradients).
95 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis).
96 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2 0.
97 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2 0. Then the responsibilities become binary r im δ(m, argmin x i µ l 2 ) l with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0.
98 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2 0. Then the responsibilities become binary r im δ(m, argmin x i µ l 2 ) l with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0. The k-means algorithm iterates these two steps: ( ) assign each point to its closest mean set r im = δ(m, argmin x i µ l 2 ) l ( ) update the means to the average of their assigned points set µ m = i r imx i i r im
99 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2 0. Then the responsibilities become binary r im δ(m, argmin x i µ l 2 ) l with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0. The k-means algorithm iterates these two steps: ( assign each point to its closest mean set r im = δ(m, argmin update the means to the average of their assigned points ) x i µ l 2 ) l ( set µ m = i r imx i This usually converges within a few iterations, although the fixed point depends on the initial values chosen for µ m. The algorithm has no learning rate, but the assumptions are quite limiting. i r im )
100 A preview of the EM algorithm We wrote the k-means algorithm in terms of binary responsibilities. Suppose, instead, we used the fractional responsibilities from the full (non-limiting) MoG, but still neglected the dependence of the responsibilities on the parameters. We could then solve for both µ m and Σ m. The EM algorithm for MoGs iterates these two steps: Evaluate the responsibilities for each point given the current parameters. Optimise the parameters assuming the responsibilities stay fixed: µ m = i r imx i i r im(x i µ m)(x i µ m) T and Σ m = i r im i r im Although this appears ad hoc, we will see (later) that it is a special case of a general algorithm, and is actually guaranteed to increase the likelihood at each iteration.
101 Issues There are several problems with these algorithms: slow convergence for the gradient based method gradient based method may develop invalid covariance matrices local minima; the end configuration may depend on the starting state how do you adjust k? Using the likelihood alone is no good. singularities; components with a single data point will have their covariance going to zero and the likelihood will tend to infinity. We will attempt to address many of these as the course goes on.
Probabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationLecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan
Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationFactor Analysis and Kalman Filtering (11/2/04)
CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationCS181 Midterm 2 Practice Solutions
CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationWhat is Principal Component Analysis?
What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationMaximum variance formulation
12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary
More informationPrincipal Component Analysis
Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationDimensionality Reduction with Principal Component Analysis
10 Dimensionality Reduction with Principal Component Analysis Working directly with high-dimensional data, such as images, comes with some difficulties: it is hard to analyze, interpretation is difficult,
More informationCSC 411 Lecture 12: Principal Component Analysis
CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised
More informationIntroduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond
PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009 Motivation
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationLinear Algebra for Machine Learning. Sargur N. Srihari
Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it
More informationCovariance and Principal Components
COMP3204/COMP6223: Computer Vision Covariance and Principal Components Jonathon Hare jsh2@ecs.soton.ac.uk Variance and Covariance Random Variables and Expected Values Mathematicians talk variance (and
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationProbabilistic & Unsupervised Learning. Latent Variable Models. Exponential family models. Latent variable models. Latent variable models
Exponental famly models Probablstc & Unsupervsed Learnng Smple, sngle-stage generatve models. Easy, often closed-form expressons for learnng and model comparson.... but lmted n expressveness. Latent Varable
More informationPrincipal Component Analysis (PCA)
Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationConstrained Projection Approximation Algorithms for Principal Component Analysis
Constrained Projection Approximation Algorithms for Principal Component Analysis Seungjin Choi, Jong-Hoon Ahn, Andrzej Cichocki Department of Computer Science, Pohang University of Science and Technology,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationCS229 Lecture notes. Andrew Ng
CS229 Lecture notes Andrew Ng Part X Factor analysis When we have data x (i) R n that comes from a mixture of several Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting,
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationBackground Mathematics (2/2) 1. David Barber
Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationPrinciple Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA
Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationPrincipal Component Analysis (PCA)
Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationPrincipal Component Analysis and Linear Discriminant Analysis
Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationCS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)
CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationCovariance and Correlation Matrix
Covariance and Correlation Matrix Given sample {x n } N 1, where x Rd, x n = x 1n x 2n. x dn sample mean x = 1 N N n=1 x n, and entries of sample mean are x i = 1 N N n=1 x in sample covariance matrix
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationCPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018
CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationCS 4495 Computer Vision Principle Component Analysis
CS 4495 Computer Vision Principle Component Analysis (and it s use in Computer Vision) Aaron Bobick School of Interactive Computing Administrivia PS6 is out. Due *** Sunday, Nov 24th at 11:55pm *** PS7
More informationDimensionality Reduction
Dimensionality Reduction Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, UCL Apr/May 2016 High dimensional data Example data: Gene Expression Example data: Web Pages Google
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see
More informationJoint Factor Analysis for Speaker Verification
Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37 Outline 1 Speaker Verification Baseline System Session
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationMACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA
1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationData Mining Lecture 4: Covariance, EVD, PCA & SVD
Data Mining Lecture 4: Covariance, EVD, PCA & SVD Jo Houghton ECS Southampton February 25, 2019 1 / 28 Variance and Covariance - Expectation A random variable takes on different values due to chance The
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationAdvanced Machine Learning & Perception
Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 1 Introduction, researchy course, latest papers Going beyond simple machine learning Perception, strange spaces, images, time, behavior
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationDimensionality Reduction
Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationTHESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott
THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION Submitted by Daniel L Elliott Department of Computer Science In partial fulfillment of the requirements
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationCS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)
CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable
More informationCS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works
CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationLinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationHST.582J/6.555J/16.456J
Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear
More informationLecture 7 Spectral methods
CSE 291: Unsupervised learning Spring 2008 Lecture 7 Spectral methods 7.1 Linear algebra review 7.1.1 Eigenvalues and eigenvectors Definition 1. A d d matrix M has eigenvalue λ if there is a d-dimensional
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More information