Probabilistic & Unsupervised Learning. Latent Variable Models

Size: px
Start display at page:

Download "Probabilistic & Unsupervised Learning. Latent Variable Models"

Transcription

1 Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2018

2 Exponential family models Simple, single-stage generative models. Easy, often closed-form expressions for learning and model comparison.... but limited in expressiveness.

3 Exponential family models Simple, single-stage generative models. Easy, often closed-form expressions for learning and model comparison.... but limited in expressiveness. What about distributions like these?

4 Exponential family models Simple, single-stage generative models. Easy, often closed-form expressions for learning and model comparison.... but limited in expressiveness. What about distributions like these? In each case, data may be generated by combining and transforming latent exponential family variates.

5 Latent variable models Explain correlations in x by assuming dependence on latent variables z (e.g. objects, illumination, pose) (e.g. object parts, surfaces) z P[θ z] x z P[θ x] (e.g. edges) p(x, z; θ x, θ z) = p(x z; θ x)p(z; θ z) p(x; θ x, θ z) = dz p(x z; θ x)p(z; θ z) (retinal image, i.e. pixels)

6 Latent variable models Describe structured distributions. Correlations in high-dimensional x may be captured by fewer parameters. Capture an underlying generative process. z may describe causes of x. help to separate signal from noise. Combine exponential family distributions into richer, more flexible forms. P(z), P(x z) and even P(x, z) may be in the exponential family P(x) rarely is. (Exception: Linear Gaussian models).

7 Latent variables and Gaussians Gaussian correlation can be composed from latent components and uncorrelated noise. x N ( [ ]) 3 2 0, 2 3

8 Latent variables and Gaussians Gaussian correlation can be composed from latent components and uncorrelated noise. x N ( [ ]) 3 2 0, 2 3 z N (0, 1) x N ( [ ] [ ]) z, 1 0 1

9 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA.

10 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D

11 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D Model for observations x is a correlated Gaussian: p(z) = N (0, I) p(x z) = N (Λz, ψi) where Λ is a D K matrix.

12 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D Model for observations x is a correlated Gaussian: p(z) = N (0, I) p(x z) = N (Λz, ψi) p(x) = p(z)p(x z)dz where Λ is a D K matrix.

13 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D Model for observations x is a correlated Gaussian: p(z) = N (0, I) Note: E x [f (x)] = E z [ Ex z [f (x)] ] V x [x] = E z [V [x z]] + V z [E [x z]] p(x z) = N (Λz, ψi) ] ) p(x) = p(z)p(x z)dz = N (E z [Λz], E z [Λzz T Λ T + ψi where Λ is a D K matrix.

14 Probabilistic Principal Components Analysis (PPCA) If the uncorrelated noise is assumed to be isotropic, this model is called PPCA. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, ψ) Gaussian noise K <D z 1 z 2 z K x 1 x 2 x D Model for observations x is a correlated Gaussian: p(z) = N (0, I) Note: E x [f (x)] = E z [ Ex z [f (x)] ] V x [x] = E z [V [x z]] + V z [E [x z]] p(x z) = N (Λz, ψi) ] ) ( ) p(x) = p(z)p(x z)dz = N (E z [Λz], E z [Λzz T Λ T + ψi = N 0, ΛΛ T + ψi where Λ is a D K matrix.

15 Multivariate Gaussians and latent variables Two models: p(x) = N (0, Σ) p(z) = N (0, I) p(x z) = N (Λz, ψi) ( ) p(x) = N 0, ΛΛ T + ψi Descriptive density model: correlations are captured by off-diagonal elements of Σ. Σ has D(D+1) free parameters. 2 Only constrained to be positive definite. Simple ML estimate. Interpretable causal model: correlations captured by common influence of latent variable. ΛΛ T + ψi has DK + 1 free parameters. For K < D covariance structure is constrained ( blurry pancake ) ML estimation is more complex.

16 PPCA likelihood The marginal distribution on x gives us the PPCA likelihood: log p(x Λ, ψ) = N 2π(ΛΛ 2 log T + ψi) 1 [ 2 Tr (ΛΛ T + ψi) ] 1 xx T n }{{} NS To find the ML values of (Λ, ψ) we could optimise numerically (gradient ascent / Newton s method), or we could use a different iterative algorithm called EM which we ll introduce soon. In fact, however, ML for PPCA is more straightforward in principle, as we will see by first considering the limit ψ 0. [Note: We may also add a constant mean µ to the output, so as to model data that are not distributed around 0. In this case, the ML estimate µ = 1 N n xn and we can define S = 1 N n (x µ)(x µ)t in the likelihood above.]

17 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance.

18 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance.

19 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance.

20 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance.

21 The ψ 0 limit As ψ 0, the latent model can only capture K dimensions of variance. In a Gaussian model, the ML parameters will find the K -dimensional space of most variance.

22 Principal Components Analysis This leads us to an (old) algorithm called Principal Components Analysis (PCA). Assume data D = {x i} have zero mean (if not, subtract it). Find direction of greatest variance λ (1). 5 λ (1) = argmax v =1 (x T nv) 2 n x x x Find direction orthogonal to λ (1) with greatest variance λ (2). Find direction orthogonal to {λ (1), λ (2),..., λ (n 1) } with greatest variance λ (n). Terminate when remaining variance drops below a threshold.

23 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy.

24 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1).

25 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix S = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector).

26 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix S = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector). The D eigenvectors are orthogonal (or orthogonalisable, if ω (i) = ω (j) ). Thus, they form an orthonormal basis. i u (i) u (i) T = I.

27 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix S = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector). The D eigenvectors are orthogonal (or orthogonalisable, if ω (i) = ω (j) ). Thus, they form an orthonormal basis. i u (i) u (i) T = I. Any vector v can be written as ( ) T v = u (i) u (i) v = (u T (i) v)u (i) = v (i) u (i) i i i

28 Eigendecomposition of a covariance matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix S if Su = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix S = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector). The D eigenvectors are orthogonal (or orthogonalisable, if ω (i) = ω (j) ). Thus, they form an orthonormal basis. i u (i) u (i) T = I. Any vector v can be written as ( ) T v = u (i) u (i) v = (u T (i) v)u (i) = i i i v (i) u (i) The original matrix S can be written: S = i ω (i) u (i) u (i) T = UWU T where U = [u (1), u (2),..., u (D) ] collects the eigenvectors and W = diag [ (ω (1), ω (2),..., ω (D) ) ].

29 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i)

30 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) = i ij v (i) u (i) T Su (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i)ω (i)

31 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) = i ij v (i) u (i) T Su (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i)ω (i) If v T v = 1, then i v 2 (i) = 1 and so argmax v =1 (x T v) 2 = u (max) The direction of greatest variance is the eigenvector the largest eigenvalue.

32 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) = i ij v (i) u (i) T Su (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i)ω (i) If v T v = 1, then i v 2 (i) = 1 and so argmax v =1 (x T v) 2 = u (max) The direction of greatest variance is the eigenvector the largest eigenvalue. In general, the PCs are exactly the eigenvectors of the empirical covariance matrix, ordered by decreasing eigenvalue.

33 PCA and eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u T (i) xx T u (i) = u T (i) Su (i) = u T (i) ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) = i ij v (i) u (i) T Su (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i)ω (i) If v T v = 1, then i v 2 (i) = 1 and so argmax v =1 (x T v) 2 = u (max) The direction of greatest variance is the eigenvector the largest eigenvalue. In general, the PCs are exactly the eigenvectors of the empirical covariance matrix, ordered by decreasing eigenvalue. 100 The eigenspectrum shows how the variance 80 is distributed across dimensions; can identify transitions that might separate signal from noise, or the number of PCs that capture a predetermined fraction of variance eigenvalue (variance) eigenvalue number fractional variance remaining eigenvalue number

34 PCA subspace The K principle components define the K -dimensional subspace of greatest variance x x 2 x 1 Each data point x n is associated with a projection ˆx n into the principle subspace. K ˆx n = x n(k) λ (k) k=1 This can be used for lossy compression, denoising, recognition,...

35 Example of PCA: Eigenfaces vismod.media.mit.edu/vismod/demos/facerec/basic.html

36 Example of PCA: Genetic variation within Europe Novembre et al. (2008) Nature 456:98-101

37 Example of PCA: Genetic variation within Europe Novembre et al. (2008) Nature 456:98-101

38 Example of PCA: Genetic variation within Europe Novembre et al. (2008) Nature 456:98-101

39 Equivalent definitions of PCA Find K directions of greatest variance in data. Find K -dimensional orthogonal projection that preserves greatest variance. Find K -dimensional vectors z i and matrix Λ so that ˆx i = Λz i is as close as possible (in squared distance) to x i.... (many others)

40 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z)

41 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z.

42 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2

43 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2 Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix). Σ z = AΣ xa T = AUW xu T A T

44 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2 Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix). Σ z = AΣ xa T = AUW xu T A T So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances).

45 Another view of PCA: Mutual information Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I(z; x) is maximised (assuming that P(x) is Gaussian). I(z; x) = H(z) + H(x) H(z, x) = H(z) So we want to maximise the entropy of z. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2 Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix). Σ z = AΣ xa T = AUW xu T A T So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances). Projection to the principal component subspace preserves the most information about the (Gaussian) data.

46 Linear autoencoders: From supervised learning to PCA output units hidden units ˆx 1 ˆx 2 ˆx D Q P z 1 z K decoder generation encoder recognition input units x 1 x 2 x D Learning: argmin ˆx x 2 ˆx = Qz z = Px P,Q At the optimum, P and Q perform the projection and reconstruction steps of PCA. (Baldi & Hornik 1989).

47 ML learning for PPCA l = N 2 log 2πC N 2 Tr [ C 1 S ] l Λ = N 2 where C = ΛΛ T + ψi ( log C Λ Λ Tr [ C 1 S ] ) = N ( C 1 Λ + C 1 SC 1 Λ ) So at the stationary points we have SC 1 Λ = Λ. This implies either: Λ = 0, which turns out to be a minimum. C = S ΛΛ T = S ψi. Now rank(λλ T ) K rank(s ψi) K S has D K eigenvalues = ψ and Λ aligns with space of remaining eigenvectors. or, taking the SVD: Λ = ULV T : S(ULV T VLU T + ψi) 1 ULV T = ULV T VL 1 S(UL 2 U T + ψi) 1 U = U U(L 2 + ψi) = (UL 2 U T + ψi)u (UL 2 U T + ψi) 1 U = U(L 2 + ψi) 1 SU(L 2 + ψi) 1 = U (L 2 + ψi) SU = U (L 2 + ψi) }{{} diagonal columns of U are eigenvectors of S with eigenvalues given by l 2 i + ψ. Thus, Λ = ULV T spans a space defined by K eigenvectors of S; and the lengths of the column vectors of L are given by the eigenvalues ψ (V selects an arbitrary basis in the latent space). Remains to show (we won t, but it s intuitively reasonable) that the global ML solution is attained when Λ aligns with the K leading eigenvectors.

48 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection?

49 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n.

50 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n)

51 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)}

52 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]}

53 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]}

54 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]}

55 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]} So Σ = (I + Λ T Ψ 1 Λ) 1 = I βλ and µ = ΣΛ T Ψ 1 x n = βx n. Where β = ΣΛ T Ψ 1.

56 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]} So Σ = (I + Λ T Ψ 1 Λ) 1 = I βλ and µ = ΣΛ T Ψ 1 x n = βx n. Where β = ΣΛ T Ψ 1. Thus, ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n

57 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]} So Σ = (I + Λ T Ψ 1 Λ) 1 = I βλ and µ = ΣΛ T Ψ 1 x n = βx n. Where β = ΣΛ T Ψ 1. Thus, ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n This is not the same projection. PPCA takes into account noise in the principal subspace.

58 PPCA latents In PCA the noise is orthogonal to the subspace, and we can project x n ˆx n trivially. In PPCA, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value z n = E [z n x n] and then take ˆx n = Λz n. Tactic: write p(z n, x n θ), consider x n to be fixed. What is this as a function of z n? p(z n, x n) = p(z n)p(x n z n) = (2π) K 2 exp{ 1 2 zt nz n} 2πΨ 1 2 exp{ 1 2 (xn Λzn)T Ψ 1 (x n Λz n)} = c exp{ 1 2 [zt nz n + (x n Λz n) T Ψ 1 (x n Λz n)]} = c exp{ 1 2 [zt n(i + Λ T Ψ 1 Λ)z n 2z T nλ T Ψ 1 x n]} = c exp{ 1 2 [zt nσ 1 z n 2z T nσ 1 µ + µ T Σ 1 µ]} So Σ = (I + Λ T Ψ 1 Λ) 1 = I βλ and µ = ΣΛ T Ψ 1 x n = βx n. Where β = ΣΛ T Ψ 1. Thus, ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n This is not the same projection. PPCA takes into account noise in the principal subspace. As ψ 0, the PPCA estimate the PCA value.

59 PPCA latents principal subspace

60 PPCA latents PCA projection principal subspace

61 PPCA latents PPCA noise PPCA latent prior principal subspace

62 PPCA latents PPCA posterior PPCA latent prior PPCA noise PPCA projection principal subspace

63 Factor Analysis If dimensions are not equivalent, equal variance assumption is inappropriate. Data: D = X = {x 1, x 2,..., x N}; x i R D Latents: Z = {z 1, z 2,..., z N}; z i R K K Linear generative model: x d = Λ dk z k + ɛ d k=1 z k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, Ψ dd) Gaussian noise K <D Model for observations x is still a correlated Gaussian: p(z) = N (0, I) p(x z) = N (Λz, Ψ) p(x) = p(z)p(x z)dz = N ( ) 0, ΛΛ T + Ψ where Λ is a D K, and Ψ is D D and diagonal. z 1 z 2 z K x 1 x 2 x D Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures the correlation structure of the data.

64 Factor Analysis (cont.) z 1 z 2 z K x 1 x 2 x D ML learning finds Λ ( common factors ) and Ψ ( unique factors or uniquenesses ) given data parameters (corrected for symmetries): DK + D K (K 1) 2 If number of parameters > D(D+1) 2 model is not identifiable (even after accounting for rotational degeneracy discussed later) no closed form solution for ML params: N (0, ΛΛ T + Ψ)

65 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA.

66 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood.

67 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood. z = Uz and Λ = ΛU T Λ z = ΛU T Uz = Λz

68 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood. z = Uz and Λ = ΛU T Λ z = ΛU T Uz = Λz l = 1 2π(ΛΛ 2 log T + Ψ) xt (ΛΛ T + Ψ) 1 x

69 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood. z = Uz and Λ = ΛU T Λ z = ΛU T Uz = Λz l = 1 2π(ΛU 2 log T UΛ T + Ψ) xt (ΛU T UΛ T + Ψ) 1 x

70 Factor Analysis projections Our analysis for PPCA still applies: ˆx n = Λ(I + Λ T Ψ 1 Λ) 1 Λ T Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by PPCA. And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal transform U (U T U = UU T = I) without changing the likelihood. z = Uz and Λ = ΛU T Λ z = ΛU T Uz = Λz l = 1 2π(ΛU 2 log T UΛ T + Ψ) xt (ΛU T UΛ T + Ψ) 1 x = 1 2 log 2π( Λ ΛT + Ψ) xt ( Λ Λ T + Ψ) 1 x

71 Gradient methods for learning FA Optimise negative log-likelihood: l = 1 2 log 2π(ΛΛT + Ψ) xt (ΛΛ T + Ψ) 1 x w.r.t. Λ and Ψ (need matrix calculus) subject to constraints. No spectral short-cut exists. Likelihood can have more than one (local) optimum, making it difficult to find the global value. For some data ( Heywood cases ) likelihood may grow unboundedly by taking one or more Ψ dd 0. Can eliminate by assuming a prior on Ψ with zero density at Ψ dd = 0, but results sensitive to precise choice of prior. Expectation maximisation (next week) provides an alternative approach to maximisation, but doesn t solve these issues.

72 FA vs PCA PCA and PPCA are rotationally invariant; FA is not If x Ux for unitary U, then λ PCA (i) Uλ PCA (i) FA is measurement scale invariant; PCA and PPCA are not If x Sx for diagonal S, then λ FA (i) Sλ FA (i) FA and PPCA define a probabilistic model; PCA does not [Note: it may be tempting to try to eliminate the scale-dependence of (P)PCA by pre-processing data to equalise total variance on each axis. But P(PCA) assume equal noise variance. Total variance has contributions from both ΛΛ T and noise, so this approach does not exactly solve the problem.]

73 Canonical Correlations Analysis Data vector pairs: D = {(u 1, v 1), (u 2, v 2)... } in spaces U and V. Classic CCA Find unit vectors υ 1 U, φ 1 V such that the correlation of u T i υ 1 and v T i φ 1 is maximised. As with PCA, repeat in orthogonal subspaces. Probabilistic CCA Generative model with latent z i R K : z N (0, I) u N (Υz, Ψ u) Ψ u 0 v N (Φz, Ψ v) Ψ v 0 Block diagonal noise.

74 Limitations of Gaussian, FA and PCA models Gaussian, FA and PCA models are easy to understand and use in practice. They are a convenient way to find interesting directions in very high dimensional data sets, eg as preprocessing However, they make strong assumptions about the distribution of the data: only the mean and variance of the data are taken into account. The class of densities which can be modelled is too restrictive. 1 x i x i1 By using mixtures of simple distributions, such as Gaussians, we can expand the class of densities greatly.

75 Mixture Distributions 1 x i x i1 A mixture distribution has a single discrete latent variable: s iid i Discrete[π] x i s i P si [θ si ] Mixtures arise naturally when observations from different sources have been collated. They can also be used to approximate arbitrary distributions.

76 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ]

77 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ] Under the discrete distribution P(s i = m) = π m; π m 0, k π m = 1 m=1

78 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ] Under the discrete distribution P(s i = m) = π m; π m 0, k π m = 1 m=1 Thus, the probability (density) at a single data point x i is k P(x i) = P(x i s i = m)p(s i = m) m=1

79 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ] Under the discrete distribution P(s i = m) = π m; π m 0, k π m = 1 m=1 Thus, the probability (density) at a single data point x i is P(x i) = = k P(x i s i = m)p(s i = m) m=1 k π mp m(x i; θ m) m=1

80 The Mixture Likelihood The mixture model is s iid i Discrete[π] x i s i P si [θ si ] Under the discrete distribution P(s i = m) = π m; π m 0, k π m = 1 m=1 Thus, the probability (density) at a single data point x i is P(x i) = = k P(x i s i = m)p(s i = m) m=1 k π mp m(x i; θ m) m=1 The mixture distribution (density) is a convex combination (or weighted average) of the component distributions (densities).

81 Approximation with a Mixture of Gaussians (MoG) The component densities may be viewed as elements of a basis which can be combined to approximate arbitrary distributions. Here are examples where non-gaussian densities are modelled (aproximated) as a mixture of Gaussians. The red curves show the (weighted) Gaussians, and the blue curve the resulting density. Uniform Triangle Heavy tails Given enough mixture components we can model (almost) any density (as accurately as desired), but still only need to work with the well-known Gaussian form.

82 Clustering with a MoG

83 Clustering with a MoG

84 Clustering with a MoG In clustering applications, the latent variable s i represents the (unknown) identity of the cluster to which the ith observation belongs. Thus, the latent distribution gives the prior probability of a data point coming from each cluster. P(s i = m π) = π m Data from the mth cluster are distributed according to the mth component: P(x i s i = m) = P m(x i) Once we observe a data point, the posterior probability distribution for the cluster it belongs to is P(s i = m x i) = Pm(xi)πm m Pm(xi)πm This is often called the responsibility of the mth cluster for the ith data point.

85 The MoG likelihood Each component of a MoG is a Gaussian, with mean µ m and covariance matrix Σ m. Thus, the probability density evaluated at a set of n iid observations, D = {x 1... x n} (i.e. the likelihood) is p(d {µ m}, {Σ m}, π) = The log of the likelihood is = n i=1 m=1 n i=1 m=1 k π m N (x i µ m, Σ m) k π m 1 2πΣm e 1 2 (x i µ m) T Σ 1 m (x i µ m) log p(d {µ m}, {Σ m}, π) = n log i=1 k m=1 π m 1 2πΣm e 1 2 (x i µ m) T Σ 1 m (x i µ m) Note that the logarithm fails to simplify the component density terms. A mixture distribution does not lie in the exponential family. Direct optimisation is not easy.

86 Maximum Likelihood for a Mixture Model The log likelihood is:

87 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1

88 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m k m=1 πmpm(xi; θm) P m(x i; θ m) θ m

89 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ,

90 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ, = n π mp m(x i; θ m) k m=1 πmpm(xi; θm) }{{} i=1 log P m(x i; θ m) θ m

91 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ, = = n π mp m(x i; θ m) k m=1 πmpm(xi; θm) }{{} n i=1 i=1 log P m(x i; θ m) θ m r im log P m(x i; θ m) θ m

92 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ, = = n π mp m(x i; θ m) k m=1 πmpm(xi; θm) }{{} n i=1 i=1 log P m(x i; θ m) θ m r im log P m(x i; θ m) θ m And its partial derivative wrt π m is L π m = n P m(x i; θ m) k m=1 πmpm(xi; θm) i=1

93 Maximum Likelihood for a Mixture Model The log likelihood is: L = n k log π mp m(x i; θ m) i=1 m=1 Its partial derivative wrt θ m is L θ m = n i=1 π m P m(x i; θ m) k πmpm(xi; θm) θ m m=1 or, using P/ θ = P log P/ θ, = = n π mp m(x i; θ m) k m=1 πmpm(xi; θm) }{{} n i=1 i=1 log P m(x i; θ m) θ m r im log P m(x i; θ m) θ m And its partial derivative wrt π m is L π m = n P m(x i; θ m) n k = m=1 πmpm(xi; θm) i=1 i=1 r im π m

94 MoG Derivatives For a MoG, with θ m = {µ m, Σ m} we get L µ m = n i=1 L = 1 Σ 1 m 2 r imσ 1 m (x i µ m) n i=1 r im ( Σ m (x i µ m)(x i µ m) T ) These equations can be used (along with Lagrangian derivatives wrt π m that enforce normalisation) for gradient based learning; e.g., taking small steps in the direction of the gradient (or using conjugate gradients).

95 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis).

96 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2 0.

97 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2 0. Then the responsibilities become binary r im δ(m, argmin x i µ l 2 ) l with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0.

98 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2 0. Then the responsibilities become binary r im δ(m, argmin x i µ l 2 ) l with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0. The k-means algorithm iterates these two steps: ( ) assign each point to its closest mean set r im = δ(m, argmin x i µ l 2 ) l ( ) update the means to the average of their assigned points set µ m = i r imx i i r im

99 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2 0. Then the responsibilities become binary r im δ(m, argmin x i µ l 2 ) l with 1 for the component with the closest mean and 0 for all other components. We can then solve directly for the means by setting the gradient to 0. The k-means algorithm iterates these two steps: ( assign each point to its closest mean set r im = δ(m, argmin update the means to the average of their assigned points ) x i µ l 2 ) l ( set µ m = i r imx i This usually converges within a few iterations, although the fixed point depends on the initial values chosen for µ m. The algorithm has no learning rate, but the assumptions are quite limiting. i r im )

100 A preview of the EM algorithm We wrote the k-means algorithm in terms of binary responsibilities. Suppose, instead, we used the fractional responsibilities from the full (non-limiting) MoG, but still neglected the dependence of the responsibilities on the parameters. We could then solve for both µ m and Σ m. The EM algorithm for MoGs iterates these two steps: Evaluate the responsibilities for each point given the current parameters. Optimise the parameters assuming the responsibilities stay fixed: µ m = i r imx i i r im(x i µ m)(x i µ m) T and Σ m = i r im i r im Although this appears ad hoc, we will see (later) that it is a special case of a general algorithm, and is actually guaranteed to increase the likelihood at each iteration.

101 Issues There are several problems with these algorithms: slow convergence for the gradient based method gradient based method may develop invalid covariance matrices local minima; the end configuration may depend on the starting state how do you adjust k? Using the likelihood alone is no good. singularities; components with a single data point will have their covariance going to zero and the likelihood will tend to infinity. We will attempt to address many of these as the course goes on.

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Factor Analysis and Kalman Filtering (11/2/04)

Factor Analysis and Kalman Filtering (11/2/04) CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

CS181 Midterm 2 Practice Solutions

CS181 Midterm 2 Practice Solutions CS181 Midterm 2 Practice Solutions 1. Convergence of -Means Consider Lloyd s algorithm for finding a -Means clustering of N data, i.e., minimizing the distortion measure objective function J({r n } N n=1,

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

What is Principal Component Analysis?

What is Principal Component Analysis? What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Maximum variance formulation

Maximum variance formulation 12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Dimensionality Reduction with Principal Component Analysis

Dimensionality Reduction with Principal Component Analysis 10 Dimensionality Reduction with Principal Component Analysis Working directly with high-dimensional data, such as images, comes with some difficulties: it is hard to analyze, interpretation is difficult,

More information

CSC 411 Lecture 12: Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised

More information

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond PCA, ICA and beyond Summer School on Manifold Learning in Image and Signal Analysis, August 17-21, 2009, Hven Technical University of Denmark (DTU) & University of Copenhagen (KU) August 18, 2009 Motivation

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear Algebra for Machine Learning. Sargur N. Srihari Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it

More information

Covariance and Principal Components

Covariance and Principal Components COMP3204/COMP6223: Computer Vision Covariance and Principal Components Jonathon Hare jsh2@ecs.soton.ac.uk Variance and Covariance Random Variables and Expected Values Mathematicians talk variance (and

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Probabilistic & Unsupervised Learning. Latent Variable Models. Exponential family models. Latent variable models. Latent variable models

Probabilistic & Unsupervised Learning. Latent Variable Models. Exponential family models. Latent variable models. Latent variable models Exponental famly models Probablstc & Unsupervsed Learnng Smple, sngle-stage generatve models. Easy, often closed-form expressons for learnng and model comparson.... but lmted n expressveness. Latent Varable

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Salvador Dalí, Galatea of the Spheres CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang Some slides from Derek Hoiem and Alysha

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Constrained Projection Approximation Algorithms for Principal Component Analysis

Constrained Projection Approximation Algorithms for Principal Component Analysis Constrained Projection Approximation Algorithms for Principal Component Analysis Seungjin Choi, Jong-Hoon Ahn, Andrzej Cichocki Department of Computer Science, Pohang University of Science and Technology,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part X Factor analysis When we have data x (i) R n that comes from a mixture of several Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting,

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Background Mathematics (2/2) 1. David Barber

Background Mathematics (2/2) 1. David Barber Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) Principal Component Analysis (PCA) Additional reading can be found from non-assessed exercises (week 8) in this course unit teaching page. Textbooks: Sect. 6.3 in [1] and Ch. 12 in [2] Outline Introduction

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Principal Component Analysis and Linear Discriminant Analysis

Principal Component Analysis and Linear Discriminant Analysis Principal Component Analysis and Linear Discriminant Analysis Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/29

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Covariance and Correlation Matrix

Covariance and Correlation Matrix Covariance and Correlation Matrix Given sample {x n } N 1, where x Rd, x n = x 1n x 2n. x dn sample mean x = 1 N N n=1 x n, and entries of sample mean are x i = 1 N N n=1 x in sample covariance matrix

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 12 Jan-Willem van de Meent (credit: Yijun Zhao, Percy Liang) DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Linear Dimensionality

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018 CPSC 340: Machine Learning and Data Mining Sparse Matrix Factorization Fall 2018 Last Time: PCA with Orthogonal/Sequential Basis When k = 1, PCA has a scaling problem. When k > 1, have scaling, rotation,

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

CS 4495 Computer Vision Principle Component Analysis

CS 4495 Computer Vision Principle Component Analysis CS 4495 Computer Vision Principle Component Analysis (and it s use in Computer Vision) Aaron Bobick School of Interactive Computing Administrivia PS6 is out. Due *** Sunday, Nov 24th at 11:55pm *** PS7

More information

Dimensionality Reduction

Dimensionality Reduction Dimensionality Reduction Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, UCL Apr/May 2016 High dimensional data Example data: Gene Expression Example data: Web Pages Google

More information

Advanced Introduction to Machine Learning

Advanced Introduction to Machine Learning 10-715 Advanced Introduction to Machine Learning Homework 3 Due Nov 12, 10.30 am Rules 1. Homework is due on the due date at 10.30 am. Please hand over your homework at the beginning of class. Please see

More information

Joint Factor Analysis for Speaker Verification

Joint Factor Analysis for Speaker Verification Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37 Outline 1 Speaker Verification Baseline System Session

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Data Mining Lecture 4: Covariance, EVD, PCA & SVD Data Mining Lecture 4: Covariance, EVD, PCA & SVD Jo Houghton ECS Southampton February 25, 2019 1 / 28 Variance and Covariance - Expectation A random variable takes on different values due to chance The

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 1 Introduction, researchy course, latest papers Going beyond simple machine learning Perception, strange spaces, images, time, behavior

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION Submitted by Daniel L Elliott Department of Computer Science In partial fulfillment of the requirements

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks Delivered by Mark Ebden With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

HST.582J/6.555J/16.456J

HST.582J/6.555J/16.456J Blind Source Separation: PCA & ICA HST.582J/6.555J/16.456J Gari D. Clifford gari [at] mit. edu http://www.mit.edu/~gari G. D. Clifford 2005-2009 What is BSS? Assume an observation (signal) is a linear

More information

Lecture 7 Spectral methods

Lecture 7 Spectral methods CSE 291: Unsupervised learning Spring 2008 Lecture 7 Spectral methods 7.1 Linear algebra review 7.1.1 Eigenvalues and eigenvectors Definition 1. A d d matrix M has eigenvalue λ if there is a d-dimensional

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information