Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Size: px
Start display at page:

Download "Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods"

Transcription

1 Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine

2 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group.

3 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures).

4 Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization.

5 Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of actors.

6 Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products.

7 Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures.

8 Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups

9 How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model. Tomorrow: advanced models.

10 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning.

11 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

12 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities?

13 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? In this series: guaranteed and efficient learning through spectral methods

14 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures.

15 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA).

16 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments.

17 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors.

18 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures.

19 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint.

20 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint. Tomorrow: using tensor methods for learning latent variable models and analysis of tensor decomposition method.

21 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint. Tomorrow: using tensor methods for learning latent variable models and analysis of tensor decomposition method. Wednesday: implementation of tensor methods.

22 Resources for Today s Talk A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Tensor Decompositions for Learning Latent Variable Models. by A., R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky, Oct Resources available at

23 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

24 Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n]

25 Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n] Result: If S = Cov(X) and S = UΛU is eigen decomposition, we have P = U (k) U (k), where U (k) are top-k eigen vectors.

26 Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n] Result: If S = Cov(X) and S = UΛU is eigen decomposition, we have P = U (k) U (k), where U (k) are top-k eigen vectors. Proof By Pythagorean theorem: i x i Px i 2 = i x i 2 i Px i 2. 1 Maximize: n i Px i 2 = 1 n i Tr[ Px i x i P ] = Tr[PSP ].

27 Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal.

28 Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal. Rayleigh Quotient For matrix S with eigenvalues λ 1 λ 2...λ d and corresponding eigenvectors u 1,...u d, then max z =1 z Sz = λ 1, min z =1 z Sz = λ d, and the optimizing vectors are u 1 and u d.

29 Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal. Rayleigh Quotient For matrix S with eigenvalues λ 1 λ 2...λ d and corresponding eigenvectors u 1,...u d, then max z =1 z Sz = λ 1, min z =1 z Sz = λ d, and the optimizing vectors are u 1 and u d. Optimal Projection max Tr(P SP) = λ 1 +λ 2...+λ k and P spans {u 1,...,u k }. P:P 2 =I Rank(P)=k

30 PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise.

31 PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I.

32 PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. How the above equation is obtained E[(x µ)(x µ) ] = E[(Ah µ)(ah µ) ]+E[zz ] = i [k]w i (a i µ)(a i µ) +σ 2 I.

33 PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1.

34 PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1. (k 1)-PCA on covariance matrix {µ} yields span(a). Lowest eigenvalue of covariance matrix yields σ 2.

35 PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1. (k 1)-PCA on covariance matrix {µ} yields span(a). Lowest eigenvalue of covariance matrix yields σ 2. How to Learn A?

36 Learning through Spectral Clustering Learning A through Spectral Clustering Project samples x on to span(a). Distance-based clustering (e.g. k-means). A series of works, e.g. Vempala & Wang.

37 Learning through Spectral Clustering Learning A through Spectral Clustering Project samples x on to span(a). Distance-based clustering (e.g. k-means). A series of works, e.g. Vempala & Wang. Failure to cluster under large variance. Learning Gaussian Mixtures Without Separation Constraints?

38 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints?

39 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful?

40 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d?

41 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models?

42 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models? What if the data is not Gaussian?

43 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models? What if the data is not Gaussian? Moment-based Estimation of probabilistic latent variable models?

44 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

45 Tensor Notation for Higher Order Moments Multi-variate higher order moments form tensors. Are there spectral operations on tensors akin to PCA? Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ].

46 Third order moment for Gaussian mixtures Consider mixture of k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) Intuition behind equation E[x x x] = E[(Ah) (Ah) (Ah)]+E[(Ah) z z]+... = i w i a i a i a i +σ 2 i µ e i e i +... How to recover parameters A and w from third order moment?

47 Simplifications E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) σ 2 is obtained from σ min (Cov(x)). E[x x] = i w ia i a i +σ 2 I.

48 Simplifications E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) σ 2 is obtained from σ min (Cov(x)). E[x x] = i w ia i a i +σ 2 I. Can obtain M 3 = i M 2 = i w i a i a i a i w i a i a i. How to obtain parameters A and w from M 2 and M 3?

49 Tensor Slices M 3 = i w i a i a i a i, M 2 = i w i a i a i. Multilinear transformation of tensor M 3 (B,C,D) := i w i (B a i ) (C a i ) (D a i )

50 Tensor Slices M 3 = i w i a i a i a i, M 2 = i w i a i a i. Multilinear transformation of tensor M 3 (B,C,D) := i w i (B a i ) (C a i ) (D a i ) Slice of a tensor M 3 (I,I,r) = i w ia i a i a i,r = ADiag(w)Diag(A r)a M 3 (I,I,r) = A Diag(w)Diag(A r) A M 2 = A Diag(w) A

51 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A.

52 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank.

53 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible.

54 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1.

55 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1. Substitution: X = (U A)Diag(A r)(u A) 1. We have v i U a i.

56 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1. Substitution: X = (U A)Diag(A r)(u A) 1. We have v i U a i. Technical Detail r = Uθ and θ drawn uniformly from sphere to ensure eigen gap.

57 Learning Gaussian Mixtures through Eigen-decomposition of Tensor Slices M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. ( U M 3 (I,I,r)U )( U M 2 U ) 1 = VDiag( λ)v 1. Recovery of A (method 1) Since v i U a i, recover A upto scale: a i Uv i.

58 Learning Gaussian Mixtures through Eigen-decomposition of Tensor Slices M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. ( U M 3 (I,I,r)U )( U M 2 U ) 1 = VDiag( λ)v 1. Recovery of A (method 1) Since v i U a i, recover A upto scale: a i Uv i. Recovery of A (method 2) Let Θ R k k be a random rotation matrix. Consider k slices M 3 (I,I,θ i ) and find eigen-decomposition above. Let Λ = [ λ 1... λ k ] be the matrix of eigenvalues of all slices. Recover A as : A = UΘ 1 Λ.

59 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.

60 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice.

61 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice.

62 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice. M 3 (I,I,r) is only a (random) slice of the tensor. Full information is not utilized.

63 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice. M 3 (I,I,r) is only a (random) slice of the tensor. Full information is not utilized. More efficient learning methods using higher order moments?

64 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

65 Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms.

66 Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms. When is it the most compact representation? (Identifiability). Can we recover the decomposition? (Algorithm?)

67 Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms. When is it the most compact representation? (Identifiability). Can we recover the decomposition? (Algorithm?) The most compact representation is known as CP-decomposition (CANDECOMP/PARAFAC).

68 Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns?

69 Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1.

70 Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv.

71 Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Two Problems How to find eigenvectors of a tensor? A is not orthogonal in general.

72 Whitening M 3 = i w i a i a i a i, M 2 = i w i a i a i. Find whitening matrix W s.t. W A = V is an orthogonal matrix. When A R d k has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2

73 Using Whitening to Obtain Orthogonal Tensor Tensor M 3 Tensor T Multi-linear transform M 3 R d d d and T R k k k. T = M 3 (W,W,W) = i w i(w a i ) 3. T = w i v i v i v i is orthogonal. i [k] Dimensionality reduction when k d.

74 How to Find Whitening Matrix? M 3 = i w i a i a i a i, M 2 = i w i a i a i. a 1a2 W v 1 v 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. Eigen-decomposition of M 2 = UDiag( λ)u, then W = UDiag( λ 1/2 ). V := W ADiag(w) 1/2 is an orthogonal matrix. v 3 T = M 3 (W,W,W) = i = i w 1/2 i (W a i wi ) 3 λ i v i v i v i, λ i := w 1/2 i. T is an orthogonal tensor.

75 Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T.

76 Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T. Tensor Power Method Start from an initial vector v. v T(I,v,v) T(I,v,v).

77 Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T. Tensor Power Method Start from an initial vector v. v T(I,v,v) T(I,v,v). Canonical recovery method Randomly initialize the power method. Run to convergence to obtain v with eigenvalue λ. Deflate: T λv v v and repeat.

78 Putting it together Gaussian mixture: x = Ah+z, where E[h] = w. z N(0,σ 2 I). M 2 = i w i a i a i, M 3 = i w i a i a i a i. Obtain whitening matrix W from SVD of M 2. Use W for multilinear transform: T = M 3 (W,W,W). Find eigenvectors of T through power method and deflation. What about learning other latent variable models?

79 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

80 Topic Modeling

81 Probabilistic Topic Models Useful abstraction for automatic categorization of documents Observed: words. Hidden: topics. Bag of words: order of words does not matter Graphical model representation l words in a document x 1,...,x l. h: proportions of topics in a document. Word x i generated from topic y i. Exchangeability: x 1 x 2... h A(i,j) := P[x m = i y m = j] : topic-word matrix. Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 A A A A A x 1 x 2 x 3 x 4 x 5 Words

82 Distribution of the topic proportions vector h If there are k topics, distribution over the simplex k 1 k 1 := {h R k,h i [0,1], i h i = 1}. Simplification for today: there is only one topic in each document. h {e1,...,e k }: basis vectors. Tomorrow: Latent Dirichlet allocation.

83 Distribution of the topic proportions vector h If there are k topics, distribution over the simplex k 1 k 1 := {h R k,h i [0,1], i h i = 1}. Simplification for today: there is only one topic in each document. h {e1,...,e k }: basis vectors. Tomorrow: Latent Dirichlet allocation.

84 Formulation as Linear Models Distribution of the words x 1,x 2,... Order d words in vocabulary. If x 1 is j th word, assign e j R d. Distribution of each x i : supported on vertices of d 1. Properties Pr[x 1 topic h = i] = E[x 1 topic h = i] = a i Linear Model: E[x i h] = Ah. Multiview model: h is fixed and multiple words (x i ) are generated.

85 Geometric Picture for Topic Models Topic proportions vector (h) Document

86 Geometric Picture for Topic Models Single topic (h)

87 Geometric Picture for Topic Models Single topic (h) A A A x 2 x 1 x 3 Word generation (x 1,x 2,...)

88 Gaussian Mixtures vs. Topic Models (spherical) Mixture of Gaussian: (single) Topic Models k means: a 1,...a k Component h = i with prob. w i observe x, with spherical noise, x = a i +z, z N(0,σ 2 ii) k topics: a 1,...a k Topic h = i with prob. w i observe l (exchangeable) words x 1,x 2,...x l i.i.d. from a i Unified Linear Model: E[x h] = Ah Gaussian mixture: single view, spherical noise. Topic model: multi-view, heteroskedastic noise. Three words per document suffice for learning. M 3 = i w i a i a i a i, M 2 = i w i a i a i.

89 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion

90 Recap: Basic Tensor Decomposition Method Toy Example in MATLAB Simulated Samples: Exchangeable Model Whiten The Samples Second Order Moments Matrix Decomposition Orthogonal Tensor Eigen Decomposition Third Order Moments Power Iteration

91 Simulated Samples: Exchangeable Model Model Parameters Hidden State: h basis {e 1,...,e k } k = 2 Observed States: x i basis {e 1,...,e d } d = 3 Conditional Independency: x 1 x 2 x 3 h Transition Matrix: A Exchangeability: E[x i h] = Ah, i 1,2,3 h x 1 x 2 x 3

92 Simulated Samples: Exchangeable Model Model Parameters Hidden State: h basis {e 1,...,e k } k = 2 Observed States: x i basis {e 1,...,e d } d = 3 Conditional Independency: x 1 x 2 x 3 h Transition Matrix: A Exchangeability: E[x i h] = Ah, i 1,2,3 Generate Samples Snippet for t = 1 : n % generate h for this sample h category=(rand()>0.5) + 1; h(t,h category)=1; transition cum=cumsum(a true(:,h category)); % generate x1 for this sample h x category=find(transition cum> rand(),1); x1(t,x category)=1; % generate x2 for this sample h x category=find(transition cum >rand(),1); x2(t,x category)=1; % generate x3 for this sample h x category=find(transition cum > rand(),1); x3(t,x category)=1; end

93 Whiten The Samples Second Order Moments M 2 = 1 n t xt 1 xt 2 Whitening Matrix W = U w L 0.5 w, [U w,l w ] =k-svd(m 2 ) Whitening Snippet fprintf( The second order moment M2: ); M2 = x1 *x2/n [Uw, Lw, Vw]= svd(m2); fprintf( M2 singular values: ); Lw W = Uw(:,1:k)* sqrt(pinv(lw(1:k,1:k))); y1 = x1 * W; y2 = x2 * W; y3 = x3 * W; Whiten Data a 1 a 2 v 1 v 2 y t 1 = W x t 1 Orthogonal Basis V = W A V V = I

94 Orthogonal Tensor Eigen Decomposition Third Order Moments T = 1 1 n t [n]y t yt 2 yt 3 λ i v i v i v i, V V = I i [k] Gradient Ascent T(I,v 1,v 1 ) = 1 1,y n 2 v t [n] v t 1,y3 y t 1 t λ i v i,v 1 2 v i = λ 1 v 1. i v i are eigenvectors of tensor T.

95 Orthogonal Tensor Eigen Decomposition T T j λ j v 3 j, v T(I,v,v) T(I,v,v) Power Iteration Snippet.V = zeros(k,k); Lambda = zeros(k,1);.for i = 1:k. v old = rand(k,1); v old = normc(v old);. for iter = 1 : Maxiter. v new = (y1 * ((y2*v old).*(y3*v old)))/n;. if i >1. % deflation. for j = 1: i-1. v new=v new-(v(:,j)*(v old *V(:,j))2)* Lambda(j);. end. end. lambda = norm(v new);v new = normc(v new);. if norm(v old - v new) < TOL. fprintf( Converged at iteration %d., iter);. V(:,i) = v new; Lambda(i,1) = lambda;. break;. end. v old = v new;. end.end

96 Orthogonal Tensor Eigen Decomposition T T j λ j v 3 j, v T(I,v,v) T(I,v,v) Power Iteration Snippet.V = zeros(k,k); Lambda = zeros(k,1);.for i = 1:k. v old = rand(k,1); v old = normc(v old);. for iter = 1 : Maxiter. v new = (y1 * ((y2*v old).*(y3*v old)))/n;. if i >1. % deflation. for j = 1: i-1. v new=v new-(v(:,j)*(v old *V(:,j))2)* Lambda(j);. end. end. lambda = norm(v new);v new = normc(v new);. if norm(v old - v new) < TOL. fprintf( Converged at iteration %d., iter);. V(:,i) = v new; Lambda(i,1) = lambda;. break;. end. v old = v new;. end.end iter 0 iter iter iter 1 Green: Groundtruth Red: Estimation at each iteration

97 Conclusion h x 1 x 2 x 3 Gaussian mixtures and topic models. Unified linear representation. Learning through higher order moments. Tensor decomposition via whitening and power method. Tomorrow s lecture Latent variable models and moments. Analysis of tensor power method. Wednesday s lecture: Implementation of tensor method. Code snippet available at

Dictionary Learning Using Tensor Methods

Dictionary Learning Using Tensor Methods Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone

More information

Tensor Methods for Feature Learning

Tensor Methods for Feature Learning Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to

More information

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Orthogonal tensor decomposition

Orthogonal tensor decomposition Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.

More information

ECE 598: Representation Learning: Algorithms and Models Fall 2017

ECE 598: Representation Learning: Algorithms and Models Fall 2017 ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors

More information

Guaranteed Learning of Latent Variable Models through Tensor Methods

Guaranteed Learning of Latent Variable Models through Tensor Methods Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018 1/75 Tutorial Topic Learning algorithms for latent

More information

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints

Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.

More information

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix

More information

Lecture 7 Spectral methods

Lecture 7 Spectral methods CSE 291: Unsupervised learning Spring 2008 Lecture 7 Spectral methods 7.1 Linear algebra review 7.1.1 Eigenvalues and eigenvectors Definition 1. A d d matrix M has eigenvalue λ if there is a d-dimensional

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Estimating Covariance Using Factorial Hidden Markov Models

Estimating Covariance Using Factorial Hidden Markov Models Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING Luis Rademacher, Ohio State University, Computer Science and Engineering. Joint work with Mikhail Belkin and James Voss This talk A new approach to multi-way

More information

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i

More information

Efficient algorithms for estimating multi-view mixture models

Efficient algorithms for estimating multi-view mixture models Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

Dimensionality Reduction

Dimensionality Reduction Dimensionality Reduction Le Song Machine Learning I CSE 674, Fall 23 Unsupervised learning Learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data where a classification of

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Estimating Latent Variable Graphical Models with Moments and Likelihoods Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Spectral Learning of Hidden Markov Models with Group Persistence

Spectral Learning of Hidden Markov Models with Group Persistence 000 00 00 00 00 00 00 007 008 009 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 Abstract In this paper, we develop a general Method of Moments

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

Lecture 8: Linear Algebra Background

Lecture 8: Linear Algebra Background CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

CSC411: Final Review. James Lucas & David Madras. December 3, 2018 CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:

More information

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices

More information

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery Tim Roughgarden & Gregory Valiant May 3, 2017 Last lecture discussed singular value decomposition (SVD), and we

More information

Collaborative Filtering: A Machine Learning Perspective

Collaborative Filtering: A Machine Learning Perspective Collaborative Filtering: A Machine Learning Perspective Chapter 6: Dimensionality Reduction Benjamin Marlin Presenter: Chaitanya Desai Collaborative Filtering: A Machine Learning Perspective p.1/18 Topics

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

Unsupervised Learning: Dimensionality Reduction

Unsupervised Learning: Dimensionality Reduction Unsupervised Learning: Dimensionality Reduction CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 3 Outline In this lecture, we set about to solve the problem posed in the previous lecture Given a dataset,

More information

Dimensionality Reduction

Dimensionality Reduction Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball

More information

Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints

Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints Animashree Anandkumar 1, Daniel Hsu 2, Adel Javanmard 3, and Sham M. Kakade 2 1 Department of EECS, University of California,

More information

CS 572: Information Retrieval

CS 572: Information Retrieval CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12

STATS 306B: Unsupervised Learning Spring Lecture 13 May 12 STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees

More information

Appendix A. Proof to Theorem 1

Appendix A. Proof to Theorem 1 Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component

More information

An Introduction to Spectral Learning

An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Linear Algebra Background

Linear Algebra Background CS76A Text Retrieval and Mining Lecture 5 Recap: Clustering Hierarchical clustering Agglomerative clustering techniques Evaluation Term vs. document space clustering Multi-lingual docs Feature selection

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016 Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen

More information

Computational functional genomics

Computational functional genomics Computational functional genomics (Spring 2005: Lecture 8) David K. Gifford (Adapted from a lecture by Tommi S. Jaakkola) MIT CSAIL Basic clustering methods hierarchical k means mixture models Multi variate

More information

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Unit University College London 27 Feb 2017 Outline Part I: Theory of ICA Definition and difference

More information

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham Intelligent Data Analysis Principal Component Analysis Peter Tiňo School of Computer Science University of Birmingham Discovering low-dimensional spatial layout in higher dimensional spaces - 1-D/3-D example

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

Principal components analysis COMS 4771

Principal components analysis COMS 4771 Principal components analysis COMS 4771 1. Representation learning Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Dimension Reduction (PCA, ICA, CCA, FLD,

Dimension Reduction (PCA, ICA, CCA, FLD, Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Singular Value Decomposition and Principal Component Analysis (PCA) I

Singular Value Decomposition and Principal Component Analysis (PCA) I Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

Pattern Classification

Pattern Classification Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method Tim Roughgarden & Gregory Valiant April 15, 015 This lecture began with an extended recap of Lecture 7. Recall that

More information

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 11-1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed

More information

What is Principal Component Analysis?

What is Principal Component Analysis? What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

CSC 411 Lecture 12: Principal Component Analysis

CSC 411 Lecture 12: Principal Component Analysis CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised

More information

Efficient Spectral Methods for Learning Mixture Models!

Efficient Spectral Methods for Learning Mixture Models! Efficient Spectral Methods for Learning Mixture Models Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint works with Munther Dahleh, Rong Ge, Sham Kakade, Greg Valiant.

More information

Announcements (repeat) Principal Components Analysis

Announcements (repeat) Principal Components Analysis 4/7/7 Announcements repeat Principal Components Analysis CS 5 Lecture #9 April 4 th, 7 PA4 is due Monday, April 7 th Test # will be Wednesday, April 9 th Test #3 is Monday, May 8 th at 8AM Just hour long

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Machine Learning for Data Science (CS4786) Lecture 12

Machine Learning for Data Science (CS4786) Lecture 12 Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We

More information