Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods
|
|
- Suzan Atkinson
- 6 years ago
- Views:
Transcription
1 Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine
2 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group.
3 Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point belongs to an unknown group. Probabilistic/latent variable viewpoint The groups represent different distributions. (e.g. Gaussian). Each data point is drawn from one of the given distributions. (e.g. Gaussian mixtures).
4 Application 2: Topic Modeling Document modeling Observed: words in document corpus. Hidden: topics. Goal: carry out document summarization.
5 Application 3: Understanding Human Communities Social Networks Observed: network of social ties, e.g. friendships, co-authorships Hidden: groups/communities of actors.
6 Application 4: Recommender Systems Recommender System Observed: Ratings of users for various products, e.g. yelp reviews. Goal: Predict new recommendations. Modeling: Find groups/communities of users and products.
7 Application 5: Feature Learning Feature Engineering Learn good features/representations for classification tasks, e.g. image and speech recognition. Sparse representations, low dimensional hidden structures.
8 Application 6: Computational Biology Observed: gene expression levels Goal: discover gene groups Hidden variables: regulators controlling gene groups
9 How to model hidden effects? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. Can model mixed memberships. h 1 h 2 h 3 x 1 x 2 x 3 x 4 x 5 This talk: basic mixture model. Tomorrow: advanced models.
10 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning.
11 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?
12 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities?
13 Challenges in Learning Basic goal in all mentioned applications Discover hidden structure in data: unsupervised learning. Challenge: Conditions for Identifiability When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms? Challenge: Efficient Learning of Latent Variable Models Maximum likelihood is NP-hard. Practice: EM, Variational Bayes have no consistency guarantees. Efficient computational and sample complexities? In this series: guaranteed and efficient learning through spectral methods
14 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures.
15 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA).
16 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments.
17 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors.
18 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures.
19 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint.
20 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint. Tomorrow: using tensor methods for learning latent variable models and analysis of tensor decomposition method.
21 What this talk series will cover This talk Start with the most basic latent variable model: Gaussian mixtures. Basic spectral approach: principal component analysis (PCA). Beyond correlations: higher order moments. Tensor notations, PCA extension to tensors. Guaranteed learning of Gaussian mixtures. Introduce topic models. Present a unified viewpoint. Tomorrow: using tensor methods for learning latent variable models and analysis of tensor decomposition method. Wednesday: implementation of tensor methods.
22 Resources for Today s Talk A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Tensor Decompositions for Learning Latent Variable Models. by A., R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky, Oct Resources available at
23 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion
24 Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n]
25 Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n] Result: If S = Cov(X) and S = UΛU is eigen decomposition, we have P = U (k) U (k), where U (k) are top-k eigen vectors.
26 Warm-up: PCA Optimization problem For (centered) points x i R d, find projection P with Rank(P) = k s.t. 1 min x i Px i 2. P R d d n i [n] Result: If S = Cov(X) and S = UΛU is eigen decomposition, we have P = U (k) U (k), where U (k) are top-k eigen vectors. Proof By Pythagorean theorem: i x i Px i 2 = i x i 2 i Px i 2. 1 Maximize: n i Px i 2 = 1 n i Tr[ Px i x i P ] = Tr[PSP ].
27 Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal.
28 Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal. Rayleigh Quotient For matrix S with eigenvalues λ 1 λ 2...λ d and corresponding eigenvectors u 1,...u d, then max z =1 z Sz = λ 1, min z =1 z Sz = λ d, and the optimizing vectors are u 1 and u d.
29 Review of linear algebra For a matrix S, u is an eigenvector if Su = λu and λ is eigenvalue. For symm. S R d d, there are d eigen values. S = i [d] λ iu i u i. U is orthogonal. Rayleigh Quotient For matrix S with eigenvalues λ 1 λ 2...λ d and corresponding eigenvectors u 1,...u d, then max z =1 z Sz = λ 1, min z =1 z Sz = λ d, and the optimizing vectors are u 1 and u d. Optimal Projection max Tr(P SP) = λ 1 +λ 2...+λ k and P spans {u 1,...,u k }. P:P 2 =I Rank(P)=k
30 PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise.
31 PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I.
32 PCA on Gaussian Mixtures k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. Let µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. How the above equation is obtained E[(x µ)(x µ) ] = E[(Ah µ)(ah µ) ]+E[zz ] = i [k]w i (a i µ)(a i µ) +σ 2 I.
33 PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1.
34 PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1. (k 1)-PCA on covariance matrix {µ} yields span(a). Lowest eigenvalue of covariance matrix yields σ 2.
35 PCA on Gaussian Mixtures E[(x µ)(x µ) ] = i [k]w i (a i µ)(a i µ) +σ 2 I. The vectors {a i µ} are linearly dependent: i w i(a i µ) = 0. The PSD matrix i [k] w i(a i µ)(a i µ) has rank k 1. (k 1)-PCA on covariance matrix {µ} yields span(a). Lowest eigenvalue of covariance matrix yields σ 2. How to Learn A?
36 Learning through Spectral Clustering Learning A through Spectral Clustering Project samples x on to span(a). Distance-based clustering (e.g. k-means). A series of works, e.g. Vempala & Wang.
37 Learning through Spectral Clustering Learning A through Spectral Clustering Project samples x on to span(a). Distance-based clustering (e.g. k-means). A series of works, e.g. Vempala & Wang. Failure to cluster under large variance. Learning Gaussian Mixtures Without Separation Constraints?
38 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints?
39 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful?
40 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d?
41 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models?
42 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models? What if the data is not Gaussian?
43 Beyond PCA: Spectral Methods on Tensors How to learn the component means A (not just its span) without separation constraints? PCA is a spectral method on (covariance) matrices. Are higher order moments helpful? What if number of components is greater than observed dimensionality k > d? Do higher order moments help to learn overcomplete models? What if the data is not Gaussian? Moment-based Estimation of probabilistic latent variable models?
44 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion
45 Tensor Notation for Higher Order Moments Multi-variate higher order moments form tensors. Are there spectral operations on tensors akin to PCA? Matrix E[x x] R d d is a second order tensor. E[x x] i1,i 2 = E[x i1 x i2 ]. For matrices: E[x x] = E[xx ]. Tensor E[x x x] R d d d is a third order tensor. E[x x x] i1,i 2,i 3 = E[x i1 x i2 x i3 ].
46 Third order moment for Gaussian mixtures Consider mixture of k Gaussians: each sample is x = Ah+z. h [e 1,...,e k ], the basis vectors. E[h] = w. A R d k : columns are component means. µ := Aw be the mean. z N(0,σ 2 I) is white Gaussian noise. E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) Intuition behind equation E[x x x] = E[(Ah) (Ah) (Ah)]+E[(Ah) z z]+... = i w i a i a i a i +σ 2 i µ e i e i +... How to recover parameters A and w from third order moment?
47 Simplifications E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) σ 2 is obtained from σ min (Cov(x)). E[x x] = i w ia i a i +σ 2 I.
48 Simplifications E[x x x] = i w i a i a i a i +σ 2 i (µ e i e i +...) σ 2 is obtained from σ min (Cov(x)). E[x x] = i w ia i a i +σ 2 I. Can obtain M 3 = i M 2 = i w i a i a i a i w i a i a i. How to obtain parameters A and w from M 2 and M 3?
49 Tensor Slices M 3 = i w i a i a i a i, M 2 = i w i a i a i. Multilinear transformation of tensor M 3 (B,C,D) := i w i (B a i ) (C a i ) (D a i )
50 Tensor Slices M 3 = i w i a i a i a i, M 2 = i w i a i a i. Multilinear transformation of tensor M 3 (B,C,D) := i w i (B a i ) (C a i ) (D a i ) Slice of a tensor M 3 (I,I,r) = i w ia i a i a i,r = ADiag(w)Diag(A r)a M 3 (I,I,r) = A Diag(w)Diag(A r) A M 2 = A Diag(w) A
51 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A.
52 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank.
53 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible.
54 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1.
55 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1. Substitution: X = (U A)Diag(A r)(u A) 1. We have v i U a i.
56 Eigen-decomposition M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. Assumption: A R d k has full column rank. M 2 = UΛU be eigen-decomposition. U R d k. U M 2 U = Λ R k k is invertible. X = ( U M 3 (I,I,r)U )( U M 2 U ) 1 = V Diag( λ) V 1. Substitution: X = (U A)Diag(A r)(u A) 1. We have v i U a i. Technical Detail r = Uθ and θ drawn uniformly from sphere to ensure eigen gap.
57 Learning Gaussian Mixtures through Eigen-decomposition of Tensor Slices M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. ( U M 3 (I,I,r)U )( U M 2 U ) 1 = VDiag( λ)v 1. Recovery of A (method 1) Since v i U a i, recover A upto scale: a i Uv i.
58 Learning Gaussian Mixtures through Eigen-decomposition of Tensor Slices M 3 (I,I,r) = A Diag(w)Diag(A r) A, M 2 = A Diag(w) A. ( U M 3 (I,I,r)U )( U M 2 U ) 1 = VDiag( λ)v 1. Recovery of A (method 1) Since v i U a i, recover A upto scale: a i Uv i. Recovery of A (method 2) Let Θ R k k be a random rotation matrix. Consider k slices M 3 (I,I,θ i ) and find eigen-decomposition above. Let Λ = [ λ 1... λ k ] be the matrix of eigenvalues of all slices. Recover A as : A = UΘ 1 Λ.
59 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June 2012.
60 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice.
61 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice.
62 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice. M 3 (I,I,r) is only a (random) slice of the tensor. Full information is not utilized.
63 Implications Putting it together Learn Gaussian mixtures through slices of third-order moment. Guaranteed learning through eigen decomposition. A Method of Moments for Mixture Models and Hidden Markov Models. by A., D. Hsu, and S.M. Kakade. Proc. of COLT, June Shortcomings The resulting product is not symmetric. Eigen-decomposition VDiag( λ)v 1 does not result in orthonormal V. More involved in practice. Require good eigen-gap in Diag( λ) for recovery. For r = Uθ, where θ is drawn uniformly from unit sphere, gap is 1/k 2.5. Numerical instability in practice. M 3 (I,I,r) is only a (random) slice of the tensor. Full information is not utilized. More efficient learning methods using higher order moments?
64 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion
65 Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms.
66 Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms. When is it the most compact representation? (Identifiability). Can we recover the decomposition? (Algorithm?)
67 Tensor Factorization Recover A and w from M 2 and M 3. M 2 = i w i a i a i, M 3 = i w i a i a i a i. = +... Tensor M 3 w 1 a 1 a 1 a 1 w 2 a 2 a 2 a 2 a a a is a rank-1 tensor since, its (i 1,i 2,i 3 ) th entry is a i1 a i2 a i3. M 3 is a sum of rank-1 terms. When is it the most compact representation? (Identifiability). Can we recover the decomposition? (Algorithm?) The most compact representation is known as CP-decomposition (CANDECOMP/PARAFAC).
68 Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns?
69 Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1.
70 Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv.
71 Initial Ideas M 3 = i w i a i a i a i. What if A has orthogonal columns? M 3 (I,a 1,a 1 ) = i w i a i,a 1 2 a i = w 1 a 1. a i are eigenvectors of tensor M 3. Analogous to matrix eigenvectors: Mv = M(I,v) = λv. Two Problems How to find eigenvectors of a tensor? A is not orthogonal in general.
72 Whitening M 3 = i w i a i a i a i, M 2 = i w i a i a i. Find whitening matrix W s.t. W A = V is an orthogonal matrix. When A R d k has full column rank, it is an invertible transformation. a 1 a 2 a 3 W v 3 v 1 v 2
73 Using Whitening to Obtain Orthogonal Tensor Tensor M 3 Tensor T Multi-linear transform M 3 R d d d and T R k k k. T = M 3 (W,W,W) = i w i(w a i ) 3. T = w i v i v i v i is orthogonal. i [k] Dimensionality reduction when k d.
74 How to Find Whitening Matrix? M 3 = i w i a i a i a i, M 2 = i w i a i a i. a 1a2 W v 1 v 2 a 3 Use pairwise moments M 2 to find W s.t. W M 2 W = I. Eigen-decomposition of M 2 = UDiag( λ)u, then W = UDiag( λ 1/2 ). V := W ADiag(w) 1/2 is an orthogonal matrix. v 3 T = M 3 (W,W,W) = i = i w 1/2 i (W a i wi ) 3 λ i v i v i v i, λ i := w 1/2 i. T is an orthogonal tensor.
75 Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T.
76 Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T. Tensor Power Method Start from an initial vector v. v T(I,v,v) T(I,v,v).
77 Orthogonal Tensor Eigen Decomposition T = i [k]λ i v i v i v i, v i,v j = δ i,j, i,j. T(I,v 1,v 1 ) = i λ i v i,v 1 2 v i = λ 1 v 1. v i are eigenvectors of tensor T. Tensor Power Method Start from an initial vector v. v T(I,v,v) T(I,v,v). Canonical recovery method Randomly initialize the power method. Run to convergence to obtain v with eigenvalue λ. Deflate: T λv v v and repeat.
78 Putting it together Gaussian mixture: x = Ah+z, where E[h] = w. z N(0,σ 2 I). M 2 = i w i a i a i, M 3 = i w i a i a i a i. Obtain whitening matrix W from SVD of M 2. Use W for multilinear transform: T = M 3 (W,W,W). Find eigenvectors of T through power method and deflation. What about learning other latent variable models?
79 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion
80 Topic Modeling
81 Probabilistic Topic Models Useful abstraction for automatic categorization of documents Observed: words. Hidden: topics. Bag of words: order of words does not matter Graphical model representation l words in a document x 1,...,x l. h: proportions of topics in a document. Word x i generated from topic y i. Exchangeability: x 1 x 2... h A(i,j) := P[x m = i y m = j] : topic-word matrix. Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 A A A A A x 1 x 2 x 3 x 4 x 5 Words
82 Distribution of the topic proportions vector h If there are k topics, distribution over the simplex k 1 k 1 := {h R k,h i [0,1], i h i = 1}. Simplification for today: there is only one topic in each document. h {e1,...,e k }: basis vectors. Tomorrow: Latent Dirichlet allocation.
83 Distribution of the topic proportions vector h If there are k topics, distribution over the simplex k 1 k 1 := {h R k,h i [0,1], i h i = 1}. Simplification for today: there is only one topic in each document. h {e1,...,e k }: basis vectors. Tomorrow: Latent Dirichlet allocation.
84 Formulation as Linear Models Distribution of the words x 1,x 2,... Order d words in vocabulary. If x 1 is j th word, assign e j R d. Distribution of each x i : supported on vertices of d 1. Properties Pr[x 1 topic h = i] = E[x 1 topic h = i] = a i Linear Model: E[x i h] = Ah. Multiview model: h is fixed and multiple words (x i ) are generated.
85 Geometric Picture for Topic Models Topic proportions vector (h) Document
86 Geometric Picture for Topic Models Single topic (h)
87 Geometric Picture for Topic Models Single topic (h) A A A x 2 x 1 x 3 Word generation (x 1,x 2,...)
88 Gaussian Mixtures vs. Topic Models (spherical) Mixture of Gaussian: (single) Topic Models k means: a 1,...a k Component h = i with prob. w i observe x, with spherical noise, x = a i +z, z N(0,σ 2 ii) k topics: a 1,...a k Topic h = i with prob. w i observe l (exchangeable) words x 1,x 2,...x l i.i.d. from a i Unified Linear Model: E[x h] = Ah Gaussian mixture: single view, spherical noise. Topic model: multi-view, heteroskedastic noise. Three words per document suffice for learning. M 3 = i w i a i a i a i, M 2 = i w i a i a i.
89 Outline 1 Introduction 2 Warm-up: PCA and Gaussian Mixtures 3 Higher order moments for Gaussian Mixtures 4 Tensor Factorization Algorithm 5 Learning Topic Models 6 Conclusion
90 Recap: Basic Tensor Decomposition Method Toy Example in MATLAB Simulated Samples: Exchangeable Model Whiten The Samples Second Order Moments Matrix Decomposition Orthogonal Tensor Eigen Decomposition Third Order Moments Power Iteration
91 Simulated Samples: Exchangeable Model Model Parameters Hidden State: h basis {e 1,...,e k } k = 2 Observed States: x i basis {e 1,...,e d } d = 3 Conditional Independency: x 1 x 2 x 3 h Transition Matrix: A Exchangeability: E[x i h] = Ah, i 1,2,3 h x 1 x 2 x 3
92 Simulated Samples: Exchangeable Model Model Parameters Hidden State: h basis {e 1,...,e k } k = 2 Observed States: x i basis {e 1,...,e d } d = 3 Conditional Independency: x 1 x 2 x 3 h Transition Matrix: A Exchangeability: E[x i h] = Ah, i 1,2,3 Generate Samples Snippet for t = 1 : n % generate h for this sample h category=(rand()>0.5) + 1; h(t,h category)=1; transition cum=cumsum(a true(:,h category)); % generate x1 for this sample h x category=find(transition cum> rand(),1); x1(t,x category)=1; % generate x2 for this sample h x category=find(transition cum >rand(),1); x2(t,x category)=1; % generate x3 for this sample h x category=find(transition cum > rand(),1); x3(t,x category)=1; end
93 Whiten The Samples Second Order Moments M 2 = 1 n t xt 1 xt 2 Whitening Matrix W = U w L 0.5 w, [U w,l w ] =k-svd(m 2 ) Whitening Snippet fprintf( The second order moment M2: ); M2 = x1 *x2/n [Uw, Lw, Vw]= svd(m2); fprintf( M2 singular values: ); Lw W = Uw(:,1:k)* sqrt(pinv(lw(1:k,1:k))); y1 = x1 * W; y2 = x2 * W; y3 = x3 * W; Whiten Data a 1 a 2 v 1 v 2 y t 1 = W x t 1 Orthogonal Basis V = W A V V = I
94 Orthogonal Tensor Eigen Decomposition Third Order Moments T = 1 1 n t [n]y t yt 2 yt 3 λ i v i v i v i, V V = I i [k] Gradient Ascent T(I,v 1,v 1 ) = 1 1,y n 2 v t [n] v t 1,y3 y t 1 t λ i v i,v 1 2 v i = λ 1 v 1. i v i are eigenvectors of tensor T.
95 Orthogonal Tensor Eigen Decomposition T T j λ j v 3 j, v T(I,v,v) T(I,v,v) Power Iteration Snippet.V = zeros(k,k); Lambda = zeros(k,1);.for i = 1:k. v old = rand(k,1); v old = normc(v old);. for iter = 1 : Maxiter. v new = (y1 * ((y2*v old).*(y3*v old)))/n;. if i >1. % deflation. for j = 1: i-1. v new=v new-(v(:,j)*(v old *V(:,j))2)* Lambda(j);. end. end. lambda = norm(v new);v new = normc(v new);. if norm(v old - v new) < TOL. fprintf( Converged at iteration %d., iter);. V(:,i) = v new; Lambda(i,1) = lambda;. break;. end. v old = v new;. end.end
96 Orthogonal Tensor Eigen Decomposition T T j λ j v 3 j, v T(I,v,v) T(I,v,v) Power Iteration Snippet.V = zeros(k,k); Lambda = zeros(k,1);.for i = 1:k. v old = rand(k,1); v old = normc(v old);. for iter = 1 : Maxiter. v new = (y1 * ((y2*v old).*(y3*v old)))/n;. if i >1. % deflation. for j = 1: i-1. v new=v new-(v(:,j)*(v old *V(:,j))2)* Lambda(j);. end. end. lambda = norm(v new);v new = normc(v new);. if norm(v old - v new) < TOL. fprintf( Converged at iteration %d., iter);. V(:,i) = v new; Lambda(i,1) = lambda;. break;. end. v old = v new;. end.end iter 0 iter iter iter 1 Green: Groundtruth Red: Estimation at each iteration
97 Conclusion h x 1 x 2 x 3 Gaussian mixtures and topic models. Unified linear representation. Learning through higher order moments. Tensor decomposition via whitening and power method. Tomorrow s lecture Latent variable models and moments. Analysis of tensor power method. Wednesday s lecture: Implementation of tensor method. Code snippet available at
Dictionary Learning Using Tensor Methods
Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone
More informationTensor Methods for Feature Learning
Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to
More informationTensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia
Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationOrthogonal tensor decomposition
Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.
More informationECE 598: Representation Learning: Algorithms and Models Fall 2017
ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors
More informationGuaranteed Learning of Latent Variable Models through Tensor Methods
Guaranteed Learning of Latent Variable Models through Tensor Methods Furong Huang University of Maryland furongh@cs.umd.edu ACM SIGMETRICS Tutorial 2018 1/75 Tutorial Topic Learning algorithms for latent
More informationIdentifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints
Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.
More informationDonald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016
Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix
More informationLecture 7 Spectral methods
CSE 291: Unsupervised learning Spring 2008 Lecture 7 Spectral methods 7.1 Linear algebra review 7.1.1 Eigenvalues and eigenvectors Definition 1. A d d matrix M has eigenvalue λ if there is a d-dimensional
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationEstimating Covariance Using Factorial Hidden Markov Models
Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More informationNon-convex Robust PCA: Provable Bounds
Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing
More informationTHE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING
THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING Luis Rademacher, Ohio State University, Computer Science and Engineering. Joint work with Mikhail Belkin and James Voss This talk A new approach to multi-way
More informationManifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA
Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inria.fr http://perception.inrialpes.fr/
More informationPCA, Kernel PCA, ICA
PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per
More informationDeep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang
Deep Learning Basics Lecture 7: Factor Analysis Princeton University COS 495 Instructor: Yingyu Liang Supervised v.s. Unsupervised Math formulation for supervised learning Given training data x i, y i
More informationEfficient algorithms for estimating multi-view mixture models
Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions
More informationMathematical foundations - linear algebra
Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar
More informationDimensionality Reduction
Dimensionality Reduction Le Song Machine Learning I CSE 674, Fall 23 Unsupervised learning Learning from raw (unlabeled, unannotated, etc) data, as opposed to supervised data where a classification of
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationPrincipal Component Analysis
Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationEstimating Latent Variable Graphical Models with Moments and Likelihoods
Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationSpectral Learning of Hidden Markov Models with Group Persistence
000 00 00 00 00 00 00 007 008 009 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 Abstract In this paper, we develop a general Method of Moments
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationDATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD
DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary
More informationLecture 8: Linear Algebra Background
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 8: Linear Algebra Background Lecturer: Shayan Oveis Gharan 2/1/2017 Scribe: Swati Padmanabhan Disclaimer: These notes have not been subjected
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationCSC411: Final Review. James Lucas & David Madras. December 3, 2018
CSC411: Final Review James Lucas & David Madras December 3, 2018 Agenda 1. A brief overview 2. Some sample questions Basic ML Terminology The final exam will be on the entire course; however, it will be
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More information26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.
10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with
More informationCS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works
CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The
More informationInformation Retrieval
Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices
More informationCS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery
CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery Tim Roughgarden & Gregory Valiant May 3, 2017 Last lecture discussed singular value decomposition (SVD), and we
More informationCollaborative Filtering: A Machine Learning Perspective
Collaborative Filtering: A Machine Learning Perspective Chapter 6: Dimensionality Reduction Benjamin Marlin Presenter: Chaitanya Desai Collaborative Filtering: A Machine Learning Perspective p.1/18 Topics
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationUnsupervised Learning: Dimensionality Reduction
Unsupervised Learning: Dimensionality Reduction CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 3 Outline In this lecture, we set about to solve the problem posed in the previous lecture Given a dataset,
More informationDimensionality Reduction
Lecture 5 1 Outline 1. Overview a) What is? b) Why? 2. Principal Component Analysis (PCA) a) Objectives b) Explaining variability c) SVD 3. Related approaches a) ICA b) Autoencoders 2 Example 1: Sportsball
More informationLearning Topic Models and Latent Bayesian Networks Under Expansion Constraints
Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints Animashree Anandkumar 1, Daniel Hsu 2, Adel Javanmard 3, and Sham M. Kakade 2 1 Department of EECS, University of California,
More informationCS 572: Information Retrieval
CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).
More informationIntroduction to Machine Learning. Introduction to ML - TAU 2016/7 1
Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger
More informationLecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More information14 Singular Value Decomposition
14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationSTATS 306B: Unsupervised Learning Spring Lecture 13 May 12
STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality
More informationDecember 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis
.. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationSVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)
Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular
More informationSum-of-Squares Method, Tensor Decomposition, Dictionary Learning
Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees
More informationAppendix A. Proof to Theorem 1
Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation
More informationDS-GA 1002 Lecture notes 10 November 23, Linear models
DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.
More informationInformation retrieval LSI, plsi and LDA. Jian-Yun Nie
Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and
More informationFourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)
Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component
More informationAn Introduction to Spectral Learning
An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationLinear Algebra Background
CS76A Text Retrieval and Mining Lecture 5 Recap: Clustering Hierarchical clustering Agglomerative clustering techniques Evaluation Term vs. document space clustering Multi-lingual docs Feature selection
More informationCourse 495: Advanced Statistical Machine Learning/Pattern Recognition
Course 495: Advanced Statistical Machine Learning/Pattern Recognition Goal (Lecture): To present Probabilistic Principal Component Analysis (PPCA) using both Maximum Likelihood (ML) and Expectation Maximization
More informationStructure in Data. A major objective in data analysis is to identify interesting features or structure in the data.
Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two
More informationRobust Principal Component Analysis
ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationLecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016
Lecture 8 Principal Component Analysis Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 13, 2016 Luigi Freda ( La Sapienza University) Lecture 8 December 13, 2016 1 / 31 Outline 1 Eigen
More informationComputational functional genomics
Computational functional genomics (Spring 2005: Lecture 8) David K. Gifford (Adapted from a lecture by Tommi S. Jaakkola) MIT CSAIL Basic clustering methods hierarchical k means mixture models Multi variate
More informationGatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II
Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II Gatsby Unit University College London 27 Feb 2017 Outline Part I: Theory of ICA Definition and difference
More informationIntelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham
Intelligent Data Analysis Principal Component Analysis Peter Tiňo School of Computer Science University of Birmingham Discovering low-dimensional spatial layout in higher dimensional spaces - 1-D/3-D example
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationPrincipal components analysis COMS 4771
Principal components analysis COMS 4771 1. Representation learning Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationDimension Reduction (PCA, ICA, CCA, FLD,
Dimension Reduction (PCA, ICA, CCA, FLD, Topic Models) Yi Zhang 10-701, Machine Learning, Spring 2011 April 6 th, 2011 Parts of the PCA slides are from previous 10-701 lectures 1 Outline Dimension reduction
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationSingular Value Decomposition and Principal Component Analysis (PCA) I
Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression
More informationComputer Vision Group Prof. Daniel Cremers. 14. Clustering
Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it
More informationPattern Classification
Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric
More informationCS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method
CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method Tim Roughgarden & Gregory Valiant April 15, 015 This lecture began with an extended recap of Lecture 7. Recall that
More informationLecture: Face Recognition and Feature Reduction
Lecture: Face Recognition and Feature Reduction Juan Carlos Niebles and Ranjay Krishna Stanford Vision and Learning Lab Lecture 11-1 Recap - Curse of dimensionality Assume 5000 points uniformly distributed
More informationWhat is Principal Component Analysis?
What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationPattern Recognition. Parameter Estimation of Probability Density Functions
Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The
More informationPrincipal Component Analysis
Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationCSC 411 Lecture 12: Principal Component Analysis
CSC 411 Lecture 12: Principal Component Analysis Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 12-PCA 1 / 23 Overview Today we ll cover the first unsupervised
More informationEfficient Spectral Methods for Learning Mixture Models!
Efficient Spectral Methods for Learning Mixture Models Qingqing Huang 2016 February Laboratory for Information & Decision Systems Based on joint works with Munther Dahleh, Rong Ge, Sham Kakade, Greg Valiant.
More informationAnnouncements (repeat) Principal Components Analysis
4/7/7 Announcements repeat Principal Components Analysis CS 5 Lecture #9 April 4 th, 7 PA4 is due Monday, April 7 th Test # will be Wednesday, April 9 th Test #3 is Monday, May 8 th at 8AM Just hour long
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationFirst Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate
58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu
More informationMachine Learning for Data Science (CS4786) Lecture 12
Machine Learning for Data Science (CS4786) Lecture 12 Gaussian Mixture Models Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016fa/ Back to K-means Single link is sensitive to outliners We
More information