Appendix A. Proof to Theorem 1

Size: px
Start display at page:

Download "Appendix A. Proof to Theorem 1"

Transcription

1 Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation error of the whitened tensors M (W, W ), M y (W, W ) and M 3 (W, W, W ) in terms of the estimation error of the tensors themselves In Appendix A, we cite results on the accuracy of SVD and robust tensor power method when performed on estimated tensors, and prove the effectiveness of the power update method used in recovering the linear regression model η Finally, we give tail bounds for the estimation error of M, M y and M 3 in Appendix A3 and complete the proof in Appendix A4 We also make some remarks on the indirect quantities (eg σ k (Õ)) used in Theorem and simplified bounds for some special cases in Appendix A4 All norms in the following analysis, if not explicitly specified, are norms in the vector and matrix cases and the operator norm in the high-order tensor case Appendix A Perturbation lemmas We first define the canonical topic distribution vectors µ and estimation error of observable tensors, which simplify the notations that arise in subsequent analysis Definition (canonical topic distribution) Define the canonical version of topic distribution vector µ i, µ i, as follows: αi µ i α 0 (α 0 + ) µ i () We also define O, Õ Rn k by O = µ,, µ k ] and Õ = µ,, µ k ] Definition (estimation error) Assume for some real values E P, E y and E T, which we will set later M M E P, () M y M y E y, (3) M 3 M 3 E T (4) The following lemma analyzes the whitening matrix W of M Many conclusions are directly from AFH + ] Lemma (Lemma C, AFH + ]) Let W, Ŵ Rn k be the whitening matrices such that M (W, W ) = M (Ŵ, Ŵ ) = I k Let A = W Õ and  = Ŵ Õ Suppose E P σ k (M )/ We have W = (5) σ k (Õ), Ŵ (6) σ k (Õ), W Ŵ 4E P, (7) σ k (Õ)3 W + 3σ (Õ), (8) Ŵ + σ (Õ), (9) W + Ŵ + 6σ(Õ) σ E P, k (Õ) (0) A =, () Â, ()

2 A Â 4E P, (3) σ k (Õ) AA ÂÂ E P (4) σ k (Õ) Proof Proof to Eq (7): Let Ŵ M Ŵ = I and Ŵ M Ŵ = BDB, where B is orthogonal and D is a positive definite diagonal matrix We then see that W = Ŵ BD / B satisfies the condition W M W = I Subsequently, Ŵ = W BD/ B We then can bound W Ŵ as follows W Ŵ W I D/ W I D 4E P σ k (Õ)3, where the inequality I D 4E P was proved in σ AFH+ ] k (Õ) Proof to Eq (4): AA ÂÂ AA AÂ + AÂ ÂÂ A Â ( A + Â ) E P σ k (Õ) All the other inequalities come from Lemma C, AFH + ] We are now able to provide perturbation bounds for estimation error of whitened moments Definition 3 (estimation error of whitened moments) Define ε p,w M (W, W ) M (Ŵ, Ŵ ), (5) ε y,w M y (W, W ) M y (Ŵ, Ŵ ), (6) ε t,w M 3 (W, W, W ) M 3 (Ŵ, Ŵ, Ŵ ) (7) Lemma (Perturbation lemma of whitened moments) Suppose E P σ k (M )/ We have ε p,w 6E P, (8) σ k (Õ) ε y,w ε t,w 4 η E P (α 0 + )σ + 4E y, k (Õ) σ k (Õ) (9) 54E P (α 0 + )(α 0 + )σ + 8E T k (Õ)5 σ k (Õ)3 (0) Proof Using the idea in the proof of Lemma C in AFH + ], we can split ε p,w as ε p,w = M (W, W ) M (Ŵ, Ŵ ) + M (Ŵ, Ŵ ) M (Ŵ, Ŵ ) We can the bound the two terms seperately, as follows For the first term, we have where the last inequality comes from Eq (4) For the second term, we have M (W, W ) M (Ŵ, Ŵ ) + M (Ŵ, Ŵ ) M (Ŵ, Ŵ ) M (W, W ) M (Ŵ, Ŵ ) = W M W Ŵ M Ŵ = AA ÂÂ E P σ k (Õ) M (Ŵ, Ŵ ) M (Ŵ, Ŵ ) Ŵ M M 4E P σ k (Õ),

3 where the last inequality comes from Eq (6) Similarly, ε y,w can be splitted as M y (W, W ) M y (Ŵ, Ŵ ) and M y(ŵ, Ŵ ) M y (Ŵ, Ŵ ), which can be bounded separately For the first term, we have M y (W, W ) M y (Ŵ, Ŵ ) = W M y W Ŵ M y Ŵ = α 0 + Adiag(η)A Âdiag(η)Â For the second term, we have η α 0 + AA ÂÂ 4 η (α 0 + )σ E P k (Õ) M y (Ŵ, Ŵ ) M y (Ŵ, Ŵ ) Ŵ M y M y Finally, we bound ε t,w as below, following the work CL3] 4E y σ k (Õ) ε t,w = M 3 (W, W, W ) M 3 (Ŵ, Ŵ, Ŵ ) M 3 W Ŵ ( W + W Ŵ + Ŵ ) + Ŵ 3 M 3 M 3 54E P (α 0 + )(α 0 + )σ + 8E T k (Õ)5 σ, k (Õ)3 where we have used the fact that M 3 k α i i= α = 0(α 0+)(α 0+) (α 0+)(α 0+) Appendix A SVD accuracy The key idea for spectral recovery of LDA topic modeling is the simultaneous diagonalization trick, which asserts that we can recover LDA model parameters by performing orthogonal tensor decomposition on a pair of simultaneously whitened moments, for example, (M, M 3 ) and (M, M y ) The following proposition details this insight, as we derive orthogonal tensor decompositions for the whitened tensor product M y (W, W ) and M 3 (W, W, W ) Proposition Define v i W µ i = αi α 0(α 0+) W µ i Then {v i } k i= is an orthonormal basis M y has a pair of singular value and singular vector (σ y i, v i) with σ y i = α η 0+ j for some j k] 3 M 3 has a pair of robust eigenvalue and eigenvector AGH + ] (λ i, v i ) with λ i = α 0+ for some j k] α0(α 0+) Proof The orthonormality of {v i } k i= follows from the fact that W M W = k i= v iv i = I k Subsequently, we have M y (W, W ) = M 3 (W, W, W ) = α 0 + α 0 + k η i v i v i, i= k i= α 0 (α 0 + ) α i v i v i v i α j 3

4 The following lemmas (Lemma 3 and Lemma 4) give upper bounds on the estimation error of η and µ in terms of λ i λ i, v i v i and the estimation errors of whitened moments defined in Definition 3 Lemma 3 (η i estimation error bound) Define η i α0+ v i M y (Ŵ, Ŵ ) v i, where v i is some estimation of v i We then have η i η i η v i v i + α 0 + ( + v i v i ) ε y,w () Proof First, note that v i M y(w, W )v i = α η 0+ i because {v i } k i= are orthonormal Subsequently, we have v M i y (Ŵ, Ŵ ) v i v i M y (W, W )v i α 0 + η i η i = ( v i v i ) My (Ŵ, Ŵ ) v i + v i ( My (Ŵ, Ŵ ) v i M y (W, W )v i ) v i v i M y (Ŵ, Ŵ ) v i + v i M y (Ŵ, Ŵ ) v i M y (W, W )v i ote that both v i and v i are unit vectors Therefore, α 0 + η i η i M y (Ŵ, Ŵ ) v i v i + M y (Ŵ, Ŵ ) v i M y (W, W )v i M y (Ŵ, Ŵ ) v i v i + M y (Ŵ, Ŵ ) v i v i + M y (Ŵ, Ŵ ) M y(w, W ) v i ( ) v i v i α 0 + η + ε y,w + ε y,w The last inequality is due to the fact that M y (W, W ) = α 0+ η Lemma 4 (µ i estimation error bound) Define µ i α0+ λ i (Ŵ + ) v i, where λ i, v i are some estimates of singular value pairs (λ i, v i ) of M 3 (W, W, W ) We then have µ i µ i 3(α 0 + ) σ (Õ) λ i λ i + 3α max σ (Õ) v i v i + 6α maxσ (Õ)E P () σ k (Õ) Proof First note that µ i = α0+ λ i (W + ) v i Subsequently, α 0 + µ i µ i = λ i (Ŵ + ) v i λ i (W + ) v i λ i Ŵ + λ i W + v i + λ i W + v i v i λ i λ i Ŵ + + λ i Ŵ + W + + λ i W + v i v i 3σ (Õ) λ i λ i + α max α 0 + 6σ (Õ)E P σ k (Õ) + α max α 0 + 3σ (Õ) v i v i To bound the error of orthogonal tensor decomposition performed on the estimated tensors M 3 (Ŵ, Ŵ, Ŵ ), we cite Theorem 5 AGH + ], a sample complexity analysis on the robust tensor power method we used for recovering λ i and v i 4

5 Lemma 5 (Theorem 5, AGH + ]) Let λ max = α 0(α 0+) α 0+ α min, λ min = α 0+ α 0(α 0+) α max, where α min = min α i and α max = max α i Then there exist universal constants C, C > 0 such that the following holds: Fix δ (0, ) Suppose ε t,w ε and ε t,w C λmin k, (3) Suppose {( λ i, v i )} k i= are eigenvalue and eigenvector pairs returned by running Algorithm in AGH+ ] with input M 3 (Ŵ, Ŵ, Ŵ ) for L = poly(k) log(/δ ) and C (log(k) + log log( λmax ε )) iterations With probability greater than δ, there exists a permutation π : k] k] such that for all i, v i v π (i) 8ε/λ min, λ i λ π (i) 5ε Appendix A3 Tail Inequalities Lemma 6 (Lemma 5, CL3]) Let x,, x R d be iid samples from some distribution with bounded support (ie, x B with probability for some constant B) Then with probability at least δ, x i Ex] B ) log(/δ) ( + i= Corrolary Let x,, x R d be iid samples from some distributions with Pr x B] δ Then with probability at least δ δ, x i Ex] B ) log(/δ) ( + i= Proof Use union bound Lemma 7 (concentraion of moment norms) Suppose we obtain iid samples (ie, documents with at least three words each and their regression variables in slda models) Define R(δ) η + Φ (δ), where Φ ( ) is the inverse function of the CDF of a standard Gaussian distribution Let E ] denote the mean of the true underlying distribution and Ê ] denote the empirical mean Then Pr Ex ] Êx ] F < + ] log(/δ) δ Pr Ex x ] Êx x ] F < + ] log(/δ) δ 3 Pr Ex x x 3 ] Êx x x 3 ] F < + ] log(/δ) δ 4 Pr Ey] Êy] < R(δ/4σ) + ] log(/δ) δ 5 Pr Eyx ] Êyx ] F < R(δ/4σ) + ] log(/δ) δ 6 Pr Eyx x ] Êyx x ] F < R(δ/4σ) + ] log(/δ) δ 5

6 Proof Use Lemma 6 and Corrolary for concentration bounds involving the regression variable y Corrolary With probability δ the following holds: E P = M M 3 + log(6/δ) E y = M y M y 0R(δ/60σ) + log(5/δ) 3 E T = M 3 M log(9/δ) Proof Corrolary can be proved by expanding the terms by definition and then using tail inequality in Lemma 7 and union bound Also note that F for all matrices Appendix A4 Completing the proof We are now ready to give a complete proof to Theorem Theorem (Sample complexity bound) Let σ (Õ) and σk(õ) be the largest and the smallest singular values of the canonical topic distribution matrix Õ Define λ min α 0(α 0+) α 0+ α max and λ max α 0+ α 0(α 0+) α min where α max and α min are the largest and the smallest entries in α Suppose µ, α and η are the outputs of Algorithm, and L is at least a linear function in k Fix δ (0, ) For any small error-tolerance parameter ε > 0, if Algorithm is run with parameter T = Ω(log(k) + log log(λ max /ε)) on iid sampled documents with max(n, n, n 3 ), where ( n = C + ) α log(6/δ) 0(α 0 + ), (4) α min n = C ( + log(5/δ)) ( max ( η + Φ (δ/60σ)), αmaxσ (Õ)), (5) ε σ k (Õ)4 n 3 = C 3 ( + log(9/δ)) σ k (Õ)0 ( max ε, k ) λ, (6) min and C, C and C 3 are universal constants, then with probability at least δ, there exists a permutation π : k] k] such that for every topic i, the following holds: α i α π(i) 4α0(α0+)(λmax+5ε) (α 0+) 5ε, if λ λ min > 5ε; min (λmin 5ε) ( ) ) µ i µ π(i) (3σ (Õ) 8α max λ min + 5(α0+) + ε; ( ) 3 η i η π(i) η λ min + (α 0 + ) ε Proof First, the assumption E P σ k (M ) is required for error bounds on ε p,w, ε y,w and ε t,w oting Corrolary and the fact that σ k (M ) =, we have αmin α 0(α 0+) ( α = Ω 0(α 0 + ) ( + ) log(6/δ)) ote that this lower bound does not depend on k, ε and σ k (Õ) For Lemma 5 to hold, we need the assumptions that ε t,w min(ε, O( λmin k )) These imply Eq (6), as we expand ε t,w according to Definition 3 and note the fact that the first term α min 54E P (α 0+)(α 0+)σ k (Õ)5 6

7 dominates the second one The α 0 is missing in Eq (6) because α 0 +, α 0 + and we discard them both The α i α π(i) bound follows immediately by Lemma 5 and the recovery rule α i = α0+ λ i To bound the estimation error for the linear classifier η, we need to further bound ε y,w We assume ε y,w ε By expanding ε y,w according to Definition 3 in a similar manner we obtain the ( η + Φ (δ/60σ)) term in Eq (5) The bound on η i η π(i) follows immediately by Lemma 3 Finally, we bound µ i µ π(i) using Lemma 4 We need to assume that 6αmaxσ(Õ)E P ε, which σ k (Õ) gives the αmaxσ (Õ) term in Eq (5) The µ i µ π(i) bound then follows by Lemma 4 and Lemma 5 We make some remarks for the main theorem In Remark, we establish links between indirect quantities appeared in Theorem (eg, σ k (Õ)) and the functions of original model parameters (eg, σ k (O)) These connections are straightforward following their definitions Remark The indirect quantities σ (Õ) and σk(õ) can be related to σ (O), σ k (O) and α in the following way: αmin α 0(α 0 + ) σ k(o) σ k (Õ) αmax α 0(α 0 + ) σ k(o); σ (Õ) αmax α 0(α 0 + ) σ(o) α0 + We now take a close look at the sample complexity bound in Theorem It is evident that n can be neglected when the number of topics k gets large, because in practice the norm of the linear regression model η is usually assumed to be small in order to avoid overfitting Moreover, as mentioned before, the prior parameter α is often assumed to be homogeneous with α i = /k SG07] With these observations, the sample complexity bound in Theorem can be greatly simplified Remark Assume η and σ are small and α = (/k,, /k) As the number of topics k gets large, the sample complexity bound in Theorem can be simplified as ( ) log(/δ) = Ω σ max(ε, k 3 ) (7) k (Õ)0 The sample complexity bound in Remark may look formidable as it depends on σ k (Õ)0 However, such dependency is somewhat necessary because we are using third-order tensors to recover the underlying model parameters Furthermore, the dependence on σ k (Õ)0 is introduced by the robust tensor power method to recover LDA parameters, and the reconstruction accuracy of η only depends on σ k (Õ)4 and ( η + Φ (δ/60σ)) As a consequence, if we can combine our power update method for η with LDA inference algorithms that have milder dependence on the singular value σ k (Õ), we might be able to get an algorithm with a better sample complexity Such an extension is discussed in Appendix C Appendix B Moments of Observable Variables Appendix B Proof to Proposition The equations on M and M 3 have already been proved in AFH + ] and AGH + ] Here we only give the proof to the equation on M y In fact, all the three equations can be proved in a similar manner In slda the topic mixing vector h follows a Dirichlet prior distribution with parameter α Therefore, we have Eh i ] = α i α 0, Eh i h j ] = { α i 0(α 0+) α iα j α 0, if i = j,, if i j, Eh i h j h k ] = α 3 i α 0(α 0+)(α 0+) α i α k 0 α iα jα k α 3 0, if i = j = k, (α0+), if i = j k,, if i j, j k, i k (8) 7

8 ext, note that Ey h] = η h, Ex h] = k h i µ i, Ex x h] = i= Eyx x h] = k i,j,k= k h i h j µ i µ j, (9) i,j= h i h j h k η k µ j µ k (30) Proposition can then be proved easily by taking expectation over the topic mixing vector h Appendix B Details of the speeding-up trick In this section we provide details of the trick mentioned in the main paper to speed up empirical moments computations First, note that the computation of M, M and M y only requires O(M ) time and O(V ) space They do not need to be accelerated in most practical applications This time and space complexity also applies to all terms in M 3 except the Êx x x 3 ] term, which requires O(M 3 ) time and O(V 3 ) space if using naive implementations Therefore, this section is devoted to speed-up the computation of Êx x x 3 ] More precisely, as mentioned in the main paper, what we want to compute is the whitened empirical moment Êx x x 3 ](Ŵ, Ŵ, Ŵ ) Rk k k Fix a document D with m words Let T Êx x x 3 D] be the empirical tensor demanded By definition, we have T i,j,k = m(m )(m ) n i (n j )(n k ), i = j = k; n i (n i )n k, i = j, j k; n i n j (n j ), j = k, i j; n i n j (n i ), i = k, i j; n i n j n k, otherwise; where n i is the number of occurrences of the i-th word in document D If T i,j,k = indices i, j and k, then we only need to compute T (W, W, W ) = m(m )(m ) (W n) 3, n in jn k m(m )(m ) (3) for all where n (n, n,, n V ) This takes O(Mk + k 3 ) computational time because n contains at most M non-zero entries, and the total time complexity is reduced from O(M 3 ) to O((Mk + k 3 )) We now consider the remaining values, where at least two indices are identical We first consider those values with two indices the same, for example, i = j For these indices, we need to subtract an n i n k term, as shown in Eq (3) That is, we need to compute the whitened tensor (W, W, W ), where R V V V and i,j,k = m(m )(m ) { ni n k, i = j; 0, otherwise ote that can be written as m(m )(m ) A n, where A = diag(n, n,, n V ) is a V V matrix and n = (n, n,, n V ) is defined previously As a result, (W, W, W ) = m(m )(m ) (W AW ) n So the computational complexity of (W, W, W ) depends on how we compute W AW Since A is a diagonal matrix with at most M non-zero entries, W AW can be computed in O(Mk ) operations Therefore, the time complexity of computing (W, W, W ) is O(Mk ) per document Finally we handle those values with three indices the same, that is, i = j = k As indicated by Eq (3), we need to add a term for compensation This can be done efficiently by first n i m(m )(m ) n i m(m )(m ) computing Ê( ) for all the documents (requiring O(V ) time), and then add them up, which takes O(V k 3 ) operations (3) 8

9 Appendix C Discussions Appendix C Extension to other topic recovery algorithms One important advantage of our proposed inference algorithm is its flexibility the algorithm can be combined with many other LDA inference algorithms to infer supervised LDA model parameters More specifically, given access to any algorithm that recovers the topic distribution matrix O and the prior parameter α from iid sampled documents, an inference algorithm for a supervised LDA model can be immediately obtained, as shown in Algorithm Algorithm slda parameter recovery based on an existing LDA inference algorithm A Input parameter: α 0 : Compute empirical moments and obtain M, M y Set η = 0 : Find Ŵ Rn k such that M (Ŵ, Ŵ ) = I k 3: Run algorithm A with observed documents and parameter α 0 Obtain the estimated topic distribution matrix Ô = ( µ,, µ k ) and prior parameter α 4: Compute v i = αi α 0(α 0+)Ŵ µ i for each topic i 5: Recover linear classifier: η i α0+ v i M y (Ŵ, Ŵ ) v i 6: Output: η, α and { µ i } k i= The sample complexity of Algorithm depends on the sample complexity of the LDA inference algorithm A Although the LDA inference algorithm A is free to make any assumptions on the topic distribution matrix O, we comment that the linear independence of topic distribution vectors µ is still required, because the power update trick (step 5 in Algorithm ) is valid only W µ i are orthogonal vectors Appendix C Going beyond SVD The proposed methods are based on spectral decomposition of observable moments and are provably correct However, a major drawback of these methods is their assumption that the topic distributions µ are linearly independent Although standard LDA models assume this topic independence BJ03], in practice there is strong evidence that different topics might be related BL] Therefore, it is important to consider possible extensions Recently, there has been much work on provable LDA inference algorithms that do not require the topic independence assumption AGH + 3, AGM] Instead of making the independence assumption, these methods assume a p-separability condition on the topic distribution matrix, that is, there exists an anchor word for each topic such that the anchor word only appears in this specific topic, and its probability (given that topic) is at least p Here we briefly describe an idea that might lead to an slda inference algorithm without assuming topic independence First, let O R V k be the topic distribution matrix defined previously and H = (h,, h ) R k be the matrix of topic mixing vectors for each document Suppose Q = Ex x ] is the word co-occurrence frequency matrix and Q y = Eyx ] is the co-occurrence frequency matrix between word and regression variable observations It is clear that both Q and Q y can be estimated from the training data, and we have Q = ORO, Q y = ORη, (33) where R EHH ] Assuming the p-separability condition holds and using methods developed in AGM, AGH + 3], we can consistently recover the O and R ow note that Q y = (OR) η As a result, knowing both OR and Q y we reduce the η inference problem to solving a linear equation systems Future research could be done on () determining when the matrix OR has full rank r so that the linear equations can be solved, and () investigating the sample complexity problem for such an inference algorithm 9

10 Appendix C3 From regression to classification A natural extension to our result is to consider supervised LDA models for classification purposes The simplest model is a logistic regression classification model, where we assume the response variable y d for each document d is in {+, }, and Pry d = h d ] = exp(η h d ) + exp(η h d ), where η is a linear classifier Though appears simple, such an extension incurs many fundamental problems A major obstacle is the fact that the conditional expectation Ey h] is no longer linear in the topic mixing vector h As a result, we cannot even evaluate Ey] (or higher order tensors like Eyx x ]) in closed forms Further more, even if we have accurate estimation to the above moments, it is rather difficult to infer η back due to its non-linearality To our knowledge, the classification problem remains unsolved in terms of spectral decomposition methods, and there is doubt whether a provably correct algorithm exists in such scenarios References AFH + ] Animashree Anandkumar, Dean Foster, Daniel Hsu, Sham Kakade, and Yi-Kai Liu Two SVDs suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocatoin arxiv:046703, 0 AGH + ] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham Kakade, and Matus Telgarsky Tensor decompositions for learning latent variable models arxiv:0:7559, 0 AGH + 3] Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, and Ankur Moitra A practical algorithm for topic modeling with provable guarantees In ICML, 03 AGM] BL] BJ03] CL3] SG07] Sanjeev Arora, Rong Ge, and Ankur Moitra Learning topic models-going beyond SVD In FOCS, 0 David Blei and John Lafferty A correlated topic model of science Annals of Applied Statistics, pages 77 84, 0 David Blei, Andrew g, and Michael Jordan Latent Dirichlet allocation Journal of Machine Learning Research, (3):993 0, 003 Arun Chaganty and Percy Liang Spectral experts for estimating mixtures of linear regressions In ICML, 03 M Steyvers and T Griffiths Latent semantic analysis: a road to meaning, chapter Probabilistic topic models Laurence Erlbaum, 007 0

TOPIC modeling offers a suite of useful tools that

TOPIC modeling offers a suite of useful tools that JOURAL OF L A T E X CLASS FILES, VOL. 6, O. 1, JAUARY 007 1 Spectral Learning for Supervised Topic Models Yong Ren, Yining Wang, Jun Zhu, Member, IEEE terion to learn discriminative latent topic representations

More information

ECE 598: Representation Learning: Algorithms and Models Fall 2017

ECE 598: Representation Learning: Algorithms and Models Fall 2017 ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors

More information

An Introduction to Spectral Learning

An Introduction to Spectral Learning An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,

More information

A Practical Algorithm for Topic Modeling with Provable Guarantees

A Practical Algorithm for Topic Modeling with Provable Guarantees 1 A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu Reviewed by Zhao Song December

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

Robust Spectral Inference for Joint Stochastic Matrix Factorization

Robust Spectral Inference for Joint Stochastic Matrix Factorization Robust Spectral Inference for Joint Stochastic Matrix Factorization Kun Dong Cornell University October 20, 2016 K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization

More information

Dictionary Learning Using Tensor Methods

Dictionary Learning Using Tensor Methods Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)

Fourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component

More information

Orthogonal tensor decomposition

Orthogonal tensor decomposition Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.

More information

Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo

Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo Rong Ge *, Holden Lee, Andrej Risteski Abstract A key task in Bayesian statistics

More information

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016

Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix

More information

Algorithms for Generalized Topic Modeling

Algorithms for Generalized Topic Modeling Algorithms for Generalized Topic Modeling Avrim Blum Toyota Technological Institute at Chicago avrim@ttic.edu Nika Haghtalab Computer Science Department Carnegie Mellon University nhaghtal@cs.cmu.edu Abstract

More information

Volodymyr Kuleshov ú Arun Tejasvi Chaganty ú Percy Liang. May 11, 2015

Volodymyr Kuleshov ú Arun Tejasvi Chaganty ú Percy Liang. May 11, 2015 Tensor Factorization via Matrix Factorization Volodymyr Kuleshov ú Arun Tejasvi Chaganty ú Percy Liang Stanford University May 11, 2015 Kuleshov, Chaganty, Liang (Stanford University) Tensor Factorization

More information

Supplemental for Spectral Algorithm For Latent Tree Graphical Models

Supplemental for Spectral Algorithm For Latent Tree Graphical Models Supplemental for Spectral Algorithm For Latent Tree Graphical Models Ankur P. Parikh, Le Song, Eric P. Xing The supplemental contains 3 main things. 1. The first is network plots of the latent variable

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Spectral Learning of Hidden Markov Models with Group Persistence

Spectral Learning of Hidden Markov Models with Group Persistence 000 00 00 00 00 00 00 007 008 009 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 Abstract In this paper, we develop a general Method of Moments

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

Tensor Factorization via Matrix Factorization

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy Liang Department of Computer Science Stanford University Stanford, CA 94305 Abstract Tensor factorization arises in many machine learning applications, such

More information

Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions

Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions [Extended Abstract] ABSTRACT Daniel Hsu Microsoft Research New England dahsu@microsoft.com This work provides a computationally

More information

Spectral Methods for the Hierarchical Dirichlet Process

Spectral Methods for the Hierarchical Dirichlet Process Spectral Methods for the Hierarchical Dirichlet Process Hsiao-Yu Fish Tung, Chao-Yuan Wu, Manzil Zaheer, Alexander J. Smola Machine Learning Department Carnegie Mellon University 5 Forbes Ave, Pittsburgh,

More information

Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints

Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints Animashree Anandkumar 1, Daniel Hsu 2, Adel Javanmard 3, and Sham M. Kakade 2 1 Department of EECS, University of California,

More information

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods

Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point

More information

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery

CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery Tim Roughgarden & Gregory Valiant May 3, 2017 Last lecture discussed singular value decomposition (SVD), and we

More information

Weighted Tensor Decomposition for Learning Latent Variables with Partial Data

Weighted Tensor Decomposition for Learning Latent Variables with Partial Data Weighted Tensor Decomposition for Learning Latent Variables with Partial Data Omer Gottesman Weiewei Pan Finale Doshi-Velez Paulson School of Engineering and Applied Sciences, Harvard University Paulson

More information

Rethinking LDA: Moment Matching for Discrete ICA

Rethinking LDA: Moment Matching for Discrete ICA Rethinking LDA: Moment Matching for Discrete ICA Anastasia Podosinnikova Francis Bach Simon Lacoste-Julien INRIA - École normale supérieure Paris Abstract We consider moment matching techniques for estimation

More information

Provable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations

Provable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations Provable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations Yuanzhi Li 1 Yingyu Liang 1 In this paper, we provide a theoretical analysis of a natural algorithm

More information

Language Information Processing, Advanced. Topic Models

Language Information Processing, Advanced. Topic Models Language Information Processing, Advanced Topic Models mcuturi@i.kyoto-u.ac.jp Kyoto University - LIP, Adv. - 2011 1 Today s talk Continue exploring the representation of text as histogram of words. Objective:

More information

arxiv: v4 [cs.lg] 26 Apr 2018

arxiv: v4 [cs.lg] 26 Apr 2018 : Online Spectral Learning for Single Topic Models Tong Yu 1, Branislav Kveton 2, Zheng Wen 2, Hung Bui 3, and Ole J. Mengshoel 1 arxiv:1709.07172v4 [cs.lg] 26 Apr 2018 1 Carnegie Mellon University 2 Adobe

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Estimating Latent Variable Graphical Models with Moments and Likelihoods Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods

More information

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,

WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Non-convex Robust PCA: Provable Bounds

Non-convex Robust PCA: Provable Bounds Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing

More information

Algorithms for Generalized Topic Modeling

Algorithms for Generalized Topic Modeling Algorithms for Generalized Topic Modeling Avrim Blum Toyota Technological Institute at Chicago avrim@ttic.edu Nika Haghtalab Computer Science Department Carnegie Mellon University nhaghtal@cs.cmu.edu Abstract

More information

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia

Tensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,

More information

Estimating Covariance Using Factorial Hidden Markov Models

Estimating Covariance Using Factorial Hidden Markov Models Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Document and Topic Models: plsa and LDA

Document and Topic Models: plsa and LDA Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

1 Feature Vectors and Time Series

1 Feature Vectors and Time Series PCA, SVD, LSI, and Kernel PCA 1 Feature Vectors and Time Series We now consider a sample x 1,..., x of objects (not necessarily vectors) and a feature map Φ such that for any object x we have that Φ(x)

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Learning Mixtures of Gaussians in High Dimensions

Learning Mixtures of Gaussians in High Dimensions Learning Mixtures of Gaussians in High Dimensions [Extended Abstract] Rong Ge Microsoft Research, New England Qingqing Huang MIT, EECS Sham M. Kakade Microsoft Research, New England ABSTRACT Efficiently

More information

Learning mixtures of spherical Gaussians: moment methods and spectral decompositions

Learning mixtures of spherical Gaussians: moment methods and spectral decompositions Learning mixtures of spherical Gaussians: moment methods and spectral decompositions Daniel Hsu and Sham M. Kakade Microsoft Research New England October 8, 01 Abstract This work provides a computationally

More information

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

1 Inner Product and Orthogonality

1 Inner Product and Orthogonality CSCI 4/Fall 6/Vora/GWU/Orthogonality and Norms Inner Product and Orthogonality Definition : The inner product of two vectors x and y, x x x =.., y =. x n y y... y n is denoted x, y : Note that n x, y =

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture

More information

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition ace Recognition Identify person based on the appearance of face CSED441:Introduction to Computer Vision (2017) Lecture10: Subspace Methods and ace Recognition Bohyung Han CSE, POSTECH bhhan@postech.ac.kr

More information

arxiv: v5 [math.na] 16 Nov 2017

arxiv: v5 [math.na] 16 Nov 2017 RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview

More information

EE 381V: Large Scale Learning Spring Lecture 16 March 7

EE 381V: Large Scale Learning Spring Lecture 16 March 7 EE 381V: Large Scale Learning Spring 2013 Lecture 16 March 7 Lecturer: Caramanis & Sanghavi Scribe: Tianyang Bai 16.1 Topics Covered In this lecture, we introduced one method of matrix completion via SVD-based

More information

Multi-View Dimensionality Reduction via Canonical Correlation Analysis

Multi-View Dimensionality Reduction via Canonical Correlation Analysis Technical Report TTI-TR-2008-4 Multi-View Dimensionality Reduction via Canonical Correlation Analysis Dean P. Foster University of Pennsylvania Sham M. Kakade Toyota Technological Institute at Chicago

More information

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018

TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 Tensors Computations and the GPU AGENDA Tensor Networks and Decompositions Tensor Layers in

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Tensor Methods for Feature Learning

Tensor Methods for Feature Learning Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Lecture 15: Random Projections

Lecture 15: Random Projections Lecture 15: Random Projections Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 15 1 / 11 Review of PCA Unsupervised learning technique Performs dimensionality reduction

More information

arxiv: v3 [stat.ml] 1 Nov 2014

arxiv: v3 [stat.ml] 1 Nov 2014 Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing Yuchen Zhang Xi Chen Dengyong Zhou Michael I. Jordan arxiv:406.3824v3 [stat.ml] Nov 204 November 4, 204 Abstract Crowdsourcing is

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Spectral Learning of Mixture of Hidden Markov Models

Spectral Learning of Mixture of Hidden Markov Models Spectral Learning of Mixture of Hidden Markov Models Y Cem Sübakan, Johannes raa, Paris Smaragdis,, Department of Computer Science, University of Illinois at Urbana-Champaign Department of Electrical and

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

Basic Calculus Review

Basic Calculus Review Basic Calculus Review Lorenzo Rosasco ISML Mod. 2 - Machine Learning Vector Spaces Functionals and Operators (Matrices) Vector Space A vector space is a set V with binary operations +: V V V and : R V

More information

Multi-View Regression via Canonincal Correlation Analysis

Multi-View Regression via Canonincal Correlation Analysis Multi-View Regression via Canonincal Correlation Analysis Dean P. Foster U. Pennsylvania (Sham M. Kakade of TTI) Model and background p ( i n) y i = X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j Data mining and

More information

Numerical Analysis Lecture Notes

Numerical Analysis Lecture Notes Numerical Analysis Lecture Notes Peter J Olver 8 Numerical Computation of Eigenvalues In this part, we discuss some practical methods for computing eigenvalues and eigenvectors of matrices Needless to

More information

topic modeling hanna m. wallach

topic modeling hanna m. wallach university of massachusetts amherst wallach@cs.umass.edu Ramona Blei-Gantz Helen Moss (Dave's Grandma) The Next 30 Minutes Motivations and a brief history: Latent semantic analysis Probabilistic latent

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational

More information

Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics

Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics Ankur P. Parikh School of Computer Science Carnegie Mellon University apparikh@cs.cmu.edu Shay B. Cohen School of Informatics

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS

METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS Y. Cem Subakan, Johannes Traa, Paris Smaragdis,,, Daniel Hsu UIUC Computer Science Department, Columbia University Computer Science Department

More information

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR 1. Definition Existence Theorem 1. Assume that A R m n. Then there exist orthogonal matrices U R m m V R n n, values σ 1 σ 2... σ p 0 with p = min{m, n},

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

More information

Linear dimensionality reduction for data analysis

Linear dimensionality reduction for data analysis Linear dimensionality reduction for data analysis Nicolas Gillis Joint work with Robert Luce, François Glineur, Stephen Vavasis, Robert Plemmons, Gabriella Casalino The setup Dimensionality reduction for

More information

A Spectral Algorithm for Latent Dirichlet Allocation

A Spectral Algorithm for Latent Dirichlet Allocation Noname manuscript No. (will be inserted by the editor) A Spectral Algorithm for Latent Dirichlet Allocation Anima Anandkumar Dean P. Foster Daniel Hsu Sham M. Kakade Yi-Kai Liu Received: date / Accepted:

More information

14.2 QR Factorization with Column Pivoting

14.2 QR Factorization with Column Pivoting page 531 Chapter 14 Special Topics Background Material Needed Vector and Matrix Norms (Section 25) Rounding Errors in Basic Floating Point Operations (Section 33 37) Forward Elimination and Back Substitution

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude

More information

Algorithmic Aspects of Machine Learning. Ankur Moitra

Algorithmic Aspects of Machine Learning. Ankur Moitra Algorithmic Aspects of Machine Learning Ankur Moitra Contents Contents i Preface v 1 Introduction 3 2 Nonnegative Matrix Factorization 7 2.1 Introduction................................ 7 2.2 Algebraic

More information

High Dimensional Covariance and Precision Matrix Estimation

High Dimensional Covariance and Precision Matrix Estimation High Dimensional Covariance and Precision Matrix Estimation Wei Wang Washington University in St. Louis Thursday 23 rd February, 2017 Wei Wang (Washington University in St. Louis) High Dimensional Covariance

More information

Robust Statistics, Revisited

Robust Statistics, Revisited Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart CLASSIC PARAMETER ESTIMATION Given samples from an unknown

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Day 5: Generative models, structured classification

Day 5: Generative models, structured classification Day 5: Generative models, structured classification Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 22 June 2018 Linear regression

More information