Appendix A. Proof to Theorem 1
|
|
- Melina Small
- 5 years ago
- Views:
Transcription
1 Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation error of the whitened tensors M (W, W ), M y (W, W ) and M 3 (W, W, W ) in terms of the estimation error of the tensors themselves In Appendix A, we cite results on the accuracy of SVD and robust tensor power method when performed on estimated tensors, and prove the effectiveness of the power update method used in recovering the linear regression model η Finally, we give tail bounds for the estimation error of M, M y and M 3 in Appendix A3 and complete the proof in Appendix A4 We also make some remarks on the indirect quantities (eg σ k (Õ)) used in Theorem and simplified bounds for some special cases in Appendix A4 All norms in the following analysis, if not explicitly specified, are norms in the vector and matrix cases and the operator norm in the high-order tensor case Appendix A Perturbation lemmas We first define the canonical topic distribution vectors µ and estimation error of observable tensors, which simplify the notations that arise in subsequent analysis Definition (canonical topic distribution) Define the canonical version of topic distribution vector µ i, µ i, as follows: αi µ i α 0 (α 0 + ) µ i () We also define O, Õ Rn k by O = µ,, µ k ] and Õ = µ,, µ k ] Definition (estimation error) Assume for some real values E P, E y and E T, which we will set later M M E P, () M y M y E y, (3) M 3 M 3 E T (4) The following lemma analyzes the whitening matrix W of M Many conclusions are directly from AFH + ] Lemma (Lemma C, AFH + ]) Let W, Ŵ Rn k be the whitening matrices such that M (W, W ) = M (Ŵ, Ŵ ) = I k Let A = W Õ and  = Ŵ Õ Suppose E P σ k (M )/ We have W = (5) σ k (Õ), Ŵ (6) σ k (Õ), W Ŵ 4E P, (7) σ k (Õ)3 W + 3σ (Õ), (8) Ŵ + σ (Õ), (9) W + Ŵ + 6σ(Õ) σ E P, k (Õ) (0) A =, () Â, ()
2 A Â 4E P, (3) σ k (Õ) AA ÂÂ E P (4) σ k (Õ) Proof Proof to Eq (7): Let Ŵ M Ŵ = I and Ŵ M Ŵ = BDB, where B is orthogonal and D is a positive definite diagonal matrix We then see that W = Ŵ BD / B satisfies the condition W M W = I Subsequently, Ŵ = W BD/ B We then can bound W Ŵ as follows W Ŵ W I D/ W I D 4E P σ k (Õ)3, where the inequality I D 4E P was proved in σ AFH+ ] k (Õ) Proof to Eq (4): AA ÂÂ AA AÂ + AÂ ÂÂ A Â ( A + Â ) E P σ k (Õ) All the other inequalities come from Lemma C, AFH + ] We are now able to provide perturbation bounds for estimation error of whitened moments Definition 3 (estimation error of whitened moments) Define ε p,w M (W, W ) M (Ŵ, Ŵ ), (5) ε y,w M y (W, W ) M y (Ŵ, Ŵ ), (6) ε t,w M 3 (W, W, W ) M 3 (Ŵ, Ŵ, Ŵ ) (7) Lemma (Perturbation lemma of whitened moments) Suppose E P σ k (M )/ We have ε p,w 6E P, (8) σ k (Õ) ε y,w ε t,w 4 η E P (α 0 + )σ + 4E y, k (Õ) σ k (Õ) (9) 54E P (α 0 + )(α 0 + )σ + 8E T k (Õ)5 σ k (Õ)3 (0) Proof Using the idea in the proof of Lemma C in AFH + ], we can split ε p,w as ε p,w = M (W, W ) M (Ŵ, Ŵ ) + M (Ŵ, Ŵ ) M (Ŵ, Ŵ ) We can the bound the two terms seperately, as follows For the first term, we have where the last inequality comes from Eq (4) For the second term, we have M (W, W ) M (Ŵ, Ŵ ) + M (Ŵ, Ŵ ) M (Ŵ, Ŵ ) M (W, W ) M (Ŵ, Ŵ ) = W M W Ŵ M Ŵ = AA ÂÂ E P σ k (Õ) M (Ŵ, Ŵ ) M (Ŵ, Ŵ ) Ŵ M M 4E P σ k (Õ),
3 where the last inequality comes from Eq (6) Similarly, ε y,w can be splitted as M y (W, W ) M y (Ŵ, Ŵ ) and M y(ŵ, Ŵ ) M y (Ŵ, Ŵ ), which can be bounded separately For the first term, we have M y (W, W ) M y (Ŵ, Ŵ ) = W M y W Ŵ M y Ŵ = α 0 + Adiag(η)A Âdiag(η)Â For the second term, we have η α 0 + AA ÂÂ 4 η (α 0 + )σ E P k (Õ) M y (Ŵ, Ŵ ) M y (Ŵ, Ŵ ) Ŵ M y M y Finally, we bound ε t,w as below, following the work CL3] 4E y σ k (Õ) ε t,w = M 3 (W, W, W ) M 3 (Ŵ, Ŵ, Ŵ ) M 3 W Ŵ ( W + W Ŵ + Ŵ ) + Ŵ 3 M 3 M 3 54E P (α 0 + )(α 0 + )σ + 8E T k (Õ)5 σ, k (Õ)3 where we have used the fact that M 3 k α i i= α = 0(α 0+)(α 0+) (α 0+)(α 0+) Appendix A SVD accuracy The key idea for spectral recovery of LDA topic modeling is the simultaneous diagonalization trick, which asserts that we can recover LDA model parameters by performing orthogonal tensor decomposition on a pair of simultaneously whitened moments, for example, (M, M 3 ) and (M, M y ) The following proposition details this insight, as we derive orthogonal tensor decompositions for the whitened tensor product M y (W, W ) and M 3 (W, W, W ) Proposition Define v i W µ i = αi α 0(α 0+) W µ i Then {v i } k i= is an orthonormal basis M y has a pair of singular value and singular vector (σ y i, v i) with σ y i = α η 0+ j for some j k] 3 M 3 has a pair of robust eigenvalue and eigenvector AGH + ] (λ i, v i ) with λ i = α 0+ for some j k] α0(α 0+) Proof The orthonormality of {v i } k i= follows from the fact that W M W = k i= v iv i = I k Subsequently, we have M y (W, W ) = M 3 (W, W, W ) = α 0 + α 0 + k η i v i v i, i= k i= α 0 (α 0 + ) α i v i v i v i α j 3
4 The following lemmas (Lemma 3 and Lemma 4) give upper bounds on the estimation error of η and µ in terms of λ i λ i, v i v i and the estimation errors of whitened moments defined in Definition 3 Lemma 3 (η i estimation error bound) Define η i α0+ v i M y (Ŵ, Ŵ ) v i, where v i is some estimation of v i We then have η i η i η v i v i + α 0 + ( + v i v i ) ε y,w () Proof First, note that v i M y(w, W )v i = α η 0+ i because {v i } k i= are orthonormal Subsequently, we have v M i y (Ŵ, Ŵ ) v i v i M y (W, W )v i α 0 + η i η i = ( v i v i ) My (Ŵ, Ŵ ) v i + v i ( My (Ŵ, Ŵ ) v i M y (W, W )v i ) v i v i M y (Ŵ, Ŵ ) v i + v i M y (Ŵ, Ŵ ) v i M y (W, W )v i ote that both v i and v i are unit vectors Therefore, α 0 + η i η i M y (Ŵ, Ŵ ) v i v i + M y (Ŵ, Ŵ ) v i M y (W, W )v i M y (Ŵ, Ŵ ) v i v i + M y (Ŵ, Ŵ ) v i v i + M y (Ŵ, Ŵ ) M y(w, W ) v i ( ) v i v i α 0 + η + ε y,w + ε y,w The last inequality is due to the fact that M y (W, W ) = α 0+ η Lemma 4 (µ i estimation error bound) Define µ i α0+ λ i (Ŵ + ) v i, where λ i, v i are some estimates of singular value pairs (λ i, v i ) of M 3 (W, W, W ) We then have µ i µ i 3(α 0 + ) σ (Õ) λ i λ i + 3α max σ (Õ) v i v i + 6α maxσ (Õ)E P () σ k (Õ) Proof First note that µ i = α0+ λ i (W + ) v i Subsequently, α 0 + µ i µ i = λ i (Ŵ + ) v i λ i (W + ) v i λ i Ŵ + λ i W + v i + λ i W + v i v i λ i λ i Ŵ + + λ i Ŵ + W + + λ i W + v i v i 3σ (Õ) λ i λ i + α max α 0 + 6σ (Õ)E P σ k (Õ) + α max α 0 + 3σ (Õ) v i v i To bound the error of orthogonal tensor decomposition performed on the estimated tensors M 3 (Ŵ, Ŵ, Ŵ ), we cite Theorem 5 AGH + ], a sample complexity analysis on the robust tensor power method we used for recovering λ i and v i 4
5 Lemma 5 (Theorem 5, AGH + ]) Let λ max = α 0(α 0+) α 0+ α min, λ min = α 0+ α 0(α 0+) α max, where α min = min α i and α max = max α i Then there exist universal constants C, C > 0 such that the following holds: Fix δ (0, ) Suppose ε t,w ε and ε t,w C λmin k, (3) Suppose {( λ i, v i )} k i= are eigenvalue and eigenvector pairs returned by running Algorithm in AGH+ ] with input M 3 (Ŵ, Ŵ, Ŵ ) for L = poly(k) log(/δ ) and C (log(k) + log log( λmax ε )) iterations With probability greater than δ, there exists a permutation π : k] k] such that for all i, v i v π (i) 8ε/λ min, λ i λ π (i) 5ε Appendix A3 Tail Inequalities Lemma 6 (Lemma 5, CL3]) Let x,, x R d be iid samples from some distribution with bounded support (ie, x B with probability for some constant B) Then with probability at least δ, x i Ex] B ) log(/δ) ( + i= Corrolary Let x,, x R d be iid samples from some distributions with Pr x B] δ Then with probability at least δ δ, x i Ex] B ) log(/δ) ( + i= Proof Use union bound Lemma 7 (concentraion of moment norms) Suppose we obtain iid samples (ie, documents with at least three words each and their regression variables in slda models) Define R(δ) η + Φ (δ), where Φ ( ) is the inverse function of the CDF of a standard Gaussian distribution Let E ] denote the mean of the true underlying distribution and Ê ] denote the empirical mean Then Pr Ex ] Êx ] F < + ] log(/δ) δ Pr Ex x ] Êx x ] F < + ] log(/δ) δ 3 Pr Ex x x 3 ] Êx x x 3 ] F < + ] log(/δ) δ 4 Pr Ey] Êy] < R(δ/4σ) + ] log(/δ) δ 5 Pr Eyx ] Êyx ] F < R(δ/4σ) + ] log(/δ) δ 6 Pr Eyx x ] Êyx x ] F < R(δ/4σ) + ] log(/δ) δ 5
6 Proof Use Lemma 6 and Corrolary for concentration bounds involving the regression variable y Corrolary With probability δ the following holds: E P = M M 3 + log(6/δ) E y = M y M y 0R(δ/60σ) + log(5/δ) 3 E T = M 3 M log(9/δ) Proof Corrolary can be proved by expanding the terms by definition and then using tail inequality in Lemma 7 and union bound Also note that F for all matrices Appendix A4 Completing the proof We are now ready to give a complete proof to Theorem Theorem (Sample complexity bound) Let σ (Õ) and σk(õ) be the largest and the smallest singular values of the canonical topic distribution matrix Õ Define λ min α 0(α 0+) α 0+ α max and λ max α 0+ α 0(α 0+) α min where α max and α min are the largest and the smallest entries in α Suppose µ, α and η are the outputs of Algorithm, and L is at least a linear function in k Fix δ (0, ) For any small error-tolerance parameter ε > 0, if Algorithm is run with parameter T = Ω(log(k) + log log(λ max /ε)) on iid sampled documents with max(n, n, n 3 ), where ( n = C + ) α log(6/δ) 0(α 0 + ), (4) α min n = C ( + log(5/δ)) ( max ( η + Φ (δ/60σ)), αmaxσ (Õ)), (5) ε σ k (Õ)4 n 3 = C 3 ( + log(9/δ)) σ k (Õ)0 ( max ε, k ) λ, (6) min and C, C and C 3 are universal constants, then with probability at least δ, there exists a permutation π : k] k] such that for every topic i, the following holds: α i α π(i) 4α0(α0+)(λmax+5ε) (α 0+) 5ε, if λ λ min > 5ε; min (λmin 5ε) ( ) ) µ i µ π(i) (3σ (Õ) 8α max λ min + 5(α0+) + ε; ( ) 3 η i η π(i) η λ min + (α 0 + ) ε Proof First, the assumption E P σ k (M ) is required for error bounds on ε p,w, ε y,w and ε t,w oting Corrolary and the fact that σ k (M ) =, we have αmin α 0(α 0+) ( α = Ω 0(α 0 + ) ( + ) log(6/δ)) ote that this lower bound does not depend on k, ε and σ k (Õ) For Lemma 5 to hold, we need the assumptions that ε t,w min(ε, O( λmin k )) These imply Eq (6), as we expand ε t,w according to Definition 3 and note the fact that the first term α min 54E P (α 0+)(α 0+)σ k (Õ)5 6
7 dominates the second one The α 0 is missing in Eq (6) because α 0 +, α 0 + and we discard them both The α i α π(i) bound follows immediately by Lemma 5 and the recovery rule α i = α0+ λ i To bound the estimation error for the linear classifier η, we need to further bound ε y,w We assume ε y,w ε By expanding ε y,w according to Definition 3 in a similar manner we obtain the ( η + Φ (δ/60σ)) term in Eq (5) The bound on η i η π(i) follows immediately by Lemma 3 Finally, we bound µ i µ π(i) using Lemma 4 We need to assume that 6αmaxσ(Õ)E P ε, which σ k (Õ) gives the αmaxσ (Õ) term in Eq (5) The µ i µ π(i) bound then follows by Lemma 4 and Lemma 5 We make some remarks for the main theorem In Remark, we establish links between indirect quantities appeared in Theorem (eg, σ k (Õ)) and the functions of original model parameters (eg, σ k (O)) These connections are straightforward following their definitions Remark The indirect quantities σ (Õ) and σk(õ) can be related to σ (O), σ k (O) and α in the following way: αmin α 0(α 0 + ) σ k(o) σ k (Õ) αmax α 0(α 0 + ) σ k(o); σ (Õ) αmax α 0(α 0 + ) σ(o) α0 + We now take a close look at the sample complexity bound in Theorem It is evident that n can be neglected when the number of topics k gets large, because in practice the norm of the linear regression model η is usually assumed to be small in order to avoid overfitting Moreover, as mentioned before, the prior parameter α is often assumed to be homogeneous with α i = /k SG07] With these observations, the sample complexity bound in Theorem can be greatly simplified Remark Assume η and σ are small and α = (/k,, /k) As the number of topics k gets large, the sample complexity bound in Theorem can be simplified as ( ) log(/δ) = Ω σ max(ε, k 3 ) (7) k (Õ)0 The sample complexity bound in Remark may look formidable as it depends on σ k (Õ)0 However, such dependency is somewhat necessary because we are using third-order tensors to recover the underlying model parameters Furthermore, the dependence on σ k (Õ)0 is introduced by the robust tensor power method to recover LDA parameters, and the reconstruction accuracy of η only depends on σ k (Õ)4 and ( η + Φ (δ/60σ)) As a consequence, if we can combine our power update method for η with LDA inference algorithms that have milder dependence on the singular value σ k (Õ), we might be able to get an algorithm with a better sample complexity Such an extension is discussed in Appendix C Appendix B Moments of Observable Variables Appendix B Proof to Proposition The equations on M and M 3 have already been proved in AFH + ] and AGH + ] Here we only give the proof to the equation on M y In fact, all the three equations can be proved in a similar manner In slda the topic mixing vector h follows a Dirichlet prior distribution with parameter α Therefore, we have Eh i ] = α i α 0, Eh i h j ] = { α i 0(α 0+) α iα j α 0, if i = j,, if i j, Eh i h j h k ] = α 3 i α 0(α 0+)(α 0+) α i α k 0 α iα jα k α 3 0, if i = j = k, (α0+), if i = j k,, if i j, j k, i k (8) 7
8 ext, note that Ey h] = η h, Ex h] = k h i µ i, Ex x h] = i= Eyx x h] = k i,j,k= k h i h j µ i µ j, (9) i,j= h i h j h k η k µ j µ k (30) Proposition can then be proved easily by taking expectation over the topic mixing vector h Appendix B Details of the speeding-up trick In this section we provide details of the trick mentioned in the main paper to speed up empirical moments computations First, note that the computation of M, M and M y only requires O(M ) time and O(V ) space They do not need to be accelerated in most practical applications This time and space complexity also applies to all terms in M 3 except the Êx x x 3 ] term, which requires O(M 3 ) time and O(V 3 ) space if using naive implementations Therefore, this section is devoted to speed-up the computation of Êx x x 3 ] More precisely, as mentioned in the main paper, what we want to compute is the whitened empirical moment Êx x x 3 ](Ŵ, Ŵ, Ŵ ) Rk k k Fix a document D with m words Let T Êx x x 3 D] be the empirical tensor demanded By definition, we have T i,j,k = m(m )(m ) n i (n j )(n k ), i = j = k; n i (n i )n k, i = j, j k; n i n j (n j ), j = k, i j; n i n j (n i ), i = k, i j; n i n j n k, otherwise; where n i is the number of occurrences of the i-th word in document D If T i,j,k = indices i, j and k, then we only need to compute T (W, W, W ) = m(m )(m ) (W n) 3, n in jn k m(m )(m ) (3) for all where n (n, n,, n V ) This takes O(Mk + k 3 ) computational time because n contains at most M non-zero entries, and the total time complexity is reduced from O(M 3 ) to O((Mk + k 3 )) We now consider the remaining values, where at least two indices are identical We first consider those values with two indices the same, for example, i = j For these indices, we need to subtract an n i n k term, as shown in Eq (3) That is, we need to compute the whitened tensor (W, W, W ), where R V V V and i,j,k = m(m )(m ) { ni n k, i = j; 0, otherwise ote that can be written as m(m )(m ) A n, where A = diag(n, n,, n V ) is a V V matrix and n = (n, n,, n V ) is defined previously As a result, (W, W, W ) = m(m )(m ) (W AW ) n So the computational complexity of (W, W, W ) depends on how we compute W AW Since A is a diagonal matrix with at most M non-zero entries, W AW can be computed in O(Mk ) operations Therefore, the time complexity of computing (W, W, W ) is O(Mk ) per document Finally we handle those values with three indices the same, that is, i = j = k As indicated by Eq (3), we need to add a term for compensation This can be done efficiently by first n i m(m )(m ) n i m(m )(m ) computing Ê( ) for all the documents (requiring O(V ) time), and then add them up, which takes O(V k 3 ) operations (3) 8
9 Appendix C Discussions Appendix C Extension to other topic recovery algorithms One important advantage of our proposed inference algorithm is its flexibility the algorithm can be combined with many other LDA inference algorithms to infer supervised LDA model parameters More specifically, given access to any algorithm that recovers the topic distribution matrix O and the prior parameter α from iid sampled documents, an inference algorithm for a supervised LDA model can be immediately obtained, as shown in Algorithm Algorithm slda parameter recovery based on an existing LDA inference algorithm A Input parameter: α 0 : Compute empirical moments and obtain M, M y Set η = 0 : Find Ŵ Rn k such that M (Ŵ, Ŵ ) = I k 3: Run algorithm A with observed documents and parameter α 0 Obtain the estimated topic distribution matrix Ô = ( µ,, µ k ) and prior parameter α 4: Compute v i = αi α 0(α 0+)Ŵ µ i for each topic i 5: Recover linear classifier: η i α0+ v i M y (Ŵ, Ŵ ) v i 6: Output: η, α and { µ i } k i= The sample complexity of Algorithm depends on the sample complexity of the LDA inference algorithm A Although the LDA inference algorithm A is free to make any assumptions on the topic distribution matrix O, we comment that the linear independence of topic distribution vectors µ is still required, because the power update trick (step 5 in Algorithm ) is valid only W µ i are orthogonal vectors Appendix C Going beyond SVD The proposed methods are based on spectral decomposition of observable moments and are provably correct However, a major drawback of these methods is their assumption that the topic distributions µ are linearly independent Although standard LDA models assume this topic independence BJ03], in practice there is strong evidence that different topics might be related BL] Therefore, it is important to consider possible extensions Recently, there has been much work on provable LDA inference algorithms that do not require the topic independence assumption AGH + 3, AGM] Instead of making the independence assumption, these methods assume a p-separability condition on the topic distribution matrix, that is, there exists an anchor word for each topic such that the anchor word only appears in this specific topic, and its probability (given that topic) is at least p Here we briefly describe an idea that might lead to an slda inference algorithm without assuming topic independence First, let O R V k be the topic distribution matrix defined previously and H = (h,, h ) R k be the matrix of topic mixing vectors for each document Suppose Q = Ex x ] is the word co-occurrence frequency matrix and Q y = Eyx ] is the co-occurrence frequency matrix between word and regression variable observations It is clear that both Q and Q y can be estimated from the training data, and we have Q = ORO, Q y = ORη, (33) where R EHH ] Assuming the p-separability condition holds and using methods developed in AGM, AGH + 3], we can consistently recover the O and R ow note that Q y = (OR) η As a result, knowing both OR and Q y we reduce the η inference problem to solving a linear equation systems Future research could be done on () determining when the matrix OR has full rank r so that the linear equations can be solved, and () investigating the sample complexity problem for such an inference algorithm 9
10 Appendix C3 From regression to classification A natural extension to our result is to consider supervised LDA models for classification purposes The simplest model is a logistic regression classification model, where we assume the response variable y d for each document d is in {+, }, and Pry d = h d ] = exp(η h d ) + exp(η h d ), where η is a linear classifier Though appears simple, such an extension incurs many fundamental problems A major obstacle is the fact that the conditional expectation Ey h] is no longer linear in the topic mixing vector h As a result, we cannot even evaluate Ey] (or higher order tensors like Eyx x ]) in closed forms Further more, even if we have accurate estimation to the above moments, it is rather difficult to infer η back due to its non-linearality To our knowledge, the classification problem remains unsolved in terms of spectral decomposition methods, and there is doubt whether a provably correct algorithm exists in such scenarios References AFH + ] Animashree Anandkumar, Dean Foster, Daniel Hsu, Sham Kakade, and Yi-Kai Liu Two SVDs suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocatoin arxiv:046703, 0 AGH + ] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham Kakade, and Matus Telgarsky Tensor decompositions for learning latent variable models arxiv:0:7559, 0 AGH + 3] Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, and Ankur Moitra A practical algorithm for topic modeling with provable guarantees In ICML, 03 AGM] BL] BJ03] CL3] SG07] Sanjeev Arora, Rong Ge, and Ankur Moitra Learning topic models-going beyond SVD In FOCS, 0 David Blei and John Lafferty A correlated topic model of science Annals of Applied Statistics, pages 77 84, 0 David Blei, Andrew g, and Michael Jordan Latent Dirichlet allocation Journal of Machine Learning Research, (3):993 0, 003 Arun Chaganty and Percy Liang Spectral experts for estimating mixtures of linear regressions In ICML, 03 M Steyvers and T Griffiths Latent semantic analysis: a road to meaning, chapter Probabilistic topic models Laurence Erlbaum, 007 0
TOPIC modeling offers a suite of useful tools that
JOURAL OF L A T E X CLASS FILES, VOL. 6, O. 1, JAUARY 007 1 Spectral Learning for Supervised Topic Models Yong Ren, Yining Wang, Jun Zhu, Member, IEEE terion to learn discriminative latent topic representations
More informationECE 598: Representation Learning: Algorithms and Models Fall 2017
ECE 598: Representation Learning: Algorithms and Models Fall 2017 Lecture 1: Tensor Methods in Machine Learning Lecturer: Pramod Viswanathan Scribe: Bharath V Raghavan, Oct 3, 2017 11 Introduction Tensors
More informationAn Introduction to Spectral Learning
An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,
More informationA Practical Algorithm for Topic Modeling with Provable Guarantees
1 A Practical Algorithm for Topic Modeling with Provable Guarantees Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, Michael Zhu Reviewed by Zhao Song December
More informationPROBABILISTIC LATENT SEMANTIC ANALYSIS
PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications
More informationRobust Spectral Inference for Joint Stochastic Matrix Factorization
Robust Spectral Inference for Joint Stochastic Matrix Factorization Kun Dong Cornell University October 20, 2016 K. Dong (Cornell University) Robust Spectral Inference for Joint Stochastic Matrix Factorization
More informationDictionary Learning Using Tensor Methods
Dictionary Learning Using Tensor Methods Anima Anandkumar U.C. Irvine Joint work with Rong Ge, Majid Janzamin and Furong Huang. Feature learning as cornerstone of ML ML Practice Feature learning as cornerstone
More informationInformation retrieval LSI, plsi and LDA. Jian-Yun Nie
Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and
More informationFourier PCA. Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech)
Fourier PCA Navin Goyal (MSR India), Santosh Vempala (Georgia Tech) and Ying Xiao (Georgia Tech) Introduction 1. Describe a learning problem. 2. Develop an efficient tensor decomposition. Independent component
More informationOrthogonal tensor decomposition
Orthogonal tensor decomposition Daniel Hsu Columbia University Largely based on 2012 arxiv report Tensor decompositions for learning latent variable models, with Anandkumar, Ge, Kakade, and Telgarsky.
More informationBeyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo
Beyond Log-concavity: Provable Guarantees for Sampling Multi-modal Distributions using Simulated Tempering Langevin Monte Carlo Rong Ge *, Holden Lee, Andrej Risteski Abstract A key task in Bayesian statistics
More informationDonald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016
Optimization for Tensor Models Donald Goldfarb IEOR Department Columbia University UCLA Mathematics Department Distinguished Lecture Series May 17 19, 2016 1 Tensors Matrix Tensor: higher-order matrix
More informationAlgorithms for Generalized Topic Modeling
Algorithms for Generalized Topic Modeling Avrim Blum Toyota Technological Institute at Chicago avrim@ttic.edu Nika Haghtalab Computer Science Department Carnegie Mellon University nhaghtal@cs.cmu.edu Abstract
More informationVolodymyr Kuleshov ú Arun Tejasvi Chaganty ú Percy Liang. May 11, 2015
Tensor Factorization via Matrix Factorization Volodymyr Kuleshov ú Arun Tejasvi Chaganty ú Percy Liang Stanford University May 11, 2015 Kuleshov, Chaganty, Liang (Stanford University) Tensor Factorization
More informationSupplemental for Spectral Algorithm For Latent Tree Graphical Models
Supplemental for Spectral Algorithm For Latent Tree Graphical Models Ankur P. Parikh, Le Song, Eric P. Xing The supplemental contains 3 main things. 1. The first is network plots of the latent variable
More informationPrincipal Component Analysis
Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used
More informationSpectral Learning of Hidden Markov Models with Group Persistence
000 00 00 00 00 00 00 007 008 009 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 0 0 07 08 09 00 0 0 0 0 Abstract In this paper, we develop a general Method of Moments
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationMachine Learning - MT & 14. PCA and MDS
Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationProvable Alternating Minimization Methods for Non-convex Optimization
Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish
More informationTensor Factorization via Matrix Factorization
Volodymyr Kuleshov Arun Tejasvi Chaganty Percy Liang Department of Computer Science Stanford University Stanford, CA 94305 Abstract Tensor factorization arises in many machine learning applications, such
More informationLearning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions
Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions [Extended Abstract] ABSTRACT Daniel Hsu Microsoft Research New England dahsu@microsoft.com This work provides a computationally
More informationSpectral Methods for the Hierarchical Dirichlet Process
Spectral Methods for the Hierarchical Dirichlet Process Hsiao-Yu Fish Tung, Chao-Yuan Wu, Manzil Zaheer, Alexander J. Smola Machine Learning Department Carnegie Mellon University 5 Forbes Ave, Pittsburgh,
More informationLearning Topic Models and Latent Bayesian Networks Under Expansion Constraints
Learning Topic Models and Latent Bayesian Networks Under Expansion Constraints Animashree Anandkumar 1, Daniel Hsu 2, Adel Javanmard 3, and Sham M. Kakade 2 1 Department of EECS, University of California,
More informationGuaranteed Learning of Latent Variable Models through Spectral and Tensor Methods
Guaranteed Learning of Latent Variable Models through Spectral and Tensor Methods Anima Anandkumar U.C. Irvine Application 1: Clustering Basic operation of grouping data points. Hypothesis: each data point
More informationCS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery
CS168: The Modern Algorithmic Toolbox Lecture #10: Tensors, and Low-Rank Tensor Recovery Tim Roughgarden & Gregory Valiant May 3, 2017 Last lecture discussed singular value decomposition (SVD), and we
More informationWeighted Tensor Decomposition for Learning Latent Variables with Partial Data
Weighted Tensor Decomposition for Learning Latent Variables with Partial Data Omer Gottesman Weiewei Pan Finale Doshi-Velez Paulson School of Engineering and Applied Sciences, Harvard University Paulson
More informationRethinking LDA: Moment Matching for Discrete ICA
Rethinking LDA: Moment Matching for Discrete ICA Anastasia Podosinnikova Francis Bach Simon Lacoste-Julien INRIA - École normale supérieure Paris Abstract We consider moment matching techniques for estimation
More informationProvable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations
Provable Alternating Gradient Descent for Non-negative Matrix Factorization with Strong Correlations Yuanzhi Li 1 Yingyu Liang 1 In this paper, we provide a theoretical analysis of a natural algorithm
More informationLanguage Information Processing, Advanced. Topic Models
Language Information Processing, Advanced Topic Models mcuturi@i.kyoto-u.ac.jp Kyoto University - LIP, Adv. - 2011 1 Today s talk Continue exploring the representation of text as histogram of words. Objective:
More informationarxiv: v4 [cs.lg] 26 Apr 2018
: Online Spectral Learning for Single Topic Models Tong Yu 1, Branislav Kveton 2, Zheng Wen 2, Hung Bui 3, and Ole J. Mengshoel 1 arxiv:1709.07172v4 [cs.lg] 26 Apr 2018 1 Carnegie Mellon University 2 Adobe
More informationLearning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute
More informationEstimating Latent Variable Graphical Models with Moments and Likelihoods
Estimating Latent Variable Graphical Models with Moments and Likelihoods Arun Tejasvi Chaganty Percy Liang Stanford University June 18, 2014 Chaganty, Liang (Stanford University) Moments and Likelihoods
More informationWHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY,
WHY ARE DEEP NETS REVERSIBLE: A SIMPLE THEORY, WITH IMPLICATIONS FOR TRAINING Sanjeev Arora, Yingyu Liang & Tengyu Ma Department of Computer Science Princeton University Princeton, NJ 08540, USA {arora,yingyul,tengyu}@cs.princeton.edu
More informationUsing SVD to Recommend Movies
Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationNon-convex Robust PCA: Provable Bounds
Non-convex Robust PCA: Provable Bounds Anima Anandkumar U.C. Irvine Joint work with Praneeth Netrapalli, U.N. Niranjan, Prateek Jain and Sujay Sanghavi. Learning with Big Data High Dimensional Regime Missing
More informationAlgorithms for Generalized Topic Modeling
Algorithms for Generalized Topic Modeling Avrim Blum Toyota Technological Institute at Chicago avrim@ttic.edu Nika Haghtalab Computer Science Department Carnegie Mellon University nhaghtal@cs.cmu.edu Abstract
More informationTensor Decompositions for Machine Learning. G. Roeder 1. UBC Machine Learning Reading Group, June University of British Columbia
Network Feature s Decompositions for Machine Learning 1 1 Department of Computer Science University of British Columbia UBC Machine Learning Group, June 15 2016 1/30 Contact information Network Feature
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationDimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas
Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationCS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization
CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,
More informationEstimating Covariance Using Factorial Hidden Markov Models
Estimating Covariance Using Factorial Hidden Markov Models João Sedoc 1,2 with: Jordan Rodu 3, Lyle Ungar 1, Dean Foster 1 and Jean Gallier 1 1 University of Pennsylvania Philadelphia, PA joao@cis.upenn.edu
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationDocument and Topic Models: plsa and LDA
Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More information1 Feature Vectors and Time Series
PCA, SVD, LSI, and Kernel PCA 1 Feature Vectors and Time Series We now consider a sample x 1,..., x of objects (not necessarily vectors) and a feature map Φ such that for any object x we have that Φ(x)
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationLearning Mixtures of Gaussians in High Dimensions
Learning Mixtures of Gaussians in High Dimensions [Extended Abstract] Rong Ge Microsoft Research, New England Qingqing Huang MIT, EECS Sham M. Kakade Microsoft Research, New England ABSTRACT Efficiently
More informationLearning mixtures of spherical Gaussians: moment methods and spectral decompositions
Learning mixtures of spherical Gaussians: moment methods and spectral decompositions Daniel Hsu and Sham M. Kakade Microsoft Research New England October 8, 01 Abstract This work provides a computationally
More informationSum-of-Squares Method, Tensor Decomposition, Dictionary Learning
Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning David Steurer Cornell Approximation Algorithms and Hardness, Banff, August 2014 for many problems (e.g., all UG-hard ones): better guarantees
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More information1 Inner Product and Orthogonality
CSCI 4/Fall 6/Vora/GWU/Orthogonality and Norms Inner Product and Orthogonality Definition : The inner product of two vectors x and y, x x x =.., y =. x n y y... y n is denoted x, y : Note that n x, y =
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationData Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis
Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis Radu Horaud INRIA Grenoble Rhone-Alpes, France Radu.Horaud@inrialpes.fr http://perception.inrialpes.fr/ Outline of Lecture
More informationFace Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition
ace Recognition Identify person based on the appearance of face CSED441:Introduction to Computer Vision (2017) Lecture10: Subspace Methods and ace Recognition Bohyung Han CSE, POSTECH bhhan@postech.ac.kr
More informationarxiv: v5 [math.na] 16 Nov 2017
RANDOM PERTURBATION OF LOW RANK MATRICES: IMPROVING CLASSICAL BOUNDS arxiv:3.657v5 [math.na] 6 Nov 07 SEAN O ROURKE, VAN VU, AND KE WANG Abstract. Matrix perturbation inequalities, such as Weyl s theorem
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview
More informationEE 381V: Large Scale Learning Spring Lecture 16 March 7
EE 381V: Large Scale Learning Spring 2013 Lecture 16 March 7 Lecturer: Caramanis & Sanghavi Scribe: Tianyang Bai 16.1 Topics Covered In this lecture, we introduced one method of matrix completion via SVD-based
More informationMulti-View Dimensionality Reduction via Canonical Correlation Analysis
Technical Report TTI-TR-2008-4 Multi-View Dimensionality Reduction via Canonical Correlation Analysis Dean P. Foster University of Pennsylvania Sham M. Kakade Toyota Technological Institute at Chicago
More informationTENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS. Cris Cecka Senior Research Scientist, NVIDIA GTC 2018
TENSOR LAYERS FOR COMPRESSION OF DEEP LEARNING NETWORKS Cris Cecka Senior Research Scientist, NVIDIA GTC 2018 Tensors Computations and the GPU AGENDA Tensor Networks and Decompositions Tensor Layers in
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationTensor Methods for Feature Learning
Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationLecture 15: Random Projections
Lecture 15: Random Projections Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 15 1 / 11 Review of PCA Unsupervised learning technique Performs dimensionality reduction
More informationarxiv: v3 [stat.ml] 1 Nov 2014
Spectral Methods meet EM: A Provably Optimal Algorithm for Crowdsourcing Yuchen Zhang Xi Chen Dengyong Zhou Michael I. Jordan arxiv:406.3824v3 [stat.ml] Nov 204 November 4, 204 Abstract Crowdsourcing is
More informationTopic Modelling and Latent Dirichlet Allocation
Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationNoisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get
Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationSpectral Learning of Mixture of Hidden Markov Models
Spectral Learning of Mixture of Hidden Markov Models Y Cem Sübakan, Johannes raa, Paris Smaragdis,, Department of Computer Science, University of Illinois at Urbana-Champaign Department of Electrical and
More informationApproximating the Covariance Matrix with Low-rank Perturbations
Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu
More informationBasic Calculus Review
Basic Calculus Review Lorenzo Rosasco ISML Mod. 2 - Machine Learning Vector Spaces Functionals and Operators (Matrices) Vector Space A vector space is a set V with binary operations +: V V V and : R V
More informationMulti-View Regression via Canonincal Correlation Analysis
Multi-View Regression via Canonincal Correlation Analysis Dean P. Foster U. Pennsylvania (Sham M. Kakade of TTI) Model and background p ( i n) y i = X ij β j + ɛ i ɛ i iid N(0, σ 2 ) i=j Data mining and
More informationNumerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J Olver 8 Numerical Computation of Eigenvalues In this part, we discuss some practical methods for computing eigenvalues and eigenvectors of matrices Needless to
More informationtopic modeling hanna m. wallach
university of massachusetts amherst wallach@cs.umass.edu Ramona Blei-Gantz Helen Moss (Dave's Grandma) The Next 30 Minutes Motivations and a brief history: Latent semantic analysis Probabilistic latent
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment Two Caramanis/Sanghavi Due: Tuesday, Feb. 19, 2013. Computational
More informationSupplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics
Supplementary Material for: Spectral Unsupervised Parsing with Additive Tree Metrics Ankur P. Parikh School of Computer Science Carnegie Mellon University apparikh@cs.cmu.edu Shay B. Cohen School of Informatics
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationMETHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS
METHOD OF MOMENTS LEARNING FOR LEFT TO RIGHT HIDDEN MARKOV MODELS Y. Cem Subakan, Johannes Traa, Paris Smaragdis,,, Daniel Hsu UIUC Computer Science Department, Columbia University Computer Science Department
More informationTHE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR
THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR 1. Definition Existence Theorem 1. Assume that A R m n. Then there exist orthogonal matrices U R m m V R n n, values σ 1 σ 2... σ p 0 with p = min{m, n},
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationStructure in Data. A major objective in data analysis is to identify interesting features or structure in the data.
Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two
More informationDS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.
DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1
More information26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.
10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with
More informationLinear dimensionality reduction for data analysis
Linear dimensionality reduction for data analysis Nicolas Gillis Joint work with Robert Luce, François Glineur, Stephen Vavasis, Robert Plemmons, Gabriella Casalino The setup Dimensionality reduction for
More informationA Spectral Algorithm for Latent Dirichlet Allocation
Noname manuscript No. (will be inserted by the editor) A Spectral Algorithm for Latent Dirichlet Allocation Anima Anandkumar Dean P. Foster Daniel Hsu Sham M. Kakade Yi-Kai Liu Received: date / Accepted:
More information14.2 QR Factorization with Column Pivoting
page 531 Chapter 14 Special Topics Background Material Needed Vector and Matrix Norms (Section 25) Rounding Errors in Basic Floating Point Operations (Section 33 37) Forward Elimination and Back Substitution
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Numerical Linear Algebra Background Cho-Jui Hsieh UC Davis May 15, 2018 Linear Algebra Background Vectors A vector has a direction and a magnitude
More informationAlgorithmic Aspects of Machine Learning. Ankur Moitra
Algorithmic Aspects of Machine Learning Ankur Moitra Contents Contents i Preface v 1 Introduction 3 2 Nonnegative Matrix Factorization 7 2.1 Introduction................................ 7 2.2 Algebraic
More informationHigh Dimensional Covariance and Precision Matrix Estimation
High Dimensional Covariance and Precision Matrix Estimation Wei Wang Washington University in St. Louis Thursday 23 rd February, 2017 Wei Wang (Washington University in St. Louis) High Dimensional Covariance
More informationRobust Statistics, Revisited
Robust Statistics, Revisited Ankur Moitra (MIT) joint work with Ilias Diakonikolas, Jerry Li, Gautam Kamath, Daniel Kane and Alistair Stewart CLASSIC PARAMETER ESTIMATION Given samples from an unknown
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationDay 5: Generative models, structured classification
Day 5: Generative models, structured classification Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 22 June 2018 Linear regression
More information