Appendix A. Proof to Theorem 1

Size: px

Start display at page:

Download "Appendix A. Proof to Theorem 1"

Melina Small
5 years ago
Views:

1 Appendix A Proof to Theorem In this section, we prove the sample complexity bound given in Theorem The proof consists of three main parts In Appendix A, we prove perturbation lemmas that bound the estimation error of the whitened tensors M (W, W ), M y (W, W ) and M 3 (W, W, W ) in terms of the estimation error of the tensors themselves In Appendix A, we cite results on the accuracy of SVD and robust tensor power method when performed on estimated tensors, and prove the effectiveness of the power update method used in recovering the linear regression model η Finally, we give tail bounds for the estimation error of M, M y and M 3 in Appendix A3 and complete the proof in Appendix A4 We also make some remarks on the indirect quantities (eg σ k (Õ)) used in Theorem and simplified bounds for some special cases in Appendix A4 All norms in the following analysis, if not explicitly specified, are norms in the vector and matrix cases and the operator norm in the high-order tensor case Appendix A Perturbation lemmas We first define the canonical topic distribution vectors µ and estimation error of observable tensors, which simplify the notations that arise in subsequent analysis Definition (canonical topic distribution) Define the canonical version of topic distribution vector µ i, µ i, as follows: αi µ i α 0 (α 0 + ) µ i () We also define O, Õ Rn k by O = µ,, µ k ] and Õ = µ,, µ k ] Definition (estimation error) Assume for some real values E P, E y and E T, which we will set later M M E P, () M y M y E y, (3) M 3 M 3 E T (4) The following lemma analyzes the whitening matrix W of M Many conclusions are directly from AFH + ] Lemma (Lemma C, AFH + ]) Let W, Ŵ Rn k be the whitening matrices such that M (W, W ) = M (Ŵ, Ŵ ) = I k Let A = W Õ and Â = Ŵ Õ Suppose E P σ k (M )/ We have W = (5) σ k (Õ), Ŵ (6) σ k (Õ), W Ŵ 4E P, (7) σ k (Õ)3 W + 3σ (Õ), (8) Ŵ + σ (Õ), (9) W + Ŵ + 6σ(Õ) σ E P, k (Õ) (0) A =, () Â, ()

2 A Â 4E P, (3) σ k (Õ) AA ÂÂ E P (4) σ k (Õ) Proof Proof to Eq (7): Let Ŵ M Ŵ = I and Ŵ M Ŵ = BDB, where B is orthogonal and D is a positive definite diagonal matrix We then see that W = Ŵ BD / B satisfies the condition W M W = I Subsequently, Ŵ = W BD/ B We then can bound W Ŵ as follows W Ŵ W I D/ W I D 4E P σ k (Õ)3, where the inequality I D 4E P was proved in σ AFH+ ] k (Õ) Proof to Eq (4): AA ÂÂ AA AÂ + AÂ ÂÂ A Â ( A + Â ) E P σ k (Õ) All the other inequalities come from Lemma C, AFH + ] We are now able to provide perturbation bounds for estimation error of whitened moments Definition 3 (estimation error of whitened moments) Define ε p,w M (W, W ) M (Ŵ, Ŵ ), (5) ε y,w M y (W, W ) M y (Ŵ, Ŵ ), (6) ε t,w M 3 (W, W, W ) M 3 (Ŵ, Ŵ, Ŵ ) (7) Lemma (Perturbation lemma of whitened moments) Suppose E P σ k (M )/ We have ε p,w 6E P, (8) σ k (Õ) ε y,w ε t,w 4 η E P (α 0 + )σ + 4E y, k (Õ) σ k (Õ) (9) 54E P (α 0 + )(α 0 + )σ + 8E T k (Õ)5 σ k (Õ)3 (0) Proof Using the idea in the proof of Lemma C in AFH + ], we can split ε p,w as ε p,w = M (W, W ) M (Ŵ, Ŵ ) + M (Ŵ, Ŵ ) M (Ŵ, Ŵ ) We can the bound the two terms seperately, as follows For the first term, we have where the last inequality comes from Eq (4) For the second term, we have M (W, W ) M (Ŵ, Ŵ ) + M (Ŵ, Ŵ ) M (Ŵ, Ŵ ) M (W, W ) M (Ŵ, Ŵ ) = W M W Ŵ M Ŵ = AA ÂÂ E P σ k (Õ) M (Ŵ, Ŵ ) M (Ŵ, Ŵ ) Ŵ M M 4E P σ k (Õ),

3 where the last inequality comes from Eq (6) Similarly, ε y,w can be splitted as M y (W, W ) M y (Ŵ, Ŵ ) and M y(ŵ, Ŵ ) M y (Ŵ, Ŵ ), which can be bounded separately For the first term, we have M y (W, W ) M y (Ŵ, Ŵ ) = W M y W Ŵ M y Ŵ = α 0 + Adiag(η)A Âdiag(η)Â For the second term, we have η α 0 + AA ÂÂ 4 η (α 0 + )σ E P k (Õ) M y (Ŵ, Ŵ ) M y (Ŵ, Ŵ ) Ŵ M y M y Finally, we bound ε t,w as below, following the work CL3] 4E y σ k (Õ) ε t,w = M 3 (W, W, W ) M 3 (Ŵ, Ŵ, Ŵ ) M 3 W Ŵ ( W + W Ŵ + Ŵ ) + Ŵ 3 M 3 M 3 54E P (α 0 + )(α 0 + )σ + 8E T k (Õ)5 σ, k (Õ)3 where we have used the fact that M 3 k α i i= α = 0(α 0+)(α 0+) (α 0+)(α 0+) Appendix A SVD accuracy The key idea for spectral recovery of LDA topic modeling is the simultaneous diagonalization trick, which asserts that we can recover LDA model parameters by performing orthogonal tensor decomposition on a pair of simultaneously whitened moments, for example, (M, M 3 ) and (M, M y ) The following proposition details this insight, as we derive orthogonal tensor decompositions for the whitened tensor product M y (W, W ) and M 3 (W, W, W ) Proposition Define v i W µ i = αi α 0(α 0+) W µ i Then {v i } k i= is an orthonormal basis M y has a pair of singular value and singular vector (σ y i, v i) with σ y i = α η 0+ j for some j k] 3 M 3 has a pair of robust eigenvalue and eigenvector AGH + ] (λ i, v i ) with λ i = α 0+ for some j k] α0(α 0+) Proof The orthonormality of {v i } k i= follows from the fact that W M W = k i= v iv i = I k Subsequently, we have M y (W, W ) = M 3 (W, W, W ) = α 0 + α 0 + k η i v i v i, i= k i= α 0 (α 0 + ) α i v i v i v i α j 3

4 The following lemmas (Lemma 3 and Lemma 4) give upper bounds on the estimation error of η and µ in terms of λ i λ i, v i v i and the estimation errors of whitened moments defined in Definition 3 Lemma 3 (η i estimation error bound) Define η i α0+ v i M y (Ŵ, Ŵ ) v i, where v i is some estimation of v i We then have η i η i η v i v i + α 0 + ( + v i v i ) ε y,w () Proof First, note that v i M y(w, W )v i = α η 0+ i because {v i } k i= are orthonormal Subsequently, we have v M i y (Ŵ, Ŵ ) v i v i M y (W, W )v i α 0 + η i η i = ( v i v i ) My (Ŵ, Ŵ ) v i + v i ( My (Ŵ, Ŵ ) v i M y (W, W )v i ) v i v i M y (Ŵ, Ŵ ) v i + v i M y (Ŵ, Ŵ ) v i M y (W, W )v i ote that both v i and v i are unit vectors Therefore, α 0 + η i η i M y (Ŵ, Ŵ ) v i v i + M y (Ŵ, Ŵ ) v i M y (W, W )v i M y (Ŵ, Ŵ ) v i v i + M y (Ŵ, Ŵ ) v i v i + M y (Ŵ, Ŵ ) M y(w, W ) v i ( ) v i v i α 0 + η + ε y,w + ε y,w The last inequality is due to the fact that M y (W, W ) = α 0+ η Lemma 4 (µ i estimation error bound) Define µ i α0+ λ i (Ŵ + ) v i, where λ i, v i are some estimates of singular value pairs (λ i, v i ) of M 3 (W, W, W ) We then have µ i µ i 3(α 0 + ) σ (Õ) λ i λ i + 3α max σ (Õ) v i v i + 6α maxσ (Õ)E P () σ k (Õ) Proof First note that µ i = α0+ λ i (W + ) v i Subsequently, α 0 + µ i µ i = λ i (Ŵ + ) v i λ i (W + ) v i λ i Ŵ + λ i W + v i + λ i W + v i v i λ i λ i Ŵ + + λ i Ŵ + W + + λ i W + v i v i 3σ (Õ) λ i λ i + α max α 0 + 6σ (Õ)E P σ k (Õ) + α max α 0 + 3σ (Õ) v i v i To bound the error of orthogonal tensor decomposition performed on the estimated tensors M 3 (Ŵ, Ŵ, Ŵ ), we cite Theorem 5 AGH + ], a sample complexity analysis on the robust tensor power method we used for recovering λ i and v i 4

5 Lemma 5 (Theorem 5, AGH + ]) Let λ max = α 0(α 0+) α 0+ α min, λ min = α 0+ α 0(α 0+) α max, where α min = min α i and α max = max α i Then there exist universal constants C, C > 0 such that the following holds: Fix δ (0, ) Suppose ε t,w ε and ε t,w C λmin k, (3) Suppose {( λ i, v i )} k i= are eigenvalue and eigenvector pairs returned by running Algorithm in AGH+ ] with input M 3 (Ŵ, Ŵ, Ŵ ) for L = poly(k) log(/δ ) and C (log(k) + log log( λmax ε )) iterations With probability greater than δ, there exists a permutation π : k] k] such that for all i, v i v π (i) 8ε/λ min, λ i λ π (i) 5ε Appendix A3 Tail Inequalities Lemma 6 (Lemma 5, CL3]) Let x,, x R d be iid samples from some distribution with bounded support (ie, x B with probability for some constant B) Then with probability at least δ, x i Ex] B ) log(/δ) ( + i= Corrolary Let x,, x R d be iid samples from some distributions with Pr x B] δ Then with probability at least δ δ, x i Ex] B ) log(/δ) ( + i= Proof Use union bound Lemma 7 (concentraion of moment norms) Suppose we obtain iid samples (ie, documents with at least three words each and their regression variables in slda models) Define R(δ) η + Φ (δ), where Φ ( ) is the inverse function of the CDF of a standard Gaussian distribution Let E ] denote the mean of the true underlying distribution and Ê ] denote the empirical mean Then Pr Ex ] Êx ] F < + ] log(/δ) δ Pr Ex x ] Êx x ] F < + ] log(/δ) δ 3 Pr Ex x x 3 ] Êx x x 3 ] F < + ] log(/δ) δ 4 Pr Ey] Êy] < R(δ/4σ) + ] log(/δ) δ 5 Pr Eyx ] Êyx ] F < R(δ/4σ) + ] log(/δ) δ 6 Pr Eyx x ] Êyx x ] F < R(δ/4σ) + ] log(/δ) δ 5

6 Proof Use Lemma 6 and Corrolary for concentration bounds involving the regression variable y Corrolary With probability δ the following holds: E P = M M 3 + log(6/δ) E y = M y M y 0R(δ/60σ) + log(5/δ) 3 E T = M 3 M log(9/δ) Proof Corrolary can be proved by expanding the terms by definition and then using tail inequality in Lemma 7 and union bound Also note that F for all matrices Appendix A4 Completing the proof We are now ready to give a complete proof to Theorem Theorem (Sample complexity bound) Let σ (Õ) and σk(õ) be the largest and the smallest singular values of the canonical topic distribution matrix Õ Define λ min α 0(α 0+) α 0+ α max and λ max α 0+ α 0(α 0+) α min where α max and α min are the largest and the smallest entries in α Suppose µ, α and η are the outputs of Algorithm, and L is at least a linear function in k Fix δ (0, ) For any small error-tolerance parameter ε > 0, if Algorithm is run with parameter T = Ω(log(k) + log log(λ max /ε)) on iid sampled documents with max(n, n, n 3 ), where ( n = C + ) α log(6/δ) 0(α 0 + ), (4) α min n = C ( + log(5/δ)) ( max ( η + Φ (δ/60σ)), αmaxσ (Õ)), (5) ε σ k (Õ)4 n 3 = C 3 ( + log(9/δ)) σ k (Õ)0 ( max ε, k ) λ, (6) min and C, C and C 3 are universal constants, then with probability at least δ, there exists a permutation π : k] k] such that for every topic i, the following holds: α i α π(i) 4α0(α0+)(λmax+5ε) (α 0+) 5ε, if λ λ min > 5ε; min (λmin 5ε) ( ) ) µ i µ π(i) (3σ (Õ) 8α max λ min + 5(α0+) + ε; ( ) 3 η i η π(i) η λ min + (α 0 + ) ε Proof First, the assumption E P σ k (M ) is required for error bounds on ε p,w, ε y,w and ε t,w oting Corrolary and the fact that σ k (M ) =, we have αmin α 0(α 0+) ( α = Ω 0(α 0 + ) ( + ) log(6/δ)) ote that this lower bound does not depend on k, ε and σ k (Õ) For Lemma 5 to hold, we need the assumptions that ε t,w min(ε, O( λmin k )) These imply Eq (6), as we expand ε t,w according to Definition 3 and note the fact that the first term α min 54E P (α 0+)(α 0+)σ k (Õ)5 6

7 dominates the second one The α 0 is missing in Eq (6) because α 0 +, α 0 + and we discard them both The α i α π(i) bound follows immediately by Lemma 5 and the recovery rule α i = α0+ λ i To bound the estimation error for the linear classifier η, we need to further bound ε y,w We assume ε y,w ε By expanding ε y,w according to Definition 3 in a similar manner we obtain the ( η + Φ (δ/60σ)) term in Eq (5) The bound on η i η π(i) follows immediately by Lemma 3 Finally, we bound µ i µ π(i) using Lemma 4 We need to assume that 6αmaxσ(Õ)E P ε, which σ k (Õ) gives the αmaxσ (Õ) term in Eq (5) The µ i µ π(i) bound then follows by Lemma 4 and Lemma 5 We make some remarks for the main theorem In Remark, we establish links between indirect quantities appeared in Theorem (eg, σ k (Õ)) and the functions of original model parameters (eg, σ k (O)) These connections are straightforward following their definitions Remark The indirect quantities σ (Õ) and σk(õ) can be related to σ (O), σ k (O) and α in the following way: αmin α 0(α 0 + ) σ k(o) σ k (Õ) αmax α 0(α 0 + ) σ k(o); σ (Õ) αmax α 0(α 0 + ) σ(o) α0 + We now take a close look at the sample complexity bound in Theorem It is evident that n can be neglected when the number of topics k gets large, because in practice the norm of the linear regression model η is usually assumed to be small in order to avoid overfitting Moreover, as mentioned before, the prior parameter α is often assumed to be homogeneous with α i = /k SG07] With these observations, the sample complexity bound in Theorem can be greatly simplified Remark Assume η and σ are small and α = (/k,, /k) As the number of topics k gets large, the sample complexity bound in Theorem can be simplified as ( ) log(/δ) = Ω σ max(ε, k 3 ) (7) k (Õ)0 The sample complexity bound in Remark may look formidable as it depends on σ k (Õ)0 However, such dependency is somewhat necessary because we are using third-order tensors to recover the underlying model parameters Furthermore, the dependence on σ k (Õ)0 is introduced by the robust tensor power method to recover LDA parameters, and the reconstruction accuracy of η only depends on σ k (Õ)4 and ( η + Φ (δ/60σ)) As a consequence, if we can combine our power update method for η with LDA inference algorithms that have milder dependence on the singular value σ k (Õ), we might be able to get an algorithm with a better sample complexity Such an extension is discussed in Appendix C Appendix B Moments of Observable Variables Appendix B Proof to Proposition The equations on M and M 3 have already been proved in AFH + ] and AGH + ] Here we only give the proof to the equation on M y In fact, all the three equations can be proved in a similar manner In slda the topic mixing vector h follows a Dirichlet prior distribution with parameter α Therefore, we have Eh i ] = α i α 0, Eh i h j ] = { α i 0(α 0+) α iα j α 0, if i = j,, if i j, Eh i h j h k ] = α 3 i α 0(α 0+)(α 0+) α i α k 0 α iα jα k α 3 0, if i = j = k, (α0+), if i = j k,, if i j, j k, i k (8) 7

8 ext, note that Ey h] = η h, Ex h] = k h i µ i, Ex x h] = i= Eyx x h] = k i,j,k= k h i h j µ i µ j, (9) i,j= h i h j h k η k µ j µ k (30) Proposition can then be proved easily by taking expectation over the topic mixing vector h Appendix B Details of the speeding-up trick In this section we provide details of the trick mentioned in the main paper to speed up empirical moments computations First, note that the computation of M, M and M y only requires O(M ) time and O(V ) space They do not need to be accelerated in most practical applications This time and space complexity also applies to all terms in M 3 except the Êx x x 3 ] term, which requires O(M 3 ) time and O(V 3 ) space if using naive implementations Therefore, this section is devoted to speed-up the computation of Êx x x 3 ] More precisely, as mentioned in the main paper, what we want to compute is the whitened empirical moment Êx x x 3 ](Ŵ, Ŵ, Ŵ ) Rk k k Fix a document D with m words Let T Êx x x 3 D] be the empirical tensor demanded By definition, we have T i,j,k = m(m )(m ) n i (n j )(n k ), i = j = k; n i (n i )n k, i = j, j k; n i n j (n j ), j = k, i j; n i n j (n i ), i = k, i j; n i n j n k, otherwise; where n i is the number of occurrences of the i-th word in document D If T i,j,k = indices i, j and k, then we only need to compute T (W, W, W ) = m(m )(m ) (W n) 3, n in jn k m(m )(m ) (3) for all where n (n, n,, n V ) This takes O(Mk + k 3 ) computational time because n contains at most M non-zero entries, and the total time complexity is reduced from O(M 3 ) to O((Mk + k 3 )) We now consider the remaining values, where at least two indices are identical We first consider those values with two indices the same, for example, i = j For these indices, we need to subtract an n i n k term, as shown in Eq (3) That is, we need to compute the whitened tensor (W, W, W ), where R V V V and i,j,k = m(m )(m ) { ni n k, i = j; 0, otherwise ote that can be written as m(m )(m ) A n, where A = diag(n, n,, n V ) is a V V matrix and n = (n, n,, n V ) is defined previously As a result, (W, W, W ) = m(m )(m ) (W AW ) n So the computational complexity of (W, W, W ) depends on how we compute W AW Since A is a diagonal matrix with at most M non-zero entries, W AW can be computed in O(Mk ) operations Therefore, the time complexity of computing (W, W, W ) is O(Mk ) per document Finally we handle those values with three indices the same, that is, i = j = k As indicated by Eq (3), we need to add a term for compensation This can be done efficiently by first n i m(m )(m ) n i m(m )(m ) computing Ê( ) for all the documents (requiring O(V ) time), and then add them up, which takes O(V k 3 ) operations (3) 8

9 Appendix C Discussions Appendix C Extension to other topic recovery algorithms One important advantage of our proposed inference algorithm is its flexibility the algorithm can be combined with many other LDA inference algorithms to infer supervised LDA model parameters More specifically, given access to any algorithm that recovers the topic distribution matrix O and the prior parameter α from iid sampled documents, an inference algorithm for a supervised LDA model can be immediately obtained, as shown in Algorithm Algorithm slda parameter recovery based on an existing LDA inference algorithm A Input parameter: α 0 : Compute empirical moments and obtain M, M y Set η = 0 : Find Ŵ Rn k such that M (Ŵ, Ŵ ) = I k 3: Run algorithm A with observed documents and parameter α 0 Obtain the estimated topic distribution matrix Ô = ( µ,, µ k ) and prior parameter α 4: Compute v i = αi α 0(α 0+)Ŵ µ i for each topic i 5: Recover linear classifier: η i α0+ v i M y (Ŵ, Ŵ ) v i 6: Output: η, α and { µ i } k i= The sample complexity of Algorithm depends on the sample complexity of the LDA inference algorithm A Although the LDA inference algorithm A is free to make any assumptions on the topic distribution matrix O, we comment that the linear independence of topic distribution vectors µ is still required, because the power update trick (step 5 in Algorithm ) is valid only W µ i are orthogonal vectors Appendix C Going beyond SVD The proposed methods are based on spectral decomposition of observable moments and are provably correct However, a major drawback of these methods is their assumption that the topic distributions µ are linearly independent Although standard LDA models assume this topic independence BJ03], in practice there is strong evidence that different topics might be related BL] Therefore, it is important to consider possible extensions Recently, there has been much work on provable LDA inference algorithms that do not require the topic independence assumption AGH + 3, AGM] Instead of making the independence assumption, these methods assume a p-separability condition on the topic distribution matrix, that is, there exists an anchor word for each topic such that the anchor word only appears in this specific topic, and its probability (given that topic) is at least p Here we briefly describe an idea that might lead to an slda inference algorithm without assuming topic independence First, let O R V k be the topic distribution matrix defined previously and H = (h,, h ) R k be the matrix of topic mixing vectors for each document Suppose Q = Ex x ] is the word co-occurrence frequency matrix and Q y = Eyx ] is the co-occurrence frequency matrix between word and regression variable observations It is clear that both Q and Q y can be estimated from the training data, and we have Q = ORO, Q y = ORη, (33) where R EHH ] Assuming the p-separability condition holds and using methods developed in AGM, AGH + 3], we can consistently recover the O and R ow note that Q y = (OR) η As a result, knowing both OR and Q y we reduce the η inference problem to solving a linear equation systems Future research could be done on () determining when the matrix OR has full rank r so that the linear equations can be solved, and () investigating the sample complexity problem for such an inference algorithm 9

10 Appendix C3 From regression to classification A natural extension to our result is to consider supervised LDA models for classification purposes The simplest model is a logistic regression classification model, where we assume the response variable y d for each document d is in {+, }, and Pry d = h d ] = exp(η h d ) + exp(η h d ), where η is a linear classifier Though appears simple, such an extension incurs many fundamental problems A major obstacle is the fact that the conditional expectation Ey h] is no longer linear in the topic mixing vector h As a result, we cannot even evaluate Ey] (or higher order tensors like Eyx x ]) in closed forms Further more, even if we have accurate estimation to the above moments, it is rather difficult to infer η back due to its non-linearality To our knowledge, the classification problem remains unsolved in terms of spectral decomposition methods, and there is doubt whether a provably correct algorithm exists in such scenarios References AFH + ] Animashree Anandkumar, Dean Foster, Daniel Hsu, Sham Kakade, and Yi-Kai Liu Two SVDs suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocatoin arxiv:046703, 0 AGH + ] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham Kakade, and Matus Telgarsky Tensor decompositions for learning latent variable models arxiv:0:7559, 0 AGH + 3] Sanjeev Arora, Rong Ge, Yoni Halpern, David Mimno, and Ankur Moitra A practical algorithm for topic modeling with provable guarantees In ICML, 03 AGM] BL] BJ03] CL3] SG07] Sanjeev Arora, Rong Ge, and Ankur Moitra Learning topic models-going beyond SVD In FOCS, 0 David Blei and John Lafferty A correlated topic model of science Annals of Applied Statistics, pages 77 84, 0 David Blei, Andrew g, and Michael Jordan Latent Dirichlet allocation Journal of Machine Learning Research, (3):993 0, 003 Arun Chaganty and Percy Liang Spectral experts for estimating mixtures of linear regressions In ICML, 03 M Steyvers and T Griffiths Latent semantic analysis: a road to meaning, chapter Probabilistic topic models Laurence Erlbaum, 007 0

TOPIC modeling offers a suite of useful tools that

TOPIC modeling offers a suite of useful tools that JOURAL OF L A T E X CLASS FILES, VOL. 6, O. 1, JAUARY 007 1 Spectral Learning for Supervised Topic Models Yong Ren, Yining Wang, Jun Zhu, Member, IEEE terion to learn discriminative latent topic representations