Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints

Size: px

Start display at page:

Download "Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints"

Buddy Phillips
5 years ago
Views:

1 Identifiability and Learning of Topic Models: Tensor Decompositions under Structural Constraints Anima Anandkumar U.C. Irvine Joint work with Daniel Hsu, Majid Janzamin Adel Javanmard and Sham Kakade.

2 Latent Variable Modeling Goal: Discover hidden effects from observed measurements Topic Models Observations: words. Hidden: topics. In One Day, 11,000 Flee Syria as War and Hardship Worsen By RICK GLADSTONE and NEIL Hurricane Exposed Flaws in Protection of MacFARQUHAR Tunnels The United Nations reported that 11,000 By ELISABETH ROSENTHAL Syrians fled on Friday, the vast majority of Nearly two weeks after Hurricane Sandy them clambering for safety over the Turkish struck, the vital arteries that bring cars, trucks Nursing Home Is Faulted Over Care After border. and subways into New York City s Storm transportation network have recovered, with By MICHAEL POWELL and SHERI FINK one major exception: the Brooklyn-Battery Amid the worst hurricane to hit New York City Obama to Insist on Tax Increase for the Tunnel remains closed. in nearly 80 years, officials have claimed that Wealthy the Promenade Rehabilitation and Health By HELENE COOPER and JONATHAN Behind New York Gas Lines, Warnings and Care Center failed to provide the most basic WEISMAN Crossed Fingers care to its patients. Amid talk of compromise, President Obama By DAVID W. CHEN, WINNIE HU and and Speaker John A. Boehner both indicated CLIFFORD KRAUSS unchanged stances on this issue, long a point The return of 1970s-era gas lines to the five of contention. boroughs of New York City was not the result of a single miscalculation, but a combination of ignored warnings and indecisiveness. Modeling communities in social networks, modeling gene regulation...

3 Challenges in Learning Topic Models Learning Topic Models Using Word Observations Challenges in Identifiability When can topics be identified? Conditions on the model parameter, e.g. on topic-word matrix Φ and on topic proportions distributions (h)? Does identifiability also lead to tractable algorithms?

4 Challenges in Learning Topic Models Learning Topic Models Using Word Observations Challenges in Identifiability When can topics be identified? Conditions on the model parameter, e.g. on topic-word matrix Φ and on topic proportions distributions (h)? Does identifiability also lead to tractable algorithms? Challenges in Design of Learning Algorithms Maximum likelihood learning of topic models NP-hard (Arora et. al.) In practice, heuristics such as Gibbs sampling, variation Bayes etc. Guaranteed learning with minimal assumptions? Efficient methods? Low sample and computational complexities?

5 Challenges in Learning Topic Models Learning Topic Models Using Word Observations Challenges in Identifiability When can topics be identified? Conditions on the model parameter, e.g. on topic-word matrix Φ and on topic proportions distributions (h)? Does identifiability also lead to tractable algorithms? Challenges in Design of Learning Algorithms Maximum likelihood learning of topic models NP-hard (Arora et. al.) In practice, heuristics such as Gibbs sampling, variation Bayes etc. Guaranteed learning with minimal assumptions? Efficient methods? Low sample and computational complexities? Moment-based approach: learning using low order observed moments

6 Probabilistic Topic Models Useful abstraction for automatic categorization of documents Observed: words. Hidden: topics. Bag of words: order of words does not matter Graphical model representation l words in a document x 1,...,x l. h: proportions of topics in a document. Word x i generated from topic y i. Exchangeability: x 1 x 2... h Φ(i,j) := P[x m = i y m = j] : topic-word matrix. Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Words

7 Formulation as Linear Models Distribution of the topic proportions vector h If there are k topics, distribution over the simplex k 1 k 1 := {h R k,h i [0,1], h i = 1}. i Distribution of the words x 1,x 2,... Order n words in vocabulary. If x 1 is j th word, assign e j R n Distribution of each x i : supported on vertices of n 1. Properties Linear Model: E[x i h] = Φh. Multiview model: h is fixed and multiple words (x i ) are generated.

8 Geometric Picture for Topic Models Topic proportions vector (h) Document

9 Geometric Picture for Topic Models Single topic (h)

10 Geometric Picture for Topic Models Topic proportions vector (h)

11 Geometric Picture for Topic Models Topic proportions vector (h) Φ Φ Φ x 2 x 1 x 3 Word generation (x 1,x 2,...)

12 Geometric Picture for Topic Models Topic proportions vector (h) Φ Φ Φ x 2 x 1 x 3 Word generation (x 1,x 2,...) Moment-based estimation: co-occurrences of words in documents

13 Outline 1 Introduction 2 Form of Moments 3 Matrix Case: Learning using Pairwise Moments Identifiability and Learning of Topic-Word Matrix Learning Latent Space Parameters of the Topic Model 4 Tensor Case: Learning From Higher Order Moments Overcomplete Representations 5 Conclusion

14 Recap of CP Decomposition Recall form of moments for single topic/dirichlet model. E[x i h] = Φh. λ := [E[h]]i. h Learn topic-word matrix Φ, vector λ Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5

15 Recap of CP Decomposition Recall form of moments for single topic/dirichlet model. E[x i h] = Φh. λ := [E[h]]i. h Learn topic-word matrix Φ, vector λ Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Pairs Matrix M 2 k M 2 := E[x 1 x 2 ] = E[E[x 1 x 2 h]] = ΦE[hh ]Φ = λ r φ r φ r r=1

16 Recap of CP Decomposition Recall form of moments for single topic/dirichlet model. E[x i h] = Φh. λ := [E[h]]i. h Learn topic-word matrix Φ, vector λ Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Pairs Matrix M 2 M 2 := E[x 1 x 2 ] = E[E[x 1 x 2 h]] = ΦE[hh ]Φ = Similarly Triples Tensor M 3 k λ r φ r φ r r=1 M 3 := E(x 1 x 2 x 3 ) = r λ r φ r φ r φ r

17 Recap of CP Decomposition Recall form of moments for single topic/dirichlet model. E[x i h] = Φh. λ := [E[h]]i. h Learn topic-word matrix Φ, vector λ Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Pairs Matrix M 2 M 2 := E[x 1 x 2 ] = E[E[x 1 x 2 h]] = ΦE[hh ]Φ = Similarly Triples Tensor M 3 k λ r φ r φ r r=1 M 3 := E(x 1 x 2 x 3 ) = r λ r φ r φ r φ r Matrix and Tensor Forms: φ r := r th column of Φ. k k M 2 = λ r φ r φ r. M 3 = λ r φ r φ r φ r r=1 r=1

18 Multi-linear Transformation For a tensor T, define (for matrices V i of appropriate dimensions) [T(V 1,V 2,V 3 )] i1,i 2,i 3 := (T) j1,j 2,j 3 V 1 (j m,i m ) j 1,j 2,j 3 For a matrix M 2, M(V 1,V 2 ) := V 1 M 2 V 2. T = k λ r φ r φ r φ r r=1 m [3] T(W,W,W) = λ r (W φ r ) 3 r [k] T(I,v,v) = r [k]λ r v,φ r 2 φ r. T(I,I,v) = r [k]λ r v,φ r φ r φ r.

19 Form of Moments for a general Topic Model E[x i h] = Φh. Learn Φ, distribution of h Form of moments? h y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Pairs Matrix M 2 M 2 := E[x 1 x 2 ] = E[E[x 1 x 2 h]] = ΦE[hh ]Φ

20 Form of Moments for a general Topic Model E[x i h] = Φh. Learn Φ, distribution of h Form of moments? h y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Pairs Matrix M 2 M 2 := E[x 1 x 2 ] = E[E[x 1 x 2 h]] = ΦE[hh ]Φ Similarly Triples Tensor M 3 M 3 := E(x 1 x 2 x 3 ) = i,j,ke[h 3 ] i,j,k φ i φ j φ k = E[h 3 ](Φ,Φ,Φ)

21 Form of Moments for a general Topic Model E[x i h] = Φh. Learn Φ, distribution of h Form of moments? h y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Pairs Matrix M 2 M 2 := E[x 1 x 2 ] = E[E[x 1 x 2 h]] = ΦE[hh ]Φ Similarly Triples Tensor M 3 M 3 := E(x 1 x 2 x 3 ) = i,j,ke[h 3 ] i,j,k φ i φ j φ k = E[h 3 ](Φ,Φ,Φ) Tucker Tensor Decomposition Find decomposition M 3 = E[h 3 ](Φ,Φ,Φ) Key difference from CP: E[h 3 ] NOT a diagonal tensor Lot more parameters to estimate.

22 Guaranteed Learning of Topic Models Two Learning approaches CP Tensor decomposition: Parametric topic distributions (constraints on h) but general topic-word matrix Φ Tucker Tensor decomposition: Constrain topic-word matrix Φ but general (non-degenerate) distributions on h Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Words

23 Outline 1 Introduction 2 Form of Moments 3 Matrix Case: Learning using Pairwise Moments Identifiability and Learning of Topic-Word Matrix Learning Latent Space Parameters of the Topic Model 4 Tensor Case: Learning From Higher Order Moments Overcomplete Representations 5 Conclusion

24 Outline 1 Introduction 2 Form of Moments 3 Matrix Case: Learning using Pairwise Moments Identifiability and Learning of Topic-Word Matrix Learning Latent Space Parameters of the Topic Model 4 Tensor Case: Learning From Higher Order Moments Overcomplete Representations 5 Conclusion

25 Pairwise moments for learning Learning using pairwise moments: minimal information. Exchangeable Topic Model Topic Mixture Topics h y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Words

26 Pairwise moments for learning Learning using pairwise moments: minimal information. Exchangeable Topic Model Topic Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ h x 1 x 2 x 3 x 4 x 5 Words So far.. Parametric h: Dirichlet, single topic, independent components,... No restrictions on Φ (other than non-degeneracy). Learning using third order moment through tensor decompositions

27 Pairwise moments for learning Learning using pairwise moments: minimal information. Exchangeable Topic Model Topic Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ h x 1 x 2 x 3 x 4 x 5 Words So far.. Parametric h: Dirichlet, single topic, independent components,... No restrictions on Φ (other than non-degeneracy). Learning using third order moment through tensor decompositions What if.. Allow for general h: model arbitrary topic correlations Constrain topic-word matrix Φ:

28 Pairwise moments for learning Learning using pairwise moments: minimal information. Exchangeable Topic Model Topic Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ h x 1 x 2 x 3 x 4 x 5 Words So far.. Parametric h: Dirichlet, single topic, independent components,... No restrictions on Φ (other than non-degeneracy). Learning using third order moment through tensor decompositions What if.. Allow for general h: model arbitrary topic correlations Constrain topic-word matrix Φ: Sparsity constraints

29 Pairwise moments for learning Learning using pairwise moments: minimal information. Exchangeable Topic Model Topic-word matrix Topic Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ h x 1 x 2 x 3 x 4 x 5 Words Φ h 1 h 2 x(1) x(2) h k x(n) So far.. Parametric h: Dirichlet, single topic, independent components,... No restrictions on Φ (other than non-degeneracy). Learning using third order moment through tensor decompositions What if.. Allow for general h: model arbitrary topic correlations Constrain topic-word matrix Φ: Sparsity constraints

30 Some Intuitions.. h 1 h 2 h k Φ x(1) x(2) x(n) Learning using second-order moments Linear model: E[x i h] = Φh. and E[x 1 x 2 ] = ΦE[hh ]Φ Learning: recover Φ from ΦE[hh ]Φ. Ill-posed without further restrictions

31 Some Intuitions.. h 1 h 2 h k Φ x(1) x(2) x(n) Learning using second-order moments Linear model: E[x i h] = Φh. and E[x 1 x 2 ] = ΦE[hh ]Φ Learning: recover Φ from ΦE[hh ]Φ. Ill-posed without further restrictions When h is not degenerate: recover Φ from Col(Φ) No other restrictions on h: arbitrary dependencies

32 Some Intuitions.. h 1 h 2 h k Φ x(1) x(2) x(n) Learning using second-order moments Linear model: E[x i h] = Φh. and E[x 1 x 2 ] = ΦE[hh ]Φ Learning: recover Φ from ΦE[hh ]Φ. Ill-posed without further restrictions When h is not degenerate: recover Φ from Col(Φ) No other restrictions on h: arbitrary dependencies Sparsity constraints on topic-word matrix Φ Main constraint: columns of Φ are sparsest vectors in Col(Φ)

33 Sufficient Conditions for Identifiability columns of Φ are sparsest vectors in Col(Φ) Sufficient conditions? N(S) S Φ h 1 h 2 x(1) x(2) h k x(n)

34 Sufficient Conditions for Identifiability columns of Φ are sparsest vectors in Col(Φ) Sufficient conditions? N(S) S Φ h 1 h 2 x(1) x(2) h k x(n) Structural Condition: (Additive) Graph Expansion N(S) > S +d max, for all S [k] Parametric Conditions: Generic Parameters Φv 0 > N Φ (supp(v)) supp(v) A. Anandkumar, D. Hsu, A. Javanmard, and, S. M. Kakade. Learning Bayesian Networks with Latent Variables. In Proc. of Intl. Conf. on Machine Learning, June 2013.

35 Brief Proof Sketch Structural Condition: (Additive) Graph Expansion N(S) > S +d max, for all S [k] Parametric Conditions: Generic Parameters Φv 0 > N Φ (supp(v)) supp(v) Structural and Parametric Conditions Imply: When supp(v) > 1, Φv 0 > N Φ (supp(v)) supp(v) > d max Thus, supp(v) = 1, for Φv to be one of k sparsest vectors in Col(Φ)

36 Brief Proof Sketch Structural Condition: (Additive) Graph Expansion N(S) > S +d max, for all S [k] Parametric Conditions: Generic Parameters Φv 0 > N Φ (supp(v)) supp(v) Structural and Parametric Conditions Imply: When supp(v) > 1, Φv 0 > N Φ (supp(v)) supp(v) > d max Thus, supp(v) = 1, for Φv to be one of k sparsest vectors in Col(Φ) Claim: Parametric conditions are satisfied for generic parameters

37 Tractable Learning Algorithm Learning Task Recover topic-word matrix Φ from M 2 = ΦE[hh ]Φ. Exhaustive search min z 0 Φz 0 Change of Variables Convex relaxation min z Φz 1, b z = 1, where b is a row in Φ. min w M 1/2 2 w 1, e i M1/2 2 w = 1. Under reasonable conditions, the above program exactly recovers Φ

38 Outline 1 Introduction 2 Form of Moments 3 Matrix Case: Learning using Pairwise Moments Identifiability and Learning of Topic-Word Matrix Learning Latent Space Parameters of the Topic Model 4 Tensor Case: Learning From Higher Order Moments Overcomplete Representations 5 Conclusion

39 Latent General Topic Models h 1 h k Φ x(1) x(2) x(n) So far: recover topic-word matrix Φ from ΦE[hh ]Φ. Learning topic proportion distribution E[hh ] not enough to recover general distributions Need higher order moments to learn distribution of h Any models where low order moments suffice? e.g. Dirichlet/single topic require only third order moments. What about any other distributions? Are there other topic distributions which can be learned efficiently?

40 Learning Latent Bayesian Networks BN: Markov relationships on DAG Pa i : parents of node i. P(h) = n P(h i h Pai ) i=1 Linear Bayesian Network: h j = i Pa j λ ji h i +η j h = Λh+η E[x i η] = Φ(I Λ) 1 η = Φ η and η i uncorrelated h 1 h k Φ x(1) x(2) x(n)

41 Learning Latent Bayesian Networks BN: Markov relationships on DAG Pa i : parents of node i. P(h) = n P(h i h Pai ) i=1 Linear Bayesian Network: h j = i Pa j λ ji h i +η j h = Λh+η E[x i η] = Φ(I Λ) 1 η = Φ η and η i uncorrelated h 1 h k Φ x(1) x(2) x(n)

42 Learning Latent Bayesian Networks BN: Markov relationships on DAG Pa i : parents of node i. P(h) = n P(h i h Pai ) i=1 Linear Bayesian Network: h j = i Pa j λ ji h i +η j h = Λh+η E[x i η] = Φ(I Λ) 1 η = Φ η and η i uncorrelated (I Λ) 1 η 1 h 1 h k η k Φ x(1) x(2) x(n)

43 Learning Latent Bayesian Networks BN: Markov relationships on DAG Pa i : parents of node i. P(h) = n P(h i h Pai ) i=1 Linear Bayesian Network: h j = i Pa j λ ji h i +η j h = Λh+η E[x i η] = Φ(I Λ) 1 η = Φ η and η i uncorrelated (I Λ) 1 η 1 η k h 1 h k Φ x(1) x(2) x(n) Φ η 1 η 2 x(1) x(2) η k x(n)

44 Learning Latent Bayesian Networks BN: Markov relationships on DAG Pa i : parents of node i. P(h) = n P(h i h Pai ) i=1 Linear Bayesian Network: h j = i Pa j λ ji h i +η j h = Λh+η E[x i η] = Φ(I Λ) 1 η = Φ η and η i uncorrelated Φ h 1 h k Φ η 1 η 2 η k x(1) x(2) x(n) x(1) x(2) x(n)

45 Learning Latent Bayesian Networks BN: Markov relationships on DAG Pa i : parents of node i. P(h) = n P(h i h Pai ) i=1 Linear Bayesian Network: h j = i Pa j λ ji h i +η j h = Λh+η E[x i η] = Φ(I Λ) 1 η = Φ η and η i uncorrelated Φ h 1 h k Φ η 1 η 2 η k x(1) x(2) x(n) x(1) x(2) x(n) Φ: structured and sparse while Φ is dense h: correlated topics while η are uncorrelated

46 Learning Latent Bayesian Networks E[x i η] = Φ(I Λ) 1 η = Φ η E[η] = λ E[x 1 x 2 x 3 ] = E[η 3 ](Φ,Φ,Φ ) = i λ i(φ i ) 3

47 Learning Latent Bayesian Networks E[x i η] = Φ(I Λ) 1 η = Φ η E[η] = λ E[x 1 x 2 x 3 ] = E[η 3 ](Φ,Φ,Φ ) = i λ i(φ i ) 3 Solving CP decomposition through Tensor Power Method Recall η i are uncorrelated: E[η ] is diagonal. Reduction to CP decomposition: can be efficient solved via tensor power method

48 Learning Latent Bayesian Networks E[x i η] = Φ(I Λ) 1 η = Φ η E[η] = λ E[x 1 x 2 x 3 ] = E[η 3 ](Φ,Φ,Φ ) = i λ i(φ i ) 3 Solving CP decomposition through Tensor Power Method Recall η i are uncorrelated: E[η ] is diagonal. Reduction to CP decomposition: can be efficient solved via tensor power method Sparse Tucker Decomposition: Unmixing via Convex Optimization Un-mix Φ from Φ = Φ(I Λ) 1 through l 1 optimization.

49 Learning Latent Bayesian Networks E[x i η] = Φ(I Λ) 1 η = Φ η E[η] = λ E[x 1 x 2 x 3 ] = E[η 3 ](Φ,Φ,Φ ) = i λ i(φ i ) 3 Solving CP decomposition through Tensor Power Method Recall η i are uncorrelated: E[η ] is diagonal. Reduction to CP decomposition: can be efficient solved via tensor power method Sparse Tucker Decomposition: Unmixing via Convex Optimization Un-mix Φ from Φ = Φ(I Λ) 1 through l 1 optimization. Learning both structure and parameters of Φ and distribution of h Combine non-convex and convex methods for learning!

50 Outline 1 Introduction 2 Form of Moments 3 Matrix Case: Learning using Pairwise Moments Identifiability and Learning of Topic-Word Matrix Learning Latent Space Parameters of the Topic Model 4 Tensor Case: Learning From Higher Order Moments Overcomplete Representations 5 Conclusion

51 Outline 1 Introduction 2 Form of Moments 3 Matrix Case: Learning using Pairwise Moments Identifiability and Learning of Topic-Word Matrix Learning Latent Space Parameters of the Topic Model 4 Tensor Case: Learning From Higher Order Moments Overcomplete Representations 5 Conclusion

52 Extension to learning overcomplete representations So far.. Pairwise moments for learning structured topic-word matrices Third order moments for learning latent Bayesian network models Number of topics k, n is vocabulary size and k < n.

53 Extension to learning overcomplete representations So far.. Pairwise moments for learning structured topic-word matrices Third order moments for learning latent Bayesian network models Number of topics k, n is vocabulary size and k < n. Undercomplete Representation k h 1 h 2 h k Overcomplete Representation k h 1 h 2 h k x(1) x(2) n x(n) x(1) x(2) n x(n) What about overcomplete models: k > n? Do higher-order moments help?

54 Learning Overcomplete Representations Why Overcomplete Representations? Flexible modeling, robust to noise Huge gains in many applications, e.g. speech and computer vision.

55 Learning Overcomplete Representations Why Overcomplete Representations? Flexible modeling, robust to noise Huge gains in many applications, e.g. speech and computer vision. Recall Tucker Form of Moments for Topic Models M 2 := E(x 1 x 2 ) = E[h 2 ](Φ,Φ) ΦE[hh ]Φ M 3 := E(x 1 x 2 x 3 ) = E[h 3 ](Φ,Φ,Φ) k > n: Tucker decomposition not unique: model non-identifiable.

56 Learning Overcomplete Representations Why Overcomplete Representations? Flexible modeling, robust to noise Huge gains in many applications, e.g. speech and computer vision. Recall Tucker Form of Moments for Topic Models M 2 := E(x 1 x 2 ) = E[h 2 ](Φ,Φ) ΦE[hh ]Φ M 3 := E(x 1 x 2 x 3 ) = E[h 3 ](Φ,Φ,Φ) k > n: Tucker decomposition not unique: model non-identifiable. Identifiability of Overcomplete Models Possible under the notion of topic persistence Includes single topic model as a special case.

57 Persistent Topic Models Bag of Words Model Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Words A. Anandkumar, D. Hsu, M. Janzamin, and S. M. Kakade. When are Overcomplete Representations Identifiable? Uniqueness of Tensor Decompositions Under Expansion Constraints, Preprint, June 2013.

58 Persistent Topic Models Bag of Words Model Persistent Topic Model Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Words Topic h Mixture Topics Φ Φ Φ Φ y 1 y 2 y 3 Φ x 1 x 2 x 3 x 4 x 5 Words A. Anandkumar, D. Hsu, M. Janzamin, and S. M. Kakade. When are Overcomplete Representations Identifiable? Uniqueness of Tensor Decompositions Under Expansion Constraints, Preprint, June 2013.

59 Persistent Topic Models Bag of Words Model Persistent Topic Model Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Words Topic h Mixture Topics Φ Φ Φ Φ y 1 y 2 y 3 Φ x 1 x 2 x 3 x 4 x 5 Words Single-topic model is a special case. Persistence: incorporates locality or order of words. A. Anandkumar, D. Hsu, M. Janzamin, and S. M. Kakade. When are Overcomplete Representations Identifiable? Uniqueness of Tensor Decompositions Under Expansion Constraints, Preprint, June 2013.

60 Persistent Topic Models Bag of Words Model Persistent Topic Model Topic h Mixture Topics y 1 y 2 y 3 y 4 y 5 Φ Φ Φ Φ Φ x 1 x 2 x 3 x 4 x 5 Words Topic h Mixture Topics Φ Φ Φ Φ y 1 y 2 y 3 Φ x 1 x 2 x 3 x 4 x 5 Words Single-topic model is a special case. Persistence: incorporates locality or order of words. Identifiability conditions for overcomplete models? A. Anandkumar, D. Hsu, M. Janzamin, and S. M. Kakade. When are Overcomplete Representations Identifiable? Uniqueness of Tensor Decompositions Under Expansion Constraints, Preprint, June 2013.

61 Identifiability of Overcomplete Models Recall Tucker Form of Moments for Bag-of-Words Model Tensor form: E(x 1 x 2 x 3 x 4 ) = E[h 4 ](Φ,Φ,Φ,Φ) Matricized form: E((x 1 x 2 )(x 3 x 4 ) ) = (Φ Φ)E[(h h)(h h) ](Φ Φ)

62 Identifiability of Overcomplete Models Recall Tucker Form of Moments for Bag-of-Words Model Tensor form: E(x 1 x 2 x 3 x 4 ) = E[h 4 ](Φ,Φ,Φ,Φ) Matricized form: E((x 1 x 2 )(x 3 x 4 ) ) = (Φ Φ)E[(h h)(h h) ](Φ Φ) For Persistent Topic Model Tensor form: E(x 1 x 2 x 3 x 4 ) = E[hh ](Φ Φ,Φ Φ) Matricized form: E((x 1 x 2 )(x 3 x 4 ) ) = (Φ Φ)E[hh ](Φ Φ)

63 Identifiability of Overcomplete Models Recall Tucker Form of Moments for Bag-of-Words Model Tensor form: E(x 1 x 2 x 3 x 4 ) = E[h 4 ](Φ,Φ,Φ,Φ) Matricized form: E((x 1 x 2 )(x 3 x 4 ) ) = (Φ Φ)E[(h h)(h h) ](Φ Φ) For Persistent Topic Model Tensor form: E(x 1 x 2 x 3 x 4 ) = E[hh ](Φ Φ,Φ Φ) Matricized form: E((x 1 x 2 )(x 3 x 4 ) ) = (Φ Φ)E[hh ](Φ Φ) Kronecker vs. Khatri-Rao Products Φ: Topic-word matrix, is n k. (Φ Φ): Kronecker product, is n 2 k 2 matrix. (Φ Φ): Khatri-Rao product, is n 2 k matrix.

64 Some Intuitions Bag-of-words Model: (Φ Φ)E[(h h)(h h) ](Φ Φ) Persistent Model: (Φ Φ)E[hh ](Φ Φ) Topic-Word Matrix Φ k Effective Topic-Word Matrix Given Fourth-Order Moments: Bag of Words Model: Kronecker Product Φ Φ k 2 n Persistent Model: Khatri-Rao Product Φ Φ k n 2 Not Identifiable n 2 Identifiable

65 Outline 1 Introduction 2 Form of Moments 3 Matrix Case: Learning using Pairwise Moments Identifiability and Learning of Topic-Word Matrix Learning Latent Space Parameters of the Topic Model 4 Tensor Case: Learning From Higher Order Moments Overcomplete Representations 5 Conclusion

66 Conclusion Moment-based Estimation of Latent Variable Models Moments are easy to estimate. Low-order moments have good concentration properties

67 Conclusion Moment-based Estimation of Latent Variable Models Moments are easy to estimate. Low-order moments have good concentration properties Tensor Decomposition Methods Moment tensors have tractable forms for many models, e.g. Topic models, HMMs, Gaussian mixtures, ICA. Efficient CP tensor decomposition through power iterations. Efficient structured Tucker decomposition through l 1. Structured topic-word matrices: Expansion conditions Can be extended to overcomplete representations

68 Conclusion Moment-based Estimation of Latent Variable Models Moments are easy to estimate. Low-order moments have good concentration properties Tensor Decomposition Methods Moment tensors have tractable forms for many models, e.g. Topic models, HMMs, Gaussian mixtures, ICA. Efficient CP tensor decomposition through power iterations. Efficient structured Tucker decomposition through l 1. Structured topic-word matrices: Expansion conditions Can be extended to overcomplete representations Practical Considerations for Tensor Methods Not covered in detail in this tutorial. Matrix algebra and iterative methods. Scalable: Parallel implementation on GPUs

Learning Linear Bayesian Networks with Latent Variables

Learning Linear Bayesian Networks with Latent Variables Adel Javanmard Stanford University joint work with Anima Anandkumar, Daniel Hsu y, Sham Kakade y University of California, Irvine y Microsoft Research,