Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

Size: px

Start display at page:

Download "Learning Tractable Graphical Models: Latent Trees and Tree Mixtures"

Frank Ferguson
5 years ago
Views:

1 Learning Tractable Graphical Models: Latent Trees and Tree Mixtures Anima Anandkumar U.C. Irvine Joint work with Furong Huang, U.N. Niranjan, Daniel Hsu and Sham Kakade.

2 High-Dimensional Graphical Modeling Modeling Conditional Independencies through Graphs X u X v X S. Learning and inference are NP-hard. u Red Sox Yankees Mets Dodgers Phillies S Baseball v Manchester United Chelsea Arsenal Tractable Models: Tree Models Efficient inference using belief propagation B

3 Walk-up: Learning Tree Models Data processing inequality for Markov chains I(X 1 ;X 3 ) I(X 1 ;X 2 ),I(X 2 ;X 3 ) Tree Structure Estimation (Chow and Liu 68) MLE: Max-weight tree with estimated mutual information weights

4 Walk-up: Learning Tree Models Data processing inequality for Markov chains I(X 1 ;X 3 ) I(X 1 ;X 2 ),I(X 2 ;X 3 ) Tree Structure Estimation (Chow and Liu 68) MLE: Max-weight tree with estimated mutual information weights Pairwise statistics suffice

5 Walk-up: Learning Tree Models Data processing inequality for Markov chains I(X 1 ;X 3 ) I(X 1 ;X 2 ),I(X 2 ;X 3 ) Tree Structure Estimation (Chow and Liu 68) MLE: Max-weight tree with estimated mutual information weights Pairwise statistics suffice n samples and p nodes Sample complexity: logp n = O(1).

6 Learning Tractable Graphical Models Tractable Models: Tree Models Efficient inference using belief propagation MLE is easy to compute. Tree models are highly restrictive. B

7 Learning Tractable Graphical Models Tractable Models: Tree Models Efficient inference using belief propagation MLE is easy to compute. Tree models are highly restrictive. B Latent tree graphical models Tree models with hidden variables. Number and location of hidden variables unknown.

8 Application: Hierarchical Topic Modeling Data: Word co-occurrences. Graph: Topic-word structure. cancer car data dealer driver earth food format card case disease disk engine fttp games aids baseball children christian display doctor evidence facts bible bmw computer course dos drive fans files health help israel jesus mac mars nhl number hit hockey jews technology medicine memory orbit patients god government honda human launch law mission moon graphics gun image insurance league lunar msg nasa president problem satellite science program puck scsi season pc phone question religion players power research rights shuttle software solar space state studies system team server university version video vitamin war water win windows won world h1 h3 h4 h15 h7 h5 h6 h11 h14 h10 h3 h18 h16 h17 h12 h13 h9 h8 oil 2 games baseball fans nhl hit hockey league puck season players team win won

9 Application of Latent Trees: Object Recognition Challenge: Succinct representation of large-scale data Input: 100 object categories, 4000 training images Goal: learn co-occurrence probabilities Solution: Latent tree graphical models Context Models and Out-of-context Objects, M. J. Choi, A. Torralba, and A. S. Willsky, Pattern Recognition Letters, 2012.

Application of Latent Trees: Object Recognition Challenge: Succinct representation of large-scale data Input: 100 object categories, 4000 training images Goal: learn 2 100 co-occurrence probabilities

10 Application of Latent Trees: Object Recognition Challenge: Succinct representation of large-scale data Input: 100 object categories, 4000 training images Goal: learn co-occurrence probabilities Solution: Latent tree graphical models Sky Floor Wall Road Area Car People Road Light Crosswalk Context Models and Out-of-context Objects, M. J. Choi, A. Torralba, and A. S. Willsky, Pattern Recognition Letters, 2012.

11 Tree Mixture Models Tree Mixture Models Multiple graphs: context specific dependencies Each component is a tree model Unsupervised learning: Class variable is latent or hidden H

12 Tree Mixture Models Tree Mixture Models Multiple graphs: context specific dependencies Each component is a tree model Unsupervised learning: Class variable is latent or hidden H Why use tree mixtures? Efficient Inference: BP on component trees and combining them. Similarly marginalization and sampling also efficient.

13 Tree Mixture Models Tree Mixture Models Multiple graphs: context specific dependencies Each component is a tree model Unsupervised learning: Class variable is latent or hidden H Why use tree mixtures? Efficient Inference: BP on component trees and combining them. Similarly marginalization and sampling also efficient. Learning: Alternatives to EM (Meila & Jordan)?

14 Tree Mixture Models Tree Mixture Models Multiple graphs: context specific dependencies Each component is a tree model Unsupervised learning: Class variable is latent or hidden H Why use tree mixtures? Efficient Inference: BP on component trees and combining them. Similarly marginalization and sampling also efficient. Learning: Alternatives to EM (Meila & Jordan)? In this talk: learning latent tree models and tree mixtures.

15 Summary of Results Latent Tree Models Number of hidden variables and location unknown Integrated structure and parameter estimation. Local learning with global consistency guarantees.

16 Summary of Results Latent Tree Models Number of hidden variables and location unknown Integrated structure and parameter estimation. Local learning with global consistency guarantees. Mixtures of Trees Structure and parameters under different contexts unknown Unsupervised setting: choice variable hidden. Efficient methods for consistent structure and parameter learning. H

17 Previous Approaches Algorithms for Structure Estimation Chow and Liu (68): Tree estimation Meinshausen and Bühlmann (06): Convex relaxation Ravikumar, Wainwright, Lafferty (10): Convex relaxation Bresler, Mossel and Sly (09): Bounded-degree graphs... Learning with Hidden Variables Erdös, et. al. (99): Latent trees Daskalakis, Mossel and Roch (06): Latent trees Choi, Tan, Anandkumar and Willsky (10): Latent trees Chandrasekaran, Parrilo and Willsky (11): Latent Gaussian models, Anandkumar et. al (12): Tensor decompositions...

18 Outline 1 Introduction 2 Tests for Structure Learning 3 Parameter Learning through Tensor Methods 4 Integrating Structure and Parameter Learning 5 Mixtures of Trees 6 Conclusion

19 Learning Latent Tree Graphical Models h y v6 Linear Multivariate Models Conditional independence w.r.t tree Categorical k-state hidden variables. Multivariate d-dimensional observed variables. k d. When y is nbr. of h, E[y h] = Ah. Includes discrete, Poisson and Gaussian models, Gaussian mixtures etc.

20 Additive Tree Distances Information Distances [d i,j ] for Tree Models

21 Additive Tree Distances Information Distances [d i,j ] for Tree Models Gaussian scalar: d ij := log ρ ij.

22 Additive Tree Distances Information Distances [d i,j ] for Tree Models Gaussian scalar: d ij := log ρ ij. Linear multivariate models: d ij := log σ i 0 σ(e[y iy j ]) dete[y i y i ]dete[y jy j ].

23 Additive Tree Distances Information Distances [d i,j ] for Tree Models Gaussian scalar: d ij := log ρ ij. Linear multivariate models: d ij := log σ i 0 σ(e[y iy j ]) dete[y i y i ]dete[y jy j ]. [d i,j ] is an additive tree metric: d k,l = d i,j. (i,j) Path(k,l;E)

24 Additive Tree Distances Information Distances [d i,j ] for Tree Models Gaussian scalar: d ij := log ρ ij. Linear multivariate models: d ij := log σ i 0 σ(e[y iy j ]) dete[y i y i ]dete[y jy j ]. [d i,j ] is an additive tree metric: d k,l = d i,j. (i,j) Path(k,l;E)

25 Additive Tree Distances Information Distances [d i,j ] for Tree Models Gaussian scalar: d ij := log ρ ij. Linear multivariate models: d ij := log σ i 0 σ(e[y iy j ]) dete[y i y i ]dete[y jy j ]. [d i,j ] is an additive tree metric: d k,l = d i,j. (i,j) Path(k,l;E)

26 Additive Tree Distances Information Distances [d i,j ] for Tree Models Gaussian scalar: d ij := log ρ ij. Linear multivariate models: d ij := log σ i 0 σ(e[y iy j ]) dete[y i y i ]dete[y jy j ]. [d i,j ] is an additive tree metric: d k,l = d i,j. (i,j) Path(k,l;E) Learning latent tree using [ˆd i,j ]

27 Siblings Test Based on Information Distances Exact Statistics: Distances [d i,j ] Let Φ ijk := d i,k d j,k. d i,j <Φ ijk =Φ ijk <d i,j k,k i,j, i,j leaves with common parent Φ ijk = d i,j, k i,j, i is a leaf and j is its parent. Sample Statistics: ML Estimates [ˆd i,j ] Use only short distances: d i,k,d j,k < τ, Relax equality relationships

28 Siblings Test Based on Information Distances Exact Statistics: Distances [d i,j ] Let Φ ijk := d i,k d j,k. d i,j <Φ ijk =Φ ijk <d i,j k,k i,j, i,j leaves with common parent Φ ijk = d i,j, k i,j, i is a leaf and j is its parent. Sample Statistics: ML Estimates [ˆd i,j ] Use only short distances: d i,k,d j,k < τ, Relax equality relationships

29 Siblings Test Based on Information Distances Exact Statistics: Distances [d i,j ] Let Φ ijk := d i,k d j,k. d i,j <Φ ijk =Φ ijk <d i,j k,k i,j, i,j leaves with common parent Φ ijk = d i,j, k i,j, i is a leaf and j is its parent. Sample Statistics: ML Estimates [ˆd i,j ] Use only short distances: d i,k,d j,k < τ, Relax equality relationships

30 Siblings Test Based on Information Distances Exact Statistics: Distances [d i,j ] Let Φ ijk := d i,k d j,k. d i,j <Φ ijk =Φ ijk <d i,j k,k i,j, i,j leaves with common parent Φ ijk = d i,j, k i,j, i is a leaf and j is its parent. Sample Statistics: ML Estimates [ˆd i,j ] Use only short distances: d i,k,d j,k < τ, Relax equality relationships

31 Siblings Test Based on Information Distances Exact Statistics: Distances [d i,j ] Let Φ ijk := d i,k d j,k. d i,j <Φ ijk =Φ ijk <d i,j k,k i,j, i,j leaves with common parent Φ ijk = d i,j, k i,j, i is a leaf and j is its parent. Sample Statistics: ML Estimates [ˆd i,j ] Use only short distances: d i,k,d j,k < τ, Relax equality relationships

32 Siblings Test Based on Information Distances Exact Statistics: Distances [d i,j ] Let Φ ijk := d i,k d j,k. d i,j <Φ ijk =Φ ijk <d i,j k,k i,j, i,j leaves with common parent Φ ijk = d i,j, k i,j, i is a leaf and j is its parent. Sample Statistics: ML Estimates [ˆd i,j ] Use only short distances: d i,k,d j,k < τ, Relax equality relationships

33 Recursive Grouping Recursive Grouping Algorithm (Choi, Tan, A., Willsky) Sibling test and remove leaves Build tree from bottom up Consistent structure estimation. Serial method, high computational complexity.

34 Recursive Grouping Recursive Grouping Algorithm (Choi, Tan, A., Willsky) Sibling test and remove leaves Build tree from bottom up Consistent structure estimation. Serial method, high computational complexity.

35 Recursive Grouping Recursive Grouping Algorithm (Choi, Tan, A., Willsky) Sibling test and remove leaves Build tree from bottom up Consistent structure estimation. Serial method, high computational complexity.

36 Recursive Grouping Recursive Grouping Algorithm (Choi, Tan, A., Willsky) Sibling test and remove leaves Build tree from bottom up Consistent structure estimation. Serial method, high computational complexity.

37 Recursive Grouping Recursive Grouping Algorithm (Choi, Tan, A., Willsky) Sibling test and remove leaves Build tree from bottom up Consistent structure estimation. Serial method, high computational complexity.

38 Recursive Grouping Recursive Grouping Algorithm (Choi, Tan, A., Willsky) Sibling test and remove leaves Build tree from bottom up Consistent structure estimation. Serial method, high computational complexity.

39 Outline 1 Introduction 2 Tests for Structure Learning 3 Parameter Learning through Tensor Methods 4 Integrating Structure and Parameter Learning 5 Mixtures of Trees 6 Conclusion

40 Overview of Proposed Parameter Learning Method Toy Model: 3-star Linear multivariate model. A r x i h := E(x i h = e r ). and λ r := P[h = e r ]. h x 1 x 2 x 3 E(x 1 x 2 x 3 ) = k r=1 λ ra r x 1 h Ar x 2 h Ar x 3 h. Guaranteed Recovery through Tensor Decomposition Transition matrices A xi h have full column rank. Linear algebraic operations: SVD and tensor power iterations. Tensor Decompositions for Learning Latent Variable Models by A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky. Preprint, October 2012.

41 Overview of Tensor Decomposition Technique Let a r = E(x i h = e r ) for all i and λ r := P[h = e r ]. M 3 = E[x 1 x 2 x 3 ] = k i=1 λ ia 3 i. h Intuition: if a i are orthogonal M 3 (I,a 1,a 1 ) := i λ i a i,a 1 2 a i = λ 1 a 1. a i are eigenvectors of the tensor M 3. x 1 x 2 x 3 Convert to an orthogonal tensor using pairwise moments M 2 := E[x 1 x 2 ] = i λ ia 2 i. Whitening matrix: W M 2 W = I. Consider tensor M 3 (W,W,W) := i λ i(w a i ) 3. It is an orthogonal tensor.

42 Parameter Learning in Latent Trees v6 Learning through Hierarchical Tensor Decomposition Assume known tree structure. Decompose different triplets: hidden variable is join point on tree. Alignment issue Tensor decomposition is an unsupervised method. Hidden labels permuted across different triplets. Solution: Align using common node in triplets.

43 Parameter Learning in Latent Trees v6 Learning through Hierarchical Tensor Decomposition Assume known tree structure. Decompose different triplets: hidden variable is join point on tree. Alignment issue Tensor decomposition is an unsupervised method. Hidden labels permuted across different triplets. Solution: Align using common node in triplets.

44 Parameter Learning in Latent Trees v6 Learning through Hierarchical Tensor Decomposition Assume known tree structure. Decompose different triplets: hidden variable is join point on tree. Alignment issue Tensor decomposition is an unsupervised method. Hidden labels permuted across different triplets. Solution: Align using common node in triplets.

45 Parameter Learning in Latent Trees v6 Learning through Hierarchical Tensor Decomposition Assume known tree structure. Decompose different triplets: hidden variable is join point on tree. Alignment issue Tensor decomposition is an unsupervised method. Hidden labels permuted across different triplets. Solution: Align using common node in triplets.

46 Outline 1 Introduction 2 Tests for Structure Learning 3 Parameter Learning through Tensor Methods 4 Integrating Structure and Parameter Learning 5 Mixtures of Trees 6 Conclusion

47 Integrated Learning So far.. Consistent structure learning through sibling tests on distances. Parameter learning through tensor decomposition on triplets. Challenges How to integrate structure and parameter learning? Can we save on computations through integration? Can we learn parameters as we learn the structure? Can we parallelize learning for scalability?

48 Integrated Learning So far.. Consistent structure learning through sibling tests on distances. Parameter learning through tensor decomposition on triplets. Challenges How to integrate structure and parameter learning? Can we save on computations through integration? Can we learn parameters as we learn the structure? Can we parallelize learning for scalability? Key Ideas Divide and conquer: find (overlapping) groups of observed variables. Learn local subtrees (and parameters) over the groups independently. Merge subtrees and tweak parameters to obtain global latent tree model.

49 Parallel Chow-Liu Based Grouping Algorithm Minimum spanning tree using information distance [ˆd i,j ]. 4 v1 4 h3 2 h1 2 1 h2 3 2 v2 v3 v5 v6 v2 v3 v5 v6 h1 v2 v3 v5 v6 v2 v3 v4 v5 1 v6 v4 v1 h3 h2 v1 v3 v4 v5 h3 h2 v1 v3 v4 v5 h1 2 5 v2 v3 v5 v1 h3 2 h1 1 h2 v6 v2 2 2 v3 v4 v5 1 h3 h2 4 2 v6 v1 v3 v4 v5

50 Alignment of Parameters Alignment Correction In-group Across-group Across-neighborhood In-group Across-group Across-neighborhood

51 Theorem Consistency Guarantees The proposed method consistently recovers the structure with O(log p) samples and parameters with poly(p) samples. Extent of parallelism Size of groups Γ 1+u l δ. Effective depth δ := max i {min j {path(v i,v j ;T )}. Maximum degree in latent tree:. Upper and lower bound on distances between neighbors in the latent tree: u and l. Implications For homogeneous HMM, constant sized groups. Worst case: star graphs.

52 Computational Complexity N samples, d dimensional observed variables, k state hidden variables. p number of observed variables. z non-zero entries per sample. Γ sized groups. Algorithm Steps Time/worker Degree of parallelism Information Distance Estimation O(Nz +d+k 3 ) O(p 2 ) Structure: Minimum Spanning Tree O(logp) O(p 2 ) Structure: Local Recursive Grouping O(Γ 3 ) O(p/Γ) Parameter: Tensor Decomposition O(Γk 3 +Γdk 2 ) O(p/Γ) Merging and Alignment Correction O(dk 2 ) O(p/Γ) Integrated Structure and Parameter Learning in Latent Tree Graphical Models by F. Huang, U. N. Niranjan, A. Anandkumar. Preprint, June 2014.

53 Proof Ideas Relating Chow-Liu Tree with Latent Tree Surrogate Sg(i) for node i: observed node with strongest correlation Sg(i) := argmind i,j j V Neighborhood preservation (i,j) T (Sg(i),Sg(j)) T ML. Chow-Liu grouping reverses edge contractions Proof by induction

54 Experiments d = k = 2 dimensions, p = 9 number of variables Structure Error CLRG-EM Proposed error samples d p N Struct Error Param Error Running Time(s) K K K , K , k K K K

55 Outline 1 Introduction 2 Tests for Structure Learning 3 Parameter Learning through Tensor Methods 4 Integrating Structure and Parameter Learning 5 Mixtures of Trees 6 Conclusion

56 Mixtures of Graphical Models: Our Approach Our Approach Consider data from graphical model mixture Output tree mixture: best tree approx. of each component H G 1 G 2 Manchester United

57 Mixtures of Graphical Models: Our Approach Our Approach Consider data from graphical model mixture Output tree mixture: best tree approx. of each component H G 1 G 2 Steps Manchester United

58 Mixtures of Graphical Models: Our Approach Our Approach Consider data from graphical model mixture Output tree mixture: best tree approx. of each component H G 1 G 2 Steps Estimation of union graph Manchester United

59 Mixtures of Graphical Models: Our Approach Our Approach Steps Consider data from graphical model mixture Output tree mixture: best tree approx. of each component Estimation of union graph H G 1 G 2 G Manchester United

60 Mixtures of Graphical Models: Our Approach Our Approach Steps Consider data from graphical model mixture Output tree mixture: best tree approx. of each component Estimation of union graph Estimation of pairwise moments in each component H G 1 G 2 G Manchester United

61 Mixtures of Graphical Models: Our Approach Our Approach Consider data from graphical model mixture Output tree mixture: best tree approx. of each component H G 1 G 2 Steps Estimation of union graph Estimation of pairwise moments in each component G Manchester United Tree approximation of each component via Chow-Liu algorithm.

62 Mixtures of Graphical Models: Our Approach Our Approach Consider data from graphical model mixture Output tree mixture: best tree approx. of each component H G 1 G 2 Steps Estimation of union graph Estimation of pairwise moments in each component G Manchester United Tree approximation of each component via Chow-Liu algorithm. Efficient Learning of Tree Mixture Approximations

63 Learning Graphical Model Mixtures Adapt Tensor Decomposition Method for Graphical Model Mixtures? Challenges Cannot reduce to triplet tensor decomposition H G 1 G 2 Solutions

64 Learning Graphical Model Mixtures Adapt Tensor Decomposition Method for Graphical Model Mixtures? Challenges Cannot reduce to triplet tensor decomposition H G 1 G 2 Solutions G : union graph learnt from rank test H Baseball Mets G Manchester United

65 Learning Graphical Model Mixtures Adapt Tensor Decomposition Method for Graphical Model Mixtures? Challenges Cannot reduce to triplet tensor decomposition H G 1 G 2 Solutions G : union graph learnt from rank test Use separators on union graph G : Baseball H Mets G Manchester United

66 Learning Graphical Model Mixtures Adapt Tensor Decomposition Method for Graphical Model Mixtures? Challenges Cannot reduce to triplet tensor decomposition H G 1 G 2 Solutions G : union graph learnt from rank test H Use separators on union graph G : X u X v X w X S,H w S Baseball v Mets G Manchester United u

67 Learning Graphical Model Mixtures Adapt Tensor Decomposition Method for Graphical Model Mixtures? Challenges Cannot reduce to triplet tensor decomposition Need to learn higher order moments of mixture components. H G 1 G 2 Solutions G : union graph learnt from rank test H Use separators on union graph G : X u X v X w X S,H w S Baseball v Mets G Manchester United u

68 Learning Graphical Model Mixtures Adapt Tensor Decomposition Method for Graphical Model Mixtures? Challenges Cannot reduce to triplet tensor decomposition Need to learn higher order moments of mixture components. H G 1 G 2 Solutions G : union graph learnt from rank test H Use separators on union graph G : X u X v X S,H w S Baseball v Learn tree mixture approximation: estimate pairwise mixture moments Mets G Manchester United u

69 Learning Graphical Model Mixtures Adapt Tensor Decomposition Method for Graphical Model Mixtures? Challenges Cannot reduce to triplet tensor decomposition Need to learn higher order moments of mixture components. H G 1 G 2 Solutions G : union graph learnt from rank test H Use separators on union graph G : X u X v X w,w X S,H Learn tree mixture approximation: estimate pairwise mixture moments w w Mets S Baseball G v Manchester United u

70 Learning Graphical Model Mixtures Adapt Tensor Decomposition Method for Graphical Model Mixtures? Challenges Cannot reduce to triplet tensor decomposition Need to learn higher order moments of mixture components. H G 1 G 2 Solutions G : union graph learnt from rank test H Use separators on union graph G : X u X v X w,w X S,H Learn tree mixture approximation: estimate pairwise mixture moments w w Mets S Baseball G v Manchester United u Efficient Estimation of Tree Mixture Approximations

71 Sparse Graphical Model Selection: Intuitions First consider a graphical model with no latent variables Markov Property of Graphical Models Dodgers? X u X v X S I(X u ;X v X S ) = 0 Red usox Baseball S v Yankees Mets Phillies Manchester United Chels Arsen

72 Sparse Graphical Model Selection: Intuitions First consider a graphical model with no latent variables Markov Property of Graphical Models X u X v X S? I(X u ;X v X S ) = 0 Alternative Test for Conditional Independence? u Red Sox Yankees Mets Dodgers Phillies S Baseball v Manchester United Chels Arsen

73 Sparse Graphical Model Selection: Intuitions First consider a graphical model with no latent variables Markov Property of Graphical Models X u X v X S? I(X u ;X v X S ) = 0 Alternative Test for Conditional Independence? u Red Sox Yankees Mets Dodgers Phillies S Baseball v Manchester United Chels Arsen P(X u = i,x v = j X S = k) = P(X u = i X S = k) P(X v = j X S = k)

74 Sparse Graphical Model Selection: Intuitions First consider a graphical model with no latent variables Markov Property of Graphical Models X u X v X S? I(X u ;X v X S ) = 0 Alternative Test for Conditional Independence? u Red Sox Yankees Mets Dodgers Phillies S Baseball v Manchester United Chels Arsen P(X u = i,x v = j X S = k) = P(X u = i X S = k) P(X v = j X S = k) M u,v,{s;k} := [P(X u = i,x v = j,x S = k)] i,j. M u,v,{s;k} =

75 Sparse Graphical Model Selection: Intuitions First consider a graphical model with no latent variables Markov Property of Graphical Models X u X v X S? Rank(M u,v,{s;k} ) = 1 Alternative Test for Conditional Independence? u Red Sox Yankees Mets Dodgers Phillies S Baseball v Manchester United Chels Arsen P(X u = i,x v = j X S = k) = P(X u = i X S = k) P(X v = j X S = k) M u,v,{s;k} := [P(X u = i,x v = j,x S = k)] i,j. M u,v,{s;k} =

76 Sparse Graphical Model Selection: Intuitions First consider a graphical model with no latent variables Markov Property of Graphical Models X u X v X S? Rank(M u,v,{s;k} ) = 1 Alternative Test for Conditional Independence? u Red Sox Yankees Mets Dodgers Phillies S Baseball v Manchester United Chels Arsen P(X u = i,x v = j X S = k) = P(X u = i X S = k) P(X v = j X S = k) M u,v,{s;k} := [P(X u = i,x v = j,x S = k)] i,j. M u,v,{s;k} = Rank Test on Pairwise Probability Matrices

77 Extending Rank Tests to Mixtures Dimension of latent H is r and each observed variable is d > r. H G 1 G 2

78 Extending Rank Tests to Mixtures Dimension of latent H is r and each observed variable is d > r. First assume Markov graph is the same for all components. H G 1 G 2

79 Extending Rank Tests to Mixtures Dimension of latent H is r and each observed variable is d > r. First assume Markov graph is the same for all components. X u X v X S H G 1 G 2 H u Yankees Mets Dodgers Phillies S Baseball v Manchester United Chel Arse

80 Extending Rank Tests to Mixtures Dimension of latent H is r and each observed variable is d > r. First assume Markov graph is the same for all components. X u X v X S H G 1 G 2 u Yankees Mets Dodgers Phillies S Baseball v Manchester United Chel Arse

81 Extending Rank Tests to Mixtures Dimension of latent H is r and each observed variable is d > r. First assume Markov graph is the same for all components. X u X v X S H G 1 G 2 H u Yankees Mets Dodgers Phillies S Baseball v Manchester United Chel Arse

82 Extending Rank Tests to Mixtures Dimension of latent H is r and each observed variable is d > r. First assume Markov graph is the same for all components. X u X v X S,H H G 1 G 2 H u Yankees Mets Dodgers Phillies S Baseball v Manchester United Chel Arse

83 Extending Rank Tests to Mixtures Dimension of latent H is r and each observed variable is d > r. First assume Markov graph is the same for all components. X u X v X S,H H G 1 G 2 H P(X u,x v X S ) = r h=1 u Dodgers S Baseball Yankees Mets Phillies v Manchester United P(X u X S,H = h) P(H = h X S ) P(X v X S,H = h) Chel Arse

84 Extending Rank Tests to Mixtures Dimension of latent H is r and each observed variable is d > r. First assume Markov graph is the same for all components. X u X v X S,H H G 1 G 2 M u,v,{s;k} := [P(X u = i,x v = j,x S = k)] i,j. H M u,v,{s;k} = d d P(X u,x v X S ) = r h=1 r r r d d r u Dodgers S Baseball Yankees Mets Phillies v Manchester United P(X u X S,H = h) P(H = h X S ) P(X v X S,H = h) Chel Arse

85 Extending Rank Tests to Mixtures Dimension of latent H is r and each observed variable is d > r. First assume Markov graph is the same for all components. X u X v X S,H H G 1 G 2? Rank(M u,v,{s;k} ) = r M u,v,{s;k} := [P(X u = i,x v = j,x S = k)] i,j. H M u,v,{s;k} = d d P(X u,x v X S ) = r h=1 r r r d d r u Dodgers S Baseball Yankees Mets Phillies v Manchester United P(X u X S,H = h) P(H = h X S ) P(X v X S,H = h) Chel Arse

86 Rank Test for Mixtures Dim(H) is r and each observed variable is d > r. G : union of Markov graphs of components. η: Bound on separators btw. node pairs in G. M u,v,{s;k} := [P(X u = i,x v = j,x S = k)] i,j Declare (u,v) as edge if min S V\{u,v} S η u Yankees Mets Dodgers Phillies S Baseball H max Rank(M u,v,{s;k} ;ξ n,p ) > r. k X S v Manchester United Chels Arsen

87 Rank Test for Mixtures Dim(H) is r and each observed variable is d > r. G : union of Markov graphs of components. η: Bound on separators btw. node pairs in G. M u,v,{s;k} := [P(X u = i,x v = j,x S = k)] i,j Declare (u,v) as edge if min S V\{u,v} S η u Yankees Mets Dodgers Phillies S Baseball H max Rank(M u,v,{s;k} ;ξ n,p ) > r. k X S Small η computationally efficient, uses only low order statistics. v Manchester United Chels Arsen

88 Rank Test for Mixtures Dim(H) is r and each observed variable is d > r. G : union of Markov graphs of components. η: Bound on separators btw. node pairs in G. M u,v,{s;k} := [P(X u = i,x v = j,x S = k)] i,j Declare (u,v) as edge if min S V\{u,v} S η u Yankees Mets Dodgers Phillies S Baseball H max Rank(M u,v,{s;k} ;ξ n,p ) > r. k X S Small η computationally efficient, uses only low order statistics. Examples of graphs G with small η Mixture of product distributions: G is trivial and η = 0. Mixture on same tree: G is a tree and η = 1. Mixture on arbitrary trees: G is union of r trees and η = r. v Manchester United Chels Arsen

89 Rank Test for Mixtures Dim(H) is r and each observed variable is d > r. G : union of Markov graphs of components. η: Bound on separators btw. node pairs in G. M u,v,{s;k} := [P(X u = i,x v = j,x S = k)] i,j Declare (u,v) as edge if min S V\{u,v} S η u Yankees Mets Dodgers Phillies S Baseball H max Rank(M u,v,{s;k} ;ξ n,p ) > r. k X S Small η computationally efficient, uses only low order statistics. Examples of graphs G with small η Mixture of product distributions: G is trivial and η = 0. Mixture on same tree: G is a tree and η = 1. Mixture on arbitrary trees: G is union of r trees and η = r. Simple Test for Estimation of Union Graph of Mixtures v Manchester United Chels Arsen

90 Guarantees on Rank Test Theorem (A., Hsu, Huang, Kakade 12) Rank test recovers graph structure G correctly w.h.p on p nodes under n samples when ρ 2 min logp = O(1). n ρ min : Min. (r +1) th singular value between neighbors.

91 Guarantees on Rank Test Theorem (A., Hsu, Huang, Kakade 12) Rank test recovers graph structure G correctly w.h.p on p nodes under n samples when ρ 2 min logp = O(1). n ρ min : Min. (r +1) th singular value between neighbors. Efficient Test with Low Sample and Computational Requirements.

92 Guarantees on Rank Test Theorem (A., Hsu, Huang, Kakade 12) Rank test recovers graph structure G correctly w.h.p on p nodes under n samples when ρ 2 min logp = O(1). n ρ min : Min. (r +1) th singular value between neighbors. Efficient Test with Low Sample and Computational Requirements. Recall examples of graphs G with small η Mixture of product distributions: G is trivial and η = 0. Mixture on same tree: G is a tree and η = 1. Mixture on arbitrary trees: G is union of r trees and η = r.

93 Guarantees for Learning Graphical Model Mixtures Steps Involved in Tree Mixture Approximation Rank tests for structure estimation of union graph G Tensor decomposition for estimation of pairwise moments of mixture components Chow-Liu algorithm to estimate mixture component trees Computationally Efficient Algorithm for Learning Graphical Model Mixtures Learning High-Dimensional Mixtures of Graphical Models by A. Anandkumar, D. Hsu, F. Huang, and S.M. Kakade. NIPS 2012.

94 Guarantees for Learning Graphical Model Mixtures Steps Involved in Tree Mixture Approximation Rank tests for structure estimation of union graph G Tensor decomposition for estimation of pairwise moments of mixture components Chow-Liu algorithm to estimate mixture component trees Computationally Efficient Algorithm for Learning Graphical Model Mixtures Theorem (A., Hsu, Huang, Kakade 12) The above method recovers correct tree mixture approximation correctly w.h.p on p nodes of r component mixture under n samples when n = poly(p,r). Learning High-Dimensional Mixtures of Graphical Models by A. Anandkumar, D. Hsu, F. Huang, and S.M. Kakade. NIPS 2012.

95 Guarantees for Learning Graphical Model Mixtures Steps Involved in Tree Mixture Approximation Rank tests for structure estimation of union graph G Tensor decomposition for estimation of pairwise moments of mixture components Chow-Liu algorithm to estimate mixture component trees Computationally Efficient Algorithm for Learning Graphical Model Mixtures Theorem (A., Hsu, Huang, Kakade 12) The above method recovers correct tree mixture approximation correctly w.h.p on p nodes of r component mixture under n samples when n = poly(p,r). Efficient Learning of Multiple Graphs and Models in High Dimensions Learning High-Dimensional Mixtures of Graphical Models by A. Anandkumar, D. Hsu, F. Huang, and S.M. Kakade. NIPS 2012.

96 Splice Data 0 error DNA SPLICE-junctions 60 variables(sequence of DNA bases), class variable Splice junction type: EI, IE, none EM Algorithm Proposed Algorithm Proposed EM Number of components Mixing weight estimation(r=3) log-likelihood Number of components error Conditional log-likelihood EM Algorithm Proposed Algorithm Proposed EM Number of components Classification Error

97 Outline 1 Introduction 2 Tests for Structure Learning 3 Parameter Learning through Tensor Methods 4 Integrating Structure and Parameter Learning 5 Mixtures of Trees 6 Conclusion

98 Summary and Outlook Learning Latent Tree Models Integrated Structure and Parameter Learning High level of parallelism without losing consistency. Learning Graphical Model Mixtures Tree mixture approximations Combinatorial search + spectral decomposition Computational and sample guarantees H G 1 G 2

CPSC 540: Machine Learning

CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning and Inference in DAGs We discussed learning in DAG models, log p(x W ) = n d log p(x i j x i pa(j),