Large-Deviations and Applications for Learning Tree-Structured Graphical Models

Size: px

Start display at page:

Download "Large-Deviations and Applications for Learning Tree-Structured Graphical Models"

Neil Johnston
5 years ago
Views:

1 Large-Deviations and Applications for Learning Tree-Structured Graphical Models Vincent Tan Stochastic Systems Group, Lab of Information and Decision Systems, Massachusetts Institute of Technology Thesis Defense (Nov 16, 2010) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 1 / 52

2 Acknowledgements The following is joint work with: Alan Willsky (MIT) Lang Tong (Cornell) Animashree Anandkumar (UC Irvine) John Fisher (MIT) Sujay Sanghavi (UT Austin) Matt Johnson (MIT) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 2 / 52

3 Outline 1 Motivation, Background and Main Contributions Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

4 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

5 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

6 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

7 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

8 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 4 / 52

9 Motivation: A Real-Life Example Manchester Asthma and Allergy Study (MAAS) More than n 1000 children Number of variables d 10 6 Environmental, Physiological and Genetic (SNP) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 5 / 52

10 Motivation: A Real-Life Example Manchester Asthma and Allergy Study (MAAS) More than n 1000 children Number of variables d 10 6 Environmental, Physiological and Genetic (SNP) M a n c h e s t e r A s t h n m a a d A l l e r g y S t u d y Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 5 / 52

11 Motivation: Modeling Large Datasets I How do we model such data to make useful inferences? Simpson*, VYFT* et al. Beyond Atopy: Multiple Patterns of Sensitization in Relation to Asthma in a Birth Cohort Study, Am. J. Respir. Crit. Care Med. Feb Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 6 / 52

12 Motivation: Modeling Large Datasets I How do we model such data to make useful inferences? Model the relationships between variables by a sparse graph Reduce the number of interdependencies between the variables Simpson*, VYFT* et al. Beyond Atopy: Multiple Patterns of Sensitization in Relation to Asthma in a Birth Cohort Study, Am. J. Respir. Crit. Care Med. Feb Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 6 / 52

13 Motivation: Modeling Large Datasets I How do we model such data to make useful inferences? Model the relationships between variables by a sparse graph Reduce the number of interdependencies between the variables Prematurity Smoking Lung Function Airway Inflammation Acquired Immune Response Obesity Airway Obstruction Bronchial Hyperresponsiveness Immune Response to Virus Viral Infection Simpson*, VYFT* et al. Beyond Atopy: Multiple Patterns of Sensitization in Relation to Asthma in a Birth Cohort Study, Am. J. Respir. Crit. Care Med. Feb Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 6 / 52

14 Motivation: Modeling Large Datasets II Reduce the dimensionality of the covariates (features) for predicting a variable for interest (e.g., asthma) Information-theoretic limits? VYFT, Johnson and Willsky, Necessary and Sufficient Conditions for Salient Subset Recovery, Intl. Symp. on Info. Theory, Jul VYFT, Sanghavi, Fisher and Willsky, Learning Graphical Models for Hypothesis Testing and Classification, IEEE Trans. on Signal Processing, Nov Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 7 / 52

15 Motivation: Modeling Large Datasets II Reduce the dimensionality of the covariates (features) for predicting a variable for interest (e.g., asthma) Information-theoretic limits? Learning graphical models tailored specifically for hypothesis testing Can we learn better models in the finite-sample setting? VYFT, Johnson and Willsky, Necessary and Sufficient Conditions for Salient Subset Recovery, Intl. Symp. on Info. Theory, Jul VYFT, Sanghavi, Fisher and Willsky, Learning Graphical Models for Hypothesis Testing and Classification, IEEE Trans. on Signal Processing, Nov Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 7 / 52

16 Graphical Models: Introduction Graph structure G = (V, E) represents a multivariate distribution of a random vector X = (X 1,..., X d ) indexed by V = {1,..., d} Node i V corresponds to random variable X i Edge set E corresponds to conditional independencies Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 8 / 52

17 Graphical Models: Introduction Graph structure Graphical G = (V, E) Models: represents Introduction a multivariate distribution of a random Graph vector structure X G = = (X (V, 1, E)... in, X the d ) indexed multivariate by distribution V = {1,. of.. random, d} variables, with V = {1,..., m}. Node i V corresponds to random variable X i Nodes i V correspond to random variable X i. EdgeEdges set Ecorresponds to conditional to independence independencies relationships. V \{nbd(i) i} i nbd(i) S B A X i X V \{nbd(i) i} X nbd(i) X A X B X S Anima Anandkumar (UCI) Trees, Latent Trees & Beyond 11/08/ / 50 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 8 / 52

18 From Conditional Independence to Gibbs Distribution Hammersley-Clifford Theorem (1971) Let P be the joint pmf of graphical model Markov on G = (V, E): ] P(x) = 1 Z exp [ c C Ψ c (x c ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 9 / 52

19 From Conditional Independence to Gibbs Distribution From Conditional Independence to Gibbs Distribution Hammersley-Clifford Theorem (1971) Hammersley-Clifford Let P be the joint Theorem 71 pmf of graphical Letmodel P be joint Markov pmf of onmodel G = (V, withe): graph G = (V, E), ] [ P(x) = 1 Z exp c (x c ) P (x) = 1 Z exp[ Ψ c (x c )]. c C c C where C is the set of maximal cliques. Anima Anandkumar (UCI) Trees, Latent Trees & Beyond 11/08/ / 50 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 9 / 52

20 Conditional Independence to Gibbs Distribution ersley-clifford From Conditional Theorem 71 Independence to Gibbs Distribution From Conditional Independence to Gibbs Distribution be Hammersley-Clifford joint pmf of model with Theorem graph(1971) V, Hammersley-Clifford E), Let P be the joint Theorem 71 pmf of graphical Letmodel P be joint Markov pmf of onmodel G = (V, withe): graph G = P (V, (x) E), = 1 ] Z exp[ Ψ c (x c )]. c C [ P(x) = 1 Z exp c (x c ) P (x) = 1 Z exp[ Ψ c (x c )]. C is the set of maximal c C c C cliques. where C is the set of maximal cliques. Gaussian Graphical Models ian Graphical Models dency X X X X X X X X X X X X X X X X 8 X X Inverse of Covariance Matrix Dependency Graph Inverse Covariance Matrix Anandkumar (UCI) Trees, Latent Trees & Beyond 11/08/ / 50 Anima Anandkumar Vincent Tan (UCI) (MIT) Trees, Large-Deviations Latent Trees & Beyond for Learning Trees 11/08/2010 Thesis Defense 5 / 50 9 / 52

21 Tree-Structured Graphical Models X 3 X 2 P(x)= P i (x i ) i V X (i,j) E 1 = P 1 (x 1 ) P 1,2(x 1, x 2 ) P 1 (x 1 ) X 4 P i,j (x i, x j ) P i (x i )P j (x j ) P 1,3 (x 1, x 3 ) P 1,4 (x 1, x 4 ) P 1 (x 1 ) P 1 (x 1 ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 10 / 52

22 Tree-Structured Graphical Models X 3 X 2 P(x)= P i (x i ) i V X (i,j) E 1 = P 1 (x 1 ) P 1,2(x 1, x 2 ) P 1 (x 1 ) X 4 P i,j (x i, x j ) P i (x i )P j (x j ) P 1,3 (x 1, x 3 ) P 1,4 (x 1, x 4 ) P 1 (x 1 ) P 1 (x 1 ) Tree-structured Graphical Models: Tractable Learning and Inference Maximum-Likelihood learning of tree structure is tractable Chow-Liu Algorithm (1968) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 10 / 52

23 Tree-Structured Graphical Models X 3 X 2 P(x)= P i (x i ) i V X (i,j) E 1 = P 1 (x 1 ) P 1,2(x 1, x 2 ) P 1 (x 1 ) X 4 P i,j (x i, x j ) P i (x i )P j (x j ) P 1,3 (x 1, x 3 ) P 1,4 (x 1, x 4 ) P 1 (x 1 ) P 1 (x 1 ) Tree-structured Graphical Models: Tractable Learning and Inference Maximum-Likelihood learning of tree structure is tractable Chow-Liu Algorithm (1968) Inference on Trees is tractable Sum-Product Algorithm Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 10 / 52

24 Tree-Structured Graphical Models X 3 X 2 P(x)= P i (x i ) i V X (i,j) E 1 = P 1 (x 1 ) P 1,2(x 1, x 2 ) P 1 (x 1 ) X 4 P i,j (x i, x j ) P i (x i )P j (x j ) P 1,3 (x 1, x 3 ) P 1,4 (x 1, x 4 ) P 1 (x 1 ) P 1 (x 1 ) Tree-structured Graphical Models: Tractable Learning and Inference Maximum-Likelihood learning of tree structure is tractable Chow-Liu Algorithm (1968) Inference on Trees is tractable Sum-Product Algorithm Which other classes of graphical models are tractable for learning? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 10 / 52

25 Main Contributions in Thesis: I Error Exponent Analysis of Tree Structure Learning (Ch. 3 and 4) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 11 / 52

26 Main Contributions in Thesis: I Graphical Models: Trees & Beyond T Error Exponent Analysis Analysis of Tree of Structure Tree Structure Learning Learning: (Ch. 3 andextremal 4) Tr Star Chain Structure Learning in Graphical Models High-Dimensional Structure Learning for Forest Models (Ch. 5) Forests Latent Trees Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 11 / 52

27 Main Contributions in Thesis: II Learning Graphical Models for Hypothesis Testing (Ch. 6) Devised algorithms for learning trees for hypothesis testing 12 p q D( p p DT ) D( q p DT ) D( p p CL ) p DT p CL T p Fig. 1. Illustration of Proposition 2. As defined in (8), T p is the subset of tree distributions that are marginally consistent with p, the empirical distribution of the positively labeled samples. p and q are not trees, thus p, q / T p. The generatively-learned distribution (via Chow-Liu) p CL, is the projection of p onto T p as given by the optimization problem in (9). The discriminativelylearned distribution p DT, is the solution of (20a) which is further (in the KL-divergence sense) from q (because of the D( q p) term). reverse is Vincent true for Tan q. (MIT) See Fig. 1 for an illustration Large-Deviations of the forproposition. Learning TreesNote that all distances Thesis Defense are measured 12 / 52

28 Main Contributions in Thesis: II Learning Graphical Models for Hypothesis Testing (Ch. 6) Devised algorithms for learning trees for hypothesis testing 12 p q D( p p DT ) D( q p DT ) D( p p CL ) p DT p CL T p Information-Theoretic Limits for Salient Subset Recovery (Ch. 7) Fig. 1. Illustration of Proposition 2. As defined in (8), T p is the subset of tree distributions that are marginally consistent with p, the empirical distribution of the positively labeled samples. p and q are not trees, thus p, q / T p. The generatively-learned distribution (via Chow-Liu) p CL, is the projection of p onto T p as given by the optimization problem in (9). The discriminativelylearned distribution p DT, is the solution of (20a) which is further (in the KL-divergence sense) from q (because of the D( q p) term). Devised necessary and sufficient conditions for estimating of salient set of features reverse is true for q. See Fig. 1 for an illustration of the proposition. Note that all distances are measured Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 12 / 52

29 Main Contributions in Thesis: II Learning Graphical Models for Hypothesis Testing (Ch. 6) Devised algorithms for learning trees for hypothesis testing 12 p q D( p p DT ) D( q p DT ) D( p p CL ) p DT p CL T p Information-Theoretic Limits for Salient Subset Recovery (Ch. 7) Fig. 1. Illustration of Proposition 2. As defined in (8), T p is the subset of tree distributions that are marginally consistent with p, the empirical distribution of the positively labeled samples. p and q are not trees, thus p, q / T p. The generatively-learned distribution (via Chow-Liu) p CL, is the projection of p onto T p as given by the optimization problem in (9). The discriminativelylearned distribution p DT, is the solution of (20a) which is further (in the KL-divergence sense) from q (because of the D( q p) term). Devised necessary and sufficient conditions for estimating of salient set of features We will focus on Chapters 3-5 here. See thesis for Chapters 6 and 7. reverse is true for q. See Fig. 1 for an illustration of the proposition. Note that all distances are measured Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 12 / 52

30 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 13 / 52

31 Motivation ML learning of tree structure given i.i.d. X d -valued samples X 3 X 2 X 1 X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

32 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

33 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 P n (err). = exp( n Rate) n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

34 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 P n (err). = exp( n Rate) n = # Samples When does the error probability decay exponentially? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

35 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 P n (err). = exp( n Rate) n = # Samples When does the error probability decay exponentially? What is the exact rate of decay of the probability of error? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

36 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 P n (err). = exp( n Rate) n = # Samples When does the error probability decay exponentially? What is the exact rate of decay of the probability of error? How does the error exponent depend on the parameters and structure of the true distribution? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

37 Main Contributions Discrete case: Provide the exact rate of decay for a given P Rate of decay SNR for learning Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 15 / 52

38 Main Contributions Discrete case: Provide the exact rate of decay for a given P Rate of decay SNR for learning Gaussian case: Extremal structures: Star (worst) and chain (best) for learning Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 15 / 52

39 Related Work in Structure Learning ML for trees: Max-weight spanning tree with mutual information edge weights (Chow & Liu 1968) Causal dependence trees: directed mutual information (Quinn, Coleman & Kiyavash 2010) Convex relaxation methods: l 1 regularization Gaussian graphical models (Meinshausen and Buehlmann 2006) Logistic regression for Ising models (Ravikumar et al. 2010) Learning thin junction trees through conditional mutual information tests (Chechetka et al. 2007) Conditional independence tests for bounded degree graphs (Bresler et al. 2008) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 16 / 52

40 Related Work in Structure Learning ML for trees: Max-weight spanning tree with mutual information edge weights (Chow & Liu 1968) Causal dependence trees: directed mutual information (Quinn, Coleman & Kiyavash 2010) Convex relaxation methods: l 1 regularization Gaussian graphical models (Meinshausen and Buehlmann 2006) Logistic regression for Ising models (Ravikumar et al. 2010) Learning thin junction trees through conditional mutual information tests (Chechetka et al. 2007) Conditional independence tests for bounded degree graphs (Bresler et al. 2008) We obtain and analyze error exponents for the ML learning of trees (and extensions to forests) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 16 / 52

41 ML Learning of Trees (Chow-Liu) I Samples x n = {x 1,..., x n } drawn i.i.d. from P P(X d ), X is finite Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 17 / 52

42 ML Learning of Trees (Chow-Liu) I Samples x n = {x 1,..., x n } drawn i.i.d. from P P(X d ), X is finite Solve the ML problem given the data x n P ML argmax Q Trees 1 n n log Q(x k ) k=1 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 17 / 52

43 ML Learning of Trees (Chow-Liu) I Samples x n = {x 1,..., x n } drawn i.i.d. from P P(X d ), X is finite Solve the ML problem given the data x n P ML argmax Q Trees 1 n n log Q(x k ) k=1 Denote P(a) = P(a; x n ) as the empirical distribution of x n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 17 / 52

44 ML Learning of Trees (Chow-Liu) I Samples x n = {x 1,..., x n } drawn i.i.d. from P P(X d ), X is finite Solve the ML problem given the data x n P ML argmax Q Trees 1 n n log Q(x k ) k=1 Denote P(a) = P(a; x n ) as the empirical distribution of x n Reduces to a max-weight spanning tree problem (Chow-Liu 1968) E ML = argmax I( P e ) E Q Trees e E Q P e is the marginal of the empirical on e = (i, j) I( P e ) is the mutual information of the empirical P e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 17 / 52

45 ML Learning of Trees (Chow-Liu) II X 4 X 1 X 3 X Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 18 / 52

46 ML Learning of Trees (Chow-Liu) II 1 X 3 X 2 X 3 X X 1 2 X X 4 X 4 True MI {I(P e )} Max-weight spanning tree E P Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 18 / 52

47 ML Learning of Trees (Chow-Liu) II 1 X 3 X 2 X 3 X X 1 2 X X 4 X 4 True MI {I(P e )} Max-weight spanning tree E P 1.1 X 3 X X X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 18 / 52

48 ML Learning of Trees (Chow-Liu) II 1 X 3 X X X 4 True MI {I(P e )} 1.1 X 3 X X X 4 Empirical MI {I( P e )} from x n X 3 X X 1 4 X 4 Max-weight spanning tree E P X 3 X X 1 X 4 Max-weight spanning tree E ML E P Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 18 / 52

49 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

50 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

51 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Find the error exponent K P : K P lim n 1 n log Pn (E ML E P ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

52 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Find the error exponent K P : K P lim n 1 n log Pn (E ML E P ) P n (E ML E P ). = exp( nk P ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

53 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Find the error exponent K P : K P lim n 1 n log Pn (E ML E P ) P n (E ML E P ). = exp( nk P ) Naïvely, what could we do to compute K P? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

54 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Find the error exponent K P : K P lim n 1 n log Pn (E ML E P ) P n (E ML E P ). = exp( nk P ) Naïvely, what could we do to compute K P? I-projections onto all trees? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

55 The Crossover Rate I Correct Structure True MI I(P e ) Emp MI I( P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 20 / 52

56 The Crossover Rate I Correct Structure True MI I(P e ) Emp MI I( P e ) Incorrect Structure! True MI I(P e ) Emp MI I( P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 20 / 52

57 The Crossover Rate I Correct Structure True MI I(P e ) Emp MI I( P e ) Incorrect Structure! True MI I(P e ) Emp MI I( P e ) Structure Unaffected True MI I(P e ) Emp MI I( P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 20 / 52

58 The Crossover Rate I e e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 21 / 52

59 The Crossover Rate I e e Given two node pairs e, e ( V 2) with joint distribution Pe,e P(X 4 ), s.t. I(P e ) > I(P e ). Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 21 / 52

60 The Crossover Rate I e e Given two node pairs e, e ( V 2) with joint distribution Pe,e P(X 4 ), s.t. I(P e ) > I(P e ). Consider the crossover event of the empirical MI {I( P e ) I( P e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 21 / 52

61 The Crossover Rate I e e Given two node pairs e, e ( V 2) with joint distribution Pe,e P(X 4 ), s.t. I(P e ) > I(P e ). Consider the crossover event of the empirical MI {I( P e ) I( P e )} Def: Crossover Rate J e,e ( ) lim n n log Pn I( P e ) I( P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 21 / 52

62 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

63 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

64 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

65 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) P e,e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

66 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) P e,e {I(Q e )=I(Q e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

67 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) P e,e D(Q e,e P e,e ) Q e,e {I(Q e )=I(Q e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

68 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) P e,e D(Q e,e P e,e ) Q e,e {I(Q e )=I(Q e )} I-projection (Csiszár) Sanov s Theorem Exact but not intuitive Non-Convex Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

69 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

70 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Easy only in some very special cases Star graph with I(Q a ) > I(Q b ) > 0 There is a unique crossover rate The unique crossover rate is the error exponent Q a Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

71 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Easy only in some very special cases Star graph with I(Q a ) > I(Q b ) > 0 There is a unique crossover rate The unique crossover rate is the error exponent Q a Q b Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

72 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Easy only in some very special cases Star graph with I(Q a ) > I(Q b ) > 0 There is a unique crossover rate The unique crossover rate is the error exponent Q a Q b Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

73 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Easy only in some very special cases Star graph with I(Q a ) > I(Q b ) > 0 There is a unique crossover rate The unique crossover rate is the error exponent Q a Q b { } K P = min R P(X 4 ) D(R Q a,b ) : I(R e ) = I(R e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

74 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

75 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander T P T Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

76 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P T P T Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

77 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P Path(e ; E P ) T P T Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

78 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P Path(e ; E P ) T P T dominates Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

79 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P Path(e ; E P ) T P P dominates Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

80 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P Path(e ; E P ) T P P dominates Theorem (Error Exponent) K P = min e / E P ( ) min J e Path(e e,e ;E P ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

81 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

82 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

83 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Proposition The following statements are equivalent: (a) The error probability decays exponentially, i.e., K P > 0 (b) T P is a connected tree, i.e., not a proper forest Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

84 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Proposition The following statements are equivalent: (a) The error probability decays exponentially, i.e., K P > 0 (b) T P is a connected tree, i.e., not a proper forest P n (err) n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

85 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Proposition The following statements are equivalent: (a) The error probability decays exponentially, i.e., K P > 0 (b) T P is a connected tree, i.e., not a proper forest P n (err) n = # Samples K P >0 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

86 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Proposition The following statements are equivalent: (a) The error probability decays exponentially, i.e., K P > 0 (b) T P is a connected tree, i.e., not a proper forest P n (err) n = # Samples K P >0 K P =0 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

87 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

88 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

89 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) P e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

90 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) P e P e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

91 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) P e P e Euclidean Information Theory [Borade & Zheng 08]: P Q D(P Q) 1 (P(a) Q(a)) 2 2 P(a) a Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

92 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) P e P e Euclidean Information Theory [Borade & Zheng 08]: P Q D(P Q) 1 (P(a) Q(a)) 2 2 P(a) Def: Given a P e = P i,j the information density is S e (X i ; X j ) log P i,j(x i, X j ) P i (X i )P j (X j ), E[S e] = I(P e ). a Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

93 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

94 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints D(Q e,e P e,e ) Q e,e P e,e {I(Q e )=I(Q e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

95 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints D(Q e,e P e,e ) Q e,e P e,e {I(Q e )=I(Q e )} P e,e 1 2 Q e,e P e,e 2 P e,e Q e,e Q(P e,e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

96 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints D(Q e,e P e,e ) Q e,e P e,e {I(Q e )=I(Q e )} P e,e 1 2 Q e,e P e,e 2 P e,e Q e,e Q(P e,e ) Theorem (Euclidean Approximation of Crossover Rate) J e,e = (I(P e ) I(P e)) 2 2 Var(S e S e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

97 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints D(Q e,e P e,e ) Q e,e P e,e {I(Q e )=I(Q e )} P e,e 1 2 Q e,e P e,e 2 P e,e Q e,e Q(P e,e ) Theorem (Euclidean Approximation of Crossover Rate) J e,e = (I(P e ) I(P e)) 2 2 Var(S e S e ) = (E[S e S e]) 2 2 Var(S e S e ) = 1 2 SNR Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

98 The Crossover Rate How good is the approximation? We consider a binary model True Rate Approx Rate 0.02 Rate J e,e I(P e ) I(P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 28 / 52

99 Remarks for Learning Discrete Trees Characterized precisely the error exponent for structure learning P n (E ML E P ). = exp( nk P ) VYFT, A. Anandkumar, L. Tong, A. S. Willsky A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures, ISIT 2009, submitted to IEEE Trans. on Information Theory, revised in Oct Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 29 / 52

100 Remarks for Learning Discrete Trees Characterized precisely the error exponent for structure learning P n (E ML E P ). = exp( nk P ) Analysis tools include the method of types (large-deviations) and simple properties of trees VYFT, A. Anandkumar, L. Tong, A. S. Willsky A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures, ISIT 2009, submitted to IEEE Trans. on Information Theory, revised in Oct Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 29 / 52

101 Remarks for Learning Discrete Trees Characterized precisely the error exponent for structure learning P n (E ML E P ). = exp( nk P ) Analysis tools include the method of types (large-deviations) and simple properties of trees Analyzed the very-noisy learning regime (Euclidean Information Theory) where learning is error-prone Extensions to learning the tree projection for non-trees have also been studied. VYFT, A. Anandkumar, L. Tong, A. S. Willsky A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures, ISIT 2009, submitted to IEEE Trans. on Information Theory, revised in Oct Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 29 / 52

102 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 30 / 52

103 Setup Jointly Gaussian distribution in very-noisy learning regime ( p(x) exp 1 ) 2 xt Σ 1 x, x R d. Zero-mean, unit variances Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 31 / 52

104 Setup Jointly Gaussian distribution in very-noisy learning regime ( p(x) exp 1 ) 2 xt Σ 1 x, x R d. Zero-mean, unit variances Keep correlations coefficients on edges fixed specifies the Gaussian graphical model by Markovianity ρ i is the correlation coefficient on edge e i for i = 1,..., d 1 ρ 1 ρ 2 ρ 3 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 31 / 52

105 Setup Jointly Gaussian distribution in very-noisy learning regime ( p(x) exp 1 ) 2 xt Σ 1 x, x R d. Zero-mean, unit variances Keep correlations coefficients on edges fixed specifies the Gaussian graphical model by Markovianity ρ 1 ρ 2 ρ 3 ρ i is the correlation coefficient on edge e i for i = 1,..., d 1 ρ 1 ρ 2 ρ 3 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 31 / 52

106 Setup Jointly Gaussian distribution in very-noisy learning regime ( p(x) exp 1 ) 2 xt Σ 1 x, x R d. Zero-mean, unit variances Keep correlations coefficients on edges fixed specifies the Gaussian graphical model by Markovianity ρ 1 ρ 2 ρ 3 ρ i is the correlation coefficient on edge e i for i = 1,..., d 1 ρ 1 ρ 2 ρ 3 Compare the error exponent associated to different structures Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 31 / 52

107 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

108 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

109 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 1 ρ 2 ρ 4 ρ 3 Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

110 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 4 ρ 1 ρ 2 P n (err) ρ 3 Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

111 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 4 ρ 1 ρ 2 P n (err) ρ 3 Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain Chain n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

112 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 4 ρ 1 ρ 2 P n (err) ρ 3 Star Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain Chain n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

113 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 4 ρ 1 ρ 2 P n (err) ρ 3 Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain Star Any other tree Chain n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

114 Numerical Simulations Chain, Star and Hybrid for d = 10 ρ i = 0.1 i i [1 : 9] ρ i Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 33 / 52

115 Numerical Simulations Chain, Star and Hybrid for d = 10 ρ i = 0.1 i i [1 : 9] ρ i P(error) 1 n log P(error) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 33 / 52

116 Numerical Simulations Chain, Star and Hybrid for d = 10 ρ i = 0.1 i i [1 : 9] ρ i P(error) 1 n log P(error) Simulated Prob of Error Chain Hybrid Star Number of samples n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 33 / 52

117 Numerical Simulations Chain, Star and Hybrid for d = 10 ρ i = 0.1 i i [1 : 9] ρ i Simulated Prob of Error P(error) Number of samples n Chain Hybrid Star Simulated Error Exponent 2.5 x n log P(error) Number of samples n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 33 / 52

118 Proof Idea and Intuition Correlation decay Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

119 Proof Idea and Intuition Correlation decay e / E P Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

120 Proof Idea and Intuition Correlation decay e / E P e / E P Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

121 Proof Idea and Intuition Correlation decay e / E P e / E P O(d 2 ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

122 Proof Idea and Intuition Correlation decay e / E P e / E P O(d 2 ) O(d) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

123 Proof Idea and Intuition Correlation decay e / E P e / E P O(d 2 ) O(d) Number of distance-two node pairs in: Star is O(d 2 ) Markov chain is O(d) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

124 Concluding Remarks for Learning Gaussian Trees Gaussianity allows us to perform further analysis to find the extremal structures for learning VYFT, A. Anandkumar, A. S. Willsky Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures, Allerton 2009, IEEE Trans. on Signal Processing, May Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 35 / 52

125 Concluding Remarks for Learning Gaussian Trees Gaussianity allows us to perform further analysis to find the extremal structures for learning Allows to derive a data-processing inequality for crossover rates VYFT, A. Anandkumar, A. S. Willsky Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures, Allerton 2009, IEEE Trans. on Signal Processing, May Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 35 / 52

126 Concluding Remarks for Learning Gaussian Trees Gaussianity allows us to perform further analysis to find the extremal structures for learning Allows to derive a data-processing inequality for crossover rates Universal result not (strongly) dependent on choice of correlations ρ = {ρ 1,..., ρ d 1 } VYFT, A. Anandkumar, A. S. Willsky Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures, Allerton 2009, IEEE Trans. on Signal Processing, May Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 35 / 52

127 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 36 / 52

128 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

129 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Extend Liu et al. s work for discrete models and improve convergence results Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

130 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Extend Liu et al. s work for discrete models and improve convergence results Strategy: Remove weak edges X 3 X 2 X 1 X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

131 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Extend Liu et al. s work for discrete models and improve convergence results Strategy: Remove weak edges X 3 X 2 X 1 X 4 Reduce Num Params Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

132 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Extend Liu et al. s work for discrete models and improve convergence results Strategy: Remove weak edges X 3 X 2 X 1 X 4 Reduce Num Params X 3 X 2 X 1 X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

133 Main Contributions Propose CLThres, a thresholding algorithm, for consistently learning forest-structured models Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 38 / 52

134 Main Contributions Propose CLThres, a thresholding algorithm, for consistently learning forest-structured models Prove convergence rates ( moderate deviations ) for a fixed discrete graphical model P P(X d ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 38 / 52

135 Main Contributions Propose CLThres, a thresholding algorithm, for consistently learning forest-structured models Prove convergence rates ( moderate deviations ) for a fixed discrete graphical model P P(X d ) Prove achievable scaling laws on (n, d, k) (k is the num edges) for consistent recovery in high-dimensions. Roughly speaking, is achievable n log 1+δ (d k) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 38 / 52

136 Main Difficulty Unknown minimum mutual information I min in the forest model Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 39 / 52

137 Main Difficulty Unknown minimum mutual information I min in the forest model Markov order estimation [Merhav, Gutman, Ziv 1989] Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 39 / 52

138 Main Difficulty Unknown minimum mutual information I min in the forest model Markov order estimation [Merhav, Gutman, Ziv 1989] If known, can easily use a threshold, i.e, if I( P i,j ) < I min, remove (i, j) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 39 / 52

139 Main Difficulty Unknown minimum mutual information I min in the forest model Markov order estimation [Merhav, Gutman, Ziv 1989] If known, can easily use a threshold, i.e, if I( P i,j ) < I min, remove (i, j) How to deal with classic tradeoff between over- and underestimation errors? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 39 / 52

140 The CLThres Algorithm Compute the set of empirical mutual information I( P i,j ) for all (i, j) V V Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 40 / 52

141 The CLThres Algorithm Compute the set of empirical mutual information I( P i,j ) for all (i, j) V V Max-weight spanning tree Ê d 1 := argmax E:Tree (i,j) E I( P i,j ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 40 / 52

142 The CLThres Algorithm Compute the set of empirical mutual information I( P i,j ) for all (i, j) V V Max-weight spanning tree Ê d 1 := argmax E:Tree (i,j) E I( P i,j ) Estimate number of edges given threshold ɛ n { } k n := (i, j) Ê d 1 : I( P i,j ) ɛ n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 40 / 52

143 The CLThres Algorithm Compute the set of empirical mutual information I( P i,j ) for all (i, j) V V Max-weight spanning tree Ê d 1 := argmax E:Tree (i,j) E I( P i,j ) Estimate number of edges given threshold ɛ n { } k n := (i, j) Ê d 1 : I( P i,j ) ɛ n Output the forest with the top k n edges Computational Complexity = O((n + log d)d 2 ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 40 / 52

144 A Convergence Result for CLThres Assume that P P(X d ) is a fixed forest-structured graphical model d does not grow with n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 41 / 52

145 A Convergence Result for CLThres Assume that P P(X d ) is a fixed forest-structured graphical model d does not grow with n Theorem ( Moderate Deviations ) Assume that the sequence {ɛ n } n=1 satisfies lim ɛ n = 0, n lim n nɛ n log n =, (ɛ n := n 1/2 works) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 41 / 52

146 A Convergence Result for CLThres Assume that P P(X d ) is a fixed forest-structured graphical model d does not grow with n Theorem ( Moderate Deviations ) Assume that the sequence {ɛ n } n=1 satisfies lim ɛ n = 0, n lim n nɛ n log n =, (ɛ n := n 1/2 works) Then lim sup n 1 nɛ n log P(Ê k n E P ) 1, Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 41 / 52

147 A Convergence Result for CLThres Assume that P P(X d ) is a fixed forest-structured graphical model d does not grow with n Theorem ( Moderate Deviations ) Assume that the sequence {ɛ n } n=1 satisfies lim ɛ n = 0, n lim n nɛ n log n =, (ɛ n := n 1/2 works) Then lim sup n 1 nɛ n log P(Ê k n E P ) 1, P(Ê k n E P ) exp( nɛ n ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 41 / 52

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery Vincent Tan, Matt Johnson, Alan S. Willsky Stochastic Systems Group, Laboratory for Information and Decision Systems,