Large-Deviations and Applications for Learning Tree-Structured Graphical Models

Size: px
Start display at page:

Download "Large-Deviations and Applications for Learning Tree-Structured Graphical Models"

Transcription

1 Large-Deviations and Applications for Learning Tree-Structured Graphical Models Vincent Tan Stochastic Systems Group, Lab of Information and Decision Systems, Massachusetts Institute of Technology Thesis Defense (Nov 16, 2010) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 1 / 52

2 Acknowledgements The following is joint work with: Alan Willsky (MIT) Lang Tong (Cornell) Animashree Anandkumar (UC Irvine) John Fisher (MIT) Sujay Sanghavi (UT Austin) Matt Johnson (MIT) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 2 / 52

3 Outline 1 Motivation, Background and Main Contributions Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

4 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

5 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

6 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

7 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 3 / 52

8 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 4 / 52

9 Motivation: A Real-Life Example Manchester Asthma and Allergy Study (MAAS) More than n 1000 children Number of variables d 10 6 Environmental, Physiological and Genetic (SNP) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 5 / 52

10 Motivation: A Real-Life Example Manchester Asthma and Allergy Study (MAAS) More than n 1000 children Number of variables d 10 6 Environmental, Physiological and Genetic (SNP) M a n c h e s t e r A s t h n m a a d A l l e r g y S t u d y Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 5 / 52

11 Motivation: Modeling Large Datasets I How do we model such data to make useful inferences? Simpson*, VYFT* et al. Beyond Atopy: Multiple Patterns of Sensitization in Relation to Asthma in a Birth Cohort Study, Am. J. Respir. Crit. Care Med. Feb Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 6 / 52

12 Motivation: Modeling Large Datasets I How do we model such data to make useful inferences? Model the relationships between variables by a sparse graph Reduce the number of interdependencies between the variables Simpson*, VYFT* et al. Beyond Atopy: Multiple Patterns of Sensitization in Relation to Asthma in a Birth Cohort Study, Am. J. Respir. Crit. Care Med. Feb Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 6 / 52

13 Motivation: Modeling Large Datasets I How do we model such data to make useful inferences? Model the relationships between variables by a sparse graph Reduce the number of interdependencies between the variables Prematurity Smoking Lung Function Airway Inflammation Acquired Immune Response Obesity Airway Obstruction Bronchial Hyperresponsiveness Immune Response to Virus Viral Infection Simpson*, VYFT* et al. Beyond Atopy: Multiple Patterns of Sensitization in Relation to Asthma in a Birth Cohort Study, Am. J. Respir. Crit. Care Med. Feb Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 6 / 52

14 Motivation: Modeling Large Datasets II Reduce the dimensionality of the covariates (features) for predicting a variable for interest (e.g., asthma) Information-theoretic limits? VYFT, Johnson and Willsky, Necessary and Sufficient Conditions for Salient Subset Recovery, Intl. Symp. on Info. Theory, Jul VYFT, Sanghavi, Fisher and Willsky, Learning Graphical Models for Hypothesis Testing and Classification, IEEE Trans. on Signal Processing, Nov Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 7 / 52

15 Motivation: Modeling Large Datasets II Reduce the dimensionality of the covariates (features) for predicting a variable for interest (e.g., asthma) Information-theoretic limits? Learning graphical models tailored specifically for hypothesis testing Can we learn better models in the finite-sample setting? VYFT, Johnson and Willsky, Necessary and Sufficient Conditions for Salient Subset Recovery, Intl. Symp. on Info. Theory, Jul VYFT, Sanghavi, Fisher and Willsky, Learning Graphical Models for Hypothesis Testing and Classification, IEEE Trans. on Signal Processing, Nov Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 7 / 52

16 Graphical Models: Introduction Graph structure G = (V, E) represents a multivariate distribution of a random vector X = (X 1,..., X d ) indexed by V = {1,..., d} Node i V corresponds to random variable X i Edge set E corresponds to conditional independencies Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 8 / 52

17 Graphical Models: Introduction Graph structure Graphical G = (V, E) Models: represents Introduction a multivariate distribution of a random Graph vector structure X G = = (X (V, 1, E)... in, X the d ) indexed multivariate by distribution V = {1,. of.. random, d} variables, with V = {1,..., m}. Node i V corresponds to random variable X i Nodes i V correspond to random variable X i. EdgeEdges set Ecorresponds to conditional to independence independencies relationships. V \{nbd(i) i} i nbd(i) S B A X i X V \{nbd(i) i} X nbd(i) X A X B X S Anima Anandkumar (UCI) Trees, Latent Trees & Beyond 11/08/ / 50 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 8 / 52

18 From Conditional Independence to Gibbs Distribution Hammersley-Clifford Theorem (1971) Let P be the joint pmf of graphical model Markov on G = (V, E): ] P(x) = 1 Z exp [ c C Ψ c (x c ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 9 / 52

19 From Conditional Independence to Gibbs Distribution From Conditional Independence to Gibbs Distribution Hammersley-Clifford Theorem (1971) Hammersley-Clifford Let P be the joint Theorem 71 pmf of graphical Letmodel P be joint Markov pmf of onmodel G = (V, withe): graph G = (V, E), ] [ P(x) = 1 Z exp c (x c ) P (x) = 1 Z exp[ Ψ c (x c )]. c C c C where C is the set of maximal cliques. Anima Anandkumar (UCI) Trees, Latent Trees & Beyond 11/08/ / 50 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 9 / 52

20 Conditional Independence to Gibbs Distribution ersley-clifford From Conditional Theorem 71 Independence to Gibbs Distribution From Conditional Independence to Gibbs Distribution be Hammersley-Clifford joint pmf of model with Theorem graph(1971) V, Hammersley-Clifford E), Let P be the joint Theorem 71 pmf of graphical Letmodel P be joint Markov pmf of onmodel G = (V, withe): graph G = P (V, (x) E), = 1 ] Z exp[ Ψ c (x c )]. c C [ P(x) = 1 Z exp c (x c ) P (x) = 1 Z exp[ Ψ c (x c )]. C is the set of maximal c C c C cliques. where C is the set of maximal cliques. Gaussian Graphical Models ian Graphical Models dency X X X X X X X X X X X X X X X X 8 X X Inverse of Covariance Matrix Dependency Graph Inverse Covariance Matrix Anandkumar (UCI) Trees, Latent Trees & Beyond 11/08/ / 50 Anima Anandkumar Vincent Tan (UCI) (MIT) Trees, Large-Deviations Latent Trees & Beyond for Learning Trees 11/08/2010 Thesis Defense 5 / 50 9 / 52

21 Tree-Structured Graphical Models X 3 X 2 P(x)= P i (x i ) i V X (i,j) E 1 = P 1 (x 1 ) P 1,2(x 1, x 2 ) P 1 (x 1 ) X 4 P i,j (x i, x j ) P i (x i )P j (x j ) P 1,3 (x 1, x 3 ) P 1,4 (x 1, x 4 ) P 1 (x 1 ) P 1 (x 1 ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 10 / 52

22 Tree-Structured Graphical Models X 3 X 2 P(x)= P i (x i ) i V X (i,j) E 1 = P 1 (x 1 ) P 1,2(x 1, x 2 ) P 1 (x 1 ) X 4 P i,j (x i, x j ) P i (x i )P j (x j ) P 1,3 (x 1, x 3 ) P 1,4 (x 1, x 4 ) P 1 (x 1 ) P 1 (x 1 ) Tree-structured Graphical Models: Tractable Learning and Inference Maximum-Likelihood learning of tree structure is tractable Chow-Liu Algorithm (1968) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 10 / 52

23 Tree-Structured Graphical Models X 3 X 2 P(x)= P i (x i ) i V X (i,j) E 1 = P 1 (x 1 ) P 1,2(x 1, x 2 ) P 1 (x 1 ) X 4 P i,j (x i, x j ) P i (x i )P j (x j ) P 1,3 (x 1, x 3 ) P 1,4 (x 1, x 4 ) P 1 (x 1 ) P 1 (x 1 ) Tree-structured Graphical Models: Tractable Learning and Inference Maximum-Likelihood learning of tree structure is tractable Chow-Liu Algorithm (1968) Inference on Trees is tractable Sum-Product Algorithm Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 10 / 52

24 Tree-Structured Graphical Models X 3 X 2 P(x)= P i (x i ) i V X (i,j) E 1 = P 1 (x 1 ) P 1,2(x 1, x 2 ) P 1 (x 1 ) X 4 P i,j (x i, x j ) P i (x i )P j (x j ) P 1,3 (x 1, x 3 ) P 1,4 (x 1, x 4 ) P 1 (x 1 ) P 1 (x 1 ) Tree-structured Graphical Models: Tractable Learning and Inference Maximum-Likelihood learning of tree structure is tractable Chow-Liu Algorithm (1968) Inference on Trees is tractable Sum-Product Algorithm Which other classes of graphical models are tractable for learning? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 10 / 52

25 Main Contributions in Thesis: I Error Exponent Analysis of Tree Structure Learning (Ch. 3 and 4) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 11 / 52

26 Main Contributions in Thesis: I Graphical Models: Trees & Beyond T Error Exponent Analysis Analysis of Tree of Structure Tree Structure Learning Learning: (Ch. 3 andextremal 4) Tr Star Chain Structure Learning in Graphical Models High-Dimensional Structure Learning for Forest Models (Ch. 5) Forests Latent Trees Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 11 / 52

27 Main Contributions in Thesis: II Learning Graphical Models for Hypothesis Testing (Ch. 6) Devised algorithms for learning trees for hypothesis testing 12 p q D( p p DT ) D( q p DT ) D( p p CL ) p DT p CL T p Fig. 1. Illustration of Proposition 2. As defined in (8), T p is the subset of tree distributions that are marginally consistent with p, the empirical distribution of the positively labeled samples. p and q are not trees, thus p, q / T p. The generatively-learned distribution (via Chow-Liu) p CL, is the projection of p onto T p as given by the optimization problem in (9). The discriminativelylearned distribution p DT, is the solution of (20a) which is further (in the KL-divergence sense) from q (because of the D( q p) term). reverse is Vincent true for Tan q. (MIT) See Fig. 1 for an illustration Large-Deviations of the forproposition. Learning TreesNote that all distances Thesis Defense are measured 12 / 52

28 Main Contributions in Thesis: II Learning Graphical Models for Hypothesis Testing (Ch. 6) Devised algorithms for learning trees for hypothesis testing 12 p q D( p p DT ) D( q p DT ) D( p p CL ) p DT p CL T p Information-Theoretic Limits for Salient Subset Recovery (Ch. 7) Fig. 1. Illustration of Proposition 2. As defined in (8), T p is the subset of tree distributions that are marginally consistent with p, the empirical distribution of the positively labeled samples. p and q are not trees, thus p, q / T p. The generatively-learned distribution (via Chow-Liu) p CL, is the projection of p onto T p as given by the optimization problem in (9). The discriminativelylearned distribution p DT, is the solution of (20a) which is further (in the KL-divergence sense) from q (because of the D( q p) term). Devised necessary and sufficient conditions for estimating of salient set of features reverse is true for q. See Fig. 1 for an illustration of the proposition. Note that all distances are measured Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 12 / 52

29 Main Contributions in Thesis: II Learning Graphical Models for Hypothesis Testing (Ch. 6) Devised algorithms for learning trees for hypothesis testing 12 p q D( p p DT ) D( q p DT ) D( p p CL ) p DT p CL T p Information-Theoretic Limits for Salient Subset Recovery (Ch. 7) Fig. 1. Illustration of Proposition 2. As defined in (8), T p is the subset of tree distributions that are marginally consistent with p, the empirical distribution of the positively labeled samples. p and q are not trees, thus p, q / T p. The generatively-learned distribution (via Chow-Liu) p CL, is the projection of p onto T p as given by the optimization problem in (9). The discriminativelylearned distribution p DT, is the solution of (20a) which is further (in the KL-divergence sense) from q (because of the D( q p) term). Devised necessary and sufficient conditions for estimating of salient set of features We will focus on Chapters 3-5 here. See thesis for Chapters 6 and 7. reverse is true for q. See Fig. 1 for an illustration of the proposition. Note that all distances are measured Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 12 / 52

30 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 13 / 52

31 Motivation ML learning of tree structure given i.i.d. X d -valued samples X 3 X 2 X 1 X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

32 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

33 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 P n (err). = exp( n Rate) n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

34 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 P n (err). = exp( n Rate) n = # Samples When does the error probability decay exponentially? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

35 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 P n (err). = exp( n Rate) n = # Samples When does the error probability decay exponentially? What is the exact rate of decay of the probability of error? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

36 Motivation ML learning of tree structure given i.i.d. X d -valued samples P n (err) X 3 X 2 X 1 X 4 P n (err). = exp( n Rate) n = # Samples When does the error probability decay exponentially? What is the exact rate of decay of the probability of error? How does the error exponent depend on the parameters and structure of the true distribution? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 14 / 52

37 Main Contributions Discrete case: Provide the exact rate of decay for a given P Rate of decay SNR for learning Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 15 / 52

38 Main Contributions Discrete case: Provide the exact rate of decay for a given P Rate of decay SNR for learning Gaussian case: Extremal structures: Star (worst) and chain (best) for learning Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 15 / 52

39 Related Work in Structure Learning ML for trees: Max-weight spanning tree with mutual information edge weights (Chow & Liu 1968) Causal dependence trees: directed mutual information (Quinn, Coleman & Kiyavash 2010) Convex relaxation methods: l 1 regularization Gaussian graphical models (Meinshausen and Buehlmann 2006) Logistic regression for Ising models (Ravikumar et al. 2010) Learning thin junction trees through conditional mutual information tests (Chechetka et al. 2007) Conditional independence tests for bounded degree graphs (Bresler et al. 2008) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 16 / 52

40 Related Work in Structure Learning ML for trees: Max-weight spanning tree with mutual information edge weights (Chow & Liu 1968) Causal dependence trees: directed mutual information (Quinn, Coleman & Kiyavash 2010) Convex relaxation methods: l 1 regularization Gaussian graphical models (Meinshausen and Buehlmann 2006) Logistic regression for Ising models (Ravikumar et al. 2010) Learning thin junction trees through conditional mutual information tests (Chechetka et al. 2007) Conditional independence tests for bounded degree graphs (Bresler et al. 2008) We obtain and analyze error exponents for the ML learning of trees (and extensions to forests) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 16 / 52

41 ML Learning of Trees (Chow-Liu) I Samples x n = {x 1,..., x n } drawn i.i.d. from P P(X d ), X is finite Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 17 / 52

42 ML Learning of Trees (Chow-Liu) I Samples x n = {x 1,..., x n } drawn i.i.d. from P P(X d ), X is finite Solve the ML problem given the data x n P ML argmax Q Trees 1 n n log Q(x k ) k=1 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 17 / 52

43 ML Learning of Trees (Chow-Liu) I Samples x n = {x 1,..., x n } drawn i.i.d. from P P(X d ), X is finite Solve the ML problem given the data x n P ML argmax Q Trees 1 n n log Q(x k ) k=1 Denote P(a) = P(a; x n ) as the empirical distribution of x n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 17 / 52

44 ML Learning of Trees (Chow-Liu) I Samples x n = {x 1,..., x n } drawn i.i.d. from P P(X d ), X is finite Solve the ML problem given the data x n P ML argmax Q Trees 1 n n log Q(x k ) k=1 Denote P(a) = P(a; x n ) as the empirical distribution of x n Reduces to a max-weight spanning tree problem (Chow-Liu 1968) E ML = argmax I( P e ) E Q Trees e E Q P e is the marginal of the empirical on e = (i, j) I( P e ) is the mutual information of the empirical P e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 17 / 52

45 ML Learning of Trees (Chow-Liu) II X 4 X 1 X 3 X Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 18 / 52

46 ML Learning of Trees (Chow-Liu) II 1 X 3 X 2 X 3 X X 1 2 X X 4 X 4 True MI {I(P e )} Max-weight spanning tree E P Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 18 / 52

47 ML Learning of Trees (Chow-Liu) II 1 X 3 X 2 X 3 X X 1 2 X X 4 X 4 True MI {I(P e )} Max-weight spanning tree E P 1.1 X 3 X X X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 18 / 52

48 ML Learning of Trees (Chow-Liu) II 1 X 3 X X X 4 True MI {I(P e )} 1.1 X 3 X X X 4 Empirical MI {I( P e )} from x n X 3 X X 1 4 X 4 Max-weight spanning tree E P X 3 X X 1 X 4 Max-weight spanning tree E ML E P Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 18 / 52

49 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

50 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

51 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Find the error exponent K P : K P lim n 1 n log Pn (E ML E P ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

52 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Find the error exponent K P : K P lim n 1 n log Pn (E ML E P ) P n (E ML E P ). = exp( nk P ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

53 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Find the error exponent K P : K P lim n 1 n log Pn (E ML E P ) P n (E ML E P ). = exp( nk P ) Naïvely, what could we do to compute K P? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

54 Problem Statement Define P ML to be ML tree-structured distribution with edge set E ML and the error event is {E ML E P } X 3 X 2 X X X 1 X 1 4 X 4 X 4 Find the error exponent K P : K P lim n 1 n log Pn (E ML E P ) P n (E ML E P ). = exp( nk P ) Naïvely, what could we do to compute K P? I-projections onto all trees? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 19 / 52

55 The Crossover Rate I Correct Structure True MI I(P e ) Emp MI I( P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 20 / 52

56 The Crossover Rate I Correct Structure True MI I(P e ) Emp MI I( P e ) Incorrect Structure! True MI I(P e ) Emp MI I( P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 20 / 52

57 The Crossover Rate I Correct Structure True MI I(P e ) Emp MI I( P e ) Incorrect Structure! True MI I(P e ) Emp MI I( P e ) Structure Unaffected True MI I(P e ) Emp MI I( P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 20 / 52

58 The Crossover Rate I e e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 21 / 52

59 The Crossover Rate I e e Given two node pairs e, e ( V 2) with joint distribution Pe,e P(X 4 ), s.t. I(P e ) > I(P e ). Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 21 / 52

60 The Crossover Rate I e e Given two node pairs e, e ( V 2) with joint distribution Pe,e P(X 4 ), s.t. I(P e ) > I(P e ). Consider the crossover event of the empirical MI {I( P e ) I( P e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 21 / 52

61 The Crossover Rate I e e Given two node pairs e, e ( V 2) with joint distribution Pe,e P(X 4 ), s.t. I(P e ) > I(P e ). Consider the crossover event of the empirical MI {I( P e ) I( P e )} Def: Crossover Rate J e,e ( ) lim n n log Pn I( P e ) I( P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 21 / 52

62 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

63 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

64 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

65 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) P e,e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

66 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) P e,e {I(Q e )=I(Q e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

67 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) P e,e D(Q e,e P e,e ) Q e,e {I(Q e )=I(Q e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

68 The Crossover Rate II e e I(P e ) > I(P e ) {I( P e ) I( P e )} Proposition The crossover rate for empirical mutual informations is { } J e,e = min Q P(X 4 ) D(Q P e,e ) : I(Q e ) = I(Q e ) P(X 4 ) P e,e D(Q e,e P e,e ) Q e,e {I(Q e )=I(Q e )} I-projection (Csiszár) Sanov s Theorem Exact but not intuitive Non-Convex Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 22 / 52

69 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

70 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Easy only in some very special cases Star graph with I(Q a ) > I(Q b ) > 0 There is a unique crossover rate The unique crossover rate is the error exponent Q a Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

71 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Easy only in some very special cases Star graph with I(Q a ) > I(Q b ) > 0 There is a unique crossover rate The unique crossover rate is the error exponent Q a Q b Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

72 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Easy only in some very special cases Star graph with I(Q a ) > I(Q b ) > 0 There is a unique crossover rate The unique crossover rate is the error exponent Q a Q b Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

73 Error Exponent for Structure Learning I How to calculate the error exponent K P with the crossover rates J e,e? Easy only in some very special cases Star graph with I(Q a ) > I(Q b ) > 0 There is a unique crossover rate The unique crossover rate is the error exponent Q a Q b { } K P = min R P(X 4 ) D(R Q a,b ) : I(R e ) = I(R e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 23 / 52

74 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

75 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander T P T Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

76 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P T P T Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

77 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P Path(e ; E P ) T P T Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

78 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P Path(e ; E P ) T P T dominates Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

79 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P Path(e ; E P ) T P P dominates Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

80 Error Exponent for Structure Learning II A large deviation is done in the least unlikely of all unlikely ways. Large deviations by F. Den Hollander e / E P Path(e ; E P ) T P P dominates Theorem (Error Exponent) K P = min e / E P ( ) min J e Path(e e,e ;E P ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 24 / 52

81 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

82 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

83 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Proposition The following statements are equivalent: (a) The error probability decays exponentially, i.e., K P > 0 (b) T P is a connected tree, i.e., not a proper forest Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

84 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Proposition The following statements are equivalent: (a) The error probability decays exponentially, i.e., K P > 0 (b) T P is a connected tree, i.e., not a proper forest P n (err) n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

85 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Proposition The following statements are equivalent: (a) The error probability decays exponentially, i.e., K P > 0 (b) T P is a connected tree, i.e., not a proper forest P n (err) n = # Samples K P >0 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

86 Error Exponent for Structure Learning III P n (E ML E P ) =. [ ( )] exp n min min J e / E P e Path(e e,e ;E P ) We have a finite-sample result too! See thesis Proposition The following statements are equivalent: (a) The error probability decays exponentially, i.e., K P > 0 (b) T P is a connected tree, i.e., not a proper forest P n (err) n = # Samples K P >0 K P =0 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 25 / 52

87 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

88 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

89 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) P e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

90 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) P e P e Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

91 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) P e P e Euclidean Information Theory [Borade & Zheng 08]: P Q D(P Q) 1 (P(a) Q(a)) 2 2 P(a) a Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

92 Approximating The Crossover Rate I Def: Very-noisy learning condition on P e,e P e P e I(P e ) I(P e ) P e P e Euclidean Information Theory [Borade & Zheng 08]: P Q D(P Q) 1 (P(a) Q(a)) 2 2 P(a) Def: Given a P e = P i,j the information density is S e (X i ; X j ) log P i,j(x i, X j ) P i (X i )P j (X j ), E[S e] = I(P e ). a Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 26 / 52

93 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

94 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints D(Q e,e P e,e ) Q e,e P e,e {I(Q e )=I(Q e )} Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

95 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints D(Q e,e P e,e ) Q e,e P e,e {I(Q e )=I(Q e )} P e,e 1 2 Q e,e P e,e 2 P e,e Q e,e Q(P e,e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

96 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints D(Q e,e P e,e ) Q e,e P e,e {I(Q e )=I(Q e )} P e,e 1 2 Q e,e P e,e 2 P e,e Q e,e Q(P e,e ) Theorem (Euclidean Approximation of Crossover Rate) J e,e = (I(P e ) I(P e)) 2 2 Var(S e S e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

97 Approximating The Crossover Rate II Convexifying the optimization problem by linearizing constraints D(Q e,e P e,e ) Q e,e P e,e {I(Q e )=I(Q e )} P e,e 1 2 Q e,e P e,e 2 P e,e Q e,e Q(P e,e ) Theorem (Euclidean Approximation of Crossover Rate) J e,e = (I(P e ) I(P e)) 2 2 Var(S e S e ) = (E[S e S e]) 2 2 Var(S e S e ) = 1 2 SNR Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 27 / 52

98 The Crossover Rate How good is the approximation? We consider a binary model True Rate Approx Rate 0.02 Rate J e,e I(P e ) I(P e ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 28 / 52

99 Remarks for Learning Discrete Trees Characterized precisely the error exponent for structure learning P n (E ML E P ). = exp( nk P ) VYFT, A. Anandkumar, L. Tong, A. S. Willsky A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures, ISIT 2009, submitted to IEEE Trans. on Information Theory, revised in Oct Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 29 / 52

100 Remarks for Learning Discrete Trees Characterized precisely the error exponent for structure learning P n (E ML E P ). = exp( nk P ) Analysis tools include the method of types (large-deviations) and simple properties of trees VYFT, A. Anandkumar, L. Tong, A. S. Willsky A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures, ISIT 2009, submitted to IEEE Trans. on Information Theory, revised in Oct Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 29 / 52

101 Remarks for Learning Discrete Trees Characterized precisely the error exponent for structure learning P n (E ML E P ). = exp( nk P ) Analysis tools include the method of types (large-deviations) and simple properties of trees Analyzed the very-noisy learning regime (Euclidean Information Theory) where learning is error-prone Extensions to learning the tree projection for non-trees have also been studied. VYFT, A. Anandkumar, L. Tong, A. S. Willsky A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures, ISIT 2009, submitted to IEEE Trans. on Information Theory, revised in Oct Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 29 / 52

102 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 30 / 52

103 Setup Jointly Gaussian distribution in very-noisy learning regime ( p(x) exp 1 ) 2 xt Σ 1 x, x R d. Zero-mean, unit variances Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 31 / 52

104 Setup Jointly Gaussian distribution in very-noisy learning regime ( p(x) exp 1 ) 2 xt Σ 1 x, x R d. Zero-mean, unit variances Keep correlations coefficients on edges fixed specifies the Gaussian graphical model by Markovianity ρ i is the correlation coefficient on edge e i for i = 1,..., d 1 ρ 1 ρ 2 ρ 3 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 31 / 52

105 Setup Jointly Gaussian distribution in very-noisy learning regime ( p(x) exp 1 ) 2 xt Σ 1 x, x R d. Zero-mean, unit variances Keep correlations coefficients on edges fixed specifies the Gaussian graphical model by Markovianity ρ 1 ρ 2 ρ 3 ρ i is the correlation coefficient on edge e i for i = 1,..., d 1 ρ 1 ρ 2 ρ 3 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 31 / 52

106 Setup Jointly Gaussian distribution in very-noisy learning regime ( p(x) exp 1 ) 2 xt Σ 1 x, x R d. Zero-mean, unit variances Keep correlations coefficients on edges fixed specifies the Gaussian graphical model by Markovianity ρ 1 ρ 2 ρ 3 ρ i is the correlation coefficient on edge e i for i = 1,..., d 1 ρ 1 ρ 2 ρ 3 Compare the error exponent associated to different structures Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 31 / 52

107 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

108 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

109 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 1 ρ 2 ρ 4 ρ 3 Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

110 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 4 ρ 1 ρ 2 P n (err) ρ 3 Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

111 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 4 ρ 1 ρ 2 P n (err) ρ 3 Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain Chain n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

112 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 4 ρ 1 ρ 2 P n (err) ρ 3 Star Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain Chain n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

113 The Gaussian Case: Extremal Tree Structures Theorem (Extremal Structures) Under the very-noisy assumption, Star graphs are hardest to learn (smallest approx error exponent) Markov chains are easiest to learn (largest approx error exponent) ρ 4 ρ 1 ρ 2 P n (err) ρ 3 Star ρ π(1) ρ π(2) ρ π(3) ρ π(4) π: Permutation Chain Star Any other tree Chain n = # Samples Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 32 / 52

114 Numerical Simulations Chain, Star and Hybrid for d = 10 ρ i = 0.1 i i [1 : 9] ρ i Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 33 / 52

115 Numerical Simulations Chain, Star and Hybrid for d = 10 ρ i = 0.1 i i [1 : 9] ρ i P(error) 1 n log P(error) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 33 / 52

116 Numerical Simulations Chain, Star and Hybrid for d = 10 ρ i = 0.1 i i [1 : 9] ρ i P(error) 1 n log P(error) Simulated Prob of Error Chain Hybrid Star Number of samples n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 33 / 52

117 Numerical Simulations Chain, Star and Hybrid for d = 10 ρ i = 0.1 i i [1 : 9] ρ i Simulated Prob of Error P(error) Number of samples n Chain Hybrid Star Simulated Error Exponent 2.5 x n log P(error) Number of samples n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 33 / 52

118 Proof Idea and Intuition Correlation decay Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

119 Proof Idea and Intuition Correlation decay e / E P Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

120 Proof Idea and Intuition Correlation decay e / E P e / E P Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

121 Proof Idea and Intuition Correlation decay e / E P e / E P O(d 2 ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

122 Proof Idea and Intuition Correlation decay e / E P e / E P O(d 2 ) O(d) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

123 Proof Idea and Intuition Correlation decay e / E P e / E P O(d 2 ) O(d) Number of distance-two node pairs in: Star is O(d 2 ) Markov chain is O(d) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 34 / 52

124 Concluding Remarks for Learning Gaussian Trees Gaussianity allows us to perform further analysis to find the extremal structures for learning VYFT, A. Anandkumar, A. S. Willsky Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures, Allerton 2009, IEEE Trans. on Signal Processing, May Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 35 / 52

125 Concluding Remarks for Learning Gaussian Trees Gaussianity allows us to perform further analysis to find the extremal structures for learning Allows to derive a data-processing inequality for crossover rates VYFT, A. Anandkumar, A. S. Willsky Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures, Allerton 2009, IEEE Trans. on Signal Processing, May Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 35 / 52

126 Concluding Remarks for Learning Gaussian Trees Gaussianity allows us to perform further analysis to find the extremal structures for learning Allows to derive a data-processing inequality for crossover rates Universal result not (strongly) dependent on choice of correlations ρ = {ρ 1,..., ρ d 1 } VYFT, A. Anandkumar, A. S. Willsky Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures, Allerton 2009, IEEE Trans. on Signal Processing, May Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 35 / 52

127 Outline 1 Motivation, Background and Main Contributions 2 Learning Discrete Trees Models: Error Exponent Analysis 3 Learning Gaussian Trees Models: Extremal Structures 4 Learning High-Dimensional Forest-Structured Models 5 Related Topics and Conclusion Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 36 / 52

128 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

129 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Extend Liu et al. s work for discrete models and improve convergence results Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

130 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Extend Liu et al. s work for discrete models and improve convergence results Strategy: Remove weak edges X 3 X 2 X 1 X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

131 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Extend Liu et al. s work for discrete models and improve convergence results Strategy: Remove weak edges X 3 X 2 X 1 X 4 Reduce Num Params Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

132 Motivation: Prevent Overfitting Chow-Liu algorithm tells us how to learn trees Suppose we are in the high-dimensional setting where Samples n Variables d learning forest-structured graphical models may reduce overfitting vis-à-vis trees [Liu, Lafferty and Wasserman, 2010] Extend Liu et al. s work for discrete models and improve convergence results Strategy: Remove weak edges X 3 X 2 X 1 X 4 Reduce Num Params X 3 X 2 X 1 X 4 Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 37 / 52

133 Main Contributions Propose CLThres, a thresholding algorithm, for consistently learning forest-structured models Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 38 / 52

134 Main Contributions Propose CLThres, a thresholding algorithm, for consistently learning forest-structured models Prove convergence rates ( moderate deviations ) for a fixed discrete graphical model P P(X d ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 38 / 52

135 Main Contributions Propose CLThres, a thresholding algorithm, for consistently learning forest-structured models Prove convergence rates ( moderate deviations ) for a fixed discrete graphical model P P(X d ) Prove achievable scaling laws on (n, d, k) (k is the num edges) for consistent recovery in high-dimensions. Roughly speaking, is achievable n log 1+δ (d k) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 38 / 52

136 Main Difficulty Unknown minimum mutual information I min in the forest model Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 39 / 52

137 Main Difficulty Unknown minimum mutual information I min in the forest model Markov order estimation [Merhav, Gutman, Ziv 1989] Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 39 / 52

138 Main Difficulty Unknown minimum mutual information I min in the forest model Markov order estimation [Merhav, Gutman, Ziv 1989] If known, can easily use a threshold, i.e, if I( P i,j ) < I min, remove (i, j) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 39 / 52

139 Main Difficulty Unknown minimum mutual information I min in the forest model Markov order estimation [Merhav, Gutman, Ziv 1989] If known, can easily use a threshold, i.e, if I( P i,j ) < I min, remove (i, j) How to deal with classic tradeoff between over- and underestimation errors? Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 39 / 52

140 The CLThres Algorithm Compute the set of empirical mutual information I( P i,j ) for all (i, j) V V Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 40 / 52

141 The CLThres Algorithm Compute the set of empirical mutual information I( P i,j ) for all (i, j) V V Max-weight spanning tree Ê d 1 := argmax E:Tree (i,j) E I( P i,j ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 40 / 52

142 The CLThres Algorithm Compute the set of empirical mutual information I( P i,j ) for all (i, j) V V Max-weight spanning tree Ê d 1 := argmax E:Tree (i,j) E I( P i,j ) Estimate number of edges given threshold ɛ n { } k n := (i, j) Ê d 1 : I( P i,j ) ɛ n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 40 / 52

143 The CLThres Algorithm Compute the set of empirical mutual information I( P i,j ) for all (i, j) V V Max-weight spanning tree Ê d 1 := argmax E:Tree (i,j) E I( P i,j ) Estimate number of edges given threshold ɛ n { } k n := (i, j) Ê d 1 : I( P i,j ) ɛ n Output the forest with the top k n edges Computational Complexity = O((n + log d)d 2 ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 40 / 52

144 A Convergence Result for CLThres Assume that P P(X d ) is a fixed forest-structured graphical model d does not grow with n Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 41 / 52

145 A Convergence Result for CLThres Assume that P P(X d ) is a fixed forest-structured graphical model d does not grow with n Theorem ( Moderate Deviations ) Assume that the sequence {ɛ n } n=1 satisfies lim ɛ n = 0, n lim n nɛ n log n =, (ɛ n := n 1/2 works) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 41 / 52

146 A Convergence Result for CLThres Assume that P P(X d ) is a fixed forest-structured graphical model d does not grow with n Theorem ( Moderate Deviations ) Assume that the sequence {ɛ n } n=1 satisfies lim ɛ n = 0, n lim n nɛ n log n =, (ɛ n := n 1/2 works) Then lim sup n 1 nɛ n log P(Ê k n E P ) 1, Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 41 / 52

147 A Convergence Result for CLThres Assume that P P(X d ) is a fixed forest-structured graphical model d does not grow with n Theorem ( Moderate Deviations ) Assume that the sequence {ɛ n } n=1 satisfies lim ɛ n = 0, n lim n nɛ n log n =, (ɛ n := n 1/2 works) Then lim sup n 1 nɛ n log P(Ê k n E P ) 1, P(Ê k n E P ) exp( nɛ n ) Vincent Tan (MIT) Large-Deviations for Learning Trees Thesis Defense 41 / 52

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery

Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery Vincent Tan, Matt Johnson, Alan S. Willsky Stochastic Systems Group, Laboratory for Information and Decision Systems,

More information

A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures

A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

More information

Introduction to graphical models: Lecture III

Introduction to graphical models: Lecture III Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates Journal of Machine Learning Research 2 (20) 67-653 Submitted 5/0; Revised 2/; Published 5/ Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates Vincent Y. F. Tan Department of

More information

Finding the best mismatched detector for channel coding and hypothesis testing

Finding the best mismatched detector for channel coding and hypothesis testing Finding the best mismatched detector for channel coding and hypothesis testing Sean Meyn Department of Electrical and Computer Engineering University of Illinois and the Coordinated Science Laboratory

More information

High dimensional Ising model selection

High dimensional Ising model selection High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Problem Set 3 Issued: Thursday, September 25, 2014 Due: Thursday,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Markov Random Fields

Markov Random Fields Markov Random Fields Umamahesh Srinivas ipal Group Meeting February 25, 2011 Outline 1 Basic graph-theoretic concepts 2 Markov chain 3 Markov random field (MRF) 4 Gauss-Markov random field (GMRF), and

More information

Robust Principal Component Analysis

Robust Principal Component Analysis ELE 538B: Mathematics of High-Dimensional Data Robust Principal Component Analysis Yuxin Chen Princeton University, Fall 2018 Disentangling sparse and low-rank matrices Suppose we are given a matrix M

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Introduction to continuous and hybrid. Bayesian networks

Introduction to continuous and hybrid. Bayesian networks Introduction to continuous and hybrid Bayesian networks Joanna Ficek Supervisor: Paul Fink, M.Sc. Department of Statistics LMU January 16, 2016 Outline Introduction Gaussians Hybrid BNs Continuous children

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Latent Tree Approximation in Linear Model

Latent Tree Approximation in Linear Model Latent Tree Approximation in Linear Model Navid Tafaghodi Khajavi Dept. of Electrical Engineering, University of Hawaii, Honolulu, HI 96822 Email: navidt@hawaii.edu ariv:1710.01838v1 [cs.it] 5 Oct 2017

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department

Uncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department Decision-Making under Statistical Uncertainty Jayakrishnan Unnikrishnan PhD Defense ECE Department University of Illinois at Urbana-Champaign CSL 141 12 June 2010 Statistical Decision-Making Relevant in

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

A large-deviation analysis for the maximum likelihood learning of tree structures

A large-deviation analysis for the maximum likelihood learning of tree structures A large-deviation analysis for the maximum likelihood learning of tree structures The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Lecture 22: Error exponents in hypothesis testing, GLRT

Lecture 22: Error exponents in hypothesis testing, GLRT 10-704: Information Processing and Learning Spring 2012 Lecture 22: Error exponents in hypothesis testing, GLRT Lecturer: Aarti Singh Scribe: Aarti Singh Disclaimer: These notes have not been subjected

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Review: Directed Models (Bayes Nets)

Review: Directed Models (Bayes Nets) X Review: Directed Models (Bayes Nets) Lecture 3: Undirected Graphical Models Sam Roweis January 2, 24 Semantics: x y z if z d-separates x and y d-separation: z d-separates x from y if along every undirected

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Random Field Models for Applications in Computer Vision

Random Field Models for Applications in Computer Vision Random Field Models for Applications in Computer Vision Nazre Batool Post-doctorate Fellow, Team AYIN, INRIA Sophia Antipolis Outline Graphical Models Generative vs. Discriminative Classifiers Markov Random

More information

3 Undirected Graphical Models

3 Undirected Graphical Models Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 3 Undirected Graphical Models In this lecture, we discuss undirected

More information

Inferning with High Girth Graphical Models

Inferning with High Girth Graphical Models Uri Heinemann The Hebrew University of Jerusalem, Jerusalem, Israel Amir Globerson The Hebrew University of Jerusalem, Jerusalem, Israel URIHEI@CS.HUJI.AC.IL GAMIR@CS.HUJI.AC.IL Abstract Unsupervised learning

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs Animashree Anandkumar UC Irvine a.anandkumar@uci.edu Ragupathyraj Valluvan UC Irvine rvalluva@uci.edu Abstract Graphical

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Structure Learning: the good, the bad, the ugly

Structure Learning: the good, the bad, the ugly Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 Structure Learning: the good, the bad, the ugly Graphical Models 10708 Carlos Guestrin Carnegie Mellon University September 29 th, 2006 1 Understanding the uniform

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures Learning Tractable Graphical Models: Latent Trees and Tree Mixtures Anima Anandkumar U.C. Irvine Joint work with Furong Huang, U.N. Niranjan, Daniel Hsu and Sham Kakade. High-Dimensional Graphical Modeling

More information

Active and Semi-supervised Kernel Classification

Active and Semi-supervised Kernel Classification Active and Semi-supervised Kernel Classification Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London Work done in collaboration with Xiaojin Zhu (CMU), John Lafferty (CMU),

More information

Scalable robust hypothesis tests using graphical models

Scalable robust hypothesis tests using graphical models Scalable robust hypothesis tests using graphical models Umamahesh Srinivas ipal Group Meeting October 22, 2010 Binary hypothesis testing problem Random vector x = (x 1,...,x n ) R n generated from either

More information

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Junction Tree, BP and Variational Methods

Junction Tree, BP and Variational Methods Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3

More information

Gaussian Processes 1. Schedule

Gaussian Processes 1. Schedule 1 Schedule 17 Jan: Gaussian processes (Jo Eidsvik) 24 Jan: Hands-on project on Gaussian processes (Team effort, work in groups) 31 Jan: Latent Gaussian models and INLA (Jo Eidsvik) 7 Feb: Hands-on project

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Submodularity in Machine Learning

Submodularity in Machine Learning Saifuddin Syed MLRG Summer 2016 1 / 39 What are submodular functions Outline 1 What are submodular functions Motivation Submodularity and Concavity Examples 2 Properties of submodular functions Submodularity

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

High dimensional ising model selection using l 1 -regularized logistic regression

High dimensional ising model selection using l 1 -regularized logistic regression High dimensional ising model selection using l 1 -regularized logistic regression 1 Department of Statistics Pennsylvania State University 597 Presentation 2016 1/29 Outline Introduction 1 Introduction

More information

Supplemental for Spectral Algorithm For Latent Tree Graphical Models

Supplemental for Spectral Algorithm For Latent Tree Graphical Models Supplemental for Spectral Algorithm For Latent Tree Graphical Models Ankur P. Parikh, Le Song, Eric P. Xing The supplemental contains 3 main things. 1. The first is network plots of the latent variable

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar Discrete Markov Random Fields the Inference story Pradeep Ravikumar Graphical Models, The History How to model stochastic processes of the world? I want to model the world, and I like graphs... 2 Mid to

More information

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College Distributed Estimation, Information Loss and Exponential Families Qiang Liu Department of Computer Science Dartmouth College Statistical Learning / Estimation Learning generative models from data Topic

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation

Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation Walk-Sum Interpretation and Analysis of Gaussian Belief Propagation Jason K. Johnson, Dmitry M. Malioutov and Alan S. Willsky Department of Electrical Engineering and Computer Science Massachusetts Institute

More information

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs) Machine Learning (ML, F16) Lecture#07 (Thursday Nov. 3rd) Lecturer: Byron Boots Undirected Graphical Models 1 Undirected Graphical Models In the previous lecture, we discussed directed graphical models.

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING DATE AND TIME: June 9, 2018, 09.00 14.00 RESPONSIBLE TEACHER: Andreas Svensson NUMBER OF PROBLEMS: 5 AIDING MATERIAL: Calculator, mathematical

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Computational Genomics

Computational Genomics Computational Genomics http://www.cs.cmu.edu/~02710 Introduction to probability, statistics and algorithms (brief) intro to probability Basic notations Random variable - referring to an element / event

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

More information

Total positivity in Markov structures

Total positivity in Markov structures 1 based on joint work with Shaun Fallat, Kayvan Sadeghi, Caroline Uhler, Nanny Wermuth, and Piotr Zwiernik (arxiv:1510.01290) Faculty of Science Total positivity in Markov structures Steffen Lauritzen

More information

11 : Gaussian Graphic Models and Ising Models

11 : Gaussian Graphic Models and Ising Models 10-708: Probabilistic Graphical Models 10-708, Spring 2017 11 : Gaussian Graphic Models and Ising Models Lecturer: Bryon Aragam Scribes: Chao-Ming Yen 1 Introduction Different from previous maximum likelihood

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Oct, 21, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models CPSC

More information