Learning Markov Graphs Up To Edit Distance

Similar documents
Consistency and asymptotic normality

Lecture 6 : Dimensionality Reduction

Least-Squares Regression on Sparse Spaces

7. Introduction to Large Sample Theory

A method of constructing the half-rate QC-LDPC codes with linear encoder, maximum column weight three and inevitable girth 26

Colin Cameron: Asymptotic Theory for OLS

Colin Cameron: Brief Asymptotic Theory for 240A

Bivariate distributions characterized by one family of conditionals and conditional percentile or mode functions

Sharp Thresholds. Zachary Hamaker. March 15, 2010

Convergence of Random Walks

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Convex Optimization methods for Computing Channel Capacity

Probabilistic Learning

The Effect of a Finite Measurement Volume on Power Spectra from a Burst Type LDA

Topic 7: Convergence of Random Variables

Convergence of random variables, and the Borel-Cantelli lemmas

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

Lenny Jones Department of Mathematics, Shippensburg University, Shippensburg, Pennsylvania Daniel White

Consistency and asymptotic normality

Econometrics I. September, Part I. Department of Economics Stanford University

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Digitally delicate primes

MINIMAL MAHLER MEASURE IN REAL QUADRATIC FIELDS. 1. Introduction

Convergence Analysis of Terminal ILC in the z Domain

15.1 Upper bound via Sudakov minorization

A Simple Exchange Economy with Complex Dynamics

Probabilistic Learning

Robust Control of Robot Manipulators Using Difference Equations as Universal Approximator

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

arxiv: v1 [math.pr] 17 Dec 2007

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

ON THE AVERAGE NUMBER OF DIVISORS OF REDUCIBLE QUADRATIC POLYNOMIALS

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d

Application of Measurement System R&R Analysis in Ultrasonic Testing

Lower bounds on Locality Sensitive Hashing

6 General properties of an autonomous system of two first order ODE

Jointly continuous distributions and the multivariate Normal

A secure approach for embedding message text on an elliptic curve defined over prime fields, and building 'EC-RSA-ELGamal' Cryptographic System

Positive decomposition of transfer functions with multiple poles

Submitted to the Journal of Hydraulic Engineering, ASCE, January, 2006 NOTE ON THE ANALYSIS OF PLUNGING OF DENSITY FLOWS

Mod p 3 analogues of theorems of Gauss and Jacobi on binomial coefficients

arxiv:cond-mat/ v2 25 Sep 2002

TOEPLITZ AND POSITIVE SEMIDEFINITE COMPLETION PROBLEM FOR CYCLE GRAPH

u!i = a T u = 0. Then S satisfies

Agmon Kolmogorov Inequalities on l 2 (Z d )

Spring 2016 Network Science

Introduction to the Vlasov-Poisson system

GRAPH LIMITS AND EXCHANGEABLE RANDOM GRAPHS

Normalized Ordinal Distance; A Performance Metric for Ordinal, Probabilistic-ordinal or Partial-ordinal Classification Problems

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Tractability results for weighted Banach spaces of smooth functions

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

THE ZEROS OF A QUADRATIC FORM AT SQUARE-FREE POINTS

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Stability of steady states in kinetic Fokker-Planck equations for Bosons and Fermions

Novel Algorithm for Sparse Solutions to Linear Inverse. Problems with Multiple Measurements

ECE 534 Information Theory - Midterm 2

Parameter estimation: A new approach to weighting a priori information

NOTES. The Primes that Euclid Forgot

arxiv: v2 [math.st] 29 Oct 2015

WUCHEN LI AND STANLEY OSHER

A. Exclusive KL View of the MLE

Multi-View Clustering via Canonical Correlation Analysis

CHS GUSSET PLATE CONNECTIONS ANALYSES Theoretical and Experimental Approaches

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

Improved Capacity Bounds for the Binary Energy Harvesting Channel

LECTURE NOTES ON DVORETZKY S THEOREM

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Elementary Analysis in Q p

Permanent vs. Determinant

Table of Common Derivatives By David Abraham

Conservation Laws. Chapter Conservation of Energy

He s Homotopy Perturbation Method for solving Linear and Non-Linear Fredholm Integro-Differential Equations

On combinatorial approaches to compressed sensing

arxiv: v4 [math.pr] 27 Jul 2016

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

On the Expansion of Group based Lifts

On Characterizing the Delay-Performance of Wireless Scheduling Algorithms

On the number of isolated eigenvalues of a pair of particles in a quantum wire

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Nonlinear Estimation. Professor David H. Staelin

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

Multidisciplinary System Design Optimization (MSDO)

Influence of weight initialization on multilayer perceptron performance

Distributed Rule-Based Inference in the Presence of Redundant Information

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

MATH 2710: NOTES FOR ANALYSIS

Generalized Tractability for Multivariate Problems

Competition in the health care market: a two-sided. approach

Efficiently Decodable Non-Adaptive Threshold Group Testing

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Robustness and Perturbations of Minimal Bases

An Inverse Problem for Two Spectra of Complex Finite Jacobi Matrices

On Code Design for Simultaneous Energy and Information Transfer

Error Floors in LDPC Codes: Fast Simulation, Bounds and Hardware Emulation

Least Distortion of Fixed-Rate Vector Quantizers. High-Resolution Analysis of. Best Inertial Profile. Zador's Formula Z-1 Z-2

Interactive Hypothesis Testing Against Independence

Trap Coverage: Allowing Coverage Holes of Bounded Diameter in Wireless Sensor Networks

Transcription:

Learning Markov Grahs U To Eit Distance Abhik Das, Praneeth Netraalli, Sujay Sanghavi an Sriram Vishwanath Deartment of ECE, The University of Texas at Austin, USA Abstract This aer resents a rate istortion aroach to Markov grah learning. It rovies lower bouns on the number of samles require for any algorithm to learn the Markov grah structure of a robability istribution, u to eit istance. In articular, for both Ising an Gaussian moels on variables with egree at most, we show that at least Ω s log samles are require for any algorithm to learn the grah structure u to eit istance s. Our bouns reresent a strong converse; i.e., we show that for a lower number of samles, the robability of error goes to as the roblem size increases. These results show that substantial gains in samle comlexity may not be ossible without aying a significant rice in eit istance error. Introuction Markov networks, also known as unirecte grahical moels, escribe the intereenencies or the lack of thereof among a collection of ranom variables using an unirecte grah. As such, they have been use for moeling an esigning alications in a multitue of settings, for examle, social network moeling [], [2], image rocessing/comuter vision [3], [4] an comutational biology [5], [6]. The roblem of learning the grah structure of a Markov network from samles generate by the unerlying grah structure is a well-stuie one an is referre to as the roblem of grahical moel selection. There is iverse literature on various asects of learning grahical moels, from statistical hysics to comutational learning theory. It is only relatively recently that the information theoretic limits for the roblem of grahical moel selection are being better unerstoo. An unerstaning of the information theoretic limits of high-imensional learning roblems in general, an the roblem of grahical moel selection in articular, rovies us with samle comlexity bouns corresoning to lower bouns for learning. A useful tool use for obtaining these bouns is Fano s inequality an its generalizations [7]. However, Fano s inequality results in weak converse bouns the tyical result obtaine for grahical moel selection roblem is that if the number of observe samles available to a learning algorithm falls below a certain threshol, the robability of error is boune away from zero for examle, excees /2. Therefore, alternate

information-theoretic tools are require for stronger bouns on samle comlexity. This motivates the the formulation of strong converse tye results, in the same sirit as in the case of noisy channel coing [8]. In other wors, we esire results for the grahical moel selection roblem that state that unless the number of available samles excees some threshol, the robability of error in learning the structure of a Markov network goes to as the roblem size increases. Such information-theoretic limits are imortant as they rovie an unerstaning of settings where recovery is imossible, regarless of the algorithm or the cleverness of its esign. In this aer, we focus on reconstruction of the grah structure of a Markov network within a re-secifie istortion, rather than exact reconstruction. As an overarching goal, we are intereste in characterizing the rate-istortion limits of the roblem of grahical moel selection. We restrict ourselves to Markov networks whose unerlying grahical structures have boune egree. We erive etaile results for two well-known families Ising moels an Gaussian Markov networks. The istortion metric we choose is eit istance, that we efine in the next section.. Relate Work There is a significant boy of literature in the context of eriving the information-theoretic limits on the samle comlexity for exact learning of Markov networks, esecially the secialize cases of Ising moels [9], [0], an Gaussian Markov networks [], [2]. The grah ensembles that have been consiere inclue egree-boune grahs [9] [2],[3], grahs with limite eges [9] an ranom grahs [0], [2]. A common theme in eriving these theoretical bouns is to treat the grahical moel selection roblem as a noisy channel coing roblem an aly Fano s inequality to characterize the limits. The grahical moel selection roblem in resence of istortion has been examine in [2] for the ensemble of Erös-Rényi grahs. The only known strong converse results have been erive in [0] an [4], for the cases of exact reconstruction of Ising moels base on Erös-Rényi grahs an Gaussian Markov networks base on egree-boune grahs resectively. The erformance of grahical moel learning algorithms, that outut a set of grahs instea of a single one similar to list-ecoing, is examine in [5] for Gaussian an Ising moels..2 Summary of Results We rovie a comarison of our results in this aer vs. existing ones in literature in Table. All the results are for the ensemble of grahs on noes with egree at most. Our results are highlighte in bol face. s enotes the maximum allowe eit istance between the original an recovere grahs. All existing results in literature are for the case of exact recovery, i.e., s = 0. It is known that the ege weights lay an imortant role in etermining the comlexity of learning grahical moels [6]. However this eenency is comlex in general. For instance both very low ege weights an very large ege weights increase the samle comlexity of learning the 2

Table : Comarison with existing results Moel Ege weight = Θ Ege weight = Θ Ising Gaussian Ω 2 log [9] Ω ex log [9] Ω 8s log Ω 8s log Ω 2 log [] Ω log [] Ω 4s log Ω 4s log grahical moel - the first because of ifficulty in recognizing the resence of an ege an the latter because of large range correlations. Existing results show significant ifference in lower bouns for ege weight scaling as Θ as oose to Θ. However, in this aer, we are able to show ifferent lower bouns for the Gaussian case but not for the Ising moel. A etaile escrition as well as comarison of our results with the existing ones is one in Section 6. The rest of this aer is organize as follows. We iscuss some reliminaries an introuce the system moel in Section 2. We consier the roblem of learning grah structure u to a re-secifie istortion value eit istance an give strong limits on the samle comlexity for arbitrary ensembles, Ising moels an Gaussian Markov networks in Sections 3, 4 an 5 resectively. We finally conclue the aer with Section 6 an resent the roofs in the aenix. 2 Preliminaries We consier an unirecte grah G = V, E, where V = {,..., } is the set of noes an E V V is the set of eges. A Markov network is obtaine by associating a ranom variable X i to noe i V, that takes values from an alhabet set A, an secifying a joint robability istribution over vector X := X, X 2,..., X that satisfies the following roerty: x A, x B x C = x A x C x B x C, where A, B an C are any isjoint subsets of V such that every ath in G from A to B asses through C, an x A, x B an x C enote the restrictions of x,..., x A to inices in A, B an C resectively. Note that enotes.m.f. in the iscrete case i.e., A is a finite set an..f. in the continuous case i.e., A = R an the istribution is a continuous function. We briefly escribe the families of iscrete an continuous grahical moels consiere in this aer. Ising Moel: This is a iscrete robability istribution that is wiely stuie in statistical hysics [7], an alicable to fiels like comuter vision [8] an game theory [9]. In this aer we consier a secial case of Ising moel, calle the zero fiel binary Ising moel. Here the alhabet is 3

A = {, }. Given an unirecte grah G on noes an weight θ ij R for each ege i, j E, the robability of a configuration x = x, x 2,..., x A is given by the following exression: x = ex i,j E θ ijx i x j. x A ex i,j E θ ijx i x j Gaussian Grahical Moel: This is one of the most well known families of continuous Markov networks. Here X has a multivariate Gaussian istribution. Without any loss of generality, it can be assume that X has zero mean. Given an unirecte grah G on noes an a ositive efinite matrix Θ, the..f. of X is given by the following exression: x = 2π Θ ex 2 xt Θx, where x = x,..., x R. Θ is the inverse covariance matrix an is such that Θi, j 0 if an only if i, j E. Θ is also calle the otential matrix, since Θi, j can be interrete as the otential of ege i, j E. The following quantity, calle minimum magnitue of artial corelation coefficient, lays an imortant role in etermining the comlexity of the structure learning roblem: λ Θi, j Θ := min. i,j E Θi, iθj, j This quantity is invariant to rescaling of the ranom variables an can be thought of as the minimum magnitue of a non-zero entry after normalizing the iagonal terms of Θ to one. In this aer we restrict our attention to the ensemble of egree boune grahs in light of the fact that extensive work on learning grahical moels focus on these grahs. We enote the set of all grahs on noes an maximum egree by G,. We also enote the set of all grahs on noes by U. For any two grahs G an H on the same set of noes, we efine the eit istance G, H as the minimum number of ege eletions an insertions require to convert G to H. In other wors, G, H is the carinality of the symmetric ifference between ege sets of G an H. 2. Learning Algorithm an Error Criterion We consier an ensemble of unirecte grahs on a common set of noes, G = {G,..., G M }, an an ensemble of Markov networks K = {K,..., K M }, such that K i is associate with G i an the ranom variables X = X,..., X raw values from alhabet A. We choose a Markov network K K uniformly at ranom an obtain n i.i.. vector samles X n = X,..., X n from the istribution secifie by K. The roblem we consier is to reconstruct the grah G associate with K given the samles X n. This is also known as the roblem of grahical moel selection. A 4

function φ : A n U that mas the observe samles to an estimate grah Ĝ = φxn U is calle a learning algorithm. Given a re-secifie s, we efine the error event for the learning algorithm as { G, φx n s}, i.e., the algorithm is correct only if the eit istance between the actual an estimate grahs is less than s. Then the robability of error can be efine as P e n φ = P G, φx n s = M P G i, φx n s K = K i. i= In this aer, we erive lower bouns on the samle size n, in terms of the ensemble arameters, for any learning algorithm to reliably recover the unerlying grah of a Markov network u to an eit istance of s. We o this by bouning P e n φ in terms of n an the ensemble arameters. 3 Lower Bouns for Arbitrary Ensembles In this section, we state our result for lower bouns on the samle comlexity for arbitrary ensembles of grahical moels. We consier the same setu as escribe in Section 2.. We have an ensemble of Markov network moels K an the corresoning ensemble of unirecte grahs G on noes. We choose a Markov network K K uniformly at ranom an obtain n i.i.. samle vectors from its joint istribution. Our aim is to analyze the erformance of any learning algorithm φ : A n U an boun its robability of error. For this we efine the following quantities for G G: Bs, G = {H : G, H < s, H U }, Bs, G = max Bs, G. G G Bs, G reresents the maximum number of grahs in U that are at an eit istance of at most s from any grah in G. We also efine another quantity, similar in structure to mutual information: IK i ; X HX HX K = K i if A <, = hx hx K = K i if A = R. We efine a lower boun R an an uer boun C on the following quantities: R log M log Bs, G, C max i M IK i; X. 2 Then we have the following theorem which establishes a necessary conition on the number of samles n for consistent recovery of the grahical moel using any learning algorithm. 5

Theorem. Consier an ensemble of Markov networks K = {K,..., K M } an the corresoning ensemble of unirecte grahs on noes, G = {G,..., G M }. Suose the ranom variables take values from alhabet A. If the number of samles satisfies n < R C then for any learning algorithm, we have the following lower boun on the robability of error: P e n 4nA K φ R nc 2 R nc 2 2. Here error means that the eit istance between the original an recovere grahs excees s an A K = max log var X K = K i i M X K = K i. where stans for.m.f. in the iscrete case an..f. in the continuous case. If we fin a goo uer boun for A K for a given ensemble, we can use Theorem to show that P e n in the high imensional setting as an n < R C. We ursue this aroach in the next two sections to rove results for ensembles of Ising an Gaussian grahical moels. 4 Lower Bouns for Ising Moels Our main result in this section is the following theorem which characterizes a lower boun on the number of samles require for consistent recovery of grahical moel from an ensemble K I,, whose construction is escribe below. For each grah G G,, consier the corresoning Ising moel with all ege arameters equal to θ where θ 0,. We enote this ensemble of Ising moels by K I,. Note that there is a bijection between grahs in G, an moels in K I,. Theorem 2. Suose K is chosen uniformly at ranom from K, I. If = oα for some 0 α <, s < α 6 an the number of samles n, available to a learning algorithm, satisfies n < [ 2 4 2s log 4 log 8 + s 2s log e log s ] = Ω α 8s log, then for any grahical moel learning algorithm, the robability of error P n e as. Proof strategy for Theorem 2: The roof of Theorem 2 follows from establishing the bouns R an C in an 2 an then using Theorem. Lemmas an 2 establish such bouns R an C resectively. A comlete roof of Theorem 2 can be foun in the aenix. We list the grahs in G, as {G,..., G M } an the corresoning Ising moels in K I, as {K,..., K M }. Now we state the following two lemmas bouning R an C for this ensemble. 6

Lemma. For grah ensemble G, with 2 an s 4, the following bouns hol: log M 4 log 8, 2 Bs, G /2, < s. s Lemma 2. Suose K is chosen uniformly at ranom from K, I. Then we have max IK i ; X. i 5 Lower Bouns for Gaussian Markov Grahs Our main result in this section is a lower boun on the number of samles require for consistent recovery of grahical moel from an ensemble of Gaussian Markov networks K, G, which is constructe as follows. Without loss of generality, we assume that is even. We choose erfect matchings on noes, each erfect matching chosen uniformly at ranom, an form a multigrah resulting from the union of the matchings. We refer to the set of all such multigrahs on noes, constructe in this fashion, as H. The uniform istribution on the set of erfect matchings also efines a robability istribution over H. We have the following lemma for this istribution [20]: Lemma 3. Consier a multigrah H = V, E, V = {, 2,..., }, forme from the union of ranom erfect matchings on V, the matchings being chosen accoring to a uniform istribution. Suose the eigenvalues of the weighte ajacency matrix of H, enote by A, are = λ A λ 2 A λ A. Define ρa = max 2 i λ i A. Then the following result hols: where c is a ositive real an τ = P ρa < 3 c τ, + 2. Next, we eliminate the multigrahs from H whose weighte ajacency matrices A satisfy ρa 3 an get a reuce set H. By Lemma 3, H\H forms a small fraction of H as. We fix constants λ 0, 4 δ, δ > 0 an efine µ := λ 4. Then for every multigrah H H, we generate a matrix Θ = 4 µ + δi + µa, where I is the ientity matrix an A is the weighte ajacency matrix of multigrah H. We refer to the resulting set of these matrices as T. Then the following roerty hols for this set: Lemma 4. The matrices in T are symmetric an ositive efinite. Proof. By construction, any matrix Θ T has the form Θ = 4 µ + δi + µa, where A is the weighte ajacency matrix of some multigrah H H, which makes it symmetric. Also, the construction of H ensures that ρa < 3. Therefore, the minimum eigenvalue of Θ is at least 7

4 µ + δ ρaµ > µ + δ > 0. This an the symmetry of Θ ensure that all the eigenvalues of Θ are ositive. Hence Θ T is a symmetric an ositive efinite matrix. Note that the choice of µ ensures that λ Θ = λ for Θ T. Lemma 4 suggests that the matrices of T can be the inverse covariance matrices of Gaussian Markov networks. By construction, the unerlying grah of each of these Markov networks comes from G,. We enote this ensemble of Gaussian Markov networks by K G, an the corresoning grah ensemble by G, G,. Theorem 3. Suose K is chosen uniformly at ranom from K, G. If = oα for some 0 α < 2, s < 2α 8 an the number of samles n, available to a learning algorithm, satisfies 4s log 2 log 2 + 2s 2s log e 2 log s 2 2α 4s log n < = Ω, 2 log + 4 λ 4 log + 4 λ 4 then for any grahical moel learning algorithm, the robability of error P n e as. Proof strategy for Theorem 3: Analogous to the roof of Theorem 2, the roof of Theorem 3 follows from establishing the bouns R an C in an 2 an then using Theorem. Lemmas 5 an 6 establish such bouns R an C resectively. A comlete roof of Theorem 3 can be foun in the aenix. We list the grahs in G, as {G,..., G M } an the Gaussian moels in K G, as {K,..., K M }. Now we state the following two lemmas bouning R an C for this ensemble. Lemma 5. For grah ensemble G,, the following bouns hol for large enough : log M 2 log 4 2, 2 Bs, G, < s /2. s Lemma 6. Suose K is chosen uniformly at ranom from K, G. Then we have 6 Conclusion & Discussion max IK i ; X i 2 log + 4 λ 4 Remarks about Theorems 2 & 3: Secializing our result to the case of exact recovery, i.e., setting s = 0 yiels weaker lower bouns on samle comlexity than existing results. However, for the Gaussian case [] with ege weights Θ our result matches the existing result, an for both the Ising [9] an Gaussian [] cases with ege weights Θ, our result is only a factor of an away resectively from existing results. This ga is either ue to a limitation of our roof. 8

technique or ue to the ifference in the kin of guarantee. Secifically, the lower boun results in [9] an [] use Fano s inequality to obtain a weak converse i.e., if the number of samles n scales below a certain threshol then the robability of error is lower-boune by a constant as. On the other han our result establishes a strong converse i.e., if the number of samles n scales below a certain threshol then as the robability of error converges to. In this aer, we evelo a rate-istortion framework for grah learning, where we characterize lower bouns on samle comlexity within a given istortion criterion. We use a strong converse framework to erive these bouns, inicating that it is near-imossible to learn the grahical moel with any fewer samles. Our results show that, for both Ising an Gaussian moels on variables with maximum egree, at least Ω s log samles are require for any learning algorithm to recover the grah structure to within eit istance s. References [] A. Grabowski an R. Kosinski, Ising-base moel of oinion formation in a comlex network of interersonal interactions, Physica A: Statistical Mechanics an its Alications, vol. 36,. 65 664, 2006. [2] F. Vega-Reono, Comlex social networks. Cambrige Press, 2007. [3] J. Besag, On the statistical analysis of irty ictures, Journal of the Royal Statistical Society Series B, vol. 48,. 259 279, 986. [4] M. Choi, J. J. Lim, A. Torralba, an A. S. Willsky, Exloiting hierarchial context on a large atabase of object categories, in IEEE CVPR, 200. [5] N. Frieman, Inferring cellular networks using robabilistic grahical moels, Science, Feb 2004. [6] A. Ahmey, L. Song, an E. P. Xing, Time-varying networks: Recovering temorally rewiring genetic networks uring the life cycle of rosohila melanogaster, tech. re., 2008. arxiv. [7] T. Cover an J. Thomas, Elements of Information Theory. Wiley Interscience, 2006. [8] R. Gallager, Information Theory an Reliable Communication. Wiley, 968. [9] N. Santhanam an M. J. Wainwright, Information-theoretic limits of selecting binary grahical moels in high imensions, arxiv, 2009. [0] A. Anankumar, V. Y. F. Tan, an A. Willsky, High-imensional structure learning of Ising moels: Tractable grah families, arxiv Prerint, 20. 9

[] W. Wang, M. J. Wainwright, an K. Ramchanran, Information-theoretic bouns on moel selection for Gaussian Markov ranom fiels, in IEEE ISIT, 200. [2] A. Anankumar, V. Y. F. Tan, an A. Willsky, High-imensional Gaussian grahical moel selection: Tractable grah families, arxiv Prerint, 20. [3] G. Bresler, E. Mossel, an A. Sly, Reconstruction of Markov ranom fiels from samles: Some observations an algorithms, in APPROX,. 343 356, 2008. [4] I. Mitliagkas an S. Vishwanath, Strong information-theoretic limits for source/moel recovery, in Proc. of Allerton Conf. on Communication, Control an Comuting, Monticello, USA, 200. [5] D. Vats an J. Moura, Necessary conitions for consistent set-base grahical moel selection, in IEEE ISIT, 20. [6] A. Montanari an J. A. Pereira, Which grahical moels are ifficult to learn?, in Avances in Neural Information Processing Systems 22 Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, an A. Culotta, es.,. 303 3, 2009. [7] L. Reichl an J. Luscombe, A moern course in statistical hysics, American Journal of Physics, vol. 67,. 285, 999. [8] S. Geman an C. Graffigne, Markov ranom fiel image moels an their alications to comuter vision, in Proceeings of the International Congress of Mathematicians, vol.,. 2, AMS, Provience, RI, 986. [9] Y. Zhang, Moeling market mechanism with evolutionary games, Arxiv rerint conmat/9803308, 998. [20] J. Frieman, A roof of Alon s secon eigenvalue conjecture an relate roblems, arxiv, 2004. A Proofs Proof Theorem. The roof technique is similar to the stanar strong converse roofs in information theory [8]. We rove the theorem for the case where A is a finite set an K is an ensemble of iscrete Markov networks. The roof for the case when A = R an K is an ensemble of Markov networks with continuous ensity functions goes along the same line. 0

We fix ɛ > 0 an efine the following sets: B i = { } x n A n : log xn K = K i x n nc + ɛ, i =, 2,..., M. The set B i tries to cature all oints in the samle sace where the ranom variable log Xn K=K i X n is greater than its mean conitione on K = K i strictly seaking, it only contains a subset of those oints since C is an uer boun on the mean of the ranom variable. Given a learning algorithm φ : A n U, we efine the following sets: R i = {x n A n : φx n, G i < s}, i =, 2,..., M, S i = {x n A n : φx n = G i }, i =, 2,..., M. R i is the set of oints in the samle sace for which we have correct recovery if K = K i an S i is the set of oints in the samle sace for which the learning algorithm returns G i. Enumerating the grahs in Bs, G i as G i,..., G ik we see that R i = k t= S i t. Note that S i S j = φ for i j. The robability of correct ecoing by the learning algorithm is given by P n c = M = M i= x n R i x n K = K i x n K = K i + M i= x n R i Bi c i= x n R i B i x n K = K i The first term can be boune as follows: M i= x n R i Bi c x n K = K i M = 2nC+ɛ M 2nC+ɛ M 2nC+ɛ M 2nC+ɛ M i= x n R i Bi c i= x n R i Bi c i= i= 2 nc+ɛ R. 2 nc+ɛ x n x n R i x n Bs,G i t= Bs, G x n x n S it x n

The secon term can be boune as follows: M i= x n R i B i x n K = K i M = M = M i= x n K = K i x n B i P log Xn K = K i X n nc + ɛ K = K i n P log Xj K = K i X j nc + ɛ K = K i. i= i= j= Since X,..., X n are i.i.. vectors, we have the following equality: n var log Xj K = K i X j j= K = K i = n var where var is the variance of the ranom variable. Defining A K = max i M var log X K = K i X log X K = K i X K = K i, an using 2 along with Chebyshev s inequality, we obtain the following boun: n P log Xj K = K i X j j= K = K i. nc + ɛ K = K i A K, i =, 2,..., M. nɛ2 Choosing ɛ = R nc/2n an the fact that P n e = P n c gives us P e n 4nA K R nc 2 R nc 2 2. Proof Lemma. The first inequality follows from the roof of Theorem in [9]. For the secon inequality, note that for grahs G = V, EG G, an H = V, EH U with G, H < s, we have EG, H < s, where EG, H is the symmetric ifference of the ege sets EG an EH. In other wors, V, EG, H is a grah on noes an at most s eges. Therefore, Bs, G, 2

is no more than the number of grahs on noes an at most s eges. This gives s /2 /2 2 /2 Bs, G, s < s, i s s i=0 where we use the fact that m i m j for 0 i j m/2 an s /4. Proof Lemma 2. Since X is a vector of ranom variables taking values from {, }, we have HX. Hence, by efinition, max i IK i ; X HX. Proof Theorem 2. Using Lemmas an 2, we can efine bouns R an C in an 2 as follows: R := 4 log 8 log s 2 /2 s, 3 C :=. 4 Using the fact that a b a e b, b we obtain the following lower boun on R C : R C 4 2s log 4 log 8 + s 2s log e log s. Uner the hyothesis of the theorem, we have n < R n 2C. Theorem shows that P e satisfies P e n 8AKI, 2 R 4. 5 RC We nee to now show that the last two terms of the RHS of 5 go to 0 as. Since = o α for some α < an s < α 6, we have R = Θ log. 6 This shows that the last term of the RHS of 5 goes to 0 as. To show the same for the secon last term, we give a boun for AK I,. For this, we recall the efinition of AKI, : AK, log I = max var X K = K i i M X K = K i max E log X 2 K = K i i M X K = K i. 7 To boun AK I,, we give a eterministic boun on log X K=K i X an use 7. Note that for 3

K i K, I, the total number of eges in the corresoning grah G i oes not excee 2. Also, for every ege k, l in E i, the ege set of G i, we have θ kl = θ = O. Then given K i K, I an x {, }, we have the following uer boun: x K = K i = ex θ k,l Ei klx k x l x {,} ex k,l E i θ klx k x l Similarly we have the following lower boun : x K = K i = ex θ k,l Ei klx k x l x {,} ex k,l E i θ klx k x l ex θ 2 = ex θ log 2. 8 2 ex θ 2 ex 2 ex θ 2 θ 2 = ex θ + log 2. 9 Since x is the average of x K = K i, we have the same bouns as above for x. Using this observation as well as bouns 8 an 9 we obtain the following inequality: log x K = K i x ex θ log 2 log ex θ + log 2 = 2θ log e. 0 Since 0 hols for i =, 2,..., M, using 7 an 0 we obtain AK I, 42 θ 2 2 log e 2 = O 2. where we use the fact that θ = O. Using 4, 6 an we see that the secon term of the RHS of 5 goes to 0 as an hence robability of error goes to. Proof Lemma 5. The uer boun on Bs, G, is obtaine using arguments similar to those resente in the roof of Lemma. For lower boun on log M, note that there are N =! 2 /2 /2! ossible erfect matchings on a set of noes. Therefore, a multigrah comose of erfect matchings can be forme in N ways. Note that multile coies of the same multigrah may be generate uring this construction. Using Lemma 3, atleast c N of these τ multigrahs have weighte ajacency matrix A satisfying ρa < 3, for some constant c > 0. Note that any given multigrah generate by erfect matchings is a -regular grah an has 2 eges. In general, each of the eges can come from any of the erfect matchings. Therefore, a single multigrah can otentially be generate by atmost 2 sets of erfect matchings. Also, we esire that the multigrahs have ifferent unerlying unirecte grah structures in G,. The fact that there can be atmost eges between two noes of the multigrah an there are 2 eges overall, gives the lower boun M = H = K, G = G, c N. For > 2c τ τ, 4

c τ /2. Taking log on both sies an simlifying gives log M log [ c τ! 2 /2 /2! ] 2 log 4 2, for > 2c τ an using the boun! 2 2 2!. Proof Lemma 6. By efinition we have IK i ; X = hx hx K = K i. The ifferential entroy of X is uer boune by the ifferential entroy of a Gaussian ranom vector with the same covariance matrix. This gives hx 2 log2πe Σ, where Σ = M M i= Σ i an Σ i = Θ i is the covariance matrix associate with K i. Also, hx K = K i = 2 log2πe Σ i. This gives IK i ; X 2 log Σ log Σ i = 2 log Σ + log Θ i By construction, the iagonal entries of every inverse covariance matrix Θ i are same an equal to 4 µ + δ. So by Haamar s Inequality, Θ i 4 µ + δ. Also, as state in the roof of Lemma 4, the minimum eigenvalue of Θ i is atleast δ. This means that the maximum eigenvalue of Σ i is atmost δ or Σ i 2 δ. Hence, the maximum eigenvalue of Σ oes not excee δ as Σ 2 M M i= Σ i 2. This gives Σ Σ 2 δ. Therefore, we obtain where we substitute µ = IK i ; X 2 log δ λ 4. + 4 µ δ = 2 log 4 + λ 4 Proof Theorem 3. Following the lines of the roof of Theorem 2, we efine the bouns R an C satisfying an 2, as er Lemmas 5 an 6: R := 2 log [ 2 ] 4 2 log /2 s, 2 s C := 2 log 4 + λ 4. 3 After some algebraic maniulations, we obtain the following lower boun on R C : R C 4s log 2 log 2 + 2s 2s log e 2 log s 2. log + 4 λ 4 5

Uner the hyothesis of the theorem, we have n < R n 2C. Theorem shows that P e satisfies P e n 8AKG, 2 R 4. 4 RC Even in this case, the last term in the RHS of 4 goes to 0 as, as = o α for some α < 2α 2 an s < 8 imlies R = Θ log. 5 To show that the secon last term in RHS of 4 goes to 0 as, we give a boun for AK, G. For this, we first erive a eterministic boun on log X K=K i. We efine D X max = max i Θ i, D min = min i Θ i, λ max to be the maximum among the eigenvalues of Θ i, i =, 2,..., M, an Θ = M M i= Θ i. Then given K i K, G an x R, we have the following uer boun: x K = K i x = Θi ex 2 xt Θ i x M M j= Θj ex 2 xt Θ j x Dmax Dmin M M j= ex 2 xt Θ j x Dmax ex Dmin 2 xt Θx Dmax λmax ex Dmin 2 xt x. This gives log x K = K i x Similarly, we have the following lower boun: 2 log D max D min + λ max 2 xt x. 6 x K = K i x = Θi ex 2 xt Θ i x M M j= Θj ex 2 xt Θ j x Dmin ex Dmax 2 xt Θ i x Dmin ex λ max Dmax 2 xt x. This gives log x K = K i x 2 log D max D min λ max 2 xt x. 7 6

Inequalities 6 an 7 together give Therefore, we have var log X K = K i X log X K = K i X 2 log D max + λ max D min 2 XT X. [ K = K i E 2 2 log D max D min + λ max log D max D min 2 ] 2 XT X K = K i 2 + λ2 max 2 E[XT X 2 K = K i ]. For the given ensemble, we have D max 4 µ+δ, D min δ, λ max = +4 µ+δ 5µ+δ, where µ = δ/λ 4. Using these bouns, we get We also have var log X K = K i X K = K i 2 4 2 log + 2 λ 4 + δ2 5 2 + 2 λ 4 E[X T X 2 K = K i ]. E[X T X 2 K = K i ] = T rθ i 2 + T r2θ 2 2 + 2 i δ 2 22 δ 2. Then we get var log X K = K i X Thus, we get the result: K = K i 2 4 log + 2 λ 4 32 5 2 + 2 λ 4. AK G, 32 2 + 5 λ 4 2 + 2 + 5 λ 4 2 2. 8 For λ = O /, we obtain A = O 2. Using 3, 5 an 8, we see that the secon term of the RHS of 4 goes to 0 as an hence robability of error goes to. 7