Information-theoretic bounds on model selection for Gaussian Markov random fields

Size: px

Start display at page:

Download "Information-theoretic bounds on model selection for Gaussian Markov random fields"

Gervais Ray
5 years ago
Views:

1 Iformatio-theoretic bouds o model selectio for Gaussia Markov radom fields Wei Wag, Marti J. Waiwright,, ad Kaa Ramchadra Departmet of Electrical Egieerig ad Computer Scieces, ad Departmet of Statistics UC Berkeley, Berkeley, CA 970 {wagwei, waiwrig, kaar}@eecs.berkeley.edu ISIT 00, Austi, Texas, U.S.A., Jue 3-8, 00 Abstract The problem of graphical model selectio is to estimate the graph structure of a ukow Markov radom field based o observed samples from the graphical model. For Gaussia Markov radom fields, this problem is closely related to the problem of estimatig the iverse covariace matrix of the uderlyig Gaussia distributio. This paper focuses o the iformatio-theoretic limitatios of Gaussia graphical model selectio ad iverse covariace estimatio i the highdimesioal settig, i which the graph size p ad maximum ode degree d are allowed to grow as a fuctio of the sample size. Our first result establishes a set of ecessary coditios o (p, d for ay recovery method to cosistetly estimate the uderlyig graph. Our secod result provides ecessary coditios for ay decoder to produce a estimate b Θ of the true iverse covariace matrix Θ satisfyig b Θ Θ < δ i the elemetwise l -orm (which implies aalogous results i the Frobeius orm as well. Combied with previously kow sufficiet coditios for polyomial-time algorithms, these results yield sharp characterizatios i several regimes of iterest. I. INTRODUCTION Markov radom fields or udirected graphical models are families of multivariate probability distributios whose factorizatio ad coditioal idepedece properties are characterized by the structure of a uderlyig graph ]. Graphical model selectio refers to the problem of estimatig the graph structure based o observed samples from a Markov radom field. This problem arises i a wide variety of settigs, icludig statistical image aalysis, atural laguage processig, ad computatioal biology. I may applicatios, this problem is of iterest uder high-dimesioal scalig, meaig both the graph size p ad the umber of samples are large. Classical methods, such as those based directly o the sample covariace, are kow (via radom matrix theory ] to break dow whe p/ does ot go to zero. Cosequetly, i the highdimesioal regime where p, additioal structure is required i order to obtai cosistet estimators. Accordigly, a lie of recet work has focused o developig computatioally efficiet methods to solve this problem by imposig sparsity o the uderlyig graph. I particular, methods based o l - regularizatio (e.g. 3], ], 5], 6], 7] have bee show to yield cosistet estimators for Gaussia graphical models, or the associated iverse covariace matrices. Complemetary i ature to such achievable results are the iformatio-theoretic limits associated with ay procedure for graphical model selectio. Such aalysis ca serve two purposes. First, it ca demostrate whe kow polyomial-time algorithms achieve the iformatio-theoretic bouds. Secod, it ca reveal regimes i which there exists a gap betwee the performace of curret methods ad the fudametal limits. With this motivatio, some previous work (8], 9] has studied both ecessary ad sufficiet coditios for graphical model selectio i discrete Markov radom fields. The focus of this paper is o the iformatio-theoretic limits of Gaussia graphical model selectio, i which the observed radom vector has a multivariate Gaussia distributio. For Gaussia Markov radom fields, by the Hammersley-Clifford theorem ], the model selectio problem is equivalet to estimatig the off-diagoal sparsity patter of the iverse covariace matrix. I this paper, we study the esemble G d,p of graphs o p vertices with maximum degree at most d, ad derive two mai results. Our first result is to derive coditios o the sample size, graph size p, ad maximum ode degree d that are ecessary for ay method to correctly recover the uderlyig graph with probability of error goig to zero. Our secod result addresses the problem of estimatig the iverse covariace matrix Θ, ad establishes ecessary coditios for ay method to produce a estimate Θ satisfyig Θ Θ <δ. Our results ca be compared agaist kow sufficiet coditios for graph selectio ad iverse covariace estimatio usig l -pealized maximum likelihood 7], ad reveal regimes i which this polyomial-time algorithm achieves the iformatio-theoretic scalig. Oe cosequece of our results is coditios uder which the scalig o the sample size =Ω(d log p is sharp. This paper is orgaized as follows. I Sectio II, we begi with some backgroud ad a precise formulatio of the problem. Sectio III provides the statemets of our mai results ad a discussio of their cosequeces. Sectio IV describes a geeral framework for derivig iformatio-theoretic lower bouds ad discusses several approaches for boudig the mutual iformatio that arises i Fao s iequality. Subsectios IV-B ad IV-C are devoted to the proofs of the ecessary coditios for graphical model selectio ad iverse covariace estimatio. Give space costraits, this paper oly provides statemets ad high-level proof ideas; we refer the reader to the techical report 0] for details. We coclude i Sectio V with a discussio of ope directios /0/$ IEEE 373 ISIT 00

2 ISIT 00, Austi, Texas, U.S.A., Jue 3-8, (a (b Figure. Illustratio of Gaussia Markov radom fields. (a Give a udirected graph, associate a radom variable X i with each vertex i i the graph. A GMRF is the collectio of Gaussia distributios over the vector X that respect the structure of the graph. (b Sparsity patter of the iverse covariace matrix Θ associated with the GMRF i (a. II. BACKGROUND AND PROBLEM FORMULATION We begi with some backgroud o Gaussia Markov radom fields. We the formulate the graphical model selectio problem, which for Gaussia models is directly related to estimatio of the iverse covariace matrix. Our goal is to derive iformatio-theoretic lower bouds o the umber of samples required for recovery, which apply to ay procedure regardless of its computatioal complexity. A. Gaussia Markov radom fields Let X =(X,...,X p be a multivariate Gaussia radom vector with zero mea ad covariace matrix Σ. Accordigly, its desity is determied completely by the iverse covariace matrix Θ=Σ, ad has the form φ(x;0, Σ = 3 (πp det(θ exp{ xt Θx}. ( For a give udirected graph G = (V,E with vertex set V ad edge set E V V, we associate a radom variable X i with each vertex i V. The Gaussia Markov radom field associated with the graph G is the family of Gaussia distributios that respect the Markov properties of G. I particular, the off-diagoal sparsity patter of the iverse covariace matrix Θ is specified by the edge structure of the graph, such that Θ ij =0if (i, j E (see Figure. Give i.i.d. samples from a ukow Markov radom field, the problem of estimatig the iverse covariace matrix Θ correspods to recoverig the graphical model istace, while the problem of estimatig the uderlyig graph G correspods to graphical model selectio. We defie the maximum degree of the graph as d := max {j V (i, j E}, ( i V which is equal to the maximum umber of o-zeros per row of the iverse covariace matrix Θ. Note that we are ot icludig self-loops at each vertex i the degree cout, correspodig to the diagoal etries Θ ii. We ofte write Θ(G to emphasize the graph-based structure of Θ. B. Classes of graphical models Let G p,d be a family of udirected graphs o p vertices with edge sets that have degree at most d. For a give graph G G p,d, let Σ(G be the covariace matrix of a Gaussia Markov radom field (GMRF defied by the graph G. By defiitio, the iverse covariace matrix Θ(G must have o-zeros oly i positios correspodig to edges i E. I additio to graph structure, the difficulty of graphical model selectio also depeds o properties of the iverse covariace matrix etries. We measure the miimum value of each matrix Θ(G by the fuctio λ Θ st (Θ(G : = mi, (3 (s,t E Θss Θ tt so that it is ivariat to rescalig of the data. We study the class G p,d (λ of Gaussia Markov radom fields parameterized by a lower boud λ o the miimum value, defied as the set of probability distributios φ Θ(G = φ(0, Σ(G where the uderlyig graph G G p,d, the iverse covariace matrix satisfies Θ st =0if (s, t / E, ad λ (Θ(G λ. C. Decoders ad error metrics ( Suppose we are give i.i.d. vector samples X = X (,...,X ( R p from a ukow distributio φ Θ(G i the class G p,d (λ. Graphical model selectio refers to the problem of estimatig the uderlyig graph G based o the observatios X. A decoder ψ : R p G p,d maps the observatios X to a estimated graph Ĝ = ψ(x. We defie the error metric betwee the estimate Ĝ ad the true uderlyig graph G usig the 0- loss fuctio Iψ(X G]. For ay decoder ψ, we defie the maximal probability of error over the class G p,d (λ as p err (ψ := max P Θ(G ψ(x G ], ( φ Θ(G G p,d (λ where the error probability P Θ(G ψ(x G ] = E Θ(G Iψ(X G] ] is take with respect to the product distributio P Θ(G ( =φ( ; 0, Σ(G over i.i.d. samples. I cotrast to graphical model selectio (i which the goal is to recover the support set of Θ(G, the goal of iverse covariace estimatio is to estimate the umerical values of the iverse covariace matrix. More precisely, a decoder ψ : R p G p,d (λ maps the samples X to a estimate Θ = ψ(x. We measure the error betwee the estimate Θ ad the true iverse covariace matrix Θ usig the elemetwise l -orm Θ Θ := max st Θ st Θ st, ad defie the probability of error P Θ(G Θ Θ δ/ ]. The maximal probability of error over the model class G p,d (λ is the defied as p err ( ψ := max P Θ(G Θ Θ δ/ ].(5 φ Θ(G G p,d (λ Although the error metrics for graphical model selectio ad iverse covariace estimatio are closely related, either 37

3 ISIT 00, Austi, Texas, U.S.A., Jue 3-8, 00 recovery guaratee is strictly stroger tha the other. I particular, it is possible to recover the true graph (i.e. Ĝ = G eve whe Θ Θ δ/, sice the graph structure is determied oly by which etries are zero. Coversely, it is also possible to recover a estimate satisfyig Θ Θ <δ/ ad still fail to recover the true graph, if for istace there is a o-zero edge weight less tha δ/. With this set-up, our goal is to derive ecessary coditios o the sample size (p, d, λ for ay decoder to reliably recover the uderlyig graph (or estimate the iverse covariace matrix. We say that recovery is asymptotically reliable over the graphical model class G p,d (λ if p err 0 as. Our aalysis is high-dimesioal i ature, i which the graph size p, maximum degree d, ad miimum value λ are all allowed to scale arbitrarily as the umber of samples teds to ifiity. III. MAIN RESULTS AND CONSEQUENCES I this sectio, we state our mai results o the iformatiotheoretic limits of Gaussia graphical model selectio ad iverse covariace estimatio, ad the discuss some of their cosequeces. A. Graphical model selectio We begi with a set of ecessary coditios for graphical model selectio, applicable to ay recovery method regardless of its computatioal complexity. Theorem. Cosider the class G p,d (λ of Gaussia Markov radom fields with λ 0, ]. A ecessary coditio for asymptotically reliable graphical model selectio over the class G p,d (λ is >max { ( log p d λ, log ( p d ( log( + dλ λ dλ +(d λ } (6. The proof of Theorem (give i Sectio IV-B costructs restricted esembles of graphical models ad the, viewig the observatio process as a commuicatio chael, applies Fao s iequality ] i order to boud the probability of error. The bouds i Theorem capture how the sample size must grow with graph size p ad miimum value λ. I particular, i order for the sum of the edge weights i each eighborhood of the graph to stay bouded, the miimum value must scale as λ = Θ( d. I this regime, the first boud i Theorem implies that the sample size must scale as =Ω(d log(p d. For ay costat λ 0, /], the secod boud i Theorem scales as = Ω ( d log(p/d log(+dλ. Moreover, it implies that =Ω(d ɛ log( p d for ay ɛ>0. The iformatio-theoretic bouds i Theorem ca be compared with previous work o polyomial-time methods for cosistet graph selectio. I particular, Ravikumar et al. 7] showed that a sufficiet coditio for l -regularized maximum likelihood to cosistetly estimate the uderlyig graph is =Ω((d + λ logp. I the regime i which λ = Θ( d, this scalig matches the iformatio-theoretic bouds i Theorem, showig that a polyomial-time method achieves the optimal rates (up to costat factors. B. Iverse covariace estimatio We ow state some ecessary coditios for the closely related problem of iverse covariace estimatio. Recall that A :=max ij A ij deotes the elemetwise l -orm applied to a matrix. Theorem. Cosider the class of Gaussia Markov radom fields G p,d (λ. If there exists a estimator such that P Θ Θ <δ/] / uiformly over choices from G p,d (λ, the we must have > log ( δ. (7 The proof of Theorem, give i Sectio IV-C, is based o costructig restricted esembles of graphical models with miimum separatio δ, ad the applyig Fao s iequality ] to boud the probability of decodig error i distiguishig betwee such models. Theorem captures how the sample size must grow with the miimum separatio betwee models δ. A cosequece of Theorem is that if the recovery error decays at rate δ =/d, the the sample size must scale as >d ( log ( /. Furthermore, Theorem implies that the same ecessary coditio holds for iverse covariace estimatio with other error metrics as well. I particular, let A F :=( ij A ij / deote the Frobeius orm. Corollary. A ecessary coditio for asymptotically reliable iverse covariace estimatio, with recovery error ( at most δ/ measured i the Frobeius orm, is > log δ. The ecessary coditio i Theorem ca be compared to kow sufficiet coditios for l -regularized maximum likelihood to cosistetly estimate the iverse covariace matrix. Ravikumar et al. 7] showed that if the sample size satisfies >cd log p for some costat c>0, the with probability goig to oe, the l -regularized maximum likelihood method ( produces a estimate Θ satisfyig Θ Θ = O log p Cosequetly, the performace of the polyomial-time algorithm i 7] matches the scalig of the iformatio-theoretic boud i Theorem. IV. PROOF SKETCHES I this sectio, we describe our geeral framework for derivig ecessary coditios for cosistet graphical model selectio ad iverse covariace estimatio. Our methods are iformatio-theoretic i ature, ispired by techiques that have bee used to derive miimax bouds i oparametric estimatio (e.g., ], 3]. A. Fao s method Our geeral approach is to costruct restricted esembles of graphical models, ad the use Fao s method to lower boud the probability of error i each restricted esemble. Cosider a restricted esemble G cosistig of M = G models, ad let model idex θ be chose uiformly at radom from. 375

4 ISIT 00, Austi, Texas, U.S.A., Jue 3-8, 00 {,...,M}. Give the observatios X R ν, the decoder ψ estimates the uderlyig graph structure with maximal probability of decodig error defied as ψ( X Gj ] p err ( ψ = max P e j=,...,m Θ( Gj e. (8 By Fao s iequality ], the maximal probability of error over G ca be lower bouded as p err ( ψ I( θ; X + log M. (9 I order to make use of the Fao boud, the key is to desig esembles of models for which log M is large, while the mutual iformatio I ( θ; X is relatively small. Sice it is typically difficult to evaluate the mutual iformatio exactly, we discuss some upper bouds o it. Etropy-based boud: Defie the averaged covariace matrix Σ := M Σ ( Gj. (0 j= The mutual iformatio is upper bouded by I ( θ; X F ( G, where F ( G := logdet Σ M log det Σ ( Gj. ( j= KL-based boud: Let P j = f ( ( X θ = j = φ 0, Σ( Gj for j =,...,M. A alterative boud o the mutual iformatio is give by I ( θ; X E θ D(P θ Q] ( for ay distributio Q over X. Settig Q = φ(0,i ν ν, the KL distace ca be expressed as D(P j Q = { log det Gj +trace ( Σ( Gj ν }. (3 Note that we are assumig log e throughout this paper. B. Aalysis of graphical model selectio We ow briefly outlie the proofs of the ecessary coditios i Theorem o the sample size as a fuctio of the umber of vertices p, maximum degree d ad miimum value λ. We obtai two ecessary coditios, which ca be see as ed poits of a etire family of bouds, by aalyzig esembles of graphs i which a subset S of up to d odes form a clique (i.e. fully coected subset, ad the remaiig odes are all isolated. Restricted esemble A: We begi by derivig the first boud i Theorem, which captures how the sample size must grow with the miimum value λ. Cosider a family of graphs o p vertices, i which each edge set E(S, T ={(s, t s, t S or s, t T } defies a clique over a subset S of size, ad aother clique over a disjoit subset T of size d. Foragive graph G =(V,E(S, T ad a parameter a 0, we defie the iverse covariace matrix Θ(G :=I + a S T S + a T T T, where S ad T are the idicator vectors of sets S ad T, respectively. The covariace matrix ca the be computed as Σ(G = I a +a S T S a +da T T T. ( The resultig class of graphical models is a subset of G p,d (λ if λ (Θ(G = a +a λ. Suppose the decoder is give the idices of the d vertices i T, ad the parameter value a. Estimatig the uderlyig graph structure G ow amouts to fidig the remaiig pair of odes i S, out of ( p d possibilities. More precisely, give (T,a, the decoder ca extract the submatrix of observatios X :=(X T C R (p d. Whe the origial observatios are sampled i.i.d. from the distributio X (i N(0, Σ, the modified observatios are distributed accordig to X (i N(0, Σ T C T C. Sice the modified covariace matrix is of the form Σ ( G := Σ T C T C = I a +a S T S, (5 the iverse covariace matrix becomes Θ ( G = ( Σ( G = I + a S T S. (6 Note that the uderlyig graph associated with Θ ( G is G := G \ T (i.e. the graph obtaied by removig the vertices i set T ad all edges coected to T from graph G. The remaiig sub-problem is to determie, give the observatios X, the sigle edge graph o (p d vertices. Let G deote the set of graphs o (p d vertices with a sigle edge, ad let G(λ deote the associated class of Gaussia Markov radom fields with iverse covariace matrices defied as i (6. The proof the applies the Fao boud (9 over this restricted esemble usig the etropy-based boud o mutual iformatio (. Restricted esemble B: We ow derive the secod lower boud i Theorem usig a esemble of d-clique graphs ad the etropy-based boud o mutual iformatio (. Cosider the esemble of graphs cosistig of edge sets E(S ={(s, t s, t S} with S = d. For a give edge set E(S ad paramter a 0, defie the iverse covariace matrix Θ(G :=I + a S T S, ad its associated covariace matrix Σ(G =(Θ(G = I a +da S T S. The cardiality of this restricted esemble is ( p d. The proof the follows by applyig Fao s iequality (9 usig the etropy-based boud (. C. Aalysis for iverse covariace estimatio I this sectio, we provide the basic ituitio uderlyig the proof of Theorem. We derive a set of ecessary coditios for iverse covariace estimatio usig a esemble of graphical models which share the same uderlyig graph, but vary by perturbig a sigle edge weight. These bouds capture the difficulty of distiguishig betwee models with iverse covariace matrices that are δ-close, e.g i the elemetwise 376

5 ISIT 00, Austi, Texas, U.S.A., Jue 3-8, 00 l -orm. Note that for ay two models Θ (i ad Θ (j i our esemble, sice Θ (i Θ (j = δ by costructio, there does ot exist a matrix Θ satisfyig both Θ Θ (i <δ/ ad Θ Θ (j <δ/. Cosequetly, we ca apply Fao s iequality (9 to boud the probability of error i the restricted esemble, ad the problem is reduced to boudig the mutual iformatio betwee the model idex ad the observatios. Alterate KL boud: We begi by statig a variat of the KL-based boud o mutual iformatio i (3, usig KL distaces betwee all pairs of models i the class, istead of KL distaces betwee each model ad the stadard Gaussia distributio. Pairwise KL-based boud: We defie the symmetrized Kullback-Leibler divergece, S(P i P j := D(P i P j +D(P j P i. (7 By covexity of the KL divergece, we have the followig boud o mutual iformatio I ( θ; X M i= j=i+ S(P i P j. (8 For Gaussia Markov radom fields, a straightforward calculatio shows that the symmetrized KL distace is equal to S(P i P j = p p l= m= ( ( Θ (i lm Θ(j lm Σ (j lm Σ(i lm (9 Restricted esemble C: We ow use these methods to derive ecessary coditios for iverse covariace estimatio (stated i Theorem, which capture how the sample size must grow with the miimum separatio betwee models δ. Cosider a graph o p vertices cosistig of p d+ cliques, where each clique is of size (d +. Let N = p d+, ad let {S,...,S N } deote the N cliques with S i = d +.We defie the iverse covariace matrix associated with this graph as Θ := I + a N Si T S i, (0 for some parameter a 0. From this base model, we geerate a esemble of Gaussia Markov radom fields i which each model perturbs the weight associated with oe edge. Thus the model obtaied by perturbig the weight o edge (s, t is defied by the iverse covariace matrix Θ (i := Θ+δ( st T st I st for some parameter δ (0, ]. Note that we are usig ( st T st I st to deote the matrix with oes i locatios (s, t ad (t, s, ad zeros elsewhere. The resultig esemble of graphical models has cardiality M = p d+ ( d+. The proof the computes the KLbased boud o mutual iformatio i (9 ad applies Fao s iequality (9. i= V. DISCUSSION I this paper, we have studied the iformatio-theoretic limits of Gaussia graphical model selectio ad iverse covariace estimatio i the high-dimesioal settig. Our aalysis yields a set of ecessary coditios for cosistet graph selectio with ay method, which matches the scalig of kow sufficiet coditios 7] for l -regularized maximum likelihood i regimes i which the miimum value scales as λ =Θ( d. The tightess of the bouds i other regimes of λ is a iterestig ope questio. Furthermore, we derived a set of ecessary coditios for iverse covariace estimatio, which similarly matches the performace of polyomial-time recovery methods 7]. Our results cosider recovery i the elemetwise l ad Frobeius orms; the tightess of the ecessary coditios for recovery i other orms is a iterestig ope questio. At a high-level, our aalysis is based o a geeral framework for derivig iformatio-theoretic bouds i which we view the observatio process as a commuicatio chael, ad may be applicable to other problems as well. Ackowledgmet The work of WW ad KR was supported by NSF grat CCF ad AFOSR grat FA The work of MJW was supported by NSF grats CAREER-CCF ad AFOSR-09NL8. REFERENCES ] S. L. Lauritze, Graphical Models. Oxford: Oxford Uiversity Press, 996. ] V. A. Marceko ad L. A. Pastur, Distributio of eigevalues for some sets of radom matrices, Aals of Probability, vol., o., pp , ] M. Yua ad Y. Li, Model selectio ad estimatio i the Gaussia graphical model, Biometrika, vol. 9, o., pp. 9 35, 007. ] J. Friedma, T. Hastie, ad R. Tibshirai, Sparse iverse covariace estimatio with the graphical lasso, Biostatistics, vol. 9, o. 3, pp. 3, ] A. d Aspremot, O. Baerjee, ad L. E. Ghaoui, First order methods for sparse covariace selectio, SIAM Joural o Matrix Aalysis ad its Applicatios, vol. 30, o., pp , ] A. J. Rothma, P. J. Bickel, E. Levia, ad J. Zhu, Sparse permutatio ivariat covariace estimatio, Electroic Joural of Statistics, vol., pp. 9 55, ] P. Ravikumar, M. J. Waiwright, G. Raskutti, ad B. Yu, Highdimesioal covariace estimatio by miimizig l -pealized logdetermiat divergece, Departmet of Statistics, UC Berkeley, Tech. Rep. 767, November ] N. Sathaam ad M. J. Waiwright, Iformatio-theoretic limits of selectig biary graphical models i high dimesios, i Iteratioal Symposium o Iformatio Theory (ISIT, Toroto, Caada, July ] G. Bresler, E. Mossel, ad A. Sly, Recostructio of markov radom fields from samples: Some easy observatios ad algorithms, UC Berkeley, Tech. Rep. arxiv, ] W. Wag, M. J. Waiwright, ad K. Ramchadra, Iformatiotheoretic bouds o model selectio for Gaussia markov radom fields, Departmet of Statistics, UC Berkeley, Tech. Rep., May 00. ] T. Cover ad J. Thomas, Elemets of Iformatio Theory. New York: Joh Wiley ad Sos, 99. ] B. Yu, Assouad, Fao ad Le Cam, Research Papers i Probability ad Statistics: Festschrift i Hoor of Lucie Le Cam, pp. 3 35, ] Y. Yag ad A. Barro, Iformatio-theoretic determiatio of miimax rates of covergece, Aals of Statistics, vol. 7, o. 5, pp ,

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short