High dimensional ising model selection using l 1 -regularized logistic regression

Size: px

Start display at page:

Download "High dimensional ising model selection using l 1 -regularized logistic regression"

Victor Johnson
5 years ago
Views:

1 High dimensional ising model selection using l 1 -regularized logistic regression 1 Department of Statistics Pennsylvania State University 597 Presentation /29

2 Outline Introduction 1 Introduction Background and Motivation Problem Formulation 2 Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF /29

3 Background Introduction Background and Motivation Problem Formulation Markov random fields (MRF)/Undirected graphical model. MRF: undirected graph G = (V, E) with vertex set V = {1, 2,, p} and edge set E V V Structure of the graph: encodes certain conditional independence assumptions among subsets of X = (X 1, X p ), where each X s is associated with a vertex s V. One important problem: Estimate the underlying graph from n i.i.d samples {x (1), x (2),, x (n) } drawn from the distribution specified by the MRF. 3/29

4 Previous work on MRF Background and Motivation Problem Formulation Constraint-based approaches: estimates conditional independence from the data using hypothesis testing. Score-based approaches: combine a metric for the complexity of the graph with a measure of the goodness of fit, e.g., log-likelihood of the maximum likelihood parameters. Drawbacks: Calculation for the partition function associated with the MRF is computationally intractable. 4/29

5 Basic approach Introduction Background and Motivation Problem Formulation Basic approach: performing sequential l 1 -regularized logistic regression. Then using the sparsity pattern of the regression vector to infer the underlying neighborhood structure. sample size: n; model dimension p; maximum neighborhood size d. Both p and d may tend to infinity as a function of the size n. Focus on binary Ising model and generalize the method to general discrete MRF. 5/29

6 Background Introduction Background and Motivation Problem Formulation Pairwise Markov random fields. X = (X 1, X 2, X p ) is a random vector. G = (V, E) is an undirected graph with vertex set V = {1,, p} and edge set E. The pairwise MRF is the family of distributions of X : P(x) exp{ φ st (x s, x t )} (s,t) E Ising model. We focus on the special case of the Ising model in which X s { 1, 1} for each vertex s V, and φ st (x s, x t ) = θstx s x t, θst R, thus P(x) = 1 Z(θ ) exp θ stx s x t. (1) (s,t) E 6/29

7 Graphical model selection Background and Motivation Problem Formulation Goal. X n 1 := {x (1),, x (n) } of n samples where x (i) { 1, +1} p is drawn i.i.d. from a distribution P θ The goal of graphical model selection is to infer the edge set E(the same as estimate θ st). Stronger criterion of signed edge recovery. Given a graphical model with parameter θ, define the edge sign vector E = { sign(θ st ), if (s, t) E 0, otherwise. (2) 7/29

8 Motivation Introduction Background and Motivation Problem Formulation Classical notion of statistical consistency: limiting behavior of the estimator as n and p fixed. Analysis in this paper is of high-dimensional nature. As n, p = p(n), d = d(n) also scale as a function of n. Our goal is to establish sufficient conditions on the scaling of the triple (n, p, d) such that proposed estimator is consistent: P[Ê n = E ] 1, as n + 8/29

9 Background and Motivation Problem Formulation Neighborhood-based logistic regression. Recovering E is equivalent to recovering signed neighborhood for each r V N ± (r) := {sign(θ rt)t t N (r)} (3) where N (r) is the neighborhood set of r, i.e., N(r) := {t V (r, t) E} N ± (r) can be recovered from the sign-sparsity pattern: θ\r := {θ ru, u V \r}. Derive the conditional distribution of X r given all others X \r (4), P θ (x r x \r ) = exp(2x r t V \r θ rtx t ) (4) exp(2x r t V \r θ rtx t ) + 1 9/29

10 Background and Motivation Problem Formulation Neighborhood-based logistic regression. The conditional distribution suggests that X r can be viewed as the response variable in a logistic regression in which X \r are covariates. Estimation procedure: computing an l 1 -regularized logistic regression of X r on X \r. Given a sample X1 n, the regularized regression problem is a convex program of the form 1 n min f (θ; x (i) ) θ ru ˆµ ru + λ n θ θ \r R p 1 n \r 1 (5) i=1 t V \r u V \r where f (θ; x) := log exp( θ rt x t ) + exp( θ rt x t ) (6) t V \r 10/29

11 Background and Motivation Problem Formulation By Lagrangian duality, the problem (5) can be re-cast as a constrained problem over the ball θ \r 1 C(λn). After obtaining the minimizer ˆθ \r n, estimate the signed neighborhood N ± (r) according to ˆN ± (r) := {sign(ˆθ ru )u u V \r, ˆθ su 0}. (7) The full graph G is estimated consistently {Ên = E }, if ˆN ± (r) = N ± (r) for all r V. 11/29

12 Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. Our analysis proceeds by First, establish sufficient conditions for correct signed neighborhood recovery for some fixed node r V. Second, we use a union bound over all p nodes of the graph to conclude that consistent graph selection is also achieved. 12/29

13 Assumptions Introduction Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. Next, state the main result on the performance of neighborhood logistic regression for graphical model selection. Definitions Population Fisher information matrix (Hessian of the likelihood function). Q r := E θ { 2 logp θ [X r X \r ].} S := {(r, t) t N (r)}, Q SS denotes the d d sub-matrix of Q θmin = min (r,t) E θrt determines the limits of model selection. 13/29

14 Assumptions Introduction Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. (A1) Dependency condition Λ min (Q SS ) C min Λ max (E θ [X \r X T \r ]) D max (8) (A2) Incoherence condition Q S c S (Q SS ) 1 1 α (9) 14/29

15 Theorem 1 Introduction Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. Theorem 1 Consider an Ising graphical model with parameter vector θ and associated edge set E such that conditions (A1) and (A2) are satisfied by Q, and let X1 n be a set of n i.i.d. samples from the model specified by θ. Suppose that the regularization parameter λ n is selected to satisfy λ n 16(2 α) log p α n (10) Then there exist positive constants L and K, independent of (n, p, d), such that if n > Ld 3 log p (11) 15/29

16 Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. Theorem Then the following properties hold with probability at least 1 2exp( kλ 2 nn) (a) For each node r V, the l 1 -regularized logistic regression (5), given data X1 n, has a unique solution, and so uniquely specifies a signed neighborhood ˆN ± (r). (b) For each r V, the estimated signed neighborhood ˆN ± (r) correctly excludes all edges not in the true neighborhood. Moreover, it correctly includes all edges (r, t) for which θrt 10 C dλn min. 16/29

17 Corollary 1 Introduction Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. Corollary 1 Consider a sequence of Ising models satisfy conditions (A1) and (A2). For each n, Suppose that (n, p(n), d(n)) satisfies the scaling condition (11). Suppose that {λ n } of regularization parameters satisfies condition (10) and λ 2 nn and the minimum parameter weights satisfy min θ 10 (r,t) En (n,p,d)(r, t) dλn (12) C min for sufficiently large n. Then the method is model selection consistent so that if Êp(n) is the graph structure estimated by the method, then P[Ê p(n) = E p(n) ] 1 17/29

18 Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. The analysis required to prove Theorem 1 can be divided naturally into two parts. If A(1) and A(2) hold for sample Fisher information matrix Q n := Ê[η(X ; θ )X \r X T \r ] = 1 n n i=1 η(x (i) ; θ )x (i) (i) \r (x\r )T (13) Then the condition in Theorem 1 are sufficient to ensure the graph is recovered with high probability. Under the scaling condition, imposing the assumptions on Q guarantees(with high probabillity) that analogous conditions hold for Q n 18/29

19 Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. Our proof involves the explicit construction of an optimal primal-dual pair. The zero sub-gradient optimality conditions take the form l(ˆθ) + λ n ẑ = 0 (14) where the dual vector ẑ R p 1 must satisfy the properties ẑ rt = sign(ˆθ rt ) if ˆθ i 0 and ẑ rt 1 otherwise. (15) a pair (ˆθ, ẑ) is a primal dual optimal solution to the convex program and its dual iff (14) and (15) are satisfied. Such an optimal primal-dual pair correctly species the signed neighborhood of node r iff sign(ẑ rt ) = sign(θ rt) (r, t) S := {(r, t) E} and (16) ˆθ ru = 0 for all(r, u) S c := E\S (17) 19/29

20 Lemma 1 Introduction Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. Shared sparsity among optimal solutions and uniqueness of the optimal solution. Lemma Suppose that there exists an optimal primal solution ˆθ with associated optimal dual vector ẑ such that ẑ S c < 1. Then any optimal primal solution θ must have θ S c = 0. Moreover, if the Hessian sub-matrix [ 2 l(ˆθ)] SS is strictly positive definite, then ˆθ is the unique optimal solution. 20/29

21 Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. model selection consistency when assumptions are imposed directly on Q n Define the good event : M(X n 1 ) := {X n 1 { 1, +1}n p Q n satisfies (A1) and (A2)}. Proposition 1 (Fixed design). If the event M(X 1 n) holds, n > Ld 2 log(p), and λ n 16(2 α) log p α n, then with probability at least 1 2exp(Kλ 2 n) 1, the following properties hold. (a) For each node r V, the l 1 -regularized logistic regression has a unique solution, and so uniquely specifies ˆN ± (r). (b) For each r V, the estimated signed neighborhood vector ˆN ± (r) correctly excludes all edges not in the true neighborhood. Moreover, it correctly includes all edges with θ rt 10 C dλn in. 21/29

22 Theoretical results Primal-dual witness for graph recovery Analysis under sample Fisher matrix assumptions Extensions to general discrete MRF. Extensions to general discrete Markov random fields. Each variable X i taking values in X = {1, 2,, m} { 1, if xs = j, Define I [x s = j] =. Similarly, define pairwise 0, otherwise. indicator functions I [x s = j, x t = k] Define a matrix Θ \r R(m 1)2 (p 1), where column u is given by the vector θ ru. Solve the following convex program min {l(θ \r ; X1 n ) + λ n Θ \r 2,1} (18) Θ \r R (m 1)2 (p 1) 22/29

23 Three different classes of graphs four-nearest neighbor lattices eight-nearest neighbor lattices star-shaped graphs 23/29

24 Generate random data sets {x (1),, x (n) } by gibbs sampling for the lattice models, and by exact sampling for the star graph. Examine the performance of models with mixed couplings θ st = ±ω or with positive couplings θ st = ω λ = log pn Perform simulations for sample sizes n = 10βd log p, where the control parameter β ranged from 0.1 to 3, depending on the graph type. 24/29

25 Results for 4-nearest-neighbor grid model Three different graph size p {64, 100, 225} Each curve corresponds to the success probability versus the control parameter β. Each point corresponds to the average of N = 200 trials. 25/29

26 Results for 8-nearest-neighbor grid model 26/29

27 Results for star-shaped graphs Construct star-shaped graphs with p vertices: one node as the hub and connecting it to d < (p 1) of its neighbors. For linear sparsity, we choose d = 0.1p ; for logarithmic sparsity we choose d = log(p). A triple of graph sizes p {64, 100, 225}. 27/29

28 Conclusion. Introduction A technique based on l 1 -regularized logistic regression can be used to perform consistent model selection in binary Ising graphical models. Our analysis applies to the high-dimensional setting, in which both the number of nodes p and maximum neighborhood sizes d are allowed to grow as a function of the number of observations n. Simulation results show the accuracy of these theoretical predictions.. The methods of this paper can be extended to general discrete graphical models with a higher number of states. 28/29

29 Future Work Introduction Our experimental results are consistent with the conjecture that logistic regression procedure fails with high probability for sample sizes n that are smaller than O(d log p). It would be interesting to prove such a converse result. The case of samples drawn in a non-i.i.d. manner from some unknown Markov random field; we suspect that similar results would hold for weakly dependent sampling schemes. 29/29

High-dimensional graphical model selection: Practical and information-theoretic limits

1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John