High dimensional Ising model selection

Size: px

Start display at page:

Download "High dimensional Ising model selection"

Alexandra Fowler
5 years ago
Views:

1 High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright)

2 Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse discrete graphical model (Banerjee et al. 08, Ravikumar, Wainwright, Lafferty 09.)

3 Binary Ising model p(x; θ,g)= 1 Z(θ) exp ( (s,t) E(G) θ st X s X t ) modeling the behavior of gases or magnets in statistical physics social network analysis, spatial statistics, computer vision X s { 1, +1}, s V (G)

4 General Graphical Models C X1 X2 X5 A X3 X4 B {X 1,X 3 } {X 4,X 5 } X 2 p(x) = 1 Z Ψ A(X A )Ψ B (X B )Ψ C (X C ) Set of distributions that factor according to a graph

5 Pairwise discrete graphical models p(x; θ,g)= 1 Z(θ) exp ( (s,t) E(G) θ st φ st (X s,x t ) ) φ st (x s,x t ) : arbitrary potential functions Ising x s x t Potts I(x s = x t ) Indicator I(x s,x t = j, k)

6 Samples from binary-valued pairwise MRFs Independence model θ st =0

7 Samples from binary-valued pairwise MRFs Medium coupling θ st 0.2

8 Samples from binary-valued pairwise MRFs Strong coupling θ st 0.8

9 Graphical model selection let G =(V, E) be an undirected graph on p = V vertices pairwise Markov random field: family of prob. distributions 1 P(x 1,..., x p ; θ) = Z(θ) exp { } θ st x s x t (s,t) E Problem of graph selection: given n independent and identically distributed (i.i.d.) samples of X =(X 1,..., X p ), identify the underlying graph structure

10 Ising Model Selection Given n observations x (1),..., x (n) drawn from above distribution, estimate graph G. Estimate: E(G) {(s, t) :θ st 0} Non-zero support set of θ

11 Sparse Model Estimation In many modern applications {images, signals, systems, networks} number of observations number of variables p n fmri images gene expression profiles social networks

12 Modern data settings Curse of dimensionality consistent procedures typically require p/n 0, sometimes even n>e p But if data has intrinsic structure with smaller dimension? Important to develop methods that leverage such structure Sparsity: many components or parameters are zero.

13 Sparsity l 0 quasi-norm l 1 norm l 2 norm [From Tropp, J. 2004] Sparsity: ell_0(params) is small Convex relaxation: ell_1(params) is small

14 Ising Model Estimation via MLE θ arg min θ (s,t) θ st [ 1 n n i=1 φ st (x (i) s,x (i) t ) ] log(z(θ)) + λ θ 1 Ê = {(s, t) : θ st 0} Optimization problem: ell_1 regularized maximum likelihood Intractable because of partition function Z(θ)

15 Global max. likelihood for discrete models? maximum likelihood for general graphical model in exponential family: θ = arg max θ st Ê[X s X t ] log Z(θ) θ R p }{{} (s,t) E empirical moments exact likelihood involves log partition function exp( 1 R log Z(θ) = n 2 xt Θx)dx exp ( (s,t) θ ) stx s x t x 1,+1 p for Gaussian RV for binary RV key consequence: likelihood computation is straightfoward for Gaussian MRFs (log-determinant) intractable for Ising models (binary pairwise MRFs) (Welsh, 1993)

16 Neighborhood Regression Estimate conditional distribution at each node instead of the joint distribution. Sparsity pattern of conditional distribution parameters: neighborhood structure in original graph. p(x r X V \r ; θ,g)= exp( t N(r) 2 θ rtx r X t ) exp( t N(r) 2 θ rtx r X t )+1 Don t have to worry about partition function!

17 Graph selection via neighborhood regression Observation: Recovering graph G equivalent to recovering neighborhood set N(s) for all s V. Method: Given n i.i.d. samples {X (1),...,X (n) }, perform logistic regression of each node X s on X \s := {X s,t s} to estimate neighborhood structure b N(s). 1 For each node s V, perform l 1 regularized logistic regression of X s on the remaining variables X \s : bθ[s] := arg min θ R p 1 ( 1 n nx i=1 f(θ; X (i) \s ) {z } logistic likelihood + ρ n θ 1 {z} ) regularization 2 Estimate the local neighborhood b N(s) as the support (non-negative entries) of the regression vector b θ[s]. 3 Combine the neighborhood estimates in a consistent manner (AND, or OR rule).

18 Empirical behavior: Unrescaled plots 1 Star graph; Linear fraction neighbors 0.8 Prob. success p = 64 p = p = Number of samples

19 Empirical behavior: Appropriately rescaled 1 Star graph; Linear fraction neighbors 0.8 Prob. success p = 64 p = p = Control parameter

20 High-dimensional analysis classical analysis: dimension p fixed, sample size n + high-dimensional analysis: allow both dimension p, sample size n, and maximum degree d to increase at arbitrary rates take n i.i.d. samples from MRF defined by G p,d study probability of success as a function of three parameters: Success(n, p, d) = P[Method recovers graph G p,d from n samples] theory is non-asymptotic: explicit probabilities for finite (n, p, d)

21 Sufficient conditions for consistent model selection graph sequences G p,d =(V, E) with p vertices, and maximum degree d. edge weights θ st θ min for all (s, t) E draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d) Theorem Under incoherence conditions, for a rescaled sample size θ LR (n, p, d) := n d 3 log p > θ crit and regularization parameter ρ n c 1 τ than 1 2 exp ( c 2 (τ 2) log p ) 1: log p n (RavWaiLaf06), then with probability greater (a) Uniqueness: For each node s V, the l 1 -regularized logistic convex program has a unique solution. (Non-trivial since p n = not strictly convex). (b) Correct exclusion: The estimated sign neighborhood N(s) correctly excludes all edges not in the true neighborhood.

22 Sufficient conditions for consistent model selection graph sequences G p,d =(V, E) with p vertices, and maximum degree d. edge weights θ st θ min for all (s, t) E draw n i.i.d, samples, and analyze prob. success indexed by (n, p, d) Theorem Under incoherence conditions, for a rescaled sample size θ LR (n, p, d) := n d 3 log p > θ crit and regularization parameter ρ n c 1 τ than 1 2 exp ( c 2 (τ 2) log p ) 1: log p n (RavWaiLaf06), then with probability greater (a) Uniqueness: For each node s V, the l 1 -regularized logistic convex program has a unique solution. (Non-trivial since p n = not strictly convex). (b) Correct exclusion: The estimated sign neighborhood N(s) correctly excludes all edges not in the true neighborhood. (c) Correct inclusion: For θ min c 3 τ dρ n, the method selects the correct signed neighborhood. Consequence: For θ min = Ω(1/d), it suffices to have n = Ω(d 3 log p).

23 Results for 8-grid graphs 1 8!nearest neighbor grid (attractive) 0.8 Prob. success p = 64 p = p = Control parameter 3 4 Prob. of success P[Ĝ = G] versus rescaled sample size θ LR(n, p, d 3 )= n d 3 log p

24 Assumptions Define Fisher information matrix of logistic regression: Q := E θ [ 2 f(θ ; X) ]. A1. Dependency condition: Bounded eigenspectra: C min λ min (Q SS), and λ max (Q SS) C max. λ max (E θ [XX T ]) D max. A2. Incoherence There exists an ν (0, 1] such that Q S c S(Q SS) 1, 1 ν. where A, := max i j A ij. bounds on eigenvalues are fairly standard incoherence condition: partly necessary (prevention of degenerate models) partly an artifact of l1 -regularization incoherence condition is weaker than correlation decay

25 Assumptions I A1: Bounded eigenvalues

26 Assumptions II Q S c S (Q SS) 1 < 1 Energy (log likelihood) is suitably aligned (Tibshirani, 96)

27 Discrete graphical models P(X) exp( s V exp ( φ s (X s )+ φ st (X s,x t )) (s,t) E θ s;j I[X s = j]+ s V,j (s,t) E,j,k θ st;jk I[X s = j, X t = k] ) E {(s, t) : θ st 0} Each edge has multiple parameters

28 Discrete graphical models P(X r = j X V \r ; θ,g)= exp(θ r;j + t N(r) k θ rt;jk I[X t = k]) l exp(θ r;l + t N(r) k θ rt;lk I[X t = k]) Estimate conditional distributions at each node Optimization problem: Ell_1-Ell_q block-regularized logistic regression

29 Summary and questions Recovering structure for non-pairwise / higher order Ising graphical models Node-wise regression ~ pseudo-likelihood (product of local conditional probabilities) approximation of partition function Other tractable approximations? Recoverable using our method only for certain parameter settings Can fundamental limits on parameters be characterized?

High-dimensional graphical model selection: Practical and information-theoretic limits

1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John