High-dimensional graphical model selection: Practical and information-theoretic limits

Similar documents
High-dimensional graphical model selection: Practical and information-theoretic limits

Introduction to graphical models: Lecture III

High dimensional Ising model selection

Learning and message-passing in graphical models

High dimensional ising model selection using l 1 -regularized logistic regression

Learning discrete graphical models via generalized inverse covariance matrices

High-dimensional statistics: Some progress and challenges ahead

Probabilistic Graphical Models

Composite Loss Functions and Multivariate Regression; Sparse PCA

Graphical models and message-passing Part I: Basics and MAP computation

Which graphical models are difficult to learn?

Probabilistic Graphical Models

Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block `1=` -Regularization

Sparse Graph Learning via Markov Random Fields

Estimators based on non-convex programs: Statistical and computational guarantees

High-dimensional covariance estimation based on Gaussian graphical models

Large-Deviations and Applications for Learning Tree-Structured Graphical Models

Information-Theoretic Limits of Selecting Binary Graphical Models in High Dimensions

11 : Gaussian Graphic Models and Ising Models

1 Regression with High Dimensional Data

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Information Recovery from Pairwise Measurements

Undirected Graphical Models

Robust Principal Component Analysis

14 : Theory of Variational Inference: Inner and Outer Approximation

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Chris Bishop s PRML Ch. 8: Graphical Models

Improved Greedy Algorithms for Learning Graphical Models

STAT 200C: High-dimensional Statistics

Causal Inference: Discussion

EE229B - Final Project. Capacity-Approaching Low-Density Parity-Check Codes

Probabilistic Graphical Models

Convex relaxation for Combinatorial Penalties

Linear and conic programming relaxations: Graph structure and message-passing

Permutation-invariant regularization of large covariance matrices. Liza Levina

3 : Representation of Undirected GM

Graphical models and message-passing Part II: Marginals and likelihoods

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Multivariate Bernoulli Distribution 1

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

Information-Theoretic Limits of Group Testing: Phase Transitions, Noisy Tests, and Partial Recovery

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

CSC 412 (Lecture 4): Undirected Graphical Models

Exact MAP estimates by (hyper)tree agreement

Sparse signal recovery using sparse random projections

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Undirected Graphical Models

Undirected graphical models

Learning Markov Network Structure using Brownian Distance Covariance

Robust Inverse Covariance Estimation under Noisy Measurements

Discrete Markov Random Fields the Inference story. Pradeep Ravikumar

Sparse Covariance Selection using Semidefinite Programming

Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF

Compressed Sensing and Sparse Recovery

Structure Learning of Mixed Graphical Models

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Inferning with High Girth Graphical Models

14 : Theory of Variational Inference: Inner and Outer Approximation

Sparse inverse covariance estimation with the lasso

3 Undirected Graphical Models

Markov Random Fields

10708 Graphical Models: Homework 2

Probabilistic Graphical Models

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Intelligent Systems:

New Theory and Algorithms for Scalable Data Fusion

A Convex Upper Bound on the Log-Partition Function for Binary Graphical Models

arxiv: v1 [stat.ap] 19 Oct 2015

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Lecture 4 Noisy Channel Coding

Lecture 9: PGM Learning

New ways of dimension reduction? Cutting data sets into small pieces

Learning Quadratic Variance Function (QVF) DAG Models via OverDispersion Scoring (ODS)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Adaptive Cut Generation for Improved Linear Programming Decoding of Binary Linear Codes

Does Better Inference mean Better Learning?

CS Lecture 4. Markov Random Fields

Statistical Machine Learning for Structured and High Dimensional Data

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Learning Tractable Graphical Models: Latent Trees and Tree Mixtures

Solving Corrupted Quadratic Equations, Provably

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Non-Asymptotic Analysis for Relational Learning with One Network

Rapid Introduction to Machine Learning/ Deep Learning

Lower Bounds on the Graphical Complexity of Finite-Length LDPC Codes

13 : Variational Inference: Loopy Belief Propagation

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Random hyperplane tessellations and dimension reduction

Learning Gaussian Graphical Models with Unknown Group Sparsity

Random Field Models for Applications in Computer Vision

Tractable Upper Bounds on the Restricted Isometry Constant

High-Dimensional Learning of Linear Causal Networks via Inverse Covariance Estimation

Lecture Notes 9: Constrained Optimization

Undirected Graphical Models: Markov Random Fields

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Representation of undirected GM. Kayhan Batmanghelich

Transcription:

1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John Lafferty (CMU), Pradeep Ravikumar (UC Berkeley), and Prasad Santhanam (University of Hawaii) Supported by grants from National Science Foundation, and a Sloan Foundation Fellowship

2 Introduction classical asymptotic theory of statistical inference: number of observations n + model dimension p stays fixed not suitable for many modern applications: { images, signals, systems, networks } frequently large (p 10 3 10 8 )... function/surface estimation: enforces limit p + interesting consequences: might have p = Θ(n) or even p n curse of dimensionality: frequently impossible to obtain consistent procedures unless p/n 0 can be saved by a lower effective dimensionality, due to some form of complexity constraint: sparse vectors {sparse, structured, low-rank}-matrices structured regression functions graphical models (Markov random fields)

3 What are graphical models? Markov random field: random vector (X 1,...,X p ) with distribution factoring according to a graph G = (V, E): D A B C Hammersley-Clifford Theorem: (X 1,...,X p ) being Markov w.r.t G implies factorization: { } P(x 1,...,x p ) exp θ A (x A ) + θ B (x B ) + θ C (x C ) + θ D (x D ). studied/used in various fields: spatial statistics, language modeling, computational biology, computer vision, statistical physics...

4 Graphical model selection let G = (V, E) be an undirected graph on p = V vertices pairwise Markov random field: family of prob. distributions P(x 1,...,x p ; θ) = 1 Z(θ) exp{ (s,t) E θ st, φ st (x s, x t ) }. given n independent and identically distributed (i.i.d.) samples of X = (X 1,...,X p ), identify the underlying graph structure complexity constraint: restrict to subset G d,p of graphs with maximum degree d

5 Illustration: Voting behavior of US senators Graphical model fit to voting records of US senators (Bannerjee, El Ghaoui, & d Aspremont, 2008)

6 Some issues in high-dimensional inference Consider some fixed loss function, and a fixed level δ of error. Limitations of tractable algorithms: Given particular (polynomial-time) algorithms for what sample sizes n do they succeed/fail to achieve error δ? given a collection of methods, when does more computation reduce minimum # samples needed? Information-theoretic limitations: Data collection as communication from nature statistician: what are fundamental limitations of problem (Shannon capacity)? when are known (polynomial-time) methods optimal? when are there gaps between poly.-time methods and optimal methods?

7 Previous/on-going work on graph selection exact solution for trees (Chow & Liu, 1967) Buhlmann, 2008) local testing-based approaches (e.g., Spirtes et al, 2000; Kalisch & methods for Gaussian MRFs l 1 -regularized neighborhood regression for Gaussian MRFs (e.g., Meinshausen & Buhlmann, 2005; Wainwright, 2006, Zhao, 2006) l 1 -regularized log-determinant (e.g., Yuan & Lin, 2006; d Asprémont et al., 2007; Friedman, 2008; Ravikumar et al., 2008) methods for discrete MRFs neighborhood-based search method (Bresler, Mossel & Sly, 2008) l 1 -regularized logistic regression (Ravikumar et al., 2006, 2008) information-theoretic approaches: pseudolikelihood and BIC criterion (Csiszar & Talata, 2006) information-theoretic limitations (Santhanam & Wainwright, 2008)

8 Markov property and neighborhood structure (X r X V \r ) }{{} Condition on full graph d = (X r X N(r) ) }{{} Condition on Markov blanket Markov properties encode neighborhood structure: N(r) = {s, t, u, v, w} X s X t X w X r X u X v basis of pseudolikelihood method (Besag, 1974)

9 Practical method via neighborhood regression Observation: Recovering graph G equivalent to recovering neighborhood set N(r) for all r V. Method: Given n i.i.d. samples {X (1),...,X (n) }, perform logistic regression of each node X r on X \r := {X r, t r} to estimate neighborhood structure b N(r). 1. For each node r V, perform l 1 regularized logistic regression of X r on the remaining variables X \r : ( ) 1 nx bθ[r] := arg min f(θ; X (i) θ R p 1 \r n {z } ) + ρ n θ 1 {z} i=1 logistic likelihood regularization 2. Estimate the local neighborhood b N(r) as the support (non-negative entries) of the regression vector b θ[r]. 3. Combine the neighborhood estimates in a consistent manner (AND, or OR rule).

10 High-dimensional analysis classical analysis: dimension p fixed, sample size n + high-dimensional analysis: allow both dimension p, sample size n, and maximum degree d to increase at arbitrary rates take n i.i.d. samples from MRF defined by G p,d study probability of success as a function of three parameters: Success(n, p, d) = P[Method recovers graph G p,d from n samples] theory is non-asymptotic: explicit probabilities for finite (n, p, d)

11 Empirical behavior: Unrescaled plots 1 Star graph; Linear fraction neighbors 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 0 0 100 p = 225 200 300 400 500 600 Number of samples Plots of success probability versus raw sample size n.

12 Empirical behavior: Appropriately rescaled 1 Star graph; Linear fraction neighbors 0.8 Prob. success 0.6 0.4 0.2 p = 64 p = 100 0 0 0.5 p = 225 1 1.5 2 Control parameter Plots of success probability versus control parameter T LR (n, p, d).

13 Sufficient conditions for consistent model selection graph sequences G p,d = (V, E) with p vertices, and maximum degree d. drawn n i.i.d, samples, and analyze prob. success indexed by (n, p, d) Theorem: For a rescaled sample size (RavWaiLaf06, RavWaiLaf08) n T LR (n, p, d) := d 3 log p > T crit and regularization parameter ρ n c 1 τ, then with probability log p n greater than 1 2 exp ( c 2 (τ 2) log p ) 1: (a) For each node r V, the l 1 -regularized logistic convex program has a unique solution. (Non-trivial since p n = not strictly convex). (b) The estimated sign neighborhood N ± (r) correctly excludes all edges not in the true neighborhood. d (c) For θ min c 3 τ 2 log p n, the method selects the correct signed neighborhood.

14 Some challenges in distinguishing graphs A C D B Guilt by association Hidden interactions Conditions on Fisher information matrix Q = E[ 2 f(θ ; X)] A1. Bounded eigenspectra: λ(q SS) [C min, C max ]. A2. Mutual incoherence There exists an ν (0,1] such that Q S c S(Q SS) 1, 1 ν. where A, := max i Pj A ij.

15 Proof sketch: Primal-dual certificate construct candidate primal-dual pair ( b θ, bz) R p 1 R p 1. proof technique -not a practical algorithm! (A) For a fixed node r with S = N(r), we solve the restricted program θ = arg { 1 min θ R p 1,θ S c=0 n n i=1 f(θ; X (i) \r ) + ρ n θ 1 }}, thereby obtaining candidate solution θ = ( θ S, 0 S c). (B) We choose ẑ S R S as an element of the subdifferential θ S 1. (C) Using optimality conditions from original convex program, solve for ẑ S c and check whether or not strict dual feasibility ẑ j < 1 for all j S c holds. Lemma: Full convex program recovers neighborhood primal-dual witness succeeds.

16 Information-theoretic limits on graph selection thus far: have exhibited a a particular polynomial-time method can recover structure if n > Ω(d 3 log(p d)) but...is this a good result? are there polynomial-time methods that can do better? information theory can answer the question: is there an exponential-time method that can do better? (Santhanam & Wainwright, 2008)

17 Graph selection as channel coding graphical model selection is an unorthodox channel coding problem: nature sends G G d,p := { graphs on p vertices, max. degree d } G P(X G) X (1),...,X (n) decoding problem: use observations {X (1),...,X (n) } to correctly distinguish the codeword channel capacity for graph decoding: balance between log number of models: log M(p, d) = Θ `pdlog. p d relative distinguishability of different models

18 Necessary conditions for graph recovery take Ising models P θ(g) from G d,p (λ, ω): graphs with p nodes and max. degree d parameters θ st λ for all edges (s, t) maximum neighborhood weight ω = max s V t N(s) θ st. take n i.i.d. observations, and study probability of success in terms of (n, p, d) Theorem: Necessary conditions: For sample size n { log p n max 2λtanh(λ), exp(ω/2) λ d log(pd), 16 sinh(λ) d 8 log p 8d, then the probability of error of any algorithm over G d,p (λ, ω) is at least 1/2. }, (Santhanam & W., 2008)

19 Some consequences note neighborhood weight ω = max s V hence, need at least t N(s) n > exp( dλ 2 ) λ 16 sinh(λ) d log(pd) θ st is at least dλ if λ = O(1/d), then need at least n > log p λ 2 = Ω(d 2 log p) samples l 1 -regularized log. regression (LR) order-optimal for constant degrees for d tending to infinity, gap between optimal methods and l 1 any method requires n = Ω(d 2 log p) samples LR method: guaranteed to work with n = Ω(d 3 log p) samples

20 Geometric intuition underlying proofs D 2 Truth D 1 Error probability controlled by two competing quantities: Model type Log # models Distance scaling Near-by log p c 2 /θ 2 Intermediate d log p sinh(θ) θ exp(θd) Far-away pdlog p d c 2 p

21 Summary and open questions l 1 -regularized regression to select neighborhoods: succeeds with sample size n > ( c 1 θ 2 min + c 2 d 3) log p. any method (including those with exponential complexity) fails for n < ( c 3 θ 2 min + c 4 d 2) log p some extensions... non-binary MRFs via block-structured regularization schemes non-i.i.d. sampling models other performance metrics (e.g, (1 δ) edges correct) broader issue: optimal trade-offs between statistical/computational efficiency?

22 Some papers Ravikumar, P., Wainwright, M. J. and Lafferty, J. (2008). High-dimensional Ising model selection using l 1 -regularized logistic regression. Appeared at NIPS Conference (2006); To appear in Annals of Statistics. Santhanam, P. and Wainwright, M. J. (2008). Information-theoretic limitations of high-dimensional graphical model selection. Presented at Int. Symposium on Information Theory. Wainwright, M. J. (2006). Sharp thresholds for noisy and high-dimensional recovery of sparsity using l 1 -constrained quadratic programming. To appear in IEEE Trans. on Information Theory. Wainwright, M. J. (2007). Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting. UC Berkeley, Department of Statistics, Technical Report, January 2007.