Graphical Model Selection

Similar documents
Chapter 17: Undirected Graphical Models

Undirected Graphical Models

Sparse inverse covariance estimation with the lasso

An Introduction to Graphical Lasso

Lecture 25: November 27

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent

Learning with Sparsity Constraints

Gaussian Graphical Models and Graphical Lasso

The lasso: some novel algorithms and applications

Sparse Graph Learning via Markov Random Fields

Fast Regularization Paths via Coordinate Descent

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

10708 Graphical Models: Homework 2

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Pathwise coordinate optimization

Learning Gaussian Graphical Models with Unknown Group Sparsity

11 : Gaussian Graphic Models and Ising Models

Learning discrete graphical models via generalized inverse covariance matrices

Learning the Structure of Mixed Graphical Models

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Structure Learning of Mixed Graphical Models

A note on the group lasso and a sparse group lasso

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Inferring Protein-Signaling Networks II

The graphical lasso: New insights and alternatives

Multivariate Normal Models

Multivariate Normal Models

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Probabilistic Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models

Generalized Linear Models. Kurt Hornik

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

High dimensional Ising model selection

Introduction to graphical models: Lecture III

Lecture 6: Methods for high-dimensional problems

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Sparse Gaussian conditional random fields

Structure estimation for Gaussian graphical models

Learning the Structure of Mixed Graphical Models

CPSC 540: Machine Learning

CPSC 540: Machine Learning

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

Proximity-Based Anomaly Detection using Sparse Structure Learning

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

High-dimensional covariance estimation based on Gaussian graphical models

Markowitz Minimum Variance Portfolio Optimization. using New Machine Learning Methods. Oluwatoyin Abimbola Awoye. Thesis

High dimensional ising model selection using l 1 -regularized logistic regression

Multivariate Bernoulli Distribution 1

Statistical Data Mining and Machine Learning Hilary Term 2016

Bi-level feature selection with applications to genetic association

Probabilistic Graphical Models

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Probabilistic Graphical Models

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Permutation-invariant regularization of large covariance matrices. Liza Levina

Regularization Paths

Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF

Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Joint Gaussian Graphical Model Review Series I

Logistic Regression with the Nonnegative Garrote

CSC 412 (Lecture 4): Undirected Graphical Models

Recent Advances in Post-Selection Statistical Inference

Recent Advances in Post-Selection Statistical Inference

Classification 1: Linear regression of indicators, linear discriminant analysis

Structural pursuit over multiple undirected graphs

Likelihood-Based Methods

Regularization Path Algorithms for Detecting Gene Interactions

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

The lasso: some novel algorithms and applications

Probabilistic Graphical Models

Stochastic Proximal Gradient Algorithm

MS-C1620 Statistical inference

Learning Graphical Models With Hubs

CMSC858P Supervised Learning Methods

Mixed and Covariate Dependent Graphical Models

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

ESL Chap3. Some extensions of lasso

COS513 LECTURE 8 STATISTICAL CONCEPTS

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Methods of Data Analysis Learning probability distributions

A Robust Approach to Regularized Discriminant Analysis

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Regularization in Cox Frailty Models

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Covariance-regularized regression and classification for high-dimensional problems

An algorithm for the multivariate group lasso with covariance estimation

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

MSA220/MVE440 Statistical Learning for Big Data

Kernel Logistic Regression and the Import Vector Machine

Univariate shrinkage in the Cox model for high dimensional data

Computational methods for mixed models

Topics in Graphical Models. Statistical Machine Learning Ryan Haunfelder

Graphical Models for Non-Negative Data Using Generalized Score Matching

Transcription:

May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee

May 6, 2013 Trevor Hastie, Stanford Statistics 2 Outline Undirected Graphical Models Gaussian models for quantitative variables Estimation with known structure Estimating the structure via L 1 regularization Log-linear models for qualitative variables.

May 6, 2013 Trevor Hastie, Stanford Statistics 3 Mek Plcg PIP2 PIP3 Erk Raf Akt Jnk P38 PKC PKA Flow Cytometry 11 proteins measured on 7466cells. Shownisanestimated undirected graphical model or Markov Network. Raf and Jnk are conditionally independent, given the rest. PIP3 is independent ofeverythingelse, asiserk. (Sachs et al, 2003). The model was estimated using the graphical lasso.

May 6, 2013 Trevor Hastie, Stanford Statistics 4 Undirected graphical models X Y Z Represent the joint distribution of a set of variables. Dependence structure is represented by the presence or absence of edges. Pairwise Markov graphs represent densities having no higher than second-order dependencies (e.g. Gaussian)

May 6, 2013 Trevor Hastie, Stanford Statistics 5 Conditional Independence in Undirected Graphical Models Y Y W X (a) Z X (b) Z Y Z X Y Z W X W (c) (d) No edge joining X and Z X Z rest E.g. in (a), X and Z are conditionally independent give Y.

May 6, 2013 Trevor Hastie, Stanford Statistics 6 Gaussian graphical models Suppose all the variables X = X 1,...,X p in a graph are Gaussian, with joint density X N(µ,Σ) Let X = (Z,Y) where Y = X p and Z = (X 1,...,X p 1 ). Then with µ = µ Z µ Y, Σ = Σ ZZ σ T ZY σ ZY σ YY we can write the conditional distribution of Y given Z (the rest) as Y Z = z N ( µ Y +(z µ Z ) T Σ 1 ZZ σ ZY, σ YY σzyσ T 1 ZZ σ ) ZY, The regression coefficients β = Σ 1 ZZ σ ZY determine the conditional (in)dependence structure. In particular, if β j = 0, then Y and Z j are conditionally independent, given the rest.

May 6, 2013 Trevor Hastie, Stanford Statistics 7 Inference through regression Fit regressions of each variable X j on the rest. Do variable selection to decide which coefficients should be zero. Meinshausen and Bühlmann (2006) use lasso regressions to achieve this (more later). Problem: in Gaussian model, if X j is conditionally independent of X i, given the rest, then β ji = 0. But then X i is conditionally independent of X j, given the rest, and β ij = 0 as well. Regression methods don t honor this symmetry.

May 6, 2013 Trevor Hastie, Stanford Statistics 8 Θ = Σ 1 and conditional dependence structure Σ = Σ ZZ σ ZY Θ = Θ ZZ θ ZY σ T ZY σ YY θ T ZY θ YY Since ΣΘ = I, using partitioned inverses we get θ ZY = θ YY Σ 1 ZZ σ ZY = θ YY β Y Z. Hence Θ contains all the conditional dependence information for the multivariate Gaussian model. In particular, any θ ij = 0 implies conditional independence of X i and X j, given the rest.

May 6, 2013 Trevor Hastie, Stanford Statistics 9 Estimating Θ by Gaussian Maximum Likelihood Given a sample x i, i = 1,...,N we can write down the Gaussian log-likelihood of the data: l(µ,σ;{x i }) = N 2 logdet(σ) 1 2 N (x i µ) T Σ 1 (x i µ) i=1 Partially maximizing w.r.t µ we get ˆµ = x. Setting S = 1 N N i=1 (x i x)(x i x) T, we get (up to constants) and (by some miracle) l(θ;s) = log detθ trace(sθ) dl(θ; S) dθ Hence ˆΘ = S 1 and of course ˆΣ = S. = Θ 1 S.

May 6, 2013 Trevor Hastie, Stanford Statistics 10 Solving for ˆΘ through regression We can solve for ˆΘ = S 1 one column at a time in the score equations Θ 1 S = 0. Let W = ˆΘ 1. Suppose we solve for the last column of Θ. Using the partitioning as before, we can write w 12 = W 11 θ 12 /θ 22 = W 11 β, with β = θ 12 /θ 22 (p 1 vector). Hence the score equation says W 11 β s 12 = 0 This looks like an OLS estimating equation Z T Zβ = Z T y.

May 6, 2013 Trevor Hastie, Stanford Statistics 11 Since W = ˆΘ 1 = S, then W 11 = S 11 and ˆβ = S 1 11 s 12, the OLS regression coefficient of X p on the rest. Again through partitioned inverses, we have that ( ) ˆθ 22 = 1/ s 22 w12ˆβ T, (the inverse MSR). Hence we get ˆθ 12 from ˆβ. So with p regressions we construct ˆΘ. This does not seem like such a big deal, because each of the regressions requires inverting a (p 1) (p 1) matrix. The payoff comes when we restrict the regressions (next).

May 6, 2013 Trevor Hastie, Stanford Statistics 12 Solving for Θ when zero structure is known We add Lagrange terms to the log-likelihood corresponding to the missing edges max logdetθ trace(sθ) γ jk θ jk Θ Score equations: Θ 1 S Γ = 0 (j,k) E Γ is a matrix of Lagrange parameters with nonzero values for all pairs with edges absent. Can solve by regression as before, except now iteration is needed.

May 6, 2013 Trevor Hastie, Stanford Statistics 13 Graphical Regression Algorithm With partitioning as before, given W 11 we need to solve With w 12 = W 11 β, this is w 12 s 12 γ 12 = 0. W 11 β s 12 γ 12 = 0 These are the score equations for a constrained regression where 1. We use the current estimate of W 11 rather than S 12 for the predictor covariance matrix 2. We confine ourselves to the sub-system obtained by omitting the variables constrained to be zero: W 11β s 12 = 0

May 6, 2013 Trevor Hastie, Stanford Statistics 14 We then fill in ˆβ with ˆβ (and zeros), replace w 12 W 11ˆβ, and proceed to the next column. As we cycle around the columns, the W matrix changes, as do the regressions, until the system converges. We retain all the ˆβs for each column in a matrix B. Only at convergence ( ) do we need to estimate the ˆθ 22 = 1/ s 22 w12ˆβ T for each column, to recover the entire matrix ˆΘ.

May 6, 2013 Trevor Hastie, Stanford Statistics 15 Simple four-variable example with known structure X 3 X 2 S = 10 1 5 4 1 10 2 6 5 2 10 3 4 6 3 10 X 4 X 1 ˆΣ = 10.00 1.00 1.31 4.00 1.00 10.00 2.00 0.87 1.31 2.00 10.00 3.00 4.00 0.87 3.00 10.00, ˆΘ = 0.12 0.01 0.00 0.05 0.01 0.11 0.02 0.00 0.00 0.02 0.11 0.03 0.05 0.00 0.03 0.13

May 6, 2013 Trevor Hastie, Stanford Statistics 16 Estimating the graph structure using the lasso penalty Use lasso regularized log-likelihood max Θ [logdetθ trace(sθ) λ Θ 1] with score equations Θ 1 S λ Sign(Θ) = 0. Solving column-wise leads as before to W 11 β s 12 +λ Sign(β) = 0 Compare with solution to lasso problem with solution 1 min β 2 y Zβ) 2 2 +λ β 1 Z T Zβ Z T y+λ Sign(β) = 0 This leads to the graphical lasso algorithm.

May 6, 2013 Trevor Hastie, Stanford Statistics 17 Graphical Lasso Algorithm 1. Initialize W = S+λI. The diagonal of W remains unchanged in what follows. 2. Repeat for j = 1,2,...p,1,2,...p,... until convergence: (a) Partition the matrix W into part 1: all but the jth row and column, and part 2: the jth row and column. (b) Solve the estimating equations W 11 β s 12 +λ Sign(β) = 0 using cyclical coordinate-descent. (c) Update w 12 = W 11ˆβ 3. In the final cycle (for each j) solve for ˆθ 12 = ˆβ ˆθ 22, with 1/ˆθ 22 = w 22 w T 12ˆβ.

May 6, 2013 Trevor Hastie, Stanford Statistics 18 λ = 36 λ = 27 Raf Raf Mek Jnk Mek Jnk Plcg P38 Plcg P38 PIP2 PKC PIP2 PKC PIP3 PKA PIP3 PKA Erk Akt Erk Akt λ = 7 λ = 0 Raf Raf Mek Jnk Mek Jnk Plcg P38 Plcg P38 PIP2 PKC PIP2 PKC PIP3 PKA PIP3 PKA Erk Akt Erk Akt Fit using the glasso package in R (on CRAN).

May 6, 2013 Trevor Hastie, Stanford Statistics 19 Cyclical coordinate descent This is Gauss-Seidel algorithm for solving system Wβ s+λ Sign(β) = 0 where Sign(β) = ±1 if β 0, else [0,1] if β = 0. ith row: w ii β i (s i j i w ijβ j )+λ Sign(β i ) = 0 β i Soft(s i j i w ij β j, λ)/w ii and Soft(z,λ) = sign(z) ( z λ) +.

May 6, 2013 Trevor Hastie, Stanford Statistics 20 Other approaches The graphical lasso scales as O(p 2 k), where k is the number of non-zero values of Θ; can thus be O(p 4 ) for dense problems. Meinshausen and Bühlmann (2006) run lasso regression of each X j on all the rest. Have different strategies for removing an edge (j,k). For example, if both ˆβ jk and ˆβ kj are zero, remove the edge. Useful for p N problems, since O(p 2 N). Can do as above, except constrain β jk and β kj jointly, using a group lasso penalty on the pairs: p N x ik β jk ) +λ 2 βjk 2 +β2 kj min β 1,...,β p j=1 i=1(x ij k j k<j

May 6, 2013 Trevor Hastie, Stanford Statistics 21 Same as above, except respect the symmetry between β jk = θ jk /θ jj and β kj = θ kj /θ kk with θ jk = θ kj : 1 p min N logδ j 1 N ij Θ 2 δ j=1 j i=1(x x ik β jk ) 2 2 +λ k j k<j θ kj with β jk = θ jk /θ jj, β kj = θ kj /θ kk, θ jk = θ kj, and δ j = 1/θ jj. Small simulation: p = 500, N = 500, λ chosen so 25% of θ ij nonzero. Timing in seconds. Graphical Lasso 184 Meinshausen-Bühlmann 12 Grouped paired lasso 3 Symmetric paired lasso 10

May 6, 2013 Trevor Hastie, Stanford Statistics 22 Qualitative Variables With binary variables, second order Ising model. Correspond to first-order interaction models in log-linear models. Conditional distributions are logistic regressions. Exact maximum-likelihood inference difficult for large p; computations grow exponentially due to computation of partition function. Approximations based on lasso-penalized logistic regression (Wainwright et al 2007). Symmetric version in Hoeffling and Tibshirani (2008), using pseudo-likelihood

May 6, 2013 Trevor Hastie, Stanford Statistics 23 Mixed variables General Markov random field representation, with edge and node potentials. Work with PhD student Jason Lee. p(x,y;θ) exp p s=1 t=1 p 1 2 β stx s x t + p α s x s + s=1 p s=1 j=1 q ρ sj (y j )x s + q j=1 r=1 Pseudo likelihood allows simple inference with mixed variables. Conditionals for continuous are Gaussian linear regression models, for categorical are binomial or multinomial logistic regressions. q φ rj (y r,y j ) Parameters come in symmetric blocks, and the inference should respect this symmetry (next slide)

May 6, 2013 Trevor Hastie, Stanford Statistics 24 Group-lasso penalties Parameters in blocks. Here we have an interaction between a pair of quantitative variables (red), a 2-level qualitative with a quantitative (blue), and an interaction between the 2 level and a 3 level qualitative. Minimize a pseudo-likelihood with lasso and group-lasso penalties on parameter blocks. min Θ l(θ)+λ ( p s=1 s 1 t=1 β st + p s=1 j=1 q ρ sj 2 + q j 1 j=1 r=1 φ rj F ) Solved using proximal Newton algorithm for a decreasing sequence of values for λ [Lee and Hastie, 2013].

May 6, 2013 Trevor Hastie, Stanford Statistics 25 Large scale graphical lasso The cost of glasso is O(np 2 +p 3+ ) where [0,1]; prohibitive for genomic scale p. For many of these problems, n p, so we can only fit very sparse solutions anyway. Simple idea [Mazumder and Hastie, 2011]: Compute S and soft-threshold: S λ ij = sign(s ij)( S ij λ) +. Reorder rows and columns to achieve block-diagonal pattern [Tarjan, 1972] Run glasso on each corresponding block of S with parameter λ, and then reconstruct. Solution solves original glasso problem! similar result found independently by Witten et al, 2011

May 6, 2013 Trevor Hastie, Stanford Statistics 26 S S λ S λ P Soft(S,λ) permute 0.95 0.94 0.93 0.92 (A) p=2000 1 1 5 10 20 20 100 200 800 >800 0.9 0.85 (B) p=4718 1 1 10 30 80 200 800 800 1500 >1500 0.98 0.96 0.94 0.92 (C) p=24281 1 1 10 10 50 100 800 1500 8000 >8000 0.91 0.8 0.9 λ 0.9 0.75 0.88 0.89 0.88 0.7 0.86 0.84 0.87 0.65 0.82 0.86 0.6 0.8 0.78 0 0.5 1 1.5 2 2.5 3 log 10 (SizeofComponents) 0 0.5 1 1.5 2 2.5 3 log 10 (SizeofComponents) 0 0.5 1 1.5 2 2.5 3 log 10 (SizeofComponents)