Gaussian Graphical Models and Graphical Lasso

Similar documents
10708 Graphical Models: Homework 2

Chapter 17: Undirected Graphical Models

Graphical Model Selection

An Introduction to Graphical Lasso

Sparse Gaussian conditional random fields

Sparse inverse covariance estimation with the lasso

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Lecture 25: November 27

Multivariate Normal Models

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Multivariate Normal Models

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

The graphical lasso: New insights and alternatives

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

Probabilistic Graphical Models

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Structure estimation for Gaussian graphical models

Sparse Graph Learning via Markov Random Fields

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

The lasso: some novel algorithms and applications

Lasso: Algorithms and Extensions

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Optimization methods

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

Learning discrete graphical models via generalized inverse covariance matrices

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Compressed Sensing and Sparse Recovery

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Lecture 2 Part 1 Optimization

Learning Gaussian Graphical Models with Unknown Group Sparsity

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Robust Principal Component Analysis

Probabilistic Graphical Models

Sparse Covariance Selection using Semidefinite Programming

A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation

CPSC 540: Machine Learning

Undirected Graphical Models

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

Probabilistic Graphical Models

A note on the group lasso and a sparse group lasso

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Optimization methods

Node-Based Learning of Multiple Gaussian Graphical Models

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Sparse Inverse Covariance Estimation for a Million Variables

OPTIMISATION CHALLENGES IN MODERN STATISTICS. Co-authors: Y. Chen, M. Cule, R. Gramacy, M. Yuan

11 : Gaussian Graphic Models and Ising Models

Linear Models in Machine Learning

CPSC 540: Machine Learning

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Linear Models for Regression CS534

Joint Gaussian Graphical Model Review Series I

Graphical Models for Collaborative Filtering

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Gaussian Graphical Models: An Algebraic and Geometric Perspective

CSC 412 (Lecture 4): Undirected Graphical Models

CPSC 540: Machine Learning

Lecture 9: September 28

High-dimensional covariance estimation based on Gaussian graphical models

Proximity-Based Anomaly Detection using Sparse Structure Learning

Linear Models for Regression CS534

Probabilistic Graphical Models

Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

A Local Inverse Formula and a Factorization

Learning Markov Network Structure using Brownian Distance Covariance

Sparsity Regularization

Lecture 4 September 15

Linear Regression (9/11/13)

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

(Part 1) High-dimensional statistics May / 41

Adaptive First-Order Methods for General Sparse Inverse Covariance Selection

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Partitioned Covariance Matrices and Partial Correlations. Proposition 1 Let the (p + q) (p + q) covariance matrix C > 0 be partitioned as C = C11 C 12

Extended Bayesian Information Criteria for Gaussian Graphical Models

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Maximum Likelihood Estimation

A Robust Approach to Regularized Discriminant Analysis

Regulatory Inferece from Gene Expression. CMSC858P Spring 2012 Hector Corrada Bravo

Exact Hybrid Covariance Thresholding for Joint Graphical Lasso

Sparse Inverse Covariance Estimation with Hierarchical Matrices

Multivariate Bernoulli Distribution 1

The Nonparanormal skeptic

Selected Topics in Optimization. Some slides borrowed from

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Sparse and Locally Constant Gaussian Graphical Models

Learning Graphical Models With Hubs

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

MAS223 Statistical Inference and Modelling Exercises

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

STA414/2104 Statistical Methods for Machine Learning II

Transcription:

ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017

Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf { 1 f(x) = (2π) p/2 exp 1 } 1/2 det (Σ) 2 x Σ 1 x { det (Θ) 1/2 exp 1 } 2 x Θx where Σ = E[xx ] 0 is covariance matrix, and Θ = Σ 1 is inverse covariance matrix / precision matrix Graphical lasso 5-2

Undirected graphical models x 1 x 4 {x 2, x 3, x 5, x 6, x 7, x 8 } Represent a collection of variables x = [x 1,, x p ] by a vertex set V = {1,, p} Encode conditional independence by a set E of edges For any pair of vertices u and v, (u, v) / E x u x v x V\{u,v} Graphical lasso 5-3

Gaussian graphical models Fact 5.1 (Homework) Consider a Gaussian vector x N (0, Σ). For any u and v, x u x v x V\{u,v} iff Θ u,v = 0, where Θ = Σ 1. conditional independence sparsity Graphical lasso 5-4

Gaussian graphical models 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } {{ } Θ Graphical lasso 5-5

Likelihoods for Gaussian models Draw n i.i.d. samples x (1),, x (n) N (0, Σ), then log-likelihood (up to additive constant) is l (Θ) = 1 n log f(x (i) ) = 1 n 2 i=1 = 1 2 log det (Θ) 1 S, Θ, 2 log det (Θ) 1 2n n x (i) Θx (i) i=1 where S := 1 n ni=1 x (i) x (i) is sample covariance; S, Θ = tr(sθ) Maximum likelihood estimation maximize Θ 0 log det (Θ) S, Θ Graphical lasso 5-6

Challenge in high-dimensional regime Classical theory says MLE coverges to the truth as sample size n Practically, we are often in the regime where sample size n is small (n < p) In this regime, S is rank-deficient, and MLE does not even exist Graphical lasso 5-7

Graphical lasso (Friedman, Hastie, &Tibshirani 08) Many pairs of variables are conditionally independent many missing links in the graphical model (sparsity) Key idea: apply lasso by treating each node as a response variable maximize Θ 0 log det (Θ) S, Θ λ Θ 1 }{{} lasso penalty It is a convex program! (homework) First-order optimality condition Θ 1 S λ Θ 1 }{{} = 0 (5.1) subgradient = (Θ 1 ) i,i = S i,i + λ, 1 i p Graphical lasso 5-8

Blockwise coordinate descent Idea: repeatedly cycle through all columns/rows and, in each step, optimize only a single column/row Notation: use W to denote working version of Θ 1. Partition all matrices into 1 column/row vs. the rest [ ] [ ] [ ] Θ11 θ Θ = 12 S11 s θ12 S = 12 W11 w θ 22 s W = 12 12 s 22 w12 w 22 Graphical lasso 5-9

Blockwise coordinate descent Blockwise step: suppose we fix all but the last row / column. It follows from (5.1) that 0 W 11 β s 12 λ θ 12 1 = W 11 β s 12 + λ β 1 (5.2) where β = θ 12 / θ 22 (since [ ] 1 [ ] Θ 11 θ 12 1 Θ 1 θ12 = θ 22 11 θ 12 θ 22 ) with }{{} θ 22 = θ 22 θ 12 Θ 1 11 θ 12 > 0 matrix inverse formula This coincides with the optimality condition for 1 minimize β W 1/2 11 2 β W 1/2 11 s 12 2 2 + λ β 1 (5.3) Graphical lasso 5-10

Blockwise coordinate descent Algorithm 5.1 Block coordinate descent for graphical lasso Initialize W = S + λi and fix its diagonals {w i,i }. Repeat until covergence: for t = 1, p: (i) Partition W (resp. S) into 4 parts, where the upper-left part consists of all but the jth row / column (ii) Solve (iii) Update w 12 = W 11 β 1 1/2 minimize β W11 2 β W 1/2 11 s 12 2 + λ β 1 Set ˆθ 12 = ˆθ 22 β with ˆθ 22 = 1/(w 22 w 12 β) Graphical lasso 5-11

Blockwise coordinate descent The only remaining thing is to ensure W 0. This is automatically satisfied: Lemma 5.2 (Mazumder & Hastie, 12) If we start with W 0 satisfying W S λ, then every row/column update maintains positive definiteness of W. If we start with W (0) = S + λi, then W (t) will always be positive definite Graphical lasso 5-12

Proof of Lemma 5.2 A key observation for the proof of Lemma 5.2 Fact 5.3 (Lemma 2, Mazumder & Hastie, 12) Solving (5.3) is equivalent to solving minimize γ (s 12 + γ) W 1 11 (s 12 + γ) s.t. γ λ (5.4) where solutions to 2 problems are related by ˆβ = W 1 11 (s 12 + ˆγ) Check that optimality condition of (5.3) and that of (5.4) match Graphical lasso 5-13

Proof of Lemma 5.2 Suppose in t th iteration one has W (t) S λ and W (t) 11 0; w 22 w (t) ( 12 W (t) 0 W (t) 11 ) 1w (t) 12 We only update w 12, so it suffices to show w 22 w (t+1) ( 12 W (t) 11 > 0 (Schur complement) ) 1w (t+1) 12 > 0 (5.5) Recall that w (t+1) 12 = W t 11 βt+1. It follows from Fact 5.3 that and w (t+1) 12 s 12 λ; w (t+1) ( (t)) 1w (t+1) 12 W 11 12 w (t) 12 ( (t)) 1w (t) W 11 12. Since w 22 = s 22 + λ remains unchanged, we establish (5.5). Graphical lasso 5-14

Reference [1] Sparse inverse covariance estimation with the graphical lasso, J. Friedman, T. Hastie, and R. Tibshirani, Biostatistics, 2008. [2] The graphical lasso: new insights and alternatives, R. Mazumder and T. Hastie, Electronic journal of statistics, 2012. [3] Statistical learning with sparsity: the Lasso and generalizations, T. Hastie, R. Tibshirani, and M. Wainwright, 2015. Graphical lasso 5-15