Probabilistic Graphical Models

Similar documents
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Probabilistic Graphical Models

Distributed Optimization via Alternating Direction Method of Multipliers

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Probabilistic Graphical Models

Sparse Graph Learning via Markov Random Fields

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Frist order optimization methods for sparse inverse covariance selection

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Causal Inference: Discussion

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Multivariate Normal Models

Sparse Gaussian conditional random fields

The Nonparanormal skeptic

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

High-dimensional graphical model selection: Practical and information-theoretic limits

Sparse inverse covariance estimation with the lasso

Lecture 25: November 27

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

UVA CS 4501: Machine Learning

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Multivariate Normal Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

Math 273a: Optimization Overview of First-Order Optimization Algorithms

High-dimensional graphical model selection: Practical and information-theoretic limits

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Max Margin-Classifier

Homework 4. Convex Optimization /36-725

Contraction Methods for Convex Optimization and monotone variational inequalities No.12

Support Vector Machines and Kernel Methods

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Probabilistic Graphical Models

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Splitting methods for decomposing separable convex programs

Introduction to Alternating Direction Method of Multipliers

Proximal splitting methods on convex problems with a quadratic term: Relax!

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

Is the test error unbiased for these programs?

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Big Data Analytics: Optimization and Randomization

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

The Alternating Direction Method of Multipliers

Robust Inverse Covariance Estimation under Noisy Measurements

Learning discrete graphical models via generalized inverse covariance matrices

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Accelerated primal-dual methods for linearly constrained convex problems

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Permutation-invariant regularization of large covariance matrices. Liza Levina

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants

Proximity-Based Anomaly Detection using Sparse Structure Learning

Probabilistic Graphical Models

Preconditioning via Diagonal Scaling

Support Vector Machines

Homework 5. Convex Optimization /36-725

Expectation Maximization Algorithm

Lecture 5 : Projections

13 : Variational Inference: Loopy Belief Propagation and Mean Field

First-order methods for structured nonsmooth optimization

14 : Theory of Variational Inference: Inner and Outer Approximation

Learning Gaussian Graphical Models with Unknown Group Sparsity

Introduction to graphical models: Lecture III

Lecture 9: September 28

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

ECS289: Scalable Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Chapter 17: Undirected Graphical Models

10708 Graphical Models: Homework 2

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Selected Topics in Optimization. Some slides borrowed from

Statistical Machine Learning for Structured and High Dimensional Data

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Gaussian Graphical Models and Graphical Lasso

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Convex Optimization Algorithms for Machine Learning in 10 Slides

An interior-point stochastic approximation method and an L1-regularized delta rule

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Nonlinear Optimization for Optimal Control

Graphical Model Selection

Optimization methods

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent

Sparse Optimization Lecture: Dual Methods, Part I

14 : Theory of Variational Inference: Inner and Outer Approximation

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

11 : Gaussian Graphic Models and Ising Models

Lecture 23: November 19

Joint Gaussian Graphical Model Review Series I

Transcription:

School of Computer Science Probabilistic Graphical Models Distributed ADMM for Gaussian Graphical Models Yaoliang Yu Lecture 29, April 29, 2015 Eric Xing @ CMU, 2005-2015 1

Networks / Graphs Eric Xing @ CMU, 2005-2015 2

Where do graphs come from? Prior knowledge Mom told me A is connected to B Estimate from data! We have seen this in previous classes Will see two more today Sometimes may also be interested in edge weights An easier problem Real networks are BIG Require distributed optimization Eric Xing @ CMU, 2005-2015 3

Structural Learning for completely observed Data MRF (Recall) 1) ( x,, x ( ( 1) 1 n ( 2) ( 2) 1,, n ( x x M ) ( x,, x ) ) ( ( M ) 1 n ) Eric Xing @ CMU, 2005-2015 4

Gaussian Graphical Models Multivariate Gaussian density: WOLG: let We can view this as a continuous Markov Random Field with potentials defined on every node and edge: { } ) - ( ) - ( - exp ) ( ), ( / / µ µ π µ x x x 1 2 1 2 1 2 2 1 Σ Σ = Σ T n p ( ) = = < j i j i ij i i ii n p x x q x q Q Q x x x p 2 2 1 2 / 2 1/ 2 1 - exp ) (2 ) 0,,,, ( π µ! 5 Eric Xing @ CMU, 2005-2015

The covariance and the precision matrices Covariance matrix Graphical model interpretation? Precision matrix Graphical model interpretation? Eric Xing @ CMU, 2005-2015 6

Sparse precision vs. sparse covariance in GGM 1 2 3 4 5 Σ 1 1 6 = 0 0 0 6 2 7 0 0 0 7 3 8 0 0 0 8 4 9 0 0 0 9 5 0. 10 0. 15 Σ = 0. 13 0. 08 0. 15 0. 15 0. 03 0. 02 0. 01 0. 03 0. 13 0. 02 0. 10 0. 07 0. 12 0. 08 0. 01 0. 07 0. 04 0. 07 0. 15 0. 03 0. 12 0. 07 0. 08 X 1 Σ 15 = 0 X1 X 5 X nbrs (1) or nbrs(5) 1 X 5 Σ15 = 0 Eric Xing @ CMU, 2005-2015 7

Another example How to estimate this MRF? What if p >> n MLE does not exist in general! What about only learning a sparse graphical model? This is possible when s=o(n) Very often it is the structure of the GM that is more interesting Eric Xing @ CMU, 2005-2015 8

Recall lasso Eric Xing @ CMU, 2005-2015 9

Graph Regression (Meinshausen & Buhlmann 06) Neighborhood selection Lasso: Eric Xing @ CMU, 2005-2015 10

Graph Regression Eric Xing @ CMU, 2005-2015 11

Graph Regression Pros: Computationally convenient Strong theoretical guarantee ( p <= pol(n) ) Cons: Asymmetry Not minimax optimal Eric Xing @ CMU, 2005-2015 12

The regularized MLE (Yuan & Lin 07) min Q log det Q +tr(qs)+ kqk 1 S: sample covariance matrix, may be singular Q 1 : may exclude the diagonal log det Q: implicitly force Q to be PSD symmetric Pros Cons Single step for estimating graph and inverse covariance MLE! Computationally challenging, partly solved by Glasso (Banergee et al 08, Friedman et al 08) Eric Xing @ CMU, 2005-2015 13

Many many follow-ups Eric Xing @ CMU, 2005-2015 14

A closer look of RMLE min Q log det Q +tr(qs)+ kqk 1 Set derivative to 0: Q 1 + S + sign(q) =0 kq 1 Sk 1 apple Can we (?!): min Q kqk 1 s.t. kq 1 Sk 1 apple Eric Xing @ CMU, 2005-2015 15

CLIME (Cai et al. 11) Further relaxation min Q kqk 1 s.t. ksq Ik 1 apple Constraint controls Q S 1 Objective controls sparsity in Q Q is not required to be PSD or symmetric Separable! LP!!! Both objective and constraint are element-wise separable Can be reformulated as LP Strong theoretical guarantee Variations are minimax-optimal (Cai et al. 12, Liu & Wang 12) Eric Xing @ CMU, 2005-2015 16

But for BIG problems min Q kqk 1 s.t. ksq Ik 1 apple Standard solvers for LP can be slow Embarrassingly parallel: Solve each column of Q independently in each core/machine min q i kq i k 1 s.t. ksq i e i k 1 apple Thanks for not having PSD constraint on Q Still troublesome if S is big Need to consider first-order methods Eric Xing @ CMU, 2005-2015 17

A gentle introduction to alternating direction method of multipliers (ADMM) Eric Xing @ CMU, 2005-2015 18

Optimization with coupling variables J uncoupled L coupled Canonical form: min w,z f(w)+g(z), s.t. Aw + Bz = c, where w 2 R m,z 2 R p,a: R m! R q,b : R p! R q,c2 R q Numerically challenging because Function f or g nonsmooth or constrained (i.e., can take value ) Linear constraint couples the variables w and z Large scale, interior point methods NA Naively alternating x and z does not work Min w 2 s.t. w + z = 1; optimum clearly is w = 0 Start with say w = 1 à z = 0 à w = 1 à z = 0 However, without coupling, can solve separately w and z Idea: try to decouple vars in the constraint! 1 Eric Xing @ CMU, 2005-2014

Example: Empirical Risk Minimization (ERM) Each i corresponds to a training point (x i, y i ) Loss f i measures the fitness of the model parameter w least squares: support vector machines: boosting: logistic regression: g is the regularization function, e.g. or Vars coupled in obj, but not in constraint (none) min w n g(w)+ X f i (w) f i (w) =(y i w > x i ) 2 L coupled i=1 f i (w) =(1 y i w > x i ) + f i (w) =exp( y i w > x i ) f i (w) = log(1 + exp( y i w > x i )) Reformulate: transfer coupling from obj to constraint Arrive at canonical form, allow unified treatment later nkwk 2 2 nkwk 1 Eric Xing @ CMU, 2005-2014

Why canonical form? ERM: min w g(w)+ n X i=1 f i (w) Canonical form: min w,z f(w)+g(z), s.t. Aw + Bz = c, where w 2 R m,z 2 R p,a: R m! R q,b : R p! R q,c2 R q ADMM algorithm (to be introduced shortly) excels at solving the canonical form Canonical form is a general template for constrained problems ERM (and many other problems) can be converted to canonical form through variable duplication (see next slide) Eric Xing @ CMU, 2005-2013 21

How to: variable duplication Duplicate variables to achieve canonical form All w i must (eventually) agree Downside: many extra variables, increase problem size min w n g(w)+ X f i (w) Implicitly maintain duplicated variables i=1 v =[w 1,...,w n ] > min g(z)+ X f i(w i ), s.t. w i = z,8i v,z i {z } {z } v [I,...,I] f(v) > z=0 Global consensus constraint: 8i, w i = z Eric Xing @ CMU, 2005-2014 22

Augmented Lagrangian Canonical form: min w,z f(w)+g(z), s.t. Aw + Bz = c, where w 2 R m, z 2 R p,a: R m! R q,b : R p! R q, c 2 R q Intro Lagrangian multiplier to decouple variables min w,z > (Aw + Bz {z c)+ µ 2 kaw + Bz ck2 2 } L µ (w,z; ) L µ : augmented Lagrangian More complicated min-max problem, but no coupling constraints Eric Xing @ CMU, 2005-2014

Why Augmented Lagrangian? Quadratic term gives numerical stability min w,z > (Aw + Bz {z c)+ µ 2 kaw + Bz ck2 2 } L µ (w,z; ) May lead to strong convexity in w or z Faster convergence when strongly convex Allows larger step size (due to higher stability) Prevents subproblems diverging to infinity (again, stability) But sometimes better to work with normal Lagrangian Eric Xing @ CMU, 2005-2013 24

ADMM Algorithm Fix dual, block coordinate descent on primal w, z min w,z > (Aw + Bz {z c)+ µ 2 kaw + Bz ck2 2 } L µ (w,z; ) arg min L µ(w, z t ; w z t+1 arg min L µ (w t+1, z; z Fix primal w, z, gradient ascent on dual w t+1 t+1 t + (Aw t+1 + Bz t+1 c) Step size can be large, e.g. = µ Usually rescale to remove / t ) f(w)+ µ 2 kaw + Bzt c + t /µk 2 t ) g(z)+ µ 2 kawt+1 + Bz c + t /µk 2 Eric Xing @ CMU, 2005-2014

ERM revisited Reformulate by duplicating variables min v,z g(z)+ X f i(w i ), s.t. w i = z, 8i i {z } {z } v [I,...,I] f(v) > z=0 ADMM x-step: w t+1 arg min w L µ(w, z t ; t ) f(w)+ µ 2 kaw + Bzt c + t /µk 2 = X i f i (w i )+ µ 2 kw i z t + t ik 2 Thanks to duplicating Completely decoupled Parallelizable Closed-form if f i is simple Eric Xing @ CMU, 2005-2013 26

ADMM: History and Related Augmented Lagrangian Method (ALM): solve w, z jointly even though coupled (Bertsekas'82) and refs therein Alternating Direction of Multiplier Method (ADMM): alternate w and z as previous slide (Boyd et al.'10) and refs therein Operator splitting for PDEs: Douglas, Peaceman, and Rachford (50s-70s) Glowinsky et al.'80s, Gabay'83; Spingarn'85 (Eckstein & Bertsekas'92; He et al.'02) in variational inequality Lots of recent work. Eric Xing @ CMU, 2005-2014

ADMM: Linearization x t+1 Demanding step in each iteration of ADMM (similar for z): Diagonal A: reduce to proximal map (more later) of f f(x) =kxk 1, soft-shrinkage: sign(x) ( x µ) + Non-diagonal A: no closed-form, messy inner loop Instead, reduce to diagonal A by arg min L µ (x, z t ; y t ) = f(x)+g(z t )+y t > (Ax + Bz t c)+ µ 2 kax + Bz t ck 2 2 x A single gradient step: x t+1 x t @f(x t )+A > y t + µa > (Ax t + Bz t c) x t Or, linearize the quadratic at : x t+1 arg min f(x)+y t > Ax +(x x t ) > µa > (Ax t + Bz t c)+ µ x 2 kx x tk 2 2 {z } Intuition: x re-computed in the next iteration anyways No need for perfect x f(x)+ µ 2 kx x t+a > (Ax t +Bz t c+y t /µ)k 2 2 Eric Xing @ CMU, 2005-2014

Convergence Guarantees: Fixedpoint theory Recall some definitions Lagrangian: ADMM = Douglas-Rachford splitting Fixed-point iteration! proximal map P µ f (w) := arg min z reflection map R µ f (w) :=2Pµ f (w) convergence follows, e.g. (Bauschke & Combettes'13) explains why dual y, not primal x or z, always converges 1 2µ kz wk2 2 + f(z) well-defined for convex f, non-expansive: kt (x) T (y)k 2 applekx yk 2 proximal map generalizes the Euclidean projection L 0 (x, z; y) =min f(x)+y > Ax +min g(z)+y > (Bz c) x z {z } {z } d 1 (y) d 2 (y) w 1 2 (w + Rµ d 2 (R µ d 1 (w))); y P µ d 2 (w) w Eric Xing @ CMU, 2005-2014

ADMM for CLIME Eric Xing @ CMU, 2005-2015 30

Apply ADMM to CLIME min Q kqk 1 s.t. ksq Ek 1 apple Solve a block of columns of Q in each core/machine E is the corresponding block in I Step 1: reduce to ADMM canonical form Use variable duplicating min Q,Z kqk 1 s.t. kz Ek 1 apple, Z = SQ min Q,Z kqk 1 +[kz Ek 1 apple ] s.t. Z = SQ Eric Xing @ CMU, 2005-2015 31

Apply ADMM to CLIME (cont ) Step 2: Write out augmented Lagrangian L(Q, Z; Y )=kqk 1 +[kz Ek 1 apple ]+ tr[(sq Z)Y ]+ 2 ksq Zk2 F Step 3: Perform primal-dual updates Q Z arg min Q kqk 1 + 2 ksq Z + Y k2 F arg min Z [kz Ek 1 apple ]+ 2 ksq Z + Y k2 F = arg min kz Ek 1 apple Y Y + SQ Z 2 ksq Z + Y k2 F Eric Xing @ CMU, 2005-2015 32

Apply ADMM to CLIME (cont ) Step 4: Solve the subproblems Lagrangian dual Y: trivial Primal Z: projection to l_inf ball, separable, easy Primal Q: easy if S is orthogonal, in general a lasso problem Bypass double loop by linearization Intuition: wasteful to solve Q to death min Q kqk 1 + tr(q > S(Y + SQ t Z)) + 2 kq Q tk 2 F Soft-thresholding Putting things together Q Z arg min Q kqk 1 + 2 ksq Z + Y k2 F arg min Z [kz Ek 1 apple ]+ 2 ksq Z + Y k2 F = arg min kz Ek 1 apple Y Y + SQ Z 2 ksq Z + Y k2 F Eric Xing @ CMU, 2005-2015 33

Exploring structure Expensive step in ADMM-CLIME: Matrix-matrix multiplication: SQ and alike If p >> n, S is size p x p but of rank at most n Write S = AA, and do A(A Q) A =[X 1,...,X n ] 2 R p n Matrix * matrix >> for loop of matrix * vector Preferable to solve a balanced block of columns Eric Xing @ CMU, 2005-2015 34

Parallelization (Wang et al 13) Embarrassingly parallel If A fits into memory Chop A into small blocks and distribute Communication may be high Eric Xing @ CMU, 2005-2015 35

Numerical results (Wang et al 13) Eric Xing @ CMU, 2005-2015 36

Numerical results cont (Wang et al 13) Eric Xing @ CMU, 2005-2015 37

Nonparanormal extensions Eric Xing @ CMU, 2005-2015 38

Nonparanormal (Liu et al. 09) Z i = f i (X i ), i =1,...,p (Z 1,...,Z p ) N(0, ) f i: unknown monotonic functions Observe X, but not Z Independence preserved under transformations X i? X j X rest () Z i? Z j Z rest () Q ij =0 Can estimate f i first, then apply glasso on f i (X i ) Estimating functions can be slow, nonparametric rate Eric Xing @ CMU, 2005-2015 39

A neat observation Since f i is monotonic Z i,: comonotone / concordant with X i,: Use rank estimator! Z i = f i (X i ), i =1,...,p (Z 1,...,Z p ) N(0, ) Z i,1,z i,2,...,z i,n R i,1,r i,2,...,r i,n Want, but do not observe Z Maybe??? ij = 1 n X i,1,x i,2,...,x i,n nx k=1 Z i,k Z j,k? = 1 n nx k=1 R i,k R j,k Eric Xing @ CMU, 2005-2015 40

Kendall s tau Assuming no ties: Key: ij = 1 n(n 1) k,` Complexity of computing Kendall s tau? Genuine, asymptotically unbiased estimate of t 7! 2sin( 6 t) X sign[(r i,k R i,`)(r j,k R j,`)] ij =2sin( 6 E( ij)) is a contraction, preserving concentration After having, use whatever glasso, e.g., CLIME Can also use other rank estimator, e.g., Spearman s rho Eric Xing @ CMU, 2005-2015 41

Summary Gaussian graphical model selection Neighborhood selection, Regularized MLE, CLIME Implicit PSD vs. Explicit PSD Distributed ADMM Generic procedure Can distribute matrix product Nonparanormal Gaussian graphical model Rank statistics Plug-in estimator Eric Xing @ CMU, 2005-2015 42

References Graph Lasso High-Dimensional Graphs and Variable Selection with the LASSO, Meinshausen and Buhlmann, Annals of Statistics, vol. 34, 2006 Model Selection and Estimation in the Gaussian Graphical Model, Yuan and Lin, Biometrika, vol. 94, 2007 A Constrained l_1 minimization Approach for Sparse Precision Matrix Estimation, Cai et al., Journal of the American Statistical Association, 2011 Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data, Banerjee et al, Journal of Machine Learning Research, 2008 Sparse Inverse Covariance Estimation with the Graphical Lasso, Friedman et al., Biostatistics, 2008 Nonparanormal The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs, Han Liu et al., Journal of Machine Learning Research, 2009 Regularized Rank-based Estimation of High-Dimensional Nonparanormal Graphical Models, Lingzhou Xue and Hui Zou, Annals of Statistics, vol. 40, 2012 High-Dimensional Semiparametric Gaussian Copula Graphical Models, Han Liu et al., Annals of Statistics, vol. 40, 2012 ADMM Large Scale Distributed Sparse Precision Estimation, Wang et al., NIPS 2013 Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Boyd et al., Foundation and Trends in Machine Learning, 2011 Eric Xing @ CMU, 2005-2015 43