Probabilistic Graphical Models

School of Computer Science Probabilistic Graphical Models Distributed ADMM for Gaussian Graphical Models Yaoliang Yu Lecture 29, April 29, 2015 Eric Xing @ CMU, 2005-2015 1

Networks / Graphs Eric Xing @ CMU, 2005-2015 2

Where do graphs come from? Prior knowledge Mom told me A is connected to B Estimate from data! We have seen this in previous classes Will see two more today Sometimes may also be interested in edge weights An easier problem Real networks are BIG Require distributed optimization Eric Xing @ CMU, 2005-2015 3

Structural Learning for completely observed Data MRF (Recall) 1) ( x,, x ( ( 1) 1 n ( 2) ( 2) 1,, n ( x x M ) ( x,, x ) ) ( ( M ) 1 n ) Eric Xing @ CMU, 2005-2015 4

Gaussian Graphical Models Multivariate Gaussian density: WOLG: let We can view this as a continuous Markov Random Field with potentials defined on every node and edge: { } ) - ( ) - ( - exp ) ( ), ( / / µ µ π µ x x x 1 2 1 2 1 2 2 1 Σ Σ = Σ T n p ( ) = = < j i j i ij i i ii n p x x q x q Q Q x x x p 2 2 1 2 / 2 1/ 2 1 - exp ) (2 ) 0,,,, ( π µ! 5 Eric Xing @ CMU, 2005-2015

The covariance and the precision matrices Covariance matrix Graphical model interpretation? Precision matrix Graphical model interpretation? Eric Xing @ CMU, 2005-2015 6

Sparse precision vs. sparse covariance in GGM 1 2 3 4 5 Σ 1 1 6 = 0 0 0 6 2 7 0 0 0 7 3 8 0 0 0 8 4 9 0 0 0 9 5 0. 10 0. 15 Σ = 0. 13 0. 08 0. 15 0. 15 0. 03 0. 02 0. 01 0. 03 0. 13 0. 02 0. 10 0. 07 0. 12 0. 08 0. 01 0. 07 0. 04 0. 07 0. 15 0. 03 0. 12 0. 07 0. 08 X 1 Σ 15 = 0 X1 X 5 X nbrs (1) or nbrs(5) 1 X 5 Σ15 = 0 Eric Xing @ CMU, 2005-2015 7

Another example How to estimate this MRF? What if p >> n MLE does not exist in general! What about only learning a sparse graphical model? This is possible when s=o(n) Very often it is the structure of the GM that is more interesting Eric Xing @ CMU, 2005-2015 8

Recall lasso Eric Xing @ CMU, 2005-2015 9

Graph Regression (Meinshausen & Buhlmann 06) Neighborhood selection Lasso: Eric Xing @ CMU, 2005-2015 10

Graph Regression Eric Xing @ CMU, 2005-2015 11

Graph Regression Pros: Computationally convenient Strong theoretical guarantee ( p <= pol(n) ) Cons: Asymmetry Not minimax optimal Eric Xing @ CMU, 2005-2015 12

The regularized MLE (Yuan & Lin 07) min Q log det Q +tr(qs)+ kqk 1 S: sample covariance matrix, may be singular Q 1 : may exclude the diagonal log det Q: implicitly force Q to be PSD symmetric Pros Cons Single step for estimating graph and inverse covariance MLE! Computationally challenging, partly solved by Glasso (Banergee et al 08, Friedman et al 08) Eric Xing @ CMU, 2005-2015 13

Many many follow-ups Eric Xing @ CMU, 2005-2015 14

A closer look of RMLE min Q log det Q +tr(qs)+ kqk 1 Set derivative to 0: Q 1 + S + sign(q) =0 kq 1 Sk 1 apple Can we (?!): min Q kqk 1 s.t. kq 1 Sk 1 apple Eric Xing @ CMU, 2005-2015 15

CLIME (Cai et al. 11) Further relaxation min Q kqk 1 s.t. ksq Ik 1 apple Constraint controls Q S 1 Objective controls sparsity in Q Q is not required to be PSD or symmetric Separable! LP!!! Both objective and constraint are element-wise separable Can be reformulated as LP Strong theoretical guarantee Variations are minimax-optimal (Cai et al. 12, Liu & Wang 12) Eric Xing @ CMU, 2005-2015 16

But for BIG problems min Q kqk 1 s.t. ksq Ik 1 apple Standard solvers for LP can be slow Embarrassingly parallel: Solve each column of Q independently in each core/machine min q i kq i k 1 s.t. ksq i e i k 1 apple Thanks for not having PSD constraint on Q Still troublesome if S is big Need to consider first-order methods Eric Xing @ CMU, 2005-2015 17

A gentle introduction to alternating direction method of multipliers (ADMM) Eric Xing @ CMU, 2005-2015 18

Optimization with coupling variables J uncoupled L coupled Canonical form: min w,z f(w)+g(z), s.t. Aw + Bz = c, where w 2 R m,z 2 R p,a: R m! R q,b : R p! R q,c2 R q Numerically challenging because Function f or g nonsmooth or constrained (i.e., can take value ) Linear constraint couples the variables w and z Large scale, interior point methods NA Naively alternating x and z does not work Min w 2 s.t. w + z = 1; optimum clearly is w = 0 Start with say w = 1 à z = 0 à w = 1 à z = 0 However, without coupling, can solve separately w and z Idea: try to decouple vars in the constraint! 1 Eric Xing @ CMU, 2005-2014

Example: Empirical Risk Minimization (ERM) Each i corresponds to a training point (x i, y i ) Loss f i measures the fitness of the model parameter w least squares: support vector machines: boosting: logistic regression: g is the regularization function, e.g. or Vars coupled in obj, but not in constraint (none) min w n g(w)+ X f i (w) f i (w) =(y i w > x i ) 2 L coupled i=1 f i (w) =(1 y i w > x i ) + f i (w) =exp( y i w > x i ) f i (w) = log(1 + exp( y i w > x i )) Reformulate: transfer coupling from obj to constraint Arrive at canonical form, allow unified treatment later nkwk 2 2 nkwk 1 Eric Xing @ CMU, 2005-2014

Why canonical form? ERM: min w g(w)+ n X i=1 f i (w) Canonical form: min w,z f(w)+g(z), s.t. Aw + Bz = c, where w 2 R m,z 2 R p,a: R m! R q,b : R p! R q,c2 R q ADMM algorithm (to be introduced shortly) excels at solving the canonical form Canonical form is a general template for constrained problems ERM (and many other problems) can be converted to canonical form through variable duplication (see next slide) Eric Xing @ CMU, 2005-2013 21

How to: variable duplication Duplicate variables to achieve canonical form All w i must (eventually) agree Downside: many extra variables, increase problem size min w n g(w)+ X f i (w) Implicitly maintain duplicated variables i=1 v =[w 1,...,w n ] > min g(z)+ X f i(w i ), s.t. w i = z,8i v,z i {z } {z } v [I,...,I] f(v) > z=0 Global consensus constraint: 8i, w i = z Eric Xing @ CMU, 2005-2014 22

Augmented Lagrangian Canonical form: min w,z f(w)+g(z), s.t. Aw + Bz = c, where w 2 R m, z 2 R p,a: R m! R q,b : R p! R q, c 2 R q Intro Lagrangian multiplier to decouple variables min w,z > (Aw + Bz {z c)+ µ 2 kaw + Bz ck2 2 } L µ (w,z; ) L µ : augmented Lagrangian More complicated min-max problem, but no coupling constraints Eric Xing @ CMU, 2005-2014

Why Augmented Lagrangian? Quadratic term gives numerical stability min w,z > (Aw + Bz {z c)+ µ 2 kaw + Bz ck2 2 } L µ (w,z; ) May lead to strong convexity in w or z Faster convergence when strongly convex Allows larger step size (due to higher stability) Prevents subproblems diverging to infinity (again, stability) But sometimes better to work with normal Lagrangian Eric Xing @ CMU, 2005-2013 24

ADMM Algorithm Fix dual, block coordinate descent on primal w, z min w,z > (Aw + Bz {z c)+ µ 2 kaw + Bz ck2 2 } L µ (w,z; ) arg min L µ(w, z t ; w z t+1 arg min L µ (w t+1, z; z Fix primal w, z, gradient ascent on dual w t+1 t+1 t + (Aw t+1 + Bz t+1 c) Step size can be large, e.g. = µ Usually rescale to remove / t ) f(w)+ µ 2 kaw + Bzt c + t /µk 2 t ) g(z)+ µ 2 kawt+1 + Bz c + t /µk 2 Eric Xing @ CMU, 2005-2014

ERM revisited Reformulate by duplicating variables min v,z g(z)+ X f i(w i ), s.t. w i = z, 8i i {z } {z } v [I,...,I] f(v) > z=0 ADMM x-step: w t+1 arg min w L µ(w, z t ; t ) f(w)+ µ 2 kaw + Bzt c + t /µk 2 = X i f i (w i )+ µ 2 kw i z t + t ik 2 Thanks to duplicating Completely decoupled Parallelizable Closed-form if f i is simple Eric Xing @ CMU, 2005-2013 26

ADMM: History and Related Augmented Lagrangian Method (ALM): solve w, z jointly even though coupled (Bertsekas'82) and refs therein Alternating Direction of Multiplier Method (ADMM): alternate w and z as previous slide (Boyd et al.'10) and refs therein Operator splitting for PDEs: Douglas, Peaceman, and Rachford (50s-70s) Glowinsky et al.'80s, Gabay'83; Spingarn'85 (Eckstein & Bertsekas'92; He et al.'02) in variational inequality Lots of recent work. Eric Xing @ CMU, 2005-2014

ADMM: Linearization x t+1 Demanding step in each iteration of ADMM (similar for z): Diagonal A: reduce to proximal map (more later) of f f(x) =kxk 1, soft-shrinkage: sign(x) ( x µ) + Non-diagonal A: no closed-form, messy inner loop Instead, reduce to diagonal A by arg min L µ (x, z t ; y t ) = f(x)+g(z t )+y t > (Ax + Bz t c)+ µ 2 kax + Bz t ck 2 2 x A single gradient step: x t+1 x t @f(x t )+A > y t + µa > (Ax t + Bz t c) x t Or, linearize the quadratic at : x t+1 arg min f(x)+y t > Ax +(x x t ) > µa > (Ax t + Bz t c)+ µ x 2 kx x tk 2 2 {z } Intuition: x re-computed in the next iteration anyways No need for perfect x f(x)+ µ 2 kx x t+a > (Ax t +Bz t c+y t /µ)k 2 2 Eric Xing @ CMU, 2005-2014

Convergence Guarantees: Fixedpoint theory Recall some definitions Lagrangian: ADMM = Douglas-Rachford splitting Fixed-point iteration! proximal map P µ f (w) := arg min z reflection map R µ f (w) :=2Pµ f (w) convergence follows, e.g. (Bauschke & Combettes'13) explains why dual y, not primal x or z, always converges 1 2µ kz wk2 2 + f(z) well-defined for convex f, non-expansive: kt (x) T (y)k 2 applekx yk 2 proximal map generalizes the Euclidean projection L 0 (x, z; y) =min f(x)+y > Ax +min g(z)+y > (Bz c) x z {z } {z } d 1 (y) d 2 (y) w 1 2 (w + Rµ d 2 (R µ d 1 (w))); y P µ d 2 (w) w Eric Xing @ CMU, 2005-2014

ADMM for CLIME Eric Xing @ CMU, 2005-2015 30

Apply ADMM to CLIME min Q kqk 1 s.t. ksq Ek 1 apple Solve a block of columns of Q in each core/machine E is the corresponding block in I Step 1: reduce to ADMM canonical form Use variable duplicating min Q,Z kqk 1 s.t. kz Ek 1 apple, Z = SQ min Q,Z kqk 1 +[kz Ek 1 apple ] s.t. Z = SQ Eric Xing @ CMU, 2005-2015 31

Apply ADMM to CLIME (cont ) Step 2: Write out augmented Lagrangian L(Q, Z; Y )=kqk 1 +[kz Ek 1 apple ]+ tr[(sq Z)Y ]+ 2 ksq Zk2 F Step 3: Perform primal-dual updates Q Z arg min Q kqk 1 + 2 ksq Z + Y k2 F arg min Z [kz Ek 1 apple ]+ 2 ksq Z + Y k2 F = arg min kz Ek 1 apple Y Y + SQ Z 2 ksq Z + Y k2 F Eric Xing @ CMU, 2005-2015 32

Apply ADMM to CLIME (cont ) Step 4: Solve the subproblems Lagrangian dual Y: trivial Primal Z: projection to l_inf ball, separable, easy Primal Q: easy if S is orthogonal, in general a lasso problem Bypass double loop by linearization Intuition: wasteful to solve Q to death min Q kqk 1 + tr(q > S(Y + SQ t Z)) + 2 kq Q tk 2 F Soft-thresholding Putting things together Q Z arg min Q kqk 1 + 2 ksq Z + Y k2 F arg min Z [kz Ek 1 apple ]+ 2 ksq Z + Y k2 F = arg min kz Ek 1 apple Y Y + SQ Z 2 ksq Z + Y k2 F Eric Xing @ CMU, 2005-2015 33

Exploring structure Expensive step in ADMM-CLIME: Matrix-matrix multiplication: SQ and alike If p >> n, S is size p x p but of rank at most n Write S = AA, and do A(A Q) A =[X 1,...,X n ] 2 R p n Matrix * matrix >> for loop of matrix * vector Preferable to solve a balanced block of columns Eric Xing @ CMU, 2005-2015 34

Parallelization (Wang et al 13) Embarrassingly parallel If A fits into memory Chop A into small blocks and distribute Communication may be high Eric Xing @ CMU, 2005-2015 35

Numerical results (Wang et al 13) Eric Xing @ CMU, 2005-2015 36

Numerical results cont (Wang et al 13) Eric Xing @ CMU, 2005-2015 37

Nonparanormal extensions Eric Xing @ CMU, 2005-2015 38

Nonparanormal (Liu et al. 09) Z i = f i (X i ), i =1,...,p (Z 1,...,Z p ) N(0, ) f i: unknown monotonic functions Observe X, but not Z Independence preserved under transformations X i? X j X rest () Z i? Z j Z rest () Q ij =0 Can estimate f i first, then apply glasso on f i (X i ) Estimating functions can be slow, nonparametric rate Eric Xing @ CMU, 2005-2015 39

A neat observation Since f i is monotonic Z i,: comonotone / concordant with X i,: Use rank estimator! Z i = f i (X i ), i =1,...,p (Z 1,...,Z p ) N(0, ) Z i,1,z i,2,...,z i,n R i,1,r i,2,...,r i,n Want, but do not observe Z Maybe??? ij = 1 n X i,1,x i,2,...,x i,n nx k=1 Z i,k Z j,k? = 1 n nx k=1 R i,k R j,k Eric Xing @ CMU, 2005-2015 40

Kendall s tau Assuming no ties: Key: ij = 1 n(n 1) k,` Complexity of computing Kendall s tau? Genuine, asymptotically unbiased estimate of t 7! 2sin( 6 t) X sign[(r i,k R i,`)(r j,k R j,`)] ij =2sin( 6 E( ij)) is a contraction, preserving concentration After having, use whatever glasso, e.g., CLIME Can also use other rank estimator, e.g., Spearman s rho Eric Xing @ CMU, 2005-2015 41

Summary Gaussian graphical model selection Neighborhood selection, Regularized MLE, CLIME Implicit PSD vs. Explicit PSD Distributed ADMM Generic procedure Can distribute matrix product Nonparanormal Gaussian graphical model Rank statistics Plug-in estimator Eric Xing @ CMU, 2005-2015 42

References Graph Lasso High-Dimensional Graphs and Variable Selection with the LASSO, Meinshausen and Buhlmann, Annals of Statistics, vol. 34, 2006 Model Selection and Estimation in the Gaussian Graphical Model, Yuan and Lin, Biometrika, vol. 94, 2007 A Constrained l_1 minimization Approach for Sparse Precision Matrix Estimation, Cai et al., Journal of the American Statistical Association, 2011 Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data, Banerjee et al, Journal of Machine Learning Research, 2008 Sparse Inverse Covariance Estimation with the Graphical Lasso, Friedman et al., Biostatistics, 2008 Nonparanormal The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs, Han Liu et al., Journal of Machine Learning Research, 2009 Regularized Rank-based Estimation of High-Dimensional Nonparanormal Graphical Models, Lingzhou Xue and Hui Zou, Annals of Statistics, vol. 40, 2012 High-Dimensional Semiparametric Gaussian Copula Graphical Models, Han Liu et al., Annals of Statistics, vol. 40, 2012 ADMM Large Scale Distributed Sparse Precision Estimation, Wang et al., NIPS 2013 Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Boyd et al., Foundation and Trends in Machine Learning, 2011 Eric Xing @ CMU, 2005-2015 43