Probabilistic Graphical Models

Size: px

Start display at page:

Download "Probabilistic Graphical Models"

Evelyn Greene
5 years ago
Views:

1 School of Computer Science Probabilistic Graphical Models Distributed ADMM for Gaussian Graphical Models Yaoliang Yu Lecture 29, April 29, 2015 Eric CMU,

2 Networks / Graphs Eric CMU,

3 Where do graphs come from? Prior knowledge Mom told me A is connected to B Estimate from data! We have seen this in previous classes Will see two more today Sometimes may also be interested in edge weights An easier problem Real networks are BIG Require distributed optimization Eric CMU,

4 Structural Learning for completely observed Data MRF (Recall) 1) ( x,, x ( ( 1) 1 n ( 2) ( 2) 1,, n ( x x M ) ( x,, x ) ) ( ( M ) 1 n ) Eric CMU,

5 Gaussian Graphical Models Multivariate Gaussian density: WOLG: let We can view this as a continuous Markov Random Field with potentials defined on every node and edge: { } ) - ( ) - ( - exp ) ( ), ( / / µ µ π µ x x x Σ Σ = Σ T n p ( ) = = < j i j i ij i i ii n p x x q x q Q Q x x x p / 2 1/ exp ) (2 ) 0,,,, ( π µ! 5 Eric CMU,

6 The covariance and the precision matrices Covariance matrix Graphical model interpretation? Precision matrix Graphical model interpretation? Eric CMU,

7 Sparse precision vs. sparse covariance in GGM Σ = Σ = X 1 Σ 15 = 0 X1 X 5 X nbrs (1) or nbrs(5) 1 X 5 Σ15 = 0 Eric CMU,

8 Another example How to estimate this MRF? What if p >> n MLE does not exist in general! What about only learning a sparse graphical model? This is possible when s=o(n) Very often it is the structure of the GM that is more interesting Eric CMU,

9 Recall lasso Eric CMU,

10 Graph Regression (Meinshausen & Buhlmann 06) Neighborhood selection Lasso: Eric CMU,

11 Graph Regression Eric CMU,

12 Graph Regression Pros: Computationally convenient Strong theoretical guarantee ( p <= pol(n) ) Cons: Asymmetry Not minimax optimal Eric CMU,

13 The regularized MLE (Yuan & Lin 07) min Q log det Q +tr(qs)+ kqk 1 S: sample covariance matrix, may be singular Q 1 : may exclude the diagonal log det Q: implicitly force Q to be PSD symmetric Pros Cons Single step for estimating graph and inverse covariance MLE! Computationally challenging, partly solved by Glasso (Banergee et al 08, Friedman et al 08) Eric CMU,

14 Many many follow-ups Eric CMU,

15 A closer look of RMLE min Q log det Q +tr(qs)+ kqk 1 Set derivative to 0: Q 1 + S + sign(q) =0 kq 1 Sk 1 apple Can we (?!): min Q kqk 1 s.t. kq 1 Sk 1 apple Eric CMU,

16 CLIME (Cai et al. 11) Further relaxation min Q kqk 1 s.t. ksq Ik 1 apple Constraint controls Q S 1 Objective controls sparsity in Q Q is not required to be PSD or symmetric Separable! LP!!! Both objective and constraint are element-wise separable Can be reformulated as LP Strong theoretical guarantee Variations are minimax-optimal (Cai et al. 12, Liu & Wang 12) Eric CMU,

17 But for BIG problems min Q kqk 1 s.t. ksq Ik 1 apple Standard solvers for LP can be slow Embarrassingly parallel: Solve each column of Q independently in each core/machine min q i kq i k 1 s.t. ksq i e i k 1 apple Thanks for not having PSD constraint on Q Still troublesome if S is big Need to consider first-order methods Eric CMU,

18 A gentle introduction to alternating direction method of multipliers (ADMM) Eric CMU,

19 Optimization with coupling variables J uncoupled L coupled Canonical form: min w,z f(w)+g(z), s.t. Aw + Bz = c, where w 2 R m,z 2 R p,a: R m! R q,b : R p! R q,c2 R q Numerically challenging because Function f or g nonsmooth or constrained (i.e., can take value ) Linear constraint couples the variables w and z Large scale, interior point methods NA Naively alternating x and z does not work Min w 2 s.t. w + z = 1; optimum clearly is w = 0 Start with say w = 1 à z = 0 à w = 1 à z = 0 However, without coupling, can solve separately w and z Idea: try to decouple vars in the constraint! 1 Eric CMU,

20 Example: Empirical Risk Minimization (ERM) Each i corresponds to a training point (x i, y i ) Loss f i measures the fitness of the model parameter w least squares: support vector machines: boosting: logistic regression: g is the regularization function, e.g. or Vars coupled in obj, but not in constraint (none) min w n g(w)+ X f i (w) f i (w) =(y i w > x i ) 2 L coupled i=1 f i (w) =(1 y i w > x i ) + f i (w) =exp( y i w > x i ) f i (w) = log(1 + exp( y i w > x i )) Reformulate: transfer coupling from obj to constraint Arrive at canonical form, allow unified treatment later nkwk 2 2 nkwk 1 Eric CMU,

21 Why canonical form? ERM: min w g(w)+ n X i=1 f i (w) Canonical form: min w,z f(w)+g(z), s.t. Aw + Bz = c, where w 2 R m,z 2 R p,a: R m! R q,b : R p! R q,c2 R q ADMM algorithm (to be introduced shortly) excels at solving the canonical form Canonical form is a general template for constrained problems ERM (and many other problems) can be converted to canonical form through variable duplication (see next slide) Eric CMU,

22 How to: variable duplication Duplicate variables to achieve canonical form All w i must (eventually) agree Downside: many extra variables, increase problem size min w n g(w)+ X f i (w) Implicitly maintain duplicated variables i=1 v =[w 1,...,w n ] > min g(z)+ X f i(w i ), s.t. w i = z,8i v,z i {z } {z } v [I,...,I] f(v) > z=0 Global consensus constraint: 8i, w i = z Eric CMU,

23 Augmented Lagrangian Canonical form: min w,z f(w)+g(z), s.t. Aw + Bz = c, where w 2 R m, z 2 R p,a: R m! R q,b : R p! R q, c 2 R q Intro Lagrangian multiplier to decouple variables min w,z > (Aw + Bz {z c)+ µ 2 kaw + Bz ck2 2 } L µ (w,z; ) L µ : augmented Lagrangian More complicated min-max problem, but no coupling constraints Eric CMU,

24 Why Augmented Lagrangian? Quadratic term gives numerical stability min w,z > (Aw + Bz {z c)+ µ 2 kaw + Bz ck2 2 } L µ (w,z; ) May lead to strong convexity in w or z Faster convergence when strongly convex Allows larger step size (due to higher stability) Prevents subproblems diverging to infinity (again, stability) But sometimes better to work with normal Lagrangian Eric CMU,

25 ADMM Algorithm Fix dual, block coordinate descent on primal w, z min w,z > (Aw + Bz {z c)+ µ 2 kaw + Bz ck2 2 } L µ (w,z; ) arg min L µ(w, z t ; w z t+1 arg min L µ (w t+1, z; z Fix primal w, z, gradient ascent on dual w t+1 t+1 t + (Aw t+1 + Bz t+1 c) Step size can be large, e.g. = µ Usually rescale to remove / t ) f(w)+ µ 2 kaw + Bzt c + t /µk 2 t ) g(z)+ µ 2 kawt+1 + Bz c + t /µk 2 Eric CMU,

26 ERM revisited Reformulate by duplicating variables min v,z g(z)+ X f i(w i ), s.t. w i = z, 8i i {z } {z } v [I,...,I] f(v) > z=0 ADMM x-step: w t+1 arg min w L µ(w, z t ; t ) f(w)+ µ 2 kaw + Bzt c + t /µk 2 = X i f i (w i )+ µ 2 kw i z t + t ik 2 Thanks to duplicating Completely decoupled Parallelizable Closed-form if f i is simple Eric CMU,

27 ADMM: History and Related Augmented Lagrangian Method (ALM): solve w, z jointly even though coupled (Bertsekas'82) and refs therein Alternating Direction of Multiplier Method (ADMM): alternate w and z as previous slide (Boyd et al.'10) and refs therein Operator splitting for PDEs: Douglas, Peaceman, and Rachford (50s-70s) Glowinsky et al.'80s, Gabay'83; Spingarn'85 (Eckstein & Bertsekas'92; He et al.'02) in variational inequality Lots of recent work. Eric CMU,

28 ADMM: Linearization x t+1 Demanding step in each iteration of ADMM (similar for z): Diagonal A: reduce to proximal map (more later) of f f(x) =kxk 1, soft-shrinkage: sign(x) ( x µ) + Non-diagonal A: no closed-form, messy inner loop Instead, reduce to diagonal A by arg min L µ (x, z t ; y t ) = f(x)+g(z t )+y t > (Ax + Bz t c)+ µ 2 kax + Bz t ck 2 2 x A single gradient step: x t+1 x t )+A > y t + µa > (Ax t + Bz t c) x t Or, linearize the quadratic at : x t+1 arg min f(x)+y t > Ax +(x x t ) > µa > (Ax t + Bz t c)+ µ x 2 kx x tk 2 2 {z } Intuition: x re-computed in the next iteration anyways No need for perfect x f(x)+ µ 2 kx x t+a > (Ax t +Bz t c+y t /µ)k 2 2 Eric CMU,

29 Convergence Guarantees: Fixedpoint theory Recall some definitions Lagrangian: ADMM = Douglas-Rachford splitting Fixed-point iteration! proximal map P µ f (w) := arg min z reflection map R µ f (w) :=2Pµ f (w) convergence follows, e.g. (Bauschke & Combettes'13) explains why dual y, not primal x or z, always converges 1 2µ kz wk2 2 + f(z) well-defined for convex f, non-expansive: kt (x) T (y)k 2 applekx yk 2 proximal map generalizes the Euclidean projection L 0 (x, z; y) =min f(x)+y > Ax +min g(z)+y > (Bz c) x z {z } {z } d 1 (y) d 2 (y) w 1 2 (w + Rµ d 2 (R µ d 1 (w))); y P µ d 2 (w) w Eric CMU,

30 ADMM for CLIME Eric CMU,

31 Apply ADMM to CLIME min Q kqk 1 s.t. ksq Ek 1 apple Solve a block of columns of Q in each core/machine E is the corresponding block in I Step 1: reduce to ADMM canonical form Use variable duplicating min Q,Z kqk 1 s.t. kz Ek 1 apple, Z = SQ min Q,Z kqk 1 +[kz Ek 1 apple ] s.t. Z = SQ Eric CMU,

32 Apply ADMM to CLIME (cont ) Step 2: Write out augmented Lagrangian L(Q, Z; Y )=kqk 1 +[kz Ek 1 apple ]+ tr[(sq Z)Y ]+ 2 ksq Zk2 F Step 3: Perform primal-dual updates Q Z arg min Q kqk ksq Z + Y k2 F arg min Z [kz Ek 1 apple ]+ 2 ksq Z + Y k2 F = arg min kz Ek 1 apple Y Y + SQ Z 2 ksq Z + Y k2 F Eric CMU,

33 Apply ADMM to CLIME (cont ) Step 4: Solve the subproblems Lagrangian dual Y: trivial Primal Z: projection to l_inf ball, separable, easy Primal Q: easy if S is orthogonal, in general a lasso problem Bypass double loop by linearization Intuition: wasteful to solve Q to death min Q kqk 1 + tr(q > S(Y + SQ t Z)) + 2 kq Q tk 2 F Soft-thresholding Putting things together Q Z arg min Q kqk ksq Z + Y k2 F arg min Z [kz Ek 1 apple ]+ 2 ksq Z + Y k2 F = arg min kz Ek 1 apple Y Y + SQ Z 2 ksq Z + Y k2 F Eric CMU,

34 Exploring structure Expensive step in ADMM-CLIME: Matrix-matrix multiplication: SQ and alike If p >> n, S is size p x p but of rank at most n Write S = AA, and do A(A Q) A =[X 1,...,X n ] 2 R p n Matrix * matrix >> for loop of matrix * vector Preferable to solve a balanced block of columns Eric CMU,

35 Parallelization (Wang et al 13) Embarrassingly parallel If A fits into memory Chop A into small blocks and distribute Communication may be high Eric CMU,

36 Numerical results (Wang et al 13) Eric CMU,

37 Numerical results cont (Wang et al 13) Eric CMU,

38 Nonparanormal extensions Eric CMU,

39 Nonparanormal (Liu et al. 09) Z i = f i (X i ), i =1,...,p (Z 1,...,Z p ) N(0, ) f i: unknown monotonic functions Observe X, but not Z Independence preserved under transformations X i? X j X rest () Z i? Z j Z rest () Q ij =0 Can estimate f i first, then apply glasso on f i (X i ) Estimating functions can be slow, nonparametric rate Eric CMU,

40 A neat observation Since f i is monotonic Z i,: comonotone / concordant with X i,: Use rank estimator! Z i = f i (X i ), i =1,...,p (Z 1,...,Z p ) N(0, ) Z i,1,z i,2,...,z i,n R i,1,r i,2,...,r i,n Want, but do not observe Z Maybe??? ij = 1 n X i,1,x i,2,...,x i,n nx k=1 Z i,k Z j,k? = 1 n nx k=1 R i,k R j,k Eric CMU,

41 Kendall s tau Assuming no ties: Key: ij = 1 n(n 1) k,` Complexity of computing Kendall s tau? Genuine, asymptotically unbiased estimate of t 7! 2sin( 6 t) X sign[(r i,k R i,`)(r j,k R j,`)] ij =2sin( 6 E( ij)) is a contraction, preserving concentration After having, use whatever glasso, e.g., CLIME Can also use other rank estimator, e.g., Spearman s rho Eric CMU,

42 Summary Gaussian graphical model selection Neighborhood selection, Regularized MLE, CLIME Implicit PSD vs. Explicit PSD Distributed ADMM Generic procedure Can distribute matrix product Nonparanormal Gaussian graphical model Rank statistics Plug-in estimator Eric CMU,

43 References Graph Lasso High-Dimensional Graphs and Variable Selection with the LASSO, Meinshausen and Buhlmann, Annals of Statistics, vol. 34, 2006 Model Selection and Estimation in the Gaussian Graphical Model, Yuan and Lin, Biometrika, vol. 94, 2007 A Constrained l_1 minimization Approach for Sparse Precision Matrix Estimation, Cai et al., Journal of the American Statistical Association, 2011 Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data, Banerjee et al, Journal of Machine Learning Research, 2008 Sparse Inverse Covariance Estimation with the Graphical Lasso, Friedman et al., Biostatistics, 2008 Nonparanormal The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs, Han Liu et al., Journal of Machine Learning Research, 2009 Regularized Rank-based Estimation of High-Dimensional Nonparanormal Graphical Models, Lingzhou Xue and Hui Zou, Annals of Statistics, vol. 40, 2012 High-Dimensional Semiparametric Gaussian Copula Graphical Models, Han Liu et al., Annals of Statistics, vol. 40, 2012 ADMM Large Scale Distributed Sparse Precision Estimation, Wang et al., NIPS 2013 Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Boyd et al., Foundation and Trends in Machine Learning, 2011 Eric CMU,

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao