Sparse Inverse Covariance Estimation for a Million Variables

Size: px

Start display at page:

Download "Sparse Inverse Covariance Estimation for a Million Variables"

Milton Greer
5 years ago
Views:

1 Sparse Inverse Covariance Estimation for a Million Variables Inderjit S. Dhillon Depts of Computer Science & Mathematics The University of Texas at Austin SAMSI LDHD Opening Workshop Raleigh, North Carolina Sept 10, 2013 Joint work with C. Hsieh, M. Sustik, P. Ravikumar and R. Poldrack

2 Estimate Relationships between Variables Example: FMRI Brain Analysis. Goal: Reveal functional connections between regions of the brain. (Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011) Input Output Figure from (Varoquaux et al, 2010)

3 Estimate Relationships between Variables Example: Gene regulatory network discovery. Goal: Gene network reconstruction using microarray data. (Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez et al, 2010; Yin and Li, 2011) Figure from (Banerjee et al, 2008)

4 Other Applications Financial Data Analysis: Model dependencies in multivariate time series (Xuan & Murphy, 2007). Sparse high dimensional models in economics (Fan et al, 2011). Social Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore, 2005). Model item-item similarity for recommender system(agarwal et al, 2011). Climate Data Analysis (Chen et al., 2010). Signal Processing (Zhang & Fung, 2013). Anomaly Detection (Ide et al, 2009).

5 Inverse Covariance Estimation Given: n i.i.d. samples {y 1,..., y n }, y i R p, y i N (µ, Σ), Goal: Estimate relationships between variables. The sample mean and covariance are given by: ˆµ = 1 n n i=1 y i and S = 1 n n (y i ˆµ)(y i ˆµ) T. i=1 Covariance matrix Σ may not directly convey relationships. An example Chain graph: y j = 0.5y j 1 + N (0, 1) Σ =

Inverse Covariance Estimation (cont d) Conditional independence is reflected as zeros in Θ (= Σ 1 ): Θ ij = 0 y i and y j are conditionally independent given other variables.

6 Inverse Covariance Estimation (cont d) Conditional independence is reflected as zeros in Θ (= Σ 1 ): Θ ij = 0 y i and y j are conditionally independent given other variables. Chain graph example: y j = 0.5y j 1 + N (0, 1) Θ = In a GMRF G = (V, E), each node corresponds to a variable, and each edge corresponds to a non-zero entry in Θ. Goal: Estimate the inverse covariance matrix Θ.

7 Maximum Likelihood Estimator: LogDet Optimization Given the n samples, the likelihood is n ( P(y 1,..., y n ; ˆµ, Θ) (det Θ) 1/2 exp 1 ) 2 (y i ˆµ) T Θ(y i ˆµ) i=1 = (det Θ) n/2 exp The log likelihood can be written as ( 1 2 ) n (y i ˆµ) T Θ(y i ˆµ). log(p(y 1,..., y n ; ˆµ, Θ)) = n 2 log(det Θ) n tr(θs) + constant. 2 Maximum likelihood estimator: i=1 Θ = arg min{ log det X + tr(sx )}. X 0 Problem: when p > n, sample covariance matrix S is singular.

8 L1-regularized inverse covariance selection A sparse inverse covariance matrix is preferred add l 1 regularization to promote sparsity. The resulting optimization problem: { } Θ = arg min log det X + tr(sx ) + λ X 1 = arg min f (X ), X 0 X 0 where X 1 = n i,j=1 X ij. Regularization parameter λ > 0 controls the sparsity. The problem appears hard to solve: Non-smooth log-determinant program. Number of parameters scale quadratically with number of variables.

9 Scalability Block coordinate ascent (Banerjee et al, 2007). Solves p = 1000 in 4000 secs. Graphical Lasso (Friedman et al, 2007). Solves p = 1000 in 800 secs. VSM, PSM, SINCO, IPM, PQN, ALM ( ). ALM solves p = 1000 in 300 secs.

10 Scalability Block coordinate ascent (Banerjee et al, 2007). Solves p = 1000 in 4000 secs. Graphical Lasso (Friedman et al, 2007). Solves p = 1000 in 800 secs. VSM, PSM, SINCO, IPM, PQN, ALM ( ). ALM solves p = 1000 in 300 secs. QUIC (Hsieh et al, 2011) Solves p = 1000 in 10 secs, p = 10, 000 in half hour. But needs > 3 days when p = 30, 000 (out of memory). Need for scalability: FMRI dataset has more than 220, 000 variables (48 billion parameters).

11 Scalability Block coordinate ascent (Banerjee et al, 2007). Solves p = 1000 in 4000 secs. Graphical Lasso (Friedman et al, 2007). Solves p = 1000 in 800 secs. VSM, PSM, SINCO, IPM, PQN, ALM ( ). ALM solves p = 1000 in 300 secs. QUIC (Hsieh et al, 2011) Solves p = 1000 in 10 secs, p = 10, 000 in half hour. But needs > 3 days when p = 30, 000 (out of memory). Need for scalability: FMRI dataset has more than 220, 000 variables (48 billion parameters). BigQUIC (2013): p = 1, 000, 000 in 22.9 hrs (using 32 cores).

12 Scalability Main Ingredients: 1 Second-order Newton-like method (QUIC) quadratic convergence rate. 2 Division of variables into free and fixed sets (QUIC) efficient computation of each Newton direction 3 Memory-efficient scheme using block coordinate descent (BigQUIC) scale to one million variables. 4 Approximate Hessian computation (BigQUIC) super-linear convergence rate.

13 Quic QUadratic approximation for Inverse Covariance estimation

14 Second Order Method Newton method for twice differentiable function: x x η( 2 f (x)) 1 f (x) However, the sparse inverse covariance estimation objective is not differentiable. f (X ) = log det X + tr(sx ) + λ X 1 Most current solvers are first-order methods: Block Coordinate Descent (GLASSO), projected gradient descent (PSM), greedy coordinate descent (SINCO), alternating linearization method (ALM).

15 Quadratic Approximation Write objective as f (X ) = g(x ) + h(x ), where g(x ) = log det X + tr(sx ) and h(x ) = λ X 1.

16 Quadratic Approximation Write objective as f (X ) = g(x ) + h(x ), where g(x ) = log det X + tr(sx ) and h(x ) = λ X 1. Form quadratic approximation for g(x t + ): ḡ Xt ( ) = tr((s W t ) ) + (1/2) vec( ) T (W t W t ) vec( ) log det X t + tr(sx t ), where W t = (X t ) 1 = X log det(x ) X =X t.

17 Quadratic Approximation Write objective as f (X ) = g(x ) + h(x ), where g(x ) = log det X + tr(sx ) and h(x ) = λ X 1. Form quadratic approximation for g(x t + ): ḡ Xt ( ) = tr((s W t ) ) + (1/2) vec( ) T (W t W t ) vec( ) log det X t + tr(sx t ), where W t = (X t ) 1 = X log det(x ) X =X t. Define the generalized Newton direction: D = arg min ḡx t ( ) + λ X + 1. Solve by coordinate descent (Hsieh et al, 2011) or other methods (Olsen et al, 2012).

18 Coordinate Descent Updates Use coordinate descent to solve: Each coordinate descent update: arg min D {ḡ X t (D) + λ X + D 1 }. µ = arg min µ ḡ(d + µ(e i e T j + e j e T i )) + 2λ X ij + D ij + µ D ij D ij + µ Above one-variable problem can be simplified to minimizing: 1 2 (W 2 ij + W ii W jj )µ 2 + (S ij W ij + w T i Dw j )µ + λ X ij + D ij + µ Quadratic form with L1 regularization soft thresholding gives exact solution.

19 Efficient solution of one-variable problem Minimum is achieved at: µ = c + S(c b/a, λ/a), where S(z, r) = sign(z) max{ z r, 0} is the soft-thresholding function, a = W 2 ij + W iiw jj, b = S ij W ij + w T i Dw j, and c = X ij + D ij. The main cost is in computing w T i Dw j : direct computation requires O(p 2 ) flops. Instead, we maintain U = DW after each coordinate updates, and then compute w T i u j only O(p) flops per updates.

20 Efficient solution of one-variable problem Minimum is achieved at: µ = c + S(c b/a, λ/a), where S(z, r) = sign(z) max{ z r, 0} is the soft-thresholding function, a = W 2 ij + W iiw jj, b = S ij W ij + w T i Dw j, and c = X ij + D ij. The main cost is in computing w T i Dw j : direct computation requires O(p 2 ) flops. Instead, we maintain U = DW after each coordinate updates, and then compute w T i u j only O(p) flops per updates. Need O(p 2 ) memory, not scalable to p > Block coordinate descent method to avoid O(p 2 ) storage (BigQUIC)

21 Line Search Adopt Armijo rule: try step-sizes α {β 0, β 1, β 2,... } st X t + αd t : 1 is positive definite 2 satisfies a sufficient decrease condition f (X t + αd t ) f (X t ) + ασ t where t = tr( g(x t )D t ) + λ X t + D t 1 λ X t 1. Both conditions can be checked by performing Cholesky factorization O(p 3 ) flops per line search iteration. Can do better by using Lanczos [K.C.Toh], and Conjugate Gradient.

22 Free and Fixed Sets Many variables never changes to nonzero values: p = 1869, p 2 = 3.49 million, only 3% changes to nonzero values. Our goal: Variable Selection!

23 Free and Fixed Sets Many variables never changes to nonzero values: p = 1869, p 2 = 3.49 million, only 3% changes to nonzero values. Our goal: Variable Selection! Our strategy: before solving the Newton direction, guess the variables that should be updated: { S fixed if ij g(x t ) < λ and (X t ) ij = 0 (X t ) ij otherwise S free We then perform the coordinate descent updates only on S free set.

24 Algorithm QUIC: QUadratic approximation for sparse Inverse Covariance estimation Input: Empirical covariance matrix S, scalar λ, initial X 0. For t = 0, 1,... 1 Compute W t = X 1 t. 2 Form the second order approximation ḡ Xt (X ) to g(x ) around X t. 3 Partition variables into free and fixed sets 4 Use coordinate descent to find descent direction: D t = arg min f Xt (X t + ) over set of free variables, (A Lasso problem.) 5 Use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite and objective sufficiently decreases.

25 Big & Quic Sparse Inverse Covariance Estimation for a Million Variables

26 Difficulties in Scaling QUIC Coordinate descent requires X t and X 1 t, needs O(p 2 ) storage needs O(mp) computation per sweep, where m = X t 0 Line search (compute determinant of a big sparse matrix). needs O(p 2 ) storage needs O(p 3 ) computation

27 Line Search Given sparse matrix A = X t + αd, we need to 1 Check its positive definiteness. 2 Compute log det(a). Cholesky factorization in QUIC requires O(p 3 ) computation. ( ) a b T If A =, b C det(a) = det(c)(a b T C 1 b) A is positive definite iff C is positive definite and (a b T C 1 b) > 0. C is sparse, so can compute C 1 b using Conjugate Gradient (CG). Time complexity: T CG = O(mT ), where T is number of CG iterations.

28 Difficulties in Scaling QUIC Coordinate descent requires X t and X 1 t, needs O(p 2 ) storage needs O(mp) computation per sweep, where m = X t 0 Line search (compute determinant of a big sparse matrix). needs O(p) storage needs O(mp) computation

29 Back-of-the-envelope calculations Co-ordinate descent requires m computations of w T i Dw j. To solve a problem of size p = 10 4 (avg degree=10), QUIC takes 12 minutes Extrapolate: to solve a problem with p = 10 6, QUIC would take 12min 10 4 = days (2.6 days using 32 cores). p 2 memory means 8 Terabyte main memory for p = Naive solution that has O(m) memory requirement: Compute w i, w j by solving two linear systems by CG, O(T CG ) computation per coordinate update O(mT CG ) computation per sweep. Would need 8,333 days to solve the problem for p = 10 6 (260 days using 32 cores).

30 Back-of-the-envelope calculations Co-ordinate descent requires m computations of w T i Dw j. To solve a problem of size p = 10 4 (avg degree=10), QUIC takes 12 minutes Extrapolate: to solve a problem with p = 10 6, QUIC would take 12min 10 4 = days (2.6 days using 32 cores). p 2 memory means 8 Terabyte main memory for p = Naive solution that has O(m) memory requirement: Compute w i, w j by solving two linear systems by CG, O(T CG ) computation per coordinate update O(mT CG ) computation per sweep. Would need 8,333 days to solve the problem for p = 10 6 (260 days using 32 cores). BigQUIC solves problems of size p = 10 6 in one day on 32 cores!

31 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(T CG ) time.

32 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(mT CG ) time.

33 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(mT CG ) time.

34 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(mT CG ) time.

35 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(mT CG ) time.

36 Coordinate Updates ideal case Want to find update sequence that minimizes number of cache misses: probably NP Hard. Our strategy: update variables block by block. The ideal case: there exists a partition {S 1,..., S k } such that all free sets are in diagonal blocks: Only requires p column evaluations.

37 General case: block diagonal + sparse If the block partition is not perfect: extra column computations can be characterized by boundary nodes. Given a partition {S 1,..., S k }, we define boundary nodes as B(S q ) {j j S q and i S z, z q s.t. F ij = 1}, where F is adjacency matrix of the free set.

38 Detailed updates in general case

39 Detailed updates in general case

40 Detailed updates in general case

41 Detailed updates in general case

42 Graph Clustering Algorithm The number of columns to be computed in one sweep is p + q B(S q ). Can be upper bounded by p + B(S q ) p + F ij. q z q i S z,j S q Minimize the right hand side minimize off-diagonal entries. Use Graph Clustering (METIS or Graclus) to find the partition.

43 Size of boundary nodes in real datasets ER dataset with p = 692. Random partition compute 1687 columns Partition by clustering compute 775 columns O(kpT CG ) O(pT CG ) flops per sweep.

44 BigQUIC Block co-ordinate descent with clustering, needs O(m + p 2 /k) storage needs O(mp) computation per sweep, where m = X t 0 Line search (compute determinant of a big sparse matrix). needs O(p) storage needs O(mp) computation

45 Convergence Guarantees for QUIC Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011) QUIC converges quadratically to the global optimum, that is for some constant 0 < κ < 1: X t+1 X F lim t X t X 2 F = κ.

46 BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately.

47 BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) If g(x ) is computed exactly and ηi Ĥ t ηi for some constant η, η > 0 at every Newton iteration, then BigQUIC converges to the global optimum.

48 BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t.

49 BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t. When R t = O( S f (X t ) ), BigQUIC has asymptotic quadratic convergence rate.

50 Experimental results (scalability) (a) Scalability on random graph (b) Scalability on chain graph Figure : BigQUIC can solve one million dimensional problems.

51 Experimental results BigQUIC is faster even for medium size problems. Figure : Comparison on FMRI data with p = (maximum dimension that previous methods can handle).

52 Results on FMRI dataset 228,483 voxels, 518 time points. λ = 0.6 = average degree 8, BigQUIC took 5 hours. λ = 0.5 = average degree 38, BigQUIC took 21 hours. Findings: Voxels with large degree were generally found in the gray matter. Can detect meaningful brain modules by modularity clustering.

53 Conclusions Memory efficient quadratic approximation method for sparse inverse covariance estimation (BigQUIC). Key ingredients: A second order method. Fixed/free set selection. Block coordinate descent solver for each subproblem. Inexact computation of Hessian with convergence analysis. BigQUIC can solve a 1-million dimensional problem in 1-day on a single 32-core machine. QUIC published in NIPS 2011: Sparse Inverse Covariance Matrix Estimation using Quadratic Approximation BigQUIC to appear in NIPS 2013 (Oral): BIG & QUIC: Sparse inverse covariance estimation for a million variables

54 References [1] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparse inverse covariance estimation for a million variables. NIPS (oral presentation), [2] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance Matrix Estimation using Quadratic Approximation. NIPS, [3] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation. NIPS, [4] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for Sparse Inverse Covariance Estimation. Optimization Online, [5] O. Banerjee, L. El Ghaoui, and A. d Aspremont Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, [6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, [7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covariance selection. Mathematical Programming Computation, [8] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternating linearization methods. NIPS, [9] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedy coordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal: