for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack, A. Banerjee, S. Becker, and P. Olsen
FMRI Brain Analysis Goal: Reveal functional connections between regions of the brain. (Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011) p = 228, 483 voxels. Figure from (Varoquaux et al, 2010)
Other Applications Gene regulatory network discovery: (Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez et al, 2010; Yin and Li, 2011) Financial Data Analysis: Model dependencies in multivariate time series (Xuan & Murphy, 2007). Sparse high dimensional models in economics (Fan et al, 2011). Social Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore, 2005). Model item-item similarity for recommender system(agarwal et al, 2011). Climate Data Analysis (Chen et al., 2010). Signal Processing (Zhang & Fung, 2013). Anomaly Detection (Ide et al, 2009). Time series prediction and classification (Wang et al., 2014).
L1-regularized inverse covariance selection Goal: Estimate the inverse covariance matrix in the high dimensional setting: p(# variables) n(# samples) Add l 1 regularization a sparse inverse covriance matrix is preferred. The l 1 -regularized Maximum Likelihood Estimator: Σ 1 { = arg min log det X + tr(sx ) X 0 where X 1 = n i,j=1 X ij. High dimensional problems } {{ } negative log likelihood } +λ X 1 = arg min f (X ), X 0 Number of parameters scales quadratically with number of variables.
Proximal Newton method Split smooth and non-smooth terms: f (X ) = g(x ) + h(x ), where g(x ) = log det X + tr(sx ) and h(x ) = λ X 1. Form quadratic approximation for g(x t + ): ḡ Xt ( ) = tr((s W t ) ) + (1/2) vec( ) T (W t W t ) vec( ) log det X t + tr(sx t ), where W t = (X t ) 1 = X log det(x ) X =X t. Define the generalized Newton direction: D t = arg min ḡx t ( ) + λ X t + 1.
Algorithm Proximal Newton method Proximal Newton method for sparse Inverse Covariance estimation Input: Empirical covariance matrix S, scalar λ, initial X 0. For t = 0, 1,... 1 Compute Newton direction: D t = arg min fxt (X t + ) (A Lasso problem.) 2 Line Search: use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite, satisfies a sufficient decrease condition f (X t + αd t) f (X t) + ασ t. (Cholesky factorization of X t + αd t ) (Hsieh et al., 2011; Olsen et al., 2012; Lee et al., 2012; Dinh et al., 2013; Scheinberg et al., 2014)
Scalability How to apply proximal Newton method to high dimensional problems? Number of parameters: too large (p 2 ) Hessian: cannot be explicited formed (p 2 p 2 ) Limited memory... BigQUIC (2013): p = 1, 000, 000 (1 trillion parameters) in 22.9 hrs with 32 GBytes memory (using a single machine with 32 cores). Key ingradients: Active variable selection the complexity scales with the intrincic dimensionality. Block coordinate descent Handle problems where # parameters > memory size. (Big & Quic). Other techniques: Divide-and-conquer, efficient line search... Extension to general dirty models.
Active Variable Selection
Motivation We can directly implement a proximal Newton method. Takes 28 seconds on ER dataset. Run Glasso: takes only 25 seconds. Why do people like to use proximal Newton method?
Newton Iteration 1 ER dataset, p = 692, Newton iteration 1.
Newton Iteration 3 ER dataset, p = 692, Newton iteration 3.
Newton Iteration 5 ER dataset, p = 692, Newton iteration 5.
Newton Iteration 7 ER dataset, p = 692, Newton iteration 7. Only 2.7% variables have been changed to nonzero.
Active Variable Selection Our goal: Reduce the number of variables from O(p 2 ) to O( X ). The solution usually has a small number of nonzeros: X 0 p 2 Strategy: before solving the Newton direction, guess the variables that should be updated. Fixed set: S fixed = {(i, j) X ij = 0 and ij g(x ) λ.}. The remaining variables constitute the free set. Solve the smaller subproblem: D t = arg min : Sfixed =0 ḡx t ( ) + λ X t + 1.
Active Variable Selection Is this just a heuristic? No. The process is equivalent to a block coordinate descent algorithm with two blocks: 1 Update the fixed set no change (due to the definition of S fixed ). 2 Update the free set update on the subproblem. Convergence is guaranteed.
Size of free set In practice, the size of free set is small. Take Hereditary dataset as an example: p = 1869, number of variables = p 2 = 3.49 million. The size of free set drops to 20, 000 at the end.
Real datasets (a) Time for Estrogen, p = 692 (b) Time for hereditarybc, p = 1, 869 Figure : Comparison of algorithms on real datasets. The results show QUIC converges faster than other methods.
Algorithm QUIC: QUadratic approximation for sparse Inverse Covariance estimation Input: Empirical covariance matrix S, scalar λ, initial X 0. For t = 0, 1,... 1 Variable selection: select a free set of m p 2 variables. 2 Use coordinate descent to find descent direction: D t = arg min fxt (X t + ) over set of free variables, (A Lasso problem.) 3 Line Search: use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite, satisfies a sufficient decrease condition f (X t + αd t) f (X t) + ασ t. (Cholesky factorization of X t + αd t )
Extensions Also used in LIBLINEAR (Yuan et al., 2011) for l 1 -regularized logistic regression. Can be generalized to active subspace selection for nuclear norm (Hsieh et al., 2014). Can be generalized to the general decomposable norm.
Efficient (block) Coordinate Descent with Limited Memory
Coordinate Descent Updates Use coordinate descent to solve: arg min D {ḡ X (D) + λ X + D 1 }. Current gradient ḡ X (D) = (W W )vec(d) + S W = WDW + S W, where W = X 1. Closed form solution for each coordinate descent update: D ij c + S(c b/a, λ/a), where S(z, r) = sign(z) max{ z r, 0} is the soft-thresholding function, a = W 2 ij + W iiw jj, b = S ij W ij + w T i Dw j, and c = X ij + D ij. The main cost is in computing w T i Dw j. Instead, we maintain U = DW after each coordinate updates, and then compute w T i u j only O(p) flops per updates.
Difficulties in Scaling QUIC Consider the case that p 1million, m = X t 0 50million. Coordinate descent requires X t and W = X 1 t, needs O(p 2 ) storage needs O(mp) computation per sweep, where m = X t 0
Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(T CG ) time.
Coordinate Updates with Memory Cache w 1, w 2, w 3, w 4 stored in memory.
Coordinate Updates with Memory Cache Cache hit, do not need to recompute w i, w j.
Coordinate Updates with Memory Cache Cache miss, recompute w i, w j.
Coordinate Updates ideal case Want to find update sequence that minimizes number of cache misses: probably NP Hard. Our strategy: update variables block by block. The ideal case: there exists a partition {S 1,..., S k } such that all free sets are in diagonal blocks: Only requires p column evaluations.
General case: block diagonal + sparse If the block partition is not perfect: extra column computations can be characterized by boundary nodes. Given a partition {S 1,..., S k }, we define boundary nodes as B(S q ) {j j S q and i S z, z q s.t. F ij = 1}, where F is adjacency matrix of the free set.
Graph Clustering Algorithm The number of columns to be computed in one sweep is p + q B(S q ). Can be upper bounded by p + B(S q ) p + F ij. q z q i S z,j S q Use Graph Clustering (METIS or Graclus) to find the partition. Example: on fmri dataset (p = 0.228 million) with 20 blocks, random partition: need 1.6 million column computations. graph clustering: need 0.237 million column computations.
BigQUIC Block co-ordinate descent with clustering, needs O(p 2 ) O(m + p 2 /k) storage needs O(mp) O(mp) computation per sweep, where m = X t 0
Algorithm BigQUIC Input: Samples Y, scalar λ, initial X 0. For t = 0, 1,... 1 Variable selection: select a free set of m p 2 variables. 2 Construct a partition by clustering. 3 Run block coordinate descent to find descent direction: D t = arg min fxt (X t + ) over set of free variables. 4 Line Search: use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite, satisfies a sufficient decrease condition f (X t + αd t) f (X t) + ασ t. (Schur complement with conjugate gradient method. )
Convergence Analysis
Convergence Guarantees for QUIC Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011) QUIC converges quadratically to the global optimum, that is for some constant 0 < κ < 1: X t+1 X F lim t X t X 2 F = κ. Other convergence proof for proximal Newton: (Lee et al, 2012) discuss the convergence for general loss function and regularization. (Katyas et al., 2014) discuss the global convergence rate.
BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately.
BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) If g(x ) is computed exactly and ηi Ĥ t ηi for some constant η, η > 0 at every Newton iteration, then BigQUIC converges to the global optimum.
BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t.
BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t. When R t = O( S f (X t ) ), BigQUIC has asymptotic quadratic convergence rate.
Experimental results (scalability) (a) Scalability on random graph (b) Scalability on chain graph Figure : BigQUIC can solve one million dimensional problems.
Experimental results BigQUIC is faster even for medium size problems. Figure : Comparison on FMRI data with a p = 20000 subset (maximum dimension that previous methods can handle).
Results on FMRI dataset 228,483 voxels, 518 time points. λ = 0.6 = average degree 8, BigQUIC took 5 hours. λ = 0.5 = average degree 38, BigQUIC took 21 hours. Findings: Voxels with large degree were generally found in the gray matter. Can detect meaningful brain modules by modularity clustering.
Quic & Dirty A Quadratic Approximation Approach for Dirty Statistical Models
Dirty Statistical Models Superposition-structured models or dirty models (Yang et al., 2013): ( ) min L θ (r) + λ r θ (r) Ar := F (θ), {θ (r) } k r=1 r r An example: GMRF with latent features (Chandrasekaran et al., 2012) min log det(s L) + S L, Σ + λ S S 1 + λ L L. S,L:L 0,S L 0 When there are two groups of random variables X, Y, Θ = [Θ XX, Θ XY ; Θ YX, Θ YY ]. If we only observe X, and Y is the set of hidden variables, the marginal precision matrix will be [Θ XX Θ XY Θ 1 YY Θ YX ], therefore will have a structure of low rank + sparse. Error bound: (Meng et al., 2014). Other applications: Robust PCA (low rank + sparse), multitask (sparse + group sparse),...
Quadratic Approximation The Newton direction d: [d (1),..., d (k) ] = argmin (1),..., (k) ḡ(θ + ) + k λ r θ (r) + (r) Ar, r=1 where and k ḡ(θ + ) = g(θ) + (t), G + 1 2 T H, r=1 H H H := 2 g(θ) =....., H := 2 L( θ). H H
Quadratic Approximation A general active subspace selection for decomposable norms. L1 norm: variable selection Nuclear norm: subspace selection Group sparse norm: group selection Method for solving the quadratic approximation subproblem: Alternating minimization. Convergence analysis: Hessian is not positive definite. Still have quadratic convergence.
Experimental Comparisons (a) ER dataset (b) Leukemia dataset Figure : Comparison on gene expression data (GMRF with latent variables).
Conclusions How to scale proximal Newton method to high dimensional problems: Active variable selection the complexity scales with the intrincic dimensionality. Block coordinate descent Handle problems where # parameters > memory size. (Big & Quic). Other techniques: Divide-and-conquer, efficient line search... BigQUIC: Memory efficient quadratic approximation method for sparse inverse covariance estimation (scale to 1 million by 1 million problems). With a careful design, we can apply proximal Newton method to solve models with superposition-structured regularization (sparse + low rank; sparse + group sparse; low rank + group sparse... ).
References [1] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparse inverse covariance estimation for a million variables. NIPS (oral presentation), 2013. [2] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance Matrix Estimation using Quadratic Approximation. NIPS, 2011. [3] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure for. NIPS, 2012. [4] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for Sparse Inverse Covariance Estimation. Optimization Online, 2012. [5] O. Banerjee, L. El Ghaoui, and A. d Aspremont Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, 2008. [6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 2008. [7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covariance selection. Mathematical Programming Computation, 2010. [8] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternating linearization methods. NIPS, 2010. [9] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedy coordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.