Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Similar documents
Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Sparse Inverse Covariance Estimation for a Million Variables

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models

A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Robust Inverse Covariance Estimation under Noisy Measurements

Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

Newton-Like Methods for Sparse Inverse Covariance Estimation

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

STA141C: Big Data & High Performance Statistical Computing

ECS289: Scalable Machine Learning

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models

Convex Optimization Algorithms for Machine Learning in 10 Slides

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 25: November 27

Sparse Gaussian conditional random fields

Sparse Inverse Covariance Estimation with Hierarchical Matrices

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Frist order optimization methods for sparse inverse covariance selection

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture 17: October 27

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

Sparse Covariance Selection using Semidefinite Programming

Chapter 17: Undirected Graphical Models

10708 Graphical Models: Homework 2

Multivariate Normal Models

Multivariate Normal Models

ECS289: Scalable Machine Learning

Sparse inverse covariance estimation with the lasso

ECS289: Scalable Machine Learning

Probabilistic Graphical Models

Gaussian Graphical Models and Graphical Lasso

Exact Hybrid Covariance Thresholding for Joint Graphical Lasso

Joint Gaussian Graphical Model Review Series I

ECS289: Scalable Machine Learning

Memory Efficient Kernel Approximation

Robust Inverse Covariance Estimation under Noisy Measurements

Sparse Graph Learning via Markov Random Fields

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Graphical Model Selection

Probabilistic Graphical Models

Structure estimation for Gaussian graphical models

Big Data Analytics: Optimization and Randomization

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

CPSC 540: Machine Learning

11 : Gaussian Graphic Models and Ising Models

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Technical report no. SOL Department of Management Science and Engineering, Stanford University, Stanford, California, September 15, 2013.

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

PU Learning for Matrix Completion

Lecture 9: September 28

ECS171: Machine Learning

Large-scale Collaborative Ranking in Near-Linear Time

Sparse representation classification and positive L1 minimization

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Permutation-invariant regularization of large covariance matrices. Liza Levina

Bregman Divergences for Data Mining Meta-Algorithms

Topology identification of undirected consensus networks via sparse inverse covariance estimation

Lecture 2 Part 1 Optimization

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Computational statistics

Conditional Gradient (Frank-Wolfe) Method

Optimization methods

Newton s Method. Javier Peña Convex Optimization /36-725

Optimization methods

Proximity-Based Anomaly Detection using Sparse Structure Learning

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Robust estimation of scale and covariance with P n and its application to precision matrix estimation

RaRE: Social Rank Regulated Large-scale Network Embedding

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Online Dictionary Learning with Group Structure Inducing Norms

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

IBM Research Report. SINCO - A Greedy Coordinate Ascent Method for Sparse Inverse Covariance Selection Problem. Katya Scheinberg Columbia University

Fast Regularization Paths via Coordinate Descent

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Bias-free Sparse Regression with Guaranteed Consistency

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

STA 4273H: Statistical Machine Learning

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent

Nearest Neighbor based Coordinate Descent

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Stochastic Proximal Gradient Algorithm

Robust and sparse Gaussian graphical modelling under cell-wise contamination

High-dimensional covariance estimation based on Gaussian graphical models

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Big Data Optimization in Machine Learning

Transcription:

for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack, A. Banerjee, S. Becker, and P. Olsen

FMRI Brain Analysis Goal: Reveal functional connections between regions of the brain. (Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011) p = 228, 483 voxels. Figure from (Varoquaux et al, 2010)

Other Applications Gene regulatory network discovery: (Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez et al, 2010; Yin and Li, 2011) Financial Data Analysis: Model dependencies in multivariate time series (Xuan & Murphy, 2007). Sparse high dimensional models in economics (Fan et al, 2011). Social Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore, 2005). Model item-item similarity for recommender system(agarwal et al, 2011). Climate Data Analysis (Chen et al., 2010). Signal Processing (Zhang & Fung, 2013). Anomaly Detection (Ide et al, 2009). Time series prediction and classification (Wang et al., 2014).

L1-regularized inverse covariance selection Goal: Estimate the inverse covariance matrix in the high dimensional setting: p(# variables) n(# samples) Add l 1 regularization a sparse inverse covriance matrix is preferred. The l 1 -regularized Maximum Likelihood Estimator: Σ 1 { = arg min log det X + tr(sx ) X 0 where X 1 = n i,j=1 X ij. High dimensional problems } {{ } negative log likelihood } +λ X 1 = arg min f (X ), X 0 Number of parameters scales quadratically with number of variables.

Proximal Newton method Split smooth and non-smooth terms: f (X ) = g(x ) + h(x ), where g(x ) = log det X + tr(sx ) and h(x ) = λ X 1. Form quadratic approximation for g(x t + ): ḡ Xt ( ) = tr((s W t ) ) + (1/2) vec( ) T (W t W t ) vec( ) log det X t + tr(sx t ), where W t = (X t ) 1 = X log det(x ) X =X t. Define the generalized Newton direction: D t = arg min ḡx t ( ) + λ X t + 1.

Algorithm Proximal Newton method Proximal Newton method for sparse Inverse Covariance estimation Input: Empirical covariance matrix S, scalar λ, initial X 0. For t = 0, 1,... 1 Compute Newton direction: D t = arg min fxt (X t + ) (A Lasso problem.) 2 Line Search: use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite, satisfies a sufficient decrease condition f (X t + αd t) f (X t) + ασ t. (Cholesky factorization of X t + αd t ) (Hsieh et al., 2011; Olsen et al., 2012; Lee et al., 2012; Dinh et al., 2013; Scheinberg et al., 2014)

Scalability How to apply proximal Newton method to high dimensional problems? Number of parameters: too large (p 2 ) Hessian: cannot be explicited formed (p 2 p 2 ) Limited memory... BigQUIC (2013): p = 1, 000, 000 (1 trillion parameters) in 22.9 hrs with 32 GBytes memory (using a single machine with 32 cores). Key ingradients: Active variable selection the complexity scales with the intrincic dimensionality. Block coordinate descent Handle problems where # parameters > memory size. (Big & Quic). Other techniques: Divide-and-conquer, efficient line search... Extension to general dirty models.

Active Variable Selection

Motivation We can directly implement a proximal Newton method. Takes 28 seconds on ER dataset. Run Glasso: takes only 25 seconds. Why do people like to use proximal Newton method?

Newton Iteration 1 ER dataset, p = 692, Newton iteration 1.

Newton Iteration 3 ER dataset, p = 692, Newton iteration 3.

Newton Iteration 5 ER dataset, p = 692, Newton iteration 5.

Newton Iteration 7 ER dataset, p = 692, Newton iteration 7. Only 2.7% variables have been changed to nonzero.

Active Variable Selection Our goal: Reduce the number of variables from O(p 2 ) to O( X ). The solution usually has a small number of nonzeros: X 0 p 2 Strategy: before solving the Newton direction, guess the variables that should be updated. Fixed set: S fixed = {(i, j) X ij = 0 and ij g(x ) λ.}. The remaining variables constitute the free set. Solve the smaller subproblem: D t = arg min : Sfixed =0 ḡx t ( ) + λ X t + 1.

Active Variable Selection Is this just a heuristic? No. The process is equivalent to a block coordinate descent algorithm with two blocks: 1 Update the fixed set no change (due to the definition of S fixed ). 2 Update the free set update on the subproblem. Convergence is guaranteed.

Size of free set In practice, the size of free set is small. Take Hereditary dataset as an example: p = 1869, number of variables = p 2 = 3.49 million. The size of free set drops to 20, 000 at the end.

Real datasets (a) Time for Estrogen, p = 692 (b) Time for hereditarybc, p = 1, 869 Figure : Comparison of algorithms on real datasets. The results show QUIC converges faster than other methods.

Algorithm QUIC: QUadratic approximation for sparse Inverse Covariance estimation Input: Empirical covariance matrix S, scalar λ, initial X 0. For t = 0, 1,... 1 Variable selection: select a free set of m p 2 variables. 2 Use coordinate descent to find descent direction: D t = arg min fxt (X t + ) over set of free variables, (A Lasso problem.) 3 Line Search: use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite, satisfies a sufficient decrease condition f (X t + αd t) f (X t) + ασ t. (Cholesky factorization of X t + αd t )

Extensions Also used in LIBLINEAR (Yuan et al., 2011) for l 1 -regularized logistic regression. Can be generalized to active subspace selection for nuclear norm (Hsieh et al., 2014). Can be generalized to the general decomposable norm.

Efficient (block) Coordinate Descent with Limited Memory

Coordinate Descent Updates Use coordinate descent to solve: arg min D {ḡ X (D) + λ X + D 1 }. Current gradient ḡ X (D) = (W W )vec(d) + S W = WDW + S W, where W = X 1. Closed form solution for each coordinate descent update: D ij c + S(c b/a, λ/a), where S(z, r) = sign(z) max{ z r, 0} is the soft-thresholding function, a = W 2 ij + W iiw jj, b = S ij W ij + w T i Dw j, and c = X ij + D ij. The main cost is in computing w T i Dw j. Instead, we maintain U = DW after each coordinate updates, and then compute w T i u j only O(p) flops per updates.

Difficulties in Scaling QUIC Consider the case that p 1million, m = X t 0 50million. Coordinate descent requires X t and W = X 1 t, needs O(p 2 ) storage needs O(mp) computation per sweep, where m = X t 0

Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(T CG ) time.

Coordinate Updates with Memory Cache w 1, w 2, w 3, w 4 stored in memory.

Coordinate Updates with Memory Cache Cache hit, do not need to recompute w i, w j.

Coordinate Updates with Memory Cache Cache miss, recompute w i, w j.

Coordinate Updates ideal case Want to find update sequence that minimizes number of cache misses: probably NP Hard. Our strategy: update variables block by block. The ideal case: there exists a partition {S 1,..., S k } such that all free sets are in diagonal blocks: Only requires p column evaluations.

General case: block diagonal + sparse If the block partition is not perfect: extra column computations can be characterized by boundary nodes. Given a partition {S 1,..., S k }, we define boundary nodes as B(S q ) {j j S q and i S z, z q s.t. F ij = 1}, where F is adjacency matrix of the free set.

Graph Clustering Algorithm The number of columns to be computed in one sweep is p + q B(S q ). Can be upper bounded by p + B(S q ) p + F ij. q z q i S z,j S q Use Graph Clustering (METIS or Graclus) to find the partition. Example: on fmri dataset (p = 0.228 million) with 20 blocks, random partition: need 1.6 million column computations. graph clustering: need 0.237 million column computations.

BigQUIC Block co-ordinate descent with clustering, needs O(p 2 ) O(m + p 2 /k) storage needs O(mp) O(mp) computation per sweep, where m = X t 0

Algorithm BigQUIC Input: Samples Y, scalar λ, initial X 0. For t = 0, 1,... 1 Variable selection: select a free set of m p 2 variables. 2 Construct a partition by clustering. 3 Run block coordinate descent to find descent direction: D t = arg min fxt (X t + ) over set of free variables. 4 Line Search: use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite, satisfies a sufficient decrease condition f (X t + αd t) f (X t) + ασ t. (Schur complement with conjugate gradient method. )

Convergence Analysis

Convergence Guarantees for QUIC Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011) QUIC converges quadratically to the global optimum, that is for some constant 0 < κ < 1: X t+1 X F lim t X t X 2 F = κ. Other convergence proof for proximal Newton: (Lee et al, 2012) discuss the convergence for general loss function and regularization. (Katyas et al., 2014) discuss the global convergence rate.

BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately.

BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) If g(x ) is computed exactly and ηi Ĥ t ηi for some constant η, η > 0 at every Newton iteration, then BigQUIC converges to the global optimum.

BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t.

BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t. When R t = O( S f (X t ) ), BigQUIC has asymptotic quadratic convergence rate.

Experimental results (scalability) (a) Scalability on random graph (b) Scalability on chain graph Figure : BigQUIC can solve one million dimensional problems.

Experimental results BigQUIC is faster even for medium size problems. Figure : Comparison on FMRI data with a p = 20000 subset (maximum dimension that previous methods can handle).

Results on FMRI dataset 228,483 voxels, 518 time points. λ = 0.6 = average degree 8, BigQUIC took 5 hours. λ = 0.5 = average degree 38, BigQUIC took 21 hours. Findings: Voxels with large degree were generally found in the gray matter. Can detect meaningful brain modules by modularity clustering.

Quic & Dirty A Quadratic Approximation Approach for Dirty Statistical Models

Dirty Statistical Models Superposition-structured models or dirty models (Yang et al., 2013): ( ) min L θ (r) + λ r θ (r) Ar := F (θ), {θ (r) } k r=1 r r An example: GMRF with latent features (Chandrasekaran et al., 2012) min log det(s L) + S L, Σ + λ S S 1 + λ L L. S,L:L 0,S L 0 When there are two groups of random variables X, Y, Θ = [Θ XX, Θ XY ; Θ YX, Θ YY ]. If we only observe X, and Y is the set of hidden variables, the marginal precision matrix will be [Θ XX Θ XY Θ 1 YY Θ YX ], therefore will have a structure of low rank + sparse. Error bound: (Meng et al., 2014). Other applications: Robust PCA (low rank + sparse), multitask (sparse + group sparse),...

Quadratic Approximation The Newton direction d: [d (1),..., d (k) ] = argmin (1),..., (k) ḡ(θ + ) + k λ r θ (r) + (r) Ar, r=1 where and k ḡ(θ + ) = g(θ) + (t), G + 1 2 T H, r=1 H H H := 2 g(θ) =....., H := 2 L( θ). H H

Quadratic Approximation A general active subspace selection for decomposable norms. L1 norm: variable selection Nuclear norm: subspace selection Group sparse norm: group selection Method for solving the quadratic approximation subproblem: Alternating minimization. Convergence analysis: Hessian is not positive definite. Still have quadratic convergence.

Experimental Comparisons (a) ER dataset (b) Leukemia dataset Figure : Comparison on gene expression data (GMRF with latent variables).

Conclusions How to scale proximal Newton method to high dimensional problems: Active variable selection the complexity scales with the intrincic dimensionality. Block coordinate descent Handle problems where # parameters > memory size. (Big & Quic). Other techniques: Divide-and-conquer, efficient line search... BigQUIC: Memory efficient quadratic approximation method for sparse inverse covariance estimation (scale to 1 million by 1 million problems). With a careful design, we can apply proximal Newton method to solve models with superposition-structured regularization (sparse + low rank; sparse + group sparse; low rank + group sparse... ).

References [1] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparse inverse covariance estimation for a million variables. NIPS (oral presentation), 2013. [2] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance Matrix Estimation using Quadratic Approximation. NIPS, 2011. [3] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure for. NIPS, 2012. [4] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for Sparse Inverse Covariance Estimation. Optimization Online, 2012. [5] O. Banerjee, L. El Ghaoui, and A. d Aspremont Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, 2008. [6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 2008. [7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covariance selection. Mathematical Programming Computation, 2010. [8] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternating linearization methods. NIPS, 2010. [9] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedy coordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.