Sparse Inverse Covariance Estimation for a Million Variables
|
|
- Milton Greer
- 5 years ago
- Views:
Transcription
1 Sparse Inverse Covariance Estimation for a Million Variables Inderjit S. Dhillon Depts of Computer Science & Mathematics The University of Texas at Austin SAMSI LDHD Opening Workshop Raleigh, North Carolina Sept 10, 2013 Joint work with C. Hsieh, M. Sustik, P. Ravikumar and R. Poldrack
2 Estimate Relationships between Variables Example: FMRI Brain Analysis. Goal: Reveal functional connections between regions of the brain. (Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011) Input Output Figure from (Varoquaux et al, 2010)
3 Estimate Relationships between Variables Example: Gene regulatory network discovery. Goal: Gene network reconstruction using microarray data. (Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez et al, 2010; Yin and Li, 2011) Figure from (Banerjee et al, 2008)
4 Other Applications Financial Data Analysis: Model dependencies in multivariate time series (Xuan & Murphy, 2007). Sparse high dimensional models in economics (Fan et al, 2011). Social Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore, 2005). Model item-item similarity for recommender system(agarwal et al, 2011). Climate Data Analysis (Chen et al., 2010). Signal Processing (Zhang & Fung, 2013). Anomaly Detection (Ide et al, 2009).
5 Inverse Covariance Estimation Given: n i.i.d. samples {y 1,..., y n }, y i R p, y i N (µ, Σ), Goal: Estimate relationships between variables. The sample mean and covariance are given by: ˆµ = 1 n n i=1 y i and S = 1 n n (y i ˆµ)(y i ˆµ) T. i=1 Covariance matrix Σ may not directly convey relationships. An example Chain graph: y j = 0.5y j 1 + N (0, 1) Σ =
6 Inverse Covariance Estimation (cont d) Conditional independence is reflected as zeros in Θ (= Σ 1 ): Θ ij = 0 y i and y j are conditionally independent given other variables. Chain graph example: y j = 0.5y j 1 + N (0, 1) Θ = In a GMRF G = (V, E), each node corresponds to a variable, and each edge corresponds to a non-zero entry in Θ. Goal: Estimate the inverse covariance matrix Θ.
7 Maximum Likelihood Estimator: LogDet Optimization Given the n samples, the likelihood is n ( P(y 1,..., y n ; ˆµ, Θ) (det Θ) 1/2 exp 1 ) 2 (y i ˆµ) T Θ(y i ˆµ) i=1 = (det Θ) n/2 exp The log likelihood can be written as ( 1 2 ) n (y i ˆµ) T Θ(y i ˆµ). log(p(y 1,..., y n ; ˆµ, Θ)) = n 2 log(det Θ) n tr(θs) + constant. 2 Maximum likelihood estimator: i=1 Θ = arg min{ log det X + tr(sx )}. X 0 Problem: when p > n, sample covariance matrix S is singular.
8 L1-regularized inverse covariance selection A sparse inverse covariance matrix is preferred add l 1 regularization to promote sparsity. The resulting optimization problem: { } Θ = arg min log det X + tr(sx ) + λ X 1 = arg min f (X ), X 0 X 0 where X 1 = n i,j=1 X ij. Regularization parameter λ > 0 controls the sparsity. The problem appears hard to solve: Non-smooth log-determinant program. Number of parameters scale quadratically with number of variables.
9 Scalability Block coordinate ascent (Banerjee et al, 2007). Solves p = 1000 in 4000 secs. Graphical Lasso (Friedman et al, 2007). Solves p = 1000 in 800 secs. VSM, PSM, SINCO, IPM, PQN, ALM ( ). ALM solves p = 1000 in 300 secs.
10 Scalability Block coordinate ascent (Banerjee et al, 2007). Solves p = 1000 in 4000 secs. Graphical Lasso (Friedman et al, 2007). Solves p = 1000 in 800 secs. VSM, PSM, SINCO, IPM, PQN, ALM ( ). ALM solves p = 1000 in 300 secs. QUIC (Hsieh et al, 2011) Solves p = 1000 in 10 secs, p = 10, 000 in half hour. But needs > 3 days when p = 30, 000 (out of memory). Need for scalability: FMRI dataset has more than 220, 000 variables (48 billion parameters).
11 Scalability Block coordinate ascent (Banerjee et al, 2007). Solves p = 1000 in 4000 secs. Graphical Lasso (Friedman et al, 2007). Solves p = 1000 in 800 secs. VSM, PSM, SINCO, IPM, PQN, ALM ( ). ALM solves p = 1000 in 300 secs. QUIC (Hsieh et al, 2011) Solves p = 1000 in 10 secs, p = 10, 000 in half hour. But needs > 3 days when p = 30, 000 (out of memory). Need for scalability: FMRI dataset has more than 220, 000 variables (48 billion parameters). BigQUIC (2013): p = 1, 000, 000 in 22.9 hrs (using 32 cores).
12 Scalability Main Ingredients: 1 Second-order Newton-like method (QUIC) quadratic convergence rate. 2 Division of variables into free and fixed sets (QUIC) efficient computation of each Newton direction 3 Memory-efficient scheme using block coordinate descent (BigQUIC) scale to one million variables. 4 Approximate Hessian computation (BigQUIC) super-linear convergence rate.
13 Quic QUadratic approximation for Inverse Covariance estimation
14 Second Order Method Newton method for twice differentiable function: x x η( 2 f (x)) 1 f (x) However, the sparse inverse covariance estimation objective is not differentiable. f (X ) = log det X + tr(sx ) + λ X 1 Most current solvers are first-order methods: Block Coordinate Descent (GLASSO), projected gradient descent (PSM), greedy coordinate descent (SINCO), alternating linearization method (ALM).
15 Quadratic Approximation Write objective as f (X ) = g(x ) + h(x ), where g(x ) = log det X + tr(sx ) and h(x ) = λ X 1.
16 Quadratic Approximation Write objective as f (X ) = g(x ) + h(x ), where g(x ) = log det X + tr(sx ) and h(x ) = λ X 1. Form quadratic approximation for g(x t + ): ḡ Xt ( ) = tr((s W t ) ) + (1/2) vec( ) T (W t W t ) vec( ) log det X t + tr(sx t ), where W t = (X t ) 1 = X log det(x ) X =X t.
17 Quadratic Approximation Write objective as f (X ) = g(x ) + h(x ), where g(x ) = log det X + tr(sx ) and h(x ) = λ X 1. Form quadratic approximation for g(x t + ): ḡ Xt ( ) = tr((s W t ) ) + (1/2) vec( ) T (W t W t ) vec( ) log det X t + tr(sx t ), where W t = (X t ) 1 = X log det(x ) X =X t. Define the generalized Newton direction: D = arg min ḡx t ( ) + λ X + 1. Solve by coordinate descent (Hsieh et al, 2011) or other methods (Olsen et al, 2012).
18 Coordinate Descent Updates Use coordinate descent to solve: Each coordinate descent update: arg min D {ḡ X t (D) + λ X + D 1 }. µ = arg min µ ḡ(d + µ(e i e T j + e j e T i )) + 2λ X ij + D ij + µ D ij D ij + µ Above one-variable problem can be simplified to minimizing: 1 2 (W 2 ij + W ii W jj )µ 2 + (S ij W ij + w T i Dw j )µ + λ X ij + D ij + µ Quadratic form with L1 regularization soft thresholding gives exact solution.
19 Efficient solution of one-variable problem Minimum is achieved at: µ = c + S(c b/a, λ/a), where S(z, r) = sign(z) max{ z r, 0} is the soft-thresholding function, a = W 2 ij + W iiw jj, b = S ij W ij + w T i Dw j, and c = X ij + D ij. The main cost is in computing w T i Dw j : direct computation requires O(p 2 ) flops. Instead, we maintain U = DW after each coordinate updates, and then compute w T i u j only O(p) flops per updates.
20 Efficient solution of one-variable problem Minimum is achieved at: µ = c + S(c b/a, λ/a), where S(z, r) = sign(z) max{ z r, 0} is the soft-thresholding function, a = W 2 ij + W iiw jj, b = S ij W ij + w T i Dw j, and c = X ij + D ij. The main cost is in computing w T i Dw j : direct computation requires O(p 2 ) flops. Instead, we maintain U = DW after each coordinate updates, and then compute w T i u j only O(p) flops per updates. Need O(p 2 ) memory, not scalable to p > Block coordinate descent method to avoid O(p 2 ) storage (BigQUIC)
21 Line Search Adopt Armijo rule: try step-sizes α {β 0, β 1, β 2,... } st X t + αd t : 1 is positive definite 2 satisfies a sufficient decrease condition f (X t + αd t ) f (X t ) + ασ t where t = tr( g(x t )D t ) + λ X t + D t 1 λ X t 1. Both conditions can be checked by performing Cholesky factorization O(p 3 ) flops per line search iteration. Can do better by using Lanczos [K.C.Toh], and Conjugate Gradient.
22 Free and Fixed Sets Many variables never changes to nonzero values: p = 1869, p 2 = 3.49 million, only 3% changes to nonzero values. Our goal: Variable Selection!
23 Free and Fixed Sets Many variables never changes to nonzero values: p = 1869, p 2 = 3.49 million, only 3% changes to nonzero values. Our goal: Variable Selection! Our strategy: before solving the Newton direction, guess the variables that should be updated: { S fixed if ij g(x t ) < λ and (X t ) ij = 0 (X t ) ij otherwise S free We then perform the coordinate descent updates only on S free set.
24 Algorithm QUIC: QUadratic approximation for sparse Inverse Covariance estimation Input: Empirical covariance matrix S, scalar λ, initial X 0. For t = 0, 1,... 1 Compute W t = X 1 t. 2 Form the second order approximation ḡ Xt (X ) to g(x ) around X t. 3 Partition variables into free and fixed sets 4 Use coordinate descent to find descent direction: D t = arg min f Xt (X t + ) over set of free variables, (A Lasso problem.) 5 Use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite and objective sufficiently decreases.
25 Big & Quic Sparse Inverse Covariance Estimation for a Million Variables
26 Difficulties in Scaling QUIC Coordinate descent requires X t and X 1 t, needs O(p 2 ) storage needs O(mp) computation per sweep, where m = X t 0 Line search (compute determinant of a big sparse matrix). needs O(p 2 ) storage needs O(p 3 ) computation
27 Line Search Given sparse matrix A = X t + αd, we need to 1 Check its positive definiteness. 2 Compute log det(a). Cholesky factorization in QUIC requires O(p 3 ) computation. ( ) a b T If A =, b C det(a) = det(c)(a b T C 1 b) A is positive definite iff C is positive definite and (a b T C 1 b) > 0. C is sparse, so can compute C 1 b using Conjugate Gradient (CG). Time complexity: T CG = O(mT ), where T is number of CG iterations.
28 Difficulties in Scaling QUIC Coordinate descent requires X t and X 1 t, needs O(p 2 ) storage needs O(mp) computation per sweep, where m = X t 0 Line search (compute determinant of a big sparse matrix). needs O(p) storage needs O(mp) computation
29 Back-of-the-envelope calculations Co-ordinate descent requires m computations of w T i Dw j. To solve a problem of size p = 10 4 (avg degree=10), QUIC takes 12 minutes Extrapolate: to solve a problem with p = 10 6, QUIC would take 12min 10 4 = days (2.6 days using 32 cores). p 2 memory means 8 Terabyte main memory for p = Naive solution that has O(m) memory requirement: Compute w i, w j by solving two linear systems by CG, O(T CG ) computation per coordinate update O(mT CG ) computation per sweep. Would need 8,333 days to solve the problem for p = 10 6 (260 days using 32 cores).
30 Back-of-the-envelope calculations Co-ordinate descent requires m computations of w T i Dw j. To solve a problem of size p = 10 4 (avg degree=10), QUIC takes 12 minutes Extrapolate: to solve a problem with p = 10 6, QUIC would take 12min 10 4 = days (2.6 days using 32 cores). p 2 memory means 8 Terabyte main memory for p = Naive solution that has O(m) memory requirement: Compute w i, w j by solving two linear systems by CG, O(T CG ) computation per coordinate update O(mT CG ) computation per sweep. Would need 8,333 days to solve the problem for p = 10 6 (260 days using 32 cores). BigQUIC solves problems of size p = 10 6 in one day on 32 cores!
31 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(T CG ) time.
32 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(mT CG ) time.
33 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(mT CG ) time.
34 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(mT CG ) time.
35 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(mT CG ) time.
36 Coordinate Updates ideal case Want to find update sequence that minimizes number of cache misses: probably NP Hard. Our strategy: update variables block by block. The ideal case: there exists a partition {S 1,..., S k } such that all free sets are in diagonal blocks: Only requires p column evaluations.
37 General case: block diagonal + sparse If the block partition is not perfect: extra column computations can be characterized by boundary nodes. Given a partition {S 1,..., S k }, we define boundary nodes as B(S q ) {j j S q and i S z, z q s.t. F ij = 1}, where F is adjacency matrix of the free set.
38 Detailed updates in general case
39 Detailed updates in general case
40 Detailed updates in general case
41 Detailed updates in general case
42 Graph Clustering Algorithm The number of columns to be computed in one sweep is p + q B(S q ). Can be upper bounded by p + B(S q ) p + F ij. q z q i S z,j S q Minimize the right hand side minimize off-diagonal entries. Use Graph Clustering (METIS or Graclus) to find the partition.
43 Size of boundary nodes in real datasets ER dataset with p = 692. Random partition compute 1687 columns Partition by clustering compute 775 columns O(kpT CG ) O(pT CG ) flops per sweep.
44 BigQUIC Block co-ordinate descent with clustering, needs O(m + p 2 /k) storage needs O(mp) computation per sweep, where m = X t 0 Line search (compute determinant of a big sparse matrix). needs O(p) storage needs O(mp) computation
45 Convergence Guarantees for QUIC Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011) QUIC converges quadratically to the global optimum, that is for some constant 0 < κ < 1: X t+1 X F lim t X t X 2 F = κ.
46 BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately.
47 BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) If g(x ) is computed exactly and ηi Ĥ t ηi for some constant η, η > 0 at every Newton iteration, then BigQUIC converges to the global optimum.
48 BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t.
49 BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t. When R t = O( S f (X t ) ), BigQUIC has asymptotic quadratic convergence rate.
50 Experimental results (scalability) (a) Scalability on random graph (b) Scalability on chain graph Figure : BigQUIC can solve one million dimensional problems.
51 Experimental results BigQUIC is faster even for medium size problems. Figure : Comparison on FMRI data with p = (maximum dimension that previous methods can handle).
52 Results on FMRI dataset 228,483 voxels, 518 time points. λ = 0.6 = average degree 8, BigQUIC took 5 hours. λ = 0.5 = average degree 38, BigQUIC took 21 hours. Findings: Voxels with large degree were generally found in the gray matter. Can detect meaningful brain modules by modularity clustering.
53 Conclusions Memory efficient quadratic approximation method for sparse inverse covariance estimation (BigQUIC). Key ingredients: A second order method. Fixed/free set selection. Block coordinate descent solver for each subproblem. Inexact computation of Hessian with convergence analysis. BigQUIC can solve a 1-million dimensional problem in 1-day on a single 32-core machine. QUIC published in NIPS 2011: Sparse Inverse Covariance Matrix Estimation using Quadratic Approximation BigQUIC to appear in NIPS 2013 (Oral): BIG & QUIC: Sparse inverse covariance estimation for a million variables
54 References [1] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparse inverse covariance estimation for a million variables. NIPS (oral presentation), [2] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance Matrix Estimation using Quadratic Approximation. NIPS, [3] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation. NIPS, [4] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for Sparse Inverse Covariance Estimation. Optimization Online, [5] O. Banerjee, L. El Ghaoui, and A. d Aspremont Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, [6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, [7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covariance selection. Mathematical Programming Computation, [8] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternating linearization methods. NIPS, [9] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedy coordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.
Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationVariables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack,
More informationBIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables Cho-Jui Hsieh, Mátyás A. Sustik, Inderjit S. Dhillon, Pradeep Ravikumar Department of Computer Science University of Texas at Austin
More informationSparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation Cho-Jui Hsieh, Mátyás A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar Department of Computer Science University of Texas
More informationNewton-Like Methods for Sparse Inverse Covariance Estimation
Newton-Like Methods for Sparse Inverse Covariance Estimation Peder A. Olsen Figen Oztoprak Jorge Nocedal Stephen J. Rennie June 7, 2012 Abstract We propose two classes of second-order optimization methods
More informationA Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation
A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation Cho-Jui Hsieh Dept. of Computer Science University of Texas, Austin cjhsieh@cs.utexas.edu Inderjit S. Dhillon Dept. of Computer Science
More informationSparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation Cho-Jui Hsieh, Mátyás A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar Department of Computer Science University of Texas
More informationSparse Inverse Covariance Estimation with Hierarchical Matrices
Sparse Inverse Covariance Estimation with ierarchical Matrices Jonas Ballani Daniel Kressner October, 0 Abstract In statistics, a frequent task is to estimate the covariance matrix of a set of normally
More informationEfficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization
Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationFrist order optimization methods for sparse inverse covariance selection
Frist order optimization methods for sparse inverse covariance selection Katya Scheinberg Lehigh University ISE Department (joint work with D. Goldfarb, Sh. Ma, I. Rish) Introduction l l l l l l The field
More informationQUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models
QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models Cho-Jui Hsieh, Inderjit S. Dhillon, Pradeep Ravikumar University of Texas at Austin Austin, TX 7872 USA {cjhsieh,inderjit,pradeepr}@cs.utexas.edu
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationRobust Inverse Covariance Estimation under Noisy Measurements
.. Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang, Shou-De Lin Intel-NTU, National Taiwan University ICML 2014 1 / 30 . Table of contents Introduction.1 Introduction.2 Related
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationSparse inverse covariance estimation with the lasso
Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationQUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models
QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models Cho-Jui Hsieh, Inderjit S. Dhillon, Pradeep Ravikumar University of Texas at Austin Austin, TX 7872 USA {cjhsieh,inderjit,pradeepr}@cs.utexas.edu
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Graphical LASSO Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 26 th, 2013 Emily Fox 2013 1 Multivariate Normal Models
More informationChapter 17: Undirected Graphical Models
Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Coping with Large Covariances: Latent Factor Models, Graphical Models, Graphical LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February
More informationExact Hybrid Covariance Thresholding for Joint Graphical Lasso
Exact Hybrid Covariance Thresholding for Joint Graphical Lasso Qingming Tang Chao Yang Jian Peng Jinbo Xu Toyota Technological Institute at Chicago Massachusetts Institute of Technology Abstract. This
More informationSparse Graph Learning via Markov Random Fields
Sparse Graph Learning via Markov Random Fields Xin Sui, Shao Tang Sep 23, 2016 Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 1 / 36 Outline 1 Introduction to graph learning
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06
More informationr=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J
7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04
More informationGraphical Model Selection
May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor
More informationIBM Research Report. SINCO - A Greedy Coordinate Ascent Method for Sparse Inverse Covariance Selection Problem. Katya Scheinberg Columbia University
RC24837 (W98-46) August 4, 29 Mathematics IBM Research Report - A Greedy Coordinate Ascent Method for Sparse Inverse Covariance Selection Problem Katya Scheinberg Columbia University Irina Rish IBM Research
More informationInverse Covariance Estimation with Missing Data using the Concave-Convex Procedure
Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Jérôme Thai 1 Timothy Hunter 1 Anayo Akametalu 1 Claire Tomlin 1 Alex Bayen 1,2 1 Department of Electrical Engineering
More informationJoint Gaussian Graphical Model Review Series I
Joint Gaussian Graphical Model Review Series I Probability Foundations Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia http://jointggm.org/ June 23rd, 2017 Beilun
More informationTopology identification of undirected consensus networks via sparse inverse covariance estimation
2016 IEEE 55th Conference on Decision and Control (CDC ARIA Resort & Casino December 12-14, 2016, Las Vegas, USA Topology identification of undirected consensus networks via sparse inverse covariance estimation
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationSparse Linear Programming via Primal and Dual Augmented Coordinate Descent
Sparse Linear Programg via Primal and Dual Augmented Coordinate Descent Presenter: Joint work with Kai Zhong, Cho-Jui Hsieh, Pradeep Ravikumar and Inderjit Dhillon. Sparse Linear Program Given vectors
More informationRobust Inverse Covariance Estimation under Noisy Measurements
Jun-Kun Wang Intel-NTU, National Taiwan University, Taiwan Shou-de Lin Intel-NTU, National Taiwan University, Taiwan WANGJIM123@GMAIL.COM SDLIN@CSIE.NTU.EDU.TW Abstract This paper proposes a robust method
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationLearning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF
Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF Jong Ho Kim, Youngsuk Park December 17, 2016 1 Introduction Markov random fields (MRFs) are a fundamental fool on data
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationThe picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R
The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework
More informationStructure estimation for Gaussian graphical models
Faculty of Science Structure estimation for Gaussian graphical models Steffen Lauritzen, University of Copenhagen Department of Mathematical Sciences Minikurs TUM 2016 Lecture 3 Slide 1/48 Overview of
More informationEE 381V: Large Scale Optimization Fall Lecture 24 April 11
EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that
More informationNearest Neighbor based Coordinate Descent
Nearest Neighbor based Coordinate Descent Pradeep Ravikumar, UT Austin Joint with Inderjit Dhillon, Ambuj Tewari Simons Workshop on Information Theory, Learning and Big Data, 2015 Modern Big Data Across
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationMATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)
1/12 MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.) Dominique Guillot Departments of Mathematical Sciences University of Delaware May 6, 2016
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More information11 : Gaussian Graphic Models and Ising Models
10-708: Probabilistic Graphical Models 10-708, Spring 2017 11 : Gaussian Graphic Models and Ising Models Lecturer: Bryon Aragam Scribes: Chao-Ming Yen 1 Introduction Different from previous maximum likelihood
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationHigh-dimensional Covariance Estimation Based On Gaussian Graphical Models
High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationRobust and sparse Gaussian graphical modelling under cell-wise contamination
Robust and sparse Gaussian graphical modelling under cell-wise contamination Shota Katayama 1, Hironori Fujisawa 2 and Mathias Drton 3 1 Tokyo Institute of Technology, Japan 2 The Institute of Statistical
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationIntroduction to Alternating Direction Method of Multipliers
Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationThe lasso: some novel algorithms and applications
1 The lasso: some novel algorithms and applications Newton Institute, June 25, 2008 Robert Tibshirani Stanford University Collaborations with Trevor Hastie, Jerome Friedman, Holger Hoefling, Gen Nowak,
More informationAn Introduction to Graphical Lasso
An Introduction to Graphical Lasso Bo Chang Graphical Models Reading Group May 15, 2015 Bo Chang (UBC) Graphical Lasso May 15, 2015 1 / 16 Undirected Graphical Models An undirected graph, each vertex represents
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationBehavioral Data Mining. Lecture 7 Linear and Logistic Regression
Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression
More informationRobust estimation of scale and covariance with P n and its application to precision matrix estimation
Robust estimation of scale and covariance with P n and its application to precision matrix estimation Garth Tarr, Samuel Müller and Neville Weber USYD 2013 School of Mathematics and Statistics THE UNIVERSITY
More informationNearest Neighbor Gaussian Processes for Large Spatial Data
Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns
More informationThe graphical lasso: New insights and alternatives
Electronic Journal of Statistics Vol. 6 (2012) 2125 2149 ISSN: 1935-7524 DOI: 10.1214/12-EJS740 The graphical lasso: New insights and alternatives Rahul Mazumder Massachusetts Institute of Technology Cambridge,
More informationMATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models
1/13 MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models Dominique Guillot Departments of Mathematical Sciences University of Delaware May 4, 2016 Recall
More informationProximity-Based Anomaly Detection using Sparse Structure Learning
Proximity-Based Anomaly Detection using Sparse Structure Learning Tsuyoshi Idé (IBM Tokyo Research Lab) Aurelie C. Lozano, Naoki Abe, and Yan Liu (IBM T. J. Watson Research Center) 2009/04/ SDM 2009 /
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationFast Regularization Paths via Coordinate Descent
August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationLearning Gaussian Graphical Models with Unknown Group Sparsity
Learning Gaussian Graphical Models with Unknown Group Sparsity Kevin Murphy Ben Marlin Depts. of Statistics & Computer Science Univ. British Columbia Canada Connections Graphical models Density estimation
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationTopology identification via growing a Chow-Liu tree network
2018 IEEE Conference on Decision and Control (CDC) Miami Beach, FL, USA, Dec. 17-19, 2018 Topology identification via growing a Chow-Liu tree network Sepideh Hassan-Moghaddam and Mihailo R. Jovanović Abstract
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationSub-Sampled Newton Methods
Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationTechnical report no. SOL Department of Management Science and Engineering, Stanford University, Stanford, California, September 15, 2013.
PROXIMAL NEWTON-TYPE METHODS FOR MINIMIZING COMPOSITE FUNCTIONS JASON D. LEE, YUEKAI SUN, AND MICHAEL A. SAUNDERS Abstract. We generalize Newton-type methods for minimizing smooth functions to handle a
More informationLarge-scale Collaborative Ranking in Near-Linear Time
Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More information