Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014
|
|
- Reginald Berry
- 5 years ago
- Views:
Transcription
1 for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack, A. Banerjee, S. Becker, and P. Olsen
2 FMRI Brain Analysis Goal: Reveal functional connections between regions of the brain. (Sun et al, 2009; Smith et al, 2011; Varoquaux et al, 2010; Ng et al, 2011) p = 228, 483 voxels. Figure from (Varoquaux et al, 2010)
3 Other Applications Gene regulatory network discovery: (Schafer & Strimmer 2005; Andrei & Kendziorski 2009; Menendez et al, 2010; Yin and Li, 2011) Financial Data Analysis: Model dependencies in multivariate time series (Xuan & Murphy, 2007). Sparse high dimensional models in economics (Fan et al, 2011). Social Network Analysis / Web data: Model co-authorship networks (Goldenberg & Moore, 2005). Model item-item similarity for recommender system(agarwal et al, 2011). Climate Data Analysis (Chen et al., 2010). Signal Processing (Zhang & Fung, 2013). Anomaly Detection (Ide et al, 2009). Time series prediction and classification (Wang et al., 2014).
4 L1-regularized inverse covariance selection Goal: Estimate the inverse covariance matrix in the high dimensional setting: p(# variables) n(# samples) Add l 1 regularization a sparse inverse covriance matrix is preferred. The l 1 -regularized Maximum Likelihood Estimator: Σ 1 { = arg min log det X + tr(sx ) X 0 where X 1 = n i,j=1 X ij. High dimensional problems } {{ } negative log likelihood } +λ X 1 = arg min f (X ), X 0 Number of parameters scales quadratically with number of variables.
5 Proximal Newton method Split smooth and non-smooth terms: f (X ) = g(x ) + h(x ), where g(x ) = log det X + tr(sx ) and h(x ) = λ X 1. Form quadratic approximation for g(x t + ): ḡ Xt ( ) = tr((s W t ) ) + (1/2) vec( ) T (W t W t ) vec( ) log det X t + tr(sx t ), where W t = (X t ) 1 = X log det(x ) X =X t. Define the generalized Newton direction: D t = arg min ḡx t ( ) + λ X t + 1.
6 Algorithm Proximal Newton method Proximal Newton method for sparse Inverse Covariance estimation Input: Empirical covariance matrix S, scalar λ, initial X 0. For t = 0, 1,... 1 Compute Newton direction: D t = arg min fxt (X t + ) (A Lasso problem.) 2 Line Search: use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite, satisfies a sufficient decrease condition f (X t + αd t) f (X t) + ασ t. (Cholesky factorization of X t + αd t ) (Hsieh et al., 2011; Olsen et al., 2012; Lee et al., 2012; Dinh et al., 2013; Scheinberg et al., 2014)
7 Scalability How to apply proximal Newton method to high dimensional problems? Number of parameters: too large (p 2 ) Hessian: cannot be explicited formed (p 2 p 2 ) Limited memory... BigQUIC (2013): p = 1, 000, 000 (1 trillion parameters) in 22.9 hrs with 32 GBytes memory (using a single machine with 32 cores). Key ingradients: Active variable selection the complexity scales with the intrincic dimensionality. Block coordinate descent Handle problems where # parameters > memory size. (Big & Quic). Other techniques: Divide-and-conquer, efficient line search... Extension to general dirty models.
8 Active Variable Selection
9 Motivation We can directly implement a proximal Newton method. Takes 28 seconds on ER dataset. Run Glasso: takes only 25 seconds. Why do people like to use proximal Newton method?
10 Newton Iteration 1 ER dataset, p = 692, Newton iteration 1.
11 Newton Iteration 3 ER dataset, p = 692, Newton iteration 3.
12 Newton Iteration 5 ER dataset, p = 692, Newton iteration 5.
13 Newton Iteration 7 ER dataset, p = 692, Newton iteration 7. Only 2.7% variables have been changed to nonzero.
14 Active Variable Selection Our goal: Reduce the number of variables from O(p 2 ) to O( X ). The solution usually has a small number of nonzeros: X 0 p 2 Strategy: before solving the Newton direction, guess the variables that should be updated. Fixed set: S fixed = {(i, j) X ij = 0 and ij g(x ) λ.}. The remaining variables constitute the free set. Solve the smaller subproblem: D t = arg min : Sfixed =0 ḡx t ( ) + λ X t + 1.
15 Active Variable Selection Is this just a heuristic? No. The process is equivalent to a block coordinate descent algorithm with two blocks: 1 Update the fixed set no change (due to the definition of S fixed ). 2 Update the free set update on the subproblem. Convergence is guaranteed.
16 Size of free set In practice, the size of free set is small. Take Hereditary dataset as an example: p = 1869, number of variables = p 2 = 3.49 million. The size of free set drops to 20, 000 at the end.
17 Real datasets (a) Time for Estrogen, p = 692 (b) Time for hereditarybc, p = 1, 869 Figure : Comparison of algorithms on real datasets. The results show QUIC converges faster than other methods.
18 Algorithm QUIC: QUadratic approximation for sparse Inverse Covariance estimation Input: Empirical covariance matrix S, scalar λ, initial X 0. For t = 0, 1,... 1 Variable selection: select a free set of m p 2 variables. 2 Use coordinate descent to find descent direction: D t = arg min fxt (X t + ) over set of free variables, (A Lasso problem.) 3 Line Search: use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite, satisfies a sufficient decrease condition f (X t + αd t) f (X t) + ασ t. (Cholesky factorization of X t + αd t )
19 Extensions Also used in LIBLINEAR (Yuan et al., 2011) for l 1 -regularized logistic regression. Can be generalized to active subspace selection for nuclear norm (Hsieh et al., 2014). Can be generalized to the general decomposable norm.
20 Efficient (block) Coordinate Descent with Limited Memory
21 Coordinate Descent Updates Use coordinate descent to solve: arg min D {ḡ X (D) + λ X + D 1 }. Current gradient ḡ X (D) = (W W )vec(d) + S W = WDW + S W, where W = X 1. Closed form solution for each coordinate descent update: D ij c + S(c b/a, λ/a), where S(z, r) = sign(z) max{ z r, 0} is the soft-thresholding function, a = W 2 ij + W iiw jj, b = S ij W ij + w T i Dw j, and c = X ij + D ij. The main cost is in computing w T i Dw j. Instead, we maintain U = DW after each coordinate updates, and then compute w T i u j only O(p) flops per updates.
22 Difficulties in Scaling QUIC Consider the case that p 1million, m = X t 0 50million. Coordinate descent requires X t and W = X 1 t, needs O(p 2 ) storage needs O(mp) computation per sweep, where m = X t 0
23 Coordinate Updates with Memory Cache Assume we can store M columns of W in memory. Coordinate descent update (i, j): compute w T i Dw j. If w i, w j are not in memory: recompute by CG: X w i = e i : O(T CG ) time.
24 Coordinate Updates with Memory Cache w 1, w 2, w 3, w 4 stored in memory.
25 Coordinate Updates with Memory Cache Cache hit, do not need to recompute w i, w j.
26 Coordinate Updates with Memory Cache Cache miss, recompute w i, w j.
27 Coordinate Updates ideal case Want to find update sequence that minimizes number of cache misses: probably NP Hard. Our strategy: update variables block by block. The ideal case: there exists a partition {S 1,..., S k } such that all free sets are in diagonal blocks: Only requires p column evaluations.
28 General case: block diagonal + sparse If the block partition is not perfect: extra column computations can be characterized by boundary nodes. Given a partition {S 1,..., S k }, we define boundary nodes as B(S q ) {j j S q and i S z, z q s.t. F ij = 1}, where F is adjacency matrix of the free set.
29 Graph Clustering Algorithm The number of columns to be computed in one sweep is p + q B(S q ). Can be upper bounded by p + B(S q ) p + F ij. q z q i S z,j S q Use Graph Clustering (METIS or Graclus) to find the partition. Example: on fmri dataset (p = million) with 20 blocks, random partition: need 1.6 million column computations. graph clustering: need million column computations.
30 BigQUIC Block co-ordinate descent with clustering, needs O(p 2 ) O(m + p 2 /k) storage needs O(mp) O(mp) computation per sweep, where m = X t 0
31 Algorithm BigQUIC Input: Samples Y, scalar λ, initial X 0. For t = 0, 1,... 1 Variable selection: select a free set of m p 2 variables. 2 Construct a partition by clustering. 3 Run block coordinate descent to find descent direction: D t = arg min fxt (X t + ) over set of free variables. 4 Line Search: use an Armijo-rule based step-size selection to get α s.t. X t+1 = X t + αd t is positive definite, satisfies a sufficient decrease condition f (X t + αd t) f (X t) + ασ t. (Schur complement with conjugate gradient method. )
32 Convergence Analysis
33 Convergence Guarantees for QUIC Theorem (Hsieh, Sustik, Dhillon & Ravikumar, NIPS, 2011) QUIC converges quadratically to the global optimum, that is for some constant 0 < κ < 1: X t+1 X F lim t X t X 2 F = κ. Other convergence proof for proximal Newton: (Lee et al, 2012) discuss the convergence for general loss function and regularization. (Katyas et al., 2014) discuss the global convergence rate.
34 BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately.
35 BigQUIC Convergence Analysis Recall W = X 1. When each w i is computed by CG (X w i = e i ): The gradient ij g(x ) = S ij W ij on free set can be computed once and stored in memory. Hessian (w T i Dw j in coordinate updates) needs to be repeatedly computed. To reduce the time overhead, Hessian should be computed approximately. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) If g(x ) is computed exactly and ηi Ĥ t ηi for some constant η, η > 0 at every Newton iteration, then BigQUIC converges to the global optimum.
36 BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t.
37 BigQUIC Convergence Analysis To have super-linear convergence rate, we require the Hessian computation to be more and more accurate as X t approaches X. Measure the accuracy of Hessian computation by R = [r 1..., r p ], where r i = X w i e i. The optimality of X t can be measured by { S ij g(x ) + sign(x ij )λ if X ij 0, ij f (X ) = sign( ij g(x )) max( ij g(x ) λ, 0) if X ij = 0. Theorem (Hsieh, Sustik, Dhillon, Ravikumar & Poldrack, 2013) In BigQUIC, if residual matrix R t = O( S f (X t ) l ) for some 0 < l 1 as t, then X t+1 X = O( X t X 1+l ) as t. When R t = O( S f (X t ) ), BigQUIC has asymptotic quadratic convergence rate.
38 Experimental results (scalability) (a) Scalability on random graph (b) Scalability on chain graph Figure : BigQUIC can solve one million dimensional problems.
39 Experimental results BigQUIC is faster even for medium size problems. Figure : Comparison on FMRI data with a p = subset (maximum dimension that previous methods can handle).
40 Results on FMRI dataset 228,483 voxels, 518 time points. λ = 0.6 = average degree 8, BigQUIC took 5 hours. λ = 0.5 = average degree 38, BigQUIC took 21 hours. Findings: Voxels with large degree were generally found in the gray matter. Can detect meaningful brain modules by modularity clustering.
41 Quic & Dirty A Quadratic Approximation Approach for Dirty Statistical Models
42 Dirty Statistical Models Superposition-structured models or dirty models (Yang et al., 2013): ( ) min L θ (r) + λ r θ (r) Ar := F (θ), {θ (r) } k r=1 r r An example: GMRF with latent features (Chandrasekaran et al., 2012) min log det(s L) + S L, Σ + λ S S 1 + λ L L. S,L:L 0,S L 0 When there are two groups of random variables X, Y, Θ = [Θ XX, Θ XY ; Θ YX, Θ YY ]. If we only observe X, and Y is the set of hidden variables, the marginal precision matrix will be [Θ XX Θ XY Θ 1 YY Θ YX ], therefore will have a structure of low rank + sparse. Error bound: (Meng et al., 2014). Other applications: Robust PCA (low rank + sparse), multitask (sparse + group sparse),...
43 Quadratic Approximation The Newton direction d: [d (1),..., d (k) ] = argmin (1),..., (k) ḡ(θ + ) + k λ r θ (r) + (r) Ar, r=1 where and k ḡ(θ + ) = g(θ) + (t), G T H, r=1 H H H := 2 g(θ) =....., H := 2 L( θ). H H
44 Quadratic Approximation A general active subspace selection for decomposable norms. L1 norm: variable selection Nuclear norm: subspace selection Group sparse norm: group selection Method for solving the quadratic approximation subproblem: Alternating minimization. Convergence analysis: Hessian is not positive definite. Still have quadratic convergence.
45 Experimental Comparisons (a) ER dataset (b) Leukemia dataset Figure : Comparison on gene expression data (GMRF with latent variables).
46 Conclusions How to scale proximal Newton method to high dimensional problems: Active variable selection the complexity scales with the intrincic dimensionality. Block coordinate descent Handle problems where # parameters > memory size. (Big & Quic). Other techniques: Divide-and-conquer, efficient line search... BigQUIC: Memory efficient quadratic approximation method for sparse inverse covariance estimation (scale to 1 million by 1 million problems). With a careful design, we can apply proximal Newton method to solve models with superposition-structured regularization (sparse + low rank; sparse + group sparse; low rank + group sparse... ).
47 References [1] C. J. Hsieh, M. Sustik, I. S. Dhillon, P. Ravikumar, and R. Poldrack. BIG & QUIC: Sparse inverse covariance estimation for a million variables. NIPS (oral presentation), [2] C. J. Hsieh, M. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse Inverse Covariance Matrix Estimation using Quadratic Approximation. NIPS, [3] C. J. Hsieh, I. S. Dhillon, P. Ravikumar, A. Banerjee. A Divide-and-Conquer Procedure for. NIPS, [4] P. A. Olsen, F. Oztoprak, J. Nocedal, and S. J. Rennie. Newton-Like Methods for Sparse Inverse Covariance Estimation. Optimization Online, [5] O. Banerjee, L. El Ghaoui, and A. d Aspremont Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. JMLR, [6] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, [7] L. Li and K.-C. Toh. An inexact interior point method for l1-reguarlized sparse covariance selection. Mathematical Programming Computation, [8] K. Scheinberg, S. Ma, and D. Glodfarb. Sparse inverse covariance selection via alternating linearization methods. NIPS, [9] K. Scheinberg and I. Rish. Learning sparse Gaussian Markov networks using a greedy coordinate ascent approach. Machine Learning and Knowledge Discovery in Databases, 2010.
Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:
More informationSparse Inverse Covariance Estimation for a Million Variables
Sparse Inverse Covariance Estimation for a Million Variables Inderjit S. Dhillon Depts of Computer Science & Mathematics The University of Texas at Austin SAMSI LDHD Opening Workshop Raleigh, North Carolina
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationBIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables Cho-Jui Hsieh, Mátyás A. Sustik, Inderjit S. Dhillon, Pradeep Ravikumar Department of Computer Science University of Texas at Austin
More informationSparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation Cho-Jui Hsieh, Mátyás A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar Department of Computer Science University of Texas
More informationQUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models
QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models Cho-Jui Hsieh, Inderjit S. Dhillon, Pradeep Ravikumar University of Texas at Austin Austin, TX 7872 USA {cjhsieh,inderjit,pradeepr}@cs.utexas.edu
More informationA Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation
A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation Cho-Jui Hsieh Dept. of Computer Science University of Texas, Austin cjhsieh@cs.utexas.edu Inderjit S. Dhillon Dept. of Computer Science
More informationEfficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization
Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg
More informationRobust Inverse Covariance Estimation under Noisy Measurements
.. Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang, Shou-De Lin Intel-NTU, National Taiwan University ICML 2014 1 / 30 . Table of contents Introduction.1 Introduction.2 Related
More informationSparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation Cho-Jui Hsieh, Mátyás A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar Department of Computer Science University of Texas
More informationNewton-Like Methods for Sparse Inverse Covariance Estimation
Newton-Like Methods for Sparse Inverse Covariance Estimation Peder A. Olsen Figen Oztoprak Jorge Nocedal Stephen J. Rennie June 7, 2012 Abstract We propose two classes of second-order optimization methods
More informationr=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J
7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationQUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models
QUIC & DIRTY: A Quadratic Approximation Approach for Dirty Statistical Models Cho-Jui Hsieh, Inderjit S. Dhillon, Pradeep Ravikumar University of Texas at Austin Austin, TX 7872 USA {cjhsieh,inderjit,pradeepr}@cs.utexas.edu
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationSparse Inverse Covariance Estimation with Hierarchical Matrices
Sparse Inverse Covariance Estimation with ierarchical Matrices Jonas Ballani Daniel Kressner October, 0 Abstract In statistics, a frequent task is to estimate the covariance matrix of a set of normally
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationFrist order optimization methods for sparse inverse covariance selection
Frist order optimization methods for sparse inverse covariance selection Katya Scheinberg Lehigh University ISE Department (joint work with D. Goldfarb, Sh. Ma, I. Rish) Introduction l l l l l l The field
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationInverse Covariance Estimation with Missing Data using the Concave-Convex Procedure
Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Jérôme Thai 1 Timothy Hunter 1 Anayo Akametalu 1 Claire Tomlin 1 Alex Bayen 1,2 1 Department of Electrical Engineering
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationChapter 17: Undirected Graphical Models
Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Graphical LASSO Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 26 th, 2013 Emily Fox 2013 1 Multivariate Normal Models
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Coping with Large Covariances: Latent Factor Models, Graphical Models, Graphical LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationSparse inverse covariance estimation with the lasso
Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationExact Hybrid Covariance Thresholding for Joint Graphical Lasso
Exact Hybrid Covariance Thresholding for Joint Graphical Lasso Qingming Tang Chao Yang Jian Peng Jinbo Xu Toyota Technological Institute at Chicago Massachusetts Institute of Technology Abstract. This
More informationJoint Gaussian Graphical Model Review Series I
Joint Gaussian Graphical Model Review Series I Probability Foundations Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia http://jointggm.org/ June 23rd, 2017 Beilun
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationMemory Efficient Kernel Approximation
Si Si Department of Computer Science University of Texas at Austin ICML Beijing, China June 23, 2014 Joint work with Cho-Jui Hsieh and Inderjit S. Dhillon Outline Background Motivation Low-Rank vs. Block
More informationRobust Inverse Covariance Estimation under Noisy Measurements
Jun-Kun Wang Intel-NTU, National Taiwan University, Taiwan Shou-de Lin Intel-NTU, National Taiwan University, Taiwan WANGJIM123@GMAIL.COM SDLIN@CSIE.NTU.EDU.TW Abstract This paper proposes a robust method
More informationSparse Graph Learning via Markov Random Fields
Sparse Graph Learning via Markov Random Fields Xin Sui, Shao Tang Sep 23, 2016 Xin Sui, Shao Tang Sparse Graph Learning via Markov Random Fields Sep 23, 2016 1 / 36 Outline 1 Introduction to graph learning
More informationLarge-Scale Feature Learning with Spike-and-Slab Sparse Coding
Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab
More informationGraphical Model Selection
May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04
More informationStructure estimation for Gaussian graphical models
Faculty of Science Structure estimation for Gaussian graphical models Steffen Lauritzen, University of Copenhagen Department of Mathematical Sciences Minikurs TUM 2016 Lecture 3 Slide 1/48 Overview of
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html
More information11 : Gaussian Graphic Models and Ising Models
10-708: Probabilistic Graphical Models 10-708, Spring 2017 11 : Gaussian Graphic Models and Ising Models Lecturer: Bryon Aragam Scribes: Chao-Ming Yen 1 Introduction Different from previous maximum likelihood
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationTechnical report no. SOL Department of Management Science and Engineering, Stanford University, Stanford, California, September 15, 2013.
PROXIMAL NEWTON-TYPE METHODS FOR MINIMIZING COMPOSITE FUNCTIONS JASON D. LEE, YUEKAI SUN, AND MICHAEL A. SAUNDERS Abstract. We generalize Newton-type methods for minimizing smooth functions to handle a
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationEE 381V: Large Scale Optimization Fall Lecture 24 April 11
EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that
More informationPU Learning for Matrix Completion
Cho-Jui Hsieh Dept of Computer Science UT Austin ICML 2015 Joint work with N. Natarajan and I. S. Dhillon Matrix Completion Example: movie recommendation Given a set Ω and the values M Ω, how to predict
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationLarge-scale Collaborative Ranking in Near-Linear Time
Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationSparse representation classification and positive L1 minimization
Sparse representation classification and positive L1 minimization Cencheng Shen Joint Work with Li Chen, Carey E. Priebe Applied Mathematics and Statistics Johns Hopkins University, August 5, 2014 Cencheng
More informationAlternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization
Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationBregman Divergences for Data Mining Meta-Algorithms
p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,
More informationTopology identification of undirected consensus networks via sparse inverse covariance estimation
2016 IEEE 55th Conference on Decision and Control (CDC ARIA Resort & Casino December 12-14, 2016, Las Vegas, USA Topology identification of undirected consensus networks via sparse inverse covariance estimation
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationProximity-Based Anomaly Detection using Sparse Structure Learning
Proximity-Based Anomaly Detection using Sparse Structure Learning Tsuyoshi Idé (IBM Tokyo Research Lab) Aurelie C. Lozano, Naoki Abe, and Yan Liu (IBM T. J. Watson Research Center) 2009/04/ SDM 2009 /
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationLearning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF
Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF Jong Ho Kim, Youngsuk Park December 17, 2016 1 Introduction Markov random fields (MRFs) are a fundamental fool on data
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationRobust estimation of scale and covariance with P n and its application to precision matrix estimation
Robust estimation of scale and covariance with P n and its application to precision matrix estimation Garth Tarr, Samuel Müller and Neville Weber USYD 2013 School of Mathematics and Statistics THE UNIVERSITY
More informationRaRE: Social Rank Regulated Large-scale Network Embedding
RaRE: Social Rank Regulated Large-scale Network Embedding Authors: Yupeng Gu 1, Yizhou Sun 1, Yanen Li 2, Yang Yang 3 04/26/2018 The Web Conference, 2018 1 University of California, Los Angeles 2 Snapchat
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationOnline Dictionary Learning with Group Structure Inducing Norms
Online Dictionary Learning with Group Structure Inducing Norms Zoltán Szabó 1, Barnabás Póczos 2, András Lőrincz 1 1 Eötvös Loránd University, Budapest, Hungary 2 Carnegie Mellon University, Pittsburgh,
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationIBM Research Report. SINCO - A Greedy Coordinate Ascent Method for Sparse Inverse Covariance Selection Problem. Katya Scheinberg Columbia University
RC24837 (W98-46) August 4, 29 Mathematics IBM Research Report - A Greedy Coordinate Ascent Method for Sparse Inverse Covariance Selection Problem Katya Scheinberg Columbia University Irina Rish IBM Research
More informationFast Regularization Paths via Coordinate Descent
August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor
More informationComputational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center
Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationBias-free Sparse Regression with Guaranteed Consistency
Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationHigh-dimensional Covariance Estimation Based On Gaussian Graphical Models
High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationSparse Linear Programming via Primal and Dual Augmented Coordinate Descent
Sparse Linear Programg via Primal and Dual Augmented Coordinate Descent Presenter: Joint work with Kai Zhong, Cho-Jui Hsieh, Pradeep Ravikumar and Inderjit Dhillon. Sparse Linear Program Given vectors
More informationNearest Neighbor based Coordinate Descent
Nearest Neighbor based Coordinate Descent Pradeep Ravikumar, UT Austin Joint with Inderjit Dhillon, Ambuj Tewari Simons Workshop on Information Theory, Learning and Big Data, 2015 Modern Big Data Across
More informationBehavioral Data Mining. Lecture 7 Linear and Logistic Regression
Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression
More informationThe picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R
The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R Xingguo Li Tuo Zhao Tong Zhang Han Liu Abstract We describe an R package named picasso, which implements a unified framework
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationRobust and sparse Gaussian graphical modelling under cell-wise contamination
Robust and sparse Gaussian graphical modelling under cell-wise contamination Shota Katayama 1, Hironori Fujisawa 2 and Mathias Drton 3 1 Tokyo Institute of Technology, Japan 2 The Institute of Statistical
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationBig Data Optimization in Machine Learning
Lehigh University Lehigh Preserve Theses and Dissertations 2016 Big Data Optimization in Machine Learning Xiaocheng Tang Lehigh University Follow this and additional works at: http://preserve.lehigh.edu/etd
More information