Lasso: Algorithms and Extensions
|
|
- Amy McDaniel
- 5 years ago
- Views:
Transcription
1 ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017
2 Outline Proximal operators Proximal gradient methods for lasso and its extensions Nesterov s accelerated algorithm Lasso: algorithms and extensions 4-2
3 Proximal operators Lasso: algorithms and extensions 4-3
4 Gradient descent minimize β R p where f(β) is convex and differentiable f(β) Algorithm 4.1 Gradient descent for t = 0, 1, : β t+1 = β t µ t f(β t ) where µ t : step size / learning rate Lasso: algorithms and extensions 4-4
5 A proximal point of view of GD β t+1 = arg min β { f(β t ) + f(β t ), β β t }{{} linear approximation When µ t is small, β t+1 tends to stay close to β t + 1 } β β t 2 2µ } t {{} proximal term Lasso: algorithms and extensions 4-5
6 Proximal operator If we define the proximal operator prox h (b) := arg min β { 1 2 β b 2 + h(β)} for any convex function h, then one can write β t+1 = prox µtf t ( β t) where f t (β) := f(β t ) + f(β t ), β β t Lasso: algorithms and extensions 4-6
7 Why consider proximal operators? { 1 prox h (b) := arg min β 2 β b 2 + h(β)} It is well-defined under very general conditions (including nonsmooth convex functions) The operator can be evaluated efficiently for many widely used functions (in particular, regularizers) This abstraction is conceptually and mathematically simple, and covers many well-known optimization algorithms Lasso: algorithms and extensions 4-7
8 Example: characteristic functions If h is characteristic function { 0, if β C h(β) =, else then prox h (b) = arg min β C β b 2 (Euclidean projection) Lasso: algorithms and extensions 4-8
9 Example: l 1 norm If h(β) = β 1, then prox λh (b) = ψ st (b; λ) where soft-thresholding ψ st ( ) is applied in an entry-wise manner. Lasso: algorithms and extensions 4-9
10 Example: l 2 norm { 1 prox h (b) := arg min β 2 β b 2 + h(β)} If h(β) = β, then prox λh (b) = ( 1 λ ) b b + where a + := max{a, 0}. This is called block soft thresholding. Lasso: algorithms and extensions 4-10
11 Example: log barrier { 1 prox h (b) := arg min β 2 β b 2 + h(β)} If h(β) = p i=1 log β i, then (prox λh (b)) i = b i + b 2 i + 4λ 2 Lasso: algorithms and extensions 4-11
12 Nonexpansiveness of proximal operators { 0, if β C Recall that when h(β) =, prox h (β) is Euclidean else projection P C onto C, which is nonexpansive: P C (β 1 ) P C (β 2 ) β 1 β 2 Lasso: algorithms and extensions 4-12
13 Nonexpansiveness of proximal operators Nonexpansiveness is a property for general prox h ( ) Fact 4.1 (Nonexpansiveness) prox h (β 1 ) prox h (β 2 ) β 1 β 2 In some sense, proximal operator behaves like projection Lasso: algorithms and extensions 4-13
14 Proof of nonexpansiveness Let z 1 = prox h (β 1 ) and z 2 = prox h (β 2 ). Subgradient characterizations of z 1 and z 2 read The claim would follow if β 1 z 1 h(z 1 ) and β 2 z 2 h(z 2 ) (β 1 β 2 ) (z 1 z 2 ) z 1 z 2 2 (together with Cauchy-Schwarz) = (β 1 z 1 β 2 + z 2 ) (z 1 z 2 ) 0 h(z 2 ) h(z 1 ) + β 1 z 1, z 2 z 1 }{{} h(z = 1 ) h(z 1 ) h(z 2 ) + β 2 z 2, z 1 z 2 }{{} h(z 2 ) Lasso: algorithms and extensions 4-14
15 Proximal gradient methods Lasso: algorithms and extensions 4-15
16 Optimizing composite functions (Lasso) minimize β R p 1 Xβ y 2 + λ β 1 = f(β) + g(β) } 2 {{}}{{} :=g(β) :=f(β) where f(β) is differentiable, and g(β) is non-smooth Since g(β) is non-differentiable, we cannot run vanilla gradient descent Lasso: algorithms and extensions 4-16
17 Proximal gradient methods One strategy: replace f(β) with linear approximation, and compute the proximal solution { β t+1 = arg min f(β t ) + f(β t ), β β t + g(β) + 1 } β β t 2 β 2µ t The optimality condition reads 0 f(β t ) + g(β t+1 ) + 1 µ t ( β t+1 β t) which is equivalent to optimality condition of { β t+1 = arg min g(β) + 1 β ( β t µ t f(β t ) ) } 2 β 2µ t ( ) = prox µtg β t µ t f(β t ) Lasso: algorithms and extensions 4-17
18 Proximal gradient methods Alternate between gradient updates on f and proximal minimization on g Algorithm 4.2 Proximal gradient methods for t = 0, 1, : β t+1 = prox µtg where µ t : step size / learning rate ( ) β t µ t f(β t ) Lasso: algorithms and extensions 4-18
19 Projected gradient methods 0, if β C When g(β) =, else }{{} convex is characteristic function: ) β t+1 = P C (β t µ t f(β t ) := arg min β (β t µ t f(β t )) β C This is a first-order method to solve the constrained optimization minimize β s.t. f(β) β C Lasso: algorithms and extensions 4-19
20 Proximal gradient methods for lasso For lasso: f(β) = 1 2 Xβ y 2 and g(β) = λ β 1, prox g (β) { } 1 = arg min b 2 β b 2 + λ b 1 = ψ st (β; λ) ) = β t+1 = ψ st (β t µ t X (Xβ t y); µ t λ (iterative soft thresholding) Lasso: algorithms and extensions 4-20
21 Proximal gradient methods for group lasso Sometimes variables have a natural group structure, and it is desirable to set all variables within a group to be zero (or nonzero) simultaneously (group lasso) where β j R p/k and β = 1 Xβ y 2 + λ k } 2 β j j=1 {{}}{{} :=f(β) :=g(β) β 1. β k. prox g (β) = ψ bst (β; λ) := [ ( 1 λ ) ] β j β j + 1 j k = β t+1 = ψ bst ( β t µ t X (Xβ t y); µ t λ ) Lasso: algorithms and extensions 4-21
22 Proximal gradient methods for elastic net Lasso does not handle highly correlated variables well: if there is a group of highly correlated variables, lasso often picks one from the group and ignore the rest. Sometimes we make a compromise between lasso and l 2 penalties 1 { } (elastic net) = β t+1 = Xβ y 2 + λ } 2 {{} :=f(β) prox λg (β) = β 1 + (γ/2) β 2 2 }{{} :=g(β) λγ ψ st (β; λ) 1 ( ) 1 + µ t λγ ψ st β t µ t X (Xβ t y); µ t λ soft thresholding followed by multiplicative shrinkage Lasso: algorithms and extensions 4-22
23 Interpretation: majorization-minimization f µt (β, β t ) := f(β t ) + f(β t ), β β t + 1 β β t 2 }{{} 2µ } t {{} linearization trust region penalty majorizes f(β) if 0 < µ t < 1 L, where L is Lipschitz constant1 of f( ) Proximal gradient descent is a majorization-minimization algorithm { } β t+1 = arg min f µt (β, β t ) + g(β) β }{{}}{{} majorization minimization 1 This means f(β) f(b) L β b for all β and b Lasso: algorithms and extensions 4-23
24 Convergence rate of proximal gradient methods Theorem 4.2 (fixed step size; Nesterov 07) Suppose g is convex, and f is differentiable and convex whose gradient has Lipschitz constant L. If µ t µ (0, 1/L), then ( ) f(β t ) + g(β t { } 1 ) min f(β) + g(β) O β t Step size requires an upper bound on L May prefer backtracking line search to fixed step size Question: can we further improve the convergence rate? Lasso: algorithms and extensions 4-24
25 Nesterov s accelerated gradient methods Lasso: algorithms and extensions 4-25
26 Nesterov s accelerated method Problem of gradient descent: zigzagging Nesterov s idea: include a momentum term to avoid overshooting Lasso: algorithms and extensions 4-26
27 Nesterov s accelerated method Nesterov s idea: include a momentum term to avoid overshooting β t = ( prox µtg b t 1 µ t f (b t 1)) b t = ( β t + α t β t β t 1) (extrapolation) } {{ } momentum term A simple (but mysterious) choice of extrapolation parameter α t = t 1 t + 2 Fixed size µ t µ (0, 1/L) or backtracking line search Same computational cost per iteration as proximal gradient Lasso: algorithms and extensions 4-27
28 Convergence rate of Nesterov s accelerated method Theorem 4.3 (Nesterov 83, Nesterov 07) Suppose f is differentiable and convex and g is convex. If one takes α t = t 1 t+2 and a fixed step size µ t µ (0, 1/L), then ( ) f(β t ) + g(β t { } 1 ) min f(β) + g(β) O β t 2 In general, this rate cannot be improved if one only uses gradient information! Lasso: algorithms and extensions 4-28
29 Numerical experiments (for lasso) Figure credit: Hastie, Tibshirani, & Wainwright 15 Lasso: algorithms and extensions 4-29
30 Reference [1] Proximal algorithms, Neal Parikh and S. Boyd, Foundations and Trends in Optimization, [2] Convex optimization algorithms, D. Bertsekas, Athena Scientific, [3] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and Trends in Machine Learning, [4] Statistical learning with sparsity: the Lasso and generalizations, T. Hastie, R. Tibshirani, and M. Wainwright, [5] Model selection and estimation in regression with grouped variables, M. Yuan and Y. Lin, Journal of the royal statistical society, [6] A method of solving a convex programming problem with convergence rate O(1/k 2 ), Y. Nestrove, Soviet Mathematics Doklady, Lasso: algorithms and extensions 4-30
31 Reference [7] Gradient methods for minimizing composite functions,, Y. Nesterov, Technical Report, [8] A fast iterative shrinkage-thresholding algorithm for linear inverse problems, A. Beck and M. Teboulle, SIAM journal on imaging sciences, Lasso: algorithms and extensions 4-31
ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationProximal gradient methods
ELE 538B: Large-Scale Optimization for Data Science Proximal gradient methods Yuxin Chen Princeton University, Spring 08 Outline Proximal gradient descent for composite functions Proximal mapping / operator
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationAn Efficient Proximal Gradient Method for General Structured Sparse Learning
Journal of Machine Learning Research 11 (2010) Submitted 11/2010; Published An Efficient Proximal Gradient Method for General Structured Sparse Learning Xi Chen Qihang Lin Seyoung Kim Jaime G. Carbonell
More informationAccelerated gradient methods
ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationLecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent
10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationSIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University
SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some
More informationA Brief Overview of Practical Optimization Algorithms in the Context of Relaxation
A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationLecture 1: September 25
0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationPathwise coordinate optimization
Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationA Tutorial on Primal-Dual Algorithm
A Tutorial on Primal-Dual Algorithm Shenlong Wang University of Toronto March 31, 2016 1 / 34 Energy minimization MAP Inference for MRFs Typical energies consist of a regularization term and a data term.
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationEfficient Methods for Overlapping Group Lasso
Efficient Methods for Overlapping Group Lasso Lei Yuan Arizona State University Tempe, AZ, 85287 Lei.Yuan@asu.edu Jun Liu Arizona State University Tempe, AZ, 85287 j.liu@asu.edu Jieping Ye Arizona State
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationHomework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent
Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent CMU 10-725/36-725: Convex Optimization (Fall 2017) OUT: Sep 29 DUE: Oct 13, 5:00
More informationSMOOTHING PROXIMAL GRADIENT METHOD FOR GENERAL STRUCTURED SPARSE REGRESSION. Carnegie Mellon University
The Annals of Applied Statistics 2012, Vol. 6, No. 2, 719 752 DOI: 10.1214/11-AOAS514 Institute of Mathematical Statistics, 2012 SMOOTHING PROXIMAL GRADIENT METHOD FOR GENERAL STRUCTURED SPARSE REGRESSION
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationWithin Group Variable Selection through the Exclusive Lasso
Within Group Variable Selection through the Exclusive Lasso arxiv:1505.07517v1 [stat.me] 28 May 2015 Frederick Campbell Department of Statistics, Rice University and Genevera Allen Department of Statistics,
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationSMOOTHING PROXIMAL GRADIENT METHOD FOR GENERAL STRUCTURED SPARSE REGRESSION
Submitted to the Annals of Applied Statistics SMOOTHING PROXIMAL GRADIENT METHOD FOR GENERAL STRUCTURED SPARSE REGRESSION By Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell and Eric P. Xing Carnegie
More informationLecture 2 February 25th
Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.
More informationarxiv: v2 [stat.co] 29 May 2018
Proximal Methods for Sparse Optimal Scoring and Discriminant Analysis Summer Atkins 1, Gudmundur Einarsson 2, Brendan Ames 3, and Line Clemmensen 2 arxiv:1705.07194v2 [stat.co] 29 May 2018 1 Department
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationORIE 4741: Learning with Big Messy Data. Proximal Gradient Method
ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:
More informationMachine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014
Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression
More informationHomework 2. Convex Optimization /36-725
Homework 2 Convex Optimization 0-725/36-725 Due Monday October 3 at 5:30pm submitted to Christoph Dann in Gates 803 Remember to a submit separate writeup for each problem, with your name at the top) Total:
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationSupplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data
Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University
More informationOptimization for Learning and Big Data
Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationRecent Advances in Structured Sparse Models
Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured
More informationOne Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties
One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationRegression.
Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets
More informationNon-negative Matrix Factorization via accelerated Projected Gradient Descent
Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationarxiv: v1 [cs.lg] 17 Dec 2015
Successive Ray Refinement and Its Application to Coordinate Descent for LASSO arxiv:1512.05808v1 [cs.lg] 17 Dec 2015 Jun Liu SAS Institute Inc. Zheng Zhao Google Inc. October 16, 2018 Ruiwen Zhang SAS
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationRanking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization
Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis
More information10-725/36-725: Convex Optimization Spring Lecture 21: April 6
10-725/36-725: Conve Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 21: April 6 Scribes: Chiqun Zhang, Hanqi Cheng, Waleed Ammar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More information27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1
10-708: Probabilistic Graphical Models, Spring 2015 27: Case study with popular GM III Lecturer: Eric P. Xing Scribes: Hyun Ah Song & Elizabeth Silver 1 Introduction: Gene association mapping for complex
More informationAdaptive Restarting for First Order Optimization Methods
Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation
More informationLecture 7: September 17
10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More information2 Regularized Image Reconstruction for Compressive Imaging and Beyond
EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement
More informationThe Alternating Direction Method of Multipliers
The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein December 2, 2015 1 / 25 Background The Dual Problem Consider
More informationEfficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization
Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More information