ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
|
|
- Janice Baker
- 5 years ago
- Views:
Transcription
1 ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring 2018
2 Outline Lasso with orthogonal design Proximal operators Proximal gradient methods for lasso and its extensions Nesterov s accelerated algorithm (FISTA) These are useful in general for compound optimization problems. 2/49
3 Sparse recovery by l 0 regularization As a warm up, consider a sparsifying basis X R n n that is orthonormal, we wish to solve the following sparsity-promoting problem regularized by l 0 norm: ˆβ = argmin β 1 2 y Xβ 2 + λ β 0 }{{} penalized by sparsity level (4.1) The first term is the approximation error: y X ˆβ 2. The second term is the model complexity: ˆβ 0. We will discuss regularized algorithms throughout this lecture. 3/49
4 Orthogonal design Since X is orthonormal, y Xβ 2 = X y β 2. Without loss of generality, suppose X = I, then (4.1) reduces to ˆβ = argmin β n i=1 1 [ ] (y i β i ) 2 + λ 1{β i 0} 2 Solving this problem gives { 0, yi 2λ ˆβ i = y i, y i > 2λ hard thresholding Keep large coefficients; discard small coefficients 4/49
5 The case X = I ˆβ i = ψ ht (y i ; { 0, yi 2λ 2λ) := y i, y i > 2λ hard thresholding Hard thresholding preserves data outside threshold zone 5/49
6 Convex relaxation: Lasso (Tibshirani 96) Lasso (Least absolute shrinkage and selection operator) ˆβ = argmin β 1 2 y Xβ 2 + λ β 1 (4.2) for some regularization parameter λ > 0. It is equivalent to ˆβ = argmin β y Xβ 2 s.t. β 1 t for some t that depends on λ (no explicit formula) a quadratic program (QP) with convex constraints λ controls model complexity: larger λ restricts the parameters more; smaller λ frees up more parameters Also related to Basis Pursuit: ˆβ = argmin β β 1 s.t. y Xβ ɛ 6/49
7 Lasso vs. ridge regression ˆβ: least squares solution minimize β y Xβ minimize β y Xβ s.t. β 1 t s.t. β 2 t Fig. credit: Hastie, Tibshirani, & Wainwright 7/49
8 A Bayesian interpretation Orthogonal design: y = β + η with η N (0, σ 2 I). Impose an i.i.d. prior on β i to encourage sparsity (Gaussian is not a good choice): (Laplacian prior) P(β i = z) = λ 2 e λ z The tail decays exponentially. 8/49
9 A Bayesian interpretation of Lasso Posterior of β: P (β y) P(y β)p(β) { n exp i=1 n i=1 e (y i β i ) 2 λ 2σ 2 2 e λ β i (y i β i ) 2 2σ 2 λ β i = maximum a posteriori (MAP) estimator: arg min β { n (yi β i ) 2 } 2σ i=1 2 + λ β i } (Lasso) Implication: Lasso is MAP estimator under Laplacian prior 9/49
10 Example: orthogonal design Suppose X = I, then Lasso reduces to ˆβ = argmin β n i=1 [ 1 2 (y i β i ) 2 + λ β i ] The Lasso estimate ˆβ is then given by y i λ, y i λ ˆβ i = y i + λ, y i λ 0, else soft thresholding 10/49
11 Example: orthogonal design y i λ, y i λ ˆβ i = ψ st (y i ; λ) = y i + λ, y i λ soft thresholding 0, else Soft thresholding shrinks data towards 0 outside threshold zone 11/49
12 Optimality condition for convex functions For any convex function f(β), β is an optimal solution iff 0 f(β ), where f(β) is the set of all subgradients at β The subgradient of f(β) = 1 2 (y β)2 + λ β can be written as g = β y + λs with s is a subgradient of f(β) = β if { s = sign(β), if β 0 s [ 1, 1], if β = 0 (4.3) We see that ˆβ = ψ st (y; λ) by checking optimality conditions for two cases: If y λ, taking β = 0 and s = y/λ gives g = 0 If y > λ, taking β = y sign(y)λ gives g = 0 12/49
13 Solving LASSO in general cases 13/49
14 Composite optimization problems General composite optimization problem: ˆβ = argmin β {F (β) = f(β) + g(β)} f(β) is convex and differentiable, e.g. approximation error g(β) is convex, possibly non-differentiable, e.g. regularizers Examples: LASSO: f(β) = 1 2 y Xβ 2 2, and g(β) = λ β 1. Matrix completion: f(x) = P Ω (Y X) 2 F, g(x) = λ X where X is the nuclear norm. 14/49
15 Motivation Standard methods (e.g. subgradient methods) for solving composite optimization has very slow convergence rate. We would discuss accelerated proximal gradient methods that is iterative, and has low computational cost (first-order algorithm, which requires computation of a single gradient per iteration); has quadratic convergence rate; performs well in practice and works for a large class of problems. 15/49
16 Proximal gradient methods 16/49
17 Gradient descent minimize β R p f(β) where f(β) is convex and smooth (differentiable) Algorithm 4.1 Gradient descent for t = 0, 1, : β t+1 = β t µ t f(β t ) where µ t : step size / learning rate 17/49
18 A proximal point of view of GD β t+1 = arg min β { f(β t ) + f(β t ), β β t }{{} linear approximation + 1 } β β t 2 2µ } t {{} proximal term When µ t is small, β t+1 tends to stay close to β t 18/49
19 Proximal operator If we define the proximal operator prox h (b) := arg min β { 1 2 β b 2 + h(β)} for any convex function h, then one can write β t+1 = prox µtf t ( β t) where f t (β) := f(β t ) + f(β t ), β β t Gradient descent is performing proximal mapping at every iteration. 19/49
20 Why consider proximal operators? { 1 prox h (b) := arg min β 2 β b 2 + h(β)} It is well-defined under very general conditions (including nonsmooth convex functions) The operator can be evaluated efficiently for many widely used functions (in particular, regularizers) This abstraction is conceptually and mathematically simple, and covers many well-known optimization algorithms 20/49
21 Example: characteristic functions If h is characteristic function { 0, if β C h(β) =, else then prox h (b) = arg min β C β b 2 (Euclidean projection) 21/49
22 Example: l 1 norm If h(β) = β 1, then prox λh (b) = ψ st (b; λ) where soft-thresholding ψ st ( ) is applied in an entry-wise manner. 22/49
23 Example: l 2 norm { 1 prox h (b) := arg min β 2 β b 2 + h(β)} If h(β) = β, then prox λh (b) = ( 1 λ ) b b + where a + := max{a, 0}. This is called block soft thresholding. 23/49
24 Example: log barrier { 1 prox h (b) := arg min β 2 β b 2 + h(β)} If h(β) = p i=1 log β i, then (prox λh (b)) i = b i + b 2 i + 4λ 2 24/49
25 Nonexpansiveness of proximal operators { 0, if β C Recall that when h(β) =, prox h (β) is Euclidean else projection P C onto C, which is nonexpansive: P C (β 1 ) P C (β 2 ) β 1 β 2 25/49
26 Nonexpansiveness of proximal operators Nonexpansiveness is a property for general prox h ( ) Fact 4.1 (Nonexpansiveness) prox h (β 1 ) prox h (β 2 ) β 1 β 2 In some sense, proximal operator behaves like projection 26/49
27 Proof of nonexpansiveness Let z 1 = prox h (β 1 ) and z 2 = prox h (β 2 ). Subgradient characterizations of z 1 and z 2 read The claim would follow if β 1 z 1 h(z 1 ) and β 2 z 2 h(z 2 ) (β 1 β 2 ) (z 1 z 2 ) z 1 z 2 2 (together with Cauchy-Schwarz) = (β 1 z 1 β 2 + z 2 ) (z 1 z 2 ) 0 h(z 2 ) h(z 1 ) + β 1 z 1, z 2 z 1 }{{} h(z = 1 ) h(z 1 ) h(z 2 ) + β 2 z 2, z 1 z 2 }{{} h(z 2 ) 27/49
28 Proximal gradient methods 28/49
29 Optimizing composite functions (Lasso) minimize β R p 1 Xβ y 2 + λ β 1 = f(β) + g(β) } 2 {{}}{{} :=g(β) :=f(β) where f(β) is differentiable, and g(β) is non-smooth Since g(β) is non-differentiable, we cannot run vanilla gradient descent 29/49
30 Proximal gradient methods One strategy: replace f(β) with linear approximation, and compute the proximal solution { β t+1 = arg min f(β t ) + f(β t ), β β t + g(β) + 1 } β β t 2 β 2µ t The optimality condition reads 0 f(β t ) + g(β t+1 ) + 1 µ t ( β t+1 β t) which is equivalent to optimality condition of { β t+1 = arg min g(β) + 1 β ( β t µ t f(β t ) ) } 2 β 2µ t ( ) = prox µtg β t µ t f(β t ) 30/49
31 Proximal gradient methods Alternate between gradient updates on f and proximal mapping on g Algorithm 4.2 Proximal gradient methods for t = 0, 1, : β t+1 = prox µtg where µ t : step size / learning rate ( ) β t µ t f(β t ) 31/49
32 Projected gradient methods 0, if β C When g(β) =, else }{{} convex is characteristic function: ) β t+1 = P C (β t µ t f(β t ) := arg min β (β t µ t f(β t )) β C This is a first-order method to solve the constrained optimization minimize β s.t. f(β) β C 32/49
33 Proximal gradient methods for lasso For lasso: f(β) = 1 2 Xβ y 2 and g(β) = λ β 1, prox g (β) { } 1 = arg min b 2 β b 2 + λ b 1 = ψ st (β; λ) ) = β t+1 = ψ st (β t µ t X (Xβ t y); µ t λ iterative soft thresholding 33/49
34 Proximal gradient methods for group lasso Sometimes variables have a natural group structure, and it is desirable to set all variables within a group to be zero (or nonzero) simultaneously (group lasso) where β j R p/k and β = 1 Xβ y 2 + λ k } 2 β j j=1 {{}}{{} :=f(β) :=g(β) β 1. β k. prox g (β) = ψ bst (β; λ) := [ ( 1 λ ) ] β j β j + 1 j k = β t+1 = ψ bst ( β t µ t X (Xβ t y); µ t λ ) 34/49
35 Proximal gradient methods for elastic net Lasso does not handle highly correlated variables well: if there is a group of highly correlated variables, lasso often picks one from the group and ignore the rest. Sometimes we make a compromise between lasso and l 2 penalties 1 { } (elastic net) Xβ y 2 + λ β 1 + (γ/2) β 2 2 } 2 {{}}{{} :=f(β) :=g(β) prox λg (β) = λγ ψ st (β; λ) = β t+1 = 1 ( ) 1 + µ t λγ ψ st β t µ t X (Xβ t y); µ t λ soft thresholding followed by multiplicative shrinkage 35/49
36 Interpretation: majorization-minimization f µt (β, β t ) := f(β t ) + f(β t ), β β t + 1 β β t 2 }{{} 2µ } t {{} linearization trust region penalty majorizes f(β) if 0 < µ t < 1 L, where L is Lipschitz constant1 of f( ) Proximal gradient descent is a majorization-minimization algorithm { } β t+1 = arg min f µt (β, β t ) + g(β) β }{{}}{{} majorization minimization 1 This means f(β) f(b) L β b for all β and b 36/49
37 Assumptions for convergence g : R n R is a continuous convex function, possibly nonsmooth; f : R n R is a smooth convex function that is continuously differentiable with Lipschitz constant: f(x) f(y) L x y, x, y R n. f(y) f(x) + f(x) (y x) + L f 2 y x 2 2, x, y 37/49
38 Convergence rate of proximal gradient methods Theorem 4.2 (fixed step size; Nesterov 07) Suppose g is convex, and f is differentiable and convex whose gradient has Lipschitz constant L. If µ t µ (0, 1/L), then F (β t ) F ( ˆβ) O ( β 0 ˆβ 2 ) tµ Step size requires an upper bound on L For LASSO problems, we have L = σ max (A A). May prefer backtracking line search to fixed step size Question: can we further improve the convergence rate? 38/49
39 Nesterov s accelerated gradient methods We will first examine Nesterov s acceleration method (1983) for smooth convex functions; We then extend it to optimizing composite functions, using FISTA (Beck and Teboulle, 2009), which extends Nesterov s method to proximal gradient methods. 39/49
40 Nesterov s accelerated method Problem of gradient descent: zigzagging Nesterov s idea: include a momentum term to avoid overshooting 40/49
41 Accelerated Gradient descent minimize β R p f(β) where f(β) is convex and smooth (differentiable) Algorithm 4.3 Accelerated Gradient descent for t = 0, 1, : β t = b t 1 µ t f(b t 1 ) ( b t = β t + α t β t β t 1) } {{ } momentum term where µ t : step size / learning rate, α t the the extrapolation parametre 41/49
42 With/Without Acceleration 42/49
43 Momentum parameter A simple (but mysterious) choice of extrapolation parameter α t = t 1 t Another choice: let s 1 = 1, s t+1 = s 2 t 2, and α t = st 1 s t+1. 43/49
44 Accelerated proximal method (FISTA) Nesterov s idea: include a momentum term to avoid overshooting β t = ( prox µtg b t 1 µ t f (b t 1)) b t = ( β t + α t β t β t 1) (extrapolation) } {{ } momentum term A simple (but mysterious) choice of extrapolation parameter α t = t 1 t + 2 Fixed size µ t µ (0, 1/L) or backtracking line search Same computational cost per iteration as proximal gradient 44/49
45 Interpretation 45/49
46 Convergence of FISTA Theorem 4.3 (Nesterov 83, Nesterov 07, Beck & M. Teboulle 09) Suppose f is differentiable and convex and g is convex. If one takes α t = t 1 t+2 and a fixed step size µ t µ (0, 1/L), then ( ) F (β t ) F ( ˆβ) 1 O t 2 Improves upon O( 1 t ) convergence than proximal gradient method. in general un-improvable 46/49
47 Numerical experiments (for lasso) Figure credit: Hastie, Tibshirani, & Wainwright 15 47/49
48 Computational-Statistical Trade-off If there is indeed a ground truth β and we wish ˆβ is close to β ; we have a sequence of {β t } and hope β t converges to ˆβ. At a fixed t, we may bound β t β 2 β t ˆβ 2 }{{} + ˆβ β 2 }{{} computational error statistical error 48/49
49 Reference [1] Proximal algorithms, Neal Parikh and S. Boyd, Foundations and Trends in Optimization, [2] Convex optimization algorithms, D. Bertsekas, Athena Scientific, [3] Convex optimization: algorithms and complexity, S. Bubeck, Foundations and Trends in Machine Learning, [4] Statistical learning with sparsity: the Lasso and generalizations, T. Hastie, R. Tibshirani, and M. Wainwright, [5] Model selection and estimation in regression with grouped variables, M. Yuan and Y. Lin, Journal of the royal statistical society, [6] A method of solving a convex programming problem with convergence rate O(1/k 2 ), Y. Nesterov, Soviet Mathematics Doklady, /49
50 Reference [7] Gradient methods for minimizing composite functions,, Y. Nesterov, Technical Report, [8] A fast iterative shrinkage-thresholding algorithm for linear inverse problems, A. Beck and M. Teboulle, SIAM journal on imaging sciences, /49
Lasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationLecture 1: September 25
0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationLecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent
10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationSIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University
SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationAn Efficient Proximal Gradient Method for General Structured Sparse Learning
Journal of Machine Learning Research 11 (2010) Submitted 11/2010; Published An Efficient Proximal Gradient Method for General Structured Sparse Learning Xi Chen Qihang Lin Seyoung Kim Jaime G. Carbonell
More informationRegression.
Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets
More informationA Short Introduction to the Lasso Methodology
A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael
More informationMachine Learning for Economists: Part 4 Shrinkage and Sparsity
Machine Learning for Economists: Part 4 Shrinkage and Sparsity Michal Andrle International Monetary Fund Washington, D.C., October, 2018 Disclaimer #1: The views expressed herein are those of the authors
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationProximal gradient methods
ELE 538B: Large-Scale Optimization for Data Science Proximal gradient methods Yuxin Chen Princeton University, Spring 08 Outline Proximal gradient descent for composite functions Proximal mapping / operator
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationThe lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding
Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationA Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)
A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical
More informationPathwise coordinate optimization
Stanford University 1 Pathwise coordinate optimization Jerome Friedman, Trevor Hastie, Holger Hoefling, Robert Tibshirani Stanford University Acknowledgements: Thanks to Stephen Boyd, Michael Saunders,
More informationLecture 7: September 17
10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationA Tutorial on Primal-Dual Algorithm
A Tutorial on Primal-Dual Algorithm Shenlong Wang University of Toronto March 31, 2016 1 / 34 Energy minimization MAP Inference for MRFs Typical energies consist of a regularization term and a data term.
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationHomework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent
Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent CMU 10-725/36-725: Convex Optimization (Fall 2017) OUT: Sep 29 DUE: Oct 13, 5:00
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationOptimization for Learning and Big Data
Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for
More informationSupplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data
Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationWithin Group Variable Selection through the Exclusive Lasso
Within Group Variable Selection through the Exclusive Lasso arxiv:1505.07517v1 [stat.me] 28 May 2015 Frederick Campbell Department of Statistics, Rice University and Genevera Allen Department of Statistics,
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationRecent Advances in Structured Sparse Models
Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationAccelerated gradient methods
ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationHomework 2. Convex Optimization /36-725
Homework 2 Convex Optimization 0-725/36-725 Due Monday October 3 at 5:30pm submitted to Christoph Dann in Gates 803 Remember to a submit separate writeup for each problem, with your name at the top) Total:
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationA Brief Overview of Practical Optimization Algorithms in the Context of Relaxation
A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationLecture 5: September 12
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationMATH 829: Introduction to Data Mining and Analysis Computing the lasso solution
1/16 MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution Dominique Guillot Departments of Mathematical Sciences University of Delaware February 26, 2016 Computing the lasso
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationLecture 2 February 25th
Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationEE 381V: Large Scale Optimization Fall Lecture 24 April 11
EE 381V: Large Scale Optimization Fall 2012 Lecture 24 April 11 Lecturer: Caramanis & Sanghavi Scribe: Tao Huang 24.1 Review In past classes, we studied the problem of sparsity. Sparsity problem is that
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More information