Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
|
|
- Norman Cunningham
- 5 years ago
- Views:
Transcription
1 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization /36-725
2 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method: choose an initial x (0) R n, and repeat: x (k) = x (k 1) t k g (k 1), k = 1, 2, 3,... where g (k 1) f(x (k 1) ). We use pre-set rules for the step sizes (e.g., diminshing step sizes rule) If f is Lipschitz, then subgradient method has a convergence rate O(1/ɛ 2 ) Upside: very generic. Downside: can be very slow! 2
3 Outline Today: Proximal gradient descent Convergence analysis ISTA, matrix completion Special cases Acceleration 3
4 Decomposable functions Suppose f(x) = g(x) + h(x) g is convex, differentiable, dom(g) = R n h is convex, not necessarily differentiable If f were differentiable, then gradient descent update would be: x + = x t f(x) Recall motivation: minimize quadratic approximation to f around x, replace 2 f(x) by 1 t I, x + = argmin z f(x) + f(x) T (z x) + 1 2t z x 2 2 }{{} f t (z) 4
5 In our case f is not differentiable, but f = g + h, g differentiable. Why don t we make quadratic approximation to g, leave h alone? I.e., update x + = argmin z = argmin z = argmin z g t (z) + h(z) g(x) + g(x) T (z x) + 1 2t z x h(z) 1 ( ) z x t g(x) 2 2t 2 + h(z) 1 2t ( ) z x t g(x) 2 stay close to gradient update for g 2 h(z) also make h small 5
6 Define proximal mapping: Proximal gradient descent prox t (x) = argmin z 1 2t x z h(z) Proximal gradient descent: choose initialize x (0), repeat x (k) ( = prox tk x (k 1) t k g(x (k 1) ) ), k = 1, 2, 3,... To make this update step look familiar, can rewrite it as x (k) = x (k 1) t k G tk (x (k 1) ) where G t is the generalized gradient of f, G t (x) = x prox ( ) t x t g(x) t 6
7 What good did this do? You have a right to be suspicious... may look like we just swapped one minimization problem for another Key point is that prox t ( ) is can be computed analytically for a lot of important functions h. Note: mapping prox t ( ) doesn t depend on g at all, only on h smooth part g can be complicated, we only need to compute its gradients Convergence analysis: will be in terms of number of iterations of the algorithm. Keep in mind that each iteration evaluates prox t ( ) once, and this can be cheap or expensive, depending on h 7
8 Example: ISTA Given y R n, X R n p, recall lasso criterion: Prox mapping is now f(β) = 1 2 y Xβ }{{}. λ β 1 }{{} g(β) h(β) 1 prox t (β) = argmin z R n 2t β z λ z 1 = S λt (β) where S λ (β) is the soft-thresholding operator, β i λ if β i > λ [S λ (β)] i = 0 if λ β i λ, i = 1,... n β i + λ if β i < λ 8
9 Recall g(β) = X T (y Xβ), hence proximal gradient update is: β + = S λt ( β + tx T (y Xβ) ) Often called the iterative soft-thresholding algorithm (ISTA). 1 Very simple algorithm to compute a lasso solution Example of proximal gradient (ISTA) vs. subgradient method convergence rates f fstar Subgradient method Proximal gradient Beck and Teboulle (2008), A fast iterative shrinkage-thresholding algorithm for linear inverse problems k 9
10 10 Convergence analysis With criterion f(x) = g(x) + h(x), we assume: g is convex, differentiable, dom(g) = R n, and g is Lipschitz continuous with constant L > 0 h is convex, prox t (x) = argmin z { x z 2 2 /(2t) + h(z)} can be evaluated Theorem: Proximal gradient descent with fixed step size t 1/L satisfies f(x (k) ) f x(0) x 2 2 2tk Proximal gradient descent has convergence rate O(1/k), or O(1/ɛ) Same as gradient descent! But remember, this counts the number of iterations, not operations
11 11 Backtracking line search Similar to gradient descent, but operates on g and not f. We fix a parameter 0 < β < 1. At each iteration, start with t = 1, and while g ( x tg t (x) ) > g(x) t g(x) T G t (x) + t 2 G t(x) 2 2 shrink t = βt. Else perform prox gradient update Under same assumptions, we get the same rate Theorem: Proximal gradient descent with backtracking line search satisfies where t min = min{1, β/l} f(x (k) ) f x(0) x 2 2 2t min k
12 12 Example: matrix completion Given a matrix Y R m n, and only observe entries Y ij, (i, j) Ω. Suppose we want to fill in missing entries (e.g., for a recommender system), so we solve a matrix completion problem: 1 min B R m n 2 (i,j) Ω (Y ij B ij ) 2 + λ B tr Here B tr is the trace (or nuclear) norm of B, B tr = r σ i (B) i=1 where r = rank(b) and σ 1 (X)... σ r (X) 0 are the singular values
13 13 Define P Ω, projection operator onto observed set: { B ij (i, j) Ω [P Ω (B)] ij = 0 (i, j) / Ω Then the criterion is f(b) = 1 2 P Ω(Y ) P Ω (B) 2 F +. }{{}. λ B tr }{{} g(b) h(b) Two ingredients needed for proximal gradient descent: Gradient calculation: g(b) = (P Ω (Y ) P Ω (B)) Prox function: 1 prox t (B) = argmin Z R m n 2t B Z 2 F + λ Z tr
14 14 Claim: prox t (B) = S λt (B), matrix soft-thresholding at the level λ. Here S λ (B) is defined by S λ (B) = UΣ λ V T where B = UΣV T is an SVD, and Σ λ is diagonal with (Σ λ ) ii = max{σ ii λ, 0} Why? Note that prox t (B) = Z, where Z satisfies 0 Z B + λt Z tr Fact: if Z = UΣV T, then Z tr = {UV T + W : W op 1, U T W = 0, W V = 0} Now plug in Z = S λt (B) and check that we can get 0
15 Hence proximal gradient update step is: B + = S λt ( B + t ( P Ω (Y ) P Ω (B) )) Note that g(b) is Lipschitz continuous with L = 1, so we can choose fixed step size t = 1. Update step is now: B + = S λ ( PΩ (Y ) + P Ω (B) ) where P Ω projects onto unobserved set, P Ω(B) + P Ω (B) = B This is the soft-impute algorithm 2, simple and effective method for matrix completion 2 Mazumder et al. (2011), Spectral regularization algorithms for learning large incomplete matrices 15
16 16 Special cases Proximal gradient descent also called composite gradient descent, or generalized gradient descent Why generalized? This refers to the several special cases, when minimizing f = g + h: h = 0 gradient descent h = I C projected gradient descent g = 0 proximal minimization algorithm Therefore these algorithms all have O(1/ɛ) convergence rate
17 17 Projected gradient descent Given closed, convex set C R n, where I C (x) = min x C g(x) min x R n { 0 x C x / C g(x) + I C(x) is the indicator function of C Hence 1 prox t (x) = argmin z R n 2t x z I C (z) = argmin z C x z 2 2 I.e., prox t (x) = P C (x), projection operator onto C
18 18 Therefore proximal gradient update step is: x + = P C ( x t g(x) ) i.e., perform usual gradient update and then project back onto C. Called projected gradient descent c() (Note: projected subgradient method works too)
19 19 Proximal minimization algorithm Consider for h convex (not necessarily differentiable), min h(x) Proximal gradient update step is just: x + = argmin z 1 2t x z h(z) Called proximal minimization algorithm. Faster than subgradient method, but not implementable unless we know prox in closed form
20 What happens if we can t evaluate prox? Theory for proximal gradient, with f = g + h, assumes that prox function can be evaluated, i.e., assumes the minimization 1 prox t (x) = argmin z R n 2t x z h(z) can be done exactly. In general, all bets are off if we just minimize this approximately But, if you can precisely control the errors in approximating the prox operator, then you can recover the original convergence rates 3 In practice, if prox evaluation is done approximately, then it should be done to fairly high precision 3 Schmidt et al. (2011), Convergence rates of inexact proximal-gradient methods for convex optimization 20
21 Acceleration Turns out we can accelerate proximal gradient descent in order to achieve the optimal O(1/ ɛ) convergence rate. Four ideas (three acceleration methods) by Nesterov: 1983: original acceleration idea for smooth functions 1988: another acceleration idea for smooth functions 2005: smoothing techniques for nonsmooth functions, coupled with original acceleration idea 2007: acceleration idea for composite functions 4 We will follow Beck and Teboulle (2008), extension of Nesterov (1983) to composite functions 5 4 Each step uses entire history of previous steps and makes two prox calls 5 Each step uses information from two last steps and makes one prox call 21
22 22 Accelerated proximal gradient method Our problem, as before: min g(x) + h(x) where g convex, differentiable, and h convex. Accelerated proximal gradient method: choose an initial point x (0) = x ( 1) R n, repeat for k = 1, 2, 3,... v = x (k 1) + k 2 k + 1 (x(k 1) x (k 2) ) x (k) = prox tk ( v tk g(v) ) First step k = 1 is just usual proximal gradient update After that, v = x (k 1) + k 2 k+1 (x(k 1) x (k 2) ) carries some momentum from previous iterations h = 0 gives accelerated gradient method
23 23 Momentum weights: (k 2)/(k + 1) k
24 24 Recall lasso example: acceleration can really help! f fstar Subgradient method Proximal gradient Nesterov acceleration k Note: accelerated proximal gradient is not a descent method ( Nesterov ripples )
25 25 Convergence analysis As usual, we are minimizing f(x) = g(x) + h(x), assuming: g is convex, differentiable, dom(f) = R n, and g is Lipschitz continuous with constant L > 0 h is convex, prox function can be evaluated Theorem: Accelerated proximal gradient method with fixed step size t 1/L satisfies f(x (k) ) f 2 x(0) x 2 2 t(k + 1) 2 Achieves the optimal rate O(1/k 2 ) for first-order methods! I.e., a rate of O(1/ ɛ)
26 26 Backtracking line search A few ways to do this with acceleration... here s a simple method (more complicated strategies exist): fix β < 1, t 0 = 1. At iteration k, start with t = t k 1, and while g(x + ) > g(v) + g(v) T (x + v) + 1 2t x+ v 2 2 shrink t = βt, and let x + = prox t (v t g(v)). Else keep x + Under same assumptions, we get the same rate Theorem: Accelerated proximal gradient method with backtracking line search satisfies f(x (k) ) f 2 x(0) x 2 2 t min (k + 1) 2 where t min = min{1, β/l}
27 27 Is acceleration always useful? Acceleration can be a very effective speedup tool... but should it always be used? In practice the speedup of using acceleration is diminished in the presence of warm starts. I.e., suppose want to solve lasso problem for tuning parameters values λ 1 > λ 2 >... > λ r When solving for λ 1, initialize x (0) = 0, record solution ˆx(λ 1 ) When solving for λ j, initialize x (0) = ˆx(λ j 1 ), the recorded solution for λ j 1 Over a fine enough grid of λ values, proximal gradient descent can often perform just as well without acceleration
28 28 Sometimes backtracking and acceleration can be disadvantageous! Recall matrix completion problem: the proximal gradient update is ( B + = S λ B + t ( P Ω (Y ) P (B) )) where S λ is the matrix soft-thresholding operator... requires SVD One backtracking loop evaluates generalized gradient G t (x), i.e., evaluates prox t (x), across various values of t. For matrix completion, this means multiple SVDs... Acceleration changes argument we pass to prox: v t g(v) instead of x t g(x). For matrix completion (and t = 1), B g(b) = P Ω (Y ) + PΩ (B) }{{}}{{} sparse low rank a V g(v ) = P Ω (Y ) + PΩ (V ) }{{}}{{} sparse not necessarily low rank fast SVD slow SVD
29 29 References and further reading Nesterov s four ideas (three acceleration methods): Y. Nesterov (1983), A method for solving a convex programming problem with convergence rate O(1/k 2 ) Y. Nesterov (1988), On an approach to the construction of optimal methods of minimization of smooth convex functions Y. Nesterov (2005), Smooth minimization of non-smooth functions Y. Nesterov (2007), Gradient methods for minimizing composite objective function
30 Extensions and/or analyses: A. Beck and M. Teboulle (2008), A fast iterative shrinkage-thresholding algorithm for linear inverse problems S. Becker and J. Bobin and E. Candes (2009), NESTA: a fast and accurate first-order method for sparse recovery P. Tseng (2008), On accelerated proximal gradient methods for convex-concave optimization and there are many more... Helpful lecture notes/books: E. Candes, Lecture notes for Math 301, Stanford University, Winter Y. Nesterov (1998), Introductory lectures on convex optimization: a basic course, Chapter 2 L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
Lecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationLecture 1: September 25
0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationLecture 5: September 15
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationLecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent
10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationLecture 5: September 12
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationAdaptive Restarting for First Order Optimization Methods
Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725 Consider Last time: proximal Newton method min x g(x) + h(x) where g, h convex, g twice differentiable, and h simple. Proximal
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationSIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University
SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some
More informationLecture 7: September 17
10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationGradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725
Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex
More informationLecture 6: September 12
10-725: Optimization Fall 2013 Lecture 6: September 12 Lecturer: Ryan Tibshirani Scribes: Micol Marchetti-Bowick Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not
More informationOptimization for Learning and Big Data
Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More information10-725/36-725: Convex Optimization Spring Lecture 21: April 6
10-725/36-725: Conve Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 21: April 6 Scribes: Chiqun Zhang, Hanqi Cheng, Waleed Ammar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationLecture 23: November 19
10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo
More informationMATH 829: Introduction to Data Mining and Analysis Computing the lasso solution
1/16 MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution Dominique Guillot Departments of Mathematical Sciences University of Delaware February 26, 2016 Computing the lasso
More informationSupplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data
Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationHomework 2. Convex Optimization /36-725
Homework 2 Convex Optimization 0-725/36-725 Due Monday October 3 at 5:30pm submitted to Christoph Dann in Gates 803 Remember to a submit separate writeup for each problem, with your name at the top) Total:
More information9. Dual decomposition and dual algorithms
EE 546, Univ of Washington, Spring 2016 9. Dual decomposition and dual algorithms dual gradient ascent example: network rate control dual decomposition and the proximal gradient method examples with simple
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationDuality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Duality in Linear Programs Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: proximal gradient descent Consider the problem x g(x) + h(x) with g, h convex, g differentiable, and
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationLecture 14: October 17
1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationGeneralized Conditional Gradient and Its Applications
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationAlternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization
Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationConvexity II: Optimization Basics
Conveity II: Optimization Basics Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 See supplements for reviews of basic multivariate calculus basic linear algebra Last time: conve sets and functions
More informationORIE 6326: Convex Optimization. Quasi-Newton Methods
ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationRegression.
Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationA Tutorial on Primal-Dual Algorithm
A Tutorial on Primal-Dual Algorithm Shenlong Wang University of Toronto March 31, 2016 1 / 34 Energy minimization MAP Inference for MRFs Typical energies consist of a regularization term and a data term.
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationHomework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent
Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent CMU 10-725/36-725: Convex Optimization (Fall 2017) OUT: Sep 29 DUE: Oct 13, 5:00
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationDual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationQuasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization
Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)
More informationMinimizing the Difference of L 1 and L 2 Norms with Applications
1/36 Minimizing the Difference of L 1 and L 2 Norms with Department of Mathematical Sciences University of Texas Dallas May 31, 2017 Partially supported by NSF DMS 1522786 2/36 Outline 1 A nonconvex approach:
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationAccelerated gradient methods
ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated
More informationNon-negative Matrix Factorization via accelerated Projected Gradient Descent
Non-negative Matrix Factorization via accelerated Projected Gradient Descent Andersen Ang Mathématique et recherche opérationnelle UMONS, Belgium Email: manshun.ang@umons.ac.be Homepage: angms.science
More information