Lecture 1: September 25

Similar documents
Lecture 9: September 28

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Lecture 8: February 9

Lecture 5: September 15

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Optimization methods

Lecture 17: October 27

Lecture 5: September 12

Lecture 6: September 12

Lecture 25: November 27

Lecture 6: September 17

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 23: November 21

Lecture 14: October 17

Optimization methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lecture 23: November 19

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Lecture 14: Newton s Method

Lecture 11: October 2

Lecture 14: October 11

Lecture 6: September 19

Fast proximal gradient methods

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Lecture 9: September 25

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 4: September 19

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Lecture 24: August 28

Lecture 20: November 1st

6. Proximal gradient method

Lasso: Algorithms and Extensions

6. Proximal gradient method

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture 26: April 22nd

Conditional Gradient (Frank-Wolfe) Method

Lecture 23: Conditional Gradient Method

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Lecture 15: October 15

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Lecture 10: September 26

Lecture 7: September 17

Lecture 18: November Review on Primal-dual interior-poit methods

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Lecture 16: October 22

Subgradient Method. Ryan Tibshirani Convex Optimization

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Convex Optimization Lecture 16

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Convex Optimization Algorithms for Machine Learning in 10 Slides

Lecture 4: September 12

Gradient Descent. Stephan Robert

10-704: Information Processing and Learning Fall Lecture 21: Nov 14. sup

Dual Proximal Gradient Method

ECS171: Machine Learning

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Gradient Descent. Model

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Descent and Ascent Methods

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Lecture 5: September Time Complexity Analysis of Local Alignment

Selected Topics in Optimization. Some slides borrowed from

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Lecture 3 September 1

Adaptive Restarting for First Order Optimization Methods

Lecture 4: January 26

ECS171: Machine Learning

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

A direct formulation for sparse PCA using semidefinite programming

Lecture 17: Primal-dual interior-point methods part II

Homework 4. Convex Optimization /36-725

Oslo Class 6 Sparsity based regularization

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Lecture 2: August 31

STA141C: Big Data & High Performance Statistical Computing

CPSC 540: Machine Learning

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Optimization for Learning and Big Data

Descent methods. min x. f(x)

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Is the test error unbiased for these programs? 2017 Kevin Jamieson

ECS289: Scalable Machine Learning

Minimizing the Difference of L 1 and L 2 Norms with Applications

Convex Optimization / Homework 1, due September 19

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Transcription:

0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor.. Accelerated Backtracking Line Search In the proof for accelerated general gradient descent we saw an O(/k 2 ) convergence rate which optimal. Gradient descent on the other hand has a O(/k) convergence rate. The proofs for these convergence rates are very different and are made under different assumptions. There are a number of different accelerated backtracking schemes and these are made under different criteria for the same reason. We will examine one of the simpler schemes. The assumptions that this scheme needs to make are : Lipschitz Gradient condition g(x + ) g(y) + g(y) T (x + y) + 2t x+ y 2 Condition on θ k ( θ k )t k θ 2 k t k θ 2 k t k is monotically decreasing i.e. t k t k. The problem with this condition is that if you choose a small step size initially, you will need to continue to use small step sizes further on as well... Algorithm Choose β < t 0 = for k =, 2, 3,... till convergence t k = t k x + = prox tk (y t k g(y)) while g(x + ) > g(y) + g(y) T (x + y) + 2t k x + y 2 repeat t k = βt k x + = prox tk (y t k g(y)) endwhile endfor This method checks if g(x + ) is small enough, otherwise it shrinks t k by a factor β and updates x +. This method satisfies the required conditions and is thus able to achieve a O(/k 2 ) convergence rate. -

-2 Lecture : September 25..2 Convergence rates From the above discussion, the following theorem summarizes the O(/k 2 ) convergence rate Theorem. Accelerated generalized gradient method with backtracking satisfies f(x (k) ) f(x ) 2 x(0) x 2 t min (k + ) 2.2 FISTA Fast Iterative Soft Thresholding Algorithm(FISTA) is an accelerated version of ISTA which is applied to problems containing convex differentiable objectives with L norm such as Lasso. The Lasso problem is defined as : min x 2 y Ax 2 + λ x ISTA solution This is the solution by using the normal generalized gradient also known as the iterative soft thresholding algorithm (ISTA). See Lecture 8. where S λ ( ) is the soft-thresholding operator x (k) = S λtk (x (k ) + t k A T (y Ax (k ) )) k =, 2, 3,... x i λ if x i > λ [S λ (x)] i = 0 if λ x i λ x i + λ if x i < λ This is obtained by solving the prox function prox t (x) = arg min z R n 2t x z 2 + λ z = S λt (x) FISTA solution The accelerated version involves solving the same prox problem but with an additional vector added to the input to the prox function. v = x (k ) + k 2 k + (x(k ) x (k 2) ) x (k) = S λtk (v + t k A T (y Ax (k ) )) k =, 2, 3,... Here we show two images comparing the performance of ISTA vs FISTA. We can see that FISTA clearly beats ISTA by an order of magnitude faster.

Lecture : September 25-3 Figure.: Performance of ISTA vs FISTA for Lasso Regression (n=00,p=500) Figure.2: Performance of ISTA vs FISTA for Lasso Logistic Regression (n=00,p=500).3 Failure Cases for Acceleration Acceleration achieves the optimal O(/k 2 ) convergence rate for gradient based methods. However, it does not do so under all conditions. In some cases it might perform similar to non-accelerated methods and in some others it might actually hurt performance. Some cases are presented wherein acceleration fails to do well.

-4 Lecture : September 25.3. Warm Starts In iterative algorithms warm starting can be an effective strategy to speed up convergence. For e.g. in cross validation runs for Lasso, one can use the optimal ˆx found in the previous iteration as a warm start for the next iteration. Let the tuning parameters for lasso be λ λ 2... λ r When solving for λ, initialize x (0) = 0 and record solution ˆx(λ ). Now, reuse this value such that when solving for λ j, initialize x (0) (λ j ) = ˆx(λ j ). It has been observed that over a fine grid of values, generalized gradient descent can perform just as well as accelerated version when using warm starts..3.2 Matrix Completion In the case of matrix completion, acceleration and even backtracking can hurt performance. The matrix completion problem is described in Lecture 8. Briefly, Given a matrix A, only some entries (i, j) Ω of which are visible to you, you want to fill in the rest of entries, while keeping the matrix low rank. We solve, min X 2 P Ω(A) P Ω (X) 2 F + λ X where X = r i= σ i(x) is the nuclear norm, r is the rank of X and P Ω ( ) is the projection operator, [P Ω (X)] ij = { X ij (i, j) Ω 0 (i, j) / Ω the gradient descent updates, also known as the soft-impute algorithm are X + = S λ (P Ω (A) + P Ω (X)) where S λ ( ) is the matrix soft-thresholding operator which requires the SVD to compute as S λ (X) = UΣ λ V T where (Σ λ ) ii = max{σ ii λ, 0}. Calculating the SVD can be expensive and can cost upto O(mn 2 ) operations. This can be expensive for the backtracking search method since each backtracking loop evaluates the generalized G t (X) at various values of t and this involves solving the prox function. This equates to calling the SVD several times and can be slow. The matrix completion problem does not work well with acceleration as well, since acceleration involves changing the argument passed to the prox function. We pass y t g(y) instead of x t g(x). This can make the matrix high rank which makes the computation of SVD more expensive. Here is an example where using acceleration performs worse than soft-impute [M].

Lecture : September 25-5 Figure.3: Performance of Soft impute vs Accelerated grad descent on a small and a big problem References [M] R. Mazumder and T. Hastie and R. Tibshirani Spectral Regularization Algorithms for Learning Large Incomplete Matrices, The Journal of Machine Learning Research, 20, pp. 2287 2322.