Lecture 6 : Projected Gradient Descent

Similar documents
Lecture 5 : Projections

Lecture 6: September 12

Optimization and Optimal Control in Banach Spaces

Lecture 7: September 17

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Lecture 1: Background on Convex Analysis

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Optimality Conditions for Nonsmooth Convex Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Math 273a: Optimization Subgradients of convex functions

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Lecture 6: September 17

Lecture 12 Unconstrained Optimization (contd.) Constrained Optimization. October 15, 2008

18.657: Mathematics of Machine Learning

Lecture 5: September 12

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

6. Proximal gradient method

Primal/Dual Decomposition Methods

Algorithms for Nonsmooth Optimization

Lecture 14: October 17

5. Subgradient method

Generalized Pattern Search Algorithms : unconstrained and constrained cases

Proximal and First-Order Methods for Convex Optimization

1 Sparsity and l 1 relaxation

1 Non-negative Matrix Factorization (NMF)

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

The FTRL Algorithm with Strongly Convex Regularizers

Optimization Tutorial 1. Basic Gradient Descent

Math 273a: Optimization Subgradients of convex functions

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

More First-Order Optimization Algorithms

Lecture 15: October 15

Stochastic Programming Math Review and MultiPeriod Models

Lagrange Relaxation and Duality

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 8 Luca Trevisan September 19, 2017

Lecture 16: FTRL and Online Mirror Descent

Chapter 2: Preliminaries and elements of convex analysis

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Math 273a: Optimization Subgradient Methods

Homework 4. Convex Optimization /36-725

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Existence of minimizers

A Greedy Framework for First-Order Optimization

Convex Optimization Conjugate, Subdifferential, Proximation

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Lecture 25: Subgradient Method and Bundle Methods April 24

Estimators based on non-convex programs: Statistical and computational guarantees

IFT Lecture 2 Basics of convex analysis and gradient descent

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

CSC 576: Gradient Descent Algorithms

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Lecture 8: February 9

5. Duality. Lagrangian

CO 250 Final Exam Guide

6. Proximal gradient method

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Convex Analysis Background

Convex Optimization Boyd & Vandenberghe. 5. Duality

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Constrained Optimization Theory

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Randomized Smoothing Techniques in Optimization

Optimization methods

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Optimization methods

Zangwill s Global Convergence Theorem

Convex Optimization and Modeling

Dual Proximal Gradient Method

LECTURE 12 LECTURE OUTLINE. Subgradients Fenchel inequality Sensitivity in constrained optimization Subdifferential calculus Optimality conditions

Lecture 17: Primal-dual interior-point methods part II

Lecture 4: Hardness Amplification: From Weak to Strong OWFs

Lecture 3. Optimization Problems and Iterative Algorithms

IE 5531: Engineering Optimization I

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Coordinate Descent and Ascent Methods

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Lecture 23: Conditional Gradient Method

Lecture 1: January 12

Convex Optimization M2

1 Introduction

Conditional Gradient (Frank-Wolfe) Method

Problem 1 (Exercise 2.2, Monograph)

Unconstrained minimization of smooth functions

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Stochastic Optimization: First order method

Lecture 18 Oct. 30, 2014

subject to (x 2)(x 4) u,

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Classical regularity conditions

Gradient methods for minimizing composite functions

Automatic Differentiation and Neural Networks

1 Computing with constraints

Transcription:

Lecture 6 : Projected Gradient Descent EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Consider the following update. x l+1 = Π C (x l α f(x l )) Theorem Say f : R d R is (m, M)-strongly convex smooth. Then with α = 2 m+m. x l x 2 = ( 1 m/m 1 + m/m )l x 0 x + 2 Take x + to be the unconstrained optimum, and the x to be the constrained optimum. Proof: Define S α (x) = x α f(x). Then with α = 2, we have m+m S α (x) x + 2 1 m/m 1 + m/m x x+ 2 Define S(x) = x α f(x) and T (x) = T C (S(x)). Our method generates x l+1 = Π C (S(x l )) = T (x l ) In past lectures, we proved the following was a contraction, where 0 < γ < 1. Note that S(x) S(y) 2 γ x y 2 x, y R d We proved this for y = argmin u R df(u), but this holds in general. Now examine our algorithm. T (x) T (y) 2 = Π C (S(x)) Π C (S(y)) 2 and S(x) S ( y) γ x y 2 gives us that T is a γ contraction. This tells us rigorously that T : R d C has a unique fixed point ˆx C, which satisfies T (ˆx) = ˆx. Unwrap our recursion to get the following. x l ˆx 2 γ x 0 ˆx 2 1

We know that the following is approaching some value. However, we haven t used anything specific about the optimality conjecture - only non-expansiveness. So, we need to show that ˆx = x = argmin x C f(x). By optimality for projection, we have the following Π C (ˆx α f(ˆx)) (ˆx α f(ˆx)), y Π C (z), 0, y C We know T (ˆx) = Π C (ˆx α f(ˆx)) = ˆx, so the above becomes: α f(ˆx), y ˆx, 0, y C How does this prove our claim? The gradient forms a positive quantity for all feasible directions. In other words, there is no other direction to descend along. Thus, ˆx = x. Let s analyze f weakly convex, M-smooth. For unconstrained, we proved that with α = 1 M, f(x l ) f(x ) M x0 x 2 2 2l Theorem Same bound holds for projected gradient descent. Assume M-smooth and weak convexity. We claim the same bound with the same step size will hold in the projected case. We prove that the progress you make scales quadratically in the size of the gradient. In the unconstrained case, we showed f(x l+1 ) f(x l ) 1 2M f(xl ) 2 2 We re no longer following the negative gradient. This algorithm is written as a descent method where the descent direction is different. Let us call it g C (x l ), and rewrite the algorithm that isolates g c. We need to prove something analogous for constrianed problem. Consider α = 1 M x l+1 = x l αg C (x l ) Define T (x) = Π C (x 1 M f(x)), then g C(x) = M(x T (x)). We ve started with gradients, but there are plenty of other descent directions to use. We can replace the gradient with any direction that causes a potential function to decrease. Lemma For x, y C, f(t (x)) f(y) f C (x), x y 1 2M g C(x) 2 2. 2

For an unconstrained problem, g c (X) is the gradient. Take lemma u = l, v = x l. We will get that f(t (x l )) f(x l ) = f(x l+1 ) f(x l ) 1 2M g C(x l ) 2 2 The above is a descent guarantee. Compare this statement to what we did in the unconstrained case. Formally, it s the same. We can use the exact same algebra to get our telescoping property. Apply lemma with u = x l, v = x. f(x l+1 ) f(x ) f C (x l ), x l x 1 2M g C(x l ) 2 2 = M 2 ( xl x 2 2 x l+1 x 2 2) Now we can average. f(x T ) f(x ) 1 T T 1 (f(x l+1 ) f(x )) i=0 M T 1 ( x l x 2 2 x l+1 x + 2 2T 2) l=0 M 2T x0 x 2 2 If you look at this, it seems strange. We almost defined something circular, but it s actually the right intuition. Now, we prove the lemma above using closedness and convexity of C. Proof of lemma: S(x) = x 1 M f(x), and T (x) = Π C(S(x)). By optimality conditions for projection, we have the following, where T (x) = Π C (S(x)). y C, T (x) S(x), y T (x) 0 f(x), T (x) y g C (x), T (x) y Then, we have g C (x) = M(x T (x)). For the first, we use M-smoothness. For the second term, we use convexity. 3

f(t (x)) f(y) = (f(t (x)) f(x)) + (f(x) f(y)) f(x), T (x) x + M 2 Π C(x) x 2 2 + f(x), x y Note that we have f(x)x and f(x)( x) also note that we can relate the second term to g C. f(x), T (x) y + M 2 1 M (g C(x)) 2 2 f(x), T (x) y + 1 2M g C(x) 2 2 Using the side result, we get the following. g C (x), Π(x) x + (x y) + 1 2M g C(x) 2 2 = g C (x), x y + ( 1 2M 1 M ) g C(x) 2 2 Intuitively, this algorithm works because of the direction g C (x) that works. M smoothness and convexity are very useful, powerful utilities. The overall conclusion is nontrivial, but when you use these simple tools correctly, you can build interesting things. So far, we ve assumed functions are differentiable and convex. We will break down both of these. First, we relax the condition of differentiability. Consider a sub-differentiable function. Why do I care? It s very clean and classical for convex functions. A standard problem that people like to solve is the following. min x R d Ax b 2 2 + λ x 1 Consider the matrix analog. Many researchers study matrix completion. Some are in particular interested in low-rank completion. min x S d d (i,j) Ω (B ij X ij ) 2 + λ X nuc This L1 norm is a weak surrogate for sparsity. In 2 dimensions, its epigraph is a convex set. Let f : R d R {+ }. Then, z R d is a sub-gradient of f at x, written z f(x)) if f(y) f(x) + z, y x y R d. 4

Generalizes f(y) f(x)+ f(x), y x for a convex, differentiable f. Geometrically, the gradient is any vector from the hyperplane and supporting the epigraph. When you re convex and differentiable, this is unique. subgradients are lines that we can draw through the point of discontinuity and not touch the epigraph. Consider the typical subdifferentiable function x. +1 if x > 0 x = 1 if x < 0 [ 1, +1] if x = 0 Danskin s theorem gives us conditions on functions of the form: f(x) = max φ(x, z) z Z Under certain conditions, the subdifferentiable exists. The subdifferentiable is a convex hull of a set of vectors. f(x) = conv{ x φ(x, z) z achieves max φ(x, z)} z 5