This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

Similar documents
WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Lecture 9: September 28

Conditional Gradient (Frank-Wolfe) Method

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Optimization methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Fast proximal gradient methods

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Accelerated primal-dual methods for linearly constrained convex problems

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Optimization for Learning and Big Data

Lecture 8: February 9

Accelerated Proximal Gradient Methods for Convex Optimization

Lecture 1: September 25

6. Proximal gradient method

GRADIENT = STEEPEST DESCENT

Lasso: Algorithms and Extensions

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Dual Proximal Gradient Method

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Convex Optimization Algorithms for Machine Learning in 10 Slides

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Optimization methods

Stochastic Optimization: First order method

Primal-dual coordinate descent

A Tutorial on Primal-Dual Algorithm

Lecture 17: October 27

Coordinate Update Algorithm Short Course Operator Splitting

Low Complexity Regularization 1 / 33

Proximal methods. S. Villa. October 7, 2014

Proximal splitting methods on convex problems with a quadratic term: Relax!

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Adaptive Restarting for First Order Optimization Methods

Coordinate Descent and Ascent Methods

6. Proximal gradient method

Homework 4. Convex Optimization /36-725

Dual and primal-dual methods

Proximal gradient methods

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Sparse Optimization Lecture: Dual Methods, Part I

1 Sparsity and l 1 relaxation

Newton s Method. Javier Peña Convex Optimization /36-725

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

9. Dual decomposition and dual algorithms

CPSC 540: Machine Learning

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Splitting methods for decomposing separable convex programs

Nesterov s Optimal Gradient Methods

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Tight Rates and Equivalence Results of Operator Splitting Schemes

Smoothing Proximal Gradient Method. General Structured Sparse Regression

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

Unconstrained minimization: assumptions

MATH 680 Fall November 27, Homework 3

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture: Algorithms for Compressed Sensing

A Unified Approach to Proximal Algorithms using Bregman Distance

Operator Splitting for Parallel and Distributed Optimization

Convex Optimization Lecture 16

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

10-725/ Optimization Midterm Exam

Douglas-Rachford Splitting: Complexity Estimates and Accelerated Variants

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

More First-Order Optimization Algorithms

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Iterative Convex Regularization

Sparse Regularization via Convex Analysis

Line Search Methods for Unconstrained Optimisation

Sequential Unconstrained Minimization: A Survey

Subgradient Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

FAST FIRST-ORDER METHODS FOR COMPOSITE CONVEX OPTIMIZATION WITH BACKTRACKING

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

arxiv: v1 [cs.na] 13 Nov 2014

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

Optimization for Machine Learning

Nesterov s Acceleration

Lecture 3. Optimization Problems and Iterative Algorithms

Lecture: Smoothing.

ACCELERATED LINEARIZED BREGMAN METHOD. June 21, Introduction. In this paper, we are interested in the following optimization problem.

Adaptive Primal Dual Optimization for Image Processing and Learning

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Agenda. Interior Point Methods. 1 Barrier functions. 2 Analytic center. 3 Central path. 4 Barrier method. 5 Primal-dual path following algorithms

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Algorithms for Nonsmooth Optimization

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Auxiliary-Function Methods in Optimization

10-725/36-725: Convex Optimization Prerequisite Topics

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Math 273a: Optimization Subgradient Methods

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Transcription:

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms!

PROXIMAL METHODS

WHY PROXIMAL METHODS Smooth functions minimize f(x) Non-differentiable minimize f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc Proximal methods Constrained problems? minimize f(x) subject to g(x) apple 0 Lagrangian methods h(x) =0

PROXIMAL OPERATOR prox f (z, ) = arg min x f(x)+ 1 2 kx zk2 objective proximal penalty Stay close to z

PROXIMAL OPERATOR prox f (z, ) = arg min x f(x)+ 1 2 kx zk2 stepsize prox f (z,t) z

GRADIENT INTERPRETATION prox f (z, ) = arg min x f(x)+ 1 2 kx zk2 assume differentiable rf(x)+ 1 (x z) =0 x = z rf(x) Two type of gradient descent Forward gradient x = z rf(z) Backward gradient x = z rf(x)

SUBDIFFERENTIAL prox f (z, ) = arg min x f(x)+ 1 2 kx zk2 non-differentiable 0 2 @f(x)+ 1 (x z) 0=g + 1 (x z), for some g 2 @f(x) Backward gradient descent x = z g

EXAMPLE: L1 prox f (z, ) = arg min x x + 1 2 kx zk2 Is sub-differential unique? Is solution unique? prox f (z,t) z

PROPERTIES backward gradient descent prox f (z, ) = arg min x x = z f(x)+ 1 2 kx @f(x) zk2 forward gradient descent x = z rf(z) Existence Exists for any proper convex function (and some non-convex) Exists only for smooth functions Uniqueness Always unique result Sub-gradient unique if differentiable

RESOLVENT prox f (z, ) = arg min x f(x)+ 1 2 kx zk2 non-differentiable 0 2 @f(x)+ 1 (x z) z 2 @f(x)+x z 2 ( @f + I)x x =( @f + I) 1 z =prox f (z, ) unique

EXAMPLE: L1 prox f (z, ) = arg min x x + 1 2 kx 8 >< 1+ 1 (x? z)=0, if x? > 0 >: 1+ 1 (x? z)=0, if x? < 0 1 (x? z) 2 [ 1, 1], if x? =0 zk2 shrink(z, ) = 8 >< x? = z, if x? > 0 x? = z +, if x? < 0 >: 0, otherwise

THE RIGOROUS WAY prox f (z, ) = arg min x max minimize 1 x + 1 2 kx dual problem 1 X 1 ( / ) 2 k zk2 or 2 k zk2 subject to k k 1 apple Lagrangian zk2 minimize x x + 1 2 ky zk2 + hy x, i y z + =0 y = z = z max(min(,z), )

SHRINK OPERATOR y =shrink(x, 1) 2 1.5 1 0.5 0 0.5 1 1.5 2 3 2 1 0 1 2 3

NUCLEAR NORM kxk = X i Singular values Why would you use this regularizer? If X is PSD, we call this the trace norm kxk = X i = trace(x)

NUCLEAR NORM PROX minimize kxk + 1 2 kx Zk2 Same singular vectors minimize kus X V T k + 1 2 kus XV T US Z V T k 2 minimize S X + 1 2 ks X S Z k 2 S X =shrink(s Z, ) X = U shrink(s Z, )V T

CHARACTERISTIC FUNCTIONS C = some convex set X C (x) = ( 0, if x 2 C 1, otherwise proximal minimize X C (x)+ 1 2t kx zk2 What does this do? Does the stepsize matter?

2-norm ball EXAMPLES C = B 2 = {x prox 2 (z) = z kzk kxk apple1} min{kzk, 1} infinity ball C = B 2 = {x kxk 1 apple 1} prox 1 (z) i =min{max{z i, 1}, 1} positive half space C = {x x 0} prox + (z) = max{z,0}

HARDER EXAMPLES semidefinite cone C = S ++ = {X X 0} prox S++ (Z) =U max{s, 0}V T L1-ball C = B 1 = {x kxk 1 apple 1} prox 1 (z) =???

PROXIMAL-POINT METHOD minimize f(x) backward gradient descent x k+1 =prox f (x k, ) =x k @f(x k+1 ) x k+1 = arg min f(x)+ 1 2 kx xk k 2 stepsize restriction? does it depend on the norm? Would you ever use this?

FORWARD-BACKWARD SPLITTING minimize x g(x)+f(x) non-smooth backward gradient descent smooth forward gradient descent FBS forward step backward step ˆx = x k rf(x k ) x k+1 =prox g (ˆx, )

WHY FORWARD BACKWARD? forward step minimize x g(x)+f(x) ˆx = x k rf(x k ) backward step x k+1 =prox g (ˆx, ) x k+1 = x k rf(x k ) @g(x k+1 ) fixed-point property x? = x? rf(x? ) @g(x? ) 0 2rf(x? )+@g(x? )

GRADIENT INTERPRETATION forward step backward step ˆx = x k rf(x k ) x k+1 =prox g (ˆx, ) x k+1 = x k rf(x k ) @g(x k+1 ) ˆx x?

MAJORIZATION- MINIMIZATION minimize h(x) surrogate function m(x k )=h(x k ) m(x) h(x), 8x surrogate majorizes f f m x k x k+1

MM ALGORITHM minimize h(x) =g(x)+f(x) f(x) apple f(x k )+hx x k, rf(x k )i + 1 2 kx xk k 2 apple 1 L f m x k x k+1

MM ALGORITHM minimize h(x) =g(x)+f(x) f(x) apple f(x k )+hx x k, rf(x k )i + 1 minimize m(x) =g(x)+ 1 2 kx xk k 2 minimize m(x) =g(x)+f(x k )+hx x k, rf(x k )i + 1 2 kx xk k 2 2 kx xk + rf(x k )k 2 f m x k x k+1

MM ALGORITHM minimize h(x) =g(x)+f(x) f(x) apple f(x k )+hx x k, rf(x k )i + 1 minimize m(x) =g(x)+f(x k )+hx x k, rf(x k )i + 1 2 kx xk k 2 minimize m(x) =g(x)+ 1 2 kx xk k 2 2 kx xk + rf(x k )k 2 prox(x k rf(x k ), )

PROXIMAL POINT INTERPRETATION minimize h(x) =g(x)+f(x) f(x) apple f(x k )+hx x k, rf(x k )i + 1 2 kx xk k 2 apple 1 L build a proximal penalty / distance measure d(x, x k )= 1 2 kx xk k 2 f(x)+hx x k, rf(x k )i + f(x k )

PROXIMAL POINT INTERPRETATION minimize h(x) =g(x)+f(x) d(x, x k )= 1 2 kx xk k 2 f(x)+hx x k, rf(x k )i + f(x k ) minimize h(x)+d(x, x k ) minimize g(x)+ 1 2 kx xk + rf(x k )k 2 prox(x k rf(x k ), )

SPARSE LEAST SQUARES minimize g(x) + f(x) minimize µ x + 1 2 kax bk2 non-smooth smooth forward step ˆx = x k rf(x k ) ˆx = x k A T (Ax k b) x k+1 = arg min g(x)+ 1 2 kx backward step ˆxk2 x k+1 = arg min µ x + 1 2 kx iterative shrinkage / thresholding x k+1 =shrink(ˆx, µ ) ˆxk2

SPARSE LOGISTIC minimize g(x) + f(x) minimize µ x + L(Ax, y) L(z,y) = X i log(1 + e y iz i ) ˆx = x k rf(x k ) forward step ˆx = x k A T rl(ax k,y) x k+1 = arg min g(x)+ 1 2 kx backward step ˆxk2 x k+1 = arg min µ x + 1 2 kx x k+1 =shrink(ˆx, µ ) ˆxk2

EXAMPLE: DEMOCRATIC REPRESENTATIONS { 1 minimize kax bk2 2 subject to kxk 1 apple µ minimize X µ 1(x)+ 1 2 kax bk2 forward step ˆx = x k A T (Ax k b) backward step x k+1 = arg min X 1(x)+ µ 1 2 kx =min{max{ˆx, µ},µ} ˆxk2

EXAMPLE: LASSO { 1 minimize kax bk2 2 subject to x apple µ implicit constraints minimize X µ 1 (x)+1 2 kax bk2 forward step ˆx = x k A T (Ax k b) backward step x k+1 = arg min X µ 1 (x)+ 1 2 kx ˆxk2

TOTAL VARIATION minimize µ rx + 1 2 kx fk2 Dualize the easy way z = max z 2[ 1,1] µ z = max z 2[ µ,µ] why? min max 2[ µ,µ] h, rxi + 1 2 kx fk2 hr T,xi max min 2[ µ,µ] x hrt,xi + 1 kx fk2 2 r T + x f =0

TOTAL VARIATION minimize µ rx + 1 2 kx fk2 max min 2[ µ,µ] x hrt,xi + 1 kx fk2 2 optimality condition r T + x f =0 x = f r T max 2[ µ,µ] hrt,f r T i + 1 2 krt k 2 max 2[ µ,µ] hrt,fi hr T, r T i + 1 2 krt k 2 1 max 2[ µ,µ] hrt,fi 2 krt k 2 max 2[ µ,µ] 1 2 krt fk 2

FBS FOR TV minimize µ rx + 1 2 kx fk2 max 2[ µ,µ] 1 2 krt fk 2 Dual x = f r T ( 0, z 2 [ µ, µ] X µ (z) = 1, otherwise minimize X µ ( )+ 1 2 krt fk 2 ˆ = forward step k r(r T k f) backward step k+1 = arg min X µ ( )+ 1 2 k =min{max{ µ, ˆ},µ} ˆk2

SVM minimize 1 2 kwk2 + Ch(LDw) h(z) = max{1 z,0} min w use the trick Ch(z) = max 2[0,C] max 2[0,C] 1 (1 z) 2 kwk2 + h, 1 LDwi optimality w D T L =0 maximize subject to 1 2 kdt L k 2 + 1 T 2 [0,C]

SVM maximize subject to 1 2 kdt L k 2 + 1 T 2 [0,C] forward ASCENT step ˆ = k (LDD T L k 1) backward step k+1 =min{max{0, ˆ},C}

NETFLIX PROBLEM movies people data imputation : fill in missing values low rank assumption: everyone described by affinity for horror, action, rom-com, etc matrix completion

NUCLEAR NORM FORM relax minimize rank(x)+ 1 2 mask km X Dk2 data minimize kxk + 1 2 km X Dk2 coordinate multiplication forward ASCENT step ˆX = X k M (M X k D) why not transpose? backward step X k+1 = arg min kxk + 1 2 kx ˆXk 2 what s this do?

MATRIX FACTORIZATION minimize X,Y 1 kxy Dk2 2 subject to X, Y 0 forward step ˆX = X k (X k Y k D)(Y k ) T Ŷ = Y k (X k ) T (X k Y k D) backward step X k+1 = max{ ˆX,0} Y k+1 = max{ŷ,0}

CONVERGENCE by reduction to proximal-point method minimize h(x) =g(x)+f(x) Theorem Suppose the stepsize for FBS satisfies does not depend on g apple 2 L where L is a Lipschitz constant for rf. Then h(x k ) h(x? ) apple O(1/k)

Nemirovski and Yundin GRADIENT VS. NESTEROV Gradient x 0 Nesterov x 0 x 1 x 2 x 4 x 3 x 1 x 4 x 2 x3 O 1 k O 1 k 2 Optimal

FISTA Fast iterative shrinkage/thresholding - Beck and Teboulle 09 FISTA x k+1 =prox g (y k rf(y k ), ) k+1 = 1 q 1+ 4( k ) 2 +1 2 y k+1 = x k+1 + k 1 k+1 (xk+1 x k ) momentum

SPARSA / FASTA idea: adaptive stepsize outperforms acceleration f(x) 2 xt x secant equation rf(x k+1 ) rf(x k )= (x k+1 x k ) g = x minimize least-squares solution 1 k x gk2 2 = xt g k xk 2 = 1 = k xk2 x T g

BACKTRACKING based on MM interpretation f(x k+1 ) <f k + hx k+1 x k, rf(x k )i + 1 kx k+1 x k k 2 2 k behaves like Lipschitz constant Choose stepsize t k FBS+Backtracking x k+1 =prox g (x k k rf(x k ), k ) While f(x k+1 ) >f k + hx k+1 x k, rf(x k )i + 1 2 k kx k+1 x k k 2 k k /2 x k+1 =prox g (x k k rf(x k ), k )

STOPPING CONDITIONS minimize h(x) =g(x)+f(x) stop when residual is small.but what s the residual?? x k+1 = arg min g(x)+ 1 2 kx 0 2 @g(x k+1 )+ 1 (xk+1 ˆx) 1 (ˆx xk+1 ) 2 @g(x k+1 ) form the derivative @g(x k+1 ) ˆxk2 r k+1 = 1 (ˆx xk+1 )+rf(x k+1 ) 2 @h(x k+1 )

HOW SMALL IS SMALL? RELATIVE RESIDUAL minimize h(x) =g(x)+f(x) @g(x k+1 ) residual r k+1 = 1 (ˆx xk+1 )+rf(x k+1 ) 2 @h(x k+1 ) stop when r k+1 = 1 (ˆx xk+1 )+rf(x k+1 ) 0 or equivalently r k+1 r = 1 (ˆx xk+1 ) rf(x k+1 ) relative residual r k max{ 1 (ˆx x k+1 ), rf(x k+1 )}