A Tutorial on Primal-Dual Algorithm

Size: px

Start display at page:

Download "A Tutorial on Primal-Dual Algorithm"

Godwin Hodge
6 years ago
Views:

1 A Tutorial on Primal-Dual Algorithm Shenlong Wang University of Toronto March 31, / 34

2 Energy minimization MAP Inference for MRFs Typical energies consist of a regularization term and a data term. Used for a wide range of problems. min{e(x) = R(x) + D(x)} x Minimizer provides the best configuration to the problem. The energy related to the posterior probability via a Gibbs distribution: p(x) = 1 Z exp( E(x)) 2 / 34

3 Energy minimization Discrete vs Continuous MRF setting According to different output space, we have: Discrete setting: each variable y i can take a label from a discrete label set U Z. Continuous setting: each variable y i is considered as a continuous value from U R. We will focus on continuous setting today. 3 / 34

optimization: 1 Pick any initial vector x 0 2 R n,setk =0 2 Compute search direction d k 2 R n 3 Choose

converged, else goto 2 Plug-and-play, lots of choice: steepest descent, conjugate gradient, newton,

4 Why not Black box convex optimization? 1 Graz University of Technology Black box convex optimization Generic iterative algorithm for convex optimization: 1 Pick any initial vector x 0 2 R n,setk =0 2 Compute search direction d k 2 R n 3 Choose step size k such that f (x k + k d k ) < f (x k ) 4 Set x k+1 = x k + k d k, k = k +1 5 Stop if converged, else goto 2 Plug-and-play, lots of choice: steepest descent, conjugate gradient, newton, quasi-newton (e.g. L-BFGS) Do not use the structure of the problems, thus may not be the most efficient choice. What if it s difficult to compute d? 1 Image from Cremers (2014) 4 / 34

5 Notations Variational imaging people use different notations: min u + u x Ω λ 2 k u f 2 2dx where u : Ω R is a function over the continuous image space Ω R n and the objective is a functional of u. Keep in mind what it looks like in our language: min x Gx 1 + λ 2 Kx f / 34

Warm-up Semi-continuity 2 A function f(x) is called lower(upper) semi-continuous for a point x 0 if function values for arguments near x 0 are either close to f(x 0 ) or greater than (less than) f(x

6 Warm-up Semi-continuity 2 A function f(x) is called lower(upper) semi-continuous for a point x 0 if function values for arguments near x 0 are either close to f(x 0 ) or greater than (less than) f(x 0 ) Floor function x is upper semi-continuous, x is lower semi-continuous. The indicator function of any open set is lower semicontinuous. The indicator function of a closed set is upper semicontinuous. Used to convert inequality constraints to objective function. 2 See for a formal definition 6 / 34

7 Warm-up Lipschitz continuity A function f(x) is called Lipschitz continuous on B n if : f(x) f(y) L x y, x, y B n where constant L is an upper bound to the maximum steepness of f(x) Stronger than continuous, weaker than continuously differentiable. f = x is Lipschitz continuous but not continuously differentiable. 7 / 34

8 Warm-up Convex conjugate The convex conjugate f (y) of a function f(x) is defined as: f (y) = sup x, y f(x) x domf Examples: f(x) = x : Each pair of (y, f(y )) is a tangent line of the function 3 3 Image from Wikipedia f (y) = sup x, y x = x { 0 if y < 1 else 8 / 34

9 Warm-up Proximal operator The proximal operator (or proximal mapping) of a convex function f is: ( prox f (x) = arg min f(u) + 1 ) u 2 u x 2 2 f can be nonsmooth, have embedded constraints,... evaluating prox f involves solving a convex optimization problem. but often has analytic solution, or simple linear-time algorithm. 9 / 34

10 Warm-up Proximal operator examples quadratic function f(x) = 1 2 xt P x + q T x + r (P 0): prox f (x) = (I + P ) 1 (x q) l 1 norm f(x) = x 1 : x i 1 x i 1 prox f (x) i = 0 x i 1 x i + 1 x i 1 (soft thresholding) logarithmic barrier f(x) = n i=1 log x i : prox f (x) i = x i + x 2 i / 34

11 Warm-up Useful properties of proximal operator If f is closed and convex then prox f exists and is unique for all x. seperable sum: if f(x) = N i=1 f i (x i ), then (prox f (x)) i = prox fi (x i ) fixed point: the point x minimizes f if and only if x is a fixed point: x = prox f (x ) scaling and translation: define h(x) = f(tx + b) with t 0: conjugate: prox h (x) = 1 t (prox t 2 f (tx + b) b) prox tf (x) = x tprox f/t (x/t) keys to design parrallel optimization method 11 / 34

12 Proximal gradient method Problem min f(x) = g(x) + h(x) x g convex, differentiable, g is Lipschitz continuous with constant L h convex, possibly nondifferentiable; prox h is inexpensive rules out many methods, e.g. conjugate gradient e.g. lasso: min x x 1 + λ 2 Ax + b / 34

13 Proximal gradient method Proximal gradient method min f(x) = g(x) + h(x) x g convex, differentiable, g is Lipschitz continuous with constant L h convex, possibly nondifferentiable; prox h is inexpensive proximal gradient algorithm ) x (t) = prox h (x (t 1) τ t g(x (t 1) ) O(1/N) convergence rate i.e. to get f(x (k) ) f(x ) ɛ, need O(1/ɛ) iterations 13 / 34

14 Proximal gradient method Accelerated gradient method min f(x) = g(x) + h(x) x g convex, differentiable, g is Lipschitz continuous with constant L h convex, possibly nondifferentiable; prox h is inexpensive accelerated proximal gradient algorithm 4 ) x (t) = prox h (y (t 1) τ t g(y (t 1) ) y (t) = x (t) + t 1 t + 2 (x(t) x (t 1) ) O(1/N 2 ) convergence rate for first-order method! i.e. to get f(x (k) ) f(x ) ɛ, only need O(1/ ɛ) iterations 4 Nesterov (2004), Beck and Teboulle (2009) 14 / 34

15 Proximal gradient method Experiments proximal gradient vs accelerated proximal gradient 5 5 Image from Gordon & Tibshirani (2012) 15 / 34

16 Motivation Energy minimization min f(kx) + g(x) x X K is a linear and continuous operator X is Hilbert space f : X R { }, g : X R { }, convex, not necessarily to be continuous and differentiable, prox is inexpensive 16 / 34

17 Motivation Examples Many problems are within this subclass: Lasso: Low-level vision: Linear programming: min x x 1 + λ 2 Ax + b 2 2 min u,v u 1 + v 1 + λ ρ(u, v) 1 min x c, x, subject to { Ax = b x 0 SVM: min w,b w n max(0, 1 y i ( w, x i + b)) i=1 17 / 34

18 Primal-dual algorithm Problem formulation min f(kx) + g(x) (primal) x X Recall the convex conjugate: f (y) = Kx, y f(kx), we have: min max Kx, y + g(x) f (y) x X y Y (primal-dual) Primal-dual gap: max (f (y) + g ( K y)) y Y (dual) f(kx) + g(x) + f (y) + g ( K y) For convex function, primal-dual gap will be 0 at the optimal solution. 18 / 34

19 Primal-dual algorithm Optimal condition Today we focus on primal-dual problem: min max Kx, y + g(x) f (y) x X y Y (primal-dual) Why primal-dual? proximal operator for f(kx) is not trivial. but we can get proximal operator f (x) and g(x) easily A saddle point (x, y) X Y of this min-max function should satisfy: { K ˆx f (ŷ) 0 K ŷ + g(ˆx) 0 We iterate according to this condition. 19 / 34

20 Primal-dual algorithm The algorithm 6 min max Kx, y + g(x) f (y) x X y Y Choose step size σ > 0 and τ > 0, so that στl 2 < 1, where L = K, and θ [0, 1]. Choose initialization (x 0, y 0 ) For each iteration: y (n+1) = prox f (y (n) + σk x (n) ) (dual proximal) x (n+1) = prox g (x (n) τk ȳ (n+1) ) (primal proximal) x (n+1) = x (n+1) + θ(x (n+1) x (n) ) (extrapolation) Essentially alternately do proximal gradient descent for x and y. 6 Chambolle and Pock (2011), Pock, Cremers, Bischof, Chambolle (2009) 20 / 34

21 Primal-dual algorithm Convergence The algorithm s convergence rate depending on different types of the problem 7 : Completely non-smooth problem: O(1/N) for the duality gap. Sum of a smooth and non-smooth: O(1/N 2 ) for x x 2. Completely smooth problem: O(ω N ), ω < 1 for x x 2. 7 See Chambolle and Pock (2011) for a detailed proof. 21 / 34

22 Discussion Parallel implementation y (n+1) = prox f (y (n) + σk x (n) ) (dual proximal) x (n+1) = prox g (x (n) τk ȳ (n+1) ) (primal proximal) x (n+1) = x (n+1) + θ(x (n+1) x (n) ) (extrapolation) Problems we usually have in vision x and y are defined on a regular grid. f and g is usually in a separable sum format. Small number of variables involved gradient part Kx (high-order potential) perfect for GPU parallel computing! 22 / 34

23 Discussion Arrow-Hurwicz method (θ = 0) 8 { y (n+1) = prox f (y (n) + σk x (n) ) (dual proximal) x (n+1) = prox g (x (n) τk ȳ (n+1) ) (primal proximal) Also tackles primal-dual method Without the momentum step Theoretically lose O(1/N 2 ) convergence guarantee (or people haven t proved it yet) In practice, for some method it is still fast 8 Arrow, Hurwicz, Uzawa (1958) 23 / 34

24 Discussion Alternating direction method of multipliers (ADMM) min f(kx) + g(x) (primal) x X We can conduct a decomposition: min f(y) + g(x) subject to Kx y = 0 x X Solving the following augmented Lagrangian multipliers problem: L τ (x, y, z) = f(y) + g(x) + z T (Kx y) + τ 2 K y x 2 2 y (n+1) = arg min y f(y) + y, z (n) + τ 2 Kx(n) y 2 2 (primal) x (n+1) = arg min x g(x) Kx, z (n) + τ 2 Kx y(n+1) 2 2 (primal) z (n+1) = z (n) + τ(kx y) (dual) Primal-dual method is equivalent to ADMM if K = I. But in the general case primal-dual is usually faster, since solving the subproblems of ADMM is harder. 24 / 34

Discussion Experimental results 9 ROF-model, 500 375 grayscale image, ɛ error tolerance AHZC: Arrow-Hurwicz primal-dual method FISTA: Fast iterative shrinkage threshold, O(1/N 2 ) convergence rate

25 Discussion Experimental results 9 ROF-model, grayscale image, ɛ error tolerance AHZC: Arrow-Hurwicz primal-dual method FISTA: Fast iterative shrinkage threshold, O(1/N 2 ) convergence rate NEST: Nesterov s method on dual problem, O(1/N 2 ) convergence rate ADMM: Alternating direction method of multipliers PGM: Proximal gradient method, O(1/N) convergence rate CFP: Fixed point method on dual 9 Table from Chambolle and Pock (2010) 25 / 34

26 Example Image denoising ROF 10 model for image denoising: min x x 1 + λ 2 x u 2 2 f(x) = i x i 1, where x i is a two-dimensional intensity gradient vector at image pixel i { f 0 y P (y) = δ l (y) = y / P, P = {p : i, p i < 1} Proximal operator for convex-set indicator function is just euclidean projecting onto the feasible closed set P. Thus: y prox f (y) = max( y, 1) Proximal operator for λ 2 x u 2 2 is in closed form: prox g (x) = x + λτu 1 + λτ 10 Rudin, Osher, Fatemi (1992) 26 / 34

27 Example TV-L 1 Optical flow 11 min u,v u 1 + v 1 + λ ρ(u, v) 1 u, v: horizontal and vertical motion field ρ(u, v) first-order Taylor approximation of photometric error u k : estimation of inverse depth from single view 11 Zach, Pock, Bischof (2007) 27 / 34

28 Example Linear Programming min x c, x, subject to { Ax = b x 0 Introducing Lagrange multipliers y min x Applying primal-dual algorithm: max Ax b, y + c, x, subject to x 0 y y (n+1) = y (n) + σa x (n) x (n+1) = proj [0, ) (x (n) τ(a T y (n+1) + c)) x (n+1) = x (n+1) + θ(x (n+1) x (n) ) where proj [0, ) is the simply truncation function. 28 / 34

29 Example Discrete MRF inference LP relaxation: min θ i (x i ) + i,x i L i f min θ i (x i )µ i (x i ) + i,x i L i x i i subject to µ i (x i ) = 1, i x i µ f (x f ) = 1, f x f θ f (x f ) µ f (x f ) = µ i (x i ), f, i, x i x f ı x f θ f (x f )µ f (x f ) 29 / 34

Example Binary labeling 12 Potts-model over 2D grid can be written as following convex relaxation: min x Dx 1 + x, w subject to 0 x i 1, i Preconditioned primal-dual

30 Example Binary labeling 12 Potts-model over 2D grid can be written as following convex relaxation: min x Dx 1 + x, w subject to 0 x i 1, i Preconditioned primal-dual algorithm: y (n+1) = proj [ 1,1] (y (n) + ΣA x (n) ) x (n+1) = proj [0, ) (x (n) T (A T y (n+1) + w)) x (n+1) = x (n+1) + θ(x (n+1) x (n) ) 12 Image from Cremers (2014) 30 / 34

31 Example Multi-view stereo reconstruction 13 x: inverse depth estimation min λ(x) x ɛ + C(x) x λ(x): element-wise weighting function ɛ robust Huber loss function C(x) non-convex matching function 13 Newcombie et al. (2011) 31 / 34

32 Example Multi-view stereo reconstruction 14 Introducing auxiliary variable u: min x λ(x) x ɛ + C(u) + 1 2θ x u 2 2 Minimize C(u) + 1 2θ x u 2 2 wrt. u by smart brute-force search Minimize x ɛ + 1 2θ x u 2 2 wrt. x by primal-dual Reducing θ 14 Newcombie et al. (2011) 32 / 34

33 Summary First-order primal-dual algorithm for a class of structured convex optimization problems Objective function can be non-differentiable Easy to implement (we just need to derive the proximal operators) Optimal convergence rate on multiple sub-classes 33 / 34

34 Reference Chambolle, Antonin, and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision 40.1 (2011): Nesterov, Yurii. Introductory lectures on convex optimization: A basic course. Vol. 87. Springer Science & Business Media, Nesterov, Yurii. Gradient methods for minimizing composite objective function. No. CORE Discussion Papers (2007/76). UCL, Parikh, Neal, and Stephen P. Boyd. Proximal Algorithms. Foundations and Trends in optimization 1.3 (2014): Newcombe, Richard A., Steven J. Lovegrove, and Andrew J. Davison. DTAM: Dense tracking and mapping in real-time. Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, Pock, Thomas, et al. Global solutions of variational models with convex regularization. SIAM Journal on Imaging Sciences 3.4 (2010): Nieuwenhuis, Claudia, Eno TÃűppe, and Daniel Cremers. A survey and comparison of discrete and continuous multi-label optimization approaches for the Potts model. International journal of computer vision (2013): Zach, Christopher, Thomas Pock, and Horst Bischof. A duality based approach for realtime TV-L 1 optical flow. Pattern Recognition. Springer Berlin Heidelberg, / 34

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan