First-order methods for structured nonsmooth optimization

Similar documents
Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

A General Framework for a Class of Primal-Dual Algorithms for TV Minimization

Convex Optimization and l 1 -minimization

Adaptive Primal Dual Optimization for Image Processing and Learning

Linear & nonlinear classifiers

Algorithms for Nonsmooth Optimization

An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Math 273a: Optimization Overview of First-Order Optimization Algorithms

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Solving DC Programs that Promote Group 1-Sparsity

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Distributed Optimization via Alternating Direction Method of Multipliers

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Coordinate Descent and Ascent Methods

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Linear & nonlinear classifiers

Coordinate Update Algorithm Short Course Operator Splitting

Machine Learning. Support Vector Machines. Manfred Huber

Sparse Optimization Lecture: Dual Methods, Part I

Sparse and Regularized Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

A Coordinate Gradient Descent Method for l 1 -regularized Convex Minimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Accelerated primal-dual methods for linearly constrained convex problems

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1

A Coordinate Gradient Descent Method for Nonsmooth Nonseparable Minimization

Splitting Techniques in the Face of Huge Problem Sizes: Block-Coordinate and Block-Iterative Approaches

Convex Optimization Algorithms for Machine Learning in 10 Slides

Minimizing the Difference of L 1 and L 2 Norms with Applications

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR TV MINIMIZATION

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

ECS289: Scalable Machine Learning

STA141C: Big Data & High Performance Statistical Computing

Big Data Analytics: Optimization and Randomization

A GENERAL FRAMEWORK FOR A CLASS OF FIRST ORDER PRIMAL-DUAL ALGORITHMS FOR CONVEX OPTIMIZATION IN IMAGING SCIENCE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

A Coordinate Gradient Descent Method for l 1 -regularized Convex Minimization

Accelerated Proximal Gradient Methods for Convex Optimization

You should be able to...

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Constrained Optimization and Lagrangian Duality

Optimization for Machine Learning

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

Variational Image Restoration

About Split Proximal Algorithms for the Q-Lasso

Dual and primal-dual methods

Machine Learning And Applications: Supervised Learning-SVM

Frist order optimization methods for sparse inverse covariance selection

Proximal methods. S. Villa. October 7, 2014

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Jeff Howbert Introduction to Machine Learning Winter

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Stochastic Optimization Algorithms Beyond SG

Selected Topics in Optimization. Some slides borrowed from

A memory gradient algorithm for l 2 -l 0 regularization with applications to image restoration

Self-dual Smooth Approximations of Convex Functions via the Proximal Average

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Sparse Covariance Selection using Semidefinite Programming

ADMM and Fast Gradient Methods for Distributed Optimization

Image restoration: numerical optimisation

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Optimization for Learning and Big Data

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

Homework 5. Convex Optimization /36-725

Coordinate descent methods

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

A derivative-free nonmonotone line search and its application to the spectral residual method

Optimization methods

Sparse Gaussian conditional random fields

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization and Support Vector Machine

Lecture 3. Optimization Problems and Iterative Algorithms

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Convex Optimization M2

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Support Vector Machines

Investigating the Influence of Box-Constraints on the Solution of a Total Variation Model via an Efficient Primal-Dual Method

Large-scale Stochastic Optimization

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

Sparsity Regularization

Max Margin-Classifier

Kernel Methods. Machine Learning A W VO

MS&E 318 (CME 338) Large-Scale Numerical Optimization

A New Look at First Order Methods Lifting the Lipschitz Gradient Continuity Restriction

minimize x subject to (x 2)(x 4) u,

Beyond Heuristics: Applying Alternating Direction Method of Multipliers in Nonconvex Territory

Lecture Notes on Support Vector Machine

Optimality Conditions for Constrained Optimization

Newton s Method. Javier Peña Convex Optimization /36-725

Transcription:

First-order methods for structured nonsmooth optimization Sangwoon Yun Department of Mathematics Education Sungkyunkwan University Oct 19, 2016 Center for Mathematical Analysis & Computation, Yonsei University

Outline I. Coordinate (Gradient) Descent Method II. Incremental Gradient Method III. (Linearized) Alternating Direction Method of Multipliers

Thank you! I. Coordinate (Gradient) Descent Method

Structured Nonsmooth Optimization min x F (X) = f (x) + P(x), f : real-valued, (convex) smooth on domf P: proper, convex, lsc In particular, P: separable i.e., P(x) = n j=1 P j(x j ) Bound constraint: P(x) = where l u (possibly with or ). l 1 -norm: P(x) = λ x 1 with λ > 0 { 0 if l x u; else, or indicator function of linear constraints (Ax = b).

Bound-constrained Optimization min l x u f (x), where f : R N R is smooth, l u (possibly with or components). Can be reformulated as the following unconstrained optimization problem: min x { 0 if l x u where P(x) =. else f (x) + P(x),

l 1 -regularized Convex Minimization 1. l 1 -regularized linear least squares problem Find x so that Ax b 0 and x has few nonzeros. Formulate this as an unconstrained convex optimization problem: min x R n Ax b 2 2 + λ x 1 2. l 1 -regularized logistic regression problem min w R n 1,v R 1 m m log(1 + exp( (w T a i + vb i ))) + λ w 1, i=1 where a i = b i z i and (z i, b i ) R n 1 { 1, 1}, i = 1,..., m are a given set of (observed or training) examples.

Support Vector Machines

Support Vector Machines Support Vector Classification Training points : z i R p, i = 1,..., n. Consider a simple case with two classes (linear separable case): Define a vector a : a i = { 1 if z i in class 1 1 if z i in class 2 A hyperplane (0 = w T z b) separates data with the maximal margin. Margin is the distance of the hyperplane to the nearest of the positive and negative points. Nearest points lie on the planes ±1 = w T z b

Support Vector Machines Convex Quadratic Programming Problem: Support Vector Classification The (original) Optimization Problem min a,b 1 2 a 2 2 subject to z i ( a T x i b ) 1, i = 1,..., n. The Modified Optimization Problem (allows, but penalizes, the failure of a point to reach the correct margin) min a,b,ξ 1 2 a 2 2 + C n i=1 ξ i subject to z i ( a T x i b ) 1 ξ i, ξ i 0, i = 1,..., n.

Support Vector Machines SVM (Dual) Optimization Problem (Convex Quadratic Program) min x 1 2 x T Qx e T x subject to 0 x i C, i = 1,..., n, a T x = 0, where a { 1, 1} n, 0 < C, e = [1,..., 1] T, Q R n n is a sym. pos. semidef. with Q ij = a i a j K (z i, z j ), K : R p R p R ( kernel function ), and z i R p ( ith data point ), i = 1,..., n. Popular Choices of K : linear kernel K (z i, z j ) = z T i z j radial basis function kernel K (z i, z j ) = exp( γ z i z j 2 2 ) sigmoid kernel K (z i, z j ) = tanh(γz T i z j ) where γ is a constant. Q is an n n fully dense matrix and even indefinite. (n 5000)

Rank Minimization Now, imagine that we only observe a few entries of a data matrix. Then is it possible to accurately guess the entries that we have not seen? Netflix problem: Given a sparse matrix where M ij is the rating given by user i on movie j, predict the rating a user would assign to a movie he has not seen, i.e., we would like to infer users preference for unrated movies. (impossible! in general) The problem is ill-posed. Intuitively, users preferences depend only on a few factors, i.e., rank(m) is small. Thus can be formulated as the low-rank matrix completion problem (affine rank minimization): X R m n min { rank(x) X ij = M ij, (i, j) Ω }, (NP hard!) where Ω is an index set of p observed entries.

Rank Minimization Nuclear norm minimization: { min X := X R m n m i=1 where σ i (X) s are singular values of X. } σ i (X) X ij = M ij, (i, j) Ω. a more general nuclear norm minimization problem: X R m n min { X : A(X) = b }. When the matrix variable is restricted to be diagonal, the above problem reduces to the following l 1 -minimization problem: x R n min { x 1 : Ax = b }.

Nuclear Norm Regularized Least Squares Problem If the observation b is contaminated with noise consider the following nuclear norm regularized least squares problem: where µ > 0 is a given parameter. 1 min X R m n 2 A(X) b 2 2 + µ X. appeared in many applications of engineering and science including collaborative filtering global positioning system identification remote sensing computer vision

Sparse Portfolio Selection How to allocate an investor s available capital into a prefixed set of assets with the aims of maximizing the expected return and minimizing the investment risk. Traditional Markowitz portfolio selection model: min x x T Qx subject to µ T x = β, e T x = 1, x 0, where β is the desired expected return of the portfolio. Modified models: min x x T Qx subject to µ T x = β, e T x = 1, x 0, x 0 K. min x x 0 subject to µ T x = β, x T Qx α, e T x = 1, x 0.

Sparse Covariance Selection Undirected graphical models offer a way to describe and explain relationships among a set of variables, central element of multivariate data analysis From a sample covariance matrix, wish to estimate the true covariance matrix, in which some of entries of its inverse are zero, by maximizing l 1 -regularized log-likelihood: where max X S n log detx subject to X ij = 0, (i, j) V, ˆΣ, X (i,j) V ρ ij X ij ˆΣ S n + is an empirical covariance matrix and ˆΣ is singular or nearly so May want to impose structural conditions on Σ 1, such as conditional independence, which is reflected as zero entries in Σ 1. V is a collection of all pairs of conditional independent nodes ρ ij > 0: parameter controlling the trade-off between the goodness-of-fit and the sparsity of X

Sparse Covariance Selection log det X + ˆΣ, X is strictly convex, cont. diff. on its domain S++, n O(n 3 ) opers. to evaluate. In applications, n can exceed 5000. Dual problem can be expressed: min X S n log det X n subject to (X ˆΣ) ij υ ij, i, j = 1,..., n, where υ ij = ρ ij for all (i, j) V and υ ij = for all (i, j) V.

Coordinate Descent Method When P 0. Given x R n, Choose i N = {1,..., n}. Update Repeat until convergence. x new = arg min f (u). u u j =x j j i Gauss-Seidel rule: Choose i cyclically, 1, 2,..., n, 1, 2,... Gauss-Southwell rule: Choose i with f x i (x) maximum.

Coordinate Descent Method Properties: If f convex, then every cluster point of the x-sequence is a minimizer. If f nonconvex, then G-Seidel can cycle 1 but G-Southwell still converges. Convergence is possible when P 0 2 1 M. J. D. Powell, On search directions for minimization algorithms, Math. Program. 4, (1973), 193 201 2 P. Tseng, Convergence of block coordinate descent method for nondifferentiable minimization, J. Optim. Theory Appl. 109, (2001), 473 492

Coord. Gradient Descent Method Descent direction. For x domp, choose J ( ) N and H 0 n, Then solve min { f d d j =0 j J (x)t d + 1 2 d T Hd + P(x + d) P(x)} direc. subprob Let d H (x; J ) and q H (x; J ) be the opt. soln and obj. value of the direc. subprob. Facts: d H (x; N ) = 0 F (x; d) 0 d R n. H is diagonal d H (x; J ) = d H (x; j), q H (x; J ) = q H (x; j). j J j J q H (x; J ) 1 2 d T Hd where d = d H (x; J ).

Coord. Gradient Descent Method This coord. grad. descent approach may be viewed as a hybrid of gradient-projection and coordinate descent. In particular, { 0 if l x u if J = N and P(x) =, then d H (x; N ) is a scaled else gradient-projection direction for bound-constrained optimization. if f is quadratic and we choose H = 2 f (x), then d H (x; J ) is a (block) coordinate descent direction. If H is diagonal, then subproblems can be solved in parallel. If P 0, then d H (x) j = f (x) j /H jj. { 0 if l x u If P(x) =, then else d H (x) j = median{l j x j, f (x) j /H jj, u j x j }. If P is the 1-norm, then d H (x) j = median{( f (x) j λ)/h jj, x j, ( f (x) j + λ)/h jj }.

Coord. Gradient Descent Method Stepsize: Armijo rule Choose α to be the largest element of {β k } k=0,1,... satisfying F (x + αd) F(x) σαq H (x; J ) (0 < β < 1, 0 < σ < 1). For the l 1 -regularized linear least squares problem, the minimization rule α arg min{f (x + td) t 0} or the limited minimization rule α arg min{f(x + td) 0 t s}, where 0 < s <, can also be used.

Coord. Gradient Descent Method Choose J : Gauss-Seidel rule: J cycles through {1}, {2},..., {n}. Gauss-Southwell-r rule: d D (x; J ) υ d D (x; N ) where 0 < υ 1, D 0 n is diagonal (e.g., D = diag(h)). Gauss-Southwell-q rule: q D (x; J ) υ q D (x; N ), Where 0 < υ 1, D 0 n is diagonal (e.g., D = diag(h)).

Coord. Gradient Descent Method Advantage of CGD CGD method is simple, highly parallelizable, and is suited for solving large-scale problems. CGD not only has cheaper iterations than exact coordinate descent, it also has stronger global convergence properties.

Convergence Results Global convergence If 0 λi D, H λi, J is chosen by G-Seidel, G-Southwell-r, G-Southwell-q rule, is chosen by Armijo rule, then every cluster point of the x-sequence generated by CGD method is a stationary point of F.

Convergence Results Local convergence rate 0 λi D, H λi, If J is chosen by G-Seidel or Gauss-Southwell-q rule, α is chosen by Armijo rule, in addition, if P and f satisfy any of the following assumptions, then the x-sequence generated by CGD method converges at R-linear rate. C1 f is strongly convex, f is Lipschitz cont. on domp. C2 f is (nonconvex) quadratic. P is polyhedral. C3 f (x) = g(ex) + q T x, where E R m N, q R N, g is strongly convex, g is Lipschitz cont. on R m. P is polyhedral. C4 f (x) = max y Y {(Ex) T y g(y)} + q T x, where Y R m is polyhedral, E R m N, q R N, g is strongly convex, g is Lipschitz cont. on R m. P is polyhedral.

Convergence Results Complexity Bound If f is convex with Lipschitz cont. grad., then the number of iterations for achieving ɛ-optimality is Gauss-Seidel rule: ( n 2 Lr 0 ) O, ɛ where L is a Lipschitz constant and r 0 = max { dist(x, X ) 2 F(x) F (x 0 ) }. Gauss-Southwell-q rule: ( Lr 0 O {0, υɛ + max Lυ ( )}) e 0 ln r 0, where e 0 = F (x 0 ) min x X F (x).

Thank you! II. Incremental Gradient Method

Sum of several functions min x F (x) := f (x) + P(x), where c > 0, P : R n (, ] is a proper, convex, lower semicontinuous (lsc) function, and m f (x) := f i (x), where each function f i is real-valued and smooth (i.e., continuously differentiable) on R n. i=1

Sum of several functions In applications, m is often large (exceeding 10 4 ). In this case, traditional gradient methods would be inefficient since they require evaluating f i (x) for all i before updating x. Incremental gradient methods (IGM), in contrast, update x after evaluation of f i (x) for only one or a few i. In the unconstrained case P 0, this method has the basic form x k+1 = x k + α k f ik (x k ), k = 0, 1,..., where i k is chosen to cycle through 1,..., m (i.e., i 0 = 1, i 1 = 2,..., i m 1 = m, i m = 1,... ) and α k > 0.

Sum of several functions For global convergence of IGMs, stepsize α k (also called learning rate ) needs to diminish to zero, which can lead to slow convergence 3. If a constant stepsize is used, only convergence to an approximate solution can be shown 4. Methods 5 to overcome this difficulty were proposed. However, these methods need additional assumptions such as f i (x) = 0 for all i at a stationary point x to achieve global convergence without the stepsize tending to zero. Moreover, its extension to P 0 is problematic. 3 D. P. Bertsekas, A new class of incremental gradient methods for least squares problems, SIAM J. Optim., 7 (1997), 913 926 4 M. V. Solodov, Incremental gradient algorithms with stepsizes bounded away from zero, Comput. Optim. Appl., 11 (1998), 23 35 5 P. Tseng, An incremental gradient(-projection) method with momentum term and adaptive stepsize rule, SIAM J. Optim., 8 (1998), 506 531

Sum of several functions For the case of P 0, Blatt, Hero and Gauchman 6 proposed a method: g k = g k 1 + f ik (x k ) f ik (x k m ), x k+1 = x k αg k, with α > 0, g 1 = m i=1 f i(x i m 1 ), x 0, x 1,..., x m R n given. Computes the gradient of a single component function at each iteration. But instead of updating x using this gradient, it uses the sum of the m most recently computed gradients. This method requires more storage (O(mn) instead of O(n)) and slightly more communication/computation per iteration. 6 D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18 (2007), pp. 29 51

IGM: Constant Stepsize 0. Choose x 0, x 1, domp and α (0, 1]. Initialize k = 0. Go to 1. 1. Choose H k 0 and 0 τ k i k for i = 1,..., m, and compute g k, d k, and x k+1 by g k = m i=1 d k = arg min d R n f i (x τ k i ), x k+1 = x k + αd k. Increment k by 1 and return to 1. τ k i = { g k, d + 1 } 2 d, Hk d + P(x k + d), The method of Blatt et al. corresponds to the special case of P 0, H k = I, K = m 1, and { k if i = (k mod m) + 1; τ k 1 i otherwise, 1 i m, k m,

IGM: Constant Stepsize Assumption 1 (a) τi k k K for all i and k, where K 0 is an integer. (b) λi H k λi for all k, where 0 < λ λ. Assumption 2 f i (y) f i (z) L i y z y, z domp, for some L i 0, i = 1,..., m. Let L = m i=1 L i.

IGM: Constant Stepsize Global Convergence Let {x k }, {d k }, {H k } be sequences generated by Algorithm 1 under Assumptions 1 and 2, and with α < 2λ/(L(2K + 1)). Then {d k } 0 and every cluster point of {x k } is a stationary point.

IGM: Adaptive Stepsize 0. Choose x 0, x 1, domp, β (0, 1), σ > 1 2, and α (0, 1]. Initialize k = 0. Go to 1. 1. Choose H k 0 and 0 τ k i g k = m i=1 d k = arg min d R n f i (x τ k i ), k for i = 1,..., m, compute g k, d k by { g k, d + 1 } 2 d, Hk d + cp(x k + d). Choose α init k [α, 1] and let α k be the largest element of {α init k βj } j=0,1,... satisfying the descent-like condition k 1 F c (x k + α k d k ) F c (x k ) σkl α k d k 2 + L α j d j 2 2 j=(k K ) + and set x k+1 = x k + αd k. Increment k by 1 and return to 1.

IGM: Adaptive Stepsize Global Convergence Let {x k }, {d k }, {H k }, {α k } be sequences generated by Algorithm 2 under Assumptions 1 and 2. Then the following results hold. (a) For each k 0, the descent-like condition holds whenever α k ᾱ, where ᾱ = λ L(σK +K /2+1/2). (b) We have α k min{α, βᾱ} for all k. (c) {d k } 0 and every cluster point of {x k } is a stationary point.

IGM: 1-memory with Adaptive Stepsize 0. Choose x 0 domp. Initialize k = 0 and g 1 = 0. Go to 1. 1. Choose H k 0 and compute g k, d k, and x k+1 by g k k = k + 1 gk 1 + m k + 1 f i k (x k ) with i k = (k mod m) + 1, { d k = arg min g k, d + 1 } d R n 2 d, Hk d + cp(x k + d), x k+1 = x k + α k d k, with α k (0, 1]. Increment k by 1 and return to 1.

IGM: 1-memory with Adaptive Stepsize Assumption 3 (a) α k =. k=0 (b) lim l l j=0 (x 1 = x 0 ). j + 1 l + 1 δ j = 0, where δ j := max x k+i x k+m i=0,1,...,m k=jm 1

IGM: 1-memory with Adaptive Stepsize Global Convergence Let {x k }, {d k }, {H k }, {α k } be sequences generated by Algorithm 3 under Assumptions 1(b), 2, and 3. Then the following results hold. (a) { x k+1 x k } 0 and { f (x k ) g k } 0. (b) lim inf k d k = 0. (c) If {x k } is bounded, then there exists a cluster point of {x k } that is a stationary point.

Thank you! III. (Linearized) Alternating Direction Method of Multipliers

Total Variation Regularized Linear Least Squares Problem Image restorations such as image deconvolution, image inpainting and image denoising is often formulated as an inverse problem unknown true image u R n. b = Au + η, observed image (or measurements) b R l. η is a Gaussian noise. A R l n is a linear operator, typically a convolution operator in deconvolution, a projection in inpainting and the identity in denoising. The unknown images can be recovered by solving TV regularized linear least squares problem (Primal): 1 min x R n 2 Ax b 2 2 + µ x, where µ > 0 and x = n i=1 ( x) i 2 with ( x) i R 2.

Gaussian Noise Figure: denoising (a) original: 512 512 (b) noised image (c) recovered image

Gaussian Noise Figure: deblurring (a) original: 512 512 (b) motion blurred image (c) recovered image

Gaussian Noise Figure: inpainting (a) original: 512 512 (b) scratched image (c) recovered image

Algorithms Saddle Point Formulation and Dual Problem 1. Dual Norm: min u max p 1 u, p + 1 2 Au b 2 2. 2. Convex Conjugate (Legendre-Fenchel transform) of J(u) = u : min sup p, u J (p) + 1 u p 2 Au b 2 2. where J (p) = sup z p, z J(z). 3. Lagrangian Function: max inf L 1(u, p, z). p u,z where L 1 (u, p, z) =: 1 2 Au b 2 2 + z + p, u z

Algorithms 4. Dual Problem: max p { inf p, u J (p) + 1 u 2 Au b 2 2 = J (p) H (div p) where H(u) = 1 2 Au b 2 2. 5. Lagrangian Function with y = div p: max inf L 2(u, p, y). u p,y }. where L 2 (u, p, y) =: J (p) + H (div p) + u, div p y

ADMM 7 E. Esser, X. Zhang, and T. Chan, A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science, SIAM J. Imaging Sci., 2010 8 J. Douglas and H. H. Rachford, On the numerical solution of heat conduction problems in two or three space variables, Trans. Amer. Math. Soc., 1956 Alternating Direction Method of Multipliers 7 Augmented Lagrangian Function: L α (u, p, z) =: 1 2 Au b 2 2 + z + p, u z + α 2 u z 2 2 u k+1 z k+1 = arg min u = arg min d p k+1 = p k + α(z k+1 u k+1 ) 1 2 Au b 2 2 + p k, u z k + α 2 u zk 2 2 z + p k, u k+1 z + α 2 uk+1 z 2 2 ADMM on primal (or 3) is equivalent to Douglas-Rachford 8 on dual (4). ADMM on dual (or 5) is equivalent to Douglas-Rachford on primal.

ADMM Alternating Minimization Algorithm & Forward-Backward Splitting Method AMA 9 on primal (or 3) u k+1 z k+1 = arg min u 1 2 Au b 2 2 + p k, u z k = arg min d p k+1 = p k + α(z k+1 u k+1 ) z + p k, u k+1 z + α 2 uk+1 z 2 2 FBS 10 on dual (4) u k+1 p k+1 = arg min u = arg min d 1 2 Au b 2 2 + p k, u J (p) p, u k+1 + 1 2α p pk 2 2 FBS on dual is equivalent to AMA on primal (or 3). 9 P. Tseng, Applications of a splitting algorithm to decomposition in convex programming and variational inequalities, SIAM J. Control and Optimization, 1991 10 P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Model. Simul., 2005

Proximal Splitting Method Linearly Constrained Separable Convex Programming Problem: min x,y {f (x) + g(y) Qx = y}. Augmented Lagrangian Function: L α (x, y, z) := f (x) + g(y) + z, y Qx + α 2 Qx y 2 2. x k+1 = arg min L α (x, y k, z k ) u y k+1 = arg min L α (x k+1, y, z k ) z z k+1 = z k + α(qx k+1 y k+1 )

Proximal Splitting Method If f (x) = 1 2 Ax b 2 2, x k+1 = (A T A + αq T Q) 1 (A T b + Q T z k + αq T y k ). If f (x) = 1, Ax b, log(ax) Need inner solver or inversion involving Q and/or A. If f (x) = x + be x, 1 Need inner solver.

Poisson Noise Figure: deblurring (a) original (b) blurred image (c) recovered image

Multiplicative Noise

Multiplicative Noise Copyright : Sandia Nat. Lab. (http://www.sandia.gov/radar/sar.html)

Multiplicative Noise 4M (2 22 ) pixel @ 20sec.

Multiplicative Noise

Multiplicative Noise

Multiplicative Noise

Proximal Splitting Method In order to avoid any inner iterations or inversions involving Laplacian operator required in algorithms based on augment Lagrangian. Alternating minimization algorithm u k+1 = L 0 (x, y k, z k ) z k+1 = L α (x k+1, y, z k ) p k+1 = p k + α(z k+1 u k+1 ) Linearized augmented Lagrangian with proximal function u k+1 = LL α (x, y k, z k ) z k+1 = L α (x k+1, y, z k ) p k+1 = p k + α(z k+1 u k+1 ) LL α (x, x k, y k, z z ) := x f (x k ), x x k + g(y k ) + z k, y k Qx + αq T (Qx k y k ), x x k + 1 2δ x x k 2 2

Thank you! Thank you!