Majorization Minimization - the Technique of Surrogate

Similar documents
Non-negative Matrix Factorization via accelerated Projected Gradient Descent

Descent methods. min x. f(x)

Lecture 5: September 15

QUADRATIC MAJORIZATION 1. INTRODUCTION

On Nesterov s Random Coordinate Descent Algorithms - Continued

CS711008Z Algorithm Design and Analysis

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Cubic regularization of Newton s method for convex problems with constraints

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Lecture 2: Convex functions

Lecture 4 September 15

CS-E4830 Kernel Methods in Machine Learning

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

FIXED POINT ITERATIONS

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Constrained Optimization and Lagrangian Duality

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Convex Optimization Algorithms for Machine Learning in 10 Slides

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

WHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????

Optimization Tutorial 1. Basic Gradient Descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Probability Background

Convex Functions. Wing-Kin (Ken) Ma The Chinese University of Hong Kong (CUHK)

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Math 273a: Optimization Convex Conjugacy

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

IE 521 Convex Optimization

5. Subgradient method

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Introduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems

Gradient Descent. Dr. Xiaowei Huang

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Lecture 6 : Projected Gradient Descent

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

You should be able to...

Optimization and Optimal Control in Banach Spaces

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

Lecture 4: Optimization. Maximizing a function of a single variable

Solving Dual Problems

Lecture 3. Optimization Problems and Iterative Algorithms

Lecture: Smoothing.

Constrained optimization: direct methods (cont.)

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

Scientific Computing: Optimization

Homework 4. Convex Optimization /36-725

Max Margin-Classifier

COURSE ON LMI PART I.2 GEOMETRY OF LMI SETS. Didier HENRION henrion

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Fenchel Duality between Strong Convexity and Lipschitz Continuous Gradient

Machine Learning. Support Vector Machines. Manfred Huber

1/37. Convexity theory. Victor Kitov

6. Proximal gradient method

A Brief Review on Convex Optimization

Conditional Gradient (Frank-Wolfe) Method

14 Lecture 14 Local Extrema of Function

12. Interior-point methods

LMI MODELLING 4. CONVEX LMI MODELLING. Didier HENRION. LAAS-CNRS Toulouse, FR Czech Tech Univ Prague, CZ. Universidad de Valladolid, SP March 2009

Lecture 5: September 12

Math 273a: Optimization Subgradients of convex functions

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Convex Optimization Lecture 16

Vector Derivatives and the Gradient

Math 5210, Definitions and Theorems on Metric Spaces

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

1 Convexity, concavity and quasi-concavity. (SB )

4. Convex optimization problems

WHEN LAGRANGEAN AND QUASI-ARITHMETIC MEANS COINCIDE

Inner Product Spaces 6.1 Length and Dot Product in R n

EE Applications of Convex Optimization in Signal Processing and Communications Dr. Andre Tkacenko, JPL Third Term

Selected Topics in Optimization. Some slides borrowed from

Gaussian and Linear Discriminant Analysis; Multiclass Classification

ICS-E4030 Kernel Methods in Machine Learning

Lecture 6: September 12

8.7 Taylor s Inequality Math 2300 Section 005 Calculus II. f(x) = ln(1 + x) f(0) = 0

Convex Functions and Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Online Convex Optimization

Proximal and First-Order Methods for Convex Optimization

Support Vector Machine (SVM) and Kernel Methods

10. Unconstrained minimization

Convex Functions. Pontus Giselsson

Deterministic Global Optimization for Dynamic Systems Using Interval Analysis

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Chapter 8: Taylor s theorem and L Hospital s rule

minimize x subject to (x 2)(x 4) u,

Paul Schrimpf. October 18, UBC Economics 526. Unconstrained optimization. Paul Schrimpf. Notation and definitions. First order conditions

8. Geometric problems

Coordinate Descent Methods on Huge-Scale Optimization Problems

12. Interior-point methods

OPTIMIZATION THEORY IN A NUTSHELL Daniel McFadden, 1990, 2003

Optimization. Yuh-Jye Lee. March 28, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 40

Math 273a: Optimization Subgradients of convex functions

A New Trust Region Algorithm Using Radial Basis Function Models

Transcription:

Majorization Minimization - the Technique of Surrogate Andersen Ang Mathématique et de Recherche opérationnelle Faculté polytechnique de Mons UMONS Mons, Belgium email: manshun.ang@umons.ac.be homepage: angms.science June 6, 2017 1 / 22

Overview 1 Introduction to Majorization Minimization What is Majorization Minimization Definition of surrogate 2 How to construct surrogate By quadratic upper bound of convex smooth function By quadratic lower bound of strongly convex function By Taylor expasion : 1st and 2nd order By inequalities 2 / 22

What is Majorization Minimization Majorization Minimization (MM) is an optimization algorithm. More accurately, MM itself is not an algorithm, but a framework on how to construct an optimization algorithm. Example of MM : Expectation Minimization (EM-Algorithm). Another name of MM is Successive upper bound minimization method. 3 / 22

The idea of MM: Successive upper bound minimization Want to solve min x Q f(x) How to solve : construct an iterative algorithm that produces a sequence {x k } such taht the objective function is non-increasing: f (x k+1 ) f (x k ) Problem: if f is complicated = cannot handle the problem directly Idea: attack the problem indirectly Generate the sequence {x k } to minimize f by another simpler function g such that minimizing g helps minimizing f. g is called surrogate function / auxiliary function How minimizing g helps minimizing f : if g is the upper bound of f 4 / 22

The idea of MM - Successive upper bound minimization In other words, the idea of MM is: 1. [Original problem is too complicated]. Want to solve min x Q f(x), but f is too complicated or solving min x f(x) directly is too expensive 2a. [Indirect attack of the problem via surrogate]. Finds/constructs a simpler function g that solving min x Q g(x) is cheaper 2b. Finally, use the solution of min x g(x) to solve min x Q f(x) Questions: how to find g? how to use the information on g to minimize f? 5 / 22

The surrogate g Surrogate function g(x) can be defined as a parametric function with the form g(x θ) where θ is the parameter In MM for minimization, θ can be defined as x k. i.e. the information about the variable x at the current iteration is used to construct g. The surrogate helps to minimize f by finding the variable in the next iteration as the minimizer of the current surrogate: ) x k+1 = arg min g k (x x k x Q 6 / 22

Overall framework of MM To solve min f(x) x Q Use the following surrogate scheme: (1) Initialize x 0 (2) Construct a surrogate function at x k as g k (x x k ) (3) Updating : x k+1 = arg min x Q g k (x x k ) (4) Repeat (2)-(3) until converge Note that the surrogate function is changing in each iteration, as g k (x x k ) depends on the changing variable x k Also notice that the process of minimizing g will help minizing f as the sequence {x k } produced satisfies f(x k+1 ) f(x k ) 7 / 22

More questions But, more exactly: (1) how to constrcut g? (2) what is the condition of g? (3) how to optimize g? (4) how to know that optimizing g is cheaper than optimizing f? 8 / 22

Definition of surrogate Two key conditions for surrogate g are: 1. g majorizes the original function f at x k for all other points. Mathematically, g k (x x k ) f(x), x, x k Q. 2. g touches the original function f at x k at the point x = x k. Mathematically, g k (x k x k ) = f(x k ), x k Q f g f(x k ) g(x k+1 x k+1 ) x k x k+1 9 / 22

The convergence theorem of MM Theorem. If the surrogate g satisfies the two conditions : 1. g (x x k ) f(x), x. 2. g (x k x k ) = f(x k ), x k. Then the iterative method x k+1 = arg min x Q g k (x x k ) will produce a sequence f (x k ) that converge to a local optimum. i.e. f (x k+1 ) f (x k ) Proof. f(x k+1 ) g(x k+1 x k ) by condition 1 g(x k x k ) x k+1 minimizes g = f(x k ) by condition 2 10 / 22

Construct surrogate by quadratic upper bound of smooth convex function Suppose f is convex (both f and domf) and β-smooth ( f is Lipschitz continuous with parameter β). Then f is bounded above by the following at x 0 domf for all x dom f f(x) f(x 0 ) + f(x 0 ) T (x x 0 ) + β 2 x x 0 2 2 Surrogate can be defined as such upper bound. g k (x x k ) f(x k ) + f(x 0 ) T (x x k ) + β 2 x x k 2 2 Pros: simple construction of f Cons: need the knowledge of β 11 / 22

Construct surrogate by quadratic lower bound of strongly convex function Suppose we want to do Minorization Maximization (the opposite). Suppose f is α-strongly convex. Then f is bounded below by the following at x 0 domf for all x dom f f(x) f(x 0 ) + f(x 0 ) T (x x 0 ) + α 2 x x 0 2 2 Surrogate can be defined as such upper bound. g k (x x k ) f(x k ) + f(x 0 ) T (x x k ) + α 2 x x k 2 2 Pros: simple construction of f Cons: need the knowledge of α 12 / 22

Construct surrogate by first order Taylor expansion Taylor Expansion. Taylor expansion of a differentiable f at a point x 0 is where O is higher order term. f(x) = f(x 0 ) + f T (x 0 )(x x 0 ) + O If f is convex, the first order Taylor approximation is a global underestimator of f. i.e. f(x) f(x 0 ) + f(x 0 ) T (x x 0 ). This is useful for Minorization Maximization. If f is concave, the first order Taylor approximation is a global overestimator of f. i.e. f(x) f(x 0 ) + f(x 0 ) T (x x 0 ). This is useful for Majorization Minimization. 13 / 22

Construct surrogate by majorizing the second order Taylor expansion Consider the Taylor expansion at x 0 again f(x) = f(x 0 ) + f T (x 0 )(x x 0 ) + O Suppose f is twice differentiable. Now express explicitly the higher order term O in Lagrangian form O = 1 2 (x x 0) T 2 f(ξ)(x x 0 ) where ξ is some constant (by mean value theorem). Taylor expansion of f at point x 0 becomes f(x) = f(x 0 ) + f T (x 0 )(x x 0 ) + 1 2 (x x 0) T 2 f(ξ)(x x 0 ) 14 / 22

Construct surrogate by majorizing the second order Taylor expansion One can construct a surrogate in the following form g(x x 0 ) = f(x 0 ) + f T (x 0 )(x x 0 ) + 1 2 (x x 0) T M(x x 0 ) if M 2 f(x), x The key is M 2 f(x), so if M 2 f(x) is positive semi-definite x (including the case x = ξ), then g(x x 0 ) f(x) = 1 ( ) 2 (x x 0) T M 2 f(ξ) (x x 0 ) 0 Hence g majorizes f. How to form M: M = 2 f + δi 15 / 22

Construct surrogate by majorizing the second order Taylor expansion - Least Sqaure Example Consider f(x) = Ax b 2 (F-norm or 2-norm) Ax b 2 = (Ax b) T (Ax b) = x T A T Ax x T A T b b T Ax + b T b = x T A T Ax 2x T A T b + b T b x Ax b 2 = 2A T Ax 2A T b = 2A T (Ax b) 2 x Ax b 2 = 2A T A 2nd order Taylor expansion of f(x) = Ax b 2 is f(x) = f(x 0 ) + 2A T (Ax 0 b) (x x 0 ) + 2(x x 0 ) T A T A(x x 0 ) Thus the following g majorizes f g(x x 0 ) = f(x 0 ) + 2A T (Ax 0 b) (x x 0 ) + 2(x x 0 ) T M(x x 0 ) where M A T A is a diagonal matrix. A simple way to construct M is M = A T A + δi with δ > 0 16 / 22

Construct surrogate by majorizing the second order Taylor expansion - NMF Example In Non-negative Matrix Factorization, we have f(w, h) = W h x 2 2 where x is given and W, h are variable. 2nd order Taylor expansion of f(w, h) is f(w, h) = f(h 0 ) + 2W T (W h 0 x) (h h 0 ) + 2(h h 0 ) T W T W (h h 0 ) ( [W T W h] ) i We can construct M as M = Diag, then M W T W and [h] i g(w, h) = f(h 0 ) + 2W T (W h 0 x) (h h 0 ) + 2(h h 0 ) T M(h h 0 ) For detail: see the slides Convergence analysis of NMF algorithm, and the original paper by Lee and Seung 2001 17 / 22

Construct surrogate by inequalities Jensen s inequality. If f is convex, then ( ) f λ i t i i i where λ i 0 and i λ i = 1. λ i f(t i ) Example. Let t i = c i λ i (x i y i ) + c T y, then λ T t = i λ i t i ( ci x i c i y i + λ i c T y ) = i = c i x i c i y i + ( ) λ i c T y i i i = c T x c T y + c T y = c T x 18 / 22

Construct surrogate by inequalities ( ) By Jensen s inequality f i λ it i i λ if(t i ) ( ) f λ T t i = i λ i f(t i ) ( ci ) λ i f (x i y i ) + c T y λ i As λ T t = c T x thus f ( c T x ) i ( ci ) λ i f (x i y i ) + c T y λ i So for a convex function f, the surrogate function of f(c T x) is g(x y) = ) i if( λ ci λ i (x i y i ) + c T y, with i λ i = 1 19 / 22

Construct surrogate by inequalities Example. If c, x and y are all positive, let t i = x i c T y and λ i = c iy i y i c T y (hence again λ T t = c T x) then by Jensen s inequality ( ) f λ T t λ i f(t i ) i = i c i y ( i c T y f xi ) c T y y i Hence f ( c T x ) i c i y ( i c T y f xi ) c T y y i So for a convex function f, the surrogate function of f(c T x), with positive c and x is g(x y) = ) i if( λ ci λ i (x i y i ) + c T y, where i λ i = 1 and y has to be positive. Advantage of the surrogate in the two examples : g are separable and thus parallelizable. 20 / 22

Construct surrogate by inequalities Other inequalities : Cauchy-Schwartz Inequality u T v u v Arithmetic-Geometric Mean ( n i x i ) 1 n 1 n n i x i Chebyshev, Hölder, and so on... 21 / 22

Last page - summary Introduction of Majorization Minimization (1) Initialize x 0 (2) Construct a surrogate function at x k as g k (x x k ) (3) Updating : x k+1 = arg min x Q g k (x x k ) (4) Repeat (2)-(3) until converge The surrogate funcion 1. g majorizes the original function f at x k for all other points. Mathematically, g k (x x k ) f(x), x, x k Q. 2. g touches the original function f at x k at the point x = x k. Construction of surrogate function via various methods End of document 22 / 22