Majorization Minimization - the Technique of Surrogate

Majorization Minimization - the Technique of Surrogate Andersen Ang Mathématique et de Recherche opérationnelle Faculté polytechnique de Mons UMONS Mons, Belgium email: manshun.ang@umons.ac.be homepage: angms.science June 6, 2017 1 / 22

Overview 1 Introduction to Majorization Minimization What is Majorization Minimization Definition of surrogate 2 How to construct surrogate By quadratic upper bound of convex smooth function By quadratic lower bound of strongly convex function By Taylor expasion : 1st and 2nd order By inequalities 2 / 22

What is Majorization Minimization Majorization Minimization (MM) is an optimization algorithm. More accurately, MM itself is not an algorithm, but a framework on how to construct an optimization algorithm. Example of MM : Expectation Minimization (EM-Algorithm). Another name of MM is Successive upper bound minimization method. 3 / 22

The idea of MM: Successive upper bound minimization Want to solve min x Q f(x) How to solve : construct an iterative algorithm that produces a sequence {x k } such taht the objective function is non-increasing: f (x k+1 ) f (x k ) Problem: if f is complicated = cannot handle the problem directly Idea: attack the problem indirectly Generate the sequence {x k } to minimize f by another simpler function g such that minimizing g helps minimizing f. g is called surrogate function / auxiliary function How minimizing g helps minimizing f : if g is the upper bound of f 4 / 22

The idea of MM - Successive upper bound minimization In other words, the idea of MM is: 1. [Original problem is too complicated]. Want to solve min x Q f(x), but f is too complicated or solving min x f(x) directly is too expensive 2a. [Indirect attack of the problem via surrogate]. Finds/constructs a simpler function g that solving min x Q g(x) is cheaper 2b. Finally, use the solution of min x g(x) to solve min x Q f(x) Questions: how to find g? how to use the information on g to minimize f? 5 / 22

The surrogate g Surrogate function g(x) can be defined as a parametric function with the form g(x θ) where θ is the parameter In MM for minimization, θ can be defined as x k. i.e. the information about the variable x at the current iteration is used to construct g. The surrogate helps to minimize f by finding the variable in the next iteration as the minimizer of the current surrogate: ) x k+1 = arg min g k (x x k x Q 6 / 22

Overall framework of MM To solve min f(x) x Q Use the following surrogate scheme: (1) Initialize x 0 (2) Construct a surrogate function at x k as g k (x x k ) (3) Updating : x k+1 = arg min x Q g k (x x k ) (4) Repeat (2)-(3) until converge Note that the surrogate function is changing in each iteration, as g k (x x k ) depends on the changing variable x k Also notice that the process of minimizing g will help minizing f as the sequence {x k } produced satisfies f(x k+1 ) f(x k ) 7 / 22

More questions But, more exactly: (1) how to constrcut g? (2) what is the condition of g? (3) how to optimize g? (4) how to know that optimizing g is cheaper than optimizing f? 8 / 22

Definition of surrogate Two key conditions for surrogate g are: 1. g majorizes the original function f at x k for all other points. Mathematically, g k (x x k ) f(x), x, x k Q. 2. g touches the original function f at x k at the point x = x k. Mathematically, g k (x k x k ) = f(x k ), x k Q f g f(x k ) g(x k+1 x k+1 ) x k x k+1 9 / 22

The convergence theorem of MM Theorem. If the surrogate g satisfies the two conditions : 1. g (x x k ) f(x), x. 2. g (x k x k ) = f(x k ), x k. Then the iterative method x k+1 = arg min x Q g k (x x k ) will produce a sequence f (x k ) that converge to a local optimum. i.e. f (x k+1 ) f (x k ) Proof. f(x k+1 ) g(x k+1 x k ) by condition 1 g(x k x k ) x k+1 minimizes g = f(x k ) by condition 2 10 / 22

Construct surrogate by quadratic upper bound of smooth convex function Suppose f is convex (both f and domf) and β-smooth ( f is Lipschitz continuous with parameter β). Then f is bounded above by the following at x 0 domf for all x dom f f(x) f(x 0 ) + f(x 0 ) T (x x 0 ) + β 2 x x 0 2 2 Surrogate can be defined as such upper bound. g k (x x k ) f(x k ) + f(x 0 ) T (x x k ) + β 2 x x k 2 2 Pros: simple construction of f Cons: need the knowledge of β 11 / 22

Construct surrogate by quadratic lower bound of strongly convex function Suppose we want to do Minorization Maximization (the opposite). Suppose f is α-strongly convex. Then f is bounded below by the following at x 0 domf for all x dom f f(x) f(x 0 ) + f(x 0 ) T (x x 0 ) + α 2 x x 0 2 2 Surrogate can be defined as such upper bound. g k (x x k ) f(x k ) + f(x 0 ) T (x x k ) + α 2 x x k 2 2 Pros: simple construction of f Cons: need the knowledge of α 12 / 22

Construct surrogate by first order Taylor expansion Taylor Expansion. Taylor expansion of a differentiable f at a point x 0 is where O is higher order term. f(x) = f(x 0 ) + f T (x 0 )(x x 0 ) + O If f is convex, the first order Taylor approximation is a global underestimator of f. i.e. f(x) f(x 0 ) + f(x 0 ) T (x x 0 ). This is useful for Minorization Maximization. If f is concave, the first order Taylor approximation is a global overestimator of f. i.e. f(x) f(x 0 ) + f(x 0 ) T (x x 0 ). This is useful for Majorization Minimization. 13 / 22

Construct surrogate by majorizing the second order Taylor expansion Consider the Taylor expansion at x 0 again f(x) = f(x 0 ) + f T (x 0 )(x x 0 ) + O Suppose f is twice differentiable. Now express explicitly the higher order term O in Lagrangian form O = 1 2 (x x 0) T 2 f(ξ)(x x 0 ) where ξ is some constant (by mean value theorem). Taylor expansion of f at point x 0 becomes f(x) = f(x 0 ) + f T (x 0 )(x x 0 ) + 1 2 (x x 0) T 2 f(ξ)(x x 0 ) 14 / 22

Construct surrogate by majorizing the second order Taylor expansion One can construct a surrogate in the following form g(x x 0 ) = f(x 0 ) + f T (x 0 )(x x 0 ) + 1 2 (x x 0) T M(x x 0 ) if M 2 f(x), x The key is M 2 f(x), so if M 2 f(x) is positive semi-definite x (including the case x = ξ), then g(x x 0 ) f(x) = 1 ( ) 2 (x x 0) T M 2 f(ξ) (x x 0 ) 0 Hence g majorizes f. How to form M: M = 2 f + δi 15 / 22

Construct surrogate by majorizing the second order Taylor expansion - Least Sqaure Example Consider f(x) = Ax b 2 (F-norm or 2-norm) Ax b 2 = (Ax b) T (Ax b) = x T A T Ax x T A T b b T Ax + b T b = x T A T Ax 2x T A T b + b T b x Ax b 2 = 2A T Ax 2A T b = 2A T (Ax b) 2 x Ax b 2 = 2A T A 2nd order Taylor expansion of f(x) = Ax b 2 is f(x) = f(x 0 ) + 2A T (Ax 0 b) (x x 0 ) + 2(x x 0 ) T A T A(x x 0 ) Thus the following g majorizes f g(x x 0 ) = f(x 0 ) + 2A T (Ax 0 b) (x x 0 ) + 2(x x 0 ) T M(x x 0 ) where M A T A is a diagonal matrix. A simple way to construct M is M = A T A + δi with δ > 0 16 / 22

Construct surrogate by majorizing the second order Taylor expansion - NMF Example In Non-negative Matrix Factorization, we have f(w, h) = W h x 2 2 where x is given and W, h are variable. 2nd order Taylor expansion of f(w, h) is f(w, h) = f(h 0 ) + 2W T (W h 0 x) (h h 0 ) + 2(h h 0 ) T W T W (h h 0 ) ( [W T W h] ) i We can construct M as M = Diag, then M W T W and [h] i g(w, h) = f(h 0 ) + 2W T (W h 0 x) (h h 0 ) + 2(h h 0 ) T M(h h 0 ) For detail: see the slides Convergence analysis of NMF algorithm, and the original paper by Lee and Seung 2001 17 / 22

Construct surrogate by inequalities Jensen s inequality. If f is convex, then ( ) f λ i t i i i where λ i 0 and i λ i = 1. λ i f(t i ) Example. Let t i = c i λ i (x i y i ) + c T y, then λ T t = i λ i t i ( ci x i c i y i + λ i c T y ) = i = c i x i c i y i + ( ) λ i c T y i i i = c T x c T y + c T y = c T x 18 / 22

Construct surrogate by inequalities ( ) By Jensen s inequality f i λ it i i λ if(t i ) ( ) f λ T t i = i λ i f(t i ) ( ci ) λ i f (x i y i ) + c T y λ i As λ T t = c T x thus f ( c T x ) i ( ci ) λ i f (x i y i ) + c T y λ i So for a convex function f, the surrogate function of f(c T x) is g(x y) = ) i if( λ ci λ i (x i y i ) + c T y, with i λ i = 1 19 / 22

Construct surrogate by inequalities Example. If c, x and y are all positive, let t i = x i c T y and λ i = c iy i y i c T y (hence again λ T t = c T x) then by Jensen s inequality ( ) f λ T t λ i f(t i ) i = i c i y ( i c T y f xi ) c T y y i Hence f ( c T x ) i c i y ( i c T y f xi ) c T y y i So for a convex function f, the surrogate function of f(c T x), with positive c and x is g(x y) = ) i if( λ ci λ i (x i y i ) + c T y, where i λ i = 1 and y has to be positive. Advantage of the surrogate in the two examples : g are separable and thus parallelizable. 20 / 22

Construct surrogate by inequalities Other inequalities : Cauchy-Schwartz Inequality u T v u v Arithmetic-Geometric Mean ( n i x i ) 1 n 1 n n i x i Chebyshev, Hölder, and so on... 21 / 22

Last page - summary Introduction of Majorization Minimization (1) Initialize x 0 (2) Construct a surrogate function at x k as g k (x x k ) (3) Updating : x k+1 = arg min x Q g k (x x k ) (4) Repeat (2)-(3) until converge The surrogate funcion 1. g majorizes the original function f at x k for all other points. Mathematically, g k (x x k ) f(x), x, x k Q. 2. g touches the original function f at x k at the point x = x k. Construction of surrogate function via various methods End of document 22 / 22