Fenchel Duality between Strong Convexity and Lipschitz Continuous Gradient

Fenchel Duality between Strong Convexity and Lipschitz Continuous Gradient Xingyu Zhou The Ohio State University zhou.2055@osu.edu December 5, 2017 Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 1 / 47

Main Result Outline 1 Main Result 2 Review of Convex Function 3 Strong Convexity Implications of Strong Convexity 4 Lipschitz Continuous Gradient 5 Conjugate Function Useful Results 6 Proof of Main Result Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 2 / 47

Main Result Main Result Theorem (i) If f is closed and strong convex with parameter µ, then f has a Lipschitz continuous gradient with parameter 1 µ. (ii) If f is convex and has a Lipschitz continuous gradient with parameter L, then f is strong convex with parameter 1 L. We provide a very simple proof for this theorem (two-line). To this end, we first present equivalent conditions for strong convexity and Lipschitz continuous gradient. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 3 / 47

Review of Convex Function Outline 1 Main Result 2 Review of Convex Function 3 Strong Convexity Implications of Strong Convexity 4 Lipschitz Continuous Gradient 5 Conjugate Function Useful Results 6 Proof of Main Result Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 4 / 47

Review of Convex Function (Convexity) Suppose f : R n R with the extended-value extension. Then, f is said to be a convex function if for any α [0, 1], for any x and y. f (αx + (1 α)y) αf (x) + (1 α)f (y) Note: This definition does not require the differentiability of f. We are familiar that this definition is equivalent to the first-order condition and monotonicity condition assuming f is differentiable. In the following, we will first show that this equivalence still holds for a non-smooth function. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 6 / 47

Review of Convex Function for Convexity Lemma (Equivalence for Convexity) Suppose f : R n R with the extended-value extension. Then, the following statements are equivalent: (i) (Jensen s inequality): f is convex. (ii) (First-order): f (y) f (x) + s T x (y x) for all x, y and any s x f (x). (iii) (Monotonicity of subgradient): (s y s x ) T (y x) 0 for all x, y and any s x f (x), s y f (y). Note: The lemma is a non-trivial extension as we will see later in the proof. Moreover, this lemma will be extremely useful as we later deal with strong convexity. Specifically, we will prove this lemma by showing (i) = (ii) = (iii) = (i). Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 8 / 47

Review of Convex Function Proof (i) = (ii): Since f is convex, by definition, we have for any 0 < α 1 f (x + α(y x)) αf (y) + (1 α)f (x) f (x + α(y x)) f (x) f (y) f (x) α (a) = f (x, y x) f (y) f (x) (b) = s T x (y x) f (y) f (x) s x f (x) (a) is obtained by letting α 0 and the definition of directional derivate. Note that the limit always exists for a convex f by the monotonicity of difference quotient. (b) follows from the fact that f (x, v) = sup s f (x) s T v for a convex f. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 9 / 47

Review of Convex Function Proof (Cont d) (ii) = (iii): From (ii), we have Adding together directly yields (iii). f (y) f (x) + s T x (y x) f (x) f (y) + s T y (x y) Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 10 / 47

Review of Convex Function Proof (Cont d) (iii) = (i): To this end, we will show (iii) = (ii) and (ii) = (i). First, (iii) = (ii): Let φ(α) := f (x + α(y x)) and x α := x + α(y x), then f (y) f (x) = φ(1) φ(0) = 1 0 s T α (y x)dα (1) where s α f (x α ) for t [0, 1]. Then by (iii), for any s x f (x), we have s T α (y x) s T x (y x) which combined with inequality above directly implies (ii). Second, (ii) = (i): By (ii), we have f (y) f (x α ) + s T t (y x α ) (2) f (x) f (x α ) + s T t (x x α ) (3) Multiplying (2) by α and (3) by 1 α, and adding together yields (i). Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 11 / 47

Strong Convexity Outline 1 Main Result 2 Review of Convex Function 3 Strong Convexity Implications of Strong Convexity 4 Lipschitz Continuous Gradient 5 Conjugate Function Useful Results 6 Proof of Main Result Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 12 / 47

Strong Convexity Strongly Convex Function (Strong Convexity) Suppose f : R n R with the extended-value extension. Then, f is said to be strongly convex with parameter µ if is convex. g(x) = f (x) µ 2 x T x Note: As before, this definition does not require the differentiability of f. In the following, by applying the previous equivalent conditions of convexity to g, we can easily establish equivalence between strong convexity. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 14 / 47

Strong Convexity for Strong Convexity Lemma (Equivalence for Strong Convexity) Suppose f : R n R with the extended-value extension. Then, the following statements are equivalent: (i) f is strongly convex with parameter µ. (ii) f (αx + (1 α)y) αf (x) + (1 α)f (y) µ 2 α(1 α) y x 2 for any x, y. (iii) f (y) f (x) + s T x (y x) + µ 2 y x 2 for all x, y and any s x f (x). (iv) (s y s x ) T (y x) µ y x 2 for all x, y and any s x f (x), s y f (y). Proof: (i) (ii): It follows from the definition of convexity for g(x) = f (x) µ 2 x T x. (i) (iii): It follows from the equivalent first-order condition of convexity for g. (i) (iv): It follows from the equivalent monotonicity of (sub)-gradient of convexity for g. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 16 / 47

Strong Convexity Implications of Strong Convexity Outline 1 Main Result 2 Review of Convex Function 3 Strong Convexity Implications of Strong Convexity 4 Lipschitz Continuous Gradient 5 Conjugate Function Useful Results 6 Proof of Main Result Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 17 / 47

Strong Convexity Implications of Strong Convexity Conditions Implied by Strong Convexity (SC) Lemma (Implications of SC) Suppose f : R n R with the extended-value extension. The following conditions are all implied by strong convexity with parameter µ: (i) 1 2 s x 2 µ(f (x) f ), x and s x f (x). (ii) s y s x µ y x x, y and any s x f (x), s y f (y). (iii) f (y) f (x) + s T x (y x) + 1 2µ s y s x 2 x, y and any s x f (x), s y f (y). (iv) (s y s x ) T (y x) 1 µ s y s x 2 x, y and any s x f (x), s y f (y). Note: In the case of smooth function f, (i) reduced to Polyak-Lojasiewicz (PL) inequality, which is more general than SC and is extremely useful for linear convergence. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 18 / 47

Strong Convexity Implications of Strong Convexity Proof (i) 1 2 s x 2 µ(f (x) f ) (SC) = (i): By equivalence of strong convexity in the previous lemma, we have f (y) f (x) + sx T (y x) + µ y x 2 2 Taking minimization with respect to y on both sides, yields Re-arranging it yields (i). f f (x) 1 2µ s x 2 Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 19 / 47

Strong Convexity Implications of Strong Convexity Proof (Cont d) (ii) s y s x µ y x SC = (ii): By equivalence of strong convexity in the previous lemma, we have (s y s x ) T (y x) µ y x 2 Applying Cauchy-Schwartz inequality to the left-hand-side, yields (ii). Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 20 / 47

Strong Convexity Implications of Strong Convexity Proof (Cont d) (iii) f (y) f (x) + s T x (y x) + 1 2µ s y s x 2 SC = (iii): For any given s x f (x), let φ x (z) := f (z) s T x z. First, since (s φ z 1 s φ z 2 )(z 1 z 2 ) = (s f z 1 s f z 2 )(z 1 z 2 ) µ z 1 z 2 2 which implies that φ x (z) is also strong convexity with parameter µ. Then, applying PL-inequality to φ x (z), yields Re-arranging it yields (iii). φ x = f (x) sx T x φ x (y) 1 sy φ 2 2µ = f (y) sx T y 1 2µ s y s x 2. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 21 / 47

Strong Convexity Implications of Strong Convexity Proof (Cont d) (iv) (s y s x ) T (y x) 1 µ s y s x 2 SC = (iv): From previous result, we have Adding together yields (iv). f (y) f (x) + s T x (y x) + 1 2µ s y s x 2 f (x) f (y) + s T y (x y) + 1 2µ s x s y 2 Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 22 / 47

Lipschitz Continuous Gradient Outline 1 Main Result 2 Review of Convex Function 3 Strong Convexity Implications of Strong Convexity 4 Lipschitz Continuous Gradient 5 Conjugate Function Useful Results 6 Proof of Main Result Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 23 / 47

Lipschitz Continuous Gradient A differentiable function f is said to have an L-Lipschitz continuous gradient if for some L > 0 f (x) f (y) L x y, x, y. Note: The definition does not require the convexity of f. In the following, we will see different conditions that are equivalent or implied by Lipschitz continuous gradient condition. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 25 / 47

Lipschitz Continuous Gradient Relationship [0] f (x) f (y) L x y, x, y. [1] g(x) = L 2 x T x f (x) is convex, x [2] f (y) f (x) + f (x) T (y x) + L 2 y x 2, x, y. [3] ( f (x) f (y) T (x y) L x y 2, x, y. α(1 α)l [4] f (αx + (1 α)y) αf (x) + (1 α)f (y) x y 2, x, y and α [ 2 [5] f (y) f (x) + f (x) T (y x) + 1 2L f (y) f (x) 2, x, y. [6] ( f (x) f (y) T (x y) 1 L f (x) f (y) 2, x, y. [7] f (αx + (1 α)y) αf (x) + (1 α)f (y) α(1 α) f (x) f (y) 2, x, y 2L Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 27 / 47

Lipschitz Continuous Gradient Relationship (Cont d) Lemma For a function f with a Lipschitz continuous gradient over R n, the following relations hold: [5] [7] = [6] = [0] = [1] [2] [3] [4] (4) If the function f is convex, then all the conditions [0]-[7] are equivalent. Note: From (5), we can see that conditions [0]-[7] can be grouped into four classes. Class A: [1]-[4] Class B: [5],[7] Class C: [6] Class D: [0] Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 28 / 47

Lipschitz Continuous Gradient Proof Class A [1] g(x) = L 2 x T x f (x) is convex [2] f (y) f (x) + f (x) T (y x) + L y x 2 2 [3] ( f (x) f (y) T (x y) L x y 2 [4] f (αx + (1 α)y) αf (x) + (1 α)f (y) α(1 α)l x y 2 2 [1] [2]: It follows from the first-order equivalence of convexity. [1] [3]: It follows from the monotonicity of gradient equivalence of convexity. [1] [4]: It follows from the definition of convexity. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 29 / 47

Lipschitz Continuous Gradient Proof (Cont d) Class B [5] f (y) f (x) + f (x) T (y x) + 1 2L f (y) f (x) 2. [7] f (αx + (1 α)y) αf (x) + (1 α)f (y) [5] = [7]: Let x α := αx + (1 α)y, then α(1 α) f (x) f (y) 2. 2L f (x) f (x α ) + f (x α ) T (x x α ) + 1 2L f (x) f (x α) 2 f (y) f (x α ) + f (z) T (y x α ) + 1 2L f (y) f (x α) 2 Multiplying the first inequality with α and second inequality with 1 α, and adding them together and using α x 2 + (1 α) y 2 α(1 α) x y 2, yields [7]. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 30 / 47

Lipschitz Continuous Gradient Proof (Cont d) Class B [5] f (y) f (x) + f (x) T (y x) + 1 2L f (y) f (x) 2. [7] f (αx + (1 α)y) αf (x) + (1 α)f (y) [7] = [5]: Interchanging x and y in [7] and re-writing it as f (y) f (x) + Letting α 0, yields [5]. f (x + α(y x)) f (x) α α(1 α) f (x) f (y) 2. 2L + 1 α f (x) f (y) 2 2L Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 31 / 47

Lipschitz Continuous Gradient Relationship (Cont d) Lemma For a function f with a Lipschitz continuous gradient over R n, the following relations hold: [5] [7] = [6] = [0] = [1] [2] [3] [4] (5) If the function f is convex, then all the conditions [0]-[7] are equivalent. Note: From (5), we can see that conditions [0]-[7] can be grouped into four classes. Class A: [1]-[4] Class B: [5],[7] Class C: [6] Class D: [0] Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 32 / 47

Lipschitz Continuous Gradient Proof (Cont d) Class B ([5]) = Class C ([6]) = Class D ([0]) = Class A ([3]) [5] f (y) f (x) + f (x) T (y x) + 1 2L f (y) f (x) 2. [6] ( f (x) f (y)) T (x y) 1 f (x) f (y) 2 L [0] f (x) f (y) L x y [3] ( f (x) f (y)) T (x y) L x y 2 [5] = [6]: Interchanging x and y and adding together. [6] = [0]: It follows directly from Cauchy-Schwartz inequality. [0] = [3]: It following directly from Cauchy-Schwartz inequality. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 33 / 47

Lipschitz Continuous Gradient Proof (Cont d) Class A ([3]) = Class B ([5]) for convex f [3] ( f (x) f (y)) T (x y) L x y 2. [5] f (y) f (x) + f (x) T (y x) + 1 2L f (y) f (x) 2. [3] = [5]: Let φ x (z) := f (z) f (x) T z. Note that since f is convex, φ x (z) attains its minimum at z = x. By [3], we have which implies that ( φ x (z 1 ) φ x (z 2 ) T (z 1 z 2 ) L z 1 z 2 2 φ x (z) φ x (y) + φ x (y) T (z y) + L 2 z y 2. Taking minimization over z, yields [5]. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 34 / 47

Conjugate Function Outline 1 Main Result 2 Review of Convex Function 3 Strong Convexity Implications of Strong Convexity 4 Lipschitz Continuous Gradient 5 Conjugate Function Useful Results 6 Proof of Main Result Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 35 / 47

Conjugate Function (Conjugate) The conjugate of a function f is f (s) = sup (s T x f (x)) x domf Note: f is convex and closed (even f is not). (Subgradient) s is a subgradient of f at x if Note: f need not be convex. f (y) f (x) + s T (y x) for all y Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 37 / 47

Conjugate Function Useful Results Outline 1 Main Result 2 Review of Convex Function 3 Strong Convexity Implications of Strong Convexity 4 Lipschitz Continuous Gradient 5 Conjugate Function Useful Results 6 Proof of Main Result Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 38 / 47

Conjugate Function Useful Results Useful Results Lemma (Fenchel-Young s Equality) Consider the following conditions for a general f : [1] f (s) = s T x f (x) [2] s f (x) [3] x f (s) Then, we have [1] [2] = [3] Further, if f is closed and convex, then all these conditions are equivalent. Lemma (Differentiability) For a closed and strictly convex f, f (s) = argmax x (s T x f (x)). Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 39 / 47

Conjugate Function Useful Results Proof [1] f (s) = s T x f (x) [2] s f (x) [1] [2]: s f (x) f (y) f (x) + s T (y x) y s T x f (x) s T y f (y) y s T x f (x) f (s) Since f (s) s T x f (x) always holds, [1] [2]. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 40 / 47

Conjugate Function Useful Results Proof (Cont d) [2] s f (x) [3] x f (s) [2] = [3]: If [2] holds, then by previous result we have f (s) = s T x f (x). Thus, f (z) = sup(z T u f (u)) u z T x f (x) = s T x f (x) + x T (z s) = f (s) + x T (z s) This holds for all z, which implies x f (s). For a closed and convex function, we have [3] = [2]: This follows from f = f. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 41 / 47

Conjugate Function Useful Results Proof (Differentiability) Suppose x 1 f (s) and x 2 f (s). By closeness and convexity, we have [3] = [2], which gives s f x 1 and s f x 2 By previous result ([2] = [1]), we have x 1 = argmax(s T x f (x)) and x 2 = argmax(s T x f (x)) By strictly convexity, s T x f (x) has a unique maximizer for every s. Thus, x := x 1 = x 2 Thus f (s) = {x}, which implies f is differentiable at s and f (s) = argmax x (s T x f (x)). Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 42 / 47

Proof of Main Result Outline 1 Main Result 2 Review of Convex Function 3 Strong Convexity Implications of Strong Convexity 4 Lipschitz Continuous Gradient 5 Conjugate Function Useful Results 6 Proof of Main Result Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 43 / 47

Proof of Main Result Proof of Main Result Theorem (i) If f is closed and strong convex with parameter µ, then f has a Lipschitz continuous gradient with parameter 1 µ. (ii) If f is convex and has a Lipschitz continuous gradient with parameter L, then f is strong convex with parameter 1 L. Proof: (1): By implication of strong convexity, we have which implies s x s y µ x y s x f (x), s y f (y) s x s y µ f (s x ) f (s y ) Hence, f has a Lipschitz continuous gradient with 1 µ. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 44 / 47

Proof of Main Result Proof (Cont d) (2): By implication of Lipschitz continuous gradient for convex f, we have which implies ( f (x) f (y) T (x y) 1 f (x) f (y) 2 L (s x s y ) T (x y) 1 L (s x s y ) x f (s x ), y f (s y ) Hence, f is strongly convex with parameter 1 L. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 45 / 47

Proof of Main Result Summary A very simple proof of Fenchel duality between strong convexity and Lipschitz continuous gradient. A summary of conditions for strong convexity. A summary of conditions for Lipschitz continuous gradient Common tricks: Taking limit (α 0). Taking minimization over both sides. Interchanging x and y, and adding together. Applying mean-value theorem. Applying Cauchy-Schwartz inequality. Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 46 / 47

Proof of Main Result Thank you! Q & A Xingyu Zhou (OSU) Fenchel Duality December 5, 2017 47 / 47