Lecture: Smoothing http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s lecture notes
Smoothing 2/26 introduction smoothing via conjugate examples
3/26 First-order convex optimization methods complexity of finding ɛ-suboptimal point of f (x) subgradient method: f nondifferentiable with Lipschitz constant G O((G/ɛ) 2 ) iterations proximal gradient method: f = g + h, where h is a simple nondifferentiable function, g is differentiable with L Lipschitz continuous gratient O(L/ɛ) iterations fast proximal gradient methods O( L/ɛ) iterations
Non-differentiable optimization by smoothing 4/26 for nondifferentiable f that cannot be handled by proximal gradient method replace f with differentiable approximation f µ (parametrized by µ) minimize f µ by (fast) gradient method Complexity: #iterations for (fast) gradient method depends on L µ /ɛ µ L µ is Lipschitz constant of f µ ɛ µ is accuracy with which the smooth problem is solved trade-off in amount of smoothing (choice of µ) Large L µ (less smoothing) gives more accurate approximation Small L µ (more smoothing) gives faster convergence
5/26 Example: Huber penalty as smoothed absolute value φ µ (z) = { z 2 /2(µ) z µ z µ/2 z µ µ controls accuracy and smoothness accuracy z µ 2 φ µ(z) z smoothness φ µ(z) 1 µ
Huber penalty approximation of 1-norm minimization f (x) = Ax b 1, f µ (x) = m φ µ (a T i x b i ) i=1 accuracy: from f (x) mµ/2 f µ (x) f (x), f (x) f f µ (x) f µ + mµ/2 to achieve f (x) f ɛ, we need f µ (x) f µ ɛ µ Lipschitz constant of f µ is L µ = A 2 2 /µ complexity: for µ = ɛ/m with ɛ µ = ɛ mµ/2 L µ A 2 2 ɛ µ µ(ɛ mµ/2) = 2m A 2 ɛ 2 i.e., O( L µ /ɛ µ = O(1/ɛ) iteration complexity for fast gradient method 6/26
Outline 7/26 introduction smoothing via conjugate examples
8/26 Minimum of strongly convex function if x is a minimizer of a strongly convex function f, then it is unique and f (y) f (x) + µ 2 y x 2 2 y domf (µ is the strong convexity constant of f ) proof: if some y does not satisfy the inequality, then for some small θ > 0: f ((1 θ)x + θy) (1 θ)f (x) + θf (y) µ θ(1 θ) y x 2 2 2 = f (x) + θ(f (y) f (x) µ 2 y x 2 2) + µ θ2 2 x y 2 2 < f (x)
Conjugate of strongly convex function 9/26 suppose f is closed and strongly convex with constant µ and conjugate f (y) = sup (y T x f (x)) x domf f is defined and differentiable at all y, with gradient f (y) = argmax(y T x f (x)) x f is Lipschitz continuous with constant 1/µ f (u) f (v) 2 1 µ u v 2
10/26 outline of proof y T x f (x) has a unique maximizer x y for every y (follows from closedness and strong convexity of f (x) y T x) f (y) = x y from strong convexity (with x µ = f (u), x v = f (v)) f (x u ) v T x u f (x v ) v T x v + µ 2 x u x v 2 2 f (x v ) u T x v f (x u ) u T x u + µ 2 x u x v 2 2 adding the left- and right-hand sides of the inequalities gives µ x u x v 2 2 (x u x v ) T (u v) by the Cauchy-Schwarz inequality, µ x u x v 2 u v 2
Proximity function 11/26 d is a proximity function for a closed convec set C if d is continuous and strongly convex C domd d(x) measures distance of x to the center x d = argmin x C d(x) of C normalization we will assume the strong convexity constant is 1 and inf x C d(x) = 0 for a normalized proximity function d(x) 1 2 x x d 2 2 x C
12/26 common proximity functions d(x) = x u 2 2 /2 with x d = u C d(x) = n w i (x i u i ) 2 /2 with w i 1 and x d = u C i=1 d(x) = n x i log x i + log n for C = {x 0 1 T x = 1}, x d = (1/n)1 i=1 example (probability simplex): entropy and d(x) = (1/2) x (1/n)1 2 2
13/26 Smoothing via conjugate conjugate (dual) representation: suppose f can be expressed as f (x) = sup ((Ax + b) T y h(y)) y domh = h (Ax + b) where h is closed and convex with bounded domain smooth approximation: choose proximity function d for C = cldomh f µ (x) = sup ((Ax + b) T y h(y) µd(y)) y domh = (h + µd) (Ax + b) f µ is differentiable because h + µd is strongly convex
Example: absolute value 14/26 conjugate representation x = sup xy = h (x), 1 y 1 h(y) = I [ 1,1] (y) proximity function: choosing d(y) = y 2 /2 gives Huber penalty { f µ (x) = sup (xy µy 2 x 2 /(2µ) x µ /2) = 1 y 1 x µ/2 x > µ proximity function: choosing d(y) = 1 1 y 2 gives f µ (x) = sup (xy + µ 1 y 2 µ) = x 2 + µ 2 µ 1 y 1
another conjugate representation of x 15/26 x = sup y 1 +y 2 =1 y 0 x(y 1 y 2 ) i.e., x = h (ax) for h = I C, proximity function for C smooth approximation f µ (x) = C = {y 0 y 1 + y 2 = 1}, A = sup y 1 +y 2 =1 d(y) = y 1 log y 1 + y 2 log y 2 + log 2 [ ] 1 1 (xy 1 xy 2 + µ(y 1 log y 1 + y 2 log y 2 + log 2)) ( ) e x/µ + e x/µ = µ log 2
comparison: three smooth approximations of absolute value 16/26
17/26 Gradient of smooth approximation f µ (x) = (h + µd) (Ax + b) = sup ((Ax + b) T y h(y) µd(y)) y domh from properties of the conjugate of strongly convex function (page 7) f µ is differentiable, with gradient f µ (x) = A T argmax((ax + b) T y h(y) µd(y)) y domh f µ (x) is Lipschitz continuous with constant L µ = A 2 2 µ
18/26 Accuracy of smooth approximation f (x) µd f µ (x) f (x), D = sup d(y) y domh note D < + because domh is bounded and domh domd lower bound follows from f µ (x) = sup ((Ax + b) T y h(y) µd(y)) y domh sup ((Ax + b) T y h(y) µd) y domh = f (x) µd upper bound follows from f µ (x) sup ((Ax + b) T y h(y)) = f (x) y domh
Complexity 19/26 to find solution of nondifferentiable problem with accuracy f (x) f ɛ solve smoothed problem with accuracy ɛ µ = ɛ µd, so that f (x) f f µ (x) + µd f µ ɛ µ + µd = ɛ Lipschitz constant of f µ is L µ = A 2 2 /µ complexity: for µ = ɛ/(2d) L µ A 2 2 = ɛ µ µ(ɛ µd) = 4D A 2 2 µɛ 2 gives O(1/ɛ) iteration bound for fast gradient method efficiency in practice can be improved by decreasing µ gradually
Outline 20/26 introduction smoothing via conjugate examples
Piecewise-linear approximation 21/26 conjugate representation f (x) = max i=1,...,m (at i x + b i ) proximity function f (x) = sup (Ax + b) T y y 0,1 T y=1 d(y) = m y i log y i + log m i=1 smooth approximation m f µ (x) = µ log e (at i x+bi)/µ µ log m i=1
1-Norm approximation 22/26 f (x) = Ax b 1 conjugate representation proximity function f (x) = sup (Ax b) T y y 1 d(y) = 1 w i y 2 i (with w i > 1) 2 smooth approximation: Huber approximation i f µ (x) = n φ µwi (a T i x b i ) i=1
Maximum eigenvalue 23/26 conjugate representation: for X S n, f (X) = λ max (X) = sup tr(xy) Y,trY=1 proximity function: negative matrix entropy d(y) = n λ i (Y) log λ i (Y) + log n i=1 smooth approximation f µ (X) = sup (tr(xy) µd(y)) Y 0,trY=1 n = µ log e λi(x)/µ µ log n i=1
Nuclear norm 24/26 nuclear norm f (X) = X is sum of singular values of X R m n conjugate representation f (X) = sup tr(x T Y) Y 2 1 proximity function d(y) = 1 2 Y 2 F smooth approximation f µ (X) = sup tr(x T Y µd(y)) = Y 2 1 i φ µ (σ i (X)) the sum of the Huber penalties applied to the singular values of X
25/26 Lagrange dual function minimize f 0 (x) subject to f i (x) 0, i = 1,..., m f i (x) convex, C closed and bounded x C smooth approximation of dual function: choose prox. function d for C g µ (λ) = inf x C (f 0(x) + m λ i f i (x) + µd(x)) i=1 minimize f 0 (x) + µd(x) subject to f i (x) 0, i = 1,..., m x C
References 26/26 D. Bertsekas, Nonlinear Programming (1995), 5.4.5 Yu. Nesterov, Smooth minimization of non-smooth functions, Mathamatical Programming (2005)