Operators. Pontus Giselsson

Size: px

Start display at page:

Download "Operators. Pontus Giselsson"

Kristian Byrd
5 years ago
Views:

1 Operators Pontus Giselsson 1

2 Today s lecture forward step operators gradient step operators (subclass of forward step operators) resolvents proimal operators (subclass of resolvents) reflected resolvents reflected proimal operators (or proimal reflectors) 2

3 Forward step operator suppose that T : R n R n is single-valued the forward step operator is (Id γt ) 3

4 Cocoercivity suppose that T is 1 β -cocoercive with β = 1 2 then Id γt with γ = 3 is α-averaged, decide α: γβ γβ 1 γβ γt γt 0.5 Id γt T is 2-cocoercive γt is 2 3 -cocoercive (Id γt ) is 3 4 -averaged generally: suppose γ (0, 2 β ) 1 then: β -cocoercivity of T γβ 2 -averagedness of (Id γt ) 4

5 Iterating the forward step operator since 1 β -cocoercivity of T γβ 2 -averagedness of (Id γt ) iterating k+1 = (Id γt ) k converges to fied-point (if eists) (if γ (0, 2 β )) 5

6 Lipschitz continuity suppose that T is 1 2-Lipschitz and γ = 1 motivate that Id γt is not averaged γt 0.5 γt 0.5 Id γt 1.5 cannot make Id γt nonepansive independent of γ iterating forward step of Lipschitz T not guaranteed to converge 6

7 Gradient step operator suppose that f : R n R is differentiable the forward step becomes the gradient step operator of f I γ f if f (proper closed and) conve, then 1 β -cocoercivity of f β-lipschitz continuity of f β-smoothness of f if f is β-smooth, the gradient method converges for γ (0, 2 β ) k+1 = (Id γ f) k (since f 1 β -cocoercive) 7

8 Stronger properties assume f is β-smooth and σ-strongly conve then γf is γβ-smooth and γσ-strongly conve then γf γσ 2 2 is γ(β σ)-smooth or γ f γσid is 1 γ(β σ) -cocoercive (Id γ f) is δ-lipschitz, decide δ 0 γ(β σ) γσ γβ 1 γβ 1 γσ γ f γσid γ f Id γ f (Id γt ) is ma(γβ 1, 1 γσ)-lipschitz contractive if 1 γσ < 1 and γβ 1 < 1, i.e., γ (0, 2 β ) gradient method k+1 = (Id γt ) k then converges linearly optimal γ (center circle) given by γ = 2 β+σ δ = β/σ 1 β/σ+1 8

9 Summary: Gradient step operator σ > 0 : 1 γβ 1 γσ 1 γβ σ = 0 : Id γ f γ (0, 2 β ) contractive optimal γ = 2 β+σ factor (= γβ 1 = 1 γσ) β/σ 1 β/σ+1 Id γ f γ = 2α/β, α (0, 1) 1 γβ = 1 2α (Id γ f) α-averaged 9

10 Resolvent resolvent J A : D R n to monotone operator is defined as J A = (Id + A) 1 due to Minty, if A maimally monotone, then D = R n (dom(id + A 1 ) = ran(id + A) = R n iff A maimally monotone) this is important for algorithms involving resolvent we will consider resolvents to maimally monotone operators 10

11 Properties of resolvent assume A is σ-strongly monotone (σ = 0 implies monotone) Id + A is (1 + σ)-strongly monotone A Ay + ( y), y σ y 2 + y 2 = (1 + σ) y 2 properties of J A = (Id + A) 1? J A = (Id + A) 1 is (1 + σ)-cocoercive 1 1+σ 0 y σ = 0: J A is 1 2-averaged (or 1-cocoercive or firmly nonepansive) 1 σ > 0: J A is 1+σ -contractive (iteration of the resolvent converges to a fied-point, if eists) note: resolvent is single-valued 11

12 Lipschitz continuity assume A is β-lipschitz continuous, then 2 J A J A y, y y 2 + (1 β 2 ) J A J A y 2 (besides being 1-cocoercive) proof sketch: A + βid is 1 2β -cocoercive 0 y dotted: A Ay gray: (βid + A) (βid + A)y using βid = Id + (β 1)Id, the definition of a cocoercive operator, and the definition of the inverse, gives the result 12

13 Graphical representation assume A is 1-Lipschitz continuous then (besides being 1-cocoercive) J A is 1 2-strongly monotone 0 y 1-cocoercivity: J A J A y in dotted region 1 2 -strong monotonicity: J A J A y to the right of dashed line 13

14 Lipschitz continuity and strong monotonicity let A be 1-Lipschitz and σ-strongly monotone (with 0 σ < 1) σ-strong monotonicity of A (1 + σ)-cocoercivity of J A 1-Lipschitz continuity of A 1 2 -strong monotonicity of J A intersect regions to find region when both properties are present 0 y J A J A y ends up in gray region (σ = 1 9 and β = 1 in figure) 14

15 Proimal operators assume f is proper closed and conve then f maimally monotone let A = f, then: { J A (z) = argmin f() z 2} =: pro f (z) where pro f is called pro operator proof: = pro f (z) if and only if 0 f() + z z f() + z (Id + f) = (Id + f) 1 z 15

16 Proimal operator characterization the proimal operator satisfies pro f = h where h = f why? h is proper closed and conve, and h = f + Id therefore h = ( h) 1 = ( f + Id) 1 = J f can this be used to derive tighter properties of J f? 16

17 Proimal operator properties we have pro f (z) = h (z) where h = f recall equivalent dual properties (i) f is σ-strongly conve (ii) f is σ-strongly monotone (iii) f is σ-cocoercive (iv) f is 1 -Lipschitz continuous σ (v) f is 1 -smooth σ assume f is σ-strongly conve h is (1 + σ)-strongly conve h = pro f is (1 + σ)-cocoercive (same as in general case) 17

18 Lipschitz continuity let h = f , i.e. pro f = h assume that f is β-smooth and σ-strongly conve 0 σ β then h is (β + 1)-smooth and (σ + 1)-strongly conve therefore h is 1 1+β and h 1 2(1+β) 2 is ( 1 1+σ 1 finally h 1 1+β Id is strongly conve and 1+σ -smooth 1+β )-smooth 1+σ 1 1+β -cocoercive (if β > σ) 0 y h 1 1+β Id = pro f 1 1+β Id inside dashed circle h = pro f in gray area (shift by 1 1+β Id) (figure has β = 17 3 and σ = 3 17 ) 18

19 Comparison assume A is a maimal monotone operator and that f is PCC assume that A and f are 1-Lipschitz J A and pro f end up in darker and lighter gray area respectively 0 y conclusion: under Lipschitz assumptions, the resolvent of subdifferentials are confined to smaller regions 19

20 Proimal operator for separable functions consider a separable function g() = n i=1 g i( i ) the pro is also separable: pro g (z) = argmin{g() z 2 } n n = argmin{ g i ( i ) ( i z i ) 2 } = i=1 i=1 argmin 1 {g 1 ( 1 ) ( 1 z 1 ) 2 }... argmin n {g n ( n ) ( n z n ) 2 } cheap evaluation good to have in algorithms 20

21 Separability and compositions assume that g is separable, i.e., g() = n i=1 g i( i ) let h = g L where L is arbitrary linear operator the pro becomes pro h (z) = argmin{h() z 2 } = argmin{g(l) z 2 } separability is lost in general 21

22 Moreau s identity the following relation holds between the pro of f and f when f scaled by γ, we have pro f + pro f = Id pro γf + pro (γf) = pro γf + γpro γ 1 f γ 1 Id = Id when f composed with L, we have where pro γ(f L) (z) = z γl µ µ Argmin{f (µ) + γ 2 L µ γ 1 z 2 } µ (assuming that the Argmin is nonempty) these identities are very useful! 22

23 Reflected resolvent the reflected resolvent R A to a monotone operator A is defined as R A := 2J A Id it gives the reflection point (therefore its name) R A J A if A = f then reflected proimal operator given by R f = 2pro f Id (sometimes denoted rpro f or R f ) 23

24 Properties of reflected resolvent in the general case, A monotone reflected resolvent R A is β-lipschitz, what is β? β = 1, i.e., R A is nonepansive proof : 1. J A J Ay within dashed region (since J A 1-cocoercive in general case) 2. 2J A J Ay within dotted region (multiply by 2) 3. (2J A Id) (2J A Id)y = (2J A 2J Ay) ( y) in gray area (shift by Id) 0 y 24

25 Further properties of reflected resolvent properties of R A obtained by multiplying resolvent (J A ) area by 2 (radially) and shifting with Id eamples: A = f is β-smooth and σ-strongly monotone 0 y 0 y 0 y β =, σ = 1 4 β = 4, σ = 0 β = 4, σ = 1 4 left: negatively averaged, middle: averaged, right: contractive (fairly easy to visualize, can be harder to prove) 25

26 More properties of reflected proimal operator assume f is σ-strongly monotone and β-lipschitz then pro γf 1 1+γβ Id is γσ 1 -cocoercive (if β > σ) 1+γβ ( ) it can be shown that R γf is ma 1 γσ 1+γσ, γβ 1 1+γβ -contractive fir γf contraction factor optimized for γ = 1 β/σ 1 (gives a contraction factor of ) β/σ+1 σβ 26

27 A reflected resolvent identity assume that A is a maimally monotone operator and γ (0, ) then R γa (Id + γa) = Id γa proof R γa (Id + γa) = 2(Id + γa) 1 (Id + γa) (Id + γa) = 2Id (Id + γa) = (Id γa) where second step holds since (Id + γa) 1 = J γa is single-valued 27

CSC 576: Gradient Descent Algorithms

CSC 576: Gradient Descent Algorithms Ji Liu Department of Computer Sciences, University of Rochester December 22, 205 Introduction The gradient descent algorithm is one of the most popular optimization