Computational Statistics and Optimisation. Joseph Salmon Télécom Paristech, Institut Mines-Télécom

Size: px

Start display at page:

Download "Computational Statistics and Optimisation. Joseph Salmon Télécom Paristech, Institut Mines-Télécom"

Ambrose Silas Payne
5 years ago
Views:

1 Computational Statistics and Optimisation Joseph Salmon Télécom Paristech, Institut Mines-Télécom

2 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent

3 Fenchel Duality for stopping criterion F objective function, fix ε ą 0 small, and stop when Fpθ t`1 q Fpθ t q Fpθ t q ď ε or Fpθ t q ď ε Alternative : leverage on the duality gap Notation : Fpθq f pxθq ` gpθq with f : R n Ñ R, g : R p Ñ R and X : n ˆ p matrix. Fenchel-Duality Consider the problem min θ Fpθq, then the following holds supt f puq g p X J uqu ď inftf pxθq ` gpθqu u θ Moreover, if f and g are convex, then under mild assumptions, equality of both sides holds (strong duality, no duality gap) proof : use Fenchel-Young inequality (TO DO: Blackboard)

4 Fenchel Duality We denote by θ : primal optimal solution of inf θ tf pxθq ` gpθqu u : dual solution of sup u t f puq g p X J uqu Define the duality gap by : pθ, uq Fpθq ` f puq ` g p X J uq Property of the duality u, pθ, uq ě Fpθq Fpθ q ě 0 proof : Fenchel-duality applied to a primal solution θ Motivation : pθ, uq ď ε ñ Fpθq Fpθ q ď ε

5 Example : Duality gap for the Lasso Lasso objective : Fpθq 1 2 }Xθ y}2 2 ` λ}θ} 1 f pzq 1 2 }z y}2 2 ; f puq 1 2 }u}2 2 ` xu, yy (translation prop.) gpθq λ}θ} 1 ; g puq ι tu,}u}8ďλu (l 8 ball indicator) Duality gap : pθ, uq Fpθq ` f puq ` g p X J uq Fpθq ` 1 2 }u}2 2 ` xu, yy as soon as }X J u} 8 ď λ, otherwise the bound is `8 : useless Rem: at optimum solutions and under mild assumptions pθ, u q 0

6 Example : Duality gap for the Lasso (II) Possible choice : θ t (current iterate of any iterative algorithm), r t Xθ k y (minus current residuals) u t µ t r t with µ t minp1, λ{}x J r t } 8 q Motivation for this choice : at optimum u f pxθ q Stopping criterion : 1 2 }r t} 2 2 ` λ}θ t } 1 ` 1 2 }u t} 2 2 ` xu t, yy ď ε ô 1 2 p1 ` µ2 t q}r t } 2 2 ` λ}θ t } 1 ` µ t xr t, yy ď ε

7 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent

8 Convergence : Lipschitz gradient θ t`1 θ t α f px t q Convergence rate for fixed step size Hypothesis : f convex, differentiable with gradient L-Lipschitz, θ 1 q, } f pθq f pθ 1 q} ď L}θ θ 1 } Result : for any minimum θ of f, if α ď 1 L then θt satisfies f pθ T q f pθ q ď }θ0 θ } 2 2αT Rem: if f is twice differentiable 2 f pxq ĺ L Id Example: θ ÞÑ }Xθ y}2 2 2 then L λ max px J Xq (spectral radius)

9 Convergence : proof Point 1 : gradient L-Lipschitz implies quadratic upper yq f pxq ď f pyq ` x f pxq, y xy ` L }y x}2 2 Point 2 : use definition θ t`1 θ t α f pθ t q and get f pθ t`1 q ď f pθ t q p1 Lα 2 qα} f pθt q} 2 Point 3 : use convexity, 0 ă α ď 1 L and ab pa2 ` b 2 pa bq 2 q{2 f pθ t`1 q ďf pθ q ` f pθ t q J pθ t θ q α 2 } f pθt q} 2 f pθ q ` 1 2α p}θt θ } 2 }θ t`1 θ } 2 q

10 Convergence proof (bis) Point 4 : Telescopic sums 1 T T 1 ÿ t 0 `f pθ t`1 q f pθ q ď 1 T 1 2α p}θ0 θ } 2 }x T θ } 2 q ď 1 2αT }θ0 θ } 2 From Point 2 f pθ t`1 q ď f pθ t q, hence f pθ t`1 q f px q ď 1 T T 1 ÿ t 0 `f pθ t`1 q f pθ q ď 1 2αT }θ0 θ } 2

11 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent

12 Composite minimization One aims at minimizing : F f ` g f smooth : f is L-Lipschitz g proximable (prox-capable) : ˆ1 prox αg pyq arg min zpr d 2 }z y}2 2 ` αgpzq efficiently computable Examples: Projection over a box-constraint, (block) Soft-thresholding operator, shrinkage operator (Ridge)

13 Forward-Backward algorithm Notation : φ α pθq : prox αg pθ α f pθqq Forward-Backward algorithm Input: Initialization θ 0, step size α Result: θ T while not converged do θ t`1 φ α pθ t q end Rem: φ α pθq arg min θ 1 ˆ f pθq ` x f pθq, θ 1 θy ` 1 2α }θ1 θ} 2 ` gpθ 1 q

14 Convergence : f gradient Lipschitz θ t`1 φ α pθ t q prox αg pθ t α f pθ t qq Convergence rate for fixed step size Hypothesis : f convex, differentiable with gradient L-Lipschitz, θ 1 q, } f pθq f pθ 1 q} ď L}θ θ 1 } Result : for any minimum θ of F, if α ď 1 L then θt satisfies Fpθ T q Fpθ q ď }θ0 θ } 2 2αT Rem: same bound as in the case with g 0

15 Proof : gradient Point 1 : for α ď 1{L and ˆx φ α p xq then for all y : Fpˆxq ` }ˆx y}2 2 2α ď Fpyq ` } x y}2 2 2α Proof : H α pyq f p xq ` x f p xq, y xy ` 1 2α }y x}2 ` gpyq H α is 1{α-strongly convex since α ď 1{L and H p q 1{p2αq} } 2 2 is convex (cf. for instance page 280, Hiriart-Urruty and Lemaréchal (1993)) ˆx arg min H α pyq (cf. two slides up) y By 1{α-strong convexity H α pˆxq ` 1{p2αq}ˆx y} 2 2 ď H αpyq

16 Point 1 (continued) gpˆxq ` f p xq ` x f p xq, ˆx xy ` 1 2α p}ˆx x}2 ` }ˆx y} 2 q ď By convexity of f : gpyq ` f p xq ` x f p xq, y xy ` 1 }y x}2 2α f p xq ` x f p xq, y xy ď f pyq and by the choice α ď 1{L the following bound holds : f pˆxq ď f p xq ` x f p xq, ˆx xy ` 1 }ˆx x}2 2α Hence Point 1 : Fpˆxq ` 1 2α }ˆx y}2 ď Fpyq ` 1 }y x}2 2α

17 Theorem proof Point 1 states : Fpˆxq ` 1 2α }ˆx y}2 ď Fpyq ` 1 2α }y x}2, so choosing y θ (any minimizer of F) and x θ t, ˆx θ t`1 : Fpθ t`1 q ` 1 2α }θt`1 θ } 2 ď Fpθ q ` 1 2α }θ θ t } 2 This leads to Point 3 of the smooth case : Fpθ t`1 q ďfpθ q ` 1 2α p}θt θ } 2 }θ t`1 θ } 2 q and then the same telescopic argument lead to the bound

18 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent

19 Forward-Backward accelerated algorithm Notation : φ α pθq : prox αg pθ α f pθqq Forward-Backward algorithm Input: Initialization θ 0, step size α, a sequence pµ t q tpn satisfying :µ 1 1 and µ 2 t`1 µ t`1 ď µ 2 t Result: θ T while not converged do θ t`1 φ α pz t q z t`1 θ t`1 ` µt 1 µ t`1 pθ t`1 θ t q end Examples of admissible sequences : µ t`1 a µ 2 t ` 1{4 ` 1{2 (i.e., µ2 t`1 µ t`1 µ 2 t ) µ t`1 pt ` 1q{2 µ t`1 pt ` a 1q{a

20 Convergence : Lipschitz gradient θ t`1 φ α pθ t q prox αg pθ t α f pθ t qq Convergence rate for fixed step size Hypothesis : f convex, differentiable with gradient L-Lipschitz, θ 1 q, } f pθq f pθ 1 q} ď L}θ θ 1 } Result : for any minimum θ of F, if α ď 1 L then θt satisfies Fpθ T q Fpθ q ď }θ0 θ } 2 2αµ 2 T Rem: for common choices given above µ t «t Rem: define F Fpθ q for the proof

21 Proof : rate for the Nesterov acceleration Point 1 with ˆx φ α pz t q, x z t, y p1 1{µ t`1 qθ t ` 1{µ t`1 θ Fpˆxq ` }ˆx y}2 2 2α ď Fpyq ` } x y}2 2 2α with u t`1 θ t ` µ t`1 pθ t`1 θ t q and a little algebra gives : Fpθ t`1 q ` }ut`1 θ } 2 2 2αµ 2 t`1 ďfpyq ` }ut θ } 2 2 2αµ 2 t`1 Fpθ t`1 q F p1 1 qpfpθ t q F q ď }ut θ } 2 2 }ut`1 θ } 2 2 µ t`1 2αµ 2 t`1 2αµ 2 t`1 µ 2 t`1 F t`1 pµ 2 t`1 µ t`1 qp F t q ď }ut θ } 2 2 2α }ut`1 θ } 2 2 2α (convexity of F and F t`1 Fpθt`1 q F )

22 Proof continued Define ρ t`1 : µ t`1 µ 2 t`1 ` µ2 t ě 0 so µ 2 t`1 F t`1 pµ 2 t`1 µ t`1 qp F t q ď }ut θ } 2 2 2α µ 2 t`1 F t`1 µ 2 t F t ` ρ t`1 F t ď }ut θ } 2 2 2α }ut`1 θ } 2 2 2α }ut`1 θ } 2 2 2α Telescopic terms again (convention µ 0 0 and u 0 x 0 x 1 ) µ 2 T F T ` Tÿ t 0 ρ t`1 F t ď }u0 θ } 2 2 2α }ut θ } 2 2 2α µ 2 T F T ď }u0 θ } 2 2 2α

23 Convergence of the iterates Very recent result : Chambolle and Dossal 2014 Proof out of the scope of this course More reading on the previous theme : Nesterov for proofs, strong convexity, etc. Beck and Teboulle09 for ISTA/FISTA analysis Chambolle and Dossal 2014 for FISTA with larger choice of updating rules

24 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent

25 Coordinate descent Objective : solve arg min f pθq θpr p Initialization : θ p0q θ pkq 1 P arg min f pθ 1, θ pk 1q 2, θ pk 1q 3..., θp pk 1q q θ pkq θ 1 PR 2 P arg min f pθ pkq 1, θ 2, θ pk 1q 3,..., θp pk 1q q θ pkq θ 2 PR 3 P arg min f pθ pkq 1, θpkq 2, θ 3,..., θp pk 1q q. θ pkq p θ 3 PR P arg min f pθ pkq 1, θpkq 2, θpkq 3,..., θ pq θ ppr cycle over coordinates

26 Motivation Convergence guarantees toward a minimum for smooth functions cf. Tseng (2001)

27 Motivation Convergence guarantees toward a minimum for separable functions cf. Tseng (2001)

28 Motivation NO CONVERGENCE toward a minimum for non-separable/non-smooth functions : some points get stuck

29 Motivation NO CONVERGENCE toward a minimum for non-separable/non-smooth functions : some points get stuck

30 Motivation NO CONVERGENCE toward a minimum for non-separable/non-smooth functions : some points get stuck

31 Motivation NO CONVERGENCE toward a minimum for non-separable/non-smooth functions : some points get stuck

32 Motivation Coordinate descent can be extremely fast Possibly visit the coordinate in arbitrary order (cycle, random, more refined methods,etc.) Possibly by blocks : update not only one coordinate, but a bunch of them (optimize according to your architecture) Exo: Testing over a the ridge regression problem the relative performance of cycling and random sampling of the coordinate

33 CD for least square II arg min θpr p f pθq pour f pθq 1 2 }y Xθ} nÿ py i i 1 pÿ θ j x j q 2 x J Bf 1 pxθ yq Bθ 1 pθq Reminder : f pθq X J pxθ yq.. xp J Bf pxθ yq Bθ p pθq Minimize w.r.t θ j with fixed θ k (k j) 0 Bf pθq xj J pxθ yq xj J x j θ j ` ÿ x k θ k y Bθ j k j xj y J řk j x kθ k ôθ j xj Jx xj j py řp k 1 x kθ k ` x j θ j q j }x j } 2 2 j 1

34 CD for least square II Clever update scheme with low memory impact by storing : current residual in a variable r pkq (size n vector) current coefficient in θ pkq (size p vector) Coordinate descent for least square Input: Observations y, features X rx 1,, x p s, initial θ p0q Result: Vector θ pkq while not converged do Pick a coordinate j r int Ð r pkq ` x j θ pkq j θ pk`1q j Ð xj J r int {}x j } 2 2 r pk`1q r int x j θ pk`1q j end Rem: computational simplification }x j } (NORMALIZE!) Rem: the residual update can be done in place

35 Ridge regression with coordinate descent arg min θpr p f pθq for f pθq 1 2 nÿ py i i 1 pÿ j 1 θ j x j q 2 ` λ 2 pÿ θj 2 j 1 x1 J f pθq X J pxθ yq ` λθ Bf 1 Bθ 1 pθq pxθ yq ` λθ.. xp J Bf pxθ yq ` λθ p Bθ p pθq Minimize w.r.t θ j fixing θ k (k j) 0 Bf pθq xj J pxθ yq ` λθ j xj J x j θ j ` ÿ x k θ k y ` λθ j Bθ j k j xj y J řk j x kθ k ôθ j xj Jx xj j py řp k 1 x kθ k ` x j θ j q j ` λ }x j } 2 2 ` λ

36 Ridge regression with coordinate descent II Clever update scheme with low memory impact by storing : current residual in a variable r pkq (size n vector) current coefficient in θ pkq (size p vector) Coordinate Descent for Ridge regression Input: Observations y, features X rx 1,, x p s, initial θ p0q Result: Vector θ pkq while not converged do Pick a coordinate j r int Ð r pkq ` x j θ pkq j Ð xj J r int {p}x j } 2 2 ` λq θ pk`1q j r pk`1q r int x j θ pk`1q j end Rem: computational simplification }x j } (NORMALIZE!) Rem: the residual update can be done in place

37 Lasso with coordinate descent arg min θpr p f pθq for f pθq 1 2 }y Xθ}2 ` λ Minimize w.r.t θ j fixing θ k (k j) ˆθ j arg min f pθ 1,, θ p q θ j PR arg min θ j PR arg min θ j PR arg min θ j PR 1 2 }y ÿ k j θ k x k x j θ j } 2 ` λ ÿ k j 1 2 }x j} 2 θj 2 xy ÿ» }x j } k j pÿ θ j j 1 θ j ` λ θ j θ k x k, x j yθ j ` λ θ j θ j }x } 2 j xy ÿ θ k x k, x j y k j 2 fi ` λ }x j } 2 θ j fl

38 Lasso with coordinate descent ˆθ j arg min θ j PR» }x j } θ j }x } 2 j xy ÿ θ k x k, x j y k j 2 fi ` λ }x j } 2 θ j fl Soft-Thresholding : S λ pzq arg min x ÞÑ 1 xpr 2 pz xq2 2 ` λ x Update rule : ˆθj S λ{}xj } 2 }x } 2 j xy ÿ θ k x k, x j y k j

39 Ridge regression with coordinate descent II Clever update scheme with low memory impact by storing : current residual in a variable r pkq (size n vector) current coefficient in θ pkq (size p vector) Coordinate descent for least square Input: Observations y, features X rx 1,, x p s, initial θ p0q Result: Vector θ pkq while not converged do Pick a coordinate j r int Ð r pkq ` x j θ pkq j Ð S λ{}xj} `xj 2 j r int {}x j } 2 θ pk`1q j r pk`1q r int x j θ pk`1q j end Rem: computational simplification }x j } (NORMALIZE!) Rem: the residual update can be done in place

40 Similarity with Forward-Backward Update rule for CD : Update rule for FB : θ pk`1q j Ð S λ{}xj } 2 `xj j r{}x j } 2 θ pk`1q Ð S λ{l `X J r{l where L is the Lipschitz constant of X J X, L λ max px J Xq Rem: The Forward-Backward update could be really useful if the following operation can be performed efficiently : # R n Ñ R p r ÞÑ X J r Common examples includes : FFT, wavelet transform, etc. Rem: Note that the residual is usually a full vector ( sparse)

41 Optimisation Other alternatives to obtained a/the Lasso solution LARS Efron et al. (2004) for full Lasso path Forward-Backward (ISTA, FISTA, cf. Beck et Teboulle(2009)) Conditional Gradient / Frank-Wolfe (Jaggi (2013)) useful for distributed datasets

42 Références I A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1) : , A. Chambolle and C. Dossal. How to make sure the iterates of fista converge B. Efron, T. Hastie, I. M. Johnstone, and R. Tibshirani. Least angle regression. Ann. Statist., 32(2) : , With discussion, and a rejoinder by the authors. J-B. Hiriart-Urruty and C. Lemaréchal. Convex analysis and minimization algorithms. I, volume 305. Springer-Verlag, Berlin, M. Jaggi. Revisiting tfrank-wolfeu : Projection-free sparse convex optimization. In ICML, pages , 2013.

43 Références II Y. Nesterov. Introductory lectures on convex optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA, P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109(3) : , 2001.

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some