Computational Statistics and Optimisation. Joseph Salmon Télécom Paristech, Institut Mines-Télécom
|
|
- Ambrose Silas Payne
- 5 years ago
- Views:
Transcription
1 Computational Statistics and Optimisation Joseph Salmon Télécom Paristech, Institut Mines-Télécom
2 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent
3 Fenchel Duality for stopping criterion F objective function, fix ε ą 0 small, and stop when Fpθ t`1 q Fpθ t q Fpθ t q ď ε or Fpθ t q ď ε Alternative : leverage on the duality gap Notation : Fpθq f pxθq ` gpθq with f : R n Ñ R, g : R p Ñ R and X : n ˆ p matrix. Fenchel-Duality Consider the problem min θ Fpθq, then the following holds supt f puq g p X J uqu ď inftf pxθq ` gpθqu u θ Moreover, if f and g are convex, then under mild assumptions, equality of both sides holds (strong duality, no duality gap) proof : use Fenchel-Young inequality (TO DO: Blackboard)
4 Fenchel Duality We denote by θ : primal optimal solution of inf θ tf pxθq ` gpθqu u : dual solution of sup u t f puq g p X J uqu Define the duality gap by : pθ, uq Fpθq ` f puq ` g p X J uq Property of the duality u, pθ, uq ě Fpθq Fpθ q ě 0 proof : Fenchel-duality applied to a primal solution θ Motivation : pθ, uq ď ε ñ Fpθq Fpθ q ď ε
5 Example : Duality gap for the Lasso Lasso objective : Fpθq 1 2 }Xθ y}2 2 ` λ}θ} 1 f pzq 1 2 }z y}2 2 ; f puq 1 2 }u}2 2 ` xu, yy (translation prop.) gpθq λ}θ} 1 ; g puq ι tu,}u}8ďλu (l 8 ball indicator) Duality gap : pθ, uq Fpθq ` f puq ` g p X J uq Fpθq ` 1 2 }u}2 2 ` xu, yy as soon as }X J u} 8 ď λ, otherwise the bound is `8 : useless Rem: at optimum solutions and under mild assumptions pθ, u q 0
6 Example : Duality gap for the Lasso (II) Possible choice : θ t (current iterate of any iterative algorithm), r t Xθ k y (minus current residuals) u t µ t r t with µ t minp1, λ{}x J r t } 8 q Motivation for this choice : at optimum u f pxθ q Stopping criterion : 1 2 }r t} 2 2 ` λ}θ t } 1 ` 1 2 }u t} 2 2 ` xu t, yy ď ε ô 1 2 p1 ` µ2 t q}r t } 2 2 ` λ}θ t } 1 ` µ t xr t, yy ď ε
7 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent
8 Convergence : Lipschitz gradient θ t`1 θ t α f px t q Convergence rate for fixed step size Hypothesis : f convex, differentiable with gradient L-Lipschitz, θ 1 q, } f pθq f pθ 1 q} ď L}θ θ 1 } Result : for any minimum θ of f, if α ď 1 L then θt satisfies f pθ T q f pθ q ď }θ0 θ } 2 2αT Rem: if f is twice differentiable 2 f pxq ĺ L Id Example: θ ÞÑ }Xθ y}2 2 2 then L λ max px J Xq (spectral radius)
9 Convergence : proof Point 1 : gradient L-Lipschitz implies quadratic upper yq f pxq ď f pyq ` x f pxq, y xy ` L }y x}2 2 Point 2 : use definition θ t`1 θ t α f pθ t q and get f pθ t`1 q ď f pθ t q p1 Lα 2 qα} f pθt q} 2 Point 3 : use convexity, 0 ă α ď 1 L and ab pa2 ` b 2 pa bq 2 q{2 f pθ t`1 q ďf pθ q ` f pθ t q J pθ t θ q α 2 } f pθt q} 2 f pθ q ` 1 2α p}θt θ } 2 }θ t`1 θ } 2 q
10 Convergence proof (bis) Point 4 : Telescopic sums 1 T T 1 ÿ t 0 `f pθ t`1 q f pθ q ď 1 T 1 2α p}θ0 θ } 2 }x T θ } 2 q ď 1 2αT }θ0 θ } 2 From Point 2 f pθ t`1 q ď f pθ t q, hence f pθ t`1 q f px q ď 1 T T 1 ÿ t 0 `f pθ t`1 q f pθ q ď 1 2αT }θ0 θ } 2
11 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent
12 Composite minimization One aims at minimizing : F f ` g f smooth : f is L-Lipschitz g proximable (prox-capable) : ˆ1 prox αg pyq arg min zpr d 2 }z y}2 2 ` αgpzq efficiently computable Examples: Projection over a box-constraint, (block) Soft-thresholding operator, shrinkage operator (Ridge)
13 Forward-Backward algorithm Notation : φ α pθq : prox αg pθ α f pθqq Forward-Backward algorithm Input: Initialization θ 0, step size α Result: θ T while not converged do θ t`1 φ α pθ t q end Rem: φ α pθq arg min θ 1 ˆ f pθq ` x f pθq, θ 1 θy ` 1 2α }θ1 θ} 2 ` gpθ 1 q
14 Convergence : f gradient Lipschitz θ t`1 φ α pθ t q prox αg pθ t α f pθ t qq Convergence rate for fixed step size Hypothesis : f convex, differentiable with gradient L-Lipschitz, θ 1 q, } f pθq f pθ 1 q} ď L}θ θ 1 } Result : for any minimum θ of F, if α ď 1 L then θt satisfies Fpθ T q Fpθ q ď }θ0 θ } 2 2αT Rem: same bound as in the case with g 0
15 Proof : gradient Point 1 : for α ď 1{L and ˆx φ α p xq then for all y : Fpˆxq ` }ˆx y}2 2 2α ď Fpyq ` } x y}2 2 2α Proof : H α pyq f p xq ` x f p xq, y xy ` 1 2α }y x}2 ` gpyq H α is 1{α-strongly convex since α ď 1{L and H p q 1{p2αq} } 2 2 is convex (cf. for instance page 280, Hiriart-Urruty and Lemaréchal (1993)) ˆx arg min H α pyq (cf. two slides up) y By 1{α-strong convexity H α pˆxq ` 1{p2αq}ˆx y} 2 2 ď H αpyq
16 Point 1 (continued) gpˆxq ` f p xq ` x f p xq, ˆx xy ` 1 2α p}ˆx x}2 ` }ˆx y} 2 q ď By convexity of f : gpyq ` f p xq ` x f p xq, y xy ` 1 }y x}2 2α f p xq ` x f p xq, y xy ď f pyq and by the choice α ď 1{L the following bound holds : f pˆxq ď f p xq ` x f p xq, ˆx xy ` 1 }ˆx x}2 2α Hence Point 1 : Fpˆxq ` 1 2α }ˆx y}2 ď Fpyq ` 1 }y x}2 2α
17 Theorem proof Point 1 states : Fpˆxq ` 1 2α }ˆx y}2 ď Fpyq ` 1 2α }y x}2, so choosing y θ (any minimizer of F) and x θ t, ˆx θ t`1 : Fpθ t`1 q ` 1 2α }θt`1 θ } 2 ď Fpθ q ` 1 2α }θ θ t } 2 This leads to Point 3 of the smooth case : Fpθ t`1 q ďfpθ q ` 1 2α p}θt θ } 2 }θ t`1 θ } 2 q and then the same telescopic argument lead to the bound
18 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent
19 Forward-Backward accelerated algorithm Notation : φ α pθq : prox αg pθ α f pθqq Forward-Backward algorithm Input: Initialization θ 0, step size α, a sequence pµ t q tpn satisfying :µ 1 1 and µ 2 t`1 µ t`1 ď µ 2 t Result: θ T while not converged do θ t`1 φ α pz t q z t`1 θ t`1 ` µt 1 µ t`1 pθ t`1 θ t q end Examples of admissible sequences : µ t`1 a µ 2 t ` 1{4 ` 1{2 (i.e., µ2 t`1 µ t`1 µ 2 t ) µ t`1 pt ` 1q{2 µ t`1 pt ` a 1q{a
20 Convergence : Lipschitz gradient θ t`1 φ α pθ t q prox αg pθ t α f pθ t qq Convergence rate for fixed step size Hypothesis : f convex, differentiable with gradient L-Lipschitz, θ 1 q, } f pθq f pθ 1 q} ď L}θ θ 1 } Result : for any minimum θ of F, if α ď 1 L then θt satisfies Fpθ T q Fpθ q ď }θ0 θ } 2 2αµ 2 T Rem: for common choices given above µ t «t Rem: define F Fpθ q for the proof
21 Proof : rate for the Nesterov acceleration Point 1 with ˆx φ α pz t q, x z t, y p1 1{µ t`1 qθ t ` 1{µ t`1 θ Fpˆxq ` }ˆx y}2 2 2α ď Fpyq ` } x y}2 2 2α with u t`1 θ t ` µ t`1 pθ t`1 θ t q and a little algebra gives : Fpθ t`1 q ` }ut`1 θ } 2 2 2αµ 2 t`1 ďfpyq ` }ut θ } 2 2 2αµ 2 t`1 Fpθ t`1 q F p1 1 qpfpθ t q F q ď }ut θ } 2 2 }ut`1 θ } 2 2 µ t`1 2αµ 2 t`1 2αµ 2 t`1 µ 2 t`1 F t`1 pµ 2 t`1 µ t`1 qp F t q ď }ut θ } 2 2 2α }ut`1 θ } 2 2 2α (convexity of F and F t`1 Fpθt`1 q F )
22 Proof continued Define ρ t`1 : µ t`1 µ 2 t`1 ` µ2 t ě 0 so µ 2 t`1 F t`1 pµ 2 t`1 µ t`1 qp F t q ď }ut θ } 2 2 2α µ 2 t`1 F t`1 µ 2 t F t ` ρ t`1 F t ď }ut θ } 2 2 2α }ut`1 θ } 2 2 2α }ut`1 θ } 2 2 2α Telescopic terms again (convention µ 0 0 and u 0 x 0 x 1 ) µ 2 T F T ` Tÿ t 0 ρ t`1 F t ď }u0 θ } 2 2 2α }ut θ } 2 2 2α µ 2 T F T ď }u0 θ } 2 2 2α
23 Convergence of the iterates Very recent result : Chambolle and Dossal 2014 Proof out of the scope of this course More reading on the previous theme : Nesterov for proofs, strong convexity, etc. Beck and Teboulle09 for ISTA/FISTA analysis Chambolle and Dossal 2014 for FISTA with larger choice of updating rules
24 Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward analysis Forward-backward accelerated Coordinate descent
25 Coordinate descent Objective : solve arg min f pθq θpr p Initialization : θ p0q θ pkq 1 P arg min f pθ 1, θ pk 1q 2, θ pk 1q 3..., θp pk 1q q θ pkq θ 1 PR 2 P arg min f pθ pkq 1, θ 2, θ pk 1q 3,..., θp pk 1q q θ pkq θ 2 PR 3 P arg min f pθ pkq 1, θpkq 2, θ 3,..., θp pk 1q q. θ pkq p θ 3 PR P arg min f pθ pkq 1, θpkq 2, θpkq 3,..., θ pq θ ppr cycle over coordinates
26 Motivation Convergence guarantees toward a minimum for smooth functions cf. Tseng (2001)
27 Motivation Convergence guarantees toward a minimum for separable functions cf. Tseng (2001)
28 Motivation NO CONVERGENCE toward a minimum for non-separable/non-smooth functions : some points get stuck
29 Motivation NO CONVERGENCE toward a minimum for non-separable/non-smooth functions : some points get stuck
30 Motivation NO CONVERGENCE toward a minimum for non-separable/non-smooth functions : some points get stuck
31 Motivation NO CONVERGENCE toward a minimum for non-separable/non-smooth functions : some points get stuck
32 Motivation Coordinate descent can be extremely fast Possibly visit the coordinate in arbitrary order (cycle, random, more refined methods,etc.) Possibly by blocks : update not only one coordinate, but a bunch of them (optimize according to your architecture) Exo: Testing over a the ridge regression problem the relative performance of cycling and random sampling of the coordinate
33 CD for least square II arg min θpr p f pθq pour f pθq 1 2 }y Xθ} nÿ py i i 1 pÿ θ j x j q 2 x J Bf 1 pxθ yq Bθ 1 pθq Reminder : f pθq X J pxθ yq.. xp J Bf pxθ yq Bθ p pθq Minimize w.r.t θ j with fixed θ k (k j) 0 Bf pθq xj J pxθ yq xj J x j θ j ` ÿ x k θ k y Bθ j k j xj y J řk j x kθ k ôθ j xj Jx xj j py řp k 1 x kθ k ` x j θ j q j }x j } 2 2 j 1
34 CD for least square II Clever update scheme with low memory impact by storing : current residual in a variable r pkq (size n vector) current coefficient in θ pkq (size p vector) Coordinate descent for least square Input: Observations y, features X rx 1,, x p s, initial θ p0q Result: Vector θ pkq while not converged do Pick a coordinate j r int Ð r pkq ` x j θ pkq j θ pk`1q j Ð xj J r int {}x j } 2 2 r pk`1q r int x j θ pk`1q j end Rem: computational simplification }x j } (NORMALIZE!) Rem: the residual update can be done in place
35 Ridge regression with coordinate descent arg min θpr p f pθq for f pθq 1 2 nÿ py i i 1 pÿ j 1 θ j x j q 2 ` λ 2 pÿ θj 2 j 1 x1 J f pθq X J pxθ yq ` λθ Bf 1 Bθ 1 pθq pxθ yq ` λθ.. xp J Bf pxθ yq ` λθ p Bθ p pθq Minimize w.r.t θ j fixing θ k (k j) 0 Bf pθq xj J pxθ yq ` λθ j xj J x j θ j ` ÿ x k θ k y ` λθ j Bθ j k j xj y J řk j x kθ k ôθ j xj Jx xj j py řp k 1 x kθ k ` x j θ j q j ` λ }x j } 2 2 ` λ
36 Ridge regression with coordinate descent II Clever update scheme with low memory impact by storing : current residual in a variable r pkq (size n vector) current coefficient in θ pkq (size p vector) Coordinate Descent for Ridge regression Input: Observations y, features X rx 1,, x p s, initial θ p0q Result: Vector θ pkq while not converged do Pick a coordinate j r int Ð r pkq ` x j θ pkq j Ð xj J r int {p}x j } 2 2 ` λq θ pk`1q j r pk`1q r int x j θ pk`1q j end Rem: computational simplification }x j } (NORMALIZE!) Rem: the residual update can be done in place
37 Lasso with coordinate descent arg min θpr p f pθq for f pθq 1 2 }y Xθ}2 ` λ Minimize w.r.t θ j fixing θ k (k j) ˆθ j arg min f pθ 1,, θ p q θ j PR arg min θ j PR arg min θ j PR arg min θ j PR 1 2 }y ÿ k j θ k x k x j θ j } 2 ` λ ÿ k j 1 2 }x j} 2 θj 2 xy ÿ» }x j } k j pÿ θ j j 1 θ j ` λ θ j θ k x k, x j yθ j ` λ θ j θ j }x } 2 j xy ÿ θ k x k, x j y k j 2 fi ` λ }x j } 2 θ j fl
38 Lasso with coordinate descent ˆθ j arg min θ j PR» }x j } θ j }x } 2 j xy ÿ θ k x k, x j y k j 2 fi ` λ }x j } 2 θ j fl Soft-Thresholding : S λ pzq arg min x ÞÑ 1 xpr 2 pz xq2 2 ` λ x Update rule : ˆθj S λ{}xj } 2 }x } 2 j xy ÿ θ k x k, x j y k j
39 Ridge regression with coordinate descent II Clever update scheme with low memory impact by storing : current residual in a variable r pkq (size n vector) current coefficient in θ pkq (size p vector) Coordinate descent for least square Input: Observations y, features X rx 1,, x p s, initial θ p0q Result: Vector θ pkq while not converged do Pick a coordinate j r int Ð r pkq ` x j θ pkq j Ð S λ{}xj} `xj 2 j r int {}x j } 2 θ pk`1q j r pk`1q r int x j θ pk`1q j end Rem: computational simplification }x j } (NORMALIZE!) Rem: the residual update can be done in place
40 Similarity with Forward-Backward Update rule for CD : Update rule for FB : θ pk`1q j Ð S λ{}xj } 2 `xj j r{}x j } 2 θ pk`1q Ð S λ{l `X J r{l where L is the Lipschitz constant of X J X, L λ max px J Xq Rem: The Forward-Backward update could be really useful if the following operation can be performed efficiently : # R n Ñ R p r ÞÑ X J r Common examples includes : FFT, wavelet transform, etc. Rem: Note that the residual is usually a full vector ( sparse)
41 Optimisation Other alternatives to obtained a/the Lasso solution LARS Efron et al. (2004) for full Lasso path Forward-Backward (ISTA, FISTA, cf. Beck et Teboulle(2009)) Conditional Gradient / Frank-Wolfe (Jaggi (2013)) useful for distributed datasets
42 Références I A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1) : , A. Chambolle and C. Dossal. How to make sure the iterates of fista converge B. Efron, T. Hastie, I. M. Johnstone, and R. Tibshirani. Least angle regression. Ann. Statist., 32(2) : , With discussion, and a rejoinder by the authors. J-B. Hiriart-Urruty and C. Lemaréchal. Convex analysis and minimization algorithms. I, volume 305. Springer-Verlag, Berlin, M. Jaggi. Revisiting tfrank-wolfeu : Projection-free sparse convex optimization. In ICML, pages , 2013.
43 Références II Y. Nesterov. Introductory lectures on convex optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA, P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl., 109(3) : , 2001.
Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationEfficient Energy Maximization Using Smoothing Technique
1/35 Efficient Energy Maximization Using Smoothing Technique Bogdan Savchynskyy Heidelberg Collaboratory for Image Processing (HCI) University of Heidelberg 2/35 Energy Maximization Problem G pv, Eq, v
More informationConvex optimization, sparsity and regression in high dimension CIMAT 2015
Convex optimization, sparsity and regression in high dimension CIMAT 2015 Joseph Salmon http://josephsalmon.eu LTCI, CNRS, Télécom ParisTech, Université Paris-Saclay Outline Variable selection and sparsity
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationGeneralized Concomitant Multi-Task Lasso for sparse multimodal regression
Generalized Concomitant Multi-Task Lasso for sparse multimodal regression Mathurin Massias https://mathurinm.github.io INRIA Saclay Joint work with: Olivier Fercoq (Télécom ParisTech) Alexandre Gramfort
More informationA Tutorial on Primal-Dual Algorithm
A Tutorial on Primal-Dual Algorithm Shenlong Wang University of Toronto March 31, 2016 1 / 34 Energy minimization MAP Inference for MRFs Typical energies consist of a regularization term and a data term.
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationGAP Safe screening rules for sparse multi-task and multi-class models
GAP Safe screening rules for sparse multi-task and multi-class models Eugene Ndiaye Olivier Fercoq Alexandre Gramfort Joseph Salmon LTCI, CNRS, Télécom ParisTech, Université Paris-Saclay Paris, 75013,
More informationSIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University
SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationProximal Operator and Proximal Algorithms (Lecture notes of UCLA 285J Fall 2016)
Proximal Operator and Proximal Algorithms (Lecture notes of UCLA 285J Fall 206) Instructor: Wotao Yin April 29, 207 Given a function f, the proximal operator maps an input point x to the minimizer of f
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationAccelerated gradient methods
ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationOn the acceleration of the double smoothing technique for unconstrained convex optimization problems
On the acceleration of the double smoothing technique for unconstrained convex optimization problems Radu Ioan Boţ Christopher Hendrich October 10, 01 Abstract. In this article we investigate the possibilities
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:
More informationEntropy and Ergodic Theory Notes 22: The Kolmogorov Sinai entropy of a measure-preserving system
Entropy and Ergodic Theory Notes 22: The Kolmogorov Sinai entropy of a measure-preserving system 1 Joinings and channels 1.1 Joinings Definition 1. If px, µ, T q and py, ν, Sq are MPSs, then a joining
More informationGAP Safe Screening Rules for Sparse-Group Lasso
GAP Safe Screening Rules for Sparse-Group Lasso Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, Joseph Salmon LTCI, CNRS, Télécom ParisTech Université Paris-Saclay 75013 Paris, France first.last@telecom-paristech.fr
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationNOTES ON SOME EXERCISES OF LECTURE 5, MODULE 2
NOTES ON SOME EXERCISES OF LECTURE 5, MODULE 2 MARCO VITTURI Contents 1. Solution to exercise 5-2 1 2. Solution to exercise 5-3 2 3. Solution to exercise 5-7 4 4. Solution to exercise 5-8 6 5. Solution
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationA New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives Paul Grigas May 25, 2016 1 Boosting Algorithms in Linear Regression Boosting [6, 9, 12, 15, 16] is an extremely
More informationCOMPLEX NUMBERS LAURENT DEMONET
COMPLEX NUMBERS LAURENT DEMONET Contents Introduction: numbers 1 1. Construction of complex numbers. Conjugation 4 3. Geometric representation of complex numbers 5 3.1. The unit circle 6 3.. Modulus of
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationDual Decomposition.
1/34 Dual Decomposition http://bicmr.pku.edu.cn/~wenzw/opt-2017-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/34 1 Conjugate function 2 introduction:
More informationAdaptive Restarting for First Order Optimization Methods
Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationLecture 1: September 25
0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationI P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION
I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationCeler: Fast solver for the Lasso with dual extrapolation
Celer: Fast solver for the Lasso with dual extrapolation Mathurin Massias https://mathurinm.github.io INRIA Joint work with: Alexandre Gramfort (INRIA) Joseph Salmon (Télécom ParisTech) To appear in ICML
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationThe proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates
The proximal alternating direction method of multipliers in the nonconvex setting: convergence analysis and rates Radu Ioan Boţ Dang-Khoa Nguyen January 6, 08 Abstract. We propose two numerical algorithms
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationMathematical methods for Image Processing
Mathematical methods for Image Processing François Malgouyres Institut de Mathématiques de Toulouse, France invitation by Jidesh P., NITK Surathkal funding Global Initiative on Academic Network Oct. 23
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationExpectation maximization tutorial
Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If
More informationAn Efficient Proximal Gradient Method for General Structured Sparse Learning
Journal of Machine Learning Research 11 (2010) Submitted 11/2010; Published An Efficient Proximal Gradient Method for General Structured Sparse Learning Xi Chen Qihang Lin Seyoung Kim Jaime G. Carbonell
More informationMath 273a: Optimization Overview of First-Order Optimization Algorithms
Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationSingular integral operators and the Riesz transform
Singular integral operators and the Riesz transform Jordan Bell jordan.bell@gmail.com Department of Mathematics, University of Toronto November 17, 017 1 Calderón-Zygmund kernels Let ω n 1 be the measure
More informationLecture 16: October 22
0-725/36-725: Conve Optimization Fall 208 Lecturer: Ryan Tibshirani Lecture 6: October 22 Scribes: Nic Dalmasso, Alan Mishler, Benja LeRoy Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationPrimal-dual coordinate descent
Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationLecture 19 - Covariance, Conditioning
THEORY OF PROBABILITY VLADIMIR KOBZAR Review. Lecture 9 - Covariance, Conditioning Proposition. (Ross, 6.3.2) If X,..., X n are independent normal RVs with respective parameters µ i, σi 2 for i,..., n,
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationLecture 2 February 25th
Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationRegression.
Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets
More informationNon-asymptotic bound for stochastic averaging
Non-asymptotic bound for stochastic averaging S. Gadat and F. Panloup Toulouse School of Economics PGMO Days, November, 14, 2017 I - Introduction I - 1 Motivations I - 2 Optimization I - 3 Stochastic Optimization
More informationApproximation in the Zygmund Class
Approximation in the Zygmund Class Odí Soler i Gibert Joint work with Artur Nicolau Universitat Autònoma de Barcelona New Developments in Complex Analysis and Function Theory, 02 July 2018 Approximation
More informationICS-E4030 Kernel Methods in Machine Learning
ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationarxiv: v1 [stat.ml] 19 Feb 2016
GAP Safe Screening Rules for Sparse-Group Lasso Eugene Ndiaye, Olivier Fercoq, Alexandre Gramfort, Joseph Salmon CNRS LTCI, Télécom ParisTech, Université Paris-Saclay 46 rue Barrault, 75013, Paris, France
More informationJanuary 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13
Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière Hestenes Stiefel January 29, 2014 Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes
More informationMariusz Jurkiewicz, Bogdan Przeradzki EXISTENCE OF SOLUTIONS FOR HIGHER ORDER BVP WITH PARAMETERS VIA CRITICAL POINT THEORY
DEMONSTRATIO MATHEMATICA Vol. XLVIII No 1 215 Mariusz Jurkiewicz, Bogdan Przeradzki EXISTENCE OF SOLUTIONS FOR HIGHER ORDER BVP WITH PARAMETERS VIA CRITICAL POINT THEORY Communicated by E. Zadrzyńska Abstract.
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationMachine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014
Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationEE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1
EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More information9. Dual decomposition and dual algorithms
EE 546, Univ of Washington, Spring 2016 9. Dual decomposition and dual algorithms dual gradient ascent example: network rate control dual decomposition and the proximal gradient method examples with simple
More information