Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Size: px
Start display at page:

Download "Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique"

Transcription

1 Master 2 MathBigData S. Gaïffas 1 3 novembre CMAP - Ecole Polytechnique

2 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

3 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

4 Supervised learning Setting Data x i X, y i Y for i = 1,..., n x i is an input and y i is an output x i are called features and x i X = R d y i are called labels Y = { 1, 1} or Y = {0, 1} for binary classification Y = {1,..., K} for multiclass classification Y = R for regression Goal: given a new x s, predict y s.

5 Supervised learning Loss functions, linearity What to do Minimize with respect to f : R d R where R n (f ) = 1 n n l(y i, f (x i )) i=1 l is a loss function. l(y i, f (x i )) small means y i is close to f (x i ) R n (f ) is called goodness-of-fit or empirical risk Computation f is called training or estimation step

6 Supervised learning Loss functions, linearity Hence: When d is large, impossible to fit a complex functions f on the data When n is large, training is too time-consuming for a complex function f Choose a linear function f : f (x) = x, θ = d x j θ j, j=1 for some parameter vector θ R d to be trained Remark: linear with respect to x i, but you can choose the x i based on the data. Hence, not linear w.r.t the original features

7 Supervised learning Loss functions, linearity Training the model: compute where Classical losses R n (θ) = 1 n ˆθ argmin R n (θ) θ R d n l(y i, x i, θ ). i=1 l(y, z) = 1 2 (y z)2 : least-squares loss, linear regression (label y R) l(y, z) = (1 yz) + hinge loss, or SVM loss (binary classification, label y { 1, 1}) l(y, z) = log(1 + e yz ) logistic loss (binary classification, label y { 1, 1})

8 Supervised learning Loss functions, linearity l least sq (y, z) = 1 2 (y z)2 l hinge (y, z) = (1 yz) + l logistic (y, z) = log(1 + e yz )

9 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

10 Penalization Introduction You should never actually fit a model by minimizing only 1 ˆθ n argmin θ R d n You should minimize instead where { 1 ˆθ n argmin θ R d n n i=1 n l(y i, x i, θ ). i=1 } l(y i, x i, θ ) + λ pen(θ) pen is a penalization function, that encodes a prior assumption on θ. It forbids θ to be too complex λ > 0 is a tuning or smoothing parameter, that balances goodness-of-fit and penalization

11 Penalization Introduction Why using penalization? { 1 ˆθ argmin θ R d n n i=1 } l(y i, x i, θ ) + λ pen(θ) Penalization, for a well-chosen λ > 0, allows to avoid overfitting

12 Penalization Ridge Most classical penalization is the Ridge penalization pen(θ) = θ 2 2 = d θj 2. j=1 It penalizes the energy of θ, measured by squared l 2 -norm Sparsity inducing penalization. It would be nice to find a model where ˆθ j = 0 for many coordinates j few features are useful for prediction, the model is simpler, with a smaller dimension We say that ˆθ is sparse How to do it?

13 Penalization Sparsity It is tempting to use where { 1 ˆθ argmin θ R d n n l(y i, x i, θ ) + λ θ 0 }, i=1 θ 0 = #{j : θ j 0}. But, to do it exactly, you need to try all possible subsets of non-zero coordinates of θ: 2 d possibilities. Impossible!

14 Penalization Lasso A solution: Lasso penalization (least absolute shrinkage and selection operator) pen(θ) = θ 1 = d θ j. j=1 This is penalization based on the l 1 -norm 1. In a noiseless setting [compressed sensing, basis pursuit], in a certain regime, l 1 -minimization gives the same solution as 0 But the Lasso penalized problem is easy to compute Why do l 1 -penalization leads to sparsity?

15 Penalization Lasso Why l 2 (ridge) does not induce sparsity?

16 Penalization Lasso Hence, a minimizer { 1 ˆθ argmin θ R d n n } l(y i, x i, θ ) + λ θ 1 i=1 is typically sparse (ˆθ j = 0 for many j). for λ large (larger than some constant) ˆθ j = 0 for all j for λ = 0 then there is no penalization Between the two, the sparsity depends on the value of λ: once again, it is a regularization or penalization parameter

17 Penalization Lasso For the least squares loss { 1 ˆθ argmin θ R d 2n Y X θ λ } 2 θ 2 2 is called ridge linear regression, and { 1 } ˆθ argmin θ R d 2n Y X θ λ θ 1 is called Lasso linear regression.

18 Penalization Lasso Consider the minimization problem for λ > 0 and b R 1 min a R 2 (a b)2 + λ a Derivative at 0 + : d + = λ b Derivative at 0 : d = λ b Let a be the solution Hence a = 0 iff d + 0 and d 0, namely b λ a 0 iff d + 0, namely b λ and a = b λ a 0 iff d 0, namely b λ and a = b + λ where a + = max(0, a) a = sign(b)( b λ) +

19 Penalization Lasso As a consequence, we have where 1 a = argmin a R d 2 a b λ a 1 = S λ (b) S λ (b) = sign(b) ( b λ) + is the soft-thresholding operator

20 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

21 Quick recap f : R d [, + ] is convex if f (tx + (1 t)x ) tf (x) + (1 t)f (x ) for any x, x R d, t [0, 1] proper if not equal to or + everywhere [note that proper valued in (, + ] lower-semicontinuous (l.s.c) if and only if for any x and a sequence x n x we have f (x) lim inf n f (x n ). The set of such functions is often denoted Γ 0 (R d ) or Γ 0

22 Proximal operator For any g is convex l.s.c and any y R d, we define the proximal operator { 1 prox g (y) = argmin x R d 2 x y g(x)} (strongly convex problem unique minimum) We have seen that the soft-thresholding is the proximal operator of the l 1 -norm prox λ 1 (y) = S λ (y) = sign(y) ( y λ)+ Proximal operators and proximal algorithms are now fundamental tools for optimization in machine learning

23 Examples of proximal operators g(x) = c for a constant c, prox g = Id If C convex set, and g(x) = δ C (x) = { 0 if x C + if x / C then prox g = proj C = projection onto C. If g(x) = b, x + c, then prox λg (x) = x λb If g(x) = 1 2 x Ax + b, x + c with A symetric positive, then prox λg (x) = (I + λa) 1 (x λb)

24 Examples of proximal operators If g(x) = 1 2 x 2 2 then prox λg (x) = 1 x = shrinkage operator 1 + λ If g(x) = log x then If g(x) = x 2 then prox λg (x) = x + x 2 + 4λ 2 prox λg (x) = the block soft-thresholding operator ( 1 λ ) x x, 2 +

25 Examples of proximal operators If g(x) = x 1 + γ 2 x 2 2 (elastic-net) where γ > 0, then prox λg (x) = λγ prox λ 1 (x) If g(x) = g G x g 2 where G partition of {1,..., d}, (prox λg (x)) g = ( 1 λ x g 2 ) + x g, for g G. Block soft-thresholding, used for group-lasso

26 Subdifferential, Fenchel conjuguate The subdifferential of f Γ 0 at x is the set f (x) = { g R d : f (y) g, y x + f (x) for all y R d} Each element is called a subgradient Optimality criterion 0 f (x) iff f (x) f (y) y If f is differentiable at x, then f (x) = { f (x)} Example: 0 = [ 1, 1]

27 Fenchel conjuguate The Fenchel conjugate of a function f on R d is given by f { } (x) = sup x, y f (y) y R d Always a convex function (as a sup of continuous, linear functions). It is the smallest constant c such that the affine function is below f. x y, x c Fenchel-Young inequality: we have for any x and y. f (x) + f (y) x, y

28 Fenchel conjuguate and subgradients Legendre Fenchel identity: if f Γ 0 we have x, y = f (x) + f (y) y f (x) x f (y) Example. Fenchel conjuguate of a norm : where x = sup y R d { x, y y } = δ{y R d : y 1}(x), x = max x, y y R d : y 1 dual norm of [recall that x p = x q with 1/p + 1/q = 1]

29 Some extras f : R d [, + ] is f is L-smooth if it is continuously differentiable and if f (x) f (y) 2 L x y 2 for any x, y R d. Equivalent to H f (x) LI d for all x, where H f (x) Hessian at x when twice continuously differentiable [i.e. LI d H f (x) positive semi-definite] f is µ-strongly convex if f ( ) µ is convex. Equivalent to f (y) f (x) + g, y x + µ 2 y x 2 2 for g f (x). Equivalent to H f (x) µi d when twice differentiable. f is µ-strongly convex iff f is 1 µ -smooth

30 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

31 The general problem we want to solve How to solve { 1 ˆθ argmin θ R d n n i=1 } l(y i, x i, θ ) + λ pen(θ)??? Put for short f (θ) = 1 n n l(y i, x i, θ ) and g(θ) = λ pen(θ) i=1 Assume that f is convex and L-smooth g is convex and continuous, but possibly non-smooth (for instance l 1 penalization) g is prox-capable: not hard to compute its proximal operator

32 Examples Smoothness of f : Least-squares: Logistic loss: f (θ) = 1 n X (X θ Y ), L = X X op n f (θ) = 1 n n i=1 y i 1 + e y i x i,θ x i, L = max i=1,...,n X i 2 2 4n Prox-capability of g: we gave the explicit prox for many penalizations above

33 Gradient descent Now how do I minimize f + g? Key point: the descent lemma. If f convex and L-smooth, then for any L L: f (θ ) f (θ) + f (θ), θ θ + L 2 θ θ 2 2 for any θ, θ R d At iteration k, the current point is θ k. I use the descent lemma: f (θ) f (θ k ) + f (θ k ), θ θ k + L 2 θ θk 2 2.

34 Gradient descent Remark that argmin {f (θ k ) + f (θ k ), θ θ k + L θ R d ) 2 = argmin θ R d θ ( θ k 1 L f (θk ) 2 θ θk } Hence, choose θ k+1 = θ k 1 L f (θk ) This is the basic gradient descent algorithm [cf previous lecture] Gradient descent is based on a majoration-minimization principle, with a quadratic majorant given by the descent lemma But we forgot about g...

35 ISTA Let s put back g: f (θ) + g(θ) f (θ k ) + f (θ k ), θ θ k + L 2 θ θk g(θ) and again argmin {f (θ k ) + f (θ k ), θ θ k + L θ R d { L = argmin θ R d = argmin θ R d = prox g/l θ 2 { 1 θ 2 ( θ k 1 L f (θk ) ( θ k 1 L f (θk ) ( θ k 1 ) L f (θk ) } 2 θ θk g(θ) ) 2 ) g(θ) } L g(θ) } The prox operator naturally appears because of the descent lemma

36 ISTA Proximal gradient descent algorithm [also called ISTA] Input: starting point θ 0, Lipschitz constant L > 0 for f For k = 1, 2,... until converged do ) θ k = prox g/l (θ k 1 1 L f (θk 1 ) Return last θ k Also called Forward-Backward splitting. For Lasso with least-squares loss, iteration is θ k = S λ/l (θ k 1 1 ) L (X X θ k 1 X Y ), where S λ is the soft-thresholding operator

37 ISTA Put for short F = f + g, Take any θ argmin θ R d F (θ) Theorem (Beck Teboulle (2009)) If the sequence {θ k } is generated by ISTA, then F (θ k ) F (θ ) L θ0 θ 2 2 2k Convergence rate is O(1/k) Is it possible to improve the O(1/k) rate?

38 FISTA Yes! Using Accelerated proximal gradient descent (called FISTA, Nesterov 83, 04, Beck Teboule 09) Idea: to find θ k+1, use an interpolation between θ k and θ k 1 Accelerated proximal gradient descent algorithm [FISTA] Input: starting points z 1 = θ 0, Lipschitz constant L > 0 for f, t 1 = 1 For k = 1, 2,... until converged do θ k = prox g/l (z k 1 L f (zk )) t k+1 = t 2 k 2 z k+1 = θ k + t k 1 t k+1 (θ k θ k 1 ) Return last θ k

39 FISTA Theorem (Beck Teboulle (2009)) If the sequence {θ k } is generated by FISTA, then F (θ k ) F (θ ) 2L θ0 θ 2 2 (k + 1) 2 Convergence rate is O(1/k 2 ) Is O(1/k 2 ) the optimal rate in general?

40 FISTA Yes. Put g = 0 Theorem (Nesterov) For any optimization procedure satisfying θ k+1 θ 1 + span( f (θ 1 ),..., f (θ k )), there is a function f on R d convex and L-smooth such that for any 1 k (d 1)/2. min f 1 j k (θj ) f (θ ) 3L 32 θ 1 θ 2 2 (k + 1) 2

41 FISTA Comparison of ISTA and FISTA FISTA is not a descent algorithm, while ISTA is

42 FISTA [Proof of convergence of FISTA on the blackboard]

43 Backtracking linesearch What if I don t know L > 0? X X op can be long to compute Letting L evolve along iterations k generally improve convergence speed Backtracking linesearch. Idea: Start from a very small lipschitz constant L Between iteration k and k + 1, choose the smallest L satisfying the lemma descent at z k

44 Backtracking linesearch At iteration k of FISTA, we have z k and a constant L k 1 Put L L k 2 Do an iteration θ prox g/l (z k 1 ) L f (zk ) 3 Check it this step satisfies the descent lemma at z k : f (θ) + g(θ) f (z k ) + f (z k ), θ z k + L 2 θ zk g(θ) 4 If yes, then θ k+1 θ and L k+1 L and continue FISTA 5 It not, then put L 2L (say), and go back to point 2 Sequence L k is non-decreasing: between iteration k and k + 1, a tweak is to decrease it a little bit to have (much) faster convergence

45 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some tools from convex optimization Quick recap Proximal operator Proximal operators Subdifferential, Fenchel conjuguate 4 ISTA and FISTA The general problem Gradient descent ISTA FISTA Linesearch 5 Duality gap Fenchel duality Duality gap

46 Fenchel duality How to stop an iterative optimization algorithm? If F objective function, fix ε > 0 small, and stop when F (θ k+1 ) F (θ k ) F (θ k ) ε or f (θ k ) ε An alternative is to compute the duality gap Fenchel Duality. Consider the problem min θ f (Aθ) + g(θ) with f : R d R, g : R p R and a d p matrix A. We have sup{ f (u) g ( A u)} inf{f (Aθ) + g(θ)} u θ Moreover, if f and g are convex, then under mild assumptions, equality of both sides holds (strong duality, no duality gap)

47 Fenchel duality Fenchel Duality sup{ f (u) g ( A u)} inf{f (Aθ) + g(θ)} u θ Right part is the primal problem Left part is a dual formulation of the primal problem If θ is an optimum for the primal and u is an optimum for the dual, then f (u ) g ( A u ) = f (Aθ ) + g(θ ) When g(θ) = λ θ where λ > 0 and is a norm, this writes sup u: A u λ f (u) inf{f (Aθ) + λ θ } θ

48 Duality gap If (θ, u ) is a pair of primal/dual solutions then we have u f (Aθ ) or u = f (Aθ ) if f is differentiable Namely, we have at optimum A f (Aθ ) λ and f (Aθ ) + λ θ + f ( f (Aθ )) = 0 Natural stopping rule: imagine we are at iteration k of an optimization algorithm, current primal variable is θ k. Define ( u k = u(θ k λ ) ) = min 1, A f (Aθ k f (Aθ k ) ) and stop at iteration k when for a given small ε > 0 f (Aθ k ) + λ θ k + f (u k ) ε

49 Duality gap Back to machine learning: n l(y i, x i, θ ) = i=1 n l(y i, (X θ) i ) = f (X θ) i=1 with f (z) = n i=1 l(y i, z i ) for z = [z 1 z n ] and X matrix with lines x 1,..., x n. Gradient is f (X θ) = n l (y i, x i, θ )x i, i=1 where l (y, z) = l(y, z)/ z. For the duality gap, we need to compute f.

50 Duality gap For least squares f (z) = 1 2 y z 2 2, we have f (u) = 1 2 u u, y For logistic regression f (z) = n i=1 log(1 + e y i z i ) we have f (u) = n (1 + u i y i ) log(1 + u i y i ) u i y i log( u i y i ) i=1 if u i y i (0, 1] for any i = 1,..., d and otherwise f (u) = +

51 Duality gap Example. Stopping criterion for Lasso based on duality gap: Compute residual Compute dual variable Stop if r k X θ k y ( u k λ ) = X r k r k λ θ k 1 + u k u k, y ε r k

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Oslo Class 6 Sparsity based regularization

Oslo Class 6 Sparsity based regularization RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Xinyue Shen, Student Member, IEEE, and Yuantao Gu, Senior Member, IEEE Abstract In this work we propose to fit a sparse logistic regression

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Computational Statistics and Optimisation. Joseph Salmon Télécom Paristech, Institut Mines-Télécom

Computational Statistics and Optimisation. Joseph Salmon   Télécom Paristech, Institut Mines-Télécom Computational Statistics and Optimisation Joseph Salmon http://josephsalmon.eu Télécom Paristech, Institut Mines-Télécom Plan Duality gap and stopping criterion Back to gradient descent analysis Forward-backward

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Convex optimization COMS 4771

Convex optimization COMS 4771 Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Nesterov s Optimal Gradient Methods

Nesterov s Optimal Gradient Methods Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1 Outline The problem from machine learning perspective Preliminaries

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent CMU 10-725/36-725: Convex Optimization (Fall 2017) OUT: Sep 29 DUE: Oct 13, 5:00

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE Convex Analysis Notes Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE These are notes from ORIE 6328, Convex Analysis, as taught by Prof. Adrian Lewis at Cornell University in the

More information

Optimization for Learning and Big Data

Optimization for Learning and Big Data Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods Renato D.C. Monteiro B. F. Svaiter May 10, 011 Revised: May 4, 01) Abstract This

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

9. Dual decomposition and dual algorithms

9. Dual decomposition and dual algorithms EE 546, Univ of Washington, Spring 2016 9. Dual decomposition and dual algorithms dual gradient ascent example: network rate control dual decomposition and the proximal gradient method examples with simple

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Regression.

Regression. Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Convex Optimization and l 1 -minimization

Convex Optimization and l 1 -minimization Convex Optimization and l 1 -minimization Sangwoon Yun Computational Sciences Korea Institute for Advanced Study December 11, 2009 2009 NIMS Thematic Winter School Outline I. Convex Optimization II. l

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Convex Functions. Pontus Giselsson

Convex Functions. Pontus Giselsson Convex Functions Pontus Giselsson 1 Today s lecture lower semicontinuity, closure, convex hull convexity preserving operations precomposition with affine mapping infimal convolution image function supremum

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics 10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Smoothing Proximal Gradient Method. General Structured Sparse Regression for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

ORIE 4741 Final Exam

ORIE 4741 Final Exam ORIE 4741 Final Exam December 15, 2016 Rules for the exam. Write your name and NetID at the top of the exam. The exam is 2.5 hours long. Every multiple choice or true false question is worth 1 point. Every

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical

More information