The Frank-Wolfe Algorithm:

Size: px
Start display at page:

Download "The Frank-Wolfe Algorithm:"

Transcription

1 The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder Massachusetts Institute of Technology May

2 Modified Title Original title: The Frank-Wolfe Algorithm: New Results, Connections to Statistical Boosting, and Computation 2

3 Problem of Interest Our problem of interest is: h := max λ s.t. h(λ) λ Q Q E is convex and compact h( ) : Q R is concave and differentiable with Lipschitz gradient on Q Assume it is easy to solve linear optimization problems on Q 3

4 Frank-Wolfe (FW) Method Frank-Wolfe Method for maximizing h(λ) on Q Initialize at λ 1 Q, (optional) initial upper bound B 0, k 1. 1 Compute h(λ k ). 2 Compute λ k arg max λ Q {h(λ k) + h(λ k ) T (λ λ k )}. B w k h(λ k) + h(λ k ) T ( λ k λ k ). G k h(λ k ) T ( λ k λ k ). 3 (Optional: compute other upper bound Bk o ), update best bound B k min{b k 1, Bk w, Bo k }. 4 Set λ k+1 λ k + ᾱ k ( λ k λ k ), where ᾱ k [0, 1). Note the condition ᾱ k [0, 1) 4

5 Wolfe Bound Wolfe Bound is B w k := h(λ k) + h(λ k ) T ( λ k λ k ) B w k h Suppose h( ) arises from minmax structure: h(λ) = min x P φ(x, λ) P is convex and compact φ(x, λ) : P Q R is convex in x and concave λ Define f (x) := max λ Q φ(x, λ) Then a dual problem is min x P f (x) In FW method, let B m k := f (x k) where x k arg min x P {φ(x, λ k)} Then B w k Bm k 5

6 Wolfe Gap Wolfe Gap is G k := B w k h(λ k) = h(λ k ) T ( λ k λ k ) Recall B m k Bw k Unlike generic duality gaps, the Wolfe gap is determined completely by the current iterate λ k. It arises in: Khachiyan s analysis of FW for rounding of polytopes, FW for boosting methods in statistics and learning, perhaps elsewhere.... 6

7 Pre-start Step for FW Pre-start Step of Frank-Wolfe method given λ 0 Q and (optional) upper bound B 1 1 Compute h(λ 0 ). 2 Compute λ 0 arg max λ Q {h(λ 0) + h(λ 0 ) T (λ λ 0 )}. B w 0 h(λ 0) + h(λ 0 ) T ( λ 0 λ 0 ). G 0 h(λ 0 ) T ( λ 0 λ 0 ). 3 (Optional: compute other upper bound B0 o ), update best bound B 0 min{b 1, B0 w, Bo 0 }. 4 Set λ 1 λ 0. This is the same as a regular FW step at λ 0 with ᾱ 0 = 1 7

8 Curvature Constant C h,q and Classical Metrics Following Clarkson, define C h,q to be the minimal value of C for which λ, λ Q and α [0, 1] implies: h(λ + α( λ λ)) h(λ) + h(λ) T (α( λ λ)) Cα 2 Let Diam Q := max { λ λ, λ Q λ } Let L := L h,q be the smallest constant L for which λ, λ Q implies: h(λ) h( λ) L λ λ It is straightforward to bound C h,q 1 2 L h,q(diam Q ) 2 8

9 Auxiliary Sequences {α k } and {β k } Define the following two auxiliary sequences as functions of the step-sizes {α k } from the Frank-Wolfe method: β k = 1 (1 ᾱ j ) k 1 j=1 and α k = β kᾱ k 1 ᾱ k, k 1 (By convention 0 j=1 = 1 and 0 i=1 = 0 ) 9

10 Two Technical Theorems Theorem Consider the iterate sequences of the Frank-Wolfe Method {λ k } and { λ k } and the sequence of upper bounds {B k } on h, using the step-size sequence {ᾱ k }. For the auxiliary sequences {α k } and {β k }, and for any k 1, the following inequality holds: B k h(λ k+1 ) B k h(λ 1 ) β k+1 + C h,q k i=1 β k+1 α 2 i β i+1 The summation expression appears also in the dual averages method of Nesterov. This is not a coincidence, indeed it is by design, see Grigas. We will henceforth refer to the sequences {α k } and {β k } as the dual averages sequences associated with the FW step-sizes. 10

11 Second Technical Theorem Theorem Consider the iterate sequences of the Frank-Wolfe Method {λ k } and { λ k } and the sequence of Wolfe gaps {G k } from Step (2.), using the step-size sequence {ᾱ k }. For the auxiliary sequences {α k } and {β k }, and for any k 1 and l k + 1, the following inequality holds: min G i i {k+1,...,l} 1 l i=k+1 ᾱi B k h(λ 1) β k+1 + C k h,q i=1 β k+1 α 2 i β i+1 + C h,q l i=k+1 ᾱ2 i l i=1 ᾱi 11

12 A Well-Studied Step-size Sequence Suppose we initiate the Frank-Wolfe method with the Pre-start step from a given value λ 0 Q (which by definition assigns the step-size ᾱ 0 = 1 as discussed earlier), and then use the step-sizes: ᾱ i = 2 i + 2 for i 0 (1) Guarantee Under the step-size sequence (1), the following inequalities hold for all k 1: B k h(λ k+1 ) 4C h,q k + 4 and min i {1,...,k} G i 8.7C h,q k 12

13 Simple Averaging Suppose we initiate the Frank-Wolfe method with the Pre-start step, and then use the following step-size sequence: ᾱ i = 1 i + 1 for i 1 (2) This has the property that λ k+1 is the simple average of λ 0, λ 1, λ 2,..., λ k Guarantee Under the step-size sequence (2), the following inequalities holds for all k 1: B k h(λ k+1 ) C h,q(1 + ln(k + 1)) k + 1 and min i {1,...,k} G i 2.9 C h,q(2 + ln(k + 1)) k

14 Constant Step-size Given ᾱ (0, 1), suppose we initiate the Frank-Wolfe method with the Pre-start step, and then use the following step-size sequence: ᾱ i = ᾱ for i 1 (3) (This step-size rule arises in the analysis of the Incremental Forward Stagewise Regression algorithm FS ε ) Guarantee Under the step-size sequence (3), the following inequality holds for all k 1: B k h(λ k+1 ) C h,q [ (1 ᾱ) k+1 + ᾱ ] 14

15 Constant Step-size, continued If we decide a priori to run the Frank-Wolfe method for k iterations, then the optimized value of ᾱ in the previous guarantee is: which yields the following: Guarantee ᾱ = 1 1 k k + 1 (4) For any k 1, using the constant step-size (4) for all iterations, the following inequality holds: B k h(λ k+1 ) C h,q [1 + ln(k + 1)] k Furthermore, after l = 2k + 2 iterations it also holds that: min i {1,...,l} G i 2C h,q [ ln(l)] l 2 15

16 On the Need for Warm Start Analysis Recall the well-studied step-size sequence initiated with a Pre-start step from λ 0 Q: ᾱ i = 2 for i 0 i + 2 and computational guarantee: This guarantee is independent of λ 0 If h(λ 0 ) h this is good B k h(λ k+1 ) 4C h,q k + 4 But if h(λ 0 ) h ε then this is not good Let us see how we can take advantageous of a warm start λ

17 A Warm Start Step-size Rule Let us start the Frank-Wolfe method at an initial point λ 1 and an upper bound B 0 Let C 1 be a given estimate of the curvature constant C h,q 2 Use the step-size sequence: ᾱ i = 4C 1 B + i + 1 for i 1 (5) 1 h(λ 1) One can think of this as if the Frank-Wolfe method had run for 4C 1 B 1 h(λ 1) iterations before arriving at λ 1 Guarantee Using (5), the following inequality holds for all k 1: B k h(λ k+1 ) 4 max{c 1, C h,q } 4C 1 + k B 1 h(λ 1 ) 17

18 Warm Start Step-size Rule, continued λ 1 Q is initial value C 1 is a given estimate of C h,q Guarantee Using (5), the following inequality holds for all k 1: B k h(λ k+1 ) 4 max{c 1, C h,q } 4C 1 + k B 1 h(λ 1 ) Easy to see C 1 C h,q optimizes the above guarantee If B 1 h(λ 1 ) is small, the incremental decrease in the guarantee from an additional iteration is lessened. This is different from first-order methods that use prox functions and/or projections 18

19 Dynamic Version of Warm-Start Analysis The warm-start step-sizes: 2 ᾱ i = 4C 1 B + i + 1 for i 1 1 h(λ 1) are based-on two pieces of information at λ 1 : B 1 h(λ 1 ), and C 1 This is a static warm-start step-size strategy Let us see how we can improve the computational guarantee by treating every iterate as if it were the initial iterate... 19

20 Dynamic Version of Warm-Starts, continued At a given iteration k of FW, we will presume that we have: λ k Q, an upper bound B k 1 on h (from previous iteration), and an estimate C k 1 of C h,q (also from the previous iteration) Consider ᾱ k of the form: 2 ᾱ k := 4C k B k h(λ k ) + 2 (6) where ᾱ k will depend explicitly on the value of C k. Let us now discuss how C k is computed... 20

21 Updating the Estimate of the Curvature Constant We require that C k (and ᾱ k which depends explicitly on C k ) satisfy C k C k 1 and: h(λ k + ᾱ k ( λ k λ k )) h(λ k ) + ᾱ k (B k h(λ k )) C k ᾱ 2 k (7) We first test if C k := C k 1 satisfies (7), and if so we set C k C k 1 If not, perform a standard doubling strategy, testing values C k 2C k 1, 4C k 1, 8C k 1,..., until (7) is satisfied Alternatively, say if h( ) is quadratic, C k can be determined analytically It will always hold that C k max{c 0, 2C h,q } 21

22 Frank-Wolfe Method with Dynamic Step-sizes FW Method with Dynamic Step-sizes for maximizing h(λ) on Q Initialize at λ 1 Q, initial estimate C 0 of C h,q, (optional) initial upper bound B 0, k 1. 1 Compute h(λ k ). 2 Compute λ k arg max λ Q {h(λ k) + h(λ k ) T (λ λ k )}. B w k h(λ k ) + h(λ k ) T ( λ k λ k ). G k h(λ k ) T ( λ k λ k ). 3 (Optional: compute other upper bound B o k ), update best bound B k min{b k 1, B w k, B o k }. 4 Compute C k for which the following conditions hold: C k C k 1, and h(λ k + ᾱ k ( λ k λ k )) h(λ k ) + ᾱ k (B k h(λ k )) C k ᾱk 2, where ᾱ k := 2 4C k B k h(λ k ) +2 5 Set λ k+1 λ k + ᾱ k ( λ k λ k ), where ᾱ k [0, 1). 22

23 Dynamic Warm Start Step-size Rule, continued Guarantee Using Frank-Wolfe method with Dynamic Step-sizes, the following inequality holds for all k 1 and l 1: B k+l h(λ k+l ) 4C k+l 4C k+l B k h(λ k ) + l Furthermore, if the doubling strategy is used to update the estimates C k of C h,q, it holds that C k+l max{c 0, 2C h,q } 23

24 Frank-Wolfe with Inexact Gradient Consider the case when the gradient h( ) is given inexactly d Aspremont s δ-oracle for the gradient: given λ Q, the δ-oracle returns g δ ( λ) that satisfies: ( h( λ) g δ ( λ)) T (ˆλ λ) δ for all ˆλ, λ Q Devolder/Glineur/Nesterov s (δ, L)-oracle for the the function value and the gradient: given λ Q, the (δ, L)-oracle returns h (δ,l) ( λ) and g (δ,l) ( λ) that satisfies for all λ Q: and h(λ) h (δ,l) ( λ) + g (δ,l) ( λ) T (λ λ) h(λ) h (δ,l) ( λ) + g (δ,l) ( λ) T (λ λ) L 2 λ λ δ 24

25 Frank-Wolfe with d Aspremont δ-oracle for Gradient Suppose that we use a δ-oracle for the gradient. Then the main technical theorem is amended as follows: Theorem Consider the iterate sequences of the Frank-Wolfe Method {λ k } and { λ k } and the sequence of upper bounds {B k } on h, using the step-size sequence {ᾱ k }. For the auxiliary sequences {α k } and {β k }, and for any k 1, the following inequality holds: B k h(λ k+1 ) B k h(λ 1 ) β k+1 + C h,q k i=1 β k+1 α 2 i β i+1 + 2δ The error δ does not accumulate All bounds presented earlier are amended by 2δ 25

26 Frank-Wolfe with DGN (δ, L)-oracle for Functions and Gradient Suppose that we use a (δ, L)-oracle for the function and gradient. Then the errors accumulate. (δ, L)-oracle for the function and gradient Guarantee Guarantee Sparsity Method without Errors with Errors of Iterates ( ) ( ) 1 1 Frank-Wolfe O O + O(δk) YES ( k ) ( k ) 1 1 Prox Gradient Acclerated Prox Grad. O ( k ) 1 O k 2 O O + O(δ) NO ( k ) 1 k 2 + O(δk) NO No method dominates all three criteria 26

27 Applications to Statistical Boosting We consider two applications in statistical boosting: 1 Linear Regression 2 Binary Classification / Supervised Learning 27

28 Linear Regression Consider linear regression: y = Xβ + e y R n is the response, β R p are the coefficients, X R n p is the model matrix and e R n are the errors In high-dimension statistical regime, especially with p n, we desire: good predictive performance, good performance on samples ( residuals r := y Xβ are small ), an interpretable model β coefficients are not excessively large ( β δ), and a sparse solution (β has few non-zero coefficients) 28

29 LASSO The form of the LASSO we consider is: L δ = min β L(β) := 1 2 y Xβ 2 2 s.t. β 1 δ δ > 0 is the regularization parameter As is well-known, the 1 -norm regularizer induces sparse solutions Let β 0 denote the number of non-zero coefficients of the vector β 29

30 LASSO with Frank-Wolfe Consider using Frank-Wolfe to solve the LASSO: L δ = min β L(β) := 1 2 y Xβ 2 2 s.t. β 1 δ the linear optimization subproblem any c: let j arg max c j j {1,...,p} then arg min β ct β δsgn(c j )e j 1 δ min β ct β is trivial to solve for 1 δ extreme points of feasible region are ±δe 1,..., ±δe p therefore β k+1 0 β k

31 Frank-Wolfe for LASSO Adaptation of Frank-Wolfe Method for LASSO Initialize at β 0 with β 0 1 δ. At iteration k: 1 Compute: 2 Set: r k y Xβ k j k arg max (r k ) T X j j {1,...,p} β k+1 j k (1 ᾱ k )β k j k + ᾱ k δsgn((r k ) T X jk ) β k+1 j (1 ᾱ k )β k j for j j k, and where ᾱ k [0, 1] 31

32 Computational Guarantees for Frank-Wolfe on LASSO Guarantees Suppose that we use the Frank-Wolfe Method to solve the LASSO problem, with either the fixed step-size rule ᾱ i = 2 i+2 or a line-search to determine ᾱ i for i 0. Then after k iterations, there exists an i {0,..., k} satisfying: 1 2 y Xβi 2 2 L δ 17.4 X 2 1,2 δ2 k X T (y Xβ i ) 1 2δ Xβ LS X 2 1,2 δ k β i 0 k β i 1 δ where β LS is the least-squares solution, and so Xβ LS 2 y 2 32

33 Frank-Wolfe on LASSO, and FS ε There are some structural connections between Frank-Wolfe on LASSO and Incremental Forward Stagewise Regression FS ε There are even more connections in the case of Frank-Wolfe with constant step-size 33

34 Binary Classification / Supervised Learning The set-up of the binary classification boosting problem consists of: Data/training examples (x 1, y 1 ),..., (x m, y m ) where each x i X and each y i [ 1, 1] A set of n base classifiers H = {h 1 ( ),..., h n ( )} where each h j ( ) : X [ 1, 1] h j (x i ) is classifier j s score of example x i y i h j (x i ) > 0 if and only if classifier j correctly classifies example x i We would like to construct a classifier H = λ 1 h 1 ( ) + + λ n h n ( ) that performs significantly better than any individual classifier in H 34

35 Binary Classification, continued We construct a new classifier H = H λ := λ 1 h 1 ( ) + + λ n h n ( ) from nonnegative multipliers λ R n + H(x) = H λ (x) := λ 1 h 1 (x) + + λ n h n (x) (associate sgnh H for ease of notation) In the high-dimensional case where n m 0, we desire: good predictive performance, good performance on the training data (y i H λ (x i ) > 0 for all examples x i ), a good interpretable model λ coefficients are not excessively large ( λ δ), and λ has few non-zero coefficients ( λ 0 is small) 35

36 Weak Learner Form the matrix A R m n by A ij = y i h j (x i ) A ij > 0 if and only if classifier h j ( ) correctly classifies example x i We suppose we have a weak learner W( ) that, for any distribution w on the examples (w m := {w R m : e T w = 1, w 0 }), returns the base classifier h j ( ) in H that does best on the weighted example determined by w: m w i y i h j (x i ) = i=1 max j=1,...,n m w i y i h j (x i ) = i=1 max j=1,...,n w T A j 36

37 Performance Objectives: Maximize the Margin The margin of the classifier H λ is defined by: p(λ) := min y ih λ (x i ) = min (Aλ) i i {1,...,m} i {1,...,m} One can then solve: p = max λ s.t. p(λ) λ n 37

38 Performance Objectives: Minimize the Exponential Loss The (logarithm of the) exponential loss of the classifier H λ is defined by: ( ) 1 m L exp (λ) := ln exp ( (Aλ) i ) m One can then solve: i=1 L exp,δ = min λ s.t. L exp (λ) λ 1 δ λ 0 38

39 Minimizing Exponential Loss with Frank-Wolfe Consider using Frank-Wolfe to minimize the exponential loss problem: L exp,δ = min λ s.t. L exp (λ) λ 1 δ λ 0 the linear optimization subproblem min λ 1 δ, λ 0 ct λ is trivial to solve for any c: arg min λ 1 δ, λ 0 ct λ 0 if c 0 else let j arg max j {1,...,p} c j then arg min λ 1 δ, λ 0 ct λ δe j extreme points of feasible region are 0, δe 1,..., δe n therefore λ k+1 0 λ k

40 Frank-Wolfe for Exponential Loss Minimization Adaptation of Frank-Wolfe Method for Minimizing Exponential Loss Initialize at λ 0 0 with λ 0 1 δ. Set w 0 i = exp( (Aλ0 ) i ) m l=1 exp( (Aλ0 ) l ) i = 1,..., m. Set k = 0. At iteration k: 1 Compute j k W(w k ) 2 Choose ᾱ [0, 1] and set: (1 ᾱ k )λ k j k + ᾱ k δ λ k+1 j (1 ᾱ k )λ k j for j j k, and where ᾱ k [0, 1] λ k+1 j k (wj k)1 ᾱ k exp( ᾱ k δa i,jk ), i = 1,..., m and re-normalize w k+1 so that e T w k+1 = 1 w k+1 j 40

41 Computational Guarantees for FW for Binary Classification Guarantees Suppose that we use the Frank-Wolfe Method to solve the exponential loss minimization problem, with either the fixed step-size rule ᾱ i = 2 i+2 or a line-search to determine ᾱ i, using ᾱ 0 = 1. Then after k iterations: L exp (λ k ) L exp,δ p p( λ k ) λ k 0 k λ k 1 δ 8δ2 k + 3 8δ k ln(m) δ where λ k is the normalization of λ k, namely λ k = λk e T λ k 41

42 Frank-Wolfe for Binary Classification, and AdaBoost There are some structural connections between Frank-Wolfe for binary classification, and AdaBoost L exp (λ) is a (Nesterov-) smoothing of the margin using smoothing parameter µ = 1 : write the margin as p(λ) := min (Aλ) i = min (w T Aλ) i {1,...,m} w m do µ-smoothing of the margin using entropy function e(w): then p µ (λ) = µl exp (λ/µ) p µ (λ) := min w m (w T Aλ + µe(w)) it easily follows that 1 µ p(λ) L exp(λ/µ) 1 µ p(λ) + ln(m) 42

43 Comments Computational evaluation of Frank-Wolfe for boosting: LASSO binary classification Structural connections in G-F-M: AdaBoost is equivalent to Mirror Descent for maximizing the margin Incremental Forward Stagewise Regression (FS ε ) is equivalent to subgradient descent for minimizing X T r 43

AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods

AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods Robert M. Freund Paul Grigas Rahul Mazumder June 29, 2013 Abstract Boosting methods are highly popular and effective

More information

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives Paul Grigas May 25, 2016 1 Boosting Algorithms in Linear Regression Boosting [6, 9, 12, 15, 16] is an extremely

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion Robert M. Freund, MIT joint with Paul Grigas (UC Berkeley) and Rahul Mazumder (MIT) CDC, December 2016 1 Outline of Topics

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Solving Dual Problems

Solving Dual Problems Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37 COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring / Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

A NEW PERSPECTIVE ON BOOSTING IN LINEAR REGRESSION VIA SUBGRADIENT OPTIMIZATION AND RELATIVES

A NEW PERSPECTIVE ON BOOSTING IN LINEAR REGRESSION VIA SUBGRADIENT OPTIMIZATION AND RELATIVES Submitted to the Annals of Statistics A NEW PERSPECTIVE ON BOOSTING IN LINEAR REGRESSION VIA SUBGRADIENT OPTIMIZATION AND RELATIVES By Robert M. Freund,, Paul Grigas, and Rahul Mazumder, Massachusetts

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Gradient Boosting (Continued)

Gradient Boosting (Continued) Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution

More information

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder 011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14 Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Nesterov s Optimal Gradient Methods

Nesterov s Optimal Gradient Methods Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1 Outline The problem from machine learning perspective Preliminaries

More information

ORIE 4741 Final Exam

ORIE 4741 Final Exam ORIE 4741 Final Exam December 15, 2016 Rules for the exam. Write your name and NetID at the top of the exam. The exam is 2.5 hours long. Every multiple choice or true false question is worth 1 point. Every

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 Updated: March 23, 2010 Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 ICML 2009 Tutorial Survey of Boosting from an Optimization Perspective Part I: Entropy Regularized LPBoost Part II: Boosting from

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

EE364b Convex Optimization II May 30 June 2, Final exam

EE364b Convex Optimization II May 30 June 2, Final exam EE364b Convex Optimization II May 30 June 2, 2014 Prof. S. Boyd Final exam By now, you know how it works, so we won t repeat it here. (If not, see the instructions for the EE364a final exam.) Since you

More information

Announcements Kevin Jamieson

Announcements Kevin Jamieson Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Lecture 23: November 19

Lecture 23: November 19 10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

10-725/ Optimization Midterm Exam

10-725/ Optimization Midterm Exam 10-725/36-725 Optimization Midterm Exam November 6, 2012 NAME: ANDREW ID: Instructions: This exam is 1hr 20mins long Except for a single two-sided sheet of notes, no other material or discussion is permitted

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Logistic Regression with the Nonnegative Garrote

Logistic Regression with the Nonnegative Garrote Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

CS 590D Lecture Notes

CS 590D Lecture Notes CS 590D Lecture Notes David Wemhoener December 017 1 Introduction Figure 1: Various Separators We previously looked at the perceptron learning algorithm, which gives us a separator for linearly separable

More information

3.10 Lagrangian relaxation

3.10 Lagrangian relaxation 3.10 Lagrangian relaxation Consider a generic ILP problem min {c t x : Ax b, Dx d, x Z n } with integer coefficients. Suppose Dx d are the complicating constraints. Often the linear relaxation and the

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

9 Classification. 9.1 Linear Classifiers

9 Classification. 9.1 Linear Classifiers 9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive

More information