The Frank-Wolfe Algorithm:
|
|
- Dorcas Rose
- 5 years ago
- Views:
Transcription
1 The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder Massachusetts Institute of Technology May
2 Modified Title Original title: The Frank-Wolfe Algorithm: New Results, Connections to Statistical Boosting, and Computation 2
3 Problem of Interest Our problem of interest is: h := max λ s.t. h(λ) λ Q Q E is convex and compact h( ) : Q R is concave and differentiable with Lipschitz gradient on Q Assume it is easy to solve linear optimization problems on Q 3
4 Frank-Wolfe (FW) Method Frank-Wolfe Method for maximizing h(λ) on Q Initialize at λ 1 Q, (optional) initial upper bound B 0, k 1. 1 Compute h(λ k ). 2 Compute λ k arg max λ Q {h(λ k) + h(λ k ) T (λ λ k )}. B w k h(λ k) + h(λ k ) T ( λ k λ k ). G k h(λ k ) T ( λ k λ k ). 3 (Optional: compute other upper bound Bk o ), update best bound B k min{b k 1, Bk w, Bo k }. 4 Set λ k+1 λ k + ᾱ k ( λ k λ k ), where ᾱ k [0, 1). Note the condition ᾱ k [0, 1) 4
5 Wolfe Bound Wolfe Bound is B w k := h(λ k) + h(λ k ) T ( λ k λ k ) B w k h Suppose h( ) arises from minmax structure: h(λ) = min x P φ(x, λ) P is convex and compact φ(x, λ) : P Q R is convex in x and concave λ Define f (x) := max λ Q φ(x, λ) Then a dual problem is min x P f (x) In FW method, let B m k := f (x k) where x k arg min x P {φ(x, λ k)} Then B w k Bm k 5
6 Wolfe Gap Wolfe Gap is G k := B w k h(λ k) = h(λ k ) T ( λ k λ k ) Recall B m k Bw k Unlike generic duality gaps, the Wolfe gap is determined completely by the current iterate λ k. It arises in: Khachiyan s analysis of FW for rounding of polytopes, FW for boosting methods in statistics and learning, perhaps elsewhere.... 6
7 Pre-start Step for FW Pre-start Step of Frank-Wolfe method given λ 0 Q and (optional) upper bound B 1 1 Compute h(λ 0 ). 2 Compute λ 0 arg max λ Q {h(λ 0) + h(λ 0 ) T (λ λ 0 )}. B w 0 h(λ 0) + h(λ 0 ) T ( λ 0 λ 0 ). G 0 h(λ 0 ) T ( λ 0 λ 0 ). 3 (Optional: compute other upper bound B0 o ), update best bound B 0 min{b 1, B0 w, Bo 0 }. 4 Set λ 1 λ 0. This is the same as a regular FW step at λ 0 with ᾱ 0 = 1 7
8 Curvature Constant C h,q and Classical Metrics Following Clarkson, define C h,q to be the minimal value of C for which λ, λ Q and α [0, 1] implies: h(λ + α( λ λ)) h(λ) + h(λ) T (α( λ λ)) Cα 2 Let Diam Q := max { λ λ, λ Q λ } Let L := L h,q be the smallest constant L for which λ, λ Q implies: h(λ) h( λ) L λ λ It is straightforward to bound C h,q 1 2 L h,q(diam Q ) 2 8
9 Auxiliary Sequences {α k } and {β k } Define the following two auxiliary sequences as functions of the step-sizes {α k } from the Frank-Wolfe method: β k = 1 (1 ᾱ j ) k 1 j=1 and α k = β kᾱ k 1 ᾱ k, k 1 (By convention 0 j=1 = 1 and 0 i=1 = 0 ) 9
10 Two Technical Theorems Theorem Consider the iterate sequences of the Frank-Wolfe Method {λ k } and { λ k } and the sequence of upper bounds {B k } on h, using the step-size sequence {ᾱ k }. For the auxiliary sequences {α k } and {β k }, and for any k 1, the following inequality holds: B k h(λ k+1 ) B k h(λ 1 ) β k+1 + C h,q k i=1 β k+1 α 2 i β i+1 The summation expression appears also in the dual averages method of Nesterov. This is not a coincidence, indeed it is by design, see Grigas. We will henceforth refer to the sequences {α k } and {β k } as the dual averages sequences associated with the FW step-sizes. 10
11 Second Technical Theorem Theorem Consider the iterate sequences of the Frank-Wolfe Method {λ k } and { λ k } and the sequence of Wolfe gaps {G k } from Step (2.), using the step-size sequence {ᾱ k }. For the auxiliary sequences {α k } and {β k }, and for any k 1 and l k + 1, the following inequality holds: min G i i {k+1,...,l} 1 l i=k+1 ᾱi B k h(λ 1) β k+1 + C k h,q i=1 β k+1 α 2 i β i+1 + C h,q l i=k+1 ᾱ2 i l i=1 ᾱi 11
12 A Well-Studied Step-size Sequence Suppose we initiate the Frank-Wolfe method with the Pre-start step from a given value λ 0 Q (which by definition assigns the step-size ᾱ 0 = 1 as discussed earlier), and then use the step-sizes: ᾱ i = 2 i + 2 for i 0 (1) Guarantee Under the step-size sequence (1), the following inequalities hold for all k 1: B k h(λ k+1 ) 4C h,q k + 4 and min i {1,...,k} G i 8.7C h,q k 12
13 Simple Averaging Suppose we initiate the Frank-Wolfe method with the Pre-start step, and then use the following step-size sequence: ᾱ i = 1 i + 1 for i 1 (2) This has the property that λ k+1 is the simple average of λ 0, λ 1, λ 2,..., λ k Guarantee Under the step-size sequence (2), the following inequalities holds for all k 1: B k h(λ k+1 ) C h,q(1 + ln(k + 1)) k + 1 and min i {1,...,k} G i 2.9 C h,q(2 + ln(k + 1)) k
14 Constant Step-size Given ᾱ (0, 1), suppose we initiate the Frank-Wolfe method with the Pre-start step, and then use the following step-size sequence: ᾱ i = ᾱ for i 1 (3) (This step-size rule arises in the analysis of the Incremental Forward Stagewise Regression algorithm FS ε ) Guarantee Under the step-size sequence (3), the following inequality holds for all k 1: B k h(λ k+1 ) C h,q [ (1 ᾱ) k+1 + ᾱ ] 14
15 Constant Step-size, continued If we decide a priori to run the Frank-Wolfe method for k iterations, then the optimized value of ᾱ in the previous guarantee is: which yields the following: Guarantee ᾱ = 1 1 k k + 1 (4) For any k 1, using the constant step-size (4) for all iterations, the following inequality holds: B k h(λ k+1 ) C h,q [1 + ln(k + 1)] k Furthermore, after l = 2k + 2 iterations it also holds that: min i {1,...,l} G i 2C h,q [ ln(l)] l 2 15
16 On the Need for Warm Start Analysis Recall the well-studied step-size sequence initiated with a Pre-start step from λ 0 Q: ᾱ i = 2 for i 0 i + 2 and computational guarantee: This guarantee is independent of λ 0 If h(λ 0 ) h this is good B k h(λ k+1 ) 4C h,q k + 4 But if h(λ 0 ) h ε then this is not good Let us see how we can take advantageous of a warm start λ
17 A Warm Start Step-size Rule Let us start the Frank-Wolfe method at an initial point λ 1 and an upper bound B 0 Let C 1 be a given estimate of the curvature constant C h,q 2 Use the step-size sequence: ᾱ i = 4C 1 B + i + 1 for i 1 (5) 1 h(λ 1) One can think of this as if the Frank-Wolfe method had run for 4C 1 B 1 h(λ 1) iterations before arriving at λ 1 Guarantee Using (5), the following inequality holds for all k 1: B k h(λ k+1 ) 4 max{c 1, C h,q } 4C 1 + k B 1 h(λ 1 ) 17
18 Warm Start Step-size Rule, continued λ 1 Q is initial value C 1 is a given estimate of C h,q Guarantee Using (5), the following inequality holds for all k 1: B k h(λ k+1 ) 4 max{c 1, C h,q } 4C 1 + k B 1 h(λ 1 ) Easy to see C 1 C h,q optimizes the above guarantee If B 1 h(λ 1 ) is small, the incremental decrease in the guarantee from an additional iteration is lessened. This is different from first-order methods that use prox functions and/or projections 18
19 Dynamic Version of Warm-Start Analysis The warm-start step-sizes: 2 ᾱ i = 4C 1 B + i + 1 for i 1 1 h(λ 1) are based-on two pieces of information at λ 1 : B 1 h(λ 1 ), and C 1 This is a static warm-start step-size strategy Let us see how we can improve the computational guarantee by treating every iterate as if it were the initial iterate... 19
20 Dynamic Version of Warm-Starts, continued At a given iteration k of FW, we will presume that we have: λ k Q, an upper bound B k 1 on h (from previous iteration), and an estimate C k 1 of C h,q (also from the previous iteration) Consider ᾱ k of the form: 2 ᾱ k := 4C k B k h(λ k ) + 2 (6) where ᾱ k will depend explicitly on the value of C k. Let us now discuss how C k is computed... 20
21 Updating the Estimate of the Curvature Constant We require that C k (and ᾱ k which depends explicitly on C k ) satisfy C k C k 1 and: h(λ k + ᾱ k ( λ k λ k )) h(λ k ) + ᾱ k (B k h(λ k )) C k ᾱ 2 k (7) We first test if C k := C k 1 satisfies (7), and if so we set C k C k 1 If not, perform a standard doubling strategy, testing values C k 2C k 1, 4C k 1, 8C k 1,..., until (7) is satisfied Alternatively, say if h( ) is quadratic, C k can be determined analytically It will always hold that C k max{c 0, 2C h,q } 21
22 Frank-Wolfe Method with Dynamic Step-sizes FW Method with Dynamic Step-sizes for maximizing h(λ) on Q Initialize at λ 1 Q, initial estimate C 0 of C h,q, (optional) initial upper bound B 0, k 1. 1 Compute h(λ k ). 2 Compute λ k arg max λ Q {h(λ k) + h(λ k ) T (λ λ k )}. B w k h(λ k ) + h(λ k ) T ( λ k λ k ). G k h(λ k ) T ( λ k λ k ). 3 (Optional: compute other upper bound B o k ), update best bound B k min{b k 1, B w k, B o k }. 4 Compute C k for which the following conditions hold: C k C k 1, and h(λ k + ᾱ k ( λ k λ k )) h(λ k ) + ᾱ k (B k h(λ k )) C k ᾱk 2, where ᾱ k := 2 4C k B k h(λ k ) +2 5 Set λ k+1 λ k + ᾱ k ( λ k λ k ), where ᾱ k [0, 1). 22
23 Dynamic Warm Start Step-size Rule, continued Guarantee Using Frank-Wolfe method with Dynamic Step-sizes, the following inequality holds for all k 1 and l 1: B k+l h(λ k+l ) 4C k+l 4C k+l B k h(λ k ) + l Furthermore, if the doubling strategy is used to update the estimates C k of C h,q, it holds that C k+l max{c 0, 2C h,q } 23
24 Frank-Wolfe with Inexact Gradient Consider the case when the gradient h( ) is given inexactly d Aspremont s δ-oracle for the gradient: given λ Q, the δ-oracle returns g δ ( λ) that satisfies: ( h( λ) g δ ( λ)) T (ˆλ λ) δ for all ˆλ, λ Q Devolder/Glineur/Nesterov s (δ, L)-oracle for the the function value and the gradient: given λ Q, the (δ, L)-oracle returns h (δ,l) ( λ) and g (δ,l) ( λ) that satisfies for all λ Q: and h(λ) h (δ,l) ( λ) + g (δ,l) ( λ) T (λ λ) h(λ) h (δ,l) ( λ) + g (δ,l) ( λ) T (λ λ) L 2 λ λ δ 24
25 Frank-Wolfe with d Aspremont δ-oracle for Gradient Suppose that we use a δ-oracle for the gradient. Then the main technical theorem is amended as follows: Theorem Consider the iterate sequences of the Frank-Wolfe Method {λ k } and { λ k } and the sequence of upper bounds {B k } on h, using the step-size sequence {ᾱ k }. For the auxiliary sequences {α k } and {β k }, and for any k 1, the following inequality holds: B k h(λ k+1 ) B k h(λ 1 ) β k+1 + C h,q k i=1 β k+1 α 2 i β i+1 + 2δ The error δ does not accumulate All bounds presented earlier are amended by 2δ 25
26 Frank-Wolfe with DGN (δ, L)-oracle for Functions and Gradient Suppose that we use a (δ, L)-oracle for the function and gradient. Then the errors accumulate. (δ, L)-oracle for the function and gradient Guarantee Guarantee Sparsity Method without Errors with Errors of Iterates ( ) ( ) 1 1 Frank-Wolfe O O + O(δk) YES ( k ) ( k ) 1 1 Prox Gradient Acclerated Prox Grad. O ( k ) 1 O k 2 O O + O(δ) NO ( k ) 1 k 2 + O(δk) NO No method dominates all three criteria 26
27 Applications to Statistical Boosting We consider two applications in statistical boosting: 1 Linear Regression 2 Binary Classification / Supervised Learning 27
28 Linear Regression Consider linear regression: y = Xβ + e y R n is the response, β R p are the coefficients, X R n p is the model matrix and e R n are the errors In high-dimension statistical regime, especially with p n, we desire: good predictive performance, good performance on samples ( residuals r := y Xβ are small ), an interpretable model β coefficients are not excessively large ( β δ), and a sparse solution (β has few non-zero coefficients) 28
29 LASSO The form of the LASSO we consider is: L δ = min β L(β) := 1 2 y Xβ 2 2 s.t. β 1 δ δ > 0 is the regularization parameter As is well-known, the 1 -norm regularizer induces sparse solutions Let β 0 denote the number of non-zero coefficients of the vector β 29
30 LASSO with Frank-Wolfe Consider using Frank-Wolfe to solve the LASSO: L δ = min β L(β) := 1 2 y Xβ 2 2 s.t. β 1 δ the linear optimization subproblem any c: let j arg max c j j {1,...,p} then arg min β ct β δsgn(c j )e j 1 δ min β ct β is trivial to solve for 1 δ extreme points of feasible region are ±δe 1,..., ±δe p therefore β k+1 0 β k
31 Frank-Wolfe for LASSO Adaptation of Frank-Wolfe Method for LASSO Initialize at β 0 with β 0 1 δ. At iteration k: 1 Compute: 2 Set: r k y Xβ k j k arg max (r k ) T X j j {1,...,p} β k+1 j k (1 ᾱ k )β k j k + ᾱ k δsgn((r k ) T X jk ) β k+1 j (1 ᾱ k )β k j for j j k, and where ᾱ k [0, 1] 31
32 Computational Guarantees for Frank-Wolfe on LASSO Guarantees Suppose that we use the Frank-Wolfe Method to solve the LASSO problem, with either the fixed step-size rule ᾱ i = 2 i+2 or a line-search to determine ᾱ i for i 0. Then after k iterations, there exists an i {0,..., k} satisfying: 1 2 y Xβi 2 2 L δ 17.4 X 2 1,2 δ2 k X T (y Xβ i ) 1 2δ Xβ LS X 2 1,2 δ k β i 0 k β i 1 δ where β LS is the least-squares solution, and so Xβ LS 2 y 2 32
33 Frank-Wolfe on LASSO, and FS ε There are some structural connections between Frank-Wolfe on LASSO and Incremental Forward Stagewise Regression FS ε There are even more connections in the case of Frank-Wolfe with constant step-size 33
34 Binary Classification / Supervised Learning The set-up of the binary classification boosting problem consists of: Data/training examples (x 1, y 1 ),..., (x m, y m ) where each x i X and each y i [ 1, 1] A set of n base classifiers H = {h 1 ( ),..., h n ( )} where each h j ( ) : X [ 1, 1] h j (x i ) is classifier j s score of example x i y i h j (x i ) > 0 if and only if classifier j correctly classifies example x i We would like to construct a classifier H = λ 1 h 1 ( ) + + λ n h n ( ) that performs significantly better than any individual classifier in H 34
35 Binary Classification, continued We construct a new classifier H = H λ := λ 1 h 1 ( ) + + λ n h n ( ) from nonnegative multipliers λ R n + H(x) = H λ (x) := λ 1 h 1 (x) + + λ n h n (x) (associate sgnh H for ease of notation) In the high-dimensional case where n m 0, we desire: good predictive performance, good performance on the training data (y i H λ (x i ) > 0 for all examples x i ), a good interpretable model λ coefficients are not excessively large ( λ δ), and λ has few non-zero coefficients ( λ 0 is small) 35
36 Weak Learner Form the matrix A R m n by A ij = y i h j (x i ) A ij > 0 if and only if classifier h j ( ) correctly classifies example x i We suppose we have a weak learner W( ) that, for any distribution w on the examples (w m := {w R m : e T w = 1, w 0 }), returns the base classifier h j ( ) in H that does best on the weighted example determined by w: m w i y i h j (x i ) = i=1 max j=1,...,n m w i y i h j (x i ) = i=1 max j=1,...,n w T A j 36
37 Performance Objectives: Maximize the Margin The margin of the classifier H λ is defined by: p(λ) := min y ih λ (x i ) = min (Aλ) i i {1,...,m} i {1,...,m} One can then solve: p = max λ s.t. p(λ) λ n 37
38 Performance Objectives: Minimize the Exponential Loss The (logarithm of the) exponential loss of the classifier H λ is defined by: ( ) 1 m L exp (λ) := ln exp ( (Aλ) i ) m One can then solve: i=1 L exp,δ = min λ s.t. L exp (λ) λ 1 δ λ 0 38
39 Minimizing Exponential Loss with Frank-Wolfe Consider using Frank-Wolfe to minimize the exponential loss problem: L exp,δ = min λ s.t. L exp (λ) λ 1 δ λ 0 the linear optimization subproblem min λ 1 δ, λ 0 ct λ is trivial to solve for any c: arg min λ 1 δ, λ 0 ct λ 0 if c 0 else let j arg max j {1,...,p} c j then arg min λ 1 δ, λ 0 ct λ δe j extreme points of feasible region are 0, δe 1,..., δe n therefore λ k+1 0 λ k
40 Frank-Wolfe for Exponential Loss Minimization Adaptation of Frank-Wolfe Method for Minimizing Exponential Loss Initialize at λ 0 0 with λ 0 1 δ. Set w 0 i = exp( (Aλ0 ) i ) m l=1 exp( (Aλ0 ) l ) i = 1,..., m. Set k = 0. At iteration k: 1 Compute j k W(w k ) 2 Choose ᾱ [0, 1] and set: (1 ᾱ k )λ k j k + ᾱ k δ λ k+1 j (1 ᾱ k )λ k j for j j k, and where ᾱ k [0, 1] λ k+1 j k (wj k)1 ᾱ k exp( ᾱ k δa i,jk ), i = 1,..., m and re-normalize w k+1 so that e T w k+1 = 1 w k+1 j 40
41 Computational Guarantees for FW for Binary Classification Guarantees Suppose that we use the Frank-Wolfe Method to solve the exponential loss minimization problem, with either the fixed step-size rule ᾱ i = 2 i+2 or a line-search to determine ᾱ i, using ᾱ 0 = 1. Then after k iterations: L exp (λ k ) L exp,δ p p( λ k ) λ k 0 k λ k 1 δ 8δ2 k + 3 8δ k ln(m) δ where λ k is the normalization of λ k, namely λ k = λk e T λ k 41
42 Frank-Wolfe for Binary Classification, and AdaBoost There are some structural connections between Frank-Wolfe for binary classification, and AdaBoost L exp (λ) is a (Nesterov-) smoothing of the margin using smoothing parameter µ = 1 : write the margin as p(λ) := min (Aλ) i = min (w T Aλ) i {1,...,m} w m do µ-smoothing of the margin using entropy function e(w): then p µ (λ) = µl exp (λ/µ) p µ (λ) := min w m (w T Aλ + µe(w)) it easily follows that 1 µ p(λ) L exp(λ/µ) 1 µ p(λ) + ln(m) 42
43 Comments Computational evaluation of Frank-Wolfe for boosting: LASSO binary classification Structural connections in G-F-M: AdaBoost is equivalent to Mirror Descent for maximizing the margin Incremental Forward Stagewise Regression (FS ε ) is equivalent to subgradient descent for minimizing X T r 43
AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods
AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods Robert M. Freund Paul Grigas Rahul Mazumder June 29, 2013 Abstract Boosting methods are highly popular and effective
More informationA New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives
A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives Paul Grigas May 25, 2016 1 Boosting Algorithms in Linear Regression Boosting [6, 9, 12, 15, 16] is an extremely
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationAn Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion
An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion Robert M. Freund, MIT joint with Paul Grigas (UC Berkeley) and Rahul Mazumder (MIT) CDC, December 2016 1 Outline of Topics
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationMachine Learning. Lecture 6: Support Vector Machine. Feng Li.
Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationSolving Dual Problems
Lecture 20 Solving Dual Problems We consider a constrained problem where, in addition to the constraint set X, there are also inequality and linear equality constraints. Specifically the minimization problem
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationCOMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37
COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationMachine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /
Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 / Agenda Combining Classifiers Empirical view Theoretical
More informationAnalysis of Greedy Algorithms
Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm
More informationA NEW PERSPECTIVE ON BOOSTING IN LINEAR REGRESSION VIA SUBGRADIENT OPTIMIZATION AND RELATIVES
Submitted to the Annals of Statistics A NEW PERSPECTIVE ON BOOSTING IN LINEAR REGRESSION VIA SUBGRADIENT OPTIMIZATION AND RELATIVES By Robert M. Freund,, Paul Grigas, and Rahul Mazumder, Massachusetts
More informationRegularization Paths
December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationA Sparsity Preserving Stochastic Gradient Method for Composite Optimization
A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationGradient Boosting (Continued)
Gradient Boosting (Continued) David Rosenberg New York University April 4, 2016 David Rosenberg (New York University) DS-GA 1003 April 4, 2016 1 / 31 Boosting Fits an Additive Model Boosting Fits an Additive
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationConvex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014
Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution
More informationDISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder
011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationPrimal-dual Subgradient Method for Convex Problems with Functional Constraints
Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationThe deterministic Lasso
The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationRecitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14
Brett Bernstein CDS at NYU March 30, 2017 Brett Bernstein (CDS at NYU) Recitation 9 March 30, 2017 1 / 14 Initial Question Intro Question Question Suppose 10 different meteorologists have produced functions
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationNearest Neighbors Methods for Support Vector Machines
Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationNesterov s Optimal Gradient Methods
Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1 Outline The problem from machine learning perspective Preliminaries
More informationORIE 4741 Final Exam
ORIE 4741 Final Exam December 15, 2016 Rules for the exam. Write your name and NetID at the top of the exam. The exam is 2.5 hours long. Every multiple choice or true false question is worth 1 point. Every
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationSupport Vector Machines
Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationManfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62
Updated: March 23, 2010 Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62 ICML 2009 Tutorial Survey of Boosting from an Optimization Perspective Part I: Entropy Regularized LPBoost Part II: Boosting from
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationVoting (Ensemble Methods)
1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers
More informationEE364b Convex Optimization II May 30 June 2, Final exam
EE364b Convex Optimization II May 30 June 2, 2014 Prof. S. Boyd Final exam By now, you know how it works, so we won t repeat it here. (If not, see the instructions for the EE364a final exam.) Since you
More informationAnnouncements Kevin Jamieson
Announcements My office hours TODAY 3:30 pm - 4:30 pm CSE 666 Poster Session - Pick one First poster session TODAY 4:30 pm - 7:30 pm CSE Atrium Second poster session December 12 4:30 pm - 7:30 pm CSE Atrium
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationLecture 23: November 19
10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More information10-725/ Optimization Midterm Exam
10-725/36-725 Optimization Midterm Exam November 6, 2012 NAME: ANDREW ID: Instructions: This exam is 1hr 20mins long Except for a single two-sided sheet of notes, no other material or discussion is permitted
More informationLearning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features
Learning with L q
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationLogistic Regression with the Nonnegative Garrote
Logistic Regression with the Nonnegative Garrote Enes Makalic Daniel F. Schmidt Centre for MEGA Epidemiology The University of Melbourne 24th Australasian Joint Conference on Artificial Intelligence 2011
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationCS 590D Lecture Notes
CS 590D Lecture Notes David Wemhoener December 017 1 Introduction Figure 1: Various Separators We previously looked at the perceptron learning algorithm, which gives us a separator for linearly separable
More information3.10 Lagrangian relaxation
3.10 Lagrangian relaxation Consider a generic ILP problem min {c t x : Ax b, Dx d, x Z n } with integer coefficients. Suppose Dx d are the complicating constraints. Often the linear relaxation and the
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationSparse Approximation and Variable Selection
Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation
More information9 Classification. 9.1 Linear Classifiers
9 Classification This topic returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive
More information