The Frank-Wolfe Algorithm:

Similar documents
AdaBoost and Forward Stagewise Regression are First-Order Convex Optimization Methods

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

Optimization methods

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

Conditional Gradient (Frank-Wolfe) Method

Coordinate Descent and Ascent Methods

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Convex Optimization Lecture 16

6.036 midterm review. Wednesday, March 18, 15

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Machine Learning for NLP

Computational and Statistical Learning Theory

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture 9: September 28

Optimization methods

Solving Dual Problems

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CPSC 540: Machine Learning

Lecture 8. Instructor: Haipeng Luo

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Analysis of Greedy Algorithms

A NEW PERSPECTIVE ON BOOSTING IN LINEAR REGRESSION VIA SUBGRADIENT OPTIMIZATION AND RELATIVES

Regularization Paths

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

DATA MINING AND MACHINE LEARNING

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Gradient Boosting (Continued)

Support Vector Machines and Kernel Methods

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Computational and Statistical Learning Theory

DISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder

Descent methods. min x. f(x)

Algorithms for Nonsmooth Optimization

arxiv: v1 [math.oc] 1 Jul 2016

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Optimization and Gradient Descent

Coordinate descent methods

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Approximation Theoretical Questions for SVMs

The deterministic Lasso

Stochastic Gradient Descent

Support Vector Machines

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Logistic Regression. COMP 527 Danushka Bollegala

Nearest Neighbors Methods for Support Vector Machines

CPSC 540: Machine Learning

Neural Network Training

Machine Learning for NLP

Homework 4. Convex Optimization /36-725

Classification Logistic Regression

Nesterov s Optimal Gradient Methods

ORIE 4741 Final Exam

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Support Vector Machines

Big Data Analytics: Optimization and Randomization

Accelerated Proximal Gradient Methods for Convex Optimization

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Convex Optimization Algorithms for Machine Learning in 10 Slides

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

A Greedy Framework for First-Order Optimization

Linear & nonlinear classifiers

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Max Margin-Classifier

Voting (Ensemble Methods)

EE364b Convex Optimization II May 30 June 2, Final exam

Announcements Kevin Jamieson

CIS 520: Machine Learning Oct 09, Kernel Methods

Lasso: Algorithms and Extensions

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture 23: November 19

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Adaptive Online Gradient Descent

10-725/ Optimization Midterm Exam

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

On Nesterov s Random Coordinate Descent Algorithms - Continued

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS-E4830 Kernel Methods in Machine Learning

Logistic Regression with the Nonnegative Garrote

Subgradient Method. Ryan Tibshirani Convex Optimization

Stochastic Proximal Gradient Algorithm

CS 590D Lecture Notes

3.10 Lagrangian relaxation

Accelerating Stochastic Optimization

Optimization for Machine Learning

Discriminative Models

Sparse Approximation and Variable Selection

9 Classification. 9.1 Linear Classifiers

Transcription:

The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology May 2013 1

Modified Title Original title: The Frank-Wolfe Algorithm: New Results, Connections to Statistical Boosting, and Computation 2

Problem of Interest Our problem of interest is: h := max λ s.t. h(λ) λ Q Q E is convex and compact h( ) : Q R is concave and differentiable with Lipschitz gradient on Q Assume it is easy to solve linear optimization problems on Q 3

Frank-Wolfe (FW) Method Frank-Wolfe Method for maximizing h(λ) on Q Initialize at λ 1 Q, (optional) initial upper bound B 0, k 1. 1 Compute h(λ k ). 2 Compute λ k arg max λ Q {h(λ k) + h(λ k ) T (λ λ k )}. B w k h(λ k) + h(λ k ) T ( λ k λ k ). G k h(λ k ) T ( λ k λ k ). 3 (Optional: compute other upper bound Bk o ), update best bound B k min{b k 1, Bk w, Bo k }. 4 Set λ k+1 λ k + ᾱ k ( λ k λ k ), where ᾱ k [0, 1). Note the condition ᾱ k [0, 1) 4

Wolfe Bound Wolfe Bound is B w k := h(λ k) + h(λ k ) T ( λ k λ k ) B w k h Suppose h( ) arises from minmax structure: h(λ) = min x P φ(x, λ) P is convex and compact φ(x, λ) : P Q R is convex in x and concave λ Define f (x) := max λ Q φ(x, λ) Then a dual problem is min x P f (x) In FW method, let B m k := f (x k) where x k arg min x P {φ(x, λ k)} Then B w k Bm k 5

Wolfe Gap Wolfe Gap is G k := B w k h(λ k) = h(λ k ) T ( λ k λ k ) Recall B m k Bw k Unlike generic duality gaps, the Wolfe gap is determined completely by the current iterate λ k. It arises in: Khachiyan s analysis of FW for rounding of polytopes, FW for boosting methods in statistics and learning, perhaps elsewhere.... 6

Pre-start Step for FW Pre-start Step of Frank-Wolfe method given λ 0 Q and (optional) upper bound B 1 1 Compute h(λ 0 ). 2 Compute λ 0 arg max λ Q {h(λ 0) + h(λ 0 ) T (λ λ 0 )}. B w 0 h(λ 0) + h(λ 0 ) T ( λ 0 λ 0 ). G 0 h(λ 0 ) T ( λ 0 λ 0 ). 3 (Optional: compute other upper bound B0 o ), update best bound B 0 min{b 1, B0 w, Bo 0 }. 4 Set λ 1 λ 0. This is the same as a regular FW step at λ 0 with ᾱ 0 = 1 7

Curvature Constant C h,q and Classical Metrics Following Clarkson, define C h,q to be the minimal value of C for which λ, λ Q and α [0, 1] implies: h(λ + α( λ λ)) h(λ) + h(λ) T (α( λ λ)) Cα 2 Let Diam Q := max { λ λ, λ Q λ } Let L := L h,q be the smallest constant L for which λ, λ Q implies: h(λ) h( λ) L λ λ It is straightforward to bound C h,q 1 2 L h,q(diam Q ) 2 8

Auxiliary Sequences {α k } and {β k } Define the following two auxiliary sequences as functions of the step-sizes {α k } from the Frank-Wolfe method: β k = 1 (1 ᾱ j ) k 1 j=1 and α k = β kᾱ k 1 ᾱ k, k 1 (By convention 0 j=1 = 1 and 0 i=1 = 0 ) 9

Two Technical Theorems Theorem Consider the iterate sequences of the Frank-Wolfe Method {λ k } and { λ k } and the sequence of upper bounds {B k } on h, using the step-size sequence {ᾱ k }. For the auxiliary sequences {α k } and {β k }, and for any k 1, the following inequality holds: B k h(λ k+1 ) B k h(λ 1 ) β k+1 + C h,q k i=1 β k+1 α 2 i β i+1 The summation expression appears also in the dual averages method of Nesterov. This is not a coincidence, indeed it is by design, see Grigas. We will henceforth refer to the sequences {α k } and {β k } as the dual averages sequences associated with the FW step-sizes. 10

Second Technical Theorem Theorem Consider the iterate sequences of the Frank-Wolfe Method {λ k } and { λ k } and the sequence of Wolfe gaps {G k } from Step (2.), using the step-size sequence {ᾱ k }. For the auxiliary sequences {α k } and {β k }, and for any k 1 and l k + 1, the following inequality holds: min G i i {k+1,...,l} 1 l i=k+1 ᾱi B k h(λ 1) β k+1 + C k h,q i=1 β k+1 α 2 i β i+1 + C h,q l i=k+1 ᾱ2 i l i=1 ᾱi 11

A Well-Studied Step-size Sequence Suppose we initiate the Frank-Wolfe method with the Pre-start step from a given value λ 0 Q (which by definition assigns the step-size ᾱ 0 = 1 as discussed earlier), and then use the step-sizes: ᾱ i = 2 i + 2 for i 0 (1) Guarantee Under the step-size sequence (1), the following inequalities hold for all k 1: B k h(λ k+1 ) 4C h,q k + 4 and min i {1,...,k} G i 8.7C h,q k 12

Simple Averaging Suppose we initiate the Frank-Wolfe method with the Pre-start step, and then use the following step-size sequence: ᾱ i = 1 i + 1 for i 1 (2) This has the property that λ k+1 is the simple average of λ 0, λ 1, λ 2,..., λ k Guarantee Under the step-size sequence (2), the following inequalities holds for all k 1: B k h(λ k+1 ) C h,q(1 + ln(k + 1)) k + 1 and min i {1,...,k} G i 2.9 C h,q(2 + ln(k + 1)) k 1 2 13

Constant Step-size Given ᾱ (0, 1), suppose we initiate the Frank-Wolfe method with the Pre-start step, and then use the following step-size sequence: ᾱ i = ᾱ for i 1 (3) (This step-size rule arises in the analysis of the Incremental Forward Stagewise Regression algorithm FS ε ) Guarantee Under the step-size sequence (3), the following inequality holds for all k 1: B k h(λ k+1 ) C h,q [ (1 ᾱ) k+1 + ᾱ ] 14

Constant Step-size, continued If we decide a priori to run the Frank-Wolfe method for k iterations, then the optimized value of ᾱ in the previous guarantee is: which yields the following: Guarantee ᾱ = 1 1 k k + 1 (4) For any k 1, using the constant step-size (4) for all iterations, the following inequality holds: B k h(λ k+1 ) C h,q [1 + ln(k + 1)] k Furthermore, after l = 2k + 2 iterations it also holds that: min i {1,...,l} G i 2C h,q [.35 + 2 ln(l)] l 2 15

On the Need for Warm Start Analysis Recall the well-studied step-size sequence initiated with a Pre-start step from λ 0 Q: ᾱ i = 2 for i 0 i + 2 and computational guarantee: This guarantee is independent of λ 0 If h(λ 0 ) h this is good B k h(λ k+1 ) 4C h,q k + 4 But if h(λ 0 ) h ε then this is not good Let us see how we can take advantageous of a warm start λ 0... 16

A Warm Start Step-size Rule Let us start the Frank-Wolfe method at an initial point λ 1 and an upper bound B 0 Let C 1 be a given estimate of the curvature constant C h,q 2 Use the step-size sequence: ᾱ i = 4C 1 B + i + 1 for i 1 (5) 1 h(λ 1) One can think of this as if the Frank-Wolfe method had run for 4C 1 B 1 h(λ 1) iterations before arriving at λ 1 Guarantee Using (5), the following inequality holds for all k 1: B k h(λ k+1 ) 4 max{c 1, C h,q } 4C 1 + k B 1 h(λ 1 ) 17

Warm Start Step-size Rule, continued λ 1 Q is initial value C 1 is a given estimate of C h,q Guarantee Using (5), the following inequality holds for all k 1: B k h(λ k+1 ) 4 max{c 1, C h,q } 4C 1 + k B 1 h(λ 1 ) Easy to see C 1 C h,q optimizes the above guarantee If B 1 h(λ 1 ) is small, the incremental decrease in the guarantee from an additional iteration is lessened. This is different from first-order methods that use prox functions and/or projections 18

Dynamic Version of Warm-Start Analysis The warm-start step-sizes: 2 ᾱ i = 4C 1 B + i + 1 for i 1 1 h(λ 1) are based-on two pieces of information at λ 1 : B 1 h(λ 1 ), and C 1 This is a static warm-start step-size strategy Let us see how we can improve the computational guarantee by treating every iterate as if it were the initial iterate... 19

Dynamic Version of Warm-Starts, continued At a given iteration k of FW, we will presume that we have: λ k Q, an upper bound B k 1 on h (from previous iteration), and an estimate C k 1 of C h,q (also from the previous iteration) Consider ᾱ k of the form: 2 ᾱ k := 4C k B k h(λ k ) + 2 (6) where ᾱ k will depend explicitly on the value of C k. Let us now discuss how C k is computed... 20

Updating the Estimate of the Curvature Constant We require that C k (and ᾱ k which depends explicitly on C k ) satisfy C k C k 1 and: h(λ k + ᾱ k ( λ k λ k )) h(λ k ) + ᾱ k (B k h(λ k )) C k ᾱ 2 k (7) We first test if C k := C k 1 satisfies (7), and if so we set C k C k 1 If not, perform a standard doubling strategy, testing values C k 2C k 1, 4C k 1, 8C k 1,..., until (7) is satisfied Alternatively, say if h( ) is quadratic, C k can be determined analytically It will always hold that C k max{c 0, 2C h,q } 21

Frank-Wolfe Method with Dynamic Step-sizes FW Method with Dynamic Step-sizes for maximizing h(λ) on Q Initialize at λ 1 Q, initial estimate C 0 of C h,q, (optional) initial upper bound B 0, k 1. 1 Compute h(λ k ). 2 Compute λ k arg max λ Q {h(λ k) + h(λ k ) T (λ λ k )}. B w k h(λ k ) + h(λ k ) T ( λ k λ k ). G k h(λ k ) T ( λ k λ k ). 3 (Optional: compute other upper bound B o k ), update best bound B k min{b k 1, B w k, B o k }. 4 Compute C k for which the following conditions hold: C k C k 1, and h(λ k + ᾱ k ( λ k λ k )) h(λ k ) + ᾱ k (B k h(λ k )) C k ᾱk 2, where ᾱ k := 2 4C k B k h(λ k ) +2 5 Set λ k+1 λ k + ᾱ k ( λ k λ k ), where ᾱ k [0, 1). 22

Dynamic Warm Start Step-size Rule, continued Guarantee Using Frank-Wolfe method with Dynamic Step-sizes, the following inequality holds for all k 1 and l 1: B k+l h(λ k+l ) 4C k+l 4C k+l B k h(λ k ) + l Furthermore, if the doubling strategy is used to update the estimates C k of C h,q, it holds that C k+l max{c 0, 2C h,q } 23

Frank-Wolfe with Inexact Gradient Consider the case when the gradient h( ) is given inexactly d Aspremont s δ-oracle for the gradient: given λ Q, the δ-oracle returns g δ ( λ) that satisfies: ( h( λ) g δ ( λ)) T (ˆλ λ) δ for all ˆλ, λ Q Devolder/Glineur/Nesterov s (δ, L)-oracle for the the function value and the gradient: given λ Q, the (δ, L)-oracle returns h (δ,l) ( λ) and g (δ,l) ( λ) that satisfies for all λ Q: and h(λ) h (δ,l) ( λ) + g (δ,l) ( λ) T (λ λ) h(λ) h (δ,l) ( λ) + g (δ,l) ( λ) T (λ λ) L 2 λ λ δ 24

Frank-Wolfe with d Aspremont δ-oracle for Gradient Suppose that we use a δ-oracle for the gradient. Then the main technical theorem is amended as follows: Theorem Consider the iterate sequences of the Frank-Wolfe Method {λ k } and { λ k } and the sequence of upper bounds {B k } on h, using the step-size sequence {ᾱ k }. For the auxiliary sequences {α k } and {β k }, and for any k 1, the following inequality holds: B k h(λ k+1 ) B k h(λ 1 ) β k+1 + C h,q k i=1 β k+1 α 2 i β i+1 + 2δ The error δ does not accumulate All bounds presented earlier are amended by 2δ 25

Frank-Wolfe with DGN (δ, L)-oracle for Functions and Gradient Suppose that we use a (δ, L)-oracle for the function and gradient. Then the errors accumulate. (δ, L)-oracle for the function and gradient Guarantee Guarantee Sparsity Method without Errors with Errors of Iterates ( ) ( ) 1 1 Frank-Wolfe O O + O(δk) YES ( k ) ( k ) 1 1 Prox Gradient Acclerated Prox Grad. O ( k ) 1 O k 2 O O + O(δ) NO ( k ) 1 k 2 + O(δk) NO No method dominates all three criteria 26

Applications to Statistical Boosting We consider two applications in statistical boosting: 1 Linear Regression 2 Binary Classification / Supervised Learning 27

Linear Regression Consider linear regression: y = Xβ + e y R n is the response, β R p are the coefficients, X R n p is the model matrix and e R n are the errors In high-dimension statistical regime, especially with p n, we desire: good predictive performance, good performance on samples ( residuals r := y Xβ are small ), an interpretable model β coefficients are not excessively large ( β δ), and a sparse solution (β has few non-zero coefficients) 28

LASSO The form of the LASSO we consider is: L δ = min β L(β) := 1 2 y Xβ 2 2 s.t. β 1 δ δ > 0 is the regularization parameter As is well-known, the 1 -norm regularizer induces sparse solutions Let β 0 denote the number of non-zero coefficients of the vector β 29

LASSO with Frank-Wolfe Consider using Frank-Wolfe to solve the LASSO: L δ = min β L(β) := 1 2 y Xβ 2 2 s.t. β 1 δ the linear optimization subproblem any c: let j arg max c j j {1,...,p} then arg min β ct β δsgn(c j )e j 1 δ min β ct β is trivial to solve for 1 δ extreme points of feasible region are ±δe 1,..., ±δe p therefore β k+1 0 β k 0 + 1 30

Frank-Wolfe for LASSO Adaptation of Frank-Wolfe Method for LASSO Initialize at β 0 with β 0 1 δ. At iteration k: 1 Compute: 2 Set: r k y Xβ k j k arg max (r k ) T X j j {1,...,p} β k+1 j k (1 ᾱ k )β k j k + ᾱ k δsgn((r k ) T X jk ) β k+1 j (1 ᾱ k )β k j for j j k, and where ᾱ k [0, 1] 31

Computational Guarantees for Frank-Wolfe on LASSO Guarantees Suppose that we use the Frank-Wolfe Method to solve the LASSO problem, with either the fixed step-size rule ᾱ i = 2 i+2 or a line-search to determine ᾱ i for i 0. Then after k iterations, there exists an i {0,..., k} satisfying: 1 2 y Xβi 2 2 L δ 17.4 X 2 1,2 δ2 k X T (y Xβ i ) 1 2δ Xβ LS 2 2 + 17.4 X 2 1,2 δ k β i 0 k β i 1 δ where β LS is the least-squares solution, and so Xβ LS 2 y 2 32

Frank-Wolfe on LASSO, and FS ε There are some structural connections between Frank-Wolfe on LASSO and Incremental Forward Stagewise Regression FS ε There are even more connections in the case of Frank-Wolfe with constant step-size 33

Binary Classification / Supervised Learning The set-up of the binary classification boosting problem consists of: Data/training examples (x 1, y 1 ),..., (x m, y m ) where each x i X and each y i [ 1, 1] A set of n base classifiers H = {h 1 ( ),..., h n ( )} where each h j ( ) : X [ 1, 1] h j (x i ) is classifier j s score of example x i y i h j (x i ) > 0 if and only if classifier j correctly classifies example x i We would like to construct a classifier H = λ 1 h 1 ( ) + + λ n h n ( ) that performs significantly better than any individual classifier in H 34

Binary Classification, continued We construct a new classifier H = H λ := λ 1 h 1 ( ) + + λ n h n ( ) from nonnegative multipliers λ R n + H(x) = H λ (x) := λ 1 h 1 (x) + + λ n h n (x) (associate sgnh H for ease of notation) In the high-dimensional case where n m 0, we desire: good predictive performance, good performance on the training data (y i H λ (x i ) > 0 for all examples x i ), a good interpretable model λ coefficients are not excessively large ( λ δ), and λ has few non-zero coefficients ( λ 0 is small) 35

Weak Learner Form the matrix A R m n by A ij = y i h j (x i ) A ij > 0 if and only if classifier h j ( ) correctly classifies example x i We suppose we have a weak learner W( ) that, for any distribution w on the examples (w m := {w R m : e T w = 1, w 0 }), returns the base classifier h j ( ) in H that does best on the weighted example determined by w: m w i y i h j (x i ) = i=1 max j=1,...,n m w i y i h j (x i ) = i=1 max j=1,...,n w T A j 36

Performance Objectives: Maximize the Margin The margin of the classifier H λ is defined by: p(λ) := min y ih λ (x i ) = min (Aλ) i i {1,...,m} i {1,...,m} One can then solve: p = max λ s.t. p(λ) λ n 37

Performance Objectives: Minimize the Exponential Loss The (logarithm of the) exponential loss of the classifier H λ is defined by: ( ) 1 m L exp (λ) := ln exp ( (Aλ) i ) m One can then solve: i=1 L exp,δ = min λ s.t. L exp (λ) λ 1 δ λ 0 38

Minimizing Exponential Loss with Frank-Wolfe Consider using Frank-Wolfe to minimize the exponential loss problem: L exp,δ = min λ s.t. L exp (λ) λ 1 δ λ 0 the linear optimization subproblem min λ 1 δ, λ 0 ct λ is trivial to solve for any c: arg min λ 1 δ, λ 0 ct λ 0 if c 0 else let j arg max j {1,...,p} c j then arg min λ 1 δ, λ 0 ct λ δe j extreme points of feasible region are 0, δe 1,..., δe n therefore λ k+1 0 λ k 0 + 1 39

Frank-Wolfe for Exponential Loss Minimization Adaptation of Frank-Wolfe Method for Minimizing Exponential Loss Initialize at λ 0 0 with λ 0 1 δ. Set w 0 i = exp( (Aλ0 ) i ) m l=1 exp( (Aλ0 ) l ) i = 1,..., m. Set k = 0. At iteration k: 1 Compute j k W(w k ) 2 Choose ᾱ [0, 1] and set: (1 ᾱ k )λ k j k + ᾱ k δ λ k+1 j (1 ᾱ k )λ k j for j j k, and where ᾱ k [0, 1] λ k+1 j k (wj k)1 ᾱ k exp( ᾱ k δa i,jk ), i = 1,..., m and re-normalize w k+1 so that e T w k+1 = 1 w k+1 j 40

Computational Guarantees for FW for Binary Classification Guarantees Suppose that we use the Frank-Wolfe Method to solve the exponential loss minimization problem, with either the fixed step-size rule ᾱ i = 2 i+2 or a line-search to determine ᾱ i, using ᾱ 0 = 1. Then after k iterations: L exp (λ k ) L exp,δ p p( λ k ) λ k 0 k λ k 1 δ 8δ2 k + 3 8δ k + 3 + ln(m) δ where λ k is the normalization of λ k, namely λ k = λk e T λ k 41

Frank-Wolfe for Binary Classification, and AdaBoost There are some structural connections between Frank-Wolfe for binary classification, and AdaBoost L exp (λ) is a (Nesterov-) smoothing of the margin using smoothing parameter µ = 1 : write the margin as p(λ) := min (Aλ) i = min (w T Aλ) i {1,...,m} w m do µ-smoothing of the margin using entropy function e(w): then p µ (λ) = µl exp (λ/µ) p µ (λ) := min w m (w T Aλ + µe(w)) it easily follows that 1 µ p(λ) L exp(λ/µ) 1 µ p(λ) + ln(m) 42

Comments Computational evaluation of Frank-Wolfe for boosting: LASSO binary classification Structural connections in G-F-M: AdaBoost is equivalent to Mirror Descent for maximizing the margin Incremental Forward Stagewise Regression (FS ε ) is equivalent to subgradient descent for minimizing X T r 43