Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Similar documents
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs

Notes for Lecture 17-18

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Essential Microeconomics : OPTIMAL CONTROL 1. Consider the following class of optimization problems

An introduction to the theory of SDDP algorithm

The Asymptotic Behavior of Nonoscillatory Solutions of Some Nonlinear Dynamic Equations on Time Scales

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Online Convex Optimization Example And Follow-The-Leader

Variational Iteration Method for Solving System of Fractional Order Ordinary Differential Equations

A Forward-Backward Splitting Method with Component-wise Lazy Evaluation for Online Structured Convex Optimization

1 Review of Zero-Sum Games

Hamilton- J acobi Equation: Weak S olution We continue the study of the Hamilton-Jacobi equation:

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018

MODULE 3 FUNCTION OF A RANDOM VARIABLE AND ITS DISTRIBUTION LECTURES PROBABILITY DISTRIBUTION OF A FUNCTION OF A RANDOM VARIABLE

Ensamble methods: Bagging and Boosting

Correspondence should be addressed to Nguyen Buong,

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

EXERCISES FOR SECTION 1.5

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

A Shooting Method for A Node Generation Algorithm

Appendix to Online l 1 -Dictionary Learning with Application to Novel Document Detection

Some Ramsey results for the n-cube

Approximation Algorithms for Unique Games via Orthogonal Separators

Lecture 9: September 25

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

Properties Of Solutions To A Generalized Liénard Equation With Forcing Term

arxiv: v1 [math.ca] 15 Nov 2016

Convergence of the Neumann series in higher norms

Optimality Conditions for Unconstrained Problems

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

Notes on online convex optimization

Random Walk with Anti-Correlated Steps

Vehicle Arrival Models : Headway

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models.

Cash Flow Valuation Mode Lin Discrete Time

Ensamble methods: Boosting

Robotics I. April 11, The kinematics of a 3R spatial robot is specified by the Denavit-Hartenberg parameters in Tab. 1.

Generalized Snell envelope and BSDE With Two general Reflecting Barriers

Improved Approximate Solutions for Nonlinear Evolutions Equations in Mathematical Physics Using the Reduced Differential Transform Method

An Introduction to Backward Stochastic Differential Equations (BSDEs) PIMS Summer School 2016 in Mathematical Finance.

An Introduction to Malliavin calculus and its applications

Orthogonal Rational Functions, Associated Rational Functions And Functions Of The Second Kind

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

) were both constant and we brought them from under the integral.

Lecture 2 October ε-approximation of 2-player zero-sum games

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

ON THE DEGREES OF RATIONAL KNOTS

BOUNDEDNESS OF MAXIMAL FUNCTIONS ON NON-DOUBLING MANIFOLDS WITH ENDS

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Christos Papadimitriou & Luca Trevisan November 22, 2016

On the Online Frank-Wolfe Algorithms for Convex and Non-convex Optimizations

Stability and Bifurcation in a Neural Network Model with Two Delays

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

International Journal of Mathematics Trends and Technology (IJMTT) Volume 37 Number 3 September 2016

Chapter 2. First Order Scalar Equations

PROOF FOR A CASE WHERE DISCOUNTING ADVANCES THE DOOMSDAY. T. C. Koopmans

Monotonic Solutions of a Class of Quadratic Singular Integral Equations of Volterra type

23.5. Half-Range Series. Introduction. Prerequisites. Learning Outcomes

On Oscillation of a Generalized Logistic Equation with Several Delays

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Oscillation of an Euler Cauchy Dynamic Equation S. Huff, G. Olumolode, N. Pennington, and A. Peterson

Robust estimation based on the first- and third-moment restrictions of the power transformation model

The Optimal Stopping Time for Selling an Asset When It Is Uncertain Whether the Price Process Is Increasing or Decreasing When the Horizon Is Infinite

Seminar 4: Hotelling 2

INEXACT CUTS FOR DETERMINISTIC AND STOCHASTIC DUAL DYNAMIC PROGRAMMING APPLIED TO CONVEX NONLINEAR OPTIMIZATION PROBLEMS

Fractional Method of Characteristics for Fractional Partial Differential Equations

SOME PROPERTIES OF GENERALIZED STRONGLY HARMONIC CONVEX FUNCTIONS MUHAMMAD ASLAM NOOR, KHALIDA INAYAT NOOR, SABAH IFTIKHAR AND FARHAT SAFDAR

BBP-type formulas, in general bases, for arctangents of real numbers

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

EXISTENCE OF NON-OSCILLATORY SOLUTIONS TO FIRST-ORDER NEUTRAL DIFFERENTIAL EQUATIONS

arxiv: v1 [math.pr] 19 Feb 2011

Approximating positive solutions of nonlinear first order ordinary quadratic differential equations

Representation of Stochastic Process by Means of Stochastic Integrals

The General Linear Test in the Ridge Regression

Existence Theory of Second Order Random Differential Equations

2. Nonlinear Conservation Law Equations

Research Article Existence and Uniqueness of Periodic Solution for Nonlinear Second-Order Ordinary Differential Equations

Existence of positive solution for a third-order three-point BVP with sign-changing Green s function

Example on p. 157

Stochastic Model for Cancer Cell Growth through Single Forward Mutation

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Solutions to Assignment 1

Mixing times and hitting times: lecture notes

Essential Maps and Coincidence Principles for General Classes of Maps

20. Applications of the Genetic-Drift Model

Math 10B: Mock Mid II. April 13, 2016

TO our knowledge, most exciting results on the existence

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

Hamilton Jacobi equations

OBJECTIVES OF TIME SERIES ANALYSIS

A problem related to Bárány Grünbaum conjecture

Undetermined coefficients for local fractional differential equations

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

SPECTRAL EVOLUTION OF A ONE PARAMETER EXTENSION OF A REAL SYMMETRIC TOEPLITZ MATRIX* William F. Trench. SIAM J. Matrix Anal. Appl. 11 (1990),

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data

An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem

Transcription:

Supplemen for Sochasic Convex Opimizaion: Faser Local Growh Implies Faser Global Convergence Yi Xu Qihang Lin ianbao Yang Proof of heorem heorem Suppose Assumpion holds and F (w) obeys he LGC (6) Given (, ), le = /K, K = dlog ( )e, D c and be he smalles ineger such ha max{9, 78 log(/ )} G D hen A-c guaranees ha, wih a probabiliy, F (w K ) F apple As a resul, he ieraion complexiy of A-c for achieving an -opimal soluion wih a high probabiliy is O(c G dlog ( )e log(/ )/( ) ) provided D = O( c ) ( ) Proof Le w k, denoe he closes poin o w k in S Define k = Noe ha D k k = D c k and k k = k 3G We will show by inducion ha F (w k ) F apple k + for k =,, wih a high probabiliy, which leads o our conclusion when k = K he inequaliy holds obviously for k = Condiioned on F (w k ) F apple k +, we will show ha F (w k ) F apple k + wih a high probabiliy By Lemma, we have kw k, w k k apple c (F (w k ) F (w k, )) apple c k apple D k () We apply Lemma o he k-h sage of Algorihm condiioned on randomness in previous sages Wih a probabiliy we have F (w k ) F (w k, ) apple kg + kw k w k, k k q + GD k 3 log(/ ) p () We now consider wo cases for w k Firs, we assume F (w k ) F apple, ie w k S hen we have Deparmen of Compuer Science, he Universiy of Iowa, Iowa Ciy, IA 56, USA Deparmen of Managemen Sciences, he Universiy of Iowa, Iowa Ciy, IA 56, USA Correspondence o: ianbao Yang <ianbao-yang@uiowaedu> Proceedings of he 3 h Inernaional Conference on Machine Learning, Sydney, Ausralia, PMLR 7, 7 Copyrigh 7 by he auhor(s) w k, = w k and q F (w k ) F (w k, ) apple kg + GD k 3 log(/ ) p apple k 3 + k = k 6 3 he second inequaliy using he fac ha k 78 log(/ ) G D As a resul, = k 3G F (w k ) F apple F (w k, ) F + k 3 apple + k and Nex, we consider F (w k ) F >, ie w k /S hen we have F (w k (), we ge, ) F = Combining () and q F (w k ) F (w k, ) apple kg + D k k + GD k 3 log(/ ) p Since k = k 3G and max{9, 78 log(/ )} G D, we have each erm in he RHS of above inequaliy bounded by k /3 As a resul, F (w k ) F (w k, ) apple k ) F (w k ) F apple k + wih a probabiliy probabiliy a leas ( herefore by inducion, wih a )K we have F (w K ) F apple K + apple Since = /K, hen ( )K and we complee he proof Proof of Lemma 3 Lemma 3 For any, we have k bw w k apple 3 G and kw w k apple G Proof By he opimaliy of bw, we have for any w K @F( bw )+ ( bw w ) > (w bw )

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence Le w = w, we have @F( bw ) > (w bw ) k bw w k Because k@f( bw )k apple G due o k@f(w; )k apple G, hen Nex, we bound kw w + we have kw + w k k bw w k apple G w k According o he updae of applekw + w k =k @f(w ; )+( / )(w w )k We prove kw w k apple G by inducion Firs, we consider =, where =, hen kw w k applek @f(w ; )k apple G hen we consider any, where / apple hen apple kw + w k apple herefore @f(w ; )+ G + 3 Proof of heorem G apple G k bw w k apple 3 G (w w ) heorem Suppose Assumpion holds and F (w) obeys he LGC (6) Given (, /e), le = /K, K = dlog ( )e, c and be he smalles ineger such ha max{3, 36 G (+log( log / )+log ) ( ) } hen A-r guaranees ha, wih a probabiliy, F (w K ) F apple As a resul, he ieraion complexiy of A-r for achieving an -opimal soluion wih a high probabiliy is O(c G log( /) log(/ )/ ( ) ) provided = O( c ) ( ) o prove he heorem, we firs show he following resuls of high probabiliy convergence bound Lemma Given w K, apply -ieraions of (9) For any fixed w K, (, ), and 3, wih a probabiliy a leas, he following inequaliy holds F ( bw ) F (w) apple kw w k where bw = P = w / + 3 G ( + log + log( log / )), Proof Le g = @f(w ; )+(w w )/ and @ F b (w )= @F(w )+(w w )/ Noe ha kg k apple 3G According o he sandard analysis for he sochasic gradien mehod we have g > (w bw ) apple kw bw k hen + kg k @ b F (w ) > (w bw ) apple kw bw k By srong convexiy of b F we have kw + bw k kw + bw k + kg k +(@ b F (w ) g ) > (w bw ) bf ( bw ) b F (w ) @ b F (w ) > ( bw w )+ k bw w k hen bf (w ) F b ( bw ) apple kw bw k kw + +(@ b F (w ) g ) > (w bw ) apple kw bw k kw + +(@F(w ) @f(w ; )) > (w {z bw ) } bw k + kg k k bw w k bw k + kg k k bw w k By summing he above inequaliies across =,,, we have ( F b (w ) F b ( bw )) (3) = apple k bw w + k + apple = + = = k bw w k = k bw w k + k bw w k + 9G = k bw w k +9 G ( + log ) = where he las inequaliy uses = Nex, we bound RHS of he above inequaliy We need he following lemma ()

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence Lemma 5 (Lemma 3 (Kakade & ewari, 8)) Suppose X,,X is a maringale difference sequence wih X appleb Le Var X = Var(X X,,X ) where Var denoes he variance Le V = P = Var X be he sum of condiional variance of X s Furher, le = p V hen we have for any </e and 3, Pr X > max{, 3b p log(/ )} p! log(/ ) = apple log o proceed he proof of Lemma We le X = and D = P = kw bw k hen X,,X is a maringale difference sequence Le D = 3 G Noe ha applegd By Lemma 5, for any </e and 3, wih a probabiliy we have apple = 8 v < 9 max : u log X log( ) Var, 6GD log( log = ) ; Noe ha Var apple = = E [ ] apple G = X = As a resul, wih a probabiliy, (5) kw bw k =G D appleg p log( log / ) p D +6GD log( log / ) = apple6 G log( log / )+ D +6GD log( log / ) As a resul, wih a probabiliy, = k bw w k = apple6 G log( log / )+6GD log( log / ) =3 G log( log / ) hus, wih a probabiliy bf ( bw ) b F ( bw ) apple 3 G log( log / ) + 9 G ( + log ) apple 3 G ( + log + log( log / )) Using he facs ha F ( bw ) apple F b ( bw ) and F b ( bw ) apple bf (w) =F (w)+ kw wk, we have F ( bw ) F (w) kw w k apple 3 G ( + log + log( log / )) Nex, le us sar o prove heorem Proof of heorem Le w k, denoe he closes poin o w k in he sublevel se Define k, Firs, we noe ha k c k k We will show by inducion ha F (w ( ) k ) F apple k + for k =,, wih a high probabiliy, which leads o our conclusion when k = K he inequaliy holds obviously for k = Condiioned on F (w k ) F apple k +, we will show ha F (w k ) F apple k + wih a high probabiliy We apply Lemma o he k-h sage of Algorihm condiioned on he randomness in previous sages Wih a probabiliy a leas we have F (w k ) F (w k, ) apple k kw k, w k k + 3 kg ( + log + log( log / )) (6) We now consider wo cases for w k Firs, we assume F (w k ) F apple, ie w k S hen we have w k, = w k and F (w k ) F (w k, ) apple3 kg ( + log + log( log / )) apple k he las inequaliy uses he fac ha 36 G (+log( log / )+log ) As a resul, F (w k ) F apple F (w k, ) F + k apple + k Nex, we consider F (w k ) F >, ie w k /S hen we have F (w k, ) F = Similar o he proof of heorem, by Lemma, we have kw k, w k k apple c k (7) Combining (6) and (7), we have F (w k ) F (w k, ) apple ck 3 k G ( + log + log( log / )) + k

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence Using he fac ha k and 36 G (+log +log( log / )), we ge c k ( ) 68 k G (+log +log( log / )) k = F (w k ) F (w k, ) apple k + k = k, which ogeher wih he fac ha F (w k, )=F + implies F (w k ) F apple + k herefore by inducion, we have wih a probabiliy a leas ( )K, F (w K ) F apple K + = + apple, K where he las inequaliy is due o he value of K = dlog ( )e Since = /K, hen ( )K Proof of heorem 3 heorem 3 (RA wih unknown c) Le apple /,! =, and K = dlog ( )e in Algorihm 3 Suppose D () is sufficienly large so ha here exiss ˆ [, /], wih which F ( ) saisfies a LGC (6) on Sˆ wih (, ) and he consan c, and D () = c Le ˆ = ˆ K(K+), and = max{9, 78 log(/ˆ)} GD () / hen wih a mos S = dlog (ˆ /)e +calls of A-c, Algorihm 3 finds a soluion w (S) such ha F (w (S) ) F apple he oal number of ieraions of RA for obaining -opimal soluion is upper bounded by S = O(dlog ( )e log(/ )/( ) ) ˆ GD () Proof Since K = dlog ( )e dlog ( ˆ )e, D () = c, and = max{9, 78 log(/ˆ)}, following he proof of heorem, we can show ha wih a probabiliy K+, F (w () ) F apple ˆ (8) By running A-c saring from w () which saisfies (8) wih K = dlog ( )e dlog ( ˆ D () = max{9, 78 log(/ˆ)} ha c cˆ (ˆ /) ˆ / )e, (ˆ /), and =, GD () / heorem ensures F (w () ) F apple ˆ wih a probabiliy a leas ( /(K + )) By coninuing he process, wih S = dlog (ˆ /)e +we can prove ha wih a probabiliy a leas ( /(K +)) S S K+, F (w (S) ) F apple ˆ / S apple he oal number of ieraions for he S calls of A-c is bounded by S =K s = K s= s= =K (S )( ) S X (s )( ) s= / ( ) S s applek (S )( ) /( )! ( ) ˆ appleo K apple O(log(/ e )/ ( ) ) 5 Proof of heorem heorem (RA wih unknown ) Le =, apple /,! =, and K = dlog ( )e in Algorihm 3 Assume D () is sufficienly large so ha here exiss ˆ [, /] rendering ha D () = Bˆ ˆ Le ˆ = K(K+), and = max{9, 78 log(/ˆ)} GD () / hen wih a mos S = dlog (ˆ /)e +calls of A-c, Algorihm 3 finds a soluion w (S) such ha F (w (S) ) F apple he oal number of ieraions of RA for obaining -opimal soluion is upper bounded by S = O dlog ( )e log(/ ) G B ˆ Proof he proof is similar o he proof of heorem 3, and we reprove i for compleeness I is easy o show ha Following he proof of heorem, we hen can show ha wih a probabiliy S, 36 () G (+log( log /ˆ)+log ) wih K = dlog ( )e F (w () ) F apple ˆ (9) dlog ( ˆ )e and () = c ( ) ˆ By running A-r saring from w () which saisfies (9) wih K = dlog ( )e dlog ( ˆ ˆ )e, ( ) / = 36 () G (+log( log /ˆ)+log ) and () c ˆ / (ˆ /), heorem ensures ha ( ) F (w () ) F apple ˆ = c (ˆ /) ( )

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence wih a probaliy a leas ( /S) By coninuing he process, wih S = dlog (ˆ /)e +, we can prove ha wih a probaliy a leas ( /S) S F (w (S) ) F apple ˆ / S apple he oal number of ieraions for he S calls of A-c is bounded by S =K s = K s= s= =K (S )( ) S X (s )( ) s= / ( ) S s applek (S )( ) /( )! ( ) ˆ appleo K apple O(log(/ e )/ ( ) ) 6 Monooniciy of B / Lemma 6 B is monoonically decreasing in Proof Consider >> Le x be any poin on L such ha dis(x, )=B and x be he closes poin o x in so ha kx x k = B We define a new poin beween x and x as x = B B x + B B B x Since <B <B, x is sricly beween x and x and dis( x, )=kx xk = B B kx x k = B By he convexiy of F, we have able Saisics of real daases Name #raining (n) #Feaures (d) ype covypebinary 58, 5 Classificaion real-sim 7,39,958 Classificaion url,396,3 3,3,96 Classificaion million songs 63,75 9 Regression E6-fidf 6,87 5,36 Regression E6-logp 6,87,7,7 Regression he resuls of A wih very large and add hem ino Figure For compleeness, we replo all hese resuls in Figure he resuls show ha A wih smaller converges much faser o an -level se han A wih larger, while A wih larger can converge o a much smaller In some case, A wih larger is no as good as in earlier sages bu overall i converges faser o a smaller han We also presen he resuls for = in Figure 3, which are similar o ha for = in Figure In Figures and 3, we compare RA wih in erms of running ime (cpu ime) since compues a full gradien in each ouer loop while RA only goes one sample in each ieraion Following many previous sudies, we also include he resuls in erms of he number of full gradien pass in Figure boh for = and = Similar rend resuls can be found in Figure, indecaing ha RA converges faser han oher hree algorihms References Kakade, Sham M and ewari, Ambuj On he generalizaion abiliy of online srongly convex programming algorihms In NIPS, pp 8 88, 8 F ( x) F dis( x, ) apple F (x ) F dis(x, ) = B Noe ha we mus have F ( x) F since, oherwise, we can move x owards x unil F ( x) F = bu dis( x, ) >B, conradicing wih he definiion of B hen, he proof is compleed by applying F ( x) F and dis( x, )=B o he previous inequaliy 7 Addiional Experimens he daases used in experimens are from libsvm websie We summarize he basic saisics of daases in able o examine he convergence behavior of A wih differen values of he ieraions in per-sage, we also provide hps://wwwcsienueduw/ cjlin/ libsvmools/daases/

Sochasic Convex Opimizaion: Faser Local Growh Implies Global Convergence log ( gap) log ( gap) hinge loss + l norm, covype A(= 7 ) A(= 6 ) RA(= 6 ) 5 5 5 5 5 6 8 number of ieraions 7 robus + l norm, million songs - A(= 7 ) A(= 6 ) RA(= 6 ) 6 8 number of ieraions 7 log ( gap) log ( gap) hinge loss + l norm, real-sim - A(= 7 ) A(= 6 ) RA(= 6 ) -5 5 5 5 6 8 number of ieraions 7 robus + l norm, E6-fidf - A(= 5 ) A(= ) RA(= ) 6 8 number of ieraions 5 huber loss + l norm, million songs A(= 7 ) A(= 6 ) RA(= 6 ) log ( gap) 6 8 6 8 number of ieraions 7 squared hinge + l norm, url 8 RA 6 5 5 cpu ime (s) 5 huber loss + l norm, E6-fidf - A(= 5 ) A(= ) RA(= ) log ( gap) -9 6 8 8 6 number of ieraions 5 huber loss + l norm, E6-logp 6 RA 5 5 cpu ime (s) 5 Figure Comparison of differen algorihms for solving differen problems on differen daases ( = ) log ( gap) log ( gap) hinge loss + l norm, covype A(= 7 ) A(= 6 ) RA(= 6 ) 5 5 5 5 5 6 8 number of ieraions 7 robus + l norm, million songs - A(= 7 ) A(= 6 ) RA(= 6 ) 6 8 number of ieraions 7 log ( gap) log ( gap) hinge loss + l norm, real-sim A(= 7 ) A(= 6 ) RA(= 6 ) 6 8 number of ieraions 7 robus + l norm, E6-fidf - -9 A(= 5 ) A(= ) RA(= ) - 6 8 number of ieraions 5 huber loss + l norm, million songs 5 5 A(= 7 ) A(= 6 ) RA(= 6 ) log ( gap) 5 5 5 5 6 8 number of ieraions 7 squared hinge + l norm, url 9 8 7 6 5 3 RA 6 8 cpu ime (s) log ( gap) huber loss + l norm, E6-fidf A(= 5 ) A(= ) RA(= ) - 6 8 5 number of ieraions 5 huber loss + l norm, E6-logp 5 RA 5 5 5 cpu ime (s) 5 Figure 3 Comparison of differen algorihms for solving differen problems on differen daases ( = ) squared hinge + l norm, url 8 RA 6 8 huber loss + l norm, E6-logp 6 RA 8 6 squared hinge + l norm, url 9 8 7 6 RA 5 3 huber loss + l norm, E6-logp 5 RA 5 6 5 5 6 8 (a) = 5 6 8 6 8 (b) = Figure Comparison of differen algorihms for solving differen problems on differen daases by number of gradien pass