New Optimisation Methods for Machine Learning

Similar documents
New Optimisation Methods for Machine Learning Aaron Defazio

arxiv: v1 [cs.lg] 22 Feb 2015

Bayes (Naïve or not) Classifiers: Generative Approach

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

Dimensionality Reduction and Learning

Introduction to local (nonparametric) density estimation. methods

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Point Estimation: definition of estimators

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Rademacher Complexity. Examples

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Summary of the lecture in Biostatistics

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

TESTS BASED ON MAXIMUM LIKELIHOOD

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

Functions of Random Variables

Lecture 9: Tolerant Testing

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Chapter 5 Properties of a Random Sample

Lecture 3. Sampling, sampling distributions, and parameter estimation

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Chapter 9 Jordan Block Matrices

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Kernel-based Methods and Support Vector Machines

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

ESS Line Fitting

Chapter 8. Inferences about More Than Two Population Central Values

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

Chapter 14 Logistic Regression Models

An Introduction to. Support Vector Machine

5 Short Proofs of Simplified Stirling s Approximation

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

8.1 Hashing Algorithms

Econometric Methods. Review of Estimation

ECON 5360 Class Notes GMM

Class 13,14 June 17, 19, 2015

Simple Linear Regression

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

PTAS for Bin-Packing

CHAPTER 4 RADICAL EXPRESSIONS

Investigating Cellular Automata

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

PGE 310: Formulation and Solution in Geosystems Engineering. Dr. Balhoff. Interpolation

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Special Instructions / Useful Data

18.657: Mathematics of Machine Learning

L5 Polynomial / Spline Curves

Taylor s Series and Interpolation. Interpolation & Curve-fitting. CIS Interpolation. Basic Scenario. Taylor Series interpolates at a specific

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Eulerian numbers revisited : Slices of hypercube

22 Nonparametric Methods.

Statistics MINITAB - Lab 5

X ε ) = 0, or equivalently, lim

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

ENGI 3423 Simple Linear Regression Page 12-01

This lecture and the next. Why Sorting? Sorting Algorithms so far. Why Sorting? (2) Selection Sort. Heap Sort. Heapsort

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Logistic regression (continued)

Regression and the LMS Algorithm

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

C-1: Aerodynamics of Airfoils 1 C-2: Aerodynamics of Airfoils 2 C-3: Panel Methods C-4: Thin Airfoil Theory

MATH 247/Winter Notes on the adjoint and on normal operators.

Lecture 3 Probability review (cont d)

A New Family of Transformations for Lifetime Data

Algorithms Design & Analysis. Hash Tables

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

1 Convergence of the Arnoldi method for eigenvalue problems

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

Lecture 2 - What are component and system reliability and how it can be improved?

Analysis of Lagrange Interpolation Formula

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

Module 7: Probability and Statistics

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

6.867 Machine Learning

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Objectives of Multiple Regression

Lecture Notes Types of economic variables

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

EECE 301 Signals & Systems

Median as a Weighted Arithmetic Mean of All Sample Observations

NP!= P. By Liu Ran. Table of Contents. The P versus NP problem is a major unsolved problem in computer

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)

1 Review and Overview

NP!= P. By Liu Ran. Table of Contents. The P vs. NP problem is a major unsolved problem in computer

The Occupancy and Coupon Collector problems

Application of Calibration Approach for Regression Coefficient Estimation under Two-stage Sampling Design

Generative classification models

AN UPPER BOUND FOR THE PERMANENT VERSUS DETERMINANT PROBLEM BRUNO GRENET

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Runtime analysis RLS on OneMax. Heuristic Optimization

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Analysis of System Performance IN2072 Chapter 5 Analysis of Non Markov Systems

Lecture 1 Review of Fundamental Statistical Concepts

( ) 2 2. Multi-Layer Refraction Problem Rafael Espericueta, Bakersfield College, November, 2006

A Remark on the Uniform Convergence of Some Sequences of Functions

Transcription:

New Optmsato Methods for Mache Learg Aaro Defazo (Uder Examato) A thess submtted for the degree of Doctor of Phlosophy of The Australa Natoal Uversty November 204

c Aaro Defazo 204

Except where otherwse dcated, ths thess s my ow orgal work. Aaro Defazo 7 November 204

v

Ackowledgemets I would lke to thak several NICTA researchers for coversatos ad brastormg sessos durg the course of my PhD, partcularly Scott Saer ad my supervsor Tbero Caetao. I would lke to thak Just Domke for may dscussos about the Fto algorthm, ad hs assstace wth developg ad checkg the proof. Lkewse, for the SAGA algorthm I would lke to thak Fracs Bach ad Smo Lacoste-Jule for dscusso ad assstace wth the proofs. The SAGA algorthm was dscovered collaborato wth them whle vstg the INRIA lab, wth some facal support from INRIA. I would also lke to thak my famly for all ther support durg the course of my PhD. Partcularly my mother for gvg me a place to stay for part of the durato of the PhD as well as food, love ad support. I do ot thak her ofte eough. I also would lke to thak NICTA for ther scholarshp durg the course of the PhD. NICTA s fuded by the Australa Govermet through the Departmet of Commucatos ad the Australa Research Coucl through the ICT Cetre of Excellece Program. v

v

Abstract I ths work we troduce several ew optmsato methods for problems mache learg. Our algorthms broadly fall to two categores: optmsato of fte sums ad of graph structured objectves. The fte sum problem s smply the mmsato of objectve fuctos that are aturally expressed as a summato over a large umber of terms, where each term has a smlar or detcal weght. Such objectves most ofte appear mache learg the emprcal rsk mmsato framework the o-ole learg settg. The secod category, that of graph structured objectves, cossts of objectves that result from applyg maxmum lkelhood to Markov radom feld models. Ulke the fte sum case, all the o-learty s cotaed wth a partto fucto term, whch does ot readly decompose to a summato. For the fte sum problem, we troduce the Fto ad SAGA algorthms, as well as varats of each. The Fto algorthm s best suted to strogly covex problems where the umber of terms s of the same order as the codto umber of the problem. We prove the fast covergece rate of Fto for strogly covex problems ad demostrate ts state-of-the-art emprcal performace o 5 datasets. The SAGA algorthm we troduce s complemetary to the Fto algorthm. It s more geerally applcable, as t ca be appled to problems wthout strog covexty, ad to problems that have a o-dfferetable regularsato term. I both cases we establsh strog covergece rate proofs. It s also better suted to sparser problems tha Fto. The SAGA method has a broader ad smpler theory tha ay exstg fast method for the problem class of fte sums, partcular t s the frst such method that ca provably be appled to o-strogly covex problems wth odfferetable regularsers wthout troducto of addtoal regularsato. For graph-structured problems, we take three complemetary approaches. We look at learg the parameters for a fxed structure, learg the structure depedetly, ad learg both smultaeously. Specfcally, for the combed approach, we troduce a ew method for ecouragg graph structures wth the scale-free property. For the structure learg problem, we establsh SHORTCUT, a O( 2.5 ) expected tme approxmate structure learg method for Gaussa graphcal models. For problems where the structure s kow but the parameters ukow, we troduce a approxmate maxmum lkelhood learg algorthm that s capable of learg a useful subclass of Gaussa graphcal models. v

v Our thess as a whole troduces a ew sut of techques for mache learg practtoers that creases the sze ad type of problems that ca be effcetly solved. Our work s backed by extesve theory, cludg proofs of covergece for each method dscussed.

Cotets Itroducto ad Overvew. Covex Mache Learg Problems......................2 Problem Structure ad Black Box Methods................. 3.3 Early & Late Stage Covergece....................... 4.4 Approxmatos................................. 6.5 No-dfferetablty Mache Learg................. 7.6 Publcatos Related to Ths Thess...................... 7 2 Icremetal Gradet Methods 9 2. Problem Setup.................................. 9 2.. Explotg problem structure..................... 0 2..2 Radomess ad expected covergece rates........... 2..3 Data access order............................ 2 2.2 Early Icremetal Gradet Methods..................... 2 2.3 Stochastc Dual Coordate Descet (SDCA)................ 3 2.3. Alteratve steps............................ 6 2.3.2 Reducg storage requremets.................... 7 2.3.3 Accelerated SDCA........................... 7 2.4 Stochastc Average Gradet (SAG)...................... 8 2.5 Stochastc Varace Reduced Gradet (SVRG)............... 20 x

x Cotets 3 New Dual Icremetal Gradet Methods 23 3. The Fto Algorthm.............................. 23 3.. Addtoal otato.......................... 24 3..2 Method.................................. 24 3..3 Storage costs.............................. 25 3.2 Permutato & the Importace of Radomess............... 26 3.3 Expermets................................... 26 3.4 The MISO Method............................... 27 3.5 A Prmal Form of SDCA............................ 3 3.6 Prox-Fto: a Novel Mdpot Algorthm.................. 34 3.6. Prox-Fto relato to Fto..................... 35 3.6.2 No-Uform Lpschtz Costats.................. 36 3.7 Fto Theory.................................. 37 3.7. Ma proof............................... 37 3.8 Prox-Fto Theory............................... 47 3.8. Ma result............................... 50 3.8.2 Proof of Theorem 3.4.......................... 53 4 New Prmal Icremetal Gradet Methods 55 4. Composte Objectves.............................. 55 4.2 SAGA Algorthm................................ 56 4.3 Relato to Exstg Methods......................... 57 4.3. SAG................................... 58 4.3.2 SVRG................................... 59 4.3.3 Fto................................... 59 4.4 Implemetato................................. 60 4.5 Expermets................................... 6 4.6 SAGA Theory.................................. 64 4.6. Lear covergece for strogly covex problems......... 68 4.6.2 /k covergece for o-strogly covex problems........ 7 4.7 Uderstadg the Covergece of the SVRG Method.......... 73 4.8 Verfyg SAGA Costats........................... 77 4.8. Strogly covex step sze g = /2(µ + L)... 77 4.8.2 Strogly covex step sze g = /3L... 79 4.8.3 No-strogly covex step sze g = /3L... 8

Cotets x 5 Access Orders ad Complexty Bouds 83 5. Lower Complexty Bouds.......................... 83 5.. Techcal assumptos......................... 85 5..2 Smple ( )k boud......................... 85 5..3 Mmsato of o-strogly covex fte sums......... 87 5..4 Ope problems............................. 88 5.2 Access Ordergs................................ 89 5.3 MISO Robustess................................ 92 6 Beyod Fte Sums: Learg Graphcal Models 99 6. Beyod the Fte Sum Structure....................... 99 6.2 The Structure Learg Problem....................... 00 6.3 Covarace Selecto.............................. 0 6.3. Drect optmsato approaches.................... 03 6.3.2 Neghbourhood selecto....................... 04 6.3.3 Thresholdg approaches....................... 04 6.3.4 Codtoal thresholdg....................... 05 6.4 Alteratve Regularsers............................ 06 7 Learg Scale Free Networks 09 7. Combatoral Objectve............................ 0 7.2 Submodularty.................................. 7.3 Optmsato................................... 2 7.3. Alteratg drecto method of multplers............ 3 7.3.2 Proxmal operator usg dual decomposto........... 4 7.4 Alteratve Degree Prors........................... 6 7.5 Expermets................................... 6 7.5. Recostructo of sythetc etworks................ 6 7.5.2 Recostructo of a gee actvato etwork............ 8 7.5.3 Rutme comparso: dfferet proxmal operator methods... 20 7.5.4 Rutme comparso: submodular relaxato agast other approaches................................. 20 7.6 Proof of Correctess.............................. 23

x Cotets 8 Fast Approxmate Structural Iferece 25 8. SHORTCUT................................... 25 8.2 Rug Tme.................................. 26 8.3 Expermets................................... 30 8.3. Sythetc datasets............................ 30 8.3.2 Real world datasets........................... 3 8.4 Theoretcal Propertes............................. 33 9 Fast Approxmate Parameter Iferece 37 9. Model Class................................... 37 9.. Improper models............................ 37 9..2 Precso matrx restrctos...................... 38 9.2 A Approxmate Costraed Maxmum Etropy Learg Algorthm. 39 9.2. Maxmum Etropy Learg..................... 39 9.2.2 The Bethe Approxmato....................... 40 9.2.3 Maxmum etropy learg of ucostraed Gaussas dstrbutos.................................. 4 9.2.4 Restrcted Gaussa dstrbutos.................. 42 9.3 Maxmum Lkelhood Learg wth Belef Propagato......... 43 9.4 Collaboratve Flterg............................. 45 9.5 The Item Graph................................. 46 9.5. Lmtatos of prevous approaches................. 48 9.6 The Item Feld Model.............................. 48 9.7 Predcto Rule................................. 49 9.8 Expermets................................... 5 9.9 Related Work.................................. 52 9.0 Extesos.................................... 53 9.0. Mssg Data & Kerel Fuctos.................. 53 9.0.2 Codtoal Radom Feld Varats................. 54

Cotets x 0 Cocluso ad Dscusso 55 0. Icremetal Gradet Methods........................ 55 0.. Summary of cotrbutos....................... 55 0..2 Applcatos............................... 57 0..3 Ope problems............................. 58 0.2 Learg Graph Models............................ 60 0.2. Summary of cotrbutos....................... 60 0.2.2 Applcatos............................... 60 0.2.3 Ope Problems............................. 6 A Basc Covexty Theorems 63 A. Deftos.................................... 63 A.2 Useful Propertes of Covex Cojugates................... 64 A.3 Types of Dualty................................. 65 A.4 Propertes of Dfferetable Fuctos.................... 66 A.5 Covexty Bouds................................ 67 A.5. Taylor lke bouds........................... 67 A.5.2 Gradet dfferece bouds...................... 68 A.5.3 Ier product bouds......................... 70 A.5.4 Stregtheed bouds usg both Lpschtz ad strog covexty 70 B Mscellaeous Lemmas 75 Bblography 79 Refereces 79

xv Cotets

Chapter Itroducto ad Overvew Numercal optmsato s may ways the core problem moder mache learg. Vrtually all learg problems ca be tackled by formulatg a real valued objectve fucto expressg some otato of loss or suboptmalty whch ca be optmsed over. Ideed approaches that do t have well fouded objectve fuctos are rare, perhaps cotrastve dvergece (Hto, 2002) ad some samplg schemes beg otable examples. May methods that started as heurstcs were able to be sgfcatly mproved oce well-fouded objectves were dscovered ad exploted, o-tree belef propagato ad the relato to the Bethe approxmato (Yedda et al., 2000), ad the later developmet of tree weghted varats (Wawrght et al., 2003) beg a otable example. The core of ths thess s the developmet of several ew umercal optmsato schemes, whch ether address lmtatos of exstg approaches, or mprove o the performace of state-of-the-art algorthms. These methods crease the breadth ad depth of mache learg problems that are tractable o moder computers.. Covex Mache Learg Problems I ths work we partcularly focus o problems that have covex objectves. Ths s a major restrcto, ad oe at the core of much of moder optmsato theory, but oe that evertheless requres justfcato. The prmary reasos for targetg covex problems s ther ubqutousess applcatos ad ther relatve ease of solvg them. Logstc regresso, least-squares, support vector maches, hdde- Markov models, codtoal radom felds ad tree-weghted belef propagato all volve covex models. All of these techques have see real world applcato, although ther use has bee overshadowed recet years by o-covex models such as eural etworks. Covex optmsato s stll of terest whe addressg o-covex problems though. May algorthms that were developed for covex problems, motvated by ther provably fast covergece have later bee appled to o-covex problems wth good emprcal results.

2 Itroducto ad Overvew The class of covex umercal problems s sometmes cosdered syoymous wth that of computatoally tractable problems. Ths s o loger ecessarly the case practce, as we ca tackle o-covex problems of massve scale usg moder approaches (.e. Dea et al., 202). Istead, covex problems ca be better thought of as the relably solvable problems. For covex problems we ca almost always establsh theoretcal results gvg a practcal boud o the amout of computato tme requred to solve a gve covex problem (Nesterov ad Nemrovsk, 994). Together wth the small or o tug requred by covex optmsato algorthms, they ca be used as buldg blocks wth larger programs; detals of the problem ca be abstracted away from the users. Ths s ot the case for o-covex problems, where kow methods requre substatal had tug. Gve these advatages, may researchers cosder covex optmsato a solved problem. Ths s largely the result of udergraduate courses ad text books treatg teror pot methods as the prcpal soluto to covex problems. Ths vew s partcularly prevalet amog statstcas, uder the reweghted least-squares omeclature. Whle Newto s method s strkgly successful o small problems, ts approxmately cubc rug tme per terato resultg from the eed to do a lear solve meas that t scales extremely poorly to problems wth large umbers of varables. It s also uable to drectly hadle o-dfferetable problems commo mache learg. Both of these shortcomgs have bee addressed to some degree (Nocedal, 980; Lu ad Nocedal, 989; Adrew ad Gao, 2007), by the use of low-rak approxmatos ad trcks for specfc o-dfferetable structures, although problems rema. A addtoal complcato s a dvergece betwee the umercal optmsato ad mache learg commutes. Numercal covex optmsato researchers the 80s ad 90s largely focused o solvg problems wth large umbers of complex costrats, partcularly Quadratc Programmg (QP) ad Lear Programmg (LP) problems. These advaces were applcable to the kerel methods of the early 2000s, but at odds wth may of the more moder mache learg problems whch are charactersed by large umbers of potetally o-dfferetable terms. The core examples would be lear support vector maches, other max-marg methods ad eural etworks wth o-dfferetable actvato fuctos. The problem we address Chapter 7 also fts to ths class. I ths thess we wll focus o smooth optmsato problems that obey the Lpschtz smoothess crtero. A fucto f s Lpschtz smooth wth costat L f ts gradets are Lpschtz cotuous. That s, for all x, y 2 R d : f 0 (x) f 0 (y) apple L kx yk. We should ote that the geeral statemet that all covex problems are computato tractable s actually ot true the usual computer scece sese. There exst covex problems that are NP-hard, the sese that problems the NP-hard class are reducble to covex optmsato problems (.e. de Klerk ad Pasechk, 2006).

.2 Problem Structure ad Black Box Methods 3 Lpschtz smooth fuctos are dfferetable, ad f ther Hessa matrx exsts t s bouded spectral orm. The other assumpto we wll sometmes make s that of strog covexty. A fucto f s strogly covex wth costat µ f for all x, y 2 R d ad a 2 [0, ]: f (ax +( a)y) apple a f (x)+( a) f (y) a ( a) µ 2 kx yk2. Essetally rather tha the usual covexty terpolato boud f (ax +( a f (x)+( a) f (y), we have t stregtheed by a quadratc term. a)y) apple.2 Problem Structure ad Black Box Methods The last few years have see a resurgece covex optmsato cetred aroud the techque of explotg problem structure, a approach we take as well. Whe o structure s assumed by the optmsato method about the problem other tha the degree of covexty, very strog results are kow about the best possble covergece rates obtaable. These results date back to the semal work of Nemrovsky ad Yud (983) ad Nesterov (998, earler work Russa). These results have cotrbuted to the wdely held atttude that covex optmsato s a solved problem. But whe the problem has some sort of addtoal structure these worst-case theoretcal results are o loger applcable. Ideed, a seres of recet results suggest that practcally all problems of terest have such structure, allowg advaces theoretcal, ot just practcal covergece. For example, o-dfferetable problems uder reasoable Lpschtz smoothess assumptos ca be solved wth a error reducto of O( p t) tmes after t teratos, for stadard measures of covergece rate, at best (Nesterov, 998, Theorem 3.2.). I practce, vrtually all o-dfferetable problems ca be treated by a smoothg trasformato, gvg a O(t) reducto error after t teratos whe a optmal algorthm s used (Nesterov, 2005). May problems of terest have a structure where most terms the objectve volve oly a small umber of varables. Ths s the case for example ferece problems o graphcal models. I such cases block coordate descet methods ca gve better theoretcal ad practcal results (Rchtark ad Takac, 20). Aother explotable structure volves a sum of two terms F(x) = f (x)+h(x), where the frst term f (x) s structurally ce, say smooth ad dfferetable, but potetally complex to evaluate, ad where the secod term h(x) s o-dfferetable. As log as h(x) s smple the sese that ts proxmal operator s easy to evaluate, the algorthms exst wth the same theoretcal covergece rate as f h(x) was ot part of the objectve at all (F(x) = f (x)) (Beck ad Teboulle, 2009). The proxmal operator

4 Itroducto ad Overvew.0 0.8 Suboptmalty 0.6 0.4 0.2 0.0 0 0 20 30 40 Iterato LBFGS SGD Icremetal Gradet Fgure.: Schematc llustrato of covergece rates s a key costructo ths work, ad deed moder optmsato theory. It s defed for a fucto h ad costat g as: proxg h (v) =arg m h(x)+ g x 2 kx vk2o. Some deftos of the proxmal operator use the weghtg 2g stead of g 2 ; we use ths form throughout ths work. The proxmal operator s ts self a optmsato problem, ad so geeral t s oly useful whe the fucto h s smple. I may cases of terest the proxmal operator has a closed form soluto. The frst four chapters of ths work focus o qute possbly the smplest problem structure, that of a fte summato. Ths occurs whe there s a large umber of terms wth smlar structure added together or averaged the objectve. Recet results have show that for strogly covex problems better covergece rates are possble uder such summato structures tha s possble for black box problems (Schmdt et al., 203; Shalev-Shwartz ad Zhag, 203b). We provde three ew algorthms for ths problem structure, dscussed Chapters 3 ad 4. We also dscuss propertes of problems the fte sum class extesvely Chapter 5..3 Early & Late Stage Covergece Whe dealg wth problems wth a fte sum structure, practtoers have tradtoally had to make a key trade-off betwee stochastc methods whch access the objectve oe term at a tme, ad batch methods whch work drectly wth the full

.3 Early & Late Stage Covergece 5 objectve. Stochastc methods such as SGD exhbt rapd covergece durg early stages of optmsato, yeldg a good approxmate soluto quckly, but ths covergece slows dow over tme; gettg a hgh accuracy soluto s early mpossble wth SGD. Fortuately, mache learg t s ofte the case that a low accuracy soluto gves just as a good a result as a hgh accuracy soluto for mmsg the test loss o held out data. A hgh accuracy soluto ca effectvely over-ft to the trag data. Rug SGD for a small umber of epochs s commo practce. Batch methods o the other had are slowly covergg but steady; f ru for log eough they yeld a hgh accuracy soluto. For strogly covex problems, the dfferece covergece s betwee a O(/t) error after t teratos for SGD versus a O(r t ) error (r < ) for LBFGS 2, the most popular batch method (Nocedal, 980). We have llustrated the dfferece schematcally Fgure.. The SGD ad LBFGS les here are typcal of smple logstc regresso problems, where SGD gves acceptable solutos after 5-0 epochs (passes over the data), where LBFGS evetually gves a better soluto, takg 30-00 teratos to do so. LBFGS s partcularly well suted to use a dstrbuted computg settg, ad t s sometmes the case LBFGS wll gve better results ultmately o the test loss, partcularly for poorly codtoed (hgh-curvature) problems. Fgure. also llustrates the kd of covergece that the recetly developed class of cremetal gradet methods potetally offers. Icremetal gradet methods have the same lear O(r t ) error after t epochs as a batch method, but wth a coeffcet r dramatcally better. The dfferece beg theory thousads of tmes faster covergece, ad practce usually 0-20 tmes better. Essetally cremetal gradet methods are able to offer the best of both worlds, havg rapd tal covergece wthout the later stage slow-dow of SGD. Aother tradtoal advatage of batch methods over stochastc methods s ther ease of use. Methods such as LBFGS requre o had tug to be appled to vrtually ay smooth problem. Some tug of the memory costat that holds the umber of past gradets to remember at each step ca gve faster covergece, but bad choces of ths costat stll result covergece. SGD ad other tradtoal stochastc methods o the other had requre a step sze parameter ad a parameter aealg schedule to be set. SGD s sestve to these choces, ad wll dverge for poor choces. Icremetal gradet methods offer a soluto to the tug problem as well. Most cremetal gradet algorthms have oly a sgle step sze parameter that eeds to be set. Fortuately the covergece rate s farly robust to the value of ths parameter. The SDCA algorthm reduces ths to 0 parameters, but at the expese of beg lmted to problems wth effcet to compute proxmal operators. 2 Quas-ewto methods are ofte cted as havg super-lear covergece. Ths s oly true f the dmesoalty of the uderlyg parameter space s comparable to the umber of teratos used. I mache learg the parameter space s usually much larger effectve dmeso tha the umber of teratos.

6 Itroducto ad Overvew 3 5 C = 00 2 6 4-0 0. 0. 0.0 0 0. 0. 0 0.0 0. 0 0.0 0. 0. 0 - - 2 4 3 7 5 - - 2 P = C = 6 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 7 5 Fgure.2: Gaussa graphcal model defed by the precso matrx P, together wth the o-sparse covarace matrx C t duces wth roudg to sgfcat fgure. Correlatos are dcated by egatve edge weghts a Gaussa model..4 Approxmatos The explotato of problem structure s ot always drectly possble wth the objectves we ecouter mache learg. A case we focus o ths work s the learg of weght parameters a Gaussa graphcal model structure. Ths s a udrected graph structure wth weghts assocated wth both edges ad odes. These weghts are the etres of the precso matrx (verse covarace matrx) of a Gaussa dstrbuto. Abset edges effectvely have a weght of zero (Fgure.2). A formal defto s gve Chapter 6. A key approach to such problems s the use of approxmatos that troduce addtoal structure the objectve whch we ca explot. The regularsed maxmum lkelhood objectve for fttg a Gaussa graphcal model ca requre tme O( 3 ) to evaluate 3. Ths s prohbtvely log o may problems of terest. Istead, approxmatos ca be troduced that decompose the objectve, allowg more effcet techques to be used. I Chapter 9 we show how the Bethe approxmato may be appled for learg the edge weghts o restrcted classes of Gaussa graphcal models. Ths approxmato allows for the use of a effcet dual decomposto optmsato method, ad has drect practcal applcablty the doma of recommedato systems. Besdes parameter learg, the other prmary task volvg graphs s drectly learg the structure. Structure learg for Gaussa graphcal models s problem that has see a lot of terest mache learg. The structure ca be used a mache learg ppele as the precursor to parameter learg, or t ca be used for ts 3 Theoretcally t takes tme equvalet to the bg-o cost of a fast matrx multplcato such as Strasse s algorthm ( O( 2.8 )), but practce smpler O( 3 ) techques are used.

.5 No-dfferetablty Mache Learg 7 ow sake as dcator of correlato structure a dataset. The use of approxmatos structure learg s more wdespread tha parameter learg, ad we gve a overvew of approaches Chapter 6. We mprove o a exstg techque Chapter 8, where we show that a exstg approxmato ca be further approxmated, gvg a substatal practcal ad theoretcal speed-up by a factor of O( p )..5 No-dfferetablty Mache Learg As metoed, mache learg problems ted to have substatal o-dfferetable structure compared to the costrat structures more commoly addressed umercal optmsato. These two forms of structure are a sese two sdes of the same co, as for covex problems the trasformato to the dual problem ca ofte covert from oe to the other. The prmary example beg support vector maches, where o-dfferetablty the prmal hge loss s coverted to a costrat set whe the dual s cosdered. Recet progress optmsato has see the use of proxmal methods as the tool of choce for hadlg both structures mache learg problems. Whe usg a regularsed loss objectve of the form F(x) = f (x) +h(x) as metoed above Secto.2, the o-dfferetablty ca be the regularser h(x) or the loss term f (x). We troduce methods addressg both cases ths work. The SAGA algorthm of Chapter 4 s a ew prmal method, the frst prmal cremetal gradet method able to be used o o-strogly covex problems wth o-dfferetable regularsers drectly. It makes use of the proxmal operator of the regularser. It ca also be used o problems wth costrats, where the fucto h(x) s the dcator fucto of the costrat set, ad proxmal operator s projecto oto the costrat set. I ths work we also troduce a ew o-dfferetable regularser for the above metoed graph structure learg problem, whch ca also be attacked usg proxmal methods. Its o-dfferetable structure s substatally more complex tha other regularsers used mache learg, requrg a specal optmsato procedure to be used just to evaluate the proxmal operator (Typcally proxmal operators mache learg have closed form solutos). For o-dfferetable losses, we troduce the Prox-Fto algorthm (Secto 3.6). Ths cremetal gradet algorthm uses the proxmal operator of the sgle datapot loss. It provdes a brdge betwee the Fto algorthm (Secto 3.) ad the SDCA algorthm (Shalev-Shwartz ad Zhag, 203b), havg propertes of both methods..6 Publcatos Related to Ths Thess The majorty of the cotet ths thess has bee publshed as coferece artcles. For the work o cremetal gradet methods, the Fto method has bee publshed

8 Itroducto ad Overvew as Defazo et al. (204b), ad the SAGA method as Defazo et al. (204a). Chapters 3 & 4 cota much more detaled theory tha has bee prevously publshed. Some of the dscusso Chapter 5 appears Defazo et al. (204b) also. For the porto of ths thess o Gaussa graphcal models, Chapter 7 largely follows the publcato Defazo ad Caetao (202a). Chapter 9 s based o the work Defazo ad Caetao (202b), although heavly revsed.

Chapter 2 Icremetal Gradet Methods I ths chapter we gve a troducto to the class of cremetal gradet (IG) methods. Icremetal gradet methods are smply the class of methods that take advatage of ay kow summato structure a optmsato objectve by accessg the objectve oe term at a tme. Objectves that are decomposable as a sum of a umber of terms come up ofte appled mathematcs ad scetfc computg, but are partcularly prevalet mache learg applcatos. Research the last two decades o optmsato problems wth a summato structure has focused more o the stochastc approxmato settg, where the summato s assumed to be over a fte set of terms. The fte sum case that cremetal gradet methods cover has see a resurgece recet years after the dscovery that there exst fast cremetal gradet methods whose covergece rates are better tha ay possble black box method for fte sums wth partcular (commo) structures. We provde a extesve overvew of all kow fast cremetal gradet methods the later parts of ths chapter. Buldg o the descrbed methods, Chapters 3 & 4 we troduce three ovel fast cremetal gradet methods. Depedg o the problem structure, each of these methods ca have state-of-the-art performace. 2. Problem Setup We are terested mmsg fuctos of the form f (x) = Â f (x), = where x 2 R d ad each f s covex ad Lpschtz smooth wth costat L. We wll also cosder the case where each f s addtoally strogly covex wth costat µ. Icremetal gradet methods are algorthms that at each step evaluate the gradet ad fucto value of oly a sgle f. 9

0 Icremetal Gradet Methods We wll measure covergece rates terms of the umber of ( f (x), f 0 (x)) evaluatos, ormally these are much cheaper computatoally tha evaluatos of the whole fucto gradet f 0, such as performed by the gradet descet algorthm. We use the otato x to deote a mmser of f. For strogly covex problems ths s the uque mmser. Ths setup dffers from the tradtoal black box smooth covex optmsato problem oly that we are assumg that our fucto s decomposable to a fte sum structure. Ths fte sum structure s wdespread mache learg applcatos. For example, the stadard framework of Emprcal Rsk Mmsato (ERM) takes ths form, where for a loss fucto L : R d R! R ad data label tuples (x, y ), we have: R emp (h) = Â L(h(x ), y ), where h s the hypothess fucto that we ted to optmse over. The most commo case of ERM s mmsato of the egatve log-lkelhood, for stace the classcal logstc regresso problem. 2.. Explotg problem structure Gve the very geeral ature of the fte sum structure, we ca ot expect to get faster covergece tha we would by accessg the whole gradet wthout addtoal assumptos. For example, suppose the summato oly has oe term, or alteratvely each f s the zero fucto except oe of the. Notce that the Lpschtz smoothess ad strog covexty assumptos we made are o each f rather tha o f. Ths s a key pot. If the drectos of maxmum curvature of each term are alged ad of smlar magtude, the we ca expect the term Lpschtz smoothess to be smlar to the smoothess of the whole fucto. However, t s easy to costruct problems for whch ths s ot the case, fact the Lpschtz smoothess of f may be tmes smaller tha that of each f. I that case the cremetal gradet methods wll gve o mprovemet over black box optmsato methods. For mache learg problems, ad partcularly the emprcal rsk mmsato problem, ths worst case behavor s ot commo. The curvature ad hece the Lpschtz costats are defed largely by the loss fucto, whch s shared betwee the terms, rather tha the data pot. Commo data preprocessg methods such as data whteg ca mprove ths eve further. The requremet that the magtude of the Lpschtz costats be approxmately balaced ca be relaxed some cases. It s possble to formulate IG methods where the covergece s stated terms of the average of the Lpschtz costats of the

2. Problem Setup f stead of the maxmum. Ths s the case for the Prox-Fto algorthm descrbed Secto 3.6. All kow methods that make use of the average Lpschtz costat requre kowledge of the ratos of the Lpschtz costats of the f terms, whch lmts ther practcalty ufortuately. Regardless of the codto umber of the problem, f we have a summato wth eough terms optmsato becomes easy. Ths made precse the defto that follows. Defto 2.. The bg data codto: For some kow costat b, b L µ. Ths codto obvously requres strog covexty ad Lpschtz smoothess so that L/µ s well defed. It s a very strog assumpto for small, as the codto umber L/µ typcal mache learg problems s at least the thousads. For applcatos of ths assumpto, b s typcally betwee ad 8. Several of the methods we descrbe below have a fxed ad very fast covergece rate depedet of the codto umber whe ths bg-data codto holds. 2..2 Radomess ad expected covergece rates Ths thess works extesvely wth optmsato methods that make radom decsos durg the course of the algorthm. Ulke the stochastc approxmato settg, we are dealg wth determstc, kow optmsato problems; the stochastcty s troduced by our optmsato methods, t s ot heret the problem. We troduce radomess because t allows us to get covergece rates faster tha that of ay curretly kow determstc methods. The caveat s that these covergece rates are expectato, so they do t always hold precsely. Ths s ot as bad as t frst seems though. Determg that the expectato of a geeral radom varable coverges s ormally qute a weak result, as ts value may vary aroud the expectato substatally practce, potetally by far more tha t coverges by. The reaso why ths s ot a ssue for the optmsato methods we cosder s that all the radom varables we boud are o-egatve. A o-egatve radom varable X wth a very small expectato, say: E[X] = 0 5, s wth hgh probablty close to ts expectato. Ths s a fudametal result mpled by Markov s equalty. For example, suppose E[X] = 0 5 ad we wat to boud the probablty that X s greater tha 0 3,.e a factor of 00 worse tha ts expectato. The Markov s equalty tells us that: P(X 0 3 ) apple 00.

2 Icremetal Gradet Methods So there s oly a % chace of X beg larger tha 00 tmes ts expected value here. We wll largely focus o methods wth lear covergece the followg chapters, so order to crease the probablty of the value X holdg by a factor r, oly a logarthmc umber of addtoal teratos r s requred (O(log r)). We would also lke to ote that Markov s equalty ca be qute coservatve. Our expermets later chapters show lttle the way of radom ose attrbutable to the optmsato procedure, partcularly whe the amout of data s large. 2..3 Data access order The source of radomess all the methods cosdered ths chapter s the order of accessg the f terms. By access we mea the evaluato of f (x) ad f 0 (x) at a x of our choce. Ths s more formally kow as a oracle evaluato (see Secto 5.), ad typcally costtutes the most computatoally expesve part of the ma loop of each algorthm we cosder. The access order s defed o a per-epoch bass, where a epoch s evaluatos. Oly three dfferet access orders are cosdered ths work: Cyclc Each step wth j = +(k mod ). Effectvely we access f the order they appear, the loop to the begg ad the ed of every epoch. Permuted Each epoch wth j s sampled wthout replacemet from the set of dces ot accessed yet that epoch. Ths s equvalet to permutg the f at the begg of each epoch, the usg the cyclc order wth the epoch. Radomsed The value of j s sampled uformly at radom wth replacemet from,...,. The permuted termology s our omeclature, whereas the other two terms are stadard. 2.2 Early Icremetal Gradet Methods The classcal cremetal gradet (IG) method s smply a step of the form: x k+ = x k g k f 0 j (xk ), where at step k we use cyclc access, takg j = +(k mod ). Ths s smlar to the more well kow stochastc gradet descet, but wth a cyclc order of access of the

2.3 Stochastc Dual Coordate Descet (SDCA) 3 data stead of a radom order. We have troduced here a superscrpt otato x k for the varable x at step k. We use ths otato throughout ths work. It turs out to be much easer to aalyse such methods uder a radom access orderg. For the radom order IG method (.e. SGD) o smooth strogly covex problems, the followg rate holds for a approprately chose step szes: h E f (x k ) f (x ) apple L 2k x 0 x 2. The step sze scheme requred s of the form g k = q k, where q s a costat that depeds o the gradet orm boud R as well as the degree of strog covexty µ. It may be requred to be qute small some cases. Ths s what s kow as a sublear rate of covergece, as the depedece o k s of the form O( L 2k ), whch s slower tha the lear rate O(( a) k ) for ay a 2 (0, ) asymptotcally. Icremetal gradet methods for strogly covex smooth problems were of lttle terest up utl the developmet of fast varats (dscussed below), as the sublear rates for the prevously kow methods dd ot compare favourably to the lear rate of quas-newto methods. For o-strogly covex problems, or strogly covex but o-smooth problems, the story s qute dfferet. I those cases, the theoretcal ad practcal rates are hard to beat wth full (sub-)gradet methods. The o-covex case s of partcular terest mache learg. SGD has bee the de facto stadard optmsato method for eural etworks for example sce the 980s (Rumelhart et al., 986). Such cremetal gradet methods have a log hstory, havg bee appled to specfc problems as far back as the 960s (Wdrow ad Hoff, 960). A up-to-date survey ca be foud Bertsekas (202). 2.3 Stochastc Dual Coordate Descet (SDCA) The stochastc dual coordate descet method (Shalev-Shwartz ad Zhag, 203b) s based o the prcple that for problems wth explct quadratc regularsers, the dual takes a partcularly easy to work wth form. Recall the fte sum structure f (x) = Â = f (x) defed earler. Istead of assumg that each f s strogly covex, we stead eed to cosder the regularsed objectve: f (x) = Â f (x)+ µ 2 kxk2. = For ay strogly covex f, we may trasform our fucto to ths form by replacg each f wth f µ 2 kxk2, the cludg a separate regularser. Ths chages the

4 Icremetal Gradet Methods Lpschtz smoothess costat for each f to L µ, ad preserves covexty. We are ow ready to cosder the dual trasformato. We apply the techque of dual decomposto, where we decouple the terms our objectve as follows: m f (x) = x,x,...x,...,x  f (x )+ µ 2 kxk2, = s.t. x = x =... Ths reformulato tally acheves othg, but the key dea s that we ow have a costraed optmsato problem, ad so we may apply Lagraga dualty (Secto A.3). The Lagraga fucto s: L(x, x,...a,...) = =  f (x )+ µ 2 kxk2 + =  ha, x  ( f (x ) ha, x ) + µ 2 kxk2 + = x *  a, x +, (2.) where a 2 R d are the troduced dual varables. The Lagraga dual fucto s formed by takg the mmum of L wth respect to each x, leavg a, the set of a =... free: D(a) =  = m x { f (x ) ha, x } + m x ( * µ 2 kxk2 +  a, x Now recall that the defto of the covex cojugate (Secto A.2) says that: m { f (x) ha, x} = sup x {ha, x f (x)} = f (a). +), (2.2) Clearly we ca plug ths for each f to get: D(a) = " *  f (a µ )+m x 2 kxk2 +  = a, x +#. We stll eed to smplfy the remag m term, whch s also the form of a covex cojugate. We kow that squared orms are self-cojugate, ad scalg a fucto by a postve costat b trasforms ts cojugate from f (a) to b f (a/b), so we fact have: D(a) =  f (a ) = µ 2 µ  a 2.

2.3 Stochastc Dual Coordate Descet (SDCA) 5 Algorthm 2. SDCA (exact coordate descet) Italse x 0 ad a 0 as the zero vector, for all. Step k + :. Pck a dex j uformly at radom. 2. Update a j, leavg the other a uchaged: a k+ j = arg m y " f j (y)+µ 2 x k # 2 y a k j. µ 3. Update x k+ = x k µ a k+ j a k j. At completo, for smooth f retur x k. For o-smooth, retur a tal average of the x k sequece. Ths s the objectve drectly maxmsed by SDCA. As the ame mples, SDCA s radomsed (block) coordate ascet o ths objectve, where oly oe a s chaged each step. I coordate descet we have the opto of performg a gradet step a coordate drecto, or a exact mmsato. For the exact coordate mmsato, the update s easy to derve: 2 j = arg m 4 a j  f (a )+ µ 2 = 2 a k+ = arg m a j 4 f j (a j)+ µ 2 µ µ   a a 2 3 5 2 3 5. (2.3) The prmal pot x k correspodg to the dual varables a k at step k s the mmser h of the cojugate problem x k = µ arg m x 2 kxk2 +  a k, x, whch closed form s smply x k = µ  a k. Ths ca be used to further smplfy Equato 2.3. The full method s Algorthm 2.. The SDCA method has a smple geometrc covergece rate the dual objectve D of the form: h E D(a k ) D(a ) apple µ k D(a 0 ) D(a ). L + µ Ths s easly exteded to a statemet about the dualty gap f (x k ) D(a k ) ad hece

6 Icremetal Gradet Methods the suboptmalty f (x k ) f (x ) by usg the relato: f (x k ) D(a k ) apple L + µ µ D(a k ) D(a ). 2.3. Alteratve steps The full coordate mmsato step dscussed the prevous secto s ot always practcal. If we are treatg each elemet f the summato  f (x) as a sgle data pot loss, the eve for the smple bary logstc loss there s ot a closed form soluto for the exact coordate step. We ca use a black-box D optmsato method to fd the coordate mmser, but ths wll geerally requre 20-30 expoetal fucto evaluatos, together wth oe vector dot product. For multclass logstc loss, the subproblem solve s ot fast eough to yeld a usable algorthm. I the case of o-dfferetable losses, the stuato s better. Most odfferetable fuctos we use mache learg, such as the hge loss, yeld closed form solutos. For performace reasos we ofte wat to treat each f as a mbatch loss, whch case we vrtually ever have a closed form soluto for the subproblem, eve the o-dfferetable case. Shalev-Shwartz ad Zhag (203a) descrbe a umber of other possble steps whch lead to the same theoretcal covergece rate as the exact mmsato step, but whch are more usable practce: Iterval Le search: It turs out that t s suffcet to perform the mmsato Equato 2.3 alog the terval betwee the curret dual varable a k j ad the pot u = f 0 j (xk ). The update takes the form: s = arg m s2[0,] " f j a k j + s(u ak j ) + µ 2 a k+ j = a k j + s(u ak j ). x k + s # 2 u a k j, µ Costat step: If the value of the Lpschtz smoothess costat L s kow, we ca calculate a coservatve value for the parameter s stead of optmsg over t wth a terval le search. Ths gves a update of the form: a k+ j = a k j + s(u ak j ) where s = µ µ + L. Ths method s much slower practce tha performg a le-search, just as a step sze wth gradet descet s much slower tha performg a le search. L

2.3 Stochastc Dual Coordate Descet (SDCA) 7 2.3.2 Reducg storage requremets We have preseted the SDCA algorthm full geeralty above. Ths results dual varables of dmeso d, for whch the total storage d ca be prohbtve. I practce, the dual varables ofte le o a low-dmesoal subspace. Ths s the case wth lear classfers ad regressors, where a r class problem has gradets o a r dmesoal subspace. A lear classfer takes the form f (x) =f X T x, for a fxed loss f : Rr! R ad data stace matrx X : d r. I the smplest case X s just the data pot duplcated as r rows. The the dual varables are r dmesoal, ad the x k updates chage to: x k = µ Â X a. a k+ j = arg m a " f j (a)+µ 2 x k + # 2 µ X a a k j. Ths s the form of SDCA preseted by Shalev-Shwartz ad Zhag (203a), although wth the egato of our dual varables. 2.3.3 Accelerated SDCA The SDCA method s also curretly the oly fast cremetal gradet method to have a kow accelerated varat. By accelerato, we refer to the modfcato of a optmsato method to mprove the covergece rate by a amout greater tha ay costat factor. Ths termology s commo optmsato although a precse defto s ot ormally gve. The ASDCA method (Shalev-Shwartz ad Zhag, 203a) works by utlsg the regular SDCA method as a sub-procedure. It has a outer loop, whch at each step vokes SDCA o a modfed problem x k+ = m x f (x)+ l 2 kx yk2, where y s chose as a over-relaxed step of the form: y = x k + b(x k x k ), for some kow costat b. The costat l s lkewse computed from the Lpschtz smoothess ad strog covexty costats. These regularsed sub-problems f (x)+ l 2 kx yk2 have a greater degree of strog covexty tha f (x), ad so dvdually are much faster to solve. By a careful choce of the accuracy at whch they are computed to, the total umber of steps made betwee all the subproblem solves s

8 Icremetal Gradet Methods much smaller tha would be requred f regular SDCA s appled drectly to f (x) to reach the same accuracy. I partcular, they state that to reach a accuracy of e expectato for the fucto value, they eed k teratos, where: k = Õ d + m ( s )! dl µ, d L log(/e). µ The Õ otato hdes costat factors. Ths rate s ot of the same precse form as the other covergece rates wll dscuss ths chapter. We ca make some geeral statemets though. Whe s the rage of the bg-data codto, ths rate s o better tha regular SDCA s rate, ad probably worse practce due to overheads hdde by the Õ otato. Whe s much smaller tha L µ, the potetally t ca be much faster tha regular SDCA. Ufortuately, the ASDCA procedure has sgfcat computatoal overheads that make t ot ecessarly the best choce practce. Probably the bggest ssue however s a sestvty to the Lpschtz smoothess ad strog covexty costats. It assumes these are kow, ad f the used values dffer from the true values, t may be sgfcatly slower tha regular SDCA. I cotrast, regular SDCA requres o kowledge of the Lpschtz smoothess costats (for the prox varat at least), just the strog covexty (regularsato) costat. 2.4 Stochastc Average Gradet (SAG) The SAG algorthm (Schmdt et al., 203) s the closest form to the classcal SGD algorthm amog the fast cremetal gradet methods. Istead of storg dual varables a lke SDCA above, we store a table of past gradets y, whch has the same storage cost geeral, d. The SAG method s gve Algorthm 2.2. They key equato for SAG s the step: x k+ = x k g  y k. Essetally we move the drecto of the average of the past gradets. Note that ths average cotas oe past gradet for each term, ad they are equally weghted. Ths ca be cotrasted to the SGD method wth mometum, whch uses a geometrcally decayg weghted sum of all past gradet evaluatos. SGD wth mometum however s ot a learly coverget method. It s surprsg that usg equal weghts lke ths actually yelds a much faster covergg algorthm, eve though some of the gradets the summato ca be extremely out of date.

2.4 Stochastc Average Gradet (SAG) 9 Algorthm 2.2 SAG Italse x 0 as the zero vector, ad y = f 0 (x0 ) for each. Step k + :. Pck a dex j uformly at radom. 2. Update x usg step legth costat g: x k+ = x k g 3. Set y k+ j = f 0 j (xk+ ). Leave y k+ = y k for 6= j.  y k. SAG s a evoluto of the earler cremetal averaged gradet method (IAG, Blatt et al., 2007) whch has the same update wth a dfferet costat factor, ad wth cyclc access used stead of radomsed. It has a more lmted covergece theory coverg quadratc or bouded gradet problems, ad a much slower rate of covergece. The covergece rate of SAG for strogly covex problems s of the same order as SDCA, although the costats are ot qute as good. I partcular, we have a expected covergece rate terms of fucto value suboptmalty of: E[ f (x k ) f (x )] apple m 8, µ 6L k L 0, Where L 0 s a complex expresso volvg f (x 0 + g  y 0 ) ad a quadratc form of x 0 ad each y 0. Ths theoretcal covergece rate s betwee 8 ad 6 tmes worse tha SDCA. I practce SAG s ofte faster tha SDCA though, suggestg that the SAG theory s ot tght. A ce feature of SAG s that ulke SDCA, t ca be drectly appled to o-strogly covex problems. Dfferetablty s stll requred though. The covergece rate s the terms of the average terate x k = k Âk l xl : E[ f ( x k ) f (x )] apple 32 k L 0. The SAG algorthm has great practcal performace, but t s surprsgly dffcult to aalyse theoretcally. The above rates are lkely coservatve by a factor of betwee 4 ad 8. Due to the dffculty of aalyss, the proxmal verso for composte losses has ot yet had ts theoretcal covergece establshed.

20 Icremetal Gradet Methods Algorthm 2.3 SVRG Italse x 0 as the zero vector, g k = Â f 0 (x0 ) ad x 0 = x 0. Step k + :. Pck j uformly at radom. 2. Update x: x k+ = x k h f j 0(xk )+ h fj 0 h ( xk ) g k. 3. Every m teratos, set x ad recalculate the full gradet at that pot: x k+ = x k+. g k = Â f 0 ( xk+ ). Otherwse leave x k+ = x k ad g k+ = g k. At completo retur x. 2.5 Stochastc Varace Reduced Gradet (SVRG) The SVRG method (Johso ad Zhag, 203) s a recetly developed fast cremetal gradet method. It was developed to address the potetally hgh storage costs of SDCA ad SAG, by tradg off storage agast computato. The SVRG method s gve Algorthm 2.3. Ulke the other methods dscussed, there s a tuable parameter m, whch specfes the umber of teratos to complete before the curret gradet approxmato s recalbrated by computg a full gradet f 0 ( x) at the last terate before the recalbrato, x := x k. Essetally, stead of matag a table of past gradets y for each lke SAG does, the algorthm just stores the locato x at whch those gradets should be evaluated, the re-evaluates them whe eeded by just computg fj 0( x). Lke the SAG algorthm, at each step we eed to kow the updated term gradet f 0 j (xk ), the old term gradet f 0 j ( x) ad the average of the old gradets f 0 ( x). Sce we are ot storg the old term gradet, just ts average, we eed to calculate two term gradets stead of the oe term gradet calculated by SAG at each step. The S2GD method (Koečý ad Rchtárk, 203) was cocurretly developed wth SVRG. It has the same update as SVRG, just dfferg that the theoretcal choce of x dscussed the ext paragraph. We use SVRG heceforth to refer to both methods. The update x k+ = x k+ step 3 above s techcally ot supported by the theory. Istead, oe of the followg two updates are used:. x s the average of the x values from the last m teratos. Ths s the varat suggested by Johso ad Zhag (203).

2.5 Stochastc Varace Reduced Gradet (SVRG) 2 2. x s a radomly sampled x from the last m teratos. Ths s used the S2GD varat (Koečý ad Rchtárk, 203). These alteratve updates are requred theoretcally as the covergece betwee recalbratos s expressed terms of the sum of fucto values of the last m pots, k  r=k m [ f (x r ) f (x )], stead of terms of f (x k ) f (x ) drectly. Varat avods ths ssue by usg Jese s equalty to pull the summato sde: k  r=k m [ f (x r ) f (x )] f ( k  r=k x r ) f (x ). m Varat 2 uses a sampled x, whch expectato wll also have the requred value. I practce, there s a very hgh probablty that f (x k ) f (x ) s less tha the last-m sum, so just takg x = x k works. The SVRG method has the followg covergece rate f k s a multple of m: E[ f ( x k ) f (x )] apple r k/m f ( x 0 ) f (x ), where r = h 4L(m + ) + µ( 4L/h)m h( 4L/h)m. Note also that each step requres two term gradets, so the rate must be halved whe comparg agast the other methods descrbed ths chapter. There s also the cost of the recalbrato pass, whch (depedg o m) ca further crease the ru tme to three tmes that of SAG per step. Ths covergece rate has qute a dfferet form from that of the other methods cosdered ths secto, makg drect comparso dffcult. However, for most parameter values ths theoretcal rate s worse tha that of the other fast cremetal gradet methods. I Secto 4.7 we gve a aalyss of SVRG that requres addtoal assumptos, but gves a rate that s drectly comparable to the other fast cremetal gradet methods.

22 Icremetal Gradet Methods

Chapter 3 New Dual Icremetal Gradet Methods I ths chapter we troduce a ovel fast cremetal gradet method for strogly covex problems that we call Fto. Lke SDCA, SVRG ad SAG, Fto s a stochastc method that s able to acheve lear covergece rates for strogly covex problems. Although the Fto algorthm oly uses prmal quattes drectly, the proof of ts covergece rate uses lower bouds extesvely, so t ca be cosdered a dual method, lke SDCA. Smlar to SDCA, ts theory does ot support ts use o ostrogly covex problems, although there s o practcal ssues wth ts applcato. I Secto 3.7 we prove the covergece rate of the Fto method uder the bg-data codto descrbed the prevous chapter. Ths theoretcal rate s better tha the SAG ad SVRG rates but ot qute as good as the SDCA rate. I Secto 3.3 we compare Fto emprcally agast SAG ad SDCA, showg that t coverges faster, partcularly f the permuted access order s used. The relatoshp betwee Fto ad SDCA allows a kd of mdpot algorthm to be costructed, whch has favourable propertes of both methods. We call ths mdpot Prox-Fto. It s descrbed Secto 3.6. A earler verso of the work ths chapter has bee publshed as Defazo et al. (204b). 3. The Fto Algorthm As dscussed Chapter 2, we are terested covex fuctos of the form f (w) = 23 Â f (w). =