arxiv: v1 [cs.lg] 22 Feb 2015

Similar documents
Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

An Accelerated Proximal Coordinate Gradient Method

Dimensionality Reduction and Learning

Strong Convergence of Weighted Averaged Approximants of Asymptotically Nonexpansive Mappings in Banach Spaces without Uniform Convexity

2006 Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America

LECTURE 24 LECTURE OUTLINE

Lecture 16: Backpropogation Algorithm Neural Networks with smooth activation functions

Econometric Methods. Review of Estimation

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Chapter 3 Sampling For Proportions and Percentages

Complete Convergence and Some Maximal Inequalities for Weighted Sums of Random Variables

arxiv: v1 [math.oc] 7 Mar 2017

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Rademacher Complexity. Examples

18.657: Mathematics of Machine Learning

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

1 Review and Overview

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

TESTS BASED ON MAXIMUM LIKELIHOOD

Lecture 9: Tolerant Testing

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Communication-Efficient Distributed Primal-Dual Algorithm for Saddle Point Problems

6.867 Machine Learning

Functions of Random Variables

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Lecture 02: Bounding tail distributions of a random variable

Parallel Multi-splitting Proximal Method for Star Networks

Multiple Linear Regression Analysis

Bayes (Naïve or not) Classifiers: Generative Approach

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

New Optimisation Methods for Machine Learning Aaron Defazio

Research Article A New Iterative Method for Common Fixed Points of a Finite Family of Nonexpansive Mappings

Maximum Likelihood Estimation

Lecture Note to Rice Chapter 8

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

STK4011 and STK9011 Autumn 2016

A tighter lower bound on the circuit size of the hardest Boolean functions

Lecture 3 Probability review (cont d)

New Optimisation Methods for Machine Learning

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

The Mathematical Appendix

Lecture 4 Sep 9, 2015

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Stochastic Convex Optimization

5 Short Proofs of Simplified Stirling s Approximation

STRONG CONSISTENCY FOR SIMPLE LINEAR EV MODEL WITH v/ -MIXING

A New Method for Decision Making Based on Soft Matrix Theory

Chapter 4 Multiple Random Variables

CHAPTER 4 RADICAL EXPRESSIONS

Likewise, properties of the optimal policy for equipment replacement & maintenance problems can be used to reduce the computation.

COMPROMISE HYPERSPHERE FOR STOCHASTIC DOMINANCE MODEL

Unimodality Tests for Global Optimization of Single Variable Functions Using Statistical Methods

Convergence of Large Margin Separable Linear Classification

Complete Convergence for Weighted Sums of Arrays of Rowwise Asymptotically Almost Negative Associated Random Variables

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Distributed Accelerated Proximal Coordinate Gradient Methods

A Remark on the Uniform Convergence of Some Sequences of Functions

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Chapter 9 Jordan Block Matrices

Chapter 14 Logistic Regression Models

Assignment 7/MATH 247/Winter, 2010 Due: Friday, March 19. Powers of a square matrix

The internal structure of natural numbers, one method for the definition of large prime numbers, and a factorization test

Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

Q-analogue of a Linear Transformation Preserving Log-concavity

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

PTAS for Bin-Packing

18.413: Error Correcting Codes Lab March 2, Lecture 8

Randomized Dual Coordinate Ascent with Arbitrary Sampling

TRIANGULAR MEMBERSHIP FUNCTIONS FOR SOLVING SINGLE AND MULTIOBJECTIVE FUZZY LINEAR PROGRAMMING PROBLEM.

A conic cutting surface method for linear-quadraticsemidefinite

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

( ) 2 2. Multi-Layer Refraction Problem Rafael Espericueta, Bakersfield College, November, 2006

Aitken delta-squared generalized Juncgk-type iterative procedure

ESS Line Fitting

Entropy ISSN by MDPI

SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Non-uniform Turán-type problems

DIFFERENTIAL GEOMETRIC APPROACH TO HAMILTONIAN MECHANICS

A Robust Total Least Mean Square Algorithm For Nonlinear Adaptive Filter

Analyzing Fuzzy System Reliability Using Vague Set Theory

Unsupervised Learning and Other Neural Networks

PROJECTION PROBLEM FOR REGULAR POLYGONS

CSE 5526: Introduction to Neural Networks Linear Regression

NEUMANN ISOPERIMETRIC CONSTANT ESTIMATE FOR CONVEX DOMAINS

Analysis of Lagrange Interpolation Formula

Machine Learning. knowledge acquisition skill refinement. Relation between machine learning and data mining. P. Berka, /18

Arithmetic Mean and Geometric Mean

4 Inner Product Spaces

The Occupancy and Coupon Collector problems

Stochastic Convex Optimization

MOLECULAR VIBRATIONS

A new type of optimization method based on conjugate directions

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

Point Estimation: definition of estimators

Transcription:

SDCA wthout Dualty Sha Shalev-Shwartz arxv:50.0677v cs.lg Feb 05 Abstract Stochastc Dual Coordate Ascet s a popular method for solvg regularzed loss mmzato for the case of covex losses. I ths paper we show how a varat of SDCA ca be appled for o-covex losses. We prove lear covergece rate eve f dvdual loss fuctos are o-covex as log as the expected loss s covex. Itroducto The followg regularzed loss mmzato problem s assocated wth may mache learg methods: m w R d P w := φ w + λ w. = Oe of the most popular methods for solvg ths problem s Stochastc Dual Coordate Ascet SDCA. 8 aalyzed ths method, ad showed that whe each φ s L-smooth ad covex the the covergece rate of SDCA s ÕL/λ + log/ɛ. As ts ame dcates, SDCA s derved by cosderg a dual problem. I ths paper, we cosder the possblty of applyg SDCA for problems whch dvdual φ are o-covex, e.g., deep learg optmzato problems. I may such cases, the dual problem s meagless. Istead of drectly usg the dual problem, we descrbe ad aalyze a varat of SDCA whch oly gradets of φ are beg used smlar to opto 5 the pseudo code of Prox-SDCA gve 6. Followg 3, we show that SDCA s a varat of the Stochastc Gradet Descet SGD, that s, ts update s based o a ubased estmate of the gradet. But, ulke the valla SGD, for SDCA the varace of the estmato of the gradet teds to zero as we coverge to a mmum. For the case whch each φ s L-smooth ad covex, we derve the same lear covergece rate of ÕL/λ + log/ɛ as 8, but wth a smpler, drect, dual-free, proof. We also provde a lear covergece rate for the case whch dvdual φ ca be o-covex, as log as the average of φ are covex. The rate for o-covex losses has a worst depedece o L/λ ad we leave t ope to see f a better rate ca be obtaed for the o-covex case. Related work: I recet years, may methods for optmzg regularzed loss mmzato problems have bee proposed. For example, SAG 5, SVRG 3, Fto, SAGA, ad SGD 4. The best covergece rate s for accelerated SDCA 6. A systematc study of the covergece rate of the dfferet methods uder o-covex losses s left to future work. School of Computer Scece ad Egeerg, The Hebrew Uversty, Jerusalem, Israel

SDCA wthout Dualty We mata pseudo-dual vectors α,..., α, where each α R d. Dual-Free SDCAP, T, η, α 0 Goal: Mmze P w = = φ w + λ w Iput: Objectve P, umber of teratos T, step sze η s.t. β := ηλ <, tal dual vectors α 0 = α 0,..., α0 Italze: w 0 = λ = α0 For t =,..., T Pck uformly at radom from Update: α t = α t ηλ φ w t + α t Update: w t = w t η φ w t + α t Observe that SDCA keeps the prmal-dual relato w t = λ Observe also that the update of α ca be rewrtte as α t = βα t + β = α t φ w t amely, the ew value of α s a covex combato of ts old value ad the egato of the gradet. Fally, observe that, codtoed o the value of w t ad α t, we have that Ew t = w t η Eφ w t + Eα t = w t η φ w t + λw t = = w t η P w t. That s, SDCA s fact a stace of Stochastc Gradet Descet. As we wll see the aalyss secto below, the advatage of SDCA over a valla SGD algorthm s because the varace of the update goes to zero as we coverge to a optmum., 3 Aalyss The theorem below provdes a lear covergece rate for smooth ad covex fuctos. The rate matches the aalyss gve 8, but the aalyss s smpler ad does ot rely o dualty. Theorem. Assume that each φ s L-smooth ad covex, ad the algorthm s ru wth η L+λ. Let w be the mmzer of P w ad let α = φ w. The, for every t, λ E wt w + α t α e ηλt λ L w0 w + α 0 α. L = =

I partcular, settg η = L+λ, the after T Ω L λ + teratos we wll have EP w T P w ɛ. The theorem below provdes a lear covergece rate for smooth fuctos, wthout assumg that dvdual φ are covex. We oly requre that the average of φ s covex. The depedece o L/λ s worse ths case. Theorem. Assume that each φ s L-smooth ad that the average fucto, = φ, s covex. Let w be the mmzer of P w ad let α = φ w. The, f we ru SDCA wth η = m{ λ, L λ }, we have that λ E wt w + λ L α t α e ηλt λ w0 w + λ L α 0 α. = = It follows that wheever we have that EP w T P w ɛ. T Ω L λ + 3. SDCA as varace-reduced SGD As we have show before, SDCA s a stace of SGD, the sese that the update ca be wrtte as w t = w t ηv t, wth v t = φ w t + α t satsfyg Ev t = P w t. The advatage of SDCA over a geerc SGD s that the varace of the update goes to zero as we coverge to the optmum. To see ths, observe that E v t = E α t + φ w t = E α t α + α + φ w t E α t α + E φ w t α Theorem or Theorem tells us that the term E α t α goes to zero as e ηλt. For the secod term, by smoothess of φ we have φ w t α = φ w t φ w L w t w, ad therefore, usg Theorem or Theorem aga, the secod term also goes to zero as e ηλt. All all, whe t Ω ηλ log/ɛ we wll have that E v t ɛ. 4 Proofs Observe that 0 = P w = φ w + λw, whch mples that w = λ Defe u = φ w t ad v t = u + α t. We also deote two potetals: A t = j= α t j α j, B t = w t w. 3 α.

We wll frst aalyze the evoluto of A t ad B t. If o roud t we update usg elemet the α t βα t + βu, where β = ηλ. It follows that, I addto, A t A t = αt = α αt α = βαt α + βu α αt α = β α t α + β u α β β α t = β α t α + u α β v t = ηλ α t α + u α β v t. u α t α B t B t = w t w w t w = ηw t w v t + η v t. The proofs of Theorem ad Theorem wll follow by studyg dfferet combatos of A t ad B t. 4. Proof of Theorem Defe Combg ad we obta C t = λ L A t + B t. C t C t = ηλ L = ηλ α t α u α + β v t + λ λ L α t α u α + λ β L η ηw t w v t η v t v t + w t w v t The defto of η mples that η λ β/l, so the coeffcet of v t s o-egatve. By smoothess of each φ we have u α = φ w t φ w L w t w. Therefore, λ C t C t ηλ L αt α λ wt w + w t w v t. Takg expectato of both sdes w.r.t. the choce of ad codtoed o w t ad α t ad otg that Ev t = P w t, we obta that λ EC t C t ηλ L E αt α λ wt w + w t w P w t. Usg the strog covexty of P we have w t w P w t P w t P w + λ wt w ad P w t P w λ wt w, whch together yelds w t w P w t 4

λ w t w. Therefore, EC t C t ηλ λ L E αt α + λl L + λ w t w = ηλc t. It follows that ad repeatg ths recursvely we ed up wth EC t ηλc t EC t ηλ t C 0 e ηλt C 0, whch cocludes the proof of the frst part of Theorem. The secod part follows by observg that P s L + λ smooth, whch gves P w P w L+λ w w. 4. Proof of Theorem I the proof of Theorem we bouded the term u α by L w t w based o the smoothess of φ. We ow assume that φ s also covex, whch eables to boud u α based o the curret sub-optmalty. Lemma. Assume that each φ s L-smooth ad covex. The, for every w, = Proof. For every, defe φ w φ w L P w P w λ w w g w = φ w φ w φ w w w. Clearly, sce φ s L-smooth so s g. I addto, by covexty of φ we have g w 0 for all w. It follows that g s o-egatve ad smooth, ad therefore, t s self-bouded see Secto..3 7: Usg the defto of g, we obta g w Lg w. φ w φ w = g w Lg w = L φ w φ w φ w w w Takg expectato over ad observg that P w = Eφ w + λ w ad 0 = P w = E φ w + λw we obta E φ w φ w L P w λ w P w + λ w + λw w w = L P w P w λ w w... 5

We ow cosder the potetal Combg ad we obta D t = L A t + λ B t. D t D t = ηλ α t α u α + β v t + λ ηw t w v t η v t L = ηλ α t α u α β + L L η v t + w t w v t ηλ α t α u α + w t w v t, L where the last equalty we used the assumpto η L + λ η β L. Take expectato of the above w.r.t. the choce of, usg Lemma, usg Ev t = P w t, ad usg covexty of P that yelds P w P w t w w t P w t, we obta ED t D t ηλ E α t α E u α + w t w Ev t L ηλ L E αt α P w t P w λ wt w + w t w P w t ηλ L E αt α + λ wt w = ηλd t Ths gves ED t ηλd t e ηλ D t, whch cocludes the proof of the frst part of the theorem. The secod part follows by observg that P s L + λ smooth, whch gves P w P w L+λ w w. Refereces Aaro Defazo, Fracs Bach, ad Smo Lacoste-Jule. Saga: A fast cremetal gradet method wth support for o-strogly covex composte objectves. I Advaces Neural Iformato Processg Systems, pages 646 654, 04. Aaro J Defazo, Tbéro S Caetao, ad Just Domke. Fto: A faster, permutable cremetal gradet method for bg data problems. arxv preprt arxv:407.70, 04. 3 Re Johso ad Tog Zhag. Acceleratg stochastc gradet descet usg predctve varace reducto. I Advaces Neural Iformato Processg Systems, pages 35 33, 03. 4 Jakub Koečỳ ad Peter Rchtárk. Sem-stochastc gradet descet methods. arxv preprt arxv:3.666, 03. 6

5 Ncolas Le Roux, Mark Schmdt, ad Fracs Bach. A stochastc gradet method wth a expoetal covergece rate for fte trag sets. I Advaces Neural Iformato Processg Systems, pages 663 67, 0. 6 S. Shalev-Shwartz ad T. Zhag. Accelerated proxmal stochastc dual coordate ascet for regularzed loss mmzato. Mathematcal Programmg SERIES A ad B to appear, 05. 7 Sha Shalev-Shwartz ad Sha Be-Davd. Uderstadg Mache Learg: From Theory to Algorthms. Cambrdge uversty press, 04. 8 Sha Shalev-Shwartz ad Tog Zhag. Stochastc dual coordate ascet methods for regularzed loss mmzato. Joural of Mache Learg Research, 4:567 599, Feb 03. 7