Supplement for SADAGRAD: Strongly Adaptive Stochastic Gradient Methods"

Similar documents
1 Notes on Little s Law (l = λw)

Lecture 15 First Properties of the Brownian Motion

Extremal graph theory II: K t and K t,t

Some Properties of Semi-E-Convex Function and Semi-E-Convex Programming*

Actuarial Society of India

Math 6710, Fall 2016 Final Exam Solutions

Lecture 9: Polynomial Approximations

FIXED FUZZY POINT THEOREMS IN FUZZY METRIC SPACE

Mathematical Statistics. 1 Introduction to the materials to be covered in this course

Online Supplement to Reactive Tabu Search in a Team-Learning Problem

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

BEST LINEAR FORECASTS VS. BEST POSSIBLE FORECASTS

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

MATH 507a ASSIGNMENT 4 SOLUTIONS FALL 2018 Prof. Alexander. g (x) dx = g(b) g(0) = g(b),

Additional Tables of Simulation Results

Review Answers for E&CE 700T02

STK4080/9080 Survival and event history analysis

COS 522: Complexity Theory : Boaz Barak Handout 10: Parallel Repetition Lemma

TAKA KUSANO. laculty of Science Hrosh tlnlersty 1982) (n-l) + + Pn(t)x 0, (n-l) + + Pn(t)Y f(t,y), XR R are continuous functions.

L-functions and Class Numbers

Comparison between Fourier and Corrected Fourier Series Methods

Lecture 8 April 18, 2018

A note on deviation inequalities on {0, 1} n. by Julio Bernués*

An interesting result about subset sums. Nitu Kitchloo. Lior Pachter. November 27, Abstract

NEWTON METHOD FOR DETERMINING THE OPTIMAL REPLENISHMENT POLICY FOR EPQ MODEL WITH PRESENT VALUE

The Connection between the Basel Problem and a Special Integral

An approximate approach to the exponential utility indifference valuation

Calculus Limits. Limit of a function.. 1. One-Sided Limits...1. Infinite limits 2. Vertical Asymptotes...3. Calculating Limits Using the Limit Laws.

Fermat Numbers in Multinomial Coefficients

Moment Generating Function

Research Article A Generalized Nonlinear Sum-Difference Inequality of Product Form

The Central Limit Theorem

arxiv: v1 [math.pr] 16 Dec 2018

A Note on Prediction with Misspecified Models

BE.430 Tutorial: Linear Operator Theory and Eigenfunction Expansion

Big O Notation for Time Complexity of Algorithms

Mean Square Convergent Finite Difference Scheme for Stochastic Parabolic PDEs

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Inference of the Second Order Autoregressive. Model with Unit Roots

A Note on Random k-sat for Moderately Growing k

LINEAR APPROXIMATION OF THE BASELINE RBC MODEL JANUARY 29, 2013

Existence Of Solutions For Nonlinear Fractional Differential Equation With Integral Boundary Conditions

High-Probability Regret Bounds for Bandit Online Linear Optimization

A TAUBERIAN THEOREM FOR THE WEIGHTED MEAN METHOD OF SUMMABILITY

The Solution of the One Species Lotka-Volterra Equation Using Variational Iteration Method ABSTRACT INTRODUCTION

Boundary-to-Displacement Asymptotic Gains for Wave Systems With Kelvin-Voigt Damping

Sampling Example. ( ) δ ( f 1) (1/2)cos(12πt), T 0 = 1

Four equations describe the dynamic solution to RBC model. Consumption-leisure efficiency condition. Consumption-investment efficiency condition

AN EXTENSION OF LUCAS THEOREM. Hong Hu and Zhi-Wei Sun. (Communicated by David E. Rohrlich)

Ideal Amplifier/Attenuator. Memoryless. where k is some real constant. Integrator. System with memory

LINEAR APPROXIMATION OF THE BASELINE RBC MODEL SEPTEMBER 17, 2013

Solutions to Problems 3, Level 4

K3 p K2 p Kp 0 p 2 p 3 p

On The Eneström-Kakeya Theorem

Optimal Packet Scheduling in a Multiple Access Channel with Rechargeable Nodes

Outline. simplest HMM (1) simple HMMs? simplest HMM (2) Parameter estimation for discrete hidden Markov models

10.3 Autocorrelation Function of Ergodic RP 10.4 Power Spectral Density of Ergodic RP 10.5 Normal RP (Gaussian RP)

David Randall. ( )e ikx. k = u x,t. u( x,t)e ikx dx L. x L /2. Recall that the proof of (1) and (2) involves use of the orthogonality condition.

Solution. 1 Solutions of Homework 6. Sangchul Lee. April 28, Problem 1.1 [Dur10, Exercise ]

Homotopy Analysis Method for Solving Fractional Sturm-Liouville Problems

Numerical approximation of Backward Stochastic Differential Equations with Jumps

EXISTENCE THEORY OF RANDOM DIFFERENTIAL EQUATIONS D. S. Palimkar

1. Solve by the method of undetermined coefficients and by the method of variation of parameters. (4)

Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary

Available online at J. Math. Comput. Sci. 4 (2014), No. 4, ISSN:

N! AND THE GAMMA FUNCTION

Basic Results in Functional Analysis

S n. = n. Sum of first n terms of an A. P is

OLS bias for econometric models with errors-in-variables. The Lucas-critique Supplementary note to Lecture 17

arxiv:math/ v1 [math.pr] 5 Jul 2006

Some Newton s Type Inequalities for Geometrically Relative Convex Functions ABSTRACT. 1. Introduction

arxiv: v1 [cs.lg] 9 Oct 2018

Research Article A MOLP Method for Solving Fully Fuzzy Linear Programming with LR Fuzzy Parameters

The Eigen Function of Linear Systems

Research Article Generalized Equilibrium Problem with Mixed Relaxed Monotonicity

F D D D D F. smoothed value of the data including Y t the most recent data.

BIBECHANA A Multidisciplinary Journal of Science, Technology and Mathematics

Streaming Robust PCA

th m m m m central moment : E[( X X) ] ( X X) ( x X) f ( x)

Notes 03 largely plagiarized by %khc

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite-Sum Structure

λiv Av = 0 or ( λi Av ) = 0. In order for a vector v to be an eigenvector, it must be in the kernel of λi

ODEs II, Supplement to Lectures 6 & 7: The Jordan Normal Form: Solving Autonomous, Homogeneous Linear Systems. April 2, 2003

Common Fixed Point Theorem in Intuitionistic Fuzzy Metric Space via Compatible Mappings of Type (K)

A Change-of-Variable Formula with Local Time on Surfaces

Samuel Sindayigaya 1, Nyongesa L. Kennedy 2, Adu A.M. Wasike 3

DIFFERENTIAL EQUATIONS

arxiv: v2 [math.pr] 10 Apr 2014

xp (X = x) = P (X = 1) = θ. Hence, the method of moments estimator of θ is

11. Adaptive Control in the Presence of Bounded Disturbances Consider MIMO systems in the form,

The analysis of the method on the one variable function s limit Ke Wu

Approximately Quasi Inner Generalized Dynamics on Modules. { } t t R

CSE 241 Algorithms and Data Structures 10/14/2015. Skip Lists

On the Validity of the Pairs Bootstrap for Lasso Estimators

Review Exercises for Chapter 9

Stability Analysis of Discrete-Time Piecewise-Linear Systems: A Generating Function Approach

The Moment Approximation of the First Passage Time for the Birth Death Diffusion Process with Immigraton to a Moving Linear Barrier

Online Nonparametric Regression

Introduction to the Mathematics of Lévy Processes

INTEGER INTERVAL VALUE OF NEWTON DIVIDED DIFFERENCE AND FORWARD AND BACKWARD INTERPOLATION FORMULA

Transcription:

Suppleme for SADAGRAD: Srogly Adapive Sochasic Gradie Mehods" Zaiyi Che * 1 Yi Xu * Ehog Che 1 iabao Yag 1. Proof of Proposiio 1 Proposiio 1. Le ɛ > 0 be fixed, H 0 γi, γ g, EF (w 1 ) F (w ) ɛ 0 ad ieraio umber saisfies ɛ ɛ0(γ i g 1:,i ) η, η d g 1:,i. Algorihm 1 gives a soluio ŵ such ha EF (ŵ ) F ɛ. Proof. Le ψ 0 (w) 0 ad x H x Hx. Firs, we ca see ha ψ 1 (w) ψ (w) for ay 0. Defie z τ1 g ad τ ( F (w ) g ) (w w). Le ψ be defied by ψ (g) sup g x 1 x Ω η ψ (x) aig he summaio of objecive gap i all ieraios, we (F (w ) F (w)) g (w w) g w 1 η ψ (w) sup x Ω F (w ) (w w) g w 1 η ψ (w) 1 η ψ (w) g w g x 1 η ψ (x) 1 η ψ (w) g w ψ ( z ) Noe ha ψ ( z ) g w 1 1 η ψ (w 1 ) g w 1 1 η ψ 1(w 1 ) sup z x 1 x Ω η ψ 1(x) ψ 1( z ) ψ 1( z 1 ) g ψ 1( z 1 ) η g ψ 1 where he las iequaliy uses he fac ha ψ (w) is 1- srogly covex w.r. ψ H ad cosequeially ψ (w) is η-smooh wr.. ψ H 1. hus, we g w ψ ( z ) g w ψ 1( z 1 ) g ψ 1( z 1 ) η g ψ 1 1 g w ψ 1( z 1 ) η g ψ 1 By repeaig his process, we g w ψ ( z ) ψ 0( z 0 ) η η g ψ 1 g ψ 1 he (F (w ) F (w)) 1 η ψ (w) η g ψ 1 (1)

SADAGRAD: Srogly Adapive Sochasic Gradie Mehods Followig he aalysis i (Duchi e al., 011), we hus g ψ g 1 1:,i (F (w ) F (w)) γ w w 1 (w w 1) diag(s )(w w 1 ) η η η g 1:,i γ i g 1:,i w w 1 η η g 1:,i Now by he value of ɛ ɛ0(γ i g 1:,i ) η, η d g 1:,i, we (γ i g 1:,i ) η η d g 1:,i ɛ ɛ 4ɛ 0 Dividig by o boh sides ad seig w w, followig he iequaliy () ad he covexiy of F (w) we F (ŵ) F ɛ 4ɛ 0 w w 1 ɛ 1 Le F be he filraio associaed wih Algorihm 1 i he paper. Noicig ha is a radom variable wih respec o F, we cao ge rid of he las erm direcly. Defie he Sequece X N as X 1 i 1 g i Eg i, w i w () where Eg i F (w i ). ( Sice E g 1 Eg 1 0 ad ) w 1 arg mi w Ω ηw 1 τ1 g τ 1 ψ (w), which is measurable wih respec o g 1,..., g ad w 1,..., w, i is easy o see N is a marigale differece sequece wih respec o F, e.g. E F 1 0. O he oher had, sice g is upper bouded (e.g., by G), followig he saeme of i he heorem, N 4 ɛ ( Gɛ0 θ ), θ d G < always holds. he followig Lemma 1 below we ha EX 0. Now aig he expecaio we ha EF (ŵ) F ɛ E w w 1 ɛ 4ɛ E 1 0 ɛ E F (w 1 ) F (w ) ɛ ɛ 0 0 ɛ he we fiish he proof. Lemma 1. Le N be a marigale differece sequece w.r. he filraio F N, is a soppig ime such ha F for all N. If 0 < N <, he we Proof. 1 E 1 E 0. E 1 I( ) E F N I( )E 1 F I( ) E 0 I( ) I( ) I( ) I( ) E F F F E F E E F F 1 E F 1 where I( ) is he idicaor fucio. he firs equaio follows from he defiiio of codiioal expecaio ad N; he secod equaio follows from he fac ha I( ) 1; he hird ad fifh equaios follow from he defiiio of soppig ime (( ) F ); he seveh ad las equaios follow from he defiiio of marigale differece sequece; ad eighh equaio follows from heorem 5.1.6 i (Durre, 010).

SADAGRAD: Srogly Adapive Sochasic Gradie Mehods. Proof of heorem 1 heorem 1. Cosider SCO (1) wih a propery () ad a give ɛ > 0. Assume H 0 γi i Algorihm 1 ad γ,τ gτ, F (w 0 ) F ɛ 0 ad is he miimum umber such ha ɛ (γ i g 1:,i ) θ, θ d g 1:,i K log (ɛ 0 /ɛ), we EF (w K ) F ɛ.. Wih Proof of heorem 1. We will show by iducio ha EF (w ) F ɛ ɛ0 for 0, 1,..., K, which leads o our coclusio whe K log(ɛ 0 /ɛ). he iequaliy holds obviously for 0. Codiioed o EF (w 1 ) F ɛ 1, we will show ha EF (w ) F ɛ. We will modify Proposiio 1, he use i o he -h epoch of Algorihm codiioed o radomess i previous epoches. Le E deoes he expecaio over all radomess before he las ieraio of he -h epoch ad E 1: 1 deoes he expecaio over he radomess i he -h epoch give he radomess before -h epoch. Give w 1, we le w 1 deoe he opimal soluio ha is closes o w 1 1. Accordig o he proof of Proposio 1, We E 1: 1 F (w ) F (w 1) γ i g1: E,i 1: 1 w 1 w η 1 η d g 1:,i Eg g, w w 1 By he value of η θ ɛ / ad 4(γ i g 1:,i ), θ d g 1:,i ɛ, we hus θ ɛ (γ i g 1:,i ) η 8 η d g 1:,i ɛ E 1: 1 F (w ) F (w 1) E 1: 1 8 w 1 w 1 ɛ Eg g, w w 1 1 Sice we oly assume he codiio () ha does o ecessarily imply he uiqueess of he opimal soluios. he followig he similar argumes i Proposiio 1, we E 1: 1 F (w ) F (w 1) E 1: 1 8 w 1 w 1 ɛ aig expecaio over radomess i sages 1,..., 1, we EF (w ) F (w 1) E 8 w 1 w 1 ɛ 1 4 EF (w 1) F ɛ ɛ 1 4 ɛ ɛ herefore by iducio, we EF (w K ) F ɛ K ɛ.. Proof of heorem Lemma. Cosider SCO (4) wih he propery (). Le H 0 γi i Algorihm ad γ g. For ay w Ω ad is closes opimal soluio w, we F ( w ) F (w) G w 1 w 1 1 (Eg g ) (w w) η d g 1:,i γ i g 1:,i w w 1 η where w 1 w /. Proof. his proof is similar o he proof of Proposiio 1, bu we do o ae expecaio here. For compleeess, we give he proof here. hroughou he whole proof, we se he oaio g as he sochasic gradie of f(w ) ad as a resul Eg f(w ). Le ψ 0 (w) 0 ad x H x Hx. Firs, we ca see ha ψ 1 (w) ψ (w) for ay 0. Defie z τ1 g ad τ ( f(w ) g ) (w w). Le ψ be defied by ψ (g) sup g x 1 x Ω η ψ (x) φ(x) aig he summaio of objecive gap i all ieraios, we

SADAGRAD: Srogly Adapive Sochasic Gradie Mehods ψ (w) is η-smooh w.r.. ψ H 1. hus, we (f(w ) f(w) φ(w ) φ(w)) ( f(w ) (w w) φ(w ) φ(w)) g (w w) g w 1 η ψ (w) 1 η ψ (w) sup x Ω (φ(w ) φ(w)) g w 1 η ψ (w) φ(w) φ(w ) g w φ(w ) g x 1 η ψ (x) φ(x) 1 η ψ (w) g w φ(w ) ψ ( z ) () Noe ha ψ ( z ) g w 1 1 η ψ (w 1 ) φ(w 1 ) g w 1 1 η ψ 1(w 1 ) ( 1)φ(w 1 ) φ(w 1 ) sup z x 1 x Ω η ψ 1(x) ( 1)φ(x) φ(w 1 ) ψ 1( z ) φ(w 1 ) ψ 1( z 1 ) g ψ 1( z 1 ) η g ψ 1 φ(w 1 ) where he las iequaliy uses he fac ha ψ (w) is 1- srogly covex w.r. ψ H ad cosequeially g w ψ ( z ) g w ψ 1( z 1 ) g ψ 1( z 1 ) η g ψ 1 φ(w 1) 1 g w ψ 1( z 1 ) η g ψ 1 φ(w 1 ) By repeaig his process, we g w ψ ( z ) ψ 0( z 0 ) η η g ψ φ(w 1 1 ) g ψ φ(w 1 1 ) (4) Pluggig iequaliy (4) i iequaliy (), he (F (w ) F (w)) 1 η ψ (w) η φ(w 1 ) g ψ 1 φ(w 1 ) By addig F (w 1 ) F (w 1 ) o he boh sides of above iequaliy ad usig he fac ha F (w) f(w) φ(w), we ge 1 (F (w ) F (w)) 1 η ψ (w) η f(w 1 ) g ψ 1 f(w 1 ) Followig he aalysis i (Duchi e al., 011), we g ψ g 1 1:,i

SADAGRAD: Srogly Adapive Sochasic Gradie Mehods hus 1 (F (w ) F (w)) γ w w 1 (w w 1) diag(s )(w w 1 ) η η η g 1:,i f(w 1 ) f(w 1 ) γ i g 1:,i w w 1 η η g 1:,i ( f(w 1 )) (w 1 w 1 ) γ i g 1:,i w w 1 η η G w 1 w 1 g 1:,i where he las iequaliy hold usig Cauchy-Schwarz Iequaliy ad he fac ha f(w 1 ) G. Dividig by o boh sides, he we fiish he proof by usig he covexiy of F (w). heorem. For a give ɛ > 0, le K log (ɛ 0 /ɛ). Assume H 0 γi ad γ,τ gτ, F (w 0 ) F ɛ 0 ad is he miimum umber such G w 1 ɛ A, w 1 ɛ, where ha A (γ i g 1:,i ) θ, θ d g 1:,i. Algorihm 4 guaraees ha EF (w K ) F ɛ. Proof. his resul is proved by revisig Lemma o hold for a bouded soppig ime of he supermarigale sequece X i (). aig he expecaio of Lemma, we ha G w1 w 1 EF ( w ) F (w) E 1 E (Eg g ) (w w) η d E g 1:,i γ i g 1:,i w w 1 η he followig he same argumes o Proposiio 1, we ha E 1 (Eg g ) (w w) 0 Similar o he iducio of heorem 1, le η θ ɛ / ad he ieraio umber i -h epoch o be he smalles umber saisfyig followig iequaliies (γ i g 1:,i ) η 1 η d g 1:,i G w 1 w 1 ɛ ɛ hus codiioed o 1,..., 1-h epoches, we ha E 1: 1 F (w ) F (w 1) E 1: 1 1 w 1 w 1 ɛ aig expecaio over radomess i sages 1,..., 1, we EF (w ) F (w 1) E 1 w 1 w 1 ɛ 1 6 EF (w 1) F ɛ ɛ 1 6 ɛ ɛ herefore by iducio, we EF (w K ) F ɛ K ɛ. 4. Proof of heorem heorem. Uder he same assumpios as heorem 1 ad F (w 0 ) F ɛ 0, where w 0 is a iiial soluio. Le 1, ɛ ɛ0, K log ɛ0 ɛ ad (s) (γ i g sɛ 1:,i ) θ, θ d g 1:,i. he wih a mos a oal umber of S log ( 1 ) 1 calls of SADAGRAD ad a worse-cas ieraio complexiy of O(1/(ɛ)), Algorihm 5 fids a soluio w (S) such ha EF (w (S) ) F ɛ. Proof. Sice 1 / > 1, he F (w 0 ) F ( 1 /)ɛ 0. Followig he proof of heorem 1, we ca show ha EF (w (1) ) F ( ( ) 1/)ɛ 0 1 K ɛ ɛ wih K log 0ɛ ad (1) (γ i g 1:,i ) θ, θ d g 1:,i ( 1 ɛ ),

1,..., K. Nex, sice ɛ ɛ0, he we EF (w (1) ) F ( 1 ) ɛ0 ( ) ɛ0. By ruig SADAGRAD from w (1), heorem 1 esures ha EF (w () ) F EF (w(1) ) F K ( /)ɛ 0 ( ) K ɛ By coiuig he process, wih S ( log 1 ) 1, we ( ) EF (w (S) S ) F ɛ ɛ (5) he oal umber of ieraios for he S calls of SADAGRAD is upper bouded by oal for some C > 0. S K (s) s1 1 Acowledgeme SADAGRAD: Srogly Adapive Sochasic Gradie Mehods S s1 C S K s 1 1 ɛ 0 s1 1 ( ) 1 O ɛ C s ɛ 0 1 K 1 1 We ha Prof. Qihe ag from Uiversiy of Iowa for his help o he proof of Lemma 1. Refereces Duchi, Joh, Haza, Elad, ad Siger, Yoram. Adapive subgradie mehods for olie learig ad sochasic opimizaio. Joural of Machie Learig Research, 1 (Jul):11 159, 011. Durre, Ric. Probabiliy: heory ad examples. Cambridge uiversiy press, 010.