Rates of Convergence by Moduli of Continuity

Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity of a stochastic process ad certai growth-like properties of the populatio quatity beig modeled or optimized Our treatmet roughly follows va der Vaart ad Weller 1, Chapter 3, though we make a few simplificatios i attempt to make the approach somewhat cleaer To set otatio, let Θ be some parameter space with distace d, ad let R : Θ R be a sequece of radom risk fuctioals with expectatio Rθ := ER θ A typical example of such a process is whe we have data X i X ad a loss fuctio l : Θ X R, for example, the iid loss may be the egative log likelihood log p θ x for some model p θ We the draw X i P, ad we defie R θ := 1 lθ, X i ad Rθ := Elθ, X i=1 We would like to uderstad the covergece rate properties of θ = argmi θ Θ R θ to θ 0 := argmi θ Θ Rθ, the populatio miimizer It is atural, based o a Taylor expasio, to assume that i a eighborhood of θ 0, the populatio risk grows at least quadratically because Rθ 0 = 0 Thus, throughout this ote, we assume that there is a costat η > 0 ad a growth costat > 0 such that Rθ Rθ 0 + dθ, θ 0 for θ Θ st dθ, θ 0 η 1 With such a coditio, it is possible to give rates of covergece of θ to θ 0, at least so log as the radom fuctios R do ot have so much variability i a eighborhood of θ 0 that they swamp the quadratic growth away from θ 0 Rates of covergece ad compariso of fuctios Because we would like to uderstad miimizig the populatio risk R ad fidig θ 0, we do ot particularly care if Rθ ad R θ are close While havig R θ Rθ uiformly is sufficiet to guaratee that θ θ 0, it is ot ecessary Ideed, all we really care about is that R θ > R θ 0 for θ sufficietly far from θ 0 That is, as we expect to have roughly R θ R θ 0 Rθ Rθ 0, where Rθ Rθ 0 + dθ, θ 0, so that we hope that R θ > R θ 0 wheever dθ, θ 0 is large eough that it swamps the stochastic error i R θ R θ 0 Moreover, as log as θ is close eough to θ 0, we ca give stroger covergece guaratees, because we expect VarR θ R θ 0 to be smaller tha VarR θ by itself A bit more precisely, we must have deviatios roughly 1

1 Varlθ, X i ay uiform estimate of Rθ, by the cetral limit theorem However, if θ is ear θ 0, the R θ R θ 0 = Rθ Rθ 0 + O P 1 Varlθ; X lθ0 ; X, ad the latter variace may be substatially smaller tha Varlθ, X whe dθ, θ 0 is small With the above motivatio i mid, as we wish to compare R θ R θ 0 to Rθ Rθ 0, our first step i providig rates of covergece is to uderstad the modulus of cotiuity of the process θ R θ i a eighborhood of θ 0 We make the followig defiitio Defiitio 1 Let Θ δ := {θ Θ : dθ, θ 0 δ} The expected modulus of cotiuity of the process R i a radius δ aroud θ 0 is E sup R θ Rθ R θ 0 Rθ 0 θ Θ δ For otatioal coveiece, we also defie the error processes θ, x := lθ, x Rθ lθ 0, x Rθ 0 ad θ := R θ Rθ R θ 0 Rθ 0 Both of these processes are evidetly mea zero We are most ofte cocered with upper bouds o the modulus of cotiuity relative to 1/ the typical cetral limit theorem rate That is, we cosider fuctios φ of the form that E sup R θ Rθ R θ 0 Rθ 0 θ Θ δ φδ Ofte, these fuctios satisfy φδ σδ, where σ is a type of stadard deviatio/variace measure though for fuller geerality, we will cosider fuctios φδ = σδ α for parameters α 0, A example makes this more apparet Example 1: Let l be L-Lipschitz i Θ R d ad take the orm as the distace fuctio, that is, lθ; x lθ ; x L θ θ Recallig the compariso process, we the have λ L θ θ E exp λ θ, X exp by the stadard sub-gaussia iequality for bouded radom variables Thus, lettig NΘ δ,, ɛ be the coverig umber of Θ δ for the orm, we have log NΘ δ,, ɛ d log 1 + δ ɛ ad log NΘ δ,, ɛ = 0 for ɛ δ Thus, a stadard etropy itegral calculatio, usig that L θ is a -sub-gaussia process, yields E sup θ c L δ log NΘδ,, ɛdɛ c L d δ log 1 + δ dɛ c L d δ, θ Θ δ ɛ 0 where c is a umerical costat That is, we have modulus of cotiuity boud with φδ = L dδ, or σ = L d 0

1 For ituitio: o-stochastic bouds o differeces i empirical risk Because we would like to uderstad the relative differeces betwee R ad R, we begi for ituitio by assumig that we have the ucoditioal boud that θ φδ wheever dθ, θ 0 δ The ituitively, we must have d θ, θ 0 small wheever the quadratic growth dθ, θ 0 i Rθ away from θ 0 domiates or overcomes the stochastic error φδ/ i our estimatio Let us make this rigorous, ad begi by assumig that dθ, θ 0 η, that is, θ is i the regio of quadratic growth 1 of R away from Rθ 0, ad let = 1 for simplicity ad wlog Now, let δ = dθ, θ 0, ad assume that R θ R θ 0, that is, θ has smaller empirical risk tha θ 0 The we have R θ R θ 0 = R θ 0 Rθ 0 + Rθ + Rθ 0 Rθ }{{} dθ,θ 0 R θ 0 Rθ 0 + Rθ dθ, θ 0, where we have used the coditio 1 Rearragig, we fid that dθ, θ 0 R θ 0 Rθ 0 + Rθ R θ θ φδ That is, we have the key iequality δ φδ 3 This iequality is the key isight to all of our cosideratios of moduli of cotiuity: if φδ does ot grow as fast as δ ad δ were large, this would cotradict iequality 3, so δ = dθ, θ 0 must be small Said differetly, for suitably large δ suitably large will decrease as grows, the quadratic growth δ will evetually swamp the stochastic error φδ/ based o iequality 3 More carefully, suppose that φδ σδ α for some α 0, The iequality 3 implies δ σδα, or δ σ 1 α Moduli of cotiuity ad covergece guaratees We ow show how to make the o-stochastic heuristic argumet of the precedig sectio rigorous Assume that we have the modulus of cotiuity boud E sup θ = E sup R θ Rθ R θ 0 Rθ 0 φδ 4 θ Θ δ θ Θ δ for all δ η, where η > 0 is the regio of strog covexity of R iequality 1 Assume additioally that φδ σδ α for some variace parameter σ ad a power α 0, The we choose the rate 3

δ to be the poit at which the quadratic growth domiates the stochastic error i the modulus of cotiuity 4, that is, { δ := if δ 0 : δ φδ } 5 Notig that φδ σδ α, the we certaily have that σ δ = 1 α is sufficiet to satisfy this domiatio coditio, that is, we have δ δ Moreover, we have φδ/δ 1, ad similarly for δ Thus, at least ituitively, we expect that the rate of covergece of θ to θ 0 should be roughly of order δ δ, because this is the poit at which the curvature of the risk domiates the stochastic error i its estimatio We may formalize this i the followig theorem Theorem 1 Rates of covergece Let δ be the smallest domiatig radius 5, where the empirical risk R satisfies the modulus coditio 4 ad φδ σδ α Assume also that θ = argmi θ R θ is cosistet, that is, θ p θ0 The d θ, θ 0 = O P δ = O P δ = O P σ 1 α Proof Our proof builds off of a so-called peelig argumet, where we argue that the behavior of the local relative errors θ is ice o shells aroud θ 0 Ideed, for each ad all j N, defie the shells S j, := { θ Θ : δ j 1 dθ, θ 0 δ j} Recall the defiitio η > 0 of the radius i the quadratic growth coditio 1 Now, fix ay t R +, ad cosider the evet that d θ, θ 0 t δ The either d θ, θ 0 η or we have θ S j, for some j with j t but j δ η I particular, P d θ, θ 0 t δ j:j t, j δ η P θ S j, + Pd θ, θ 0 η 6 The fial term is o1, so we may igore it i what follows Now, cosider the evet that θ S j, This implies that there exists some θ S j, such that R θ R θ 0, i tur implyig R θ R θ 0 Rθ 0 + Rθ + Rθ 0 Rθ R θ 0 Rθ 0 + Rθ dθ, θ 0, where we have used the growth coditio 1 that Rθ Rθ 0 + dθ, θ 0, which holds for θ S j, as dθ, θ 0 η Notig that dθ, θ 0 δ j, we rearrage the precedig iequality to obtai that θ S j, implies δ j R θ 0 Rθ 0 R θ Rθ sup θ S j, θ Returig to the probability sum 6, we thus have P θ S j, P sup θ δ j θ S j, 4 Esup θ S j, θ δ j

But of course, by assumptio 4, this i tur has the boud P θ S j, j φ j δ δ j jα by the defiitio 5 of the critical radius for δ Summig iequality 6, we thus obtai P d θ, θ 0 t δ 4 φδ δ j jα j α + o1 For ay ɛ > 0, we may take t sufficietly large that j t j α ɛ, which is the defiitio of d θ, θ 0 = O P δ j t Refereces 1 A W va der Vaart ad J A Weller Weak Covergece ad Empirical Processes: With Applicatios to Statistics Spriger, New York, 1996 5