Rates of Convergence by Moduli of Continuity

Similar documents
Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Glivenko-Cantelli Classes

Lecture 19: Convergence

ECE 901 Lecture 13: Maximum Likelihood Estimation

5.1 A mutual information bound based on metric entropy

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Empirical Process Theory and Oracle Inequalities

Lecture 3: August 31

Sequences and Series of Functions

7.1 Convergence of sequences of random variables

Empirical Processes: Glivenko Cantelli Theorems

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

1 Convergence in Probability and the Weak Law of Large Numbers

The Growth of Functions. Theoretical Supplement

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ST5215: Advanced Statistical Theory

Fall 2013 MTH431/531 Real analysis Section Notes

Notes 5 : More on the a.s. convergence of sums

Lecture 6: Coupon Collector s problem

Infinite Sequences and Series

1 Review and Overview

A survey on penalized empirical risk minimization Sara A. van de Geer

Sieve Estimators: Consistency and Rates of Convergence

Lecture 2. The Lovász Local Lemma

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lecture 13: Maximum Likelihood Estimation

This section is optional.

6.3 Testing Series With Positive Terms

Lecture 11 October 27

Statistical Inference Based on Extremum Estimators

Point Estimation: properties of estimators 1 FINITE-SAMPLE PROPERTIES. finite-sample properties (CB 7.3) large-sample properties (CB 10.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

7.1 Convergence of sequences of random variables

Output Analysis and Run-Length Control

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if

Metric Space Properties

Lecture 9: September 19

Lecture 12: September 27

Lecture 10 October Minimaxity and least favorable prior sequences

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Math 2784 (or 2794W) University of Connecticut

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

STAT Homework 1 - Solutions

MATH 413 FINAL EXAM. f(x) f(y) M x y. x + 1 n

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Regression with quadratic loss

Random Variables, Sampling and Estimation

Sequences. Notation. Convergence of a Sequence

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Lecture Notes 15 Hypothesis Testing (Chapter 10)

Singular Continuous Measures by Michael Pejic 5/14/10

Lecture 2: Concentration Bounds

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

An Introduction to Asymptotic Theory

Distribution of Random Samples & Limit theorems

INFINITE SEQUENCES AND SERIES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

Math 525: Lecture 5. January 18, 2018

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

Seunghee Ye Ma 8: Week 5 Oct 28

Math 113, Calculus II Winter 2007 Final Exam Solutions

Notes On Median and Quantile Regression. James L. Powell Department of Economics University of California, Berkeley

Advanced Stochastic Processes.

Exercise 4.3 Use the Continuity Theorem to prove the Cramér-Wold Theorem, Theorem. (1) φ a X(1).

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT

Machine Learning Brett Bernstein

y X F n (y), To see this, let y Y and apply property (ii) to find a sequence {y n } X such that y n y and lim sup F n (y n ) F (y).

Technical Proofs for Homogeneity Pursuit

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Self-normalized deviation inequalities with application to t-statistic

Differentiable Convex Functions

lim za n n = z lim a n n.

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Basics of Probability Theory (for Theory of Computation courses)

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Maximum Likelihood Estimation and Complexity Regularization

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Solutions to HW Assignment 1

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

G. R. Pasha Department of Statistics Bahauddin Zakariya University Multan, Pakistan

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

1 Review and Overview

Monte Carlo Integration

Notes 27 : Brownian motion: path properties

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Lecture 15: Learning Theory: Concentration Inequalities

Optimally Sparse SVMs

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Beurling Integers: Part 2

Notes 19 : Martingale CLT

Transcription:

Rates of Covergece by Moduli of Cotiuity Joh Duchi: Notes for Statistics 300b March, 017 1 Itroductio I this ote, we give a presetatio showig the importace, ad relatioship betwee, the modulis of cotiuity of a stochastic process ad certai growth-like properties of the populatio quatity beig modeled or optimized Our treatmet roughly follows va der Vaart ad Weller 1, Chapter 3, though we make a few simplificatios i attempt to make the approach somewhat cleaer To set otatio, let Θ be some parameter space with distace d, ad let R : Θ R be a sequece of radom risk fuctioals with expectatio Rθ := ER θ A typical example of such a process is whe we have data X i X ad a loss fuctio l : Θ X R, for example, the iid loss may be the egative log likelihood log p θ x for some model p θ We the draw X i P, ad we defie R θ := 1 lθ, X i ad Rθ := Elθ, X i=1 We would like to uderstad the covergece rate properties of θ = argmi θ Θ R θ to θ 0 := argmi θ Θ Rθ, the populatio miimizer It is atural, based o a Taylor expasio, to assume that i a eighborhood of θ 0, the populatio risk grows at least quadratically because Rθ 0 = 0 Thus, throughout this ote, we assume that there is a costat η > 0 ad a growth costat > 0 such that Rθ Rθ 0 + dθ, θ 0 for θ Θ st dθ, θ 0 η 1 With such a coditio, it is possible to give rates of covergece of θ to θ 0, at least so log as the radom fuctios R do ot have so much variability i a eighborhood of θ 0 that they swamp the quadratic growth away from θ 0 Rates of covergece ad compariso of fuctios Because we would like to uderstad miimizig the populatio risk R ad fidig θ 0, we do ot particularly care if Rθ ad R θ are close While havig R θ Rθ uiformly is sufficiet to guaratee that θ θ 0, it is ot ecessary Ideed, all we really care about is that R θ > R θ 0 for θ sufficietly far from θ 0 That is, as we expect to have roughly R θ R θ 0 Rθ Rθ 0, where Rθ Rθ 0 + dθ, θ 0, so that we hope that R θ > R θ 0 wheever dθ, θ 0 is large eough that it swamps the stochastic error i R θ R θ 0 Moreover, as log as θ is close eough to θ 0, we ca give stroger covergece guaratees, because we expect VarR θ R θ 0 to be smaller tha VarR θ by itself A bit more precisely, we must have deviatios roughly 1

1 Varlθ, X i ay uiform estimate of Rθ, by the cetral limit theorem However, if θ is ear θ 0, the R θ R θ 0 = Rθ Rθ 0 + O P 1 Varlθ; X lθ0 ; X, ad the latter variace may be substatially smaller tha Varlθ, X whe dθ, θ 0 is small With the above motivatio i mid, as we wish to compare R θ R θ 0 to Rθ Rθ 0, our first step i providig rates of covergece is to uderstad the modulus of cotiuity of the process θ R θ i a eighborhood of θ 0 We make the followig defiitio Defiitio 1 Let Θ δ := {θ Θ : dθ, θ 0 δ} The expected modulus of cotiuity of the process R i a radius δ aroud θ 0 is E sup R θ Rθ R θ 0 Rθ 0 θ Θ δ For otatioal coveiece, we also defie the error processes θ, x := lθ, x Rθ lθ 0, x Rθ 0 ad θ := R θ Rθ R θ 0 Rθ 0 Both of these processes are evidetly mea zero We are most ofte cocered with upper bouds o the modulus of cotiuity relative to 1/ the typical cetral limit theorem rate That is, we cosider fuctios φ of the form that E sup R θ Rθ R θ 0 Rθ 0 θ Θ δ φδ Ofte, these fuctios satisfy φδ σδ, where σ is a type of stadard deviatio/variace measure though for fuller geerality, we will cosider fuctios φδ = σδ α for parameters α 0, A example makes this more apparet Example 1: Let l be L-Lipschitz i Θ R d ad take the orm as the distace fuctio, that is, lθ; x lθ ; x L θ θ Recallig the compariso process, we the have λ L θ θ E exp λ θ, X exp by the stadard sub-gaussia iequality for bouded radom variables Thus, lettig NΘ δ,, ɛ be the coverig umber of Θ δ for the orm, we have log NΘ δ,, ɛ d log 1 + δ ɛ ad log NΘ δ,, ɛ = 0 for ɛ δ Thus, a stadard etropy itegral calculatio, usig that L θ is a -sub-gaussia process, yields E sup θ c L δ log NΘδ,, ɛdɛ c L d δ log 1 + δ dɛ c L d δ, θ Θ δ ɛ 0 where c is a umerical costat That is, we have modulus of cotiuity boud with φδ = L dδ, or σ = L d 0

1 For ituitio: o-stochastic bouds o differeces i empirical risk Because we would like to uderstad the relative differeces betwee R ad R, we begi for ituitio by assumig that we have the ucoditioal boud that θ φδ wheever dθ, θ 0 δ The ituitively, we must have d θ, θ 0 small wheever the quadratic growth dθ, θ 0 i Rθ away from θ 0 domiates or overcomes the stochastic error φδ/ i our estimatio Let us make this rigorous, ad begi by assumig that dθ, θ 0 η, that is, θ is i the regio of quadratic growth 1 of R away from Rθ 0, ad let = 1 for simplicity ad wlog Now, let δ = dθ, θ 0, ad assume that R θ R θ 0, that is, θ has smaller empirical risk tha θ 0 The we have R θ R θ 0 = R θ 0 Rθ 0 + Rθ + Rθ 0 Rθ }{{} dθ,θ 0 R θ 0 Rθ 0 + Rθ dθ, θ 0, where we have used the coditio 1 Rearragig, we fid that dθ, θ 0 R θ 0 Rθ 0 + Rθ R θ θ φδ That is, we have the key iequality δ φδ 3 This iequality is the key isight to all of our cosideratios of moduli of cotiuity: if φδ does ot grow as fast as δ ad δ were large, this would cotradict iequality 3, so δ = dθ, θ 0 must be small Said differetly, for suitably large δ suitably large will decrease as grows, the quadratic growth δ will evetually swamp the stochastic error φδ/ based o iequality 3 More carefully, suppose that φδ σδ α for some α 0, The iequality 3 implies δ σδα, or δ σ 1 α Moduli of cotiuity ad covergece guaratees We ow show how to make the o-stochastic heuristic argumet of the precedig sectio rigorous Assume that we have the modulus of cotiuity boud E sup θ = E sup R θ Rθ R θ 0 Rθ 0 φδ 4 θ Θ δ θ Θ δ for all δ η, where η > 0 is the regio of strog covexity of R iequality 1 Assume additioally that φδ σδ α for some variace parameter σ ad a power α 0, The we choose the rate 3

δ to be the poit at which the quadratic growth domiates the stochastic error i the modulus of cotiuity 4, that is, { δ := if δ 0 : δ φδ } 5 Notig that φδ σδ α, the we certaily have that σ δ = 1 α is sufficiet to satisfy this domiatio coditio, that is, we have δ δ Moreover, we have φδ/δ 1, ad similarly for δ Thus, at least ituitively, we expect that the rate of covergece of θ to θ 0 should be roughly of order δ δ, because this is the poit at which the curvature of the risk domiates the stochastic error i its estimatio We may formalize this i the followig theorem Theorem 1 Rates of covergece Let δ be the smallest domiatig radius 5, where the empirical risk R satisfies the modulus coditio 4 ad φδ σδ α Assume also that θ = argmi θ R θ is cosistet, that is, θ p θ0 The d θ, θ 0 = O P δ = O P δ = O P σ 1 α Proof Our proof builds off of a so-called peelig argumet, where we argue that the behavior of the local relative errors θ is ice o shells aroud θ 0 Ideed, for each ad all j N, defie the shells S j, := { θ Θ : δ j 1 dθ, θ 0 δ j} Recall the defiitio η > 0 of the radius i the quadratic growth coditio 1 Now, fix ay t R +, ad cosider the evet that d θ, θ 0 t δ The either d θ, θ 0 η or we have θ S j, for some j with j t but j δ η I particular, P d θ, θ 0 t δ j:j t, j δ η P θ S j, + Pd θ, θ 0 η 6 The fial term is o1, so we may igore it i what follows Now, cosider the evet that θ S j, This implies that there exists some θ S j, such that R θ R θ 0, i tur implyig R θ R θ 0 Rθ 0 + Rθ + Rθ 0 Rθ R θ 0 Rθ 0 + Rθ dθ, θ 0, where we have used the growth coditio 1 that Rθ Rθ 0 + dθ, θ 0, which holds for θ S j, as dθ, θ 0 η Notig that dθ, θ 0 δ j, we rearrage the precedig iequality to obtai that θ S j, implies δ j R θ 0 Rθ 0 R θ Rθ sup θ S j, θ Returig to the probability sum 6, we thus have P θ S j, P sup θ δ j θ S j, 4 Esup θ S j, θ δ j

But of course, by assumptio 4, this i tur has the boud P θ S j, j φ j δ δ j jα by the defiitio 5 of the critical radius for δ Summig iequality 6, we thus obtai P d θ, θ 0 t δ 4 φδ δ j jα j α + o1 For ay ɛ > 0, we may take t sufficietly large that j t j α ɛ, which is the defiitio of d θ, θ 0 = O P δ j t Refereces 1 A W va der Vaart ad J A Weller Weak Covergece ad Empirical Processes: With Applicatios to Statistics Spriger, New York, 1996 5