Sieve Estimators: Consistency and Rates of Convergence

Similar documents
Rademacher Complexity

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Empirical Process Theory and Oracle Inequalities

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Machine Learning Brett Bernstein

Advanced Stochastic Processes.

Chapter 6 Infinite Series

Entropy Rates and Asymptotic Equipartition

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Agnostic Learning and Concentration Inequalities

Intro to Learning Theory

Sequences and Series of Functions

Lecture 19: Convergence

Lecture 10: Universal coding and prediction

Empirical Processes: Glivenko Cantelli Theorems

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

10.6 ALTERNATING SERIES

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Sequences. Notation. Convergence of a Sequence

1 Review and Overview

REGRESSION WITH QUADRATIC LOSS

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Fall 2013 MTH431/531 Real analysis Section Notes

A survey on penalized empirical risk minimization Sara A. van de Geer

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

INFINITE SEQUENCES AND SERIES

Math 341 Lecture #31 6.5: Power Series

Estimation of the essential supremum of a regression function

APPENDIX A SMO ALGORITHM

Lecture 2. The Lovász Local Lemma

18.657: Mathematics of Machine Learning

1 Review and Overview

Lecture 3: August 31

Introduction to Probability. Ariel Yadin

Ma 530 Infinite Series I

Machine Learning Theory (CS 6783)

Lecture 15: Learning Theory: Concentration Inequalities

1 Convergence in Probability and the Weak Law of Large Numbers

MAT1026 Calculus II Basic Convergence Tests for Series

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Regression with quadratic loss

BIRKHOFF ERGODIC THEOREM

The log-behavior of n p(n) and n p(n)/n

7 Sequences of real numbers

Solutions to HW Assignment 1

Lecture Notes for Analysis Class

An Introduction to Randomized Algorithms

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Glivenko-Cantelli Classes

Math 61CM - Solutions to homework 3

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Chapter 7 Isoperimetric problem

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

2.1. Convergence in distribution and characteristic functions.

6.3 Testing Series With Positive Terms

Rates of Convergence by Moduli of Continuity

MA131 - Analysis 1. Workbook 3 Sequences II

Math 155 (Lecture 3)

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Optimally Sparse SVMs

Lecture 3 : Random variables and their distributions

Lecture 10 October Minimaxity and least favorable prior sequences

Infinite Sequences and Series

MATH 413 FINAL EXAM. f(x) f(y) M x y. x + 1 n

1 Lecture 2: Sequence, Series and power series (8/14/2012)

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Detailed proofs of Propositions 3.1 and 3.2

Estimation for Complete Data

Lecture 11: Decision Trees

Machine Learning Brett Bernstein

Notes 5 : More on the a.s. convergence of sums

4x 2. (n+1) x 3 n+1. = lim. 4x 2 n+1 n3 n. n 4x 2 = lim = 3

Lecture 8: Convergence of transformations and law of large numbers

A Proof of Birkhoff s Ergodic Theorem

Lecture 9: Expanders Part 2, Extractors

MAS111 Convergence and Continuity

Sequences, Series, and All That

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Lecture Chapter 6: Convergence of Random Sequences

10-701/ Machine Learning Mid-term Exam Solution

Lecture 2: Concentration Bounds

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

Definition An infinite sequence of numbers is an ordered set of real numbers.

4.1 Sigma Notation and Riemann Sums

MA131 - Analysis 1. Workbook 2 Sequences I

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

ON MEAN ERGODIC CONVERGENCE IN THE CALKIN ALGEBRAS

n=1 a n is the sequence (s n ) n 1 n=1 a n converges to s. We write a n = s, n=1 n=1 a n

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Lecture 3 The Lebesgue Integral

Transcription:

EECS 598: Statistical Learig Theory, Witer 2014 Topic 6 Sieve Estimators: Cosistecy ad Rates of Covergece Lecturer: Clayto Scott Scribe: Julia Katz-Samuels, Brado Oselio, Pi-Yu Che Disclaimer: These otes have ot bee subjected to the usual scrutiy reserved for formal publicatios. They may be distributed outside this class oly with the permissio of the Istructor. 1 Itroductio We ca decompose the excess risk of a discrimiatio ruleas follows: Rĥ R = Rĥ RH + }{{} RH R }{{} estimatio error approximatio error The first term is the estimatio error ad measures the performace of the discrimiatio rule with respect to the best hypothesis i H. I previous lectures, we studied performace guaratees for this quatity whe ĥ is ERM. Now, we also cosider the approximatio error, which captures how well a class of hypotheses {H k } approximates the Bayes decisio boudary. For example, cosider the histogram classifiers i Figure 1 ad Figure 2, respectively. As the grid becomes more fie-graied, the class of hypotheses approximates the Bayes decisio boudary with icreasig accuracy. The examiatio of the approximatio error will lead to the desig of sieve estimators that perform ERM over H k where k = k grows with at a appropriate rate such that both the approximatio error ad the estimatio error coverge to 0. Note that while the estimatio error is radom because it depeds o the sample, the approximatio error is ot radom. 2 Approximatio Error The followig assumptio will be adopted to establish uiversal cosistecy. Defiitio 1. A sequece of sets of classifiers H 1, H 2,... is said to have the uiversal approximatio property UAP if for all P XY, as k. R H k R Observe that RH k is ot radom; it does ot deped o the traiig data. Also, it depeds o P XY because it is a expectatio. The followig theorem gives a sufficiet coditio for a sequece of classifiers to have the UAP. Theorem 1. Suppose H k cosists of classifiers that are piecewise costat o a paritito of X ito cells A k,1, A k,2,.... If sup j 1 diama k,j 0 as k, the {H k } has the UAP. Proof. See the proof of Theorem 6.1 i [1]. Example. Suppose X = [0, 1] d. Cosider H k = { histogram classifiers based o cells of sidelegth 1 k }. The, diama k,j = d k for j. I particular, sup diama k,j = j 1 as k. By the above theorem, {H k } has the UAP. 0 d k 1

2 Figure 1: Histogram classifier based o 6 6 grid Figure 2: Histogram classifier based o 12 12 grid 3 Covergece of Radom Variables This sectio offers a brief review of covergece i probability ad almost surely, ad gives some useful results for provig both kids of covergece. Defiitio 2. Let Z 1, Z 2,... be a sequece of radom variables i R. We say that Z coverges to Z i i.p. probability, ad write Z Z, if ɛ > 0, lim Pr Z Z ɛ = 0. We say Z coverges to Z almost surely with probability oe, ad write Z a.s. Z, if Pr{ω : lim Z ω = Zω} = 1. Example. I aalysis of discrimiatio rules, we cosider Ω = {ω = X 1, Y 1, X 2, Y 2,... X Y } where X i, Y i i.i.d. P XY. To study covergece of the risk to the Bayes risk, we take Z = Rĥ where ĥ is a classifier based o the first etries of ω, amely X 1, Y 1,..., X, Y, ad Z = R. Now, we cosider some useful lemmas for provig covergece i probability ad almost sure covergece. We begi with useful results for covergece i probability. a.s. i.p. Lemma 1. If Z Z, the Z Z. Proof. This was proved i EECS 501. Lemma 2. If E{ Z Z } 0 as, the Z Proof. By Markov s iequality i.p. Z. Pr Z Z ɛ E{ Z Z } ɛ 0. Now, we tur our attetio to almost sure covergece. We begi with the Borel-Catelli Lemma, which yields a useful corollary.

3 Lemma 3 Borel-Catelli Lemma. Cosider a probability space Ω, A, P. Let {A } 1 be a sequece of evets. If P A < the where 1 P lim sup A = 0 lim sup A := {ω Ω : N N such that ω A } = {ω that occur ifiitely ofte i the sequece of evets} = A. N 1 N Proof. Observe that P lim sup A = P lim N N A. Let B N = N A. Observe that B 1 B 2 B 3, i.e., {B N } is a decreasig sequece of evets. The, by cotiuity of P, we have P lim B N = lim P B N N N = lim P A N lim = 0 N N N P A where the iequality i the third lie follows from uio boud ad the last equality follows from the hypothesis that 1 P A <. The followig corollary of the Borel-Catelli lemma is very useful for showig almost sure covergece. Corollary 1. If for all ɛ > 0, we have the Z a.s. Z. Pr Z Z ɛ < 1 Proof. Defie the evet A ɛ = {ω Ω : N N such that Z w Zw ɛ}. Let A ɛ = {ω Ω : Z ω Zω ɛ}. Observe that lim sup A ɛ = A ɛ. By the hypothesis, we may apply the Borel-Catelli lemma to obtai PrA ɛ = 0. Now defie A = {ω Ω : Z ω Z} = {ω Ω : ɛ > 0 N N such that Z ω Zω ɛ}. I words, this meas: for some ɛ, o matter how far alog you go i the sequece, you ca fid some such that Z ω Zω ɛ. Now, let {ɛ j } be a strictly decreasig sequece covergig to 0. Observe that A j=1 Aɛj. To see this, take some ω A. The, for some ɛ > 0 N N such that Z ω Zω ɛ. As j, evetually there is some j such that ɛ j ɛ, so that ω A ɛ j. Therefore, PrA Pr A ɛ j This cocludes the proof. j=1 PrA ɛj j=1 = 0.

4 4 Sieve Estimators Let {H k } be a sequece of sets of classifiers with the uiform approximatio property UAP. Assume that we have a uiform deviatio boud for H k e.g., if H k is fiite or a VC class. We deote by ĥ,k the classifier we obtai from empirical risk miimizatio ERM over H k : 1 ĥ,k = arg mi h H k 1 {hxi Y i}. Let k be a iteger-valued sequece ad defie ĥ = ĥ,k. Sice we have the UAP, it should be clear that the approximatio error goes to 0 as log as k with. We wish to restrict the rate of growth of k appropriately so that the estimatio error also goes to zero, i probability or almost surely. This is called a sieve estimator, whose ame is geerally attributed to Greader [3]. Formally, we eed that Rĥ RH k 0, i.p/a.s. Let s apply some previously developed theory to this problem. Cosider the case where H k <. We have see that Pr Rĥ RH k ɛ 2 H k e ɛ2 /2. 1 We eed to make sure that H k does ot domiate the expoetial term, so that the sum i Corollary 1 coverges. Rewrite the right-had side of Eq. 1 as Hk e ɛ 2 /2 = e l H k ɛ 2 /2. i Thus it suffices to have l H k 0, as. Aother way to write this is usig little-oh otatio. We say that a sequece a = ob if a b. So, the above is equivalet to l Hk = o. Uder this coditio, 0 as The for all ɛ > 0, 1 ɛ > 0, N ɛ, N ɛ, l Hk ɛ 2 Pr Rĥ RH k ɛ N ɛ + 2e ɛ 2 /4 <. N ɛ 2 ɛ2 4. 2 The summatio o the right is simply a covergig geometric series. By Corollary 1 we have established Theorem 2. Let {H k } satisfy the UAP with Hk <. Let k be a iteger sequece such that as, k ad l Hk = o. The R ĥ R almost surely, i.e., ĥ is strogly uiversally cosistet. Corollary 2. Let {H k } = {the set of histogram classifiers with side legth 1/k}, X = [0, 1] d. Let k such that k d = o. The ĥ is strogly uiversally cosistet. Proof. We previous saw i Sectio 2 that histograms have the UAP. Also, Hk = 2 k d, so the corollary follows from Theorem 2. I Corollary 2, we require kd 0. Note that /kd is the umber of samples per cell. Therefore the umber of samples per cell must go to ifiity; this seems like a reasoable coditio for strog cosistecy.

5 Now, let s cosider the case where the VC dimesio of H k is fiite. From our discussio of VC theory, we have that for all ɛ > 0, Pr Rĥ RH k ɛ 8S Hk e ɛ2 /128. We also kow from Sauer s lemma e V S H, V, V where V is the VC dimesio of H. Usig this, we deduce Pr Rĥ RH k ɛ 8S Hk e ɛ2 /128 C V k e ɛ2 /128 = C exp V k l ɛ2 128 C is simply a costat that does ot deped o. If we choose k such that V k = o l, the we obtai the followig iequality, which is similar to the oe i Eq. 2:. Therefore We have established 1 ɛ > 0, N, N, V k l ɛ2 128 ɛ2 256. Pr Rĥ RH k ɛ N + Ce ɛ2 /256 <. N Theorem 3. Let {H k } satisfy the UAP, ad let V k < deote the VC dimesio of H k. Let k be a iteger-valued sequece such that as, k ad V k = o l. The Rĥ R almost surely, i.e., ĥ is strogly uiversally cosistet. 5 Rates of Covergece Oe could ask ] whether there is a uiversal rate of covergece. Put more formally, is there a ĥ such that E [Rĥ R coverges to 0 at a fixed rate, for ay joit distributio P XY? It turs out that the aswer is o. A theorem from [2] makes this cocrete. Theorem 4. Let {a } with a 0 such that 1 16 a 1 a 2.... For ay ĥ, there exists a joit distributio P XY such that R = 0 ad ERĥ a. Proof. See Chapter 7 of [1]. Sice the sequece {a } is arbitrary, we ca also choose how slowly it coverges to 0, thus showig that there will be o uiversal rate of covergece. I order to establish rates of covergece, we will therefore have to place restrictios o P XY. Some possible reasoable restrictios o P XY iclude: P X Y =y is a cotiuous radom variable. ηx is smooth. there exists a t 0 such that P X ηx 1 2 t 0 = 0. the Bayes decisio boudary is smooth. Below we cosider oe particular distributioal assumptio uder which we ca establish a rate of covergece.

6 Figure 3: Illustratio of Bayes decisio boudary BDB. 6 Box-Coutig Class Let X = [0, 1] d. For m 1, defie P m to be the partitio of X ito cubes of side legth 1/m. Note that P m = m d. Defiitio 3. The box coutig class B is the set of all P XY such that A P X has a bouded desity. B There exists a costat C such that m 1, the Bayes decisio boudary BDB {x : ηx = 1/2} passes through at most Cm d 1 elemets of P m see Fig. 3. This essetially says that the Bayes decisio boudary BDB has dimesio d 1. Look up the term box-coutig dimesio for more. We have the followig rate of covergece result. Theorem 5. Let H k ={histogram classifiers based o P k }, assume P XY B, ad let ĥ be a sieve estimator with ad k 1 d+2. The [ ] E Rĥ R = O 1 d+2. Proof. Recall the decompositio ito estimatio ad approximatio errors: [ E Rĥ R ] ] = E [Rĥ R Hk + RH k R. To boud the estimatio error, deote := Rĥ R H k. By the law of total expectatio, Choosig ɛ = [ ] E = P r [ ɛ E }{{} ] ɛ + P r [ < ɛ E }{{}}{{} ] < ɛ. }{{} δ 1 1 trivial sice R 1 ɛ 2[k d l 2+l2/δ] [ ] ad δ = 1/, we have E Rĥ RH k k = O d. To boud the approximatio error, let h = Bayes classifier ad h k = best classifier i H k. Observe [ Rh k Rh = 2E X ηx 1 ] 2 1 {h k X h X} E X [1 {h k X h X}] by ηx 1 2 1 2 = P X h kx h X. 3

7 Let := {x : h k x h x}, ad let f be the desity of P X. The P X = fxdx f λ, where λ deotes volume/lebesgue measure. Now classificatio errors occur oly o cells itersectig the Bayes decisio boudary see shaded cells i Fig. 3, ad so λ Ck d 1 /k d = O 1 k. Settig 1 k = k d, we see that k 1 d+2 gives the best rate for histograms. Histograms[ do ot ] achieve the best rate amog all discrimiatio rules for the box-coutig class; this best rate is E Rĥ R = O 1 d [4]. I the ext set of otes we ll examie a discrimiatio rule that achieves this rate to withi a logarithmic factor. Exercises 1. With X = R d let H k = {hx = 1 {fx 0} : f is a polyomial of degree at most k}. Let P be the set of all joit distributios P XY such that if Rh R h H k as k. Determie a explicit sufficiet coditio o k for the sieve estimator to be cosistet for all P XY P. 2. Up to this poit i our discussio of empirical risk miimizatio, we have always assumed that a empirical risk miimizer exists. However, it could be that the ifimum of the empirical risk is ot attaied by ay classifier. I this case, we ca still have a cosistet sieve estimator based o a approximate empirical risk miimizer. Thus, let τ k be a sequece of positive real umbers decreasig to zero. Defie ĥ,k to be ay classifier ĥ,k {h H k : R h if R h + τ k }. h H k Now cosider the discrimiatio rule ĥ := ĥ,k. Show that Theorems 2 ad 3 still hold for this discrimiatio rule. State ay additioal assumptios o the rate of covergece of τ k that may be eeded. 3. Assume X R d. A fuctio f : X R is said to be Lipschitz cotiuous if there exists a costat L > 0 such that for all x, x X, fx fx L x x. Show that if oe coordiate of the Bayes decisio boudary is a Lipschitz cotiuous fuctio of the others, the B holds i the defiitio of the box-coutig class. A example of the stated coditio is whe the Bayes decisio boudary is the graph of, say, x d = fx 1,..., x d 1, where f is Lipschitz. 4. Establish the followig partial coverse to Lemma 2: If Z Z is bouded, the Z i.p. Z implies E{ Z Z } 0. Therefore, i the cotext of classificatio, ERĥ R is equivalet to weak cosistecy.

8 Refereces [1] L. Devroye, L. Györfi ad G. Lugosi, A Probabilistic Theory of Patter Recogitio, Spriger, 1996. [2] L. Devroye, Ay discrimiatio rule ca have a arbitrarily bad probability of error for fiite sample size, IEEE Tras. Patter Aalysis ad Machie Itelligece, vol. 4, pp. 154-157, 1982. [3] U. Greader, Abstract Iferece, Wiley, 1981. [4] C. Scott ad R. Nowak, Miimax-Optimal Classificatio with Dyadic Decisio Trees, IEEE Tras. Iform. Theory, vol. 52, pp. 1335-1353, 2006.