Glivenko-Cantelli Classes

Similar documents
Agnostic Learning and Concentration Inequalities

Lecture 2: Concentration Bounds

A survey on penalized empirical risk minimization Sara A. van de Geer

ST5215: Advanced Statistical Theory

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Sequences and Series of Functions

Learning Theory: Lecture Notes

Rademacher Complexity

Distribution of Random Samples & Limit theorems

1 Review and Overview

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Lecture 15: Learning Theory: Concentration Inequalities

Rates of Convergence by Moduli of Continuity

7.1 Convergence of sequences of random variables

Lecture Chapter 6: Convergence of Random Sequences

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Probability and Random Processes

32 estimating the cumulative distribution function

Solutions to HW Assignment 1

Lecture 3: August 31

An Introduction to Randomized Algorithms

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

7.1 Convergence of sequences of random variables

Maximum Likelihood Estimation and Complexity Regularization

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

2.1. Convergence in distribution and characteristic functions.

1 Convergence in Probability and the Weak Law of Large Numbers

Lecture 6: Coupon Collector s problem

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Regression with quadratic loss

Introduction to Probability. Ariel Yadin

Self-normalized deviation inequalities with application to t-statistic

REGRESSION WITH QUADRATIC LOSS

Empirical Process Theory and Oracle Inequalities

Lecture 15: Strong, Conditional, & Joint Typicality

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Learnability with Rademacher Complexities

18.657: Mathematics of Machine Learning

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Empirical Processes: Glivenko Cantelli Theorems

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Lecture 8: Convergence of transformations and law of large numbers

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data

Notes 5 : More on the a.s. convergence of sums

Lecture 27: Optimal Estimators and Functional Delta Method

Asymptotic distribution of products of sums of independent random variables

Lecture 19: Convergence

Solutions of Homework 2.

MATH 112: HOMEWORK 6 SOLUTIONS. Problem 1: Rudin, Chapter 3, Problem s k < s k < 2 + s k+1

Chapter 6 Infinite Series

This section is optional.

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Fall 2013 MTH431/531 Real analysis Section Notes

Notes 19 : Martingale CLT

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Lecture 13: Maximum Likelihood Estimation

Math 341 Lecture #31 6.5: Power Series

Lecture 12: November 13, 2018

ECE534, Spring 2018: Solutions for Problem Set #2

Advanced Stochastic Processes.

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Lecture 2. The Lovász Local Lemma

1 Review and Overview

BIRKHOFF ERGODIC THEOREM

Posted-Price, Sealed-Bid Auctions

Exponential Families and Bayesian Inference

2.2. Central limit theorem.

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Sieve Estimators: Consistency and Rates of Convergence

Chapter 6 Principles of Data Reduction

ECE 901 Lecture 13: Maximum Likelihood Estimation

lim za n n = z lim a n n.

Lecture 12: September 27

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

1 = δ2 (0, ), Y Y n nδ. , T n = Y Y n n. ( U n,k + X ) ( f U n,k + Y ) n 2n f U n,k + θ Y ) 2 E X1 2 X1

Lecture 3 : Random variables and their distributions

ON POINTWISE BINOMIAL APPROXIMATION

Machine Learning Theory (CS 6783)

Machine Learning Brett Bernstein

Notes for Lecture 11

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Kernel density estimator

Limit theorems. Sayan Mukherjee

Probability for mathematicians INDEPENDENCE TAU

OPERATOR PROBABILITY THEORY

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Lecture 4. We also define the set of possible values for the random walk as the set of all x R d such that P(S n = x) > 0 for some n.

Lecture 10: Universal coding and prediction

f n (x) f m (x) < ɛ/3 for all x A. By continuity of f n and f m we can find δ > 0 such that d(x, x 0 ) < δ implies that

Lecture 7: Properties of Random Samples

MATH301 Real Analysis (2008 Fall) Tutorial Note #7. k=1 f k (x) converges pointwise to S(x) on E if and

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

EE 4TM4: Digital Communications II Probability Theory

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Transcription:

CS28B/Stat24B (Sprig 2008 Statistical Learig Theory Lecture: 4 Gliveko-Catelli Classes Lecturer: Peter Bartlett Scribe: Michelle Besi Itroductio This lecture will cover Gliveko-Catelli (GC classes ad itroduce Rademacher averages. We are iterested i GC classes because, for these classes, we get uiform covergece of the empirical average to the true expectatio. Rademacher averages provide a measure of complexity. I this lecture, the primary focus will be o itroducig the GC classes of fuctios ad provig the GC Theorem. It will ed with the defiitio of Rademacher averages. Recall from previous lectures that: We ca choose: Ad we wat: Ad it suffices to show: ˆf = argmi ˆR(f ˆR(f if R(f sup R(f ˆR(f is small For GC class fuctios this sufficiet coditio is satisfied as gets large. 2 GC Classes We begi with a defiitio of the GC class of fuctios. Defiitio. F is a GC Class if, for all ɛ > 0: lim sup P (sup Ef ˆ E f > ɛ = 0 P Note: P meas idepedet draws from a distributio. 2. The Gliveko-Catelli Theorem Let:

2 Gliveko-Catelli Classes x,..., x be i.i.d. data poits from a distributio F. F (x be the empirical distributio fuctio F (t be the true distributio fuctio f of P We have the followig expressios for the CDF s: F (t = Ex t Now defie: F (t = E x t G = {x x 0 : θ R} That is, there is a oe-to-oe mappig betwee G ad R. Therefore: Gliveko-Catelli Theorem P, sup Eg E g 0 Thus, we ca iterpret this classical result as a result about uiform covergece over this class of subsets of the reals. 2.2 GC Theorem We ll ow formally preset the GC Theorem, ad give a proof that is suggestive of a approach that applies much more geerally (which we ll meet i the ext lecture. Theorem 2.. Defie: F (t = P ((, t (the empirical distributio fuctio F (t = P ((, t (the true distributio fuctio f of P For all probability distributios P o R, F a.s. F uiformly o R Or, symbolically we ca write: sup F (x F (x a.s. 0 x R Note: the law of large umbers esures poitwise covergece of distributio fuctios, however, with the GC class of fuctios we obtai somethig stroger, amely uiform covergece. The proof of the Gliveko-Catelli Theorem ivolves three parts:. Use of the McDiarmid cocetratio equality 2. Use of symmetrizatio 3. Applicatio of simple restrictios Proof.

Gliveko-Catelli Classes 3. Through applicatio of the McDiarmid cocetratio iequality we kow that with probability at least exp( 2ɛ 2, ( sup E g Eg E sup E g Eg + ɛ That is, the deviatios are cocetrated aroud their expectatio. 2. Next we apply symmetrizatio. Recall that we ultimately would like to prove: sup E g Eg a.s. 0 g g Also, ote that we ca write: E g Eg = g(x i E g(x i Let: x,..., x be i.i.d. copies of x,..., x. sup E g Eg = E sup g(x i Eg (expadig o defiitio of E g g = E sup (g(x i Eg(x i = E sup E (g(x i g(x i x,..., x (properties of coditioal expectatio EE sup (g(x i g(x i (brigig the E out frot = Esup ɛ i (g(x i g(x i Where ɛ i is a Rademacher variable (uiform o {±}. So we have the followig upperboud o the previous expressio: ( E sup + 3. Next, we cosider simple restrictios o G. We ca write: 2 E sup = 2 E E sup ɛ i g(x i 2 E sup Rademacher averages of G x,..., x

4 Gliveko-Catelli Classes But, {(g(x,..., g(x : g G} = {(g(x (,..., g(x ( : g G} + Where we have ordered the data: {x,...x } = {x (,..., x ( } ad x (... x ( Next we apply the follow lemma to seek the boud of the expressio from above: 2 EE sup x,..., x Lemma 2.2. For A R with R = max ( i a2 i 2, we have: ( E sup ɛ i a i R 2 log A } {{ } Z a Proof. So, exp s s>0 E supz a E exp E sup s supz a = E sup (exp(sz a E expsz a exp s 2 2 A exp(s 2 R2 s ( a 2 i R 2 (because expoetial fuctio is covex (by Hoeffdig s iequality ( log A ɛ i a i if + sr2 = R 2 log A s>0 s 2 Note: For our applicatio, R ad A + Hece: ( 2 log( + P r sup Ê(g Eg ɛ + 2 which completes the proof. exp( 2ɛ 2, 3 Rademacher averages Defiitio. For a class F of real-valued fuctios defied o X, for i.i.d. x,..., x X, ad for idepedet Rademacher radom variables ɛ,..., ɛ, defie:

Gliveko-Catelli Classes 5 R (F = E sup ɛ if(x i (Rademacher averages o F ˆR (F = Esup ɛ if(x i x,..., x (empirical Rademacher averages