Empirical Process Theory and Oracle Inequalities

Similar documents
Lecture 15: Learning Theory: Concentration Inequalities

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

REGRESSION WITH QUADRATIC LOSS

Agnostic Learning and Concentration Inequalities

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

An Introduction to Randomized Algorithms

Maximum Likelihood Estimation and Complexity Regularization

1 Review and Overview

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Regression with quadratic loss

Sieve Estimators: Consistency and Rates of Convergence

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Machine Learning Theory (CS 6783)

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Brett Bernstein

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Optimally Sparse SVMs

Lecture 14: Graph Entropy

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Topic 9: Sampling Distributions of Estimators

Lecture 3: August 31

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Topic 9: Sampling Distributions of Estimators

Intro to Learning Theory

Chapter 6 Sampling Distributions

A survey on penalized empirical risk minimization Sara A. van de Geer

1 Review and Overview

Topic 9: Sampling Distributions of Estimators

Lecture 7: October 18, 2017

ECE 901 Lecture 13: Maximum Likelihood Estimation

Problem Set 4 Due Oct, 12

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Lecture 13: Maximum Likelihood Estimation

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Rademacher Complexity

Lecture 19: Convergence

Lecture 2: Concentration Bounds

Lecture 2: Monte Carlo Simulation

Rates of Convergence by Moduli of Continuity

Sequences and Series of Functions

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

18.657: Mathematics of Machine Learning

Lecture 12: November 13, 2018

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

MATH/STAT 352: Lecture 15

10-701/ Machine Learning Mid-term Exam Solution

Glivenko-Cantelli Classes

Lecture 4: April 10, 2013

Empirical Processes: Glivenko Cantelli Theorems

Linear Regression Demystified

Machine Learning Brett Bernstein

Lecture 10 October Minimaxity and least favorable prior sequences

Machine Learning Theory (CS 6783)

Chapter 8: Estimating with Confidence

Learning Theory: Lecture Notes

Fall 2013 MTH431/531 Real analysis Section Notes

Estimation for Complete Data

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

On the Theory of Learning with Privileged Information

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Machine Learning for Data Science (CS 4786)

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Random Variables, Sampling and Estimation

Expectation and Variance of a random variable

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Mixtures of Gaussians and the EM Algorithm

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

32 estimating the cumulative distribution function

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Algorithms for Clustering

Introduction to Machine Learning DIS10

6. Uniform distribution mod 1

Binary classification, Part 1

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Bayesian Methods: Introduction to Multi-parameter Models

Lecture 2 February 8, 2016

Lecture 24: Variable selection in linear models

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Sequences. Notation. Convergence of a Sequence

4. Partial Sums and the Central Limit Theorem

Support vector machine revisited

Metric Space Properties

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Lecture 3 The Lebesgue Integral

Lecture 9: Expanders Part 2, Extractors

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Math 25 Solutions to practice problems

Section 11.8: Power Series

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Approximations and more PMFs and PDFs

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Regression and generalization

Transcription:

Stat 928: Statistical Learig Theory Lecture: 10 Empirical Process Theory ad Oracle Iequalities Istructor: Sham Kakade 1 Risk vs Risk See Lecture 0 for a discussio o termiology. 2 The Uio Boud / Boferoi Cosider m evets E 1,... E m, we have P (E 1 E m ) P (E 1 ) + + P (E m ). I other words, with probability 1 P (E 1 ) P (E m ), oe of the evets E i (i = 1,..., m) occurs. If we assume the probability j P (E j) is small. Uio boud is relatively tight whe the evets E j are idepedet. P (E 1 E m ) j P (E j ) j k P (E j E k ) j P (E j ) 0.5( j P (E j )) 2. If E j are correlated, the it is ot tight. For example whe they are completely correlated: E 1 = = E m, the P (E 1 E m ) = N 1 j P (E j ). We will come back to this whe we discuss chaiig. 3 Motivatio of Empirical Process Cosider learig problem with observatios Z i = (X i, Y i ), predictio rule f(x i ) ad loss fuctio L(f(X i ), Y i ). Assume further that f is parameterized by Θ as f (X i ). Example, f (x) = x be a liear fuctio, ad L(f (x), y) = ( x y) 2 is least squares loss. I the followig, we itroduce simplified otatio g (Z i ) = L(f (X i ), Y i ). We are iterested i estimatig ˆ from traiig data. That is, ˆ depeds o Z i. Sice we are usig the traiig data as a surrogate of the test (true uderlyig) distributio, we hope traiig error is similar to test error. I learig theory, we are iterested i estimatig the followig tail quatities for some ɛ > 0: ad P ( 1 P ( 1 gˆ(z i ) Egˆ(Z) + ɛ) gˆ(z i ) Egˆ(Z) ɛ). 1

The above two quatities ca be bouded usig the followig two quatities: P ( 1 gˆ(z i ) Egˆ(Z) + ɛ) P [( 1 g (Z i ) Eg (Z)) ɛ] Θ ad P ( 1 gˆ(z i ) Egˆ(Z) ɛ) P [(Eg (Z) 1 Θ g (Z i )) ɛ]. Notatio: i the above settig the collectio of radom variables 1 g (Z i ) idexed by Γ is call a empirical process. We may also call 1 g (Z i ) Eg (Z) empirical process. For each fixed, 1 g (Z i ) Eg (Z) 0 i probability, by LLN. However, i empirical process, we are iterested i uiform law of large umbers, that is the followig remum of empirical process defied as Θ 1 g (Z i ) Eg (Z) coverges to zero i probability. Give traiig data Z1 = {Z 1,..., Z }, we may let ˆ(Z 1 ) achieve the remum above. The 1 g (Z i ) Eg (Z) = 1 gˆ(z 1 ) (Z i) Egˆ(Z 1 ) (Z), Θ where ˆ(Z 1 ) depeds o the traiig data. This meas that gˆ(z 1 ) (Z i) is ot sum of idepedet radom variable aymore. Supreme of empirical process is basically the worst case deviatio of empirical mea (traiig error) ad true mea (test error) for parameter that is chose based o traiig data. Coceptually, as log as you select ˆ based o traiig data, you eed to use empirical process ad uiform law of large umbers. However, if you oly cosider fixed idepedet of traiig data, the you ca use stadard law of large umbers because g (Z i ) are idepedet radom variable. 4 Oracle Iequality for empirical risk miimizatio Cosider the empirical risk miimizatio algorithm: ˆ = arg mi Θ g (Z i ), ad the optimizatio parameter that miimizes the test error (with iifite amout of data): = arg mi Θ Eg (Z). We wat to kow how much worse is the test error performace of ˆ compared to that of. Results of this flavor is referred to as oracle iequality. We ca obtai simple oracle iequality usig ULLN of empirical process as follows. Assume that we have the tail boud for the empirical mea of g (Z) as: P ( 1 g (Z i ) Eg (Z) ɛ 1 ) δ 1 (ɛ 1 ) 2

Assume that we have the followig uiform tail boud for empiricla process for some γ [0, 1): P ([ 1 g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ 2 ) δ 2 (ɛ 2 ) Takig the uio boud, we obtai with probability 1 δ 1 (ɛ 1 ) δ 2 (ɛ 2 ), 1 Sice by defiitio, we have g (Z i ) Eg (Z) < ɛ 1, [ 1 gˆ(z i ) + (1 γ)egˆ(z) + γeg (Z)] < ɛ 2. Therefore by addig the three iequalities: 1 gˆ(z i ) 1 g (Z i ). (1 γ)egˆ(z) + γeg (Z)] Eg (Z) < ɛ 1 + ɛ 2. That is, we have Egˆ(Z) < Eg (Z) + (1 γ) 1 (ɛ 1 + ɛ 2 ). If Θ cotais oly fiite umber of fuctios: N = Θ, the we ca simply apply the uio boud P ([ 1 P ([ 1 Θ Θ P ([ 1 Θ g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ) g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ) g (Z i ) + (1 γ)eg (Z) + γeg (Z) ɛ). 5 Recap: Oracle Iequality Cosider the empirical risk miimizatio algorithm: ˆ = arg mi Θ g (Z i ), ad the optimizatio parameter that miimizes the test error (with iifite amout of data): If P ( 1 = arg mi Θ Eg (Z). g (Z i ) Eg (Z) ɛ 1 ) δ 1 (ɛ 1 ), which meas that the traiig error of the optimal parameter is t much larger tha test error. Assume also that we have the followig uiform tail boud for empiricla process for some γ [0, 1): P ([ 1 g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ 2 ) δ 2 (ɛ 2 ), 3

which meas that the traiig error of a arbitrary iferior parametr is t much smaller tha its test error. The we have oracle iequality with probability 1 δ 1 (ɛ 1 ) δ 2 (ɛ 2 ), Egˆ(Z) < Eg (Z) + (1 γ) 1 (ɛ 1 + ɛ 2 ). This meas that the geeralizatio performace of ERM is t much worst tha that of the optimal parameter. 6 Lower bracketig coverig umber If Θ is ifiite, the we ca use the idea of coverig umber. There are differet defiitios. Let G = {g : Θ} be the fuctio class of the empirical process. G N = {g 1 (z),..., g N (z)} is a ɛ-lower bracketig cover of G if for all Θ, there exists j = j() such that [g j (z) g (z)] 0 Eg j (z) Eg (z) ɛ. z The smallest cardiality N LB (G, ɛ) of such G N is called ɛ-lower bracketig coverig umber. Similarly oe ca defie upper bracketig coverig umber. The logarithm of coverig umber is called etropy. We shall metio that the fuctios g j (z) may ot ecessarily be a fuctio g (z) for Θ. Let G(ɛ/2) be a ɛ/2 lower bracketig cover of G, the pick j = j() Thus, [ 1 = 1 [ 1 [ g (Z i ) + (1 γ)eg (Z) + γeg (Z)] g (Z i ) g j (Z i )] g j (Z i ) + (1 γ)eg j (Z) + γeg (Z) + (1 γ)[ Eg j (Z) + Eg (Z)]] j=j() [ 1 P ([ 1 P ( G(ɛ/2) [ 1 P ( 1 g j (Z i ) + (1 γ)eg j (Z) + γeg (Z) + (1 γ)ɛ/2]. g (Z i ) + (1 γ)eg (Z) + γeg (Z)] ɛ) P ( 1 g j (Z i ) + (1 γ)eg j (Z) + γeg (Z) + (1 γ)ɛ/2] ɛ) g j (Z i ) + Eg j (Z) γ(eg j (Z) Eg (Z)) + 0.5(1 + γ)ɛ) g j (Z i ) + Eg j (Z) γ(eg j (Z) Eg (Z)) + 0.5(1 + γ)ɛ). The summatio boud with γ > 0 is a form of a idea i empirical referred to as peelig, ad some times also called shell bouds. We will preset a simple example below to illustrate the basic cocepts. 4

7 A Simple Example This example is to get you familiar with the ituitios ad otatios. We will cosider more complex examples i future lectures, but the basic idea resembles this example. Cosider oe dimesioal classificatio problem, with x [ 1, 1] ad y {±1}. Assume that coditioed o x, the class label is give by y = ɛ(2i(x ) 1) for some ukow, with idepedet radom oise ɛ { 1, 1}, ad p = P (ɛ = 1) > 0.5. This meas that the optimal Bayes classifier is f (x) = 1 whe x ad f (x) = 1 whe x <, ad the Bayes error is 1 p. Sice we do t kow the true threshold, we ca cosider a family of classifiers f (x) = 2I(x ) 1, with to be leared from traiig data. Give sample Z = (X, Y ), the classifier error fuctio for this classifier is g (Z) = I(f (X) Y ). Give traiig data Z1 = {(X 1, Y 1 ),..., (X, Y )}, we ca lear a threshold ˆ usig empirical risk miimizatio that fids by miimizig the traiig error: ˆ = arg mi g (Z i ). We wat to kow the geeralizatio performace of ˆ compared to the Bayes error. That is, to give a upper boud of Eg (Z) (1 p). We will examie the followig few issues i order to uderstad what is goig o: 1/ covergece (usig Cheroff boud) versus 1/ covergece (usig refied Cheroff boud or Beet). The role of peelig. 7.1 Bracketig cover of the fuctio class Give ɛ, ad let j = 1 + jɛ for j = 1,..., 2/ɛ. Let { 0 if x [ j ɛ, j ] g j (z) = g j (z) otherwise, where z = (x, y). It follows that for ay [ 1, 1], if we let j be the smallest j such that j, the we have g j (z) = 0 g (z) whe x [, j ], ad g j (z) = g (z) whe x / [, j ], where z = (x, y). Moreover, Eg j (z) Eg (z) = E x [,j] g (z) ɛ. Note that sice oly the aalysis depeds o coverig umber, geerally we ca deisg a coverig umber that depeds o the truth, ad may cover the space o-uiformly. This is ot cosidered here. 5

7.2 Usig Stadard Cheroff boud without peelig At, we have from Cheroff boud: P ( 1 Alteratively, we say that with probability 1 δ 1 : g (Z i ) Eg (Z) ɛ) exp( 2ɛ 2 ). g (Z i ) Eg (Z) < ɛ 1 = l(1/δ 1 )/2. Now we wat to evaluate usig lower brackig cover G(ɛ/2) as: P ([ 1 G(ɛ/2) 4/ɛ e ɛ2 /2. g (Z i ) + Eg (Z)] ɛ) P ( 1 g j (Z i ) + Eg j (Z) 0.5ɛ) We used G(ɛ/2) 4/ɛ. Alteratively, we say that with probability 1 δ 2 (ad ote that ɛ 2 2/): [ 1 g (Z i ) + Eg (Z)] < ɛ 2 = 2(l 4/ɛ 2 l δ 2 )/. Let δ = 2δ 1 = 2δ 2, we have with probability at least 1 δ: Egˆ(Z) (1 p) < l(2/δ)/2 + 2(l 4 /2 + l(2/δ))/ < 2(l 4 /2 l δ 2 )/. 2 l 4 /2 / + 3 l(2/δ)/2. 6