Lecture 15: Learning Theory: Concentration Inequalities

Similar documents
Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Empirical Process Theory and Oracle Inequalities

Lecture 2: Monte Carlo Simulation

REGRESSION WITH QUADRATIC LOSS

Regression with quadratic loss

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

An Introduction to Randomized Algorithms

Rademacher Complexity

10-701/ Machine Learning Mid-term Exam Solution

1 Review and Overview

Learning Theory: Lecture Notes

Glivenko-Cantelli Classes

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 3: August 31

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Machine Learning Brett Bernstein

Linear Regression Demystified

Fall 2013 MTH431/531 Real analysis Section Notes

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Advanced Stochastic Processes.

Lecture Chapter 6: Convergence of Random Sequences

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Sieve Estimators: Consistency and Rates of Convergence

Topic 9: Sampling Distributions of Estimators

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Infinite Sequences and Series

Algebra of Least Squares

Optimally Sparse SVMs

Topic 9: Sampling Distributions of Estimators

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Lecture 19: Convergence

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

7.1 Convergence of sequences of random variables

32 estimating the cumulative distribution function

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Topic 9: Sampling Distributions of Estimators

Sequences and Series of Functions

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Lecture Notes for Analysis Class

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data

Section 11.8: Power Series

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Random Variables, Sampling and Estimation

Introduction to Probability. Ariel Yadin

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Problem Set 4 Due Oct, 12

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning

7.1 Convergence of sequences of random variables

Lecture 2: Concentration Bounds

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Math 155 (Lecture 3)

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Agnostic Learning and Concentration Inequalities

Lecture 1 Probability and Statistics

Distribution of Random Samples & Limit theorems

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

1 Review and Overview

Lecture 7: Properties of Random Samples

Statistics 511 Additional Materials

This section is optional.

Notes 27 : Brownian motion: path properties

Notes 19 : Martingale CLT

THE SOLUTION OF NONLINEAR EQUATIONS f( x ) = 0.

Maximum Likelihood Estimation and Complexity Regularization

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Intro to Learning Theory

Estimation for Complete Data

1 Approximating Integrals using Taylor Polynomials

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Lecture 3 The Lebesgue Integral

Bertrand s Postulate

Expectation and Variance of a random variable

6.3 Testing Series With Positive Terms

A survey on penalized empirical risk minimization Sara A. van de Geer

1 Inferential Methods for Correlation and Regression Analysis

Introduction to Machine Learning DIS10

Singular Continuous Measures by Michael Pejic 5/14/10

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

Section 14. Simple linear regression.

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Lecture 6: Integration and the Mean Value Theorem. slope =

Machine Learning Theory (CS 6783)

Notes on iteration and Newton s method. Iteration

Lecture 2 February 8, 2016

Transcription:

STAT 425: Itroductio to Noparametric Statistics Witer 208 Lecture 5: Learig Theory: Cocetratio Iequalities Istructor: Ye-Chi Che 5. Itroductio Recall that i the lecture o classificatio, we have see that empirical risk may ot uiformly coverge to the actual risk. Ad we wat a boud of the type sup R c Rc = somethig coverges to 0 i probability 5. so that we ca use the empirical risk miimizatio to trai our classifier. How do we achieve this? I the ormal meas model, we have see that if we have N IID ormal with mea 0 ad stadard deviatio σ 2, the the maimal error is at rate Oσ 2 log N. This implies that if σ 2 = σ 2 0, the maimal value of may ormal still coverges to 0. The empirical risk R c is a sample average of the risk evaluated at observatios. Thus, we do have R c Rc N 0, σ2 c for some σ 2 c. The variace σ2 c does coverge to 0. Thus, as log as Therefore, as log as we do ot have too may classifiers beig cosidered, i.e, umber of elemet i C is ot too large, we should be able to derive a uiform covergece. I what follows, we itroduce a method called Hoeffdig s iequality that allows us to derive equatio 5. without usig ay ormal distributio approimatio. 5.2 Hoeffdig s Iequality For bouded IID radom variables, the Hoeffdig s iequality is a powerful to boud the behavior of their average. We first start with a property of a bouded radom variable. Lemma 5. Let X be a radom variable with EX = 0 ad a X b. The for ay positive umber t. Ee tx e t2 b a 2 /8 Proof: We will use the fact that e t is a cove fuctio for all positive t. Recall that a fuctio g is a cove fuctio if for ay two poit a < b ad α [0, ], gαa + αb αga + αgb. 5-

5-2 Lecture 5: Learig Theory: Cocetratio Iequalities Because X [a, b], we defie α X to This implies X = α X b + α X a. α X = X a b a Usig the fact that e t is cove, Now takig the epectatio i both sides, e tx α X e tb + α X e ta = X a b a etb + b X b a eta. Ee tx EX a e tb + b EX b a b a eta = b b a eta a b a etb = e gs, 5.2 where s = tb a ad gs = γs + log γ + γe s ad γ = a/b a. Note that g0 = g 0 = 0 ad g s /4 for all positive s. Usig Taylor s theorem, gs = g0 + sg 0 + 2 s2 g s for some s [0, s]. Thus, we coclude gs 2 s2 4 = 8 s2. The equatio 5.2 implies Ee tx e gs e s2 8 = e t 2 b a 2 8. With this lemma, we ca prove the Hoeffdig s iequality. Theorem 5.2 Let X,, X be IID with EX = µ ad a X b. The P X µ ɛ 2e 2ɛ2 /b a 2. Proof: We first prove that P X µ ɛ e 2ɛ2 /b a 2. Let Y i = X i µ. Because the epoetial fuctio is mootoic, for ay positive r, P X µ ɛ = P Ȳ ɛ = P Y i ɛ = P = P Eet e Yi e ɛ e t Yi e tɛ Yi e tɛ = e tɛ Ee ty e ty2 e ty by arkov s iequality = e tɛ Ee ty Ee ty2 Ee ty = e tɛ Ee ty e tɛ e t2 b a 2 /8 by Lemma 5..

Lecture 5: Learig Theory: Cocetratio Iequalities 5-3 Because the above iequality holds for all positive t, we ca choose t to optimize the boud. To get the boud as sharp as possible, we would like to make it as small as possible. Thus, we eed to fid t such that tɛ + t 2 b a 2 /8 is miimized. Takig derivatives with respect to t ad set it to be 0, we obtai ad Thus, the iequality becomes t = 4ɛ b a 2 t ɛ + t 2 b a 2 /8 = 2ɛ 2 /b a 2. P X µ ɛ e t ɛ e t2 b a2 /8 = e 2ɛ2 /b a 2. The same proof also applies to the case P X µ ɛ ad we will obtai the same boud. Therefore, we coclude that P X µ ɛ 2e 2ɛ2 /b a 2. A iequality like the oe i Theorem 5.2 is called a cocetratio iequality ad this pheomea caused by this type of iequalities is called cocetratio of measures ideed the probability measure of X is cocetratig aroud µ, its epectatio. Cocetratio iequalities are commo theoretical tools i aalyzig the property of a statistic. Ad they are particularly powerful whe applies to estimators ad classifiers. 5.2. Applicatio i Classificatio The Hoeffdig s iequality has a direct applicatio to classificatio problem Remember that i biary classificatio case with 0 loss, the loss at each observatio is LcX i, Y i = IcX i Y i, a bouded radom variable with a = 0 ad b =! The empirical risk R c = LcX i, Y i is a ubiased estimator of E R c = Rc. Thus, the Hoeffdig s iequality implies that for a give classifier c, P R c Rc ɛ 2e 2ɛ2. 5.3 A powerful feature of the Hoeffdig s iequality is that it holds regardless of the classifier. Namely, eve if we are cosiderig may differet types of classifiers, some are decisio trees, some are knn, some are logistic regressio, they all satisfy equatio 5.3. Assume that the collectio of classifier C cotais oly N classifiers c,, c N. The we have P sup R c Rc ɛ = P ma R c j Rc j ɛ j=,,n N P R c j Rc j ɛ j= N2e 2ɛ2.

5-4 Lecture 5: Learig Theory: Cocetratio Iequalities Thus, as log as N 2e 2ɛ2 0, we have P sup R c Rc ɛ the desired property we wat i equatio 5.. 0 sup R c Rc P 0, 5.2.2 Covergece Rate The Hoeffdig s iequality directly implies that X µ = O P. It actually implies a boud o the epectatio E X µ as well. To see this, recall that for a positive radom variable T, Thus, for ay positive s, ET = 0 P T > tdt. E X µ 2 = = = 0 0 s 0 s 0 = s + P X µ 2 > tdt P X µ > tdt P X µ > t dt + }{{} dt + c c 2e 2t/b a2 dt b a2 e s/b a2. P X µ > t dt }{{} Hoeffdig It is ot easy to optimize this but there are some simple choice that gives a good result. We will choose s = 2b a2 log b a, which leads to Fially, usig the Cauchy-Schwarz iequality, E X µ 2 2b a2 log b a E X µ which is the desired boud o the epectatio. + = O. E X µ 2 = O, 5.4 I the case of classificatio, whe that the collectio of classifier C cotais N classifiers, we have P sup R c Rc ɛ N 2e 2ɛ2.

Lecture 5: Learig Theory: Cocetratio Iequalities 5-5 This implies sup R log N c Rc = O P. oreover, usig a similar calculatio as we have doe for derivig equatio 5.4, oe ca show that E sup R c Rc log N = O. 5.2.3 Applicatio i Histogram We have derived the covergece rate for a histogram desity estimator at a give poit. However, we have ot yet aalyzed the uiform covergece rate of the histogram. Usig the Hoeffdig s iequality, we are able to obtai such a uiform covergece rate. Assume that X,, X F with a PDF p that has a o-zero desity over [0, ]. If we use a histogram with bis: [ [ [ ] B = 0,, B 2 =, 2,, B =,. Let p be the histogram desity estimator: p = IX i B, where B is the bi that belogs to. The goal is to boud sup p p. We kow that the differece p p ca be writte as p p = p E p + E p p. }{{}}{{} stochastic variatio bias The bias aalysis we have doe i Lecture 6 ca be geeralized to every poit, so we have sup E p N p = O. So we oly eed to boud the stochastic variatio part sup p E p. Although we are takig supremum over every, there are oly bis B,, B so we ca rewrite the stochastic part as sup p E p = ma j=,, = ma j=,, IX i B j P X i B j IX i B j P X i B j.

5-6 Lecture 5: Learig Theory: Cocetratio Iequalities Because the idicator fuctio takes oly two values: 0 ad, we have P sup p E p N > ɛ = P ma IX i B j P X i B j j=,, > ɛ = P ma IX i B j P X i B j j=,, > ɛ }{{} =ɛ Thus, This, together with the uiform bias, implies 2e 2ɛ 2 = 2e 2ɛ2 / 2. 2 log sup p E p = O P sup p p sup E p N p + sup p E p N 2 log = O + O P. Note that this boud is ot the tightest boud we ca obtai. Usig the Berstei s iequality, you ca obtai a faster covergece rate: log sup p p = O + O P. 5.3 VC Theory Hoeffdig s iequality has helped us to partially solve the problem of provig equatio 5. whe the umber of classifiers are fied. However, i may cases we are workig o ifiite umber of classifiers so that the Hoeffdig s iequality does ot work. For istace, i the logistic regressio, the collectio of all possible classifiers is where C logistic = { c β0,β : β 0 R, β R d }, e β0+βt c β0,β = I + e >. β0+βt 2 The parameters idices are β 0 R, β R d, both cotai ifiite umber of poits the real lie cotai ifiite umber of poits. Thus, the Hoeffdig s iequality will ot work i this case. To overcome this problem, we first thik about the performace of the classifier c β0,β for differet values of β 0, β. For simplicity, we assume that the classifiers c θ are ideed by a quatity θ R p, i.e., there are p parameters cotrollig each classifier ad let C Θ = {c θ : θ R p }. https://e.wikipedia.org/wiki/berstei_iequalities_probability_theory

Lecture 5: Learig Theory: Cocetratio Iequalities 5-7 I the case of the logistic regressio, p = d + ad θ = β 0, β. Cosider two classifiers c θ ad c θ2. Whe θ, θ 2 are close, we will epect their performace to be similar. Just like i the logistic regressio, whe two classifiers with similar parameters, their decisio boudary the boudary betwee the two class labels should be similar. Namely, the performace of may classifiers may be similar to each other so whe usig the cocetratio iequality, we do ot eed to treat all of them as idepedet classifiers. Thus, although there are ifiite umber of classifiers i C Θ, there might oly be a few effetive idepedet classifiers. Istead of usig the total umber of classifiers i costructig a uiform boud, we will use P sup R c Rc ɛ N eff 2e 2ɛ2, 5.5 where N eff is some measure of effective idepedet classifiers. I a sese, N eff measures the compleity of C Θ. There are may ways of measurig the umber of effective idepedet classifiers such as the coverig umber ad bracketig umber, we will itroduce a simple ad famous quatity call the VC dimesio. 5.3. Shatterig Number ad VC Dimesio I the 0 loss case, give a classifier c, the empirical risk ad the actual risk R c = IY i cx i Rc = EIY cx = P Y cx. If we defie a ew radom variable Z i = X i, Y i, the the evet Y i cx i Z i A c, for some regio A c that is determied by the classifier c. Thus, we ca the rewrite the differece betwee the empirical risk ad the actual risk as R c Rc = IY i cx i P Y cx = The uiform boud i equatio 5. ca the be rewritte as sup R c Rc = sup IZ i A c P Z A c = sup IZ i A c P Z A c. A A IZ i A P Z A, where A = {A c : c C} is a collectio of regios. Thus, the effective umber N eff should be related to how comple the set A is. If A cotais may very differet regios, the N eff is epected to be large. Note that the quatity IZ i A is just the ratio of our data that falls withi the area A c whereas the quatity P Z A is the probability of observig a radom quatity Z withi the area A c. Thus, evetually we are tryig to boud the differece betwee usig the ratio of a sample as a estimator of the probability of fallig ito a give regio, which is the classical problem i STAT 0. However, the challege here is that we are cosiderig may may may regios so the problem is much more complicated.

5-8 Lecture 5: Learig Theory: Cocetratio Iequalities The VC theory shows the followig boud: P sup IZ i A P Z A > ɛ 8 + νa e ɛ2 /32. 5.6 A A Namely, 8 + νa = N eff is the effective umber we wat ad the quatity νa is called the VC dimesio of A 2. To defie the VC dimesio, we first itroduce the cocept of shatterig. Shatterig is a cocept about usig sets i a collectio of regios R to partitio a set of poits. Here are some simple eamples of R:. R = {, t] : t R}. This is related to the uiform boud of the EDF. 2. R 2 = {a, b : a b}. 3. R 3 = {[ 0 r, 0 + r] : r 0} for some fied 0. 4. R 4 = {[ 0 r, 0 + r] : r 0, 0 R}. Note that the differece betwee R 3 ad R 4 is that this collectio is much larger tha the previous oe because the 0 i the previous eample is fied whereas here 0 is allowed to chage. The above eamples are where R cotais subregios i D. But it ca be geeralized to other dimesios. Here are some eamples about d-dimesio regios:. R 5 = {, t], s] : t, s R}. I this case, d = 2 ad this is related to the performace of a 2D EDF. 2. R 6 = {a, b a 2, b 2 a d, b d : a j b j, j =,, d}. I this case the dimesio is d. 3. R 7 = {B 0, r : r 0}, where B 0, r = { : 0 r} is a d-dimesioal ball with radius r cetered at 0. 4. R 8 = {B β,c : β R d, c R}, where B β,c = { : β T + c 0} is the half space i d-dimesio. This is related to the liear classifier or the logistic regressio. Let F = {,, } R d be a set of poits. For a subset G F, we say R picks out G if there eists R R such that G = R F. Namely, there eists a subregios i R that we ca fid the subset G by the itersectio of F ad this subregio. Eample. For eample, cosider the case d = ad R = R ad let F = {, 5, 8}. For the case G = {, 5}, it ca be picked out by R because we ca choose, 6] R such that, 6] F = G. Note that the choice may ot be uique;, 7] R also picks up G. However, the subset G = {5, 8} caot be picked out by ay elemet of R. Let SR, F be the umber of subsets of F that ca be picked out by R. By defiitio, if F cotais elemets, the SR, F 2 because there is at most 2 uique subsets of elemets, icludig the empty set. 2 Note that the quatity i the epoetial part becomes ɛ 2 /32 rather tha 2ɛ 2. The rates ɛ 2 are the same but just the costats are differet. This chage is due to the fact that we are ot just usig Hoeffdig s iequality but also aother techique called symmetrizatio i the proof so the costat chages.

Lecture 5: Learig Theory: Cocetratio Iequalities 5-9 Eample 2. I the case of Eample, SR, F is 4 because R ca oly picks out the followig subsets:, {}, {, 5}, {, 5, 8}. However, if we cosider R 2, the quatity becomes SR 2, F = 7. The oly case that R 2 caot pick out is {, 8}. Eample 3. Now assume that we are workig o F = {, 5}, the SR, F = 3 because we ca picks out, {}, {, 5} ad SR 2, F = 4 = 2 2 because every subset of F ca be picked out. F is shattered by R if SR, F = 2, where is the umber of elemets i F. Namely, F is shattered by R if every subset of F ca be picked out by R. I Eample 2 ad 3, we see that F = {, 5} is shattered by R 2 but F = {, 5, 8} is ot. The VC dimesio of R, deoted as νr, is the maimal umber of distict poits that ca shattered by R. Namely, if R has a VC dimesio ν, the we will be able to fid a set of ν poits such that R ca pick out every subsets of this set. With the VC dimesio, we ca the use the equatio 5.6 to obtai equatio 5.. Here are the VC dimesio correspods to the regios we have metioed. R = {, t] : t R} = νr =. 2. R 2 = {a, b : a b} = νr 2 = 2. 3. R 3 = {[ 0 r, 0 + r] : r 0} for some fied 0 = νr 3 =. 4. R 4 = {[ 0 r, 0 + r] : r 0, 0 R} = νr 4 = 2. 5. R 5 = {, t], s] : t, s R} = νr 5 =. 6. R 6 = {a, b a 2, b 2 a d, b d : a j b j, j =,, d} = νr 6 = 2d. 7. R 7 = {B 0, r : r 0}, where B 0, r = { : 0 r}. = νr 7 =. 8. R 8 = {B β,c : β R d, c R} = νr 8 = d +. Eample: uiform covergece of the EDF. Let F be the EDF of a CDF F based o a IID radom sample from F. Usig VC theory, we ca prove that P sup F F > ɛ = P sup IX i A P X A > ɛ A R 8 + νr e ɛ2 /32 = 8 + e ɛ2 /32 coverges to 0 as 3. oreover, usig the method i Sectio 5.2.2, we have sup F log F = O P 3 Note that there is a better boud for the EDF called the DKW Dvoretsky-Kiefer-Wolfowitz iequality: P sup F F > ɛ 2e 2ɛ2.

5-0 Lecture 5: Learig Theory: Cocetratio Iequalities ad E sup F F log = O.