ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Similar documents
Agnostic Learning and Concentration Inequalities

Maximum Likelihood Estimation and Complexity Regularization

Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

REGRESSION WITH QUADRATIC LOSS

Regression with quadratic loss

Lecture 3: August 31

Empirical Process Theory and Oracle Inequalities

A survey on penalized empirical risk minimization Sara A. van de Geer

ECE 901 Lecture 13: Maximum Likelihood Estimation

Machine Learning Brett Bernstein

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Machine Learning Brett Bernstein

Machine Learning Theory (CS 6783)

Estimation for Complete Data

Lecture 2: Monte Carlo Simulation

Optimally Sparse SVMs

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

ECE 901 Lecture 4: Estimation of Lipschitz smooth functions

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Topic 9: Sampling Distributions of Estimators

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

10-701/ Machine Learning Mid-term Exam Solution

Lecture 15: Learning Theory: Concentration Inequalities

An Introduction to Randomized Algorithms

Rademacher Complexity

Random Variables, Sampling and Estimation

1 Review and Overview

Simulation. Two Rule For Inverting A Distribution Function

Lecture 2: Concentration Bounds

Topic 9: Sampling Distributions of Estimators

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Summary. Recap ... Last Lecture. Summary. Theorem

Topic 9: Sampling Distributions of Estimators

Glivenko-Cantelli Classes

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

5.1 A mutual information bound based on metric entropy

Properties and Hypothesis Testing

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

1 Review and Overview

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

7.1 Convergence of sequences of random variables

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Learning Theory: Lecture Notes

Sequences and Series of Functions

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Problem Set 2 Solutions

Chapter 6 Sampling Distributions

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Section 14. Simple linear regression.

Lecture 11: Decision Trees

7.1 Convergence of sequences of random variables

1 Introduction to reducing variance in Monte Carlo simulations

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Lecture 11 and 12: Basic estimation theory

Intro to Learning Theory

arxiv: v1 [math.pr] 13 Oct 2011

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

32 estimating the cumulative distribution function

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Lecture 19: Convergence

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Statistical Properties of OLS estimators

Advanced Stochastic Processes.

Problem Set 4 Due Oct, 12

Element sampling: Part 2

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Notes 5 : More on the a.s. convergence of sums

Estimation of the Mean and the ACVF

On Random Line Segments in the Unit Square

Sieve Estimators: Consistency and Rates of Convergence

Expectation and Variance of a random variable

Lecture 6: Coupon Collector s problem

SDS 321: Introduction to Probability and Statistics

Asymptotic distribution of products of sums of independent random variables

Unbiased Estimation. February 7-12, 2008

Output Analysis and Run-Length Control

Lecture 4: April 10, 2013

This section is optional.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Spectral Partitioning in the Planted Partition Model

Approximations and more PMFs and PDFs

The standard deviation of the mean

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Linear Regression Demystified

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

HOMEWORK #10 SOLUTIONS

Rates of Convergence by Moduli of Continuity

5.1 Review of Singular Value Decomposition (SVD)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Math 113, Calculus II Winter 2007 Final Exam Solutions

Transcription:

ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality states that for a sum of idepedet radom variables 0 L i, i =,..., ) P E[L i ] L i > ɛ e ɛ. If L i = lfx i ), Y i ), the loss of f i the predictio of Y i from X i, the we have P Rf) Rf) ) > ɛ e ɛ. Whe cosiderig a coutable collectio F of cadidate predictors, ad pealties cf) assiged to each f F that satisfy the summability coditio cf), the we showed that E[R f )] R if Rf) R + cf) log + log +. Cosider the two terms i this upper boud: Rf) R is a boud o the approximatio error of a model f, ad remaider is a boud o the estimatio error associated with f. Thus, we see that complexity regularizatio automatically optimizes a balace betwee approximatio ad estimatio errors. Note that the boud is valid for ay bouded loss fuctio. The above upper boud is at least /. This is the best oe ca expect, i geeral, whe cosiderig the 0/ or l absolute error) loss fuctios, but i regressio we are ofte iterested i the squared error or l loss correspodig to the mea square error risk). The squared error typically decays faster tha the 0/ or absolute error sice squarig small umbers makes them smaller yet). Ufortuately, the Cheroff/Hoeffdig bouds are ot capable of hadlig such cases, ad more sophisticated techiques are required. Before delvig ito those methods, cosider the followig simple example. Example. To illustrate the distictio betwee classificatio ad regressio, cosider a simple, scalar sigal plus oise problem. Cosider Y i = θ + W i, i =,...,, where θ is a fixed ukow scalar parameter ad the W i are idepedet, zero-mea, uit variace radom variables. Let ˆθ = / Y i. The we have E[ ˆθ θ ] = E ) W i = E[Wi ] =

Thus, the mea square error decays like, otably faster tha /. The covergece rate of is called the parametric rate for regressio, sice it is the rate at which the MSE decays i simple parametric iferece. A similar coclusio ca be arrived at through large deviatio aalysis. Accordig to the Cetral Limit Theorem dist ˆθ θ) N0, ), as. A simple tail-boud o the Gaussia distributio gives us P ˆθ θ) > t) e t /, for large, which implies that P ˆθ θ) > ɛ) e ɛ/. This is a boud o the deviatios of the squared error ˆθ θ. The squared error cocetratio iequality implies that E[ ˆθ θ ] = O ) just write E[ˆθ θ) ] = P ˆθ θ) > t)dt). Note that the mai differece 0 betwee Hoeffdig s iequality ad the above cocetratio boud is the depedece i ɛ, liear i the latter ad quadratic i the former, therefore much weaker i the former case. Risk Bouds for Squared Error Loss Let X be the feature space e.g., X = R d ) ad Y = [ b/, b/], where b > 0 is kow. I other words, assume the label space is bouded. Cosider the squared error loss ly, y ) = y y ). Take F such that f F is a map f : X Y. We have traiig data {X i, Y i } simply the sum of squared predictio errors i.i.d. P XY. The empirical risk fuctio is Rf) = fx i ) Y i ). The risk is therefore the MSE Rf) = E[fX) Y ) ]. We kow that the fuctio f that miimizes the MSE is just the coditioal expectatio of Y give X also kow as the regressio fuctio): f = E[Y X = x]. Now let R = Rf ). We wat to select a f F usig the traiig data {X i, Y i } such that the excess risk E[R f )] R 0 is small. Like we did i lecture 9 we will take advatage of the fact that Rf) cocetrates i probability) aroud Rf), but to take advatage of the particular aspects of the squared error loss it is coveiet to look at relative versios of the risk, amely the excess risk ad its empirical couterpart The first thig to ote is that, as show i lecture Ef) := Rf) Rf ) Êf) := Rf) Rf ). Ef) = E[fX) f X)) ].

Furthermore ote that E[Êf)] = Ef), ad that Êf) is the sum of idepedet radom variables: Êf) = U i, where U i = Y i fx i )) + Y i f X i )). Therefore, Ef) Êf) = U i E[U i ]). Clearly the strog law of large umbers tells us that for a fixed predictio rule f Êf) Ef), as. All we eed to kow if to determie the speed of covergece. We will derive a boud for the differece [Rf) Rf )] [ Rf) Rf )]. The followig derivatio is due to Adrew Barro. The excess risk ad it empirical couterpart will be deoted by Ef) := Rf) Rf ) Êf) := Rf) Rf ) Note that Êf) is the sum of idepedet radom variables: Êf) = U i, where U i = Y i fx i )) + Y i f X i )). Therefore, Ef) Êf) = U i E[U i ]). We are lookig for a boud of the form P Ef) Êf) > ɛ) < δ. If the variables U i are idepedet ad bouded, the we ca apply Hoeffdig s iequality. However, a more useful boud for our regressio problem ca be derived if the the variables U i satisfy the followig momet coditio: E[ U i E[U i ] k ] varu i) k! h k ) for some h > 0 ad all k. The momet coditio ca be difficult to verify i geeral, but it does hold, for example, for bouded radom variables. I that case the Craig-Berstei CB) iequality Craig 933) states that, for idepedet r.v. s U i satisfyig ): P U i E[U i ]) t ɛ + ɛ var ) Ui ) e t, c) for 0 < ɛh c < ad t > 0. This shows that the tail decays expoetially i t, rather tha expoetially i t. Recall Hoeffdig s iequality: ) P Z i E[Z i ]) t e t. t If, the t t t, which implies e e t. This idicates that the CB iequality may be much tighter tha Hoeffdig s, whe the variace term ɛ var Ui) is small. To use the CB iequality, we eed to boud the variace of U i. Note that c) varu i ) = var Y i fx i )) + Y i f X i )) ). Recall our assumptio that Y is bouded, i particular that Y is cotaied i a iterval of legth b without loss of geerality we ca assume Y = [ b/, b/]). A. R. Barro, Complexity regularizatio with applicatio to artificial eural etworks, i Noparametric Fuctioal Estimatio ad Related Topics. Kluwer Academic Publishers, 99, pp. 56-576. 3

Propositio. The momet coditio )) holds with h = b 3. Proof. Left as a exercise. Propositio. The variace of U i satisfies Proof. We ca write U i as U i = Y i fx i ) Y i f X i ) + f X i ) fx i ) varu i ) 5b Ef), ) = Y i fx i ) Y i f X i ) + f X i ) f X i ) fx i ) + fx i )f X i ) fx i )f X i ) = Y i f X i )) fx i ) f X i )) fx i ) f X i )). }{{}}{{} T T Note that the variace of U i is upper-bouded by its secod momet, that is varu i ) E[U i ] = E[T T ) ] = E[T ] + E[T ] E[T T ]. Also ote that the covariace of T ad T is zero: E[ Y i f X i )) fx i ) f X i )) fx i ) f X i )) ] = E[T T ] = E[E[T T ] X i ]] = E[T E[T ] X i ]] = E[T fx i ) f X i ))E[Y i f X i ) X i ]] = 0. This is evidet whe you recall that f x) = E[Y X = x]. Now we ca boud the secod momets of T ad T. Begi by recallig that Ef) = E[fX) f X)) ].. Now E[T ] = 4E[Y i f X i ))fx i ) f X i ))) ] = 4E[Y i f X i )) fx i ) f X i )) ] 4E[b fx i ) f X i )) ] = 4b Ef), E[T ] = E[fX i ) f X i )) 4 ] = E[fX i ) f X i )) fx i ) f X i )) ] E[b fx i ) f X i )) ] = b Ef). So varu i ) 5b E[fX i ) f X i )) ]. Thus, var U i) 5b Ef). Usig the CB iequality for properly chose values of ɛ ad c, to be discussed later) we have that, with probability at least e t, Ef) Êf) t ɛ + 5ɛ b Ef) c) I other words, with probability at least δ where δ = e t ),. Ef) Êf) log δ ɛ + 5ɛ b Ef) c). 3) 4

Now, suppose we have assiged positive umbers cf) to each f F satisfyig the Kraft iequality: cf). Note that 3) holds δ > 0. I particular, we let δ be a fuctio of f: δf) = cf) δ. So we ca use this δ alog with the procedure itroduced i lecture 9 i.e., the uio boud followed by the Kraft iequality) to obtai the followig. For ay δ > 0, with probability at least δ Ef) Êf) cf) log + log δ ɛ + 5ɛ b Ef) c), f F 4) Now set c = ɛ h = b 3 ad defie Takig ɛ < 6 9b ɛ α = 5ɛ b c). guaratees that α <. Usig this fact we have α)ef) Êf) + cf) log + log δ ɛ f F, with probability at least δ. Sice we wat to fid f F that miimizes Ef) it is a good bet to miimize the right-had-side of the above boud. Recall that Êf) = Rf) Rf ), ad so defie f = arg mi { Rf) + cf) log ɛ so that f miimizes the upper boud. Thus, with probability at least δ, }, α)e f ) Ê f ) + c f ) log + log δ ɛ Ê f) + c f) log + log δ ɛ 5) where f F is arbitrary but ot a fuctio of the traiig data). Now we use the Craig-Berstei iequality to boud the differece betwee Ê f) ad E f). I order to get the correct directio i the boud we will apply CB to U i istead very similar derivatio as before). With probability at least δ, Ê f) E f) + α E f) + log δ ). 6) ɛ Now we ca agai use a uio boud to combie 5) ad 6). For ay δ > 0, with probability at least δ, E f ) + α α E f) + c f) log + log /δ α)ɛ At this poit we have show the followig PAC boud. Theorem. Cosider the squared error loss. Let X be the feature space ad Y = [ b/, b/] be the label space. Let {X i, Y i } be i.i.d. accordig to P XY, ukow. Let F be a collectio of predictors i.e. f F are fuctios f : X Y) such that there are umbers cf) satisfyig cf).. 5

Select a fuctio f accordig to f = arg mi { R f) + ɛ } cf) log with 0 < ɛ < 6 9b. The, for ay δ > 0 with probability at least δ, E f ) + α α Ef) + cf) log + log /δ ɛ α) f F, where α = 5b ɛ 6 b ɛ, ad Ef) = Rf) R = E[fX) f X)) ]. Fially we ca use this result to get a boud o the expected excess risk. Although this result is just a corollary of the above theorem we will state is as a theorem due to its importace. Theorem. Uder the coditios of the above theorem Proof. Let ad defie [ E f X) f X)) ] if The previous theorem implies that { + α α E [ fx) f X)) ] + cf) log ɛ α) { + α f = if α Ef) + } cf) log ɛ α) Φ f ) = E f ) + α α E f) c f) log ɛ α). Pr Φ f ) > ) log/δ) δ. ɛ α), } + 4 α)ɛ. Take δ = e α)ɛ t. The [ E Φ f ] ) = 0 0 P Φ f ) t) dt e α)ɛt 4 α)ɛ, cocludig the proof. As a fial remark otice that the above boud ca be much better tha the oe derived for geeral losses, i particular if f F ad if cf ) is ot too large e.g., cf ) log ), the we have E[R f )] Rf ) = O log ), withi a logarithmic factor of the parametric rate of covergece! 6