ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality states that for a sum of idepedet radom variables 0 L i, i =,..., ) P E[L i ] L i > ɛ e ɛ. If L i = lfx i ), Y i ), the loss of f i the predictio of Y i from X i, the we have P Rf) Rf) ) > ɛ e ɛ. Whe cosiderig a coutable collectio F of cadidate predictors, ad pealties cf) assiged to each f F that satisfy the summability coditio cf), the we showed that E[R f )] R if Rf) R + cf) log + log +. Cosider the two terms i this upper boud: Rf) R is a boud o the approximatio error of a model f, ad remaider is a boud o the estimatio error associated with f. Thus, we see that complexity regularizatio automatically optimizes a balace betwee approximatio ad estimatio errors. Note that the boud is valid for ay bouded loss fuctio. The above upper boud is at least /. This is the best oe ca expect, i geeral, whe cosiderig the 0/ or l absolute error) loss fuctios, but i regressio we are ofte iterested i the squared error or l loss correspodig to the mea square error risk). The squared error typically decays faster tha the 0/ or absolute error sice squarig small umbers makes them smaller yet). Ufortuately, the Cheroff/Hoeffdig bouds are ot capable of hadlig such cases, ad more sophisticated techiques are required. Before delvig ito those methods, cosider the followig simple example. Example. To illustrate the distictio betwee classificatio ad regressio, cosider a simple, scalar sigal plus oise problem. Cosider Y i = θ + W i, i =,...,, where θ is a fixed ukow scalar parameter ad the W i are idepedet, zero-mea, uit variace radom variables. Let ˆθ = / Y i. The we have E[ ˆθ θ ] = E ) W i = E[Wi ] =
Thus, the mea square error decays like, otably faster tha /. The covergece rate of is called the parametric rate for regressio, sice it is the rate at which the MSE decays i simple parametric iferece. A similar coclusio ca be arrived at through large deviatio aalysis. Accordig to the Cetral Limit Theorem dist ˆθ θ) N0, ), as. A simple tail-boud o the Gaussia distributio gives us P ˆθ θ) > t) e t /, for large, which implies that P ˆθ θ) > ɛ) e ɛ/. This is a boud o the deviatios of the squared error ˆθ θ. The squared error cocetratio iequality implies that E[ ˆθ θ ] = O ) just write E[ˆθ θ) ] = P ˆθ θ) > t)dt). Note that the mai differece 0 betwee Hoeffdig s iequality ad the above cocetratio boud is the depedece i ɛ, liear i the latter ad quadratic i the former, therefore much weaker i the former case. Risk Bouds for Squared Error Loss Let X be the feature space e.g., X = R d ) ad Y = [ b/, b/], where b > 0 is kow. I other words, assume the label space is bouded. Cosider the squared error loss ly, y ) = y y ). Take F such that f F is a map f : X Y. We have traiig data {X i, Y i } simply the sum of squared predictio errors i.i.d. P XY. The empirical risk fuctio is Rf) = fx i ) Y i ). The risk is therefore the MSE Rf) = E[fX) Y ) ]. We kow that the fuctio f that miimizes the MSE is just the coditioal expectatio of Y give X also kow as the regressio fuctio): f = E[Y X = x]. Now let R = Rf ). We wat to select a f F usig the traiig data {X i, Y i } such that the excess risk E[R f )] R 0 is small. Like we did i lecture 9 we will take advatage of the fact that Rf) cocetrates i probability) aroud Rf), but to take advatage of the particular aspects of the squared error loss it is coveiet to look at relative versios of the risk, amely the excess risk ad its empirical couterpart The first thig to ote is that, as show i lecture Ef) := Rf) Rf ) Êf) := Rf) Rf ). Ef) = E[fX) f X)) ].
Furthermore ote that E[Êf)] = Ef), ad that Êf) is the sum of idepedet radom variables: Êf) = U i, where U i = Y i fx i )) + Y i f X i )). Therefore, Ef) Êf) = U i E[U i ]). Clearly the strog law of large umbers tells us that for a fixed predictio rule f Êf) Ef), as. All we eed to kow if to determie the speed of covergece. We will derive a boud for the differece [Rf) Rf )] [ Rf) Rf )]. The followig derivatio is due to Adrew Barro. The excess risk ad it empirical couterpart will be deoted by Ef) := Rf) Rf ) Êf) := Rf) Rf ) Note that Êf) is the sum of idepedet radom variables: Êf) = U i, where U i = Y i fx i )) + Y i f X i )). Therefore, Ef) Êf) = U i E[U i ]). We are lookig for a boud of the form P Ef) Êf) > ɛ) < δ. If the variables U i are idepedet ad bouded, the we ca apply Hoeffdig s iequality. However, a more useful boud for our regressio problem ca be derived if the the variables U i satisfy the followig momet coditio: E[ U i E[U i ] k ] varu i) k! h k ) for some h > 0 ad all k. The momet coditio ca be difficult to verify i geeral, but it does hold, for example, for bouded radom variables. I that case the Craig-Berstei CB) iequality Craig 933) states that, for idepedet r.v. s U i satisfyig ): P U i E[U i ]) t ɛ + ɛ var ) Ui ) e t, c) for 0 < ɛh c < ad t > 0. This shows that the tail decays expoetially i t, rather tha expoetially i t. Recall Hoeffdig s iequality: ) P Z i E[Z i ]) t e t. t If, the t t t, which implies e e t. This idicates that the CB iequality may be much tighter tha Hoeffdig s, whe the variace term ɛ var Ui) is small. To use the CB iequality, we eed to boud the variace of U i. Note that c) varu i ) = var Y i fx i )) + Y i f X i )) ). Recall our assumptio that Y is bouded, i particular that Y is cotaied i a iterval of legth b without loss of geerality we ca assume Y = [ b/, b/]). A. R. Barro, Complexity regularizatio with applicatio to artificial eural etworks, i Noparametric Fuctioal Estimatio ad Related Topics. Kluwer Academic Publishers, 99, pp. 56-576. 3
Propositio. The momet coditio )) holds with h = b 3. Proof. Left as a exercise. Propositio. The variace of U i satisfies Proof. We ca write U i as U i = Y i fx i ) Y i f X i ) + f X i ) fx i ) varu i ) 5b Ef), ) = Y i fx i ) Y i f X i ) + f X i ) f X i ) fx i ) + fx i )f X i ) fx i )f X i ) = Y i f X i )) fx i ) f X i )) fx i ) f X i )). }{{}}{{} T T Note that the variace of U i is upper-bouded by its secod momet, that is varu i ) E[U i ] = E[T T ) ] = E[T ] + E[T ] E[T T ]. Also ote that the covariace of T ad T is zero: E[ Y i f X i )) fx i ) f X i )) fx i ) f X i )) ] = E[T T ] = E[E[T T ] X i ]] = E[T E[T ] X i ]] = E[T fx i ) f X i ))E[Y i f X i ) X i ]] = 0. This is evidet whe you recall that f x) = E[Y X = x]. Now we ca boud the secod momets of T ad T. Begi by recallig that Ef) = E[fX) f X)) ].. Now E[T ] = 4E[Y i f X i ))fx i ) f X i ))) ] = 4E[Y i f X i )) fx i ) f X i )) ] 4E[b fx i ) f X i )) ] = 4b Ef), E[T ] = E[fX i ) f X i )) 4 ] = E[fX i ) f X i )) fx i ) f X i )) ] E[b fx i ) f X i )) ] = b Ef). So varu i ) 5b E[fX i ) f X i )) ]. Thus, var U i) 5b Ef). Usig the CB iequality for properly chose values of ɛ ad c, to be discussed later) we have that, with probability at least e t, Ef) Êf) t ɛ + 5ɛ b Ef) c) I other words, with probability at least δ where δ = e t ),. Ef) Êf) log δ ɛ + 5ɛ b Ef) c). 3) 4
Now, suppose we have assiged positive umbers cf) to each f F satisfyig the Kraft iequality: cf). Note that 3) holds δ > 0. I particular, we let δ be a fuctio of f: δf) = cf) δ. So we ca use this δ alog with the procedure itroduced i lecture 9 i.e., the uio boud followed by the Kraft iequality) to obtai the followig. For ay δ > 0, with probability at least δ Ef) Êf) cf) log + log δ ɛ + 5ɛ b Ef) c), f F 4) Now set c = ɛ h = b 3 ad defie Takig ɛ < 6 9b ɛ α = 5ɛ b c). guaratees that α <. Usig this fact we have α)ef) Êf) + cf) log + log δ ɛ f F, with probability at least δ. Sice we wat to fid f F that miimizes Ef) it is a good bet to miimize the right-had-side of the above boud. Recall that Êf) = Rf) Rf ), ad so defie f = arg mi { Rf) + cf) log ɛ so that f miimizes the upper boud. Thus, with probability at least δ, }, α)e f ) Ê f ) + c f ) log + log δ ɛ Ê f) + c f) log + log δ ɛ 5) where f F is arbitrary but ot a fuctio of the traiig data). Now we use the Craig-Berstei iequality to boud the differece betwee Ê f) ad E f). I order to get the correct directio i the boud we will apply CB to U i istead very similar derivatio as before). With probability at least δ, Ê f) E f) + α E f) + log δ ). 6) ɛ Now we ca agai use a uio boud to combie 5) ad 6). For ay δ > 0, with probability at least δ, E f ) + α α E f) + c f) log + log /δ α)ɛ At this poit we have show the followig PAC boud. Theorem. Cosider the squared error loss. Let X be the feature space ad Y = [ b/, b/] be the label space. Let {X i, Y i } be i.i.d. accordig to P XY, ukow. Let F be a collectio of predictors i.e. f F are fuctios f : X Y) such that there are umbers cf) satisfyig cf).. 5
Select a fuctio f accordig to f = arg mi { R f) + ɛ } cf) log with 0 < ɛ < 6 9b. The, for ay δ > 0 with probability at least δ, E f ) + α α Ef) + cf) log + log /δ ɛ α) f F, where α = 5b ɛ 6 b ɛ, ad Ef) = Rf) R = E[fX) f X)) ]. Fially we ca use this result to get a boud o the expected excess risk. Although this result is just a corollary of the above theorem we will state is as a theorem due to its importace. Theorem. Uder the coditios of the above theorem Proof. Let ad defie [ E f X) f X)) ] if The previous theorem implies that { + α α E [ fx) f X)) ] + cf) log ɛ α) { + α f = if α Ef) + } cf) log ɛ α) Φ f ) = E f ) + α α E f) c f) log ɛ α). Pr Φ f ) > ) log/δ) δ. ɛ α), } + 4 α)ɛ. Take δ = e α)ɛ t. The [ E Φ f ] ) = 0 0 P Φ f ) t) dt e α)ɛt 4 α)ɛ, cocludig the proof. As a fial remark otice that the above boud ca be much better tha the oe derived for geeral losses, i particular if f F ad if cf ) is ot too large e.g., cf ) log ), the we have E[R f )] Rf ) = O log ), withi a logarithmic factor of the parametric rate of covergece! 6