REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d -valued feature vector or iput vector) ad Y is the real-valued respose or output). We assume that the ukow joit distributio P = P Z = P XY of X, Y ) belogs to some class P of probability distributios over R d R. The learig problem, the, is to produce a predictor of Y give X o the basis of a i.i.d. traiig sample Z = Z 1,..., Z ) = X 1, Y 1 ),..., X, Y )) from P. A predictor is just a measurable) fuctio f : R d R, ad we evaluate its performace by the expected quadratic loss Lf) E[Y fx)) 2. As we have see before, the smallest expected loss is achieved by the regressio fuctio f x) = E[Y X = x, i.e., Moreover, for ay other f we have L if f Lf) = Lf ) = E[X E[Y X) 2. Lf) = L + f f 2 L 2 P X ), where f f 2 L 2 P X ) = R d fx) f x) 2 P X dx). Sice we do ot kow P, i geeral we caot hope to lear f, so, as before, istead we aim at fidig a good approximatio to the best predictor i some class F of fuctios f : R d R, i.e., to use the traiig data Z to costruct a predictor f F, such that L f ) L F) if f F Lf) with high probability. We will assume that the margial distributio P X of the feature vector is supported o a closed subset X R d, ad that the joit distributio P of X, Y ) is such that, with probability oe, 1) Y M ad f X) M. for some costat 0 < M <. Thus we ca assume that the traiig samples belog to the set Z = X [ M, M. We will also assume that the class F is a subset of a suitable reproducig kerel Hilbert space RKHS) H K iduced by some Mercer kerel K : X X R. It will be useful to defie 2) C K sup Kx, x); x X we will assume that C K is fiite. The followig simple boud will come i hady: Date: March 28, 2011. 1
Lemma 1. For ay fuctio f : X R, defie the sup orm 3) The for ay f H K we have Proof. For ay f H K ad x X, f sup fx). x X f C K f K. fx) = f, K x K f K K x K = f K Kx, x), where the first step is by the reproducig kerel property, while the secod step is by Cauchy Schwarz. Takig the supremum of both sides over X, we get 3). 1. ERM over a ball i RKHS First, we will look at the simplest case: ERM over a ball i H K. Thus, we pick the radius λ > 0 ad take F = F λ = {f H K : f K λ}. The ERM algorithm outputs the predictor 1 f = arg mi L f) arg mi Y i fx i )) 2, where L f) deotes, as usual, the empirical loss i this case, empirical quadratic loss) of f. Theorem 1. With probability at least 1 δ, L f ) L F λ ) + 8M + 2C Kλ) 2 32 log1/δ) 4) + M 2 + C 2 Kλ 2 ) Proof. First let us itroduce some otatio. Let us deote the quadratic loss fuctio y, u) y u) 2 by ly, u), ad for ay f : R d R let l fx, y) ly, fx)) = y fx)) 2 Let l F λ deote the fuctio class {l f : f F λ }. Let f λ deote ay miimizer of Lf) over F λ, i.e., Lf λ ) = L F λ ). As usual, we write 5) L f ) L F λ ) = L f ) L F λ ) = L f ) L f ) + L f ) L f λ ) + L f λ ) Lf λ ) 2 sup L f) Lf) = 2 sup P l f) P l f) = 2 l F λ ), where we have defied the uiform deviatio l F) sup P l f) P l f). f F Next we show that, as a fuctio of the traiig sample Z, gz ) = l F λ ) has bouded differeces. Ideed, for ay 1 i, ay z Z, ad ay z i Z, let z i) deote z with the ith 2
coordiate replaced by z i. The gz ) gzi) ) 1 sup y i fx i )) 2 y i fx i)) 2 2 sup 4 4 sup x X y M sup y fx) 2 ) M 2 + sup f 2 M 2 + CKλ 2 2), where the last lie is by Lemma 1. Thus, l F λ ) has the bouded differece property with c 1 =... = c = 4M 2 + CK 2 λ2 )/, so McDiarmid s iequality says that, for ay t > 0, ) t 2 ) P l F λ ) E l F λ ) + t exp 8M 2 + CK 2 λ2 ) 2. Therefore, lettig we see that 2 log1/δ) t = 2M 2 + CKλ 2 2 ), l F λ ) E l F λ ) + 2M 2 + C 2 Kλ 2 ) with probability at least 1 δ. Moreover, by symmetrizatio we have 2 log1/δ) 6) E l F λ ) 2ER l F λ Z )), where 7) is the Rademacher average of the radom) set [ R l F λ Z )) = 1 E σ sup σ i l fz i ) l F λ Z ) = {l fz 1 ),..., l fz )) : f F λ } = { Y 1 fx 1 ) 2 ),..., Y fx )) 2 ) : f F λ }. To boud the Rademacher average i 7), we will eed to use the cotractio priciple. To that ed, let us fix ay y [ M, M ad ay u, v [ C K λ, C K λ. The ly, u) ly, v) = y 2 2yu + u 2 ) y 2 2yv + v 2 ) = 2yv u) v 2 u 2 ) = 2 y u v u v 2M + 2C K λ) u v. Hece, by the cotractio priciple we ca write [ R l F λ Z )) 2M + 2C Kλ) 8) E σ sup σ i Y i fx i )). 3
Moreover 9) E σ [ [ sup σ i Y i fx i )) E σ σ i Y i + E σ sup σ i fx i ) Yi 2 + R F λ Z )) M + C K λ), where the first step uses the triagle iequality, the secod step uses the result from the previous lecture o the expected absolute value of Rademacher sums, ad the third step uses 1) ad the boud o the Rademacher average over a ball i a RKHS. Combiig 6) through 9) ad overboudig 9) slightly), we coclude that 10) l F λ ) 4M + 2C Kλ) 2 + 2M 2 + C 2 Kλ 2 ) 2 log1/δ) with probability at least 1 δ. Fially, combiig this with 5), we get 4). 2. Regularized least squares i a RKHS The observatio we had made may times by ow is that whe the joit distributio of the iput-output pair X, Y ) X R is ukow, there is o hope i geeral to lear the optimal predictor f from a fiite traiig sample. Thus, restrictig our attetio to some hypothesis space F, which is a proper subset of the class of all measurable fuctios f : X R, is a form of isurace: If we do ot do this, the we ca always fid some fuctio f that attais zero empirical loss, yet performs spectacularly badly o the iputs outside the traiig set. Whe this happes, we say that our leared predictor overfits. O the other had, if our hypothesis space F cosists of well-behaved fuctios, the it is possible to lear a predictor that achieves a graceful balace betwee i-sample data fit ad out-of-sample geeralizatio. The price we pay is the approximatio error L F) L if Lf) if Lf) 0. f F f:x R I the regressio settig, the approximatio error ca be expressed as L F) L = if f F f f 2 P X, where f x) = E[Y X = x is the regressio fuctio the MMSE predictor of Y give X). Whe see from this perspective, the use of a restricted hypothesis space F is a form of regularizatio a way of guarateeig that the leared predictor performs well outside the traiig sample. However, this is ot the oly way to achieve regularizatio. I this sectio, we will aalyze aother way: complexity regularizatio. I a utshell, complexity regularizatio is a modificatio of the ERM scheme that allows us to search over a fairly rich hypothesis space by addig a pealty term. Complexity regularizatio is a very geeral techique with wide applicability. We will look at a particular example of complexity regularizatio over a RKHS ad derive a simple boud o its geeralizatio performace. To set thigs up, let > 0 be a regularizatio parameter. Itroduce the regularized quadratic loss J f) Lf) + f 2 K ad its empirical couterpart J, f) L f) + f 2 K. 4
Defie the fuctios 11) f arg mi J f) f H K ad 12) f, arg mi J, f). f H K We will refer to 12) as the regularized kerel least squares RKLS) algorithm. Note that the miimizatio i 11) ad 12) takes place i the etire RKHS H K, rather tha a subset, say, a ball. However, the additio of the regularizatio term f 2 K esures that the RKLS algorithm does ot just select ay fuctio f H K that happes to fit the traiig data well istead, it weighs the goodess-of-fit term L f) term agaist the complexity f 2 K, sice a very large value of f 2 K would idicate that f might wiggle aroud a lot ad, therefore, overfit the traiig sample. The regularizatio parameter > 0 cotrols the relative importace of the goodess-of-fit ad the complexity terms. We have the followig basic boud o the geeralizatio performace of RKLS: Theorem 2. With probability at least 1 δ, L f 4M 13), ) L A) + where A) is the regularized approximatio error. Proof. We start with the followig lemma: Lemma 2. 1 + 2C K ) 2 + 2 2M 2 + C2 K M 2 ) + A)) 2 log2/δ) [ if Lf) + f 2 K L f H K 14) L f, ) L δ f, ) δ f ) + A), where δ f) Lf) L f) for all f. Proof. First, a obvious overboudig gives L f, ) L J f, ) L. The J f, ) = L f, ) + f, 2 K = L f, ) L f, ) + L f, ) + f, 2 K }{{} =J, b f,) = L f, ) L f, ) + J, f, ) J, f ) + J, f ) L f, ) L f, ) + J, f ) = L f, ) L f, ) + L f ) + f 2 K = L f, ) L f, ) + L f ) Lf ) + Lf ) + f 2 K = L f, ) L f, ) + L f ) Lf ) + J f ). 5
This gives ad we are doe. L f, ) L L f, ) L f, ) + L f ) Lf ) + J f ) L = L f, ) L f, ) + L f ) Lf [ ) + if Lf) + f 2 K L f H = L f, ) L f, ) + L f ) Lf ) + A), Lemma 2 shows that the excess loss of the regularized empirical loss miimizer f, is bouded from above by the sum of three terms: the deviatio δ f, ) L f, ) L f, ) of f, itself, the egative) deviatio δ f ) L f ) Lf ) of the best regularized predictor f, ad the approximatio error A). To prove Theorem 2, we will eed to obtai high-probability bouds o the two deviatio terms. To that ed, we eed a lemma: Lemma 3. The fuctios f ad f, satisfy the bouds 15) f A) C K. ad 16) respectively. f, K M with probability oe Proof. To prove 15), we use the fact that A) = Lf ) L + f 2 K f 2 K, which gives f K A)/. From this ad from 3) we obtai 15). For 16), we use the fact that f, miimizes J, f) over all f. I particular, J, f, ) = L f, ) + f, 2 K J, 0) = 1 Y 2 i M 2 w.p. 1, where the last step follows from 1). Rearragig ad usig the fact that L f) 0 for all f, we get 16). Now we are ready to boud δ f, ). For ay R 0, let F R = {f H K : f K R} deote the zero-cetered ball of radius R i the RKHS H K. The Lemma 3 says that f, F M/ with probability oe. Therefore, with probability oe we have δ f, ) = δ f, ) 1 { f, F b M/ } δ f, ) 1 { f, F b M/ } sup δ f) 1 f F { f, F b M/ } M/ }{{} l F M/ ) l F M/ ). 6
Cosequetly, we ca carry out the same aalysis as i the proof of Theorem 1. First of all, the fuctio gz ) = l F M/ ) has bouded differeces with ) c 1 =... = c 4 M 2 + sup f 2 4M 2 ) 1 + C2 K f F M/ where the last step uses 16) ad Lemma 1. Therefore, with probability at least 1 δ/2, ) 2 δ f 4M 1 + 2C K ), ) l F M/ ) + 2M 1 2 + C2 K 2 log2/δ) 17), where the secod step follows from 10) with δ replaced by δ/2 ad with λ = M/. It remais to boud δ f ). This is, actually, much easier, sice we are dealig with a sigle data-idepedet fuctio. I particular, ote that we ca write δ f ) = 1 Y i f X i )) 2 E [ Y f X)) 2 = 1 U i, where U i Y i f X i )) 2 E [ Y f X)) 2, 1 i, are i.i.d. radom variables with EU i = 0 ad U i sup sup y f x) ) 2 2M 2 + f 2 ) 2 M 2 + C2 K A) ) y [ M,M x X with probability oe, where we have used 1) ad 15). We ca therefore use Hoeffdig s iequality to write, for ay t 0, P δ f ) t ) ) ) 1 t 2 = P U i t exp 8 M 2 + CK 2 A)/) 2 This implies that 18) δ f ) 2 M 2 + C2 K A) ) 2 log2/δ) with probability at least 1 δ/2. Combiig 17) ad 18) with 14), we get 13). 7