Regression with quadratic loss

Regressio with quadratic loss Maxim Ragisky October 13, 2015 Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X,Y, where, as before, X is a R d -valued feature vector or iput vector ad Y is the real-valued respose or output. We assume that the ukow joit distributio P = P Z = P X Y of X,Y belogs to some class P of probability distributios over R d R. The learig problem, the, is to produce a predictor of Y give X o the basis of a i.i.d. traiig sample Z = Z 1,..., Z = X 1,Y 1,...,X,Y from P. A predictor is just a measurable fuctio f : R d R, ad we evaluate its performace by the expected quadratic loss Lf EY f X 2 ]. As we have see before, the smallest expected loss is achieved by the regressio fuctio f x EY X = x], i.e., L iflf = Lf = EX EY X ] 2 ]. f Moreover, for ay other f we have where Lf = L + f f 2 L 2 P X, f f 2 L 2 P X = R d f x f x 2 P X d x. Sice we do ot kow P, i geeral we caot hope to lear f, so, as before, istead we aim at fidig a good approximatio to the best predictor i some class F of fuctios f : R d R, i.e., to use the traiig data Z to costruct a predictor f F, such that L f L F if Lf with high probability. We will assume that the margial distributio P X of the feature vector is supported o a closed subset X R d, ad that the joit distributio P of X,Y is such that, with probability oe, Y M ad f X M. 1 for some costat 0 < M <. Thus we ca assume that the traiig samples belog to the set Z = X M, M]. We will also assume that the class F is a subset of a suitable reproducig kerel Hilbert space RKHS H K iduced by some Mercer kerel K : X X R. It will be useful to defie C K sup K x, x; 2 x X we will assume that C K is fiite. The followig simple boud will come i hady: 1

Lemma 1. For ay fuctio f : X R, defie the sup orm f sup f x. 3 x X The for ay f H K we have f C K f K. Proof. For ay f H K ad x X, f x = f,k x K f K K x K = f K K x, x, where the first step is by the reproducig kerel property, while the secod step is by Cauchy Schwarz. Takig the supremum of both sides over X, we get 3. 1 ERM over a ball i RKHS First, we will look at the simplest case: ERM over a ball i H K. Thus, we pick the radius λ > 0 ad take The ERM algorithm outputs the predictor f = argmi F = F λ = { f H K : f K λ }. 1 L f argmi Y i f X i 2, where L f deotes, as usual, the empirical loss i this case, empirical quadratic loss of f. Theorem 1. With probability at least 1 δ, L f L F λ + 16M +C K λ 2 + M 2 +C 2 32 log1/δ K λ2 4 Proof. First let us itroduce some otatio. Let us deote the quadratic loss fuctio y,u y u 2 by ly,u, ad for ay f : R d R let l f x, y ly, f x = y f x 2 Let l F λ deote the fuctio class {l f : f F λ }. Let f λ deote ay miimizer of Lf over F λ, i.e., Lf λ = L F λ. As usual, we write L f L F λ = L f L F λ = L f L f + L f L f λ + L f λ Lf λ 2 sup L f Lf = 2 sup P l f Pl f = 2 l F λ, 5 2

where we have defied the uiform deviatio l F sup P l f Pl f. Next we show that, as a fuctio of the traiig sample Z, g Z = l F λ has bouded differeces. Ideed, for ay 1 i, ay z Z, ad ay z i Z, let z i deote z with the i th coordiate replaced by z i. The g z g z i 1 sup yi f x i 2 y i f x i 2 2 sup sup sup y f x 2 x X y M 4 4 M 2 + sup f 2 M 2 +C 2 K λ2, where the last lie is by Lemma 1. Thus, l F λ has the bouded differece property with c 1 =... = c = 4M 2 +C 2 K λ2 /, so McDiarmid s iequality says that, for ay t > 0, t 2 P l F λ E l F λ + t exp 8M 2 +C 2 K λ2 2. Therefore, lettig we see that t = 2M 2 +C 2 2log1/δ K λ2, l F λ E l F λ + 2M 2 +C 2 2log1/δ K λ2 with probability at least 1 δ. Moreover, by symmetrizatio we have E l F λ 2ER l F λ Z, 6 where ] R l F λ Z = 1 E sup i l f Z i 7 is the Rademacher average of the radom set l F λ Z = { l f Z 1,...,l f Z : f F λ } = { Y 1 f X 1 2,...,Y f X 2 : f F λ }. To boud the Rademacher average i 7, we will eed to use the cotractio priciple. To that ed, cosider the fuctio ϕt = t 2. O the iterval A, A] for some A > 0, this fuctio is Lipschitz with costat 2A, i.e., s 2 t 2 2A s t, A s, t A. 3

Thus, sice Y i M ad f X i C K λ for all 1 i, by the cotractio priciple we ca write R l F λ Z 4M +C K λ E sup i Yi f X i ]. 8 Moreover E sup i Yi f X i ] ] E i Y i + E sup i f X i Y 2 + R i F λ Z M +C K λ, 9 where the first step uses the triagle iequality, the secod step uses the result from previous lectures o the expected absolute value of Rademacher sums, ad the third step uses 1 ad the boud o the Rademacher average over a ball i a RKHS. Combiig 6 through 9 ad overboudig 9 slightly, we coclude that l F λ 8M +C K λ 2 + 2M 2 +C 2 2log1/δ K λ2 10 with probability at least 1 δ. Fially, combiig this with 5, we get 4. 2 Regularized least squares i a RKHS The observatio we had made may times by ow is that whe the joit distributio of the iput-output pair X,Y X R is ukow, there is o hope i geeral to lear the optimal predictor f from a fiite traiig sample. Thus, restrictig our attetio to some hypothesis space F, which is a proper subset of the class of all measurable fuctios f : X R, is a form of isurace: If we do ot do this, the we ca always fid some fuctio f that attais zero empirical loss, yet performs spectacularly badly o the iputs outside the traiig set. Whe this happes, we say that our leared predictor overfits. O the other had, if our hypothesis space F cosists of well-behaved fuctios, the it is possible to lear a predictor that achieves a graceful balace betwee i-sample data fit ad out-of-sample geeralizatio. The price we pay is the approximatio error L F L if Lf if f :X R Lf 0. I the regressio settig, the approximatio error ca be expressed as L F L = if f f 2 P X, where f x = EY X = x] is the regressio fuctio the MMSE predictor of Y give X. Whe see from this perspective, the use of a restricted hypothesis space F is a form of regularizatio a way of guarateeig that the leared predictor performs well outside the traiig sample. However, this is ot the oly way to achieve regularizatio. I this sectio, we will aalyze aother way: complexity regularizatio. I a utshell, complexity regularizatio is a modificatio of the ERM scheme 4

that allows us to search over a fairly rich hypothesis space by addig a pealty term. Complexity regularizatio is a very geeral techique with wide applicability. We will look at a particular example of complexity regularizatio over a RKHS ad derive a simple boud o its geeralizatio performace. To set thigs up, let > 0 be a regularizatio parameter. Itroduce the regularized quadratic loss J f Lf + f 2 K ad its empirical couterpart J, f L f + f 2 K. Defie the fuctios f argmi J f 11 f H K ad f, argmi J, f. 12 f H K We will refer to 12 as the regularized kerel least squares RKLS algorithm. Note that the miimizatio i 11 ad 12 takes place i the etire RKHS H K, rather tha a subset, say, a ball. However, the additio of the regularizatio term f 2 K esures that the RKLS algorithm does ot just select ay fuctio f H K that happes to fit the traiig data well istead, it weighs the goodess-of-fit term L f term agaist the complexity f 2 K, sice a very large value of f 2 K would idicate that f might wiggle aroud a lot ad, therefore, overfit the traiig sample. The regularizatio parameter > 0 cotrols the relative importace of the goodess-of-fit ad the complexity terms. We have the followig basic boud o the geeralizatio performace of RKLS: Theorem 2. With probability at least 1 δ, L f, L A + 16M 2 1 + C 2 K + 2 2M 2 + C 2 K M 2 + A 2log2/δ 13 where is the regularized approximatio error. Proof. We start with the followig lemma: Lemma 2. where δ f Lf L f for all f. ] A if Lf + f 2 K L f H K L f, L δ f, δ f + A, 14 5

Proof. First, a obvious overboudig gives L f, L J f, L. The This gives ad we are doe. J f, = L f, + f, 2 K = L f, L f, + L f, + f, 2 K }{{} =J, f, = L f, L f, + J, f, J, f + J,f L f, L f, + J, f = L f, L f, + L f + f 2 K = L f, L f, + L f Lf + Lf + f 2 K = L f, L f, + L f Lf + J f. L f, L L f, L f, + L f Lf + J f L = L f, L f, + L f Lf + if f H = L f, L f, + L f Lf + A, Lf + f 2 K ] L Lemma 2 shows that the excess loss of the regularized empirical loss miimizer f, is bouded from above by the sum of three terms: the deviatio δ f, L f, L f, of f, itself, the egative deviatio δ f L f Lf of the best regularized predictor f, ad the approximatio error A. To prove Theorem 2, we will eed to obtai high-probability bouds o the two deviatio terms. To that ed, we eed a lemma: Lemma 3. The fuctios f ad f, satisfy the bouds ad respectively. Proof. To prove 15, we use the fact that f C K A. 15 f, K M with probability oe 16 A = Lf L + f 2 K f 2 K, which gives f K A/. From this ad from 3 we obtai 15. For 16, we use the fact that f, miimizes J, f over all f. I particular, J, f, = L f, + f, 2 K J,0 = 1 Y 2 i M 2 w.p. 1, where the last step follows from 1. Rearragig ad usig the fact that L f 0 for all f, we get 16. 6

Now we are ready to boud δ f,. For ay R 0, let F R = {f H K : f K R} deote the zerocetered ball of radius R i the RKHS H K. The Lemma 3 says that f, F M/ with probability oe. Therefore, with probability oe we have δ f, = δ f, 1 { f, F M/ } δ f, 1{ f, F M/ } sup δ f 1 { f, F M/ } M/ }{{} l F M/ l F M/. Cosequetly, we ca carry out the same aalysis as i the proof of Theorem 1. First of all, the fuctio g Z = l F M/ has bouded differeces with c 1 =... = c 4 M 2 + sup f 2 M/ 4M 2 1 + C 2 K where the last step uses 16 ad Lemma 1. Therefore, with probability at least 1 δ/2, δ f, l F M/ 8M 2 1 + C 2 K + 2M 2 1 + C 2 K 2log2/δ, 17 where the secod step follows from 10 with δ replaced by δ/2 ad with λ = M/. It remais to boud δ f. This is, actually, much easier, sice we are dealig with a sigle dataidepedet fuctio. I particular, ote that we ca write δ f = 1 Y i f X i 2 E Y f X 2] = 1 U i, where U i Y i f X i 2 E Y f X 2],1 i, are i.i.d. radom variables with EU i = 0 ad U i sup sup y M,M] x X y f x 2 2M 2 + f 2 2 M 2 + C 2 K A with probability oe, where we have used 1 ad 15. We ca therefore use Hoeffdig s iequality to write, for ay t 0, P δ f t 1 t 2 = P U i t exp 8 M 2 +C 2 K A/ 2 This implies that δ f 2 M 2 + C 2 K A 2log2/δ 18 with probability at least 1 δ/2. Combiig 17 ad 18 with 14, we get 13. 7