Excess Error, Approximation Error, and Estimation Error

E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple settng: gven a fnte saple S X Y drawn accordng to, we have seen how to obtan hgh confdence bounds on the generalzaton error of a functon learned fro S, usually n ters of soe eprcal quantty that easures the perforance of the functon on S. Another queston of nterest concerns the behavour of a learnng algorth n the nfnte saple lt: as t receves ore and ore data, does the algorth converge to an optal predcton rule,.e. does the generalzaton error of the learned functon approach the optal error? Recall that for a dstrbuton on X Y and a loss l : Y Y [0,, the optal error w.r.t. and l s the lowest possble error achevable by any functon h : X Y: er l, = nf h:x Y erl [h]. 1 For the 0-1 loss, the optal error s known as the Bayes error. To foralze the above, for any functon h : X Y, defne ts excess error w.r.t. and l as er [h] er l,. We would lke to study the behavour of the excess error of the functon learned by an algorth fro a tranng saple S as. As we have seen, snce nzng the error over all possble functons n Y X can be dffcult, ost learnng algorths select a functon fro soe fxed functon class H Y X. In such cases, we can only hope to acheve generalzaton error close to the lowest possble wthn the class; we refer to ths as the optal error wthn H w.r.t. and l: er l [H] = nf erl [h]. 3 It s then useful to vew the excess error of functons h H as a su of the followng two ters: er [h] er l, = er l [h] er l [H] + er l [H] er l,. The frst ter s called the estaton error, and easures how far h s fro the optal wthn H. The second ter, called the approxaton error, easures how close one can get to the optal error usng functons n H; ths s an nherent property of the functon class, and fors a lower bound on the excess error of any functon learned fro H. In the followng we wll focus on the estaton error, whch s what a learnng algorth learnng fro a functon class H can hope to nze. We frst gve a couple of defntons. Statstcal Consstency efnton. Let H Y X. Let A : =1X Y H be a learnng algorth that gven a tranng saple S =1X Y, returns a functon h S H. Let be a probablty dstrbuton on X Y and 1

Excess Error, Approxaton Error, and Estaton Error l : Y Y [0,. We say A s statstcally consstent n H w.r.t. and l f the estaton error of the functon learned by A fro S converges n probablty to zero,.e. f for all ɛ > 0, er l [h S ] er l [H] ɛ 0 as. If A s consstent n H w.r.t. l for all dstrbutons on X Y, we say A s unversally consstent n H w.r.t. l. 1 efnton. Let A : =1X Y Y X be a learnng algorth that gven a tranng saple S =1X Y, returns a functon h S : X Y. Let be a probablty dstrbuton on X Y and l : Y Y [0,. We say A s Bayes consstent w.r.t. and l f the excess error of the functon learned by A fro S converges n probablty to zero,.e. f for all ɛ > 0, er l [h S ] er l, ɛ 0 as. If A s Bayes consstent w.r.t. l for all dstrbutons on X Y, we say A s unversally Bayes consstent w.r.t. l. One can also defne analogous notons of strong consstency, whch requre alost sure convergence nstead of convergence n probablty. 3 Consstency of Eprcal Rsk Mnzaton n H Let H Y X and l : Y Y [0,. Consder the eprcal rsk nzaton ERM algorth n H, whch gven a tranng saple S =1X Y returns 3 h S arg n erl S[h]. 5 Then for any dstrbuton on X Y, we can wrte the estaton error of h S as er l [h S ] er l [H] = er l [h S ] er l S[h S ] + er l S[h S ] er l [H] 6 er l [h S ] er l S[h S ] + sup er l S[h] er l [h] 7 sup er l S[h] er l [h]. 8 Therefore, unfor convergence of eprcal errors n H ples consstency of ERM n H! In partcular, for bnary classfcaton, we edately have the followng: Theore 3.1. Let H {±1} X and l = l 0-1. If VCdH = d <, then ERM n H s unversally consstent n H w.r.t. l 0-1. Proof. Let be any probablty dstrbuton on X {±1}. Let ɛ > 0. Then [h S ] [H] ɛ sup [h S ] [H] ɛ by Eq. 8 9 d e e ɛ /3 by prevous results 10 d 0 as. 11 1 Note that one could also defne a noton of consstency n ters of convergence n expectaton, whch would requre that E S [er l [h S] er l [H]] 0 as. It s easy to show that a sequence of bounded, non-negatve rando varables converges n probablty f and only f t converges n expectaton show ths!, and therefore when the loss functon l s bounded, consstency n ters of convergence n probablty s equvalent to consstency n ters of convergence n expectaton. Note that the ter Bayes consstency s usually used to refer to convergence to the optal error for bnary classfcaton wth the 0-1 loss; we wll use the ter for any learnng proble/loss functon to dstngush t fro consstency wthn H. 3 We assue for splcty that the nu s acheved n H; the results we dscuss contnue to hold f h S s selected to be any functon n H whose eprcal error s wthn an approprately sall precson of nf er l S [h].

Excess Error, Approxaton Error, and Estaton Error 3 Several rearks are n order: 1. As we have noted before, for bnary classfcaton, ERM s typcally not coputatonally effcent, except for soe sple classes H. We wll later dscuss consstency of algorths that nze a convex upper bound on l 0-1.. Note that for any 0 < δ 1, we have wth probablty at least 1 δ over S, [h S ] [H] c d ln + ln 1 δ ln As a functon of the saple sze, ths gves a rate of convergence of O for the estaton error. For dstrbutons for whch er [H] = 0 so that there s a target functon t H such that wth probablty 1, the true label y of any nstance x under s gven by tx,.e. P x,y y = tx = 1, one can actually show a faster rate of convergence of O ln. Ths follows fro a better unfor convergence bound for such dstrbutons wth an e cɛ ter n the bound rather than e cɛ ; we probably wll not show ths for the general case, but wll show ths for fnte H n a later lecture. A dervaton for the general case can be found for exaple n [1]. 3. It s portant to note that the above result apples only to classes of fnte VC-denson. Snce no such class can have zero approxaton error for all dstrbutons, ERM n such a class cannot acheve unversal Bayes consstency.. For classes H of fnte VC-denson, the above result actually establshes that ERM n H s strongly unversally consstent n H, by vrtue of the Borel-Cantell lea see [1].. Consstency of Structural Rsk Mnzaton n H = H Let H 1 H..., where H Y X. Let l : Y Y [0,. Gven a tranng saple S =1X Y, the structural rsk nzaton SRM algorth n H =1 returns h S arg n er l S[h S] + penalty,, 1 where h S H s the functon returned by ERM n H, and penalty, s a penalty ter that ncreases wth the coplexty of H. Under certan condtons, one can show that SRM n H =1 s consstent n H = =1 H ; f n addton the sequence H =1 s such that H = =1 H has zero approxaton error, then SRM n H =1 can also be Bayes consstent. For exaple, for bnary classfcaton, we have the followng result: Theore.1 Lugos and Zeger, 1996. Let H 1 H..., where H {±1} X, VCdH = d <, and d < d +1. Let l = l 0-1. Then SRM wth penaltes gven by penalty, = s unversally consstent n H = =1 H w.r.t. l 0-1. 8d lne + Proof. Let be any probablty dstrbuton on X {±1}. Let ɛ > 0. We can wrte the estaton error of h S as [h S ] [H] = S [h S] + penalty, + [h S ] nf nf S [h S] + penalty, [H]. 13

Excess Error, Approxaton Error, and Estaton Error Therefore we have [h S ] [H] ɛ S [h S] + penalty, ɛ [h S ] nf nf S [h S] + penalty, + [H] ɛ. 1 We wll bound each probablty n turn. For the frst probablty, we have [h S ] nf S [h S] + penalty, ɛ 15 sup [h S] S [h S] + penalty, ɛ 16 [h S] S [h S] ɛ + penalty, by unon bound 17 =1 d e e ɛ +penalty, /8 =1 d e d e ɛ /3 e penalty, /8 =1 = e ɛ /3 = e ɛ /3 = 18 19 e d e 8d lne+/8 0 =1 e /8 1 =1 e ɛ /3. 1 e 1/8 For the second probablty, let be such that and let be such that for all, Then we have [H ] [H] + ɛ, 3 penalty, ɛ 8. nf S [h S] + penalty, [H] ɛ 5 nf S [h S] + penalty, [H ] ɛ 6 S [h S ] + penalty, [H ] ɛ 7 S [h S ] [H ] ɛ, for 8 8 sup S [h] [h] ɛ 9 8 d e d Thus we have [h S ] [H] ɛ e ɛ /51. 30 d e e ɛ /3 + e ɛ /51, 1 e 1/8 d for 31 0 as. 3

Excess Error, Approxaton Error, and Estaton Error 5 A couple of rearks: 1. As noted above, f the sequence H =1 s such that nf nf [h] = er0-1, for all dstrbutons on X {±1}.e. f the approxaton error of H = =1 H s zero for all, then SRM n H =1 as above s unversally Bayes consstent w.r.t. l 0-1.. Agan, except for the splest probles, SRM partcularly for bnary classfcaton s often not coputatonally feasble; however t s useful as a theoretcal tool for understandng odel selecton technques and Bayes consstency, and can also serve as a gude for the developent of approxate algorths. 5 Consstency and Learnablty: Two Sdes of the Sae Con In the next few lectures we wll turn to learnablty, and then return to a ore detaled dscusson of statstcal consstency. As we wll see, the two notons are closely related, although they arose n dfferent countes and tend to ephasze soewhat dfferent aspects: Statstcal Consstency Learnablty Orgns n statstcs Starts wth learnng algorth; asks f t s statstcally consstent Both consstency wthn H and Bayes consstency of nterest Mostly dstrbuton-free; also nterested n low-nose settngs Focus on convergence rates ɛ, δ Orgns n theoretcal coputer scence Starts wth functon class H; asks f there s a learnng algorth that s statstcally consstent n H wth an addtonal requreent we wll see next te By defnton, nterest s n consstency w.r.t. H Often assue er l [H] = 0 target functon settng; ostly dstrbuton-free otherwse, but soetes nterested n specfc dstrbutons such as the unfor dstrbuton over the Boolean cube X = {0, 1} n Focus on saple coplexty ɛ, δ and coputatonal coplexty 6 Next Lecture In the next lecture we wll ntroduce the noton of learnablty, and wll gve a few basc results and exaples to llustrate the concept. The next few lectures after that wll dscuss ore results and exaples related to learnablty, before we return to talk ore about statstcal consstency. References [1] Luc evroye, Laszlo Gyorf, and Gabor Lugos. A Probablstc Theory of Pattern Recognton. Sprnger, 1996.