TECHNICAL REPORT NO Generalization and Regularization in Nonlinear Learning Systems 1

Size: px

Start display at page:

Download "TECHNICAL REPORT NO Generalization and Regularization in Nonlinear Learning Systems 1"

Theodore Fisher
5 years ago
Views:

1 DEPARTMENT OF STATISTICS Uiversity f Wiscsi 1210 West Dayt St. Madis, WI TECHNICAL REPORT NO February 28, 2000 i Nliear Learig Systems 1 by Grace 1 Prepared fr the Hadbk f Brai Thery ad Neural Netwrks, Secd Editi, Michael Arbib, Ed, withi the space ad referece limitatis f the Hadbk. This TR is a updated versi f the etry f the same ame i the First Editi, 1995, which was als prited as TR Supprted by NIH Grat EY09946 ad NSF Grat DMS

2 i Nliear Learig Systems Grace Departmet f Statistics Uiversity f Wiscsi 1210 W. Dayt St. Madis, WI wahba@stat.wisc.edu February 24, Itrducti I this article we will describe geeralizati ad regularizati frm the pit f view f multivariate fucti estimati i a statistical ctext. Multivariate fucti estimati is t, i priciple, distiguishable frm supervised machie learig. Hwever, util fairly recetly supervised machie learig ad multivariate fucti estimati had fairly distict grups f practitiers, ad small verlap i laguage, literature, ad i the kids f practical prblems uder study. I ay case, we are give a traiig set, csistig f pairs f iput (feature) vectrs ad assciated utputs {t(i), y i }, fr traiig r example subjects, i = 1,... Frm this data, it is desired t cstruct a map which geeralizes well, that is, give a ew value f t, the map will prvide a reasable predicti fr the ubserved utput assciated with this t. Mst applicatis fall it e f tw brad categries, which might be called parametric regressi ad classificati. I parametric regressi, y may be (ay) real umber r a vectr f r real umbers. The desired algrithm will prduce a estimate ˆf(t) f the expected value f a (ew) y t be assciated with a (ew) attribute vectr t. I the (tw-class) classificati prblem y i will be a idicatr whether r t the example (subject) came frm class A. I sme classificati applicatis, the desired algrithm will, give t, retur a idicatr which predicts whether r t a example with attribute vectr t cmes frm class A ( hard ) classificati. I ther applicatis the desired algrithm will retur p(t), a estimate f the prbability that the example with attribute vectr t is i class A. ( sft classificati). I sme applicatis the feature vectr t f dimesi d ctais zeres ad es (fr example as i a bitmap f hadwritig), i thers it may ctai real umbers represetig sme physical quatities, rdered r urdered categry idicatrs are als pssible, as i medical demgraphic studies. Regularizati, lsely speakig, meas that while the desired map is cstructed t apprximately sed the bserved feature vectrs t the bserved utputs, cstraits are applied t the cstructi f the map with the gal f reducig the geeralizati errr. I sme applicatis, these cstraits embdy a priri ifrmati ccerig the true relatiship betwee iput ad utput; alteratively, varius ad hc cstraits have smetimes bee shw t wrk well i practice. Girsi, Jes ad Pggi (1995) give a wide-ragig review. 2 i N-Parametric Regressi 2.1 Sigle Iput Splie Smthig We will use Figure 1 t illustrate the ideas f geeralizati ad regularizati i the simplest pssible parametric regressi setup, that is, d = 1, r = 1, with t = t ay real umber i sme iterval f the real lie. The circles (which are idetical i each f the three paels f Figure 1) represet = 100 (sythetically

3 geerated) iput-utput pairs {t(i), y i }, geerated accrdig t the mdel y i = f TRUE (t(i)) + ǫ i, i = 1,...,, (1) where f TRUE (t) = 4.26(e t 4e 2t + 3e 3t ), ad the ǫ i came frm a pseudradm umber geeratr fr Nrmally distributed radm variables with mea 0 ad stadard deviati σ = 0.2. Give this traiig data {t(i), y i, i = 1,..., }, the learig prblem is t create a map which, if give a ew value f t, will predict the respse y(t). I this case, the data are isy, s that eve if the ew t cicides with sme predictr variable t(i) i the traiig set, merely predictig y as the respse y i is t likely t be satisfactry. Als, this des t yet prvide ay ability t make predictis whe t des t exactly match ay predictr values i the traiig set. It is desired t geerate a curve which will allw a reasable predicti f the respse fr ay t withi a reasable viciity f the set f traiig predictrs {t(i)}. The dashed lie i each pael f Figure 1 is f TRUE (t); the three slid black lies i the three paels f Figure 1 are three slutis t the variatial prblem: Fid f i the [Hilbert] space W 2 f fuctis with ctiuus first derivatives ad square itegrable secd derivatives which miimizes 1 (y i f(t(i)) 2 + λ (f (2) (u)) 2 du, (2) fr three differet values f λ. The parameter λ is kw as the regularizati r smthig parameter. As λ, f λ teds t the least squares straight lie best fittig the data, ad as λ 0 the sluti teds t that curve i W 2 which miimizes the pealty fuctial J(f) = (f (2) (u)) 2 du subject t iterplatig the data (prvided the {t(i)} are distict). This latter iterplatig curve is kw as a cubic iterplatig splie, ad miimizers f (2) are kw as smthig splies. See (1990) ad refereces cited there fr further ifrmati ccerig these ad ther prperties f splies ted belw, ad further refereces. I the tp pael f Figure 1 λ has bee chse t small, ad the wiggly slid lie is attemptig t fit the data t clsely. It ca be see that usig the wiggly curve i the tp pael is t likely t give a gd predicti f y, assumig that future predictr-respse data is geerated by the same mechaism as the traiig data. I the middle pael, λ has bee chse t large, the curve has bee frced t flatte ut, ad agai it ca be see that the heavy lie will t give a gd predicti f y. I the bttm pael, λ has bee chse by geeralized crss validati (GCV). This is a methd which behaves similarly t leavig-ut-e i may cases but with cmputatial ad theretical advatages. See Li(1986), (1990, Chapter 4), Girard(1998). It ca be see that the λ btaied this way des a gd jb f chsig the right amut f smthig t best recver f TRUE f Equati (1). The f TRUE f Equati (1) wuld prvide the best predictr f the respse i a expected mea square errr sese if future data were geerated accrdig t Equati (1). The curve i the bttm pael has a reasable ability t geeralize, that is, t predict the respse give a ew value t f the predictr variable, at least if t is t t far frm the traiig predictr set {t(i)}. Fr each psitive λ, there exists a uique κ = κ(λ) s that the miimizer f λ f (2) is als the sluti t the prblem: Fid f i W 2 t miimize L(y, f) = 1 (y i f(t(i)) 2 (3) subject t the cditi J(f) = (f (2) (u)) 2 du κ. (4) As λ becmes large, the assciated κ(λ) becmes small, ad cversely. I geeral, the term regularizati refers t slvig sme prblem ivlvig best fittig, subject t sme cstrait(s) the sluti. These cstraits may be f varius frms. Whe they ivlve a quadratic pealty ivlvig derivatives, like J(f), 2

4 Figure 1: Traiig data (circles) have bee geerated by addig ise t f TRUE (t), shw by the dashed curve i each pael. All three paels have the same data. Tp: Slid curve is fitted splie with λ t small. Middle: Slid curve is fitted splie with λ t large. Bttm: Slid curve is fitted splie with λ btaied by geeralized crss validati. 3

5 the methd is cmmly referred t as Tikhv regularizati. The tighter the cstraits (i. e. the smaller κ, equivaletly the larger λ) the further away the sluti f λ will geerally be frm the traiig data, that is, L will be larger. As the cstraits get weaker ad weaker the ultimately (if there are eugh degrees f freedm i the methd) the sluti will iterplate the data. Hwever, as is clear frm Figure 1 a curve which rus thrugh all the data pits is t a gd sluti. A fudametal prblem i machie learig with isy ad r icmplete data, is t balace the tightess f the cstraits with the gdess f fit t the data, i such a way as t miimize the geeralizati errr, that is, the ability t predict the ubserved respse fr ew values f t (r t). This tradeff is by w well kw as the bias-variace tradeff, r, equivaletly, the gdess f fit - mdel cmplexity tradeff. Methds abud i the statistical literature fr uivariate curve fittig, icludig Parze kerel estimates, earest eighbr estimates, rthgal series estimates, least squares regressi splie estimates, ad, recetly wavelet estimates. Each methd has e r mre regularizati parameters, be they kerel widw widths, umbers f earest eighbrs icluded, umber f terms i the rthgal series expasi r regressi basis, r factrs r threshlds fr shrikig r trucatig wavelet cefficiets, that ctrl this tradeff. See Ramsay ad Silverma (1997) ad refereces cited there. 2.2 Multiple Iput, Sigle Hidde Layer Feed-Frward Neural Net A multiple iput, sigle hidde layer feed-frward eural et (NN) predictr fr the learig prblem f Secti 1 is typically f the frm N f NN (t) = σ 0 (b + w j σ h (a jt(i) + b j )) (5) j=1 where the a j ad t are d-vectrs. The fucti σ h is the s-called activati fucti f the hidde layer ad σ 0 is the activati fucti fr the utput. σ h is geerally a sigmidal fucti, fr example, σ h (τ) = e τ /(1 + e τ ), while σ 0 may be liear, sigmidal r a threshld uit. Here N is the umber f hidde uits, ad the w j, a j ad b j are leared frm the traiig data by sme apprpriate iterative descet algrithm that tries t steer these values twards miimizig sme distace measure, typically L(y, f NN ) = 1 (y i f NN (t(i))) 2. It is clear that if N is sufficietly large, ad the descet algrithm is ru lg eugh, it shuld be pssible t drive the L as clse as e likes t 0. (I practice it is pssible t get stuck i lcal miima.) Hwever, it is als clear ituitively frm Figure 1 that drivig L all the way t zer is t a desirable thig t d. Regularizati i this prblem may be de by ctrllig the size f N, by impsig pealties the w j, by stppig the descet algrithm early, that is, t drivig dw L as far as it ca g, r by varius cmbiatis f these strategies. Each will ifluece hw clsely f NN will fit the data, hw wiggly it will be, ad hw well it will be able t predict ubserved data that is geerated by a similar mechaism as the bserved data. 2.3 Multiple Iput Radial Basis Fucti ad Related Estimates Radial basis fuctis are rapidly becmig a ppular methd fr parametric regressi. We first describe a geeral frm f parametric regressi which will specialize t radial basis fuctis ad ther methds f iterest. Let R(s, t) be ay symmetric, strictly psitive defiite fucti E d E d. Here strictly psitive defiite meas fr ay K = 1, 2,... the K K matrix with j, kth etry R(s(j), s(k)) is strictly psitive defiite wheever the s(1),..., s(k) are distict. (A symmetric K K matrix M is said t be psitive defiite if fr ay K dimesial clum vectr x, x Mx is greater tha r equal t 0, ad is said t be strictly psitive defiite if x Mx is always strictly greater tha 0.) Psitive defiiteess will play a key rle i the discussi belw because, (amg ther reass) ay psitive defiite matrix ca be the cvariace matrix f a radm vectr ad ay psitive defiite fucti R(s, t) ca be the cvariace fucti f sme stchastic prcess, X(t). That is, there exists X( ) such that Cv X(s)X(t) = R(s, t). Give traiig data {t(i), y i }, it is always pssible i 4

6 priciple t btai a (regularized) iput-utput map frm this data by lettig the mdel f R,λ be f the frm N f R,λ (t) = c j R(t, s(j)), (6) j=1 where the s(j) are N ceters which are placed at distict values f the {t(i)} ad c = (c 1,..., c N ) is chse t miimize L(y, f) + λj(f). Here ad the regularizig pealty J( ) is f the frm L(y, f R,λ ) = 1 (y i f R,λ (t(i)) 2 (7) J(f R,λ ) = N j,k=1 c j c k J jk (8) where J jk are the etries f a -egative defiite quadratic frm. The (strict) psitive defiiteess f R guaratees that L(y, f R,λ ) + λj(f R,λ ) (9) always has a uique miimizer i c, fr ay -egative λ. This fllws by substitutig (6) it (9), ad usig the fact that the clums f the N matrix with i, j etry R(t(i), s(j)) are liearly idepedet sice they are just N clums f the psitive defiite matrix with i, j etry R(t(i), t(j)). Radial basis fucti estimates are btaied fr the special case where R(s, t) is f the special frm R(s, t) = r( W(s t) ), (10) where W is sme liear trasfrmati E d ad the rm is Euclidea distace. That is, R(s, t) depeds ly sme geeralized distace i E d betwee s ad t. The regularizati, that is, the effectig f the tradeff betwee gdess f fit t the data ad smthess f the sluti, is perfrmed by reducig N, ad/r icreasig λ. The chice f W will als affect the wiggliess f f R,λ i the radial basis fucti case. Alteratively, a mdel ca be btaied by chsig N small ad miimizig L(y, f). I that case N ad W are the smthig parameters. I the special case N =, s(i) = t(i), the f R,λ ca (fr ay psitive defiite R) be shw t be Bayes estimates, see Kimeldrf ad (1970), (1990). Argumets ca be give t shw that if is large ad N < is t t small, the they are gd apprximatis t Bayes estimates, see (1990, Chapter 7). I the special case J i,j = R(t(i), t(j)), the Bayes mdel is easy t describe ad we d it here; it is: y i = X(t(i)) + ǫ i, (11) with X(t) a zer mea Gaussia stchastic prcess with cvariace EX(s)X(t) = br(s, t) ad the ǫ i idepedet zer mea Gaussia radm variables with cmm variace σ 2, ad idepedet f X(t). I this case, the miimizer f R,λ f L(y, f) + λj(f), evaluated at t, is the cditial expectati f X(t), give y 1,..., y prvided that λ is chse as σ 2 /b. I geeral, pretedig that e has a prir ad cmputig the psterir mea r mde will have a regularizig effect. The discussi abve exteds t symmetric psitive defiite fuctis arbitrary dmais fr t icludig thse metied i Secti 1. Thi plate splies i d variables (f rder m) csist f radial basis fuctis plus plymials f ttal degree less tha m i d variables. (2m d > 0 is required fr techical reass.) Lettig t = (t 1,..., d d ), the thi plate splies are miimizers (i a apprpriate fucti space) f 1 (y i f(t(i)) 2 + λ α 1 + +α d =m m! α 1! α d! 5 ( m ) 2 f t α 1 1 dt tα d 1 dt d. (12) d

7 Settig d = 1, m = 2 gives the cubic splie case discussed earlier. Nte that there is pealty plymials f ttal degree less tha m, the thi plate splies with a particular chice f λ are Bayes estimates with a imprper prir (that is, ifiite variace) the plymials f ttal degree less tha m, see (1990) ad refereces cited there. Related variatis regularized estimates iclude additive smthig splies, which are f the frm d f(t) = µ + f α (t α ) (13) α=1 where µ ad the f α are the sluti t a variatial prblem f the frm: Fid µ ad f 1,.., f d i a certai fucti space t miimize 1 d (y i f(t(i)) 2 + λ α J α (f α ). (14) The J α may be f the frm f J i Equati (4). Here, there is a regularizati parameter fr each cmpet. See Hastie ad Tibshirai (1990), (1990). These additive mdels geeralize t smthig splie aalysis f variace (SS-ANOVA) mdels. I the SS-ANOVA mdels iteracti terms f the frm f αβ (t α, t β ), f αβγ (t α, t β, t γ ), etc., which satisfy side cditis makig them uiquely determied, are added t the represetati i Equati (13), ad crrespdig pealty terms with regularizati parameters are added i Equati (14). The f α, etc, may be geeralized t themselves beig radial basis fuctis. Behid these mdels are psitive defiite fuctis which are built up via tesr sums ad prducts f psitive defiite fuctis, See Gu ad (1993), (1990),, Wag, Gu, Klei ad Klei (1995). Regressi splie ANOVA mdels be btaied by settig the f α, f αβ etc. as liear cmbiatis f a (relatively small) umber f basis fuctis (usually splies). I this case the umber f the basis fuctis is prbably the mst ifluetial regularizati parameter. These ad similar methds agai all have either explicit r implicit regularizati parameters which gver the balace betwee the cmplexity f the mdel ad the fit t the data - the bias-variace tradeff. The usual criteria fr the geeralizati errr whe the fit ivlves miimizig the bserved residual sum f squares is the expected (cmparative) residual sum f squares fr ew data, EL(y ew, f λ ) σ 2 L(f TRUE, f λ ). Here the y ew are ew bservatis. Leavig ut e, leavig ut 10%, leavig ut a 1/3 represetative sample ( tuig set ) ad GCV ( i sample tuig ) are ppular methds fr chsig the tuig parameters t miimize this criteria. Cdes i Splus (smth.splie()), SAS (tpsplie), etlib(/gcv), Fufits (sreg, tps), R(smth.Psplie, gss) ad elsewhere are available fr implemetig the uivariate splie, thi plate splie ad additive ad iteracti (ANOVA) splies with GCV t chse sigle r multiple smthig parameters. Netlib, Fufits ad R are freeware. The smth.psplie cde i R at was used t geerate Figure 1. 3 i Sft Classificati Sft classificati is a atural gal i certai kids f demgraphic medical studies - fr example suppse a large traiig set is available frm a demgraphic study, csistig f bservatis {t(i), y i } where y i is a idicatr (1 r 0) f the presece r absece f sme disease i subject i at the ed f the study, ad t(i) is a vectr f values f risk factrs fr this subject at the begiig f the study. With this kid f data, it is frequetly f iterest t make a sft classificati, that is, t estimate the prbability p(t) that a ew subject with predictr vectr t will ctract the disease. A dctr, give this mdel, may advise ew patiets which risk factr(s) are imprtat fr them t ctrl t reduce the prbability f their ctractig the disease. A regularized (that is, smth ) estimate fr p(t) is desirable. Regularized estimates ca be btaied as fllws. First, defie α=1 f(t) = lg[p(t)/(1 p(t))]. (15) 6

8 f is kw i the statistics literature as the lg dds rati, r lgit. The p(t) is a sigmidal fucti f f(t), that is p(t) = e f(t) /(1+e f(t) ). We will get a regularized estimate fr f. L(y, f) f Equati (3) will be replaced by a expressi mre suitable fr 0 1 data, by usig the likelihd fr this data. T describe the likelihd, te that if y is a radm variable with Prb [y = 1] = p ad Prb [y = 0] = (1 p), the the prbability desity (r likelihd) P(y, p) fr y whe p is true, is just P(y, p) = p y (1 p) (1 y), this merely says P(1, p) = p ad P(0, p) = (1 p). Thus, the likelihd fr y 1,..., y (assumig that the y i are idepedet), is P(y 1,..., y ; p(t(1),..., p(t()) = Π p(t(i)) y i (1 p(t(i)) (1 y i). (16) Substitutig f fr p i (16), takig the egative lgarithm, gives the egative lg likelihd L(y, f) i terms f f: lgp(y 1,..., y ; f(t(1),..., f(t()) L(y, f) = [lg(1 + e f(t(i)) ) y i f(t(i))]. (17) It is atural fr L(y, f) t replace L(y, f) f (3) (7), (14) whe y i is restricted t 0 r 1, sice L(y, f TRUE ) is (a multiple f) the egative lg likelihd fr y geerated by a mdel with Gaussia ise like (1). A eural et implemetati f sft classificati wuld csist f fidig f NN (t) = lgitp NN (t) f the frm f Equati (5) t miimize L(y, f) f (17). If N is large eugh, the, i priciple, f NN may be drive s that p NN (t(i)) is clse t 1 if y i is 1, ad is clse t 0 if y i is 0. Agai, it is ituitively clear that this is t desirable. As befre, a regularized, r smth f NN ca be btaied by ctrllig N, pealizig the w i, stppig the iterative fittig early, r sme cmbiati f these. Pealized likelihd estimates f f are btaied by miimizig L(y, f) + J λ (f) where J λ (f) is a pealty fuctial crrespdig t thse i Equatis (2), (9), (12) r (14) ad its geeralizatis. A ppular defiiti fr the geeralizati errr is the (ubservable) cmparative Kullback- Leibler distace f the estimate t the true prbability distributi, which ca be shw t be give by EL(y ew, f λ )) = L(p TRUE, f λ ). A estimate f the λ which miimizes this criteria ca be btaied by withhldig a represetative subset y [left ut] f the traiig set ad chsig λ t miimize L(y [left ut], f λ ). Leavig-ut-e estimates are als pssible but geerally t feasible i this case. Geeralized apprximate crss validati (GACV) is a feasible isample methd f chsig λ; based a leavig-ut-e argumet, it has bee shw i simulati studies t prvide a gd estimate f the miimizer f L(p TRUE, f λ ), see, Li, Ga, Xiag, Klei ad Klei (1999). 4 i Hard Classificati I the hard classificati prblem (here we will csider ly tw classes fr simplicity), we are ly iterested i estimatig whether a example with vectr t is i class A t. This is the typical situati i, fr example character recgiti, vice recgiti, ad ther situatis where it is kw that the t s frm the tw classes beig examied are geerally well separated. I that case (assumig, fr simplicity that the examples frm the tw classes are represeted i the traiig set equally as is the future ppulati f iterest, ad, that csts f misclassificati are the same fr bth classes), the the ptimum classifier (t miimize the expected cst) wuld be A if p(t) is greater tha e-half, ad t A therwise. Equivaletly, the same rule ca be implemeted by examiig the sig f the lgit f(t). Here we are idetifyig A with the 1 s, ad ptimum is with respect t miimizig the expected cst f future misclassificati. Ufrtuately, i geeral it is either desirable r feasible t estimate the lgit f directly by the methds f Secti 3, because i the well separated case f takes values ear ±, ad, if d ad/r the sample size is large slvig the pealized likelihd prblem f Secti 3 is likely t be umerically ustable. Recetly, supprt vectr machies (SVM s) have bee shw t prvide a excellet methd fr classificati i this situati. See Burges (1998). The supprt vectr machie (SVM) is implemeted cdig the y i as ±1 accrdig as the ith example is i A r t. Give a psitive defiite fucti R(s, t), we fid a fucti f f the frm f(t) = b+ c i R(t, t(i)) 7

9 by fidig b ad c = (c i,,c ) t miimize 1 (1 y i f(t(i)) + + λ i,j c i c j R(t(i), t(j)) (18) where (τ) + = τ fr τ > 0 ad 0 therwise. Lettig f λ be the miimizer f (18), the classificati algrithm is: fr a ew attribute vectr t, assig A if f(t) > 0 ad t A if f(t) < 0. Li (1999) has demstrated the remarkable result that, uder geeral circumstaces with apprpriately chse λ, the SVM estimate f λ teds almst everywhere t either 1 r 1 ad is a estimate f sigf TRUE sig(p TRUE 1 2 ), which is exactly what is eeded t carry ut the ptimum classificati algrithm. A ppular chice fr R(s, t) is R(s, t) = exp 1 σ s t 2 where is the Euclidea rm. I this chice f R(, ) the result may be 2 sesitive t bth σ ad λ. As befre, the λ ad σ may be chse by leavig ut a represetative subset f the bservatis ad chsig λ ad σ t miimize sme measure f the geeralizati errr. Here the atural chice fr geeralizati errr wuld be the misclassificati rate. A versi f GACV fr SVM s, agai based a leavig-ut-e argumet, may be used as a isample methd fr chsig λ ad σ, see, Li ad Zhag (1999). The geeralizati errr target fr the GACV is E 1 (1 y iew f λ (t(i))) +. Hwever, 1 2 E 1 (1 y iew sig[f λ (t(i))]) + is the expected misclassificati rate, s that t the extet that f λ resembles sigf λ, this criteria will be apprpriate fr the geeralizati errr. 5 Chsig Hw Much t Regularize At the time f this writig, it is a matter f lively debate ad much research hw t chse the varius regularizati parameters. Leavig ut a large fracti f the traiig sample fr this purpse ad tuig the regularizati parameter(s) t best predict the left-ut data (accrdig t whatever criteria f best predicti is adpted) is cceptually simple, defesible, ad widely used (this is called ut-f-sample tuig). Successively leavig-ut-e, successively leavig-ut-10%, ad the i-sample methds GCV ad GACV are all ppular. See als Ye (1998) wh discusses i-sample tuig methds related t GCV i the Gaussia case which allw cmpariss acrss differet regularized estimates. I the Nrmally distributed bservatial errr case, if the stadard deviati f the bservatial errr (σ i Equati (1))is kw the ubiased risk estimates becme available. See Li (1986), (1990) ad refereces cited there. Whe there is a Bayesia mdel behid the regularizati prcedure, the maximum likelihd estimates may be derived, see (1985), althugh i rder fr these ad ther Bayes estimates t d a gd jb f miimizig the geeralizati errr i practice, it is usually ecessary that the prirs which they are based are realistic. 6 Which methd is best? Feedfrward eural ets, radial basis fuctis, ad varius frms f splies all prvide regularized r regularizable methds fr estimatig smth fuctis f several variables, give a traiig set {t(i), y i }: Which apprach is best? Ufrtuately, there is t, r is there likely t be, a sigle aswer t that questi. The aswer mst surely depeds the particular ature f the uderlyig but ukw truth, the ature f ay prir ifrmati that might be available abut this truth, the ature f ay ise i the data, the ability f the experimeter t chse the varius smthig r regularizati parameters well, the size f the data set, the use t which the aswer will be put, ad the cmputatial facilities available. Frm a mathematical pit f view, the classes f fuctis well apprximated by eural ets, radial basis fuctis, additive ad iteracti splies (ANOVA splies) are t the same, althugh all f these methds have the capability f apprximatig large classes f fuctis. Of curse, if a large eugh data set is available, mdels utilizig all f these appraches may be built, ad tued, ad the cmpared data that has bee set aside fr this 8

10 purpse. I-sample tuig methds fr cmparis acrss differet regularized estimates i the hard ad sft classificati ctexts are a area f active research. REFERENCES Burges, C. (1998), A tutrial supprt vectr machies fr patter recgiti, Data Miig ad Kwledge Discvery 2, Girard, D. (1998), Asympttic cmparis f (partial) crss-validati, GCV ad radmized GCV i parametric regressi, A. Statist. 126, Girsi, F., Jes, M. & Pggi, T. (1995), Regularizati thery ad eural etwrks architectures, Neural Cmputati 7, Gu, C. &, G. (1993), Semiparametric aalysis f variace with tesr prduct thi plate splies, J. Ryal Statistical Sc. Ser. B 55, Hastie, T. & Tibshirai, R. (1990), Geeralized Additive Mdels, Chapma ad Hall. Kimeldrf, G. &, G. (1970), A crrespdece betwee Bayesia estimati f stchastic prcesses ad smthig by splies, A. Math. Statist. 41, Li, K. C. (1986), Asympttic ptimality f C L ad geeralized crss validati i ridge regressi with applicati t splie smthig, A. Statist. 14, Li, Y. (1999), Supprt vectr machies ad the Bayes rule i classificati, Techical Reprt 1014, Departmet f Statistics, Uiversity f Wiscsi, Madis WI. Ramsay, J. & Silverma, B. (1997), Fuctial Data Aalysis, Spriger., G. (1985), A cmparis f GCV ad GML fr chsig the smthig parameter i the geeralized splie smthig prblem, A. Statist. 13, , G. (1990), Splie Mdels fr Observatial Data, SIAM. CBMS-NSF Regial Cferece Series i Applied Mathematics, v. 59., G., Li, X., Ga, F., Xiag, D., Klei, R. & Klei, B. (1999), The bias-variace tradeff ad the radmized GACV, i M. Kears, S. Slla & D. Ch, eds, Advaces i Ifrmati Prcessig Systems 11, MIT Press, pp Full ral presetati at NIPS 11., G., Li, Y. & Zhag, H. (1999), Geeralized apprximate crss validati fr supprt vectr machies, r, ather way t lk at margi-like quatities, Techical Reprt 1006, Departmet f Statistics, Uiversity f Wiscsi, Madis WI. t appear, Advaces i Large Margi Classifiers, A. Smla, P. Bartlett, B. Schlkpf ad D. Schurmas, eds, MIT Press., G., Wag, Y., Gu, C., Klei, R. & Klei, B. (1995), Smthig splie ANOVA fr expetial families, with applicati t the Wiscsi Epidemilgical Study f Diabetic Retipathy, A. Statist. 23, Neyma Lecture. Ye, J. (1998), O measurig ad crrectig the effects f data miig ad mdel selecti, J. Amer. Statist. Assc. 93,

5.1 Two-Step Conditional Density Estimator

5.1 Two-Step Conditional Density Estimator 5.1 Tw-Step Cditial Desity Estimatr We ca write y = g(x) + e where g(x) is the cditial mea fucti ad e is the regressi errr. Let f e (e j x) be the cditial desity f e give X = x: The the cditial desity