Stability of Transductive Regression Algorithms

Size: px

Start display at page:

Download "Stability of Transductive Regression Algorithms"

Ami Francine Phillips
5 years ago
Views:

1 Stability of Transdctive Regression Algoriths Corinna Cortes Google Research, 76 Ninth Avene, New York, NY 00 Mehryar Mohri Corant Institte of Matheatical Sciences and Google Research, 25 Mercer Street, New York, NY 002 Ditry Pechyony Technion - Israel Institte of Technology, Haifa 32000, Israel Ashish Rastogi rastogi@csnyed Corant Institte of Matheatical Sciences, 25 Mercer Street, New York, NY 002 Keywords: transdctive inference, stability, regression, learning theory Abstract This paper ses the notion of algorithic stability to derive novel generalization bonds for several failies of transdctive regression algoriths, both by sing convexity and closed-for soltions Or analysis helps copare the stability of these algoriths It sggests that several existing algoriths ight not be stable bt prescribes a techniqe to ake the stable It also reports the reslts of experients with local transdctive regression deonstrating the benefit of or stability bonds for odel selection, in particlar for deterining the radis of the local neighborhood sed by the algorith Introdction Many learning probles in inforation extraction, coptational biology, natral langage processing and other doains can be forlated as transdctive inference probles (Vapnik, 982) In the transdctive setting, the learning algorith receives both a labeled training set, as in the standard indction setting, and a set of nlabeled test points The objective is to predict the labels of the test points No other test points will ever be considered This setting arises in a variety of applications Often, the points to label are known bt they have not been assigned a label de Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008 Copyright 2008 by the athor(s)/owner(s) to the prohibitive cost of labeling This otivates the se of transdctive algoriths which leverage the nlabeled data dring training to iprove learning perforance This paper deals with transdctive regression, which arises in probles sch as predicting the real-valed labels of the nodes of a known graph in coptational biology, or the scores associated with known docents in inforation extraction or search engine tasks Several algoriths have been devised for the specific setting of transdctive regression (Belkin et al, 2004b; Chapelle et al, 999; Schrans & Sothey, 2002; Cortes & Mohri, 2007) Several other algoriths introdced for transdctive classification can be viewed in fact as transdctive regression ones as their objective fnction is based on the sqared loss, eg, (Belkin et al 2004a; 2004b) Cortes and Mohri (2007) also gave explicit VC-diension generalization bonds for transdctive regression that hold for all bonded loss fnctions and coincide with the tight classification bonds of Vapnik (998) when applied to classification This paper presents novel algorith-dependent generalization bonds for transdctive regression Since they are algorith-specific, these bonds can often be tighter than bonds based on general coplexity easres sch as the VC-diension Or analysis is based on the notion of algorithic stability In Sec 2 we give a foral definition of the transdctive regression setting and the notion of stability for transdction Or bonds generalize the stability bonds given by Bosqet and Elisseeff (2002) for the in-

2 Stability of Transdctive Regression dctive setting and extend to regression the stabilitybased transdctive classification bonds of (El-Yaniv & Pechyony, 2006) Standard concentration bonds sch as McDiarid s bond (McDiarid, 989) cannot be readily applied to the transdctive regression setting since the points are not drawn independently bt niforly withot replaceent fro a finite set Instead, a generalization of McDiarid s bond that holds for rando variables sapled withot replaceent is sed, as in (El-Yaniv & Pechyony, 2006) Sec 3 gives a sipler proof of this bond This concentration bond is sed to derive a general transdctive regression stability bond in Sec 32 In Sec 4, we present the stability coefficients for a faily of local transdctive regression algoriths The analysis in this section is based on convexity In Sec 5, we stdy the stability of other transdctive regression algoriths (Belkin et al, 2004a; W & Schölkopf, 2007; Zho et al, 2004; Zh et al, 2003) based on their closed for soltion and propose a odification to the seeingly nstable algorith that akes the stable and garantees a non-trivial generalization bond Finally, Sec 6 shows the reslts of experients with local transdctive regression deonstrating the benefit of or stability bonds for odel selection, in particlar for deterining the radis of the local neighborhood sed by the algorith This provides a partial validation of or bonds and analysis 2 Definitions Let s first describe the transdctive learning setting Asse that a fll saple X of + exaples is given The learning algorith frther receives the labels of a rando sbset S of X of size which serves as a training saple The reaining nlabeled exaples, x +,, x + X, serve as test data We denote by X (S, T) a partitioning of X into the training set S and the test set T The transdctive learning proble consists of predicting accrately the labels y +,, y + of the test exaples, no other test exaples will ever be considered (Vapnik, 998) The specific probles where the labels are real-valed nbers, as in the case stdied in this paper, is that of transdction regression It differs fro the standard (indction) regression since the learning algorith is given the nlabeled test exaples beforehand and can Another natral setting for transdction is one where the training and test saples are both drawn according to the sae distribtion and where the test points, bt not their labels, are ade available to the learning algorith However, as pointed ot by Vapnik (998), any generalization bond in the setting we analyze directly yields a bond for this other setting, essentially by taking the expectation ths exploit this inforation to iprove perforance We denote by c(h, x) the cost of an error of a hypothesis h on a point x labeled with y(x) The cost fnction coonly sed in regression is the sqared loss c(h, x) = (h(x) y(x)) 2 In the reaining of this paper, we will asse a sqared loss bt any of or reslts generalize to other convex cost fnctions The training and test errors of h are respectively R(h) = c(h, x k) and R(h) = c(h, x +k) The generalization bonds we derive are based on the notion of transdctive algorithic stability Definition (Transdction β-stability) Let L be a transdctive learning algorith and let h denote the hypothesis retrned by L for X (S, T) and h the hypothesis retrned for X (S, T ) L is said to be niforly β-stable with respect to the cost fnction c if there exists β 0 sch that for any two partitionings X (S, T) and X (S, T ) that differ in exactly one training (and ths test) point and for all x X, c(h, x) c(h, x) β () 3 Transdction Stability Bonds 3 Concentration Bond for Sapling withot Replaceent Stability-based generalization bonds in the indctive setting are based on McDiarid s ineqality (989) In the transdctive setting, the points are drawn niforly withot replaceent and ths are not independent Therefore, McDiarid s concentration bond cannot be readily sed Instead, a generalization of McDiarid s bond for sapling withot replaceent is needed as in El-Yaniv and Pechyony (2006) We will denote by S a seqence of rando variables S,,S and write S = x as a shorthand for the eqalities S i = x i, i =,, and Pr[x xi, x i ]=Pr[S =x Si =x i, S i =x i ] Theore ((McDiarid, 989), 60) Let S be a seqence of rando variables, each S i taking vales in the set X, and asse that a easrable fnction φ : X R satisfies: i [, ], x i, x i X, E S h φ S i, S i = x i i E S h φ S i, S i = x ii ci Then, ǫ > 0, «2ǫ 2 Pr [ φ E [φ] ǫ] 2exp P i= c2 i The following is a concentration bond for sapling withot replaceent needed to analyze the generalization of transdctive algoriths

3 Stability of Transdctive Regression Theore 2 Let x be a seqence of rando variables, sapled fro an nderlying set X of + eleents withot replaceent, and let that φ : X R be a syetric fnction sch that for all i [, ] and for all x,, x X and x,,x X, φ(x,, x ) φ(x,, x i, x i, x,, x ) c Then, ǫ > 0, Pr ˆ φ E [φ] ǫ 2exp 2ǫ 2 where α(, ) = ) = x + /2 /(2ax{,}) α(, )c 2 Proof For a fixed i [, ], let g(s i ) = i E S ˆφ S, S i =x i i ES ˆφ S, S i =x i Then, g(x i φ(x i, x i,x )Pr[x xi, x i ] x φ(x i, x i,x )Pr[x x i, x i ] For nifor sapling withot replaceent, the probability ters can be written as: Prˆx x i, x i = Q! k=i = + k (+ i)! Ths, g(x i! ) = [P (+ i)! x φ(x i, x i,x ) P x φ(x i, x i,x )] To copte the expression between brackets, we divide the set of pertations {x } into two sets, those that contain x i and those that do not If a pertation x contains x i we can write it as x k x ix k+, where k is sch that x k = x i We then atch it p with the pertation x i x k x k+ fro the set {x i x } These two pertations contain exactly the sae eleents, and since the fnction φ is syetric in its argents, the difference in the vale of the fnction on the pertations is zero In the other case, if a pertation x does not contain the eleent x i, then we siply atch it p with the sae pertation in {x } The atching pertations appearing in the sation are then x i x and x i x which clearly only differ with respect to x i The difference in the vale of the fnction φ in this case can be bonded by c The nber of sch pertations is ( i)! ( ) + () = (+ i )! ( )! x φ(x i (+ i )! ( )! i, which leads to the following pper bond:, x i,x ) x φ(x i, x i,x ) c, which iplies that g(x i )! (+ i )! ( )! c + i (+ i)! c Then, cobining Theore with the identity i= (+ i) 2 yields that Pr [ ] ( φ E [φ] ǫ 2 exp + /2 /(2) + /2 /2, 2ǫ 2 α (,)c ), 2 where α (, ) = The fnction φ is syetric in and in the sense that selecting one of the sets niqely deterines the other set The stateent of the theore then follows fro a siilar «, bond with α (, ) = the tighter of the two 32 Transdctive Stability Bond + /2 /(2), taking To obtain a general transdctive regression stability bond, we apply the concentration bond of Theore 2 to the rando variable φ(s) = R(h) R(h) To do so, we need to bond E S [φ(s)], where S is a rando sbset of X of size, and φ(s) φ(s ) where S and S are saples differing by exactly one point Lea Let H be a bonded hypothesis set ( x X, h(x) y(x) B) and L a β-stable algorith retrning the hypotheses h and h for two training sets S and S of size each, respectively, differing in exactly one point Then, φ(s) φ(s ) 2β + B 2 ( + )/() (2) Proof By definition, S and S differ exactly in one point Let x i S, x +j S be the points in which the two sets differ The lea follows fro the observation that for each one of the coon labeled points in S and S, and for each one of the coon test points in T and T (recall T = X \ S, T = X \ S ), the difference in cost is bonded by β, while for x i and x +j, the difference in cost is bonded by B 2 Then, it follows that φ(s) φ(s ) ( )β + ( )β + B2 + B2 2β + ( B2 + ) Lea 2 Let h be the hypothesis retrned by a β- stable algorith L Then, E S [φ(s)] β Proof By definition of φ(s), its expectation is P ES [c(h, x +k)] P ES [c(h, x k)] Since E S [c(h, x +j )] is the sae for all j [, ], and E S [c(h, x i )] the sae for all i [, ], for any i and j, E S [φ(s)] = E S [c(h, x +j )] E S [c(h, x i )] = E S [c(h, x i )] E S [c(h, x i )] Ths, E S [φ(s)] = E S,S X [c(h, x i ) c(h, x i )] β Theore 3 Let H be a bonded hypothesis set ( x X, h(x) y(x) B) and L a β-stable algorith Let h be the hypothesis retrned by L when trained on X (S, T) Then, for any δ > 0, with prob at least δ, ( ) R(h) R(h)+β+ 2β + B2 ( + ) α(, )ln δ 2 Proof The reslt follows directly fro Theore 2 and Leas and 2 This is a general bond that applies to any transdctive algorith To apply it, the stability coefficient β,

4 Stability of Transdctive Regression which depends on and, needs to be deterined In the sbseqent sections, we derive bonds on β for a nber of transdctive regression algoriths (Cortes & Mohri, 2007; Belkin et al, 2004a; W & Schölkopf, 2007; Zho et al, 2004; Zh et al, 2003) 4 Stability of Local Transdctive Regression Algoriths This section describes and analyzes a general faily of local transdctive regression algoriths (LTR) generalizing the algorith of Cortes and Mohri (2007) LTR algoriths can be viewed as a generalization of the so-called kernel reglarization-based learning algoriths to the transdctive setting The objective fnction that they iniize is of the for: F(h, S) = h 2 K + C X c(h, x k )+ C X ec(h, x +k ), (3) where K is the nor in the reprodcing kernel Hilbert space (RKHS) with associated kernel K, C 0 and C 0 are trade-off paraeters, and c(h, x) = (h(x) ỹ(x)) 2 is the error of the hypothesis h on the nlabeled point x with respect to a psedo-target ỹ Psedo-targets are obtained fro neighborhood labels y(x) by a local weighted average Neighborhoods can be defined as a ball of radis r arond each point in the featre space We will denote by β loc the scorestability coefficient of the local algorith sed, that is the axial aont by which the two hypotheses differ on an given point, when trained on saples disagreeing on one point This notion is stronger than that of cost-based stability In this section, we se the bonded-labels assption, that is x S, y(x) M We also asse that for any x X, K(x, x) κ 2 We will se the following bond based on the reprodcing property and the Cachy-Schwarz ineqality valid for any hypothesis h H : x X, h(x) = h,k(x, ) h K p K(x, x) κ h K (4) Lea 3 Let h be the hypothesis iniizing (3) Asse that for any x X, K(x, x) κ 2 Then, for any x X, h(x) κm C + C Proof The proof is a straightforward adaptation of the techniqe of (Bosqet & Elisseeff, 2002) to LTR algoriths By Eqn 4, h(x) κ h K Let 0 R + be the hypothesis assigning label zero to all exaples By definition of h, F(h, S) F(0, S) (C + C )M 2 Using h K F(h, S) yields the stateent Since h(x) κm C + C, this iediately gives s a bond on h(x) y(x) M( + κ C + C ) Ths, we are in a position to apply Theore 3 with B = AM, A = + κ C + C We now derive a bond on the stability coefficient β To do so, the key property we will se is the convexity of h c(h, x) Note, however, that in the case of c, the psedo-targets ay depend on the training set S This dependency atters when we wish to apply convexity with two hypotheses h and h obtained by training on different saples S and S For convenience, for any two sch fixed hypotheses h and h, we extend the definition of c as follows For all t [0, ], c(th+( t)h, x) = ( (th+( t)h )(x) (tỹ+( t)ỹ ) ) 2 This allows s to se the sae convexity property for c as for c for any two fixed hypotheses h and h, as verified by the following lea, and does not affect the proofs otherwise Lea 4 Let h be a hypothesis obtained by training on S and h by training on S Then, for all t [0, ], t c(h, x) + ( t) c(h, x) c(th + ( t)h, x) (5) Proof Let ỹ = ỹ(x) be the psedo-target vale at x when the training set is S and ỹ = ỹ (x) when the training set is S For all t [0, ], tc(h, x) + ( t)c(h, x) c(th + ( t)h, x) = t(h(x) ey) 2 + ( t)(h (x) ey ) 2 ˆt(h(x) ey) + ( t)(h (x) ey ) 2 The stateent of the lea follows directly by the convexity of x x 2 over real nbers Let h be a hypothesis obtained by training on S and h by training on S Let = h h Then, for all x X, c(h, x) c(h, x) = (x)((h(x) y(x)) + (h (x) y(x))) 2M( + κ C + C ) (x) As in 4, for all x X, (x) κ K, ths for all x X, c(h, x) c(h, x) 2M( + κ C + C )κ K (6) Lea 5 Asse that for all x X, y(x) M Let S and S be two saples differing by exactly one point Let h be the hypothesis retrned by the algorith iniizing the objective fnction F(h, S), h be the hypothesis obtained by iniization of F(h, S ) and let ỹ and ỹ be the corresponding psedo-targets Then, C [c(h, x i ) c(h, x i )] / C [ c(h, x i ) c(h, x i )] / 2AM (κ K (C/ + C /) + β loc C /)

5 Stability of Transdctive Regression where = h h and A = + κ C + C Proof Let c(h i, ỹ i ) = c(h, x i ) and c(h i, ỹ i ) = c(h, x i ) By Lea 3 and the bonded-labels assption, c(h i, ỹ i ) c(h i, ỹ i ) = c(h i, ỹ i) c(h i, ỹ i ) + c(h i, ỹ i ) c(h i, ỹ i ) (ỹ i ỹ i)(ỹ i + ỹ i 2h i ) + (h i h i)(h i + h i 2ỹ i ) By the score-stability of local estiates, ỹ (x i ) ỹ(x i ) β loc Ths, c(h i, ỹ i) c(h i, ỹ i ) 2AM(β loc + κ K ) (7) Using 6 leads after siplification to the stateent of the lea The proof of the following theore is based on Lea 4 and Lea 5 and is given in the appendix Theore 4 Asse that for all x X, y(x) M and there exists κ sch that x X, K(x, x) κ 2 Frther, asse that the local estiator has nifor stability coefficient β loc Let A = +κ C + C Then, LTR is niforly β-stable with [ ( ) 2 ] C β 2(AM) 2 κ 2 + C C + + C + 2C β loc AMκ 2 Or experients with LTR will deonstrate the benefit of this bond for odel selection (Sec 6) 5 Stability Based on Closed-For Soltions 5 Unconstrained Reglarization Algoriths In this section, we consider a faily of transdctive regression algoriths that can be forlated as the following optiization proble: in h h T Qh + (h y) T C(h y) (8) Q R (+) (+) is a syetric reglarization atrix, C R (+) (+) is a syetric atrix of epirical weights (in practice it is often a diagonal atrix), y R (+) are the target vales of the labeled points together with the psedo-target vales of the nlabeled points (in soe forlations, the psedo-target vale is 0), and h R (+) is a coln vector whose ith row is the predicted target vale for the x i The closed-for soltion of (8) is given by h = (C Q + I) y (9) The forlation (8) is qite general and incldes as special cases the algoriths of (Belkin et al, 2004a; W & Schölkopf, 2007; Zho et al, 2004; Zh et al, 2003) We present a general fraework for bonding the stability coefficient of these algoriths and then exaine the stability coefficient of each of these algoriths in trn For a syetric atrix A R n n we will denote by λ M (A) its largest eigenvale and λ (A) its sallest Then, for any v R n, λ (A) v 2 Av 2 λ M (A) v 2 We will also se in the proof of the following proposition the fact that for syetric atrices A,B R n n, λ M (AB) λ M (A)λ M (B) Proposition Let h and h solve (8), nder test and training sets that differ exactly in one point and let C,C,y,y be the analogos epirical weight and the target vale atrices Then, h h 2 y y 2 λ (Q) λ + + λm(q) C C 2 y ( )( ) 2 λ(q) M(C) λ M(C ) + λ(q) λ + M(C) Proof Let = h h and y = y y Let C = (C Q + I) and C = (C Q + I) By definition, = C y C y = C y + (C C )y = C y + (C [ (C C )Q ]C )y Ths, 2 y 2 λ (C) +λ M(Q) C C 2 y 2 λ (C )λ (C) (0) Frtherore, λ (C) λ(q) λ M(C) + Plgging this bond back into Eqn 0 yields: 2 y 2 λ (Q) λ + + λm(q) C C 2 y ( )( ) 2 λ(q) M(C) λ M(C ) + λ(q) λ + M(C) Since h h is bonded by h h 2, the proposition provides a bond on the score-stability of h for the transdctive regression algoriths of Zho et al (2004); W and Schölkopf (2007); Zh et al (2003) For each of these algoriths, the psedo-targets sed are zero If we ake the bonded labels assption ( x X, y(x) M, for soe M >0), it is not difficlt to show that y y 2 2M and y 2 M We now exaine each algorith in trn Consistency ethod (CM) In the CM algorith (Zho et al, 2004), the atrix Q is a noralized Laplacian of a weight atrix W R (+) (+) that captres affinity between pairs of points in the

6 Stability of Transdctive Regression fll saple X Ths, Q = I D /2 WD /2, where D R (+) (+) is a diagonal atrix, with [D] i,i = j [W] i,j Note that λ (Q) = 0 Frtherore, atrices C and C are identical in CM, both diagonal atrices with (i, i)th entry eqal to a positive constant µ > 0 Ths C = C and sing Prop, we obtain the following bond on the score-stability of the CM algorith: β CM 2M Local learning reglarization (LL Reg) In the LL Reg algorith (W & Schölkopf, 2007), the reglarization atrix Q is (I A) T (I A), where I R (+) (+) is an identity atrix and A R (+) (+) is a non-negative weight atrix that captres the local siilarity between all pairs of points in X A is noralized, ie each of its rows s to Let C l, C > 0 be two positive constants The atrix C is a diagonal atrix with [C] i,i = C l if x i S and C otherwise Let C ax = ax{c l, C } and C in = in{c l, C } Ths, C C 2 = ( 2 C in C ax ) By the Perron-Frobenis theore, its eigenvales lie in the interval (, ] and λ M (A) Ths, λ (Q) 0 and λ M (Q) 4 and we have the following bond on the scorestability of the LL Reg algorith: β LL Reg 2M + 4 ( ) M C in C ax 2M + 4 M C in Gassian Mean Fields algorith GMF (Zh et al, 2003) is very siilar to the LL Reg, and adits exactly the sae stability coefficient Ths, the stability coefficients of the algoriths of CM,LL Reg, and GMF can be large Withot additional constraints on the atrix Q, these algoriths do not see to be stable enogh for the generalization bond of Theore 3 to converge A particlar exaple of constraint is the condition + i= h(x i) = 0 sed by Belkin et al s algorith (2004a) In the next section, we give a generalization bond for this algorith and then describe a general ethod for aking the algoriths jst exained stable 52 Stability of Constrained Reglarization Algoriths This sbsection analyzes constrained reglarization algoriths sch as the Laplacian-based graph reglarization algorith of Belkin et al (2004a) Given a weighted graph G = (X, E) in which edge weights represent the extent of siilarity between vertices, the task consists of predicting the vertex labels The hypothesis h retrned by the algorith is soltion of the following optiization proble: in h H ht Lh + C X (h(x i) y i) 2 i= + X sbject to: h(x i) = 0, i= () where L R (+) (+) is a soothness atrix, eg, the graph Laplacian, {y i i [, ]} are the target vales of the labeled nodes The hypothesis set H in this case can be thoght of as a hyperplane in R + that is orthogonal to the vector R + Maintaining the notation sed in (Belkin et al, 2004a), we let P H denote the operator corresponding to the orthogonal projection on H For a saple S drawn withot replaceent fro X, define I S R (+) (+) to be the diagonal atrix with [I S ] i,i = if x i S and 0 otherwise Siilarly, let y S R (+) be the coln vector with [y S ] i, = y i if x i S and 0 otherwise The closed-for soltion on a training saple S is given by (Belkin et al, 2004a): h S = P H C L + IS ys (2) Theore 5 Asse that the vertex labels of the graph G = (X, E) and the hypothesis h obtained by optiizing Eqn are both bonded ( x, h(x) M and y(x) M for soe M > 0) Let A = + κ C Then, for any δ > 0, with probability at least δ, «s R(h) R(h) b + β + 2β + (AM)2 ( + ) α(, ) ln δ, 2 with α(, ) = + /2 /(2 ax{,}) and β (4 2M 2 )/(λ 2 /C ) + (4 2M 2 )/(λ 2 /C ) 2, λ 2 is the second sallest eigenvale of the Laplacian Proof The proof is siilar to that of (Belkin et al, 2004a) bt ses or general transdctive regression bond instead The generalization bond we jst presented differs in several respects fro that of Belkin et al (2004a) Or bond explicitly depends on both and while theirs shows only a dependency on Also, or bond does not depend on the nber of ties a point is sapled in the training set (paraeter t), thanks to or analysis based on sapling withot replaceent Contrasting the stability coefficient of Belkin s algorith with the stability coefficient of LTR (Theore 4), we note that it does not depend on C and β loc This is becase nlabeled points do not enter the objective fnction, and ths C = 0 and ỹ(x) = 0 for all

7 x X However, the stability does depend on the second sallest eigenvale λ 2 and the bond diverges as λ 2 approaches C In all or regression experients, we observed that this algorith does not perfor as well in coparison with LTR 53 Making Seeingly Unstable Algoriths Stable In Sec 52, we saw that iposing additional constraints on the hypothesis, eg, h = 0, allowed one to derive non-trivial stability bonds This idea can be generalized and siilar non-trivial stability bonds can be derived for stable versions of the algoriths presented in Sec 5 CM,LL Reg, and GMF Recall that the stability bond in Prop is inversely proportional to the sallest eigenvale λ (Q) The ain difficlty with sing the proposition for these algoriths is that λ (Q) = 0 in each case Let v denote the eigenvector corresponding to λ (Q) and let λ 2 be the second sallest eigenvale of Q One can odify (8) and constrain the soltion to be orthogonal to v by iposing h v = 0 In the case of (Belkin et al, 2004a), v = This odification, otivated by the algorith of (Belkin et al, 2004a), is eqivalent to increasing the sallest eigenvale to be λ 2 As an exaple, by iposing the additional constraint, we can show that the stability coefficient ofcm becoes bonded by O(C/λ 2 ), instead of Θ() Ths, if C = O(/) and λ 2 = Ω(), it is bonded by O(/) and the generalization bond converges as O(/) 6 Experients 6 Model Selection Based on Bond This section reports the reslts of experients sing or stability-based generalization bond for odel selection for the LTR algorith A crcial paraeter of this algorith is the stability coefficient β loc (r) of the local algorith, which coptes psedo-targets ỹ x based on a ball of radis r arond each point We derive an expression for β loc (r) and show, sing extensive experients with ltiple data sets, that the vale r iniizing the bond is a rearkably good estiate of the best r for the test error This deonstrates the benefit of or generalization bond for odel selection, avoiding the need for a held-ot validation set The experients were carried ot on several pblicly available regression data sets: Boston Hosing, Elevators and Ailerons 2 For each of these data sets, we sed =, inspired by the observation that, all other Stability of Transdctive Regression 2 wwwliaadppt/~ltorgo/regression/datasetshtl paraeters being fixed, the bond of Theore 3 is tightest when = The vale of the inpt variables were noralized to have ean zero and variance one For the Boston Hosing data set, the total nber of exaples was 506 For the Elevators and the Ailerons data set, a rando sbset of 2000 exaples was sed For both of these data sets, other rando sbsets of 2000 saples led to siilar reslts The Boston Hosing experients were repeated for 50 rando partitions, while for the Elevators and the Ailerons data set, the experients were repeated for 20 rando partitions each Since the target vales for the Elevators and the Ailerons data set were extreely sall, they were scaled by a factor 000 and 00 respectively in a pre-processing step In or experients, we estiated the psedo-target of a point x T as a weighted average of the labeled points x N(x ) in a neighborhood of x Ths, ỹ x = x N(x ) α xy x / x N(x ) α x Weights are defined in ters of a siilarity easre K(x, x ) captred by a kernel K: α x = K(x, x ) Let (r) be the nber of labeled points in N(x ) Then, it is easy to show that β loc 4α ax M/(α in (r)), where α ax = ax x N(x ) α x and α in = in x N(x ) α x Ths, for a Gassian kernel with paraeter σ, β loc 4M/((r)e 2r2 /σ 2 ) To estiate β loc, one needs an estiate of (r), the nber of saples in a ball of radis r fro an nlabeled point x In or experients, we estiated (r) as the nber of saples in a ball of radis r fro the origin Since all featres are noralized to ean zero and variance one, the origin is also the centroid of the set X We ipleented a dal soltion ofltr and sed Gassian kernels, for which, the paraeter σ was selected sing cross-validation on the training set Experients were repeated across 36 different pairs of vales of (C, C ) For each pair, we varied the radis r of the neighborhood sed to deterine estiates fro zero to the radis of the ball containing all points Figre (a) shows the ean vales of the test MSE of or experients on the Boston Hosing data set for typical vales of C and C Figres (b)-(c) show siilar reslts for the Ailerons and Elevators data sets For the sake of coparison, we also report reslts for indction The relative standard deviations on the MSE are not indicated, bt were typically of the order of 0% LTR generally achieves a significant iproveent over indction The generalization bond we derived in Eqn 3 consists of the training error and a coplexity ter that depends on the paraeters of the LTR algorith (C, C, M,,, κ, β loc, δ) Only two ters depend

8 Stability of Transdctive Regression Transdction Indction Training Error + Slack Ter Transdction Indction Training Error + Slack Ter Transdction Indction Training Error + Slack Ter (a) (b) (c) Figre MSE against the radis r of LTR for three data sets: (a) Boston Hosing (b) Ailerons (c) Elevators The sall horizontal bar indicates the location (ean ± one standard deviation) of the ini of the epirically deterined r pon the choice of the radis r: R(h) and βloc Ths, keeping all other paraeters fixed, the theoretically optial radis r is the one that iniizes the training error pls the slack ter The figres also inclde plots of the training error cobined with the coplexity ter, appropriately scaled The epirical iniization of the radis r coincides with or is close to r The optial r based on test MSE is indicated with error bars 62 Stable Versions of Unstable Algoriths We refer to the stable version of the CM algorith presented in Sec 5 as CM STABLE We copared CM and CM STABLE epirically on the sae datasets, again sing = For the noralized Laplacian we sed k-nearest neighbors graphs based on Eclidean distance The paraeters k and C were chosen by five-fold cross-validation over the training set The experient was repeated 20 ties with rando partitions The averaged ean-sqared errors with standard deviations, are reported in Table 62 Dataset CM CM STABLE Elevators ± ± Ailerons 049 ± ± Hosing 5793 ± ± 65 We conclde fro this experient that CM and CM STABLE have the sae perforance However, as we showed previosly, CM STABLE has a non-trivial risk bond and ths coes with soe garantee 7 Conclsion We presented a coprehensive analysis of the stability of transdctive regression algoriths with novel generalization bonds for a nber of algoriths Since they are algorith-dependent, or bonds are often tighter than those based on coplexity easres sch as the VC-diension Or experients also show the effectiveness of or bonds for odel selection and the good perforance of LTR algoriths Acknowledgents The research of Mehryar Mohri and Ashish Rastogi was partially spported by the New York State Office of Science Technology and Acadeic Research (NYS- TAR) This project was also sponsored in part by the Departent of the Ary Award Nber W8XWH The US Ary Medical Research Acqisition Activity, 820 Chandler Street, Fort Detrick MD is the awarding and adinistering acqisition office The content of this aterial does not necessarily reflect the position or the policy of the Governent and no official endorseent shold be inferred References Belkin, M, Matveeva, I, & Niyogi, P (2004a) Reglarization and sei-spervised learning on large graphs COLT (pp ) Belkin, M, Niyogi, P, & Sindhwani, V (2004b) Manifold reglarization (Technical Report TR ) University of Chicago Bosqet, O, & Elisseeff, A (2002) Stability and generalization JMLR, 2, Chapelle, O, Vapnik, V, & Weston, J (999) Transdctive Inference for Estiating Vales of Fnctions NIPS 2 (pp ) Cortes, C, & Mohri, M (2007) On Transdctive Regression NIPS 9 (pp ) El-Yaniv, R, & Pechyony, D (2006) Stable transdctive learning COLT (pp 35 49) McDiarid, C (989) On the ethod of bonded differences Srveys in Cobinatorics (pp 48 88) Cabridge University Press, Cabridge Schrans, D, & Sothey, F (2002) Metric-Based Methods for Adaptive Model Selection and Reglarization Machine Learning, 48, 5 84 Vapnik, V N (982) Estiation of dependences based on epirical data Berlin: Springer

9 Stability of Transdctive Regression Vapnik, V N (998) Statistical learning theory New York: Wiley-Interscience W, M, & Schölkopf, B (2007) Transdctive classification via local learning reglarization AISTATS (pp ) Zho, D, Bosqet, O, Lal, T, Weston, J, & Schölkopf, B (2004) Learning with local and global consistency NIPS 6 (pp ) Zh, X, Ghahraani, Z, & Lafferty, J (2003) Seispervised learning sing gassian fields and haronic fnctions ICML (pp 92 99) The right-hand side can be bonded sing Lea 5, leading to a second-degree ineqality in K Upperbonding K with the positive root and sing (6) to bond c(h, x) c(h, x) leads to the bond on β and proves the stateent of the lea A Proof of Theore 4 Proof Let h and h be defined as in Lea 5 By definition, we have h = argin h H F(h, S) and h = arginf(h, S ) (3) h H Ths, F(h, S) F(h + t, S) 0 and F(h, S ) F(h t, S ) 0 For any h, an x, let c(h,, x) denote c(h, x) c(h+, x) and siilarly with c Then, sing these two ineqalities yields C C X X k i c(h, t h, x k ) + C c(h, t, x k ) + C X ec(h, t, x +k )+ X ec(h, t, x +k ) k j + C c(h, t,x +j) + C ec(h, t, x i) + h 2 K h + t 2 K + h 2 K h t 2 K 0 By convexity of c(h, ) in h, it follows that for all k [, + ], c(h, t, x k ) t c(h,, x k ), and c(h, t, x k ) t c(h,, x k ) By Lea 4, siilar ineqalities hold for c These observations lead to: Ct Ct X X k i c(h, δh, x k ) + C t c(h,,x k ) + C t X ec(h, δh, x +k )+ X ec(h,, x +k )+ k j Ct c(h,, x +j) + C t ec(h,, x i) + E 0, (4) with E = h 2 K h + t 2 K + h 2 K h t 2 K It is not hard to see by sing the definition of K in ters of the dot prodct in the reprodcing Hilbert space that E = 2t 2 ( t) Siplifying 4 leads to: 2t 2 ( t) = E Ct [ c(h,, x i )+ c(h,, x +j ) ] C t [ c(h,, x i ) + c(h,, x +j )]

On the Impact of Kernel Approximation on Learning Accuracy

On the Impact of Kernel Approximation on Learning Accuracy On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu