Stability of Transductive Regression Algorithms

Size: px
Start display at page:

Download "Stability of Transductive Regression Algorithms"

Transcription

1 Stability of Transdctive Regression Algoriths Corinna Cortes Google Research, 76 Ninth Avene, New York, NY 00 Mehryar Mohri Corant Institte of Matheatical Sciences and Google Research, 25 Mercer Street, New York, NY 002 Ditry Pechyony Technion - Israel Institte of Technology, Haifa 32000, Israel Ashish Rastogi rastogi@csnyed Corant Institte of Matheatical Sciences, 25 Mercer Street, New York, NY 002 Keywords: transdctive inference, stability, regression, learning theory Abstract This paper ses the notion of algorithic stability to derive novel generalization bonds for several failies of transdctive regression algoriths, both by sing convexity and closed-for soltions Or analysis helps copare the stability of these algoriths It sggests that several existing algoriths ight not be stable bt prescribes a techniqe to ake the stable It also reports the reslts of experients with local transdctive regression deonstrating the benefit of or stability bonds for odel selection, in particlar for deterining the radis of the local neighborhood sed by the algorith Introdction Many learning probles in inforation extraction, coptational biology, natral langage processing and other doains can be forlated as transdctive inference probles (Vapnik, 982) In the transdctive setting, the learning algorith receives both a labeled training set, as in the standard indction setting, and a set of nlabeled test points The objective is to predict the labels of the test points No other test points will ever be considered This setting arises in a variety of applications Often, the points to label are known bt they have not been assigned a label de Appearing in Proceedings of the 25 th International Conference on Machine Learning, Helsinki, Finland, 2008 Copyright 2008 by the athor(s)/owner(s) to the prohibitive cost of labeling This otivates the se of transdctive algoriths which leverage the nlabeled data dring training to iprove learning perforance This paper deals with transdctive regression, which arises in probles sch as predicting the real-valed labels of the nodes of a known graph in coptational biology, or the scores associated with known docents in inforation extraction or search engine tasks Several algoriths have been devised for the specific setting of transdctive regression (Belkin et al, 2004b; Chapelle et al, 999; Schrans & Sothey, 2002; Cortes & Mohri, 2007) Several other algoriths introdced for transdctive classification can be viewed in fact as transdctive regression ones as their objective fnction is based on the sqared loss, eg, (Belkin et al 2004a; 2004b) Cortes and Mohri (2007) also gave explicit VC-diension generalization bonds for transdctive regression that hold for all bonded loss fnctions and coincide with the tight classification bonds of Vapnik (998) when applied to classification This paper presents novel algorith-dependent generalization bonds for transdctive regression Since they are algorith-specific, these bonds can often be tighter than bonds based on general coplexity easres sch as the VC-diension Or analysis is based on the notion of algorithic stability In Sec 2 we give a foral definition of the transdctive regression setting and the notion of stability for transdction Or bonds generalize the stability bonds given by Bosqet and Elisseeff (2002) for the in-

2 Stability of Transdctive Regression dctive setting and extend to regression the stabilitybased transdctive classification bonds of (El-Yaniv & Pechyony, 2006) Standard concentration bonds sch as McDiarid s bond (McDiarid, 989) cannot be readily applied to the transdctive regression setting since the points are not drawn independently bt niforly withot replaceent fro a finite set Instead, a generalization of McDiarid s bond that holds for rando variables sapled withot replaceent is sed, as in (El-Yaniv & Pechyony, 2006) Sec 3 gives a sipler proof of this bond This concentration bond is sed to derive a general transdctive regression stability bond in Sec 32 In Sec 4, we present the stability coefficients for a faily of local transdctive regression algoriths The analysis in this section is based on convexity In Sec 5, we stdy the stability of other transdctive regression algoriths (Belkin et al, 2004a; W & Schölkopf, 2007; Zho et al, 2004; Zh et al, 2003) based on their closed for soltion and propose a odification to the seeingly nstable algorith that akes the stable and garantees a non-trivial generalization bond Finally, Sec 6 shows the reslts of experients with local transdctive regression deonstrating the benefit of or stability bonds for odel selection, in particlar for deterining the radis of the local neighborhood sed by the algorith This provides a partial validation of or bonds and analysis 2 Definitions Let s first describe the transdctive learning setting Asse that a fll saple X of + exaples is given The learning algorith frther receives the labels of a rando sbset S of X of size which serves as a training saple The reaining nlabeled exaples, x +,, x + X, serve as test data We denote by X (S, T) a partitioning of X into the training set S and the test set T The transdctive learning proble consists of predicting accrately the labels y +,, y + of the test exaples, no other test exaples will ever be considered (Vapnik, 998) The specific probles where the labels are real-valed nbers, as in the case stdied in this paper, is that of transdction regression It differs fro the standard (indction) regression since the learning algorith is given the nlabeled test exaples beforehand and can Another natral setting for transdction is one where the training and test saples are both drawn according to the sae distribtion and where the test points, bt not their labels, are ade available to the learning algorith However, as pointed ot by Vapnik (998), any generalization bond in the setting we analyze directly yields a bond for this other setting, essentially by taking the expectation ths exploit this inforation to iprove perforance We denote by c(h, x) the cost of an error of a hypothesis h on a point x labeled with y(x) The cost fnction coonly sed in regression is the sqared loss c(h, x) = (h(x) y(x)) 2 In the reaining of this paper, we will asse a sqared loss bt any of or reslts generalize to other convex cost fnctions The training and test errors of h are respectively R(h) = c(h, x k) and R(h) = c(h, x +k) The generalization bonds we derive are based on the notion of transdctive algorithic stability Definition (Transdction β-stability) Let L be a transdctive learning algorith and let h denote the hypothesis retrned by L for X (S, T) and h the hypothesis retrned for X (S, T ) L is said to be niforly β-stable with respect to the cost fnction c if there exists β 0 sch that for any two partitionings X (S, T) and X (S, T ) that differ in exactly one training (and ths test) point and for all x X, c(h, x) c(h, x) β () 3 Transdction Stability Bonds 3 Concentration Bond for Sapling withot Replaceent Stability-based generalization bonds in the indctive setting are based on McDiarid s ineqality (989) In the transdctive setting, the points are drawn niforly withot replaceent and ths are not independent Therefore, McDiarid s concentration bond cannot be readily sed Instead, a generalization of McDiarid s bond for sapling withot replaceent is needed as in El-Yaniv and Pechyony (2006) We will denote by S a seqence of rando variables S,,S and write S = x as a shorthand for the eqalities S i = x i, i =,, and Pr[x xi, x i ]=Pr[S =x Si =x i, S i =x i ] Theore ((McDiarid, 989), 60) Let S be a seqence of rando variables, each S i taking vales in the set X, and asse that a easrable fnction φ : X R satisfies: i [, ], x i, x i X, E S h φ S i, S i = x i i E S h φ S i, S i = x ii ci Then, ǫ > 0, «2ǫ 2 Pr [ φ E [φ] ǫ] 2exp P i= c2 i The following is a concentration bond for sapling withot replaceent needed to analyze the generalization of transdctive algoriths

3 Stability of Transdctive Regression Theore 2 Let x be a seqence of rando variables, sapled fro an nderlying set X of + eleents withot replaceent, and let that φ : X R be a syetric fnction sch that for all i [, ] and for all x,, x X and x,,x X, φ(x,, x ) φ(x,, x i, x i, x,, x ) c Then, ǫ > 0, Pr ˆ φ E [φ] ǫ 2exp 2ǫ 2 where α(, ) = ) = x + /2 /(2ax{,}) α(, )c 2 Proof For a fixed i [, ], let g(s i ) = i E S ˆφ S, S i =x i i ES ˆφ S, S i =x i Then, g(x i φ(x i, x i,x )Pr[x xi, x i ] x φ(x i, x i,x )Pr[x x i, x i ] For nifor sapling withot replaceent, the probability ters can be written as: Prˆx x i, x i = Q! k=i = + k (+ i)! Ths, g(x i! ) = [P (+ i)! x φ(x i, x i,x ) P x φ(x i, x i,x )] To copte the expression between brackets, we divide the set of pertations {x } into two sets, those that contain x i and those that do not If a pertation x contains x i we can write it as x k x ix k+, where k is sch that x k = x i We then atch it p with the pertation x i x k x k+ fro the set {x i x } These two pertations contain exactly the sae eleents, and since the fnction φ is syetric in its argents, the difference in the vale of the fnction on the pertations is zero In the other case, if a pertation x does not contain the eleent x i, then we siply atch it p with the sae pertation in {x } The atching pertations appearing in the sation are then x i x and x i x which clearly only differ with respect to x i The difference in the vale of the fnction φ in this case can be bonded by c The nber of sch pertations is ( i)! ( ) + () = (+ i )! ( )! x φ(x i (+ i )! ( )! i, which leads to the following pper bond:, x i,x ) x φ(x i, x i,x ) c, which iplies that g(x i )! (+ i )! ( )! c + i (+ i)! c Then, cobining Theore with the identity i= (+ i) 2 yields that Pr [ ] ( φ E [φ] ǫ 2 exp + /2 /(2) + /2 /2, 2ǫ 2 α (,)c ), 2 where α (, ) = The fnction φ is syetric in and in the sense that selecting one of the sets niqely deterines the other set The stateent of the theore then follows fro a siilar «, bond with α (, ) = the tighter of the two 32 Transdctive Stability Bond + /2 /(2), taking To obtain a general transdctive regression stability bond, we apply the concentration bond of Theore 2 to the rando variable φ(s) = R(h) R(h) To do so, we need to bond E S [φ(s)], where S is a rando sbset of X of size, and φ(s) φ(s ) where S and S are saples differing by exactly one point Lea Let H be a bonded hypothesis set ( x X, h(x) y(x) B) and L a β-stable algorith retrning the hypotheses h and h for two training sets S and S of size each, respectively, differing in exactly one point Then, φ(s) φ(s ) 2β + B 2 ( + )/() (2) Proof By definition, S and S differ exactly in one point Let x i S, x +j S be the points in which the two sets differ The lea follows fro the observation that for each one of the coon labeled points in S and S, and for each one of the coon test points in T and T (recall T = X \ S, T = X \ S ), the difference in cost is bonded by β, while for x i and x +j, the difference in cost is bonded by B 2 Then, it follows that φ(s) φ(s ) ( )β + ( )β + B2 + B2 2β + ( B2 + ) Lea 2 Let h be the hypothesis retrned by a β- stable algorith L Then, E S [φ(s)] β Proof By definition of φ(s), its expectation is P ES [c(h, x +k)] P ES [c(h, x k)] Since E S [c(h, x +j )] is the sae for all j [, ], and E S [c(h, x i )] the sae for all i [, ], for any i and j, E S [φ(s)] = E S [c(h, x +j )] E S [c(h, x i )] = E S [c(h, x i )] E S [c(h, x i )] Ths, E S [φ(s)] = E S,S X [c(h, x i ) c(h, x i )] β Theore 3 Let H be a bonded hypothesis set ( x X, h(x) y(x) B) and L a β-stable algorith Let h be the hypothesis retrned by L when trained on X (S, T) Then, for any δ > 0, with prob at least δ, ( ) R(h) R(h)+β+ 2β + B2 ( + ) α(, )ln δ 2 Proof The reslt follows directly fro Theore 2 and Leas and 2 This is a general bond that applies to any transdctive algorith To apply it, the stability coefficient β,

4 Stability of Transdctive Regression which depends on and, needs to be deterined In the sbseqent sections, we derive bonds on β for a nber of transdctive regression algoriths (Cortes & Mohri, 2007; Belkin et al, 2004a; W & Schölkopf, 2007; Zho et al, 2004; Zh et al, 2003) 4 Stability of Local Transdctive Regression Algoriths This section describes and analyzes a general faily of local transdctive regression algoriths (LTR) generalizing the algorith of Cortes and Mohri (2007) LTR algoriths can be viewed as a generalization of the so-called kernel reglarization-based learning algoriths to the transdctive setting The objective fnction that they iniize is of the for: F(h, S) = h 2 K + C X c(h, x k )+ C X ec(h, x +k ), (3) where K is the nor in the reprodcing kernel Hilbert space (RKHS) with associated kernel K, C 0 and C 0 are trade-off paraeters, and c(h, x) = (h(x) ỹ(x)) 2 is the error of the hypothesis h on the nlabeled point x with respect to a psedo-target ỹ Psedo-targets are obtained fro neighborhood labels y(x) by a local weighted average Neighborhoods can be defined as a ball of radis r arond each point in the featre space We will denote by β loc the scorestability coefficient of the local algorith sed, that is the axial aont by which the two hypotheses differ on an given point, when trained on saples disagreeing on one point This notion is stronger than that of cost-based stability In this section, we se the bonded-labels assption, that is x S, y(x) M We also asse that for any x X, K(x, x) κ 2 We will se the following bond based on the reprodcing property and the Cachy-Schwarz ineqality valid for any hypothesis h H : x X, h(x) = h,k(x, ) h K p K(x, x) κ h K (4) Lea 3 Let h be the hypothesis iniizing (3) Asse that for any x X, K(x, x) κ 2 Then, for any x X, h(x) κm C + C Proof The proof is a straightforward adaptation of the techniqe of (Bosqet & Elisseeff, 2002) to LTR algoriths By Eqn 4, h(x) κ h K Let 0 R + be the hypothesis assigning label zero to all exaples By definition of h, F(h, S) F(0, S) (C + C )M 2 Using h K F(h, S) yields the stateent Since h(x) κm C + C, this iediately gives s a bond on h(x) y(x) M( + κ C + C ) Ths, we are in a position to apply Theore 3 with B = AM, A = + κ C + C We now derive a bond on the stability coefficient β To do so, the key property we will se is the convexity of h c(h, x) Note, however, that in the case of c, the psedo-targets ay depend on the training set S This dependency atters when we wish to apply convexity with two hypotheses h and h obtained by training on different saples S and S For convenience, for any two sch fixed hypotheses h and h, we extend the definition of c as follows For all t [0, ], c(th+( t)h, x) = ( (th+( t)h )(x) (tỹ+( t)ỹ ) ) 2 This allows s to se the sae convexity property for c as for c for any two fixed hypotheses h and h, as verified by the following lea, and does not affect the proofs otherwise Lea 4 Let h be a hypothesis obtained by training on S and h by training on S Then, for all t [0, ], t c(h, x) + ( t) c(h, x) c(th + ( t)h, x) (5) Proof Let ỹ = ỹ(x) be the psedo-target vale at x when the training set is S and ỹ = ỹ (x) when the training set is S For all t [0, ], tc(h, x) + ( t)c(h, x) c(th + ( t)h, x) = t(h(x) ey) 2 + ( t)(h (x) ey ) 2 ˆt(h(x) ey) + ( t)(h (x) ey ) 2 The stateent of the lea follows directly by the convexity of x x 2 over real nbers Let h be a hypothesis obtained by training on S and h by training on S Let = h h Then, for all x X, c(h, x) c(h, x) = (x)((h(x) y(x)) + (h (x) y(x))) 2M( + κ C + C ) (x) As in 4, for all x X, (x) κ K, ths for all x X, c(h, x) c(h, x) 2M( + κ C + C )κ K (6) Lea 5 Asse that for all x X, y(x) M Let S and S be two saples differing by exactly one point Let h be the hypothesis retrned by the algorith iniizing the objective fnction F(h, S), h be the hypothesis obtained by iniization of F(h, S ) and let ỹ and ỹ be the corresponding psedo-targets Then, C [c(h, x i ) c(h, x i )] / C [ c(h, x i ) c(h, x i )] / 2AM (κ K (C/ + C /) + β loc C /)

5 Stability of Transdctive Regression where = h h and A = + κ C + C Proof Let c(h i, ỹ i ) = c(h, x i ) and c(h i, ỹ i ) = c(h, x i ) By Lea 3 and the bonded-labels assption, c(h i, ỹ i ) c(h i, ỹ i ) = c(h i, ỹ i) c(h i, ỹ i ) + c(h i, ỹ i ) c(h i, ỹ i ) (ỹ i ỹ i)(ỹ i + ỹ i 2h i ) + (h i h i)(h i + h i 2ỹ i ) By the score-stability of local estiates, ỹ (x i ) ỹ(x i ) β loc Ths, c(h i, ỹ i) c(h i, ỹ i ) 2AM(β loc + κ K ) (7) Using 6 leads after siplification to the stateent of the lea The proof of the following theore is based on Lea 4 and Lea 5 and is given in the appendix Theore 4 Asse that for all x X, y(x) M and there exists κ sch that x X, K(x, x) κ 2 Frther, asse that the local estiator has nifor stability coefficient β loc Let A = +κ C + C Then, LTR is niforly β-stable with [ ( ) 2 ] C β 2(AM) 2 κ 2 + C C + + C + 2C β loc AMκ 2 Or experients with LTR will deonstrate the benefit of this bond for odel selection (Sec 6) 5 Stability Based on Closed-For Soltions 5 Unconstrained Reglarization Algoriths In this section, we consider a faily of transdctive regression algoriths that can be forlated as the following optiization proble: in h h T Qh + (h y) T C(h y) (8) Q R (+) (+) is a syetric reglarization atrix, C R (+) (+) is a syetric atrix of epirical weights (in practice it is often a diagonal atrix), y R (+) are the target vales of the labeled points together with the psedo-target vales of the nlabeled points (in soe forlations, the psedo-target vale is 0), and h R (+) is a coln vector whose ith row is the predicted target vale for the x i The closed-for soltion of (8) is given by h = (C Q + I) y (9) The forlation (8) is qite general and incldes as special cases the algoriths of (Belkin et al, 2004a; W & Schölkopf, 2007; Zho et al, 2004; Zh et al, 2003) We present a general fraework for bonding the stability coefficient of these algoriths and then exaine the stability coefficient of each of these algoriths in trn For a syetric atrix A R n n we will denote by λ M (A) its largest eigenvale and λ (A) its sallest Then, for any v R n, λ (A) v 2 Av 2 λ M (A) v 2 We will also se in the proof of the following proposition the fact that for syetric atrices A,B R n n, λ M (AB) λ M (A)λ M (B) Proposition Let h and h solve (8), nder test and training sets that differ exactly in one point and let C,C,y,y be the analogos epirical weight and the target vale atrices Then, h h 2 y y 2 λ (Q) λ + + λm(q) C C 2 y ( )( ) 2 λ(q) M(C) λ M(C ) + λ(q) λ + M(C) Proof Let = h h and y = y y Let C = (C Q + I) and C = (C Q + I) By definition, = C y C y = C y + (C C )y = C y + (C [ (C C )Q ]C )y Ths, 2 y 2 λ (C) +λ M(Q) C C 2 y 2 λ (C )λ (C) (0) Frtherore, λ (C) λ(q) λ M(C) + Plgging this bond back into Eqn 0 yields: 2 y 2 λ (Q) λ + + λm(q) C C 2 y ( )( ) 2 λ(q) M(C) λ M(C ) + λ(q) λ + M(C) Since h h is bonded by h h 2, the proposition provides a bond on the score-stability of h for the transdctive regression algoriths of Zho et al (2004); W and Schölkopf (2007); Zh et al (2003) For each of these algoriths, the psedo-targets sed are zero If we ake the bonded labels assption ( x X, y(x) M, for soe M >0), it is not difficlt to show that y y 2 2M and y 2 M We now exaine each algorith in trn Consistency ethod (CM) In the CM algorith (Zho et al, 2004), the atrix Q is a noralized Laplacian of a weight atrix W R (+) (+) that captres affinity between pairs of points in the

6 Stability of Transdctive Regression fll saple X Ths, Q = I D /2 WD /2, where D R (+) (+) is a diagonal atrix, with [D] i,i = j [W] i,j Note that λ (Q) = 0 Frtherore, atrices C and C are identical in CM, both diagonal atrices with (i, i)th entry eqal to a positive constant µ > 0 Ths C = C and sing Prop, we obtain the following bond on the score-stability of the CM algorith: β CM 2M Local learning reglarization (LL Reg) In the LL Reg algorith (W & Schölkopf, 2007), the reglarization atrix Q is (I A) T (I A), where I R (+) (+) is an identity atrix and A R (+) (+) is a non-negative weight atrix that captres the local siilarity between all pairs of points in X A is noralized, ie each of its rows s to Let C l, C > 0 be two positive constants The atrix C is a diagonal atrix with [C] i,i = C l if x i S and C otherwise Let C ax = ax{c l, C } and C in = in{c l, C } Ths, C C 2 = ( 2 C in C ax ) By the Perron-Frobenis theore, its eigenvales lie in the interval (, ] and λ M (A) Ths, λ (Q) 0 and λ M (Q) 4 and we have the following bond on the scorestability of the LL Reg algorith: β LL Reg 2M + 4 ( ) M C in C ax 2M + 4 M C in Gassian Mean Fields algorith GMF (Zh et al, 2003) is very siilar to the LL Reg, and adits exactly the sae stability coefficient Ths, the stability coefficients of the algoriths of CM,LL Reg, and GMF can be large Withot additional constraints on the atrix Q, these algoriths do not see to be stable enogh for the generalization bond of Theore 3 to converge A particlar exaple of constraint is the condition + i= h(x i) = 0 sed by Belkin et al s algorith (2004a) In the next section, we give a generalization bond for this algorith and then describe a general ethod for aking the algoriths jst exained stable 52 Stability of Constrained Reglarization Algoriths This sbsection analyzes constrained reglarization algoriths sch as the Laplacian-based graph reglarization algorith of Belkin et al (2004a) Given a weighted graph G = (X, E) in which edge weights represent the extent of siilarity between vertices, the task consists of predicting the vertex labels The hypothesis h retrned by the algorith is soltion of the following optiization proble: in h H ht Lh + C X (h(x i) y i) 2 i= + X sbject to: h(x i) = 0, i= () where L R (+) (+) is a soothness atrix, eg, the graph Laplacian, {y i i [, ]} are the target vales of the labeled nodes The hypothesis set H in this case can be thoght of as a hyperplane in R + that is orthogonal to the vector R + Maintaining the notation sed in (Belkin et al, 2004a), we let P H denote the operator corresponding to the orthogonal projection on H For a saple S drawn withot replaceent fro X, define I S R (+) (+) to be the diagonal atrix with [I S ] i,i = if x i S and 0 otherwise Siilarly, let y S R (+) be the coln vector with [y S ] i, = y i if x i S and 0 otherwise The closed-for soltion on a training saple S is given by (Belkin et al, 2004a): h S = P H C L + IS ys (2) Theore 5 Asse that the vertex labels of the graph G = (X, E) and the hypothesis h obtained by optiizing Eqn are both bonded ( x, h(x) M and y(x) M for soe M > 0) Let A = + κ C Then, for any δ > 0, with probability at least δ, «s R(h) R(h) b + β + 2β + (AM)2 ( + ) α(, ) ln δ, 2 with α(, ) = + /2 /(2 ax{,}) and β (4 2M 2 )/(λ 2 /C ) + (4 2M 2 )/(λ 2 /C ) 2, λ 2 is the second sallest eigenvale of the Laplacian Proof The proof is siilar to that of (Belkin et al, 2004a) bt ses or general transdctive regression bond instead The generalization bond we jst presented differs in several respects fro that of Belkin et al (2004a) Or bond explicitly depends on both and while theirs shows only a dependency on Also, or bond does not depend on the nber of ties a point is sapled in the training set (paraeter t), thanks to or analysis based on sapling withot replaceent Contrasting the stability coefficient of Belkin s algorith with the stability coefficient of LTR (Theore 4), we note that it does not depend on C and β loc This is becase nlabeled points do not enter the objective fnction, and ths C = 0 and ỹ(x) = 0 for all

7 x X However, the stability does depend on the second sallest eigenvale λ 2 and the bond diverges as λ 2 approaches C In all or regression experients, we observed that this algorith does not perfor as well in coparison with LTR 53 Making Seeingly Unstable Algoriths Stable In Sec 52, we saw that iposing additional constraints on the hypothesis, eg, h = 0, allowed one to derive non-trivial stability bonds This idea can be generalized and siilar non-trivial stability bonds can be derived for stable versions of the algoriths presented in Sec 5 CM,LL Reg, and GMF Recall that the stability bond in Prop is inversely proportional to the sallest eigenvale λ (Q) The ain difficlty with sing the proposition for these algoriths is that λ (Q) = 0 in each case Let v denote the eigenvector corresponding to λ (Q) and let λ 2 be the second sallest eigenvale of Q One can odify (8) and constrain the soltion to be orthogonal to v by iposing h v = 0 In the case of (Belkin et al, 2004a), v = This odification, otivated by the algorith of (Belkin et al, 2004a), is eqivalent to increasing the sallest eigenvale to be λ 2 As an exaple, by iposing the additional constraint, we can show that the stability coefficient ofcm becoes bonded by O(C/λ 2 ), instead of Θ() Ths, if C = O(/) and λ 2 = Ω(), it is bonded by O(/) and the generalization bond converges as O(/) 6 Experients 6 Model Selection Based on Bond This section reports the reslts of experients sing or stability-based generalization bond for odel selection for the LTR algorith A crcial paraeter of this algorith is the stability coefficient β loc (r) of the local algorith, which coptes psedo-targets ỹ x based on a ball of radis r arond each point We derive an expression for β loc (r) and show, sing extensive experients with ltiple data sets, that the vale r iniizing the bond is a rearkably good estiate of the best r for the test error This deonstrates the benefit of or generalization bond for odel selection, avoiding the need for a held-ot validation set The experients were carried ot on several pblicly available regression data sets: Boston Hosing, Elevators and Ailerons 2 For each of these data sets, we sed =, inspired by the observation that, all other Stability of Transdctive Regression 2 wwwliaadppt/~ltorgo/regression/datasetshtl paraeters being fixed, the bond of Theore 3 is tightest when = The vale of the inpt variables were noralized to have ean zero and variance one For the Boston Hosing data set, the total nber of exaples was 506 For the Elevators and the Ailerons data set, a rando sbset of 2000 exaples was sed For both of these data sets, other rando sbsets of 2000 saples led to siilar reslts The Boston Hosing experients were repeated for 50 rando partitions, while for the Elevators and the Ailerons data set, the experients were repeated for 20 rando partitions each Since the target vales for the Elevators and the Ailerons data set were extreely sall, they were scaled by a factor 000 and 00 respectively in a pre-processing step In or experients, we estiated the psedo-target of a point x T as a weighted average of the labeled points x N(x ) in a neighborhood of x Ths, ỹ x = x N(x ) α xy x / x N(x ) α x Weights are defined in ters of a siilarity easre K(x, x ) captred by a kernel K: α x = K(x, x ) Let (r) be the nber of labeled points in N(x ) Then, it is easy to show that β loc 4α ax M/(α in (r)), where α ax = ax x N(x ) α x and α in = in x N(x ) α x Ths, for a Gassian kernel with paraeter σ, β loc 4M/((r)e 2r2 /σ 2 ) To estiate β loc, one needs an estiate of (r), the nber of saples in a ball of radis r fro an nlabeled point x In or experients, we estiated (r) as the nber of saples in a ball of radis r fro the origin Since all featres are noralized to ean zero and variance one, the origin is also the centroid of the set X We ipleented a dal soltion ofltr and sed Gassian kernels, for which, the paraeter σ was selected sing cross-validation on the training set Experients were repeated across 36 different pairs of vales of (C, C ) For each pair, we varied the radis r of the neighborhood sed to deterine estiates fro zero to the radis of the ball containing all points Figre (a) shows the ean vales of the test MSE of or experients on the Boston Hosing data set for typical vales of C and C Figres (b)-(c) show siilar reslts for the Ailerons and Elevators data sets For the sake of coparison, we also report reslts for indction The relative standard deviations on the MSE are not indicated, bt were typically of the order of 0% LTR generally achieves a significant iproveent over indction The generalization bond we derived in Eqn 3 consists of the training error and a coplexity ter that depends on the paraeters of the LTR algorith (C, C, M,,, κ, β loc, δ) Only two ters depend

8 Stability of Transdctive Regression Transdction Indction Training Error + Slack Ter Transdction Indction Training Error + Slack Ter Transdction Indction Training Error + Slack Ter (a) (b) (c) Figre MSE against the radis r of LTR for three data sets: (a) Boston Hosing (b) Ailerons (c) Elevators The sall horizontal bar indicates the location (ean ± one standard deviation) of the ini of the epirically deterined r pon the choice of the radis r: R(h) and βloc Ths, keeping all other paraeters fixed, the theoretically optial radis r is the one that iniizes the training error pls the slack ter The figres also inclde plots of the training error cobined with the coplexity ter, appropriately scaled The epirical iniization of the radis r coincides with or is close to r The optial r based on test MSE is indicated with error bars 62 Stable Versions of Unstable Algoriths We refer to the stable version of the CM algorith presented in Sec 5 as CM STABLE We copared CM and CM STABLE epirically on the sae datasets, again sing = For the noralized Laplacian we sed k-nearest neighbors graphs based on Eclidean distance The paraeters k and C were chosen by five-fold cross-validation over the training set The experient was repeated 20 ties with rando partitions The averaged ean-sqared errors with standard deviations, are reported in Table 62 Dataset CM CM STABLE Elevators ± ± Ailerons 049 ± ± Hosing 5793 ± ± 65 We conclde fro this experient that CM and CM STABLE have the sae perforance However, as we showed previosly, CM STABLE has a non-trivial risk bond and ths coes with soe garantee 7 Conclsion We presented a coprehensive analysis of the stability of transdctive regression algoriths with novel generalization bonds for a nber of algoriths Since they are algorith-dependent, or bonds are often tighter than those based on coplexity easres sch as the VC-diension Or experients also show the effectiveness of or bonds for odel selection and the good perforance of LTR algoriths Acknowledgents The research of Mehryar Mohri and Ashish Rastogi was partially spported by the New York State Office of Science Technology and Acadeic Research (NYS- TAR) This project was also sponsored in part by the Departent of the Ary Award Nber W8XWH The US Ary Medical Research Acqisition Activity, 820 Chandler Street, Fort Detrick MD is the awarding and adinistering acqisition office The content of this aterial does not necessarily reflect the position or the policy of the Governent and no official endorseent shold be inferred References Belkin, M, Matveeva, I, & Niyogi, P (2004a) Reglarization and sei-spervised learning on large graphs COLT (pp ) Belkin, M, Niyogi, P, & Sindhwani, V (2004b) Manifold reglarization (Technical Report TR ) University of Chicago Bosqet, O, & Elisseeff, A (2002) Stability and generalization JMLR, 2, Chapelle, O, Vapnik, V, & Weston, J (999) Transdctive Inference for Estiating Vales of Fnctions NIPS 2 (pp ) Cortes, C, & Mohri, M (2007) On Transdctive Regression NIPS 9 (pp ) El-Yaniv, R, & Pechyony, D (2006) Stable transdctive learning COLT (pp 35 49) McDiarid, C (989) On the ethod of bonded differences Srveys in Cobinatorics (pp 48 88) Cabridge University Press, Cabridge Schrans, D, & Sothey, F (2002) Metric-Based Methods for Adaptive Model Selection and Reglarization Machine Learning, 48, 5 84 Vapnik, V N (982) Estiation of dependences based on epirical data Berlin: Springer

9 Stability of Transdctive Regression Vapnik, V N (998) Statistical learning theory New York: Wiley-Interscience W, M, & Schölkopf, B (2007) Transdctive classification via local learning reglarization AISTATS (pp ) Zho, D, Bosqet, O, Lal, T, Weston, J, & Schölkopf, B (2004) Learning with local and global consistency NIPS 6 (pp ) Zh, X, Ghahraani, Z, & Lafferty, J (2003) Seispervised learning sing gassian fields and haronic fnctions ICML (pp 92 99) The right-hand side can be bonded sing Lea 5, leading to a second-degree ineqality in K Upperbonding K with the positive root and sing (6) to bond c(h, x) c(h, x) leads to the bond on β and proves the stateent of the lea A Proof of Theore 4 Proof Let h and h be defined as in Lea 5 By definition, we have h = argin h H F(h, S) and h = arginf(h, S ) (3) h H Ths, F(h, S) F(h + t, S) 0 and F(h, S ) F(h t, S ) 0 For any h, an x, let c(h,, x) denote c(h, x) c(h+, x) and siilarly with c Then, sing these two ineqalities yields C C X X k i c(h, t h, x k ) + C c(h, t, x k ) + C X ec(h, t, x +k )+ X ec(h, t, x +k ) k j + C c(h, t,x +j) + C ec(h, t, x i) + h 2 K h + t 2 K + h 2 K h t 2 K 0 By convexity of c(h, ) in h, it follows that for all k [, + ], c(h, t, x k ) t c(h,, x k ), and c(h, t, x k ) t c(h,, x k ) By Lea 4, siilar ineqalities hold for c These observations lead to: Ct Ct X X k i c(h, δh, x k ) + C t c(h,,x k ) + C t X ec(h, δh, x +k )+ X ec(h,, x +k )+ k j Ct c(h,, x +j) + C t ec(h,, x i) + E 0, (4) with E = h 2 K h + t 2 K + h 2 K h t 2 K It is not hard to see by sing the definition of K in ters of the dot prodct in the reprodcing Hilbert space that E = 2t 2 ( t) Siplifying 4 leads to: 2t 2 ( t) = E Ct [ c(h,, x i )+ c(h,, x +j ) ] C t [ c(h,, x i ) + c(h,, x +j )]

On the Impact of Kernel Approximation on Learning Accuracy

On the Impact of Kernel Approximation on Learning Accuracy On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Kernel based collaborative filtering for very large scale top-n item recommendation

Kernel based collaborative filtering for very large scale top-n item recommendation ESANN 06 proceedings, Eropean Syposi on Artificial Neral Networks, Coptational Intelligence and Machine Learning. Brges (Belgi), 7-9 April 06, i6doc.co pbl., ISBN 978-8758707-8. Available fro http:www.i6doc.coen.

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

New MINLP Formulations for Flexibility Analysis for Measured and Unmeasured Uncertain Parameters

New MINLP Formulations for Flexibility Analysis for Measured and Unmeasured Uncertain Parameters Anton A. Kiss, Edwin Zondervan, Richard Lakerveld, Leyla Özkan (Eds.) Proceedings of the 29 th Eropean Syposi on Copter Aided Process Engineering Jne 16 th to 19 th, 219, Eindhoven, The Netherlands. 219

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Models to Estimate the Unicast and Multicast Resource Demand for a Bouquet of IP-Transported TV Channels

Models to Estimate the Unicast and Multicast Resource Demand for a Bouquet of IP-Transported TV Channels Models to stiate the Unicast and Mlticast Resorce Deand for a Boqet of IP-Transported TV Channels Z. Avraova, D. De Vleeschawer,, S. Wittevrongel, H. Brneel SMACS Research Grop, Departent of Teleconications

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Quadratic forms and a some matrix computations

Quadratic forms and a some matrix computations Linear Algebra or Wireless Conications Lectre: 8 Qadratic ors and a soe atri coptations Ove Edors Departent o Electrical and Inoration echnology Lnd University it Stationary points One diension ( d d =

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

SOME EFFECTIVE ESTIMATION PROCEDURES UNDER NON-RESPONSE IN TWO-PHASE SUCCESSIVE SAMPLING

SOME EFFECTIVE ESTIMATION PROCEDURES UNDER NON-RESPONSE IN TWO-PHASE SUCCESSIVE SAMPLING STATISTICS IN TRANSITION new series, Jne 06 63 STATISTICS IN TRANSITION new series, Jne 06 Vol. 7, No., pp. 63 8 SOME EFFECTIVE ESTIMATION PROCEDURES UNDER NONRESPONSE IN TWOPHASE SUCCESSIVE SAMPING G.

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters journal of ultivariate analysis 58, 96106 (1996) article no. 0041 The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Paraeters H. S. Steyn

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200) 789-84 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13 CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture

More information

CS 331: Artificial Intelligence Naïve Bayes. Naïve Bayes

CS 331: Artificial Intelligence Naïve Bayes. Naïve Bayes CS 33: Artificial Intelligence Naïe Bayes Thanks to Andrew Moore for soe corse aterial Naïe Bayes A special type of Bayesian network Makes a conditional independence assption Typically sed for classification

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Kernel Methods Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Motivation Efficient coputation of inner products in high diension. Non-linear decision

More information

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs International Cobinatorics Volue 2011, Article ID 872703, 9 pages doi:10.1155/2011/872703 Research Article On the Isolated Vertices and Connectivity in Rando Intersection Graphs Yilun Shang Institute for

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 5 Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient coputation of inner products

More information

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University

More information

Physics 215 Winter The Density Matrix

Physics 215 Winter The Density Matrix Physics 215 Winter 2018 The Density Matrix The quantu space of states is a Hilbert space H. Any state vector ψ H is a pure state. Since any linear cobination of eleents of H are also an eleent of H, it

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Least Squares Fitting of Data

Least Squares Fitting of Data Least Squares Fitting of Data David Eberly, Geoetric Tools, Redond WA 98052 https://www.geoetrictools.co/ This work is licensed under the Creative Coons Attribution 4.0 International License. To view a

More information

Introduction to Kernel methods

Introduction to Kernel methods Introduction to Kernel ethods ML Workshop, ISI Kolkata Chiranjib Bhattacharyya Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru 19th Oct, 2012 Introduction

More information

LOSSY JPEG compression [1] achieves a good compression

LOSSY JPEG compression [1] achieves a good compression 1 JPEG Noises beyond the First Copression Cycle Bin Li, Tian-Tsong Ng, Xiaolong Li, Shnqan Tan, and Jiw Hang arxiv:1405.7571v1 [cs.mm] 29 May 2014 Abstract This paper focses on the JPEG noises, which inclde

More information

INNER CONSTRAINTS FOR A 3-D SURVEY NETWORK

INNER CONSTRAINTS FOR A 3-D SURVEY NETWORK eospatial Science INNER CONSRAINS FOR A 3-D SURVEY NEWORK hese notes follow closely the developent of inner constraint equations by Dr Willie an, Departent of Building, School of Design and Environent,

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

Supplement to: Subsampling Methods for Persistent Homology

Supplement to: Subsampling Methods for Persistent Homology Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation

More information

Multi-Dimensional Hegselmann-Krause Dynamics

Multi-Dimensional Hegselmann-Krause Dynamics Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Accuracy at the Top. Abstract

Accuracy at the Top. Abstract Accuracy at the Top Stephen Boyd Stanford University Packard 264 Stanford, CA 94305 boyd@stanford.edu Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu Corinna

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

A new type of lower bound for the largest eigenvalue of a symmetric matrix

A new type of lower bound for the largest eigenvalue of a symmetric matrix Linear Algebra and its Applications 47 7 9 9 www.elsevier.co/locate/laa A new type of lower bound for the largest eigenvalue of a syetric atrix Piet Van Mieghe Delft University of Technology, P.O. Box

More information

Journal of Mathematical Analysis and Applications

Journal of Mathematical Analysis and Applications J Math Anal Appl 386 202 205 22 Contents lists available at ScienceDirect Journal of Matheatical Analysis and Applications wwwelsevierco/locate/jaa Sei-Supervised Learning with the help of Parzen Windows

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

An Analysis of W-fibers and W-type Fiber Polarizers

An Analysis of W-fibers and W-type Fiber Polarizers An Analysis of W-fibers and W-type Fiber Polarizers Corey M. Paye Thesis sbitted to the Faclty of the Virginia Polytechnic Institte and State University in partial flfillent of the reqireents for the degree

More information

The Hilbert Schmidt version of the commutator theorem for zero trace matrices

The Hilbert Schmidt version of the commutator theorem for zero trace matrices The Hilbert Schidt version of the coutator theore for zero trace atrices Oer Angel Gideon Schechtan March 205 Abstract Let A be a coplex atrix with zero trace. Then there are atrices B and C such that

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

arxiv: v4 [cs.lg] 4 Apr 2016

arxiv: v4 [cs.lg] 4 Apr 2016 e-publication 3 3-5 Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions arxiv:35796v4 cslg 4 Apr 6 Corinna Cortes Google Research, 76 Ninth Avenue, New York, NY Spencer

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

lecture 36: Linear Multistep Mehods: Zero Stability

lecture 36: Linear Multistep Mehods: Zero Stability 95 lecture 36: Linear Multistep Mehods: Zero Stability 5.6 Linear ultistep ethods: zero stability Does consistency iply convergence for linear ultistep ethods? This is always the case for one-step ethods,

More information

Neural Network Learning as an Inverse Problem

Neural Network Learning as an Inverse Problem Neural Network Learning as an Inverse Proble VĚRA KU RKOVÁ, Institute of Coputer Science, Acadey of Sciences of the Czech Republic, Pod Vodárenskou věží 2, 182 07 Prague 8, Czech Republic. Eail: vera@cs.cas.cz

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish

More information

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY Physical and Matheatical Sciences 04,, p. 7 5 ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD M a t h e a t i c s Yu. A. HAKOPIAN, R. Z. HOVHANNISYAN

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Multi-view Discriminative Manifold Embedding for Pattern Classification

Multi-view Discriminative Manifold Embedding for Pattern Classification Multi-view Discriinative Manifold Ebedding for Pattern Classification X. Wang Departen of Inforation Zhenghzou 450053, China Y. Guo Departent of Digestive Zhengzhou 450053, China Z. Wang Henan University

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

4.2 First-Order Logic

4.2 First-Order Logic 64 First-Order Logic and Type Theory The problem can be seen in the two qestionable rles In the existential introdction, the term a has not yet been introdced into the derivation and its se can therefore

More information

Structured Prediction Theory Based on Factor Graph Complexity

Structured Prediction Theory Based on Factor Graph Complexity Structured Prediction Theory Based on Factor Graph Coplexity Corinna Cortes Google Research New York, NY 00 corinna@googleco Mehryar Mohri Courant Institute and Google New York, NY 00 ohri@cisnyuedu Vitaly

More information

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS. Introduction When it coes to applying econoetric odels to analyze georeferenced data, researchers are well

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Matheatical Sciences, 251 Mercer Street, New York,

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

Research Article Robust ε-support Vector Regression

Research Article Robust ε-support Vector Regression Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

arxiv: v3 [stat.ml] 9 Aug 2016

arxiv: v3 [stat.ml] 9 Aug 2016 with Specialization to Linear Classifiers Pascal Gerain Aaury Habrard François Laviolette 3 ilie Morvant INRIA, SIRRA Project-Tea, 75589 Paris, France, et DI, École Norale Supérieure, 7530 Paris, France

More information