SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming

Size: px
Start display at page:

Download "SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming"

Transcription

1 SVM Soft Margin Classifiers: Linear Prograing versus Quadratic Prograing Qiang Wu Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China Support vector achine soft argin classifiers are iportant learning algoriths for classification probles. They can be stated as convex optiization probles and are suitable for a large data setting. Linear prograing SVM classifier is specially efficient for very large size saples. But little is known about its convergence, copared with the well understood quadratic prograing SVM classifier. In this paper, we point out the difficulty and provide an error analysis. Our analysis shows that the convergence behavior of the linear prograing SVM is alost the sae as that of the quadratic prograing SVM. This is ipleented by setting a stepping stone between the linear prograing SVM and the classical 1 nor soft argin classifier. An upper bound for the isclassification error is presented for general probability distributions. Explicit learning rates are derived for deterinistic and weakly separable distributions, and for distributions satisfying soe Tsybakov noise condition. 1

2 1 Introduction Support vector achines SVM s) for an iportant subject in learning theory. They are very efficient for any applications, especially for classification probles. The classical SVM odel, the so-called 1 nor soft argin SVM, was introduced with polynoial kernels by Boser et al. 1992) and with general kernels by Cortes and Vapnik 1995). Since then any different fors of SVM algoriths were introduced for different purposes e.g. Niyogi and Girosi 1996; Vapnik 1998). Aong the the linear prograing LP) SVM Bradley and Mangasarian 2000; Kecan and Hadzic 2000; Niyogi and Girosi 1996; Pedroso and N. Murata 2001; Vapnik 1998) is an iportant one because of its linearity and flexibility for large data setting. The ter linear prograing eans the algorith is based on linear prograing optiization. Correspondingly, the 1 nor soft argin SVM is also called quadratic prograing QP) SVM since it is based on quadratic prograing optiization Vapnik 1998). Many experients deonstrate that LP-SVM is efficient and perfors even better than QP-SVM for soe purposes: capable of solving huge saple size probles Bradley and Mangasarian 2000), iproving the coputational speed Pedroso and N. Murata 2001), and reducing the nuber of support vectors Kecan and Hadzic 2000). While the convergence of QP-SVM has becoe pretty well understood because of recent works Steinwart 2002; Zhang 2004; Wu and Zhou 2003; Scovel and Steinwart 2003; Wu et al. 2004), little is known for LP-SVM. The purpose of this paper is to point out the ain difficulty and then provide error analysis for LP-SVM. Consider the binary classification setting. Let X, d) be a copact etric space and Y = 1, 1}. A binary classifier is a function f : X Y which labels every point x X with soe y Y. Both LP-SVM and QP-SVM considered here are kernel based classifiers. A function K : X X R is called a Mercer kernel if it is continuous, syetric and positive seidefinite, i.e., for any finite set of distinct points 2

3 x 1,, x l } X, the atrix Kx i, x j )) l i,j=1 is positive seidefinite. Let z = x 1, y 1 ),, x, y )} X Y ) be the saple. Motivated by reducing the nuber of support vectors of the 1 nor soft argin SVM, Vapnik 1998) introduced the LP-SVM algorith associated to a Mercer Kernel K. It is based on the following linear prograing optiization proble: in α R +, b R subject to 1 ξ i + 1 } α i C ) y i α j y j Kx i, x j ) + b 1 ξ i j=1 ξ i 0, i = 1,,. 1.1) Here α = α 1,, α ), ξ i s are slack variables. The trade-off paraeter C = C) > 0 depends on and is crucial. If ) α z = α 1,z,, α,z ), b z solves the optiization proble 1.1), the LP-SVM classifier is given by sgnf z ) with f z x) = α i,z y i Kx, x i ) + b z. 1.2) For a real-valued function f : X R, its sign function is defined as sgnf)x) = 1 if fx) 0 and sgnf)x) = 1 otherwise. The QP-SVM is based on a quadratic prograing optiization proble: in α R +, b R subject to 1 ξ i + 1 } 2 C α i y i Kx i, x j )α j y j i,j=1 ) y i α j y j Kx i, x j ) + b 1 ξ i j=1 ξ i 0, i = 1,,. 1.3) Here C = C) > 0 is also a trade-off paraeter depending on the saple size. If α z = α 1,z,, α,z ), b z ) solves the optiization proble 1.3), then the 1 nor soft argin classifier is defined by sgn f z ) with f z x) = α i,z y i Kx, x i ) + b z. 1.4) 3

4 Observe that both LP-SVM classifier 1.1) and QP-SVM classifier 1.3) are ipleented by convex optiization probles. Copared with this, neural network learning algoriths are often perfored by nonconvex optiization probles. The reproducing kernel property of Mercer kernels ensures nice approxiation power of SVM classifiers. Recall that the Reproducing Kernel Hilbert Space RKHS) H K associated with a Mercer kernel K is defined Aronszajn 1950) to be the closure of the linear span of the set of functions K x := Kx, ) : x X} with the inner product <, > K satisfying < K x, K y > K = Kx, y). The reproducing property is given by < f, K x > K = fx), x X, f H K. 1.5) The QP-SVM is well understood. It has attractive approxiation properties see 2.2) below) because the learning schee can be represented as a Tikhonov regularization Evgeniou et al. 2000) odified by an offset) associated with the RKHS: f z = arg in f=f +b H K +R 1 1 yi fx i ) ) C f 2 K }, 1.6) where t) + = ax 0, t}. Set H K := H K + R. For a function f = f 1 + b 1 H K, we denote f = f 1 and b f = b 1. Write b fz as b z. It turns out that 1.6) is the sae as 1.3) together with 1.4). To see this, we first note that f z ust lies in the span of K xi } according to the representation theore Wahba 1990). Next, the dual proble of 1.6) shows Vapnik 1998) that the coefficient of K xi, α i y i, has the sae sign as y i. Finally, the definition of the H K nor yields f z 2 K = α iy i K xi 2 K = i,j=1 α iy i Kx i, x j )α j y j. The rich knowledge on Tikhonov regularization schees and the idea of bias-variance trade-off developed in the neural network literature provide a atheatical foundation of the QP-SVM. In particular, the convergence is well understood due to the work done within the last a few years. Here the for 1.6) illustrate soe advantages of the QP-SVM: the iniization is 4

5 taken over the whole space H K, so we expect the QP-SVM has soe good approxiation power, siilar to the approxiation error of the space H K. Things are totally different for LP-SVM. Set } H K,z = α i y i Kx, x i ) : α = α 1,, α ) R +. Then the LP-SVM schee 1.1) can be written as f z = arg in f=f +b H K,z +R 1 1 yi fx i ) ) + 1 )} + C Ωf. 1.7) Here we have denoted Ωf ) = yα l 1 = α i for f = α iy i K xi with α i 0. It plays the role of a nor of f in soe sense. This is not a Hilbert space nor, which raises the technical difficulty for the atheatical analysis. More seriously, the hypothesis space H K,z depends on the saple z. The centers x i of the basis functions in H K,z are deterined by the saple z, not free. One ight consider regularization schees in the space of all linear cobinations with free centers, but whether the iniization can be reduced into a convex optiization proble of size, like 1.1), is unknown. Also, it is difficult to relate the corresponding optiu in a ball with radius C) to fz with respect to the estiation error. Thus separating the error for LP-SVM into two ters of saple error and approxiation error is not as iediate as for the QP-SVM or neural network ethods Niyogi and Girosi 1996) where the centers are free. In this paper, we shall overcoe this difficulty by setting a stepping stone. Turn to the error analysis. Let ρ be a Borel probability easure on Z := X Y and X, Y) be the corresponding rando variable. The prediction power of a classifier f is easured by its isclassification error, i.e., the probability of the event fx ) Y: Rf) = ProbfX ) Y} = P Y = fx) x) dρ X. 1.8) Here ρ X is the arginal distribution and ρ x) is the conditional distribution of ρ. The classifier iniizing the isclassification error is called the Bayes 5 X

6 rule f c. It takes the for 1, if P Y = 1 x) P Y = 1 x), f c x) = 1, if P Y = 1 x) < P Y = 1 x). If we define the regression function of ρ as f ρ x) = ydρy x) = P Y = 1 x) P Y = 1 x), x X, Y then f c = sgnf ρ ). Note that for a real-valued function f, sgnf) gives a classifier and its isclassification error will be denoted by Rf) for abbreviation. Though the Bayes rule exists, it can not be found directly since ρ is unknown. Instead, we have in hand a set of saples z = z i } = x 1, y 1 ),, x, y )} N). Throughout the paper we assue z 1,, z } are independently and identically distributed according to ρ. A classification algorith constructs a classifier f z based on z. Our goal is to understand how to choose the paraeter C = C) in the algorith 1.1) so that the LP-SVM classifier sgnf z ) can approxiate the Bayes rule f c with satisfactory convergence rates as ). Our approach provides clues to study learning algoriths with penalty functional different fro the RKHS nor Niyogi and Girosi 1996; Evgeniou et al. 2000). It can be extended to schees with general loss functions Rosasco et al. 2004; Lugosi and Vayatis 2004; Wu et al. 2004). 2 Main Results In this paper we investigate learning rates, the decay of the excess isclassification error Rf z ) Rf c ) as and C) becoe large. Consider the QP-SVM classification algorith f z defined by 1.3). Steinwart 2002) showed that R f z ) Rf c ) 0 as and C = C) ), when H K is dense in CX), the space of continuous functions on X with the nor. Lugosi and Vayatis 2004) found that for the exponential loss, the excess isclassification error of regularized boosting algoriths 6

7 can be estiated by the excess generalization error. An iportant result on the relation between the isclassification error and generalization error for a convex loss function is due to Zhang 2004). See Bartlett et al. 2003), and Chen et al. 2004) for extensions to general loss functions. Here we consider the hinge loss V y, fx)) = 1 yfx)) +. The generalization error is defined as Ef) = V y, fx)) dρ. Z Note that f c is a iniizer of Ef). Then Zhang s results asserts that Rf) Rf c ) Ef) Ef c ), f : X R. 2.1) Thus, the excess isclassification error R f z ) Rf c ) can be bounded by the excess generalization error E f z ) Ef c ), and the following error decoposition Wu and Zhou 2003) holds: E f z ) Ef c ) E ) ) fz Ez fz + Ez fk, C) E fk, C) } + D C). 2.2) Here E z f) = 1 V y i, fx i )). The function f K, C depends on C and is defined as f K,C := arg in Ef) + 1 } 2C f 2 K, C > ) f H K The decoposition 2.2) akes the error analysis for QP-SVM easy, siilar to that in Niyogi and Girosi 1996). The second ter of 2.2) easures the approxiation power of H K for ρ. by Definition 2.1. The regularization error of the syste K, ρ) is defined DC) := inf Ef) Ef c ) + 1 } f H K 2C f 2 K. 2.4) The regularization error for a regularizing function f K,C H K is defined as DC) := Ef K,C ) Ef c ) + 1 2C f K,C 2 K. 2.5) 7

8 In Wu and Zhou 2003) we showed that Ef) Ef c ) f f c L 1 ρx. Hence the regularization error can be estiated by the approxiation in a weighted L 1 space, as done in Sale and Zhou 2003), and Chen et al. 2004). Definition 2.2. We say that the probability easure ρ can be approxiated by H K with exponent 0 < β 1 if there exists a constant c β such that H1) DC) cβ C β, C > 0. The first ter of 2.2) is called the saple error. It has been well understood in learning theory by concentration inequalities, e.g. Vapnik 1998), Devroye et al. 1997), Niyogi 1998), Cucker and Sale 2001), Bousquet and Elisseeff 2002). The approaches developed in Barron 1990), Bartlett 1998), Niyogi and Girosi 1996), and Zhang 2004) separate the regularization error and the saple error concerning f z. In particular, for the QP-SVM, Zhang 2004) proved that E z Z E fz ) } inf Ef) + 1 } f H K 2 C f 2 K + 2 C. 2.6) It follows that E z Z E fz ) Ef c ) } 2 C D C)+. When H1) holds, Zhang s bound in connection with 2.1) yields E z Z R fz ) Rf c ) } = O C β )+ 2 C. This is siilar to soe well-known bounds for the neural network learning algoriths, see e.g. Theore 3.1 in Niyogi and Girosi 1996). The best learning rate derived fro 2.6) by choosing C = 1/β+1) is E z Z R fz ) Rf c ) } = O α), α = β β ) Observe that the saple error bound 2 C in 2.6) is independent of the kernel K or the distribution ρ. If soe inforation about K or ρ is available, the saple error and hence the excess isclassification error can be iproved. The inforation we need about K is the capacity easured by covering nubers. 8

9 Definition 2.3. Let F be a subset of a etric space. For any ε > 0, the covering nuber N F, ε) is defined to be the inial integer l N such that there exist l balls with radius ε covering F. In this paper we only use the unifor covering nuber. Covering nubers easured by epirical distances are also used in the literature van der Vaart and Wellner 1996). For coparisons, see Pontil 2003). Let B R = f H K : f K R}. It is a subset of CX) and the covering nuber is well defined. We denote the covering nuber of the unit ball B 1 as N ε) := N B 1, ε ), ε > ) Definition 2.4. The RKHS H K is said to have logarithic coplexity exponent s 1 if there exists a constant c s > 0 such that H2) log N ε) c s log1/ε) ) s. It has polynoial coplexity exponent s > 0 if there is soe c s > 0 such that H2 ) log N ε) c s 1/ε ) s. The unifor covering nuber has been extensively studied in learning theory. In particular, we know that for the Gaussian kernel Kx, y) = exp x y 2 /σ 2 } with σ > 0 on a bounded subset X of R n, H2) holds with s = n+1, see Zhou 2002); if K is C r with r > 0 Sobolev soothness), then H2 ) is valid with s = 2n/r, see Zhou 2003). The inforation we need about ρ is a Tsybakov noise condition Tsybakov 2004). Definition 2.5. Let 0 q. We say that ρ has Tsybakov noise exponent q if there exists a constant c q > 0 such that H3) P X x X : fρ x) c q t} ) t q. All distributions have at least noise exponent 0. Deterinistic distributions which satisfy f ρ x) 1) have the noise exponent q = with c = 1. 9

10 Using the above conditions about K and ρ, Scovel and Steinwart 2003) showed that when H1), H2 ) and H3) hold, for every ɛ > 0 and every δ > 0, with confidence 1 δ, R f z ) Rf c ) = O α), α = 4βq + 1) ɛ. 2.9) 2q + sq + 4)1 + β) When no conditions are assued for the distribution i.e., q = 0) or s = 2 for the kernel the worse case when epirical covering nubers are used, see van der Vaart and Wellner 1996), the rate is reduced to α = β ɛ, arbitrarily β+1 close to Zhang s rate 2.7). Recently, Wu et al. 2004) iprove the rate 2.9) and show that under the sae assuptions H1), H2 ) and H3), for every ɛ, δ > 0, with confidence 1 δ, R f z ) Rf c ) = O α), α = in βq + 1) βq + 2) + q + 1 β)s/2 ɛ, 2β }. β ) When soe condition is assued for the kernel but not for the distribution, i.e., s < 2 but q = 0, the rate 2.10) has power α = in β 2β ɛ, 2β+1 β)s/2 β+1}. This is better than 2.7) or 2.9) or the rates given in Bartlett et al. 2003; Blanchard et al. 2004, see Chen et al. 2004; Wu et al for detailed coparisons) if β < 1. This iproveent is possible due to the projection operator. Definition 2.6. The projection operator π is defined on the space of easurable functions f : X R as 1, if fx) > 1, πf)x) = 1, if fx) < 1, fx), if 1 fx) 1. The idea of projections appeared in argin-based bound analysis, e.g. Bartlett 1998), Lugosi and Vayatis 2004), Zhang 2002), Anthony and Bartlett 1999). We used the projection operator for the purpose of bounding isclassification and generalization errors in Chen et al. 2004). It helps 10

11 us to get sharper bounds of the saple error: probability inequalities are applied to rando variables involving functions π f z ) bounded by 1), not to f z the corresponding bound increases to infinity as C becoes large). In this paper we apply the projection operator to the LP-SVM. Turn to our ain goal, the LP-SVM classification algorith f z defined by 1.1). To our knowledge, the convergence of the algorith has not been verified, even for distributions strictly separable by a universal kernel. What is the ain difficulty in the error analysis? One difficulty lies in the error decoposition: nothing like 2.2) exists for LP-SVM in the literature. Bounds for the regularization or approxiation error independent of z are not available. We do not know whether it can be bounded by a nor in the whole space H K or a nor siilar to those in Niyogi and Girosi 1996). In the paper we overcoe the difficulty by eans of a stepping stone fro QP-SVM to LP-SVM. Then we can provide error analysis for general distributions. In particular, explicit learning rates will be presented. To this end, we first ake an error decoposition. Theore 1. Let C > 0, 0 < η 1 and f K,C H K. There holds Rf z ) Rf c ) 2ηRf c ) + S, C, η) + 2DηC), where S, C, η) is the saple error defined by } ) ) } S, C, η) := Eπf z )) E z πf z )) +1+η) E z fk,c E fk,c. 2.11) Theore 1 will be proved in Section 4. The ter DηC) is the regularization error Sale and Zhou 2004) defined for a regularizing function f K,C arbitrarily chosen) by 2.5). In Chen et al. 2004), we showed that where DC) DC) κ2 2C 2.12) κ := E 0 /1 + κ), κ = sup Kx, x), x X 11 E0 := inf Eb) Efc ) }. b R

12 Also, κ = 0 only for very special distributions. Hence the decay of DC) cannot be faster than O1/C) in general. Thus, to have satisfactory convergence rates, C can not be too sall, and it usually takes the for of τ for soe τ > 0. The constant κ is the nor of the inclusion H K CX): f κ f K, f H K. 2.13) Next we focus on analyzing the learning rates. Since a unifor rate is ipossible for all probability distributions as shown in Theore 7.2 of Devroye et al. 1997), we need to consider subclasses. The choice of η is iportant in the upper bound in Theore 1. If the distribution is deterinistic, i.e., Rf c ) = 0, we ay choose η = 1. When Rf c ) > 0, we ust choose η = η) 0 as in order to get the convergence rate. Of course the latter choice ay lead to a slightly worse rate. Thus, we will consider these two cases separately. The following proposition gives the bound for deterinistic distributions. Proposition 2.1. Suppose Rf c ) = 0. If f K,C is a function in H K satisfying V y, f K,C x)) M, then for every 0 < δ < 1, with confidence 1 δ there holds Rf z ) 32ε,C + 20M log2/δ) 3 + 8DC), where with a constant c s depending on c s, κ and s, ε,c is given by 22 log 2 [ δ + c s log CM log 2 ) )] } s + log CDC), if H2) holds; δ 35 log 2/δ ) 1 + c s) ) 1 s 1+s CM) 1+s + 32c s CDC)) s 1+s, if H2 ) holds. 3 1/1+s) Proposition 2.1 will be proved in Section 6. As corollaries we obtain learning rates for strictly separable distributions and for weakly separable distributions. Definition 2.7. We say that ρ is strictly separable by H K with argin γ > 0 if there is soe function f γ H K such that f γ K = 1 and yf γ x) γ alost everywhere. 12

13 For QP-SVM, the strictly separable case is well understood, see e.g. Vapnik 1998), Cristianini and Shawe-Taylor 2000) and vast references therein. For LP-SVM, we have Corollary 2.1. If ρ is strictly separable by H K with argin γ > 0 and H2) holds, then Rf z ) 704 log 2 δ + c s log + log 1 ) } s + 4 γ 2 Cγ. 2 ) In particular, this will yield the learning rate O log ) s by taking C = /γ 2. Proof. Take f K,C = f γ /γ. Then V y, f K,C x)) 0 and DC) equals 1 f 2C γ /γ 2 K = 1. The conclusion follows fro Proposition 2.1 by choosing 2Cγ 2 M = 0. Reark 2.1. For strictly separable distributions, we verify the optial rate when H2) holds. Siilar rates are true for ore general kernels. But we oit details here. Definition 2.8. We say that ρ is weakly) separable by H K if there is soe function fsp H K, called the separating function, such that f sp K = 1 and yfspx) > 0 alost everywhere. It has separating exponent θ 0, ] if for soe γ θ > 0, there holds ρ X 0 < f sp x) < γ θ t) t θ. 2.14) Corollary 2.2. Suppose that ρ is separable by H K with 2.14) valid. i) If H2) holds, then log + log C) s Rf z ) = O ) + C θ θ+2. log )s This gives the learning rate O ) by taking C = θ+2)/θ. 13

14 ii) If H2 ) holds, then Rf z ) = O C s 1+s 2s θ+2 C + ) ) 1 1+s + C θ θ+2. This yields the learning rate O θ θ+2 sθ+2s+θ ) by taking C = sθ+2s+θ. Proof. Take f K,C = C 1 θ+2 f sp/γ θ. By the definition of fsp, we have yf K,C x) 0 alost everywhere. Hence 0 V y, f K,C x)) 1. Moreover, Ef K,C ) = X 1 C 1 θ+2 γ θ ) } fspx) dρ X = ρ X 0 < fspx) < γ θ C 1 θ+2 + which is bounded by C θ θ+2. Therefore, DC) )C θ 2γθ 2 θ+2. Then the conclusion follows fro Proposition 2.1 by choosing M = 1. Exaple. Let X = [ 1/2, 1/2] and ρ be the Borel probability easure on Z such that ρ X is the Lebesgue easure on X and 1, if 1/2 x < 0, f ρ x) = 1, if 0 < x < 1/2. If we take the linear kernel Kx, y) = x y, then θ = 1, γ θ = 1/2. Since H2) is satisfied with s = 1, the learning rate is O log ) by taking C = 3. Reark 2.2. The condition 2.14) with θ = is exactly the definition of strictly separable distribution and γ θ is the argin. The choice of f K,C and the regularization error play essential roles to get our error bounds. It influences the strategy of choosing the regularization paraeter odel selection) and deterines learning rates. For weakly separable distributions we chose f K,C to be ultiples of a separating function in Corollary 2.2. For the general case, it can be the choice 2.3). Let s analyze learning rates for distributions having polynoially decaying regularization error, i.e., H1) with β 1. This is reasonable because of 2.12). 14

15 Theore 2. Suppose that Rf c ) = 0 and the hypotheses H1), H2 ) hold with 0 < s < and 0 < β 1, respectively. Take C = ζ with ζ := in 1, 2 }. Then for every 0 < δ < 1 there exists a constant c s+β 1+β depending on s, β, δ such that with confidence 1 δ, } 2β Rf z ) c α, α = in 1 + β, β. s + β Next we consider general distributions satisfying Tsybakov condition Tsybakov 2004). Theore 3. Assue the hypotheses H1), H2 ) and H3) with 0 < s <, 0 < β 1, and 0 q. Take C = ζ with } 2 ζ := in β + 1, q + 1)β + 1). sq + 1) + βq qs + s) For every ɛ > 0 and every 0 < δ < 1 there exists a constant c depending on s, q, β, δ, and ɛ such that with confidence 1 δ, } 2β Rf z ) Rf c ) c α, α = in β + 1, βq + 1) sq + 1) + βq qs + s) ɛ. Reark 2.3. Since Rf c ) is usually sall for a eaningful classification proble, the upper bound in Theore 1 tells that the perforance of LP- SVM is siilar to that of QP-SVM. However, to have convergence rates, we need to choose η = η) 0 as becoes large. This akes our rate worse than that of QP-SVM. This is the case when the capacity index s is large. When s is very sall, the rate is O α ) with α close to in q+1, 2β }, q+2 β+1 which coincides to the rate 2.10), and is better than the rates 2.7) or 2.9) for QP-SVM. As any C kernel satisfies H2 ) for an arbitrarily sall s > 0 Zhou 2003), this is the case for polynoial or Gaussian kernels, usually used in practice. Reark 2.4. Here we use a stepping stone fro QP-SVM to LP-SVM. So the derived learning rates for the LP-SVM are essentially no worse than those of QP-SVM. It would be interesting to introduce different tools to get 15

16 learning rates for the LP-SVM, better than those of QP-SVM. Also, the choice of the trade-off paraeter C in Theore 3 depends on the indices β approxiation), s capacity), and q noise condition). This gives a rate which is optial by our approach. One can take other choices ζ > 0 for C = ζ ), independent of β, s, q, and then derive learning rates according to the proof of Theore 3. But the derived rates are worse than the one stated in Theore 3. It would be of iportance to give soe ethods for choosing C adaptively. Reark 2.5. When epirical covering nubers are used, the capacity index can be restricted to s [0, 2]. Siilar learning rates can be derived, as done in Blanchard et al. 2004), Wu et al. 2004). 3 Stepping Stone Recall that in 1.7), the penalty ter Ωf ) is usually not a nor. This akes the schee difficult to analyze. Since the solution f z of the LP-SVM has a representation siilar to f z in QP-SVM, we expect close relations between these schees. Hence the latter ay play roles in the analysis for the forer. To this end, we need to estiate Ω f z ), the l 1 nor of the coefficients of the solution f z to 1.4). Lea 3.1. For every C > 0, the function f z defined by 1.3) and 1.4) satisfies Ω f z ) = α i,z CE z f z ) + f z 2 K. Proof. The dual proble of the 1 nor soft argin SVM Vapnik 1998) tells us that the coefficients α i,z in the expression 1.4) of f z satisfy 0 α i,z C and α i,z y i = ) The definition of the loss function V iplies that 1 y i fz x i ) V y i, f z x i ) ). 16

17 Then α i,z α i,z y i fz x i ) α i,z V y i, f z x i ) ). Applying the upper bound for α i,z in 3.1), we can bound the right side above as α i,z V y i, f z x i ) ) C V y i, f z x i ) ) = CE ) z fz. Applying the second relation in 3.1) yields It follows that α i,z y i bz = 0. α i,z y i fz x i ) = α i,z y i f z x i ) + b ) z = α i,z y i f z x i ). But f z x i ) = α j,z y j Kx i, x j ). We have j=1 α i,z y i fz x i ) = α i,z y i α j,z y j Kx i, x j ) = f z 2 K. i,j=1 Hence the bound for Ω f z ) follows. Reark 3.1. Dr.Yiing Ying pointed out to us that actually the equality holds in Lea 3.1. This follows fro the KKT conditions. But we only need the inequality here. 4 Error Decoposition In this section, we estiate Rf z ) Rf c ). 17

18 Since sgnπf)) = sgnf), we have Rf) = Rπf)). Using 2.1) to πf), we obtain Rf) Rf c ) = Rπf)) Rf c ) Eπf)) Ef c ). 4.1) It is easy to see that V y, πf)x)) V y, fx)). Hence Eπf)) Ef) and E z πf)) E z f). 4.2) We are in a position to prove Theore 1 which, by 4.1), is an easy consequence of the following result. Proposition 4.1. Let C > 0, 0 < η 1 and f K,C H K. Then Eπf z )) Ef c ) + 1 C Ωf z ) 2ηRf c ) + S, C, η) + 2DηC), where S, C, η) is defined by 2.11). Proof. Take f z to be the solution of 1.4) with C = ηc. We see fro the definition of f z and 4.2) that E z πfz ) ) + 1 ) C Ωf z ) E z f z ) + 1 C Ω f ) z ) 0. This enables us to decopose Eπf z )) + 1 C Ωf z ) as Eπf z )) + 1 C Ωf z ) E πf z ) ) E z πfz ) )} + E z f z ) + 1 C Ω f ) z ). Lea 3.1 gives Ω f z ) CE z f z ) + f z 2 K. But C = ηc. Hence Eπf z )) + 1 C Ωf z ) E πf z ) ) E z πfz ) )} η)e z f z ) + 1 C f z 2 K. Next we use the function f K,C to analyze the second ter of the above bound and get E z f z ) η)c f z 2 K E z f z ) C f z 2 K E z f K,C ) C f K,C 2 K. This bound can be written as E z f K,C ) Ef K,C ) } + Ef K,C )+ 1 2 C f K,C 2 K 18 }.

19 Cobining the above two steps, we find that Eπf z )) Ef c ) + 1 C Ωf z ) is bounded by E πf z ) ) E z πfz ) )} ) ) η) E } z fk,c E fk,c +1 + η) E ) ) } f K,C E fc + 1 f 2ηC K,C 2 K + ηef c ). By the fact Ef c ) = 2Rf c ) and the definition of DC), we draw our conclusion. 5 Probability Inequalities In this section we give soe probability inequalities. They odify the Bernstein inequality and extend our previous work in Chen et al. 2004) which was otivated by saple error estiates for the square loss e.g. Barron 1990; Bartlett 1998; Cucker and Sale 2001, and Mendelson 2002). Recall the Bernstein inequality: Let ξ be a rando variable on Z with ean µ and variance σ 2. If ξ µ M, then Prob µ 1 } ξz i ) > ε 2 exp ε 2 2σ 2 + 1Mε) 3 The one-side Bernstein inequality holds without the leading factor 2. Proposition 5.1. Let ξ be a rando variable on Z satisfying µ 0, ξ µ M alost everywhere, and σ 2 cµ τ for soe 0 τ 2. Then for every ε > 0 there holds µ 1 Prob ξz } i) > ε 1 τ µ τ + ε τ ) exp ε 2 τ }. 2c Mε1 τ ) Proof. The one-side Bernstein inequality tells us that µ 1 Prob ξz } } i) > ε 1 τ ε 2 τ µ τ + ε τ ) µ τ + ε τ ) 1 2 exp 2 2 ). σ Mε1 τ 2 µ τ + ε τ ) }.

20 Since σ 2 cµ τ, we have σ 2 + M 3 ε1 τ 2 µ τ + ε τ ) 1 2 cµ τ + M 3 ε1 τ µ τ + ε τ ) µ τ + ε τ ) c Mε1 τ). This yields the desired inequality. Note that f z depends on z and thus runs over a set of functions as z changes. We need a probability inequality concerning the unifor convergence. Denote E g := Z gz)dρ. Lea 5.1. Let 0 τ 1, M > 0, c 0, and G be a set of functions on Z such that for every g G, E g 0, g E g M and E g 2 ce g) τ. Then for ε > 0, Prob E g 1 sup gz i) g G E g) τ + ε τ ) 1 2 > 4ε 1 τ 2 } N G, ε ) exp ε 2 τ 2c Mε1 τ ) Proof. Let g j } N j=1 G with N = N G, ε ) such that for every g G there is soe j 1,..., N } satisfying g g j ε. Then by Proposition 5.1, a standard procedure Cucker and Sale 2001; Mukherjee et al 2002; Chen et al. 2004) leads to the conclusion. Reark 5.1. Various fors of probability inequalities using epirical covering nubers can be found in the literature. For siplicity we give the current for in Lea 5.1 which is enough for our purpose. Let us find the hypothesis space covering f z when z runs over all possible saples. This is ipleented in the following two leas. By the idea of bounding the offset fro Wu and Zhou 2003), and Chen et al. 2004), we can prove the following. Lea 5.2. For any C > 0, N and z Z, we can find a solution f z of 1.7) satisfying in 1 i f zx i ) 1. Hence b z 1 + f z. We shall always choose f z as in Lea 5.2. In fact, the only restriction we need to ake for the iniizer f z is to choose α i = 0 and b z = y, i.e., f z x) = y whenever y i = y for all 1 i with soe y Y. 20 }.

21 Lea 5.3. For every C > 0, we have f z H K and f z K κωf z ) κc. Proof. It is trivial that f z H K. By the reproducing property 1.5), 1/2 ) 1/2 fz K = α i,z α j,z y i y j Kx i, x j )) κ α i,z α j,z = κωfz ). i,j=1 i,j=1 Bounding the solution to 1.7) by the choice f = 0 + 0, we have Ef z ) + 1 Ωf C z ) E0) + 0 = 1. This gives Ωfz ) C, and copletes the proof. By Lea 5.3 and Lea 5.2 we know that πf z ) lies in F R := πf) : f B R + [ 1 + κr), 1 + κr ]} 5.1) with R = κc. The following lea Chen et al. 2004) gives the covering nuber estiate for F R. Lea 5.4. Let F R be given by 5.1) with R > 0. For any ε > 0 there holds ) 21 + κr) ε ) N F R, ε) + 1 N. ε 2R Using the function set F R defined by 5.1), we set for R > 0, G R = V y, fx) ) } V y, f c x)) : f F R. 5.2) By Lea 5.4 and the additive property of the log function, we have Lea 5.5. Let G R given by 5.2) with R > 0. i) If H2) holds, then there exists a constant c s > 0 such that log N G R, ε) c s log R ε ) s. ii) If H2 ) holds, then there exists a constant c s > 0 such that R log N G R, ε) c s ε ) s. 21

22 The following lea was proved by Scovel and Steinwart 2003) for general functions f : X R. With the projection, here f has range [ 1, 1] and a sipler proof is given. Lea 5.6. Assue H3). For every function f : X [ 1, 1] there holds ) } 2 E V y, fx)) V y, f c x)) 8 1 ) ) q q/q+1) q+1 Ef) Ef c ). 2c q Proof. Since fx) [ 1, 1], we have V y, fx)) V y, f c x)) = yf c x) fx)). It follows that Ef) Ef c ) = f c x) fx))f ρ x)dρ X = f c x) fx) f ρ x) dρ X X and ) } 2 E V y, fx)) V y, f c x)) = f c x) fx) 2 dρ X. X Let t > 0 and separate the doain X into two sets: X t + := x X : f ρ x) > c q t} and Xt := x X : f ρ x) c q t}. On X t + we have f c x) fx) 2 2 f c x) fx) fρx) c qt. On Xt we have f c x) fx) 2 4. It follows fro Assuption H3) that f c x) fx) 2 dρ X 2 Ef) Ef c ) ) +4ρ X Xt ) 2 Ef) Ef c ) ) +4t q. c q t c q t X X Choosing t = Ef) Ef c ))/2c q ) } 1/q+1) yields the desired bound. Take the function set G in Lea 5.1 to be G R. Then a function g in G R takes the for gx, y) = V y, πf)x) ) V y, f c x)) with πf) F R. Obviously we have g 2, E g = Eπf)) Ef c ) and 1 gz i) = E z πf)) E z f c ). When Assuption H3) is valid, Lea 5.6 tells us that E g 2 c E g ) τ with τ = q and c = 8 ) 1 q/q+1). q+1 2c q Applying Lea 5.1 and solving the equation log N G R, ε ) ε 2 τ 2c ε1 τ ) = log δ, we see the following corollary fro Lea 5.5 and Lea

23 Corollary 5.1. Let G R be defined by 5.2) with R > 0 and H3) hold with 0 q. For every 0 < δ < 1, with confidence at least 1 δ, there holds } } Ef) Ef c ) E z f) E z f c ) 4ε,R + 4ε for all f F R, where ε,r is given by 5 8 ) ) 1 q/q+1) log 1 + 2c q + 1 δ c slog R + log ) s ) ) 1 q/q+1) 2c q + 1 c s R s ) q+1) q+2+qs+s log 1 δ Rate Analysis q+2 2q+1),R ) q+1 q+2 ) q+1 q+2 } q 2q+1) Ef) Ef c ), if H2) holds, ), if H2 ) holds. Let us now prove the ain results stated in Section 2. We first prove Proposition 2.1. Proof of Proposition 2.1. Since Rf c ) = 0, V y, f c x)) = 0 alost everywhere and Ef c ) = 0. Take η = 1 in Proposition 4.1. We first consider the rando variable ξ = V y, f K,C x)). Since 0 ξ M and E ξ = Ef K,C ) DC), we have σ 2 ξ) E ξ 2 M E ξ MDC). Applying the one-side Bernstein inequality to ξ, we see by solving the quadratic equation ε2 = log δ/2 ) that with probability 1 δ/2, 2σ 2 +Mε/3) E z f K,C ) Ef K,C ) 2M log ) 2 δ + 3 2σ 2 ξ) log 2/δ ) 5M log ) 2 δ + DC) ) Next we estiate Eπf z )) E z πf z )). By the definition of f z, there holds 1 C Ωf z ) E z f z ) + 1 C Ωf z ) E z f z ) + 1 C Ω f z ). 23

24 According to Lea 3.1, this is bounded by 2 E z f z ) + 1 connection with the definition of f z yields 2C f z 2 K 1 C Ωf z ) 2 E z f z ) + 1 2C f ) z 2 K 2 E z f K,C ) + 1 ) 2C f K,C 2 K. Since Ef c ) = 0, DC) = Ef K,C ) + 1 2C f K,C 2 K. It follows that 1 ) C Ωf z ) 2 E z f K,C ) Ef K,C ) + DC). ). This in Together with Lea 5.3 and 6.1), this tells us that with probability 1 δ/2 ) 5M log 2/δ ) fz K κωfz ) R := 2κC + 2DC). 3 As we are considering a deterinistic case, H3) holds with q = and c = 1. Recall the definition of G R in 5.2). Corollary 5.1 with q = and R given as above iplies that Eπf z )) E z πf z )) 4ε,C + 4 ε,c Eπfz )) with confidence 1 δ where ε,c is defined in the stateent. Putting the above two estiates into Proposition 4.1, we have with confidence 1 δ, Eπf z )) 4ε,C + 4 ε,c Eπfz )) + 10M log 2/δ ) + 4DC). 3 Solving the quadratic inequality for Eπf z )) leads to Eπf z )) 32ε,C + 20M log 2/δ ) + 8DC). 3 Then our conclusion follows fro 4.1). Finally, we turn to the proof of Theores 2 and 3. To this end, we need a bound for f K,C K. According to the definition, 1 2C f K,C 2 K DC). Then we have 24

25 Lea 6.1. For every C > 0, there hold f K,C K DC)) 1/2 2C and f DC)) 1/2. K,C 1 + 2κ 2C Proof of Theore 2. Take f K,C = f K,C in Proposition 4.1. Then by DC)) 1/2. Lea 6.1 we ay take M = 2 + 2κ 2C Proposition 2.1 with Assuption H2 ) yields Rf z ) c s,β,δ C 1 β)s/s+1) 1 1+s + C1 β)s/s+1) 1 1+s Take C = in 1 s+β, 2 1+β }. Then C 1+β)/2 C 1+β)/2 ) s 1+s + C 1 β } 2 + C β. 1 and the proof is coplete. Proof of Theore 3. Denote z = E πf z ) ) Ef c ) + 1 C Ωf z ). Then we have Ωf z ) C z. This in connection with Lea 5.3 yields f z K κωf z ) κc z. 6.2) Take f K,C = f K, C with C = ηc in Proposition 4.1. It tells us that z 2ηRf c ) + S, C, η) + 2 DηC). Set η = C β/β+1). Then C = ηc = C 1/β+1). By the fact Rf c ) 1 2 and Assuption H1), z S, C, η) c β )C β β ) Recall the expression 2.11) for S, C, η). Here f K,C = f K, C. So we have S, C, η) = E πf z ) ) ) Ef c ) E z πfz ) ) )} E z f c ) ) +1 + η) E z fk, C) Ez f c ) E )} fk, C) Efc ) } +η E z f c ) Ef c ) =: S η)s 2 + ηs 3. Take t 1, C 1 to be deterined later. For R 1, denote WR) := z Z : f z K R}. 6.4) 25

26 For S 1, we apply Corollary 5.1 with δ = e t 1/e. We know that there Z of easure at ost δ = e t such that is a set V 1) R R s ) q+1 q+2+qs+s R s S 1 c s,q t + ) q+1 q+2+qs+s q+2 2q+1) q 2q+1) z }, z WR)\V 1) R. Here c s,q := 32 8 ) 1 q/q+1) 2c q + 1 3) c s + 1) 1 is a constant depending only on q and s. To estiate S 2, consider ξ = V y, f Cx) ) K, V y, f c x)) on Z, ρ). By Lea 6.1, we have f K, C 1 + 2κ 2 C D C) 1 + 2κ 2c β C 1 β 2β+1). Write ξ = ξ 1 + ξ 2 where ξ 1 := V y, f K, Cx) ) V y, π f K, C)x) ), ξ 2 := V y, π f K, C)x) ) V y, f c x)). It is easy to check that 0 ξ 1 2κ 2c β C 1 β 2β+1). Hence σ 2 ξ 1 ) is bounded by 2κ 2c β C 1 β 2β+1) E ξ1. Then the one-side Bernstein inequality with δ = e t tells us that there is a set V 2) Z of easure at ost δ = e t such that for every z Z \ V 2), there holds 1 ξ 1 z i ) E ξ 1 4κ 2c β C 1 β 3 2β+1) t + 2σ2 ξ 1 )t 10κ 2cβ C 1 β 3 2β+1) t +E ξ 1. For ξ 2, by Lea 5.6, σ 2 ξ 2 ) 8 1 2c q ) q/q+1) E ξ 2 ) q q+1. But ξ 2 2. So the one-side Bernstein inequality tells us again that there is a set V 3) Z of easure at ost δ = e t such that for every z Z \V 3), there holds 1 ξ 1 z i ) E ξ 1 4t 3 + 4σ2 ξ 1 )t 26 4t ) q q+2 t q+2 + E ξ 2. 2c q )q+1

27 Here we have used the following eleentary inequality with b := E ξ 2 ) q 2q+2 and a := c q ) q/q+1) t/ ) 1/2 : a b q + 2 2q + 2 a2q+2)/q+2) + q 2q + 2 b2q+2)/q, a, b > 0. Cobing the two estiates for ξ 1, ξ 2 with the fact that E ξ = E ξ 1 +E ξ 2 = E fk, C) Efc ) D C) c β C β/β+1) we see that S 2 c q,β t C 1 β 2β+1) + 1 ) ) q+1 q+2 + C β β+1, z Z \ V 2) R \ V 3) R, where c q,β := 10κ 2c β / c q ) q/q+1) + c β is a constant depending on q, β. The last ter is S 3 1. Putting the above three estiates for S 1, S 2, S 3 to 6.3), we find that for every z WR) \ V 1) R \ V 2) \ V 3) there holds R s 1 z 2c s,q t ) q+1) q+2+qs+s + 8c q,β t ) q+1 q+2 + C β β+1 C 1/2 )} ) Here we have used another eleentary inequality for α = q/2q + 2) 0, 1) and x = z : x ax α + b, a, b, x > 0 = x ax2a) 1/1 α), 2b}. Now we can choose C to be } C := in 2, q+1)β+1) sq+1)+βq+2+qs+s). 6.6) It ensures that ) q+1 1 q+2 C β β+1 and 1 ) q+1) q+2+qs+s C sq+1)+βq+2+qs+s) β+1)q+2+qs+s). With this choice of C, 6.5) iplies that with a set V R := V 1) R V 2) R V 3) R easure at ost 3e t, ) z C β β+1 2c s,q t C 1 sq+1) } q+2+qs+s β+1 R + 24c q,β t, z WR) \ V R. 6.7) We shall finish our proof by using 6.2) and 6.7) iteratively. 27 of

28 Start with the bound R = R 0) := κc. Lea 5.3 verifies WR 0) ) = Z. At this first step, by 6.7) and 6.2) we have Z = WR 0) ) WR 1) ) V R 0), where R 1) := κc 1 β+1 2c s,q tκ + 1)) C β β+1 } sq+1) q+2+qs+s + 24cq,β t. Now we iterate. For n = 2, 3,, we derive fro 6.7) and 6.2) that Z = WR 0) ) WR 1) ) V R 0) WR n) ) n 1 j=0 V R j) ), where each set V R j) has easure at ost 3e t and the nuber R n) is given by } R n) = κc 1 β+1 2c s,q tκ + 1)) n C β sq+1) β+1 q+2+qs+s) n + 24c q,β tκ + 1)n. Note that ɛ > 0 is fixed. We choose n 0 N to be large enough such that ) n0 +1) sq + 1) ɛ s + 2s q qs + s β + q + 2 ). q + 1 In the n 0 -th step of our iteration we have shown that for z WR n0) ), fz K κc 1 β+1 2c s,q tκ + 1)) n 0 C β sq+1) β+1 q+2+qs+s) n cq,β tκ + 1)n 0 }. This together with 6.5) gives z cs, q, β, ɛ)t n 0 ax 2β β+1, βq+1) +ɛ} sq+1)+βs+2+qs+s). This is true for z WR n0) ) \ V R n 0 ). Since the set n 0 j=0 V R j) has easure at ost 3n 0 +1)e t, we know that the set WR n0) )\V R n 0 ) has easure at least 1 3n 0 + 1)e t. Note that E πf z ) ) Ef c ) z. Take t = log 3n 0+1) ). δ Then the proof is finished by 4.1). Acknowledgents This work is partially supported by the Research Grants Council of Hong Kong [Project No. CityU ] and by City University of Hong Kong [Project No ]. The corresponding author is Ding-Xuan Zhou. 28

29 References Anthony, M., and Bartlett, P. L. 1999). Neural Network Learning: Theoretical Foundations. Cabridge University Press. Aronszajn, N. 1950). Theory of reproducing kernels. Trans. Aer. Math. Soc., 68, Barron, A. R. 1990). Coplexity regularization with applications to artificial neural networks. In Nonparaetric Functional Estiation G. Roussa, ed.), Dortrecht: Kluwer. Bartlett, P. L. 1998). The saple coplexity of pattern classification with neural networks: the size of the weights is ore iportant than the size of the network. IEEE Trans. Infor. Theory, 44, Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. 2003). Convexity, classification, and risk bounds. Preprint. Blanchard, B., Bousquet, O., and Massart, P. 2004). Statistical perforance of support vector achines. Preprint. Boser, B. E., Guyon, I., and Vapnik, V. 1992). A training algorith for optial argin classifiers. In Proceedings of the Fifth Annual Workshop of Coputational Learning Theory, Vol. 5, Pittsburgh: ACM. Bousquet, O., and Elisseeff, A. 2002). Stability and generalization. J. Machine Learning Research, 2, Bradley, P. S., and Mangasarian, O. L. 2000). Massive data discriination via linear support vector achines. Optiization Methods and Software, 13, Chen, D. R., Wu, Q., Ying, Y., and Zhou, D. X. 2004). Support vector achine soft argin classifiers: error analysis. J. Machine Learning Research, 5, Cortes, C., and Vapnik, V. 1995). Support-vector networks. Mach. Learning, 20, Cristianini, N., and Shawe-Taylor, J. 2000). An Introduction to Support Vector Machines. Cabridge University Press. 29

30 Cucker, F., and Sale, S. 2001). On the atheatical foundations of learning. Bull. Aer. Math. Soc., 39, Devroye, L., Györfi, L., and Lugosi, G. 1997). A probabilistic Theory of Pattern Recognition. New York: Springer-Verlag. Evgeniou, T., Pontil, M., and Poggio, T. 2000). Regularization networks and support vector achines. Adv. Coput. Math., 13, Kecan, V., and Hadzic, I. 2000). Support vector selection by linear prograing. Proc. of IJCNN, 5, Lugosi, G., and Vayatis, N. 2004). On the Bayes-risk consistency of regularized bossting ethods. Ann. Statis., 32, Mendelson, S. 2002). Iproving the saple coplexity using global data. IEEE Trans. Infor. Theory, 48, Mukherjee, S., Rifkin, R., and Poggio, T. 2002). Regression and classification with regularization. In Lecture Notes in Statistics: Nonlinear Estiation and Classification, D. D. Denison, M. H. Hansen, C. C. Holes, B. Mallick, and B. Yu eds.), New York: Springer-Verlag. Niyogi, P. 1998). The Inforational Coplexity of Learning. Kluwer. Niyogi, P., and Girosi, F. 1996). On the relationship between generalization error, hypothesis coplexity, and saple coplexity for radial basis functions. Neural Cop., 8, Pedroso, J. P., and Murata, N. 2001). Support vector achines with different nors: otivation, forulations and results. Pattern recognition Letters, 22, Pontil, M. 2003). A note on different covering nubers in learning theory. J. Coplexity, 19, Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., and Verri, A. 2004). Are loss functions all the sae? Neural Cop., 16, Scovel, C., and Steinwart, I. 2003). Fast rates for support vector achines. Preprint. Sale, S., and Zhou, D. X. 2003). Estiating the approxiation error in learning theory. Anal. Appl., 1,

31 Sale, S., and Zhou, D. X. 2004). Shannon sapling and function reconstruction fro point values. Bull. Aer. Math. Soc., 41, Steinwart, I. 2002). Support vector achines are universally consistent. J. Coplexity, 18, Tsybakov, A. B. 2004). Optial aggregation of classifiers in statistical learning. Ann. Statis., 32, van der Vaart, A. W., and Wellner, J. A. 1996). Weak Convergence and Epirical Processes. New York: Springer-Verlag. Vapnik, V. 1998). Statistical Learning Theory. John Wiley & Sons. Wahba, G. 1990). Spline Models for Observational Data. SIAM. Wu, Q., Ying, Y., and Zhou, D. X. 2004). Multi-kernel regularized classifiers. Preprint. Wu, Q., and Zhou, D. X. 2004). Analysis of support vector achine classification. Preprint. Zhang, T. 2004). Statistical behavior and consistency of classification ethods based on convex risk iniization. Ann. Statis., 32, Zhang, T. 2002). Covering nuber bounds of certain regularized linear function classes. J. Machine Learning Research, 2, Zhou, D. X. 2002). The covering nuber in learning theory. J. Coplexity, 18, Zhou, D. X. 2003). Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Infor. Theory, 49,

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

Multi-kernel Regularized Classifiers

Multi-kernel Regularized Classifiers Multi-kernel Regularized Classifiers Qiang Wu, Yiing Ying, and Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong Kowloon, Hong Kong, CHINA wu.qiang@student.cityu.edu.hk, yying@cityu.edu.hk,

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011

Yongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011 This article was downloaded by: [Xi'an Jiaotong University] On: 15 Noveber 2011, At: 18:34 Publisher: Taylor & Francis Infora Ltd Registered in England and Wales Registered Nuber: 1072954 Registered office:

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions Journal of Matheatical Research with Applications Jul., 207, Vol. 37, No. 4, pp. 496 504 DOI:0.3770/j.issn:2095-265.207.04.0 Http://jre.dlut.edu.cn An l Regularized Method for Nuerical Differentiation

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Are Loss Functions All the Same?

Are Loss Functions All the Same? Are Loss Functions All the Same? L. Rosasco E. De Vito A. Caponnetto M. Piana A. Verri November 11, 2003 Abstract In this paper we investigate the impact of choosing different loss functions from the viewpoint

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Derivative reproducing properties for kernel methods in learning theory

Derivative reproducing properties for kernel methods in learning theory Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Manifold learning via Multi-Penalty Regularization

Manifold learning via Multi-Penalty Regularization Manifold learning via Multi-Penalty Regularization Abhishake Rastogi Departent of Matheatics Indian Institute of Technology Delhi New Delhi 006, India abhishekrastogi202@gail.co Abstract Manifold regularization

More information

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved

More information

Universal algorithms for learning theory Part II : piecewise polynomial functions

Universal algorithms for learning theory Part II : piecewise polynomial functions Universal algoriths for learning theory Part II : piecewise polynoial functions Peter Binev, Albert Cohen, Wolfgang Dahen, and Ronald DeVore Deceber 6, 2005 Abstract This paper is concerned with estiating

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions Kernel Choice and Classifiability for RKHS Ebeddings of Probability Distributions Bharath K. Sriperubudur Departent of ECE UC San Diego, La Jolla, USA bharathsv@ucsd.edu Kenji Fukuizu The Institute of

More information

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer. UIVRSITY OF TRTO DIPARTITO DI IGGRIA SCIZA DLL IFORAZIO 3823 Povo Trento (Italy) Via Soarive 4 http://www.disi.unitn.it O TH US OF SV FOR LCTROAGTIC SUBSURFAC SSIG A. Boni. Conci A. assa and S. Piffer

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

Journal of Mathematical Analysis and Applications

Journal of Mathematical Analysis and Applications J Math Anal Appl 386 202 205 22 Contents lists available at ScienceDirect Journal of Matheatical Analysis and Applications wwwelsevierco/locate/jaa Sei-Supervised Learning with the help of Parzen Windows

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Research Article Robust ε-support Vector Regression

Research Article Robust ε-support Vector Regression Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Neural Network Learning as an Inverse Problem

Neural Network Learning as an Inverse Problem Neural Network Learning as an Inverse Proble VĚRA KU RKOVÁ, Institute of Coputer Science, Acadey of Sciences of the Czech Republic, Pod Vodárenskou věží 2, 182 07 Prague 8, Czech Republic. Eail: vera@cs.cas.cz

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Geometrical intuition behind the dual problem

Geometrical intuition behind the dual problem Based on: Geoetrical intuition behind the dual proble KP Bennett, EJ Bredensteiner, Duality and Geoetry in SVM Classifiers, Proceedings of the International Conference on Machine Learning, 2000 1 Geoetrical

More information

A Bernstein-Markov Theorem for Normed Spaces

A Bernstein-Markov Theorem for Normed Spaces A Bernstein-Markov Theore for Nored Spaces Lawrence A. Harris Departent of Matheatics, University of Kentucky Lexington, Kentucky 40506-0027 Abstract Let X and Y be real nored linear spaces and let φ :

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Kernel Methods Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Motivation Efficient coputation of inner products in high diension. Non-linear decision

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES

More information

Introduction to Kernel methods

Introduction to Kernel methods Introduction to Kernel ethods ML Workshop, ISI Kolkata Chiranjib Bhattacharyya Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru 19th Oct, 2012 Introduction

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

Complex Quadratic Optimization and Semidefinite Programming

Complex Quadratic Optimization and Semidefinite Programming Coplex Quadratic Optiization and Seidefinite Prograing Shuzhong Zhang Yongwei Huang August 4 Abstract In this paper we study the approxiation algoriths for a class of discrete quadratic optiization probles

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University

More information

The Methods of Solution for Constrained Nonlinear Programming

The Methods of Solution for Constrained Nonlinear Programming Research Inventy: International Journal Of Engineering And Science Vol.4, Issue 3(March 2014), PP 01-06 Issn (e): 2278-4721, Issn (p):2319-6483, www.researchinventy.co The Methods of Solution for Constrained

More information

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials Fast Montgoery-like Square Root Coputation over GF( ) for All Trinoials Yin Li a, Yu Zhang a, a Departent of Coputer Science and Technology, Xinyang Noral University, Henan, P.R.China Abstract This letter

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

Lecture 9: Multi Kernel SVM

Lecture 9: Multi Kernel SVM Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204 Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL:

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 5 Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient coputation of inner products

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Generalized eigenfunctions and a Borel Theorem on the Sierpinski Gasket.

Generalized eigenfunctions and a Borel Theorem on the Sierpinski Gasket. Generalized eigenfunctions and a Borel Theore on the Sierpinski Gasket. Kasso A. Okoudjou, Luke G. Rogers, and Robert S. Strichartz May 26, 2006 1 Introduction There is a well developed theory (see [5,

More information

Metric Entropy of Convex Hulls

Metric Entropy of Convex Hulls Metric Entropy of Convex Hulls Fuchang Gao University of Idaho Abstract Let T be a precopact subset of a Hilbert space. The etric entropy of the convex hull of T is estiated in ters of the etric entropy

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

The linear sampling method and the MUSIC algorithm

The linear sampling method and the MUSIC algorithm INSTITUTE OF PHYSICS PUBLISHING INVERSE PROBLEMS Inverse Probles 17 (2001) 591 595 www.iop.org/journals/ip PII: S0266-5611(01)16989-3 The linear sapling ethod and the MUSIC algorith Margaret Cheney Departent

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

A Note on Online Scheduling for Jobs with Arbitrary Release Times

A Note on Online Scheduling for Jobs with Arbitrary Release Times A Note on Online Scheduling for Jobs with Arbitrary Release Ties Jihuan Ding, and Guochuan Zhang College of Operations Research and Manageent Science, Qufu Noral University, Rizhao 7686, China dingjihuan@hotail.co

More information

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

Max-Product Shepard Approximation Operators

Max-Product Shepard Approximation Operators Max-Product Shepard Approxiation Operators Barnabás Bede 1, Hajie Nobuhara 2, János Fodor 3, Kaoru Hirota 2 1 Departent of Mechanical and Syste Engineering, Bánki Donát Faculty of Mechanical Engineering,

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information