Multi-kernel Regularized Classifiers
|
|
- Miles Cobb
- 6 years ago
- Views:
Transcription
1 Multi-kernel Regularized Classifiers Qiang Wu, Yiing Ying, and Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong Kowloon, Hong Kong, CHINA Abstract A faily of classification algoriths generated fro Tikhonov regularization schees are considered. They involve ulti-kernel spaces and general convex loss functions. Our ain purpose is to provide satisfactory estiates for the excess isclassification error of these ulti-kernel regularized classifiers. The error analysis consists of two parts: regularization error and saple error. Allowing ulti-kernels in the algorith iproves the regularization error and approxiation error, which is one advantage of the ulti-kernel setting. For a general loss function, we show how to bound the regularization error by the approxiation in soe weighted L q spaces. For the saple error, we use a projection operator. The projection in connection with the decay of the regularization error enables us to iprove convergence rates in the literature even for the one kernel schees and special loss functions: least square loss and hinge loss for support vector achine soft argin classifiers. Existence of the optiization proble for the regularization schee associated with ulti-kernels is verified when the kernel functions are continuous with respect to the index set. Gaussian kernels with flexible variances and probability distributions with soe noise conditions are deonstrated to illustrate the general theory. Keywords and Phrases: Classification algorith, ulti-kernel regularization schee, convex loss function, isclassification error, regularization error and saple error. Supported by the Research Grants Council of Hong Kong [Project No. CityU ]. Corresponding author: Ding-Xuan Zhou. 1
2 1. Introduction We study binary classification algoriths generated fro Tikhonov regularization schees associated with general convex loss functions and ulti-kernel spaces. These algoriths produce binary classifiers f : X 1, 1, fro a copact etric space X (called input space) to the output space Y = 1, 1 (representing the two classes). Such a classifier f labels a class f(x) Y for each point x (when X IR n, x is a vector representing an event with each coponent corresponding to a specific easureent). The classifiers considered here have the for sgn(f), defined as sgn(f)(x) = 1 if f(x) 0 and sgn(f)(x) = 1 if f(x) < 0, induced by real-valued functions. These functions are solutions of soe optiization probles associated with a saple z = (x i, y i ), independently drawn according to a (unknown) probability distribution ρ on Z = X Y. The nature of such an optiization proble (called a Tikhonov regularization schee) is deterined by two objects: a loss function and a hypothesis space. Definition 1. A function φ : IR IR + is called an activating loss (function) for classification if it is convex, φ (0) < 0, and inf t IR φ(t) = 0. Typical exaples of activating loss includes the hinge loss φ h (t) = (1 t) + = ax1 t, 0 for SVM classification and the exponential loss φ exp (t) = e t for boosting. Let φ be an activating loss. For a real-valued function f, when sgn(f) is used for classification or prediction, the local error incurred for the event x and output y will be easured by the value φ(yf(x)). The average of local errors is defined as E φ (f) = φ(yf(x))dρ, called the error or generalization error. Z The convexity and the condition φ (0) < 0 tells that φ(yf(x)) > φ(0) > 0 when yf(x) < 0, i.e., when sgn(f)(x) predicts the class label y incorrectly. So local errors are possibly sall only if yf(x) > 0. Hence iniizing the generalization error is expected to lead to a function predicting the label satisfactorily. This gives the intuition that φ is adissible for classification probles, as verified by any exaples in practice. Since the generalization error involving the unknown distribution ρ is not coputable, 2
3 its discretization is used instead which, coputable in ters of the saple z, is defined as Ez φ (f) = 1 φ ( y i f(x i ) ) and called the epirical error. Regularized learning schees are ipleented by iniizing a penalized version of the epirical error over a set of functions, called a hypothesis space H, equipped with a functional Ω : H IR +. The penalty functional Ω reflects constraints iposed on functions fro the hypothesis space in various desirable fors. Definition 2. Given a function φ : IR IR + and a hypothesis space H together with a penalty functional Ω, the regularized classifier generated for a saple z Z is defined as sgn(f z ), where f z is a iniizer of the Tikhonov regularization schee 1 f z := arg in φ ( y i f(x i ) ) + λω(f). (1.1) f H Here λ is a positive constant called the regularization paraeter. It depends on : λ = λ(), and usually λ() 0 as becoes large. Reproducing kernel Hilbert spaces are often used as the hypothesis space in (1.1). They play an iportant role in learning theory because of their reproducing property. Let K : X X IR be continuous, syetric and positive seidefinite, i.e., for any finite set of distinct points x 1,, x l X, the atrix (K(x i, x j )) l i,j=1 seidefinite. Such a function is called a Mercer kernel. is positive The Reproducing Kernel Hilbert Space (RKHS) H K associated with the Mercer kernel K is defined (see [1]) to be the copletion of the linear span of the set of functions K x = K(x, ) : x X with the inner product, K given by K x, K y K = K(x, y). The reproducing property of H K is K x, f K = f(x), x X, f H K. (1.2) The classical soft argin classifiers [35] correspond to the schee (1.1) with H = H K : 1 f z = arg in φ(y i f(x i )) + λ f 2 K. (1.3) f H K In this paper we introduce a ulti-kernel setting where H is the union of a set of reproducing kernel Hilbert spaces (and Ω(f) is the infiu nor square of f). 3
4 Definition 3. Let K Σ = K σ : σ Σ be a set of Mercer kernels on X. The ulti-kernel space associated with K Σ is defined to be the union H Σ = σ Σ H K σ. For f H Σ, we take f Σ = inf f Kσ : f H Kσ, σ Σ. (1.4) Taking H Σ as the hypothesis space and Ω(f) = f 2 Σ in (1.1) leads to the following schee in the ulti-kernel space H Σ : f z = arg in f H Σ 1 φ ( y i f(x i ) ) + λ f 2 Σ. (1.5) The corresponding ulti-kernel regularized classifier is given by sgn(f z ). Denote ( H Kσ, Kσ ) as ( Hσ, σ ) for siplicity. The regularization schee in the ulti-kernel space H Σ can be rewritten as a two-layer iniization proble: 1 f z = arg in in σ Σ f H σ φ ( y i f(x i ) ) + λ f 2 σ. (1.6) It reduces to (1.3) when Σ contains only one eleent. Our study of general ulti-kernel schees is otivated by recent work on learning algoriths with varying kernels. In [8] support vector achines with ultiple paraeters are investigated. In [19, 24] ixture density estiation is considered and Gaussian kernels with variance σ 2 flexible on an interval [σ 2 1, σ 2 2] with 0 < σ 1 < σ 2 < + are used for deriving bounds. Approxiation properties of ulti-kernel spaces are studied in [44]. Multi-task learning algoriths involve kernels fro a convex hull of several Mercer kernels and spaces with changing nors, e.g. [16, 18]. The first natural concern about the optiization proble (1.5) or (1.6) is the existence of a iniizer. This is assured by the copactness of the index etric set Σ and the continuity of K σ for σ Σ in the next result following fro Proposition 1 given in Section 2. Theore 1. Let φ be an activating loss. If the index set Σ is a copact etric space, and for each pair (x, y), the function K σ (x, y) is continuous with respect to σ Σ, then a solution f z to the ulti-kernel schee (1.6) exists. 4
5 In particular, f z exists in the one-kernel setting (1.3). We shall assue the existence of the optiization proble (1.6) throughout the error analysis of ulti-kernel regularized classifiers, the ain goal of this paper. Let (X, Y) be the rando variable on X Y with the probability distribution ρ. The isclassification error for a classifier f : X Y is defined to be the probability of the event f(x ) Y R(f) = Prob f(x ) Y = P (Y = f(x) x)dρ X. (1.7) Here ρ X is the arginal distribution on X and P ( x) is the conditional distribution. Our target of error analysis is to understand how sgn(f z ) approxiates the Bayes rule, the best classifier with respect to the isclassification error: f c = arg inf R(f) with the infiu taken over all classifiers. Denote η(x) = P (Y = 1 x) and recall the regression function f ρ (x) = ydρ(y x) = P (Y = 1 x) P (Y = 1 x) = 2η(x) 1, x X. (1.8) Y Then the Bayes rule is given (e.g. [15]) by the sign of the regression function f c = sgn(f ρ ). Estiating the excess isclassification error X R(sgn(f z )) R(f c ) (1.9) for the ulti-kernel regularized classification algorith (1.6) is our ain purpose. For the one-kernel setting (1.3) and special choices of φ, the error analysis has been extensively investigated in the literature, especially when ρ is strictly separable (with a positive argin). Besides the hinge loss φ h corresponding to the SVM 1-nor soft argin classifier [35, 25, 28, 11, 37], exaples of loss functions include (1) φ q (t) = (1 t) q + for the SVM q-nor (q > 1) soft argin classifier, see [35, 20, 9]; (2) least square loss φ ls (t) = (1 t) 2, see e.g. [12, 15, 17, 23, 29, 31, 40]; (3) the exponential loss φ exp (t) = e t, see [40, 4]; (4) the logistic regression φ(t) = log(1 + e t ) or 1/(1 + e t ), see [40, 4]. For the error bounds, we will focus on activating loss functions achieving zeros, which allows us to provide a powerful analysis. 5
6 Definition 4. An activating loss is called a classifying loss if the infiu 0 can be achieved. It is called noralized if 1 is its inial zero. Exaples of classifying loss includes the hinge loss φ h, the q-nor loss φ q for SVM classification and the least square loss φ ls (t) = (1 t) 2. They are all noralized. Our error analysis will be done in Sections 3-5. It uses an error decoposition procedure for regularization schee introduced in [9, 38], by the aid of an iteration technique [30, 38] and a projection operator [9]. The convergence rates will be stated in ters of the saple size with proper choices of the regularization paraeter λ = λ() 0. Our analysis is powerful. It yields fast convergence rates. Let us deonstrate this by SVM. Assue X IR n and for soe s > n, the ulti-kernels K Σ satisfy sup K σ C s (X X) <. (1.10) σ Σ It eans that K σ : σ Σ is a set of C s Mercer kernels with a unifor bound. The convergence rate for SVM with such ulti-kernels can be stated as follows. Theore 2. Let φ = φ h and f z by (1.6). Assue for soe 0 < β 1 and c β > 0, inf inf f f c L 1 ρx + λ f 2 σ c β λ β, λ > 0. (1.11) σ Σ f H σ If (1.10) holds for soe s > n, choose λ() = ( ) 1 in 1 2β+(1 β)n/s, 2 1+β. For any ɛ > 0 and 0 < δ < 1, there exists a constant c independent of such that with confidence 1 δ, where θ = in β 2β+(1 β)n/s ɛ, 2β ( 1 ) θ, R(sgn(f z )) R(f c ) c (1.12) 1+β. In Theore 2, the condition (1.11) easures the approxiation power of the ultikernel space H Σ in L 1 ρ X, acting on the function f c. It can be described by soe interpolation spaces of the pair (H Σ, L 1 ρ X ). So only the sign of f ρ is involved in (1.11). If further inforation about the distribution ρ is available, one expects sharper error estiates. For exaple, when ρ satisfies a so-called Tsybakov noise condition ρ X (x X : 0 < f ρ (x) t) t ζ, t > 0, (1.13) 6
7 with soe ζ [0, ] and > 0, then the power θ in the error bound (1.12) can be β(ζ+1) iproved to θ = in β(ζ+2)+(ζ+1 β)n/s ɛ, 2β 1+β. This will be shown in Theore 6 below (in Section 5). Note that any distribution satisfies (1.13) with ζ = 0. The case ζ = is the sae as f ρ (x) or f ρ (x) = 0, eaning that the two classes are well separated. Our result is copletely new for the ulti-kernel setting. Even for the one-kernel setting H Σ = H K, Theore 2 provides the best convergence rate for the SVM under the sae assuption (1.11) of the approxiation power of H K and the regularity condition of the kernel (K C s with s > n): the capacity independent estiates derived by Zhang [40] yield the learning rate (1.12) with θ = β/(1 + β); under the noise condition (1.13), Steinwart and Scovel [30] obtained the learning rate (1.12) with θ = Since s > n, our rate is sharper than theirs. 2β(ζ+1) (2+ζ+ζn/s)(1+β) ɛ. 2. Optiization Proble for Regularization with Multi-kernels We divide the study of the optiization proble (1.6) in two steps. First, fix σ Σ. Denote the optial solution in the RKHS H σ as 1 f z,σ = arg inf f H σ φ(y i f(x i )) + λ f 2 σ. Define the dual function ψ : IR IR of φ by ψ(v) = sup vu φ(u), v IR. (2.1) u IR By the reproducing property (1.2), the optiization proble for solving f z,σ on H σ can be reduced into one on IR. The following relation between the prial proble and its dual is well known (see e.g. [41]): where 1 inf f H σ φ(y i f(x i )) + λ f 2 σ = sup α IR ˆR(α, σ), ˆR(α, σ) := 1 ψ( α i y i ) λ i,j=1 α i K σ (x i, x j )α j, α IR. 7
8 Moreover, both optiizers exist. If ˆα σ = arg ax α IR ˆR(α, σ), then sgn((ˆασ ) i ) = y i and f z,σ (x) = 1 2λ ) (ˆασ i K σ(x i, x). Next, consider the ulti-kernel schee (1.6). A solution f z can be represented as f z (x) = 1 2λ ˆα i Kˆσ (x i, x) if an optial point (ˆα, ˆσ) of the following dual proble exists: (ˆα, ˆσ) = arg in σ Σ ax α IR ˆR(α, σ). (2.2) We show that under soe ild condition, (2.2) can be solved. Proposition 1. Under the conditions of Theore 1, an optial point (ˆα, ˆσ) of (2.2) can be achieved. Hence an optial solution f z to the ulti-kernel regularization schee (1.5) always exists. Proof. We first clai that there exists a constant C(φ, ) depending on φ and the saple size such that ˆα σ l (IR ) C(φ, ), σ Σ. (2.3) To verify our clai, recall that ˆα σ is a axiizer of ˆR(α, σ). This yields ˆR(ˆα σ, σ) ˆR(0, σ) = ψ(0) = sup 0 φ(u) = inf φ(u) = 0. u IR u IR But K σ is positive seidefinite, it follows that ψ ( (ˆα σ ) i y i ) = ˆR(ˆασ, σ) 1 4λ However, for each v IR, (ˆα σ ) i K σ (x i, x j )(ˆα σ ) j ˆR(ˆα σ, σ) = 0. i,j=1 Therefore, for each i 1,,, we have ψ ( (ˆα σ ) i y i ) 0 ψ( v) = sup uv φ(u) φ(0). u IR j i ψ( (ˆα σ ) j y j ) j i 8 φ(0) = ( 1)φ(0). (2.4)
9 Now we prove our clai in two cases. Recall that the convexity of φ iplies that the one-side derivatives φ + and φ exist, are nondecreasing, and satisfy φ (t) φ +(t) for any t IR. Case 1: φ +(t) 0 for each t IR. In this case, φ is nonincreasing and li u + φ(u) = inf u IR φ(u) = 0. This in connection with the definition of the dual function iplies ψ( v) = sup uv φ(u) li uv = +, v < 0. (2.5) u IR u + It follows fro (2.4) that (ˆα σ ) i y i 0 for each i. Definition 1 also tells us that φ is strictly decreasing on (, 0] and li t φ(t) = +. Then the inverse function φ 1 is well defined on [φ(0), + ). Choose u = φ 1 ( v) for v (φ(0)) 2 in the definition of ψ, we see that ψ( v) vφ 1 ( v) φ ( φ 1 ( v) ). It follows that for any v ax 1, ( φ( 2) ) 2 there holds ψ( v) v vφ 1 ( v) 1 v. Hence v ax1, ( φ( 2) ) 2, ( ψ( v) ) 2, v IR. Cobining with (2.4), this iplies that (ˆα σ ) i y i ax 1, ( φ( 2) ) 2, ( 1) 2 ( φ(0) ) 2 =: C 1 (φ, ). As y i = ±1 and sgn((ˆα σ ) i ) = y i, we know that (ˆασ ) i = (ˆασ ) i y i = (ˆασ ) i y i C 1 (φ, ) for each i. This proves our clai in Case 1: ˆα σ l (IR ) C 1 (φ, ). Case 2: φ +(t 0 ) > 0 for soe t 0 IR. In this case, t 0 > 0 and φ is strictly increasing on [t 0, + ). Then for v in 1, ( φ(t 0 +2) ) 2, there exists soe uv t 0 +2 such that φ(u v ) = v. Choosing u = u v in the definition of ψ, we see that ψ( v) u v v φ(u v ) can be bounded fro below as ψ( v) v v(t 0 + 2) 1 v, v in 1, ( φ(t 0 + 2) ) 2. (2.6) On the other hand, since φ is strictly decreasing on (, 0], for v ax 1, ( φ( 2) ) 2 there exists soe u v 2 such that φ(u v ) = v. It follows that ψ( v) u v v φ(u v ) v ( ) 2 u v v 1 = v, v ax 1, φ( 2). 9
10 This in connection with (2.6) iplies that ψ( v) > ( 1)φ(0) whenever v > ax ( 1) 2( φ(0) ) 2 (, φ(t0 + 2) ) 2 ( ) 2, 1, φ( 2) =: C 2 (φ, ). Cobining with (2.4), we see again that (ˆασ ) i = ˆασ,i y i C2 (φ, ) for each i 1,,. This proves our clai in Case 2: ˆα σ l (IR ) C 2 (φ, ). Therefore, (2.3) holds with C(φ, ) = ax C 1 (φ, ), C 2 (φ, ). Next, we apply our clai (2.3) to prove the proposition. Denote Ĝ(σ) = ax α IR ˆR(α, σ) = ˆR(ˆασ, σ). To prove the existence of a solution (ˆα, ˆσ) = (ˆαˆσ, ˆσ) to the proble (2.2), it is sufficient to prove that the function Ĝ(σ) is continuous on the copact etric space ( Σ, d Σ ). Let σ 1, σ 0 Σ. By the definition of Ĝ(σ) and ˆR(α, σ), we have Ĝ(σ 1 ) Ĝ(σ 0) = ˆR(ˆα σ1, σ 1 ) ˆR(ˆα σ0, σ 0 ) ˆR(ˆα σ1, σ 1 ) ˆR(ˆα σ1, σ 0 ) = 1 ) 4 2 (ˆα σ1 ) i (K σ0 (x i, x j ) K σ1 (x i, x j ) (ˆα σ1 ) j. λ By syetry, there holds Ĝ(σ 0 ) Ĝ(σ 1) λ i,j=1 i,j=1 ) (ˆα σ0 ) i (K σ1 (x i, x j ) K σ0 (x i, x j ) (ˆα σ0 ) j. By the continuity of K σ (x i, x j ) at σ 0 for each pair (i, j), we know that for any ε > 0, there exists soe δ > 0 such that Kσ1 (x i, x j ) K σ0 (x i, x j ) 4λε/ ( C(φ, ) ) 2 whenever d Σ (σ 1, σ 0 ) < δ. It follows fro (2.3) and the above two bounds that Ĝ(σ 1 ) Ĝ(σ 0) ε. This shows the continuity of Ĝ at σ 0. Since σ 0 is an arbitrary point in Σ, Ĝ(σ) is continuous on Σ. Therefore, a iniizer of Ĝ(σ) in Σ exists: ˆσ = arg inf σ Σ Ĝ(σ). Thus, inf ax σ Σ α ˆR(α, σ) = inf Ĝ(σ) = Ĝ(ˆσ) = ax σ Σ α ˆR(α, ˆσ). Moreover the axiizer of ˆR(α, ˆσ) always exists. This tells us that the general optiu of ˆR(α, σ) is achievable. By the the relationship between the prial proble and its dual, we obtain the existence of the ulti-kernel regularization schee (1.5). This copletes the proof of the proposition. Exaple 1. Let Σ = [σ 1, σ 2 ] with 0 < σ 1 σ 2 < and K σ be the Gaussian kernel K σ (x, y) = exp x y 2 2σ 2 on a copact subset X of IR n. Then a solution to the optiization proble (1.6) exists. 10
11 3. Error Analysis: A General Fraework In this section, we give a general fraework of our error analysis, consisting of a coparison theore, a projection operator and an error decoposition procedure. It provides bounds for the excess isclassification error in ters of a regularization error and a saple error, studied in the next two sections separately Coparison Theores Siilar to the learning rate stated in Theore 2, the error analysis ais at bounding the excess isclassification error R(sgn(f z )) R(f c ). But the algorith is designed by iniizing a penalized epirical error Ez φ associated with the loss function φ. Knowledge on regularization schees or epirical risk iniization processes would only lead us to expect the convergence of E φ (f z ) as. So relations between isclassification error and generalization error becoe crucial. Soe works have been done on this topic [4, 40, 2]. Here we only ention soe coparison theores which will be used in the paper. Denote IR = IR ±. Define f φ ρ = arg in E φ (f) with the iniu taken over all functions f : X IR. Note that fρ φ always exists since φ is convex. It satisfies sgn(fρ φ ) = f c, an adissible condition for the loss function, see [29, 2]. The first coparison theore is for the hinge loss φ h (t) = (1 t) +. Proposition 2. Let φ = φ h be the hinge loss. We have f φ h ρ function f : X IR, = f c and for every easurable R(sgn(f)) R(f c ) E φ h (f) E φ h (f c ). (3.1) The fact f c = f φ h ρ was proved in [36]. The relation (3.1) was proved in [40]. The following coparison theore for general activating loss functions was given in [9]. Note that the convexity of φ iplies φ (0) 0. 11
12 Proposition 3. If an activating loss φ satisfies φ (0) > 0, then there exists a constant c φ > 0 such that for any easurable function f : X IR, there holds R(f) R(f c ) c φ E φ (f) E φ (f φ ρ ). Tighter coparison bounds are possible under soe noise conditions. We say that ρ has a Tsybakov noise exponent α 0 if for soe c α > 0 and every easurable f : X Y, ρ X (x X : f(x) f c (x)) c α (R(f) R(f c )) α. (3.2) All distributions satisfy (3.2) with α = 0 and c α = 1. The following sharper coparison bound for α > 0 follows iediately fro [4, Lea 6] and Proposition 3. Corollary 1. Let φ be a classifying loss satisfying φ (0) > 0. If ρ satisfies the Tsybakov noise condition (3.2) for soe α [0, 1] and c α > 0, then R(sgn(f)) R(f c ) 2c φ c α ( E φ (f) E φ (f φ ρ )) 1/(2 α), f : X IR Projection Operator By coparison theores, we only need to bound the excess generalization error E φ (f z ) E φ (fρ φ ) in order to study the perforance of the classifier sgn(f z ). But we can do better using the special feature of a classifying loss that it achieves a zero. A key technical tool here is a projection operator. To siply the notations and stateents, we will restrict our discussion only for noralized classifying loss functions. Firstly we show that the target function fρ φ can be chosen to be bounded. Set a univariate convex function Q for x X as Q(t) = Q x (t) := φ(yt)dρ(y x), t IR. (3.3) Y Its one-side derivatives exist, are nondecreasing and satisfy Q (t) Q +(t) for every t IR. Denote f ρ (x) = sup t IR : Q (t) < 0, f + ρ (x) = inf t IR : Q +(t) > 0. 12
13 Theore 3. Let φ be a noralized classifying loss function. Then (a) for each x X, the univariate function Q given by (3.3) is strictly decreasing on (, fρ (x)], strictly increasing on [f ρ + (x), + ), and is constant on [fρ (x), f ρ + (x)]. (b) fρ φ : X IR is a iniizer of the generalization error E φ (f) if and only if for alost every x (X, ρ X ), fρ φ (x) is a iniizer of Q, that is, there holds fρ (x) fρ φ (x) f ρ + (x). (3.4) (c) we ay choose a iniizer fρ φ of E φ satisfying fρ φ [ 1, 1] for each x X. Proof. Let x X. Consider the univariate continuous function Q given by (3.3). It is strictly decreasing on the interval (, fρ (x) ), since Q (t) < 0 on this interval. In the sae way, Q +(t) > 0 for t > f ρ + (x), so Q is strictly increasing on (f ρ + (x), + ). For t (fρ (x), f ρ + (x)), we have 0 Q (t) Q +(t) 0, hence Q is constant which is the inial value of Q on IR. This proves (a). Since E φ (f) = X Q x(f(x))dρ X (x), the stateent (b) follows directly fro (a). By the assuption, φ is convex and has inial zero 1. This iplies that φ is strictly decreasing on (, 1] and nondecreasing on [1, + ). So Q(t) Q(1) for t > 1 and Q(t) Q( 1) for t < 1. So a iniu of Q can always be achieved on [ 1, 1]. Hence we ay choose fρ φ such that fρ φ (x) [ 1, 1]. This proves the stateent (c). In what follows we shall always choose fρ φ with fρ φ (x) 1 for noralized classifying loss functions. Then we can ake full use of the projection operator introduced in [9]. Definition 5. The projection operator π is defined on the space of easurable functions f : X IR as 1, if f(x) > 1, π(f)(x) = 1, if f(x) < 1, (3.5) f(x), if 1 f(x) 1. It is easy to see that π(f) and f induce the sae classifier, i.e., sgn(π(f)) = sgn(f). Apply this fact to coparison theores. It is sufficient for us to bound the excess generalization error for π(f z ) instead of f z. This leads to better estiates, as we will see later. The following property of projection operator is iediate fro the definition of φ. 13
14 Proposition 4. If φ is noralized classifying loss function, then there holds alost surely φ(yπ(f)(x)) φ(yf(x)). (3.6) Hence for any easurable function f, we have E φ (π(f)) E φ (f) and E φ z (π(f)) E φ z (f) Error Decoposition Now we can present the error decoposition which leads to bounds of the excess generalization error for π(f z ). Define f λ = arg in E φ (f) + λ f 2 Σ. f H Σ Proposition 5. Let φ be a noralized classifying loss and f z given by (1.6). Then E φ (π(f z )) E φ (f φ ρ ) + λ f z 2 Σ D(λ) + S z,λ, (3.7) where D(λ) is the regularization error of the ulti-kernel space H Σ defined [27] as D(λ) = inf inf σ Σ f H σ E φ (f) E φ (f φ ρ ) + λ f 2 σ (3.8) and S z,λ = E φ (π(f z )) Ez φ (π(f z )) + Ez φ (f λ ) E φ (f λ ). (3.9) Proof. Write E φ (π(f z )) E φ (f φ ρ ) + λ f z 2 Σ as E φ (π(f z )) Ez φ (π(f z )) + Ez φ (f λ ) E φ (f λ ) + (E φ + z (π(f z )) + λ f z 2 ( Σ) E φ z (f λ ) + λ f λ 2 ) Σ E φ (f λ ) E φ (f φ ρ ) + λ f λ 2 Σ. By Proposition 4, E φ z (π(f z )) E φ z (f z ). This in connection with the definition of f z tells us that the second ter is 0. Note that S z,λ is just the su of the first and third ters. By the definition of f λ, the last ter equals to D(λ). This proves (3.7). The regularization error ter D(λ) in the error decoposition (3.7) is independent of the saple. It can be estiated by K-functionals by the discussion in Section 4. 14
15 The last ter S z,λ in (3.7) is called the saple error. Without projection, it is well understood because of the vast literature in learning theory. We are able to iprove the saple error estiates, stated in Theore 5 below, because of the projection operator. Coparison theores and the error decoposition help switch the goal of the error analysis to the estiation of the regularization error and the saple error. For instance, to prove Theore 2, we first apply Proposition 2 to π(f z ) and then Proposition 5. It tells us that R(sgn(f z )) R(f c ) is bounded by the su of S z,λ and D(λ). 4. Estiating Regularization Error and Approxiation Error In this section, we discuss the estiation of the regularization error. The convexity of φ iplies that φ (t) = φ +(t) = φ (t) for alost every t IR. Theore 4. Let φ be a noralized classifying loss. Then E φ (f) E φ (f φ ρ ) φ L [ f, f ] f f φ ρ L 1 ρx. If oreover, φ is C 1 and φ is absolutely continuous on IR, we have E φ (f) E φ (f φ ρ ) φ L [ f 1, f +1] f f φ ρ 2 L 2 ρ X. Proof. With the function Q = Q x defined in (3.3), write E φ (f) E φ (fρ φ ) as E φ (f) E φ (fρ φ ) = Q(f(x)) Q(f φ ρ (x)) dρ X. X Since φ (0) < 0 and φ(t) 0, we have φ(0) > 0 and φ ±(t) < 0 for t < 0. Let P (t) = ax φ ±(t), φ ±( t) for t > 0. We only need to prove Q(f(x)) Q(f φ ρ (x)) P ( f(x) ) f(x) f φ ρ (x) (4.1) for those x with Q(f(x)) Q(fρ φ (x)) > 0. According to Theore 3, such a point x satisfies f(x) [fρ (x), f ρ + (x)]. If f(x) > f ρ + (x), then Q is strictly increasing on [f(x), + ). Hence f(x) > fρ φ (x). By Theore 3, we have Q(f(x)) Q(f φ ρ (x)) Q (f(x)) ( f(x) f φ ρ (x) ). 15
16 Note that both φ and φ + are nondecreasing, and Q(t) = η(x)φ(t) + (1 η(x))φ( t). Hence Q (f(x)) = η(x)φ (f(x)) (1 η(x))φ +( f(x)) ax φ ±( f(x) ), φ ±( f(x) ) no atter whether f(x) 0 or not. Thus, (4.1) holds true when f(x) > f + ρ (x). In the sae way, if f(x) < f ρ (x), then Q is strictly decreasing on (, f(x)]. Hence f(x) < f φ ρ (x). Theore 3 yields again Q(f(x)) Q(f φ ρ (x)) Q +(f(x)) ( f φ ρ (x) f(x) ). Since Q +(f(x)) = η(x)φ +(f(x)) + (1 η(x))φ ( f(x)) P ( f(x) ), we see that (4.1) also holds when f(x) < f ρ (x). This proves the first stateent. If φ is C 1 and φ is absolutely continuous on IR, we know fro Theore 3 that Q (f φ ρ (x)) = 0. Hence Q(f(x)) Q(f φ ρ (x)) = f(x) f φ ρ (x) Q (u) Q (fρ φ (x))du Q L (I) f(x) fρ φ (x) 2 2 where I is the interval between f φ ρ (x) and f(x). Then the second stateent follows. In the above, L q ρ X is the L q space with nor f L q ρx = 1/q. X f(x) q dρ X Thus, we can use the rich knowledge fro approxiation theory to estiate the regularization error. See [9] for details on bounding the regularization error for the SVM q-nor soft argin classifiers by eans of K-functionals in L q ρ X. One advantage of ulti-kernel algoriths is the iproveent of regularization errors copared with the one-kernel setting. For exaples and discussion, see [44, 30, 26]. 5. Saple Error Estiates and Learning Rates We are in a position to estiate the saple error and derive the learning rates. Throughout this section, we assue that the kernels are uniforly bounded in the sense that κ := sup K C(X X) <. (4.2) σ Σ 16
17 To state our result, we need to further introduce several concepts and notations. The quantity E φ (π(f z )) E φ z (π(f z )) in the saple error (3.9) needs to be estiated by soe unifor law of large nubers. To this end, we need the capacity of the hypothesis space, which plays an essential role in saple error estiates. In this paper, we use the covering nubers easured by epirical distances. Definition 6. Let F be a set of functions on Z and z = z 1,, z Z. The etric d 2,z is defined on F by 1 d 2,z (f, g) = ( f(zi ) g(z i ) ) 2 1/2. For every ε > 0, the covering nuber of F with respect to d 2,z is defined as N 2,z (F, ε) = inf l IN : f i l F such that F = l f F : d 2,z (f, f i ) ε. The function sets in our situation are balls of the ulti-kernel space in the for of B R = f H Σ : f Σ R = σ Σ f H σ : f σ R. We need the epirical covering nuber of B 1 defined as ( ) N (ε) = sup sup N 2,x B 1, ε. (4.3) IN x X For a function f : Z IR, denote IEf = f(z)dρ. Z Theore 5. Let φ is a noralized classifying loss. Assue the following conditions with exponents q > 0, τ [0, 1] and p (0, 2): (1) an increent condition for φ with a constant c q > 0 φ(t)) c q t q, t 1, (4.4) (2) a variance-expectation bound for the pair (φ, ρ) with the exponent τ and soe c τ > 0 (φ(yf(x)) IE φ(yf φ ρ (x)) ) 2 τ c τ E φ (f) E φ (fρ ) φ, f 1, (4.5) (3) a capacity condition for the function set B 1 with a constant c p > 0 log N (ε) c p ( 1 ε ) p, ε, R > 0, IN. (4.6) 17
18 If D(λ) c β λ β for soe 0 < β 1 and c β > 0, then for any ɛ > 0 and 0 < δ < 1, there exists a constant c independent of such that, with λ = λ() = ( 1 ) γ, we have with confidence 1 δ, where E φ (π(f z )) E φ (fρ φ ) c ( 1 ) θ (4.7) 2 γ = in β(4 2τ + pτ) + p(1 β), 2 2β + q βq θ = in 2β β(4 2τ + pτ) + p(1 β) ɛ, 2β 2β + q βq The proof of Theore 5 will be given at the end of this section. Theore 5, let us reark the assuptions. (4.8). (4.9) Before applying The increent condition (4.4) is satisfied for any useful loss functions including the hinge loss and least square loss. The variance-exponent condition (4.5) for the pair (φ, ρ) always holds for τ = 0 with c τ = (axφ( 1), φ(1)) 2. This can be seen fro the fact that φ(yf(x)) φ(yf φ ρ (x)) axφ( 1), φ(1). Larger exponents τ are possible when φ has high convexity (such as φ ls in Theore 7 below) or when the distribution ρ satisfies soe conditions (such as the Tsybakov noise condition (1.13) in Theore 6 below). The capacity condition (4.6) always holds with p 2 if K Σ contains only one kernel. Note that for any function set F C(X), the epirical covering nuber N 2,x ( F, ε ) is bounded by N ( F, ε ), the (unifor) covering nuber of F under the etric, since d 2,x (f, g) f g. So in the ulti-kernel setting, the behavior of the covering nuber N (ε) can be estiated by the unifor soothness of kernels in Σ according to [43]. Exaple 2. If the set Σ of kernels on X IR n satisfies (1.10) for soe s > 0, then there is a constant c s > 0 such that log N (ε) c s ( 1/ε ) 2n/s for any ε > 0. The regularization error D(λ) decays to zero once H Σ is dense in C(X). By the discussion in Section 4, the decay rate with an exponent β can be estiated if soe priori knowledge on the distribution is available; see [9] for explicit exaples. Let us now show how to apply Theore 5 to derive learning rates. Recall Proposition 3 and Corollary 1. A direct corollary of Theore 5 is as follows. 18
19 Corollary 2. Under the assuption of Theore 5, if φ (0) > 0, then for any ɛ > 0 and 0 < δ < 1, there is a constant c independent of such that with confidence 1 δ, ( 1 ) θ/2 R(sgn(f z )) R(f c ) c (4.10) where λ = ( 1 ) γ, γ, θ are given by (4.8) and (4.9), respectively. If, in addition, ρ satisfies the noise condition (3.2) with 0 < α 1, the power θ 2 in (4.10) can be iproved to 1 2 α θ. Next we consider two classical classification algoriths: SVM classification and least square ethod Learning Rates for the SVM Classification For the SVM classification with the hinge loss, we illustrate how noise conditions on the distribution ρ raise the variance-expectation exponent τ in (4.5) fro 0 (for general distributions) to τ = ζ/(ζ + 1) > 0. Theore 6. Let φ = φ h and the ulti-kernels K σ : σ Σ satisfy (4.6). Assue inf inf E φ h (f) E φ h (f c ) + λ f 2 σ c β λ β, λ > 0 (4.11) σ Σ f H σ with 0 < β 1, c β > 0, and that ρ satisfies the noise condition (1.13) with ζ [0, ] and > 0. Choose λ = λ() = ( 1 )in 2(ζ+1) β(ζ+2)+p(ζ+1 β), 2 β+1. For any ɛ > 0 and 0 < δ < 1, there exists a constant C ɛ > 0 independent of such that with confidence 1 δ, ( θ 1 2β(ζ + 1) R(sgn(f z )) R(f c ) C ɛ, θ = in ) 2β(ζ + 2) + p(ζ + 1 β) ɛ, 2β 1 + β. Proof. Observe that φ h satisfies the increent condition (4.4) with q = 1 and c q = 2. Because of the noise condition (1.13), we know fro [30] and [38] that the condition (4.5) is valid with the exponent τ = ζ ζ+1 and the constant c τ = 8 ( 1 2 ) ζ/(ζ+1). Then the conclusion follows fro Theore 5 and Proposition 2. Theore 2 stated in the introduction is a special case of Theore 6 with ulti-kernels having a unifor bound in C s. Proof of Theore 2. By Exaple 2, (4.6) holds with p = 2n/s. Since φ h is Lipschitz, Theore 4 yields E φ h (f) E φ h (f c ) f f c L 1 ρx. Hence (1.11) iplies (4.11). Take ζ = 0 since no assuption on the noise is ade. We see Theore 2 follows fro Theore 6. 19
20 5.2. Learning Rates with the Least-square Loss Consider the least-square loss φ ls (t) = (1 t) 2 investigated in [31]. We illustrate how high convexity of the loss function yields large variance-expectation exponent τ in (4.5). Here φ ls (yf(x)) = (1 yf(x)) 2 = (y f(x)) 2 since y 2 = 1 for y Y. So we know [35] that f φ ρ = f ρ and the high convexity of φ ls ensures [12] that (4.5) holds true with τ = 1 and C τ = 1. The increent condition (4.4) for φ ls is true with q = 2. Moreover, E φ ls (f) E φ ls (f ρ ) = f f ρ 2 L 2 ρ X. Putting all these into Proposition 3 and Corollary 2, we obtain the following learning rate. Theore 7. Consider (1.6) with φ = φ ls and and ulti-kernels K σ : σ Σ satisfying (4.6) with soe p (0, 2). Assue that for soe 0 < β 1 and c β > 0, inf inf f f ρ 2 L + λ f 2 2 ρ σ c β λ β, λ > 0. (4.12) X σ Σ f H σ Then by choosing λ = λ() = ( 1 )in 2 2β+p,1, for any ɛ > 0 and 0 < δ < 1, there exists a constant C ɛ independent of such that with confidence 1 δ, R(sgn(f z )) R(f c ) C ɛ ( 1 ) θ with θ = 1 2β 2 in 2β + p ɛ, β. (4.13) If oreover, ρ satisfies (3.2), then θ can be iproved to 1 2 α in 2β 2β+p. ɛ, β In particular, when inf x X f ρ (x) > 0, (4.13) holds with θ = in 2β 2β+p. ɛ, β The above learning rate is better than those in the literature, e.g. [13, 23, 6, 40]. When the kernels are C with (1.10) valid for any s > 0, we ay take p in Theore 7 to be arbitrarily sall and the power θ in (4.13) becoes in1/2 ɛ, β/2. Exaple 3. Let φ(t) = (1 t) 2, Σ = [σ 1, σ 2 ] with 0 < σ 1 σ 2 < and K σ be the Gaussian kernel K σ (x, y) = exp x y 2 on X IR n. Assue (4.12). Let ɛ > 0 and λ = λ() = ( 1 ) in 1 2σ 2 β ɛ,1. Then with confidence 1 δ, we have ( 1 ) θ/2, R(sgn(f z )) R(f c ) c θ = in 1 ɛ, β. If ρ satisfies the noise condition (3.2) with 0 < α 1, then θ/2 can be iproved to 1 2 α θ = 1 2 α in 1 ɛ, β. When inf x X f ρ (x) > 0, we can replace θ/2 by in1 ɛ, β. 20
21 5.3. Proof of the Main Result To end this section, we prove our ain result, Theore 5. To this end, we shall use the following concentration inequality. Proposition 6. Let F be a set of easurable functions on Z, and B, c > 0, τ [0, 1] be constants such that each function f F satisfies f B and IE(f 2 ) c(ief) τ. If for soe a > 0 and p (0, 2), sup IN sup log N 2,z (F, ε) aε p, ε > 0, (4.14) z Z then there exists a constant c p depending only on p such that for any t > 0, with probability at least 1 e t, there holds where IEf 1 f(z i ) 1 ( ct ) 1/(2 τ) 2 η1 τ (IEf) τ + c 18Bt pη + 2 +, f F, η := ax c 2 p 4 2τ+pτ ( a ) 2 4 2τ+pτ To prove Proposition 6, we need soe preparations. (, B 2 p 2+p a 2 2+p. ) Definition 7. A function ψ : IR + IR + is sub-root if it is non-negative, non-decreasing, and if ψ(r)/ r is non-increasing. For a sub-root function ψ and any D > 0, the equation ψ(r) = r/d has a unique positive solution. The following proposition is given in [3, Theore 3]. Proposition 7. Let F be a class of easurable, square integrable functions such that IEf f b for all f F. Let ψ be a sub-root function, D be soe positive constant and r be the unique solution to ψ(r) = r/d. Assue that [ IE ax 0, sup IEf 1 f F IEf 2 r ] f(z i ) ψ(r), r r. Then for all t > 0, and all K > D/7, with probability at least 1 e t there holds IEf 1 f(z i ) IEf 2 K + 50K D 2 r + 21 (K + 9b)t, f F.
22 We need to find the sub-root function ψ in our setting. To this end, introduce the Radeacher variables ε i, i = 1,,. Then [ IE sup f F IEf 2 r IEf 1 ] f(z i ) [ 2IE sup f F IEf 2 r 1 ] ε i f(z i ). (4.15) The right hand side is called the local Radeacher process. It can be bounded by using epirical covering nubers and the entropy integral. See [34]. is given. The following result is a scaled version of Proposition 5.4 in [30] where the case B = 1 Proposition 8. Let F be a class of easurable functions fro Z to [ B, B]. Assue (4.14) for soe p (0, 2) and a > 0. Then there exists a constant c p depending only on p such that [ IE sup f F IEf 2 r 1 ] ε i f(z i ) c p ax r 1/2 p/4( a ) 1/2, ( 2 p B 2+p a 2/(2+p). ) According to Proposition 8 and (4.15), in applying Proposition 7, one should take ψ(r) = 2c p ax r 1/2 p/4( a ) 1/2, ( 2 p B 2+p a 2/(2+p). (4.16) ) Then the solution r to the equation ψ(r) = r/d satisfies ( r ax (2c p D) 4 2+p, 2cp DB 2 p 2+p a 2 2+p. (4.17) ) Proof of Proposition 6. Let ψ be defined by (4.16) and r be the solution to ψ(r) = r/d. Since f B, we have IEf f b := 2B for each f F. Choose K = D/5. By Proposition 7 and the condition IEf 2 c(ief) τ we know that with probability at least 1 e t there holds IEf 1 f(z i ) 5c D (IEf)τ + 10 D r + ( D B)t, f F. (4.18) Then 5c D Recall that r satisfies (4.17). Take D = 10cη τ 1 where η is given in our stateent. = 1 2 η1 τ. The expression of η in connection with the bound (4.17) for r tells 22
23 us that 10 D r c p η where c p is a constant depending only on p and c p, hence only on p. Observe fro the choice of D that Dt 5 = 2ct η 1 τ ( ct ) 1/(2 τ) 2 ax η,, according to whether η ( ct ) 1/(2 τ). Take c p to be the constant c p + 2 depending only on p. Then the desired inequality holds for each f F. This proves Proposition 6. We now turn to our key analysis and prove Theore 5. Let us first explain our ain ideas. In the saple error ter of (3.7), the quantity Ez φ (f λ ) E φ (f λ ) is easy to handle. It can be estiated by the one-side Bernstein inequality for the single rando variable φ(yf λ (x)) on Z. This will be done in the first step of the proof with a ild technical odification: consider the rando variable ξ = φ(yf λ (x)) φ(y, fρ φ (x)) instead of φ(yf λ (x)). The quantity E φ (π(f z )) Ez φ (π(f z )) is ore difficult and we need Proposition 6 to estiate. Here the function set will be F = φ ( yπ(f)(x) ) φ ( yfρ φ (x) ) : f B R with such a radius R that B R contains f z, i.e., R is a bound of f z Σ. On the other hand, saller radius R yields better estiates. Hence good bounds for f z Σ play an iportant role for the saple error estiates. A rough bound for f z Σ iediately follows fro the definition of f z. By choosing f = 0, we find λ f z 2 Σ E z φ (f z ) + λ f z 2 Σ E z φ (0) + λ 0 = φ(0). This proves Lea 1. For every λ > 0, there holds f z Σ φ(0)/λ. We ay use the bound φ(0)/λ as R in F and apply Proposition 6 to get soe rough estiates for E φ (π(f z )) Ez φ (π(f z )). However, the epirical error Ez φ (f) is a good approxiation of the generalization error E φ (f). Hence the penalty value f z Σ is expected to be close to f λ Σ which is bounded by D(λ)/λ: λ f λ 2 Σ E φ (f λ ) E φ (fρ φ ) + λ f λ 2 Σ = D(λ). (4.19) This expectation will be realized by an iteration technique used in [30] and [38]. By this technique, we shall show under soe assuptions that with high confidence f z Σ has a bound arbitrarily close to D(λ)/λ (in the order of λ). 23
24 We are in a position to estiate the saple error and prove Theore 5. Proof of Theore 5. Write the saple error as ( S z,λ = E φ (π(f z )) E φ (f φ + ρ ) ) ( ) Ez φ (π(f z )) Ez φ (fρ φ ) ( ) ( Ez φ (f λ ) Ez φ (fρ φ ) E φ (f λ ) E φ (fρ )) φ := S 1 + S 2. We divide our estiation into three steps. Take t 1 which will be deterined later. Denote B = axφ( 1), φ(1). Denote Step 1: estiate S 2. Consider the rando variable ξ = φ(yf λ (x)) φ(yf φ ρ (x)) on Z. ξ = ξ 1 + ξ 2 = φ(yf λ (x)) φ(yπ(f λ )(x)) + φ(yπ(f λ )(x)) φ(yfρ φ (x)). First we bound ξ 1. By (1.2), (4.2) and (4.19), we have f λ κ f λ Σ κ D(λ)/λ. We ay assue the last quantity to be greater than one since otherwise ξ 1 0. Then the increent condition on φ tells us 0 ξ 1 B λ := c q κ q( D(λ)/λ ) q/2. Hence ξ1 IE(ξ 1 ) B λ. Applying the one-side Bernstein inequality to ξ 1, we know that for any ε > 0, 1 Prob Solving the quadratic equation ξ 1 (z i ) IEξ 1 > ε exp ε 2 2 ( σ 2 (ξ 1 ) B λε ) = t ε 2 2 ( σ 2 (ξ 1 ) B λε ) for ε, we see that there exists a subset U 1 of Z with easure at least 1 e t such that for every z U 1,. 1 ξ 1 (z i ) IEξ 1 1 ( 3 B λt B λt ) 2 + 2σ2 (ξ 1 )t 2B λt 2t 3 + σ2 (ξ 1 ). But the fact 0 ξ 1 B λ iplies σ 2 (ξ 1 ) B λ IE(ξ 1 ). Therefore, we have 1 ξ 1 (z i ) IEξ 1 7B λt 6 + IEξ 1, z U 1. 24
25 Next we consider ξ 2. Since both yπ(f λ )(x) and yf φ ρ (x) are on [ 1, 1], ξ 2 is a rando variable satisfying ξ 2 B. Applying the one-side Bernstein inequality as above, we know that there exists another subset U 2 of Z with easure at least 1 e t such that for every z U 2, 1 ξ 2 (z i ) IEξ 2 2Bt 3 + 2tσ2 (ξ 2 ). By (4.5), we have σ 2 (ξ 2 ) C τ (IEξ 2 ) τ. Applying the eleentary inequality with q = 2 2 τ, q = 2 τ 1 q + 1 q = 1 with q, q > 1 = a b 1 q aq + 1 q bq, a, b 0 and a = 2tC M, b = (IEξ 2 ) τ, we see that 2tσ2 (ξ 2 ) 2tCτ (IEξ 2 ) τ ( 1 τ ) ( ) 1 2 τ 2tC τ 2 + τ 2 IEξ 2. Hence 1 ξ 2 (z i ) IEξ 2 2Bt ( 3 + 2tCτ ) 1 2 τ + IEξ 2, z U 2. Cobine the above estiates for ξ 1 and ξ 2 with the fact IEξ 1 + IEξ 2 = IEξ D(λ) c β λ β. We conclude that S 2 7B λt + 4Bt 6 + ( 2tCτ ) 1 2 τ + D(λ), z U 1 U 2. (4.20) Step 2: estiate S 1. By Proposition 5, one has z := E φ (π(f z )) E φ (f φ ρ ) + λ f z 2 Σ S 1 + S 2 + D(λ). (4.21) Let R > 0. Apply Proposition 6 to the function set F = φ ( yπ(f)(x) ) φ ( yf φρ (x) ) : f B R. Since φ ( yπ(f)(x) ) φ ( yπ(g)(x) ) φ ( 1) π(f)(x) π(g)(x) φ ( 1) f(x) g(x), there holds N 2,z (F, ε) N 2,z (B R, Hence (4.6) yields (4.14) with a = c p φ ( 1) p R p. 25 ) ε φ. ( 1)
26 Since φ ( yπ(f)(x) ) B and φ ( yf φ ρ (x) ) B, we know that f B for every f F. The assuption (4.5) tells us that IEf 2 c(ief) τ with c = C τ. Thus all the conditions in Proposition 6 hold, and we know that there is a subset V(R) of Z with easure at least 1 e t such that for every z V(R) and every f B R, ( ) ( ) E φ (π(f)) E φ (fρ φ ) Ez φ (π(f)) Ez φ (fρ φ ) 1 2 η1 τ R ( E φ (π(f)) E φ (f φ ρ )) τ + c p η R + 2 ( Cτ t ) 1 2 τ + 18Bt, (4.22) where η R = η is given in Proposition 6 with c = C τ and a = c p φ ( 1) p R p, i.e., η R = ax C 2 p 4 2τ+pτ τ ( cp φ ( 1) p R p Let W(R) be the subset of Z defined by W(R) = ) 2 4 2τ+pτ, B 2 p 2+p z U 1 U 2 : f z B R. ( cp φ ( 1) p R p ) 2 2+p. Let z W(R) V(R). Then (4.22) holds for f z. Together with the estiate (4.20) for S 2 and (4.21), we know that z 1 2 η1 τ R When τ = 1 this yields ( E φ (π(f z )) E φ (f φ ρ )) τ + c p η R Bt + 3B λt/2 + 2D(λ). ( Cτ t ) 1/(2 τ) ( z c Cτ t ) 1/(2 τ) 38Bt + 3B λ t pη R D(λ), (4.23) where c p = ax2c p, 1. Here we have bounded 2c p by c p. When 0 < τ < 1, we use the eleentary inequality: if a, b > 0 and 0 < τ < 1, then x ax τ + b, x > 0 = x ax(2a) 1/(1 τ), 2b. We find that (4.23) still holds. By the choice of λ = λ() = ( 1 )γ, one easily checks that η R c p,τ λ β ax (R 2 λ 1 β p ) 4 2τ+pτ, (R 2 λ 1 β ) p 26 2+p
27 for soe c p,τ > 0. But 4 2τ + pτ 2 + p, hence if R > λ (1 β)/2, then η R c p,τ λ β (R 2 λ 1 β ) p 2+p = cp,τ λ p+2β 2+p R 2p 2+p. (4.24) The choice of λ together with the assuption D(λ) c β λ β and t > 1 on the regularization error also iplies ( Cτ t ) 1/(2 τ) 38Bt + 3B λ t D(λ) c q,τ,β tλ β (4.25) for soe c q,τ,β > 0. Putting the estiates (4.25) and (4.24) into (4.23) we obtain z c pc p.τ λ p+2β 2p 2+p R 2+p + cq,τ,β tλ β, z W(R) V(R) (4.26) whenever R > λ (1 β)/2. This iplies that f z Σ z /λ g(r), where g : IR + IR + is a univariate function defined as g(r) = c pc p,τ λ β 1 p 2+p R 2+p + cq,τ,β tλ (β 1)/2. (4.27) It follows that W(R) V(R) W(g(R)), R > λ (1 β)/2. (4.28) Step 3: by iteration, find a sall ball B R that, with high confidence, contains f z. Lea 1 eans that W(R 0 ) = U 1 U 2 for R 0 = φ(0)/λ. When R 0 > λ (1 β)/2, we use our conclusion (4.28) iteratively. Denote g [0] (R) = R, g [1] (R) = g(r) and g [l] (R) = g ( g [l 1] (R) ) for l 2. According to (4.28), if g [j] (R) > λ (1 β)/2, j = 0, 1,, l 1, (4.29) then W(R) V(R) V(g [1] (R)) V ( g [l 1] (R) ) W(g [l] (R)). (4.30) Observe that g(r) = d 0 R p 2+p + d1 with d 0, d 1 > 0 given in (4.27). Then g [2] ( ) (R) = d 0 d0 R p 2+p + p 2+p d1 + d 1 d 1+ p 27 2+p 0 R ( p 2+p )2 + d 1 + d 0 d p 2+p 1,
28 and in general, for l IN, g [l] (R) d 1+ p 2+p + +( p 2+p )l 1 0 R ( p + + d 1+ p 2+p + +( p 2+p )l 2 p 2+p )l 2+p + d 1 + d 0 d1 + d 1+ p 0 d ( p 2+p )l 1 1. This in connection with the expressions for d 0 and d 1 gives g [l] (R) d 2+p 2 1 ( p 0 R ( p 2+p )l 2+p )l + c 2+p 4 0 λ (β 1) 2 1 ( p 2+p )l R ( l 1 d i=0 p 2+p )l + i 1 j=0 ( p 2+p )j 0 d ( p 1 l 1 i=0 2+p 0 d ( p 2+p )2 2+p )i c 2+p 4 0 (c 1 t) ( p 2+p )i λ (β 1) 2, where c 0 = ax1, c pc p,τ and c 1 = ax1, c q,τ,β. In particular, for R = R 0, there holds g [l] (R 0 ) c 2+p 4 0 λ (β 1)/2 (φ(0)) 1 2 ( p 2+p )l λ β 2 ( p 1 2+p )l + c 1 tl. For ɛ > 0, choose l 0 IN such that l 0 log 1 2+p 2ε / log p. Then ( 1 p l0 2 2+p) ε. It follows that g [l0] (R 0 ) c 2+p 4 0 λ (β 1)/2 (φ(0)) 1 2 ( when (4.29) with l = l 0 and R = R 0 holds. p 2+p )l 0 λ βɛ + c 1 tl 0 When (4.29) with l = l 0 and R = R 0 is not valid, we have g [j 0] (R 0 ) λ (β 1)/2 for soe j 0 0, 1,, l 0 1. Take l ɛ = l 0 when (4.29) with l = l 0 and R = R 0 holds and l ɛ = j 0 otherwise. In both cases, we have where c ε := c 2+p ( (φ(0)) 2 ( p 2+p )l 0 ) + c 1 tl 0. g [l ɛ] (R 0 ) c ε λ (β 1)/2 βε =: R ε (4.31) Take l = l ɛ l 0 and R = R 0 in (4.30). Since W(R 0 ) = U 1 U 2, we know that there is a subset V ε of Z with easure at ost l 0 e t such that U 1 U 2 W(R ε ) V ε. Then the easure of the set W(R ε ) is at least 1 (l 0 + 2)e t. 28
29 Apply (4.23) with R = R ε and notice (4.25). Let z W(R ε ) V(R ε ). We know that z c pη Rε + c q,τ,β tλ (β 1)/2. It is easy to check that η Rε c p,τ c ε ( 1 )θ. Therefore, with the constant c = c pc p,τ c ε +c q,τ,β t, there holds ( 1 ) θ. E φ (π(f z )) E φ (fρ φ ) z c Taking t = log l 0+3 δ, the easure of the set W(R ε ) V(R ε ) is at least 1 δ. Then Theore 5 is proved. 6. Extensions A key point of our analysis is to find essential bounds for penalty functional values of regularization schees. This approach can be extended to regularization schees with ore general loss functions and general penalty functionals. Let the hypothesis space H be a function set containing 0. It is assigned a functional Ω : H IR + satisfying Ω(0) = 0. Beyond the ulti-kernel space H Σ, such a hypothesis space is the linear prograing support vector achine classifier [38] in a one kernel setting with the penalty functional Ω(f) defined for f H = H K,z = α iy i K xi : α i 0 as Ω(f) = α i. Let Y be a subset of IR, and V : IR 2 IR + be a general loss function. The general regularization schee in H associated with V and the penalty functional Ω is defined for the saple z as f V z = arg in f H 1 V (y i, f(x i )) + λω(f). (5.1) All the results we obtained for the ulti-kernel regularized classifiers (1.6) can be established for the ore general schee (5.1) under the assuption that the pair (V, ρ) is M-adissible: there is a constant M > 0 such that y M alost surely with respect to ρ, and for each y [ M, M], V (y, t) is a convex function of the variable t IR satisfying V (y, t) V (y, M), t > M (5.2) V (y, t) V (y, M), t < M. 29
30 An iportant faily of regularization schees (5.1) are those for regression with a general loss function: take Y = IR and V (y, f(x)) = ψ(y f(x)) where ψ : IR IR + is even, convex and increasing on [0, + ) with ψ(0) = 0. If y M alost surely with respect to ρ, then (V, ρ) is M-adissible. Our approach can be used to analyze the convergence of Z V (y, f z V (x))dρ to inf f H V (y, f(x))dρ. Z Exaple 4. Let ε > 0. The ε-insensitive nor is the univariate loss function ψ used for regression defined [35] as ψ(t) = ax t ε, 0. It would be interesting to analyze the convergence of the schee (5.1) as ε tends to zero. For the classification algorith (1.6), soe of our error bounds can be extended to nonclassifying loss functions (such as the exponential loss), i.e., those activating loss functions whose infiu cannot be achieved. For this purpose, we need a ore general projection operator. Definition 8. For M > 0, the projection operator at level M is defined on the space of easurable functions f : X IR as M, if f(x) > M, π M (f)(x) = M, if f(x) < M, f(x), if M f(x) M. Using this projection operator, we can have siilar error decopositions by revising the regularization error and introducing level M adapting to the behavior of the loss function (the convergence rate of φ(t) as t ). Then soe learning rates can be obtained, following our approach. Acknowledgeent. When the paper is being revised as requested, we learn that a kernel-searching ethod, leading to the regularization schee (1.6), is studied recently in [22]. The learnability of ulti-kernel spaces associated with Gaussian kernels with flexible variances, i.e., Σ = (0, + ) in Exaple 3, is also verified recently in [39]. We thank the referees for their careful reading and constructive suggestions which help us iprove the paper. References 30
31 [1] N. Aronszajn, Theory of reproducing kernels, Trans. Aer. Math. Soc. 68 (1950), [2] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, Convexity, classification, and risk bounds, preprint, [3] B. Blanchard, O. Bousquet and P. Massart, Statistical perforance of support vector achines, preprint, [4] B. Blanchard, G. Lugosi and N. Vayatis, On the rate of convergence of regularized boosting classifiers, J. Mach. Learning Res. 4 (2003), [5] B. E. Boser, I. Guyon, and V. Vapnik, A training algorith for optial argin classifiers, in Proceedings of the Fifth Annual Workshop of Coputational Learning Theory 5 (1992), Pittsburgh, ACM, pp [6] O. Bousquet and A. Elisseeff, Stability and generalization, J. Mach. Learning Res. 2 (2002), [7] L. Breian, Arcing classifiers, (discussion paper) Ann. Stat. 26 (1998), [8] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, Choosing ultiple paraeters for support vector achines, Mach. Learning 46 (2002), [9] D. R. Chen, Q. Wu, Y. Ying and D. X. Zhou, Support vector achine soft argin classifiers: error analysis, J. Mach. Learning Res. 5 (2004), [10] C. Cortes and V. Vapnik, Support-vector networks, Mach. Learning 20 (1995), [11] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cabridge University Press, [12] F. Cucker and S. Sale, On the atheatical foundations of learning, Bull. Aer. Math. Soc. 39 (2001), [13] F. Cucker and S. Sale, Best choices for regularization paraeters in learning theory: On the bias-variance proble, Found. Coput. Math. 2 (2002), [14] F. Cucker and D. X. Zhou, Learning Theory: an Approxiation Theory Viewpoint, onograph anuscript in preparation for Cabridge University Press. 31
Learnability of Gaussians with flexible variances
Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007
More informationSVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming
SVM Soft Margin Classifiers: Linear Prograing versus Quadratic Prograing Qiang Wu wu.qiang@student.cityu.edu.hk Ding-Xuan Zhou azhou@cityu.edu.hk Departent of Matheatics, City University of Hong Kong,
More informationShannon Sampling II. Connections to Learning Theory
Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,
More informationE0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis
E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds
More information1 Bounding the Margin
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost
More informationA Simple Regression Problem
A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher
More informationSupport Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization
Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering
More informationYongquan Zhang a, Feilong Cao b & Zongben Xu a a Institute for Information and System Sciences, Xi'an Jiaotong. Available online: 11 Mar 2011
This article was downloaded by: [Xi'an Jiaotong University] On: 15 Noveber 2011, At: 18:34 Publisher: Taylor & Francis Infora Ltd Registered in England and Wales Registered Nuber: 1072954 Registered office:
More informationConsistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material
Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,
More informationIntelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes
More informationCombining Classifiers
Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/
More informationAn l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions
Journal of Matheatical Research with Applications Jul., 207, Vol. 37, No. 4, pp. 496 504 DOI:0.3770/j.issn:2095-265.207.04.0 Http://jre.dlut.edu.cn An l Regularized Method for Nuerical Differentiation
More informationSymmetrization and Rademacher Averages
Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and
More informationSupport Vector Machines. Goals for the lecture
Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October
More informationSharp Time Data Tradeoffs for Linear Inverse Problems
Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used
More informationPAC-Bayes Analysis Of Maximum Entropy Learning
PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E
More informationBayes Decision Rule and Naïve Bayes Classifier
Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.
More informationSupport Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab
Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a
More informationSupport Vector Machines. Maximizing the Margin
Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the
More informationSupplement to: Subsampling Methods for Persistent Homology
Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation
More informationMachine Learning Basics: Estimators, Bias and Variance
Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic
More informationRobustness and Regularization of Support Vector Machines
Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA
More informationIntroduction to Kernel methods
Introduction to Kernel ethods ML Workshop, ISI Kolkata Chiranjib Bhattacharyya Machine Learning lab Dept of CSA, IISc chiru@csa.iisc.ernet.in http://drona.csa.iisc.ernet.in/~chiru 19th Oct, 2012 Introduction
More information1 Rademacher Complexity Bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability
More informationCS Lecture 13. More Maximum Likelihood
CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood
More informatione-companion ONLY AVAILABLE IN ELECTRONIC FORM
OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer
More informationE0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)
E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how
More informationKernel Methods and Support Vector Machines
Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic
More informationSupport Vector Machines MIT Course Notes Cynthia Rudin
Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance
More informationA Smoothed Boosting Algorithm Using Probabilistic Output Codes
A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu
More informationDEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS
ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved
More informationThis article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and
This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution
More informationKernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
Kernel Choice and Classifiability for RKHS Ebeddings of Probability Distributions Bharath K. Sriperubudur Departent of ECE UC San Diego, La Jolla, USA bharathsv@ucsd.edu Kenji Fukuizu The Institute of
More informationFoundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research
Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)
More informationStochastic Subgradient Methods
Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods
More informationManifold learning via Multi-Penalty Regularization
Manifold learning via Multi-Penalty Regularization Abhishake Rastogi Departent of Matheatics Indian Institute of Technology Delhi New Delhi 006, India abhishekrastogi202@gail.co Abstract Manifold regularization
More informationMetric Entropy of Convex Hulls
Metric Entropy of Convex Hulls Fuchang Gao University of Idaho Abstract Let T be a precopact subset of a Hilbert space. The etric entropy of the convex hull of T is estiated in ters of the etric entropy
More informationEnsemble Based on Data Envelopment Analysis
Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807
More informationRademacher Complexity Margin Bounds for Learning with a Large Number of Classes
Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute
More information3.8 Three Types of Convergence
3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to
More informationDistributed Subgradient Methods for Multi-agent Optimization
1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions
More information1 Generalization bounds based on Rademacher complexity
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationBoosting with log-loss
Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the
More informationSoft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis
Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES
More informationFeature Extraction Techniques
Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that
More informationSupplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion
Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish
More informationUnderstanding Machine Learning Solution Manual
Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y
More informationSoft-margin SVM can address linearly separable problems with outliers
Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly
More informationRANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION
RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed
More informationUniversal algorithms for learning theory Part II : piecewise polynomial functions
Universal algoriths for learning theory Part II : piecewise polynoial functions Peter Binev, Albert Cohen, Wolfgang Dahen, and Ronald DeVore Deceber 6, 2005 Abstract This paper is concerned with estiating
More informationUniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval
Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,
More information13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices
CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay
More informationOn Constant Power Water-filling
On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives
More informationTail estimates for norms of sums of log-concave random vectors
Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics
More informationLower Bounds for Quantized Matrix Completion
Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &
More informationOPTIMIZATION in multi-agent networks has attracted
Distributed constrained optiization and consensus in uncertain networks via proxial iniization Kostas Margellos, Alessandro Falsone, Sione Garatti and Maria Prandini arxiv:603.039v3 [ath.oc] 3 May 07 Abstract
More informationThe Weierstrass Approximation Theorem
36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined
More informationOn the Use of A Priori Information for Sparse Signal Approximations
ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique
More informationNon-Parametric Non-Line-of-Sight Identification 1
Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,
More informationBlock designs and statistics
Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent
More informationKeywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution
Testing approxiate norality of an estiator using the estiated MSE and bias with an application to the shape paraeter of the generalized Pareto distribution J. Martin van Zyl Abstract In this work the norality
More informationLecture 21. Interior Point Methods Setup and Algorithm
Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and
More informationJournal of Mathematical Analysis and Applications
J Math Anal Appl 386 202 205 22 Contents lists available at ScienceDirect Journal of Matheatical Analysis and Applications wwwelsevierco/locate/jaa Sei-Supervised Learning with the help of Parzen Windows
More informationGrafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space
Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing
More informationLecture 9: Multi Kernel SVM
Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204 Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL:
More informationComputable Shell Decomposition Bounds
Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung
More informationTail Estimation of the Spectral Density under Fixed-Domain Asymptotics
Tail Estiation of the Spectral Density under Fixed-Doain Asyptotics Wei-Ying Wu, Chae Young Li and Yiin Xiao Wei-Ying Wu, Departent of Statistics & Probability Michigan State University, East Lansing,
More informationQuantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search
Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths
More informationarxiv: v3 [cs.lg] 7 Jan 2016
Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co
More informationA Bernstein-Markov Theorem for Normed Spaces
A Bernstein-Markov Theore for Nored Spaces Lawrence A. Harris Departent of Matheatics, University of Kentucky Lexington, Kentucky 40506-0027 Abstract Let X and Y be real nored linear spaces and let φ :
More informationarxiv: v1 [cs.lg] 8 Jan 2019
Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v
More informationMax-Product Shepard Approximation Operators
Max-Product Shepard Approxiation Operators Barnabás Bede 1, Hajie Nobuhara 2, János Fodor 3, Kaoru Hirota 2 1 Departent of Mechanical and Syste Engineering, Bánki Donát Faculty of Mechanical Engineering,
More informationL p moments of random vectors via majorizing measures
L p oents of rando vectors via ajorizing easures Olivier Guédon, Mark Rudelson Abstract For a rando vector X in R n, we obtain bounds on the size of a saple, for which the epirical p-th oents of linear
More informationResearch Article Robust ε-support Vector Regression
Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,
More information1 Proof of learning bounds
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a
More informationComputable Shell Decomposition Bounds
Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago
More informationA MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION
A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University
More informationPattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition
Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition
More informationLearnability and Stability in the General Learning Setting
Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu
More informationarxiv: v1 [cs.ds] 3 Feb 2014
arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/
More informationTight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions
Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805
More informationNyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,
More informationA Theoretical Analysis of a Warm Start Technique
A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful
More informationPolygonal Designs: Existence and Construction
Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G
More informationGeneralized eigenfunctions and a Borel Theorem on the Sierpinski Gasket.
Generalized eigenfunctions and a Borel Theore on the Sierpinski Gasket. Kasso A. Okoudjou, Luke G. Rogers, and Robert S. Strichartz May 26, 2006 1 Introduction There is a well developed theory (see [5,
More informationVC Dimension and Sauer s Lemma
CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions
More informationBipartite subgraphs and the smallest eigenvalue
Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.
More informationProbability Distributions
Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples
More informationConvex Programming for Scheduling Unrelated Parallel Machines
Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly
More informationA Theoretical Framework for Deep Transfer Learning
A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il
More information3.3 Variational Characterization of Singular Values
3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and
More informationFairness via priority scheduling
Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation
More informationConstrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008
LIDS Report 2779 1 Constrained Consensus and Optiization in Multi-Agent Networks arxiv:0802.3922v2 [ath.oc] 17 Dec 2008 Angelia Nedić, Asuan Ozdaglar, and Pablo A. Parrilo February 15, 2013 Abstract We
More informationTesting Properties of Collections of Distributions
Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the
More informationSupport recovery in compressed sensing: An estimation theoretic approach
Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de
More information