Manifold learning via Multi-Penalty Regularization

Size: px

Start display at page:

Download "Manifold learning via Multi-Penalty Regularization"

Lionel Stokes
6 years ago
Views:

1 Manifold learning via Multi-Penalty Regularization Abhishake Rastogi Departent of Matheatics Indian Institute of Technology Delhi New Delhi 006, India Abstract Manifold regularization is an approach which exploits the geoetry of the arginal distribution. The ain goal of this paper is to analyze the convergence issues of such regularization algoriths in learning theory. We propose a ore general ulti-penalty fraework and establish the optial convergence rates under the general soothness assuption. We study a theoretical analysis of the perforance of the ulti-penalty regularization over the reproducing kernel Hilbert space. We discuss the error estiates of the regularization schees under soe prior assuptions for the joint probability easure on the saple space. We analyze the convergence rates of learning algoriths easured in the nor in reproducing kernel Hilbert space and in the nor in Hilbert space of square-integrable functions. The convergence issues for the learning algoriths are discussed in probabilistic sense by exponential tail inequalities. In order to optiize the regularization functional, one of the crucial issue is to select regularization paraeters to ensure good perforance of the solution. We propose a new paraeter choice rule the penalty balancing principle based on augented Tikhonov regularization for the choice of regularization paraeters. The superiority of ulti-penalty regularization over single-penalty regularization is shown using the acadeic exaple and oon data set. Keywords: Learning theory, Multi-penalty regularization, General source condition, Optial rates, Penalty balancing principle. Matheatics Subject Classification 200: 68T05, 68Q32. Introduction Let X be a copact etric space and Y R with the joint probability easure ρ on Z = X Y. Suppose z = (x i, y i )} Z be a observation set drawn fro the unknown probability easure ρ. The learning proble [, 2, 3, 4] ais to approxiate a function f z based on z such that f z (x) y. The goodness of the estiator can be easured by the generalization error of a function f which can be defined as E(f) := E ρ (f) = Z V (f(x), y)dρ(x, y), () DOI : 0.52/ijaia

2 where V : Y Y R is the loss function. The iniizer of E(f) for the square loss function V (f(x), y) = (f(x) y) 2 Y is given by f ρ (x) := Y ydρ(y x), (2) where f ρ is called the regression function. It is clear fro the following proposition: E(f) = X (f(x) f ρ (x)) 2 dρ X (x) + σ 2 ρ, where σρ 2 = X Y (y f ρ(x)) 2 dρ(y x)dρ X (x). The regression function f ρ belongs to the space of square integrable functions provided that Z y 2 dρ(x, y) <. (3) Therefore our objective becoes to estiate the regression function f ρ. Single-penalty regularization is widely considered to infer the estiator fro given set of rando saples [5, 6, 7, 8, 9, 0]. Sale et al. [9,, 2] provided the foundations of theoretical analysis of square-loss regularization schee under Hölder s source condition. Caponnetto et al. [6] iproved the error estiates to optial convergence rates for regularized least-square algorith using the polynoial decay condition of eigenvalues of the integral operator. But soeties, one ay require to add ore penalties to incorporate ore features in the regularized solution. Multi-penalty regularization is studied by various authors for both inverse probles and learning algoriths [3, 4, 5, 6, 7, 8, 9, 20]. Belkin et al. [3] discussed the proble of anifold regularization which controls the coplexity of the function in abient space as well as geoetry of the probability space: f = argin (f(x i ) y i ) 2 + λ A f 2 H K + λ I n i,j= (f(x i ) f(x j )) 2 ω ij, (4) where (x i, y i ) X Y : i } x i X : < i n} is given set of labeled and unlabeled data, λ A and λ I are non-negative regularization paraeters, ω ij s are non-negative weights, H K is reproducing kernel Hilbert space and HK is its nor. Further, the anifold regularization algorith is developed and widely considered in the vectorvalued fraework to analyze the ulti-task learning proble [2, 22, 23, 24] (Also see references therein). So it otivates us to theoretically analyze this proble. The convergence issues of the ulti-penalty regularizer are discussed under general source condition in [25] but the convergence rates are not optial. Here we are able to achieve the optial iniax convergence rates using the polynoial decay condition of eigenvalues of the integral operator. In order to optiize regularization functional, one of the crucial proble is the paraeter choice strategy. Various prior and posterior paraeter choice rules are proposed for single-penalty regularization [26, 27, 28, 29, 30] (also see references therein). Many regularization paraeter selection approaches are discussed for ulti-penalized ill-posed inverse probles such as discrepancy principle [5, 3], quasi-optiality principle [8, 32], balanced-discrepancy principle [33], heuristic L-curve [34], noise structure based paraeter choice rules [35, 36, 37], soe approaches which require reduction to single-penalty regularization [38]. Due to growing interest in ulti-penalty 78

3 x i } is defined by S x (f) = (f(x)) x x. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.8, No.5, Septeber 207 regularization in learning, ulti-paraeter choice rules are discussed in learning theory fraework such as discrepancy principle [5, 6], balanced-discrepancy principle [25], paraeter choice strategy based on generalized cross validation score [9]. Here we discuss the penalty balancing principle (PB-principle) to choose the regularization paraeters in our learning theory fraework which is considered for ulti-penalty regularization in ill-posed probles [33].. Matheatical Preliinaries and Notations Definition.. Let X be a non-epty set and H be a Hilbert space of real-valued functions on X. If the pointwise evaluation ap F x : H R, defined by F x (f) = f(x) f H, is continuous for every x X. Then H is called reproducing kernel Hilbert space. For each reproducing kernel Hilbert space H there exists a ercer kernel K : X X R such that for K x : X R, defined as K x (y) = K(x, y), the span of the set K x : x X} is dense in H. Moreover, there is one to one correspondence between ercer kernels and reproducing kernel Hilbert spaces [39]. So we denote the reproducing kernel Hilbert space H by H K corresponding to a ercer kernel K and its nor by K. Definition.2. The sapling operator S x : H K R associated with a discrete subset x = Denote S x : R H K as the adjoint of S x. Then for c R, f, S xc K = S x f, c = Then its adjoint is given by S xc = c i f(x i ) = f, c i K xi K, f H K. c i K xi, c = (c,, c ) R. Fro the following assertion we observe that S x is a bounded operator: S x f 2 = } f, K xi 2 K } f 2 K K xi 2 K κ 2 f 2 K, which iplies S x κ, where κ := sup K(x, x). x X For each (x i, y i ) Z, y i = f ρ (x i ) + η xi, where the probability distribution of η xi has ean 0 and variance σx 2 i. Denote σ 2 := σx 2 i <. Learning Schee. The optiization functional (4) can be expressed as } f = argin S x f y 2 + λ A f 2 K + λ I (Sx LS x )/2 f 2 K, (5) 79

4 where x = x i X : i n}, y 2 = y2 i, L = D W with W = (ω ij) is a weight atrix with non-negative entries and D is a diagonal atrix with D ii = n ω ij. Here we consider a ore general regularized learning schee based on two penalties: f z,λ := argin Sx f y 2 + λ f 2 K + λ 2 Bf 2 } K, (6) where B : H K H K is a bounded operator and λ, λ 2 are non-negative paraeters. Theore.. If S xs x + λ I + λ 2 B B is invertible, then the optiization functional (6) has unique iniizer: Proof. we know that for f H K, j= f z,λ = S S xy, where S := (S xs x + λ I + λ 2 B B). S x f y 2 + λ f 2 K + λ 2 Bf 2 K= (S xs x + λ I + λ 2 B B)f, f K 2 S xy, f K + y 2. Taking the functional derivative for f H K, we see that any iniizer f z,λ of (6) satisfies This proves Theore.. (S xs x + λ I + λ 2 B B)f z,λ = S xy. Define f x,λ as the iniizer of the optiization proble: f x,λ := argin } (f(x i ) f ρ (x i )) 2 + λ f 2 K + λ 2 Bf 2 K (7) which gives f x,λ = S S xs x f ρ. (8) The data-free version of the considered regularization schee (6) is f λ := argin f fρ 2 ρ + λ f 2 K + λ 2 Bf 2 } K, (9) where the nor ρ := L 2 ρx. Then we get the expression of f λ, f λ = (L K + λ I + λ 2 B B) L K f ρ (0) and which iplies f λ := argin f fρ 2 ρ + λ f 2 } K. () f λ = (L K + λ I) L K f ρ, (2) where the integral operator L K : Lρ 2 X Lρ 2 X is a self-adjoint, non-negative, copact operator, defined as L K (f)(x) := K(x, t)f(t)dρ X (t), x X. X 80

5 The integral operator L K can also be defined as a self-adjoint operator on H K. We use the sae notation L K for both the operators. Using the singular value decoposition L K = t i, e i K e i for orthonoral syste e i } in H K and sequence of singular nubers κ 2 t t , we define φ(l K ) = φ(t i ), e i K e i, where φ is a continuous increasing index function defined on the interval [0, κ 2 ] with the assuption φ(0) = 0. We require soe prior assuptions on the probability easure ρ to achieve the unifor convergence rates for learning algoriths. Assuption. (Source condition) Suppose Ω φ,r := f H K : f = φ(l K )g and g K R}, Then the condition f ρ Ω φ,r is usually referred as general source condition [40]. Assuption 2. (Polynoial decay condition) We assue the eigenvalues t n s of the integral operator L K follows the polynoial decay: For fixed positive constants α, β and b >, αn b t n βn b n N. Following the notion of Bauer et al. [5] and Caponnetto et al. [6], we consider the class of probability easures P φ which satisfies the source condition and the probability easure class P φ,b satisfying the source condition and polynoial decay condition. The effective diension N (λ ) can be estiated fro Proposition 3 [6] under the polynoial decay condition as follows, N (λ ) := T r ( (L K + λ I) L K ) where T r(a) := Ae k, e k for soe orthonoral basis e k } k=. k= βb b λ /b, for b >. (3) Shuai Lu et al. [4] and Blanchard et al. [42] considered the logarith decay condition of the effective diension N (λ ), Assuption 3. (logarithic decay) Assue that there exists soe positive constant c > 0 such that ( ) N (λ ) c log, λ > 0. (4) λ 2 Convergence Analysis In this section, we discuss the convergence issues of ulti-penalty regularization schee on reproducing kernel Hilbert space under the considered soothness priors in learning theory fraework. We address the convergence rates of the ulti-penalty regularizer by estiating the saple error f z,λ f λ and approxiation error f λ f ρ in interpolation nor. In Theore 2., the upper convergence rates of ulti-penalty regularized solution f z,λ are derived fro the estiates of 8

6 Proposition 2.2, 2.3 for the class of probability easure P φ,b, respectively. We discuss the error estiates under the general source condition and the paraeter choice rule based on the index function φ and saple size. Under the polynoial decay condition, in Theore 2.2 we obtain the optial iniax convergence rates in ters of index function φ, the paraeter b and the nuber of saples. In particular under Hölder s source condition, we present the convergence rates under the logarith decay condition on effective diension in Corollary 2.2. Proposition 2.. Let z be i.i.d. saples drawn according to the probability easure ρ with the hypothesis y i M for each (x i, y i ) Z. Then for 0 s 2 and for every 0 < < with prob., L s K(f z,λ f x,λ ) K 2λ s 2 Ξ ( + 2 log ( ) ) 2 + 4κM 3 log λ ( ) } 2, where N xi (λ ) = T r ( ) (L K + λ I) K xi Kx i and Ξ = σ2 x i N xi (λ ) for the variance σx 2 i of the probability distribution of η xi = y i f ρ (x i ). Proof. The expression f z,λ f x,λ can be written as S S x(y S x f ρ ). Then we find that L s K(f z,λ f x,λ ) K I L s K(L K + λ I) /2 (L K + λ I) /2 S (L K + λ I) /2 I I 2 L s K(L K + λ I) /2, (5) where I = (L K +λ I) /2 S x(y S x f ρ ) K and I 2 = (L K +λ I) /2 (S xs x +λ I) (L K +λ I) /2. For sufficiently large saple size, the following inequality holds: Then fro Theore 2 [43] we have with confidence, 8κ 2 ( ) 2 log λ (6) I 3 = (L K + λ I) /2 (L K S xs x )(L K + λ I) /2 S xs x L K Then the Neuann series gives λ 4κ 2 ( ) 2 log λ 2. I 2 = I (L K + λ I) /2 (L K SxS x )(L K + λ I) /2 } (7) = (L K + λ I) /2 (L K SxS x )(L K + λ I) /2 } i I3 i = 2. I 3 Now we have, i=0 L s K(L K + λ I) /2 sup 0<t κ 2 i=0 t s λs /2 (t + λ ) /2 for 0 s 2. (8) To estiate the error bound for (L K + λ I) /2 S x(y S x f ρ ) K using the McDiarid inequality (Lea 2 [2]), define the function F : R R as F(y) = (L K + λ I) /2 S x(y S x f ρ ) K 82

7 So F 2 (y) = 2 = (L K + λ I) /2 (y i f ρ (x i ))K xi. K (y i f ρ (x i ))(y j f ρ (x j )) (L K + λ I) K xi, K xj K. i,j= The independence of the saples together with E y (y i f ρ (x i )) = 0, E y (y i f ρ (x i )) 2 = σx 2 i iplies E y (F 2 ) = 2 σx 2 i N xi (λ ) Ξ 2, where N xi (λ ) = T r ( ) (L K + λ I) K xi Kx i and Ξ = σ2 x i N xi (λ ). Since E y (F) Ey (F 2 ). It iplies E y (F) Ξ. Let y i = (y,..., y i, y i, y i+,..., y ), where y i is another saple at x i. We have F(y) F(y i ) (L K + λ I) /2 Sx(y y i ) K = (y i y i)(l K + λ I) /2 K xi K 2κM. λ This can be taken as B in Lea 2(2) [2]. Now ( E yi F(y) Eyi (F(y)) 2) ( 2 2 y i y i (L K + λ I) /2 K xi K dρ(y i x i )) dρ(y i x i ) Y Y 2 (y i y i) 2 N xi (λ )dρ(y i x i )dρ(y i x i ) which iplies Y Y 2 2 σ2 x i N xi (λ ) σi 2 (F) 2Ξ 2. In view of Lea 2(2) [2] for every ε > 0, P rob F(y) E ε 2 } y(f(y)) ε} exp y Y 4(Ξ 2 + εκm/3 =. (let) λ ) In ters of, probability inequality becoes ( P rob y Y F(y) Ξ + 2 log ( ) ) + 4κM 3 log λ ( ) }. Incorporating this inequality with (7), (8) in (5), we get the desired result. Proposition 2.2. Let z be i.i.d. saples drawn according to the probability easure ρ with the hypothesis y i M for each (x i, y i ) Z. Suppose f ρ Ω φ,r. Then for 0 s 2 and for every 0 < < with prob., L s K(f z,λ f λ ) K 2λs 2 3M N (λ ) + 4κ λ f λ f ρ ρ + λ 6 f λ f ρ K + 7κM } ( ) 4 log. λ 83

8 Proof. We can express f x,λ f λ = S (S xs x L K )(f ρ f λ ), which iplies L s K(f x,λ f λ ) K I 4 (f ρ (x i ) f λ (x i ))K xi L K (f ρ f λ ). K where I 4 = L s K S. Using Lea 3 [2] for the function f ρ f λ, we get with confidence, L s K(f x,λ f λ ) K I 4 ( 4κ f λ f ρ log 3 ( ) + κ f λ f ρ ρ ( + 8log For sufficiently large saple (6), fro Theore 2 [43] we get ( ) (L K SxS x )(L K + λ I) S xs x L K 4κ2 2 log λ λ 2 with confidence, which iplies ( ) )). (9) (L K + λ I)(S xs x + λ I) = I (L K S xs x )(L K + λ I) } 2. (20) We have, L s K(L K + λ I) sup 0<t κ 2 Now equation (20) and (2) iplies the following inequality, t s (t + λ ) λs for 0 s. (2) I 4 L s K(S xs x + λ I) L s K(L K + λ I) (L K + λ I)(S xs x + λ I) 2λ s. (22) Let ξ(x) = σ 2 xn x (λ ) be the rando variable. Then it satisfies ξ 4κ 2 M 2 /λ, E x (ξ) M 2 N (λ ) and σ 2 (ξ) 4κ 2 M 4 N (λ )/λ. Using the Bernstein inequality we get P rob x X ( σ 2 xi N xi (λ ) M 2 N (λ ) ) } ( > t exp t 2 /2 4κ 2 M 4 N (λ ) λ + 4κ2 M 2 t 3λ ) which iplies P rob x X M 2 N (λ ) Ξ + 8κ 2 M λ log ( ) }. (23) We get the required error estiate by cobining the estiates of Proposition 2. with inequalities (9), (22), (23). Proposition 2.3. Suppose f ρ Ω φ,r. Then under the assuption that φ(t) and t s /φ(t) are nondecreasing functions, we have L s K(f λ f ρ ) K λ s ( ) Rφ(λ ) + λ 2 λ 3/2 M B B. (24) Proof. To realize the above error estiates, we decoposes f λ f ρ into f λ f λ + f λ f ρ. The first ter can be expressed as f λ f λ = λ 2 (L K + λ I + λ 2 B B) B Bf λ. 84

9 Then we get L s K(f λ f λ ) K λ 2 L s K(L K + λ I) B B f λ K (25) λ 2 λ s B B f λ K λ 2 λ s 3/2 M B B. L s K(f λ f ρ ) R r λ (L K )L s Kφ(L K ) Rλ s φ(λ ), (26) where r λ (t) = (t + λ ) t. Cobining these error bounds, we achieve the required estiate. Theore 2.. Let z be i.i.d. saples drawn according to probability easure P φ,b. Suppose φ(t) and t s /φ(t) are nondecreasing functions. Then under paraeter choice λ (0, ], λ = Ψ ( /2 ), λ 2 = (Ψ ( /2 )) 3/2 φ(ψ ( /2 )) where Ψ(t) = t 2 + 2b φ(t), for 0 s 2 and for all 0 < <, the following error estiates holds with confidence, ( )} 4 P rob L s K(f z,λ f ρ ) K C(Ψ ( /2 )) s φ(ψ ( /2 )) log, z Z where C = 4κM + (2 + 8κ)(R + M B B ) + 6M βb/(b ) and li τ li sup sup P rob ρ P φ,b z Z Proof. Let Ψ(t) = t 2 + 2b φ(t). Then Ψ(t) = y follows, } L s K(f z,λ f ρ ) K > τ(ψ ( /2 )) s φ(ψ ( /2 )) =0. Ψ(t) y li = li t 0 t y 0 Ψ (y) = 0. Under the paraeter choice λ = Ψ ( /2 ) we have large, 2b = λ φ(λ ) λ λ li λ =. Therefore for sufficiently λ 2b φ(λ ). Under the fact λ fro Proposition 2.2, 2.3 and eqn. (3) follows that with confidence, L s K(f z,λ f ρ ) K C(Ψ ( /2 )) s φ(ψ ( /2 )) log ( ) 4, (27) where C = 4κM + (2 + 8κ)(R + M B B ) + 6M βb/(b ). Now defining τ := C log ( 4 ) gives = τ = 4e τ/c. The estiate (27) can be reexpressed as P rob K(f z Z Ls z,λ f ρ ) K > τ(ψ ( /2 )) s φ(ψ ( /2 ))} τ. (28) Theore 2.2. Let z be i.i.d. saples drawn according to the probability easure ρ P φ,b. Then under the paraeter choice λ 0 (0, ], λ 0 = Ψ ( /2 ), λ j = (Ψ ( /2 )) 3/2 φ(ψ ( /2 )) for j p, where Ψ(t) = t 2 + 2b φ(t), for all 0 < η <, the following error estiates hold with confidence η, 85

10 (i) If φ(t) and t/φ(t) are nondecreasing functions. Then we have, ( )} Prob f z,λ f H H Cφ(Ψ 4 ( /2 )) log η z Z η and li τ li sup sup Prob ρ P φ,b z Z } f z,λ f H H > τφ(ψ ( /2 )) = 0, where C = 2R + 4κM + 2M B B L(H) + 4κΣ. (ii) If φ(t) and t/φ(t) are nondecreasing functions. Then we have, ( )} Prob f z,λ f H ρ C(Ψ 4 ( /2 )) /2 φ(ψ ( /2 )) log η z Z η and li τ li sup sup Prob ρ P φ,b z Z } f z,λ f H ρ > τ(ψ ( /2 )) /2 φ(ψ ( /2 )) = 0. Corollary 2.. Under the sae assuptions of Theore 2. for Hölder s source condition f ρ Ω φ,r, φ(t) = t r, for 0 s 2 and for all 0 < <, with confidence, for the paraeter choice λ = b 2br+b+ and λ2 = 2br+3b 4br+2b+2 we have the following convergence rates: ( ) L s K(f z,λ f ρ ) K C b(r+s) 2br+b+ 4 log for 0 r s. Reark 2.. For Hölder source condition f H Ω φ,r, φ(t) = t r with the paraeter choice λ 0 (0, ], λ 0 = b 2br+b+ and for j p, λj = 2br+3b 4br+2b+2, we obtain the optial iniax convergence rates O( br 2br+b+ ) for 0 r and O( 2br+b 4br+2b+2 ) for 0 r 2 in RKHS-nor and Lρ 2 X -nor, respectively. Corollary 2.2. Under the logarith decay condition of effective diension N (λ ), for Hölder s source condition f ρ Ω φ,r, φ(t) = t r, for 0 s 2 and for all 0 < <, with confidence, for the paraeter choice λ = convergence rates: ( log ) 2r+ and λ 2 = ( ) s+r ( ) log L s 2r+ 4 K(f z,λ f ρ ) K C log ( log ) 2r+3 4r+2 for 0 r s. we have the following Reark 2.2. The upper convergence rates of the regularized solution is estiated in the interpolation nor for the paraeter s [0, 2 ]. In particular, we obtain the error estiates in HK -nor for s = 0 and in L 2 ρx -nor for s = 2. We present the error estiates of ulti-penalty regularizer over the regularity class P φ,b in Theore 2. and Corollary 2.. We can also obtain the convergence rates of the estiator f z,λ under the source condition without the polynoial decay of the eigenvalues of the integral operator L K by substituting N (λ ) κ2 λ. In addition, for B = (Sx LS x )/2 we obtain the error estiates of the anifold regularization schee (30) considered in [3]. Reark 2.3. The paraeter choice is said to be optial, if the iniax lower rates coincide with the upper convergence rates for soe λ = λ(). For the paraeter choice λ = Ψ ( /2 ) and 86

11 λ 2 = (Ψ ( /2 )) 3/2 φ(ψ ( /2 )), Theore 2.2 share the upper convergence rates with the lower convergence rates of Theore 3., 3.2 [44]. Therefore the choice of paraeters is optial. Reark 2.4. The results can be easily generalized to n-penalty regularization in vector-valued fraework. For siplicity, we discuss two-paraeter regularization schee in scalar-valued function setting. Reark 2.5. We can also address the convergence issues of binary classification proble [45] using our error estiates as siilar to discussed in Section 3.3 [5] and Section 5 [9]. The proposed choice of paraeters in Theore 2. is based on the regularity paraeters which are generally not known in practice. In the proceeding section, we discuss the paraeter choice rules based on saples. 3 Paraeter Choice Rules Most regularized learning algoriths depend on the tuning paraeter, whose appropriate choice is crucial to ensure good perforance of the regularized solution. Many paraeter choice strategies are discussed for single-penalty regularization schees for both ill-posed probles and the learning algoriths [27, 28] (also see references therein). Various paraeter choice rules are studied for ulti-penalty regularization schees [5, 8, 9, 25, 3, 32, 33, 36, 46]. Ito el al. [33] studied a balancing principle for choosing regularization paraeters based on the augented Tikhonov regularization approach for ill posed inverse probles. In learning theory fraework, we are discussing the fixed point algorith based on the penalty balancing principle considered in [33]. The Bayesian inference approach provides a echanis for selecting the regularization paraeters through hierarchical odeling. Various authors successfully applied this approach in different probles. Thopson et al. [47] applied this for selecting paraeters for iage restoration. Jin et al. [48] considered the approach for ill-posed Cauchy proble of steady-state heat conduction. The posterior probability density function (PPDF) for the functional (5) is given by P (f, σ 2, µ, z) ( ) n/2 ( σ 2 exp ) ( 2σ 2 S xf y 2 µ n/2 exp µ ) 2 f 2 K µ n2/2 2 ( exp µ ) ( ) α 2 2 Bf 2 K µ α e β µ µ α 2 e β µ 2 o σ 2 e β o ( σ 2 ). where (α, β ) are paraeter pairs for µ = (µ, µ 2 ), (α o, β o) are paraeter pair for inverse variance σ 2. In the Bayesian inference approach, we select paraeter set (f, σ 2, µ) which axiizes the PPDF. By taking the negative logarith and siplifying, the proble can be reforulated as J (f, τ, µ) = τ S x f y 2 + µ f 2 K + µ 2 Bf 2 K +β(µ + µ 2 ) α(logµ + logµ 2 ) + β o τ α o logτ, where τ = /σ 2, β = 2β, α = n + 2α 2, β o = 2β o, α o = n 2 + 2α o 2. We assue that the scalars τ and µ i s have Gaa distributions with known paraeter pairs. The functional is pronounced as augented Tikhonov regularization. For non-inforative prior β o = β = 0, the optiality of a-tikhonov functional can be reduced to 87

12 f z,λ = arg in Sx f y 2 + λ f 2 K + λ } 2 Bf 2 K α µ = τ = f z,λ 2 K α o S xf z,λ y 2, µ 2 = α Bf z,λ 2 K where λ = µ τ, λ 2 = µ2 τ, γ = αo α, this can be reforulated as f z,λ = arg in Sx f y 2 + λ f 2 K + λ } 2 Bf 2 K λ = S xf z,λ y 2 γ f z,λ, λ 2 2 = S xf z,λ y 2 γ K Bf z,λ 2 K which iplies λ f z,λ 2 K = λ 2 Bf z,λ 2 K. It selects the regularization paraeter λ in the functional (6) by balancing the penalty with the fidelity. Therefore the ter Penalty balancing principle follows. Now we describe the fixed point algorith based on PB-principle. Algorith Paraeter choice rule Penalty-balancing Principle. For an initial value λ = (λ 0, λ 0 2), start with k = Calculate f z,λ k and update λ by = S xf z,λ k y 2 + λ k 2 Bf z,λ k 2 K ( + γ) f z,λ k 2, K λ k+ 2 = S xf z,λ k y 2 + λ k f z,λ k 2 K ( + γ) Bf z,λ k 2. K λ k+ 3. If stopping criteria λ k+ λ k < ε satisfied then stop otherwise set k = k + and GOTO (2). 4 Nuerical Realization In this section, the perforance of single-penalty regularization versus ulti-penalty regularization is deonstrated using the acadeic exaple and two oon data set. For single-penalty regularization, paraeters are chosen according to the quasi-optiality principle while for twoparaeter regularization according to PB-principle. We consider the well-known acadeic exaple [28, 6, 49] to test the ulti-penalty regularization under PB-principle paraeter choice rule, f ρ (x) = ( x + 2 e 8( 4π 3 x)2 e 8( π 2 x)2 e 8( 3π x)2)} 2, x [0, 2π], (29) 0 which belongs to reproducing kernel Hilbert space H K corresponding to the kernel K(x, y) = xy + exp ( 8(x y) 2 ). We generate noisy data 00 ties in the for y = f ρ (x) + ξ corresponding to the inputs x = x i } = π 0 (i )}, where ξ follows the unifor distribution over [, ] with =

13 We consider the following ulti-penalty functional proposed in the anifold regularization [3, 5], } argin (f(x i ) y i ) 2 + λ f 2 K + λ 2 (Sx LS x )/2 f 2 K, (30) where x = x i X : i n} and L = D W with W = (ω ij ) is a weight atrix with non-negative entries and D is a diagonal atrix with D ii = n ω ij. In our experient, we illustrate the error estiates of single-penalty regularizers f = f z,λ, f = f z,λ2 and ulti-penalty regularizer f = f z,λ using the relative error easure f fρ f for the acadeic exaple in sup nor, H K -nor and -epirical nor in Fig. (a), (b) & (c) respectively. j= Figure : Figures show the relative errors of different estiators for the acadeic exaple in K - nor (a), -epirical nor (b) and infinity nor (c) corresponding to 00 test probles with the noise = 0.02 for all estiators. Now we copare the perforance of ulti-penalty regularization over single-penalty regularization ethod using the well-known two oon data set (Fig. 2) in the context of anifold learning. The data set contains 200 exaples with k labeled exaple for each class. We perfor experients 500 ties by taking l = 2k = 2, 6, 0, 20 labeled points randoly. We solve the anifold regularization proble (30) for the ercer kernel K(x i, x j ) = exp( γ x i x j 2 ) with the exponential weights ω ij = exp( x i x j 2 /4b), for soe b, γ > 0. We choose initial paraeters λ = 0 4, λ 2 = , the kernel paraeter γ = 3.5 and the weight paraeter b = in all experients. The perforance of single-penalty (λ 2 = 0) and the proposed ulti-penalty regularizer (30) is presented in Fig. 2, Table. Based on the considered exaples, we observe that the proposed ulti-penalty regularization with the penalty balancing principle paraeter choice outperfors the single-penalty regularizers. 89

(a) (b) Figure 2: The figures show the decision surfaces generated with two labeled saples (red star) by single-penalty regularizer (a) based on the quasi-optiality principle and anifold regularizer

2 0 4 00 0 λ = 9.8784 0 5 λ 2 = 5.7020 0 4 = 0 93.725 77 λ =.2 0 4 00 0 λ =.0504 0 4 λ 2 = 7.3798 0 4 = 20 98.00 40 λ =.2 0 4 00 0 λ =.0782 0 4 λ 2 = 7.

14 (a) (b) Figure 2: The figures show the decision surfaces generated with two labeled saples (red star) by single-penalty regularizer (a) based on the quasi-optiality principle and anifold regularizer (b) based on PB-principle. Single-penalty Regularizer Multi-penalty Regularizer (SP %) (WC) Paraeters (SP %) (WC) Paraeters = λ = λ = λ 2 = = λ = λ = λ 2 = = λ = λ = λ 2 = = λ = λ = λ 2 = Table : Statistical perforance interpretation of single-penalty (λ 2 = 0) and ulti-penalty regularizers of the functional Sybols: labeled points (); successfully predicted (SP); axiu of wrongly classified points (WC) 5 Conclusion In suary, we achieved the optial iniax rates of ulti-penalized regression proble under the general source condition with the decay conditions of effective diension. In particular, the convergence analysis of ulti-penalty regularization provide the error estiates of anifold regularization proble. We can also address the convergence issues of binary classification proble using our error estiates. Here we discussed the penalty balancing principle based on augented Tikhonov regularization for the choice of regularization paraeters. Many other paraeter choice rules are proposed to obtain the regularized solution of ulti-paraeter regularization schees. The next proble of interest can be the rigorous analysis of different paraeter choice rules of ulti-penalty regularization schees. Finally, the superiority of ulti-penalty regularization over single-penalty regularization is shown using the acadeic exaple and oon data set. Acknowledgeents: The authors are grateful for the valuable suggestions and coents of the anonyous referees that led to iprove the quality of the paper. 90

15 References [] O. Bousquet, S. Boucheron, and G. Lugosi, Introduction to statistical learning theory, in Advanced lectures on achine learning, pp , Berlin/Heidelberg: Springer, [2] F. Cucker and S. Sale, On the atheatical foundations of learning, Bull. Aer. Math. Soc. (NS), vol. 39, no., pp. 49, [3] T. Evgeniou, M. Pontil, and T. Poggio, Regularization networks and support vector achines, Adv. Coput. Math., vol. 3, no., pp. 50, [4] V. N. Vapnik and V. Vapnik, Statistical Learning Theory, vol.. New York: Wiley, 998. [5] F. Bauer, S. Pereverzev, and L. Rosasco, On regularization algoriths in learning theory, J. Coplexity, vol. 23, no., pp , [6] A. Caponnetto and E. De Vito, Optial rates for the regularized least-squares algorith, Found. Coput. Math., vol. 7, no. 3, pp , [7] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Probles, vol Dordrecht, The Netherlands: Math. Appl., Kluwer Acadeic Publishers Group, 996. [8] L. L. Gerfo, L. Rosasco, F. Odone, E. De Vito, and A. Verri, Spectral algoriths for supervised learning, Neural Coputation, vol. 20, no. 7, pp , [9] S. Sale and D. X. Zhou, Learning theory estiates via integral operators and their approxiations, Constr. Approx., vol. 26, no. 2, pp , [0] A. N. Tikhonov and V. Y. Arsenin, Solutions of Ill-posed Probles, vol. 4. Washington, DC: W. H. Winston, 977. [] S. Sale and D. X. Zhou, Shannon sapling and function reconstruction fro point values, Bull. Aer. Math. Soc., vol. 4, no. 3, pp , [2] S. Sale and D. X. Zhou, Shannon sapling II: Connections to learning theory, Appl. Coput. Haronic Anal., vol. 9, no. 3, pp , [3] M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: A geoetric fraework for learning fro labeled and unlabeled exaples, J. Mach. Learn. Res., vol. 7, pp , [4] D. Düveleyer and B. Hofann, A ulti-paraeter regularization approach for estiating paraeters in jup diffusion processes, J. Inverse Ill-Posed Probl., vol. 4, no. 9, pp , [5] S. Lu and S. V. Pereverzev, Multi-paraeter regularization and its nuerical realization, Nuer. Math., vol. 8, no., pp. 3, 20. [6] S. Lu, S. Pereverzyev Jr., and S. Sivananthan, Multiparaeter regularization for construction of extrapolating estiators in statistical learning theory, in Multiscale Signal Analysis and Modeling, pp , New York: Springer,

16 [7] Y. Lu, L. Shen, and Y. Xu, Multi-paraeter regularization ethods for high-resolution iage reconstruction with displaceent errors, IEEE Trans. Circuits Syst. I: Regular Papers, vol. 54, no. 8, pp , [8] V. Nauova and S. V. Pereverzyev, Multi-penalty regularization with a coponent-wise penalization, Inverse Probles, vol. 29, no. 7, p , 203. [9] S. N. Wood, Modelling and soothing paraeter estiation with ultiple quadratic penalties, J. R. Statist. Soc., vol. 62, pp , [20] P. Xu, Y. Fukuda, and Y. Liu, Multiple paraeter regularization: Nuerical solutions and applications to the deterination of geopotential fro precise satellite orbits, J. Geodesy, vol. 80, no., pp. 7 27, [2] Y. Luo, D. Tao, C. Xu, D. Li, and C. Xu, Vector-valued ulti-view sei-supervsed learning for ulti-label iage classification., in AAAI, pp , 203. [22] Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu, and Y. Wen, Multiview vector-valued anifold regularization for ultilabel iage classification, IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp , 203. [23] H. Q. Minh, L. Bazzani, and V. Murino, A unifying fraework in vector-valued reproducing kernel Hilbert spaces for anifold regularization and co-regularized ulti-view learning, J. Mach. Learn. Res., vol. 7, no. 25, pp. 72, 206. [24] H. Q. Minh and V. Sindhwani, Vector-valued anifold regularization, in International Conference on Machine Learning, 20. [25] Abhishake and S. Sivananthan, Multi-penalty regularization in learning theory, J. Coplexity, vol. 36, pp. 4 65, 206. [26] F. Bauer and S. Kinderann, The quasi-optiality criterion for classical inverse probles, Inverse Probles, vol. 24, p , [27] A. Caponnetto and Y. Yao, Cross-validation based adaptation for regularization operators in learning theory, Anal. Appl., vol. 8, no. 2, pp. 6 83, 200. [28] E. De Vito, S. Pereverzyev, and L. Rosasco, Adaptive kernel ethods using the balancing principle, Found. Coput. Math., vol. 0, no. 4, pp , 200. [29] V. A. Morozov, On the solution of functional equations by the ethod of regularization, Soviet Math. Dokl, vol. 7, no., pp , 966. [30] J. Xie and J. Zou, An iproved odel function ethod for choosing regularization paraeters in linear inverse probles, Inverse Probles, vol. 8, no. 3, pp , [3] S. Lu, S. V. Pereverzev, and U. Tautenhahn, A odel function ethod in regularized total least squares, Appl. Anal., vol. 89, no., pp , 200. [32] M. Fornasier, V. Nauova, and S. V. Pereverzyev, Paraeter choice strategies for ultipenalty regularization, SIAM J. Nuer. Anal., vol. 52, no. 4, pp ,

17 [33] K. Ito, B. Jin, and T. Takeuchi, Multi-paraeter Tikhonov regularization An augented approach, Chinese Ann. Math., vol. 35, no. 3, pp , 204. [34] M. Belge, M. E. Kiler, and E. L. Miller, Efficient deterination of ultiple regularization paraeters in a generalized L-curve fraework, Inverse Probles, vol. 8, pp. 6 83, [35] F. Bauer and O. Ivanyshyn, Optial regularization with two interdependent regularization paraeters, Inverse probles, vol. 23, no., pp , [36] F. Bauer and S. V. Pereverzev, An utilization of a rough approxiation of a noise covariance within the fraework of ulti-paraeter regularization, Int. J. Toogr. Stat, vol. 4, pp. 2, [37] Z. Chen, Y. Lu, Y. Xu, and H. Yang, Multi-paraeter Tikhonov regularization for linear ill-posed operator equations, J. Cop. Math., vol. 26, pp , [38] C. Brezinski, M. Redivo-Zaglia, G. Rodriguez, and S. Seatzu, Multi-paraeter regularization techniques for ill-conditioned linear systes, Nuer. Math., vol. 94, no. 2, pp , [39] N. Aronszajn, Theory of reproducing kernels, Trans. Aer. Math. Soc., vol. 68, pp , 950. [40] P. Mathé and S. V. Pereverzev, Geoetry of linear ill-posed probles in variable Hilbert scales, Inverse probles, vol. 9, no. 3, pp , [4] S. Lu, P. Mathé, and S. Pereverzyev, Balancing principle in supervised learning for a general regularization schee, RICAM-Report, vol. 38, 206. [42] G. Blanchard and P. Mathé, Discrepancy principle for statistical inverse probles with application to conjugate gradient iteration, Inverse probles, vol. 28, no., p. 50, 202. [43] E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone, Learning fro exaples as an inverse proble, J. Mach. Learn. Res., vol. 6, pp , [44] A. Rastogi and S. Sivananthan, Optial rates for the regularized learning algoriths under general source condition, Front. Appl. Math. Stat., vol. 3, p. 3, 207. [45] S. Boucheron, O. Bousquet, and G. Lugosi, Theory of classification: A survey of soe recent advances, ESAIM: probability and statistics, vol. 9, pp , [46] S. Lu and S. Pereverzev, Regularization Theory for Ill-posed Probles: Selected Topics, vol. 58. Berlin: Walter de Gruyter, 203. [47] A. M. Thopson and J. Kay, On soe choices of regularization paraeter in iage restoration, Inverse Probles, vol. 9, pp , 993. [48] B. Jin and J. Zou, Augented Tikhonov regularization, Inverse Probles, vol. 25, no. 2, p , [49] C. A. Micchelli and M. Pontil, Learning the kernel function via regularization, J. Mach. Learn. Res., vol. 6, no. 2, pp ,

Error Estimates for Multi-Penalty Regularization under General Source Condition

Error Estimates for Multi-Penalty Regularization under General Source Condition Abhishake Rastogi Department of Mathematics Indian Institute of Technology Delhi New Delhi 006, India abhishekrastogi202@gmail.com