Near-Orthogonality Regularization in Kernel Methods Supplementary Material

Size: px

Start display at page:

Download "Near-Orthogonality Regularization in Kernel Methods Supplementary Material"

Donald Hines
5 years ago
Views:

1 Near-Orthogoality Regularizatio i Kerel Methods Supplemetary Material ADMM-Based Algorithms. Detailed Derivatio of Solvig A Give H R K K where H ij = f i, ˆf j, the sub-problem defied over A is mi A λd φ (A, I) P, A Q, A + ρ H A F + ρ H A F. () D φ (A, I) has three cases, which we discuss separately. Whe D φ (A, I) is the squared Frobeius orm, the problem becomes mi A λ A I F P, A Q, A + ρ H A F + ρ H A F. () Takig the derivative ad settig it to zero, we get the optimal solutio for A: A = (λi + P + Q + ρ(h + H ))/(λ + ρ). (3) Whe D φ (A, I) is the log-determiat divergece, the problem is specialized to mi A λ(tr(a) log det(a)) P, A Q, A + ρ H A F + ρ H A F s.t. A 0 (4) Takig the derivative of the objective fuctio w.r.t A ad settig it to zero, we get A + ρ (λi P Q ρ(h + H ))A λ ρ I = 0. (5) Let B = ρ (λi P Q ρ(h + H )), C = λ ρ I, Eq.(5) ca be writte as A + BA + C = 0. (6) Accordig to (Higham ad Kim, 00), sice B ad C commute, the solutio of this equatio is A = B + B 4C. (7) Takig a eigedecompositio of B = ΦΣΦ, we ca compute A as A = Φ ΣΦ, where Σ is a diagoal matrix with Σ kk = Σ kk + Σ kk + 4λ ρ. Sice Σ kk + 4λ ρ > Σ kk, we kow Σ kk > 0. Hece A is positive defiite. Whe D φ (A, I) is the vo Neuma divergece, the problem becomes mi A λtr(a log A A) P, A Q, A + ρ H A F + ρ H A F s.t. A 0 (8)

2 Settig the gradiet of the objective fuctio w.r.t A to zero, we get λ log A + ρa = P + Q + ρ(h + H ). (9) Let D = P + Q + ρ(h + H ). We perform a eigedecompositio of D = ΦΣΦ ad parameterize A as A = Φ ΣΦ, the we obtai λ log A + ρa = Φ(λ log Σ + ρ Σ)Φ. (0) Pluggig this equatio ito Eq.(9), we get the followig equatio: which amouts to solvig K idepedet oe-variable equatios takig the form where i =,, K. This equatio has a closed-form solutio: λ log Σ + ρ Σ = Σ () λ log Σ ii + ρ Σ ii = Σ ii () Σ ii = λω( Σii λ log( λ ρ )) ρ (3) where ω( ) is the Wright omega fuctio Goreflo et al. (007). Due to the presece of log, Σ ii is required to be positive ad the solutio always exists sice the rage of λ log Σ ii + ρ Σ ii ad Σ ii are both (, ). Hece A is guarateed to be positive defiite.. ADMM-Based Algorithm for BMD-KSC BMD-KSC has two set of parameters: sparse codes {a } N = ad a dictioary of RKHS fuctios {f i } K i=. We use a coordiate descet algorithm to lear these two parameter sets, which iteratively performs the followig two steps: () fixig {f i } K i=, solvig {a } N =; () fixig {a } N =, solvig {f i } K i=, util covergece. We first discuss step (). The sub-problem defied over a is mi a k(x, ) K i= a if i H + λ a. (4) k(x, ) K i= a if i H = k(x, x ) a h + a Ga where h R K, h i = f i, k(x, ), G R K K ad G ij = f i, f j. This is a stadard Lasso (Tibshirai, 996) problem ad ca be solved usig may algorithms. Next we discuss step (), which lears {f i } K i= usig the ADMM-based algorithm outlied i the mai paper. The updates of all variables are the same as those i BMD-KDML, except f i. Let b i = k(x, ) K j i a jf j, the sub-problem defied over f i is: N mi fi = b i a i f i H + λ H + g i, f i + K j= P ij f i, ˆf j + K j= Q ij f i, ˆf j + ρ K j= ( f i, ˆf j A ij ) + ρ K j= ( f i, ˆf j A ji ) (5) This problem ca be solved with fuctioal gradiet descet. The fuctioal gradiet of the objective fuctio w.r.t f i is N K a i (a i f i b i ) + λ f i + g i + (P ij + Q ij + ρ f i, ˆf j ρ(a ij + A ji )) ˆf j (6) = Proof of Lemma j= For the ease of otatio, we use, to deote the ier product i the RKHS ad to deote the Hilbert orm. To prove Lemma, we eed the followig Lemma, which is a extesio of Lemma 4 i (Xie et al., 05a) from vectors to RKHS fuctios. Lemma. Let F = {f i } K i= ad G be the Gram matrix defied o F. Let g i be the fuctioal gradiet of det(g) w.r.t f i, the g i, f j = 0 for all j i, ad g i, f i > 0.

3 Proof. We decompose f i ito f i = f i + f i. f i is i the spa of F/{f i}: f i = K j i a jf j, where {a j } K j i are the liear coefficiets. fi is orthogoal to F/{f i }: fi, f j = 0 for all j i. Let c j deote the j-th colum of G. Subtractig K j i a jc j from the i-th colum, we get f, f 0 f, f K f, f 0 f, f K. det(g) = f i, f fi, f i f i, f K (7) f K, f 0 f K, f K Expadig the determiat accordig to the i-th colum, we get det(g) = det(g i ) f i, f i (8) where G i is the Gram matrix defied o F/{f i }. The the fuctioal gradiet g i of det(g) w.r.t f i is det(g i )f i, which is orthogoal to f j for all j i. Sice G is full rak, we kow det(g) > 0. From Eq.(8) we kow f i, f i is o-egative. Hece g i, f i = det(g i ) f i, f i > 0. Now we are ready to prove Lemma. Some of the proof techiques draw ispiratio from (Xie et al., 05a). We first compute ŝ ij : ŝ ij = f i, f j f i f = f i+ηg i,f j+ηg j j fi+ηg i (9) f j+ηg j The fuctioal gradiet of log det(g) w.r.t f i is computed as g i = det(g) g i. Accordig to Lemma ad the fact that det(g) > 0, we kow g i, f j = 0 for all j i, ad g i, f i > 0. The we have ad Accordig to the Taylor expasio, we have f i + ηg i, f j + ηg j = f i + η(f i g i ), f j + η(f j g j ) j = ( + η)f i ηg i, ( + η)f j ηg = ( + η) f i, f j + 4η g i, g j = f i, f j ( + η) + 4η g i,g j f i,f j fi+ηg i = fi +η f i,g i +η g i = (+ η f i,g i f i ) = f i + η f i,g i f i (0) () = x + o(x) () + x The where Hece + η f i,g i f i η f i,g i = ( η fi,gi f i + η g i = η fi,fi g i + η f i,g i f i ) + o( η fi,gi + η g i ) (3) = 4η 4η fi,g i (4) = η + η fi,g i f i + o(η) (5) 3

4 ad The + η f i,g i f i + η f j,g j f j + η g j f j = ( η) + η( fi,g i ŝ ij = fi,fj f ( + i f j η) + 4η g i,g j f i,f j (( η) + η( fi,g i fi,fj f (( + i f j η) + 4η g i,g j f i,f j )(( η) + η( fi,g = s ij ( + η( fi,g i f + fj,g j i f j ) + o(η)) > s ij + fj,g j f j + fj,g j f j i + fj,g j f j ) + o(η) ) + o(η)) ) + o(η)) (6) (7) This holds for all i, j, hece s( F) > s(f). The proof completes. 3 Proof of Theorem For the ease of presetatio, we use to deote the umber of data examples. Note that this umber is deoted by N i the mai paper. Some of the proof techiques draw ispiratio from (Xie et al., 05b). A well established result i learig theory is that the geeralizatio error ca be upper bouded by the Rademacher complexity. We start from the Rademacher complexity, seek a further upper boud of it ad show how s(f) affects this upper boud. The Rademacher complexity R (A) of the loss fuctio set A is defied as R (A) = E[sup l A i= σ il(u(x i, y i ), t i )] (8) where σ i is uiform over {, } ad {(x i, y i, t i )} i= are i.i.d samples draw from p. Aother form of Rademacher complexity Bartlett ad Medelso (003) ca be writte as R (A) = E[sup l A i= σ il(u(x i, y i ), t i ) ]. The Rademacher complexity ca be utilized to upper boud the estimatio error, as show i Lemma 3. Lemma 3. (Athoy ad Bartlett, 999; Bartlett ad Medelso, 003; Liag, 05) With probability at least δ for γ sup x,y,t,u l(u(x, y), t) L(u) L(u) 4R (A) + γ log(/δ) (9) Our aalysis starts from this lemma ad we seek further upper boud of R (A). The aalysis eeds a upper boud of the Rademacher complexity of the hypothesis set F, which is give i Lemma 4. Lemma 4. Let R (F) deote the Rademacher complexity of the hypothesis set F, the R (F) 8B(k) B (k,c) K (30) Proof. Let V = {v : (x, y) (f(x) f(y)), f H} deote the set of hypothesis v(x, y) = (f(x) f(y)), we have K R (U) = E[sup u U σ i (f j (x i ) f j (y i )) ] i= j= K = E[sup u U σ i (f j (x i ) f j (y i )) ] j= i= KE[sup v V max (3) j σ i (f j (x i ) f j (y i )) ] i= = KE[sup v V σ i (f(x i ) f(y i )) ] = K R (V) i= 4

5 Let G = {g : (x, y) f(x) f(y), f H} deote the set of hypothesis g(x, y) = f(x) f(y) ad h(x) = x, the R (V) = R (h g). h(0) = 0 ad h is Lipschitz cotiuous with Lipschitz costat L, which ca be bouded as follows L = sup g G h (g) = sup f,x,y f(x) f(y) 4 sup f,x f(x) 4 sup f,x f, k(x, ) (3) 4 sup f,x f k(x, ) 4B(k)B (k, C) Accordig to the compositio property of Rademacher complexity (Theorem i Bartlett ad Medelso (003)), we have R (h g) 4B(k)B (k, C)R (g) (33) Now we boud R (g): R (g) = E[sup g G i= σ i(f(x i ) f(y i )) ] = E[sup g G i= σ i( f, k(x i, ) f, k(y i, ) ) ] E[sup g G f i= σ i(k(x i, ) k(y i, )) ] B(k) E[ i= σ i(k(x i, ) k(y i, )) ] = B(k) E (x,y)[e σ [ i= σ i(k(x i, ) k(y i, )) ]] B(k) E (x,y)[ E σ [ i= σ i(k(x i, ) k(y i, )) ]] (cocavity of ) = B(k) E (x,y)[ E σ [ i= σ i k(x i, ) k(y i, ) ]] ( i j σ i σ j ) = B(k) E (x,y)[ i= k(x i, ) k(y i, ) ] = B(k) E (x,y)[ i= (k(x i, x i ) + k(y i, y i ) k(x i, y i ))] 4B(k)B (k,c) Puttig Eq.(33) ad Eq.(34) together, we have Pluggig ito R (U) K R (V) completes the proof. I additio, we eed the followig boud of u(x, y). Lemma 5. where J = 4B(k) B (k, C) ((K )s(f) + ). Proof. Let F = [f i,, f K ]. We have where op deotes the operator orm. (34) R (V) 6B(k) B (k, C) (35) sup u(x, y) J (36) x,y,u u(x, y) = K j= (f j(x) f j (y)) = K j= ( f j, k(x, ) f j, k(y, ) ) = F (k(x, ) k(y, )) F op k(x, ) k(y, ) = F op k(x, ) k(y, ) (k(x, x) + k(y, y) k(x, y)) F op 4B (k, C) F op (37) F op = sup a = Fa = sup a = a F Fa K K = sup a = i= j= a ia j f i, f j K K sup a = i= j= a i a j f i f j cos θ ij K K sup a = i= j= a i a j B(k) cos θ ij B(k) sup K K a =( i= j i a i a j s(f) + K i= a i ) (38) 5

6 where θ ij is the agle betwee f i ad f j. Defie a = [ a,, a K ] T, Q R K K : Q ij = s(f) for i j ad Q ii =, the a = a ad F op B(k) sup a = a Qa B(k) sup a = λ (Q) a B(k) λ (Q) (39) where λ (Q) is the largest eigevalue of Q. By simple liear algebra we ca get λ (Q) = (K )s(f) +, so Substitutig to Eq.(37), we have The sup x,y,u u(x, y) J with The proof completes. F op B(k) ((K )s(f) + ) (40) u(x, y) 4B(k) B (k, C) ((K )s(f) + ) (4) J = 4B(k) B (k, C) ((K )s(f) + ) (4) Give these lemmas, we proceed to prove Theorem. Sice l(u(x,y),t) u(x,y) +exp( u(x,y)) +exp( J), l(u(x, y), t) is Lipschitz cotiuous with respect to the first argumet, ad the costat L is +exp( J). Applyig the compositio property of Rademacher complexity, we have Usig Lemma 4, we have R (A) + exp( J) R (U) (43) R (A) 8B(k) B (k,c) K (+exp( J)) I additio, sup x,y,t,u l(u(x, y), t) log( + exp(j)) Substitutig this iequality ad Eq.(44) ito Lemma 3 completes the proof. 4 Proof of Theorem (44) First, we derive a upper boud of the Babel fuctio Tropp (004): µ K ({f i } K i= ) = max i {,,K} max i {,,K} mb (k)s(f) max Λ {,,K}\{i}; Λ =m max Λ {,,K}\{i}; Λ =m j Λ f j, f i j Λ f i f j s(f) (45) I Theorem 4 of Vaisecher et al. (0), we set the upper boud of µ K ({f i } K i= ) to mb (k)s(f), the get Theorem. 5 Visualizatio Give the leared RKHS fuctios {f i } K i=, we compute a K K matrix S where S ij is the absolute value of the cosie similarity betwee f i ad f j. The we visualize this matrix usig a heatmap obtaied by the imagesc fuctio i MATLAB. Figure shows the heatmaps of the matrices leared by differet methods o the MIMIC-III dataset. The umber of RKHS fuctios was fixed to 00. For all matrices, the diagoal etries equal to. From the visualizatio, we observed the followig. For BMD-KDML methods KDML-(SFN,VND,LDD)-(RTR,RFF), the heatmaps have low eergy o the off-diagoal etries, which idicates that these matrices are close to a idetity matrix ad hece the leared RKHS fuctios are close to beig orthogoal. For the uregularized KDML, the heatmap has high eergy o the off-diagoal etries. The matrices of KDML-(DPP,Agle) have lower eergy compared with KDML ad KDML- SHN, but their eergy is higher tha BMD-KDML methods. 6

KDML KDML-SHN KDML-DPP KDML-Agle KDML-SFN-RTR KDML-VND-RTR KDML-LDD-RTR KDML-SFN-RFF KDML-VND-RFF KDML-LDD-RFF Refereces Figure : Heatmaps of the Pairwise Cosie Similarities Marti Athoy ad Peter L

7 KDML KDML-SHN KDML-DPP KDML-Agle KDML-SFN-RTR KDML-VND-RTR KDML-LDD-RTR KDML-SFN-RFF KDML-VND-RFF KDML-LDD-RFF Refereces Figure : Heatmaps of the Pairwise Cosie Similarities Marti Athoy ad Peter L Bartlett. Neural etwork learig: Theoretical foudatios. cambridge uiversity press, 999. Peter L Bartlett ad Shahar Medelso. Rademacher ad gaussia complexities: Risk bouds ad structural results. Joural of Machie Learig Research, 3:463 48, 003. Rudolf Goreflo, Yuri Luchko, ad Fracesco Maiardi. Aalytical properties ad applicatios of the Wright fuctio. arxiv preprit math-ph/070069, 007. Nicholas J Higham ad Hyu-Mi Kim. Solvig a quadratic matrix equatio by ewto s method with exact lie searches. SIAM Joural o Matrix Aalysis ad Applicatios, 3():303 36, 00. Percy Liag. Lecture otes of statistical learig theory. 05. Robert Tibshirai. Regressio shrikage ad selectio via the lasso. Joural of the Royal Statistical Society. Series B (Methodological), pages 67 88, 996. Joel A Tropp. Greed is good: Algorithmic results for sparse approximatio. IEEE Trasactios o Iformatio theory, 50(0):3 4, 004. Daiel Vaisecher, Shie Maor, ad Alfred M Bruckstei. The sample complexity of dictioary learig. Joural of Machie Learig Research, (Nov):359 38, 0. P. Xie, Y. Deg, ad E. Xig. Diversifyig restricted Boltzma machie for documet modelig. I SIGKDD, 05a. P. Xie, Y. Deg, ad E. Xig. O the geeralizatio error bouds of eural etworks uder mutual agular regularizatio. arxiv:5.070, 05b. 7

REGRESSION WITH QUADRATIC LOSS

REGRESSION WITH QUADRATIC LOSS MAXIM RAGINSKY Regressio with quadratic loss is aother basic problem studied i statistical learig theory. We have a radom couple Z = X, Y ), where, as before, X is a R d