Questions and answers, kernel part

Questios ad aswers, kerel part October 8, 205 Questios. Questio : properties of kerels, PCA, represeter theorem. [2 poits] Let F be a RK defied o some domai X, with feature map φ(x) x X ad reproducig kerel k(x, x ) = φ(x), φ(x ) F. Recall the reproducig property: f( ) F, f( ), φ(x) F = f( ), k(x, ) F = f(x). () (we will equivaletly use the shorthad f F). Give f takes the form f( ) = a i k(x i, ), show that f( ) 2 F = j= a i k(x i, x j )a j. 2. [3 poits] Show that for a fuctio f F, max x X f(x) < whe the kerel is bouded, k(x, x ) K < x, x X. You will eed Cauchy-Schwarz, f, f 2 F f F f 2 F, f, f 2 F, ad the kowledge that f F < sice otherwise f would ot be i F. 3. [5 poits] Defie the empirical feature space covariace (igore ceterig) as Ĉ XX := φ(x i ) φ(x i ) where (f f 2 ) f 3 = f 2, f 3 F f, f, f 2, f 3 F. The eigefuctios of C are fλ = Ĉf.

Assumig f( ) = α ik(x i, ), show that α R is give by the solutios to λα = Kα, K ij = k(x i, x j ), assumig K ivertible. 4. [5 poits] We have a set of paired observatios (x, y ),... (x, y ) (regressio or classificatio). We are give the learig problem where f = arg mi J(f), (2) f F ( ) J(f) = L y (f(x ),..., f(x )) + Ω f 2 F, the loss L depeds o x i oly via f(x i ), Ω is o-decreasig, ad y is the vector of y i. Prove that a solutio takes the form f = α i k(x i, ) (this is the represeter theorem). 5. [5 poits] A symmetric fuctio k : X X R is positive defiite if, (a,... a ) R, (x,..., x ) X, a i a j k(x i, x j ) 0, (3) j= ad strictly positive defiite if the equality to zero holds oly whe a i = 0 i {,..., }. We cosider the case where the positive defiiteess is ot strict. I this case, there exists some set of weights {a i } ad correspodig poits {x i } such that a i a j k(x i, x j ) = 0. Show that the fuctio j= f(x + ) = a i k(x i, x + ) = 0 at every poit x + X. This is a powerful result: it shows that f H = 0 = f(x) = 0 x X. Hits: sice k is positive defiite, it remais true that + + a i a j k(x i, x j ) 0. j= Fid the coditio o a + to esure this holds for every possible x +. Check whether this coditio ca still be eforced whe f(x + ) = a i k(x i, x + ) 0. 2

.2 Questio 2: covariace, depedece. [3 poits] Let F be a reproducig kerel Hilbert space defied o a domai X, ad G be reproducig kerel Hilbert space defied o a domai Y. The RK F has kerel k(x, x ) ad feature map φ(x), ad G has kerel l(y, y ) ad feature map ψ(y). Give the radom variables X P x o X ad Y P y o Y, we defie µ X F ad µ Y G to be mea embeddigs satisfyig µ X, f F = E X f(x) f F, ad i particular µ Y, g G = E Y g(y ) g G, µ X, φ(x) F = µ X, k(x, ) F = E X k(x, X), (4) ad i particular µ Y, ψ(y) G = µ Y, l(y, ) G = E Y l(y, Y ). (5) The Hilbert-Schmidt operators mappig from G to F form a Hilbert space, writte (G, F). Defie the tesor product f g (G, F) such that Show that (f g) h = g, h G f. (7) µ X µ Y 2 = E XX k(x, X )E Y Y l(y, Y ), (8) where X has distributio P x ad is idepedet of X, ad Y has distributio P y ad is idepedet of Y. You may use without proof that A, f g = f, Ag F, (9) where A (G, F). Please referece the umbers of the above equatios as you use them i your proof. 2. [4 poits] Give a probability distributio P xy over the pair of radom variables (X, Y ) with respective margial distributios P x ad P y, the ucetered covariace operator C XY is a elemet of (G, F) defied such that CXY, A = E XY φ(x) ψ(y ), A. (0) The Hilbert-Schmidt Idepedece Criterio is defied i terms of kerels as IC 2 (F, G, P xy ) = C XY µ X µ Y 2. The ier product is L, M = j J Lf j, Mf j F, (6) idepedet of the choice of orthoormal basis {f j } of G, however you do t eed to use this iformatio to aswer the questio. 3

Prove that the populatio expressio for IC i terms of expectatios of kerels takes the form IC 2 (F, G, P xy ) = E XY E X Y [k(x, X )l(y, Y )] + E XX k(x, X )E Y Y l(y, Y ) 2E XY [E X k(x, X )E Y l(y, Y )], where the pair (X, Y ) has distributio P xy ad is idepedet of (X, Y ). You will eed eq. (8) from the previous sectio. 3. [2 poit] Show that at idepedece, i.e., whe P xy = P x P y, the IC 2 (F, G, P xy ) = 0. 4. [2 poits] Give a sample z := {(x, y ),..., (x, y )} draw i.i.d. from P xy, write a ubiased empirical estimate of C XY 2. 5. [5 poits] Derive a biased estimate of C XY 2 by computig Ĉ XY 2 Ĉ XY := φ(x i ) ψ(y i ). Derive a expressio for the bias i the latter expressio, i.e., the expected differece betwee this estimate ad the ubiased estimate, i terms of expectatios of kerel fuctios. What happes to the bias as icreases? 6. [4 poits] Cosider a relatio betwee x ad y give as y i = x 2 i + ε i, where ε i N (0, σ 2 ) is Gaussia oise, ad x i U([, ]) is draw from the uiform distributio o [, ]. See Figure for a illustratio of pairs (x i, y i ) draw i.i.d. accordig to this relatio. What is the populatio IC whe both k ad l are liear, i.e. k(x i, x j ) = x i x j ad l(y i, y j ) = y i y j. No proof is eeded, a descriptio of your reasos is sufficiet. Next, defie the maximum sigular vectors f F ad g G of the cetered empirical covariace operator as arg max f F g G ) f, (ĈXY ˆµ X ˆµ Y g where ˆµ X ad ˆµ Y are the empirical estimates of the respective mea embeddigs. Sketch f ad g whe k(x i, x j ) = exp ( (x i x j ) 2 /γ ) is the RBF kerel, ad l(y i, y j ) is the liear kerel (ote: g ca oly be a straight lie i this case). Agai, o proof is eeded, oly a sketch of what you expect to see. F,,where 4

.2 0.8 0.6 Y 0.4 0.2 0 0.2 0.5 0 0.5 X Figure : Sample of relatio betwee x ad y. 5

.3 Questio 3: kerel rakig Rakig problem: we receive pairs {(x i, y i )}, where x i are the objects to be raked, ad y i {, 2,..., M} are the associated raks. M is the highest rak, is the lowest rak; two poits ca have a equal rak, i which case y i = y j ; we also assume M <, ad that at least oe example is see for every allowable y value. We represet the iput poits i terms of feature maps φ(x i ) to a reproducig kerel Hilbert space H with kerel k(x, x ). We set up the followig optimizatio problem: mi w 2 w H, ξ u,ξ l R,b R M+ H + C (ξi l + ξi u ), () subject to w, φ(x i ) H b yi + ξ l i (2) w, φ(x i ) H b yi + ξ u i (3) ξ u i, ξ l i 0, where {b y } M y=0 are parameters of the algorithm which must be leared, ad C > 0 is a user-defied costat.. (4 poits) Sketch a figure describig what the above optimizatio problem is doig. 2. (7 poits) Write the Lagragia for the kerel rakig problem. State the KKT coditios as they apply to the problem (you are give that strog duality holds - please defie the meaig of strog duality). You may use d dw w 2 H = 2w, d dw w, φ(x i) H = φ(x i ). 3. (5 poits) Show that the Lagrage dual fuctio for this optimizatio problem takes the form g(α u, α l ) = 4 j= (αi u αi)(α l j u αj)k(x l i, x j ). Hit: from the previous part, you should have a form for w that looks like w = 2 m (αi u αi)φ(x l i ). 4. (4 poits) What do the KKT coditios imply about the allowable rage of α i? Describe where poits with α u i = 0, α u i = C, ad α u i (0, C) are situated. Please provide proofs to justify your aswers. You do ot eed to provide a accompayig figure (although you are welcome to do so if you fid this makes thigs easier to explai). 6

2 Aswers 2. Questio. The orm is writte f( ) 2 F = f( ), f( ) F = a i k(x i, ), a i k(x i, ) = = a i a j k(x i, ), k(x j, ) F a i a j k(x i, x j ), where the reproducig property is used i the fial lie. 2. The proof is: max f(x) = max f, φ(x) x X x X F max f F φ(x) F x X = f F max x X f F K <. φ(x), φ(x) F 3. First substitutig i the covariace o the R, we have fλ = Ĉf ( ) = φ(x i ) φ(x i ) f = φ(x i ) φ(x i ), α j φ(x j ) j= = φ(x i ) α j k(x i, x j ) j= ow project both sides oto all of the φ(x q ): F F φ(x q ), L F = λ φ(x q ), f F = λ α i k(x q, x i ) Writig this as a matrix equatio, q {... }. λkα = K 2 α or λα = Kα. 7

4. Deote by f s the projectio of f oto the subspace such that spa {k(x i, ) : i }, (4) f = f s + f, where f s = α ik(x i, ). Regularizer: f 2 F = f s 2 F + f 2 F f s 2 F, so ( ) ( ) Ω f 2 F Ω f s 2 F, ad this term is miimized for f = f s. Idividual terms f(x i ) i the loss: f(x i ) = f, k(x i, ) F = f s + f, k(x i, ) F = f s, k(x i, ) F, so Hece L y (f(x ),..., f(x )) = L y (f s (x ),..., f s (x )). Loss L(...) oly depeds o the compoet of f i the data subspace, Regularizer Ω(...) miimized whe f = f s. Note: If Ω is strictly o-decreasig, the f F = 0 is required at the miimum. If Ω strictly icreasig, mi. is uique. 5. For k idetically zero, the statemet holds trivially. Assume that k is ot idetically zero. We expad out + + 0 a i a j k(x i, x j ) = j= j= a i a j k(x i, x j ) + 2a + a i k(x i, x + ) + a 2 +k 2 (x +, x + ). }{{} :=c } {{ } =0 } {{ } :=b The miimum of the above expressio occurs whe a + = b/c (kowig k is ot idetically zero). For the expressio to be o-egative at this miimum, 0 c b2 c 2 2bb c = b2 c. However c > 0 so the oly possibility is b = 0, i.e. a i k(x i, x + ) = 0 x + X. 8

2.2 Questio 2. The proof is: µ X µ Y, µ X µ Y (a) = µ X, (µ X µ Y ) µ Y F (b) = µ X, µ X F µ Y, µ Y G (c) = E X µ X (X)E Y µ Y (Y ) (d) = E X µ X, k(x, ) E Y µ Y, l(y, ) (c) = E XX k(x, X )E Y Y l(y, Y ), where i step (a) we apply (9), i step (b) we apply (7), ad i the two steps (c) we apply (4) ad (5). Step (d) is the reproducig property. 2. We begi with the expasio IC 2 (F, G, P xy ) = C XY µ X µ Y 2 = CXY, C XY + µ X µ Y, µ X µ Y 2 CXY, µ X µ Y (5) There are three terms i the expasio of (5). To write the first i terms of kerels, we apply (9) ad the (0) twice, deotig by (X, Y ) a idepedet copy of the pair of variables (X, Y ), CXY, C XY C XY 2 = ad for the cross-terms, CXY, µ X µ Y = E X,Y φ(x) ψ(y ), C XY = E X,Y E X,Y φ(x) ψ(y ), φ(x ) ψ(y ) = E X,Y E X,Y φ(x), [φ(x ) ψ(y )]ψ(y ) F = E X,Y E X,Y [ φ(x), φ(x ) F ψ(y ), ψ(y ) G ] = E X,Y E X,Y [k(x, X )l(y, Y )]. (6) The fial part was proved previously. = E X,Y φ(x) ψ(y ), µ X µ Y = E X,Y ( φ(x), µx F φ(y ), µ Y G ) = E X,Y [E X k(x, X )E Y l(y, Y )]. 3. At idepedece, the expectatios o the pair (X, Y ) factorize as products of expectatios o X ad Y, hece IC 2 (F, G, P xy ) = E XX k(x, X )E Y Y l(y, Y ) + E XX k(x, X )E Y Y l(y, Y ) = 0. 2E XX k(x, X )E Y Y l(y, Y ). 9

4. A ubiased estimate of A := C XY 2 is Â := ( ) j i k ij l ij, where we use the shorthad k ij = k(x i, x j ). Note that E(Â) = E X,Y E X,Y k(x, X )l(y, Y ) = C XY 2 from eq. (6). 5. The biased estimate of A := C XY 2 is Â b := ĈXY 2 = φ(x i ) ψ(y i ), = 2 j= φ(x i ) ψ(y i ) k ij l ij = 2 tr(kl). The differece betwee the biased ad ubiased estimates is Â b Â = 2 i,j= k ij l ij = 2 k ii l ii + = ( ) j i k ij l ij ( ) 2 ( ) k ii l ii ( ) j i j i k ij l ij, thus the expectatio of this differece (i.e., the bias) is k ij l ij ) E (Âb Â = (E XY [k(x, X)l(Y, Y )] E X,Y E X,Y [k(x, X )l(y, Y )]), ad is therefore O( ). 6. Whe both kerels are liear the populatio IC will be zero, as there is o pair of fuctios i these fuctio classes which ca trasform the variables to have a high liear covariace. Whe k is a RBF kerel, ad l is a liear kerel, we expect the mappigs i Figure (2). 2.3 Questio 3 This is a mior modificatio of the rakig algorithm i [, Sectio 8..].. Sketch is i Figure 3. The algorithm sets the thresholds {b j } m j= such that w, φ(x j ) is beeath the threshold b yi by a margi / w, but above the 0

Depedece witess, X 0.8 0.6 0.4 f(x) 0.2 0.2 0.2 0.4.5 0.5 0 0.5.5 x 0.2 Correlatio: 0.94 COCO: 0. 0 0.8 0.4 Depedece witess, Y 0.2 0.2 Y 0.6 0.4 0 0.2 g(y) 0.4 0.6 g(y) 0.4 0.2 0.6 0.8 0 0.8 0.2 0.5 0 0.5 X.2 0.5 0 0.5.5 y.2 0.4 0.2 0 0.2 0.4 0.6 0.8 f(x) Figure 2: Maximum sigular vectors of covariace operator. Left plot is origial poit cloud, ceter plot cotais both mappigs, right plot cotais mapped variables. threshold b yi by a margi / w. Some poits are allowed withi the margis, however these attract a pealty of ξi l or ξu i, respectively (the sum of these pealties costitutes the loss). The parameter C trades off the margi size with the loss. 2. Strog duality meas that the maximum of the dual fuctio coicides with the miimum of the primal fuctio subject to the problem costraits. Recall the optimizatio problem: mi w 2 w H, ξ u,ξ l R,b R M+ H + C (ξi l + ξi u ), (7) subject to w, φ(x i ) H b yi + ξ l i (8) w, φ(x i ) H b yi + ξ u i (9) ξ u i, ξ l i 0.

(x ) w b 3 (x 2 ) b 2 b (x 3 ) b 0 Figure 3: Sketch of rakig algorithm 2

The Lagragia is: L := w 2 H + C + + (ξi l + ξi u ) (ηiξ l i l + ηi u ξi u ) αi( w, l φ(x i ) H b yi + ξi) l αi u ( w, φ(x i ) H + b yi + ξi u ). The KKT coditios: kowig strog duality holds ad usig geeral otatio miimize f 0 (x) subject to f i (x) 0 i =,..., m (20) for covex f 0,..., f m, the KKT coditios are f 0 (x) + f i (x) 0, i =,..., m λ i 0, i =,..., m λ i f i (x) = 0, i =,..., m (2) m λ i f i (x) = 0. These are ecessary ad sufficiet for optimality uder strog duality. The coditio λ i f i = 0 traslates to 0 = ηiξ l i l 0 = ηi u ξi u 0 = αi( w, l φ(x i ) H b yi + ξi) l 0 = αi u ( w, φ(x i ) H + b yi + ξi u ). The dual variables satisfy αi, l αi u, ηi, l ηi u 0. Takig derivatives wrt the primal parameters ad settig to zero gives the 3

remaiig KKT coditios for this problem, L w = 2w + αiφ(x l i ) αi u φ(x i ) = 0 (22) L ξ l i L ξ u i L b y = L b 0 = L b M = = C α l i η l i = 0 (23) = C α u i η u i = 0 (24) i : y i=y i : y i : y i=m α l i + i : y i=y+ α u i = 0 y {,..., M } (25) α u i = 0 (26) α l i = 0, (27) where the fial set of equalities applies for each y {,..., M}. We iterpret (25) to state that b i is the upper threshold for poits with rak y i, ad the lower threshold for poits of rak y i +. 3. We use the miimum Lagragia wrt the primal parameters, which we ca readily compute sice we have the poit at which the primal derivatives are zero. From (22), w = (αi u α l 2 i)φ(x i ). Substitutig the KKT coditios back ito the Lagragia, we get the Lagrage dual fuctio, g(α u, α l ) := m (αi u α 4 i)(α l j u αj)k(x l i, x j ) + C (ξi l + ξi u ) j= + αi l (αj u α 2 j)k(x l i, x j ) b yi + ξ l i j= + αi u (αj u α l 2 j)k(x i, x j ) + b yi + ξi u j= [ ( ) ξ l i C α l i + ξ u i (C αi u ) ] = 4 j= (αi u αi)(α l j u αj)k(x l i, x j ). To get the desired solutio, it must be maximized wrt α u i, αl i. 4

4. There are three cases: (a) Whe αi u = C, the from (24), ηi u = 0 for these poits, ad it is possible for ξi u > 0 from (2). Next, 0 = w, φ(x i ) H + b yi + ξi u w, φ(x i ) H = b yi + ξi u ad the projectio w, φ(x i ) H is above the threshold b yi by ξ u i (potetially withi the margi, or eve o the wrog side of the threshold for large eough ξ u i ). (b) Whe α u i = 0 the ηu i = C hece ξu i = 0, ad w, φ(x i ) H + b yi + ξ u i 0 w, φ(x i ) H + b yi, ad the poit is o or above the margi for the lower threshold. (c) Whe α u i (0, C) the ηu i 0, hece ξu i Refereces = 0. Moreover 0 = w, φ(x i ) H + b yi + ξ u i w, φ(x i ) H = b yi +. ad these poits are o the margi above the lower threshold b yi. [] J. Shawe-Taylor ad N. Cristiaii. Kerel Methods for Patter Aalysis. Cambridge Uiversity Press, Cambridge, UK, 2004. 5