Questions and answers, kernel part

Similar documents
Linear Support Vector Machines

10-701/ Machine Learning Mid-term Exam Solution

Linear Classifiers III

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Machine Learning for Data Science (CS 4786)

REGRESSION WITH QUADRATIC LOSS

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Introduction to Optimization Techniques. How to Solve Equations

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Support Vector Machines and Kernel Methods

Infinite Sequences and Series

Problem Set 4 Due Oct, 12

Support vector machine revisited

Regression with quadratic loss

Chapter 6 Principles of Data Reduction

Sequences and Series of Functions

Optimally Sparse SVMs

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Machine Learning Brett Bernstein

Algebra of Least Squares

Algorithms for Clustering

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

18.657: Mathematics of Machine Learning

6.867 Machine learning, lecture 7 (Jaakkola) 1

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Complex Analysis Spring 2001 Homework I Solution

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Estimation for Complete Data

Machine Learning for Data Science (CS 4786)

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Section 4.3. Boolean functions

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Optimization Methods MIT 2.098/6.255/ Final exam

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

A B = φ No conclusion. 2. (5) List the values of the sets below. Let A = {n 2 : n P n 5} = {1,4,9,16,25} and B = {n 4 : n P n 5} = {1,16,81,256,625}

Slide Set 13 Linear Model with Endogenous Regressors and the GMM estimator

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Linear Regression Demystified

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Supplemental Material: Proofs

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Math Solutions to homework 6

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Machine Learning Brett Bernstein

Estimation of the Mean and the ACVF

b i u x i U a i j u x i u x j

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Distribution of Random Samples & Limit theorems

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Exponential Families and Bayesian Inference

Convergence of random variables. (telegram style notes) P.J.C. Spreij

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

PAPER : IIT-JAM 2010

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

6.3 Testing Series With Positive Terms

Lecture 19. sup y 1,..., yn B d n

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

1 1 2 = show that: over variables x and y. [2 marks] Write down necessary conditions involving first and second-order partial derivatives for ( x0, y

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Mathematical Statistics - MS

Introduction to Machine Learning DIS10

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

1 Review and Overview

Maximum Likelihood Estimation and Complexity Regularization

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Regression and generalization

STAT Homework 1 - Solutions

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods

Hilbert Space and Least-squares Collocation

Output Analysis and Run-Length Control

Homework Set #3 - Solutions

Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient. -Avinash Atreya Feb

6.867 Machine learning

Monte Carlo Integration

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

Bayesian Methods: Introduction to Multi-parameter Models

Mathematical Methods for Physics and Engineering

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Riesz-Fischer Sequences and Lower Frame Bounds

CLRM estimation Pietro Coretto Econometrics

Math 113, Calculus II Winter 2007 Final Exam Solutions

Summary: CORRELATION & LINEAR REGRESSION. GC. Students are advised to refer to lecture notes for the GC operations to obtain scatter diagram.

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Differentiable Convex Functions

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Transcription:

Questios ad aswers, kerel part October 8, 205 Questios. Questio : properties of kerels, PCA, represeter theorem. [2 poits] Let F be a RK defied o some domai X, with feature map φ(x) x X ad reproducig kerel k(x, x ) = φ(x), φ(x ) F. Recall the reproducig property: f( ) F, f( ), φ(x) F = f( ), k(x, ) F = f(x). () (we will equivaletly use the shorthad f F). Give f takes the form f( ) = a i k(x i, ), show that f( ) 2 F = j= a i k(x i, x j )a j. 2. [3 poits] Show that for a fuctio f F, max x X f(x) < whe the kerel is bouded, k(x, x ) K < x, x X. You will eed Cauchy-Schwarz, f, f 2 F f F f 2 F, f, f 2 F, ad the kowledge that f F < sice otherwise f would ot be i F. 3. [5 poits] Defie the empirical feature space covariace (igore ceterig) as Ĉ XX := φ(x i ) φ(x i ) where (f f 2 ) f 3 = f 2, f 3 F f, f, f 2, f 3 F. The eigefuctios of C are fλ = Ĉf.

Assumig f( ) = α ik(x i, ), show that α R is give by the solutios to λα = Kα, K ij = k(x i, x j ), assumig K ivertible. 4. [5 poits] We have a set of paired observatios (x, y ),... (x, y ) (regressio or classificatio). We are give the learig problem where f = arg mi J(f), (2) f F ( ) J(f) = L y (f(x ),..., f(x )) + Ω f 2 F, the loss L depeds o x i oly via f(x i ), Ω is o-decreasig, ad y is the vector of y i. Prove that a solutio takes the form f = α i k(x i, ) (this is the represeter theorem). 5. [5 poits] A symmetric fuctio k : X X R is positive defiite if, (a,... a ) R, (x,..., x ) X, a i a j k(x i, x j ) 0, (3) j= ad strictly positive defiite if the equality to zero holds oly whe a i = 0 i {,..., }. We cosider the case where the positive defiiteess is ot strict. I this case, there exists some set of weights {a i } ad correspodig poits {x i } such that a i a j k(x i, x j ) = 0. Show that the fuctio j= f(x + ) = a i k(x i, x + ) = 0 at every poit x + X. This is a powerful result: it shows that f H = 0 = f(x) = 0 x X. Hits: sice k is positive defiite, it remais true that + + a i a j k(x i, x j ) 0. j= Fid the coditio o a + to esure this holds for every possible x +. Check whether this coditio ca still be eforced whe f(x + ) = a i k(x i, x + ) 0. 2

.2 Questio 2: covariace, depedece. [3 poits] Let F be a reproducig kerel Hilbert space defied o a domai X, ad G be reproducig kerel Hilbert space defied o a domai Y. The RK F has kerel k(x, x ) ad feature map φ(x), ad G has kerel l(y, y ) ad feature map ψ(y). Give the radom variables X P x o X ad Y P y o Y, we defie µ X F ad µ Y G to be mea embeddigs satisfyig µ X, f F = E X f(x) f F, ad i particular µ Y, g G = E Y g(y ) g G, µ X, φ(x) F = µ X, k(x, ) F = E X k(x, X), (4) ad i particular µ Y, ψ(y) G = µ Y, l(y, ) G = E Y l(y, Y ). (5) The Hilbert-Schmidt operators mappig from G to F form a Hilbert space, writte (G, F). Defie the tesor product f g (G, F) such that Show that (f g) h = g, h G f. (7) µ X µ Y 2 = E XX k(x, X )E Y Y l(y, Y ), (8) where X has distributio P x ad is idepedet of X, ad Y has distributio P y ad is idepedet of Y. You may use without proof that A, f g = f, Ag F, (9) where A (G, F). Please referece the umbers of the above equatios as you use them i your proof. 2. [4 poits] Give a probability distributio P xy over the pair of radom variables (X, Y ) with respective margial distributios P x ad P y, the ucetered covariace operator C XY is a elemet of (G, F) defied such that CXY, A = E XY φ(x) ψ(y ), A. (0) The Hilbert-Schmidt Idepedece Criterio is defied i terms of kerels as IC 2 (F, G, P xy ) = C XY µ X µ Y 2. The ier product is L, M = j J Lf j, Mf j F, (6) idepedet of the choice of orthoormal basis {f j } of G, however you do t eed to use this iformatio to aswer the questio. 3

Prove that the populatio expressio for IC i terms of expectatios of kerels takes the form IC 2 (F, G, P xy ) = E XY E X Y [k(x, X )l(y, Y )] + E XX k(x, X )E Y Y l(y, Y ) 2E XY [E X k(x, X )E Y l(y, Y )], where the pair (X, Y ) has distributio P xy ad is idepedet of (X, Y ). You will eed eq. (8) from the previous sectio. 3. [2 poit] Show that at idepedece, i.e., whe P xy = P x P y, the IC 2 (F, G, P xy ) = 0. 4. [2 poits] Give a sample z := {(x, y ),..., (x, y )} draw i.i.d. from P xy, write a ubiased empirical estimate of C XY 2. 5. [5 poits] Derive a biased estimate of C XY 2 by computig Ĉ XY 2 Ĉ XY := φ(x i ) ψ(y i ). Derive a expressio for the bias i the latter expressio, i.e., the expected differece betwee this estimate ad the ubiased estimate, i terms of expectatios of kerel fuctios. What happes to the bias as icreases? 6. [4 poits] Cosider a relatio betwee x ad y give as y i = x 2 i + ε i, where ε i N (0, σ 2 ) is Gaussia oise, ad x i U([, ]) is draw from the uiform distributio o [, ]. See Figure for a illustratio of pairs (x i, y i ) draw i.i.d. accordig to this relatio. What is the populatio IC whe both k ad l are liear, i.e. k(x i, x j ) = x i x j ad l(y i, y j ) = y i y j. No proof is eeded, a descriptio of your reasos is sufficiet. Next, defie the maximum sigular vectors f F ad g G of the cetered empirical covariace operator as arg max f F g G ) f, (ĈXY ˆµ X ˆµ Y g where ˆµ X ad ˆµ Y are the empirical estimates of the respective mea embeddigs. Sketch f ad g whe k(x i, x j ) = exp ( (x i x j ) 2 /γ ) is the RBF kerel, ad l(y i, y j ) is the liear kerel (ote: g ca oly be a straight lie i this case). Agai, o proof is eeded, oly a sketch of what you expect to see. F,,where 4

.2 0.8 0.6 Y 0.4 0.2 0 0.2 0.5 0 0.5 X Figure : Sample of relatio betwee x ad y. 5

.3 Questio 3: kerel rakig Rakig problem: we receive pairs {(x i, y i )}, where x i are the objects to be raked, ad y i {, 2,..., M} are the associated raks. M is the highest rak, is the lowest rak; two poits ca have a equal rak, i which case y i = y j ; we also assume M <, ad that at least oe example is see for every allowable y value. We represet the iput poits i terms of feature maps φ(x i ) to a reproducig kerel Hilbert space H with kerel k(x, x ). We set up the followig optimizatio problem: mi w 2 w H, ξ u,ξ l R,b R M+ H + C (ξi l + ξi u ), () subject to w, φ(x i ) H b yi + ξ l i (2) w, φ(x i ) H b yi + ξ u i (3) ξ u i, ξ l i 0, where {b y } M y=0 are parameters of the algorithm which must be leared, ad C > 0 is a user-defied costat.. (4 poits) Sketch a figure describig what the above optimizatio problem is doig. 2. (7 poits) Write the Lagragia for the kerel rakig problem. State the KKT coditios as they apply to the problem (you are give that strog duality holds - please defie the meaig of strog duality). You may use d dw w 2 H = 2w, d dw w, φ(x i) H = φ(x i ). 3. (5 poits) Show that the Lagrage dual fuctio for this optimizatio problem takes the form g(α u, α l ) = 4 j= (αi u αi)(α l j u αj)k(x l i, x j ). Hit: from the previous part, you should have a form for w that looks like w = 2 m (αi u αi)φ(x l i ). 4. (4 poits) What do the KKT coditios imply about the allowable rage of α i? Describe where poits with α u i = 0, α u i = C, ad α u i (0, C) are situated. Please provide proofs to justify your aswers. You do ot eed to provide a accompayig figure (although you are welcome to do so if you fid this makes thigs easier to explai). 6

2 Aswers 2. Questio. The orm is writte f( ) 2 F = f( ), f( ) F = a i k(x i, ), a i k(x i, ) = = a i a j k(x i, ), k(x j, ) F a i a j k(x i, x j ), where the reproducig property is used i the fial lie. 2. The proof is: max f(x) = max f, φ(x) x X x X F max f F φ(x) F x X = f F max x X f F K <. φ(x), φ(x) F 3. First substitutig i the covariace o the R, we have fλ = Ĉf ( ) = φ(x i ) φ(x i ) f = φ(x i ) φ(x i ), α j φ(x j ) j= = φ(x i ) α j k(x i, x j ) j= ow project both sides oto all of the φ(x q ): F F φ(x q ), L F = λ φ(x q ), f F = λ α i k(x q, x i ) Writig this as a matrix equatio, q {... }. λkα = K 2 α or λα = Kα. 7

4. Deote by f s the projectio of f oto the subspace such that spa {k(x i, ) : i }, (4) f = f s + f, where f s = α ik(x i, ). Regularizer: f 2 F = f s 2 F + f 2 F f s 2 F, so ( ) ( ) Ω f 2 F Ω f s 2 F, ad this term is miimized for f = f s. Idividual terms f(x i ) i the loss: f(x i ) = f, k(x i, ) F = f s + f, k(x i, ) F = f s, k(x i, ) F, so Hece L y (f(x ),..., f(x )) = L y (f s (x ),..., f s (x )). Loss L(...) oly depeds o the compoet of f i the data subspace, Regularizer Ω(...) miimized whe f = f s. Note: If Ω is strictly o-decreasig, the f F = 0 is required at the miimum. If Ω strictly icreasig, mi. is uique. 5. For k idetically zero, the statemet holds trivially. Assume that k is ot idetically zero. We expad out + + 0 a i a j k(x i, x j ) = j= j= a i a j k(x i, x j ) + 2a + a i k(x i, x + ) + a 2 +k 2 (x +, x + ). }{{} :=c } {{ } =0 } {{ } :=b The miimum of the above expressio occurs whe a + = b/c (kowig k is ot idetically zero). For the expressio to be o-egative at this miimum, 0 c b2 c 2 2bb c = b2 c. However c > 0 so the oly possibility is b = 0, i.e. a i k(x i, x + ) = 0 x + X. 8

2.2 Questio 2. The proof is: µ X µ Y, µ X µ Y (a) = µ X, (µ X µ Y ) µ Y F (b) = µ X, µ X F µ Y, µ Y G (c) = E X µ X (X)E Y µ Y (Y ) (d) = E X µ X, k(x, ) E Y µ Y, l(y, ) (c) = E XX k(x, X )E Y Y l(y, Y ), where i step (a) we apply (9), i step (b) we apply (7), ad i the two steps (c) we apply (4) ad (5). Step (d) is the reproducig property. 2. We begi with the expasio IC 2 (F, G, P xy ) = C XY µ X µ Y 2 = CXY, C XY + µ X µ Y, µ X µ Y 2 CXY, µ X µ Y (5) There are three terms i the expasio of (5). To write the first i terms of kerels, we apply (9) ad the (0) twice, deotig by (X, Y ) a idepedet copy of the pair of variables (X, Y ), CXY, C XY C XY 2 = ad for the cross-terms, CXY, µ X µ Y = E X,Y φ(x) ψ(y ), C XY = E X,Y E X,Y φ(x) ψ(y ), φ(x ) ψ(y ) = E X,Y E X,Y φ(x), [φ(x ) ψ(y )]ψ(y ) F = E X,Y E X,Y [ φ(x), φ(x ) F ψ(y ), ψ(y ) G ] = E X,Y E X,Y [k(x, X )l(y, Y )]. (6) The fial part was proved previously. = E X,Y φ(x) ψ(y ), µ X µ Y = E X,Y ( φ(x), µx F φ(y ), µ Y G ) = E X,Y [E X k(x, X )E Y l(y, Y )]. 3. At idepedece, the expectatios o the pair (X, Y ) factorize as products of expectatios o X ad Y, hece IC 2 (F, G, P xy ) = E XX k(x, X )E Y Y l(y, Y ) + E XX k(x, X )E Y Y l(y, Y ) = 0. 2E XX k(x, X )E Y Y l(y, Y ). 9

4. A ubiased estimate of A := C XY 2 is  := ( ) j i k ij l ij, where we use the shorthad k ij = k(x i, x j ). Note that E(Â) = E X,Y E X,Y k(x, X )l(y, Y ) = C XY 2 from eq. (6). 5. The biased estimate of A := C XY 2 is  b := ĈXY 2 = φ(x i ) ψ(y i ), = 2 j= φ(x i ) ψ(y i ) k ij l ij = 2 tr(kl). The differece betwee the biased ad ubiased estimates is  b  = 2 i,j= k ij l ij = 2 k ii l ii + = ( ) j i k ij l ij ( ) 2 ( ) k ii l ii ( ) j i j i k ij l ij, thus the expectatio of this differece (i.e., the bias) is k ij l ij ) E (Âb  = (E XY [k(x, X)l(Y, Y )] E X,Y E X,Y [k(x, X )l(y, Y )]), ad is therefore O( ). 6. Whe both kerels are liear the populatio IC will be zero, as there is o pair of fuctios i these fuctio classes which ca trasform the variables to have a high liear covariace. Whe k is a RBF kerel, ad l is a liear kerel, we expect the mappigs i Figure (2). 2.3 Questio 3 This is a mior modificatio of the rakig algorithm i [, Sectio 8..].. Sketch is i Figure 3. The algorithm sets the thresholds {b j } m j= such that w, φ(x j ) is beeath the threshold b yi by a margi / w, but above the 0

Depedece witess, X 0.8 0.6 0.4 f(x) 0.2 0.2 0.2 0.4.5 0.5 0 0.5.5 x 0.2 Correlatio: 0.94 COCO: 0. 0 0.8 0.4 Depedece witess, Y 0.2 0.2 Y 0.6 0.4 0 0.2 g(y) 0.4 0.6 g(y) 0.4 0.2 0.6 0.8 0 0.8 0.2 0.5 0 0.5 X.2 0.5 0 0.5.5 y.2 0.4 0.2 0 0.2 0.4 0.6 0.8 f(x) Figure 2: Maximum sigular vectors of covariace operator. Left plot is origial poit cloud, ceter plot cotais both mappigs, right plot cotais mapped variables. threshold b yi by a margi / w. Some poits are allowed withi the margis, however these attract a pealty of ξi l or ξu i, respectively (the sum of these pealties costitutes the loss). The parameter C trades off the margi size with the loss. 2. Strog duality meas that the maximum of the dual fuctio coicides with the miimum of the primal fuctio subject to the problem costraits. Recall the optimizatio problem: mi w 2 w H, ξ u,ξ l R,b R M+ H + C (ξi l + ξi u ), (7) subject to w, φ(x i ) H b yi + ξ l i (8) w, φ(x i ) H b yi + ξ u i (9) ξ u i, ξ l i 0.

(x ) w b 3 (x 2 ) b 2 b (x 3 ) b 0 Figure 3: Sketch of rakig algorithm 2

The Lagragia is: L := w 2 H + C + + (ξi l + ξi u ) (ηiξ l i l + ηi u ξi u ) αi( w, l φ(x i ) H b yi + ξi) l αi u ( w, φ(x i ) H + b yi + ξi u ). The KKT coditios: kowig strog duality holds ad usig geeral otatio miimize f 0 (x) subject to f i (x) 0 i =,..., m (20) for covex f 0,..., f m, the KKT coditios are f 0 (x) + f i (x) 0, i =,..., m λ i 0, i =,..., m λ i f i (x) = 0, i =,..., m (2) m λ i f i (x) = 0. These are ecessary ad sufficiet for optimality uder strog duality. The coditio λ i f i = 0 traslates to 0 = ηiξ l i l 0 = ηi u ξi u 0 = αi( w, l φ(x i ) H b yi + ξi) l 0 = αi u ( w, φ(x i ) H + b yi + ξi u ). The dual variables satisfy αi, l αi u, ηi, l ηi u 0. Takig derivatives wrt the primal parameters ad settig to zero gives the 3

remaiig KKT coditios for this problem, L w = 2w + αiφ(x l i ) αi u φ(x i ) = 0 (22) L ξ l i L ξ u i L b y = L b 0 = L b M = = C α l i η l i = 0 (23) = C α u i η u i = 0 (24) i : y i=y i : y i : y i=m α l i + i : y i=y+ α u i = 0 y {,..., M } (25) α u i = 0 (26) α l i = 0, (27) where the fial set of equalities applies for each y {,..., M}. We iterpret (25) to state that b i is the upper threshold for poits with rak y i, ad the lower threshold for poits of rak y i +. 3. We use the miimum Lagragia wrt the primal parameters, which we ca readily compute sice we have the poit at which the primal derivatives are zero. From (22), w = (αi u α l 2 i)φ(x i ). Substitutig the KKT coditios back ito the Lagragia, we get the Lagrage dual fuctio, g(α u, α l ) := m (αi u α 4 i)(α l j u αj)k(x l i, x j ) + C (ξi l + ξi u ) j= + αi l (αj u α 2 j)k(x l i, x j ) b yi + ξ l i j= + αi u (αj u α l 2 j)k(x i, x j ) + b yi + ξi u j= [ ( ) ξ l i C α l i + ξ u i (C αi u ) ] = 4 j= (αi u αi)(α l j u αj)k(x l i, x j ). To get the desired solutio, it must be maximized wrt α u i, αl i. 4

4. There are three cases: (a) Whe αi u = C, the from (24), ηi u = 0 for these poits, ad it is possible for ξi u > 0 from (2). Next, 0 = w, φ(x i ) H + b yi + ξi u w, φ(x i ) H = b yi + ξi u ad the projectio w, φ(x i ) H is above the threshold b yi by ξ u i (potetially withi the margi, or eve o the wrog side of the threshold for large eough ξ u i ). (b) Whe α u i = 0 the ηu i = C hece ξu i = 0, ad w, φ(x i ) H + b yi + ξ u i 0 w, φ(x i ) H + b yi, ad the poit is o or above the margi for the lower threshold. (c) Whe α u i (0, C) the ηu i 0, hece ξu i Refereces = 0. Moreover 0 = w, φ(x i ) H + b yi + ξ u i w, φ(x i ) H = b yi +. ad these poits are o the margi above the lower threshold b yi. [] J. Shawe-Taylor ad N. Cristiaii. Kerel Methods for Patter Aalysis. Cambridge Uiversity Press, Cambridge, UK, 2004. 5