Solving SVM: Quadratic Programming

Size: px

Start display at page:

Download "Solving SVM: Quadratic Programming"

Dwight O’Neal’
5 years ago
Views:

1 MA 751 Part 7 Solving SVM: Quadratic Programming 1. Quadratic programming (QP): Introducing Lagrange multipliers α 4 and. 4 (can be justified in QP for inequality as well as equality constraints) we define the Lagrangian

2 PÐaß,ß 0α ß ß. Ñ " O C +OÐ ß Ñ, " a X a α4 4 3 x3 x4 04 4œ" 4œ" 3œ" 8.0Þ 4œ" 4 4

3 Solving using quadratic programming By Lagrange multiplier theory for constraints with inequalities, the minimum of this in aß,ß0ßα œðα ßáßα Ñß. œð. ßáß. Ñ " 8 " 8 is a stationary point of this Lagrangian (derivatives vanish) is maximized wrt aß,ß0, and minimized wrt the Lagrange multipliers, α,. subject to the constraints Derivatives: α 3ß. 3!. (5)

4 Solving using quadratic programming `P `, 8 œ! Ê α C œ!à 4œ" 4 4 (6a) Plugging in get reduced Lagrangian `P " œ! Ê α 4. 4 œ!þ (6b) ` X PÐaßαÑœ-a Oa α4 C 4 +OÐ 3 x3ßx4ñ " 4œ" 4œ"

5 Solving using quadratic programming 8 œ X X α -a O a α ]Oa 4œ" 4 where Ô C "!! á!! C #! á! ]œ ã ã ã ã ã Ö Ù!! á C 8 "! Õ!!! á C Ø 8

6 Solving using quadratic programming (note (6) eliminates the terms) with same constraints (5). 0 4 Now: `P `+ 4 œ! Ê# -Oa O] α œ! (7) α 3 3 C Ê+ 3 œ Þ #- Plug in for + 3 using (7), replacing Oa by #- O] α everywhere: "

7 where T œ]o] X Þ Solving using quadratic programming 8 " X X PÐaßαÑœ α4 α ]O] α %- 8 4œ" " œ X α4 T %- α α, 4œ" Constraints: α 4ß. 4!; by (6b) this implies

8 Solving using quadratic programming "!Ÿα 4 Ÿ Þ 8 " " #-8 #- Define Gœ, α œ α. [note this does not mean complex conjugate!] Then want to minimize (division by constant #- OK - does not change minimizing α) 8 # 8 " Ð# -Ñ " " X #-α T œ X 4 α α α4 α Tα ß (8) #- #- %- # 4œ" 4œ"

9 Solving using quadratic programming subject to constraint!ÿα 3 ŸGàalso convenient to include (6a) as constraint: α y œ!þ Thus constraints are:!ÿα ŸGà α y œ!þ Summarizing above relationships: where 8 0ÐxÑ œ + OÐxßx Ñ,ß 4œ" 4 4

10 and Solving using quadratic programming α 4 4 C + 4 œ ß #- α4 œ# -α4ß α 4 are the (unconstrained) minimizers of (8), with X Tœ]O] Þ After + 4 are determined,, must be computed directly by plugging into More briefly,

11 Solving using quadratic programming where α 4 minimize (8). 8 0ÐxÑ œ C OÐxßx Ñ,ß 4œ" α Finally, to find, must plug into original optimization problem:, that is, we minimize

12 Solving using quadratic programming 8 " Ð" C 0Ðx ÑÑ -m0m 8 4œ" 4 4 O # 8 8 " œ " C X + OÐ ß Ñ, O Þ x4 x3 -a a 4œ" 3œ" 2Þ The RKHS for SVM General SVM: solution function is (see (4) above)

13 Solving using quadratic programming 0ÐxÑ œ + OÐxßx Ñ,ß with sol'n for + 4 given by quadratic programming as above. Consider a simple case (linear kernel): OÐxx ß 4Ñ œ x x4.

14 Then we have RKHS for SVM 0ÐxÑœ Ð+ x Ñ x, w x,ß where w + x Þ This gives the kernel. What class of functions is the corresponding space [? Claim it is the set of linear functions of x:

15 with inner product RKHS for SVM [ œö w xlw Ø w" xß w# xù œ w" w#. is the RKHS of OÐxy ß Ñ above. Indeed to show that OÐxy ß Ñ œ x y is the reproducing kernel for [, note that if 0ÐxÑ œ x w [ ß then recall OÐxy ß Ñ œ x y, so

16 RKHS for SVM Ø0Ð Ñß OÐ ß yñù[ œ w y œ 0ÐyÑ, as desired. Thus the matrix O œ x x, and we find the optimal separator by solving for w as before. 0ÐxÑ œ w x Note when we add, to 0ÐxÑ (as done earlier), have all affine functions 0ÐxÑ œ w x,.

17 RKHS for SVM Note above inner product gives the norm 8 # [ # 8 # 4 4œ" m w xm œ mwm œ A Þ Why use this norm? A priori information content. Final classification rule: 0ÐxÑ! Ê C œ "; Cœ "Þ 0ÐxÑ! Ê Learning from training data: R0 œ Ð0Ðx Ñßáß0Ðx ÑÑ œ ÐC ßáßC ÑÞ " 8 " 8

18 RKHS for SVM Thus [ œö0ðxñœ w x Àw 8 is set of linear separator functions (known as neural network theory). perceptrons in Consider separating hyperplane LÀ0ÐxÑœ! : 3 Þ Toy example:

19 RKHS for SVM

20 Information Example R0 œ ÖÒÐ"ß "Ñß "Óß Ò "ß " ß "Óß ÒÐ "ß "Ñß "Óß ÒÐ "ß "Ñß "Ó (red œ +1à blue œ "Ñà 0œ w x, œ + 3 ï Ð x3 xñ, 3 so OÐx ßxÑ 3

21 Example w œ + x Þ PÐ0Ñ œ " Ð" 0Ð ÑC Ñ " # x4 4 lw l (*) % # (we let - œ "Î#à minimize wrt w,,ñþ Equivalent: 4

22 Example % PÐ0Ñ œ " 0 " 4 lwl % # 4œ" C0Ð 4 x4ñ " 04à 04!Þ [Note effectively œð" w x, CÑ ] Define kernel matrix #

23 Example Ô #! #!! #! # O34 œ OÐx3ßx4Ñ œ x3 x4 œ Ö Ù #! #! Õ! #! # Ø # X # m0m[ œlwl œa O a œ # + 3 % + " + $ + # + % Þ % 3œ"

24 Ô + " + # where a œ Ö ÙÞ ã Õ+ Ø. Example Formulate

25 Example " " PÐ0Ñ œ PÐaß,ß 0Ñ œ X 04 a Oa % # % 4œ" œ " 0 % subject to (Eq. 4a): % % + 3 # # " $ # % 4œ" 4œ" 0 " cc ÐOaÑ, d ß 0!Þ Lagrange multipliers α œðα" ßáßα 8 Ñ ß. œð. " ßáß. 8Ñ (4b)): X X (see

26 Example optimize % % PÐaß,ß 0ßα ß. Ñ œ " " X 03 a Oa α4 ˆ Oa, 4 C4 " % # 4œ" 4œ" %.0 4œ" 4 4

27 Example % % œ " 0 " 4 a X Oa α X ] Oa, α X y α4 α 0. Ð"!Ñ 0 % 4œ" 4œ" with constraints Solution has (see (7) above) α 3ß. 3!Þ α œ# -] " a

28 Example Ô C "! á! Ô "!!!! C# á!! "!! Ð recall ] œ Ö Ù œ Ö ÙÑ ã ã ã ã!! "! Õ!! á C Ø Õ!!! " Ø 8 and (7a above) Finally optimize (8): " α œ α œ αþ #-

29 Example % " X P œ α α Tαß # " 3 3œ" where Tœ]O] X

30 Example Ô "!!! Ô #! #! Ô "!!!! "!! ÙÖ! #! # ÙÖ! "!! œ Ö ÙÖ ÙÖ!! "! #! #!!! "! Õ!!! " ØÕ! #! # ØÕ!!! " Ø Ô "!!! Ô #! #!! "!! ÙÖ! #! # œ Ö ÙÖ Ù!! "! #! #! Õ!!! " ØÕ! #! # Ø

31 constraint is Thus optimize Example Ô #! #!! #! # œ Ö ÙÞ #! #! Õ! #! # Ø " "!Ÿα ŸG œ Þ #-8 % (10a)

32 % % Example P œ α α3 # # αα # αα " 3 " $ # % 3œ" 3œ" % œ # # α Ðα α Ñ Ðα α Ñ Þ 3œ" 3 " $ # % # ß where

33 Minimizing: Ê Thus we have Example? œ α" α$ œ α# α% Þ " #?œ!à?œ@œ " #@œ! " Þ #

34 Example α 3 œ " % for all 3 (recall the constraint (10a)). Then Thus Ô "Î4 "Î4 α œ #-α œ α œ Ö ÙÞ "Î4 Õ"Î4 Ø

35 Example Ô "Î4 ] α "Î4 a œ œ Ö ÙÞ #- "Î4 Õ "Î4 Ø Thus w œ + x œ " Ðx x x x Ñ œ " Ð%ß!Ñ œ Ð"ß!Ñ % # Margin = " lwl 3 3 " # $ %. œ"þ

36 Example Now we find, separately from original equation (*); we will minimize with respect to, the original functional " PÐ0Ñ œ ", C l l % # w x4 4 w ( 4 " œ " Ð",ÑÐ"Ñ " Ð",ÑÐ"Ñ % œ c d c d

37 Example c" Ð ",ÑÐ "Ñd cð" Ð ",ÑÐ "Ñd " " œ,,,, " % œ c d c d c d c d " œ ec, d c, d f ". 2 Clearly the above is minimized when,œ!. Thus w œ Ð"ß!Ñà, œ! Ê

38 Example 0ÐxÑ œ w x, œ B"

39 Example

40 Example 4. Other choices of kernel Recall in SVM we have used the kernel OÐxy ß Ñ œ x yþ There are many other choices of kernel, e.g., lx yl 8 OÐxy ß Ñ œ / or OÐxy ß Ñ œ Ð" l x ylñ note - we must choose a kernel function which is positive definite. How do these choices change the discrimination function 0ÐxÑ in SVM?

41 Ex 1: Gaussian kernel Other kernels OÐß 5 xyñœ/ [can show pos. def. Mercer kernel] SVM: from (4) above have l x y l# # 5 # x x 4 4 0ÐxÑœ +OÐxßx Ñ,œ +/ # 5 #,ß where examples x 4 in J have known classifications C 4, and +ß, 4 are obtained by quadratic programming. 4 l #

42 Other kernels What kind of classifier is this? It depends on 5 movie). (see Vert Note Movie1 varies 5 in the Gaussian ( 5 œ corresponds to " a linear SVM) à then movie2 varies the margin (in linear lwl feature space J # ) as determined by changing - or " equivalently Gœ Þ #-8 5. Software available Software which implements the quadratic programming algorithm above includes:

43 Other kernels SVMLight: SVMTorch: LIBSVM: A Matlab package which implements most of these is Spider: 6. Example application: handwritten digit recognition - USPS (Scholkopf, Burges, Vapnik) Handwritten digits:

44 Other kernels

45 Examples Training set: 7300; Test set: class classifier; 3 >2 class has a separating SVM function 0Ð 3 xñœ w3 x, 3 Chosen class is F Class œ argmax0 Ð ÑÞ 3 Ö!ßáß* 3 x À digit 1 Ä feature vector FÐ1Ñ œ x J Kernels in feature space J :

46 l x3 x4 l# Examples RBF: OÐx3ßx4Ñ œ / # 5 #. Polynomial: OœÐ x3 x4 ) Ñ Sigmoidal: Oœtanh Ð, Ð x x Ñ ) Ñ Results: 3 4

47 Examples

48 Examples Computational Biology Applications References: T. Golub et al Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science S. Ramaswamy et al Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures. PNAS 2001.

49 Applications 7. Microarrays for cancer classification Goal: infer cancer genetics by looking at microarray. Microarray reveals expression patterns and can hopefully be used to discriminate similar cancers, and thus lead to better treatments. Usual problem: small sample size (e.g., 50 cancer tissue samples), high dimensionality (e.g., 20-30,000). Curse of dimensionality. Example 1: Myeloid vs. Lymphoblastic leukemias

50 Applications ALL: acute lymphoblastic leukemia AML: acute myeloblastic leukemia SVM training: leave one out cross-validation

51 Applications

52 Applications S. Mukherjee fig. 1: Myeloid and Lymphoblastic Leukemia classification by SV

53 Applications S. Mukherjee

54 Applications fig 2: AML vs. ALL error rates with increasing sample size In above figure the curves represent error rates with split between training and test sets. Red dot represents leave one out cross-validation.

SVM example: cancer classification Support Vector Machines

SVM example: cancer classification Support Vector Machines 1. Cancer genomics: TCGA The cancer genome atlas (TCGA) will provide high-quality cancer data for large scale analysis by many groups: SVM example: