SVM example: cancer classification Support Vector Machines

Size: px

Start display at page:

Download "SVM example: cancer classification Support Vector Machines"

Anthony Fleming
5 years ago
Views:

1 SVM example: cancer classification Support Vector Machines 1. Cancer genomics: TCGA The cancer genome atlas (TCGA) will provide high-quality cancer data for large scale analysis by many groups:

2 SVM example: cancer classification

3 SVM example: cancer classification

4 SVM example: cancer classification

5 SVM example: cancer classification 2. Example: cancer classification Source: T. Furey, N. Cristianini, et al. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics 16, Consider a set of 40 samples of colon cancer tissue, and 22 samples of normal colon tissue (62 all together).

6 SVM example: cancer classification For each sample = compute Let x œðbßáßb ". Ñœ microarray profile of sample = '# HœÖx 3 ßC 3 3œ" be collection of samples and correct classifications: C œ " 3 œ " if x3 cancerous if x non-cancerous. 3 We want function 0ÐxÑ œ C which for a new (test) sample x predicts its Cœ "Þ

7 SVM example: cancer classification Note the set of all possible profiles is x œðbßáßb Ñ ".. œj œfeature space of microarray We denote x œ feature vector J With the data set H, can we find the right function 0: J Ä U which generalizes the above examples, so that 0ÐxÑ œ C for all feature vectors?

8 SVM example: cancer classification Easier: find a 0 for which 0ÐxÑ! if Cœ"à 0ÐxÑ! if Cœ " (and 0ÐxÑ >> 1 indicates we are more certain C œ ").

9 4. Error function Loss function Consider the error measure: we want 0ÐxÑ! whenever Cœ" and want 0ÐxÑ! whenever Cœ " Measure the error (or penalty) for bad choice of C by Z Ð0ÐxÑß CÑ œ Ð" C0ÐxÑÑ maxa" C0ÐxÑß! bþ small if Cß 0ÐxÑ have same sign œ œ Þ large otherwise

10 Loss function This is the hinge error function.

11 Loss function Notice a margin is built in: error is! only if C0ÐxÑ " (more stringent requirement than just C0ÐxÑ!Ñ Thus data-based error (penalty) is 8 " /. œ " ZÐ0Ðx4ÑßC4Ñ 8 4œ" Not enough to determine 0! As usual need a priori (prior) information. What other information do we have?

12 Loss function Note surface L: 0ÐxÑ œ! will separate "positive" x with 0ÐxÑ!, and "negative" x with 0ÐxÑ! À

13 Loss function Fig. 1. Red points have Cœ+1 and blue have Cœ 1 in space J. L: 0ÐxÑ œ! is separating surface.

14 Loss function Additional information: introduce penalty (loss) functional PÐ0Ñ which is large when 0 is 'bad'. E.G., bad maybe non-smooth, etc. Form of PÐ0Ñ: assume 0ÐxÑ is allowed to range over collection [ of functions. Assume form of [ is an RKHS. Thus e.g. # PÐ0Ñ œ m0moþ Will specify desirable norm m m O later -- but for now:

15 Loss function Solve regularization problem for the above norm and loss Z : 8 " # 0! œ arg min " Ð" C40Ðx4ÑÑ - m0moþ (1) 8 0 [ 4œ"

16 Slack variables 5. Finding 0 : Introduction of slack variables Define new variables 0 4 Note if we find the min over 0 [ and 0 4 of with the constraint 8 " 0 [0 ß 4 4œ" arg min "04 - m0mo # (1a) 8

17 we get the same solution 0. Slack variables C0Ð 4 x4ñ " !ß To see this, note the constraints are 0 4 Ð!ß " C 4 0Ðx 4 ÑÑ œ Ð" C 4 0Ðx 4 ÑÑ max, (1b) which yields the claim. (Clearly in fact in minimizing sum we will end up with œð" C0Ðx ÑÑ )

18 Summary: the 0 Solving SVM which minimizes 8 " 0 œ arg min " Ð" C40Ðx4ÑÑ - m0mo # Þ (1) 8 0 [ 4œ" is given by the quadratic programming solution: 8 0ÐxÑ œ " + 4OÐxßx4 Ñ,. (4) 4œ" We find a œò+ " ßáß+ 8Ó X from + 4 œ! 4C4.

19 Solving SVM Here vector! œð! ßáß! Ñis defined by " 8 "! œ arg min" X! 4! T! (9) #! 8 4œ" with constraints "!Ÿ! Ÿ à! y œ! #-8

20 We define Solving SVM y œðc ßáßC ÑœHœ " 8 classifications of known samples, and Tœ YKY X ß K œðk34ñœoðx3ßx4ñ with >2 x 3 œ3 sample (e.g. microarray).

21 Solving SVM Finally, to find,, must plug into original optimization problem: that is, we minimize with respect to, 8 " "Ð" C 0Ðx ÑÑ -m0m 8 4œ" 4 4 O # 8 8 " œ " C + OÐ ß Ñ, 8 " " X 4 3 x4 x3 -a Ka after finding a. 4œ" 3œ"

22 Right RKHS for SVM 2Þ The RKHS for support vector machine General SVM: solution function is (see (4) above) 0ÐxÑ œ " + OÐxßx Ñ,ß with sol'n for + 4 given by quadratic programming as above. A simple case (linear kernel): OÐxx ß 4Ñ œ x x4. Then we have

23 Right RKHS for SVM 0ÐxÑœ" Ð+ x Ñ x, w x,ß where w " + 4x4 Þ (10) What class of RKHS [ does this correspond to? Claim the set of linear functions of x with inner product 4 [ œö w xlw.

24 Right RKHS for SVM Ø w" xß w# xù œ w" w# is the RKHS of OÐxy ß Ñ above.

25 Right RKHS for SVM Thus matrix K œ x x, and we find the optimal separator by choosing w as in (10). 0ÐxÑ œ w x Note add, to 0ÐxÑ (as earlier), so have all separator functions 0ÐxÑ œ w x,.

26 Right RKHS for SVM Note above inner product gives the norm 8 # [ # [ # 8 " # 4 4œ" m0ðxñm œ m w xm œ lwl œ A Þ Why use this norm? A priori information content. Final classification rule: 0Ðx Ñ! Ê C œ "; 0ÐxÑ! Ê C œ "Þ

27 Right RKHS for SVM Learning from training data: H0 œ Ð0Ðx Ñßáß0Ðx ÑÑ œ ÐC ßáßC ÑÞ Thus can show RKHS here is " 8 " 8 [ œö0ðxñœ w x Àw 8 is set of linear separator functions (known as neural network theory). perceptrons in Consider separating hyperplane LÀ0ÐxÑœ! :

28 3 Þ Toy example: Toy example

29 Information Toy example H0 œ ÖÒÐ"ß "Ñß "Óß Ò a"ß " bß "Óß ÒÐ "ß "Ñß "Óß ÒÐ "ß "Ñß "Ó (red œ +1à blue œ "Ñà 0œ w x, œ " + 3 ï Ð x3 xñ, 3 OÐx ßxÑ 3

30 so Toy example w œ " + x Þ # # [ w Recall m0m œ l l, so PÐ0Ñ œ " " Ð" 0Ðx ÑC Ñ " 4 4 lwl % # 4 # (- œ "Î#à minimize wrt w,,ñþ

31 Equivalent: Toy example % PÐ0Ñ œ " " 0 " 4 lwl % # 4œ" # C0Ð 4 x4ñ " 04à 04!Þ [Note effectively œð" aw x, bcñ ]

32 Define kernel matrix Toy example K34 œoðx3ßx4ñœ x3 x4 œ Ô #! #!! #! # Ö Ù #! #! Õ! #! # Ø # X # m0m[ œ lwl œ a Ka œ # " + 3 % a + " + $ + # + % b Þ % 3œ"

33 Ô + " + # where a œ Ö ÙÞ ã Õ+ Ø. Toy example

34 Toy example Solution has (see (8a) above) " "! œ# -] a œ] a Ô C "! á! Ô "!!!! C# á!! "!! recall Y œ Ö Ù œ Ö Ù ã ã ã ã!! "! Õ!! á C Ø Õ!!! " Ø 8

35 and (8a above) Finally optimize (8) Toy example "! œ! œ! Þ #- % " " X! 4 T ß #!! 4œ" where

36 Toy example P œ YKY Ô "!!! Ô #! #! Ô "!!!! "!! ÙÖ! #! # ÙÖ! "!! œ Ö ÙÖ ÙÖ Ù!! "! #! #!!! "! Õ!!! " ØÕ! #! # ØÕ!!! " Ø Ô "!!! Ô #! #!! "!! ÙÖ! #! # œ Ö ÙÖ Ù!! "! #! #! Õ!!! " ØÕ! #! # Ø X

37 Toy example Ô #! #!! #! # œ Ö ÙÞ #! #! Õ! #! # Ø

38 constraints are Toy example " "!Ÿ! 4 ŸG œ Þ #-8 %! œ! y œ!"! #! $! % Þ (11)

39 Thus optimize Toy example % % # _ œ"! "! 4 #!! #!! " 4 " $ # % 4œ" 4œ" % œ " # #! Ð!! Ñ Ð!! Ñ Þ 3œ" 3 " $ # % # ß

40 where Minimizing: Toy example? œ!"! $ œ!#! % Þ " #?œ!à " #@œ! Ê?œ@œ " Þ # Clearly this is largest if we make?œ@œ " happen (see constraint (10)) if! 4 œ a 4. % " # à this can only

41 Toy example So Ô "Î4 "Î4! œ Ö ÙÞ "Î4 Õ"Î4 Ø

42 Thus Toy example Ô "Î4 "Î4 a œ]! œö ÙÞ "Î4 Õ "Î4 Ø Thus w œ " + x œ " Ðx x x x Ñ œ " að%ß!ñb œ Ð"ß!Ñ % % Margin = " 3 3 " # $ %. lwl œ" (we'll revisit this--).

43 Toy example Now plug in a find, separately from original equation (9); we will minimize with respect to, the original functional

44 Toy example " _Ð0Ñ œ ", C l l % " a aw x b b w # " œ " Ð",ÑÐ"Ñ " Ð",ÑÐ"Ñ % œ c d c d c" Ð ",ÑÐ "Ñd cð" Ð ",ÑÐ "Ñd "

45 Toy example " œ,,,, " % œ c d c d c d c d " œ,, " # ec d c d f. Clearly the above is minimized when,œ!. Thus w œ Ð"ß!Ñà, œ! Ê 0ÐxÑ œ w x, œ B"

46 Toy example

47 Toy example [note in this case the margins reach just out to the closest data vectors; this always happens if - is small enough; see Theorem below].

48 SVM: Geometric interpretation SVM: Geometric interpretation 1. Basics Recall: if 0Ðx) œ w x, for some w J, we have defined: (independent of,). m0m[ œ lwl

49 SVM: Geometric interpretation Fig 2: SVM geometry (2 dimensions)

50 SVM: Geometric interpretation Recall Lagrangian (full loss function) to be minimized: 8 " _ Ð0Ñ œ Ð" C 0Ð ÑÑ - l l _ _ 8 " # 4 x4 w. : (8a) 4œ" (minimization over Ðwß,ÑÑÞ Why was this a good choice for _? What should - be? Consider variables (see (1b) earlier) 0 4 œð" C0Ð 4 x 4 ÑÑ Þ

51 Then SVM: Geometric interpretation 8 " _ œ 0-8 " # 4 w (8b) 4œ" In feature space J, define positive direction be parallel to w, negative direction antiparallel to w. For x J, value of 0ÐxÑœ w x, determined by.ðxñ œ distance of x from the separating hyperplane L! À 0Ðx Ñ œ!.

52 SVM: Geometric interpretation Define margin hyperplane (see diagram) L": 0ÐxÑ œ ". We assume.ðxñ positive in positive direction (parallel to w), negative in negative direction (antiparallel to w).

53 Specifically SVM: Geometric interpretation 0ÐxÑ œ lwl.ðx Ñ since gradient f0ðxñ œ w, so 0 increases along w rate lwl per unit change of x in w direction. Note if C œ " (i.e., x is in positive class), " lwl " lwl œð" l l.ð ÑÑ œ! if.ðx Ñ w x " lwl.ðx Ñ if.ðx Ñ Þ If x on positive side of L (.ÐxÑ Ñ: " " lwl

54 if x on negative side of L " : SVM: Geometric interpretation 0 4 œ!, 0 4 w x w " œ " l l.ð Ñ œ + l lðdistance from L Ñ.

55 Thus if C œ " 4 SVM: Geometric interpretation 0 4 œ! if x4 on "correct" side of margin œ L ". lwl Ðdistance from L Ñ if x on "wrong" side of L " 4 " Similarly, defining the "negative margin" hyperplane L " À 0Ðx Ñ œ ", we have if C œ " ( x in negative class) œ! if x on "correct" side of margin œ L lwl distance from L if x on "wrong" side of L 4 " " 4 ". Therefore (see above figure)

56 SVM: Geometric interpretation " 0 œlwl H 4 4 with H the total distance of points on the "wrong" sides of their respective margin hyperplanes L " ß i.e., H œ "total error"þ Also: distance from separating hyperplane " hyperplane L œ Þ " lwl L! to margin

57 SVM: Geometric interpretation [note: vectors on wrong side of margins are only ones needed for quadratic programming calculation; these are the support vectors] [fewer support vectors machine] Ê easier calculation Ê sparse Conclusion: Minimization of full Lagrangian (1) involves a balance between minimizing total error and the margin width " lwl parameter -.! 4, the balance determined by the regularization 0 4

58 1. Special case: Perfect separability If classes perfectly separable:

59 Minimizing 8 " Pœ " # 0 œp P 8 î - w ðñò 4œ" 4. : P. P: " involves maximizing margin lwl and minimizing the total error! 0 with the balance determined by Choose and so bisects the two groups with the w, L! maximum "margin" (see diagram above), and the

60 hyperplanes L " touch closest x4 to L! (such x4 are support vectors). Then still have " 0 œ total error œ!ß 4 4 while margin " lwl is as large as possible.

61 We thus have in perfectly separable case: Theorem: The wß, which minimize (1) give 0ÐxÑœ w x, whose separating hyperplane LÀ0ÐxÑœ! gives the widest margin, if - is sufficiently small.

62 Summary: In the general case we choose m0m[ œ w ß and we minimize 8 " 0 - w 4œ" 4 # subject to CÐ 4 w x,ñ " !. This is the basic SVM algorithm for finding 0ÐxÑ; see earlier for the QP algorithm leads to this.

63 2. The reproducing kernel As shown earlier the reproducing kernel OÐxy ß Ñfor [ above is ordinary dot product of vectors: OÐxy ß Ñ œ x yþ

64 4Þ Result: SVM on cancer Colon cancer application Recall: 40 samples colon cancer tissue 22 samples of normal colon tissue (62 total). For each sample computed Let x œðbßáßb Ñœ ". microarray profile '# HœÖx 3 ßC 3 3œ"

65 Colon cancer application be collection of samples and correct classifications: C œ " 3 œ " if x3 cancerous if x non-cancerous. 3 Result: using leave one out cross validation obtained: Feature space J is 6,500 dimensional (6,500 genes) Misclassification of 6/62 tissues using leave one out cross validation.

66 Handwritten digit recognition 5. Example application: handwritten digit recognition - USPS (Scholkopf, Burges, Vapnik) Handwritten digits:

67 Handwritten digit recognition

68 Handwritten digit recognition Training set (sample size): 7300; Test set: class classifier; 3 >2 class has a separating SVM function 0Ð 3 xñœ w3 x, 3 Chosen class is F Class œ argmax0 Ð ÑÞ 3 Ö!ßáß* 3 x À digit 1 Ä feature vector FÐ1Ñ œ x J

69 Handwritten digit recognition Kernels in feature space J : Results: l x3 x4 l# RBF: OÐx3ßx4Ñ œ / # 5 #. Polynomial: OœÐ x3 x4 ) Ñ Sigmoidal: Oœtanh Ð, Ð x x Ñ ) Ñ 3 4

70 Handwritten digit recognition

Solving SVM: Quadratic Programming

MA 751 Part 7 Solving SVM: Quadratic Programming 1. Quadratic programming (QP): Introducing Lagrange multipliers α 4 and. 4 (can be justified in QP for inequality as well as equality constraints) we define