S1324: Support Vector Machines

Size: px

Start display at page:

Download "S1324: Support Vector Machines"

Carol Nelson
5 years ago
Views:

1 S1324: Support Vector Machines Chloé-Agathe Azencott based on slides by J.-P. Vert

2 Overview Large margin linear classification Support vector machines (SVMs) The linearly separable case: hard-margin SVMs The non linearly separable case: soft-margin SVMs Non-linear SVMs and kernels Kernels for strings Kernels for graphs Course material:

3 Classification in bioinformatics

4 Cancer diagnosis Problem 1 Given the expression levels of 20k genes in a leukemia, is it an acute lymphocytic or myeloid leukemia (ALL or AML)?

5 Cancer prognosis Problem 2 Given the expression levels of 20k genes in a tumour after surgery, is it likely to relapse later?

6 Pharmacogenomics / Toxicogenomics Problem 3 Given the genome of a person, which drug should we give?

7 Protein annotation Data available Secreted proteins: MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA... MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW... MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL Non-secreted proteins: MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG... MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG... MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..... Problem 4 Given a newly sequenced protein, is it secreted or not?

8 Drug discovery active inactive inactive active inactive active Problem 5 Given a new candidate molecule, is it likely to be active?

9 A common topic

10 A common topic

11 A common topic

12 A common topic

13 On real data...

14 Pattern recognition, aka supervised classification Challenges High dimension Few samples Structured data Heterogeneous data Prior knowledge Fast and scalable implementations Interpretable models

15 Hard-margin Linear Support Vector Machines

16 Linear classifier

17 Linear classifier

18 Linear classifier

19 Linear classifier

20 Linear classifier

21 Linear classifier

22 Linear classifier

23 Linear classifier

24 Which one is better?

25 The margin of a linear classifier

26 The margin of a linear classifier

27 The margin of a linear classifier

28 The margin of a linear classifier

29 The margin of a linear classifier

30 Largest margin classifier (hard-margin SVM)

31 Support vectors

32 More formally The training set is a finite set of n data/class pairs: S = { ( x 1, y 1 ),..., ( x n, y n ) }, where x i R p and y i { 1, 1}. We assume (for the moment) that the data are linearly separable, i.e., that there exists ( w, b) R p R such that: { w. x i + b > 0 if y i = 1, w. x i + b < 0 if y i = 1.

33 How to find the largest separating hyperplane? For a given linear classifier f (x) = w. x + b consider the "tube" defined by the values 1 and +1 of the decision function: w.x+b=0 x1 x2 w.x+b > +1 w.x+b < 1 w w.x+b=+1 w.x+b= 1

34 The margin is 2/ w Indeed, the points x 1 and x 2 satisfy: { w. x 1 + b = 0, w. x 2 + b = 1. By subtracting we get w.( x 2 x 1 ) = 1, and therefore: γ = 2 x 2 x 1 = 2 w.

35 All training points should be on the correct side of the dotted line For positive examples (y i = 1) this means: w. x i + b 1. For negative examples (y i = 1) this means: w. x i + b 1. Both cases are summarized by: ( i = 1,..., n, y i w. x i + b ) 1.

36 Finding the optimal hyperplane Find ( w, b) which minimize: w 2 under the constraints: ( i = 1,..., n, y i w. x i + b ) 1 0. This is a classical quadratic program on R p+1.

37 Lagrange multipliers Goal: Find the critical points of f : R p+1 R on a surface defined by g( x) = C. The level surfaces defined by f( x) = Cte and g( x) = C must be tangent, i.e. the normal vectors to f( x) = Cte and g( x) = C must point in the same direction: f( x) = α g( x) α is called a Lagrange multiplier. The Lagrangian is defined by L( x, α) = f( x) αg( x) For multiple constraints g 1 ( x) = 0,..., g r ( x) = 0 r L( x, α) = f( x) α i g i ( x) To solve: L = 0 i=1

38 Karush-Kuhn-Tucker Goal: Find the critical points of f : R p+1 R under the constraint h( x) 0. Either h( x) < 0 and f( x) = 0 or h( x) = 0 and f( x) = α h( x) Hence solve { f( x) α h( x) = 0 α h( x) = 0 under the constraint that α 0 (the gradients have opposite directions)

39 Duality Lagrangian L : X R r R L( x, α) = f( x) r α i h i ( x) i=1 Lagrange dual function q : R r R q( α) = inf L( x, α) x X q is concave in α (even if L is not convex). The dual function yields lower bounds on the optimal value of the primal problem when α R r + : q( α) f α R r +

40 Duality Primal problem: minimize f over X s.t. h( x) 0. Lagrange dual problem: maximize q over R r +. Weak duality: if f optimizes the primal and d optimizes the dual, then d f. Always hold. Strong duality: if f = d. Holds under specific conditions (constraint qualification), e.g. Slater s: f convex and h affine. More on optimization: see S1734 (Nicolas Petit).

41 Lagrangian In order to minimize: under the constraints: 1 2 w 2 2 i = 1,..., n, y i ( w. x i + b ) 1 0, we introduce one dual variable α i for each constraint, i.e., for each training point. The Lagrangian is: L ( w, b, α ) = 1 2 w 2 n ( ( α i yi w. x i + b ) 1 ). i=1

42 Lagrangian L ( w, b, α ) is convex quadratic in w. It is minimized for: n w L = w α i y i x i = 0 = w = i=1 n α i y i x i. i=1 L ( w, b, α ) is affine in b. Its minimum is except if: b L = n α i y i = 0. i=1

43 Dual function We therefore obtain the Lagrange dual function: q ( α) = inf L ( w, b, α ) w R p,b R { n i=1 = α i 1 n n 2 i=1 j=1 y iy j α i α j x i. x j if n i=1 α iy i = 0, otherwise. The dual problem is: maximize q ( α) subject to α 0.

44 Dual problem Find α R n which maximizes L( α) = n α i 1 2 i=1 n n α i α j y i y j x i. x j, i=1 j=1 under the (simple) constraints α i 0 (for i = 1,..., n), and n α i y i = 0. i=1 This is a quadratic program on R N, with "box constraints". α can be found efficiently using dedicated optimization softwares.

45 Recovering the optimal hyperplane Once α is found, we recover ( w, b ) corresponding to the optimal hyperplane. w is given by: w = n α i x i, i=1 and the decision function is therefore: f ( x) = w. x + b n = α i x i. x + b. i=1 (1)

46 Interpretation: support vectors α=0 α>0

47 Soft-Margin Linear Support Vector Machines

48 What if data are not linearly separable?

49 What if data are not linearly separable?

50 What if data are not linearly separable?

51 What if data are not linearly separable?

52 Soft-margin SVM Find a trade-off between large margin and few errors. Mathematically: C is a parameter { } 1 min f margin(f ) + C errors(f )

53 Soft-margin SVM formulation The margin of a labeled point ( x, y) is margin( x, y) = y ( w. x + b ) The error is 0 if margin( x, y) > 1, 1 margin( x, y) otherwise. The soft margin SVM solves: { min w,b w 2 + C n max ( ( 0, 1 y i w. x i + b ))} i=1

54 Soft-margin SVM and hinge loss min w,b { n } ( ) l hinge w.x i + b, y i + λ w 2 2, i=1 for λ = 1/C and the hinge loss function: l hinge (u, y) = max (1 yu, 0) = { 0 if yu 1, 1 yu otherwise. l(f(x),y) yf(x) 1

55 Dual formulation of soft-margin SVM (exercice) Maximize L( α) = n α i 1 2 i=1 n i=1 j=1 n α i α j y i y j x i. x j, under the constraints: { 0 α i C, for i = 1,..., n n i=1 α iy i = 0.

56 Interpretation: bounded and unbounded support vectors α=0 α= C 0<α<C

57 Primal (for large n) vs dual (for large p) optimization 1 Find ( w, b) R p+1 which solve: { n } ( ) l hinge w.x i + b, y i + λ w 2 2. min w,b i=1 2 Find α R n which maximizes L( α) = n α i 1 2 i=1 n n α i α j y i y j x i. x j, i=1 j=1 under the constraints: { 0 α i C, for i = 1,..., n n i=1 α iy i = 0.

58 Non-Linear Support Vector Machines and Kernels

59 Sometimes linear methods are not interesting

60 Solution: nonlinear mapping to a feature space x1 x1 2 R x2 x2 2 For x = ( x1 x 2 ) let Φ(x) = ( x 2 1 x 2 2 f (x) = x x 2 2 R2 = ( 1 1 ). The decision function is: ) ( x 2 1 x 2 2 ) R 2 = β Φ(x) + b.

61 Kernel = inner product in the feature space Definition For a given mapping Φ : X H from the space of objects X to some Hilbert space of features H, the kernel between two objects x and x is the inner product of their images in the features space: x, x X, K (x, x ) = Φ(x) Φ(x ). X φ F

62 Example X φ F Let X = H = R 2 and for x = Then ( x1 x 2 ) let Φ(x) = ( x 2 1 x 2 2 K (x, x ) = Φ(x) Φ(x ) = (x 1 ) 2 (x 1 )2 + (x 2 ) 2 (x 2 )2. )

63 The kernel tricks X φ F 2 tricks 1 Many linear algorithms (in particular linear SVM) can be performed in the feature space of Φ(x) without explicitly computing the images Φ(x), but instead by computing kernels K (x, x ). 2 It is sometimes possible to easily compute kernels which correspond to complex large-dimensional feature spaces: K (x, x ) is often much simpler to compute than Φ(x) and Φ(x )

64 Trick 1 : SVM in the original space Train the SVM by maximizing max α R n n α i 1 2 i=1 n n i=1 j=1 α i α j y i y j x i x j, under the constraints: { 0 α i C, for i = 1,..., n n i=1 α iy i = 0. Predict with the decision function f (x) = n i=1 α i y i x i x + b.

65 Trick 1 : SVM in the feature space Train the SVM by maximizing max α R n n α i 1 2 i=1 n n α i α j y i y j Φ (x i ) Φ ( ) x j, i=1 j=1 under the constraints: { 0 α i C, for i = 1,..., n n i=1 α iy i = 0. Predict with the decision function f (x) = n α i y i Φ (x i ) Φ (x) + b. i=1

66 Trick 1 : SVM in the feature space with a kernel Train the SVM by maximizing max α R n n α i 1 2 i=1 n n α i α j y i y j K ( ) x i, x j, i=1 j=1 under the constraints: { 0 α i C, for i = 1,..., n n i=1 α iy i = 0. Predict with the decision function f (x) = n α i K (x i, x) + b. i=1

67 Trick 2 illustration: polynomial kernel x1 x1 2 R x2 2 x2 For x = (x 1, x 2 ) R 2, let Φ(x) = (x 2 1, 2x 1 x 2, x 2 2 ) R3 : K (x, x ) = x1 2 x x 1x 2 x 1 x 2 + x 2 2 x 2 2 = ( x 1 x 1 + x 2x 2 ( = x x ) 2. ) 2

68 Trick 2 illustration: polynomial kernel x1 x1 2 R x2 2 x2 More generally, for x, x R p, K (x, x ) = ( ) d x x + 1 is an inner product in a feature space of all monomials of degree up to d (left as exercice.)

69 Combining tricks: learn a polynomial discrimination rule with SVM Train the SVM by maximizing max α R n n α i 1 2 i=1 n n i=1 j=1 α i α j y i y j ( x i x j + 1) d, under the constraints: { 0 α i C, for i = 1,..., n n i=1 α iy i = 0. Predict with the decision function f (x) = n i=1 α i y i ( x i x + 1) d + b.

70 Illustration: toy nonlinear problem > plot(x,col=ifelse(y>0,1,2),pch=ifelse(y>0,1,2)) Training data x1 x2

71 Illustration: toy nonlinear problem, linear SVM > library(kernlab) > svp <- ksvm(x,y,type="c-svc",kernel= vanilladot ) > plot(svp,data=x) SVM classification plot x

72 Illustration: toy nonlinear problem, polynomial SVM > svp <- ksvm(x,y,type="c-svc",... kernel=polydot(degree=2)) > plot(svp,data=x) SVM classification plot x1

73 Which functions K (x, x ) are kernels? Definition A function K (x, x ) defined on a set X is a kernel if and only if there exists a features space (Hilbert space) H and a mapping such that, for any x, x in X : Φ : X H, K ( x, x ) = Φ (x), Φ ( x ) H. X φ F

74 Positive Definite (p.d.) functions Definition A positive definite (p.d.) function on the set X is a function K : X X R symmetric: ( x, x ) X 2, K ( x, x ) = K ( x, x ), and which satisfies, for all N N, (x 1, x 2,..., x N ) X N et (a 1, a 2,..., a N ) R N : N i=1 j=1 N a i a j K ( ) x i, x j 0.

75 Kernels are p.d. functions Theorem (Aronszajn, 1950) K is a kernel if and only if it is a positive definite function. X φ F

76 Proof? Kernel = p.d. function: Φ (x), Φ (x ) R d = Φ (x ), Φ (x) R d, N i=1 N j=1 a ia j Φ (x i ), Φ (x j ) R d = N i=1 a iφ (x i ) 2 R d 0. P.d. function = kernel: more difficult...

77 Kernel examples Polynomial X = R m K(u, v) = (u.v + c) d Gaussian Radial Basis Function (RBF) X = R m ) u v 2 K(u, v) = exp ( 2σ 2 Min X = R + K(u, v) = min(u, v)

78 Example: SVM with a Gaussian kernel Training: min α R n n α i 1 2 i=1 ( n α i α j y i y j exp ) x i x j 2 2σ 2 i,j=1 s.t. 0 α i C, and n α i y i = 0. i=1 Prediction f ( x) = n α i exp ( x x i 2 ) 2σ 2 i=1

79 Example: SVM with a Gaussian kernel f ( x) = n α i exp ( x x i 2 ) 2σ 2 i=1 SVM classification plot

80 Linear vs nonlinear SVM

81 Regularity vs data fitting trade-off

82 C controls the trade-off { } 1 min f margin(f ) + C errors(f )

83 Why it is important to control the trade-off

84 How to choose C in practice Split your dataset in two ("train" and "test") Train SVM with different C on the "train" set Compute the accuracy of the SVM on the "test" set Choose the C which minimizes the "test" error (you may repeat this several times = cross-validation)

85 Kernels for Strings

86 Supervised sequence classification Data (training) Goal Secreted proteins: MASKATLLLAFTLLFATCIARHQQRQQQQNQCQLQNIEA... MARSSLFTFLCLAVFINGCLSQIEQQSPWEFQGSEVW... MALHTVLIMLSLLPMLEAQNPEHANITIGEPITNETLGWL Non-secreted proteins: MAPPSVFAEVPQAQPVLVFKLIADFREDPDPRKVNLGVG... MAHTLGLTQPNSTEPHKISFTAKEIDVIEWKGDILVVG... MSISESYAKEIKTAFRQFTDFPIEGEQFEDFLPIIGNP..... Build a classifier to predict whether new proteins are secreted or not.

87 String kernels The idea Map each string x X to a vector Φ(x) F. Train a classifier for vectors on the images Φ(x 1 ),..., Φ(x n ) of the training set (nearest neighbor, linear perceptron, logistic regression, support vector machine...) X maskat... msises marssl... malhtv... mappsv... mahtlg... φ F

88 Example: substring indexation The approach Index the feature space by fixed-length strings, i.e., where Φ u (x) can be: Φ (x) = (Φ u (x)) u A k the number of occurrences of u in x (without gaps) : spectrum kernel (Leslie et al., 2002) the number of occurrences of u in x up to m mismatches (without gaps) : mismatch kernel (Leslie et al., 2004) the number of occurrences of u in x allowing gaps, with a weight decaying exponentially with the number of gaps : substring kernel (Lohdi et al., 2002)

89 Spectrum kernel (1/2) Kernel definition The 3-spectrum of is: x = CGGSLIAMMWFGV (CGG,GGS,GSL,SLI,LIA,IAM,AMM,MMW,MWF,WFG,FGV). Let Φ u (x) denote the number of occurrences of u in x. The k-spectrum kernel is: K ( x, x ) := ( Φ u (x) Φ u x ). u A k

90 Spectrum kernel (2/2) Implementation The computation of the kernel is formally a sum over A k terms, but at most x k + 1 terms are non-zero in Φ (x) = Computation in O ( x + x ) with pre-indexation of the strings. Fast classification of a sequence x in O ( x ): f (x) = w Φ (x) = u w u Φ u (x) = x k+1 i=1 w xi...x i+k 1. Remarks Work with any string (natural language, time series...) Fast and scalable, a good default method for string classification. Variants allow matching of k-mers up to m mismatches.

91 Local alignmnent kernel (Saigo et al., 2004) CGGSLIAMM----WFGV C---LIVMMNRLMWFGV s S,g (π) = S(C, C) + S(L, L) + S(I, I) + S(A, V ) + 2S(M, M) + S(W, W ) + S(F, F) + S(G, G) + S(V, V ) g(3) g(4) SW S,g (x, y) := K (β) LA (x, y) = π Π(x,y) max s S,g(π) π Π(x,y) exp ( βs S,g (x, y, π) ) is not a kernel is a kernel

92 LA kernel is p.d.: proof (1/2) Definition: Convolution kernel (Haussler, 1999) Let K 1 and K 2 be two p.d. kernels for strings. The convolution of K 1 and K 2, denoted K 1 K 2, is defined for any x, x X by: K 1 K 2 (x, y) := K 1 (x 1, y 1 )K 2 (x 2, y 2 ). x 1 x 2 =x,y 1 y 2 =y Lemma If K 1 and K 2 are p.d. then K 1 K 2 is p.d..

93 LA kernel is p.d.: proof (2/2) with K (β) LA = n=0 The constant kernel: ( K 0 K (β) a ) K (β) (n 1) (β) g K a K 0, K 0 (x, y) := 1. A kernel for letters: { K (β) 0 if x 1 where y 1, a (x, y) := exp (βs(x, y)) otherwise. A kernel for gaps: K (β) g (x, y) = exp [β (g ( x ) + g ( x ))].

94 The choice of kernel matters No. of families with given performance SVM-LA SVM-pairwise SVM-Mismatch SVM-Fisher ROC50 Performance on the SCOP superfamily recognition benchmark (from Saigo et al., 2004).

95 Kernels for Graphs

96 Virtual screening for drug discovery active inactive inactive active inactive active NCI AIDS screen results (from

97 Image retrieval and classification From Harchaoui and Bach (2007).

98 Graph kernels 1 Represent each graph x by a vector Φ(x) H, either explicitly or implicitly through the kernel K (x, x ) = Φ(x) Φ(x ). 2 Use a linear method for classification in H. X

99 Graph kernels 1 Represent each graph x by a vector Φ(x) H, either explicitly or implicitly through the kernel K (x, x ) = Φ(x) Φ(x ). 2 Use a linear method for classification in H. X φ H

100 Graph kernels 1 Represent each graph x by a vector Φ(x) H, either explicitly or implicitly through the kernel K (x, x ) = Φ(x) Φ(x ). 2 Use a linear method for classification in H. X φ H

101 Indexing by all subgraphs? Theorem Computing all subgraph occurrences is NP-hard. Proof. The linear graph of size n is a subgraph of a graph X with n vertices iff X has an Hamiltonian path The decision problem whether a graph has a Hamiltonian path is NP-complete.

102 Indexing by all subgraphs? Theorem Computing all subgraph occurrences is NP-hard. Proof. The linear graph of size n is a subgraph of a graph X with n vertices iff X has an Hamiltonian path The decision problem whether a graph has a Hamiltonian path is NP-complete.

103 Indexing by all subgraphs? Theorem Computing all subgraph occurrences is NP-hard. Proof. The linear graph of size n is a subgraph of a graph X with n vertices iff X has an Hamiltonian path The decision problem whether a graph has a Hamiltonian path is NP-complete.

104 Indexing by specific subgraphs Substructure selection We can imagine more limited sets of substuctures that lead to more computationnally efficient indexing (non-exhaustive list) substructures selected by domain knowledge (MDL fingerprint) all path up to length k (Openeye fingerprint, Nicholls 2005) all shortest paths (Borgwardt and Kriegel, 2005) all subgraphs up to k vertices (graphlet kernel, Sherashidze et al., 2009) all frequent subgraphs in the database (Helma et al., 2004)

105 Example : Indexing by all shortest paths A A B A B A B A A B (0,...,0,2,0,...,0,1,0,...) A A A A A B Properties (Borgwardt and Kriegel, 2005) There are O(n 2 ) shortest paths. The vector of counts can be computed in O(n 4 ) with the Floyd-Warshall algorithm.

106 Example : Indexing by all shortest paths A A B A B A B A A B (0,...,0,2,0,...,0,1,0,...) A A A A A B Properties (Borgwardt and Kriegel, 2005) There are O(n 2 ) shortest paths. The vector of counts can be computed in O(n 4 ) with the Floyd-Warshall algorithm.

107 Example : Indexing by all subgraphs up to k vertices Properties (Shervashidze et al., 2009) Naive enumeration scales as O(n k ). Enumeration of connected graphlets in O(nd k 1 ) for graphs with degree d and k 5. Randomly sample subgraphs if enumeration is infeasible.

108 Example : Indexing by all subgraphs up to k vertices Properties (Shervashidze et al., 2009) Naive enumeration scales as O(n k ). Enumeration of connected graphlets in O(nd k 1 ) for graphs with degree d and k 5. Randomly sample subgraphs if enumeration is infeasible.

109 Walks Definition A walk of a graph (V, E) is sequence of v 1,..., v n V such that (v i, v i+1 ) E for i = 1,..., n 1. We note W n (G) the set of walks with n vertices of the graph G, and W(G) the set of all walks. etc...

110 Walks paths

111 Walk kernel Definition Let S n denote the set of all possible label sequences of walks of length n (including vertices and edges labels), and S = n 1 S n. For any graph X let a weight λ G (w) be associated to each walk w W(G). Let the feature vector Φ(G) = (Φ s (G)) s S be defined by: Φ s (G) = w W(G) λ G (w)1 (s is the label sequence of w). A walk kernel is a graph kernel defined by: K walk (G 1, G 2 ) = s S Φ s (G 1 )Φ s (G 2 ).

112 Walk kernel Definition Let S n denote the set of all possible label sequences of walks of length n (including vertices and edges labels), and S = n 1 S n. For any graph X let a weight λ G (w) be associated to each walk w W(G). Let the feature vector Φ(G) = (Φ s (G)) s S be defined by: Φ s (G) = w W(G) λ G (w)1 (s is the label sequence of w). A walk kernel is a graph kernel defined by: K walk (G 1, G 2 ) = s S Φ s (G 1 )Φ s (G 2 ).

113 Walk kernel examples The nth-order walk kernel is the walk kernel with λ G (w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n. The random walk kernel is obtained with λ G (w) = P G (w), where P G is a Markov random walk on G. In that case we have: K (G 1, G 2 ) = P(label(W 1 ) = label(w 2 )), where W 1 and W 2 are two independant random walks on G 1 and G 2, respectively (Kashima et al., 2003). The geometric walk kernel is obtained (when it converges) with λ G (w) = β length(w), for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

114 Walk kernel examples The nth-order walk kernel is the walk kernel with λ G (w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n. The random walk kernel is obtained with λ G (w) = P G (w), where P G is a Markov random walk on G. In that case we have: K (G 1, G 2 ) = P(label(W 1 ) = label(w 2 )), where W 1 and W 2 are two independant random walks on G 1 and G 2, respectively (Kashima et al., 2003). The geometric walk kernel is obtained (when it converges) with λ G (w) = β length(w), for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

115 Walk kernel examples The nth-order walk kernel is the walk kernel with λ G (w) = 1 if the length of w is n, 0 otherwise. It compares two graphs through their common walks of length n. The random walk kernel is obtained with λ G (w) = P G (w), where P G is a Markov random walk on G. In that case we have: K (G 1, G 2 ) = P(label(W 1 ) = label(w 2 )), where W 1 and W 2 are two independant random walks on G 1 and G 2, respectively (Kashima et al., 2003). The geometric walk kernel is obtained (when it converges) with λ G (w) = β length(w), for β > 0. In that case the feature space is of infinite dimension (Gärtner et al., 2003).

116 Computation of walk kernels Proposition These three kernels (nth-order, random and geometric walk kernels) can be computed efficiently in polynomial time.

117 Product graph Definition Let G 1 = (V 1, E 1 ) and G 2 = (V 2, E 2 ) be two graphs with labeled vertices. The product graph G = G 1 G 2 is the graph G = (V, E) with: 1 V = {(v 1, v 2 ) V 1 V 2 : v 1 and v 2 have the same label}, 2 E {( = (v1, v 2 ), (v 1, v 2 )) V V : (v 1, v 1 ) E 1 and (v 2, v 2 ) E } 2. 1 a b 1b 2a 1d 2 c 1a 2b 3c 2d 3e 3 4 d e 4c 4e G1 G2 G1 x G2

118 Walk kernel and product graph Lemma There is a bijection between: 1 The pairs of walks w 1 W n (G 1 ) and w 2 W n (G 2 ) with the same label sequences, 2 The walks on the product graph w W n (G 1 G 2 ). Corollary K walk (G 1, G 2 ) = Φ s (G 1 )Φ s (G 2 ) s S = λ G1 (w 1 )λ G2 (w 2 )1(l(w 1 ) = l(w 2 )) = (w 1,w 2 ) W(G 1 ) W(G 1 ) w W(G 1 G 2 ) λ G1 G 2 (w).

119 Walk kernel and product graph Lemma There is a bijection between: 1 The pairs of walks w 1 W n (G 1 ) and w 2 W n (G 2 ) with the same label sequences, 2 The walks on the product graph w W n (G 1 G 2 ). Corollary K walk (G 1, G 2 ) = Φ s (G 1 )Φ s (G 2 ) s S = λ G1 (w 1 )λ G2 (w 2 )1(l(w 1 ) = l(w 2 )) = (w 1,w 2 ) W(G 1 ) W(G 1 ) w W(G 1 G 2 ) λ G1 G 2 (w).

120 Computation of the nth-order walk kernel For the nth-order walk kernel we have λ G1 G 2 (w) = 1 if the length of w is n, 0 otherwise. Therefore: K nth order (G 1, G 2 ) = w W n(g 1 G 2 ) Let A be the adjacency matrix of G 1 G 2. Then we get: 1. K nth order (G 1, G 2 ) = i,j [A n ] i,j = 1 A n 1. Computation in O(n G 1 G 2 d 1 d 2 ), where d i is the maximum degree of G i.

121 Computation of random and geometric walk kernels In both cases λ G (w) for a walk w = v 1... v n can be decomposed as: n λ G (v 1... v n ) = λ i (v 1 ) λ t (v i 1, v i ). Let Λ i be the vector of λ i (v) and Λ t be the matrix of λ t (v, v ): n K walk (G 1, G 2 ) = λ i (v 1 ) λ t (v i 1, v i ) = i=2 n=1 w W n(g 1 G 2 ) Λ i Λ n t 1 n=0 = Λ i (I Λ t ) 1 1 Computation in O( G 1 3 G 2 3 ) i=2

122 Extension: branching walks (Ramon and Gärtner, 2003; Mahé and Vert, 2009) O N C N C C T (v, n + 1) = R N (v) v R... C λ t (v, v )T (v, n), N C N N N C O C N O N C C C

123 2D Subtree vs walk kernels AUC Walks Subtrees CCRF CEM HL 60(TB) K 562 MOLT 4 RPMI 8226 SR A549/ATCC EKVX HOP 62 HOP 92 NCI H226 NCI H23 NCI H322M NCI H460 NCI H522 COLO_205 HCC 2998 HCT 116 HCT 15 HT29 KM12 SW 620 SF 268 SF 295 SF 539 SNB 19 SNB 75 U251 LOX_IMVI MALME 3M M14 SK MEL 2 SK MEL 28 SK MEL 5 UACC 257 UACC 62 IGR OV1 OVCAR 3 OVCAR 4 OVCAR 5 OVCAR 8 SK OV A498 ACHN CAKI 1 RXF_393 SN12C TK 10 UO 31 PC 3 DU 145 MCF7 NCI/ADR RES MDA MB 231/ATCC HS_578T MDA MB 435 MDA N BT 549 T 47D Screening of inhibitors for 60 cancer cell lines.

124 Image classification (Harchaoui and Bach, 2007) COREL14 dataset 1400 natural images in 14 classes Compare kernel between histograms (H), walk kernel (W), subtree kernel (TW), weighted subtree kernel (wtw), and a combination (M). Performance comparison on Corel Test error H W TW Kernels wtw M

125 SVM summary Large margin classifier Control of the regularization / data fitting trade-off with C Linear or nonlinear (with the kernel trick) Extension to strings, graphs... and many other

126 References N. Aronszajn. Theory of reproducing kernels. Trans. Am. Math. Soc., 68: , URL K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In ICDM 05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 74 81, Washington, DC, USA, IEEE Computer Society. ISBN doi: Z. Harchaoui and F. Bach. Image classification with segmentation graph kernels. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), pages 1 8. IEEE Computer Society, doi: /CVPR URL D. Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz, C. Helma, T. Cramer, S. Kramer, and L. De Raedt. Data mining and machine learning techniques for the identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. J. Chem. Inf. Comput. Sci., 44(4): , doi: /ci034254q. URL C. Leslie and R. Kuang. Fast string kernels using inexact matching for protein sequences. J. Mach. Learn. Res., 5: , C. Leslie, E. Eskin, and W. Noble. The spectrum kernel: a string kernel for SVM protein classification. In R. B. Altman, A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein, editors, Proceedings of the Pacific Symposium on Biocomputing 2002, pages , Singapore, World Scientific.

127 References (cont.) H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. n. p. v. d. d. r. Watkins. Text classification using string kernels. J. Mach. Learn. Res., 2: , URL http: // P. Mahé and J. P. Vert. Graph kernels based on tree patterns for molecules. Mach. Learn., 75(1): 3 35, doi: /s URL A. Nicholls. Oechem, version 1.3.4, openeye scientific software. website, J. Ramon and T. Gärtner. Expressivity versus efficiency of graph kernels. In T. Washio and L. De Raedt, editors, Proceedings of the First International Workshop on Mining Graphs, Trees and Sequences, pages 65 74, H. Saigo, J.-P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20(11): , URL http: //bioinformatics.oupjournals.org/cgi/content/abstract/20/11/1682. N. Sherashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. Efficient graphlet kernels for large graph comparison. In 12th International Conference on Artificial Intelligence and Statistics (AISTATS), pages , Clearwater Beach, Florida USA, Society for Artificial Intelligence and Statistics.

Supervised classification for structured data: Applications in bio- and chemoinformatics

Supervised classification for structured data: Applications in bio- and chemoinformatics Jean-Philippe Vert Jean-Philippe.Vert@mines-paristech.fr Mines ParisTech / Curie Institute / Inserm The third school