Linear Classifiers (Kernels)

Size: px

Start display at page:

Download "Linear Classifiers (Kernels)"

Terence Knight
5 years ago
Views:

1 Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers (Kernels) Blaine Nelson, Christoph Sawade, Tobias Scheffer

2 Exam Dates & Course Conclusion There are 2 Exam dates: Feb 20 th March 25 th Next week Dr. Landwehr will give you info for registering; please think about what date would be best for you Remaining Lectures Jan. 21 Hypothesis Evaluation Jan. 28 Summary of Topics Feb. 4 <Study Time No Lecture> 2

3 Contents Kernels for Structured Data Spaces String Kernels, Graph Kernels Main Idea: Kernel learning separates data & learning Learning algorithm is developed to achieve reasonable separation of classes in a feature space. Kernel function is developed to express a pairwise notion of similarity that corresponds to an inner product in some feature space --- domain-specific! The kernel abstraction allows us to learn on data that is non-numeric / structured 3

4 Recall: Kernel Functions Kernel function k x, x = φ x T φ x computes the inner product of the feature mapping of 2 instances. The kernel function can often be computed without an explicit representation φ x. Eg, polynomial kernel: k poly x i, x j = x i T x j + 1 p Infinite-dimensional feature mappings are possible Eg, RBF kernel: k RBF x i, x j = e γ x i x j 2 For every positive definite kernel there is a feature mapping φ x such that k x, x = φ x T φ x. For a given kernel matrix, the Mercer map provides a feature mapping. 4

5 Recall: Polynomial Kernels Kernel: k poly x i, x j = x i T x j + 1 p, 2D-input, p = 2. k poly x i, x j = x i T x j = x i1 x i2 x j1 x j2 + 1 = x 2 i1 x 2 j1 + x 2 i2 x j2 2 = x i1 x j1 + x i2 x j x i1 x j1 x i2 x j2 + 2x i1 x j1 + 2x i2 x j x j1 2 x j2 = 2 x i1 2 x i2 2x i1 x i2 2x i1 2x i2 1 φ x i T All monomials of degree 2 over input attributes = x i x i 2x i 1 T x j x j 2x j 1 2x j1 x j2 2x j1 2x j2 1 φ x j 5

6 STRING KERNELS 6

7 Strings: Motivation Strings are a common non-numeric type of data Documents & are strings DNA & Protein sequences are strings 7

8 String Kernels String a sequence of characters from alphabet Σ written as s = s 1 s 2 s n with s = n. The set of all strings is Σ = n N Σ n s i:j = s i s i+1 s j Subsequence: for any i 0,1 n, s i is the elements of s corresponding to elements of i that are 1 Eg. If s= abcd s 1,0,0,1 = ad A string kernel is a real-valued function on Σ Σ. We need positive definite kernels We will design kernels by looking at a feature space of substrings / subsequences 8

Bag-of-Words Kernel For textual data, a simple feature representation is indexed by the words contained in the string Email Attribute Instance x Dear Beneficiary, your Email address has been picked

9 Bag-of-Words Kernel For textual data, a simple feature representation is indexed by the words contained in the string Attribute Instance x Dear Beneficiary, your address has been picked online in this years MICROSOFT CONSUMER AWARD as a Winner of One Hundred and Fifty Five Thousand Pounds Sterling Word #1 occurs? Word #m occurs? m 1,000, Aardvark Beneficiary Friend Sterling Science Bag-of-Words Kernel computes the number of common words between 2 texts; efficient? 9

10 Spectrum Kernel Consider feature space with features corresponding to every p length substring of alphabet Σ. φ s u is # of times u Σ p is contained in string s The p-spectrum kernel is the result κ p s, t = φ s u T φ t u u Σ p φ aa ab ba bb K aaab bbab aaaa baab aaab bbab aaaa baab aaab bbab aaaa baab

11 Spectrum Kernel Computation Without explicitly computing this feature map, the p-spectrum kernel can be computed as s p+1 t p+1 κ p s, t = I s i:i+p 1 = t j:j+p 1 i=1 j=1 This computation is O p s t. Using trie data structures, this computation can be reduced to O p max s, t. Naturally, we can also compute (weighted) sums of different length substrings 11

12 All-Subsequences Kernel A subsequence is an ordered subset of a string Every subsequence of a string s of length n is uniquely indexed by some i 0,1 n The subsequence corresponding to i is s i Consider feature space with features corresponding to every string of alphabet Σ. φ s u is # of times u Σ p is a subsequence of s The all-subsequences kernel is the result κ p s, t = φ s u T φ t u u Σ 12

13 All-Subsequences Kernel The all-subsequences kernel is κ p s, t = φ s u T φ t u u Σ where φ s u is # of times u Σ p is a subsequence of s φ a b aa ab ba bb aaa aab aba abb baa bab bba bbb aab bab bba Problem: there are min length k in s s k, Σk subsequences of 13

14 All-Subsequences Kernel How can we avoid the exponential size of the explicit feature space? Consider rewriting the all-subsequence kernel as κ s, t = I s i = t j i,j These matching subsequences can be split into 2 possibilities; when The last character of s is not used in the match The last character of s is used in the match κ sσ, t = I s i = t j i,j κ s,t + I s i = u j u t=uσv i,j κ s,u 14

15 All-Subsequences Kernel κ s, t # matching subsequences of s and t Ignore last character of s # matching subsequences of s 1:n 1 and t Match last character of s to k th character of t # matching subsequences of s 1:n 1 and t 1:k 1 κ s 1:n 1, t + k:t k =s n κ s 1:n 1, t 1:k 1 15

16 All-Subsequences Kernel Based on this decomposition, we get a recursion with base cases: κ s, = 1 and κ, s = 1 for all s and recursions κ s, t = κ s 1:n 1, t + κ s 1:n 1, t 1:k 1 k:t k =s n κ s, t = κ s, t 1:m 1 + κ s 1:k 1, t 1:m 1 k:s k =t m 1 st term corresponds to ignoring last character of s/ t 2 nd term corresponds to possible matches of last character within other string Naïve recursion still exponential dynamic programming 16

17 Dynamic Programming Solution Initial State: matches only 1 subsequence m a c h i n e l 1 e 1 a 1 r 1 n 1 i 1 n 1 g 1 17

18 Dynamic Programming Solution l does not match any character in machine m a c h i n e l e 1 a 1 r 1 n 1 i 1 n 1 g 1 18

19 Dynamic Programming Solution e matches the last character in machine e added m a c h i n e l e a 1 r 1 n 1 i 1 n 1 g 1 19

20 Dynamic Programming Solution a matches 2 nd character in machine a added m a c h i n e l e a r 1 n 1 i 1 n 1 g 1 20

21 Dynamic Programming Solution r does not match any character in machine m a c h i n e l e a r n 1 i 1 n 1 g 1 21

22 Dynamic Programming Solution n matches 6 th character in machine n and an added m a c h i n e l e a r n i 1 n 1 g 1 22

23 Dynamic Programming Solution i matches 5 th character in machine i and ai added m a c h i n e l e a r n i n 1 g 1 23

24 Dynamic Programming Solution n matches 6th character in machine n, in, an, ain m a c h i n e l e a r n i n g 1 24

25 Dynamic Programming Solution g does not match any character in machine m a c h i n e l e a r n i n g

26 Dynamic Programming Solution m a c h i n e l e a r n i n g Total matching subsequences: 11 26

27 All-Subsequences Kernel Using caching of sub-results, this dynamic programming solution runs in O s t. AllSubseqKernel( s, t ) FOR j = 0: t : DP[0,j] = 1; FOR i = 1: s : last = 0; cache[0]=0; FOR k = 1: t : cache[k] = cache[last]; IF t k = s i THEN cache[k] += DP[i-1,k-1]; last = k; FOR k = 0: t : DP[i,k] = DP[i-1,k] + cache[k]; RETURN DP[ s, t ]; Note: strings are 1-indexed but DP & cache have a 0-index for 27

28 String Kernels Here we have seen a number of string kernels that can be efficiently computed (using dynamic programming, tries, etc.) Bag-of-word kernel p-spectrum kernel All-subsequences kernel Many other variants exist (fixed length subsequence, gap-weighted subsequence, mismatch, etc.) Choice of kernel depends on notion of similarity appropriate for application domain Kernel normalization / centering are common 28

29 GRAPH KERNELS 29

30 Graphs: Motivation Graphs are often used to model objects and their relationship to one another: Bioinformatics: Molecule relationships Internet, social networks Central Question: How similar are two Graphs? How similar are two nodes within a Graph? 30

31 Graph Kernel: Example Consider a dataset of websites with links constituting the edges in the graph A kernel on the nodes of the graph would be useful for learning w.r.t. the web-pages A kernel on graphs would be useful for comparing different components of the internet (e.g. domains) 31

way to measure similarity of different molecules roles within these A

32 Graph Kernel: Example Consider a set of chemical pathways (sequences of interactions among molecules); i.e. graphs A node kernel would a useful way to measure similarity of different molecules roles within these A graph kernel would be a useful measure of similarity for different pathways 32

33 Graphs: Definition A graph G = V, E is specified by A set of nodes: v 1,, v n V A set of edges: E V V Data structures for representing graphs: Adjacency matrix: A = a ij, a i,j=1 ij = I v i, v j E n Adjacency list Incidence matrix v 1 v 3 v 2 v 4 G 1 = V 1, E 1 V 1 = v 1,, v 4 A 1 = E 1 = v 1, v 1, v 1, v 2, v 2, v 3, v 4, v

34 Similarity between Graphs Central Question: How similar are two graphs? 1st Possibility: Number of isomorphisms between all (sub-) graphs. v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 34

35 Isomorphisms of Graphs Isomorphism: Two Graphs G 1 = V 1, E 1 & G 2 = V 2, E 2 are isomorphic if there exists a bijective mapping f V 1 V 2 so that v i, v j E 1 f v i, f v j E 2 v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 35

36 Isomorphisms of Graphs Isomorphism: Two Graphs G 1 = V 1, E 1 & G 2 = V 2, E 2 are isomorphic if there NP-hard! exists a bijective mapping f V 1 V 2 so that v i, v j E 1 f v i, f v j E 2 v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 36

37 Similarity between Graphs Central Question: How similar are two graphs? 2nd Possibility: Counting the number of common paths in the graph. v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 37

38 Common Paths in Graphs The number of paths of length 0 is just the number of nodes in the graph. v 1 v 3 v 1 v 3 v 2 v 4 v 2 v 4 G 1 = V 1, E 1 38

39 Common Paths in Graphs The number of paths of length 1 from one node to any other is given by the adjacency matrix. 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 From v 1 v 2 A 1 = v 3 v v 1 v 2 v 3 v 4 To 39

40 Common Paths in Graphs Number of paths of length k from one node to any other are given by the k th power of the adjacency matrix. 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 From v 1 A 2 v 2 1 = v 3 v v 1 v 2 v 3 v 4 To 40

41 Common Paths in Graphs Number of paths of length k from one node to any other are given by the k th power of the adjacency matrix. Proof? k 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 A 1 k = From v 1 v 2 v 3 v v 1 v 2 v 3 v 4 To k > 2 41

42 Common Paths in Graphs Number of paths of length k from one node to any other are given by the k th power of the adjacency matrix. k 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 From v 1 A k v 2 1 = v 3 v v 1 v 2 v 3 v 4 To k > 2 n Number of paths of length k: i,j=1 A k ij = 1 T A k 1 42

43 Common Paths in Graphs Common paths are given by product graphs G = V, E : V = V 1 V 2 E = v, v, w, w v, w E 1 v, w E 2 a a1 a2 b 1 b1 b2 c 2 c1 c2 G 1 G 2 G 43

44 Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c 1 2 a1 b1 c1 G 1 G 2 G a2 b2 c2 A 0 = From a1 a2 b1 b2 c1 c a1 a2 b1 b2 c1 c2 To CP 0 = n i,j=1 A 0 ij = 6 44

45 Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c a1 b1 c1 G 1 G 2 G a2 b2 c2 A = From a1 a2 b1 b2 c1 c a1 a2 b1 b2 c1 c2 To CP 1 = CP 0 + n i,j=1 A 1 ij = = 12 45

46 Similarity between Graphs a b c Similarity between graphs: number of common paths in their product graph a1 b1 G 1 G 2 G c1 a2 b2 c2 A 2 = From a1 a2 b1 b2 c1 c a1 a2 b1 b2 c1 c2 To CP 2 = CP 1 + n i,j=1 A 2 ij = = 16 46

47 Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c G G 2 a1 b1 c1 G a2 b2 c2 A 3 = From a1 a2 b1 b2 c1 c a1 a2 b1 b2 c1 c2 To CP 3 = CP 2 + n i,j=1 A 3 ij = = 16 47

48 Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c G G 2 a1 b1 c1 G a2 b2 c2 CP = A k ij n k=0 i,j=1 A k = From a1 a2 b1 b2 c1 c2 = a1 a2 b1 b2 c1 c2 To k > 2 48

49 Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c G 1 With cycles, there can be an infinite number paths! 1 2 G 2 L n CP L = A k ij k=0 i,j=1 a1 b1 c1 G a2 b2 c2 A k = From a1 a2 b1 b2 c1 c2 = 3 2 L L k 1 k 1 k a1 a2 b1 b2 c1 c2 To k > 2 49

50 Similarity between Graphs Similarity between graphs: number of common paths in their product graph. With cycles, there can be an infinite number paths! We must downweight the influence of long paths. Random Walk Kernels: k G 1, G 2 = k G 1, G 2 = n 1 λ k k A V 1 V 2 k=0 i,j=1 n 1 λ k V 1 V 2 k! k=0 i,j=1 A k ij ij = 1T I λa 1 1 V 1 V 2 = 1T exp λa 1 V 1 V 2 These kernels can be calculated by means of the Sylvester equation in O n 3. 50

51 Similarity between Nodes Similarity between graphs: number of common paths in their product graph. Assumption: Nodes are similar if they are connected by many paths. Random Walk Kernels: k v i, v j = λ k A k k v i, v j = k=1 k=1 λ k k! A k ij ij = I λa 1 = exp λa ij ij 51

52 Additional Graph-Kernels Shortest-Path Kernel All shortest paths between pairs of nodes computed by Floyd-Warshall algorithm with run time O V 3 Compare all pairs of shortest paths between 2 graphs O V 1 2 V 2 2 Subtree-Kernel: Idea: use tree structures as indexes in the feature space Can be recursively computed for a fixed height tree Trees are downweighted in their height 52

53 Summary Kernel functions provide a measure of similarity that allows to compare non-numeric data String Kernels based on space of all strings, they count the # of common occurrences within 2 strings Graph Kernels they use common structures within graphs as a basis for their feature space Paths all-paths kernel, random-walk kernel, shortest path kernel Subtrees subtree kernel Kernels are also defined on other structures (e.g. trees, images, ) The kernel is selected for a particular domain 53

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic