Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers (Kernels) Blaine Nelson, Christoph Sawade, Tobias Scheffer
Exam Dates & Course Conclusion There are 2 Exam dates: Feb 20 th March 25 th Next week Dr. Landwehr will give you info for registering; please think about what date would be best for you Remaining Lectures Jan. 21 Hypothesis Evaluation Jan. 28 Summary of Topics Feb. 4 <Study Time No Lecture> 2
Contents Kernels for Structured Data Spaces String Kernels, Graph Kernels Main Idea: Kernel learning separates data & learning Learning algorithm is developed to achieve reasonable separation of classes in a feature space. Kernel function is developed to express a pairwise notion of similarity that corresponds to an inner product in some feature space --- domain-specific! The kernel abstraction allows us to learn on data that is non-numeric / structured 3
Recall: Kernel Functions Kernel function k x, x = φ x T φ x computes the inner product of the feature mapping of 2 instances. The kernel function can often be computed without an explicit representation φ x. Eg, polynomial kernel: k poly x i, x j = x i T x j + 1 p Infinite-dimensional feature mappings are possible Eg, RBF kernel: k RBF x i, x j = e γ x i x j 2 For every positive definite kernel there is a feature mapping φ x such that k x, x = φ x T φ x. For a given kernel matrix, the Mercer map provides a feature mapping. 4
Recall: Polynomial Kernels Kernel: k poly x i, x j = x i T x j + 1 p, 2D-input, p = 2. k poly x i, x j = x i T x j + 1 2 = x i1 x i2 x j1 x j2 + 1 = x 2 i1 x 2 j1 + x 2 i2 x j2 2 = x i1 x j1 + x i2 x j2 + 1 2 2 + 2x i1 x j1 x i2 x j2 + 2x i1 x j1 + 2x i2 x j2 + 1 2 x j1 2 x j2 = 2 x i1 2 x i2 2x i1 x i2 2x i1 2x i2 1 φ x i T All monomials of degree 2 over input attributes = x i x i 2x i 1 T x j x j 2x j 1 2x j1 x j2 2x j1 2x j2 1 φ x j 5
STRING KERNELS 6
Strings: Motivation Strings are a common non-numeric type of data Documents & email are strings DNA & Protein sequences are strings 7
String Kernels String a sequence of characters from alphabet Σ written as s = s 1 s 2 s n with s = n. The set of all strings is Σ = n N Σ n s i:j = s i s i+1 s j Subsequence: for any i 0,1 n, s i is the elements of s corresponding to elements of i that are 1 Eg. If s= abcd s 1,0,0,1 = ad A string kernel is a real-valued function on Σ Σ. We need positive definite kernels We will design kernels by looking at a feature space of substrings / subsequences 8
Bag-of-Words Kernel For textual data, a simple feature representation is indexed by the words contained in the string Email Attribute Instance x Dear Beneficiary, your Email address has been picked online in this years MICROSOFT CONSUMER AWARD as a Winner of One Hundred and Fifty Five Thousand Pounds Sterling Word #1 occurs? Word #m occurs? m 1,000,000 0 1 0 1 0 Aardvark Beneficiary Friend Sterling Science Bag-of-Words Kernel computes the number of common words between 2 texts; efficient? 9
Spectrum Kernel Consider feature space with features corresponding to every p length substring of alphabet Σ. φ s u is # of times u Σ p is contained in string s The p-spectrum kernel is the result κ p s, t = φ s u T φ t u u Σ p φ aa ab ba bb K aaab bbab aaaa baab aaab 2 1 0 0 bbab 0 1 1 1 aaaa 3 0 0 0 baab 1 1 1 0 aaab 5 1 6 3 bbab 1 3 0 2 aaaa 6 0 9 3 baab 3 2 3 3 10
Spectrum Kernel Computation Without explicitly computing this feature map, the p-spectrum kernel can be computed as s p+1 t p+1 κ p s, t = I s i:i+p 1 = t j:j+p 1 i=1 j=1 This computation is O p s t. Using trie data structures, this computation can be reduced to O p max s, t. Naturally, we can also compute (weighted) sums of different length substrings 11
All-Subsequences Kernel A subsequence is an ordered subset of a string Every subsequence of a string s of length n is uniquely indexed by some i 0,1 n The subsequence corresponding to i is s i Consider feature space with features corresponding to every string of alphabet Σ. φ s u is # of times u Σ p is a subsequence of s The all-subsequences kernel is the result κ p s, t = φ s u T φ t u u Σ 12
All-Subsequences Kernel The all-subsequences kernel is κ p s, t = φ s u T φ t u u Σ where φ s u is # of times u Σ p is a subsequence of s φ a b aa ab ba bb aaa aab aba abb baa bab bba bbb aab 1 2 1 1 2 0 0 0 1 0 0 0 0 0 0 bab 1 1 2 0 1 1 1 0 0 0 0 0 1 0 0 bba 1 1 2 0 0 2 1 0 0 0 0 0 0 1 0 Problem: there are min length k in s s k, Σk subsequences of 13
All-Subsequences Kernel How can we avoid the exponential size of the explicit feature space? Consider rewriting the all-subsequence kernel as κ s, t = I s i = t j i,j These matching subsequences can be split into 2 possibilities; when The last character of s is not used in the match The last character of s is used in the match κ sσ, t = I s i = t j i,j κ s,t + I s i = u j u t=uσv i,j κ s,u 14
All-Subsequences Kernel κ s, t # matching subsequences of s and t Ignore last character of s # matching subsequences of s 1:n 1 and t Match last character of s to k th character of t # matching subsequences of s 1:n 1 and t 1:k 1 κ s 1:n 1, t + k:t k =s n κ s 1:n 1, t 1:k 1 15
All-Subsequences Kernel Based on this decomposition, we get a recursion with base cases: κ s, = 1 and κ, s = 1 for all s and recursions κ s, t = κ s 1:n 1, t + κ s 1:n 1, t 1:k 1 k:t k =s n κ s, t = κ s, t 1:m 1 + κ s 1:k 1, t 1:m 1 k:s k =t m 1 st term corresponds to ignoring last character of s/ t 2 nd term corresponds to possible matches of last character within other string Naïve recursion still exponential dynamic programming 16
Dynamic Programming Solution Initial State: matches only 1 subsequence m a c h i n e 1 1 1 1 1 1 1 1 l 1 e 1 a 1 r 1 n 1 i 1 n 1 g 1 17
Dynamic Programming Solution l does not match any character in machine m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 a 1 r 1 n 1 i 1 n 1 g 1 18
Dynamic Programming Solution e matches the last character in machine e added m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 r 1 n 1 i 1 n 1 g 1 19
Dynamic Programming Solution a matches 2 nd character in machine a added m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 n 1 i 1 n 1 g 1 20
Dynamic Programming Solution r does not match any character in machine m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 i 1 n 1 g 1 21
Dynamic Programming Solution n matches 6 th character in machine n and an added m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 n 1 g 1 22
Dynamic Programming Solution i matches 5 th character in machine i and ai added m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 1 2 2 2 4 6 7 n 1 g 1 23
Dynamic Programming Solution n matches 6th character in machine n, in, an, ain m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 1 2 2 2 4 6 7 n 1 1 2 2 2 4 10 11 g 1 24
Dynamic Programming Solution g does not match any character in machine m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 1 2 2 2 4 6 7 n 1 1 2 2 2 4 10 11 g 1 1 2 2 2 4 10 11 25
Dynamic Programming Solution m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 1 2 2 2 4 6 7 n 1 1 2 2 2 4 10 11 g 1 1 2 2 2 4 10 11 Total matching subsequences: 11 26
All-Subsequences Kernel Using caching of sub-results, this dynamic programming solution runs in O s t. AllSubseqKernel( s, t ) FOR j = 0: t : DP[0,j] = 1; FOR i = 1: s : last = 0; cache[0]=0; FOR k = 1: t : cache[k] = cache[last]; IF t k = s i THEN cache[k] += DP[i-1,k-1]; last = k; FOR k = 0: t : DP[i,k] = DP[i-1,k] + cache[k]; RETURN DP[ s, t ]; Note: strings are 1-indexed but DP & cache have a 0-index for 27
String Kernels Here we have seen a number of string kernels that can be efficiently computed (using dynamic programming, tries, etc.) Bag-of-word kernel p-spectrum kernel All-subsequences kernel Many other variants exist (fixed length subsequence, gap-weighted subsequence, mismatch, etc.) Choice of kernel depends on notion of similarity appropriate for application domain Kernel normalization / centering are common 28
GRAPH KERNELS 29
Graphs: Motivation Graphs are often used to model objects and their relationship to one another: Bioinformatics: Molecule relationships Internet, social networks Central Question: How similar are two Graphs? How similar are two nodes within a Graph? 30
Graph Kernel: Example Consider a dataset of websites with links constituting the edges in the graph A kernel on the nodes of the graph would be useful for learning w.r.t. the web-pages A kernel on graphs would be useful for comparing different components of the internet (e.g. domains) 31
Graph Kernel: Example Consider a set of chemical pathways (sequences of interactions among molecules); i.e. graphs A node kernel would a useful way to measure similarity of different molecules roles within these A graph kernel would be a useful measure of similarity for different pathways 32
Graphs: Definition A graph G = V, E is specified by A set of nodes: v 1,, v n V A set of edges: E V V Data structures for representing graphs: Adjacency matrix: A = a ij, a i,j=1 ij = I v i, v j E n Adjacency list Incidence matrix v 1 v 3 v 2 v 4 G 1 = V 1, E 1 V 1 = v 1,, v 4 A 1 = E 1 = v 1, v 1, v 1, v 2, v 2, v 3, v 4, v 2 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 33
Similarity between Graphs Central Question: How similar are two graphs? 1st Possibility: Number of isomorphisms between all (sub-) graphs. v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 34
Isomorphisms of Graphs Isomorphism: Two Graphs G 1 = V 1, E 1 & G 2 = V 2, E 2 are isomorphic if there exists a bijective mapping f V 1 V 2 so that v i, v j E 1 f v i, f v j E 2 v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 35
Isomorphisms of Graphs Isomorphism: Two Graphs G 1 = V 1, E 1 & G 2 = V 2, E 2 are isomorphic if there NP-hard! exists a bijective mapping f V 1 V 2 so that v i, v j E 1 f v i, f v j E 2 v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 36
Similarity between Graphs Central Question: How similar are two graphs? 2nd Possibility: Counting the number of common paths in the graph. v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 37
Common Paths in Graphs The number of paths of length 0 is just the number of nodes in the graph. v 1 v 3 v 1 v 3 v 2 v 4 v 2 v 4 G 1 = V 1, E 1 38
Common Paths in Graphs The number of paths of length 1 from one node to any other is given by the adjacency matrix. 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 From v 1 v 2 A 1 = v 3 v 4 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 v 1 v 2 v 3 v 4 To 39
Common Paths in Graphs Number of paths of length k from one node to any other are given by the k th power of the adjacency matrix. 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 From v 1 A 2 v 2 1 = v 3 v 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 v 1 v 2 v 3 v 4 To 40
Common Paths in Graphs Number of paths of length k from one node to any other are given by the k th power of the adjacency matrix. Proof? k 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 A 1 k = From v 1 v 2 v 3 v 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 v 1 v 2 v 3 v 4 To k > 2 41
Common Paths in Graphs Number of paths of length k from one node to any other are given by the k th power of the adjacency matrix. k 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 From v 1 A k v 2 1 = v 3 v 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 v 1 v 2 v 3 v 4 To k > 2 n Number of paths of length k: i,j=1 A k ij = 1 T A k 1 42
Common Paths in Graphs Common paths are given by product graphs G = V, E : V = V 1 V 2 E = v, v, w, w v, w E 1 v, w E 2 a a1 a2 b 1 b1 b2 c 2 c1 c2 G 1 G 2 G 43
Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c 1 2 a1 b1 c1 G 1 G 2 G a2 b2 c2 A 0 = From a1 a2 b1 b2 c1 c2 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 a1 a2 b1 b2 c1 c2 To CP 0 = n i,j=1 A 0 ij = 6 44
Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c 1 1 2 a1 b1 c1 G 1 G 2 G a2 b2 c2 A = From a1 a2 b1 b2 c1 c2 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To CP 1 = CP 0 + n i,j=1 A 1 ij = 6 + 6 = 12 45
Similarity between Graphs a b c Similarity between graphs: number of common paths in their product graph. 1 1 2 1 1 2 a1 b1 G 1 G 2 G c1 a2 b2 c2 A 2 = From a1 a2 b1 b2 c1 c2 0 0 0 0 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To CP 2 = CP 1 + n i,j=1 A 2 ij = 12 + 4 = 16 46
Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c G 1 1 2 G 2 a1 b1 c1 G a2 b2 c2 A 3 = From a1 a2 b1 b2 c1 c2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To CP 3 = CP 2 + n i,j=1 A 3 ij = 16 + 0 = 16 47
Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c G 1 1 2 G 2 a1 b1 c1 G a2 b2 c2 CP = A k ij n k=0 i,j=1 A k = From a1 a2 b1 b2 c1 c2 = 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To k > 2 48
Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c G 1 With cycles, there can be an infinite number paths! 1 2 G 2 L n CP L = A k ij k=0 i,j=1 a1 b1 c1 G a2 b2 c2 A k = From a1 a2 b1 b2 c1 c2 = 3 2 L2 + 15 2 L + 6 1 k 1 k 1 k 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To k > 2 49
Similarity between Graphs Similarity between graphs: number of common paths in their product graph. With cycles, there can be an infinite number paths! We must downweight the influence of long paths. Random Walk Kernels: k G 1, G 2 = k G 1, G 2 = n 1 λ k k A V 1 V 2 k=0 i,j=1 n 1 λ k V 1 V 2 k! k=0 i,j=1 A k ij ij = 1T I λa 1 1 V 1 V 2 = 1T exp λa 1 V 1 V 2 These kernels can be calculated by means of the Sylvester equation in O n 3. 50
Similarity between Nodes Similarity between graphs: number of common paths in their product graph. Assumption: Nodes are similar if they are connected by many paths. Random Walk Kernels: k v i, v j = λ k A k k v i, v j = k=1 k=1 λ k k! A k ij ij = I λa 1 = exp λa ij ij 51
Additional Graph-Kernels Shortest-Path Kernel All shortest paths between pairs of nodes computed by Floyd-Warshall algorithm with run time O V 3 Compare all pairs of shortest paths between 2 graphs O V 1 2 V 2 2 Subtree-Kernel: Idea: use tree structures as indexes in the feature space Can be recursively computed for a fixed height tree Trees are downweighted in their height 52
Summary Kernel functions provide a measure of similarity that allows to compare non-numeric data String Kernels based on space of all strings, they count the # of common occurrences within 2 strings Graph Kernels they use common structures within graphs as a basis for their feature space Paths all-paths kernel, random-walk kernel, shortest path kernel Subtrees subtree kernel Kernels are also defined on other structures (e.g. trees, images, ) The kernel is selected for a particular domain 53