Linear Classifiers (Kernels)

Similar documents
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Models, Data, Learning Problems

COSE212: Programming Languages. Lecture 1 Inductive Definitions (1)

COSE212: Programming Languages. Lecture 1 Inductive Definitions (1)

Harvard CS121 and CSCI E-121 Lecture 2: Mathematical Preliminaries

A canonical semi-deterministic transducer

Intrusion Detection and Malware Analysis

Chapter 4. Regular Expressions. 4.1 Some Definitions

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

FABER Formal Languages, Automata. Lecture 2. Mälardalen University

Theory of Computation

CSCI 340: Computational Models. Regular Expressions. Department of Computer Science

CS 133 : Automata Theory and Computability

Learning From Data Lecture 26 Kernel Machines

In English, there are at least three different types of entities: letters, words, sentences.

A Universal Turing Machine

Author: Vivek Kulkarni ( )

Theory of Computer Science

CSEP 590 Data Compression Autumn Arithmetic Coding

Kernel methods CSE 250B

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

{a, b, c} {a, b} {a, c} {b, c} {a}

Automata Theory Final Exam Solution 08:10-10:00 am Friday, June 13, 2008

Semigroup presentations via boundaries in Cayley graphs 1

The Probability of Winning a Series. Gregory Quenell

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Theoretical Computer Science

CS375 Midterm Exam Solution Set (Fall 2017)

Linear Classifiers IV

CS Automata, Computability and Formal Languages

Support Vector Machines.

Data Mining and Analysis: Fundamental Concepts and Algorithms

Kaggle.

The Binomial Theorem.

Structure-Based Comparison of Biomolecules

Turing s thesis: (1930) Any computation carried out by mechanical means can be performed by a Turing Machine

Automata Theory and Formal Grammars: Lecture 1

CS6902 Theory of Computation and Algorithms

1. Draw a parse tree for the following derivation: S C A C C A b b b b A b b b b B b b b b a A a a b b b b a b a a b b 2. Show on your parse tree u,

Class Note #20. In today s class, the following four concepts were introduced: decision

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

CS 241 Analysis of Algorithms

Basics of WQO theory, with some applications in computer science

Automata Theory CS F-04 Non-Determinisitic Finite Automata

HW 3 Solutions. Tommy November 27, 2012

Theory of Computation

Talen en Automaten Lecture 1: Regular Languages Herman Geuvers 1/19

Advanced Machine Learning & Perception

C1.1 Introduction. Theory of Computer Science. Theory of Computer Science. C1.1 Introduction. C1.2 Alphabets and Formal Languages. C1.

Student#: CISC-462 Exam, December XY, 2017 Page 1 of 12

Recap from Last Time

Homework 4. Chapter 7. CS A Term 2009: Foundations of Computer Science. By Li Feng, Shweta Srivastava, and Carolina Ruiz

CSE 549: Computational Biology. Computer Science for Biologists Biology

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages

Support Vector Machine (SVM) and Kernel Methods

Regular Expressions and Finite-State Automata. L545 Spring 2008

Formal solution Chen-Fliess series

BOUNDS ON ZIMIN WORD AVOIDANCE

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

CIS 520: Machine Learning Oct 09, Kernel Methods

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

arxiv: v1 [math.gr] 21 Feb 2014

Kernel Methods. Charles Elkan October 17, 2007

Dynamic Programming: Shortest Paths and DFA to Reg Exps

Kernel Methods in Machine Learning

Learning Context Free Grammars with the Syntactic Concept Lattice

ICS141: Discrete Mathematics for Computer Science I

Warshall s algorithm

Kernel methods, kernel SVM and ridge regression

Automata: a short introduction

Context Free Languages. Automata Theory and Formal Grammars: Lecture 6. Languages That Are Not Regular. Non-Regular Languages

CS A Term 2009: Foundations of Computer Science. Homework 2. By Li Feng, Shweta Srivastava, and Carolina Ruiz.

Valence automata over E-unitary inverse semigroups

Introduction to Machine Learning Midterm Exam

Bayesian Learning (II)

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Learning theory. Ensemble methods. Boosting. Boosting: history

Combinatorial Optimization

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

arxiv: v1 [cs.ds] 9 Apr 2018

Circuits. Lecture 11 Uniform Circuit Complexity

Lecture 12 Simplification of Context-Free Grammars and Normal Forms

Support Vector Machine (SVM) and Kernel Methods

Hierarchical Overlap Graph

Automata Theory CS F-08 Context-Free Grammars

Section 1.3 Ordered Structures

Week Some Warm-up Questions

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

Solutions to Problem Set 3

Finiteness conditions and index in semigroup theory

CS375: Logic and Theory of Computing

arxiv: v1 [math.ra] 15 Jul 2013

1 More finite deterministic automata

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Introduction to Support Vector Machines

Fall, 2017 CIS 262. Automata, Computability and Complexity Jean Gallier Solutions of the Practice Final Exam

Undecibability. Hilbert's 10th Problem: Give an algorithm that given a polynomial decides if the polynomial has integer roots or not.

Transcription:

Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers (Kernels) Blaine Nelson, Christoph Sawade, Tobias Scheffer

Exam Dates & Course Conclusion There are 2 Exam dates: Feb 20 th March 25 th Next week Dr. Landwehr will give you info for registering; please think about what date would be best for you Remaining Lectures Jan. 21 Hypothesis Evaluation Jan. 28 Summary of Topics Feb. 4 <Study Time No Lecture> 2

Contents Kernels for Structured Data Spaces String Kernels, Graph Kernels Main Idea: Kernel learning separates data & learning Learning algorithm is developed to achieve reasonable separation of classes in a feature space. Kernel function is developed to express a pairwise notion of similarity that corresponds to an inner product in some feature space --- domain-specific! The kernel abstraction allows us to learn on data that is non-numeric / structured 3

Recall: Kernel Functions Kernel function k x, x = φ x T φ x computes the inner product of the feature mapping of 2 instances. The kernel function can often be computed without an explicit representation φ x. Eg, polynomial kernel: k poly x i, x j = x i T x j + 1 p Infinite-dimensional feature mappings are possible Eg, RBF kernel: k RBF x i, x j = e γ x i x j 2 For every positive definite kernel there is a feature mapping φ x such that k x, x = φ x T φ x. For a given kernel matrix, the Mercer map provides a feature mapping. 4

Recall: Polynomial Kernels Kernel: k poly x i, x j = x i T x j + 1 p, 2D-input, p = 2. k poly x i, x j = x i T x j + 1 2 = x i1 x i2 x j1 x j2 + 1 = x 2 i1 x 2 j1 + x 2 i2 x j2 2 = x i1 x j1 + x i2 x j2 + 1 2 2 + 2x i1 x j1 x i2 x j2 + 2x i1 x j1 + 2x i2 x j2 + 1 2 x j1 2 x j2 = 2 x i1 2 x i2 2x i1 x i2 2x i1 2x i2 1 φ x i T All monomials of degree 2 over input attributes = x i x i 2x i 1 T x j x j 2x j 1 2x j1 x j2 2x j1 2x j2 1 φ x j 5

STRING KERNELS 6

Strings: Motivation Strings are a common non-numeric type of data Documents & email are strings DNA & Protein sequences are strings 7

String Kernels String a sequence of characters from alphabet Σ written as s = s 1 s 2 s n with s = n. The set of all strings is Σ = n N Σ n s i:j = s i s i+1 s j Subsequence: for any i 0,1 n, s i is the elements of s corresponding to elements of i that are 1 Eg. If s= abcd s 1,0,0,1 = ad A string kernel is a real-valued function on Σ Σ. We need positive definite kernels We will design kernels by looking at a feature space of substrings / subsequences 8

Bag-of-Words Kernel For textual data, a simple feature representation is indexed by the words contained in the string Email Attribute Instance x Dear Beneficiary, your Email address has been picked online in this years MICROSOFT CONSUMER AWARD as a Winner of One Hundred and Fifty Five Thousand Pounds Sterling Word #1 occurs? Word #m occurs? m 1,000,000 0 1 0 1 0 Aardvark Beneficiary Friend Sterling Science Bag-of-Words Kernel computes the number of common words between 2 texts; efficient? 9

Spectrum Kernel Consider feature space with features corresponding to every p length substring of alphabet Σ. φ s u is # of times u Σ p is contained in string s The p-spectrum kernel is the result κ p s, t = φ s u T φ t u u Σ p φ aa ab ba bb K aaab bbab aaaa baab aaab 2 1 0 0 bbab 0 1 1 1 aaaa 3 0 0 0 baab 1 1 1 0 aaab 5 1 6 3 bbab 1 3 0 2 aaaa 6 0 9 3 baab 3 2 3 3 10

Spectrum Kernel Computation Without explicitly computing this feature map, the p-spectrum kernel can be computed as s p+1 t p+1 κ p s, t = I s i:i+p 1 = t j:j+p 1 i=1 j=1 This computation is O p s t. Using trie data structures, this computation can be reduced to O p max s, t. Naturally, we can also compute (weighted) sums of different length substrings 11

All-Subsequences Kernel A subsequence is an ordered subset of a string Every subsequence of a string s of length n is uniquely indexed by some i 0,1 n The subsequence corresponding to i is s i Consider feature space with features corresponding to every string of alphabet Σ. φ s u is # of times u Σ p is a subsequence of s The all-subsequences kernel is the result κ p s, t = φ s u T φ t u u Σ 12

All-Subsequences Kernel The all-subsequences kernel is κ p s, t = φ s u T φ t u u Σ where φ s u is # of times u Σ p is a subsequence of s φ a b aa ab ba bb aaa aab aba abb baa bab bba bbb aab 1 2 1 1 2 0 0 0 1 0 0 0 0 0 0 bab 1 1 2 0 1 1 1 0 0 0 0 0 1 0 0 bba 1 1 2 0 0 2 1 0 0 0 0 0 0 1 0 Problem: there are min length k in s s k, Σk subsequences of 13

All-Subsequences Kernel How can we avoid the exponential size of the explicit feature space? Consider rewriting the all-subsequence kernel as κ s, t = I s i = t j i,j These matching subsequences can be split into 2 possibilities; when The last character of s is not used in the match The last character of s is used in the match κ sσ, t = I s i = t j i,j κ s,t + I s i = u j u t=uσv i,j κ s,u 14

All-Subsequences Kernel κ s, t # matching subsequences of s and t Ignore last character of s # matching subsequences of s 1:n 1 and t Match last character of s to k th character of t # matching subsequences of s 1:n 1 and t 1:k 1 κ s 1:n 1, t + k:t k =s n κ s 1:n 1, t 1:k 1 15

All-Subsequences Kernel Based on this decomposition, we get a recursion with base cases: κ s, = 1 and κ, s = 1 for all s and recursions κ s, t = κ s 1:n 1, t + κ s 1:n 1, t 1:k 1 k:t k =s n κ s, t = κ s, t 1:m 1 + κ s 1:k 1, t 1:m 1 k:s k =t m 1 st term corresponds to ignoring last character of s/ t 2 nd term corresponds to possible matches of last character within other string Naïve recursion still exponential dynamic programming 16

Dynamic Programming Solution Initial State: matches only 1 subsequence m a c h i n e 1 1 1 1 1 1 1 1 l 1 e 1 a 1 r 1 n 1 i 1 n 1 g 1 17

Dynamic Programming Solution l does not match any character in machine m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 a 1 r 1 n 1 i 1 n 1 g 1 18

Dynamic Programming Solution e matches the last character in machine e added m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 r 1 n 1 i 1 n 1 g 1 19

Dynamic Programming Solution a matches 2 nd character in machine a added m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 n 1 i 1 n 1 g 1 20

Dynamic Programming Solution r does not match any character in machine m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 i 1 n 1 g 1 21

Dynamic Programming Solution n matches 6 th character in machine n and an added m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 n 1 g 1 22

Dynamic Programming Solution i matches 5 th character in machine i and ai added m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 1 2 2 2 4 6 7 n 1 g 1 23

Dynamic Programming Solution n matches 6th character in machine n, in, an, ain m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 1 2 2 2 4 6 7 n 1 1 2 2 2 4 10 11 g 1 24

Dynamic Programming Solution g does not match any character in machine m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 1 2 2 2 4 6 7 n 1 1 2 2 2 4 10 11 g 1 1 2 2 2 4 10 11 25

Dynamic Programming Solution m a c h i n e 1 1 1 1 1 1 1 1 l 1 1 1 1 1 1 1 1 e 1 1 1 1 1 1 1 2 a 1 1 2 2 2 2 2 3 r 1 1 2 2 2 2 2 3 n 1 1 2 2 2 2 4 5 i 1 1 2 2 2 4 6 7 n 1 1 2 2 2 4 10 11 g 1 1 2 2 2 4 10 11 Total matching subsequences: 11 26

All-Subsequences Kernel Using caching of sub-results, this dynamic programming solution runs in O s t. AllSubseqKernel( s, t ) FOR j = 0: t : DP[0,j] = 1; FOR i = 1: s : last = 0; cache[0]=0; FOR k = 1: t : cache[k] = cache[last]; IF t k = s i THEN cache[k] += DP[i-1,k-1]; last = k; FOR k = 0: t : DP[i,k] = DP[i-1,k] + cache[k]; RETURN DP[ s, t ]; Note: strings are 1-indexed but DP & cache have a 0-index for 27

String Kernels Here we have seen a number of string kernels that can be efficiently computed (using dynamic programming, tries, etc.) Bag-of-word kernel p-spectrum kernel All-subsequences kernel Many other variants exist (fixed length subsequence, gap-weighted subsequence, mismatch, etc.) Choice of kernel depends on notion of similarity appropriate for application domain Kernel normalization / centering are common 28

GRAPH KERNELS 29

Graphs: Motivation Graphs are often used to model objects and their relationship to one another: Bioinformatics: Molecule relationships Internet, social networks Central Question: How similar are two Graphs? How similar are two nodes within a Graph? 30

Graph Kernel: Example Consider a dataset of websites with links constituting the edges in the graph A kernel on the nodes of the graph would be useful for learning w.r.t. the web-pages A kernel on graphs would be useful for comparing different components of the internet (e.g. domains) 31

Graph Kernel: Example Consider a set of chemical pathways (sequences of interactions among molecules); i.e. graphs A node kernel would a useful way to measure similarity of different molecules roles within these A graph kernel would be a useful measure of similarity for different pathways 32

Graphs: Definition A graph G = V, E is specified by A set of nodes: v 1,, v n V A set of edges: E V V Data structures for representing graphs: Adjacency matrix: A = a ij, a i,j=1 ij = I v i, v j E n Adjacency list Incidence matrix v 1 v 3 v 2 v 4 G 1 = V 1, E 1 V 1 = v 1,, v 4 A 1 = E 1 = v 1, v 1, v 1, v 2, v 2, v 3, v 4, v 2 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 33

Similarity between Graphs Central Question: How similar are two graphs? 1st Possibility: Number of isomorphisms between all (sub-) graphs. v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 34

Isomorphisms of Graphs Isomorphism: Two Graphs G 1 = V 1, E 1 & G 2 = V 2, E 2 are isomorphic if there exists a bijective mapping f V 1 V 2 so that v i, v j E 1 f v i, f v j E 2 v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 35

Isomorphisms of Graphs Isomorphism: Two Graphs G 1 = V 1, E 1 & G 2 = V 2, E 2 are isomorphic if there NP-hard! exists a bijective mapping f V 1 V 2 so that v i, v j E 1 f v i, f v j E 2 v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 36

Similarity between Graphs Central Question: How similar are two graphs? 2nd Possibility: Counting the number of common paths in the graph. v 1 v 2 v a v b v 3 v 5 v 4 v c v d v e G 1 = V 1, E 1 G 2 = V 2, E 2 37

Common Paths in Graphs The number of paths of length 0 is just the number of nodes in the graph. v 1 v 3 v 1 v 3 v 2 v 4 v 2 v 4 G 1 = V 1, E 1 38

Common Paths in Graphs The number of paths of length 1 from one node to any other is given by the adjacency matrix. 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 From v 1 v 2 A 1 = v 3 v 4 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 v 1 v 2 v 3 v 4 To 39

Common Paths in Graphs Number of paths of length k from one node to any other are given by the k th power of the adjacency matrix. 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 From v 1 A 2 v 2 1 = v 3 v 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 v 1 v 2 v 3 v 4 To 40

Common Paths in Graphs Number of paths of length k from one node to any other are given by the k th power of the adjacency matrix. Proof? k 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 A 1 k = From v 1 v 2 v 3 v 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 v 1 v 2 v 3 v 4 To k > 2 41

Common Paths in Graphs Number of paths of length k from one node to any other are given by the k th power of the adjacency matrix. k 1 v 1 v 3 v 2 v 4 G 1 = V 1, E 1 From v 1 A k v 2 1 = v 3 v 4 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 v 1 v 2 v 3 v 4 To k > 2 n Number of paths of length k: i,j=1 A k ij = 1 T A k 1 42

Common Paths in Graphs Common paths are given by product graphs G = V, E : V = V 1 V 2 E = v, v, w, w v, w E 1 v, w E 2 a a1 a2 b 1 b1 b2 c 2 c1 c2 G 1 G 2 G 43

Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c 1 2 a1 b1 c1 G 1 G 2 G a2 b2 c2 A 0 = From a1 a2 b1 b2 c1 c2 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 a1 a2 b1 b2 c1 c2 To CP 0 = n i,j=1 A 0 ij = 6 44

Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c 1 1 2 a1 b1 c1 G 1 G 2 G a2 b2 c2 A = From a1 a2 b1 b2 c1 c2 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To CP 1 = CP 0 + n i,j=1 A 1 ij = 6 + 6 = 12 45

Similarity between Graphs a b c Similarity between graphs: number of common paths in their product graph. 1 1 2 1 1 2 a1 b1 G 1 G 2 G c1 a2 b2 c2 A 2 = From a1 a2 b1 b2 c1 c2 0 0 0 0 1 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To CP 2 = CP 1 + n i,j=1 A 2 ij = 12 + 4 = 16 46

Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c G 1 1 2 G 2 a1 b1 c1 G a2 b2 c2 A 3 = From a1 a2 b1 b2 c1 c2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To CP 3 = CP 2 + n i,j=1 A 3 ij = 16 + 0 = 16 47

Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c G 1 1 2 G 2 a1 b1 c1 G a2 b2 c2 CP = A k ij n k=0 i,j=1 A k = From a1 a2 b1 b2 c1 c2 = 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To k > 2 48

Similarity between Graphs Similarity between graphs: number of common paths in their product graph. a b c G 1 With cycles, there can be an infinite number paths! 1 2 G 2 L n CP L = A k ij k=0 i,j=1 a1 b1 c1 G a2 b2 c2 A k = From a1 a2 b1 b2 c1 c2 = 3 2 L2 + 15 2 L + 6 1 k 1 k 1 k 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 a2 b1 b2 c1 c2 To k > 2 49

Similarity between Graphs Similarity between graphs: number of common paths in their product graph. With cycles, there can be an infinite number paths! We must downweight the influence of long paths. Random Walk Kernels: k G 1, G 2 = k G 1, G 2 = n 1 λ k k A V 1 V 2 k=0 i,j=1 n 1 λ k V 1 V 2 k! k=0 i,j=1 A k ij ij = 1T I λa 1 1 V 1 V 2 = 1T exp λa 1 V 1 V 2 These kernels can be calculated by means of the Sylvester equation in O n 3. 50

Similarity between Nodes Similarity between graphs: number of common paths in their product graph. Assumption: Nodes are similar if they are connected by many paths. Random Walk Kernels: k v i, v j = λ k A k k v i, v j = k=1 k=1 λ k k! A k ij ij = I λa 1 = exp λa ij ij 51

Additional Graph-Kernels Shortest-Path Kernel All shortest paths between pairs of nodes computed by Floyd-Warshall algorithm with run time O V 3 Compare all pairs of shortest paths between 2 graphs O V 1 2 V 2 2 Subtree-Kernel: Idea: use tree structures as indexes in the feature space Can be recursively computed for a fixed height tree Trees are downweighted in their height 52

Summary Kernel functions provide a measure of similarity that allows to compare non-numeric data String Kernels based on space of all strings, they count the # of common occurrences within 2 strings Graph Kernels they use common structures within graphs as a basis for their feature space Paths all-paths kernel, random-walk kernel, shortest path kernel Subtrees subtree kernel Kernels are also defined on other structures (e.g. trees, images, ) The kernel is selected for a particular domain 53