ICML - Kernels & RKHS Workshop. Distances and Kernels for Structured Objects

Similar documents
Lecture 4 February 2

In class midterm Exam - Answer key

Kernel Method: Data Analysis with Positive Definite Kernels

Mapping kernels defined over countably infinite mapping systems and their application

Kernels A Machine Learning Overview

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Mid Term-1 : Practice problems

Kernel Methods. Machine Learning A W VO

Kernel Methods. Outline

Reproducing Kernel Hilbert Spaces

Lecture Notes DRE 7007 Mathematics, PhD

Convex Optimization M2

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

MATH 614 Dynamical Systems and Chaos Lecture 6: Symbolic dynamics.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

HOMEWORK #2 - MA 504

Announcements. Proposals graded

CIS 520: Machine Learning Oct 09, Kernel Methods

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

TEST CODE: MMA (Objective type) 2015 SYLLABUS

Reproducing Kernel Hilbert Spaces

Assignment 1: From the Definition of Convexity to Helley Theorem

Lecture 2: Linear Algebra Review

Metric Embedding for Kernel Classification Rules

Metric-based classifiers. Nuno Vasconcelos UCSD

Appendix A Functional Analysis

ACO Comprehensive Exam October 14 and 15, 2013

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Reproducing Kernel Hilbert Spaces

EE364 Review Session 4

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Introduction to Kernel methods

Kernel methods, kernel SVM and ridge regression

Kernel Methods. Barnabás Póczos

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

SDP Relaxations for MAXCUT

October 25, 2013 INNER PRODUCT SPACES

Chapter II. Metric Spaces and the Topology of C

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Representation of objects. Representation of objects. Transformations. Hierarchy of spaces. Dimensionality reduction

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Support Vector Machine (SVM) and Kernel Methods

Distances & Similarities

Semidefinite Programming

Support Vector Machines

MANY BILLS OF CONCERN TO PUBLIC

Lecture 3: Semidefinite Programming

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Linear & nonlinear classifiers

Linear Algebra Practice Problems

A ten page introduction to conic optimization

Semidefinite Programming Basics and Applications

Mathematics 1 Lecture Notes Chapter 1 Algebra Review

Lecture: Expanders, in theory and in practice (2 of 2)

Support Vector Machine (SVM) and Kernel Methods

Prof. M. Saha Professor of Mathematics The University of Burdwan West Bengal, India

CS145: INTRODUCTION TO DATA MINING

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Kernel Methods in Machine Learning

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Nonlinear Dimensionality Reduction

LECTURE NOTE #11 PROF. ALAN YUILLE

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Topological properties

Notes on Linear Algebra and Matrix Theory

10-701/ Recitation : Kernels

Support Vector Machine

Cookie Monster Meets the Fibonacci Numbers. Mmmmmm Theorems!

The following definition is fundamental.

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

CS798: Selected topics in Machine Learning

Rotation of Axes. By: OpenStaxCollege

Convex and Semidefinite Programming for Approximation

Convex Optimization & Parsimony of L p-balls representation

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Copositive matrices and periodic dynamical systems

Statistical Methods for SVM

Problem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit.

Perceptron Revisited: Linear Separators. Support Vector Machines

Grothendieck s Inequality

Inderjit Dhillon The University of Texas at Austin

Kernel Methods & Support Vector Machines

Problem 1A. Use residues to compute. dx x

CSCI 1951-G Optimization Methods in Finance Part 10: Conic Optimization

Vector Spaces, Affine Spaces, and Metric Spaces

Introduction to Support Vector Machines

The moment-lp and moment-sos approaches

Support Vector Machines

Transcription:

ICML - Kernels & RKHS Workshop Distances and Kernels for Structured Objects Marco Cuturi - Kyoto University Kernel & RKHS Workshop 1

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. Kernel & RKHS Workshop 2

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When observations are in R n Distances and Positive Definite Kernels share many properties Kernel & RKHS Workshop 3

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When observations are in R n Distances and Positive Definite Kernels share many properties At their interface lies the family of Negative Definite Kernels Kernel & RKHS Workshop 4

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When observations are in R n D Kernel & RKHS Workshop 5

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When observations are in R n D p.d.k (note: intersection not to be taken literally) Kernel & RKHS Workshop 6

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When observations are in R n D n.d.k p.d.k Kernel & RKHS Workshop 7

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When observations are in R n D n.d.k p.d.k Hilbertian metrics are a sweet spot, both in theory and practice. Kernel & RKHS Workshop 8

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When comparing structured data (constrained subsets of R n, n very large)... Kernel & RKHS Workshop 9

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When comparing structured data (constrained subsets of R n, n very large) Classical distances on R n that ignore such constraints perform poorly Kernel & RKHS Workshop 10

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When comparing structured data (constrained subsets of R n, n very large) Classical distances on R n that ignore such constraints perform poorly Combinatorial distances (to be defined) take them into account (string, tree) Edit-distances, DTW, optimal matchings, transportation distances Kernel & RKHS Workshop 11

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When comparing structured data (constrained subsets of R n, n very large) Classical distances on R n that ignore such constraints perform poorly Combinatorial distances (to be defined) take them into account (string, tree) Edit-distances, DTW, optimal matchings, transportation distances Combinatorial distances are not negative definite (in the general case) Kernel & RKHS Workshop 12

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When comparing structured data (constrained subsets of R n, n very large) D n.d.k p.d.k Kernel & RKHS Workshop 13

Outline Distances and Positive Definite Kernels are crucial ingredients in many popular ML algorithms. When comparing structured data (constrained subsets of R n, n very large) D n.d.k p.d.k Main message of this talk: we can recover p.d. kernels from combinatorial distances through generating functions. Kernel & RKHS Workshop 14

Distances and Kernels Kernel & RKHS Workshop 15

A bivariate function defined on a set X, is a distance if x,y,z X, d(x,y) = d(y,x), symmetry d(x,y) = 0 x = y, definiteness Distances d : X X R + (x,y) d(x,y) d(x,z) d(x,y)+d(y,z), triangle inequality y x d(x, y) d(y, z) d(x, z) z Kernel & RKHS Workshop 16

Kernels (Symmetric & Positive Definite) A bivariate function defined on a set X k : X X R + (x,y) k(x,y) is a positive definite kernel if x,y X, k(x,y) = k(y,x), symmetry and n N,{x 1,,x n } X n,c R n n i=1 c ic j k(x i,x j ) 0 Kernel & RKHS Workshop 17

Matrices Convex cone of n n distance matrices - dimension n(n 1) 2 M n = {X R n n x ii = 0; for i < j,x ij > 0;x ik +x kj x ij 0} 3 ( 3 n) + ( 2 n) linear inequalities; n equalities Kernel & RKHS Workshop 18

Matrices Convex cone of n n distance matrices - dimension n(n 1) 2 M n = {X R n n x ii = 0; for i j,x ij > 0;x ik +x kj x ij 0} 3 ( 3 n) + ( 2 n) linear inequalities; n equalities Convex cone of n n p.s.d. matrices - dimension n(n+1) 2 S + n = {X Rn n X = X T ; z R n, z T Xz 0} z R n, X,zz T 0: infinite number of inequalities; ( 2 n) equalities Kernel & RKHS Workshop 19

Cones 0 x y x 0 z y z 0 [ ] α β β γ S 2 + image: Dattoro Kernel & RKHS Workshop 20

Functions & Matrices d distance n N,{x 1,,x n } X n k kernel n N,{x 1,,x n } X n [ d(xi,x j ) ] M n [ k(xi,x j ) ] S + n Kernel & RKHS Workshop 21

M n is a polyhedral cone. Extreme Rays & Facets Facets = 3 ( 3 n) hyperplanes dik +d kj d ij = 0. Avis (1980) shows that extreme rays are arbitrarily complex using graph metrics Kernel & RKHS Workshop 22

M n is a polyhedral cone. Extreme Rays & Facets Facets = 3 ( 3 n) hyperplanes dik +d kj d ij = 0. Avis (1980) shows that extreme rays are arbitrarily complex using graph metrics d 12 2 d 23 1 3 5 d 14 4 d 34 d 45 Kernel & RKHS Workshop 23

M n is a polyhedral cone. Extreme Rays & Facets Facets = 3 ( 3 n) hyperplanes dik +d kj d ij = 0. Avis (1980) shows that extreme rays are arbitrarily complex using graph metrics d 12 2 d 23 1 3 5 d 14 4 d 34 d 45 d 13 = min(d 12 +d 23,d 14 +d 34 ) Kernel & RKHS Workshop 24

M n is a polyhedral cone. Extreme Rays & Facets Facets = 3 ( 3 n) hyperplanes dik +d kj d ij = 0. Avis (1980) shows that extreme rays are arbitrarily complex using graph metrics Let G n,p a random graph with n points and edge probability P(ij G n,p = p). If for some 0 < ε < 1/5,n 1/5+ε p 1 n 1/4+ε, then the distance induced by G is an extreme ray of M n with probability 1 o(1). Kernel & RKHS Workshop 25

M n is a polyhedral cone. Extreme Rays & Facets Facets = 3 ( 3 n) hyperplanes dik +d kj d ij = 0. Avis (1980) shows that extreme rays are arbitrarily complex using graph metrics Let G n,p a random graph with n points and edge probability P(ij G n,p = p). If for some 0 < ε < 1/5,n 1/5+ε p 1 n 1/4+ε, then the distance induced by G is an extreme ray of M n with probability 1 o(1). Grishukin (2005) characterizes the extreme rays of M 7 ( 60.000) Kernel & RKHS Workshop 26

M n is a polyhedral cone. Extreme Rays & Facets Facets = 3 ( 3 n) hyperplanes dik +d kj d ij = 0. Avis (1980) shows that extreme rays are arbitrarily complex using graph metrics S + n is a self-dual, homogeneous cone. Overall far easier to study: Facets are isomorphic to S + k for k < n Extreme rays exactly the p.s.d matrices of rank 1, zz T. Kernel & RKHS Workshop 27

M n is a polyhedral cone. Extreme Rays & Facets Facets = 3 ( 3 n) hyperplanes dik +d kj d ij = 0. Avis (1980) shows that extreme rays are arbitrarily complex using graph metrics S + n is a self-dual, homogeneous cone. Overall far easier to study: Facets are isomorphic to S + k for k < n Extreme rays exactly the p.s.d matrices of rank 1, zz T. Eigendecomposition: if K S + n then K = n i=1 λ iz i z T i. Integral representations for p.d. kernels themselves (Bochner theorem) Kernel & RKHS Workshop 28

Optimizing in M n is relatively difficult. Checking, Projection, Learning Check if X is in M n requires up to 3 ( 3 n) comparisons. Projection: triangle fixing algorithms (Brickell et al. (2008)), no convergence speed guarantee. No simple barrier function Optimizing in S + n is relatively easy. Check if X is in S + n only requires finding minimal eigenvalue (eigs). Projection: threshold negative eigenvalues. log det barrier, semidefinite programming Kernel & RKHS Workshop 29

Optimizing in M n is relatively difficult. Checking, Projection, Learning Check if X is in M n requires up to 3 ( 3 n) comparisons. Projection: triangle fixing algorithms (Brickell et al. (2008)), no convergence speed guarantee. No simple barrier function Optimizing in S + n is relatively easy. Check if X is in S + n only requires finding minimal eigenvalue (eigs). Projection: threshold negative eigenvalues. log det barrier, semidefinite programming Real metric learning in M n is difficult, Mahalanobis learning in S + n is easier Kernel & RKHS Workshop 30

Negative Definite Kernels Convex cone of n n negative definite kernels - dimension n(n+1) 2 N n = {X R n n X = X T, z R n,z T 1 = 0,z T Xz 0} infinite linear inequalities; ( 2 n) equalities Kernel & RKHS Workshop 31

Negative Definite Kernels Convex cone of n n negative definite kernels - dimension n(n+1) 2 N n = {X R n n X = X T, z R n,z T 1 = 0,z T Xz 0} infinite linear inequalities; ( 2 n) equalities ψ n.d. kernel n N,{x 1,,x n } X n [ ψ(xi,x j ) ] N n Kernel & RKHS Workshop 32

A few important results on Negative Definite Kernels If ψ is a negative definite kernel on X then a Hilbert space H, a mapping x ϕ x H, a real valued function f on X s.t. ψ(x,y) = ϕ x ϕ y 2 +f(x)+f(y) If x X,ψ(x,x) = 0, then f = 0 and ψ is a semi-distance. If {ψ = 0} = {(x,x),x X}, then ψ is a distance. If ψ(x,x) 0, then 1 < α < 0, ψ α is negative definite. k def = e tψ is positive definite for all t > 0. Kernel & RKHS Workshop 33

A Rough Sketch We can now give a more precise meaning to D n.d.k p.d.k Kernel & RKHS Workshop 34

A Rough Sketch using this diagram D(X) ψ = d 2 N(X) vanishing diagonal k = exp( ψ) ψ = logk Hilbertian metrics d = ψ P(X) pseudo hilbertian metrics d(x,y) = ψ(x,y) ψ(x,x)+ψ(y,y) 2 P (X) infinitely divisible kernels Kernel & RKHS Workshop 35

Importance of this link One of the biggest practical issues with kernel methods is that of diagonal dominance. Cauchy Schwartz: k(x,y) k(x,x)k(y,y) Kernel & RKHS Workshop 36

Importance of this link One of the biggest practical issues with kernel methods is that of diagonal dominance. Cauchy Schwartz: k(x,y) k(x,x)k(y,y) Diagonal dominance: k(x,y) k(x,x)k(y,y) Kernel & RKHS Workshop 37

Importance of this link One of the biggest practical issues with kernel methods is that of diagonal dominance. Cauchy Schwartz: k(x,y) k(x,x)k(y,y) Diagonal dominance: k(x,y) k(x,x)k(y,y) If k is infinitely divisible, k α with small α is positive definite less diagonally dominant Kernel & RKHS Workshop 38

Importance of this link One of the biggest practical issues with kernel methods is that of diagonal dominance. Cauchy Schwartz: k(x,y) k(x,x)k(y,y) Diagonal dominance: k(x,y) k(x,x)k(y,y) If k is infinitely divisible, k α with small α is positive definite less diagonally dominant This explain the success of Gaussian kernels e t x y 2 Laplace kernels e t x y and arguably, the failure of many non-infinitely divisible kernels, because too difficult to tune. Kernel & RKHS Workshop 39

Questions Worth Asking Two questions: Let d be a distance that is not negative definite. is it possible that e t 1d is positive definite for some t 1 R? ε-infinite divisibility. a distance d such that e td is positive definite for t > ε? Kernel & RKHS Workshop 40

Questions Worth Asking Two questions: Let d be a distance that is not negative definite. is it possible that e t 1d is positive definite for some t 1 R? yes. Examples exist. Stein distance (Sra, 2011) and Inverse generalized variance (C. et al., 2005) kernel for p.s.d matrices. ε-infinite divisibility. a distance d such that e td is positive definite for t > ε?? Kernel & RKHS Workshop 41

Positivity & Combinatorial Distances Kernel & RKHS Workshop 42

Structured Objects Objects in a countable set variable length strings, trees, graphs, permutations Constrained vectors Positive vectors, histograms Vectors of different sizes variable length time series Kernel & RKHS Workshop 43

Structured Objects Objects in a countable set variable length strings, trees, graphs, sets Constrained vectors Positive vectors, histograms Vectors of different sizes variable length time series How can we define a kernel or a distance on such sets? in most cases, applying standard distances on R n or even N n is meaningless Kernel & RKHS Workshop 44

Back to fundamentals Distances are optimal by nature, and quantify shortest length paths. Graph-metrics are defined that way d 12 2 d 23 1 3 5 d 14 4 d 34 d 45 Triangle inequalities are defined precisely to enforce this optimality d(x,y) d(x,z)+d(z,y) Kernel & RKHS Workshop 45

Back to fundamentals Distances are optimal by nature, and quantify shortest length paths. Graph-metrics are defined that way d 12 2 d 23 1 3 5 d 14 4 d 34 d 45 Triangle inequalities are defined precisely to enforce this optimality d(x,y) d(x,z)+d(z,y) many distances on structured objects rely on optimization Kernel & RKHS Workshop 46

Back to fundamentals p.d. kernels are additive by nature k is positive definite ϕ : X H such that k(x,y) = ϕ(x),ϕ(y) H. X S n + L Rn n X = L T L. Kernel & RKHS Workshop 47

Back to fundamentals p.d. kernels are additive by nature k is positive definite ϕ : X H such that X S + n L Rn n X = L T L. k(x,y) = ϕ(x),ϕ(y) H. many kernels on structured objects rely on defining explicitly (possibly infinite) feature vectors very large literature on this subject which we will not address here. Kernel & RKHS Workshop 48

Combinatorial Distances To define a distance, an approach which has been repeatedly used is to, Consider two inputs x, y, Define a countable set of mappings from x to y, T(x,y) Define a cost c(τ) for each element τ of T(x,y). Define a distance between x,y as d(x,y) = min τ T(x,y) c(τ) Kernel & RKHS Workshop 49

Combinatorial Distances To define a distance, an approach which has been repeatedly used is to, Consider two inputs x, y, Define a countable set of mappings from x to y, T(x,y) Define a cost c(τ) for each element τ of T(x,y). Define a distance between x,y as d(x,y) = min τ T(x,y) c(τ) Symmetry, definiteness and triangle inequalities depend on c and T. Kernel & RKHS Workshop 50

Combinatorial Distances To define a distance, an approach which has been repeatedly used is to, Consider two inputs x, y, Define a countable set of mappings from x to y, T(x,y) Define a cost c(τ) for each element τ of T(x,y). Define a distance between x,y as d(x,y) = min τ T(x,y) c(τ) Symmetry, definiteness and triangle inequalities depend on c and T. In many cases, T is endowed with a dot product, c(τ) = τ,θ for some θ. Kernel & RKHS Workshop 51

Combinatorial Distances are not Negative Definite d(x,y) = min τ T(x,y) c(τ) In most cases such distances are not negative definite D n.d.k p.d.k Can we use them to define kernels? D n.d.k p.d.k Yes so far, using always the same technique. Kernel & RKHS Workshop 52

An alternative definition of minimality for a family of numbers a n,n N, soft-mina n = log n e a n 2 Min: 0.19 Soft min: 1.4369 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 3 Min: 0.206 Soft min: 1.5755 2 1 0 1 2 3 4 5 6 7 8 9 10 Kernel & RKHS Workshop 53

Soft-min of costs - Generating Functions d(x,y) = min τ T(x,y) c(τ) e d is not positive definite in the general case Kernel & RKHS Workshop 54

Soft-min of costs - Generating Functions d(x,y) = min τ T(x,y) c(τ) e d is not positive definite in the general case δ(x,y) = soft-min τ T(x,y) c(τ) e δ has been proved to be positive definite in all known cases Kernel & RKHS Workshop 55

Soft-min of costs - Generating Functions d(x,y) = min τ T(x,y) c(τ) e d is not positive definite in the general case δ(x,y) = soft-min τ T(x,y) c(τ) e δ has been proved to be positive definite in all known cases e δ(x,y) = e τ,θ = G T(x,y) (θ) τ T(x,y) G T(x,y) is the generating function of the set of all mappings between x and y. Kernel & RKHS Workshop 56

Example: Optimal assignment distance between two sets Input: x = {x 1,,x n },y = {y 1,,y n } X n x 1 y 1 x 2 y 2 y 3 x 3 x 4 y 4 Kernel & RKHS Workshop 57

Example: Optimal assignment distance between two sets Input: x = {x 1,,x n },y = {y 1,,y n } X n x 1 d 12 y 1 x 2 x 3 d 13 d 24 y 2 y 3 x 4 d 34 y 4 cost parameter: distance d on X. mapping variable: permutation σ in S n cost: n i=1 d(x i,y σ(i). Kernel & RKHS Workshop 58

Example: Optimal assignment distance between two sets Input: x = {x 1,,x n },y = {y 1,,y n } X n x 1 d 12 y 1 x 2 x 3 d 13 d 24 y 2 y 3 x 4 d 34 y 4 cost parameter: distance d on X. mapping variable: permutation σ in S n. cost: n i=1 d(x i,y σ(i) ) = P σ,d where D = [d(x i,y j )] d Assig. (x,y) = min σ Sn n i=1 d(x i,y σ(i) ) = min σ Sn P σ,d Kernel & RKHS Workshop 59

Example: Optimal assignment distance between two sets d Assig. (x,y) = min σ Sn n i=1 d(x i,y σ(i) ) = min σ Sn P σ,d define k = e d. If k is positive definite on X then k Perm (x,y) = σ S n e P σ,d = Permanent[k(x i,y j )] is positive definite (C. 2007). e d Assig. is not (Frohlich et al. 2005, Vert 2008). Kernel & RKHS Workshop 60

Example: Optimal alignment between two strings Input: x = (x 1,,x n ),y = (y 1,,y m ) X n, X finite x = DOING, y =DONE Kernel & RKHS Workshop 61

Example: Optimal alignment between two strings Input: x = (x 1,,x n ),y = (y 1,,y m ) X n, X finite x = DOING, y =DONE mapping variable: alignment π = ( ) π1 (1) π 1 (q). (increasing path) π 2 (1) π 2 (q) G N I O D D O N E Kernel & RKHS Workshop 62

Example: Optimal alignment between two strings Input: x = (x 1,,x n ),y = (y 1,,y m ) X n, X finite mapping variable: alignment π = x = DOING, y =DONE ( ) π1 (1) π 1 (q). (increasing path) π 2 (1) π 2 (q) G N I O D D O N E cost parameter: distance d on X + gap function g : N R. c(π) = π i=1 d(x π 1 (i),y π2 (i))+ π 1 i=1 g(π 1 (i+1) π 1 (i))+g(π 2 (i+1) π 2 (i)) Kernel & RKHS Workshop 63

Example: Optimal alignment between two strings Input: x = (x 1,,x n ),y = (y 1,,y m ) X n, X finite x = DOING, y =DONE mapping variable: alignment π = ( ) π1 (1) π 1 (q). (increasing path) π 2 (1) π 2 (q) G N I O D D O N E cost parameter: distance d on X + gap function g : N R. c(π) = π i=1 d(x π 1 (i),y π2 (i))+ π 1 i=1 g(π 1 (i+1) π 1 (i))+g(π 2 (i+1) π 2 (i)) d align (x,y) = min π Alignments c(π) Kernel & RKHS Workshop 64

Example: Optimal alignment between two strings d align (x,y) = min π Alignments c(π) define k = e d. If k is positive definite on X then k LA (x,y) = π Alignments e c(π) is positive definite (Saigo et al. 2003). Kernel & RKHS Workshop 65

Example: Optimal time warping between two time series Input: x = (x 1,,x n ),y = (y 1,,y m ) R n x 1 y 2 y 3 y 7 x 2 x 3 y 4 y 1 x 4 y 5 y 6 x 5 X Kernel & RKHS Workshop 66

Example: Optimal time warping between two time series Input: x = (x 1,,x n ),y = (y 1,,y m ) R n mapping variable: π = ( ) π1 (1) π 1 (q). (increasing contiguous path) π 2 (1) π 2 (q) x 5 D 51 D 52 D 53 D 54 D 55 D 56 D 57 x 4 D 41 D 42 D 43 D 44 D 45 D 46 D 47 x 3 D 31 D 32 D 33 D 34 D 35 D 36 D 37 x 2 D 21 D 22 D 23 D 24 D 25 D 26 D 27 x 1 D 11 D 12 D 13 D 14 D 15 D 16 D 17 y 1 y 2 y 3 y 4 y 5 y 6 y 7 Kernel & RKHS Workshop 67

Example: Optimal time warping between two time series Input: x = (x 1,,x n ),y = (y 1,,y m ) R n mapping variable: π = ( ) π1 (1) π 1 (q). (increasing contiguous path) π 2 (1) π 2 (q) x 5 D 51 D 52 D 53 D 54 D 55 D 56 D 57 x 4 D 41 D 42 D 43 D 44 D 45 D 46 D 47 x 3 D 31 D 32 D 33 D 34 D 35 D 36 D 37 x 2 D 21 D 22 D 23 D 24 D 25 D 26 D 27 x 1 D 11 D 12 D 13 D 14 D 15 D 16 D 17 y 1 y 2 y 3 y 4 y 5 y 6 y 7 cost parameter: distance d on X. cost: c(π) = π i=1 d(x π 1 (i),y π2 (i)) Kernel & RKHS Workshop 68

Example: Optimal time warping between two time series Input: x = (x 1,,x n ),y = (y 1,,y m ) R n mapping variable: π = ( ) π1 (1) π 1 (q). (increasing contiguous path) π 2 (1) π 2 (q) x 5 D 51 D 52 D 53 D 54 D 55 D 56 D 57 x 4 x 3 D 41 D 42 D 43 D 44 π D 45 D 46 D 31 D 32 D 33 D 34 D 35 D 36 D 47 D 37 x 2 D 21 D 22 D 23 D 24 D 25 D 26 D 27 x 1 D 11 D 12 D 13 D 14 D 15 D 16 D 17 y 1 y 2 y 3 y 4 y 5 y 6 y 7 cost parameter: distance d on X. cost: c(π) = π i=1 d(x π 1 (i),y π2 (i)) d DTW (x,y) = min π Alignments c(π) Kernel & RKHS Workshop 69

Example: Optimal alignment between two strings d DTW (x,y) = min π Alignments c(π) define k = e d. If k is positive definite and geometrically divisible on X then k GA (x,y) = π Alignments e c(π) is positive definite (C. et al. 2007, C. 2011) Kernel & RKHS Workshop 70

Example: Edit-distance between two trees Input: two labeled trees x, y. mapping variable: sequence of substitutions/deletions/insertions of vertices cost parameter: γ distance between labels and cost for deletion/insertion d TreeEdit (x,y) = min σ EditScripts(x,y) γ(σi ) Kernel & RKHS Workshop 71

Example: Edit-distance between two trees Input: two labeled trees x, y. mapping variable: sequence of substitutions/deletions/insertions of vertices cost parameter: γ distance between labels and cost for deletion/insertion d TreeEdit (x,y) = min σ EditScripts(x,y) γ(σi ) Positive definiteness of the generating function (if e γ ) p.d. proved by Shin & Kuboyama 2008; Shin, C., Kuboyama 2011. Kernel & RKHS Workshop 72

Example: Transportation distance between discrete histograms Input: two integer histograms x,y N d such that d i=1 x i = d i=1 y i = N mapping: transportation matrices U(r,c) = {X N d d X1 d = x,x T 1 d = y} cost parameter: M distance matrix in M d. d W (x,y) = min X U(r,c) X,M Kernel & RKHS Workshop 73

Example: Transportation distance between discrete histograms d W (x,y) = min X U(r,c) X,M define k ij = e m ij. If [k ij ] is positive definite on X then k M (x,y) = X U(r,c) e X,M is positive definite (C., submitted). Kernel & RKHS Workshop 74

To wrap up D n.d.k p.d.k d(x,y) = min c(τ), δ(x,y) = soft-min τ T(x,y) τ T(x,y) c(τ) e δ(x,y) = τ T(x,y) e τ,θ = G T(x,y) (θ) is positive definite in many (all) cases. Kernel & RKHS Workshop 75

Open problems unified framework? Convolution kernels (Haussler, 1998) Mapping kernels (Shin & Kuboyama 2008) were an important addition Extension to Countable mapping kernels (Shin 2011) Extension to symmetric functions (not just e ) (Shin 2011). To speed up computations, possible to restrict the sum to subset of T(x,y)? C. 2011 with DTW. C. submitted with transportation distances Kernel & RKHS Workshop 76