Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Kernel Methods

Motivation Non-linear decision boundary. Efficient computation of inner products in high dimension. Flexible selection of more complex features. page 3

Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 4

Non-Linear Separation Linear separation impossible in most problems. Non-linear mapping from input space to highdimensional feature space: Φ: X F. Generalization ability: independent of dim(f ), depends only on ρ and m. page 5

Idea: Define Kernel Methods K : X X R, called kernel, such that: Φ(x) Φ(y) =K(x, y). K often interpreted as a similarity measure. Benefits: Efficiency: K is often more efficient to compute than Φ and the dot product. Flexibility: K can be chosen arbitrarily so long as the existence of Φ is guaranteed (symmetry and positive definiteness condition). page 6

PDS Condition Definition: a kernel symmetric (PDS) if for any matrix positive semi-definite (SPSD). K is positive definite, the is symmetric K: X X R {x 1,...,x m } X K =[K(x i,x j )] ij R m m SPSD if symmetric and one of the 2 equiv. cond. s: its eigenvalues are non-negative. n for any c R m 1, c Kc = i,j=1 c i c j K(x i,x j ) 0. Terminology: PDS for kernels, SPSD for kernel matrices (see (Berg et al., 1984)). page 7

Example - Polynomial Kernels Definition: x, y R N,K(x, y) =(x y + c) d, c > 0. Example: for N =2 and d=2, K(x, y) =(x 1 y 1 + x 2 y 2 + c) 2 x 2 1 y 2 x 2 1 2 y 2 2 = 2 x1 2cx1 x 2 2 y1 y 2 2cy1. 2cx2 2cy2 c c page 8

XOR Problem Use second-degree polynomial kernel with c =1: 2 x1 x 2 (-1, 1) x 2 (1, 1) (1, 1, + 2, 2, 2, 1) (1, 1, + 2, + 2, + 2, 1) x 1 2 x1 (-1, -1) (1, -1) (1, 1, 2, 2, + 2, 1) (1, 1, 2, + 2, 2, 1) Linearly non-separable Linearly separable by x 1 x 2 =0. page 9

Other Standard PDS Kernels Gaussian kernels: K(x, y) =exp Sigmoid Kernels: x y 2 2σ 2, σ = 0. K(x, y) =tanh(a(x y)+b), a,b 0. page 10

Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 11

Reproducing Kernel Hilbert Space Theorem: Let K: X X R be a PDS kernel. Then, there exists a Hilbert space H and a mapping Φ from X to H such that x, y X, K(x, y) =Φ(x) Φ(y). (Aronszajn, 1950) Furthermore, the following reproducing property holds: f H 0, x X, f(x) =f,φ(x) = f,k(x, ). page 12

Notes: H is called the reproducing kernel Hilbert space (RKHS) associated to K. A Hilbert space such that there exists Φ:X H with K(x, y)=φ(x) Φ(y) for all x, y X is also called a feature space associated to K. Φ is called a feature mapping. Feature spaces associated to K unique. are in general not page 13

Consequence: SVMs with PDS Kernels Constrained optimization: Solution: max α m i=1 α i 1 2 subject to: 0 α i C h(x) =sgn m m m i,j=1 m i=1 (Boser, Guyon, and Vapnik, 1992) α i α j y i y j K(x i,x j ) α i y i =0,i [1,m]. with b = y i α j y j K(x j,x i ) for any with j=1 i=1 α i y i K(x i,x)+b, x i Φ(x i ) Φ(x j ) Φ(x i ) Φ(x) Φ(x j ) Φ(x i ) 0<α i <C. page 14

SVMs with PDS Kernels Constrained optimization: Solution: max α 2 1 α (α y) K(α y) subject to: 0 α C α y =0. h =sgn m i=1 α i y i K(x i, )+b, with b = y i (α y) Ke i for any x i with Hadamard product 0<α i <C. page 15

Generalization: Representer Theorem (Kimeldorf and Wahba, 1971) Theorem: Let be a PDS kernel and its corresponding RKHS. Then, for any nondecreasing function and any the optimization problem argmin h H K: X X R H F (h) =argmin h H admits a solution of the formh = G: R R L: R m R {+ } G(h 2 H )+L h(x 1 ),...,h(x m ) m α i K(x i, ). i=1 If G is further assumed to be increasing, then any solution has this form. page 16

Proof: let. admits the decomposition to H =H 1 H1. Since G is non-decreasing, Any according By the reproducing property, for all i [1,m], Thus, and G H 1 =span({k(x i, ): i [1,m]}) h=h 1 + h G(h 1 2 ) G(h 1 2 + h 2 )=G(h 2 ). h H h(x i )=h, K(x i, ) = h 1,K(x i, ) = h 1 (x i ). L h(x 1 ),...,h(x m ) =L h 1 (x 1 ),...,h 1 (x m ) F (h 1 ) F (h). If is increasing, then F (h 1 )<F(h) and any solution of the optimization problem must be in. H 1 page 17

Kernel-Based Algorithms PDS kernels used to extend a variety of algorithms in classification and other areas: regression. ranking. dimensionality reduction. clustering. But, how do we define PDS kernels? page 18

Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 19

Closure Properties of PDS Kernels Theorem: Positive definite symmetric (PDS) kernels are closed under: sum, product, tensor product, pointwise limit, composition with a power series. page 20

Closure Properties - Proof Proof: closure under sum: c Kc 0 c K c 0 c (K + K )c 0. closure under product: K = MM, m i,j=1 c i c j (K ij K ij )= with z k = = m i,j=1 m k=1 c 1M 1k. c m M mk m c i c j M ik M jk K ij m i,j=1 k=1 c i c j M ik M jk K ij = m z k K z k 0, k=1 page 21

Closure under tensor product: definition: for all x 1,x 2,y 1,y 2 X, (K 1 K 2 )(x 1,y 1,x 2,y 2 )=K 1 (x 1,x 2 )K 2 (y 1,y 2 ). thus, PDS kernel as product of the kernels (x 1,y 1,x 2,y 2 ) K 1 (x 1,x 2 ) (x 1,y 1,x 2,y 2 ) K 2 (y 1,y 2 ). Closure under pointwise limit: if for all x, y X, Then, lim K n(x, y) =K(x, y), n ( n, c K n c 0) lim n c K n c = c Kc 0. page 22

Closure under composition with power series: assumptions: K PDS kernel with K(x, y) <ρ for all and f(x)= x, y X n=0 a nx n,a n 0power series with radius of convergence ρ. f Kis a PDS kernel since K n is PDS by closure N n=0 a nk n under product, is PDS by closure under sum, and closure under pointwise limit. Example: for any PDS kernel, K exp(k) is PDS. page 23

Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 24

Sequence Kernels Definition: Kernels defined over pairs of strings. Motivation: computational biology, text and speech classification. Idea: two sequences are related when they share some common substrings or subsequences. Example: sum of the product of the counts of common substrings. page 25

Weighted Transducers b:a/0.6 b:a/0.2 0 a:b/0.1 1 a:b/0.5 a:a/0.4 2 b:a/0.3 3/0.1 T (x, y) = Sum of the weights of all accepting paths with input x and output y. T (abb, baa)=.1.2.3.1+.5.3.6.1 page 26

Rational Kernels over Strings Definition: a kernel for some weighted transducer T. K :Σ Σ R (Cortes et al., 2004) is rational if K =T Definition: let T 1 :Σ R and T 2 : Ω R be two weighted transducers. Then, the composition of T 1 and T 2 is defined for all x Σ,y Ω by (T 1 T 2 )(x, y) = z T 1 (x, z) T 2 (z,y). Definition: the inverse of a transducer is the transducer T 1 : Σ R obtained from by swapping input and output labels. T :Σ R T page 27

Composition Theorem: the composition of two weighted transducer is also a weighted transducer. Proof: constructive proof based on composition algorithm. states identified with pairs. -free case: transitions defined by E = (q 1,a,b,w 1,q 2 ) E 1 (q 1,b,c,w 2,q 2 ) E 2 (q 1,q 1 ),a,c,w 1 w 2, (q 2,q 2 ). general case: use of intermediate -filter. page 28

Composition Algorithm ε-free Case a:b/0.1 0 1 a:b/0.2 b:b/0.3 2 b:b/0.4 a:b/0.5 a:a/0.6 3/0.7 b:a/0.2 b:b/0.1 0 1 a:b/0.3 2 a:b/0.4 b:a/0.5 3/0.6 a:a/.04 (0, 1) a:a/.02 a:b/.18 (3, 2) a:b/.01 (0, 0) (1, 1) b:a/.06 (2, 1) a:a/0.1 (3, 1) a:b/.24 b:a/.08 (3, 3)/.42 Complexity: O( T 1 T 2 ) in general, linear in some cases. page 29

Redundant ε-paths Problem (MM et al. 1996) 0 a:a 1 b:! 2 c:! 3 d:d 4 0 a:d 1!:e 2 d:a T 1 3 T 2 T 1 0 a:a 1 b:!2 2 c:!2 3 d:d 4 0 a:d 1!1: e 2 d:a 3 T2!:!1!:!1!:!1!:!1!:!1!2:!!2:!!2:!!2:!!"%!"& ).) '.' '.( #"!,!#"!#-,!#"!#- $"! #"&!"& (.' (.( *.'!"& *.(,!#"!!$ #"!,/"/-,!!"!!-,!!"!!-,!!"!!- $"! %"!,!#"!#-,!#"!#-,/"/- +.*!2:!1 x:x 0!1:!1 x:x!2:!2 x:x!1:!1 1!2:!2 F T = T 1 F T 2. 2 page 30

PDS Rational Kernels General Construction Theorem: for any weighted transducer T :Σ Σ R, the function K =T T 1 is a PDS rational kernel. x, y Σ Proof: by definition, for all, K(x, y) = K is pointwise limit of z T (x, z) T (y, z). (K n ) n 0 x, y Σ, K n (x, y) = z n defined by T (x, z) T (y, z). K n is PDS since for any sample (x 1,...,x m ), K n = AA with A =(K n (x i,z j )) i [1,m]. j [1,N] page 31

Counting Transducers b:ε/1 a:ε/1 b:ε/1 a:ε/1 Z = X = ab bbabaabba 0 X:X/1 T X 1/1 εεabεεεεε εεεεεabεε X may be a string or an automaton representing a regular expression. Counts of Z in X: sum of the weights of accepting paths of. Z T X page 32

Transducer Counting Bigrams b:ε/1 a:ε/1 b:ε/1 a:ε/1 0 a:a/1 b:b/1 1 a:a/1 b:b/1 2/1 T bigram Counts of Zgiven by Z T bigram ab. page 33

Transducer Counting Gappy Bigrams b:ε/1 a:ε/1 b:ε/λ a:ε/λ b:ε/1 a:ε/1 0 a:a/1 b:b/1 1 a:a/1 b:b/1 2/1 T gappy bigram Counts of Zgiven by Z T gappy bigram ab, gap penalty λ (0, 1). page 34

Kernels for Other Discrete Structures Similarly, PDS kernels can be defined on other discrete structures: Images, graphs, parse trees, automata, weighted automata. page 35

References N. Aronszajn, Theory of Reproducing Kernels, Trans. Amer. Math. Soc., 68, 337-404, 1950. Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In Advances in kernel methods: support vector learning, pages 43 54. MIT Press, Cambridge, MA, USA, 1999. Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semigroups. Springer-Verlag: Berlin-New York, 1984. Bernhard Boser, Isabelle M. Guyon, and Vladimir Vapnik. A training algorithm for optimal margin classifiers. In proceedings of COLT 1992, pages 144-152, Pittsburgh, PA, 1992. Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Kernels: Theory and Algorithms. Journal of Machine Learning Research (JMLR), 5:1035-1062, 2004. Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20, 1995. Kimeldorf, G. and Wahba, G. Some results on Tchebycheffian Spline Functions, J. Mathematical Analysis and Applications, 33, 1 (1971) 82-95. page 36 Courant Institute, NYU

References James Mercer. Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations. In Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, Vol. 83, No. 559, pp. 69-70, 1909. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Automata in Text and Speech Processing, In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language. Budapest, Hungary, 1996. Fernando C. N. Pereira and Michael D. Riley. Speech Recognition by Composition of Weighted Finite Automata. In Finite-State Language Processing, pages 431-453. MIT Press, 1997. I. J. Schoenberg, Metric Spaces and Positive Definite Functions. Transactions of the American Mathematical Society, Vol. 44, No. 3, pp. 522-536, 1938. Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin, 1982. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. page 37 Courant Institute, NYU

Appendix

Shortest-Distance Problem Definition: for any regulated weighted transducer T, define the shortest distance from state to as d(q, F) = q F w[π]. π P (q,f) Problem: compute d(q, F) for all states q Q. Algorithms: Generalization of Floyd-Warshall. Single-source shortest-distance algorithm. page 39

All-Pairs Shortest-Distance Algorithm Assumption: closed semiring (not necessarily idempotent). Idea: generalization of Floyd-Warshall algorithm. Properties: Time complexity: Ω( Q 3 (T + T + T )). Space complexity: Ω( Q 2 ) with an in-place implementation. (MM, 2002) page 40

Closed Semirings (Lehmann, 1977) Definition: a semiring is closed if the closure is well defined for all elements and if associativity, commutativity, and distributivity apply to countable sums. Examples: Tropical semiring. Probability semiring when including infinity or when restricted to well-defined closures. page 41

Pseudocode page 42

Single-Source Shortest-Distance Algorithm Assumption: k-closed semiring. k+1 k x K, x i = x i. i=0 i=0 Idea: generalization of relaxation, but must keep track of weight added to d[q] since the last time was enqueued. (MM, 2002) Properties: works with any queue discipline and any k-closed semiring. Classical algorithms are special instances. q page 43

Pseudocode Generic-Single-Source-Shortest-Distance (G, s) 1 for i 1 to Q 2 do d[i] r[i] 0 3 d[s] r[s] 1 4 S {s} 5 while S 6 do q head(s) 7 Dequeue(S) 8 r r[q] 9 r[q] 0 10 for each e E[q] 11 do if d[n[e]] d[n[e]] (r w[e]) 12 then d[n[e]] d[n[e]] (r w[e]) 13 r[n[e]] r[n[e]] (r w[e]) 14 if n[e] S 15 then Enqueue(S, n[e]) 16 d[s] 1 page 44

Notes Complexity: depends on queue discipline used. O( Q + (T + T + C(A)) E max N(q) + (C(I) + C(E)) N(q)) q Q q Q coincides with that of Dijkstra and Bellman-Ford for shortest-first and FIFO orders. linear for acyclic graphs using topological order. O( Q + (T + T ) E ) Approximation: ɛ- k-closed semiring, e.g., for graphs in probability semiring. page 45