Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Size: px

Start display at page:

Download "Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research"

Darleen Taylor
5 years ago
Views:

1 Introduction to Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

2 Kernel Methods

3 Motivation Non-linear decision boundary. Efficient computation of inner products in high dimension. Flexible selection of more complex features. page 3

4 Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 4

5 Non-Linear Separation Linear separation impossible in most problems. Non-linear mapping from input space to highdimensional feature space: Φ: X F. Generalization ability: independent of dim(f ), depends only on ρ and m. page 5

6 Idea: Define Kernel Methods K : X X R, called kernel, such that: Φ(x) Φ(y) =K(x, y). K often interpreted as a similarity measure. Benefits: Efficiency: K is often more efficient to compute than Φ and the dot product. Flexibility: K can be chosen arbitrarily so long as the existence of Φ is guaranteed (symmetry and positive definiteness condition). page 6

7 PDS Condition Definition: a kernel symmetric (PDS) if for any matrix positive semi-definite (SPSD). K is positive definite, the is symmetric K: X X R {x 1,...,x m } X K =[K(x i,x j )] ij R m m SPSD if symmetric and one of the 2 equiv. cond. s: its eigenvalues are non-negative. n for any c R m 1, c Kc = i,j=1 c i c j K(x i,x j ) 0. Terminology: PDS for kernels, SPSD for kernel matrices (see (Berg et al., 1984)). page 7

8 Example - Polynomial Kernels Definition: x, y R N,K(x, y) =(x y + c) d, c > 0. Example: for N =2 and d=2, K(x, y) =(x 1 y 1 + x 2 y 2 + c) 2 x 2 1 y 2 x y 2 2 = 2 x1 2cx1 x 2 2 y1 y 2 2cy1. 2cx2 2cy2 c c page 8

9 XOR Problem Use second-degree polynomial kernel with c =1: 2 x1 x 2 (-1, 1) x 2 (1, 1) (1, 1, + 2, 2, 2, 1) (1, 1, + 2, + 2, + 2, 1) x 1 2 x1 (-1, -1) (1, -1) (1, 1, 2, 2, + 2, 1) (1, 1, 2, + 2, 2, 1) Linearly non-separable Linearly separable by x 1 x 2 =0. page 9

10 Other Standard PDS Kernels Gaussian kernels: K(x, y) =exp Sigmoid Kernels: x y 2 2σ 2, σ = 0. K(x, y) =tanh(a(x y)+b), a,b 0. page 10

11 Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 11

12 Reproducing Kernel Hilbert Space Theorem: Let K: X X R be a PDS kernel. Then, there exists a Hilbert space H and a mapping Φ from X to H such that x, y X, K(x, y) =Φ(x) Φ(y). (Aronszajn, 1950) Furthermore, the following reproducing property holds: f H 0, x X, f(x) =f,φ(x) = f,k(x, ). page 12

13 Notes: H is called the reproducing kernel Hilbert space (RKHS) associated to K. A Hilbert space such that there exists Φ:X H with K(x, y)=φ(x) Φ(y) for all x, y X is also called a feature space associated to K. Φ is called a feature mapping. Feature spaces associated to K unique. are in general not page 13

14 Consequence: SVMs with PDS Kernels Constrained optimization: Solution: max α m i=1 α i 1 2 subject to: 0 α i C h(x) =sgn m m m i,j=1 m i=1 (Boser, Guyon, and Vapnik, 1992) α i α j y i y j K(x i,x j ) α i y i =0,i [1,m]. with b = y i α j y j K(x j,x i ) for any with j=1 i=1 α i y i K(x i,x)+b, x i Φ(x i ) Φ(x j ) Φ(x i ) Φ(x) Φ(x j ) Φ(x i ) 0<α i <C. page 14

15 SVMs with PDS Kernels Constrained optimization: Solution: max α 2 1 α (α y) K(α y) subject to: 0 α C α y =0. h =sgn m i=1 α i y i K(x i, )+b, with b = y i (α y) Ke i for any x i with Hadamard product 0<α i <C. page 15

16 Generalization: Representer Theorem (Kimeldorf and Wahba, 1971) Theorem: Let be a PDS kernel and its corresponding RKHS. Then, for any nondecreasing function and any the optimization problem argmin h H K: X X R H F (h) =argmin h H admits a solution of the formh = G: R R L: R m R {+ } G(h 2 H )+L h(x 1 ),...,h(x m ) m α i K(x i, ). i=1 If G is further assumed to be increasing, then any solution has this form. page 16

17 Proof: let. admits the decomposition to H =H 1 H1. Since G is non-decreasing, Any according By the reproducing property, for all i [1,m], Thus, and G H 1 =span({k(x i, ): i [1,m]}) h=h 1 + h G(h 1 2 ) G(h h 2 )=G(h 2 ). h H h(x i )=h, K(x i, ) = h 1,K(x i, ) = h 1 (x i ). L h(x 1 ),...,h(x m ) =L h 1 (x 1 ),...,h 1 (x m ) F (h 1 ) F (h). If is increasing, then F (h 1 )<F(h) and any solution of the optimization problem must be in. H 1 page 17

18 Kernel-Based Algorithms PDS kernels used to extend a variety of algorithms in classification and other areas: regression. ranking. dimensionality reduction. clustering. But, how do we define PDS kernels? page 18

19 Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 19

20 Closure Properties of PDS Kernels Theorem: Positive definite symmetric (PDS) kernels are closed under: sum, product, tensor product, pointwise limit, composition with a power series. page 20

21 Closure Properties - Proof Proof: closure under sum: c Kc 0 c K c 0 c (K + K )c 0. closure under product: K = MM, m i,j=1 c i c j (K ij K ij )= with z k = = m i,j=1 m k=1 c 1M 1k. c m M mk m c i c j M ik M jk K ij m i,j=1 k=1 c i c j M ik M jk K ij = m z k K z k 0, k=1 page 21

22 Closure under tensor product: definition: for all x 1,x 2,y 1,y 2 X, (K 1 K 2 )(x 1,y 1,x 2,y 2 )=K 1 (x 1,x 2 )K 2 (y 1,y 2 ). thus, PDS kernel as product of the kernels (x 1,y 1,x 2,y 2 ) K 1 (x 1,x 2 ) (x 1,y 1,x 2,y 2 ) K 2 (y 1,y 2 ). Closure under pointwise limit: if for all x, y X, Then, lim K n(x, y) =K(x, y), n ( n, c K n c 0) lim n c K n c = c Kc 0. page 22

23 Closure under composition with power series: assumptions: K PDS kernel with K(x, y) <ρ for all and f(x)= x, y X n=0 a nx n,a n 0power series with radius of convergence ρ. f Kis a PDS kernel since K n is PDS by closure N n=0 a nk n under product, is PDS by closure under sum, and closure under pointwise limit. Example: for any PDS kernel, K exp(k) is PDS. page 23

24 Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 24

25 Sequence Kernels Definition: Kernels defined over pairs of strings. Motivation: computational biology, text and speech classification. Idea: two sequences are related when they share some common substrings or subsequences. Example: sum of the product of the counts of common substrings. page 25

26 Weighted Transducers b:a/0.6 b:a/0.2 0 a:b/0.1 1 a:b/0.5 a:a/0.4 2 b:a/0.3 3/0.1 T (x, y) = Sum of the weights of all accepting paths with input x and output y. T (abb, baa)= page 26

27 Rational Kernels over Strings Definition: a kernel for some weighted transducer T. K :Σ Σ R (Cortes et al., 2004) is rational if K =T Definition: let T 1 :Σ R and T 2 : Ω R be two weighted transducers. Then, the composition of T 1 and T 2 is defined for all x Σ,y Ω by (T 1 T 2 )(x, y) = z T 1 (x, z) T 2 (z,y). Definition: the inverse of a transducer is the transducer T 1 : Σ R obtained from by swapping input and output labels. T :Σ R T page 27

28 Composition Theorem: the composition of two weighted transducer is also a weighted transducer. Proof: constructive proof based on composition algorithm. states identified with pairs. -free case: transitions defined by E = (q 1,a,b,w 1,q 2 ) E 1 (q 1,b,c,w 2,q 2 ) E 2 (q 1,q 1 ),a,c,w 1 w 2, (q 2,q 2 ). general case: use of intermediate -filter. page 28

29 Composition Algorithm ε-free Case a:b/ a:b/0.2 b:b/0.3 2 b:b/0.4 a:b/0.5 a:a/0.6 3/0.7 b:a/0.2 b:b/ a:b/0.3 2 a:b/0.4 b:a/0.5 3/0.6 a:a/.04 (0, 1) a:a/.02 a:b/.18 (3, 2) a:b/.01 (0, 0) (1, 1) b:a/.06 (2, 1) a:a/0.1 (3, 1) a:b/.24 b:a/.08 (3, 3)/.42 Complexity: O( T 1 T 2 ) in general, linear in some cases. page 29

30 Redundant ε-paths Problem (MM et al. 1996) 0 a:a 1 b:! 2 c:! 3 d:d 4 0 a:d 1!:e 2 d:a T 1 3 T 2 T 1 0 a:a 1 b:!2 2 c:!2 3 d:d 4 0 a:d 1!1: e 2 d:a 3 T2!:!1!:!1!:!1!:!1!:!1!2:!!2:!!2:!!2:!!"%!"& ).) '.' '.( #"!,!#"!#-,!#"!#- $"! #"&!"& (.' (.( *.'!"& *.(,!#"!!$ #"!,/"/-,!!"!!-,!!"!!-,!!"!!- $"! %"!,!#"!#-,!#"!#-,/"/- +.*!2:!1 x:x 0!1:!1 x:x!2:!2 x:x!1:!1 1!2:!2 F T = T 1 F T 2. 2 page 30

31 PDS Rational Kernels General Construction Theorem: for any weighted transducer T :Σ Σ R, the function K =T T 1 is a PDS rational kernel. x, y Σ Proof: by definition, for all, K(x, y) = K is pointwise limit of z T (x, z) T (y, z). (K n ) n 0 x, y Σ, K n (x, y) = z n defined by T (x, z) T (y, z). K n is PDS since for any sample (x 1,...,x m ), K n = AA with A =(K n (x i,z j )) i [1,m]. j [1,N] page 31

32 Counting Transducers b:ε/1 a:ε/1 b:ε/1 a:ε/1 Z = X = ab bbabaabba 0 X:X/1 T X 1/1 εεabεεεεε εεεεεabεε X may be a string or an automaton representing a regular expression. Counts of Z in X: sum of the weights of accepting paths of. Z T X page 32

33 Transducer Counting Bigrams b:ε/1 a:ε/1 b:ε/1 a:ε/1 0 a:a/1 b:b/1 1 a:a/1 b:b/1 2/1 T bigram Counts of Zgiven by Z T bigram ab. page 33

34 Transducer Counting Gappy Bigrams b:ε/1 a:ε/1 b:ε/λ a:ε/λ b:ε/1 a:ε/1 0 a:a/1 b:b/1 1 a:a/1 b:b/1 2/1 T gappy bigram Counts of Zgiven by Z T gappy bigram ab, gap penalty λ (0, 1). page 34

35 Kernels for Other Discrete Structures Similarly, PDS kernels can be defined on other discrete structures: Images, graphs, parse trees, automata, weighted automata. page 35

36 References N. Aronszajn, Theory of Reproducing Kernels, Trans. Amer. Math. Soc., 68, , Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In Advances in kernel methods: support vector learning, pages MIT Press, Cambridge, MA, USA, Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semigroups. Springer-Verlag: Berlin-New York, Bernhard Boser, Isabelle M. Guyon, and Vladimir Vapnik. A training algorithm for optimal margin classifiers. In proceedings of COLT 1992, pages , Pittsburgh, PA, Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Kernels: Theory and Algorithms. Journal of Machine Learning Research (JMLR), 5: , Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20, Kimeldorf, G. and Wahba, G. Some results on Tchebycheffian Spline Functions, J. Mathematical Analysis and Applications, 33, 1 (1971) page 36 Courant Institute, NYU

37 References James Mercer. Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations. In Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, Vol. 83, No. 559, pp , Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Automata in Text and Speech Processing, In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language. Budapest, Hungary, Fernando C. N. Pereira and Michael D. Riley. Speech Recognition by Composition of Weighted Finite Automata. In Finite-State Language Processing, pages MIT Press, I. J. Schoenberg, Metric Spaces and Positive Definite Functions. Transactions of the American Mathematical Society, Vol. 44, No. 3, pp , Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin, Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, page 37 Courant Institute, NYU

38 Appendix

39 Shortest-Distance Problem Definition: for any regulated weighted transducer T, define the shortest distance from state to as d(q, F) = q F w[π]. π P (q,f) Problem: compute d(q, F) for all states q Q. Algorithms: Generalization of Floyd-Warshall. Single-source shortest-distance algorithm. page 39

40 All-Pairs Shortest-Distance Algorithm Assumption: closed semiring (not necessarily idempotent). Idea: generalization of Floyd-Warshall algorithm. Properties: Time complexity: Ω( Q 3 (T + T + T )). Space complexity: Ω( Q 2 ) with an in-place implementation. (MM, 2002) page 40

41 Closed Semirings (Lehmann, 1977) Definition: a semiring is closed if the closure is well defined for all elements and if associativity, commutativity, and distributivity apply to countable sums. Examples: Tropical semiring. Probability semiring when including infinity or when restricted to well-defined closures. page 41

42 Pseudocode page 42

43 Single-Source Shortest-Distance Algorithm Assumption: k-closed semiring. k+1 k x K, x i = x i. i=0 i=0 Idea: generalization of relaxation, but must keep track of weight added to d[q] since the last time was enqueued. (MM, 2002) Properties: works with any queue discipline and any k-closed semiring. Classical algorithms are special instances. q page 43

44 Pseudocode Generic-Single-Source-Shortest-Distance (G, s) 1 for i 1 to Q 2 do d[i] r[i] 0 3 d[s] r[s] 1 4 S {s} 5 while S 6 do q head(s) 7 Dequeue(S) 8 r r[q] 9 r[q] 0 10 for each e E[q] 11 do if d[n[e]] d[n[e]] (r w[e]) 12 then d[n[e]] d[n[e]] (r w[e]) 13 r[n[e]] r[n[e]] (r w[e]) 14 if n[e] S 15 then Enqueue(S, n[e]) 16 d[s] 1 page 44

45 Notes Complexity: depends on queue discipline used. O( Q + (T + T + C(A)) E max N(q) + (C(I) + C(E)) N(q)) q Q q Q coincides with that of Dijkstra and Bellman-Ford for shortest-first and FIFO orders. linear for acyclic graphs using topological order. O( Q + (T + T ) E ) Approximation: ɛ- k-closed semiring, e.g., for graphs in probability semiring. page 45

Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 5 Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient coputation of inner products