Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Similar documents
Foundations of Machine Learning Lecture 5. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Kernel Methods. Mehryar Mohri Courant Institute and Google Research

Generic ǫ-removal and Input ǫ-normalization Algorithms for Weighted Transducers

On-Line Learning with Path Experts and Non-Additive Losses

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms

Advanced Machine Learning

Large-Scale Training of SVMs with Automata Kernels

N-Way Composition of Weighted Finite-State Transducers

Speech Recognition Lecture 4: Weighted Transducer Software Library. Mehryar Mohri Courant Institute of Mathematical Sciences

Weighted Finite-State Transducer Algorithms An Overview

Learning Languages with Rational Kernels

Support Vector Machines for Classification: A Statistical Portrait

Kernel Methods. Outline

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Learning from Uncertain Data

Kernels A Machine Learning Overview

Moment Kernels for Regular Distributions

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machine (SVM) and Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Support Vector Machines

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

CIS 520: Machine Learning Oct 09, Kernel Methods

Kernel Methods. Charles Elkan October 17, 2007

Representer theorem and kernel examples

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Speech Recognition Lecture 5: N-gram Language Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Statistical learning theory, Support vector machines, and Bioinformatics

Support Vector Machine (SVM) and Kernel Methods

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines Explained

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Kernel Methods for Learning Languages

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Strictly Positive Definite Functions on a Real Inner Product Space

SVM Optimization for Lattice Kernels

Kernels MIT Course Notes

Domain Adaptation for Regression

Learning with kernels and SVM

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines

The Kernel Trick. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Support Vector Machine & Its Applications

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Advanced Machine Learning

Statistical learning on graphs

Discriminative Direction for Kernel Classifiers

Advanced Machine Learning

Introduction to Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Classifier Complexity and Support Vector Classifiers

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Weighted Finite-State Transducers in Computational Biology

CS798: Selected topics in Machine Learning

Introduction to Finite Automaton

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

Foundations of Machine Learning

Similarity and kernels in machine learning

Theory of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Kernel Methods. Machine Learning A W VO

Finite-State Transducers

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

10-701/ Recitation : Kernels

Sample Selection Bias Correction

Support Vector Machines.

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Learning Linearly Separable Languages

ML (cont.): SUPPORT VECTOR MACHINES

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Kernel Method: Data Analysis with Positive Definite Kernels

Learning with Imperfect Data

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

The Representor Theorem, Kernels, and Hilbert Spaces

Linear Classification and SVM. Dr. Xin Zhang

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Kernel Methods in Machine Learning

Support Vector Machines: Kernels

Kaggle.

Jeff Howbert Introduction to Machine Learning Winter

Statistical Natural Language Processing

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Boosting Ensembles of Structured Prediction Rules

Analysis of Multiclass Support Vector Machines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Kernel Principal Component Analysis

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Transcription:

Introduction to Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Kernel Methods

Motivation Non-linear decision boundary. Efficient computation of inner products in high dimension. Flexible selection of more complex features. page 3

Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 4

Non-Linear Separation Linear separation impossible in most problems. Non-linear mapping from input space to highdimensional feature space: Φ: X F. Generalization ability: independent of dim(f ), depends only on ρ and m. page 5

Idea: Define Kernel Methods K : X X R, called kernel, such that: Φ(x) Φ(y) =K(x, y). K often interpreted as a similarity measure. Benefits: Efficiency: K is often more efficient to compute than Φ and the dot product. Flexibility: K can be chosen arbitrarily so long as the existence of Φ is guaranteed (symmetry and positive definiteness condition). page 6

PDS Condition Definition: a kernel symmetric (PDS) if for any matrix positive semi-definite (SPSD). K is positive definite, the is symmetric K: X X R {x 1,...,x m } X K =[K(x i,x j )] ij R m m SPSD if symmetric and one of the 2 equiv. cond. s: its eigenvalues are non-negative. n for any c R m 1, c Kc = i,j=1 c i c j K(x i,x j ) 0. Terminology: PDS for kernels, SPSD for kernel matrices (see (Berg et al., 1984)). page 7

Example - Polynomial Kernels Definition: x, y R N,K(x, y) =(x y + c) d, c > 0. Example: for N =2 and d=2, K(x, y) =(x 1 y 1 + x 2 y 2 + c) 2 x 2 1 y 2 x 2 1 2 y 2 2 = 2 x1 2cx1 x 2 2 y1 y 2 2cy1. 2cx2 2cy2 c c page 8

XOR Problem Use second-degree polynomial kernel with c =1: 2 x1 x 2 (-1, 1) x 2 (1, 1) (1, 1, + 2, 2, 2, 1) (1, 1, + 2, + 2, + 2, 1) x 1 2 x1 (-1, -1) (1, -1) (1, 1, 2, 2, + 2, 1) (1, 1, 2, + 2, 2, 1) Linearly non-separable Linearly separable by x 1 x 2 =0. page 9

Other Standard PDS Kernels Gaussian kernels: K(x, y) =exp Sigmoid Kernels: x y 2 2σ 2, σ = 0. K(x, y) =tanh(a(x y)+b), a,b 0. page 10

Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 11

Reproducing Kernel Hilbert Space Theorem: Let K: X X R be a PDS kernel. Then, there exists a Hilbert space H and a mapping Φ from X to H such that x, y X, K(x, y) =Φ(x) Φ(y). (Aronszajn, 1950) Furthermore, the following reproducing property holds: f H 0, x X, f(x) =f,φ(x) = f,k(x, ). page 12

Notes: H is called the reproducing kernel Hilbert space (RKHS) associated to K. A Hilbert space such that there exists Φ:X H with K(x, y)=φ(x) Φ(y) for all x, y X is also called a feature space associated to K. Φ is called a feature mapping. Feature spaces associated to K unique. are in general not page 13

Consequence: SVMs with PDS Kernels Constrained optimization: Solution: max α m i=1 α i 1 2 subject to: 0 α i C h(x) =sgn m m m i,j=1 m i=1 (Boser, Guyon, and Vapnik, 1992) α i α j y i y j K(x i,x j ) α i y i =0,i [1,m]. with b = y i α j y j K(x j,x i ) for any with j=1 i=1 α i y i K(x i,x)+b, x i Φ(x i ) Φ(x j ) Φ(x i ) Φ(x) Φ(x j ) Φ(x i ) 0<α i <C. page 14

SVMs with PDS Kernels Constrained optimization: Solution: max α 2 1 α (α y) K(α y) subject to: 0 α C α y =0. h =sgn m i=1 α i y i K(x i, )+b, with b = y i (α y) Ke i for any x i with Hadamard product 0<α i <C. page 15

Generalization: Representer Theorem (Kimeldorf and Wahba, 1971) Theorem: Let be a PDS kernel and its corresponding RKHS. Then, for any nondecreasing function and any the optimization problem argmin h H K: X X R H F (h) =argmin h H admits a solution of the formh = G: R R L: R m R {+ } G(h 2 H )+L h(x 1 ),...,h(x m ) m α i K(x i, ). i=1 If G is further assumed to be increasing, then any solution has this form. page 16

Proof: let. admits the decomposition to H =H 1 H1. Since G is non-decreasing, Any according By the reproducing property, for all i [1,m], Thus, and G H 1 =span({k(x i, ): i [1,m]}) h=h 1 + h G(h 1 2 ) G(h 1 2 + h 2 )=G(h 2 ). h H h(x i )=h, K(x i, ) = h 1,K(x i, ) = h 1 (x i ). L h(x 1 ),...,h(x m ) =L h 1 (x 1 ),...,h 1 (x m ) F (h 1 ) F (h). If is increasing, then F (h 1 )<F(h) and any solution of the optimization problem must be in. H 1 page 17

Kernel-Based Algorithms PDS kernels used to extend a variety of algorithms in classification and other areas: regression. ranking. dimensionality reduction. clustering. But, how do we define PDS kernels? page 18

Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 19

Closure Properties of PDS Kernels Theorem: Positive definite symmetric (PDS) kernels are closed under: sum, product, tensor product, pointwise limit, composition with a power series. page 20

Closure Properties - Proof Proof: closure under sum: c Kc 0 c K c 0 c (K + K )c 0. closure under product: K = MM, m i,j=1 c i c j (K ij K ij )= with z k = = m i,j=1 m k=1 c 1M 1k. c m M mk m c i c j M ik M jk K ij m i,j=1 k=1 c i c j M ik M jk K ij = m z k K z k 0, k=1 page 21

Closure under tensor product: definition: for all x 1,x 2,y 1,y 2 X, (K 1 K 2 )(x 1,y 1,x 2,y 2 )=K 1 (x 1,x 2 )K 2 (y 1,y 2 ). thus, PDS kernel as product of the kernels (x 1,y 1,x 2,y 2 ) K 1 (x 1,x 2 ) (x 1,y 1,x 2,y 2 ) K 2 (y 1,y 2 ). Closure under pointwise limit: if for all x, y X, Then, lim K n(x, y) =K(x, y), n ( n, c K n c 0) lim n c K n c = c Kc 0. page 22

Closure under composition with power series: assumptions: K PDS kernel with K(x, y) <ρ for all and f(x)= x, y X n=0 a nx n,a n 0power series with radius of convergence ρ. f Kis a PDS kernel since K n is PDS by closure N n=0 a nk n under product, is PDS by closure under sum, and closure under pointwise limit. Example: for any PDS kernel, K exp(k) is PDS. page 23

Definitions SVMs with kernels Closure properties Sequence Kernels This Lecture page 24

Sequence Kernels Definition: Kernels defined over pairs of strings. Motivation: computational biology, text and speech classification. Idea: two sequences are related when they share some common substrings or subsequences. Example: sum of the product of the counts of common substrings. page 25

Weighted Transducers b:a/0.6 b:a/0.2 0 a:b/0.1 1 a:b/0.5 a:a/0.4 2 b:a/0.3 3/0.1 T (x, y) = Sum of the weights of all accepting paths with input x and output y. T (abb, baa)=.1.2.3.1+.5.3.6.1 page 26

Rational Kernels over Strings Definition: a kernel for some weighted transducer T. K :Σ Σ R (Cortes et al., 2004) is rational if K =T Definition: let T 1 :Σ R and T 2 : Ω R be two weighted transducers. Then, the composition of T 1 and T 2 is defined for all x Σ,y Ω by (T 1 T 2 )(x, y) = z T 1 (x, z) T 2 (z,y). Definition: the inverse of a transducer is the transducer T 1 : Σ R obtained from by swapping input and output labels. T :Σ R T page 27

Composition Theorem: the composition of two weighted transducer is also a weighted transducer. Proof: constructive proof based on composition algorithm. states identified with pairs. -free case: transitions defined by E = (q 1,a,b,w 1,q 2 ) E 1 (q 1,b,c,w 2,q 2 ) E 2 (q 1,q 1 ),a,c,w 1 w 2, (q 2,q 2 ). general case: use of intermediate -filter. page 28

Composition Algorithm ε-free Case a:b/0.1 0 1 a:b/0.2 b:b/0.3 2 b:b/0.4 a:b/0.5 a:a/0.6 3/0.7 b:a/0.2 b:b/0.1 0 1 a:b/0.3 2 a:b/0.4 b:a/0.5 3/0.6 a:a/.04 (0, 1) a:a/.02 a:b/.18 (3, 2) a:b/.01 (0, 0) (1, 1) b:a/.06 (2, 1) a:a/0.1 (3, 1) a:b/.24 b:a/.08 (3, 3)/.42 Complexity: O( T 1 T 2 ) in general, linear in some cases. page 29

Redundant ε-paths Problem (MM et al. 1996) 0 a:a 1 b:! 2 c:! 3 d:d 4 0 a:d 1!:e 2 d:a T 1 3 T 2 T 1 0 a:a 1 b:!2 2 c:!2 3 d:d 4 0 a:d 1!1: e 2 d:a 3 T2!:!1!:!1!:!1!:!1!:!1!2:!!2:!!2:!!2:!!"%!"& ).) '.' '.( #"!,!#"!#-,!#"!#- $"! #"&!"& (.' (.( *.'!"& *.(,!#"!!$ #"!,/"/-,!!"!!-,!!"!!-,!!"!!- $"! %"!,!#"!#-,!#"!#-,/"/- +.*!2:!1 x:x 0!1:!1 x:x!2:!2 x:x!1:!1 1!2:!2 F T = T 1 F T 2. 2 page 30

PDS Rational Kernels General Construction Theorem: for any weighted transducer T :Σ Σ R, the function K =T T 1 is a PDS rational kernel. x, y Σ Proof: by definition, for all, K(x, y) = K is pointwise limit of z T (x, z) T (y, z). (K n ) n 0 x, y Σ, K n (x, y) = z n defined by T (x, z) T (y, z). K n is PDS since for any sample (x 1,...,x m ), K n = AA with A =(K n (x i,z j )) i [1,m]. j [1,N] page 31

Counting Transducers b:ε/1 a:ε/1 b:ε/1 a:ε/1 Z = X = ab bbabaabba 0 X:X/1 T X 1/1 εεabεεεεε εεεεεabεε X may be a string or an automaton representing a regular expression. Counts of Z in X: sum of the weights of accepting paths of. Z T X page 32

Transducer Counting Bigrams b:ε/1 a:ε/1 b:ε/1 a:ε/1 0 a:a/1 b:b/1 1 a:a/1 b:b/1 2/1 T bigram Counts of Zgiven by Z T bigram ab. page 33

Transducer Counting Gappy Bigrams b:ε/1 a:ε/1 b:ε/λ a:ε/λ b:ε/1 a:ε/1 0 a:a/1 b:b/1 1 a:a/1 b:b/1 2/1 T gappy bigram Counts of Zgiven by Z T gappy bigram ab, gap penalty λ (0, 1). page 34

Kernels for Other Discrete Structures Similarly, PDS kernels can be defined on other discrete structures: Images, graphs, parse trees, automata, weighted automata. page 35

References N. Aronszajn, Theory of Reproducing Kernels, Trans. Amer. Math. Soc., 68, 337-404, 1950. Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers. In Advances in kernel methods: support vector learning, pages 43 54. MIT Press, Cambridge, MA, USA, 1999. Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on Semigroups. Springer-Verlag: Berlin-New York, 1984. Bernhard Boser, Isabelle M. Guyon, and Vladimir Vapnik. A training algorithm for optimal margin classifiers. In proceedings of COLT 1992, pages 144-152, Pittsburgh, PA, 1992. Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Kernels: Theory and Algorithms. Journal of Machine Learning Research (JMLR), 5:1035-1062, 2004. Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20, 1995. Kimeldorf, G. and Wahba, G. Some results on Tchebycheffian Spline Functions, J. Mathematical Analysis and Applications, 33, 1 (1971) 82-95. page 36 Courant Institute, NYU

References James Mercer. Functions of Positive and Negative Type, and Their Connection with the Theory of Integral Equations. In Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, Vol. 83, No. 559, pp. 69-70, 1909. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Automata in Text and Speech Processing, In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language. Budapest, Hungary, 1996. Fernando C. N. Pereira and Michael D. Riley. Speech Recognition by Composition of Weighted Finite Automata. In Finite-State Language Processing, pages 431-453. MIT Press, 1997. I. J. Schoenberg, Metric Spaces and Positive Definite Functions. Transactions of the American Mathematical Society, Vol. 44, No. 3, pp. 522-536, 1938. Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin, 1982. Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. page 37 Courant Institute, NYU

Appendix

Shortest-Distance Problem Definition: for any regulated weighted transducer T, define the shortest distance from state to as d(q, F) = q F w[π]. π P (q,f) Problem: compute d(q, F) for all states q Q. Algorithms: Generalization of Floyd-Warshall. Single-source shortest-distance algorithm. page 39

All-Pairs Shortest-Distance Algorithm Assumption: closed semiring (not necessarily idempotent). Idea: generalization of Floyd-Warshall algorithm. Properties: Time complexity: Ω( Q 3 (T + T + T )). Space complexity: Ω( Q 2 ) with an in-place implementation. (MM, 2002) page 40

Closed Semirings (Lehmann, 1977) Definition: a semiring is closed if the closure is well defined for all elements and if associativity, commutativity, and distributivity apply to countable sums. Examples: Tropical semiring. Probability semiring when including infinity or when restricted to well-defined closures. page 41

Pseudocode page 42

Single-Source Shortest-Distance Algorithm Assumption: k-closed semiring. k+1 k x K, x i = x i. i=0 i=0 Idea: generalization of relaxation, but must keep track of weight added to d[q] since the last time was enqueued. (MM, 2002) Properties: works with any queue discipline and any k-closed semiring. Classical algorithms are special instances. q page 43

Pseudocode Generic-Single-Source-Shortest-Distance (G, s) 1 for i 1 to Q 2 do d[i] r[i] 0 3 d[s] r[s] 1 4 S {s} 5 while S 6 do q head(s) 7 Dequeue(S) 8 r r[q] 9 r[q] 0 10 for each e E[q] 11 do if d[n[e]] d[n[e]] (r w[e]) 12 then d[n[e]] d[n[e]] (r w[e]) 13 r[n[e]] r[n[e]] (r w[e]) 14 if n[e] S 15 then Enqueue(S, n[e]) 16 d[s] 1 page 44

Notes Complexity: depends on queue discipline used. O( Q + (T + T + C(A)) E max N(q) + (C(I) + C(E)) N(q)) q Q q Q coincides with that of Dijkstra and Bellman-Ford for shortest-first and FIFO orders. linear for acyclic graphs using topological order. O( Q + (T + T ) E ) Approximation: ɛ- k-closed semiring, e.g., for graphs in probability semiring. page 45