Kernel Methods. Outline

Similar documents
CIS 520: Machine Learning Oct 09, Kernel Methods

Kernel Methods. Machine Learning A W VO

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Kernel Methods & Support Vector Machines

CS798: Selected topics in Machine Learning

Perceptron Revisited: Linear Separators. Support Vector Machines

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

10-701/ Recitation : Kernels

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Support Vector Machine (SVM) and Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Kernels A Machine Learning Overview

Introduction to Machine Learning

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Reproducing Kernel Hilbert Spaces

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Support Vector Machine (SVM) and Kernel Methods

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Review: Support vector machines. Machine learning techniques and image analysis

Kernel methods CSE 250B

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

(Kernels +) Support Vector Machines

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

LMS Algorithm Summary

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Kernel methods, kernel SVM and ridge regression

Statistical Methods for SVM

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Representer theorem and kernel examples

Kernels MIT Course Notes

Support Vector Machines: Kernels

Machine Learning : Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Kernel Method: Data Analysis with Positive Definite Kernels

Basis Expansion and Nonlinear SVM. Kai Yu

Introduction to Kernel methods

Support Vector Machines.

Support Vector Machine

Kernel Methods. Barnabás Póczos

Lecture 10: Support Vector Machine and Large Margin Classifier

CS 7140: Advanced Machine Learning

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Reproducing Kernel Hilbert Spaces

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Statistical Machine Learning from Data

Kernel Methods and Support Vector Machines

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Kernels and the Kernel Trick. Machine Learning Fall 2017

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Kernel Methods. Charles Elkan October 17, 2007

Introduction to Support Vector Machines

Advanced Introduction to Machine Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Oslo Class 2 Tikhonov regularization and kernels

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machines for Classification: A Statistical Portrait

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernels for Multi task Learning

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

9.2 Support Vector Machines 159

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines

Classifier Complexity and Support Vector Classifiers

Support Vector Machines Explained

RegML 2018 Class 2 Tikhonov regularization and kernels

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Introduction to Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

ML (cont.): SUPPORT VECTOR MACHINES

Learning with kernels and SVM

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Reproducing Kernel Hilbert Spaces

Support Vector Machines and Speaker Verification

Bayesian Support Vector Machines for Feature Ranking and Selection

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Machine Learning 4771

Support Vector Machines and Kernel Methods

Kaggle.

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

22 : Hilbert Space Embeddings of Distributions

Designing Kernel Functions Using the Karhunen-Loève Expansion

Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Polyhedral Computation. Linear Classifiers & the SVM

Kernel Methods for Computational Biology and Chemistry

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Transcription:

Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert space Reproducing kernel Hilbert space (RKHS) Constructing kernels Representer theorem Kernel examples Choosing feature space and kernel Summary 1

Motivation Machine learning theory and algorithms are well developed for linear case Support Vector Machines (SVM) Ridge regression PCA And more Real world data analysis problems: often nonlinear Solution? Motivation (cont d) Idea: map data from original input space into a (usually high-dimensional) feature space where linear relations exist among data and apply a linear algorithm in this space 2

Motivation (cont d) Challenge: computation in high-dimensional space is difficult Key idea: if we choose the mapping wisely we can do computation in the feature space implicitly while keep working in the original input space! Learning Given: input/output sets X, Y (x 1,y 1 ) (x m,y m ) X xy Goal: generalization on unseen data Given new input x X, find the corresponding y (x,y) should be similarto (x 1,y 1 ) (x m,y m ) Similarity measure For outputs: loss function(e.g. for Y= {1,-1}, zero-one loss) For inputs: kernel 3

Kernels kernel function k :,, (, ) Kernel is symmetric: (, ) = (,) Akernel that can be constructed by defining a mapping :X H, from the input space X to a feature space H, such that,, =, Why do we want this? Allow us to apply many ML algorithms in dot product (feature) spaces Gives us freedom to choose => design a large variety of models to solve a given problem Kernel Trick We map patterns from input space X into a high-dimensional feature space H and compare them using dot product Choose mapping such that the dot product can be evaluated directly using a non-linear function in X Avoid computation in H Kernel Trick 4

Kernel Example Assume =[, ] T and a feature mapping that maps the input in a quadratic feature set =,, 2, 2, 2,1 T Kernel function for the feature space:, = = + +2 +2 +2 +1 = ( + +1) 2 = (1+( )) Computation in the higher dimensional space is performed implicitly in the original input space Kernel Example: Support Vector Machines Solution of the dual problem gives us: n w = α x The decision boundary: T i = 1 i y i T w x + w 0 = α i y i ( x i x ) + w 0 i SV The decision: T y = sign α i y i ( x i x ) + w 0 i SV Mapping to a feature space, we have the decision: i y = sign α i y i ( φ ( x ) φ ( x i )) + w 0 i SV kernel k 5

SVM with Gaussian Kernel, =exp ( ) Mercer s Condition Question: whether a prospective kernel k is good, e.g. being a dot product in some feature space? Mercer s condition (Vapnik1995): there exists a mapping and an expansion, = <=> ()such that L 2 norm ()is finite, then, 0 6

Positive Definite Kernels It can be shown that the admissible class of kernels coincides with the class of positive definite (pd) kernels Definition: k: XxX R is called a pdkernelif k symmetric: k(x,x ) = k(x,x) x 1,,x m X and c 1,,c m R, 0, where K ij = (k(x i,x j )) ij K is called Gram matrix or kernel matrix Basic properties of PD kernels 1. Kernels from feature maps If maps X into a dot product space H then, is a pdkernel on X xx 2. Positivity on the diagonal, 0 3. Cauchy-Schwarz inequality (, ), (, ) 4. Vanishing diagonals, =0, =0, 7

From Kernels to Feature Spaces Question: given a pdkernel in the input space, how can we construct a feature space such that the kernel computes dot product in that space? i.e. how to construct mapping and space H, :, such that,, =, Constructing Feature Space 1. Define a feature map :, (.,) =k(.,x)denotes the function that assigns the value k(x,x) to x R e.g. for the Gaussian kernel 2. Turn it into a linear space Add linear combinations to the space. = (., ), where,,,,.. ( ) (.)= (., ) 8

Constructing Feature Space (Cont d) 3. Add dot product to the space, = (, ) = ( ) (independent of f) = ( ) (independent of g).,, =,(., ) =( ) In addition, we have difined =k(.,x).,,(., ) =, =,( ) k is called a reproducing kernel (Hofmanet al. 2008) proved that operator.,. is in fact a dot product and a pd kernel (symmetric, positive definite by definition) 4. Complete the space with a norm to get a reproducing kernel Hilbert Space (RKHS) Hilbert Spaces Hilbert Space: a complete vector space with dot product and a norm Definition: dot product on a vector space A real function <x,y>: V x V Rthat x, y, z Vand c R <x,y> = <y,x> <cx,y> = c<x,y> <x+y,z> = <x,z> + <y,z> <x,x> > 0 when x 0 Complete space All Cauchy sequences {x n } in the space converge: >0 :,>: < Completeness induces (by Riesz repsentationtheorem) that x X and, a unique function of x, called k(x,x ) s.t. =,((., ) 9

Constructing Kernels k 1, k 2 are valid (symmetric, positive definite) kernels The following are valid kernels: 1. k(u,v) = k 1 (u,v) + k 2 (u,v), for, 0 2. k(u,v) = k 1 (u,v) k 2 (u,v) 3. k(u,v) = k 1 (f(u),f(v)), where f: X X 4. k(u,v) = g(u)g(v), for g: X R 5. k(u,v) = f(k 1 (u,v)), where f is a polynomial with positive coefficients 6. k(u,v) = exp(k 1 (u,v)) 7. k(u,v) = exp Representer Theorem Denote by Ω [0, ] R a strictly monotonic increasing function, by X a set, and by c : (X x R 2 ) n R { } an arbitrary cost function. Then each minimizer f Hof the regularized risk functional c((x 1,y 1,f(x 1 )) (x n,y n,f(x n ))+Ω( ) admits a representation of the form = (,) 10

Representer Theorem (cont d) Significance: although the optimization problem seems to be in an infinitedimensional space H, the solution only lies in the span of m particular kernels centered on m training points Note that we need to solve only for,=1.. Examples: Kernels on vectors Polynomial, =(+, ),p, 0 Gaussian, =exp 2 Radial basis, =exp 1 2 11

Example: String Kernel We want to compare 2 strings, e.g. distance between 2 strings Given index sequence I= (i 1,.i u ) with 1 i 1 < < i u s, define subsequence u of string s: u = s(i) = s(i 1 ) s(i u ) () = i u -i 1 + 1 length of subsequence in s Feature map: [ ] = : λ (), 0<λ<1 is a decay parameter Example: substring u= asd, strings s 1 = Nasdaq, s 2 = lass das [ ] =λ 3, [ ] =2λ 5 Kernel, =[ ] [ ] = λ () λ (),: Applications: document analysis, spam filtering, annotation of DNA sequences etc Examples: kernels on other structures Tree kernels Graph kernels Kernels on sets and subspaces And more 12

How to choose the best feature space The problem of choosing the optimal feature space for a given problem is non-trivial Since we only know the feature space by the kernel that we use, this problem reduces to choosing the best kernel for learning Performance of the kernel algorithm is highly dependent on the kernel The best kernel depends on the specific problem Choosing the best kernel We can use prior knowledge about the problem to significantly improve performance Shape of the solution If kernel is not appropriate for problem, one needs to tune the kernel (performance is problem-dependent, so no universally good algorithm exists) Bad news: we need prior knowledge Good news: some knowledge can be extracted from the data 13

Approaches to choosing the best kernel Approach 1: Tune the hyper-parameter of a given family of functions 2 E.g. With the Gaussian kernel k( x, x') = exp( x x' / 2σ ), set the kernel width However, for non-vectorialdata (e.g. strings), this approach does not work well for popular kernels People have devised their own kernel for problems such as protein classification 2 Approaches to choosing the best kernel (cont d) Approach 2: Learn the kernel matrix directly from the data This approach is more promising Goals of this approach Not restricted to one family of kernels that may not be appropriate for the given problem Stop tuning hyper-parameters and instead derive a way to learn the kernel matrix with the setting of parameters 14

Learning the kernel matrix Problems with learning the kernel matrix It is not clear what is the best criterion to optimize Difficult to solve the optimization Choice of the class of kernel matrices to be considered is important Implementation issue It may not be possible to store the entire kernel matrix for large data sets Summary Kernels make it possible to look for linear relations in high-dimensional spaces at low computational cost Inner products of the inputs in the feature space can be calculated in the original space Can be applied to non-vectorial data Strings, trees, graphs etc Finding the best feature space and kernel is nontrivial 15

References Hoffman, Scholkopf, Smola. Kernel methods in machine learning. Annals of Statistics, Volume 36, Number 3 (2008), 1171-1220. Scholkopf, Smola. A short introduction to kernel methods. Advanced lectures on machine learning, LNAI 2600, pp. 41-64, 2003 Hal DaumeIII. From zero to reproducing kernel hilbertspaces in twelve pages or less. Unpublished VapnikV. The nature of statistical learning theory. Springer, New York, 1995 BarlettP. Lectures from statistical learning theory course. Spring 2008 Some slides/pictures from Milos Hauskrechtand David Krebs 16