Kernels A Machine Learning Overview

Similar documents
Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Kernels for Dynamic Textures

CIS 520: Machine Learning Oct 09, Kernel Methods

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

10-701/ Recitation : Kernels

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Reproducing Kernel Hilbert Spaces

Kernel Methods. Outline

Kernel Method: Data Analysis with Positive Definite Kernels

Kernels MIT Course Notes

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Representer theorem and kernel examples

Kernel Methods. Machine Learning A W VO

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

EXAM MATHEMATICAL METHODS OF PHYSICS. TRACK ANALYSIS (Chapters I-V). Thursday, June 7th,

Kernel Methods. Barnabás Póczos

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Kernel Methods for Computational Biology and Chemistry

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Kernel Methods. Charles Elkan October 17, 2007

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CS 7140: Advanced Machine Learning

Kernel Principal Component Analysis

Reproducing Kernel Hilbert Spaces

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Kernel methods and the exponential family

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Support Vector Machine

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

22 : Hilbert Space Embeddings of Distributions

Kernel Learning via Random Fourier Representations

L5 Support Vector Classification

Basis Expansion and Nonlinear SVM. Kai Yu

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

ML (cont.): SUPPORT VECTOR MACHINES

What is an RKHS? Dino Sejdinovic, Arthur Gretton. March 11, 2014

Introduction to Support Vector Machines

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Lecture 10: Support Vector Machine and Large Margin Classifier

Definition and basic properties of heat kernels I, An introduction

Linear Algebra Review

Perceptron Revisited: Linear Separators. Support Vector Machines

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Kernels for Multi task Learning

Functional Analysis Review

Approximation Theoretical Questions for SVMs

9.2 Support Vector Machines 159

The Representor Theorem, Kernels, and Hilbert Spaces

Data Mining and Analysis: Fundamental Concepts and Algorithms

Random Walks, Random Fields, and Graph Kernels

Function Spaces. 1 Hilbert Spaces

Nonlinearity & Preprocessing

ICML - Kernels & RKHS Workshop. Distances and Kernels for Structured Objects

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Graphs, Geometry and Semi-supervised Learning

9 Classification. 9.1 Linear Classifiers

Kernel Learning with Bregman Matrix Divergences

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

CS798: Selected topics in Machine Learning

Announcements. Proposals graded

RegML 2018 Class 2 Tikhonov regularization and kernels

Lecture 4 February 2

Kernel Methods and Support Vector Machines

5.6 Nonparametric Logistic Regression

Linear & nonlinear classifiers

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Spectral Properties of the Kernel Matrix and their Relation to Kernel Methods in Machine Learning

Kernel Methods in Machine Learning

Statistical Properties of Large Margin Classifiers

Linear & nonlinear classifiers

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Sturm-Liouville operators have form (given p(x) > 0, q(x)) + q(x), (notation means Lf = (pf ) + qf ) dx

Support Vector Machines for Classification: A Statistical Portrait

Approximate Kernel PCA with Random Features

A graph based approach to semi-supervised learning

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Designing Kernel Functions Using the Karhunen-Loève Expansion

Support Vector Machines.

Kernel methods for comparing distributions, measuring dependence

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Kernels and the Kernel Trick. Machine Learning Fall 2017

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Analytic Geometry. Orthogonal projection. Chapter 4 Matrix decomposition

Statistical learning theory, Support vector machines, and Bioinformatics

Transcription:

Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter Bartlett S.V.N. Vishy Vishwanathan: Kernels, Page 1

Overview (Really Really) Quick Review of Basics Functional Analysis Viewpoint of RKHS Evaluation Functional Kernels and RKHS Mercer s Theorem Properties of Kernels Positive Semi-Definiteness Constructing Kernels of RKHS Regularization Norm in a RKHS Representer Theorem Fourier Perspective S.V.N. Vishy Vishwanathan: Kernels, Page 2

Machine Learning Data: Pairs of observations (x i, y i ) Underlying distribution P(x, y) Examples (blood status, cancer), (transactions, fraud) Task: Find a function f(x) which predicts y given x The function f(x) must generalize well S.V.N. Vishy Vishwanathan: Kernels, Page 3

What Are Kernels? Problem: Pairs of observations (x i, y i ) We want to decide the label of point x Intuition: If k(x, x i ) is a measure of influence then y(x) = i k(x, x i )y i S.V.N. Vishy Vishwanathan: Kernels, Page 4

Linear Functions? General Form: We typically want to use functions of the form f(x) = Λ( φ(x), θ g(θ)) We map data to φ(x) and then apply Λ to output function Special Cases: For Linear Regression Λ = 1 For classification use Λ = sign For density estimation use Λ = exp The RKHS Connection: We need a way to tie this to kernels The RKHS setting is suited for this purpose We implicitly map data to a high dimensional space S.V.N. Vishy Vishwanathan: Kernels, Page 5

Vector Spaces Vector Space: A set X such that x, y X and α R we have x + y X (Addition) α x X (Multiplication) Examples: Rational numbers Q over the rational field Real numbers R Also true for R n Counterexamples: f : [0, 1] [0, 1] does not form a vector space! Z is not a vector space over the real field The alphabet {a,..., z} is not a vector space! (How do you define + and operators?) S.V.N. Vishy Vishwanathan: Kernels, Page 6

Banach Spaces Normed Space: A pair (X, ), where X is a vector space and : X R + 0 is a normed space if x, y X and all α R it satisfies x = 0 if and only if x = 0 α x = α x (Scaling) x + y x + y (Triangle inequality) A norm not satisfying the first condition is called a pseudo norm Norm and Metric: A norm induces a metric via d(x, y) := x y Banach Space: A complete (in the metric defined by the norm) vector space X together with a norm S.V.N. Vishy Vishwanathan: Kernels, Page 7

Hilbert Spaces Inner Product Space: A pair (X,, X ), where X is a vector space and, : X X R is a inner product space if x, y z X and all α R it satisfies x + y, z = x, z + y, z (Additivity) α x, y = α x, y (Linearity) x, y = y, x (Symmetry) x, x 0 x, y = 0 y = x = 0 Dot Product and Norm: A dot product induces a norm via x := x, x Hilbert Space: A complete (in the metric induced by the dot product) vector space X, endowed with a dot product, S.V.N. Vishy Vishwanathan: Kernels, Page 8

Hilbert Spaces: Examples Euclidean Spaces: Take R m endowed with the dot product x, y := m i=1 x iy i l 2 Spaces: Infinite series of real numbers We define a dot product as x, y = i=1 x iy i Function Spaces L 2 (X): A function is square integrable if f(x) 2 dx < For square integrable functions f, g : X R define f, g := X f(x)g(x)dx Polarization Inequality: To recover the dot product from the norm compute x + y 2 x 2 y 2 = 2 x, y S.V.N. Vishy Vishwanathan: Kernels, Page 9

Positive Matrices Positive Definite Matrix: A matrix M R m m for which for all x R m we have x M x 0 if x 0 This matrix has only positive eigenvalues since for all eigenvectors x we have x M x = λ x x = λ x 2 > 0 and thus λ > 0. Induced Norms and Metrics: Every positive definite matrix induces a norm via x 2 M := x M x The triangle inequality can be seen by writing x + x 2 M = (x + x ) M 1 2 M 1 2 (x + x ) = M 1 2 (x + x ) 2 and using the triangle inequality for M 1 2 x and M 1 2 x. S.V.N. Vishy Vishwanathan: Kernels, Page 10

Our Setting Notation: Let X a learning domain and R X := {f : X R}. Hypothesis set H R X What We Want: There are many nasty functions in R X We restrict our attention to nice hypothesis sets We want to learn a function which is smooth Restriction: We look at functions of the form H 0 = {f(x) = i I α i k(x, x i ), x i X, α i R} I is an index set and k : X X R is a kernel function S.V.N. Vishy Vishwanathan: Kernels, Page 11

Making Dot Products Definition: Let g(x) = j J β jk(x, x j ) then f, g H0 := i,j α i β j k(x i, x j ) Properties: If k is symmetric then f, g H0 = g, f H0 If k is a positive semi definite then f, f H0 0 Completion: (H 0,, H0 ) defines a dot-product space In order to obtain a Hilbert space H complete H 0, H0 is naturally extended to, H S.V.N. Vishy Vishwanathan: Kernels, Page 12

RKHS Definition: For every f H if there is a k such that f(.), k(., x) H = f(x) then H is called a Reproducing Kernel Hilbert Space Evaluation Functional: A linear functional which maps f to f(x) Observe that δ x is linear δ x (f) := f(x) Theorem: If (H,, H ) is a RKHS δ x is continuous Equivalently δ x is bounded i.e. x M x f f(x) M x f H S.V.N. Vishy Vishwanathan: Kernels, Page 13

Constructing Kernels Matrices: Let X = {1,..., d}, f(i) = f i H = R d, f, g H = f Mg, M 0 f = KMf and K = M 1 n th -Order Polynomials: Let X = [a, b], H = τ n [a, b]. Define n f, g H := f (i) (c)g (i) (c) for c [a, b] then i=0 k(x, y) = n i=0 (x c) i (y c) i i! i! S.V.N. Vishy Vishwanathan: Kernels, Page 14

General Recipe Define Γ x : Let c x L 2 (X). For f L 2 (X) define Γ x (f) = c x, f L2 := g(x) Define Γ: Using the pointwise limit above define Γ(f) := g Define a RKHS: Now let H = image(γ) and observe The kernel is g(x) = c x, f L2 c x L2 f L2 k(x, y) = c x, c y L2 S.V.N. Vishy Vishwanathan: Kernels, Page 15

Mercer s Theorem Statement: Let k L (X 2 ) be the kernel of a linear operator such that T k : L 2 (X) L 2 (X) (T k f)(x) := k(x, y)f(y)dµ y X 2 k(x, y)f(x)f(y)dµ y dµ x 0. Let (ψ j, λ j ) be the normalized eigensystem of k then λ j l 1 Almost everywhere X k(x, y) = j λ j ψ j (x)ψ j (y) S.V.N. Vishy Vishwanathan: Kernels, Page 16

RKHS from Mercer s Theorem Construction: We define H := { i c iψ i (x)} and f, g H := i c i d i λ i where ψ i are eigenfunctions and λ i are eigenvalues of k Validity: The series c 2 i /λ i must converge to 0 Reproducing Property: We can check f( ), k(, x) H = i c i λ i ψ i (x)/λ i = i c i ψ i (x) = f(x) S.V.N. Vishy Vishwanathan: Kernels, Page 17

Kernels in Practice Intuition: Kernels are measures of similarity By the reproducing property k(, x), k(, y) H = k(x, y) They define a dot product via the map φ : X H x k(, x) := φ(x) Why is this Interesting: No assumption about the input domain X!! Meaningful dot products in X = we are in business Kernel methods successfully applied for discrete data Strings, trees, graphs, automata, transducers etc. S.V.N. Vishy Vishwanathan: Kernels, Page 18

Kernels and Nonlinearity Problem: Linear functions are often too simple to provide good estimators Idea 1: Map to a higher dimensional feature space via Φ : x Φ(x) and solve the problem there Replace every x, y by Φ(x), Φ(y) Idea 2: Instead of computing Φ(x) explicitly use a kernel function k(x, y) := Φ(x), Φ(y) A large class of functions are admissible as kernels Non-vectorial data can be handled if we can compute meaningful k(x, y) S.V.N. Vishy Vishwanathan: Kernels, Page 19

A Few Kernels Gaussian Kernel: Let x, y R d and σ 2 R + then ( ) x y 2 k(x, y) := exp Polynomial Kernel: Let x, y R d and d N then x, y k(x, y) := ( σ 2 σ 2 + 1) d Computes all monomials up to degree d String Kernel: Let x, y A and w s be weights k(x, y) := w s δ s,s = # s (x)# s (y)w s s A s x,s y S.V.N. Vishy Vishwanathan: Kernels, Page 20

Norms in RKHS Occams Razor: Of all functions which explain data pick the simplest one Simple Functions: We need a way to characterize a simple function Low function norm in RKHS = smooth function Regularization: To encourage simplicity we minimize 1 m f s = argmin f H c(f(x i ), y i ) + λ f 2 H m c(, ) is any loss function λ is a trade-off parameter i=1 S.V.N. Vishy Vishwanathan: Kernels, Page 21

Representer Theorem Statement: Let c : (X R R) m R { } denote a loss function Let Ω : [0, ) R be a strictly increasing function The objective function (regularized risk) is c((x 1, y 1, f(x 1 )),..., (x m, y m, f(x m ))) + Ω( f H ) Each minimizer f H of the above admits a representation m f(x) = α i k(x i, x) i=1 The solution is the span of m particular kernels Those points for which α i > 0 are Support Vectors S.V.N. Vishy Vishwanathan: Kernels, Page 22

Proof Sketch: Replace Ω( f H ) by Ω( f 2 H )) Decompose f as f(x) = f (x) + f (x) = i α i k(x i, x) + f (x) Since f, k(x i, ) = 0 we have f(x j ) = i α i k(x i, x j ) Now observe that ( Ω( f H ) Ω i α i k(x i, ) 2 H ) The objective function is minimized when f = 0 S.V.N. Vishy Vishwanathan: Kernels, Page 23

Green s Function Adjoint Operator: Linear operators T and T are adjoint if T f, g = f, T g A differential operator is a linear operator Green s Function: Let L linear and k be the kernel of an integral operator If Lk(x, y) = δ(x y) then k is the Green s function of L Intuition: Given f and Lu = f find u The Green s function is the kernel of L 1 You can verify u = L 1 f since LL 1 f(x) = L k(x, y)f(y)d y = δ(x y)f(y)d y = f(x S.V.N. Vishy Vishwanathan: Kernels, Page 24

Kernels are Green s Function Objective Function: Suppose L is the linear differential operator We impose smoothing by minimizing c((x 1, y 1, f(x 1 )),..., (x m, y m, f(x m ))) + Lf 2 L 2 Define a RKHS: Define H as the completion of H 0 = {f L LL 2 : Lf L2 < } The dot product is defined as f, g H := Lf, Lg L2 Green s function of L L is a reproducing kernel for H Generalization: Let L be any linear mapping into a dot product space S.V.N. Vishy Vishwanathan: Kernels, Page 25

Fourier Transform Definition: For f : R n R the Fourier transform is n f(ω) = (2π) 2 f(x) exp( i ω, x )d x and the inverse Fourier transform is f(x) = (2π) n 2 f(ω) exp(i ω, x )dω Parseval s Theorem: For a function f we have f, f L2 = f, f L2 Properties: For function f and differential operator L we have f Lf 2 2 = (2π) n 2 µ(ω) dω S.V.N. Vishy Vishwanathan: Kernels, Page 26

Green s Function Dot Products: We define H 0 = {f : Lf 2 < } The dot product is defined as f, g H0 = (2π) n 2 f(ω) The RKHS H is the completion of H 0 (ω)g µ(ω) Green s Function: We guess the Green s function (kernel) for L L as k(x, y) = (2π) n 2 exp(i ω, x y )µ(ω)dω dω Verify that k(, x) = µ(ω) exp( i ω, x ) S.V.N. Vishy Vishwanathan: Kernels, Page 27

Questions? S.V.N. Vishy Vishwanathan: Kernels, Page 28