Hilbert Space Methods in Learning

Similar documents
Statistical learning theory, Support vector machines, and Bioinformatics

Support Vector Machines for Classification: A Statistical Portrait

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

RegML 2018 Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels

Support vector machines Lecture 4

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Support Vector Machine

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Machine Learning Lecture 7

Reproducing Kernel Hilbert Spaces

The Learning Problem and Regularization

Generalization theory

Statistical Learning Reading Assignments

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

A Kernel Between Sets of Vectors

Lecture 10: Support Vector Machine and Large Margin Classifier

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Non-linear Support Vector Machines

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machines. Machine Learning Fall 2017

A Bahadur Representation of the Linear Support Vector Machine

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machines

Discriminative Models

Support Vector Machine for Classification and Regression

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Reproducing Kernel Hilbert Spaces

Approximation Theoretical Questions for SVMs

Cheng Soon Ong & Christian Walder. Canberra February June 2018

TUM 2016 Class 1 Statistical learning theory

Recap from previous lecture

Solving Classification Problems By Knowledge Sets

PAC-learning, VC Dimension and Margin-based Bounds

Lecture 10: A brief introduction to Support Vector Machine

Discriminative Models

Machine Learning And Applications: Supervised Learning-SVM

Introduction to Support Vector Machines

Lecture 3: Introduction to Complexity Regularization

Pattern Recognition 2018 Support Vector Machines

Does Unlabeled Data Help?

Computational Learning Theory

Perceptron Revisited: Linear Separators. Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Bits of Machine Learning Part 1: Supervised Learning

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Advanced Introduction to Machine Learning CMU-10715

Can we do statistical inference in a non-asymptotic way? 1

Support Vector Machines

MLCC 2017 Regularization Networks I: Linear Models

Bayesian Support Vector Machines for Feature Ranking and Selection

BINARY CLASSIFICATION

Class 2 & 3 Overfitting & Regularization

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Support Vector Machine (SVM) and Kernel Methods

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

(Kernels +) Support Vector Machines

CMU-Q Lecture 24:

Introduction to Machine Learning

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Part of the slides are adapted from Ziko Kolter

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Tutorial on Machine Learning for Advanced Electronics

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Classifier Complexity and Support Vector Classifiers

Statistical Machine Learning from Data

Statistical Machine Learning Hilary Term 2018

This is an author-deposited version published in : Eprints ID : 17710

Machine Learning

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Kernel Methods and Support Vector Machines

Regularization Networks and Support Vector Machines

Spectral Regularization

Kernel Methods. Outline

Support Vector Machines with Example Dependent Costs

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio

Graphs in Machine Learning

Representer theorem and kernel examples

Regularization via Spectral Filtering

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Stochastic optimization in Hilbert spaces

Announcements - Homework

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

On the V γ Dimension for Regression in Reproducing Kernel Hilbert Spaces. Theodoros Evgeniou, Massimiliano Pontil

Support Vector Machines

Machine Learning

A Magiv CV Theory for Large-Margin Classifiers

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Minimax risk bounds for linear threshold functions

Transcription:

Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1

1. A general formulation of the learning problem Empirical and true errors over tting Error bounds and what they tell us about the design of algorithms 2. Hilbert space methods Reproducing Kernel Hilbert Spaces Kernels Algorithms: SVM, Gaussian Processes, Kernel PCA Tutorial online at http://www.cs.columbia.edu/risi/notes/tutorial6672.pdf 2

The Learning Problem 3

Regression Learn a function f : x y Linear functions, order p polynomials, splines, etc. Examples: Boston housing problem, robot grasps, motorcycle data, etc. 4

Classi cation Separate +1 labeled points from 1 labeled points Examples: Face recognition, DNA splice site identi cation, document classi cation, call type classi cation 5

Supervised learning Input space: X e.g. X =R n Output space: Y Y ={ 1, +1} for classi cation Y = R for regression Training set: S = (x 1, y 1 ), (x 2, y 2 ),..., (x m, y m ) x i X, y t Y Truth : Deterministic: y = f 0 (x) Probabilistic: y p ( y x ) (more general) Goal: construct hypothesis f : X Y to predict y given x. 6

The Empirical Risk Empirical risk (training error): R emp [f] = 1 m L(f(x i ), y i ) where L : Y Y R is the loss function. Zero-one loss for classi cation: L(ŷ, y) = 1 if ŷ y 0 otherwise. Squared error loss for regression: L(ŷ, y) = (y ŷ) 2. 7

A Bad Learning Algorithm (memorization algorithm) Set 1 when x=x i and y i =1 f(x) = 1 otherwise. For zero-one loss perfect performance on training data! R emp [f] = 1 m L(f(x i ), y i ) = 0 Will it generalize well to testing examples? Why not? 8

The True Risk Assume some distribution on inputs: p(x) Distribution on (x, y) examples: p(x, y) = δ(y f 0 (x)) p(x) p(x, y) = p ( y x ) p(x) or True risk: R[f] = E [L(f(x), y)] = X Y p(x, y) L(f(x), y) dx dy. This is what we really want to minimize in discriminative learning. 9

True Risk vs. Empirical Risk R[f] = E [L(f(x), y)] R emp [f] = 1 m L(f(x i ), y i ). Just minimizing R emp is BAD (see previous algorithm). Optimizing the training error at the expense of the testing error is called over tting. But we do not know p(x, y)!!! Can we still do anything? 10

Bounding the True Risk For many practical learning algorithms R[f] = E [L(f(x), y)] R emp [f] = 1 m L(f(x i ), y i ). Uniform error bounds: For any distribution D, with probability 1 δ (over the choice of training set) R[f] R emp [f] ɛ for all hypotheses f F simultaneously. PAC bound: probably approximately correct 11

KEY CONCEPT: Capacity Control [ R[f] P Remp [f] ] ɛ 1 δ Generally, ɛ is a complicated function of δ depending crucially on F (hypothesis class) that f is chosen from. Compromise: model exibility generalization performance large F want small ɛ complexity generality 12

Capacity control Too in exible? Just right. Over tting? How do we quantify complexity of f? 13

Uniform Error Bounds [ P R[f] R emp [f] ɛ f F ] 1 δ. [ ] P sup [R[f] R emp [f] ] ɛ f F 1 δ, Not equivalent to: [ ] P R[f] R emp [f] ɛ 1 δ f F. 14

Vapnik-Chervonenkis type bounds With probability 1 δ [ sup R[f] Remp [f] ] h (log (2m/h) + 1) log (δ/4) f F m where h is the VC dimension of F. Linear discriminators in R n : h = n + 1... with margin γ in ball of radius D : h = min ( n, D 2 /γ 2 ) + 1 Large margin is good! 15

Covering number bounds With probability 1 δ [ sup R[f] Remp [f] ] log (12m E N1 (S, ɛ/8)) log δ 16M f F m where M is an upper bound on L(f(x), y). The covering number N 1 (S, ɛ) is the number of vectors v 1, v 2,..., v n to ensure that for any f F, there is a v k such that needed L(f(x i ), v k ) ɛ. 16

Stability-based bounds If f 1 δ is the hypothesis returned by a β-stable algorithm, then with probability R[f ] R emp [f ] β + 2 (mβ + M) 2 log (δ/2) m where M is an upper bound on L(f(x), y). An algorithm is β-stable if for all training sets, and any example (x, y), L(f (x), y) changes by at most β when we replace any one of the training examples by any other example. 17

Rademacher bounds For Ra r [f] = r with probability 1 δ [ ( R[f] inf (1+α) R emp [f] + 1 + 1 ) ( )] 31r log 2 b α>0 4α r + 50bɛ. n Rademacher average: Ra r [f] = E S,σ [ where P [σ = 1] = P [σ = 1] = 1/2. sup f F : EL(f(x),y) r ] σ i L (f(x i ), y i ) 18

Structural Risk Minimization If we have bound of form [ ] P sup [R[f] R emp [f] ] ɛ F f F 1 δ 1. Fix δ 2. Compute f F = arg min f F [R emp [f] + ɛ F ] for a sequence of spaces F 1 F 2... F k 3. Return f i with smallest R emp [f i ] + ɛ F i Does this work? 19

The problem with error bounds Most bounds are hopelessly loose. Typically, we get for 1 δ =.95 ɛ = 3000. Main culprit is the uniformity requirement. Can we still use them for anything or are they just a weird sport? Form of bounds is important, even if their value is not. In particular, large margin is good. 20

Hilbert Space Methods 21

SVM s: the old story Kernel k : X X R pos.def. similarity measure Feature map Φ : X F obeys k(x, x ) = Φ(x), Φ(x ) e.g. Gaussian Kernel: k(x, x ) = e x x 2 /(2σ 2 ) Find maximum margin separating hyperplane in high dimensional space! f(x) = sgn [ b + ] α i k(x i, x) 22

Want more general story behind Hilbert space methods. How do we tell what is a good kernel, anyway? Want large margin. What kernel will give us large margin? Lessons so far: capacity control is crucial; large margin is good; pursue abstract approach looking for general f : X Y, worry about actual algorithm later. 23

Regularized Risk Motivated by form of error bounds, minimize R reg [f] = 1 L(f(x i ), y i ) m }{{} R emp [f] over some large space of functions H. + Ω[f] }{{} regularizer Ω[f] is a penalty term penalizing hypotheses that are too complex. Effectively SRM. See Regularization networks of Poggio & Girosi. 24

Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f] = 1 m L(f(x i ), y i ) + f 2 H 25

Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f] = 1 m R reg [f] = 1 m L(f(x i ), y i ) + f 2 H L(f(x i ), y i ) + f, f H Hilbert space 26

Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f] = 1 m R reg [f] = 1 m R reg [f] = 1 m L(f(x i ), y i ) + f 2 H L(f(x i ), y i ) + f, f H L( f, k xi, y i ) + f, f H Hilbert space RKHS 27

Regularized Spaces of Functions Given {(x 1, y 1 ),..., (x m, y m )} look for f : X Y in some linear space of functions H minimizing R reg [f]. R reg [f] = 1 m R reg [f] = 1 m R reg [f] = 1 m L(f(x i ), y i ) + f 2 H L(f(x i ), y i ) + f, f H L( f, k xi, y i ) + f, f H Hilbert space RKHS The k x are prototypical functions s.t. f(x) = f, k x. 28

Representer Theorem Minimizer of R reg [f] = 1 m will be in the span of k x1, k x2,..., k xm! L( f, k xi, y i ) + f, f H The hypothesis can be written f(x) = f, k x = where k(x, x ) = k x, k x. α i k xi, k x = α i k(x, x i ). All we need to nd are α 1, α 2,..., α m. How do we construct the RKHS? 29

Constructing the RKHS f(x) = f, k x Bootstrap everything from k(x, x ) = k x, k x for x, x X! 1. Anything outside span { k x x X } is uninteresting, so f = 2. To evaluate f(x ) use f(x) = f, k x = X β(x) k x, k x dx = X X β(x) k(x, x ) dx β(x) k x dx. 3. To compute f, f use f, f = β(x) β(x ) k x, k x dxdx = X X X X β(x) β(x ) k(x, x ) dxdx 4. Note that k x (x ) = k x, k x = k(x, x ) so we simply have k x = k(x, x ). 5. H is a particular instance of a feature space F if we set Φ(x) = k x. 30

Correspondence R reg [f] = 1 m L( f, k xi, y i ) + f, f H Kernel methods make sense from Regularization Theory point of view if kernel corresponds to sensible operator Ω[f] = f, f H. 31

Fourier regularization Fourier transform on R n : ˆf(ω) = 1 (2π) n/2 R n f(x) e iω x dx Inverse trasform: f(x) = 1 (2π) n/2 C n ˆf(x) e iω x dω Fourier regularization: Ω[f] = f, f H = e σ2 ω 2 /2 ˆf(ω) 2 dω Corresponding kernel: k(x, x ) = e x x 2 /(2σ 2 ) The Gaussian kernel will heavily penalize non-smooth functions! 32

Other kernels Homogeneous polynomial: k(x, x ) = (x x ) p Non-homogeneous polynomial: k(x, x ) = (x x + 1) p tanh kernel: k(x, x ) = tanh(κ (() x x ) + δ) Triangular kernel: k(x, x ) = 1 (x x ) /d String kernels: k(string 1, string 2 ) Kernels on distributions: Fisher, etc. Diffusion kernels: k(x, x ) = [ e β ] x,x Similarity measure Regularization 33

Algorithms 34

Modularity of Hilbert space methods f = arg min 1 ftinh m L( f, k xi, y i ) + f, f }{{} H }{{} Determines algorithm Determines kernel Same algorithm (SVM) can be used in very different contexts by changing the kernel kernel engineering Regularization scheme can be studied independent of application (classi cation, regression, etc.) ANY kernel method can be formulated as one of these minimization problems 35

Soft margin SVM s Relax problem to learning continuous functions f : X R with hinge loss Then L(f(x), y) = C max (0, y f(x) + 1) f = arg min f H reduces to soft margin SVM [ f = arg min f H [ f, f + C 1 m ] L( f, k xi, y i ) + f, f H ] ξ i subject to Probably the most popular algorithm for classi cation. y i f(x i ) 1 ξ i 36

Kernel Regression If we set then f = arg min f H [ 1 m ] L( f, k xi, y i ) + f, f H reduces to soft kernel regression f = arg min f H [ f, f + C ] (ξ i + ξ i) subject to y i f(x i ) ɛ + ξ y i f(x i ) ɛ ξ 37

38

39