Lecture 4 February 2

Similar documents
Lecture 10 February 23

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Reproducing Kernel Hilbert Spaces

Kernels and the Kernel Trick. Machine Learning Fall 2017

CIS 520: Machine Learning Oct 09, Kernel Methods

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Support Vector Machines

SVMs, Duality and the Kernel Trick

Perceptron Revisited: Linear Separators. Support Vector Machines

Reproducing Kernel Hilbert Spaces

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Reproducing Kernel Hilbert Spaces

SVMs: nonlinearity through kernels

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Kernel Method: Data Analysis with Positive Definite Kernels

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

22 : Hilbert Space Embeddings of Distributions

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

ICML - Kernels & RKHS Workshop. Distances and Kernels for Structured Objects

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Introduction to Kernel methods

Physics 331 Introduction to Numerical Techniques in Physics

Kernel Methods in Machine Learning

Machine Learning. Support Vector Machines. Manfred Huber

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Support Vector Machines.

9.2 Support Vector Machines 159

Convex Optimization M2

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

7.1. Calculus of inverse functions. Text Section 7.1 Exercise:

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Kernel methods CSE 250B

CS798: Selected topics in Machine Learning

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Support Vector Machine (SVM) and Kernel Methods

(Kernels +) Support Vector Machines

Lecture 3 January 28

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Oslo Class 2 Tikhonov regularization and kernels

Lecture 10: Support Vector Machine and Large Margin Classifier

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Lecture 35: December The fundamental statistical distances

Announcements. Proposals graded

GAUSSIAN PROCESS REGRESSION

Lecture 3: Review of Linear Algebra

Representer theorem and kernel examples

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines

Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Kernel Methods. Barnabás Póczos

Review: Support vector machines. Machine learning techniques and image analysis

Kaggle.

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

The Hilbert Space of Random Variables

CS 7140: Advanced Machine Learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

10-701/ Recitation : Kernels

Kernel Methods. Charles Elkan October 17, 2007

Support Vector Machines

Kernel Methods and Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CS145: INTRODUCTION TO DATA MINING

Kernels A Machine Learning Overview

Kernel Methods. Outline

Support Vector Machine (SVM) and Kernel Methods

Kernel methods, kernel SVM and ridge regression

Statistical Methods for NLP

Support Vector Machine

Lecture 3: Review of Linear Algebra

1. Subspaces A subset M of Hilbert space H is a subspace of it is closed under the operation of forming linear combinations;i.e.,

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Kernels MIT Course Notes

The Representor Theorem, Kernels, and Hilbert Spaces

Kernel Methods. Machine Learning A W VO

Functional Analysis Exercise Class

Kernel Methods & Support Vector Machines

Advanced Introduction to Machine Learning

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Statistical learning theory, Support vector machines, and Bioinformatics

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Basis Expansion and Nonlinear SVM. Kai Yu

96 CHAPTER 4. HILBERT SPACES. Spaces of square integrable functions. Take a Cauchy sequence f n in L 2 so that. f n f m 1 (b a) f n f m 2.

Causal Inference by Minimizing the Dual Norm of Bias. Nathan Kallus. Cornell University and Cornell Tech

Elementary linear algebra

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

COMP 562: Introduction to Machine Learning

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support Vector Machine (continued)

Inner product spaces. Layers of structure:

Transcription:

4-1 EECS 281B / STAT 241B: Advanced Topics in Statistical Learning Spring 29 Lecture 4 February 2 Lecturer: Martin Wainwright Scribe: Luqman Hodgkinson Note: These lecture notes are still rough, and have only have been mildly proofread. Announcements GSI: Junming Yin (junming@eecs.berkeley.edu); his offfice hours are posted on the webpage. HW #1 due Mon. Feb. 9th in class (no late assignments). Today kernels and their uses reproducing kernel Hilbert space Recap So far we have looked at two simple non-parametric classifiers based on linear discriminant functions. Methods 1. perceptron 2. max-margin SVM Given samples {x (i),y (i) }, these are based on functions f θ (x) θ,x with θ n i1 α iy (i) x (i) The indices {i : α i } index the subset of samples contributing to the classifier In the perceptron, α i is related to the number of times point i is misclassified, because we increase α i by one each time For these methods f θ (x) θ,x n i1 α iy (i) x,x (i), so we never need to represent θ explicitly. We just need weights and a representation of the inner products. This is true both for the linear perceptron algorithm and when solving the max-margin optimization problem. For the linear MM classifier, in the dual we can find α directly and never need to compute θ explicitly. This suggests a route to more powerful classifiers via the kernel trick.

4-2 EECS 281B / STAT 241B Lecture 4 February 2 Spring 29 4.1 The Kernel Trick Idea: Find a feature map, i.e. a function Φ : X Z, where X is the feature space and Z is a much higher dimensional space that can even be infinite dimensional. Linear classifiers in the higher-dimensional space Z can be very non-linear in the original feature space. Eg. Φ((x 1,...,x d )) (x 1,...,x d,...,x i x j,...,x i x j x k,...,x 3 d ). Here x Rd and Φ(x) R (d 1)+( d 2)+( d 3) where the exponent is of order d 3. This can be generalized to map to tuples of all monomials of degree k. In the example above, k 3. Working in these high-dimensional spaces can be very difficult. If d 1, for k 3, the dimension of Z is on the order of 1 9. For k 1, the dimension of Z is on the order of 1 3, though k 1 is most likely too high and may fit noise in the data. The kernel trick is a computational solution to working in these high dimensions. If we had an efficient way of computing inner products, we would never need to work in the higher dimensional space, just in the space of inner products. Suppose there were a function K : X X R such that K(x,y) Φ(x),Φ(y) Z x,y X. Then we could kernelize, i.e. replace every inner product with this, which is nice because then we would not pay the price of working in the high-dimensional space. 4.2 Hilbert Spaces 4.2.1 Definitions A Hilbert space H is an inner product space (i.e. a set on elements closed under addition and scalar multiplication, and for an inner product f,g H is defined f,g H) that is complete with respect to norm H where f H : f,f H. Complete means that any Cauchy sequence {f n } H converges to some f H. A Cauchy sequence {f n } is a sequence such that ǫ >, N(ǫ) such that n,m N(ǫ), f n f m H < ǫ. 4.2.2 Examples Example (a): R d with Euclidean inner product is a finite-dimensional Hilbert space Example (b): l 2 (N) {(a 1,a 2,a 3,...) i1 a2 i < + }, i.e., the set of all sequences that are square summable, with a,b i1 a ib i, is an infinite-dimensional Hilbert space. Example (c): Say C {f i,i 1,...,d} is some class of functions. E.g., maybe a bunch of separate estimators; if we take a weighted sum of them, it is an aggregation technique. The set span(c) { a i f i (a 1,...,a d ) R d } i1

4-3 EECS 281B / STAT 241B Lecture 4 February 2 Spring 29 is closed under addition and multiplication by scalars. We can define an inner product with G R d d, any symmetric positive definite matrix, by f i,f j : G ij i,j 1,...,d, and, thus, a i f i, b j f j i,j a ib j G ij. This is a finite-dimensional Hilbert space, but a rich one, as the base functions can be quite complicated. Example (d): L 2 [,1] {f : [,1] R 1 f2 (t) dt < + }, with f,g 1 f(t)g(t) dt, so that 1 f2 (t) dt < + is equivalent to f 2 H < +. This is too rich, meaning it is not a reproducing kernel Hilbert space (RKHS). Everything (a)-(c) is a RKHS so we can find kernels we like. Example (e): H {f : [,1] R f(),f diff ble almost everywhere withf L 2 [,1]}. f can have isolated bad points, say a corner point, but not pathologically many. This is a Sobolev space (see Adams, 1975). The inner product is interesting, defined as f,g H 1 f (t)g (t) dt. Examples (a)-(c) and (e) are RKHS. Example (d) is not. Exercise: think about this. 4.3 Reproducing Kernel Hilbert Spaces We will focus on Hilbert spaces of functions f : X R. A reproducing kernel Hilbert space (RKHS) is a Hilbert space in which x X, R x H such that R x,f H f(x), f H. That is, function evaluation has a linear representation in the Hilbert space. R x is the representer of evaluation for x. Note that this is just one of many definitions of RKHS. Example (d) (revisited): Can we find g such that 1 f(t)g(t) dt f(x). We are tempted to think of δ- functions to put infinite mass at x. In (d), δ-functions are not in H. 4.4 Positive Semi-Definite (PSD) Kernels A function K : X X R is a positive semidefinite (PSD) kernel if n N(n 1,2,...) and {x (i),i 1,...,n} X, the matrix K R n n with K ij K(x (i),x (j) ) is symmetric and positive semi-definite. A matrix K R n n is positive semi-definite if a R n,a T Ka. 4.5 Relationship Between RKHS and PSD Kernels Theorem 4.1. To every RKHS, there is a unique PSD kernel K : X X R. Conversely, given a PSD kernel, we can construct a RKHS such that R x ( ) K(,x) is the representer of evaluation. Note that K(,x) R x ( ) belongs to the Hilbert space in this correspondence between kernels and the reproduction property. Given a kernel, we can build rich classes of functions. The feature set can be a subset of R, but not necessarily. We can have kernels on combinatorial objects such as graphs, subsequences, or all subsets of a set. These are combinatorial kernels that are not continuous. Discrete kernels on combinatorial objects are very important in fields such as computational biology.

4-4 EECS 281B / STAT 241B Lecture 4 February 2 Spring 29 Example (c) (revisited): H span{f 1,...,f d }, i.e. all linear combinations of the functions f i : X R with inner products G ij f i,f j, where G is any symmetric positive definite d d matrix. One can prove that G satisfies the properties of an inner product. Let {e 1,e 2,...,e d } be any orthonormal basis of H. Since H is a vector space of functions, we can always do Gram-Schmidt to get an orthonormal basis. This assumes that f 1,...,f d are linearly independent. Define K(y,x) d i1 e i(x)e i (y). For any x X, K(,x) d i1 e i(x)e i ( ) H as e i (x) is a constant. We need to check that it reproduces. As the {e i } are orthonormal, { 1 if i j e i,e j H otherwise Take any f d i1 a ie i H for a R d. Then f, K(,x) H a i e i, e j (x)e j i1 i1 H i,j a i e j (x) e i,e j H a i e i (x) i1 f(x) Therefore, this particular Hilbert space is a RKHS. This proof applies more generally to generic finitedimensional Hilbert spaces. Any finite-dimensional Hilbert space is reproducing, and the same line of argument can be used to construct the associated kernel. Example (e) (revisited): Sobolev. H {f : [,1] R f diff ble almost everywhere, 1 f 2 (t) dt < +,f() }. Even though L 2 is not a RKHS, this Hilbert space is a RKHS. In particular, let us define the kernel as K(x,y) : min(x,y). Then R y ( ) K(,y) min(,y), which is sketched in Figure 4.1. We need to check that R y ( ) H. Clearly R y (). What about the diff ble property? The derivative is { R y(x) 1 for x < y for x > y Thus, R y ( ) is not diff ble everywhere, but it has only isolated weirdness at the point x y. We can integrate the square of the derivative and check that it is finite, so R y ( ) does belong to the space.

4-5 EECS 281B / STAT 241B Lecture 4 February 2 Spring 29 R y (x) y 1 Figure 4.1. The function R y (x) for Example (e) (revisited): Sobolev. To check the reproducing property: R y,f 1 y f(y) Thus H is a RKHS. R y(t)f (t) dt f (t) dt f(y) f() (by the Fundamental Theorem of Calculus) To complete the correspondence between the Sobolev space H from Example (e) and the PSD kernel guaranteed by Theorem 4.1, one needs to show that K(x,y) min(x,y) is a PSD kernel function, i.e. for any k points, x 1,...,x k, the matrix K R k k with K i,j min(x i,x j ) is positive semi-definite. Exercise: Prove this by demonstrating a Gaussian process with K(x, y) min(x, y) as its covariance function.