Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Similar documents
Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Review: Support vector machines. Machine learning techniques and image analysis

Lecture Notes on Support Vector Machine

10-701/ Recitation : Kernels

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Kernel methods, kernel SVM and ridge regression

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Introduction to Machine Learning

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Lecture 7: Kernels for Classification and Regression

Kernel Methods in Machine Learning

Kernel Methods. Barnabás Póczos

LMS Algorithm Summary

Kernel methods CSE 250B

Support Vector Machines.

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Kernel Methods. Outline

SVMs: nonlinearity through kernels

Machine Learning 2: Nonlinear Regression

Perceptron Revisited: Linear Separators. Support Vector Machines

Modelli Lineari (Generalizzati) e SVM

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines: Kernels

Machine Learning : Support Vector Machines

Lecture 10: Support Vector Machine and Large Margin Classifier

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Introduction to Kernel methods

Kernel Methods. Charles Elkan October 17, 2007

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning. Support Vector Machines. Manfred Huber

Kaggle.

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Machine Learning 4771

Support Vector Machines

Support Vector Machine (continued)

Learning with kernels and SVM

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machine (SVM) and Kernel Methods

Linear Classification and SVM. Dr. Xin Zhang

Final Exam. 1 True or False (15 Points)

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

ML (cont.): SUPPORT VECTOR MACHINES

Nonlinearity & Preprocessing

Kernel Principal Component Analysis

Support Vector Machine & Its Applications

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

(Kernels +) Support Vector Machines

9.2 Support Vector Machines 159

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Insights into the Geometry of the Gaussian Kernel and an Application in Geometric Modeling

Statistical Machine Learning from Data

Basis Expansion and Nonlinear SVM. Kai Yu

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Probabilistic & Unsupervised Learning

Statistical Methods for SVM

CS798: Selected topics in Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Support Vector Machines

Introduction to Machine Learning

Support Vector Machines

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Reproducing Kernel Hilbert Spaces

Support Vector Machines

Advanced Machine Learning & Perception

Online Learning With Kernel

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Automatic Relevance Determination

The Representor Theorem, Kernels, and Hilbert Spaces

Kernel PCA, clustering and canonical correlation analysis

Kernels MIT Course Notes

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Discussion About Nonlinear Time Series Prediction Using Least Squares Support Vector Machine

Applied Machine Learning Annalisa Marsico

UVA CS 4501: Machine Learning. Lecture 11b: Support Vector Machine (nonlinear) Kernel Trick and in PracCce. Dr. Yanjun Qi. University of Virginia

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Transcription:

Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases Moreover, using the mapped representation could be inefficient too e.g., imagine computing the similarity between two examples: φ(x) φ(z) Thankfully, Kernels help us avoid both these issues! The mapping doesn t have to be explicitly computed Computations with the mapped features remain efficient (CS5350/6350) Kernel Methods September 15, 2011 5 / 16

Kernels as High Dimensional Feature Mapping Consider two examples x = {x 1,x 2 } and z = {z 1,z 2 } Let s assume we are given a function k (kernel) that takes as inputs x and z k(x,z) = (x z) 2 = (x 1z 1 + x 2z 2) 2 = x 2 1 z2 1 + x2 2 z2 2 + 2x1x2z1z2 = (x 2 1, 2x 1x 2,x 2 2 ) (z 2 1, 2z 1z 2,z 2 2 ) = φ(x) φ(z) The above k implicitly defines a mapping φ to a higher dimensional space φ(x) = {x 2 1, 2x 1 x 2,x 2 2} Note that we didn t have to define/compute this mapping Simply defining the kernel a certain way gives a higher dim. mapping φ Moreover the kernel k(x,z) also computes the dot product φ(x) φ(z) φ(x) φ(z) would otherwise be much more expensive to compute explicitly All kernel functions have these properties (CS5350/6350) Kernel Methods September 15, 2011 6 / 16

Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x X (input space) and maps it to F ( feature space ) Kernel k(x,z) takes two inputs and gives their similarity in F space φ : X F k : X X R, k(x,z) = φ(x) φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space Can just any function be used as a kernel function? No. It must satisfy Mercer s Condition (CS5350/6350) Kernel Methods September 15, 2011 7 / 16

Mercer s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function dx dzf(x)k(x,z)f(z) > 0 ( f L 2) This is Mercer s Condition Let k 1, k 2 be two kernel functions then the following are as well: k(x,z) = k 1(x,z)+k 2(x,z): direct sum k(x,z) = αk 1(x,z): scalar product k(x,z) = k 1(x,z)k 2(x,z): direct product Kernels can also be constructed by composing these rules (CS5350/6350) Kernel Methods September 15, 2011 8 / 16

The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x 1,...,x N }, the (i,j)-th entry of K is defined as: K ij = k(x i,x j ) = φ(x i ) φ(x j ) K ij : Similarity between the i-th and j-th example in the feature space F K: N N matrix of pairwise similarities between examples in F space K is a symmetric matrix K is a positive definite matrix (except for a few exceptions) For a P.D. matrix: z Kz > 0, z R N (also, all eigenvalues positive) The Kernel Matrix K is also known as the Gram Matrix (CS5350/6350) Kernel Methods September 15, 2011 9 / 16

Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x,z) = x z (mapping function φ is identity - no mapping) Quadratic Kernel: Polynomial Kernel (of degree d): k(x,z) = (x z) 2 or (1+x z) 2 k(x,z) = (x z) d Radial Basis Function (RBF) Kernel: or (1+x z) d k(x,z) = exp[ γ x z 2 ] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can t actually write down the vector φ(x)) Note: Kernel hyperparameters (e.g., d, γ) chosen via cross-validation (CS5350/6350) Kernel Methods September 15, 2011 10 / 16

Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x i x j ) can be kernelized (i.e., non-linearlized).. by replacing the x i x j terms by φ(x i ) φ(x j ) = k(x i,x j ) Most learning algorithms are like that Perceptron, SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) (CS5350/6350) Kernel Methods September 15, 2011 11 / 16