Introduction to Machine Learning

Similar documents
Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel methods, kernel SVM and ridge regression

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Introduction to Machine Learning

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Introduction Dual Representations Kernel Design RBF Linear Reg. GP Regression GP Classification Summary. Kernel Methods. Henrik I Christensen

Introduction to Machine Learning

Kernel Ridge Regression. Mohammad Emtiyaz Khan EPFL Oct 27, 2015

Support Vector Machine (SVM) and Kernel Methods

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

10-701/ Recitation : Kernels

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Jeff Howbert Introduction to Machine Learning Winter

Review: Support vector machines. Machine learning techniques and image analysis

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Methods. Outline

Introduction to Machine Learning

LMS Algorithm Summary

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Kernel Methods and Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Machine Learning

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine (SVM) and Kernel Methods

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Nonparameteric Regression:

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernel methods CSE 250B

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Announcements. Proposals graded

Machine learning - HT Basis Expansion, Regularization, Validation

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Support Vector Machines.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine (SVM) and Kernel Methods

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

CS798: Selected topics in Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

(Kernels +) Support Vector Machines

Kernel Methods. Charles Elkan October 17, 2007

Advanced Introduction to Machine Learning

DD Advanced Machine Learning

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (continued)

SVMs: nonlinearity through kernels

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 10: Support Vector Machine and Large Margin Classifier

GAUSSIAN PROCESS REGRESSION

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

PATTERN RECOGNITION AND MACHINE LEARNING

Applied Machine Learning Annalisa Marsico

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

CS 340 Lec. 16: Logistic Regression

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Methods & Support Vector Machines

Statistical Machine Learning from Data

Introduction to Machine Learning

Support Vector Machines: Kernels

Statistical learning theory, Support vector machines, and Bioinformatics

Kernel Methods. Barnabás Póczos

Introduction to Machine Learning

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine & Its Applications

Introduction to Gaussian Processes

Basis Expansion and Nonlinear SVM. Kai Yu

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Introduction to Machine Learning

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning 2: Nonlinear Regression

The Perceptron Algorithm

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Linear & nonlinear classifiers

Outline lecture 4 2(26)

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Data Mining and Analysis: Fundamental Concepts and Algorithms

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Introduction to SVM and RVM

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning & SVM

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines and Speaker Verification

Kernel Methods in Machine Learning

Transcription:

Introduction to Machine Learning Kernel Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 21

Outline Kernel Methods Extension to Non-Vector Data Examples Kernel Regression Kernel Trick Choosing Kernel Functions Constructing New Kernels Using Building Blocks Kernels RBF Kernel Probabilistic Kernel Functions Kernels for Other Types of Data More About Kernels Motivation Gaussian Kernel Kernel Machines Generalizing RBF Chandola@UB CSE 474/574 2 / 21

Regression for Non-Vector Data Examples What if x / R D? Does w x make sense? How to adapt? 1. Extract features from x 2. Is not always possible Chandola@UB CSE 474/574 3 / 21

Regression for Non-Vector Data Examples What if x / R D? Does w x make sense? How to adapt? 1. Extract features from x 2. Is not always possible Sometimes it is easier/natural to compare two objects. A similarity function or kernel Chandola@UB CSE 474/574 3 / 21

A Similarity Kernel Domain-defined measure of similarity Example Strings: Length of longest common subsequence, inverse of edit distance Example Multi-attribute Categorical Vectors: Number of matching values Chandola@UB CSE 474/574 4 / 21

Can Regression be Adapted to Use a Kernel? Ridge regression estimate: Prediction at x : w = (λi D + X X) 1 X y y = w x = ((λi D + X X) 1 X y) x Still needs training and test examples as D length vectors Rearranging above (Sherman-Morrison-Woodbury formula or Matrix Inversion Lemma [See Murphy p120, Matrix Cookbook]) y = y (λi N + XX ) 1 Xx Chandola@UB CSE 474/574 5 / 21

Using the Dot Product y = y (λi N + XX ) 1 Xx XX? x 1, x 1 x 1, x 2 x 1, x N XX x 2, x 1 x 1, x 2 x 2, x N =...... x N, x 1 x N, x 2 x N, x N Chandola@UB CSE 474/574 6 / 21

Using the Dot Product y = y (λi N + XX ) 1 Xx XX? x 1, x 1 x 1, x 2 x 1, x N XX x 2, x 1 x 1, x 2 x 2, x N =...... x N, x 1 x N, x 2 x N, x N Xx? x 1, x Xx x 2, x =. x N, x Chandola@UB CSE 474/574 6 / 21

Generalizing to Non-linear Regression Consider a set of P functions that can be applied on input example x φ = {φ 1, φ 2,..., φ P } φ 1 (x 1 ) φ 2 (x 1 ) φ P (x 1 ) φ 1 (x 2 ) φ 2 (x 2 ) φ P (x 2 ) Φ =...... φ 1 (x N ) φ 2 (x N ) φ P (x N ) Prediction: y = y (λi N + ΦΦ ) 1 Φφ(x ) Each entry in ΦΦ is φ(x), φ(x ) Chandola@UB CSE 474/574 7 / 21

The Great Kernel Trick Replace dot product x i, x j with a function k(x i, x j ) Replace XX with K K - Gram Matrix k - kernel function K[i][j] = k(x i, x j ) Similarity between two data objects Kernel Regression y = y (λi N + K) 1 k(x, x ) Chandola@UB CSE 474/574 8 / 21

How to Construct a Kernel? Already know the simplest kernel function: k(x i, x j ) = x i x j Approach 1: Start with basis functions k(x i, x j ) = φ(x i ) φ(x j ) Approach 2: Direct design (good for non-vector inputs) Measure similarity between x i and x j Gram matrix must be positive semi-definite k should be symmetric Chandola@UB CSE 474/574 9 / 21

Using Building Blocks k(x i, x j ) = ck 1 (x i, x j ) k(x i, x j ) = f (x)k 1 (x i, x j )f (x j ) k(x i, x j ) = q(k 1 (x i, x j )) q is a polynomial k(x i, x j ) = exp(k 1 (x i, x j )) k(x i, x j ) = k 1 (x i, x j ) + k 2 (x i, x j ) k(x i, x j ) = k 1 (x i, x j )k 2 (x i, x j ) Chandola@UB CSE 474/574 10 / 21

Popular Kernels If K is positive definite - Mercer Kernel Radial Basis Function or Gaussian Kernel ( k(x i, x j ) = exp 1 ) 2σ 2 x i x j 2 Cosine Similarity k(x i, x j ) = x i x j x i x j Chandola@UB CSE 474/574 11 / 21

The RBF Kernel ( k(x i, x j ) = exp 1 ) 2σ 2 x i x j 2 Mapping inputs to an infinite dimensional space Chandola@UB CSE 474/574 12 / 21

Probabilistic Kernel Functions Allows using generative distributions in discriminative settings Uses class-independent probability distribution for input x k(x i, x j ) = p(x i θ)p(x j θ) Two inputs are more similar if both have high probabilities Bayesian Kernel k(x i, x j ) = p(x i θ)p(x j θ)p(θ)dθ Chandola@UB CSE 474/574 13 / 21

Kernels for Non-vector Data String Kernel Pyramid Kernels Chandola@UB CSE 474/574 14 / 21

Why Use Kernels? x R No linear separator Map x {x, x 2 } Separable in 2D space x 2 x x Chandola@UB CSE 474/574 15 / 21

Another Example 5 0 x R 2 No linear separator Map x {x1 2, 2x 1 x 2, x2 2} A circle as the decision boundary 50 5 8 6 4 2 0 2 4 6 8 40 30 20 10 0 40 30 20 10 0 0 20 20 Chandola@UB CSE 474/574 16 / 21

Another Example 5 0 x R 2 No linear separator Map x {x1 2, 2x 1 x 2, x2 2} A circle as the decision boundary 50 5 8 6 4 2 0 2 4 6 8 40 30 20 10 0 40 30 20 10 0 0 20 20 Chandola@UB CSE 474/574 16 / 21

The Gaussian Kernel The squared dot product kernel (x i, x j R 2 ): k(x i, x j ) = x i x j φ(x i ) φ(x j ) φ(x i ) = {x 2 i1, 2x i1 x i2, x 2 i2} What about the Gaussian kernel (radial basis function)? ( k(x i, x j ) = exp 1 ) 2σ 2 x i x j 2 Chandola@UB CSE 474/574 17 / 21

Why is the Gaussian Kernel Mapping to Infinite Dimensions Assume σ = 1 and x R (denoted as x) k(x i, x j ) = exp( xi 2 )exp( xj 2 )exp(2x i x j ) = exp( xi 2 )exp( xj 2 2 k xi k xj k ) k! k=0 ( ) ( ) 2 k/2 2 = xi k exp( xi 2 k/2 ) xj k exp( xj 2 ) k! k! k=0 Using Maclaurin Series Expansion 1 1 2 1/2 xi 1exp( x i 2) 2 k(x i, x j ) = 2 2/2 2 x i 2exp( x i 2) 1/2 xj 1exp( x j 2) 2 2/2 2 x j 2exp( x j 2).. Chandola@UB CSE 474/574 18 / 21

Kernel Machines We can use kernel function to generate new features Evaluate kernel function for each input and a set of K centroids φ(x) = [k(x, µ 1 ), k(x, µ 2 ),..., k(x, µ K )] y = w φ(x), y Ber(w φ(x)) If k is a Gaussian kernel Radial Basis Function Network (RBF) How to choose µ i? Clustering Random selection Chandola@UB CSE 474/574 19 / 21

Generalizing RBF Another option: Use every input example as a centroid φ(x) = [k(x, x 1 ), k(x, x 2 ),..., k(x, x N )] Chandola@UB CSE 474/574 20 / 21

References Chandola@UB CSE 474/574 21 / 21