Introduction to Machine Learning

Introduction to Machine Learning Kernel Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 21

Outline Kernel Methods Extension to Non-Vector Data Examples Kernel Regression Kernel Trick Choosing Kernel Functions Constructing New Kernels Using Building Blocks Kernels RBF Kernel Probabilistic Kernel Functions Kernels for Other Types of Data More About Kernels Motivation Gaussian Kernel Kernel Machines Generalizing RBF Chandola@UB CSE 474/574 2 / 21

Regression for Non-Vector Data Examples What if x / R D? Does w x make sense? How to adapt? 1. Extract features from x 2. Is not always possible Chandola@UB CSE 474/574 3 / 21

Regression for Non-Vector Data Examples What if x / R D? Does w x make sense? How to adapt? 1. Extract features from x 2. Is not always possible Sometimes it is easier/natural to compare two objects. A similarity function or kernel Chandola@UB CSE 474/574 3 / 21

A Similarity Kernel Domain-defined measure of similarity Example Strings: Length of longest common subsequence, inverse of edit distance Example Multi-attribute Categorical Vectors: Number of matching values Chandola@UB CSE 474/574 4 / 21

Can Regression be Adapted to Use a Kernel? Ridge regression estimate: Prediction at x : w = (λi D + X X) 1 X y y = w x = ((λi D + X X) 1 X y) x Still needs training and test examples as D length vectors Rearranging above (Sherman-Morrison-Woodbury formula or Matrix Inversion Lemma [See Murphy p120, Matrix Cookbook]) y = y (λi N + XX ) 1 Xx Chandola@UB CSE 474/574 5 / 21

Using the Dot Product y = y (λi N + XX ) 1 Xx XX? x 1, x 1 x 1, x 2 x 1, x N XX x 2, x 1 x 1, x 2 x 2, x N =...... x N, x 1 x N, x 2 x N, x N Chandola@UB CSE 474/574 6 / 21

Using the Dot Product y = y (λi N + XX ) 1 Xx XX? x 1, x 1 x 1, x 2 x 1, x N XX x 2, x 1 x 1, x 2 x 2, x N =...... x N, x 1 x N, x 2 x N, x N Xx? x 1, x Xx x 2, x =. x N, x Chandola@UB CSE 474/574 6 / 21

Generalizing to Non-linear Regression Consider a set of P functions that can be applied on input example x φ = {φ 1, φ 2,..., φ P } φ 1 (x 1 ) φ 2 (x 1 ) φ P (x 1 ) φ 1 (x 2 ) φ 2 (x 2 ) φ P (x 2 ) Φ =...... φ 1 (x N ) φ 2 (x N ) φ P (x N ) Prediction: y = y (λi N + ΦΦ ) 1 Φφ(x ) Each entry in ΦΦ is φ(x), φ(x ) Chandola@UB CSE 474/574 7 / 21

The Great Kernel Trick Replace dot product x i, x j with a function k(x i, x j ) Replace XX with K K - Gram Matrix k - kernel function K[i][j] = k(x i, x j ) Similarity between two data objects Kernel Regression y = y (λi N + K) 1 k(x, x ) Chandola@UB CSE 474/574 8 / 21

How to Construct a Kernel? Already know the simplest kernel function: k(x i, x j ) = x i x j Approach 1: Start with basis functions k(x i, x j ) = φ(x i ) φ(x j ) Approach 2: Direct design (good for non-vector inputs) Measure similarity between x i and x j Gram matrix must be positive semi-definite k should be symmetric Chandola@UB CSE 474/574 9 / 21

Using Building Blocks k(x i, x j ) = ck 1 (x i, x j ) k(x i, x j ) = f (x)k 1 (x i, x j )f (x j ) k(x i, x j ) = q(k 1 (x i, x j )) q is a polynomial k(x i, x j ) = exp(k 1 (x i, x j )) k(x i, x j ) = k 1 (x i, x j ) + k 2 (x i, x j ) k(x i, x j ) = k 1 (x i, x j )k 2 (x i, x j ) Chandola@UB CSE 474/574 10 / 21

Popular Kernels If K is positive definite - Mercer Kernel Radial Basis Function or Gaussian Kernel ( k(x i, x j ) = exp 1 ) 2σ 2 x i x j 2 Cosine Similarity k(x i, x j ) = x i x j x i x j Chandola@UB CSE 474/574 11 / 21

The RBF Kernel ( k(x i, x j ) = exp 1 ) 2σ 2 x i x j 2 Mapping inputs to an infinite dimensional space Chandola@UB CSE 474/574 12 / 21

Probabilistic Kernel Functions Allows using generative distributions in discriminative settings Uses class-independent probability distribution for input x k(x i, x j ) = p(x i θ)p(x j θ) Two inputs are more similar if both have high probabilities Bayesian Kernel k(x i, x j ) = p(x i θ)p(x j θ)p(θ)dθ Chandola@UB CSE 474/574 13 / 21

Kernels for Non-vector Data String Kernel Pyramid Kernels Chandola@UB CSE 474/574 14 / 21

Why Use Kernels? x R No linear separator Map x {x, x 2 } Separable in 2D space x 2 x x Chandola@UB CSE 474/574 15 / 21

Another Example 5 0 x R 2 No linear separator Map x {x1 2, 2x 1 x 2, x2 2} A circle as the decision boundary 50 5 8 6 4 2 0 2 4 6 8 40 30 20 10 0 40 30 20 10 0 0 20 20 Chandola@UB CSE 474/574 16 / 21

The Gaussian Kernel The squared dot product kernel (x i, x j R 2 ): k(x i, x j ) = x i x j φ(x i ) φ(x j ) φ(x i ) = {x 2 i1, 2x i1 x i2, x 2 i2} What about the Gaussian kernel (radial basis function)? ( k(x i, x j ) = exp 1 ) 2σ 2 x i x j 2 Chandola@UB CSE 474/574 17 / 21

Why is the Gaussian Kernel Mapping to Infinite Dimensions Assume σ = 1 and x R (denoted as x) k(x i, x j ) = exp( xi 2 )exp( xj 2 )exp(2x i x j ) = exp( xi 2 )exp( xj 2 2 k xi k xj k ) k! k=0 ( ) ( ) 2 k/2 2 = xi k exp( xi 2 k/2 ) xj k exp( xj 2 ) k! k! k=0 Using Maclaurin Series Expansion 1 1 2 1/2 xi 1exp( x i 2) 2 k(x i, x j ) = 2 2/2 2 x i 2exp( x i 2) 1/2 xj 1exp( x j 2) 2 2/2 2 x j 2exp( x j 2).. Chandola@UB CSE 474/574 18 / 21

Kernel Machines We can use kernel function to generate new features Evaluate kernel function for each input and a set of K centroids φ(x) = [k(x, µ 1 ), k(x, µ 2 ),..., k(x, µ K )] y = w φ(x), y Ber(w φ(x)) If k is a Gaussian kernel Radial Basis Function Network (RBF) How to choose µ i? Clustering Random selection Chandola@UB CSE 474/574 19 / 21

Generalizing RBF Another option: Use every input example as a centroid φ(x) = [k(x, x 1 ), k(x, x 2 ),..., k(x, x N )] Chandola@UB CSE 474/574 20 / 21

References Chandola@UB CSE 474/574 21 / 21