Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Aron Cooper
6 years ago
Views:

State University of New York at Buffalo Buffalo,

1 Introduction to Machine Learning Kernel Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA CSE 474/574 1 / 21

2 Outline Kernel Methods Extension to Non-Vector Data Examples Kernel Regression Kernel Trick Choosing Kernel Functions Constructing New Kernels Using Building Blocks Kernels RBF Kernel Probabilistic Kernel Functions Kernels for Other Types of Data More About Kernels Motivation Gaussian Kernel Kernel Machines Generalizing RBF CSE 474/574 2 / 21

3 Regression for Non-Vector Data Examples What if x / R D? Does w x make sense? How to adapt? 1. Extract features from x 2. Is not always possible Chandola@UB CSE 474/574 3 / 21

4 Regression for Non-Vector Data Examples What if x / R D? Does w x make sense? How to adapt? 1. Extract features from x 2. Is not always possible Sometimes it is easier/natural to compare two objects. A similarity function or kernel Chandola@UB CSE 474/574 3 / 21

5 A Similarity Kernel Domain-defined measure of similarity Example Strings: Length of longest common subsequence, inverse of edit distance Example Multi-attribute Categorical Vectors: Number of matching values CSE 474/574 4 / 21

6 Can Regression be Adapted to Use a Kernel? Ridge regression estimate: Prediction at x : w = (λi D + X X) 1 X y y = w x = ((λi D + X X) 1 X y) x Still needs training and test examples as D length vectors Rearranging above (Sherman-Morrison-Woodbury formula or Matrix Inversion Lemma [See Murphy p120, Matrix Cookbook]) y = y (λi N + XX ) 1 Xx Chandola@UB CSE 474/574 5 / 21

7 Using the Dot Product y = y (λi N + XX ) 1 Xx XX? x 1, x 1 x 1, x 2 x 1, x N XX x 2, x 1 x 1, x 2 x 2, x N = x N, x 1 x N, x 2 x N, x N Chandola@UB CSE 474/574 6 / 21

8 Using the Dot Product y = y (λi N + XX ) 1 Xx XX? x 1, x 1 x 1, x 2 x 1, x N XX x 2, x 1 x 1, x 2 x 2, x N = x N, x 1 x N, x 2 x N, x N Xx? x 1, x Xx x 2, x =. x N, x Chandola@UB CSE 474/574 6 / 21

9 Generalizing to Non-linear Regression Consider a set of P functions that can be applied on input example x φ = {φ 1, φ 2,..., φ P } φ 1 (x 1 ) φ 2 (x 1 ) φ P (x 1 ) φ 1 (x 2 ) φ 2 (x 2 ) φ P (x 2 ) Φ = φ 1 (x N ) φ 2 (x N ) φ P (x N ) Prediction: y = y (λi N + ΦΦ ) 1 Φφ(x ) Each entry in ΦΦ is φ(x), φ(x ) Chandola@UB CSE 474/574 7 / 21

10 The Great Kernel Trick Replace dot product x i, x j with a function k(x i, x j ) Replace XX with K K - Gram Matrix k - kernel function K[i][j] = k(x i, x j ) Similarity between two data objects Kernel Regression y = y (λi N + K) 1 k(x, x ) Chandola@UB CSE 474/574 8 / 21

11 How to Construct a Kernel? Already know the simplest kernel function: k(x i, x j ) = x i x j Approach 1: Start with basis functions k(x i, x j ) = φ(x i ) φ(x j ) Approach 2: Direct design (good for non-vector inputs) Measure similarity between x i and x j Gram matrix must be positive semi-definite k should be symmetric Chandola@UB CSE 474/574 9 / 21

12 Using Building Blocks k(x i, x j ) = ck 1 (x i, x j ) k(x i, x j ) = f (x)k 1 (x i, x j )f (x j ) k(x i, x j ) = q(k 1 (x i, x j )) q is a polynomial k(x i, x j ) = exp(k 1 (x i, x j )) k(x i, x j ) = k 1 (x i, x j ) + k 2 (x i, x j ) k(x i, x j ) = k 1 (x i, x j )k 2 (x i, x j ) Chandola@UB CSE 474/ / 21

13 Popular Kernels If K is positive definite - Mercer Kernel Radial Basis Function or Gaussian Kernel ( k(x i, x j ) = exp 1 ) 2σ 2 x i x j 2 Cosine Similarity k(x i, x j ) = x i x j x i x j Chandola@UB CSE 474/ / 21

14 The RBF Kernel ( k(x i, x j ) = exp 1 ) 2σ 2 x i x j 2 Mapping inputs to an infinite dimensional space Chandola@UB CSE 474/ / 21

15 Probabilistic Kernel Functions Allows using generative distributions in discriminative settings Uses class-independent probability distribution for input x k(x i, x j ) = p(x i θ)p(x j θ) Two inputs are more similar if both have high probabilities Bayesian Kernel k(x i, x j ) = p(x i θ)p(x j θ)p(θ)dθ Chandola@UB CSE 474/ / 21

16 Kernels for Non-vector Data String Kernel Pyramid Kernels CSE 474/ / 21

17 Why Use Kernels? x R No linear separator Map x {x, x 2 } Separable in 2D space x 2 x x Chandola@UB CSE 474/ / 21

18 Another Example 5 0 x R 2 No linear separator Map x {x1 2, 2x 1 x 2, x2 2} A circle as the decision boundary Chandola@UB CSE 474/ / 21

19 Another Example 5 0 x R 2 No linear separator Map x {x1 2, 2x 1 x 2, x2 2} A circle as the decision boundary Chandola@UB CSE 474/ / 21

20 The Gaussian Kernel The squared dot product kernel (x i, x j R 2 ): k(x i, x j ) = x i x j φ(x i ) φ(x j ) φ(x i ) = {x 2 i1, 2x i1 x i2, x 2 i2} What about the Gaussian kernel (radial basis function)? ( k(x i, x j ) = exp 1 ) 2σ 2 x i x j 2 Chandola@UB CSE 474/ / 21

21 Why is the Gaussian Kernel Mapping to Infinite Dimensions Assume σ = 1 and x R (denoted as x) k(x i, x j ) = exp( xi 2 )exp( xj 2 )exp(2x i x j ) = exp( xi 2 )exp( xj 2 2 k xi k xj k ) k! k=0 ( ) ( ) 2 k/2 2 = xi k exp( xi 2 k/2 ) xj k exp( xj 2 ) k! k! k=0 Using Maclaurin Series Expansion /2 xi 1exp( x i 2) 2 k(x i, x j ) = 2 2/2 2 x i 2exp( x i 2) 1/2 xj 1exp( x j 2) 2 2/2 2 x j 2exp( x j 2).. Chandola@UB CSE 474/ / 21

22 Kernel Machines We can use kernel function to generate new features Evaluate kernel function for each input and a set of K centroids φ(x) = [k(x, µ 1 ), k(x, µ 2 ),..., k(x, µ K )] y = w φ(x), y Ber(w φ(x)) If k is a Gaussian kernel Radial Basis Function Network (RBF) How to choose µ i? Clustering Random selection Chandola@UB CSE 474/ / 21

23 Generalizing RBF Another option: Use every input example as a centroid φ(x) = [k(x, x 1 ), k(x, x 2 ),..., k(x, x N )] Chandola@UB CSE 474/ / 21

24 References CSE 474/ / 21

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,