Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Size: px

Start display at page:

Download "Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin"

Cecil Francis
5 years ago
Views:

1 Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin Linear Separability: More formally, Using Margin Data linearly separable, if there exists a vector a margin Such that Carlos Guestrin

2 Perceptron Analysis: Linearly Separable Case Theorem [Block, Novikoff]: Given a sequence of labeled examples: Each feature vector has bounded norm: If dataset is linearly separable: Then the number of mistakes made by the online perceptron on any such sequence is bounded by Carlos Guestrin Beyond Linearly Separable Case Perceptron algorithm is super cool! No assumption about data distribution! Could be generated by an oblivious adversary, no need to be iid Makes a fixed number of mistakes, and it s done for ever! Even if you see infinite data However, real world not linearly separable Can t expect never to make mistakes again Analysis extends to non-linearly separable case Very similar bound, see Freund & Schapire Converges, but ultimately may not give good accuracy (make many many many mistakes) Carlos Guestrin

3 What if the data is not linearly separable? Use features of features of features of features. Feature space can get really large really quickly! Carlos Guestrin Higher order polynomials number of monomial terms number of input dimensions d=4 d=3 d=2 m input features d degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms Carlos Guestrin

4 Perceptron Revisited Given weight vector w (t), predict point x by: Mistake at time t: w (t+1) ç w (t) + y (t) x (t) Thus, write weight vector in terms of mistaken data points only: Let M (t) be time steps up to t when mistakes were made: Prediction rule now: When using high dimensional features: Carlos Guestrin Dot-product of polynomials exactly d Carlos Guestrin

5 Finally the Kernel Trick!!! (Kernelized Perceptron Every time you make a mistake, remember (x (t),y (t) ) Kernelized Perceptron prediction for x: y (j) j2m (t) sign(w (t) (x)) = X (x (j) ) (x) = X y (j) k(x (j), x) j2m (t) Carlos Guestrin Polynomial kernels All monomials of degree d in O(d) operations: How about all monomials of degree up to d? Solution 0: exactly d Better solution: Carlos Guestrin

Common kernels Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel 2 Sigmoid Carlos Guestrin 2005-2013 11 What you need to know Notion of online

6 Common kernels Polynomials of degree exactly d Polynomials of degree up to d Gaussian (squared exponential) kernel 2 Sigmoid Carlos Guestrin What you need to know Notion of online learning Perceptron algorithm Mistake bounds and proofs The kernel trick Kernelized Perceptron Derive polynomial kernel Common kernels In online learning, report averaged weights at the end Carlos Guestrin

7 Your Midterm Content: Everything up to last Wednesday (Perceptron) Only 80mins, so arrive early and settle down quickly, we ll start and end on time Open book Textbook, Books, Course notes, Personal notes Bring a calculator that can do log J No: Computers, tablets, phones, other materials, internet devices, wireless telepathy or wandering eyes The exam: Covers key concepts and ideas, work on understanding the big picture, and differences between methods Carlos Guestrin Support Vector Machines Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin

8 Linear classifiers Which line is better? Carlos Guestrin 15 Pick the one with the largest margin! w.x + w 0 = 0 confidence = y j (w x j + w 0 ) Carlos Guestrin 16 8

9 Maximize the margin w.x + w 0 = 0 max,w,w 0 y j (w x j + w 0 ), 8j 2 {1,...,N} Carlos Guestrin 17 But there are many planes w.x + w 0 = Carlos Guestrin 18 9

10 Review: Normal to a plane x j = x j + w w w.x + w 0 = Carlos Guestrin 19 A Convention: Normalized margin Canonical hyperplanes x j = x j + w w w.x + w 0 = +1 w.x + w 0 = 0 w.x + w 0 = -1 x + x - margin 2γ Carlos Guestrin 20 10

11 Margin maximization using canonical hyperplanes w.x + w 0 = +1 w.x + w 0 = 0 w.x + w 0 = -1 Unnormalized problem: max,w,w 0 Normalized Problem: y j (w x j + w 0 ), 8j 2 {1,...,N} margin 2γ min w,w 0 w 2 2 y j (w x j + w 0 ) 1, 8j 2 {1,...,N} Carlos Guestrin 21 Support vector machines (SVMs) w.x + w 0 = +1 w.x + w 0 = 0 w.x + w 0 = -1 min w,w 0 w 2 2 y j (w x j + w 0 ) 1, 8j 2 {1,...,N} Solve efficiently by many methods, e.g., quadratic programming (QP) Well-studied solution algorithms Stochastic gradient descent Hyperplane defined by support vectors margin 2γ Carlos Guestrin 22 11

12 What if the data is not linearly separable? Use features of features of features of features Carlos Guestrin 23 What if the data is still not linearly separable? min w,w 0 w 2 2 y j (w x j + w 0 ) 1, 8j If data is not linearly separable, some points don t satisfy margin constraint: How bad is the violation? Tradeoff margin violation with w : Carlos Guestrin 24 12

13 SVMs for Non-Linearly Separable meet my friend the Perceptron Perceptron was minimizing the hinge loss: NX j=1 y j (w x j + w 0 ) + SVMs minimizes the regularized hinge loss!! w C NX j=1 1 y j (w x j + w 0 ) + Carlos Guestrin Stochastic Gradient Descent for SVMs NX j=1 Perceptron minimization: y j (w x j + w 0 ) + SGD for Perceptron: SVMs minimization: NX w C 1 y j (w x j + w 0 ) + j=1 SGD for SVMs: w (t+1) w (t) + h i y (t) (w (t) x (t) ) apple 0 y (t) x (t) Carlos Guestrin

14 What you need to know Maximizing margin Derivation of SVM formulation Non-linearly separable case Hinge loss A.K.A. adding slack variables SVMs = Perceptron + L2 regularization Can also use kernels with SVMs Can optimize SVMs with SGD Many other approaches possible Carlos Guestrin 27 14

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting: