Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Similar documents
SVMs, Duality and the Kernel Trick

Support Vector Machines

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines: Maximum Margin Classifiers

Announcements - Homework

Support Vector Machines and Kernel Methods

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Kernel Methods and Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Lecture Notes on Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Max Margin-Classifier

(Kernels +) Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 10: A brief introduction to Support Vector Machine

Kernelized Perceptron Support Vector Machines

Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines for Regression

Linear & nonlinear classifiers

Learning From Data Lecture 25 The Kernel Trick

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

CS798: Selected topics in Machine Learning

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

ICS-E4030 Kernel Methods in Machine Learning

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines, Kernel SVM

Introduction to Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Machine Learning And Applications: Supervised Learning-SVM

Statistical Machine Learning from Data

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Pattern Recognition 2018 Support Vector Machines

Basis Expansion and Nonlinear SVM. Kai Yu

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

ML (cont.): SUPPORT VECTOR MACHINES

CS-E4830 Kernel Methods in Machine Learning

Kernels and the Kernel Trick. Machine Learning Fall 2017

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear & nonlinear classifiers

Support Vector Machines and Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning 4771

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Introduction to SVM and RVM

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machine & Its Applications

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Support Vector Machines for Classification and Regression

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine

Kernel Methods. Barnabás Póczos

SVMs: nonlinearity through kernels

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Classifier Complexity and Support Vector Classifiers

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support vector machines Lecture 4

Support Vector Machines and Speaker Verification

Kernel methods, kernel SVM and ridge regression

Lecture Support Vector Machine (SVM) Classifiers

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Kernel Methods. Machine Learning A W VO

Linear Classification and SVM. Dr. Xin Zhang

Support vector machines

Support Vector Machines.

Support Vector Machines for Classification: A Statistical Portrait

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Introduction to Support Vector Machines

Announcements. Proposals graded

Support Vector Machines

Support Vector Machine

Support Vector Machines

Discriminative Models

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Support Vector Machines

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Transcription:

Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701

SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem: w weights on features (d-dim problem) Convex quadratic program quadratic objective, linear constraints But expensive to solve if d is very large Often solved in dual form (n-dim problem) 2

Constrained Optimization x = max(b, 0) Constraint inactive Constraint active and tight 3

Constrained Optimization Dual Problem b +ve Primal problem: Moving the constraint to objective function Lagrangian: α = 0 constraint is inactive α > 0 constraint is active Dual problem: 4

Connection between Primal and Dual Primal problem: p* = Dual problem: d* = Ø Weak duality: The dual solution d* lower bounds the primal solution p* i.e. d* p* To see this, recall For every feasible x (i.e. x b) and feasible α (i.e. α 0), notice that d(α) = p* Ø Dual problem (maximization) is always concave even if primal is not convex 5

Connection between Primal and Dual Primal problem: p* = Dual problem: d* = Ø Weak duality: The dual solution d* lower bounds the primal solution p* i.e. d* p* Ø Strong duality: d* = p* holds often for many problems of interest e.g. if the primal is a feasible convex objective with linear constraints 6

Connection between Primal and Dual What does strong duality say about (the that achieved optimal value of dual) and x (the x that achieves optimal value of primal problem)? Whenever strong duality holds, the following conditions (known as KKT conditions) are true for and x : 1. 5L(x, ) = 0 i.e. Gradient of Lagrangian at x and is zero. 2. x 3. b i.e. x is primal feasible 0 i.e. is dual feasible 4. (x b) = 0 (called as complementary slackness) We use the first one to relate x and. We use the last one (complimentary slackness) to argue that = 0 if constraint is inactive and > 0 if constraint is active and tight. 7

Solving the dual Dual: d(alpha) Optimization over x is unconstrained. Now need to maximize L(x *,α) over α 0 Solve unconstrained problem to get α and then take max(α,0) ) 0 =2b α = 0 constraint is inactive, α > 0 constraint is active and tight 8

Dual SVM linearly separable case n training points, d features (x 1,, x n ) where x i is a d-dimensional vector Primal problem: Dual problem (derivation): w weights on features (d-dim problem) alpha weights on training pts (n-dim problem) 9

Dual SVM linearly separable case Dual problem: If we can solve for αs (dual problem), then we have a solution for w,b (primal problem) 10

Dual SVM linearly separable case Dual problem is also QP Solution gives α j s What about b? 11

Dual SVM: Sparsity of dual solution α j = 0 α j > 0 α j > 0 α j > 0 α j = 0 α j = 0 Only few α j s can be non-zero : where constraint is active and tight (w.x j + b)y j = 1 Support vectors training points j whose α j s are non-zero 12

Dual SVM linearly separable case Dual problem is also QP Solution gives α j s Use support vectors with α k >0 to compute b since constraint is tight (w.x k + b)y k = 1 13

Dual SVM non-separable case Primal problem:,{ξ j } Dual problem:,{ξ j } L(w,b,,,µ) Lagrange Multipliers 14

Dual SVM non-separable case comes from @L @ =0 Intuition: If C, recover hard-margin SVM Dual problem is also QP Solution gives α j s 15

So why solve the dual SVM? There are some quadratic programming algorithms that can solve the dual faster than the primal, (specially in high dimensions d>>n) But, more importantly, the kernel trick!!! 16

Separable using higher-order features x 2 θ x 1 r = x 12 +x 2 2 x 1 2 x 1 x 1 17

What if data is not linearly separable? Use features of features of features of features. Φ(x) = (x 12, x 22, x 1 x 2,., exp(x 1 )) Feature space becomes really large very quickly! 18

Higher Order Polynomials m input features d degree of polynomial grows fast! d = 6, m = 100 about 1.6 billion terms 19

Dual formulation only depends on dotproducts, not on w! Φ(x) High-dimensional feature space, but never need it explicitly as long as we can compute the dot product fast using some Kernel K 20

Dot Product of Polynomials d=1 d=2 d 21

Finally: The Kernel Trick! Never represent features explicitly Compute dot products in closed form Constant-time high-dimensional dotproducts for many classes of features 22

Common Kernels Polynomials of degree d Polynomials of degree up to d Gaussian/Radial kernels (polynomials of all orders recall series expansion of exp) Sigmoid 23

Mercer Kernels What functions are valid kernels that correspond to feature vectors ϕ(x)? Answer: Mercer kernels K K is continuous K is symmetric K is positive semi-definite x T Kx 0 for all x 24

Overfitting Huge feature space with kernels, what about overfitting??? Maximizing margin leads to sparse set of support vectors Some interesting theory says that SVMs search for simple hypothesis with large margin Often robust to overfitting 25

What about classification time? For a new input x, if we need to represent Φ(x), we are in trouble! Recall classifier: sign(w.φ(x)+b) Using kernels we are cool! 26

SVMs with Kernels Choose a set of features and kernel function Solve dual problem to obtain support vectors α i At classification time, compute: Classify as 27

SVMs with Kernels Iris dataset, 2 vs13, Linear Kernel 28

SVMs with Kernels Iris dataset, 1 vs 23, Polynomial Kernel degree 2 29

SVMs with Kernels Iris dataset, 1 vs23, Gaussian RBF kernel 30

SVMs with Kernels Iris dataset, 1 vs23, Gaussian RBF kernel 31

SVMs with Kernels Chessboard dataset, Gaussian RBF kernel 32

SVMs with Kernels Chessboard dataset, Polynomial kernel 33

Corel Dataset 34

Corel Dataset Olivier Chapelle 1998 35

USPS Handwritten digits 36

SVMs vs. Kernel Regression SVMs Kernel Regression or Differences: SVMs: Learn weights α i (and bandwidth) Often sparse solution KR: Fixed weights, learn bandwidth Solution may not be sparse Much simpler to implement 37

SVMs vs. Logistic Regression SVMs Logistic Regression Loss function Hinge loss Log-loss 38

SVMs vs. Logistic Regression SVM : Hinge loss Logistic Regression : Log loss ( -ve log conditional likelihood) Log loss Hinge loss 0-1 loss -1 0 1 39

SVMs vs. Logistic Regression SVMs Logistic Regression Loss function Hinge loss Log-loss High dimensional features with kernels Yes! Yes! Solution sparse Often yes! Almost always no! Semantics of output Margin Real probabilities 40

Kernels in Logistic Regression Define weights in terms of features: Derive simple gradient descent rule on α i 41

Can we use kernels in regression? 42

Ridge regression (A T A + I) 1 A T Y Recall Hence A T A is a p x p matrix whose entries denote the (sample) correlation between the features NOT inner products between the data points the inner product matrix would be AA T which is n x n (also known as Gram matrix) Using dual formulation, we will write the solution in terms of AA T 43

Ridge regression (A T A + I) 1 A T Y Similarity with SVMs Primal problem: nx zi 2 + k k 2 2 min,zi i=1 s.t. z i = Y i X i SVM Primal problem: nx C i + 1 2 kwk2 2 min w, i i=1 s.t. i = max(1 Y i X i w, w, 0) 0) Lagrangian: nx zi 2 + k k 2 + i=1 nx i (z i Y i + X i ) i=1 α i Lagrange parameter, one per training point 44

Ridge regression (dual) (A T A + I) 1 A T Y Dual problem: α = {α i } for i = 1,, n Taking derivatives of Lagrangian wrt β and z i we get: Dual problem: max min,z i max nx zi 2 + k k 2 + i=1 > 2 n-dimensional optimization problem nx i (z i Y i + X i ) i=1 = 1 2 A> z i = i 2 4 4 1 2 > AA > > Y 45

Ridge regression (dual) (A T A + I) 1 A T Y = A T (AA T + I) 1 Y Dual problem: max > 1 2 2 > AA > > Y ) b = 4 4! 1 AA > + I 2 Y can get back ˆ = 1 2 A> ˆ = A > (AA > + I) 1 Y Weighted average of training points Weight of each training point (but typically not sparse) 46

Kernelized ridge regression (A T A + I) 1 A T Y Using dual, can re-write solution as: b = A T (AA T + I) 1 Y How does this help? Only need to invert n x n matrix (instead of p x p or m x m) More importantly, kernel trick! AA T involves only inner products between the training points BUT still have an extra A T Recall the predicted label is A T (AA T + I) 1 Y XA T contains inner products between test point X and training points! 47

Kernelized ridge regression (A T A + I) 1 A T Y Using dual, can re-write solution as: b = A T (AA T + I) 1 Y How does this help? Only need to invert n x n matrix (instead of p x p or m x m) More importantly, kernel trick! bf n (X) =K X (K + I) 1 Y where K X (i) = (X) (X i ) K(i, j) = (X i ) (X j ) Work with kernels, never need to write out the high-dim vectors 48

Kernelized ridge regression bf n (X) =K X (K + where Work with kernels, never need to write out the high-dim vectors Examples of kernels: Polynomials of degree exactly d Polynomials of degree up to d Gaussian/Radial kernels I) 1 Y K X (i) = (X) (X i ) K(i, j) = (X i ) (X j ) Ridge Regression with (implicit) nonlinear features (X)! f(x) = (X) 49