Max Margin-Classifier

Similar documents
Kernel Methods and Support Vector Machines

(Kernels +) Support Vector Machines

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Pattern Recognition 2018 Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines, Kernel SVM

Statistical Machine Learning from Data

Linear Models for Classification

Support Vector Machine (SVM) and Kernel Methods

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

SVMs, Duality and the Kernel Trick

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines

Support Vector Machines

Statistical Pattern Recognition

Announcements - Homework

Machine Learning Lecture 7

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Support Vector Machine

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS798: Selected topics in Machine Learning

Machine Learning. Support Vector Machines. Manfred Huber

CS-E4830 Kernel Methods in Machine Learning

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Perceptron Revisited: Linear Separators. Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC 411 Lecture 17: Support Vector Machine

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machines and Kernel Methods

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Jeff Howbert Introduction to Machine Learning Winter

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning And Applications: Supervised Learning-SVM

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Introduction to Logistic Regression and Support Vector Machine

CS145: INTRODUCTION TO DATA MINING

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

ML (cont.): SUPPORT VECTOR MACHINES

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning, Fall 2011: Homework 5

L5 Support Vector Classification

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

SUPPORT VECTOR MACHINE

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Ch 4. Linear Models for Classification

COMP 875 Announcements

Support Vector Machines.

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification and Regression

Support vector machines Lecture 4

Applied inductive learning - Lecture 7

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines and Kernel Methods

LECTURE 7 Support vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Review: Support vector machines. Machine learning techniques and image analysis

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Sparse Kernel Machines - SVM

Constrained Optimization and Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Deep Learning for Computer Vision

Lecture Notes on Support Vector Machine

Convex Optimization and Support Vector Machine

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

HOMEWORK 4: SVMS AND KERNELS

Support Vector Machines

Machine Learning Lecture 5

Support Vector Machines. Machine Learning Fall 2017

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Modelli Lineari (Generalizzati) e SVM

Transcription:

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7

Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Kernels and Non-linear Mappings Where does the maximization problem come from? The intuition comes from the primal version, which is based on a feature mapping φ. Theorem: Every valid kernel k(x, y) is the dot-product φ(x)[φ(x)] T for some set of basis functions (feature mapping) φ. The feature space φ(x) could be high-dimensional, even infinite. This is good because if data aren t separable in original input space (x), they may be in feature space φ(x) We can think about how to find a good linear separator using the dot product in high dimensions, then transfer this back to kernels in the original input space.

Kernels and Non-linear Mappings Where does the maximization problem come from? The intuition comes from the primal version, which is based on a feature mapping φ. Theorem: Every valid kernel k(x, y) is the dot-product φ(x)[φ(x)] T for some set of basis functions (feature mapping) φ. The feature space φ(x) could be high-dimensional, even infinite. This is good because if data aren t separable in original input space (x), they may be in feature space φ(x) We can think about how to find a good linear separator using the dot product in high dimensions, then transfer this back to kernels in the original input space.

Kernels and Non-linear Mappings Where does the maximization problem come from? The intuition comes from the primal version, which is based on a feature mapping φ. Theorem: Every valid kernel k(x, y) is the dot-product φ(x)[φ(x)] T for some set of basis functions (feature mapping) φ. The feature space φ(x) could be high-dimensional, even infinite. This is good because if data aren t separable in original input space (x), they may be in feature space φ(x) We can think about how to find a good linear separator using the dot product in high dimensions, then transfer this back to kernels in the original input space.

Kernels and Non-linear Mappings Where does the maximization problem come from? The intuition comes from the primal version, which is based on a feature mapping φ. Theorem: Every valid kernel k(x, y) is the dot-product φ(x)[φ(x)] T for some set of basis functions (feature mapping) φ. The feature space φ(x) could be high-dimensional, even infinite. This is good because if data aren t separable in original input space (x), they may be in feature space φ(x) We can think about how to find a good linear separator using the dot product in high dimensions, then transfer this back to kernels in the original input space.

Why Kernels? If we can use dot products with features, why bother with kernels? Often easier to specify how similar two things are (dot product) than to construct explicit feature space φ. e.g. graphs, sets, strings (NIPS 2009 best student paper award). There are high-dimensional (even infinite) spaces that have efficient-to-compute kernels

Kernel Trick In previous lectures on linear models, we would explicitly compute φ(x i ) for each datapoint Run algorithm in feature space For some feature spaces, can compute dot product φ(x i ) T φ(x j ) efficiently Efficient method is computation of a kernel function k(x i, x j ) = φ(x i ) T φ(x j ) The kernel trick is to rewrite an algorithm to only have x enter in the form of dot products The menu: Kernel trick examples Kernel functions

Kernel Trick In previous lectures on linear models, we would explicitly compute φ(x i ) for each datapoint Run algorithm in feature space For some feature spaces, can compute dot product φ(x i ) T φ(x j ) efficiently Efficient method is computation of a kernel function k(x i, x j ) = φ(x i ) T φ(x j ) The kernel trick is to rewrite an algorithm to only have x enter in the form of dot products The menu: Kernel trick examples Kernel functions

Kernel Trick In previous lectures on linear models, we would explicitly compute φ(x i ) for each datapoint Run algorithm in feature space For some feature spaces, can compute dot product φ(x i ) T φ(x j ) efficiently Efficient method is computation of a kernel function k(x i, x j ) = φ(x i ) T φ(x j ) The kernel trick is to rewrite an algorithm to only have x enter in the form of dot products The menu: Kernel trick examples Kernel functions

Kernel Trick In previous lectures on linear models, we would explicitly compute φ(x i ) for each datapoint Run algorithm in feature space For some feature spaces, can compute dot product φ(x i ) T φ(x j ) efficiently Efficient method is computation of a kernel function k(x i, x j ) = φ(x i ) T φ(x j ) The kernel trick is to rewrite an algorithm to only have x enter in the form of dot products The menu: Kernel trick examples Kernel functions

A Kernel Trick Let s look at the nearest-neighbour classification algorithm For input point x i, find point x j with smallest distance: x i x j 2 = (x i x j ) T (x i x j ) = x T i x i 2x T i x j + x T j x j If we used a non-linear feature space φ( ): φ(x i ) φ(x j ) 2 = φ(x i ) T φ(x i ) 2φ(x i ) T φ(x j ) + φ(x j ) T φ(x j ) = k(x i, x i ) 2k(x i, x j ) + k(x j, x j ) So nearest-neighbour can be done in a high-dimensional feature space without actually moving to it

A Kernel Trick Let s look at the nearest-neighbour classification algorithm For input point x i, find point x j with smallest distance: x i x j 2 = (x i x j ) T (x i x j ) = x T i x i 2x T i x j + x T j x j If we used a non-linear feature space φ( ): φ(x i ) φ(x j ) 2 = φ(x i ) T φ(x i ) 2φ(x i ) T φ(x j ) + φ(x j ) T φ(x j ) = k(x i, x i ) 2k(x i, x j ) + k(x j, x j ) So nearest-neighbour can be done in a high-dimensional feature space without actually moving to it

A Kernel Trick Let s look at the nearest-neighbour classification algorithm For input point x i, find point x j with smallest distance: x i x j 2 = (x i x j ) T (x i x j ) = x T i x i 2x T i x j + x T j x j If we used a non-linear feature space φ( ): φ(x i ) φ(x j ) 2 = φ(x i ) T φ(x i ) 2φ(x i ) T φ(x j ) + φ(x j ) T φ(x j ) = k(x i, x i ) 2k(x i, x j ) + k(x j, x j ) So nearest-neighbour can be done in a high-dimensional feature space without actually moving to it

Example: The Quadratic Kernel Function Consider again the kernel function k(x, z) = (1 + x T z) 2 With x, z R 2, k(x, z) = (1 + x 1 z 1 + x 2 z 2 ) 2 = 1 + 2x 1 z 1 + 2x 2 z 2 + x 2 1z 2 1 + 2x 1 z 1 x 2 z 2 + x 2 2z 2 2 = (1, 2x 1, 2x 2, x 2 1, 2x 1 x 2, x 2 2)(1, 2z 1, 2z 2, z 2 1, 2z 1 z 2, z 2 2) T = φ(x) T φ(z) So this particular kernel function does correspond to a dot product in a feature space (is valid) Computing k(x, z) is faster than explicitly computing φ(x) T φ(z) In higher dimensions, larger exponent, much faster

Example: The Quadratic Kernel Function Consider again the kernel function k(x, z) = (1 + x T z) 2 With x, z R 2, k(x, z) = (1 + x 1 z 1 + x 2 z 2 ) 2 = 1 + 2x 1 z 1 + 2x 2 z 2 + x 2 1z 2 1 + 2x 1 z 1 x 2 z 2 + x 2 2z 2 2 = (1, 2x 1, 2x 2, x 2 1, 2x 1 x 2, x 2 2)(1, 2z 1, 2z 2, z 2 1, 2z 1 z 2, z 2 2) T = φ(x) T φ(z) So this particular kernel function does correspond to a dot product in a feature space (is valid) Computing k(x, z) is faster than explicitly computing φ(x) T φ(z) In higher dimensions, larger exponent, much faster

Example: The Quadratic Kernel Function Consider again the kernel function k(x, z) = (1 + x T z) 2 With x, z R 2, k(x, z) = (1 + x 1 z 1 + x 2 z 2 ) 2 = 1 + 2x 1 z 1 + 2x 2 z 2 + x 2 1z 2 1 + 2x 1 z 1 x 2 z 2 + x 2 2z 2 2 = (1, 2x 1, 2x 2, x 2 1, 2x 1 x 2, x 2 2)(1, 2z 1, 2z 2, z 2 1, 2z 1 z 2, z 2 2) T = φ(x) T φ(z) So this particular kernel function does correspond to a dot product in a feature space (is valid) Computing k(x, z) is faster than explicitly computing φ(x) T φ(z) In higher dimensions, larger exponent, much faster

Regression Kernelized Many classifiers can be written as using only dot products. Kernelization = replace dot products by kernel. E.g., the kernel solution for regularized least squares regression is y(x) = k(x) T (K + λi N ) 1 t vs. φ(x)(φ T Φ + λi M ) 1 Φ T t for original version N is number of datapoints (size of Gram matrix K) M is number of basis functions (size of matrix Φ T Φ) Bad if N > M, but good otherwise k(x) = (k(x, x 1,..., k(x, x n )) is the vector of kernel values over data points x n.

Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Linear Classification Consider a two class classification problem Use a linear model y(x) = w T φ(x) + b followed by a threshold function For now, let s assume training data are linearly separable Recall that the perceptron would converge to a perfect classifier for such data But there are many such perfect classifiers

Max Margin y = 1 y = 0 y = 1 margin We can define the margin of a classifier as the minimum distance to any example In support vector machines the decision boundary which maximizes the margin is chosen. Intuitively, this is the line right in the middle between the two classes.

Marginal Geometry y > 0 y = 0 y < 0 R2 x2 R1 w x y(x) w x x1 w0 w Recall from Ch. 4 y(x) = w T x + b (Using Ch.4 notation x for input rather than φ(x).) wt x w b w = y(x) w is signed distance to decision boundary.

Support Vectors y = 1 y = 0 y = 1 Assuming data are separated by the hyperplane, distance to decision boundary is tny(xn) w The maximum margin criterion chooses w, b by: { } 1 arg max w,b w min[t n(w T φ(x n ) + b)] n Points with this min value are known as support vectors

Canonical Representation This optimization problem is complex: { } 1 arg max w,b w min[t n(w T φ(x n ) + b)] n Note that rescaling w κw and b κb does not change distance tny(xn) w (many equiv. answers) So for x closest to surface, can use w w/y(x ) and b b/y(x ) so that: t (w T φ(x ) + b) = 1 All other points are at least this far away: n, t n (w T φ(x n ) + b) 1 Under these constraints, the optimization becomes: arg max w,b 1 1 = arg min w w,b 2 w 2

Canonical Representation This optimization problem is complex: { } 1 arg max w,b w min[t n(w T φ(x n ) + b)] n Note that rescaling w κw and b κb does not change distance tny(xn) w (many equiv. answers) So for x closest to surface, can use w w/y(x ) and b b/y(x ) so that: t (w T φ(x ) + b) = 1 All other points are at least this far away: n, t n (w T φ(x n ) + b) 1 Under these constraints, the optimization becomes: arg max w,b 1 1 = arg min w w,b 2 w 2

Canonical Representation This optimization problem is complex: { } 1 arg max w,b w min[t n(w T φ(x n ) + b)] n Note that rescaling w κw and b κb does not change distance tny(xn) w (many equiv. answers) So for x closest to surface, can use w w/y(x ) and b b/y(x ) so that: t (w T φ(x ) + b) = 1 All other points are at least this far away: n, t n (w T φ(x n ) + b) 1 Under these constraints, the optimization becomes: arg max w,b 1 1 = arg min w w,b 2 w 2

Canonical Representation This optimization problem is complex: { } 1 arg max w,b w min[t n(w T φ(x n ) + b)] n Note that rescaling w κw and b κb does not change distance tny(xn) w (many equiv. answers) So for x closest to surface, can use w w/y(x ) and b b/y(x ) so that: t (w T φ(x ) + b) = 1 All other points are at least this far away: n, t n (w T φ(x n ) + b) 1 Under these constraints, the optimization becomes: arg max w,b 1 1 = arg min w w,b 2 w 2

Canonical Representation So the optimization problem is now a constrained optimization problem: arg min w,b 1 2 w 2 s.t. n, t n (w T φ(x n ) + b) 1 To solve this, we need to take a detour into Lagrange multipliers

Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Lagrange Multipliers f(x) Consider the problem: g(x) x A max f (x) x s.t. g(x) = 0 g(x) = 0 Points on g(x) = 0 must have g(x) normal to surface A stationary point must have no change in f in the direction of the constraint surface, so f (x) must also be normal to the surface. So there must be some λ 0 such that f (x) + λ g(x) = 0 Define Lagrangian: L(x, λ) = f (x) + λg(x) Stationary points of L(x, λ) have x L(x, λ) = f (x) + λ g(x) = 0 and λ L(x, λ) = g(x) = 0

Lagrange Multipliers f(x) Consider the problem: g(x) x A max f (x) x s.t. g(x) = 0 g(x) = 0 Points on g(x) = 0 must have g(x) normal to surface A stationary point must have no change in f in the direction of the constraint surface, so f (x) must also be normal to the surface. So there must be some λ 0 such that f (x) + λ g(x) = 0 Define Lagrangian: L(x, λ) = f (x) + λg(x) Stationary points of L(x, λ) have x L(x, λ) = f (x) + λ g(x) = 0 and λ L(x, λ) = g(x) = 0

Lagrange Multipliers f(x) Consider the problem: g(x) x A max f (x) x s.t. g(x) = 0 g(x) = 0 Points on g(x) = 0 must have g(x) normal to surface A stationary point must have no change in f in the direction of the constraint surface, so f (x) must also be normal to the surface. So there must be some λ 0 such that f (x) + λ g(x) = 0 Define Lagrangian: L(x, λ) = f (x) + λg(x) Stationary points of L(x, λ) have x L(x, λ) = f (x) + λ g(x) = 0 and λ L(x, λ) = g(x) = 0

Lagrange Multipliers f(x) Consider the problem: g(x) x A max f (x) x s.t. g(x) = 0 g(x) = 0 Points on g(x) = 0 must have g(x) normal to surface A stationary point must have no change in f in the direction of the constraint surface, so f (x) must also be normal to the surface. So there must be some λ 0 such that f (x) + λ g(x) = 0 Define Lagrangian: L(x, λ) = f (x) + λg(x) Stationary points of L(x, λ) have x L(x, λ) = f (x) + λ g(x) = 0 and λ L(x, λ) = g(x) = 0

Lagrange Multipliers Example x 2 Consider the problem (x 1, x 2) x 1 g(x 1, x 2) = 0 max f (x 1, x 2 ) = 1 x 2 x 1 x2 2 s.t. g(x 1, x 2 ) = x 1 + x 2 1 = 0 Lagrangian: L(x, λ) = 1 x1 2 x2 2 + λ(x 1 + x 2 1) Stationary points require: L/ x 1 = 2x 1 + λ = 0 L/ x 2 = 2x 2 + λ = 0 L/ λ = x 1 + x 2 1 = 0 So stationary point is (x1, x 2 ) = ( 1 2, 1 2 ), λ = 1

Lagrange Multipliers - Inequality Constraints f(x) Consider the problem: g(x) > 0 xa g(x) xb g(x) = 0 max f (x) x s.t. g(x) 0 Optimization over a region solutions either at stationary points (gradients 0) in region g(x) > 0 or on boundary L(x, λ) = f (x) + λg(x) Solutions at stationary points of the Lagrangian with either: f (x) = 0 and λ = 0 (in region) f (x) = λ g(x) and λ > 0 (boundary, > for maximizing f ) For both, λg(x) = 0 Find stationary points of L s. t. g(x) 0, λ 0, λg(x) = 0

Lagrange Multipliers - Inequality Constraints f(x) Consider the problem: g(x) > 0 xa g(x) xb g(x) = 0 max f (x) x s.t. g(x) 0 Optimization over a region solutions either at stationary points (gradients 0) in region g(x) > 0 or on boundary L(x, λ) = f (x) + λg(x) Solutions at stationary points of the Lagrangian with either: f (x) = 0 and λ = 0 (in region) f (x) = λ g(x) and λ > 0 (boundary, > for maximizing f ) For both, λg(x) = 0 Find stationary points of L s. t. g(x) 0, λ 0, λg(x) = 0

Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Now Where Were We So the optimization problem is now a constrained optimization problem: arg min w,b w 2 s.t. n, t n (w T φ(x n ) + b) 1 For this problem, the Lagrangian (with N multipliers a n ) is: N L(w, b, a) = w 2 { a n tn (w T φ(x n ) + b) 1 } 2 n=1 We can find the derivatives of L wrt w, b and set to 0: N w = a n t n φ(x n ) 0 = n=1 2 N a n t n n=1

Dual Formulation Plugging those equations into L removes w and b results in a version of L where w,b L = 0: L(a) = N a n 1 2 n=1 N n=1 m=1 N a n a m t n t m φ(x n ) T φ(x m ) this new L is the dual representation of the problem (maximize with constraints) Another formula for finding b, like the bias in linear regression.

Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

Non-Separable Data ξ > 1 y = 1 y = 0 y = 1 ξ < 1 ξ = 0 ξ = 0 For most problems, data will not be linearly separable (even in feature space φ) Can relax the constraints from t n y(x n ) 1 to t n y(x n ) 1 ξ n The ξ n 0 are called slack variables ξ n = 0, satisfy original problem, so x n is on margin or correct side of margin 0 < ξ n < 1, inside margin, but still correctly classifed ξ n > 1, mis-classified

Loss Function For Non-separable Data ξ > 1 y = 1 y = 0 y = 1 ξ < 1 ξ = 0 ξ = 0 Non-zero slack variables are bad, penalize while maximizing the margin: N min C ξ n + 1 2 w 2 n=1 Constant C > 0 controls importance of large margin versus incorrect (non-zero slack) Set using cross-validation Optimization is same quadratic, different constraints, convex

SVM Loss Function The SVM for the separable case solved the problem: 1 arg min w 2 w 2 s.t. n, t n y n 1 Can write this as: arg min w N E (t n y n 1) + λ w 2 n=1 where E (z) = 0 if z 0, otherwise Non-separable case relaxes this to be: N arg min E SV (t n y n 1) + λ w 2 w n=1 where E SV (t n y n 1) = [1 y n t n ] + hinge loss [u] + = u if u 0, 0 otherwise

SVM Loss Function The SVM for the separable case solved the problem: 1 arg min w 2 w 2 s.t. n, t n y n 1 Can write this as: arg min w N E (t n y n 1) + λ w 2 n=1 where E (z) = 0 if z 0, otherwise Non-separable case relaxes this to be: N arg min E SV (t n y n 1) + λ w 2 w n=1 where E SV (t n y n 1) = [1 y n t n ] + hinge loss [u] + = u if u 0, 0 otherwise

SVM Loss Function The SVM for the separable case solved the problem: 1 arg min w 2 w 2 s.t. n, t n y n 1 Can write this as: arg min w N E (t n y n 1) + λ w 2 n=1 where E (z) = 0 if z 0, otherwise Non-separable case relaxes this to be: N arg min E SV (t n y n 1) + λ w 2 w n=1 where E SV (t n y n 1) = [1 y n t n ] + hinge loss [u] + = u if u 0, 0 otherwise

Loss Functions E(z) 2 1 0 1 2 z Linear classifiers, compare loss function used for learning z = y n t n 1 iff there is an error (with t n {+1, 1}). Black is misclassification error Transformed simple linear classifier, squared error: (y n t n ) 2 Transformed logistic regression, cross-entropy error: t n ln y n SVM, hinge loss: ξ n = [1 y n t n ] + positive only if there is a mistake, otherwise 0. encourages sparse solutions.

Two Views of Learning as Optimization The original SVM goal was of the form: Find the simplest hypothesis that is consistent with the data, or Maximize simplicity, given a consistency constraint. This general idea appears in much scientific model building, in image processing, and other applications. Bayesian methods use a criterion of the form Find a trade-off between simplicity and data fit, or Maximize sum of the type (data fit - λ simplicity) e.g., ln(p(d M)) λln(p(m)) where the model prior M is higher for simpler models.

Pros and Cons of Learning Criteria The Bayesian approach has a solid probabilistic foundation in Bayes theorem. Seems to be especially suitable for noisy data. The constraint-based approach is often easy for users to understand. Often leads to sparser simpler models. Suitable for clean data.

Conclusion Readings: Ch. 7 up to and including Ch. 7.1.2 Many algorithms can be re-written with only dot products of features We ve seen NN, perceptron, regression; also PCA, SVMs (later) Maximum margin criterion for deciding on decision boundary Linearly separable data Relax with slack variables for non-separable case Global optimization is possible in both cases Convex problem (no local optima) Descent methods converge to global optimum

Logistic Regression Learning: Iterative Reweighted Least Squares Iterative reweighted least squares (IRLS) is a descent method As in gradient descent, start with an initial guess, improve it Gradient descent - take a step (how large?) in the gradient direction IRLS is a special case of a Newton-Raphson method Approximate function using second-order Taylor expansion: ˆf (w + v) = f (w) + f (w) T (v w) + 1 2 (v w)t 2 f (w)(v w) Closed-form solution to minimize this is straight-forward: quadratic, derivatives linear In IRLS this second-order Taylor expansion ends up being a weighted least-squares problem, as in the linear regression case Hence the name IRLS

Logistic Regression Learning: Iterative Reweighted Least Squares Iterative reweighted least squares (IRLS) is a descent method As in gradient descent, start with an initial guess, improve it Gradient descent - take a step (how large?) in the gradient direction IRLS is a special case of a Newton-Raphson method Approximate function using second-order Taylor expansion: ˆf (w + v) = f (w) + f (w) T (v w) + 1 2 (v w)t 2 f (w)(v w) Closed-form solution to minimize this is straight-forward: quadratic, derivatives linear In IRLS this second-order Taylor expansion ends up being a weighted least-squares problem, as in the linear regression case Hence the name IRLS

Logistic Regression Learning: Iterative Reweighted Least Squares Iterative reweighted least squares (IRLS) is a descent method As in gradient descent, start with an initial guess, improve it Gradient descent - take a step (how large?) in the gradient direction IRLS is a special case of a Newton-Raphson method Approximate function using second-order Taylor expansion: ˆf (w + v) = f (w) + f (w) T (v w) + 1 2 (v w)t 2 f (w)(v w) Closed-form solution to minimize this is straight-forward: quadratic, derivatives linear In IRLS this second-order Taylor expansion ends up being a weighted least-squares problem, as in the linear regression case Hence the name IRLS

Logistic Regression Learning: Iterative Reweighted Least Squares Iterative reweighted least squares (IRLS) is a descent method As in gradient descent, start with an initial guess, improve it Gradient descent - take a step (how large?) in the gradient direction IRLS is a special case of a Newton-Raphson method Approximate function using second-order Taylor expansion: ˆf (w + v) = f (w) + f (w) T (v w) + 1 2 (v w)t 2 f (w)(v w) Closed-form solution to minimize this is straight-forward: quadratic, derivatives linear In IRLS this second-order Taylor expansion ends up being a weighted least-squares problem, as in the linear regression case Hence the name IRLS

Logistic Regression Learning: Iterative Reweighted Least Squares Iterative reweighted least squares (IRLS) is a descent method As in gradient descent, start with an initial guess, improve it Gradient descent - take a step (how large?) in the gradient direction IRLS is a special case of a Newton-Raphson method Approximate function using second-order Taylor expansion: ˆf (w + v) = f (w) + f (w) T (v w) + 1 2 (v w)t 2 f (w)(v w) Closed-form solution to minimize this is straight-forward: quadratic, derivatives linear In IRLS this second-order Taylor expansion ends up being a weighted least-squares problem, as in the linear regression case Hence the name IRLS

Newton-Raphson 484 9 Unconstrained minimization PSfrag replacements (x, f(x)) f (x + x nt, f(x + x nt )) f Figure 9.16 The function f (shown solid) and its second-order approximation f at x (dashed). The Newton step x nt is what must be added to x to give Figure from Boyd and Vandenberghe, Convex Optimization the minimizer of f. Excellent reference, free for download online 9.5 Newton s http://www.stanford.edu/~boyd/cvxbook/ method 9.5.1 The Newton step For x dom f, the vector x nt = 2 f(x) 1 f(x)