Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018
Topics so far Supervised learning, linear regression Linear regression o Overfitting, bias variance trade-off o Ridge and lasso regression, gradient descent Yesterday o Classification, logistic regression o Regularization for logistic regression o Multi-class classification Today o Maximum margin classifiers o Kernel trick 1
Classification Supervised learning: estimate a mapping f from input x X to output y Y o Regression Y = R or other continuous variables o Classification Y takes discrete set of values Examples: q Y = {spam, nospam}, q digits (not values) Y = {0,1,2,, 9} Many successful applications of ML in vision, speech, NLP, healthcare 2
Parametric classifiers H = x. x +? : R A,? R yb x = sign(f. x + F? ) F. x + F? = 0 (linear) decision boundary or separating hyperplane separates R A into to halfspaces (regions) F. x + F? > 0 gets label 1 and F. x + F? < 0 gets label -1 more generally, yb x = sign fj x à decision boundary is fj x = 0! # ç! " 3
Surrogate Losses The correct loss to use is 0-1 loss after thresholding l?l f x, y = 1 sign f x y = 1 sign f x y < 0 l(f x, y) 0 f(x)y 4
Surrogate Losses The correct loss to use is 0-1 loss after thresholding l?l f x, y = 1 sign f x y = 1 sign f x y < 0 Linear regression uses l OP f x, y = f x y R l(f x, y) 0 yby f(x)y 5
Hard to optimize over l?l, find another loss l(f(x), y) o Convex (for any fixed y) à easier to minimize o An upper bound of l?l à small l small l?l Satisfied by squared loss àbut has large loss even hen l?l f(x), y = 0 To more surrogate losses in in this course o Logistic loss l TUV f(x), y = log 1 + exp f(x)y o Hinge loss l Z[\]^ f(x), y = max(0,1 f(x)y) Surrogate Losses l(f x, y) 0 f(x)y 6
Logistic regression: ERM on surrogate loss Logistic loss l f x, y = log 1 + exp f(x)y l(f x, y) S = x i, y [ : i = 1,2,, N, X = R A, Y = { 1,1} Linear model f x = f x =. x +? Minimize training loss F, F? = argmin,d e f log 1 + exp. x i +? y [ [ Output classifier yb x = sign(. x +? ) 0 f(x)y 7
Logistic Regression F, F? = argmin,d e f log 1 + exp. x [ +? y [ [ Convex optimization problem Can solve using gradient descent Can also add usual regularization: l R, l L 8
Linear decision boundaries yb x = sign(. x +? ) {x:. x +? = 0} is a hyperplane in R A o decision boundary o is direction of normal o? is the offset {x:. x +? = 0} divides R A into to halfspaces (regions) o x:. x +? 0 get label +1 and {x:. x +? < 0} get label -1 x R x L 9
Linear decision boundaries yb x = sign(. x +? ) {x:. x +? = 0} is a hyperplane in R A o decision boundary o is direction of normal o? is the offset {x:. x +? = 0} divides R A into to halfspaces (regions) o x:. x +? 0 get label +1 and {x:. x +? < 0} get label -1 Maps x to a 1D coordinate x R x j =. x +? x x x L 10
Linear separators in 2D Slide credit: Nati Srebro 11
Linear separators in 2D Slide credit: Nati Srebro 12
Linear separators in 2D Slide credit: Nati Srebro 13
Margin of a classifier Margin: distance of the closest instance point to the linear hyperplane Large margins are more stable o small perturbations of the data do not change the # % prediction! x 1 x 2 &! # $ 10 14
Maximum margin classifier S = x i, y [ : i = 1,2,, N binary classes Y = { 1,1} # %! Assume data is linearly separable o,? such that for all i = 1,2,, N y [ = sign. x i +? y [ (. x i +? ) > 0 Maximum margin separator given by &! # $ 10 F, F? = argmax R m,d e min [ y [. x i +? smallest margin margin of sample i 15
Maximum margin classifier F, F? = argmax R m,d e R min [ y [. x i +? Claim 1: If F, F? is a solution, then for any γ > 0, γf, γf? is also a solution Option 1: We can fix F, F? = argmax ql,d e = 1 to get min [ y [. x i +? 16
Maximum margin classifier F, F? = argmax R m,d e R min [ y [. x i +? Claim 1: If F, F? is a solution, then for any γ > 0, γf, γf? is also a solution Option 1: e can fix = 1 to get F, F? = argmax ql min [ y [. x i +? Option 2: e can also fix min [ y [. x i +? = 1 o margin no is L o Instead of increasing margin e can reduce norm 17
Max-margin classifier equivalent formulation Solve: r, r? = argmin s.t. i, y [ (. x i +? ) 1 R Hard margin Support Vector Machine (SVM) Claim 2: Equivalent to previous slide is solution for à r r Proof:, dr e r F, F? = max min ql [ y [ (. x i +? ) 1. Let min y [ F. x i + F? = γb, then min y [ F [ [ tf. x i + df e tf 1 2. r F tf = L tf 3. min y [ r. x [ + dr e [ r r x r.x = min i ydr e [ r L r γb 18
Maximum margin classifier formulations Original formulation F, F? = argmax R m,d e R min [ y [. x i +? # % &!! Fixing = 1 F, F? = argmax,d e Fixing min min [ y [. x i +? s. t. = 1 y [. x i +? = 1 [ R s.t. i, y [ (. x i +? ) 1 r, r? = argmin # $ 10 Henceforth,? ill be absorbed into by adding an additonal feature of 1 to x 19
Margin and norm x.x i margin = min [ Remember in regression: small norm solutions have lo complexity! o Is this true for maximum margin classifiers? o hat about classification ith logistic loss [ log(1 + exp( y [. x i ))? o ho to do capacity control in maximum margin classifier learning? Some places min y [. x i referred as margin à [ implicitly assumes normalization o min [ y [. x i is meaningless ithout knoing hat is! 20
Solutions of hard margin SVM F = argmin R s. t., y [. x i 1 i Theorem: F = span x i : i = 1,2,, N i.e., {βj [ : i = 1,2,, N} such that F = [ βj [ x i Denote S = span x i : i = 1,2,, N and S = {z R A : i, z. x i = 0} o For any z R A, z = z S + z S s.t. z S S and z S S o z R = z S R + z S Three step proof: 1. Decompose F = F S + F S. 2. min y [ F. x i 1 min y [ F S. x i 1 [ [ (because F S. x i = 0 i) 3. if F S 0, then F S < F R 21
Solutions of hard margin SVM F = argmin R s. t., y [. x i 1 i Theorem: F = span x i : i = 1,2,, N i.e., {βj [ : i = 1,2,, N} such that F = [ βj [ x i Denote S = span x i : i = 1,2,, N and S = {z R A : i, z. x i = 0} o For any z R A, z = z S + z S s.t. z S S and z S S o z R = z S R + z S Three step proof: 1. Decompose F = F S + F S. 2. min y [ F. x i 1 min y [ F S. x i 1 [ [ (because F S. x i = 0 i) 3. if F S 0, then F S < F R 22
Solutions of hard margin SVM F = argmin R s. t., y [. x i 1 i Theorem: F = span x i : i = 1,2,, N i.e., {βj [ : i = 1,2,, N} such that F = [ βj [ x i Denote S = span x i : i = 1,2,, N and S = {z R A : i, z. x i = 0} o For any z R A, z = z S + z S s.t. z S S and z S S o z R = z S R + z S Three step proof: 1. Decompose F = F S + F S. 2. min y [ F. x i 1 min y [ F S. x i 1 [ [ (because F S. x i = 0 i) 3. if F S 0, then F S < F R 23
Representer Theorem F = argmin R s. t., y [. x i 1 i Theorem: F = span x i : i = 1,2,, N i.e., {βj [ : i = 1,2,, N} such that F = [ βj [ x i o Special case of representor theorem Theorem (ext): additionally, {βj [ } also stisfies βj [ = 0 for all i such that y [. x i > 1 Proof?: (animation next slide) 24
F Illustration credit: Nati Srebro 25
F = argmin Representer Theorem R s. t., y [. x i 1 i Theorem: {βj [ : i = 1,2,, N} such that F = [ βj [ x i {βj [ } also satisfies βj [ = 0 for all i such that y [ F. x i > 1 SV(F) = {i: y [ F. x i o called support vectors o hence support vector machine = 1} datapoints closest to F F = f βj [ x i [ Pˆ(dF) 26
F = argmin Representer Theorem R s. t., y [. x i 1 i Theorem: {βj [ : i = 1,2,, N} such that F = [ βj [ x i {βj [ } also satisfies βj [ = 0 for all i such that y [ F. x i > 1 SV(F) = {i: y [ F. x i o called support vectors o hence support vector machine = 1} datapoints closest to F F = f βj [ x i [ Pˆ(dF) Ho do e get F? 27
Optimizing the SVM problem F = argmin R s. t., y [. x i 1 i 1. Can do sub-gradient descent (next class) 2. Special case of quadratic program 1 min z 2 z Pz + q z s. t. Gz h, Az = b o Change of variables F = βj [ x i o Change of variables F = [ Pˆ(dF)? βj [ x i [ql! 28
Optimizing the SVM problem F = argmin Change of variables = R s. t., y [. x i 1 i β [ x i [ql! \ min f f β [β x i. x j { x } [ql ql N s. t. f β y [ x i. x j jq1 1 i = min β R N β Gβ s. t. y [ Gβ [ 1 i G R ith G [ = x i. x j is called the gram matrix Convex program: quadratic programming 29
The Kernel min R s. t. y [. x i 1 i d min β R N β Gβ s. t. y [ Gβ [ 1 i Optimization problem depends on x i values of G [ = x i. x j for i, j [N]. What about prediction? Function K x, x j F. x = f β [ [ x i. x = x. x is called the Kernel only through the Learning non-linear classifiers using feature transformations, i.e., f x =. φ(x) for some φ(x) o only thing e need to kno is K Ÿ x, x j = K(φ x, φ(x )) 30
The Kernel min R s. t. y [. x i 1 i d min β R N β Gβ s. t. y [ Gβ [ 1 i Optimization problem depends on x i values of G [ = x i. x j for i, j [N]. What about prediction? Function K x, x j F. x = f β [ [ x i. x = x. x is called the Kernel only through the Learning non-linear classifiers using feature transformations, i.e., f x =. φ(x) for some φ(x) o only thing e need to kno is K Ÿ x, x j = K(φ x, φ(x )) 31
The Kernel min R s. t. y [. x i 1 i d min β R N β Gβ s. t. y [ Gβ [ 1 i Optimization problem depends on x i values of G [ = x i. x j for i, j [N]. What about prediction? Function K x, x j F. x = f β [ [ x i. x = x. x is called the Kernel only through the Learning non-linear classifiers using feature transformations, i.e., f x =. φ(x) for some φ(x) o only thing e need to kno is K Ÿ x, x j = K(φ x, φ(x )) 32
Kernels As Prior Knoledge If e think that positive examples can (almost) be separated by some ellipse: then e should use polynomials of degree 2 A Kernel encodes a measure of similarity beteen objects. A bit like NN, except that it must be a valid inner product function. Slide credit: Nati Srebro 33