Day 4: Classification, support vector machines

Similar documents
Day 3: Classification, logistic regression

Logistic Regression. Machine Learning Fall 2018

Binary Classification / Perceptron

Lecture 2 Machine Learning Review

Neural networks and support vector machines

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Day 5: Generative models, structured classification

Support Vector Machine I

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

18.9 SUPPORT VECTOR MACHINES

CSC 411 Lecture 17: Support Vector Machine

Machine Learning for NLP

Jeff Howbert Introduction to Machine Learning Winter

Introduction to Machine Learning

Introduction to Machine Learning Summer School June 18, June 29, 2018, Chicago

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines

Machine Learning and Data Mining. Linear classification. Kalev Kask

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Statistical Machine Learning from Data

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

6.036 midterm review. Wednesday, March 18, 15

CPSC 340: Machine Learning and Data Mining

Max Margin-Classifier

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning: Logistic Regression. Lecture 04

ML4NLP Multiclass Classification

9 Classification. 9.1 Linear Classifiers

INTRODUCTION TO DATA SCIENCE

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

SVMs, Duality and the Kernel Trick

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning for NLP

Discriminative Models

Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

COMS 4771 Regression. Nakul Verma

Polyhedral Computation. Linear Classifiers & the SVM

The Representor Theorem, Kernels, and Hilbert Spaces

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Classification objectives COMS 4771

Pattern Recognition 2018 Support Vector Machines

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture 7: Kernels for Classification and Regression

(Kernels +) Support Vector Machines

Introduction to Logistic Regression and Support Vector Machine

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Midterm exam CS 189/289, Fall 2015

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning Midterm, Tues April 8

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning And Applications: Supervised Learning-SVM

Support Vector Machines and Kernel Methods

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Statistical Methods for SVM

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

CSE446: non-parametric methods Spring 2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Multiclass Classification-1

Discriminative Models

Bits of Machine Learning Part 1: Supervised Learning

Dan Roth 461C, 3401 Walnut

ML (cont.): SUPPORT VECTOR MACHINES

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Classifier Complexity and Support Vector Classifiers

MLCC 2017 Regularization Networks I: Linear Models

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine (SVM) and Kernel Methods

Deep Learning for Computer Vision

ECE521 week 3: 23/26 January 2017

Support Vector Machine II

Introduction to Support Vector Machines

Online Advertising is Big Business

Support Vector Machine (SVM) and Kernel Methods

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

CMU-Q Lecture 24:

Kernel Methods and Support Vector Machines

Midterm: CS 6375 Spring 2015 Solutions

SUPPORT VECTOR MACHINE

Transcription:

Day 4: Classification, support vector machines Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 21 June 2018

Topics so far Supervised learning, linear regression Linear regression o Overfitting, bias variance trade-off o Ridge and lasso regression, gradient descent Yesterday o Classification, logistic regression o Regularization for logistic regression o Multi-class classification Today o Maximum margin classifiers o Kernel trick 1

Classification Supervised learning: estimate a mapping f from input x X to output y Y o Regression Y = R or other continuous variables o Classification Y takes discrete set of values Examples: q Y = {spam, nospam}, q digits (not values) Y = {0,1,2,, 9} Many successful applications of ML in vision, speech, NLP, healthcare 2

Parametric classifiers H = x. x +? : R A,? R yb x = sign(f. x + F? ) F. x + F? = 0 (linear) decision boundary or separating hyperplane separates R A into to halfspaces (regions) F. x + F? > 0 gets label 1 and F. x + F? < 0 gets label -1 more generally, yb x = sign fj x à decision boundary is fj x = 0! # ç! " 3

Surrogate Losses The correct loss to use is 0-1 loss after thresholding l?l f x, y = 1 sign f x y = 1 sign f x y < 0 l(f x, y) 0 f(x)y 4

Surrogate Losses The correct loss to use is 0-1 loss after thresholding l?l f x, y = 1 sign f x y = 1 sign f x y < 0 Linear regression uses l OP f x, y = f x y R l(f x, y) 0 yby f(x)y 5

Hard to optimize over l?l, find another loss l(f(x), y) o Convex (for any fixed y) à easier to minimize o An upper bound of l?l à small l small l?l Satisfied by squared loss àbut has large loss even hen l?l f(x), y = 0 To more surrogate losses in in this course o Logistic loss l TUV f(x), y = log 1 + exp f(x)y o Hinge loss l Z[\]^ f(x), y = max(0,1 f(x)y) Surrogate Losses l(f x, y) 0 f(x)y 6

Logistic regression: ERM on surrogate loss Logistic loss l f x, y = log 1 + exp f(x)y l(f x, y) S = x i, y [ : i = 1,2,, N, X = R A, Y = { 1,1} Linear model f x = f x =. x +? Minimize training loss F, F? = argmin,d e f log 1 + exp. x i +? y [ [ Output classifier yb x = sign(. x +? ) 0 f(x)y 7

Logistic Regression F, F? = argmin,d e f log 1 + exp. x [ +? y [ [ Convex optimization problem Can solve using gradient descent Can also add usual regularization: l R, l L 8

Linear decision boundaries yb x = sign(. x +? ) {x:. x +? = 0} is a hyperplane in R A o decision boundary o is direction of normal o? is the offset {x:. x +? = 0} divides R A into to halfspaces (regions) o x:. x +? 0 get label +1 and {x:. x +? < 0} get label -1 x R x L 9

Linear decision boundaries yb x = sign(. x +? ) {x:. x +? = 0} is a hyperplane in R A o decision boundary o is direction of normal o? is the offset {x:. x +? = 0} divides R A into to halfspaces (regions) o x:. x +? 0 get label +1 and {x:. x +? < 0} get label -1 Maps x to a 1D coordinate x R x j =. x +? x x x L 10

Linear separators in 2D Slide credit: Nati Srebro 11

Linear separators in 2D Slide credit: Nati Srebro 12

Linear separators in 2D Slide credit: Nati Srebro 13

Margin of a classifier Margin: distance of the closest instance point to the linear hyperplane Large margins are more stable o small perturbations of the data do not change the # % prediction! x 1 x 2 &! # $ 10 14

Maximum margin classifier S = x i, y [ : i = 1,2,, N binary classes Y = { 1,1} # %! Assume data is linearly separable o,? such that for all i = 1,2,, N y [ = sign. x i +? y [ (. x i +? ) > 0 Maximum margin separator given by &! # $ 10 F, F? = argmax R m,d e min [ y [. x i +? smallest margin margin of sample i 15

Maximum margin classifier F, F? = argmax R m,d e R min [ y [. x i +? Claim 1: If F, F? is a solution, then for any γ > 0, γf, γf? is also a solution Option 1: We can fix F, F? = argmax ql,d e = 1 to get min [ y [. x i +? 16

Maximum margin classifier F, F? = argmax R m,d e R min [ y [. x i +? Claim 1: If F, F? is a solution, then for any γ > 0, γf, γf? is also a solution Option 1: e can fix = 1 to get F, F? = argmax ql min [ y [. x i +? Option 2: e can also fix min [ y [. x i +? = 1 o margin no is L o Instead of increasing margin e can reduce norm 17

Max-margin classifier equivalent formulation Solve: r, r? = argmin s.t. i, y [ (. x i +? ) 1 R Hard margin Support Vector Machine (SVM) Claim 2: Equivalent to previous slide is solution for à r r Proof:, dr e r F, F? = max min ql [ y [ (. x i +? ) 1. Let min y [ F. x i + F? = γb, then min y [ F [ [ tf. x i + df e tf 1 2. r F tf = L tf 3. min y [ r. x [ + dr e [ r r x r.x = min i ydr e [ r L r γb 18

Maximum margin classifier formulations Original formulation F, F? = argmax R m,d e R min [ y [. x i +? # % &!! Fixing = 1 F, F? = argmax,d e Fixing min min [ y [. x i +? s. t. = 1 y [. x i +? = 1 [ R s.t. i, y [ (. x i +? ) 1 r, r? = argmin # $ 10 Henceforth,? ill be absorbed into by adding an additonal feature of 1 to x 19

Margin and norm x.x i margin = min [ Remember in regression: small norm solutions have lo complexity! o Is this true for maximum margin classifiers? o hat about classification ith logistic loss [ log(1 + exp( y [. x i ))? o ho to do capacity control in maximum margin classifier learning? Some places min y [. x i referred as margin à [ implicitly assumes normalization o min [ y [. x i is meaningless ithout knoing hat is! 20

Solutions of hard margin SVM F = argmin R s. t., y [. x i 1 i Theorem: F = span x i : i = 1,2,, N i.e., {βj [ : i = 1,2,, N} such that F = [ βj [ x i Denote S = span x i : i = 1,2,, N and S = {z R A : i, z. x i = 0} o For any z R A, z = z S + z S s.t. z S S and z S S o z R = z S R + z S Three step proof: 1. Decompose F = F S + F S. 2. min y [ F. x i 1 min y [ F S. x i 1 [ [ (because F S. x i = 0 i) 3. if F S 0, then F S < F R 21

Solutions of hard margin SVM F = argmin R s. t., y [. x i 1 i Theorem: F = span x i : i = 1,2,, N i.e., {βj [ : i = 1,2,, N} such that F = [ βj [ x i Denote S = span x i : i = 1,2,, N and S = {z R A : i, z. x i = 0} o For any z R A, z = z S + z S s.t. z S S and z S S o z R = z S R + z S Three step proof: 1. Decompose F = F S + F S. 2. min y [ F. x i 1 min y [ F S. x i 1 [ [ (because F S. x i = 0 i) 3. if F S 0, then F S < F R 22

Solutions of hard margin SVM F = argmin R s. t., y [. x i 1 i Theorem: F = span x i : i = 1,2,, N i.e., {βj [ : i = 1,2,, N} such that F = [ βj [ x i Denote S = span x i : i = 1,2,, N and S = {z R A : i, z. x i = 0} o For any z R A, z = z S + z S s.t. z S S and z S S o z R = z S R + z S Three step proof: 1. Decompose F = F S + F S. 2. min y [ F. x i 1 min y [ F S. x i 1 [ [ (because F S. x i = 0 i) 3. if F S 0, then F S < F R 23

Representer Theorem F = argmin R s. t., y [. x i 1 i Theorem: F = span x i : i = 1,2,, N i.e., {βj [ : i = 1,2,, N} such that F = [ βj [ x i o Special case of representor theorem Theorem (ext): additionally, {βj [ } also stisfies βj [ = 0 for all i such that y [. x i > 1 Proof?: (animation next slide) 24

F Illustration credit: Nati Srebro 25

F = argmin Representer Theorem R s. t., y [. x i 1 i Theorem: {βj [ : i = 1,2,, N} such that F = [ βj [ x i {βj [ } also satisfies βj [ = 0 for all i such that y [ F. x i > 1 SV(F) = {i: y [ F. x i o called support vectors o hence support vector machine = 1} datapoints closest to F F = f βj [ x i [ Pˆ(dF) 26

F = argmin Representer Theorem R s. t., y [. x i 1 i Theorem: {βj [ : i = 1,2,, N} such that F = [ βj [ x i {βj [ } also satisfies βj [ = 0 for all i such that y [ F. x i > 1 SV(F) = {i: y [ F. x i o called support vectors o hence support vector machine = 1} datapoints closest to F F = f βj [ x i [ Pˆ(dF) Ho do e get F? 27

Optimizing the SVM problem F = argmin R s. t., y [. x i 1 i 1. Can do sub-gradient descent (next class) 2. Special case of quadratic program 1 min z 2 z Pz + q z s. t. Gz h, Az = b o Change of variables F = βj [ x i o Change of variables F = [ Pˆ(dF)? βj [ x i [ql! 28

Optimizing the SVM problem F = argmin Change of variables = R s. t., y [. x i 1 i β [ x i [ql! \ min f f β [β x i. x j { x } [ql ql N s. t. f β y [ x i. x j jq1 1 i = min β R N β Gβ s. t. y [ Gβ [ 1 i G R ith G [ = x i. x j is called the gram matrix Convex program: quadratic programming 29

The Kernel min R s. t. y [. x i 1 i d min β R N β Gβ s. t. y [ Gβ [ 1 i Optimization problem depends on x i values of G [ = x i. x j for i, j [N]. What about prediction? Function K x, x j F. x = f β [ [ x i. x = x. x is called the Kernel only through the Learning non-linear classifiers using feature transformations, i.e., f x =. φ(x) for some φ(x) o only thing e need to kno is K Ÿ x, x j = K(φ x, φ(x )) 30

The Kernel min R s. t. y [. x i 1 i d min β R N β Gβ s. t. y [ Gβ [ 1 i Optimization problem depends on x i values of G [ = x i. x j for i, j [N]. What about prediction? Function K x, x j F. x = f β [ [ x i. x = x. x is called the Kernel only through the Learning non-linear classifiers using feature transformations, i.e., f x =. φ(x) for some φ(x) o only thing e need to kno is K Ÿ x, x j = K(φ x, φ(x )) 31

The Kernel min R s. t. y [. x i 1 i d min β R N β Gβ s. t. y [ Gβ [ 1 i Optimization problem depends on x i values of G [ = x i. x j for i, j [N]. What about prediction? Function K x, x j F. x = f β [ [ x i. x = x. x is called the Kernel only through the Learning non-linear classifiers using feature transformations, i.e., f x =. φ(x) for some φ(x) o only thing e need to kno is K Ÿ x, x j = K(φ x, φ(x )) 32

Kernels As Prior Knoledge If e think that positive examples can (almost) be separated by some ellipse: then e should use polynomials of degree 2 A Kernel encodes a measure of similarity beteen objects. A bit like NN, except that it must be a valid inner product function. Slide credit: Nati Srebro 33