Introduction to Discriminative Machine Learning

Similar documents
Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Lecture 13 Visual recognition

Lecture 10: A brief introduction to Support Vector Machine

Jeff Howbert Introduction to Machine Learning Winter

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Support Vector Machine (continued)

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Introduction to Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Polyhedral Computation. Linear Classifiers & the SVM

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines

Support Vector Machines.

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

SUPPORT VECTOR MACHINE

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine (SVM) and Kernel Methods

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

CSC 411 Lecture 17: Support Vector Machine

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines and Speaker Verification

Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines. Maximizing the Margin

Support Vector Machine for Classification and Regression

Support Vector Machines

Support Vector Machine & Its Applications

Nearest Neighbors Methods for Support Vector Machines

Support Vector Machines for Classification and Regression

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Constrained Optimization and Support Vector Machines

Statistical Methods for NLP

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Support Vector Machines, Kernel SVM

Support Vector Machines

Introduction to Support Vector Machines

Announcements - Homework

Kernel Methods and Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture Notes on Support Vector Machine

Statistical Pattern Recognition

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Discriminative part-based models. Many slides based on P. Felzenszwalb

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning for Structured Prediction

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification: A Statistical Portrait

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CS145: INTRODUCTION TO DATA MINING

(Kernels +) Support Vector Machines

Support Vector Machines Explained

Discriminative Models

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machine (SVM) and Kernel Methods

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

Support Vector Machines

HOMEWORK 4: SVMS AND KERNELS

6.036 midterm review. Wednesday, March 18, 15

Introduction to SVM and RVM

Kernel methods, kernel SVM and ridge regression

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machine. Industrial AI Lab.

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Applied Machine Learning Annalisa Marsico

Support Vector Machines

Support Vector Machines

Learning with kernels and SVM

Support Vector Machine II

SVMs: nonlinearity through kernels

Modelli Lineari (Generalizzati) e SVM

SVMC An introduction to Support Vector Machines Classification

Discriminative Models

Support vector machines Lecture 4

Non-linear Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Transcription:

Introduction to Discriminative Machine Learning Yang Wang Vision & Media Lab Simon Fraser University CRV Tutorial, Kelowna May 24, 2009

Hand-written Digit Recognition [Belongie et al. PAMI 2002] 2

Hand-written Digit Recognition image [Belongie et al. PAMI 2002] 2

Hand-written Digit Recognition image {0,1,2,...,9} [Belongie et al. PAMI 2002] 2

Face Detection [Viola & Jones, SCTV 2001] 3

image patch Face Detection [Viola & Jones, SCTV 2001] 3

Face Detection image patch {face, non-face} [Viola & Jones, SCTV 2001] 3

Object Categorization Motorbikes Airplanes [Fergus et al, CVPR 2003] Faces Cars (Side) Cars (Rear) Spotted Cats Background 4

Object Categorization image Motorbikes Airplanes [Fergus et al, CVPR 2003] Faces Cars (Side) Cars (Rear) Spotted Cats Background 4

Object Categorization image {motorbike,airplane,face,...} Motorbikes Airplanes [Fergus et al, CVPR 2003] Faces Cars (Side) Cars (Rear) Spotted Cats Background 4

Classifier Construction How to compute a decision boundary to distinguish cars from non-cars? Image feature Slide adapted from K.Grauman & B.Leibe 5

Generative vs. Discriminative!)"!)!& Pr( image, car) Pr( image,! car)!! "! #! $! %! &! '! (! image feature Generative: separately model class-conditional and prior densities "!)& Pr( car image) Pr(!car image) x = data Discriminative: directly model posterior!! "! #! $! %! &! '! (! image feature Slide adapted from K.Grauman & B.Leibe 6

Generative vs. Discriminative Pr(image, car)=pr(car image)pr(image)!)"!)!& Pr( image, car) Pr( image,! car)!! "! #! $! %! &! '! (! image feature Generative: separately model class-conditional and prior densities "!)& Pr( car image) Pr(!car image) x = data Discriminative: directly model posterior!! "! #! $! %! &! '! (! image feature Slide adapted from K.Grauman & B.Leibe 6

Generative vs. Discriminative Generative possibly interpretable can draw samples models variability unimportant to classification often hard to build good model with few parameters Discriminative appealing when infeasible(or undesirable) to model data itself excel in practice often cannot provide uncertainty in predictions often non-interpretable Slide adapted from K.Grauman & B.Leibe 7

Discriminative Methods earest neighbor earest neighbor eural networks eural networks 10 6 examples Shakhnarovich et al, 2003 Shakhnarovich, Berg et al, 2005 Viola,... Darrell 2003 Berg, Berg, Malik 2005... Support vector machine Boosting Boosting LeCun et al, 1998 Rowley et al, 1998... LeCun, Bottou, Bengio, Haffner 199 Rowley, Baluja, Kanade 1998 Conditional random Random field Field Guyon, Vapnik, Heisele, Serre, Poggio, 2001,... Viola, Jones 2001, Viola & Jones 2001, Torralba Torralba et et al al. 2004, 2004, Opelt... et al. 2006, McCallum, Freitag, Lafferty et al 2000, Pereira 2000; Kumar, Kumar & Hebert 2003, Hebert 2003... Slide adapted from K.Grauman & B.Leibe 8

Discriminative Methods earest neighbor earest neighbor eural networks eural networks 10 6 examples Shakhnarovich et al, 2003 Shakhnarovich, Berg et al, 2005 Viola,... Darrell 2003 Berg, Berg, Malik 2005... Support vector machine Boosting Boosting LeCun et al, 1998 Rowley et al, 1998... LeCun, Bottou, Bengio, Haffner 199 Rowley, Baluja, Kanade 1998 Conditional random Random field Field Guyon, Vapnik, Heisele, Serre, Poggio, 2001,... Viola, Jones 2001, Viola & Jones 2001, Torralba Torralba et et al al. 2004, 2004, Opelt... et al. 2006, McCallum, Freitag, Lafferty et al 2000, Pereira 2000; Kumar, Kumar & Hebert 2003, Hebert 2003... Slide adapted from K.Grauman & B.Leibe 8

Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

Linear Classification + Which hyperplane is the best? - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

Classification and Margin + - 10

Classification and Margin + - 10

Classification and Margin + - 10

Classification and Margin + - 10

Classification and Margin + - 10

Classification and Margin + - 10

Classification and Margin + - W 10

Classification and Margin + - W w x + b = 0 10

Classification and Margin w x + b =+c + - W w x + b = c w x + b = 0 10

Classification and Margin w x + b =+c + - W d- d+ w x + b = c w x + b = 0 10

Classification and Margin - w x + b =+c + W max w,b s.t. m = d + d + w x i + b +c, i : y i =+1 w x i + b c, i : y i = 1 d- d+ w x + b = c w x + b = 0 10

Classification and Margin - w x + b =+c + W max w,b s.t. m = d + d + w x i + b +c, i : y i =+1 w x i + b c, i : y i = 1 d+ d- y i (w x i + b) c, i w x + b = c w x + b = 0 10

Max-Margin Classification The margin is: m = d + d + = 2c w 11

Max-Margin Classification The margin is: m = d + d + = 2c w Here is our max-margin classification problem: max w,b s.t. 2c w y i (w x i + b) c, i 11

Max-Margin Classification The margin is: m = d + d + = 2c w Here is our max-margin classification problem: max w,b s.t. 2c w y i (w x i + b) c, i c and w can be rescaled without changing the result 11

Max-Margin Classification The margin is: m = d + d + = 2c w Here is our max-margin classification problem: max w,b 2c w max w,b 2 w s.t. y i (w x i + b) c, i s.t. y i (w x i + b) 1, i c and w can be rescaled without changing the result 11

Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i 12

Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i 12

Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i Support Vector Machine 12

Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i quadratic min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i Support Vector Machine 12

Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i quadratic min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i linear Support Vector Machine 12

Duality Primal SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i

Duality Primal SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i Dual SVM max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1

Duality Primal SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i w = α i y i x i i=1 Dual SVM max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1

Support Vectors + - W

Support Vectors + - W

Support Vectors + Training data with nonzero α i are called support vectors - W

Support Vectors + Training data with nonzero α i are called support vectors - W w = α i y i x i i=1

Support Vectors + Training data with nonzero α i are called support vectors - W w = α i y i x i i=1 w = i SV α i y i x i

15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 Testing w z + b = i SV α i y i (x i z)+b

15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 Testing w z + b = i SV α i y i (x i z)+b

15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 kernel trick Testing w z + b = i SV α i y i (x i z)+b

15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 kernel trick Testing w z + b = i SV α i y i (x i z)+b

15 Support Vector Machines Training Testing max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i no need to form w explicitly w z + b = i SV α i α j y i y j (x i x j ) j=1 kernel trick α i y i (x i z)+b

on-linearly Separable Data 16

on-linearly Separable Data 16

on-linearly Separable Data 16

on-linearly Separable Data 16

on-linearly Separable Data 16

Slack Variables 17

Slack Variables 17

Slack Variables 17

17 Slack Variables ξ j ξ i

Slack Variables Recall: hard-margin SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i ξ j ξ i 17

Slack Variables Recall: hard-margin SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i ξ i ξ j min w,b,ξ s.t. soft-margin SVM 1 2 w 2 +C ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i 17

Dual Form of Soft-Margin SVM

Dual Form of Soft-Margin SVM Primal SVM min w,b,ξ s.t. 1 2 w 2 +C ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i

Dual Form of Soft-Margin SVM Primal SVM min w,b,ξ s.t. 1 2 w 2 +C ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i Dual SVM max α s.t. i=1α i 1 2 i=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (x i x j ) j=1

Dual Form of Soft-Margin SVM Primal SVM w = α i y i x i i=1 Dual SVM min w,b,ξ s.t. max α s.t. 1 2 w 2 +C i=1α i 1 2 i=1 0 α i C, i α i y i = 0 i=1 ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i α i α j y i y j (x i x j ) j=1

on-linear Decision Boundary original data 19

on-linear Decision Boundary original data 19

on-linear Decision Boundary original data Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ() Φ( ) Φ( ) Φ( ) higher dimensional feature space 19

on-linear Decision Boundary original data x =(x 1,x 2 ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ() Φ( ) Φ( ) Φ( ) higher dimensional feature space 19

on-linear Decision Boundary original data x =(x 1,x 2 ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ() Φ( ) Φ( ) Φ( ) higher dimensional feature space Φ(x) = (x 2 1, 2x 1 x 2,x 2 2) 19

Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (x i x j ) 20

Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (x i x j ) 20

Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (Φ(x i ) Φ(x j )) 20

Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (Φ(x i ) Φ(x j )) Define: K(x i,x j )=Φ(x i ) Φ(x j ) 20

Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j K(x i,x j ) Define: K(x i,x j )=Φ(x i ) Φ(x j ) 20

Linear kernel Examples of Kernels K(x i,x j )=x i x j Polynomial kernel K(x i,x j )= (1 + x i x j ) p Radial basis kernel K(x i,x j )=exp ( 12 ) x i x j 2 21

Multi-Class Classification So far we only talked about binary classification. What about multi-class? 22

Multi-Class Classification So far we only talked about binary classification. What about multi-class? Answer: with classes, learn binary SVM's SVM 1 learns "output=1" vs "output!=1" SVM 2 learns "output=2" vs "output!=2"... SVM learns "output=" vs "output!=" 22

Multi-Class SVM Define feature vector Φ(x, y) 23

f (x) Multi-Class SVM Define feature vector Φ(x, y) 23

f (x) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) Multi-Class SVM Define feature vector Φ(x, y) 23

f (x) Multi-Class SVM Define feature vector Φ(x, y) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23

f (x) Multi-Class SVM Define feature vector Φ(x, y) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 w 23

f (x) Multi-Class SVM Define feature vector Φ(x, y) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 w Classification rule y = argmax y w Φ(x,y) 23

Multi-Class SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) 1 ξ i ξ i 0, i 24

SVM Software Good news: you do not need to implement SVM by yourself. SVM-Light http://svmlight.joachims.org/ LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBLIEAR http://www.csie.ntu.edu.tw/~cjlin/liblinear/ 25

Classification with Latent Variables [Felzenszwalb et al, CVPR08] 26

Classification with Latent Variables f w (x)=w Φ(x) [Felzenszwalb et al, CVPR08] 26

Formal Formal model Classification withmodel Latent Variables! fwf(x) =w Φ(x) fww(x) (x) = w Φ(x) = w Φ(x) fw (x) = max w wφ(x, z) fw (x) = max Φ(x, z z Z =Zvector of part offsets = vector of part offsets Φ(x, z)=z)vector of of HOG features (from root filter & & Φ(x, = vector HOG features (from root filter [Felzenszwalb et al, CVPR08] 26

Formal Formal model Classification withmodel Latent Variables fw (x) = max w wφ(x, z) fw (x) =z max Φ(x, z z = positions of parts Z = vector of part offsets Z = vector of part Φ(x, z) = vector involvingoffsets image feature Φ(x, z)=z)vector of of HOG features (from root filter & & Φ(x, = vector HOG features (from root filter and part positions! fwf(x) =w Φ(x) fww(x) (x) = w Φ(x) = w Φ(x) [Felzenszwalb et al, CVPR08] 26

Formal Formal model Classification withmodel Latent Variables! fwfw(x) = max w Φ(x, z) (x) = max w Φ(x, z) fw (x) =zzmax w Φ(x, z z = positions of parts Z = vector of part offsets Z = vector of part Φ(x, z) = vector involvingoffsets image feature Φ(x, z)=z)vector of of HOG features (from root filter & & Φ(x, = vector HOG features (from root filter and part positions! fwf(x) =w Φ(x) fww(x) (x) = w Φ(x) = w Φ(x) [Felzenszwalb et al, CVPR08] 26

Latent SVM min w,ξ 1 2 w 2 +C i ξ i s.t. i, ξ i 0 y i f w (x i ) 1 ξ i 27

Latent SVM min w,ξ 1 2 w 2 +C i ξ i s.t. i, ξ i 0 y i f w (x i ) 1 ξ i max z w Φ(x i,z) 27

Structured Output 28

Structured Output 28

Structured Output Y 28

Structural SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) ξ i 0, i 1 ξ i 29

Structural SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) ξ i 0, i Δ(y,y i ) ξ i 29

Structural SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) ξ i 0, i exponential Δ(y,y i ) ξ i 29

Thank you!