Introduction to Discriminative Machine Learning Yang Wang Vision & Media Lab Simon Fraser University CRV Tutorial, Kelowna May 24, 2009
Hand-written Digit Recognition [Belongie et al. PAMI 2002] 2
Hand-written Digit Recognition image [Belongie et al. PAMI 2002] 2
Hand-written Digit Recognition image {0,1,2,...,9} [Belongie et al. PAMI 2002] 2
Face Detection [Viola & Jones, SCTV 2001] 3
image patch Face Detection [Viola & Jones, SCTV 2001] 3
Face Detection image patch {face, non-face} [Viola & Jones, SCTV 2001] 3
Object Categorization Motorbikes Airplanes [Fergus et al, CVPR 2003] Faces Cars (Side) Cars (Rear) Spotted Cats Background 4
Object Categorization image Motorbikes Airplanes [Fergus et al, CVPR 2003] Faces Cars (Side) Cars (Rear) Spotted Cats Background 4
Object Categorization image {motorbike,airplane,face,...} Motorbikes Airplanes [Fergus et al, CVPR 2003] Faces Cars (Side) Cars (Rear) Spotted Cats Background 4
Classifier Construction How to compute a decision boundary to distinguish cars from non-cars? Image feature Slide adapted from K.Grauman & B.Leibe 5
Generative vs. Discriminative!)"!)!& Pr( image, car) Pr( image,! car)!! "! #! $! %! &! '! (! image feature Generative: separately model class-conditional and prior densities "!)& Pr( car image) Pr(!car image) x = data Discriminative: directly model posterior!! "! #! $! %! &! '! (! image feature Slide adapted from K.Grauman & B.Leibe 6
Generative vs. Discriminative Pr(image, car)=pr(car image)pr(image)!)"!)!& Pr( image, car) Pr( image,! car)!! "! #! $! %! &! '! (! image feature Generative: separately model class-conditional and prior densities "!)& Pr( car image) Pr(!car image) x = data Discriminative: directly model posterior!! "! #! $! %! &! '! (! image feature Slide adapted from K.Grauman & B.Leibe 6
Generative vs. Discriminative Generative possibly interpretable can draw samples models variability unimportant to classification often hard to build good model with few parameters Discriminative appealing when infeasible(or undesirable) to model data itself excel in practice often cannot provide uncertainty in predictions often non-interpretable Slide adapted from K.Grauman & B.Leibe 7
Discriminative Methods earest neighbor earest neighbor eural networks eural networks 10 6 examples Shakhnarovich et al, 2003 Shakhnarovich, Berg et al, 2005 Viola,... Darrell 2003 Berg, Berg, Malik 2005... Support vector machine Boosting Boosting LeCun et al, 1998 Rowley et al, 1998... LeCun, Bottou, Bengio, Haffner 199 Rowley, Baluja, Kanade 1998 Conditional random Random field Field Guyon, Vapnik, Heisele, Serre, Poggio, 2001,... Viola, Jones 2001, Viola & Jones 2001, Torralba Torralba et et al al. 2004, 2004, Opelt... et al. 2006, McCallum, Freitag, Lafferty et al 2000, Pereira 2000; Kumar, Kumar & Hebert 2003, Hebert 2003... Slide adapted from K.Grauman & B.Leibe 8
Discriminative Methods earest neighbor earest neighbor eural networks eural networks 10 6 examples Shakhnarovich et al, 2003 Shakhnarovich, Berg et al, 2005 Viola,... Darrell 2003 Berg, Berg, Malik 2005... Support vector machine Boosting Boosting LeCun et al, 1998 Rowley et al, 1998... LeCun, Bottou, Bengio, Haffner 199 Rowley, Baluja, Kanade 1998 Conditional random Random field Field Guyon, Vapnik, Heisele, Serre, Poggio, 2001,... Viola, Jones 2001, Viola & Jones 2001, Torralba Torralba et et al al. 2004, 2004, Opelt... et al. 2006, McCallum, Freitag, Lafferty et al 2000, Pereira 2000; Kumar, Kumar & Hebert 2003, Hebert 2003... Slide adapted from K.Grauman & B.Leibe 8
Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9
Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9
Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9
Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9
Linear Classification + Which hyperplane is the best? - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9
Classification and Margin + - 10
Classification and Margin + - 10
Classification and Margin + - 10
Classification and Margin + - 10
Classification and Margin + - 10
Classification and Margin + - 10
Classification and Margin + - W 10
Classification and Margin + - W w x + b = 0 10
Classification and Margin w x + b =+c + - W w x + b = c w x + b = 0 10
Classification and Margin w x + b =+c + - W d- d+ w x + b = c w x + b = 0 10
Classification and Margin - w x + b =+c + W max w,b s.t. m = d + d + w x i + b +c, i : y i =+1 w x i + b c, i : y i = 1 d- d+ w x + b = c w x + b = 0 10
Classification and Margin - w x + b =+c + W max w,b s.t. m = d + d + w x i + b +c, i : y i =+1 w x i + b c, i : y i = 1 d+ d- y i (w x i + b) c, i w x + b = c w x + b = 0 10
Max-Margin Classification The margin is: m = d + d + = 2c w 11
Max-Margin Classification The margin is: m = d + d + = 2c w Here is our max-margin classification problem: max w,b s.t. 2c w y i (w x i + b) c, i 11
Max-Margin Classification The margin is: m = d + d + = 2c w Here is our max-margin classification problem: max w,b s.t. 2c w y i (w x i + b) c, i c and w can be rescaled without changing the result 11
Max-Margin Classification The margin is: m = d + d + = 2c w Here is our max-margin classification problem: max w,b 2c w max w,b 2 w s.t. y i (w x i + b) c, i s.t. y i (w x i + b) 1, i c and w can be rescaled without changing the result 11
Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i 12
Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i 12
Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i Support Vector Machine 12
Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i quadratic min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i Support Vector Machine 12
Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i quadratic min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i linear Support Vector Machine 12
Duality Primal SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i
Duality Primal SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i Dual SVM max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1
Duality Primal SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i w = α i y i x i i=1 Dual SVM max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1
Support Vectors + - W
Support Vectors + - W
Support Vectors + Training data with nonzero α i are called support vectors - W
Support Vectors + Training data with nonzero α i are called support vectors - W w = α i y i x i i=1
Support Vectors + Training data with nonzero α i are called support vectors - W w = α i y i x i i=1 w = i SV α i y i x i
15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 Testing w z + b = i SV α i y i (x i z)+b
15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 Testing w z + b = i SV α i y i (x i z)+b
15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 kernel trick Testing w z + b = i SV α i y i (x i z)+b
15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 kernel trick Testing w z + b = i SV α i y i (x i z)+b
15 Support Vector Machines Training Testing max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i no need to form w explicitly w z + b = i SV α i α j y i y j (x i x j ) j=1 kernel trick α i y i (x i z)+b
on-linearly Separable Data 16
on-linearly Separable Data 16
on-linearly Separable Data 16
on-linearly Separable Data 16
on-linearly Separable Data 16
Slack Variables 17
Slack Variables 17
Slack Variables 17
17 Slack Variables ξ j ξ i
Slack Variables Recall: hard-margin SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i ξ j ξ i 17
Slack Variables Recall: hard-margin SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i ξ i ξ j min w,b,ξ s.t. soft-margin SVM 1 2 w 2 +C ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i 17
Dual Form of Soft-Margin SVM
Dual Form of Soft-Margin SVM Primal SVM min w,b,ξ s.t. 1 2 w 2 +C ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i
Dual Form of Soft-Margin SVM Primal SVM min w,b,ξ s.t. 1 2 w 2 +C ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i Dual SVM max α s.t. i=1α i 1 2 i=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (x i x j ) j=1
Dual Form of Soft-Margin SVM Primal SVM w = α i y i x i i=1 Dual SVM min w,b,ξ s.t. max α s.t. 1 2 w 2 +C i=1α i 1 2 i=1 0 α i C, i α i y i = 0 i=1 ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i α i α j y i y j (x i x j ) j=1
on-linear Decision Boundary original data 19
on-linear Decision Boundary original data 19
on-linear Decision Boundary original data Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ() Φ( ) Φ( ) Φ( ) higher dimensional feature space 19
on-linear Decision Boundary original data x =(x 1,x 2 ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ() Φ( ) Φ( ) Φ( ) higher dimensional feature space 19
on-linear Decision Boundary original data x =(x 1,x 2 ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ() Φ( ) Φ( ) Φ( ) higher dimensional feature space Φ(x) = (x 2 1, 2x 1 x 2,x 2 2) 19
Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (x i x j ) 20
Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (x i x j ) 20
Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (Φ(x i ) Φ(x j )) 20
Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (Φ(x i ) Φ(x j )) Define: K(x i,x j )=Φ(x i ) Φ(x j ) 20
Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j K(x i,x j ) Define: K(x i,x j )=Φ(x i ) Φ(x j ) 20
Linear kernel Examples of Kernels K(x i,x j )=x i x j Polynomial kernel K(x i,x j )= (1 + x i x j ) p Radial basis kernel K(x i,x j )=exp ( 12 ) x i x j 2 21
Multi-Class Classification So far we only talked about binary classification. What about multi-class? 22
Multi-Class Classification So far we only talked about binary classification. What about multi-class? Answer: with classes, learn binary SVM's SVM 1 learns "output=1" vs "output!=1" SVM 2 learns "output=2" vs "output!=2"... SVM learns "output=" vs "output!=" 22
Multi-Class SVM Define feature vector Φ(x, y) 23
f (x) Multi-Class SVM Define feature vector Φ(x, y) 23
f (x) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) Multi-Class SVM Define feature vector Φ(x, y) 23
f (x) Multi-Class SVM Define feature vector Φ(x, y) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23
f (x) Multi-Class SVM Define feature vector Φ(x, y) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 w 23
f (x) Multi-Class SVM Define feature vector Φ(x, y) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 w Classification rule y = argmax y w Φ(x,y) 23
Multi-Class SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) 1 ξ i ξ i 0, i 24
SVM Software Good news: you do not need to implement SVM by yourself. SVM-Light http://svmlight.joachims.org/ LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/ LIBLIEAR http://www.csie.ntu.edu.tw/~cjlin/liblinear/ 25
Classification with Latent Variables [Felzenszwalb et al, CVPR08] 26
Classification with Latent Variables f w (x)=w Φ(x) [Felzenszwalb et al, CVPR08] 26
Formal Formal model Classification withmodel Latent Variables! fwf(x) =w Φ(x) fww(x) (x) = w Φ(x) = w Φ(x) fw (x) = max w wφ(x, z) fw (x) = max Φ(x, z z Z =Zvector of part offsets = vector of part offsets Φ(x, z)=z)vector of of HOG features (from root filter & & Φ(x, = vector HOG features (from root filter [Felzenszwalb et al, CVPR08] 26
Formal Formal model Classification withmodel Latent Variables fw (x) = max w wφ(x, z) fw (x) =z max Φ(x, z z = positions of parts Z = vector of part offsets Z = vector of part Φ(x, z) = vector involvingoffsets image feature Φ(x, z)=z)vector of of HOG features (from root filter & & Φ(x, = vector HOG features (from root filter and part positions! fwf(x) =w Φ(x) fww(x) (x) = w Φ(x) = w Φ(x) [Felzenszwalb et al, CVPR08] 26
Formal Formal model Classification withmodel Latent Variables! fwfw(x) = max w Φ(x, z) (x) = max w Φ(x, z) fw (x) =zzmax w Φ(x, z z = positions of parts Z = vector of part offsets Z = vector of part Φ(x, z) = vector involvingoffsets image feature Φ(x, z)=z)vector of of HOG features (from root filter & & Φ(x, = vector HOG features (from root filter and part positions! fwf(x) =w Φ(x) fww(x) (x) = w Φ(x) = w Φ(x) [Felzenszwalb et al, CVPR08] 26
Latent SVM min w,ξ 1 2 w 2 +C i ξ i s.t. i, ξ i 0 y i f w (x i ) 1 ξ i 27
Latent SVM min w,ξ 1 2 w 2 +C i ξ i s.t. i, ξ i 0 y i f w (x i ) 1 ξ i max z w Φ(x i,z) 27
Structured Output 28
Structured Output 28
Structured Output Y 28
Structural SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) ξ i 0, i 1 ξ i 29
Structural SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) ξ i 0, i Δ(y,y i ) ξ i 29
Structural SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) ξ i 0, i exponential Δ(y,y i ) ξ i 29
Thank you!