Introduction to Discriminative Machine Learning

Size: px

Start display at page:

Download "Introduction to Discriminative Machine Learning"

Verity Jones
5 years ago
Views:

1 Introduction to Discriminative Machine Learning Yang Wang Vision & Media Lab Simon Fraser University CRV Tutorial, Kelowna May 24, 2009

2 Hand-written Digit Recognition [Belongie et al. PAMI 2002] 2

3 Hand-written Digit Recognition image [Belongie et al. PAMI 2002] 2

4 Hand-written Digit Recognition image {0,1,2,...,9} [Belongie et al. PAMI 2002] 2

5 Face Detection [Viola & Jones, SCTV 2001] 3

6 image patch Face Detection [Viola & Jones, SCTV 2001] 3

7 Face Detection image patch {face, non-face} [Viola & Jones, SCTV 2001] 3

8 Object Categorization Motorbikes Airplanes [Fergus et al, CVPR 2003] Faces Cars (Side) Cars (Rear) Spotted Cats Background 4

9 Object Categorization image Motorbikes Airplanes [Fergus et al, CVPR 2003] Faces Cars (Side) Cars (Rear) Spotted Cats Background 4

10 Object Categorization image {motorbike,airplane,face,...} Motorbikes Airplanes [Fergus et al, CVPR 2003] Faces Cars (Side) Cars (Rear) Spotted Cats Background 4

11 Classifier Construction How to compute a decision boundary to distinguish cars from non-cars? Image feature Slide adapted from K.Grauman & B.Leibe 5

12 Generative vs. Discriminative!)"!)!& Pr( image, car) Pr( image,! car)!! "! #! $! %! &! '! (! image feature Generative: separately model class-conditional and prior densities "!)& Pr( car image) Pr(!car image) x = data Discriminative: directly model posterior!! "! #! $! %! &! '! (! image feature Slide adapted from K.Grauman & B.Leibe 6

13 Generative vs. Discriminative Pr(image, car)=pr(car image)pr(image)!)"!)!& Pr( image, car) Pr( image,! car)!! "! #! $! %! &! '! (! image feature Generative: separately model class-conditional and prior densities "!)& Pr( car image) Pr(!car image) x = data Discriminative: directly model posterior!! "! #! $! %! &! '! (! image feature Slide adapted from K.Grauman & B.Leibe 6

14 Generative vs. Discriminative Generative possibly interpretable can draw samples models variability unimportant to classification often hard to build good model with few parameters Discriminative appealing when infeasible(or undesirable) to model data itself excel in practice often cannot provide uncertainty in predictions often non-interpretable Slide adapted from K.Grauman & B.Leibe 7

15 Discriminative Methods earest neighbor earest neighbor eural networks eural networks 10 6 examples Shakhnarovich et al, 2003 Shakhnarovich, Berg et al, 2005 Viola,... Darrell 2003 Berg, Berg, Malik Support vector machine Boosting Boosting LeCun et al, 1998 Rowley et al, LeCun, Bottou, Bengio, Haffner 199 Rowley, Baluja, Kanade 1998 Conditional random Random field Field Guyon, Vapnik, Heisele, Serre, Poggio, 2001,... Viola, Jones 2001, Viola & Jones 2001, Torralba Torralba et et al al. 2004, 2004, Opelt... et al. 2006, McCallum, Freitag, Lafferty et al 2000, Pereira 2000; Kumar, Kumar & Hebert 2003, Hebert Slide adapted from K.Grauman & B.Leibe 8

16 Discriminative Methods earest neighbor earest neighbor eural networks eural networks 10 6 examples Shakhnarovich et al, 2003 Shakhnarovich, Berg et al, 2005 Viola,... Darrell 2003 Berg, Berg, Malik Support vector machine Boosting Boosting LeCun et al, 1998 Rowley et al, LeCun, Bottou, Bengio, Haffner 199 Rowley, Baluja, Kanade 1998 Conditional random Random field Field Guyon, Vapnik, Heisele, Serre, Poggio, 2001,... Viola, Jones 2001, Viola & Jones 2001, Torralba Torralba et et al al. 2004, 2004, Opelt... et al. 2006, McCallum, Freitag, Lafferty et al 2000, Pereira 2000; Kumar, Kumar & Hebert 2003, Hebert Slide adapted from K.Grauman & B.Leibe 8

17 Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

18 Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

19 Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

20 Linear Classification + - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

21 Linear Classification + Which hyperplane is the best? - x i : i-th data point y i : class label(+1,-1) of x i How would you classify these data by a linear hyperplane? 9

22 Classification and Margin

23 Classification and Margin

24 Classification and Margin

25 Classification and Margin

26 Classification and Margin

27 Classification and Margin

28 Classification and Margin + - W 10

29 Classification and Margin + - W w x + b = 0 10

30 Classification and Margin w x + b =+c + - W w x + b = c w x + b = 0 10

31 Classification and Margin w x + b =+c + - W d- d+ w x + b = c w x + b = 0 10

32 Classification and Margin - w x + b =+c + W max w,b s.t. m = d + d + w x i + b +c, i : y i =+1 w x i + b c, i : y i = 1 d- d+ w x + b = c w x + b = 0 10

33 Classification and Margin - w x + b =+c + W max w,b s.t. m = d + d + w x i + b +c, i : y i =+1 w x i + b c, i : y i = 1 d+ d- y i (w x i + b) c, i w x + b = c w x + b = 0 10

34 Max-Margin Classification The margin is: m = d + d + = 2c w 11

35 Max-Margin Classification The margin is: m = d + d + = 2c w Here is our max-margin classification problem: max w,b s.t. 2c w y i (w x i + b) c, i 11

36 Max-Margin Classification The margin is: m = d + d + = 2c w Here is our max-margin classification problem: max w,b s.t. 2c w y i (w x i + b) c, i c and w can be rescaled without changing the result 11

37 Max-Margin Classification The margin is: m = d + d + = 2c w Here is our max-margin classification problem: max w,b 2c w max w,b 2 w s.t. y i (w x i + b) c, i s.t. y i (w x i + b) 1, i c and w can be rescaled without changing the result 11

38 Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i 12

39 Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i 12

40 Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i Support Vector Machine 12

41 Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i quadratic min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i Support Vector Machine 12

42 Max-Margin Classification max w,b s.t. 2 w y i (w x i + b) 1, i quadratic min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i linear Support Vector Machine 12

43 Duality Primal SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i

44 Duality Primal SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i Dual SVM max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1

45 Duality Primal SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i w = α i y i x i i=1 Dual SVM max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1

46 Support Vectors + - W

47 Support Vectors + - W

48 Support Vectors + Training data with nonzero α i are called support vectors - W

49 Support Vectors + Training data with nonzero α i are called support vectors - W w = α i y i x i i=1

50 Support Vectors + Training data with nonzero α i are called support vectors - W w = α i y i x i i=1 w = i SV α i y i x i

51 15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 Testing w z + b = i SV α i y i (x i z)+b

52 15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 Testing w z + b = i SV α i y i (x i z)+b

53 15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 kernel trick Testing w z + b = i SV α i y i (x i z)+b

54 15 Support Vector Machines Training max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i α i α j y i y j (x i x j ) j=1 kernel trick Testing w z + b = i SV α i y i (x i z)+b

55 15 Support Vector Machines Training Testing max α s.t. i=1α i 1 2 i=1 α i y i = 0, i i=1 α i 0, i no need to form w explicitly w z + b = i SV α i α j y i y j (x i x j ) j=1 kernel trick α i y i (x i z)+b

56 on-linearly Separable Data 16

57 on-linearly Separable Data 16

58 on-linearly Separable Data 16

59 on-linearly Separable Data 16

60 on-linearly Separable Data 16

61 Slack Variables 17

62 Slack Variables 17

63 Slack Variables 17

64 17 Slack Variables ξ j ξ i

65 Slack Variables Recall: hard-margin SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i ξ j ξ i 17

66 Slack Variables Recall: hard-margin SVM min w,b s.t. 1 2 w 2 y i (w x i + b) 1, i ξ i ξ j min w,b,ξ s.t. soft-margin SVM 1 2 w 2 +C ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i 17

67 Dual Form of Soft-Margin SVM

68 Dual Form of Soft-Margin SVM Primal SVM min w,b,ξ s.t. 1 2 w 2 +C ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i

69 Dual Form of Soft-Margin SVM Primal SVM min w,b,ξ s.t. 1 2 w 2 +C ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i Dual SVM max α s.t. i=1α i 1 2 i=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (x i x j ) j=1

70 Dual Form of Soft-Margin SVM Primal SVM w = α i y i x i i=1 Dual SVM min w,b,ξ s.t. max α s.t. 1 2 w 2 +C i=1α i 1 2 i=1 0 α i C, i α i y i = 0 i=1 ξ i i=1 y i (w x i + b) 1 ξ i, i ξ i 0, i α i α j y i y j (x i x j ) j=1

71 on-linear Decision Boundary original data 19

72 on-linear Decision Boundary original data 19

73 on-linear Decision Boundary original data Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ() Φ( ) Φ( ) Φ( ) higher dimensional feature space 19

74 on-linear Decision Boundary original data x =(x 1,x 2 ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ() Φ( ) Φ( ) Φ( ) higher dimensional feature space 19

75 on-linear Decision Boundary original data x =(x 1,x 2 ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ( ) Φ() Φ( ) Φ( ) Φ( ) higher dimensional feature space Φ(x) = (x 2 1, 2x 1 x 2,x 2 2) 19

76 Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (x i x j ) 20

77 Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (x i x j ) 20

78 Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (Φ(x i ) Φ(x j )) 20

79 Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j (Φ(x i ) Φ(x j )) Define: K(x i,x j )=Φ(x i ) Φ(x j ) 20

80 Kernel Trick Recall linear SVM in the dual form: max α s.t. i=1α i 1 2 i=1 j=1 0 α i C, i α i y i = 0 i=1 α i α j y i y j K(x i,x j ) Define: K(x i,x j )=Φ(x i ) Φ(x j ) 20

81 Linear kernel Examples of Kernels K(x i,x j )=x i x j Polynomial kernel K(x i,x j )= (1 + x i x j ) p Radial basis kernel K(x i,x j )=exp ( 12 ) x i x j 2 21

82 Multi-Class Classification So far we only talked about binary classification. What about multi-class? 22

83 Multi-Class Classification So far we only talked about binary classification. What about multi-class? Answer: with classes, learn binary SVM's SVM 1 learns "output=1" vs "output!=1" SVM 2 learns "output=2" vs "output!=2"... SVM learns "output=" vs "output!=" 22

84 Multi-Class SVM Define feature vector Φ(x, y) 23

85 f (x) Multi-Class SVM Define feature vector Φ(x, y) 23

86 f (x) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) Multi-Class SVM Define feature vector Φ(x, y) 23

87 f (x) Multi-Class SVM Define feature vector Φ(x, y) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3)

88 f (x) Multi-Class SVM Define feature vector Φ(x, y) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) w 23

89 f (x) Multi-Class SVM Define feature vector Φ(x, y) Φ(x,y = 1) Φ(x,y = 2) Φ(x,y = 3) w Classification rule y = argmax y w Φ(x,y) 23

90 Multi-Class SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) 1 ξ i ξ i 0, i 24

SVM Software Good news: you do not need to implement SVM by yourself. SVM-Light http://svmlight.joachims.

91 SVM Software Good news: you do not need to implement SVM by yourself. SVM-Light LIBSVM LIBLIEAR 25

92 Classification with Latent Variables [Felzenszwalb et al, CVPR08] 26

93 Classification with Latent Variables f w (x)=w Φ(x) [Felzenszwalb et al, CVPR08] 26

94 Formal Formal model Classification withmodel Latent Variables! fwf(x) =w Φ(x) fww(x) (x) = w Φ(x) = w Φ(x) fw (x) = max w wφ(x, z) fw (x) = max Φ(x, z z Z =Zvector of part offsets = vector of part offsets Φ(x, z)=z)vector of of HOG features (from root filter & & Φ(x, = vector HOG features (from root filter [Felzenszwalb et al, CVPR08] 26

95 Formal Formal model Classification withmodel Latent Variables fw (x) = max w wφ(x, z) fw (x) =z max Φ(x, z z = positions of parts Z = vector of part offsets Z = vector of part Φ(x, z) = vector involvingoffsets image feature Φ(x, z)=z)vector of of HOG features (from root filter & & Φ(x, = vector HOG features (from root filter and part positions! fwf(x) =w Φ(x) fww(x) (x) = w Φ(x) = w Φ(x) [Felzenszwalb et al, CVPR08] 26

96 Formal Formal model Classification withmodel Latent Variables! fwfw(x) = max w Φ(x, z) (x) = max w Φ(x, z) fw (x) =zzmax w Φ(x, z z = positions of parts Z = vector of part offsets Z = vector of part Φ(x, z) = vector involvingoffsets image feature Φ(x, z)=z)vector of of HOG features (from root filter & & Φ(x, = vector HOG features (from root filter and part positions! fwf(x) =w Φ(x) fww(x) (x) = w Φ(x) = w Φ(x) [Felzenszwalb et al, CVPR08] 26

97 Latent SVM min w,ξ 1 2 w 2 +C i ξ i s.t. i, ξ i 0 y i f w (x i ) 1 ξ i 27

98 Latent SVM min w,ξ 1 2 w 2 +C i ξ i s.t. i, ξ i 0 y i f w (x i ) 1 ξ i max z w Φ(x i,z) 27

99 Structured Output 28

100 Structured Output 28

101 Structured Output Y 28

102 Structural SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) ξ i 0, i 1 ξ i 29

103 Structural SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) ξ i 0, i Δ(y,y i ) ξ i 29

104 Structural SVM min w,ξ 1 2 w 2 +C ξ i i s.t. i, y y i, w Φ(x i,y i ) w Φ(x i,y) ξ i 0, i exponential Δ(y,y i ) ξ i 29

105 Thank you!

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems