Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Size: px

Start display at page:

Download "Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1"

Sharon Payne
5 years ago
Views:

1 Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1

2 Recap Past lectures: Machine Learning 1 Linear Classifiers 2

3 Recap Past lectures: L1 Examples of machine learning applications Machine Learning 1 Linear Classifiers 2

4 Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) Machine Learning 1 Linear Classifiers 2

5 Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) L2 Bayesian decision theory Playing god: What were the optimal decision if we knew everything? Machine Learning 1 Linear Classifiers 2

6 Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) L2 Bayesian decision theory Playing god: What were the optimal decision if we knew everything? Bayes classifier = theoretical optimal classifier Given input x, predict f (x) := arg maxy P(Y = y X = x) Machine Learning 1 Linear Classifiers 2

7 Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) L2 Bayesian decision theory Playing god: What were the optimal decision if we knew everything? Bayes classifier = theoretical optimal classifier Given input x, predict f (x) := arg maxy P(Y = y X = x) By Bayes rule, is equivalent to predicting f (x) := arg max y P(X=x Y=y)P(Y = y) Machine Learning 1 Linear Classifiers 2

8 Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) L2 Bayesian decision theory Playing god: What were the optimal decision if we knew everything? Bayes classifier = theoretical optimal classifier Given input x, predict f (x) := arg maxy P(Y = y X = x) By Bayes rule, is equivalent to predicting f (x) := arg max y P(X=x Y=y)P(Y = y) L3 Gaussian Model: data comes from two Gaussians P(X = x Y = +1) = N(µ +, Σ + ) P(X = x Y = 1) = N(µ, Σ ) Machine Learning 1 Linear Classifiers 2

9 From the theoretical Bayes classifiers to practical classifiers... In practice Replace P(Y = +1) by its estimate n + /n, where n + := {i : y i = +1} Machine Learning 1 Linear Classifiers 3

10 From the theoretical Bayes classifiers to practical classifiers... In practice Replace P(Y = +1) by its estimate n + /n, where n + := {i : y i = +1} Replace parameters of Gaussian distribution by their estimates: ˆµ +, ˆµ, ˆΣ and ˆΣ ˆµ + := 1 n + i:y i =+1 x i ˆΣ + := 1 n + i:y i =+1 (x i ˆµ + )(x i ˆµ + ) Machine Learning 1 Linear Classifiers 3

11 From the theoretical Bayes classifiers to practical classifiers... In practice Replace P(Y = +1) by its estimate n + /n, where n + := {i : y i = +1} Replace parameters of Gaussian distribution by their estimates: ˆµ +, ˆµ, ˆΣ and ˆΣ ˆµ + := 1 n + i:y i =+1 x i ˆΣ + := 1 n + i:y i =+1 (x i ˆµ + )(x i ˆµ + ) Classify according to ˆf (x) := arg max y {,+} p µy,ˆσ y (x) ny where p µy,ˆσ y is the multivariate Gaussian pdf, pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ + 1 (x ˆµ+ )) n, Machine Learning 1 Linear Classifiers 3

12 Yield three different kind of classifiers differing in their assumptions on the covariance! Assumptions Both Gaussians are isotropic (no covariance, same variance) formally: ˆΣ+ = σ 2 +I and ˆΣ = σ 2 I, where I is the d d identity matrix Both Gaussian have equal covariance: ˆΣ+ = ˆΣ Classifier\ Assumptions isotropic equal covariance Nearest centroid classifier (NCC) Linear discriminant analysis (LDA) Quadratic discriminant analysis For simplicity, consider the case where n + = n (general case is a trivial extension). Machine Learning 1 Linear Classifiers 4

13 NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n = pˆµ,ˆσ n n Machine Learning 1 Linear Classifiers 5

14 NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n Inserting definition of Gaussian pdf, = pˆµ,ˆσ n n pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ+ 1 (x ˆµ+ )) Machine Learning 1 Linear Classifiers 5

15 NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n Inserting definition of Gaussian pdf, = pˆµ,ˆσ n n pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ+ 1 (x ˆµ+ )) Simplifies a lot because ˆΣ + = σ 2 I = ˆΣ Machine Learning 1 Linear Classifiers 5

16 NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n Inserting definition of Gaussian pdf, = pˆµ,ˆσ n n pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ+ 1 (x ˆµ+ )) Simplifies a lot because ˆΣ + = σ 2 I = ˆΣ Easy calculation: because of the assumptions, a lot of terms cancel out, and the decision surface simply boils down to x ˆµ + 2 = x ˆµ 2, or equivalently: (ˆµ + ˆµ ) x + 1 }{{} 2 ( ˆµ 2 ˆµ+ 2 ) = 0 }{{} =:w =:b Machine Learning 1 Linear Classifiers 5

17 NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n Inserting definition of Gaussian pdf, = pˆµ,ˆσ n n pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ+ 1 (x ˆµ+ )) Simplifies a lot because ˆΣ + = σ 2 I = ˆΣ Easy calculation: because of the assumptions, a lot of terms cancel out, and the decision surface simply boils down to x ˆµ + 2 = x ˆµ 2, or equivalently: (ˆµ + ˆµ ) x + 1 }{{} 2 ( ˆµ 2 ˆµ+ 2 ) = 0 }{{} =:w =:b A classifier of the form w x + b = 0 is called linear classifier. Machine Learning 1 Linear Classifiers 5

18 NCC: Nearest centroid classifier (continued) Training 1: function TRAINNCC(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ (see Slide 3) 3: [extension: also compute n + and n ] 4: compute w and b (see previous slide) 5: return w, b 6: end function Machine Learning 1 Linear Classifiers 6

19 NCC: Nearest centroid classifier (continued) Training 1: function TRAINNCC(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ (see Slide 3) 3: [extension: also compute n + and n ] 4: compute w and b (see previous slide) 5: return w, b 6: end function Prediction 1: function PREDICTLINEAR(x, w, b) 2: if w x + b > 0 then return y = +1 3: else return y = 1 4: end if 5: end function Machine Learning 1 Linear Classifiers 6

20 Linear discriminant analysis (LDA) Linear discriminant analysis Additionally assume equal covariance, ˆΣ + = ˆΣ Derivation similar to NCC Yields w = ˆΣ 1 + (ˆµ + ˆµ ) Machine Learning 1 Linear Classifiers 7

21 Linear discriminant analysis (LDA) Linear discriminant analysis Additionally assume equal covariance, ˆΣ + = ˆΣ Derivation similar to NCC Yields w = ˆΣ 1 + (ˆµ + ˆµ ) Assumption of equal covariance violated in practice. Machine Learning 1 Linear Classifiers 7

22 Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Machine Learning 1 Linear Classifiers 8

23 Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) Machine Learning 1 Linear Classifiers 8

24 Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) compute b so that n i=1 1 y i =sign(w x i +b) is minimized Machine Learning 1 Linear Classifiers 8

25 Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) compute b so that n i=1 1 y i =sign(w x i +b) is minimized Training 1: function LINEAR DISCRIMINANT ANALYSIS(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ, as well as ˆΣ = 1 2 (ˆΣ + + ˆΣ ) 3: put w = ˆΣ 1 (ˆµ + ˆµ ) 4: return w, b 5: end function Machine Learning 1 Linear Classifiers 8

26 Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) compute b so that n i=1 1 y i =sign(w x i +b) is minimized Training 1: function LINEAR DISCRIMINANT ANALYSIS(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ, as well as ˆΣ = 1 2 (ˆΣ + + ˆΣ ) 3: put w = ˆΣ 1 (ˆµ + ˆµ ) 4: return w, b 5: end function Prediction again via function PREDICTLINEAR (see Slide 18). Machine Learning 1 Linear Classifiers 8

27 Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Machine Learning 1 Linear Classifiers 9

28 Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers Machine Learning 1 Linear Classifiers 9

29 Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages Machine Learning 1 Linear Classifiers 9

30 Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret Machine Learning 1 Linear Classifiers 9

31 Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often Machine Learning 1 Linear Classifiers 9

32 Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often + Fast Machine Learning 1 Linear Classifiers 9

33 Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often + Fast Disadvantages Machine Learning 1 Linear Classifiers 9

34 Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often + Fast Disadvantages Suboptimal performance if true decision boundary is non-linear Machine Learning 1 Linear Classifiers 9

35 Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often + Fast Disadvantages Suboptimal performance if true decision boundary is non-linear Occurs for very complex problems such as recognition problems and many others Machine Learning 1 Linear Classifiers 9

36 Roadmap Will introduce now linear support vector machines (SVM) Machine Learning 1 Linear Classifiers 10

37 Roadmap Will introduce now linear support vector machines (SVM) Coming lecture: non-linear SVMs Machine Learning 1 Linear Classifiers 10

38 Roadmap Will introduce now linear support vector machines (SVM) Coming lecture: non-linear SVMs SVM is a very successful state-of-the-art learning algorithm Machine Learning 1 Linear Classifiers 10

39 Linear Support Vector Machines Core idea: separate the data with large margin How can we formally describe this idea? Machine Learning 1 Linear Classifiers 11

40 Linear Support Vector Machines Core idea: separate the data with large margin How can we formally describe this idea? (Maximize margin such that all data points lie outside of the margin...) Machine Learning 1 Linear Classifiers 11

41 Linear Support Vector Machines Core idea: separate the data with large margin How can we formally describe this idea? (Maximize margin such that all data points lie outside of the margin...) Note: from now part of the lecture will take place at the board. Machine Learning 1 Linear Classifiers 11

42 Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] comp b a := a b b Machine Learning 1 Linear Classifiers 12

43 Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] Follows from comp b a := a b b Machine Learning 1 Linear Classifiers 12

44 Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] Follows from Elementary geometry: cos( (a, b)) := comp b a a comp b a := a b b Machine Learning 1 Linear Classifiers 12

45 Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] Follows from Elementary geometry: cos( (a, b)) := comp b a cos( (a, b)) = comp b a := a b b a a b a b [illustrated by board picture] Machine Learning 1 Linear Classifiers 12

46 Linear SVMs (continued) Formalizing the geometric intuition [see board picture]: Machine Learning 1 Linear Classifiers 13

47 Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ [see board picture]: Machine Learning 1 Linear Classifiers 13

48 Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: Maximize the margin γ [see board picture]: Machine Learning 1 Linear Classifiers 13

49 Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: [see board picture]: Maximize the margin γ such that all positive data points lie on one side, γ comp w x i for all i with y i = +1, Machine Learning 1 Linear Classifiers 13

50 Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: [see board picture]: Maximize the margin γ such that all positive data points lie on one side, γ comp w x i for all i with y i = +1, and all negative points on the other, comp w x i γ for all i with y i = 1. Machine Learning 1 Linear Classifiers 13

51 Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: [see board picture]: Maximize the margin γ such that all positive data points lie on one side, γ comp w x i for all i with y i = +1, and all negative points on the other, comp w x i γ for all i with y i = 1. The maximization is over the variables γ R and w R d. Machine Learning 1 Linear Classifiers 13

52 Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i Machine Learning 1 Linear Classifiers 14

53 Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i Machine Learning 1 Linear Classifiers 14

54 Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i So in both cases we have γ y i comp w x i Machine Learning 1 Linear Classifiers 14

55 Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i So in both cases we have γ y i comp w x i By definition of the component of a vector with regard to another vector: comp w x i = w x i w Machine Learning 1 Linear Classifiers 14

56 Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i So in both cases we have γ y i comp w x i By definition of the component of a vector with regard to another vector: comp w x i = w x i w Thus, the problem from the previous slide becomes: Linear SVM a first preliminary definition w x i max γ s.t. γ y i γ R,w R d w for all i = 1,... n Machine Learning 1 Linear Classifiers 14

57 Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i So in both cases we have γ y i comp w x i By definition of the component of a vector with regard to another vector: comp w x i = w x i w Thus, the problem from the previous slide becomes: Linear SVM a first preliminary definition w x i max γ s.t. γ y i γ R,w R d w for all i = 1,... n Remark: we read s.t. as subject to the constraints. Machine Learning 1 Linear Classifiers 14

58 Linear SVMs (continued) More generally, we allow for positioning the hyperplane off the origin by introducing a so-called bias b: Hard-Margin SVM with bias ( w ) x i max γ s.t. γ y i γ,b R,w R d w + b for all i = 1,... n Machine Learning 1 Linear Classifiers 15

59 Linear SVMs (continued) More generally, we allow for positioning the hyperplane off the origin by introducing a so-called bias b: Hard-Margin SVM with bias ( w ) x i max γ s.t. γ y i γ,b R,w R d w + b for all i = 1,... n Problem: sometimes there the above problem is void because no separating hyperplane exists! Machine Learning 1 Linear Classifiers 15

60 Limitations of Hard-Margin SVMs Any three points in the plane (R 2 ) can be shattered (separated) by a hyperplane (= linear classifier). Machine Learning 1 Linear Classifiers 16

61 Limitations of Hard-Margin SVMs Any three points in the plane (R 2 ) can be shattered (separated) by a hyperplane (= linear classifier). But there are configurations of four points which no hyperplane can shatter. Machine Learning 1 Linear Classifiers 16

But there are configurations of four points which no hyperplane can shatter.

62 Limitations of Hard-Margin SVMs Any three points in the plane (R 2 ) can be shattered (separated) by a hyperplane (= linear classifier). But there are configurations of four points which no hyperplane can shatter. More generally: Any n + 1 points in R n can be shattered by a hyperplane. But there are configurations of n + 2 points which no hyperplane can shatter. Machine Learning 1 Linear Classifiers 16

63 Limitations Hard-Margin SVMs (continued) Another Problem is that of outliers potentially corrupting the SVM: Machine Learning 1 Linear Classifiers 17

64 Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Machine Learning 1 Linear Classifiers 18

65 Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Linear Hard-Margin SVM (with bias) max γ,b R,w R d,ξ 1,...,ξ n 0 γ C n i=1 ( w ) x i ξ i s.t. i : γ y i w + b +ξ i Machine Learning 1 Linear Classifiers 18

66 Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Linear Hard-Margin SVM (with bias) max γ,b R,w R d,ξ 1,...,ξ n 0 γ C n i=1 ( w ) x i ξ i s.t. i : γ y i w + b +ξ i where we minimize also n i=1 ξ i to allow only for slight violations of the margin separation Machine Learning 1 Linear Classifiers 18

67 Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Linear Hard-Margin SVM (with bias) max γ,b R,w R d,ξ 1,...,ξ n 0 γ C n i=1 ( w ) x i ξ i s.t. i : γ y i w + b +ξ i where we minimize also n i=1 ξ i to allow only for slight violations of the margin separation C is a trade-off parameter (to be set in advance): the higher C, the more we penalize violations of the margin separation Machine Learning 1 Linear Classifiers 18

68 Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i Machine Learning 1 Linear Classifiers 19

69 Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i ( ) All vectors x i with γ x y i w i w + b are called support vectors. Machine Learning 1 Linear Classifiers 19

70 Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i ( ) All vectors x i with γ x y i w i w + b are called support vectors. What does this mean geometrically? Machine Learning 1 Linear Classifiers 19

71 Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i ( ) All vectors x i with γ x y i w i w + b are called support vectors. What does this mean geometrically? Machine Learning 1 Linear Classifiers 19

72 SVM training How can we train SVMs, that is, how to solve the minimization task? Machine Learning 1 Linear Classifiers 20

73 Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Machine Learning 1 Linear Classifiers 21

74 Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Convex optimization problem min v R d f (v) s.t. f i (v) 0, i = 1,..., m h i (v) = 0, i = 1,..., l, where f, f 1,..., m are convex functions (introduced in the next lecture) and h 1,..., h l are linear functions. Machine Learning 1 Linear Classifiers 21

75 Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Convex optimization problem min v R d f (v) s.t. f i (v) 0, i = 1,..., m h i (v) = 0, i = 1,..., l, where f, f 1,..., m are convex functions (introduced in the next lecture) and h 1,..., h l are linear functions. Can we translate the linear SVM maximization problem into a convex minimization problem? Machine Learning 1 Linear Classifiers 21

76 Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Convex optimization problem min v R d f (v) s.t. f i (v) 0, i = 1,..., m h i (v) = 0, i = 1,..., l, where f, f 1,..., m are convex functions (introduced in the next lecture) and h 1,..., h l are linear functions. Can we translate the linear SVM maximization problem into a convex minimization problem? How to solve this problem? Machine Learning 1 Linear Classifiers 21

77 Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Convex optimization problem min v R d f (v) s.t. f i (v) 0, i = 1,..., m h i (v) = 0, i = 1,..., l, where f, f 1,..., m are convex functions (introduced in the next lecture) and h 1,..., h l are linear functions. Can we translate the linear SVM maximization problem into a convex minimization problem? How to solve this problem? Machine Learning 1 Linear Classifiers 21

78 Conclusion Linear classifiers Machine Learning 1 Linear Classifiers 22

79 Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Machine Learning 1 Linear Classifiers 22

80 Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Machine Learning 1 Linear Classifiers 22

81 Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Machine Learning 1 Linear Classifiers 22

82 Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Machine Learning 1 Linear Classifiers 22

83 Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Machine Learning 1 Linear Classifiers 22

84 Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Maximize the margin between positive and negative inputs Machine Learning 1 Linear Classifiers 22

85 Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Maximize the margin between positive and negative inputs Can be described as numerical optimization problem. Machine Learning 1 Linear Classifiers 22

86 Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Maximize the margin between positive and negative inputs Can be described as numerical optimization problem. How to optimize?? Machine Learning 1 Linear Classifiers 22

87 Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Maximize the margin between positive and negative inputs Can be described as numerical optimization problem. How to optimize?? Will show can be formulated as a convex optimization problem. Machine Learning 1 Linear Classifiers 22

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training