Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1

Recap Past lectures: Machine Learning 1 Linear Classifiers 2

Recap Past lectures: L1 Examples of machine learning applications Machine Learning 1 Linear Classifiers 2

Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) L2 Bayesian decision theory Playing god: What were the optimal decision if we knew everything? Bayes classifier = theoretical optimal classifier Given input x, predict f (x) := arg maxy P(Y = y X = x) Machine Learning 1 Linear Classifiers 2

From the theoretical Bayes classifiers to practical classifiers... In practice Replace P(Y = +1) by its estimate n + /n, where n + := {i : y i = +1} Machine Learning 1 Linear Classifiers 3

From the theoretical Bayes classifiers to practical classifiers... In practice Replace P(Y = +1) by its estimate n + /n, where n + := {i : y i = +1} Replace parameters of Gaussian distribution by their estimates: ˆµ +, ˆµ, ˆΣ and ˆΣ ˆµ + := 1 n + i:y i =+1 x i ˆΣ + := 1 n + i:y i =+1 (x i ˆµ + )(x i ˆµ + ) Classify according to ˆf (x) := arg max y {,+} p µy,ˆσ y (x) ny where p µy,ˆσ y is the multivariate Gaussian pdf, pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ + 1 (x ˆµ+ )) n, Machine Learning 1 Linear Classifiers 3

Yield three different kind of classifiers differing in their assumptions on the covariance! Assumptions Both Gaussians are isotropic (no covariance, same variance) formally: ˆΣ+ = σ 2 +I and ˆΣ = σ 2 I, where I is the d d identity matrix Both Gaussian have equal covariance: ˆΣ+ = ˆΣ Classifier\ Assumptions isotropic equal covariance Nearest centroid classifier (NCC) Linear discriminant analysis (LDA) Quadratic discriminant analysis For simplicity, consider the case where n + = n (general case is a trivial extension). Machine Learning 1 Linear Classifiers 4

NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n = pˆµ,ˆσ n n Machine Learning 1 Linear Classifiers 5

NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n Inserting definition of Gaussian pdf, = pˆµ,ˆσ n n pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ+ 1 (x ˆµ+ )) Simplifies a lot because ˆΣ + = σ 2 I = ˆΣ Easy calculation: because of the assumptions, a lot of terms cancel out, and the decision surface simply boils down to x ˆµ + 2 = x ˆµ 2, or equivalently: (ˆµ + ˆµ ) x + 1 }{{} 2 ( ˆµ 2 ˆµ+ 2 ) = 0 }{{} =:w =:b Machine Learning 1 Linear Classifiers 5

NCC: Nearest centroid classifier (continued) Training 1: function TRAINNCC(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ (see Slide 3) 3: [extension: also compute n + and n ] 4: compute w and b (see previous slide) 5: return w, b 6: end function Prediction 1: function PREDICTLINEAR(x, w, b) 2: if w x + b > 0 then return y = +1 3: else return y = 1 4: end if 5: end function Machine Learning 1 Linear Classifiers 6

Linear discriminant analysis (LDA) Linear discriminant analysis Additionally assume equal covariance, ˆΣ + = ˆΣ Derivation similar to NCC Yields w = ˆΣ 1 + (ˆµ + ˆµ ) Machine Learning 1 Linear Classifiers 7

Linear discriminant analysis (LDA) Linear discriminant analysis Additionally assume equal covariance, ˆΣ + = ˆΣ Derivation similar to NCC Yields w = ˆΣ 1 + (ˆµ + ˆµ ) Assumption of equal covariance violated in practice. Machine Learning 1 Linear Classifiers 7

Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Machine Learning 1 Linear Classifiers 8

Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) Machine Learning 1 Linear Classifiers 8

Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) compute b so that n i=1 1 y i =sign(w x i +b) is minimized Machine Learning 1 Linear Classifiers 8

Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) compute b so that n i=1 1 y i =sign(w x i +b) is minimized Training 1: function LINEAR DISCRIMINANT ANALYSIS(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ, as well as ˆΣ = 1 2 (ˆΣ + + ˆΣ ) 3: put w = ˆΣ 1 (ˆµ + ˆµ ) 4: return w, b 5: end function Machine Learning 1 Linear Classifiers 8

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Machine Learning 1 Linear Classifiers 9

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often Machine Learning 1 Linear Classifiers 9

Roadmap Will introduce now linear support vector machines (SVM) Machine Learning 1 Linear Classifiers 10

Roadmap Will introduce now linear support vector machines (SVM) Coming lecture: non-linear SVMs Machine Learning 1 Linear Classifiers 10

Roadmap Will introduce now linear support vector machines (SVM) Coming lecture: non-linear SVMs SVM is a very successful state-of-the-art learning algorithm Machine Learning 1 Linear Classifiers 10

Linear Support Vector Machines Core idea: separate the data with large margin How can we formally describe this idea? Machine Learning 1 Linear Classifiers 11

Linear Support Vector Machines Core idea: separate the data with large margin How can we formally describe this idea? (Maximize margin such that all data points lie outside of the margin...) Machine Learning 1 Linear Classifiers 11

Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] comp b a := a b b Machine Learning 1 Linear Classifiers 12

Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] Follows from comp b a := a b b Machine Learning 1 Linear Classifiers 12

Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] Follows from Elementary geometry: cos( (a, b)) := comp b a a comp b a := a b b Machine Learning 1 Linear Classifiers 12

Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] Follows from Elementary geometry: cos( (a, b)) := comp b a cos( (a, b)) = comp b a := a b b a a b a b [illustrated by board picture] Machine Learning 1 Linear Classifiers 12

Linear SVMs (continued) Formalizing the geometric intuition [see board picture]: Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ [see board picture]: Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: Maximize the margin γ [see board picture]: Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: [see board picture]: Maximize the margin γ such that all positive data points lie on one side, γ comp w x i for all i with y i = +1, and all negative points on the other, comp w x i γ for all i with y i = 1. Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i Machine Learning 1 Linear Classifiers 14

Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i So in both cases we have γ y i comp w x i By definition of the component of a vector with regard to another vector: comp w x i = w x i w Thus, the problem from the previous slide becomes: Linear SVM a first preliminary definition w x i max γ s.t. γ y i γ R,w R d w for all i = 1,... n Machine Learning 1 Linear Classifiers 14

Linear SVMs (continued) More generally, we allow for positioning the hyperplane off the origin by introducing a so-called bias b: Hard-Margin SVM with bias ( w ) x i max γ s.t. γ y i γ,b R,w R d w + b for all i = 1,... n Machine Learning 1 Linear Classifiers 15

Limitations of Hard-Margin SVMs Any three points in the plane (R 2 ) can be shattered (separated) by a hyperplane (= linear classifier). Machine Learning 1 Linear Classifiers 16

Limitations of Hard-Margin SVMs Any three points in the plane (R 2 ) can be shattered (separated) by a hyperplane (= linear classifier). But there are configurations of four points which no hyperplane can shatter. More generally: Any n + 1 points in R n can be shattered by a hyperplane. But there are configurations of n + 2 points which no hyperplane can shatter. Machine Learning 1 Linear Classifiers 16

Limitations Hard-Margin SVMs (continued) Another Problem is that of outliers potentially corrupting the SVM: Machine Learning 1 Linear Classifiers 17

Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Machine Learning 1 Linear Classifiers 18

Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Linear Hard-Margin SVM (with bias) max γ,b R,w R d,ξ 1,...,ξ n 0 γ C n i=1 ( w ) x i ξ i s.t. i : γ y i w + b +ξ i where we minimize also n i=1 ξ i to allow only for slight violations of the margin separation Machine Learning 1 Linear Classifiers 18

Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i Machine Learning 1 Linear Classifiers 19

Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i ( ) All vectors x i with γ x y i w i w + b are called support vectors. Machine Learning 1 Linear Classifiers 19

SVM training How can we train SVMs, that is, how to solve the minimization task? Machine Learning 1 Linear Classifiers 20

Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Convex optimization problem min v R d f (v) s.t. f i (v) 0, i = 1,..., m h i (v) = 0, i = 1,..., l, where f, f 1,..., m are convex functions (introduced in the next lecture) and h 1,..., h l are linear functions. Machine Learning 1 Linear Classifiers 21

Conclusion Linear classifiers Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Maximize the margin between positive and negative inputs Machine Learning 1 Linear Classifiers 22