Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Similar documents
Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Statistical Pattern Recognition

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Introduction to Support Vector Machines

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Introduction to Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Classification 2: Linear discriminant analysis (continued); logistic regression

Support Vector Machine (continued)

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

The Bayes classifier

L5 Support Vector Classification

Introduction to Machine Learning Spring 2018 Note 18

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Introduction to Machine Learning

Introduction to SVM and RVM

CSC 411 Lecture 17: Support Vector Machine

Linear & nonlinear classifiers

Kernel Methods and Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Classifier Complexity and Support Vector Classifiers

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines for Classification and Regression

Support Vector Machines and Bayes Regression

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Machine Learning Lecture 5

Support Vector Machines, Kernel SVM

Announcements - Homework

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning. Support Vector Machines. Manfred Huber

SUPPORT VECTOR MACHINE

Introduction to Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines

Classification 1: Linear regression of indicators, linear discriminant analysis

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Pattern Recognition 2018 Support Vector Machines

Linear classifiers Lecture 3

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Machine Learning (CS 567) Lecture 5

CMSC858P Supervised Learning Methods

Jeff Howbert Introduction to Machine Learning Winter

CS 195-5: Machine Learning Problem Set 1

Machine Learning for NLP

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Lecture Support Vector Machine (SVM) Classifiers

LECTURE NOTE #3 PROF. ALAN YUILLE

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

CS798: Selected topics in Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Max Margin-Classifier

SVM optimization and Kernel methods

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support vector machines Lecture 4

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Statistical Data Mining and Machine Learning Hilary Term 2016

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines for Classification: A Statistical Portrait

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine. Natural Language Processing Lab lizhonghua

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Naïve Bayes classification

Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Machine Learning Lecture 7

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

MSA220 Statistical Learning for Big Data

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 4: Probabilistic Learning

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Gaussian discriminant analysis Naive Bayes

7 Gaussian Discriminant Analysis (including QDA and LDA)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Support Vector Machines

Introduction to Machine Learning

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines

Transcription:

Machine Learning 1 Linear Classifiers Marius Kloft Humboldt University of Berlin Summer Term 2014 Machine Learning 1 Linear Classifiers 1

Recap Past lectures: Machine Learning 1 Linear Classifiers 2

Recap Past lectures: L1 Examples of machine learning applications Machine Learning 1 Linear Classifiers 2

Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) Machine Learning 1 Linear Classifiers 2

Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) L2 Bayesian decision theory Playing god: What were the optimal decision if we knew everything? Machine Learning 1 Linear Classifiers 2

Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) L2 Bayesian decision theory Playing god: What were the optimal decision if we knew everything? Bayes classifier = theoretical optimal classifier Given input x, predict f (x) := arg maxy P(Y = y X = x) Machine Learning 1 Linear Classifiers 2

Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) L2 Bayesian decision theory Playing god: What were the optimal decision if we knew everything? Bayes classifier = theoretical optimal classifier Given input x, predict f (x) := arg maxy P(Y = y X = x) By Bayes rule, is equivalent to predicting f (x) := arg max y P(X=x Y=y)P(Y = y) Machine Learning 1 Linear Classifiers 2

Recap Past lectures: L1 Examples of machine learning applications Formalization (Learning machine/algorithm: Learns from inputs x 1,..., x n and labels y 1,..., y n a function (classifier,predictor) predicting the unknown label y of a new input x) L2 Bayesian decision theory Playing god: What were the optimal decision if we knew everything? Bayes classifier = theoretical optimal classifier Given input x, predict f (x) := arg maxy P(Y = y X = x) By Bayes rule, is equivalent to predicting f (x) := arg max y P(X=x Y=y)P(Y = y) L3 Gaussian Model: data comes from two Gaussians P(X = x Y = +1) = N(µ +, Σ + ) P(X = x Y = 1) = N(µ, Σ ) Machine Learning 1 Linear Classifiers 2

From the theoretical Bayes classifiers to practical classifiers... In practice Replace P(Y = +1) by its estimate n + /n, where n + := {i : y i = +1} Machine Learning 1 Linear Classifiers 3

From the theoretical Bayes classifiers to practical classifiers... In practice Replace P(Y = +1) by its estimate n + /n, where n + := {i : y i = +1} Replace parameters of Gaussian distribution by their estimates: ˆµ +, ˆµ, ˆΣ and ˆΣ ˆµ + := 1 n + i:y i =+1 x i ˆΣ + := 1 n + i:y i =+1 (x i ˆµ + )(x i ˆµ + ) Machine Learning 1 Linear Classifiers 3

From the theoretical Bayes classifiers to practical classifiers... In practice Replace P(Y = +1) by its estimate n + /n, where n + := {i : y i = +1} Replace parameters of Gaussian distribution by their estimates: ˆµ +, ˆµ, ˆΣ and ˆΣ ˆµ + := 1 n + i:y i =+1 x i ˆΣ + := 1 n + i:y i =+1 (x i ˆµ + )(x i ˆµ + ) Classify according to ˆf (x) := arg max y {,+} p µy,ˆσ y (x) ny where p µy,ˆσ y is the multivariate Gaussian pdf, pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ + 1 (x ˆµ+ )) n, Machine Learning 1 Linear Classifiers 3

Yield three different kind of classifiers differing in their assumptions on the covariance! Assumptions Both Gaussians are isotropic (no covariance, same variance) formally: ˆΣ+ = σ 2 +I and ˆΣ = σ 2 I, where I is the d d identity matrix Both Gaussian have equal covariance: ˆΣ+ = ˆΣ Classifier\ Assumptions isotropic equal covariance Nearest centroid classifier (NCC) Linear discriminant analysis (LDA) Quadratic discriminant analysis For simplicity, consider the case where n + = n (general case is a trivial extension). Machine Learning 1 Linear Classifiers 4

NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n = pˆµ,ˆσ n n Machine Learning 1 Linear Classifiers 5

NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n Inserting definition of Gaussian pdf, = pˆµ,ˆσ n n pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ+ 1 (x ˆµ+ )) Machine Learning 1 Linear Classifiers 5

NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n Inserting definition of Gaussian pdf, = pˆµ,ˆσ n n pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ+ 1 (x ˆµ+ )) Simplifies a lot because ˆΣ + = σ 2 I = ˆΣ Machine Learning 1 Linear Classifiers 5

NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n Inserting definition of Gaussian pdf, = pˆµ,ˆσ n n pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ+ 1 (x ˆµ+ )) Simplifies a lot because ˆΣ + = σ 2 I = ˆΣ Easy calculation: because of the assumptions, a lot of terms cancel out, and the decision surface simply boils down to x ˆµ + 2 = x ˆµ 2, or equivalently: (ˆµ + ˆµ ) x + 1 }{{} 2 ( ˆµ 2 ˆµ+ 2 ) = 0 }{{} =:w =:b Machine Learning 1 Linear Classifiers 5

NCC: Nearest centroid classifier (formerly known as simple no-name classifier) Derivation in a nutshell Decision surface given by pˆµ+,ˆσ + n+ n Inserting definition of Gaussian pdf, = pˆµ,ˆσ n n pˆµ+,ˆσ + (x) := 1 (2π) d det(ˆσ +) exp( 1 2 (x ˆµ +) ˆΣ+ 1 (x ˆµ+ )) Simplifies a lot because ˆΣ + = σ 2 I = ˆΣ Easy calculation: because of the assumptions, a lot of terms cancel out, and the decision surface simply boils down to x ˆµ + 2 = x ˆµ 2, or equivalently: (ˆµ + ˆµ ) x + 1 }{{} 2 ( ˆµ 2 ˆµ+ 2 ) = 0 }{{} =:w =:b A classifier of the form w x + b = 0 is called linear classifier. Machine Learning 1 Linear Classifiers 5

NCC: Nearest centroid classifier (continued) Training 1: function TRAINNCC(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ (see Slide 3) 3: [extension: also compute n + and n ] 4: compute w and b (see previous slide) 5: return w, b 6: end function Machine Learning 1 Linear Classifiers 6

NCC: Nearest centroid classifier (continued) Training 1: function TRAINNCC(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ (see Slide 3) 3: [extension: also compute n + and n ] 4: compute w and b (see previous slide) 5: return w, b 6: end function Prediction 1: function PREDICTLINEAR(x, w, b) 2: if w x + b > 0 then return y = +1 3: else return y = 1 4: end if 5: end function Machine Learning 1 Linear Classifiers 6

Linear discriminant analysis (LDA) Linear discriminant analysis Additionally assume equal covariance, ˆΣ + = ˆΣ Derivation similar to NCC Yields w = ˆΣ 1 + (ˆµ + ˆµ ) Machine Learning 1 Linear Classifiers 7

Linear discriminant analysis (LDA) Linear discriminant analysis Additionally assume equal covariance, ˆΣ + = ˆΣ Derivation similar to NCC Yields w = ˆΣ 1 + (ˆµ + ˆµ ) Assumption of equal covariance violated in practice. Machine Learning 1 Linear Classifiers 7

Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Machine Learning 1 Linear Classifiers 8

Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) Machine Learning 1 Linear Classifiers 8

Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) compute b so that n i=1 1 y i =sign(w x i +b) is minimized Machine Learning 1 Linear Classifiers 8

Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) compute b so that n i=1 1 y i =sign(w x i +b) is minimized Training 1: function LINEAR DISCRIMINANT ANALYSIS(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ, as well as ˆΣ = 1 2 (ˆΣ + + ˆΣ ) 3: put w = ˆΣ 1 (ˆµ + ˆµ ) 4: return w, b 5: end function Machine Learning 1 Linear Classifiers 8

Fisher s discriminant analysis (FDA) FDA trick Put ˆΣ := 1 2 (ˆΣ + + ˆΣ ) Predict via w = ˆΣ 1 (ˆµ + ˆµ ) compute b so that n i=1 1 y i =sign(w x i +b) is minimized Training 1: function LINEAR DISCRIMINANT ANALYSIS(x 1,..., x n, y 1,..., y n ) 2: precompute ˆµ + and ˆµ, as well as ˆΣ = 1 2 (ˆΣ + + ˆΣ ) 3: put w = ˆΣ 1 (ˆµ + ˆµ ) 4: return w, b 5: end function Prediction again via function PREDICTLINEAR (see Slide 18). Machine Learning 1 Linear Classifiers 8

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Machine Learning 1 Linear Classifiers 9

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers Machine Learning 1 Linear Classifiers 9

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages Machine Learning 1 Linear Classifiers 9

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret Machine Learning 1 Linear Classifiers 9

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often Machine Learning 1 Linear Classifiers 9

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often + Fast Machine Learning 1 Linear Classifiers 9

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often + Fast Disadvantages Machine Learning 1 Linear Classifiers 9

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often + Fast Disadvantages Suboptimal performance if true decision boundary is non-linear Machine Learning 1 Linear Classifiers 9

Linear Classifiers Generally classifiers of the form f (x) = sign(w x + b) are called linear classifiers. Remark: trick for computation of b (see previous slide) can be used for all linear classifiers What are advantages and disadvantages of linear classifiers? Advantages + Easy to understand and interpret + In practice: work well surprisingly often + Fast Disadvantages Suboptimal performance if true decision boundary is non-linear Occurs for very complex problems such as recognition problems and many others Machine Learning 1 Linear Classifiers 9

Roadmap Will introduce now linear support vector machines (SVM) Machine Learning 1 Linear Classifiers 10

Roadmap Will introduce now linear support vector machines (SVM) Coming lecture: non-linear SVMs Machine Learning 1 Linear Classifiers 10

Roadmap Will introduce now linear support vector machines (SVM) Coming lecture: non-linear SVMs SVM is a very successful state-of-the-art learning algorithm Machine Learning 1 Linear Classifiers 10

Linear Support Vector Machines Core idea: separate the data with large margin How can we formally describe this idea? Machine Learning 1 Linear Classifiers 11

Linear Support Vector Machines Core idea: separate the data with large margin How can we formally describe this idea? (Maximize margin such that all data points lie outside of the margin...) Machine Learning 1 Linear Classifiers 11

Linear Support Vector Machines Core idea: separate the data with large margin How can we formally describe this idea? (Maximize margin such that all data points lie outside of the margin...) Note: from now part of the lecture will take place at the board. Machine Learning 1 Linear Classifiers 11

Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] comp b a := a b b Machine Learning 1 Linear Classifiers 12

Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] Follows from comp b a := a b b Machine Learning 1 Linear Classifiers 12

Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] Follows from Elementary geometry: cos( (a, b)) := comp b a a comp b a := a b b Machine Learning 1 Linear Classifiers 12

Elemental Geometry Recap Recall from linear algebra the definition of the component of a vector a with respect to another vector b: [illustrated by board picture] Follows from Elementary geometry: cos( (a, b)) := comp b a cos( (a, b)) = comp b a := a b b a a b a b [illustrated by board picture] Machine Learning 1 Linear Classifiers 12

Linear SVMs (continued) Formalizing the geometric intuition [see board picture]: Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ [see board picture]: Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: Maximize the margin γ [see board picture]: Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: [see board picture]: Maximize the margin γ such that all positive data points lie on one side, γ comp w x i for all i with y i = +1, Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: [see board picture]: Maximize the margin γ such that all positive data points lie on one side, γ comp w x i for all i with y i = +1, and all negative points on the other, comp w x i γ for all i with y i = 1. Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Formalizing the geometric intuition Denote margin by γ Task: [see board picture]: Maximize the margin γ such that all positive data points lie on one side, γ comp w x i for all i with y i = +1, and all negative points on the other, comp w x i γ for all i with y i = 1. The maximization is over the variables γ R and w R d. Machine Learning 1 Linear Classifiers 13

Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i Machine Learning 1 Linear Classifiers 14

Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i Machine Learning 1 Linear Classifiers 14

Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i So in both cases we have γ y i comp w x i Machine Learning 1 Linear Classifiers 14

Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i So in both cases we have γ y i comp w x i By definition of the component of a vector with regard to another vector: comp w x i = w x i w Machine Learning 1 Linear Classifiers 14

Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i So in both cases we have γ y i comp w x i By definition of the component of a vector with regard to another vector: comp w x i = w x i w Thus, the problem from the previous slide becomes: Linear SVM a first preliminary definition w x i max γ s.t. γ y i γ R,w R d w for all i = 1,... n Machine Learning 1 Linear Classifiers 14

Linear SVMs (continued) Note: If y i = +1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i If y i = 1 then γ comp w x i is (multiplying y i on both sides of the inequality) the same as γ y i comp w x i So in both cases we have γ y i comp w x i By definition of the component of a vector with regard to another vector: comp w x i = w x i w Thus, the problem from the previous slide becomes: Linear SVM a first preliminary definition w x i max γ s.t. γ y i γ R,w R d w for all i = 1,... n Remark: we read s.t. as subject to the constraints. Machine Learning 1 Linear Classifiers 14

Linear SVMs (continued) More generally, we allow for positioning the hyperplane off the origin by introducing a so-called bias b: Hard-Margin SVM with bias ( w ) x i max γ s.t. γ y i γ,b R,w R d w + b for all i = 1,... n Machine Learning 1 Linear Classifiers 15

Linear SVMs (continued) More generally, we allow for positioning the hyperplane off the origin by introducing a so-called bias b: Hard-Margin SVM with bias ( w ) x i max γ s.t. γ y i γ,b R,w R d w + b for all i = 1,... n Problem: sometimes there the above problem is void because no separating hyperplane exists! Machine Learning 1 Linear Classifiers 15

Limitations of Hard-Margin SVMs Any three points in the plane (R 2 ) can be shattered (separated) by a hyperplane (= linear classifier). Machine Learning 1 Linear Classifiers 16

Limitations of Hard-Margin SVMs Any three points in the plane (R 2 ) can be shattered (separated) by a hyperplane (= linear classifier). But there are configurations of four points which no hyperplane can shatter. Machine Learning 1 Linear Classifiers 16

Limitations of Hard-Margin SVMs Any three points in the plane (R 2 ) can be shattered (separated) by a hyperplane (= linear classifier). But there are configurations of four points which no hyperplane can shatter. More generally: Any n + 1 points in R n can be shattered by a hyperplane. But there are configurations of n + 2 points which no hyperplane can shatter. Machine Learning 1 Linear Classifiers 16

Limitations Hard-Margin SVMs (continued) Another Problem is that of outliers potentially corrupting the SVM: Machine Learning 1 Linear Classifiers 17

Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Machine Learning 1 Linear Classifiers 18

Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Linear Hard-Margin SVM (with bias) max γ,b R,w R d,ξ 1,...,ξ n 0 γ C n i=1 ( w ) x i ξ i s.t. i : γ y i w + b +ξ i Machine Learning 1 Linear Classifiers 18

Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Linear Hard-Margin SVM (with bias) max γ,b R,w R d,ξ 1,...,ξ n 0 γ C n i=1 ( w ) x i ξ i s.t. i : γ y i w + b +ξ i where we minimize also n i=1 ξ i to allow only for slight violations of the margin separation Machine Learning 1 Linear Classifiers 18

Remedy: Soft-Margin SVMs Core idea: Introduce for each input x i a slack variable ξ i 0 that allows for some (slight violations of the margin separation): Linear Hard-Margin SVM (with bias) max γ,b R,w R d,ξ 1,...,ξ n 0 γ C n i=1 ( w ) x i ξ i s.t. i : γ y i w + b +ξ i where we minimize also n i=1 ξ i to allow only for slight violations of the margin separation C is a trade-off parameter (to be set in advance): the higher C, the more we penalize violations of the margin separation Machine Learning 1 Linear Classifiers 18

Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i Machine Learning 1 Linear Classifiers 19

Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i ( ) All vectors x i with γ x y i w i w + b are called support vectors. Machine Learning 1 Linear Classifiers 19

Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i ( ) All vectors x i with γ x y i w i w + b are called support vectors. What does this mean geometrically? Machine Learning 1 Linear Classifiers 19

Support Vectors Denote by γ and w the arguments of the linear SVM maximization of the previous slide, that is: n (γ, w ) := arg max γ C ξ i γ,b R,w R d i=1 s.t. γ y i ( w x i w + b ) + ξ i ( ) All vectors x i with γ x y i w i w + b are called support vectors. What does this mean geometrically? Machine Learning 1 Linear Classifiers 19

SVM training How can we train SVMs, that is, how to solve the minimization task? Machine Learning 1 Linear Classifiers 20

Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Machine Learning 1 Linear Classifiers 21

Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Convex optimization problem min v R d f (v) s.t. f i (v) 0, i = 1,..., m h i (v) = 0, i = 1,..., l, where f, f 1,..., m are convex functions (introduced in the next lecture) and h 1,..., h l are linear functions. Machine Learning 1 Linear Classifiers 21

Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Convex optimization problem min v R d f (v) s.t. f i (v) 0, i = 1,..., m h i (v) = 0, i = 1,..., l, where f, f 1,..., m are convex functions (introduced in the next lecture) and h 1,..., h l are linear functions. Can we translate the linear SVM maximization problem into a convex minimization problem? Machine Learning 1 Linear Classifiers 21

Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Convex optimization problem min v R d f (v) s.t. f i (v) 0, i = 1,..., m h i (v) = 0, i = 1,..., l, where f, f 1,..., m are convex functions (introduced in the next lecture) and h 1,..., h l are linear functions. Can we translate the linear SVM maximization problem into a convex minimization problem? How to solve this problem? Machine Learning 1 Linear Classifiers 21

Convex Optimization Problems It is known from decades of research in numerical mathematics that so-called convex optimization problems (to be introduced in detail in the next lecture) can be solved very efficiently. Convex optimization problem min v R d f (v) s.t. f i (v) 0, i = 1,..., m h i (v) = 0, i = 1,..., l, where f, f 1,..., m are convex functions (introduced in the next lecture) and h 1,..., h l are linear functions. Can we translate the linear SVM maximization problem into a convex minimization problem? How to solve this problem? Machine Learning 1 Linear Classifiers 21

Conclusion Linear classifiers Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Maximize the margin between positive and negative inputs Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Maximize the margin between positive and negative inputs Can be described as numerical optimization problem. Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Maximize the margin between positive and negative inputs Can be described as numerical optimization problem. How to optimize?? Machine Learning 1 Linear Classifiers 22

Conclusion Linear classifiers Classifiers motivated by Bayesian decision theory Nearest Centroid Classifier (NCC) Linear Discriminant Analysis / Fisher s Linear Discrimiannt Support Vector Machines Motivated geometrically Maximize the margin between positive and negative inputs Can be described as numerical optimization problem. How to optimize?? Will show can be formulated as a convex optimization problem. Machine Learning 1 Linear Classifiers 22