LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Similar documents
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machine (continued)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Kernel Methods and Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear & nonlinear classifiers

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Pattern Recognition 2018 Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) and Kernel Methods

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Support Vector Machines. Manfred Huber

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines and Kernel Methods

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Constrained Optimization and Support Vector Machines

Linear Classification and SVM. Dr. Xin Zhang

Learning with kernels and SVM

The Naïve Bayes Classifier. Machine Learning Fall 2017

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

SUPPORT VECTOR MACHINE

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

CS798: Selected topics in Machine Learning

Announcements - Homework

Introduction to Logistic Regression and Support Vector Machine

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine & Its Applications

CS145: INTRODUCTION TO DATA MINING

Discriminative Models

Machine Learning Practice Page 2 of 2 10/28/13

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Introduction to Support Vector Machines

Statistical Methods for NLP

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Statistical Machine Learning from Data

Mining Classification Knowledge

(Kernels +) Support Vector Machines

Introduction to SVM and RVM

Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

18.9 SUPPORT VECTOR MACHINES

A Tutorial on Support Vector Machine

Support Vector Machine

L5 Support Vector Classification

Support Vector Machines

UVA CS / Introduc8on to Machine Learning and Data Mining

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines Explained

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Discriminative Models

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Max Margin-Classifier

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machines, Kernel SVM

Chapter 6: Classification

Bayesian Classification. Bayesian Classification: Why?

Support Vector Machine. Industrial AI Lab.

Statistical Pattern Recognition

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Non-linear Support Vector Machines

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines.

SVM optimization and Kernel methods

COMP 875 Announcements

Support Vector Machines

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Soft-margin SVM can address linearly separable problems with outliers

Support vector machines

CMU-Q Lecture 24:

Machine Learning for NLP

Neural networks and support vector machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines for Classification and Regression

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Lecture Support Vector Machine (SVM) Classifiers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Transcription:

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning

Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary is not a straight line. Decision boundary: Line that separates the three classes (x, circle, diamond) from each other. *http://nlp.stanford.edu/

Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary is not a straight line. Decision boundary On the contrary, for linear classifiers the decision boundary is a straight line (or a hyperplane). Decision boundaries *http://nlp.stanford.edu/

Classification problem Problem: Based on a labeled set of data points (X,y) learn a function f: X à y in order to predict the label (class) of new unseen data points X. The input feature vectors x may be numeric, but the target variable y is categorical. Binary classification: the target variable y can have only 2 values, representing the possible existing classes. Multiclass (or multivariate) classification: the target variable y can have more than 2 values (but a finite number of them).

Linear classifiers x 1 Unknown record What is the class of the bullet? x 2

Linear classifiers x 1 Unknown record What is the class of the bullet? The line will tell us! x 2

Linear classifiers x 1 Linearly separable sets: there exist a hyperplane (here a line) that correctly classifies all points. x 2

Binary classification Two classes to choose from yes or no, negative or positive. Input: training data: n dimensional vectors x (or points) Process: Find a linear function described by the following equation f(x)=w 1 x 1 + +w n x n + w 0 = w T x The sign (+ or -) of f(x) will show the class of a newly given point. So, the predicted class y of a new point x is: y = 1 if f(x )>0 = 1 if f(x )<0

Binary classification Two classes to choose from yes or no, negative or positive. Input: training data: n dimensional vectors x (or points) Process: Find a linear function described by the following equation f(x)=w 1 x 1 + +w n x n + w 0 = w T x The sign (+ or -) of f(x) will show the class of a newly given point. So, the predicted class y of a new point x is: y = 1 if f(x )>0 = 1 if f(x )<0 For f(x)=0 we get the boundary hyperplane.

Binary classification w T x=0 (decision boundary) x 1 x 2

Binary classification w T x=0 (decision boundary) x 1 w T x<0 (class -1) x 2

Binary classification x 1 w T x>0 (class 1) w T x=0 (decision boundary) w T x<0 (class -1) x 2

Binary classification x 1 w T x>0 (class 1) w T x=0 (decision boundary) w T x<0 (class -1) x 2

Binary classification x 1 w T x>0 (class 1) w T x=0 (decision boundary) class 1 w T x<0 (class -1) x 2

Binary classification x 1 w T x>0 (class 1) w T x=0 (decision boundary) w T x<0 (class -1) x 2

Learning the classification boundary There are many algorithms that differ on: How they measure if the model fits well the training data (typically called the loss function). If they use regularization or not. We are going to see the following: Simple Perceptron Logistic Regression Support Vector Classifier (SVC) (not only the linear case) Naive Bayes

The simple Perceptron algorithm Computes vector w to linearly separate the input points (with no guarantees that the optimal solution will be found). Always converges if points are linearly separable. Many possible lines depending on the starting point. A small learning rate lowers the probability of misclassifying correct points. Rosenblatt Algorithm: 1. Randomly initialize w. 2. Set the learning rateηto a value between 0 and 1. 3. Repeat until all points correctly classified: 4. For points m in the set of misclassified points M: 5. w w + ηy m x m

Perceptron example

Perceptron example

Perceptron example

Perceptron example

Perceptron example

Perceptron example

Perceptron example

Perceptron example

Support Vector Machine Class 1 Class -1 x 1 Many possible lines.. Which one is the best? x 2 Slides inspired from Jinwei Gu, An Introduction of Support Vector Machine

Support Vector Machine x 1 Many possible lines.. Which one is the best? Intuitively the one far away from the data points x 2

Support Vector Machine x 1 Many possible lines.. Which one is the best? Margin: defined by the distance of the closest data point from the decision boundary. x 2 Intuitively the one far away from the data points for which the margin is maximized.

Support Vector Machine x 1 support vectors x 2 The data points (vectors) that define the margin are called support vectors. They form a small subset of the dataset used to define the decision boundary. *The rest of the data points do not participate in the decision.

Support Vector Machine x 1 The furthest the point is from the decision boundary the more confident we are about the class assignment. x 2

Support Vector Machine x 1 The furthest the point is from the decision boundary the more confident we are about the class assignment. x 2 We are more confident for the class assignment for than for.

Support Vector Machine x 1 So, picking the green line (by the SVM classifier) gives a classification safety margin: a slight error in the measurement will not affect the result. x 2 Larger margin classifier à Better generalization ability and noise-tolerant.

Support Vector Machine x 1 For example, if we had chosen the blue line, would have been assigned to class 1( ) x 2

Support Vector Machine x 1 x 2 For example, if we had chosen the blue line, would have been assigned to class 1( ) whereas with the green line it is correctly classified in class -1 ( )

Support Vector Machine Learning phase x 1 n = w w x + n x + x - x - x 2 Unit-length normal vector of the hyperplane We search for w (weight vector) and b (intercept) to define the decision boundary. For the decision hyperplane it holds that: w T x + b = 0

Support Vector Machine Learning phase x 1 x + d n x + x - x - x 2 For the support vectors x + and x - it holds that T + wx T wx + b = 1 + b = 1 For the margin width it holds that n = w w Unit-length normal vector of the hyperplane M = 2d = 2 (wt x + + b) w (d is the distance of a support vector from the decision boundary) = 2 w

Support Vector Machine Learning phase x 1 x + n x + x - x - x 2 Optimization Problem: Maximize margin width 2 w such that For y i = +1, For y i = -1, w T x i + b 1 w T x i + b 1

Support Vector Machine Learning phase x 1 x + n x + x - x - x 2 Optimization Problem (alternatively): Minimize 1 2 w 2 such that y i ( w T x i + b) 1

Support Vector Machine Learning phase Quadratic programming with linear constraints Lagrangian Function Minimize 1 2 w 2 such that y i ( w T x i + b) 1 n 1 2 minimize L ( w, b, α ) = w α y ( w x + b) 1 ( T ) p i i i i 2 i= 1 s.t. αi 0 Lagrangian multipliers

Support Vector Machine Learning phase n 1 2 minimize L ( w, b, α ) = w α y ( w x + b) 1 ( T ) p i i i i 2 i= 1 s.t. αi 0 L p = 0 w = α y x w i= 1 L p b = 0 n i= 1 n α y i i i i i = 0

Support Vector Machine Learning phase n 1 2 minimize L ( w, b, α ) = w α y ( w x + b) 1 ( T ) p i i i i 2 i= 1 s.t. αi 0 Lagrangian Dual Problem maximize s.t. αi 0 Property of α i when we introduce the lagrange multipliers. 1 n n n T αi αα i jyy i j i j i= 1 2 i= 1 j= 1 n xx, and i= 1 α y i i = 0 The result when we differentiate the original Lagrangian w.r.t. b.

Support Vector Machine n Learning phase From KKT* conditions, we know: ( T y wx b ) α ( + ) 1 = 0 i i i n Thus, only support vectors have αi 0 n The solution has the form: n w = α yx = α yx i i i i i i i= 1 i SV b = y i - w T x i where x i is a support vector *Karush-Kuhn-Tucker

Support Vector Machine Learning phase n The solution has the form: n w = α yx = α yx i i i i i i i= 1 i SV b = y i - w T x i where x i is a support vector

Support Vector Machine Testing phase n The linear discriminant function is: g(x) = w T x + b = i SV α i y i x i T x + b n If g(x) 0 then x is class 1, otherwise it belongs to class -1. n Note: it relies on a dot product between the test point x and the support vectors x i n Also keep in mind that solving the optimization problem involved computing the dot products x it x j between all pairs of training points

Soft margin classification x 1 What if a circle point exists among the dots? Then the data are not linearly separable! x 2

Soft margin classification x 1 ξ i Allow for small classification errors by introducing slack variables ξ i. x i x 2 A non-zero value for ξ i allows x i to not meet the margin requirement at a cost proportional to the value of ξ i.

Soft margin classification x 1 Optimization problem ξ i x i x 2 min w 2 +C ξ i w,ξ i subject to N i=1 y i (w T x i + b) 1 ξ i for i =1...N C is the penalty that we pay for each error.

What about non-linearly separable data? Here, a straight line cannot separate well the data! x 1 x 2

What about non-linearly separable data? The circle describes the separation better.. x 1 0 0 x 2

What about non-linearly separable data? What if we introduce a third dimension, representing a circle? x 1 0 0 x 2

What about non-linearly separable data? x 3 = x 1 2 + x 2 2 x 1 0 x 3 Φ(x) (non-linear transformation) 0 x 1 0 x 2 0 x 2 Now, the data in the transformed space defined by (x 1, x 2, x 3 ) are linearly separable! So, let s useφ(x) instead of x!

Support Vector Classifier Learning phase minimize L (, b, 1 ) = y ( + b) 1 n 2 For the p w original αi w space αi i w xi 2 i= 1 s.t. αi 0 ( T ) Lagrangian Dual Problem maximize s.t. αi 0 Property of α i when we introduce the lagrange multipliers. Data points participated only as dot product.. 1 n n n T αi αα i jyy i j i j i= 1 2 i= 1 j= 1 n xx, and i= 1 α y i i = 0 The result when we differentiate the original Lagrangian w.r.t. b.

Support Vector Classifier Learning phase minimize L (, b, 1 ) = y ( + b) 1 n 2 For the p w original αi w space αi i w xi 2 i= 1 s.t. αi 0 ( T ) Lagrangian Dual Problem maximize s.t. αi 0 Property of α i when we introduce the lagrange multipliers. Let s replace them with the transformed data points! 1 n n n T αi αα i jyy i j i j i= 1 2 i= 1 j= 1 n xx, and i= 1 α y i i = 0 The result when we differentiate the original Lagrangian w.r.t. b.

Support Vector Machine Kernels Learning phase n i=1 maximize α i 1 2 s.t. αi 0 n n i=1 j=1, and α i α j y i y j Φ(x i ) T Φ(x j ) n i= 1 α y i i = 0 K(x i,x j )=Φ(x i ) T Φ(x j ) is called the kernel function Kernel trick: Instead of having to transform each data point to the new space, we can directly replace with the dot product.

Support Vector Machine Kernels Testing phase g(x) = i SV α i K(x i,x) + b n If g(x) 0 then x is class 1, otherwise it belongs to class -1.

Popular Kernels Polynomial kernel K(x,y) = (x T y +1) d d=1: linear kernel d=2: quadratic kernel (Gaussian) Radial basis function (RBF) K(x,y) = exp( γ x y 2 )

Logistic Regression Despite its name, it is a classification method. 1 It applies a sigmoid function σ ( f (x)) = over a linear classifier f(x)=w T 1+ e f (x) x. So, for a point x we have the two cases: { 0.5 σ ( f (x)) <0.5 then then y =1 y = 1 It can also be interpreted as the confidence (aka probability) with which an element is in a binary (1,-1) class y. p(y x) = 1 1+ e wt x A sigmoid (sigma-like) function fit to the data.

Logistic Regression Computes the confidence (aka probability) with which an element is in a binary (1,-1) class c. 1 p(y x) = 1+ e wt x Predict y=1 Predict y=-1 Obviously, p(y =1 x) =1 p(y = 1 x)

Logistic Regression Logistic regression is linear* because the decision boundary it produces is linear in function of x. To see this, consider the probabilities with which a certain point x belongs to the class y=1 and to the class y= 1 1 p(y =1 x) = 1+ e wt x =... = e wtx log p(y =1 x) log = w T x 1 p(y = 1 x) 1 p(y = 1 x) 1+ e wt x For the decision boundary the probabilities are equal to 0.5, so log 0.5 0.5 = 0 = wt x Decision boundary hyperplane *More concretely, it is a generalized linear model.

Logistic Regression Why do we choose the sigmoid function to fit the data? Least squares fit σ(w 1 x +w 0 ) fit to y w 1 x +w 0 fit to y Fit of w 1 x +w 0 is dominated by the more distant points and.. causes misclassification. Instead LR regresses the sigmoid to the class data.

Logistic Regression Similarly in the 2D space LR Linear LR Linear

Logistic Regression To compute the parameters w we have to solve an optimization problem. For this we are going to use the data likelihood function: N 1 l(w) = 1+ e y iw T x i i=1 Thus, we search for w that maximizes the likelihood of the data: N N N 1 # 1 & log = log 1+ e y iw T x i % 1+ e y iw T x $ i ( = log 1+ e y iw T x i ' i=1 i=1 ( ) For convenience, the loss function is the negative logarithm of the data likelihood: N L(w) = i=1 i=1 log( 1+ e y iw T x i )

Logistic Regression Finally, w is the argument that minimizes L(w): w = argmin L(w) = argmin As in linear regression, in logistic regression we can add a regularization term r(w) (L 1 or L 2 ): w = argmin( N i 1 log( 1+ e y iw T x i ) where C is the tuning parameter and r(w) is w 2 2 or w 1. Note that via regularization, we also allow for small misclassifications (to be correlated to soft margin Support Vector Classifier). N i 1 log( 1+ e y iw T x i ) + Cr(w))

Logistic Regression w = argmin( N i=1 log( 1+ e y iw T x i ) + Cr(w)) For correctly classified points y i w T (x i ) is negative, and thus log(1+ e y iwt(x i) ) is near zero. For incorrectly classified points y i w T (x i ) is positive, and thus log(1+ e y iwt(x i) ) can be large. Hence the optimization penalizes parameters which lead to such misclassifications.

Naive Bayes Classifier Assumes that the features x are conditionally independent, given the output y. Here we will see the method for discrete values. Uses the Bayesian rule

Probability Basics Prior, conditional and joint probability for random variables Prior probability: P(x) Conditional probability: P 2 ( x1 x2), P(x x1) Joint probability: x = ( x1, x2 ), P( x) = P(x1,x2) Relationship: P(x 1,x2) = P( x2 x1) P( x1) = P( x1 x2) P( x2) Independence: P( x2 x1) = P( x2), P( x1 x2) = P( x1), P(x1,x2) = P( x1) P( x2)

Naive Bayes Classifier Assumes that the features x are conditionally independent, given the output y. Here we will see the method for discrete values. Uses the Bayesian rule. P(y x) = P(x y)p(y) P(x) Posterior = Likelihood Prior Evidence

Naive Bayes Classifier Assumes that the features x are conditionally independent, given the output y. Here we will see the method for discrete values. Uses the Bayesian rule. P(y x) = P(x y)p(y) P(x) Posterior = Likelihood Prior Evidence P(y x): the posterior probability of class (target), given an attribute vector (predictor). P(y): the prior class probability P(x y): the likelihood, i.e., the probability of generating the predictor x given target y P(x): the evidence, i.e., the prior probability of the predictor x

Naive Bayes Classifier Maximum A Posterior (MAP) Classification Rule To classify an input x find the posterior probabilities P(y i x) for each output class y i. Assign the label y to x if P(y x) is the highest among all P(y i x)s. Note that since the factor P(x) in the Bayesian rule is common for all P(y i x), we can omit it when applying MAP. P(y i x) = P(x y i )P(y i ) P(x) So, the classification rule can be written as: y = argmax y P(y)Π i=1 P(x y i )P(y i ) Naives Bayes classifier has a fast training phase, even with many features because it looks and collects statistics from each feature individually. n P(x i y)

Naive Bayes Classifier as a linear classifier Naive Bayes Classifier can be seen as a linear classifier if we consider the log version of the classification rule f (x) = log P(y =1 x) P(y = 0 x) = log P(y =1 x) log P(y = 0 x) = (logθ 1 logθ 0 ) T x + (log P(y =1) log P(y = 0)) f(x) is a linear function in x. f(x)=0 is the decision boundary.

Naive Bayes Classifier An example: Play Tennis depending on weather conditions Is it probable that we will play tennis on a (Sunny, Hot, Normal, Strong) day?

Naive Bayes Classifier Step 1: Build frequency table for each attribute. Outlook Play=Yes Play=No Sunny 2 3 Overcast 4 0 Rain 3 2 Temperature Play=Yes Play=No Hot 2 2 Mild 4 2 Cool 3 1 Humidity Play=Yes Play=No Wind Play=Yes Play=No High 3 4 Normal 6 1 Strong 3 3 Weak 6 2

Naive Bayes Classifier Step 2: Build likelihood table for each attribute. Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Humidity Play=Yes Play=No Wind Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Strong 3/9 3/5 Weak 6/9 2/5 9/14 5/14

Naive Bayes Classifier Step 3: Compute the probabilities. Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Humidity Play=Yes Play=No Wind Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Strong 3/9 3/5 Weak 6/9 2/5 9/14 5/14 P(y x) = P(yes Strong) =?

Naive Bayes Classifier Step 3: Compute the probabilities. Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 P(x y) = P(Strong yes) = 3 / 9 = 0.33 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 P(y) = P(yes) = 9 /14 = 0.64 P(x) = P(Strong) = 6 /14 = 0.43 P(y x) = P(yes Strong) =? 9/14 5/14

Naive Bayes Classifier Step 3: Compute the probabilities. Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 P(x y) = P(Strong yes) = 3 / 9 = 0.33 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 9/14 5/14 P(y) = P(yes) = 9 /14 = 0.64 P(x) = P(Strong) = 6 /14 = 0.43 P(y x) = P(yes Strong) = 0.33* 0.64 / 0.43 = 0.49

Naive Bayes Classifier Q: Is it probable that we will play tennis on a x=(outlook:sunny, Temp:Hot, Humidity:Normal, Wind:Strong) day? A: Let s apply MAP for P(yes, x) and P(no, x). Likelihood yes =P(x yes)=p(outlook=sunny yes)*p(temp=hot yes)*p(humidity=normal yes)*p(wind=strong yes)=2/9*2/9*6/9*3/9=0.043 Likelihood no =P(x no)=p(outlook=sunny no)*p(temp=hot no)*p(humidity=normal no)*p(wind=strong no)=3/5*2/5*1/5*3/5=0.029 P(yes)=9/14=0.64 P(no)=5/14=0.36 P(yes x)=p(x yes)*p(yes)=0.043*0.64=0.03 P(no x)=p(x no)*p(no)=0.029*0.36=0.01 So, since P(yes x)>p(no x) then x is classified as a yes.

Naive Bayes Classifier The zero-frequency problem: When an attribute has zero frequency for a y i, then the likelihood P(x y i ) will be equal to zero! To avoid this, we can simply augment all the counts by one. Eg: Outlook Play=Yes Play=No Sunny 2 3 Overcast 4 0 Rain 3 2 Outlook Play=Yes Play=No Sunny 3 4 Overcast 5 1 Rain 4 3

Binary vs multiclass classifiers x 1 x 2 x 1 x 2 Binary Multiclass

Multiclass classification (linear) Multiclass classification: each sample can have one class out of many possibilities. Linear algorithms (LR, SVC, Naïve Bayes) that we saw can be used as per se, also for multiclass problems. How? One-Vs-All. One Vs All: Binary problem with class 0 being some class and class 1 being the rest of the classes. If the set of possible classes is Y then For each y i Y Learn a line separating class y i from the rest of the classes using a binary classifier. Use all the learned lines to separate the classes.

One vs All classification x 1 x 1 x 1 x 2 x 2 x 2 x 1 x 2

One vs All classification x 1 x 1 x 1 x 2 x 2 x 2 Eg for LR we compute: p(y i x) for y i in {,, } The class with the highest probability for a new point x is its predicted class. x 1 x 2

One vs All classification: x 1 x 2 Do you see any problem here?

One vs All classification x 1 x x 2 What about the points in the triangle?

One vs All classification x 1 What about the points in the triangle? x x 2 Choose the line closest to the point! à The class with the highest probability.

Sources Cambridge UP: Support vector machines and machine learning on documents Xiaojin Zhu: Naive Bayes Classifier Ke Chen: Naive Bayes Classifier Ng: Multiclass classification Gu: An Introduction of Support Vector Machine