Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Similar documents
Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSC 411: Lecture 09: Naive Bayes

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear Models for Classification

Machine Learning Lecture 5

Linear Classification: Probabilistic Generative Models

Logistic Regression. Machine Learning Fall 2018

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Classification Based on Probability

Logistic Regression. Seungjin Choi

Machine Learning 2017

Naïve Bayes classification

Ch 4. Linear Models for Classification

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Introduction to Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Support Vector Machines

Introduction to Machine Learning

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

ECE521 Lecture7. Logistic Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Machine Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

CSC 411: Lecture 04: Logistic Regression

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian Learning (II)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Density Estimation: ML, MAP, Bayesian estimation

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Generative v. Discriminative classifiers Intuition

Bayesian Decision Theory

Statistical Data Mining and Machine Learning Hilary Term 2016

Logistic Regression. William Cohen

Machine Learning Lecture 7

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning for Signal Processing Bayes Classification and Regression

Generative v. Discriminative classifiers Intuition

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression. COMP 527 Danushka Bollegala

ECE521 week 3: 23/26 January 2017

Multiclass Logistic Regression

STA 4273H: Statistical Machine Learning

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Stochastic Gradient Descent

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning, Fall 2012 Homework 2

Inf2b Learning and Data

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

An Introduction to Statistical and Probabilistic Linear Models

Lecture 3: Pattern Classification

Is the test error unbiased for these programs?

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Logistic Regression and Support Vector Machine

Linear & nonlinear classifiers

Machine Learning

Linear Classification

The Bayes classifier

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Discriminative Models

CPSC 340: Machine Learning and Data Mining

Machine Learning Gaussian Naïve Bayes Big Picture

Probabilistic generative models

Introduction to Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Discriminative Models

Classification Logistic Regression

PATTERN RECOGNITION AND MACHINE LEARNING

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Lecture 2: Simple Classifiers

Machine Learning

Introduction to Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Part 4: Conditional Random Fields

Machine Learning for Signal Processing Bayes Classification

Chris Bishop s PRML Ch. 8: Graphical Models

Machine Learning. 7. Logistic and Linear Regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bias-Variance Tradeoff

Machine Learning

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Bayesian Decision and Bayesian Learning

Learning Bayesian network : Given structure and completely observed data

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CS 195-5: Machine Learning Problem Set 1

Transcription:

Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier Naïve Bayes Discriminative models Logistic regression 2

Classification problem: probabilistic view Each feature as a random variable Class label also as a random variable We observe the feature values for a random sample and we intend to find its class label Evidence: feature vector x Query: class label 3

Definitions Posterior probability: p C k x Likelihood or class conditional probability: p x C k Prior probability: p(c k ) K p(x): pdf of feature vector x (p x = k=1 p x C k p(c k )) p(x C k ): pdf of feature vector x for samples of class C k p(c k ): probability of the label be C k 4

Bayes decision rule If P C 1 x > P(C 2 x) decide C 1 otherwise decide C 2 K = 2 p error x = p(c 2 x) if we decide C 1 P(C 1 x) if we decide C 2 If we use Bayes decision rule: P error x = min{p C 1 x, P(C 2 x)} Using Bayes rule, for each x, P error x is as small as possible and thus this rule minimizes the probability of error 5

Optimal classifier The optimal decision is the one that minimizes the expected number of mistakes We show that Bayes classifier is an optimal classifier 6

Bayes decision rule Minimizing misclassification rate Decision regions: R k = {x α x = k} K = 2 All points in R k are assigned to class C k p error = E x,y I(α(x) y) = p x R 1, C 2 + p x R 2, C 1 = R 1 p x, C 2 dx + R 2 p x, C 1 dx = R 1 p C 2 x p x dx + R 2 p C 1 x p x dx Choose class with highest p C k x as α x 7

Bayes minimum error Bayes minimum error classifier: min α(.) E x,y I(α(x) y) Zero-one loss If we know the probabilities in advance then the above optimization problem will be solved easily. α x = argmax y p(y x) In practice, we can estimate p(y x) based on a set of training samples D 8

Bayes theorem Bayes theorem Posterior Likelihood p C k x = p x C k p(c k ) p(x) Prior Posterior probability: p C k x Likelihood or class conditional probability: p x C k Prior probability: p(c k ) p(x): pdf of feature vector x (p x = k=1 K p x C k p(c k )) p(x C k ): pdf of feature vector x for samples of class C k p(c k ): probability of the label be C k 9

Bayes decision rule: example Bayes decision: Choose the class with highest p C k x p(c 1 x) p(x C 2 ) p(x C 1 ) p(c 2 x) R 2 R 2 p C 1 = 2 3 p C 2 = 1 3 p C k x = p x C k p(c k ) p(x) p x = p C 1 p x C 1 + p C 2 p x C 2 10

Bayesian decision rule If P C 1 x > P(C 2 x) decide C 1 otherwise decide C 2 Equivalent If p x C 1 P(C 1 ) p(x) > p x C 2 P(C 2 ) p(x) decide C 1 otherwise decide C 2 Equivalent If p x C 1 P(C 1 ) > p x C 2 P(C 2 ) decide C 1 otherwise decide C 2 11

Bayes decision rule: example Bayes decision: Choose the class with highest p C k x 2 p(x C 1 ) p(x C 2 ) p(x C 1 ) R 2 R 2 p C 1 = 2 3 p(c 1 x) p C 2 = 1 3 p(c 2 x) 12 R 2 R 2

Bayes Classier Simple Bayes classifier: estimate posterior probability of each class What should the decision criterion be? Choose class with highest p C k x The optimal decision is the one that minimizes the expected number of mistakes 13

Diabetes example white blood cell count 14 This example has been adopted from Sanja Fidler s slides, University of Toronto, CSC411

Diabetes example Doctor has a prior p y = 1 = 0.2 Prior: In the absence of any observation, what do I know about the probability of the classes? A patient comes in with white blood cell count x Does the patient have diabetes p y = 1 x? given a new observation, we still need to compute the posterior 15

Diabetes example p x y = 0 p x y = 1 16 This example has been adopted from Sanja Fidler s slides, University of Toronto, CSC411

Estimate probability densities from data If we assume Gaussian distributions for p(x C 1 ) and p(x C 2 ) Recall that for samples {x 1,, x N }, if we assume a Gaussian distribution, the MLE estimates will be 17

Diabetes example p x y = 0 p x y = 1 p x y = 1 = N μ 1, σ 1 2 μ 1 = n: y(n) =1 x(n) n: y (n) =1 σ 1 2 = n: y(n) =1 x n μ 1 N 1 n: y(n)=1 x(n) = 1 N 1 2 18 This example has been adopted from Sanja Fidler s slides, University of Toronto, CSC411

Diabetes example Add a second observation: Plasma glucose value 19 This example has been adopted from Sanja Fidler s slides, University of Toronto, CSC411

Generative approach for this example Multivariate Gaussian distributions for p(x C k ): p x y = k 1 = 2π d/2 Σ 1/2 exp{ 1 2 x μ k T Σ 1 k x μ k } k = 1,2 Prior distribution p(x C k ): p y = 1 = π, p y = 0 = 1 π 20

MLE for multivariate Gaussian For samples {x 1,, x N }, if we assume a multivariate Gaussian distribution, the MLE estimates will be: μ = n=1 N N x (n) Σ = 1 N N n=1 x (n) μ x (n) μ T 21

Generative approach: example Maximum likelihood estimation (D = x n, y n N n=1 ): π = N 1 N μ 1 = Σ 1 = 1 N 1 Σ 2 = 1 N 2 n=1 N y (n) x (n) N N 1, μ 2 = n=1 (1 y (n) )x (n) N 2 n=1 N y (n) x (n) μ x (n) μ T n=1 N (1 y n ) x (n) μ x (n) μ T N 1 = N y (n) n=1 N 2 = N N 1 22

Decision boundary for Gaussian Bayes classifier p C 1 x = p(c 2 x) ln p(c 1 x) = ln p(c 2 x) ln p(x C 1 ) + ln p(c 1 ) ln p(x) = ln p(x C 2 ) + ln p(c 2 ) ln p(x) ln p(x C 1 ) + ln p(c 1 ) = ln p(x C 2 ) + ln p(c 2 ) ln p(x C k ) p C k x = p x C k p(c k ) p(x) = d 2 ln 2π 1 2 ln Σ k 1 1 2 x μ k T Σ k 1 x μ k 23

Decision boundary p(x C 1 ) p(x C 2 ) p(c 1 x)=p(c 2 x) p(c 1 x) 24

Shared covariance matrix When classes share a single covariance matrix Σ = Σ 1 = Σ 2 p x C k = 1 2π d/2 Σ 1/2 exp{ 1 2 x μ k T Σ 1 x μ k } k = 1,2 p C 1 = π, p C 2 = 1 π 26

Likelihood N n=1 N p(x n, y (n) π, μ 1, μ 2, Σ) = p(x n y n, μ 1, μ 2, Σ)p(y n π) n=1 27

Shared covariance matrix Maximum likelihood estimation (D = x i, y i n i=1 ): μ 2 = μ 1 = π = N 1 N y (n) x (n) n=1 N n=1 N N 1 (1 y (n) )x (n) N 2 Σ = 1 N n C1 x (n) μ 1 x (n) μ 1 T + n C 2 x (n) μ 2 x (n) μ 2 T 28

Decision boundary when shared covariance matrix ln p(x C 1 ) + ln p(c 1 ) = ln p(x C 2 ) + ln p(c 2 ) ln p(x C k ) = d 2 ln 2π 1 2 ln Σ k 1 1 2 x μ k T Σ 1 x μ k 29

Bayes decision rule Multi-class misclassification rate Multi-class problem: Probability of error of Bayesian decision rule Simpler to compute the probability of correct decision P error = 1 P(correct) P Correct = = K i=1 K i=1 R i p(x, C i ) dx R i p C i x p(x) dx R i : the subset of feature space assigned to the class C i using the classifier 31

Bayes minimum error Bayes minimum error classifier: min α(.) E x,y I(α(x) y) Zero-one loss α x = argmax y p(y x) 32

Minimizing Bayes risk (expected loss) K E x,y L α x, y = L α x, C j p x, C j dx j=1 K = p x L α x, C j p C j x dx j=1 for each x minimize it that is called conditional risk Bayes minimum loss (risk) decision rule: α(x) K α(x) = argmin i=1,,k j=1 L ij p C j x 33 The loss of assigning a sample to C i where the correct class is C j

Minimizing expected loss: special case (loss = misclassification rate) Problem definition for this special case: If action α x = i is taken and the true category is C j, then the decision is correct if i = j and otherwise it is incorrect. Zero-one loss function: L ij = 1 δ ij = 0 i = j 1 o. w. K α x = argmin i=1,,k j=1 L ij p C j x = argmin i=1,,k 0 p C i x + j i p C j x 35 = argmin i=1,,k 1 p C i x = argmax i=1,,k p C i x

Probabilistic discriminant functions Discriminant functions: A popular way of representing a classifier A discriminant function f i x for each class C i (i = 1,, K): x is assigned to class C i if: f i (x) > f j (x) j i Representing Bayesian classifier using discriminant functions: Classifier minimizing error rate: f i x = P(C i x) Classifier minimizing risk: f i x = K j=1 L ij p C j x 36

Naïve Bayes classifier Generative methods High number of parameters Assumption: Conditional independence p x C k = p x 1 C k p x 2 C k p x d C k 37

Naïve Bayes classifier In the decision phase, it finds the label of x according to: argmax k=1,,k argmax k=1,,k p(c k ) p C k x n i=1 p(x i C k ) p x C k = p x 1 C k p x 2 C k p x d C k p C k x p(c k ) p(x i C k ) n i=1 38

Naïve Bayes classifier Finds d univariate distributions p x 1 C k,, p x d C k of finding one multi-variate distribution p x C k instead Example 1: For Gaussian class-conditional density p x C k, it finds d + d (mean and sigma parameters on different dimensions) instead of d + d(d+1) parameters 2 Example 2: For Bernoulli class-conditional density p x C k, it finds d (mean parameters on different dimensions) instead of 2 d 1 parameters It first estimates the class conditional densities p x 1 C k,, p x d C k and the prior probability p(c k ) for each class (k = 1,, K) based on the training set. 39

Naïve Bayes: discrete example p H = Yes = 0.3 p D = Yes H = Yes = 1 3 p S = Yes H = Yes = 2 3 p D = Yes H = No = 2 7 p S = Yes H = No = 2 7 Decision on x = [Yes, Yes] (a person that has diabetes and also smokes): p H = Yes x p H = Yes p D = yes H = Yes p S = yes H = Yes = 0.066 p H = No x p H = No p D = yes H = No p S = yes H = No = 0.057 Thus decide H = yes Diabetes (D) Smoke (S) Heart Disease (H) Y N Y Y N N N Y N N Y N N N N N Y Y N N N N Y Y N N N Y N N 40

Probabilistic classifiers How can we find the probabilities required in the Bayes decision rule? Probabilistic classification approaches can be divided in two main categories: Generative Estimate pdf p(x, C k ) for each class C k p(c k x) or alternatively estimate both pdf p(x C k ) and p C k Discriminative Directly estimate p(c k x) for each class C k and then use it to find to find p(c k x) 41

Generative approach Inference stage Determine class conditional densities p(x C k ) p(c k ) Use the Bayes theorem to find p(c k x) and priors Decision stage: After learning the model (inference stage), make optimal class assignment for new input if p C i x > p C j x j i then decide C i 42

Discriminative vs. generative approach 43 [Bishop]

Class conditional densities vs. posterior p x C 1 p x C 2 p C 1 x [Bishop] p C 1 x = σ(w T x + w 0 ) 44 w = Σ 1 μ 1 μ 2 w 0 = 1 2 μ 1 T Σ 1 μ 1 + 1 2 μ 2 T Σ 1 μ 2 + ln p(c 1) p(c 2 )

Discriminative approach Inference stage Determine the posterior class probabilities P(C k x) directly Decision stage: After learning the model (inference stage), make optimal class assignment for new input if P C i x > P C j x j i then decide C i 45

Posterior probabilities Two-class: p C k x can be written as a logistic sigmoid for a wide choice of p x C k distributions 1 p C 1 x = σ a(x) = 1 + exp( a(x)) Multi-class: p C k x can be written as a soft-max for a wide choice of p x C k p C k x = exp(a k(x)) exp(a j (x)) j=1 K 46

Discriminative approach: logistic regression More general than discriminant functions: f x; w predicts posterior probabilities P y = 1 x K = 2 f x; w = σ(w T x) x = 1, x 1,, x d w = w 0, w 1,, w d σ. is an activation function Sigmoid (logistic) function Activation function σ z = 1 1 + e z 47

Logistic regression f x; w : probability that y = 1 given x (parameterized by w) P y = 1 x, w = f x; w K = 2 y {0,1} P y = 0 x, w = 1 f x; w f x; w = σ(w T x) 0 f x; w 1 estimated probability of y = 1 on input x Example: Cancer (Malignant, Benign) f x; w = 0.7 70% chance of tumor being malignant 48

Logistic regression: Decision surface Decision surface f x; w = constant f x; w = σ w T 1 x = = 0.5 1+e (wt x) Decision surfaces are linear functions of x if f x; w 0.5 then y = 1 else y = 0 Equivalent to if w T x + w 0 0 then y = 1 else y = 0 49

Logistic regression: ML estimation Maximum (conditional) log likelihood: n w = argmax w log i=1 p y (i) w, x (i) p y (i) w, x (i) = f x (i) ; w y(i) 1 f x (i) ; w (1 y(i) ) log p y X, w n = y (i) log f x (i) ; w i=1 + (1 y (i) )log 1 f x (i) ; w 50

Logistic regression: cost function w = argmin w J(w) J w = n n i=1 log p y (i) w, x (i) = y (i) log f x (i) ; w (1 y (i) )log 1 f x (i) ; w i=1 No closed form solution for However J(w) is convex. w J(w) = 0 51

Logistic regression: Gradient descent w t+1 = w t η w J(w t ) w J w = n i=1 f x i ; w y i x i Is it similar to gradient of SSE for linear regression? w J w = n i=1 w T x i y i x i 52

Logistic regression: loss function Loss y, f x; w = y log f x; w (1 y) log(1 f x; w ) Since y = 1 or y = 0 Loss y, f x; w = log(f(x; w)) if y = 1 log(1 f x; w ) if y = 0 How is it related to zero-one loss? Loss y, y = 1 y y 0 y = y f x; w = 1 1 + exp( w T x) 53

Logistic regression: cost function (summary) Logistic Regression (LR) has a more proper cost function for classification than SSE and Perceptron Why is the cost function of LR also more suitable than? J w = 1 n i=1 where f x; w = σ(w T x) n y i f x i ; w 2 The conditional distribution p y x, w in the classification problem is not Gaussian (it is Bernoulli) The cost function of LR is also convex 54

Multi-class logistic regression For each class k, f k x; W predicts the probability of y = k i.e., P(y = k x, W) On a new input x, to make a prediction, pick the class that maximizes f k x; W : α x = argmax k=1,,k f k x if f k x > f j x decide C k j k then 55

Multi-class logistic regression K > 2 y {1,2,, K} f k x; W = p y = k x = exp(w k T x ) K exp(w T j x ) j=1 Normalized exponential (aka softmax) If w k T x w j T x for all j k then p(c k x) 1, p(c j x) 0 56 p C k x = p x C k p(c k ) K p x C j p(c j ) j=1

Logistic regression: multi-class 57 J W = log = log = n i=1 W = argmin W n i=1 K k=1 n i=1 K k=1 p y i J(W) x i, W f k x i ; W y i k y k i log f k x (i) ; W y is a vector of length K (1-of-K coding) e.g., y = 0,0,1,0 T when the target class is C 3 W = w 1 w K Y = y (1) y (n) = y 1 (1) (1) y K (n) y K y 1 (n)

Logistic regression: multi-class w j t+1 = w j t η W J(W t ) wj J W = n i=1 f j x i ; W y j i x i 58

Logistic Regression (LR): summary LR is a linear classifier LR optimization problem is obtained by maximum likelihood when assuming Bernoulli distribution for conditional 1 probabilities whose mean is 1+e (wt x) No closed-form solution for its optimization problem But convex cost function and global optimum can be found by gradient ascent 59

Discriminative vs. generative: number of parameters d-dimensional feature space Logistic regression: d + 1 parameters w = (w 0, w 1,.., w d ) Generative approach: Gaussian class-conditionals with shared covariance matrix 2d parameters for means d(d + 1)/2 parameters for shared covariance matrix one parameter for class prior p(c 1 ). But LR is more robust, less sensitive to incorrect modeling assumptions 60

Summary of alternatives Generative Most demanding, because it finds the joint distribution p(x, C k ) Usually needs a large training set to find p(x C k ) Can find p(x) Outlier or novelty detection Discriminative Specifies what is really needed (i.e., p(c k x)) More computationally efficient 61

Resources C. Bishop, Pattern Recognition and Machine Learning, Chapter 4.2-4.3. 62