Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Similar documents
Naïve Bayes classification

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning and Deep Learning! Vincent Lepetit!

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

Introduction to Machine Learning

Recap from previous lecture

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Introduction to Machine Learning

Machine Learning, Fall 2012 Homework 2

Support Vector Machines

Logistic Regression. Machine Learning Fall 2018

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Day 5: Generative models, structured classification

The Naïve Bayes Classifier. Machine Learning Fall 2017

Announcements. Proposals graded

Generative v. Discriminative classifiers Intuition

MIRA, SVM, k-nn. Lirong Xia

Bayes Decision Theory

Be able to define the following terms and answer basic questions about them:

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Nearest Neighbor Pattern Classification

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Bayesian Decision Theory

Mining Classification Knowledge

Forward algorithm vs. particle filtering

Machine Learning. Naïve Bayes classifiers

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

MLE/MAP + Naïve Bayes

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Generative v. Discriminative classifiers Intuition

Machine Learning Lecture 2

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning, Fall 2009: Midterm

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Introduction to Machine Learning

Introduction to Machine Learning

UVA CS / Introduc8on to Machine Learning and Data Mining

STA 414/2104, Spring 2014, Practice Problem Set #1

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

The Bayes classifier

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Least Squares Regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Computational Genomics

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Chapter 6 Classification and Prediction (2)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Learning Linear Detectors

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

ECE 5984: Introduction to Machine Learning

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Introduction to Machine Learning

MLE/MAP + Naïve Bayes

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Introduction to Machine Learning Spring 2018 Note 18

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

CS Machine Learning Qualifying Exam

Introduction to Machine Learning

Bayesian Learning (II)

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Supervised Learning: Non-parametric Estimation

FINAL: CS 6375 (Machine Learning) Fall 2014

Bayesian Methods: Naïve Bayes

0.5. (b) How many parameters will we learn under the Naïve Bayes assumption?

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Curve Fitting Re-visited, Bishop1.2.5

Machine Learning

Introduction to Signal Detection and Classification. Phani Chavali

probability of k samples out of J fall in R.

Machine Learning (CS 567) Lecture 5

Machine Learning Gaussian Naïve Bayes Big Picture

Data Mining Part 4. Prediction

Probabilistic Graphical Models

Machine Learning 2017

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Be able to define the following terms and answer basic questions about them:

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

What does Bayes theorem give us? Lets revisit the ball in the box example.

Qualifier: CS 6375 Machine Learning Spring 2015

Generative v. Discriminative classifiers Intuition

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Transcription:

Bayes Classifiers CAP5610 Machine Learning Instructor: Guo-Jun QI

Recap: Joint distributions Joint distribution over Input vector X = (X 1, X 2 ) X 1 =B or B (drinking beer or not) X 2 = H or H (headache or not) Output vector Y = F or F (binary class) Input vector X Output Y F B H 0.4 F B H 0.1 F B H 0.17 F B H 0.2 F B H 0.05 F B H 0.05 F B H 0.015 F B H 0.015

Prior distribution Prior for positive class P(Y=F) = 0.05+0.05+0.015+0.015 = 0.13 F B H 0.4 F B H 0.1 F B H 0.17 F B H 0.2 F B H 0.05 Prior for negative class P(Y= F) = 0.4+0.1+0.17+0.2 = 0.87 F B H 0.05 F B H 0.015 F B H 0.015

Class-conditional distribution By Bayes Rule, P X 1, X 2 Y = F = P(X 1,X 2,Y=F) P(Y=F) P X 1 = B, X 2 = H Y = F = P(X 1 =B,X 2 =H,Y=F) P(Y=F) = 0.015 0.13 = 0.1154 X 2 H H X 1 B 0.1154? B?? F B H 0.4 F B H 0.1 F B H 0.17 F B H 0.2 F B H 0.05 F B H 0.05 F B H 0.015 F B H 0.015

Class-conditional distribution Bayes Rule P X 1, X 2 Y = F = P(X 1,X 2,Y= F) P(Y= F) P X 1 = B, X 2 = H Y = F = P(X 1 =B,X 2 =H,Y= F) P(Y= F) = 0.2 0.87 = 0.2299 X 2 H H X 1 B 0.2299? B?? F B H 0.4 F B H 0.1 F B H 0.17 F B H 0.2 F B H 0.05 F B H 0.05 F B H 0.015 F B H 0.015

Posterior distribution P Y = F X 1, X 2 = P X 1,X 2,Y=F P X 1,X 2 P Y = F X 1 = B, X 2 = H = P X 1 =B,X 2 =H,Y=F P X 1 =B,X 2 =H P(Y = F X 1, X 2 ) = P(X 1,X 2,Y= F) P(X 1,X 2 ) = 0.015 0.015+0.02 F B H 0.4 F B H 0.1 F B H 0.17 F B H 0.2 F B H 0.05 F B H 0.05 F B H 0.015 F B H 0.015

Prior, class-conditional, and posterior distribution Prior distribution for a class P(Y) has no input, the fraction of a particular class in a population What s the fraction of digit 9 among all digits [0-9]? What s the fraction of people who are infected with Flu?

Class-condition distribution Given a class, it is the distribution from which we can draw an example for this class. Digit 9 Flu X 2 H H X 1 B? B P(X 1,X 2 Y= F)

Posterior Distribution Posterior distribution for classes P(Y X 1,X 2 ) given an input X, what s the likelihood of a particular class? How likely is this image a digit 9? Digit 9?

Decision Theory Maximum A Posterior (MAP) Rule: given an input vector X, making an optimal decision about the class label (i.e., Y) in a certain sense Minimizing the classification error Case I: When P(Y=F X 1,X 2 )>P(Y= F X 1,X 2 ), X shall belong to F (i.e., X is infected with Flu) Case II: When P(Y=F X 1,X 2 )<P(Y= F X 1,X 2 ), X shall belong to F (i.e., X is not infected with Flu) Proof: MAP rule gives the minimal classification error.

Proof Decision region defines a region in the feature space such that every point in this region belongs to a particular class. X 2 F F R 1 R 2 F Decision boundary X 1

Proof p error = R1 p X, Y = C 2 dx + R2 p(x, Y = C 1 ) dx = R 1 p Y = C 2 X p(x) dx + R 2 p Y = C 1 X p(x) dx For each X, it either belongs to R1 or R2; to minimize the error rate, it shall be assigned to the region with a smaller posterior probability.

Likelihood Ratio Maximum A Posterior rule Case I: When P(Y=F X 1,X 2 )>P(Y= F X 1,X 2 ), X shall belong to F (i.e., X is infected with Flu) Case II: When P(Y=F X 1,X 2 )<P(Y= F X 1,X 2 ), X shall belong to F (i.e., X is infected with Flu) Likelihood Ratios f X = P(Y= F X 1,X 2 ) P(Y= F X 1,X 2 ) Where f(x) > 1, X belongs to F, otherwise X belongs to F

Discriminate Function Given an input X, a discriminative function decides its class by comparing f(x) with a certain threshold. f X > 1, X F < 1, X F Discriminative function does not need to be positive, or its threshold does not need to be 1 either.

A linear discriminate function f(x) = X 1 + X 2-1with threshold 0. (0,1) X 2 f(x) > 0 R 1 Linear decision boundary R 2 f(x) < 0 (1,0) X 1

Bayes Error Bayes error is the minimal error that is made by MAP rule. It is the lowest bound of error rate that can be achieved by any classifier p error X = p Y = C 1 X, if P Y = C 2 X > P(Y = C 1 X) p Y = C 2 X, if P Y = C 1 X > P(Y = C 2 X) = min{p Y = C 1 X, p Y = C 2 X }

Nearest Neighbor Error The error made by nearest neighbor classifier (1-NN) is smaller than twice Bayes error. Given a example X, its nearest neighbor is X NN ; the true class of X is Y, and test the true class of X NN is Y NN. p NN error X, X NN = p Y = C 1, Y NN = C 2 X, X NN + p Y = C 2, Y NN = C 1 X, X NN = p Y = C 1 X p(y NN = C 2 X NN + p Y = C 2 X p(y NN = C 1 X NN When the size of training set is large enough (approaching to infinity), X NN will also approach to X p NN error X = 2p Y = C 1 X p(y = C 2 X

Bayes error and NN asymptotic error Bayes error: p error X = min{p Y = C 1 X, p Y = C 2 X } NN asymptotic error: p NN error X = 2p Y = C 1 X p(y = C 2 X p NN error X < 2p error X

Bayesian Classifier Comparing the posterior distribution Given an input feature vector X, Y = C 1, if P Y = C 2 X > P(Y = C 1 X) C 2, if P Y = C 1 X > P(Y = C 2 X) where P Y = C i X P X Y = Ci P Y = Ci, i = 1,2

Practical Issue Prior distribution P Y = C i, i = 1,2 Counting the fraction of two classes in training set Class-conditional distribution P X Y = C i, i = 1,2 Modeled from the training examples belonging to two classes Four training examples X 1 (Drinking beer) X 2 (Headache) Y (Flu) 0 1 1 1 1 1 1 0 0 0 0 0 P X = (0,1) Y = 1 = P X = (1,1) Y = 1 = P X = (1,0) Y = 1 = P X = (0,0) Y = 1 = #(X = 0,1, Y = 1) #(Y = 1) #(X = 1,1, Y = 1) #(Y = 1) #(X = 1,0, Y = 1) #(Y = 1) #(X = 0,0, Y = 1) #(Y = 1) = 1 2 = 1 2 = 0 2 = 0 2

N attributes of feature vector Input vector: X=(X 1,X 2,,X N ) Estimate P(X Y=C i ), how many examples suffice to do estimation? Assume X is binary vector, then at least 2 N examples are required to ensure that each possible assignment of binary attributes to X has one training example. N=20, 2 N = 1,048,576 N=30, 2 N = 1,073,741,824 In MNIST, N = 28X28 (pixel), 2 N = 1.01X10 236, let alone X is continuous vector

Naive Bayes Assume attributes are independent of each other given a class C i, i=1,2 P X 1, X 2,, X N Y = C i = P X 1 Y = Ci P X 2 Y = C i P(XN Y = C i ) Each P X n Y = C i can be estimated independently Only 2 training examples are required at least to estimate each P X n Y = C i binary At least 2N training examples can estimate the joint distribution. if X n is

Naive Bayes P X 1 = 0 Y = 1 = #(X 1 = 0, Y = 1) #(Y = 1) P X 1 = 1 Y = 1 =? = 1 2 X 1 (Drinking beer) X 2 (Headache) Y (Flu) 0 1 1 1 1 1 1 0 0 0 0 0 An exercise: complete the Naive Bayes (not homework)

Summary Recap prior distribution, class-conditional distribution, posterior distribution Maximum A Posterior (MAP) Rule to decide the class assigned to each input vector X Likelihood Ratio, and discriminant function Decision Boundary and Region Practical Issue: Estimate prior distribution and class-conditional distribution from training example Naive Bayes