What does Bayes theorem give us? Lets revisit the ball in the box example.

Similar documents
Bayesian Decision Theory

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Decision Theory

ECE521 Lecture7. Logistic Regression

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Linear Models for Classification

Bayesian Decision Theory

PATTERN RECOGNITION AND MACHINE LEARNING

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Notes on Discriminant Functions and Optimal Classification

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Naïve Bayes classification

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Machine Learning Lecture 5

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Bayes Rule for Minimizing Risk

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

From Bayes Theorem to Pattern Recognition via Bayes Rule

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

5. Discriminant analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning Lecture 7

Ch 4. Linear Models for Classification

Multivariate statistical methods and data mining in particle physics

Machine Learning 2017

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Support Vector Machines

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Logistic Regression. Machine Learning Fall 2018

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Lecture 2

Linear Classification: Probabilistic Generative Models

Is cross-validation valid for small-sample microarray classification?

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Introduction to Statistical Inference

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Undirected Graphical Models

Pattern Classification

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Bayesian Decision and Bayesian Learning

Mining Classification Knowledge

Machine Learning Lecture 2

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

7 Gaussian Discriminant Analysis (including QDA and LDA)

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning

Machine Learning Lecture 2

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Support Vector Machines

Bayes Decision Theory

Final Examination CS 540-2: Introduction to Artificial Intelligence

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Intelligent Systems Statistical Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

The Multi-Layer Perceptron

INTRODUCTION TO PATTERN RECOGNITION

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine Learning, Midterm Exam

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Midterm, Fall 2003

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Machine Learning Lecture 1

Introduction to Machine Learning

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

The Naïve Bayes Classifier. Machine Learning Fall 2017

Variational Bayesian Logistic Regression

Gaussian discriminant analysis Naive Bayes

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

LECTURE NOTE #3 PROF. ALAN YUILLE

Bayes Decision Theory

CMU-Q Lecture 24:

Machine Learning, Fall 2012 Homework 2

SGN (4 cr) Chapter 5

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning (CS 567) Lecture 5

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 10: The Normal Distribution. So far all the random variables have been discrete.

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Transcription:

ECE 6430 Pattern Recognition and Analysis Fall 2011 Lecture Notes - 2 What does Bayes theorem give us? Lets revisit the ball in the box example. Figure 1: Boxes with colored balls Last class we answered the question, what is the overall probability that the selection procedure will pick a green ball? Now lets look at another problem: suppose we have a green ball, what is the probability that it came from blue box? Or red box? 1

We can solve the problem of reversing the conditional probability by using Bayes theorem: p (B = b F = g) = p (F = g B = b) p (B = b) p (F = g) (1) Note that we know all the probabilities on the RHS from earlier! Prior probability: p (B = b) or p (B = r) it is the probability available before we observe the identity of the ball. Posterior probability: p (B F ) it is the probability available after we observe the identity of the ball. How does Bayes theorem relate to data training? Prior probabilities, p (C k ), can be estimated from the proportions of the training data which fall into each class. What does this mean? How many times was each class chosen? (What is the probability of choosing the blue box?) Class-conditional probability, p (x n C k ): is estimated from histograms for each class. Why is this happening? Take the example of hand written letters 'a' and 'b'. We tried to classify based just on height. In that case, there was a lot of overlap between classes C 1 and C 2. While the above is the most obvious, even with more number of features, such instances of over-lapping occurs. Both boxes (classes) have both orange and green balls! What about the denominator, p (x n )? Once we have the prior and the class-conditional probabilities, we can calculate this as, 2

p (x n ) = p (x n C 1 ) p (C 1 ) + + p (x n C k ) p (C k ) (2) This is just a normalizing value! Why? Summarizing, Decision making: posterior prior likelihood (3) Given a new data value x nt, the probability of misclassication is minimized if we assign the data to class C k, for which the posterior probability p (C k x nt ) is largest. C k, if p ( C k x nt) > p ( C j x nt) k j (4) Rejection threshold in Bayesian context { if max p (C k x n θ classify x n as C k ) (5) k < θ reject x n Note: that in the text book, discrete probabilities are denoted by a capital P (), while I have not made that distinction. 3

Discriminant functions Discriminant functions y 1 (x), y M (x) is dened such that, an input vector x is assigned to class C k if, y k (x) > y j (x) j k. If we compare this to our earlier rule of minimizing probability of misclassication, we would have, Applying Bayes theorem, we will have, y k (x) = p (C k x) (6) y k (x) = p (x C k ) p (C k ) (7) Note that when dening the discriminant function, we can discard the denominator p (x n ). Figure 2: Joint probabilities compared p (x, C 1 ) = p (x C 1 ) p (C 1 ) In general, the decision boundaries are given by the regions where the discriminant functions are equal. y k (x) = y j (x). Since we are looking to compare relative magnitudes, we can replace y (x) with another monotonic function and expect to get similar results. e.g. z k (x) = ln p (x C k ) + ln p (C k ) (8) 4

Curve tting revisited Figure 3: Probability in curve tting! Remember: Curve tting involves nding one set of values for w, minimizing error between y (x n, w) and desired output or target values, t n. Error function: measures the mist between the function y (x n, w), for any given value of w, and the training set data points. Sum of the squares of the errors: What does p (t x 0 ) mean? E = 1 2 N {y (x n ; w) t n } 2 (9) n=1 What are we doing here?: Choosing a specic estimate y( x) of the value of t for each input x. The regression function y (x), minimizes the expected squared loss. ˆ ˆ E (L) = {y (x) t} 2 p (x, t) dxdt (10) Choose y (x) to minimize E (L). 5

Finding partial dierential w.r.t. y (x) and equating it to zero, we will end up getting, y (x) = E (t x) (11) Figure 4: Bayesian curve t Generalizing for multiple variables, we will have, y (x) = E (t x) (12) Minimizing risk Sometimes, misclassifying one way might be more detrimental. e.g. Identication of tumor. It would be riskier to classify a real tumor as a non tumor than the other way. Loss matrix element l kj =penalty associated with assigning a pattern to class C j when in belongs to C k. END R k = Σ c j=1l kj Σ k p(x C k ) (13) Σ k l kj p(x C k )p(c k ) < Σ k l ki p(x C k )p(c k ) i j (14) 6