Chapter 2 Bayesian Decision Theory. Pattern Recognition Soochow, Fall Semester 1

Similar documents
Bayesian Decision Theory

Bayesian Decision Theory

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Integrated Algebra. Simplified Chinese. Problem Solving

Pattern Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

A Tutorial on Variational Bayes

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Cooling rate of water

The Bayes classifier

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Contents 2 Bayesian decision theory

Nonparametric probability density estimation

ANALYSIS OF INSULATION OF MATERIAL PROJECT OF DESIGN OF EXPERIMENT

通量数据质量控制的理论与方法 理加联合科技有限公司

Introduction to Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CT reconstruction Simulation

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 1

Fundamentals of Heat and Mass Transfer, 6 th edition

Nearest Neighbor Pattern Classification

Naïve Bayes classification

Introduction to Machine Learning

Bayesian Decision Theory Lecture 2

Bayes Decision Theory

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Metric-based classifiers. Nuno Vasconcelos UCSD

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Minimum Error Rate Classification

Bayes Decision Theory - I

Conditional expectation and prediction

Source mechanism solution

CMU-Q Lecture 24:

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

A STUDY OF FACTORS THAT AFFECT CAPILLARY

Grundlagen der Künstlichen Intelligenz

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSC 411: Lecture 09: Naive Bayes

ECE521 week 3: 23/26 January 2017

Computational Genomics

Lecture 2: Introduction to Probability

Bayes Decision Theory

What does Bayes theorem give us? Lets revisit the ball in the box example.

Chapter 4. Mobile Radio Propagation Large-Scale Path Loss

Bayesian Decision and Bayesian Learning

Machine Learning 2017

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture 3: Pattern Classification

An Introduction to Generalized Method of Moments. Chen,Rong aronge.net

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

Linear Algebra 20 Symmetric Matrices and Quadratic Forms

5. Discriminant analysis

BAYESIAN DECISION THEORY

数理经济学教学大纲 内容架构 资料来源 : 集合与映射 (JR,pp ) 1. 数学 : 概念 定理 证明, 重在直觉, 多用图形 ; 2. 举例 : 数学例子 ; 3. 应用 : 经济学实例 ;

Principia and Design of Heat Exchanger Device 热交换器原理与设计

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Machine Learning Lecture 2

Machine Learning Lecture 2

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Deep Learning for Computer Vision

Probability Based Learning

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Machine Learning Linear Classification. Prof. Matteo Matteucci

Pattern Recognition. Parameter Estimation of Probability Density Functions

Project Report of DOE

CSC411 Fall 2018 Homework 5

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

CIS 390 Fall 2016 Robotics: Planning and Perception Final Review Questions

Lecture 3: Pattern Classification. Pattern classification

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Approaches Data Mining Selected Technique

Chapter 2 THE MATHEMATICS OF OPTIMIZATION. Copyright 2005 by South-Western, a division of Thomson Learning. All rights reserved.

Bayesian decision making

Minimum Error-Rate Discriminant

Digital Image Processing. Point Processing( 点处理 )

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Bayesian Decision Theory

L5: Quadratic classifiers

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Support Vector Machines

Basic Probabilistic Reasoning SEG

Transcription:

Chapter 2 Bayesian Decision Theory Pattern Recognition Soochow, Fall Semester 1

Decision Theory Decision Make choice under uncertainty Pattern Recognition Pattern Category Given a test sample, its category is uncertain and a decision has to be made In essence, PR is a decision process Pattern Recognition Soochow, Fall Semester 2

Bayesian Decision Theory Bayesian decision theory is a statistical approach to pattern recognition The fundamentals of most PR algorithms are rooted from Bayesian decision theory Basic Assumptions The decision problem is posed (formalized) in probabilistic terms All the relevant probability values are known Key Principle Bayes Theorem ( 贝叶斯定理 ) Pattern Recognition Soochow, Fall Semester 3

Bayes Theorem Bayes theorem X: the observed sample (also called evidence; e.g.: the length of a fish) H: the hypothesis (e.g. the fish belongs to the salmon category) P(H): the prior probability ( 先验概率 ) that H holds (e.g. the probability of catching a salmon) P(X H): the likelihood ( 似然度 ) of observing X given that H holds (e.g. the probability of observing a 3 inch length fish which is salmon) P(X): the evidence probability that X is observed (e.g. the probability of observing a fish with 3 inch length) P(H X): the posterior probability ( 后验概率 ) that H holds given X (e.g. the probability of X being salmon given its length is 3 inch) Thomas Bayes (1702-1761) Pattern Recognition Soochow, Fall Semester 4

A Specific Example State of Nature ( 自然状态 ) Future events that might occur e.g. the next fish arriving along the conveyor belt State of nature is unpredictable e.g. it is hard to predict what type will emerge next From statistical/probabilistic point of view, the state of nature should be favorably regarded as a random variable e.g. let denote the (discrete) random variable representing the state of nature (class) of fish types Pattern Recognition Soochow, Fall Semester 5

Prior Probability Prior Probability ( 先验概率 ) Prior probability is the probability distribution which reflects one s prior knowledge on the random variable Probability distribution (for discrete random variable) Let be the probability distribution on the random variable with c possible states of nature, such that: the catch produced as much sea bass as salmon the catch produced more sea bass than salmon Pattern Recognition Soochow, Fall Semester 6

Decision Before Observation The Problem To make a decision on the type of fish arriving next, where 1) prior probability is known; 2) no observation is allowed Naive Decision Rule This is the best we can do without observation Fixed prior probabilities Same decisions all the time Incorporate observations into decision! Good when is much greater (smaller) than Poor when is close to [only 50% chance of being right if ] Pattern Recognition Soochow, Fall Semester 7

Probability Density Function (pdf) Probability density function (pdf) (for continuous random variable) Let be the probability density function on the continuous random variable taking values in R, such that: For continuous random variable, it no longer makes sense to talk about the probability that x has a particular value (almost always be zero) We instead talk about the probability of x falling into a region R, say R=(a,b), which could be computed with the pdf: Pattern Recognition Soochow, Fall Semester 8

Incorporate Observations The Problem Suppose the fish lightness measurement x is observed, how could we incorporate this knowledge into usage? Class conditional probability density function ( 类条件概率密度 ) It is a probability density function (pdf) for x given that the state of nature (class) is, i.e.: The class conditional pdf describes the difference in the distribution of observations under different classes Pattern Recognition Soochow, Fall Semester 9

Class-Conditional PDF An illustrative example h axis: lightness of fish scales v axis: class conditional pdf values black curve: sea bass red curve: salmon The area under each curve is 1.0 (normalization) class conditional pdf for lightness Sea bass is somewhat brighter than salmon Pattern Recognition Soochow, Fall Semester 10

Decision After Observation Known Prior probability Unknown The quantity which we want to use in decision naturally (by exploiting observation information) Class conditional pdf Bayes Formula Posterior probability Observation for test example (e.g.: fish lightness) Convert the prior probability to the posterior probability Pattern Recognition Soochow, Fall Semester 11

Bayes Formula Revisited Joint probability density function ( 联合分布 ) Marginal distribution ( 边缘分布 ) Law of total probability ( 全概率公式 ) [ref. pp.615] Pattern Recognition Soochow, Fall Semester 12

Bayes Formula Revisited (Cont.) Bayes Decision Rule and are assumed to be known is irrelevant for Bayesian decision (serving as a normalization factor, not related to any state of nature) Pattern Recognition Soochow, Fall Semester 13

Bayes Formula Revisited (Cont.) Special Case I: Equal prior probability Special Case II: Equal likelihood Depends on the likelihood Degenerate to naive decision rule Normally, prior probability and likelihood function together in Bayesian decision process Pattern Recognition Soochow, Fall Semester 14

Bayes Formula Revisited (Cont.) An illustrative example class conditional pdf for lightness What will the posterior probability for either type of fish look like? Pattern Recognition Soochow, Fall Semester 15

Bayes Formula Revisited (Cont.) An illustrative example h axis: lightness of fish scales v axis: posterior probability for either type of fish black curve: sea bass red curve: salmon For each value of x, the higher curve yields the output of Bayesian decision posterior probability for either type of fish For each value of x, the posteriors of either curve sum to 1.0 Pattern Recognition Soochow, Fall Semester 16

Another Example Problem statement A new medical test is used to detect whether a patient has a certain cancer or not, whose test result is either + (positive) or (negative ) For patient with this cancer, the probability of returning positive test result is 0.98 For patient without this cancer, the probability of returning negative test result is 0.97 The probability for any person to have this cancer is 0.008 Question If positive test result is returned for some person, does he/she have this kind of cancer or not? Pattern Recognition Soochow, Fall Semester 17

Another Example (Cont.) No cancer! Pattern Recognition Soochow, Fall Semester 18

Feasibility of Bayes Formula To compute posterior probability, we need to know: Prior probability: Likelihood: How do we know these probabilities? A simple solution: Counting relative frequencies ( 相对频率 ) An advanced solution: Conduct density estimation ( 概率密度估计 ) Pattern Recognition Soochow, Fall Semester 19

A Further Example Problem statement Based on the height of a car in some campus, decide whether it costs more than $50,000 or not? Quantities to know: Counting relative frequencies via collected samples Pattern Recognition Soochow, Fall Semester 20

A Further Example (Cont.) Collecting samples Suppose we have randomly picked 1209 cars in the campus, got prices from their owners, and measured their heights Compute : # cars in 221 # cars in 988 Pattern Recognition Soochow, Fall Semester 21

A Further Example (Cont.) Compute : Discretize the height spectrum (say [0.5m, 2.5m]) into 20 intervals each with length 0.1m, and then count the number of cars falling into each interval for either class Suppose x falls into interval I x =[1.0m, 1.1m] For, # cars in I x is 46 For, # cars in I x is 59 Pattern Recognition Soochow, Fall Semester 22

A Further Example (Cont.) Question For a car with height 1.05m, is its price greater than $50,000? Estimated quantities Pattern Recognition Soochow, Fall Semester 23

Is Bayes Decision Rule Optimal? Bayes Decision Rule (In case of two classes) Whenever we observe a particular x, the probability of error is: Under Bayes decision rule, we have For every x, we ensure that P(error x) is as small as possible The average probability of error over all possible x must be as small as possible Pattern Recognition Soochow, Fall Semester 24

Bayes Decision Rule The General Case By allowing to use more than one feature (d dimensional Euclidean space) By allowing more than two states of nature (finite set of c states of nature) By allowing actions other than merely deciding the state of nature (finite set of a possible actions) Pattern Recognition Soochow, Fall Semester 25

Bayes Decision Rule The General Case (Cont.) By introducing a loss function more general than the probability of error the loss incurred for taking action when the state of nature is A simple loss function For ease of reference, usually written as: Class Action 5 50 10,000 60 3 0 Pattern Recognition Soochow, Fall Semester 26

Bayes Decision Rule The General Case (Cont.) The problem Given a particular x, we have to decide which action to take We need to know the loss of taking each action true state of nature is the action being taken is However, the true state of nature is uncertain incur the loss Expected (average) loss Pattern Recognition Soochow, Fall Semester 27

Bayes Decision Rule The General Case (Cont.) Expected loss ( 期望损失 ) Average by enumerating over all possible states of nature! The incurred loss of taking action in case of true state of nature being The probability of being the true state of nature The expected loss is also named as (conditional) risk ( 条件风险 ) Pattern Recognition Soochow, Fall Semester 28

Bayes Decision Rule The General Case (Cont.) Suppose we have: Class Action 5 50 10,000 60 3 0 For a particular x: Similarly, we can get: Pattern Recognition Soochow, Fall Semester 29

Bayes Decision Rule The General Case (Cont.) The task: find a mapping from patterns to actions In other words, for every x, the decision function assumes one of the a actions Overall risk R expected loss with decision function Conditional risk for pattern x with action pdf for patterns Pattern Recognition Soochow, Fall Semester 30

Bayes Decision Rule The General Case (Cont.) For every x, we ensure that the conditional risk is as small as possible The overall risk over all possible x must be as small as possible Bayes decision rule (General case) The resulting overall risk is called the Bayes risk (denoted as R * ) The best performance achievable given p(x) and loss function Pattern Recognition Soochow, Fall Semester 31

Two-Category Classification Special case (two states of nature) the loss incurred for deciding when the true state of nature is The conditional risk: yes decide no decide Pattern Recognition Soochow, Fall Semester 32

Two-Category Classification (Cont.) likelihood ratio constant θ independent of x by definition by re arrangement by Bayes theorem the loss for being error is ordinarily greater than the loss for being correct Pattern Recognition Soochow, Fall Semester 33

Minimum-Error-Rate Classification Classification setting (c possible states of nature) Zero one (symmetrical) loss function Assign no loss (i.e. 0) to a correct decision Assign a unit loss (i.e. 1) to any incorrect decision (equal cost) Pattern Recognition Soochow, Fall Semester 34

Minimum-Error-Rate Classification (Cont.) Minimum error rate error rate ( 误差率 / 错误率 ) the probability that action is wrong Pattern Recognition Soochow, Fall Semester 35

Discriminant Function ( 判别函数 ) Classification Pattern Category actions decide categories Discriminant functions Useful way to represent classifiers One function per category Pattern Recognition Soochow, Fall Semester 36

Discriminant Function (Cont.) Minimum risk: Minimum error rate: Various discriminant functions Identical classification results e.g.: is a monotonically increasing function ( 单调递增函数 ) (i.e. equivalent in decision) Pattern Recognition Soochow, Fall Semester 37

Discriminant Function (Cont.) Decision region ( 决策区域 ) c discriminant functions c decision regions where and Decision boundary ( 决策边界 ) surface in feature space where ties occur among several largest discriminant functions decision boundary Pattern Recognition Soochow, Fall Semester 38

Expected Value Expected value ( 数学期望 ), a.k.a. expectation, mean or average of a random variable x Discrete case Continuous case Notation: Pattern Recognition Soochow, Fall Semester 39

Expected Value (Cont.) Given random variable x and function, what is the expected value of Discrete case: Continuous case: Variance ( 方差 ) Discrete case: Continuous case: Notation: ( : standard deviation ( 标准偏差 )) Pattern Recognition Soochow, Fall Semester 40

Gaussian Density Univariate Case Gaussian density ( 高斯密度函数 ), a.k.a. normal density ( 正态密度函数 ), for continuous random variable Pattern Recognition Soochow, Fall Semester 41

Vector Random Variables ( 随机向量 ) (joint pdf) (marginal pdf) Expected vector Notation: marginal pdf on the i th component Pattern Recognition Soochow, Fall Semester 42

Vector Random Variables (Cont.) Covariance matrix ( 协方差矩阵 ) Properties of symmetric ( 对称矩阵 ) Positive semidefinite ( 半正定矩阵 ) Appendix A.4.9 [pp.617] marginal pdf on a pair of random variables (x i, x j ) Pattern Recognition Soochow, Fall Semester 43

Gaussian Density Multivariate Case d dimensional column vector d dimensional mean vector covariance matrix Pattern Recognition Soochow, Fall Semester 44

Gaussian Density Multivariate Case (Cont.) Pattern Recognition Soochow, Fall Semester 45

Discriminant Functions for Gaussian Density Minimum error rate classification Constant, could be ignored Pattern Recognition Soochow, Fall Semester 46

Case I: Covariance matrix: times the identity matrix I Pattern Recognition Soochow, Fall Semester 47

Case I: (Cont.) the same for all states of nature, could be ignored Linear discriminant functions ( 线性判别函数 ) weight vector ( 权值向量 ) threshold/bias ( 阈值 / 偏置 ) Pattern Recognition Soochow, Fall Semester 48

Case II: Covariance matrix: identical for all classes squared Mahalanobis distance ( 马氏距离 ) reduces to Euclidean distance P. C. Mahalanobis (1893-1972) Pattern Recognition Soochow, Fall Semester 49

Case II: (Cont.) the same for all states of nature, could be ignored Linear discriminant functions weight vector threshold/bias Pattern Recognition Soochow, Fall Semester 50

Case III: quadratic discriminant functions ( 二次判别函数 ) quadratic matrix weight vector threshold/bias Pattern Recognition Soochow, Fall Semester 51

Summary Bayesian Decision Theory PR: essentially a decision process Basic concepts States of nature Probability distribution, probability density function (pdf) Class conditional pdf Joint pdf, marginal distribution, law of total probability Bayes theorem Prior + likelihood + observation Posterior probability Bayes decision rule Decide the state of nature with maximum posterior Pattern Recognition Soochow, Fall Semester 52

Summary (Cont.) Feasibility of Bayes decision rule Prior probability + likelihood Solution I: counting relative frequencies Solution II: conduct density estimation (chapters 3,4) Bayes decision rule: The general scenario Allowing more than one feature Allowing more than two states of nature Allowing actions than merely deciding state of nature Loss function: Pattern Recognition Soochow, Fall Semester 53

Summary (Cont.) Expected loss (conditional risk) Average by enumerating over all possible states of nature General Bayes decision rule Decide the action with minimum expected loss Minimum error rate classification Actions Decide states of nature Zero one loss function Assign no loss/unit loss for correct/incorrect decisions Pattern Recognition Soochow, Fall Semester 54

Summary (Cont.) Discriminant functions General way to represent classifiers One function per category Induce decision regions and decision boundaries Gaussian/Normal density Discriminant functions for Gaussian pdf linear discriminant function quadratic discriminant function Pattern Recognition Soochow, Fall Semester 55