Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Similar documents
The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Maximum Likelihood Estimation (MLE)

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

where v means the change in velocity, and t is the

S Advanced Digital Communication (4 cr) Targets today

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Communication with AWGN Interference

Composite Hypotheses testing

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Error Probability for M Signals

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Lecture 12: Classification

Lecture 3: Shannon s Theorem

Lecture Notes on Linear Regression

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Chapter 7 Channel Capacity and Coding

Clustering & Unsupervised Learning

Probability Theory. The nth coefficient of the Taylor series of f(k), expanded around k = 0, gives the nth moment of x as ( ik) n n!

Classification as a Regression Problem

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

b ), which stands for uniform distribution on the interval a x< b. = 0 elsewhere

Classification learning II

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Engineering Risk Benefit Analysis

Clustering & (Ken Kreutz-Delgado) UCSD

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Chapter 7 Channel Capacity and Coding

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

β0 + β1xi and want to estimate the unknown

Lecture 4 Hypothesis Testing

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

CHAPTER 3: BAYESIAN DECISION THEORY

The big picture. Outline

Lecture 3. Ax x i a i. i i

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Flux-Uncertainty from Aperture Photometry. F. Masci, version 1.0, 10/14/2008

Assuming that the transmission delay is negligible, we have

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Pattern Classification

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

Statistical pattern recognition

Artificial Intelligence Bayesian Networks

Evaluation for sets of classes

Quantifying Uncertainty

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

APPENDIX A Some Linear Algebra

Digital Modems. Lecture 2

A random variable is a function which associates a real number to each element of the sample space

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Generative classification models

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Decision-making and rationality

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Support Vector Machines

Homework Assignment 3 Due in class, Thursday October 15

Assortment Optimization under MNL

Lecture 14 (03/27/18). Channels. Decoding. Preview of the Capacity Theorem.

Support Vector Machines

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Which Separator? Spring 1

Learning from Data 1 Naive Bayes

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Assignment 2. Tyler Shendruk February 19, 2010

ECE559VV Project Report

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

Propagation of error for multivariable function

Consider the following passband digital communication system model. c t. modulator. t r a n s m i t t e r. signal decoder.

Linear Approximation with Regularization and Moving Least Squares

Introduction to Random Variables

Lecture 3 Stat102, Spring 2007

15-381: Artificial Intelligence. Regression and cross validation

Lecture 3: Boltzmann distribution

SDMML HT MSc Problem Sheet 4

ECE 534: Elements of Information Theory. Solutions to Midterm Exam (Spring 2006)

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

e i is a random error

U-Pb Geochronology Practical: Background

VQ widely used in coding speech, image, and video

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Pattern Classification

Natural Language Processing and Information Retrieval

Explaining the Stein Paradox

} Often, when learning, we deal with uncertainty:

Maximal Margin Classifier

Expectation Maximization Mixture Models HMMs

Effects of Ignoring Correlations When Computing Sample Chi-Square. John W. Fowler February 26, 2012

Transcription:

Bayesan decson theory Nuno Vasconcelos ECE Department, UCSD

Bayesan decson theory recall that we have state of the world observatons decson functon L[,y] loss of predctn y wth the epected value of the loss s called the rsk Rsk E whch can be wrtten as [ L, ], Rsk M, L[, ] d,

Bayesan decson theory from ths Rsk by chan rule where Rsk M, L[, ] d, M L[, ] d R R d E [ R ] M L[, ] 0 0 s the condtonal rsk, ven the observaton 3

Bayesan decson theory snce, by defnton, t follows that L [, ] 0,, y 0 R M L[, ] 0, hence Rsk E [ R ] s mnmum f we mnmze R at all,.e., f we use pck the decson functon ar mn M L [, ] 4

Bayesan decson theory ths s the Bayes decson rule M whch decson rule has the assocated rsk ], [ ar mn L M whch decson rule has smaller condtonal rsk? the assocated rsk d L R M ], [,, 0 or d L R M ] [ s the Bayes rsk and cannot be beaten d L R ], [ 5 s the Bayes rsk, and cannot be beaten

Eample let s consder a bnary classfcaton problem {0,} for whch the condtonal rsk s M R we have two optons L[, ] 0 0 L [,0] L [,] + 0 R 0 L [0,0] 0] + L [0,] 0 + L R 0 L[,0] L[,] + and should pck the one of smaller condtonal rsk 6

Eample.e. pck 0 f R 0 < R and otherwse ths can be wrtten as, pck 0 f 0 L[0,0] + < 0 L[,0] + L[0,] < L[,] or { L[0,0] [,0] } { L[,] L[0,] } 0 L < < 0 usually there s no loss assocated wth the correct decson L[,] L[0,0] 0 and ths s the same as 0 L[,0] > L[0,] 7

Eample or, pck 0 f > and applyn Bayes rule 0 0 L[0,] L [,0] 0 L whch s equvalent to pck 0 f 0 > T > L [0,] L[,0] L[0,] L[,0] 0 0.e. we pck 0, when the probablty of ven that 0 dvded by that ven s reater than a threshold the optmal threshold h T depends d on the costs of the two types of error and the probabltes of the two classes 8

Eample let s consder the 0- loss n ths case the optmal decson functon s y y y L 0,, ], [ n ths case the optmal decson functon s ], [ ar mn L M ar mn [ ] mn ar [ ] mn ar ma ar 9 ma ar

Eample for the 0- loss the optmal decson rule s the mamum a-posteror probablty rule ar ma what s the assocated rsk? M R L[, ] d M d y d, y, d 0

Eample but R y, d, s really just the probablty of error of the decson rule note that the same result would hold for any,.e. R would be the probablty of error of ths mples the follown for the 0- loss the Bayes decson rule s the MA rule ar ma the rsk s the probablty of error of ths rule Bayes error there s no other decson functon wth lower error

MA rule usually can be wrtten n a smple form ven a probablstc model for and consder the two-class problem,.e. 0 or the BDR s ar ma 0,, f f 0 0 < 0 pck 0 when and otherwse usn Bayes rule 0 0 0 0

MA rule notn that s a non-neatve quantty ths s the same as pck 0 when 0 0 by usn the same reasonn, ths can be easly eneralzed to note that: ar ma many class-condtonal dstrbutons are eponental e.. the Gaussan ths product can be trcky to compute e.. the tal probabltes are qute small we can take advantae of the fact that we only care about the order of the terms on the rht-hand sde 3

The lo trck ths s the lo trck whch s to take los note that the lo s a monotoncally ncreasn functon a > b lo a > lob lo a lo b from whch ar ma ar ma lo ar ma lo the order s preserved + lo b a 4

MA rule n summary for the zero/one loss, the follown three decson rules are optmal and equvalent ar ma [ ] ar ma 3 ar ma[ lo + lo ] s usually hard to use, 3 s frequently easer than 5

Eample the Bayes decson rule s usually hhly ntutve eample: communcatons a bt s transmtted by a source, corrupted by nose, and receved by a decoder channel Q: what should the optmal decoder do to recover? 6

Eample ntutvely, t appears that t should just threshold pck T decson rule 0,, f f < T > T what s the threshold value? let s solve the problem wth the BDR 7

Eample we need class probabltes: n the absence of any other nfo let s say 0 class-condtonal denstes: nose results from thermal processes, electrons movn around and bumpn each other a lot of ndependent events that add up by the central lmt theorem t appears reasonable to assume that the nose s Gaussan we denote a Gaussan random varable of mean and varance by ~ N, 8

Eample the Gaussan probablty densty functon s G,, e π snce nose s Gaussan, and assumn t s just added to the snal we have channel + ε, ε ~ N0, n both cases, corresponds to a constant plus zero-mean Gaussan nose ths smply adds to the mean of the Gaussan 9

Eample n summary 0 G,0, G,, 0 or, raphcally, 0 0

Eample to compute the BDR, we recall that ar ma lo and note that t [ + lo ] terms whch are constant as a functon of can be dropped snce we are just lookn for the that mamzes the functon snce ths s the case for the class-probabltes we have 0 ar ma lo

BDR ths s ntutve we pck the class that best eplans ves hher probablty the observaton n ths case, we can solve vsually 0 pck 0 pck but the mathematcal t soluton s equally smple

BDR let s consder the more eneral case 0 G G for whch,,,, 0 0 G G ar ma lo ar ma lo π e lo ar ma π 3 ar mn

BDR or ar mn ar mn + ar mn + the optmal decson s, therefore pck 0 f or, pck 0 f + 0 0 < 0 + < 0 0 < + 4

BDR for a problem wth Gaussan classes, equal varances and equal class probabltes optmal decson boundary s the threshold at the md-pont between the two means pck 0 0 pck 5

BDR back to our snal decodn problem n ths case T 0.5 decson rule 0,, f f < 0.5 > 0.5 ths s, once aan, ntutve we place the threshold mdway alon the nose sources 6

BDR what s the pont of on throuh all the math? now we know that the ntutve threshold s actually optmal, and n whch sense t s optmal mnmum probablty or error the Bayesan soluton keeps us honest. t forces us to make all our assumptons eplct assumptons we have made unform class probabltes 0 Gaussanty the varance s the same under the two states G,,, nose s addtve even for a trval problem, we have made lots of assumptons + ε 7

BDR what f the class probabltes are not the same? e codn scheme 7 0 e.. codn scheme 7 0 n ths case >> 0 how does ths chane the optmal decson rule? { } lo lo ar ma + + lo lo ar ma e π + lo lo ar ma π 8 lo mn ar

BDR or lo ar mn lo ar mn lo ar mn + + the optmal decson s, therefore k 0 f lo mn ar + pck 0 f 0 l lo 0 lo 0 0 + < + < + or, pck 0 f lo 0 0 + < 0 + 9 0 lo 0 0 + + <

BDR what s the role of the pror for class probabltes? + < 0 + 0 lo 0 the pror moves the threshold up or down, n an ntutve way 0> : threshold ncreases snce 0 has hher probablty, we care more about errors on the 0 sde by usn a hher threshold we are makn t more lkely to pck 0 f 0, all we care about s 0, the threshold becomes nfnte we never say how relevant s the pror? t s wehed by 0 30

BDR how relevant s the pror? t s wehed by the nverse of the normalzed dstance between the means 0 dstance between the means n unts of varance f the classes are very far apart, the pror makes no dfference ths s the easy stuaton, the observatons are very clear, Bayes says foret the pror knowlede f the classes are eactly equal same mean the pror ets nfnte weht n ths case the observatons do not say anythn about the class, Bayes says foret about the data, just use the knowlede that you started wth even f that means always say 0 or always say 3

3