2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Similar documents
P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Lecture Notes on Linear Regression

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Maximum Likelihood Estimation (MLE)

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Classification as a Regression Problem

Composite Hypotheses testing

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Kernel Methods and SVMs Extension

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Homework Assignment 3 Due in class, Thursday October 15

The big picture. Outline

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Estimation: Part 2. Chapter GREG estimation

Chapter Newton s Method

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Bayesian predictive Configural Frequency Analysis

The Second Anti-Mathima on Game Theory

Linear Approximation with Regularization and Moving Least Squares

The exam is closed book, closed notes except your one-page cheat sheet.

Generalized Linear Methods

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Lecture 10 Support Vector Machines II

Learning Theory: Lecture Notes

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

3.1 ML and Empirical Distribution

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Assortment Optimization under MNL

Introduction to Hidden Markov Models

Explaining the Stein Paradox

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Statistical pattern recognition

Lecture 20: Hypothesis testing

Assuming that the transmission delay is negligible, we have

Complete subgraphs in multipartite graphs

14 Lagrange Multipliers

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Lecture 12: Classification

Rockefeller College University at Albany

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Economics 130. Lecture 4 Simple Linear Regression Continued

Conjugacy and the Exponential Family

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Limited Dependent Variables

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Clustering & Unsupervised Learning

Relevance Vector Machines Explained

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

Hidden Markov Models

Spectral Graph Theory and its Applications September 16, Lecture 5

PHYS 705: Classical Mechanics. Calculus of Variations II

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu

Hidden Markov Models

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Statistics II Final Exam 26/6/18

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Communication with AWGN Interference

Stat 543 Exam 2 Spring 2016

ECE559VV Project Report

Problem Set 9 Solutions

Credit Card Pricing and Impact of Adverse Selection

Statistical Foundations of Pattern Recognition

Lecture 3 Stat102, Spring 2007

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

CHAPTER-5 INFORMATION MEASURE OF FUZZY MATRIX AND FUZZY BINARY RELATION

STAT 3008 Applied Regression Analysis

First Year Examination Department of Statistics, University of Florida

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Lecture 2: Prelude to the big shrink

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Stat 543 Exam 2 Spring 2016

Error Probability for M Signals

Pattern Classification

10-701/ Machine Learning, Fall 2005 Homework 3

Exercises of Chapter 2

The Order Relation and Trace Inequalities for. Hermitian Operators

Signal space Review on vector space Linear independence Metric space and norm Inner product

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Excess Error, Approximation Error, and Estimation Error

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Discretization of Continuous Attributes in Rough Set Theory and Its Application*

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Basically, if you have a dummy dependent variable you will be estimating a probability.

More metrics on cartesian products

Transcription:

E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton by Arne Lejon. The notaton followed n the text book wll be fully respected here. A short revew of the ssue dscussed n the correspondng chapter of the text book s gven here as a reference. For a complete proof of these results and for the problem text refer to the text book. problem defnton and optmzaton Ths chapter of the text book generalzes the problem of classfcaton ntroduced n the prevous chapter. Extensons are: more than two categores are consdered more than one sgnal features can be employed the performance crteron s generalzed The new scheme of the classfer s depcted n fg. taken from the text book. All the elements Sgnal Source S Feature Extracton Classfer Dscrmnant Functon g S Ms S Trunsducton N N X Dscrmnant Functon Dscrmnant Functon g M d M A X Fgure. General sgnal classfcaton Ths work comes under the terms of the Creatve Commons BY-SA.0 lcense http://creatvecommons.org/lcenses/by-sa/.0/ (9)

of ths classfcaton system are descrbed statstcally: the sgnal state can take any value from the set {j =... M s } wth a pror probablty P S (j). The observaton feature vector x s the outcome of a random vector X whose dstrbuton depends on the state S and can be wrtten as f X S (x j) for each of the M s possble states. The decson rule D makes use of the nformaton gven by the a posteror probablty P S X (j x), obtaned wth the Bayes Rule as: P S X (j x) = f X S (x j)p S (j) Ms j= f X S(x j)p S (j) to perform some actons for any ncomng observaton vector x. Ths decson mechansm s the result of an optmzaton process amed at fulfllng a performance crteron. Ths crteron s defned by a cost functon L(D =, S = j) that descrbes the loss the system s subjected to when t takes the decson D =, beng the source n the state S = j. Snce all decsons are taken wth regard to the observaton vector x, and ths s only statstcally related to the true state S of the source, we can predct (statstcally) a Condtonal Expected Loss or Condtonal Rsk: M s R(D = x) = L(D =, S = j)p S X (j x) j= The performance crteron s hence the one that leads to the mnmum rsk and s called Bayes Mnmum-Rsk decson rule: D = arg mn R(D = x) The last rule s proved to mnmze the total expected loss Q = E [R(D(X) X)] over all possble outcomes of the random vector X Specal cases f the decson s to guess the state of the source, and the loss functon s { 0, f = j L(D =, S = j) =, otherwse then the optmal decson rule ntroduced before can be smplfed to the Maxmum a Posteror decson rule (MAP): D = arg max f X S (x )P S () f the prevous condtons are verfed and the a pror probabltes are all the same (P S (j) = M s, for all j), then the resultng decson rule s called Maxmum Lkelhood Decson Rule (ML): D = arg max f X S (x ) In general any decson rule can be expressed n the form: and the g are called dscrmnant functons. D = arg max g =...M d (9)

Exercse. We observe two sequences and we know that one s generated by a human beng and the other by a random-number generator of a computer. There are two possble states of the source: x = { 3 4 5; 4 5 3} S = {, } = {[h, c], [c, h]} Where c stands for computer and h for human beng. The a pror probablty of the states are equally dstrbuted: P S () = P S () = To contnue the soluton of ths problem we have to formulate some assumptons: n the absence of other nformaton t s reasonable to assume that the machne generates unformly dstrbuted numbers, and that any sequence of the knd consdered n ths example has the same probablty of beng generated: P ({ 3 4 5} c) = P ({ 4 5 3} c) = q common sense experence (and perhaps psychologcal arguments) would suggest that the probablty that a human beng generates the sequence { 3 4 5} s hgher than that of generatng the sequence { 4 5 3}. In symbols: P ({ 3 4 5} h) = p ; P ({ 4 5 3} h) = p ; p > p Combnng the events, and assumng that they are ndependent we can wrte: Applyng Bayes rule: P S X (j x) = = P X S (x ) = P ({ 3 4 5; 4 5 3} [h, c]) = p q P X S (x ) = P ({ 3 4 5; 4 5 3} [c, h]) = qp P S (j)p X S (x j) P S ()P X S (x ) + P S ()P X S (x ) qp j q(p + p ) = p j p + p that can be read as the probablty of the state j gven the observaton x. The optmal MAP guess about the source s equvalent to the maxmum lkelhood optmal guess: S opt = arg max P S X (j x) = arg max j j p j = arg max p + p j Accordng to our assumptons on the values of p and p the optmal guess s S = : the human beng has most probably generated the sequence { 3 4 5} whle the machne the sequence { 4 5 3}. p j 3 (9)

Exercse. a) The mnmum error probablty crteron s acheved by consderng the loss functon: {, j L(D =, S = j) = 0, = j Snce the a pror probabltes of the state are unformly dstrbuted, we are n the Maxmum lkelhood case: the decson rule s D = arg max f X S (x ) where f X S (x ) = e (x µ ) σ πσ To smplfy the decson rule I chose to maxmze a monotone ncreasng functon of the argument nstead of the argument tself, for example takng the logarthm: g ln f X S (x ) (x µ ) σ (x µ ) where we smplfy all the constant terms that don t affect the maxmzaton process. Snce the decson mechansm checks whether g s greater or smaller than g, whch are monotone functons of the argument x, ths can be mplemented by a smple threshold x t wth g (x t ) = g (x t ). Substtutng: (x t µ ) = (x t µ ) x t = µ + µ Ths result was predctable when consderng that two Gaussan dstrbutons wth the same 0. f X S (x s )P(S=s ) 0.8 0.6 0.4 0. 0. 0.08 0.06 0.04 x t P E 0.0 µ µ 0 x Fgure. 4 (9)

varance ntersect n the medan pont between ther mean values, and that (as ponted out more that once n chapter ) the optmal threshold wth regard to the mnmum error probablty corresponds to ths ntersecton pont. b) As prevously explaned (see chapter and fg. ), f we assume µ > µ as n ths case, the total probablty of error s gven by: P E = P S () xt f X S (x )dx + P S () Substtutng the gven values, and gven the symmetry: P E = = 0 N(, )dx + 0 + 0 N(, )dx N(, )dx = [ Φ()] = 0.03 For the numercal result refer to BETA pag. 405. Exercse.3 + x t f X S (x )dx We have a sgnal source wth N possble outcomes j =, N, governed by the known probabltes P S (j). There are N + possble decsons (D = j; j =, N f the state was S = j and D = N + no decson ). The cost functon s gven n the exercse text as: L(D=, S=j) = 0, = j j =...N r, = N +, j =...N c, otherwse that sets the cost to 0 f the decson was correct, to c f t was taken, but ncorrect and to r f t was rejected. a) The expected cost s by defnton R(D= x) = j L(D=, S=j)P S X(j x). To compute ths we consder two dfferent cases: ) the decson s taken ( N + ); R(D= x) = c j P S X (j x) = c( P S X ( x)) The last equalty s true because j P S X(j x) =. In ths case we know that the mnmum expected cost s acheved wth the followng decson functon: ) the decson s not taken ( = N + ). D = arg max[cp S X ( x)] and, snce = N +, N R(D=N + x) = r P S X (j x) = r j= D = no decson Beta, mathematcs handbook, Studentltteratur 5 (9)

The last thng to check s whch s the best choce between the frst and second case for each x. The decson wll not be rejected f N + : R(D= x) R(D=N + x) c[ P S X ( x)] r P S X ( x) r c, Ths way we have proved that the decson functon D proposed n the example s optmal. b) If r = 0 then rejectng a decson s free of cost, f c then a wrong decson would be enormously costly. In both cases t s never worth rskng an error d wll always reject the decson. From a mathematcal pont of vew, the condton to accept a decson (no rejecton) becomes P S X ( x) that s never verfed unless the observaton x can only be generated by the source state S = (the equalty holds), and there s no doubt on the decson to take. c) If r > c, rejectng a decson wll always be more expensve than tryng one no decson wll be rejected, from the mathematcal pont of vew, the condton s that the probablty of the state gven the observaton be grater than a negatve number, whch s always verfed by probabltes: P S X ( x) ɛ; ɛ > 0 d) If =, N then the dscrmnant functon correspond to the one n pont a) whch we know to be optmal. We have to prove that the choce of the decson N + leads to the same condton as n pont a). Decson N + s chosen (arg max(.) = N + ) f and only f =, N, g N+ = ( r N c ) f X S (x j)p S (j) > g = f X S (x )P S () j= f X S (x )P S () N j= f X S(x j)p S (j) < r, =, N c applyng Bayes rule, as we wanted to prove. e) The three functons g are: P S X ( x) < r c, =, N g = P S ()f X S (x ) = e (x ) π g = P S ()f X S (x ) = e (x+) π g 3 = ( r c )[g + g ] = 3 4 [g + g ] These functons are plotted n fg. 3. 6 (9)

0.5 Dscrmnant functons g g g 3 0. 0.5 0. 0.05 0 4 3 0 3 4 x Fgure 3. f) The decson D = 3 s taken f and only f g 3 > max [g, g ], whch s verfed n the regon ndcated n the fgure by a sgn. The total probablty of rejecton s P D (3) = P X S ( x 0 < x < x 0 )P S () + P X S ( x 0 < x < x 0 )P S () = x0 x 0 N(, )dx + x0 x 0 N(, )dx Snce the problem s fully symmetrc the two terms are equal and x0 P D (3) = N(, )dx x 0 = Φ(x 0 + ) Φ( x 0 + ) Last thng to do s to fnd the value of x 0.e. the value at whch g 3 = g : g 3 = g ( r c )[ N(, ) + N(, )] = N(, ) ( r c ) e x + [e x + e x ] = e x + e x π π e x = c r c r x = log r r Wth the values specfed by the problem x 0 = ln 3. The total probablty of rejecton s then P R = P D (3) = Φ(ln 3 + ) Φ( ln 3 + ) = Φ(.549) Φ(0.45) 0.7 7 (9)

0.5 Dscrmnant functons g g g 3 0. 0.5 0. 0.05 0 4 3 0 3 4 x Fgure 4. Rejecton s never consdered Where the functon Φ s tabled n BETA 3 pag. 405. The decson g 3 (rejecton) s never chosen f g 3 < max [g, g ], x R. Ths s guaranteed f g 3 (0) < g (0) as s clear lookng at fg 4. Snce g 3 (0) = ( r c )[g (0) + g (0)] = ( r c ) g (0) for the symmetry, then rejecton s never consdered f r > c Intersecton of Two Gaussan Dstrbutons The problem of fndng for whch x, P XS (x, 0) <> P XS (x, ) s common n the exercses seen so far. Ths corresponds to fndng where P S (0)P X S (x, 0) <> P S ()P X S (x, ). In case of Gaussan dstrbutons N(µ, σ ) f we set p = P S () wth = 0, the ntersecton ponts are: x, = ( ) σ µ 0 σ0 µ ± σ 0 σ (µ 0 µ ) + (σ σ 0 ) ln p0 σ p σ 0 σ σ 0 In the specal case n whch σ 0 = σ = σ that s the most nterestng n our case (same nose that affects both observatons), there s at most one fnte soluton, and a sngle threshold on the value of x s a soluton to the problem descrbed before: x = µ + µ 0 + σ (µ 0 µ ) ln If the probablty of the source s equal (p 0 = p = /) then 3 Beta, mathematcs handbook, Studentltteratur x = µ + µ 0 ( p p 0 ) 8 (9)

... or f the means are opposte to each other (µ 0 = µ and µ = µ) x = σ µ ln ( p p 0 ) 9 (9)