Bayes Decision Theory - II

Similar documents
Bayes (Naïve or not) Classifiers: Generative Approach

Introduction to local (nonparametric) density estimation. methods

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Point Estimation: definition of estimators

Dimensionality Reduction and Learning

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Lecture 7: Linear and quadratic classifiers

STK4011 and STK9011 Autumn 2016

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

Chapter 14 Logistic Regression Models

6. Nonparametric techniques

Chapter 5 Properties of a Random Sample

Summary of the lecture in Biostatistics

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

An Introduction to. Support Vector Machine

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Lecture 3 Probability review (cont d)

Lecture 3. Sampling, sampling distributions, and parameter estimation

MA/CSSE 473 Day 27. Dynamic programming

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

Chapter 4 Multiple Random Variables

Lecture 02: Bounding tail distributions of a random variable

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Lecture Notes Types of economic variables

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Special Instructions / Useful Data

1 Solution to Problem 6.40

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Simulation Output Analysis

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Functions of Random Variables

Random Variables and Probability Distributions

Econometric Methods. Review of Estimation

ENGI 3423 Simple Linear Regression Page 12-01

CHAPTER 4 RADICAL EXPRESSIONS

22 Nonparametric Methods.

Rademacher Complexity. Examples

Unsupervised Learning and Other Neural Networks

Generative classification models

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

LECTURE 2: Linear and quadratic classifiers

Chapter 3 Sampling For Proportions and Percentages

D. VQ WITH 1ST-ORDER LOSSLESS CODING

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

Lecture Note to Rice Chapter 8

TESTS BASED ON MAXIMUM LIKELIHOOD

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Class 13,14 June 17, 19, 2015

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

3. Basic Concepts: Consequences and Properties

Maximum Likelihood Estimation (MLE)

Qualifying Exam Statistical Theory Problem Solutions August 2005

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Bayes Estimator for Exponential Distribution with Extension of Jeffery Prior Information

X ε ) = 0, or equivalently, lim

CHAPTER VI Statistical Analysis of Experimental Data

Lecture 9: Tolerant Testing

Chapter 9 Jordan Block Matrices

6.867 Machine Learning

ENGI 4421 Propagation of Error Page 8-01

arxiv:math/ v1 [math.gm] 8 Dec 2005

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

ρ < 1 be five real numbers. The

The Mathematical Appendix

Logistic regression (continued)

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Chapter 8. Inferences about More Than Two Population Central Values

Median as a Weighted Arithmetic Mean of All Sample Observations

MATH 247/Winter Notes on the adjoint and on normal operators.

Announcements. Recognition II. Computer Vision I. Example: Face Detection. Evaluating a binary classifier

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Mu Sequences/Series Solutions National Convention 2014

7. Joint Distributions

= 2. Statistic - function that doesn't depend on any of the known parameters; examples:

9.1 Introduction to the probit and logit models

Ideal multigrades with trigonometric coefficients

The number of observed cases The number of parameters. ith case of the dichotomous dependent variable. the ith case of the jth parameter

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

Idea is to sample from a different distribution that picks points in important regions of the sample space. Want ( ) ( ) ( ) E f X = f x g x dx

Multiple Choice Test. Chapter Adequacy of Models for Regression

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

Continuous Distributions

PTAS for Bin-Packing

Point Estimation: definition of estimators

CHAPTER 2. = y ˆ β x (.1022) So we can write

THE ROYAL STATISTICAL SOCIETY 2010 EXAMINATIONS SOLUTIONS GRADUATE DIPLOMA MODULE 2 STATISTICAL INFERENCE

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

2.28 The Wall Street Journal is probably referring to the average number of cubes used per glass measured for some population that they have chosen.

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Kernel-based Methods and Support Vector Machines

Naïve Bayes MIT Course Notes Cynthia Rudin

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Laboratory I.10 It All Adds Up

CHAPTER 3 POSTERIOR DISTRIBUTIONS

1 Onto functions and bijections Applications to Counting

8.1 Hashing Algorithms

Transcription:

Bayes Decso Theory - II Ke Kreutz-Delgado (Nuo Vascocelos) ECE 175 Wter 2012 - UCSD

Nearest Neghbor Classfer We are cosderg supervsed classfcato Nearest Neghbor (NN) Classfer A trag set D = {(x 1,y 1 ),, (x,y )} x s a vector of observatos, y s the correspodg class label a vector x to classfy The NN Decso Rule s Set y where y * * arg m d( x, x ) {1,..., } argm meas: the that mmzes the dstace 2

Optmal Classfers We have see that performace depeds o metrc Some metrcs are better tha others The meag of better s coected to how well adapted the metrc s to the propertes of the data But ca we be more rgorous? what do we mea by optmal? To talk about optmalty we defe cost or loss x yˆ f ( x) () f L( y, yˆ ) Loss s the fucto that we wat to mmze Loss depeds o true y ad predcto Loss tells us how good our predctor s ŷ 3

Loss Fuctos & Classfcato Errors Loss s a fucto of classfcato errors What errors ca we have? Two types: false postves ad false egatves cosder a face detecto problem (decde face or o-face ) f you see ths ad say face o-face you have a false postve false-egatve (false alarm) (mss, falure to detect) Obvously, we have correspodg sub-classes for o-errors true-postves ad true-egatves postve/egatve part reflects what we say or decde, true/false part reflects the true class label ( true state of the world ) 4

(Codtoal) Rsk To wegh dfferet errors dfferetly We troduce a loss fucto Deote the cost of classfyg X from class as j by L j Oe way to measure how good the classfer s to use the (datacodtoal) expected value of the loss, aka the (codtoal) Rsk, R( x, ) E{ L[ Y ] x} L j PYX( j x) j Note that the (data-codtoal) rsk s a fucto of both the decso decde class ad the codtog data (measured feature vector), x. 5

Loss Fuctos example: two sakes ad eatg posoous dart frogs Regular sake wll de Frogs are a good sack for the predator dart-sake Ths leads to the losses Regular sake dart frog regular frog regular 0 dart 0 10 Predator sake dart frog regular frog regular 10 0 dart 0 10 What s optmal decso whe sakes fd a frog lke these? 6

Mmum Rsk Classfcato We have see that f both sakes have P the both say regular However, f P YX YX 0 j dart ( j x) 1 j regular 0.1 j dart ( j x) 0.9 j regular the the vulerable sake says dart whle the predator says regular Its fte loss for sayg regular whe frog s dart, makes the vulerable sake much more cautous! 7

BDR = Mmzg Codtoal Rsk Note that the defto of rsk: Immedately defes the optmal classfer as the oe that mmzes the codtoal rsk for a gve observato x The Optmal Decso s the Bayes Decso Rule (BDR) : * ( x) argm R( x, ) argm L j P ( j x). j YX The BDR yelds the optmal (mmal) rsk : R x R x L j P j x * * ( ) (, ) m YX ( ) j 8

What s a Decso Rule? Cosder the c-ary classfcato problem wth class labels, {1,, c}. Gve a observato (feature), x, to be classfed, a decso rule s a fucto d = d(.) of the observato that takes ts values the set of class labels, dx ( ) {1,, c}. * * d ( x) ( x) Note that defed o the prevous slde s a optmal decso rule the sese that for a specfc value of x t mmzes the codtoal rsk R(x,) over all possble decsos C 9

(d-depedet) Total Average Rsk Gve a decso rule d ad the codtoal rsk R(x,), we ca cosder the (d-depedet) codtoal rsk R(x,d(x)). We ca ow defe the total (d-depedet) Expected or Average Rsk (aka d-rsk): R( d) E { R( x, d( x) )} Note that we have averaged over all possble measuremets (features) x that we mght ecouter the world. Note that R(d) s a fucto of a fucto! (A fucto of d) The (d-rsk) R(d) s a measure of how we expect to perform o the average whe we use the fxed decso rule d over-ad-overaga o a large set of real world data. It s atural to ask f there s a optmal decso rule whch mmzes the average rsk R(d) over the class of all possble decso rules. 10

Mmzg the Average Rsk R(d) Optmzg total rsk R(d) seems hard because we are tryg to mmze t over a famly of fuctos (decso rules), d. However, sce R( d) E{ R( x, d( x))} R( x, d( x)) p x) dx, oe ca equvaletly mmze the data-codtoal rsk R(x,d(x)) pot-wse x. I.e. solve for the value of the optmal decso rule at each x : d * ( x) arg m R( x, d( x)) argm R( x, ) Thus d*(x) = *(x)!! I.e. the BDR, whch we already kow optmzes the Data-Codtoal Rsk, ALSO optmzes the Average Rsk R(d) over ALL possble decso rules d!! Ths makes sese: f the BDR s optmal for every sgle stuato, x, t must be optmal o the average over all x 11 X ( 0 d( x)

The 0/1 Loss Fucto A mportat specal case of terest: zero loss for o error ad equal loss for two error types Ths s equvalet to the zero/oe loss : L j 0 1 j j sake predcto dart frog regular frog regular 1 0 dart 0 1 Uder ths loss the optmal Bayes decso rule (BDR) s * * d ( x) x L j PYX j j ( ) arg m ( x) arg m P ( j x) j YX 12

0/1 Loss yelds MAP Decso Rule Note that : * x PYX j x j ( ) arg m ( ) arg m 1 PYX ( x) arg max P ( x) YX Thus the Optmal Decso for the 0/1 loss s : Pck the class that s most probable gve the observato x *(x) s kow as the Maxmum a Posteror Probablty (MAP) soluto Ths s also kow as the Bayes Decso Rule (BDR) for the 0/1 loss We wll ofte smplfy our dscusso by assumg ths loss But you should always be aware that other losses may be used 13

BDR for the 0/1 Loss Cosder the evaluato of the BDR for 0/1 loss * x PYX x ( ) arg max ( ) Ths s also called the Maxmum a Posteror Probablty (MAP) rule It s usually ot trval to evaluate the posteror probabltes P Y X ( x ) Ths s due to the fact that we are tryg to fer the cause (class ) from the cosequece (observato x).e. we are tryg to solve a otrval verse problem E.g. mage that I wat to evaluate P Y X ( perso has two eyes ) Ths strogly depeds o what the other classes are 14

Posteror Probabltes ad Detecto If the two classes are people ad cars the P Y X ( perso has two eyes ) = 1 But f the classes are people ad cats the P Y X ( perso has two eyes ) = ½ f there are equal umbers of cats ad people to uformly choose from [ ths s addtoal fo! ] How do we deal wth ths problem? We ote that t s much easer to fer cosequece from cause E.g., t s easy to fer that P X Y ( has two eyes perso ) = 1 Ths does ot deped o ay other classes We do ot eed ay addtoal formato Gve a class, just cout the frequecy of observato 15

Bayes Rule How do we go from P X Y ( x j ) to P Y X ( j x )? We use Bayes rule: P YX ( x) P Cosder the two-class problem,.e. Y=0 or Y=1 the BDR uder 0/1 loss s X Y ( x ) P ( ) P X ( x) * x PYX x ( ) arg max ( ) 0, f PY X (0 x) PY X (1 x) 1, f PY X (0 x) PY X (1 x) Y 16

BDR for 0/1 Loss Bary Classfcato Pck 0 whe P ad 1 otherwse Y X (0 x) PY X (1 x) Usg Bayes rule o both sdes of ths equalty yelds P (0 x) P (1 x) Y X Y X PX Y ( x 0) PY (0) PX Y ( x 1) PY (1) P ( x) P ( x) X Notg that P X (x) s a o-egatve quatty ths s the same as the rule pck 0 whe P ( x 0) P (0) P ( x 1) P (1) X Y Y X Y Y X.e. * x PX Y x PY ( ) argmax ( ) ( ) 17

The Log Trck Sometmes t s ot coveet to work drectly wth pdf s Oe helpful trck s to take logs Note that the log s a mootocally creasg fucto a b log a from whch we have log b * x PX Y x PY ( ) arg max ( ) ( ) X Y X Y X Y log a log b arg max log P ( x ) P ( ) arg max log P ( x ) log P ( ) arg m log P ( x ) log P ( ) Y Y Y b a 18

Stadard (0/1) BDR I summary for the zero/oe loss, the followg three decso rules are optmal ad equvalet 1) 2) * ( x ) arg max PY X ( x ) ( ) arg max ( ) ( ) * x PX Y x PY 3) * ( x ) arg max log P X Y ( x ) log P ( ) Y The form 1) s usually hardest to use, 3) s frequetly easer tha 2) 19

(Stadard 0/1-Loss) BDR - Example So far the BDR s a abstract rule How does oe mplemet the optmal decso practce? I addto to havg a loss fucto, you eed to kow, model, or estmate the probabltes! Example Suppose that you ru a gas stato O Modays you have a promoto to sell more gas Q: s the promoto workg? I.e., s Y = 0 (o) or Y = 1 (yes)? A good observato to aswer ths questo s the terarrval tme (t) betwee cars hgh t: ot workg (Y = 0) low t: workg well (Y = 1) 20

BDR - Example What are the class-codtoal ad pror probabltes? Model the probablty of arrval of a car by a Expoetal desty (a stadard pdf to use) Cotuous-valued terarrval tmes are assumed to be expoetally dstrbuted. Hece P ( t ) l e lt X Y where l s the arrval rate (cars/s). The expected value of the terarrval tme s XY E x y Cosecutve tmes are assumed to be depedet : 1 l P ( t,, t ) P ( t ) l e lt k X1,, X Y 1 X Y k k1 k1 21

BDR - Example Let s assume that we kow l ad the (pror) class probabltes P Y () = p, = 0,1 Have measured a collecto of tmes durg the day, D = {t 1,...,t } The probabltes are of expoetal form Therefore t s easer to use the log-based BDR ( ) arg max log ( ) log ( ) * PX Y PY lt k arg max logle logp k 1 arg max lt k log l logp k 1 arg max l log t k l p k 1 22

BDR - Example Ths meas we pck 0 whe log k l t log l p l t l p 0 k 0 0 1 1 1 k1 k1 l ( l1l0) t k log k 1 l 1 1 0 0 1 1 l 1 p 1 t k log 1 ( 1 0) k l l l0 p 0 ad 1 otherwse Does ths decso rule make sese? Let s assume, for smplcty, that p 1 = p 2 = 1/2 p p, or, or (reasoably takg l 1 > l 0 ) 23

BDR - Example For p 1 = p 2 = ½, we pck promoto dd ot work (Y=0) f t 1 1 l 1 t k log k1 ( l1 l0 ) l0 The left had sde s the (sample) average terarrval tme for the day Ths meas that there s a optmal choce of a threshold 1 l 1 T log ( l1 l0 ) l0 above whch we say promoto dd ot work. Ths makes sese! T What s the shape of ths threshold? Assumg l 0 = 1, t looks lke ths. Hgher the l 1, the more lkely to say promoto dd ot work. l 1 24

BDR - Example Whe p 1 = p 2 = ½, we pck dd ot work (Y=0) whe t 1 t k k1 T T 1 ( l l ) 1 0 l 1 log l0 T Assumg l 0 = 1, T decreases wth l 1 I.e. for a gve daly average, Larger l 1 : easer to say dd ot work Ths meas that As the expected rate of arrval for good days creases we are gog to mpose a tougher stadard o the average measured terarrval tmes The average has to be smaller for us to accept the day as a good oe Oce aga, ths makes sese! A sesble aswer s usually the case wth the BDR (a good way to check your math) l 1 25

The Gaussa Classfer Oe mportat case s that of Multvarate Gaussa Classes The pdf of class s a Gaussa of mea m ad covarace S f P ( x ) The BDR s 1 exp 1 ( x m ) 2 S ( x m ) T 1 X Y d (2p ) S * 1 T 1 ( x) arg max ( x m) ( x m) S 2 1 log(2 ) d p S log PY ( ) 2 26

Implemetato of a Gaussa Classfer To desg a Gaussa classfer (e.g. homework) Start from a collecto of datasets, where the -th class dataset D () = {x 1 (),..., x () } s a set of () examples from class For each class estmate the Gaussa parameters : ˆ m where 1 () () x j j c () T k 1 ˆ 1 S ( ˆ )( ˆ x m x m ) ( ) ( ) T ( ) j j j Pˆ () s the total umber of examples over all c classes Va the plug rule, the BDR s approxmated as Y T () * 1 T 1 ( ) arg max ( ˆ ) ˆ x x m ( ˆ x m) S 2 1 d log(2 ) ˆ l g ˆ p S o PY( ) 2 27

Gaussa Classfer The Gaussa Classfer ca be wrtte as ( ) = 0.5 x d x m a * 2 ( ) arg m (, ) wth d x y x y x y 2 T 1 (, ) ( ) S ( ) a log( 2p ) d S 2log P Y ( ) ad ca be see as a earest class-eghbor classfer wth a fuy metrc Each class has ts ow dstace measure: Sum the Mahalaobs-squared for that class, the add the a costat. We effectvely have dfferet metrcs the data (feature) space that are class depedet. 28

Gaussa Classfer A specal case of terest s whe All classes have the same covarace S = S x d x m a * 2 ( ) arg m (, ) ( ) = 0.5 wth d x y x y x y 2 T 1 (, ) ( ) S ( ) a 2log ( ) Note that: P Y a ca be dropped whe all classes have equal pror probablty Ths s remscet of the NN classfer wth Mahalaobs dstace Istead of fdg the earest data pot eghbor of x, t looks for the earest class prototype, (or archetype, or exemplar, or template, or represetatve, or deal, or form ), defed as the class mea m 29

Bary Classfer Specal Case Cosder S = S wth two classes Oe mportat property of ths case s that the decso boudary s a hyperplae (Homework) Ths ca be show by computg the set of pots x such that d ( x, m ) a d ( x, m ) a 2 2 0 0 1 1 ad showg that they satsfy ( ) = 0.5 T w ( x x ) 0 0 Ths s the equato of a hyperplae wth ormal w. x 0 ca be ay fxed pot o the hyperplae, but t s stadard to choose t to have mmum orm, whch case w ad x 0 are the parallel x x 1 x 3 x 2 x x 0 0 x w 30

Gaussa M-ary Classfer Specal Case If all the class covaraces are the detty, S =I, the x d x m a * ( ) arg m 2 (, ) wth d 2 ( x, y) x y 2 a 2log ( ) P Y Ths s called (smple, Cartesa) template matchg wth class meas as templates E.g. for dgt classfcato *? Compare the complexty of ths classfer to NN Classfers! 31

The Sgmod Fucto We have derved much of the above from the log-based BDR ( ) arg max log ( ) log ( ) * x PX Y x PY Whe there are oly two classes, = 0,1, t s also terestg mapulate the orgal defto as follows: where * ( x) arg max g ( x) g ( x) P ( x) Y X P P X Y X Y ( x ) P ( ) P ( x) ( x ) P ( ) P ( x 0) P (0) P ( x 1) P (1) X Y Y X Y Y X Y Y 32

The Sgmod Fucto Note that ths ca be wrtte as * ( x) arg max g ( x) g1( x ) 1 g0( x ) g 0 ( x) 1 1 P ( x 1) P (1) X Y P ( x 0) P (0) X Y Y Y For Gaussa classes, the posteror probabltes are g 0 ( ) 1 x 1 exp d ( x, m ) d ( x, m ) a a 2 2 0 0 1 1 0 1 where, as before, d x y x y x y 2 T 1 (, ) ( ) S ( ) a log( 2p ) d S 2log P Y ( ) 33

The Sgmod ( S-shaped ) Fucto The posteror pdf for class = 0, g 0 ( ) 1 x 1 exp d ( x, m ) d ( x, m ) a a s a sgmod ad looks lke ths 2 2 0 0 1 1 0 1 ( 1 ) = 0.5 34

The Sgmod Fucto Neural Nets The sgmod appears eural etworks, where t ca be terpreted as a posteror pdf for a Gaussa bary classfcato problem whe the covaraces are the same 35

The Sgmod Fucto Neural Nets But ot ecessarly whe the covaraces are dfferet 36

END 37