9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Size: px

Start display at page:

Download "9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov"

Tamsin Blankenship
6 years ago
Views:

1 9.93 Class IV Part I Bayesan Decson Theory Yur Ivanov

2 TOC Roadmap to Machne Learnng Bayesan Decson Makng Mnmum Error Rate Decsons Mnmum Rsk Decsons Mnmax Crteron Operatng Characterstcs

3 Notaton x - scalar varable x - vector varable, sometmes x when clear px ( ) - probablty densty functon (contnuous varable) Px ( ) - probablty mass functon (dscrete varable) p ( abc,,, d e, f, gh, ) denstyof these functonof these - condtonal densty f ( x) d x f( x, x,... x ) dx dx... dx 2 n 2... f( x, x,... x ) dx dx... dx 2 n 2 n n A w > B - f A > B then the answer s ω, otherwse ω 2 w 2

4 Machne Learnng Learnng machne: G x S y LM ŷ = f ( xw, ) G Generator, or Nature: mplements p(x) S Supervsor: mplements p(y x) LM - Learnng Machne: mplements f(x, w)

5 Loss and Rsk L( y, f ( x, w )) - Loss Functon how much penalty we get for devatons from true y z R ( w ) = L( y, f ( x, w )) p( x, y) dxdy - Expected Rsk how much penalty we get on average Goal of learnng s to fnd f(x,w*) such that R(w*) s mnmal.

6 Quck Illustraton What does t mean? Rw ( ) L( y, f ( xw, )) pxydxdy (, ) = From basc probablty: pxy (, ) = py ( xpx ) ( ) If no nose: py ( x ) = d ( ygx, ( )) Rw ( ) = Lgx ( ( ), f ( xw, )) pxdx ( )

7 Illustraton cont. So, Rw ( ) = Lgx ( ( ), f ( xw, )) pxdx ( ) y x

8 Example Classfcaton: Measurements: Labels: x [ -, y = { 0,} ] - 0 x 0 y <= p(x, y) x y = 0 y = Fnd f(x,w*) such that R(w*) s mnmal.

9 Example Let s choose f ( xa, ) = H ( xa, ) - step functon L( y, f ( xa, )) = -d ( y - H ( xa, )) - + for every mstake (supermposed) Y = 0 [ d Y ] Ra ( ) = - ( - H ( xa, )) pxy (, = Y ) dx

10 Learnng n Realty Fundamental problem: where do we get p(x, y)??? What we want: z R ( w ) = L ( y, f ( x, w )) p( x, y ) d xdy Approxmate: estmate rsk functonal by averagng loss over observed (tranng) data. What we get: n R ( w ) R ( w ) = L ( y, f ( x, w )) e n = Replace expected rsk wth emprcal rsk

11 What We Call Learnng Taxonomes of machne learnng: by source of evaluaton Supervsed, Transductve, Unsupervsed, Renforcement by nductve prncple ERM, SRM, MDL, Bayesan estmaton by objectve classfcaton, regresson, vector quantzaton and densty estmaton

12 Taxonomy by Evaluaton Source Supervsed (classfcaton, regresson) Evaluaton source - mmedate error vector, that s, we get to see the true y Transductve Evaluaton source mmedate error vector for SOME of the data Unsupervsed (clusterng, densty estmaton) Evaluaton source - nternal metrc we don t get to see true y Renforcement Evaluaton source - envronment we get to see some scalar value (possbly delayed) that n some way related to whether the label we chose was correct

13 Taxonomy by Inductve Prncple Emprcal Rsk Mnmzaton (ERM) = Structural Rsk Mnmzaton (SRM) w, h = Mnmum Descrpton Length Bayesan Estmaton n mn L ( y, f ( x, w )), f ( x, w ) ` w n n mn( L ( y, f ( x, w )) +F ( h )), n f ( x, w ) `, ` ` `, k h k 2 k n l ( D, H ) = l ( D H ) + l( H ) mn( L ( y, f ( x, w )) + F ( w )) w n z n P ( x C ) = P ( x q ) p ( q C ) d q mn( L ( y, f ( x, w )) + F( P( w))) n = = complexty

14 Taxonomy by Learnng Objectve Classfcaton L( y, f ( x, w )) = - d ( y, f ( x, w)) Regresson L( y, f ( x, w )) = ( y - f ( x, w )) 2 Densty Estmaton L( f ( x, w )) = - log( f ( x, w)) Clusterng/Vector Quantzaton L( f ( x, w )) = ( x - f ( x, w )) ( x - f ( x, w))

15 The Lay of the Land n fnd f ( or w ) = arg mn( L ( y, f ( x, w )) +F) f ( or w ) n = Classfcaton Regresson Densty Estmaton Vector Quantzaton/Clusterng Supervsed Transductve Unsupervsed Renforcement ERM SRM MDL Bayesan Regularzaton

16 Class Prors Makng a decson about observaton x s fndng a rule that says: If x s n regon A, decde a, f x s n regon B, decde b w w = w - state of nature { } C = P ( w) - Pror probablty Shorthand P ( w ) P( w = w ) C = P ( w ) = Poor man s decson rule: Decde w f P ( w ) > P( w2) otherwse w 2

17 Class-Condtonal Densty px ( w ) - class-condtonal densty functon - p ( x w ) dx = px ( w ) - densty for x gven that the nature s n the state w How do we decde whch class x came from? Decdng on ths s not far?

18 Jont Densty p ( x, w ) = p ( x w ) P( w) - jont densty functon Good It s far Bad not very convenent. P ( w ) = 0.2 P ( w 2) = 0.8 Ths relates to the measurement probablstcally, we need a functon!

19 Bayes Rule and Posteror We want a probablty of ω for each value of x: P ( w x) Note that p ( x, w ) = p ( x w ) P ( w ) = P ( w xpx ) ( ) lkelhood pror p ( x w ) P( w ) P ( w x ) = px ( ) posteror evdence Bayes rule Contnuum of bnary dstrbutons

20 Margnalzaton Bayes Rule: how to convert pror to posteror by usng measurements: P ( w x ) = p ( x w ) P( w ) px ( ) What s p(x)? C p ( x ) = p ( x, w ) = p ( x w ) P( w ) C = = Or, more generally: p ( x ) = pxy (, ) dy - margnalzaton

21 Makng Decsons Wth posterors we can compare class probabltes Intutvely: w = arg max p ( w x) ( ) What s the probablty of error? p ( error x ) = p ( w x ) f we choose 2 p ( w x ) f we choose 2 P ( error ) = P ( error, x ) dx = P ( error xpx ) ( ) dx x w w

22 Errors How bad s a decson threshold? Pe ( ) = Px ( R, w ) + Px ( R, w ) 2 2 T* Recall that Pa ( A ) = pada ( ) a A Pe () = px (, w ) dx + px (, w ) dx = x R 2 x R 2 R R 2 = + x R px ( w ) P ( w ) dx px ( w ) P ( w ) dx 2 2 x R 2

23 Bayes Decson Rule To mnmze P ( error ) = P ( error xpx ) ( ) dx we need to make P(error x) as small as possble for all x [ ] [ x ] mn P ( error ) = mn P ( error ) px ( ) dx Bayes error w [ P w x P w x ] = mn ( ), ( 2 ) px ( ) dx Bayes decson rule: Decde w f P ( w x ) > P ( w x); otherwse w 2 2

24 Bayes Decson Rule For Bayes decson rule (back to the st example): R regon where we always choose ω R 2 regon where we always choose ω 2

25 Loss Functon w, w 2,..., w c - set of { } a, a 2,..., a a - set of { } For classes actons L ( a w ) - Loss functon, penalty sustaned for takng We make ths up j an acton α when the state of nature s ω j d x R condtonal rsk for takng an acton α n ω j s the expected (average) loss for classfyng ONE x: c R ( a x ) = L ( a w ) P ( w x) j j j =

26 Bayes Rsk We wll see a lot of x Ø c ø R = R ( a xp ) ( x ) dx = Œ L ( a w j ) P ( w j x) œ px ( ) dx º j = ß Ø ø = Œ L p x œ dx es. To see how well we do, we average agan: c ( a w j ) ( w j, ) º j = ß Ths s exactly the expresson for expected rsk from before Smlarly to the earler argument about P(error): [ ] [ ] * R = R a x) ( ) = mn mn ( px dx R Bayes rsk -

27 Quck Summary L ( a w ) j - loss R ( a x ) = E Ø w x º L( a wj ) ø ß x [ ( a )] R = E R x - condtonal rsk (expected loss) - total rsk (expected cond. rsk) R * = mn [ R] - Bayes rsk (mnmum rsk)

28 Mnmum Rsk Classfcaton Two classes and a smple acton: a l = L( a w ) j j w - decde to choose = { w, w } 2 ω L Ø ø = Œ 3 0 œ º ß Then: R ( a x ) = l P ( w x ) + l2 P ( w 2 x ) R ( a 2 x ) = l2 P ( w x ) + l22 P ( w 2 x ) Obvous decson decde n favor of the class wth mnmal rsk: w R ( a x ) < R ( a x ) 2 w 2

29 Lkelhood Rato Test Rewrtng R(α x) s: w w ( l - l ) P ( w x ) > ( l - l ) P ( w x ) 2 w ( l - l ) p ( x w ) P ( w ) > ( l - l ) p ( x w ) P ( w ) w w ( w ) ( > ( w ) w 2 ( 2 Class model Class pror p x p x l - l ) P ( w ) l - l ) P ( w ) 2 2 Lkelhood Rato Test

30 LRT Example You are drvng to Blockbuster s to return a vdeo due today It s 5 mn to mdnght You ht a red lght You see a car that you 60% sure looks lke a polce car Traffc fne s $5 AND you are late Blockbuster s fne s $0 Should you run the red lght?

31 Mnmum Rsk P ( polce x ) = 0.6 P ( polce x ) = 0.4 You pay run not run polce not polce $5 $0 $0 $0 L 5 0 = Ł 0 0 ł R ( run x ) = l P ( polce x ) + l2 P ( polce x ) = $9 R ( wat x ) = l2 P ( polce x ) + l22 P ( polce x ) = $0 The rsk s hgher f you wat

32 LRT Way OK WAIT Let s say we have ths polceness feature Decson threshold px px w ( w ) ( > ( w ) w 2 ( l - l ) P ( w ) l - l ) P ( w )

33 LRT Example OK WAIT OK WAIT Threshold s dependent on prors px px w ( w ) ( > ( w ) w 2 ( l - l ) P ( w ) l - l ) P ( w )

34 LRT Example L 5 0 = 0 Ł 0 ł L 5-60 = 0 Ł 0 ł OK WAIT OK OK Threshold s dependent on loss WAIT px px w ( w ) ( > ( w ) w 2 ( l - l ) P ( w ) l - l ) P ( w )

35 Mnmum Error Rate Classfcaton Let s smplfy the Mn. Rsk classfcaton: 0 L = 0 Ł ł Then the condtonal rsk becomes: c R ( a x ) = L ( a w ) P ( w x) j j j = = P ( w j x ) = -P( w x) " j - zero-one loss, just counts errors So, ω havng the hghest value of the posteror mnmzes the rsk: P ( w x ) > P ( w x ) " j w j - good ol Bayes decson rule

36 Mnmax Crteron Is there a decson rule such that the rsk s nsenstve to prors? R R [ ( w x) ( w )] = + After some algebra: l P l P x dx 2 2 R 2 [ ( w x) ( w )] + + l P l P x dx RP ( ( w )) = l + ( l - l ) px ( w ) dx + P ( w ) f( T) R Mnmax rsk Goal fnd the decson boundary T, such that f(t) s 0. At the boundary for whch the mnmum rsk s maxmal the rsk s ndependent of prors.

37 Messy Illustraton of the Mnmax Soluton R + P ( w ) f( T) mm -cannot be smaller than Bayes rsk RPw ( ( )) R + P ( w ) f( T) mm Bayes rsk s concave down 0 T : f( T ) 0 T : f( T ) = 0 P( w) Worst possble Bayes rsk s ndependent of prors

38 Dscrmnant Functons and Decson Surfaces Dscrmnant functons convenently represent classfers: C = ( ), ( ),... g ( x)} { g x g 2 x n w : = arg max g ( x) ( ) Eg: g ( x ) =-R ( a x) g ( x ) = P ( w x) g ( x ) = ln p( x w ) + ln P( w ) Dscrmnants DO NOT have to relate to probabltes.

39 Dscrmnant for Bnary Classfcaton For two-class problem: gx ( ) = g - g ( x) ( x ) 2 Then (assumng that classes are encoded as - and +): w = sgn( gx ( )) Eg: gx ( ) = P ( w x ) - P ( w x) 2 px ( w ) P ( w ) gx ( ) = ln + ln px ( w ) P ( w ) 2 2

40 Boundary for Two Normal Dstrbutons If we assume a Gaussan for a class model: px ( w ) = (2 p ) S d / 2 /2 e T - ( x m x m ) - ( - ) S ( - ) 2 and the mnmum error rate classfer: [ ( w ) P( w )] g ( x ) = ln px = =- ( ( ) T - ( ) ) ) d / 2 /2 x - m S x - m - ln Ø p S ø + P w 2 º (2 ß ln ( )

41 Boundary Between Two Normal Dstrbutons Dscrmnant (after some algebra): g ( x ) = g ( x )- g ( x ) == x Wx + wx + w - quadratc where Specal cases: T 2 0 W = w =S m -S m w o = ( - - S ) -S well, the rest of t ) S =S 2 W = 0 g ( x )- lnear 2) s I W Ø ø S = = Œ - œ ( ) 2 s 2s I = s I g x - a crcle º 2 ß - a matrx - a vector - a scalar

42 Evaluatng Decsons Is 70% classfcaton rate good or bad? 70% = BAD 70% = GOOD Dscrmnablty: d ' = m - m 2 Hgh d means that the classes are easy to dscrmnate. s

43 ROC We do not know m, m2, s but we can get: P x x x w * ( > 2 ) P x x x w * ( > ) P x x x w * ( < 2 ) P x x x w * ( < ) - probablty of ht - probablty of false alarm - probablty of mss - probablty of correct rejecton Each x* corresponds to a pont on ht/false_alarm plane. Ths s called an ROC curve

44 ROC Curve

45 Realty In practce, t s done for a sngle parameter Usng the data for whch true ω s known: Identfy a parameter of nterest Identfy the parameter range Vary the parameter wthn the range Compute P(ht) and P(false_alarm) emprcally for each value It tells us how well the classfer can deal wth the data set.

46 Homework Part I A chance puzzle Try to solve t Understand the soluton Smulate n Matlab Buld an ROC curve Almost lke n class Can you mplement t effcently?

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons