CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Size: px

Start display at page:

Download "CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3"

Alison Tate
5 years ago
Views:

1 CS434a/541a: attern Recognition rof. Olga Veksler Lecture 3 1

2 Announcements Link to error data in the book Reading assignment Assignment 1 handed out, due Oct. 4 lease send me an with your name and course number in the subject, and in the body: if you are graduate or undergraduate your department the year you are in if undergraduate

3 Announcements Information Session on "HOW TO ALY FOR EXTERNAL AWARDS" Location: Social Science Centre, Room 036 Tuesday, September 1, 004 from 4 pm to 5pm Lutfiyya is 3

4 Today Finish Matlab Introduction Course Roadmap Change in Notation (for consistency with textbook) Conditional distributions (forgot to review) Bayesian Decision Theory Two category classification Multiple category classification Discriminant Functions 4

5 Course Road Map a lot is known easier little is known harder

6 Bayesian Decision theory Know probability distribution of the categories never happens in real world Do not even need training data Can design optimal classifier Example respected fish expert says that salmon s length has distribution N(5,1) and sea bass s length has distribution N(10,4) a lot is known easier salmon sea bass little is known harder

7 ML and Bayesian parameter estimation Shape of probability distribution is known Happens sometimes Labeled training data salmon bass salmon salmon Need to estimate parameters of probability distribution from the training data Example respected fish expert says salmon s ( length has distribution N µ 1, σ ) 1 and sea ( bass s length has distribution N µ σ ), Need to estimate parameters µ, σ, µ σ 1 1, Then can use the methods from the bayesian decision theory a lot is known easier little is known harder

8 lightness Linear discriminant functions and Neural Nets No probability distribution (no shape or parameters are known) salmon bass salmon salmon Labeled data The shape of discriminant functions is known salmon bass linear discriminant function a lot is known length Need to estimate parameters of the discriminant function (parameters of the line in case of linear discriminant) little is known

9 Non-arametric Methods Neither probability distribution nor discriminant function is known Happens quite often All we have is labeled data a lot is known easier salmon bass salmon salmon Estimate the probability distribution from the labeled data little is known harder

10 Unsupervised Learning and Clustering Data is not labeled Happens quite often a lot is known easier 1. Estimate the probability distribution from the unlabeled data. Cluster the data little is known harder

11 Course Road Map 1. Bayesian Decision theory (rare case) Know probability distribution of the categories Do not even need training data Can design optimal classifier. ML and Bayesian parameter estimation Need to estimate arameters of probability dist. Need training data 3. Non-arametric Methods No probability distribution, labeled data 4. Linear discriminant functions and Neural Nets The shape of discriminant functions is known Need to estimate parameters of discriminant functions 5. Unsupervised Learning and Clustering No probability distribution and unlabeled data a lot is known little is known

12 Notation Change (for consistency with textbook) r [A] (x) p(x) p(x,y) p(x y) (x y) probability of event A probability mass function of discrete r.v. x probability density function of continuous r.v. x joint probability density of r.v. x and y conditional density of x given y conditional mass of x given y

13 More on robability For events A and B, we have defined conditional probability r(a B) r(a B)= r(b) U r law of total probability n ( A) = r ( A Bk ) r ( B k ) k = 1 Bayes rule r ( B A) i = n k = 1 r r ( A B ) r ( B ) i ( A B ) r ( B ) k i k Usually model with random variables not events. Need equivalents of these laws for mass and density functions (could go from random variables back to events, but time consuming)

14 Conditional Mass Function: Discrete RV For discrete RV nothing new because mass function is really a probability law Define conditional mass function of X given Y=y by ( ) ( x, y ) x y = ( y ) This is a probability mass function because: ( x, y ) x ( ) ( y ) x y = = = 1 ( y ) ( y ) x y is fixed This is really nothing new because: ( x, y ) ( y ) [ X = x Y = y ] r [ Y = y ] r ( x y ) = = = r = [ X = x Y y ]

15 Conditional Mass Function: Bayes Rule The law of Total robability: ( x) = ( x,y ) = ( x y ) ( y ) y y The Bayes Rule: ( y x ) = ( y, x ) ( x ) = y ( x y ) ( y ) ( x y ) ( y )

16 Conditional Density Function: Continuous RV Does it make sense to talk about conditional density p(x y) if Y is a continuous random variable? After all, r[y=y]=0, so we will never see Y=y in practice Measurements have limited accuracy. Can interpret observation y as observation in interval [y-ε, y+ε], and observation x as observation in interval [x-ε, x+ε] y-ε y+ε x-ε x+ε y x

17 Conditional Density Function: Continuous RV Let B(x) denote interval [x-ε,x+ε] r x + ε [ X B( x) ] = p( x) dx ε p( x) x ε Similarly r[ Y B( y )] ε p( y ) r[ X B( x) Y B( y )] 4ε p( x,y ) Thus we should have ( x y ) p ( x y ) Which can be simplified to: p r r [ X B( x) Y B( y )] ε r[ Y B( y )] p(x) x-ε x x+ε [ X B( x) Y B( y )] ε ( x,y ) ( y ) p p

18 Conditional Density Function: Continuous RV Define conditional density function of X given Y=y by p ( ) ( x, y ) p x y = p( y ) This is a probability density function because: y is fixed ( x, y ) p( y ) ( x, y ) p dx p p p ( x y ) dx = dx = = = p p The law of Total robability: ( y ) ( x) p( x, y) dy = p( x y ) p( y ) p = dy ( y ) ( y ) 1

19 Conditional Density Function: Bayes Rule The Bayes Rule: p ( y x ) = p p ( y, x ) ( x ) = p p ( x y ) p( y ) ( x y ) p( y ) dy

20 Mixed Discrete and Continuous X discrete, Y continuous Bayes rule p ( ) ( y x ) ( x ) x y = p( y ) X continuous, Y discrete Bayes rule ( ) ( y x ) p( x ) p x y = ( y )

21 Bayesian Decision Theory Know probability distribution of the categories Almost never the case in real life! Nevertheless useful since other cases can be reduced to this one after some work Do not even need training data Can design optimal classifier 1

22 Cats and Dogs Suppose we have these conditional probability mass functions for cats and dogs (small ears dog) = 0.1, (large ears dog) = 0.9 (small ears cat) = 0.8, (large ears cat) = 0. Observe an animal with large ears Dog or a cat? Makes sense to say dog because probability of observing large ears in a dog is much larger than probability of observing large ears in a cat r[large ears dog] = 0.9 > 0.= r[large ears cat] = 0. We choose the event of larger probability, i.e. maximum likelihood event

23 Example: Fish Sorting Respected fish expert says that Salmon length has distribution N(5,1) Sea bass s length has distribution N(10,4) Recall if r.v. is ( ) p l salmon = ( ) p l ( N µ,σ ) = 1 e π σ ( l ) then it s density is ( l µ ) 1 σ e π Thus class conditional densities are 5 ( ) p l bass = 1 e π ( l ) 10 *4 3

24 Likelihood function Thus class conditional densities are ( ) p l salmon fixed = 1 e π ( l ) 5 ( ) p l bass fixed = 1 e π ( l ) 10 *4 Fix length, let fish class vary. Then we get likelihood function (it is not density and not probability mass) p l ( ) fixed class = 1 e π 1 e π ( l 5 ) ( l 10 ) 8 if class = salmon if class = bass 4

25 Likelihood vs. Class Conditional Density p(l class) 7 length Suppose a fish has length 7. How do we classify it? 5

26 ML (maximum likelihood) Classifier We would like to choose salmon if r length= 7 salmon > r length= 7 bass [ ] [ ] However, since length is a continuous r.v., [ length= 7 salmon] = r[ length= 7 bass] 0 r = Instead, we choose class which maximizes likelihood ( l 5) 1 ( l 10) ( ) 1 p l salmon = e ( ) *4 p l bass = e π π ML classifier: for an observed l: bass <?p l bass > salmon ( ) ( ) p l salmon in words: if p(l salmon) > p(l bass), classify as salmon, else classify as bass

27 > > Interval Justification p( 7 bass) p( 7 salmon) Thus we choose the class (bass) which is more likely to have given the observation r r 7 [ l B( 7) bass] ε p( 7 bass) [ l B( 7) salmon] ε p( 7 salmon) 7

28 Decision Boundary classify as salmon classify as sea bass 6.70 length 8

29 riors rior comes from prior knowledge, no data has been seen yet Suppose a fish expert says: in the fall, there are twice as many salmon as sea bass rior for our fish sorting problem (salmon) = /3 (bass) = 1/3 With the addition of prior to our model, how should we classify a fish of length 7? 9

30 How rior Changes Decision Boundary? Without priors salmon 6.70 sea bass length How should this change with prior? (salmon) = /3 (bass) = 1/3 salmon?? 6.70 sea bass length 30

31 Bayes Decision Rule 1. Have likelihood functions p(length salmon) and p(length bass). Have priors (salmon) and (bass) Question: Having observed fish of certain length, do we classify it as salmon or bass? Natural Idea: salmon if bass if ( ) ( ) salmon length > bass length ( ) ( ) bass length > salmon length 31

32 osterior (salmon length) and (bass length) are called posterior distributions, because the data (length) was revealed (post data) How to compute posteriors? Not obvious From Bayes rule: ( salmon length) = ( ) p( length) p salmon,length Similarly: ( bass length) = = ( salmon ) ( salmon) p( length) p length ( bass ) ( bass) p( length) p length 3

33 MA (maximum a posteriori) classifier p > salmon salmon length? bass < ( ) ( ) bass ( length salmon) ( salmon) p( length ) bass salmon >? < p length ( length bass) ( bass) p( length ) ( ) ( ) ( ) ( ) p length salmon salmon? bass < >salmon p length bass bass 33

34 Back to Fish Sorting Example likelihood p( l salmon) = 1 e π ( l ) 5 ( bass) riors: (salmon) = /3, (bass) = 1/3 Solve inequality salmon 6.70 e π p l = 1 e π ( l ) 10 8 ( l 5 ) ( l 10 ) 3 1 e π 1 8 new decision boundary 7.18 > sea bass length 1 3 New decision boundary makes sense since we expect to see more salmon 34

35 rior (s)=/3 and (b)= 1/3 vs. rior (s)=0.999 and (b)= salmon bass length

36 Likelihood vs osteriors (salmon l) p(l salmon) (bass l) p(l bass) likelihood p(l fish class) density with respect to length, area under the curve is 1 length posterior (fish class l) mass function with respect to fish class, so for each l, (salmon l )+(bass l ) = 1

37 More on osterior posterior density (our goal) ( ) c l = likelihood (given) ( l c) ( c) ( l) rior (given) normalizing factor, often do not even need it for classification since (l) does not depend on class c. If we do need it, from the law of total probability: ( l ) = p( l salmon ) p( salmon ) + p( l bass) p( bass) Notice this formula consists of likelihoods and priors, which are given

38 More on osterior posterior cause (class) ( ) c l = c likelihood ( l c) ( c) ( l) l prior effect (length) If cause c is present, it easy to determine the probability of effect l with likelihood (l c) Usually observe the effect l without knowing cause c. Hard to determine cause c because there may be several causes which could produce same effect l Bayes rule makes I easy to determine posterior (c l), if we know likelihood (l c) and prior (c)

39 More on riors rior comes from prior knowledge, no data has been seen yet If there is a reliable source prior knowledge, it should be used Some problems cannot even be solved reliably without a good prior However prior alone is not enough, we still need likelihood (salmon)=/3, (sea bass)=1/3 If I don t let you see the data, but ask you to guess, will you choose salmon or sea bass? 39

40 More on Map Classifier posterior ( ) c l = likelihood prior ( l c) ( c) ( l) Do not care about (l) when maximizing (c l ) ( ) ( ) ( ) c l proportional c If (salmon)=(bass) (uniform prior) MA classifier becomes ML classifier ( c l) ( l c) If for some observation l, (l salmon)=(l bass), then this observation is uninformative and decision is based solely on the prior ( c l) ( c) l c

41 Justification for MA Classifier Let s compute probability of error for the MA estimate: > ( salmon l )? ( bass l ) bass < salmon For any particular l, probability of error (bass l) if we decide salmon r[error l ]= (salmon l) if we decide bass Thus MA classifier is optimal for each individual l! 41

42 Justification for MA Classifier We are interested to minimize error not just for one l, we really want to minimize the average error over all l r [ error] = p( error,l) dl = r[ error l] p( l)dl If r[error l ]is as small as possible, the integral is small as possible But Bayes rule makes r[error l ] as small as possible Thus MA classifier minimizes the probability of error!

43 Today Bayesian Decision theory Multiple Classes General loss functions Multivariate Normal Random Variable Classifiers Discriminant Functions 43

44 More General Case Let s generalize a little bit Have more than one feature x = [ x ] 1, x,..., xd Have more than classes { } c 1,c,..., cm

45 More General Case As before, for each j we have ( ) p x c j is likelihood of observation x given that the true class is c ( ) j c j is prior probability of class c j ( c ) j x is posterior probability of class c j given that we observed data x Evidence, or probability density for data p m ( x) = p( x c ) ( ) j c j j= 1 45

46 Minimum Error Rate Classification Want to minimize average probability of error r = = [ error] p( error, x) dx r[ error x] p( x)dx [ ] ( ) need to make this as small as possible r error x = 1 ci x if we decide class c i r[ error x] Decide on class c i is minimized with MA classifier ( c x) > ( c x) j i i MA classifier is optimal If we want to minimize the probability of error j if 1 1-(c 1 x) 1-(c x) (c 1 x) (c x) 1-(c 3 x) (c 3 x)

47 General Bayesian Decision Theory In close cases we may want to refuse to make a decision (let human expert handle tough case) allow actions { } α, α 1 α,..., Suppose some mistakes are more costly than others (classifying a benign tumor as cancer is not as bad as classifying cancer as benign tumor) Allow loss functions λ α describing loss i c j occurred when taking action α i when the true class is c j k ( ) 47

48 Conditional Risk Suppose we observe x and wish to take action α i If the true class is c, by definition, we incur loss ( ) j λ αi c j robability that the true class is c j after observing x is R ( c x) j The expected loss associated with taking action is called conditional risk and it is: α i m ( α ) = ( ) ( ) i x λ αi c j c j x j= 1

49 Conditional Risk sum over disjoint events (different classes) probability of class c j given observation x R m ( α ) = ( ) ( ) i x λ αi c j c j x penalty for taking action α i if observe x j= 1 part of overall penalty which comes from event that true class is c j c 1 c c 3 c 4 λ(α i c 1 ) λ(α i c ) λ(α i c 3 ) λ(α i c 4 )

50 Example: Zero-One loss function action is decision that true class is R m ( α x) λ( α c ) ( c x) = i α i λ ( α ) i c j = 1 = j= 1 = 1 ci i 0 if i = j otherwise j ( x) MA classifier is Bayes decision rule under zero-one loss function j i j (no mistake) (mistake) ( x) = c Thus MA classifier optimizes R(α i x) ( c x) > ( c x) j i i = r j j c i [ error if decide ] c i

51 Overall Risk Decision rule is a function α(x) which for every x specifies action α α out of { 1,,..., α k } The average risk for α(x) ( ) ( ( x) x) p( x)dx R α = R α need to make this as small as possible Bayes decision rule α(x) for every x is the action which minimizes the conditional risk R m ( α ) = ( ) ( ) i x λ αi c j c j x j= 1 Bayes decision rule α(x) is optimal, i.e. gives the minimum possible overall risk R* x 1 x x 3 X α(x 1 ) α(x ) α(x 3 ) { α, α α } 1,..., k

52 Bayes Risk: Example Salmon is more tasty and expensive than sea bass λ ( salmon bass) sb = λ = classify bass as salmon λ ( bass salmon) bs = λ = 1 classify salmon as bass λ = λ 0 no mistake, no loss ss bb = Likelihoods R ( salmon) p l riors (salmon)= (bass) = 1 e π ( l ) Risk R( x) = λ( α c ) ( ) j c j x 5 ( bass) p l = 1 e π ( ) ( ) ( ) ( ) salmon R l = λ s l + λ b l = λ ss sb sb b ( ) ( ) ( ) ( ) bass l j= 1 = λ s l + λ b l = λ bs bb bs s ( l ) l l 10 *4 m α = λ α s( s l) + λα b( b l)

53 R Bayes Risk: Example ( salmon l) = ( b l) R( bass l) ( s l) λ sb = λ Bayes decision rule (optimal for our loss function) Need to solve ( ) ( ) λ sb b l? λ > < salmon bs bass ( b l) λbs < ( s l) λsb ( l b) ( b) p( l) ( ) ( ) ( ) p l l s s s Or, equivalently, since priors are equal: = l ( l b) λ < ( l s) λsb bs bs

54 Bayes Risk: Example Need to solve ( l b) λbs < ( l s) λsb Substituting likelihoods and losses 1 π exp π exp ( l 10 ) 8 ( l 5 ) < 1 exp exp ( l 10 ) 8 ( l 5 ) < 1 exp ln exp ( l 10 ) 8 < ln( 1) ( l 5 ) ( l 10) ( l 5) + < 0 3l 0l < 0 l < salmon new decision boundary sea bass length

55 Likelihood Ratio Rule In category case, use likelihood ratio rule ( x c ) 1 ( ) x c > λ λ 1 1 λ λ 11 ( ) ( ) c c 1 likelihood ratio fixed number Independent of x If above inequality holds, decide c 1 Otherwise decide c 55

56 Discriminant Functions All decision rules have the same structure: at observation x choose class s.t. g i ( x) > g ( x) j i discriminant function ML decision rule: g ( x) = ( x ) j c i i c i MA decision rule: g ( x) ( c x) i = i Bayes decision rule: g ( x) R( c x) i = i

57 Discriminant Functions Classifier can be viewed as network which computes m discriminant functions and selects category corresponding to the largest discriminant select class giving maximim discriminant functions g 1 ( x) g ( x) g m ( x) features x1 x x 3 g i (x) can be replaced with any monotonically increasing function, the results will be unchanged xd

58 Decision Regions Discriminant functions split the feature vector space X into decision regions c 1 ( x) max{ } g = g i c c 3 c 3 c 1 58

59 Important oints If we know probability distributions for the classes, we can design the optimal classifier Definition of optimal depends on the chosen loss function Under the minimum error rate (zero-one loss function No prior: ML classifier is optimal Have prior: MA classifier is optimal More general loss function General Bayes classifier is optimal 59

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology