CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Size: px

Start display at page:

Download "CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5"

Deborah Gallagher
5 years ago
Views:

1 CS434a/54a: Patter Recogitio Prof. Olga Veksler Lecture 5

2 Today Itroductio to parameter estimatio Two methods for parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio

3 Itroducto Bayesia Decisio Theory i previous lectures tells us how to desig a optimal classifier if we kew: P(c i ) (priors) P( c i ) (class-coditioal desities) Ufortuately, we rarely have this complete iformatio! Suppose we kow the shape of distributio, but ot the parameters Two types of parameter estimatio Maimum Likelihood Estimatio Bayesia Estimatio

4 ML ad Bayesia Parameter Estimatio Shape of probability distributio is kow Happes sometimes Labeled traiig data salmo bass salmo salmo Need to estimate parameters of probability distributio from the traiig data Eample respected fish epert says salmo s ( 2 legth has distributio N µ, σ ) ad sea ( 2 bass s legth has distributio N µ σ ) 2, Need to estimate parameters µ, σ, µ σ 2, The desig classifiers accordig to the bayesia decisio theory 2 a lot is kow easier little is kow harder

5 Idepedece Across Classes We have traiig data for each class salmo sea bass salmo salmo sea bass sea bass Whe estimatig parameters for oe class, will oly use the data collected for that class reasoable assumptio that data from class ci gives o iformatio about distributio of class cj estimate parameters for distributio of salmo from estimate parameters for distributio of bass from

6 Idepedece Across Classes For each class c i we have a proposed desity p i ( c i ) with ukow parameters i which we eed to estimate Sice we assumed idepedece of data across the classes, estimatio is a idetical procedure for all classes To simplify otatio, we drop sub-idees ad say that we eed to estimate parameters for desity p() the fact that we eed to do so for each class o the traiig data that came from that class is implied

7 ML vs. Bayesia Parameter Estimatio Maimum Likelihood Parameters are ukow but fied (i.e. ot radom variables) Bayesia Estimatio Parameters are radom variables havig some kow a priori distributio (prior) p() After parameters are estimated with either ML or Bayesia Estimatio we use methods from Bayesia decisio theory for classificatio

8 Maimum Likelihood Parameter Estimatio We have desity p() which is completely specified by parameters [,, k ] If p() is N(µ, σ 2 ) the [µ, σ 2 ] To highlight that p() depeds o parameters we will write p( ) Note overloaded otatio, p( ) is ot a coditioal desity Let D{, 2,, } be the idepedet traiig samples i our data If p() is N(µ, σ 2 ) the, 2,, are iid samples from N(µ, σ 2 )

9 Maimum Likelihood Parameter Estimatio Cosider the followig fuctio, which is called likelihood of with respect to the set of samples D k p(d ) p( ) k k F( ) Note if D is fied p(d ) is ot a desity Maimum likelihood estimate (abbreviated MLE) of is the value of that maimizes the likelihood fuctio p(d ) ˆ arg ma ( ) p(d )

10 Maimum Likelihood Estimatio (MLE) k p(d ) k p( k ) If D is allowed to vary ad is fied, by idepedece p(d ) is the joit desity for D{, 2,, } If is allowed to vary ad D is fied, p(d ) is ot desity, it is likelihood F()! Recall our approimatio of itegral trick k Pr D B,..., ε p( [ [ ] ] ) k k Thus ML chooses that is most likely to have give the observed data D

11 ML Parameter Estimatio vs. ML Classifier Recall ML classifier fied data decide class c i which maimizes p( c i ) Compare with ML parameter estimatio fied data choose that maimizes p(d ) ML classifier ad ML parameter estimatio use the same priciples applied to differet problems

12 Maimum Likelihood Estimatio (MLE) Istead of maimizig p(d ), it is usually easier to miimize l(p(d )) Sice log is mootoic ˆ arg ma arg ma ( ) p(d ) ( ) l p(d ) p(d ) l(p(d )) To simplify otatio, l(p(d ))l() k ˆ arg ma l ( ) arg ma l p( k ) arg ma l p( k ) k k

13 2

14 MLE: Maimizatio Methods Maimizig l() ca be solved usig stadard methods from Calculus Let (, 2,, p ) t ad let be the gradiet operator, 2,..., Set of ecessary coditios for a optimum is: l 0 Also have to check that that satisfies the above coditio is maimum, ot miimum or saddle poit. Also check the boudary of rage of p t

15 MLE Eample: Gaussia with ukow µ Fortuately for us, all the ML estimates of ay desities we would care about have bee computed Let s go through a eample ayway Let p( µ) be N(µ,σ 2 ) that is σ 2 is kow, but µ is ukow ad eeds to be estimated, so µ ˆ µ arg ma l µ µ µ k 2 ( ) k µ arg ma l ep 2 µ k 2πσ 2σ arg ma µ ( µ ) arg ma l p( k ) k l 2πσ ( µ ) k 2σ 2 2

16 MLE Eample: Gaussia with ukow µ arg ma µ ( l( µ )) arg ma µ k l 2πσ ( µ ) k 2σ 2 2 d d µ k σ ( l( µ )) ( µ ) 0 2 k ˆµ k k µ 0 k k Thus the ML estimate of the mea is just the average value of the traiig data, very ituitive! average of the traiig data would be our guess for the mea eve if we did t kow about ML estimates

17 MLE for Gaussia with ukow µ, σ 2 Similarly it ca be show that if p( µ,σ 2 ) is N(µ, σ 2 ), that is both mea ad variace are ukow, the agai very ituitive result 2 ˆµ σˆ ( ) k ˆ µ k k Similarly it ca be show that if p( µ,σ) is N(µ, Σ), that is is a multivariate gaussia with both mea ad covariace matri ukow, the ˆµ k ˆ ( )( ) Σ k ˆ µ k ˆ µ k k k 2 t

18 Today Fiish Maimum Likelihood Estimatio Bayesia Parameter estimatio New Topic No Parametric Desity Estimatio

19 How to Measure Performace of MLE? How good is a ML estimate? or actually ay other estimate of a parameter? ˆ The atural measure of error would be But ˆ is radom, we caot compute it before we carry out eperimets We wat to say somethig meaigful about our estimate as a fuctio of A way to solve this difficulty is to average the error, i.e. compute the mea absolute error E [ ] ˆ ˆ p(,,..., ) ˆ 2 dd2... d

20 How to Measure Performace of MLE?s It is usually much easier to compute a almost equivalet measure of performace, the mea squared error: E [( ) 2 ] ˆ Do a little algebra, ad use Var(X)E(X 2 )-(E(X)) 2 E [( ) ] ˆ 2 Var ( ˆ ) + ( E( ˆ ) ) 2 variace estimator should have low variace bias epectatio should be close to the true

21 How to Measure Performace of MLE?s E [( ) ] ˆ 2 Var ( ˆ ) + ( E( ˆ ) ) 2 variace bias ideal case bad case bad case p( ˆ ) p( ˆ ) p( ˆ ) E ( ˆ ) E( ˆ ) E ( ˆ ) o bias low variace large bias low variace o bias high variace

22 Bias ad Variace for MLE of the Mea Let s compute the bias for ML estimate of the mea [ ˆ ] E E µ k k k E [ ] k k µ µ Thus this estimate is ubiased! How about variace of ML estimate of the mea? E ( ˆ µ µ ) E ˆ µ 2µµ ˆ + µ µ 2 µ E( ˆ µ ) + E [ ] [ ] 2 σ Thus variace is very small for a large umber of samples (the more samples, the smaller is variace) Thus the MLE of the mea is a very good estimator k k 2

23 Bias ad Variace for MLE of the Mea Suppose someoe claims they have a ew great estimator for the mea, just take the first sample! ˆ µ Thus this estimator is ubiased: However its variace is: E [( ) ] 2 ˆ µ E ( µ ) [ ] 2 2 σ µ E ( ) ( ) µ ˆ µ E p( ˆ ) Thus variace ca be very large ad does ot improve as we icrease the umber of samples E ( ˆ ) o bias high variace

24 MLE Bias for Mea ad Variace How about ML estimate for the variace? E 2 k k [ ] ˆ σ E ( ˆ µ ) σ σ Thus this estimate is biased! This is because we used µˆ istead of true µ Bias 0 as ifiity, asymptotically ubiased Ubiased estimate 2 σˆ µ k ( k ˆ ) Variace of MLE of variace ca be show to go to 0 as goes to ifiity 2

25 MLE for Uiform distributio U[0,] X is U[0, ] if its desity is / iside [0,] ad 0 otherwise (uiform distributio o [0,] ) p ( ) F ( ) The likelihood is F ( ) k 0 if ma{,..., } p( ) k k if < ma{,..., } Thus k ˆ arg ma k p ( k ) ma{,..., This is ot very pleasig sice for sure should be larger tha ay observed! }

26 Bayesia Parameter Estimatio Suppose we have some idea of the rage where parameters should be Should t we formalize such prior kowledge i hopes that it will lead to better parameter estimatio? Let be a radom variable with prior distributio P() This is the key differece betwee ML ad Bayesia parameter estimatio This key assumptio allows us to fully eploit the iformatio provided by the data

27 Bayesia Parameter Estimatio As i MLE, suppose p( ) is completely specified if is give But ow is a radom variable with prior p() Ulike MLE case, p( ) is a coditioal desity After we observe the data D, usig Bayes rule we ca compute the posterior p( D) Recall that for the MAP classifier we fid the class c i that maimizes the posterior p(c D) By aalogy, a reasoable estimate of is the oe that maimizes the posterior p( D) But is ot our fial goal, our fial goal is the ukow p() Therefore a better thig to do is to maimize p( D), this is as close as we ca come to the ukow p()!

28 Bayesia Estimatio: Formula for p( D) From the defiitio of joit distributio: ( D ) p(, D ) p d Usig the defiitio of coditioal probability: ( D ) p(, D ) p( D ) p d But p(,d)p( ) sice p( ) is completely specified by Usig Bayes formula, p kow ukow p D p p D d ( ) ( ) ( ) p ( ) ( D ) p( ) D p( D ) ( ) p( D ) p( ) p k d k

29 Bayesia Estimatio vs. MLE So i priciple p( D) ca be computed I practice, it may be hard to do itegratio aalytically, may have to resort to umerical methods p ( D ) p( ) k ( ) p( ) k k d p p ( ) p( ) Cotrast this with the MLE solutio which requires differetiatio of likelihood to get p ˆ k d ( ) Differetiatio is easy ad ca always be doe aalytically

30 Bayesia Estimatio vs. MLE p( D) ca be thought of as the weighted average of the proposed model all possible values of support receives from the data ( D ) p( ) p( D ) p d proposed model with certai Cotrast this with the MLE solutio which always gives us a sigle model: p ( ˆ ) Whe we have may possible solutios, takig their sum averaged by their probabilities seems better tha spittig out oe solutio

31 Bayesia Estimatio: Eample for U[0,] Let X be U[0,]. Recall p( )/ iside [0,], else 0 p( ) p( ) Suppose we assume a U[0,0] prior o good prior to use if we just ow the rage of but do t kow aythig else We eed to compute ( D ) p( ) p( D ) p d p ( ) ( D ) p( ) with p D ad p( D ) p( ) p( D ) p( ) k d k 0 0

32 Bayesia Estimatio: Eample for U[0,] We eed to compute p ( D ) p( ) p( D ) usig d p ( ) ( D ) p( ) p D ad p( D ) ( ) p( D ) p( ) p k d k Whe computig MLE of, we had p ( D ) Thus p ( D ) 0 for ma{ otherwise c for ma{ 0 otherwise,...,,..., } } 0 where c is the ormalizig costat, i.e. 0 c p( ) d ma {..., }, p( D ) 0

33 Bayesia Estimatio: Eample for U[0,] We eed to compute p ( D ) p( ) p( D ) p ( D ) p( ) c for ma{ 0 otherwise We have 2 cases:. case < ma{, 2,, } p 2. case > ma{, 2,, } p d,..., } ( D ) c d α + ma{,... } ( D) p 0 0 ( D ) c d + c c 0 c 0 costat idepedet of

34 Bayesia Estimatio: Eample for U[0,] α ML p ( ˆ ) Bayes p ( D ) Note that eve after >ma {, 2,, }, Bayes desity is ot zero, which makes sese curious fact: Bayes desity is ot uiform, i.e. does ot have the fuctioal form that we have assumed!

35 ML vs. Bayesia Estimatio with Broad Prior Suppose p() is flat ad broad (close to uiform prior) p( D) teds to sharpe if there is a lot of data Thus p(d ) p( D)/p() will have the same sharp peak as p( D) But by defiitio, peak of p(d ) is the ML estimate ^ The itegral is domiated by the peak: p ( D) p ˆ p( D ) p( ) ( D) p p( ) ˆ ( D) ( D) p( ) p( D) d p( ˆ ) p( D) d p( ˆ ) Thus as goes to ifiity, Bayesia estimate will approach the desity correspodig to the MLE! p ˆ p( ˆ )

36 ML vs. Bayesia Estimatio: Geeral Prior Maimum Likelihood Estimatio Easy to compute, use differetial calculus Easy to iterpret (returs oe model) p( ) ^ has the assumed parametric form Bayesia Estimatio Hard compute, eed multidimesioal itegratio Hard to iterpret, returs weighted average of models p( D) does ot ecessarily have the assumed parametric form Ca give better results sice use more iformatio about the problem (prior iformatio)

Pattern Classification

Pattern Classification Patter Classificatio All materials i these slides were tae from Patter Classificatio (d ed) by R. O. Duda, P. E. Hart ad D. G. Stor, Joh Wiley & Sos, 000 with the permissio of the authors ad the publisher