Maximum Likelihood Estimation (MLE)

Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD

Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y ), fnd an approxmatng functon f (x) y x f () ŷ f( x) y hs s called tranng or learnng. wo major types of learnng: Unsupervsed Classfcaton (aka Clusterng) or Regresson ( blnd curve fttng): only X s known. Supervsed Classfcaton or Regresson: both X and target value Y are known durng tranng, only X s known at test tme.

Optmal Classfers Performance depends on the data/feature space metrc Some metrcs are better than others he meanng of better s connected to how well adapted the metrc s to the propertes of the data But can we be more rgorous? What do we mean by optmal? o talk about optmalty we need to talk about cost or loss x yˆ f ( x) () f L( y, yˆ ) Average Loss (Rsk) s the functon that we want to mnmze Rsk depends on true y and the predcton ells us how good our predctor/estmator s ŷ 3

Data-Condtonal Rsk, R(x,), for 0/1 Loss An mportant specal case of nterest: zero loss for no error and equal loss for two error types hs s equvalent to the zero/one loss : L j 0 1 j j snake predcton dart frog regular frog regular 1 0 dart 0 1 Under ths loss * x L j PYX j x j ( ) arg mn ( ) arg mn P ( j x) j YX 4

Data-Condtonal Rsk, R(x,), for 0/1 Loss Note, then, that n the 0/1 loss case, ] j R( x, ) E L[ Y x P ( Y j x). I.e., the data-condtonal rsk under the 0/1 loss s equal to the data-condtonal Probablty of Error, hus the optmal Bayesan decson rule (BDR) under 0/1 loss mnmzes the condtonal probablty of error. hs s gven by the MAP BDR : Y X R( x, ) P( x) 1 P( x) * ( x) argma x P( x). 5

Data-Condtonal Rsk, R(x,), for 0/1 Loss Summarzng: * x PYX j x j ( ) arg mn ( ) arg mn 1 PYX ( x) arg max P ( x) YX he optmal decson rule s the MAP Rule : Pck the class wth largest probablty gven the observaton x hs the Bayes Decson Rule (BDR) for the 0/1 loss We wll often smplfy our dscusson by assumng ths loss But you should always be aware that other losses may be used 6

BDR (under 0/1 Loss) For the zero/one loss, the followng three decson rules are optmal and equvalent 1) ) 3) * x PYX x ( ) arg max ( ) ( ) argmax ( ) ( ) * x PX Y x PY ( ) arg max log ( ) log ( ) * x PX Y x PY Form 1) s usually hard to use, 3) s frequently easer than ) 7

Gaussan BDR Classfer (0/1 Loss) A very mportant case s that of Gaussan classes he pdf of each class s a Gaussan of mean m and covarance S 1 1 P x x x 1 X Y ( ) exp ( m ) S ( m ) d ( ) S he Gaussan BDR under 0/1 Loss s * 1 1 ( x) argmax ( ) ( ) x m S x m 1 log( ) d S log PY ( ) 8

Gaussan Classfer (0/1 Loss) hs can be wrtten as ( x) argmn d ( x, m ) a * ( ) = 0.5 wth d ( x, y) ( x y) S ( x y) 1 a d log( ) S log P ( ) Y and can be nterpeted as a nearest class-neghbor classfer whch uses a funny metrc Note that each class has ts own dstance functon whch s related to the sum of the square of the Mahalanobs dstance for that class plus the a term for that class we effectvely use dfferent metrcs n dfferent regons of the space 9

Gaussan Classfer (0/1 Loss) A specal case of nterest s when all classes have the same S =S ( ) = 0.5 x d x m a * ( ) argmn (, ) wth d x y x y x y Note: 1 (, ) ( ) S ( ) a log P ( ) Y a can be dropped when all classes have equal probablty (the case shown n the above fgure). In ths case the classfer s close n form to a NN classfer wth Mahalanobs dstance, but nstead of fndng the nearest tranng data pont, t looks for the nearest class prototype m usng the Mahalanobs dstance 10

Gaussan Classfer (0/1 Loss) Bnary Classfcaton wth S =S One mportant property of ths case s that the decson boundary s a hyperplane (Homework) hs can be shown by computng the set of ponts x such that d ( x, m ) a d ( x, m ) a 0 0 1 1 and showng that they satsfy dscrmnant for ( ) = 0.5. w ( x x ) 0. 0 hs s the equaton of a hyperplane wth normal w. x 0 can be any fxed pont on the hyperplane, but t s standard to choose t to have mnmum norm, n whch case w and x 0 are then parallel x n x 1 x 3 x x x 0 0 x w 11

Gaussan Classfer (0/1 Loss) Furthermore, f all the covarances are the dentty S =I x d x m a * ( ) argmn (, ) wth d ( x, y) x y a log P ( ) hs s just Eucldean Dstance emplate Matchng wth class means as templates E.g. for dgt classfcaton Y *? Compare complexty to nearest neghbors! 1

he Sgmod n 0/1 Loss Detecton We have derved all of ths from the log-based 0/1 BDR ( ) arg max log ( ) log ( ) * x PX Y x PY When there are only two classes, t s also nterestng to look at the orgnal defnton n an alternatve form: wth * ( x) argmax g ( x) g ( x) P ( x) Y X P ( x ) P ( ) X Y X Y P ( x) P ( x ) P ( ) P ( x 0) P (0) P ( x 1) P (1) X Y Y X Y Y X Y Y 13

he Sgmod n MAP Detecton Note that ths can be wrtten as * ( x) argmax g ( x) g ( x) 1 g ( x) 1 0 g 0 ( x) 1 1 P ( x 1) P (1) X Y P ( x 0) P (0) X Y Y Y For Gaussan classes, the posteror probablty for 0 s g 0 1 ( x) 1 exp d ( x, m ) d ( x, m ) a a 0 0 1 1 0 1 where, as before, d ( x, y) ( x y) S ( x y) 1 a d log( ) S log P ( ) Y 14

he Sgmod n MAP detecton he posteror densty for class 0, g 0 1 ( x) 1 exp d ( x, m ) d ( x, m ) a a s a sgmod and looks lke ths 0 0 1 1 0 1 ( 1 ) = 0.5 15

he Sgmod n Neural Networks he sgmod functon also appears n neural networks In neural networks, t can be nterpreted as a posteror densty for a Gaussan problem where the covarances are the same. 16

he Sgmod n Neural Networks But not necessarly when the covarances are dfferent 17

Implementaton All of ths s appealng, but n practce one doesn t know the values of the parameters m, S, P Y (1) In the homework we use an ntutve soluton to desgn a Gaussan classfer: Start from a collecton of datasets: D () = {x 1 (),..., x n () } = set of examples from class For each class estmate the Gaussan BDR parameters usng, 1 () 1 ˆ m x j S Pˆ Y () n j ˆ ( ) ( ) ( ˆ )( ˆ xj m xj m ) n j where s the total number of examples (over all classes) E.g., below are sample means computed for dgt classfcaton: n 18

A Practcal Gaussan MAP Classfer Instead of the deal BDR * 1 1 ( x) argmax ( ) ( ) x m S x m 1 log( ) d S log PY ( ) use the estmate of the BDR found from ˆ* 1 ˆ 1 ( x) argmax ( x ˆ m ) ( ˆ S x m ) 1 d log( ) ˆ log ˆ S PY( ) 19

Important Warnng: at ths pont all optmalty clams for the BDR cease to be vald!! he BDR s guaranteed to acheve the mnmum loss only when we use the true probabltes When we plug n probablty estmates, we could be mplementng a classfer that s qute dstant from the optmal E.g. f the P X Y (x ) look lke the example above one could never approxmate t well by usng smple parametrc models (e.g. a sngle Gaussan). 0

Maxmum lkelhood Estmaton (MLE) Gven a parameterzed pdf how should one estmate the parameters whch defne the pdf? here are many technques of parameter estmaton. We shall utlze the maxmum lkelhood (ML) prncple. hs has three steps: 1) We choose a parametrc model for all probabltes. o make ths clear we denote the vector of parameters by and the class-condtonal dstrbutons by PX Y ( x ; ), p Note: hs s a classcal statstcs approach, whch means that s NO a random varable. It s a determnstc but unknown parameter, and the probabltes are a functon of ths unknown parameter. 1

Maxmum Lkelhood Estmaton (MLE) he three steps contnued: ) Assemble a collecton of datasets: D () = {x 1 (),..., x n () } = set of examples from each class 3) Select the values of the parameters of class to be the ones that maxmze the probablty of the data from that class ˆ ( ) arg max P ; ( ) D Y ( ) argmaxlog P ( ) ; D Y Note that t does not make any dfference to maxmze probabltes or ther logs.

Maxmum Lkelhood Estmaton (MLE) Snce Each sample D () s consdered ndependently Each parameter vector s estmated only from sample D () we smply have to repeat the procedure for all classes. So, from now on we omt the class varable : ˆ arg max P ; ML argmaxlog P ; he functon L( ; D) = P X (D; ) s the lkelhood of the parameter gven the data D, or smply the lkelhood functon. X X 3

he Lkelhood Functon Gven a parameterzed famly of pdf s (aka known as a statstcal model) for the data D, we defne a Lkelhood of the parameter vector gven D : L ( ) L( ; ) a( ) P ( ) D where a(d ) > 0 for all D, and a(d ) s ndependent of the parameter. he choce a(d ) = 1 yelds the Standard Lkelhood: L( ; D) = P D (D ; ) whch was shown on the prevous slde. 4

Maxmum Lkelhood Prncple p ( x) 1 Lx ) P ( x) P x Lx 1 ( ( ) ( ) 1 x X

he Lkelhood Functon Note that the lkelhood functon s a functon of the parameters It does not have the same shape as the densty tself E.g. the lkelhood functon of a Gaussan s not bell-shaped he lkelhood s defned only after we have a data sample P X 1 ( d m) ( d; ) exp ( ) 6

Maxmum Lkelhood Estmaton (MLE) Gven a sample, to obtan ML estmate we need to solve ˆ ; M L arg max P D When s a scalar, ths s hgh-school calculus: We have a local maxmum of f(x) at a pont x when he frst dervatve at x s zero. (x s a statonary pont.) he second dervatve s negatve at x. 7

MLE Example Gaussan wth unknown mean & standard devaton: Gven a data sample D = { 1,, N } of ndependent and dentcally dstrbuted (d) measurements, the (standard) lkelhood s L(, ;,, ) 1 N 8

MLE Example he log-lkelhood s he dervatve wth respect to the mean s zero when yeldng Note that ths s just the sample mean 9

MLE Example he log-lkelhood s he dervatve wrt the standard devaton s zero when or Note that ths s just the sample varance. 30

MLE Example Numercal example: If sample s {10,0,30,40,50} 31

he Gradent In hgher dmensons, the generalzaton of the dervatve s the gradent he (Cartesan) gradent of a functon f(w) at z s f f f f ( z) ( z) ( z),, ( z) w w w 0 n 1 he gradent has a nce geometrc nterpretaton It ponts n the drecton of maxmum growth of the functon. (Steepest Ascent Drecton.) Whch makes t perpendcular to the contours where the functon s constant. he above s the gradent for the smple (unweghted) Eucldean Norm (aka the Cartesan Gradent). f ( x 0, y0) f f(x,y) f ( x, y 1 1) 3

he Gradent Note that f f(x) = 0 here s no drecton of growth at x also f(x) = 0, and there s no drecton of decrease at x We are ether at a local mnmum or maxmum or saddle pont at x Conversely, f there s a local mn or max or saddle pont at x here s no drecton of growth or decrease at x f (x) = 0 hs shows that we have a statonary pont at x f and only f f(x) = 0 o determne whch type holds we need second order condtons max mn saddle 33

he Hessan he extenson of the scalar second-order dervatve s the Hessan matrx of second partal dervatves: ( x) f f ( x) ( x) x0 x0 xn1 ( x) f f ( x) ( x) xn1 x0 x n1 f f ( x) x x x Note that the Hessan s symmetrc. he Hessan gves us the quadratc functon 1 ( ) x x ( 0 x )( 0 x x ) 0 that best approxmates f(x) at a statonary pont x 0. 34

Hessan as a Quadratc Approxmaton E.g. ths means that f the gradent s zero at x 0, we have a maxmum when the functon f(x) can be locally approxmated by an upwards pontng quadratc bowl (H (x 0 ) s neg-def) max a mnmum when the functon can be locally approxmated by a downwards pontng quadratc bowl (H (x 0 )s pos-def) a saddle pont otherwse (H (x 0 ) s ndefnte) saddle mn 35

Hessan Gves Local Behavor max hs s somethng that we already saw: For any matrx M, the quadratc functon x M x s an upwards pontng quadratc quadratc bowl at the pont x = 0 when M s negatve defnte s a downwards pontng quadratc bowl at x = 0 when M s postve defnte s a saddle pont at x = 0 otherwse Hence, smlarly, what matters s the defnteness property of the Hessan at a statonary pont x 0 saddle mn E.g., we have a maxmum at a statonary pont x 0 when the Hessan s negatve defnte at x 0 36

Optmalty Condtons In summary: w 0 s a local mnmum of f(w) f and only f f has zero gradent at w 0 f( w ) 0 0 and the Hessan of f at w 0 s postve defnte where ( x) d ( ) 0, n x, 0 0 d d d f f ( x) ( x) x0 x0xn1 f f ( x) ( x) xn1 x0 x n1 37

Maxmum Lkelhood Estmaton (MLE) Gven a sample, to obtan an MLE we want to solve ˆ ML arg max P D ; max Canddate solutons are the parameter values ˆ such that P D ( ; ˆ ) 0 ( ˆ ) 0, 0 Note that you always have to check the second-order Hessan condton 38

MLE Example Back to our Gaussan example Gven d samples { 1,, N } the lkelhood s L(, ;,, ) 1 N 39

MLE Example he log-lkelhood s he dervatve of wth respect to the mean s from whch we compute the second-order dervatves N 3 40

41 MLE Example he dervatve of wth respect to the standard devaton s whch yelds the the second-order dervatves he statonary parameter values are, N 4 3 3

MLE Example he elements of the Hessan are: N 3 hus the Hessan s ( ) 0 3 4 N N N whch s clearly negatve defnte at the statonary pont. hs we have determned the MLE of the parameters. 3 N 1 0 0 4 0

nd MLE Example o fnd the MLE s of the two pror class probabltes P Y () note that, 1 PY () 1, 0 can be wrtten as x 1 x P ( x ) 1 x 0, 1 Y where x s the so-called ndcator (or 0-1) functon. Gven d ndcator samples D = {x 1,..., x N }, we have x L( ; ) P ( ; ) 1 Y N 1 1 x 43

nd MLE Example herefore n 1 log P ( ; ) x log 1 x log1 Y Settng the dervatve of the log-lkelhood wth respect to equal to zero, N log PY ( ) 1 N 1 x 1 1 1 1 N 1 N x (1 ) 1 1 N 0, x 44

nd MLE Example yelds the MLE estmate ˆ ML 1 n x where n x N N N 1 N 1 Note that ths s just the relatve frequency of occurrence of the value 1 n the sample. I.e. the MLE s just the count of the number of 1 s over the total number of ponts! Agan we see that the MLE yelds an ntutvely pleasng estmate of the unknown parameters. 45

nd MLE Example Check that the second dervatve s negatve: for < 1. N log PY ( D) 1 N x (1 ) 1 1 N 1 1 1 N 1 1 0 1 46

Combnng the MLE Examples For Gaussan Classes all of the above formulas can be generalzed to the random vector case as follows: D () = {x 1 (),..., x n () } = set of d vector examples from each class, = 1,, d. he MLE estmates n the vector random data case are: ˆ 1 () m x j Pˆ n Y () j 1 S ˆ ( ) ( ) ( ˆ )( ˆ xj m xj m ) n j hese are the sample estmates gven earler wth no justfcaton. he ML solutons are ntutve, whch s usually the case. n N 47

END 48