CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before the class Programs: submt electroncally Collaboratons on homeworks: You may dscuss materal wth your fellow students, but the report and programs should be wrtten ndvduall CS 750 Machne Learnng

Outlne Outlne: Densty estmaton: Maxmum lkelhood ML Maxmum a posteror MAP Bayesan Bernoull dstrbuton. Bnomal dstrbuton Multnomal dstrbuton. ormal dstrbuton. CS 750 Machne Learnng Densty estmaton Data: D { D, D,.., Dn} D x a vector of attrbute values Attrbutes: modeled by random varables X { X, X, K, X d} wth: Contnuous values Dscrete values E.g. blood pressure wth numercal values or chest pan wth dscrete values [no-pan, mld, moderate, strong] Underlyng true probablty dstrbuton: px CS 750 Machne Learnng

Data: Densty estmaton D { D, D,.., Dn} D x a vector of attrbute values Objectve: try to estmate the underlyng true probablty dstrbuton over varables X, px, usng examples n D true dstrbuton n samples p X D D, D,.., D } { n estmate pˆ X Standard d assumptons: Samples are ndependent of each other come from the same dentcal dstrbuton fxed px CS 750 Machne Learnng Densty estmaton Types of densty estmaton: Parametrc the dstrbuton s modeled usng a set of parameters Θ p X Θ Example: mean and covarances of multvarate normal Estmaton: fnd parameters Θ descrbng data D on-parametrc The model of the dstrbuton utlzes all examples n D As f all examples were parameters of the dstrbuton Examples: earest-neghbor Sem-parametrc CS 750 Machne Learnng

Learnng va parameter estmaton In ths lecture we consder parametrc densty estmaton Basc settngs: A set of random varables X { X, X, K, X d} A model of the dstrbuton over varables n X wth parameters Θ : pˆ X Θ Data D { n D, D,.., D } Objectve: fnd parameters Θˆ that descrbe p X Θ the best CS 750 Machne Learnng Parameter estmaton. Maxmum lkelhood ML maxmze p D Θ, ξ yelds: one set of parameters Θ ML the target dstrbuton s approxmated as: pˆ X p X Θ ML Bayesan parameter estmaton uses the posteror dstrbuton over possble parameters p D Θ, ξ p Θ ξ p Θ D, ξ p D ξ Yelds: all possble settngs of Θ and ther weghts The target dstrbuton s approxmated as: p ˆ X p X D p X Θ p Θ D, ξ dθ Θ CS 750 Machne Learnng

Parameter estmaton. Other possble crtera: Maxmum a posteror probablty MAP maxmze p Θ D, ξ mode of the posteror Yelds: one set of parameters Θ MAP Approxmaton: pˆ X p X Θ MAP Expected value of the parameter Θˆ E Θ mean of the posteror Expectaton taken wth regard to posteror p Θ D, ξ Yelds: one set of parameters Approxmaton: p ˆ X p X Θˆ CS 750 Machne Learnng Parameter estmaton. Con example. Con example: we have a con that can be based Outcomes: two possble values -- head or tal Data: D a sequence of outcomes x such that head x tal 0 x Model: probablty of a head probablty of a tal Objectve: We would lke to estmate the probablty of a head from data ˆ CS 750 Machne Learnng

Parameter estmaton. Example. Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 What would be your estmate of the probablty of a head? ~? CS 750 Machne Learnng Parameter estmaton. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 What would be your choce of the probablty of a head? Soluton: use frequences of occurrences to do the estmate ~ 5 0.6 5 Ths s the maxmum lkelhood estmate of the parameter CS 750 Machne Learnng

Probablty of an outcome Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: we know the probablty Probablty of an outcome of a con flp x x P x Bernoull dstrbuton Combnes the probablty of a head and a tal So that x s gong to pck ts correct probablty Gves for x Gves for 0 x x CS 750 Machne Learnng x Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: a sequence of ndependent con flps D H H T H T H encoded as D 00 What s the probablty of observng the data sequence D: P D? x CS 750 Machne Learnng

Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: a sequence of con flps D H H T H T H encoded as D 00 What s the probablty of observng a data sequence D: P D x CS 750 Machne Learnng Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: a sequence of con flps D H H T H T H encoded as D 00 What s the probablty of observng a data sequence D: P D lkelhood of the data x CS 750 Machne Learnng

Probablty of a sequence of outcomes. Data: D a sequence of outcomes such that head x tal x 0 Model: probablty of a head probablty of a tal Assume: a sequence of con flps D H H T H T H encoded as D 00 What s the probablty of observng a data sequence D: P D 6 x P D Can be rewrtten usng the Bernoull dstrbuton: x x CS 750 Machne Learnng The goodness of ft to the data. Learnng: we do not know the value of the parameter Our learnng goal: Fnd the parameter that fts the data D the best? One soluton to the best : Maxmze the lkelhood n x P D x Intuton: more lkely are the data gven the model, the better s the ft ote: Instead of an error functon that measures how bad the data ft the model we have a measure that tells us how well the data ft : Error D, P D CS 750 Machne Learnng

Example: Bernoull dstrbuton. Con example: we have a con that can be based Outcomes: two possble values -- head or tal Data: D a sequence of outcomes x such that head x tal x 0 Model: probablty of a head probablty of a tal Objectve: We would lke to estmate the probablty of a head ˆ Probablty of an outcome P x x x x Bernoull dstrbuton CS 750 Machne Learnng Maxmum lkelhood ML estmate. Lkelhood of data: n x P D, ξ Maxmum lkelhood estmate ML arg max P D, ξ - number of heads seen - number of tals seen CS 750 Machne Learnng x Optmze log-lkelhood the same as maxmzng lkelhood n x x l D, log P D, ξ log n x log x log log n x log n x

Maxmum lkelhood ML estmate. Optmze log-lkelhood l D, log log Set dervatve to zero Solvng l D, 0 ML Soluton: ML CS 750 Machne Learnng Maxmum lkelhood estmate. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 What s the ML estmate of the probablty of a head and a tal? CS 750 Machne Learnng

Maxmum lkelhood estmate. Example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 What s the ML estmate of the probablty of head and tal? Head: Tal: ML ML 5 5 0.6 0 5 0.4 CS 750 Machne Learnng Maxmum a posteror estmate Maxmum a posteror estmate Selects the mode of the posteror dstrbuton MAP arg max p D, ξ How to choose the pror probablty? Lkelhood of data pror P D, ξ p ξ p D, ξ va Bayes rule P D ξ P D, ξ p ξ n x x - s the pror probablty on CS 750 Machne Learnng ormalzng factor

Pror dstrbuton Choce of pror: Beta dstrbuton p ξ Beta, P D, ξ Posteror dstrbuton s agan a Beta dstrbuton P D, ξ Beta, p D, ξ Beta, P D ξ CS 750 Machne Learnng x - A Gamma functon For nteger values of x x x! Why to use Beta dstrbuton? Beta dstrbuton fts Bernoull trals - conjugate choces Beta dstrbuton 3.5 3 0.5, β0.5.5, β.5.5, β5.5.5 0.5 0 0 0. 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 CS 750 Machne Learnng

Maxmum a posteror probablty Maxmum a posteror estmate Selects the mode of the posteror dstrbuton P D, ξ Beta, p D, ξ Beta, P D ξ otce that parameters of the pror act lke counts of heads and tals sometmes they are also referred to as pror counts MAP Soluton: MAP CS 750 Machne Learnng MAP estmate example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 Assume p ξ Beta 5,5 What s the MAP estmate? CS 750 Machne Learnng

MAP estmate example Assume the unknown and possbly based con Probablty of the head s Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 Assume p ξ Beta 5,5 What s the MAP estmate? MAP 9 33 CS 750 Machne Learnng MAP estmate example ote that the pror and data ft data lkelhood are combned The MAP can be based wth large pror counts It s hard to overturn t wth a smaller sample sze Data: H H T T H H T H T H T T T H T H H H H T H H H H T Heads: 5 Tals: 0 Assume p ξ Beta 5,5 p ξ Beta 5,0 MAP MAP 9 33 9 48 CS 750 Machne Learnng

Bayesan framework Both ML or MAP estmates pck one value of the parameter Assume: there are two dfferent parameter settngs that are close n terms of ther probablty values. Usng only one of them may ntroduce a strong bas, f we use them, for example, for predctons. Bayesan parameter estmate Remedes the lmtaton of one choce Uses all possble parameter values Where p D, ξ Beta, The posteror can be used to defne pˆ X : p ˆ X p X D p X Θ p Θ D, ξ dθ Θ CS 750 Machne Learnng Bayesan framework Predctve probablty of an outcome x n the next tral P x D, ξ P x D, ξ P x, ξ p D, ξ d 0 p D, ξ d E 0 Posteror densty Equvalent to the expected value of the parameter expectaton s taken wth regard to the posteror dstrbuton p D, ξ Beta, CS 750 Machne Learnng

CS 750 Machne Learnng Expected value of the parameter How to obtan the expected value? d d Beta E 0 0, d 0 Beta d, 0 ote: for nteger values of CS 750 Machne Learnng Expected value of the parameter Substtutng the results for the posteror: We get ote that the mean of the posteror s yet another reasonable parameter choce: E,, Beta D p ξ ˆ E