Parametric Density Estimation: Bayesian Estimation. Naïve Bayes Classifier

arametrc Dest Estmato: Baesa Estmato. Naïve Baes Classfer

Baesa arameter Estmato Suppose we have some dea of the rage where parameters θ should be Should t we formalze such pror owledge hopes that t wll lead to better parameter estmato? Let θ be a radom varable wth pror dstrbuto θ Ths s the e dfferece betwee ML ad Baesa parameter estmato Ths e assumpto allows us to full explot the formato provded b the data

Baesa arameter Estmato θ s a radom varable wth pror pθ Ule MLE case, pxθ s a codtoal dest The trag data D allow us to covert pθ to a posteror probablt dest pθd. After we observe the data D, usg Baes rule we ca compute the posteror pθd But θ s ot our fal goal, our fal goal s the uow px Therefore a better thg to do s to maxmze pxd, ths s as close as we ca come to the uow px!

Baesa Estmato: Formula for pxd From the defto of ot dstrbuto: θ x D p x, D p dθ Usg the defto of codtoal probablt: x D p x θ, D p θ D p dθ But pxθ,dpxθ sce pxθ s completel specfed b θ Usg Baes formula, p ow uow p x D p x θ p θ D dθ p D θ p θ θ D p D θ p D θ p θ p x θ dθ 1

Baesa Estmato vs. MLE So prcple pxd ca be computed I practce, t ma be hard to do tegrato aaltcall, ma have to resort to umercal methods p x D p x θ p x θ p θ 1 dθ 1 p x θ p θ Cotrast ths wth the MLE soluto whch requres dfferetato of lelhood to get p x θˆ d θ Dfferetato s eas ad ca alwas be doe aaltcall

Baesa Estmato vs. MLE support θ receves from the data x D p x θ p θ D p dθ proposed model wth certa θ The above equato mples that f we are less certa about the exact value of θ, we should cosder a weghted average of pxθ over the possble values of θ. Cotrast ths wth the MLE soluto whch alwas gves us a sgle model: p x θˆ

Baesa Estmato for Gaussa wth uow µ Let px µ be Nµ,σ 2 that s σ 2 s ow, but µ s uow ad eeds to be estmated, so θ µ 2 Assume a pror over µ : p µ ~ N µ 0, σ 0 µ 0 ecodes some pror owledge about the true mea, whle measures our pror ucertat. 2 µ σ 0

Baesa Estmato for Gaussa wth uow µ The posteror dstrbuto s: p µ D p D µ p µ 2 2 1 x µ µ µ 0 α 'exp + 2 1 σ σ 0 1 1 2 1 µ 0 α ''exp + µ 2 x 2 2 2 + µ 2 2 σ σ 0 σ 1 σ 0 Where factors that do ot deped o µ have bee absorbed to the costats α ad α p µ D s a expoet of a quadratc fucto of µ.e. t s a ormal dest; t remas ormal for a umber of trag samples. If we wrte 2 1 1 µ µ p µ D exp '' 1 1 2 1 µ 0 ; α exp 2πσ 2 σ + µ 2 x + µ 2 2 2 2 2 σ σ 0 σ 1 σ 0 the detfg the coeffcets, we get where 1 ˆ µ x 1 1 1 µ + µ ˆ µ 0 2 2 + 2 σ σ σ σ σ σ 2 2 2 0 0

Baesa Estmato for Gaussa wth uow µ µ 2 Solvg explctl for ad σ we obta: 2 2 σ 0 σ µ ˆ 2 2 µ + µ 2 2 0 σ 0 + σ σ 0 + σ our best guess after observg samples σ σ σ 2 2 2 0 2 2 σ 0 + σ ucertat about the guess, decreases mootocall wth

Baesa Estmato for Gaussa wth uow µ Each addtoal observato decreases our ucertat about the true value of. µ As creases, p µ D becomes more ad more sharpl peaed, approachg a Drac delta fucto as approaches ft. Ths behavor s ow as Baesa Learg.

Baesa Estmato for Gaussa wth uow µ 2 2 σ 0 σ µ ˆ 2 2 µ + µ 2 2 0 σ 0 + σ σ 0 + σ µ ˆ I geeral, s a lear combato of a sample mea ad a pror µ 0, wth coeffcets that are o-egatve ad sum to 1. Thus µ les somewhere betwee ˆ µ ad µ 0. If σ 0 0, µ ˆ µ as If σ 0 0, our a pror certat that µ µ 0 s so strog that o umber of observatos ca chage our opo. If a pror guess s ver ucerta σ 0 s large, we tae µ ˆ µ µ

Baesa Estmato: Example for U[0,θ] Let X be U[0,θ]. Recall pxθ1/θ sde [0,θ], else 0 p x θ p θ 1 θ θ 1 10 x 10 Suppose we assume a U[0,10] pror o θ good pror to use f we ust ow the rage of θ but do t ow athg else θ

Baesa Estmato: Example for U[0,θ] We eed to compute p x D p x θ p θ D usg dθ p D θ p θ p θ D ad p D θ p D θ p θ p x θ dθ 1 Whe computg MLE of θ, we had p D θ Thus p θ D 1 for θ max{ x,..., x } θ 0 otherwse 1 1 c for max{ θ 0 otherwse x 1,..., x } θ 10 1 10 x p θ 1 x3 x2 θ D p 10 θ where c s the ormalzg costat,.e. c 1 10 d max { x..., x } 1, θ θ

Baesa Estmato: Example for U[0,θ] We eed to compute p x D p x θ p θ D 1 θ p θ D p x θ 1 c for max{ θ 0 otherwse We have 2 cases: 1. case x < max{x 1, x 2,, x } θ x x 1 dθ,..., 1 p x,... x } + 1 θ 2. case x > max{x 1, x 2,, x } 10 1 c 10 p x D c dθ x + 1 x θ θ x } θ 10 x 1 x3 x2 10 x D c dθ α max{ 1 c x D p θ 10 c 10 θ costat depedet of x

Baesa Estmato: Example for U[0,θ] α ML x p x θˆ 1 x3 x2 Baes 10 x p x D Note that eve after x >max {x 1, x 2,, x }, Baes dest s ot zero, whch maes sese curous fact: Baes dest s ot uform,.e. does ot have the fuctoal form that we have assumed!

ML vs. Baesa Estmato wth Broad ror Suppose pθ s flat ad broad close to uform pror pθd teds to sharpe f there s a lot of data p θ D p D θ p θ θˆ θ Thus pdθ pθdpθ wll have the same sharp pea as pθd But b defto, pea of pdθ s the ML estmate θ^ The tegral s domated b the pea: p x D p x θ p θ D dθ p x ˆ θ p θ D dθ p x ˆ θ Thus as goes to ft, Baesa estmate wll approach the dest correspodg to the MLE!

ML vs. Baesa Estmato Number of trag data The two methods are equvalet assumg fte umber of trag data ad pror dstrbutos that do ot exclude the true soluto. For small trag data sets, the gve dfferet results most cases. Computatoal complext ML uses dfferetal calculus or gradet search for maxmzg the lelhood. Baesa estmato requres complex multdmesoal tegrato techques.

ML vs. Baesa Estmato Soluto complext Easer to terpret ML solutos.e., must be of the assumed parametrc form. A Baesa estmato soluto mght ot be of the parametrc form assumed. Hard to terpret, returs weghted average of models. ror dstrbuto If the pror dstrbuto pθ s uform, Baesa estmato solutos are equvalet to ML solutos.

Naïve Baes Classfer

Ubased Learg of Baes Classfers s Impractcal Lear Baes classfer b estmatg XY ad Y. AssumeY s boolea ad X s a vector of boolea attrbutes. I ths case, we eed to estmate a set of parameters θ X x Y taes o 2 possble values; How ma parameters? taes o 2 possble values. For a partcular value, ad the 2 possble values of x, we eed compute 2-1 depedet parameters. Gve the two possble values for Y, we must estmate a total of 22-1 such parameters. Complex model Hgh varace wth lmted data!!!

Codtoal Idepedece Defto: X s codtoall depedet of Y gve Z, f the probablt dstrbuto goverg X s depedet of the value of Y, gve the value of Z,, X x Y, Z z X x Z z Example: Thuder Ra, Lghtg Thuder Lghtg Note that geeral Thuder s ot depedet of Ra, but t s gve Lghtg. Equvalet to: X, Y Z X Y, Z Y Z X Z Y Z

Dervato of Nave Baes Algorthm Nave Baes algorthm assumes that the attrbutes X 1,,X are all codtoall depedet of oe aother, gve Y. Ths dramatcall smplfes the represetato of XY estmatg XY from the trag data. Cosder XX 1,X 2 X Y X, X Y X Y X Y For X cotag attrbutes 1 2 1 2 Y X Y 1 X Y Gve the boolea X ad Y, ow we eed ol 2 parameters to defe XY, whch s dramatc reducto compared to the 22-1 parameters f we mae o codtoal depedece assumpto.

The Naïve Baes Classfer Gve: ror Y codtoall depedet features X, gve the class Y For each X, we have lelhood X Y The probablt that Y wll tae o ts th possble value, s Y X Y The Decso rule: Y X Y Y X Y X Y arg max * Y X Y If assumpto holds, NB s optmal classfer!

Naïve Baes for the dscrete puts Gve, attrbutes X each tag o J possble dscrete values ad Y a dscrete varable tag o K possble values. MLE for Lelhood X x Y gve a set of trag examples D: # D { X x Y } ˆ X x Y # D{ Y } where the #D{x} operator returs the umber of elemets the set D that satsf propert x. MLE for the pror D Y ˆ # { Y D } umber of elemets the trag set D

NB Example Gve, trag data X Y Classf the followg ovel stace : Outloosu, Tempcool,Humdthgh,Wdstrog

NB Example arg max }, { o es NB strog Wd hgh Humdt cool Temp su Outloo 0.36 5/14 0.64 9 /14 rors : o lates es lates... 0.6 5 3/ 0.33 9 3/ strog : Wd e.g. robabltes, Codtoal o lates strog Wd es lates strog Wd 0.0053 es strog es hgh es cool es su es 0.02 o strog o hgh o cool o su o

Subtletes of NB classfer 1 Volatg the NB assumpto Usuall, features are ot codtoall depedet. Noetheless, NB ofte performs well, eve whe assumpto s volated [Domgos& azza 96] dscuss some codtos for [Domgos& azza 96] dscuss some codtos for good performace

Subtletes of NB classfer 2 Isuffcet trag data What f ou ever see a trag stace where X 1 a whe Yb? X 1 a Yb 0 Thus, o matter what the values X 2,,X tae: Soluto? Yb X 1 a,x 2,,X 0

Subtletes of NB classfer 2 Isuffcet trag data To avod ths, use a smoothed estmate effectvel adds a umber of addtoal hallucated examples assumes these hallucated examples are spread evel over the possble values of X. Ths smoothed estmate s gve b # D{ X x Y ˆ X x Y # D{ Y } + lj # D{ Y } + l ˆ Y D + lk l determes the stregth of the smoothg If l1 called Laplace smoothg } + l The umber of hallucated examples

Nave Baes for Cotuous Iputs Whe the X are cotuous we must choose some other wa to represet the dstrbutos X Y. Oe commo approach s to assume that for each possble dscrete value of Y, the dstrbuto of each cotuous X s Gaussa. I order to tra such a Naïve Baes classfer we must estmate the mea ad stadard devato of each of these Gaussas

Nave Baes for Cotuous Iputs MLE for meas 1 ˆ µ δ Y Y where refers to the th trag example, ad where δy s 1 f Y ad 0 otherwse. Note the role of δ s to select ol those trag examples for whch Y. MLE for stadard devato ˆ σ X Y Y 2 µ δ δ 1 X δ 2 ˆ

Learg Classf Text Applcatos: Lear whch ews artcle are of terest Lear to classf web pages b topc. Naïve Baes s amog most effectve algorthms Target cocept Iterestg?: Documet->{+,-} Target cocept Iterestg?: Documet->{+,-} 1 Represet each documet b vector of words oe attrbute per word posto documet 2 Learg: Use trag examples to estmate + - doc+ doc-

Text Classfcato-Example: Text Text Classfcato, or the tas of automatcall assgg sematc categores to atural laguage text, has become oe of the e methods for orgazg ole formato. Sce had-codg classfcato rules s costl or eve mpractcal, most moder approaches emplo mache learg techques to automatcall lear text classfers from examples. The text cotas 48 words Text Represetato a 1 text,a 2 classfcato,. a 48 examples The represetato cotas 48 attrbutes Note: Text sze ma var, but t wll ot cause a problem

NB codtoal depedece Assumpto doc legth doc 1 a w The NB assumpto s that the word probabltes for oe text posto are depedet of the words other postos, gve the documet classfcato Idcates the th word Eglsh vocabular probablt that word posto s w, gve Clearl ot true: The probablt of word learg ma be greater f the precedg word s mache Necessar, wthout t the umber of probablt terms s prohbtve erforms remarabl well despte the correctess of the assumpto

Text Classfcato-Example: Text Text Classfcato, or the tas of automatcall assgg sematc categores to atural laguage text, has become oe of the e methods for orgazg ole formato. Sce had-codg classfcato rules s costl or eve mpractcal, most moder approaches emplo mache learg techques to automatcall lear text classfers from examples. The text cotas 48 words Classfcato: * arg max { +, } { +, } arg max a 1 ' text' a w Text Represetato a 1 text,a 2 classfcato,. a 48 examples The represetato cotas 48 attrbutes... a 48 ' examples'

Estmatg Lelhood Is problematc because we eed to estmate t for each combato of text posto, Eglsh word, ad target value: 48*50,000*2 5 mllo such terms. Assumpto that reduced the umber of terms Bag of Words Model The probablt of ecouterg a specfc word w s depedet of the specfc word posto. a w am w,, m Istead of estmatg we estmate a sgle term Now we have 50,000*2 dstct terms. a 1 w, a w,... w

Estmatg Lelhood The estmate for the lelhood s w + 1 + Vocabular -the total umber of word postos all -the total umber of word postos all trag examples whose target value s -the umber tmes word w s foud amog these word postos. Vocabular -the total umber of dstct words foud wth the trag data.

Lear_Nave_Baes_TextExamples,V 1. collect all words ad other toes that occur Examples Vocabular all dstct words ad other toes Examples 2. calculate the requred ad For each target value V do docs w - subset of Examples for whch the target value s - Text - a sgle documet created b cocateatg all members of - total umber of words coutg duplcate words multple tmes - For each word the Vocabular * umber of tmes word occurs * docs Examples w w Text w + 1 + Vocabular Text docs

Classf_Nave_Baes_TextDoc postos all word postos Doc that cota toes foud Vocabular * Retur arg max w { +, } postos