Lecture 19 of 42. MAP and MLE continued, Minimum Description Length (MDL)

Lecture 19 of 4 MA and MLE contnued, Mnu Descrpton Length (MDL) Wednesday, 8 February 007 Wlla H. Hsu, KSU http://www.kddresearch.org Readngs for next class: Chapter 5, Mtchell Lecture Outlne Read Sectons 6.1-6.5, Mtchell Overvew of Bayesan Learnng Fraework: usng probablstc crtera to generate hypotheses of all knds robablty: foundatons Bayes s Theore Defnton of condtonal (posteror) probablty Rafcatons of Bayes s Theore Answerng probablstc queres MA hypotheses Generatng Maxu A osteror (MA) Hypotheses Generatng Maxu Lkelhood Hypotheses Next Week: Sectons 6.6-6.13, Mtchell; Roth; earl and Vera More Bayesan learnng: MDL, BOC, Gbbs, Sple (Naïve) Bayes Learnng over text 1

Bayes s Theore MA Hypothess Choosng Hypotheses Generally want ost probable hypothess gven the tranng data Defne: arg ax[ f ( x )] the value of x n the saple space Ω wth the hghest f(x) x Ω Maxu a posteror hypothess, h MA ML Hypothess ( h D) h MA = ( D h) ( h) ( D) = arg ax = arg ax = arg ax = ( h D) ( D) ( h D) ( D h) ( h) ( D) ( D h) ( h) Assue that p(h ) = p(h j ) for all pars, j (unfor prors,.e., H ~ Unfor) Can further splfy and choose the axu lkelhood hypothess, h ML h ML = arg ax h H ( D h ) Bayes s Theore: Query Answerng (QA) Answerng User Queres Suppose we want to perfor ntellgent nferences over a database DB Scenaro 1: DB contans records (nstances), soe labeled wth answers Scenaro : DB contans probabltes (annotatons) over propostons QA: an applcaton of probablstc nference QA Usng ror and Condtonal robabltes: Exaple Query: Does patent have cancer or not? Suppose: patent takes a lab test and result coes back postve Correct + result n only 98% of the cases n whch dsease s actually present Correct - result n only 97% of the cases n whch dsease s not present Only 0.008 of the entre populaton has ths cancer α (false negatve for H 0 Cancer) = 0.0 (NB: for 1-pont saple) β (false postve for H 0 Cancer) = 0.03 (NB: for 1-pont saple) ( Cancer ) = 0.008 ( + Cancer ) = 0.98 ( + Cancer ) = 0.03 ( Cancer ) = 0.99 ( Cancer ) = 0.0 ( Cancer ) = 0.97 (+ H 0 ) (H 0 ) = 0.0078, (+ H A ) (H A ) = 0.098 h MA = H A Cancer

Basc Forulas for robabltes roduct Rule (Alternatve Stateent of Bayes s Theore) roof: requres axoatc set theory, as does Bayes s Theore Su Rule Sketch of proof (edate fro axoatc set theory) Draw a Venn dagra of two sets denotng events A and B A Let A B denote the event correspondng to A B Theore of Total robablty Suppose events A 1, A,, A n are utually exclusve and exhaustve Mutually exclusve: j A A j = Exhaustve: (A )= 1 n Then B = B A A ( ) ( ) ( ) =1 ( A B) = ( A B) ( B) ( A B) = ( A) + ( B) ( A B) roof: follows fro product rule and 3 rd Kologorov axo B MA and ML Hypotheses: A attern Recognton Fraework attern Recognton Fraework Autoated speech recognton (ASR), autoated age recognton Dagnoss Forward roble: One Step n ML Estaton Gven: odel h, observatons (data) D Estate: (D h), the probablty that the odel generated the data Backward roble: attern Recognton / redcton Step Gven: odel h, observatons D Maxze: (h(x) = x h, D) for a new X (.e., fnd best x) Forward-Backward (Learnng) roble Gven: odel space H, data D Fnd: h H such that (h D) s axzed (.e., MA hypothess) More Info http://www.cs.brown.edu/research/a/dynacs/tutoral/docuents/ HddenMarkovModels.htl Ephass on a partcular H (the space of hdden Markov odels) 3

Bayesan Learnng Exaple: Unbased Con [1] Con Flp Saple space: Ω = {Head, Tal} Scenaro: gven con s ether far or has a 60% bas n favor of Head h 1 far con: (Head) = 0.5 h 60% bas towards Head: (Head) = 0.6 Objectve: to decde between default (null) and alternatve hypotheses A ror (aka ror) Dstrbuton on H (h 1 ) = 0.75, (h ) = 0.5 Reflects learnng agent s pror belefs regardng H Learnng s revson of agent s belefs Collecton of Evdence Frst pece of evdence: d a sngle con toss, coes up Head Q: What does the agent beleve now? A: Copute (d) = (d h 1 ) (h 1 ) + (d h ) (h ) Bayesan Learnng Exaple: Unbased Con [] Bayesan Inference: Copute (d) = (d h 1 ) (h 1 ) + (d h ) (h ) (Head) = 0.5 0.75 + 0.6 0.5 = 0.375 + 0.15 = 0.55 Ths s the probablty of the observaton d = Head Bayesan Learnng Now apply Bayes s Theore (h 1 d) = (d h 1 ) (h 1 ) / (d) = 0.375 / 0.55 = 0.714 (h d) = (d h ) (h ) / (d) = 0.15 / 0.55 = 0.86 Belef has been revsed downwards for h 1, upwards for h The agent stll thnks that the far con s the ore lkely hypothess Suppose we were to use the ML approach (.e., assue equal prors) Belef s revsed upwards fro 0.5 for h 1 Data then supports the bas con better More Evdence: Sequence D of 100 cons wth 70 heads and 30 tals (D) = (0.5) 50 (0.5) 50 0.75 + (0.6) 70 (0.4) 30 0.5 Now (h 1 d) << (h d) 4

Brute Force MA Hypothess Learner Intutve Idea: roduce Most Lkely h Gven Observed D Algorth Fnd-MA-Hypothess (D) 1. FOR each hypothess h H Calculate the condtonal (.e., posteror) probablty: = ( h D) ( D h) ( h) ( D). RETURN the hypothess h MA wth the hghest condtonal probablty h MA = arg ax ( h D) Relaton to Concept Learnng Usual Concept Learnng Task Instance space X Hypothess space H Tranng exaples D Consder Fnd-S Algorth Gven: D Return: ost specfc h n the verson space VS H,D MA and Concept Learnng Bayes s Rule: Applcaton of Bayes s Theore What would Bayes s Rule produce as the MA hypothess? Does Fnd-S Output A MA Hypothess? 5

Bayesan Concept Learnng and Verson Spaces Assuptons Fxed set of nstances <x 1, x,, x > Let D denote the set of classfcatons: D = <c(x 1 ), c(x ),, c(x )> Choose (D h) (D h) = 1 f h consstent wth D (.e., x. h(x ) = c(x )) (D h) = 0 otherwse Choose (h) ~ Unfor 1 Unfor dstrbuton: ( h) = H Unfor prors correspond to no background knowledge about h Recall: axu entropy MA Hypothess ( h D) 1 = VS 0 H,D f h s consstent wth D otherwse Evoluton of osteror robabltes Start wth Unfor rors Equal probabltes assgned to each hypothess Maxu uncertanty (entropy), nu pror nforaton (h) (h D 1 ) (h D 1, D ) Hypotheses Hypotheses Hypotheses Evdental Inference Introduce data (evdence) D 1 : belef revson occurs Learnng agent revses condtonal probablty of nconsstent hypotheses to 0 osteror probabltes for reanng h VS H,D revsed upward Add ore data (evdence) D : further belef revson 6

Characterzng Learnng Algorths by Equvalent MA Learners Inductve Syste Tranng Exaples D Hypothess Space H Canddate Elnaton Algorth Output hypotheses Tranng Exaples D Equvalent Bayesan Inference Syste Output hypotheses Hypothess Space H (h) ~ Unfor (D h) = δ(h( ), c( )) Brute Force MA Learner ror knowledge ade explct Maxu Lkelhood: Learnng A Real-Valued Functon [1] y f(x) e h ML x roble Defnton Target functon: any real-valued functon f Tranng exaples <x, y > where y s nosy tranng value y = f(x ) + e e s rando varable (nose)..d. ~ Noral (0, σ), aka Gaussan nose Objectve: approxate f as closely as possble Soluton Maxu lkelhood hypothess h ML Mnzes su of squared errors (SSE) h ML = arg n ( ( )) d h x = 1 7

Maxu Lkelhood: Learnng A Real-Valued Functon [] Dervaton of Least Squares Soluton Assue nose s Gaussan (pror knowledge) Max lkelhood soluton: h = arg ax p D h σ = arg ax e = 1 ππ roble: Coputng Exponents, Coparng Reals - Expensve! Soluton: Maxze Log rob 1 1 d ( ) h x h ML = arg ax ln = 1 πσ σ 1 d ( ) h x = arg ax = 1 σ = arg ax = 1 ML = arg ax = arg n = 1 ( ) p( d h) ( d h( x )) ( d h( x )) = 1 1 d ( ) 1 h x Learnng to redct robabltes Applcaton: redctng Survval robablty fro atent Data roble Defnton Gven tranng exaples <x, d >, where d H {0, 1} Want to tran neural network to output a probablty gven x (not a 0 or 1) Maxu Lkelhood Estator (MLE) In ths case can show: hml = arg ax [ d lnh( x ) + ( 1 d ) ln( 1 h( x ))] Weght update rule for a sgod unt w = w x 1 w 1 x w x n w n x 0 = 1 start layer,end layer Δw start layer,end layer w 0 = 1 = r Σ n r r net = w x = w x = 0 start layer,end layer ( d h( x )) x start layer,end layer = 1 + Δw start layer,end layer r r r o = ( x ) = σ( x w ) σ( net ) 8

Most robable Classfcaton of New Instances MA and MLE: Ltatons roble so far: fnd the ost lkely hypothess gven the data Soetes we just want the best classfcaton of a new nstance x, gven D A Soluton Method Fnd best (MA) h, use t to classfy Ths ay not be optal, though! Analogy Estatng a dstrbuton usng the ode versus the ntegral One fnds the axu, the other the area Refned Objectve Want to deterne the ost probable classfcaton Need to cobne the predcton of all hypotheses redctons ust be weghted by ther condtonal probabltes Result: Bayes Optal Classfer (next te ) Mnu Descrpton Length (MDL) rncple: Occa s Razor Occa s Razor Recall: prefer the shortest hypothess - an nductve bas Questons Why short hypotheses as opposed to an arbtrary class of rare hypotheses? What s specal about nu descrpton length? Answers MDL approxates an optal codng strategy for hypotheses In certan cases, ths codng strategy axzes condtonal probablty Issues How exactly s nu length beng acheved (length of what)? When and why can we use MDL learnng for MA hypothess learnng? What does MDL learnng really ental (what does the prncple buy us)? MDL rncple refer h that nzes codng length of odel plus codng length of exceptons Model: encode h usng a codng schee C 1 Exceptons: encode the condtoned data D h usng a codng schee C 9

MDL Hypothess MDL and Optal Codng: Bayesan Inforaton Crteron (BIC) = arg n[ L ( h) L ( D h) ] h + MDL C 1 C e.g., H decson trees, D = labeled tranng data L C1 (h) nuber of bts requred to descrbe tree h under encodng C 1 L C (D h) nuber of bts requred to descrbe D gven h under encodng C NB: L C (D h) = 0 f all x classfed perfectly by h (need only descrbe exceptons) Hence h MDL trades off tree sze aganst tranng errors Bayesan Inforaton Crteron BIC ( h) = lg ( D h) + lg ( h) hma = arg ax[ ( D h) ( h) ] = arg ax[ lg ( D h) + lg ( h) ] = arg ax BIC( h) = arg n[ lg ( D h) lg ( h) ] Interestng fact fro nforaton theory: the optal (shortest expected code length) code for an event wth probablty p s -lg(p) bts Interpret h MA as total length of h and D gven h under optal code BIC = -MDL (.e., argax of BIC s argn of MDL crteron) refer hypothess that nzes length(h) + length (sclassfcatons) Concludng Rearks on MDL What Can We Conclude? Q: Does ths prove once and for all that short hypotheses are best? A: Not necessarly Only shows: f we fnd log-optal representatons for (h) and (D h), then h MA = h MDL No reason to beleve that h MDL s preferable for arbtrary codngs C 1, C Case n pont: practcal probablstc knowledge bases Elctaton of a full descrpton of (h) and (D h) s hard Huan pleentor ght prefer to specfy relatve probabltes Inforaton Theoretc Learnng: Ideas Learnng as copresson Abu-Mostafa: coplexty of learnng probles (n ters of nal codngs) Wolff: coputng (especally search) as copresson (Bayesan) odel selecton: searchng H usng probablstc crtera 10

Bayesan Classfcaton Fraework Fnd ost probable classfcaton (as opposed to MA hypothess) f: X V (doan nstance space, range fnte set of values) Instances x X can be descrbed as a collecton of features x (x 1, x,, x n ) erforance eleent: Bayesan classfer Gven: an exaple (e.g., Boolean-valued nstances: x H) Output: the ost probable value v j V (NB: prors for x constant wrt v MA ) v MA = arg ax v v j V = arg ax v j V ( j x ) = arg ax ( v j x1, x, K, x n ) v j V ( x, x, K, x v ) ( v ) 1 araeter Estaton Issues Estatng (v j ) s easy: for each value v j, count ts frequency n D = {<x, f(x)>} However, t s nfeasble to estate (x 1, x,, x n v j ): too any 0 values In practce, need to ake assuptons that allow us to estate (x d) n j j Bayes Optal Classfer (BOC) Intutve Idea h MA (x) s not necessarly the ost probable classfcaton! Exaple Three possble hypotheses: (h 1 D) = 0.4, (h D) = 0.3, (h 3 D) = 0.3 Suppose that for new nstance x, h 1 (x) = +, h (x) =, h 3 (x) = What s the ost probable classfcaton of x? Bayes Optal Classfcaton (BOC) v* = v Exaple (h 1 D) = 0.4, ( h 1 ) = 0, (+ h 1 ) = 1 (h) (h D) = 0.3, ( h ) = 1, (+ h ) = 0 (h 3 D) = 0.3, ( h 3 ) = 1, (+ h 3 ) = 0 [ ( + h ) ( h D) ] h H = 0.4 [ ( h ) ( h D) ] = 0.6 h H [ ( j ) ( )] Result: v* = v = arg ax v h h D = BOC v j V h H BOC = arg ax v j V h H [ ( v j h ) ( h D) ] h 11

Ternology Introducton to Bayesan Learnng robablty foundatons Defntons: subjectvst, frequentst, logcst (3) Kologorov axos Bayes s Theore ror probablty of an event Jont probablty of an event Condtonal (posteror) probablty of an event Maxu A osteror (MA) and Maxu Lkelhood (ML) Hypotheses MA hypothess: hghest condtonal probablty gven observatons (data) ML: hghest lkelhood of generatng the observed data ML estaton (MLE): estatng paraeters to fnd ML hypothess Bayesan Inference: Coputng Condtonal robabltes (Cs) n A Model Bayesan Learnng: Searchng Model (Hypothess) Space usng Cs Suary onts Introducton to Bayesan Learnng Fraework: usng probablstc crtera to search H robablty foundatons Defntons: subjectvst, objectvst; Bayesan, frequentst, logcst Kologorov axos Bayes s Theore Defnton of condtonal (posteror) probablty roduct rule Maxu A osteror (MA) and Maxu Lkelhood (ML) Hypotheses Bayes s Rule and MA Unfor prors: allow use of MLE to generate MA hypotheses Relaton to verson spaces, canddate elnaton Next Week: 6.6-6.10, Mtchell; Chapter 14-15, Russell and Norvg; Roth More Bayesan learnng: MDL, BOC, Gbbs, Sple (Naïve) Bayes Learnng over text 1