Probabilistic Reasoning. CS 188: Artificial Intelligence Spring Inference by Enumeration. Probability recap. Chain Rule à Bayes net

CS 188: Artificil Intelligence Spring 2011 Finl Review 5/2/2011 Pieter Aeel UC Berkeley Proilistic Resoning Proility Rndom Vriles Joint nd Mrginl Distriutions Conditionl Distriution Inference y Enumertion Product Rule, Chin Rule, Byes Rule Independence Distriutions over ARGE Numers of Rndom Vriles à Byesin networks Representtion Inference Exct: Enumertion, Vrile Elimintion Approximte: Smpling erning Mximum likelihood prmeter estimtion plce smoothing iner interpoltion 2 Proility recp Inference y Enumertion Conditionl proility Product rule Chin rule, independent iff: equivlently, iff: equivlently, iff: x, y : P (x y) =P (x) x, y : P (y x) =P (y) nd re conditionlly independent given iff: equivlently, iff: x, y, z : P (x y, z) =P (x z) equivlently, iff: x, y, z : P (y x, z) =P (y z) 3 P(sun)? P(sun winter)? P(sun winter, hot)? S T W P summer hot sun 0.30 summer hot rin 0.05 summer cold sun 0.10 summer cold rin 0.05 winter hot sun 0.10 winter hot rin 0.05 winter cold sun 0.15 winter cold rin 0.20 4 Chin Rule à Byes net Exmple: Alrm Network Chin rule: cn lwys write ny joint distriution s n incrementl product of conditionl distriutions B P(B) + 0.001 0.999 Burglry Erthqk E P(E) +e 0.002 e 0.998 Byes nets: mke conditionl independence ssumptions of the form: P (x i x 1 x i 1 )=P(x i prents( i )) B E A giving us: J M 5 John clls A J P(J A) + +j 0.9 + j 0.1 +j 0.05 j 0.95 Alrm Mry clls A M P(M A) + +m 0.7 + m 0.3 +m 0.01 m 0.99 B E A P(A B,E) + +e + 0.95 + +e 0.05 + e + 0.94 + e 0.06 +e + 0.29 +e 0.71 e + 0.001 e 0.999 1

Size of Byes Net for How ig is joint distriution over N Boolen vriles? 2 N Size of representtion if we use the chin rule 2 N How ig is n N-node net if nodes hve up to k prents? O(N * 2 k+1 ) Both give you the power to clculte BNs: Huge spce svings! Esier to elicit locl CPTs Fster to nswer queries 7 Byes Nets: Assumptions Assumptions we re required to mke to define the Byes net when given the grph: P (x i x 1 x i 1 )=P (x i prents( i )) Given Byes net grph dditionl conditionl independences cn e red off directly from the grph Question: Are two nodes necessrily independent given certin evidence? If no, cn prove with counter exmple I.e., pick set of CPT s, nd show tht the independence ssumption is violted y the resulting distriution If yes, cn prove with Alger (tedious) D-seprtion (nlyzes grph) 8 D-Seprtion D-Seprtion Question: Are nd conditionlly independent given evidence vrs {}? es, if nd seprted y Consider ll (undirected) pths from to No ctive pths = independence! A pth is ctive if ech triple is ctive: Cusl chin A B C where B is unoserved (either direction) Common cuse A B C where B is unoserved Common effect (k v-structure) A B C where B or one of its descendents is oserved All it tkes to lock pth is single inctive segment Active Triples Inctive Triples Given query i? j { k1,..., kn } Shde ll evidence nodes For ll (undirected!) pths etween nd Check whether pth is ctive If ctive return i j { k1,..., kn } (If reching this point ll pths hve een checked nd shown inctive) Return i j { k1,..., kn } 10 Exmple All Conditionl Independences es es R B Given Byes net structure, cn run d- seprtion to uild complete list of conditionl independences tht re necessrily true of the form D T i j { k1,..., kn } es T 11 This list determines the set of proility distriutions tht cn e represented y Byes nets with this grph structure 12 2

Topology imits Distriutions Given some grph topology G, only certin joint distriutions cn e encoded The grph structure gurntees certin (conditionl) independences (There might e more independence) Adding rcs increses the set of distriutions, ut hs severl costs Full conditioning cn encode ny distriution {,,,,, } { } {} 13 Byes Nets Sttus Representtion Inference erning Byes Nets from Dt 14 Inference y Enumertion Given unlimited time, inference in BNs is esy Recipe: Stte the mrginl proilities you need Figure out A the tomic proilities you need Clculte nd comine them Exmple: B E Exmple: Enumertion In this simple method, we only need the BN to synthesize the joint entries A J M 15 16 Vrile Elimintion Why is inference y enumertion so slow? ou join up the whole joint distriution efore you sum out the hidden vriles ou end up repeting lot of work! Ide: interleve joining nd mrginlizing! Clled Vrile Elimintion Still NP-hrd, ut usully much fster thn inference y enumertion 17 Vrile Elimintion Outline Trck ojects clled fctors Initil fctors re locl CPTs (one per node) +r 0.1 - r 0.9 Any known vlues re selected E.g. if we know, the initil fctors re +r 0.1 - r 0.9 +r +t 0.8 +r - t 0.2 - r +t 0.1 - r - t 0.9 +r +t 0.8 +r - t 0.2 - r +t 0.1 - r - t 0.9 +t - l 0.7 - t - l 0.9 VE: Alterntely join fctors nd eliminte vriles R T 18 3

Vrile Elimintion Exmple Vrile Elimintion Exmple R T +r 0.1 - r 0.9 +r +t 0.8 +r - t 0.2 - r +t 0.1 - r - t 0.9 +t - l 0.7 - t - l 0.9 Join R +r +t 0.08 +r - t 0.02 - r +t 0.09 - r - t 0.81 +t - l 0.7 - t - l 0.9 Sum out R R, T +t 0.17 - t 0.83 +t - l 0.7 - t - l 0.9 T 19 T +t 0.17 - t 0.83 +t - l 0.7 - t - l 0.9 Join T T, Sum out T +t +l 0.051 +t - l 0.119 - t +l 0.083 - t - l 0.747 +l 0.134 - l 0.886 * VE is vrile elimintion Exmple Exmple Choose E Choose A Finish with B Normlize 21 22 Generl Vrile Elimintion Query: Strt with initil fctors: ocl CPTs (ut instntited y evidence) While there re still hidden vriles (not Q or evidence): Pick hidden vrile H Join ll fctors mentioning H Eliminte (sum out) H Join ll remining fctors nd normlize 23 Approximte Inference: Smpling Bsic ide: Drw N smples from smpling distriution S Compute n pproximte posterior proility Show this converges to the true proility P Why? Fster thn computing the exct nswer Prior smpling: Smple A vriles in topologicl order s this cn e done quickly Rejection smpling for query = like prior smpling, ut reject when vrile is smpled inconsistent with the query, in this cse when vrile E i is smpled differently from e i ikelihood weighting for query = like prior smpling ut vriles E i re not smpled, when it s their turn, they get set to e i, nd the smple gets weighted y P(e i vlue of prents(e i ) in current smple) Gis smpling: repetedly smples ech non-evidence vrile 24 conditioned on ll other vriles à cn incorporte downstrem evidence 4

Prior Smpling Exmple +c +s 0.1 - s 0.9 - c +s 0.5 - s 0.5 Sprinkler +s +r +w 0.99 - w 0.01 - r +w 0.90 - w 0.10 - s +r +w 0.90 - w 0.10 - r +w 0.01 - w 0.99 +c 0.5 - c 0.5 Cloudy WetGrss Rin Smples: +c +r 0.8 - r 0.2 - c +r 0.2 - r 0.8 +c, -s, +r, +w -c, +s, -r, +w 25 We ll get unch of smples from the BN: +c, -s, +r, +w +c, +s, +r, +w Cloudy C -c, +s, +r, -w Sprinkler S Rin R +c, -s, +r, +w -c, -s, -r, +w WetGrss W If we wnt to know P(W) We hve counts <+w:4, -w:1> Normlize to get P(W) = <+w:0.8, -w:0.2> This will get closer to the true distriution with more smples Cn estimte nything else, too Wht out P(C +w)? P(C +r, +w)? P(C -r, -w)? Fst: cn use fewer smples if less time 26 ikelihood Weighting ikelihood Weighting +c 0.5 - c 0.5 Smpling distriution if z smpled nd e fixed evidence +c +s 0.1 - s 0.9 - c +s 0.5 - s 0.5 Sprinkler Cloudy Rin +c +r 0.8 - r 0.2 - c +r 0.2 - r 0.8 Now, smples hve weights S Cloudy C W R +s +r +w 0.99 - w 0.01 - r +w 0.90 - w 0.10 - s +r +w 0.90 - w 0.10 - r +w 0.01 - w 0.99 WetGrss Smples: +c, +s, +r, +w 27 Together, weighted smpling distriution is consistent 28 Gis Smpling Ide: insted of smpling from scrtch, crete smples tht re ech like the lst one. Procedure: resmple one vrile t time, conditioned on ll the rest, ut keep evidence fixed. Properties: Now smples re not independent (in fct they re nerly identicl), ut smple verges re still consistent estimtors! Wht s the point: oth upstrem nd downstrem vriles condition on evidence. Mrkov Models A Mrkov model is chin-structured BN Ech node is identiclly distriuted (sttionrity) Vlue of t given time is clled the stte As BN: 1 2 3 4 The chin is just (growing) BN We cn lwys use generic BN resoning on it if we truncte the chin t fixed length Sttionry distriutions For most chins, the distriution we end up in is independent of the initil distriution Clled the sttionry distriution of the chin P () = P t+1 t ( x)p (x) x 29 Exmple pplictions: We link nlysis (Pge Rnk) nd Gis Smpling 5

Hidden Mrkov Models Underlying Mrkov chin over sttes S ou oserve outputs (effects) t ech time step Online Belief Updtes Every time step, we strt with current P( evidence) We updte for time: 1 2 3 4 5 1 2 E 1 E 2 E 3 E 4 E 5 We updte for evidence: Speech recognition HMMs: i : specific positions in specific words; E i : coustic signls Mchine trnsltion HMMs: i : trnsltion options; E i : Oservtions re words Root trcking: i : positions on mp; E i : rnge redings The forwrd lgorithm does oth t once (nd doesn t normlize) 2 E 2 Prticle Filtering Prticle Filtering = likelihood weighting + resmpling t ech time slice Why: sometimes is too ig to use exct inference 0.0 0.1 0.0 0.0 0.0 0.2 Elpse time: Ech prticle is moved y smpling its next position from the trnsition model Prticle is just new nme for smple 0.0 0.2 0.5 Oserve: We don t smple the oservtion, we fix it nd downweight our smples sed on the evidence This is like likelihood weighting, so we Resmple: Rther thn trcking weighted smples, we resmple N times, we choose from our weighted smple distriution Dynmic Byes Nets (DBNs) Byes Nets Sttus We wnt to trck multiple vriles over time, using multiple sources of evidence Ide: Repet fixed Byes net structure t ech time Vriles from time t cn condition on those from t-1 t =1 t =2 t =3 Representtion Inference G 1 G 1 G 2 G 2 G 3 G 3 erning Byes Nets from Dt E 1 E 1 E 2 E 2 E 3 E 3 Discrete vlued dynmic Byes nets re lso HMMs 36 6

Prmeter Estimtion Estimting distriution of rndom vriles like or Empiriclly: use trining dt For ech outcome x, look t the empiricl rte of tht vlue: r g g Representtion Inference Byes Nets Sttus This is the estimte tht mximizes the likelihood of the dt erning Byes Nets from Dt plce smoothing Pretend sw every outcome k extr times Smooth ech condition independently: 38 Clssifiction: Feture Vectors Clssifiction overview Hello, Do you wnt free printr crtriges? Why py more when you cn get them ABSOUTE FREE! Just # free : 2 OUR_NAME : 0 MISSPEED : 2 FROM_FRIEND : 0... PIE-7,12 : 1 PIE-7,13 : 0... NUM_OOPS : 1... SPAM or + 2 Nïve Byes: Builds model trining dt Gives prediction proilities Strong ssumptions out feture independence One pss through dt (counting) Perceptron: Mkes less ssumptions out dt Mistke-driven lerning Multiple psses through dt (prediction) Often more ccurte MIRA: ike perceptron, ut dptive scling of size of updte SVM: Properties similr to perceptron Convex optimiztion formultion Nerest-Neighor: Non-prmetric: more expressive with more trining dt Kernels Efficient wy to mke liner lerning rchitectures into nonliner ones Byes Nets for Clssifiction Generl Nïve Byes One method of clssifiction: Use proilistic model! Fetures re oserved rndom vriles F i is the query vrile Use proilistic inference to compute most likely A generl nive Byes model: x F n prmeters ou lredy know how to do this inference prmeters n x F x prmeters We only specify how ech feture depends on the clss Totl numer of prmeters is liner in n Our running exmple: digits F 1 F 2 F n 7

Bg-of-Words Nïve Byes iner Clssifier Genertive model Word t position i, not i th word in the dictionry! Binry liner clssifier: Bg-of-words Ech position is identiclly distriuted All positions shre the sme conditionl pros P(W C) à When lerning the prmeters, dt is shred over ll positions in the document (rther thn seprtely lerning distriution for ech position in the document) Multiclss liner clssifier: A weight vector for ech clss: Score (ctivtion) of clss y: Our running exmple: spm vs. hm Prediction highest score wins Binry = multiclss where the negtive clss hs weight zero Perceptron = lgorithm to lern weights w Strt with zero weights Pick up trining instnces one y one Clssify with current weights Prolems with the Perceptron Noise: if the dt isn t seprle, weights might thrsh Averging weight vectors over time cn help (verged perceptron) If correct, no chnge! If wrong: lower score of wrong nswer, rise score of right nswer Mediocre generliztion: finds rely seprting solution Overtrining: test / held-out ccurcy usully rises, then flls Overtrining is kind of overfitting 45 Fixing the Perceptron: MIRA Updte size tht fixes the current mistke nd lso minimizes the chnge to w Support Vector Mchines Mximizing the mrgin: good ccording to intuition, theory, prctice Support vector mchines (SVMs) find the seprtor with mx mrgin Bsiclly, SVMs re MIRA where you optimize over ll exmples t once Updte w y solving: SVM, C 8

Non-iner Seprtors Dt tht is linerly seprle (with some noise) works out gret: 0 x But wht re we going to do if the dtset is just too hrd? 0 x How out mpping dt to higher-dimensionl spce: x 2 Non-iner Seprtors Generl ide: the originl feture spce cn lwys e mpped to some higher-dimensionl feture spce where the trining set is seprle: Φ: x φ(x) 0 x 49 This nd next few slides dpted from Ry Mooney, UT 50 Some Kernels Kernels implicitly mp originl vectors to higher dimensionl spces, tke the dot product there, nd hnd the result ck Polynomil kernel: Some Kernels (2) iner kernel: Á(x) = x Qudrtic kernel: 51 Why nd When Kernels? Cn t you just dd these fetures on your own (e.g. dd ll pirs of fetures insted of using the qudrtic kernel)? es, in principle, just compute them No need to modify ny lgorithms But, numer of fetures cn get lrge (or infinite) Kernels let us compute with these fetures implicitly Exmple: implicit dot product in polynomil, Gussin nd string kernel tkes much less spce nd time per dot product When cn we use kernels? When our lerning lgorithm cn e reformulted in terms of only inner products etween feture vectors Exmples: perceptron, support vector mchine K-nerest neighors 1-NN: copy the lel of the most similr dt point K-NN: let the k nerest neighors vote (hve to devise weighting scheme) 2 Exmples 10 Exmples 100 Exmples 10000 Exmples Prmetric models: Fixed set of prmeters More dt mens etter settings Non-prmetric models: Complexity of the clssifier increses with dt Better in the limit, often worse in the non-limit (K)NN is non-prmetric Truth 54 9

Bsic Similrity Importnt Concepts Mny similrities sed on feture dot products: If fetures re just the pixels: Note: not ll similrities re of this form 55 Dt: leled instnces, e.g. emils mrked spm/hm Trining set Held out set Test set Fetures: ttriute-vlue pirs which chrcterize ech x Experimenttion cycle ern prmeters (e.g. model proilities) on trining set (Tune hyperprmeters on held-out set) Compute ccurcy of test set Very importnt: never peek t the test set! Evlution Accurcy: frction of instnces predicted correctly Overfitting nd generliztion Wnt clssifier which does well on test dt Overfitting: fitting the trining dt very closely, ut not generlizing well We ll investigte overfitting nd generliztion formlly in few lectures Trining Dt Held-Out Dt Test Dt Tuning on Held-Out Dt Extension: We Serch Now we ve got two kinds of unknowns Prmeters: the proilities P( ), P() Hyperprmeters: Amount of smoothing to do: k, α (nïve Byes) Numer of psses over trining dt (perceptron) Where to lern? ern prmeters from trining dt Informtion retrievl: Given informtion needs, produce informtion Includes, e.g. we serch, question nswering, nd clssic IR x = Apple Computers Must tune hyperprmeters on different dt For ech vlue of the hyperprmeters, trin nd test on the held-out dt Choose the est vlue nd do finl test on the test dt We serch: not exctly clssifiction, ut rther rnking Feture-Bsed Rnking x = Apple Computers x, Perceptron for Rnking Inputs Cndidtes Mny feture vectors: One weight vector: Prediction: x, Updte (if wrong): 10