Probabilistic Reasoning. CS 188: Artificial Intelligence Spring Inference by Enumeration. Probability recap. Chain Rule à Bayes net

Similar documents
CS 188: Artificial Intelligence Fall Announcements

Today. Recap: Reasoning Over Time. Demo Bonanza! CS 188: Artificial Intelligence. Advanced HMMs. Speech recognition. HMMs. Start machine learning

Reasoning over Time or Space. CS 188: Artificial Intelligence. Outline. Markov Models. Conditional Independence. Query: P(X 4 )

Bayesian Networks: Approximate Inference

Today. CS 188: Artificial Intelligence. Recap: Reasoning Over Time. Particle Filters and Applications of HMMs. HMMs

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Intermediate Math Circles Wednesday, November 14, 2018 Finite Automata II. Nickolas Rollick a b b. a b 4

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 17

Review of Gaussian Quadrature method

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17

CS103B Handout 18 Winter 2007 February 28, 2007 Finite Automata

Reasoning with Bayesian Networks

Hidden Markov Models

NFA DFA Example 3 CMSC 330: Organization of Programming Languages. Equivalence of DFAs and NFAs. Equivalence of DFAs and NFAs (cont.

Nondeterminism and Nodeterministic Automata

Surface maps into free groups

Review of Calculus, cont d

Convert the NFA into DFA

CS 188: Artificial Intelligence Spring 2007

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

CSCI 340: Computational Models. Kleene s Theorem. Department of Computer Science

LECTURE NOTE #12 PROF. ALAN YUILLE

CS 373, Spring Solutions to Mock midterm 1 (Based on first midterm in CS 273, Fall 2008.)

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

First Midterm Examination

CMPSCI 250: Introduction to Computation. Lecture #31: What DFA s Can and Can t Do David Mix Barrington 9 April 2014

Recitation 3: More Applications of the Derivative

Math 1B, lecture 4: Error bounds for numerical methods

Continuous Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

Lecture 3: Equivalence Relations

Bases for Vector Spaces

Reinforcement learning II

More on automata. Michael George. March 24 April 7, 2014

1 Nondeterministic Finite Automata

A-Level Mathematics Transition Task (compulsory for all maths students and all further maths student)

I1 = I2 I1 = I2 + I3 I1 + I2 = I3 + I4 I 3

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

Chapters Five Notes SN AA U1C5

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

p-adic Egyptian Fractions

Lecture 2e Orthogonal Complement (pages )

CS 310 (sec 20) - Winter Final Exam (solutions) SOLUTIONS

1B40 Practical Skills

Chapter 5 Plan-Space Planning

First Midterm Examination

Learning Moore Machines from Input-Output Traces

CS 188: Artificial Intelligence Fall 2010

MA123, Chapter 10: Formulas for integrals: integrals, antiderivatives, and the Fundamental Theorem of Calculus (pp.

Week 10: Line Integrals

CS 188: Artificial Intelligence

CMSC 330: Organization of Programming Languages

Monte Carlo method in solving numerical integration and differential equation

Numerical Analysis: Trapezoidal and Simpson s Rule

The Minimum Label Spanning Tree Problem: Illustrating the Utility of Genetic Algorithms

Section 4: Integration ECO4112F 2011

Quadratic Forms. Quadratic Forms

Finite Automata. Informatics 2A: Lecture 3. John Longley. 22 September School of Informatics University of Edinburgh

Quantum Nonlocality Pt. 2: No-Signaling and Local Hidden Variables May 1, / 16

Connected-components. Summary of lecture 9. Algorithms and Data Structures Disjoint sets. Example: connected components in graphs

SUMMER KNOWHOW STUDY AND LEARNING CENTRE

1. For each of the following theorems, give a two or three sentence sketch of how the proof goes or why it is not true.

Designing Information Devices and Systems I Spring 2018 Homework 7

CS 301. Lecture 04 Regular Expressions. Stephen Checkoway. January 29, 2018

CS667 Lecture 6: Monte Carlo Integration 02/10/05

Finite Automata. Informatics 2A: Lecture 3. Mary Cryan. 21 September School of Informatics University of Edinburgh

Genetic Programming. Outline. Evolutionary Strategies. Evolutionary strategies Genetic programming Summary

Finite Automata-cont d

Best Approximation. Chapter The General Case

Bridging the gap: GCSE AS Level

Lecture 20: Numerical Integration III

5: The Definite Integral

2.4 Linear Inequalities and Interval Notation

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

MATH 144: Business Calculus Final Review

5. (±±) Λ = fw j w is string of even lengthg [ 00 = f11,00g 7. (11 [ 00)± Λ = fw j w egins with either 11 or 00g 8. (0 [ ffl)1 Λ = 01 Λ [ 1 Λ 9.

CS415 Compilers. Lexical Analysis and. These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University

Math Lecture 23

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Thomas Whitham Sixth Form

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. NFA for (a b)*abb.

Types of Finite Automata. CMSC 330: Organization of Programming Languages. Comparing DFAs and NFAs. Comparing DFAs and NFAs (cont.) Finite Automata 2

Math 8 Winter 2015 Applications of Integration

Physics 1402: Lecture 7 Today s Agenda

COSC 3361 Numerical Analysis I Numerical Integration and Differentiation (III) - Gauss Quadrature and Adaptive Quadrature

Parse trees, ambiguity, and Chomsky normal form

How can we approximate the area of a region in the plane? What is an interpretation of the area under the graph of a velocity function?

Theoretical foundations of Gaussian quadrature

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Module 9: Tries and String Matching

Module 9: Tries and String Matching

Lecture 3. In this lecture, we will discuss algorithms for solving systems of linear equations.

First midterm topics Second midterm topics End of quarter topics. Math 3B Review. Steve. 18 March 2009

Things to Memorize: A Partial List. January 27, 2017

Lecture 2: January 27

10. AREAS BETWEEN CURVES

AUTOMATA AND LANGUAGES. Definition 1.5: Finite Automaton

The University of Nottingham SCHOOL OF COMPUTER SCIENCE A LEVEL 2 MODULE, SPRING SEMESTER LANGUAGES AND COMPUTATION ANSWERS

12.1 Nondeterminism Nondeterministic Finite Automata. a a b ε. CS125 Lecture 12 Fall 2016

Transcription:

CS 188: Artificil Intelligence Spring 2011 Finl Review 5/2/2011 Pieter Aeel UC Berkeley Proilistic Resoning Proility Rndom Vriles Joint nd Mrginl Distriutions Conditionl Distriution Inference y Enumertion Product Rule, Chin Rule, Byes Rule Independence Distriutions over ARGE Numers of Rndom Vriles à Byesin networks Representtion Inference Exct: Enumertion, Vrile Elimintion Approximte: Smpling erning Mximum likelihood prmeter estimtion plce smoothing iner interpoltion 2 Proility recp Inference y Enumertion Conditionl proility Product rule Chin rule, independent iff: equivlently, iff: equivlently, iff: x, y : P (x y) =P (x) x, y : P (y x) =P (y) nd re conditionlly independent given iff: equivlently, iff: x, y, z : P (x y, z) =P (x z) equivlently, iff: x, y, z : P (y x, z) =P (y z) 3 P(sun)? P(sun winter)? P(sun winter, hot)? S T W P summer hot sun 0.30 summer hot rin 0.05 summer cold sun 0.10 summer cold rin 0.05 winter hot sun 0.10 winter hot rin 0.05 winter cold sun 0.15 winter cold rin 0.20 4 Chin Rule à Byes net Exmple: Alrm Network Chin rule: cn lwys write ny joint distriution s n incrementl product of conditionl distriutions B P(B) + 0.001 0.999 Burglry Erthqk E P(E) +e 0.002 e 0.998 Byes nets: mke conditionl independence ssumptions of the form: P (x i x 1 x i 1 )=P(x i prents( i )) B E A giving us: J M 5 John clls A J P(J A) + +j 0.9 + j 0.1 +j 0.05 j 0.95 Alrm Mry clls A M P(M A) + +m 0.7 + m 0.3 +m 0.01 m 0.99 B E A P(A B,E) + +e + 0.95 + +e 0.05 + e + 0.94 + e 0.06 +e + 0.29 +e 0.71 e + 0.001 e 0.999 1

Size of Byes Net for How ig is joint distriution over N Boolen vriles? 2 N Size of representtion if we use the chin rule 2 N How ig is n N-node net if nodes hve up to k prents? O(N * 2 k+1 ) Both give you the power to clculte BNs: Huge spce svings! Esier to elicit locl CPTs Fster to nswer queries 7 Byes Nets: Assumptions Assumptions we re required to mke to define the Byes net when given the grph: P (x i x 1 x i 1 )=P (x i prents( i )) Given Byes net grph dditionl conditionl independences cn e red off directly from the grph Question: Are two nodes necessrily independent given certin evidence? If no, cn prove with counter exmple I.e., pick set of CPT s, nd show tht the independence ssumption is violted y the resulting distriution If yes, cn prove with Alger (tedious) D-seprtion (nlyzes grph) 8 D-Seprtion D-Seprtion Question: Are nd conditionlly independent given evidence vrs {}? es, if nd seprted y Consider ll (undirected) pths from to No ctive pths = independence! A pth is ctive if ech triple is ctive: Cusl chin A B C where B is unoserved (either direction) Common cuse A B C where B is unoserved Common effect (k v-structure) A B C where B or one of its descendents is oserved All it tkes to lock pth is single inctive segment Active Triples Inctive Triples Given query i? j { k1,..., kn } Shde ll evidence nodes For ll (undirected!) pths etween nd Check whether pth is ctive If ctive return i j { k1,..., kn } (If reching this point ll pths hve een checked nd shown inctive) Return i j { k1,..., kn } 10 Exmple All Conditionl Independences es es R B Given Byes net structure, cn run d- seprtion to uild complete list of conditionl independences tht re necessrily true of the form D T i j { k1,..., kn } es T 11 This list determines the set of proility distriutions tht cn e represented y Byes nets with this grph structure 12 2

Topology imits Distriutions Given some grph topology G, only certin joint distriutions cn e encoded The grph structure gurntees certin (conditionl) independences (There might e more independence) Adding rcs increses the set of distriutions, ut hs severl costs Full conditioning cn encode ny distriution {,,,,, } { } {} 13 Byes Nets Sttus Representtion Inference erning Byes Nets from Dt 14 Inference y Enumertion Given unlimited time, inference in BNs is esy Recipe: Stte the mrginl proilities you need Figure out A the tomic proilities you need Clculte nd comine them Exmple: B E Exmple: Enumertion In this simple method, we only need the BN to synthesize the joint entries A J M 15 16 Vrile Elimintion Why is inference y enumertion so slow? ou join up the whole joint distriution efore you sum out the hidden vriles ou end up repeting lot of work! Ide: interleve joining nd mrginlizing! Clled Vrile Elimintion Still NP-hrd, ut usully much fster thn inference y enumertion 17 Vrile Elimintion Outline Trck ojects clled fctors Initil fctors re locl CPTs (one per node) +r 0.1 - r 0.9 Any known vlues re selected E.g. if we know, the initil fctors re +r 0.1 - r 0.9 +r +t 0.8 +r - t 0.2 - r +t 0.1 - r - t 0.9 +r +t 0.8 +r - t 0.2 - r +t 0.1 - r - t 0.9 +t - l 0.7 - t - l 0.9 VE: Alterntely join fctors nd eliminte vriles R T 18 3

Vrile Elimintion Exmple Vrile Elimintion Exmple R T +r 0.1 - r 0.9 +r +t 0.8 +r - t 0.2 - r +t 0.1 - r - t 0.9 +t - l 0.7 - t - l 0.9 Join R +r +t 0.08 +r - t 0.02 - r +t 0.09 - r - t 0.81 +t - l 0.7 - t - l 0.9 Sum out R R, T +t 0.17 - t 0.83 +t - l 0.7 - t - l 0.9 T 19 T +t 0.17 - t 0.83 +t - l 0.7 - t - l 0.9 Join T T, Sum out T +t +l 0.051 +t - l 0.119 - t +l 0.083 - t - l 0.747 +l 0.134 - l 0.886 * VE is vrile elimintion Exmple Exmple Choose E Choose A Finish with B Normlize 21 22 Generl Vrile Elimintion Query: Strt with initil fctors: ocl CPTs (ut instntited y evidence) While there re still hidden vriles (not Q or evidence): Pick hidden vrile H Join ll fctors mentioning H Eliminte (sum out) H Join ll remining fctors nd normlize 23 Approximte Inference: Smpling Bsic ide: Drw N smples from smpling distriution S Compute n pproximte posterior proility Show this converges to the true proility P Why? Fster thn computing the exct nswer Prior smpling: Smple A vriles in topologicl order s this cn e done quickly Rejection smpling for query = like prior smpling, ut reject when vrile is smpled inconsistent with the query, in this cse when vrile E i is smpled differently from e i ikelihood weighting for query = like prior smpling ut vriles E i re not smpled, when it s their turn, they get set to e i, nd the smple gets weighted y P(e i vlue of prents(e i ) in current smple) Gis smpling: repetedly smples ech non-evidence vrile 24 conditioned on ll other vriles à cn incorporte downstrem evidence 4

Prior Smpling Exmple +c +s 0.1 - s 0.9 - c +s 0.5 - s 0.5 Sprinkler +s +r +w 0.99 - w 0.01 - r +w 0.90 - w 0.10 - s +r +w 0.90 - w 0.10 - r +w 0.01 - w 0.99 +c 0.5 - c 0.5 Cloudy WetGrss Rin Smples: +c +r 0.8 - r 0.2 - c +r 0.2 - r 0.8 +c, -s, +r, +w -c, +s, -r, +w 25 We ll get unch of smples from the BN: +c, -s, +r, +w +c, +s, +r, +w Cloudy C -c, +s, +r, -w Sprinkler S Rin R +c, -s, +r, +w -c, -s, -r, +w WetGrss W If we wnt to know P(W) We hve counts <+w:4, -w:1> Normlize to get P(W) = <+w:0.8, -w:0.2> This will get closer to the true distriution with more smples Cn estimte nything else, too Wht out P(C +w)? P(C +r, +w)? P(C -r, -w)? Fst: cn use fewer smples if less time 26 ikelihood Weighting ikelihood Weighting +c 0.5 - c 0.5 Smpling distriution if z smpled nd e fixed evidence +c +s 0.1 - s 0.9 - c +s 0.5 - s 0.5 Sprinkler Cloudy Rin +c +r 0.8 - r 0.2 - c +r 0.2 - r 0.8 Now, smples hve weights S Cloudy C W R +s +r +w 0.99 - w 0.01 - r +w 0.90 - w 0.10 - s +r +w 0.90 - w 0.10 - r +w 0.01 - w 0.99 WetGrss Smples: +c, +s, +r, +w 27 Together, weighted smpling distriution is consistent 28 Gis Smpling Ide: insted of smpling from scrtch, crete smples tht re ech like the lst one. Procedure: resmple one vrile t time, conditioned on ll the rest, ut keep evidence fixed. Properties: Now smples re not independent (in fct they re nerly identicl), ut smple verges re still consistent estimtors! Wht s the point: oth upstrem nd downstrem vriles condition on evidence. Mrkov Models A Mrkov model is chin-structured BN Ech node is identiclly distriuted (sttionrity) Vlue of t given time is clled the stte As BN: 1 2 3 4 The chin is just (growing) BN We cn lwys use generic BN resoning on it if we truncte the chin t fixed length Sttionry distriutions For most chins, the distriution we end up in is independent of the initil distriution Clled the sttionry distriution of the chin P () = P t+1 t ( x)p (x) x 29 Exmple pplictions: We link nlysis (Pge Rnk) nd Gis Smpling 5

Hidden Mrkov Models Underlying Mrkov chin over sttes S ou oserve outputs (effects) t ech time step Online Belief Updtes Every time step, we strt with current P( evidence) We updte for time: 1 2 3 4 5 1 2 E 1 E 2 E 3 E 4 E 5 We updte for evidence: Speech recognition HMMs: i : specific positions in specific words; E i : coustic signls Mchine trnsltion HMMs: i : trnsltion options; E i : Oservtions re words Root trcking: i : positions on mp; E i : rnge redings The forwrd lgorithm does oth t once (nd doesn t normlize) 2 E 2 Prticle Filtering Prticle Filtering = likelihood weighting + resmpling t ech time slice Why: sometimes is too ig to use exct inference 0.0 0.1 0.0 0.0 0.0 0.2 Elpse time: Ech prticle is moved y smpling its next position from the trnsition model Prticle is just new nme for smple 0.0 0.2 0.5 Oserve: We don t smple the oservtion, we fix it nd downweight our smples sed on the evidence This is like likelihood weighting, so we Resmple: Rther thn trcking weighted smples, we resmple N times, we choose from our weighted smple distriution Dynmic Byes Nets (DBNs) Byes Nets Sttus We wnt to trck multiple vriles over time, using multiple sources of evidence Ide: Repet fixed Byes net structure t ech time Vriles from time t cn condition on those from t-1 t =1 t =2 t =3 Representtion Inference G 1 G 1 G 2 G 2 G 3 G 3 erning Byes Nets from Dt E 1 E 1 E 2 E 2 E 3 E 3 Discrete vlued dynmic Byes nets re lso HMMs 36 6

Prmeter Estimtion Estimting distriution of rndom vriles like or Empiriclly: use trining dt For ech outcome x, look t the empiricl rte of tht vlue: r g g Representtion Inference Byes Nets Sttus This is the estimte tht mximizes the likelihood of the dt erning Byes Nets from Dt plce smoothing Pretend sw every outcome k extr times Smooth ech condition independently: 38 Clssifiction: Feture Vectors Clssifiction overview Hello, Do you wnt free printr crtriges? Why py more when you cn get them ABSOUTE FREE! Just # free : 2 OUR_NAME : 0 MISSPEED : 2 FROM_FRIEND : 0... PIE-7,12 : 1 PIE-7,13 : 0... NUM_OOPS : 1... SPAM or + 2 Nïve Byes: Builds model trining dt Gives prediction proilities Strong ssumptions out feture independence One pss through dt (counting) Perceptron: Mkes less ssumptions out dt Mistke-driven lerning Multiple psses through dt (prediction) Often more ccurte MIRA: ike perceptron, ut dptive scling of size of updte SVM: Properties similr to perceptron Convex optimiztion formultion Nerest-Neighor: Non-prmetric: more expressive with more trining dt Kernels Efficient wy to mke liner lerning rchitectures into nonliner ones Byes Nets for Clssifiction Generl Nïve Byes One method of clssifiction: Use proilistic model! Fetures re oserved rndom vriles F i is the query vrile Use proilistic inference to compute most likely A generl nive Byes model: x F n prmeters ou lredy know how to do this inference prmeters n x F x prmeters We only specify how ech feture depends on the clss Totl numer of prmeters is liner in n Our running exmple: digits F 1 F 2 F n 7

Bg-of-Words Nïve Byes iner Clssifier Genertive model Word t position i, not i th word in the dictionry! Binry liner clssifier: Bg-of-words Ech position is identiclly distriuted All positions shre the sme conditionl pros P(W C) à When lerning the prmeters, dt is shred over ll positions in the document (rther thn seprtely lerning distriution for ech position in the document) Multiclss liner clssifier: A weight vector for ech clss: Score (ctivtion) of clss y: Our running exmple: spm vs. hm Prediction highest score wins Binry = multiclss where the negtive clss hs weight zero Perceptron = lgorithm to lern weights w Strt with zero weights Pick up trining instnces one y one Clssify with current weights Prolems with the Perceptron Noise: if the dt isn t seprle, weights might thrsh Averging weight vectors over time cn help (verged perceptron) If correct, no chnge! If wrong: lower score of wrong nswer, rise score of right nswer Mediocre generliztion: finds rely seprting solution Overtrining: test / held-out ccurcy usully rises, then flls Overtrining is kind of overfitting 45 Fixing the Perceptron: MIRA Updte size tht fixes the current mistke nd lso minimizes the chnge to w Support Vector Mchines Mximizing the mrgin: good ccording to intuition, theory, prctice Support vector mchines (SVMs) find the seprtor with mx mrgin Bsiclly, SVMs re MIRA where you optimize over ll exmples t once Updte w y solving: SVM, C 8

Non-iner Seprtors Dt tht is linerly seprle (with some noise) works out gret: 0 x But wht re we going to do if the dtset is just too hrd? 0 x How out mpping dt to higher-dimensionl spce: x 2 Non-iner Seprtors Generl ide: the originl feture spce cn lwys e mpped to some higher-dimensionl feture spce where the trining set is seprle: Φ: x φ(x) 0 x 49 This nd next few slides dpted from Ry Mooney, UT 50 Some Kernels Kernels implicitly mp originl vectors to higher dimensionl spces, tke the dot product there, nd hnd the result ck Polynomil kernel: Some Kernels (2) iner kernel: Á(x) = x Qudrtic kernel: 51 Why nd When Kernels? Cn t you just dd these fetures on your own (e.g. dd ll pirs of fetures insted of using the qudrtic kernel)? es, in principle, just compute them No need to modify ny lgorithms But, numer of fetures cn get lrge (or infinite) Kernels let us compute with these fetures implicitly Exmple: implicit dot product in polynomil, Gussin nd string kernel tkes much less spce nd time per dot product When cn we use kernels? When our lerning lgorithm cn e reformulted in terms of only inner products etween feture vectors Exmples: perceptron, support vector mchine K-nerest neighors 1-NN: copy the lel of the most similr dt point K-NN: let the k nerest neighors vote (hve to devise weighting scheme) 2 Exmples 10 Exmples 100 Exmples 10000 Exmples Prmetric models: Fixed set of prmeters More dt mens etter settings Non-prmetric models: Complexity of the clssifier increses with dt Better in the limit, often worse in the non-limit (K)NN is non-prmetric Truth 54 9

Bsic Similrity Importnt Concepts Mny similrities sed on feture dot products: If fetures re just the pixels: Note: not ll similrities re of this form 55 Dt: leled instnces, e.g. emils mrked spm/hm Trining set Held out set Test set Fetures: ttriute-vlue pirs which chrcterize ech x Experimenttion cycle ern prmeters (e.g. model proilities) on trining set (Tune hyperprmeters on held-out set) Compute ccurcy of test set Very importnt: never peek t the test set! Evlution Accurcy: frction of instnces predicted correctly Overfitting nd generliztion Wnt clssifier which does well on test dt Overfitting: fitting the trining dt very closely, ut not generlizing well We ll investigte overfitting nd generliztion formlly in few lectures Trining Dt Held-Out Dt Test Dt Tuning on Held-Out Dt Extension: We Serch Now we ve got two kinds of unknowns Prmeters: the proilities P( ), P() Hyperprmeters: Amount of smoothing to do: k, α (nïve Byes) Numer of psses over trining dt (perceptron) Where to lern? ern prmeters from trining dt Informtion retrievl: Given informtion needs, produce informtion Includes, e.g. we serch, question nswering, nd clssic IR x = Apple Computers Must tune hyperprmeters on different dt For ech vlue of the hyperprmeters, trin nd test on the held-out dt Choose the est vlue nd do finl test on the test dt We serch: not exctly clssifiction, ut rther rnking Feture-Bsed Rnking x = Apple Computers x, Perceptron for Rnking Inputs Cndidtes Mny feture vectors: One weight vector: Prediction: x, Updte (if wrong): 10