Sampling from Bayes Nets

Similar documents
CS 343: Artificial Intelligence

Approximate Inference

Bayes Nets: Sampling

Artificial Intelligence

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Intelligent Systems (AI-2)

Informatics 2D Reasoning and Agents Semester 2,

CS 188: Artificial Intelligence. Bayes Nets

Bayes Nets III: Inference

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

PROBABILISTIC REASONING SYSTEMS

CSE 473: Artificial Intelligence Autumn Topics

Directed Graphical Models

Bayesian Networks BY: MOHAMAD ALSABBAGH

Introduction to Artificial Intelligence (AI)

Inference in Bayesian Networks

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

CS 4100 // artificial intelligence. Recap/midterm review!

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Probability. CS 3793/5233 Artificial Intelligence Probability 1

CS221 Practice Midterm

Latent Dirichlet Allocation

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Bayes Nets: Independence

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Probabilistic Reasoning Systems

Bayesian Networks. Marcello Cirillo PhD Student 11 th of May, 2007

Mock Exam Künstliche Intelligenz-1. Different problems test different skills and knowledge, so do not get stuck on one problem.

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Sampling Methods. Bishop PRML Ch. 11. Alireza Ghane. Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo

Name: UW CSE 473 Final Exam, Fall 2014

CS 5522: Artificial Intelligence II

Artificial Intelligence Bayes Nets: Independence

Learning Bayesian Networks

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Bayes Net Representation. CS 188: Artificial Intelligence. Approximate Inference: Sampling. Variable Elimination. Sampling.

CPSC 540: Machine Learning

Final exam of ECE 457 Applied Artificial Intelligence for the Spring term 2007.

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Final Exam December 12, 2017

Introduction to Arti Intelligence

Latent Dirichlet Allocation (LDA)

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Graphical Models and Kernel Methods

CS 188, Fall 2005 Solutions for Assignment 4

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

CS 188: Artificial Intelligence Spring Announcements

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Markov Networks.

Examination Artificial Intelligence Module Intelligent Interaction Design December 2014

Andrew/CS ID: Midterm Solutions, Fall 2006

Final Exam December 12, 2017

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Text Mining for Economics and Finance Latent Dirichlet Allocation

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

qfundamental Assumption qbeyond the Bayes net Chain rule conditional independence assumptions,

Cavities. Independence. Bayes nets. Why do we care about variable independence? P(W,CY,T,CH) = P(W )P(CY)P(T CY)P(CH CY)

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Bayesian networks. Independence. Bayesian networks. Markov conditions Inference. by enumeration rejection sampling Gibbs sampler

last two digits of your SID

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Uncertainty and Bayesian Networks

Chapter 7 R&N ICS 271 Fall 2017 Kalev Kask

Probabilistic Models. Models describe how (a portion of) the world works

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Introduction to Bayesian Learning

Directed Probabilistic Graphical Models CMSC 678 UMBC

Principles of AI Planning

Introduction to Spring 2006 Artificial Intelligence Practice Final

Artificial Intelligence Bayesian Networks

CS 188: Artificial Intelligence Spring Announcements

Generative Clustering, Topic Modeling, & Bayesian Inference

Final exam of ECE 457 Applied Artificial Intelligence for the Fall term 2007.

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Language as a Stochastic Process

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Principles of AI Planning

Naïve Bayes Classifiers

Notes on Machine Learning for and

Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

Bayes Networks 6.872/HST.950

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayesian Methods in Artificial Intelligence

CSCE 478/878 Lecture 6: Bayesian Learning

CS 188: Artificial Intelligence Spring 2009

CSC 411 Lecture 3: Decision Trees

Transcription:

from Bayes Nets http://www.youtube.com/watch?v=mvrtaljp8dm http://www.youtube.com/watch?v=geqip_0vjec Paper reviews Should be useful feedback for the authors A critique of the paper No paper is perfect! if you don t understand it, state it Technically sound vs. convinced Give explicit examples, the more the better cite sections, paragraphs, tables, figures, equations, etc. Make different sections clear many conference reviews will have a similar format Asking questions about distributions We want to be able to ask questions about these probability distributions Given n variables, a query splits the variables into three sets: query variable(s) known/evidence variables unknown/hidden variables P(query evidence) if we had no hidden variables, we could just multiply all the values in the different CPTs to answer this, we need to sum over the hiden variables!

Two approaches Enumeration top-down, multiply probabilities and sum out the hidden variables p(fo hb,lo) = α p(fo) p(lo FO) p(bp) p(do FO,bp) p(hb do) Variable elimination avoids repeated work bottom-up (right to left) two operations: point-wise product of factors and summing out hidden variables bp p(fo hb,lo) = f ( fo) f 2 (lo, fo) f (bp) f 4 (do, fo,bp) f 5 (hb,do) bp do do So is VE any better than Enumeration? Yes and No For singly-connected networks (poly-trees), YES In general, NO The problem is NP-Hard Inputs: prior probabilities of.5 I I2 I I4 I5 Reduces to O Bayesian Network Inference Butinference is still tractable in some cases. Special case: trees (each node has one parent) VE is LINEAR in this case So, what about all those graphs with cycles? T.0 F Approximate Inference! F F.0 2

Approximate Inference by Stochastic Simulation Basics: from an empty network Recall when we wanted to find out the underlying distribution (of say a coin or die) we used sampling to estimate it Basic Idea: Draw N samples from the distribution Compute an approximate probability P Eventually, for large samples sizes this converges to the true probability P T.0 F F F.0 Basics: from an empty network Basics: from an empty network T.0 F T.0 F F F.0 F F.0

Basics: from an empty network Basics: from an empty network T.0 F T.0 F F F.0 F F.0 Basics: from an empty network Basics: from an empty network T.0 F T.0 F F F.0 F F.0 Sample: [T, F, T, T] 4

Calculating probabilities If we do this a number of times, then we can approximate answers to queries Calculating probabilities If we do this a number of times, then we can approximate answers to queries [C, S, R, W] [T, T, F, T] [F, F, F, F] [F, T, F, T] [F, F, T, T] [T, F, F, F] [T, T, F, T] [F, T, F, T] [T, F, F, F] [F, T, T, F] [T, T, F, F] What is the probability of rain? [C, S, R, W] [T, T, F, T] [F, F, F, F] [F, T, F, T] [F, F, T, T] [T, F, F, F] [T, T, F, T] [F, T, F, T] [T, F, F, F] [F, T, T, F] [T, T, F, F] num with rain p(rain) = total samples = 2 0 = 0.2 Rejection sampling What if we want to know the probability conditioned on some evidence? p(rain wet_grass) [C, S, R, W] [T, T, F, T] [F, F, F, F] [F, T, F, T] [F, F, T, T] [T, F, F, F] [T, T, F, T] [F, T, F, T] [T, F, F, F] [F, T, T, F] [T, T, F, F] Adding Evidence: Rejection P ˆ ( X e) E.g. Estimate P(R s) Samples ([C, S, R, W]): [T, T, F, T] [F, F, F, F] [F, T, F, T] [F, F, T, T] [T, F, F, F] [T, T, F, T] [F, T, F, T] [T, F, F, F] [F, T, T, F] [T, T, F, F] estimated from samples agreeing with e num with rain and wet_grass p(rain wet _ grass) = num with wet_grass = 5 = 0.2 Problem with Rejection? 5

Likelihood weighting Likelihood weighting: p(rain cloudy, wet_grass) The problem with rejection sampling is that we may have to generate a lot of samples low probability/rare events large networks Likelihood weighting rather than randomly sampling over all of the variables, only randomly pick values for the query variables and hidden variables for those, the evidence variables weight the examples based on the likelihood of obtaining their fixed value C P(S C) T.0 F F F.0 Likelihood weighting: p(rain cloudy, wet_grass) Likelihood weighting: p(rain cloudy, wet_grass) fixed: weight = 0.5 fixed: weight = 0.5 T.0 F random T.0 F F F.0 F F.0 6

Likelihood weighting: p(rain cloudy, wet_grass) Likelihood weighting: p(rain cloudy, wet_grass) fixed: weight = 0.5 fixed: weight = 0.5 random random random T.0 F random T.0 F F F.0 F F.0 fixed: weight = 0.9 Likelihood weighting: p(rain cloudy, wet_grass) Likelihood weighting: p(rain cloudy, wet_grass) fixed: weight = 0.5 fixed: weight = 0.5 random random random T.0 F random T.0 F F F.0 fixed: weight = 0.9 Sample: [T, F, T, T] weight: 0.5 * 0.9 = 0.45 F F.0 fixed: weight = 0.9 7

Likelihood weighting: p(rain cloudy, wet_grass) Problems with likelihood weighting [C, S, R, W] weight [T, F, T, T] 0.45 [T, F, T, T] 0.45 [T, T, F, T] 0.45 [T, F, T, T] 0.45 [T, F, F, T] 0.005 [T, T, T, T] 0.495 [T, T, T, T] 0.495 [T, F, T, T] 0.45 [T, T, T, T] 0.495 [T, T, F, T] 0.45 weighted sum with rain,cloudy,wet_grass = weighted sum with cloudy,wet_grass As number of variables increased, weights will be very small similar to rejection sampling, will only be a small number of higher probability ones that will actually effect the outcome If evidence variables are late in the ordering (BN), simulations will be not be influenced by evidence and so samples will not look much like reality Problem with likelihood weighting? Approximate Inference using MCMC MCMC = Markov chain Monte Carlo Idea: Rather than generate individual samples, transition between "states" of the network possible transition Choose one variable and sample it given its Markov Blanket MCMC Start in some valid configuration of the variables Repeat the following steps: pick a non-evidence variable randomly sample given its markov blanket count this new state as a sample If the process visits 20 states where is true and 60 states where is false, Then the answer to the query is <20/80, 60/80> = <0.25, 0.75> 8

MCMC If you know =T and =T, there are 4 network states Document ification Naïve Bayes ifier works surprisingly well for its simplicity We can do better! Wander for awhile, average what you see (Big Boy models) Revisiting the Naïve Bayes model w w w 2 w n n words in our vocabulary Generating a The generative story of a model describes how the model would generate a sample () It can help understand the independences and how the model works As before, we can generate a random sample from the BN w w w 2 w n How does that work for Naïve Bayes? How would we generate a positive? 9

P() P() w w w 2 w n w w w 2 w n P(w ) P(w 2 ) P(w ) P(w n ) P(w ) P(w 2 ) P(w ) P(w n ) randomly sample P() w P() w w w w 2 w n w w w 2 w n P(w ) P(w 2 ) P(w ) P(w n ) P(w ) P(w 2 ) P(w ) P(w n ) the word occurs in the randomly sample 0

P() w P() w w w w 2 w n w w w 2 w n P(w ) P(w 2 ) P(w ) P(w n ) P(w ) P(w 2 ) P(w ) P(w n ) the word does NOT occur in the randomly sample P() w P() w w w w n w w w 2 w n w w w 2 w n P(w ) P(w 2 ) P(w ) P(w n ) P(w ) P(w 2 ) P(w ) P(w n ) the word occurs in the How are we simplifying the data?

Bag of words representation Notice that there is no ordering in the model I ate a banana is viewed as the same as ate I banana a Called the bag of words representation NB model A word either occurs or doesn t occur no frequency information Word occurrences are independent, given the when we sample, the only thing we condition on is the w w w n Incorporating frequency Multinomial model: rather than picking whether or not a word occurs, pick what each word in the will be now rather than having boolean random variables, our random variables space is the number of words in the first word 2 nd word rd word mth word first word 2 nd word rd word mth word What will the conditional probability tables look like? 2

first word 2 nd word rd word mth word first word 2 nd word rd word mth word word p(word ) A w A w 2 A w A w 4 B w Each position in the has a distribution over all words and each word p(word ) A w A w 2 A w A w 4 B w In practice, we use the same distribution for all word positions! first word 2 nd word rd word mth word first word 2 nd word rd word mth word randomly pick a word for the first position The

first word 2 nd word rd word mth word first word 2 nd word rd word mth word randomly pick a word for the second position The The a Multinomial model Called a multinomial model because the word frequencies drawn for a of length m, follow a multinomial distribution sampling with replacement from a fixed distribution Word occurrences are still independent! doesn t matter what other words we ve drawn Although technically the position is specified, doesn t really give us positional information Still a naïve Bayes model! Boolean NB vs. Multinomial NB 20 Newsgroups http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf 4

Boolean NB vs. Multinomial NB Industry Sector data (7 es, web pages) Plate notation It can be tedious to write out all of the children in a BN When they re all the same type, we can use plate notation A plate represents a set of variables We specify repetition by putting a number in the lower right corner edges crossing plate boundaries are considered to be multiple edges http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf A M Plate notation Dirichlet Compound Multinomial (DCM) topic word M first word 2 nd word rd word word M mth word To generate a pick a based on that, draw a multinomial representing a topic p(topic ) is represented by a Dirichlet distribution Gives us a distribution over multinomials Based on this multinomial, sample as before 5

DCM For those that like math Key problem with NB multinomial: words tend to be bursty if a word occurs once, it s likely to occur again particularly content words, e.g. Bush DCM model allows us to model burstiness by picking multinomials for a given that have a higher probability of ocurring DCM vs. Multinomial Industry 20 Newsgroups Topic models Often a isn t just about one idea/topic Topic models view s as a blend of topics Multinomial 0.600 0.85 DCM 0.806 0.890 topic + topic 2 + topic http://www.cs.pomona.edu/~dkauchak/papers/kauchak05modeling.pdf 6

Topic models Topic models topic word M topic word M DCM model DCM model topic word M T How might we model this as a Bayes net? LDA model (Latent Dirichlet Allocation) LDA LDA prior topic word M T prior topic word M T topic word M T LDA model (Latent Dirichlet Allocation) To generate a for each word in the : pick a topic given the pick a word given the topic (and the prior) Key paper: http://www.cs.princeton.edu/~blei/papers/bleingjordan200.pdf 7

LDA Two binary tasks from industry sector Used LDA to extract features Then used SVMs (support vector machines) for ification Document ification Generative models represent underlying probability distribution can be used for ification, but also other tasks models: Bernouli (boolean) naïve Bayes Multinomial naïve Bayes Dirichlet Compound Multinomial Latent Dirichlet Allocation Discriminative models support vector machines markov random fields more expensive to train very good for ification only Midterm Open book still only 75 min, so don t rely on it too much Anything we ve talked about in or read about is fair game Written questions are a good place to start Review Intro to AI what is AI goals challenges problem areas 8

Review Uninformed search reasoning through search agent paradigm (sensors, actuators, environment, etc.) setting up problems as search state space (starting state, next state function, goal state) actions costs problem characteristics observability determinism known/unknown state space techniques BFS DFS uniform cost search depth limited search Iterative deepening Review Uninformed search cont. things to know about search algorithms time space completeness optimality when to use them graph search vs. tree search Informed search heuristic function admissibility combining functions dominance methods greedy best-first search A* Review Adversarial search game playing through search ply depth branching factor state space sizes optimal play game characteristics observability # of players discrete vs. continuous real-time vs. turn-based determinism Review Adversarial search cont minimax algorithm alpha-beta pruning optimality, etc. evalution functions (heuristics) horizon effect improvements transposition table history/end-game tables dealing with chance/non-determinism expected minimax dealing with partially observable games 9

Review Local search when to use/what types of problems general formulation hill-climbing greedy random restarts randomness simulated annealing local beam search taboo list genetic algorithms Review CSPs problem formulation variables domain constraints why CSPs? applications? constraint graph CSP as search backtracking algorithm forward checking arc consistency heuristics most constrained variable least constrained value... Review Basic probability why probability (vs. say logic)? vocabulary experiment sample event random variable probability distribution unconditional/prior probability joint distribution conditional probability Bayes rule estimating probabilities Review Bayes nets representation dependencies/independencies d-separation Markov blanket reasoning/querying exact: enumeration variable elimination sampling basic variable elimination MCMC 20

Review Bayesian ification problem formulation, argmax, etc. NB model Other models multinomial, DCM, LDA training, testing, evaluation plate notation 2